Jungfraujoch

1,187 Commits 82 Branches 148 Tags

7 Commits

Include renames

Author	SHA1	Message	Date
leonarski_fandClaude Opus 4.8	617184041f	RotationScaleMerge: enable GPU combine + scale-fulls by default Steps 1-2 (GPU 3D-combine + resident scale-fulls) are validated bit-parity and run-to-run deterministic against the CPU path across the rotation battery, and cut the combine+scale-fulls region from ~0.46s to ~0.32s on lyso, so make them the default when a GPU is present (consistent with phase-1 partial scaling already being default-on). JFJOCH_RSM_CPU_COMBINE forces the CPU combine/scale-fulls for A/B or debugging; JFJOCH_RSM_NO_GPU still disables the whole GPU path. The only battery crystal whose reported metrics move is EP_cs_01-24 (CC1/2 2%, unindexable noise) whose upstream integration is itself nondeterministic; its merged intensities/CC/completeness are unchanged, only the ill-conditioned error-model b. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-03 08:35:23 +02:00
leonarski_fandClaude Opus 4.8	ced85bcd9d	RotationScaleMerge: GPU scale-fulls, fulls kept resident (phase 2 step 2) Scale the combined fulls (Unity model) on the device so they no longer round-trip between the combine and the merge: after the GPU combine, build the fulls' per-frame and per-ASU-group CSRs on the host from just the small key arrays (f_frame/f_group) with a deterministic counting sort - no GPU stable-sort - then scale in place and download once. The four scaling kernels are reused unchanged except FitPerFrameGKernel, which gains an optional `perm` argument (null for the partials, whose arrays are already frame-contiguous; a frame-grouping permutation for the emit-ordered fulls) so the fulls are scaled without a physical reorder. The Unity model falls out of giving the fulls all-ones partiality/rlp/zeta (coeff = mean), so no other kernel changes and the committed phase-1 partial-scaling path is bit-identical (perm == null -> idx == i). Validated across the rotation battery (JFJOCH_RSM_GPU_COMBINE=1): all 15 deterministic crystals stay run-to-run deterministic and their merged output is bit-identical to the CPU path (SG/ISa/CC1.2/completeness). The lone exception is EP_cs_01-24 (CC1/2 2%, R_meas 379% - unindexable noise): merged intensities/CC/completeness match exactly, but the ill-conditioned 16-bin error-model b fit amplifies the ~1e-7 scale-fulls rounding to ISa 10.6 vs 10.8 - benign, same class as the accepted phase-1 GPU rounding. The 3 upstream-nondeterministic crystals vary as before (GPU-prediction overflow, not this). Scale-fulls drops from ~0.09s to ~0 across the two passes; combine+scale-fulls region ~0.32s GPU vs ~0.46s CPU on lyso. Still opt-in (fulls are downloaded for the host merge; the win grows once the merge/error-model also stay resident). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-03 07:51:29 +02:00
leonarski_fandClaude Opus 4.8	2c928b27cd	RotationScaleMerge: GPU 3D-combine (CUDA port, phase 2 step 1) Port Combine() (partials->fulls) to CUDA, mirroring process_rawrun bit-for-bit: one thread per raw-hkl run splits its usable partials into rocking events (frame gap <= 2), pools background, seeds F, runs 3 de-biased Poisson reweights and the capture-uncertainty term. Emission is deterministic - a count pass, a host exclusive prefix sum for per-run offsets, then an emit pass at those offsets - so fulls come out in raw-run-major/event order, identical to the CPU path; both pass instantiations share the same arithmetic so count == emit exactly. Dmax/Dmin/Fmax reproduce std::max/min NaN semantics (not fmax) for parity. Validated across the 18-crystal rotation battery: all 15 deterministic crystals (P1/P2/C2/H3/I23/P41212/P222/P422) match the CPU combine exactly on SG/ISa/CC1.2/ completeness and run-to-run (fulls count bit-identical); the 3 upstream-nondet crystals vary from GPU-prediction overflow, not the combine. Gated opt-in behind JFJOCH_RSM_GPU_COMBINE (default = CPU combine): combine alone is timing-neutral because the shared 1.2M SortFullsByFrame std::sort dominates and the fulls round-trip adds a copy - it only pays off once the fulls stay resident for scale-fulls + merge. Also add JFJOCH_RSM_NO_GPU master switch to force the CPU fallback (incl. phase-1 scaling) from one binary for A/B parity. SortFullsByFrame extracted from the Combine tail and shared by both paths. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-03 07:27:33 +02:00
leonarski_fandClaude Opus 4.8	dd25de461d	RotationScaleMerge: GPU partial-scaling loop (CUDA port, phase 1) First stage of moving the rotation scale/merge onto the GPU. The per-frame partial-scaling loop (inverse-variance group-mean reduction -> robust per-frame IRLS G -> corr update, x scaling_iter) now runs in RotationScaleMergeGPU (.cu) when a GPU is present; the CPU loops remain the fallback. The host keeps the one-time raw-hkl sort and the per-space-group gemmi ASU keying, and hands the GPU a group-ordered permutation + CSR so the per-group reduction is a DETERMINISTIC segmented reduction (one thread per group, fixed order, no atomics) - preserving the run-to-run determinism just won on the CPU path (a float atomicAdd reduction would have re-introduced jitter). Reduction is one-thread-per-group (groups average tens of obs, so a block-per-group wastes threads); the IRLS is one block per frame with a deterministic shared-memory reduction. Validated: bit-identical to the CPU path and deterministic run-to-run on lyso/cytC/Ins_H/pding (P41212 ISa 7.8 CC1/2 99.7%, etc.). The scaling kernels are ~7x faster than the CPU compute (~36 ms for 3 iters vs ~0.28 s); end-to-end scale/merge ~2.0 -> ~1.5 s. The remaining gap to the <1 s target is the per-pass host round-trip (corr down/upload for the CPU combine + per-SG group-CSR rebuild); phase 2 keeps the data resident by moving the 3D combine and the merge/error-model onto the GPU too, so nothing round-trips. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-02 22:26:29 +02:00
leonarski_fandClaude Opus 4.8	29c8ba6112	rotation: deterministic frame-order mosaicity smoothing + partiality recompute Prediction applied a mosaicity/profile-radius moving average (RotationParameters) over the last N processed frames. Under the parallel per-image loop that window is thread-arrival order, so the smoothed value - and hence which reflections are predicted/integrated - was non-deterministic run-to-run, swinging CC1/2 (and even the space group) on marginal crystals. `-N 1` was deterministic; `-N 32` was not. Fix (as designed with FL): prediction now uses each frame's OWN mosaicity/profile-radius (image-local, deterministic membership - a reflection on the cutoff contributes ~nothing). The smoothing that actually matters is moved into RotationScaleMerge and done in FRAME order (deterministic): per-frame mosaicity is smoothed with the same window as smooth-G, then every partial's partiality is recomputed from it BEFORE the 3D combine. This is the mosaicity analogue of smooth-G: combining a reflection's per-frame partials only tiles the rocking curve correctly (captured fractions summing toward 1) if neighbouring frames share a consistent mosaicity. Battery (18 crystals, /data/rotation_test, 2 runs each): 15/18 now bit-identical run-to-run (the good crystals unchanged - lyso P41212 ISa 7.8 CC1/2 99.7%). The 3 residual crystals (EcwtAL500, EcwtCQ066S, pding4_003 - all large/triclinic cells) still jitter ~0.002%, traced to a SEPARATE, benign cause: the GPU prediction buffer overflow (BraggPredictionRotGPU max_reflections=10000 with a racy atomicAdd/atomicSub) on dense frames - cell/space group stay stable; to be addressed in the GPU prediction/integration rework (naively raising the cap also changes prediction quality, so it is not a one-line bump). Minor label refinements from the recomputed partiality: cytC_2 P321 -> P3121 (now consistent with cytC_3), Ins_I_2/3 report the honest I23/I213 screw-axis ambiguity. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-02 21:58:10 +02:00
leonarski_fandClaude Opus 4.8	b5d9167bf4	RotationScaleMerge: chunked parallelism -> scale/merge phase ~2s The per-observation corr update (7.6M items) ran through a work-stealing ParallelFor that does one atomic fetch_add PER item - pure contention for trivial work (measured: update 0.60s vs reduce 0.15s / fit 0.13s in the scale-partials loop). Add ParallelChunks (one contiguous range per worker, no per-item sync) and use it for UpdateCorr, and parallelise the ASU keying (gemmi reduction per distinct raw hkl - HKLKeyGenerator is const, safe to read concurrently) and the group-stamping over disjoint raw-hkl runs. scale-partials 0.90 -> 0.28s, group-hkl 0.20 -> 0.09s, per-pass warm 0.83s, whole scale/merge phase ~3.3 -> ~2.0s. Bit-identical output (same space group, ISa, CC1/2). ParallelChunks is the CPU stand-in for a flat CUDA grid-stride kernel; ParallelFor stays for the heavy, uneven per-frame fits where the atomic amortises and work-stealing balances the load. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-02 21:31:41 +02:00
leonarski_fandClaude Opus 4.8	29c5410328	rotation scaling: dedicated allocate-once RotationScaleMerge, ~3.4x faster scale/merge The rot3d post-pass (scale -> smooth-G -> 3D combine -> scale-fulls -> merge -> error model -> stats) dominated offline wall clock because ScaleOnTheFly + MergeAll + CombineRotationObservations + MergeOnTheFly rebuild a std::map keyed by hkl on every scaling iteration and every merge (7-14 map rebuilds per space-group pass), each re-walking and re-keying millions of per-frame partials. RotationScaleMerge ingests the per-frame partials ONCE into flat vectors and reuses them across both space-group passes: the raw-hkl ordering is sorted a single time at ingest, so the per-pass 3D combine only re-splits events (no sort) and the ASU grouping is one gemmi reduction per distinct raw hkl (~10x fewer) rather than per observation, reusing that order. Every hot step is a flat loop (segmented reduction + per-frame robust IRLS + parallel per-run combine) that also maps directly onto CUDA kernels. CC1/2 and the per-image CC are computed once at the end, not every iteration. It is a distinct path from ScaleOnTheFly, used only for the self-scaling rotation case (Rotation partiality + combine3d, per-image G, no B refinement, no external reference, no absorption surface, no wedge/mosaicity override). Stills, B-factor refinement, reference scaling and the absorption surface stay on the classic path. Numerically equivalent to the classic path (same robust per-frame G, same 3D combine, same XDS-order scale-fulls, same global error model, same merge statistics), validated on the 18-crystal /data/rotation_test battery: 16/18 bit-identical in space group / ISa / CC1/2; the 2 differing crystals are ice-heavy / marginal ones on which the classic path is equally non-deterministic run-to-run (a pre-existing upstream integration race). Scale/merge wall time drops ~3.4x (median 9.8 -> 4.1 s), making higher --scaling-iterations cheap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-02 21:00:12 +02:00