Jungfraujoch

Author	SHA1	Message	Date
leonarski_fandClaude Opus 4.8	fccf9b83e7	RotationScaleMerge: drop profiling + env gates, honour forced mosaicity Remove the [rsm] per-stage lap timing and the JFJOCH_RSM_NO_GPU / JFJOCH_RSM_CPU_COMBINE env gates now that the GPU-resident path is the validated default (it runs whenever a GPU is present, with the CPU loops as the bit-parity fallback; the diagnostic-dump path still uses the CPU combine). Honour a fixed (forced) mosaicity: SmoothMosaicityAndPartiality now overrides every frame with GetForcedMosaicity() when set, instead of always reading the per-frame integration value - so the caller can route the --mosaicity case through RotationScaleMerge (its partiality recompute makes it a natural fit) rather than a separate path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-03 11:41:31 +02:00
leonarski_fandClaude Opus 4.8	34b3c3c4e7	RotationScaleMerge: GPU merge + error-model reductions over the resident fulls Port the four fulls-walking reductions of MergeAndStats to the GPU, over the fulls group-CSR already resident from scale-fulls: the per-group inv-var mean + leverage- corrected error-model samples, the merge accumulate (inv-var sums + deterministic half-sets, error-model-corrected sigma, with outlier rejection), and R_meas + the per-shell usable count. The host keeps the parts that don't parallelise cleanly or are tiny: the I2-sort + 16-bin (a,b) median fit, the per-group reject median (a per-group median is awkward on the GPU - cheap on the host from the GPU cnt), the merged export, the shells and the gemmi completeness. Only per-group arrays (~55k) + the samples (~n_fulls, for the fit) come back - the fulls are not re-walked on the host. Device HalfForImage (splitmix64) + IceRingIndex mirror the host; the corrected-sigma uses (bI_for_b)^2 (not b^2I^2) to match the host rounding; the R_meas usable count requires finite d (the host counts only fulls with a valid shell, and a group's fulls share d, so the shell is assigned per group). Gated on fulls_resident (GPU combine+scale-fulls active); reject is fully supported so it runs for the default rot3d command. merge+stats ~0.49 -> ~0.37s, taking RSM on lyso to ~0.78s (was ~0.91). Validated across the battery: 15/15 deterministic crystals bit-identical to the CPU path (SG / ISa / CC1.2 / completeness / total-obs, and the exact outlier-reject count), only EP_cs_01-24 noise wobbles. The em-sort + a,b fit are the remaining host floor. Non-CUDA build unaffected (use_gpu_merge is always false there). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-03 10:38:36 +02:00
leonarski_fandClaude Opus 4.8	786edf9929	RotationScaleMerge: GPU smooth-G, corr kept resident across the pass Apply smooth-G's corr adjustment on the device (a small kernel: corr[i] *= ratio[frame[i]] for flagged frames, double-then-float, matching host SmoothG) so the per-image corr never leaves the GPU: it now stays resident through scaling -> smooth-G -> per-frame CC -> combine, and across the two space-group passes exactly as the old host round-trip did. The host only builds the tiny per-frame ratio (g/g_smooth via the extracted ComputeSmoothGWindow) and refreshes host partials[].corr solely for the CPU-combine path (JFJOCH_RSM_CPU_COMBINE or the diagnostic dump). This drops the post-scale GetCorr and the two SetCorr re-uploads (~3x25MB/pass) plus the 6.3M host corr-adjust loop: scale-partials ~0.21->~0.10s and the smooth+combine region shrinks, taking RSM on lyso to ~0.91s (was ~1.47s with phase-1-only, ~1.71s full-CPU) - under the 1s target for this crystal; merge+stats (~0.49s) is now the dominant chunk. Bit-identical (GPU smooth-G == host SmoothG on the resident corr); validated across the battery (15/15 deterministic crystals bit-identical to CPU across default / CPU-combine / NO_GPU, only EP_cs_01-24 noise wobbles). Non-CUDA build unaffected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-03 09:49:31 +02:00
leonarski_fandClaude Opus 4.8	87c70195f8	RotationScaleMerge: faster group-hkl via flat group ids + skip Obs stamp Profiling showed the per-space-group "group hkl" step (~0.30s/2-pass on lyso) is not gemmi-bound (the ASU keying is ~6ms) but memory-bandwidth-bound: stamping the group id onto, and reading it back from, the `group` field scattered across the 56-byte Obs struct touches the whole ~350MB partials array twice per pass. Precompute the per-obs AcceptReflection finiteness once (immutable) into a flat 1-byte array, then stamp the ASU-group id from rawrun_group + that flat array into a flat group_ids vector for the GPU, and build the group CSR (a stable counting sort, now parallel) from group_ids - all sequential/flat reads. The Obs.group field is written only when a CPU stage will read it (no GPU: scaling/CC/combine otherwise use group_ids / rawrun_group, never partials.group), so the default path skips the strided Obs pass entirely. group hkl ~0.31 -> ~0.20 s/2-pass on lyso. Output is bit-identical (group_ids values and the obs-index-ordered gperm are unchanged), so the merged results are unchanged; validated across the battery (15/15 deterministic crystals bit-identical to the CPU path, only EP_cs_01-24 noise keeps its benign wobble). Non-CUDA build unaffected (need_obs_group is always true there). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-03 09:25:31 +02:00
leonarski_fandClaude Opus 4.8	7d804eb799	RotationScaleMerge: GPU post-smooth group means + per-frame CC The single-threaded ReduceGroupMeans over the 6.3M partials (~0.07s/2-pass) and the per-frame diagnostic CC now run on the resident partials on the GPU: after SmoothG, the smoothed corr is uploaded once (and left resident for the combine, dropping the combine's redundant re-upload), then the post-smooth group means (reusing the scaling reduce) and the per-frame Pearson CC (a new one-block-per-frame kernel) run there and only the tiny per-frame cc/cc_n come back. FinalizePerFrameScale is split into ComputePerFrameCC (host reference) + the writeback; the GPU path uses ComputePartialCC. The per-frame CC is diagnostic only (the per-image scaling table), so the tree reduction's ~ulp difference from the CPU is immaterial and it does not touch merged intensities. smooth+CC region ~0.10s GPU vs ~0.15s CPU on lyso. Validated across the battery: 15/15 deterministic crystals run-to-run deterministic and merged output bit-identical to the CPU path (only EP_cs_01-24, unindexable noise, keeps its benign error-model-b wobble). CPU fallbacks (JFJOCH_RSM_CPU_COMBINE / _NO_GPU) unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-03 08:58:26 +02:00
leonarski_fandClaude Opus 4.8	617184041f	RotationScaleMerge: enable GPU combine + scale-fulls by default Steps 1-2 (GPU 3D-combine + resident scale-fulls) are validated bit-parity and run-to-run deterministic against the CPU path across the rotation battery, and cut the combine+scale-fulls region from ~0.46s to ~0.32s on lyso, so make them the default when a GPU is present (consistent with phase-1 partial scaling already being default-on). JFJOCH_RSM_CPU_COMBINE forces the CPU combine/scale-fulls for A/B or debugging; JFJOCH_RSM_NO_GPU still disables the whole GPU path. The only battery crystal whose reported metrics move is EP_cs_01-24 (CC1/2 2%, unindexable noise) whose upstream integration is itself nondeterministic; its merged intensities/CC/completeness are unchanged, only the ill-conditioned error-model b. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-03 08:35:23 +02:00
leonarski_fandClaude Opus 4.8	ced85bcd9d	RotationScaleMerge: GPU scale-fulls, fulls kept resident (phase 2 step 2) Scale the combined fulls (Unity model) on the device so they no longer round-trip between the combine and the merge: after the GPU combine, build the fulls' per-frame and per-ASU-group CSRs on the host from just the small key arrays (f_frame/f_group) with a deterministic counting sort - no GPU stable-sort - then scale in place and download once. The four scaling kernels are reused unchanged except FitPerFrameGKernel, which gains an optional `perm` argument (null for the partials, whose arrays are already frame-contiguous; a frame-grouping permutation for the emit-ordered fulls) so the fulls are scaled without a physical reorder. The Unity model falls out of giving the fulls all-ones partiality/rlp/zeta (coeff = mean), so no other kernel changes and the committed phase-1 partial-scaling path is bit-identical (perm == null -> idx == i). Validated across the rotation battery (JFJOCH_RSM_GPU_COMBINE=1): all 15 deterministic crystals stay run-to-run deterministic and their merged output is bit-identical to the CPU path (SG/ISa/CC1.2/completeness). The lone exception is EP_cs_01-24 (CC1/2 2%, R_meas 379% - unindexable noise): merged intensities/CC/completeness match exactly, but the ill-conditioned 16-bin error-model b fit amplifies the ~1e-7 scale-fulls rounding to ISa 10.6 vs 10.8 - benign, same class as the accepted phase-1 GPU rounding. The 3 upstream-nondeterministic crystals vary as before (GPU-prediction overflow, not this). Scale-fulls drops from ~0.09s to ~0 across the two passes; combine+scale-fulls region ~0.32s GPU vs ~0.46s CPU on lyso. Still opt-in (fulls are downloaded for the host merge; the win grows once the merge/error-model also stay resident). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-03 07:51:29 +02:00
leonarski_fandClaude Opus 4.8	2c928b27cd	RotationScaleMerge: GPU 3D-combine (CUDA port, phase 2 step 1) Port Combine() (partials->fulls) to CUDA, mirroring process_rawrun bit-for-bit: one thread per raw-hkl run splits its usable partials into rocking events (frame gap <= 2), pools background, seeds F, runs 3 de-biased Poisson reweights and the capture-uncertainty term. Emission is deterministic - a count pass, a host exclusive prefix sum for per-run offsets, then an emit pass at those offsets - so fulls come out in raw-run-major/event order, identical to the CPU path; both pass instantiations share the same arithmetic so count == emit exactly. Dmax/Dmin/Fmax reproduce std::max/min NaN semantics (not fmax) for parity. Validated across the 18-crystal rotation battery: all 15 deterministic crystals (P1/P2/C2/H3/I23/P41212/P222/P422) match the CPU combine exactly on SG/ISa/CC1.2/ completeness and run-to-run (fulls count bit-identical); the 3 upstream-nondet crystals vary from GPU-prediction overflow, not the combine. Gated opt-in behind JFJOCH_RSM_GPU_COMBINE (default = CPU combine): combine alone is timing-neutral because the shared 1.2M SortFullsByFrame std::sort dominates and the fulls round-trip adds a copy - it only pays off once the fulls stay resident for scale-fulls + merge. Also add JFJOCH_RSM_NO_GPU master switch to force the CPU fallback (incl. phase-1 scaling) from one binary for A/B parity. SortFullsByFrame extracted from the Combine tail and shared by both paths. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-03 07:27:33 +02:00
leonarski_fandClaude Opus 4.8	dd25de461d	RotationScaleMerge: GPU partial-scaling loop (CUDA port, phase 1) First stage of moving the rotation scale/merge onto the GPU. The per-frame partial-scaling loop (inverse-variance group-mean reduction -> robust per-frame IRLS G -> corr update, x scaling_iter) now runs in RotationScaleMergeGPU (.cu) when a GPU is present; the CPU loops remain the fallback. The host keeps the one-time raw-hkl sort and the per-space-group gemmi ASU keying, and hands the GPU a group-ordered permutation + CSR so the per-group reduction is a DETERMINISTIC segmented reduction (one thread per group, fixed order, no atomics) - preserving the run-to-run determinism just won on the CPU path (a float atomicAdd reduction would have re-introduced jitter). Reduction is one-thread-per-group (groups average tens of obs, so a block-per-group wastes threads); the IRLS is one block per frame with a deterministic shared-memory reduction. Validated: bit-identical to the CPU path and deterministic run-to-run on lyso/cytC/Ins_H/pding (P41212 ISa 7.8 CC1/2 99.7%, etc.). The scaling kernels are ~7x faster than the CPU compute (~36 ms for 3 iters vs ~0.28 s); end-to-end scale/merge ~2.0 -> ~1.5 s. The remaining gap to the <1 s target is the per-pass host round-trip (corr down/upload for the CPU combine + per-SG group-CSR rebuild); phase 2 keeps the data resident by moving the 3D combine and the merge/error-model onto the GPU too, so nothing round-trips. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-02 22:26:29 +02:00
leonarski_fandClaude Opus 4.8	29c8ba6112	rotation: deterministic frame-order mosaicity smoothing + partiality recompute Prediction applied a mosaicity/profile-radius moving average (RotationParameters) over the last N processed frames. Under the parallel per-image loop that window is thread-arrival order, so the smoothed value - and hence which reflections are predicted/integrated - was non-deterministic run-to-run, swinging CC1/2 (and even the space group) on marginal crystals. `-N 1` was deterministic; `-N 32` was not. Fix (as designed with FL): prediction now uses each frame's OWN mosaicity/profile-radius (image-local, deterministic membership - a reflection on the cutoff contributes ~nothing). The smoothing that actually matters is moved into RotationScaleMerge and done in FRAME order (deterministic): per-frame mosaicity is smoothed with the same window as smooth-G, then every partial's partiality is recomputed from it BEFORE the 3D combine. This is the mosaicity analogue of smooth-G: combining a reflection's per-frame partials only tiles the rocking curve correctly (captured fractions summing toward 1) if neighbouring frames share a consistent mosaicity. Battery (18 crystals, /data/rotation_test, 2 runs each): 15/18 now bit-identical run-to-run (the good crystals unchanged - lyso P41212 ISa 7.8 CC1/2 99.7%). The 3 residual crystals (EcwtAL500, EcwtCQ066S, pding4_003 - all large/triclinic cells) still jitter ~0.002%, traced to a SEPARATE, benign cause: the GPU prediction buffer overflow (BraggPredictionRotGPU max_reflections=10000 with a racy atomicAdd/atomicSub) on dense frames - cell/space group stay stable; to be addressed in the GPU prediction/integration rework (naively raising the cap also changes prediction quality, so it is not a one-line bump). Minor label refinements from the recomputed partiality: cytC_2 P321 -> P3121 (now consistent with cytC_3), Ins_I_2/3 report the honest I23/I213 screw-axis ambiguity. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-02 21:58:10 +02:00
leonarski_fandClaude Opus 4.8	b5d9167bf4	RotationScaleMerge: chunked parallelism -> scale/merge phase ~2s The per-observation corr update (7.6M items) ran through a work-stealing ParallelFor that does one atomic fetch_add PER item - pure contention for trivial work (measured: update 0.60s vs reduce 0.15s / fit 0.13s in the scale-partials loop). Add ParallelChunks (one contiguous range per worker, no per-item sync) and use it for UpdateCorr, and parallelise the ASU keying (gemmi reduction per distinct raw hkl - HKLKeyGenerator is const, safe to read concurrently) and the group-stamping over disjoint raw-hkl runs. scale-partials 0.90 -> 0.28s, group-hkl 0.20 -> 0.09s, per-pass warm 0.83s, whole scale/merge phase ~3.3 -> ~2.0s. Bit-identical output (same space group, ISa, CC1/2). ParallelChunks is the CPU stand-in for a flat CUDA grid-stride kernel; ParallelFor stays for the heavy, uneven per-frame fits where the atomic amortises and work-stealing balances the load. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-02 21:31:41 +02:00
leonarski_fandClaude Opus 4.8	29c5410328	rotation scaling: dedicated allocate-once RotationScaleMerge, ~3.4x faster scale/merge The rot3d post-pass (scale -> smooth-G -> 3D combine -> scale-fulls -> merge -> error model -> stats) dominated offline wall clock because ScaleOnTheFly + MergeAll + CombineRotationObservations + MergeOnTheFly rebuild a std::map keyed by hkl on every scaling iteration and every merge (7-14 map rebuilds per space-group pass), each re-walking and re-keying millions of per-frame partials. RotationScaleMerge ingests the per-frame partials ONCE into flat vectors and reuses them across both space-group passes: the raw-hkl ordering is sorted a single time at ingest, so the per-pass 3D combine only re-splits events (no sort) and the ASU grouping is one gemmi reduction per distinct raw hkl (~10x fewer) rather than per observation, reusing that order. Every hot step is a flat loop (segmented reduction + per-frame robust IRLS + parallel per-run combine) that also maps directly onto CUDA kernels. CC1/2 and the per-image CC are computed once at the end, not every iteration. It is a distinct path from ScaleOnTheFly, used only for the self-scaling rotation case (Rotation partiality + combine3d, per-image G, no B refinement, no external reference, no absorption surface, no wedge/mosaicity override). Stills, B-factor refinement, reference scaling and the absorption surface stay on the classic path. Numerically equivalent to the classic path (same robust per-frame G, same 3D combine, same XDS-order scale-fulls, same global error model, same merge statistics), validated on the 18-crystal /data/rotation_test battery: 16/18 bit-identical in space group / ISa / CC1/2; the 2 differing crystals are ice-heavy / marginal ones on which the classic path is equally non-deterministic run-to-run (a pre-existing upstream integration race). Scale/merge wall time drops ~3.4x (median 9.8 -> 4.1 s), making higher --scaling-iterations cheap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-02 21:00:12 +02:00