Steps 1-2 (GPU 3D-combine + resident scale-fulls) are validated bit-parity and
run-to-run deterministic against the CPU path across the rotation battery, and cut
the combine+scale-fulls region from ~0.46s to ~0.32s on lyso, so make them the
default when a GPU is present (consistent with phase-1 partial scaling already being
default-on). JFJOCH_RSM_CPU_COMBINE forces the CPU combine/scale-fulls for A/B or
debugging; JFJOCH_RSM_NO_GPU still disables the whole GPU path.
The only battery crystal whose reported metrics move is EP_cs_01-24 (CC1/2 2%,
unindexable noise) whose upstream integration is itself nondeterministic; its merged
intensities/CC/completeness are unchanged, only the ill-conditioned error-model b.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Scale the combined fulls (Unity model) on the device so they no longer round-trip
between the combine and the merge: after the GPU combine, build the fulls' per-frame
and per-ASU-group CSRs on the host from just the small key arrays (f_frame/f_group)
with a deterministic counting sort - no GPU stable-sort - then scale in place and
download once.
The four scaling kernels are reused unchanged except FitPerFrameGKernel, which gains
an optional `perm` argument (null for the partials, whose arrays are already
frame-contiguous; a frame-grouping permutation for the emit-ordered fulls) so the
fulls are scaled without a physical reorder. The Unity model falls out of giving the
fulls all-ones partiality/rlp/zeta (coeff = mean), so no other kernel changes and the
committed phase-1 partial-scaling path is bit-identical (perm == null -> idx == i).
Validated across the rotation battery (JFJOCH_RSM_GPU_COMBINE=1): all 15 deterministic
crystals stay run-to-run deterministic and their merged output is bit-identical to the
CPU path (SG/ISa/CC1.2/completeness). The lone exception is EP_cs_01-24 (CC1/2 2%,
R_meas 379% - unindexable noise): merged intensities/CC/completeness match exactly, but
the ill-conditioned 16-bin error-model b fit amplifies the ~1e-7 scale-fulls rounding
to ISa 10.6 vs 10.8 - benign, same class as the accepted phase-1 GPU rounding. The 3
upstream-nondeterministic crystals vary as before (GPU-prediction overflow, not this).
Scale-fulls drops from ~0.09s to ~0 across the two passes; combine+scale-fulls region
~0.32s GPU vs ~0.46s CPU on lyso. Still opt-in (fulls are downloaded for the host merge;
the win grows once the merge/error-model also stay resident).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Port Combine() (partials->fulls) to CUDA, mirroring process_rawrun bit-for-bit:
one thread per raw-hkl run splits its usable partials into rocking events (frame
gap <= 2), pools background, seeds F, runs 3 de-biased Poisson reweights and the
capture-uncertainty term. Emission is deterministic - a count pass, a host
exclusive prefix sum for per-run offsets, then an emit pass at those offsets - so
fulls come out in raw-run-major/event order, identical to the CPU path; both pass
instantiations share the same arithmetic so count == emit exactly. Dmax/Dmin/Fmax
reproduce std::max/min NaN semantics (not fmax) for parity.
Validated across the 18-crystal rotation battery: all 15 deterministic crystals
(P1/P2/C2/H3/I23/P41212/P222/P422) match the CPU combine exactly on SG/ISa/CC1.2/
completeness and run-to-run (fulls count bit-identical); the 3 upstream-nondet
crystals vary from GPU-prediction overflow, not the combine.
Gated opt-in behind JFJOCH_RSM_GPU_COMBINE (default = CPU combine): combine alone
is timing-neutral because the shared 1.2M SortFullsByFrame std::sort dominates and
the fulls round-trip adds a copy - it only pays off once the fulls stay resident
for scale-fulls + merge. Also add JFJOCH_RSM_NO_GPU master switch to force the CPU
fallback (incl. phase-1 scaling) from one binary for A/B parity. SortFullsByFrame
extracted from the Combine tail and shared by both paths.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
First stage of moving the rotation scale/merge onto the GPU. The per-frame partial-scaling loop
(inverse-variance group-mean reduction -> robust per-frame IRLS G -> corr update, x scaling_iter)
now runs in RotationScaleMergeGPU (.cu) when a GPU is present; the CPU loops remain the fallback.
The host keeps the one-time raw-hkl sort and the per-space-group gemmi ASU keying, and hands the
GPU a group-ordered permutation + CSR so the per-group reduction is a DETERMINISTIC segmented
reduction (one thread per group, fixed order, no atomics) - preserving the run-to-run determinism
just won on the CPU path (a float atomicAdd reduction would have re-introduced jitter). Reduction is
one-thread-per-group (groups average tens of obs, so a block-per-group wastes threads); the IRLS is
one block per frame with a deterministic shared-memory reduction.
Validated: bit-identical to the CPU path and deterministic run-to-run on lyso/cytC/Ins_H/pding
(P41212 ISa 7.8 CC1/2 99.7%, etc.). The scaling kernels are ~7x faster than the CPU compute
(~36 ms for 3 iters vs ~0.28 s); end-to-end scale/merge ~2.0 -> ~1.5 s. The remaining gap to the
<1 s target is the per-pass host round-trip (corr down/upload for the CPU combine + per-SG group-CSR
rebuild); phase 2 keeps the data resident by moving the 3D combine and the merge/error-model onto
the GPU too, so nothing round-trips.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Prediction applied a mosaicity/profile-radius moving average (RotationParameters) over the
last N *processed* frames. Under the parallel per-image loop that window is thread-arrival
order, so the smoothed value - and hence which reflections are predicted/integrated - was
non-deterministic run-to-run, swinging CC1/2 (and even the space group) on marginal crystals.
`-N 1` was deterministic; `-N 32` was not.
Fix (as designed with FL): prediction now uses each frame's OWN mosaicity/profile-radius
(image-local, deterministic membership - a reflection on the cutoff contributes ~nothing).
The smoothing that actually matters is moved into RotationScaleMerge and done in FRAME order
(deterministic): per-frame mosaicity is smoothed with the same window as smooth-G, then every
partial's partiality is recomputed from it BEFORE the 3D combine. This is the mosaicity analogue
of smooth-G: combining a reflection's per-frame partials only tiles the rocking curve correctly
(captured fractions summing toward 1) if neighbouring frames share a consistent mosaicity.
Battery (18 crystals, /data/rotation_test, 2 runs each): 15/18 now bit-identical run-to-run
(the good crystals unchanged - lyso P41212 ISa 7.8 CC1/2 99.7%). The 3 residual crystals
(EcwtAL500, EcwtCQ066S, pding4_003 - all large/triclinic cells) still jitter ~0.002%, traced
to a SEPARATE, benign cause: the GPU prediction buffer overflow (BraggPredictionRotGPU
max_reflections=10000 with a racy atomicAdd/atomicSub) on dense frames - cell/space group stay
stable; to be addressed in the GPU prediction/integration rework (naively raising the cap also
changes prediction quality, so it is not a one-line bump). Minor label refinements from the
recomputed partiality: cytC_2 P321 -> P3121 (now consistent with cytC_3), Ins_I_2/3 report the
honest I23/I213 screw-axis ambiguity.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The per-observation corr update (7.6M items) ran through a work-stealing ParallelFor that
does one atomic fetch_add PER item - pure contention for trivial work (measured: update 0.60s
vs reduce 0.15s / fit 0.13s in the scale-partials loop). Add ParallelChunks (one contiguous
range per worker, no per-item sync) and use it for UpdateCorr, and parallelise the ASU keying
(gemmi reduction per distinct raw hkl - HKLKeyGenerator is const, safe to read concurrently)
and the group-stamping over disjoint raw-hkl runs.
scale-partials 0.90 -> 0.28s, group-hkl 0.20 -> 0.09s, per-pass warm 0.83s, whole scale/merge
phase ~3.3 -> ~2.0s. Bit-identical output (same space group, ISa, CC1/2). ParallelChunks is
the CPU stand-in for a flat CUDA grid-stride kernel; ParallelFor stays for the heavy, uneven
per-frame fits where the atomic amortises and work-stealing balances the load.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The rot3d post-pass (scale -> smooth-G -> 3D combine -> scale-fulls -> merge ->
error model -> stats) dominated offline wall clock because ScaleOnTheFly + MergeAll
+ CombineRotationObservations + MergeOnTheFly rebuild a std::map keyed by hkl on every
scaling iteration and every merge (7-14 map rebuilds per space-group pass), each re-walking
and re-keying millions of per-frame partials.
RotationScaleMerge ingests the per-frame partials ONCE into flat vectors and reuses them
across both space-group passes: the raw-hkl ordering is sorted a single time at ingest, so
the per-pass 3D combine only re-splits events (no sort) and the ASU grouping is one gemmi
reduction per distinct raw hkl (~10x fewer) rather than per observation, reusing that order.
Every hot step is a flat loop (segmented reduction + per-frame robust IRLS + parallel
per-run combine) that also maps directly onto CUDA kernels. CC1/2 and the per-image CC are
computed once at the end, not every iteration.
It is a distinct path from ScaleOnTheFly, used only for the self-scaling rotation case
(Rotation partiality + combine3d, per-image G, no B refinement, no external reference, no
absorption surface, no wedge/mosaicity override). Stills, B-factor refinement, reference
scaling and the absorption surface stay on the classic path.
Numerically equivalent to the classic path (same robust per-frame G, same 3D combine, same
XDS-order scale-fulls, same global error model, same merge statistics), validated on the
18-crystal /data/rotation_test battery: 16/18 bit-identical in space group / ISa / CC1/2;
the 2 differing crystals are ice-heavy / marginal ones on which the classic path is equally
non-deterministic run-to-run (a pre-existing upstream integration race). Scale/merge wall
time drops ~3.4x (median 9.8 -> 4.1 s), making higher --scaling-iterations cheap.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>