Port the four fulls-walking reductions of MergeAndStats to the GPU, over the fulls
group-CSR already resident from scale-fulls: the per-group inv-var mean + leverage-
corrected error-model samples, the merge accumulate (inv-var sums + deterministic
half-sets, error-model-corrected sigma, with outlier rejection), and R_meas + the
per-shell usable count. The host keeps the parts that don't parallelise cleanly or are
tiny: the I2-sort + 16-bin (a,b) median fit, the per-group reject median (a per-group
median is awkward on the GPU - cheap on the host from the GPU cnt), the merged export,
the shells and the gemmi completeness. Only per-group arrays (~55k) + the samples
(~n_fulls, for the fit) come back - the fulls are not re-walked on the host.
Device HalfForImage (splitmix64) + IceRingIndex mirror the host; the corrected-sigma
uses (b*I_for_b)^2 (not b^2*I^2) to match the host rounding; the R_meas usable count
requires finite d (the host counts only fulls with a valid shell, and a group's fulls
share d, so the shell is assigned per group). Gated on fulls_resident (GPU
combine+scale-fulls active); reject is fully supported so it runs for the default
rot3d command.
merge+stats ~0.49 -> ~0.37s, taking RSM on lyso to ~0.78s (was ~0.91). Validated across
the battery: 15/15 deterministic crystals bit-identical to the CPU path (SG / ISa /
CC1.2 / completeness / total-obs, and the exact outlier-reject count), only EP_cs_01-24
noise wobbles. The em-sort + a,b fit are the remaining host floor. Non-CUDA build
unaffected (use_gpu_merge is always false there).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>