Scale the combined fulls (Unity model) on the device so they no longer round-trip
between the combine and the merge: after the GPU combine, build the fulls' per-frame
and per-ASU-group CSRs on the host from just the small key arrays (f_frame/f_group)
with a deterministic counting sort - no GPU stable-sort - then scale in place and
download once.
The four scaling kernels are reused unchanged except FitPerFrameGKernel, which gains
an optional `perm` argument (null for the partials, whose arrays are already
frame-contiguous; a frame-grouping permutation for the emit-ordered fulls) so the
fulls are scaled without a physical reorder. The Unity model falls out of giving the
fulls all-ones partiality/rlp/zeta (coeff = mean), so no other kernel changes and the
committed phase-1 partial-scaling path is bit-identical (perm == null -> idx == i).
Validated across the rotation battery (JFJOCH_RSM_GPU_COMBINE=1): all 15 deterministic
crystals stay run-to-run deterministic and their merged output is bit-identical to the
CPU path (SG/ISa/CC1.2/completeness). The lone exception is EP_cs_01-24 (CC1/2 2%,
R_meas 379% - unindexable noise): merged intensities/CC/completeness match exactly, but
the ill-conditioned 16-bin error-model b fit amplifies the ~1e-7 scale-fulls rounding
to ISa 10.6 vs 10.8 - benign, same class as the accepted phase-1 GPU rounding. The 3
upstream-nondeterministic crystals vary as before (GPU-prediction overflow, not this).
Scale-fulls drops from ~0.09s to ~0 across the two passes; combine+scale-fulls region
~0.32s GPU vs ~0.46s CPU on lyso. Still opt-in (fulls are downloaded for the host merge;
the win grows once the merge/error-model also stay resident).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>