Steps 1-2 (GPU 3D-combine + resident scale-fulls) are validated bit-parity and
run-to-run deterministic against the CPU path across the rotation battery, and cut
the combine+scale-fulls region from ~0.46s to ~0.32s on lyso, so make them the
default when a GPU is present (consistent with phase-1 partial scaling already being
default-on). JFJOCH_RSM_CPU_COMBINE forces the CPU combine/scale-fulls for A/B or
debugging; JFJOCH_RSM_NO_GPU still disables the whole GPU path.
The only battery crystal whose reported metrics move is EP_cs_01-24 (CC1/2 2%,
unindexable noise) whose upstream integration is itself nondeterministic; its merged
intensities/CC/completeness are unchanged, only the ill-conditioned error-model b.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>