The rot3d post-pass (scale -> smooth-G -> 3D combine -> scale-fulls -> merge ->
error model -> stats) dominated offline wall clock because ScaleOnTheFly + MergeAll
+ CombineRotationObservations + MergeOnTheFly rebuild a std::map keyed by hkl on every
scaling iteration and every merge (7-14 map rebuilds per space-group pass), each re-walking
and re-keying millions of per-frame partials.
RotationScaleMerge ingests the per-frame partials ONCE into flat vectors and reuses them
across both space-group passes: the raw-hkl ordering is sorted a single time at ingest, so
the per-pass 3D combine only re-splits events (no sort) and the ASU grouping is one gemmi
reduction per distinct raw hkl (~10x fewer) rather than per observation, reusing that order.
Every hot step is a flat loop (segmented reduction + per-frame robust IRLS + parallel
per-run combine) that also maps directly onto CUDA kernels. CC1/2 and the per-image CC are
computed once at the end, not every iteration.
It is a distinct path from ScaleOnTheFly, used only for the self-scaling rotation case
(Rotation partiality + combine3d, per-image G, no B refinement, no external reference, no
absorption surface, no wedge/mosaicity override). Stills, B-factor refinement, reference
scaling and the absorption surface stay on the classic path.
Numerically equivalent to the classic path (same robust per-frame G, same 3D combine, same
XDS-order scale-fulls, same global error model, same merge statistics), validated on the
18-crystal /data/rotation_test battery: 16/18 bit-identical in space group / ISa / CC1/2;
the 2 differing crystals are ice-heavy / marginal ones on which the classic path is equally
non-deterministic run-to-run (a pre-existing upstream integration race). Scale/merge wall
time drops ~3.4x (median 9.8 -> 4.1 s), making higher --scaling-iterations cheap.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>