Commit Graph
3 Commits
Author SHA1 Message Date
leonarski_fandClaude Opus 4.8 29c8ba6112 rotation: deterministic frame-order mosaicity smoothing + partiality recompute
Prediction applied a mosaicity/profile-radius moving average (RotationParameters) over the
last N *processed* frames. Under the parallel per-image loop that window is thread-arrival
order, so the smoothed value - and hence which reflections are predicted/integrated - was
non-deterministic run-to-run, swinging CC1/2 (and even the space group) on marginal crystals.
`-N 1` was deterministic; `-N 32` was not.

Fix (as designed with FL): prediction now uses each frame's OWN mosaicity/profile-radius
(image-local, deterministic membership - a reflection on the cutoff contributes ~nothing).
The smoothing that actually matters is moved into RotationScaleMerge and done in FRAME order
(deterministic): per-frame mosaicity is smoothed with the same window as smooth-G, then every
partial's partiality is recomputed from it BEFORE the 3D combine. This is the mosaicity analogue
of smooth-G: combining a reflection's per-frame partials only tiles the rocking curve correctly
(captured fractions summing toward 1) if neighbouring frames share a consistent mosaicity.

Battery (18 crystals, /data/rotation_test, 2 runs each): 15/18 now bit-identical run-to-run
(the good crystals unchanged - lyso P41212 ISa 7.8 CC1/2 99.7%). The 3 residual crystals
(EcwtAL500, EcwtCQ066S, pding4_003 - all large/triclinic cells) still jitter ~0.002%, traced
to a SEPARATE, benign cause: the GPU prediction buffer overflow (BraggPredictionRotGPU
max_reflections=10000 with a racy atomicAdd/atomicSub) on dense frames - cell/space group stay
stable; to be addressed in the GPU prediction/integration rework (naively raising the cap also
changes prediction quality, so it is not a one-line bump). Minor label refinements from the
recomputed partiality: cytC_2 P321 -> P3121 (now consistent with cytC_3), Ins_I_2/3 report the
honest I23/I213 screw-axis ambiguity.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-02 21:58:10 +02:00
leonarski_fandClaude Opus 4.8 b5d9167bf4 RotationScaleMerge: chunked parallelism -> scale/merge phase ~2s
The per-observation corr update (7.6M items) ran through a work-stealing ParallelFor that
does one atomic fetch_add PER item - pure contention for trivial work (measured: update 0.60s
vs reduce 0.15s / fit 0.13s in the scale-partials loop). Add ParallelChunks (one contiguous
range per worker, no per-item sync) and use it for UpdateCorr, and parallelise the ASU keying
(gemmi reduction per distinct raw hkl - HKLKeyGenerator is const, safe to read concurrently)
and the group-stamping over disjoint raw-hkl runs.

scale-partials 0.90 -> 0.28s, group-hkl 0.20 -> 0.09s, per-pass warm 0.83s, whole scale/merge
phase ~3.3 -> ~2.0s. Bit-identical output (same space group, ISa, CC1/2). ParallelChunks is
the CPU stand-in for a flat CUDA grid-stride kernel; ParallelFor stays for the heavy, uneven
per-frame fits where the atomic amortises and work-stealing balances the load.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-02 21:31:41 +02:00
leonarski_fandClaude Opus 4.8 29c5410328 rotation scaling: dedicated allocate-once RotationScaleMerge, ~3.4x faster scale/merge
The rot3d post-pass (scale -> smooth-G -> 3D combine -> scale-fulls -> merge ->
error model -> stats) dominated offline wall clock because ScaleOnTheFly + MergeAll
+ CombineRotationObservations + MergeOnTheFly rebuild a std::map keyed by hkl on every
scaling iteration and every merge (7-14 map rebuilds per space-group pass), each re-walking
and re-keying millions of per-frame partials.

RotationScaleMerge ingests the per-frame partials ONCE into flat vectors and reuses them
across both space-group passes: the raw-hkl ordering is sorted a single time at ingest, so
the per-pass 3D combine only re-splits events (no sort) and the ASU grouping is one gemmi
reduction per distinct raw hkl (~10x fewer) rather than per observation, reusing that order.
Every hot step is a flat loop (segmented reduction + per-frame robust IRLS + parallel
per-run combine) that also maps directly onto CUDA kernels. CC1/2 and the per-image CC are
computed once at the end, not every iteration.

It is a distinct path from ScaleOnTheFly, used only for the self-scaling rotation case
(Rotation partiality + combine3d, per-image G, no B refinement, no external reference, no
absorption surface, no wedge/mosaicity override). Stills, B-factor refinement, reference
scaling and the absorption surface stay on the classic path.

Numerically equivalent to the classic path (same robust per-frame G, same 3D combine, same
XDS-order scale-fulls, same global error model, same merge statistics), validated on the
18-crystal /data/rotation_test battery: 16/18 bit-identical in space group / ISa / CC1/2;
the 2 differing crystals are ice-heavy / marginal ones on which the classic path is equally
non-deterministic run-to-run (a pre-existing upstream integration race). Scale/merge wall
time drops ~3.4x (median 9.8 -> 4.1 s), making higher --scaling-iterations cheap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-02 21:00:12 +02:00