Commit Graph
12 Commits
Author SHA1 Message Date
leonarski_fandClaude Opus 4.8 fccf9b83e7 RotationScaleMerge: drop profiling + env gates, honour forced mosaicity
Remove the [rsm] per-stage lap timing and the JFJOCH_RSM_NO_GPU / JFJOCH_RSM_CPU_COMBINE
env gates now that the GPU-resident path is the validated default (it runs whenever a GPU
is present, with the CPU loops as the bit-parity fallback; the diagnostic-dump path still
uses the CPU combine).

Honour a fixed (forced) mosaicity: SmoothMosaicityAndPartiality now overrides every frame
with GetForcedMosaicity() when set, instead of always reading the per-frame integration
value - so the caller can route the --mosaicity case through RotationScaleMerge (its
partiality recompute makes it a natural fit) rather than a separate path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-03 11:41:31 +02:00
leonarski_fandClaude Opus 4.8 34b3c3c4e7 RotationScaleMerge: GPU merge + error-model reductions over the resident fulls
Port the four fulls-walking reductions of MergeAndStats to the GPU, over the fulls
group-CSR already resident from scale-fulls: the per-group inv-var mean + leverage-
corrected error-model samples, the merge accumulate (inv-var sums + deterministic
half-sets, error-model-corrected sigma, with outlier rejection), and R_meas + the
per-shell usable count. The host keeps the parts that don't parallelise cleanly or are
tiny: the I2-sort + 16-bin (a,b) median fit, the per-group reject median (a per-group
median is awkward on the GPU - cheap on the host from the GPU cnt), the merged export,
the shells and the gemmi completeness. Only per-group arrays (~55k) + the samples
(~n_fulls, for the fit) come back - the fulls are not re-walked on the host.

Device HalfForImage (splitmix64) + IceRingIndex mirror the host; the corrected-sigma
uses (b*I_for_b)^2 (not b^2*I^2) to match the host rounding; the R_meas usable count
requires finite d (the host counts only fulls with a valid shell, and a group's fulls
share d, so the shell is assigned per group). Gated on fulls_resident (GPU
combine+scale-fulls active); reject is fully supported so it runs for the default
rot3d command.

merge+stats ~0.49 -> ~0.37s, taking RSM on lyso to ~0.78s (was ~0.91). Validated across
the battery: 15/15 deterministic crystals bit-identical to the CPU path (SG / ISa /
CC1.2 / completeness / total-obs, and the exact outlier-reject count), only EP_cs_01-24
noise wobbles. The em-sort + a,b fit are the remaining host floor. Non-CUDA build
unaffected (use_gpu_merge is always false there).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-03 10:38:36 +02:00
leonarski_fandClaude Opus 4.8 786edf9929 RotationScaleMerge: GPU smooth-G, corr kept resident across the pass
Apply smooth-G's corr adjustment on the device (a small kernel: corr[i] *=
ratio[frame[i]] for flagged frames, double-then-float, matching host SmoothG) so the
per-image corr never leaves the GPU: it now stays resident through scaling ->
smooth-G -> per-frame CC -> combine, and across the two space-group passes exactly as
the old host round-trip did. The host only builds the tiny per-frame ratio (g/g_smooth
via the extracted ComputeSmoothGWindow) and refreshes host partials[].corr solely for
the CPU-combine path (JFJOCH_RSM_CPU_COMBINE or the diagnostic dump).

This drops the post-scale GetCorr and the two SetCorr re-uploads (~3x25MB/pass) plus the
6.3M host corr-adjust loop: scale-partials ~0.21->~0.10s and the smooth+combine region
shrinks, taking RSM on lyso to ~0.91s (was ~1.47s with phase-1-only, ~1.71s full-CPU) -
under the 1s target for this crystal; merge+stats (~0.49s) is now the dominant chunk.

Bit-identical (GPU smooth-G == host SmoothG on the resident corr); validated across the
battery (15/15 deterministic crystals bit-identical to CPU across default / CPU-combine /
NO_GPU, only EP_cs_01-24 noise wobbles). Non-CUDA build unaffected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-03 09:49:31 +02:00
leonarski_fandClaude Opus 4.8 87c70195f8 RotationScaleMerge: faster group-hkl via flat group ids + skip Obs stamp
Profiling showed the per-space-group "group hkl" step (~0.30s/2-pass on lyso) is not
gemmi-bound (the ASU keying is ~6ms) but memory-bandwidth-bound: stamping the group id
onto, and reading it back from, the `group` field scattered across the 56-byte Obs
struct touches the whole ~350MB partials array twice per pass.

Precompute the per-obs AcceptReflection finiteness once (immutable) into a flat 1-byte
array, then stamp the ASU-group id from rawrun_group + that flat array into a flat
group_ids vector for the GPU, and build the group CSR (a stable counting sort, now
parallel) from group_ids - all sequential/flat reads. The Obs.group field is written
only when a CPU stage will read it (no GPU: scaling/CC/combine otherwise use
group_ids / rawrun_group, never partials.group), so the default path skips the strided
Obs pass entirely. group hkl ~0.31 -> ~0.20 s/2-pass on lyso.

Output is bit-identical (group_ids values and the obs-index-ordered gperm are unchanged),
so the merged results are unchanged; validated across the battery (15/15 deterministic
crystals bit-identical to the CPU path, only EP_cs_01-24 noise keeps its benign wobble).
Non-CUDA build unaffected (need_obs_group is always true there).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-03 09:25:31 +02:00
leonarski_fandClaude Opus 4.8 7d804eb799 RotationScaleMerge: GPU post-smooth group means + per-frame CC
The single-threaded ReduceGroupMeans over the 6.3M partials (~0.07s/2-pass) and the
per-frame diagnostic CC now run on the resident partials on the GPU: after SmoothG,
the smoothed corr is uploaded once (and left resident for the combine, dropping the
combine's redundant re-upload), then the post-smooth group means (reusing the scaling
reduce) and the per-frame Pearson CC (a new one-block-per-frame kernel) run there and
only the tiny per-frame cc/cc_n come back. FinalizePerFrameScale is split into
ComputePerFrameCC (host reference) + the writeback; the GPU path uses ComputePartialCC.

The per-frame CC is diagnostic only (the per-image scaling table), so the tree
reduction's ~ulp difference from the CPU is immaterial and it does not touch merged
intensities. smooth+CC region ~0.10s GPU vs ~0.15s CPU on lyso. Validated across the
battery: 15/15 deterministic crystals run-to-run deterministic and merged output
bit-identical to the CPU path (only EP_cs_01-24, unindexable noise, keeps its benign
error-model-b wobble). CPU fallbacks (JFJOCH_RSM_CPU_COMBINE / _NO_GPU) unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-03 08:58:26 +02:00
leonarski_fandClaude Opus 4.8 617184041f RotationScaleMerge: enable GPU combine + scale-fulls by default
Steps 1-2 (GPU 3D-combine + resident scale-fulls) are validated bit-parity and
run-to-run deterministic against the CPU path across the rotation battery, and cut
the combine+scale-fulls region from ~0.46s to ~0.32s on lyso, so make them the
default when a GPU is present (consistent with phase-1 partial scaling already being
default-on). JFJOCH_RSM_CPU_COMBINE forces the CPU combine/scale-fulls for A/B or
debugging; JFJOCH_RSM_NO_GPU still disables the whole GPU path.

The only battery crystal whose reported metrics move is EP_cs_01-24 (CC1/2 2%,
unindexable noise) whose upstream integration is itself nondeterministic; its merged
intensities/CC/completeness are unchanged, only the ill-conditioned error-model b.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-03 08:35:23 +02:00
leonarski_fandClaude Opus 4.8 ced85bcd9d RotationScaleMerge: GPU scale-fulls, fulls kept resident (phase 2 step 2)
Scale the combined fulls (Unity model) on the device so they no longer round-trip
between the combine and the merge: after the GPU combine, build the fulls' per-frame
and per-ASU-group CSRs on the host from just the small key arrays (f_frame/f_group)
with a deterministic counting sort - no GPU stable-sort - then scale in place and
download once.

The four scaling kernels are reused unchanged except FitPerFrameGKernel, which gains
an optional `perm` argument (null for the partials, whose arrays are already
frame-contiguous; a frame-grouping permutation for the emit-ordered fulls) so the
fulls are scaled without a physical reorder. The Unity model falls out of giving the
fulls all-ones partiality/rlp/zeta (coeff = mean), so no other kernel changes and the
committed phase-1 partial-scaling path is bit-identical (perm == null -> idx == i).

Validated across the rotation battery (JFJOCH_RSM_GPU_COMBINE=1): all 15 deterministic
crystals stay run-to-run deterministic and their merged output is bit-identical to the
CPU path (SG/ISa/CC1.2/completeness). The lone exception is EP_cs_01-24 (CC1/2 2%,
R_meas 379% - unindexable noise): merged intensities/CC/completeness match exactly, but
the ill-conditioned 16-bin error-model b fit amplifies the ~1e-7 scale-fulls rounding
to ISa 10.6 vs 10.8 - benign, same class as the accepted phase-1 GPU rounding. The 3
upstream-nondeterministic crystals vary as before (GPU-prediction overflow, not this).

Scale-fulls drops from ~0.09s to ~0 across the two passes; combine+scale-fulls region
~0.32s GPU vs ~0.46s CPU on lyso. Still opt-in (fulls are downloaded for the host merge;
the win grows once the merge/error-model also stay resident).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-03 07:51:29 +02:00
leonarski_fandClaude Opus 4.8 2c928b27cd RotationScaleMerge: GPU 3D-combine (CUDA port, phase 2 step 1)
Port Combine() (partials->fulls) to CUDA, mirroring process_rawrun bit-for-bit:
one thread per raw-hkl run splits its usable partials into rocking events (frame
gap <= 2), pools background, seeds F, runs 3 de-biased Poisson reweights and the
capture-uncertainty term. Emission is deterministic - a count pass, a host
exclusive prefix sum for per-run offsets, then an emit pass at those offsets - so
fulls come out in raw-run-major/event order, identical to the CPU path; both pass
instantiations share the same arithmetic so count == emit exactly. Dmax/Dmin/Fmax
reproduce std::max/min NaN semantics (not fmax) for parity.

Validated across the 18-crystal rotation battery: all 15 deterministic crystals
(P1/P2/C2/H3/I23/P41212/P222/P422) match the CPU combine exactly on SG/ISa/CC1.2/
completeness and run-to-run (fulls count bit-identical); the 3 upstream-nondet
crystals vary from GPU-prediction overflow, not the combine.

Gated opt-in behind JFJOCH_RSM_GPU_COMBINE (default = CPU combine): combine alone
is timing-neutral because the shared 1.2M SortFullsByFrame std::sort dominates and
the fulls round-trip adds a copy - it only pays off once the fulls stay resident
for scale-fulls + merge. Also add JFJOCH_RSM_NO_GPU master switch to force the CPU
fallback (incl. phase-1 scaling) from one binary for A/B parity. SortFullsByFrame
extracted from the Combine tail and shared by both paths.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-03 07:27:33 +02:00
leonarski_fandClaude Opus 4.8 dd25de461d RotationScaleMerge: GPU partial-scaling loop (CUDA port, phase 1)
First stage of moving the rotation scale/merge onto the GPU. The per-frame partial-scaling loop
(inverse-variance group-mean reduction -> robust per-frame IRLS G -> corr update, x scaling_iter)
now runs in RotationScaleMergeGPU (.cu) when a GPU is present; the CPU loops remain the fallback.

The host keeps the one-time raw-hkl sort and the per-space-group gemmi ASU keying, and hands the
GPU a group-ordered permutation + CSR so the per-group reduction is a DETERMINISTIC segmented
reduction (one thread per group, fixed order, no atomics) - preserving the run-to-run determinism
just won on the CPU path (a float atomicAdd reduction would have re-introduced jitter). Reduction is
one-thread-per-group (groups average tens of obs, so a block-per-group wastes threads); the IRLS is
one block per frame with a deterministic shared-memory reduction.

Validated: bit-identical to the CPU path and deterministic run-to-run on lyso/cytC/Ins_H/pding
(P41212 ISa 7.8 CC1/2 99.7%, etc.). The scaling kernels are ~7x faster than the CPU compute
(~36 ms for 3 iters vs ~0.28 s); end-to-end scale/merge ~2.0 -> ~1.5 s. The remaining gap to the
<1 s target is the per-pass host round-trip (corr down/upload for the CPU combine + per-SG group-CSR
rebuild); phase 2 keeps the data resident by moving the 3D combine and the merge/error-model onto
the GPU too, so nothing round-trips.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-02 22:26:29 +02:00
leonarski_fandClaude Opus 4.8 29c8ba6112 rotation: deterministic frame-order mosaicity smoothing + partiality recompute
Prediction applied a mosaicity/profile-radius moving average (RotationParameters) over the
last N *processed* frames. Under the parallel per-image loop that window is thread-arrival
order, so the smoothed value - and hence which reflections are predicted/integrated - was
non-deterministic run-to-run, swinging CC1/2 (and even the space group) on marginal crystals.
`-N 1` was deterministic; `-N 32` was not.

Fix (as designed with FL): prediction now uses each frame's OWN mosaicity/profile-radius
(image-local, deterministic membership - a reflection on the cutoff contributes ~nothing).
The smoothing that actually matters is moved into RotationScaleMerge and done in FRAME order
(deterministic): per-frame mosaicity is smoothed with the same window as smooth-G, then every
partial's partiality is recomputed from it BEFORE the 3D combine. This is the mosaicity analogue
of smooth-G: combining a reflection's per-frame partials only tiles the rocking curve correctly
(captured fractions summing toward 1) if neighbouring frames share a consistent mosaicity.

Battery (18 crystals, /data/rotation_test, 2 runs each): 15/18 now bit-identical run-to-run
(the good crystals unchanged - lyso P41212 ISa 7.8 CC1/2 99.7%). The 3 residual crystals
(EcwtAL500, EcwtCQ066S, pding4_003 - all large/triclinic cells) still jitter ~0.002%, traced
to a SEPARATE, benign cause: the GPU prediction buffer overflow (BraggPredictionRotGPU
max_reflections=10000 with a racy atomicAdd/atomicSub) on dense frames - cell/space group stay
stable; to be addressed in the GPU prediction/integration rework (naively raising the cap also
changes prediction quality, so it is not a one-line bump). Minor label refinements from the
recomputed partiality: cytC_2 P321 -> P3121 (now consistent with cytC_3), Ins_I_2/3 report the
honest I23/I213 screw-axis ambiguity.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-02 21:58:10 +02:00
leonarski_fandClaude Opus 4.8 b5d9167bf4 RotationScaleMerge: chunked parallelism -> scale/merge phase ~2s
The per-observation corr update (7.6M items) ran through a work-stealing ParallelFor that
does one atomic fetch_add PER item - pure contention for trivial work (measured: update 0.60s
vs reduce 0.15s / fit 0.13s in the scale-partials loop). Add ParallelChunks (one contiguous
range per worker, no per-item sync) and use it for UpdateCorr, and parallelise the ASU keying
(gemmi reduction per distinct raw hkl - HKLKeyGenerator is const, safe to read concurrently)
and the group-stamping over disjoint raw-hkl runs.

scale-partials 0.90 -> 0.28s, group-hkl 0.20 -> 0.09s, per-pass warm 0.83s, whole scale/merge
phase ~3.3 -> ~2.0s. Bit-identical output (same space group, ISa, CC1/2). ParallelChunks is
the CPU stand-in for a flat CUDA grid-stride kernel; ParallelFor stays for the heavy, uneven
per-frame fits where the atomic amortises and work-stealing balances the load.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-02 21:31:41 +02:00
leonarski_fandClaude Opus 4.8 29c5410328 rotation scaling: dedicated allocate-once RotationScaleMerge, ~3.4x faster scale/merge
The rot3d post-pass (scale -> smooth-G -> 3D combine -> scale-fulls -> merge ->
error model -> stats) dominated offline wall clock because ScaleOnTheFly + MergeAll
+ CombineRotationObservations + MergeOnTheFly rebuild a std::map keyed by hkl on every
scaling iteration and every merge (7-14 map rebuilds per space-group pass), each re-walking
and re-keying millions of per-frame partials.

RotationScaleMerge ingests the per-frame partials ONCE into flat vectors and reuses them
across both space-group passes: the raw-hkl ordering is sorted a single time at ingest, so
the per-pass 3D combine only re-splits events (no sort) and the ASU grouping is one gemmi
reduction per distinct raw hkl (~10x fewer) rather than per observation, reusing that order.
Every hot step is a flat loop (segmented reduction + per-frame robust IRLS + parallel
per-run combine) that also maps directly onto CUDA kernels. CC1/2 and the per-image CC are
computed once at the end, not every iteration.

It is a distinct path from ScaleOnTheFly, used only for the self-scaling rotation case
(Rotation partiality + combine3d, per-image G, no B refinement, no external reference, no
absorption surface, no wedge/mosaicity override). Stills, B-factor refinement, reference
scaling and the absorption surface stay on the classic path.

Numerically equivalent to the classic path (same robust per-frame G, same 3D combine, same
XDS-order scale-fulls, same global error model, same merge statistics), validated on the
18-crystal /data/rotation_test battery: 16/18 bit-identical in space group / ISa / CC1/2;
the 2 differing crystals are ice-heavy / marginal ones on which the classic path is equally
non-deterministic run-to-run (a pre-existing upstream integration race). Scale/merge wall
time drops ~3.4x (median 9.8 -> 4.1 s), making higher --scaling-iterations cheap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-02 21:00:12 +02:00