Files
Jungfraujoch/image_analysis
leonarski_fandClaude Opus 4.8 dd25de461d RotationScaleMerge: GPU partial-scaling loop (CUDA port, phase 1)
First stage of moving the rotation scale/merge onto the GPU. The per-frame partial-scaling loop
(inverse-variance group-mean reduction -> robust per-frame IRLS G -> corr update, x scaling_iter)
now runs in RotationScaleMergeGPU (.cu) when a GPU is present; the CPU loops remain the fallback.

The host keeps the one-time raw-hkl sort and the per-space-group gemmi ASU keying, and hands the
GPU a group-ordered permutation + CSR so the per-group reduction is a DETERMINISTIC segmented
reduction (one thread per group, fixed order, no atomics) - preserving the run-to-run determinism
just won on the CPU path (a float atomicAdd reduction would have re-introduced jitter). Reduction is
one-thread-per-group (groups average tens of obs, so a block-per-group wastes threads); the IRLS is
one block per frame with a deterministic shared-memory reduction.

Validated: bit-identical to the CPU path and deterministic run-to-run on lyso/cytC/Ins_H/pding
(P41212 ISa 7.8 CC1/2 99.7%, etc.). The scaling kernels are ~7x faster than the CPU compute
(~36 ms for 3 iters vs ~0.28 s); end-to-end scale/merge ~2.0 -> ~1.5 s. The remaining gap to the
<1 s target is the per-pass host round-trip (corr down/upload for the CPU combine + per-SG group-CSR
rebuild); phase 2 keeps the data resident by moving the 3D combine and the merge/error-model onto
the GPU too, so nothing round-trips.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-02 22:26:29 +02:00
..
2026-06-08 08:30:35 +02:00
2026-06-23 20:29:49 +02:00
2026-04-25 19:59:21 +02:00
2026-06-23 20:29:49 +02:00
2026-06-23 20:29:49 +02:00
2026-06-25 22:01:48 +02:00
2026-06-25 22:01:48 +02:00
2026-05-28 18:48:35 +02:00
2026-06-25 22:01:48 +02:00
2026-06-25 22:01:48 +02:00
2026-06-15 20:24:15 +02:00
2026-06-15 20:24:15 +02:00
2026-06-23 20:29:49 +02:00
2026-02-18 16:17:21 +01:00
2026-06-08 08:30:35 +02:00