- Non-photon pixels now update pedestal (push_fast equivalent)
directly in the kernel, no atomics needed
- Commented out quadrant significance test (c2): absent from
sequential CPU code, was producing GPU-only clusters.
- Added d_pd_sum to device allocations and host upload
Build (sm_89): 46 registers, 0 spills, 100% occupancy.
Verified on 256x256 Jungfrau data, 5000 frames, nSigma=5.0:
CPU 8428 vs GPU 8471 clusters, 99.8% match
0.63 ms/frame CPU vs 0.04 ms/frame GPU (~16x)