mirror of
https://github.com/slsdetectorgroup/aare.git
synced 2026-06-09 04:48:44 +02:00
6a12e3de24
Rework the multi-stream pipeline to eliminate per-frame sync barriers and fix the D2H staging architecture. Sync reduction: - Replace one cudaStreamSynchronize per frame with one per stream per batch, cutting synchronisation calls from O(n_frames x n_streams) to O(n_streams) - Introduce a unified per-frame D2H output layout [uint32_t count | clusters[max]] stored in a single class-level lazy-allocated pinned pool (h_output_pinned), replacing the per-stream separate cluster/count device buffers - Move CUDA event pool from per-stream fixed-size to per-frame-slot lazy-allocated, enabling correct kernel timing across any batch size Pinned H2D without CPU-side copy: - Add register_input_buffer(ptr, bytes) / unregister_input_buffer() wrapping cudaHostRegister so callers can pin their existing batch buffer once; all find_clusters_batched() slices then transfer at DMA speed (~22 GB/s) instead of ~15 GB/s for pageable, with no extra memcpy or WC-memory penalty Result (RTX 4090, 400x400 uint16, 3x3 clusters, batch=2000, 5 streams): Before: ~34 µs/frame -> After: ~28 µs/frame (−18 %)