Files
aare/python
kferjaoui 6a12e3de24
Build on RHEL9 / build (push) Successful in 3m22s
Build on RHEL8 / build (push) Successful in 3m28s
Run tests using data on local RHEL8 / build (push) Successful in 3m37s
Refactor ClusterFinderCUDA
Rework the multi-stream pipeline to eliminate per-frame sync barriers and
fix the D2H staging architecture.

Sync reduction:
- Replace one cudaStreamSynchronize per frame with one per stream per batch,
  cutting synchronisation calls from O(n_frames x n_streams) to O(n_streams)
- Introduce a unified per-frame D2H output layout [uint32_t count | clusters[max]]
  stored in a single class-level lazy-allocated pinned pool (h_output_pinned),
  replacing the per-stream separate cluster/count device buffers
- Move CUDA event pool from per-stream fixed-size to per-frame-slot lazy-allocated,
  enabling correct kernel timing across any batch size

Pinned H2D without CPU-side copy:
- Add register_input_buffer(ptr, bytes) / unregister_input_buffer() wrapping
  cudaHostRegister so callers can pin their existing batch buffer once; all
  find_clusters_batched() slices then transfer at DMA speed (~22 GB/s) instead
  of ~15 GB/s for pageable, with no extra memcpy or WC-memory penalty

Result (RTX 4090, 400x400 uint16, 3x3 clusters, batch=2000, 5 streams):
  Before: ~34 µs/frame  ->  After: ~28 µs/frame  (−18 %)
2026-05-18 16:30:13 +02:00
..
2026-05-18 16:30:13 +02:00
2026-03-30 09:12:23 +02:00
2026-05-18 16:30:13 +02:00
2026-05-18 16:30:13 +02:00