aare/python at 6a12e3de244f258447a5c4018b49c5f633019110 - aare

detectors/aare

Fork 0

mirror of https://github.com/slsdetectorgroup/aare.git synced 2026-07-30 18:53:37 +02:00

Files

T

History

kferjaoui 6a12e3de24

Build on RHEL9 / build (push) Successful in 3m22s

Details

Build on RHEL8 / build (push) Successful in 3m28s

Details

Run tests using data on local RHEL8 / build (push) Successful in 3m37s

Details

Refactor ClusterFinderCUDA

Rework the multi-stream pipeline to eliminate per-frame sync barriers and
fix the D2H staging architecture.

Sync reduction:
- Replace one cudaStreamSynchronize per frame with one per stream per batch,
  cutting synchronisation calls from O(n_frames x n_streams) to O(n_streams)
- Introduce a unified per-frame D2H output layout [uint32_t count | clusters[max]]
  stored in a single class-level lazy-allocated pinned pool (h_output_pinned),
  replacing the per-stream separate cluster/count device buffers
- Move CUDA event pool from per-stream fixed-size to per-frame-slot lazy-allocated,
  enabling correct kernel timing across any batch size

Pinned H2D without CPU-side copy:
- Add register_input_buffer(ptr, bytes) / unregister_input_buffer() wrapping
  cudaHostRegister so callers can pin their existing batch buffer once; all
  find_clusters_batched() slices then transfer at DMA speed (~22 GB/s) instead
  of ~15 GB/s for pageable, with no extra memcpy or WC-memory penalty

Result (RTX 4090, 400x400 uint16, 3x3 clusters, batch=2000, 5 streams):
  Before: ~34 µs/frame  ->  After: ~28 µs/frame  (−18 %)

2026-05-18 16:30:13 +02:00

aare

Refactor ClusterFinderCUDA

2026-05-18 16:30:13 +02:00

examples

Feature/minuit2 wrapper (#279 )

2026-03-30 09:12:23 +02:00

src

Refactor ClusterFinderCUDA

2026-05-18 16:30:13 +02:00

tests

Refactor ClusterFinderCUDA

2026-05-18 16:30:13 +02:00

CMakeLists.txt

Apply cmake-format to build files

2026-04-27 11:53:56 +02:00