Commit Graph

9 Commits

Author SHA1 Message Date
kferjaoui 5922c73c07 feat(ClusterFinderCUDA): async submit_batch/collect API
Build on RHEL8 / build (push) Successful in 3m16s
Build on RHEL9 / build (push) Successful in 3m26s
Run tests using data on local RHEL8 / build (push) Successful in 9m42s
- Eliminate the ~200–300 µs inter-batch idle gap by allowing two batches
to be in-flight simultaneously:
  - submit_batch() enqueues H2D+kernel+D2H without blocking
  - collect() syncs via cudaEventSynchronize (not
  cudaStreamSynchronize) so a queued second batch runs uninterrupted.

- Two ping-pong output slots (NUM_SLOTS=2) with per-slot pinned buffers
and cudaEventDisableTiming sync events.
- find_clusters_batched() keeps its direct implementation.

* Measured: 0.026 -> 0.022 ms/frame (~18%).
2026-05-28 16:23:37 +02:00
kferjaoui 4c66802980 perf(ClusterFinderCUDA): FP32 device pedestal and bulk memcpy drain
Build on RHEL8 / build (push) Successful in 3m0s
Build on RHEL9 / build (push) Successful in 3m41s
Run tests using data on local RHEL8 / build (push) Successful in 3m47s
- Device pedestal arrays (mean/sum/sum2) are now float instead of
  double: halves global-memory bandwidth for pedestal reads/writes and
  eliminates FP64 arithmetic in the kernel (3.3x kernel speedup,
  15µs -> 4.6µs).

- Replace the per-cluster push_back loop in the D2H drain with a
  single resize()+memcpy().
2026-05-21 14:12:02 +02:00
kferjaoui 6a12e3de24 Refactor ClusterFinderCUDA
Build on RHEL9 / build (push) Successful in 3m22s
Build on RHEL8 / build (push) Successful in 3m28s
Run tests using data on local RHEL8 / build (push) Successful in 3m37s
Rework the multi-stream pipeline to eliminate per-frame sync barriers and
fix the D2H staging architecture.

Sync reduction:
- Replace one cudaStreamSynchronize per frame with one per stream per batch,
  cutting synchronisation calls from O(n_frames x n_streams) to O(n_streams)
- Introduce a unified per-frame D2H output layout [uint32_t count | clusters[max]]
  stored in a single class-level lazy-allocated pinned pool (h_output_pinned),
  replacing the per-stream separate cluster/count device buffers
- Move CUDA event pool from per-stream fixed-size to per-frame-slot lazy-allocated,
  enabling correct kernel timing across any batch size

Pinned H2D without CPU-side copy:
- Add register_input_buffer(ptr, bytes) / unregister_input_buffer() wrapping
  cudaHostRegister so callers can pin their existing batch buffer once; all
  find_clusters_batched() slices then transfer at DMA speed (~22 GB/s) instead
  of ~15 GB/s for pageable, with no extra memcpy or WC-memory penalty

Result (RTX 4090, 400x400 uint16, 3x3 clusters, batch=2000, 5 streams):
  Before: ~34 µs/frame  ->  After: ~28 µs/frame  (−18 %)
2026-05-18 16:30:13 +02:00
kferjaoui 41d5184e1b Fix ClusterVector move semantics 2026-05-06 11:30:59 +02:00
kferjaoui 88e0e8d678 Optimize CUDA cluster finder transfers and kernel hot path
Build on RHEL8 / build (push) Successful in 2m51s
Build on RHEL9 / build (push) Successful in 3m15s
Run tests using data on local RHEL8 / build (push) Successful in 3m47s
- Use per-stream pinned host staging buffers for truly async CUDA transfers.
- Avoid reserving full device capacity per result frame.
- Reduce kernel work by delaying cluster payload construction.
- Use squared comparisons and removing per-pixel sqrtf() ops.
2026-04-30 18:23:31 +02:00
kferjaoui 34e69a8065 Add per-frame kernel timing via CUDA events
Build on RHEL8 / build (push) Successful in 3m13s
Build on RHEL9 / build (push) Successful in 3m37s
Run tests using data on local RHEL8 / build (push) Successful in 3m51s
2026-04-28 13:09:25 +02:00
kferjaoui ac96d1f688 Implement mixed precision: f32 stencil, f64 pedestal
Build on RHEL8 / build (push) Successful in 2m53s
Build on RHEL9 / build (push) Successful in 3m15s
Run tests using data on local RHEL8 / build (push) Successful in 3m47s
- Stencil arithmetic and shared memory use float (COMPUTE_TYPE alias).
- Pedestal accumulation stays double to preserve variance accuracy.

Notes:
- On RTX 4090, FP32 throughput is ~64× higher than FP64, so moving
  stencil math to float improves performance.
- Using float also avoids shared memory bank conflicts: stride-18 maps
  to distinct banks for 32-bit values, but caused conflicts with 64-bit.
2026-04-27 14:56:40 +02:00
kferjaoui fddef977af Exclude notebooks from JSON check 2026-04-27 11:53:15 +02:00
kferjaoui e894bdac9b Add Python bindings for CUDA cluster finder
Build on RHEL8 / build (push) Successful in 2m50s
Build on RHEL9 / build (push) Successful in 2m57s
Run tests using data on local RHEL8 / build (push) Successful in 3m38s
- Add bind_ClusterFinderCUDA.hpp with pybind11 bindings for
  ClusterFinderCUDA
- Build CUDA bindings as separate _aare_cuda.so to avoid
  segfaults from mixing nvcc and gcc compiled code in the
  same shared object
- Re-export CUDA classes onto _aare in __init__.py so user
  code uses `from aare import ClusterFinderCUDA` regardless
  of which .so hosts the class
- Factory in ClusterFinder.py selects backend; RuntimeError
  if GPU requested on CPU-only build
- Update python/CMakeLists.txt: _aare_cuda module gated
  behind AARE_CUDA and AARE_PYTHON_BINDINGS
- Add validation notebook: ~20x speedup vs sequential ClusterFinder
2026-04-23 11:43:40 +02:00