Files
kferjaoui 5922c73c07
Build on RHEL8 / build (push) Successful in 3m16s
Build on RHEL9 / build (push) Successful in 3m26s
Run tests using data on local RHEL8 / build (push) Successful in 9m42s
feat(ClusterFinderCUDA): async submit_batch/collect API
- Eliminate the ~200–300 µs inter-batch idle gap by allowing two batches
to be in-flight simultaneously:
  - submit_batch() enqueues H2D+kernel+D2H without blocking
  - collect() syncs via cudaEventSynchronize (not
  cudaStreamSynchronize) so a queued second batch runs uninterrupted.

- Two ping-pong output slots (NUM_SLOTS=2) with per-slot pinned buffers
and cudaEventDisableTiming sync events.
- find_clusters_batched() keeps its direct implementation.

* Measured: 0.026 -> 0.022 ms/frame (~18%).
2026-05-28 16:23:37 +02:00
..
2026-05-18 16:30:13 +02:00
2026-03-30 09:12:23 +02:00