Optimize CUDA cluster finder transfers and kernel hot path
Build on RHEL8 / build (push) Successful in 2m51s
Build on RHEL9 / build (push) Successful in 3m15s
Run tests using data on local RHEL8 / build (push) Successful in 3m47s

- Use per-stream pinned host staging buffers for truly async CUDA transfers.
- Avoid reserving full device capacity per result frame.
- Reduce kernel work by delaying cluster payload construction.
- Use squared comparisons and removing per-pixel sqrtf() ops.
This commit is contained in:
kferjaoui
2026-04-30 18:23:31 +02:00
parent 34e69a8065
commit 88e0e8d678
3 changed files with 241 additions and 91 deletions
File diff suppressed because one or more lines are too long