perf(ClusterFinderCUDA): FP32 device pedestal and bulk memcpy drain
Build on RHEL8 / build (push) Successful in 3m0s
Build on RHEL9 / build (push) Successful in 3m41s
Run tests using data on local RHEL8 / build (push) Successful in 3m47s

- Device pedestal arrays (mean/sum/sum2) are now float instead of
  double: halves global-memory bandwidth for pedestal reads/writes and
  eliminates FP64 arithmetic in the kernel (3.3x kernel speedup,
  15µs -> 4.6µs).

- Replace the per-cluster push_back loop in the D2H drain with a
  single resize()+memcpy().
This commit is contained in:
kferjaoui
2026-05-21 14:12:02 +02:00
parent 6a12e3de24
commit 4c66802980
4 changed files with 107 additions and 87 deletions
File diff suppressed because one or more lines are too long