detectors/aare - aare - PSI GIT Service

mirror of https://github.com/slsdetectorgroup/aare.git synced 2026-06-30 00:09:15 +02:00

Author	SHA1	Message	Date
kferjaoui	5922c73c07	feat(ClusterFinderCUDA): async submit_batch/collect API Build on RHEL8 / build (push) Successful in 3m16s Details Build on RHEL9 / build (push) Successful in 3m26s Details Run tests using data on local RHEL8 / build (push) Successful in 9m42s Details - Eliminate the ~200–300 µs inter-batch idle gap by allowing two batches to be in-flight simultaneously: - submit_batch() enqueues H2D+kernel+D2H without blocking - collect() syncs via cudaEventSynchronize (not cudaStreamSynchronize) so a queued second batch runs uninterrupted. - Two ping-pong output slots (NUM_SLOTS=2) with per-slot pinned buffers and cudaEventDisableTiming sync events. - find_clusters_batched() keeps its direct implementation. * Measured: 0.026 -> 0.022 ms/frame (~18%).	2026-05-28 16:23:37 +02:00
kferjaoui	4c66802980	perf(ClusterFinderCUDA): FP32 device pedestal and bulk memcpy drain Build on RHEL8 / build (push) Successful in 3m0s Details Build on RHEL9 / build (push) Successful in 3m41s Details Run tests using data on local RHEL8 / build (push) Successful in 3m47s Details - Device pedestal arrays (mean/sum/sum2) are now float instead of double: halves global-memory bandwidth for pedestal reads/writes and eliminates FP64 arithmetic in the kernel (3.3x kernel speedup, 15µs -> 4.6µs). - Replace the per-cluster push_back loop in the D2H drain with a single resize()+memcpy().	2026-05-21 14:12:02 +02:00
kferjaoui	6a12e3de24	Refactor ClusterFinderCUDA Build on RHEL9 / build (push) Successful in 3m22s Details Build on RHEL8 / build (push) Successful in 3m28s Details Run tests using data on local RHEL8 / build (push) Successful in 3m37s Details Rework the multi-stream pipeline to eliminate per-frame sync barriers and fix the D2H staging architecture. Sync reduction: - Replace one cudaStreamSynchronize per frame with one per stream per batch, cutting synchronisation calls from O(n_frames x n_streams) to O(n_streams) - Introduce a unified per-frame D2H output layout [uint32_t count \| clusters[max]] stored in a single class-level lazy-allocated pinned pool (h_output_pinned), replacing the per-stream separate cluster/count device buffers - Move CUDA event pool from per-stream fixed-size to per-frame-slot lazy-allocated, enabling correct kernel timing across any batch size Pinned H2D without CPU-side copy: - Add register_input_buffer(ptr, bytes) / unregister_input_buffer() wrapping cudaHostRegister so callers can pin their existing batch buffer once; all find_clusters_batched() slices then transfer at DMA speed (~22 GB/s) instead of ~15 GB/s for pageable, with no extra memcpy or WC-memory penalty Result (RTX 4090, 400x400 uint16, 3x3 clusters, batch=2000, 5 streams): Before: ~34 µs/frame -> After: ~28 µs/frame (−18 %)	2026-05-18 16:30:13 +02:00
kferjaoui	41d5184e1b	Fix ClusterVector move semantics	2026-05-06 11:30:59 +02:00
kferjaoui	88e0e8d678	Optimize CUDA cluster finder transfers and kernel hot path Build on RHEL8 / build (push) Successful in 2m51s Details Build on RHEL9 / build (push) Successful in 3m15s Details Run tests using data on local RHEL8 / build (push) Successful in 3m47s Details - Use per-stream pinned host staging buffers for truly async CUDA transfers. - Avoid reserving full device capacity per result frame. - Reduce kernel work by delaying cluster payload construction. - Use squared comparisons and removing per-pixel sqrtf() ops.	2026-04-30 18:23:31 +02:00
kferjaoui	34e69a8065	Add per-frame kernel timing via CUDA events Build on RHEL8 / build (push) Successful in 3m13s Details Build on RHEL9 / build (push) Successful in 3m37s Details Run tests using data on local RHEL8 / build (push) Successful in 3m51s Details	2026-04-28 13:09:25 +02:00
kferjaoui	ac96d1f688	Implement mixed precision: f32 stencil, f64 pedestal Build on RHEL8 / build (push) Successful in 2m53s Details Build on RHEL9 / build (push) Successful in 3m15s Details Run tests using data on local RHEL8 / build (push) Successful in 3m47s Details - Stencil arithmetic and shared memory use float (COMPUTE_TYPE alias). - Pedestal accumulation stays double to preserve variance accuracy. Notes: - On RTX 4090, FP32 throughput is ~64× higher than FP64, so moving stencil math to float improves performance. - Using float also avoids shared memory bank conflicts: stride-18 maps to distinct banks for 32-bit values, but caused conflicts with 64-bit.	2026-04-27 14:56:40 +02:00
kferjaoui	fddef977af	Exclude notebooks from JSON check	2026-04-27 11:53:15 +02:00
kferjaoui	e894bdac9b	Add Python bindings for CUDA cluster finder Build on RHEL8 / build (push) Successful in 2m50s Details Build on RHEL9 / build (push) Successful in 2m57s Details Run tests using data on local RHEL8 / build (push) Successful in 3m38s Details - Add bind_ClusterFinderCUDA.hpp with pybind11 bindings for ClusterFinderCUDA - Build CUDA bindings as separate _aare_cuda.so to avoid segfaults from mixing nvcc and gcc compiled code in the same shared object - Re-export CUDA classes onto _aare in __init__.py so user code uses `from aare import ClusterFinderCUDA` regardless of which .so hosts the class - Factory in ClusterFinder.py selects backend; RuntimeError if GPU requested on CPU-only build - Update python/CMakeLists.txt: _aare_cuda module gated behind AARE_CUDA and AARE_PYTHON_BINDINGS - Add validation notebook: ~20x speedup vs sequential ClusterFinder	2026-04-23 11:43:40 +02:00

9 Commits