- Add bind_ClusterFinderCUDA.hpp with pybind11 bindings for
ClusterFinderCUDA
- Build CUDA bindings as separate _aare_cuda.so to avoid
segfaults from mixing nvcc and gcc compiled code in the
same shared object
- Re-export CUDA classes onto _aare in __init__.py so user
code uses `from aare import ClusterFinderCUDA` regardless
of which .so hosts the class
- Factory in ClusterFinder.py selects backend; RuntimeError
if GPU requested on CPU-only build
- Update python/CMakeLists.txt: _aare_cuda module gated
behind AARE_CUDA and AARE_PYTHON_BINDINGS
- Add validation notebook: ~20x speedup vs sequential ClusterFinder