- Use per-stream pinned host staging buffers for truly async CUDA transfers.
- Avoid reserving full device capacity per result frame.
- Reduce kernel work by delaying cluster payload construction.
- Use squared comparisons and removing per-pixel sqrtf() ops.
- Stencil arithmetic and shared memory use float (COMPUTE_TYPE alias).
- Pedestal accumulation stays double to preserve variance accuracy.
Notes:
- On RTX 4090, FP32 throughput is ~64× higher than FP64, so moving
stencil math to float improves performance.
- Using float also avoids shared memory bank conflicts: stride-18 maps
to distinct banks for 32-bit values, but caused conflicts with 64-bit.
- After upgrading to pybind11 3, duplicate registration of cluster-related
types across `_aare` and `_aare_cuda` started failing.
- Mark the `Cluster` and `ClusterVector` bindings as `py::module_local()` so
each extension owns its local registration.
Note: cluster objects from CPU and CUDA bindings are now distinct Python types.
- Add bind_ClusterFinderCUDA.hpp with pybind11 bindings for
ClusterFinderCUDA
- Build CUDA bindings as separate _aare_cuda.so to avoid
segfaults from mixing nvcc and gcc compiled code in the
same shared object
- Re-export CUDA classes onto _aare in __init__.py so user
code uses `from aare import ClusterFinderCUDA` regardless
of which .so hosts the class
- Factory in ClusterFinder.py selects backend; RuntimeError
if GPU requested on CPU-only build
- Update python/CMakeLists.txt: _aare_cuda module gated
behind AARE_CUDA and AARE_PYTHON_BINDINGS
- Add validation notebook: ~20x speedup vs sequential ClusterFinder
To improve codebase quality and reduce human error, this PR introduces
the pre-commit framework. This ensures that all code adheres to project
standards before it is even committed, maintaining a consistent style
and catching common mistakes early.
Key Changes:
- Code Formatting: Automated C++ formatting using clang-format (based on
the project's .clang-format file).
- Syntax Validation: Basic checks for file integrity and syntax.
- Spell Check: Automated scanning for typos in source code and comments.
- CMake Formatting: Standardization of CMakeLists.txt and .cmake
configuration files.
- GitHub Workflow: Added a CI action that validates every Pull Request
against the pre-commit configuration to ensure compliance.
The configuration includes a [ci] block to handle automated fixes within
the PR. Currently, this is disabled. If we want the CI to automatically
commit formatting fixes back to the PR branch, this can be toggled to
true in .pre-commit-config.yaml.
```yaml
ci:
autofix_commit_msg: [pre-commit] auto fixes from pre-commit hooks
autofix_prs: false
autoupdate_schedule: monthly
```
The last large commit with the fit functions, for example, was not
formatted according to the clang-format rules. This PR would allow to
avoid similar mistakes in the future.
Python fomat with `ruff` for tests and sanitiser for `.ipynb` notebooks
can be added as well.
- Non-photon pixels now update pedestal (push_fast equivalent)
directly in the kernel, no atomics needed
- Commented out quadrant significance test (c2): absent from
sequential CPU code, was producing GPU-only clusters.
- Added d_pd_sum to device allocations and host upload
Build (sm_89): 46 registers, 0 spills, 100% occupancy.
Verified on 256x256 Jungfrau data, 5000 frames, nSigma=5.0:
CPU 8428 vs GPU 8471 clusters, 99.8% match
0.63 ms/frame CPU vs 0.04 ms/frame GPU (~16x)
Implements a GPU version of the sequential ClusterFinder for
single-frame cluster reconstrcution.
Kernel (ClusterFinderCUDA.cuh):
- Shared memory tiling with generalized halo loading for arbitrary
cluster sizes (3x3, 5x5, ...)
- Zero-initialization of shared memory to handle image boundary
and partial edge-block cases.
- Pedestal subtraction during shared memory loading.
- Compile-time cluster geometry enabling full loop unrolling
of the stencil reduction
- Atomic global counter for lock-free cluster output across blocks.
- RAII host wrapper; `ClusterFinderCUDA` struct.
- Set parameter starting values, limits or fix through name as well as
index
- Updated parameter names for the scurve
- Fast approximation to erf function (~10% speedup of fitting)
---------
Co-authored-by: Khalil Ferjaoui <khalilferjaoui@yahoo.fr>
## Unified Minuit2 fitting framework with FitModel API
### Models (`Models.hpp`)
Consolidate all model structs (Gaussian, RisingScurve, FallingScurve)
into a
single header. Each model provides: `eval`, `eval_and_grad`, `is_valid`,
`estimate_par`, `compute_steps`, and `param_info` metadata. No Minuit2
dependency.
### Chi2 functors (`Chi2.hpp`)
Generic `Chi2Model1DGrad` (analytic gradient) templated on the model
struct.
Replaces the separate Chi2Gaussian, Chi2GaussianGradient,
Chi2Scurves, and Chi2ScurvesGradient headers.
### FitModel (`FitModel.hpp`)
Configuration object wrapping `MnUserParameters`, strategy, tolerance,
and
user-override tracking. User constraints (fixed parameters, start
values, limits)
always take precedence over automatic data-driven estimates.
### Fit functions (`Fit.hpp`)
- `fit_pixel<Model, FCN>(model, x, y, y_err)` -> single-pixel,
self-contained
- `fit_pixel<Model, FCN>(model, upar_local, x, y, y_err)` -> pre-cloned
upar for hot loops
- `fit_3d<Model, FCN>(model, x, y, y_err, ..., n_threads)` ->
row-parallel over pixel grid
### Python bindings
- `Pol1`, `Pol2`, `Gaussian`, `RisingScurve`, `FallingScurve` model
classes with
`FixParameter`, `SetParLimits`, `SetParameter`, and properties for
`max_calls`, `tolerance`, `compute_errors`
- Single `fit(model, x, y, y_err, n_threads)` dispatch replacing the old
`fit_gaus_minuit`, `fit_gaus_minuit_grad`, `fit_scurve_minuit_grad`,
etc.
### Benchmarks
- Updated `fit_benchmark.cpp` (Google Benchmark) to use the new FitModel
API
- Jupyter notebooks for 1D and 3D S-curve fitting (lmfit vs Minuit2
analytic)
- ~1.8x speedup over lmfit, near-linear thread scaling up to physical
core count
---------
Co-authored-by: Erik Fröjdh <erik.frojdh@psi.ch>
- Fixed failing conda builds (numpy >=2.1 and added pytest-check)
- added cibuildwheel settings for osx
- Bumped cibuildwheel to 3.4.0 in pipeline
- build also for macos-latest
- fixed conda workflow which was only on developer branch (now main and
pr to main)
Matterhorn10 Transform
some other Transformations from pyctbGUI
added method get_reading_mode for easier error handling in decoders
## TODO:
- proper error handling for all other decoders
- proper documentation for all other decoders
- refactoring all other decoders to store hard coded values in a Struct
ChipSpecification
Reading multiple ROI's for aare
- read_frame, read_n etc throws for multiple ROIs
- new functions read_ROIs, read_n_ROIs
- read_roi_into (used for python bindings - to not copy)
all these functions use get_frame or get_frame_into where one passes the
roi_index
## Refactoring:
- each roi keeps track of its subfiles that one has to open e.g.
subfiles can be opened several times
- refactored class DetectorGeometry - keep track of the updated module
geometries in new class ROIGeometry.
- ModuleGeometry updates based on ROI
## ROIGeometry:
- stores number of modules overlapping with ROI and its indices
- size of ROI
Note: only tested size of the resulting frames not the actual values
---------
Co-authored-by: Erik Fröjdh <erik.frojdh@psi.ch>
Co-authored-by: Erik Fröjdh <erik.frojdh@gmail.com>
File content is the same as the normal ctb files so no changes needed
for reading.
- Accept DetectorType::Xilinx_ChipTestBoard in the CtbRawFile class
- Check if file is open in constructor
Directly casting the values in the cluster finder would truncate the
resulting ADU value and create an offset when summing the cluster.
1. do pedestal subtraction and then round
2. static cast to result type only after rounding
3. if constexpr to avoid unnecessary rounding
- automatically run python tests
- automatically run test using data files on local runner from gitea
- fixed some of the workflows
---------
Co-authored-by: Erik Fröjdh <erik.frojdh@psi.ch>
Issue from Jonathan.
- writing to output queue did not check if queue is full - such that
frames were dropped.
## Dataset to recreata issue:
data overf 10G interface can be accessed on pc: pc-highz-02
raw frames:
/mnt/sls_det_storage_10G/highZ_data/JMulvey/Calibration_From_HZ02/2025Jan_m343/Zr15800eV/250129_CZTsolo_Xray_Tp_15C_tint_100_master_0.json
pedestal frames:
/mnt/sls_det_storage_10G/highZ_data/JMulvey/Calibration_From_HZ02/2025Jan_m343/Zr15800eV/250129_CZTsolo_Pedestal_Tp_15C_tint_100_master_0.json
---------
Co-authored-by: Erik Fröjdh <erik.frojdh@psi.ch>
- Removed redundant arr.value(ix,iy...) on NDArray use arr(ix,iy...)
- Removed Print/Print_some/Print_all form NDArray (operator << still
works)
- Added const* version of .data()
- Comment for documentation
- Some extra tests
- New aare:to_string/string_to similar to what we have in
slsDetectorPackage
- Added members period and exptime to RawMasterFile
- Parsing exposure time and period for json and raw master file formats
- Parsing of RawMasterFile from string stream to enable test without
files
Comments:
- to_string is at the moment not a public header. Can make it later if
needed. This gives us full freedom with the API
- FileConfig should probably be deprecated need to look into it.
Meanwhile removed python bindings and string conv