Files
Jungfraujoch/image_analysis/bragg_integration/BraggIntegrationEngineGPU.h
T
leonarski_fandClaude Opus 4.8 ddddfb6ffc
Build Packages / build:windows:nocuda (pull_request) Successful in 14m41s
Build Packages / build:windows:cuda (pull_request) Successful in 16m48s
Build Packages / build:rpm (ubuntu2404_nocuda) (pull_request) Successful in 11m15s
Build Packages / build:rpm (rocky8_nocuda) (pull_request) Successful in 12m46s
Build Packages / build:rpm (ubuntu2204_nocuda) (pull_request) Successful in 12m38s
Build Packages / build:rpm (rocky9_nocuda) (pull_request) Successful in 13m11s
Build Packages / build:rpm (rocky8_sls9) (pull_request) Successful in 12m20s
Build Packages / build:rpm (rocky9_sls9) (pull_request) Successful in 12m22s
Build Packages / build:rpm (ubuntu2404) (pull_request) Successful in 11m7s
Build Packages / build:rpm (ubuntu2204) (pull_request) Successful in 11m55s
Build Packages / build:rpm (rocky8) (pull_request) Successful in 12m56s
Build Packages / Generate python client (pull_request) Successful in 14s
Build Packages / build:rpm (rocky9) (pull_request) Successful in 13m15s
Build Packages / Create release (pull_request) Skipped
Build Packages / Build documentation (pull_request) Successful in 41s
Build Packages / XDS test (durin plugin) (pull_request) Successful in 10m3s
Build Packages / DIALS test (pull_request) Successful in 13m6s
Build Packages / XDS test (neggia plugin) (pull_request) Successful in 6m58s
Build Packages / XDS test (JFJoch plugin) (pull_request) Successful in 7m30s
Build Packages / Unit tests (pull_request) Successful in 58m5s
Build Packages / Unit tests (push) Successful in 1h12m36s
Build Packages / build:rpm (rocky8_nocuda) (push) Successful in 14m52s
Build Packages / build:rpm (rocky9_nocuda) (push) Successful in 15m35s
Build Packages / build:rpm (ubuntu2204_nocuda) (push) Successful in 15m29s
Build Packages / build:rpm (ubuntu2404_nocuda) (push) Successful in 13m35s
Build Packages / build:rpm (rocky8_sls9) (push) Successful in 15m25s
Build Packages / build:rpm (rocky9_sls9) (push) Successful in 16m5s
Build Packages / build:rpm (rocky8) (push) Successful in 15m11s
Build Packages / build:rpm (rocky9) (push) Successful in 13m35s
Build Packages / build:rpm (ubuntu2204) (push) Successful in 11m59s
Build Packages / build:rpm (ubuntu2404) (push) Successful in 12m14s
Build Packages / DIALS test (push) Successful in 14m29s
Build Packages / XDS test (durin plugin) (push) Successful in 9m56s
Build Packages / XDS test (JFJoch plugin) (push) Successful in 10m23s
Build Packages / XDS test (neggia plugin) (push) Successful in 9m3s
Build Packages / Generate python client (push) Successful in 20s
Build Packages / Build documentation (push) Successful in 1m10s
Build Packages / Create release (push) Skipped
Build Packages / build:windows:nocuda (push) Successful in 16m39s
Build Packages / build:windows:cuda (push) Successful in 18m40s
bragg_integration: GPU box + profile-fit integrator (standalone engine)
Reimplement BraggIntegrate2D (box sum) and ProfileIntegrate2D (Kabsch
profile fit) under one roof as a base + CPU + GPU engine, mirroring the
AzIntEngine / ROIIntegration pattern. Reads the preprocessed int32
ImagePreprocessorBuffer (masked=INT32_MIN, saturated=INT32_MAX), the same
buffer AzIntEngineGPU/ROIIntegrationGPU consume.

The CUDA engine runs one block per reflection with shared-memory
reductions across six kernels (reset, mask, box-sum, profile learning,
profile build, Kabsch fit); the resolution shell is computed inline. The
learning/fit hot path is single precision (FP64 is throttled on consumer
GPUs; reproduces the double CPU path to ~1e-4). Collapsing the per-frame
CUDA API calls into one reset kernel keeps launch-latency overhead low.

Standalone for now: NOT wired into IndexAndRefine. See
BRAGG_INTEGRATION_ENGINE.md for the design and the binding steps.
BraggIntegrationEngineGPUTest checks GPU == CPU across all three modes
(box/gaussian/empirical) within numeric tolerance, plus a [bragg_bench]
perf sweep.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-02 20:59:45 +02:00

58 lines
2.8 KiB
C++

// SPDX-FileCopyrightText: 2026 Filip Leonarski, Paul Scherrer Institute <filip.leonarski@psi.ch>
// SPDX-License-Identifier: GPL-3.0-only
#pragma once
#include <cstdint>
#include <memory>
#include <vector>
#include "BraggIntegrationEngine.h"
#include "../indexing/CUDAMemHelpers.h"
// CUDA engine: reproduces BraggIntegrationEngineCPU up to floating-point precision. Each stage is a
// kernel with one CUDA block per reflection cooperating over the small window via shared-memory
// reductions (the natural mapping for thousands of independent, tiny per-spot integrations).
//
// Pipeline (profile modes): reset -> mark_mask -> boxsum -> learn_profile -> build_profiles -> fit
// (the resolution shell is computed inline, so there is no separate shell pass). BoxSum mode stops
// after boxsum (that pass is the BraggIntegrate2D box integrator and the seed of the profile fit).
// The preprocessed image already lives on the device (ImagePreprocessorBufferGPU::getGPUBuffer());
// only the per-frame predicted centres are uploaded.
class BraggIntegrationEngineGPU : public BraggIntegrationEngine {
std::shared_ptr<CudaStream> stream;
int threads;
size_t fit_shared_bytes;
size_t capacity = 0; // per-reflection device/host arrays hold at least this many reflections
// --- per-reflection device arrays (grown by EnsureCapacity) ---
CudaDevicePtr<float> d_px_x, d_px_y, d_d;
CudaDevicePtr<int> d_cx, d_cy;
CudaDevicePtr<float> d_I, d_sigma, d_bkg, d_obs_x, d_obs_y;
CudaDevicePtr<uint8_t> d_ok, d_strong, d_has_obs;
// --- fixed-size device arrays ---
// The learning/fit math is single precision: FP64 is heavily throttled on consumer GPUs and the
// extraction is Poisson-noise limited, so float reproduces the double CPU path to ~1e-4.
CudaDevicePtr<uint8_t> d_mask; // per-pixel r2-disk reflection mask
CudaDevicePtr<float> d_shell_grid, d_global_grid; // learned profile accumulators (N_SHELL*GG, GG)
CudaDevicePtr<float> d_shell_P, d_global_P; // normalised profiles (empirical mode)
CudaDevicePtr<float> d_shell_sigma2, d_global_sigma2;
CudaDevicePtr<int> d_shell_n, d_global_n;
CudaDevicePtr<unsigned long long> d_invd2; // [min,max] inv-d^2 as monotonic bit patterns
// --- host staging (copied back once per frame) ---
std::vector<float> h_px_x, h_px_y, h_d;
std::vector<float> h_I, h_sigma, h_bkg, h_obs_x, h_obs_y;
std::vector<uint8_t> h_ok, h_has_obs;
void EnsureCapacity(size_t n);
public:
BraggIntegrationEngineGPU(const DiffractionExperiment &experiment, std::shared_ptr<CudaStream> stream);
std::vector<Reflection> Run(const ImagePreprocessorBuffer &image,
const std::vector<Reflection> &predicted, size_t npredicted,
int64_t image_number) override;
};