Build Packages / build:windows:nocuda (pull_request) Successful in 14m41s
Build Packages / build:windows:cuda (pull_request) Successful in 16m48s
Build Packages / build:rpm (ubuntu2404_nocuda) (pull_request) Successful in 11m15s
Build Packages / build:rpm (rocky8_nocuda) (pull_request) Successful in 12m46s
Build Packages / build:rpm (ubuntu2204_nocuda) (pull_request) Successful in 12m38s
Build Packages / build:rpm (rocky9_nocuda) (pull_request) Successful in 13m11s
Build Packages / build:rpm (rocky8_sls9) (pull_request) Successful in 12m20s
Build Packages / build:rpm (rocky9_sls9) (pull_request) Successful in 12m22s
Build Packages / build:rpm (ubuntu2404) (pull_request) Successful in 11m7s
Build Packages / build:rpm (ubuntu2204) (pull_request) Successful in 11m55s
Build Packages / build:rpm (rocky8) (pull_request) Successful in 12m56s
Build Packages / Generate python client (pull_request) Successful in 14s
Build Packages / build:rpm (rocky9) (pull_request) Successful in 13m15s
Build Packages / Create release (pull_request) Skipped
Build Packages / Build documentation (pull_request) Successful in 41s
Build Packages / XDS test (durin plugin) (pull_request) Successful in 10m3s
Build Packages / DIALS test (pull_request) Successful in 13m6s
Build Packages / XDS test (neggia plugin) (pull_request) Successful in 6m58s
Build Packages / XDS test (JFJoch plugin) (pull_request) Successful in 7m30s
Build Packages / Unit tests (pull_request) Successful in 58m5s
Build Packages / Unit tests (push) Successful in 1h12m36s
Build Packages / build:rpm (rocky8_nocuda) (push) Successful in 14m52s
Build Packages / build:rpm (rocky9_nocuda) (push) Successful in 15m35s
Build Packages / build:rpm (ubuntu2204_nocuda) (push) Successful in 15m29s
Build Packages / build:rpm (ubuntu2404_nocuda) (push) Successful in 13m35s
Build Packages / build:rpm (rocky8_sls9) (push) Successful in 15m25s
Build Packages / build:rpm (rocky9_sls9) (push) Successful in 16m5s
Build Packages / build:rpm (rocky8) (push) Successful in 15m11s
Build Packages / build:rpm (rocky9) (push) Successful in 13m35s
Build Packages / build:rpm (ubuntu2204) (push) Successful in 11m59s
Build Packages / build:rpm (ubuntu2404) (push) Successful in 12m14s
Build Packages / DIALS test (push) Successful in 14m29s
Build Packages / XDS test (durin plugin) (push) Successful in 9m56s
Build Packages / XDS test (JFJoch plugin) (push) Successful in 10m23s
Build Packages / XDS test (neggia plugin) (push) Successful in 9m3s
Build Packages / Generate python client (push) Successful in 20s
Build Packages / Build documentation (push) Successful in 1m10s
Build Packages / Create release (push) Skipped
Build Packages / build:windows:nocuda (push) Successful in 16m39s
Build Packages / build:windows:cuda (push) Successful in 18m40s
Reimplement BraggIntegrate2D (box sum) and ProfileIntegrate2D (Kabsch profile fit) under one roof as a base + CPU + GPU engine, mirroring the AzIntEngine / ROIIntegration pattern. Reads the preprocessed int32 ImagePreprocessorBuffer (masked=INT32_MIN, saturated=INT32_MAX), the same buffer AzIntEngineGPU/ROIIntegrationGPU consume. The CUDA engine runs one block per reflection with shared-memory reductions across six kernels (reset, mask, box-sum, profile learning, profile build, Kabsch fit); the resolution shell is computed inline. The learning/fit hot path is single precision (FP64 is throttled on consumer GPUs; reproduces the double CPU path to ~1e-4). Collapsing the per-frame CUDA API calls into one reset kernel keeps launch-latency overhead low. Standalone for now: NOT wired into IndexAndRefine. See BRAGG_INTEGRATION_ENGINE.md for the design and the binding steps. BraggIntegrationEngineGPUTest checks GPU == CPU across all three modes (box/gaussian/empirical) within numeric tolerance, plus a [bragg_bench] perf sweep. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
58 lines
2.8 KiB
C++
58 lines
2.8 KiB
C++
// SPDX-FileCopyrightText: 2026 Filip Leonarski, Paul Scherrer Institute <filip.leonarski@psi.ch>
|
|
// SPDX-License-Identifier: GPL-3.0-only
|
|
|
|
#pragma once
|
|
|
|
#include <cstdint>
|
|
#include <memory>
|
|
#include <vector>
|
|
|
|
#include "BraggIntegrationEngine.h"
|
|
#include "../indexing/CUDAMemHelpers.h"
|
|
|
|
// CUDA engine: reproduces BraggIntegrationEngineCPU up to floating-point precision. Each stage is a
|
|
// kernel with one CUDA block per reflection cooperating over the small window via shared-memory
|
|
// reductions (the natural mapping for thousands of independent, tiny per-spot integrations).
|
|
//
|
|
// Pipeline (profile modes): reset -> mark_mask -> boxsum -> learn_profile -> build_profiles -> fit
|
|
// (the resolution shell is computed inline, so there is no separate shell pass). BoxSum mode stops
|
|
// after boxsum (that pass is the BraggIntegrate2D box integrator and the seed of the profile fit).
|
|
// The preprocessed image already lives on the device (ImagePreprocessorBufferGPU::getGPUBuffer());
|
|
// only the per-frame predicted centres are uploaded.
|
|
class BraggIntegrationEngineGPU : public BraggIntegrationEngine {
|
|
std::shared_ptr<CudaStream> stream;
|
|
int threads;
|
|
size_t fit_shared_bytes;
|
|
|
|
size_t capacity = 0; // per-reflection device/host arrays hold at least this many reflections
|
|
|
|
// --- per-reflection device arrays (grown by EnsureCapacity) ---
|
|
CudaDevicePtr<float> d_px_x, d_px_y, d_d;
|
|
CudaDevicePtr<int> d_cx, d_cy;
|
|
CudaDevicePtr<float> d_I, d_sigma, d_bkg, d_obs_x, d_obs_y;
|
|
CudaDevicePtr<uint8_t> d_ok, d_strong, d_has_obs;
|
|
|
|
// --- fixed-size device arrays ---
|
|
// The learning/fit math is single precision: FP64 is heavily throttled on consumer GPUs and the
|
|
// extraction is Poisson-noise limited, so float reproduces the double CPU path to ~1e-4.
|
|
CudaDevicePtr<uint8_t> d_mask; // per-pixel r2-disk reflection mask
|
|
CudaDevicePtr<float> d_shell_grid, d_global_grid; // learned profile accumulators (N_SHELL*GG, GG)
|
|
CudaDevicePtr<float> d_shell_P, d_global_P; // normalised profiles (empirical mode)
|
|
CudaDevicePtr<float> d_shell_sigma2, d_global_sigma2;
|
|
CudaDevicePtr<int> d_shell_n, d_global_n;
|
|
CudaDevicePtr<unsigned long long> d_invd2; // [min,max] inv-d^2 as monotonic bit patterns
|
|
|
|
// --- host staging (copied back once per frame) ---
|
|
std::vector<float> h_px_x, h_px_y, h_d;
|
|
std::vector<float> h_I, h_sigma, h_bkg, h_obs_x, h_obs_y;
|
|
std::vector<uint8_t> h_ok, h_has_obs;
|
|
|
|
void EnsureCapacity(size_t n);
|
|
|
|
public:
|
|
BraggIntegrationEngineGPU(const DiffractionExperiment &experiment, std::shared_ptr<CudaStream> stream);
|
|
std::vector<Reflection> Run(const ImagePreprocessorBuffer &image,
|
|
const std::vector<Reflection> &predicted, size_t npredicted,
|
|
int64_t image_number) override;
|
|
};
|