Files
Jungfraujoch/docs/HDF5.md
T
leonarski_f 6b95600260 writer: link ROI results into the VDS master file
In VDS mode the per-image ROI results (max/sum/sum_sq/npixel/x/y) are
written into the data files but were not exposed in the master, so a VDS
master surfaced no ROI statistics. Add virtual datasets under
/entry/roi/<name> in LinkToData_VDS, one group per ROI, mirroring how the
spot-finding and azimuthal-integration arrays are linked. Integrated and
legacy formats are unaffected (the results are already reachable there).

Extended the reader round-trip test to write real ROI results and check
they read back from the master for both VDS and integrated formats.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 10:38:24 +02:00

25 KiB
Raw Blame History

HDF5 / NeXus data format

Jungfraujoch stores images and on-the-fly analysis results in HDF5 files that aim to be NXmx-compliant. On top of the NXmx application definition, Jungfraujoch records a substantial amount of derived metadata (spot finding, indexing, integration, azimuthal integration, per-image statistics, timing). These extra entries do not exist in NXmx and are documented here so that the layout is unambiguous and reusable.

This page documents the file layout and the data fields. The operational behaviour of the writer (running, republishing, file finalisation, CBF/TIFF output) is described in jfjoch_writer. The wire format that feeds the writer is described in CBOR messages; fields below frequently correspond one-to-one to CBOR message fields, and that document is a useful companion for their meaning.

1. Motivation: derived metadata and FAIR data

The goal of Jungfraujoch is not only to store high-throughput datasets efficiently, but to keep them findable, accessible, interoperable and reusable (FAIR). Jungfraujoch is used for both rotation macromolecular crystallography (single- and multi-crystal, including fine-sliced and helical scans) and serial crystallography (stills, grid scans); the same concerns apply to both:

  • Findability. Raw diffraction images carry almost no descriptive metadata about content. Quantities such as background level, number of diffraction spots, or indexing outcome let a user judge the quality and relevance of a dataset before inspecting the raw images.
  • Accessibility at scale. A single experiment can span tens to hundreds of terabytes. Standard retrieval (e.g. HTTP) makes a dataset available but not inspectable — users would otherwise have to download a large fraction of the data just to decide whether it is useful. Compact derived representations make discovery, assessment and reuse feasible.

Because Jungfraujoch couples acquisition with real-time analysis used to steer experiments, transparency and reproducibility of that analysis matter. As a minimum the writer therefore preserves spot-finding and indexing results together with the filters that were applied, and it can retain an unbiased, down-sampled reference set of unfiltered images for validation and reuse.

Two complementary layouts: per-image spots vs. a reflection table

Jungfraujoch stores analysis products in two shapes, matching how each is accessed.

Per-image spot finding / indexing. Spot finding and indexing are inherently image-centric — the natural query is "give me the spots for image n" — and this holds for serial stills and for rotation frames alike. For these products Jungfraujoch adopts a layout similar to the Coherent X-ray Imaging (CXI) data bank (Maia, 2012) and the convention understood by CrystFEL: spot properties (position, intensity, Miller index, …) are stored in fixed-size two-dimensional arrays indexed by image number, with each image allocated room for up to a predefined maximum number of spots. These dense arrays are addressed with ordinary HDF5 hyperslab reads, so the spots of a single image are retrieved without traversing variable-length structures. The cost is some storage overhead for unused slots (padded with sentinels), which is acceptable for the access pattern.

Integrated reflections. Integrated intensities are naturally a dataset-wide table, which is exactly the model of the NeXus NXreflections base class. This fits rotation crystallography well, and Jungfraujoch uses NXreflections for its integration results (see §4.2 below). We deliberately do not force spot finding/indexing into a single experiment-wide table: across the hundreds of thousands of patterns typical of serial — or fine-sliced rotation — experiments, that would require aggregating the whole experiment before the spots of one image can be read. We encourage the community to develop standardised NeXus application definitions for image-centric crystallography products that combine NeXus interoperability with the access patterns and scale of modern high-throughput experiments.

2. File layout

A run is written as one master file plus, depending on the format, one or more data files:

<prefix>_master.h5             # NXmx master file (metadata + links / virtual datasets)
<prefix>_data_000001.h5        # data file: images + per-image analysis
<prefix>_data_000002.h5
...

The master file is produced by writer/HDF5NXmx.cpp; data files by writer/HDF5DataFile.cpp and its plugins (writer/HDF5DataFilePlugin*.cpp). Files are written to a temporary *.<random>.tmp name and renamed on successful close.

Three master-file variants exist (set via file_format):

Format Value Master ↔ data linking
NXmxLegacy (default) 1 One external link in /entry/data per data file (data_000001, …). HDF5 1.8 compatible — works with Neggia/Durin XDS plugins and Albula 4.0.
NXmxVDS 2 A single virtual dataset /entry/data/data spans all data files; spot finding, azimuthal integration and reflections are linked the same way. Requires HDF5 1.10 / Albula 4.1+.
NXmxIntegrated 3 No separate data files — images and all metadata live in one file. Equivalent in content to the VDS format.

In legacy/VDS mode, image-indexed analysis arrays live in the data files and are exposed in the master file through external links or virtual datasets; in integrated mode they are written directly into the single file. Throughout this document a "✓ in master" column marks entries that are visible (directly or via link/VDS) from the master file.

Images are stored chunked (one image per chunk) and compressed with bitshuffle + LZ4 or bitshuffle + Zstd; signed integer image datasets use INTx_MIN as the HDF5 fill value (the "masked / no-data" sentinel), unsigned use UINTx_MAX.

Reprocessing output: <prefix>_process.h5

The offline reprocessing tool jfjoch_process (tools/jfjoch_process.cpp) re-runs the full analysis pipeline (spot finding, indexing, refinement, integration, scaling) on an existing dataset and writes its results to a master file named <prefix>_process.h5. This file uses the integrated format, but instead of copying the images its /entry/data/data is a virtual dataset that links back to the original image files (hdf5_source_dataNXmx::LinkToData_ProcessingVDS). The result is a compact, self-describing companion file that holds all the derived analysis (everything in §4) plus a virtual view of the raw images — without duplicating terabytes of data.

This is a particularly FAIR-friendly artefact: it can be shared or archived alongside (or instead of) the raw data to convey what is in a dataset and how it processed, while the /entry/data/data VDS still resolves to the original images when they are available. jfjoch_process can also process an equally-spaced subset of images (start/end/stride), producing a down-sampled reference set.

3. NXmx-standard content

The entries below are part of, or valid base classes for, the NXmx application definition. "NXmx" = listed in the application definition; "base" = a valid field of the relevant NeXus base class (NXdetector, NXsample, NXsource) but not in the NXmx required/recommended subset.

/entry (NXentry)

Field Std Notes
definition NXmx value "NXmx"
start_time NXmx arming time
end_time, end_time_estimated NXmx approximate end time

File-level HDF5 attributes file_name, file_time, HDF5_Version are also set.

/entry/source (NXsource), /entry/instrument (NXinstrument)

Field Std Units
source/name, source/type NXmx / base
source/current base A
instrument/name NXmx

/entry/instrument/beam (NXbeam)

Field Std Units
incident_wavelength NXmx angstrom
incident_wavelength_spread NXmx angstrom (only if polychromatic)
total_flux NXmx Hz

/entry/instrument/attenuator (NXattenuator)

Field Std
attenuator_transmission NXmx

/entry/instrument/detector (NXdetector)

Field Std Units
depends_on NXmx transformations/rot3
beam_center_x, beam_center_y NXmx pixel
distance NXmx m
count_time, frame_time NXmx s
sensor_thickness NXmx m
sensor_material NXmx
description NXmx
threshold_energy NXmx eV (EIGER; written only for a single channel)
x_pixel_size, y_pixel_size base m
serial_number base
bit_depth_readout NXmx
saturation_value NXmx
flatfield_applied NXmx
pixel_mask, pixel_mask_applied NXmx pixel_mask is [y, x], hard-linked from detectorSpecific/pixel_mask
countrate_correction_applied NXmx
number_of_cycles base frame-summation factor

/entry/instrument/detector/transformations (NXtransformations)

The NXtransformations mechanism (the depends_on chain, transformation_type, vector, offset attributes) is standard. The axis names follow the PyFAI PONI convention chosen by Jungfraujoch (see DETECTOR_GEOMETRY):

Axis Type Units Depends on
translation translation m .
rot1 rotation rad translation
rot2 rotation rad rot1
rot3 rotation rad rot2

/entry/instrument/detector/module (NXdetector_module)

data_origin, data_size, fast_pixel_direction, slow_pixel_direction, module_offset — all NXmx (fast/slow_pixel_direction and module_offset carry transformation attributes).

/entry/sample (NXsample)

Field Std Units / notes
name NXmx
depends_on NXmx points at the last goniometer / grid-scan axis, or . for stills
temperature NXmx K
transformations/ (NXtransformations) NXmx rotation axis (e.g. omega) or grid-scan translation; hard-linked as /entry/sample/goniometer
unit_cell base [a, b, c, α, β, γ]
ub_matrix base [1, 3, 3], Angstrom⁻¹

For a rotation scan the goniometer axis is written as a per-image angle array <axis> plus <axis>_end, scalar <axis>_range_average, <axis>_range_total, and for helical scans <axis>_helical_x/_y/_z. These extra goniometer datasets beyond the bare axis array are Jungfraujoch conveniences.

/entry/data (NXdata)

data (3-D image stack, [n_images, y, x]) with image_nr_low / image_nr_high attributes. In legacy mode this group instead contains one external link data_000001, … per data file.

4. Extensions beyond NXmx

Everything in this section is outside the NXmx standard. Each group is declared with NX_class = NXcollection (the NeXus-sanctioned container for non-standardised content) unless noted. The per-image arrays are indexed by image number, padded to the run length and filled with a sentinel (NaN for floats, -1/0 for integer indices) where a quantity is absent.

4.1 /entry/MX — spot finding and indexing (CXI-style)

The flagship extension. Spot ("peak") properties are stored as fixed-size [n_images, max_spots] arrays (CXI layout, recognised by CrystFEL); scalar-per-image quantities as [n_images] vectors. In legacy/VDS mode these live in the data files and are linked/virtual-stacked into the master.

Per-spot arrays [n_images, max_spots]:

Dataset Units Meaning Indexing only
peakXPosRaw, peakYPosRaw pixel spot position (raw detector frame)
peakTotalIntensity photons spot intensity
peakIceRingRes spot lies in an ice-ring resolution band
peakH, peakK, peakL Miller indices of the (indexed) spot
peakDistEwaldSphere Å⁻¹ distance of the spot from the Ewald sphere
peakIndexed spot fits the indexing solution
peakLattice lattice the spot belongs to (-1 = unindexed)

Per-image vectors [n_images]:

Dataset Units Meaning
nPeaks number of spots stored for the image (CXI)
strongPixels strong-pixel count (first spot-finding stage)
peakCountUnfiltered spots found before filtering
peakCountLowRes low-resolution spots
peakCountIceRingRes spots inside ice-ring bands
peakCountIndexed spots fitting the indexing solution
imageIndexed image was indexed (0/1)
indexingLatticeCount number of lattices found for the image
niggliClass Niggli class of the indexed Bravais lattice (see International Tables for Crystallography A (2016), Vol. A, Table 3.1.3.1)
bravaisLattice Bravais lattice short code, e.g. aP, mC, oF, tI, hP, hR, cF
profileRadius Å⁻¹ crystal profile radius
mosaicity deg mosaicity estimate
bFactor Ų per-image B-factor estimate
resolutionEstimate Å diffraction resolution estimate
integratedReflections number of integrated reflections
bkgEstimate photons mean background in the 35 Å resolution band
beam_corr_x, beam_corr_y pixel beam-center correction applied during processing
imageScaleFactor on-the-fly per-image scale factor g
imageScaleCC on-the-fly scaling correlation coefficient
imageScaleMosaicity deg scaling-model mosaicity
imageScaleBFactor Ų scaling-model B-factor

Per-image lattices: latticeIndexed [n_images, 9] (Å) — the real-space lattice (flattened 3×3); latticeIndexedExtra [n_images, max_extra_lattices, 9] (Å) — additional orientation variants.

Run-level summaries (written into the master /entry/MX at finalisation):

Dataset Units Meaning
indexing_algorithm FFBIDX / FFT (CUDA) / FFT (FFTW)
geom_refinement_algorithm e.g. beam_center
rotationLatticeIndexed Å whole-run rotation-indexing lattice ([9])
rotationLatticeIndexedExtra Å additional whole-run lattices ([m, 9])
rotationLatticeNiggliClass Niggli class of the run lattice
imageIndexedMean mean indexing rate over the run
bkgEstimateMean photons mean background over the run
indexedLatticeCount per-image lattice count summary (master). Note: data files use indexingLatticeCount; readers accept either.

CrystFEL can read the spots directly with:

peak_list = /entry/MX
peak_list_type = cxi

4.2 /entry/reflections — integrated reflections (NXreflections)

Integrated reflections are stored per image as /entry/reflections/image_NNNNNN groups, each declared NX_class = NXreflections. The columns map mostly onto the standard NXreflections base class:

Dataset Units NXreflections Meaning
h, k, l standard Miller indices
d Å standard resolution
int_sum photons standard integrated intensity (summation)
int_err photons non-standard name σ of the intensity (standard equivalent: int_sum_errors)
background_mean photons standard mean background under the peak
predicted_x, predicted_y pixel name standard, units differ predicted position. NXreflections predicted_x/_y are physical lengths; the pixel datasets are predicted_px_x/_y
observed_x, observed_y pixel name standard, units differ observed centroid (pixels; standard pixel form is observed_px_x/_y)
observed_frame standard image number of the reflection
lp standard Lorentzpolarization factor (stored as 1/rlp)
partiality standard recorded fraction of the reflection
delta_phi deg extension XDS Δφ: offset from the centre of the current frame
zeta extension Lorentz ζ factor (reciprocal-space geometry term)
image_scale_corr extension per-image scale correction; I_true = image_scale_corr · int_sum

In the master file these per-image groups are exposed through /entry/reflections external links (VDS/integrated formats).

4.3 /entry/azint — azimuthal integration

Dataset Shape Units Meaning
bin_to_q [φ_bins, q_bins] Å⁻¹ q value of each bin
bin_to_two_theta [φ_bins, q_bins] deg 2θ of each bin
bin_to_phi [φ_bins, q_bins] deg azimuthal angle of each bin
image [n_images, φ_bins, q_bins] per-image integrated profile (NaN for empty bins)
image_std [n_images, φ_bins, q_bins] per-bin standard deviation
image_count [n_images, φ_bins, q_bins] pixels contributing per bin
map [y, x] pixel→bin mapping (master file only)

4.4 /entry/roi — regions of interest (per-image results)

/entry/roi/<roi_name> has one sub-group per configured ROI, holding the per-image result vectors [n_images]. These are written into the data files; in VDS mode they are exposed from the master file through virtual datasets, and in integrated mode they are in the single file. (In legacy mode they remain only in the data files.)

Dataset Meaning
max maximum pixel value in the ROI
sum sum of pixel values
sum_sq sum of squared pixel values
npixel number of valid pixels
x, y intensity-weighted centroid

4.4.1 /entry/roi_defs — ROI definitions (master file)

The dataset-wide ROI definitions (geometry, fixed for the whole acquisition) live in the master file under a separate /entry/roi_defs group — kept apart from /entry/roi above so that older readers, which iterate /entry/roi, are unaffected by these entries. One sub-group /entry/roi_defs/<roi_name> per ROI:

Dataset Meaning
bit_index which bit of roi_map (below) marks this ROI
type box, circle or azim
min_x_pxl, max_x_pxl, min_y_pxl, max_y_pxl box bounds (type box)
center_x_pxl, center_y_pxl, radius_pxl circle (type circle)
q_min_recipA, q_max_recipA Q range (type azim)
phi_min_deg, phi_max_deg azimuthal-angle sector (type azim, omitted for a full ring)

/entry/roi_defs/roi_map [y, x] is a uint16 per-pixel bitmask: bit bit_index is set for every pixel belonging to that ROI, so an ROI's footprint can be recovered exactly.

4.5 /entry/image — per-image pixel statistics

[n_images] vectors: max_value, min_value (viable min/max, excluding error/saturated pixels), error_pixels, saturated_pixels, pixel_sum. Surfaced in the master file under /entry/image.

4.6 /entry/profiling — per-image timing

[n_images] vectors in seconds: spotFindingTime, indexingTime, integrationTime, refinementTime, processingTime, braggPredictionTime, preprocessingTime, compressionTime, azIntTime, indexAnalysisTime, imageScaleTime.

4.7 /entry/detector — acquisition diagnostics (data file)

A convenience NXcollection in the data file (note: distinct from the standard /entry/instrument/detector). In integrated format these datasets are written under /entry/instrument/detector/detectorSpecific instead.

Dataset Meaning
timestamp, exptime per-image timestamp and exposure time
number image number (original number if image rejection was used)
det_info JUNGFRAU debug field
storage_cell_image storage-cell number
rcv_delay, rcv_free_send_buffers receiver internal diagnostics
packets_expected, packets_received UDP packets per image
data_collection_efficiency_image received / expected packet ratio

4.8 /entry/xfel — pulsed-source metadata

[n_images] vectors pulseID and eventCode, written for pulsed sources (e.g. SwissFEL).

4.9 Other collections

Path Class Content
/entry/instrument/detector/detectorSpecific NXcollection Dectris-style detector metadata + Jungfraujoch fields: x_pixels_in_detector, y_pixels_in_detector, nimages, ntrigger, nimages_collected, nimages_written, data_collection_efficiency, max_receiver_delay, storage_cell_number, storage_cell_delay [ns], software_git_commit, software_git_date, jfjoch_release, jfjoch_writer_release, summation_mode, detect_ice_rings, gain_file_names, data_reduction_factor_serialmx, adu_histogram/, data_collection_efficiency_image
/entry/instrument/detector/calibration NXcollection per-channel pedestal / calibration images (bitshuffle-compressed)
/entry/instrument/fluorescence NXcollection XRF spectrum: energy [eV], data
/entry/user NXcollection scalar values supplied under header_appendix.hdf5

4.10 Non-standard fields inside the NXmx detector group

A few extension scalars are written inside the otherwise-standard /entry/instrument/detector group for compatibility with existing tooling:

Field Units Meaning
detector_distance m duplicate of distance (Dectris/Neggia compatibility)
detector_number detector identifier (Dectris convention)
error_value masked/error pixel sentinel (NXmx standard would be underload_value)
bit_depth_image stored image bit depth (NXmx standard is bit_depth_readout)
acquisition_type always triggered (Dectris convention)
jungfrau_conversion_applied JUNGFRAU photon/keV conversion applied
jungfrau_conversion_factor eV conversion factor
geometry_transformation_applied module→full-detector geometry applied

4.11 User-supplied metadata: header_appendix and image_appendix

Facilities frequently need to attach metadata that Jungfraujoch does not model explicitly. Two free-form JSON fields in the /start request (broker/jfjoch_api.yaml) provide this without any schema change; both accept any valid JSON:

Field Carried in Persisted to HDF5?
header_appendix the start message, under user_data.user (see CBOR) no — except the hdf5 sub-object (below)
image_appendix every image message, as user_data no

Both are forwarded verbatim through the ZeroMQ/CBOR stream to every downstream consumer (writer, republished analysis, viewers), so they are the recommended channel for facility- or beamline-specific provenance (proposal, operator, optics state, per-image trigger info, …) that has no dedicated API field.

Persisting selected values to HDF5. header_appendix is normally not written to the master file. As an exception, if it contains a key hdf5 whose value is a JSON object of scalars (strings and numbers — no arrays or nested objects), the writer stores each entry under /entry/user/<key>.

For example, a /start request containing:

{
  "header_appendix": {
    "proposal": "p20001",
    "operator": "jdoe",
    "hdf5": { "beamline": "X06SA", "ring_mode": "top-up", "attenuator_foils": 2 }
  },
  "image_appendix": { "trigger_source": "external" }
}

forwards the whole header_appendix as user_data.user on the start message and {"trigger_source": "external"} as user_data on every image message, and writes three scalars into the master file:

/entry/user/beamline          = "X06SA"
/entry/user/ring_mode         = "top-up"
/entry/user/attenuator_foils  = 2

5. Notes

  • Units are written as the HDF5 units attribute on the dataset (e.g. m, eV, deg, Angstrom, Angstrom^-1, Angstrom^2, pixel, s).
  • Sentinels. Missing per-image values are NaN (floats) or -1/0 (integer indices); image pixels use INTx_MIN / UINTx_MAX.
  • Master vs data file. In legacy/VDS formats the analysis arrays physically live in the data files; the master file links to them (external links in legacy, virtual datasets in VDS). In the integrated format there are no data files and everything is in one place.
  • CXI / CrystFEL. /entry/MX follows the CXI peak-list convention; see CXI file format.