New docs/HDF5.md documents the on-disk HDF5/NeXus format produced by the writer: a FAIR/derived-metadata rationale (CXI-style per-image spot layout, NXreflections for integration), the master/data-file layout and the three NXmx format variants, the NXmx-standard fields that are populated, and every Jungfraujoch extension group (/entry/MX, /entry/reflections, /entry/azint, /entry/roi, /entry/image, /entry/profiling, /entry/detector, /entry/xfel, detectorSpecific, calibration, fluorescence, user). Content is derived from writer/HDF5NXmx.cpp and writer/HDF5DataFilePlugin*.cpp and cross-checked against the NXmx and NXreflections definitions. JFJOCH_WRITER.md's stale, partial structure table is replaced by a pointer to the new doc; HDF5 is added to the Sphinx toctree. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
20 KiB
HDF5 / NeXus data format
Jungfraujoch stores images and on-the-fly analysis results in HDF5 files that aim to be NXmx-compliant. On top of the NXmx application definition, Jungfraujoch records a substantial amount of derived metadata (spot finding, indexing, integration, azimuthal integration, per-image statistics, timing). These extra entries do not exist in NXmx and are documented here so that the layout is unambiguous and reusable.
This page documents the file layout and the data fields. The operational behaviour of the writer (running, republishing, file finalisation, CBF/TIFF output) is described in jfjoch_writer. The wire format that feeds the writer is described in CBOR messages; fields below frequently correspond one-to-one to CBOR message fields, and that document is a useful companion for their meaning.
1. Motivation: derived metadata and FAIR data
The goal of Jungfraujoch is not only to store high-throughput datasets efficiently, but to keep them findable, accessible, interoperable and reusable (FAIR). For serial crystallography this is hard for two practical reasons:
- Findability. Raw diffraction images carry almost no descriptive metadata about content. Quantities such as background level, number of diffraction spots, or indexing outcome let a user judge the quality and relevance of a dataset before inspecting the raw images.
- Accessibility at scale. A single experiment can span tens to hundreds of terabytes. Standard retrieval (e.g. HTTP) makes a dataset available but not inspectable — users would otherwise have to download a large fraction of the data just to decide whether it is useful. Compact derived representations make discovery, assessment and reuse feasible.
Because Jungfraujoch couples acquisition with real-time analysis used to steer experiments, transparency and reproducibility of that analysis matter. As a minimum the writer therefore preserves spot-finding and indexing results together with the filters that were applied, and it can retain an unbiased, down-sampled reference set of unfiltered images for validation and reuse.
Why a CXI-style per-image layout for spot finding / indexing
Spot-finding and indexing results in serial crystallography are inherently image-centric: the natural query is "give me the spots for image n". For these products Jungfraujoch adopts a layout similar to the Coherent X-ray Imaging (CXI) data bank (Maia, 2012) and the convention understood by CrystFEL: spot properties (position, intensity, Miller index, …) are stored in fixed-size two-dimensional arrays indexed by image number, with each image allocated room for up to a predefined maximum number of spots. These dense arrays are addressed with ordinary HDF5 hyperslab reads, so the spots of a single image are retrieved without traversing variable-length structures. The cost is some storage overhead for unused slots (padded with sentinels), which is acceptable for the access pattern.
We also evaluated the NeXus NXreflections base class. NXreflections models a dataset-wide reflection table, which fits integrated rotation data well — and Jungfraujoch does use it for integration results (see §4.2 below). But for spot finding/indexing across hundreds of thousands of patterns a single table would force aggregation over the whole experiment before the spots of one image can be accessed efficiently. For these intermediate products a per-image representation is more suitable. We encourage the community to develop standardised NeXus application definitions for image-centric serial crystallography products that combine NeXus interoperability with the access patterns and scale of modern experiments.
2. File layout
A run is written as one master file plus, depending on the format, one or more data files:
<prefix>_master.h5 # NXmx master file (metadata + links / virtual datasets)
<prefix>_data_000001.h5 # data file: images + per-image analysis
<prefix>_data_000002.h5
...
The master file is produced by writer/HDF5NXmx.cpp; data files by writer/HDF5DataFile.cpp and
its plugins (writer/HDF5DataFilePlugin*.cpp). Files are written to a temporary *.<random>.tmp
name and renamed on successful close.
Three master-file variants exist (set via file_format):
| Format | Value | Master ↔ data linking |
|---|---|---|
| NXmxLegacy (default) | 1 | One external link in /entry/data per data file (data_000001, …). HDF5 1.8 compatible — works with Neggia/Durin XDS plugins and Albula 4.0. |
| NXmxVDS | 2 | A single virtual dataset /entry/data/data spans all data files; spot finding, azimuthal integration and reflections are linked the same way. Requires HDF5 1.10 / Albula 4.1+. |
| NXmxIntegrated | 3 | No separate data files — images and all metadata live in one file. Equivalent in content to the VDS format. |
In legacy/VDS mode, image-indexed analysis arrays live in the data files and are exposed in the master file through external links or virtual datasets; in integrated mode they are written directly into the single file. Throughout this document a "✓ in master" column marks entries that are visible (directly or via link/VDS) from the master file.
Images are stored chunked (one image per chunk) and compressed with bitshuffle + LZ4 or
bitshuffle + Zstd; signed integer image datasets use INTx_MIN as the HDF5 fill value (the
"masked / no-data" sentinel), unsigned use UINTx_MAX.
3. NXmx-standard content
The entries below are part of, or valid base classes for, the
NXmx application definition.
"NXmx" = listed in the application definition; "base" = a valid field of the relevant NeXus base
class (NXdetector, NXsample, NXsource) but not in the NXmx required/recommended subset.
/entry (NXentry)
| Field | Std | Notes |
|---|---|---|
definition |
NXmx | value "NXmx" |
start_time |
NXmx | arming time |
end_time, end_time_estimated |
NXmx | approximate end time |
File-level HDF5 attributes file_name, file_time, HDF5_Version are also set.
/entry/source (NXsource), /entry/instrument (NXinstrument)
| Field | Std | Units |
|---|---|---|
source/name, source/type |
NXmx / base | |
source/current |
base | A |
instrument/name |
NXmx |
/entry/instrument/beam (NXbeam)
| Field | Std | Units |
|---|---|---|
incident_wavelength |
NXmx | angstrom |
incident_wavelength_spread |
NXmx | angstrom (only if polychromatic) |
total_flux |
NXmx | Hz |
/entry/instrument/attenuator (NXattenuator)
| Field | Std |
|---|---|
attenuator_transmission |
NXmx |
/entry/instrument/detector (NXdetector)
| Field | Std | Units |
|---|---|---|
depends_on |
NXmx | → transformations/rot3 |
beam_center_x, beam_center_y |
NXmx | pixel |
distance |
NXmx | m |
count_time, frame_time |
NXmx | s |
sensor_thickness |
NXmx | m |
sensor_material |
NXmx | |
description |
NXmx | |
threshold_energy |
NXmx | eV (EIGER; written only for a single channel) |
x_pixel_size, y_pixel_size |
base | m |
serial_number |
base | |
bit_depth_readout |
NXmx | |
saturation_value |
NXmx | |
flatfield_applied |
NXmx | |
pixel_mask, pixel_mask_applied |
NXmx | pixel_mask is [y, x], hard-linked from detectorSpecific/pixel_mask |
countrate_correction_applied |
NXmx | |
number_of_cycles |
base | frame-summation factor |
/entry/instrument/detector/transformations (NXtransformations)
The NXtransformations mechanism (the depends_on chain, transformation_type, vector,
offset attributes) is standard. The axis names follow the PyFAI PONI convention chosen by
Jungfraujoch (see DETECTOR_GEOMETRY):
| Axis | Type | Units | Depends on |
|---|---|---|---|
translation |
translation | m | . |
rot1 |
rotation | rad | translation |
rot2 |
rotation | rad | rot1 |
rot3 |
rotation | rad | rot2 |
/entry/instrument/detector/module (NXdetector_module)
data_origin, data_size, fast_pixel_direction, slow_pixel_direction, module_offset — all
NXmx (fast/slow_pixel_direction and module_offset carry transformation attributes).
/entry/sample (NXsample)
| Field | Std | Units / notes |
|---|---|---|
name |
NXmx | |
depends_on |
NXmx | points at the last goniometer / grid-scan axis, or . for stills |
temperature |
NXmx | K |
transformations/ (NXtransformations) |
NXmx | rotation axis (e.g. omega) or grid-scan translation; hard-linked as /entry/sample/goniometer |
unit_cell |
base | [a, b, c, α, β, γ] |
ub_matrix |
base | [1, 3, 3], Angstrom⁻¹ |
For a rotation scan the goniometer axis is written as a per-image angle array <axis> plus
<axis>_end, scalar <axis>_range_average, <axis>_range_total, and for helical scans
<axis>_helical_x/_y/_z. These extra goniometer datasets beyond the bare axis array are Jungfraujoch
conveniences.
/entry/data (NXdata)
data (3-D image stack, [n_images, y, x]) with image_nr_low / image_nr_high attributes.
In legacy mode this group instead contains one external link data_000001, … per data file.
4. Extensions beyond NXmx
Everything in this section is outside the NXmx standard. Each group is declared with
NX_class = NXcollection (the NeXus-sanctioned container for non-standardised content) unless noted.
The per-image arrays are indexed by image number, padded to the run length and filled with a
sentinel (NaN for floats, -1/0 for integer indices) where a quantity is absent.
4.1 /entry/MX — spot finding and indexing (CXI-style)
The flagship extension. Spot ("peak") properties are stored as fixed-size [n_images, max_spots]
arrays (CXI layout, recognised by CrystFEL); scalar-per-image quantities as [n_images] vectors.
In legacy/VDS mode these live in the data files and are linked/virtual-stacked into the master.
Per-spot arrays [n_images, max_spots]:
| Dataset | Units | Meaning | Indexing only |
|---|---|---|---|
peakXPosRaw, peakYPosRaw |
pixel | spot position (raw detector frame) | |
peakTotalIntensity |
photons | spot intensity | |
peakIceRingRes |
spot lies in an ice-ring resolution band | ||
peakH, peakK, peakL |
Miller indices of the (indexed) spot | ✓ | |
peakDistEwaldSphere |
Å⁻¹ | distance of the spot from the Ewald sphere | ✓ |
peakIndexed |
spot fits the indexing solution | ✓ | |
peakLattice |
lattice the spot belongs to (-1 = unindexed) |
✓ |
Per-image vectors [n_images]:
| Dataset | Units | Meaning |
|---|---|---|
nPeaks |
number of spots stored for the image (CXI) | |
strongPixels |
strong-pixel count (first spot-finding stage) | |
peakCountUnfiltered |
spots found before filtering | |
peakCountLowRes |
low-resolution spots | |
peakCountIceRingRes |
spots inside ice-ring bands | |
peakCountIndexed |
spots fitting the indexing solution | |
imageIndexed |
image was indexed (0/1) | |
indexingLatticeCount |
number of lattices found for the image | |
niggliClass |
Niggli class of the indexed Bravais lattice (see International Tables for Crystallography A, Table 3.1.3.1) | |
bravaisLattice |
Bravais lattice short code, e.g. aP, mC, oF, tI, hP, hR, cF |
|
profileRadius |
Å⁻¹ | crystal profile radius |
mosaicity |
deg | mosaicity estimate |
bFactor |
Ų | per-image B-factor estimate |
resolutionEstimate |
Å | diffraction resolution estimate |
integratedReflections |
number of integrated reflections | |
bkgEstimate |
photons | mean background in the 3–5 Å resolution band |
beam_corr_x, beam_corr_y |
pixel | beam-center correction applied during processing |
imageScaleFactor |
on-the-fly per-image scale factor g | |
imageScaleCC |
on-the-fly scaling correlation coefficient | |
imageScaleMosaicity |
deg | scaling-model mosaicity |
imageScaleBFactor |
Ų | scaling-model B-factor |
Per-image lattices: latticeIndexed [n_images, 9] (Å) — the real-space lattice (flattened
3×3); latticeIndexedExtra [n_images, max_extra_lattices, 9] (Å) — additional orientation
variants.
Run-level summaries (written into the master /entry/MX at finalisation):
| Dataset | Units | Meaning |
|---|---|---|
indexing_algorithm |
FFBIDX / FFT (CUDA) / FFT (FFTW) |
|
geom_refinement_algorithm |
e.g. beam_center |
|
rotationLatticeIndexed |
Å | whole-run rotation-indexing lattice ([9]) |
rotationLatticeIndexedExtra |
Å | additional whole-run lattices ([m, 9]) |
rotationLatticeNiggliClass |
Niggli class of the run lattice | |
imageIndexedMean |
mean indexing rate over the run | |
bkgEstimateMean |
photons | mean background over the run |
indexedLatticeCount |
per-image lattice count summary (master). Note: data files use indexingLatticeCount; readers accept either. |
CrystFEL can read the spots directly with:
peak_list = /entry/MX
peak_list_type = cxi
4.2 /entry/reflections — integrated reflections (NXreflections)
Integrated reflections are stored per image as
/entry/reflections/image_NNNNNN groups, each declared NX_class = NXreflections. The columns map
mostly onto the standard
NXreflections base class:
| Dataset | Units | NXreflections | Meaning |
|---|---|---|---|
h, k, l |
standard | Miller indices | |
d |
Å | standard | resolution |
int_sum |
photons | standard | integrated intensity (summation) |
int_err |
photons | non-standard name | σ of the intensity (standard equivalent: int_sum_errors) |
background_mean |
photons | standard | mean background under the peak |
predicted_x, predicted_y |
pixel | name standard, units differ | predicted position. NXreflections predicted_x/_y are physical lengths; the pixel datasets are predicted_px_x/_y |
observed_x, observed_y |
pixel | name standard, units differ | observed centroid (pixels; standard pixel form is observed_px_x/_y) |
observed_frame |
standard | image number of the reflection | |
lp |
standard | Lorentz–polarization factor (stored as 1/rlp) |
|
partiality |
standard | recorded fraction of the reflection | |
delta_phi |
deg | extension | XDS Δφ: offset from the centre of the current frame |
zeta |
extension | Lorentz ζ factor (reciprocal-space geometry term) | |
image_scale_corr |
extension | per-image scale correction; I_true = image_scale_corr · int_sum |
In the master file these per-image groups are exposed through /entry/reflections external links
(VDS/integrated formats).
4.3 /entry/azint — azimuthal integration
| Dataset | Shape | Units | Meaning |
|---|---|---|---|
bin_to_q |
[φ_bins, q_bins] |
Å⁻¹ | q value of each bin |
bin_to_two_theta |
[φ_bins, q_bins] |
deg | 2θ of each bin |
bin_to_phi |
[φ_bins, q_bins] |
deg | azimuthal angle of each bin |
image |
[n_images, φ_bins, q_bins] |
per-image integrated profile (NaN for empty bins) | |
image_std |
[n_images, φ_bins, q_bins] |
per-bin standard deviation | |
image_count |
[n_images, φ_bins, q_bins] |
pixels contributing per bin | |
map |
[y, x] |
pixel→bin mapping (master file only) |
4.4 /entry/roi/<roi_name> — regions of interest
One sub-group per configured ROI, each with [n_images] vectors:
| Dataset | Meaning |
|---|---|
max |
maximum pixel value in the ROI |
sum |
sum of pixel values |
sum_sq |
sum of squared pixel values |
npixel |
number of valid pixels |
x, y |
intensity-weighted centroid |
4.5 /entry/image — per-image pixel statistics
[n_images] vectors: max_value, min_value (viable min/max, excluding error/saturated pixels),
error_pixels, saturated_pixels, pixel_sum. Surfaced in the master file under /entry/image.
4.6 /entry/profiling — per-image timing
[n_images] vectors in seconds: spotFindingTime, indexingTime, integrationTime,
refinementTime, processingTime, braggPredictionTime, preprocessingTime, compressionTime,
azIntTime, indexAnalysisTime, imageScaleTime.
4.7 /entry/detector — acquisition diagnostics (data file)
A convenience NXcollection in the data file (note: distinct from the standard
/entry/instrument/detector). In integrated format these datasets are written under
/entry/instrument/detector/detectorSpecific instead.
| Dataset | Meaning |
|---|---|
timestamp, exptime |
per-image timestamp and exposure time |
number |
image number (original number if image rejection was used) |
det_info |
JUNGFRAU debug field |
storage_cell_image |
storage-cell number |
rcv_delay, rcv_free_send_buffers |
receiver internal diagnostics |
packets_expected, packets_received |
UDP packets per image |
data_collection_efficiency_image |
received / expected packet ratio |
4.8 /entry/xfel — pulsed-source metadata
[n_images] vectors pulseID and eventCode, written for pulsed sources (e.g. SwissFEL).
4.9 Other collections
| Path | Class | Content |
|---|---|---|
/entry/instrument/detector/detectorSpecific |
NXcollection | Dectris-style detector metadata + Jungfraujoch fields: x_pixels_in_detector, y_pixels_in_detector, nimages, ntrigger, nimages_collected, nimages_written, data_collection_efficiency, max_receiver_delay, storage_cell_number, storage_cell_delay [ns], software_git_commit, software_git_date, jfjoch_release, jfjoch_writer_release, summation_mode, detect_ice_rings, gain_file_names, data_reduction_factor_serialmx, adu_histogram/, data_collection_efficiency_image |
/entry/instrument/detector/calibration |
NXcollection | per-channel pedestal / calibration images (bitshuffle-compressed) |
/entry/instrument/fluorescence |
NXcollection | XRF spectrum: energy [eV], data |
/entry/user |
NXcollection | scalar values supplied under header_appendix.hdf5 |
4.10 Non-standard fields inside the NXmx detector group
A few extension scalars are written inside the otherwise-standard /entry/instrument/detector
group for compatibility with existing tooling:
| Field | Units | Meaning |
|---|---|---|
detector_distance |
m | duplicate of distance (Dectris/Neggia compatibility) |
detector_number |
detector identifier (Dectris convention) | |
error_value |
masked/error pixel sentinel (NXmx standard would be underload_value) |
|
bit_depth_image |
stored image bit depth (NXmx standard is bit_depth_readout) |
|
acquisition_type |
always triggered (Dectris convention) |
|
jungfrau_conversion_applied |
JUNGFRAU photon/keV conversion applied | |
jungfrau_conversion_factor |
eV | conversion factor |
geometry_transformation_applied |
module→full-detector geometry applied |
5. Notes
- Units are written as the HDF5
unitsattribute on the dataset (e.g.m,eV,deg,Angstrom,Angstrom^-1,Angstrom^2,pixel,s). - Sentinels. Missing per-image values are
NaN(floats) or-1/0(integer indices); image pixels useINTx_MIN/UINTx_MAX. - Master vs data file. In legacy/VDS formats the analysis arrays physically live in the data files; the master file links to them (external links in legacy, virtual datasets in VDS). In the integrated format there are no data files and everything is in one place.
- CXI / CrystFEL.
/entry/MXfollows the CXI peak-list convention; see CXI file format.