In VDS mode the per-image ROI results (max/sum/sum_sq/npixel/x/y) are written into the data files but were not exposed in the master, so a VDS master surfaced no ROI statistics. Add virtual datasets under /entry/roi/<name> in LinkToData_VDS, one group per ROI, mirroring how the spot-finding and azimuthal-integration arrays are linked. Integrated and legacy formats are unaffected (the results are already reachable there). Extended the reader round-trip test to write real ROI results and check they read back from the master for both VDS and integrated formats. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
25 KiB
HDF5 / NeXus data format
Jungfraujoch stores images and on-the-fly analysis results in HDF5 files that aim to be NXmx-compliant. On top of the NXmx application definition, Jungfraujoch records a substantial amount of derived metadata (spot finding, indexing, integration, azimuthal integration, per-image statistics, timing). These extra entries do not exist in NXmx and are documented here so that the layout is unambiguous and reusable.
This page documents the file layout and the data fields. The operational behaviour of the writer (running, republishing, file finalisation, CBF/TIFF output) is described in jfjoch_writer. The wire format that feeds the writer is described in CBOR messages; fields below frequently correspond one-to-one to CBOR message fields, and that document is a useful companion for their meaning.
1. Motivation: derived metadata and FAIR data
The goal of Jungfraujoch is not only to store high-throughput datasets efficiently, but to keep them findable, accessible, interoperable and reusable (FAIR). Jungfraujoch is used for both rotation macromolecular crystallography (single- and multi-crystal, including fine-sliced and helical scans) and serial crystallography (stills, grid scans); the same concerns apply to both:
- Findability. Raw diffraction images carry almost no descriptive metadata about content. Quantities such as background level, number of diffraction spots, or indexing outcome let a user judge the quality and relevance of a dataset before inspecting the raw images.
- Accessibility at scale. A single experiment can span tens to hundreds of terabytes. Standard retrieval (e.g. HTTP) makes a dataset available but not inspectable — users would otherwise have to download a large fraction of the data just to decide whether it is useful. Compact derived representations make discovery, assessment and reuse feasible.
Because Jungfraujoch couples acquisition with real-time analysis used to steer experiments, transparency and reproducibility of that analysis matter. As a minimum the writer therefore preserves spot-finding and indexing results together with the filters that were applied, and it can retain an unbiased, down-sampled reference set of unfiltered images for validation and reuse.
Two complementary layouts: per-image spots vs. a reflection table
Jungfraujoch stores analysis products in two shapes, matching how each is accessed.
Per-image spot finding / indexing. Spot finding and indexing are inherently image-centric — the natural query is "give me the spots for image n" — and this holds for serial stills and for rotation frames alike. For these products Jungfraujoch adopts a layout similar to the Coherent X-ray Imaging (CXI) data bank (Maia, 2012) and the convention understood by CrystFEL: spot properties (position, intensity, Miller index, …) are stored in fixed-size two-dimensional arrays indexed by image number, with each image allocated room for up to a predefined maximum number of spots. These dense arrays are addressed with ordinary HDF5 hyperslab reads, so the spots of a single image are retrieved without traversing variable-length structures. The cost is some storage overhead for unused slots (padded with sentinels), which is acceptable for the access pattern.
Integrated reflections. Integrated intensities are naturally a dataset-wide table, which is exactly the model of the NeXus NXreflections base class. This fits rotation crystallography well, and Jungfraujoch uses NXreflections for its integration results (see §4.2 below). We deliberately do not force spot finding/indexing into a single experiment-wide table: across the hundreds of thousands of patterns typical of serial — or fine-sliced rotation — experiments, that would require aggregating the whole experiment before the spots of one image can be read. We encourage the community to develop standardised NeXus application definitions for image-centric crystallography products that combine NeXus interoperability with the access patterns and scale of modern high-throughput experiments.
2. File layout
A run is written as one master file plus, depending on the format, one or more data files:
<prefix>_master.h5 # NXmx master file (metadata + links / virtual datasets)
<prefix>_data_000001.h5 # data file: images + per-image analysis
<prefix>_data_000002.h5
...
The master file is produced by writer/HDF5NXmx.cpp; data files by writer/HDF5DataFile.cpp and
its plugins (writer/HDF5DataFilePlugin*.cpp). Files are written to a temporary *.<random>.tmp
name and renamed on successful close.
Three master-file variants exist (set via file_format):
| Format | Value | Master ↔ data linking |
|---|---|---|
| NXmxLegacy (default) | 1 | One external link in /entry/data per data file (data_000001, …). HDF5 1.8 compatible — works with Neggia/Durin XDS plugins and Albula 4.0. |
| NXmxVDS | 2 | A single virtual dataset /entry/data/data spans all data files; spot finding, azimuthal integration and reflections are linked the same way. Requires HDF5 1.10 / Albula 4.1+. |
| NXmxIntegrated | 3 | No separate data files — images and all metadata live in one file. Equivalent in content to the VDS format. |
In legacy/VDS mode, image-indexed analysis arrays live in the data files and are exposed in the master file through external links or virtual datasets; in integrated mode they are written directly into the single file. Throughout this document a "✓ in master" column marks entries that are visible (directly or via link/VDS) from the master file.
Images are stored chunked (one image per chunk) and compressed with bitshuffle + LZ4 or
bitshuffle + Zstd; signed integer image datasets use INTx_MIN as the HDF5 fill value (the
"masked / no-data" sentinel), unsigned use UINTx_MAX.
Reprocessing output: <prefix>_process.h5
The offline reprocessing tool jfjoch_process (tools/jfjoch_process.cpp) re-runs the
full analysis pipeline (spot finding, indexing, refinement, integration, scaling) on an existing
dataset and writes its results to a master file named <prefix>_process.h5. This file uses the
integrated format, but instead of copying the images its /entry/data/data is a virtual
dataset that links back to the original image files (hdf5_source_data →
NXmx::LinkToData_ProcessingVDS). The result is a compact, self-describing companion file that
holds all the derived analysis (everything in §4) plus a virtual view
of the raw images — without duplicating terabytes of data.
This is a particularly FAIR-friendly artefact: it can be shared or archived alongside (or instead
of) the raw data to convey what is in a dataset and how it processed, while the /entry/data/data
VDS still resolves to the original images when they are available. jfjoch_process can also process
an equally-spaced subset of images (start/end/stride), producing a down-sampled reference set.
3. NXmx-standard content
The entries below are part of, or valid base classes for, the
NXmx application definition.
"NXmx" = listed in the application definition; "base" = a valid field of the relevant NeXus base
class (NXdetector, NXsample, NXsource) but not in the NXmx required/recommended subset.
/entry (NXentry)
| Field | Std | Notes |
|---|---|---|
definition |
NXmx | value "NXmx" |
start_time |
NXmx | arming time |
end_time, end_time_estimated |
NXmx | approximate end time |
File-level HDF5 attributes file_name, file_time, HDF5_Version are also set.
/entry/source (NXsource), /entry/instrument (NXinstrument)
| Field | Std | Units |
|---|---|---|
source/name, source/type |
NXmx / base | |
source/current |
base | A |
instrument/name |
NXmx |
/entry/instrument/beam (NXbeam)
| Field | Std | Units |
|---|---|---|
incident_wavelength |
NXmx | angstrom |
incident_wavelength_spread |
NXmx | angstrom (only if polychromatic) |
total_flux |
NXmx | Hz |
/entry/instrument/attenuator (NXattenuator)
| Field | Std |
|---|---|
attenuator_transmission |
NXmx |
/entry/instrument/detector (NXdetector)
| Field | Std | Units |
|---|---|---|
depends_on |
NXmx | → transformations/rot3 |
beam_center_x, beam_center_y |
NXmx | pixel |
distance |
NXmx | m |
count_time, frame_time |
NXmx | s |
sensor_thickness |
NXmx | m |
sensor_material |
NXmx | |
description |
NXmx | |
threshold_energy |
NXmx | eV (EIGER; written only for a single channel) |
x_pixel_size, y_pixel_size |
base | m |
serial_number |
base | |
bit_depth_readout |
NXmx | |
saturation_value |
NXmx | |
flatfield_applied |
NXmx | |
pixel_mask, pixel_mask_applied |
NXmx | pixel_mask is [y, x], hard-linked from detectorSpecific/pixel_mask |
countrate_correction_applied |
NXmx | |
number_of_cycles |
base | frame-summation factor |
/entry/instrument/detector/transformations (NXtransformations)
The NXtransformations mechanism (the depends_on chain, transformation_type, vector,
offset attributes) is standard. The axis names follow the PyFAI PONI convention chosen by
Jungfraujoch (see DETECTOR_GEOMETRY):
| Axis | Type | Units | Depends on |
|---|---|---|---|
translation |
translation | m | . |
rot1 |
rotation | rad | translation |
rot2 |
rotation | rad | rot1 |
rot3 |
rotation | rad | rot2 |
/entry/instrument/detector/module (NXdetector_module)
data_origin, data_size, fast_pixel_direction, slow_pixel_direction, module_offset — all
NXmx (fast/slow_pixel_direction and module_offset carry transformation attributes).
/entry/sample (NXsample)
| Field | Std | Units / notes |
|---|---|---|
name |
NXmx | |
depends_on |
NXmx | points at the last goniometer / grid-scan axis, or . for stills |
temperature |
NXmx | K |
transformations/ (NXtransformations) |
NXmx | rotation axis (e.g. omega) or grid-scan translation; hard-linked as /entry/sample/goniometer |
unit_cell |
base | [a, b, c, α, β, γ] |
ub_matrix |
base | [1, 3, 3], Angstrom⁻¹ |
For a rotation scan the goniometer axis is written as a per-image angle array <axis> plus
<axis>_end, scalar <axis>_range_average, <axis>_range_total, and for helical scans
<axis>_helical_x/_y/_z. These extra goniometer datasets beyond the bare axis array are Jungfraujoch
conveniences.
/entry/data (NXdata)
data (3-D image stack, [n_images, y, x]) with image_nr_low / image_nr_high attributes.
In legacy mode this group instead contains one external link data_000001, … per data file.
4. Extensions beyond NXmx
Everything in this section is outside the NXmx standard. Each group is declared with
NX_class = NXcollection (the NeXus-sanctioned container for non-standardised content) unless noted.
The per-image arrays are indexed by image number, padded to the run length and filled with a
sentinel (NaN for floats, -1/0 for integer indices) where a quantity is absent.
4.1 /entry/MX — spot finding and indexing (CXI-style)
The flagship extension. Spot ("peak") properties are stored as fixed-size [n_images, max_spots]
arrays (CXI layout, recognised by CrystFEL); scalar-per-image quantities as [n_images] vectors.
In legacy/VDS mode these live in the data files and are linked/virtual-stacked into the master.
Per-spot arrays [n_images, max_spots]:
| Dataset | Units | Meaning | Indexing only |
|---|---|---|---|
peakXPosRaw, peakYPosRaw |
pixel | spot position (raw detector frame) | |
peakTotalIntensity |
photons | spot intensity | |
peakIceRingRes |
spot lies in an ice-ring resolution band | ||
peakH, peakK, peakL |
Miller indices of the (indexed) spot | ✓ | |
peakDistEwaldSphere |
Å⁻¹ | distance of the spot from the Ewald sphere | ✓ |
peakIndexed |
spot fits the indexing solution | ✓ | |
peakLattice |
lattice the spot belongs to (-1 = unindexed) |
✓ |
Per-image vectors [n_images]:
| Dataset | Units | Meaning |
|---|---|---|
nPeaks |
number of spots stored for the image (CXI) | |
strongPixels |
strong-pixel count (first spot-finding stage) | |
peakCountUnfiltered |
spots found before filtering | |
peakCountLowRes |
low-resolution spots | |
peakCountIceRingRes |
spots inside ice-ring bands | |
peakCountIndexed |
spots fitting the indexing solution | |
imageIndexed |
image was indexed (0/1) | |
indexingLatticeCount |
number of lattices found for the image | |
niggliClass |
Niggli class of the indexed Bravais lattice (see International Tables for Crystallography A (2016), Vol. A, Table 3.1.3.1) | |
bravaisLattice |
Bravais lattice short code, e.g. aP, mC, oF, tI, hP, hR, cF |
|
profileRadius |
Å⁻¹ | crystal profile radius |
mosaicity |
deg | mosaicity estimate |
bFactor |
Ų | per-image B-factor estimate |
resolutionEstimate |
Å | diffraction resolution estimate |
integratedReflections |
number of integrated reflections | |
bkgEstimate |
photons | mean background in the 3–5 Å resolution band |
beam_corr_x, beam_corr_y |
pixel | beam-center correction applied during processing |
imageScaleFactor |
on-the-fly per-image scale factor g | |
imageScaleCC |
on-the-fly scaling correlation coefficient | |
imageScaleMosaicity |
deg | scaling-model mosaicity |
imageScaleBFactor |
Ų | scaling-model B-factor |
Per-image lattices: latticeIndexed [n_images, 9] (Å) — the real-space lattice (flattened
3×3); latticeIndexedExtra [n_images, max_extra_lattices, 9] (Å) — additional orientation
variants.
Run-level summaries (written into the master /entry/MX at finalisation):
| Dataset | Units | Meaning |
|---|---|---|
indexing_algorithm |
FFBIDX / FFT (CUDA) / FFT (FFTW) |
|
geom_refinement_algorithm |
e.g. beam_center |
|
rotationLatticeIndexed |
Å | whole-run rotation-indexing lattice ([9]) |
rotationLatticeIndexedExtra |
Å | additional whole-run lattices ([m, 9]) |
rotationLatticeNiggliClass |
Niggli class of the run lattice | |
imageIndexedMean |
mean indexing rate over the run | |
bkgEstimateMean |
photons | mean background over the run |
indexedLatticeCount |
per-image lattice count summary (master). Note: data files use indexingLatticeCount; readers accept either. |
CrystFEL can read the spots directly with:
peak_list = /entry/MX
peak_list_type = cxi
4.2 /entry/reflections — integrated reflections (NXreflections)
Integrated reflections are stored per image as
/entry/reflections/image_NNNNNN groups, each declared NX_class = NXreflections. The columns map
mostly onto the standard
NXreflections base class:
| Dataset | Units | NXreflections | Meaning |
|---|---|---|---|
h, k, l |
standard | Miller indices | |
d |
Å | standard | resolution |
int_sum |
photons | standard | integrated intensity (summation) |
int_err |
photons | non-standard name | σ of the intensity (standard equivalent: int_sum_errors) |
background_mean |
photons | standard | mean background under the peak |
predicted_x, predicted_y |
pixel | name standard, units differ | predicted position. NXreflections predicted_x/_y are physical lengths; the pixel datasets are predicted_px_x/_y |
observed_x, observed_y |
pixel | name standard, units differ | observed centroid (pixels; standard pixel form is observed_px_x/_y) |
observed_frame |
standard | image number of the reflection | |
lp |
standard | Lorentz–polarization factor (stored as 1/rlp) |
|
partiality |
standard | recorded fraction of the reflection | |
delta_phi |
deg | extension | XDS Δφ: offset from the centre of the current frame |
zeta |
extension | Lorentz ζ factor (reciprocal-space geometry term) | |
image_scale_corr |
extension | per-image scale correction; I_true = image_scale_corr · int_sum |
In the master file these per-image groups are exposed through /entry/reflections external links
(VDS/integrated formats).
4.3 /entry/azint — azimuthal integration
| Dataset | Shape | Units | Meaning |
|---|---|---|---|
bin_to_q |
[φ_bins, q_bins] |
Å⁻¹ | q value of each bin |
bin_to_two_theta |
[φ_bins, q_bins] |
deg | 2θ of each bin |
bin_to_phi |
[φ_bins, q_bins] |
deg | azimuthal angle of each bin |
image |
[n_images, φ_bins, q_bins] |
per-image integrated profile (NaN for empty bins) | |
image_std |
[n_images, φ_bins, q_bins] |
per-bin standard deviation | |
image_count |
[n_images, φ_bins, q_bins] |
pixels contributing per bin | |
map |
[y, x] |
pixel→bin mapping (master file only) |
4.4 /entry/roi — regions of interest (per-image results)
/entry/roi/<roi_name> has one sub-group per configured ROI, holding the per-image result
vectors [n_images]. These are written into the data files; in VDS mode they are exposed from
the master file through virtual datasets, and in integrated mode they are in the single file.
(In legacy mode they remain only in the data files.)
| Dataset | Meaning |
|---|---|
max |
maximum pixel value in the ROI |
sum |
sum of pixel values |
sum_sq |
sum of squared pixel values |
npixel |
number of valid pixels |
x, y |
intensity-weighted centroid |
4.4.1 /entry/roi_defs — ROI definitions (master file)
The dataset-wide ROI definitions (geometry, fixed for the whole acquisition) live in the
master file under a separate /entry/roi_defs group — kept apart from /entry/roi above so
that older readers, which iterate /entry/roi, are unaffected by these entries. One sub-group
/entry/roi_defs/<roi_name> per ROI:
| Dataset | Meaning |
|---|---|
bit_index |
which bit of roi_map (below) marks this ROI |
type |
box, circle or azim |
min_x_pxl, max_x_pxl, min_y_pxl, max_y_pxl |
box bounds (type box) |
center_x_pxl, center_y_pxl, radius_pxl |
circle (type circle) |
q_min_recipA, q_max_recipA |
Q range (type azim) |
phi_min_deg, phi_max_deg |
azimuthal-angle sector (type azim, omitted for a full ring) |
/entry/roi_defs/roi_map [y, x] is a uint16 per-pixel bitmask: bit bit_index is set for
every pixel belonging to that ROI, so an ROI's footprint can be recovered exactly.
4.5 /entry/image — per-image pixel statistics
[n_images] vectors: max_value, min_value (viable min/max, excluding error/saturated pixels),
error_pixels, saturated_pixels, pixel_sum. Surfaced in the master file under /entry/image.
4.6 /entry/profiling — per-image timing
[n_images] vectors in seconds: spotFindingTime, indexingTime, integrationTime,
refinementTime, processingTime, braggPredictionTime, preprocessingTime, compressionTime,
azIntTime, indexAnalysisTime, imageScaleTime.
4.7 /entry/detector — acquisition diagnostics (data file)
A convenience NXcollection in the data file (note: distinct from the standard
/entry/instrument/detector). In integrated format these datasets are written under
/entry/instrument/detector/detectorSpecific instead.
| Dataset | Meaning |
|---|---|
timestamp, exptime |
per-image timestamp and exposure time |
number |
image number (original number if image rejection was used) |
det_info |
JUNGFRAU debug field |
storage_cell_image |
storage-cell number |
rcv_delay, rcv_free_send_buffers |
receiver internal diagnostics |
packets_expected, packets_received |
UDP packets per image |
data_collection_efficiency_image |
received / expected packet ratio |
4.8 /entry/xfel — pulsed-source metadata
[n_images] vectors pulseID and eventCode, written for pulsed sources (e.g. SwissFEL).
4.9 Other collections
| Path | Class | Content |
|---|---|---|
/entry/instrument/detector/detectorSpecific |
NXcollection | Dectris-style detector metadata + Jungfraujoch fields: x_pixels_in_detector, y_pixels_in_detector, nimages, ntrigger, nimages_collected, nimages_written, data_collection_efficiency, max_receiver_delay, storage_cell_number, storage_cell_delay [ns], software_git_commit, software_git_date, jfjoch_release, jfjoch_writer_release, summation_mode, detect_ice_rings, gain_file_names, data_reduction_factor_serialmx, adu_histogram/, data_collection_efficiency_image |
/entry/instrument/detector/calibration |
NXcollection | per-channel pedestal / calibration images (bitshuffle-compressed) |
/entry/instrument/fluorescence |
NXcollection | XRF spectrum: energy [eV], data |
/entry/user |
NXcollection | scalar values supplied under header_appendix.hdf5 |
4.10 Non-standard fields inside the NXmx detector group
A few extension scalars are written inside the otherwise-standard /entry/instrument/detector
group for compatibility with existing tooling:
| Field | Units | Meaning |
|---|---|---|
detector_distance |
m | duplicate of distance (Dectris/Neggia compatibility) |
detector_number |
detector identifier (Dectris convention) | |
error_value |
masked/error pixel sentinel (NXmx standard would be underload_value) |
|
bit_depth_image |
stored image bit depth (NXmx standard is bit_depth_readout) |
|
acquisition_type |
always triggered (Dectris convention) |
|
jungfrau_conversion_applied |
JUNGFRAU photon/keV conversion applied | |
jungfrau_conversion_factor |
eV | conversion factor |
geometry_transformation_applied |
module→full-detector geometry applied |
4.11 User-supplied metadata: header_appendix and image_appendix
Facilities frequently need to attach metadata that Jungfraujoch does not model explicitly. Two
free-form JSON fields in the /start request (broker/jfjoch_api.yaml) provide this without any
schema change; both accept any valid JSON:
| Field | Carried in | Persisted to HDF5? |
|---|---|---|
header_appendix |
the start message, under user_data.user (see CBOR) |
no — except the hdf5 sub-object (below) |
image_appendix |
every image message, as user_data |
no |
Both are forwarded verbatim through the ZeroMQ/CBOR stream to every downstream consumer (writer, republished analysis, viewers), so they are the recommended channel for facility- or beamline-specific provenance (proposal, operator, optics state, per-image trigger info, …) that has no dedicated API field.
Persisting selected values to HDF5. header_appendix is normally not written to the master
file. As an exception, if it contains a key hdf5 whose value is a JSON object of scalars (strings
and numbers — no arrays or nested objects), the writer stores each entry under /entry/user/<key>.
For example, a /start request containing:
{
"header_appendix": {
"proposal": "p20001",
"operator": "jdoe",
"hdf5": { "beamline": "X06SA", "ring_mode": "top-up", "attenuator_foils": 2 }
},
"image_appendix": { "trigger_source": "external" }
}
forwards the whole header_appendix as user_data.user on the start message and
{"trigger_source": "external"} as user_data on every image message, and writes three scalars
into the master file:
/entry/user/beamline = "X06SA"
/entry/user/ring_mode = "top-up"
/entry/user/attenuator_foils = 2
5. Notes
- Units are written as the HDF5
unitsattribute on the dataset (e.g.m,eV,deg,Angstrom,Angstrom^-1,Angstrom^2,pixel,s). - Sentinels. Missing per-image values are
NaN(floats) or-1/0(integer indices); image pixels useINTx_MIN/UINTx_MAX. - Master vs data file. In legacy/VDS formats the analysis arrays physically live in the data files; the master file links to them (external links in legacy, virtual datasets in VDS). In the integrated format there are no data files and everything is in one place.
- CXI / CrystFEL.
/entry/MXfollows the CXI peak-list convention; see CXI file format.