HDF5 / NeXus data format

Jungfraujoch stores images and on-the-fly analysis results in HDF5 files that aim to be NXmx-compliant. On top of the NXmx application definition, Jungfraujoch records a substantial amount of derived metadata (spot finding, indexing, integration, azimuthal integration, per-image statistics, timing). These extra entries do not exist in NXmx and are documented here so that the layout is unambiguous and reusable.

This page documents the file layout and the data fields. The operational behaviour of the writer (running, republishing, file finalisation, CBF/TIFF output) is described in jfjoch_writer. The wire format that feeds the writer is described in CBOR messages; fields below frequently correspond one-to-one to CBOR message fields, and that document is a useful companion for their meaning.

1. Motivation: derived metadata and FAIR data

The goal of Jungfraujoch is not only to store high-throughput datasets efficiently, but to keep them findable, accessible, interoperable and reusable (FAIR). Jungfraujoch is used for both rotation macromolecular crystallography (single- and multi-crystal, including fine-sliced and helical scans) and serial crystallography (stills, grid scans); the same concerns apply to both:

  • Findability. Raw diffraction images carry almost no descriptive metadata about content. Quantities such as background level, number of diffraction spots, or indexing outcome let a user judge the quality and relevance of a dataset before inspecting the raw images.

  • Accessibility at scale. A single experiment can span tens to hundreds of terabytes. Standard retrieval (e.g. HTTP) makes a dataset available but not inspectable — users would otherwise have to download a large fraction of the data just to decide whether it is useful. Compact derived representations make discovery, assessment and reuse feasible.

Because Jungfraujoch couples acquisition with real-time analysis used to steer experiments, transparency and reproducibility of that analysis matter. As a minimum the writer therefore preserves spot-finding and indexing results together with the filters that were applied, and it can retain an unbiased, down-sampled reference set of unfiltered images for validation and reuse.

Two complementary layouts: per-image spots vs. a reflection table

Jungfraujoch stores analysis products in two shapes, matching how each is accessed.

Per-image spot finding / indexing. Spot finding and indexing are inherently image-centric — the natural query is “give me the spots for image n” — and this holds for serial stills and for rotation frames alike. For these products Jungfraujoch adopts a layout similar to the Coherent X-ray Imaging (CXI) data bank (Maia, 2012) and the convention understood by CrystFEL: spot properties (position, intensity, Miller index, …) are stored in fixed-size two-dimensional arrays indexed by image number, with each image allocated room for up to a predefined maximum number of spots. These dense arrays are addressed with ordinary HDF5 hyperslab reads, so the spots of a single image are retrieved without traversing variable-length structures. The cost is some storage overhead for unused slots (padded with sentinels), which is acceptable for the access pattern.

Integrated reflections. Integrated intensities are naturally a dataset-wide table, which is exactly the model of the NeXus NXreflections base class. This fits rotation crystallography well, and Jungfraujoch uses NXreflections for its integration results (see §4.2 below). We deliberately do not force spot finding/indexing into a single experiment-wide table: across the hundreds of thousands of patterns typical of serial — or fine-sliced rotation — experiments, that would require aggregating the whole experiment before the spots of one image can be read. We encourage the community to develop standardised NeXus application definitions for image-centric crystallography products that combine NeXus interoperability with the access patterns and scale of modern high-throughput experiments.

2. File layout

A run is written as one master file plus, depending on the format, one or more data files:

<prefix>_master.h5             # NXmx master file (metadata + links / virtual datasets)
<prefix>_data_000001.h5        # data file: images + per-image analysis
<prefix>_data_000002.h5
...

The master file is produced by writer/HDF5NXmx.cpp; data files by writer/HDF5DataFile.cpp and its plugins (writer/HDF5DataFilePlugin*.cpp). Files are written to a temporary *.<random>.tmp name and renamed on successful close.

Three master-file variants exist (set via file_format):

Format

Value

Master ↔ data linking

NXmxLegacy (default)

1

One external link in /entry/data per data file (data_000001, …). HDF5 1.8 compatible — works with Neggia/Durin XDS plugins and Albula 4.0.

NXmxVDS

2

A single virtual dataset /entry/data/data spans all data files; spot finding, azimuthal integration and reflections are linked the same way. Requires HDF5 1.10 / Albula 4.1+.

NXmxIntegrated

3

No separate data files — images and all metadata live in one file. Equivalent in content to the VDS format.

In legacy/VDS mode, image-indexed analysis arrays live in the data files and are exposed in the master file through external links or virtual datasets; in integrated mode they are written directly into the single file. Throughout this document a “✓ in master” column marks entries that are visible (directly or via link/VDS) from the master file.

Images are stored chunked (one image per chunk) and compressed with bitshuffle + LZ4 or bitshuffle + Zstd; signed integer image datasets use INTx_MIN as the HDF5 fill value (the “masked / no-data” sentinel), unsigned use UINTx_MAX.

Reprocessing output: <prefix>_process.h5

The offline reprocessing tool jfjoch_process (tools/jfjoch_process.cpp) re-runs the full analysis pipeline (spot finding, indexing, refinement, integration, scaling) on an existing dataset and writes its results to a master file named <prefix>_process.h5. This file uses the integrated format, but instead of copying the images its /entry/data/data is a virtual dataset that links back to the original image files (hdf5_source_dataNXmx::LinkToData_ProcessingVDS). The result is a compact, self-describing companion file that holds all the derived analysis (everything in §4) plus a virtual view of the raw images — without duplicating terabytes of data.

This is a particularly FAIR-friendly artefact: it can be shared or archived alongside (or instead of) the raw data to convey what is in a dataset and how it processed, while the /entry/data/data VDS still resolves to the original images when they are available. jfjoch_process can also process an equally-spaced subset of images (start/end/stride), producing a down-sampled reference set.

3. NXmx-standard content

The entries below are part of, or valid base classes for, the NXmx application definition. “NXmx” = listed in the application definition; “base” = a valid field of the relevant NeXus base class (NXdetector, NXsample, NXsource) but not in the NXmx required/recommended subset.

/entry (NXentry)

Field

Std

Notes

definition

NXmx

value "NXmx"

start_time

NXmx

arming time

end_time, end_time_estimated

NXmx

approximate end time

File-level HDF5 attributes file_name, file_time, HDF5_Version are also set.

/entry/source (NXsource), /entry/instrument (NXinstrument)

Field

Std

Units

source/name, source/type

NXmx / base

source/current

base

A

instrument/name

NXmx

/entry/instrument/beam (NXbeam)

Field

Std

Units

incident_wavelength

NXmx

angstrom

incident_wavelength_spread

NXmx

angstrom (only if polychromatic)

total_flux

NXmx

Hz

/entry/instrument/attenuator (NXattenuator)

Field

Std

attenuator_transmission

NXmx

/entry/instrument/detector (NXdetector)

Field

Std

Units

depends_on

NXmx

transformations/rot3

beam_center_x, beam_center_y

NXmx

pixel

distance

NXmx

m

count_time, frame_time

NXmx

s

sensor_thickness

NXmx

m

sensor_material

NXmx

description

NXmx

threshold_energy

NXmx

eV (EIGER; written only for a single channel)

x_pixel_size, y_pixel_size

base

m

serial_number

base

bit_depth_readout

NXmx

saturation_value

NXmx

flatfield_applied

NXmx

pixel_mask, pixel_mask_applied

NXmx

pixel_mask is [y, x], hard-linked from detectorSpecific/pixel_mask

countrate_correction_applied

NXmx

number_of_cycles

base

frame-summation factor

/entry/instrument/detector/transformations (NXtransformations)

The NXtransformations mechanism (the depends_on chain, transformation_type, vector, offset attributes) is standard. The axis names follow the PyFAI PONI convention chosen by Jungfraujoch (see DETECTOR_GEOMETRY):

Axis

Type

Units

Depends on

translation

translation

m

.

rot1

rotation

rad

translation

rot2

rotation

rad

rot1

rot3

rotation

rad

rot2

/entry/instrument/detector/module (NXdetector_module)

data_origin, data_size, fast_pixel_direction, slow_pixel_direction, module_offset — all NXmx (fast/slow_pixel_direction and module_offset carry transformation attributes).

/entry/sample (NXsample)

Field

Std

Units / notes

name

NXmx

depends_on

NXmx

points at the last goniometer / grid-scan axis, or . for stills

temperature

NXmx

K

transformations/ (NXtransformations)

NXmx

rotation axis (e.g. omega) or grid-scan translation; hard-linked as /entry/sample/goniometer

unit_cell

base

[a, b, c, α, β, γ]

ub_matrix

base

[1, 3, 3], Angstrom⁻¹

For a rotation scan the goniometer axis is written as a per-image angle array <axis> plus <axis>_end, scalar <axis>_range_average, <axis>_range_total, and for helical scans <axis>_helical_x/_y/_z. These extra goniometer datasets beyond the bare axis array are Jungfraujoch conveniences.

/entry/data (NXdata)

data (3-D image stack, [n_images, y, x]) with image_nr_low / image_nr_high attributes. In legacy mode this group instead contains one external link data_000001, … per data file.

4. Extensions beyond NXmx

Everything in this section is outside the NXmx standard. Each group is declared with NX_class = NXcollection (the NeXus-sanctioned container for non-standardised content) unless noted. The per-image arrays are indexed by image number, padded to the run length and filled with a sentinel (NaN for floats, -1/0 for integer indices) where a quantity is absent.

4.1 /entry/MX — spot finding and indexing (CXI-style)

The flagship extension. Spot (“peak”) properties are stored as fixed-size [n_images, max_spots] arrays (CXI layout, recognised by CrystFEL); scalar-per-image quantities as [n_images] vectors. In legacy/VDS mode these live in the data files and are linked/virtual-stacked into the master.

Per-spot arrays [n_images, max_spots]:

Dataset

Units

Meaning

Indexing only

peakXPosRaw, peakYPosRaw

pixel

spot position (raw detector frame)

peakTotalIntensity

photons

spot intensity

peakIceRingRes

spot lies in an ice-ring resolution band

peakH, peakK, peakL

Miller indices of the (indexed) spot

peakDistEwaldSphere

Å⁻¹

distance of the spot from the Ewald sphere

peakIndexed

spot fits the indexing solution

peakLattice

lattice the spot belongs to (-1 = unindexed)

Per-image vectors [n_images]:

Dataset

Units

Meaning

nPeaks

number of spots stored for the image (CXI)

strongPixels

strong-pixel count (first spot-finding stage)

peakCountUnfiltered

spots found before filtering

peakCountLowRes

low-resolution spots

peakCountIceRingRes

spots inside ice-ring bands

peakCountIndexed

spots fitting the indexing solution

imageIndexed

image was indexed (0/1)

indexingLatticeCount

number of lattices found for the image

niggliClass

Niggli class of the indexed Bravais lattice (see International Tables for Crystallography A (2016), Vol. A, Table 3.1.3.1)

bravaisLattice

Bravais lattice short code, e.g. aP, mC, oF, tI, hP, hR, cF

profileRadius

Å⁻¹

crystal profile radius

mosaicity

deg

mosaicity estimate

bFactor

Ų

per-image B-factor estimate

resolutionEstimate

Å

diffraction resolution estimate

integratedReflections

number of integrated reflections

bkgEstimate

photons

mean background in the 3–5 Å resolution band

beam_corr_x, beam_corr_y

pixel

beam-center correction applied during processing

imageScaleFactor

on-the-fly per-image scale factor g

imageScaleCC

on-the-fly scaling correlation coefficient

imageScaleMosaicity

deg

scaling-model mosaicity

imageScaleBFactor

Ų

scaling-model B-factor

Per-image lattices: latticeIndexed [n_images, 9] (Å) — the real-space lattice (flattened 3×3); latticeIndexedExtra [n_images, max_extra_lattices, 9] (Å) — additional orientation variants.

Run-level summaries (written into the master /entry/MX at finalisation):

Dataset

Units

Meaning

indexing_algorithm

FFBIDX / FFT (CUDA) / FFT (FFTW)

geom_refinement_algorithm

e.g. beam_center

rotationLatticeIndexed

Å

whole-run rotation-indexing lattice ([9])

rotationLatticeIndexedExtra

Å

additional whole-run lattices ([m, 9])

rotationLatticeNiggliClass

Niggli class of the run lattice

imageIndexedMean

mean indexing rate over the run

bkgEstimateMean

photons

mean background over the run

indexedLatticeCount

per-image lattice count summary (master). Note: data files use indexingLatticeCount; readers accept either.

CrystFEL can read the spots directly with:

peak_list = /entry/MX
peak_list_type = cxi

4.2 /entry/reflections — integrated reflections (NXreflections)

Integrated reflections are stored per image as /entry/reflections/image_NNNNNN groups, each declared NX_class = NXreflections. The columns map mostly onto the standard NXreflections base class:

Dataset

Units

NXreflections

Meaning

h, k, l

standard

Miller indices

d

Å

standard

resolution

int_sum

photons

standard

integrated intensity (summation)

int_err

photons

non-standard name

σ of the intensity (standard equivalent: int_sum_errors)

background_mean

photons

standard

mean background under the peak

predicted_x, predicted_y

pixel

name standard, units differ

predicted position. NXreflections predicted_x/_y are physical lengths; the pixel datasets are predicted_px_x/_y

observed_x, observed_y

pixel

name standard, units differ

observed centroid (pixels; standard pixel form is observed_px_x/_y)

observed_frame

standard

image number of the reflection

lp

standard

Lorentz–polarization factor (stored as 1/rlp)

partiality

standard

recorded fraction of the reflection

delta_phi

deg

extension

XDS Δφ: offset from the centre of the current frame

zeta

extension

Lorentz ζ factor (reciprocal-space geometry term)

image_scale_corr

extension

per-image scale correction; I_true = image_scale_corr · int_sum

In the master file these per-image groups are exposed through /entry/reflections external links (VDS/integrated formats).

4.3 /entry/azint — azimuthal integration

Dataset

Shape

Units

Meaning

bin_to_q

[φ_bins, q_bins]

Å⁻¹

q value of each bin

bin_to_two_theta

[φ_bins, q_bins]

deg

2θ of each bin

bin_to_phi

[φ_bins, q_bins]

deg

azimuthal angle of each bin

image

[n_images, φ_bins, q_bins]

per-image integrated profile (NaN for empty bins)

image_std

[n_images, φ_bins, q_bins]

per-bin standard deviation

image_count

[n_images, φ_bins, q_bins]

pixels contributing per bin

map

[y, x]

pixel→bin mapping (master file only)

4.4 /entry/roi — regions of interest (per-image results)

/entry/roi/<roi_name> has one sub-group per configured ROI, holding the per-image result vectors [n_images]. These are written into the data files; in VDS mode they are exposed from the master file through virtual datasets, and in integrated mode they are in the single file. (In legacy mode they remain only in the data files.)

Dataset

Meaning

max

maximum pixel value in the ROI

sum

sum of pixel values

sum_sq

sum of squared pixel values

npixel

number of valid pixels

x, y

intensity-weighted centroid

4.4.1 /entry/roi_defs — ROI definitions (master file)

The dataset-wide ROI definitions (geometry, fixed for the whole acquisition) live in the master file under a separate /entry/roi_defs group — kept apart from /entry/roi above so that older readers, which iterate /entry/roi, are unaffected by these entries. One sub-group /entry/roi_defs/<roi_name> per ROI:

Dataset

Meaning

bit_index

which bit of roi_map (below) marks this ROI

type

box, circle or azim

min_x_pxl, max_x_pxl, min_y_pxl, max_y_pxl

box bounds (type box)

center_x_pxl, center_y_pxl, radius_pxl

circle (type circle)

q_min_recipA, q_max_recipA

Q range (type azim)

phi_min_deg, phi_max_deg

azimuthal-angle sector (type azim, omitted for a full ring)

/entry/roi_defs/roi_map [y, x] is a uint16 per-pixel bitmask: bit bit_index is set for every pixel belonging to that ROI, so an ROI’s footprint can be recovered exactly.

4.5 /entry/image — per-image pixel statistics

[n_images] vectors: max_value, min_value (viable min/max, excluding error/saturated pixels), error_pixels, saturated_pixels, pixel_sum. Surfaced in the master file under /entry/image.

4.6 /entry/profiling — per-image timing

[n_images] vectors in seconds: spotFindingTime, indexingTime, integrationTime, refinementTime, processingTime, braggPredictionTime, preprocessingTime, compressionTime, azIntTime, indexAnalysisTime, imageScaleTime.

4.7 /entry/detector — acquisition diagnostics (data file)

A convenience NXcollection in the data file (note: distinct from the standard /entry/instrument/detector). In integrated format these datasets are written under /entry/instrument/detector/detectorSpecific instead.

Dataset

Meaning

timestamp, exptime

per-image timestamp and exposure time

number

image number (original number if image rejection was used)

det_info

JUNGFRAU debug field

storage_cell_image

storage-cell number

rcv_delay, rcv_free_send_buffers

receiver internal diagnostics

packets_expected, packets_received

UDP packets per image

data_collection_efficiency_image

received / expected packet ratio

4.8 /entry/xfel — pulsed-source metadata

[n_images] vectors pulseID and eventCode, written for pulsed sources (e.g. SwissFEL).

4.9 Other collections

Path

Class

Content

/entry/instrument/detector/detectorSpecific

NXcollection

Dectris-style detector metadata + Jungfraujoch fields: x_pixels_in_detector, y_pixels_in_detector, nimages, ntrigger, nimages_collected, nimages_written, data_collection_efficiency, max_receiver_delay, storage_cell_number, storage_cell_delay [ns], software_git_commit, software_git_date, jfjoch_release, jfjoch_writer_release, summation_mode, detect_ice_rings, gain_file_names, data_reduction_factor_serialmx, adu_histogram/, data_collection_efficiency_image

/entry/instrument/detector/calibration

NXcollection

per-channel pedestal / calibration images (bitshuffle-compressed)

/entry/instrument/fluorescence

NXcollection

XRF spectrum: energy [eV], data

/entry/user

NXcollection

scalar values supplied under header_appendix.hdf5

4.10 Non-standard fields inside the NXmx detector group

A few extension scalars are written inside the otherwise-standard /entry/instrument/detector group for compatibility with existing tooling:

Field

Units

Meaning

detector_distance

m

duplicate of distance (Dectris/Neggia compatibility)

detector_number

detector identifier (Dectris convention)

error_value

masked/error pixel sentinel (NXmx standard would be underload_value)

bit_depth_image

stored image bit depth (NXmx standard is bit_depth_readout)

acquisition_type

always triggered (Dectris convention)

jungfrau_conversion_applied

JUNGFRAU photon/keV conversion applied

jungfrau_conversion_factor

eV

conversion factor

geometry_transformation_applied

module→full-detector geometry applied

4.11 User-supplied metadata: header_appendix and image_appendix

Facilities frequently need to attach metadata that Jungfraujoch does not model explicitly. Two free-form JSON fields in the /start request (broker/jfjoch_api.yaml) provide this without any schema change; both accept any valid JSON:

Field

Carried in

Persisted to HDF5?

header_appendix

the start message, under user_data.user (see CBOR)

no — except the hdf5 sub-object (below)

image_appendix

every image message, as user_data

no

Both are forwarded verbatim through the ZeroMQ/CBOR stream to every downstream consumer (writer, republished analysis, viewers), so they are the recommended channel for facility- or beamline-specific provenance (proposal, operator, optics state, per-image trigger info, …) that has no dedicated API field.

Persisting selected values to HDF5. header_appendix is normally not written to the master file. As an exception, if it contains a key hdf5 whose value is a JSON object of scalars (strings and numbers — no arrays or nested objects), the writer stores each entry under /entry/user/<key>.

For example, a /start request containing:

{
  "header_appendix": {
    "proposal": "p20001",
    "operator": "jdoe",
    "hdf5": { "beamline": "X06SA", "ring_mode": "top-up", "attenuator_foils": 2 }
  },
  "image_appendix": { "trigger_source": "external" }
}

forwards the whole header_appendix as user_data.user on the start message and {"trigger_source": "external"} as user_data on every image message, and writes three scalars into the master file:

/entry/user/beamline          = "X06SA"
/entry/user/ring_mode         = "top-up"
/entry/user/attenuator_foils  = 2

5. Notes

  • Units are written as the HDF5 units attribute on the dataset (e.g. m, eV, deg, Angstrom, Angstrom^-1, Angstrom^2, pixel, s).

  • Sentinels. Missing per-image values are NaN (floats) or -1/0 (integer indices); image pixels use INTx_MIN / UINTx_MAX.

  • Master vs data file. In legacy/VDS formats the analysis arrays physically live in the data files; the master file links to them (external links in legacy, virtual datasets in VDS). In the integrated format there are no data files and everything is in one place.

  • CXI / CrystFEL. /entry/MX follows the CXI peak-list convention; see CXI file format.