# HDF5 / NeXus data format Jungfraujoch stores images and on-the-fly analysis results in HDF5 files that aim to be [NXmx](https://manual.nexusformat.org/classes/applications/NXmx.html)-compliant. On top of the NXmx application definition, Jungfraujoch records a substantial amount of *derived* metadata (spot finding, indexing, integration, azimuthal integration, per-image statistics, timing). These extra entries do not exist in NXmx and are documented here so that the layout is unambiguous and reusable. This page documents the **file layout and the data fields**. The operational behaviour of the writer (running, republishing, file finalisation, CBF/TIFF output) is described in [jfjoch_writer](JFJOCH_WRITER.md). The wire format that feeds the writer is described in [CBOR messages](CBOR.md); fields below frequently correspond one-to-one to CBOR message fields, and that document is a useful companion for their meaning. ## 1. Motivation: derived metadata and FAIR data The goal of Jungfraujoch is not only to store high-throughput datasets efficiently, but to keep them findable, accessible, interoperable and reusable (FAIR). Jungfraujoch is used for both **rotation** macromolecular crystallography (single- and multi-crystal, including fine-sliced and helical scans) and **serial** crystallography (stills, grid scans); the same concerns apply to both: * **Findability.** Raw diffraction images carry almost no descriptive metadata about *content*. Quantities such as background level, number of diffraction spots, or indexing outcome let a user judge the quality and relevance of a dataset *before* inspecting the raw images. * **Accessibility at scale.** A single experiment can span tens to hundreds of terabytes. Standard retrieval (e.g. HTTP) makes a dataset *available* but not *inspectable* — users would otherwise have to download a large fraction of the data just to decide whether it is useful. Compact derived representations make discovery, assessment and reuse feasible. Because Jungfraujoch couples acquisition with real-time analysis used to *steer* experiments, transparency and reproducibility of that analysis matter. As a minimum the writer therefore preserves spot-finding and indexing results together with the filters that were applied, and it can retain an unbiased, down-sampled reference set of unfiltered images for validation and reuse. ### Two complementary layouts: per-image spots vs. a reflection table Jungfraujoch stores analysis products in two shapes, matching how each is accessed. **Per-image spot finding / indexing.** Spot finding and indexing are inherently *image-centric* — the natural query is "give me the spots for image *n*" — and this holds for serial stills and for rotation frames alike. For these products Jungfraujoch adopts a layout similar to the [Coherent X-ray Imaging (CXI) data bank](https://www.cxidb.org) (Maia, 2012) and the convention understood by [CrystFEL](https://www.desy.de/~twhite/crystfel/): spot properties (position, intensity, Miller index, …) are stored in fixed-size two-dimensional arrays indexed by image number, with each image allocated room for up to a predefined maximum number of spots. These dense arrays are addressed with ordinary HDF5 hyperslab reads, so the spots of a single image are retrieved without traversing variable-length structures. The cost is some storage overhead for unused slots (padded with sentinels), which is acceptable for the access pattern. **Integrated reflections.** Integrated intensities are naturally a *dataset-wide* table, which is exactly the model of the NeXus [NXreflections](https://manual.nexusformat.org/classes/base_classes/NXreflections.html) base class. This fits rotation crystallography well, and Jungfraujoch uses NXreflections for its integration results (see §4.2 below). We deliberately do *not* force spot finding/indexing into a single experiment-wide table: across the hundreds of thousands of patterns typical of serial — or fine-sliced rotation — experiments, that would require aggregating the whole experiment before the spots of one image can be read. We encourage the community to develop standardised NeXus application definitions for image-centric crystallography products that combine NeXus interoperability with the access patterns and scale of modern high-throughput experiments. ## 2. File layout A run is written as one **master file** plus, depending on the format, one or more **data files**: ``` _master.h5 # NXmx master file (metadata + links / virtual datasets) _data_000001.h5 # data file: images + per-image analysis _data_000002.h5 ... ``` The master file is produced by `writer/HDF5NXmx.cpp`; data files by `writer/HDF5DataFile.cpp` and its plugins (`writer/HDF5DataFilePlugin*.cpp`). Files are written to a temporary `*..tmp` name and renamed on successful close. Three master-file variants exist (set via `file_format`): | Format | Value | Master ↔ data linking | |--------|:-----:|------------------------| | **NXmxLegacy** (default) | 1 | One external link in `/entry/data` per data file (`data_000001`, …). HDF5 1.8 compatible — works with Neggia/Durin XDS plugins and Albula 4.0. | | **NXmxVDS** | 2 | A single virtual dataset `/entry/data/data` spans all data files; spot finding, azimuthal integration and reflections are linked the same way. Requires HDF5 1.10 / Albula 4.1+. | | **NXmxIntegrated** | 3 | No separate data files — images and all metadata live in one file. Equivalent in content to the VDS format. | In legacy/VDS mode, image-indexed analysis arrays live in the **data files** and are exposed in the master file through external links or virtual datasets; in integrated mode they are written directly into the single file. Throughout this document a "✓ in master" column marks entries that are visible (directly or via link/VDS) from the master file. Images are stored chunked (one image per chunk) and compressed with bitshuffle + LZ4 or bitshuffle + Zstd; signed integer image datasets use `INTx_MIN` as the HDF5 fill value (the "masked / no-data" sentinel), unsigned use `UINTx_MAX`. ### Reprocessing output: `_process.h5` The offline reprocessing tool [`jfjoch_process`](TOOLS.md) (`tools/jfjoch_process.cpp`) re-runs the full analysis pipeline (spot finding, indexing, refinement, integration, scaling) on an existing dataset and writes its results to a master file named **`_process.h5`**. This file uses the **integrated** format, but instead of copying the images its `/entry/data/data` is a *virtual dataset that links back to the original image files* (`hdf5_source_data` → `NXmx::LinkToData_ProcessingVDS`). The result is a compact, self-describing companion file that holds *all* the derived analysis (everything in §4) plus a virtual view of the raw images — without duplicating terabytes of data. This is a particularly FAIR-friendly artefact: it can be shared or archived alongside (or instead of) the raw data to convey what is in a dataset and how it processed, while the `/entry/data/data` VDS still resolves to the original images when they are available. `jfjoch_process` can also process an equally-spaced *subset* of images (start/end/stride), producing a down-sampled reference set. ## 3. NXmx-standard content The entries below are part of, or valid base classes for, the [NXmx](https://manual.nexusformat.org/classes/applications/NXmx.html) application definition. "NXmx" = listed in the application definition; "base" = a valid field of the relevant NeXus base class (`NXdetector`, `NXsample`, `NXsource`) but not in the NXmx required/recommended subset. ### `/entry` (NXentry) | Field | Std | Notes | |-------|:---:|-------| | `definition` | NXmx | value `"NXmx"` | | `start_time` | NXmx | arming time | | `end_time`, `end_time_estimated` | NXmx | approximate end time | File-level HDF5 attributes `file_name`, `file_time`, `HDF5_Version` are also set. ### `/entry/source` (NXsource), `/entry/instrument` (NXinstrument) | Field | Std | Units | |-------|:---:|-------| | `source/name`, `source/type` | NXmx / base | | | `source/current` | base | A | | `instrument/name` | NXmx | | ### `/entry/instrument/beam` (NXbeam) | Field | Std | Units | |-------|:---:|-------| | `incident_wavelength` | NXmx | angstrom | | `incident_wavelength_spread` | NXmx | angstrom (only if polychromatic) | | `total_flux` | NXmx | Hz | ### `/entry/instrument/attenuator` (NXattenuator) | Field | Std | |-------|:---:| | `attenuator_transmission` | NXmx | ### `/entry/instrument/detector` (NXdetector) | Field | Std | Units | |-------|:---:|-------| | `depends_on` | NXmx | → `transformations/rot3` | | `beam_center_x`, `beam_center_y` | NXmx | pixel | | `distance` | NXmx | m | | `count_time`, `frame_time` | NXmx | s | | `sensor_thickness` | NXmx | m | | `sensor_material` | NXmx | | | `description` | NXmx | | | `threshold_energy` | NXmx | eV (EIGER; written only for a single channel) | | `x_pixel_size`, `y_pixel_size` | base | m | | `serial_number` | base | | | `bit_depth_readout` | NXmx | | | `saturation_value` | NXmx | | | `flatfield_applied` | NXmx | | | `pixel_mask`, `pixel_mask_applied` | NXmx | `pixel_mask` is `[y, x]`, hard-linked from `detectorSpecific/pixel_mask` | | `countrate_correction_applied` | NXmx | | | `number_of_cycles` | base | frame-summation factor | ### `/entry/instrument/detector/transformations` (NXtransformations) The NXtransformations *mechanism* (the `depends_on` chain, `transformation_type`, `vector`, `offset` attributes) is standard. The axis **names** follow the PyFAI PONI convention chosen by Jungfraujoch (see [DETECTOR_GEOMETRY](DETECTOR_GEOMETRY.md)): | Axis | Type | Units | Depends on | |------|------|-------|-----------| | `translation` | translation | m | `.` | | `rot1` | rotation | rad | `translation` | | `rot2` | rotation | rad | `rot1` | | `rot3` | rotation | rad | `rot2` | ### `/entry/instrument/detector/module` (NXdetector_module) `data_origin`, `data_size`, `fast_pixel_direction`, `slow_pixel_direction`, `module_offset` — all NXmx (`fast/slow_pixel_direction` and `module_offset` carry transformation attributes). ### `/entry/sample` (NXsample) | Field | Std | Units / notes | |-------|:---:|-------| | `name` | NXmx | | | `depends_on` | NXmx | points at the last goniometer / grid-scan axis, or `.` for stills | | `temperature` | NXmx | K | | `transformations/` (NXtransformations) | NXmx | rotation axis (e.g. `omega`) or grid-scan translation; hard-linked as `/entry/sample/goniometer` | | `unit_cell` | base | `[a, b, c, α, β, γ]` | | `ub_matrix` | base | `[1, 3, 3]`, Angstrom⁻¹ | For a rotation scan the goniometer axis is written as a per-image angle array `` plus `_end`, scalar `_range_average`, `_range_total`, and for helical scans `_helical_x/_y/_z`. These extra goniometer datasets beyond the bare axis array are Jungfraujoch conveniences. ### `/entry/data` (NXdata) `data` (3-D image stack, `[n_images, y, x]`) with `image_nr_low` / `image_nr_high` attributes. In legacy mode this group instead contains one external link `data_000001`, … per data file. ## 4. Extensions beyond NXmx Everything in this section is **outside the NXmx standard**. Each group is declared with `NX_class = NXcollection` (the NeXus-sanctioned container for non-standardised content) unless noted. The per-image arrays are indexed by image number, padded to the run length and filled with a sentinel (`NaN` for floats, `-1`/`0` for integer indices) where a quantity is absent. ### 4.1 `/entry/MX` — spot finding and indexing (CXI-style) The flagship extension. Spot ("peak") properties are stored as fixed-size `[n_images, max_spots]` arrays (CXI layout, recognised by CrystFEL); scalar-per-image quantities as `[n_images]` vectors. In legacy/VDS mode these live in the data files and are linked/virtual-stacked into the master. **Per-spot arrays `[n_images, max_spots]`:** | Dataset | Units | Meaning | Indexing only | |---------|-------|---------|:---:| | `peakXPosRaw`, `peakYPosRaw` | pixel | spot position (raw detector frame) | | | `peakTotalIntensity` | photons | spot intensity | | | `peakIceRingRes` | | spot lies in an ice-ring resolution band | | | `peakH`, `peakK`, `peakL` | | Miller indices of the (indexed) spot | ✓ | | `peakDistEwaldSphere` | Å⁻¹ | distance of the spot from the Ewald sphere | ✓ | | `peakIndexed` | | spot fits the indexing solution | ✓ | | `peakLattice` | | lattice the spot belongs to (`-1` = unindexed) | ✓ | **Per-image vectors `[n_images]`:** | Dataset | Units | Meaning | |---------|-------|---------| | `nPeaks` | | number of spots stored for the image (CXI) | | `strongPixels` | | strong-pixel count (first spot-finding stage) | | `peakCountUnfiltered` | | spots found before filtering | | `peakCountLowRes` | | low-resolution spots | | `peakCountIceRingRes` | | spots inside ice-ring bands | | `peakCountIndexed` | | spots fitting the indexing solution | | `imageIndexed` | | image was indexed (0/1) | | `indexingLatticeCount` | | number of lattices found for the image | | `niggliClass` | | Niggli class of the indexed Bravais lattice (see *International Tables for Crystallography A* (2016), Vol. A, [Table 3.1.3.1](https://onlinelibrary.wiley.com/iucr/itc/Ac/ch3o1v0001/table3o1o3o1.pdf)) | | `bravaisLattice` | | Bravais lattice short code, e.g. `aP`, `mC`, `oF`, `tI`, `hP`, `hR`, `cF` | | `profileRadius` | Å⁻¹ | crystal profile radius | | `mosaicity` | deg | mosaicity estimate | | `bFactor` | Ų | per-image B-factor estimate | | `resolutionEstimate` | Å | diffraction resolution estimate | | `integratedReflections` | | number of integrated reflections | | `bkgEstimate` | photons | mean background in the 3–5 Å resolution band | | `beam_corr_x`, `beam_corr_y` | pixel | beam-center correction applied during processing | | `imageScaleFactor` | | on-the-fly per-image scale factor *g* | | `imageScaleCC` | | on-the-fly scaling correlation coefficient | | `imageScaleMosaicity` | deg | scaling-model mosaicity | | `imageScaleBFactor` | Ų | scaling-model B-factor | **Per-image lattices:** `latticeIndexed` `[n_images, 9]` (Å) — the real-space lattice (flattened 3×3); `latticeIndexedExtra` `[n_images, max_extra_lattices, 9]` (Å) — additional orientation variants. **Run-level summaries** (written into the master `/entry/MX` at finalisation): | Dataset | Units | Meaning | |---------|-------|---------| | `indexing_algorithm` | | `FFBIDX` / `FFT (CUDA)` / `FFT (FFTW)` | | `geom_refinement_algorithm` | | e.g. `beam_center` | | `rotationLatticeIndexed` | Å | whole-run rotation-indexing lattice (`[9]`) | | `rotationLatticeIndexedExtra` | Å | additional whole-run lattices (`[m, 9]`) | | `rotationLatticeNiggliClass` | | Niggli class of the run lattice | | `imageIndexedMean` | | mean indexing rate over the run | | `bkgEstimateMean` | photons | mean background over the run | | `indexedLatticeCount` | | per-image lattice count summary (master). *Note: data files use `indexingLatticeCount`; readers accept either.* | CrystFEL can read the spots directly with: ``` peak_list = /entry/MX peak_list_type = cxi ``` ### 4.2 `/entry/reflections` — integrated reflections (NXreflections) Integrated reflections are stored **per image** as `/entry/reflections/image_NNNNNN` groups, each declared `NX_class = NXreflections`. The columns map mostly onto the standard [NXreflections](https://manual.nexusformat.org/classes/base_classes/NXreflections.html) base class: | Dataset | Units | NXreflections | Meaning | |---------|-------|:-------------:|---------| | `h`, `k`, `l` | | standard | Miller indices | | `d` | Å | standard | resolution | | `int_sum` | photons | standard | integrated intensity (summation) | | `int_err` | photons | non-standard name | σ of the intensity (standard equivalent: `int_sum_errors`) | | `background_mean` | photons | standard | mean background under the peak | | `predicted_x`, `predicted_y` | pixel | name standard, units differ | predicted position. NXreflections `predicted_x/_y` are *physical* lengths; the pixel datasets are `predicted_px_x/_y` | | `observed_x`, `observed_y` | pixel | name standard, units differ | observed centroid (pixels; standard pixel form is `observed_px_x/_y`) | | `observed_frame` | | standard | image number of the reflection | | `lp` | | standard | Lorentz–polarization factor (stored as `1/rlp`) | | `partiality` | | standard | recorded fraction of the reflection | | `delta_phi` | deg | **extension** | XDS Δφ: offset from the centre of the current frame | | `zeta` | | **extension** | Lorentz ζ factor (reciprocal-space geometry term) | | `image_scale_corr` | | **extension** | per-image scale correction; `I_true = image_scale_corr · int_sum` | In the master file these per-image groups are exposed through `/entry/reflections` external links (VDS/integrated formats). ### 4.3 `/entry/azint` — azimuthal integration | Dataset | Shape | Units | Meaning | |---------|-------|-------|---------| | `bin_to_q` | `[φ_bins, q_bins]` | Å⁻¹ | q value of each bin | | `bin_to_two_theta` | `[φ_bins, q_bins]` | deg | 2θ of each bin | | `bin_to_phi` | `[φ_bins, q_bins]` | deg | azimuthal angle of each bin | | `image` | `[n_images, φ_bins, q_bins]` | | per-image integrated profile (NaN for empty bins) | | `image_std` | `[n_images, φ_bins, q_bins]` | | per-bin standard deviation | | `image_count` | `[n_images, φ_bins, q_bins]` | | pixels contributing per bin | | `map` | `[y, x]` | | pixel→bin mapping (master file only) | ### 4.4 `/entry/roi` — regions of interest (per-image results) `/entry/roi/` has one sub-group per configured ROI, holding the **per-image result vectors** `[n_images]`. These are written into the data files; in VDS mode they are exposed from the master file through virtual datasets, and in integrated mode they are in the single file. (In legacy mode they remain only in the data files.) | Dataset | Meaning | |---------|---------| | `max` | maximum pixel value in the ROI | | `sum` | sum of pixel values | | `sum_sq` | sum of squared pixel values | | `npixel` | number of valid pixels | | `x`, `y` | intensity-weighted centroid | ### 4.4.1 `/entry/roi_defs` — ROI definitions (master file) The **dataset-wide ROI definitions** (geometry, fixed for the whole acquisition) live in the master file under a *separate* `/entry/roi_defs` group — kept apart from `/entry/roi` above so that older readers, which iterate `/entry/roi`, are unaffected by these entries. One sub-group `/entry/roi_defs/` per ROI: | Dataset | Meaning | |---------|---------| | `bit_index` | which bit of `roi_map` (below) marks this ROI | | `type` | `box`, `circle` or `azim` | | `min_x_pxl`, `max_x_pxl`, `min_y_pxl`, `max_y_pxl` | box bounds (type `box`) | | `center_x_pxl`, `center_y_pxl`, `radius_pxl` | circle (type `circle`) | | `q_min_recipA`, `q_max_recipA` | Q range (type `azim`) | | `phi_min_deg`, `phi_max_deg` | azimuthal-angle sector (type `azim`, omitted for a full ring) | `/entry/roi_defs/roi_map` `[y, x]` is a `uint16` per-pixel bitmask: bit `bit_index` is set for every pixel belonging to that ROI, so an ROI's footprint can be recovered exactly. ### 4.5 `/entry/image` — per-image pixel statistics `[n_images]` vectors: `max_value`, `min_value` (viable min/max, excluding error/saturated pixels), `error_pixels`, `saturated_pixels`, `pixel_sum`. Surfaced in the master file under `/entry/image`. ### 4.6 `/entry/profiling` — per-image timing `[n_images]` vectors in seconds: `spotFindingTime`, `indexingTime`, `integrationTime`, `refinementTime`, `processingTime`, `braggPredictionTime`, `preprocessingTime`, `compressionTime`, `azIntTime`, `indexAnalysisTime`, `imageScaleTime`. ### 4.7 `/entry/detector` — acquisition diagnostics (data file) A convenience NXcollection in the data file (note: distinct from the standard `/entry/instrument/detector`). In **integrated** format these datasets are written under `/entry/instrument/detector/detectorSpecific` instead. | Dataset | Meaning | |---------|---------| | `timestamp`, `exptime` | per-image timestamp and exposure time | | `number` | image number (original number if image rejection was used) | | `det_info` | JUNGFRAU debug field | | `storage_cell_image` | storage-cell number | | `rcv_delay`, `rcv_free_send_buffers` | receiver internal diagnostics | | `packets_expected`, `packets_received` | UDP packets per image | | `data_collection_efficiency_image` | received / expected packet ratio | ### 4.8 `/entry/xfel` — pulsed-source metadata `[n_images]` vectors `pulseID` and `eventCode`, written for pulsed sources (e.g. SwissFEL). ### 4.9 Other collections | Path | Class | Content | |------|-------|---------| | `/entry/instrument/detector/detectorSpecific` | NXcollection | Dectris-style detector metadata + Jungfraujoch fields: `x_pixels_in_detector`, `y_pixels_in_detector`, `nimages`, `ntrigger`, `nimages_collected`, `nimages_written`, `data_collection_efficiency`, `max_receiver_delay`, `storage_cell_number`, `storage_cell_delay` [ns], `software_git_commit`, `software_git_date`, `jfjoch_release`, `jfjoch_writer_release`, `summation_mode`, `detect_ice_rings`, `gain_file_names`, `data_reduction_factor_serialmx`, `adu_histogram/`, `data_collection_efficiency_image` | | `/entry/instrument/detector/calibration` | NXcollection | per-channel pedestal / calibration images (bitshuffle-compressed) | | `/entry/instrument/fluorescence` | NXcollection | XRF spectrum: `energy` [eV], `data` | | `/entry/user` | NXcollection | scalar values supplied under `header_appendix.hdf5` | ### 4.10 Non-standard fields inside the NXmx detector group A few extension scalars are written *inside* the otherwise-standard `/entry/instrument/detector` group for compatibility with existing tooling: | Field | Units | Meaning | |-------|-------|---------| | `detector_distance` | m | duplicate of `distance` (Dectris/Neggia compatibility) | | `detector_number` | | detector identifier (Dectris convention) | | `error_value` | | masked/error pixel sentinel (NXmx standard would be `underload_value`) | | `bit_depth_image` | | stored image bit depth (NXmx standard is `bit_depth_readout`) | | `acquisition_type` | | always `triggered` (Dectris convention) | | `jungfrau_conversion_applied` | | JUNGFRAU photon/keV conversion applied | | `jungfrau_conversion_factor` | eV | conversion factor | | `geometry_transformation_applied` | | module→full-detector geometry applied | ### 4.11 User-supplied metadata: `header_appendix` and `image_appendix` Facilities frequently need to attach metadata that Jungfraujoch does not model explicitly. Two free-form JSON fields in the `/start` request (`broker/jfjoch_api.yaml`) provide this without any schema change; both accept *any valid JSON*: | Field | Carried in | Persisted to HDF5? | |-------|-----------|--------------------| | `header_appendix` | the **start** message, under `user_data.user` (see [CBOR](CBOR.md)) | no — except the `hdf5` sub-object (below) | | `image_appendix` | **every image** message, as `user_data` | no | Both are forwarded verbatim through the ZeroMQ/CBOR stream to every downstream consumer (writer, republished analysis, viewers), so they are the recommended channel for facility- or beamline-specific provenance (proposal, operator, optics state, per-image trigger info, …) that has no dedicated API field. **Persisting selected values to HDF5.** `header_appendix` is normally *not* written to the master file. As an exception, if it contains a key `hdf5` whose value is a JSON object of scalars (strings and numbers — no arrays or nested objects), the writer stores each entry under `/entry/user/`. For example, a `/start` request containing: ```json { "header_appendix": { "proposal": "p20001", "operator": "jdoe", "hdf5": { "beamline": "X06SA", "ring_mode": "top-up", "attenuator_foils": 2 } }, "image_appendix": { "trigger_source": "external" } } ``` forwards the whole `header_appendix` as `user_data.user` on the start message and `{"trigger_source": "external"}` as `user_data` on every image message, and writes three scalars into the master file: ``` /entry/user/beamline = "X06SA" /entry/user/ring_mode = "top-up" /entry/user/attenuator_foils = 2 ``` ## 5. Notes * **Units** are written as the HDF5 `units` attribute on the dataset (e.g. `m`, `eV`, `deg`, `Angstrom`, `Angstrom^-1`, `Angstrom^2`, `pixel`, `s`). * **Sentinels.** Missing per-image values are `NaN` (floats) or `-1`/`0` (integer indices); image pixels use `INTx_MIN` / `UINTx_MAX`. * **Master vs data file.** In legacy/VDS formats the analysis arrays physically live in the data files; the master file links to them (external links in legacy, virtual datasets in VDS). In the integrated format there are no data files and everything is in one place. * **CXI / CrystFEL.** `/entry/MX` follows the CXI peak-list convention; see [CXI file format](https://raw.githubusercontent.com/cxidb/CXI/master/cxi_file_format.pdf).