lib_cpp_h5_writer
This library is used for creating C++ based stream writer for H5 files. It focuses on the functionality and performance needed for high performance detectors integrations.
Key features:
- Get data from ZMQ stream (Array-1.0 protocol) - htypes specification
- Write Nexus complaint H5 file (User specified) - nexus format
- Specify additional zmq stream parameters to write to file.
- Receive additional parameters to write to file via REST api.
- Interaction with the writer over REST api (stop, kill, get_statistics).
Table of content
Quick start for using the library
To create your own stream writer you need to specify:
- The H5 file format you want to write.
- The mapping of REST input variables to your H5 format.
- Additional H5 format fields with default values or calculated fields (based on input or default values).
- The mapping between the stream header metadata and your H5 file format.
- Additional metadata that is transfer in the stream message header.
Under sf/ and csaxs/ you can see examples of this. Feel free to use any of this folders as a template.
IMPORTANT: We are using a monorepo for this project (all implementations should live in this git repository). To create a new implementation, please add a folder to the root of the proejct (like sf/ and csaxs/).
The minimum you need to implement your own writer is:
- Writer runner (example: csaxs/csaxs_h5_writer.cpp)
- File format (example: csaxs/CsaxsFormat.cpp)
- Build file (example: csaxs/Makefile)
Writer runner
Example: csaxs/csaxs_h5_writer.cpp
The runner is the actual executable you will run to create files. In the writer runner you:
- Specify and parse input parameters.
- Prepare your system for writing (creating folders, switch process user etc.)
- Instantiate the file format object.
- Define the parameters that come in the stream header.
- Start the writer (mostly boilerplate code, if you do not need any special implementations).
File format
Example: csaxs/CsaxsFormat.cpp
This is a class that extends the H5Format class. You need to specify:
- input_value_type (REST API value name to type mapping)
- default_values (Fields in the file format that have default values)
- dataset_move_mapping (Move datasets to another place in the file if needed)
- file_format (The hierarchical structure of your H5 format) It is best to specify all the values above in the class constructor. Some values (all except file_format) can be empty, but they should not be null.
The current cSAXS and SF formats are quite simple. As a reference, you can check the old cSAXS file format implementation: csaxs_cpp_h5_writer
Build file
Example: csaxs/Makefile
If you want to use Makefiles, you can basically copy one from an existing implementation (csaxs/) and rename the executable. In case you want something more sophisticated you will have to provide it yourself.
In addition, you can deploy your writer also as an anaconda package - you will need to include the conda-recipe folder in this case as well (see csaxs/conda-recipe).
Build
You need your compiler to support C++11.
The easiest way to build the library is via Anaconda. If you are not familiar with Anaconda (and do not want to learn), you can also install all the dependencies directly in your os.
The base library is located in lib/. Change you current directory to lib/ and:
- make (build the library for production)
- make clean (clean the previous build)
- make deploy (deploy library to your local conda environemnt)
- make debug (build library with debug prints in the standard output)
- make perf (build the library with performance measurements in the standard output)
- make test (create tests)
The usual procedure would be:
- make test (build the tests)
- ./bin/execute_tests (execute the tests)
- make deploy (deploy the library)
You can then start building your executable. It is also a good idea to automate the base library build from your executable build system (see csaxs/Makefile, lib target for example).
Conda build
If you use conda, you can create an environment with the needed library by running:
conda create -c paulscherrerinstitute --name <env_name> make cppzmq==4.3.0 hdf5==1.10.4 boost==1.61.0 gtest==1.8.1
After that you can just source you newly created environment:
conda activate <env_name>
and start linking your builds against the libraries. To do that you can use the environament variables Anaconda sets:
-L${CONDA_PREFIX}/lib (for linking libraries you have installed with Anaconda)
To run you executables inside the Anaconda environment, you will need also to export the lib/ path in your env variables:
export LD_LIBRARY_PATH=${CONDA_PREFIX}/lib
If you decide not to use Anaconda, you will have to install the following libraries in your system:
- make
- cppzmq ==4.3.0
- hdf5 ==1.10.4
- boost ==1.61.0
Basic concepts
In this chapter we will describe the basic concepts you need to get a hold off in order to use the library. In case more advanced knowledge is needed, please feel free to browse the code. The most important components are discussed in subchapters below.
General overview
The writer has 3 processes:
- REST process (listens to incoming REST requests).
- ZMQ process (listens for incoming ZMQ stream messages).
- H5 process (writes the received data to disk).
The communication bridges between processes are:
- REST to H5 process: WriterManager (WriterManager.cpp).
- ZMQ to H5 process: WriterManager for process control (WriterManager.cpp) and RingBuffer (RingBuffer.cpp) for data transfer.
In order to have a central place where to set fine tunning parameters, the config.cpp file is used.
The ZMQ process receives data from the stream, it extracts it and packs it (with additional metadata) into the ring buffer. Meanwhile, the H5 process is listening for data in the ring buffer. When new data arrives, it writes this data down into temporary datasets (for performance reasons we write the file format in the end).
When the end of the writing is triggered (via the REST api, when the desired number of frames are received, or when the user terminates the process), an attempt to write the file format is performed. If the format writing is successful, the temporary datasets are moved to their final place in the file format. If the format writing step fails for any reason, the data will remain in the temporary datasets and the user will need to fix the file manually (the goal is to preserve the data as much as possible).
ZmqReceiver
The stream receiver that gets your data from the stream. This is PSI specific, and currently supports only the Array-1.0 protocol.
The protocol specification can be found here: htypes specification
Stream header values
In addition to the image in the stream, the receiver can pass to the writer also data defined in the header of the stream, for example:
- pulse_id (The pulse id for the current image)
- source (source of the currect image)
- etc.
This fields are specific to your input stream. The only constrain is that values should be scalars (one value per message). The allowed data types for this values are:
- "uint8"
- "uint16"
- "uint32"
- "uint64"
- "int8"
- "int16"
- "int32"
- "int64"
- "float32"
- "float64"
This stream header parameters need to be specified when constructing your ZmqReceiver instance:
// Extract the "pulse_id" value from the header, and convert it into uint64 type.
auto header_values = shared_ptr<unordered_map<string, string>>(new unordered_map<string, string> {
{"pulse_id", "uint64"},
});
// Pass the header_values to the ZmqReceiver constructor.
ZmqReceiver receiver(connect_address, n_io_threads, receive_timeout, header_values);
Read the H5Writer chapter to see where this data is written in the H5 file. Knowing where the data is written is important to properly setup the dataset_move_mapping in the file format. See chapter H5Format for more info.