HDF5 Data Operations
- class src.hdf5_ops.HDF5DataOpsManager(file_path, mode='r+')[source]
Bases:
object
A class to handle HDF5 fundamental middle level file operations to power data updates, metadata revision, and data analysis with hdf5 files encoding multi-instrument experimental campaign data.
Parameters:
- path_to_filestr
path/to/hdf5file.
- modestr
‘r’ or ‘r+’ read or read/write mode only when file exists
- append_metadata(obj_name, annotation_dict)[source]
Appends metadata attributes to the specified object (obj_name) based on the provided annotation_dict.
This method ensures that the provided metadata attributes do not overwrite any existing ones. If an attribute already exists, a ValueError is raised. The function supports storing scalar values (int, float, str) and compound values such as dictionaries that are converted into NumPy structured arrays before being added to the metadata.
Parameters:
- obj_name: str
Path to the target object (dataset or group) within the HDF5 file.
- annotation_dict: dict
- A dictionary where the keys represent new attribute names (strings), and the values can be:
Scalars: int, float, or str.
Compound values (dictionaries) for more complex metadata, which are converted to NumPy structured arrays.
Example of a compound value:
- annotation_dict = {
- “relative_humidity”: {
“value”: 65, “units”: “percentage”, “range”: “[0,100]”, “definition”: “amount of water vapor present …”
}
}
- delete_metadata(obj_name, annotation_dict)[source]
Deletes metadata attributes of the specified object (obj_name) based on the provided annotation_dict.
Parameters:
- obj_name: str
Path to the target object (dataset or group) within the HDF5 file.
- annotation_dict: dict
Dictionary where keys represent attribute names, and values should be dictionaries containing {“delete”: True} to mark them for deletion.
Example:
annotation_dict = {“attr_to_be_deleted”: {“delete”: True}}
Behavior:
Deletes the specified attributes from the object’s metadata if marked for deletion.
Issues a warning if the attribute is not found or not marked for deletion.
- extract_dataset_as_dataframe(dataset_name)[source]
returns a copy of the dataset content in the form of dataframe when possible or numpy array
- get_metadata(obj_path)[source]
Get file attributes from object at path = obj_path. For example, obj_path = ‘/’ will get root level attributes or metadata.
- reformat_datetime_column(dataset_name, column_name, src_format, desired_format='%Y-%m-%d %H:%M:%S.%f')[source]
- rename_metadata(obj_name, renaming_map)[source]
Renames metadata attributes of the specified object (obj_name) based on the provided renaming_map.
Parameters:
- obj_name: str
Path to the target object (dataset or group) within the HDF5 file.
- renaming_map: dict
A dictionary where keys are current attribute names (strings), and values are the new attribute names (strings or byte strings) to rename to.
- renaming_map = {
“old_attr_name”: “new_attr_name”, “old_attr_2”: “new_attr_2”
}
- update_metadata(obj_name, annotation_dict)[source]
Updates the value of existing metadata attributes of the specified object (obj_name) based on the provided annotation_dict.
The function disregards non-existing attributes and suggests to use the append_metadata() method to include those in the metadata.
Parameters:
- obj_namestr
Path to the target object (dataset or group) within the HDF5 file.
- annotation_dict: dict
- A dictionary where the keys represent existing attribute names (strings), and the values can be:
Scalars: int, float, or str.
Compound values (dictionaries) for more complex metadata, which are converted to NumPy structured arrays.
Example of a compound value:
- annotation_dict = {
- “relative_humidity”: {
“value”: 65, “units”: “percentage”, “range”: “[0,100]”, “definition”: “amount of water vapor present …”
}
}
- src.hdf5_ops.read_mtable_as_dataframe(filename)[source]
Reconstruct a MATLAB Table encoded in a .h5 file as a Pandas DataFrame.
This function reads a .h5 file containing a MATLAB Table and reconstructs it as a Pandas DataFrame. The input .h5 file contains one group per row of the MATLAB Table. Each group stores the table’s dataset-like variables as Datasets, while categorical and numerical variables are represented as attributes of the respective group.
To ensure homogeneity of data columns, the DataFrame is constructed column-wise.
Parameters
- filenamestr
The name of the .h5 file. This may include the file’s location and path information.
Returns
- pd.DataFrame
The MATLAB Table reconstructed as a Pandas DataFrame.
- src.hdf5_ops.serialize_metadata(input_filename_path, folder_depth: int = 4, output_format: str = 'yaml') str [source]
Serialize metadata from an HDF5 file into YAML or JSON format.
Parameters
- input_filename_pathstr
The path to the input HDF5 file.
- folder_depthint, optional
The folder depth to control how much of the HDF5 file hierarchy is traversed (default is 4).
- output_formatstr, optional
The format to serialize the output, either ‘yaml’ or ‘json’ (default is ‘yaml’).
Returns
- str
The output file path where the serialized metadata is stored (either .yaml or .json).
HDF5 Writer
- src.hdf5_writer.create_hdf5_file_from_dataframe(ofilename, input_data, group_by_funcs: list, approach: str = None, extract_attrs_func=None)[source]
Creates an HDF5 file with hierarchical groups based on the specified grouping functions or columns.
Parameters:
ofilename (str): Path for the output HDF5 file. input_data (pd.DataFrame or str): Input data as a DataFrame or a valid file system path. group_by_funcs (list): List of callables or column names to define hierarchical grouping. approach (str): Specifies the approach (‘top-down’ or ‘bottom-up’) for creating the HDF5 file. extract_attrs_func (callable, optional): Function to extract additional attributes for HDF5 groups.
Returns:
None
- src.hdf5_writer.create_hdf5_file_from_filesystem_path(path_to_input_directory: str, path_to_filenames_dict: dict = None, select_dir_keywords: list = [], root_metadata_dict: dict = {}, mode='w')[source]
Creates an .h5 file with name “output_filename” that preserves the directory tree (or folder structure) of a given filesystem path.
The data integration capabilities are limited by our file reader, which can only access data from a list of admissible file formats. These, however, can be extended. Directories are groups in the resulting HDF5 file. Files are formatted as composite objects consisting of a group, file, and attributes.
Parameters
- output_filenamestr
Name of the output HDF5 file.
- path_to_input_directorystr
Path to root directory, specified with forward slashes, e.g., path/to/root.
- path_to_filenames_dictdict, optional
A pre-processed dictionary where keys are directory paths on the input directory’s tree and values are lists of files. If provided, ‘input_file_system_path’ is ignored.
- select_dir_keywordslist
- List of string elements to consider or select only directory paths that contain
a word in ‘select_dir_keywords’. When empty, all directory paths are considered to be included in the HDF5 file group hierarchy.
- root_metadata_dictdict
Metadata to include at the root level of the HDF5 file.
- modestr
‘w’ create File, truncate if it exists, or ‘r+’ read/write, File must exists. By default, mode = “w”.
Returns
- output_filenamestr
Path to the created HDF5 file.
- src.hdf5_writer.save_processed_dataframe_to_hdf5(df, annotator, output_filename)[source]
Save processed dataframe columns with annotations to an HDF5 file.
- Parameters:
df (pd.DataFrame): DataFrame containing processed time series. annotator (): Annotator object with get_metadata method. output_filename (str): Path to the source HDF5 file.