diff --git a/.gitignore b/.gitignore
index 254d4ae..0108897 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,9 +1,9 @@
-*.pyc
-__pycache__/
-*.h5
-tmp_files/
-*.ipynb
-logs/
-envs/
-hidden.py
+*.pyc
+__pycache__/
+*.h5
+tmp_files/
+*.ipynb
+logs/
+envs/
+hidden.py
output_files/
\ No newline at end of file
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
index c38d206..83bb132 100644
--- a/.gitlab-ci.yml
+++ b/.gitlab-ci.yml
@@ -1,12 +1,12 @@
-pages:
- stage: deploy
- script:
- - echo "Deploying pre-built HTML..."
- - cp -r docs/build/html public # Copy the pre-built HTML to the public directory
- artifacts:
- paths:
- - public
- only:
- changes:
- - docs/source/** # Run only if files in docs/source/ change
+pages:
+ stage: deploy
+ script:
+ - echo "Deploying pre-built HTML..."
+ - cp -r docs/build/html public # Copy the pre-built HTML to the public directory
+ artifacts:
+ paths:
+ - public
+ only:
+ changes:
+ - docs/source/** # Run only if files in docs/source/ change
- docs/Makefile
\ No newline at end of file
diff --git a/README.md b/README.md
index 62f171c..5632774 100644
--- a/README.md
+++ b/README.md
@@ -1,267 +1,267 @@
-## DIMA: Data Integration and Metadata Annotation
-
-
-## Description
-
-**DIMA** (Data Integration and Metadata Annotation) is a Python package developed to support the findable, accessible, interoperable, and reusable (FAIR) data transformation of multi-instrument data at the **Laboratory of Atmospheric Chemistry** as part of the project **IVDAV**: *Instant and Versatile Data Visualization During the Current Dark Period of the Life Cycle of FAIR Research*, funded by the [ETH-Domain ORD Program Measure 1](https://ethrat.ch/en/measure-1-calls-for-field-specific-actions/).
-
-
-The **FAIR** data transformation involves cycles of data harmonization and metadata review. DIMA facilitates these processes by enabling the integration and annotation of multi-instrument data in HDF5 format. This data may originate from diverse experimental campaigns, including **beamtimes**, **kinetic flowtube studies**, **smog chamber experiments**, and **field campaigns**.
-
-
-## Key features
-
-DIMA provides reusable operations for data integration, manipulation, and extraction using HDF5 files. These serve as the foundation for the following higher-level operations:
-
-1. **Data integration pipeline**: Searches for, retrieves, and integrates multi-instrument data sources in HDF5 format using a human-readable campaign descriptor YAML file that points to the data sources on a network drive.
-
-2. **Metadata revision pipeline**: Enables updates, deletions, and additions of metadata in an HDF5 file. It operates on the target HDF5 file and a YAML file specifying the required changes. A suitable YAML file specification can be generated by serializing the current metadata of the target HDF5 file. This supports alignment with conventions and the development of campaign-centric vocabularies.
-
-
-3. **Visualization pipeline:**
- Generates a treemap visualization of an HDF5 file, highlighting its structure and key metadata elements.
-
-4. **Jupyter notebooks**
- Demonstrates DIMA’s core functionalities, such as data integration, HDF5 file creation, visualization, and metadata annotation. Key notebooks include examples for data sharing, OpenBis ETL, and workflow demos.
-
-## Requirements
-
-For **Windows** users, the following are required:
-
-1. **Git Bash**: Install [Git Bash](https://git-scm.com/downloads) to run shell scripts (`.sh` files).
-
-2. **Conda**: Install [Anaconda](https://www.anaconda.com/products/individual) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html).
-
-3. **PSI Network Access**: Ensure access to PSI’s network and access rights to source drives for retrieving campaign data from YAML files in the `input_files/` folder.
-
-:bulb: **Tip**: Editing your system’s PATH variable ensures both Conda and Git are available in the terminal environment used by Git Bash.
-
-
-## Getting Started
-
-### Download DIMA
-
-Open a **Git Bash** terminal.
-
-Navigate to your `GitLab` folder, clone the repository, and navigate to the `dima` folder as follows:
-
- ```bash
- cd path/to/GitLab
- git clone --recurse-submodules https://gitlab.psi.ch/5505/dima.git
- cd dima
- ```
-
-### Install Python Interpreter
-
-Open **Git Bash** terminal.
-
-**Option 1**: Install a suitable conda environment `multiphase_chemistry_env` inside the repository `dima` as follows:
-
- ```bash
- cd path/to/GitLab/dima
- Bash setup_env.sh
- ```
-
-Open **Anaconda Prompt** or a terminal with access to conda.
-
-**Option 2**: Install conda enviroment from YAML file as follows:
- ```bash
- cd path/to/GitLab/dima
- conda env create --file environment.yml
- ```
-
- Working with Jupyter Notebooks
-
-We now make the previously installed Python environment `multiphase_chemistry_env` selectable as a kernel in Jupyter's interface.
-
-1. Open an Anaconda Prompt, check if the environment exists, and activate it:
- ```
- conda env list
- conda activate multiphase_chemistry_env
- ```
-2. Register the environment in Jupyter:
- ```
- python -m ipykernel install --user --name multiphase_chemistry_env --display-name "Python (multiphase_chemistry_env)"
- ```
-3. Start a Jupyter Notebook by running the command:
- ```
- jupyter notebook
- ```
- and select the `multiphase_chemistry_env` environment from the kernel options.
-
-
-
-
+
+
-import sys
-import os
-
-try:
- thisFilePath = os.path.abspath(__file__)
-except NameError:
- print("Error: __file__ is not available. Ensure the script is being run from a file.")
- print("[Notice] Path to DIMA package may not be resolved properly.")
- thisFilePath = os.getcwd() # Use current directory or specify a default
-
-dimaPath = os.path.normpath(os.path.join(thisFilePath, "..",'..')) # Move up to project root
-
-if dimaPath not in sys.path: # Avoid duplicate entries
- sys.path.append(dimaPath)
-
-
-import yaml
-import logging
-from datetime import datetime
-# Importing chain class from itertools
-from itertools import chain
-
-# Import DIMA modules
-import src.hdf5_writer as hdf5_lib
-import utils.g5505_utils as utils
-from instruments.readers import filereader_registry
-
-allowed_file_extensions = filereader_registry.file_extensions
-
-def _generate_datetime_dict(datetime_steps):
- """ Generate the datetime augment dictionary from datetime steps. """
- datetime_augment_dict = {}
- for datetime_step in datetime_steps:
- #tmp = datetime.strptime(datetime_step, '%Y-%m-%d %H-%M-%S')
- datetime_augment_dict[datetime_step] = [
- datetime_step.strftime('%Y-%m-%d'), datetime_step.strftime('%Y_%m_%d'), datetime_step.strftime('%Y.%m.%d'), datetime_step.strftime('%Y%m%d')
- ]
- return datetime_augment_dict
-
-
-[docs]
-def load_config_and_setup_logging(yaml_config_file_path, log_dir):
- """Load YAML configuration file, set up logging, and validate required keys and datetime_steps."""
-
- # Define required keys
- required_keys = [
- 'experiment', 'contact', 'input_file_directory', 'output_file_directory',
- 'instrument_datafolder', 'project', 'actris_level'
- ]
-
- # Supported integration modes
- supported_integration_modes = ['collection', 'single_experiment']
-
-
- # Set up logging
- date = utils.created_at("%Y_%m").replace(":", "-")
- utils.setup_logging(log_dir, f"integrate_data_sources_{date}.log")
-
- # Load YAML configuration file
- with open(yaml_config_file_path, 'r') as stream:
- try:
- config_dict = yaml.load(stream, Loader=yaml.FullLoader)
- except yaml.YAMLError as exc:
- logging.error("Error loading YAML file: %s", exc)
- raise ValueError(f"Failed to load YAML file: {exc}")
-
- # Check if required keys are present
- missing_keys = [key for key in required_keys if key not in config_dict]
- if missing_keys:
- raise KeyError(f"Missing required keys in YAML configuration: {missing_keys}")
-
- # Validate integration_mode
- integration_mode = config_dict.get('integration_mode', 'N/A') # Default to 'collection'
- if integration_mode not in supported_integration_modes:
- raise RuntimeWarning(
- f"Unsupported integration_mode '{integration_mode}'. Supported modes are {supported_integration_modes}. Setting '{integration_mode}' to 'single_experiment'."
- )
-
-
- # Validate datetime_steps format if it exists
- if 'datetime_steps' in config_dict:
- datetime_steps = config_dict['datetime_steps']
- expected_format = '%Y-%m-%d %H-%M-%S'
-
- # Check if datetime_steps is a list or a falsy value
- if datetime_steps and not isinstance(datetime_steps, list):
- raise TypeError(f"datetime_steps should be a list of strings or a falsy value (None, empty), but got {type(datetime_steps)}")
-
- for step_idx, step in enumerate(datetime_steps):
- try:
- # Attempt to parse the datetime to ensure correct format
- config_dict['datetime_steps'][step_idx] = datetime.strptime(step, expected_format)
- except ValueError:
- raise ValueError(f"Invalid datetime format for '{step}'. Expected format: {expected_format}")
- # Augment datatime_steps list as a dictionary. This to speed up single-experiment file generation
- config_dict['datetime_steps_dict'] = _generate_datetime_dict(datetime_steps)
- else:
- # If datetime_steps is not present, set the integration mode to 'collection'
- logging.info("datetime_steps missing, setting integration_mode to 'collection'.")
- config_dict['integration_mode'] = 'collection'
-
- # Validate filename_format if defined
- if 'filename_format' in config_dict:
- if not isinstance(config_dict['filename_format'], str):
- raise ValueError(f'"Specified filename_format needs to be of String type" ')
-
- # Split the string and check if each key exists in config_dict
- keys = [key.strip() for key in config_dict['filename_format'].split(',')]
- missing_keys = [key for key in keys if key not in config_dict]
-
- # If there are any missing keys, raise an assertion error
- # assert not missing_keys, f'Missing key(s) in config_dict: {", ".join(missing_keys)}'
- if not missing_keys:
- config_dict['filename_format'] = ','.join(keys)
- else:
- config_dict['filename_format'] = None
- print(f'"filename_format" should contain comma-separated keys that match existing keys in the YAML config file.')
- print('Setting "filename_format" as None')
- else:
- config_dict['filename_format'] = None
-
- # Compute complementary metadata elements
-
- # Create output filename prefix
- if not config_dict['filename_format']: # default behavior
- config_dict['filename_prefix'] = '_'.join([config_dict[key] for key in ['experiment', 'contact']])
- else:
- config_dict['filename_prefix'] = '_'.join([config_dict[key] for key in config_dict['filename_format'].split(sep=',')])
-
- # Set default dates from datetime_steps if not provided
- current_date = datetime.now().strftime('%Y-%m-%d')
- dates = config_dict.get('datetime_steps',[])
- if not config_dict.get('dataset_startdate'):
- config_dict['dataset_startdate'] = min(config_dict['datetime_steps']).strftime('%Y-%m-%d') if dates else current_date # Earliest datetime step
-
- if not config_dict.get('dataset_enddate'):
- config_dict['dataset_enddate'] = max(config_dict['datetime_steps']).strftime('%Y-%m-%d') if dates else current_date # Latest datetime step
-
- config_dict['expected_datetime_format'] = '%Y-%m-%d %H-%M-%S'
-
- return config_dict
-
-
-
-
-[docs]
-def copy_subtree_and_create_hdf5(src, dst, select_dir_keywords, select_file_keywords, allowed_file_extensions, root_metadata_dict):
-
- """Helper function to copy directory with constraints and create HDF5."""
- src = src.replace(os.sep,'/')
- dst = dst.replace(os.sep,'/')
-
- logging.info("Creating constrained copy of the experimental campaign folder %s at: %s", src, dst)
-
- path_to_files_dict = utils.copy_directory_with_contraints(src, dst, select_dir_keywords, select_file_keywords, allowed_file_extensions)
- logging.info("Finished creating a copy of the experimental campaign folder tree at: %s", dst)
-
-
- logging.info("Creating HDF5 file at: %s", dst)
- hdf5_path = hdf5_lib.create_hdf5_file_from_filesystem_path(dst, path_to_files_dict, select_dir_keywords, root_metadata_dict)
- logging.info("Completed creation of HDF5 file %s at: %s", hdf5_path, dst)
-
- return hdf5_path
-
-
-
-
-[docs]
-def run_pipeline(path_to_config_yamlFile, log_dir='logs/'):
-
- """Integrates data sources specified by the input configuration file into HDF5 files.
-
- Parameters:
- yaml_config_file_path (str): Path to the YAML configuration file.
- log_dir (str): Directory to save the log file.
-
- Returns:
- list: List of Paths to the created HDF5 file(s).
- """
-
- config_dict = load_config_and_setup_logging(path_to_config_yamlFile, log_dir)
-
- path_to_input_dir = config_dict['input_file_directory']
- path_to_output_dir = config_dict['output_file_directory']
- select_dir_keywords = config_dict['instrument_datafolder']
-
- # Define root folder metadata dictionary
- root_metadata_dict = {key : config_dict[key] for key in ['project', 'experiment', 'contact', 'actris_level']}
-
- # Get dataset start and end dates
- dataset_startdate = config_dict['dataset_startdate']
- dataset_enddate = config_dict['dataset_enddate']
-
- # Determine mode and process accordingly
- output_filename_path = []
- campaign_name_template = lambda filename_prefix, suffix: '_'.join([filename_prefix, suffix])
- date_str = f'{dataset_startdate}_{dataset_enddate}'
-
- # Create path to new raw datafolder and standardize with forward slashes
- path_to_rawdata_folder = os.path.join(
- path_to_output_dir, 'collection_' + campaign_name_template(config_dict['filename_prefix'], date_str), "").replace(os.sep, '/')
-
- # Process individual datetime steps if available, regardless of mode
- if config_dict.get('datetime_steps_dict', {}):
- # Single experiment mode
- for datetime_step, file_keywords in config_dict['datetime_steps_dict'].items():
- date_str = datetime_step.strftime('%Y-%m-%d')
- single_campaign_name = campaign_name_template(config_dict['filename_prefix'], date_str)
- path_to_rawdata_subfolder = os.path.join(path_to_rawdata_folder, single_campaign_name, "")
-
- path_to_integrated_stepwise_hdf5_file = copy_subtree_and_create_hdf5(
- path_to_input_dir, path_to_rawdata_subfolder, select_dir_keywords,
- file_keywords, allowed_file_extensions, root_metadata_dict)
-
- output_filename_path.append(path_to_integrated_stepwise_hdf5_file)
-
- # Collection mode processing if specified
- if 'collection' in config_dict.get('integration_mode', 'single_experiment'):
- path_to_filenames_dict = {path_to_rawdata_folder: [os.path.basename(path) for path in output_filename_path]} if output_filename_path else {}
- hdf5_path = hdf5_lib.create_hdf5_file_from_filesystem_path(path_to_rawdata_folder, path_to_filenames_dict, [], root_metadata_dict)
- output_filename_path.append(hdf5_path)
- else:
- path_to_integrated_stepwise_hdf5_file = copy_subtree_and_create_hdf5(
- path_to_input_dir, path_to_rawdata_folder, select_dir_keywords, [],
- allowed_file_extensions, root_metadata_dict)
- output_filename_path.append(path_to_integrated_stepwise_hdf5_file)
-
- return output_filename_path
-
-
-
-if __name__ == "__main__":
-
- if len(sys.argv) < 2:
- print("Usage: python data_integration.py <function_name> <function_args>")
- sys.exit(1)
-
- # Extract the function name from the command line arguments
- function_name = sys.argv[1]
-
- # Handle function execution based on the provided function name
- if function_name == 'run':
-
- if len(sys.argv) != 3:
- print("Usage: python data_integration.py run <path_to_config_yamlFile>")
- sys.exit(1)
- # Extract path to configuration file, specifying the data integration task
- path_to_config_yamlFile = sys.argv[2]
- run_pipeline(path_to_config_yamlFile)
-
-
-
+import sys
+import os
+
+try:
+ thisFilePath = os.path.abspath(__file__)
+except NameError:
+ print("Error: __file__ is not available. Ensure the script is being run from a file.")
+ print("[Notice] Path to DIMA package may not be resolved properly.")
+ thisFilePath = os.getcwd() # Use current directory or specify a default
+
+dimaPath = os.path.normpath(os.path.join(thisFilePath, "..",'..')) # Move up to project root
+
+if dimaPath not in sys.path: # Avoid duplicate entries
+ sys.path.append(dimaPath)
+
+
+import yaml
+import logging
+from datetime import datetime
+# Importing chain class from itertools
+from itertools import chain
+
+# Import DIMA modules
+import src.hdf5_writer as hdf5_lib
+import utils.g5505_utils as utils
+from instruments.readers import filereader_registry
+
+allowed_file_extensions = filereader_registry.file_extensions
+
+def _generate_datetime_dict(datetime_steps):
+ """ Generate the datetime augment dictionary from datetime steps. """
+ datetime_augment_dict = {}
+ for datetime_step in datetime_steps:
+ #tmp = datetime.strptime(datetime_step, '%Y-%m-%d %H-%M-%S')
+ datetime_augment_dict[datetime_step] = [
+ datetime_step.strftime('%Y-%m-%d'), datetime_step.strftime('%Y_%m_%d'), datetime_step.strftime('%Y.%m.%d'), datetime_step.strftime('%Y%m%d')
+ ]
+ return datetime_augment_dict
+
+
+[docs]
+def load_config_and_setup_logging(yaml_config_file_path, log_dir):
+ """Load YAML configuration file, set up logging, and validate required keys and datetime_steps."""
+
+ # Define required keys
+ required_keys = [
+ 'experiment', 'contact', 'input_file_directory', 'output_file_directory',
+ 'instrument_datafolder', 'project', 'actris_level'
+ ]
+
+ # Supported integration modes
+ supported_integration_modes = ['collection', 'single_experiment']
+
+
+ # Set up logging
+ date = utils.created_at("%Y_%m").replace(":", "-")
+ utils.setup_logging(log_dir, f"integrate_data_sources_{date}.log")
+
+ # Load YAML configuration file
+ with open(yaml_config_file_path, 'r') as stream:
+ try:
+ config_dict = yaml.load(stream, Loader=yaml.FullLoader)
+ except yaml.YAMLError as exc:
+ logging.error("Error loading YAML file: %s", exc)
+ raise ValueError(f"Failed to load YAML file: {exc}")
+
+ # Check if required keys are present
+ missing_keys = [key for key in required_keys if key not in config_dict]
+ if missing_keys:
+ raise KeyError(f"Missing required keys in YAML configuration: {missing_keys}")
+
+ # Validate integration_mode
+ integration_mode = config_dict.get('integration_mode', 'N/A') # Default to 'collection'
+ if integration_mode not in supported_integration_modes:
+ raise RuntimeWarning(
+ f"Unsupported integration_mode '{integration_mode}'. Supported modes are {supported_integration_modes}. Setting '{integration_mode}' to 'single_experiment'."
+ )
+
+
+ # Validate datetime_steps format if it exists
+ if 'datetime_steps' in config_dict:
+ datetime_steps = config_dict['datetime_steps']
+ expected_format = '%Y-%m-%d %H-%M-%S'
+
+ # Check if datetime_steps is a list or a falsy value
+ if datetime_steps and not isinstance(datetime_steps, list):
+ raise TypeError(f"datetime_steps should be a list of strings or a falsy value (None, empty), but got {type(datetime_steps)}")
+
+ for step_idx, step in enumerate(datetime_steps):
+ try:
+ # Attempt to parse the datetime to ensure correct format
+ config_dict['datetime_steps'][step_idx] = datetime.strptime(step, expected_format)
+ except ValueError:
+ raise ValueError(f"Invalid datetime format for '{step}'. Expected format: {expected_format}")
+ # Augment datatime_steps list as a dictionary. This to speed up single-experiment file generation
+ config_dict['datetime_steps_dict'] = _generate_datetime_dict(datetime_steps)
+ else:
+ # If datetime_steps is not present, set the integration mode to 'collection'
+ logging.info("datetime_steps missing, setting integration_mode to 'collection'.")
+ config_dict['integration_mode'] = 'collection'
+
+ # Validate filename_format if defined
+ if 'filename_format' in config_dict:
+ if not isinstance(config_dict['filename_format'], str):
+ raise ValueError(f'"Specified filename_format needs to be of String type" ')
+
+ # Split the string and check if each key exists in config_dict
+ keys = [key.strip() for key in config_dict['filename_format'].split(',')]
+ missing_keys = [key for key in keys if key not in config_dict]
+
+ # If there are any missing keys, raise an assertion error
+ # assert not missing_keys, f'Missing key(s) in config_dict: {", ".join(missing_keys)}'
+ if not missing_keys:
+ config_dict['filename_format'] = ','.join(keys)
+ else:
+ config_dict['filename_format'] = None
+ print(f'"filename_format" should contain comma-separated keys that match existing keys in the YAML config file.')
+ print('Setting "filename_format" as None')
+ else:
+ config_dict['filename_format'] = None
+
+ # Compute complementary metadata elements
+
+ # Create output filename prefix
+ if not config_dict['filename_format']: # default behavior
+ config_dict['filename_prefix'] = '_'.join([config_dict[key] for key in ['experiment', 'contact']])
+ else:
+ config_dict['filename_prefix'] = '_'.join([config_dict[key] for key in config_dict['filename_format'].split(sep=',')])
+
+ # Set default dates from datetime_steps if not provided
+ current_date = datetime.now().strftime('%Y-%m-%d')
+ dates = config_dict.get('datetime_steps',[])
+ if not config_dict.get('dataset_startdate'):
+ config_dict['dataset_startdate'] = min(config_dict['datetime_steps']).strftime('%Y-%m-%d') if dates else current_date # Earliest datetime step
+
+ if not config_dict.get('dataset_enddate'):
+ config_dict['dataset_enddate'] = max(config_dict['datetime_steps']).strftime('%Y-%m-%d') if dates else current_date # Latest datetime step
+
+ config_dict['expected_datetime_format'] = '%Y-%m-%d %H-%M-%S'
+
+ return config_dict
+
+
+
+
+[docs]
+def copy_subtree_and_create_hdf5(src, dst, select_dir_keywords, select_file_keywords, allowed_file_extensions, root_metadata_dict):
+
+ """Helper function to copy directory with constraints and create HDF5."""
+ src = src.replace(os.sep,'/')
+ dst = dst.replace(os.sep,'/')
+
+ logging.info("Creating constrained copy of the experimental campaign folder %s at: %s", src, dst)
+
+ path_to_files_dict = utils.copy_directory_with_contraints(src, dst, select_dir_keywords, select_file_keywords, allowed_file_extensions)
+ logging.info("Finished creating a copy of the experimental campaign folder tree at: %s", dst)
+
+
+ logging.info("Creating HDF5 file at: %s", dst)
+ hdf5_path = hdf5_lib.create_hdf5_file_from_filesystem_path(dst, path_to_files_dict, select_dir_keywords, root_metadata_dict)
+ logging.info("Completed creation of HDF5 file %s at: %s", hdf5_path, dst)
+
+ return hdf5_path
+
+
+
+
+[docs]
+def run_pipeline(path_to_config_yamlFile, log_dir='logs/'):
+
+ """Integrates data sources specified by the input configuration file into HDF5 files.
+
+ Parameters:
+ yaml_config_file_path (str): Path to the YAML configuration file.
+ log_dir (str): Directory to save the log file.
+
+ Returns:
+ list: List of Paths to the created HDF5 file(s).
+ """
+
+ config_dict = load_config_and_setup_logging(path_to_config_yamlFile, log_dir)
+
+ path_to_input_dir = config_dict['input_file_directory']
+ path_to_output_dir = config_dict['output_file_directory']
+ select_dir_keywords = config_dict['instrument_datafolder']
+
+ # Define root folder metadata dictionary
+ root_metadata_dict = {key : config_dict[key] for key in ['project', 'experiment', 'contact', 'actris_level']}
+
+ # Get dataset start and end dates
+ dataset_startdate = config_dict['dataset_startdate']
+ dataset_enddate = config_dict['dataset_enddate']
+
+ # Determine mode and process accordingly
+ output_filename_path = []
+ campaign_name_template = lambda filename_prefix, suffix: '_'.join([filename_prefix, suffix])
+ date_str = f'{dataset_startdate}_{dataset_enddate}'
+
+ # Create path to new raw datafolder and standardize with forward slashes
+ path_to_rawdata_folder = os.path.join(
+ path_to_output_dir, 'collection_' + campaign_name_template(config_dict['filename_prefix'], date_str), "").replace(os.sep, '/')
+
+ # Process individual datetime steps if available, regardless of mode
+ if config_dict.get('datetime_steps_dict', {}):
+ # Single experiment mode
+ for datetime_step, file_keywords in config_dict['datetime_steps_dict'].items():
+ date_str = datetime_step.strftime('%Y-%m-%d')
+ single_campaign_name = campaign_name_template(config_dict['filename_prefix'], date_str)
+ path_to_rawdata_subfolder = os.path.join(path_to_rawdata_folder, single_campaign_name, "")
+
+ path_to_integrated_stepwise_hdf5_file = copy_subtree_and_create_hdf5(
+ path_to_input_dir, path_to_rawdata_subfolder, select_dir_keywords,
+ file_keywords, allowed_file_extensions, root_metadata_dict)
+
+ output_filename_path.append(path_to_integrated_stepwise_hdf5_file)
+
+ # Collection mode processing if specified
+ if 'collection' in config_dict.get('integration_mode', 'single_experiment'):
+ path_to_filenames_dict = {path_to_rawdata_folder: [os.path.basename(path) for path in output_filename_path]} if output_filename_path else {}
+ hdf5_path = hdf5_lib.create_hdf5_file_from_filesystem_path(path_to_rawdata_folder, path_to_filenames_dict, [], root_metadata_dict)
+ output_filename_path.append(hdf5_path)
+ else:
+ path_to_integrated_stepwise_hdf5_file = copy_subtree_and_create_hdf5(
+ path_to_input_dir, path_to_rawdata_folder, select_dir_keywords, [],
+ allowed_file_extensions, root_metadata_dict)
+ output_filename_path.append(path_to_integrated_stepwise_hdf5_file)
+
+ return output_filename_path
+
+
+
+if __name__ == "__main__":
+
+ if len(sys.argv) < 2:
+ print("Usage: python data_integration.py <function_name> <function_args>")
+ sys.exit(1)
+
+ # Extract the function name from the command line arguments
+ function_name = sys.argv[1]
+
+ # Handle function execution based on the provided function name
+ if function_name == 'run':
+
+ if len(sys.argv) != 3:
+ print("Usage: python data_integration.py run <path_to_config_yamlFile>")
+ sys.exit(1)
+ # Extract path to configuration file, specifying the data integration task
+ path_to_config_yamlFile = sys.argv[2]
+ run_pipeline(path_to_config_yamlFile)
+
+
+
-import sys
-import os
-
-try:
- thisFilePath = os.path.abspath(__file__)
-except NameError:
- print("Error: __file__ is not available. Ensure the script is being run from a file.")
- print("[Notice] Path to DIMA package may not be resolved properly.")
- thisFilePath = os.getcwd() # Use current directory or specify a default
-
-dimaPath = os.path.normpath(os.path.join(thisFilePath, "..",'..')) # Move up to project root
-
-if dimaPath not in sys.path: # Avoid duplicate entries
- sys.path.append(dimaPath)
-
-import h5py
-import yaml
-import src.hdf5_ops as hdf5_ops
-
-
-
-[docs]
-def load_yaml(review_yaml_file):
- with open(review_yaml_file, 'r') as stream:
- try:
- return yaml.load(stream, Loader=yaml.FullLoader)
- except yaml.YAMLError as exc:
- print(exc)
- return None
-
-
-
-[docs]
-def validate_yaml_dict(input_hdf5_file, yaml_dict):
- errors = []
- notes = []
-
- with h5py.File(input_hdf5_file, 'r') as hdf5_file:
- # 1. Check for valid object names
- for key in yaml_dict:
- if key not in hdf5_file:
- error_msg = f"Error: {key} is not a valid object's name in the HDF5 file."
- print(error_msg)
- errors.append(error_msg)
-
- # 2. Confirm metadata dict for each object is a dictionary
- for key, meta_dict in yaml_dict.items():
- if not isinstance(meta_dict, dict):
- error_msg = f"Error: Metadata for {key} should be a dictionary."
- print(error_msg)
- errors.append(error_msg)
- else:
- if 'attributes' not in meta_dict:
- warning_msg = f"Warning: No 'attributes' in metadata dict for {key}."
- print(warning_msg)
- notes.append(warning_msg)
-
- # 3. Verify update, append, and delete operations are well specified
- for key, meta_dict in yaml_dict.items():
- attributes = meta_dict.get("attributes", {})
-
- for attr_name, attr_value in attributes.items():
- # Ensure the object exists before accessing attributes
- if key in hdf5_file:
- hdf5_obj_attrs = hdf5_file[key].attrs # Access object-specific attributes
-
- if attr_name in hdf5_obj_attrs:
- # Attribute exists: it can be updated or deleted
- if isinstance(attr_value, dict) and "delete" in attr_value:
- note_msg = f"Note: '{attr_name}' in {key} may be deleted if 'delete' is set as true."
- print(note_msg)
- notes.append(note_msg)
- else:
- note_msg = f"Note: '{attr_name}' in {key} will be updated."
- print(note_msg)
- notes.append(note_msg)
- else:
- # Attribute does not exist: it can be appended or flagged as an invalid delete
- if isinstance(attr_value, dict) and "delete" in attr_value:
- error_msg = f"Error: Cannot delete non-existent attribute '{attr_name}' in {key}."
- print(error_msg)
- errors.append(error_msg)
- else:
- note_msg = f"Note: '{attr_name}' in {key} will be appended."
- print(note_msg)
- notes.append(note_msg)
- else:
- error_msg = f"Error: '{key}' is not a valid object in the HDF5 file."
- print(error_msg)
- errors.append(error_msg)
-
- return len(errors) == 0, errors, notes
-
-
-
-
-[docs]
-def update_hdf5_file_with_review(input_hdf5_file, review_yaml_file):
-
- """
- Updates, appends, or deletes metadata attributes in an HDF5 file based on a provided YAML dictionary.
-
- Parameters:
- -----------
- input_hdf5_file : str
- Path to the HDF5 file.
-
- yaml_dict : dict
- Dictionary specifying objects and their attributes with operations. Example format:
- {
- "object_name": { "attributes" : "attr_name": { "value": attr_value,
- "delete": true | false
- }
- }
- }
- """
- yaml_dict = load_yaml(review_yaml_file)
-
- success, errors, notes = validate_yaml_dict(input_hdf5_file,yaml_dict)
- if not success:
- raise ValueError(f"Review yaml file {review_yaml_file} is invalid. Validation errors: {errors}")
-
- # Initialize HDF5 operations manager
- DataOpsAPI = hdf5_ops.HDF5DataOpsManager(input_hdf5_file)
- DataOpsAPI.load_file_obj()
-
- # Iterate over each object in the YAML dictionary
- for obj_name, attr_dict in yaml_dict.items():
- # Prepare dictionaries for append, update, and delete actions
- append_dict = {}
- update_dict = {}
- delete_dict = {}
-
- if not obj_name in DataOpsAPI.file_obj:
- continue # Skip if the object does not exist
-
- # Iterate over each attribute in the current object
- for attr_name, attr_props in attr_dict['attributes'].items():
- if not isinstance(attr_props, dict):
- #attr_props = {'value': attr_props}
- # Check if the attribute exists (for updating)
- if attr_name in DataOpsAPI.file_obj[obj_name].attrs:
- update_dict[attr_name] = attr_props
- # Otherwise, it's a new attribute to append
- else:
- append_dict[attr_name] = attr_props
- else:
- # Check if the attribute is marked for deletion
- if attr_props.get('delete', False):
- delete_dict[attr_name] = attr_props
-
- # Perform a single pass for all three operations
- if append_dict:
- DataOpsAPI.append_metadata(obj_name, append_dict)
- if update_dict:
- DataOpsAPI.update_metadata(obj_name, update_dict)
- if delete_dict:
- DataOpsAPI.delete_metadata(obj_name, delete_dict)
-
- # Close hdf5 file
- DataOpsAPI.unload_file_obj()
- # Regenerate yaml snapshot of updated HDF5 file
- output_yml_filename_path = hdf5_ops.serialize_metadata(input_hdf5_file)
- print(f'{output_yml_filename_path} was successfully regenerated from the updated version of{input_hdf5_file}')
-
-
-
-[docs]
-def count(hdf5_obj,yml_dict):
- print(hdf5_obj.name)
- if isinstance(hdf5_obj,h5py.Group) and len(hdf5_obj.name.split('/')) <= 4:
- obj_review = yml_dict[hdf5_obj.name]
- additions = [not (item in hdf5_obj.attrs.keys()) for item in obj_review['attributes'].keys()]
- count_additions = sum(additions)
- deletions = [not (item in obj_review['attributes'].keys()) for item in hdf5_obj.attrs.keys()]
- count_delections = sum(deletions)
- print('additions',count_additions, 'deletions', count_delections)
-
-
-if __name__ == "__main__":
-
- if len(sys.argv) < 4:
- print("Usage: python metadata_revision.py update <path/to/target_file.hdf5> <path/to/metadata_review_file.yaml>")
- sys.exit(1)
-
-
- if sys.argv[1] == 'update':
- input_hdf5_file = sys.argv[2]
- review_yaml_file = sys.argv[3]
- update_hdf5_file_with_review(input_hdf5_file, review_yaml_file)
- #run(sys.argv[2])
-
+import sys
+import os
+
+try:
+ thisFilePath = os.path.abspath(__file__)
+except NameError:
+ print("Error: __file__ is not available. Ensure the script is being run from a file.")
+ print("[Notice] Path to DIMA package may not be resolved properly.")
+ thisFilePath = os.getcwd() # Use current directory or specify a default
+
+dimaPath = os.path.normpath(os.path.join(thisFilePath, "..",'..')) # Move up to project root
+
+if dimaPath not in sys.path: # Avoid duplicate entries
+ sys.path.append(dimaPath)
+
+import h5py
+import yaml
+import src.hdf5_ops as hdf5_ops
+
+
+
+[docs]
+def load_yaml(review_yaml_file):
+ with open(review_yaml_file, 'r') as stream:
+ try:
+ return yaml.load(stream, Loader=yaml.FullLoader)
+ except yaml.YAMLError as exc:
+ print(exc)
+ return None
+
+
+
+[docs]
+def validate_yaml_dict(input_hdf5_file, yaml_dict):
+ errors = []
+ notes = []
+
+ with h5py.File(input_hdf5_file, 'r') as hdf5_file:
+ # 1. Check for valid object names
+ for key in yaml_dict:
+ if key not in hdf5_file:
+ error_msg = f"Error: {key} is not a valid object's name in the HDF5 file."
+ print(error_msg)
+ errors.append(error_msg)
+
+ # 2. Confirm metadata dict for each object is a dictionary
+ for key, meta_dict in yaml_dict.items():
+ if not isinstance(meta_dict, dict):
+ error_msg = f"Error: Metadata for {key} should be a dictionary."
+ print(error_msg)
+ errors.append(error_msg)
+ else:
+ if 'attributes' not in meta_dict:
+ warning_msg = f"Warning: No 'attributes' in metadata dict for {key}."
+ print(warning_msg)
+ notes.append(warning_msg)
+
+ # 3. Verify update, append, and delete operations are well specified
+ for key, meta_dict in yaml_dict.items():
+ attributes = meta_dict.get("attributes", {})
+
+ for attr_name, attr_value in attributes.items():
+ # Ensure the object exists before accessing attributes
+ if key in hdf5_file:
+ hdf5_obj_attrs = hdf5_file[key].attrs # Access object-specific attributes
+
+ if attr_name in hdf5_obj_attrs:
+ # Attribute exists: it can be updated or deleted
+ if isinstance(attr_value, dict) and "delete" in attr_value:
+ note_msg = f"Note: '{attr_name}' in {key} may be deleted if 'delete' is set as true."
+ print(note_msg)
+ notes.append(note_msg)
+ else:
+ note_msg = f"Note: '{attr_name}' in {key} will be updated."
+ print(note_msg)
+ notes.append(note_msg)
+ else:
+ # Attribute does not exist: it can be appended or flagged as an invalid delete
+ if isinstance(attr_value, dict) and "delete" in attr_value:
+ error_msg = f"Error: Cannot delete non-existent attribute '{attr_name}' in {key}."
+ print(error_msg)
+ errors.append(error_msg)
+ else:
+ note_msg = f"Note: '{attr_name}' in {key} will be appended."
+ print(note_msg)
+ notes.append(note_msg)
+ else:
+ error_msg = f"Error: '{key}' is not a valid object in the HDF5 file."
+ print(error_msg)
+ errors.append(error_msg)
+
+ return len(errors) == 0, errors, notes
+
+
+
+
+[docs]
+def update_hdf5_file_with_review(input_hdf5_file, review_yaml_file):
+
+ """
+ Updates, appends, or deletes metadata attributes in an HDF5 file based on a provided YAML dictionary.
+
+ Parameters:
+ -----------
+ input_hdf5_file : str
+ Path to the HDF5 file.
+
+ yaml_dict : dict
+ Dictionary specifying objects and their attributes with operations. Example format:
+ {
+ "object_name": { "attributes" : "attr_name": { "value": attr_value,
+ "delete": true | false
+ }
+ }
+ }
+ """
+ yaml_dict = load_yaml(review_yaml_file)
+
+ success, errors, notes = validate_yaml_dict(input_hdf5_file,yaml_dict)
+ if not success:
+ raise ValueError(f"Review yaml file {review_yaml_file} is invalid. Validation errors: {errors}")
+
+ # Initialize HDF5 operations manager
+ DataOpsAPI = hdf5_ops.HDF5DataOpsManager(input_hdf5_file)
+ DataOpsAPI.load_file_obj()
+
+ # Iterate over each object in the YAML dictionary
+ for obj_name, attr_dict in yaml_dict.items():
+ # Prepare dictionaries for append, update, and delete actions
+ append_dict = {}
+ update_dict = {}
+ delete_dict = {}
+
+ if not obj_name in DataOpsAPI.file_obj:
+ continue # Skip if the object does not exist
+
+ # Iterate over each attribute in the current object
+ for attr_name, attr_props in attr_dict['attributes'].items():
+ if not isinstance(attr_props, dict):
+ #attr_props = {'value': attr_props}
+ # Check if the attribute exists (for updating)
+ if attr_name in DataOpsAPI.file_obj[obj_name].attrs:
+ update_dict[attr_name] = attr_props
+ # Otherwise, it's a new attribute to append
+ else:
+ append_dict[attr_name] = attr_props
+ else:
+ # Check if the attribute is marked for deletion
+ if attr_props.get('delete', False):
+ delete_dict[attr_name] = attr_props
+
+ # Perform a single pass for all three operations
+ if append_dict:
+ DataOpsAPI.append_metadata(obj_name, append_dict)
+ if update_dict:
+ DataOpsAPI.update_metadata(obj_name, update_dict)
+ if delete_dict:
+ DataOpsAPI.delete_metadata(obj_name, delete_dict)
+
+ # Close hdf5 file
+ DataOpsAPI.unload_file_obj()
+ # Regenerate yaml snapshot of updated HDF5 file
+ output_yml_filename_path = hdf5_ops.serialize_metadata(input_hdf5_file)
+ print(f'{output_yml_filename_path} was successfully regenerated from the updated version of{input_hdf5_file}')
+
+
+
+[docs]
+def count(hdf5_obj,yml_dict):
+ print(hdf5_obj.name)
+ if isinstance(hdf5_obj,h5py.Group) and len(hdf5_obj.name.split('/')) <= 4:
+ obj_review = yml_dict[hdf5_obj.name]
+ additions = [not (item in hdf5_obj.attrs.keys()) for item in obj_review['attributes'].keys()]
+ count_additions = sum(additions)
+ deletions = [not (item in obj_review['attributes'].keys()) for item in hdf5_obj.attrs.keys()]
+ count_delections = sum(deletions)
+ print('additions',count_additions, 'deletions', count_delections)
+
+
+if __name__ == "__main__":
+
+ if len(sys.argv) < 4:
+ print("Usage: python metadata_revision.py update <path/to/target_file.hdf5> <path/to/metadata_review_file.yaml>")
+ sys.exit(1)
+
+
+ if sys.argv[1] == 'update':
+ input_hdf5_file = sys.argv[2]
+ review_yaml_file = sys.argv[3]
+ update_hdf5_file_with_review(input_hdf5_file, review_yaml_file)
+ #run(sys.argv[2])
+
-import os
-
-import src.hdf5_lib as hdf5_lib
-import src.g5505_utils as utils
-import yaml
-
-import logging
-from datetime import datetime
-
-
-
-
-[docs]
-def integrate_data_sources(yaml_config_file_path, log_dir='logs/'):
-
- """ Integrates data sources specified by the input configuration file into HDF5 files.
-
- Parameters:
- yaml_config_file_path (str): Path to the YAML configuration file.
- log_dir (str): Directory to save the log file.
-
- Returns:
- str: Path (or list of Paths) to the created HDF5 file(s).
- """
-
- date = utils.created_at()
- utils.setup_logging(log_dir, f"integrate_data_sources_{date}.log")
-
- with open(yaml_config_file_path,'r') as stream:
- try:
- config_dict = yaml.load(stream, Loader=yaml.FullLoader)
- except yaml.YAMLError as exc:
- logging.error("Error loading YAML file: %s", exc)
- raise
-
- def output_filename(name, date, initials):
- return f"{name}_{date}_{initials}.h5"
-
- exp_campaign_name = config_dict['experiment']
- initials = config_dict['contact']
- input_file_dir = config_dict['input_file_directory']
- output_dir = config_dict['output_file_directory']
- select_dir_keywords = config_dict['instrument_datafolder']
- root_metadata_dict = {
- 'project' : config_dict['project'],
- 'experiment' : config_dict['experiment'],
- 'contact' : config_dict['contact'],
- 'actris_level': config_dict['actris_level']
- }
-
- def create_hdf5_file(date_str, select_file_keywords,root_metadata):
- filename = output_filename(exp_campaign_name, date_str, initials)
- output_path = os.path.join(output_dir, filename)
- logging.info("Creating HDF5 file at: %s", output_path)
-
- return hdf5_lib.create_hdf5_file_from_filesystem_path(
- output_path, input_file_dir, select_dir_keywords, select_file_keywords, root_metadata_dict=root_metadata
- )
-
- if config_dict.get('datetime_steps'):
-
- datetime_augment_dict = {}
- for datetime_step in config_dict['datetime_steps']:
- tmp = datetime.strptime(datetime_step,'%Y-%m-%d %H-%M-%S') #convert(datetime_step)
- datetime_augment_dict[tmp] = [tmp.strftime('%Y-%m-%d'),tmp.strftime('%Y_%m_%d'),tmp.strftime('%Y.%m.%d'),tmp.strftime('%Y%m%d')]
- print(tmp)
-
- if 'single_experiment' in config_dict['integration_mode']:
- output_filename_path = []
- for datetime_step in datetime_augment_dict.keys():
- date_str = datetime_step.strftime('%Y-%m-%d')
- select_file_keywords = datetime_augment_dict[datetime_step]
-
- root_metadata_dict.update({'dataset_startdate': date_str,
- 'dataset_enddate': date_str})
- dt_step_output_filename_path= create_hdf5_file(date_str, select_file_keywords, root_metadata_dict)
- output_filename_path.append(dt_step_output_filename_path)
-
- elif 'collection' in config_dict['integration_mode']:
- select_file_keywords = []
- for datetime_step in datetime_augment_dict.keys():
- select_file_keywords = select_file_keywords + datetime_augment_dict[datetime_step]
-
- config_dict['dataset_startdate'] = min(datetime_augment_dict.keys())
- config_dict['dataset_enddate'] = max(datetime_augment_dict.keys())
- startdate = config_dict['dataset_startdate'].strftime('%Y-%m-%d')
- enddate = config_dict['dataset_enddate'].strftime('%Y-%m-%d')
- root_metadata_dict.update({'dataset_startdate': startdate,
- 'dataset_enddate': enddate})
-
- date_str = f'{startdate}_{enddate}'
- output_filename_path = create_hdf5_file(date_str, select_file_keywords, root_metadata_dict)
- else:
- startdate = config_dict['dataset_startdate']
- enddate = config_dict['dataset_enddate']
- root_metadata_dict.update({'dataset_startdate': startdate,
- 'dataset_enddate': enddate})
- date_str = f'{startdate}_{enddate}'
- output_filename_path = create_hdf5_file(date_str, select_file_keywords = [], root_metadata = root_metadata_dict)
-
- return output_filename_path
-
-
+import os
+
+import src.hdf5_lib as hdf5_lib
+import src.g5505_utils as utils
+import yaml
+
+import logging
+from datetime import datetime
+
+
+
+
+[docs]
+def integrate_data_sources(yaml_config_file_path, log_dir='logs/'):
+
+ """ Integrates data sources specified by the input configuration file into HDF5 files.
+
+ Parameters:
+ yaml_config_file_path (str): Path to the YAML configuration file.
+ log_dir (str): Directory to save the log file.
+
+ Returns:
+ str: Path (or list of Paths) to the created HDF5 file(s).
+ """
+
+ date = utils.created_at()
+ utils.setup_logging(log_dir, f"integrate_data_sources_{date}.log")
+
+ with open(yaml_config_file_path,'r') as stream:
+ try:
+ config_dict = yaml.load(stream, Loader=yaml.FullLoader)
+ except yaml.YAMLError as exc:
+ logging.error("Error loading YAML file: %s", exc)
+ raise
+
+ def output_filename(name, date, initials):
+ return f"{name}_{date}_{initials}.h5"
+
+ exp_campaign_name = config_dict['experiment']
+ initials = config_dict['contact']
+ input_file_dir = config_dict['input_file_directory']
+ output_dir = config_dict['output_file_directory']
+ select_dir_keywords = config_dict['instrument_datafolder']
+ root_metadata_dict = {
+ 'project' : config_dict['project'],
+ 'experiment' : config_dict['experiment'],
+ 'contact' : config_dict['contact'],
+ 'actris_level': config_dict['actris_level']
+ }
+
+ def create_hdf5_file(date_str, select_file_keywords,root_metadata):
+ filename = output_filename(exp_campaign_name, date_str, initials)
+ output_path = os.path.join(output_dir, filename)
+ logging.info("Creating HDF5 file at: %s", output_path)
+
+ return hdf5_lib.create_hdf5_file_from_filesystem_path(
+ output_path, input_file_dir, select_dir_keywords, select_file_keywords, root_metadata_dict=root_metadata
+ )
+
+ if config_dict.get('datetime_steps'):
+
+ datetime_augment_dict = {}
+ for datetime_step in config_dict['datetime_steps']:
+ tmp = datetime.strptime(datetime_step,'%Y-%m-%d %H-%M-%S') #convert(datetime_step)
+ datetime_augment_dict[tmp] = [tmp.strftime('%Y-%m-%d'),tmp.strftime('%Y_%m_%d'),tmp.strftime('%Y.%m.%d'),tmp.strftime('%Y%m%d')]
+ print(tmp)
+
+ if 'single_experiment' in config_dict['integration_mode']:
+ output_filename_path = []
+ for datetime_step in datetime_augment_dict.keys():
+ date_str = datetime_step.strftime('%Y-%m-%d')
+ select_file_keywords = datetime_augment_dict[datetime_step]
+
+ root_metadata_dict.update({'dataset_startdate': date_str,
+ 'dataset_enddate': date_str})
+ dt_step_output_filename_path= create_hdf5_file(date_str, select_file_keywords, root_metadata_dict)
+ output_filename_path.append(dt_step_output_filename_path)
+
+ elif 'collection' in config_dict['integration_mode']:
+ select_file_keywords = []
+ for datetime_step in datetime_augment_dict.keys():
+ select_file_keywords = select_file_keywords + datetime_augment_dict[datetime_step]
+
+ config_dict['dataset_startdate'] = min(datetime_augment_dict.keys())
+ config_dict['dataset_enddate'] = max(datetime_augment_dict.keys())
+ startdate = config_dict['dataset_startdate'].strftime('%Y-%m-%d')
+ enddate = config_dict['dataset_enddate'].strftime('%Y-%m-%d')
+ root_metadata_dict.update({'dataset_startdate': startdate,
+ 'dataset_enddate': enddate})
+
+ date_str = f'{startdate}_{enddate}'
+ output_filename_path = create_hdf5_file(date_str, select_file_keywords, root_metadata_dict)
+ else:
+ startdate = config_dict['dataset_startdate']
+ enddate = config_dict['dataset_enddate']
+ root_metadata_dict.update({'dataset_startdate': startdate,
+ 'dataset_enddate': enddate})
+ date_str = f'{startdate}_{enddate}'
+ output_filename_path = create_hdf5_file(date_str, select_file_keywords = [], root_metadata = root_metadata_dict)
+
+ return output_filename_path
+
+
-import os
-
-import numpy as np
-import pandas as pd
-import collections
-from igor2.binarywave import load as loadibw
-
-import src.g5505_utils as utils
-#import src.metadata_review_lib as metadata
-#from src.metadata_review_lib import parse_attribute
-
-import yaml
-import h5py
-
-ROOT_DIR = os.path.abspath(os.curdir)
-
-
-[docs]
-def read_xps_ibw_file_as_dict(filename):
- """
- Reads IBW files from the Multiphase Chemistry Group, which contain XPS spectra and acquisition settings,
- and formats the data into a dictionary with the structure {datasets: list of datasets}. Each dataset in the
- list has the following structure:
-
- {
- 'name': 'name',
- 'data': data_array,
- 'data_units': 'units',
- 'shape': data_shape,
- 'dtype': data_type
- }
-
- Parameters
- ----------
- filename : str
- The IBW filename from the Multiphase Chemistry Group beamline.
-
- Returns
- -------
- file_dict : dict
- A dictionary containing the datasets from the IBW file.
-
- Raises
- ------
- ValueError
- If the input IBW file is not a valid IBW file.
-
- """
-
-
- file_obj = loadibw(filename)
-
- required_keys = ['wData','data_units','dimension_units','note']
- if sum([item in required_keys for item in file_obj['wave'].keys()]) < len(required_keys):
- raise ValueError('This is not a valid xps ibw file. It does not satisfy minimum adimissibility criteria.')
-
- file_dict = {}
- path_tail, path_head = os.path.split(filename)
-
- # Group name and attributes
- file_dict['name'] = path_head
- file_dict['attributes_dict'] = {}
-
- # Convert notes of bytes class to string class and split string into a list of elements separated by '\r'.
- notes_list = file_obj['wave']['note'].decode("utf-8").split('\r')
- exclude_list = ['Excitation Energy']
- for item in notes_list:
- if '=' in item:
- key, value = tuple(item.split('='))
- # TODO: check if value can be converted into a numeric type. Now all values are string type
- if not key in exclude_list:
- file_dict['attributes_dict'][key] = value
-
- # TODO: talk to Thorsten to see if there is an easier way to access the below attributes
- dimension_labels = file_obj['wave']['dimension_units'].decode("utf-8").split(']')
- file_dict['attributes_dict']['dimension_units'] = [item+']' for item in dimension_labels[0:len(dimension_labels)-1]]
-
- # Datasets and their attributes
-
- file_dict['datasets'] = []
-
- dataset = {}
- dataset['name'] = 'spectrum'
- dataset['data'] = file_obj['wave']['wData']
- dataset['data_units'] = file_obj['wave']['data_units']
- dataset['shape'] = dataset['data'].shape
- dataset['dtype'] = type(dataset['data'])
-
- # TODO: include energy axis dataset
-
- file_dict['datasets'].append(dataset)
-
-
- return file_dict
-
-
-
-[docs]
-def copy_file_in_group(source_file_path, dest_file_obj : h5py.File, dest_group_name, work_with_copy : bool = True):
- # Create copy of original file to avoid possible file corruption and work with it.
-
- if work_with_copy:
- tmp_file_path = utils.make_file_copy(source_file_path)
- else:
- tmp_file_path = source_file_path
-
- # Open backup h5 file and copy complet filesystem directory onto a group in h5file
- with h5py.File(tmp_file_path,'r') as src_file:
- dest_file_obj.copy(source= src_file['/'], dest= dest_group_name)
-
- if 'tmp_files' in tmp_file_path:
- os.remove(tmp_file_path)
-
-
-
-[docs]
-def read_txt_files_as_dict(filename : str , work_with_copy : bool = True ):
-
- # Get the directory of the current module
- module_dir = os.path.dirname(__file__)
- # Construct the relative file path
- instrument_configs_path = os.path.join(module_dir, 'instruments', 'text_data_sources.yaml')
-
- with open(instrument_configs_path,'r') as stream:
- try:
- config_dict = yaml.load(stream, Loader=yaml.FullLoader)
- except yaml.YAMLError as exc:
- print(exc)
- # Verify if file can be read by available intrument configurations.
- if not any(key in filename.replace(os.sep,'/') for key in config_dict.keys()):
- return {}
-
-
- #TODO: this may be prone to error if assumed folder structure is non compliant
- file_encoding = config_dict['default']['file_encoding'] #'utf-8'
- separator = config_dict['default']['separator']
- table_header = config_dict['default']['table_header']
-
- for key in config_dict.keys():
- if key.replace('/',os.sep) in filename:
- file_encoding = config_dict[key].get('file_encoding',file_encoding)
- separator = config_dict[key].get('separator',separator).replace('\\t','\t')
- table_header = config_dict[key].get('table_header',table_header)
- timestamp_variables = config_dict[key].get('timestamp',[])
- datetime_format = config_dict[key].get('datetime_format',[])
-
- description_dict = {}
- #link_to_description = config_dict[key].get('link_to_description',[]).replace('/',os.sep)
- link_to_description = os.path.join(module_dir,config_dict[key].get('link_to_description',[]).replace('/',os.sep))
- with open(link_to_description,'r') as stream:
- try:
- description_dict = yaml.load(stream, Loader=yaml.FullLoader)
- except yaml.YAMLError as exc:
- print(exc)
- break
- #if 'None' in table_header:
- # return {}
-
- # Read header as a dictionary and detect where data table starts
- header_dict = {}
- data_start = False
- # Work with copy of the file for safety
- if work_with_copy:
- tmp_filename = utils.make_file_copy(source_file_path=filename)
- else:
- tmp_filename = filename
-
- #with open(tmp_filename,'rb',encoding=file_encoding,errors='ignore') as f:
- with open(tmp_filename,'rb') as f:
- table_preamble = []
- for line_number, line in enumerate(f):
-
- if table_header in line.decode(file_encoding):
- list_of_substrings = line.decode(file_encoding).split(separator)
-
- # Count occurrences of each substring
- substring_counts = collections.Counter(list_of_substrings)
- data_start = True
- # Generate column names with appended index only for repeated substrings
- column_names = [f"{i}_{name.strip()}" if substring_counts[name] > 1 else name.strip() for i, name in enumerate(list_of_substrings)]
-
- #column_names = [str(i)+'_'+name.strip() for i, name in enumerate(list_of_substrings)]
- #column_names = []
- #for i, name in enumerate(list_of_substrings):
- # column_names.append(str(i)+'_'+name)
-
- #print(line_number, len(column_names ),'\n')
- break
- # Subdivide line into words, and join them by single space.
- # I asumme this can produce a cleaner line that contains no weird separator characters \t \r or extra spaces and so on.
- list_of_substrings = line.decode(file_encoding).split()
- # TODO: ideally we should use a multilinear string but the yalm parser is not recognizing \n as special character
- #line = ' '.join(list_of_substrings+['\n'])
- #line = ' '.join(list_of_substrings)
- table_preamble.append(' '.join([item for item in list_of_substrings]))# += new_line
-
- # Represent string values as fixed length strings in the HDF5 file, which need
- # to be decoded as string when we read them. It provides better control than variable strings,
- # at the expense of flexibility.
- # https://docs.h5py.org/en/stable/strings.html
-
- if table_preamble:
- header_dict["table_preamble"] = utils.convert_string_to_bytes(table_preamble)
-
-
-
- # TODO: it does not work with separator as none :(. fix for RGA
- try:
- df = pd.read_csv(tmp_filename,
- delimiter = separator,
- header=line_number,
- #encoding='latin-1',
- encoding = file_encoding,
- names=column_names,
- skip_blank_lines=True)
-
- df_numerical_attrs = df.select_dtypes(include ='number')
- df_categorical_attrs = df.select_dtypes(exclude='number')
- numerical_variables = [item for item in df_numerical_attrs.columns]
-
- # Consolidate into single timestamp column the separate columns 'date' 'time' specified in text_data_source.yaml
- if timestamp_variables:
- #df_categorical_attrs['timestamps'] = [' '.join(df_categorical_attrs.loc[i,timestamp_variables].to_numpy()) for i in df.index]
- #df_categorical_attrs['timestamps'] = [ df_categorical_attrs.loc[i,'0_Date']+' '+df_categorical_attrs.loc[i,'1_Time'] for i in df.index]
-
-
- #df_categorical_attrs['timestamps'] = df_categorical_attrs[timestamp_variables].astype(str).agg(' '.join, axis=1)
- timestamps_name = ' '.join(timestamp_variables)
- df_categorical_attrs[ timestamps_name] = df_categorical_attrs[timestamp_variables].astype(str).agg(' '.join, axis=1)
-
- valid_indices = []
- if datetime_format:
- df_categorical_attrs[ timestamps_name] = pd.to_datetime(df_categorical_attrs[ timestamps_name],format=datetime_format,errors='coerce')
- valid_indices = df_categorical_attrs.dropna(subset=[timestamps_name]).index
- df_categorical_attrs = df_categorical_attrs.loc[valid_indices,:]
- df_numerical_attrs = df_numerical_attrs.loc[valid_indices,:]
-
- df_categorical_attrs[timestamps_name] = df_categorical_attrs[timestamps_name].dt.strftime(config_dict['default']['desired_format'])
- startdate = df_categorical_attrs[timestamps_name].min()
- enddate = df_categorical_attrs[timestamps_name].max()
-
- df_categorical_attrs[timestamps_name] = df_categorical_attrs[timestamps_name].astype(str)
- #header_dict.update({'stastrrtdate':startdate,'enddate':enddate})
- header_dict['startdate']= str(startdate)
- header_dict['enddate']=str(enddate)
-
- if len(timestamp_variables) > 1:
- df_categorical_attrs = df_categorical_attrs.drop(columns = timestamp_variables)
-
-
- #df_categorical_attrs.reindex(drop=True)
- #df_numerical_attrs.reindex(drop=True)
-
-
-
- categorical_variables = [item for item in df_categorical_attrs.columns]
- ####
- #elif 'RGA' in filename:
- # df_categorical_attrs = df_categorical_attrs.rename(columns={'0_Time(s)' : 'timestamps'})
-
- ###
- file_dict = {}
- path_tail, path_head = os.path.split(tmp_filename)
-
- file_dict['name'] = path_head
- # TODO: review this header dictionary, it may not be the best way to represent header data
- file_dict['attributes_dict'] = header_dict
- file_dict['datasets'] = []
- ####
-
- df = pd.concat((df_categorical_attrs,df_numerical_attrs),axis=1)
-
- #if numerical_variables:
- dataset = {}
- dataset['name'] = 'data_table'#_numerical_variables'
- dataset['data'] = utils.dataframe_to_np_structured_array(df) #df_numerical_attrs.to_numpy()
- dataset['shape'] = dataset['data'].shape
- dataset['dtype'] = type(dataset['data'])
- #dataset['data_units'] = file_obj['wave']['data_units']
- #
- # Create attribute descriptions based on description_dict
- dataset['attributes'] = {}
-
- for column_name in df.columns:
- column_attr_dict = description_dict['table_header'].get(column_name,
- {'note':'there was no description available. Review instrument files.'})
- dataset['attributes'].update({column_name: utils.parse_attribute(column_attr_dict)})
-
- #try:
- # dataset['attributes'] = description_dict['table_header'].copy()
- # for key in description_dict['table_header'].keys():
- # if not key in numerical_variables:
- # dataset['attributes'].pop(key) # delete key
- # else:
- # dataset['attributes'][key] = utils.parse_attribute(dataset['attributes'][key])
- # if timestamps_name in categorical_variables:
- # dataset['attributes'][timestamps_name] = utils.parse_attribute({'unit':'YYYY-MM-DD HH:MM:SS.ffffff'})
- #except ValueError as err:
- # print(err)
-
- file_dict['datasets'].append(dataset)
-
-
- #if categorical_variables:
- # dataset = {}
- # dataset['name'] = 'table_categorical_variables'
- # dataset['data'] = dataframe_to_np_structured_array(df_categorical_attrs) #df_categorical_attrs.loc[:,categorical_variables].to_numpy()
- # dataset['shape'] = dataset['data'].shape
- # dataset['dtype'] = type(dataset['data'])
- # if timestamps_name in categorical_variables:
- # dataset['attributes'] = {timestamps_name: utils.parse_attribute({'unit':'YYYY-MM-DD HH:MM:SS.ffffff'})}
- # file_dict['datasets'].append(dataset)
-
-
-
-
- except:
- return {}
-
- return file_dict
-
-
-
-[docs]
-def main():
-
- inputfile_dir = '\\\\fs101\\5505\\People\\Juan\\TypicalBeamTime'
-
- file_dict = read_xps_ibw_file_as_dict(inputfile_dir+'\\SES\\0069069_N1s_495eV.ibw')
-
- for key in file_dict.keys():
- print(key,file_dict[key])
-
-
-
-if __name__ == '__main__':
-
- main()
-
- print(':)')
-
+import os
+
+import numpy as np
+import pandas as pd
+import collections
+from igor2.binarywave import load as loadibw
+
+import src.g5505_utils as utils
+#import src.metadata_review_lib as metadata
+#from src.metadata_review_lib import parse_attribute
+
+import yaml
+import h5py
+
+ROOT_DIR = os.path.abspath(os.curdir)
+
+
+[docs]
+def read_xps_ibw_file_as_dict(filename):
+ """
+ Reads IBW files from the Multiphase Chemistry Group, which contain XPS spectra and acquisition settings,
+ and formats the data into a dictionary with the structure {datasets: list of datasets}. Each dataset in the
+ list has the following structure:
+
+ {
+ 'name': 'name',
+ 'data': data_array,
+ 'data_units': 'units',
+ 'shape': data_shape,
+ 'dtype': data_type
+ }
+
+ Parameters
+ ----------
+ filename : str
+ The IBW filename from the Multiphase Chemistry Group beamline.
+
+ Returns
+ -------
+ file_dict : dict
+ A dictionary containing the datasets from the IBW file.
+
+ Raises
+ ------
+ ValueError
+ If the input IBW file is not a valid IBW file.
+
+ """
+
+
+ file_obj = loadibw(filename)
+
+ required_keys = ['wData','data_units','dimension_units','note']
+ if sum([item in required_keys for item in file_obj['wave'].keys()]) < len(required_keys):
+ raise ValueError('This is not a valid xps ibw file. It does not satisfy minimum adimissibility criteria.')
+
+ file_dict = {}
+ path_tail, path_head = os.path.split(filename)
+
+ # Group name and attributes
+ file_dict['name'] = path_head
+ file_dict['attributes_dict'] = {}
+
+ # Convert notes of bytes class to string class and split string into a list of elements separated by '\r'.
+ notes_list = file_obj['wave']['note'].decode("utf-8").split('\r')
+ exclude_list = ['Excitation Energy']
+ for item in notes_list:
+ if '=' in item:
+ key, value = tuple(item.split('='))
+ # TODO: check if value can be converted into a numeric type. Now all values are string type
+ if not key in exclude_list:
+ file_dict['attributes_dict'][key] = value
+
+ # TODO: talk to Thorsten to see if there is an easier way to access the below attributes
+ dimension_labels = file_obj['wave']['dimension_units'].decode("utf-8").split(']')
+ file_dict['attributes_dict']['dimension_units'] = [item+']' for item in dimension_labels[0:len(dimension_labels)-1]]
+
+ # Datasets and their attributes
+
+ file_dict['datasets'] = []
+
+ dataset = {}
+ dataset['name'] = 'spectrum'
+ dataset['data'] = file_obj['wave']['wData']
+ dataset['data_units'] = file_obj['wave']['data_units']
+ dataset['shape'] = dataset['data'].shape
+ dataset['dtype'] = type(dataset['data'])
+
+ # TODO: include energy axis dataset
+
+ file_dict['datasets'].append(dataset)
+
+
+ return file_dict
+
+
+
+[docs]
+def copy_file_in_group(source_file_path, dest_file_obj : h5py.File, dest_group_name, work_with_copy : bool = True):
+ # Create copy of original file to avoid possible file corruption and work with it.
+
+ if work_with_copy:
+ tmp_file_path = utils.make_file_copy(source_file_path)
+ else:
+ tmp_file_path = source_file_path
+
+ # Open backup h5 file and copy complet filesystem directory onto a group in h5file
+ with h5py.File(tmp_file_path,'r') as src_file:
+ dest_file_obj.copy(source= src_file['/'], dest= dest_group_name)
+
+ if 'tmp_files' in tmp_file_path:
+ os.remove(tmp_file_path)
+
+
+
+[docs]
+def read_txt_files_as_dict(filename : str , work_with_copy : bool = True ):
+
+ # Get the directory of the current module
+ module_dir = os.path.dirname(__file__)
+ # Construct the relative file path
+ instrument_configs_path = os.path.join(module_dir, 'instruments', 'text_data_sources.yaml')
+
+ with open(instrument_configs_path,'r') as stream:
+ try:
+ config_dict = yaml.load(stream, Loader=yaml.FullLoader)
+ except yaml.YAMLError as exc:
+ print(exc)
+ # Verify if file can be read by available intrument configurations.
+ if not any(key in filename.replace(os.sep,'/') for key in config_dict.keys()):
+ return {}
+
+
+ #TODO: this may be prone to error if assumed folder structure is non compliant
+ file_encoding = config_dict['default']['file_encoding'] #'utf-8'
+ separator = config_dict['default']['separator']
+ table_header = config_dict['default']['table_header']
+
+ for key in config_dict.keys():
+ if key.replace('/',os.sep) in filename:
+ file_encoding = config_dict[key].get('file_encoding',file_encoding)
+ separator = config_dict[key].get('separator',separator).replace('\\t','\t')
+ table_header = config_dict[key].get('table_header',table_header)
+ timestamp_variables = config_dict[key].get('timestamp',[])
+ datetime_format = config_dict[key].get('datetime_format',[])
+
+ description_dict = {}
+ #link_to_description = config_dict[key].get('link_to_description',[]).replace('/',os.sep)
+ link_to_description = os.path.join(module_dir,config_dict[key].get('link_to_description',[]).replace('/',os.sep))
+ with open(link_to_description,'r') as stream:
+ try:
+ description_dict = yaml.load(stream, Loader=yaml.FullLoader)
+ except yaml.YAMLError as exc:
+ print(exc)
+ break
+ #if 'None' in table_header:
+ # return {}
+
+ # Read header as a dictionary and detect where data table starts
+ header_dict = {}
+ data_start = False
+ # Work with copy of the file for safety
+ if work_with_copy:
+ tmp_filename = utils.make_file_copy(source_file_path=filename)
+ else:
+ tmp_filename = filename
+
+ #with open(tmp_filename,'rb',encoding=file_encoding,errors='ignore') as f:
+ with open(tmp_filename,'rb') as f:
+ table_preamble = []
+ for line_number, line in enumerate(f):
+
+ if table_header in line.decode(file_encoding):
+ list_of_substrings = line.decode(file_encoding).split(separator)
+
+ # Count occurrences of each substring
+ substring_counts = collections.Counter(list_of_substrings)
+ data_start = True
+ # Generate column names with appended index only for repeated substrings
+ column_names = [f"{i}_{name.strip()}" if substring_counts[name] > 1 else name.strip() for i, name in enumerate(list_of_substrings)]
+
+ #column_names = [str(i)+'_'+name.strip() for i, name in enumerate(list_of_substrings)]
+ #column_names = []
+ #for i, name in enumerate(list_of_substrings):
+ # column_names.append(str(i)+'_'+name)
+
+ #print(line_number, len(column_names ),'\n')
+ break
+ # Subdivide line into words, and join them by single space.
+ # I asumme this can produce a cleaner line that contains no weird separator characters \t \r or extra spaces and so on.
+ list_of_substrings = line.decode(file_encoding).split()
+ # TODO: ideally we should use a multilinear string but the yalm parser is not recognizing \n as special character
+ #line = ' '.join(list_of_substrings+['\n'])
+ #line = ' '.join(list_of_substrings)
+ table_preamble.append(' '.join([item for item in list_of_substrings]))# += new_line
+
+ # Represent string values as fixed length strings in the HDF5 file, which need
+ # to be decoded as string when we read them. It provides better control than variable strings,
+ # at the expense of flexibility.
+ # https://docs.h5py.org/en/stable/strings.html
+
+ if table_preamble:
+ header_dict["table_preamble"] = utils.convert_string_to_bytes(table_preamble)
+
+
+
+ # TODO: it does not work with separator as none :(. fix for RGA
+ try:
+ df = pd.read_csv(tmp_filename,
+ delimiter = separator,
+ header=line_number,
+ #encoding='latin-1',
+ encoding = file_encoding,
+ names=column_names,
+ skip_blank_lines=True)
+
+ df_numerical_attrs = df.select_dtypes(include ='number')
+ df_categorical_attrs = df.select_dtypes(exclude='number')
+ numerical_variables = [item for item in df_numerical_attrs.columns]
+
+ # Consolidate into single timestamp column the separate columns 'date' 'time' specified in text_data_source.yaml
+ if timestamp_variables:
+ #df_categorical_attrs['timestamps'] = [' '.join(df_categorical_attrs.loc[i,timestamp_variables].to_numpy()) for i in df.index]
+ #df_categorical_attrs['timestamps'] = [ df_categorical_attrs.loc[i,'0_Date']+' '+df_categorical_attrs.loc[i,'1_Time'] for i in df.index]
+
+
+ #df_categorical_attrs['timestamps'] = df_categorical_attrs[timestamp_variables].astype(str).agg(' '.join, axis=1)
+ timestamps_name = ' '.join(timestamp_variables)
+ df_categorical_attrs[ timestamps_name] = df_categorical_attrs[timestamp_variables].astype(str).agg(' '.join, axis=1)
+
+ valid_indices = []
+ if datetime_format:
+ df_categorical_attrs[ timestamps_name] = pd.to_datetime(df_categorical_attrs[ timestamps_name],format=datetime_format,errors='coerce')
+ valid_indices = df_categorical_attrs.dropna(subset=[timestamps_name]).index
+ df_categorical_attrs = df_categorical_attrs.loc[valid_indices,:]
+ df_numerical_attrs = df_numerical_attrs.loc[valid_indices,:]
+
+ df_categorical_attrs[timestamps_name] = df_categorical_attrs[timestamps_name].dt.strftime(config_dict['default']['desired_format'])
+ startdate = df_categorical_attrs[timestamps_name].min()
+ enddate = df_categorical_attrs[timestamps_name].max()
+
+ df_categorical_attrs[timestamps_name] = df_categorical_attrs[timestamps_name].astype(str)
+ #header_dict.update({'stastrrtdate':startdate,'enddate':enddate})
+ header_dict['startdate']= str(startdate)
+ header_dict['enddate']=str(enddate)
+
+ if len(timestamp_variables) > 1:
+ df_categorical_attrs = df_categorical_attrs.drop(columns = timestamp_variables)
+
+
+ #df_categorical_attrs.reindex(drop=True)
+ #df_numerical_attrs.reindex(drop=True)
+
+
+
+ categorical_variables = [item for item in df_categorical_attrs.columns]
+ ####
+ #elif 'RGA' in filename:
+ # df_categorical_attrs = df_categorical_attrs.rename(columns={'0_Time(s)' : 'timestamps'})
+
+ ###
+ file_dict = {}
+ path_tail, path_head = os.path.split(tmp_filename)
+
+ file_dict['name'] = path_head
+ # TODO: review this header dictionary, it may not be the best way to represent header data
+ file_dict['attributes_dict'] = header_dict
+ file_dict['datasets'] = []
+ ####
+
+ df = pd.concat((df_categorical_attrs,df_numerical_attrs),axis=1)
+
+ #if numerical_variables:
+ dataset = {}
+ dataset['name'] = 'data_table'#_numerical_variables'
+ dataset['data'] = utils.dataframe_to_np_structured_array(df) #df_numerical_attrs.to_numpy()
+ dataset['shape'] = dataset['data'].shape
+ dataset['dtype'] = type(dataset['data'])
+ #dataset['data_units'] = file_obj['wave']['data_units']
+ #
+ # Create attribute descriptions based on description_dict
+ dataset['attributes'] = {}
+
+ for column_name in df.columns:
+ column_attr_dict = description_dict['table_header'].get(column_name,
+ {'note':'there was no description available. Review instrument files.'})
+ dataset['attributes'].update({column_name: utils.parse_attribute(column_attr_dict)})
+
+ #try:
+ # dataset['attributes'] = description_dict['table_header'].copy()
+ # for key in description_dict['table_header'].keys():
+ # if not key in numerical_variables:
+ # dataset['attributes'].pop(key) # delete key
+ # else:
+ # dataset['attributes'][key] = utils.parse_attribute(dataset['attributes'][key])
+ # if timestamps_name in categorical_variables:
+ # dataset['attributes'][timestamps_name] = utils.parse_attribute({'unit':'YYYY-MM-DD HH:MM:SS.ffffff'})
+ #except ValueError as err:
+ # print(err)
+
+ file_dict['datasets'].append(dataset)
+
+
+ #if categorical_variables:
+ # dataset = {}
+ # dataset['name'] = 'table_categorical_variables'
+ # dataset['data'] = dataframe_to_np_structured_array(df_categorical_attrs) #df_categorical_attrs.loc[:,categorical_variables].to_numpy()
+ # dataset['shape'] = dataset['data'].shape
+ # dataset['dtype'] = type(dataset['data'])
+ # if timestamps_name in categorical_variables:
+ # dataset['attributes'] = {timestamps_name: utils.parse_attribute({'unit':'YYYY-MM-DD HH:MM:SS.ffffff'})}
+ # file_dict['datasets'].append(dataset)
+
+
+
+
+ except:
+ return {}
+
+ return file_dict
+
+
+
+[docs]
+def main():
+
+ inputfile_dir = '\\\\fs101\\5505\\People\\Juan\\TypicalBeamTime'
+
+ file_dict = read_xps_ibw_file_as_dict(inputfile_dir+'\\SES\\0069069_N1s_495eV.ibw')
+
+ for key in file_dict.keys():
+ print(key,file_dict[key])
+
+
+
+if __name__ == '__main__':
+
+ main()
+
+ print(':)')
+
-import h5py
-import pandas as pd
-import numpy as np
-import os
-import src.hdf5_vis as hdf5_vis
-
-
-
-[docs]
-def read_dataset_from_hdf5file(hdf5_file_path, dataset_path):
- # Open the HDF5 file
- with h5py.File(hdf5_file_path, 'r') as hdf:
- # Load the dataset
- dataset = hdf[dataset_path]
- data = np.empty(dataset.shape, dtype=dataset.dtype)
- dataset.read_direct(data)
- df = pd.DataFrame(data)
-
- for col_name in df.select_dtypes(exclude='number'):
- df[col_name] = df[col_name].str.decode('utf-8') #apply(lambda x: x.decode('utf-8') if isinstance(x,bytes) else x)
- ## Extract metadata (attributes) and convert to a dictionary
- #metadata = hdf5_vis.construct_attributes_dict(hdf[dataset_name].attrs)
- ## Create a one-row DataFrame with the metadata
- #metadata_df = pd.DataFrame.from_dict(data, orient='columns')
- return df
-
-
-
-[docs]
-def read_metadata_from_hdf5obj(hdf5_file_path, obj_path):
- # TODO: Complete this function
- metadata_df = pd.DataFrame.empty()
- return metadata_df
-
-
-
-[docs]
-def list_datasets_in_hdf5file(hdf5_file_path):
-
- def get_datasets(name, obj, list_of_datasets):
- if isinstance(obj,h5py.Dataset):
- list_of_datasets.append(name)
- #print(f'Adding dataset: {name}') #tail: {head} head: {tail}')
-
-
- with h5py.File(hdf5_file_path,'r') as file:
- list_of_datasets = []
- file.visititems(lambda name, obj: get_datasets(name, obj, list_of_datasets))
-
- dataset_df = pd.DataFrame({'dataset_name':list_of_datasets})
-
- dataset_df['parent_instrument'] = dataset_df['dataset_name'].apply(lambda x: x.split('/')[-3])
- dataset_df['parent_file'] = dataset_df['dataset_name'].apply(lambda x: x.split('/')[-2])
-
- return dataset_df
-
-
+import h5py
+import pandas as pd
+import numpy as np
+import os
+import src.hdf5_vis as hdf5_vis
+
+
+
+[docs]
+def read_dataset_from_hdf5file(hdf5_file_path, dataset_path):
+ # Open the HDF5 file
+ with h5py.File(hdf5_file_path, 'r') as hdf:
+ # Load the dataset
+ dataset = hdf[dataset_path]
+ data = np.empty(dataset.shape, dtype=dataset.dtype)
+ dataset.read_direct(data)
+ df = pd.DataFrame(data)
+
+ for col_name in df.select_dtypes(exclude='number'):
+ df[col_name] = df[col_name].str.decode('utf-8') #apply(lambda x: x.decode('utf-8') if isinstance(x,bytes) else x)
+ ## Extract metadata (attributes) and convert to a dictionary
+ #metadata = hdf5_vis.construct_attributes_dict(hdf[dataset_name].attrs)
+ ## Create a one-row DataFrame with the metadata
+ #metadata_df = pd.DataFrame.from_dict(data, orient='columns')
+ return df
+
+
+
+[docs]
+def read_metadata_from_hdf5obj(hdf5_file_path, obj_path):
+ # TODO: Complete this function
+ metadata_df = pd.DataFrame.empty()
+ return metadata_df
+
+
+
+[docs]
+def list_datasets_in_hdf5file(hdf5_file_path):
+
+ def get_datasets(name, obj, list_of_datasets):
+ if isinstance(obj,h5py.Dataset):
+ list_of_datasets.append(name)
+ #print(f'Adding dataset: {name}') #tail: {head} head: {tail}')
+
+
+ with h5py.File(hdf5_file_path,'r') as file:
+ list_of_datasets = []
+ file.visititems(lambda name, obj: get_datasets(name, obj, list_of_datasets))
+
+ dataset_df = pd.DataFrame({'dataset_name':list_of_datasets})
+
+ dataset_df['parent_instrument'] = dataset_df['dataset_name'].apply(lambda x: x.split('/')[-3])
+ dataset_df['parent_file'] = dataset_df['dataset_name'].apply(lambda x: x.split('/')[-2])
+
+ return dataset_df
+
+
-import sys
-import os
-root_dir = os.path.abspath(os.curdir)
-sys.path.append(root_dir)
-
-import pandas as pd
-import numpy as np
-import h5py
-import logging
-
-import utils.g5505_utils as utils
-import instruments.readers.filereader_registry as filereader_registry
-
-
-
-def __transfer_file_dict_to_hdf5(h5file, group_name, file_dict):
- """
- Transfers data from a file_dict to an HDF5 file.
-
- Parameters
- ----------
- h5file : h5py.File
- HDF5 file object where the data will be written.
- group_name : str
- Name of the HDF5 group where data will be stored.
- file_dict : dict
- Dictionary containing file data to be transferred. Required structure:
- {
- 'name': str,
- 'attributes_dict': dict,
- 'datasets': [
- {
- 'name': str,
- 'data': array-like,
- 'shape': tuple,
- 'attributes': dict (optional)
- },
- ...
- ]
- }
-
- Returns
- -------
- None
- """
-
- if not file_dict:
- return
-
- try:
- # Create group and add their attributes
- group = h5file[group_name].create_group(name=file_dict['name'])
- # Add group attributes
- group.attrs.update(file_dict['attributes_dict'])
-
- # Add datasets to the just created group
- for dataset in file_dict['datasets']:
- dataset_obj = group.create_dataset(
- name=dataset['name'],
- data=dataset['data'],
- shape=dataset['shape']
- )
-
- # Add dataset's attributes
- attributes = dataset.get('attributes', {})
- dataset_obj.attrs.update(attributes)
- group.attrs['last_update_date'] = utils.created_at().encode('utf-8')
- except Exception as inst:
- print(inst)
- logging.error('Failed to transfer data into HDF5: %s', inst)
-
-def __copy_file_in_group(source_file_path, dest_file_obj : h5py.File, dest_group_name, work_with_copy : bool = True):
- # Create copy of original file to avoid possible file corruption and work with it.
-
- if work_with_copy:
- tmp_file_path = utils.make_file_copy(source_file_path)
- else:
- tmp_file_path = source_file_path
-
- # Open backup h5 file and copy complet filesystem directory onto a group in h5file
- with h5py.File(tmp_file_path,'r') as src_file:
- dest_file_obj.copy(source= src_file['/'], dest= dest_group_name)
-
- if 'tmp_files' in tmp_file_path:
- os.remove(tmp_file_path)
-
-
-[docs]
-def create_hdf5_file_from_filesystem_path(path_to_input_directory: str,
- path_to_filenames_dict: dict = None,
- select_dir_keywords : list = [],
- root_metadata_dict : dict = {}, mode = 'w'):
-
- """
- Creates an .h5 file with name "output_filename" that preserves the directory tree (or folder structure)
- of a given filesystem path.
-
- The data integration capabilities are limited by our file reader, which can only access data from a list of
- admissible file formats. These, however, can be extended. Directories are groups in the resulting HDF5 file.
- Files are formatted as composite objects consisting of a group, file, and attributes.
-
- Parameters
- ----------
- output_filename : str
- Name of the output HDF5 file.
- path_to_input_directory : str
- Path to root directory, specified with forward slashes, e.g., path/to/root.
-
- path_to_filenames_dict : dict, optional
- A pre-processed dictionary where keys are directory paths on the input directory's tree and values are lists of files.
- If provided, 'input_file_system_path' is ignored.
-
- select_dir_keywords : list
- List of string elements to consider or select only directory paths that contain
- a word in 'select_dir_keywords'. When empty, all directory paths are considered
- to be included in the HDF5 file group hierarchy.
- root_metadata_dict : dict
- Metadata to include at the root level of the HDF5 file.
-
- mode : str
- 'w' create File, truncate if it exists, or 'r+' read/write, File must exists. By default, mode = "w".
-
- Returns
- -------
- output_filename : str
- Path to the created HDF5 file.
- """
-
-
- if not mode in ['w','r+']:
- raise ValueError(f'Parameter mode must take values in ["w","r+"]')
-
- if not '/' in path_to_input_directory:
- raise ValueError('path_to_input_directory needs to be specified using forward slashes "/".' )
-
- #path_to_output_directory = os.path.join(path_to_input_directory,'..')
- path_to_input_directory = os.path.normpath(path_to_input_directory).rstrip(os.sep)
-
-
- for i, keyword in enumerate(select_dir_keywords):
- select_dir_keywords[i] = keyword.replace('/',os.sep)
-
- if not path_to_filenames_dict:
- # On dry_run=True, returns path to files dictionary of the output directory without making a actual copy of the input directory.
- # Therefore, there wont be a copying conflict by setting up input and output directories the same
- path_to_filenames_dict = utils.copy_directory_with_contraints(input_dir_path=path_to_input_directory,
- output_dir_path=path_to_input_directory,
- dry_run=True)
- # Set input_directory as copied input directory
- root_dir = path_to_input_directory
- path_to_output_file = path_to_input_directory.rstrip(os.path.sep) + '.h5'
-
- with h5py.File(path_to_output_file, mode=mode, track_order=True) as h5file:
-
- number_of_dirs = len(path_to_filenames_dict.keys())
- dir_number = 1
- for dirpath, filtered_filenames_list in path_to_filenames_dict.items():
-
- start_message = f'Starting to transfer files in directory: {dirpath}'
- end_message = f'\nCompleted transferring files in directory: {dirpath}'
- # Print and log the start message
- print(start_message)
- logging.info(start_message)
-
- # Check if filtered_filenames_list is nonempty. TODO: This is perhaps redundant by design of path_to_filenames_dict.
- if not filtered_filenames_list:
- continue
-
- group_name = dirpath.replace(os.sep,'/')
- group_name = group_name.replace(root_dir.replace(os.sep,'/') + '/', '/')
-
- # Flatten group name to one level
- if select_dir_keywords:
- offset = sum([len(i.split(os.sep)) if i in dirpath else 0 for i in select_dir_keywords])
- else:
- offset = 1
- tmp_list = group_name.split('/')
- if len(tmp_list) > offset+1:
- group_name = '/'.join([tmp_list[i] for i in range(offset+1)])
-
- # Group hierarchy is implicitly defined by the forward slashes
- if not group_name in h5file.keys():
- h5file.create_group(group_name)
- h5file[group_name].attrs['creation_date'] = utils.created_at().encode('utf-8')
- #h5file[group_name].attrs.create(name='filtered_file_list',data=convert_string_to_bytes(filtered_filename_list))
- #h5file[group_name].attrs.create(name='file_list',data=convert_string_to_bytes(filenames_list))
- else:
- print(group_name,' was already created.')
-
- for filenumber, filename in enumerate(filtered_filenames_list):
-
- #file_ext = os.path.splitext(filename)[1]
- #try:
-
- # hdf5 path to filename group
- dest_group_name = f'{group_name}/{filename}'
-
- if not 'h5' in filename:
- #file_dict = config_file.select_file_readers(group_id)[file_ext](os.path.join(dirpath,filename))
- #file_dict = ext_to_reader_dict[file_ext](os.path.join(dirpath,filename))
- file_dict = filereader_registry.select_file_reader(dest_group_name)(os.path.join(dirpath,filename))
-
- __transfer_file_dict_to_hdf5(h5file, group_name, file_dict)
-
- else:
- source_file_path = os.path.join(dirpath,filename)
- dest_file_obj = h5file
- #group_name +'/'+filename
- #ext_to_reader_dict[file_ext](source_file_path, dest_file_obj, dest_group_name)
- #g5505f_reader.select_file_reader(dest_group_name)(source_file_path, dest_file_obj, dest_group_name)
- __copy_file_in_group(source_file_path, dest_file_obj, dest_group_name, False)
-
- # Update the progress bar and log the end message
- utils.progressBar(dir_number, number_of_dirs, end_message)
- logging.info(end_message)
- dir_number = dir_number + 1
-
-
-
- if len(root_metadata_dict.keys())>0:
- for key, value in root_metadata_dict.items():
- #if key in h5file.attrs:
- # del h5file.attrs[key]
- h5file.attrs.create(key, value)
- #annotate_root_dir(output_filename,root_metadata_dict)
-
-
- #output_yml_filename_path = hdf5_vis.take_yml_snapshot_of_hdf5_file(output_filename)
-
- return path_to_output_file #, output_yml_filename_path
-
-
-
-[docs]
-def save_processed_dataframe_to_hdf5(df, annotator, output_filename): # src_hdf5_path, script_date, script_name):
- """
- Save processed dataframe columns with annotations to an HDF5 file.
-
- Parameters:
- df (pd.DataFrame): DataFrame containing processed time series.
- annotator (): Annotator object with get_metadata method.
- output_filename (str): Path to the source HDF5 file.
- """
- # Convert datetime columns to string
- datetime_cols = df.select_dtypes(include=['datetime64']).columns
-
- if list(datetime_cols):
- df[datetime_cols] = df[datetime_cols].map(str)
-
- # Convert dataframe to structured array
- icad_data_table = utils.convert_dataframe_to_np_structured_array(df)
-
- # Get metadata
- metadata_dict = annotator.get_metadata()
-
- # Prepare project level attributes to be added at the root level
-
- project_level_attributes = metadata_dict['metadata']['project']
-
- # Prepare high-level attributes
- high_level_attributes = {
- 'parent_files': metadata_dict['parent_files'],
- **metadata_dict['metadata']['sample'],
- **metadata_dict['metadata']['environment'],
- **metadata_dict['metadata']['instruments']
- }
-
- # Prepare data level attributes
- data_level_attributes = metadata_dict['metadata']['datasets']
-
- for key, value in data_level_attributes.items():
- if isinstance(value,dict):
- data_level_attributes[key] = utils.convert_attrdict_to_np_structured_array(value)
-
-
- # Prepare file dictionary
- file_dict = {
- 'name': project_level_attributes['processing_file'],
- 'attributes_dict': high_level_attributes,
- 'datasets': [{
- 'name': "data_table",
- 'data': icad_data_table,
- 'shape': icad_data_table.shape,
- 'attributes': data_level_attributes
- }]
- }
-
- # Check if the file exists
- if os.path.exists(output_filename):
- mode = "a"
- print(f"File {output_filename} exists. Opening in append mode.")
- else:
- mode = "w"
- print(f"File {output_filename} does not exist. Creating a new file.")
-
-
- # Write to HDF5
- with h5py.File(output_filename, mode) as h5file:
- # Add project level attributes at the root/top level
- h5file.attrs.update(project_level_attributes)
- __transfer_file_dict_to_hdf5(h5file, '/', file_dict)
-
-
-#if __name__ == '__main__':
-
+import sys
+import os
+root_dir = os.path.abspath(os.curdir)
+sys.path.append(root_dir)
+
+import pandas as pd
+import numpy as np
+import h5py
+import logging
+
+import utils.g5505_utils as utils
+import instruments.readers.filereader_registry as filereader_registry
+
+
+
+def __transfer_file_dict_to_hdf5(h5file, group_name, file_dict):
+ """
+ Transfers data from a file_dict to an HDF5 file.
+
+ Parameters
+ ----------
+ h5file : h5py.File
+ HDF5 file object where the data will be written.
+ group_name : str
+ Name of the HDF5 group where data will be stored.
+ file_dict : dict
+ Dictionary containing file data to be transferred. Required structure:
+ {
+ 'name': str,
+ 'attributes_dict': dict,
+ 'datasets': [
+ {
+ 'name': str,
+ 'data': array-like,
+ 'shape': tuple,
+ 'attributes': dict (optional)
+ },
+ ...
+ ]
+ }
+
+ Returns
+ -------
+ None
+ """
+
+ if not file_dict:
+ return
+
+ try:
+ # Create group and add their attributes
+ group = h5file[group_name].create_group(name=file_dict['name'])
+ # Add group attributes
+ group.attrs.update(file_dict['attributes_dict'])
+
+ # Add datasets to the just created group
+ for dataset in file_dict['datasets']:
+ dataset_obj = group.create_dataset(
+ name=dataset['name'],
+ data=dataset['data'],
+ shape=dataset['shape']
+ )
+
+ # Add dataset's attributes
+ attributes = dataset.get('attributes', {})
+ dataset_obj.attrs.update(attributes)
+ group.attrs['last_update_date'] = utils.created_at().encode('utf-8')
+ except Exception as inst:
+ print(inst)
+ logging.error('Failed to transfer data into HDF5: %s', inst)
+
+def __copy_file_in_group(source_file_path, dest_file_obj : h5py.File, dest_group_name, work_with_copy : bool = True):
+ # Create copy of original file to avoid possible file corruption and work with it.
+
+ if work_with_copy:
+ tmp_file_path = utils.make_file_copy(source_file_path)
+ else:
+ tmp_file_path = source_file_path
+
+ # Open backup h5 file and copy complet filesystem directory onto a group in h5file
+ with h5py.File(tmp_file_path,'r') as src_file:
+ dest_file_obj.copy(source= src_file['/'], dest= dest_group_name)
+
+ if 'tmp_files' in tmp_file_path:
+ os.remove(tmp_file_path)
+
+
+[docs]
+def create_hdf5_file_from_filesystem_path(path_to_input_directory: str,
+ path_to_filenames_dict: dict = None,
+ select_dir_keywords : list = [],
+ root_metadata_dict : dict = {}, mode = 'w'):
+
+ """
+ Creates an .h5 file with name "output_filename" that preserves the directory tree (or folder structure)
+ of a given filesystem path.
+
+ The data integration capabilities are limited by our file reader, which can only access data from a list of
+ admissible file formats. These, however, can be extended. Directories are groups in the resulting HDF5 file.
+ Files are formatted as composite objects consisting of a group, file, and attributes.
+
+ Parameters
+ ----------
+ output_filename : str
+ Name of the output HDF5 file.
+ path_to_input_directory : str
+ Path to root directory, specified with forward slashes, e.g., path/to/root.
+
+ path_to_filenames_dict : dict, optional
+ A pre-processed dictionary where keys are directory paths on the input directory's tree and values are lists of files.
+ If provided, 'input_file_system_path' is ignored.
+
+ select_dir_keywords : list
+ List of string elements to consider or select only directory paths that contain
+ a word in 'select_dir_keywords'. When empty, all directory paths are considered
+ to be included in the HDF5 file group hierarchy.
+ root_metadata_dict : dict
+ Metadata to include at the root level of the HDF5 file.
+
+ mode : str
+ 'w' create File, truncate if it exists, or 'r+' read/write, File must exists. By default, mode = "w".
+
+ Returns
+ -------
+ output_filename : str
+ Path to the created HDF5 file.
+ """
+
+
+ if not mode in ['w','r+']:
+ raise ValueError(f'Parameter mode must take values in ["w","r+"]')
+
+ if not '/' in path_to_input_directory:
+ raise ValueError('path_to_input_directory needs to be specified using forward slashes "/".' )
+
+ #path_to_output_directory = os.path.join(path_to_input_directory,'..')
+ path_to_input_directory = os.path.normpath(path_to_input_directory).rstrip(os.sep)
+
+
+ for i, keyword in enumerate(select_dir_keywords):
+ select_dir_keywords[i] = keyword.replace('/',os.sep)
+
+ if not path_to_filenames_dict:
+ # On dry_run=True, returns path to files dictionary of the output directory without making a actual copy of the input directory.
+ # Therefore, there wont be a copying conflict by setting up input and output directories the same
+ path_to_filenames_dict = utils.copy_directory_with_contraints(input_dir_path=path_to_input_directory,
+ output_dir_path=path_to_input_directory,
+ dry_run=True)
+ # Set input_directory as copied input directory
+ root_dir = path_to_input_directory
+ path_to_output_file = path_to_input_directory.rstrip(os.path.sep) + '.h5'
+
+ with h5py.File(path_to_output_file, mode=mode, track_order=True) as h5file:
+
+ number_of_dirs = len(path_to_filenames_dict.keys())
+ dir_number = 1
+ for dirpath, filtered_filenames_list in path_to_filenames_dict.items():
+
+ start_message = f'Starting to transfer files in directory: {dirpath}'
+ end_message = f'\nCompleted transferring files in directory: {dirpath}'
+ # Print and log the start message
+ print(start_message)
+ logging.info(start_message)
+
+ # Check if filtered_filenames_list is nonempty. TODO: This is perhaps redundant by design of path_to_filenames_dict.
+ if not filtered_filenames_list:
+ continue
+
+ group_name = dirpath.replace(os.sep,'/')
+ group_name = group_name.replace(root_dir.replace(os.sep,'/') + '/', '/')
+
+ # Flatten group name to one level
+ if select_dir_keywords:
+ offset = sum([len(i.split(os.sep)) if i in dirpath else 0 for i in select_dir_keywords])
+ else:
+ offset = 1
+ tmp_list = group_name.split('/')
+ if len(tmp_list) > offset+1:
+ group_name = '/'.join([tmp_list[i] for i in range(offset+1)])
+
+ # Group hierarchy is implicitly defined by the forward slashes
+ if not group_name in h5file.keys():
+ h5file.create_group(group_name)
+ h5file[group_name].attrs['creation_date'] = utils.created_at().encode('utf-8')
+ #h5file[group_name].attrs.create(name='filtered_file_list',data=convert_string_to_bytes(filtered_filename_list))
+ #h5file[group_name].attrs.create(name='file_list',data=convert_string_to_bytes(filenames_list))
+ else:
+ print(group_name,' was already created.')
+
+ for filenumber, filename in enumerate(filtered_filenames_list):
+
+ #file_ext = os.path.splitext(filename)[1]
+ #try:
+
+ # hdf5 path to filename group
+ dest_group_name = f'{group_name}/{filename}'
+
+ if not 'h5' in filename:
+ #file_dict = config_file.select_file_readers(group_id)[file_ext](os.path.join(dirpath,filename))
+ #file_dict = ext_to_reader_dict[file_ext](os.path.join(dirpath,filename))
+ file_dict = filereader_registry.select_file_reader(dest_group_name)(os.path.join(dirpath,filename))
+
+ __transfer_file_dict_to_hdf5(h5file, group_name, file_dict)
+
+ else:
+ source_file_path = os.path.join(dirpath,filename)
+ dest_file_obj = h5file
+ #group_name +'/'+filename
+ #ext_to_reader_dict[file_ext](source_file_path, dest_file_obj, dest_group_name)
+ #g5505f_reader.select_file_reader(dest_group_name)(source_file_path, dest_file_obj, dest_group_name)
+ __copy_file_in_group(source_file_path, dest_file_obj, dest_group_name, False)
+
+ # Update the progress bar and log the end message
+ utils.progressBar(dir_number, number_of_dirs, end_message)
+ logging.info(end_message)
+ dir_number = dir_number + 1
+
+
+
+ if len(root_metadata_dict.keys())>0:
+ for key, value in root_metadata_dict.items():
+ #if key in h5file.attrs:
+ # del h5file.attrs[key]
+ h5file.attrs.create(key, value)
+ #annotate_root_dir(output_filename,root_metadata_dict)
+
+
+ #output_yml_filename_path = hdf5_vis.take_yml_snapshot_of_hdf5_file(output_filename)
+
+ return path_to_output_file #, output_yml_filename_path
+
+
+
+[docs]
+def save_processed_dataframe_to_hdf5(df, annotator, output_filename): # src_hdf5_path, script_date, script_name):
+ """
+ Save processed dataframe columns with annotations to an HDF5 file.
+
+ Parameters:
+ df (pd.DataFrame): DataFrame containing processed time series.
+ annotator (): Annotator object with get_metadata method.
+ output_filename (str): Path to the source HDF5 file.
+ """
+ # Convert datetime columns to string
+ datetime_cols = df.select_dtypes(include=['datetime64']).columns
+
+ if list(datetime_cols):
+ df[datetime_cols] = df[datetime_cols].map(str)
+
+ # Convert dataframe to structured array
+ icad_data_table = utils.convert_dataframe_to_np_structured_array(df)
+
+ # Get metadata
+ metadata_dict = annotator.get_metadata()
+
+ # Prepare project level attributes to be added at the root level
+
+ project_level_attributes = metadata_dict['metadata']['project']
+
+ # Prepare high-level attributes
+ high_level_attributes = {
+ 'parent_files': metadata_dict['parent_files'],
+ **metadata_dict['metadata']['sample'],
+ **metadata_dict['metadata']['environment'],
+ **metadata_dict['metadata']['instruments']
+ }
+
+ # Prepare data level attributes
+ data_level_attributes = metadata_dict['metadata']['datasets']
+
+ for key, value in data_level_attributes.items():
+ if isinstance(value,dict):
+ data_level_attributes[key] = utils.convert_attrdict_to_np_structured_array(value)
+
+
+ # Prepare file dictionary
+ file_dict = {
+ 'name': project_level_attributes['processing_file'],
+ 'attributes_dict': high_level_attributes,
+ 'datasets': [{
+ 'name': "data_table",
+ 'data': icad_data_table,
+ 'shape': icad_data_table.shape,
+ 'attributes': data_level_attributes
+ }]
+ }
+
+ # Check if the file exists
+ if os.path.exists(output_filename):
+ mode = "a"
+ print(f"File {output_filename} exists. Opening in append mode.")
+ else:
+ mode = "w"
+ print(f"File {output_filename} does not exist. Creating a new file.")
+
+
+ # Write to HDF5
+ with h5py.File(output_filename, mode) as h5file:
+ # Add project level attributes at the root/top level
+ h5file.attrs.update(project_level_attributes)
+ __transfer_file_dict_to_hdf5(h5file, '/', file_dict)
+
+
+#if __name__ == '__main__':
+
-import sys
-import os
-
-try:
- thisFilePath = os.path.abspath(__file__)
-except NameError:
- print("Error: __file__ is not available. Ensure the script is being run from a file.")
- print("[Notice] Path to DIMA package may not be resolved properly.")
- thisFilePath = os.getcwd() # Use current directory or specify a default
-
-dimaPath = os.path.normpath(os.path.join(thisFilePath, "..",'..')) # Move up to project root
-
-if dimaPath not in sys.path: # Avoid duplicate entries
- sys.path.append(dimaPath)
-
-
-import h5py
-import pandas as pd
-import numpy as np
-
-import utils.g5505_utils as utils
-import src.hdf5_writer as hdf5_lib
-import logging
-import datetime
-
-import h5py
-
-import yaml
-import json
-import copy
-
-
-[docs]
-class HDF5DataOpsManager():
-
- """
- A class to handle HDF5 fundamental middle level file operations to power data updates, metadata revision, and data analysis
- with hdf5 files encoding multi-instrument experimental campaign data.
-
- Parameters:
- -----------
- path_to_file : str
- path/to/hdf5file.
- mode : str
- 'r' or 'r+' read or read/write mode only when file exists
- """
- def __init__(self, file_path, mode = 'r+') -> None:
-
- # Class attributes
- if mode in ['r','r+']:
- self.mode = mode
- self.file_path = file_path
- self.file_obj = None
- #self._open_file()
- self.dataset_metadata_df = None
-
- # Define private methods
-
- # Define public methods
-
-
-[docs]
- def load_file_obj(self):
- if self.file_obj is None:
- self.file_obj = h5py.File(self.file_path, self.mode)
-
-
-
-[docs]
- def unload_file_obj(self):
- if self.file_obj:
- self.file_obj.flush() # Ensure all data is written to disk
- self.file_obj.close()
- self.file_obj = None
-
-
-
-[docs]
- def extract_and_load_dataset_metadata(self):
-
- def __get_datasets(name, obj, list_of_datasets):
- if isinstance(obj,h5py.Dataset):
- list_of_datasets.append(name)
- #print(f'Adding dataset: {name}') #tail: {head} head: {tail}')
- list_of_datasets = []
-
- if self.file_obj is None:
- raise RuntimeError("File object is not loaded. Please load the HDF5 file using the 'load_file_obj' method before attempting to extract datasets.")
-
- try:
-
- list_of_datasets = []
-
- self.file_obj.visititems(lambda name, obj: __get_datasets(name, obj, list_of_datasets))
-
- dataset_metadata_df = pd.DataFrame({'dataset_name': list_of_datasets})
- dataset_metadata_df['parent_instrument'] = dataset_metadata_df['dataset_name'].apply(lambda x: x.split('/')[-3])
- dataset_metadata_df['parent_file'] = dataset_metadata_df['dataset_name'].apply(lambda x: x.split('/')[-2])
-
- self.dataset_metadata_df = dataset_metadata_df
-
- except Exception as e:
-
- self.unload_file_obj()
- print(f"An unexpected error occurred: {e}. File object will be unloaded.")
-
-
-
-
-
-
-[docs]
- def extract_dataset_as_dataframe(self,dataset_name):
- """
- returns a copy of the dataset content in the form of dataframe when possible or numpy array
- """
- if self.file_obj is None:
- raise RuntimeError("File object is not loaded. Please load the HDF5 file using the 'load_file_obj' method before attempting to extract datasets.")
-
- dataset_obj = self.file_obj[dataset_name]
- # Read dataset content from dataset obj
- data = dataset_obj[...]
- # The above statement can be understood as follows:
- # data = np.empty(shape=dataset_obj.shape,
- # dtype=dataset_obj.dtype)
- # dataset_obj.read_direct(data)
-
- try:
- return pd.DataFrame(data)
- except ValueError as e:
- logging.error(f"Failed to convert dataset '{dataset_name}' to DataFrame: {e}. Instead, dataset will be returned as Numpy array.")
- return data # 'data' is a NumPy array here
- except Exception as e:
- self.unload_file_obj()
- print(f"An unexpected error occurred: {e}. Returning None and unloading file object")
- return None
-
-
- # Define metadata revision methods: append(), update(), delete(), and rename().
-
-
-[docs]
- def append_metadata(self, obj_name, annotation_dict):
- """
- Appends metadata attributes to the specified object (obj_name) based on the provided annotation_dict.
-
- This method ensures that the provided metadata attributes do not overwrite any existing ones. If an attribute already exists,
- a ValueError is raised. The function supports storing scalar values (int, float, str) and compound values such as dictionaries
- that are converted into NumPy structured arrays before being added to the metadata.
-
- Parameters:
- -----------
- obj_name: str
- Path to the target object (dataset or group) within the HDF5 file.
-
- annotation_dict: dict
- A dictionary where the keys represent new attribute names (strings), and the values can be:
- - Scalars: int, float, or str.
- - Compound values (dictionaries) for more complex metadata, which are converted to NumPy structured arrays.
- Example of a compound value:
-
- Example:
- ----------
- annotation_dict = {
- "relative_humidity": {
- "value": 65,
- "units": "percentage",
- "range": "[0,100]",
- "definition": "amount of water vapor present ..."
- }
- }
- """
-
- if self.file_obj is None:
- raise RuntimeError("File object is not loaded. Please load the HDF5 file using the 'load_file_obj' method before attempting to modify it.")
-
- # Create a copy of annotation_dict to avoid modifying the original
- annotation_dict_copy = copy.deepcopy(annotation_dict)
-
- try:
- obj = self.file_obj[obj_name]
-
- # Check if any attribute already exists
- if any(key in obj.attrs for key in annotation_dict_copy.keys()):
- raise ValueError("Make sure the provided (key, value) pairs are not existing metadata elements or attributes. To modify or delete existing attributes use .modify_annotation() or .delete_annotation()")
-
- # Process the dictionary values and convert them to structured arrays if needed
- for key, value in annotation_dict_copy.items():
- if isinstance(value, dict):
- # Convert dictionaries to NumPy structured arrays for complex attributes
- annotation_dict_copy[key] = utils.convert_attrdict_to_np_structured_array(value)
-
- # Update the object's attributes with the new metadata
- obj.attrs.update(annotation_dict_copy)
-
- except Exception as e:
- self.unload_file_obj()
- print(f"An unexpected error occurred: {e}. The file object has been properly closed.")
-
-
-
-
-[docs]
- def update_metadata(self, obj_name, annotation_dict):
- """
- Updates the value of existing metadata attributes of the specified object (obj_name) based on the provided annotation_dict.
-
- The function disregards non-existing attributes and suggests to use the append_metadata() method to include those in the metadata.
-
- Parameters:
- -----------
- obj_name : str
- Path to the target object (dataset or group) within the HDF5 file.
-
- annotation_dict: dict
- A dictionary where the keys represent existing attribute names (strings), and the values can be:
- - Scalars: int, float, or str.
- - Compound values (dictionaries) for more complex metadata, which are converted to NumPy structured arrays.
- Example of a compound value:
-
- Example:
- ----------
- annotation_dict = {
- "relative_humidity": {
- "value": 65,
- "units": "percentage",
- "range": "[0,100]",
- "definition": "amount of water vapor present ..."
- }
- }
-
-
- """
-
- if self.file_obj is None:
- raise RuntimeError("File object is not loaded. Please load the HDF5 file using the 'load_file_obj' method before attempting to modify it.")
-
- update_dict = {}
-
- try:
-
- obj = self.file_obj[obj_name]
- for key, value in annotation_dict.items():
- if key in obj.attrs:
- if isinstance(value, dict):
- update_dict[key] = utils.convert_attrdict_to_np_structured_array(value)
- else:
- update_dict[key] = value
- else:
- # Optionally, log or warn about non-existing keys being ignored.
- print(f"Warning: Key '{key}' does not exist and will be ignored.")
-
- obj.attrs.update(update_dict)
-
- except Exception as e:
- self.unload_file_obj()
- print(f"An unexpected error occurred: {e}. The file object has been properly closed.")
-
-
-
-[docs]
- def delete_metadata(self, obj_name, annotation_dict):
- """
- Deletes metadata attributes of the specified object (obj_name) based on the provided annotation_dict.
-
- Parameters:
- -----------
- obj_name: str
- Path to the target object (dataset or group) within the HDF5 file.
-
- annotation_dict: dict
- Dictionary where keys represent attribute names, and values should be dictionaries containing
- {"delete": True} to mark them for deletion.
-
- Example:
- --------
- annotation_dict = {"attr_to_be_deleted": {"delete": True}}
-
- Behavior:
- ---------
- - Deletes the specified attributes from the object's metadata if marked for deletion.
- - Issues a warning if the attribute is not found or not marked for deletion.
- """
-
- if self.file_obj is None:
- raise RuntimeError("File object is not loaded. Please load the HDF5 file using the 'load_file_obj' method before attempting to modify it.")
-
- try:
- obj = self.file_obj[obj_name]
- for attr_key, value in annotation_dict.items():
- if attr_key in obj.attrs:
- if isinstance(value, dict) and value.get('delete', False):
- obj.attrs.__delitem__(attr_key)
- else:
- msg = f"Warning: Value for key '{attr_key}' is not marked for deletion or is invalid."
- print(msg)
- else:
- msg = f"Warning: Key '{attr_key}' does not exist in metadata."
- print(msg)
-
- except Exception as e:
- self.unload_file_obj()
- print(f"An unexpected error occurred: {e}. The file object has been properly closed.")
-
-
-
-
-[docs]
- def rename_metadata(self, obj_name, renaming_map):
- """
- Renames metadata attributes of the specified object (obj_name) based on the provided renaming_map.
-
- Parameters:
- -----------
- obj_name: str
- Path to the target object (dataset or group) within the HDF5 file.
-
- renaming_map: dict
- A dictionary where keys are current attribute names (strings), and values are the new attribute names (strings or byte strings) to rename to.
-
- Example:
- --------
- renaming_map = {
- "old_attr_name": "new_attr_name",
- "old_attr_2": "new_attr_2"
- }
-
- """
-
- if self.file_obj is None:
- raise RuntimeError("File object is not loaded. Please load the HDF5 file using the 'load_file_obj' method before attempting to modify it.")
-
- try:
- obj = self.file_obj[obj_name]
- # Iterate over the renaming_map to process renaming
- for old_attr, new_attr in renaming_map.items():
- if old_attr in obj.attrs:
- # Get the old attribute's value
- attr_value = obj.attrs[old_attr]
-
- # Create a new attribute with the new name
- obj.attrs.create(new_attr, data=attr_value)
-
- # Delete the old attribute
- obj.attrs.__delitem__(old_attr)
- else:
- # Skip if the old attribute doesn't exist
- msg = f"Skipping: Attribute '{old_attr}' does not exist."
- print(msg) # Optionally, replace with warnings.warn(msg)
- except Exception as e:
- self.unload_file_obj()
- print(
- f"An unexpected error occurred: {e}. The file object has been properly closed. "
- "Please ensure that 'obj_name' exists in the file, and that the keys in 'renaming_map' are valid attributes of the object."
- )
-
- self.unload_file_obj()
-
-
-
-[docs]
- def get_metadata(self, obj_path):
- """ Get file attributes from object at path = obj_path. For example,
- obj_path = '/' will get root level attributes or metadata.
- """
- try:
- # Access the attributes for the object at the given path
- metadata_dict = self.file_obj[obj_path].attrs
- except KeyError:
- # Handle the case where the path doesn't exist
- logging.error(f'Invalid object path: {obj_path}')
- metadata_dict = {}
-
- return metadata_dict
-
-
-
-
-[docs]
- def reformat_datetime_column(self, dataset_name, column_name, src_format, desired_format='%Y-%m-%d %H:%M:%S.%f'):
- # Access the dataset
- dataset = self.file_obj[dataset_name]
-
- # Read the column data into a pandas Series and decode bytes to strings
- dt_column_data = pd.Series(dataset[column_name][:]).apply(lambda x: x.decode() )
-
- # Convert to datetime using the source format
- dt_column_data = pd.to_datetime(dt_column_data, format=src_format, errors = 'coerce')
-
- # Reformat datetime objects to the desired format as strings
- dt_column_data = dt_column_data.dt.strftime(desired_format)
-
- # Encode the strings back to bytes
- #encoded_data = dt_column_data.apply(lambda x: x.encode() if not pd.isnull(x) else 'N/A').to_numpy()
-
- # Update the dataset in place
- #dataset[column_name][:] = encoded_data
-
- # Convert byte strings to datetime objects
- #timestamps = [datetime.datetime.strptime(a.decode(), src_format).strftime(desired_format) for a in dt_column_data]
-
- #datetime.strptime('31/01/22 23:59:59.999999',
- # '%d/%m/%y %H:%M:%S.%f')
-
- #pd.to_datetime(
- # np.array([a.decode() for a in dt_column_data]),
- # format=src_format,
- # errors='coerce'
- #)
-
-
- # Standardize the datetime format
- #standardized_time = datetime.strftime(desired_format)
-
- # Convert to byte strings to store back in the HDF5 dataset
- #standardized_time_bytes = np.array([s.encode() for s in timestamps])
-
- # Update the column in the dataset (in-place update)
- # TODO: make this a more secure operation
- #dataset[column_name][:] = standardized_time_bytes
-
- #return np.array(timestamps)
- return dt_column_data.to_numpy()
-
-
- # Define data append operations: append_dataset(), and update_file()
-
-
-[docs]
- def append_dataset(self,dataset_dict, group_name):
-
- # Parse value into HDF5 admissible type
- for key in dataset_dict['attributes'].keys():
- value = dataset_dict['attributes'][key]
- if isinstance(value, dict):
- dataset_dict['attributes'][key] = utils.convert_attrdict_to_np_structured_array(value)
-
- if not group_name in self.file_obj:
- self.file_obj.create_group(group_name, track_order=True)
- self.file_obj[group_name].attrs['creation_date'] = utils.created_at().encode("utf-8")
-
- self.file_obj[group_name].create_dataset(dataset_dict['name'], data=dataset_dict['data'])
- self.file_obj[group_name][dataset_dict['name']].attrs.update(dataset_dict['attributes'])
- self.file_obj[group_name].attrs['last_update_date'] = utils.created_at().encode("utf-8")
-
-
-
-[docs]
- def update_file(self, path_to_append_dir):
- # Split the reference file path and the append directory path into directories and filenames
- ref_tail, ref_head = os.path.split(self.file_path)
- ref_head_filename, head_ext = os.path.splitext(ref_head)
- tail, head = os.path.split(path_to_append_dir)
-
-
- # Ensure the append directory is in the same directory as the reference file and has the same name (without extension)
- if not (ref_tail == tail and ref_head_filename == head):
- raise ValueError("The append directory must be in the same directory as the reference HDF5 file and have the same name without the extension.")
-
- # Close the file if it's already open
- if self.file_obj is not None:
- self.unload_file_obj()
-
- # Attempt to open the file in 'r+' mode for appending
- try:
- hdf5_lib.create_hdf5_file_from_filesystem_path(path_to_append_dir, mode='r+')
- except FileNotFoundError:
- raise FileNotFoundError(f"Reference HDF5 file '{self.file_path}' not found.")
- except OSError as e:
- raise OSError(f"Error opening HDF5 file: {e}")
-
-
-
-
-
-
-[docs]
-def get_parent_child_relationships(file: h5py.File):
-
- nodes = ['/']
- parent = ['']
- #values = [file.attrs['count']]
- # TODO: maybe we should make this more general and not dependent on file_list attribute?
- #if 'file_list' in file.attrs.keys():
- # values = [len(file.attrs['file_list'])]
- #else:
- # values = [1]
- values = [len(file.keys())]
-
- def node_visitor(name,obj):
- if name.count('/') <=2:
- nodes.append(obj.name)
- parent.append(obj.parent.name)
- #nodes.append(os.path.split(obj.name)[1])
- #parent.append(os.path.split(obj.parent.name)[1])
-
- if isinstance(obj,h5py.Dataset):# or not 'file_list' in obj.attrs.keys():
- values.append(1)
- else:
- print(obj.name)
- try:
- values.append(len(obj.keys()))
- except:
- values.append(0)
-
- file.visititems(node_visitor)
-
- return nodes, parent, values
-
-
-
-def __print_metadata__(name, obj, folder_depth, yaml_dict):
-
- """
- Extracts metadata from HDF5 groups and datasets and organizes them into a dictionary with compact representation.
-
- Parameters:
- -----------
- name (str): Name of the HDF5 object being inspected.
- obj (h5py.Group or h5py.Dataset): The HDF5 object (Group or Dataset).
- folder_depth (int): Maximum depth of folders to explore.
- yaml_dict (dict): Dictionary to populate with metadata.
- """
- # Process only objects within the specified folder depth
- if len(obj.name.split('/')) <= folder_depth: # and ".h5" not in obj.name:
- name_to_list = obj.name.split('/')
- name_head = name_to_list[-1] if not name_to_list[-1]=='' else obj.name
-
- if isinstance(obj, h5py.Group): # Handle groups
- # Convert attributes to a YAML/JSON serializable format
- attr_dict = {key: utils.to_serializable_dtype(val) for key, val in obj.attrs.items()}
-
- # Initialize the group dictionary
- group_dict = {"name": name_head, "attributes": attr_dict}
-
- # Handle group members compactly
- #subgroups = [member_name for member_name in obj if isinstance(obj[member_name], h5py.Group)]
- #datasets = [member_name for member_name in obj if isinstance(obj[member_name], h5py.Dataset)]
-
- # Summarize groups and datasets
- #group_dict["content_summary"] = {
- # "group_count": len(subgroups),
- # "group_preview": subgroups[:3] + (["..."] if len(subgroups) > 3 else []),
- # "dataset_count": len(datasets),
- # "dataset_preview": datasets[:3] + (["..."] if len(datasets) > 3 else [])
- #}
-
- yaml_dict[obj.name] = group_dict
-
- elif isinstance(obj, h5py.Dataset): # Handle datasets
- # Convert attributes to a YAML/JSON serializable format
- attr_dict = {key: utils.to_serializable_dtype(val) for key, val in obj.attrs.items()}
-
- dataset_dict = {"name": name_head, "attributes": attr_dict}
-
- yaml_dict[obj.name] = dataset_dict
-
-
-
-
-[docs]
-def serialize_metadata(input_filename_path, folder_depth: int = 4, output_format: str = 'yaml') -> str:
- """
- Serialize metadata from an HDF5 file into YAML or JSON format.
-
- Parameters
- ----------
- input_filename_path : str
- The path to the input HDF5 file.
- folder_depth : int, optional
- The folder depth to control how much of the HDF5 file hierarchy is traversed (default is 4).
- output_format : str, optional
- The format to serialize the output, either 'yaml' or 'json' (default is 'yaml').
-
- Returns
- -------
- str
- The output file path where the serialized metadata is stored (either .yaml or .json).
-
- """
-
- # Choose the appropriate output format (YAML or JSON)
- if output_format not in ['yaml', 'json']:
- raise ValueError("Unsupported format. Please choose either 'yaml' or 'json'.")
-
- # Initialize dictionary to store YAML/JSON data
- yaml_dict = {}
-
- # Split input file path to get the output file's base name
- output_filename_tail, ext = os.path.splitext(input_filename_path)
-
- # Open the HDF5 file and extract metadata
- with h5py.File(input_filename_path, 'r') as f:
- # Convert attribute dict to a YAML/JSON serializable dict
- #attrs_dict = {key: utils.to_serializable_dtype(val) for key, val in f.attrs.items()}
- #yaml_dict[f.name] = {
- # "name": f.name,
- # "attributes": attrs_dict,
- # "datasets": {}
- #}
- __print_metadata__(f.name, f, folder_depth, yaml_dict)
- # Traverse HDF5 file hierarchy and add datasets
- f.visititems(lambda name, obj: __print_metadata__(name, obj, folder_depth, yaml_dict))
-
-
- # Serialize and write the data
- output_file_path = output_filename_tail + '.' + output_format
- with open(output_file_path, 'w') as output_file:
- if output_format == 'json':
- json_output = json.dumps(yaml_dict, indent=4, sort_keys=False)
- output_file.write(json_output)
- elif output_format == 'yaml':
- yaml_output = yaml.dump(yaml_dict, sort_keys=False)
- output_file.write(yaml_output)
-
- return output_file_path
-
-
-
-
-[docs]
-def get_groups_at_a_level(file: h5py.File, level: str):
-
- groups = []
- def node_selector(name, obj):
- if name.count('/') == level:
- print(name)
- groups.append(obj.name)
-
- file.visititems(node_selector)
- #file.visititems()
- return groups
-
-
-
-[docs]
-def read_mtable_as_dataframe(filename):
-
- """
- Reconstruct a MATLAB Table encoded in a .h5 file as a Pandas DataFrame.
-
- This function reads a .h5 file containing a MATLAB Table and reconstructs it as a Pandas DataFrame.
- The input .h5 file contains one group per row of the MATLAB Table. Each group stores the table's
- dataset-like variables as Datasets, while categorical and numerical variables are represented as
- attributes of the respective group.
-
- To ensure homogeneity of data columns, the DataFrame is constructed column-wise.
-
- Parameters
- ----------
- filename : str
- The name of the .h5 file. This may include the file's location and path information.
-
- Returns
- -------
- pd.DataFrame
- The MATLAB Table reconstructed as a Pandas DataFrame.
- """
-
-
- #contructs dataframe by filling out entries columnwise. This way we can ensure homogenous data columns"""
-
- with h5py.File(filename,'r') as file:
-
- # Define group's attributes and datasets. This should hold
- # for all groups. TODO: implement verification and noncompliance error if needed.
- group_list = list(file.keys())
- group_attrs = list(file[group_list[0]].attrs.keys())
- #
- column_attr_names = [item[item.find('_')+1::] for item in group_attrs]
- column_attr_names_idx = [int(item[4:(item.find('_'))]) for item in group_attrs]
-
- group_datasets = list(file[group_list[0]].keys()) if not 'DS_EMPTY' in file[group_list[0]].keys() else []
- #
- column_dataset_names = [file[group_list[0]][item].attrs['column_name'] for item in group_datasets]
- column_dataset_names_idx = [int(item[2:]) for item in group_datasets]
-
-
- # Define data_frame as group_attrs + group_datasets
- #pd_series_index = group_attrs + group_datasets
- pd_series_index = column_attr_names + column_dataset_names
-
- output_dataframe = pd.DataFrame(columns=pd_series_index,index=group_list)
-
- tmp_col = []
-
- for meas_prop in group_attrs + group_datasets:
- if meas_prop in group_attrs:
- column_label = meas_prop[meas_prop.find('_')+1:]
- # Create numerical or categorical column from group's attributes
- tmp_col = [file[group_key].attrs[meas_prop][()][0] for group_key in group_list]
- else:
- # Create dataset column from group's datasets
- column_label = file[group_list[0] + '/' + meas_prop].attrs['column_name']
- #tmp_col = [file[group_key + '/' + meas_prop][()][0] for group_key in group_list]
- tmp_col = [file[group_key + '/' + meas_prop][()] for group_key in group_list]
-
- output_dataframe.loc[:,column_label] = tmp_col
-
- return output_dataframe
-
-
-if __name__ == "__main__":
- if len(sys.argv) < 5:
- print("Usage: python hdf5_ops.py serialize <path/to/target_file.hdf5> <folder_depth : int = 2> <format=json|yaml>")
- sys.exit(1)
-
- if sys.argv[1] == 'serialize':
- input_hdf5_file = sys.argv[2]
- folder_depth = int(sys.argv[3])
- file_format = sys.argv[4]
-
- try:
- # Call the serialize_metadata function and capture the output path
- path_to_file = serialize_metadata(input_hdf5_file,
- folder_depth = folder_depth,
- output_format=file_format)
- print(f"Metadata serialized to {path_to_file}")
- except Exception as e:
- print(f"An error occurred during serialization: {e}")
- sys.exit(1)
-
- #run(sys.argv[2])
-
-
+import sys
+import os
+
+try:
+ thisFilePath = os.path.abspath(__file__)
+except NameError:
+ print("Error: __file__ is not available. Ensure the script is being run from a file.")
+ print("[Notice] Path to DIMA package may not be resolved properly.")
+ thisFilePath = os.getcwd() # Use current directory or specify a default
+
+dimaPath = os.path.normpath(os.path.join(thisFilePath, "..",'..')) # Move up to project root
+
+if dimaPath not in sys.path: # Avoid duplicate entries
+ sys.path.append(dimaPath)
+
+
+import h5py
+import pandas as pd
+import numpy as np
+
+import utils.g5505_utils as utils
+import src.hdf5_writer as hdf5_lib
+import logging
+import datetime
+
+import h5py
+
+import yaml
+import json
+import copy
+
+
+[docs]
+class HDF5DataOpsManager():
+
+ """
+ A class to handle HDF5 fundamental middle level file operations to power data updates, metadata revision, and data analysis
+ with hdf5 files encoding multi-instrument experimental campaign data.
+
+ Parameters:
+ -----------
+ path_to_file : str
+ path/to/hdf5file.
+ mode : str
+ 'r' or 'r+' read or read/write mode only when file exists
+ """
+ def __init__(self, file_path, mode = 'r+') -> None:
+
+ # Class attributes
+ if mode in ['r','r+']:
+ self.mode = mode
+ self.file_path = file_path
+ self.file_obj = None
+ #self._open_file()
+ self.dataset_metadata_df = None
+
+ # Define private methods
+
+ # Define public methods
+
+
+[docs]
+ def load_file_obj(self):
+ if self.file_obj is None:
+ self.file_obj = h5py.File(self.file_path, self.mode)
+
+
+
+[docs]
+ def unload_file_obj(self):
+ if self.file_obj:
+ self.file_obj.flush() # Ensure all data is written to disk
+ self.file_obj.close()
+ self.file_obj = None
+
+
+
+[docs]
+ def extract_and_load_dataset_metadata(self):
+
+ def __get_datasets(name, obj, list_of_datasets):
+ if isinstance(obj,h5py.Dataset):
+ list_of_datasets.append(name)
+ #print(f'Adding dataset: {name}') #tail: {head} head: {tail}')
+ list_of_datasets = []
+
+ if self.file_obj is None:
+ raise RuntimeError("File object is not loaded. Please load the HDF5 file using the 'load_file_obj' method before attempting to extract datasets.")
+
+ try:
+
+ list_of_datasets = []
+
+ self.file_obj.visititems(lambda name, obj: __get_datasets(name, obj, list_of_datasets))
+
+ dataset_metadata_df = pd.DataFrame({'dataset_name': list_of_datasets})
+ dataset_metadata_df['parent_instrument'] = dataset_metadata_df['dataset_name'].apply(lambda x: x.split('/')[-3])
+ dataset_metadata_df['parent_file'] = dataset_metadata_df['dataset_name'].apply(lambda x: x.split('/')[-2])
+
+ self.dataset_metadata_df = dataset_metadata_df
+
+ except Exception as e:
+
+ self.unload_file_obj()
+ print(f"An unexpected error occurred: {e}. File object will be unloaded.")
+
+
+
+
+
+
+[docs]
+ def extract_dataset_as_dataframe(self,dataset_name):
+ """
+ returns a copy of the dataset content in the form of dataframe when possible or numpy array
+ """
+ if self.file_obj is None:
+ raise RuntimeError("File object is not loaded. Please load the HDF5 file using the 'load_file_obj' method before attempting to extract datasets.")
+
+ dataset_obj = self.file_obj[dataset_name]
+ # Read dataset content from dataset obj
+ data = dataset_obj[...]
+ # The above statement can be understood as follows:
+ # data = np.empty(shape=dataset_obj.shape,
+ # dtype=dataset_obj.dtype)
+ # dataset_obj.read_direct(data)
+
+ try:
+ return pd.DataFrame(data)
+ except ValueError as e:
+ logging.error(f"Failed to convert dataset '{dataset_name}' to DataFrame: {e}. Instead, dataset will be returned as Numpy array.")
+ return data # 'data' is a NumPy array here
+ except Exception as e:
+ self.unload_file_obj()
+ print(f"An unexpected error occurred: {e}. Returning None and unloading file object")
+ return None
+
+
+ # Define metadata revision methods: append(), update(), delete(), and rename().
+
+
+[docs]
+ def append_metadata(self, obj_name, annotation_dict):
+ """
+ Appends metadata attributes to the specified object (obj_name) based on the provided annotation_dict.
+
+ This method ensures that the provided metadata attributes do not overwrite any existing ones. If an attribute already exists,
+ a ValueError is raised. The function supports storing scalar values (int, float, str) and compound values such as dictionaries
+ that are converted into NumPy structured arrays before being added to the metadata.
+
+ Parameters:
+ -----------
+ obj_name: str
+ Path to the target object (dataset or group) within the HDF5 file.
+
+ annotation_dict: dict
+ A dictionary where the keys represent new attribute names (strings), and the values can be:
+ - Scalars: int, float, or str.
+ - Compound values (dictionaries) for more complex metadata, which are converted to NumPy structured arrays.
+ Example of a compound value:
+
+ Example:
+ ----------
+ annotation_dict = {
+ "relative_humidity": {
+ "value": 65,
+ "units": "percentage",
+ "range": "[0,100]",
+ "definition": "amount of water vapor present ..."
+ }
+ }
+ """
+
+ if self.file_obj is None:
+ raise RuntimeError("File object is not loaded. Please load the HDF5 file using the 'load_file_obj' method before attempting to modify it.")
+
+ # Create a copy of annotation_dict to avoid modifying the original
+ annotation_dict_copy = copy.deepcopy(annotation_dict)
+
+ try:
+ obj = self.file_obj[obj_name]
+
+ # Check if any attribute already exists
+ if any(key in obj.attrs for key in annotation_dict_copy.keys()):
+ raise ValueError("Make sure the provided (key, value) pairs are not existing metadata elements or attributes. To modify or delete existing attributes use .modify_annotation() or .delete_annotation()")
+
+ # Process the dictionary values and convert them to structured arrays if needed
+ for key, value in annotation_dict_copy.items():
+ if isinstance(value, dict):
+ # Convert dictionaries to NumPy structured arrays for complex attributes
+ annotation_dict_copy[key] = utils.convert_attrdict_to_np_structured_array(value)
+
+ # Update the object's attributes with the new metadata
+ obj.attrs.update(annotation_dict_copy)
+
+ except Exception as e:
+ self.unload_file_obj()
+ print(f"An unexpected error occurred: {e}. The file object has been properly closed.")
+
+
+
+
+[docs]
+ def update_metadata(self, obj_name, annotation_dict):
+ """
+ Updates the value of existing metadata attributes of the specified object (obj_name) based on the provided annotation_dict.
+
+ The function disregards non-existing attributes and suggests to use the append_metadata() method to include those in the metadata.
+
+ Parameters:
+ -----------
+ obj_name : str
+ Path to the target object (dataset or group) within the HDF5 file.
+
+ annotation_dict: dict
+ A dictionary where the keys represent existing attribute names (strings), and the values can be:
+ - Scalars: int, float, or str.
+ - Compound values (dictionaries) for more complex metadata, which are converted to NumPy structured arrays.
+ Example of a compound value:
+
+ Example:
+ ----------
+ annotation_dict = {
+ "relative_humidity": {
+ "value": 65,
+ "units": "percentage",
+ "range": "[0,100]",
+ "definition": "amount of water vapor present ..."
+ }
+ }
+
+
+ """
+
+ if self.file_obj is None:
+ raise RuntimeError("File object is not loaded. Please load the HDF5 file using the 'load_file_obj' method before attempting to modify it.")
+
+ update_dict = {}
+
+ try:
+
+ obj = self.file_obj[obj_name]
+ for key, value in annotation_dict.items():
+ if key in obj.attrs:
+ if isinstance(value, dict):
+ update_dict[key] = utils.convert_attrdict_to_np_structured_array(value)
+ else:
+ update_dict[key] = value
+ else:
+ # Optionally, log or warn about non-existing keys being ignored.
+ print(f"Warning: Key '{key}' does not exist and will be ignored.")
+
+ obj.attrs.update(update_dict)
+
+ except Exception as e:
+ self.unload_file_obj()
+ print(f"An unexpected error occurred: {e}. The file object has been properly closed.")
+
+
+
+[docs]
+ def delete_metadata(self, obj_name, annotation_dict):
+ """
+ Deletes metadata attributes of the specified object (obj_name) based on the provided annotation_dict.
+
+ Parameters:
+ -----------
+ obj_name: str
+ Path to the target object (dataset or group) within the HDF5 file.
+
+ annotation_dict: dict
+ Dictionary where keys represent attribute names, and values should be dictionaries containing
+ {"delete": True} to mark them for deletion.
+
+ Example:
+ --------
+ annotation_dict = {"attr_to_be_deleted": {"delete": True}}
+
+ Behavior:
+ ---------
+ - Deletes the specified attributes from the object's metadata if marked for deletion.
+ - Issues a warning if the attribute is not found or not marked for deletion.
+ """
+
+ if self.file_obj is None:
+ raise RuntimeError("File object is not loaded. Please load the HDF5 file using the 'load_file_obj' method before attempting to modify it.")
+
+ try:
+ obj = self.file_obj[obj_name]
+ for attr_key, value in annotation_dict.items():
+ if attr_key in obj.attrs:
+ if isinstance(value, dict) and value.get('delete', False):
+ obj.attrs.__delitem__(attr_key)
+ else:
+ msg = f"Warning: Value for key '{attr_key}' is not marked for deletion or is invalid."
+ print(msg)
+ else:
+ msg = f"Warning: Key '{attr_key}' does not exist in metadata."
+ print(msg)
+
+ except Exception as e:
+ self.unload_file_obj()
+ print(f"An unexpected error occurred: {e}. The file object has been properly closed.")
+
+
+
+
+[docs]
+ def rename_metadata(self, obj_name, renaming_map):
+ """
+ Renames metadata attributes of the specified object (obj_name) based on the provided renaming_map.
+
+ Parameters:
+ -----------
+ obj_name: str
+ Path to the target object (dataset or group) within the HDF5 file.
+
+ renaming_map: dict
+ A dictionary where keys are current attribute names (strings), and values are the new attribute names (strings or byte strings) to rename to.
+
+ Example:
+ --------
+ renaming_map = {
+ "old_attr_name": "new_attr_name",
+ "old_attr_2": "new_attr_2"
+ }
+
+ """
+
+ if self.file_obj is None:
+ raise RuntimeError("File object is not loaded. Please load the HDF5 file using the 'load_file_obj' method before attempting to modify it.")
+
+ try:
+ obj = self.file_obj[obj_name]
+ # Iterate over the renaming_map to process renaming
+ for old_attr, new_attr in renaming_map.items():
+ if old_attr in obj.attrs:
+ # Get the old attribute's value
+ attr_value = obj.attrs[old_attr]
+
+ # Create a new attribute with the new name
+ obj.attrs.create(new_attr, data=attr_value)
+
+ # Delete the old attribute
+ obj.attrs.__delitem__(old_attr)
+ else:
+ # Skip if the old attribute doesn't exist
+ msg = f"Skipping: Attribute '{old_attr}' does not exist."
+ print(msg) # Optionally, replace with warnings.warn(msg)
+ except Exception as e:
+ self.unload_file_obj()
+ print(
+ f"An unexpected error occurred: {e}. The file object has been properly closed. "
+ "Please ensure that 'obj_name' exists in the file, and that the keys in 'renaming_map' are valid attributes of the object."
+ )
+
+ self.unload_file_obj()
+
+
+
+[docs]
+ def get_metadata(self, obj_path):
+ """ Get file attributes from object at path = obj_path. For example,
+ obj_path = '/' will get root level attributes or metadata.
+ """
+ try:
+ # Access the attributes for the object at the given path
+ metadata_dict = self.file_obj[obj_path].attrs
+ except KeyError:
+ # Handle the case where the path doesn't exist
+ logging.error(f'Invalid object path: {obj_path}')
+ metadata_dict = {}
+
+ return metadata_dict
+
+
+
+
+[docs]
+ def reformat_datetime_column(self, dataset_name, column_name, src_format, desired_format='%Y-%m-%d %H:%M:%S.%f'):
+ # Access the dataset
+ dataset = self.file_obj[dataset_name]
+
+ # Read the column data into a pandas Series and decode bytes to strings
+ dt_column_data = pd.Series(dataset[column_name][:]).apply(lambda x: x.decode() )
+
+ # Convert to datetime using the source format
+ dt_column_data = pd.to_datetime(dt_column_data, format=src_format, errors = 'coerce')
+
+ # Reformat datetime objects to the desired format as strings
+ dt_column_data = dt_column_data.dt.strftime(desired_format)
+
+ # Encode the strings back to bytes
+ #encoded_data = dt_column_data.apply(lambda x: x.encode() if not pd.isnull(x) else 'N/A').to_numpy()
+
+ # Update the dataset in place
+ #dataset[column_name][:] = encoded_data
+
+ # Convert byte strings to datetime objects
+ #timestamps = [datetime.datetime.strptime(a.decode(), src_format).strftime(desired_format) for a in dt_column_data]
+
+ #datetime.strptime('31/01/22 23:59:59.999999',
+ # '%d/%m/%y %H:%M:%S.%f')
+
+ #pd.to_datetime(
+ # np.array([a.decode() for a in dt_column_data]),
+ # format=src_format,
+ # errors='coerce'
+ #)
+
+
+ # Standardize the datetime format
+ #standardized_time = datetime.strftime(desired_format)
+
+ # Convert to byte strings to store back in the HDF5 dataset
+ #standardized_time_bytes = np.array([s.encode() for s in timestamps])
+
+ # Update the column in the dataset (in-place update)
+ # TODO: make this a more secure operation
+ #dataset[column_name][:] = standardized_time_bytes
+
+ #return np.array(timestamps)
+ return dt_column_data.to_numpy()
+
+
+ # Define data append operations: append_dataset(), and update_file()
+
+
+[docs]
+ def append_dataset(self,dataset_dict, group_name):
+
+ # Parse value into HDF5 admissible type
+ for key in dataset_dict['attributes'].keys():
+ value = dataset_dict['attributes'][key]
+ if isinstance(value, dict):
+ dataset_dict['attributes'][key] = utils.convert_attrdict_to_np_structured_array(value)
+
+ if not group_name in self.file_obj:
+ self.file_obj.create_group(group_name, track_order=True)
+ self.file_obj[group_name].attrs['creation_date'] = utils.created_at().encode("utf-8")
+
+ self.file_obj[group_name].create_dataset(dataset_dict['name'], data=dataset_dict['data'])
+ self.file_obj[group_name][dataset_dict['name']].attrs.update(dataset_dict['attributes'])
+ self.file_obj[group_name].attrs['last_update_date'] = utils.created_at().encode("utf-8")
+
+
+
+[docs]
+ def update_file(self, path_to_append_dir):
+ # Split the reference file path and the append directory path into directories and filenames
+ ref_tail, ref_head = os.path.split(self.file_path)
+ ref_head_filename, head_ext = os.path.splitext(ref_head)
+ tail, head = os.path.split(path_to_append_dir)
+
+
+ # Ensure the append directory is in the same directory as the reference file and has the same name (without extension)
+ if not (ref_tail == tail and ref_head_filename == head):
+ raise ValueError("The append directory must be in the same directory as the reference HDF5 file and have the same name without the extension.")
+
+ # Close the file if it's already open
+ if self.file_obj is not None:
+ self.unload_file_obj()
+
+ # Attempt to open the file in 'r+' mode for appending
+ try:
+ hdf5_lib.create_hdf5_file_from_filesystem_path(path_to_append_dir, mode='r+')
+ except FileNotFoundError:
+ raise FileNotFoundError(f"Reference HDF5 file '{self.file_path}' not found.")
+ except OSError as e:
+ raise OSError(f"Error opening HDF5 file: {e}")
+
+
+
+
+
+
+[docs]
+def get_parent_child_relationships(file: h5py.File):
+
+ nodes = ['/']
+ parent = ['']
+ #values = [file.attrs['count']]
+ # TODO: maybe we should make this more general and not dependent on file_list attribute?
+ #if 'file_list' in file.attrs.keys():
+ # values = [len(file.attrs['file_list'])]
+ #else:
+ # values = [1]
+ values = [len(file.keys())]
+
+ def node_visitor(name,obj):
+ if name.count('/') <=2:
+ nodes.append(obj.name)
+ parent.append(obj.parent.name)
+ #nodes.append(os.path.split(obj.name)[1])
+ #parent.append(os.path.split(obj.parent.name)[1])
+
+ if isinstance(obj,h5py.Dataset):# or not 'file_list' in obj.attrs.keys():
+ values.append(1)
+ else:
+ print(obj.name)
+ try:
+ values.append(len(obj.keys()))
+ except:
+ values.append(0)
+
+ file.visititems(node_visitor)
+
+ return nodes, parent, values
+
+
+
+def __print_metadata__(name, obj, folder_depth, yaml_dict):
+
+ """
+ Extracts metadata from HDF5 groups and datasets and organizes them into a dictionary with compact representation.
+
+ Parameters:
+ -----------
+ name (str): Name of the HDF5 object being inspected.
+ obj (h5py.Group or h5py.Dataset): The HDF5 object (Group or Dataset).
+ folder_depth (int): Maximum depth of folders to explore.
+ yaml_dict (dict): Dictionary to populate with metadata.
+ """
+ # Process only objects within the specified folder depth
+ if len(obj.name.split('/')) <= folder_depth: # and ".h5" not in obj.name:
+ name_to_list = obj.name.split('/')
+ name_head = name_to_list[-1] if not name_to_list[-1]=='' else obj.name
+
+ if isinstance(obj, h5py.Group): # Handle groups
+ # Convert attributes to a YAML/JSON serializable format
+ attr_dict = {key: utils.to_serializable_dtype(val) for key, val in obj.attrs.items()}
+
+ # Initialize the group dictionary
+ group_dict = {"name": name_head, "attributes": attr_dict}
+
+ # Handle group members compactly
+ #subgroups = [member_name for member_name in obj if isinstance(obj[member_name], h5py.Group)]
+ #datasets = [member_name for member_name in obj if isinstance(obj[member_name], h5py.Dataset)]
+
+ # Summarize groups and datasets
+ #group_dict["content_summary"] = {
+ # "group_count": len(subgroups),
+ # "group_preview": subgroups[:3] + (["..."] if len(subgroups) > 3 else []),
+ # "dataset_count": len(datasets),
+ # "dataset_preview": datasets[:3] + (["..."] if len(datasets) > 3 else [])
+ #}
+
+ yaml_dict[obj.name] = group_dict
+
+ elif isinstance(obj, h5py.Dataset): # Handle datasets
+ # Convert attributes to a YAML/JSON serializable format
+ attr_dict = {key: utils.to_serializable_dtype(val) for key, val in obj.attrs.items()}
+
+ dataset_dict = {"name": name_head, "attributes": attr_dict}
+
+ yaml_dict[obj.name] = dataset_dict
+
+
+
+
+[docs]
+def serialize_metadata(input_filename_path, folder_depth: int = 4, output_format: str = 'yaml') -> str:
+ """
+ Serialize metadata from an HDF5 file into YAML or JSON format.
+
+ Parameters
+ ----------
+ input_filename_path : str
+ The path to the input HDF5 file.
+ folder_depth : int, optional
+ The folder depth to control how much of the HDF5 file hierarchy is traversed (default is 4).
+ output_format : str, optional
+ The format to serialize the output, either 'yaml' or 'json' (default is 'yaml').
+
+ Returns
+ -------
+ str
+ The output file path where the serialized metadata is stored (either .yaml or .json).
+
+ """
+
+ # Choose the appropriate output format (YAML or JSON)
+ if output_format not in ['yaml', 'json']:
+ raise ValueError("Unsupported format. Please choose either 'yaml' or 'json'.")
+
+ # Initialize dictionary to store YAML/JSON data
+ yaml_dict = {}
+
+ # Split input file path to get the output file's base name
+ output_filename_tail, ext = os.path.splitext(input_filename_path)
+
+ # Open the HDF5 file and extract metadata
+ with h5py.File(input_filename_path, 'r') as f:
+ # Convert attribute dict to a YAML/JSON serializable dict
+ #attrs_dict = {key: utils.to_serializable_dtype(val) for key, val in f.attrs.items()}
+ #yaml_dict[f.name] = {
+ # "name": f.name,
+ # "attributes": attrs_dict,
+ # "datasets": {}
+ #}
+ __print_metadata__(f.name, f, folder_depth, yaml_dict)
+ # Traverse HDF5 file hierarchy and add datasets
+ f.visititems(lambda name, obj: __print_metadata__(name, obj, folder_depth, yaml_dict))
+
+
+ # Serialize and write the data
+ output_file_path = output_filename_tail + '.' + output_format
+ with open(output_file_path, 'w') as output_file:
+ if output_format == 'json':
+ json_output = json.dumps(yaml_dict, indent=4, sort_keys=False)
+ output_file.write(json_output)
+ elif output_format == 'yaml':
+ yaml_output = yaml.dump(yaml_dict, sort_keys=False)
+ output_file.write(yaml_output)
+
+ return output_file_path
+
+
+
+
+[docs]
+def get_groups_at_a_level(file: h5py.File, level: str):
+
+ groups = []
+ def node_selector(name, obj):
+ if name.count('/') == level:
+ print(name)
+ groups.append(obj.name)
+
+ file.visititems(node_selector)
+ #file.visititems()
+ return groups
+
+
+
+[docs]
+def read_mtable_as_dataframe(filename):
+
+ """
+ Reconstruct a MATLAB Table encoded in a .h5 file as a Pandas DataFrame.
+
+ This function reads a .h5 file containing a MATLAB Table and reconstructs it as a Pandas DataFrame.
+ The input .h5 file contains one group per row of the MATLAB Table. Each group stores the table's
+ dataset-like variables as Datasets, while categorical and numerical variables are represented as
+ attributes of the respective group.
+
+ To ensure homogeneity of data columns, the DataFrame is constructed column-wise.
+
+ Parameters
+ ----------
+ filename : str
+ The name of the .h5 file. This may include the file's location and path information.
+
+ Returns
+ -------
+ pd.DataFrame
+ The MATLAB Table reconstructed as a Pandas DataFrame.
+ """
+
+
+ #contructs dataframe by filling out entries columnwise. This way we can ensure homogenous data columns"""
+
+ with h5py.File(filename,'r') as file:
+
+ # Define group's attributes and datasets. This should hold
+ # for all groups. TODO: implement verification and noncompliance error if needed.
+ group_list = list(file.keys())
+ group_attrs = list(file[group_list[0]].attrs.keys())
+ #
+ column_attr_names = [item[item.find('_')+1::] for item in group_attrs]
+ column_attr_names_idx = [int(item[4:(item.find('_'))]) for item in group_attrs]
+
+ group_datasets = list(file[group_list[0]].keys()) if not 'DS_EMPTY' in file[group_list[0]].keys() else []
+ #
+ column_dataset_names = [file[group_list[0]][item].attrs['column_name'] for item in group_datasets]
+ column_dataset_names_idx = [int(item[2:]) for item in group_datasets]
+
+
+ # Define data_frame as group_attrs + group_datasets
+ #pd_series_index = group_attrs + group_datasets
+ pd_series_index = column_attr_names + column_dataset_names
+
+ output_dataframe = pd.DataFrame(columns=pd_series_index,index=group_list)
+
+ tmp_col = []
+
+ for meas_prop in group_attrs + group_datasets:
+ if meas_prop in group_attrs:
+ column_label = meas_prop[meas_prop.find('_')+1:]
+ # Create numerical or categorical column from group's attributes
+ tmp_col = [file[group_key].attrs[meas_prop][()][0] for group_key in group_list]
+ else:
+ # Create dataset column from group's datasets
+ column_label = file[group_list[0] + '/' + meas_prop].attrs['column_name']
+ #tmp_col = [file[group_key + '/' + meas_prop][()][0] for group_key in group_list]
+ tmp_col = [file[group_key + '/' + meas_prop][()] for group_key in group_list]
+
+ output_dataframe.loc[:,column_label] = tmp_col
+
+ return output_dataframe
+
+
+if __name__ == "__main__":
+ if len(sys.argv) < 5:
+ print("Usage: python hdf5_ops.py serialize <path/to/target_file.hdf5> <folder_depth : int = 2> <format=json|yaml>")
+ sys.exit(1)
+
+ if sys.argv[1] == 'serialize':
+ input_hdf5_file = sys.argv[2]
+ folder_depth = int(sys.argv[3])
+ file_format = sys.argv[4]
+
+ try:
+ # Call the serialize_metadata function and capture the output path
+ path_to_file = serialize_metadata(input_hdf5_file,
+ folder_depth = folder_depth,
+ output_format=file_format)
+ print(f"Metadata serialized to {path_to_file}")
+ except Exception as e:
+ print(f"An error occurred during serialization: {e}")
+ sys.exit(1)
+
+ #run(sys.argv[2])
+
+
-import sys
-import os
-root_dir = os.path.abspath(os.curdir)
-sys.path.append(root_dir)
-
-import h5py
-import yaml
-
-import numpy as np
-import pandas as pd
-
-from plotly.subplots import make_subplots
-import plotly.graph_objects as go
-import plotly.express as px
-#import plotly.io as pio
-from src.hdf5_ops import get_parent_child_relationships
-
-
-
-
-[docs]
-def display_group_hierarchy_on_a_treemap(filename: str):
-
- """
- filename (str): hdf5 file's filename"""
-
- with h5py.File(filename,'r') as file:
- nodes, parents, values = get_parent_child_relationships(file)
-
- metadata_list = []
- metadata_dict={}
- for key in file.attrs.keys():
- #if 'metadata' in key:
- if isinstance(file.attrs[key], str): # Check if the attribute is a string
- metadata_key = key[key.find('_') + 1:]
- metadata_value = file.attrs[key]
- metadata_dict[metadata_key] = metadata_value
- metadata_list.append(f'{metadata_key}: {metadata_value}')
-
- #metadata_dict[key[key.find('_')+1::]]= file.attrs[key]
- #metadata_list.append(key[key.find('_')+1::]+':'+file.attrs[key])
-
- metadata = '<br>'.join(['<br>'] + metadata_list)
-
- customdata_series = pd.Series(nodes)
- customdata_series[0] = metadata
-
- fig = make_subplots(1, 1, specs=[[{"type": "domain"}]],)
- fig.add_trace(go.Treemap(
- labels=nodes, #formating_df['formated_names'][nodes],
- parents=parents,#formating_df['formated_names'][parents],
- values=values,
- branchvalues='remainder',
- customdata= customdata_series,
- #marker=dict(
- # colors=df_all_trees['color'],
- # colorscale='RdBu',
- # cmid=average_score),
- #hovertemplate='<b>%{label} </b> <br> Number of files: %{value}<br> Success rate: %{color:.2f}',
- hovertemplate='<b>%{label} </b> <br> Count: %{value} <br> Path: %{customdata}',
- name='',
- root_color="lightgrey"
- ))
- fig.update_layout(width = 800, height= 600, margin = dict(t=50, l=25, r=25, b=25))
- fig.show()
- file_name, file_ext = os.path.splitext(filename)
- fig.write_html(file_name + ".html")
-
-
- #pio.write_image(fig,file_name + ".png",width=800,height=600,format='png')
-
-#
-
+import sys
+import os
+root_dir = os.path.abspath(os.curdir)
+sys.path.append(root_dir)
+
+import h5py
+import yaml
+
+import numpy as np
+import pandas as pd
+
+from plotly.subplots import make_subplots
+import plotly.graph_objects as go
+import plotly.express as px
+#import plotly.io as pio
+from src.hdf5_ops import get_parent_child_relationships
+
+
+
+
+[docs]
+def display_group_hierarchy_on_a_treemap(filename: str):
+
+ """
+ filename (str): hdf5 file's filename"""
+
+ with h5py.File(filename,'r') as file:
+ nodes, parents, values = get_parent_child_relationships(file)
+
+ metadata_list = []
+ metadata_dict={}
+ for key in file.attrs.keys():
+ #if 'metadata' in key:
+ if isinstance(file.attrs[key], str): # Check if the attribute is a string
+ metadata_key = key[key.find('_') + 1:]
+ metadata_value = file.attrs[key]
+ metadata_dict[metadata_key] = metadata_value
+ metadata_list.append(f'{metadata_key}: {metadata_value}')
+
+ #metadata_dict[key[key.find('_')+1::]]= file.attrs[key]
+ #metadata_list.append(key[key.find('_')+1::]+':'+file.attrs[key])
+
+ metadata = '<br>'.join(['<br>'] + metadata_list)
+
+ customdata_series = pd.Series(nodes)
+ customdata_series[0] = metadata
+
+ fig = make_subplots(1, 1, specs=[[{"type": "domain"}]],)
+ fig.add_trace(go.Treemap(
+ labels=nodes, #formating_df['formated_names'][nodes],
+ parents=parents,#formating_df['formated_names'][parents],
+ values=values,
+ branchvalues='remainder',
+ customdata= customdata_series,
+ #marker=dict(
+ # colors=df_all_trees['color'],
+ # colorscale='RdBu',
+ # cmid=average_score),
+ #hovertemplate='<b>%{label} </b> <br> Number of files: %{value}<br> Success rate: %{color:.2f}',
+ hovertemplate='<b>%{label} </b> <br> Count: %{value} <br> Path: %{customdata}',
+ name='',
+ root_color="lightgrey"
+ ))
+ fig.update_layout(width = 800, height= 600, margin = dict(t=50, l=25, r=25, b=25))
+ fig.show()
+ file_name, file_ext = os.path.splitext(filename)
+ fig.write_html(file_name + ".html")
+
+
+ #pio.write_image(fig,file_name + ".png",width=800,height=600,format='png')
+
+#
+
-import sys
-import os
-root_dir = os.path.abspath(os.curdir)
-sys.path.append(root_dir)
-
-import pandas as pd
-import numpy as np
-import h5py
-import logging
-
-import utils.g5505_utils as utils
-import instruments.readers.filereader_registry as filereader_registry
-
-
-
-def __transfer_file_dict_to_hdf5(h5file, group_name, file_dict):
- """
- Transfers data from a file_dict to an HDF5 file.
-
- Parameters
- ----------
- h5file : h5py.File
- HDF5 file object where the data will be written.
- group_name : str
- Name of the HDF5 group where data will be stored.
- file_dict : dict
- Dictionary containing file data to be transferred. Required structure:
- {
- 'name': str,
- 'attributes_dict': dict,
- 'datasets': [
- {
- 'name': str,
- 'data': array-like,
- 'shape': tuple,
- 'attributes': dict (optional)
- },
- ...
- ]
- }
-
- Returns
- -------
- None
- """
-
- if not file_dict:
- return
-
- try:
- # Create group and add their attributes
- filename = file_dict['name']
- group = h5file[group_name].create_group(name=filename)
- # Add group attributes
- group.attrs.update(file_dict['attributes_dict'])
-
- # Add datasets to the just created group
- for dataset in file_dict['datasets']:
- dataset_obj = group.create_dataset(
- name=dataset['name'],
- data=dataset['data'],
- shape=dataset['shape']
- )
-
- # Add dataset's attributes
- attributes = dataset.get('attributes', {})
- dataset_obj.attrs.update(attributes)
- group.attrs['last_update_date'] = utils.created_at().encode('utf-8')
-
- stdout = f'Completed transfer for /{group_name}/{filename}'
-
- except Exception as inst:
- stdout = inst
- logging.error('Failed to transfer data into HDF5: %s', inst)
-
- return stdout
-
-def __copy_file_in_group(source_file_path, dest_file_obj : h5py.File, dest_group_name, work_with_copy : bool = True):
- # Create copy of original file to avoid possible file corruption and work with it.
-
- if work_with_copy:
- tmp_file_path = utils.make_file_copy(source_file_path)
- else:
- tmp_file_path = source_file_path
-
- # Open backup h5 file and copy complet filesystem directory onto a group in h5file
- with h5py.File(tmp_file_path,'r') as src_file:
- dest_file_obj.copy(source= src_file['/'], dest= dest_group_name)
-
- if 'tmp_files' in tmp_file_path:
- os.remove(tmp_file_path)
-
- stdout = f'Completed transfer for /{dest_group_name}'
- return stdout
-
-
-[docs]
-def create_hdf5_file_from_filesystem_path(path_to_input_directory: str,
- path_to_filenames_dict: dict = None,
- select_dir_keywords : list = [],
- root_metadata_dict : dict = {}, mode = 'w'):
-
- """
- Creates an .h5 file with name "output_filename" that preserves the directory tree (or folder structure)
- of a given filesystem path.
-
- The data integration capabilities are limited by our file reader, which can only access data from a list of
- admissible file formats. These, however, can be extended. Directories are groups in the resulting HDF5 file.
- Files are formatted as composite objects consisting of a group, file, and attributes.
-
- Parameters
- ----------
- output_filename : str
- Name of the output HDF5 file.
- path_to_input_directory : str
- Path to root directory, specified with forward slashes, e.g., path/to/root.
-
- path_to_filenames_dict : dict, optional
- A pre-processed dictionary where keys are directory paths on the input directory's tree and values are lists of files.
- If provided, 'input_file_system_path' is ignored.
-
- select_dir_keywords : list
- List of string elements to consider or select only directory paths that contain
- a word in 'select_dir_keywords'. When empty, all directory paths are considered
- to be included in the HDF5 file group hierarchy.
- root_metadata_dict : dict
- Metadata to include at the root level of the HDF5 file.
-
- mode : str
- 'w' create File, truncate if it exists, or 'r+' read/write, File must exists. By default, mode = "w".
-
- Returns
- -------
- output_filename : str
- Path to the created HDF5 file.
- """
-
-
- if not mode in ['w','r+']:
- raise ValueError(f'Parameter mode must take values in ["w","r+"]')
-
- if not '/' in path_to_input_directory:
- raise ValueError('path_to_input_directory needs to be specified using forward slashes "/".' )
-
- #path_to_output_directory = os.path.join(path_to_input_directory,'..')
- path_to_input_directory = os.path.normpath(path_to_input_directory).rstrip(os.sep)
-
-
- for i, keyword in enumerate(select_dir_keywords):
- select_dir_keywords[i] = keyword.replace('/',os.sep)
-
- if not path_to_filenames_dict:
- # On dry_run=True, returns path to files dictionary of the output directory without making a actual copy of the input directory.
- # Therefore, there wont be a copying conflict by setting up input and output directories the same
- path_to_filenames_dict = utils.copy_directory_with_contraints(input_dir_path=path_to_input_directory,
- output_dir_path=path_to_input_directory,
- dry_run=True)
- # Set input_directory as copied input directory
- root_dir = path_to_input_directory
- path_to_output_file = path_to_input_directory.rstrip(os.path.sep) + '.h5'
-
- start_message = f'\n[Start] Data integration :\nSource: {path_to_input_directory}\nDestination: {path_to_output_file}\n'
-
- print(start_message)
- logging.info(start_message)
-
- # Check if the .h5 file already exists
- if os.path.exists(path_to_output_file) and mode in ['w']:
- message = (
- f"[Notice] The file '{path_to_output_file}' already exists and will not be overwritten.\n"
- "If you wish to replace it, please delete the existing file first and rerun the program."
- )
- print(message)
- logging.error(message)
- else:
- with h5py.File(path_to_output_file, mode=mode, track_order=True) as h5file:
-
- number_of_dirs = len(path_to_filenames_dict.keys())
- dir_number = 1
- for dirpath, filtered_filenames_list in path_to_filenames_dict.items():
-
- # Check if filtered_filenames_list is nonempty. TODO: This is perhaps redundant by design of path_to_filenames_dict.
- if not filtered_filenames_list:
- continue
-
- group_name = dirpath.replace(os.sep,'/')
- group_name = group_name.replace(root_dir.replace(os.sep,'/') + '/', '/')
-
- # Flatten group name to one level
- if select_dir_keywords:
- offset = sum([len(i.split(os.sep)) if i in dirpath else 0 for i in select_dir_keywords])
- else:
- offset = 1
- tmp_list = group_name.split('/')
- if len(tmp_list) > offset+1:
- group_name = '/'.join([tmp_list[i] for i in range(offset+1)])
-
- # Create group called "group_name". Hierarchy of nested groups can be implicitly defined by the forward slashes
- if not group_name in h5file.keys():
- h5file.create_group(group_name)
- h5file[group_name].attrs['creation_date'] = utils.created_at().encode('utf-8')
- #h5file[group_name].attrs.create(name='filtered_file_list',data=convert_string_to_bytes(filtered_filename_list))
- #h5file[group_name].attrs.create(name='file_list',data=convert_string_to_bytes(filenames_list))
- #else:
- #print(group_name,' was already created.')
- instFoldermsgStart = f'Starting data transfer from instFolder: {group_name}'
- print(instFoldermsgStart)
-
- for filenumber, filename in enumerate(filtered_filenames_list):
-
- #file_ext = os.path.splitext(filename)[1]
- #try:
-
- # hdf5 path to filename group
- dest_group_name = f'{group_name}/{filename}'
-
- if not 'h5' in filename:
- #file_dict = config_file.select_file_readers(group_id)[file_ext](os.path.join(dirpath,filename))
- #file_dict = ext_to_reader_dict[file_ext](os.path.join(dirpath,filename))
- file_dict = filereader_registry.select_file_reader(dest_group_name)(os.path.join(dirpath,filename))
-
- stdout = __transfer_file_dict_to_hdf5(h5file, group_name, file_dict)
-
- else:
- source_file_path = os.path.join(dirpath,filename)
- dest_file_obj = h5file
- #group_name +'/'+filename
- #ext_to_reader_dict[file_ext](source_file_path, dest_file_obj, dest_group_name)
- #g5505f_reader.select_file_reader(dest_group_name)(source_file_path, dest_file_obj, dest_group_name)
- stdout = __copy_file_in_group(source_file_path, dest_file_obj, dest_group_name, False)
-
- # Update the progress bar and log the end message
- instFoldermsdEnd = f'\nCompleted data transfer for instFolder: {group_name}\n'
- # Print and log the start message
- utils.progressBar(dir_number, number_of_dirs, instFoldermsdEnd)
- logging.info(instFoldermsdEnd )
- dir_number = dir_number + 1
-
- print('[End] Data integration')
- logging.info('[End] Data integration')
-
- if len(root_metadata_dict.keys())>0:
- for key, value in root_metadata_dict.items():
- #if key in h5file.attrs:
- # del h5file.attrs[key]
- h5file.attrs.create(key, value)
- #annotate_root_dir(output_filename,root_metadata_dict)
-
-
- #output_yml_filename_path = hdf5_vis.take_yml_snapshot_of_hdf5_file(output_filename)
-
- return path_to_output_file #, output_yml_filename_path
-
-
-
-[docs]
-def create_hdf5_file_from_dataframe(ofilename, input_data, group_by_funcs: list, approach: str = None, extract_attrs_func=None):
- """
- Creates an HDF5 file with hierarchical groups based on the specified grouping functions or columns.
-
- Parameters:
- -----------
- ofilename (str): Path for the output HDF5 file.
- input_data (pd.DataFrame or str): Input data as a DataFrame or a valid file system path.
- group_by_funcs (list): List of callables or column names to define hierarchical grouping.
- approach (str): Specifies the approach ('top-down' or 'bottom-up') for creating the HDF5 file.
- extract_attrs_func (callable, optional): Function to extract additional attributes for HDF5 groups.
-
- Returns:
- --------
- None
- """
- # Check whether input_data is a valid file-system path or a DataFrame
- is_valid_path = lambda x: os.path.exists(x) if isinstance(x, str) else False
-
- if is_valid_path(input_data):
- # If input_data is a file-system path, create a DataFrame with file info
- file_list = os.listdir(input_data)
- df = pd.DataFrame(file_list, columns=['filename'])
- df = utils.augment_with_filetype(df) # Add filetype information if needed
- elif isinstance(input_data, pd.DataFrame):
- # If input_data is a DataFrame, make a copy
- df = input_data.copy()
- else:
- raise ValueError("input_data must be either a valid file-system path or a DataFrame.")
-
- # Generate grouping columns based on group_by_funcs
- if utils.is_callable_list(group_by_funcs):
- grouping_cols = []
- for i, func in enumerate(group_by_funcs):
- col_name = f'level_{i}_groups'
- grouping_cols.append(col_name)
- df[col_name] = func(df)
- elif utils.is_str_list(group_by_funcs) and all([item in df.columns for item in group_by_funcs]):
- grouping_cols = group_by_funcs
- else:
- raise ValueError("'group_by_funcs' must be a list of callables or valid column names in the DataFrame.")
-
- # Generate group paths
- df['group_path'] = ['/' + '/'.join(row) for row in df[grouping_cols].values.astype(str)]
-
- # Open the HDF5 file in write mode
- with h5py.File(ofilename, 'w') as file:
- for group_path in df['group_path'].unique():
- # Create groups in HDF5
- group = file.create_group(group_path)
-
- # Filter the DataFrame for the current group
- datatable = df[df['group_path'] == group_path].copy()
-
- # Drop grouping columns and the generated 'group_path'
- datatable = datatable.drop(columns=grouping_cols + ['group_path'])
-
- # Add datasets to groups if data exists
- if not datatable.empty:
- dataset = utils.convert_dataframe_to_np_structured_array(datatable)
- group.create_dataset(name='data_table', data=dataset)
-
- # Add attributes if extract_attrs_func is provided
- if extract_attrs_func:
- attrs = extract_attrs_func(datatable)
- for key, value in attrs.items():
- group.attrs[key] = value
-
- # Save metadata about depth of hierarchy
- file.attrs.create(name='depth', data=len(grouping_cols) - 1)
-
- print(f"HDF5 file created successfully at {ofilename}")
-
- return ofilename
-
-
-
-
-[docs]
-def save_processed_dataframe_to_hdf5(df, annotator, output_filename): # src_hdf5_path, script_date, script_name):
- """
- Save processed dataframe columns with annotations to an HDF5 file.
-
- Parameters:
- df (pd.DataFrame): DataFrame containing processed time series.
- annotator (): Annotator object with get_metadata method.
- output_filename (str): Path to the source HDF5 file.
- """
- # Convert datetime columns to string
- datetime_cols = df.select_dtypes(include=['datetime64']).columns
-
- if list(datetime_cols):
- df[datetime_cols] = df[datetime_cols].map(str)
-
- # Convert dataframe to structured array
- icad_data_table = utils.convert_dataframe_to_np_structured_array(df)
-
- # Get metadata
- metadata_dict = annotator.get_metadata()
-
- # Prepare project level attributes to be added at the root level
-
- project_level_attributes = metadata_dict['metadata']['project']
-
- # Prepare high-level attributes
- high_level_attributes = {
- 'parent_files': metadata_dict['parent_files'],
- **metadata_dict['metadata']['sample'],
- **metadata_dict['metadata']['environment'],
- **metadata_dict['metadata']['instruments']
- }
-
- # Prepare data level attributes
- data_level_attributes = metadata_dict['metadata']['datasets']
-
- for key, value in data_level_attributes.items():
- if isinstance(value,dict):
- data_level_attributes[key] = utils.convert_attrdict_to_np_structured_array(value)
-
-
- # Prepare file dictionary
- file_dict = {
- 'name': project_level_attributes['processing_file'],
- 'attributes_dict': high_level_attributes,
- 'datasets': [{
- 'name': "data_table",
- 'data': icad_data_table,
- 'shape': icad_data_table.shape,
- 'attributes': data_level_attributes
- }]
- }
-
- # Check if the file exists
- if os.path.exists(output_filename):
- mode = "a"
- print(f"File {output_filename} exists. Opening in append mode.")
- else:
- mode = "w"
- print(f"File {output_filename} does not exist. Creating a new file.")
-
-
- # Write to HDF5
- with h5py.File(output_filename, mode) as h5file:
- # Add project level attributes at the root/top level
- h5file.attrs.update(project_level_attributes)
- __transfer_file_dict_to_hdf5(h5file, '/', file_dict)
-
-
-#if __name__ == '__main__':
-
+import sys
+import os
+root_dir = os.path.abspath(os.curdir)
+sys.path.append(root_dir)
+
+import pandas as pd
+import numpy as np
+import h5py
+import logging
+
+import utils.g5505_utils as utils
+import instruments.readers.filereader_registry as filereader_registry
+
+
+
+def __transfer_file_dict_to_hdf5(h5file, group_name, file_dict):
+ """
+ Transfers data from a file_dict to an HDF5 file.
+
+ Parameters
+ ----------
+ h5file : h5py.File
+ HDF5 file object where the data will be written.
+ group_name : str
+ Name of the HDF5 group where data will be stored.
+ file_dict : dict
+ Dictionary containing file data to be transferred. Required structure:
+ {
+ 'name': str,
+ 'attributes_dict': dict,
+ 'datasets': [
+ {
+ 'name': str,
+ 'data': array-like,
+ 'shape': tuple,
+ 'attributes': dict (optional)
+ },
+ ...
+ ]
+ }
+
+ Returns
+ -------
+ None
+ """
+
+ if not file_dict:
+ return
+
+ try:
+ # Create group and add their attributes
+ filename = file_dict['name']
+ group = h5file[group_name].create_group(name=filename)
+ # Add group attributes
+ group.attrs.update(file_dict['attributes_dict'])
+
+ # Add datasets to the just created group
+ for dataset in file_dict['datasets']:
+ dataset_obj = group.create_dataset(
+ name=dataset['name'],
+ data=dataset['data'],
+ shape=dataset['shape']
+ )
+
+ # Add dataset's attributes
+ attributes = dataset.get('attributes', {})
+ dataset_obj.attrs.update(attributes)
+ group.attrs['last_update_date'] = utils.created_at().encode('utf-8')
+
+ stdout = f'Completed transfer for /{group_name}/{filename}'
+
+ except Exception as inst:
+ stdout = inst
+ logging.error('Failed to transfer data into HDF5: %s', inst)
+
+ return stdout
+
+def __copy_file_in_group(source_file_path, dest_file_obj : h5py.File, dest_group_name, work_with_copy : bool = True):
+ # Create copy of original file to avoid possible file corruption and work with it.
+
+ if work_with_copy:
+ tmp_file_path = utils.make_file_copy(source_file_path)
+ else:
+ tmp_file_path = source_file_path
+
+ # Open backup h5 file and copy complet filesystem directory onto a group in h5file
+ with h5py.File(tmp_file_path,'r') as src_file:
+ dest_file_obj.copy(source= src_file['/'], dest= dest_group_name)
+
+ if 'tmp_files' in tmp_file_path:
+ os.remove(tmp_file_path)
+
+ stdout = f'Completed transfer for /{dest_group_name}'
+ return stdout
+
+
+[docs]
+def create_hdf5_file_from_filesystem_path(path_to_input_directory: str,
+ path_to_filenames_dict: dict = None,
+ select_dir_keywords : list = [],
+ root_metadata_dict : dict = {}, mode = 'w'):
+
+ """
+ Creates an .h5 file with name "output_filename" that preserves the directory tree (or folder structure)
+ of a given filesystem path.
+
+ The data integration capabilities are limited by our file reader, which can only access data from a list of
+ admissible file formats. These, however, can be extended. Directories are groups in the resulting HDF5 file.
+ Files are formatted as composite objects consisting of a group, file, and attributes.
+
+ Parameters
+ ----------
+ output_filename : str
+ Name of the output HDF5 file.
+ path_to_input_directory : str
+ Path to root directory, specified with forward slashes, e.g., path/to/root.
+
+ path_to_filenames_dict : dict, optional
+ A pre-processed dictionary where keys are directory paths on the input directory's tree and values are lists of files.
+ If provided, 'input_file_system_path' is ignored.
+
+ select_dir_keywords : list
+ List of string elements to consider or select only directory paths that contain
+ a word in 'select_dir_keywords'. When empty, all directory paths are considered
+ to be included in the HDF5 file group hierarchy.
+ root_metadata_dict : dict
+ Metadata to include at the root level of the HDF5 file.
+
+ mode : str
+ 'w' create File, truncate if it exists, or 'r+' read/write, File must exists. By default, mode = "w".
+
+ Returns
+ -------
+ output_filename : str
+ Path to the created HDF5 file.
+ """
+
+
+ if not mode in ['w','r+']:
+ raise ValueError(f'Parameter mode must take values in ["w","r+"]')
+
+ if not '/' in path_to_input_directory:
+ raise ValueError('path_to_input_directory needs to be specified using forward slashes "/".' )
+
+ #path_to_output_directory = os.path.join(path_to_input_directory,'..')
+ path_to_input_directory = os.path.normpath(path_to_input_directory).rstrip(os.sep)
+
+
+ for i, keyword in enumerate(select_dir_keywords):
+ select_dir_keywords[i] = keyword.replace('/',os.sep)
+
+ if not path_to_filenames_dict:
+ # On dry_run=True, returns path to files dictionary of the output directory without making a actual copy of the input directory.
+ # Therefore, there wont be a copying conflict by setting up input and output directories the same
+ path_to_filenames_dict = utils.copy_directory_with_contraints(input_dir_path=path_to_input_directory,
+ output_dir_path=path_to_input_directory,
+ dry_run=True)
+ # Set input_directory as copied input directory
+ root_dir = path_to_input_directory
+ path_to_output_file = path_to_input_directory.rstrip(os.path.sep) + '.h5'
+
+ start_message = f'\n[Start] Data integration :\nSource: {path_to_input_directory}\nDestination: {path_to_output_file}\n'
+
+ print(start_message)
+ logging.info(start_message)
+
+ # Check if the .h5 file already exists
+ if os.path.exists(path_to_output_file) and mode in ['w']:
+ message = (
+ f"[Notice] The file '{path_to_output_file}' already exists and will not be overwritten.\n"
+ "If you wish to replace it, please delete the existing file first and rerun the program."
+ )
+ print(message)
+ logging.error(message)
+ else:
+ with h5py.File(path_to_output_file, mode=mode, track_order=True) as h5file:
+
+ number_of_dirs = len(path_to_filenames_dict.keys())
+ dir_number = 1
+ for dirpath, filtered_filenames_list in path_to_filenames_dict.items():
+
+ # Check if filtered_filenames_list is nonempty. TODO: This is perhaps redundant by design of path_to_filenames_dict.
+ if not filtered_filenames_list:
+ continue
+
+ group_name = dirpath.replace(os.sep,'/')
+ group_name = group_name.replace(root_dir.replace(os.sep,'/') + '/', '/')
+
+ # Flatten group name to one level
+ if select_dir_keywords:
+ offset = sum([len(i.split(os.sep)) if i in dirpath else 0 for i in select_dir_keywords])
+ else:
+ offset = 1
+ tmp_list = group_name.split('/')
+ if len(tmp_list) > offset+1:
+ group_name = '/'.join([tmp_list[i] for i in range(offset+1)])
+
+ # Create group called "group_name". Hierarchy of nested groups can be implicitly defined by the forward slashes
+ if not group_name in h5file.keys():
+ h5file.create_group(group_name)
+ h5file[group_name].attrs['creation_date'] = utils.created_at().encode('utf-8')
+ #h5file[group_name].attrs.create(name='filtered_file_list',data=convert_string_to_bytes(filtered_filename_list))
+ #h5file[group_name].attrs.create(name='file_list',data=convert_string_to_bytes(filenames_list))
+ #else:
+ #print(group_name,' was already created.')
+ instFoldermsgStart = f'Starting data transfer from instFolder: {group_name}'
+ print(instFoldermsgStart)
+
+ for filenumber, filename in enumerate(filtered_filenames_list):
+
+ #file_ext = os.path.splitext(filename)[1]
+ #try:
+
+ # hdf5 path to filename group
+ dest_group_name = f'{group_name}/{filename}'
+
+ if not 'h5' in filename:
+ #file_dict = config_file.select_file_readers(group_id)[file_ext](os.path.join(dirpath,filename))
+ #file_dict = ext_to_reader_dict[file_ext](os.path.join(dirpath,filename))
+ file_dict = filereader_registry.select_file_reader(dest_group_name)(os.path.join(dirpath,filename))
+
+ stdout = __transfer_file_dict_to_hdf5(h5file, group_name, file_dict)
+
+ else:
+ source_file_path = os.path.join(dirpath,filename)
+ dest_file_obj = h5file
+ #group_name +'/'+filename
+ #ext_to_reader_dict[file_ext](source_file_path, dest_file_obj, dest_group_name)
+ #g5505f_reader.select_file_reader(dest_group_name)(source_file_path, dest_file_obj, dest_group_name)
+ stdout = __copy_file_in_group(source_file_path, dest_file_obj, dest_group_name, False)
+
+ # Update the progress bar and log the end message
+ instFoldermsdEnd = f'\nCompleted data transfer for instFolder: {group_name}\n'
+ # Print and log the start message
+ utils.progressBar(dir_number, number_of_dirs, instFoldermsdEnd)
+ logging.info(instFoldermsdEnd )
+ dir_number = dir_number + 1
+
+ print('[End] Data integration')
+ logging.info('[End] Data integration')
+
+ if len(root_metadata_dict.keys())>0:
+ for key, value in root_metadata_dict.items():
+ #if key in h5file.attrs:
+ # del h5file.attrs[key]
+ h5file.attrs.create(key, value)
+ #annotate_root_dir(output_filename,root_metadata_dict)
+
+
+ #output_yml_filename_path = hdf5_vis.take_yml_snapshot_of_hdf5_file(output_filename)
+
+ return path_to_output_file #, output_yml_filename_path
+
+
+
+[docs]
+def create_hdf5_file_from_dataframe(ofilename, input_data, group_by_funcs: list, approach: str = None, extract_attrs_func=None):
+ """
+ Creates an HDF5 file with hierarchical groups based on the specified grouping functions or columns.
+
+ Parameters:
+ -----------
+ ofilename (str): Path for the output HDF5 file.
+ input_data (pd.DataFrame or str): Input data as a DataFrame or a valid file system path.
+ group_by_funcs (list): List of callables or column names to define hierarchical grouping.
+ approach (str): Specifies the approach ('top-down' or 'bottom-up') for creating the HDF5 file.
+ extract_attrs_func (callable, optional): Function to extract additional attributes for HDF5 groups.
+
+ Returns:
+ --------
+ None
+ """
+ # Check whether input_data is a valid file-system path or a DataFrame
+ is_valid_path = lambda x: os.path.exists(x) if isinstance(x, str) else False
+
+ if is_valid_path(input_data):
+ # If input_data is a file-system path, create a DataFrame with file info
+ file_list = os.listdir(input_data)
+ df = pd.DataFrame(file_list, columns=['filename'])
+ df = utils.augment_with_filetype(df) # Add filetype information if needed
+ elif isinstance(input_data, pd.DataFrame):
+ # If input_data is a DataFrame, make a copy
+ df = input_data.copy()
+ else:
+ raise ValueError("input_data must be either a valid file-system path or a DataFrame.")
+
+ # Generate grouping columns based on group_by_funcs
+ if utils.is_callable_list(group_by_funcs):
+ grouping_cols = []
+ for i, func in enumerate(group_by_funcs):
+ col_name = f'level_{i}_groups'
+ grouping_cols.append(col_name)
+ df[col_name] = func(df)
+ elif utils.is_str_list(group_by_funcs) and all([item in df.columns for item in group_by_funcs]):
+ grouping_cols = group_by_funcs
+ else:
+ raise ValueError("'group_by_funcs' must be a list of callables or valid column names in the DataFrame.")
+
+ # Generate group paths
+ df['group_path'] = ['/' + '/'.join(row) for row in df[grouping_cols].values.astype(str)]
+
+ # Open the HDF5 file in write mode
+ with h5py.File(ofilename, 'w') as file:
+ for group_path in df['group_path'].unique():
+ # Create groups in HDF5
+ group = file.create_group(group_path)
+
+ # Filter the DataFrame for the current group
+ datatable = df[df['group_path'] == group_path].copy()
+
+ # Drop grouping columns and the generated 'group_path'
+ datatable = datatable.drop(columns=grouping_cols + ['group_path'])
+
+ # Add datasets to groups if data exists
+ if not datatable.empty:
+ dataset = utils.convert_dataframe_to_np_structured_array(datatable)
+ group.create_dataset(name='data_table', data=dataset)
+
+ # Add attributes if extract_attrs_func is provided
+ if extract_attrs_func:
+ attrs = extract_attrs_func(datatable)
+ for key, value in attrs.items():
+ group.attrs[key] = value
+
+ # Save metadata about depth of hierarchy
+ file.attrs.create(name='depth', data=len(grouping_cols) - 1)
+
+ print(f"HDF5 file created successfully at {ofilename}")
+
+ return ofilename
+
+
+
+
+[docs]
+def save_processed_dataframe_to_hdf5(df, annotator, output_filename): # src_hdf5_path, script_date, script_name):
+ """
+ Save processed dataframe columns with annotations to an HDF5 file.
+
+ Parameters:
+ df (pd.DataFrame): DataFrame containing processed time series.
+ annotator (): Annotator object with get_metadata method.
+ output_filename (str): Path to the source HDF5 file.
+ """
+ # Convert datetime columns to string
+ datetime_cols = df.select_dtypes(include=['datetime64']).columns
+
+ if list(datetime_cols):
+ df[datetime_cols] = df[datetime_cols].map(str)
+
+ # Convert dataframe to structured array
+ icad_data_table = utils.convert_dataframe_to_np_structured_array(df)
+
+ # Get metadata
+ metadata_dict = annotator.get_metadata()
+
+ # Prepare project level attributes to be added at the root level
+
+ project_level_attributes = metadata_dict['metadata']['project']
+
+ # Prepare high-level attributes
+ high_level_attributes = {
+ 'parent_files': metadata_dict['parent_files'],
+ **metadata_dict['metadata']['sample'],
+ **metadata_dict['metadata']['environment'],
+ **metadata_dict['metadata']['instruments']
+ }
+
+ # Prepare data level attributes
+ data_level_attributes = metadata_dict['metadata']['datasets']
+
+ for key, value in data_level_attributes.items():
+ if isinstance(value,dict):
+ data_level_attributes[key] = utils.convert_attrdict_to_np_structured_array(value)
+
+
+ # Prepare file dictionary
+ file_dict = {
+ 'name': project_level_attributes['processing_file'],
+ 'attributes_dict': high_level_attributes,
+ 'datasets': [{
+ 'name': "data_table",
+ 'data': icad_data_table,
+ 'shape': icad_data_table.shape,
+ 'attributes': data_level_attributes
+ }]
+ }
+
+ # Check if the file exists
+ if os.path.exists(output_filename):
+ mode = "a"
+ print(f"File {output_filename} exists. Opening in append mode.")
+ else:
+ mode = "w"
+ print(f"File {output_filename} does not exist. Creating a new file.")
+
+
+ # Write to HDF5
+ with h5py.File(output_filename, mode) as h5file:
+ # Add project level attributes at the root/top level
+ h5file.attrs.update(project_level_attributes)
+ __transfer_file_dict_to_hdf5(h5file, '/', file_dict)
+
+
+#if __name__ == '__main__':
+
-import sys
-import os
-root_dir = os.path.abspath(os.curdir)
-sys.path.append(root_dir)
-import subprocess
-
-import h5py
-import yaml
-import src.g5505_utils as utils
-import src.hdf5_vis as hdf5_vis
-import src.hdf5_lib as hdf5_lib
-import src.git_ops as git_ops
-
-
-import numpy as np
-
-
-
-YAML_EXT = ".yaml"
-TXT_EXT = ".txt"
-
-
-
-
-[docs]
-def get_review_status(filename_path):
-
- filename_path_tail, filename_path_head = os.path.split(filename_path)
- filename, ext = os.path.splitext(filename_path_head)
- # TODO:
- with open(os.path.join("review/",filename+"-review_status"+TXT_EXT),'r') as f:
- workflow_steps = []
- for line in f:
- workflow_steps.append(line)
- return workflow_steps[-1]
-
-
-
-[docs]
-def first_initialize_metadata_review(hdf5_file_path, reviewer_attrs, restart = False):
-
- """
- First: Initialize review branch with review folder with a copy of yaml representation of
- hdf5 file under review and by creating a txt file with the state of the review process, e.g., under review.
-
- """
-
- initials = reviewer_attrs['initials']
- #branch_name = '-'.join([reviewer_attrs['type'],'review_',initials])
- branch_name = '_'.join(['review',initials])
-
- hdf5_file_path_tail, filename_path_head = os.path.split(hdf5_file_path)
- filename, ext = os.path.splitext(filename_path_head)
-
- # Check file_path points to h5 file
- if not 'h5' in ext:
- raise ValueError("filename_path needs to point to an h5 file.")
-
- # Verify if yaml snapshot of input h5 file exists
- if not os.path.exists(os.path.join(hdf5_file_path_tail,filename+YAML_EXT)):
- raise ValueError("metadata review cannot be initialized. The associated .yaml file under review was not found. Run take_yml_snapshot_of_hdf5_file(filename_path) ")
-
- # Initialize metadata review workflow
- # print("Create branch metadata-review-by-"+initials+"\n")
-
- #checkout_review_branch(branch_name)
-
- # Check you are working at the right branch
-
- curr_branch = git_ops.show_current_branch()
- if not branch_name in curr_branch.stdout:
- raise ValueError("Branch "+branch_name+" was not found. \nPlease open a Git Bash Terminal, and follow the below instructions: \n1. Change directory to your project's directory. \n2. Excecute the command: git checkout "+branch_name)
-
- # Check if review file already exists and then check if it is still untracked
- review_yaml_file_path = os.path.join("review/",filename+YAML_EXT)
- review_yaml_file_path_tail, ext = os.path.splitext(review_yaml_file_path)
- review_status_yaml_file_path = os.path.join(review_yaml_file_path_tail+"-review_status"+".txt")
-
- if not os.path.exists(review_yaml_file_path) or restart:
- review_yaml_file_path = utils.make_file_copy(os.path.join(hdf5_file_path_tail,filename+YAML_EXT), 'review')
- if restart:
- print('metadata review has been reinitialized. The review files will reflect the current state of the hdf5 files metadata')
-
-
-
- #if not os.path.exists(os.path.join(review_yaml_file_path_tail+"-review_status"+".txt")):
-
- with open(review_status_yaml_file_path,'w') as f:
- f.write('under review')
-
- # Stage untracked review files and commit them to local repository
- status = git_ops.get_status()
- untracked_files = []
- for line in status.stdout.splitlines():
- #tmp = line.decode("utf-8")
- #modified_files.append(tmp.split()[1])
- if 'review/' in line:
- if not 'modified' in line: # untracked filesand
- untracked_files.append(line.strip())
- else:
- untracked_files.append(line.strip().split()[1])
-
- if 'output_files/'+filename+YAML_EXT in line and not 'modified' in line:
- untracked_files.append(line.strip())
-
- if untracked_files:
- result = subprocess.run(git_ops.add_files_to_git(untracked_files),capture_output=True,check=True)
- message = 'Initialized metadata review.'
- commit_output = subprocess.run(git_ops.commit_changes(message),capture_output=True,check=True)
-
- for line in commit_output.stdout.splitlines():
- print(line.decode('utf-8'))
- #else:
- # print('This action will not have any effect because metadata review process has been already initialized.')
-
-
-
-
- #status_dict = repo_obj.status()
- #for filepath, file_status in status_dict.items():
- # Identify keys associated to review files and stage them
- # if 'review/'+filename in filepath:
- # Stage changes
- # repo_obj.index.add(filepath)
-
- #author = config_file.author #default_signature
- #committer = config_file.committer
- #message = "Initialized metadata review process."
- #tree = repo_obj.index.write_tree()
- #oid = repo_obj.create_commit('HEAD', author, committer, message, tree, [repo_obj.head.peel().oid])
-
- #print("Add and commit"+"\n")
-
- return review_yaml_file_path, review_status_yaml_file_path
-
-
-
-
-
-[docs]
-def second_save_metadata_review(review_yaml_file_path, reviewer_attrs):
- """
- Second: Once you're done reviewing the yaml representation of hdf5 file in review folder.
- Change the review status to complete and save (add and commit) modified .yalm and .txt files in the project by
- running this function.
-
- """
- # 1 verify review initializatin was performed first
- # 2. change review status in txt to complete
- # 3. git add review/ and git commit -m "Submitted metadata review"
-
- initials = reviewer_attrs['initials']
- #branch_name = '-'.join([reviewer_attrs['type'],'review','by',initials])
- branch_name = '_'.join(['review',initials])
- # TODO: replace with subprocess + git
- #checkout_review_branch(repo_obj, branch_name)
-
- # Check you are working at the right branch
- curr_branch = git_ops.show_current_branch()
- if not branch_name in curr_branch.stdout:
- raise ValueError('Please checkout ' + branch_name + ' via Git Bash before submitting metadata review files. ')
-
- # Collect modified review files
- status = git_ops.get_status()
- modified_files = []
- os.path.basename(review_yaml_file_path)
- for line in status.stdout.splitlines():
- # conver line from bytes to str
- tmp = line.decode("utf-8")
- if 'modified' in tmp and 'review/' in tmp and os.path.basename(review_yaml_file_path) in tmp:
- modified_files.append(tmp.split()[1])
-
- # Stage modified files and commit them to local repository
- review_yaml_file_path_tail, review_yaml_file_path_head = os.path.split(review_yaml_file_path)
- filename, ext = os.path.splitext(review_yaml_file_path_head)
- if modified_files:
- review_status_file_path = os.path.join("review/",filename+"-review_status"+TXT_EXT)
- with open(review_status_file_path,'a') as f:
- f.write('\nsubmitted')
-
- modified_files.append(review_status_file_path)
-
- result = subprocess.run(git_ops.add_files_to_git(modified_files),capture_output=True,check=True)
- message = 'Submitted metadata review.'
- commit_output = subprocess.run(git_ops.commit_changes(message),capture_output=True,check=True)
-
- for line in commit_output.stdout.splitlines():
- print(line.decode('utf-8'))
- else:
- print('Nothing to commit.')
-
-
-#
-
-[docs]
-def load_yaml(yaml_review_file):
- with open(yaml_review_file, 'r') as stream:
- try:
- return yaml.load(stream, Loader=yaml.FullLoader)
- except yaml.YAMLError as exc:
- print(exc)
- return None
-
-
-
-[docs]
-def update_hdf5_attributes(input_hdf5_file, yaml_dict):
-
- def update_attributes(hdf5_obj, yaml_obj):
- for attr_name, attr_value in yaml_obj['attributes'].items():
-
- if not isinstance(attr_value, dict):
- attr_value = {'rename_as': attr_name, 'value': attr_value}
-
- if (attr_name in hdf5_obj.attrs.keys()): # delete or update
- if attr_value.get('delete'): # delete when True
- hdf5_obj.attrs.__delitem__(attr_name)
- elif not (attr_value.get('rename_as') == attr_name): # update when true
- hdf5_obj.attrs[attr_value.get('rename_as')] = hdf5_obj.attrs[attr_name] # parse_attribute(attr_value)
- hdf5_obj.attrs.__delitem__(attr_name)
- else: # add a new attribute
- hdf5_obj.attrs.update({attr_name : utils.parse_attribute(attr_value)})
-
- with h5py.File(input_hdf5_file, 'r+') as f:
- for key in yaml_dict.keys():
- hdf5_obj = f[key]
- yaml_obj = yaml_dict[key]
- update_attributes(hdf5_obj, yaml_obj)
-
-
-
-[docs]
-def update_hdf5_file_with_review(input_hdf5_file, yaml_review_file):
- yaml_dict = load_yaml(yaml_review_file)
- update_hdf5_attributes(input_hdf5_file, yaml_dict)
- # Regenerate yaml snapshot of updated HDF5 file
- output_yml_filename_path = hdf5_vis.take_yml_snapshot_of_hdf5_file(input_hdf5_file)
- print(f'{output_yml_filename_path} was successfully regenerated from the updated version of{input_hdf5_file}')
-
-
-
-[docs]
-def third_update_hdf5_file_with_review(input_hdf5_file, yaml_review_file, reviewer_attrs={}, hdf5_upload=False):
- if 'submitted' not in get_review_status(input_hdf5_file):
- raise ValueError('Review yaml file must be submitted before trying to perform an update. Run first second_submit_metadata_review().')
-
- update_hdf5_file_with_review(input_hdf5_file, yaml_review_file)
- git_ops.perform_git_operations(hdf5_upload)
-
-
-
-[docs]
-def count(hdf5_obj,yml_dict):
- print(hdf5_obj.name)
- if isinstance(hdf5_obj,h5py.Group) and len(hdf5_obj.name.split('/')) <= 4:
- obj_review = yml_dict[hdf5_obj.name]
- additions = [not (item in hdf5_obj.attrs.keys()) for item in obj_review['attributes'].keys()]
- count_additions = sum(additions)
- deletions = [not (item in obj_review['attributes'].keys()) for item in hdf5_obj.attrs.keys()]
- count_delections = sum(deletions)
- print('additions',count_additions, 'deletions', count_delections)
-
-
-
-[docs]
-def last_submit_metadata_review(reviewer_attrs):
-
- """Fourth: """
-
- initials =reviewer_attrs['initials']
-
- repository = 'origin'
- branch_name = '_'.join(['review',initials])
-
- push_command = lambda repository,refspec: ['git','push',repository,refspec]
-
- list_branches_command = ['git','branch','--list']
-
- branches = subprocess.run(list_branches_command,capture_output=True,text=True,check=True)
- if not branch_name in branches.stdout:
- print('There is no branch named '+branch_name+'.\n')
- print('Make sure to run data owner review workflow from the beginning without missing any steps.')
- return
-
- curr_branch = git_ops.show_current_branch()
- if not branch_name in curr_branch.stdout:
- print('Complete metadata review could not be completed.\n')
- print('Make sure a data-owner workflow has already been started on branch '+branch_name+'\n')
- print('The step "Complete metadata review" will have no effect.')
- return
-
-
-
- # push
- result = subprocess.run(push_command(repository,branch_name),capture_output=True,text=True,check=True)
- print(result.stdout)
-
- # 1. git add output_files/
- # 2. delete review/
- #shutil.rmtree(os.path.join(os.path.abspath(os.curdir),"review"))
- # 3. git rm review/
- # 4. git commit -m "Completed review process. Current state of hdf5 file and yml should be up to date."
- return result.returncode
-
-
-
-#import config_file
-#import hdf5_vis
-
-
-[docs]
-class MetadataHarvester:
- def __init__(self, parent_files=None):
- if parent_files is None:
- parent_files = []
- self.parent_files = parent_files
- self.metadata = {
- "project": {},
- "sample": {},
- "environment": {},
- "instruments": {},
- "datasets": {}
- }
-
-
-[docs]
- def add_project_info(self, key_or_dict, value=None, append=False):
- self._add_info("project", key_or_dict, value, append)
-
-
-
-[docs]
- def add_sample_info(self, key_or_dict, value=None, append=False):
- self._add_info("sample", key_or_dict, value, append)
-
-
-
-[docs]
- def add_environment_info(self, key_or_dict, value=None, append=False):
- self._add_info("environment", key_or_dict, value, append)
-
-
-
-[docs]
- def add_instrument_info(self, key_or_dict, value=None, append=False):
- self._add_info("instruments", key_or_dict, value, append)
-
-
-
-[docs]
- def add_dataset_info(self, key_or_dict, value=None, append=False):
- self._add_info("datasets", key_or_dict, value, append)
-
-
- def _add_info(self, category, key_or_dict, value, append):
- """Internal helper method to add information to a category."""
- if isinstance(key_or_dict, dict):
- self.metadata[category].update(key_or_dict)
- else:
- if key_or_dict in self.metadata[category]:
- if append:
- current_value = self.metadata[category][key_or_dict]
-
- if isinstance(current_value, list):
-
- if not isinstance(value, list):
- # Append the new value to the list
- self.metadata[category][key_or_dict].append(value)
- else:
- self.metadata[category][key_or_dict] = current_value + value
-
- elif isinstance(current_value, str):
- # Append the new value as a comma-separated string
- self.metadata[category][key_or_dict] = current_value + ',' + str(value)
- else:
- # Handle other types (for completeness, usually not required)
- self.metadata[category][key_or_dict] = [current_value, value]
- else:
- self.metadata[category][key_or_dict] = value
- else:
- self.metadata[category][key_or_dict] = value
-
-
-[docs]
- def get_metadata(self):
- return {
- "parent_files": self.parent_files,
- "metadata": self.metadata
- }
-
-
-
-[docs]
- def print_metadata(self):
- print("parent_files", self.parent_files)
-
- for key in self.metadata.keys():
- print(key,'metadata:\n')
- for item in self.metadata[key].items():
- print(item[0],item[1])
-
-
-
-
-
-[docs]
- def clear_metadata(self):
- self.metadata = {
- "project": {},
- "sample": {},
- "environment": {},
- "instruments": {},
- "datasets": {}
- }
- self.parent_files = []
-
-
-
-
-[docs]
-def main():
-
- output_filename_path = "output_files/unified_file_smog_chamber_2024-03-19_UTC-OFST_+0100_NG.h5"
- output_yml_filename_path = "output_files/unified_file_smog_chamber_2024-03-19_UTC-OFST_+0100_NG.yalm"
- output_yml_filename_path_tail, filename = os.path.split(output_yml_filename_path)
-
- #output_yml_filename_path = hdf5_vis.take_yml_snapshot_of_hdf5_file(output_filename_path)
-
- #first_initialize_metadata_review(output_filename_path,initials='NG')
- #second_submit_metadata_review()
- #if os.path.exists(os.path.join(os.path.join(os.path.abspath(os.curdir),"review"),filename)):
- # third_update_hdf5_file_with_review(output_filename_path, os.path.join(os.path.join(os.path.abspath(os.curdir),"review"),filename))
- #fourth_complete_metadata_review()
-
-#if __name__ == '__main__':
-
-# main()
-
+import sys
+import os
+root_dir = os.path.abspath(os.curdir)
+sys.path.append(root_dir)
+import subprocess
+
+import h5py
+import yaml
+import src.g5505_utils as utils
+import src.hdf5_vis as hdf5_vis
+import src.hdf5_lib as hdf5_lib
+import src.git_ops as git_ops
+
+
+import numpy as np
+
+
+
+YAML_EXT = ".yaml"
+TXT_EXT = ".txt"
+
+
+
+
+[docs]
+def get_review_status(filename_path):
+
+ filename_path_tail, filename_path_head = os.path.split(filename_path)
+ filename, ext = os.path.splitext(filename_path_head)
+ # TODO:
+ with open(os.path.join("review/",filename+"-review_status"+TXT_EXT),'r') as f:
+ workflow_steps = []
+ for line in f:
+ workflow_steps.append(line)
+ return workflow_steps[-1]
+
+
+
+[docs]
+def first_initialize_metadata_review(hdf5_file_path, reviewer_attrs, restart = False):
+
+ """
+ First: Initialize review branch with review folder with a copy of yaml representation of
+ hdf5 file under review and by creating a txt file with the state of the review process, e.g., under review.
+
+ """
+
+ initials = reviewer_attrs['initials']
+ #branch_name = '-'.join([reviewer_attrs['type'],'review_',initials])
+ branch_name = '_'.join(['review',initials])
+
+ hdf5_file_path_tail, filename_path_head = os.path.split(hdf5_file_path)
+ filename, ext = os.path.splitext(filename_path_head)
+
+ # Check file_path points to h5 file
+ if not 'h5' in ext:
+ raise ValueError("filename_path needs to point to an h5 file.")
+
+ # Verify if yaml snapshot of input h5 file exists
+ if not os.path.exists(os.path.join(hdf5_file_path_tail,filename+YAML_EXT)):
+ raise ValueError("metadata review cannot be initialized. The associated .yaml file under review was not found. Run take_yml_snapshot_of_hdf5_file(filename_path) ")
+
+ # Initialize metadata review workflow
+ # print("Create branch metadata-review-by-"+initials+"\n")
+
+ #checkout_review_branch(branch_name)
+
+ # Check you are working at the right branch
+
+ curr_branch = git_ops.show_current_branch()
+ if not branch_name in curr_branch.stdout:
+ raise ValueError("Branch "+branch_name+" was not found. \nPlease open a Git Bash Terminal, and follow the below instructions: \n1. Change directory to your project's directory. \n2. Excecute the command: git checkout "+branch_name)
+
+ # Check if review file already exists and then check if it is still untracked
+ review_yaml_file_path = os.path.join("review/",filename+YAML_EXT)
+ review_yaml_file_path_tail, ext = os.path.splitext(review_yaml_file_path)
+ review_status_yaml_file_path = os.path.join(review_yaml_file_path_tail+"-review_status"+".txt")
+
+ if not os.path.exists(review_yaml_file_path) or restart:
+ review_yaml_file_path = utils.make_file_copy(os.path.join(hdf5_file_path_tail,filename+YAML_EXT), 'review')
+ if restart:
+ print('metadata review has been reinitialized. The review files will reflect the current state of the hdf5 files metadata')
+
+
+
+ #if not os.path.exists(os.path.join(review_yaml_file_path_tail+"-review_status"+".txt")):
+
+ with open(review_status_yaml_file_path,'w') as f:
+ f.write('under review')
+
+ # Stage untracked review files and commit them to local repository
+ status = git_ops.get_status()
+ untracked_files = []
+ for line in status.stdout.splitlines():
+ #tmp = line.decode("utf-8")
+ #modified_files.append(tmp.split()[1])
+ if 'review/' in line:
+ if not 'modified' in line: # untracked filesand
+ untracked_files.append(line.strip())
+ else:
+ untracked_files.append(line.strip().split()[1])
+
+ if 'output_files/'+filename+YAML_EXT in line and not 'modified' in line:
+ untracked_files.append(line.strip())
+
+ if untracked_files:
+ result = subprocess.run(git_ops.add_files_to_git(untracked_files),capture_output=True,check=True)
+ message = 'Initialized metadata review.'
+ commit_output = subprocess.run(git_ops.commit_changes(message),capture_output=True,check=True)
+
+ for line in commit_output.stdout.splitlines():
+ print(line.decode('utf-8'))
+ #else:
+ # print('This action will not have any effect because metadata review process has been already initialized.')
+
+
+
+
+ #status_dict = repo_obj.status()
+ #for filepath, file_status in status_dict.items():
+ # Identify keys associated to review files and stage them
+ # if 'review/'+filename in filepath:
+ # Stage changes
+ # repo_obj.index.add(filepath)
+
+ #author = config_file.author #default_signature
+ #committer = config_file.committer
+ #message = "Initialized metadata review process."
+ #tree = repo_obj.index.write_tree()
+ #oid = repo_obj.create_commit('HEAD', author, committer, message, tree, [repo_obj.head.peel().oid])
+
+ #print("Add and commit"+"\n")
+
+ return review_yaml_file_path, review_status_yaml_file_path
+
+
+
+
+
+[docs]
+def second_save_metadata_review(review_yaml_file_path, reviewer_attrs):
+ """
+ Second: Once you're done reviewing the yaml representation of hdf5 file in review folder.
+ Change the review status to complete and save (add and commit) modified .yalm and .txt files in the project by
+ running this function.
+
+ """
+ # 1 verify review initializatin was performed first
+ # 2. change review status in txt to complete
+ # 3. git add review/ and git commit -m "Submitted metadata review"
+
+ initials = reviewer_attrs['initials']
+ #branch_name = '-'.join([reviewer_attrs['type'],'review','by',initials])
+ branch_name = '_'.join(['review',initials])
+ # TODO: replace with subprocess + git
+ #checkout_review_branch(repo_obj, branch_name)
+
+ # Check you are working at the right branch
+ curr_branch = git_ops.show_current_branch()
+ if not branch_name in curr_branch.stdout:
+ raise ValueError('Please checkout ' + branch_name + ' via Git Bash before submitting metadata review files. ')
+
+ # Collect modified review files
+ status = git_ops.get_status()
+ modified_files = []
+ os.path.basename(review_yaml_file_path)
+ for line in status.stdout.splitlines():
+ # conver line from bytes to str
+ tmp = line.decode("utf-8")
+ if 'modified' in tmp and 'review/' in tmp and os.path.basename(review_yaml_file_path) in tmp:
+ modified_files.append(tmp.split()[1])
+
+ # Stage modified files and commit them to local repository
+ review_yaml_file_path_tail, review_yaml_file_path_head = os.path.split(review_yaml_file_path)
+ filename, ext = os.path.splitext(review_yaml_file_path_head)
+ if modified_files:
+ review_status_file_path = os.path.join("review/",filename+"-review_status"+TXT_EXT)
+ with open(review_status_file_path,'a') as f:
+ f.write('\nsubmitted')
+
+ modified_files.append(review_status_file_path)
+
+ result = subprocess.run(git_ops.add_files_to_git(modified_files),capture_output=True,check=True)
+ message = 'Submitted metadata review.'
+ commit_output = subprocess.run(git_ops.commit_changes(message),capture_output=True,check=True)
+
+ for line in commit_output.stdout.splitlines():
+ print(line.decode('utf-8'))
+ else:
+ print('Nothing to commit.')
+
+
+#
+
+[docs]
+def load_yaml(yaml_review_file):
+ with open(yaml_review_file, 'r') as stream:
+ try:
+ return yaml.load(stream, Loader=yaml.FullLoader)
+ except yaml.YAMLError as exc:
+ print(exc)
+ return None
+
+
+
+[docs]
+def update_hdf5_attributes(input_hdf5_file, yaml_dict):
+
+ def update_attributes(hdf5_obj, yaml_obj):
+ for attr_name, attr_value in yaml_obj['attributes'].items():
+
+ if not isinstance(attr_value, dict):
+ attr_value = {'rename_as': attr_name, 'value': attr_value}
+
+ if (attr_name in hdf5_obj.attrs.keys()): # delete or update
+ if attr_value.get('delete'): # delete when True
+ hdf5_obj.attrs.__delitem__(attr_name)
+ elif not (attr_value.get('rename_as') == attr_name): # update when true
+ hdf5_obj.attrs[attr_value.get('rename_as')] = hdf5_obj.attrs[attr_name] # parse_attribute(attr_value)
+ hdf5_obj.attrs.__delitem__(attr_name)
+ else: # add a new attribute
+ hdf5_obj.attrs.update({attr_name : utils.parse_attribute(attr_value)})
+
+ with h5py.File(input_hdf5_file, 'r+') as f:
+ for key in yaml_dict.keys():
+ hdf5_obj = f[key]
+ yaml_obj = yaml_dict[key]
+ update_attributes(hdf5_obj, yaml_obj)
+
+
+
+[docs]
+def update_hdf5_file_with_review(input_hdf5_file, yaml_review_file):
+ yaml_dict = load_yaml(yaml_review_file)
+ update_hdf5_attributes(input_hdf5_file, yaml_dict)
+ # Regenerate yaml snapshot of updated HDF5 file
+ output_yml_filename_path = hdf5_vis.take_yml_snapshot_of_hdf5_file(input_hdf5_file)
+ print(f'{output_yml_filename_path} was successfully regenerated from the updated version of{input_hdf5_file}')
+
+
+
+[docs]
+def third_update_hdf5_file_with_review(input_hdf5_file, yaml_review_file, reviewer_attrs={}, hdf5_upload=False):
+ if 'submitted' not in get_review_status(input_hdf5_file):
+ raise ValueError('Review yaml file must be submitted before trying to perform an update. Run first second_submit_metadata_review().')
+
+ update_hdf5_file_with_review(input_hdf5_file, yaml_review_file)
+ git_ops.perform_git_operations(hdf5_upload)
+
+
+
+[docs]
+def count(hdf5_obj,yml_dict):
+ print(hdf5_obj.name)
+ if isinstance(hdf5_obj,h5py.Group) and len(hdf5_obj.name.split('/')) <= 4:
+ obj_review = yml_dict[hdf5_obj.name]
+ additions = [not (item in hdf5_obj.attrs.keys()) for item in obj_review['attributes'].keys()]
+ count_additions = sum(additions)
+ deletions = [not (item in obj_review['attributes'].keys()) for item in hdf5_obj.attrs.keys()]
+ count_delections = sum(deletions)
+ print('additions',count_additions, 'deletions', count_delections)
+
+
+
+[docs]
+def last_submit_metadata_review(reviewer_attrs):
+
+ """Fourth: """
+
+ initials =reviewer_attrs['initials']
+
+ repository = 'origin'
+ branch_name = '_'.join(['review',initials])
+
+ push_command = lambda repository,refspec: ['git','push',repository,refspec]
+
+ list_branches_command = ['git','branch','--list']
+
+ branches = subprocess.run(list_branches_command,capture_output=True,text=True,check=True)
+ if not branch_name in branches.stdout:
+ print('There is no branch named '+branch_name+'.\n')
+ print('Make sure to run data owner review workflow from the beginning without missing any steps.')
+ return
+
+ curr_branch = git_ops.show_current_branch()
+ if not branch_name in curr_branch.stdout:
+ print('Complete metadata review could not be completed.\n')
+ print('Make sure a data-owner workflow has already been started on branch '+branch_name+'\n')
+ print('The step "Complete metadata review" will have no effect.')
+ return
+
+
+
+ # push
+ result = subprocess.run(push_command(repository,branch_name),capture_output=True,text=True,check=True)
+ print(result.stdout)
+
+ # 1. git add output_files/
+ # 2. delete review/
+ #shutil.rmtree(os.path.join(os.path.abspath(os.curdir),"review"))
+ # 3. git rm review/
+ # 4. git commit -m "Completed review process. Current state of hdf5 file and yml should be up to date."
+ return result.returncode
+
+
+
+#import config_file
+#import hdf5_vis
+
+
+[docs]
+class MetadataHarvester:
+ def __init__(self, parent_files=None):
+ if parent_files is None:
+ parent_files = []
+ self.parent_files = parent_files
+ self.metadata = {
+ "project": {},
+ "sample": {},
+ "environment": {},
+ "instruments": {},
+ "datasets": {}
+ }
+
+
+[docs]
+ def add_project_info(self, key_or_dict, value=None, append=False):
+ self._add_info("project", key_or_dict, value, append)
+
+
+
+[docs]
+ def add_sample_info(self, key_or_dict, value=None, append=False):
+ self._add_info("sample", key_or_dict, value, append)
+
+
+
+[docs]
+ def add_environment_info(self, key_or_dict, value=None, append=False):
+ self._add_info("environment", key_or_dict, value, append)
+
+
+
+[docs]
+ def add_instrument_info(self, key_or_dict, value=None, append=False):
+ self._add_info("instruments", key_or_dict, value, append)
+
+
+
+[docs]
+ def add_dataset_info(self, key_or_dict, value=None, append=False):
+ self._add_info("datasets", key_or_dict, value, append)
+
+
+ def _add_info(self, category, key_or_dict, value, append):
+ """Internal helper method to add information to a category."""
+ if isinstance(key_or_dict, dict):
+ self.metadata[category].update(key_or_dict)
+ else:
+ if key_or_dict in self.metadata[category]:
+ if append:
+ current_value = self.metadata[category][key_or_dict]
+
+ if isinstance(current_value, list):
+
+ if not isinstance(value, list):
+ # Append the new value to the list
+ self.metadata[category][key_or_dict].append(value)
+ else:
+ self.metadata[category][key_or_dict] = current_value + value
+
+ elif isinstance(current_value, str):
+ # Append the new value as a comma-separated string
+ self.metadata[category][key_or_dict] = current_value + ',' + str(value)
+ else:
+ # Handle other types (for completeness, usually not required)
+ self.metadata[category][key_or_dict] = [current_value, value]
+ else:
+ self.metadata[category][key_or_dict] = value
+ else:
+ self.metadata[category][key_or_dict] = value
+
+
+[docs]
+ def get_metadata(self):
+ return {
+ "parent_files": self.parent_files,
+ "metadata": self.metadata
+ }
+
+
+
+[docs]
+ def print_metadata(self):
+ print("parent_files", self.parent_files)
+
+ for key in self.metadata.keys():
+ print(key,'metadata:\n')
+ for item in self.metadata[key].items():
+ print(item[0],item[1])
+
+
+
+
+
+[docs]
+ def clear_metadata(self):
+ self.metadata = {
+ "project": {},
+ "sample": {},
+ "environment": {},
+ "instruments": {},
+ "datasets": {}
+ }
+ self.parent_files = []
+
+
+
+
+[docs]
+def main():
+
+ output_filename_path = "output_files/unified_file_smog_chamber_2024-03-19_UTC-OFST_+0100_NG.h5"
+ output_yml_filename_path = "output_files/unified_file_smog_chamber_2024-03-19_UTC-OFST_+0100_NG.yalm"
+ output_yml_filename_path_tail, filename = os.path.split(output_yml_filename_path)
+
+ #output_yml_filename_path = hdf5_vis.take_yml_snapshot_of_hdf5_file(output_filename_path)
+
+ #first_initialize_metadata_review(output_filename_path,initials='NG')
+ #second_submit_metadata_review()
+ #if os.path.exists(os.path.join(os.path.join(os.path.abspath(os.curdir),"review"),filename)):
+ # third_update_hdf5_file_with_review(output_filename_path, os.path.join(os.path.join(os.path.abspath(os.curdir),"review"),filename))
+ #fourth_complete_metadata_review()
+
+#if __name__ == '__main__':
+
+# main()
+
-import pandas as pd
-import os
-import sys
-import shutil
-import datetime
-import logging
-import numpy as np
-import h5py
-import re
-
-
-
-[docs]
-def setup_logging(log_dir, log_filename):
- """Sets up logging to a specified directory and file.
-
- Parameters:
- log_dir (str): Directory to save the log file.
- log_filename (str): Name of the log file.
- """
- # Ensure the log directory exists
- os.makedirs(log_dir, exist_ok=True)
-
- # Create a logger instance
- logger = logging.getLogger()
- logger.setLevel(logging.INFO)
-
- # Create a file handler
- log_path = os.path.join(log_dir, log_filename)
- file_handler = logging.FileHandler(log_path)
-
- # Create a formatter and set it for the handler
- formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
- file_handler.setFormatter(formatter)
-
- # Add the handler to the logger
- logger.addHandler(file_handler)
-
-
-
-
-
-
-
-
-
-
-[docs]
-def augment_with_filetype(df):
- df['filetype'] = [os.path.splitext(item)[1][1::] for item in df['filename']]
- #return [os.path.splitext(item)[1][1::] for item in df['filename']]
- return df
-
-
-
-[docs]
-def augment_with_filenumber(df):
- df['filenumber'] = [item[0:item.find('_')] for item in df['filename']]
- #return [item[0:item.find('_')] for item in df['filename']]
- return df
-
-
-
-[docs]
-def group_by_df_column(df, column_name: str):
- """
- df (pandas.DataFrame):
- column_name (str): column_name of df by which grouping operation will take place.
- """
-
- if not column_name in df.columns:
- raise ValueError("column_name must be in the columns of df.")
-
- return df[column_name]
-
-
-
-[docs]
-def split_sample_col_into_sample_and_data_quality_cols(input_data: pd.DataFrame):
-
- sample_name = []
- sample_quality = []
- for item in input_data['sample']:
- if item.find('(')!=-1:
- #print(item)
- sample_name.append(item[0:item.find('(')])
- sample_quality.append(item[item.find('(')+1:len(item)-1])
- else:
- if item=='':
- sample_name.append('Not yet annotated')
- sample_quality.append('unevaluated')
- else:
- sample_name.append(item)
- sample_quality.append('good data')
- input_data['sample'] = sample_name
- input_data['data_quality'] = sample_quality
-
- return input_data
-
-
-
-[docs]
-def make_file_copy(source_file_path, output_folder_name : str = 'tmp_files'):
-
- pathtail, filename = os.path.split(source_file_path)
- #backup_filename = 'backup_'+ filename
- backup_filename = filename
- # Path
- ROOT_DIR = os.path.abspath(os.curdir)
-
- tmp_dirpath = os.path.join(ROOT_DIR,output_folder_name)
- if not os.path.exists(tmp_dirpath):
- os.mkdir(tmp_dirpath)
-
- tmp_file_path = os.path.join(tmp_dirpath,backup_filename)
- shutil.copy(source_file_path, tmp_file_path)
-
- return tmp_file_path
-
-
-
-[docs]
-def created_at(datetime_format = '%Y-%m-%d %H:%M:%S'):
- now = datetime.datetime.now()
- # Populate now object with time zone information obtained from the local system
- now_tz_aware = now.astimezone()
- tz = now_tz_aware.strftime('%z')
- # Replace colons in the time part of the timestamp with hyphens to make it file name friendly
- created_at = now_tz_aware.strftime(datetime_format) #+ '_UTC-OFST_' + tz
- return created_at
-
-
-
-[docs]
-def sanitize_dataframe(df: pd.DataFrame) -> pd.DataFrame:
- # Handle datetime columns (convert to string in 'yyyy-mm-dd hh:mm:ss' format)
- datetime_cols = df.select_dtypes(include=['datetime']).columns
- for col in datetime_cols:
- # Convert datetime to string in the specified format, handling NaT
- df[col] = df[col].dt.strftime('%Y-%m-%d %H-%M-%S')
-
- # Handle object columns with mixed types
- otype_cols = df.select_dtypes(include='O')
- for col in otype_cols:
- col_data = df[col]
-
- # Check if all elements in the column are strings
- if col_data.apply(lambda x: isinstance(x, str)).all():
- df[col] = df[col].astype(str)
- else:
- # If the column contains mixed types, attempt to convert to numeric, coercing errors to NaN
- df[col] = pd.to_numeric(col_data, errors='coerce')
-
- # Handle NaN values differently based on dtype
- if pd.api.types.is_string_dtype(df[col]):
- # Replace NaN in string columns with empty string
- df[col] = df[col].fillna('') # Replace NaN with empty string
- elif pd.api.types.is_numeric_dtype(df[col]):
- # For numeric columns, we want to keep NaN as it is
- # But if integer column has NaN, consider casting to float
- if pd.api.types.is_integer_dtype(df[col]):
- df[col] = df[col].astype(float) # Cast to float to allow NaN
- else:
- df[col] = df[col].fillna(np.nan) # Keep NaN in float columns
-
- return df
-
-
-
-[docs]
-def convert_dataframe_to_np_structured_array(df: pd.DataFrame):
-
- df = sanitize_dataframe(df)
- # Define the dtype for the structured array, ensuring compatibility with h5py
- dtype = []
- for col in df.columns:
-
- col_data = df[col]
- col_dtype = col_data.dtype
-
- try:
- if pd.api.types.is_string_dtype(col_dtype):
- # Convert string dtype to fixed-length strings
- max_len = col_data.str.len().max() if not col_data.isnull().all() else 0
- dtype.append((col, f'S{max_len}'))
- elif pd.api.types.is_integer_dtype(col_dtype):
- dtype.append((col, 'i4')) # Assuming 32-bit integer
- elif pd.api.types.is_float_dtype(col_dtype):
- dtype.append((col, 'f4')) # Assuming 32-bit float
- else:
- # Handle unsupported data types
- print(f"Unsupported dtype found in column '{col}': {col_data.dtype}")
- raise ValueError(f"Unsupported data type: {col_data.dtype}")
-
- except Exception as e:
- # Log more detailed error message
- print(f"Error processing column '{col}': {e}")
- raise
-
- # Convert the DataFrame to a structured array
- structured_array = np.array(list(df.itertuples(index=False, name=None)), dtype=dtype)
-
- return structured_array
-
-
-
-[docs]
-def convert_string_to_bytes(input_list: list):
- """Convert a list of strings into a numpy array with utf8-type entries.
-
- Parameters
- ----------
- input_list (list) : list of string objects
-
- Returns
- -------
- input_array_bytes (ndarray): array of ut8-type entries.
- """
- utf8_type = lambda max_length: h5py.string_dtype('utf-8', max_length)
- if input_list:
- max_length = max(len(item) for item in input_list)
- # Convert the strings to bytes with utf-8 encoding, specifying errors='ignore' to skip characters that cannot be encoded
- input_list_bytes = [item.encode('utf-8', errors='ignore') for item in input_list]
- input_array_bytes = np.array(input_list_bytes,dtype=utf8_type(max_length))
- else:
- input_array_bytes = np.array([],dtype=utf8_type(0))
-
- return input_array_bytes
-
-
-
-[docs]
-def convert_attrdict_to_np_structured_array(attr_value: dict):
- """
- Converts a dictionary of attributes into a numpy structured array for HDF5
- compound type compatibility.
-
- Each dictionary key is mapped to a field in the structured array, with the
- data type (S) determined by the longest string representation of the values.
- If the dictionary is empty, the function returns 'missing'.
-
- Parameters
- ----------
- attr_value : dict
- Dictionary containing the attributes to be converted. Example:
- attr_value = {
- 'name': 'Temperature',
- 'unit': 'Celsius',
- 'value': 23.5,
- 'timestamp': '2023-09-26 10:00'
- }
-
- Returns
- -------
- new_attr_value : ndarray or str
- Numpy structured array with UTF-8 encoded fields. Returns 'missing' if
- the input dictionary is empty.
- """
- dtype = []
- values_list = []
- max_length = max(len(str(attr_value[key])) for key in attr_value.keys())
- for key in attr_value.keys():
- if key != 'rename_as':
- dtype.append((key, f'S{max_length}'))
- values_list.append(attr_value[key])
- if values_list:
- new_attr_value = np.array([tuple(values_list)], dtype=dtype)
- else:
- new_attr_value = 'missing'
-
- return new_attr_value
-
-
-
-
-[docs]
-def infer_units(column_name):
- # TODO: complete or remove
-
- match = re.search('\[.+\]')
-
- if match:
- return match
- else:
- match = re.search('\(.+\)')
-
- return match
-
-
-
-[docs]
-def progressBar(count_value, total, suffix=''):
- bar_length = 100
- filled_up_Length = int(round(bar_length* count_value / float(total)))
- percentage = round(100.0 * count_value/float(total),1)
- bar = '=' * filled_up_Length + '-' * (bar_length - filled_up_Length)
- sys.stdout.write('[%s] %s%s ...%s\r' %(bar, percentage, '%', suffix))
- sys.stdout.flush()
-
-
-
-[docs]
-def copy_directory_with_contraints(input_dir_path, output_dir_path,
- select_dir_keywords = None,
- select_file_keywords = None,
- allowed_file_extensions = None,
- dry_run = False):
- """
- Copies files from input_dir_path to output_dir_path based on specified constraints.
-
- Parameters
- ----------
- input_dir_path (str): Path to the input directory.
- output_dir_path (str): Path to the output directory.
- select_dir_keywords (list): optional, List of keywords for selecting directories.
- select_file_keywords (list): optional, List of keywords for selecting files.
- allowed_file_extensions (list): optional, List of allowed file extensions.
-
- Returns
- -------
- path_to_files_dict (dict): dictionary mapping directory paths to lists of copied file names satisfying the constraints.
- """
-
- # Unconstrained default behavior: No filters, make sure variable are lists even when defined as None in function signature
- select_dir_keywords = select_dir_keywords or []
- select_file_keywords = select_file_keywords or []
- allowed_file_extensions = allowed_file_extensions or []
-
- date = created_at('%Y_%m').replace(":", "-")
- log_dir='logs/'
- setup_logging(log_dir, f"copy_directory_with_contraints_{date}.log")
-
- # Define helper functions. Return by default true when filtering lists are either None or []
- def has_allowed_extension(filename):
- return not allowed_file_extensions or os.path.splitext(filename)[1] in allowed_file_extensions
-
- def file_is_selected(filename):
- return not select_file_keywords or any(keyword in filename for keyword in select_file_keywords)
-
-
- # Collect paths of directories, which are directly connected to the root dir and match select_dir_keywords
- paths = []
- if select_dir_keywords:
- for item in os.listdir(input_dir_path): #Path(input_dir_path).iterdir():
- if any([item in keyword for keyword in select_dir_keywords]):
- paths.append(os.path.join(input_dir_path,item))
- else:
- paths.append(input_dir_path) #paths.append(Path(input_dir_path))
-
-
- path_to_files_dict = {} # Dictionary to store directory-file pairs satisfying constraints
-
- for subpath in paths:
-
- for dirpath, _, filenames in os.walk(subpath,topdown=False):
-
- # Reduce filenames to those that are admissible
- admissible_filenames = [
- filename for filename in filenames
- if file_is_selected(filename) and has_allowed_extension(filename)
- ]
-
- if admissible_filenames: # Only create directory if there are files to copy
-
- relative_dirpath = os.path.relpath(dirpath, input_dir_path)
- target_dirpath = os.path.join(output_dir_path, relative_dirpath)
- path_to_files_dict[target_dirpath] = admissible_filenames
-
- if not dry_run:
-
- # Perform the actual copying
-
- os.makedirs(target_dirpath, exist_ok=True)
-
- for filename in admissible_filenames:
- src_file_path = os.path.join(dirpath, filename)
- dest_file_path = os.path.join(target_dirpath, filename)
- try:
- shutil.copy2(src_file_path, dest_file_path)
- except Exception as e:
- logging.error("Failed to copy %s: %s", src_file_path, e)
-
- return path_to_files_dict
-
-
-
-[docs]
-def to_serializable_dtype(value):
-
- """Transform value's dtype into YAML/JSON compatible dtype
-
- Parameters
- ----------
- value : _type_
- _description_
-
- Returns
- -------
- _type_
- _description_
- """
- try:
- if isinstance(value, np.generic):
- if np.issubdtype(value.dtype, np.bytes_):
- value = value.decode('utf-8')
- elif np.issubdtype(value.dtype, np.unicode_):
- value = str(value)
- elif np.issubdtype(value.dtype, np.number):
- value = float(value)
- else:
- print('Yaml-compatible data-type was not found. Value has been set to NaN.')
- value = np.nan
- elif isinstance(value, np.ndarray):
- # Handling structured array types (with fields)
- if value.dtype.names:
- value = {field: to_serializable_dtype(value[field]) for field in value.dtype.names}
- else:
- # Handling regular array NumPy types with assumption of unform dtype accross array elements
- # TODO: evaluate a more general way to check for individual dtypes
- if isinstance(value[0], bytes):
- # Decode bytes
- value = [item.decode('utf-8') for item in value] if len(value) > 1 else value[0].decode('utf-8')
- elif isinstance(value[0], str):
- # Already a string type
- value = [str(item) for item in value] if len(value) > 1 else str(value[0])
- elif isinstance(value[0], int):
- # Integer type
- value = [int(item) for item in value] if len(value) > 1 else int(value[0])
- elif isinstance(value[0], float):
- # Floating type
- value = [float(item) for item in value] if len(value) > 1 else float(value[0])
- else:
- print('Yaml-compatible data-type was not found. Value has been set to NaN.')
- print("Debug: value.dtype is", value.dtype)
- value = np.nan
-
- except Exception as e:
- print(f'Error converting value: {e}. Value has been set to NaN.')
- value = np.nan
-
- return value
-
-
-
-[docs]
-def is_structured_array(attr_val):
- if isinstance(attr_val,np.ndarray):
- return True if attr_val.dtype.names is not None else False
- else:
- return False
-
-
+import pandas as pd
+import os
+import sys
+import shutil
+import datetime
+import logging
+import numpy as np
+import h5py
+import re
+
+
+
+[docs]
+def setup_logging(log_dir, log_filename):
+ """Sets up logging to a specified directory and file.
+
+ Parameters:
+ log_dir (str): Directory to save the log file.
+ log_filename (str): Name of the log file.
+ """
+ # Ensure the log directory exists
+ os.makedirs(log_dir, exist_ok=True)
+
+ # Create a logger instance
+ logger = logging.getLogger()
+ logger.setLevel(logging.INFO)
+
+ # Create a file handler
+ log_path = os.path.join(log_dir, log_filename)
+ file_handler = logging.FileHandler(log_path)
+
+ # Create a formatter and set it for the handler
+ formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
+ file_handler.setFormatter(formatter)
+
+ # Add the handler to the logger
+ logger.addHandler(file_handler)
+
+
+
+
+
+
+
+
+
+
+[docs]
+def augment_with_filetype(df):
+ df['filetype'] = [os.path.splitext(item)[1][1::] for item in df['filename']]
+ #return [os.path.splitext(item)[1][1::] for item in df['filename']]
+ return df
+
+
+
+[docs]
+def augment_with_filenumber(df):
+ df['filenumber'] = [item[0:item.find('_')] for item in df['filename']]
+ #return [item[0:item.find('_')] for item in df['filename']]
+ return df
+
+
+
+[docs]
+def group_by_df_column(df, column_name: str):
+ """
+ df (pandas.DataFrame):
+ column_name (str): column_name of df by which grouping operation will take place.
+ """
+
+ if not column_name in df.columns:
+ raise ValueError("column_name must be in the columns of df.")
+
+ return df[column_name]
+
+
+
+[docs]
+def split_sample_col_into_sample_and_data_quality_cols(input_data: pd.DataFrame):
+
+ sample_name = []
+ sample_quality = []
+ for item in input_data['sample']:
+ if item.find('(')!=-1:
+ #print(item)
+ sample_name.append(item[0:item.find('(')])
+ sample_quality.append(item[item.find('(')+1:len(item)-1])
+ else:
+ if item=='':
+ sample_name.append('Not yet annotated')
+ sample_quality.append('unevaluated')
+ else:
+ sample_name.append(item)
+ sample_quality.append('good data')
+ input_data['sample'] = sample_name
+ input_data['data_quality'] = sample_quality
+
+ return input_data
+
+
+
+[docs]
+def make_file_copy(source_file_path, output_folder_name : str = 'tmp_files'):
+
+ pathtail, filename = os.path.split(source_file_path)
+ #backup_filename = 'backup_'+ filename
+ backup_filename = filename
+ # Path
+ ROOT_DIR = os.path.abspath(os.curdir)
+
+ tmp_dirpath = os.path.join(ROOT_DIR,output_folder_name)
+ if not os.path.exists(tmp_dirpath):
+ os.mkdir(tmp_dirpath)
+
+ tmp_file_path = os.path.join(tmp_dirpath,backup_filename)
+ shutil.copy(source_file_path, tmp_file_path)
+
+ return tmp_file_path
+
+
+
+[docs]
+def created_at(datetime_format = '%Y-%m-%d %H:%M:%S'):
+ now = datetime.datetime.now()
+ # Populate now object with time zone information obtained from the local system
+ now_tz_aware = now.astimezone()
+ tz = now_tz_aware.strftime('%z')
+ # Replace colons in the time part of the timestamp with hyphens to make it file name friendly
+ created_at = now_tz_aware.strftime(datetime_format) #+ '_UTC-OFST_' + tz
+ return created_at
+
+
+
+[docs]
+def sanitize_dataframe(df: pd.DataFrame) -> pd.DataFrame:
+ # Handle datetime columns (convert to string in 'yyyy-mm-dd hh:mm:ss' format)
+ datetime_cols = df.select_dtypes(include=['datetime']).columns
+ for col in datetime_cols:
+ # Convert datetime to string in the specified format, handling NaT
+ df[col] = df[col].dt.strftime('%Y-%m-%d %H-%M-%S')
+
+ # Handle object columns with mixed types
+ otype_cols = df.select_dtypes(include='O')
+ for col in otype_cols:
+ col_data = df[col]
+
+ # Check if all elements in the column are strings
+ if col_data.apply(lambda x: isinstance(x, str)).all():
+ df[col] = df[col].astype(str)
+ else:
+ # If the column contains mixed types, attempt to convert to numeric, coercing errors to NaN
+ df[col] = pd.to_numeric(col_data, errors='coerce')
+
+ # Handle NaN values differently based on dtype
+ if pd.api.types.is_string_dtype(df[col]):
+ # Replace NaN in string columns with empty string
+ df[col] = df[col].fillna('') # Replace NaN with empty string
+ elif pd.api.types.is_numeric_dtype(df[col]):
+ # For numeric columns, we want to keep NaN as it is
+ # But if integer column has NaN, consider casting to float
+ if pd.api.types.is_integer_dtype(df[col]):
+ df[col] = df[col].astype(float) # Cast to float to allow NaN
+ else:
+ df[col] = df[col].fillna(np.nan) # Keep NaN in float columns
+
+ return df
+
+
+
+[docs]
+def convert_dataframe_to_np_structured_array(df: pd.DataFrame):
+
+ df = sanitize_dataframe(df)
+ # Define the dtype for the structured array, ensuring compatibility with h5py
+ dtype = []
+ for col in df.columns:
+
+ col_data = df[col]
+ col_dtype = col_data.dtype
+
+ try:
+ if pd.api.types.is_string_dtype(col_dtype):
+ # Convert string dtype to fixed-length strings
+ max_len = col_data.str.len().max() if not col_data.isnull().all() else 0
+ dtype.append((col, f'S{max_len}'))
+ elif pd.api.types.is_integer_dtype(col_dtype):
+ dtype.append((col, 'i4')) # Assuming 32-bit integer
+ elif pd.api.types.is_float_dtype(col_dtype):
+ dtype.append((col, 'f4')) # Assuming 32-bit float
+ else:
+ # Handle unsupported data types
+ print(f"Unsupported dtype found in column '{col}': {col_data.dtype}")
+ raise ValueError(f"Unsupported data type: {col_data.dtype}")
+
+ except Exception as e:
+ # Log more detailed error message
+ print(f"Error processing column '{col}': {e}")
+ raise
+
+ # Convert the DataFrame to a structured array
+ structured_array = np.array(list(df.itertuples(index=False, name=None)), dtype=dtype)
+
+ return structured_array
+
+
+
+[docs]
+def convert_string_to_bytes(input_list: list):
+ """Convert a list of strings into a numpy array with utf8-type entries.
+
+ Parameters
+ ----------
+ input_list (list) : list of string objects
+
+ Returns
+ -------
+ input_array_bytes (ndarray): array of ut8-type entries.
+ """
+ utf8_type = lambda max_length: h5py.string_dtype('utf-8', max_length)
+ if input_list:
+ max_length = max(len(item) for item in input_list)
+ # Convert the strings to bytes with utf-8 encoding, specifying errors='ignore' to skip characters that cannot be encoded
+ input_list_bytes = [item.encode('utf-8', errors='ignore') for item in input_list]
+ input_array_bytes = np.array(input_list_bytes,dtype=utf8_type(max_length))
+ else:
+ input_array_bytes = np.array([],dtype=utf8_type(0))
+
+ return input_array_bytes
+
+
+
+[docs]
+def convert_attrdict_to_np_structured_array(attr_value: dict):
+ """
+ Converts a dictionary of attributes into a numpy structured array for HDF5
+ compound type compatibility.
+
+ Each dictionary key is mapped to a field in the structured array, with the
+ data type (S) determined by the longest string representation of the values.
+ If the dictionary is empty, the function returns 'missing'.
+
+ Parameters
+ ----------
+ attr_value : dict
+ Dictionary containing the attributes to be converted. Example:
+ attr_value = {
+ 'name': 'Temperature',
+ 'unit': 'Celsius',
+ 'value': 23.5,
+ 'timestamp': '2023-09-26 10:00'
+ }
+
+ Returns
+ -------
+ new_attr_value : ndarray or str
+ Numpy structured array with UTF-8 encoded fields. Returns 'missing' if
+ the input dictionary is empty.
+ """
+ dtype = []
+ values_list = []
+ max_length = max(len(str(attr_value[key])) for key in attr_value.keys())
+ for key in attr_value.keys():
+ if key != 'rename_as':
+ dtype.append((key, f'S{max_length}'))
+ values_list.append(attr_value[key])
+ if values_list:
+ new_attr_value = np.array([tuple(values_list)], dtype=dtype)
+ else:
+ new_attr_value = 'missing'
+
+ return new_attr_value
+
+
+
+
+[docs]
+def infer_units(column_name):
+ # TODO: complete or remove
+
+ match = re.search('\[.+\]')
+
+ if match:
+ return match
+ else:
+ match = re.search('\(.+\)')
+
+ return match
+
+
+
+[docs]
+def progressBar(count_value, total, suffix=''):
+ bar_length = 100
+ filled_up_Length = int(round(bar_length* count_value / float(total)))
+ percentage = round(100.0 * count_value/float(total),1)
+ bar = '=' * filled_up_Length + '-' * (bar_length - filled_up_Length)
+ sys.stdout.write('[%s] %s%s ...%s\r' %(bar, percentage, '%', suffix))
+ sys.stdout.flush()
+
+
+
+[docs]
+def copy_directory_with_contraints(input_dir_path, output_dir_path,
+ select_dir_keywords = None,
+ select_file_keywords = None,
+ allowed_file_extensions = None,
+ dry_run = False):
+ """
+ Copies files from input_dir_path to output_dir_path based on specified constraints.
+
+ Parameters
+ ----------
+ input_dir_path (str): Path to the input directory.
+ output_dir_path (str): Path to the output directory.
+ select_dir_keywords (list): optional, List of keywords for selecting directories.
+ select_file_keywords (list): optional, List of keywords for selecting files.
+ allowed_file_extensions (list): optional, List of allowed file extensions.
+
+ Returns
+ -------
+ path_to_files_dict (dict): dictionary mapping directory paths to lists of copied file names satisfying the constraints.
+ """
+
+ # Unconstrained default behavior: No filters, make sure variable are lists even when defined as None in function signature
+ select_dir_keywords = select_dir_keywords or []
+ select_file_keywords = select_file_keywords or []
+ allowed_file_extensions = allowed_file_extensions or []
+
+ date = created_at('%Y_%m').replace(":", "-")
+ log_dir='logs/'
+ setup_logging(log_dir, f"copy_directory_with_contraints_{date}.log")
+
+ # Define helper functions. Return by default true when filtering lists are either None or []
+ def has_allowed_extension(filename):
+ return not allowed_file_extensions or os.path.splitext(filename)[1] in allowed_file_extensions
+
+ def file_is_selected(filename):
+ return not select_file_keywords or any(keyword in filename for keyword in select_file_keywords)
+
+
+ # Collect paths of directories, which are directly connected to the root dir and match select_dir_keywords
+ paths = []
+ if select_dir_keywords:
+ for item in os.listdir(input_dir_path): #Path(input_dir_path).iterdir():
+ if any([item in keyword for keyword in select_dir_keywords]):
+ paths.append(os.path.join(input_dir_path,item))
+ else:
+ paths.append(input_dir_path) #paths.append(Path(input_dir_path))
+
+
+ path_to_files_dict = {} # Dictionary to store directory-file pairs satisfying constraints
+
+ for subpath in paths:
+
+ for dirpath, _, filenames in os.walk(subpath,topdown=False):
+
+ # Reduce filenames to those that are admissible
+ admissible_filenames = [
+ filename for filename in filenames
+ if file_is_selected(filename) and has_allowed_extension(filename)
+ ]
+
+ if admissible_filenames: # Only create directory if there are files to copy
+
+ relative_dirpath = os.path.relpath(dirpath, input_dir_path)
+ target_dirpath = os.path.join(output_dir_path, relative_dirpath)
+ path_to_files_dict[target_dirpath] = admissible_filenames
+
+ if not dry_run:
+
+ # Perform the actual copying
+
+ os.makedirs(target_dirpath, exist_ok=True)
+
+ for filename in admissible_filenames:
+ src_file_path = os.path.join(dirpath, filename)
+ dest_file_path = os.path.join(target_dirpath, filename)
+ try:
+ shutil.copy2(src_file_path, dest_file_path)
+ except Exception as e:
+ logging.error("Failed to copy %s: %s", src_file_path, e)
+
+ return path_to_files_dict
+
+
+
+[docs]
+def to_serializable_dtype(value):
+
+ """Transform value's dtype into YAML/JSON compatible dtype
+
+ Parameters
+ ----------
+ value : _type_
+ _description_
+
+ Returns
+ -------
+ _type_
+ _description_
+ """
+ try:
+ if isinstance(value, np.generic):
+ if np.issubdtype(value.dtype, np.bytes_):
+ value = value.decode('utf-8')
+ elif np.issubdtype(value.dtype, np.unicode_):
+ value = str(value)
+ elif np.issubdtype(value.dtype, np.number):
+ value = float(value)
+ else:
+ print('Yaml-compatible data-type was not found. Value has been set to NaN.')
+ value = np.nan
+ elif isinstance(value, np.ndarray):
+ # Handling structured array types (with fields)
+ if value.dtype.names:
+ value = {field: to_serializable_dtype(value[field]) for field in value.dtype.names}
+ else:
+ # Handling regular array NumPy types with assumption of unform dtype accross array elements
+ # TODO: evaluate a more general way to check for individual dtypes
+ if isinstance(value[0], bytes):
+ # Decode bytes
+ value = [item.decode('utf-8') for item in value] if len(value) > 1 else value[0].decode('utf-8')
+ elif isinstance(value[0], str):
+ # Already a string type
+ value = [str(item) for item in value] if len(value) > 1 else str(value[0])
+ elif isinstance(value[0], int):
+ # Integer type
+ value = [int(item) for item in value] if len(value) > 1 else int(value[0])
+ elif isinstance(value[0], float):
+ # Floating type
+ value = [float(item) for item in value] if len(value) > 1 else float(value[0])
+ else:
+ print('Yaml-compatible data-type was not found. Value has been set to NaN.')
+ print("Debug: value.dtype is", value.dtype)
+ value = np.nan
+
+ except Exception as e:
+ print(f'Error converting value: {e}. Value has been set to NaN.')
+ value = np.nan
+
+ return value
+
+
+
+[docs]
+def is_structured_array(attr_val):
+ if isinstance(attr_val,np.ndarray):
+ return True if attr_val.dtype.names is not None else False
+ else:
+ return False
+
+
-import sys
-import os
-root_dir = os.path.abspath(os.curdir)
-sys.path.append(root_dir)
-
-import h5py
-import yaml
-
-import numpy as np
-import pandas as pd
-
-from plotly.subplots import make_subplots
-import plotly.graph_objects as go
-import plotly.express as px
-#import plotly.io as pio
-from src.hdf5_ops import get_parent_child_relationships
-
-
-
-
-[docs]
-def display_group_hierarchy_on_a_treemap(filename: str):
-
- """
- filename (str): hdf5 file's filename"""
-
- with h5py.File(filename,'r') as file:
- nodes, parents, values = get_parent_child_relationships(file)
-
- metadata_list = []
- metadata_dict={}
- for key in file.attrs.keys():
- #if 'metadata' in key:
- if isinstance(file.attrs[key], str): # Check if the attribute is a string
- metadata_key = key[key.find('_') + 1:]
- metadata_value = file.attrs[key]
- metadata_dict[metadata_key] = metadata_value
- metadata_list.append(f'{metadata_key}: {metadata_value}')
-
- #metadata_dict[key[key.find('_')+1::]]= file.attrs[key]
- #metadata_list.append(key[key.find('_')+1::]+':'+file.attrs[key])
-
- metadata = '<br>'.join(['<br>'] + metadata_list)
-
- customdata_series = pd.Series(nodes)
- customdata_series[0] = metadata
-
- fig = make_subplots(1, 1, specs=[[{"type": "domain"}]],)
- fig.add_trace(go.Treemap(
- labels=nodes, #formating_df['formated_names'][nodes],
- parents=parents,#formating_df['formated_names'][parents],
- values=values,
- branchvalues='remainder',
- customdata= customdata_series,
- #marker=dict(
- # colors=df_all_trees['color'],
- # colorscale='RdBu',
- # cmid=average_score),
- #hovertemplate='<b>%{label} </b> <br> Number of files: %{value}<br> Success rate: %{color:.2f}',
- hovertemplate='<b>%{label} </b> <br> Count: %{value} <br> Path: %{customdata}',
- name='',
- root_color="lightgrey"
- ))
- fig.update_layout(width = 800, height= 600, margin = dict(t=50, l=25, r=25, b=25))
- fig.show()
- file_name, file_ext = os.path.splitext(filename)
- fig.write_html(file_name + ".html")
-
-
- #pio.write_image(fig,file_name + ".png",width=800,height=600,format='png')
-
-#
-
+import sys
+import os
+root_dir = os.path.abspath(os.curdir)
+sys.path.append(root_dir)
+
+import h5py
+import yaml
+
+import numpy as np
+import pandas as pd
+
+from plotly.subplots import make_subplots
+import plotly.graph_objects as go
+import plotly.express as px
+#import plotly.io as pio
+from src.hdf5_ops import get_parent_child_relationships
+
+
+
+
+[docs]
+def display_group_hierarchy_on_a_treemap(filename: str):
+
+ """
+ filename (str): hdf5 file's filename"""
+
+ with h5py.File(filename,'r') as file:
+ nodes, parents, values = get_parent_child_relationships(file)
+
+ metadata_list = []
+ metadata_dict={}
+ for key in file.attrs.keys():
+ #if 'metadata' in key:
+ if isinstance(file.attrs[key], str): # Check if the attribute is a string
+ metadata_key = key[key.find('_') + 1:]
+ metadata_value = file.attrs[key]
+ metadata_dict[metadata_key] = metadata_value
+ metadata_list.append(f'{metadata_key}: {metadata_value}')
+
+ #metadata_dict[key[key.find('_')+1::]]= file.attrs[key]
+ #metadata_list.append(key[key.find('_')+1::]+':'+file.attrs[key])
+
+ metadata = '<br>'.join(['<br>'] + metadata_list)
+
+ customdata_series = pd.Series(nodes)
+ customdata_series[0] = metadata
+
+ fig = make_subplots(1, 1, specs=[[{"type": "domain"}]],)
+ fig.add_trace(go.Treemap(
+ labels=nodes, #formating_df['formated_names'][nodes],
+ parents=parents,#formating_df['formated_names'][parents],
+ values=values,
+ branchvalues='remainder',
+ customdata= customdata_series,
+ #marker=dict(
+ # colors=df_all_trees['color'],
+ # colorscale='RdBu',
+ # cmid=average_score),
+ #hovertemplate='<b>%{label} </b> <br> Number of files: %{value}<br> Success rate: %{color:.2f}',
+ hovertemplate='<b>%{label} </b> <br> Count: %{value} <br> Path: %{customdata}',
+ name='',
+ root_color="lightgrey"
+ ))
+ fig.update_layout(width = 800, height= 600, margin = dict(t=50, l=25, r=25, b=25))
+ fig.show()
+ file_name, file_ext = os.path.splitext(filename)
+ fig.write_html(file_name + ".html")
+
+
+ #pio.write_image(fig,file_name + ".png",width=800,height=600,format='png')
+
+#
+
' + - '' + - _("Hide Search Matches") + - "
" - ) - ); - }, - - /** - * helper function to hide the search marks again - */ - hideSearchWords: () => { - document - .querySelectorAll("#searchbox .highlight-link") - .forEach((el) => el.remove()); - document - .querySelectorAll("span.highlighted") - .forEach((el) => el.classList.remove("highlighted")); - localStorage.removeItem("sphinx_highlight_terms") - }, - - initEscapeListener: () => { - // only install a listener if it is really needed - if (!DOCUMENTATION_OPTIONS.ENABLE_SEARCH_SHORTCUTS) return; - - document.addEventListener("keydown", (event) => { - // bail for input elements - if (BLACKLISTED_KEY_CONTROL_ELEMENTS.has(document.activeElement.tagName)) return; - // bail with special keys - if (event.shiftKey || event.altKey || event.ctrlKey || event.metaKey) return; - if (DOCUMENTATION_OPTIONS.ENABLE_SEARCH_SHORTCUTS && (event.key === "Escape")) { - SphinxHighlight.hideSearchWords(); - event.preventDefault(); - } - }); - }, -}; - -_ready(() => { - /* Do not call highlightSearchWords() when we are on the search page. - * It will highlight words from the *previous* search query. - */ - if (typeof Search === "undefined") SphinxHighlight.highlightSearchWords(); - SphinxHighlight.initEscapeListener(); -}); +/* Highlighting utilities for Sphinx HTML documentation. */ +"use strict"; + +const SPHINX_HIGHLIGHT_ENABLED = true + +/** + * highlight a given string on a node by wrapping it in + * span elements with the given class name. + */ +const _highlight = (node, addItems, text, className) => { + if (node.nodeType === Node.TEXT_NODE) { + const val = node.nodeValue; + const parent = node.parentNode; + const pos = val.toLowerCase().indexOf(text); + if ( + pos >= 0 && + !parent.classList.contains(className) && + !parent.classList.contains("nohighlight") + ) { + let span; + + const closestNode = parent.closest("body, svg, foreignObject"); + const isInSVG = closestNode && closestNode.matches("svg"); + if (isInSVG) { + span = document.createElementNS("http://www.w3.org/2000/svg", "tspan"); + } else { + span = document.createElement("span"); + span.classList.add(className); + } + + span.appendChild(document.createTextNode(val.substr(pos, text.length))); + const rest = document.createTextNode(val.substr(pos + text.length)); + parent.insertBefore( + span, + parent.insertBefore( + rest, + node.nextSibling + ) + ); + node.nodeValue = val.substr(0, pos); + /* There may be more occurrences of search term in this node. So call this + * function recursively on the remaining fragment. + */ + _highlight(rest, addItems, text, className); + + if (isInSVG) { + const rect = document.createElementNS( + "http://www.w3.org/2000/svg", + "rect" + ); + const bbox = parent.getBBox(); + rect.x.baseVal.value = bbox.x; + rect.y.baseVal.value = bbox.y; + rect.width.baseVal.value = bbox.width; + rect.height.baseVal.value = bbox.height; + rect.setAttribute("class", className); + addItems.push({ parent: parent, target: rect }); + } + } + } else if (node.matches && !node.matches("button, select, textarea")) { + node.childNodes.forEach((el) => _highlight(el, addItems, text, className)); + } +}; +const _highlightText = (thisNode, text, className) => { + let addItems = []; + _highlight(thisNode, addItems, text, className); + addItems.forEach((obj) => + obj.parent.insertAdjacentElement("beforebegin", obj.target) + ); +}; + +/** + * Small JavaScript module for the documentation. + */ +const SphinxHighlight = { + + /** + * highlight the search words provided in localstorage in the text + */ + highlightSearchWords: () => { + if (!SPHINX_HIGHLIGHT_ENABLED) return; // bail if no highlight + + // get and clear terms from localstorage + const url = new URL(window.location); + const highlight = + localStorage.getItem("sphinx_highlight_terms") + || url.searchParams.get("highlight") + || ""; + localStorage.removeItem("sphinx_highlight_terms") + url.searchParams.delete("highlight"); + window.history.replaceState({}, "", url); + + // get individual terms from highlight string + const terms = highlight.toLowerCase().split(/\s+/).filter(x => x); + if (terms.length === 0) return; // nothing to do + + // There should never be more than one element matching "div.body" + const divBody = document.querySelectorAll("div.body"); + const body = divBody.length ? divBody[0] : document.querySelector("body"); + window.setTimeout(() => { + terms.forEach((term) => _highlightText(body, term, "highlighted")); + }, 10); + + const searchBox = document.getElementById("searchbox"); + if (searchBox === null) return; + searchBox.appendChild( + document + .createRange() + .createContextualFragment( + '' + + '' + + _("Hide Search Matches") + + "
" + ) + ); + }, + + /** + * helper function to hide the search marks again + */ + hideSearchWords: () => { + document + .querySelectorAll("#searchbox .highlight-link") + .forEach((el) => el.remove()); + document + .querySelectorAll("span.highlighted") + .forEach((el) => el.classList.remove("highlighted")); + localStorage.removeItem("sphinx_highlight_terms") + }, + + initEscapeListener: () => { + // only install a listener if it is really needed + if (!DOCUMENTATION_OPTIONS.ENABLE_SEARCH_SHORTCUTS) return; + + document.addEventListener("keydown", (event) => { + // bail for input elements + if (BLACKLISTED_KEY_CONTROL_ELEMENTS.has(document.activeElement.tagName)) return; + // bail with special keys + if (event.shiftKey || event.altKey || event.ctrlKey || event.metaKey) return; + if (DOCUMENTATION_OPTIONS.ENABLE_SEARCH_SHORTCUTS && (event.key === "Escape")) { + SphinxHighlight.hideSearchWords(); + event.preventDefault(); + } + }); + }, +}; + +_ready(() => { + /* Do not call highlightSearchWords() when we are on the search page. + * It will highlight words from the *previous* search query. + */ + if (typeof Search === "undefined") SphinxHighlight.highlightSearchWords(); + SphinxHighlight.initEscapeListener(); +}); diff --git a/docs/build/html/genindex.html b/docs/build/html/genindex.html index 3afeba2..2e1259c 100644 --- a/docs/build/html/genindex.html +++ b/docs/build/html/genindex.html @@ -1,409 +1,409 @@ - - - - - -- | - |
- | - |
- | - |
- | - |
- |
- | - |
- | - |
|
-
|
-
|
-
|
-
- | - |
- |
|
-
- |
- |
|
-
- |
|
-
+ | + |
+ | + |
+ | + |
+ | + |
+ |
+ | + |
+ | + |
|
+
|
+
|
+
|
+
+ | + |
+ |
|
+
+ |
+ |
|
+
+ |
|
+
Contents:
-augment_with_filenumber()
augment_with_filetype()
convert_attrdict_to_np_structured_array()
convert_dataframe_to_np_structured_array()
convert_string_to_bytes()
copy_directory_with_contraints()
created_at()
group_by_df_column()
infer_units()
is_callable_list()
is_str_list()
is_structured_array()
make_file_copy()
progressBar()
sanitize_dataframe()
setup_logging()
split_sample_col_into_sample_and_data_quality_cols()
to_serializable_dtype()
Contents:
+augment_with_filenumber()
augment_with_filetype()
convert_attrdict_to_np_structured_array()
convert_dataframe_to_np_structured_array()
convert_string_to_bytes()
copy_directory_with_contraints()
created_at()
group_by_df_column()
infer_units()
is_callable_list()
is_str_list()
is_structured_array()
make_file_copy()
progressBar()
sanitize_dataframe()
setup_logging()
split_sample_col_into_sample_and_data_quality_cols()
to_serializable_dtype()
Helper function to copy directory with constraints and create HDF5.
-Load YAML configuration file, set up logging, and validate required keys and datetime_steps.
-Integrates data sources specified by the input configuration file into HDF5 files.
-yaml_config_file_path (str): Path to the YAML configuration file. -log_dir (str): Directory to save the log file.
-list: List of Paths to the created HDF5 file(s).
-Updates, appends, or deletes metadata attributes in an HDF5 file based on a provided YAML dictionary.
-Path to the HDF5 file.
-Dictionary specifying objects and their attributes with operations. Example format: -{
----
-- “object_name”: { “attributes”“attr_name”: { “value”: attr_value,
- -
----“delete”: true | false
-}
-}
-
}
-Helper function to copy directory with constraints and create HDF5.
+Load YAML configuration file, set up logging, and validate required keys and datetime_steps.
+Integrates data sources specified by the input configuration file into HDF5 files.
+yaml_config_file_path (str): Path to the YAML configuration file. +log_dir (str): Directory to save the log file.
+list: List of Paths to the created HDF5 file(s).
+Updates, appends, or deletes metadata attributes in an HDF5 file based on a provided YAML dictionary.
+Path to the HDF5 file.
+Dictionary specifying objects and their attributes with operations. Example format: +{
++++
+- “object_name”: { “attributes”“attr_name”: { “value”: attr_value,
- +
++++“delete”: true | false
+}
+}
+
}
+Bases: object
A class to handle HDF5 fundamental middle level file operations to power data updates, metadata revision, and data analysis -with hdf5 files encoding multi-instrument experimental campaign data.
--- - --
-- path_to_filestr
- -
path/to/hdf5file.
-- modestr
- -
‘r’ or ‘r+’ read or read/write mode only when file exists
-
Appends metadata attributes to the specified object (obj_name) based on the provided annotation_dict.
-This method ensures that the provided metadata attributes do not overwrite any existing ones. If an attribute already exists, -a ValueError is raised. The function supports storing scalar values (int, float, str) and compound values such as dictionaries -that are converted into NumPy structured arrays before being added to the metadata.
-Path to the target object (dataset or group) within the HDF5 file.
-Scalars: int, float, or str.
Compound values (dictionaries) for more complex metadata, which are converted to NumPy structured arrays.
Example of a compound value:
-“value”: 65, -“units”: “percentage”, -“range”: “[0,100]”, -“definition”: “amount of water vapor present …”
-}
-}
-Deletes metadata attributes of the specified object (obj_name) based on the provided annotation_dict.
-Path to the target object (dataset or group) within the HDF5 file.
-Dictionary where keys represent attribute names, and values should be dictionaries containing -{“delete”: True} to mark them for deletion.
-annotation_dict = {“attr_to_be_deleted”: {“delete”: True}}
-Deletes the specified attributes from the object’s metadata if marked for deletion.
Issues a warning if the attribute is not found or not marked for deletion.
returns a copy of the dataset content in the form of dataframe when possible or numpy array
-Get file attributes from object at path = obj_path. For example, -obj_path = ‘/’ will get root level attributes or metadata.
-Renames metadata attributes of the specified object (obj_name) based on the provided renaming_map.
-Path to the target object (dataset or group) within the HDF5 file.
-A dictionary where keys are current attribute names (strings), and values are the new attribute names (strings or byte strings) to rename to.
-“old_attr_name”: “new_attr_name”, -“old_attr_2”: “new_attr_2”
-}
-Updates the value of existing metadata attributes of the specified object (obj_name) based on the provided annotation_dict.
-The function disregards non-existing attributes and suggests to use the append_metadata() method to include those in the metadata.
-Path to the target object (dataset or group) within the HDF5 file.
-Scalars: int, float, or str.
Compound values (dictionaries) for more complex metadata, which are converted to NumPy structured arrays.
Example of a compound value:
-“value”: 65, -“units”: “percentage”, -“range”: “[0,100]”, -“definition”: “amount of water vapor present …”
-}
-}
-Reconstruct a MATLAB Table encoded in a .h5 file as a Pandas DataFrame.
-This function reads a .h5 file containing a MATLAB Table and reconstructs it as a Pandas DataFrame. -The input .h5 file contains one group per row of the MATLAB Table. Each group stores the table’s -dataset-like variables as Datasets, while categorical and numerical variables are represented as -attributes of the respective group.
-To ensure homogeneity of data columns, the DataFrame is constructed column-wise.
-The name of the .h5 file. This may include the file’s location and path information.
-The MATLAB Table reconstructed as a Pandas DataFrame.
-Serialize metadata from an HDF5 file into YAML or JSON format.
-The path to the input HDF5 file.
-The folder depth to control how much of the HDF5 file hierarchy is traversed (default is 4).
-The format to serialize the output, either ‘yaml’ or ‘json’ (default is ‘yaml’).
-The output file path where the serialized metadata is stored (either .yaml or .json).
-Creates an HDF5 file with hierarchical groups based on the specified grouping functions or columns.
---ofilename (str): Path for the output HDF5 file. -input_data (pd.DataFrame or str): Input data as a DataFrame or a valid file system path. -group_by_funcs (list): List of callables or column names to define hierarchical grouping. -approach (str): Specifies the approach (‘top-down’ or ‘bottom-up’) for creating the HDF5 file. -extract_attrs_func (callable, optional): Function to extract additional attributes for HDF5 groups.
-
--None
-
Creates an .h5 file with name “output_filename” that preserves the directory tree (or folder structure) -of a given filesystem path.
-The data integration capabilities are limited by our file reader, which can only access data from a list of -admissible file formats. These, however, can be extended. Directories are groups in the resulting HDF5 file. -Files are formatted as composite objects consisting of a group, file, and attributes.
-Name of the output HDF5 file.
-Path to root directory, specified with forward slashes, e.g., path/to/root.
-A pre-processed dictionary where keys are directory paths on the input directory’s tree and values are lists of files. -If provided, ‘input_file_system_path’ is ignored.
-a word in ‘select_dir_keywords’. When empty, all directory paths are considered -to be included in the HDF5 file group hierarchy.
-Metadata to include at the root level of the HDF5 file.
-‘w’ create File, truncate if it exists, or ‘r+’ read/write, File must exists. By default, mode = “w”.
-Path to the created HDF5 file.
-Save processed dataframe columns with annotations to an HDF5 file.
-df (pd.DataFrame): DataFrame containing processed time series. -annotator (): Annotator object with get_metadata method. -output_filename (str): Path to the source HDF5 file.
-Bases: object
A class to handle HDF5 fundamental middle level file operations to power data updates, metadata revision, and data analysis +with hdf5 files encoding multi-instrument experimental campaign data.
+++ + ++
+- path_to_filestr
- +
path/to/hdf5file.
+- modestr
- +
‘r’ or ‘r+’ read or read/write mode only when file exists
+
Appends metadata attributes to the specified object (obj_name) based on the provided annotation_dict.
+This method ensures that the provided metadata attributes do not overwrite any existing ones. If an attribute already exists, +a ValueError is raised. The function supports storing scalar values (int, float, str) and compound values such as dictionaries +that are converted into NumPy structured arrays before being added to the metadata.
+Path to the target object (dataset or group) within the HDF5 file.
+Scalars: int, float, or str.
Compound values (dictionaries) for more complex metadata, which are converted to NumPy structured arrays.
Example of a compound value:
+“value”: 65, +“units”: “percentage”, +“range”: “[0,100]”, +“definition”: “amount of water vapor present …”
+}
+}
+Deletes metadata attributes of the specified object (obj_name) based on the provided annotation_dict.
+Path to the target object (dataset or group) within the HDF5 file.
+Dictionary where keys represent attribute names, and values should be dictionaries containing +{“delete”: True} to mark them for deletion.
+annotation_dict = {“attr_to_be_deleted”: {“delete”: True}}
+Deletes the specified attributes from the object’s metadata if marked for deletion.
Issues a warning if the attribute is not found or not marked for deletion.
returns a copy of the dataset content in the form of dataframe when possible or numpy array
+Get file attributes from object at path = obj_path. For example, +obj_path = ‘/’ will get root level attributes or metadata.
+Renames metadata attributes of the specified object (obj_name) based on the provided renaming_map.
+Path to the target object (dataset or group) within the HDF5 file.
+A dictionary where keys are current attribute names (strings), and values are the new attribute names (strings or byte strings) to rename to.
+“old_attr_name”: “new_attr_name”, +“old_attr_2”: “new_attr_2”
+}
+Updates the value of existing metadata attributes of the specified object (obj_name) based on the provided annotation_dict.
+The function disregards non-existing attributes and suggests to use the append_metadata() method to include those in the metadata.
+Path to the target object (dataset or group) within the HDF5 file.
+Scalars: int, float, or str.
Compound values (dictionaries) for more complex metadata, which are converted to NumPy structured arrays.
Example of a compound value:
+“value”: 65, +“units”: “percentage”, +“range”: “[0,100]”, +“definition”: “amount of water vapor present …”
+}
+}
+Reconstruct a MATLAB Table encoded in a .h5 file as a Pandas DataFrame.
+This function reads a .h5 file containing a MATLAB Table and reconstructs it as a Pandas DataFrame. +The input .h5 file contains one group per row of the MATLAB Table. Each group stores the table’s +dataset-like variables as Datasets, while categorical and numerical variables are represented as +attributes of the respective group.
+To ensure homogeneity of data columns, the DataFrame is constructed column-wise.
+The name of the .h5 file. This may include the file’s location and path information.
+The MATLAB Table reconstructed as a Pandas DataFrame.
+Serialize metadata from an HDF5 file into YAML or JSON format.
+The path to the input HDF5 file.
+The folder depth to control how much of the HDF5 file hierarchy is traversed (default is 4).
+The format to serialize the output, either ‘yaml’ or ‘json’ (default is ‘yaml’).
+The output file path where the serialized metadata is stored (either .yaml or .json).
+Creates an HDF5 file with hierarchical groups based on the specified grouping functions or columns.
+++ofilename (str): Path for the output HDF5 file. +input_data (pd.DataFrame or str): Input data as a DataFrame or a valid file system path. +group_by_funcs (list): List of callables or column names to define hierarchical grouping. +approach (str): Specifies the approach (‘top-down’ or ‘bottom-up’) for creating the HDF5 file. +extract_attrs_func (callable, optional): Function to extract additional attributes for HDF5 groups.
+
++None
+
Creates an .h5 file with name “output_filename” that preserves the directory tree (or folder structure) +of a given filesystem path.
+The data integration capabilities are limited by our file reader, which can only access data from a list of +admissible file formats. These, however, can be extended. Directories are groups in the resulting HDF5 file. +Files are formatted as composite objects consisting of a group, file, and attributes.
+Name of the output HDF5 file.
+Path to root directory, specified with forward slashes, e.g., path/to/root.
+A pre-processed dictionary where keys are directory paths on the input directory’s tree and values are lists of files. +If provided, ‘input_file_system_path’ is ignored.
+a word in ‘select_dir_keywords’. When empty, all directory paths are considered +to be included in the HDF5 file group hierarchy.
+Metadata to include at the root level of the HDF5 file.
+‘w’ create File, truncate if it exists, or ‘r+’ read/write, File must exists. By default, mode = “w”.
+Path to the created HDF5 file.
+Save processed dataframe columns with annotations to an HDF5 file.
+df (pd.DataFrame): DataFrame containing processed time series. +annotator (): Annotator object with get_metadata method. +output_filename (str): Path to the source HDF5 file.
+Converts a dictionary of attributes into a numpy structured array for HDF5 -compound type compatibility.
-Each dictionary key is mapped to a field in the structured array, with the -data type (S) determined by the longest string representation of the values. -If the dictionary is empty, the function returns ‘missing’.
-Dictionary containing the attributes to be converted. Example: -attr_value = {
---‘name’: ‘Temperature’, -‘unit’: ‘Celsius’, -‘value’: 23.5, -‘timestamp’: ‘2023-09-26 10:00’
-
}
-Numpy structured array with UTF-8 encoded fields. Returns ‘missing’ if -the input dictionary is empty.
-Convert a list of strings into a numpy array with utf8-type entries.
-input_list (list) : list of string objects
-input_array_bytes (ndarray): array of ut8-type entries.
-Copies files from input_dir_path to output_dir_path based on specified constraints.
---input_dir_path (str): Path to the input directory. -output_dir_path (str): Path to the output directory. -select_dir_keywords (list): optional, List of keywords for selecting directories. -select_file_keywords (list): optional, List of keywords for selecting files. -allowed_file_extensions (list): optional, List of allowed file extensions.
-
--path_to_files_dict (dict): dictionary mapping directory paths to lists of copied file names satisfying the constraints.
-
df (pandas.DataFrame): -column_name (str): column_name of df by which grouping operation will take place.
-Sets up logging to a specified directory and file.
-log_dir (str): Directory to save the log file. -log_filename (str): Name of the log file.
-Converts a dictionary of attributes into a numpy structured array for HDF5 +compound type compatibility.
+Each dictionary key is mapped to a field in the structured array, with the +data type (S) determined by the longest string representation of the values. +If the dictionary is empty, the function returns ‘missing’.
+Dictionary containing the attributes to be converted. Example: +attr_value = {
+++‘name’: ‘Temperature’, +‘unit’: ‘Celsius’, +‘value’: 23.5, +‘timestamp’: ‘2023-09-26 10:00’
+
}
+Numpy structured array with UTF-8 encoded fields. Returns ‘missing’ if +the input dictionary is empty.
+Convert a list of strings into a numpy array with utf8-type entries.
+input_list (list) : list of string objects
+input_array_bytes (ndarray): array of ut8-type entries.
+Copies files from input_dir_path to output_dir_path based on specified constraints.
+++input_dir_path (str): Path to the input directory. +output_dir_path (str): Path to the output directory. +select_dir_keywords (list): optional, List of keywords for selecting directories. +select_file_keywords (list): optional, List of keywords for selecting files. +allowed_file_extensions (list): optional, List of allowed file extensions.
+
++path_to_files_dict (dict): dictionary mapping directory paths to lists of copied file names satisfying the constraints.
+
df (pandas.DataFrame): +column_name (str): column_name of df by which grouping operation will take place.
+Sets up logging to a specified directory and file.
+log_dir (str): Directory to save the log file. +log_filename (str): Name of the log file.
+- n | ||
- |
- notebooks | - |
- p | ||
- |
- pipelines | - |
- |
- pipelines.data_integration | - |
- |
- pipelines.metadata_revision | - |
- s | ||
- |
- src | - |
- |
- src.hdf5_ops | - |
- |
- src.hdf5_writer | - |
- u | ||
- |
- utils | - |
- |
- utils.g5505_utils | - |
- v | ||
- |
- visualization | - |
- |
- visualization.hdf5_vis | - |
+ n | ||
+ |
+ notebooks | + |
+ p | ||
+ |
+ pipelines | + |
+ |
+ pipelines.data_integration | + |
+ |
+ pipelines.metadata_revision | + |
+ s | ||
+ |
+ src | + |
+ |
+ src.hdf5_ops | + |
+ |
+ src.hdf5_writer | + |
+ u | ||
+ |
+ utils | + |
+ |
+ utils.g5505_utils | + |
+ v | ||
+ |
+ visualization | + |
+ |
+ visualization.hdf5_vis | + |