Decorate readers to capture data lineage using record_data_lineage from src.meta_ops

Implement record_data_lineage.py to be used as a parameterized decorator. This is to simplify provenance tracking on newly added file readers.
Merge branch 'main' of https://gitea.psi.ch/5505-public/dima
2025-09-20 11:02:09 +02:00 · 2025-09-19 19:01:08 +02:00 · 2025-09-19 18:56:27 +02:00 · 2025-09-19 18:55:31 +02:00 · 2025-06-29 08:51:32 +02:00 · 2025-06-29 08:47:29 +02:00
16 changed files with 156 additions and 96 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -30,4 +30,11 @@ Format based on [Keep a Changelog](https://keepachangelog.com) and [Semantic Ver
 - Include Licence

 ### Changed
- Update README.md with new description + authors and funding sections
+- Update README.md with new description + authors and funding sections
+
+## [1.2.0] - 2025-06-29
+### Changed
+- Updated `README.md` to use Miniforge and `conda-forge` for environment setup.
+- Removed unreliable `setup_env.sh` shell-based installation instructions.
+- Added instructions to configure Conda to use only `conda-forge` with strict priority.
+- Included a notice to verify base environment origin via `conda info`.
--- a/README.md
+++ b/README.md
@@ -30,7 +30,7 @@ For **Windows** users, the following are required:

 1. **Git Bash**: Install [Git Bash](https://git-scm.com/downloads) to run shell scripts (`.sh` files).

-2. **Conda**: Install [Anaconda](https://www.anaconda.com/products/individual) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html).  
+2. **Miniforge**: Install [Miniforge](https://conda-forge.org/download/).

 3. **PSI Network Access**

@@ -44,56 +44,65 @@ For **Windows** users, the following are required:

 ### Download DIMA

-Open a **Git Bash** terminal.
+Open a **Git Bash** terminal (or a terminal of your choice).

-Navigate to your `Gitea` folder, clone the repository, and navigate to the `dima` folder as follows:
+Navigate to your `Gitea` folder, clone the repository, and move into the `dima` directory:

-   ```bash
-   cd path/to/Gitea
-   git clone --recurse-submodules https://gitea.psi.ch/5505-public/dima.git
-   cd dima
+  ```bash
+  cd path/to/Gitea
+  git clone --recurse-submodules https://gitea.psi.ch/5505-public/dima.git
+  cd dima
   ```

-### Install Python Interpreter
+### Install Python Environment Using Miniforge and conda-forge

-Open **Git Bash** terminal. 
+We recommend using Miniforge to manage your conda environments. Miniforge ensures compatibility with packages from the conda-forge channel.

-**Option 1**: Install a suitable conda environment `multiphase_chemistry_env` inside the repository `dima` as follows:
+1. Make sure you have installed **Miniforge**. 

+2. Open **Miniforge Prompt**
+    > ⚠️ Ensure your Conda base environment is from Miniforge (not Anaconda). Run `conda info` and check for `miniforge` in the base path and `conda-forge` as the default channel.
+
+
+3. Create the Environment from `environment.yml`. Inside the     **Miniforge Prompt** or a terminal with access to conda and run:
   ```bash
-   cd path/to/GitLab/dima
-   Bash setup_env.sh
-   ```
-
-Open **Anaconda Prompt** or a terminal with access to conda.
-
-**Option 2**: Install conda enviroment from YAML file as follows:
-   ```bash
-   cd path/to/GitLab/dima
+   cd path/to/Gitea/dima
   conda env create --file environment.yml
   ```
+3. Activate the Environment
+   ```bash
+   conda activate dima_env
+   ```
+4. Remove the `defaults` channel (if present):
+   ```bash
+   conda config --remove channels defaults
+   ```
+5. Add `conda-forge` as the highest-priority channel:
+  ```bash
+  conda config --add channels conda-forge
+  conda config --set channel_priority strict
+  ```

-<details> 
-<summary> <b> Working with Jupyter Notebooks </b> </summary>

-We now make the previously installed Python environment `multiphase_chemistry_env` selectable as a kernel in Jupyter's interface.
+### Working with Jupyter Notebooks
+We now make the previously installed Python environment `dima_env` selectable as a kernel in Jupyter's interface.

 1. Open an Anaconda Prompt, check if the environment exists, and activate it:
    ```
    conda env list
-    conda activate multiphase_chemistry_env
+    conda activate dima_env
    ```
 2. Register the environment in Jupyter:
    ```
-    python -m ipykernel install --user --name multiphase_chemistry_env --display-name "Python (multiphase_chemistry_env)"  
+    python -m ipykernel install --user --name dima_env --display-name "Python (dima_env)"  
    ```
 3. Start a Jupyter Notebook by running the command:
    ```
    jupyter notebook
    ```
-  and select the `multiphase_chemistry_env` environment from the kernel options.
+  and select the `dima_env` environment from the kernel options.
+

-</details>

 ## Repository Structure and Software arquitecture

--- a/environment.yml
+++ b/environment.yml
@@ -1,8 +1,6 @@
-name: pyenv5505
-#prefix: ./envs/pyenv5505  # Custom output folder
+name: dima_env
 channels:
  - conda-forge
-  - defaults
 dependencies:
  - python=3.11
  - jupyter
--- a/input_files/campaignDescriptor1_LI.yaml
+++ b/input_files/campaignDescriptor1_LI.yaml
@@ -13,7 +13,7 @@ group_id: '5505'
 experiment: 'kinetic_flowtube_study' # 'beamtime', 'smog_chamber_study'   
 dataset_startdate: 
 dataset_enddate: 
-actris_level: '0'
+data_level: 0

 # Instrument folders containing raw data from the campaign
 instrument_datafolder: 
--- a/input_files/campaignDescriptor2_TBR.yaml
+++ b/input_files/campaignDescriptor2_TBR.yaml
@@ -13,7 +13,7 @@ group_id: '5505'
 experiment: 'beamtime' # beamtime, smog_chamber, lab_experiment 
 dataset_startdate: '2023-09-22'
 dataset_enddate: '2023-09-25'
-actris_level: '0'
+data_level: 0

 institution : "PSI"
 filename_format : "institution,experiment,contact"
--- a/input_files/campaignDescriptor3_NG.yaml
+++ b/input_files/campaignDescriptor3_NG.yaml
@@ -13,7 +13,7 @@ group_id: '5505'
 experiment: 'smog_chamber_study' # beamtime, smog_chamber, lab_experiment  
 dataset_startdate: 
 dataset_enddate: 
-actris_level: '0'
+data_level: 0

 # Instrument folders containing raw data from the campaign
 instrument_datafolder:
--- a/instruments/readers/acsm_flag_reader.py
+++ b/instruments/readers/acsm_flag_reader.py
@@ -23,7 +23,9 @@ import logging
 import utils.g5505_utils as utils


+from src.meta_ops import record_data_lineage

+@record_data_lineage(data_level=0)
 def read_jsonflag_as_dict(path_to_file):


--- a/instruments/readers/acsm_tofware_reader.py
+++ b/instruments/readers/acsm_tofware_reader.py
@@ -21,10 +21,9 @@ import argparse
 import logging

 import utils.g5505_utils as utils
+from src.meta_ops import record_data_lineage

-
-
-
+@record_data_lineage(data_level=0)
 def read_acsm_files_as_dict(filename: str, instruments_dir: str = None, work_with_copy: bool = True):
    # If instruments_dir is not provided, use the default path relative to the module directory
    if not instruments_dir:
--- a/instruments/readers/g5505_text_reader.py
+++ b/instruments/readers/g5505_text_reader.py
@@ -21,8 +21,9 @@ import argparse
 import logging
 import warnings
 import utils.g5505_utils as utils
+from src.meta_ops import record_data_lineage

-
+@record_data_lineage(data_level=0)
 def read_txt_files_as_dict(filename: str, instruments_dir: str = None, work_with_copy: bool = True):

    filename = os.path.normpath(filename)
@@ -44,7 +45,7 @@ def read_txt_files_as_dict(filename: str, instruments_dir: str = None, work_with


    # Read header as a dictionary and detect where data table starts
-    header_dict = {'actris_level': 0, 'processing_date':utils.created_at(), 'processing_script' : os.path.relpath(thisFilePath,dimaPath)}
+
    data_start = False    
    # Work with copy of the file for safety
    if work_with_copy:
@@ -54,7 +55,7 @@ def read_txt_files_as_dict(filename: str, instruments_dir: str = None, work_with

    # Run header detection
    header_line_number, column_names, fmt_dict, table_preamble = detect_table_header_line(tmp_filename, format_variants)
-
+    header_dict = {}
    # Unpack validated format info
    table_header = fmt_dict['table_header']
    separator = fmt_dict['separator']
--- a/instruments/readers/hdf5_file_reader.py
+++ b/instruments/readers/hdf5_file_reader.py
@@ -22,11 +22,12 @@ import logging
 import utils.g5505_utils as utils
 import src.hdf5_ops as hdf5_ops
 import instruments.filereader_registry as filereader_registry
-
+from src.meta_ops import record_data_lineage
 
-   
-
-def hdf5_file_reader(dest_file_obj_or_path, src_file_path=None, dest_group_name=None, work_with_copy: bool = True):
+import inspect
+  
+@record_data_lineage(data_level=0)
+def hdf5_file_reader(dest_file_obj_or_path, src_file_path : str = None, dest_group_name : str = None, work_with_copy: bool = True):
    """
    Reads an HDF5 file and copies its contents to a destination group.
    If an HDF5 file object is provided, it skips reading from a file path.
--- a/instruments/readers/nasa_ames_reader.py
+++ b/instruments/readers/nasa_ames_reader.py
@@ -22,7 +22,7 @@ import argparse


 import utils.g5505_utils as utils
-
+from src.meta_ops import record_data_lineage

 def split_header(header_lines):
    header_lines_copy = []
@@ -79,6 +79,8 @@ def extract_var_descriptions(part2):



+
+@record_data_lineage(data_level=0)
 def read_nasa_ames_as_dict(filename, instruments_dir: str = None, work_with_copy: bool = True):

    # If instruments_dir is not provided, use the default path relative to the module directory
--- a/instruments/readers/structured_file_reader.py
+++ b/instruments/readers/structured_file_reader.py
@@ -20,7 +20,9 @@ import argparse
 import logging

 import utils.g5505_utils as utils
+from src.meta_ops import record_data_lineage

+@record_data_lineage(data_level=0)
 def read_structured_file_as_dict(path_to_file):
    """
    Reads a JSON or YAML file, flattens nested structures using pandas.json_normalize,
@@ -32,7 +34,7 @@ def read_structured_file_as_dict(path_to_file):
    _, path_head = os.path.split(path_to_file)

    file_dict['name'] = path_head
-    file_dict['attributes_dict'] = {'actris_level': 0, 'processing_date': utils.created_at(), 'processing_script' : os.path.relpath(thisFilePath,dimaPath)}
+    file_dict['attributes_dict'] = {} #'actris_level': 0, 'processing_date': utils.created_at(), 'processing_script' : os.path.relpath(thisFilePath,dimaPath)}
    file_dict['datasets'] = []

    try:
--- a/instruments/readers/xps_ibw_reader.py
+++ b/instruments/readers/xps_ibw_reader.py
@@ -21,8 +21,9 @@ from igor2.binarywave import load as loadibw
 import logging
 import argparse
 import utils.g5505_utils as utils
+from src.meta_ops import record_data_lineage

-
+@record_data_lineage(data_level=0)
 def read_xps_ibw_file_as_dict(filename):
    """
    Reads IBW files from the Multiphase Chemistry Group, which contain XPS spectra and acquisition settings,
@@ -66,7 +67,7 @@ def read_xps_ibw_file_as_dict(filename):

    # Group name and attributes
    file_dict['name'] = path_head
-    file_dict['attributes_dict'] =  {'actris_level': 0, 'processing_date':utils.created_at(), 'processing_script' : os.path.relpath(thisFilePath,dimaPath)}
+    file_dict['attributes_dict'] =  {} #'actris_level': 0, 'processing_date':utils.created_at(), 'processing_script' : os.path.relpath(thisFilePath,dimaPath)}
 
    # Convert notes of bytes class to string class and split string into a list of elements separated by '\r'. 
    notes_list = file_obj['wave']['note'].decode("utf-8").split('\r')
--- a/pipelines/data_integration.py
+++ b/pipelines/data_integration.py
@@ -18,6 +18,7 @@ if dimaPath not in sys.path:  # Avoid duplicate entries
 import yaml 
 import logging
 from datetime import datetime
+import shutil 
 # Importing chain class from itertools 
 from itertools import chain 
 import shutil
@@ -57,7 +58,7 @@ def load_config_and_setup_logging(yaml_config_file_path, log_dir):
    # Define required keys
    required_keys = [
        'experiment', 'contact', 'input_file_directory', 'output_file_directory',
-        'instrument_datafolder', 'project', 'actris_level'
+        'instrument_datafolder', 'project', 'data_level'
    ]

    # Supported integration modes
@@ -258,7 +259,7 @@ def run_pipeline(path_to_config_yamlFile, log_dir='logs/'):
    select_dir_keywords = config_dict['instrument_datafolder']

    # Define root folder metadata dictionary
-    root_metadata_dict = {key : config_dict[key] for key in ['project', 'experiment', 'contact', 'actris_level']}
+    root_metadata_dict = {key : config_dict[key] for key in ['project', 'experiment', 'contact', 'data_level']}

    # Get dataset start and end dates
    dataset_startdate = config_dict['dataset_startdate']
--- a/setup_env.sh
+++ b/setup_env.sh
@@ -1,47 +0,0 @@
-#!/bin/bash
-
-# Define the name of the environment
-ENV_NAME="multiphase_chemistry_env"
-
-# Check if mamba is available and use it instead of conda for faster installation
-if command -v mamba &> /dev/null; then
-    CONDA_COMMAND="mamba"
-else
-    CONDA_COMMAND="conda"
-fi
-
-# Create the conda environment with all dependencies, resolving from conda-forge and defaults
-$CONDA_COMMAND create -y -n "$ENV_NAME" -c conda-forge -c defaults python=3.11 \
-    jupyter numpy h5py pandas matplotlib plotly=5.24 scipy pip
-
-# Check if the environment was successfully created
-if [ $? -ne 0 ]; then
-    echo "Failed to create the environment '$ENV_NAME'. Please check the logs above for details."
-    exit 1
-fi
-
-# Activate the new environment
-if source activate "$ENV_NAME" 2>/dev/null || conda activate "$ENV_NAME" 2>/dev/null; then
-    echo "Environment '$ENV_NAME' activated successfully."
-else
-    echo "Failed to activate the environment '$ENV_NAME'. Please check your conda setup."
-    exit 1
-fi
-
-# Install additional pip packages only if the environment is activated
-echo "Installing additional pip packages..."
-pip install pybis==1.35 igor2 ipykernel sphinx
-
-# Check if pip installations were successful
-if [ $? -ne 0 ]; then
-    echo "Failed to install pip packages. Please check the logs above for details."
-    exit 1
-fi
-
-# Optional: Export the environment to a YAML file (commented out)
-# $CONDA_COMMAND env export -n "$ENV_NAME" > "$ENV_NAME-environment.yaml"
-
-# Print success message
-echo "Environment '$ENV_NAME' created and configured successfully."
-# echo "Environment configuration saved to '$ENV_NAME-environment.yaml'."
-
--- a/src/meta_ops.py
+++ b/src/meta_ops.py
@@ -0,0 +1,84 @@
+import sys
+import os
+
+try:
+    thisFilePath = os.path.abspath(__file__)
+except NameError:
+    print("Error: __file__ is not available. Ensure the script is being run from a file.")
+    print("[Notice] Path to DIMA package may not be resolved properly.")
+    thisFilePath = os.getcwd()  # Use current directory or specify a default
+
+dimaPath = os.path.normpath(os.path.join(thisFilePath, "..",'..'))  # Move up to project root
+
+if dimaPath not in sys.path:  # Avoid duplicate entries
+    sys.path.append(dimaPath)
+
+
+import h5py
+import pandas as pd
+import numpy as np
+import logging
+import datetime
+import yaml
+import json
+import copy
+
+import utils.g5505_utils as utils
+#import src.hdf5_writer as hdf5_lib
+import inspect
+from functools import wraps
+
+
+def record_data_lineage(data_level: int = 0):
+    """Parameterized decorator to record data lineage information.
+    `data_level` is a user-defined integer.
+    Adds lineage metadata to dict returns or HDF5 group attributes."""
+    
+    def decorator(function: callable):
+        # Get relative path to the script where the function is defined
+        tmpFunctionAbsPath = inspect.getfile(function)
+        functionFileRelativePath = os.path.relpath(tmpFunctionAbsPath, dimaPath)
+        func_signature = inspect.signature(function)
+
+        @wraps(function)
+        def wrapper_func(*args, **kwargs):
+            # Bind args/kwargs to the function signature
+            bound_args = func_signature.bind(*args, **kwargs)
+            bound_args.apply_defaults()
+
+            dest_file_path = bound_args.arguments.get('dest_file_obj_or_path')
+            dest_group_name = bound_args.arguments.get('dest_group_name')
+
+            # If the file is already an h5py.File object, use its filename
+            if isinstance(dest_file_path, h5py.File):
+                dest_file_path = dest_file_path.filename
+
+            # Call the original function
+            result = function(*args, **kwargs)
+
+            # Prepare lineage metadata
+            data_lineage_metadata = {
+                'data_level': data_level,
+                'processing_script': functionFileRelativePath,
+                'processing_date': utils.created_at(),
+            }
+
+            # Case 1: dict result → inject metadata
+            if isinstance(result, dict):
+                if 'attributes_dict' not in result:
+                    result['attributes_dict'] = {}
+                result['attributes_dict'].update(data_lineage_metadata)
+
+            # Case 2: HDF5 group → inject metadata safely
+            elif dest_file_path and dest_group_name:
+                if os.path.exists(dest_file_path) and dest_file_path.endswith('.h5'):
+                    with h5py.File(dest_file_path, mode='r+', track_order=True) as fobj:
+                        if dest_group_name in fobj:
+                            for key, value in data_lineage_metadata.items():
+                                fobj[dest_group_name].attrs[key] = value
+
+            return result
+
+        return wrapper_func
+
+    return decorator
Author	SHA1	Message	Date
florez_j	5f6d0e4f2b	Decorate readers to capture data lineage using record_data_lineage from src.meta_ops	2025-09-20 11:02:09 +02:00
florez_j	e96ecfa951	Implement record_data_lineage.py to be used as a parameterized decorator. This is to simplify provenance tracking on newly added file readers.	2025-09-19 19:01:08 +02:00
florez_j	8daa57c396	Merge branch 'main' of https://gitea.psi.ch/5505-public/dima	2025-09-19 18:56:27 +02:00
florez_j	11b9e35526	Rename variable actris_level --> data_level	2025-09-19 18:55:31 +02:00
Florez Ospina Juan Felipe	2a9e39b9ca	update changelog for version v1.2.0	2025-06-29 08:51:32 +02:00
Florez Ospina Juan Felipe	1d2f311b1f	Update README.md with conda forge instructions.	2025-06-29 08:47:29 +02:00
florez_j	d43ead5f6c	Merge branch 'main' of https://gitea.psi.ch/5505-public/dima	2025-06-29 07:57:46 +02:00
Florez Ospina Juan Felipe	978101f9c2	Refactor README to use Miniforge and conda-forge for env setup; remove unreliable shell script instructions	2025-06-29 07:53:20 +02:00
Juan Felipe Flórez Ospina	d6bb20ae7d	Update pipelines/data_integration.py with new functionality. copy subtree and create hdf5 function now checks whether there is already a copy of the src directory to avoid replacing files and verifies there is enough free storage space to initiate data transfer and subsequent ingestion into hdf5.	2025-06-28 14:45:06 +02:00