Compare commits
9 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 5f6d0e4f2b | |||
| e96ecfa951 | |||
| 8daa57c396 | |||
| 11b9e35526 | |||
| 2a9e39b9ca | |||
| 1d2f311b1f | |||
| d43ead5f6c | |||
| 978101f9c2 | |||
|
|
d6bb20ae7d |
@@ -30,4 +30,11 @@ Format based on [Keep a Changelog](https://keepachangelog.com) and [Semantic Ver
|
||||
- Include Licence
|
||||
|
||||
### Changed
|
||||
- Update README.md with new description + authors and funding sections
|
||||
- Update README.md with new description + authors and funding sections
|
||||
|
||||
## [1.2.0] - 2025-06-29
|
||||
### Changed
|
||||
- Updated `README.md` to use Miniforge and `conda-forge` for environment setup.
|
||||
- Removed unreliable `setup_env.sh` shell-based installation instructions.
|
||||
- Added instructions to configure Conda to use only `conda-forge` with strict priority.
|
||||
- Included a notice to verify base environment origin via `conda info`.
|
||||
|
||||
61
README.md
61
README.md
@@ -30,7 +30,7 @@ For **Windows** users, the following are required:
|
||||
|
||||
1. **Git Bash**: Install [Git Bash](https://git-scm.com/downloads) to run shell scripts (`.sh` files).
|
||||
|
||||
2. **Conda**: Install [Anaconda](https://www.anaconda.com/products/individual) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html).
|
||||
2. **Miniforge**: Install [Miniforge](https://conda-forge.org/download/).
|
||||
|
||||
3. **PSI Network Access**
|
||||
|
||||
@@ -44,56 +44,65 @@ For **Windows** users, the following are required:
|
||||
|
||||
### Download DIMA
|
||||
|
||||
Open a **Git Bash** terminal.
|
||||
Open a **Git Bash** terminal (or a terminal of your choice).
|
||||
|
||||
Navigate to your `Gitea` folder, clone the repository, and navigate to the `dima` folder as follows:
|
||||
Navigate to your `Gitea` folder, clone the repository, and move into the `dima` directory:
|
||||
|
||||
```bash
|
||||
cd path/to/Gitea
|
||||
git clone --recurse-submodules https://gitea.psi.ch/5505-public/dima.git
|
||||
cd dima
|
||||
```bash
|
||||
cd path/to/Gitea
|
||||
git clone --recurse-submodules https://gitea.psi.ch/5505-public/dima.git
|
||||
cd dima
|
||||
```
|
||||
|
||||
### Install Python Interpreter
|
||||
### Install Python Environment Using Miniforge and conda-forge
|
||||
|
||||
Open **Git Bash** terminal.
|
||||
We recommend using Miniforge to manage your conda environments. Miniforge ensures compatibility with packages from the conda-forge channel.
|
||||
|
||||
**Option 1**: Install a suitable conda environment `multiphase_chemistry_env` inside the repository `dima` as follows:
|
||||
1. Make sure you have installed **Miniforge**.
|
||||
|
||||
2. Open **Miniforge Prompt**
|
||||
> ⚠️ Ensure your Conda base environment is from Miniforge (not Anaconda). Run `conda info` and check for `miniforge` in the base path and `conda-forge` as the default channel.
|
||||
|
||||
|
||||
3. Create the Environment from `environment.yml`. Inside the **Miniforge Prompt** or a terminal with access to conda and run:
|
||||
```bash
|
||||
cd path/to/GitLab/dima
|
||||
Bash setup_env.sh
|
||||
```
|
||||
|
||||
Open **Anaconda Prompt** or a terminal with access to conda.
|
||||
|
||||
**Option 2**: Install conda enviroment from YAML file as follows:
|
||||
```bash
|
||||
cd path/to/GitLab/dima
|
||||
cd path/to/Gitea/dima
|
||||
conda env create --file environment.yml
|
||||
```
|
||||
3. Activate the Environment
|
||||
```bash
|
||||
conda activate dima_env
|
||||
```
|
||||
4. Remove the `defaults` channel (if present):
|
||||
```bash
|
||||
conda config --remove channels defaults
|
||||
```
|
||||
5. Add `conda-forge` as the highest-priority channel:
|
||||
```bash
|
||||
conda config --add channels conda-forge
|
||||
conda config --set channel_priority strict
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary> <b> Working with Jupyter Notebooks </b> </summary>
|
||||
|
||||
We now make the previously installed Python environment `multiphase_chemistry_env` selectable as a kernel in Jupyter's interface.
|
||||
### Working with Jupyter Notebooks
|
||||
We now make the previously installed Python environment `dima_env` selectable as a kernel in Jupyter's interface.
|
||||
|
||||
1. Open an Anaconda Prompt, check if the environment exists, and activate it:
|
||||
```
|
||||
conda env list
|
||||
conda activate multiphase_chemistry_env
|
||||
conda activate dima_env
|
||||
```
|
||||
2. Register the environment in Jupyter:
|
||||
```
|
||||
python -m ipykernel install --user --name multiphase_chemistry_env --display-name "Python (multiphase_chemistry_env)"
|
||||
python -m ipykernel install --user --name dima_env --display-name "Python (dima_env)"
|
||||
```
|
||||
3. Start a Jupyter Notebook by running the command:
|
||||
```
|
||||
jupyter notebook
|
||||
```
|
||||
and select the `multiphase_chemistry_env` environment from the kernel options.
|
||||
and select the `dima_env` environment from the kernel options.
|
||||
|
||||
|
||||
</details>
|
||||
|
||||
## Repository Structure and Software arquitecture
|
||||
|
||||
|
||||
@@ -1,8 +1,6 @@
|
||||
name: pyenv5505
|
||||
#prefix: ./envs/pyenv5505 # Custom output folder
|
||||
name: dima_env
|
||||
channels:
|
||||
- conda-forge
|
||||
- defaults
|
||||
dependencies:
|
||||
- python=3.11
|
||||
- jupyter
|
||||
|
||||
@@ -13,7 +13,7 @@ group_id: '5505'
|
||||
experiment: 'kinetic_flowtube_study' # 'beamtime', 'smog_chamber_study'
|
||||
dataset_startdate:
|
||||
dataset_enddate:
|
||||
actris_level: '0'
|
||||
data_level: 0
|
||||
|
||||
# Instrument folders containing raw data from the campaign
|
||||
instrument_datafolder:
|
||||
|
||||
@@ -13,7 +13,7 @@ group_id: '5505'
|
||||
experiment: 'beamtime' # beamtime, smog_chamber, lab_experiment
|
||||
dataset_startdate: '2023-09-22'
|
||||
dataset_enddate: '2023-09-25'
|
||||
actris_level: '0'
|
||||
data_level: 0
|
||||
|
||||
institution : "PSI"
|
||||
filename_format : "institution,experiment,contact"
|
||||
|
||||
@@ -13,7 +13,7 @@ group_id: '5505'
|
||||
experiment: 'smog_chamber_study' # beamtime, smog_chamber, lab_experiment
|
||||
dataset_startdate:
|
||||
dataset_enddate:
|
||||
actris_level: '0'
|
||||
data_level: 0
|
||||
|
||||
# Instrument folders containing raw data from the campaign
|
||||
instrument_datafolder:
|
||||
|
||||
@@ -23,7 +23,9 @@ import logging
|
||||
import utils.g5505_utils as utils
|
||||
|
||||
|
||||
from src.meta_ops import record_data_lineage
|
||||
|
||||
@record_data_lineage(data_level=0)
|
||||
def read_jsonflag_as_dict(path_to_file):
|
||||
|
||||
|
||||
|
||||
@@ -21,10 +21,9 @@ import argparse
|
||||
import logging
|
||||
|
||||
import utils.g5505_utils as utils
|
||||
from src.meta_ops import record_data_lineage
|
||||
|
||||
|
||||
|
||||
|
||||
@record_data_lineage(data_level=0)
|
||||
def read_acsm_files_as_dict(filename: str, instruments_dir: str = None, work_with_copy: bool = True):
|
||||
# If instruments_dir is not provided, use the default path relative to the module directory
|
||||
if not instruments_dir:
|
||||
|
||||
@@ -21,8 +21,9 @@ import argparse
|
||||
import logging
|
||||
import warnings
|
||||
import utils.g5505_utils as utils
|
||||
from src.meta_ops import record_data_lineage
|
||||
|
||||
|
||||
@record_data_lineage(data_level=0)
|
||||
def read_txt_files_as_dict(filename: str, instruments_dir: str = None, work_with_copy: bool = True):
|
||||
|
||||
filename = os.path.normpath(filename)
|
||||
@@ -44,7 +45,7 @@ def read_txt_files_as_dict(filename: str, instruments_dir: str = None, work_with
|
||||
|
||||
|
||||
# Read header as a dictionary and detect where data table starts
|
||||
header_dict = {'actris_level': 0, 'processing_date':utils.created_at(), 'processing_script' : os.path.relpath(thisFilePath,dimaPath)}
|
||||
|
||||
data_start = False
|
||||
# Work with copy of the file for safety
|
||||
if work_with_copy:
|
||||
@@ -54,7 +55,7 @@ def read_txt_files_as_dict(filename: str, instruments_dir: str = None, work_with
|
||||
|
||||
# Run header detection
|
||||
header_line_number, column_names, fmt_dict, table_preamble = detect_table_header_line(tmp_filename, format_variants)
|
||||
|
||||
header_dict = {}
|
||||
# Unpack validated format info
|
||||
table_header = fmt_dict['table_header']
|
||||
separator = fmt_dict['separator']
|
||||
|
||||
@@ -22,11 +22,12 @@ import logging
|
||||
import utils.g5505_utils as utils
|
||||
import src.hdf5_ops as hdf5_ops
|
||||
import instruments.filereader_registry as filereader_registry
|
||||
|
||||
from src.meta_ops import record_data_lineage
|
||||
|
||||
|
||||
|
||||
def hdf5_file_reader(dest_file_obj_or_path, src_file_path=None, dest_group_name=None, work_with_copy: bool = True):
|
||||
import inspect
|
||||
|
||||
@record_data_lineage(data_level=0)
|
||||
def hdf5_file_reader(dest_file_obj_or_path, src_file_path : str = None, dest_group_name : str = None, work_with_copy: bool = True):
|
||||
"""
|
||||
Reads an HDF5 file and copies its contents to a destination group.
|
||||
If an HDF5 file object is provided, it skips reading from a file path.
|
||||
|
||||
@@ -22,7 +22,7 @@ import argparse
|
||||
|
||||
|
||||
import utils.g5505_utils as utils
|
||||
|
||||
from src.meta_ops import record_data_lineage
|
||||
|
||||
def split_header(header_lines):
|
||||
header_lines_copy = []
|
||||
@@ -79,6 +79,8 @@ def extract_var_descriptions(part2):
|
||||
|
||||
|
||||
|
||||
|
||||
@record_data_lineage(data_level=0)
|
||||
def read_nasa_ames_as_dict(filename, instruments_dir: str = None, work_with_copy: bool = True):
|
||||
|
||||
# If instruments_dir is not provided, use the default path relative to the module directory
|
||||
|
||||
@@ -20,7 +20,9 @@ import argparse
|
||||
import logging
|
||||
|
||||
import utils.g5505_utils as utils
|
||||
from src.meta_ops import record_data_lineage
|
||||
|
||||
@record_data_lineage(data_level=0)
|
||||
def read_structured_file_as_dict(path_to_file):
|
||||
"""
|
||||
Reads a JSON or YAML file, flattens nested structures using pandas.json_normalize,
|
||||
@@ -32,7 +34,7 @@ def read_structured_file_as_dict(path_to_file):
|
||||
_, path_head = os.path.split(path_to_file)
|
||||
|
||||
file_dict['name'] = path_head
|
||||
file_dict['attributes_dict'] = {'actris_level': 0, 'processing_date': utils.created_at(), 'processing_script' : os.path.relpath(thisFilePath,dimaPath)}
|
||||
file_dict['attributes_dict'] = {} #'actris_level': 0, 'processing_date': utils.created_at(), 'processing_script' : os.path.relpath(thisFilePath,dimaPath)}
|
||||
file_dict['datasets'] = []
|
||||
|
||||
try:
|
||||
|
||||
@@ -21,8 +21,9 @@ from igor2.binarywave import load as loadibw
|
||||
import logging
|
||||
import argparse
|
||||
import utils.g5505_utils as utils
|
||||
from src.meta_ops import record_data_lineage
|
||||
|
||||
|
||||
@record_data_lineage(data_level=0)
|
||||
def read_xps_ibw_file_as_dict(filename):
|
||||
"""
|
||||
Reads IBW files from the Multiphase Chemistry Group, which contain XPS spectra and acquisition settings,
|
||||
@@ -66,7 +67,7 @@ def read_xps_ibw_file_as_dict(filename):
|
||||
|
||||
# Group name and attributes
|
||||
file_dict['name'] = path_head
|
||||
file_dict['attributes_dict'] = {'actris_level': 0, 'processing_date':utils.created_at(), 'processing_script' : os.path.relpath(thisFilePath,dimaPath)}
|
||||
file_dict['attributes_dict'] = {} #'actris_level': 0, 'processing_date':utils.created_at(), 'processing_script' : os.path.relpath(thisFilePath,dimaPath)}
|
||||
|
||||
# Convert notes of bytes class to string class and split string into a list of elements separated by '\r'.
|
||||
notes_list = file_obj['wave']['note'].decode("utf-8").split('\r')
|
||||
|
||||
@@ -18,6 +18,7 @@ if dimaPath not in sys.path: # Avoid duplicate entries
|
||||
import yaml
|
||||
import logging
|
||||
from datetime import datetime
|
||||
import shutil
|
||||
# Importing chain class from itertools
|
||||
from itertools import chain
|
||||
import shutil
|
||||
@@ -57,7 +58,7 @@ def load_config_and_setup_logging(yaml_config_file_path, log_dir):
|
||||
# Define required keys
|
||||
required_keys = [
|
||||
'experiment', 'contact', 'input_file_directory', 'output_file_directory',
|
||||
'instrument_datafolder', 'project', 'actris_level'
|
||||
'instrument_datafolder', 'project', 'data_level'
|
||||
]
|
||||
|
||||
# Supported integration modes
|
||||
@@ -258,7 +259,7 @@ def run_pipeline(path_to_config_yamlFile, log_dir='logs/'):
|
||||
select_dir_keywords = config_dict['instrument_datafolder']
|
||||
|
||||
# Define root folder metadata dictionary
|
||||
root_metadata_dict = {key : config_dict[key] for key in ['project', 'experiment', 'contact', 'actris_level']}
|
||||
root_metadata_dict = {key : config_dict[key] for key in ['project', 'experiment', 'contact', 'data_level']}
|
||||
|
||||
# Get dataset start and end dates
|
||||
dataset_startdate = config_dict['dataset_startdate']
|
||||
|
||||
47
setup_env.sh
47
setup_env.sh
@@ -1,47 +0,0 @@
|
||||
#!/bin/bash
|
||||
|
||||
# Define the name of the environment
|
||||
ENV_NAME="multiphase_chemistry_env"
|
||||
|
||||
# Check if mamba is available and use it instead of conda for faster installation
|
||||
if command -v mamba &> /dev/null; then
|
||||
CONDA_COMMAND="mamba"
|
||||
else
|
||||
CONDA_COMMAND="conda"
|
||||
fi
|
||||
|
||||
# Create the conda environment with all dependencies, resolving from conda-forge and defaults
|
||||
$CONDA_COMMAND create -y -n "$ENV_NAME" -c conda-forge -c defaults python=3.11 \
|
||||
jupyter numpy h5py pandas matplotlib plotly=5.24 scipy pip
|
||||
|
||||
# Check if the environment was successfully created
|
||||
if [ $? -ne 0 ]; then
|
||||
echo "Failed to create the environment '$ENV_NAME'. Please check the logs above for details."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Activate the new environment
|
||||
if source activate "$ENV_NAME" 2>/dev/null || conda activate "$ENV_NAME" 2>/dev/null; then
|
||||
echo "Environment '$ENV_NAME' activated successfully."
|
||||
else
|
||||
echo "Failed to activate the environment '$ENV_NAME'. Please check your conda setup."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Install additional pip packages only if the environment is activated
|
||||
echo "Installing additional pip packages..."
|
||||
pip install pybis==1.35 igor2 ipykernel sphinx
|
||||
|
||||
# Check if pip installations were successful
|
||||
if [ $? -ne 0 ]; then
|
||||
echo "Failed to install pip packages. Please check the logs above for details."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Optional: Export the environment to a YAML file (commented out)
|
||||
# $CONDA_COMMAND env export -n "$ENV_NAME" > "$ENV_NAME-environment.yaml"
|
||||
|
||||
# Print success message
|
||||
echo "Environment '$ENV_NAME' created and configured successfully."
|
||||
# echo "Environment configuration saved to '$ENV_NAME-environment.yaml'."
|
||||
|
||||
84
src/meta_ops.py
Normal file
84
src/meta_ops.py
Normal file
@@ -0,0 +1,84 @@
|
||||
import sys
|
||||
import os
|
||||
|
||||
try:
|
||||
thisFilePath = os.path.abspath(__file__)
|
||||
except NameError:
|
||||
print("Error: __file__ is not available. Ensure the script is being run from a file.")
|
||||
print("[Notice] Path to DIMA package may not be resolved properly.")
|
||||
thisFilePath = os.getcwd() # Use current directory or specify a default
|
||||
|
||||
dimaPath = os.path.normpath(os.path.join(thisFilePath, "..",'..')) # Move up to project root
|
||||
|
||||
if dimaPath not in sys.path: # Avoid duplicate entries
|
||||
sys.path.append(dimaPath)
|
||||
|
||||
|
||||
import h5py
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
import logging
|
||||
import datetime
|
||||
import yaml
|
||||
import json
|
||||
import copy
|
||||
|
||||
import utils.g5505_utils as utils
|
||||
#import src.hdf5_writer as hdf5_lib
|
||||
import inspect
|
||||
from functools import wraps
|
||||
|
||||
|
||||
def record_data_lineage(data_level: int = 0):
|
||||
"""Parameterized decorator to record data lineage information.
|
||||
`data_level` is a user-defined integer.
|
||||
Adds lineage metadata to dict returns or HDF5 group attributes."""
|
||||
|
||||
def decorator(function: callable):
|
||||
# Get relative path to the script where the function is defined
|
||||
tmpFunctionAbsPath = inspect.getfile(function)
|
||||
functionFileRelativePath = os.path.relpath(tmpFunctionAbsPath, dimaPath)
|
||||
func_signature = inspect.signature(function)
|
||||
|
||||
@wraps(function)
|
||||
def wrapper_func(*args, **kwargs):
|
||||
# Bind args/kwargs to the function signature
|
||||
bound_args = func_signature.bind(*args, **kwargs)
|
||||
bound_args.apply_defaults()
|
||||
|
||||
dest_file_path = bound_args.arguments.get('dest_file_obj_or_path')
|
||||
dest_group_name = bound_args.arguments.get('dest_group_name')
|
||||
|
||||
# If the file is already an h5py.File object, use its filename
|
||||
if isinstance(dest_file_path, h5py.File):
|
||||
dest_file_path = dest_file_path.filename
|
||||
|
||||
# Call the original function
|
||||
result = function(*args, **kwargs)
|
||||
|
||||
# Prepare lineage metadata
|
||||
data_lineage_metadata = {
|
||||
'data_level': data_level,
|
||||
'processing_script': functionFileRelativePath,
|
||||
'processing_date': utils.created_at(),
|
||||
}
|
||||
|
||||
# Case 1: dict result → inject metadata
|
||||
if isinstance(result, dict):
|
||||
if 'attributes_dict' not in result:
|
||||
result['attributes_dict'] = {}
|
||||
result['attributes_dict'].update(data_lineage_metadata)
|
||||
|
||||
# Case 2: HDF5 group → inject metadata safely
|
||||
elif dest_file_path and dest_group_name:
|
||||
if os.path.exists(dest_file_path) and dest_file_path.endswith('.h5'):
|
||||
with h5py.File(dest_file_path, mode='r+', track_order=True) as fobj:
|
||||
if dest_group_name in fobj:
|
||||
for key, value in data_lineage_metadata.items():
|
||||
fobj[dest_group_name].attrs[key] = value
|
||||
|
||||
return result
|
||||
|
||||
return wrapper_func
|
||||
|
||||
return decorator
|
||||
Reference in New Issue
Block a user