commit f9b9e1226b2258cdd26754adcd6ff02571f3013e Author: Florez Ospina Juan Felipe Date: Wed Jun 18 14:23:18 2025 +0200 Basic init idear template diff --git a/.dockerignore b/.dockerignore new file mode 100644 index 0000000..2f67e11 --- /dev/null +++ b/.dockerignore @@ -0,0 +1,16 @@ +data/ +figures/ +notebooks/ +scripts/ +envs/ +logs/ +*.pyc +__pycache__/ +*.h5 +.Trash-0/ +.ipynb_checkpoints/ +env_setup.sh +docker-compose.yaml +run_container.sh +TODO.md +.env diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..90d5b36 --- /dev/null +++ b/.gitignore @@ -0,0 +1,8 @@ +envs/ +logs/ +*.pyc +__pycache__/ +*.h5 +.env +.ipynb_checkpoints +.Trash-0 diff --git a/.gitmodules b/.gitmodules new file mode 100644 index 0000000..367c882 --- /dev/null +++ b/.gitmodules @@ -0,0 +1,3 @@ +[submodule "dima"] + path = dima + url = https://gitea.psi.ch/5505-public/dima.git diff --git a/README.md b/README.md new file mode 100644 index 0000000..2ecfbbb --- /dev/null +++ b/README.md @@ -0,0 +1,131 @@ +# IDEAR FAIRification Toolkit + +This is a **containerized, JupyterLab-based data toolkit** developed as part of the IDEAR project. It supports efficient, reproducible, and metadata-enriched data processing workflows for instrument-generated datasets. + +--- + +### Key Features + +- Modular pipeline with reusable notebook workflows +- Metadata-driven HDF5 outputs for long-term data reuse +- Optional network-mounted input for seamless integration with shared drives + +--- + +### Output Format + +- **Self-describing HDF5 files**, including: + - Project-level, contextual, and data lineage metadata + +--- + +### Extensibility + +New instruments can be supported by extending the file parsing capabilities in the `dima/` module. + + + +## Repository Structure + +
+Click to expand + +- `data/` — Input and output datasets (mounted volume) +- `figures/` — Output visualizations (mounted volume) +- `notebooks/` — Jupyter notebooks for processing and metadata integration +- `scripts/` — Supplementary processing logic +- `dima/` — Metadata and HDF5 schema utilities (persisted module) +- `Dockerfile` — Container image definition +- `docker-compose.yaml` — Local and networked deployment options +- `env_setup.sh` — Optional local environment bootstrap +- `CITATION.cff`, `LICENCE`, `README.md`, `.gitignore`, `.dockerignore` — Project metadata and config +- `campaignDescriptor.yaml` — Campaign-specific metadata + +
+ +--- + +## Getting Started + +### Requirements + +#### For Docker-based usage: + +- **Docker Desktop** +- **Git Bash** (for running shell scripts on Windows) + +#### Optional for local (non-Docker) usage: + +- **Conda** (`miniconda` or `anaconda`) + +#### If accessing network drives (e.g., PSI): + +- PSI credentials with access to mounted network shares + +--- + +## Clone the Repository + +```bash +git clone --recurse-submodules +cd +``` + +## Run with Docker + +This toolkit includes a containerized JupyterLab environment for executing the data processing pipeline, plus an optional dashboard for manual flagging. + +1. Open **PowerShell as Administrator** and navigate to the `acsmnode` repository. +2. Create a `.env` file in the root of `acsmnode/`. +3. **Securely store your network drive access credentials** in the `.env` file by adding the following lines: + ```plaintext + CIFS_USER= + CIFS_PASS= + JUPYTER_TOKEN=my-token + ``` + **To protect your credentials:** + - Do not share the .env file with others. + - Ensure the file is excluded from version control by adding .env to your .gitignore and .dockerignore files. +4. Open **Docker Desktop**, then build the container image: + ```bash + docker build -f Dockerfile -t idear_processor . + ``` +5. Start the environment: + +- **Locally without network drive mount:** + + ```bash + docker compose up idear_processor + +- **With network drive mount:** + + ```bash + docker compose up idear_processor_networked + +6. Access: + - **Jupyter Lab**: [http://localhost:8889/lab/tree/notebooks/](http://localhost:8889/lab/tree/notebooks/) + +7. Stop the app: + In the previously open PowerShell terminal, enter: + ```bash + Ctrl + C + ``` + After the container is properly Stopped, remove the container process as: + ```bash + docker rm $(docker ps -aq --filter ancestor=idear_processor) + ``` + + +## (Optional) Set Up the Python Environment + + > Required only if you plan to run the toolkit outside of Docker + +If **Git Bash** lacks a suitable Python interpreter, run: + +```bash + bash env_setup.sh +``` + +## Citation + +## License diff --git a/data/.gitkeep b/data/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/dima b/dima new file mode 160000 index 0000000..d5fa2b6 --- /dev/null +++ b/dima @@ -0,0 +1 @@ +Subproject commit d5fa2b6c71b2d04e67691d0a544ad4f7b4ce6334 diff --git a/figures/.gitkeep b/figures/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/notebooks/demo_data_integration.ipynb b/notebooks/demo_data_integration.ipynb new file mode 100644 index 0000000..5888231 --- /dev/null +++ b/notebooks/demo_data_integration.ipynb @@ -0,0 +1,154 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Data integration workflow of experimental campaign\n", + "\n", + "In this notebook, we will go through a our data integration workflow. This involves the following steps:\n", + "\n", + "1. Specify data integration file through YAML configuration file.\n", + "2. Create an integrated HDF5 file of experimental campaign from configuration file.\n", + "3. Display the created HDF5 file using a treemap\n", + "\n", + "## Import libraries and modules\n", + "\n", + "* Excecute (or Run) the Cell below" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "import os\n", + "# Set up project root directory\n", + "\n", + "notebook_dir = os.getcwd() # Current working directory (assumes running from notebooks/)\n", + "project_path = os.path.normpath(os.path.join(notebook_dir, \"..\")) # Move up to project root\n", + "dima_path = os.path.normpath(os.path.join(project_path, \"dima\")) # Move up to project root\n", + "\n", + "for item in sys.path:\n", + " print(item)\n", + "\n", + "\n", + "if project_path not in sys.path: # Avoid duplicate entries\n", + " sys.path.append(project_path)\n", + " print(project_path)\n", + "if dima_path not in sys.path:\n", + " sys.path.insert(0,dima_path)\n", + " print(dima_path)\n", + "\n", + "import dima.visualization.hdf5_vis as hdf5_vis\n", + "import dima.pipelines.data_integration as data_integration\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 1: Specify data integration task through YAML configuration file\n", + "\n", + "* Open the `campaignDescriptor.yaml` file located in the root directory.\n", + "\n", + "* Refer to examples in `/dima/input_files/` for guidance.\n", + "\n", + "* Specify the input and output directory paths.\n", + "\n", + "* Execute the cell to initiate the configuration.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "yaml_config_file_path ='../campaignDescriptor.yaml'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2: Create an integrated HDF5 file of experimental campaign.\n", + "\n", + "* Excecute Cell. Here we run the function `integrate_data_sources` with input argument as the previously specified YAML config file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "hdf5_file_path = data_integration.run_pipeline(yaml_config_file_path)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "hdf5_file_path" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Display integrated HDF5 file using a treemap\n", + "\n", + "* Excecute Cell. A visual representation in html format of the integrated file should be displayed and stored in the output directory folder" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "if isinstance(hdf5_file_path ,list):\n", + " for path_item in hdf5_file_path :\n", + " hdf5_vis.display_group_hierarchy_on_a_treemap(path_item)\n", + "else:\n", + " hdf5_vis.display_group_hierarchy_on_a_treemap(hdf5_file_path)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "dash_multi_chem_env", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..e69de29 diff --git a/scripts/.gitkeep b/scripts/.gitkeep new file mode 100644 index 0000000..e69de29