Basic init idear template

2025-06-18 14:23:18 +02:00
commit f9b9e1226b
10 changed files with 313 additions and 0 deletions
--- a/.dockerignore
+++ b/.dockerignore
@ -0,0 +1,16 @@
+data/
+figures/
+notebooks/
+scripts/
+envs/
+logs/
+*.pyc
+__pycache__/
+*.h5
+.Trash-0/
+.ipynb_checkpoints/
+env_setup.sh
+docker-compose.yaml
+run_container.sh
+TODO.md
+.env
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,8 @@
+envs/
+logs/
+*.pyc
+__pycache__/
+*.h5
+.env
+.ipynb_checkpoints
+.Trash-0
--- a/.gitmodules
+++ b/.gitmodules
@ -0,0 +1,3 @@
+[submodule "dima"]
+	path = dima
+	url = https://gitea.psi.ch/5505-public/dima.git
--- a/README.md
+++ b/README.md
@ -0,0 +1,131 @@
+# IDEAR FAIRification Toolkit
+
+This is a **containerized, JupyterLab-based data toolkit** developed as part of the IDEAR project. It supports efficient, reproducible, and metadata-enriched data processing workflows for instrument-generated datasets.
+
+---
+
+### Key Features
+
+- Modular pipeline with reusable notebook workflows  
+- Metadata-driven HDF5 outputs for long-term data reuse  
+- Optional network-mounted input for seamless integration with shared drives  
+
+---
+
+### Output Format
+
+- **Self-describing HDF5 files**, including:
+  - Project-level, contextual, and data lineage metadata
+
+---
+
+### Extensibility
+
+New instruments can be supported by extending the file parsing capabilities in the `dima/` module.
+
+
+
+## Repository Structure
+
+<details>
+<summary><b>Click to expand</b></summary>
+
+- `data/` — Input and output datasets (mounted volume)  
+- `figures/` — Output visualizations (mounted volume)  
+- `notebooks/` — Jupyter notebooks for processing and metadata integration  
+- `scripts/` — Supplementary processing logic  
+- `dima/` — Metadata and HDF5 schema utilities (persisted module)  
+- `Dockerfile` — Container image definition  
+- `docker-compose.yaml` — Local and networked deployment options  
+- `env_setup.sh` — Optional local environment bootstrap  
+- `CITATION.cff`, `LICENCE`, `README.md`, `.gitignore`, `.dockerignore` — Project metadata and config  
+- `campaignDescriptor.yaml` — Campaign-specific metadata
+
+</details>
+
+---
+
+## Getting Started
+
+### Requirements
+
+#### For Docker-based usage:
+
+- **Docker Desktop**  
+- **Git Bash** (for running shell scripts on Windows)  
+
+#### Optional for local (non-Docker) usage:
+
+- **Conda** (`miniconda` or `anaconda`)  
+
+#### If accessing network drives (e.g., PSI):
+
+- PSI credentials with access to mounted network shares
+
+---
+
+## Clone the Repository
+
+```bash
+git clone --recurse-submodules <your-repo-url>
+cd <your-repo-name>
+```
+
+## Run with Docker
+
+This toolkit includes a containerized JupyterLab environment for executing the data processing pipeline, plus an optional dashboard for manual flagging.
+
+1. Open **PowerShell as Administrator** and navigate to the `acsmnode` repository.  
+2. Create a `.env` file in the root of `acsmnode/`.  
+3. **Securely store your network drive access credentials** in the `.env` file by adding the following lines: 
+   ```plaintext
+   CIFS_USER=<your-username>
+   CIFS_PASS=<your-password>
+   JUPYTER_TOKEN=my-token
+   ```
+   **To protect your credentials:**
+   - Do not share the .env file with others.
+   - Ensure the file is excluded from version control by adding .env to your .gitignore and .dockerignore files.
+4. Open **Docker Desktop**, then build the container image:
+   ```bash
+   docker build -f Dockerfile -t idear_processor .
+   ```
+5. Start the environment:
+
+- **Locally without network drive mount:**
+
+  ```bash
+  docker compose up idear_processor
+
+- **With network drive mount:**
+
+  ```bash
+  docker compose up idear_processor_networked
+
+6. Access:
+   - **Jupyter Lab**: [http://localhost:8889/lab/tree/notebooks/](http://localhost:8889/lab/tree/notebooks/)
+
+7. Stop the app:
+   In the previously open PowerShell terminal, enter:
+   ```bash
+   Ctrl + C
+   ```
+   After the container is properly Stopped, remove the container process as: 
+   ```bash
+   docker rm $(docker ps -aq --filter ancestor=idear_processor)
+   ```
+
+
+## (Optional) Set Up the Python Environment
+
+   > Required only if you plan to run the toolkit outside of Docker
+
+If **Git Bash** lacks a suitable Python interpreter, run:
+
+```bash
+   bash env_setup.sh
+```
+
+## Citation
+
+## License
--- a/data/.gitkeep
+++ b/data/.gitkeep
--- a/1
+++ b/1
--- a/figures/.gitkeep
+++ b/figures/.gitkeep
--- a/notebooks/demo_data_integration.ipynb
+++ b/notebooks/demo_data_integration.ipynb
@ -0,0 +1,154 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Data integration workflow of experimental campaign\n",
+    "\n",
+    "In this notebook, we will go through a our data integration workflow. This involves the following steps:\n",
+    "\n",
+    "1. Specify data integration file through YAML configuration file.\n",
+    "2. Create an integrated HDF5 file of experimental campaign from configuration file.\n",
+    "3. Display the created HDF5 file using a treemap\n",
+    "\n",
+    "## Import libraries and modules\n",
+    "\n",
+    "* Excecute (or Run) the Cell below"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "import os\n",
+    "# Set up project root directory\n",
+    "\n",
+    "notebook_dir = os.getcwd()  # Current working directory (assumes running from notebooks/)\n",
+    "project_path = os.path.normpath(os.path.join(notebook_dir, \"..\"))  # Move up to project root\n",
+    "dima_path = os.path.normpath(os.path.join(project_path, \"dima\"))  # Move up to project root\n",
+    "\n",
+    "for item in sys.path:\n",
+    "    print(item)\n",
+    "\n",
+    "\n",
+    "if project_path not in sys.path:  # Avoid duplicate entries\n",
+    "    sys.path.append(project_path)\n",
+    "    print(project_path)\n",
+    "if dima_path not in sys.path:\n",
+    "    sys.path.insert(0,dima_path)\n",
+    "    print(dima_path)\n",
+    "\n",
+    "import dima.visualization.hdf5_vis as hdf5_vis\n",
+    "import dima.pipelines.data_integration as data_integration\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 1: Specify data integration task through YAML configuration file\n",
+    "\n",
+    "* Open the `campaignDescriptor.yaml` file located in the root directory.\n",
+    "\n",
+    "* Refer to examples in `/dima/input_files/` for guidance.\n",
+    "\n",
+    "* Specify the input and output directory paths.\n",
+    "\n",
+    "* Execute the cell to initiate the configuration.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "yaml_config_file_path ='../campaignDescriptor.yaml'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 2: Create an integrated HDF5 file of experimental campaign.\n",
+    "\n",
+    "* Excecute Cell. Here we run the function `integrate_data_sources` with input argument as the previously specified YAML config file."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "hdf5_file_path = data_integration.run_pipeline(yaml_config_file_path)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "hdf5_file_path"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Display integrated HDF5 file using a treemap\n",
+    "\n",
+    "* Excecute Cell. A visual representation in html format of the integrated file should be displayed and stored in the output directory folder"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "if isinstance(hdf5_file_path ,list):\n",
+    "    for path_item in hdf5_file_path :\n",
+    "        hdf5_vis.display_group_hierarchy_on_a_treemap(path_item)\n",
+    "else:\n",
+    "    hdf5_vis.display_group_hierarchy_on_a_treemap(hdf5_file_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# "
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "dash_multi_chem_env",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/requirements.txt
+++ b/requirements.txt
--- a/scripts/.gitkeep
+++ b/scripts/.gitkeep