Basic init idear template

2025-06-18 14:23:18 +02:00
commit f9b9e1226b
10 changed files with 313 additions and 0 deletions
--- a/.dockerignore
+++ b/.dockerignore
@ -0,0 +1,16 @@
 data/
 figures/
 notebooks/
 scripts/
 envs/
 logs/
 *.pyc
 __pycache__/
 *.h5
 .Trash-0/
 .ipynb_checkpoints/
 env_setup.sh
 docker-compose.yaml
 run_container.sh
 TODO.md
 .env
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,8 @@
 envs/
 logs/
 *.pyc
 __pycache__/
 *.h5
 .env
 .ipynb_checkpoints
 .Trash-0
--- a/.gitmodules
+++ b/.gitmodules
@ -0,0 +1,3 @@
 [submodule "dima"]
 	path = dima
 	url = https://gitea.psi.ch/5505-public/dima.git
--- a/README.md
+++ b/README.md
@ -0,0 +1,131 @@
 # IDEAR FAIRification Toolkit
 This is a **containerized, JupyterLab-based data toolkit** developed as part of the IDEAR project. It supports efficient, reproducible, and metadata-enriched data processing workflows for instrument-generated datasets.
 ---
 ### Key Features
 - Modular pipeline with reusable notebook workflows  
 - Metadata-driven HDF5 outputs for long-term data reuse  
 - Optional network-mounted input for seamless integration with shared drives  
 ---
 ### Output Format
 - **Self-describing HDF5 files**, including:
  - Project-level, contextual, and data lineage metadata
 ---
 ### Extensibility
 New instruments can be supported by extending the file parsing capabilities in the `dima/` module.
 ## Repository Structure
 <details>
 <summary><b>Click to expand</b></summary>
 - `data/` — Input and output datasets (mounted volume)  
 - `figures/` — Output visualizations (mounted volume)  
 - `notebooks/` — Jupyter notebooks for processing and metadata integration  
 - `scripts/` — Supplementary processing logic  
 - `dima/` — Metadata and HDF5 schema utilities (persisted module)  
 - `Dockerfile` — Container image definition  
 - `docker-compose.yaml` — Local and networked deployment options  
 - `env_setup.sh` — Optional local environment bootstrap  
 - `CITATION.cff`, `LICENCE`, `README.md`, `.gitignore`, `.dockerignore` — Project metadata and config  
 - `campaignDescriptor.yaml` — Campaign-specific metadata
 </details>
 ---
 ## Getting Started
 ### Requirements
 #### For Docker-based usage:
 - **Docker Desktop**  
 - **Git Bash** (for running shell scripts on Windows)  
 #### Optional for local (non-Docker) usage:
 - **Conda** (`miniconda` or `anaconda`)  
 #### If accessing network drives (e.g., PSI):
 - PSI credentials with access to mounted network shares
 ---
 ## Clone the Repository
 ```bash
 git clone --recurse-submodules <your-repo-url>
 cd <your-repo-name>
 ```
 ## Run with Docker
 This toolkit includes a containerized JupyterLab environment for executing the data processing pipeline, plus an optional dashboard for manual flagging.
 1. Open **PowerShell as Administrator** and navigate to the `acsmnode` repository.  
 2. Create a `.env` file in the root of `acsmnode/`.  
 3. **Securely store your network drive access credentials** in the `.env` file by adding the following lines: 
   ```plaintext
   CIFS_USER=<your-username>
   CIFS_PASS=<your-password>
   JUPYTER_TOKEN=my-token
   ```
   **To protect your credentials:**
   - Do not share the .env file with others.
   - Ensure the file is excluded from version control by adding .env to your .gitignore and .dockerignore files.
 4. Open **Docker Desktop**, then build the container image:
   ```bash
   docker build -f Dockerfile -t idear_processor .
   ```
 5. Start the environment:
 - **Locally without network drive mount:**
  ```bash
  docker compose up idear_processor
 - **With network drive mount:**
  ```bash
  docker compose up idear_processor_networked
 6. Access:
   - **Jupyter Lab**: [http://localhost:8889/lab/tree/notebooks/](http://localhost:8889/lab/tree/notebooks/)
 7. Stop the app:
   In the previously open PowerShell terminal, enter:
   ```bash
   Ctrl + C
   ```
   After the container is properly Stopped, remove the container process as: 
   ```bash
   docker rm $(docker ps -aq --filter ancestor=idear_processor)
   ```
 ## (Optional) Set Up the Python Environment
   > Required only if you plan to run the toolkit outside of Docker
 If **Git Bash** lacks a suitable Python interpreter, run:
 ```bash
   bash env_setup.sh
 ```
 ## Citation
 ## License
--- a/data/.gitkeep
+++ b/data/.gitkeep
--- a/1
+++ b/1
--- a/figures/.gitkeep
+++ b/figures/.gitkeep
--- a/notebooks/demo_data_integration.ipynb
+++ b/notebooks/demo_data_integration.ipynb
@ -0,0 +1,154 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data integration workflow of experimental campaign\n",
    "\n",
    "In this notebook, we will go through a our data integration workflow. This involves the following steps:\n",
    "\n",
    "1. Specify data integration file through YAML configuration file.\n",
    "2. Create an integrated HDF5 file of experimental campaign from configuration file.\n",
    "3. Display the created HDF5 file using a treemap\n",
    "\n",
    "## Import libraries and modules\n",
    "\n",
    "* Excecute (or Run) the Cell below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "import os\n",
    "# Set up project root directory\n",
    "\n",
    "notebook_dir = os.getcwd()  # Current working directory (assumes running from notebooks/)\n",
    "project_path = os.path.normpath(os.path.join(notebook_dir, \"..\"))  # Move up to project root\n",
    "dima_path = os.path.normpath(os.path.join(project_path, \"dima\"))  # Move up to project root\n",
    "\n",
    "for item in sys.path:\n",
    "    print(item)\n",
    "\n",
    "\n",
    "if project_path not in sys.path:  # Avoid duplicate entries\n",
    "    sys.path.append(project_path)\n",
    "    print(project_path)\n",
    "if dima_path not in sys.path:\n",
    "    sys.path.insert(0,dima_path)\n",
    "    print(dima_path)\n",
    "\n",
    "import dima.visualization.hdf5_vis as hdf5_vis\n",
    "import dima.pipelines.data_integration as data_integration\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Specify data integration task through YAML configuration file\n",
    "\n",
    "* Open the `campaignDescriptor.yaml` file located in the root directory.\n",
    "\n",
    "* Refer to examples in `/dima/input_files/` for guidance.\n",
    "\n",
    "* Specify the input and output directory paths.\n",
    "\n",
    "* Execute the cell to initiate the configuration.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "yaml_config_file_path ='../campaignDescriptor.yaml'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Create an integrated HDF5 file of experimental campaign.\n",
    "\n",
    "* Excecute Cell. Here we run the function `integrate_data_sources` with input argument as the previously specified YAML config file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "hdf5_file_path = data_integration.run_pipeline(yaml_config_file_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hdf5_file_path"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Display integrated HDF5 file using a treemap\n",
    "\n",
    "* Excecute Cell. A visual representation in html format of the integrated file should be displayed and stored in the output directory folder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if isinstance(hdf5_file_path ,list):\n",
    "    for path_item in hdf5_file_path :\n",
    "        hdf5_vis.display_group_hierarchy_on_a_treemap(path_item)\n",
    "else:\n",
    "    hdf5_vis.display_group_hierarchy_on_a_treemap(hdf5_file_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "dash_multi_chem_env",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/requirements.txt
+++ b/requirements.txt
--- a/scripts/.gitkeep
+++ b/scripts/.gitkeep