Basic init idear template
This commit is contained in:
16
.dockerignore
Normal file
16
.dockerignore
Normal file
@ -0,0 +1,16 @@
|
|||||||
|
data/
|
||||||
|
figures/
|
||||||
|
notebooks/
|
||||||
|
scripts/
|
||||||
|
envs/
|
||||||
|
logs/
|
||||||
|
*.pyc
|
||||||
|
__pycache__/
|
||||||
|
*.h5
|
||||||
|
.Trash-0/
|
||||||
|
.ipynb_checkpoints/
|
||||||
|
env_setup.sh
|
||||||
|
docker-compose.yaml
|
||||||
|
run_container.sh
|
||||||
|
TODO.md
|
||||||
|
.env
|
8
.gitignore
vendored
Normal file
8
.gitignore
vendored
Normal file
@ -0,0 +1,8 @@
|
|||||||
|
envs/
|
||||||
|
logs/
|
||||||
|
*.pyc
|
||||||
|
__pycache__/
|
||||||
|
*.h5
|
||||||
|
.env
|
||||||
|
.ipynb_checkpoints
|
||||||
|
.Trash-0
|
3
.gitmodules
vendored
Normal file
3
.gitmodules
vendored
Normal file
@ -0,0 +1,3 @@
|
|||||||
|
[submodule "dima"]
|
||||||
|
path = dima
|
||||||
|
url = https://gitea.psi.ch/5505-public/dima.git
|
131
README.md
Normal file
131
README.md
Normal file
@ -0,0 +1,131 @@
|
|||||||
|
# IDEAR FAIRification Toolkit
|
||||||
|
|
||||||
|
This is a **containerized, JupyterLab-based data toolkit** developed as part of the IDEAR project. It supports efficient, reproducible, and metadata-enriched data processing workflows for instrument-generated datasets.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Key Features
|
||||||
|
|
||||||
|
- Modular pipeline with reusable notebook workflows
|
||||||
|
- Metadata-driven HDF5 outputs for long-term data reuse
|
||||||
|
- Optional network-mounted input for seamless integration with shared drives
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Output Format
|
||||||
|
|
||||||
|
- **Self-describing HDF5 files**, including:
|
||||||
|
- Project-level, contextual, and data lineage metadata
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Extensibility
|
||||||
|
|
||||||
|
New instruments can be supported by extending the file parsing capabilities in the `dima/` module.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## Repository Structure
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary><b>Click to expand</b></summary>
|
||||||
|
|
||||||
|
- `data/` — Input and output datasets (mounted volume)
|
||||||
|
- `figures/` — Output visualizations (mounted volume)
|
||||||
|
- `notebooks/` — Jupyter notebooks for processing and metadata integration
|
||||||
|
- `scripts/` — Supplementary processing logic
|
||||||
|
- `dima/` — Metadata and HDF5 schema utilities (persisted module)
|
||||||
|
- `Dockerfile` — Container image definition
|
||||||
|
- `docker-compose.yaml` — Local and networked deployment options
|
||||||
|
- `env_setup.sh` — Optional local environment bootstrap
|
||||||
|
- `CITATION.cff`, `LICENCE`, `README.md`, `.gitignore`, `.dockerignore` — Project metadata and config
|
||||||
|
- `campaignDescriptor.yaml` — Campaign-specific metadata
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Getting Started
|
||||||
|
|
||||||
|
### Requirements
|
||||||
|
|
||||||
|
#### For Docker-based usage:
|
||||||
|
|
||||||
|
- **Docker Desktop**
|
||||||
|
- **Git Bash** (for running shell scripts on Windows)
|
||||||
|
|
||||||
|
#### Optional for local (non-Docker) usage:
|
||||||
|
|
||||||
|
- **Conda** (`miniconda` or `anaconda`)
|
||||||
|
|
||||||
|
#### If accessing network drives (e.g., PSI):
|
||||||
|
|
||||||
|
- PSI credentials with access to mounted network shares
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Clone the Repository
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git clone --recurse-submodules <your-repo-url>
|
||||||
|
cd <your-repo-name>
|
||||||
|
```
|
||||||
|
|
||||||
|
## Run with Docker
|
||||||
|
|
||||||
|
This toolkit includes a containerized JupyterLab environment for executing the data processing pipeline, plus an optional dashboard for manual flagging.
|
||||||
|
|
||||||
|
1. Open **PowerShell as Administrator** and navigate to the `acsmnode` repository.
|
||||||
|
2. Create a `.env` file in the root of `acsmnode/`.
|
||||||
|
3. **Securely store your network drive access credentials** in the `.env` file by adding the following lines:
|
||||||
|
```plaintext
|
||||||
|
CIFS_USER=<your-username>
|
||||||
|
CIFS_PASS=<your-password>
|
||||||
|
JUPYTER_TOKEN=my-token
|
||||||
|
```
|
||||||
|
**To protect your credentials:**
|
||||||
|
- Do not share the .env file with others.
|
||||||
|
- Ensure the file is excluded from version control by adding .env to your .gitignore and .dockerignore files.
|
||||||
|
4. Open **Docker Desktop**, then build the container image:
|
||||||
|
```bash
|
||||||
|
docker build -f Dockerfile -t idear_processor .
|
||||||
|
```
|
||||||
|
5. Start the environment:
|
||||||
|
|
||||||
|
- **Locally without network drive mount:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker compose up idear_processor
|
||||||
|
|
||||||
|
- **With network drive mount:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker compose up idear_processor_networked
|
||||||
|
|
||||||
|
6. Access:
|
||||||
|
- **Jupyter Lab**: [http://localhost:8889/lab/tree/notebooks/](http://localhost:8889/lab/tree/notebooks/)
|
||||||
|
|
||||||
|
7. Stop the app:
|
||||||
|
In the previously open PowerShell terminal, enter:
|
||||||
|
```bash
|
||||||
|
Ctrl + C
|
||||||
|
```
|
||||||
|
After the container is properly Stopped, remove the container process as:
|
||||||
|
```bash
|
||||||
|
docker rm $(docker ps -aq --filter ancestor=idear_processor)
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
## (Optional) Set Up the Python Environment
|
||||||
|
|
||||||
|
> Required only if you plan to run the toolkit outside of Docker
|
||||||
|
|
||||||
|
If **Git Bash** lacks a suitable Python interpreter, run:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
bash env_setup.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
## Citation
|
||||||
|
|
||||||
|
## License
|
0
data/.gitkeep
Normal file
0
data/.gitkeep
Normal file
1
dima
Submodule
1
dima
Submodule
Submodule dima added at d5fa2b6c71
0
figures/.gitkeep
Normal file
0
figures/.gitkeep
Normal file
154
notebooks/demo_data_integration.ipynb
Normal file
154
notebooks/demo_data_integration.ipynb
Normal file
@ -0,0 +1,154 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Data integration workflow of experimental campaign\n",
|
||||||
|
"\n",
|
||||||
|
"In this notebook, we will go through a our data integration workflow. This involves the following steps:\n",
|
||||||
|
"\n",
|
||||||
|
"1. Specify data integration file through YAML configuration file.\n",
|
||||||
|
"2. Create an integrated HDF5 file of experimental campaign from configuration file.\n",
|
||||||
|
"3. Display the created HDF5 file using a treemap\n",
|
||||||
|
"\n",
|
||||||
|
"## Import libraries and modules\n",
|
||||||
|
"\n",
|
||||||
|
"* Excecute (or Run) the Cell below"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"import sys\n",
|
||||||
|
"import os\n",
|
||||||
|
"# Set up project root directory\n",
|
||||||
|
"\n",
|
||||||
|
"notebook_dir = os.getcwd() # Current working directory (assumes running from notebooks/)\n",
|
||||||
|
"project_path = os.path.normpath(os.path.join(notebook_dir, \"..\")) # Move up to project root\n",
|
||||||
|
"dima_path = os.path.normpath(os.path.join(project_path, \"dima\")) # Move up to project root\n",
|
||||||
|
"\n",
|
||||||
|
"for item in sys.path:\n",
|
||||||
|
" print(item)\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"if project_path not in sys.path: # Avoid duplicate entries\n",
|
||||||
|
" sys.path.append(project_path)\n",
|
||||||
|
" print(project_path)\n",
|
||||||
|
"if dima_path not in sys.path:\n",
|
||||||
|
" sys.path.insert(0,dima_path)\n",
|
||||||
|
" print(dima_path)\n",
|
||||||
|
"\n",
|
||||||
|
"import dima.visualization.hdf5_vis as hdf5_vis\n",
|
||||||
|
"import dima.pipelines.data_integration as data_integration\n",
|
||||||
|
"\n",
|
||||||
|
"\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Step 1: Specify data integration task through YAML configuration file\n",
|
||||||
|
"\n",
|
||||||
|
"* Open the `campaignDescriptor.yaml` file located in the root directory.\n",
|
||||||
|
"\n",
|
||||||
|
"* Refer to examples in `/dima/input_files/` for guidance.\n",
|
||||||
|
"\n",
|
||||||
|
"* Specify the input and output directory paths.\n",
|
||||||
|
"\n",
|
||||||
|
"* Execute the cell to initiate the configuration.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"yaml_config_file_path ='../campaignDescriptor.yaml'"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Step 2: Create an integrated HDF5 file of experimental campaign.\n",
|
||||||
|
"\n",
|
||||||
|
"* Excecute Cell. Here we run the function `integrate_data_sources` with input argument as the previously specified YAML config file."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"\n",
|
||||||
|
"hdf5_file_path = data_integration.run_pipeline(yaml_config_file_path)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"hdf5_file_path"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Display integrated HDF5 file using a treemap\n",
|
||||||
|
"\n",
|
||||||
|
"* Excecute Cell. A visual representation in html format of the integrated file should be displayed and stored in the output directory folder"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"if isinstance(hdf5_file_path ,list):\n",
|
||||||
|
" for path_item in hdf5_file_path :\n",
|
||||||
|
" hdf5_vis.display_group_hierarchy_on_a_treemap(path_item)\n",
|
||||||
|
"else:\n",
|
||||||
|
" hdf5_vis.display_group_hierarchy_on_a_treemap(hdf5_file_path)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# "
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "dash_multi_chem_env",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.11.9"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 4
|
||||||
|
}
|
0
requirements.txt
Normal file
0
requirements.txt
Normal file
0
scripts/.gitkeep
Normal file
0
scripts/.gitkeep
Normal file
Reference in New Issue
Block a user