ACSM FAIRifier
ACSM FAIRifier is a containerized JupyterLab-based toolkit for preparing Aerosol Chemical Speciation Monitor (ACSM) datasets for EBAS submission and domain-agnostic reuse. It enables users to transform raw or processed ACSM data into:
- EBAS-compliant outputs, with appropriate metadata and file structure
- Self-describing HDF5 files, containing final and intermediate data products for transparent, reusable, and reproducible science
Key Features
- Notebook-driven pipelines with automatic provenance tracking
- Notebook-driven visualizations of data products
- Dash Plotly app for interactive data annotation for quality control
- Direct integration with an HDF5-based data structure
- HDF5 output includes intermediate data products in addition to final outputs
Input Format
The ACSM data integration pipeline reads data from an input directory and an instrument folder specified in the campaignDescriptor.yaml
file with key-value pairs as follows:
input_file_directory : NETWORK_MOUNT/Data/JFJ/
instrument_datafolder: ACSM_TOFWARE/2024/
Make sure the combined path contains the following structure and files:
NETWORK_MOUNT/Data/JFJ/ACSM_TOFWARE/2024/
├── ACSM_JFJ_2024_meta.txt
├── ACSM_JFJ_2024_timeseries.txt
├── CH0001G.20240101000000...lev1.nas
├── Org_data_valid.csv
├── Org_err_valid.csv
├── Org_mz_valid.csv
├── Org_time_valid.csv
└── params/
├── calibration_params.yaml
├── limits_of_detection.yaml
└── validity_thresholds.yaml
Output Formats
- NAS EBAS-compliant files, structured and metadata-rich for archive submission
- Self-describing HDF5 files, including:
- Project-level, contextual, and data lineage metadata
- Intermediate and final processed datasets
- YAML workflow file, automatically generated in Renku format,
recording the prospective provenance of the data processing chain (i.e., planned steps, parameters, and dependencies)
Extensibility
While designed for ACSM datasets, the FAIRifier framework is modular and adaptable to new instruments and processing pipelines. Email the authors for details.
Visual Overview of Domain-Agnostic Data Products
Repository Structure
Click here to see the structure
app/
— Dash Plotly app for interactive data flaggingdata/
— Contains ACSM datasets in HDF5 format (input/output)dima/
— Submodule supporting HDF5 metadata structurenotebooks/
— Jupyter notebooks for stepwise FAIRification and submission preparationpipelines/
— Data chain scripts powering the transformation workflowdocs/
— Additional documentation resourcesfigures/
— Generated plots and visualizationsthird_party/
— External code dependenciesworkflows/
— Workflow automation (e.g., CI/CD pipelines)- Configuration files:
Dockerfile.acsmchain
for container buildsdocker-compose.yaml
for orchestrating multi-container setupsenv_setup.sh
to bootstrap local environment- Project metadata files:
README.md
,LICENSE
,CITATION.cff
,TODO.md
, andcampaigndescriptor.yaml
.
Getting Started
Requirements
For Windows users, the following are required:
-
Docker Desktop: Required to run the toolkit using containers. Download and install Docker Desktop.
-
Git Bash: Used to run shell scripts (
.sh
files). -
Miniforge (Optional): Install Miniforge. Required only if you plan to run the toolkit outside of Docker.
-
PSI Network Access (for data retrieval): Needed only if accessing live data from a PSI network-mounted drive.
Clone the Repository
Open Git Bash and run:
cd Gitea
git clone --recurse-submodules https://gitea.psi.ch/apog/acsm-fairifier.git
cd acsm-fairifier
Run the ACSM FAIRifier Toolkit
This toolkit includes a containerized JupyterLab environment for executing the data processing pipeline, plus an optional dashboard for manual flagging.
- Open PowerShell as Administrator and navigate to the
acsm-fairifier
repository. - Create a
.env
file in the root ofacsm-fairifier/
. - Securely store your network drive access credentials in the
.env
file by adding the following lines:To protect your credentials:CIFS_USER=<your-username> CIFS_PASS=<your-password> NETWORK_MOUNT=//your-server/your-share
- Do not share the .env file with others.
- Ensure the file is excluded from version control by adding .env to your .gitignore and .dockerignore files.
- Open Docker Desktop, then build the container image:
docker build -f Dockerfile.acsmchain -t datachain_processor .
- Start the toolkit:
-
Locally without network drive mount:
docker compose up datachain_processor
-
With network drive mount:
docker compose up datachain_processor_networked
-
Access:
- Jupyter Lab: http://localhost:8889/lab/
-
Stop the app: In the previously open PowerShell terminal, enter:
Ctrl + C
After the container is properly Stopped, remove the container process as:
docker rm $(docker ps -aq --filter ancestor=datachain_processor)
(Optional) Set Up the Python Environment
Required only if you plan to run the toolkit outside of Docker
Install Python Environment Using Miniforge and conda-forge
We recommend using Miniforge to manage your conda environments. Miniforge ensures compatibility with packages from the conda-forge channel.
-
Make sure you have installed Miniforge.
-
Create the Environment from
environment.yml
After installing Miniforge, open Miniforge Prompt or a terminal with access to conda and run:cd path/to/Gitea/acsm-fairifier conda env create --file environment.yml
Working with Jupyter Notebooks
We now make the previously installed Python environment acsmnode_env
selectable as a kernel in Jupyter's interface.
- Open an Miniforge Prompt, check if the environment exists, and activate it:
conda env list conda activate acsmnode_env
- Register the environment in Jupyter:
python -m ipykernel install --user --name acsmnode_env --display-name "Python (acsmnode_env)"
- Start a Jupyter Notebook by running the command:
jupyter notebook
and select the acsmnode_env
environment from the kernel options.
(Optional) Run the Dashboard App
Run the following command to start the dashboard app:
python data_flagging_app.py
This command will launch the data flagging app.
(Optional) Stop the Dashboard App
Run the following command to stop the dashboard app:
CTRL + C
This command will terminate the server process running the app.
Authors
This toolkit was developed by:
- Juan F. Flórez-Ospina
- Leïla H. Simon
- Nora K. Nowak
- Benjamin T. Brem
- Martin Gysel-Beer
- Robin L. Modini
All authors are affiliated with the PSI Center for Energy and Environmental Sciences, 5232 Villigen PSI, Switzerland.
- For general correspondence: robin.modini@psi.ch
- For implementation-specific questions: juan.florez-ospina@psi.ch, juanflo16@gmail.com
Funding
This work was funded by the ETH-Domain Open Research Data (ORD) Program – Measure 1.
It is part of the project
“Building FAIR Data Chains for Atmospheric Observations in the ACTRIS Switzerland Network”,
which is described in more detail at the ORD Program project portal.