Save changes.

This commit is contained in:
2025-02-15 18:20:58 +01:00
parent 25f3ee12a4
commit 0911260f26
5 changed files with 228 additions and 228 deletions

View File

@ -1,20 +1,20 @@
cff-version: 1.2.0
title: >-
QC/QA flagging app for acsm experimental campaigns
message: >-
If you use our code and datasets, please cite our
repository and related paper.
type: software
authors:
- given-names: Juan Felipe
family-names: Florez-Ospina
email: juanflo16@gmail.com
orcid: 'https://orcid.org/0000-0001-5971-9042'
- given-names: Robbin Lewis
family-names: Modini
email: robin.modini@psi.ch
orcid: https://orcid.org/0000-0002-2982-1369
date-released: 26-11-2024
url: "https://gitlab.psi.ch/apog/acsmnode.git"
doi:
cff-version: 1.2.0
title: >-
QC/QA flagging app for acsm experimental campaigns
message: >-
If you use our code and datasets, please cite our
repository and related paper.
type: software
authors:
- given-names: Juan Felipe
family-names: Florez-Ospina
email: juanflo16@gmail.com
orcid: 'https://orcid.org/0000-0001-5971-9042'
- given-names: Robbin Lewis
family-names: Modini
email: robin.modini@psi.ch
orcid: https://orcid.org/0000-0002-2982-1369
date-released: 26-11-2024
url: "https://gitlab.psi.ch/apog/acsmnode.git"
doi:
license:

110
README.md
View File

@ -1,56 +1,56 @@
# QC/QA Data Flagging Application
This repository hosts a Dash Plotly data flagging app for ACSM data structured in HDF5 format using the DIMA submodule. The provided Jupyter notebooks walk you through the steps to append metadata about diagnostic and target channels, which are necessary for the app to run properly.
## Getting Started
### Requirements
For Windows users, the following are required:
1. **Git Bash**: Git Bash will be used to run shell scripts (`.sh` files).
2. **Conda**: You must have [Anaconda](https://www.anaconda.com/products/individual) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html) installed on your system. Git Bash needs access to Conda to set up the environment properly. Ensure that Conda is added to your systems PATH during installation.
3. **PSI Network Access (for data retrieval)**: Real data retrieval can only be performed when connected to the PSI network and with the appropriate access rights to the source network drive.
## Clone the Repository
Open a **Git Bash** terminal.
1. Navigate to your GitLab folder, clone the repository, and navigate to the `acsmnode` folder:
```bash
cd GitLab
git clone --recurse-submodules https://gitlab.psi.ch/apog/acsmnode.git
cd acsmnode
```
### Set Up the Python Environment
Skip this step if the **Git Bash** terminal already has access to a suitable Python interpreter.
Otherwise, set up an appropriate Python interpreter by running the following command:
```bash
bash env_setup.sh
```
## Run the Dashboard App
Run the following command to start the dashboard app:
```bash
python data_flagging_app.py
```
This command will launch the data flagging app.
## Stop the Dashboard App
Run the following command to stop the dashboard app:
```bash
CTRL + C
```
# QC/QA Data Flagging Application
This repository hosts a Dash Plotly data flagging app for ACSM data structured in HDF5 format using the DIMA submodule. The provided Jupyter notebooks walk you through the steps to append metadata about diagnostic and target channels, which are necessary for the app to run properly.
## Getting Started
### Requirements
For Windows users, the following are required:
1. **Git Bash**: Git Bash will be used to run shell scripts (`.sh` files).
2. **Conda**: You must have [Anaconda](https://www.anaconda.com/products/individual) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html) installed on your system. Git Bash needs access to Conda to set up the environment properly. Ensure that Conda is added to your systems PATH during installation.
3. **PSI Network Access (for data retrieval)**: Real data retrieval can only be performed when connected to the PSI network and with the appropriate access rights to the source network drive.
## Clone the Repository
Open a **Git Bash** terminal.
1. Navigate to your GitLab folder, clone the repository, and navigate to the `acsmnode` folder:
```bash
cd GitLab
git clone --recurse-submodules https://gitlab.psi.ch/apog/acsmnode.git
cd acsmnode
```
### Set Up the Python Environment
Skip this step if the **Git Bash** terminal already has access to a suitable Python interpreter.
Otherwise, set up an appropriate Python interpreter by running the following command:
```bash
bash env_setup.sh
```
## Run the Dashboard App
Run the following command to start the dashboard app:
```bash
python data_flagging_app.py
```
This command will launch the data flagging app.
## Stop the Dashboard App
Run the following command to stop the dashboard app:
```bash
CTRL + C
```
This command will terminate the server process running the app.

56
TODO.md
View File

@ -1,28 +1,28 @@
# TODO
* Implement flagging-app specific data operations such as:
1. [New item] When verify flags from checklist is active, enable delete-flag button to delete flag associated with active cell on table.
2. [New item] When verify and ready to trasnfer items on checklist are active, enable record-flags button to record verified flags into the HDF5 file.
3. [New item] When all checklist items active, enable apply button to apply flags to the time series data and save it to the HDF5 file.
1. ~~Define data manager obj with apply flags behavior.~~
2. Define metadata answering who did the flagging and quality assurance tests?
3. Update intruments/dictionaries/ACSM_TOFWARE_flags.yaml and instruments/readers/flag_reader.py to describe metadata elements based on dictionary.
4. ~~Update DIMA data integration pipeline to allowed user-defined file naming template~~
5. ~~Design and implement flag visualization feature: click flag on table and display on figure shaded region when feature is enabled~~
6. Implement schema validator on yaml/json representation of hdf5 metadata
7. Implement updates to 'actris level' and 'processing_script' after operation applied to data/file?
* ~~When `Create Flag` is clicked, modify the title to indicate that we are in flagging mode and ROIs can be drawn by dragging.~~
* ~~Update `Commit Flag` logic:~~
~~3. Update recorded flags directory, and add provenance information to each flag (which instrument and channel belongs to).~~
* Record collected flag information initially in a YAML or JSON file. Is this faster than writing directly to the HDF5 file?
* Should we actively transfer collected flags by clicking a button? after commit button is pressed, each flag is now stored in an independent json file
* Enable some form of chunk storage and visualization from the HDF5 file. Iterate over chunks for faster display versus access time.
1. Do I need to modify DIMA?
2. What is a good chunk size?
3. What Dash component can we use to iterate over the chunks?
![Screenshot](figures/flagging_app_screenshot.JPG)
# TODO
* Implement flagging-app specific data operations such as:
1. [New item] When verify flags from checklist is active, enable delete-flag button to delete flag associated with active cell on table.
2. [New item] When verify and ready to trasnfer items on checklist are active, enable record-flags button to record verified flags into the HDF5 file.
3. [New item] When all checklist items active, enable apply button to apply flags to the time series data and save it to the HDF5 file.
1. ~~Define data manager obj with apply flags behavior.~~
2. Define metadata answering who did the flagging and quality assurance tests?
3. Update intruments/dictionaries/ACSM_TOFWARE_flags.yaml and instruments/readers/flag_reader.py to describe metadata elements based on dictionary.
4. ~~Update DIMA data integration pipeline to allowed user-defined file naming template~~
5. ~~Design and implement flag visualization feature: click flag on table and display on figure shaded region when feature is enabled~~
6. Implement schema validator on yaml/json representation of hdf5 metadata
7. Implement updates to 'actris level' and 'processing_script' after operation applied to data/file?
* ~~When `Create Flag` is clicked, modify the title to indicate that we are in flagging mode and ROIs can be drawn by dragging.~~
* ~~Update `Commit Flag` logic:~~
~~3. Update recorded flags directory, and add provenance information to each flag (which instrument and channel belongs to).~~
* Record collected flag information initially in a YAML or JSON file. Is this faster than writing directly to the HDF5 file?
* Should we actively transfer collected flags by clicking a button? after commit button is pressed, each flag is now stored in an independent json file
* Enable some form of chunk storage and visualization from the HDF5 file. Iterate over chunks for faster display versus access time.
1. Do I need to modify DIMA?
2. What is a good chunk size?
3. What Dash component can we use to iterate over the chunks?
![Screenshot](figures/flagging_app_screenshot.JPG)

View File

@ -1,48 +1,48 @@
#!/bin/bash
# Define the name of the environment
ENV_NAME="multiphase_chemistry_env"
# Check if mamba is available and use it instead of conda for faster installation
if command -v mamba &> /dev/null; then
CONDA_COMMAND="mamba"
else
CONDA_COMMAND="conda"
fi
# Create the conda environment with all dependencies, resolving from conda-forge and defaults
$CONDA_COMMAND create -y -n "$ENV_NAME" -c conda-forge -c defaults python=3.11 \
jupyter numpy h5py pandas matplotlib plotly=5.24 scipy pip
# Check if the environment was successfully created
if [ $? -ne 0 ]; then
echo "Failed to create the environment '$ENV_NAME'. Please check the logs above for details."
exit 1
fi
# Activate the new environment
if source activate "$ENV_NAME" 2>/dev/null || conda activate "$ENV_NAME" 2>/dev/null; then
echo "Environment '$ENV_NAME' activated successfully."
else
echo "Failed to activate the environment '$ENV_NAME'. Please check your conda setup."
exit 1
fi
# Install additional pip packages only if the environment is activated
echo "Installing additional pip packages..."
pip install pybis==1.35 igor2 ipykernel sphinx
pip install dash dash-bootstrap-components
# Check if pip installations were successful
if [ $? -ne 0 ]; then
echo "Failed to install pip packages. Please check the logs above for details."
exit 1
fi
# Optional: Export the environment to a YAML file (commented out)
# $CONDA_COMMAND env export -n "$ENV_NAME" > "$ENV_NAME-environment.yaml"
# Print success message
echo "Environment '$ENV_NAME' created and configured successfully."
# echo "Environment configuration saved to '$ENV_NAME-environment.yaml'."
#!/bin/bash
# Define the name of the environment
ENV_NAME="multiphase_chemistry_env"
# Check if mamba is available and use it instead of conda for faster installation
if command -v mamba &> /dev/null; then
CONDA_COMMAND="mamba"
else
CONDA_COMMAND="conda"
fi
# Create the conda environment with all dependencies, resolving from conda-forge and defaults
$CONDA_COMMAND create -y -n "$ENV_NAME" -c conda-forge -c defaults python=3.11 \
jupyter numpy h5py pandas matplotlib plotly=5.24 scipy pip
# Check if the environment was successfully created
if [ $? -ne 0 ]; then
echo "Failed to create the environment '$ENV_NAME'. Please check the logs above for details."
exit 1
fi
# Activate the new environment
if source activate "$ENV_NAME" 2>/dev/null || conda activate "$ENV_NAME" 2>/dev/null; then
echo "Environment '$ENV_NAME' activated successfully."
else
echo "Failed to activate the environment '$ENV_NAME'. Please check your conda setup."
exit 1
fi
# Install additional pip packages only if the environment is activated
echo "Installing additional pip packages..."
pip install pybis==1.35 igor2 ipykernel sphinx
pip install dash dash-bootstrap-components
# Check if pip installations were successful
if [ $? -ne 0 ]; then
echo "Failed to install pip packages. Please check the logs above for details."
exit 1
fi
# Optional: Export the environment to a YAML file (commented out)
# $CONDA_COMMAND env export -n "$ENV_NAME" > "$ENV_NAME-environment.yaml"
# Print success message
echo "Environment '$ENV_NAME' created and configured successfully."
# echo "Environment configuration saved to '$ENV_NAME-environment.yaml'."

View File

@ -1,78 +1,78 @@
import dima.src.hdf5_ops as dataOps
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
def visualize_table_variables(data_file_path, dataset_name, flags_dataset_name, x_var, y_vars):
if not os.path.exists(data_file_path):
raise ValueError(f"Path to input file {data_file_path} does not exists. The parameter 'data_file_path' must be a valid path to a suitable HDF5 file. ")
# Create data manager object
dataManager = dataOps.HDF5DataOpsManager(data_file_path)
dataManager.load_file_obj()
# Specify diagnostic variables and the associated flags
#dataset_name = 'ACSM_TOFWARE/2024/ACSM_JFJ_2024_meta.txt/data_table'
#flags_dataset_name = 'ACSM_TOFWARE_flags/2024/ACSM_JFJ_2024_meta_flags.csv/data_table'
dataset_df = dataManager.extract_dataset_as_dataframe(dataset_name)
flags_df = dataManager.extract_dataset_as_dataframe(flags_dataset_name)
if x_var not in dataset_df.columns and x_var not in flags_df.columns:
raise ValueError(f'Invalid x_var : {x_var}. x_var must refer to a time variable name that is both in {dataset_name} and {flags_dataset_name}')
flags_df[x_var] = pd.to_datetime(flags_df[x_var].apply(lambda x : x.decode(encoding="utf-8")))
dataManager.unload_file_obj()
if not all(var in dataset_df.columns for var in y_vars):
raise ValueError(f'Invalid y_vars : {y_vars}. y_vars must be a subset of {dataset_df.columns}.')
#fig, ax = plt.subplots(len(y_vars), 1, figsize=(12, 5))
for var_idx, var in enumerate(y_vars):
#y = dataset_df[var].to_numpy()
# Plot Flow Rate
fig = plt.figure(var_idx,figsize=(12, 2.5))
ax = plt.gca()
#ax = fig.get_axes()
ax.plot(dataset_df[x_var], dataset_df[var], label=var, alpha=0.8, color='tab:blue')
# Specify flag name associated with var name in y_vars. By construction, it is assumed the name satisfy the following sufix convention.
var_flag_name = f"flag_{var}"
if var_flag_name in flags_df.columns:
# Identify valid and invalid indices
ind_valid = flags_df[var_flag_name].to_numpy()
ind_invalid = np.logical_not(ind_valid)
# Detect start and end indices of invalid regions
# Find transition points in invalid regions
invalid_starts = np.diff(np.concatenate(([False], ind_invalid, [False]))).nonzero()[0][::2]
invalid_ends = np.diff(np.concatenate(([False], ind_invalid, [False]))).nonzero()[0][1::2]
# Fill invalid regions
t_base = dataset_df[x_var].to_numpy()
for start, end in zip(invalid_starts, invalid_ends):
ax.fill_betweenx([dataset_df[var].min(), dataset_df[var].max()], t_base[start], t_base[end],
color='red', alpha=0.3, label="Invalid Data" if start == invalid_starts[0] else "")
# Labels and Legends
ax.set_xlabel(x_var)
ax.set_ylabel(var)
ax.legend()
ax.grid(True)
#plt.tight_layout()
#plt.show()
return fig, ax
import dima.src.hdf5_ops as dataOps
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
def visualize_table_variables(data_file_path, dataset_name, flags_dataset_name, x_var, y_vars):
if not os.path.exists(data_file_path):
raise ValueError(f"Path to input file {data_file_path} does not exists. The parameter 'data_file_path' must be a valid path to a suitable HDF5 file. ")
# Create data manager object
dataManager = dataOps.HDF5DataOpsManager(data_file_path)
dataManager.load_file_obj()
# Specify diagnostic variables and the associated flags
#dataset_name = 'ACSM_TOFWARE/2024/ACSM_JFJ_2024_meta.txt/data_table'
#flags_dataset_name = 'ACSM_TOFWARE_flags/2024/ACSM_JFJ_2024_meta_flags.csv/data_table'
dataset_df = dataManager.extract_dataset_as_dataframe(dataset_name)
flags_df = dataManager.extract_dataset_as_dataframe(flags_dataset_name)
if x_var not in dataset_df.columns and x_var not in flags_df.columns:
raise ValueError(f'Invalid x_var : {x_var}. x_var must refer to a time variable name that is both in {dataset_name} and {flags_dataset_name}')
flags_df[x_var] = pd.to_datetime(flags_df[x_var].apply(lambda x : x.decode(encoding="utf-8")))
dataManager.unload_file_obj()
if not all(var in dataset_df.columns for var in y_vars):
raise ValueError(f'Invalid y_vars : {y_vars}. y_vars must be a subset of {dataset_df.columns}.')
#fig, ax = plt.subplots(len(y_vars), 1, figsize=(12, 5))
for var_idx, var in enumerate(y_vars):
#y = dataset_df[var].to_numpy()
# Plot Flow Rate
fig = plt.figure(var_idx,figsize=(12, 2.5))
ax = plt.gca()
#ax = fig.get_axes()
ax.plot(dataset_df[x_var], dataset_df[var], label=var, alpha=0.8, color='tab:blue')
# Specify flag name associated with var name in y_vars. By construction, it is assumed the name satisfy the following sufix convention.
var_flag_name = f"flag_{var}"
if var_flag_name in flags_df.columns:
# Identify valid and invalid indices
ind_valid = flags_df[var_flag_name].to_numpy()
ind_invalid = np.logical_not(ind_valid)
# Detect start and end indices of invalid regions
# Find transition points in invalid regions
invalid_starts = np.diff(np.concatenate(([False], ind_invalid, [False]))).nonzero()[0][::2]
invalid_ends = np.diff(np.concatenate(([False], ind_invalid, [False]))).nonzero()[0][1::2]
# Fill invalid regions
t_base = dataset_df[x_var].to_numpy()
for start, end in zip(invalid_starts, invalid_ends):
ax.fill_betweenx([dataset_df[var].min(), dataset_df[var].max()], t_base[start], t_base[end],
color='red', alpha=0.3, label="Invalid Data" if start == invalid_starts[0] else "")
# Labels and Legends
ax.set_xlabel(x_var)
ax.set_ylabel(var)
ax.legend()
ax.grid(True)
#plt.tight_layout()
#plt.show()
return fig, ax