# ACSM Data Chain Workflow

In this notebook, we will go through our **ACSM Data Chain**. This involves the following steps:

1. Run the data integration pipeline to retrieve ACSM input data and prepare it for processing.  
2. Perform QC/QA analysis.  
3. (Optional) Conduct visual analysis for flag validation.  
4. Prepare input data and QC/QA analysis results for submission to the EBAS database.  

## Import Libraries and Data Chain Steps

* Execute (or Run) the cell below.

In [None]:
import sys
import os
# Set up project root directory


notebook_dir = os.getcwd()  # Current working directory (assumes running from notebooks/)
project_path = os.path.normpath(os.path.join(notebook_dir, ".."))  # Move up to project root
dima_path = os.path.normpath(os.path.join(project_path, "dima"))  # Move up to project root

if project_path not in sys.path:  # Avoid duplicate entries
    sys.path.append(project_path)
if dima_path not in sys.path:
    sys.path.insert(0,dima_path)
#sys.path.append(os.path.join(root_dir,'dima','instruments'))
#sys.path.append(os.path.join(root_dir,'dima','src'))
#sys.path.append(os.path.join(root_dir,'dima','utils'))

#import dima.visualization.hdf5_vis as hdf5_vis
#import dima.pipelines.data_integration as data_integration
import subprocess


for item in sys.path:
    print(item)

from dima.pipelines.data_integration import run_pipeline as get_campaign_data
from pipelines.steps.apply_calibration_factors import main as apply_calibration_factors
from pipelines.steps.generate_flags import main as generate_flags
from pipelines.steps.prepare_ebas_submission import main as prepare_ebas_submission 
from pipelines.steps.update_actris_header import main as update_actris_header
from pipelines.steps.utils import load_project_yaml_files
from pipelines.steps.update_datachain_params import main as update_datachain_params
from pipelines.steps.drop_column_from_nas_file import main as drop_column_from_nas_file
from pipelines.steps.adjust_uncertainty_column_in_nas_file import main as adjust_uncertainty_column_in_nas_file

campaign_descriptor = load_project_yaml_files(project_path, "campaignDescriptor.yaml")
YEAR = campaign_descriptor['year']
STATION_ABBR = campaign_descriptor['station_abbr']

workflow_fname = f'workflow_acsm_data_{STATION_ABBR}_{YEAR}'

print(workflow_fname)

## Step 1: Retrieve Input Data from a Network Drive

* Create a configuration file (i.e., a `.yaml` file) following the example provided in the input folder.
* Set up the input and output directory paths.
* Execute the cell (or Skip it and Execute next cell with manually defined **CAMPAIGN_DATA_FILE** and **APPEND_DATA_DIR**).

In [None]:
path_to_config_file = '../campaignDescriptor.yaml'
paths_to_hdf5_files = get_campaign_data(path_to_config_file)
# Select campaign data file and append directory
CAMPAIGN_DATA_FILE = paths_to_hdf5_files[0]
APPEND_DATA_DIR = os.path.splitext(CAMPAIGN_DATA_FILE)[0]


In [None]:
# Uncomment and define the following variables manually to reanalize previous data collections
#CAMPAIGN_DATA_FILE = '../data/collection_PAY_2024_2025-06-05_2025-06-05.h5'
#CAMPAIGN_DATA_FILE = '../data/collection_JFJ_2024_2025-06-06_2025-06-06.h5'
#APPEND_DATA_DIR = '../data/collection_JFJ_2024_2025-06-06_2025-06-06'
#APPEND_DATA_DIR = '../data/collection_PAY_2024_2025-05-26_2025-05-26'
#CAMPAIGN_DATA_FILE = '../data/collection_PAY_2024_2025-05-21_2025-05-21.h5'
#APPEND_DATA_DIR = '../data/collection_PAY_2024_2025-05-21_2025-05-21'

## Step 1.1: Update Data Chain Parameters with Input Data
* Ensure the data folder retreived from the network drive contains a suitably specified folder `ACSM_TOFWARE/<year>/params`.

In [None]:
update_datachain_params(CAMPAIGN_DATA_FILE, 'ACSM_TOFWARE/2024', capture_renku_metadata=True, workflow_name=workflow_fname)

## Step 2: Calibrate Input Campaign Data and Save Data Products

* Make sure the variable `CAMPAIGN_DATA_FILE` is properly defined in previous step. Otherwise, set the variable manually as indicated below.
* Execute the cell.

In [None]:
# Define manually path to data file by uncomenting the following line, and filling the path

# CAMPAIGN_DATA_FILE = ../data/<enter here *.h5 filename of interest inside the data directory>
path_to_data_file = CAMPAIGN_DATA_FILE
path_to_calibration_file = '../pipelines/params/calibration_factors.yaml'

apply_calibration_factors(path_to_data_file,path_to_calibration_file, capture_renku_metadata=True, workflow_name=workflow_fname)


## Step 3: Perform QC/QA Analysis

* Generate automated flags based on validity thresholds for diagnostic channels.
* (Optional) Generate manual flags using the **Data Flagging App**, accessible at:  
  [http://localhost:8050/](http://localhost:8050/)
* Execute the cell.

In [None]:
dataset_name = f'ACSM_TOFWARE/{YEAR}/ACSM_{STATION_ABBR}_{YEAR}_meta.txt/data_table'
path_to_config_file = 'pipelines/params/validity_thresholds.yaml'
#command = ['python', 'pipelines/steps/compute_automated_flags.py', path_to_data_file, dataset_name, path_to_config_file]
#status = subprocess.run(command, capture_output=True, check=True)
#print(status.stdout.decode())
path_to_data_file = CAMPAIGN_DATA_FILE
generate_flags(path_to_data_file, 'diagnostics', capture_renku_metadata=True, workflow_name=workflow_fname)



In [None]:

generate_flags(path_to_data_file, 'cpc', capture_renku_metadata=True, workflow_name=workflow_fname)

## (Optional) Step 3.1: Inspect Previously Generated Flags for Correctness

* Perform flag validation using the Jupyter Notebook workflow available at:  
  [../notebooks/demo_visualize_diagnostic_flags_from_hdf5_file.ipynb](demo_visualize_diagnostic_flags_from_hdf5_file.ipynb)
* Follow the notebook steps to visually inspect previously generated flags.

## Step 4: Apply Diagnostic and Manual Flags to Variables of Interest

* Generate flags for species based on previously collected QC/QA flags.
* Execute the cell.

In [None]:
#CAMPAIGN_DATA_FILE = '../data/collection_JFJ_2024_2025-04-08_2025-04-08.h5'
path_to_data_file = CAMPAIGN_DATA_FILE
dataset_name = f'ACSM_TOFWARE/{YEAR}/ACSM_{STATION_ABBR}_{YEAR}_meta.txt/data_table'
path_to_config_file = 'pipelines/params/validity_thresholds.yaml'
#command = ['python', 'pipelines/steps/compute_automated_flags.py', path_to_data_file, dataset_name, path_to_config_file]
#status = subprocess.run(command, capture_output=True, check=True)
#print(status.stdout.decode())
generate_flags(path_to_data_file, 'species', capture_renku_metadata=True, workflow_name=workflow_fname)

## Step 5: Generate Campaign Data in EBAS Format

* Gather and set paths to the required data products produced in the previous steps.
* Execute the cell.

In [None]:
import warnings
print(APPEND_DATA_DIR)
DATA_DIR = f"{APPEND_DATA_DIR}/ACSM_TOFWARE_processed/{YEAR}"
FLAGS_DIR = f"{APPEND_DATA_DIR}/ACSM_TOFWARE_flags/{YEAR}"

PATH1 = f"{DATA_DIR}/ACSM_{STATION_ABBR}_{YEAR}_timeseries_calibrated.csv"
PATH2 = f"{DATA_DIR}/ACSM_{STATION_ABBR}_{YEAR}_timeseries_calibrated_err.csv"
PATH3 = f"{DATA_DIR}/ACSM_{STATION_ABBR}_{YEAR}_timeseries_calibration_factors.csv"
PATH4 = f"{FLAGS_DIR}/ACSM_{STATION_ABBR}_{YEAR}_timeseries_flags.csv"

[print(p, os.path.exists(p)) for p in [PATH1,PATH2,PATH3,PATH4]]
update_actris_header('../campaignDescriptor.yaml')

month = "2-3"
with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    prepare_ebas_submission([PATH1, PATH2, PATH3], PATH4, month,capture_renku_metadata=True, workflow_name=workflow_fname)


## Step 5.1: Remove inletP column from a generated nas file
* Select a nas file from the data folder

In [None]:
#path_to_data_file = '../data/CH0001G.20240201010000.20250519140310.aerosol_mass_spectrometer.chemistry_ACSM.pm1_non_refractory.2mo.1h.CH02L_Aerodyne_ToF-ACSM_017.CH02L_Aerodyne_ToF-ACSM_JFJ.lev2.nas'
path_to_data_file = '../data/CH0001G.20240201010000.20250527075812.aerosol_mass_spectrometer.chemistry_ACSM.pm1_non_refractory.2mo.1h.CH02L_Aerodyne_ToF-ACSM_017.CH02L_Aerodyne_ToF-ACSM_JFJ.lev2.nas'

drop_column_from_nas_file(path_to_data_file, column_to_remove='inletP')

In [None]:
#print(((0.5*0.9520)**2 + (0.5*0.0554)**2)**0.5)
#from math import sqrt
#print(':)',sqrt((0.5*0.9520)**2 + (0.5*0.0554)**2))

## Step 5.2: Adjust uncertainty of selected column name / variable by adding a constant
* Select a nas file from the data folder

In [None]:
path_to_data_file = '../data/CH0001G.20240201010000.20250527075812.aerosol_mass_spectrometer.chemistry_ACSM.pm1_non_refractory.2mo.1h.CH02L_Aerodyne_ToF-ACSM_017.CH02L_Aerodyne_ToF-ACSM_JFJ.lev2.nas'
#adjust_uncertainty_column_in_nas_file(path_to_data_file, base_column_name='Org')
variables = ['Org', 'NO3', 'NH4', 'SO4', 'Chl']
adjust_uncertainty_column_in_nas_file(path_to_data_file, base_column_names=variables)

## Step 6: Save Data Products to an HDF5 File

* Gather and set paths to the required data products produced in the previous steps.
* Execute the cell.


In [None]:
import dima.src.hdf5_ops as dataOps 
#print(os.curdir)


dataManager = dataOps.HDF5DataOpsManager(CAMPAIGN_DATA_FILE)
print(dataManager.file_path)
print(APPEND_DATA_DIR)
dataManager.update_file(APPEND_DATA_DIR)


In [None]:
dataManager = dataOps.HDF5DataOpsManager(path_to_data_file)
dataManager.load_file_obj()
dataManager.extract_and_load_dataset_metadata()
df = dataManager.dataset_metadata_df
print(df.head(10))
dataManager.unload_file_obj()