# Data Processing Workflow ## Recommended Directory Structure ``` Campaign_name/ ├── SP2XR_files/ # Raw instrument files (.csv/.zip) │ └── 20200415/ # Organized by date ├── SP2XR_pbp_parquet/ # Converted particle data ├── SP2XR_hk_parquet/ # Converted housekeeping data └── SP2XR_pbp_processed_1min/ # Calibrated and processed data (user-defined resolution) ``` ## Processing Steps ### 1. CSV to Parquet Conversion (`sp2xr_csv2parquet.py`) - Convert raw CSV/ZIP files to Parquet format - Separate particle-by-particle (PbP) and housekeeping (HK) data streams - Apply data schema and column mapping transformations - Organize by date and hour for efficient querying ### 2. Data Loading and Preparation (`sp2xr_pipeline.py`) - Load PbP and HK Parquet files with time-based filtering - Repartition data for optimal parallel processing - Resample housekeeping data to user-defined time resolution (e.g., 1s, 60s) ### 3. Calibration and Quality Control - Apply scattering and incandescence calibrations to raw particle signals - Convert signals to physical units (diameter in nm, mass in fg) - Flag particles based on instrument quality control parameters - Calculate mixing state classifications using time delay method - Merge calibrated particle data with resampled flow measurements ### 4. Time Aggregation and Summary Statistics - Aggregate particle-by-particle data to time bins (dt resolution) - Calculate summary statistics (counts, means) for each time bin - Join aggregated PbP data with resampled HK data ### 5. Bulk Concentrations (optional, if `conc: true`) - Compute number and mass concentrations for different particle types - Calculate size-resolved concentrations for different coating states - Account for flow rate corrections and sampling efficiency ### 6. Size and Mass Distributions (optional, if `*_hist: true`) - Compute size distributions (dNdlogDsc) for scattering-only particles - Compute mass distributions (dNdlogDmev, dMdlogDmev) for BC-containing particles - Calculate time-lag distributions for mixing state analysis - User-configurable bin edges and ranges ## Known Limitations 1. **Sampling frequency**: Currently assumes all BC and BC-free particles are recorded in PbP files 2. **Distribution calculations**: When using time resolution ≠ 1s, histogram calculations may be incorrect. Process at 1s resolution first, then resample.