Files
SP2XR/docs/workflow.md

2.4 KiB

Data Processing Workflow

Campaign_name/
├── SP2XR_files/                 # Raw instrument files (.csv/.zip)
│   └── 20200415/               # Organized by date
├── SP2XR_pbp_parquet/          # Converted particle data
├── SP2XR_hk_parquet/           # Converted housekeeping data
└── SP2XR_pbp_processed_1min/   # Calibrated and processed data (user-defined resolution)

Processing Steps

1. CSV to Parquet Conversion (sp2xr_csv2parquet.py)

  • Convert raw CSV/ZIP files to Parquet format
  • Separate particle-by-particle (PbP) and housekeeping (HK) data streams
  • Apply data schema and column mapping transformations
  • Organize by date and hour for efficient querying

2. Data Loading and Preparation (sp2xr_pipeline.py)

  • Load PbP and HK Parquet files with time-based filtering
  • Repartition data for optimal parallel processing
  • Resample housekeeping data to user-defined time resolution (e.g., 1s, 60s)

3. Calibration and Quality Control

  • Apply scattering and incandescence calibrations to raw particle signals
  • Convert signals to physical units (diameter in nm, mass in fg)
  • Flag particles based on instrument quality control parameters
  • Calculate mixing state classifications using time delay method
  • Merge calibrated particle data with resampled flow measurements

4. Time Aggregation and Summary Statistics

  • Aggregate particle-by-particle data to time bins (dt resolution)
  • Calculate summary statistics (counts, means) for each time bin
  • Join aggregated PbP data with resampled HK data

5. Bulk Concentrations (optional, if conc: true)

  • Compute number and mass concentrations for different particle types
  • Calculate size-resolved concentrations for different coating states
  • Account for flow rate corrections and sampling efficiency

6. Size and Mass Distributions (optional, if *_hist: true)

  • Compute size distributions (dNdlogDsc) for scattering-only particles
  • Compute mass distributions (dNdlogDmev, dMdlogDmev) for BC-containing particles
  • Calculate time-lag distributions for mixing state analysis
  • User-configurable bin edges and ranges

Known Limitations

  1. Sampling frequency: Currently assumes all BC and BC-free particles are recorded in PbP files
  2. Distribution calculations: When using time resolution ≠ 1s, histogram calculations may be incorrect. Process at 1s resolution first, then resample.