2.4 KiB
2.4 KiB
Data Processing Workflow
Recommended Directory Structure
Campaign_name/
├── SP2XR_files/ # Raw instrument files (.csv/.zip)
│ └── 20200415/ # Organized by date
├── SP2XR_pbp_parquet/ # Converted particle data
├── SP2XR_hk_parquet/ # Converted housekeeping data
└── SP2XR_pbp_processed_1min/ # Calibrated and processed data (user-defined resolution)
Processing Steps
1. CSV to Parquet Conversion (sp2xr_csv2parquet.py)
- Convert raw CSV/ZIP files to Parquet format
- Separate particle-by-particle (PbP) and housekeeping (HK) data streams
- Apply data schema and column mapping transformations
- Organize by date and hour for efficient querying
2. Data Loading and Preparation (sp2xr_pipeline.py)
- Load PbP and HK Parquet files with time-based filtering
- Repartition data for optimal parallel processing
- Resample housekeeping data to user-defined time resolution (e.g., 1s, 60s)
3. Calibration and Quality Control
- Apply scattering and incandescence calibrations to raw particle signals
- Convert signals to physical units (diameter in nm, mass in fg)
- Flag particles based on instrument quality control parameters
- Calculate mixing state classifications using time delay method
- Merge calibrated particle data with resampled flow measurements
4. Time Aggregation and Summary Statistics
- Aggregate particle-by-particle data to time bins (dt resolution)
- Calculate summary statistics (counts, means) for each time bin
- Join aggregated PbP data with resampled HK data
5. Bulk Concentrations (optional, if conc: true)
- Compute number and mass concentrations for different particle types
- Calculate size-resolved concentrations for different coating states
- Account for flow rate corrections and sampling efficiency
6. Size and Mass Distributions (optional, if *_hist: true)
- Compute size distributions (dNdlogDsc) for scattering-only particles
- Compute mass distributions (dNdlogDmev, dMdlogDmev) for BC-containing particles
- Calculate time-lag distributions for mixing state analysis
- User-configurable bin edges and ranges
Known Limitations
- Sampling frequency: Currently assumes all BC and BC-free particles are recorded in PbP files
- Distribution calculations: When using time resolution ≠ 1s, histogram calculations may be incorrect. Process at 1s resolution first, then resample.