Files
SP2XR/docs/workflow.md

52 lines
2.4 KiB
Markdown

# Data Processing Workflow
## Recommended Directory Structure
```
Campaign_name/
├── SP2XR_files/ # Raw instrument files (.csv/.zip)
│ └── 20200415/ # Organized by date
├── SP2XR_pbp_parquet/ # Converted particle data
├── SP2XR_hk_parquet/ # Converted housekeeping data
└── SP2XR_pbp_processed_1min/ # Calibrated and processed data (user-defined resolution)
```
## Processing Steps
### 1. CSV to Parquet Conversion (`sp2xr_csv2parquet.py`)
- Convert raw CSV/ZIP files to Parquet format
- Separate particle-by-particle (PbP) and housekeeping (HK) data streams
- Apply data schema and column mapping transformations
- Organize by date and hour for efficient querying
### 2. Data Loading and Preparation (`sp2xr_pipeline.py`)
- Load PbP and HK Parquet files with time-based filtering
- Repartition data for optimal parallel processing
- Resample housekeeping data to user-defined time resolution (e.g., 1s, 60s)
### 3. Calibration and Quality Control
- Apply scattering and incandescence calibrations to raw particle signals
- Convert signals to physical units (diameter in nm, mass in fg)
- Flag particles based on instrument quality control parameters
- Calculate mixing state classifications using time delay method
- Merge calibrated particle data with resampled flow measurements
### 4. Time Aggregation and Summary Statistics
- Aggregate particle-by-particle data to time bins (dt resolution)
- Calculate summary statistics (counts, means) for each time bin
- Join aggregated PbP data with resampled HK data
### 5. Bulk Concentrations (optional, if `conc: true`)
- Compute number and mass concentrations for different particle types
- Calculate size-resolved concentrations for different coating states
- Account for flow rate corrections and sampling efficiency
### 6. Size and Mass Distributions (optional, if `*_hist: true`)
- Compute size distributions (dNdlogDsc) for scattering-only particles
- Compute mass distributions (dNdlogDmev, dMdlogDmev) for BC-containing particles
- Calculate time-lag distributions for mixing state analysis
- User-configurable bin edges and ranges
## Known Limitations
1. **Sampling frequency**: Currently assumes all BC and BC-free particles are recorded in PbP files
2. **Distribution calculations**: When using time resolution ≠ 1s, histogram calculations may be incorrect. Process at 1s resolution first, then resample.