Update readme and documentation

2025-09-30 00:58:43 +02:00
parent ba9670b1df
commit 566b728db2
9 changed files with 890 additions and 108 deletions
--- a/How_to_convert_SP2XR_raw_data_to_parquet.md
+++ b/How_to_convert_SP2XR_raw_data_to_parquet.md
@@ -1,60 +0,0 @@
 # Conversion of SP2-XR *PbP* and *HK* .csv/.zip files to .parquet
 ## Overview
 Different SP2-XR instrument versions may use different column names in their CSV/ZIP data files. To ensure consistent data processing, the SP2-XR package now supports automatic column name standardization during CSV to Parquet conversion.
 ## How It Works
 1. **Input**: Instrument-specific CSV/ZIP files with their original column names
 2. **Mapping**: A configuration file maps your column names to canonical (standard) column names
 3. **Output**: Parquet files with standardized column names for consistent downstream processing
 ## Usage
 ### Step 1: Generate Enhanced Config Template
 Use `generate_config.py` to create a configuration template:
 ```python
 from meta_files.generate_config import generate_mapping_template
 # Generate config with column mappings
 generate_mapping_template("your_pbp_file.csv", "your_hk_file.csv", "config_with_mapping.yaml")
 ```
 ### Step 2: Customize Column Mappings
 Open the generated `config_with_mapping.yaml` and update the column mappings:
 ```yaml
 pbp_column_mapping:
  # Canonical name -> Your file's column name
  Time (sec): "Time_Seconds"  # Replace with your actual column name
  Particle Flags: "Flags"     # Replace with your actual column name
  Incand Mass (fg): "Mass_fg" # Replace with your actual column name
  # ... etc
 hk_column_mapping:
  Time Stamp: "Timestamp"     # Replace with your actual column name
  Time (sec): "Time_Seconds"  # Replace with your actual column name
  # ... etc
 ```
 ### Step 3: Convert CSV to Parquet with Mapping
 Use your customized config with the CSV to Parquet conversion:
 ```bash
 python scripts/sp2xr_csv2parquet.py --source /path/to/csv --target /path/to/parquet --config config_with_mapping.yaml
 ```
 ## Configuration File Structure
 The enhanced config file contains several sections:
 - **`pbp_schema`** / **`hk_schema`**: Data types for your input files
 - **`pbp_canonical_schema`** / **`hk_canonical_schema`**: Standard column schemas used by SP2-XR processing
 - **`pbp_column_mapping`** / **`hk_column_mapping`**: Maps canonical names to your file's column names
 ## Column Mapping Rules
 1. **Exact matches**: If your column names exactly match canonical names, they're automatically mapped
 2. **Custom mapping**: Replace placeholder values with your actual column names
 3. **Missing columns**: Set to `null` or remove the line if your data doesn't have that column
 4. **Extra columns**: Unmapped columns in your files are preserved as-is
--- a/README.md
+++ b/README.md
@@ -1,55 +1,95 @@
-# SP2XR
+# SP2XR - Single Particle Soot Photometer Extended Range Toolkit
-This repository contains python functions and template scripts to analyze SP2-XR data with Python.
+A comprehensive Python toolkit for analyzing SP2-XR ([Single Particle Soot Photometer Extended Range, Droplet Measurement Technologies](https://www.dropletmeasurement.com/product/single-particle-soot-photometer-extended-range/)) data, providing calibration and data processing for black carbon (BC) aerosol measurements.
-## Suggested structure for the data
+Some functions in the calibration part of the toolkit have been adapted from the Igor SP2 toolkit.
- Campaign_name   
+Some helper functions have been adapted from Rob Modini's first implementation of the analysis toolkit for the SP2-XR.
    - **SP2XR_files**   
        - 20200415    
        (from here on it is the usual SP2XR structure, no need to unzip the files)
    - **SP2XR_pbp_parquet**   
    Directory automatically generated by `read_csv_files_with_dask()`, it contains the pbp files converted to parquet and organized by date and hour. Single file names correspond to the original file names.
    - **SP2XR_hk_parquet**   
    Directory automatically generated by `read_csv_files_with_dask()`, it contains the pbp file converted to parquet and organized by date and hour. Single file names correspond to the original file names.
    - **SP2XR_pbp_processed**   
    Directory automatically generated by `process_pbp_parquet`. It contains the processed pbp files (calibration applied, various distributions calculated, mixing state calculated, ...). By default the data are grouped to 1s time resolution.
    - **SP2XR_pbp_processed_1min**   
    Directory automatically generated by `resample_to_dt`. It contains files at the same process level of SP2XR_pbp_processed but at the specified time resolution.
-# Suggested structure for the code   
+## Overview
 ❔ Do you want to use git to track your analysis code?   
 - **Yes!**   
    1. Create a repository for the analysis of your specific dataset (from here on this is referred as main repository)
    2. Add the SP2XR repository (this repository) to your main repository as submodule
    3. Copy the template file `processing_code.py` from the submodule to your main repository.
    4. Modify the template `processing_code.py` according to your needs (file paths, time resolution, calibration values, ...)
 - **No, thanks.**
    1. Download this repository and place it in the same directory of your analysis scripts.
    2. Modify the template `processing_code.py` according to your needs (file paths, time resolution, calibration values, ...)
-# How to run `processing_code.py`   
+The SP2-XR is a scientific instrument that measures individual BC particles in real time, providing data on their mass and mixing state.
 The code contains several blocks for the different processing steps. This division helps to truble shoot possible problems in file reading/writing.   
 The different blocks correspond to these actions:
 1. Define paths and other variables
 2. From .csv/.zip to parquet for
    a. PbP files
    b. HK files 
 3. Analysis of the single particle data (pbp data, not raw traces) to user defined time resolution. Operations performed in this block:
    - read the specified config file (!! for the moment I assume that all the files processed in one go will have the same config parameters)
    - Apply scattering and incadescence calibration parameters
    - Flag data according to config file parameters (e.g., `Incand Transit Time`, `Incand FWHM`, `Incand relPeak`, ...)
    - Flag data according to mixing state (see dedicated section)
    - Resample pbp to the specified time (usually and suggested is 1s). Some columns are summed (e.g., BC mass for BC mass conc) and some counted (e.g., BC mass for BC numb conc and Opt D of purely scattering particles for their # conc).
    - Resample the flows columns in the hk to the same specified time
    - Create a joint pbp_hk file with the specified time resolution
    - Calculate distributions for the different flags. Time resolution is the same as for above computations. Min, max, numb_bins for time delay, [BC mass, BC numb], Scatt numb are defined by the user. 
    - Merge pbp_hk and distributions in 1 variable and save it as parquet files partitioned on date and hour
 4. Resample of the pbp_processed data to another time resolution
-5. From .sp2b to parquet. See notes below for the analysis of the raw traces. This block can usually be skipped.
+A high-power Nd:YAG laser at 1064 nm illuminates each aerosol particle drawn into the optical region of the instrument.
-6. Process the sp2b.parquet files
+Single-particle black carbon mass is measured via laser-induced incandescence (LII).
 Particles composed of refractory absorbing carbon (i.e., black carbon) absorb the laser energy and heat up, eventually vaporizing and incandescing (emitting thermal radiation). The intensity of that emission is proportional to the mass of the incandescing BC (via a set of calibration constants).
 All particles, including BC-free particles (i.e., particles without a detectable incandescence signal), scatter the laser light, providing a measurement of their optical diameter (via a set of calibration constants).
 This repository contains Python functions and scripts to process and analyze SP2-XR data files, including:
 - **Converting raw data files** (\*Pbp\* and \*hk\* files in .csv/.zip formats) to Parquet files indexed by time
 - Applying scattering and incandescence **calibrations**
 - **Processing single particle data** with quality control flags (e.g., saturation, FWHM outside of accepted range, ...)
 - Calculating number and mass **concetrations**
 - Calculating size **distributions** as a function of their mass equivalent diameter for BC-containing particle (dNdlogDmev, dMdlogDmev), or of their optical diameter for particles without deteclable BC content (dNdlogDsc)
 - **Mixing state analysis** based on the time delay method (Moteki and Kondo, [2007](https://doi.org/10.1080/02786820701199728))
 The toolkit is designed for **parallel processing** to handle large datasets efficiently.
 ## Quick Start
 ### Installation
 ```bash
 git clone <repository-url>
 cd SP2XR_code
 pip install -e .
 ```
 ### Basic Usage
 ```bash
 # 1. Generate configuration from your data
 python scripts/sp2xr_generate_config.py /path/to/data --mapping
 # 2. Convert CSV/ZIP to Parquet
 python scripts/sp2xr_csv2parquet.py \
    --source /path/to/csv \
    --target /path/to/parquet \
    --config config_with_mapping.yaml \
    --local
 # 3. Run processing pipeline
 python scripts/sp2xr_pipeline.py --config your_config.yaml
 ```
 ## Repository Structure
 ```
 SP2XR_code/
 ├── src/sp2xr/           # Main package source code
 ├── scripts/             # Command-line processing scripts
 ├── docs/                # Detailed documentation
 ├── meta_files/          # Configuration templates
 ├── tests/               # Test suite
 └── calibration_workflow.ipynb  # Interactive calibration
 ```
 ## Core Scripts
 - **`sp2xr_csv2parquet.py`** - Data format conversion with column mapping
 - **`sp2xr_pipeline.py`** - Complete processing pipeline with distributed computing
 - **`sp2xr_generate_config.py`** - Auto-generate configuration files
 - **`sp2xr_apply_calibration.py`** - Apply instrument calibrations
 ## Documentation
 **[Installation Guide](docs/installation.md)** - Setup and dependencies   
 **[Configuration Guide](docs/configuration.md)** - Schema and calibration files   
 **[Processing Workflow](docs/workflow.md)** - Step-by-step data processing   
 **[Scripts Reference](docs/scripts.md)** - Detailed script usage   
 **[API Reference](docs/api-reference.md)** - Function documentation   
 **[Usage Examples](docs/examples.md)** - Code examples and workflows   
 **[CSV to Parquet Conversion](How_to_convert_SP2XR_raw_data_to_parquet.md)** - Detailed conversion guide
 ## Requirements
 - Python >= 3.9
 - Dask, pandas, numpy, pyyaml
 - Optional: Jupyter for interactive calibration
 ## License
 See [LICENSE.md](LICENSE.md) for license information.
 # Known bugs and missing features
 1. Missing feature: Currently, the code doesn't take into account the sampling frequency, it assumes all BC and BC-free particles are recorded in the PbP file.
 2. Bug: When calculating the distributions (histrograms) with time resolution different than 1s, the calculation is wrong! Currently, to have correctly calcualted distributions you have to process the data at 1s and then average at 1min.
--- a/docs/How_to_convert_SP2XR_raw_data_to_parquet.md
+++ b/docs/How_to_convert_SP2XR_raw_data_to_parquet.md
@@ -0,0 +1,163 @@
 # Conversion of SP2-XR *PbP* and *HK* .csv/.zip files to .parquet
 ## Overview
 Different SP2-XR instrument versions may use different column names in their CSV/ZIP data files. To ensure consistent data processing, the SP2-XR package now supports automatic column name standardization during CSV to Parquet conversion.
 ## How It Works
 1. **Input**: Instrument-specific CSV/ZIP files with their original column names
 2. **Detection**: Automatically find and analyze SP2-XR files in your data directory
 3. **Mapping**: A configuration file maps your column names to canonical (standard) column names
 4. **Output**: Parquet files with standardized column names for consistent downstream processing
 ## Quick Start
 Follow these 3 simple steps to convert your SP2-XR data:
 ```bash
 # Step 1: Generate config with column mapping from your data directory
 python scripts/sp2xr_generate_config.py /path/to/your/sp2xr/data --mapping
 # Step 2: Review and edit config_with_mapping.yaml (see below for details)
 # Step 3: Convert CSV/ZIP to Parquet (local processing)
 python scripts/sp2xr_csv2parquet.py --source /path/to/csv --target /path/to/parquet --config config_with_mapping.yaml --local
 ```
 ## Detailed Usage
 ### Step 1: Generate Configuration
 The script automatically detects SP2-XR files in your directory (including subdirectories) and creates a mapping template:
 ```bash
 # Generate config with column mapping (recommended approach)
 python scripts/sp2xr_generate_config.py /path/to/your/data --mapping
 # Optional: Use specific files instead of auto-detection
 python scripts/sp2xr_generate_config.py /path/to/data --mapping --pbp-file specific_pbp.csv --hk-file specific_hk.csv
 # Optional: Custom output filename
 python scripts/sp2xr_generate_config.py /path/to/data --mapping --output my_custom_config.yaml
 ```
 **File Detection Patterns:**
 - **PbP files**: Files containing "PbP", "pbp", or "Pbp" in the name
 - **HK files**: Files containing "hk", "HK", or "Hk" in the name
 - **Formats**: Supports .csv, .zip, and .parquet files
 - **Search**: Recursively searches all subdirectories
 ### Step 2: Review and Customize Column Mappings
 Open the generated `config_with_mapping.yaml` and verify the column mappings:
 ```yaml
 pbp_column_mapping:
  # Canonical name -> Your file's column name
  Time (sec): "Time (sec)"             # ✅ Already matches - no change needed
  Particle Flags: "Particle Flags"    # ✅ Already matches - no change needed
  Incand Mass (fg): "Mass_fg"          # ❌ Update this to match your file's column name
  Scatter Size (nm): "Size_nanometers" # ❌ Update this to match your file's column name
  # ... etc
 hk_column_mapping:
  Time Stamp: "Time Stamp"             # ✅ Already matches - no change needed
  Time (sec): "Time_Seconds"           # ❌ Update this to match your file's column name
  Sample Flow Controller Read (sccm): "Flow_Rate" # ❌ Update this to match your file's column name
  # ... etc
 ```
 **What to do:**
 - ✅ **Perfect matches**: Leave as-is (e.g., `Time (sec): "Time (sec)"`)
 - ❌ **Different names**: Update the right side with your actual column name
 - 🗑️ **Missing columns**: Set to `null` or delete the line if your data doesn't have that column
 ### Step 3: Convert CSV to Parquet
 Use your configuration with the CSV to Parquet conversion. The script supports both local and cluster processing:
 #### Local Processing (Recommended for Small Datasets)
 ```bash
 # Process PbP files locally (auto-detects system resources)
 python scripts/sp2xr_csv2parquet.py \
    --source /path/to/csv/files \
    --target /path/to/parquet/output \
    --config config_with_mapping.yaml \
    --filter "PbP" \
    --local
 # Process HK files locally
 python scripts/sp2xr_csv2parquet.py \
    --source /path/to/csv/files \
    --target /path/to/parquet/output \
    --config config_with_mapping.yaml \
    --filter "hk" \
    --local
 ```
 #### Cluster Processing (For Large Datasets)
 ```bash
 # Process on SLURM cluster with custom resources
 python scripts/sp2xr_csv2parquet.py \
    --source /path/to/csv/files \
    --target /path/to/parquet/output \
    --config config_with_mapping.yaml \
    --filter "PbP" \
    --cores 32 --memory 64GB --partition general
 # Custom chunk size for memory optimization
 python scripts/sp2xr_csv2parquet.py \
    --source /path/to/csv/files \
    --target /path/to/parquet/output \
    --config config_with_mapping.yaml \
    --filter "PbP" \
    --chunk 25 \
    --cores 16 --memory 32GB
 ```
 ## Configuration File Structure
 The enhanced config file contains several sections:
 - **`pbp_schema`** / **`hk_schema`**: Data types detected from your input files
 - **`pbp_canonical_schema`** / **`hk_canonical_schema`**: Standard column schemas used by SP2-XR processing
 - **`pbp_column_mapping`** / **`hk_column_mapping`**: Maps canonical names to your file's column names
 ## Column Mapping Rules
 1. **Exact matches**: If your column names exactly match canonical names, they're automatically mapped
 2. **Custom mapping**: Replace placeholder values with your actual column names
 3. **Missing columns**: Set to `null` or remove the line if your data doesn't have that column
 4. **Extra columns**: Unmapped columns in your files are preserved as-is
 ## Example Workflow
 ```bash
 # Navigate to your project directory
 cd /path/to/your/project
 # Generate config from your raw data
 python scripts/sp2xr_generate_config.py ./raw_data --mapping
 # Edit config_with_mapping.yaml to match your column names (if needed)
 # nano config_with_mapping.yaml
 # Convert PbP files (local processing)
 python scripts/sp2xr_csv2parquet.py \
    --source ./raw_data \
    --target ./SP2XR_pbp_parquet \
    --config config_with_mapping.yaml \
    --filter "PbP" \
    --local
 # Convert HK files (local processing)
 python scripts/sp2xr_csv2parquet.py \
    --source ./raw_data \
    --target ./SP2XR_hk_parquet \
    --config config_with_mapping.yaml \
    --filter "hk" \
    --local
 ```
 ## Performance Tips
 - **Local processing**: Automatically detects and uses all available CPU cores and 80% of system memory
 - **Cluster processing**: Use multiple cores (e.g., `--cores 32`) for faster processing of large datasets
 - **Chunk size**: Reduce `--chunk` value (e.g., `--chunk 25`) if experiencing memory issues
--- a/docs/api-reference.md
+++ b/docs/api-reference.md
@@ -0,0 +1,178 @@
 # API Reference
 ## Core Modules
 ### `sp2xr.io`
 Input/output operations for data conversion and processing.
 **Functions:**
 - `csv_to_parquet(source_dir, target_dir, config_file, filter_pattern)` - Convert CSV/ZIP files to Parquet format
 - `process_sp2xr_file(file_path, config_file, output_dir)` - Process individual SP2XR data files
 - `read_sp2xr_csv(file_path, schema, **kwargs)` - Read SP2XR CSV/ZIP files with schema validation
 - `load_matching_hk_file(pbp_file, hk_dir, hk_schema)` - Load housekeeping file matching a PbP file
 - `enrich_sp2xr_dataframe(df, filename)` - Add time-derived and metadata columns (date, hour, filename)
 - `save_sp2xr_parquet(df, output_dir, partition_cols)` - Save DataFrame to partitioned Parquet format
 ---
 ### `sp2xr.calibration`
 Complete calibration workflow with flags and processing.
 **Calibration Functions:**
 - `calibrate_single_particle(ddf, instr_config, run_config)` - Complete calibration workflow with quality flags
 - `calibrate_particle_data(df, config)` - Apply scattering and incandescence calibrations to particle data
 - `apply_calibration(df, config)` - Legacy wrapper for backward compatibility
 - `apply_inc_calibration(df, calib_params)` - Apply incandescence calibration (BC mass calculation)
 - `apply_scatt_calibration(df, calib_params)` - Apply scattering calibration (optical diameter calculation)
 **Calibration Curve Functions:**
 - `polynomial(x, coeffs)` - Generic polynomial curve: `a0 + a1*x + a2*x^2 + ...`
 - `powerlaw(x, a, b)` - Generic power-law calibration: `a * x^b`
 **Mass/Diameter Conversion:**
 - `BC_mass_to_diam(mass_fg, material='fullerene')` - Convert BC mass (fg) to mass-equivalent diameter (nm)
 - `BC_diam_to_mass(diam_nm, material='fullerene')` - Convert BC diameter (nm) to mass (fg)
 - `mass2meqDiam(mass, rho_eff)` - Calculate mass-equivalent diameter from mass and effective density
 **Distribution Conversion:**
 - `dNdlogDp_to_dMdlogDp(dNdlogDp, dp_nm, rho)` - Convert number distribution to mass distribution
 ---
 ### `sp2xr.calibration_constants`
 Spline coefficient functions for BC calibration materials.
 **Core Functions:**
 - `SP2_calibCurveSpline(x, spline_coeffs)` - Spline interpolation for density calculations
 - `SP2_calibCurveSplineCheck(spline_coeffs)` - Validate spline coefficient dimensions
 **Aquadag Calibration Functions:**
 - `Aquadag_RhoVsLogMspline_Var25()` - Aquadag density vs. log(mass) spline coefficients (Var25)
 - `Aquadag_RhoVsLogMspline_Var8()` - Aquadag density vs. log(mass) spline coefficients (Var8)
 - `Aquadag_RhoVsLogDspline_Var25()` - Aquadag density vs. log(diameter) spline (Var25)
 - `Aquadag_RhoVsLogDspline_Var8()` - Aquadag density vs. log(diameter) spline (Var8)
 - `Aquadag_RhoVsLogDspline_Var8_old()` - Legacy Aquadag diameter-to-density spline
 **Fullerene Calibration Functions:**
 - `Fullerene_RhoVsLogMspline_Var2()` - Fullerene density vs. log(mass) spline (Var2)
 - `Fullerene_RhoVsLogMspline_Var5()` - Fullerene density vs. log(mass) spline (Var5)
 - `Fullerene_RhoVsLogMspline_Var8()` - Fullerene density vs. log(mass) spline (Var8)
 - `Fullerene_RhoVsLogMspline_Var8_old()` - Legacy Fullerene mass-to-density spline
 - `Fullerene_RhoVsLogDspline_Var2()` - Fullerene diameter-to-density spline (Var2)
 - `Fullerene_RhoVsLogDspline_Var5()` - Fullerene diameter-to-density spline (Var5)
 - `Fullerene_RhoVsLogDspline_Var8()` - Fullerene diameter-to-density spline (Var8)
 - `Fullerene_RhoVsLogDspline_Var8_old()` - Legacy Fullerene diameter-to-density spline
 **Glassy Carbon Calibration Functions:**
 - `GlassyCarbonAlpha_Mass2Diam()` - Glassy carbon mass-to-diameter conversion coefficients
 - `GlassyCarbonAlpha_Diam2Mass()` - Glassy carbon diameter-to-mass conversion coefficients
 ---
 ### `sp2xr.flag_single_particle_data`
 Quality control flagging based on instrument parameters.
 **Functions:**
 - `define_flags(df, flag_config)` - Define quality control flags (transit time, FWHM, peak intensity, etc.)
 - `add_thin_thick_flags(df, lag_threshold)` - Add thin/thick coating classification flags based on lag time
 **Constants:**
 - `FLAG_COLS` - List of flag column names used in quality filtering
 ---
 ### `sp2xr.distribution`
 Size/mass distribution calculations.
 **Main Functions:**
 - `process_histograms(ddf, config, inc_bins, inc_ctrs, scatt_bins, scatt_ctrs, timelag_bins, timelag_ctrs, chunk_start, client)` - Calculate size/mass distributions (main workflow function)
 - `process_hist_and_dist_partition(partition, dt_s, inc_mass_bin_lims, scatt_bin_lims, timelag_bins_lims, calc_dist, flow_dt)` - Process histogram and distribution for a single partition
 **Bin Utilities:**
 - `make_bin_arrays(min_val, max_val, n_bins, log_scale=True)` - Create bin arrays for histograms
 - `bin_lims_to_ctrs(bin_lims)` - Convert bin limits to bin centers
 - `bin_ctrs_to_lims(bin_ctrs)` - Convert bin centers to bin limits
 - `get_dlogp(bin_lims)` - Calculate log bin widths (for dN/dlogDp calculations)
 **Distribution Calculations:**
 - `calculate_histogram(series, bins, dt_s, flow_sccm)` - Calculate histogram from series with time/flow normalization
 - `counts2numConc(counts, dlogDp, dt_s, flow_sccm)` - Convert counts to number concentration (dN/dlogDp)
 - `dNdlogDp_to_dMdlogDp(dNdlogDp, dp_nm, rho)` - Convert number distribution to mass distribution
 **Metadata:**
 - `make_hist_meta(bin_ctrs, dt_index)` - Create metadata DataFrame for histogram output
 ---
 ### `sp2xr.resample_pbp_hk`
 Time resampling functions for data aggregation.
 **Functions:**
 - `build_dt_summary(pdf, dt_s)` - Resample single-particle data to time bins (aggregate particle counts)
 - `resample_hk_partition(pdf, dt)` - Partition-wise resampling of housekeeping data to specified time resolution
 - `join_pbp_with_flow(ddf_pbp, flow_series, config)` - Join particle data with flow measurements
 - `aggregate_dt(ddf_pbp_dt, ddf_hk_dt, config)` - Aggregate PbP and HK data at specified time resolution
 ---
 ### `sp2xr.concentrations`
 Concentration calculations for different particle types.
 **Functions:**
 - `add_concentrations(df, dt)` - Add BC mass concentration, scattering mass concentration, and number concentration columns
 ---
 ### `sp2xr.helpers`
 Utility functions for file handling, argument parsing, cluster initialization, and configuration management.
 #### Configuration Management
 - `load_and_resolve_config(args)` - Load and merge YAML configuration with command-line arguments
 - `load_yaml_cfg(path)` - Load YAML configuration file
 - `parse_args()` - Parse command-line arguments for pipeline scripts
 - `apply_sets(config, set_args)` - Apply command-line overrides (--set) to configuration
 - `get(config, key_path, default=None)` - Get nested configuration value using dot notation
 - `choose(cli_val, config, key_path, default)` - Choose between CLI argument and config value
 - `validate_config_compatibility(config)` - Validate configuration compatibility and consistency
 #### Cluster/Dask Management
 - `initialize_cluster(config)` - Initialize Dask cluster (local or SLURM)
 - `make_slurm_cluster(config)` - Create SLURM cluster with specified resources
 - `make_local_cluster(config)` - Create local Dask cluster with auto-detected resources
 #### File Operations
 - `find_files(directory, pattern)` - Recursively find files matching pattern
 - `find_matching_hk_file(pbp_file, hk_dir)` - Find housekeeping file matching a PbP file
 - `extract_base_filename(file_path)` - Extract base filename from SP2XR file path
 - `extract_sp2xr_filename_parts(filename)` - Extract filename components (timestamp, instrument ID, etc.)
 #### Time/Partition Management
 - `extract_partitioned_datetimes(parquet_path)` - Extract timestamps from partitioned Parquet paths
 - `get_time_chunks_from_range(start, end, freq)` - Generate time chunk tuples for processing
 - `delete_partition_if_exists(output_path, partition_values)` - Delete specific partition directory
 - `floor_index_to_dt(df, dt_s)` - Replace DatetimeIndex with lower-second floored values
 - `calculate_delta_sec(time1, time2)` - Calculate time difference in seconds
 - `extract_datetime(filename)` - Extract datetime from SP2XR filename
 #### INI/YAML Conversion
 - `read_xr_ini_file(ini_path)` - Read SP2-XR .ini calibration file
 - `find_and_validate_ini_files(directory)` - Find and validate .ini files (ensure consistency)
 - `export_xr_ini_to_yaml(ini_path, yaml_path)` - Convert .ini file to YAML format
 - `export_xr_ini_to_yaml_with_source(ini_path, yaml_path)` - Convert .ini to YAML with source metadata
 #### Utilities
 - `chunks(lst, n)` - Yield successive n-sized chunks from list
 - `partition_rowcount(ddf)` - Count total rows in Dask DataFrame
 ---
 ### `sp2xr.schema`
 Data schema definitions and type enforcement for SP2XR data streams.
 **Constants:**
 - `CANONICAL_DTYPES` - Dictionary mapping column names to canonical data types
 - `DEFAULT_FLOAT` - Default float dtype for numeric columns
 **Functions:**
 - `enforce_schema(df)` - Cast DataFrame columns to canonical data types
 - `cast_and_arrow(df)` - Cast to canonical dtypes and convert to PyArrow backend
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -0,0 +1,154 @@
 # Configuration Guide
 ## Overview
 SP2-XR uses configuration files to handle different instrument versions, data schemas, calibrations, and processing parameters.
 ## Configuration System Structure
 ### Primary Locations
 - **`config/`** - Essential templates and examples
 - **Auto-generated configs** - Created by `sp2xr_generate_config.py` script
 ### Recommended Workflow
 1. **Auto-generate** data schema and instrument settings from your actual data
 2. **Copy and customize** pipeline template for your workflow
 3. **Validate** with small datasets before full processing
 ### Configuration File Types
 ## 1. Configuration Generation Tools
 **Features:**
 - Automatically detects PbP and HK files (CSV, ZIP, Parquet)
 - Analyzes actual data to determine column names and types
 - Creates mapping templates for column standardization
 - Creates a .yaml file with the instrument settings
 ### Auto-Generated Schemas
 ```bash
 # Generate schema with instrument settings (recommended)
 python scripts/sp2xr_generate_config.py /path/to/your/data \
  --mapping \
  --schema-output my_data_schema.yaml \
  --instrument-output my_instrument_settings.yaml
 ```
 **Benefits of auto-generation:**
 - Analyzes actual data files to detect correct column names and types
 - Automatically finds and validates INI calibration files
 - Creates instrument settings with source traceability
 - Handles different SP2-XR instrument versions automatically
 ### Generated Schema Structure
 The auto-generated schema includes:
 ```yaml
 # Data type definitions (detected from your files)
 pbp_schema:
  Time (sec): float
  Scatter Size (nm): float
  # ... all columns from your data
 # Standard column names used by SP2-XR package
 pbp_canonical_schema:
  Time (sec): float
  Scatter Size (nm): float
  # ... canonical column definitions
 # Column mapping (your files → canonical names)
 pbp_column_mapping:
  Time (sec): "Time (sec)"           #  Exact match
  Scatter Size (nm): "Size_nm"       #  Maps your column name
 ```
 ### Auto-Generated Instrument Settings
 The generation script automatically converts INI calibration files to structured YAML format with:
 - **Metadata**: Source file path, generation timestamp, traceability
 - **Instrument parameters**: All settings from INI file
 - **Signal saturation**
 ```yaml
 metadata:
  source_ini_file: /full/path/to/calibration.ini
  generated_on: '2024-01-01T12:00:00'
  generated_by: sp2xr_generate_config.py
 instrument_parameters:
  ScattTransitMin: 10.0
  IncTransitMin: 5.0
  # ... all INI parameters
 ```
 ## 2. Main Data Processing Configurations
 ### Comprehensive Template
 Copy and customize the complete pipeline template:
 ```bash
 # Copy the template
 cp config/complete_example.yaml my_campaign_pipeline.yaml
 # Edit for your specific needs
 nano my_campaign_pipeline.yaml
 ```
 ### Pipeline Configuration Structure
 The pipeline config includes all processing settings:
 ```yaml
 # File paths
 paths:
  input_pbp: /path/to/SP2XR_pbp_parquet
  input_hk: /path/to/SP2XR_hk_parquet
  output: /path/to/SP2XR_processed_output
  instrument_config: /path/to/instrument_settings.yaml
 # Workflow settings
 workflow:
  conc: true              # Calculate concentrations
  BC_hist: true           # BC mass distributions
  scatt_hist: true        # Scattering size distributions
  dt: 60                  # Time resolution (seconds)
 # Computing resources
 cluster:
  use_local: false        # true for local, false for SLURM
  cores: 16               # CPU cores
  memory: 128GB           # Memory allocation
 # Analysis parameters
 histo:
  inc:                    # BC mass histograms
    min_mass: 0.3
    max_mass: 400
    n_bins: 50
  scatt:                  # Scattering histograms
    min_D: 100
    max_D: 500
    n_bins: 20
 # Calibration parameters
 calibration:
  incandescence:
    curve_type: "polynomial"
    parameters: [0.05, 2.047e-07]
  scattering:
    curve_type: "powerlaw"
    parameters: [17.22, 0.169, -1.494]
 ```
 **Key sections to customize:**
 - **`paths`** - Update all file and directory paths
 - **`calibration`** - Use parameters from your instrument settings
 - **`cluster`** - Match your computing environment
 - **`workflow.dt`** - Set appropriate time resolution
 - **`histo`** - Configure size/mass distribution bins
--- a/docs/examples.md
+++ b/docs/examples.md
@@ -0,0 +1,31 @@
 # Complete Workflow Example
 ```bash
 # Navigate to your project directory
 cd /path/to/your/project
 # Step 1: Generate configuration from your raw data
 python scripts/sp2xr_generate_config.py ./raw_data --mapping
 # Step 2: Edit config_with_mapping.yaml to match your column names (if needed)
 # nano config_with_mapping.yaml
 # Step 3: Convert PbP files (local processing)
 python scripts/sp2xr_csv2parquet.py \
    --source ./raw_data \
    --target ./SP2XR_pbp_parquet \
    --config config_with_mapping.yaml \
    --filter "PbP" \
    --local
 # Step 4: Convert HK files (local processing)
 python scripts/sp2xr_csv2parquet.py \
    --source ./raw_data \
    --target ./SP2XR_hk_parquet \
    --config config_with_mapping.yaml \
    --filter "hk" \
    --local
 # Step 5: Run the full processing pipeline
 python scripts/sp2xr_pipeline.py --config your_pipeline_config.yaml
 ```
--- a/docs/installation.md
+++ b/docs/installation.md
@@ -0,0 +1,54 @@
 # Installation Guide
 ## Requirements
 - Python >= 3.9
 - Dependencies listed in `pyproject.toml`
 ## Install from Source
 1. Clone the repository:
 ```bash
 git clone <repository-url>
 cd SP2XR_code/sp2xr
 ```
 2. Install the package:
 ```bash
 pip install -e .
 ```
 For Jupyter notebook support:
 ```bash
 pip install -e ".[notebook]"
 ```
 ## Dependencies
 Core dependencies include:
 - `dask[dataframe]` >= 2024.6 - Parallel computing
 - `dask[distributed]` - Distributed computing across multiple machines
 - `dask-jobqueue` - Integration with job schedulers (SLURM)
 - `pandas` >= 2.2
 - `numpy` >= 1.26
 - `pyyaml` - YAML configuration file parsing
 - `psutil` - System and process utilities
 ## Development Setup
 ### Pre-commit Hooks
 The repository uses pre-commit hooks for code quality:
 ```bash
 pip install pre-commit
 pre-commit install
 ```
 Configured tools:
 - **Black**: Code formatting
 - **Ruff**: Linting and import sorting
 ### Testing
 ```bash
 pytest tests/
 ```
--- a/docs/scripts.md
+++ b/docs/scripts.md
@@ -0,0 +1,170 @@
 # Scripts Reference
 ## Available Scripts
 ### `scripts/sp2xr_pipeline.py`
 Main processing pipeline that orchestrates the complete data analysis workflow.
 **Usage:**
 ```bash
 python scripts/sp2xr_pipeline.py --config path/to/config.yaml [options]
 ```
 **Command-line Options:**
 - `--config` - Path to YAML configuration file (required)
 - `--set KEY=VALUE` - Override config values using dot notation (e.g., `--set dt=60`)
 - Additional cluster configuration options (see `--help`)
 **Features:**
 - Distributed processing with Dask (local or SLURM cluster)
 - Automatic time chunking and partition management
 - Calibration application and quality flagging
 - Distribution calculations and time resampling
 - Concentration calculations
 - Output partitioned by date and hour
 ### `scripts/sp2xr_csv2parquet.py`
 Batch conversion of raw CSV/ZIP files to Parquet format with support for both local and cluster processing.
 **Usage:**
 ```bash
 # Local processing (automatic resource detection)
 python scripts/sp2xr_csv2parquet.py \
    --source /path/to/csv/files \
    --target /path/to/parquet/output \
    --config config.yaml \
    --filter "PbP" \
    --local
 # SLURM cluster processing
 python scripts/sp2xr_csv2parquet.py \
    --source /path/to/csv/files \
    --target /path/to/parquet/output \
    --config config.yaml \
    --filter "PbP" \
    --cores 32 --memory 64GB --partition general
 # Process housekeeping files with custom chunk size
 python scripts/sp2xr_csv2parquet.py \
    --source /path/to/csv/files \
    --target /path/to/parquet/output \
    --config config.yaml \
    --filter "hk" \
    --chunk 50 \
    --local
 ```
 **Features:**
 - Supports both local and SLURM cluster execution
 - Automatic resource detection for local processing
 - Configurable batch processing with chunking
 - Progress tracking and error handling
 - Graceful shutdown with signal handling
 ### `scripts/sp2xr_apply_calibration.py`
 Apply calibration parameters to particle data using Dask + SLURM.
 **Usage:**
 ```bash
 python scripts/sp2xr_apply_calibration.py \
    --input /path/to/data.parquet \
    --config /path/to/config.yaml \
    --output /path/to/calibrated.parquet \
    [--cores 32] [--memory 64GB] [--walltime 02:00:00] [--partition daily]
 ```
 **Options:**
 - `--input` - Input Parquet file or directory (required)
 - `--config` - YAML calibration configuration (required)
 - `--output` - Output directory for calibrated Parquet dataset (required)
 - `--cores` - Cores per SLURM job (default: 32)
 - `--memory` - Memory per job (default: 64GB)
 - `--walltime` - Wall-time limit (default: 02:00:00)
 - `--partition` - SLURM partition (default: daily)
 **Features:**
 - Parallel processing with Dask + SLURM cluster
 - Automatic scaling and resource management
 - Partitioned output by date and hour
 ### `scripts/sp2xr_generate_config.py`
 Generate schema configuration files by automatically detecting SP2XR files in a directory.
 **Usage:**
 ```bash
 # Generate basic schema config from current directory
 python scripts/sp2xr_generate_config.py .
 # Generate config from specific directory
 python scripts/sp2xr_generate_config.py /path/to/sp2xr/data
 # Generate config with column mapping support (for non-standard column names)
 python scripts/sp2xr_generate_config.py /path/to/data --mapping
 # Specify custom schema and instrument settings output filenames
 python scripts/sp2xr_generate_config.py /path/to/data \
    --schema-output my_schema.yaml \
    --instrument-output my_settings.yaml
 # Generate mapping config with custom output names
 python scripts/sp2xr_generate_config.py /path/to/data --mapping \
    --schema-output campaign_schema.yaml \
    --instrument-output campaign_settings.yaml
 # Use specific files instead of auto-detection
 python scripts/sp2xr_generate_config.py . --pbp-file data.pbp.csv --hk-file data.hk.csv
 ```
 **Options:**
 - `directory` - Directory containing SP2XR files (PbP and HK files)
 - `--schema-output`, `-s` - Output filename for data schema config (default: `config_schema.yaml`)
 - `--instrument-output`, `-i` - Output filename for instrument settings config (default: `{schema_output}_instrument_settings.yaml`)
 - `--mapping`, `-m` - Generate config with column mapping support (creates canonical column mappings)
 - `--pbp-file` - Specify specific PbP file instead of auto-detection
 - `--hk-file` - Specify specific HK file instead of auto-detection
 **Features:**
 - Automatic file detection (searches recursively for PbP and HK files)
 - Schema inference from CSV/ZIP/Parquet files
 - Column mapping support for non-standard column names
 - Automatic INI file detection and conversion to YAML
 - Validates INI file consistency across multiple files
 ### `scripts/sp2xr_ini2yaml.py`
 Convert legacy INI calibration files to YAML format.
 **Usage:**
 ```bash
 python scripts/sp2xr_ini2yaml.py input.ini output.yaml
 ```
 **Arguments:**
 - `ini` - Input .ini calibration file
 - `yaml` - Output .yaml file path
 **Features:**
 - Converts SP2-XR instrument .ini files to editable YAML format
 - Preserves all calibration parameters and settings
 ### `calibration_workflow.ipynb`
 Interactive Jupyter notebook for determining instrument-specific calibration coefficients.
 **Purpose:**
 - Analyze calibration standards (PSL spheres, Aquadag, etc.) to derive scattering and incandescence calibration curves
 - Iterative process with visualization for quality control
 - Generate calibration parameters for use in configuration files
 **Workflow:**
 1. Load calibration standard measurements
 2. Plot raw signals vs. known particle properties
 3. Fit calibration curves (polynomial, power-law, etc.)
 4. Export calibration coefficients to YAML configuration
 ### `scripts/run_sp2xr_pipeline.sbatch`
 SLURM batch job script for running the pipeline on HPC systems.
 **Features:**
 - Configurable resource allocation
 - Automatic scratch directory management
 - Module loading and environment activation
 - Error and output logging
--- a/docs/workflow.md
+++ b/docs/workflow.md
@@ -0,0 +1,52 @@
 # Data Processing Workflow
 ## Recommended Directory Structure
 ```
 Campaign_name/
 ├── SP2XR_files/                 # Raw instrument files (.csv/.zip)
 │   └── 20200415/               # Organized by date
 ├── SP2XR_pbp_parquet/          # Converted particle data
 ├── SP2XR_hk_parquet/           # Converted housekeeping data
 └── SP2XR_pbp_processed_1min/   # Calibrated and processed data (user-defined resolution)
 ```
 ## Processing Steps
 ### 1. CSV to Parquet Conversion (`sp2xr_csv2parquet.py`)
 - Convert raw CSV/ZIP files to Parquet format
 - Separate particle-by-particle (PbP) and housekeeping (HK) data streams
 - Apply data schema and column mapping transformations
 - Organize by date and hour for efficient querying
 ### 2. Data Loading and Preparation (`sp2xr_pipeline.py`)
 - Load PbP and HK Parquet files with time-based filtering
 - Repartition data for optimal parallel processing
 - Resample housekeeping data to user-defined time resolution (e.g., 1s, 60s)
 ### 3. Calibration and Quality Control
 - Apply scattering and incandescence calibrations to raw particle signals
 - Convert signals to physical units (diameter in nm, mass in fg)
 - Flag particles based on instrument quality control parameters
 - Calculate mixing state classifications using time delay method
 - Merge calibrated particle data with resampled flow measurements
 ### 4. Time Aggregation and Summary Statistics
 - Aggregate particle-by-particle data to time bins (dt resolution)
 - Calculate summary statistics (counts, means) for each time bin
 - Join aggregated PbP data with resampled HK data
 ### 5. Bulk Concentrations (optional, if `conc: true`)
 - Compute number and mass concentrations for different particle types
 - Calculate size-resolved concentrations for different coating states
 - Account for flow rate corrections and sampling efficiency
 ### 6. Size and Mass Distributions (optional, if `*_hist: true`)
 - Compute size distributions (dNdlogDsc) for scattering-only particles
 - Compute mass distributions (dNdlogDmev, dMdlogDmev) for BC-containing particles
 - Calculate time-lag distributions for mixing state analysis
 - User-configurable bin edges and ranges
 ## Known Limitations
 1. **Sampling frequency**: Currently assumes all BC and BC-free particles are recorded in PbP files
 2. **Distribution calculations**: When using time resolution ≠ 1s, histogram calculations may be incorrect. Process at 1s resolution first, then resample.