SP2XR/docs/configuration.md

# Configuration Guide

## Overview

SP2-XR uses configuration files to handle different instrument versions, data schemas, calibrations, and processing parameters.

## Configuration System Structure

### Primary Locations
- **`config/`** - Essential templates and examples
- **Auto-generated configs** - Created by `sp2xr_generate_config.py` script

### Recommended Workflow
1. **Auto-generate** data schema and instrument settings from your actual data
2. **Copy and customize** pipeline template for your workflow
3. **Validate** with small datasets before full processing

### Configuration File Types

## 1. Configuration Generation Tools

**Features:**
- Automatically detects PbP and HK files (CSV, ZIP, Parquet)
- Analyzes actual data to determine column names and types
- Creates mapping templates for column standardization
- Creates a .yaml file with the instrument settings

### Auto-Generated Schemas

```bash
# Generate schema with instrument settings (recommended)
python scripts/sp2xr_generate_config.py /path/to/your/data \
  --mapping \
  --schema-output my_data_schema.yaml \
  --instrument-output my_instrument_settings.yaml
```

**Benefits of auto-generation:**
- Analyzes actual data files to detect correct column names and types
- Automatically finds and validates INI calibration files
- Creates instrument settings with source traceability
- Handles different SP2-XR instrument versions automatically

### Generated Schema Structure
The auto-generated schema includes:

```yaml
# Data type definitions (detected from your files)
pbp_schema:
  Time (sec): float
  Scatter Size (nm): float
  # ... all columns from your data

# Standard column names used by SP2-XR package
pbp_canonical_schema:
  Time (sec): float
  Scatter Size (nm): float
  # ... canonical column definitions

# Column mapping (your files → canonical names)
pbp_column_mapping:
  Time (sec): "Time (sec)"           #  Exact match
  Scatter Size (nm): "Size_nm"       #  Maps your column name
```

### Auto-Generated Instrument Settings
The generation script automatically converts INI calibration files to structured YAML format with:

- **Metadata**: Source file path, generation timestamp, traceability
- **Instrument parameters**: All settings from INI file
- **Signal saturation**

```yaml
metadata:
  source_ini_file: /full/path/to/calibration.ini
  generated_on: '2024-01-01T12:00:00'
  generated_by: sp2xr_generate_config.py

instrument_parameters:
  ScattTransitMin: 10.0
  IncTransitMin: 5.0
  # ... all INI parameters

```

## 2. Main Data Processing Configurations

### Comprehensive Template
Copy and customize the complete pipeline template:

```bash
# Copy the template
cp config/complete_example.yaml my_campaign_pipeline.yaml

# Edit for your specific needs
nano my_campaign_pipeline.yaml
```

### Pipeline Configuration Structure
The pipeline config includes all processing settings:

```yaml
# File paths
paths:
  input_pbp: /path/to/SP2XR_pbp_parquet
  input_hk: /path/to/SP2XR_hk_parquet
  output: /path/to/SP2XR_processed_output
  instrument_config: /path/to/instrument_settings.yaml

# Workflow settings
workflow:
  conc: true              # Calculate concentrations
  BC_hist: true           # BC mass distributions
  scatt_hist: true        # Scattering size distributions
  dt: 60                  # Time resolution (seconds)

# Computing resources
cluster:
  use_local: false        # true for local, false for SLURM
  cores: 16               # CPU cores
  memory: 128GB           # Memory allocation

# Analysis parameters
histo:
  inc:                    # BC mass histograms
    min_mass: 0.3
    max_mass: 400
    n_bins: 50
  scatt:                  # Scattering histograms
    min_D: 100
    max_D: 500
    n_bins: 20

# Calibration parameters
calibration:
  incandescence:
    curve_type: "polynomial"
    parameters: [0.05, 2.047e-07]
  scattering:
    curve_type: "powerlaw"
    parameters: [17.22, 0.169, -1.494]
```

**Key sections to customize:**
- **`paths`** - Update all file and directory paths
- **`calibration`** - Use parameters from your instrument settings
- **`cluster`** - Match your computing environment
- **`workflow.dt`** - Set appropriate time resolution
- **`histo`** - Configure size/mass distribution bins