Files
SP2XR/docs/configuration.md

155 lines
4.2 KiB
Markdown

# Configuration Guide
## Overview
SP2-XR uses configuration files to handle different instrument versions, data schemas, calibrations, and processing parameters.
## Configuration System Structure
### Primary Locations
- **`config/`** - Essential templates and examples
- **Auto-generated configs** - Created by `sp2xr_generate_config.py` script
### Recommended Workflow
1. **Auto-generate** data schema and instrument settings from your actual data
2. **Copy and customize** pipeline template for your workflow
3. **Validate** with small datasets before full processing
### Configuration File Types
## 1. Configuration Generation Tools
**Features:**
- Automatically detects PbP and HK files (CSV, ZIP, Parquet)
- Analyzes actual data to determine column names and types
- Creates mapping templates for column standardization
- Creates a .yaml file with the instrument settings
### Auto-Generated Schemas
```bash
# Generate schema with instrument settings (recommended)
python scripts/sp2xr_generate_config.py /path/to/your/data \
--mapping \
--schema-output my_data_schema.yaml \
--instrument-output my_instrument_settings.yaml
```
**Benefits of auto-generation:**
- Analyzes actual data files to detect correct column names and types
- Automatically finds and validates INI calibration files
- Creates instrument settings with source traceability
- Handles different SP2-XR instrument versions automatically
### Generated Schema Structure
The auto-generated schema includes:
```yaml
# Data type definitions (detected from your files)
pbp_schema:
Time (sec): float
Scatter Size (nm): float
# ... all columns from your data
# Standard column names used by SP2-XR package
pbp_canonical_schema:
Time (sec): float
Scatter Size (nm): float
# ... canonical column definitions
# Column mapping (your files → canonical names)
pbp_column_mapping:
Time (sec): "Time (sec)" # Exact match
Scatter Size (nm): "Size_nm" # Maps your column name
```
### Auto-Generated Instrument Settings
The generation script automatically converts INI calibration files to structured YAML format with:
- **Metadata**: Source file path, generation timestamp, traceability
- **Instrument parameters**: All settings from INI file
- **Signal saturation**
```yaml
metadata:
source_ini_file: /full/path/to/calibration.ini
generated_on: '2024-01-01T12:00:00'
generated_by: sp2xr_generate_config.py
instrument_parameters:
ScattTransitMin: 10.0
IncTransitMin: 5.0
# ... all INI parameters
```
## 2. Main Data Processing Configurations
### Comprehensive Template
Copy and customize the complete pipeline template:
```bash
# Copy the template
cp config/complete_example.yaml my_campaign_pipeline.yaml
# Edit for your specific needs
nano my_campaign_pipeline.yaml
```
### Pipeline Configuration Structure
The pipeline config includes all processing settings:
```yaml
# File paths
paths:
input_pbp: /path/to/SP2XR_pbp_parquet
input_hk: /path/to/SP2XR_hk_parquet
output: /path/to/SP2XR_processed_output
instrument_config: /path/to/instrument_settings.yaml
# Workflow settings
workflow:
conc: true # Calculate concentrations
BC_hist: true # BC mass distributions
scatt_hist: true # Scattering size distributions
dt: 60 # Time resolution (seconds)
# Computing resources
cluster:
use_local: false # true for local, false for SLURM
cores: 16 # CPU cores
memory: 128GB # Memory allocation
# Analysis parameters
histo:
inc: # BC mass histograms
min_mass: 0.3
max_mass: 400
n_bins: 50
scatt: # Scattering histograms
min_D: 100
max_D: 500
n_bins: 20
# Calibration parameters
calibration:
incandescence:
curve_type: "polynomial"
parameters: [0.05, 2.047e-07]
scattering:
curve_type: "powerlaw"
parameters: [17.22, 0.169, -1.494]
```
**Key sections to customize:**
- **`paths`** - Update all file and directory paths
- **`calibration`** - Use parameters from your instrument settings
- **`cluster`** - Match your computing environment
- **`workflow.dt`** - Set appropriate time resolution
- **`histo`** - Configure size/mass distribution bins