Files
SP2XR/docs/scripts.md

170 lines
5.8 KiB
Markdown

# Scripts Reference
## Available Scripts
### `scripts/sp2xr_pipeline.py`
Main processing pipeline that orchestrates the complete data analysis workflow.
**Usage:**
```bash
python scripts/sp2xr_pipeline.py --config path/to/config.yaml [options]
```
**Command-line Options:**
- `--config` - Path to YAML configuration file (required)
- `--set KEY=VALUE` - Override config values using dot notation (e.g., `--set dt=60`)
- Additional cluster configuration options (see `--help`)
**Features:**
- Distributed processing with Dask (local or SLURM cluster)
- Automatic time chunking and partition management
- Calibration application and quality flagging
- Distribution calculations and time resampling
- Concentration calculations
- Output partitioned by date and hour
### `scripts/sp2xr_csv2parquet.py`
Batch conversion of raw CSV/ZIP files to Parquet format with support for both local and cluster processing.
**Usage:**
```bash
# Local processing (automatic resource detection)
python scripts/sp2xr_csv2parquet.py \
--source /path/to/csv/files \
--target /path/to/parquet/output \
--config config.yaml \
--filter "PbP" \
--local
# SLURM cluster processing
python scripts/sp2xr_csv2parquet.py \
--source /path/to/csv/files \
--target /path/to/parquet/output \
--config config.yaml \
--filter "PbP" \
--cores 32 --memory 64GB --partition general
# Process housekeeping files with custom chunk size
python scripts/sp2xr_csv2parquet.py \
--source /path/to/csv/files \
--target /path/to/parquet/output \
--config config.yaml \
--filter "hk" \
--chunk 50 \
--local
```
**Features:**
- Supports both local and SLURM cluster execution
- Automatic resource detection for local processing
- Configurable batch processing with chunking
- Progress tracking and error handling
- Graceful shutdown with signal handling
### `scripts/sp2xr_apply_calibration.py`
Apply calibration parameters to particle data using Dask + SLURM.
**Usage:**
```bash
python scripts/sp2xr_apply_calibration.py \
--input /path/to/data.parquet \
--config /path/to/config.yaml \
--output /path/to/calibrated.parquet \
[--cores 32] [--memory 64GB] [--walltime 02:00:00] [--partition daily]
```
**Options:**
- `--input` - Input Parquet file or directory (required)
- `--config` - YAML calibration configuration (required)
- `--output` - Output directory for calibrated Parquet dataset (required)
- `--cores` - Cores per SLURM job (default: 32)
- `--memory` - Memory per job (default: 64GB)
- `--walltime` - Wall-time limit (default: 02:00:00)
- `--partition` - SLURM partition (default: daily)
**Features:**
- Parallel processing with Dask + SLURM cluster
- Automatic scaling and resource management
- Partitioned output by date and hour
### `scripts/sp2xr_generate_config.py`
Generate schema configuration files by automatically detecting SP2XR files in a directory.
**Usage:**
```bash
# Generate basic schema config from current directory
python scripts/sp2xr_generate_config.py .
# Generate config from specific directory
python scripts/sp2xr_generate_config.py /path/to/sp2xr/data
# Generate config with column mapping support (for non-standard column names)
python scripts/sp2xr_generate_config.py /path/to/data --mapping
# Specify custom schema and instrument settings output filenames
python scripts/sp2xr_generate_config.py /path/to/data \
--schema-output my_schema.yaml \
--instrument-output my_settings.yaml
# Generate mapping config with custom output names
python scripts/sp2xr_generate_config.py /path/to/data --mapping \
--schema-output campaign_schema.yaml \
--instrument-output campaign_settings.yaml
# Use specific files instead of auto-detection
python scripts/sp2xr_generate_config.py . --pbp-file data.pbp.csv --hk-file data.hk.csv
```
**Options:**
- `directory` - Directory containing SP2XR files (PbP and HK files)
- `--schema-output`, `-s` - Output filename for data schema config (default: `config_schema.yaml`)
- `--instrument-output`, `-i` - Output filename for instrument settings config (default: `{schema_output}_instrument_settings.yaml`)
- `--mapping`, `-m` - Generate config with column mapping support (creates canonical column mappings)
- `--pbp-file` - Specify specific PbP file instead of auto-detection
- `--hk-file` - Specify specific HK file instead of auto-detection
**Features:**
- Automatic file detection (searches recursively for PbP and HK files)
- Schema inference from CSV/ZIP/Parquet files
- Column mapping support for non-standard column names
- Automatic INI file detection and conversion to YAML
- Validates INI file consistency across multiple files
### `scripts/sp2xr_ini2yaml.py`
Convert legacy INI calibration files to YAML format.
**Usage:**
```bash
python scripts/sp2xr_ini2yaml.py input.ini output.yaml
```
**Arguments:**
- `ini` - Input .ini calibration file
- `yaml` - Output .yaml file path
**Features:**
- Converts SP2-XR instrument .ini files to editable YAML format
- Preserves all calibration parameters and settings
### `calibration_workflow.ipynb`
Interactive Jupyter notebook for determining instrument-specific calibration coefficients.
**Purpose:**
- Analyze calibration standards (PSL spheres, Aquadag, etc.) to derive scattering and incandescence calibration curves
- Iterative process with visualization for quality control
- Generate calibration parameters for use in configuration files
**Workflow:**
1. Load calibration standard measurements
2. Plot raw signals vs. known particle properties
3. Fit calibration curves (polynomial, power-law, etc.)
4. Export calibration coefficients to YAML configuration
### `scripts/run_sp2xr_pipeline.sbatch`
SLURM batch job script for running the pipeline on HPC systems.
**Features:**
- Configurable resource allocation
- Automatic scratch directory management
- Module loading and environment activation
- Error and output logging