Files
SP2XR/docs/scripts.md

5.8 KiB

Scripts Reference

Available Scripts

scripts/sp2xr_pipeline.py

Main processing pipeline that orchestrates the complete data analysis workflow.

Usage:

python scripts/sp2xr_pipeline.py --config path/to/config.yaml [options]

Command-line Options:

  • --config - Path to YAML configuration file (required)
  • --set KEY=VALUE - Override config values using dot notation (e.g., --set dt=60)
  • Additional cluster configuration options (see --help)

Features:

  • Distributed processing with Dask (local or SLURM cluster)
  • Automatic time chunking and partition management
  • Calibration application and quality flagging
  • Distribution calculations and time resampling
  • Concentration calculations
  • Output partitioned by date and hour

scripts/sp2xr_csv2parquet.py

Batch conversion of raw CSV/ZIP files to Parquet format with support for both local and cluster processing.

Usage:

# Local processing (automatic resource detection)
python scripts/sp2xr_csv2parquet.py \
    --source /path/to/csv/files \
    --target /path/to/parquet/output \
    --config config.yaml \
    --filter "PbP" \
    --local

# SLURM cluster processing
python scripts/sp2xr_csv2parquet.py \
    --source /path/to/csv/files \
    --target /path/to/parquet/output \
    --config config.yaml \
    --filter "PbP" \
    --cores 32 --memory 64GB --partition general

# Process housekeeping files with custom chunk size
python scripts/sp2xr_csv2parquet.py \
    --source /path/to/csv/files \
    --target /path/to/parquet/output \
    --config config.yaml \
    --filter "hk" \
    --chunk 50 \
    --local

Features:

  • Supports both local and SLURM cluster execution
  • Automatic resource detection for local processing
  • Configurable batch processing with chunking
  • Progress tracking and error handling
  • Graceful shutdown with signal handling

scripts/sp2xr_apply_calibration.py

Apply calibration parameters to particle data using Dask + SLURM.

Usage:

python scripts/sp2xr_apply_calibration.py \
    --input /path/to/data.parquet \
    --config /path/to/config.yaml \
    --output /path/to/calibrated.parquet \
    [--cores 32] [--memory 64GB] [--walltime 02:00:00] [--partition daily]

Options:

  • --input - Input Parquet file or directory (required)
  • --config - YAML calibration configuration (required)
  • --output - Output directory for calibrated Parquet dataset (required)
  • --cores - Cores per SLURM job (default: 32)
  • --memory - Memory per job (default: 64GB)
  • --walltime - Wall-time limit (default: 02:00:00)
  • --partition - SLURM partition (default: daily)

Features:

  • Parallel processing with Dask + SLURM cluster
  • Automatic scaling and resource management
  • Partitioned output by date and hour

scripts/sp2xr_generate_config.py

Generate schema configuration files by automatically detecting SP2XR files in a directory.

Usage:

# Generate basic schema config from current directory
python scripts/sp2xr_generate_config.py .

# Generate config from specific directory
python scripts/sp2xr_generate_config.py /path/to/sp2xr/data

# Generate config with column mapping support (for non-standard column names)
python scripts/sp2xr_generate_config.py /path/to/data --mapping

# Specify custom schema and instrument settings output filenames
python scripts/sp2xr_generate_config.py /path/to/data \
    --schema-output my_schema.yaml \
    --instrument-output my_settings.yaml

# Generate mapping config with custom output names
python scripts/sp2xr_generate_config.py /path/to/data --mapping \
    --schema-output campaign_schema.yaml \
    --instrument-output campaign_settings.yaml

# Use specific files instead of auto-detection
python scripts/sp2xr_generate_config.py . --pbp-file data.pbp.csv --hk-file data.hk.csv

Options:

  • directory - Directory containing SP2XR files (PbP and HK files)
  • --schema-output, -s - Output filename for data schema config (default: config_schema.yaml)
  • --instrument-output, -i - Output filename for instrument settings config (default: {schema_output}_instrument_settings.yaml)
  • --mapping, -m - Generate config with column mapping support (creates canonical column mappings)
  • --pbp-file - Specify specific PbP file instead of auto-detection
  • --hk-file - Specify specific HK file instead of auto-detection

Features:

  • Automatic file detection (searches recursively for PbP and HK files)
  • Schema inference from CSV/ZIP/Parquet files
  • Column mapping support for non-standard column names
  • Automatic INI file detection and conversion to YAML
  • Validates INI file consistency across multiple files

scripts/sp2xr_ini2yaml.py

Convert legacy INI calibration files to YAML format.

Usage:

python scripts/sp2xr_ini2yaml.py input.ini output.yaml

Arguments:

  • ini - Input .ini calibration file
  • yaml - Output .yaml file path

Features:

  • Converts SP2-XR instrument .ini files to editable YAML format
  • Preserves all calibration parameters and settings

calibration_workflow.ipynb

Interactive Jupyter notebook for determining instrument-specific calibration coefficients.

Purpose:

  • Analyze calibration standards (PSL spheres, Aquadag, etc.) to derive scattering and incandescence calibration curves
  • Iterative process with visualization for quality control
  • Generate calibration parameters for use in configuration files

Workflow:

  1. Load calibration standard measurements
  2. Plot raw signals vs. known particle properties
  3. Fit calibration curves (polynomial, power-law, etc.)
  4. Export calibration coefficients to YAML configuration

scripts/run_sp2xr_pipeline.sbatch

SLURM batch job script for running the pipeline on HPC systems.

Features:

  • Configurable resource allocation
  • Automatic scratch directory management
  • Module loading and environment activation
  • Error and output logging