Files
SP2XR/How_to_convert_SP2XR_raw_data_to_parquet.md

2.5 KiB

Conversion of SP2-XR PbP and HK .csv/.zip files to .parquet

Overview

Different SP2-XR instrument versions may use different column names in their CSV/ZIP data files. To ensure consistent data processing, the SP2-XR package now supports automatic column name standardization during CSV to Parquet conversion.

How It Works

  1. Input: Instrument-specific CSV/ZIP files with their original column names
  2. Mapping: A configuration file maps your column names to canonical (standard) column names
  3. Output: Parquet files with standardized column names for consistent downstream processing

Usage

Step 1: Generate Enhanced Config Template

Use generate_config.py to create a configuration template:

from meta_files.generate_config import generate_mapping_template

# Generate config with column mappings
generate_mapping_template("your_pbp_file.csv", "your_hk_file.csv", "config_with_mapping.yaml")

Step 2: Customize Column Mappings

Open the generated config_with_mapping.yaml and update the column mappings:

pbp_column_mapping:
  # Canonical name -> Your file's column name
  Time (sec): "Time_Seconds"  # Replace with your actual column name
  Particle Flags: "Flags"     # Replace with your actual column name
  Incand Mass (fg): "Mass_fg" # Replace with your actual column name
  # ... etc
  
hk_column_mapping:
  Time Stamp: "Timestamp"     # Replace with your actual column name
  Time (sec): "Time_Seconds"  # Replace with your actual column name
  # ... etc

Step 3: Convert CSV to Parquet with Mapping

Use your customized config with the CSV to Parquet conversion:

python scripts/sp2xr_csv2parquet.py --source /path/to/csv --target /path/to/parquet --config config_with_mapping.yaml

Configuration File Structure

The enhanced config file contains several sections:

  • pbp_schema / hk_schema: Data types for your input files
  • pbp_canonical_schema / hk_canonical_schema: Standard column schemas used by SP2-XR processing
  • pbp_column_mapping / hk_column_mapping: Maps canonical names to your file's column names

Column Mapping Rules

  1. Exact matches: If your column names exactly match canonical names, they're automatically mapped
  2. Custom mapping: Replace placeholder values with your actual column names
  3. Missing columns: Set to null or remove the line if your data doesn't have that column
  4. Extra columns: Unmapped columns in your files are preserved as-is