2.5 KiB
2.5 KiB
Conversion of SP2-XR PbP and HK .csv/.zip files to .parquet
Overview
Different SP2-XR instrument versions may use different column names in their CSV/ZIP data files. To ensure consistent data processing, the SP2-XR package now supports automatic column name standardization during CSV to Parquet conversion.
How It Works
- Input: Instrument-specific CSV/ZIP files with their original column names
- Mapping: A configuration file maps your column names to canonical (standard) column names
- Output: Parquet files with standardized column names for consistent downstream processing
Usage
Step 1: Generate Enhanced Config Template
Use generate_config.py to create a configuration template:
from meta_files.generate_config import generate_mapping_template
# Generate config with column mappings
generate_mapping_template("your_pbp_file.csv", "your_hk_file.csv", "config_with_mapping.yaml")
Step 2: Customize Column Mappings
Open the generated config_with_mapping.yaml and update the column mappings:
pbp_column_mapping:
# Canonical name -> Your file's column name
Time (sec): "Time_Seconds" # Replace with your actual column name
Particle Flags: "Flags" # Replace with your actual column name
Incand Mass (fg): "Mass_fg" # Replace with your actual column name
# ... etc
hk_column_mapping:
Time Stamp: "Timestamp" # Replace with your actual column name
Time (sec): "Time_Seconds" # Replace with your actual column name
# ... etc
Step 3: Convert CSV to Parquet with Mapping
Use your customized config with the CSV to Parquet conversion:
python scripts/sp2xr_csv2parquet.py --source /path/to/csv --target /path/to/parquet --config config_with_mapping.yaml
Configuration File Structure
The enhanced config file contains several sections:
pbp_schema/hk_schema: Data types for your input filespbp_canonical_schema/hk_canonical_schema: Standard column schemas used by SP2-XR processingpbp_column_mapping/hk_column_mapping: Maps canonical names to your file's column names
Column Mapping Rules
- Exact matches: If your column names exactly match canonical names, they're automatically mapped
- Custom mapping: Replace placeholder values with your actual column names
- Missing columns: Set to
nullor remove the line if your data doesn't have that column - Extra columns: Unmapped columns in your files are preserved as-is