Files
SP2XR/README.md
2025-06-06 11:55:49 +02:00

4.1 KiB

SP2XR

This repository contains python functions and template scripts to analyze SP2-XR data with Python.

Suggested structure for the data

  • Campaign_name
    • SP2XR_files
      • 20200415
        (from here on it is the usual SP2XR structure, no need to unzip the files)
    • SP2XR_pbp_parquet
      Directory automatically generated by read_csv_files_with_dask(), it contains the pbp files converted to parquet and organized by date and hour. Single file names correspond to the original file names.
    • SP2XR_hk_parquet
      Directory automatically generated by read_csv_files_with_dask(), it contains the pbp file converted to parquet and organized by date and hour. Single file names correspond to the original file names.
    • SP2XR_pbp_processed
      Directory automatically generated by process_pbp_parquet. It contains the processed pbp files (calibration applied, various distributions calculated, mixing state calculated, ...). By default the data are grouped to 1s time resolution.
    • SP2XR_pbp_processed_1min
      Directory automatically generated by resample_to_dt. It contains files at the same process level of SP2XR_pbp_processed but at the specified time resolution.

Suggested structure for the code

Do you want to use git to track your analysis code?

  • Yes!
    1. Create a repository for the analysis of your specific dataset (from here on this is referred as main repository)
    2. Add the SP2XR repository (this repository) to your main repository as submodule
    3. Copy the template file processing_code.py from the submodule to your main repository.
    4. Modify the template processing_code.py according to your needs (file paths, time resolution, calibration values, ...)
  • No, thanks.
    1. Download this repository and place it in the same directory of your analysis scripts.
    2. Modify the template processing_code.py according to your needs (file paths, time resolution, calibration values, ...)

How to run processing_code.py

The code contains several blocks for the different processing steps. This division helps to truble shoot possible problems in file reading/writing.
The different blocks correspond to these actions:

  1. Define paths and other variables

  2. From .csv/.zip to parquet for a. PbP files b. HK files

  3. Analysis of the single particle data (pbp data, not raw traces) to user defined time resolution. Operations performed in this block:

    • read the specified config file (!! for the moment I assume that all the files processed in one go will have the same config parameters)
    • Apply scattering and incadescence calibration parameters
    • Flag data according to config file parameters (e.g., Incand Transit Time, Incand FWHM, Incand relPeak, ...)
    • Flag data according to mixing state (see dedicated section)
    • Resample pbp to the specified time (usually and suggested is 1s). Some columns are summed (e.g., BC mass for BC mass conc) and some counted (e.g., BC mass for BC numb conc and Opt D of purely scattering particles for their # conc).
    • Resample the flows columns in the hk to the same specified time
    • Create a joint pbp_hk file with the specified time resolution
    • Calculate distributions for the different flags. Time resolution is the same as for above computations. Min, max, numb_bins for time delay, [BC mass, BC numb], Scatt numb are defined by the user.
    • Merge pbp_hk and distributions in 1 variable and save it as parquet files partitioned on date and hour
  4. Resample of the pbp_processed data to another time resolution

  5. From .sp2b to parquet. See notes below for the analysis of the raw traces. This block can usually be skipped.

  6. Process the sp2b.parquet files

Known bugs and missing features

  1. Missing feature: Currently, the code doesn't take into account the sampling frequency, it assumes all BC and BC-free particles are recorded in the PbP file.
  2. Bug: When calculating the distributions (histrograms) with time resolution different than 1s, the calculation is wrong! Currently, to have correctly calcualted distributions you have to process the data at 1s and then average at 1min.