2
Data Analysis Setup and Execution Guide
szola_m edited this page 2024-04-08 13:17:14 +02:00

Welcome to our data analysis guide designed for the Cristallina projects. This page will walk you through setting up and executing a key part of our data processing, focusing on efficiency and reproducibility.

Our script leverages various Python libraries to manage, analyze, and visualize scientific data effectively.

Prerequisites

Before diving into the script, ensure you have following libraries installed:

Matplotlib NumPy SciPy Pandas Joblib SFData and Our custom library: Cristallina

These can typically be installed via pip, for example:

pip install matplotlib numpy scipy pandas joblib sfdata

Script

%matplotlib widget


import os
import json
from pathlib import Path
from collections import defaultdict, deque
import time
import scipy
import numpy as np
from tqdm import tqdm


import matplotlib
from matplotlib import pyplot as plt
import matplotlib as mpl
import pandas as pd
from matplotlib import cm

import cristallina.utils as cu
import cristallina as cr

from joblib import Parallel, delayed, Memory
from sfdata import SFProcFile, SFDataFile, SFDataFiles, SFScanInfo

memory = Memory(location="/sf/cristallina/data/p21640/work/joblib", compress=2, verbose=2)

import sfdata
import logging

logger = logging.getLogger()

%load_ext autoreload
%autoreload 2

pgroup = "p21640"

Overview of the Script

The script begins by setting up an interactive plotting session with %matplotlib widget, making it easier to interact with plots directly in Jupyter notebooks or similar environments.

Key components of the script include:

Data Loading and Caching: Utilizes joblib for caching results of computation-heavy functions, speeding up repeat analyses.

Data Processing and Visualization: Employs matplotlib for creating plots, numpy and scipy for numerical operations, and pandas for data manipulation.

Custom Utilities: Uses cristallina library functions for specific data analysis tasks related to our research.

Step-by-Step Guide:

Step 1: Setting Up Interactive Plotting

To enable interactive plotting in a Jupyter notebook, include the following at the beginning of your script:

%matplotlib widget

Step 2: Importing Libraries

Import the necessary Python libraries as shown in the initial code snippet. This includes both standard libraries like os and json, as well as scientific computing libraries such as numpy and matplotlib.

Step 3: Initialize Logging and Caching

Set up a caching mechanism with joblib to store intermediate results, which can significantly reduce computation time for repeated operations:

from joblib import Memory
memory = Memory(location="/sf/cristallina/data/p21640/work/joblib", compress=2, verbose=2)

Adjust the location parameter to match your directory structure.

Step 4: Data Analysis with Cristallina

The script utilizes custom utilities from the cristallina library for specific data analysis tasks. Ensure you're familiar with its modules and functions for effective use.

Step 5: Processing Data

Leverage the power of numpy, pandas, and matplotlib for analyzing and visualizing data. The script includes examples of loading data, performing computations, and generating plots.

Conclusion

This guide outlines the setup and basic usage of a comprehensive data analysis pipeline using Python. Tailor the script to your specific project needs, focusing on the flexibility and power of the libraries involved. For further customization or troubleshooting, refer to the documentation of the respective libraries or consult with our team.