Files
acsm-fairifier/notebooks/demo_data_integration.ipynb

1172 lines
29 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data integration workflow of experimental campaign\n",
"\n",
"In this notebook, we will go through a our data integration workflow. This involves the following steps:\n",
"\n",
"1. Specify data integration file through YAML configuration file.\n",
"2. Create an integrated HDF5 file of experimental campaign from configuration file.\n",
"3. Display the created HDF5 file using a treemap\n",
"\n",
"## Import libraries and modules\n",
"\n",
"* Excecute (or Run) the Cell below"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"c:\\Users\\florez_j\\.conda\\envs\\dash_multi_chem_env\\python311.zip\n",
"c:\\Users\\florez_j\\.conda\\envs\\dash_multi_chem_env\\DLLs\n",
"c:\\Users\\florez_j\\.conda\\envs\\dash_multi_chem_env\\Lib\n",
"c:\\Users\\florez_j\\.conda\\envs\\dash_multi_chem_env\n",
"\n",
"c:\\Users\\florez_j\\.conda\\envs\\dash_multi_chem_env\\Lib\\site-packages\n",
"c:\\Users\\florez_j\\.conda\\envs\\dash_multi_chem_env\\Lib\\site-packages\\win32\n",
"c:\\Users\\florez_j\\.conda\\envs\\dash_multi_chem_env\\Lib\\site-packages\\win32\\lib\n",
"c:\\Users\\florez_j\\.conda\\envs\\dash_multi_chem_env\\Lib\\site-packages\\Pythonwin\n",
"c:\\Users\\florez_j\\Documents\\GitLab\\acsmnode\n",
"c:\\Users\\florez_j\\Documents\\GitLab\\acsmnode\\dima\n"
]
}
],
"source": [
"import sys\n",
"import os\n",
"# Set up project root directory\n",
"\n",
"\n",
"notebook_dir = os.getcwd() # Current working directory (assumes running from notebooks/)\n",
"project_path = os.path.normpath(os.path.join(notebook_dir, \"..\")) # Move up to project root\n",
"dima_path = os.path.normpath(os.path.join(project_path, \"dima\")) # Move up to project root\n",
"\n",
"for item in sys.path:\n",
" print(item)\n",
"\n",
"\n",
"if project_path not in sys.path: # Avoid duplicate entries\n",
" sys.path.append(project_path)\n",
" print(project_path)\n",
"if dima_path not in sys.path:\n",
" sys.path.insert(0,dima_path)\n",
" print(dima_path)\n",
"#sys.path.append(os.path.join(root_dir,'dima','instruments'))\n",
"#sys.path.append(os.path.join(root_dir,'dima','src'))\n",
"#sys.path.append(os.path.join(root_dir,'dima','utils'))\n",
"\n",
"import dima.visualization.hdf5_vis as hdf5_vis\n",
"import dima.pipelines.data_integration as data_integration\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1: Specify data integration task through YAML configuration file\n",
"\n",
"* Create your configuration file (i.e., *.yaml file) adhering to the example yaml file in the input folder.\n",
"* Set up input directory and output directory paths and Excecute Cell.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"#yaml_config_file_path = 'dima/input_files/data_integr_config_file_TBR.yaml' \n",
"yaml_config_file_path ='../campaignDescriptor.yaml'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2: Create an integrated HDF5 file of experimental campaign.\n",
"\n",
"* Excecute Cell. Here we run the function `integrate_data_sources` with input argument as the previously specified YAML config file."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"[Start] Data integration :\n",
"Source: ..\\data\\collection_JFJ_2024_2025-03-14_2025-03-14\n",
"Destination: ..\\data\\collection_JFJ_2024_2025-03-14_2025-03-14.h5\n",
"\n",
"Starting data transfer from instFolder: /ACSM_TOFWARE/2024\n",
"Completed transfer for //ACSM_TOFWARE/2024/ACSM_JFJ_2024_meta.txt\n",
"Completed transfer for //ACSM_TOFWARE/2024/ACSM_JFJ_2024_timeseries.txt\n",
"Completed transfer for //ACSM_TOFWARE/2024/Org_data_valid.csv\n",
"Completed transfer for //ACSM_TOFWARE/2024/Org_err_valid.csv\n",
"Completed transfer for //ACSM_TOFWARE/2024/Org_mz_valid.csv\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\florez_j\\Documents\\GitLab\\acsmnode\\dima\\instruments\\readers\\acsm_tofware_reader.py:112: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.\n",
" df = pd.read_csv(tmp_filename,\n",
"c:\\Users\\florez_j\\Documents\\GitLab\\acsmnode\\dima\\instruments\\readers\\acsm_tofware_reader.py:112: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.\n",
" df = pd.read_csv(tmp_filename,\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Completed transfer for //ACSM_TOFWARE/2024/Org_time_valid.csv\n",
"[====================================================================================================] 100.0% ...\n",
"Completed data transfer for instFolder: /ACSM_TOFWARE/2024\n",
"[End] Data integration\n"
]
}
],
"source": [
"\n",
"hdf5_file_path = data_integration.run_pipeline(yaml_config_file_path)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['..\\\\data\\\\collection_JFJ_2024_LeilaS_2025-02-22_2025-02-22.h5']"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hdf5_file_path"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Display integrated HDF5 file using a treemap\n",
"\n",
"* Excecute Cell. A visual representation in html format of the integrated file should be displayed and stored in the output directory folder"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/ACSM_TOFWARE\n",
"/ACSM_TOFWARE/2024\n",
"/ACSM_TOFWARE/2024/ACSM_JFJ_2024_meta.txt\n",
"/ACSM_TOFWARE/2024/ACSM_JFJ_2024_timeseries.txt\n",
"/ACSM_TOFWARE/2024/Org_data_valid.csv\n",
"/ACSM_TOFWARE/2024/Org_err_valid.csv\n",
"/ACSM_TOFWARE/2024/Org_mz_valid.csv\n",
"/ACSM_TOFWARE/2024/Org_time_valid.csv\n"
]
},
{
"data": {
"application/vnd.plotly.v1+json": {
"config": {
"plotlyServerURL": "https://plot.ly"
},
"data": [
{
"branchvalues": "remainder",
"customdata": [
"<br><br>project: Building FAIR data chains for atmospheric observations in the ACTRIS Switzerland Network<br>experiment: acsm_campaign<br>contact: LeilaS<br>level: 1",
"/ACSM_TOFWARE",
"/ACSM_TOFWARE/2024",
"/ACSM_TOFWARE/2024/ACSM_JFJ_2024_meta.txt",
"/ACSM_TOFWARE/2024/ACSM_JFJ_2024_timeseries.txt",
"/ACSM_TOFWARE/2024/Org_data_valid.csv",
"/ACSM_TOFWARE/2024/Org_err_valid.csv",
"/ACSM_TOFWARE/2024/Org_mz_valid.csv",
"/ACSM_TOFWARE/2024/Org_time_valid.csv"
],
"hovertemplate": "<b>%{label} </b> <br> Count: %{value} <br> Path: %{customdata}",
"labels": [
"/",
"/ACSM_TOFWARE",
"/ACSM_TOFWARE/2024",
"/ACSM_TOFWARE/2024/ACSM_JFJ_2024_meta.txt",
"/ACSM_TOFWARE/2024/ACSM_JFJ_2024_timeseries.txt",
"/ACSM_TOFWARE/2024/Org_data_valid.csv",
"/ACSM_TOFWARE/2024/Org_err_valid.csv",
"/ACSM_TOFWARE/2024/Org_mz_valid.csv",
"/ACSM_TOFWARE/2024/Org_time_valid.csv"
],
"name": "",
"parents": [
"",
"/",
"/ACSM_TOFWARE",
"/ACSM_TOFWARE/2024",
"/ACSM_TOFWARE/2024",
"/ACSM_TOFWARE/2024",
"/ACSM_TOFWARE/2024",
"/ACSM_TOFWARE/2024",
"/ACSM_TOFWARE/2024"
],
"root": {
"color": "lightgrey"
},
"type": "treemap",
"values": [
1,
1,
6,
1,
1,
1,
1,
1,
1
]
}
],
"layout": {
"height": 600,
"margin": {
"b": 25,
"l": 25,
"r": 25,
"t": 50
},
"template": {
"data": {
"bar": [
{
"error_x": {
"color": "#2a3f5f"
},
"error_y": {
"color": "#2a3f5f"
},
"marker": {
"line": {
"color": "#E5ECF6",
"width": 0.5
},
"pattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
}
},
"type": "bar"
}
],
"barpolar": [
{
"marker": {
"line": {
"color": "#E5ECF6",
"width": 0.5
},
"pattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
}
},
"type": "barpolar"
}
],
"carpet": [
{
"aaxis": {
"endlinecolor": "#2a3f5f",
"gridcolor": "white",
"linecolor": "white",
"minorgridcolor": "white",
"startlinecolor": "#2a3f5f"
},
"baxis": {
"endlinecolor": "#2a3f5f",
"gridcolor": "white",
"linecolor": "white",
"minorgridcolor": "white",
"startlinecolor": "#2a3f5f"
},
"type": "carpet"
}
],
"choropleth": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"type": "choropleth"
}
],
"contour": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "contour"
}
],
"contourcarpet": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"type": "contourcarpet"
}
],
"heatmap": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "heatmap"
}
],
"heatmapgl": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "heatmapgl"
}
],
"histogram": [
{
"marker": {
"pattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
}
},
"type": "histogram"
}
],
"histogram2d": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "histogram2d"
}
],
"histogram2dcontour": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "histogram2dcontour"
}
],
"mesh3d": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"type": "mesh3d"
}
],
"parcoords": [
{
"line": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "parcoords"
}
],
"pie": [
{
"automargin": true,
"type": "pie"
}
],
"scatter": [
{
"fillpattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
},
"type": "scatter"
}
],
"scatter3d": [
{
"line": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatter3d"
}
],
"scattercarpet": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattercarpet"
}
],
"scattergeo": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattergeo"
}
],
"scattergl": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattergl"
}
],
"scattermapbox": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattermapbox"
}
],
"scatterpolar": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatterpolar"
}
],
"scatterpolargl": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatterpolargl"
}
],
"scatterternary": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatterternary"
}
],
"surface": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "surface"
}
],
"table": [
{
"cells": {
"fill": {
"color": "#EBF0F8"
},
"line": {
"color": "white"
}
},
"header": {
"fill": {
"color": "#C8D4E3"
},
"line": {
"color": "white"
}
},
"type": "table"
}
]
},
"layout": {
"annotationdefaults": {
"arrowcolor": "#2a3f5f",
"arrowhead": 0,
"arrowwidth": 1
},
"autotypenumbers": "strict",
"coloraxis": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"colorscale": {
"diverging": [
[
0,
"#8e0152"
],
[
0.1,
"#c51b7d"
],
[
0.2,
"#de77ae"
],
[
0.3,
"#f1b6da"
],
[
0.4,
"#fde0ef"
],
[
0.5,
"#f7f7f7"
],
[
0.6,
"#e6f5d0"
],
[
0.7,
"#b8e186"
],
[
0.8,
"#7fbc41"
],
[
0.9,
"#4d9221"
],
[
1,
"#276419"
]
],
"sequential": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"sequentialminus": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
]
},
"colorway": [
"#636efa",
"#EF553B",
"#00cc96",
"#ab63fa",
"#FFA15A",
"#19d3f3",
"#FF6692",
"#B6E880",
"#FF97FF",
"#FECB52"
],
"font": {
"color": "#2a3f5f"
},
"geo": {
"bgcolor": "white",
"lakecolor": "white",
"landcolor": "#E5ECF6",
"showlakes": true,
"showland": true,
"subunitcolor": "white"
},
"hoverlabel": {
"align": "left"
},
"hovermode": "closest",
"mapbox": {
"style": "light"
},
"paper_bgcolor": "white",
"plot_bgcolor": "#E5ECF6",
"polar": {
"angularaxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
},
"bgcolor": "#E5ECF6",
"radialaxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
}
},
"scene": {
"xaxis": {
"backgroundcolor": "#E5ECF6",
"gridcolor": "white",
"gridwidth": 2,
"linecolor": "white",
"showbackground": true,
"ticks": "",
"zerolinecolor": "white"
},
"yaxis": {
"backgroundcolor": "#E5ECF6",
"gridcolor": "white",
"gridwidth": 2,
"linecolor": "white",
"showbackground": true,
"ticks": "",
"zerolinecolor": "white"
},
"zaxis": {
"backgroundcolor": "#E5ECF6",
"gridcolor": "white",
"gridwidth": 2,
"linecolor": "white",
"showbackground": true,
"ticks": "",
"zerolinecolor": "white"
}
},
"shapedefaults": {
"line": {
"color": "#2a3f5f"
}
},
"ternary": {
"aaxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
},
"baxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
},
"bgcolor": "#E5ECF6",
"caxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
}
},
"title": {
"x": 0.05
},
"xaxis": {
"automargin": true,
"gridcolor": "white",
"linecolor": "white",
"ticks": "",
"title": {
"standoff": 15
},
"zerolinecolor": "white",
"zerolinewidth": 2
},
"yaxis": {
"automargin": true,
"gridcolor": "white",
"linecolor": "white",
"ticks": "",
"title": {
"standoff": 15
},
"zerolinecolor": "white",
"zerolinewidth": 2
}
}
},
"width": 800
}
}
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"if isinstance(hdf5_file_path ,list):\n",
" for path_item in hdf5_file_path :\n",
" hdf5_vis.display_group_hierarchy_on_a_treemap(path_item)\n",
"else:\n",
" hdf5_vis.display_group_hierarchy_on_a_treemap(hdf5_file_path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import dima.pipelines.metadata_revision as metadata\n",
"\n",
"import dima.src.hdf5_ops as h5de\n",
"\n",
"channels1 = ['Chl_11000','NH4_11000','SO4_11000','NO3_11000','Org_11000']\n",
"channels2 = ['FilamentEmission_mA','VaporizerTemp_C','FlowRate_mb','ABsamp']\n",
"\n",
"target_channels = {'location':'ACSM_TOFWARE/ACSM_JFJ_2024_JantoFeb_timeseries.txt/data_table',\n",
" 'names': ','.join(['t_start_Buf','Chl_11000','NH4_11000','SO4_11000','NO3_11000','Org_11000'])\n",
" }\n",
"diagnostic_channels = {'location':'ACSM_TOFWARE/ACSM_JFJ_2024_JantoFeb_meta.txt/data_table',\n",
" 'names': ','.join(['t_base','FilamentEmission_mA','VaporizerTemp_C','FlowRate_mb','ABsamp'])}\n",
"\n",
"DataOpsAPI = h5de.HDF5DataOpsManager(hdf5_file_path[0])\n",
"\n",
"DataOpsAPI.load_file_obj()\n",
"DataOpsAPI.append_metadata('/ACSM_TOFWARE/',{'target_channels' : target_channels, 'diagnostic_channels' : diagnostic_channels})\n",
"\n",
"DataOpsAPI.reformat_datetime_column('ACSM_TOFWARE/ACSM_JFJ_2024_JantoFeb_timeseries.txt/data_table','t_start_Buf',src_format='%d.%m.%Y %H:%M:%S.%f')\n",
"DataOpsAPI.reformat_datetime_column('ACSM_TOFWARE/ACSM_JFJ_2024_JantoFeb_meta.txt/data_table','t_base',src_format='%d.%m.%Y %H:%M:%S')\n",
"\n",
"DataOpsAPI.unload_file_obj()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "dash_multi_chem_env",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}