dima/notebooks/demo_data_integration.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data integration workflow of experimental campaign\n",
    "\n",
    "In this notebook, we will go through a our data integration workflow. This involves the following steps:\n",
    "\n",
    "1. Specify data integration file through YAML configuration file.\n",
    "2. Create an integrated HDF5 file of experimental campaign from configuration file.\n",
    "3. Display the created HDF5 file using a treemap\n",
    "\n",
    "## Import libraries and modules\n",
    "\n",
    "* Excecute (or Run) the Cell below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nbutils import add_project_path_to_sys_path\n",
    "\n",
    "# Add project root to sys.path\n",
    "add_project_path_to_sys_path()\n",
    "\n",
    "try:\n",
    "    import visualization.hdf5_vis as hdf5_vis\n",
    "    import pipelines.data_integration as data_integration\n",
    "    print(\"Imports successful!\")\n",
    "except ImportError as e:\n",
    "    print(f\"Import error: {e}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Configure Your Data Integration Task\n",
    "\n",
    "1. Based on one of the example `.yaml` files found in the `input_files/` folder, define the input and output directory paths inside the file.\n",
    "\n",
    "2. When working with network drives, create `.env` file in the root of the `dima/` project with the following line:\n",
    "\n",
    "     ```dotenv\n",
    "     NETWORK_MOUNT=//your-server/your-share\n",
    "     ```\n",
    "3. Excecute Cell.\n",
    "\n",
    "**Note:** Ensure `.env` is listed in `.gitignore` and `.dockerignore`.\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "number, initials = 2, 'TBR' # Set as either 2, 'TBR' or 3, 'NG'\n",
    "campaign_descriptor_path = f'../input_files/campaignDescriptor{number}_{initials}.yaml'\n",
    "\n",
    "print(campaign_descriptor_path)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Create an integrated HDF5 file of experimental campaign.\n",
    "\n",
    "* Excecute Cell. Here we run the function `integrate_data_sources` with input argument as the previously specified YAML config file.\n",
    "\n",
    "   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "hdf5_file_path = data_integration.run_pipeline(campaign_descriptor_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hdf5_file_path "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Display integrated HDF5 file using a treemap\n",
    "\n",
    "* Excecute Cell. A visual representation in html format of the integrated file should be displayed and stored in the output directory folder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "if isinstance(hdf5_file_path ,list):\n",
    "    for path_item in hdf5_file_path :\n",
    "        hdf5_vis.display_group_hierarchy_on_a_treemap(path_item)\n",
    "else:\n",
    "    hdf5_vis.display_group_hierarchy_on_a_treemap(hdf5_file_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import src.hdf5_ops as h5de \n",
    "h5de.serialize_metadata(hdf5_file_path[0],folder_depth=3,output_format='yaml')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import src.hdf5_ops as h5de \n",
    "print(hdf5_file_path)\n",
    "DataOpsAPI = h5de.HDF5DataOpsManager(hdf5_file_path[0])\n",
    "\n",
    "DataOpsAPI.load_file_obj()\n",
    "\n",
    "#DataOpsAPI.reformat_datetime_column('ICAD/HONO/2022_11_22_Channel1_Data.dat/data_table',\n",
    "#                                    'Start Date/Time (UTC)',\n",
    "#                                    '%Y-%m-%d %H:%M:%S.%f', '%Y-%m-%d %H:%M:%S')\n",
    "DataOpsAPI.extract_and_load_dataset_metadata()\n",
    "df = DataOpsAPI.dataset_metadata_df\n",
    "print(df.head())\n",
    "\n",
    "DataOpsAPI.unload_file_obj()\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "DataOpsAPI.load_file_obj()\n",
    "\n",
    "DataOpsAPI.append_metadata('/',{'test_attr':'this is a test value'})\n",
    "\n",
    "DataOpsAPI.unload_file_obj()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "multiphase_chemistry_env",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}