From 08c999f97a4d5520ba8e0e9b26098ed5a6e640db Mon Sep 17 00:00:00 2001
From: caubet_m <marc.caubet@psi.ch>
Date: Wed, 18 Dec 2024 17:11:12 +0100
Subject: [PATCH] Update

---
 _data/sidebars/merlin7_sidebar.yml            |   2 +
 .../02-How-To-Use-Merlin/transfer-data.md     |   9 +-
 .../merlin7-configuration.md                  |  68 +++++++
 .../slurm-configuration.md                    | 186 ++++++++++++++----
 4 files changed, 218 insertions(+), 47 deletions(-)
 create mode 100644 pages/merlin7/03-Slurm-General-Documentation/merlin7-configuration.md
diff --git a/_data/sidebars/merlin7_sidebar.yml b/_data/sidebars/merlin7_sidebar.yml
index 5c965e8..df6f9f7 100644
--- a/_data/sidebars/merlin7_sidebar.yml
+++ b/_data/sidebars/merlin7_sidebar.yml
@@ -47,6 +47,8 @@ entries:
       - title: Slurm General Documentation
         folderitems:
           - title: Merlin7 Infrastructure
+            url: /merlin7/merlin7-configuration.html
+          - title: Slurm Configuration
             url: /merlin7/slurm-configuration.html
           - title: Running Slurm Interactive Jobs
             url: /merlin7/interactive-jobs.html
diff --git a/pages/merlin7/02-How-To-Use-Merlin/transfer-data.md b/pages/merlin7/02-How-To-Use-Merlin/transfer-data.md
index dd17171..09230a8 100644
--- a/pages/merlin7/02-How-To-Use-Merlin/transfer-data.md
+++ b/pages/merlin7/02-How-To-Use-Merlin/transfer-data.md
@@ -15,15 +15,16 @@ initiate the transfer from either merlin or the other system, depending on the n
 visibility.
 
 - Merlin login nodes are visible from the PSI network, so direct data transfer
-  (rsync/WinSCP) is generally preferable. This can be initiated from either endpoint.
-- Merlin login nodes can access the internet using a limited set of protocols
-  - SSH-based protocols using port 22 (rsync-over-ssh, sftp, WinSCP, etc)
+  (rsync/WinSCP/sftp) is generally preferable. 
+  - Protocols from Merlin7 to PSI may require special firewall rules.
+- Merlin login nodes can access the internet using a limited set of protocols:
   - HTTP-based protocols using ports 80 or 445 (https, WebDav, etc)
   - Protocols using other ports require admin configuration and may only work with
-    specific hosts (ftp, rsync daemons, etc)
+    specific hosts, and may require new firewall rules (ssh, ftp, rsync daemons, etc).
 - Systems on the internet can access the [PSI Data Transfer](https://www.psi.ch/en/photon-science-data-services/data-transfer) service
   `datatransfer.psi.ch`, using ssh-based protocols and [Globus](https://www.globus.org/)
 
+SSH-based protocols using port 22 to most PSI servers and  (rsync-over-ssh, sftp, WinSCP, etc.), are in general, not permitted
 
 ## Direct transfer via Merlin7 login nodes
 
diff --git a/pages/merlin7/03-Slurm-General-Documentation/merlin7-configuration.md b/pages/merlin7/03-Slurm-General-Documentation/merlin7-configuration.md
new file mode 100644
index 0000000..da80902
--- /dev/null
+++ b/pages/merlin7/03-Slurm-General-Documentation/merlin7-configuration.md
@@ -0,0 +1,68 @@
+---
+title: Slurm cluster 'merlin7'
+#tags:
+keywords: configuration, partitions, node definition
+#last_updated: 24 Mai 2023
+summary: "This document describes a summary of the Merlin7 configuration."
+sidebar: merlin7_sidebar
+permalink: /merlin7/merlin7-configuration.html
+---
+
+![Work In Progress](/images/WIP/WIP1.webp){:style="display:block; margin-left:auto; margin-right:auto"}
+
+{{site.data.alerts.warning}}The Merlin7 documentation is <b>Work In Progress</b>.
+Please do not use or rely on this documentation until this becomes official.  
+This applies to any page under <b><a href="https://lsm-hpce.gitpages.psi.ch/merlin7/">https://lsm-hpce.gitpages.psi.ch/merlin7/</a></b>
+{{site.data.alerts.end}}
+
+This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin7 cluster.
+
+## Infrastructure
+
+### Hardware
+
+The current configuration for the _preproduction_ phase (and likely the production phase) is made up as:
+
+* 92 nodes in total for Merlin7:
+  * 2 CPU-only login nodes
+  * 77 CPU-only compute nodes
+  * 5 GPU A100 nodes
+  * 8 GPU Grace Hopper nodes
+
+The specification of the node types is:
+
+| Node | CPU | RAM | GRES | Notes |
+| ---- | --- | --- | ---- | ----- |
+| Multi-core node | _2x_ AMD EPYC 7742 (x86_64 Rome, 64 Cores, 2.25GHz) | 512GB DDR4 3200Mhz | | For both the login and CPU-only compute nodes |
+| A100 node | _2x_ AMD EPYC 7713 (x86_64 Milan, 64 Cores, 3.2GHz) | 512GB DDR4 3200Mhz | _4x_ NVidia A100 (Ampere, 80GB) | |
+| GH Node | _2x_ NVidia Grace Neoverse-V2 (SBSA ARM 64bit, 144 Cores, 3.1GHz) | _2x_ 480GB DDR5X (CPU + GPU) | _4x_ NVidia GH200 (Hopper, 120GB) | |
+
+### Network
+
+The Merlin7 cluster builds on top of HPE/Cray technologies, including a high-performance network fabric called Slingshot. This network fabric is able
+to provide up to 200 Gbit/s throughput between nodes. Further information on Slignshot can be found on at [HPE](https://www.hpe.com/psnow/doc/PSN1012904596HREN) and
+at <https://www.glennklockwood.com/garden/slingshot>.
+
+Through software interfaces like [libFabric](https://ofiwg.github.io/libfabric/) (which available on Merlin7), application can leverage the network seamlessly.
+
+### Storage
+
+Unlike previous iteration of the Merlin HPC clusters, Merlin7 _does not_ have any local storage. Instead storage for the entire cluster is provided through
+a dedicated storage appliance from HPE/Cray called [ClusterStor](https://www.hpe.com/psnow/doc/PSN1012842049INEN.pdf).
+
+The appliance is built of several storage servers:
+
+* 2 management nodes
+* 2 MDS servers, 12 drives per server, 2.9TiB (Raid10)
+* 8 OSS-D servers, 106 drives per server, 14.5 T.B HDDs (Gridraid / Raid6)
+* 4 OSS-F servers, 12 drives per server 7TiB SSDs (Raid10)
+
+With effective storage capacity of:
+
+* 10 PB HDD
+  * value visible on linux: HDD 9302.4 TiB
+* 162 TB SSD
+  * value visible on linux: SSD 151.6 TiB
+* 23.6 TiB on Metadata
+
+The storage is directly connected to the cluster (and each individual node) through the Slingshot NIC.
diff --git a/pages/merlin7/03-Slurm-General-Documentation/slurm-configuration.md b/pages/merlin7/03-Slurm-General-Documentation/slurm-configuration.md
index f25f279..c939bd3 100644
--- a/pages/merlin7/03-Slurm-General-Documentation/slurm-configuration.md
+++ b/pages/merlin7/03-Slurm-General-Documentation/slurm-configuration.md
@@ -1,68 +1,168 @@
 ---
-title: Slurm cluster 'merlin7'
+title: Slurm merlin7 Configuration
 #tags:
 keywords: configuration, partitions, node definition
 #last_updated: 24 Mai 2023
-summary: "This document describes a summary of the Merlin7 configuration."
+summary: "This document describes a summary of the Merlin7 Slurm CPU-based configuration."
 sidebar: merlin7_sidebar
 permalink: /merlin7/slurm-configuration.html
 ---
 
-![Work In Progress](/images/WIP/WIP1.webp){:style="display:block; margin-left:auto; margin-right:auto"}
-
-{{site.data.alerts.warning}}The Merlin7 documentation is <b>Work In Progress</b>.
-Please do not use or rely on this documentation until this becomes official.  
-This applies to any page under <b><a href="https://lsm-hpce.gitpages.psi.ch/merlin7/">https://lsm-hpce.gitpages.psi.ch/merlin7/</a></b>
-{{site.data.alerts.end}}
-
 This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin7 cluster.
 
-## Infrastructure
+## General configuration
 
-### Hardware
+The **Merlin7 cluster** is configured with the **`CR_CORE_MEMORY`** and **`CR_ONE_TASK_PER_CORE`** options.
+* This configuration treats both cores and memory as consumable resources.
+* Since the nodes are running with **hyper-threading** enabled, each core thread is counted as a CPU 
+  to fulfill a job's resource requirements.
 
-The current configuration for the _preproduction_ phase (and likely the production phase) is made up as:
+By default, Slurm will allocate one task per core, which means:
+* Each task will consume 2 **CPUs**, regardless of whether both threads are actively used by the job.
 
-* 92 nodes in total for Merlin7:
-  * 2 CPU-only login nodes
-  * 77 CPU-only compute nodes
-  * 5 GPU A100 nodes
-  * 8 GPU Grace Hopper nodes
+This behavior ensures consistent resource allocation but may result in underutilization of hyper-threading in some cases.
 
-The specification of the node types is:
+### Default cluster
 
-| Node | CPU | RAM | GRES | Notes |
-| ---- | --- | --- | ---- | ----- |
-| Multi-core node | _2x_ AMD EPYC 7742 (x86_64 Rome, 64 Cores, 2.25GHz) | 512GB DDR4 3200Mhz | | For both the login and CPU-only compute nodes |
-| A100 node | _2x_ AMD EPYC 7713 (x86_64 Milan, 64 Cores, 3.2GHz) | 512GB DDR4 3200Mhz | _4x_ NVidia A100 (Ampere, 80GB) | |
-| GH Node | _2x_ NVidia Grace Neoverse-V2 (SBSA ARM 64bit, 144 Cores, 3.1GHz) | _2x_ 480GB DDR5X (CPU + GPU) | _4x_ NVidia GH200 (Hopper, 120GB) | |
+By default, jobs will be submitted to **`merlin7`**, as it is the primary cluster configured on the login nodes.
+Specifying the cluster name is typically unnecessary unless you have defined environment variables that could override the default cluster name.
+However, when necessary, one can specify the cluster as follows:
+```bash
+#SBATCH --cluster=merlin7
+```
 
-### Network
+## Slurm nodes definition
 
-The Merlin7 cluster builds on top of HPE/Cray technologies, including a high-performance network fabric called Slingshot. This network fabric is able
-to provide up to 200 Gbit/s throughput between nodes. Further information on Slignshot can be found on at [HPE](https://www.hpe.com/psnow/doc/PSN1012904596HREN) and
-at <https://www.glennklockwood.com/garden/slingshot>.
+The table below provides an overview of the Slurm configuration for the different node types in the Merlin7 cluster.
+This information is essential for understanding how resources are allocated, enabling users to tailor their submission
+scripts accordingly.
 
-Through software interfaces like [libFabric](https://ofiwg.github.io/libfabric/) (which available on Merlin7), application can leverage the network seamlessly.
+| Nodes                | Sockets  | CoresPerSocket  | Cores  | ThreadsPerCore   | CPUs  | MaxMemPerNode | DefMemPerCPU | Features      |
+| --------------------:| -------: | --------------: | -----: | --------------:  | ----: | ------------: | -----------: | ------------: |
+| login[001-002]       | 2        | 64              | 128    | 2                | 256   | 480G          | 1920M        | AMD_EPYC_7713 |
+| cn[001-077]          | 2        | 64              | 128    | 2                | 256   | 480G          | 1920M        | AMD_EPYC_7713 |
 
-### Storage
+Notes on memory configuration:
+* **Memory allocation options:** To request additional memory, use the following options in your submission script:
+   * **`--mem=<mem_in_MB>`**: Allocates memory per node.
+   * **`--mem-per-cpu=<mem_in_MB>`**: Allocates memory per CPU (equivalent to a core thread).
 
-Unlike previous iteration of the Merlin HPC clusters, Merlin7 _does not_ have any local storage. Instead storage for the entire cluster is provided through
-a dedicated storage appliance from HPE/Cray called [ClusterStor](https://www.hpe.com/psnow/doc/PSN1012842049INEN.pdf).
+  The total memory requested cannot exceed the **`MaxMemPerNode`** value.
+* **Impact of disabling Hyper-Threading:** Using the **`--hint=nomultithread`** option disables one thread per core, 
+effectively halving the number of available CPUs. Consequently, memory allocation will also be halved unless explicitly
+adjusted.
 
-The appliance is built of several storage servers:
+  For MPI-based jobs, where performance generally improves with single-threaded CPUs, this option is recommended. 
+  In such cases, you should double the **`--mem-per-cpu`** value to account for the reduced number of threads.
 
-* 2 management nodes
-* 2 MDS servers, 12 drives per server, 2.9TiB (Raid10)
-* 8 OSS-D servers, 106 drives per server, 14.5 T.B HDDs (Gridraid / Raid6)
-* 4 OSS-F servers, 12 drives per server 7TiB SSDs (Raid10)
+{{site.data.alerts.tip}}
+Always verify the Slurm <b>'/var/spool/slurmd/conf-cache/slurm.conf'</b> configuration file for potential changes.
+{{site.data.alerts.end}}
 
-With effective storage capacity of:
+### User and job limits with QoS
 
-* 10 PB HDD
-  * value visible on linux: HDD 9302.4 TiB
-* 162 TB SSD
-  * value visible on linux: SSD 151.6 TiB
-* 23.6 TiB on Metadata
+In the `merlin7` CPU cluster, we enforce certain limits on jobs and users to ensure fair resource usage and prevent
+overuse by a single user or job. These limits aim to balance resource availability while maintaining overall cluster
+efficiency. However, applying limits can occasionally impact the cluster’s utilization. For example, user-specific 
+limits may result in pending jobs even when many nodes are idle due to low activity.
 
-The storage is directly connected to the cluster (and each individual node) through the Slingshot NIC.
+On the other hand, these limits also enhance cluster efficiency by preventing scenarios such as a single job monopolizing
+all available resources, which could block other jobs from running. Without job size limits, for instance, a large job 
+might drain the entire cluster to satisfy its resource request, a situation that is generally undesirable.
+
+Thus, setting appropriate limits is essential to maintain fair resource usage while optimizing cluster efficiency. These 
+limits should allow for a mix of jobs of varying sizes and types, including single-core and parallel jobs, to coexist 
+effectively.
+
+To implement these limits, **we utilize Quality of Service (QoS)**. Different QoS policies are defined and applied 
+**to specific partitions** in line with the established resource allocation policies. The table below outlines the 
+various QoS definitions applicable to the merlin7 CPU-based cluster. Here:
+* `MaxTRES` specifies resource limits per job.
+* `MaxTRESPU` specifies resource limits per user.
+
+|        Name     |            MaxTRES |          MaxTRESPU | Scope                  |
+| --------------: | -----------------: | -----------------: | ---------------------: |
+|      **normal** |                    |                    | partition              |
+| **cpu_general** | cpu=1024,mem=1920G | cpu=1024,mem=1920G | <u>user</u>, partition |
+|   **cpu_daily** | cpu=1024,mem=1920G | cpu=2048,mem=3840G | partition              |
+|  **cpu_hourly** | cpu=2048,mem=3840G |   cpu=8192,mem=15T | partition              |
+
+Where:
+* **`normal` QoS:** This QoS has no limits and is typically applied to partitions that do not require user or job 
+  restrictions.
+* **`cpu_general` QoS:** This is the **default QoS** for `merlin7` _users_. It limits the total resources available to each
+  user. Additionally, this QoS is applied to the `general` partition, enforcing restrictions at the partition level and
+  overriding user-level QoS.
+* **`cpu_daily` QoS:** Guarantees increased resources for the `daily` partition, accommodating shorter-duration jobs
+  with higher resource needs.
+* **`cpu_hourly` QoS:** Offers the least constraints, allowing more resources to be used for the `hourly` partition,
+  which caters to very short-duration jobs.
+
+For additional details, refer to the [Partitions](/merlin7/slurm-configuration.html#Partitions) section.
+
+{{site.data.alerts.tip}}
+Always verify QoS definitions for potential changes using the  <b>'sacctmgr show qos format="Name%22,MaxTRESPU%35,MaxTRES%35"'</b> command.
+{{site.data.alerts.end}}
+
+## Partitions
+
+This section provides a summary of the partitions available in the `merlin7` CPU cluster.
+
+Key concepts:
+* **`PriorityJobFactor`**: This value is added to a job’s priority (visible in the `PARTITION` column of the `sprio -l` command).
+  Jobs submitted to partitions with higher `PriorityJobFactor` values generally run sooner. However, other factors like *job age*
+  and especially *fair share* can also influence scheduling.
+* **`PriorityTier`**: Jobs submitted to partitions with higher `PriorityTier` values take precedence over pending jobs in partitions
+  with lower `PriorityTier` values. Additionally, jobs from higher `PriorityTier` partitions can preempt running jobs in lower-tier
+  partitions, where applicable.
+* **`QoS`**: Specifies the quality of service associated with a partition. It is used to control and restrict resource availability
+  for specific partitions, ensuring that resource allocation aligns with intended usage policies. Detailed explanations of the various
+  QoS settings can be found in the [User and job limits with QoS](/merlin7/slurm-configuration.html#user-and-job-limits-with-qos) section.
+
+{{site.data.alerts.tip}}
+Always verify partition configurations for potential changes using the  <b>'scontrol show partition'</b> command.
+{{site.data.alerts.end}}
+
+### Public partitions
+
+| PartitionName      |  DefaultTime  | MaxTime     | TotalNodes | PriorityJobFactor | PriorityTier | QoS         | AllowAccounts  |
+| -----------------: |  -----------: | ----------: | --------:  | ----------------: | -----------: | ----------: | -------------: |
+| **<u>general</u>** |  1-00:00:00   | 7-00:00:00  | 50         | 1                 | 1            | cpu_general | <u>merlin</u>  |
+| **daily**          |  0-01:00:00   | 1-00:00:00  | 63         | 500               | 1            | cpu_daily   | <u>merlin</u>  |
+| **hourly**         |  0-00:30:00   | 0-01:00:00  | 77         | 1000              | 1            | cpu_hourly  | <u>merlin</u>  |
+
+All Merlin users are part of the `merlin` account, which is used as the *default account* when submitting jobs.
+Similarly, if no partition is specified, jobs are automatically submitted to the `general` partition by default.
+
+{{site.data.alerts.tip}}
+For jobs running less than one day, submit them to the <b>daily</b> partition.
+For jobs running less than one hour, use the <b>hourly</b> partition.
+These partitions provide higher priority and ensure quicker scheduling compared to <b>general</b>, which has limited node availability.
+{{site.data.alerts.end}}
+
+The **`hourly`** partition may include private nodes as an additional buffer. However, the current Slurm partition configuration, governed 
+by **`PriorityTier`**, ensures that jobs submitted to private partitions are prioritized and processed first. As a result, access to the 
+**`hourly`** partition might experience delays in such scenarios.
+
+### Private partitions
+
+#### CAS / ASA
+
+| PartitionName      |  DefaultTime  | MaxTime     | TotalNodes | PriorityJobFactor | PriorityTier | QoS         | AllowAccounts  |
+| -----------------: |  -----------: | ----------: | --------:  | ----------------: | -----------: | ----------: | -------------: |
+| **asa-general**    |  0-01:00:00   | 14-00:00:00 | 10         | 1                 | 2            | normal      | asa            |
+| **asa-daily**      |  0-01:00:00   |  1-00:00:00 | 10         | 1000              | 2            | normal      | asa            |
+
+#### CNM / Mu3e
+
+| PartitionName      |  DefaultTime  | MaxTime     | TotalNodes | PriorityJobFactor | PriorityTier | QoS         | AllowAccounts  |
+| -----------------: |  -----------: | ----------: | --------:  | ----------------: | -----------: | ----------: | -------------: |
+| **mu3e**           |  1-00:00:00   | 7-00:00:00  | 4          | 1                 | 2            | normal      | mu3e, meg      |
+
+#### CNM / MeG
+
+| PartitionName      |  DefaultTime  | MaxTime     | TotalNodes | PriorityJobFactor | PriorityTier | QoS         | AllowAccounts  |
+| -----------------: |  -----------: | ----------: | --------:  | ----------------: | -----------: | ----------: | -------------: |
+| **meg-short**      |  0-01:00:00   | 0-01:00:00  | unlimited | 1000               | 2            | normal      | meg            |
+| **meg-long**       |  1-00:00:00   | 5-00:00:00  | unlimited | 1                  | 2            | normal      | meg            |
+| **meg-prod**       |  1-00:00:00   | 5-00:00:00  | unlimited | 1000               | 4            | normal      | meg            |