Docs update 13.04.2022

2022-04-13 16:03:50 +02:00
parent 9837429824
commit b8ff389a9d
8 changed files with 99 additions and 34 deletions
--- a/pages/merlin6/02-How-To-Use-Merlin/ssh-keys.md
+++ b/pages/merlin6/02-How-To-Use-Merlin/ssh-keys.md
@ -28,7 +28,7 @@ ls ~/.ssh/id*
 For creating **SSH RSA Keys**, one should:

 1. Run `ssh-keygen`, a password will be requested twice. You **must remember** this password for the future.
-   * Due to security reasons, ***always add a password***. Never leave an empty password.
+   * Due to security reasons, ***always try protecting it with a password***. There is only one exception, when running ANSYS software, which in general should not use password to simplify the way of running the software in Slurm.
   * This will generate a private key **id_rsa**, and a public key **id_rsa.pub** in your **~/.ssh** directory.
 2. Add your public key to the **`authorized_keys`** file, and ensure proper permissions for that file, as follows:
   ```bash
--- a/pages/merlin6/03-Slurm-General-Documentation/monitoring.md
+++ b/pages/merlin6/03-Slurm-General-Documentation/monitoring.md
@ -75,6 +75,32 @@ gpu          up 7-00:00:00 1-infinite   no       NO        all      8   allocate
 </pre>
 </details>

+### Job accounting
+
+Users can check detailed information of jobs (pending, running, completed, failed, etc.) with the `sacct` command.
+This command is very flexible and can provide a lot of information. For checking all the available options, please read `man sacct`.
+Below, we summarize some examples that can be useful for the users:
+
+```bash
+# Today jobs, basic summary
+sacct
+
+# Today jobs, with details
+sacct --long
+
+# Jobs from January 1, 2022, 12pm, with details
+sacct -S 2021-01-01T12:00:00 --long
+
+# Specific job accounting
+sacct --long -j $jobid
+
+# Jobs custom details, without steps (-X)
+sacct -X --format=User%20,JobID,Jobname,partition,state,time,submit,start,end,elapsed,AveRss,MaxRss,MaxRSSTask,MaxRSSNode%20,MaxVMSize,nnodes,ncpus,ntasks,reqcpus,totalcpu,reqmem,cluster,TimeLimit,TimeLimitRaw,cputime,nodelist%50,AllocTRES%80
+
+# Jobs custom details, with steps
+sacct --format=User%20,JobID,Jobname,partition,state,time,submit,start,end,elapsed,AveRss,MaxRss,MaxRSSTask,MaxRSSNode%20,MaxVMSize,nnodes,ncpus,ntasks,reqcpus,totalcpu,reqmem,cluster,TimeLimit,TimeLimitRaw,cputime,nodelist%50,AllocTRES%80
+```
+
 ### Job efficiency

 Users can check how efficient are their jobs. For that, the ``seff`` command is available.
--- a/pages/merlin6/03-Slurm-General-Documentation/slurm-basic-commands.md
+++ b/pages/merlin6/03-Slurm-General-Documentation/slurm-basic-commands.md
@ -28,6 +28,7 @@ scancel job_id   # to cancel slurm job, job id is the numeric id, seen by the sq
 sview            # X interface for managing jobs and track job run information.
 seff             # Calculates the efficiency of a job
 sjstat           # List attributes of jobs under the SLURM control
+sacct            # Show job accounting, useful for checking details of finished jobs.
 ```

 ---
--- a/pages/merlin6/hardware-and-software-description.md
+++ b/pages/merlin6/hardware-and-software-description.md
@ -69,19 +69,24 @@ The connectivity for the Merlin6 cluster is based on **ConnectX-5 EDR-100Gbps**,
     <td style="vertical-align:middle;text-align:center;" rowspan="1">384GB</td>
   </tr>
   <tr style="vertical-align:middle;text-align:center;" ralign="center">
-     <td style="vertical-align:middle;text-align:center;" rowspan="2"><b>#3</b></td>
-     <td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-c-3[01-06]</b></td>
-     <td style="vertical-align:middle;text-align:center;" rowspan="2"><a href="https://ark.intel.com/content/www/us/en/ark/products/199343/intel-xeon-gold-6240r-processor-35-75m-cache-2-40-ghz.html">Intel Xeon Gold 6240R</a></td>
-     <td style="vertical-align:middle;text-align:center;" rowspan="2">2</td>
-     <td style="vertical-align:middle;text-align:center;" rowspan="2">48</td>
-     <td style="vertical-align:middle;text-align:center;" rowspan="2">2</td>
-     <td style="vertical-align:middle;text-align:center;" rowspan="2">1.2TB</td>
-     <td style="vertical-align:middle;text-align:center;" rowspan="1">384GB</td>
-   </tr>
-   <tr style="vertical-align:middle;text-align:center;" ralign="center">
-     <td rowspan="1"><b>merlin-c-3[07-12]</b></td>
-     <td style="vertical-align:middle;text-align:center;" rowspan="1">768GB</td>
+     <td style="vertical-align:middle;text-align:center;" rowspan="3"><b>#3</b></td>
+     <td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-c-3[01-12]</b></td>
+     <td style="vertical-align:middle;text-align:center;" rowspan="3"><a href="https://ark.intel.com/content/www/us/en/ark/products/199343/intel-xeon-gold-6240r-processor-35-75m-cache-2-40-ghz.html">Intel Xeon Gold 6240R</a></td>
+     <td style="vertical-align:middle;text-align:center;" rowspan="3">2</td>
+     <td style="vertical-align:middle;text-align:center;" rowspan="3">48</td>
+     <td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
+     <td style="vertical-align:middle;text-align:center;" rowspan="3">1.2TB</td>
+     <td style="vertical-align:middle;text-align:center;" rowspan="2">768GB</td>
   </tr>
+  <tr style="vertical-align:middle;text-align:center;" ralign="center">
+    <td rowspan="1"><b>merlin-c-3[03-18]</b></td>
+    <td style="vertical-align:middle;text-align:center;" rowspan="1">1</td>
+  </tr>
+  <tr style="vertical-align:middle;text-align:center;" ralign="center">
+    <td rowspan="1"><b>merlin-c-3[19-24]</b></td>
+    <td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
+    <td style="vertical-align:middle;text-align:center;" rowspan="1">384GB</td>
+  </tr>
  </tbody>
 </table>
 Each blade contains a NVMe disk, where up to 300TB are dedicated to the O.S., and ~1.2TB are reserved for local `/scratch`.
--- a/pages/merlin6/slurm-configuration.md
+++ b/pages/merlin6/slurm-configuration.md
@ -14,12 +14,14 @@ This documentation shows basic Slurm configuration and options needed to run job

 The following table show default and maximum resources that can be used per node:

-| Nodes              | Def.#CPUs | Max.#CPUs | #Threads | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | Def.#GPUs | Max.#GPUs |
-|:------------------:| ---------:| :--------:| :------: | :----------:| :----------:| :-----------:| :-------:| :-------: | :-------: |
-| merlin-c-[001-024] | 1 core    | 44 cores  | 2        | 4000        | 352000      | 352000       | 10000    | N/A       | N/A       |
-| merlin-c-[101-124] | 1 core    | 44 cores  | 2        | 4000        | 352000      | 352000       | 10000    | N/A       | N/A       |
-| merlin-c-[201-224] | 1 core    | 44 cores  | 2        | 4000        | 352000      | 352000       | 10000    | N/A       | N/A       |
-| merlin-c-[301-306] | 1 core    | 44 cores  | 2        | 4000        | 352000      | 352000       | 10000    | N/A       | N/A       |
+| Nodes                | Def.#CPUs | Max.#CPUs | #Threads | Max.Mem/CPU | Max.Mem/Node | Max.Swap | Def.#GPUs | Max.#GPUs |
+|:--------------------:| ---------:| :--------:| :------: | :----------:| :-----------:| :-------:| :-------: | :-------: |
+| merlin-c-[001-024]   | 1 core    | 44 cores  | 2        | 352000      | 352000       | 10000    | N/A       | N/A       |
+| merlin-c-[101-124]   | 1 core    | 44 cores  | 2        | 352000      | 352000       | 10000    | N/A       | N/A       |
+| merlin-c-[201-224]   | 1 core    | 44 cores  | 2        | 352000      | 352000       | 10000    | N/A       | N/A       |
+| merlin-c-[301-312]   | 1 core    | 44 cores  | 2        | 748800      | 748800       | 10000    | N/A       | N/A       |
+| merlin-c-[313-318]   | 1 core    | 44 cores  | 1        | 748800      | 748800       | 10000    | N/A       | N/A       |
+| merlin-c-[319-324]   | 1 core    | 44 cores  | 2        | 748800      | 748800       | 10000    | N/A       | N/A       |

 If nothing is specified, by default each core will use up to 8GB of memory. Memory can be increased with the `--mem=<mem_in_MB>` and 
 `--mem-per-cpu=<mem_in_MB>` options, and maximum memory allowed is `Max.Mem/Node`.
@ -57,12 +59,15 @@ Users might need to specify the Slurm partition. If no partition is specified, i

 The following *partitions* (also known as *queues*) are configured in Slurm:

-| CPU Partition      |  Default Time | Max Time | Max Nodes | PriorityJobFactor\* | PriorityTier\*\* |
-|:-----------------: |  :----------: | :------: | :-------: | :-----------------: | :--------------: |
-| **<u>general</u>** |  1 day        | 1 week   | 50        | 1                   | 1                |
-| **daily**          |  1 day        | 1 day    | 67        | 500                 | 1                |
-| **hourly**         |  1 hour       | 1 hour   | unlimited | 1000                | 1                |
-| **gfa-asa**        |  1 day        | 1 week   | 11        | 1000                | 1000             |
+| CPU Partition      |  Default Time | Max Time | Max Nodes | PriorityJobFactor\* | PriorityTier\*\* | DefMemPerCPU |
+|:-----------------: |  :----------: | :------: | :-------: | :-----------------: | :--------------: |:------------:|
+| **<u>general</u>** |  1 day        | 1 week   | 50        | 1                   | 1                | 4000         |
+| **daily**          |  1 day        | 1 day    | 67        | 500                 | 1                | 4000         |
+| **hourly**         |  1 hour       | 1 hour   | unlimited | 1000                | 1                | 4000         |
+| **asa-general**    |  1 hour       | 2 weeks  | unlimited | 1                   | 2                | 3712         |
+| **asa-daily**      |  1 hour       | 1 week   | unlimited | 500                 | 2                | 3712         |
+| **asa-ansys**      |  1 hour       | 90 days  | unlimited | 1000                | 4                | 15600        |
+| **mu3e**           |  1 day        | 7 days   | unlimited | 1000                | 4                | 3712         |

 \*The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority
 partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU
@ -74,8 +79,7 @@ and, if possible, they will preempt running jobs from partitions with lower *Pri
 * The **`general`** partition is the **default**. It can not have more than 50 nodes running jobs.
 * For **`daily`** this limitation is extended to 67 nodes.
 * For **`hourly`** there are no limits.
-* **`gfa-asa`** is a **private hidden** partition, belonging to one experiment. **Access is restricted**. However, by agreement with the experiment, 
-nodes are usually added to the **`hourly`** partition as extra resources for the public resources.
+* **`asa-general`,`asa-daily`,`asa-ansys`,`asa-visas` and `mu3e`** are **private hidden** partitions, belonging to different experiments owning the machines. **Access is restricted** in all cases. However, by agreement with the experiments, nodes are usually added to the **`hourly`** partition as extra resources for the public resources.

 {{site.data.alerts.tip}}Jobs which would run for less than one day should be always sent to <b>daily</b>, while jobs that would run for less
 than one hour should be sent to <b>hourly</b>. This would ensure that you have highest priority over jobs sent to partitions with less priority,