Docs update 13.04.2022

This commit is contained in:
2022-04-13 16:03:50 +02:00
parent 9837429824
commit b8ff389a9d
8 changed files with 99 additions and 34 deletions

View File

@ -28,7 +28,7 @@ ls ~/.ssh/id*
For creating **SSH RSA Keys**, one should:
1. Run `ssh-keygen`, a password will be requested twice. You **must remember** this password for the future.
* Due to security reasons, ***always add a password***. Never leave an empty password.
* Due to security reasons, ***always try protecting it with a password***. There is only one exception, when running ANSYS software, which in general should not use password to simplify the way of running the software in Slurm.
* This will generate a private key **id_rsa**, and a public key **id_rsa.pub** in your **~/.ssh** directory.
2. Add your public key to the **`authorized_keys`** file, and ensure proper permissions for that file, as follows:
```bash

View File

@ -75,6 +75,32 @@ gpu up 7-00:00:00 1-infinite no NO all 8 allocate
</pre>
</details>
### Job accounting
Users can check detailed information of jobs (pending, running, completed, failed, etc.) with the `sacct` command.
This command is very flexible and can provide a lot of information. For checking all the available options, please read `man sacct`.
Below, we summarize some examples that can be useful for the users:
```bash
# Today jobs, basic summary
sacct
# Today jobs, with details
sacct --long
# Jobs from January 1, 2022, 12pm, with details
sacct -S 2021-01-01T12:00:00 --long
# Specific job accounting
sacct --long -j $jobid
# Jobs custom details, without steps (-X)
sacct -X --format=User%20,JobID,Jobname,partition,state,time,submit,start,end,elapsed,AveRss,MaxRss,MaxRSSTask,MaxRSSNode%20,MaxVMSize,nnodes,ncpus,ntasks,reqcpus,totalcpu,reqmem,cluster,TimeLimit,TimeLimitRaw,cputime,nodelist%50,AllocTRES%80
# Jobs custom details, with steps
sacct --format=User%20,JobID,Jobname,partition,state,time,submit,start,end,elapsed,AveRss,MaxRss,MaxRSSTask,MaxRSSNode%20,MaxVMSize,nnodes,ncpus,ntasks,reqcpus,totalcpu,reqmem,cluster,TimeLimit,TimeLimitRaw,cputime,nodelist%50,AllocTRES%80
```
### Job efficiency
Users can check how efficient are their jobs. For that, the ``seff`` command is available.

View File

@ -28,6 +28,7 @@ scancel job_id # to cancel slurm job, job id is the numeric id, seen by the sq
sview # X interface for managing jobs and track job run information.
seff # Calculates the efficiency of a job
sjstat # List attributes of jobs under the SLURM control
sacct # Show job accounting, useful for checking details of finished jobs.
```
---

View File

@ -69,19 +69,24 @@ The connectivity for the Merlin6 cluster is based on **ConnectX-5 EDR-100Gbps**,
<td style="vertical-align:middle;text-align:center;" rowspan="1">384GB</td>
</tr>
<tr style="vertical-align:middle;text-align:center;" ralign="center">
<td style="vertical-align:middle;text-align:center;" rowspan="2"><b>#3</b></td>
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-c-3[01-06]</b></td>
<td style="vertical-align:middle;text-align:center;" rowspan="2"><a href="https://ark.intel.com/content/www/us/en/ark/products/199343/intel-xeon-gold-6240r-processor-35-75m-cache-2-40-ghz.html">Intel Xeon Gold 6240R</a></td>
<td style="vertical-align:middle;text-align:center;" rowspan="2">2</td>
<td style="vertical-align:middle;text-align:center;" rowspan="2">48</td>
<td style="vertical-align:middle;text-align:center;" rowspan="2">2</td>
<td style="vertical-align:middle;text-align:center;" rowspan="2">1.2TB</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">384GB</td>
</tr>
<tr style="vertical-align:middle;text-align:center;" ralign="center">
<td rowspan="1"><b>merlin-c-3[07-12]</b></td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">768GB</td>
<td style="vertical-align:middle;text-align:center;" rowspan="3"><b>#3</b></td>
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-c-3[01-12]</b></td>
<td style="vertical-align:middle;text-align:center;" rowspan="3"><a href="https://ark.intel.com/content/www/us/en/ark/products/199343/intel-xeon-gold-6240r-processor-35-75m-cache-2-40-ghz.html">Intel Xeon Gold 6240R</a></td>
<td style="vertical-align:middle;text-align:center;" rowspan="3">2</td>
<td style="vertical-align:middle;text-align:center;" rowspan="3">48</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
<td style="vertical-align:middle;text-align:center;" rowspan="3">1.2TB</td>
<td style="vertical-align:middle;text-align:center;" rowspan="2">768GB</td>
</tr>
<tr style="vertical-align:middle;text-align:center;" ralign="center">
<td rowspan="1"><b>merlin-c-3[03-18]</b></td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">1</td>
</tr>
<tr style="vertical-align:middle;text-align:center;" ralign="center">
<td rowspan="1"><b>merlin-c-3[19-24]</b></td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">384GB</td>
</tr>
</tbody>
</table>
Each blade contains a NVMe disk, where up to 300TB are dedicated to the O.S., and ~1.2TB are reserved for local `/scratch`.

View File

@ -14,12 +14,14 @@ This documentation shows basic Slurm configuration and options needed to run job
The following table show default and maximum resources that can be used per node:
| Nodes | Def.#CPUs | Max.#CPUs | #Threads | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | Def.#GPUs | Max.#GPUs |
|:------------------:| ---------:| :--------:| :------: | :----------:| :----------:| :-----------:| :-------:| :-------: | :-------: |
| merlin-c-[001-024] | 1 core | 44 cores | 2 | 4000 | 352000 | 352000 | 10000 | N/A | N/A |
| merlin-c-[101-124] | 1 core | 44 cores | 2 | 4000 | 352000 | 352000 | 10000 | N/A | N/A |
| merlin-c-[201-224] | 1 core | 44 cores | 2 | 4000 | 352000 | 352000 | 10000 | N/A | N/A |
| merlin-c-[301-306] | 1 core | 44 cores | 2 | 4000 | 352000 | 352000 | 10000 | N/A | N/A |
| Nodes | Def.#CPUs | Max.#CPUs | #Threads | Max.Mem/CPU | Max.Mem/Node | Max.Swap | Def.#GPUs | Max.#GPUs |
|:--------------------:| ---------:| :--------:| :------: | :----------:| :-----------:| :-------:| :-------: | :-------: |
| merlin-c-[001-024] | 1 core | 44 cores | 2 | 352000 | 352000 | 10000 | N/A | N/A |
| merlin-c-[101-124] | 1 core | 44 cores | 2 | 352000 | 352000 | 10000 | N/A | N/A |
| merlin-c-[201-224] | 1 core | 44 cores | 2 | 352000 | 352000 | 10000 | N/A | N/A |
| merlin-c-[301-312] | 1 core | 44 cores | 2 | 748800 | 748800 | 10000 | N/A | N/A |
| merlin-c-[313-318] | 1 core | 44 cores | 1 | 748800 | 748800 | 10000 | N/A | N/A |
| merlin-c-[319-324] | 1 core | 44 cores | 2 | 748800 | 748800 | 10000 | N/A | N/A |
If nothing is specified, by default each core will use up to 8GB of memory. Memory can be increased with the `--mem=<mem_in_MB>` and
`--mem-per-cpu=<mem_in_MB>` options, and maximum memory allowed is `Max.Mem/Node`.
@ -57,12 +59,15 @@ Users might need to specify the Slurm partition. If no partition is specified, i
The following *partitions* (also known as *queues*) are configured in Slurm:
| CPU Partition | Default Time | Max Time | Max Nodes | PriorityJobFactor\* | PriorityTier\*\* |
|:-----------------: | :----------: | :------: | :-------: | :-----------------: | :--------------: |
| **<u>general</u>** | 1 day | 1 week | 50 | 1 | 1 |
| **daily** | 1 day | 1 day | 67 | 500 | 1 |
| **hourly** | 1 hour | 1 hour | unlimited | 1000 | 1 |
| **gfa-asa** | 1 day | 1 week | 11 | 1000 | 1000 |
| CPU Partition | Default Time | Max Time | Max Nodes | PriorityJobFactor\* | PriorityTier\*\* | DefMemPerCPU |
|:-----------------: | :----------: | :------: | :-------: | :-----------------: | :--------------: |:------------:|
| **<u>general</u>** | 1 day | 1 week | 50 | 1 | 1 | 4000 |
| **daily** | 1 day | 1 day | 67 | 500 | 1 | 4000 |
| **hourly** | 1 hour | 1 hour | unlimited | 1000 | 1 | 4000 |
| **asa-general** | 1 hour | 2 weeks | unlimited | 1 | 2 | 3712 |
| **asa-daily** | 1 hour | 1 week | unlimited | 500 | 2 | 3712 |
| **asa-ansys** | 1 hour | 90 days | unlimited | 1000 | 4 | 15600 |
| **mu3e** | 1 day | 7 days | unlimited | 1000 | 4 | 3712 |
\*The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority
partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU
@ -74,8 +79,7 @@ and, if possible, they will preempt running jobs from partitions with lower *Pri
* The **`general`** partition is the **default**. It can not have more than 50 nodes running jobs.
* For **`daily`** this limitation is extended to 67 nodes.
* For **`hourly`** there are no limits.
* **`gfa-asa`** is a **private hidden** partition, belonging to one experiment. **Access is restricted**. However, by agreement with the experiment,
nodes are usually added to the **`hourly`** partition as extra resources for the public resources.
* **`asa-general`,`asa-daily`,`asa-ansys`,`asa-visas` and `mu3e`** are **private hidden** partitions, belonging to different experiments owning the machines. **Access is restricted** in all cases. However, by agreement with the experiments, nodes are usually added to the **`hourly`** partition as extra resources for the public resources.
{{site.data.alerts.tip}}Jobs which would run for less than one day should be always sent to <b>daily</b>, while jobs that would run for less
than one hour should be sent to <b>hourly</b>. This would ensure that you have highest priority over jobs sent to partitions with less priority,