Docs update 13.04.2022
This commit is contained in:
parent
9837429824
commit
b8ff389a9d
@ -16,3 +16,7 @@ entries:
|
|||||||
- title: The Merlin Local HPC Cluster
|
- title: The Merlin Local HPC Cluster
|
||||||
url: /merlin6/introduction.html
|
url: /merlin6/introduction.html
|
||||||
output: web
|
output: web
|
||||||
|
- title: PSI HPC@CSCS
|
||||||
|
url: /CSCS/index.html
|
||||||
|
output: web
|
||||||
|
|
||||||
|
@ -103,6 +103,8 @@ entries:
|
|||||||
url: /merlin6/ansys-fluent.html
|
url: /merlin6/ansys-fluent.html
|
||||||
- title: ANSYS/MAPDL
|
- title: ANSYS/MAPDL
|
||||||
url: /merlin6/ansys-mapdl.html
|
url: /merlin6/ansys-mapdl.html
|
||||||
|
- title: ANSYS/HFSS
|
||||||
|
url: /merlin6/ansys-hfss.html
|
||||||
- title: ParaView
|
- title: ParaView
|
||||||
url: /merlin6/paraview.html
|
url: /merlin6/paraview.html
|
||||||
- title: Support
|
- title: Support
|
||||||
|
@ -2,7 +2,7 @@
|
|||||||
title: PSI HPC@CSCS Admin Overview
|
title: PSI HPC@CSCS Admin Overview
|
||||||
#tags:
|
#tags:
|
||||||
#keywords:
|
#keywords:
|
||||||
last_updated: 22 September 2020
|
last_updated: 13 April 2022
|
||||||
#summary: ""
|
#summary: ""
|
||||||
sidebar: CSCS_sidebar
|
sidebar: CSCS_sidebar
|
||||||
permalink: /CSCS/index.html
|
permalink: /CSCS/index.html
|
||||||
@ -10,13 +10,35 @@ permalink: /CSCS/index.html
|
|||||||
|
|
||||||
## PSI HPC@CSCS
|
## PSI HPC@CSCS
|
||||||
|
|
||||||
For offering high-end HPC sources to PSI users, AIT has a long standing col-laboration with the national supercomputing centre CSCS (since 2005).
|
For offering high-end HPC sources to PSI users, PSI has a long standing collaboration with
|
||||||
Some of the resources are procured by central PSI funds while users have the optionsof an additional buy-in at the same rates.
|
the national supercomputing centre CSCS (since 2005). Some of the resources are procured by
|
||||||
|
central PSI funds while users have the optionsof an additional buy-in at the same rates.
|
||||||
|
|
||||||
### PSI resources at Piz Daint
|
### PSI resources at Piz Daint
|
||||||
|
|
||||||
The yearly computing resources at CSCS for the PSI projects are 320,000 NH (Node Hours). The yearly storage resources for the PSI projects is a total of 40TB.
|
The yearly computing resources at CSCS for the PSI projects are 627,000 NH (Node Hours).
|
||||||
These resources are centrally financed, but in addition experiments can individually purchase more resources.
|
The yearly storage resources for the PSI projects is a total of 80TB. These resources are
|
||||||
|
centrally financed, but in addition experiments can individually purchase more resources.
|
||||||
|
|
||||||
|
### How to request a PSI project
|
||||||
|
|
||||||
|
A survey is sent out in the third quarter of each year. This survey is used to request
|
||||||
|
CSCS resources for the upcoming year.
|
||||||
|
|
||||||
|
Users registered in the **PSI HPC@CSCS mailing list** <psi-hpc-at-cscs@lists.psi.ch> will
|
||||||
|
receive notification and details about the survey, in example:
|
||||||
|
* Link to the survey
|
||||||
|
* Update of resource changes
|
||||||
|
* Other details of the process
|
||||||
|
|
||||||
|
Generally users need to specify in the survey the total resources they intend to use
|
||||||
|
next year and also how they would like to split it over the 4 quarters (e.g. 25%, 25%,
|
||||||
|
25%, 25%). In general, we provide the possibility to adapt the distribution over the
|
||||||
|
course of next year if required. The minimum allocation over a year is 10,000 node hours.
|
||||||
|
|
||||||
|
By default allocated nodes are on the CPU partition of PizDaint (36 cores per node).
|
||||||
|
However, allocations to the GPU partition are also possible (1 x NVIDIA P100 and 12cores per
|
||||||
|
node), but needs to be splicitly stated in the survey.
|
||||||
|
|
||||||
### Piz Daint total resources
|
### Piz Daint total resources
|
||||||
|
|
||||||
@ -27,6 +49,7 @@ References:
|
|||||||
|
|
||||||
## Contact information
|
## Contact information
|
||||||
|
|
||||||
* Contact person at PSI: Marc Caubet Serrabou <marc.caubet@psi.ch>
|
* Contact responsibles:
|
||||||
* Mail list contact: <psi-hpc-at-cscs-admin@lists.psi.ch>
|
* Mail list contact: <psi-hpc-at-cscs-admin@lists.psi.ch>
|
||||||
* Contact Person at CSCS: Angelo Mangili <amangili@psi.ch>
|
* Marc Caubet Serrabou <marc.caubet@psi.ch>
|
||||||
|
* Derek Feichtinger <derek.feichtinger@psi.ch>
|
||||||
|
@ -28,7 +28,7 @@ ls ~/.ssh/id*
|
|||||||
For creating **SSH RSA Keys**, one should:
|
For creating **SSH RSA Keys**, one should:
|
||||||
|
|
||||||
1. Run `ssh-keygen`, a password will be requested twice. You **must remember** this password for the future.
|
1. Run `ssh-keygen`, a password will be requested twice. You **must remember** this password for the future.
|
||||||
* Due to security reasons, ***always add a password***. Never leave an empty password.
|
* Due to security reasons, ***always try protecting it with a password***. There is only one exception, when running ANSYS software, which in general should not use password to simplify the way of running the software in Slurm.
|
||||||
* This will generate a private key **id_rsa**, and a public key **id_rsa.pub** in your **~/.ssh** directory.
|
* This will generate a private key **id_rsa**, and a public key **id_rsa.pub** in your **~/.ssh** directory.
|
||||||
2. Add your public key to the **`authorized_keys`** file, and ensure proper permissions for that file, as follows:
|
2. Add your public key to the **`authorized_keys`** file, and ensure proper permissions for that file, as follows:
|
||||||
```bash
|
```bash
|
||||||
|
@ -75,6 +75,32 @@ gpu up 7-00:00:00 1-infinite no NO all 8 allocate
|
|||||||
</pre>
|
</pre>
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
|
### Job accounting
|
||||||
|
|
||||||
|
Users can check detailed information of jobs (pending, running, completed, failed, etc.) with the `sacct` command.
|
||||||
|
This command is very flexible and can provide a lot of information. For checking all the available options, please read `man sacct`.
|
||||||
|
Below, we summarize some examples that can be useful for the users:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Today jobs, basic summary
|
||||||
|
sacct
|
||||||
|
|
||||||
|
# Today jobs, with details
|
||||||
|
sacct --long
|
||||||
|
|
||||||
|
# Jobs from January 1, 2022, 12pm, with details
|
||||||
|
sacct -S 2021-01-01T12:00:00 --long
|
||||||
|
|
||||||
|
# Specific job accounting
|
||||||
|
sacct --long -j $jobid
|
||||||
|
|
||||||
|
# Jobs custom details, without steps (-X)
|
||||||
|
sacct -X --format=User%20,JobID,Jobname,partition,state,time,submit,start,end,elapsed,AveRss,MaxRss,MaxRSSTask,MaxRSSNode%20,MaxVMSize,nnodes,ncpus,ntasks,reqcpus,totalcpu,reqmem,cluster,TimeLimit,TimeLimitRaw,cputime,nodelist%50,AllocTRES%80
|
||||||
|
|
||||||
|
# Jobs custom details, with steps
|
||||||
|
sacct --format=User%20,JobID,Jobname,partition,state,time,submit,start,end,elapsed,AveRss,MaxRss,MaxRSSTask,MaxRSSNode%20,MaxVMSize,nnodes,ncpus,ntasks,reqcpus,totalcpu,reqmem,cluster,TimeLimit,TimeLimitRaw,cputime,nodelist%50,AllocTRES%80
|
||||||
|
```
|
||||||
|
|
||||||
### Job efficiency
|
### Job efficiency
|
||||||
|
|
||||||
Users can check how efficient are their jobs. For that, the ``seff`` command is available.
|
Users can check how efficient are their jobs. For that, the ``seff`` command is available.
|
||||||
|
@ -28,6 +28,7 @@ scancel job_id # to cancel slurm job, job id is the numeric id, seen by the sq
|
|||||||
sview # X interface for managing jobs and track job run information.
|
sview # X interface for managing jobs and track job run information.
|
||||||
seff # Calculates the efficiency of a job
|
seff # Calculates the efficiency of a job
|
||||||
sjstat # List attributes of jobs under the SLURM control
|
sjstat # List attributes of jobs under the SLURM control
|
||||||
|
sacct # Show job accounting, useful for checking details of finished jobs.
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
@ -69,19 +69,24 @@ The connectivity for the Merlin6 cluster is based on **ConnectX-5 EDR-100Gbps**,
|
|||||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">384GB</td>
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">384GB</td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||||
<td style="vertical-align:middle;text-align:center;" rowspan="2"><b>#3</b></td>
|
<td style="vertical-align:middle;text-align:center;" rowspan="3"><b>#3</b></td>
|
||||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-c-3[01-06]</b></td>
|
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-c-3[01-12]</b></td>
|
||||||
<td style="vertical-align:middle;text-align:center;" rowspan="2"><a href="https://ark.intel.com/content/www/us/en/ark/products/199343/intel-xeon-gold-6240r-processor-35-75m-cache-2-40-ghz.html">Intel Xeon Gold 6240R</a></td>
|
<td style="vertical-align:middle;text-align:center;" rowspan="3"><a href="https://ark.intel.com/content/www/us/en/ark/products/199343/intel-xeon-gold-6240r-processor-35-75m-cache-2-40-ghz.html">Intel Xeon Gold 6240R</a></td>
|
||||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">2</td>
|
<td style="vertical-align:middle;text-align:center;" rowspan="3">2</td>
|
||||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">48</td>
|
<td style="vertical-align:middle;text-align:center;" rowspan="3">48</td>
|
||||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">2</td>
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">1.2TB</td>
|
<td style="vertical-align:middle;text-align:center;" rowspan="3">1.2TB</td>
|
||||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">384GB</td>
|
<td style="vertical-align:middle;text-align:center;" rowspan="2">768GB</td>
|
||||||
</tr>
|
|
||||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
|
||||||
<td rowspan="1"><b>merlin-c-3[07-12]</b></td>
|
|
||||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">768GB</td>
|
|
||||||
</tr>
|
</tr>
|
||||||
|
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||||
|
<td rowspan="1"><b>merlin-c-3[03-18]</b></td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">1</td>
|
||||||
|
</tr>
|
||||||
|
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||||
|
<td rowspan="1"><b>merlin-c-3[19-24]</b></td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">384GB</td>
|
||||||
|
</tr>
|
||||||
</tbody>
|
</tbody>
|
||||||
</table>
|
</table>
|
||||||
Each blade contains a NVMe disk, where up to 300TB are dedicated to the O.S., and ~1.2TB are reserved for local `/scratch`.
|
Each blade contains a NVMe disk, where up to 300TB are dedicated to the O.S., and ~1.2TB are reserved for local `/scratch`.
|
||||||
|
@ -14,12 +14,14 @@ This documentation shows basic Slurm configuration and options needed to run job
|
|||||||
|
|
||||||
The following table show default and maximum resources that can be used per node:
|
The following table show default and maximum resources that can be used per node:
|
||||||
|
|
||||||
| Nodes | Def.#CPUs | Max.#CPUs | #Threads | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | Def.#GPUs | Max.#GPUs |
|
| Nodes | Def.#CPUs | Max.#CPUs | #Threads | Max.Mem/CPU | Max.Mem/Node | Max.Swap | Def.#GPUs | Max.#GPUs |
|
||||||
|:------------------:| ---------:| :--------:| :------: | :----------:| :----------:| :-----------:| :-------:| :-------: | :-------: |
|
|:--------------------:| ---------:| :--------:| :------: | :----------:| :-----------:| :-------:| :-------: | :-------: |
|
||||||
| merlin-c-[001-024] | 1 core | 44 cores | 2 | 4000 | 352000 | 352000 | 10000 | N/A | N/A |
|
| merlin-c-[001-024] | 1 core | 44 cores | 2 | 352000 | 352000 | 10000 | N/A | N/A |
|
||||||
| merlin-c-[101-124] | 1 core | 44 cores | 2 | 4000 | 352000 | 352000 | 10000 | N/A | N/A |
|
| merlin-c-[101-124] | 1 core | 44 cores | 2 | 352000 | 352000 | 10000 | N/A | N/A |
|
||||||
| merlin-c-[201-224] | 1 core | 44 cores | 2 | 4000 | 352000 | 352000 | 10000 | N/A | N/A |
|
| merlin-c-[201-224] | 1 core | 44 cores | 2 | 352000 | 352000 | 10000 | N/A | N/A |
|
||||||
| merlin-c-[301-306] | 1 core | 44 cores | 2 | 4000 | 352000 | 352000 | 10000 | N/A | N/A |
|
| merlin-c-[301-312] | 1 core | 44 cores | 2 | 748800 | 748800 | 10000 | N/A | N/A |
|
||||||
|
| merlin-c-[313-318] | 1 core | 44 cores | 1 | 748800 | 748800 | 10000 | N/A | N/A |
|
||||||
|
| merlin-c-[319-324] | 1 core | 44 cores | 2 | 748800 | 748800 | 10000 | N/A | N/A |
|
||||||
|
|
||||||
If nothing is specified, by default each core will use up to 8GB of memory. Memory can be increased with the `--mem=<mem_in_MB>` and
|
If nothing is specified, by default each core will use up to 8GB of memory. Memory can be increased with the `--mem=<mem_in_MB>` and
|
||||||
`--mem-per-cpu=<mem_in_MB>` options, and maximum memory allowed is `Max.Mem/Node`.
|
`--mem-per-cpu=<mem_in_MB>` options, and maximum memory allowed is `Max.Mem/Node`.
|
||||||
@ -57,12 +59,15 @@ Users might need to specify the Slurm partition. If no partition is specified, i
|
|||||||
|
|
||||||
The following *partitions* (also known as *queues*) are configured in Slurm:
|
The following *partitions* (also known as *queues*) are configured in Slurm:
|
||||||
|
|
||||||
| CPU Partition | Default Time | Max Time | Max Nodes | PriorityJobFactor\* | PriorityTier\*\* |
|
| CPU Partition | Default Time | Max Time | Max Nodes | PriorityJobFactor\* | PriorityTier\*\* | DefMemPerCPU |
|
||||||
|:-----------------: | :----------: | :------: | :-------: | :-----------------: | :--------------: |
|
|:-----------------: | :----------: | :------: | :-------: | :-----------------: | :--------------: |:------------:|
|
||||||
| **<u>general</u>** | 1 day | 1 week | 50 | 1 | 1 |
|
| **<u>general</u>** | 1 day | 1 week | 50 | 1 | 1 | 4000 |
|
||||||
| **daily** | 1 day | 1 day | 67 | 500 | 1 |
|
| **daily** | 1 day | 1 day | 67 | 500 | 1 | 4000 |
|
||||||
| **hourly** | 1 hour | 1 hour | unlimited | 1000 | 1 |
|
| **hourly** | 1 hour | 1 hour | unlimited | 1000 | 1 | 4000 |
|
||||||
| **gfa-asa** | 1 day | 1 week | 11 | 1000 | 1000 |
|
| **asa-general** | 1 hour | 2 weeks | unlimited | 1 | 2 | 3712 |
|
||||||
|
| **asa-daily** | 1 hour | 1 week | unlimited | 500 | 2 | 3712 |
|
||||||
|
| **asa-ansys** | 1 hour | 90 days | unlimited | 1000 | 4 | 15600 |
|
||||||
|
| **mu3e** | 1 day | 7 days | unlimited | 1000 | 4 | 3712 |
|
||||||
|
|
||||||
\*The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority
|
\*The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority
|
||||||
partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU
|
partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU
|
||||||
@ -74,8 +79,7 @@ and, if possible, they will preempt running jobs from partitions with lower *Pri
|
|||||||
* The **`general`** partition is the **default**. It can not have more than 50 nodes running jobs.
|
* The **`general`** partition is the **default**. It can not have more than 50 nodes running jobs.
|
||||||
* For **`daily`** this limitation is extended to 67 nodes.
|
* For **`daily`** this limitation is extended to 67 nodes.
|
||||||
* For **`hourly`** there are no limits.
|
* For **`hourly`** there are no limits.
|
||||||
* **`gfa-asa`** is a **private hidden** partition, belonging to one experiment. **Access is restricted**. However, by agreement with the experiment,
|
* **`asa-general`,`asa-daily`,`asa-ansys`,`asa-visas` and `mu3e`** are **private hidden** partitions, belonging to different experiments owning the machines. **Access is restricted** in all cases. However, by agreement with the experiments, nodes are usually added to the **`hourly`** partition as extra resources for the public resources.
|
||||||
nodes are usually added to the **`hourly`** partition as extra resources for the public resources.
|
|
||||||
|
|
||||||
{{site.data.alerts.tip}}Jobs which would run for less than one day should be always sent to <b>daily</b>, while jobs that would run for less
|
{{site.data.alerts.tip}}Jobs which would run for less than one day should be always sent to <b>daily</b>, while jobs that would run for less
|
||||||
than one hour should be sent to <b>hourly</b>. This would ensure that you have highest priority over jobs sent to partitions with less priority,
|
than one hour should be sent to <b>hourly</b>. This would ensure that you have highest priority over jobs sent to partitions with less priority,
|
||||||
|
Loading…
x
Reference in New Issue
Block a user