Docs update 13.04.2022

2022-04-13 16:03:50 +02:00 · 2022-04-13 16:03:50 +02:00 · b8ff389a9d
commit b8ff389a9d
parent 9837429824
8 changed files with 99 additions and 34 deletions
--- a/_data/sidebars/home_sidebar.yml
+++ b/_data/sidebars/home_sidebar.yml
@ -16,3 +16,7 @@ entries:
    - title: The Merlin Local HPC Cluster
      url: /merlin6/introduction.html
      output: web
    - title: PSI HPC@CSCS
      url: /CSCS/index.html
      output: web
--- a/_data/sidebars/merlin6_sidebar.yml
+++ b/_data/sidebars/merlin6_sidebar.yml
@ -103,6 +103,8 @@ entries:
      url: /merlin6/ansys-fluent.html
    - title: ANSYS/MAPDL
      url: /merlin6/ansys-mapdl.html
    - title: ANSYS/HFSS
      url: /merlin6/ansys-hfss.html
    - title: ParaView
      url: /merlin6/paraview.html
  - title: Support
--- a/pages/CSCS/index.md
+++ b/pages/CSCS/index.md
@ -2,7 +2,7 @@
 title: PSI HPC@CSCS Admin Overview
 #tags:
 #keywords:
-last_updated: 22 September 2020
+last_updated: 13 April 2022
 #summary: ""
 sidebar: CSCS_sidebar
 permalink: /CSCS/index.html
@ -10,13 +10,35 @@ permalink: /CSCS/index.html
 ## PSI HPC@CSCS
-For offering high-end HPC sources to PSI users, AIT has a long standing col-laboration with the national supercomputing centre CSCS (since 2005). 
+For offering high-end HPC sources to PSI users, PSI has a long standing collaboration with 
-Some of the resources are procured by central PSI funds while users have the optionsof an additional buy-in at the same rates.
+the national supercomputing centre CSCS (since 2005). Some of the resources are procured by 
 central PSI funds while users have the optionsof an additional buy-in at the same rates.
 ### PSI resources at Piz Daint
-The yearly computing resources at CSCS for the PSI projects are 320,000 NH (Node Hours). The yearly storage resources for the PSI projects is a total of 40TB.
+The yearly computing resources at CSCS for the PSI projects are 627,000 NH (Node Hours). 
-These resources are centrally financed, but in addition experiments can individually purchase more resources.
+The yearly storage resources for the PSI projects is a total of 80TB. These resources are 
 centrally financed, but in addition experiments can individually purchase more resources.
 ### How to request a PSI project
 A survey is sent out in the third quarter of each year. This survey is used to request
 CSCS resources for the upcoming year.
 Users registered in the **PSI HPC@CSCS mailing list** <psi-hpc-at-cscs@lists.psi.ch> will
 receive notification and details about the survey, in example:
 * Link to the survey
 * Update of resource changes
 * Other details of the process
 Generally users need to specify in the survey the total resources they intend to use 
 next year and also how they would like to split it over the 4 quarters (e.g. 25%, 25%,
 25%, 25%). In general, we provide the possibility to adapt the distribution over the 
 course of next year if required. The minimum allocation over a year is 10,000 node hours.
 By default allocated nodes are on the CPU partition of PizDaint (36 cores per node).
 However, allocations to the GPU partition are also possible (1 x NVIDIA P100 and 12cores per 
 node), but needs to be splicitly stated in the survey.
 ### Piz Daint total resources
@ -27,6 +49,7 @@ References:
 ## Contact information
-* Contact person at PSI: Marc Caubet Serrabou <marc.caubet@psi.ch>
+* Contact responsibles: 
   * Mail list contact: <psi-hpc-at-cscs-admin@lists.psi.ch>
-* Contact Person at CSCS: Angelo Mangili <amangili@psi.ch>
+     * Marc Caubet Serrabou <marc.caubet@psi.ch>
     * Derek Feichtinger <derek.feichtinger@psi.ch>
--- a/pages/merlin6/02-How-To-Use-Merlin/ssh-keys.md
+++ b/pages/merlin6/02-How-To-Use-Merlin/ssh-keys.md
@ -28,7 +28,7 @@ ls ~/.ssh/id*
 For creating **SSH RSA Keys**, one should:
 1. Run `ssh-keygen`, a password will be requested twice. You **must remember** this password for the future.
-   * Due to security reasons, ***always add a password***. Never leave an empty password.
+   * Due to security reasons, ***always try protecting it with a password***. There is only one exception, when running ANSYS software, which in general should not use password to simplify the way of running the software in Slurm.
   * This will generate a private key **id_rsa**, and a public key **id_rsa.pub** in your **~/.ssh** directory.
 2. Add your public key to the **`authorized_keys`** file, and ensure proper permissions for that file, as follows:
   ```bash
--- a/pages/merlin6/03-Slurm-General-Documentation/monitoring.md
+++ b/pages/merlin6/03-Slurm-General-Documentation/monitoring.md
@ -75,6 +75,32 @@ gpu          up 7-00:00:00 1-infinite   no       NO        all      8   allocate
 </pre>
 </details>
 ### Job accounting
 Users can check detailed information of jobs (pending, running, completed, failed, etc.) with the `sacct` command.
 This command is very flexible and can provide a lot of information. For checking all the available options, please read `man sacct`.
 Below, we summarize some examples that can be useful for the users:
 ```bash
 # Today jobs, basic summary
 sacct
 # Today jobs, with details
 sacct --long
 # Jobs from January 1, 2022, 12pm, with details
 sacct -S 2021-01-01T12:00:00 --long
 # Specific job accounting
 sacct --long -j $jobid
 # Jobs custom details, without steps (-X)
 sacct -X --format=User%20,JobID,Jobname,partition,state,time,submit,start,end,elapsed,AveRss,MaxRss,MaxRSSTask,MaxRSSNode%20,MaxVMSize,nnodes,ncpus,ntasks,reqcpus,totalcpu,reqmem,cluster,TimeLimit,TimeLimitRaw,cputime,nodelist%50,AllocTRES%80
 # Jobs custom details, with steps
 sacct --format=User%20,JobID,Jobname,partition,state,time,submit,start,end,elapsed,AveRss,MaxRss,MaxRSSTask,MaxRSSNode%20,MaxVMSize,nnodes,ncpus,ntasks,reqcpus,totalcpu,reqmem,cluster,TimeLimit,TimeLimitRaw,cputime,nodelist%50,AllocTRES%80
 ```
 ### Job efficiency
 Users can check how efficient are their jobs. For that, the ``seff`` command is available.
--- a/pages/merlin6/03-Slurm-General-Documentation/slurm-basic-commands.md
+++ b/pages/merlin6/03-Slurm-General-Documentation/slurm-basic-commands.md
@ -28,6 +28,7 @@ scancel job_id   # to cancel slurm job, job id is the numeric id, seen by the sq
 sview            # X interface for managing jobs and track job run information.
 seff             # Calculates the efficiency of a job
 sjstat           # List attributes of jobs under the SLURM control
 sacct            # Show job accounting, useful for checking details of finished jobs.
 ```
 ---
--- a/pages/merlin6/hardware-and-software-description.md
+++ b/pages/merlin6/hardware-and-software-description.md
@ -69,19 +69,24 @@ The connectivity for the Merlin6 cluster is based on **ConnectX-5 EDR-100Gbps**,
     <td style="vertical-align:middle;text-align:center;" rowspan="1">384GB</td>
   </tr>
   <tr style="vertical-align:middle;text-align:center;" ralign="center">
-     <td style="vertical-align:middle;text-align:center;" rowspan="2"><b>#3</b></td>
+     <td style="vertical-align:middle;text-align:center;" rowspan="3"><b>#3</b></td>
-     <td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-c-3[01-06]</b></td>
+     <td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-c-3[01-12]</b></td>
-     <td style="vertical-align:middle;text-align:center;" rowspan="2"><a href="https://ark.intel.com/content/www/us/en/ark/products/199343/intel-xeon-gold-6240r-processor-35-75m-cache-2-40-ghz.html">Intel Xeon Gold 6240R</a></td>
+     <td style="vertical-align:middle;text-align:center;" rowspan="3"><a href="https://ark.intel.com/content/www/us/en/ark/products/199343/intel-xeon-gold-6240r-processor-35-75m-cache-2-40-ghz.html">Intel Xeon Gold 6240R</a></td>
-     <td style="vertical-align:middle;text-align:center;" rowspan="2">2</td>
+     <td style="vertical-align:middle;text-align:center;" rowspan="3">2</td>
-     <td style="vertical-align:middle;text-align:center;" rowspan="2">48</td>
+     <td style="vertical-align:middle;text-align:center;" rowspan="3">48</td>
-     <td style="vertical-align:middle;text-align:center;" rowspan="2">2</td>
+     <td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
-     <td style="vertical-align:middle;text-align:center;" rowspan="2">1.2TB</td>
+     <td style="vertical-align:middle;text-align:center;" rowspan="3">1.2TB</td>
-     <td style="vertical-align:middle;text-align:center;" rowspan="1">384GB</td>
+     <td style="vertical-align:middle;text-align:center;" rowspan="2">768GB</td>
   </tr>
   <tr style="vertical-align:middle;text-align:center;" ralign="center">
     <td rowspan="1"><b>merlin-c-3[07-12]</b></td>
     <td style="vertical-align:middle;text-align:center;" rowspan="1">768GB</td>
   </tr>
  <tr style="vertical-align:middle;text-align:center;" ralign="center">
    <td rowspan="1"><b>merlin-c-3[03-18]</b></td>
    <td style="vertical-align:middle;text-align:center;" rowspan="1">1</td>
  </tr>
  <tr style="vertical-align:middle;text-align:center;" ralign="center">
    <td rowspan="1"><b>merlin-c-3[19-24]</b></td>
    <td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
    <td style="vertical-align:middle;text-align:center;" rowspan="1">384GB</td>
  </tr>
  </tbody>
 </table>
 Each blade contains a NVMe disk, where up to 300TB are dedicated to the O.S., and ~1.2TB are reserved for local `/scratch`.
--- a/pages/merlin6/slurm-configuration.md
+++ b/pages/merlin6/slurm-configuration.md
@ -14,12 +14,14 @@ This documentation shows basic Slurm configuration and options needed to run job
 The following table show default and maximum resources that can be used per node:
-| Nodes              | Def.#CPUs | Max.#CPUs | #Threads | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | Def.#GPUs | Max.#GPUs |
+| Nodes                | Def.#CPUs | Max.#CPUs | #Threads | Max.Mem/CPU | Max.Mem/Node | Max.Swap | Def.#GPUs | Max.#GPUs |
-|:------------------:| ---------:| :--------:| :------: | :----------:| :----------:| :-----------:| :-------:| :-------: | :-------: |
+|:--------------------:| ---------:| :--------:| :------: | :----------:| :-----------:| :-------:| :-------: | :-------: |
-| merlin-c-[001-024] | 1 core    | 44 cores  | 2        | 4000        | 352000      | 352000       | 10000    | N/A       | N/A       |
+| merlin-c-[001-024]   | 1 core    | 44 cores  | 2        | 352000      | 352000       | 10000    | N/A       | N/A       |
-| merlin-c-[101-124] | 1 core    | 44 cores  | 2        | 4000        | 352000      | 352000       | 10000    | N/A       | N/A       |
+| merlin-c-[101-124]   | 1 core    | 44 cores  | 2        | 352000      | 352000       | 10000    | N/A       | N/A       |
-| merlin-c-[201-224] | 1 core    | 44 cores  | 2        | 4000        | 352000      | 352000       | 10000    | N/A       | N/A       |
+| merlin-c-[201-224]   | 1 core    | 44 cores  | 2        | 352000      | 352000       | 10000    | N/A       | N/A       |
-| merlin-c-[301-306] | 1 core    | 44 cores  | 2        | 4000        | 352000      | 352000       | 10000    | N/A       | N/A       |
+| merlin-c-[301-312]   | 1 core    | 44 cores  | 2        | 748800      | 748800       | 10000    | N/A       | N/A       |
 | merlin-c-[313-318]   | 1 core    | 44 cores  | 1        | 748800      | 748800       | 10000    | N/A       | N/A       |
 | merlin-c-[319-324]   | 1 core    | 44 cores  | 2        | 748800      | 748800       | 10000    | N/A       | N/A       |
 If nothing is specified, by default each core will use up to 8GB of memory. Memory can be increased with the `--mem=<mem_in_MB>` and 
 `--mem-per-cpu=<mem_in_MB>` options, and maximum memory allowed is `Max.Mem/Node`.
@ -57,12 +59,15 @@ Users might need to specify the Slurm partition. If no partition is specified, i
 The following *partitions* (also known as *queues*) are configured in Slurm:
-| CPU Partition      |  Default Time | Max Time | Max Nodes | PriorityJobFactor\* | PriorityTier\*\* |
+| CPU Partition      |  Default Time | Max Time | Max Nodes | PriorityJobFactor\* | PriorityTier\*\* | DefMemPerCPU |
-|:-----------------: |  :----------: | :------: | :-------: | :-----------------: | :--------------: |
+|:-----------------: |  :----------: | :------: | :-------: | :-----------------: | :--------------: |:------------:|
-| **<u>general</u>** |  1 day        | 1 week   | 50        | 1                   | 1                |
+| **<u>general</u>** |  1 day        | 1 week   | 50        | 1                   | 1                | 4000         |
-| **daily**          |  1 day        | 1 day    | 67        | 500                 | 1                |
+| **daily**          |  1 day        | 1 day    | 67        | 500                 | 1                | 4000         |
-| **hourly**         |  1 hour       | 1 hour   | unlimited | 1000                | 1                |
+| **hourly**         |  1 hour       | 1 hour   | unlimited | 1000                | 1                | 4000         |
-| **gfa-asa**        |  1 day        | 1 week   | 11        | 1000                | 1000             |
+| **asa-general**    |  1 hour       | 2 weeks  | unlimited | 1                   | 2                | 3712         |
 | **asa-daily**      |  1 hour       | 1 week   | unlimited | 500                 | 2                | 3712         |
 | **asa-ansys**      |  1 hour       | 90 days  | unlimited | 1000                | 4                | 15600        |
 | **mu3e**           |  1 day        | 7 days   | unlimited | 1000                | 4                | 3712         |
 \*The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority
 partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU
@ -74,8 +79,7 @@ and, if possible, they will preempt running jobs from partitions with lower *Pri
 * The **`general`** partition is the **default**. It can not have more than 50 nodes running jobs.
 * For **`daily`** this limitation is extended to 67 nodes.
 * For **`hourly`** there are no limits.
-* **`gfa-asa`** is a **private hidden** partition, belonging to one experiment. **Access is restricted**. However, by agreement with the experiment, 
+* **`asa-general`,`asa-daily`,`asa-ansys`,`asa-visas` and `mu3e`** are **private hidden** partitions, belonging to different experiments owning the machines. **Access is restricted** in all cases. However, by agreement with the experiments, nodes are usually added to the **`hourly`** partition as extra resources for the public resources.
 nodes are usually added to the **`hourly`** partition as extra resources for the public resources.
 {{site.data.alerts.tip}}Jobs which would run for less than one day should be always sent to <b>daily</b>, while jobs that would run for less
 than one hour should be sent to <b>hourly</b>. This would ensure that you have highest priority over jobs sent to partitions with less priority,