diff --git a/_data/sidebars/home_sidebar.yml b/_data/sidebars/home_sidebar.yml index af6daee..68cfa96 100644 --- a/_data/sidebars/home_sidebar.yml +++ b/_data/sidebars/home_sidebar.yml @@ -16,3 +16,7 @@ entries: - title: The Merlin Local HPC Cluster url: /merlin6/introduction.html output: web + - title: PSI HPC@CSCS + url: /CSCS/index.html + output: web + diff --git a/_data/sidebars/merlin6_sidebar.yml b/_data/sidebars/merlin6_sidebar.yml index b91e55e..f618950 100644 --- a/_data/sidebars/merlin6_sidebar.yml +++ b/_data/sidebars/merlin6_sidebar.yml @@ -103,6 +103,8 @@ entries: url: /merlin6/ansys-fluent.html - title: ANSYS/MAPDL url: /merlin6/ansys-mapdl.html + - title: ANSYS/HFSS + url: /merlin6/ansys-hfss.html - title: ParaView url: /merlin6/paraview.html - title: Support diff --git a/pages/CSCS/index.md b/pages/CSCS/index.md index 013d54d..da12d23 100644 --- a/pages/CSCS/index.md +++ b/pages/CSCS/index.md @@ -2,7 +2,7 @@ title: PSI HPC@CSCS Admin Overview #tags: #keywords: -last_updated: 22 September 2020 +last_updated: 13 April 2022 #summary: "" sidebar: CSCS_sidebar permalink: /CSCS/index.html @@ -10,13 +10,35 @@ permalink: /CSCS/index.html ## PSI HPC@CSCS -For offering high-end HPC sources to PSI users, AIT has a long standing col-laboration with the national supercomputing centre CSCS (since 2005). -Some of the resources are procured by central PSI funds while users have the optionsof an additional buy-in at the same rates. +For offering high-end HPC sources to PSI users, PSI has a long standing collaboration with +the national supercomputing centre CSCS (since 2005). Some of the resources are procured by +central PSI funds while users have the optionsof an additional buy-in at the same rates. ### PSI resources at Piz Daint -The yearly computing resources at CSCS for the PSI projects are 320,000 NH (Node Hours). The yearly storage resources for the PSI projects is a total of 40TB. -These resources are centrally financed, but in addition experiments can individually purchase more resources. +The yearly computing resources at CSCS for the PSI projects are 627,000 NH (Node Hours). +The yearly storage resources for the PSI projects is a total of 80TB. These resources are +centrally financed, but in addition experiments can individually purchase more resources. + +### How to request a PSI project + +A survey is sent out in the third quarter of each year. This survey is used to request +CSCS resources for the upcoming year. + +Users registered in the **PSI HPC@CSCS mailing list** will +receive notification and details about the survey, in example: +* Link to the survey +* Update of resource changes +* Other details of the process + +Generally users need to specify in the survey the total resources they intend to use +next year and also how they would like to split it over the 4 quarters (e.g. 25%, 25%, +25%, 25%). In general, we provide the possibility to adapt the distribution over the +course of next year if required. The minimum allocation over a year is 10,000 node hours. + +By default allocated nodes are on the CPU partition of PizDaint (36 cores per node). +However, allocations to the GPU partition are also possible (1 x NVIDIA P100 and 12cores per +node), but needs to be splicitly stated in the survey. ### Piz Daint total resources @@ -27,6 +49,7 @@ References: ## Contact information -* Contact person at PSI: Marc Caubet Serrabou +* Contact responsibles: * Mail list contact: -* Contact Person at CSCS: Angelo Mangili + * Marc Caubet Serrabou + * Derek Feichtinger diff --git a/pages/merlin6/02-How-To-Use-Merlin/ssh-keys.md b/pages/merlin6/02-How-To-Use-Merlin/ssh-keys.md index 5d2746c..80243b5 100644 --- a/pages/merlin6/02-How-To-Use-Merlin/ssh-keys.md +++ b/pages/merlin6/02-How-To-Use-Merlin/ssh-keys.md @@ -28,7 +28,7 @@ ls ~/.ssh/id* For creating **SSH RSA Keys**, one should: 1. Run `ssh-keygen`, a password will be requested twice. You **must remember** this password for the future. - * Due to security reasons, ***always add a password***. Never leave an empty password. + * Due to security reasons, ***always try protecting it with a password***. There is only one exception, when running ANSYS software, which in general should not use password to simplify the way of running the software in Slurm. * This will generate a private key **id_rsa**, and a public key **id_rsa.pub** in your **~/.ssh** directory. 2. Add your public key to the **`authorized_keys`** file, and ensure proper permissions for that file, as follows: ```bash diff --git a/pages/merlin6/03-Slurm-General-Documentation/monitoring.md b/pages/merlin6/03-Slurm-General-Documentation/monitoring.md index a4453d9..bdaa622 100644 --- a/pages/merlin6/03-Slurm-General-Documentation/monitoring.md +++ b/pages/merlin6/03-Slurm-General-Documentation/monitoring.md @@ -75,6 +75,32 @@ gpu up 7-00:00:00 1-infinite no NO all 8 allocate +### Job accounting + +Users can check detailed information of jobs (pending, running, completed, failed, etc.) with the `sacct` command. +This command is very flexible and can provide a lot of information. For checking all the available options, please read `man sacct`. +Below, we summarize some examples that can be useful for the users: + +```bash +# Today jobs, basic summary +sacct + +# Today jobs, with details +sacct --long + +# Jobs from January 1, 2022, 12pm, with details +sacct -S 2021-01-01T12:00:00 --long + +# Specific job accounting +sacct --long -j $jobid + +# Jobs custom details, without steps (-X) +sacct -X --format=User%20,JobID,Jobname,partition,state,time,submit,start,end,elapsed,AveRss,MaxRss,MaxRSSTask,MaxRSSNode%20,MaxVMSize,nnodes,ncpus,ntasks,reqcpus,totalcpu,reqmem,cluster,TimeLimit,TimeLimitRaw,cputime,nodelist%50,AllocTRES%80 + +# Jobs custom details, with steps +sacct --format=User%20,JobID,Jobname,partition,state,time,submit,start,end,elapsed,AveRss,MaxRss,MaxRSSTask,MaxRSSNode%20,MaxVMSize,nnodes,ncpus,ntasks,reqcpus,totalcpu,reqmem,cluster,TimeLimit,TimeLimitRaw,cputime,nodelist%50,AllocTRES%80 +``` + ### Job efficiency Users can check how efficient are their jobs. For that, the ``seff`` command is available. diff --git a/pages/merlin6/03-Slurm-General-Documentation/slurm-basic-commands.md b/pages/merlin6/03-Slurm-General-Documentation/slurm-basic-commands.md index 92e9bee..289c5bf 100644 --- a/pages/merlin6/03-Slurm-General-Documentation/slurm-basic-commands.md +++ b/pages/merlin6/03-Slurm-General-Documentation/slurm-basic-commands.md @@ -28,6 +28,7 @@ scancel job_id # to cancel slurm job, job id is the numeric id, seen by the sq sview # X interface for managing jobs and track job run information. seff # Calculates the efficiency of a job sjstat # List attributes of jobs under the SLURM control +sacct # Show job accounting, useful for checking details of finished jobs. ``` --- diff --git a/pages/merlin6/hardware-and-software-description.md b/pages/merlin6/hardware-and-software-description.md index 139757a..de19689 100644 --- a/pages/merlin6/hardware-and-software-description.md +++ b/pages/merlin6/hardware-and-software-description.md @@ -69,19 +69,24 @@ The connectivity for the Merlin6 cluster is based on **ConnectX-5 EDR-100Gbps**, 384GB - #3 - merlin-c-3[01-06] - Intel Xeon Gold 6240R - 2 - 48 - 2 - 1.2TB - 384GB - - - merlin-c-3[07-12] - 768GB + #3 + merlin-c-3[01-12] + Intel Xeon Gold 6240R + 2 + 48 + 2 + 1.2TB + 768GB + + merlin-c-3[03-18] + 1 + + + merlin-c-3[19-24] + 2 + 384GB + Each blade contains a NVMe disk, where up to 300TB are dedicated to the O.S., and ~1.2TB are reserved for local `/scratch`. diff --git a/pages/merlin6/slurm-configuration.md b/pages/merlin6/slurm-configuration.md index 30670ad..9e17723 100644 --- a/pages/merlin6/slurm-configuration.md +++ b/pages/merlin6/slurm-configuration.md @@ -14,12 +14,14 @@ This documentation shows basic Slurm configuration and options needed to run job The following table show default and maximum resources that can be used per node: -| Nodes | Def.#CPUs | Max.#CPUs | #Threads | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | Def.#GPUs | Max.#GPUs | -|:------------------:| ---------:| :--------:| :------: | :----------:| :----------:| :-----------:| :-------:| :-------: | :-------: | -| merlin-c-[001-024] | 1 core | 44 cores | 2 | 4000 | 352000 | 352000 | 10000 | N/A | N/A | -| merlin-c-[101-124] | 1 core | 44 cores | 2 | 4000 | 352000 | 352000 | 10000 | N/A | N/A | -| merlin-c-[201-224] | 1 core | 44 cores | 2 | 4000 | 352000 | 352000 | 10000 | N/A | N/A | -| merlin-c-[301-306] | 1 core | 44 cores | 2 | 4000 | 352000 | 352000 | 10000 | N/A | N/A | +| Nodes | Def.#CPUs | Max.#CPUs | #Threads | Max.Mem/CPU | Max.Mem/Node | Max.Swap | Def.#GPUs | Max.#GPUs | +|:--------------------:| ---------:| :--------:| :------: | :----------:| :-----------:| :-------:| :-------: | :-------: | +| merlin-c-[001-024] | 1 core | 44 cores | 2 | 352000 | 352000 | 10000 | N/A | N/A | +| merlin-c-[101-124] | 1 core | 44 cores | 2 | 352000 | 352000 | 10000 | N/A | N/A | +| merlin-c-[201-224] | 1 core | 44 cores | 2 | 352000 | 352000 | 10000 | N/A | N/A | +| merlin-c-[301-312] | 1 core | 44 cores | 2 | 748800 | 748800 | 10000 | N/A | N/A | +| merlin-c-[313-318] | 1 core | 44 cores | 1 | 748800 | 748800 | 10000 | N/A | N/A | +| merlin-c-[319-324] | 1 core | 44 cores | 2 | 748800 | 748800 | 10000 | N/A | N/A | If nothing is specified, by default each core will use up to 8GB of memory. Memory can be increased with the `--mem=` and `--mem-per-cpu=` options, and maximum memory allowed is `Max.Mem/Node`. @@ -57,12 +59,15 @@ Users might need to specify the Slurm partition. If no partition is specified, i The following *partitions* (also known as *queues*) are configured in Slurm: -| CPU Partition | Default Time | Max Time | Max Nodes | PriorityJobFactor\* | PriorityTier\*\* | -|:-----------------: | :----------: | :------: | :-------: | :-----------------: | :--------------: | -| **general** | 1 day | 1 week | 50 | 1 | 1 | -| **daily** | 1 day | 1 day | 67 | 500 | 1 | -| **hourly** | 1 hour | 1 hour | unlimited | 1000 | 1 | -| **gfa-asa** | 1 day | 1 week | 11 | 1000 | 1000 | +| CPU Partition | Default Time | Max Time | Max Nodes | PriorityJobFactor\* | PriorityTier\*\* | DefMemPerCPU | +|:-----------------: | :----------: | :------: | :-------: | :-----------------: | :--------------: |:------------:| +| **general** | 1 day | 1 week | 50 | 1 | 1 | 4000 | +| **daily** | 1 day | 1 day | 67 | 500 | 1 | 4000 | +| **hourly** | 1 hour | 1 hour | unlimited | 1000 | 1 | 4000 | +| **asa-general** | 1 hour | 2 weeks | unlimited | 1 | 2 | 3712 | +| **asa-daily** | 1 hour | 1 week | unlimited | 500 | 2 | 3712 | +| **asa-ansys** | 1 hour | 90 days | unlimited | 1000 | 4 | 15600 | +| **mu3e** | 1 day | 7 days | unlimited | 1000 | 4 | 3712 | \*The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU @@ -74,8 +79,7 @@ and, if possible, they will preempt running jobs from partitions with lower *Pri * The **`general`** partition is the **default**. It can not have more than 50 nodes running jobs. * For **`daily`** this limitation is extended to 67 nodes. * For **`hourly`** there are no limits. -* **`gfa-asa`** is a **private hidden** partition, belonging to one experiment. **Access is restricted**. However, by agreement with the experiment, -nodes are usually added to the **`hourly`** partition as extra resources for the public resources. +* **`asa-general`,`asa-daily`,`asa-ansys`,`asa-visas` and `mu3e`** are **private hidden** partitions, belonging to different experiments owning the machines. **Access is restricted** in all cases. However, by agreement with the experiments, nodes are usually added to the **`hourly`** partition as extra resources for the public resources. {{site.data.alerts.tip}}Jobs which would run for less than one day should be always sent to daily, while jobs that would run for less than one hour should be sent to hourly. This would ensure that you have highest priority over jobs sent to partitions with less priority,