initial formatting changes complete

2026-01-06 16:40:15 +01:00
parent f58c1f57b8
commit 7db5d0fd05
81 changed files with 805 additions and 1112 deletions
--- a/docs/merlin6/slurm-general-docs/interactive-jobs.md
+++ b/docs/merlin6/slurm-general-docs/interactive-jobs.md
@@ -57,7 +57,7 @@ a shell (`$SHELL`) at the end of the `salloc` command. In example:
 ```bash
 # Typical 'salloc' call
 #   - Same as running:
-#    'salloc --clusters=merlin6 -N 2 -n 2 srun -n1 -N1 --mem-per-cpu=0 --gres=gpu:0 --pty --preserve-env --mpi=none $SHELL' 
+#    'salloc --clusters=merlin6 -N 2 -n 2 srun -n1 -N1 --mem-per-cpu=0 --gres=gpu:0 --pty --preserve-env --mpi=none $SHELL'
 salloc --clusters=merlin6 -N 2 -n 2

 # Custom 'salloc' call
@@ -155,7 +155,7 @@ srun --clusters=merlin6 --x11 --pty bash
    srun: job 135095591 queued and waiting for resources
    srun: job 135095591 has been allocated resources

-    (base) [caubet_m@merlin-l-001 ~]$ 
+    (base) [caubet_m@merlin-l-001 ~]$

    (base) [caubet_m@merlin-l-001 ~]$ srun --clusters=merlin6 --x11 --pty bash
    srun: job 135095592 queued and waiting for resources
@@ -198,7 +198,7 @@ salloc --clusters=merlin6 --x11
    salloc: Granted job allocation 135171355
    salloc: Relinquishing job allocation 135171355

-    (base) [caubet_m@merlin-l-001 ~]$ salloc --clusters=merlin6 --x11 
+    (base) [caubet_m@merlin-l-001 ~]$ salloc --clusters=merlin6 --x11
    salloc: Pending job allocation 135171349
    salloc: job 135171349 queued and waiting for resources
    salloc: job 135171349 has been allocated resources
--- a/docs/merlin6/slurm-general-docs/monitoring.md
+++ b/docs/merlin6/slurm-general-docs/monitoring.md
@@ -166,20 +166,20 @@ sjstat

    Scheduling pool data:
    ----------------------------------------------------------------------------------
-                               Total  Usable   Free   Node   Time      Other          
-    Pool         Memory  Cpus  Nodes   Nodes  Nodes  Limit  Limit      traits         
+                               Total  Usable   Free   Node   Time      Other
+    Pool         Memory  Cpus  Nodes   Nodes  Nodes  Limit  Limit      traits
    ----------------------------------------------------------------------------------
-    test        373502Mb    88      6       6      1  UNLIM 1-00:00:00   
-    general*    373502Mb    88     66      66      8     50 7-00:00:00   
-    daily       373502Mb    88     72      72      9     60 1-00:00:00   
-    hourly      373502Mb    88     72      72      9  UNLIM   01:00:00   
-    gpu         128000Mb     8      1       1      0  UNLIM 7-00:00:00   
-    gpu         128000Mb    20      8       8      0  UNLIM 7-00:00:00   
+    test        373502Mb    88      6       6      1  UNLIM 1-00:00:00
+    general*    373502Mb    88     66      66      8     50 7-00:00:00
+    daily       373502Mb    88     72      72      9     60 1-00:00:00
+    hourly      373502Mb    88     72      72      9  UNLIM   01:00:00
+    gpu         128000Mb     8      1       1      0  UNLIM 7-00:00:00
+    gpu         128000Mb    20      8       8      0  UNLIM 7-00:00:00

    Running job data:
    ---------------------------------------------------------------------------------------------------
-                                                     Time        Time            Time                  
-    JobID    User      Procs Pool      Status        Used       Limit         Started  Master/Other    
+                                                     Time        Time            Time
+    JobID    User      Procs Pool      Status        Used       Limit         Started  Master/Other
    ---------------------------------------------------------------------------------------------------
    13433377 collu_g       1 gpu       PD            0:00    24:00:00             N/A  (Resources)
    13433389 collu_g      20 gpu       PD            0:00    24:00:00             N/A  (Resources)
@@ -249,11 +249,10 @@ sview

 !['sview' graphical user interface](../../images/slurm/sview.png)

-
 ## General Monitoring

-The following pages contain basic monitoring for Slurm and computing nodes. 
-Currently, monitoring is based on Grafana + InfluxDB. In the future it will 
+The following pages contain basic monitoring for Slurm and computing nodes.
+Currently, monitoring is based on Grafana + InfluxDB. In the future it will
 be moved to a different service based on ElasticSearch + LogStash + Kibana.

 In the meantime, the following monitoring pages are available in a best effort
@@ -262,17 +261,17 @@ support:
 ### Merlin6 Monitoring Pages

 * Slurm monitoring:
-   * ***[Merlin6 Slurm Statistics - XDMOD](https://merlin-slurmmon01.psi.ch/)***
-   * [Merlin6 Slurm Live Status](https://hpc-monitor02.psi.ch/d/QNcbW1AZk/merlin6-slurm-live-status?orgId=1&refresh=10s)
-   * [Merlin6 Slurm Overview](https://hpc-monitor02.psi.ch/d/94UxWJ0Zz/merlin6-slurm-overview?orgId=1&refresh=10s)
+    * ***[Merlin6 Slurm Statistics - XDMOD](https://merlin-slurmmon01.psi.ch/)***
+    * [Merlin6 Slurm Live Status](https://hpc-monitor02.psi.ch/d/QNcbW1AZk/merlin6-slurm-live-status?orgId=1&refresh=10s)
+    * [Merlin6 Slurm Overview](https://hpc-monitor02.psi.ch/d/94UxWJ0Zz/merlin6-slurm-overview?orgId=1&refresh=10s)
 * Nodes monitoring:
-   * [Merlin6 CPU Nodes Overview](https://hpc-monitor02.psi.ch/d/JmvLR8gZz/merlin6-computing-cpu-nodes?orgId=1&refresh=10s)
-   * [Merlin6 GPU Nodes Overview](https://hpc-monitor02.psi.ch/d/gOo1Z10Wk/merlin6-computing-gpu-nodes?orgId=1&refresh=10s)
+    * [Merlin6 CPU Nodes Overview](https://hpc-monitor02.psi.ch/d/JmvLR8gZz/merlin6-computing-cpu-nodes?orgId=1&refresh=10s)
+    * [Merlin6 GPU Nodes Overview](https://hpc-monitor02.psi.ch/d/gOo1Z10Wk/merlin6-computing-gpu-nodes?orgId=1&refresh=10s)

 ### Merlin5 Monitoring Pages

 * Slurm monitoring:
-   * [Merlin5 Slurm Live Status](https://hpc-monitor02.psi.ch/d/o8msZJ0Zz/merlin5-slurm-live-status?orgId=1&refresh=10s)
-   * [Merlin5 Slurm Overview](https://hpc-monitor02.psi.ch/d/eWLEW1AWz/merlin5-slurm-overview?orgId=1&refresh=10s)
+    * [Merlin5 Slurm Live Status](https://hpc-monitor02.psi.ch/d/o8msZJ0Zz/merlin5-slurm-live-status?orgId=1&refresh=10s)
+    * [Merlin5 Slurm Overview](https://hpc-monitor02.psi.ch/d/eWLEW1AWz/merlin5-slurm-overview?orgId=1&refresh=10s)
 * Nodes monitoring:
-   * [Merlin5 CPU Nodes Overview](https://hpc-monitor02.psi.ch/d/ejTyWJAWk/merlin5-computing-cpu-nodes?orgId=1&refresh=10s)
+    * [Merlin5 CPU Nodes Overview](https://hpc-monitor02.psi.ch/d/ejTyWJAWk/merlin5-computing-cpu-nodes?orgId=1&refresh=10s)
--- a/docs/merlin6/slurm-general-docs/running-jobs.md
+++ b/docs/merlin6/slurm-general-docs/running-jobs.md
@@ -5,19 +5,19 @@
 Before starting using the cluster, please read the following rules:

 1. To ease and improve *scheduling* and *backfilling*, always try to **estimate and** to **define a proper run time** of your jobs:
-   * Use `--time=<D-HH:MM:SS>` for that.
-   * For very long runs, please consider using ***[Job Arrays with Checkpointing](#array-jobs-running-very-long-tasks-with-checkpoint-files)***
+    * Use `--time=<D-HH:MM:SS>` for that.
+    * For very long runs, please consider using ***[Job Arrays with Checkpointing](#array-jobs-running-very-long-tasks-with-checkpoint-files)***
 2. Try to optimize your jobs for running at most within **one day**. Please, consider the following:
-   * Some software can simply scale up by using more nodes while drastically reducing the run time.
-   * Some software allow to save a specific state, and a second job can start from that state: ***[Job Arrays with Checkpointing](#array-jobs-running-very-long-tasks-with-checkpoint-files)*** can help you with that.
-   * Jobs submitted to **`hourly`** get more priority than jobs submitted to **`daily`**: always use **`hourly`** for jobs shorter than 1 hour.
-   * Jobs submitted to **`daily`** get more priority than jobs submitted to **`general`**: always use **`daily`** for jobs shorter than 1 day.
+    * Some software can simply scale up by using more nodes while drastically reducing the run time.
+    * Some software allow to save a specific state, and a second job can start from that state: ***[Job Arrays with Checkpointing](#array-jobs-running-very-long-tasks-with-checkpoint-files)*** can help you with that.
+    * Jobs submitted to **`hourly`** get more priority than jobs submitted to **`daily`**: always use **`hourly`** for jobs shorter than 1 hour.
+    * Jobs submitted to **`daily`** get more priority than jobs submitted to **`general`**: always use **`daily`** for jobs shorter than 1 day.
 3. Is **forbidden** to run **very short jobs** as they cause a lot of overhead but also can cause severe problems to the main scheduler.
-   * ***Question:*** Is my job a very short job? ***Answer:*** If it lasts in few seconds or very few minutes, yes.
-   * ***Question:*** How long should my job run? ***Answer:*** as the *Rule of Thumb*, from 5' would start being ok, from 15' would preferred.
-   * Use ***[Packed Jobs](#packed-jobs-running-a-large-number-of-short-tasks)*** for running a large number of short tasks.
+    * ***Question:*** Is my job a very short job? ***Answer:*** If it lasts in few seconds or very few minutes, yes.
+    * ***Question:*** How long should my job run? ***Answer:*** as the *Rule of Thumb*, from 5' would start being ok, from 15' would preferred.
+    * Use ***[Packed Jobs](#packed-jobs-running-a-large-number-of-short-tasks)*** for running a large number of short tasks.
 4. Do not submit hundreds of similar jobs!
-   * Use ***[Array Jobs](#array-jobs-launching-a-large-number-of-related-jobs)*** for gathering jobs instead.
+    * Use ***[Array Jobs](#array-jobs-launching-a-large-number-of-related-jobs)*** for gathering jobs instead.

 !!! tip
    Having a good estimation of the *time* needed by your jobs, a proper way for
@@ -37,6 +37,7 @@ Before starting using the cluster, please read the following rules:
 ## Basic settings

 For a complete list of options and parameters available is recommended to use the **man pages** (i.e. `man sbatch`, `man srun`, `man salloc`).
+
 Please, notice that behaviour for some parameters might change depending on the command used when running jobs (in example, `--exclusive` behaviour in `sbatch` differs from `srun`).

 In this chapter we show the basic parameters which are usually needed in the Merlin cluster.
@@ -115,20 +116,20 @@ The following template should be used by any user submitting jobs to the Merlin6

 ```bash
 #!/bin/bash
-#SBATCH --cluster=merlin6                 # Cluster name                                        
+#SBATCH --cluster=merlin6                 # Cluster name
 #SBATCH --partition=general,daily,hourly  # Specify one or multiple partitions
-#SBATCH --time=<D-HH:MM:SS>               # Strongly recommended                                
-#SBATCH --output=<output_file>            # Generate custom output file                         
-#SBATCH --error=<error_file>              # Generate custom error  file                         
-#SBATCH --hint=nomultithread              # Mandatory for multithreaded jobs                    
-##SBATCH --exclusive                      # Uncomment if you need exclusive node usage          
-##SBATCH --ntasks-per-core=1              # Only mandatory for multithreaded single tasks       
-                                                                                               
-## Advanced options example                                                                    
-##SBATCH --nodes=1                        # Uncomment and specify #nodes to use                 
-##SBATCH --ntasks=44                      # Uncomment and specify #nodes to use                 
-##SBATCH --ntasks-per-node=44             # Uncomment and specify #tasks per node               
-##SBATCH --cpus-per-task=44               # Uncomment and specify the number of cores per task  
+#SBATCH --time=<D-HH:MM:SS>               # Strongly recommended
+#SBATCH --output=<output_file>            # Generate custom output file
+#SBATCH --error=<error_file>              # Generate custom error  file
+#SBATCH --hint=nomultithread              # Mandatory for multithreaded jobs
+##SBATCH --exclusive                      # Uncomment if you need exclusive node usage
+##SBATCH --ntasks-per-core=1              # Only mandatory for multithreaded single tasks
+
+## Advanced options example
+##SBATCH --nodes=1                        # Uncomment and specify #nodes to use
+##SBATCH --ntasks=44                      # Uncomment and specify #nodes to use
+##SBATCH --ntasks-per-node=44             # Uncomment and specify #tasks per node
+##SBATCH --cpus-per-task=44               # Uncomment and specify the number of cores per task
 ```

 #### Multithreaded jobs template
@@ -241,7 +242,7 @@ strategy:
 #SBATCH --time=7-00:00:00       # each job can run for 7 days
 #SBATCH --cpus-per-task=1
 #SBATCH --array=1-10%1   # Run a 10-job array, one job at a time.
-if test -e checkpointfile; then 
+if test -e checkpointfile; then
     # There is a checkpoint file;
     myprogram --read-checkp checkpointfile
 else
--- a/docs/merlin6/slurm-general-docs/slurm-basic-commands.md
+++ b/docs/merlin6/slurm-general-docs/slurm-basic-commands.md
@@ -9,9 +9,9 @@ information about options and examples.
 Useful commands for the slurm:

 ```bash
-sinfo            # to see the name of nodes, their occupancy, 
+sinfo            # to see the name of nodes, their occupancy,
                 # name of slurm partitions, limits (try out with "-l" option)
-squeue           # to see the currently running/waiting jobs in slurm 
+squeue           # to see the currently running/waiting jobs in slurm
                 # (additional "-l" option may also be useful)
 sbatch Script.sh # to submit a script (example below) to the slurm.
 srun <command>   # to submit a command to Slurm. Same options as in 'sbatch' can be used.
@@ -30,7 +30,7 @@ sacct            # Show job accounting, useful for checking details of finished
 ```bash
 sinfo -N -l      # list nodes, state, resources (#CPUs, memory per node, ...), etc.
 sshare -a        # to list shares of associations to a cluster
-sprio -l         # to view the factors that comprise a job's scheduling priority 
+sprio -l         # to view the factors that comprise a job's scheduling priority
                 # add '-u <username>' for filtering user
 ```

--- a/docs/merlin6/slurm-general-docs/slurm-examples.md
+++ b/docs/merlin6/slurm-general-docs/slurm-examples.md
@@ -251,7 +251,6 @@ The `%1` in the `#SBATCH --array=1-10%1` statement defines that only 1 subjob ca
 this will result in subjob n+1 only being started when job n has finished. It will read the checkpoint file
 if it is present.

-
 ### Packed jobs: running a large number of short tasks

 Since the launching of a Slurm job incurs some overhead, you should not submit each short task as a separate