Slurm Configuration: Running jobs

2020-12-23 15:36:33 +01:00
parent a107b80bd0
commit 701f5f6d11
1 changed files with 87 additions and 113 deletions
--- a/Submission/running-jobs.md
+++ b/Submission/running-jobs.md
@ -15,7 +15,7 @@ Before starting using the cluster, please read the following rules:

 1. Always try to **estimate and** to **define a proper run time** of your jobs:
   * Use ``--time=<D-HH:MM:SS>`` for that.
-   * This will ease the scheduling.
+   * This will ease *scheduling* and *backfilling*. 
   * Slurm will schedule efficiently the queued jobs.
   * For very long runs, please consider using ***[Job Arrays with Checkpointing](/merlin6/running-jobs.html#array-jobs-running-very-long-tasks-with-checkpoint-files)***
 2. Try to optimize your jobs for running within **one day**. Please, consider the following:
@ -27,7 +27,7 @@ Before starting using the cluster, please read the following rules:
 3. Is **forbidden** to run **very short jobs**:
   * Running jobs of few seconds can cause severe problems.
   * Running very short jobs causes a lot of overhead.
-   * ***Question:*** Is my job a very short job? 
+   * ***Question:*** Is my job a very short job?
      * ***Answer:*** If it lasts in few seconds or very few minutes, yes.
   * ***Question:*** How long should my job run?
      * ***Answer:*** as the *Rule of Thumb*, from 5' would start being ok, from 15' would preferred.
@ -37,146 +37,117 @@ Before starting using the cluster, please read the following rules:
 4. Do not submit hundreds of similar jobs!
   * Use ***[Array Jobs](/merlin6/running-jobs.html#array-jobs-launching-a-large-number-of-related-jobs)*** for gathering jobs instead.

+{{site.data.alerts.tip}}Having a good estimation of the <i>time</i> needed by your jobs, a proper way for running them, and optimizing the jobs to <i>run within one day</i> will contribute to make the system fairly and efficiently used.
+{{site.data.alerts.end}}
+
 ## Basic commands for running batch scripts

-**``sbatch``** is the command used for submitting a batch script to Slurm
-  * Use **``srun``**: to run parallel tasks. 
-    * As an alternative, ``mpirun`` and ``mpiexec`` can be used. However, ***is strongly recommended to user ``srun``*** instead.
-  * Use **``squeue``** for checking jobs status 
-  * Use **``scancel``** for deleting a job from the queue.
+* Use **``sbatch``** for submitting a batch script to Slurm.
+* Use **``srun``** for running parallel tasks.
+* Use **``squeue``** for checking jobs status.
+* Use **``scancel``** for cancelling/deleting a job from the queue.
+
+{{site.data.alerts.tip}}Use Linux <b><u>man pages</u></b> (i.e. <i>man sbatch</i> for checking the available options for the above commands.
+{{site.data.alerts.end}}

 ## Basic settings

-For a complete list of options and parameters available is recommended to use the **man** pages (``man sbatch``, ``man srun``, ``man salloc``). Please, notice that behaviour for some parameters might change depending on the command (in example, ``--exclusive`` behaviour in ``sbatch`` differs from ``srun``.
+For a complete list of options and parameters available is recommended to use the **man pages** (i.e. ``man sbatch``, ``man srun``, ``man salloc``). 
+Please, notice that behaviour for some parameters might change depending on the command used when running jobs (in example, ``--exclusive`` behaviour in ``sbatch`` differs from ``srun``).

 In this chapter we show the basic parameters which are usually needed in the Merlin cluster.

-### Clusters
+### Common settings

-* For running jobs in the **Merlin6** computing nodes, users have to add the following option:
+The following settings are the minimum required for running a job in the Merlin CPU and GPU nodes. Please, consider taking a look to the **man pages** (i.e. `man sbatch`, `man salloc`, `man srun`) for more
+information about all possible options. Also, do not hesitate to contact us on any questions.

+* **Clusters:** For running jobs in the Merlin6 CPU and GPU nodes, users should to add the following option:
  ```bash
  #SBATCH --clusters=merlin6
  ```

-* For running jobs in the **Merlin5** computing nodes, users have to add the following options:
+  Users with proper access, can also use the `merlin5` cluster.
+* **Partitions:** except when using the *default* partition, one needs to specify the partition:
+  * GPU partitions: ``gpu``, ``gpu-short`` (more details **[Slurm GPU Partitions](/merlin6/slurm-configuration.html#gpu-partitions)**)
+  * CPU partitions: ``general`` (**default** if no partition is specified), ``daily`` and ``hourly`` (more details: **[Slurm CPU Partitions](/merlin6/slurm-configuration.html#cpu-partitions)**)

+  Partition can be set as follows:
  ```bash
-  #SBATCH --clusters=merlin5
+  #SBATCH --partition=<partition_name>  # Partition to use. 'general' is the 'default'
  ```
-
-***For advanced users:*** If you do not care where to run the jobs (**Merlin5** or **Merlin6**) you can skip this setting, however you must make sure that your code can run on both clusters without any problem and you have defined proper settings in your *batch* script.
-
-### Partitions
-
-**Merlin6** contains 4 partitions for general purpose, while **Merlin5** contains 1 single CPU partition (for historical reasons):
-
-   * **Merlin6** CPU partitions are 3: ``general``, ``daily`` and ``hourly``.
-   * **Merlin6** GPU partition is 1: ``gpu``.
-   * **Merlin5** CPU partition is 1: ``merlin``
-
-For Merlin6, if no partition is defined, ``general`` will be the default, while for Merlin5 is ``merlin``. Partitions can be changed by defining the ``--partition`` option as follows:
-
-```bash
-#SBATCH --partition=<partition_name>  # Partition to use. 'general' is the 'default' in Merlin6.
-```
-
-Please check the section [Slurm Configuration#Merlin6 Slurm Partitions] for more information about Merlin6 partition setup.
-
-### Hyperthreaded vs non-Hyperthreaded jobs
-
-Computing nodes in **merlin6** have hyperthreading enabled: every core is running two threads. It means that for many cases it needs to be disabled and only those multithread-based applications will benefit from that. There are some parameters that users must apply:
-
-* For **hyperthreaded based jobs** users ***must*** specify the following options:
-
+* **[Optional] Disabling shared nodes**: by default, nodes can share jobs from multiple users, but by ensuring that CPU/Memory/GPU resources are dedicated. 
+  One can request exclusive usage of a node (or set of nodes) with the following option:
  ```bash
-  #SBATCH --hint=multithread   # Mandatory for multithreaded jobs
-  #SBATCH --ntasks-per-core=2  # Only needed when a task fits into a core
+  #SBATCH --exclusive # Only if you want a dedicated node
  ```
-
-* For **non-hyperthreaded based jobs** users ***must*** specify the following options:
-
-  ```bash
-  #SBATCH --hint=nomultithread # Mandatory for non-multithreaded jobs
-  #SBATCH --ntasks-per-core=1  # Only needed when a task fits into a core
-  ```
-
-{{site.data.alerts.tip}} In general, <span style="color:orange;"><b>--hint=[no]multithread</b></span> is a mandatory field. On the other hand, <span style="color:orange;"><b>--ntasks-per-core</b></span> is only needed when
-one needs to define how a task should be handled within a core, and this setting will not be generally used on Hybrid MPI/OpenMP jobs where multiple cores are needed for a single tasks.
-{{site.data.alerts.end}}
-
-### Shared vs exclusive nodes
-
-The **Merlin5** and **Merlin6** clusters are designed in a way that should allow running MPI/OpenMP processes as well as single core based jobs. For allowing co-existence, nodes are configured by default in a shared mode. It means, that multiple jobs from multiple users may land in the same node. This behaviour can be changed by a user if they require exclusive usage of nodes.
-
-By default, Slurm will try to allocate jobs on nodes that are already occupied by processes not requiring exclusive usage of a node. In this way, we fill up first mixed nodes and we ensure that free full resources are available for MPI/OpenMP jobs.
-
-Exclusivity of a node can be setup by specific the ``--exclusive`` option as follows:
-
-```bash
-#SBATCH --exclusive
-```
-
-### Time
-
-There are some settings that are not mandatory but would be needed or useful to specify. These are the following:
-
-* ``--time``: mostly used when you need to specify longer runs in the ``general`` partition, also useful for specifying
-shorter times. **This will affect scheduling priorities**, hence is important to define it (and to define it properly).
-
+* **Time**: is important to define how long a job should run, according to the reality. This will help Slurm when *scheduling* and *backfilling*, by managing job queues in a more efficient
+way. This value can never exceed the `MaxTime` of the affected partition. Please review  the partition information (`scontrol show partition <partition_name>` or [GPU Partition Configuration](/merlin6/slurm-configuration.html#gpu-partitions)) for 
+`DefaultTime` and `MaxTime` values.
   ```bash
-   #SBATCH --time=<D-HH:MM:SS>   # Time job needs to run
+   #SBATCH --time=<D-HH:MM:SS>          # Time job needs to run. Can not exceed the partition `MaxTime`
   ```
+* **Output and error files**: by default, Slurm script will generate standard output and errors files in the directory from where you submit the batch script:
+  * standard output will be written into a file ``slurm-$SLURM_JOB_ID.out``.
+  * standard error will be written into a file ``slurm-$SLURM_JOB_ID.err``.
+  
+  If you want to the default names it can be done with the options ``--output`` and ``--error``. In example:
+  ```bash
+  #SBATCH --output=logs/myJob.%N.%j.out # Generate an output file per hostname and jobid
+  #SBATCH --error=logs/myJob.%N.%j.err  # Generate an errori file per hostname and jobid
+  ```
+  Use **man sbatch** (``man sbatch | grep -A36 '^filename pattern'``) for getting a list specification of **filename patterns**.

-### Output and Errors
-
-By default, Slurm script will generate standard output and errors files in the directory from where
-you submit the batch script:
-
-* standard output will be written into a file ``slurm-$SLURM_JOB_ID.out``.
-* standard error will be written into a file ``slurm-$SLURM_JOB_ID.err``.
-
-If you want to the default names it can be done with the options ``--output`` and ``--error``. In example:
-
-```bash
-#SBATCH --output=logs/myJob.%N.%j.out  # Generate an output file per hostname and jobid
-#SBATCH --error=logs/myJob.%N.%j.err   # Generate an errori file per hostname and jobid
-```
-
-Use **man sbatch** (``man sbatch | grep -A36 '^filename pattern'``) for getting a list specification of **filename patterns**.
+* **Multithreading/No-Multithreading:** Whether a node has or not multithreading depends on the node configuration. By default, HT nodes have HT enabled, but one can ensure this feature with the option `--hint` as follows:
+  ```bash
+  #SBATCH --hint=multithread            # Use extra threads with in-core multi-threading.
+  #SBATCH --hint=nomultithread          # Don't use extra threads with in-core multi-threading.
+  ```
+  Consider that, sometimes, depending on your job requirements, you might need also to setup how many `--ntasks-per-core` or `--cpus-per-task` (even other options) in addition to the `--hint` command. Please, contact us in case of doubts.

 ### GPU specific settings

-#### Slurm account
+The following settings are required for running on the GPU nodes:

-When using GPUs, users must switch to the **merlin-gpu** Slurm account in order to be able to run on GPU-based nodes. This is done with the ``--account`` setting as follows:
+* **Slurm account**: When using GPUs, users must use the `merlin-gpu` Slurm account. This is done with the ``--account`` setting as follows:
+  ```bash
+  #SBATCH --account=merlin-gpu          # The account 'merlin-gpu' must be used for GPUs
+  ```
+* **`[Valid until 08.01.2021]` GRES:** Slurm must be aware that the job will use GPUs. This is done with the `--gres` setting, at least, as follows:
+  ```bash
+  #SBATCH --gres=gpu                    # Always set at least this option when using GPUs
+  ```
+ 
+  Please read below **[GPU advanced settings](/merlin6/running-gpu-jobs.html#gpu-advanced-settings)** for other `--gres` options.
+* **`[Valid from 08.01.2021]` GPU options (instead of GRES):** Slurm must be aware that the job will use GPUs. New options are available for specifying
+the GPUs as a consumable resource. These are the following:
+  * `--gpus` *instead of* (but also in addition with) `--gres=gpu`: specifies the total number of GPUs required for the job.
+  * `--cpus-per-gpu`, to specify the number of CPUs to be used for each GPU.
+  * `--mem-per-gpu`, to specify the amount of memory to be used for each GPU.
+  * `--gpus-per-node`, `--gpus-per-socket`, `--gpus-per-task`, to specify how many GPUs per node, socket and or tasks need to be allocated.
+  * Other advanced options (i.e. `--gpu-bind`). Please see **man** pages for **sbatch**/**srun**/**salloc** (i.e. *`man sbatch`*) for further information.
+  Please read below **[GPU advanced settings](/merlin6/running-gpu-jobs.html#gpu-advanced-settings)** for other `--gpus` options.

-```bash
-#SBATCH --account=merlin-gpu     # The account 'merlin-gpu' must be used
-```
-
-#### GRES
-
-The following options are mandatory settings that **must be included** in your batch scripts:
-
-```bash
-#SBATCH --gres=gpu         # Always set at least this option when using GPUs
-```
-
-##### GRES advanced settings
+#### GPU advanced settings

 GPUs are also a shared resource. Hence, multiple users can run jobs on a single node, but only one GPU per user process
-must be used. Users can define which GPUs resources they need with the ``--gres`` option.
-Valid ``gres`` options are: ``gpu[[:type]:count]`` where ``type=GTX1080|GTX1080Ti`` and ``count=<number of gpus to use>``
-This would be according to the following rules:
-
-In example:
+must be used. 

+**Until 08.01.2021**, users can define which GPUs resources and *how many per node* they need with the ``--gres`` option. 
+Valid ``gres`` options are: ``gpu[[:type]:count]`` where ``type=GTX1080|GTX1080Ti|RTX2080Ti`` and ``count=<number of gpus requested per node>``. In example:
 ```bash
 #SBATCH --gres=gpu:GTX1080:4   # Use a node with 4 x GTX1080 GPUs
 ```

-***Important note:*** Due to a bug in the configuration, ``[:type]`` (i.e. ``GTX1080`` or ``GTX1080Ti``) is not working. Users should skip that and use only ``gpu[:count]``. This will be fixed in the upcoming downtimes as it requires a full restart of the batch system.
+**From 08.01.2021**, `--gres` is not needed anymore (but can still be used), and `--gpus` and related other options should replace it. `--gpus` works in a similar way, but without
+the need of specifying the `gpu` resource. In oher words, `--gpus` options are: ``[[:type]:count]`` where ``type=GTX1080|GTX1080Ti|RTX2080Ti`` and ``count=<number of gpus to use>``. In example:
+```bash
+#SBATCH --gpus=GTX1080:4   # Use 4 GPUs with Type=GTX1080
+```
+This setting can use in addition other settings, such like `--gpus-per-node`, in order to accomplish a similar behaviour as with `--gres`. 
+
+{{site.data.alerts.tip}}Always check <b>'/etc/slurm/gres.conf'</b> for checking available <i>Types</i> and for details of the NUMA node.
+{{site.data.alerts.end}}

 ## Batch script templates

@ -232,8 +203,8 @@ The following template should be used by any user submitting jobs to GPU nodes:

 ```bash
 #!/bin/bash
-#SBATCH --partition=gpu_<general|daily|hourly> # Specify 'general' or 'daily' or 'hourly'
-#SBATCH --gres="gpu:<type>:<number_gpus>"   # You should specify at least 'gpu'
+#SBATCH --partition=<gpu|gpu-short>         # Specify GPU partition
+#SBATCH --gpus="<type>:<number_gpus>"       # You should specify at least 'gpu'
 #SBATCH --time=<D-HH:MM:SS>                 # Strongly recommended
 #SBATCH --output=<output_file>              # Generate custom output file
 #SBATCH --error=<error_file                 # Generate custom error  file
@ -242,9 +213,12 @@ The following template should be used by any user submitting jobs to GPU nodes:

 ## Advanced options example
 ##SBATCH --nodes=1                          # Uncomment and specify number of nodes to use
-##SBATCH --ntasks=20                        # Uncomment and specify number of nodes to use
-##SBATCH --ntasks-per-node=20               # Uncomment and specify number of tasks per node
-##SBATCH --cpus-per-task=1                  # Uncomment and specify the number of cores per task
+##SBATCH --ntasks=1                         # Uncomment and specify number of nodes to use
+##SBATCH --cpus-per-gpu=5                   # Uncomment and specify the number of cores per task
+##SBATCH --mem-per-gpu=16000                # Uncomment and specify the number of cores per task
+##SBATCH --gpus-per-node=2                  # Uncomment and specify the number of GPUs per node
+##SBATCH --gpus-per-socket=2                # Uncomment and specify the number of GPUs per socket
+##SBATCH --gpus-per-task=1                  # Uncomment and specify the number of GPUs per task
 ```

 ## Advanced configurations