Slurm Configuration: Running jobs

2020-12-23 15:36:33 +01:00
parent a107b80bd0
commit 701f5f6d11
1 changed files with 87 additions and 113 deletions
--- a/Submission/running-jobs.md
+++ b/Submission/running-jobs.md
@ -15,7 +15,7 @@ Before starting using the cluster, please read the following rules:
 1. Always try to **estimate and** to **define a proper run time** of your jobs:
   * Use ``--time=<D-HH:MM:SS>`` for that.
-   * This will ease the scheduling.
+   * This will ease *scheduling* and *backfilling*. 
   * Slurm will schedule efficiently the queued jobs.
   * For very long runs, please consider using ***[Job Arrays with Checkpointing](/merlin6/running-jobs.html#array-jobs-running-very-long-tasks-with-checkpoint-files)***
 2. Try to optimize your jobs for running within **one day**. Please, consider the following:
@ -37,146 +37,117 @@ Before starting using the cluster, please read the following rules:
 4. Do not submit hundreds of similar jobs!
   * Use ***[Array Jobs](/merlin6/running-jobs.html#array-jobs-launching-a-large-number-of-related-jobs)*** for gathering jobs instead.
 {{site.data.alerts.tip}}Having a good estimation of the <i>time</i> needed by your jobs, a proper way for running them, and optimizing the jobs to <i>run within one day</i> will contribute to make the system fairly and efficiently used.
 {{site.data.alerts.end}}
 ## Basic commands for running batch scripts
-**``sbatch``** is the command used for submitting a batch script to Slurm
+* Use **``sbatch``** for submitting a batch script to Slurm.
-  * Use **``srun``**: to run parallel tasks. 
+* Use **``srun``** for running parallel tasks.
-    * As an alternative, ``mpirun`` and ``mpiexec`` can be used. However, ***is strongly recommended to user ``srun``*** instead.
+* Use **``squeue``** for checking jobs status.
-  * Use **``squeue``** for checking jobs status 
+* Use **``scancel``** for cancelling/deleting a job from the queue.
-  * Use **``scancel``** for deleting a job from the queue.
+
 {{site.data.alerts.tip}}Use Linux <b><u>man pages</u></b> (i.e. <i>man sbatch</i> for checking the available options for the above commands.
 {{site.data.alerts.end}}
 ## Basic settings
-For a complete list of options and parameters available is recommended to use the **man** pages (``man sbatch``, ``man srun``, ``man salloc``). Please, notice that behaviour for some parameters might change depending on the command (in example, ``--exclusive`` behaviour in ``sbatch`` differs from ``srun``.
+For a complete list of options and parameters available is recommended to use the **man pages** (i.e. ``man sbatch``, ``man srun``, ``man salloc``). 
 Please, notice that behaviour for some parameters might change depending on the command used when running jobs (in example, ``--exclusive`` behaviour in ``sbatch`` differs from ``srun``).
 In this chapter we show the basic parameters which are usually needed in the Merlin cluster.
-### Clusters
+### Common settings
-* For running jobs in the **Merlin6** computing nodes, users have to add the following option:
+The following settings are the minimum required for running a job in the Merlin CPU and GPU nodes. Please, consider taking a look to the **man pages** (i.e. `man sbatch`, `man salloc`, `man srun`) for more
 information about all possible options. Also, do not hesitate to contact us on any questions.
 * **Clusters:** For running jobs in the Merlin6 CPU and GPU nodes, users should to add the following option:
  ```bash
  #SBATCH --clusters=merlin6
  ```
-* For running jobs in the **Merlin5** computing nodes, users have to add the following options:
+  Users with proper access, can also use the `merlin5` cluster.
 * **Partitions:** except when using the *default* partition, one needs to specify the partition:
  * GPU partitions: ``gpu``, ``gpu-short`` (more details **[Slurm GPU Partitions](/merlin6/slurm-configuration.html#gpu-partitions)**)
  * CPU partitions: ``general`` (**default** if no partition is specified), ``daily`` and ``hourly`` (more details: **[Slurm CPU Partitions](/merlin6/slurm-configuration.html#cpu-partitions)**)
  Partition can be set as follows:
  ```bash
-  #SBATCH --clusters=merlin5
+  #SBATCH --partition=<partition_name>  # Partition to use. 'general' is the 'default'
  ```
-
+* **[Optional] Disabling shared nodes**: by default, nodes can share jobs from multiple users, but by ensuring that CPU/Memory/GPU resources are dedicated. 
-***For advanced users:*** If you do not care where to run the jobs (**Merlin5** or **Merlin6**) you can skip this setting, however you must make sure that your code can run on both clusters without any problem and you have defined proper settings in your *batch* script.
+  One can request exclusive usage of a node (or set of nodes) with the following option:
 ### Partitions
 **Merlin6** contains 4 partitions for general purpose, while **Merlin5** contains 1 single CPU partition (for historical reasons):
   * **Merlin6** CPU partitions are 3: ``general``, ``daily`` and ``hourly``.
   * **Merlin6** GPU partition is 1: ``gpu``.
   * **Merlin5** CPU partition is 1: ``merlin``
 For Merlin6, if no partition is defined, ``general`` will be the default, while for Merlin5 is ``merlin``. Partitions can be changed by defining the ``--partition`` option as follows:
 ```bash
 #SBATCH --partition=<partition_name>  # Partition to use. 'general' is the 'default' in Merlin6.
 ```
 Please check the section [Slurm Configuration#Merlin6 Slurm Partitions] for more information about Merlin6 partition setup.
 ### Hyperthreaded vs non-Hyperthreaded jobs
 Computing nodes in **merlin6** have hyperthreading enabled: every core is running two threads. It means that for many cases it needs to be disabled and only those multithread-based applications will benefit from that. There are some parameters that users must apply:
 * For **hyperthreaded based jobs** users ***must*** specify the following options:
  ```bash
-  #SBATCH --hint=multithread   # Mandatory for multithreaded jobs
+  #SBATCH --exclusive # Only if you want a dedicated node
  #SBATCH --ntasks-per-core=2  # Only needed when a task fits into a core
  ```
-
+* **Time**: is important to define how long a job should run, according to the reality. This will help Slurm when *scheduling* and *backfilling*, by managing job queues in a more efficient
-* For **non-hyperthreaded based jobs** users ***must*** specify the following options:
+way. This value can never exceed the `MaxTime` of the affected partition. Please review  the partition information (`scontrol show partition <partition_name>` or [GPU Partition Configuration](/merlin6/slurm-configuration.html#gpu-partitions)) for 
-
+`DefaultTime` and `MaxTime` values.
  ```bash
  #SBATCH --hint=nomultithread # Mandatory for non-multithreaded jobs
  #SBATCH --ntasks-per-core=1  # Only needed when a task fits into a core
  ```
 {{site.data.alerts.tip}} In general, <span style="color:orange;"><b>--hint=[no]multithread</b></span> is a mandatory field. On the other hand, <span style="color:orange;"><b>--ntasks-per-core</b></span> is only needed when
 one needs to define how a task should be handled within a core, and this setting will not be generally used on Hybrid MPI/OpenMP jobs where multiple cores are needed for a single tasks.
 {{site.data.alerts.end}}
 ### Shared vs exclusive nodes
 The **Merlin5** and **Merlin6** clusters are designed in a way that should allow running MPI/OpenMP processes as well as single core based jobs. For allowing co-existence, nodes are configured by default in a shared mode. It means, that multiple jobs from multiple users may land in the same node. This behaviour can be changed by a user if they require exclusive usage of nodes.
 By default, Slurm will try to allocate jobs on nodes that are already occupied by processes not requiring exclusive usage of a node. In this way, we fill up first mixed nodes and we ensure that free full resources are available for MPI/OpenMP jobs.
 Exclusivity of a node can be setup by specific the ``--exclusive`` option as follows:
 ```bash
 #SBATCH --exclusive
 ```
 ### Time
 There are some settings that are not mandatory but would be needed or useful to specify. These are the following:
 * ``--time``: mostly used when you need to specify longer runs in the ``general`` partition, also useful for specifying
 shorter times. **This will affect scheduling priorities**, hence is important to define it (and to define it properly).
   ```bash
-   #SBATCH --time=<D-HH:MM:SS>   # Time job needs to run
+   #SBATCH --time=<D-HH:MM:SS>          # Time job needs to run. Can not exceed the partition `MaxTime`
   ```
 * **Output and error files**: by default, Slurm script will generate standard output and errors files in the directory from where you submit the batch script:
  * standard output will be written into a file ``slurm-$SLURM_JOB_ID.out``.
  * standard error will be written into a file ``slurm-$SLURM_JOB_ID.err``.
-### Output and Errors
+  If you want to the default names it can be done with the options ``--output`` and ``--error``. In example:
  ```bash
  #SBATCH --output=logs/myJob.%N.%j.out # Generate an output file per hostname and jobid
  #SBATCH --error=logs/myJob.%N.%j.err  # Generate an errori file per hostname and jobid
  ```
  Use **man sbatch** (``man sbatch | grep -A36 '^filename pattern'``) for getting a list specification of **filename patterns**.
-By default, Slurm script will generate standard output and errors files in the directory from where
+* **Multithreading/No-Multithreading:** Whether a node has or not multithreading depends on the node configuration. By default, HT nodes have HT enabled, but one can ensure this feature with the option `--hint` as follows:
-you submit the batch script:
+  ```bash
-
+  #SBATCH --hint=multithread            # Use extra threads with in-core multi-threading.
-* standard output will be written into a file ``slurm-$SLURM_JOB_ID.out``.
+  #SBATCH --hint=nomultithread          # Don't use extra threads with in-core multi-threading.
-* standard error will be written into a file ``slurm-$SLURM_JOB_ID.err``.
+  ```
-
+  Consider that, sometimes, depending on your job requirements, you might need also to setup how many `--ntasks-per-core` or `--cpus-per-task` (even other options) in addition to the `--hint` command. Please, contact us in case of doubts.
 If you want to the default names it can be done with the options ``--output`` and ``--error``. In example:
 ```bash
 #SBATCH --output=logs/myJob.%N.%j.out  # Generate an output file per hostname and jobid
 #SBATCH --error=logs/myJob.%N.%j.err   # Generate an errori file per hostname and jobid
 ```
 Use **man sbatch** (``man sbatch | grep -A36 '^filename pattern'``) for getting a list specification of **filename patterns**.
 ### GPU specific settings
-#### Slurm account
+The following settings are required for running on the GPU nodes:
-When using GPUs, users must switch to the **merlin-gpu** Slurm account in order to be able to run on GPU-based nodes. This is done with the ``--account`` setting as follows:
+* **Slurm account**: When using GPUs, users must use the `merlin-gpu` Slurm account. This is done with the ``--account`` setting as follows:
  ```bash
  #SBATCH --account=merlin-gpu          # The account 'merlin-gpu' must be used for GPUs
  ```
 * **`[Valid until 08.01.2021]` GRES:** Slurm must be aware that the job will use GPUs. This is done with the `--gres` setting, at least, as follows:
  ```bash
  #SBATCH --gres=gpu                    # Always set at least this option when using GPUs
  ```
-```bash
+  Please read below **[GPU advanced settings](/merlin6/running-gpu-jobs.html#gpu-advanced-settings)** for other `--gres` options.
-#SBATCH --account=merlin-gpu     # The account 'merlin-gpu' must be used
+* **`[Valid from 08.01.2021]` GPU options (instead of GRES):** Slurm must be aware that the job will use GPUs. New options are available for specifying
-```
+the GPUs as a consumable resource. These are the following:
  * `--gpus` *instead of* (but also in addition with) `--gres=gpu`: specifies the total number of GPUs required for the job.
  * `--cpus-per-gpu`, to specify the number of CPUs to be used for each GPU.
  * `--mem-per-gpu`, to specify the amount of memory to be used for each GPU.
  * `--gpus-per-node`, `--gpus-per-socket`, `--gpus-per-task`, to specify how many GPUs per node, socket and or tasks need to be allocated.
  * Other advanced options (i.e. `--gpu-bind`). Please see **man** pages for **sbatch**/**srun**/**salloc** (i.e. *`man sbatch`*) for further information.
  Please read below **[GPU advanced settings](/merlin6/running-gpu-jobs.html#gpu-advanced-settings)** for other `--gpus` options.
-#### GRES
+#### GPU advanced settings
 The following options are mandatory settings that **must be included** in your batch scripts:
 ```bash
 #SBATCH --gres=gpu         # Always set at least this option when using GPUs
 ```
 ##### GRES advanced settings
 GPUs are also a shared resource. Hence, multiple users can run jobs on a single node, but only one GPU per user process
-must be used. Users can define which GPUs resources they need with the ``--gres`` option.
+must be used. 
 Valid ``gres`` options are: ``gpu[[:type]:count]`` where ``type=GTX1080|GTX1080Ti`` and ``count=<number of gpus to use>``
 This would be according to the following rules:
 In example:
 **Until 08.01.2021**, users can define which GPUs resources and *how many per node* they need with the ``--gres`` option. 
 Valid ``gres`` options are: ``gpu[[:type]:count]`` where ``type=GTX1080|GTX1080Ti|RTX2080Ti`` and ``count=<number of gpus requested per node>``. In example:
 ```bash
 #SBATCH --gres=gpu:GTX1080:4   # Use a node with 4 x GTX1080 GPUs
 ```
-***Important note:*** Due to a bug in the configuration, ``[:type]`` (i.e. ``GTX1080`` or ``GTX1080Ti``) is not working. Users should skip that and use only ``gpu[:count]``. This will be fixed in the upcoming downtimes as it requires a full restart of the batch system.
+**From 08.01.2021**, `--gres` is not needed anymore (but can still be used), and `--gpus` and related other options should replace it. `--gpus` works in a similar way, but without
 the need of specifying the `gpu` resource. In oher words, `--gpus` options are: ``[[:type]:count]`` where ``type=GTX1080|GTX1080Ti|RTX2080Ti`` and ``count=<number of gpus to use>``. In example:
 ```bash
 #SBATCH --gpus=GTX1080:4   # Use 4 GPUs with Type=GTX1080
 ```
 This setting can use in addition other settings, such like `--gpus-per-node`, in order to accomplish a similar behaviour as with `--gres`. 
 {{site.data.alerts.tip}}Always check <b>'/etc/slurm/gres.conf'</b> for checking available <i>Types</i> and for details of the NUMA node.
 {{site.data.alerts.end}}
 ## Batch script templates
@ -232,8 +203,8 @@ The following template should be used by any user submitting jobs to GPU nodes:
 ```bash
 #!/bin/bash
-#SBATCH --partition=gpu_<general|daily|hourly> # Specify 'general' or 'daily' or 'hourly'
+#SBATCH --partition=<gpu|gpu-short>         # Specify GPU partition
-#SBATCH --gres="gpu:<type>:<number_gpus>"   # You should specify at least 'gpu'
+#SBATCH --gpus="<type>:<number_gpus>"       # You should specify at least 'gpu'
 #SBATCH --time=<D-HH:MM:SS>                 # Strongly recommended
 #SBATCH --output=<output_file>              # Generate custom output file
 #SBATCH --error=<error_file                 # Generate custom error  file
@ -242,9 +213,12 @@ The following template should be used by any user submitting jobs to GPU nodes:
 ## Advanced options example
 ##SBATCH --nodes=1                          # Uncomment and specify number of nodes to use
-##SBATCH --ntasks=20                        # Uncomment and specify number of nodes to use
+##SBATCH --ntasks=1                         # Uncomment and specify number of nodes to use
-##SBATCH --ntasks-per-node=20               # Uncomment and specify number of tasks per node
+##SBATCH --cpus-per-gpu=5                   # Uncomment and specify the number of cores per task
-##SBATCH --cpus-per-task=1                  # Uncomment and specify the number of cores per task
+##SBATCH --mem-per-gpu=16000                # Uncomment and specify the number of cores per task
 ##SBATCH --gpus-per-node=2                  # Uncomment and specify the number of GPUs per node
 ##SBATCH --gpus-per-socket=2                # Uncomment and specify the number of GPUs per socket
 ##SBATCH --gpus-per-task=1                  # Uncomment and specify the number of GPUs per task
 ```
 ## Advanced configurations