diff --git a/pages/merlin6/03 Job Submission/running-jobs.md b/pages/merlin6/03 Job Submission/running-jobs.md index 33dd992..643a3af 100644 --- a/pages/merlin6/03 Job Submission/running-jobs.md +++ b/pages/merlin6/03 Job Submission/running-jobs.md @@ -15,7 +15,7 @@ Before starting using the cluster, please read the following rules: 1. Always try to **estimate and** to **define a proper run time** of your jobs: * Use ``--time=`` for that. - * This will ease the scheduling. + * This will ease *scheduling* and *backfilling*. * Slurm will schedule efficiently the queued jobs. * For very long runs, please consider using ***[Job Arrays with Checkpointing](/merlin6/running-jobs.html#array-jobs-running-very-long-tasks-with-checkpoint-files)*** 2. Try to optimize your jobs for running within **one day**. Please, consider the following: @@ -27,7 +27,7 @@ Before starting using the cluster, please read the following rules: 3. Is **forbidden** to run **very short jobs**: * Running jobs of few seconds can cause severe problems. * Running very short jobs causes a lot of overhead. - * ***Question:*** Is my job a very short job? + * ***Question:*** Is my job a very short job? * ***Answer:*** If it lasts in few seconds or very few minutes, yes. * ***Question:*** How long should my job run? * ***Answer:*** as the *Rule of Thumb*, from 5' would start being ok, from 15' would preferred. @@ -37,146 +37,117 @@ Before starting using the cluster, please read the following rules: 4. Do not submit hundreds of similar jobs! * Use ***[Array Jobs](/merlin6/running-jobs.html#array-jobs-launching-a-large-number-of-related-jobs)*** for gathering jobs instead. +{{site.data.alerts.tip}}Having a good estimation of the time needed by your jobs, a proper way for running them, and optimizing the jobs to run within one day will contribute to make the system fairly and efficiently used. +{{site.data.alerts.end}} + ## Basic commands for running batch scripts -**``sbatch``** is the command used for submitting a batch script to Slurm - * Use **``srun``**: to run parallel tasks. - * As an alternative, ``mpirun`` and ``mpiexec`` can be used. However, ***is strongly recommended to user ``srun``*** instead. - * Use **``squeue``** for checking jobs status - * Use **``scancel``** for deleting a job from the queue. +* Use **``sbatch``** for submitting a batch script to Slurm. +* Use **``srun``** for running parallel tasks. +* Use **``squeue``** for checking jobs status. +* Use **``scancel``** for cancelling/deleting a job from the queue. + +{{site.data.alerts.tip}}Use Linux man pages (i.e. man sbatch for checking the available options for the above commands. +{{site.data.alerts.end}} ## Basic settings -For a complete list of options and parameters available is recommended to use the **man** pages (``man sbatch``, ``man srun``, ``man salloc``). Please, notice that behaviour for some parameters might change depending on the command (in example, ``--exclusive`` behaviour in ``sbatch`` differs from ``srun``. +For a complete list of options and parameters available is recommended to use the **man pages** (i.e. ``man sbatch``, ``man srun``, ``man salloc``). +Please, notice that behaviour for some parameters might change depending on the command used when running jobs (in example, ``--exclusive`` behaviour in ``sbatch`` differs from ``srun``). In this chapter we show the basic parameters which are usually needed in the Merlin cluster. -### Clusters +### Common settings -* For running jobs in the **Merlin6** computing nodes, users have to add the following option: +The following settings are the minimum required for running a job in the Merlin CPU and GPU nodes. Please, consider taking a look to the **man pages** (i.e. `man sbatch`, `man salloc`, `man srun`) for more +information about all possible options. Also, do not hesitate to contact us on any questions. +* **Clusters:** For running jobs in the Merlin6 CPU and GPU nodes, users should to add the following option: ```bash #SBATCH --clusters=merlin6 ``` -* For running jobs in the **Merlin5** computing nodes, users have to add the following options: + Users with proper access, can also use the `merlin5` cluster. +* **Partitions:** except when using the *default* partition, one needs to specify the partition: + * GPU partitions: ``gpu``, ``gpu-short`` (more details **[Slurm GPU Partitions](/merlin6/slurm-configuration.html#gpu-partitions)**) + * CPU partitions: ``general`` (**default** if no partition is specified), ``daily`` and ``hourly`` (more details: **[Slurm CPU Partitions](/merlin6/slurm-configuration.html#cpu-partitions)**) + Partition can be set as follows: ```bash - #SBATCH --clusters=merlin5 + #SBATCH --partition= # Partition to use. 'general' is the 'default' ``` - -***For advanced users:*** If you do not care where to run the jobs (**Merlin5** or **Merlin6**) you can skip this setting, however you must make sure that your code can run on both clusters without any problem and you have defined proper settings in your *batch* script. - -### Partitions - -**Merlin6** contains 4 partitions for general purpose, while **Merlin5** contains 1 single CPU partition (for historical reasons): - - * **Merlin6** CPU partitions are 3: ``general``, ``daily`` and ``hourly``. - * **Merlin6** GPU partition is 1: ``gpu``. - * **Merlin5** CPU partition is 1: ``merlin`` - -For Merlin6, if no partition is defined, ``general`` will be the default, while for Merlin5 is ``merlin``. Partitions can be changed by defining the ``--partition`` option as follows: - -```bash -#SBATCH --partition= # Partition to use. 'general' is the 'default' in Merlin6. -``` - -Please check the section [Slurm Configuration#Merlin6 Slurm Partitions] for more information about Merlin6 partition setup. - -### Hyperthreaded vs non-Hyperthreaded jobs - -Computing nodes in **merlin6** have hyperthreading enabled: every core is running two threads. It means that for many cases it needs to be disabled and only those multithread-based applications will benefit from that. There are some parameters that users must apply: - -* For **hyperthreaded based jobs** users ***must*** specify the following options: - +* **[Optional] Disabling shared nodes**: by default, nodes can share jobs from multiple users, but by ensuring that CPU/Memory/GPU resources are dedicated. + One can request exclusive usage of a node (or set of nodes) with the following option: ```bash - #SBATCH --hint=multithread # Mandatory for multithreaded jobs - #SBATCH --ntasks-per-core=2 # Only needed when a task fits into a core + #SBATCH --exclusive # Only if you want a dedicated node ``` - -* For **non-hyperthreaded based jobs** users ***must*** specify the following options: - - ```bash - #SBATCH --hint=nomultithread # Mandatory for non-multithreaded jobs - #SBATCH --ntasks-per-core=1 # Only needed when a task fits into a core - ``` - -{{site.data.alerts.tip}} In general, --hint=[no]multithread is a mandatory field. On the other hand, --ntasks-per-core is only needed when -one needs to define how a task should be handled within a core, and this setting will not be generally used on Hybrid MPI/OpenMP jobs where multiple cores are needed for a single tasks. -{{site.data.alerts.end}} - -### Shared vs exclusive nodes - -The **Merlin5** and **Merlin6** clusters are designed in a way that should allow running MPI/OpenMP processes as well as single core based jobs. For allowing co-existence, nodes are configured by default in a shared mode. It means, that multiple jobs from multiple users may land in the same node. This behaviour can be changed by a user if they require exclusive usage of nodes. - -By default, Slurm will try to allocate jobs on nodes that are already occupied by processes not requiring exclusive usage of a node. In this way, we fill up first mixed nodes and we ensure that free full resources are available for MPI/OpenMP jobs. - -Exclusivity of a node can be setup by specific the ``--exclusive`` option as follows: - -```bash -#SBATCH --exclusive -``` - -### Time - -There are some settings that are not mandatory but would be needed or useful to specify. These are the following: - -* ``--time``: mostly used when you need to specify longer runs in the ``general`` partition, also useful for specifying -shorter times. **This will affect scheduling priorities**, hence is important to define it (and to define it properly). - +* **Time**: is important to define how long a job should run, according to the reality. This will help Slurm when *scheduling* and *backfilling*, by managing job queues in a more efficient +way. This value can never exceed the `MaxTime` of the affected partition. Please review the partition information (`scontrol show partition ` or [GPU Partition Configuration](/merlin6/slurm-configuration.html#gpu-partitions)) for +`DefaultTime` and `MaxTime` values. ```bash - #SBATCH --time= # Time job needs to run + #SBATCH --time= # Time job needs to run. Can not exceed the partition `MaxTime` ``` +* **Output and error files**: by default, Slurm script will generate standard output and errors files in the directory from where you submit the batch script: + * standard output will be written into a file ``slurm-$SLURM_JOB_ID.out``. + * standard error will be written into a file ``slurm-$SLURM_JOB_ID.err``. + + If you want to the default names it can be done with the options ``--output`` and ``--error``. In example: + ```bash + #SBATCH --output=logs/myJob.%N.%j.out # Generate an output file per hostname and jobid + #SBATCH --error=logs/myJob.%N.%j.err # Generate an errori file per hostname and jobid + ``` + Use **man sbatch** (``man sbatch | grep -A36 '^filename pattern'``) for getting a list specification of **filename patterns**. -### Output and Errors - -By default, Slurm script will generate standard output and errors files in the directory from where -you submit the batch script: - -* standard output will be written into a file ``slurm-$SLURM_JOB_ID.out``. -* standard error will be written into a file ``slurm-$SLURM_JOB_ID.err``. - -If you want to the default names it can be done with the options ``--output`` and ``--error``. In example: - -```bash -#SBATCH --output=logs/myJob.%N.%j.out # Generate an output file per hostname and jobid -#SBATCH --error=logs/myJob.%N.%j.err # Generate an errori file per hostname and jobid -``` - -Use **man sbatch** (``man sbatch | grep -A36 '^filename pattern'``) for getting a list specification of **filename patterns**. +* **Multithreading/No-Multithreading:** Whether a node has or not multithreading depends on the node configuration. By default, HT nodes have HT enabled, but one can ensure this feature with the option `--hint` as follows: + ```bash + #SBATCH --hint=multithread # Use extra threads with in-core multi-threading. + #SBATCH --hint=nomultithread # Don't use extra threads with in-core multi-threading. + ``` + Consider that, sometimes, depending on your job requirements, you might need also to setup how many `--ntasks-per-core` or `--cpus-per-task` (even other options) in addition to the `--hint` command. Please, contact us in case of doubts. ### GPU specific settings -#### Slurm account +The following settings are required for running on the GPU nodes: -When using GPUs, users must switch to the **merlin-gpu** Slurm account in order to be able to run on GPU-based nodes. This is done with the ``--account`` setting as follows: +* **Slurm account**: When using GPUs, users must use the `merlin-gpu` Slurm account. This is done with the ``--account`` setting as follows: + ```bash + #SBATCH --account=merlin-gpu # The account 'merlin-gpu' must be used for GPUs + ``` +* **`[Valid until 08.01.2021]` GRES:** Slurm must be aware that the job will use GPUs. This is done with the `--gres` setting, at least, as follows: + ```bash + #SBATCH --gres=gpu # Always set at least this option when using GPUs + ``` + + Please read below **[GPU advanced settings](/merlin6/running-gpu-jobs.html#gpu-advanced-settings)** for other `--gres` options. +* **`[Valid from 08.01.2021]` GPU options (instead of GRES):** Slurm must be aware that the job will use GPUs. New options are available for specifying +the GPUs as a consumable resource. These are the following: + * `--gpus` *instead of* (but also in addition with) `--gres=gpu`: specifies the total number of GPUs required for the job. + * `--cpus-per-gpu`, to specify the number of CPUs to be used for each GPU. + * `--mem-per-gpu`, to specify the amount of memory to be used for each GPU. + * `--gpus-per-node`, `--gpus-per-socket`, `--gpus-per-task`, to specify how many GPUs per node, socket and or tasks need to be allocated. + * Other advanced options (i.e. `--gpu-bind`). Please see **man** pages for **sbatch**/**srun**/**salloc** (i.e. *`man sbatch`*) for further information. + Please read below **[GPU advanced settings](/merlin6/running-gpu-jobs.html#gpu-advanced-settings)** for other `--gpus` options. -```bash -#SBATCH --account=merlin-gpu # The account 'merlin-gpu' must be used -``` - -#### GRES - -The following options are mandatory settings that **must be included** in your batch scripts: - -```bash -#SBATCH --gres=gpu # Always set at least this option when using GPUs -``` - -##### GRES advanced settings +#### GPU advanced settings GPUs are also a shared resource. Hence, multiple users can run jobs on a single node, but only one GPU per user process -must be used. Users can define which GPUs resources they need with the ``--gres`` option. -Valid ``gres`` options are: ``gpu[[:type]:count]`` where ``type=GTX1080|GTX1080Ti`` and ``count=`` -This would be according to the following rules: - -In example: +must be used. +**Until 08.01.2021**, users can define which GPUs resources and *how many per node* they need with the ``--gres`` option. +Valid ``gres`` options are: ``gpu[[:type]:count]`` where ``type=GTX1080|GTX1080Ti|RTX2080Ti`` and ``count=``. In example: ```bash #SBATCH --gres=gpu:GTX1080:4 # Use a node with 4 x GTX1080 GPUs ``` -***Important note:*** Due to a bug in the configuration, ``[:type]`` (i.e. ``GTX1080`` or ``GTX1080Ti``) is not working. Users should skip that and use only ``gpu[:count]``. This will be fixed in the upcoming downtimes as it requires a full restart of the batch system. +**From 08.01.2021**, `--gres` is not needed anymore (but can still be used), and `--gpus` and related other options should replace it. `--gpus` works in a similar way, but without +the need of specifying the `gpu` resource. In oher words, `--gpus` options are: ``[[:type]:count]`` where ``type=GTX1080|GTX1080Ti|RTX2080Ti`` and ``count=``. In example: +```bash +#SBATCH --gpus=GTX1080:4 # Use 4 GPUs with Type=GTX1080 +``` +This setting can use in addition other settings, such like `--gpus-per-node`, in order to accomplish a similar behaviour as with `--gres`. + +{{site.data.alerts.tip}}Always check '/etc/slurm/gres.conf' for checking available Types and for details of the NUMA node. +{{site.data.alerts.end}} ## Batch script templates @@ -232,8 +203,8 @@ The following template should be used by any user submitting jobs to GPU nodes: ```bash #!/bin/bash -#SBATCH --partition=gpu_ # Specify 'general' or 'daily' or 'hourly' -#SBATCH --gres="gpu::" # You should specify at least 'gpu' +#SBATCH --partition= # Specify GPU partition +#SBATCH --gpus=":" # You should specify at least 'gpu' #SBATCH --time= # Strongly recommended #SBATCH --output= # Generate custom output file #SBATCH --error=