Document common statuses

This commit is contained in:
Spencer Bliven
2019-07-29 15:47:53 +02:00
parent b7b52fbdce
commit 51400e382f

View File

@ -1,5 +1,5 @@
--- ---
title: Running Jobs title: Running Jobs
#tags: #tags:
#keywords: #keywords:
last_updated: 18 June 2019 last_updated: 18 June 2019
@ -12,7 +12,7 @@ permalink: /merlin6/running-jobs.html
* ``sbatch``: to submit a batch script to Slurm. Use ``squeue`` for checking jobs status and ``scancel`` for deleting a job from the queue. * ``sbatch``: to submit a batch script to Slurm. Use ``squeue`` for checking jobs status and ``scancel`` for deleting a job from the queue.
* ``srun``: to run parallel jobs in the batch system * ``srun``: to run parallel jobs in the batch system
* ``salloc``: to obtain a Slurm job allocation (a set of nodes), execute command(s), and then release the allocation when the command is finished. * ``salloc``: to obtain a Slurm job allocation (a set of nodes), execute command(s), and then release the allocation when the command is finished.
This is equivalent to interactive run. This is equivalent to interactive run.
## Running on Merlin5 ## Running on Merlin5
@ -24,7 +24,7 @@ but they will need to specify a couple of extra options to their scripts.
#SBATCH --clusters=merlin5 #SBATCH --clusters=merlin5
``` ```
By adding ``--clusters=merlin5`` it will send the jobs to the old Merlin5 computing nodes. Also, ``--partition=<merlin|gpu>`` can be specified in By adding ``--clusters=merlin5`` it will send the jobs to the old Merlin5 computing nodes. Also, ``--partition=<merlin|gpu>`` can be specified in
order to use the old Merlin5 partitions. order to use the old Merlin5 partitions.
## Running on Merlin6 ## Running on Merlin6
@ -35,12 +35,12 @@ In order to run on the **Merlin6** cluster, users have to add the following opti
#SBATCH --clusters=merlin6 #SBATCH --clusters=merlin6
``` ```
By adding ``--clusters=merlin6`` it will send the jobs to the old Merlin6 computing nodes. By adding ``--clusters=merlin6`` it will send the jobs to the old Merlin6 computing nodes.
## Shared nodes and exclusivity ## Shared nodes and exclusivity
The **Merlin6** cluster has been designed in a way that should allow running MPI/OpenMP processes as well as single core based jobs. For allowing The **Merlin6** cluster has been designed in a way that should allow running MPI/OpenMP processes as well as single core based jobs. For allowing
co-existence, nodes are configured by default in a shared mode. It means, that multiple jobs from multiple users may land in the same node. This co-existence, nodes are configured by default in a shared mode. It means, that multiple jobs from multiple users may land in the same node. This
behaviour can be changed by a user if they require exclusive usage of nodes. behaviour can be changed by a user if they require exclusive usage of nodes.
By default, Slurm will try to allocate jobs on nodes that are already occupied by processes not requiring exclusive usage of a node. In this way, By default, Slurm will try to allocate jobs on nodes that are already occupied by processes not requiring exclusive usage of a node. In this way,
@ -57,8 +57,8 @@ Exclusivity of a node can be setup by specific the ``--exclusive`` option as fol
By default, Slurm script will generate standard output and errors files in the directory from where By default, Slurm script will generate standard output and errors files in the directory from where
you submit the batch script: you submit the batch script:
* standard output will be written into a file ``slurm-$SLURM_JOB_ID.out``. * standard output will be written into a file ``slurm-$SLURM_JOB_ID.out``.
* standard error will be written into a file ``slurm-$SLURM_JOB_ID.err``. * standard error will be written into a file ``slurm-$SLURM_JOB_ID.err``.
If you want to the default names it can be done with the options ``--output`` and ``--error``. In example: If you want to the default names it can be done with the options ``--output`` and ``--error``. In example:
@ -72,8 +72,8 @@ Use **man sbatch** (``man sbatch | grep -A36 '^filename pattern'``) for getting
## Partitions ## Partitions
Merlin6 contains 6 partitions for general purpose: Merlin6 contains 6 partitions for general purpose:
* For the CPU these are ``general``, ``daily`` and ``hourly``. * For the CPU these are ``general``, ``daily`` and ``hourly``.
* For the GPU these are ``gpu``. * For the GPU these are ``gpu``.
If no partition is defined, ``general`` will be the default. Partition can be defined with the ``--partition`` option as follows: If no partition is defined, ``general`` will be the default. Partition can be defined with the ``--partition`` option as follows:
@ -82,7 +82,7 @@ If no partition is defined, ``general`` will be the default. Partition can be de
#SBATCH --partition=<partition_name> # Partition to use. 'general' is the 'default'. #SBATCH --partition=<partition_name> # Partition to use. 'general' is the 'default'.
``` ```
Please check the section [Slurm Configuration#Merlin6 Slurm Partitions] for more information about Merlin6 partition setup. Please check the section [Slurm Configuration#Merlin6 Slurm Partitions] for more information about Merlin6 partition setup.
## CPU-based Jobs Settings ## CPU-based Jobs Settings
@ -115,7 +115,7 @@ The following template should be used by any user submitting jobs to CPU nodes:
## Advanced options example ## Advanced options example
##SBATCH --nodes=1 # Uncomment and specify #nodes to use ##SBATCH --nodes=1 # Uncomment and specify #nodes to use
##SBATCH --ntasks=44 # Uncomment and specify #nodes to use ##SBATCH --ntasks=44 # Uncomment and specify #nodes to use
##SBATCH --ntasks-per-node=44 # Uncomment and specify #tasks per node ##SBATCH --ntasks-per-node=44 # Uncomment and specify #tasks per node
##SBATCH --ntasks-per-core=2 # Uncomment and specify #tasks per core (a.k.a. threads) ##SBATCH --ntasks-per-core=2 # Uncomment and specify #tasks per core (a.k.a. threads)
##SBATCH --cpus-per-task=44 # Uncomment and specify the number of cores per task ##SBATCH --cpus-per-task=44 # Uncomment and specify the number of cores per task
@ -139,8 +139,8 @@ The following options are mandatory settings that **must be included** in your b
### Slurm GPU Recommended Settings ### Slurm GPU Recommended Settings
GPUs are also a shared resource. Hence, multiple users can run jobs on a single node, but only one GPU per user process GPUs are also a shared resource. Hence, multiple users can run jobs on a single node, but only one GPU per user process
must be used. Users can define which GPUs resources they need with the ``--gres`` option. must be used. Users can define which GPUs resources they need with the ``--gres`` option.
Valid ``gres`` options are: ``gpu[[:type]:count]`` where ``type=GTX1080|GTX1080Ti`` and ``count=<number of gpus to use>`` Valid ``gres`` options are: ``gpu[[:type]:count]`` where ``type=GTX1080|GTX1080Ti`` and ``count=<number of gpus to use>``
This would be according to the following rules: This would be according to the following rules:
@ -170,3 +170,30 @@ The following template should be used by any user submitting jobs to GPU nodes:
##SBATCH --ntasks-per-node=44 # Uncomment and specify number of tasks per node ##SBATCH --ntasks-per-node=44 # Uncomment and specify number of tasks per node
##SBATCH --cpus-per-task=44 # Uncomment and specify the number of cores per task ##SBATCH --cpus-per-task=44 # Uncomment and specify the number of cores per task
``` ```
## Job status
The status of submitted jobs can be check with the `squeue` command:
```
~ $ squeue -u bliven_s
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
134507729 gpu test_scr bliven_s PD 0:00 3 (AssocGrpNodeLimit)
134507768 general test_scr bliven_s PD 0:00 19 (AssocGrpCpuLimit)
134507729 gpu test_scr bliven_s PD 0:00 3 (Resources)
134506301 gpu test_scr bliven_s PD 0:00 1 (Priority)
134506288 gpu test_scr bliven_s R 9:16 1 merlin-g-008
```
Common Statuses:
- *merlin-\** Running on the specified host
- *(Priority)* Waiting in the queue
- *(Resources)* At the head of the queue, waiting for machines to become available
- *(AssocGrpCpuLimit), (AssocGrpNodeLimit)* Job would exceed per-user limitations on
the number of simultaneous CPUs/Nodes. Use `scancel` to remove the job and
resubmit with fewer resources, or else wait for your other jobs to finish.
- *(PartitionNodeLimit)* Exceeds all resources available on this partition.
Run `scancel` and resubmit to a different partition (`-p`) or with fewer
resources.