From 51400e382f900932cf789dbff27d3ece8d937fd0 Mon Sep 17 00:00:00 2001 From: Spencer Bliven Date: Mon, 29 Jul 2019 15:47:53 +0200 Subject: [PATCH] Document common statuses --- .../merlin6/03 merlin6-slurm/running-jobs.md | 53 ++++++++++++++----- 1 file changed, 40 insertions(+), 13 deletions(-) diff --git a/pages/merlin6/03 merlin6-slurm/running-jobs.md b/pages/merlin6/03 merlin6-slurm/running-jobs.md index 592c320..accbaf8 100644 --- a/pages/merlin6/03 merlin6-slurm/running-jobs.md +++ b/pages/merlin6/03 merlin6-slurm/running-jobs.md @@ -1,5 +1,5 @@ --- -title: Running Jobs +title: Running Jobs #tags: #keywords: last_updated: 18 June 2019 @@ -12,7 +12,7 @@ permalink: /merlin6/running-jobs.html * ``sbatch``: to submit a batch script to Slurm. Use ``squeue`` for checking jobs status and ``scancel`` for deleting a job from the queue. * ``srun``: to run parallel jobs in the batch system -* ``salloc``: to obtain a Slurm job allocation (a set of nodes), execute command(s), and then release the allocation when the command is finished. +* ``salloc``: to obtain a Slurm job allocation (a set of nodes), execute command(s), and then release the allocation when the command is finished. This is equivalent to interactive run. ## Running on Merlin5 @@ -24,7 +24,7 @@ but they will need to specify a couple of extra options to their scripts. #SBATCH --clusters=merlin5 ``` -By adding ``--clusters=merlin5`` it will send the jobs to the old Merlin5 computing nodes. Also, ``--partition=`` can be specified in +By adding ``--clusters=merlin5`` it will send the jobs to the old Merlin5 computing nodes. Also, ``--partition=`` can be specified in order to use the old Merlin5 partitions. ## Running on Merlin6 @@ -35,12 +35,12 @@ In order to run on the **Merlin6** cluster, users have to add the following opti #SBATCH --clusters=merlin6 ``` -By adding ``--clusters=merlin6`` it will send the jobs to the old Merlin6 computing nodes. +By adding ``--clusters=merlin6`` it will send the jobs to the old Merlin6 computing nodes. ## Shared nodes and exclusivity The **Merlin6** cluster has been designed in a way that should allow running MPI/OpenMP processes as well as single core based jobs. For allowing -co-existence, nodes are configured by default in a shared mode. It means, that multiple jobs from multiple users may land in the same node. This +co-existence, nodes are configured by default in a shared mode. It means, that multiple jobs from multiple users may land in the same node. This behaviour can be changed by a user if they require exclusive usage of nodes. By default, Slurm will try to allocate jobs on nodes that are already occupied by processes not requiring exclusive usage of a node. In this way, @@ -57,8 +57,8 @@ Exclusivity of a node can be setup by specific the ``--exclusive`` option as fol By default, Slurm script will generate standard output and errors files in the directory from where you submit the batch script: -* standard output will be written into a file ``slurm-$SLURM_JOB_ID.out``. -* standard error will be written into a file ``slurm-$SLURM_JOB_ID.err``. +* standard output will be written into a file ``slurm-$SLURM_JOB_ID.out``. +* standard error will be written into a file ``slurm-$SLURM_JOB_ID.err``. If you want to the default names it can be done with the options ``--output`` and ``--error``. In example: @@ -72,8 +72,8 @@ Use **man sbatch** (``man sbatch | grep -A36 '^filename pattern'``) for getting ## Partitions Merlin6 contains 6 partitions for general purpose: - - * For the CPU these are ``general``, ``daily`` and ``hourly``. + + * For the CPU these are ``general``, ``daily`` and ``hourly``. * For the GPU these are ``gpu``. If no partition is defined, ``general`` will be the default. Partition can be defined with the ``--partition`` option as follows: @@ -82,7 +82,7 @@ If no partition is defined, ``general`` will be the default. Partition can be de #SBATCH --partition= # Partition to use. 'general' is the 'default'. ``` -Please check the section [Slurm Configuration#Merlin6 Slurm Partitions] for more information about Merlin6 partition setup. +Please check the section [Slurm Configuration#Merlin6 Slurm Partitions] for more information about Merlin6 partition setup. ## CPU-based Jobs Settings @@ -115,7 +115,7 @@ The following template should be used by any user submitting jobs to CPU nodes: ## Advanced options example ##SBATCH --nodes=1 # Uncomment and specify #nodes to use -##SBATCH --ntasks=44 # Uncomment and specify #nodes to use +##SBATCH --ntasks=44 # Uncomment and specify #nodes to use ##SBATCH --ntasks-per-node=44 # Uncomment and specify #tasks per node ##SBATCH --ntasks-per-core=2 # Uncomment and specify #tasks per core (a.k.a. threads) ##SBATCH --cpus-per-task=44 # Uncomment and specify the number of cores per task @@ -139,8 +139,8 @@ The following options are mandatory settings that **must be included** in your b ### Slurm GPU Recommended Settings -GPUs are also a shared resource. Hence, multiple users can run jobs on a single node, but only one GPU per user process -must be used. Users can define which GPUs resources they need with the ``--gres`` option. +GPUs are also a shared resource. Hence, multiple users can run jobs on a single node, but only one GPU per user process +must be used. Users can define which GPUs resources they need with the ``--gres`` option. Valid ``gres`` options are: ``gpu[[:type]:count]`` where ``type=GTX1080|GTX1080Ti`` and ``count=`` This would be according to the following rules: @@ -170,3 +170,30 @@ The following template should be used by any user submitting jobs to GPU nodes: ##SBATCH --ntasks-per-node=44 # Uncomment and specify number of tasks per node ##SBATCH --cpus-per-task=44 # Uncomment and specify the number of cores per task ``` + + +## Job status + +The status of submitted jobs can be check with the `squeue` command: + +``` +~ $ squeue -u bliven_s + JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) + 134507729 gpu test_scr bliven_s PD 0:00 3 (AssocGrpNodeLimit) + 134507768 general test_scr bliven_s PD 0:00 19 (AssocGrpCpuLimit) + 134507729 gpu test_scr bliven_s PD 0:00 3 (Resources) + 134506301 gpu test_scr bliven_s PD 0:00 1 (Priority) + 134506288 gpu test_scr bliven_s R 9:16 1 merlin-g-008 +``` + +Common Statuses: +- *merlin-\** Running on the specified host +- *(Priority)* Waiting in the queue +- *(Resources)* At the head of the queue, waiting for machines to become available +- *(AssocGrpCpuLimit), (AssocGrpNodeLimit)* Job would exceed per-user limitations on + the number of simultaneous CPUs/Nodes. Use `scancel` to remove the job and + resubmit with fewer resources, or else wait for your other jobs to finish. +- *(PartitionNodeLimit)* Exceeds all resources available on this partition. + Run `scancel` and resubmit to a different partition (`-p`) or with fewer + resources. +