Document common statuses

2019-07-29 15:47:53 +02:00
parent b7b52fbdce
commit 51400e382f
1 changed files with 40 additions and 13 deletions
--- a/merlin6-slurm/running-jobs.md
+++ b/merlin6-slurm/running-jobs.md
@ -1,5 +1,5 @@
 ---
-title: Running Jobs 
+title: Running Jobs
 #tags:
 #keywords:
 last_updated: 18 June 2019
@ -12,7 +12,7 @@ permalink: /merlin6/running-jobs.html

 * ``sbatch``: to submit a batch script to Slurm. Use ``squeue`` for checking jobs status and ``scancel`` for deleting a job from the queue.
 * ``srun``: to run parallel jobs in the batch system
-* ``salloc``: to obtain a Slurm job allocation (a set of nodes), execute command(s), and then release the allocation when the command is finished. 
+* ``salloc``: to obtain a Slurm job allocation (a set of nodes), execute command(s), and then release the allocation when the command is finished.
 This is equivalent to interactive run.

 ## Running on Merlin5
@ -24,7 +24,7 @@ but they will need to specify a couple of extra options to their scripts.
 #SBATCH --clusters=merlin5
 ```

-By adding ``--clusters=merlin5`` it will send the jobs to the old Merlin5 computing nodes. Also, ``--partition=<merlin|gpu>`` can be specified in 
+By adding ``--clusters=merlin5`` it will send the jobs to the old Merlin5 computing nodes. Also, ``--partition=<merlin|gpu>`` can be specified in
 order to use the old Merlin5 partitions.

 ## Running on Merlin6
@ -35,12 +35,12 @@ In order to run on the **Merlin6** cluster, users have to add the following opti
 #SBATCH --clusters=merlin6
 ```

-By adding ``--clusters=merlin6`` it will send the jobs to the old Merlin6 computing nodes. 
+By adding ``--clusters=merlin6`` it will send the jobs to the old Merlin6 computing nodes.

 ## Shared nodes and exclusivity

 The **Merlin6** cluster has been designed in a way that should allow running MPI/OpenMP processes as well as single core based jobs. For allowing
-co-existence, nodes are configured by default in a shared mode. It means, that multiple jobs from multiple users may land in the same node. This 
+co-existence, nodes are configured by default in a shared mode. It means, that multiple jobs from multiple users may land in the same node. This
 behaviour can be changed by a user if they require exclusive usage of nodes.

 By default, Slurm will try to allocate jobs on nodes that are already occupied by processes not requiring exclusive usage of a node. In this way,
@ -57,8 +57,8 @@ Exclusivity of a node can be setup by specific the ``--exclusive`` option as fol
 By default, Slurm script will generate standard output and errors files in the directory from where
 you submit the batch script:

-* standard output will be written into a file ``slurm-$SLURM_JOB_ID.out``. 
-* standard error will be written into a file ``slurm-$SLURM_JOB_ID.err``. 
+* standard output will be written into a file ``slurm-$SLURM_JOB_ID.out``.
+* standard error will be written into a file ``slurm-$SLURM_JOB_ID.err``.

 If you want to the default names it can be done with the options ``--output`` and ``--error``. In example:

@ -72,8 +72,8 @@ Use **man sbatch** (``man sbatch | grep -A36 '^filename pattern'``) for getting
 ## Partitions

 Merlin6 contains 6 partitions for general purpose:
-   
-   * For the CPU these are ``general``, ``daily`` and ``hourly``. 
+
+   * For the CPU these are ``general``, ``daily`` and ``hourly``.
   * For the GPU these are ``gpu``.

 If no partition is defined, ``general`` will be the default. Partition can be defined with the ``--partition`` option as follows:
@ -82,7 +82,7 @@ If no partition is defined, ``general`` will be the default. Partition can be de
 #SBATCH --partition=<partition_name>  # Partition to use. 'general' is the 'default'.
 ```

-Please check the section [Slurm Configuration#Merlin6 Slurm Partitions] for more information about Merlin6 partition setup. 
+Please check the section [Slurm Configuration#Merlin6 Slurm Partitions] for more information about Merlin6 partition setup.

 ## CPU-based Jobs Settings

@ -115,7 +115,7 @@ The following template should be used by any user submitting jobs to CPU nodes:

 ## Advanced options example
 ##SBATCH --nodes=1                          # Uncomment and specify #nodes to use
-##SBATCH --ntasks=44                        # Uncomment and specify #nodes to use  
+##SBATCH --ntasks=44                        # Uncomment and specify #nodes to use
 ##SBATCH --ntasks-per-node=44               # Uncomment and specify #tasks per node
 ##SBATCH --ntasks-per-core=2                # Uncomment and specify #tasks per core (a.k.a. threads)
 ##SBATCH --cpus-per-task=44                 # Uncomment and specify the number of cores per task
@ -139,8 +139,8 @@ The following options are mandatory settings that **must be included** in your b

 ### Slurm GPU Recommended Settings

-GPUs are also a shared resource. Hence, multiple users can run jobs on a single node, but only one GPU per user process 
-must be used. Users can define which GPUs resources they need with the ``--gres`` option. 
+GPUs are also a shared resource. Hence, multiple users can run jobs on a single node, but only one GPU per user process
+must be used. Users can define which GPUs resources they need with the ``--gres`` option.
 Valid ``gres`` options are: ``gpu[[:type]:count]`` where ``type=GTX1080|GTX1080Ti`` and ``count=<number of gpus to use>``
 This would be according to the following rules:

@ -170,3 +170,30 @@ The following template should be used by any user submitting jobs to GPU nodes:
 ##SBATCH --ntasks-per-node=44               # Uncomment and specify number of tasks per node
 ##SBATCH --cpus-per-task=44                 # Uncomment and specify the number of cores per task
 ```
+
+
+## Job status
+
+The status of submitted jobs can be check with the `squeue` command:
+
+```
+~ $ squeue -u bliven_s
+             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
+         134507729       gpu test_scr bliven_s PD       0:00      3 (AssocGrpNodeLimit)
+         134507768   general test_scr bliven_s PD       0:00     19 (AssocGrpCpuLimit)
+         134507729       gpu test_scr bliven_s PD       0:00      3 (Resources)
+         134506301       gpu test_scr bliven_s PD       0:00      1 (Priority)
+         134506288       gpu test_scr bliven_s  R       9:16      1 merlin-g-008
+```
+
+Common Statuses:
+- *merlin-\** Running on the specified host
+- *(Priority)* Waiting in the queue
+- *(Resources)* At the head of the queue, waiting for machines to become available
+- *(AssocGrpCpuLimit), (AssocGrpNodeLimit)* Job would exceed per-user limitations on
+  the number of simultaneous CPUs/Nodes. Use `scancel` to remove the job and
+  resubmit with fewer resources, or else wait for your other jobs to finish.
+- *(PartitionNodeLimit)* Exceeds all resources available on this partition.
+  Run `scancel` and resubmit to a different partition (`-p`) or with fewer
+  resources.
+