Document common statuses

2019-07-29 15:47:53 +02:00
parent b7b52fbdce
commit 51400e382f
1 changed files with 40 additions and 13 deletions
--- a/merlin6-slurm/running-jobs.md
+++ b/merlin6-slurm/running-jobs.md
@ -170,3 +170,30 @@ The following template should be used by any user submitting jobs to GPU nodes:
 ##SBATCH --ntasks-per-node=44               # Uncomment and specify number of tasks per node
 ##SBATCH --cpus-per-task=44                 # Uncomment and specify the number of cores per task
 ```
+
+
+## Job status
+
+The status of submitted jobs can be check with the `squeue` command:
+
+```
+~ $ squeue -u bliven_s
+             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
+         134507729       gpu test_scr bliven_s PD       0:00      3 (AssocGrpNodeLimit)
+         134507768   general test_scr bliven_s PD       0:00     19 (AssocGrpCpuLimit)
+         134507729       gpu test_scr bliven_s PD       0:00      3 (Resources)
+         134506301       gpu test_scr bliven_s PD       0:00      1 (Priority)
+         134506288       gpu test_scr bliven_s  R       9:16      1 merlin-g-008
+```
+
+Common Statuses:
+- *merlin-\** Running on the specified host
+- *(Priority)* Waiting in the queue
+- *(Resources)* At the head of the queue, waiting for machines to become available
+- *(AssocGrpCpuLimit), (AssocGrpNodeLimit)* Job would exceed per-user limitations on
+  the number of simultaneous CPUs/Nodes. Use `scancel` to remove the job and
+  resubmit with fewer resources, or else wait for your other jobs to finish.
+- *(PartitionNodeLimit)* Exceeds all resources available on this partition.
+  Run `scancel` and resubmit to a different partition (`-p`) or with fewer
+  resources.
+