Updated documentation

2020-01-23 16:43:12 +01:00
parent 6169b7a8dc
commit b12846e82a
8 changed files with 551 additions and 292 deletions
--- a/Submission/running-jobs.md
+++ b/Submission/running-jobs.md
@@ -1,40 +1,69 @@
 ---
-title: Running Jobs
+title: Running Slurm Scripts
 #tags:
-#keywords:
-last_updated: 18 June 2019
-#summary: ""
+keywords: batch script, slurm, sbatch, srun
+last_updated: 23 January 2020
+summary: "This document describes how to run batch scripts in Slurm."
 sidebar: merlin6_sidebar
 permalink: /merlin6/running-jobs.html
 ---

-## Commands for running jobs

-* **``sbatch``**: to submit a batch script to Slurm
+## The rules
+
+Before starting using the cluster, please read the following rules:
+
+1. Always try to **estimate and** to **define a proper run time** of your jobs:
+   * Use ``--time=<D-HH:MM:SS>`` for that.
+   * This will ease the scheduling.
+   * Slurm will schedule efficiently the queued jobs.
+   * For very long runs, please consider using ***[Job Arrays with Checkpointing](/merlin6/running-jobs.html#array-jobs-running-very-long-tasks-with-checkpoint-files)***
+2. Try to optimize your jobs for running within **one day**. Please, consider the following:
+   * Some software can simply scale up by using more nodes while drastically reducing the run time.
+   * Some software allow to save a specific state, and a second job can start from that state.
+   * ***[Job Arrays with Checkpointing](/merlin6/running-jobs.html#array-jobs-running-very-long-tasks-with-checkpoint-files)*** can help you with that.
+   * Use the **'daily'** partition when you ensure that you can run within one day:
+      * ***'daily'*** **will give you more priority than running in the** ***'general'*** **queue!**
+3. Is **forbidden** to run **very short jobs**:
+   * Running jobs of few seconds can cause severe problems.
+   * Running very short jobs causes a lot of overhead.
+   * ***Question:*** Is my job a very short job? 
+      * ***Answer:*** If it lasts in few seconds or very few minutes, yes.
+   * ***Question:*** How long should my job run?
+      * ***Answer:*** as the *Rule of Thumb*, from 5' would start being ok, from 15' would preferred.
+   * Use ***[Packed Jobs](/merlin6/running-jobs.html#packed-jobs-running-a-large-number-of-short-tasks)*** for running a large number of short tasks.
+   * For short runs lasting in less than 1 hour, please use the **hourly** partition.
+      * ***'hourly'*** **will give you more priority than running in the** ***'daily'*** **queue!**
+4. Do not submit hundreds of similar jobs!
+   * Use ***[Array Jobs](/merlin6/running-jobs.html#array-jobs-launching-a-large-number-of-related-jobs)*** for gathering jobs instead.
+
+## Basic commands for running batch scripts
+
+**``sbatch``** is the command used for submitting a batch script to Slurm
+  * Use **``srun``**: to run parallel tasks. 
+    * As an alternative, ``mpirun`` and ``mpiexec`` can be used. However, ***is strongly recommended to user ``srun``**** instead.
  * Use **``squeue``** for checking jobs status 
  * Use **``scancel``** for deleting a job from the queue.
-* **``srun``**: to run parallel jobs in the batch system
-* **``salloc``**: to obtain a Slurm job allocation (a set of nodes), execute command(s), and then release the allocation when the command is finished.

-## Slurm parameters
+## Basic settings

 For a complete list of options and parameters available is recommended to use the **man** pages (``man sbatch``, ``man srun``, ``man salloc``). Please, notice that behaviour for some parameters might change depending on the command (in example, ``--exclusive`` behaviour in ``sbatch`` differs from ``srun``.

 In this chapter we show the basic parameters which are usually needed in the Merlin cluster.

-### Running in Merlin5 & Merlin6
+### Clusters

 * For running jobs in the **Merlin6** computing nodes, users have to add the following option:

-```bash
-#SBATCH --clusters=merlin6
-```
+  ```bash
+  #SBATCH --clusters=merlin6
+  ```

 * For running jobs in the **Merlin5** computing nodes, users have to add the following options:

-```bash
-#SBATCH --clusters=merlin5
-```
+  ```bash
+  #SBATCH --clusters=merlin5
+  ```

 ***For advanced users:*** If you do not care where to run the jobs (**Merlin5** or **Merlin6**) you can skip this setting, however you must make sure that your code can run on both clusters without any problem and you have defined proper settings in your *batch* script.

@@ -54,24 +83,25 @@ For Merlin6, if no partition is defined, ``general`` will be the default, while

 Please check the section [Slurm Configuration#Merlin6 Slurm Partitions] for more information about Merlin6 partition setup.

-### Enabling/disabling hyperthreading
+### Hyperthreaded vs non-Hyperthreaded jobs

 Computing nodes in **merlin6** have hyperthreading enabled: every core is running two threads. It means that for many cases it needs to be disabled and only those multithread-based applications will benefit from that. There are some parameters that users must apply:

 * For **hyperthreaded based jobs** users ***must*** specify the following options:

-```bash
-#SBATCH --ntasks-per-core=2                 # Mandatory for multithreaded jobs
-#SBATCH --hint=multithread                  # Mandatory for multithreaded jobs
-```
+  ```bash
+  #SBATCH --ntasks-per-core=2                 # Mandatory for multithreaded jobs
+  #SBATCH --hint=multithread                  # Mandatory for multithreaded jobs
+  ```

 * For **non-hyperthreaded based jobs** users ***must*** specify the following options:
-```bash
-#SBATCH --ntasks-per-core=1                 # Mandatory for non-multithreaded jobs
-#SBATCH --hint=nomultithread                # Mandatory for non-multithreaded jobs
-```

-### Shared nodes and exclusivity
+  ```bash
+  #SBATCH --ntasks-per-core=1                 # Mandatory for non-multithreaded jobs
+  #SBATCH --hint=nomultithread                # Mandatory for non-multithreaded jobs
+  ```
+
+### Shared vs exclusive nodes

 The **Merlin5** and **Merlin6** clusters are designed in a way that should allow running MPI/OpenMP processes as well as single core based jobs. For allowing co-existence, nodes are configured by default in a shared mode. It means, that multiple jobs from multiple users may land in the same node. This behaviour can be changed by a user if they require exclusive usage of nodes.

@@ -83,7 +113,7 @@ Exclusivity of a node can be setup by specific the ``--exclusive`` option as fol
 #SBATCH --exclusive
 ```

-### Slurm CPU Recommended Settings
+### Time

 There are some settings that are not mandatory but would be needed or useful to specify. These are the following:

@@ -111,62 +141,9 @@ If you want to the default names it can be done with the options ``--output`` an

 Use **man sbatch** (``man sbatch | grep -A36 '^filename pattern'``) for getting a list specification of **filename patterns**.

-## CPU-based Jobs Settings
+### GPU specific settings

-CPU-based jobs are available for all PSI users. Users must belong to the ``merlin6`` Slurm ``Account`` in order to be able
-to run on CPU-based nodes. All users registered in Merlin6 are automatically included in the ``Account``.
-
-### Slurm CPU Templates
-
-The following examples apply to the **Merlin6** cluster.
-
-#### Nomultithreaded jobs example
-
-The following template should be used by any user submitting jobs to CPU nodes:
-
-```bash
-#!/bin/sh
-#SBATCH --partition=<general|daily|hourly>  # Specify 'general' or 'daily' or 'hourly'
-#SBATCH --time=<D-HH:MM:SS>                 # Strictly recommended when using 'general' partition.
-#SBATCH --output=<output_file>              # Generate custom output file
-#SBATCH --error=<error_file>                # Generate custom error  file
-#SBATCH --ntasks-per-core=1                 # Mandatory for non-multithreaded jobs
-#SBATCH --hint=nomultithread                # Mandatory for non-multithreaded jobs
-##SBATCH --exclusive                        # Uncomment if you need exclusive node usage
-
-## Advanced options example
-##SBATCH --nodes=1                          # Uncomment and specify #nodes to use
-##SBATCH --ntasks=44                        # Uncomment and specify #nodes to use
-##SBATCH --ntasks-per-node=44               # Uncomment and specify #tasks per node
-##SBATCH --cpus-per-task=44                 # Uncomment and specify the number of cores per task
-```
-
-#### Multithreaded jobs
-
-The following template should be used by any user submitting jobs to CPU nodes:
-
-```bash
-#!/bin/sh
-#SBATCH --partition=<general|daily|hourly>  # Specify 'general' or 'daily' or 'hourly'
-#SBATCH --time=<D-HH:MM:SS>                 # Strictly recommended when using 'general' partition.
-#SBATCH --output=<output_file>              # Generate custom output file
-#SBATCH --error=<error_file>                # Generate custom error  file
-#SBATCH --ntasks-per-core=2                 # Mandatory for multithreaded jobs
-#SBATCH --hint=multithread                  # Mandatory for multithreaded jobs
-##SBATCH --exclusive                        # Uncomment if you need exclusive node usage
-
-## Advanced options example
-##SBATCH --nodes=1                          # Uncomment and specify #nodes to use
-##SBATCH --ntasks=88                        # Uncomment and specify #nodes to use
-##SBATCH --ntasks-per-node=88               # Uncomment and specify #tasks per node
-##SBATCH --cpus-per-task=88                 # Uncomment and specify the number of cores per task
-```
-
-## GPU-based Jobs Settings
-
-**Merlin6** GPUs are available for all PSI users, however, this is restricted to any user belonging to the ``merlin-gpu`` account. By default, all users are added to this account (exceptions could apply).
-
-### Merlin6 GPU account
+#### Slurm account

 When using GPUs, users must switch to the **merlin-gpu** Slurm account in order to be able to run on GPU-based nodes. This is done with the ``--account`` setting as follows:

@@ -174,7 +151,7 @@ When using GPUs, users must switch to the **merlin-gpu** Slurm account in order
 #SBATCH --account=merlin-gpu     # The account 'merlin-gpu' must be used
 ```

-### Slurm CPU Mandatory Settings
+#### GRES

 The following options are mandatory settings that **must be included** in your batch scripts:

@@ -182,7 +159,7 @@ The following options are mandatory settings that **must be included** in your b
 #SBATCH --gres=gpu         # Always set at least this option when using GPUs
 ```

-#### Slurm GPU Recommended Settings
+##### GRES advanced settings

 GPUs are also a shared resource. Hence, multiple users can run jobs on a single node, but only one GPU per user process
 must be used. Users can define which GPUs resources they need with the ``--gres`` option.
@@ -197,15 +174,63 @@ In example:

 ***Important note:*** Due to a bug in the configuration, ``[:type]`` (i.e. ``GTX1080`` or ``GTX1080Ti``) is not working. Users should skip that and use only ``gpu[:count]``. This will be fixed in the upcoming downtimes as it requires a full restart of the batch system.

-### Slurm GPU Template
+## Batch script templates
+
+### CPU-based jobs templates
+
+The following examples apply to the **Merlin6** cluster.
+
+#### Nomultithreaded jobs template
+
+The following template should be used by any user submitting jobs to CPU nodes:
+
+```bash
+#!/bin/bash
+#SBATCH --partition=<general|daily|hourly>  # Specify 'general' or 'daily' or 'hourly'
+#SBATCH --time=<D-HH:MM:SS>                 # Strongly recommended
+#SBATCH --output=<output_file>              # Generate custom output file
+#SBATCH --error=<error_file>                # Generate custom error  file
+#SBATCH --ntasks-per-core=1                 # Mandatory for non-multithreaded jobs
+#SBATCH --hint=nomultithread                # Mandatory for non-multithreaded jobs
+##SBATCH --exclusive                        # Uncomment if you need exclusive node usage
+
+## Advanced options example
+##SBATCH --nodes=1                          # Uncomment and specify #nodes to use
+##SBATCH --ntasks=44                        # Uncomment and specify #nodes to use
+##SBATCH --ntasks-per-node=44               # Uncomment and specify #tasks per node
+##SBATCH --cpus-per-task=44                 # Uncomment and specify the number of cores per task
+```
+
+#### Multithreaded jobs template
+
+The following template should be used by any user submitting jobs to CPU nodes:
+
+```bash
+#!/bin/bash
+#SBATCH --partition=<general|daily|hourly>  # Specify 'general' or 'daily' or 'hourly'
+#SBATCH --time=<D-HH:MM:SS>                 # Strongly recommended
+#SBATCH --output=<output_file>              # Generate custom output file
+#SBATCH --error=<error_file>                # Generate custom error  file
+#SBATCH --ntasks-per-core=2                 # Mandatory for multithreaded jobs
+#SBATCH --hint=multithread                  # Mandatory for multithreaded jobs
+##SBATCH --exclusive                        # Uncomment if you need exclusive node usage
+
+## Advanced options example
+##SBATCH --nodes=1                          # Uncomment and specify #nodes to use
+##SBATCH --ntasks=88                        # Uncomment and specify #nodes to use
+##SBATCH --ntasks-per-node=88               # Uncomment and specify #tasks per node
+##SBATCH --cpus-per-task=88                 # Uncomment and specify the number of cores per task
+```
+
+### GPU-based jobs templates

 The following template should be used by any user submitting jobs to GPU nodes:

 ```bash
-#!/bin/sh
+#!/bin/bash
 #SBATCH --partition=gpu_<general|daily|hourly> # Specify 'general' or 'daily' or 'hourly'
 #SBATCH --gres="gpu:<type>:<number_gpus>"   # You should specify at least 'gpu'
-#SBATCH --time=<D-HH:MM:SS>                 # Strictly recommended when using 'general' partition.
+#SBATCH --time=<D-HH:MM:SS>                 # Strongly recommended
 #SBATCH --output=<output_file>              # Generate custom output file
 #SBATCH --error=<error_file                 # Generate custom error  file
 #SBATCH --ntasks-per-core=1                 # GPU nodes have hyper-threading disabled
@@ -215,35 +240,120 @@ The following template should be used by any user submitting jobs to GPU nodes:

 ## Advanced options example
 ##SBATCH --nodes=1                          # Uncomment and specify number of nodes to use
-##SBATCH --ntasks=44                        # Uncomment and specify number of nodes to use
-##SBATCH --ntasks-per-node=44               # Uncomment and specify number of tasks per node
-##SBATCH --cpus-per-task=44                 # Uncomment and specify the number of cores per task
+##SBATCH --ntasks=20                        # Uncomment and specify number of nodes to use
+##SBATCH --ntasks-per-node=20               # Uncomment and specify number of tasks per node
+##SBATCH --cpus-per-task=1                  # Uncomment and specify the number of cores per task
 ```

+## Advanced configurations

-## Job status
+### Array Jobs: launching a large number of related jobs

-The status of submitted jobs can be check with the `squeue` command:
+If you need to run a large number of jobs based on the same executable with systematically varying inputs,
+e.g. for a parameter sweep, you can do this most easily in form of a **simple array job**.
+
+``` bash
+#!/bin/bash
+#SBATCH --job-name=test-array
+#SBATCH --partition=daily
+#SBATCH --ntasks=1
+#SBATCH --time=08:00:00
+#SBATCH --array=1-8
+
+echo $(date) "I am job number ${SLURM_ARRAY_TASK_ID}"
+srun myprogram config-file-${SLURM_ARRAY_TASK_ID}.dat

-```bash
-$> squeue -u bliven_s
-           JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
-       134507729       gpu test_scr bliven_s PD       0:00      3 (AssocGrpNodeLimit)
-       134507768   general test_scr bliven_s PD       0:00     19 (AssocGrpCpuLimit)
-       134507729       gpu test_scr bliven_s PD       0:00      3 (Resources)
-       134506301       gpu test_scr bliven_s PD       0:00      1 (Priority)
-       134506288       gpu test_scr bliven_s  R       9:16      1 merlin-g-008
 ```

-Common Statuses:
+This will run 8 independent jobs, where each job can use the counter
+variable `SLURM_ARRAY_TASK_ID` defined by Slurm inside of the job's
+environment to feed the correct input arguments or configuration file
+to the "myprogram" executable. Each job will receive the same set of
+configurations (e.g. time limit of 8h in the example above).

-* **merlin-\***: Running on the specified host
-* **(Priority)**: Waiting in the queue
-* **(Resources)**: At the head of the queue, waiting for machines to become available
-* **(AssocGrpCpuLimit), (AssocGrpNodeLimit)**: Job would exceed per-user limitations on
-  the number of simultaneous CPUs/Nodes. Use `scancel` to remove the job and
-  resubmit with fewer resources, or else wait for your other jobs to finish.
-* **(PartitionNodeLimit)**: Exceeds all resources available on this partition.
-  Run `scancel` and resubmit to a different partition (`-p`) or with fewer
-  resources.
+The jobs are independent, but they will run in parallel (if the cluster resources allow for
+it). The jobs will get JobIDs like {some-number}_0 to {some-number}_7, and they also will each
+have their own output file.
+
+**Note:**
+   * Do not use such jobs if you have very short tasks, since each array sub job will incur the full overhead for launching an independent Slurm job. For such cases you should used a **packed job** (see below).
+   * If you want to control how many of these jobs can run in parallel, you can use the `#SBATCH --array=1-100%5` syntax. The `%5` will define
+     that only 5 sub jobs may ever run in parallel.
+   
+You also can use an array job approach to run over all files in a directory, substituting the payload with
+
+``` bash
+FILES=(/path/to/data/*)
+srun ./myprogram ${FILES[$SLURM_ARRAY_TASK_ID]}
+```
+
+Or for a trivial case you could supply the values for a parameter scan in form
+of a argument list that gets fed to the program using the counter variable.
+
+``` bash
+ARGS=(0.05 0.25 0.5 1 2 5 100)
+srun ./my_program.exe ${ARGS[$SLURM_ARRAY_TASK_ID]}
+```
+
+### Array jobs: running very long tasks with checkpoint files
+
+If you need to run a job for much longer than the queues (partitions) permit, and
+your executable is able to create checkpoint files, you can use this
+strategy:
+
+``` bash
+#!/bin/bash
+#SBATCH --job-name=test-checkpoint
+#SBATCH --partition=general
+#SBATCH --ntasks=1
+#SBATCH --time=7-00:00:00       # each job can run for 7 days
+#SBATCH --cpus-per-task=1
+#SBATCH --array=1-10%1   # Run a 10-job array, one job at a time.
+if test -e checkpointfile; then 
+     # There is a checkpoint file;
+     myprogram --read-checkp checkpointfile
+else
+     # There is no checkpoint file, start a new simulation.
+     myprogram
+fi
+```
+
+The `%1` in the `#SBATCH --array=1-10%1` statement defines that only 1 subjob can ever run in parallel, so
+this will result in subjob n+1 only being started when job n has finished. It will read the checkpoint file
+if it is present.
+
+
+### Packed jobs: running a large number of short tasks
+
+Since the launching of a Slurm job incurs some overhead, you should not submit each short task as a separate
+Slurm job. Use job packing, i.e. you run the short tasks within the loop of a single Slurm job.
+
+You can launch the short tasks using `srun` with the `--exclusive` switch (not to be confused with the
+switch of the same name used in the SBATCH commands). This switch will ensure that only a specified
+number of tasks can run in parallel.
+
+As an example, the following job submission script will ask Slurm for
+44 cores (threads), then it will run the =myprog= program 1000 times with
+arguments passed from 1 to 1000. But with the =-N1 -n1 -c1
+--exclusive= option, it will control that at any point in time only 44
+instances are effectively running, each being allocated one CPU. You
+can at this point decide to allocate several CPUs or tasks by adapting
+the corresponding parameters.
+   
+``` bash
+#! /bin/bash
+#SBATCH --job-name=test-checkpoint
+#SBATCH --partition=general
+#SBATCH --ntasks=1
+#SBATCH --time=7-00:00:00
+#SBATCH --ntasks=44    # defines the number of parallel tasks
+for i in {1..1000}
+do
+   srun -N1 -n1 -c1 --exclusive ./myprog $i &
+done
+wait
+```
+
+**Note:** The `&` at the end of the `srun` line is needed to not have the script waiting (blocking).
+The `wait` command waits for all such background tasks to finish and returns the exit code.