From 0e30473091ef06e1a904babd9318360ad53e8ea0 Mon Sep 17 00:00:00 2001 From: caubet_m Date: Wed, 6 May 2026 11:04:04 +0200 Subject: [PATCH] Add packed jobs documentation --- .../slurm-examples.md | 180 +++++++++++++++++- 1 file changed, 176 insertions(+), 4 deletions(-) diff --git a/docs/merlin7/03-Slurm-General-Documentation/slurm-examples.md b/docs/merlin7/03-Slurm-General-Documentation/slurm-examples.md index 3fc57b18..aaed6e08 100644 --- a/docs/merlin7/03-Slurm-General-Documentation/slurm-examples.md +++ b/docs/merlin7/03-Slurm-General-Documentation/slurm-examples.md @@ -1,6 +1,8 @@ # Slurm Examples -## Single core based job examples +## Basic examples + +### Single core based job examples ```bash #!/bin/bash @@ -16,9 +18,9 @@ module load $MODULE_NAME # where $MODULE_NAME is a software in PModules srun $MYEXEC # where $MYEXEC is a path to your binary file ``` -## Multi-core based jobs example +### Multi-core based jobs example -### Pure MPI +#### Pure MPI ```bash #!/bin/bash @@ -38,7 +40,7 @@ module load $MODULE_NAME # where $MODULE_NAME is a software in PModules srun $MYEXEC # where $MYEXEC is a path to your binary file ``` -### Hybrid +#### Hybrid ```bash #!/bin/bash @@ -58,3 +60,173 @@ module load $MODULE_NAME # where $MODULE_NAME is a software in PModules srun $MYEXEC # where $MYEXEC is a path to your binary file ``` +## Advanced examples + +### Packed jobs: running many short tasks inside one allocation + +Launching a Slurm job has some overhead. If you have hundreds or thousands of short, independent tasks, avoid submitting each task as a separate Slurm job. **This creates unnecessary scheduler load** and often increases the total time your workflow spends waiting in the queue. + +A better approach is to use a **packed job**: request one Slurm allocation with enough CPUs, then run several short tasks in parallel inside that allocation. + +!!! tip "When packed jobs are useful" + + - Each task is short + - The tasks are independent from each other + - Each task uses one or a small fixed number of CPU cores + - You want to limit how many tasks run at the same time + +!!! danger "When not to use packed jobs" + + Packed jobs are not always the best solution. Consider other approaches if: + + - Each task is long-running; + - Tasks have very different runtimes and need advanced scheduling; + - Each task requires many nodes; + - You need Slurm accounting for each task as a separate job; + - Failed tasks must be automatically retried or tracked individually. + + In those cases, a Slurm job array may be more appropriate. + +!!! warning + Do not start more parallel tasks than the CPUs requested from Slurm. For example, if your job requests `--cpus-per-task=4`, run at most 4 single tasks at the same time. + +#### Recommended pattern: Control parallelism inside the job script + +The following example requests 4 CPUs from Slurm and runs 12 short tasks in total. At most 4 tasks are active at the same time, matching the number of CPUs requested with `--cpus-per-task=4`. + +```bash +#!/bin/bash +#SBATCH --job-name=stress_single_job +#SBATCH --partition=hourly +#SBATCH --time=00:05:00 +#SBATCH --cpus-per-task=4 +#SBATCH --mem=1G +#SBATCH --output=stress_single_job_%j.out + +set -euo pipefail + +TASKS=12 +MAX_PARALLEL="${SLURM_CPUS_PER_TASK:-1}" + +run_one_task() { + local idx="$1" + + echo "[$(date '+%F %T')] starting task ${idx} on host $(hostname)" + + # Replace this command with your real workload. + # This simulates around 20 seconds of single-core CPU work. + stress-ng --cpu 1 --timeout 20s --metrics-brief + + echo "[$(date '+%F %T')] finished task ${idx}" +} + +export -f run_one_task + +active=0 + +for i in $(seq 1 "${TASKS}"); do + bash -lc "run_one_task '${i}'" & + ((active+=1)) + + if [ "${active}" -ge "${MAX_PARALLEL}" ]; then + wait -n + ((active-=1)) + fi +done + +wait +echo "All tasks completed" +``` + +!!! note + Replace the `stress-ng` command with the command you actually want to run. For example: + + ```bash + ./myprog "${idx}" + ``` + +In this example: + +- Slurm allocates one job with 4 CPUs. +- The script launches 12 tasks in total. +- Only 4 tasks run in parallel. +- When one task finishes, the next one starts. +- The job finishes only after all background tasks have completed. + +The `&` starts each task in the background. The `wait -n` command waits until one background task finishes before launching more work. +The final `wait` ensures that the script does not exit until all remaining tasks have completed. + +!!! tip + The number of simultaneously running tasks should match the resources requested from Slurm. + + * *For single threaded tasks*, and example would be: + + ```bash + #SBATCH --cpus-per-task=8 + + MAX_PARALLEL="${SLURM_CPUS_PER_TASK:-1}" + ``` + + This means that up to 8 single tasks may run at the same time. + + * *For tasks that use multiple threads each*, reduce the number of parallel tasks accordingly. + For example, if every task uses 4 CPU threads and the job requests 16 CPUs, then run at most 4 tasks in parallel. + + ```bash + #SBATCH --cpus-per-task=16 + + CPUS_PER_WORKER=4 + MAX_PARALLEL=$((SLURM_CPUS_PER_TASK / CPUS_PER_WORKER)) + ``` + + You should also make sure that the application itself uses the expected number of threads, for example: + + ```bash + export OMP_NUM_THREADS="${CPUS_PER_WORKER}" + ``` + +#### Alternative pattern: Using `srun` for each packed task + +For some workflows it can be useful to launch each internal task with `srun`. This gives Slurm more visibility of each step inside the allocation. + +!!! danger + + Using `srun` inside a packed job is valid and gives Slurm visibility of each task as a job step. However, **every `srun` creates additional Slurm step-management overhead.** + For many very short tasks, it is usually better to request the required CPUs once and control the parallelism inside the job script using Bash, GNU Parallel, or a similar + workflow tool. Use `srun` mainly when you need Slurm to launch MPI tasks, enforce step-level resource isolation, or track each task as a Slurm job step. + +```bash +#!/bin/bash +#SBATCH --job-name=packed_srun_example +#SBATCH --partition=hourly +#SBATCH --time=00:10:00 +#SBATCH --ntasks=4 +#SBATCH --cpus-per-task=1 +#SBATCH --mem=1G +#SBATCH --output=packed_srun_example_%j.out + +set -euo pipefail + +TASKS=12 +MAX_PARALLEL="${SLURM_NTASKS:-1}" + +active=0 + +for i in $(seq 1 "${TASKS}"); do + srun --nodes=1 --ntasks=1 --cpus-per-task=1 --exclusive ./myprog "${i}" & + ((active+=1)) + + if [ "${active}" -ge "${MAX_PARALLEL}" ]; then + wait -n + ((active-=1)) + fi +done + +wait +echo "All tasks completed" +``` + +In this case, the job requests 4 Slurm tasks, and each internal `srun` step consumes one of them. The `--exclusive` option on the `srun` command prevents several job steps from sharing the same allocated CPU resources. + +!!! note + The `--exclusive` option shown here belongs to `srun`. It is not the same as using `#SBATCH --exclusive`, which would request exclusive access to whole nodes and is usually not what you want for packed short tasks.