From 0e30473091ef06e1a904babd9318360ad53e8ea0 Mon Sep 17 00:00:00 2001
From: caubet_m <marc.caubet@psi.ch>
Date: Wed, 6 May 2026 11:04:04 +0200
Subject: [PATCH] Add packed jobs documentation

---
 .../slurm-examples.md                         | 180 +++++++++++++++++-
 1 file changed, 176 insertions(+), 4 deletions(-)

diff --git a/docs/merlin7/03-Slurm-General-Documentation/slurm-examples.md b/docs/merlin7/03-Slurm-General-Documentation/slurm-examples.md
index 3fc57b18..aaed6e08 100644
--- a/docs/merlin7/03-Slurm-General-Documentation/slurm-examples.md
+++ b/docs/merlin7/03-Slurm-General-Documentation/slurm-examples.md
@@ -1,6 +1,8 @@
 # Slurm Examples
 
-## Single core based job examples
+## Basic examples
+
+### Single core based job examples
 
 ```bash
 #!/bin/bash
@@ -16,9 +18,9 @@ module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
 srun $MYEXEC             # where $MYEXEC is a path to your binary file
 ```
 
-## Multi-core based jobs example
+### Multi-core based jobs example
 
-### Pure MPI
+#### Pure MPI
 
 ```bash
 #!/bin/bash
@@ -38,7 +40,7 @@ module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
 srun $MYEXEC             # where $MYEXEC is a path to your binary file
 ```
 
-### Hybrid
+#### Hybrid
 
 ```bash
 #!/bin/bash
@@ -58,3 +60,173 @@ module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
 srun $MYEXEC             # where $MYEXEC is a path to your binary file
 ```
 
+## Advanced examples
+
+### Packed jobs: running many short tasks inside one allocation
+
+Launching a Slurm job has some overhead. If you have hundreds or thousands of short, independent tasks, avoid submitting each task as a separate Slurm job. **This creates unnecessary scheduler load** and often increases the total time your workflow spends waiting in the queue.
+
+A better approach is to use a **packed job**: request one Slurm allocation with enough CPUs, then run several short tasks in parallel inside that allocation.
+
+!!! tip "When packed jobs are useful"
+
+    - Each task is short
+    - The tasks are independent from each other
+    - Each task uses one or a small fixed number of CPU cores
+    - You want to limit how many tasks run at the same time
+
+!!! danger "When not to use packed jobs"
+
+    Packed jobs are not always the best solution. Consider other approaches if:
+    
+    - Each task is long-running;
+    - Tasks have very different runtimes and need advanced scheduling;
+    - Each task requires many nodes;
+    - You need Slurm accounting for each task as a separate job;
+    - Failed tasks must be automatically retried or tracked individually.
+    
+    In those cases, a Slurm job array may be more appropriate.
+
+!!! warning
+    Do not start more parallel tasks than the CPUs requested from Slurm. For example, if your job requests `--cpus-per-task=4`, run at most 4 single tasks at the same time.
+
+#### Recommended pattern: Control parallelism inside the job script
+
+The following example requests 4 CPUs from Slurm and runs 12 short tasks in total. At most 4 tasks are active at the same time, matching the number of CPUs requested with `--cpus-per-task=4`.
+
+```bash
+#!/bin/bash
+#SBATCH --job-name=stress_single_job
+#SBATCH --partition=hourly
+#SBATCH --time=00:05:00
+#SBATCH --cpus-per-task=4
+#SBATCH --mem=1G
+#SBATCH --output=stress_single_job_%j.out
+
+set -euo pipefail
+
+TASKS=12
+MAX_PARALLEL="${SLURM_CPUS_PER_TASK:-1}"
+
+run_one_task() {
+    local idx="$1"
+
+    echo "[$(date '+%F %T')] starting task ${idx} on host $(hostname)"
+
+    # Replace this command with your real workload.
+    # This simulates around 20 seconds of single-core CPU work.
+    stress-ng --cpu 1 --timeout 20s --metrics-brief
+
+    echo "[$(date '+%F %T')] finished task ${idx}"
+}
+
+export -f run_one_task
+
+active=0
+
+for i in $(seq 1 "${TASKS}"); do
+    bash -lc "run_one_task '${i}'" &
+    ((active+=1))
+
+    if [ "${active}" -ge "${MAX_PARALLEL}" ]; then
+        wait -n
+        ((active-=1))
+    fi
+done
+
+wait
+echo "All tasks completed"
+```
+
+!!! note
+    Replace the `stress-ng` command with the command you actually want to run. For example:
+
+    ```bash
+    ./myprog "${idx}"
+    ```
+
+In this example:
+
+- Slurm allocates one job with 4 CPUs.
+- The script launches 12 tasks in total.
+- Only 4 tasks run in parallel.
+- When one task finishes, the next one starts.
+- The job finishes only after all background tasks have completed.
+
+The `&` starts each task in the background. The `wait -n` command waits until one background task finishes before launching more work.
+The final `wait` ensures that the script does not exit until all remaining tasks have completed.
+
+!!! tip
+    The number of simultaneously running tasks should match the resources requested from Slurm.
+
+    * *For single threaded tasks*, and example would be:
+    
+        ```bash
+        #SBATCH --cpus-per-task=8
+
+        MAX_PARALLEL="${SLURM_CPUS_PER_TASK:-1}"
+        ```
+
+        This means that up to 8 single tasks may run at the same time.
+
+    * *For tasks that use multiple threads each*, reduce the number of parallel tasks accordingly.
+      For example, if every task uses 4 CPU threads and the job requests 16 CPUs, then run at most 4 tasks in parallel.
+
+        ```bash
+        #SBATCH --cpus-per-task=16
+        
+        CPUS_PER_WORKER=4
+        MAX_PARALLEL=$((SLURM_CPUS_PER_TASK / CPUS_PER_WORKER))
+        ```
+
+        You should also make sure that the application itself uses the expected number of threads, for example:
+
+        ```bash
+        export OMP_NUM_THREADS="${CPUS_PER_WORKER}"
+        ```
+
+#### Alternative pattern: Using `srun` for each packed task
+
+For some workflows it can be useful to launch each internal task with `srun`. This gives Slurm more visibility of each step inside the allocation.
+
+!!! danger
+
+    Using `srun` inside a packed job is valid and gives Slurm visibility of each task as a job step. However, **every `srun` creates additional Slurm step-management overhead.**
+    For many very short tasks, it is usually better to request the required CPUs once and control the parallelism inside the job script using Bash, GNU Parallel, or a similar
+    workflow tool. Use `srun` mainly when you need Slurm to launch MPI tasks, enforce step-level resource isolation, or track each task as a Slurm job step.
+
+```bash
+#!/bin/bash
+#SBATCH --job-name=packed_srun_example
+#SBATCH --partition=hourly
+#SBATCH --time=00:10:00
+#SBATCH --ntasks=4
+#SBATCH --cpus-per-task=1
+#SBATCH --mem=1G
+#SBATCH --output=packed_srun_example_%j.out
+
+set -euo pipefail
+
+TASKS=12
+MAX_PARALLEL="${SLURM_NTASKS:-1}"
+
+active=0
+
+for i in $(seq 1 "${TASKS}"); do
+    srun --nodes=1 --ntasks=1 --cpus-per-task=1 --exclusive ./myprog "${i}" &
+    ((active+=1))
+
+    if [ "${active}" -ge "${MAX_PARALLEL}" ]; then
+        wait -n
+        ((active-=1))
+    fi
+done
+
+wait
+echo "All tasks completed"
+```
+
+In this case, the job requests 4 Slurm tasks, and each internal `srun` step consumes one of them. The `--exclusive` option on the `srun` command prevents several job steps from sharing the same allocated CPU resources.
+
+!!! note
+    The `--exclusive` option shown here belongs to `srun`. It is not the same as using `#SBATCH --exclusive`, which would request exclusive access to whole nodes and is usually not what you want for packed short tasks.