From 2b901919c83d89285484428653aa49032e9e9b9e Mon Sep 17 00:00:00 2001 From: caubet_m Date: Tue, 18 Jun 2019 14:03:17 +0200 Subject: [PATCH] Added Running Jobs --- _data/sidebars/merlin6_sidebar.yml | 6 +- pages/merlin6/merlin6-slurm/running-jobs.md | 122 ++++++++++++++++++ .../merlin6-slurm/slurm-configuration.md | 14 +- 3 files changed, 134 insertions(+), 8 deletions(-) create mode 100644 pages/merlin6/merlin6-slurm/running-jobs.md diff --git a/_data/sidebars/merlin6_sidebar.yml b/_data/sidebars/merlin6_sidebar.yml index 3e96cd5..d91ad6a 100644 --- a/_data/sidebars/merlin6_sidebar.yml +++ b/_data/sidebars/merlin6_sidebar.yml @@ -29,10 +29,12 @@ entries: url: /merlin6/slurm-access.html - title: Merlin6 Slurm folderitems: - - title: Slurm Basic Commands - url: /merlin6/slurm-basics.html - title: Slurm Configuration url: /merlin6/slurm-configuration.html + - title: Slurm Basic Commands + url: /merlin6/slurm-basics.html + - title: Running Jobs + url: /merlin6/running-jobs.html - title: Support folderitems: - title: Contact diff --git a/pages/merlin6/merlin6-slurm/running-jobs.md b/pages/merlin6/merlin6-slurm/running-jobs.md new file mode 100644 index 0000000..98bcc75 --- /dev/null +++ b/pages/merlin6/merlin6-slurm/running-jobs.md @@ -0,0 +1,122 @@ +--- +title: Running Jobs +#tags: +#keywords: +last_updated: 13 June 2019 +#summary: "" +sidebar: merlin6_sidebar +permalink: /merlin6/running-jobs.html +--- + +## Commands for running jobs + +* ``sbatch``: to submit a batch script to Slurm + * ``squeue``: for checking the status of your jobs + * ``scancel``: for deleting a job from the queue +* ``srun``: to run parallel jobs in the batch system +* ``salloc``: to obtain a Slurm job allocation (a set of nodes), execute command(s), and then release the allocation when the command is finished. + * ``salloc`` is equivalent to an interactive run + +## Slurm settings + +### Shared nodes and exclusivity + +The **Merlin6** cluster has been designed in a way that should allow running MPI/OpenMP processes as well as single core based jobs. For allowing +co-existence, nodes are configured by default in a shared mode. It means, that multiple jobs from multiple users may land in the same node. This +behaviour can be changed by a user if they require exclusive usage of nodes. + +By default, Slurm will try to allocate jobs on nodes that are already occupied by processes not requiring exclusive usage of a node. In this way, +we fill up first mixed nodes and we ensure that free full resources are available for MPI/OpenMP jobs. + +Exclusivity of a node can be setup by specific the ``--exclusive`` option as follows: + +```bash +#SBATCH --exclusive +``` + +### Output and Errors + +By default, Slurm script will generate standard output and errors files in the directory from where +you submit the batch script: + +* standard output will be written into a file ``slurm-$SLURM_JOB_ID.out``. +* standard error will be written into a file ``slurm-$SLURM_JOB_ID.err``. + +If you want to the default names it can be done with the options ``--output`` and ``--error``. In example: + +```batch +#SBATCH --output=logs/myJob.%N.%j.out # Generate an output file per hostname and jobid +#SBATCH --error=logs/myJob.%N.%j.err # Generate an errori file per hostname and jobid +``` + +Use **man sbatch** (``man sbatch | grep -A36 '^filename pattern'``) for getting a list specification of **filename patterns**. + +### Partitions + +Merlin6 contains 3 partitions for general purpose. These are ``general``, ``daily`` and ``hourly``. If no partition is defined, +``general`` will be the default. Partition can be defined with the ``--partition`` option as follows: + +```bash +#SBATCH --partition= # name of slurm partition to submit. 'general' is the 'default'. +``` + +Please check the section [Slurm Configuration#Merlin6 Slurm Partitions] for more information about Merlin6 partition setup. + +### CPU-based Jobs Settings + +CPU-based jobs are available for all PSI users. Users must belong to the ``merlin6`` Slurm ``Account`` in order to be able +to run on CPU-based nodes. All users registered in Merlin6 are automatically included in the ``Account``. + +#### Slurm CPU Mandatory Settings + +The following options are mandatory settings that **must be included** in your batch scripts: + +```bash +#SBATCH --constraint=mc # Always set it to 'mc' for CPU jobs. +``` + +#### Slurm CPU Recommended Settings + +There are some settings that are not mandatory but would be needed or useful to specify. These are the following: + +* ``--time``: mostly used when you need to specify longer runs in the ``general`` partition, also useful for specifying +shorter times. This may affect scheduling priorities. + + ```bash + #SBATCH --time= # Time job needs to run + ``` + +### GPU-based Jobs Settings + +GPU-base jobs are restricted to BIO users, however access for PSI users can be requested on demand. Users must belong to +the ``merlin6-gpu`` Slurm ``Account`` in order to be able to run GPU-based nodes. BIO users belonging to any BIO group +are automatically registered to the ``merlin6-gpu`` account. Other users should request access to the Merlin6 administrators. + +#### Slurm CPU Mandatory Settings + +The following options are mandatory settings that **must be included** in your batch scripts: + +```bash +#SBATCH --constraint=gpu # Always set it to 'gpu' for GPU jobs. +#SBATCH --gres=gpu # Always set at least this option when using GPUs +``` + +### Slurm GPU Recommended Settings + +GPUs are also a shared resource. Hence, multiple users can run jobs on a single node, but only one GPU per user process +must be used. Users can define which GPUs resources they need with the ``--gres`` option. +This would be according to the following rules: + +* All machines except ``merlin-g-001`` have up to 4 GPUs. ``merlin-g-001`` has up to 2 GPUs. +* Two different NVIDIA models profiles exist: ``GTX1080`` and ``GTX1080Ti``. + +Valid ``gres`` options are: ``gpu[[:type]:count]`` where: + +* ``type``: can be ``GTX1080`` or ``GTX1080Ti`` +* ``count``: will be the number of GPUs to use + +In example: + +```batch +#SBATCH --gres=gpu:GTX1080:4 # Use 4 x GTX1080 GPUs +``` diff --git a/pages/merlin6/merlin6-slurm/slurm-configuration.md b/pages/merlin6/merlin6-slurm/slurm-configuration.md index 59d52fe..fe44097 100644 --- a/pages/merlin6/merlin6-slurm/slurm-configuration.md +++ b/pages/merlin6/merlin6-slurm/slurm-configuration.md @@ -2,7 +2,7 @@ title: Slurm Configuration #tags: #keywords: -last_updated: 13 June 2019 +last_updated: 18 June 2019 #summary: "" sidebar: merlin6_sidebar permalink: /merlin6/slurm-configuration.html @@ -43,11 +43,13 @@ Basic usage for the **merlin6** cluster will be detailed here. For advanced usag The following table show default and maximum resources that can be used per node: -| Nodes | Def.#CPUs | Max.#CPUs | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | Def.#GPUs | Max.#GPUs | -|:---------------------------------- | ---------:| ---------:| -----------:| -----------:| ------------:| --------:| --------- | --------- | -| merlin-c-[001-022,101-122,201-222] | 1 core | 44 cores | 8000 | 352000 | 352000 | 10000 | N/A | N/A | -| merlin-g-[001] | 1 core | 8 cores | 8000 | 102498 | 102498 | 10000 | 1 | 2 | -| merlin-g-[002-009] | 1 core | 10 cores | 8000 | 102498 | 102498 | 10000 | 1 | 4 | +| Nodes | Def.#CPUs | Max.#CPUs | #Threads | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | Def.#GPUs | Max.#GPUs | +|:-------------------| ---------:| ---------:| -------- | -----------:| -----------:| ------------:| --------:| --------- | --------- | +| merlin-c-[001-022] | 1 core | 44 cores | 1 | 8000 | 352000 | 352000 | 10000 | N/A | N/A | +| merlin-c-[101-122] | 1 core | 44 cores | 1 | 8000 | 352000 | 352000 | 10000 | N/A | N/A | +| merlin-c-[201-222] | 1 core | 44 cores | 1 | 8000 | 352000 | 352000 | 10000 | N/A | N/A | +| merlin-g-[001] | 1 core | 8 cores | 1 | 8000 | 102498 | 102498 | 10000 | 1 | 2 | +| merlin-g-[002-009] | 1 core | 10 cores | 1 | 8000 | 102498 | 102498 | 10000 | 1 | 4 | If nothing is specified, by default each core will use up to 8GB of memory. More memory per core can be specified with the ``--mem=`` option, and maximum memory allowed is ``Max.Mem/Node``.