NoMachine fix + running jobs

This commit is contained in:
caubet_m 2020-01-22 16:45:35 +01:00
parent 5baf53df77
commit 6169b7a8dc
3 changed files with 127 additions and 74 deletions

View File

@ -1,5 +1,5 @@
--- ---
title: NoMachine title: Remote Desktop Access
#tags: #tags:
#keywords: #keywords:
@ -9,9 +9,9 @@ sidebar: merlin6_sidebar
permalink: /merlin6/nomachine.html permalink: /merlin6/nomachine.html
--- ---
NoMachine is a desktop virtualization tool. It is similar to VNC, Remote Users can login in Merlin through a Linux Remote Desktop Session. NoMachine
Desktop, etc. It uses the NX protocol to enable a graphical login to remote is a desktop virtualization tool. It is similar to VNC, Remote Desktop, etc.
servers. It uses the NX protocol to enable a graphical login to remote servers.
## Installation ## Installation

View File

@ -10,41 +10,72 @@ permalink: /merlin6/running-jobs.html
## Commands for running jobs ## Commands for running jobs
* ``sbatch``: to submit a batch script to Slurm. Use ``squeue`` for checking jobs status and ``scancel`` for deleting a job from the queue. * **``sbatch``**: to submit a batch script to Slurm
* ``srun``: to run parallel jobs in the batch system * Use **``squeue``** for checking jobs status
* ``salloc``: to obtain a Slurm job allocation (a set of nodes), execute command(s), and then release the allocation when the command is finished. * Use **``scancel``** for deleting a job from the queue.
This is equivalent to interactive run. * **``srun``**: to run parallel jobs in the batch system
* **``salloc``**: to obtain a Slurm job allocation (a set of nodes), execute command(s), and then release the allocation when the command is finished.
## Running on Merlin5 ## Slurm parameters
The **Merlin5** cluster will be available at least until 1st of November 2019. In the meantime, users can keep submitting jobs to the old cluster For a complete list of options and parameters available is recommended to use the **man** pages (``man sbatch``, ``man srun``, ``man salloc``). Please, notice that behaviour for some parameters might change depending on the command (in example, ``--exclusive`` behaviour in ``sbatch`` differs from ``srun``.
but they will need to specify a couple of extra options to their scripts.
```bash In this chapter we show the basic parameters which are usually needed in the Merlin cluster.
#SBATCH --clusters=merlin5
```
By adding ``--clusters=merlin5`` it will send the jobs to the old Merlin5 computing nodes. Also, ``--partition=<merlin|gpu>`` can be specified in ### Running in Merlin5 & Merlin6
order to use the old Merlin5 partitions.
## Running on Merlin6 * For running jobs in the **Merlin6** computing nodes, users have to add the following option:
In order to run on the **Merlin6** cluster, users have to add the following options:
```bash ```bash
#SBATCH --clusters=merlin6 #SBATCH --clusters=merlin6
``` ```
By adding ``--clusters=merlin6`` it will send the jobs to the old Merlin6 computing nodes. * For running jobs in the **Merlin5** computing nodes, users have to add the following options:
## Shared nodes and exclusivity ```bash
#SBATCH --clusters=merlin5
```
The **Merlin6** cluster has been designed in a way that should allow running MPI/OpenMP processes as well as single core based jobs. For allowing ***For advanced users:*** If you do not care where to run the jobs (**Merlin5** or **Merlin6**) you can skip this setting, however you must make sure that your code can run on both clusters without any problem and you have defined proper settings in your *batch* script.
co-existence, nodes are configured by default in a shared mode. It means, that multiple jobs from multiple users may land in the same node. This
behaviour can be changed by a user if they require exclusive usage of nodes.
By default, Slurm will try to allocate jobs on nodes that are already occupied by processes not requiring exclusive usage of a node. In this way, ### Partitions
we fill up first mixed nodes and we ensure that free full resources are available for MPI/OpenMP jobs.
**Merlin6** contains 4 partitions for general purpose, while **Merlin5** contains 1 single CPU partition (for historical reasons):
* **Merlin6** CPU partitions are 3: ``general``, ``daily`` and ``hourly``.
* **Merlin6** GPU partition is 1: ``gpu``.
* **Merlin5** CPU partition is 1: ``merlin``
For Merlin6, if no partition is defined, ``general`` will be the default, while for Merlin5 is ``merlin``. Partitions can be changed by defining the ``--partition`` option as follows:
```bash
#SBATCH --partition=<partition_name> # Partition to use. 'general' is the 'default' in Merlin6.
```
Please check the section [Slurm Configuration#Merlin6 Slurm Partitions] for more information about Merlin6 partition setup.
### Enabling/disabling hyperthreading
Computing nodes in **merlin6** have hyperthreading enabled: every core is running two threads. It means that for many cases it needs to be disabled and only those multithread-based applications will benefit from that. There are some parameters that users must apply:
* For **hyperthreaded based jobs** users ***must*** specify the following options:
```bash
#SBATCH --ntasks-per-core=2 # Mandatory for multithreaded jobs
#SBATCH --hint=multithread # Mandatory for multithreaded jobs
```
* For **non-hyperthreaded based jobs** users ***must*** specify the following options:
```bash
#SBATCH --ntasks-per-core=1 # Mandatory for non-multithreaded jobs
#SBATCH --hint=nomultithread # Mandatory for non-multithreaded jobs
```
### Shared nodes and exclusivity
The **Merlin5** and **Merlin6** clusters are designed in a way that should allow running MPI/OpenMP processes as well as single core based jobs. For allowing co-existence, nodes are configured by default in a shared mode. It means, that multiple jobs from multiple users may land in the same node. This behaviour can be changed by a user if they require exclusive usage of nodes.
By default, Slurm will try to allocate jobs on nodes that are already occupied by processes not requiring exclusive usage of a node. In this way, we fill up first mixed nodes and we ensure that free full resources are available for MPI/OpenMP jobs.
Exclusivity of a node can be setup by specific the ``--exclusive`` option as follows: Exclusivity of a node can be setup by specific the ``--exclusive`` option as follows:
@ -52,7 +83,18 @@ Exclusivity of a node can be setup by specific the ``--exclusive`` option as fol
#SBATCH --exclusive #SBATCH --exclusive
``` ```
## Output and Errors ### Slurm CPU Recommended Settings
There are some settings that are not mandatory but would be needed or useful to specify. These are the following:
* ``--time``: mostly used when you need to specify longer runs in the ``general`` partition, also useful for specifying
shorter times. **This will affect scheduling priorities**, hence is important to define it (and to define it properly).
```bash
#SBATCH --time=<D-HH:MM:SS> # Time job needs to run
```
### Output and Errors
By default, Slurm script will generate standard output and errors files in the directory from where By default, Slurm script will generate standard output and errors files in the directory from where
you submit the batch script: you submit the batch script:
@ -69,38 +111,16 @@ If you want to the default names it can be done with the options ``--output`` an
Use **man sbatch** (``man sbatch | grep -A36 '^filename pattern'``) for getting a list specification of **filename patterns**. Use **man sbatch** (``man sbatch | grep -A36 '^filename pattern'``) for getting a list specification of **filename patterns**.
## Partitions
Merlin6 contains 6 partitions for general purpose:
* For the CPU these are ``general``, ``daily`` and ``hourly``.
* For the GPU these are ``gpu``.
If no partition is defined, ``general`` will be the default. Partition can be defined with the ``--partition`` option as follows:
```bash
#SBATCH --partition=<partition_name> # Partition to use. 'general' is the 'default'.
```
Please check the section [Slurm Configuration#Merlin6 Slurm Partitions] for more information about Merlin6 partition setup.
## CPU-based Jobs Settings ## CPU-based Jobs Settings
CPU-based jobs are available for all PSI users. Users must belong to the ``merlin6`` Slurm ``Account`` in order to be able CPU-based jobs are available for all PSI users. Users must belong to the ``merlin6`` Slurm ``Account`` in order to be able
to run on CPU-based nodes. All users registered in Merlin6 are automatically included in the ``Account``. to run on CPU-based nodes. All users registered in Merlin6 are automatically included in the ``Account``.
### Slurm CPU Recommended Settings ### Slurm CPU Templates
There are some settings that are not mandatory but would be needed or useful to specify. These are the following: The following examples apply to the **Merlin6** cluster.
* ``--time``: mostly used when you need to specify longer runs in the ``general`` partition, also useful for specifying #### Nomultithreaded jobs example
shorter times. This may affect scheduling priorities.
```bash
#SBATCH --time=<D-HH:MM:SS> # Time job needs to run
```
### Slurm CPU Template
The following template should be used by any user submitting jobs to CPU nodes: The following template should be used by any user submitting jobs to CPU nodes:
@ -110,24 +130,49 @@ The following template should be used by any user submitting jobs to CPU nodes:
#SBATCH --time=<D-HH:MM:SS> # Strictly recommended when using 'general' partition. #SBATCH --time=<D-HH:MM:SS> # Strictly recommended when using 'general' partition.
#SBATCH --output=<output_file> # Generate custom output file #SBATCH --output=<output_file> # Generate custom output file
#SBATCH --error=<error_file> # Generate custom error file #SBATCH --error=<error_file> # Generate custom error file
#SBATCH --ntasks-per-core=1 # Recommended one thread per core #SBATCH --ntasks-per-core=1 # Mandatory for non-multithreaded jobs
#SBATCH --hint=nomultithread # Mandatory for non-multithreaded jobs
##SBATCH --exclusive # Uncomment if you need exclusive node usage ##SBATCH --exclusive # Uncomment if you need exclusive node usage
## Advanced options example ## Advanced options example
##SBATCH --nodes=1 # Uncomment and specify #nodes to use ##SBATCH --nodes=1 # Uncomment and specify #nodes to use
##SBATCH --ntasks=44 # Uncomment and specify #nodes to use ##SBATCH --ntasks=44 # Uncomment and specify #nodes to use
##SBATCH --ntasks-per-node=44 # Uncomment and specify #tasks per node ##SBATCH --ntasks-per-node=44 # Uncomment and specify #tasks per node
##SBATCH --ntasks-per-core=2 # Uncomment and specify #tasks per core (a.k.a. threads)
##SBATCH --cpus-per-task=44 # Uncomment and specify the number of cores per task ##SBATCH --cpus-per-task=44 # Uncomment and specify the number of cores per task
``` ```
* Users needing hyper-threading can specify ``--ntasks-per-core=2`` instead. This is not recommended for generic usage. #### Multithreaded jobs
The following template should be used by any user submitting jobs to CPU nodes:
```bash
#!/bin/sh
#SBATCH --partition=<general|daily|hourly> # Specify 'general' or 'daily' or 'hourly'
#SBATCH --time=<D-HH:MM:SS> # Strictly recommended when using 'general' partition.
#SBATCH --output=<output_file> # Generate custom output file
#SBATCH --error=<error_file> # Generate custom error file
#SBATCH --ntasks-per-core=2 # Mandatory for multithreaded jobs
#SBATCH --hint=multithread # Mandatory for multithreaded jobs
##SBATCH --exclusive # Uncomment if you need exclusive node usage
## Advanced options example
##SBATCH --nodes=1 # Uncomment and specify #nodes to use
##SBATCH --ntasks=88 # Uncomment and specify #nodes to use
##SBATCH --ntasks-per-node=88 # Uncomment and specify #tasks per node
##SBATCH --cpus-per-task=88 # Uncomment and specify the number of cores per task
```
## GPU-based Jobs Settings ## GPU-based Jobs Settings
GPU-base jobs are restricted to BIO users, however access for PSI users can be requested on demand. Users must belong to **Merlin6** GPUs are available for all PSI users, however, this is restricted to any user belonging to the ``merlin-gpu`` account. By default, all users are added to this account (exceptions could apply).
the ``merlin6-gpu`` Slurm ``Account`` in order to be able to run GPU-based nodes. BIO users belonging to any BIO group
are automatically registered to the ``merlin6-gpu`` account. Other users should request access to the Merlin6 administrators. ### Merlin6 GPU account
When using GPUs, users must switch to the **merlin-gpu** Slurm account in order to be able to run on GPU-based nodes. This is done with the ``--account`` setting as follows:
```bash
#SBATCH --account=merlin-gpu # The account 'merlin-gpu' must be used
```
### Slurm CPU Mandatory Settings ### Slurm CPU Mandatory Settings
@ -137,7 +182,7 @@ The following options are mandatory settings that **must be included** in your b
#SBATCH --gres=gpu # Always set at least this option when using GPUs #SBATCH --gres=gpu # Always set at least this option when using GPUs
``` ```
### Slurm GPU Recommended Settings #### Slurm GPU Recommended Settings
GPUs are also a shared resource. Hence, multiple users can run jobs on a single node, but only one GPU per user process GPUs are also a shared resource. Hence, multiple users can run jobs on a single node, but only one GPU per user process
must be used. Users can define which GPUs resources they need with the ``--gres`` option. must be used. Users can define which GPUs resources they need with the ``--gres`` option.
@ -147,9 +192,11 @@ This would be according to the following rules:
In example: In example:
```bash ```bash
#SBATCH --gres=gpu:GTX1080:8 # Use 8 x GTX1080 GPUs #SBATCH --gres=gpu:GTX1080:4 # Use a node with 4 x GTX1080 GPUs
``` ```
***Important note:*** Due to a bug in the configuration, ``[:type]`` (i.e. ``GTX1080`` or ``GTX1080Ti``) is not working. Users should skip that and use only ``gpu[:count]``. This will be fixed in the upcoming downtimes as it requires a full restart of the batch system.
### Slurm GPU Template ### Slurm GPU Template
The following template should be used by any user submitting jobs to GPU nodes: The following template should be used by any user submitting jobs to GPU nodes:
@ -157,13 +204,15 @@ The following template should be used by any user submitting jobs to GPU nodes:
```bash ```bash
#!/bin/sh #!/bin/sh
#SBATCH --partition=gpu_<general|daily|hourly> # Specify 'general' or 'daily' or 'hourly' #SBATCH --partition=gpu_<general|daily|hourly> # Specify 'general' or 'daily' or 'hourly'
#SBATCH --gres="gpu:<type>:<number_gpus>" # You should specify at least 'gpu'
#SBATCH --time=<D-HH:MM:SS> # Strictly recommended when using 'general' partition. #SBATCH --time=<D-HH:MM:SS> # Strictly recommended when using 'general' partition.
#SBATCH --output=<output_file> # Generate custom output file #SBATCH --output=<output_file> # Generate custom output file
#SBATCH --error=<error_file # Generate custom error file #SBATCH --error=<error_file # Generate custom error file
#SBATCH --gres="gpu:<type>:<number_gpus>" # You should specify at least 'gpu'
#SBATCH --ntasks-per-core=1 # GPU nodes have hyper-threading disabled #SBATCH --ntasks-per-core=1 # GPU nodes have hyper-threading disabled
#SBATCH --account=merlin-gpu # The account 'merlin-gpu' must be used
##SBATCH --exclusive # Uncomment if you need exclusive node usage ##SBATCH --exclusive # Uncomment if you need exclusive node usage
## Advanced options example ## Advanced options example
##SBATCH --nodes=1 # Uncomment and specify number of nodes to use ##SBATCH --nodes=1 # Uncomment and specify number of nodes to use
##SBATCH --ntasks=44 # Uncomment and specify number of nodes to use ##SBATCH --ntasks=44 # Uncomment and specify number of nodes to use
@ -176,8 +225,8 @@ The following template should be used by any user submitting jobs to GPU nodes:
The status of submitted jobs can be check with the `squeue` command: The status of submitted jobs can be check with the `squeue` command:
``` ```bash
~ $ squeue -u bliven_s $> squeue -u bliven_s
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
134507729 gpu test_scr bliven_s PD 0:00 3 (AssocGrpNodeLimit) 134507729 gpu test_scr bliven_s PD 0:00 3 (AssocGrpNodeLimit)
134507768 general test_scr bliven_s PD 0:00 19 (AssocGrpCpuLimit) 134507768 general test_scr bliven_s PD 0:00 19 (AssocGrpCpuLimit)
@ -187,13 +236,14 @@ The status of submitted jobs can be check with the `squeue` command:
``` ```
Common Statuses: Common Statuses:
- *merlin-\** Running on the specified host
- *(Priority)* Waiting in the queue * **merlin-\***: Running on the specified host
- *(Resources)* At the head of the queue, waiting for machines to become available * **(Priority)**: Waiting in the queue
- *(AssocGrpCpuLimit), (AssocGrpNodeLimit)* Job would exceed per-user limitations on * **(Resources)**: At the head of the queue, waiting for machines to become available
* **(AssocGrpCpuLimit), (AssocGrpNodeLimit)**: Job would exceed per-user limitations on
the number of simultaneous CPUs/Nodes. Use `scancel` to remove the job and the number of simultaneous CPUs/Nodes. Use `scancel` to remove the job and
resubmit with fewer resources, or else wait for your other jobs to finish. resubmit with fewer resources, or else wait for your other jobs to finish.
- *(PartitionNodeLimit)* Exceeds all resources available on this partition. * **(PartitionNodeLimit)**: Exceeds all resources available on this partition.
Run `scancel` and resubmit to a different partition (`-p`) or with fewer Run `scancel` and resubmit to a different partition (`-p`) or with fewer
resources. resources.

View File

@ -25,6 +25,9 @@ sbatch Script.sh # to submit a script (example below) to the slurm.
srun <command> # to submit a command to Slurm. Same options as in 'sbatch' can be used. srun <command> # to submit a command to Slurm. Same options as in 'sbatch' can be used.
salloc # to allocate computing nodes. Use for interactive runs. salloc # to allocate computing nodes. Use for interactive runs.
scancel job_id # to cancel slurm job, job id is the numeric id, seen by the squeue. scancel job_id # to cancel slurm job, job id is the numeric id, seen by the squeue.
sview # X interface for managing jobs and track job run information.
seff # Calculates the efficiency of a job
sjstat # List attributes of jobs under the SLURM control
``` ```
--- ---