Added Merlin6 user Guide

2019-06-07 19:27:58 +02:00 · 2019-06-07 19:27:58 +02:00 · 4f3deaedaf
commit 4f3deaedaf
parent 100eb526b4
4 changed files with 449 additions and 0 deletions
--- a/pages/merlin6-user-guide/contact.md
+++ b/pages/merlin6-user-guide/contact.md
@ -0,0 +1,62 @@
 ---
 layout: default
 title: Contact
 parent: Merlin6 User Guide
 nav_order: 2
 ---
 # Contact 
 {: .no_toc }
 ## Table of contents
 {: .no_toc .text-delta }
 1. TOC
 {:toc}
 ---
 ## Support
 Basic contact information can be also found when login in the Merlin Login Nodes through the Message of the Day.
 Support can be done through:
 * [PSI Service Now](https://psi.service-now.com/psisp)
 * E-Mail: <merlin-admins@lists.psi.ch>
 ### PSI Service Now
 [PSI Service Now](https://psi.service-now.com/psisp) is the official tool for opening incidents.
 PSI HelpDesk will redirect the incident to the corresponding department, but you can always assign it directly to us
 (``Assignment Group['itsm-sci_hpc_loc']``).
 ### Contact Merlin6 Administrators
 An official mail list is available for contacting Merlin6 Administrators:
 * <merlin-admins@lists.psi.ch>
    * This is the official way to contact Merlin6 Administrators.
    * Do not hesitate to contact us on any question, request and/or problem.
 ---
 ## Get updated through the Merlin User list!
 Is *strictly* recommended to register to the Merlin Users mail list:
 * <merlin-users@lists.psi.ch>
    * please subscribe to this list to receive updates about Merlin6 general  
      information, interventions and system improvements useful for users.
    * Users can be subscribed in two ways:
        * [Sympa Link](https://psilists.ethz.ch/sympa/info/merlin-users)
        * Send a request to the admin list: <merlin-admins@lists.psi.ch>
 This is the official channel we use to inform users about downtimes, interventions or problems.
 ---
 ## The Merlin6 Team
 Merlin6 is managed by the [High Performance Computing and Emerging technologies Group](https://www.psi.ch/de/lsm/hpce-group), which
 is one of the from from the [Laboratory for Scientific Computing and Modelling](https://www.psi.ch/de/lsm).
 For more information about our team and contacts please visit: <https://www.psi.ch/de/lsm/hpce-group>
--- a/pages/merlin6-user-guide/introduction.md
+++ b/pages/merlin6-user-guide/introduction.md
@ -0,0 +1,113 @@
 ---
 layout: default
 title: Introduction
 parent: Merlin6 User Guide
 nav_order: 1
 ---
 # Introduction
 {: .no_toc }
 ## Table of contents
 {: .no_toc .text-delta }
 1. TOC
 {:toc}
 ---
 ## About Merlin6
 Merlin6 is a the official PSI Local HPC cluster for development and mission-critical applications that has been built in 2019. It replaces the Merlin5 cluster.
 Merlin6 is designed to be extensible, so is technically possible to add more compute nodes and cluster storage without significant increase of the costs of the manpower and 
 the operations.
 Merlin6 is mostly based on CPU resources, but also contains a small amount of GPU-based resources which are mostly used by the BIO experiments.
 ---
 ## Hardware & Software Description
 ### Computing Nodes
 The new Merlin6 cluster contains an homogeneous solution based on *three* HP Apollo k6000 systems. Each HP Apollo k6000 chassis contains 22 HP XL320k Gen10 blades. However,
 each chassis can contain up to 24 blades, so is possible to upgradew with up to 2 nodes per chassis.
 Each HP XL320k Gen 10 blade can contain up to two processors of the latest Intel® Xeon® Scalable Processor family. The hardware and software configuration is the following:
 * 3 x HP Apollo k6000 chassis systems, each one:
    * 22 x [HP Apollo XL230K Gen10](https://h20195.www2.hpe.com/v2/GetDocument.aspx?docname=a00016634enw), each one:
        * 2 x *22 core* [Intel® Xeon® Gold 6152 Scalable Processor](https://ark.intel.com/products/120491/Intel-Xeon-Gold-6152-Processor-30-25M-Cache-2-10-GHz-) (2.10-3.70GHz).
        * 12 x 32 GB (384 GB in total) of DDR4 memory clocked 2666 MHz.
        * Dual Port !InfiniBand !ConnectX-5 EDR-100Gbps (low latency network); one active port per chassis.
        * 1 x 1.6TB NVMe SSD Disk
            * ~300GB reserved for the O.S.
            * ~1.2TB reserved for local fast scratch ``/scratch``.
        * Software:
            * RedHat Enterprise Linux 7.6
            * [Slurm](https://slurm.schedmd.com/) v18.08
            * [GPFS](https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/ibmspectrumscale502_welcome.html)  v5.0.2
    * 1 x [HPE Apollo InfiniBand EDR 36-port Unmanaged Switch](https://h20195.www2.hpe.com/v2/getdocument.aspx?docname=a00016643enw)
        * 24 internal EDR-100Gbps ports (1 port per blade for internal low latency connectivity)
        * 12 external EDR-100Gbps ports (for external for internal low latency connectivity)
 ### Login Nodes
 Two login nodes are inherit from the previous Merlin5 cluster: ``merlin-l-01.psi.ch``, ``merlin-l-02.psi.ch``. The hardware and software configuration is the following:
 * 2 x HP DL380 Gen9, each one:
    * 2 x *16 core* [Intel® Xeon® Processor E5-2697AV4 Family](https://ark.intel.com/products/91768/Intel-Xeon-Processor-E5-2697A-v4-40M-Cache-2-60-GHz-) (2.60-3.60GHz)
        * ``merlin-l-01.psi.ch`` hyper-threading disabled
        * ``merlin-l-02.psi.ch`` hyper-threading enabled
    * 16 x 32 GB (512 GB in total) of DDR4 memory clocked 2400 MHz.
    * Dual Port Infiniband !ConnectIB FDR-56Gbps (low latency network).
    * Software:
        * RedHat Enterprise Linux 7.6
        * [Slurm](https://slurm.schedmd.com/) v18.08
        * [GPFS](https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/ibmspectrumscale502_welcome.html)  v5.0.2
 Two new login nodes are available in the new cluster: ``merlin-l-001.psi.ch``, ``merlin-l-002.psi.ch``. The hardware and software configuration is the following:
 * 2 x HP DL380 Gen10, each one:
    * 2 x *22 core* [Intel® Xeon® Gold 6152 Scalable Processor](https://ark.intel.com/products/120491/Intel-Xeon-Gold-6152-Processor-30-25M-Cache-2-10-GHz-) (2.10-3.70GHz).
        * Hyper-threading disabled.
    * 24 x 16GB (384 GB in total) of DDR4 memory clocked 2666 MHz.
    * Dual Port Infiniband !ConnectX-5 EDR-100Gbps (low latency network).
    * Software:
        * [NoMachine Terminal Server](https://www.nomachine.com/)
            * Currently only on: ``merlin-l-001.psi.ch``.
        * RedHat Enterprise Linux 7.6
        * [Slurm](https://slurm.schedmd.com/) v18.08
        * [GPFS](https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/ibmspectrumscale502_welcome.html)  v5.0.2
 ### Storage
 The storage node is based on the [Lenovo Distributed Storage Solution for IBM Spectrum Scale](https://lenovopress.com/lp0626-lenovo-distributed-storage-solution-for-ibm-spectrum-scale-x3650-m5).
 The solution is equipped with 334 x 10TB disks providing a useable capacity of 2.316 PiB (2.608PB). THe overall solution can provide a maximum read performance of 20GB/s.
 * 1 x Lenovo DSS G240, composed by:
    * 2 x ThinkSystem SR650, each one:
        * 2 x Dual Port !Infiniband ConnectX-5 EDR-100Gbps (low latency network).
        * 2 x Dual Port !Infiniband ConnectX-4 EDR-100Gbps (low latency network).
        * 1 x ThinkSystem RAID 930-8i 2GB Flash PCIe 12Gb Adapter
    * 1 x ThinkSystem SR630
        * 1 x Dual Port !Infiniband ConnectX-5 EDR-100Gbps (low latency network).
        * 1 x Dual Port !Infiniband ConnectX-4 EDR-100Gbps (low latency network).
    * 4 x Lenovo Storage D3284 High Density Expansion Enclosure, each one:
        * Holds 84 x 3.5" hot-swap drive bays in two drawers. Each drawer has three rows of drives, and each row has 14 drives.
        * Each drive bay will contain a 10TB Helium 7.2K NL-SAS HDD.
    * 2 x Mellanox SB7800 InfiniBand 1U Switch for High Availability and fast access to the storage with very low latency. Each one:
        * 36 EDR-100Gbps ports
 ### Networking
 Merlin6 cluster connectivity is based on the [Infiniband](https://en.wikipedia.org/wiki/InfiniBand) technology. This allows fast access with very low latencies to the data as well as running
 extremely efficient MPI-based jobs:
 * Connectivity amongst different computing nodes on different chassis ensures up to 1200Gbps of aggregated bandwidth. 
 * Inter connectivity (communication amongst computing nodes in the same chassis) ensures up to 2400Gbps of aggregated bandwidth. 
 * Communication to the storage ensures up to 800Gbps of aggregated bandwidth.
 Merlin6 cluster currently contains 5 Infiniband Managed switches and 3 Infiniband Unmanaged switches (one per HP Apollo chassis):
 * 1 * MSX6710 (FDR) for connecting old GPU nodes, old login nodes and MeG cluster to the Merlin6 cluster (and storage). No High Availability mode possible.
 * 2 * MSB7800 (EDR) for connecting Login Nodes, Storage and other nodes in High Availability mode.
 * 3 * HP EDR Unmanaged switches, each one embedded to each HP Apollo k6000 chassis solution.
 * 2 * MSB7700 (EDR) are the top switches, interconnecting the Apollo unmanaged switches and the managed switches (MSX6710, MSB7800).
--- a/pages/merlin6-user-guide/merlin6-user-guide.md
+++ b/pages/merlin6-user-guide/merlin6-user-guide.md
@ -0,0 +1,11 @@
 ---
 layout: default
 title: Merlin6 User Guide
 nav_order: 10 
 has_children: true
 permalink: /docs/merlin6-user-guide.html
 ---
 # Merlin6 User Guide
 Wellcome to the PSI Merlin6 Cluster.
--- a/pages/merlin6-user-guide/using-merlin6.md
+++ b/pages/merlin6-user-guide/using-merlin6.md
@ -0,0 +1,263 @@
 ---
 layout: default
 title: Using Merlin6
 parent: Merlin6 User Guide
 nav_order: 3
 ---
 # Using Merlin6
 {: .no_toc }
 ## Table of contents
 {: .no_toc .text-delta }
 1. TOC
 {:toc}
 ---
 ## Important: Code of Conduct 
 The basic principle is courtesy and consideration for other users.
 * Merlin6 is a shared resource, not your laptop, therefore you are kindly requested to behave in a way you would be happy to see other users behaving towards you.
 * Basic shell programming skills in Linux/UNIX environment is a must-have requirement for HPC users; a proficiency in shell programming would be greatly beneficial.
 * The login nodes are for development and quick testing:
  * Is **strictly forbidden to run production jobs** on the login nodes.
  * Is **forbidden to run long processes** occupying big part of the resources.
  * *Any miss-behaving running processes according to these rules will be killed.*
 * All production jobs should be submitted using the batch system.
 * Make sure that no broken or run-away processes are left when your job is done. Keep the process space clean on all nodes.
 * Remove files you do not need any more (e.g. core dumps, temporary files) as early as possible. Keep the disk space clean on all nodes.
 The system administrator has the right to block the access to Merlin6 for an account violating the Code of Conduct, in which case the issue will be escalated to the user's supervisor.
 ---
 ## Merlin6 Access
 ### HowTo: Request Access to Merlin6
 * PSI users with their Linux accounts belonging to the *svc-cluster_merlin6* group are allowed to use Merlin6. 
 * Registration for Merlin6 access must be done through [PSI Service Now](https://psi.service-now.com/psisp)
    * Please open it as an Incident request, with subject: ``[Merlin6] Access Request for user '<username>'``
 ### HowTo: Access to Merlin6
    Use SSH to access the login nodes:
   *    <tt><b>merlin-l-01.psi.ch</b></tt> ("merlin" '-' 'el' '-' 'zero' 'one')   
   *    <tt><b>merlin-l-02.psi.ch</b></tt> ("merlin" '-' 'el' '-' 'zero' 'two')   
 Examples:
 <pre>
        ssh -Y merlin-l-01 
        ssh -Y bond_j@merlin-l-02.psi.ch
 </pre>
 <!--
 ### Home and Data Directories 
 Default quota for the home directory */gpfs/home/$USER* is 10GB. 
 Until a service for an automatic backup of the home directories is announced 
 to be in production, the users are responsible for managing the backups 
 of their home directories.    
 The data directories */gpfs/data/$USER* have much larger quotas per user (default is 1TB, extendible on request) then the home directories, 
 but there will be no automatic backup of the data directories.  
 The users are fully responsible for backup and restore operations 
 in the data directories.    
 Command to see your quota on merlin5:
 <pre>
 /usr/lpp/mmfs/bin/mmlsquota -u $USER --block-size auto merlin5
 </pre>
 ### Scratch disk and Temporary Files
 A */scratch* partition of ~50GB is available on each computing node. This partition should be the one used by the users for creating temporary files and/or 
 directories that are needed by running jobs. Temporary files *must be deleted at the end of the job*.
 Example of how to use the */scratch* disk:
 <pre>
 #!/bin/bash
 #SBATCH --partition=merlin # name of slurm partition to submit
 #SBATCH --time=2:00:00     # limit the execution of this job to 2 hours, see sinfo for the max. allowance 
 #SBATCH --nodes=4          # you request a 4 nodes
 <b># Create scratch directory</b>
 <i>SCRATCHDIR="/scratch/$(id -un)/${SLURM_JOB_ID}"
 mkdir -p ${SCRATCHDIR}</i>
 ...
 <b># Core code, generating temporary files in $SCRATCHDIR</b>
 ...
 <b># Copy final results (whenever needed)</b>
 <i>mkdir /gpfs/home/$(id -un)/${SLURM_JOB_ID}</i>
 <i>cp -pr /scratch/$(id -un)/${SLURM_JOB_ID}/my_results /gpfs/home/$(id -un)/${SLURM_JOB_ID}</i>
 <b># Cleanup temporary data and directories</b>
 <i>rm -rf /scratch/$(id -un)/${SLURM_JOB_ID}</i>
 <i>rmdir /scratch/$(id -un)</i>
 </pre>
 ### Using Batch System to Submit Jobs to Merlin5 
 The  Slurm  Workload  Manager is used on Merlin5 to manage and schedule jobs. 
 Please see "man slurm" and references therein for more details.  
 There are many tutorials and howtos on Slurm elsewhere, e.g. at CSCS.    
 We shall provide some typical examples for submitting different types of jobs. 
 Useful commands for the slurm:
 <pre>
 sinfo            # to see the name of nodes, their occupancy, name of slurm partitions, limits (try out with "-l" option)
 squeue           # to see the currently running/waiting jobs in slurm (additional "-l" option may also be useful)
 sbatch Script.sh # to submit a script (example below) to the slurm
 scancel job_id   # to cancel slurm job, job id is the numeric id, seen by the squeue
 </pre>
 Other advanced commands:
 <pre>
 sinfo -N -l      # list nodes, state, resources (number of CPUs, memory per node, etc.), and other information
 sshare -a        # to list shares of associations to a cluster
 sprio -l         # to view the factors that comprise a job's scheduling priority (add -u <username> for filtering user)
 </pre>
 ###+ Simple slurm test script (copy-paste the following example in file Script.sh):
 <pre>
 #!/bin/bash
 #SBATCH --partition=merlin # name of slurm partition to submit
 #SBATCH --time=2:00:00     # limit the execution of this job to 2 hours, see sinfo for the max. allowance 
 #SBATCH --nodes=4          # you request a 4 nodes
 hostname                   # will print one name, since executed on one node
 echo
 module load  gcc/6.2.0 openmpi/1.10.2 hdf5/1.8.17
 mpirun hostname            # will be executed on all 4 nodes (see above --nodes)
 echo
 sleep 60                   # useless work occupying 4 merlin nodes
 module list
 </pre>
 Submit job to slurm and check it's status:
 <pre>
 sbatch Script.sh # submit this job to slurm
 squeue # check it's status
 </pre>
 ###+ Advanced slurm test script (copy-paste the following example in file Script.sh):
 <pre>
 #!/bin/bash
 #SBATCH --partition=merlin # name of slurm partition to submit
 #SBATCH --time=2:00:00     # limit the execution of this job to 2 hours, see sinfo for the max. allowance 
 #SBATCH --nodes=2          # number of nodes  
 #SBATCH --ntasks=24        # number of tasks
 hostname                   # will print one name, since executed on one node
 echo
 module load  gcc/6.2.0 openmpi/1.10.2 hdf5/1.8.17
 mpirun hostname            # will be executed on all 4 nodes (see above --nodes)
 echo
 sleep 60                   # useless work occupying 4 merlin nodes
 module list
 </pre>
 In the above example are specified the options *--nodes=2* and *--ntasks=24*. This means that 2 nodes are requested, 
 and is expected to run 24 tasks. Hence, 24 cores are needed for running that job. Slurm will try to allocate 2 nodes
 with similar resources, having at least 12 cores/node. 
 Usually, 2 nodes with 12 cores/node would fit in the allocation decision. However, other combinations may be possible 
 (i.e, 2 nodes with 16 cores/node). In this second case, could happen that other users are running jobs in the allocated 
 nodes (in this example, up to 4 cores per node could be used by other user jobs, having at least 12 cores per node  
 available, which are the minimum number of tasks/cores required by our job). 
 In order to ensure exclusivity of the node, an option *--exclusive* can be used (see below). This will ensure that 
 the requested nodes are exclusive for the job (no other users jobs will interact with this node, and only completely 
 free nodes will be allocated). 
 <pre>
 #SBATCH --exclusive
 </pre>
 More advanced configurations can be defined and can be combined with the previous examples. More information about advanced 
 options can be found in the following link: https://slurm.schedmd.com/sbatch.html (or run 'man sbatch').
 If you have questions about how to properly execute your jobs, please contact us through merlin-admins@lists.psi.ch. Do not run
 advanced configurations unless your are sure of what you are doing.
 ### Environment Modules 
 On top of the operating system stack we provide different software using the PSI developed
 pmodule system. Useful commands:
 <pre>
 module avail                                       # to see the list of available software provided via pmodules
 module load gnuplot/5.2.0                          # to load specific version of gnuplot package
 module search hdf                                  # try it out to see which version of hdf5 package is provided and with which dependencies
 module load gcc/6.2.0 openmpi/1.10.2 hdf5/1.8.17   # load the specific version of hdf5, compiled with specific version of gcc and openmpi
 module use unstable                                # to get access to the packages which are not considered to be fully stable by module provider (may be very fresh version, or not yet tested by community)
 module list                                        # to see which software is loaded in your environment 
 </pre>
 ###+ Requests for New Software 
 If you miss some package/version, contact us
 ### Known Problems and Troubleshooting 
 ###+ Paraview, ANSYS and openGL 
 Try to use X11(mesa) driver for paraview and ANSYS instead of OpenGL:
 <pre>
 module load ANSYS
 fluent -driver x11
 </pre>
 <pre>
 module load paraview
 paraview --mesa
 </pre>
 ###+ Illegal instructions
 It may happened that your code, compiled on one machine will not be executed on another throwing exception like "(Illegal instruction)".
 Check (with "hostname" command) on which of the node you are and compare it with the names from first item. We observe few applications
 that can't be run on merlin-c-01..16 because of this problem (notice that these machines are more then 5 years old). Hint: you may 
 choose the particular flavour of the machines for your slurm job, check the "--cores-per-node" option for sbatch:
 <pre>
 sbatch --cores-per-socket=8 Script.sh # will filter the selection of the machine and exclude the oldest one, merlin-c-01..16
 </pre>
 ###+ Troubleshooting SSH 
 Use the ssh command with the "-vvv" option and copy and paste (no screenshot please) 
 the output to your request in Service-Now.  Example 
 <pre>
      ssh -Y -vvv bond_j@merlin-l-01   
 </pre>
 ###+ Troubleshooting SLURM
 If one copies Slurm commands or batch scripts from another cluster, 
 they may need some changes (often minor) to run successfully on Merlin5.  
 Examine carefully the error message, especially concerning the options 
 used in the slurm commands.    
 Try to submit jobs using the examples given in the section "Using Batch System to Submit Jobs to Merlin5". 
 If you can run successfully an example for a type of job (!OpenMP, MPI) similar to your one, 
 try to edit the example to run your application.  
 If the problem remains, then, in your request in Service-Now, describe the problem in details that 
 are needed to reproduce it.  Include the output of the following commands:
 <pre>
   date
   hostname 
   pwd 
   module list
   # All slurm commands used with the corresponding output
 </pre>
 Do not delete any output and error files generated by Slurm.  
 Make a copy of the failed job script if you like to edit it meanwhile.      
 -->