gitea-pages/pages/merlin6/98 announcements/downtimes.md at c91ff314d8b28153b841d9851849d433cbbb4831

Files

caubet_m 41af2e7a26 Downtime 04.11.2019

2019-10-25 16:17:16 +02:00

4.1 KiB

Raw Blame History

title, last_updated, sidebar, permalink

title	last_updated	sidebar	permalink
Downtimes	28 June 2019	merlin6_sidebar	/merlin6/downtimes.html

On the first Monday of each month the Merlin6 cluster might be subject to interruption due to maintenance. Users will be informed with at least one week in advance when a downtime is scheduled for the next month.

Downtimes will be informed to users through the merlin-users@lists.psi.ch mail list. Also, a detailed description for the nexts scheduled interventions will be available in Next Scheduled Downtimes).

Scheduled Downtime Draining Policy

Scheduled downtimes mostly affecting the storage and Slurm configurantions may require draining the nodes. When this is required, users will be informed accordingly. Two different types of draining are possible:

soft drain: new jobs may be queued on the partition, but queued jobs may not be allocated nodes and run from the partition. Jobs already running on the partition continue to run. This will be the default drain method.
hard drain: no new jobs may be queued on the partition (job submission requests will be denied with an error message), but jobs already queued on the partition may be allocated to nodes and run.

Unless explicitly specified, the default draining policy for each partition will be the following:

The daily and general partition will be soft drained 24h before the downtime.
The hourly partition will be soft drained 1 hour before the downtime.
The gpu partition will be soft drained 1 hour before the downtime.

Finally, remaining running jobs will be killed by default when the downtime starts. In some specific rare cases jobs will be just paused and resumed back when the downtime finished.

Draining Policy Summary

The following table contains a summary of the draining policies during a Schedule Downtime:

Partition	Drain Policy	Default Drain Type	Default Job Policy
general	24h before the SD	soft drain	Kill running jobs when SD starts
daily	24h before the SD	soft drain	Kill running jobs when SD starts
hourly	1h before the SD	soft drain	Kill running jobs when SD starts
gpu	1h before the SD	soft drain	Kill running jobs when SD starts

Next Scheduled Downtimes

The table below shows a description for the next Scheduled Downtime:

Date	Time	Affected Service/s	Description
04.11.2019	From 8h	Login nodes	Upgrade HP SPP Software Stack (hardware related)
04.11.2019	From 8h	Merlin5 storage `/gpfs`	Decomission of the Merlin5 storage under `/gpfs`
04.11.2019	From 8h	Login + Computing nodes	Permanently unmounting /gpfs Merlin5 GPFS storage

Notes:
- Login nodes will have a maintenance window from 8am until approx. 10am.
  - An e-mail will be sent when login nodes become available again.
- Merlin5 storage will be decomissioned: all data under /gpfs will be not available anymore and /gpfs will be unmounted from login and computing nodes.
  - Please ensure that data has been migrated to the Merlin6 storage.
    - /gpfs/data will not be accessible anymore
    - /gpfs/user will not be accessible anymore
    - /gpfs/group will not be accessible anymore
  - Read HowTo: Migrating data from the Merlin5 storage to Merlin6 for more information about it.
- Batch system will keep running
  - However, jobs accessing the Merlin5 storage will die or killed by admins.

4.1 KiB Raw Blame History

Scheduled Downtime Draining Policy

Draining Policy Summary

Next Scheduled Downtimes

4.1 KiB

Raw Blame History