73 lines
4.1 KiB
Markdown
73 lines
4.1 KiB
Markdown
---
|
|
title: Downtimes
|
|
#tags:
|
|
#keywords:
|
|
last_updated: 28 June 2019
|
|
#summary: "Merlin 6 cluster overview"
|
|
sidebar: merlin6_sidebar
|
|
permalink: /merlin6/downtimes.html
|
|
---
|
|
|
|
On the first Monday of each month the Merlin6 cluster might be subject to interruption due to maintenance.
|
|
Users will be informed with at least one week in advance when a downtime is scheduled for the next month.
|
|
|
|
Downtimes will be informed to users through the <merlin-users@lists.psi.ch> mail list. Also, a detailed description
|
|
for the nexts scheduled interventions will be available in [Next Scheduled Downtimes](/merlin6/downtimes.html#next-scheduled-downtimes)).
|
|
|
|
---
|
|
|
|
## Scheduled Downtime Draining Policy
|
|
|
|
Scheduled downtimes mostly affecting the storage and Slurm configurantions may require draining the nodes.
|
|
When this is required, users will be informed accordingly. Two different types of draining are possible:
|
|
|
|
* **soft drain**: new jobs may be queued on the partition, but queued jobs may not be allocated nodes and run from the partition.
|
|
Jobs already running on the partition continue to run. This will be the **default** drain method.
|
|
* **hard drain**: no new jobs may be queued on the partition (job submission requests will be denied with an error message),
|
|
but jobs already queued on the partition may be allocated to nodes and run.
|
|
|
|
Unless explicitly specified, the default draining policy for each partition will be the following:
|
|
|
|
* The **daily** and **general** partition will be soft drained 24h before the downtime.
|
|
* The **hourly** partition will be soft drained 1 hour before the downtime.
|
|
* The **gpu** partition will be soft drained 1 hour before the downtime.
|
|
|
|
Finally, **remaining running jobs will be killed** by default when the downtime starts. In some specific rare cases jobs will be
|
|
just *paused* and *resumed* back when the downtime finished.
|
|
|
|
### Draining Policy Summary
|
|
|
|
The following table contains a summary of the draining policies during a Schedule Downtime:
|
|
|
|
| **Partition** | **Drain Policy** | **Default Drain Type** | **Default Job Policy** |
|
|
|:---------------:| -----------------:| ----------------------:| --------------------------------:|
|
|
| **general** | 24h before the SD | soft drain | Kill running jobs when SD starts |
|
|
| **daily** | 24h before the SD | soft drain | Kill running jobs when SD starts |
|
|
| **hourly** | 1h before the SD | soft drain | Kill running jobs when SD starts |
|
|
| **gpu** | 1h before the SD | soft drain | Kill running jobs when SD starts |
|
|
|
|
---
|
|
|
|
## Next Scheduled Downtimes
|
|
|
|
The table below shows a description for the next Scheduled Downtime:
|
|
|
|
| Date | Time | Affected Service/s | Description |
|
|
|:------------:| --------- | :----------------------------- |:-----------------------------------------------------------------|
|
|
| *04.11.2019* | From 8h | *Login nodes* | Upgrade HP SPP Software Stack (hardware related) |
|
|
| *04.11.2019* | From 8h | **Merlin5 storage** ``/gpfs`` | Decomission of the Merlin5 storage under ``/gpfs`` |
|
|
| *04.11.2019* | From 8h | *Login + Computing nodes* | Permanently unmounting /gpfs **Merlin5** GPFS storage |
|
|
|
|
* **Notes:**
|
|
* Login nodes will have a maintenance window from 8am until approx. 10am.
|
|
* An e-mail will be sent when login nodes become available again.
|
|
* **Merlin5 storage will be decomissioned**: all data under ``/gpfs`` will be not available anymore and **``/gpfs`` will be unmounted** from login and computing nodes.
|
|
* Please ensure that data has been migrated to the Merlin6 storage.
|
|
* ``/gpfs/data`` will not be accessible anymore
|
|
* ``/gpfs/user`` will not be accessible anymore
|
|
* ``/gpfs/group`` will not be accessible anymore
|
|
* Read [HowTo: Migrating data from the Merlin5 storage to Merlin6](https://lsm-hpce.gitpages.psi.ch/merlin6/migrating.html) for more information about it.
|
|
* Batch system will keep running
|
|
* However, jobs accessing the Merlin5 storage will die or killed by admins.
|
|
|