Doc changes

This commit is contained in:
2021-05-21 12:34:19 +02:00
parent 42d8f38934
commit fcfdbf1344
46 changed files with 447 additions and 528 deletions

View File

@ -0,0 +1,61 @@
---
title: Downtimes
#tags:
#keywords:
last_updated: 28 June 2019
#summary: "Merlin 6 cluster overview"
sidebar: merlin6_sidebar
permalink: /merlin6/downtimes.html
---
On the first Monday of each month the Merlin6 cluster might be subject to interruption due to maintenance.
Users will be informed with at least one week in advance when a downtime is scheduled for the next month.
Downtimes will be informed to users through the <merlin-users@lists.psi.ch> mail list. Also, a detailed description
for the nexts scheduled interventions will be available in [Next Scheduled Downtimes](/merlin6/downtimes.html#next-scheduled-downtimes)).
---
## Scheduled Downtime Draining Policy
Scheduled downtimes mostly affecting the storage and Slurm configurantions may require draining the nodes.
When this is required, users will be informed accordingly. Two different types of draining are possible:
* **soft drain**: new jobs may be queued on the partition, but queued jobs may not be allocated nodes and run from the partition.
Jobs already running on the partition continue to run. This will be the **default** drain method.
* **hard drain**: no new jobs may be queued on the partition (job submission requests will be denied with an error message),
but jobs already queued on the partition may be allocated to nodes and run.
Unless explicitly specified, the default draining policy for each partition will be the following:
* The **daily** and **general** partitions will be soft drained 12h before the downtime.
* The **hourly** partition will be soft drained 1 hour before the downtime.
* The **gpu** and **gpu-short** partitions will be soft drained 1 hour before the downtime.
Finally, **remaining running jobs will be killed** by default when the downtime starts. In some specific rare cases jobs will be
just *paused* and *resumed* back when the downtime finished.
### Draining Policy Summary
The following table contains a summary of the draining policies during a Schedule Downtime:
| **Partition** | **Drain Policy** | **Default Drain Type** | **Default Job Policy** |
|:---------------:| -----------------:| ----------------------:| --------------------------------:|
| **general** | 12h before the SD | soft drain | Kill running jobs when SD starts |
| **daily** | 12h before the SD | soft drain | Kill running jobs when SD starts |
| **hourly** | 1h before the SD | soft drain | Kill running jobs when SD starts |
| **gpu** | 1h before the SD | soft drain | Kill running jobs when SD starts |
| **gpu-short** | 1h before the SD | soft drain | Kill running jobs when SD starts |
| **gfa-asa** | 1h before the SD | soft drain | Kill running jobs when SD starts |
---
## Next Scheduled Downtimes
The table below shows a description for the next Scheduled Downtime:
| From | To | Service | Description |
| ---------------- | ---------------- |:------------:|:----------------------------------------------------------------------- |
| 05.09.2020 8am | 05.09.2020 6pm | <pending> | <pending> |
* **Note**: An e-mail will be sent when the services are fully available.

View File

@ -0,0 +1,38 @@
---
title: Past Downtimes
#tags:
#keywords:
last_updated: 03 September 2019
#summary: "Merlin 6 cluster overview"
sidebar: merlin6_sidebar
permalink: /merlin6/past-downtimes.html
---
## Past Downtimes: Log Changes
### 2020
| From | To | Service | Clusters | Description | Exceptions |
| ---------------- | ---------------- |:------------:|:---------------:|:--------------------------------------------------------------|:-------------------------------------------:|
| 03.08.2020 8am | 03.08.2020 6pm | Archive | merlin6 | Replace old merlin-export-01 for merlin-export-02 | |
| 03.08.2020 8am | 03.08.2020 6pm | RemoteAccess | merlin6 | ra-merlin-0[1,2] Remount merlin-export-02 | |
| 06.07.2020 | 06.07.2020 | All services | merlin5,merlin6 | GPFS v5.0.4-4,OFED v5.0,YFS v0.195,RHEL7.7,Slurm v19.05.7,f/w | |
| 04.05.2020 | 04.05.2020 | Login nodes | merlin6 | Outage. YFS (AFS) update v0.194 and reboot | |
| 04.05.2020 | 04.05.2020 | CN | merlin5 | Outage. O.S. update, OFED drivers update, YFS (AFS) update. | |
| 03.02.2020 9am | 03.02.2020 10am | Slurm | merlin5,merlin6 | Upgrading config [HPCLOCAL-321](https://jira.psi.ch/browse/HPCLOCAL-321) | |
| 10.01.2020 9am | 10.01.2020 6pm | All Services | merlin5,merlin6 | Slurm v18->v19, IB Connected Mode, other. [HPCLOCAL-300](https://jira.psi.ch/browse/HPCLOCAL-300) | |
## Older downtimes
| From | To | Service | Clusters | Description | Exceptions |
| ---------------- | ---------------- |:------------:|:---------------:|:--------------------------------------------------------------|:-------------------------------------------:|
| 02.09.2019 | 02.09.2019 | GPFS | merlin5,merlin6 | v5.0.2-3 -> v5.0.3-2 | |
| 02.09.2019 | 02.09.2019 | O.S. | merlin5 | RHEL7.4 (rhel-7.4) -> RHEL7.6 (prod-00048) | merlin-g-40, still running RHEL7.4\* |
| 02.09.2019 | 02.09.2019 | O.S. | merlin6 | RHEL7.6 (prod-00030) -> RHEL7.6 (prod-00048) | |
| 02.09.2019 | 02.09.2019 | Infiniband | merlin5 | OFED v4.4 -> v4.6 | merlin-g-40, still running OFED v4.4\* |
| 02.09.2019 | 02.09.2019 | Infiniband | merlin6 | OFED v4.5 -> v4.6 | |
| 02.09.2019 | 02.09.2019 | PModules | merlin5,merlin6 | PModules v1.0.0rc4 -> v1.0.0rc5 | |
| 02.09.2019 | 02.09.2019 | AFS(YFS) | merlin5 | OpenAFS v1.6.22.2-236 -> YFS v188 | merlin-g-40, still running OpenAFS\* |
| 02.09.2019 | 02.09.2019 | AFS(YFS) | merlin6 | YFS v186 -> YFS v188 | |
| 02.09.2019 | 02.09.2019 | O.S. | merlin5 | RHEL7.4 -> RHEL7.6 (prod-00048) | |
| 02.09.2019 | 02.09.2019 | Slurm | merlin5,merlin6 | Slurm v18.08.6 -> v18.08.8 | |