sf_databuffer/operation-tools/Readme.md

# Overview

This document provides instructions on how to install and to operate the *data acquisition (daq)* environment.

The DataBuffer currently consists of 13 nodes located in the server room next to the control room in WBGB. The nodes are located in the SwissFEL network and are named _sf-daqbuf-21_ up to _sf-daqbuf-33_.

Each node of the cluster currently runs 2 independent services:
- daq-dispatcher-node.service - used to dispatch and record data (filestorage-cluster)
- daq-query-node.service - used to query data

The machines are installed and managed via ansible scripts and playbooks. To be able to run these scripts password-less ssh needs to be setup from the machine running these scripts.


# Prerequisites
__Note:__ This is not needed if working on a standard GFA machine like sflca!

To be able to execute the steps and commands outlined in this document following prerequisites need to be met:

- [Ansible](https://www.ansible.com) needs to be installed on your machine
- Password-less ssh needs to be setup from your machine
    - Use `ssh-copy-id` to copy your public key to the servers
        ```
        ssh-copy-id your-user@sf-daqbuf-xx.psi.ch
        ```
    - Have a tunneling directive in your ~/.ssh/config like this:
        ```
        Host sfbastion
          Hostname sf-gw.psi.ch
          ControlMaster auto
          ControlPath ~/.ssh/mux-%r@%h:%p
          ControlPersist 8h

        Host sf-data-api* sf-dispatcher-api* sf-daq*
          Hostname %h
          ProxyJump username@sfbastion
        ```

- Sudo rights are needed on databuffer/... servers
- Clone the [sf_databuffer](https://git.psi.ch/archiver_config/sf_databuffer.git) repository and switch to the `operation-tools` folder


# Checks

Check network load DataBuffer:
https://metrics.psi.ch/d/1SL13Nxmz/gfa-linux-tabular?orgId=1&var-env=telegraf_cocos&var-host=sf-daqbuf-21.psi.ch&var-host=sf-daqbuf-22.psi.ch&var-host=sf-daqbuf-23.psi.ch&var-host=sf-daqbuf-24.psi.ch&var-host=sf-daqbuf-25.psi.ch&var-host=sf-daqbuf-26.psi.ch&var-host=sf-daqbuf-27.psi.ch&var-host=sf-daqbuf-28.psi.ch&var-host=sf-daqbuf-29.psi.ch&var-host=sf-daqbuf-30.psi.ch&var-host=sf-daqbuf-31.psi.ch&var-host=sf-daqbuf-32.psi.ch&var-host=sf-daqbuf-33.psi.ch&from=now-6h&to=now&refresh=30s

A healthy network load looks something like this:
![image](documentation/DataBufferNetworkLoad.png)

If more than 2 machines does not show this pattern - the DataBuffer has an issue.

Check memory usage and network load ImageBuffer:
https://hpc-monitor02.psi.ch/d/TW0pr_bik/gl2?refresh=30s&orgId=1

A healthy memory consumption is between 50GB and 250GB, any drop below 50GB indicates a crash.
![image](documentation/ImageBufferMemory.png)


Check disk free DataBuffer:
http://gmeta00.psi.ch/?r=hour&cs=&ce=&c=sf-daqbuf&h=&tab=m&vn=&hide-hf=false&m=disk_free_percent_data&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name


Checks on the cluster can be performed via ansible ad-hoc commands:
```bash
ansible databuffer_cluster -m shell -a 'uptime'
```

Check whether time are synchronized between the machines:
```bash
ansible databuffer_cluster -m shell -a 'date +%s.%N'
```

Check if the ntp synchronization is enabled and running
```bash
ansible databuffer_cluster -b -m shell -a "systemctl is-enabled chronyd"
ansible databuffer_cluster -b -m shell -a "systemctl is-active chronyd"
```

Check if the tuned service is running:
```bash
ansible databuffer_cluster -b -m shell -a "systemctl is-active tuned"
```

Check latest 10 lines of the dispatcher node logs
```bash
ansible databuffer_cluster -b -m shell -a "journalctl -n 10 -u daq-dispatcher-node.service"
```

Check for failed compactions
```bash
ansible -b databuffer -m shell -a "journalctl -n 50000 -u daq-dispatcher-node.service | grep \"Exception while compacting\" | grep -oP \"\\'\K[^\\']+\" | sort | uniq"
```


## Find Sources With Issues

Find sources with bsread level issues
https://kibana.psi.ch/s/gfa/app/dashboards#/view/1b1e1bb0-ca94-11ea-bc5d-315bf3957d13?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15m,to:now))&_a=(description:'',filters:!(('$state':(store:appState),meta:(alias:'Connection%20Errors',disabled:!f,index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',key:bsread.error.type,negate:!t,params:!('0','1','2'),type:phrases,value:'0,%201,%202'),query:(bool:(minimum_should_match:1,should:!((match_phrase:(bsread.error.type:'0')),(match_phrase:(bsread.error.type:'1')),(match_phrase:(bsread.error.type:'2'))))))),fullScreenMode:!f,options:(hidePanelTitles:!f,useMargins:!t),query:(language:kuery,query:''),timeRestore:!f,title:'BSREAD%20Errors%20DataBuffer',viewMode:view)

Error number indicate following errors:

3. 0-pulse / 0 globaltime
4. time out of valid timerange
5. duplicate pulse-id
6. pulse-id before last valid pulse-id
7. duplicate globaltimestamp
8. globaltimestamp before last globaltimestamp

To see connection errors as well temporarily disable the "NOT Connection Errors" filter on the top left (click on the filter and select "Temporarily disable")

The error numbers used there are

1. receiver connected
2. receiver stopped
3. reconnect


Find channels that are received from more than one source:
https://kibana.psi.ch/s/gfa/app/discover#/view/cb725720-ca89-11ea-bc5d-315bf3957d13?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15m,to:now))&_a=(columns:!(bsread.error.type,bsread.source,message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',key:systemd.unit,negate:!f,params:(query:daq-dispatcher-node.service),type:phrase),query:(match_phrase:(systemd.unit:daq-dispatcher-node.service)))),index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',interval:auto,query:(language:kuery,query:'%22This%20is%20usually%20an%20indication%22'),sort:!(!('@timestamp',desc)))


Find channels that send corrupt MainHeader:
https://kibana.psi.ch/s/gfa/app/discover#/view/cb725720-ca89-11ea-bc5d-315bf3957d13?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15m,to:now))&_a=(columns:!(bsread.error.type,bsread.source,message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',key:systemd.unit,negate:!f,params:(query:daq-dispatcher-node.service),type:phrase),query:(match_phrase:(systemd.unit:daq-dispatcher-node.service)))),index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',interval:auto,query:(language:kuery,query:MainHeader),sort:!(!('@timestamp',desc)))


# Maintenance

## Restart Procedures
Following are restart procedures meant for emergency interventions on the SwissFEL DAQ system

__Before issuing a restart check the which endstations are attended via one of the following operations panel and check whether a restart is ok with all attended endstations!__

```
bash -c 'caqtdm -noMsg  -stylefile sfop.qss -macro chargelim=15  S_OP_overview.ui'
bash -c 'caqtdm -noMsg  -stylefile sfop.qss -macro chargelim=15  S_OP_overview_for_photonics.ui'
bash -c 'caqtdm -noMsg  -stylefile sfop.qss   S_OP_Messages_all_stations_small.ui'
```

### Restart Data Retrieval
If there are issues with data retrieval (DataBuffer, ImageBuffer, Epics Channel Archiver) but all checks regarding the DataBuffer shows normal operation use this procedure to restart the SwissFEL data retrieval services. This will only affect the data retrieval of SwissFEL at the time of restart but there will be no interrupt in the recording of the data.

- login to sf-lca.psi.ch
- clone the databuffer repository (if you haven't yet- https://git.psi.ch/archiver_config/sf_databuffer.git), change to the `operation-tools` directory and/or pull the latest changes
```bash
cd operation-tools
```

- call the restart_dataretrieval script
```bash
ansible-playbook restart_dataretrieval.yml
```

### Restart Data Retrieval All
If the method above doesn't work try to restart all of the data retrieval services via this procedure. This will not interrupt any data recording __but this restart will, beside SwissFEL also affect the data retrieval of GLS, Hipa and Proscan__!

- login to sf-lca.psi.ch
- clone the databuffer repository (if you haven't yet - https://git.psi.ch/archiver_config/sf_databuffer.git), change to the `operation-tools` directory and/or pull the latest changes
```bash
cd operation-tools
```

- call the restart_dataretrieval script
```bash
ansible-playbook restart_dataretrieval_all.yml
```

### Restart ImageBuffer
If the DataBuffer looks healthy but the ImageBuffer seems to be in a buggy state the restart of the ImageBuffer only can be triggered as follows:

- login to sf-lca.psi.ch (_sf-lca.psi.ch is the machine in the machine network !!!!_)
- clone the databuffer repository (if you haven't yet), change to the repository directory and/or pull the latest changes

```bash
git clone https://git.psi.ch/archiver_config/sf_databuffer.git
cd sf_databuffer
# and/or
git pull
```

- stop the sources belonging to the imagebuffer
```bash
./bufferutils stop --backend sf-imagebuffer
```

- change to the operation-tools directory and call the restart_imagebuffer script
```bash
cd operation-tools
ansible-playbook restart_imagebuffer.yml
```

- Afterwards restart the recording of the image sources:
```bash
cd ..
./bufferutils upload
```

### Restart DataBuffer Cluster
This is the procedure to follow to restart the DataBuffer in an emergency.

After checking whether the restart is really necessary do this:

- login to sf-lca.psi.ch (_sflca is cluster in the machine network !!!!_)
- clone the databuffer repository (if you haven't yet), change to the operation-tools directory and/or pull the latest changes

```bash
git clone https://git.psi.ch/archiver_config/sf_databuffer.git
cd sf_databuffer/operation-tools
# and/or
git pull
```

- call the restart_cluster script
```bash
ansible-playbook restart_cluster.yml
```

- Afterwards restart the recording again:
```bash
cd ..
./bufferutils upload
```


## Manual Restart Procedures (Experts Only)

### Restart query-node Services

Restart daq-query-node service:
```bash
ansible databuffer_cluster --forks 1 -b -m shell -a "systemctl restart daq-query-node.service"
```

__Important Note:__ To be able to start the query node processes the dispatcher nodes need to be up and running! After restarting all query nodes you have to restart the data-api service as well. A single restart of a Query Node server should work fine (as there is no complete shutdown of the Hazelcast cluster).


### Restart dispatcher-node Services

Restart daq-dispatcher-node service:
```bash
ansible databuffer_cluster --forks 1 -b -m shell -a "systemctl restart daq-dispatcher-node.service"
```

This restart should also restart all recordings and reestablish streams. If there are issues, this recording restart can be enabled/disabled by setting dispatcher.local.sources.restart=false in /home/daqusr/.config/daq/dispatcher.properties. An other option is to delete the restart configurations as follows:
```bash
ansible databuffer_cluster -b -m shell -a "rm -rf /home/daqusr/.config/daq/stores/sources; rm -rf /home/daqusr/.config/daq/stores/streamers"
```

__Note:__ After restarting all dispatcher nodes you have to restart the dispatcher-api service as well. A single restart of Dispatcher Node server should work fine (as there is no complete shutdown of the Hazelcast cluster).


# Installation

## Prerequisites

To be able to install a new version of the daq system, the binaries need to be build and available in the Maven repository. (details see toplevel [Readme](../Readme.md))

## Pre-Checks

Make sure that the time is in sync within the machines:
```bash
ansible databuffer_cluster -m shell -a 'date +%s.%N'
```

Check if the ntp synchronization is enabled and running
```bash
ansible databuffer_cluster -b -m shell -a "systemctl is-enabled chronyd"
ansible databuffer_cluster -b -m shell -a "systemctl is-active chronyd"
```

On the ImageBuffer nodes check that the MTU size of the 25Gb/s interface is set to 9000
```bash
ip link
```

On the ImageBuffer nodes test the connection to camera servers with iperf3. As all the camera servers only have a 10Gb/s interface the overall throughput (SUM) should be around 9Gb/s. While testing connecting to two servers simultaneously should show 9Gb/s for each stream.
```
# Start iperf server on the camera servers
iperf3 -s
```

```
# Check speed to different camera servers via
iperf3 -P 3 -c daqsf-sioc-cs-02
iperf3 -P 3 -c daqsf-sioc-cs-31
iperf3 -P 3 -c daqsf-sioc-cs-73
iperf3 -P 3 -c daqsf-sioc-cs-85
```

Check whether all the firmware of the servers are on the same level:
```bash
ansible databuffer_cluster -b -m shell -a "dmidecode -s bios-release-date"
```

Check whether the Power Regulator Settings in the bios is set to __Static High Performance Mode__ !
![documentation/BIOSSettings.png](documentation/BIOSSettings.png)

## Steps

Add daqusr user and daq group on all nodes:
```bash
ansible-playbook install_user.yml
```

Install file and memory limits for the _daqusr_ on all nodes:
```bash
ansible-playbook install_limits.yml
```

Install the current JDK on all nodes:
```bash
ansible-playbook install_jdk.yml
```

Install dispatcher node:
```bash
ansible-playbook install_dispatcher_node.yml
```

Install query node:
```bash
ansible-playbook install_query_node.yml
```

The installation of the dispatcher and query nodes does not start the services. See above how to start them.

## Post-Checks

Check if the tuned service is running:
```bash
ansible databuffer_cluster -b -m shell -a "systemctl is-active tuned"
```

Check whether all CPUs are set to performance:
```bash
ansible databuffer_cluster -b -m shell -a "cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor | uniq -c"
```