sf_databuffer/operation-tools/Readme.md

# Overview

This document provides instructions on how to install and to operate the *data acquisition (daq)* environment.

The DataBuffer currently consists of 13 nodes located in the server room next to the control room in WBGB. The nodes are located in the SwissFEL network and are named _sf-daqbuf-21_ up to _sf-daqbuf-33_.

Each node of the cluster currently runs 2 independent services:
- daq-dispatcher-node.service - used to dispatch and record data (filestorage-cluster)
- daq-query-node.service - used to query data

The machines are installed and managed via ansible scripts and playbooks. To be able to run these scripts password-less ssh needs to be setup from the machine running these scripts.


# Prerequisites
__Note:__ This is not needed if working on a standard GFA machine like sflca!

To be able to execute the steps and commands outlined in this document following prerequisites need to be met:

- [Ansible](https://www.ansible.com) needs to be installed on your machine
- Password-less ssh needs to be setup from your machine
    - Use `ssh-copy-id` to copy your public key to the servers
        ```
        ssh-copy-id your-user@sf-daqbuf-xx.psi.ch
        ```
    - Have a tunneling directive in your ~/.ssh/config like this:
        ```
        Host sfbastion
          Hostname sf-gw.psi.ch
          ControlMaster auto
          ControlPath ~/.ssh/mux-%r@%h:%p
          ControlPersist 8h

        Host sf-data-api* sf-dispatcher-api* sf-daq*
          Hostname %h
          ProxyJump username@sfbastion
        ```

- Sudo rights are needed on databuffer/... servers
- Clone the [sf_databuffer](https://git.psi.ch/archiver_config/sf_databuffer.git) repository and switch to the `operation-tools` folder


# Checks

Check network load DataBuffer:
https://metrics.psi.ch/d/1SL13Nxmz/gfa-linux-tabular?orgId=1&var-env=telegraf_cocos&var-host=sf-daqbuf-21.psi.ch&var-host=sf-daqbuf-22.psi.ch&var-host=sf-daqbuf-23.psi.ch&var-host=sf-daqbuf-24.psi.ch&var-host=sf-daqbuf-25.psi.ch&var-host=sf-daqbuf-26.psi.ch&var-host=sf-daqbuf-27.psi.ch&var-host=sf-daqbuf-28.psi.ch&var-host=sf-daqbuf-29.psi.ch&var-host=sf-daqbuf-30.psi.ch&var-host=sf-daqbuf-31.psi.ch&var-host=sf-daqbuf-32.psi.ch&var-host=sf-daqbuf-33.psi.ch&from=now-6h&to=now&refresh=30s

A healthy network load looks something like this:
![image](documentation/DataBufferNetworkLoad.png)

If more than 2 machines does not show this pattern - the DataBuffer has an issue.


Check disk free DataBuffer:
http://gmeta00.psi.ch/?r=hour&cs=&ce=&c=sf-daqbuf&h=&tab=m&vn=&hide-hf=false&m=disk_free_percent_data&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name


Checks on the cluster can be performed via ansible ad-hoc commands:
```bash
ansible databuffer -m shell -a 'uptime'
```

## Find Sources With Issues

Find sources with bsread level issues
https://kibana.psi.ch/s/gfa/app/dashboards#/view/1b1e1bb0-ca94-11ea-bc5d-315bf3957d13?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15m,to:now))&_a=(description:'',filters:!(('$state':(store:appState),meta:(alias:'Connection%20Errors',disabled:!f,index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',key:bsread.error.type,negate:!t,params:!('0','1','2'),type:phrases,value:'0,%201,%202'),query:(bool:(minimum_should_match:1,should:!((match_phrase:(bsread.error.type:'0')),(match_phrase:(bsread.error.type:'1')),(match_phrase:(bsread.error.type:'2'))))))),fullScreenMode:!f,options:(hidePanelTitles:!f,useMargins:!t),query:(language:kuery,query:''),timeRestore:!f,title:'BSREAD%20Errors%20DataBuffer',viewMode:view)

Error number indicate following errors:

3. 0-pulse / 0 globaltime
4. time out of valid timerange
5. duplicate pulse-id
6. pulse-id before last valid pulse-id
7. duplicate globaltimestamp
8. globaltimestamp before last globaltimestamp

To see connection errors as well temporarily disable the "NOT Connection Errors" filter on the top left (click on the filter and select "Temporarily disable")

The error numbers used there are

1. receiver connected
2. receiver stopped
3. reconnect


Find channels that are received from more than one source:
https://kibana.psi.ch/s/gfa/app/discover#/view/cb725720-ca89-11ea-bc5d-315bf3957d13?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15m,to:now))&_a=(columns:!(bsread.error.type,bsread.source,message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',key:systemd.unit,negate:!f,params:(query:daq-dispatcher-node.service),type:phrase),query:(match_phrase:(systemd.unit:daq-dispatcher-node.service)))),index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',interval:auto,query:(language:kuery,query:'%22This%20is%20usually%20an%20indication%22'),sort:!(!('@timestamp',desc)))


Find channels that send corrupt MainHeader:
https://kibana.psi.ch/s/gfa/app/discover#/view/cb725720-ca89-11ea-bc5d-315bf3957d13?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15m,to:now))&_a=(columns:!(bsread.error.type,bsread.source,message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',key:systemd.unit,negate:!f,params:(query:daq-dispatcher-node.service),type:phrase),query:(match_phrase:(systemd.unit:daq-dispatcher-node.service)))),index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',interval:auto,query:(language:kuery,query:MainHeader),sort:!(!('@timestamp',desc)))


# Maintenance

## Restart Procedures
Following are restart procedures meant for emergency interventions on the SwissFEL DAQ system

__Before issuing a restart check the which endstations are attended via one of the following operations panel and check whether a restart is ok with all attended endstations!__

```
bash -c 'caqtdm -noMsg  -stylefile sfop.qss -macro chargelim=15  S_OP_overview.ui'
bash -c 'caqtdm -noMsg  -stylefile sfop.qss -macro chargelim=15  S_OP_overview_for_photonics.ui'
bash -c 'caqtdm -noMsg  -stylefile sfop.qss   S_OP_Messages_all_stations_small.ui'
```

Inform the sfOperations@psi.ch mailing list before the restart!

### Restart DataBuffer
This is the procedure to follow to restart the DataBuffer in an emergency.

After checking whether the restart is really necessary do this:

- login to sf-lca.psi.ch (_sflca is cluster in the machine network !!!!_)
- clone the databuffer repository (if you haven't yet), change to the operation-tools directory and/or pull the latest changes

```bash
git clone https://git.psi.ch/archiver_config/sf_databuffer.git
cd sf_databuffer/operation-tools
# and/or
git pull
```

- call the restart script
```bash
ansible-playbook restart-all.yml
```

- After waiting ~5min restart the recording again:
```bash
cd ..
./bufferutils upload
```

__The required time to wait until restarting the recording somehow depends on the amount of cleanup work the DataBuffer has to do after a crash/screw up. The more bad the crash/screw up was wait for a longer time. In most cases 5 min should be enough.__