forked from archiver_config/sf_databuffer
345 lines
13 KiB
Markdown
345 lines
13 KiB
Markdown
# Overview
|
|
|
|
This document provides instructions on how to install and to operate the *data acquisition (daq)* environment.
|
|
|
|
The DataBuffer currently consists of 13 nodes located in the server room next to the control room in WBGB. The nodes are located in the SwissFEL network and are named _sf-daqbuf-21_ up to _sf-daqbuf-33_.
|
|
|
|
Each node of the cluster currently runs 2 independent services:
|
|
- daq-dispatcher-node.service - used to dispatch and record data (filestorage-cluster)
|
|
- daq-query-node.service - used to query data
|
|
|
|
The machines are installed and managed via ansible scripts and playbooks. To be able to run these scripts password-less ssh needs to be setup from the machine running these scripts.
|
|
|
|
|
|
# Prerequisites
|
|
__Note:__ This is not needed if working on a standard GFA machine like sflca!
|
|
|
|
To be able to execute the steps and commands outlined in this document following prerequisites need to be met:
|
|
|
|
- [Ansible](https://www.ansible.com) needs to be installed on your machine
|
|
- Password-less ssh needs to be setup from your machine
|
|
- Use `ssh-copy-id` to copy your public key to the servers
|
|
```
|
|
ssh-copy-id your-user@sf-daqbuf-xx.psi.ch
|
|
```
|
|
- Have a tunneling directive in your ~/.ssh/config like this:
|
|
```
|
|
Host sfbastion
|
|
Hostname sf-gw.psi.ch
|
|
ControlMaster auto
|
|
ControlPath ~/.ssh/mux-%r@%h:%p
|
|
ControlPersist 8h
|
|
|
|
Host sf-data-api* sf-dispatcher-api* sf-daq*
|
|
Hostname %h
|
|
ProxyJump username@sfbastion
|
|
```
|
|
|
|
- Sudo rights are needed on databuffer/... servers
|
|
- Clone the [sf_databuffer](https://git.psi.ch/archiver_config/sf_databuffer.git) repository and switch to the `operation-tools` folder
|
|
|
|
|
|
# Checks
|
|
|
|
Check network load DataBuffer:
|
|
https://metrics.psi.ch/d/1SL13Nxmz/gfa-linux-tabular?orgId=1&var-env=telegraf_cocos&var-host=sf-daqbuf-21.psi.ch&var-host=sf-daqbuf-22.psi.ch&var-host=sf-daqbuf-23.psi.ch&var-host=sf-daqbuf-24.psi.ch&var-host=sf-daqbuf-25.psi.ch&var-host=sf-daqbuf-26.psi.ch&var-host=sf-daqbuf-27.psi.ch&var-host=sf-daqbuf-28.psi.ch&var-host=sf-daqbuf-29.psi.ch&var-host=sf-daqbuf-30.psi.ch&var-host=sf-daqbuf-31.psi.ch&var-host=sf-daqbuf-32.psi.ch&var-host=sf-daqbuf-33.psi.ch&from=now-6h&to=now&refresh=30s
|
|
|
|
A healthy network load looks something like this:
|
|

|
|
|
|
If more than 2 machines does not show this pattern - the DataBuffer has an issue.
|
|
|
|
Check memory usage and network load ImageBuffer:
|
|
https://hpc-monitor02.psi.ch/d/TW0pr_bik/gl2?refresh=30s&orgId=1
|
|
|
|
A healthy memory consumption is between 50GB and 250GB, any drop below 50GB indicates a crash.
|
|

|
|
|
|
|
|
|
|
Check disk free DataBuffer:
|
|
http://gmeta00.psi.ch/?r=hour&cs=&ce=&c=sf-daqbuf&h=&tab=m&vn=&hide-hf=false&m=disk_free_percent_data&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name
|
|
|
|
|
|
|
|
Checks on the cluster can be performed via ansible ad-hoc commands:
|
|
```bash
|
|
ansible databuffer_cluster -m shell -a 'uptime'
|
|
```
|
|
|
|
Check whether time are synchronized between the machines:
|
|
```bash
|
|
ansible databuffer_cluster -m shell -a 'date +%s.%N'
|
|
```
|
|
|
|
Check if the ntp synchronization is enabled and running
|
|
```bash
|
|
ansible databuffer_cluster -b -m shell -a "systemctl is-enabled chronyd"
|
|
ansible databuffer_cluster -b -m shell -a "systemctl is-active chronyd"
|
|
```
|
|
|
|
Check if the tuned service is running:
|
|
```bash
|
|
ansible databuffer_cluster -b -m shell -a "systemctl is-active tuned"
|
|
```
|
|
|
|
Check latest 10 lines of the dispatcher node logs
|
|
```bash
|
|
ansible databuffer_cluster -b -m shell -a "journalctl -n 10 -u daq-dispatcher-node.service"
|
|
```
|
|
|
|
Check for failed compactions
|
|
```bash
|
|
ansible -b databuffer -m shell -a "journalctl -n 50000 -u daq-dispatcher-node.service | grep \"Exception while compacting\" | grep -oP \"\\'\K[^\\']+\" | sort | uniq"
|
|
```
|
|
|
|
|
|
## Find Sources With Issues
|
|
|
|
Find sources with bsread level issues
|
|
https://kibana.psi.ch/s/gfa/app/dashboards#/view/1b1e1bb0-ca94-11ea-bc5d-315bf3957d13?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15m,to:now))&_a=(description:'',filters:!(('$state':(store:appState),meta:(alias:'Connection%20Errors',disabled:!f,index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',key:bsread.error.type,negate:!t,params:!('0','1','2'),type:phrases,value:'0,%201,%202'),query:(bool:(minimum_should_match:1,should:!((match_phrase:(bsread.error.type:'0')),(match_phrase:(bsread.error.type:'1')),(match_phrase:(bsread.error.type:'2'))))))),fullScreenMode:!f,options:(hidePanelTitles:!f,useMargins:!t),query:(language:kuery,query:''),timeRestore:!f,title:'BSREAD%20Errors%20DataBuffer',viewMode:view)
|
|
|
|
Error number indicate following errors:
|
|
|
|
3. 0-pulse / 0 globaltime
|
|
4. time out of valid timerange
|
|
5. duplicate pulse-id
|
|
6. pulse-id before last valid pulse-id
|
|
7. duplicate globaltimestamp
|
|
8. globaltimestamp before last globaltimestamp
|
|
|
|
To see connection errors as well temporarily disable the "NOT Connection Errors" filter on the top left (click on the filter and select "Temporarily disable")
|
|
|
|
The error numbers used there are
|
|
|
|
1. receiver connected
|
|
2. receiver stopped
|
|
3. reconnect
|
|
|
|
|
|
Find channels that are received from more than one source:
|
|
https://kibana.psi.ch/s/gfa/app/discover#/view/cb725720-ca89-11ea-bc5d-315bf3957d13?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15m,to:now))&_a=(columns:!(bsread.error.type,bsread.source,message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',key:systemd.unit,negate:!f,params:(query:daq-dispatcher-node.service),type:phrase),query:(match_phrase:(systemd.unit:daq-dispatcher-node.service)))),index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',interval:auto,query:(language:kuery,query:'%22This%20is%20usually%20an%20indication%22'),sort:!(!('@timestamp',desc)))
|
|
|
|
|
|
Find channels that send corrupt MainHeader:
|
|
https://kibana.psi.ch/s/gfa/app/discover#/view/cb725720-ca89-11ea-bc5d-315bf3957d13?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15m,to:now))&_a=(columns:!(bsread.error.type,bsread.source,message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',key:systemd.unit,negate:!f,params:(query:daq-dispatcher-node.service),type:phrase),query:(match_phrase:(systemd.unit:daq-dispatcher-node.service)))),index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',interval:auto,query:(language:kuery,query:MainHeader),sort:!(!('@timestamp',desc)))
|
|
|
|
|
|
|
|
|
|
# Maintenance
|
|
|
|
## Restart Procedures
|
|
Following are restart procedures meant for emergency interventions on the SwissFEL DAQ system
|
|
|
|
__Before issuing a restart check the which endstations are attended via one of the following operations panel and check whether a restart is ok with all attended endstations!__
|
|
|
|
```
|
|
bash -c 'caqtdm -noMsg -stylefile sfop.qss -macro chargelim=15 S_OP_overview.ui'
|
|
bash -c 'caqtdm -noMsg -stylefile sfop.qss -macro chargelim=15 S_OP_overview_for_photonics.ui'
|
|
bash -c 'caqtdm -noMsg -stylefile sfop.qss S_OP_Messages_all_stations_small.ui'
|
|
```
|
|
|
|
### Restart Data Retrieval
|
|
If there are issues with data retrieval (DataBuffer, ImageBuffer, Epics Channel Archiver) but all checks regarding the DataBuffer shows normal operation use this procedure to restart the SwissFEL data retrieval services. This will only affect the data retrieval of SwissFEL at the time of restart but there will be no interrupt in the recording of the data.
|
|
|
|
- login to sf-lca.psi.ch
|
|
- clone the databuffer repository (if you haven't yet- https://git.psi.ch/archiver_config/sf_databuffer.git), change to the `operation-tools` directory and/or pull the latest changes
|
|
```bash
|
|
cd operation-tools
|
|
```
|
|
|
|
- call the restart_dataretrieval script
|
|
```bash
|
|
ansible-playbook restart_dataretrieval.yml
|
|
```
|
|
|
|
### Restart Data Retrieval All
|
|
If the method above doesn't work try to restart all of the data retrieval services via this procedure. This will not interrupt any data recording __but this restart will, beside SwissFEL also affect the data retrieval of GLS, Hipa and Proscan__!
|
|
|
|
- login to sf-lca.psi.ch
|
|
- clone the databuffer repository (if you haven't yet - https://git.psi.ch/archiver_config/sf_databuffer.git), change to the `operation-tools` directory and/or pull the latest changes
|
|
```bash
|
|
cd operation-tools
|
|
```
|
|
|
|
- call the restart_dataretrieval script
|
|
```bash
|
|
ansible-playbook restart_dataretrieval_all.yml
|
|
```
|
|
|
|
### Restart ImageBuffer
|
|
If the DataBuffer looks healthy but the ImageBuffer seems to be in a buggy state the restart of the ImageBuffer only can be triggered as follows:
|
|
|
|
- login to sf-lca.psi.ch (_sf-lca.psi.ch is the machine in the machine network !!!!_)
|
|
- clone the databuffer repository (if you haven't yet), change to the repository directory and/or pull the latest changes
|
|
|
|
```bash
|
|
git clone https://git.psi.ch/archiver_config/sf_databuffer.git
|
|
cd sf_databuffer
|
|
# and/or
|
|
git pull
|
|
```
|
|
|
|
- stop the sources belonging to the imagebuffer
|
|
```bash
|
|
./bufferutils stop --backend sf-imagebuffer
|
|
```
|
|
|
|
- change to the operation-tools directory and call the restart_imagebuffer script
|
|
```bash
|
|
cd operation-tools
|
|
ansible-playbook restart_imagebuffer.yml
|
|
```
|
|
|
|
- Afterwards restart the recording of the image sources:
|
|
```bash
|
|
cd ..
|
|
./bufferutils upload
|
|
```
|
|
|
|
### Restart DataBuffer Cluster
|
|
This is the procedure to follow to restart the DataBuffer in an emergency.
|
|
|
|
After checking whether the restart is really necessary do this:
|
|
|
|
- login to sf-lca.psi.ch (_sflca is cluster in the machine network !!!!_)
|
|
- clone the databuffer repository (if you haven't yet), change to the operation-tools directory and/or pull the latest changes
|
|
|
|
```bash
|
|
git clone https://git.psi.ch/archiver_config/sf_databuffer.git
|
|
cd sf_databuffer/operation-tools
|
|
# and/or
|
|
git pull
|
|
```
|
|
|
|
- call the restart_cluster script
|
|
```bash
|
|
ansible-playbook restart_cluster.yml
|
|
```
|
|
|
|
- Afterwards restart the recording again:
|
|
```bash
|
|
cd ..
|
|
./bufferutils upload
|
|
```
|
|
|
|
|
|
## Manual Restart Procedures (Experts Only)
|
|
|
|
### Restart query-node Services
|
|
|
|
Restart daq-query-node service:
|
|
```bash
|
|
ansible databuffer_cluster --forks 1 -b -m shell -a "systemctl restart daq-query-node.service"
|
|
```
|
|
|
|
__Important Note:__ To be able to start the query node processes the dispatcher nodes need to be up and running! After restarting all query nodes you have to restart the data-api service as well. A single restart of a Query Node server should work fine (as there is no complete shutdown of the Hazelcast cluster).
|
|
|
|
|
|
### Restart dispatcher-node Services
|
|
|
|
Restart daq-dispatcher-node service:
|
|
```bash
|
|
ansible databuffer_cluster --forks 1 -b -m shell -a "systemctl restart daq-dispatcher-node.service"
|
|
```
|
|
|
|
This restart should also restart all recordings and reestablish streams. If there are issues, this recording restart can be enabled/disabled by setting dispatcher.local.sources.restart=false in /home/daqusr/.config/daq/dispatcher.properties. An other option is to delete the restart configurations as follows:
|
|
```bash
|
|
ansible databuffer_cluster -b -m shell -a "rm -rf /home/daqusr/.config/daq/stores/sources; rm -rf /home/daqusr/.config/daq/stores/streamers"
|
|
```
|
|
|
|
__Note:__ After restarting all dispatcher nodes you have to restart the dispatcher-api service as well. A single restart of Dispatcher Node server should work fine (as there is no complete shutdown of the Hazelcast cluster).
|
|
|
|
|
|
# Installation
|
|
|
|
## Prerequisites
|
|
|
|
To be able to install a new version of the daq system, the binaries need to be build and available in the Maven repository. (details see toplevel [Readme](../Readme.md))
|
|
|
|
## Pre-Checks
|
|
|
|
Make sure that the time is in sync within the machines:
|
|
```bash
|
|
ansible databuffer_cluster -m shell -a 'date +%s.%N'
|
|
```
|
|
|
|
Check if the ntp synchronization is enabled and running
|
|
```bash
|
|
ansible databuffer_cluster -b -m shell -a "systemctl is-enabled chronyd"
|
|
ansible databuffer_cluster -b -m shell -a "systemctl is-active chronyd"
|
|
```
|
|
|
|
On the ImageBuffer nodes check that the MTU size of the 25Gb/s interface is set to 9000
|
|
```bash
|
|
ip link
|
|
```
|
|
|
|
On the ImageBuffer nodes test the connection to camera servers with iperf3. As all the camera servers only have a 10Gb/s interface the overall throughput (SUM) should be around 9Gb/s. While testing connecting to two servers simultaneously should show 9Gb/s for each stream.
|
|
```
|
|
# Start iperf server on the camera servers
|
|
iperf3 -s
|
|
```
|
|
|
|
```
|
|
# Check speed to different camera servers via
|
|
iperf3 -P 3 -c daqsf-sioc-cs-02
|
|
iperf3 -P 3 -c daqsf-sioc-cs-31
|
|
iperf3 -P 3 -c daqsf-sioc-cs-73
|
|
iperf3 -P 3 -c daqsf-sioc-cs-85
|
|
```
|
|
|
|
Check whether all the firmware of the servers are on the same level:
|
|
```bash
|
|
ansible databuffer_cluster -b -m shell -a "dmidecode -s bios-release-date"
|
|
```
|
|
|
|
Check whether the Power Regulator Settings in the bios is set to __Static High Performance Mode__ !
|
|

|
|
|
|
## Steps
|
|
|
|
Add daqusr user and daq group on all nodes:
|
|
```bash
|
|
ansible-playbook install_user.yml
|
|
```
|
|
|
|
Install file and memory limits for the _daqusr_ on all nodes:
|
|
```bash
|
|
ansible-playbook install_limits.yml
|
|
```
|
|
|
|
Install the current JDK on all nodes:
|
|
```bash
|
|
ansible-playbook install_jdk.yml
|
|
```
|
|
|
|
Install dispatcher node:
|
|
```bash
|
|
ansible-playbook install_dispatcher_node.yml
|
|
```
|
|
|
|
Install query node:
|
|
```bash
|
|
ansible-playbook install_query_node.yml
|
|
```
|
|
|
|
The installation of the dispatcher and query nodes does not start the services. See above how to start them.
|
|
|
|
## Post-Checks
|
|
|
|
Check if the tuned service is running:
|
|
```bash
|
|
ansible databuffer_cluster -b -m shell -a "systemctl is-active tuned"
|
|
```
|
|
|
|
Check whether all CPUs are set to performance:
|
|
```bash
|
|
ansible databuffer_cluster -b -m shell -a "cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor | uniq -c"
|
|
```
|
|
|
|
|
|
|
|
|