The DataBuffer currently consists of 13 nodes located in the server room next to the control room in WBGB. The nodes are located in the SwissFEL network and are named sf-daqbuf-21 up to sf-daqbuf-33.

Each node of the cluster currently runs 2 independent services:

daq-dispatcher-node.service - used to dispatch and record data (filestorage-cluster)
daq-query-node.service - used to query data

The machines are installed and managed via ansible scripts and playbooks. To be able to run these scripts password-less ssh needs to be setup from the machine running these scripts.

Prerequisites

Note: This is not needed if working on a standard GFA machine like sflca!

To be able to execute the steps and commands outlined in this document following prerequisites need to be met:

Ansible needs to be installed on your machine

Password-less ssh needs to be setup from your machine

Use ssh-copy-id to copy your public key to the servers
```
ssh-copy-id your-user@sf-daqbuf-xx.psi.ch
```

Have a tunneling directive in your ~/.ssh/config like this:

Host sfbastion
  Hostname sf-gw.psi.ch
  ControlMaster auto
  ControlPath ~/.ssh/mux-%r@%h:%p
  ControlPersist 8h

Host sf-data-api* sf-dispatcher-api* sf-daq*
  Hostname %h
  ProxyJump username@sfbastion

Sudo rights are needed on databuffer/... servers
Clone the sf_databuffer repository and switch to the operation-tools folder

Checks

A healthy network load looks something like this:

If more than 2 machines does not show this pattern - the DataBuffer has an issue.

Check memory usage and network load ImageBuffer: https://hpc-monitor02.psi.ch/d/TW0pr_bik/gl2?refresh=30s&orgId=1

A healthy memory consumption is between 50GB and 250GB, any drop below 50GB indicates a crash.

Check disk free DataBuffer: http://gmeta00.psi.ch/?r=hour&cs=&ce=&c=sf-daqbuf&h=&tab=m&vn=&hide-hf=false&m=disk_free_percent_data&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name

Checks on the cluster can be performed via ansible ad-hoc commands:

ansible databuffer_cluster -m shell -a 'uptime'

Check whether time are synchronized between the machines:

ansible databuffer_cluster -m shell -a 'date +%s.%N'

Check if the ntp synchronization is enabled and running

ansible databuffer_cluster -b -m shell -a "systemctl is-enabled chronyd"
ansible databuffer_cluster -b -m shell -a "systemctl is-active chronyd"

Check if the tuned service is running:

ansible databuffer_cluster -b -m shell -a "systemctl is-active tuned"

Check latest 10 lines of the dispatcher node logs

ansible databuffer_cluster -b -m shell -a "journalctl -n 10 -u daq-dispatcher-node.service"

Check for failed compactions

ansible -b databuffer -m shell -a "journalctl -n 50000 -u daq-dispatcher-node.service | grep \"Exception while compacting\" | grep -oP \"\\'\K[^\\']+\" | sort | uniq"

Find Sources With Issues

Find sources with bsread level issues https://kibana.psi.ch/s/gfa/app/dashboards#/view/1b1e1bb0-ca94-11ea-bc5d-315bf3957d13?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15m,to:now))&_a=(description:'',filters:!(('$state':(store:appState),meta:(alias:'Connection%20Errors',disabled:!f,index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',key:bsread.error.type,negate:!t,params:!('0','1','2'),type:phrases,value:'0,%201,%202'),query:(bool:(minimum_should_match:1,should:!((match_phrase:(bsread.error.type:'0')),(match_phrase:(bsread.error.type:'1')),(match_phrase:(bsread.error.type:'2'))))))),fullScreenMode:!f,options:(hidePanelTitles:!f,useMargins:!t),query:(language:kuery,query:''),timeRestore:!f,title:'BSREAD%20Errors%20DataBuffer',viewMode:view)

Error number indicate following errors:

0-pulse / 0 globaltime
time out of valid timerange
duplicate pulse-id
pulse-id before last valid pulse-id
duplicate globaltimestamp
globaltimestamp before last globaltimestamp

To see connection errors as well temporarily disable the "NOT Connection Errors" filter on the top left (click on the filter and select "Temporarily disable")

The error numbers used there are

receiver connected
receiver stopped
reconnect

Find channels that are received from more than one source: https://kibana.psi.ch/s/gfa/app/discover#/view/cb725720-ca89-11ea-bc5d-315bf3957d13?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15m,to:now))&_a=(columns:!(bsread.error.type,bsread.source,message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',key:systemd.unit,negate:!f,params:(query:daq-dispatcher-node.service),type:phrase),query:(match_phrase:(systemd.unit:daq-dispatcher-node.service)))),index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',interval:auto,query:(language:kuery,query:'%22This%20is%20usually%20an%20indication%22'),sort:!(!('@timestamp',desc)))

Find channels that send corrupt MainHeader: https://kibana.psi.ch/s/gfa/app/discover#/view/cb725720-ca89-11ea-bc5d-315bf3957d13?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15m,to:now))&_a=(columns:!(bsread.error.type,bsread.source,message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',key:systemd.unit,negate:!f,params:(query:daq-dispatcher-node.service),type:phrase),query:(match_phrase:(systemd.unit:daq-dispatcher-node.service)))),index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',interval:auto,query:(language:kuery,query:MainHeader),sort:!(!('@timestamp',desc)))

Maintenance

Restart Procedures

Following are restart procedures meant for emergency interventions on the SwissFEL DAQ system

Restart Data Retrieval

If there are issues with data retrieval (DataBuffer, ImageBuffer, Epics Channel Archiver) but all checks regarding the DataBuffer shows normal operation use this procedure to restart the SwissFEL data retrieval services. This will only affect the data retrieval of SwissFEL at the time of restart but there will be no interrupt in the recording of the data.

login to sf-lca.psi.ch
clone the databuffer repository (if you haven't yet- https://git.psi.ch/archiver_config/sf_databuffer.git), change to the operation-tools directory and/or pull the latest changes

cd operation-tools

call the restart_dataretrieval script

ansible-playbook restart_dataretrieval.yml

Restart Data Retrieval All

If the method above doesn't work try to restart all of the data retrieval services via this procedure. This will not interrupt any data recording but this restart will, beside SwissFEL also affect the data retrieval of GLS, Hipa and Proscan!

login to sf-lca.psi.ch
clone the databuffer repository (if you haven't yet - https://git.psi.ch/archiver_config/sf_databuffer.git), change to the operation-tools directory and/or pull the latest changes

cd operation-tools

call the restart_dataretrieval script

ansible-playbook restart_dataretrieval_all.yml

Restart ImageBuffer

If the DataBuffer looks healthy but the ImageBuffer seems to be in a buggy state the restart of the ImageBuffer only can be triggered as follows:

login to sf-lca.psi.ch (sf-lca.psi.ch is the machine in the machine network !!!!)
clone the databuffer repository (if you haven't yet), change to the repository directory and/or pull the latest changes

git clone https://git.psi.ch/archiver_config/sf_databuffer.git
cd sf_databuffer
# and/or
git pull

stop the sources belonging to the imagebuffer

./bufferutils stop --backend sf-imagebuffer

change to the operation-tools directory and call the restart_imagebuffer script

cd operation-tools
ansible-playbook restart_imagebuffer.yml

Afterwards restart the recording of the image sources:

cd ..
./bufferutils upload

Restart DataBuffer Cluster

This is the procedure to follow to restart the DataBuffer in an emergency.

After checking whether the restart is really necessary do this:

login to sf-lca.psi.ch (sflca is cluster in the machine network !!!!)
clone the databuffer repository (if you haven't yet), change to the operation-tools directory and/or pull the latest changes

git clone https://git.psi.ch/archiver_config/sf_databuffer.git
cd sf_databuffer/operation-tools
# and/or
git pull

call the restart_cluster script

ansible-playbook restart_cluster.yml

Afterwards restart the recording again:

cd ..
./bufferutils upload

Manual Restart Procedures (Experts Only)

Restart query-node Services

Restart daq-query-node service:

ansible databuffer_cluster --forks 1 -b -m shell -a "systemctl restart daq-query-node.service"

Important Note: To be able to start the query node processes the dispatcher nodes need to be up and running! After restarting all query nodes you have to restart the data-api service as well. A single restart of a Query Node server should work fine (as there is no complete shutdown of the Hazelcast cluster).

Restart dispatcher-node Services

Restart daq-dispatcher-node service:

ansible databuffer_cluster --forks 1 -b -m shell -a "systemctl restart daq-dispatcher-node.service"

This restart should also restart all recordings and reestablish streams. If there are issues, this recording restart can be enabled/disabled by setting dispatcher.local.sources.restart=false in /home/daqusr/.config/daq/dispatcher.properties. An other option is to delete the restart configurations as follows:

ansible databuffer_cluster -b -m shell -a "rm -rf /home/daqusr/.config/daq/stores/sources; rm -rf /home/daqusr/.config/daq/stores/streamers"

Note: After restarting all dispatcher nodes you have to restart the dispatcher-api service as well. A single restart of Dispatcher Node server should work fine (as there is no complete shutdown of the Hazelcast cluster).

Installation

Prerequisites

To be able to install a new version of the daq system, the binaries need to be build and available in the Maven repository. (details see toplevel Readme)

Pre-Checks

Make sure that the time is in sync within the machines:

ansible databuffer_cluster -m shell -a 'date +%s.%N'

Check if the ntp synchronization is enabled and running

ansible databuffer_cluster -b -m shell -a "systemctl is-enabled chronyd"
ansible databuffer_cluster -b -m shell -a "systemctl is-active chronyd"

On the ImageBuffer nodes check that the MTU size of the 25Gb/s interface is set to 9000

ip link

On the ImageBuffer nodes test the connection to camera servers with iperf3. As all the camera servers only have a 10Gb/s interface the overall throughput (SUM) should be around 9Gb/s. While testing connecting to two servers simultaneously should show 9Gb/s for each stream.

# Start iperf server on the camera servers
iperf3 -s

# Check speed to different camera servers via
iperf3 -P 3 -c daqsf-sioc-cs-02
iperf3 -P 3 -c daqsf-sioc-cs-31
iperf3 -P 3 -c daqsf-sioc-cs-73
iperf3 -P 3 -c daqsf-sioc-cs-85

Check whether all the firmware of the servers are on the same level:

ansible databuffer_cluster -b -m shell -a "dmidecode -s bios-release-date"

Check whether the Power Regulator Settings in the bios is set to Static High Performance Mode !

Steps

Add daqusr user and daq group on all nodes:

ansible-playbook install_user.yml

Install file and memory limits for the daqusr on all nodes:

ansible-playbook install_limits.yml

Install the current JDK on all nodes:

ansible-playbook install_jdk.yml

Install dispatcher node:

ansible-playbook install_dispatcher_node.yml

Install query node:

ansible-playbook install_query_node.yml

The installation of the dispatcher and query nodes does not start the services. See above how to start them.

Post-Checks

Check if the tuned service is running:

ansible databuffer_cluster -b -m shell -a "systemctl is-active tuned"

Check whether all CPUs are set to performance:

ansible databuffer_cluster -b -m shell -a "cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor | uniq -c"