13 KiB
Overview
This document provides instructions on how to install and to operate the data acquisition (daq) environment.
The DataBuffer currently consists of 13 nodes located in the server room next to the control room in WBGB. The nodes are located in the SwissFEL network and are named sf-daqbuf-21 up to sf-daqbuf-33.
Each node of the cluster currently runs 2 independent services:
- daq-dispatcher-node.service - used to dispatch and record data (filestorage-cluster)
- daq-query-node.service - used to query data
The machines are installed and managed via ansible scripts and playbooks. To be able to run these scripts password-less ssh needs to be setup from the machine running these scripts.
Prerequisites
Note: This is not needed if working on a standard GFA machine like sflca!
To be able to execute the steps and commands outlined in this document following prerequisites need to be met:
-
Ansible needs to be installed on your machine
-
Password-less ssh needs to be setup from your machine
- Use
ssh-copy-idto copy your public key to the serversssh-copy-id your-user@sf-daqbuf-xx.psi.ch - Have a tunneling directive in your ~/.ssh/config like this:
Host sfbastion Hostname sf-gw.psi.ch ControlMaster auto ControlPath ~/.ssh/mux-%r@%h:%p ControlPersist 8h Host sf-data-api* sf-dispatcher-api* sf-daq* Hostname %h ProxyJump username@sfbastion
- Use
-
Sudo rights are needed on databuffer/... servers
-
Clone the ch.psi.daq.databuffer repository and switch to the
operation-toolsfolder
Checks
A healthy network load looks something like this:

If more than 2 machines does not show this pattern - the DataBuffer has an issue.
Check memory usage and network load ImageBuffer: https://hpc-monitor02.psi.ch/d/TW0pr_bik/gl2?refresh=30s&orgId=1
A healthy memory consumption is between 50GB and 250GB, any drop below 50GB indicates a crash.

Check disk free DataBuffer: http://gmeta00.psi.ch/?r=hour&cs=&ce=&c=sf-daqbuf&h=&tab=m&vn=&hide-hf=false&m=disk_free_percent_data&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name
Checks on the cluster can be performed via ansible ad-hoc commands:
ansible databuffer_cluster -m shell -a 'uptime'
Check whether time are synchronized between the machines:
ansible databuffer_cluster -m shell -a 'date +%s.%N'
Check if the ntp synchronization is enabled and running
ansible databuffer_cluster -b -m shell -a "systemctl is-enabled chronyd"
ansible databuffer_cluster -b -m shell -a "systemctl is-active chronyd"
Check if the tuned service is running:
ansible databuffer_cluster -b -m shell -a "systemctl is-active tuned"
Check latest 10 lines of the dispatcher node logs
ansible databuffer_cluster -b -m shell -a "journalctl -n 10 -u daq-dispatcher-node.service"
Check for failed compactions
ansible -b databuffer -m shell -a "journalctl -n 50000 -u daq-dispatcher-node.service | grep \"Exception while compacting\" | grep -oP \"\\'\K[^\\']+\" | sort | uniq"
Find Sources With issues
Sources with issues can be found like this:
ansible databuffer -m shell -b -a "journalctl -n 5000 -u daq-dispatcher-node.service | grep \" WARN \""
# not used any more
# ansible databuffer -m shell -a "tail -n 5000 /opt/dispatcher_node/latest/logs/data_validation.log | grep \"MainHeader\" | grep -Po \"(?<=')[^.']*(?=')\" | grep tcp | sort | uniq"
To get a detailed report use:
ansible databuffer -m shell -b -a "journalctl -u daq-dispatcher-node.service -n 500 | sed -e 's#.*tcp://##' | grep -e '\w* - ' | sort | uniq | wc -l"
To just get the list of sources without the reason use (send this list to sf-operation so with the request that the source responsible should fix the sources):
ansible databuffer -m shell -b -a "journalctl -u daq-dispatcher-node.service -n 5000 | sed -e 's#.*tcp://##' | grep -e '^[^ ]* - ' | sed -e 's# -.*##' | sort | uniq"
Maintenance
Emergency Restart Procedures
Following are restart procedures meant for emergency interventions on the SwissFEL DAQ system
Restart Data Retrieval
If there are issues with data retrieval (DataBuffer, ImageBuffer, Epics Channel Archiver) but all checks regarding the DataBuffer shows normal operation use this procedure to restart the SwissFEL data retrieval services. This will only affect the data retrieval of SwissFEL at the time of restart but there will be no interrupt in the recording of the data.
- login to sflca
- clone the databuffer repository (if you haven't yet), change to the operation-tools directory and/or pull the latest changes
cd operation-tools
- call the restart_dataretrieval script
ansible-playbook restart_dataretrieval.yml
Restart Data Retrieval All
If the method above doesn't work try to restart all of the data retrieval services via this procedure. This will not interrupt any data recording but this restart will, beside SwissFEL also affect the data retrieval of GLS, Hipa and Proscan!
- login to sflca
- clone the databuffer repository (if you haven't yet), change to the operation-tools directory and/or pull the latest changes
cd operation-tools
- call the restart_dataretrieval script
ansible-playbook restart_dataretrieval_all.yml
Restart DataBuffer Cluster
This is the procedure to follow to restart the DataBuffer in an emergency.
After checking whether the restart is really necessary do this:
- login to sflca (sflca is cluster in the machine network !!!!)
- clone the databuffer repository (if you haven't yet), change to the operation-tools directory and/or pull the latest changes
git clone git@git.psi.ch:sf_daq/ch.psi.daq.databuffer.git
cd ch.psi.daq.databuffer/operation-tools
# and/or
git pull
- call the restart_cluster script
ansible-playbook restart_cluster.yml
- Afterwards start the recording again - you need to have cloned the sf_daq_sources git repo:
git clone git@git.psi.ch:sf_config/sf_daq_sources.git
cd sf_daq_sources
git pull
./upload.sh
Full Restart Procedure
Stop recording (via the stop.sh script at https://git.psi.ch/sf_config/sf_daq_sources)
# git clone git@git.psi.ch:sf_config/sf_daq_sources.git
# git pull
# cd "sf_daq_sources"
./stop.sh
Stop data-api and dispatcher-api
ansible data_api -b -m shell -a "systemctl stop data-api.service nginx"
ansible dispatcher_api -b -m shell -a "systemctl stop dispatcher-api.service nginx"
Stop daq-dispatcher-node and daq-query-node services:
ansible databuffer_cluster -b -m shell -a "systemctl stop daq-dispatcher-node.service daq-query-node.service"
Remove configurations for local stream/recording restarts:
ansible databuffer_cluster -b -m shell -a "rm -rf /home/daqusr/.config/daq/stores/sources; rm -rf /home/daqusr/.config/daq/stores/streamers"
Start daq-dispatcher-node and daq-query-node services:
ansible databuffer_cluster --forks 1 -b -m shell -a "systemctl start daq-dispatcher-node.service"
ansible databuffer_cluster --forks 1 -b -m shell -a "systemctl start daq-query-node.service"
Important Note: It is necessary to bring up the dispatcher node processes first before starting the query node processes!
Start data-api and dispatcher-api
ansible data_api -b -m shell -a "systemctl start data-api.service nginx"
ansible dispatcher_api -b -m shell -a "systemctl start dispatcher-api.service nginx"
After starting the dispatcher- and query-nodes wait about 5 minutes until all the cluster discovery processes are finished and the cluster is up and running.
Start recording (via the upload.sh script at https://git.psi.ch/sf_config/sf_daq_sources)
# git clone git@git.psi.ch:sf_config/sf_daq_sources.git
# git pull
# cd "sf_daq_sources"
./upload.sh
# reload cache
curl -H "Content-Type: application/json" -X POST -d '{"reload": "true"}' https://data-api.psi.ch/sf/channels/config &>/dev/null
Restart ImageBuffer (dispatcher nodes only)
Stop image recording
# Execute this in the checkout of sf_daq_sources (git@git.psi.ch:sf_config/sf_daq_sources.git)
./stop_images.sh
Restart the involved services
ansible imagebuffer -b -m shell -a "systemctl stop daq-dispatcher-node"
ansible imagebuffer -b -m shell -a "rm -rf /home/daqusr/.config/daq/stores/sources; rm -rf /home/daqusr/.config/daq/stores/streamers"
ansible imagebuffer --forks 1 -b -m shell -a "systemctl start daq-dispatcher-node"
ansible dispatcher_api -b -m shell -a "systemctl restart dispatcher-api.service nginx"
ansible data_api -b -m shell -a "systemctl restart data-api.service nginx"
Start recording (via the upload.sh script at https://git.psi.ch/sf_config/sf_daq_sources)
# Execute this in the checkout of sf_daq_sources (git@git.psi.ch:sf_config/sf_daq_sources.git)
./upload.sh
Restart query-node Services
Restart daq-query-node service:
ansible databuffer_cluster --forks 1 -b -m shell -a "systemctl restart daq-query-node.service"
Important Note: To be able to start the query node processes the dispatcher nodes need to be up and running! After restarting all query nodes you have to restart the data-api service as well. A single restart of a Query Node server should work fine (as there is no complete shutdown of the Hazelcast cluster).
Restart dispatcher-node Services
Restart daq-dispatcher-node service:
ansible databuffer_cluster --forks 1 -b -m shell -a "systemctl restart daq-dispatcher-node.service"
This restart should also restart all recordings and reestablish streams. If there are issues, this recording restart can be enabled/disabled by setting dispatcher.local.sources.restart=false in /home/daqusr/.config/daq/dispatcher.properties. An other option is to delete the restart configurations as follows:
ansible databuffer_cluster -b -m shell -a "rm -rf /home/daqusr/.config/daq/stores/sources; rm -rf /home/daqusr/.config/daq/stores/streamers"
Note: After restarting all dispatcher nodes you have to restart the dispatcher-api service as well. A single restart of Dispatcher Node server should work fine (as there is no complete shutdown of the Hazelcast cluster).
Installation
Prerequisites
To be able to install a new version of the daq system, the binaries need to be build and available in the Maven repository. (details see toplevel Readme)
Pre-Checks
Make sure that the time is in sync within the machines:
ansible databuffer_cluster -m shell -a 'date +%s.%N'
Check if the ntp synchronization is enabled and running
ansible databuffer_cluster -b -m shell -a "systemctl is-enabled chronyd"
ansible databuffer_cluster -b -m shell -a "systemctl is-active chronyd"
On the ImageBuffer nodes check that the MTU size of the 25Gb/s interface is set to 9000
ip link
On the ImageBuffer nodes test the connection to camera servers with iperf3. As all the camera servers only have a 10Gb/s interface the overall throughput (SUM) should be around 9Gb/s. While testing connecting to two servers simultaneously should show 9Gb/s for each stream.
# Start iperf server on the camera servers
iperf3 -s
# Check speed to different camera servers via
iperf3 -P 3 -c daqsf-sioc-cs-02
iperf3 -P 3 -c daqsf-sioc-cs-31
iperf3 -P 3 -c daqsf-sioc-cs-73
iperf3 -P 3 -c daqsf-sioc-cs-85
Check whether all the firmware of the servers are on the same level:
ansible databuffer_cluster -b -m shell -a "dmidecode -s bios-release-date"
Check whether the Power Regulator Settings in the bios is set to Static High Performance Mode !

Steps
Add daqusr user and daq group on all nodes:
ansible-playbook install_user.yml
Install file and memory limits for the daqusr on all nodes:
ansible-playbook install_limits.yml
Install the current JDK on all nodes:
ansible-playbook install_jdk.yml
Install dispatcher node:
ansible-playbook install_dispatcher_node.yml
Install query node:
ansible-playbook install_query_node.yml
The installation of the dispatcher and query nodes does not start the services. See above how to start them.
Post-Checks
Check if the tuned service is running:
ansible databuffer_cluster -b -m shell -a "systemctl is-active tuned"
Check whether all CPUs are set to performance:
ansible databuffer_cluster -b -m shell -a "cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor | uniq -c"