augustin_s/sf_databuffer

Fork 0

forked from archiver_config/sf_databuffer

Files

Simon Ebner 7e525ec76c moved operation tools from databuffer code to here

2020-12-16 10:30:32 +01:00

13 KiB

Raw Blame History

Overview

This document provides instructions on how to install and to operate the data acquisition (daq) environment.

The DataBuffer currently consists of 13 nodes located in the server room next to the control room in WBGB. The nodes are located in the SwissFEL network and are named sf-daqbuf-21 up to sf-daqbuf-33.

Each node of the cluster currently runs 2 independent services:

daq-dispatcher-node.service - used to dispatch and record data (filestorage-cluster)
daq-query-node.service - used to query data

The machines are installed and managed via ansible scripts and playbooks. To be able to run these scripts password-less ssh needs to be setup from the machine running these scripts.

Prerequisites

Note: This is not needed if working on a standard GFA machine like sflca!

To be able to execute the steps and commands outlined in this document following prerequisites need to be met:

Ansible needs to be installed on your machine

Password-less ssh needs to be setup from your machine

Use ssh-copy-id to copy your public key to the servers
```
ssh-copy-id your-user@sf-daqbuf-xx.psi.ch
```

Have a tunneling directive in your ~/.ssh/config like this:

Host sfbastion
  Hostname sf-gw.psi.ch
  ControlMaster auto
  ControlPath ~/.ssh/mux-%r@%h:%p
  ControlPersist 8h

Host sf-data-api* sf-dispatcher-api* sf-daq*
  Hostname %h
  ProxyJump username@sfbastion

Sudo rights are needed on databuffer/... servers
Clone the ch.psi.daq.databuffer repository and switch to the operation-tools folder

Checks

A healthy network load looks something like this:

If more than 2 machines does not show this pattern - the DataBuffer has an issue.

Check memory usage and network load ImageBuffer: https://hpc-monitor02.psi.ch/d/TW0pr_bik/gl2?refresh=30s&orgId=1

A healthy memory consumption is between 50GB and 250GB, any drop below 50GB indicates a crash.

Check disk free DataBuffer: http://gmeta00.psi.ch/?r=hour&cs=&ce=&c=sf-daqbuf&h=&tab=m&vn=&hide-hf=false&m=disk_free_percent_data&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name

Checks on the cluster can be performed via ansible ad-hoc commands:

ansible databuffer_cluster -m shell -a 'uptime'

Check whether time are synchronized between the machines:

ansible databuffer_cluster -m shell -a 'date +%s.%N'

Check if the ntp synchronization is enabled and running

ansible databuffer_cluster -b -m shell -a "systemctl is-enabled chronyd"
ansible databuffer_cluster -b -m shell -a "systemctl is-active chronyd"

Check if the tuned service is running:

ansible databuffer_cluster -b -m shell -a "systemctl is-active tuned"

Check latest 10 lines of the dispatcher node logs

ansible databuffer_cluster -b -m shell -a "journalctl -n 10 -u daq-dispatcher-node.service"

Check for failed compactions

ansible -b databuffer -m shell -a "journalctl -n 50000 -u daq-dispatcher-node.service | grep \"Exception while compacting\" | grep -oP \"\\'\K[^\\']+\" | sort | uniq"

Find Sources With issues

Sources with issues can be found like this:

ansible databuffer -m shell -b -a "journalctl -n 5000 -u daq-dispatcher-node.service | grep \" WARN \""

# not used any more
# ansible databuffer -m shell -a "tail -n 5000 /opt/dispatcher_node/latest/logs/data_validation.log | grep \"MainHeader\" | grep -Po \"(?<=')[^.']*(?=')\" | grep tcp | sort | uniq"

To get a detailed report use:

ansible databuffer -m shell -b -a "journalctl -u daq-dispatcher-node.service -n 500 | sed -e 's#.*tcp://##' | grep -e '\w* - ' | sort | uniq | wc -l"

To just get the list of sources without the reason use (send this list to sf-operation so with the request that the source responsible should fix the sources):

ansible databuffer -m shell -b -a "journalctl -u daq-dispatcher-node.service -n 5000 | sed -e 's#.*tcp://##' | grep -e '^[^ ]* - ' | sed -e 's# -.*##' | sort | uniq"

Maintenance

Emergency Restart Procedures

Following are restart procedures meant for emergency interventions on the SwissFEL DAQ system

Restart Data Retrieval

If there are issues with data retrieval (DataBuffer, ImageBuffer, Epics Channel Archiver) but all checks regarding the DataBuffer shows normal operation use this procedure to restart the SwissFEL data retrieval services. This will only affect the data retrieval of SwissFEL at the time of restart but there will be no interrupt in the recording of the data.

login to sflca
clone the databuffer repository (if you haven't yet), change to the operation-tools directory and/or pull the latest changes

cd operation-tools

call the restart_dataretrieval script

ansible-playbook restart_dataretrieval.yml

Restart Data Retrieval All

If the method above doesn't work try to restart all of the data retrieval services via this procedure. This will not interrupt any data recording but this restart will, beside SwissFEL also affect the data retrieval of GLS, Hipa and Proscan!

login to sflca
clone the databuffer repository (if you haven't yet), change to the operation-tools directory and/or pull the latest changes

cd operation-tools

call the restart_dataretrieval script

ansible-playbook restart_dataretrieval_all.yml

Restart DataBuffer Cluster

This is the procedure to follow to restart the DataBuffer in an emergency.

After checking whether the restart is really necessary do this:

login to sflca (sflca is cluster in the machine network !!!!)
clone the databuffer repository (if you haven't yet), change to the operation-tools directory and/or pull the latest changes

git clone git@git.psi.ch:sf_daq/ch.psi.daq.databuffer.git
cd ch.psi.daq.databuffer/operation-tools
# and/or
git pull

call the restart_cluster script

ansible-playbook restart_cluster.yml

Afterwards start the recording again - you need to have cloned the sf_daq_sources git repo:

git clone git@git.psi.ch:sf_config/sf_daq_sources.git
cd sf_daq_sources
git pull
./upload.sh

Full Restart Procedure

Stop recording (via the stop.sh script at https://git.psi.ch/sf_config/sf_daq_sources)

# git clone git@git.psi.ch:sf_config/sf_daq_sources.git
# git pull
# cd "sf_daq_sources"
./stop.sh

Stop data-api and dispatcher-api

ansible data_api -b -m shell -a "systemctl stop data-api.service nginx"
ansible dispatcher_api -b -m shell -a "systemctl stop dispatcher-api.service nginx"

Stop daq-dispatcher-node and daq-query-node services:

ansible databuffer_cluster -b -m shell -a "systemctl stop daq-dispatcher-node.service daq-query-node.service"

Remove configurations for local stream/recording restarts:

ansible databuffer_cluster -b -m shell -a "rm -rf /home/daqusr/.config/daq/stores/sources; rm -rf /home/daqusr/.config/daq/stores/streamers"

Start daq-dispatcher-node and daq-query-node services:

ansible databuffer_cluster --forks 1 -b -m shell -a "systemctl start daq-dispatcher-node.service"
ansible databuffer_cluster --forks 1 -b -m shell -a "systemctl start daq-query-node.service"

Important Note: It is necessary to bring up the dispatcher node processes first before starting the query node processes!

Start data-api and dispatcher-api

ansible data_api -b -m shell -a "systemctl start data-api.service nginx"
ansible dispatcher_api -b -m shell -a "systemctl start dispatcher-api.service nginx"

After starting the dispatcher- and query-nodes wait about 5 minutes until all the cluster discovery processes are finished and the cluster is up and running.

Start recording (via the upload.sh script at https://git.psi.ch/sf_config/sf_daq_sources)

# git clone git@git.psi.ch:sf_config/sf_daq_sources.git
# git pull
# cd "sf_daq_sources"
./upload.sh
# reload cache
curl -H "Content-Type: application/json" -X POST -d '{"reload": "true"}' https://data-api.psi.ch/sf/channels/config &>/dev/null

Restart ImageBuffer (dispatcher nodes only)

Stop image recording

# Execute this in the checkout of sf_daq_sources (git@git.psi.ch:sf_config/sf_daq_sources.git)
./stop_images.sh

Restart the involved services

ansible imagebuffer -b -m shell -a "systemctl stop daq-dispatcher-node"
ansible imagebuffer -b -m shell -a "rm -rf /home/daqusr/.config/daq/stores/sources; rm -rf /home/daqusr/.config/daq/stores/streamers"
ansible imagebuffer --forks 1 -b -m shell -a "systemctl start daq-dispatcher-node"
ansible dispatcher_api -b -m shell -a "systemctl restart dispatcher-api.service nginx"
ansible data_api -b -m shell -a "systemctl restart data-api.service nginx"

Start recording (via the upload.sh script at https://git.psi.ch/sf_config/sf_daq_sources)

# Execute this in the checkout of sf_daq_sources (git@git.psi.ch:sf_config/sf_daq_sources.git)
./upload.sh

Restart query-node Services

Restart daq-query-node service:

ansible databuffer_cluster --forks 1 -b -m shell -a "systemctl restart daq-query-node.service"

Important Note: To be able to start the query node processes the dispatcher nodes need to be up and running! After restarting all query nodes you have to restart the data-api service as well. A single restart of a Query Node server should work fine (as there is no complete shutdown of the Hazelcast cluster).

Restart dispatcher-node Services

Restart daq-dispatcher-node service:

ansible databuffer_cluster --forks 1 -b -m shell -a "systemctl restart daq-dispatcher-node.service"

This restart should also restart all recordings and reestablish streams. If there are issues, this recording restart can be enabled/disabled by setting dispatcher.local.sources.restart=false in /home/daqusr/.config/daq/dispatcher.properties. An other option is to delete the restart configurations as follows:

ansible databuffer_cluster -b -m shell -a "rm -rf /home/daqusr/.config/daq/stores/sources; rm -rf /home/daqusr/.config/daq/stores/streamers"

Note: After restarting all dispatcher nodes you have to restart the dispatcher-api service as well. A single restart of Dispatcher Node server should work fine (as there is no complete shutdown of the Hazelcast cluster).

Installation

Prerequisites

To be able to install a new version of the daq system, the binaries need to be build and available in the Maven repository. (details see toplevel Readme)

Pre-Checks

Make sure that the time is in sync within the machines:

ansible databuffer_cluster -m shell -a 'date +%s.%N'

Check if the ntp synchronization is enabled and running

ansible databuffer_cluster -b -m shell -a "systemctl is-enabled chronyd"
ansible databuffer_cluster -b -m shell -a "systemctl is-active chronyd"

On the ImageBuffer nodes check that the MTU size of the 25Gb/s interface is set to 9000

ip link

On the ImageBuffer nodes test the connection to camera servers with iperf3. As all the camera servers only have a 10Gb/s interface the overall throughput (SUM) should be around 9Gb/s. While testing connecting to two servers simultaneously should show 9Gb/s for each stream.

# Start iperf server on the camera servers
iperf3 -s

# Check speed to different camera servers via
iperf3 -P 3 -c daqsf-sioc-cs-02
iperf3 -P 3 -c daqsf-sioc-cs-31
iperf3 -P 3 -c daqsf-sioc-cs-73
iperf3 -P 3 -c daqsf-sioc-cs-85

Check whether all the firmware of the servers are on the same level:

ansible databuffer_cluster -b -m shell -a "dmidecode -s bios-release-date"

Check whether the Power Regulator Settings in the bios is set to Static High Performance Mode !

Steps

Add daqusr user and daq group on all nodes:

ansible-playbook install_user.yml

Install file and memory limits for the daqusr on all nodes:

ansible-playbook install_limits.yml

Install the current JDK on all nodes:

ansible-playbook install_jdk.yml

Install dispatcher node:

ansible-playbook install_dispatcher_node.yml

Install query node:

ansible-playbook install_query_node.yml

The installation of the dispatcher and query nodes does not start the services. See above how to start them.

Post-Checks

Check if the tuned service is running:

ansible databuffer_cluster -b -m shell -a "systemctl is-active tuned"

Check whether all CPUs are set to performance:

ansible databuffer_cluster -b -m shell -a "cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor | uniq -c"

13 KiB Raw Blame History

Overview

Prerequisites

Checks

Find Sources With issues

Maintenance

Emergency Restart Procedures

Restart Data Retrieval

Restart Data Retrieval All

Restart DataBuffer Cluster

Full Restart Procedure

Restart ImageBuffer (dispatcher nodes only)

Restart query-node Services

Restart dispatcher-node Services

Installation

Prerequisites

Pre-Checks

Steps

Post-Checks

13 KiB

Raw Blame History