diff --git a/operation-tools/.DS_Store b/operation-tools/.DS_Store new file mode 100644 index 0000000..a5fec24 Binary files /dev/null and b/operation-tools/.DS_Store differ diff --git a/operation-tools/Readme.md b/operation-tools/Readme.md new file mode 100644 index 0000000..a620d0e --- /dev/null +++ b/operation-tools/Readme.md @@ -0,0 +1,372 @@ +# Overview + +This document provides instructions on how to install and to operate the *data acquisition (daq)* environment. + +The DataBuffer currently consists of 13 nodes located in the server room next to the control room in WBGB. The nodes are located in the SwissFEL network and are named _sf-daqbuf-21_ up to _sf-daqbuf-33_. + +Each node of the cluster currently runs 2 independent services: +- daq-dispatcher-node.service - used to dispatch and record data (filestorage-cluster) +- daq-query-node.service - used to query data + +The machines are installed and managed via ansible scripts and playbooks. To be able to run these scripts password-less ssh needs to be setup from the machine running these scripts. + + +# Prerequisites +__Note:__ This is not needed if working on a standard GFA machine like sflca! + +To be able to execute the steps and commands outlined in this document following prerequisites need to be met: + +- [Ansible](https://www.ansible.com) needs to be installed on your machine +- Password-less ssh needs to be setup from your machine + - Use `ssh-copy-id` to copy your public key to the servers + ``` + ssh-copy-id your-user@sf-daqbuf-xx.psi.ch + ``` + - Have a tunneling directive in your ~/.ssh/config like this: + ``` + Host sfbastion + Hostname sf-gw.psi.ch + ControlMaster auto + ControlPath ~/.ssh/mux-%r@%h:%p + ControlPersist 8h + + Host sf-data-api* sf-dispatcher-api* sf-daq* + Hostname %h + ProxyJump username@sfbastion + ``` + +- Sudo rights are needed on databuffer/... servers +- Clone the [ch.psi.daq.databuffer](https://git.psi.ch/sf_daq/ch.psi.daq.databuffer) repository and switch to the `operation-tools` folder + + +# Checks + +Check network load DataBuffer: +https://metrics.psi.ch/d/1SL13Nxmz/gfa-linux-tabular?orgId=1&var-env=telegraf_cocos&var-host=sf-daqbuf-21.psi.ch&var-host=sf-daqbuf-22.psi.ch&var-host=sf-daqbuf-23.psi.ch&var-host=sf-daqbuf-24.psi.ch&var-host=sf-daqbuf-25.psi.ch&var-host=sf-daqbuf-26.psi.ch&var-host=sf-daqbuf-27.psi.ch&var-host=sf-daqbuf-28.psi.ch&var-host=sf-daqbuf-29.psi.ch&var-host=sf-daqbuf-30.psi.ch&var-host=sf-daqbuf-31.psi.ch&var-host=sf-daqbuf-32.psi.ch&var-host=sf-daqbuf-33.psi.ch&from=now-6h&to=now&refresh=30s + +A healthy network load looks something like this: +![image](documentation/DataBufferNetworkLoad.png) + +If more than 2 machines does not show this pattern - the DataBuffer has an issue. + +Check memory usage and network load ImageBuffer: +https://hpc-monitor02.psi.ch/d/TW0pr_bik/gl2?refresh=30s&orgId=1 + +A healthy memory consumption is between 50GB and 250GB, any drop below 50GB indicates a crash. +![image](documentation/ImageBufferMemory.png) + + + +Check disk free DataBuffer: +http://gmeta00.psi.ch/?r=hour&cs=&ce=&c=sf-daqbuf&h=&tab=m&vn=&hide-hf=false&m=disk_free_percent_data&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name + + + +Checks on the cluster can be performed via ansible ad-hoc commands: +```bash +ansible databuffer_cluster -m shell -a 'uptime' +``` + +Check whether time are synchronized between the machines: +```bash +ansible databuffer_cluster -m shell -a 'date +%s.%N' +``` + +Check if the ntp synchronization is enabled and running +```bash +ansible databuffer_cluster -b -m shell -a "systemctl is-enabled chronyd" +ansible databuffer_cluster -b -m shell -a "systemctl is-active chronyd" +``` + +Check if the tuned service is running: +```bash +ansible databuffer_cluster -b -m shell -a "systemctl is-active tuned" +``` + +Check latest 10 lines of the dispatcher node logs +```bash +ansible databuffer_cluster -b -m shell -a "journalctl -n 10 -u daq-dispatcher-node.service" +``` + +Check for failed compactions +```bash +ansible -b databuffer -m shell -a "journalctl -n 50000 -u daq-dispatcher-node.service | grep \"Exception while compacting\" | grep -oP \"\\'\K[^\\']+\" | sort | uniq" +``` + + +## Find Sources With issues + +Sources with issues can be found like this: + +```bash +ansible databuffer -m shell -b -a "journalctl -n 5000 -u daq-dispatcher-node.service | grep \" WARN \"" + +# not used any more +# ansible databuffer -m shell -a "tail -n 5000 /opt/dispatcher_node/latest/logs/data_validation.log | grep \"MainHeader\" | grep -Po \"(?<=')[^.']*(?=')\" | grep tcp | sort | uniq" +``` + +To get a detailed report use: +```bash +ansible databuffer -m shell -b -a "journalctl -u daq-dispatcher-node.service -n 500 | sed -e 's#.*tcp://##' | grep -e '\w* - ' | sort | uniq | wc -l" +``` + +To just get the list of sources without the reason use (send this list to sf-operation so with the request that the source responsible should fix the sources): +```bash +ansible databuffer -m shell -b -a "journalctl -u daq-dispatcher-node.service -n 5000 | sed -e 's#.*tcp://##' | grep -e '^[^ ]* - ' | sed -e 's# -.*##' | sort | uniq" +``` + + +# Maintenance + +## Emergency Restart Procedures +Following are restart procedures meant for emergency interventions on the SwissFEL DAQ system + +### Restart Data Retrieval +If there are issues with data retrieval (DataBuffer, ImageBuffer, Epics Channel Archiver) but all checks regarding the DataBuffer shows normal operation use this procedure to restart the SwissFEL data retrieval services. This will only affect the data retrieval of SwissFEL at the time of restart but there will be no interrupt in the recording of the data. + +- login to sflca +- clone the databuffer repository (if you haven't yet), change to the operation-tools directory and/or pull the latest changes +```bash +cd operation-tools +``` + +- call the restart_dataretrieval script +```bash +ansible-playbook restart_dataretrieval.yml +``` + +### Restart Data Retrieval All +If the method above doesn't work try to restart all of the data retrieval services via this procedure. This will not interrupt any data recording __but this restart will, beside SwissFEL also affect the data retrieval of GLS, Hipa and Proscan__! + +- login to sflca +- clone the databuffer repository (if you haven't yet), change to the operation-tools directory and/or pull the latest changes +```bash +cd operation-tools +``` + +- call the restart_dataretrieval script +```bash +ansible-playbook restart_dataretrieval_all.yml +``` + + +### Restart DataBuffer Cluster +This is the procedure to follow to restart the DataBuffer in an emergency. + +After checking whether the restart is really necessary do this: + +- login to sflca (_sflca is cluster in the machine network !!!!_) +- clone the databuffer repository (if you haven't yet), change to the operation-tools directory and/or pull the latest changes + +```bash +git clone git@git.psi.ch:sf_daq/ch.psi.daq.databuffer.git +cd ch.psi.daq.databuffer/operation-tools +# and/or +git pull +``` + +- call the restart_cluster script +```bash +ansible-playbook restart_cluster.yml +``` + +- Afterwards start the recording again - you need to have cloned the sf_daq_sources git repo: +```bash +git clone git@git.psi.ch:sf_config/sf_daq_sources.git +cd sf_daq_sources +git pull +./upload.sh +``` + + +## Full Restart Procedure + +Stop recording (via the stop.sh script at https://git.psi.ch/sf_config/sf_daq_sources) +```bash +# git clone git@git.psi.ch:sf_config/sf_daq_sources.git +# git pull +# cd "sf_daq_sources" +./stop.sh +``` + +Stop data-api and dispatcher-api +```bash +ansible data_api -b -m shell -a "systemctl stop data-api.service nginx" +ansible dispatcher_api -b -m shell -a "systemctl stop dispatcher-api.service nginx" +``` + +Stop daq-dispatcher-node and daq-query-node services: +```bash +ansible databuffer_cluster -b -m shell -a "systemctl stop daq-dispatcher-node.service daq-query-node.service" +``` + +Remove configurations for local stream/recording restarts: +```bash +ansible databuffer_cluster -b -m shell -a "rm -rf /home/daqusr/.config/daq/stores/sources; rm -rf /home/daqusr/.config/daq/stores/streamers" +``` + +Start daq-dispatcher-node and daq-query-node services: +```bash +ansible databuffer_cluster --forks 1 -b -m shell -a "systemctl start daq-dispatcher-node.service" +ansible databuffer_cluster --forks 1 -b -m shell -a "systemctl start daq-query-node.service" +``` +__Important Note:__ It is necessary to bring up the dispatcher node processes first before starting the query node processes! + +Start data-api and dispatcher-api +```bash +ansible data_api -b -m shell -a "systemctl start data-api.service nginx" +ansible dispatcher_api -b -m shell -a "systemctl start dispatcher-api.service nginx" +``` + +After starting the dispatcher- and query-nodes wait about 5 minutes until all the cluster discovery processes are finished and the cluster is up and running. + +Start recording (via the upload.sh script at https://git.psi.ch/sf_config/sf_daq_sources) +```bash +# git clone git@git.psi.ch:sf_config/sf_daq_sources.git +# git pull +# cd "sf_daq_sources" +./upload.sh +# reload cache +curl -H "Content-Type: application/json" -X POST -d '{"reload": "true"}' https://data-api.psi.ch/sf/channels/config &>/dev/null +``` + + +## Restart ImageBuffer (dispatcher nodes only) + +Stop image recording +```bash +# Execute this in the checkout of sf_daq_sources (git@git.psi.ch:sf_config/sf_daq_sources.git) +./stop_images.sh +``` + +Restart the involved services +```bash +ansible imagebuffer -b -m shell -a "systemctl stop daq-dispatcher-node" +ansible imagebuffer -b -m shell -a "rm -rf /home/daqusr/.config/daq/stores/sources; rm -rf /home/daqusr/.config/daq/stores/streamers" +ansible imagebuffer --forks 1 -b -m shell -a "systemctl start daq-dispatcher-node" +ansible dispatcher_api -b -m shell -a "systemctl restart dispatcher-api.service nginx" +ansible data_api -b -m shell -a "systemctl restart data-api.service nginx" +``` + +Start recording (via the upload.sh script at https://git.psi.ch/sf_config/sf_daq_sources) +```bash +# Execute this in the checkout of sf_daq_sources (git@git.psi.ch:sf_config/sf_daq_sources.git) +./upload.sh +``` + + +## Restart query-node Services + +Restart daq-query-node service: +```bash +ansible databuffer_cluster --forks 1 -b -m shell -a "systemctl restart daq-query-node.service" +``` + +__Important Note:__ To be able to start the query node processes the dispatcher nodes need to be up and running! After restarting all query nodes you have to restart the data-api service as well. A single restart of a Query Node server should work fine (as there is no complete shutdown of the Hazelcast cluster). + + +## Restart dispatcher-node Services + +Restart daq-dispatcher-node service: +```bash +ansible databuffer_cluster --forks 1 -b -m shell -a "systemctl restart daq-dispatcher-node.service" +``` + +This restart should also restart all recordings and reestablish streams. If there are issues, this recording restart can be enabled/disabled by setting dispatcher.local.sources.restart=false in /home/daqusr/.config/daq/dispatcher.properties. An other option is to delete the restart configurations as follows: +```bash +ansible databuffer_cluster -b -m shell -a "rm -rf /home/daqusr/.config/daq/stores/sources; rm -rf /home/daqusr/.config/daq/stores/streamers" +``` + +__Note:__ After restarting all dispatcher nodes you have to restart the dispatcher-api service as well. A single restart of Dispatcher Node server should work fine (as there is no complete shutdown of the Hazelcast cluster). + + +# Installation + +## Prerequisites + +To be able to install a new version of the daq system, the binaries need to be build and available in the Maven repository. (details see toplevel [Readme](../Readme.md)) + +## Pre-Checks + +Make sure that the time is in sync within the machines: +```bash +ansible databuffer_cluster -m shell -a 'date +%s.%N' +``` + +Check if the ntp synchronization is enabled and running +```bash +ansible databuffer_cluster -b -m shell -a "systemctl is-enabled chronyd" +ansible databuffer_cluster -b -m shell -a "systemctl is-active chronyd" +``` + +On the ImageBuffer nodes check that the MTU size of the 25Gb/s interface is set to 9000 +```bash +ip link +``` + +On the ImageBuffer nodes test the connection to camera servers with iperf3. As all the camera servers only have a 10Gb/s interface the overall throughput (SUM) should be around 9Gb/s. While testing connecting to two servers simultaneously should show 9Gb/s for each stream. +``` +# Start iperf server on the camera servers +iperf3 -s +``` + +``` +# Check speed to different camera servers via +iperf3 -P 3 -c daqsf-sioc-cs-02 +iperf3 -P 3 -c daqsf-sioc-cs-31 +iperf3 -P 3 -c daqsf-sioc-cs-73 +iperf3 -P 3 -c daqsf-sioc-cs-85 +``` + +Check whether all the firmware of the servers are on the same level: +```bash +ansible databuffer_cluster -b -m shell -a "dmidecode -s bios-release-date" +``` + +Check whether the Power Regulator Settings in the bios is set to __Static High Performance Mode__ ! +![documentation/BIOSSettings.png](documentation/BIOSSettings.png) + +## Steps + +Add daqusr user and daq group on all nodes: +```bash +ansible-playbook install_user.yml +``` + +Install file and memory limits for the _daqusr_ on all nodes: +```bash +ansible-playbook install_limits.yml +``` + +Install the current JDK on all nodes: +```bash +ansible-playbook install_jdk.yml +``` + +Install dispatcher node: +```bash +ansible-playbook install_dispatcher_node.yml +``` + +Install query node: +```bash +ansible-playbook install_query_node.yml +``` + +The installation of the dispatcher and query nodes does not start the services. See above how to start them. + +## Post-Checks + +Check if the tuned service is running: +```bash +ansible databuffer_cluster -b -m shell -a "systemctl is-active tuned" +``` + +Check whether all CPUs are set to performance: +```bash +ansible databuffer_cluster -b -m shell -a "cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor | uniq -c" +``` + + + + diff --git a/operation-tools/Readme_toreview.docx b/operation-tools/Readme_toreview.docx new file mode 100644 index 0000000..a5c9db8 Binary files /dev/null and b/operation-tools/Readme_toreview.docx differ diff --git a/operation-tools/Readme_toreview.md b/operation-tools/Readme_toreview.md new file mode 100644 index 0000000..4c9902b --- /dev/null +++ b/operation-tools/Readme_toreview.md @@ -0,0 +1,247 @@ +# Troubleshooting + +Find sources with issues in validation log: +``` +tail -n 10000 /opt/dispatcher_node/latest/logs/data_validation.log | grep "Invalid and drop" | awk -e '{print($4)}' | sort | uniq +``` + +# Overview +Following are the step to setup the imagebuffer from scratch: + +- Install Users +```bash +for THE_HOST in $(sort -u ../hostlists_daqbufs/ImageBufferHosts.txt); do cat user_settings.sh ../hostlists_daqbufs/env_settings.sh add_user.sh | ssh -i ${HOME}/.ssh/id_rsa_daq root@${THE_HOST} ; done +``` + +- Install Java +```bash +for THE_HOST in $(sort -u ../hostlists_daqbufs/ImageBufferHosts.txt); do cat user_settings.sh ../hostlists_daqbufs/env_settings.sh install_java.sh | ssh -i ${HOME}/.ssh/id_rsa_daq root@${THE_HOST} ; done +``` + +- Install Query Nodes - ImageBuffer +```bash +for THE_HOST in $(sort -u ../hostlists_daqbufs/ImageBufferHosts.txt); do cat user_settings.sh ../hostlists_daqbufs/env_settings.sh install_query_node.sh | ssh -i ${HOME}/.ssh/id_rsa_daq root@${THE_HOST} ; done +``` + +- Install Dispatcher Nodes ImageBuffer +```bash +for THE_HOST in $(sort -u ../hostlists_daqbufs/ImageBufferHosts.txt); do cat user_settings.sh ../hostlists_daqbufs/env_settings.sh install_dispatcher_node.sh | ssh -i ${HOME}/.ssh/id_rsa_daq root@${THE_HOST} ; done +``` + +- Create Required Config File +``` +[root@sf-daq-5 /]# cat /home/daqusr/.config/daq/domain.properties +backend.default=sf-imagebuffer + +chown -R daqusr:daq /home/daqusr/.config/daq/domain.properties +chown -R daqusr:daq /home/daqusr/.config/daq/dispatcher.properties +``` + + +- Restart Services (if needed) +``` +for THE_HOST in $(sort -u ../hostlists_daqbufs/ImageBufferHosts.txt); do echo -e "systemctl stop daq-query-node.service" | ssh -i ${HOME}/.ssh/id_rsa_daq root@${THE_HOST} ; done +for THE_HOST in $(sort -u ../hostlists_daqbufs/ImageBufferHosts.txt); do echo -e "systemctl stop daq-dispatcher-node.service" | ssh -i ${HOME}/.ssh/id_rsa_daq root@${THE_HOST} ; done +``` + +``` +for THE_HOST in $(sort -u ../hostlists_daqbufs/ImageBufferHosts.txt); do echo -e "hostname \n systemctl start daq-dispatcher-node.service \n sleep 20" | ssh -i ${HOME}/.ssh/id_rsa_daq root@${THE_HOST} ; done +for THE_HOST in $(sort -u ../hostlists_daqbufs/ImageBufferHosts.txt); do echo -e "hostname \n systemctl start daq-query-node.service \n sleep 10" | ssh -i ${HOME}/.ssh/id_rsa_daq root@${THE_HOST} ; done + +``` + + +Monitoring of the system is available via telegraf and grafana: +https://hpc-monitor01.psi.ch/d/TW0pr_bik/gl2?refresh=30s&orgId=1 + + +---- +TODO CONVERT DOCUMENTATION + +## Dispatcher + + + +### Dispatcher Nodes + +#### Install +1. Go to ch.psi.daq.buildall and execute: `./gradlew dropItDispatcherNode -x test` +2. Login to master node and follow [these instructions](Readme.md#clone_git) to setup the git environment. +3. Multihost command: `for THE_HOST in $(sort -u ../hostlists/DispatcherNodeHosts.txt); do cat user_settings.sh ../hostlists/env_settings.sh install_dispatcher_node.sh | ssh -i ${HOME}/.ssh/id_rsa_daq root@${THE_HOST} ; done` + +#### De-Install + +Multihost command: `for THE_HOST in $(sort -u ../hostlists/DispatcherNodeHosts.txt); do echo -e "systemctl stop daq-dispatcher-node.service \n systemctl disable daq-dispatcher-node.service \n rm /usr/lib/systemd/system/daq-dispatcher-node.service \n rm -rf /opt/dispatcher_node \n systemctl daemon-reload" | ssh -i ${HOME}/.ssh/id_rsa_daq root@${THE_HOST} ; done` + + + +### Dispatcher REST Server + +#### Install +1. Go to ch.psi.daq.buildall and execute: `./gradlew dropItDispatcherREST -x test` +2. Login to master node and follow [these instructions](Readme.md#clone_git) to setup the git environment. +3. Multihost command: `for THE_HOST in $(sort -u ../hostlists/DispatcherRESTHost.txt); do cat user_settings.sh ../hostlists/env_settings.sh install_dispatcher_rest.sh | ssh -i ${HOME}/.ssh/id_rsa_daq root@${THE_HOST} ; done` + +4. Check if ui is running by using a browser: http://sf-nube-13.psi.ch:8080/ + +#### De-Install + +Multihost command: `for THE_HOST in $(sort -u ../hostlists/DispatcherRESTHost.txt); do echo -e "systemctl stop daq-dispatcher-rest.service \n systemctl disable daq-dispatcher-rest.service \n rm /usr/lib/systemd/system/daq-dispatcher-rest.service \n rm -rf /opt/dispatcher_rest \n systemctl daemon-reload" | ssh -i ${HOME}/.ssh/id_rsa_daq root@${THE_HOST} ; done` + + + +## Querying + + + +### Query Nodes + +#### Install +1. Go to ch.psi.daq.buildall and execute: `./gradlew dropItQueryNode -x test` +2. Login to master node and follow [these instructions](Readme.md#clone_git) to setup the git environment. +3. Multihost command: `for THE_HOST in $(sort -u ../hostlists/QueryNodeHosts.txt); do cat user_settings.sh ../hostlists/env_settings.sh install_query_node.sh | ssh -i ${HOME}/.ssh/id_rsa_daq root@${THE_HOST} ; done` + +#### De-Install + +Multihost command: `for THE_HOST in $(sort -u ../hostlists/QueryNodeHosts.txt); do echo -e "systemctl stop daq-query-node.service \n systemctl disable daq-query-node.service \n rm /usr/lib/systemd/system/daq-query-node.service \n rm -rf /opt/query_node \n systemctl daemon-reload" | ssh -i ${HOME}/.ssh/id_rsa_daq root@${THE_HOST} ; done` + + + + +### Query REST Server + +#### Install +1. Go to ch.psi.daq.buildall and execute: `./gradlew dropItQueryREST -x test` +2. Login to master node and follow [these instructions](Readme.md#clone_git) to setup the git environment. +3. Multihost command: `for THE_HOST in $(sort -u ../hostlists/QueryRESTHost.txt); do cat user_settings.sh ../hostlists/env_settings.sh install_query_rest.sh | ssh -i ${HOME}/.ssh/id_rsa_daq root@${THE_HOST} ; done` + +#### De-Install + +Multihost command: `for THE_HOST in $(sort -u ../hostlists/QueryRESTHost.txt); do echo -e "systemctl stop daq-query-rest.service \n systemctl disable daq-query-rest.service \n rm /usr/lib/systemd/system/daq-query-rest.service \n rm -rf /opt/query_rest \n systemctl daemon-reload" | ssh -i ${HOME}/.ssh/id_rsa_daq root@${THE_HOST} ; done` + + +## DAQLocal + + + +### DAQLocal + +#### Install +1. Go to ch.psi.daq.buildall and execute: `./gradlew dropItDAQLocal -x test` +2. Login to master node and follow [these instructions](Readme.md#clone_git) to setup the git environment. +1. Multihost command: `cat user_settings.sh ../hostlists/env_settings.sh ../hostlists/stream_sources.sh install_daqlocal.sh | bash` + +#### De-Install + +Multihost command: `systemctl stop daq-daqlocal.service; systemctl disable daq-daqlocal.service; rm -rf /usr/lib/systemd/system/daq-daqlocal.service; rm -rf /opt/daqlocal` + + +## Helpful Commands + + +Dispatcher Node service: +`for THE_HOST in $(sort -u ../hostlists/*Host*.txt); do echo -e "echo -e '\n\nHOST:${THE_HOST}' && ls /usr/lib/systemd/system/daq-dispatcher-node* | xargs -n1 basename | xargs -n1 systemctl stop" | ssh -i ${HOME}/.ssh/id_rsa_daq root@${THE_HOST} ; done` + +### Miscellaneous + +Remove log files: +`for THE_HOST in $(sort -u ../hostlists/*Host*.txt); do echo -e "find /data_meta -name "*.log*" | grep "logs" | xargs rm" | ssh -i ${HOME}/.ssh/id_rsa_daq root@${THE_HOST} ; done` + + +Monitor log messages + +systemd: +`journalctl -f -u daq-dispatcher-rest.service` + +`journalctl --since=today -u daq-dispatcher-rest.service` + +CPU/Disk usage: +`dstat -d -D sdb1,sda5,total -cm -n` + + + +### Docker Issues + +``` +systemctl stop docker +rm -rf /var/lib/docker +systemctl start docker +systemctl start nginx +``` + +### Modify Logging + +1. Modify logback-server.xml (e.g. in /opt/dispatcher_node/latest/lib/) +2. Run JConsole + a. /usr/java/latest/bin/jconsole + b. /usr/java/latest/bin/jconsole and use `Remote Process` (application needs to be started with `-Dcom.sun.management.jmxremote.port=3334 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false`) + a. localhost:3334 for DispatcherNode + b. localhost:3335 for DispatcherRest + c. localhost:3336 for QueryNode + d. localhost:3337 for QueryRest. +3. Go to ch.qos.logback.classic -> ... -> Operations +4. Press `reloadDefaultConfiguration` + +### Profiling + +1. Run `/usr/java/latest/bin/jvisualvm` (use `/usr/java/latest/bin/jvisualvm -J-Dnetbeans.logger.console=true` for debuging). +2. Add `JMX Connestion` (see [here](Readme.md#modify_logging) for `hostname:port` settings + +Note: You might need to install `yum install xorg-x11-xauth libXtst` + + +### Folder Crawler + +The NAS system needs incredibly long to list folders. Current workaround is to use a folder crawler that periodically lists the folder structure and thus keeps it in the cache. + +1. `mkdir /home/daqusr/scripts && cp ../scripts/folder_crawler.sh /home/daqusr/scripts && chown -R daqusr:daq /home/daqusr/scripts` +2. `cp ../hostlists/systemd/folder-crawler.service /etc/systemd/system/ && systemctl enable folder-crawler.service && systemctl daemon-reload` +3. `cp ../hostlists/systemd/folder-crawler.timer /etc/systemd/system/ && systemctl enable folder-crawler.timer && systemctl daemon-reload && systemctl start folder-crawler.timer` + + + + +## Maintenance Utils + +### Find Largest Files + +`find /data/sf-databuffer/daq_swissfel/daq_swissfel_3 -mtime +5 -printf "%s %n %m %u %g %t %p" \( -type l -printf ' -> %l\n' -o -printf '\n' \) | sort -k1,1 -n` + +### Count Disk Usage + +`find /data/sf-databuffer/daq_swissfel/daq_swissfel_3 -type f -mtime +5 -printf '%s\n' | awk '{a+=$1;} END {printf "%.1f GB\n", a/2**30;}'` + +`find /data/sf-databuffer/daq_swissfel/daq_swissfel_3 -type f -newerct "2000-01-01" ! -newerct "2018-06-27 23:00" -printf '%s\n' | awk '{a+=$1;} END {printf "%.1f GB\n", a/2**30;}'` + +### Delete Specific Files (do not forget -empty if needed!!!) + +`find /data/sf-databuffer/daq_swissfel/daq_swissfel_3 -type f -mtime +5 -regextype sed -regex '.*LOSS_SIGNAL_RAW.*' -delete` + +`find /data/sf-databuffer/daq_swissfel/daq_swissfel_3 -type d -empty -delete` + +`find /gpfs/sf-data/sf-imagebuffer/daq_swissfel/daq_swissfel_4 -type f -newerct "2000-01-01" ! -newerct "2018-06-25 23:00" -delete` + +`find /gpfs/sf-data/sf-imagebuffer/daq_swissfel/daq_swissfel_4 -type d -empty -delete` + +### Delete empty files in parallel + +```bash +# use 32 threads +find /gls_data/gls-archive/daq_local/daq_local_*/byTime/ -maxdepth 1 | tail -n +2 | xargs -I {} -P 32 -n 1 find {} -type f -empty -delete +``` + +### Parallel rsync + +```bash +# see: https://stackoverflow.com/a/46611168 + +# SETUP OPTIONS +export SRCDIR="/home/maerki_f/Downloads/rsync_test/.snapshot/data/test" +# export SRCDIR="/gls_data/.snapshot/daily.2018-07-18_0010/gls-archive/daq_local/daq_local_2/byTime" +export DESTDIR="/home/maerki_f/Downloads/rsync_test/data/test" +# export DESTDIR="/gls_data/gls-archive/daq_local/daq_local_2/byTime" + + +# use 32 threads +ls -1 $SRCDIR | xargs -I {} -P 32 -n 1 rsync -auvh --progress $SRCDIR/{} $DESTDIR/ +``` diff --git a/operation-tools/ansible.cfg b/operation-tools/ansible.cfg new file mode 100755 index 0000000..519228b --- /dev/null +++ b/operation-tools/ansible.cfg @@ -0,0 +1,10 @@ +[defaults] +# this controls whether an Ansible playbook should prompt for a sudo password by default when sudoing. Default: False +#ask_sudo_pass=True + +inventory=inventories/sf +host_key_checking = False + +[ssh_connection] +# ssh_args = -F ./ssh.cfg -o +control_path_dir=/tmp/.ansible-${USER} diff --git a/operation-tools/documentation/BIOSSettings.png b/operation-tools/documentation/BIOSSettings.png new file mode 100644 index 0000000..8ee19d5 Binary files /dev/null and b/operation-tools/documentation/BIOSSettings.png differ diff --git a/operation-tools/documentation/DataBufferNetworkLoad.png b/operation-tools/documentation/DataBufferNetworkLoad.png new file mode 100644 index 0000000..162eb5f Binary files /dev/null and b/operation-tools/documentation/DataBufferNetworkLoad.png differ diff --git a/operation-tools/documentation/ImageBufferMemory.png b/operation-tools/documentation/ImageBufferMemory.png new file mode 100644 index 0000000..7880a1d Binary files /dev/null and b/operation-tools/documentation/ImageBufferMemory.png differ diff --git a/operation-tools/fetch_validation_logs.yml b/operation-tools/fetch_validation_logs.yml new file mode 100644 index 0000000..1362f3d --- /dev/null +++ b/operation-tools/fetch_validation_logs.yml @@ -0,0 +1,13 @@ +- hosts: databuffer_cluster + become: true + tasks: + - name: Specifying a path directly + fetch: + src: /opt/dispatcher_node/latest/logs/data_validation.log.1.zip + dest: /tmp/daq/prefix-{{ inventory_hostname }} + flat: yes +# - name: Fetch stuff from the remote and save to local +# synchronize: src={{ item }} dest=/tmp/daq mode=pull +# with_items: +# - "/opt/dispatcher_node/latest/logs/data_validation.log" +# - "/opt/dispatcher_node/latest/logs/data_validation.log.1.zip" diff --git a/operation-tools/install_cluster.yml b/operation-tools/install_cluster.yml new file mode 100644 index 0000000..1c16255 --- /dev/null +++ b/operation-tools/install_cluster.yml @@ -0,0 +1,10 @@ +# ensure that autofs is disabled +# that there is no afs + +- include: install_user.yml +- include: install_limits.yml +- include: install_jdk.yml +- include: install_dispatcher_node.yml +- include: install_query_node.yml + +# a restart of the machine is required to have the limits being applied diff --git a/operation-tools/install_dispatcher_node.yml b/operation-tools/install_dispatcher_node.yml new file mode 100644 index 0000000..352daf3 --- /dev/null +++ b/operation-tools/install_dispatcher_node.yml @@ -0,0 +1,67 @@ +- hosts: databuffer_cluster + become: true + vars: + dispatcher_node_version: 1.14.13 + binaries_install_dir: /opt/databuffer + tasks: + - name: Create deployment directory - dispatcher_node + file: + path: "{{binaries_install_dir}}/lib" + owner: daqusr + group: daq + state: directory + - name: Download app binary + get_url: + url: https://artifacts.psi.ch/artifactory/libs-snapshots-local/ch/psi/daq/dispatchernode/{{dispatcher_node_version}}/dispatchernode-{{dispatcher_node_version}}-all.jar + dest: "{{binaries_install_dir}}/lib/" + owner: daqusr + group: daq + + # Deploy systemd unit file for dispatchernode + - template: + src: templates/daq-dispatcher-node.service.j2 + dest: /etc/systemd/system/daq-dispatcher-node.service + - name: Reload systemd unit files + systemd: + daemon_reload: yes + - name: Make sure the tuned service is enabled an started + systemd: + enabled: yes + state: started + name: tuned + - name: Make sure the daq-dispatcher-node is enabled + systemd: + enabled: yes + name: daq-dispatcher-node + +- hosts: imagebuffer + become: true + tasks: + - name: Creates configuration directory + file: + path: /home/daqusr/.config/daq + owner: daqusr + group: daq + state: directory + - template: + src: templates/domain.properties.j2 + dest: /home/daqusr/.config/daq/domain.properties + owner: daqusr + group: daq + mode: '644' + +- hosts: databuffer + become: true + tasks: + - name: Creates configuration directory + file: + path: /home/daqusr/.config/daq + owner: daqusr + group: daq + state: directory + - template: + src: templates/domain.properties.j2 + dest: /home/daqusr/.config/daq/domain.properties + owner: daqusr + group: daq + mode: '644' diff --git a/operation-tools/install_dispatcher_node_jdk8.yml b/operation-tools/install_dispatcher_node_jdk8.yml new file mode 100644 index 0000000..ce5b3bb --- /dev/null +++ b/operation-tools/install_dispatcher_node_jdk8.yml @@ -0,0 +1,66 @@ +- hosts: databuffer_cluster + become: true + vars: + dispatcher_node_version: 1.14.8 + tasks: + - name: Creates deployment directory + file: + path: /opt/dispatcher_node/latest/lib + owner: daqusr + group: daq + state: directory + - name: Download app binary + get_url: + url: https://artifacts.psi.ch/artifactory/libs-snapshots-local/ch/psi/daq/dispatchernode/{{dispatcher_node_version}}/dispatchernode-{{dispatcher_node_version}}-all.jar + dest: /opt/dispatcher_node/latest/lib/ + owner: daqusr + group: daq + + # Deploy systemd unit file for dispatchernode + - template: + src: templates/daq-dispatcher-node.service_jdk8.j2 + dest: /etc/systemd/system/daq-dispatcher-node.service + - name: Reload systemd unit files + systemd: + daemon_reload: yes + - name: Make sure the tuned service is enabled an started + systemd: + enabled: yes + state: started + name: tuned + - name: Make sure the daq-dispatcher-node is enabled + systemd: + enabled: yes + name: daq-dispatcher-node + +- hosts: imagebuffer + become: true + tasks: + - name: Creates configuration directory + file: + path: /home/daqusr/.config/daq + owner: daqusr + group: daq + state: directory + - template: + src: templates/imagebuffer_domain.properties + dest: /home/daqusr/.config/daq/domain.properties + owner: daqusr + group: daq + mode: '644' + +- hosts: databuffer + become: true + tasks: + - name: Creates configuration directory + file: + path: /home/daqusr/.config/daq + owner: daqusr + group: daq + state: directory + - template: + src: templates/databuffer_domain.properties + dest: /home/daqusr/.config/daq/domain.properties + owner: daqusr + group: daq + mode: '644' diff --git a/operation-tools/install_elastic.yml b/operation-tools/install_elastic.yml new file mode 100644 index 0000000..53f881f --- /dev/null +++ b/operation-tools/install_elastic.yml @@ -0,0 +1,57 @@ +- hosts: imagebuffer + become: true + tasks: + - name: Install https certificate + template: + src: templates/elastic-stack-ca.pem + dest: /etc/pki/tls/certs/elastic-stack-ca.pem + + - name: Install journalbeat + yum: + name: https://artifacts.elastic.co/downloads/beats/journalbeat/journalbeat-7.3.2-x86_64.rpm + state: present + - name: Install journalbeat configuration + template: + src: templates/journalbeat.yml + dest: /etc/journalbeat/journalbeat.yml + +# - name: Install auditbeat +# yum: +# name: https://artifacts.elastic.co/downloads/beats/auditbeat/auditbeat-7.3.2-x86_64.rpm +# state: present +# - name: Install auditbeat configuration +# template: +# src: templates/auditbeat.yml +# dest: /etc/auditbeat/auditbeat.yml +# +# - name: Install metricbeat +# yum: +# name: https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-7.3.2-x86_64.rpm +# state: present +# - name: Install metricbeat configuration +# template: +# src: templates/metricbeat.yml +# dest: /etc/metricbeat/metricbeat.yml +# - name: Install metricbeat system.yml configuration +# template: +# src: templates/system.yml +# dest: /etc/metricbeat/modules.d/system.yml + + - name: Reload systemd unit files + systemd: + daemon_reload: yes + - name: Enable and start journalbeat + systemd: + enabled: yes + state: restarted + name: journalbeat +# - name: Enable and start metricbeat +# systemd: +# enabled: yes +# state: restarted +# name: metricbeat +# - name: Enable and start auditbeat +# systemd: +# enabled: yes +# state: restarted +# name: auditbeat diff --git a/operation-tools/install_elastic_heartbeat.yml b/operation-tools/install_elastic_heartbeat.yml new file mode 100644 index 0000000..56ea291 --- /dev/null +++ b/operation-tools/install_elastic_heartbeat.yml @@ -0,0 +1,66 @@ +- hosts: dispatcher_api_office + become: true + tasks: + - name: Install https certificate + template: + src: templates/elastic-stack-ca.pem + dest: /etc/pki/tls/certs/elastic-stack-ca.pem + + - name: Install osquery + yum: + name: https://pkg.osquery.io/rpm/osquery-4.0.2-1.linux.x86_64.rpm + state: present + + - name: Install heartbeat + yum: + name: https://artifacts.elastic.co/downloads/beats/heartbeat/heartbeat-7.3.2-x86_64.rpm + state: present + - name: Install heartbeat configuration + template: + src: templates/heartbeat.yml + dest: /etc/heartbeat/heartbeat.yml + - name: Install heartbeat monitors + template: + src: templates/reachable.icmp.yml + dest: /etc/heartbeat/monitors.d/reachable.icmp.yml + + - name: Install metricbeat + yum: + name: https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-7.3.2-x86_64.rpm + state: present + - name: Install metricbeat configuration + template: + src: templates/metricbeat.yml + dest: /etc/metricbeat/metricbeat.yml + - name: Install metricbeat system.yml configuration + template: + src: templates/system.yml + dest: /etc/metricbeat/modules.d/system.yml + + - name: Install auditbeat + yum: + name: https://artifacts.elastic.co/downloads/beats/auditbeat/auditbeat-7.3.2-x86_64.rpm + state: present + - name: Install auditbeat configuration + template: + src: templates/auditbeat.yml + dest: /etc/auditbeat/auditbeat.yml + + - name: Reload systemd unit files + systemd: + daemon_reload: yes + - name: Enable and start journalbeat + systemd: + enabled: yes + state: restarted + name: heartbeat-elastic + - name: Enable and start metricbeat + systemd: + enabled: yes + state: restarted + name: metricbeat + - name: Enable and start auditbeat + systemd: + enabled: yes + state: restarted + name: auditbeat diff --git a/operation-tools/install_imageapi.yml b/operation-tools/install_imageapi.yml new file mode 100644 index 0000000..7adc66e --- /dev/null +++ b/operation-tools/install_imageapi.yml @@ -0,0 +1,36 @@ +- hosts: imageapi + become: true + gather_facts: no + vars: + imageapi_version: 0.0.0.000 + tasks: + - name: mkdir deployment directory + file: + path: /opt/imageapi/{{imageapi_version}}/lib + owner: daqusr + group: daq + state: directory + - name: deploy jar + get_url: + url: https://artifacts.psi.ch/artifactory/libs-snapshots-local/ch/psi/daq/imageapi/{{imageapi_version}}/imageapi-{{imageapi_version}}-all.jar + dest: /opt/imageapi/{{imageapi_version}}/lib + owner: daqusr + group: daq + - template: + src: templates/imageapi.service.j2 + dest: /etc/systemd/system/imageapi.service + - name: mkdir etc + file: + path: /etc/imageapi + group: daq + state: directory + - template: + src: templates/imageapi.application.properties + dest: /etc/imageapi/application.properties + - name: reload systemd daemon + systemd: + daemon_reload: yes + - name: enable service + systemd: + name: imageapi + enabled: yes diff --git a/operation-tools/install_jdk.yml b/operation-tools/install_jdk.yml new file mode 100644 index 0000000..b4717de --- /dev/null +++ b/operation-tools/install_jdk.yml @@ -0,0 +1,7 @@ +- hosts: databuffer_cluster + become: true + tasks: + - name: Install jdk-13 + yum: + name: java-13-openjdk-devel + state: present diff --git a/operation-tools/install_jdk8.yml b/operation-tools/install_jdk8.yml new file mode 100644 index 0000000..3fdf216 --- /dev/null +++ b/operation-tools/install_jdk8.yml @@ -0,0 +1,8 @@ +- hosts: databuffer_cluster + become: true + tasks: + - name: Install jdk from a local file + yum: +# name: https://artifacts.psi.ch/artifactory/releases/jdk-8u162-linux-x64.rpm + name: java-1.8.0-openjdk-devel + state: present diff --git a/operation-tools/install_limits.yml b/operation-tools/install_limits.yml new file mode 100644 index 0000000..4508857 --- /dev/null +++ b/operation-tools/install_limits.yml @@ -0,0 +1,28 @@ +- hosts: databuffer_cluster + become: true + tasks: + - template: + src: templates/90-daq_limits.d.conf + dest: /etc/security/limits.d/90-daq.conf + mode: '644' + - template: + src: templates/90-daq_sysctl.d.conf + dest: /etc/sysctl.d/90-daq.conf + mode: '644' + +# - name: Set limits in /etc/security/limits.conf +# shell: | +# echo "daqusr - memlock unlimited" >> /etc/security/limits.conf +# echo "daqusr - nofile 500000" >> /etc/security/limits.conf +# echo "daqusr - nproc 32768" >> /etc/security/limits.conf +# echo "daqusr - as unlimited" >> /etc/security/limits.conf +# unlimited does not work for nofile (user cannot login ???) +# this should actually go into /etc/security/limits.d/99-daq.conf +# - name: Set limits in /etc/sysctl.conf +# shell: | +# echo "" >> /etc/sysctl.conf +# echo "vm.max_map_count = 131072" >> /etc/sysctl.conf +# echo "vm.swappiness = 1" >> /etc/sysctl.conf +# sysctl -p + +# a restart of the machine is required to have the limits being applied diff --git a/operation-tools/install_query_node.yml b/operation-tools/install_query_node.yml new file mode 100644 index 0000000..4818703 --- /dev/null +++ b/operation-tools/install_query_node.yml @@ -0,0 +1,36 @@ +- hosts: databuffer_cluster + become: true + vars: + query_node_version: 1.14.13 + binaries_install_dir: /opt/databuffer + tasks: + - name: Create deployment directory - dispatcher_node + file: + path: "{{binaries_install_dir}}/lib" + owner: daqusr + group: daq + state: directory + + - name: Download app binary + get_url: + url: https://artifacts.psi.ch/artifactory/libs-snapshots-local/ch/psi/daq/querynode/{{query_node_version}}/querynode-{{query_node_version}}-all.jar + dest: "{{binaries_install_dir}}/lib/" + owner: daqusr + group: daq + + # Deploy systemd unit file for querynode + - template: + src: templates/daq-query-node.service.j2 + dest: /etc/systemd/system/daq-query-node.service + - name: Reload systemd unit files + systemd: + daemon_reload: yes + - name: Make sure the tuned service is enabled an started + systemd: + enabled: yes + state: started + name: tuned + - name: Make sure the daq-query-node is enabled + systemd: + enabled: yes + name: daq-query-node diff --git a/operation-tools/install_query_node_jdk8.yml b/operation-tools/install_query_node_jdk8.yml new file mode 100644 index 0000000..9838c08 --- /dev/null +++ b/operation-tools/install_query_node_jdk8.yml @@ -0,0 +1,34 @@ +- hosts: databuffer_cluster + become: true + vars: + query_node_version: 1.14.8 + tasks: + - name: Creates deployment directory + file: + path: /opt/query_node/latest/lib + owner: daqusr + group: daq + state: directory + - name: Download app binary + get_url: + url: https://artifacts.psi.ch/artifactory/libs-snapshots-local/ch/psi/daq/querynode/{{query_node_version}}/querynode-{{query_node_version}}-all.jar + dest: /opt/query_node/latest/lib/ + owner: daqusr + group: daq + + # Deploy systemd unit file for querynode + - template: + src: templates/daq-query-node.service_jdk8.j2 + dest: /etc/systemd/system/daq-query-node.service + - name: Reload systemd unit files + systemd: + daemon_reload: yes + - name: Make sure the tuned service is enabled an started + systemd: + enabled: yes + state: started + name: tuned + - name: Make sure the daq-query-node is enabled + systemd: + enabled: yes + name: daq-query-node diff --git a/operation-tools/install_rest_apis.yml b/operation-tools/install_rest_apis.yml new file mode 100644 index 0000000..f2fe2e2 --- /dev/null +++ b/operation-tools/install_rest_apis.yml @@ -0,0 +1,40 @@ +- hosts: databuffer_cluster + become: true + vars: + binaries_version: 1.14.13 + binaries_install_dir: /opt/databuffer + tasks: + + - name: Download jar - dispatcher_rest + get_url: + url: https://artifacts.psi.ch/artifactory/libs-snapshots-local/ch/psi/daq/dispatcherrest/{{binaries_version}}/dispatcherrest-{{binaries_version}}-all.jar + dest: "{{binaries_install_dir}}/lib/" + owner: daqusr + group: daq + + - name: Download jar - query_rest + get_url: + url: https://artifacts.psi.ch/artifactory/libs-snapshots-local/ch/psi/daq/queryrest/{{binaries_version}}/queryrest-{{binaries_version}}-all.jar + dest: "{{binaries_install_dir}}/lib/" + owner: daqusr + group: daq + + # Deploy systemd unit file for dispatchernode + - template: + src: templates/daq-dispatcher-rest.service.j2 + dest: /etc/systemd/system/daq-dispatcher-rest.service + - template: + src: templates/daq-query-rest.service.j2 + dest: /etc/systemd/system/daq-query-rest.service + + - name: Reload systemd unit files + systemd: + daemon_reload: yes + + - systemd: + enabled: yes + name: daq-dispatcher-rest + - systemd: + enabled: yes + name: daq-query-rest + diff --git a/operation-tools/install_user.yml b/operation-tools/install_user.yml new file mode 100644 index 0000000..aa209f1 --- /dev/null +++ b/operation-tools/install_user.yml @@ -0,0 +1,14 @@ +- hosts: databuffer_cluster + become: true + tasks: + - name: Ensure group "daq" exists + group: + name: daq + gid: 1000 + state: present + - name: Add the user 'daqusr' + user: + name: daqusr + uid: 1000 + comment: DAQ User + group: daq diff --git a/operation-tools/inventories/sf b/operation-tools/inventories/sf new file mode 100644 index 0000000..21e3878 --- /dev/null +++ b/operation-tools/inventories/sf @@ -0,0 +1,79 @@ +[data_api_office] +data-api.psi.ch + +[dispatcher_api_office] +dispatcher-api.psi.ch + +[data_api] +sf-data-api.psi.ch +sf-data-api-02.psi.ch + +[dispatcher_api] +sf-dispatcher-api.psi.ch + +[databuffer_cluster:children] +databuffer +imagebuffer + +[imagebuffer] +sf-daq-5.psi.ch +sf-daq-6.psi.ch + +[databuffer] +sf-daqbuf-21.psi.ch +sf-daqbuf-22.psi.ch +sf-daqbuf-23.psi.ch +sf-daqbuf-24.psi.ch +sf-daqbuf-25.psi.ch +sf-daqbuf-26.psi.ch +sf-daqbuf-27.psi.ch +sf-daqbuf-28.psi.ch +sf-daqbuf-29.psi.ch +sf-daqbuf-30.psi.ch +sf-daqbuf-31.psi.ch +sf-daqbuf-32.psi.ch +sf-daqbuf-33.psi.ch + + +[databuffer_cluster:vars] +daq_environment=swissfel + +[imagebuffer:vars] +# This was calculated like this: +# SYSTEM_CPU_COUNT=$(cat /proc/cpuinfo | grep "physical id" | sort | uniq | wc -l) +# SYSTEM_CORE_PER_CPU_COUNT=$(cat /proc/cpuinfo | grep -o -P 'cpu cores\t: [^\n]*' | cut -f2- -d':' | uniq | tr -d ' ') +# # without hyper threading +# SYSTEM_CORE_COUNT=$(( ${SYSTEM_CPU_COUNT} * ${SYSTEM_CORE_PER_CPU_COUNT} )) +number_of_cores=44 + +# This was calculated like this: +# SYSTEM_THREAD_COUNT=$(egrep -c 'processor([[:space:]]+):.*' /proc/cpuinfo) +# QUERY_NODE_COMMON_FORK_JOIN_POOL_PARALLELISM="$((2 * ${SYSTEM_THREAD_COUNT}))" +fork_join_pool_parallelism=88 + +backend_default=sf-imagebuffer + +[databuffer:vars] +# This was calculated like this: +# SYSTEM_CPU_COUNT=$(cat /proc/cpuinfo | grep "physical id" | sort | uniq | wc -l) +# SYSTEM_CORE_PER_CPU_COUNT=$(cat /proc/cpuinfo | grep -o -P 'cpu cores\t: [^\n]*' | cut -f2- -d':' | uniq | tr -d ' ') +# # without hyper threading +# SYSTEM_CORE_COUNT=$(( ${SYSTEM_CPU_COUNT} * ${SYSTEM_CORE_PER_CPU_COUNT} )) +number_of_cores=28 + +# This was calculated like this: +# SYSTEM_THREAD_COUNT=$(egrep -c 'processor([[:space:]]+):.*' /proc/cpuinfo) +# QUERY_NODE_COMMON_FORK_JOIN_POOL_PARALLELISM="$((2 * ${SYSTEM_THREAD_COUNT}))" +fork_join_pool_parallelism=112 + +backend_default=sf-databuffer + +[test] +sf-nube-11 +sf-nube-12 + +[iodatatest] +sf-daq-5.psi.ch + +[imageapi] +sf-daq-5.psi.ch diff --git a/operation-tools/inventories/test b/operation-tools/inventories/test new file mode 100644 index 0000000..e41cff5 --- /dev/null +++ b/operation-tools/inventories/test @@ -0,0 +1,41 @@ +[databuffer_cluster:children] +databuffer +imagebuffer + +[databuffer] +sf-nube-11.psi.ch +sf-nube-12.psi.ch +sf-nube-13.psi.ch + +[imagebuffer] +sf-nube-14.psi.ch + + +[databuffer_cluster:vars] +daq_environment=swissfel-test + +[imagebuffer:vars] +# This was calculated like this: +# SYSTEM_CPU_COUNT=$(cat /proc/cpuinfo | grep "physical id" | sort | uniq | wc -l) +# SYSTEM_CORE_PER_CPU_COUNT=$(cat /proc/cpuinfo | grep -o -P 'cpu cores\t: [^\n]*' | cut -f2- -d':' | uniq | tr -d ' ') +# # without hyper threading +# SYSTEM_CORE_COUNT=$(( ${SYSTEM_CPU_COUNT} * ${SYSTEM_CORE_PER_CPU_COUNT} )) +number_of_cores=20 + +# This was calculated like this: +# SYSTEM_THREAD_COUNT=$(egrep -c 'processor([[:space:]]+):.*' /proc/cpuinfo) +# QUERY_NODE_COMMON_FORK_JOIN_POOL_PARALLELISM="$((2 * ${SYSTEM_THREAD_COUNT}))" +fork_join_pool_parallelism=80 + +[databuffer:vars] +# This was calculated like this: +# SYSTEM_CPU_COUNT=$(cat /proc/cpuinfo | grep "physical id" | sort | uniq | wc -l) +# SYSTEM_CORE_PER_CPU_COUNT=$(cat /proc/cpuinfo | grep -o -P 'cpu cores\t: [^\n]*' | cut -f2- -d':' | uniq | tr -d ' ') +# # without hyper threading +# SYSTEM_CORE_COUNT=$(( ${SYSTEM_CPU_COUNT} * ${SYSTEM_CORE_PER_CPU_COUNT} )) +number_of_cores=22 + +# This was calculated like this: +# SYSTEM_THREAD_COUNT=$(egrep -c 'processor([[:space:]]+):.*' /proc/cpuinfo) +# QUERY_NODE_COMMON_FORK_JOIN_POOL_PARALLELISM="$((2 * ${SYSTEM_THREAD_COUNT}))" +fork_join_pool_parallelism=82 diff --git a/operation-tools/inventories/twlha b/operation-tools/inventories/twlha new file mode 100644 index 0000000..95d1c06 --- /dev/null +++ b/operation-tools/inventories/twlha @@ -0,0 +1,40 @@ +[data_api_office] +data-api.psi.ch + +[dispatcher_api_office] +dispatcher-api.psi.ch + +[data_api] +#twlha-data-api.psi.ch + +[dispatcher_api] +#twlha-dispatcher-api.psi.ch + +[databuffer_cluster:children] +databuffer + +[databuffer] +twlha-daqbuf-21.psi.ch + +[imagebuffer] + +[databuffer_cluster:vars] +daq_environment=twlha-databuffer + +[databuffer:vars] +# This was calculated like this: +# SYSTEM_CPU_COUNT=$(cat /proc/cpuinfo | grep "physical id" | sort | uniq | wc -l) +# SYSTEM_CORE_PER_CPU_COUNT=$(cat /proc/cpuinfo | grep -o -P 'cpu cores\t: [^\n]*' | cut -f2- -d':' | uniq | tr -d ' ') +# # without hyper threading +# SYSTEM_CORE_COUNT=$(( ${SYSTEM_CPU_COUNT} * ${SYSTEM_CORE_PER_CPU_COUNT} )) +number_of_cores=28 + +# This was calculated like this: +# SYSTEM_THREAD_COUNT=$(egrep -c 'processor([[:space:]]+):.*' /proc/cpuinfo) +# QUERY_NODE_COMMON_FORK_JOIN_POOL_PARALLELISM="$((2 * ${SYSTEM_THREAD_COUNT}))" +fork_join_pool_parallelism=112 + +backend_default=twlha-databuffer + +[imagebuffer:vars] +backend_default=twlha-imagebuffer diff --git a/operation-tools/restart_cluster.yml b/operation-tools/restart_cluster.yml new file mode 100644 index 0000000..ffa0be5 --- /dev/null +++ b/operation-tools/restart_cluster.yml @@ -0,0 +1,106 @@ +- name: stop data api + hosts: data_api + become: true + tasks: + - name: stop data-api + systemd: + state: stopped + name: data-api + - name: stop nginx + systemd: + state: stopped + name: nginx + + +- name: stop dispatcher api + hosts: dispatcher_api + become: true + tasks: + - name: stop dispatcher-api + systemd: + state: stopped + name: dispatcher-api + - name: stop nginx + systemd: + state: stopped + name: nginx + + +- name: stop nodes + hosts: databuffer_cluster + become: true + tasks: + - name: stop daq-dispatcher-node + systemd: + state: stopped + name: daq-dispatcher-node + - name: stop daq-query-node + systemd: + state: stopped + name: daq-query-node + - name: Remove sources + file: + path: /home/daqusr/.config/daq/stores/sources + state: absent + - name: Remove streamers + file: + path: /home/daqusr/.config/daq/stores/streamers + state: absent + + +- name: start dispatcher nodes + hosts: databuffer_cluster + become: true + serial: 1 + tasks: + - name: start daq-dispatcher-node + systemd: + state: started + name: daq-dispatcher-node + + +- name: wait for dispatcher nodes to come up + hosts: dispatcher_api + tasks: + - name: sleep for 120 seconds and continue with play + wait_for: + timeout: 120 + + +- name: start query nodes + hosts: databuffer_cluster + become: true + serial: 1 + tasks: + - name: start daq-query-node + systemd: + state: started + name: daq-query-node + + +- name: start data api + hosts: data_api + become: true + tasks: + - name: start data-api + systemd: + state: started + name: data-api + - name: start nginx + systemd: + state: started + name: nginx + + +- name: start dispatcher api + hosts: dispatcher_api + become: true + tasks: + - name: start dispatcher-api + systemd: + state: started + name: dispatcher-api + - name: start nginx + systemd: + state: started + name: nginx diff --git a/operation-tools/restart_databuffer_api3.yml b/operation-tools/restart_databuffer_api3.yml new file mode 100644 index 0000000..c3ceb90 --- /dev/null +++ b/operation-tools/restart_databuffer_api3.yml @@ -0,0 +1,38 @@ +- name: Restart API3 Processes + hosts: databuffer + become: true + tasks: + - name: Stop nginx + systemd: + state: stopped + name: nginx + - name: Stop retrieval00 + systemd: + state: stopped + name: retrieval-00 + - name: Stop retrieval01 + systemd: + state: stopped + name: retrieval-01 + - name: Stop retrieval02 + systemd: + state: stopped + name: retrieval-02 + + + - name: Start nginx + systemd: + state: started + name: nginx + - name: Start retrieval00 + systemd: + state: started + name: retrieval-00 + - name: Start retrieval01 + systemd: + state: started + name: retrieval-01 + - name: Start retrieval02 + systemd: + state: started + name: retrieval-02 diff --git a/operation-tools/restart_dataretrieval.yml b/operation-tools/restart_dataretrieval.yml new file mode 100644 index 0000000..8296a3b --- /dev/null +++ b/operation-tools/restart_dataretrieval.yml @@ -0,0 +1,54 @@ +- name: restart dataretrieval + hosts: data_api + become: true + tasks: + - name: stop data-api + systemd: + state: stopped + name: data-api + - name: stop nginx + systemd: + state: stopped + name: nginx + + +- name: stop dispatcher api + hosts: dispatcher_api + become: true + tasks: + - name: stop dispatcher-api + systemd: + state: stopped + name: dispatcher-api + - name: stop nginx + systemd: + state: stopped + name: nginx + + +- name: start data api + hosts: data_api + become: true + tasks: + - name: start data-api + systemd: + state: started + name: data-api + - name: start nginx + systemd: + state: started + name: nginx + + +- name: start dispatcher api + hosts: dispatcher_api + become: true + tasks: + - name: start dispatcher-api + systemd: + state: started + name: dispatcher-api + - name: start nginx + systemd: + state: started + name: nginx diff --git a/operation-tools/restart_dataretrieval_all.yml b/operation-tools/restart_dataretrieval_all.yml new file mode 100644 index 0000000..17514b8 --- /dev/null +++ b/operation-tools/restart_dataretrieval_all.yml @@ -0,0 +1,79 @@ +- name: stop data api + hosts: data_api + become: true + tasks: + - name: stop data-api + systemd: + state: stopped + name: data-api + - name: stop nginx + systemd: + state: stopped + name: nginx + + +- name: stop dispatcher api + hosts: dispatcher_api + become: true + tasks: + - name: stop dispatcher-api + systemd: + state: stopped + name: dispatcher-api + - name: stop nginx + systemd: + state: stopped + name: nginx + +- name: restart dispatcher api office + hosts: dispatcher_api_office + become: true + tasks: + - name: restart central dispatcher-api + systemd: + name: dispatcher-api-central + state: restarted + - name: restart nginx + systemd: + state: restarted + name: nginx + +- name: restart data api office + hosts: data_api_office + become: true + tasks: + - name: restart central data-api + systemd: + name: data-api-central + state: restarted + - name: restart nginx + systemd: + name: nginx + state: restarted + +- name: start data api + hosts: data_api + become: true + tasks: + - name: start data-api + systemd: + state: started + name: data-api + - name: start nginx + systemd: + state: started + name: nginx + + +- name: start dispatcher api + hosts: dispatcher_api + become: true + tasks: + - name: start dispatcher-api + systemd: + state: started + name: dispatcher-api + - name: start nginx + systemd: + state: started + name: nginx diff --git a/operation-tools/restart_imageapi.yml b/operation-tools/restart_imageapi.yml new file mode 100644 index 0000000..12aca99 --- /dev/null +++ b/operation-tools/restart_imageapi.yml @@ -0,0 +1,11 @@ +- hosts: imageapi + become: true + gather_facts: no + tasks: + - name: systemd daemon reload + systemd: + daemon_reload: yes + - name: restart service + systemd: + name: imageapi + state: restarted diff --git a/operation-tools/stop_cluster.yml b/operation-tools/stop_cluster.yml new file mode 100644 index 0000000..5012f3a --- /dev/null +++ b/operation-tools/stop_cluster.yml @@ -0,0 +1,48 @@ +- name: stop data api + hosts: data_api + become: true + tasks: + - name: stop data-api + systemd: + state: stopped + name: data-api + - name: stop nginx + systemd: + state: stopped + name: nginx + + +- name: stop dispatcher api + hosts: dispatcher_api + become: true + tasks: + - name: stop dispatcher-api + systemd: + state: stopped + name: dispatcher-api + - name: stop nginx + systemd: + state: stopped + name: nginx + + +- name: stop nodes + hosts: databuffer_cluster + become: true + tasks: + - name: stop daq-dispatcher-node + systemd: + state: stopped + name: daq-dispatcher-node + - name: stop daq-query-node + systemd: + state: stopped + name: daq-query-node + - name: Remove sources + file: + path: /home/daqusr/.config/daq/stores/sources + state: absent + - name: Remove streamers + file: + path: /home/daqusr/.config/daq/stores/streamers + state: absent diff --git a/operation-tools/templates/90-daq_limits.d.conf b/operation-tools/templates/90-daq_limits.d.conf new file mode 100644 index 0000000..2717cf1 --- /dev/null +++ b/operation-tools/templates/90-daq_limits.d.conf @@ -0,0 +1,4 @@ +daqusr - memlock unlimited +daqusr - nofile 500000 +daqusr - nproc 32768 +daqusr - as unlimited diff --git a/operation-tools/templates/90-daq_sysctl.d.conf b/operation-tools/templates/90-daq_sysctl.d.conf new file mode 100644 index 0000000..8deb1b1 --- /dev/null +++ b/operation-tools/templates/90-daq_sysctl.d.conf @@ -0,0 +1,2 @@ +vm.max_map_count = 131072 +vm.swappiness = 1 \ No newline at end of file diff --git a/operation-tools/templates/auditbeat.yml b/operation-tools/templates/auditbeat.yml new file mode 100644 index 0000000..30d075b --- /dev/null +++ b/operation-tools/templates/auditbeat.yml @@ -0,0 +1,220 @@ +###################### Auditbeat Configuration Example ######################### + +# This is an example configuration file highlighting only the most common +# options. The auditbeat.reference.yml file from the same directory contains all +# the supported options with more comments. You can use it as a reference. +# +# You can find the full configuration reference here: +# https://www.elastic.co/guide/en/beats/auditbeat/index.html + +#========================== Modules configuration ============================= +auditbeat.modules: + +- module: auditd + # Load audit rules from separate files. Same format as audit.rules(7). + audit_rule_files: [ '${path.config}/audit.rules.d/*.conf' ] + audit_rules: | + ## Define audit rules here. + ## Create file watches (-w) or syscall audits (-a or -A). Uncomment these + ## examples or add your own rules. + + ## If you are on a 64 bit platform, everything should be running + ## in 64 bit mode. This rule will detect any use of the 32 bit syscalls + ## because this might be a sign of someone exploiting a hole in the 32 + ## bit API. + -a always,exit -F arch=b32 -S all -F key=32bit-abi + + ## Executions. + -a always,exit -F arch=b64 -S execve,execveat -k exec + + ## External access (warning: these can be expensive to audit). + -a always,exit -F arch=b64 -S accept,bind,connect -F key=external-access + + ## Identity changes. + -w /etc/group -p wa -k identity + -w /etc/passwd -p wa -k identity + -w /etc/gshadow -p wa -k identity + + ## Unauthorized access attempts. + -a always,exit -F arch=b64 -S open,creat,truncate,ftruncate,openat,open_by_handle_at -F exit=-EACCES -k access + -a always,exit -F arch=b64 -S open,creat,truncate,ftruncate,openat,open_by_handle_at -F exit=-EPERM -k access + +- module: file_integrity + paths: + - /bin + - /usr/bin + - /sbin + - /usr/sbin + - /etc + +- module: system + datasets: + - host # General host information, e.g. uptime, IPs + - login # User logins, logouts, and system boots. + - package # Installed, updated, and removed packages + - process # Started and stopped processes + - socket # Opened and closed sockets + - user # User information + + # How often datasets send state updates with the + # current state of the system (e.g. all currently + # running processes, all open sockets). + state.period: 12h + + # Enabled by default. Auditbeat will read password fields in + # /etc/passwd and /etc/shadow and store a hash locally to + # detect any changes. + user.detect_password_changes: true + + # File patterns of the login record files. + login.wtmp_file_pattern: /var/log/wtmp* + login.btmp_file_pattern: /var/log/btmp* + +#==================== Elasticsearch template setting ========================== +setup.template.settings: + index.number_of_shards: 1 + #index.codec: best_compression + #_source.enabled: false + +#================================ General ===================================== + +# The name of the shipper that publishes the network data. It can be used to group +# all the transactions sent by a single shipper in the web interface. +#name: + +# The tags of the shipper are included in their own field with each +# transaction published. +#tags: ["service-X", "web-tier"] + +# Optional fields that you can specify to add additional information to the +# output. +#fields: +# env: staging + + +#============================== Dashboards ===================================== +# These settings control loading the sample dashboards to the Kibana index. Loading +# the dashboards is disabled by default and can be enabled either by setting the +# options here or by using the `setup` command. +#setup.dashboards.enabled: false + +# The URL from where to download the dashboards archive. By default this URL +# has a value which is computed based on the Beat name and version. For released +# versions, this URL points to the dashboard archive on the artifacts.elastic.co +# website. +#setup.dashboards.url: + +#============================== Kibana ===================================== + +# Starting with Beats version 6.0.0, the dashboards are loaded via the Kibana API. +# This requires a Kibana endpoint configuration. +setup.kibana: + + + host: "https://realstuff.psi.ch:5601" + ssl.certificate_authorities: ["/etc/pki/tls/certs/elastic-stack-ca.pem"] + # Kibana Host + # Scheme and port can be left out and will be set to the default (http and 5601) + # In case you specify and additional path, the scheme is required: http://localhost:5601/path + # IPv6 addresses should always be defined as: https://[2001:db8::1]:5601 + #host: "localhost:5601" + + # Kibana Space ID + # ID of the Kibana Space into which the dashboards should be loaded. By default, + # the Default Space will be used. + #space.id: + +#============================= Elastic Cloud ================================== + +# These settings simplify using Auditbeat with the Elastic Cloud (https://cloud.elastic.co/). + +# The cloud.id setting overwrites the `output.elasticsearch.hosts` and +# `setup.kibana.host` options. +# You can find the `cloud.id` in the Elastic Cloud web UI. +#cloud.id: + +# The cloud.auth setting overwrites the `output.elasticsearch.username` and +# `output.elasticsearch.password` settings. The format is `:`. +#cloud.auth: + +#================================ Outputs ===================================== + +# Configure what output to use when sending the data collected by the beat. + +#-------------------------- Elasticsearch output ------------------------------ +output.elasticsearch: + # Array of hosts to connect to. + hosts: ["realstuff.psi.ch:9200"] + + # Optional protocol and basic auth credentials. + protocol: "https" + username: "beats_user" + password: "beats123" + ssl.certificate_authorities: ["/etc/pki/tls/certs/elastic-stack-ca.pem"] + + # Optional protocol and basic auth credentials. + #protocol: "https" + #username: "elastic" + #password: "changeme" + +#----------------------------- Logstash output -------------------------------- +#output.logstash: + # The Logstash hosts + #hosts: ["localhost:5044"] + + # Optional SSL. By default is off. + # List of root certificates for HTTPS server verifications + #ssl.certificate_authorities: ["/etc/pki/root/ca.pem"] + + # Certificate for SSL client authentication + #ssl.certificate: "/etc/pki/client/cert.pem" + + # Client Certificate Key + #ssl.key: "/etc/pki/client/cert.key" + +#================================ Processors ===================================== + +# Configure processors to enhance or manipulate events generated by the beat. + +processors: + - add_host_metadata: ~ + - add_cloud_metadata: ~ + +#================================ Logging ===================================== + +# Sets log level. The default log level is info. +# Available log levels are: error, warning, info, debug +#logging.level: debug + +# At debug level, you can selectively enable logging only for some components. +# To enable all selectors use ["*"]. Examples of other selectors are "beat", +# "publish", "service". +#logging.selectors: ["*"] + +#============================== Xpack Monitoring =============================== +# auditbeat can export internal metrics to a central Elasticsearch monitoring +# cluster. This requires xpack monitoring to be enabled in Elasticsearch. The +# reporting is disabled by default. + +# Set to true to enable the monitoring reporter. +monitoring.enabled: true + +# Sets the UUID of the Elasticsearch cluster under which monitoring data for this +# Auditbeat instance will appear in the Stack Monitoring UI. If output.elasticsearch +# is enabled, the UUID is derived from the Elasticsearch cluster referenced by output.elasticsearch. +#monitoring.cluster_uuid: +monitoring.cluster_uuid: "57-GhvUVR1WM1D-42XEFYg" + +# Uncomment to send the metrics to Elasticsearch. Most settings from the +# Elasticsearch output are accepted here as well. +# Note that the settings should point to your Elasticsearch *monitoring* cluster. +# Any setting that is not set is automatically inherited from the Elasticsearch +# output configuration, so if you have the Elasticsearch output configured such +# that it is pointing to your Elasticsearch monitoring cluster, you can simply +# uncomment the following line. +#monitoring.elasticsearch: + +#================================= Migration ================================== + +# This allows to enable 6.7 migration aliases +#migration.6_to_7.enabled: true diff --git a/operation-tools/templates/daq-dispatcher-node.service.j2 b/operation-tools/templates/daq-dispatcher-node.service.j2 new file mode 100644 index 0000000..e4d8492 --- /dev/null +++ b/operation-tools/templates/daq-dispatcher-node.service.j2 @@ -0,0 +1,47 @@ +[Unit] +Description=Dispatcher Node +After=network.target local-fs.target tuned.service + +[Service] +User=daqusr +ExecStart=/usr/lib/jvm/java-13/bin/java --add-opens jdk.management/com.sun.management.internal=ALL-UNNAMED \ + --add-opens java.management/sun.management=ALL-UNNAMED \ + -Xms8G \ + -Xmx32G \ + -Xmn2G \ + -Xss256k \ + -DDirectMemoryAllocationThreshold=2KB \ + -XX:MaxDirectMemorySize=64G \ + -DDirectMemoryCleanerThreshold=0.7 \ + -XX:+ExitOnOutOfMemoryError \ + --add-exports java.base/jdk.internal.ref=ALL-UNNAMED \ + --add-opens java.base/java.nio=ALL-UNNAMED \ + --add-opens java.base/sun.nio.ch=ALL-UNNAMED \ + --add-opens java.base/java.lang=ALL-UNNAMED \ + --add-modules jdk.unsupported \ + -XX:+UnlockExperimentalVMOptions \ + -XX:+UseZGC \ + -XX:ConcGCThreads={{ number_of_cores }} \ + -Djava.util.concurrent.ForkJoinPool.common.parallelism={{fork_join_pool_parallelism}} \ + -Duser.timezone=Europe/Zurich \ + -Dcom.sun.management.jmxremote.port=3334 \ + -Dcom.sun.management.jmxremote.ssl=false \ + -Dcom.sun.management.jmxremote.authenticate=false \ + -Dcom.sun.management.jmxremote.local.only=false \ + -jar {{binaries_install_dir}}/lib/dispatchernode-{{dispatcher_node_version}}-all.jar \ + --daq.config.environment={{daq_environment}} +Restart=on-failure +RestartSec=3s +SuccessExitStatus=143 +StandardOutput=journal +StandardError=journal +OOMScoreAdjust=-500 +LimitNOFILE=500000 +LimitMEMLOCK=infinity +LimitNPROC=infinity +LimitAS=infinity +#CPUAccounting=true +#CPUShares=2048 + +[Install] +WantedBy=multi-user.target diff --git a/operation-tools/templates/daq-dispatcher-node.service_jdk8.j2 b/operation-tools/templates/daq-dispatcher-node.service_jdk8.j2 new file mode 100644 index 0000000..fcac2c6 --- /dev/null +++ b/operation-tools/templates/daq-dispatcher-node.service_jdk8.j2 @@ -0,0 +1,52 @@ +[Unit] +Description=Dispatcher Node +After=network.target local-fs.target tuned.service + +[Service] +User=daqusr +ExecStart=/usr/java/jdk1.8.0_162/bin/java -XX:+CMSClassUnloadingEnabled \ + -XX:+UseThreadPriorities \ + -Xms8G \ + -Xmx32G \ + -Xmn2G \ + -DDirectMemoryAllocationThreshold=2KB \ + -XX:MaxDirectMemorySize=64G \ + -DDirectMemoryCleanerThreshold=0.7 \ + -XX:+ExitOnOutOfMemoryError \ + -Xss256k \ + -XX:StringTableSize=1000003 \ + -XX:+UseParNewGC \ + -XX:+UseConcMarkSweepGC \ + -XX:+CMSParallelRemarkEnabled \ + -XX:SurvivorRatio=8 \ + -XX:MaxTenuringThreshold=1 \ + -XX:CMSInitiatingOccupancyFraction=75 \ + -XX:+UseCMSInitiatingOccupancyOnly \ + -XX:+UseTLAB \ + -XX:+PerfDisableSharedMem \ + -XX:CMSWaitDuration=10000 \ + -XX:+CMSParallelInitialMarkEnabled \ + -XX:+CMSEdenChunksRecordAlways \ + -XX:CMSWaitDuration=10000 \ + -XX:+UseCondCardMark \ + -Dcom.sun.management.jmxremote.port=3334 \ + -Dcom.sun.management.jmxremote.ssl=false \ + -Dcom.sun.management.jmxremote.authenticate=false \ + -Dcom.sun.management.jmxremote.local.only=false \ + -jar /opt/dispatcher_node/latest/lib/dispatchernode-{{dispatcher_node_version}}-all.jar \ + --daq.config.environment={{daq_environment}} +Restart=on-failure +RestartSec=3s +SuccessExitStatus=143 +StandardOutput=journal +StandardError=journal +OOMScoreAdjust=-500 +LimitNOFILE=500000 +LimitMEMLOCK=infinity +LimitNPROC=infinity +LimitAS=infinity +CPUAccounting=true +CPUShares=2048 + +[Install] +WantedBy=multi-user.target diff --git a/operation-tools/templates/daq-dispatcher-rest.service.j2 b/operation-tools/templates/daq-dispatcher-rest.service.j2 new file mode 100644 index 0000000..d0cd665 --- /dev/null +++ b/operation-tools/templates/daq-dispatcher-rest.service.j2 @@ -0,0 +1,46 @@ +[Unit] +Description=Dispatcher REST Server +After=network.target +PartOf=daq-dispatcher-node.service + +[Service] +User=daqusr +ExecStart=/usr/lib/jvm/java-13/bin/java \ + --add-opens jdk.management/com.sun.management.internal=ALL-UNNAMED \ + --add-opens java.management/sun.management=ALL-UNNAMED \ + --add-exports java.base/jdk.internal.ref=ALL-UNNAMED \ + --add-opens java.base/java.nio=ALL-UNNAMED \ + --add-opens java.base/sun.nio.ch=ALL-UNNAMED \ + --add-opens java.base/java.lang=ALL-UNNAMED \ + --add-modules jdk.unsupported \ + -XX:+UnlockExperimentalVMOptions \ + -XX:+UseZGC \ + -XX:ConcGCThreads=8 \ + -Djava.util.concurrent.ForkJoinPool.common.parallelism=16 \ + -Duser.timezone=Europe/Zurich \ + -Xms128M \ + -Xmx1G \ + -Xmn64M \ + -Xss256k \ + -DDirectMemoryAllocationThreshold=50MB \ + -DDirectMemoryCleanerThreshold=0.7 \ + -XX:MaxDirectMemorySize=1G \ + -XX:+ExitOnOutOfMemoryError \ + -jar {{binaries_install_dir}}/lib/dispatcherrest-{{binaries_version}}-all.jar \ + --daq.config.environment={{daq_environment}} \ + --server.port=8081 +Restart=on-failure +RestartSec=3s +SuccessExitStatus=143 +StandardOutput=journal +StandardError=journal +OOMScoreAdjust=-500 +LimitNOFILE=infinity +LimitMEMLOCK=infinity +LimitNPROC=infinity +LimitAS=infinity +CPUAccounting=true +CPUShares=2048 + +[Install] +WantedBy=multi-user.target diff --git a/operation-tools/templates/daq-query-node.service.j2 b/operation-tools/templates/daq-query-node.service.j2 new file mode 100644 index 0000000..dc9a9a6 --- /dev/null +++ b/operation-tools/templates/daq-query-node.service.j2 @@ -0,0 +1,47 @@ +[Unit] +Description=Query Node +After=network.target local-fs.target tuned.service + +[Service] +User=daqusr +ExecStart=/usr/lib/jvm/java-13/bin/java --add-opens jdk.management/com.sun.management.internal=ALL-UNNAMED \ + --add-opens java.management/sun.management=ALL-UNNAMED \ + -Xms8G \ + -Xmx16G \ + -Xmn4G \ + -Xss256k \ + -DDirectMemoryAllocationThreshold=2KB \ + -DDirectMemoryCleanerThreshold=0.7 \ + -XX:+ExitOnOutOfMemoryError \ + --add-exports java.base/jdk.internal.ref=ALL-UNNAMED \ + --add-opens java.base/java.nio=ALL-UNNAMED \ + --add-opens java.base/sun.nio.ch=ALL-UNNAMED \ + --add-opens java.base/java.lang=ALL-UNNAMED \ + --add-modules jdk.unsupported \ + -XX:+UnlockExperimentalVMOptions \ + -XX:+UseZGC \ + -XX:ConcGCThreads={{ number_of_cores }} \ + -Djava.util.concurrent.ForkJoinPool.common.parallelism={{fork_join_pool_parallelism}} \ + -Duser.timezone=Europe/Zurich \ + -XX:MaxDirectMemorySize=64G \ + -Dcom.sun.management.jmxremote.port=3336 \ + -Dcom.sun.management.jmxremote.ssl=false \ + -Dcom.sun.management.jmxremote.authenticate=false \ + -Dcom.sun.management.jmxremote.local.only=false \ + -jar {{binaries_install_dir}}/lib/querynode-{{query_node_version}}-all.jar \ + --daq.config.environment={{daq_environment}} +Restart=on-failure +RestartSec=3s +SuccessExitStatus=143 +StandardOutput=journal +StandardError=journal +OOMScoreAdjust=-500 +LimitNOFILE=500000 +LimitMEMLOCK=infinity +LimitNPROC=infinity +LimitAS=infinity +#CPUAccounting=true +#CPUShares=2048 + +[Install] +WantedBy=multi-user.target diff --git a/operation-tools/templates/daq-query-node.service_jdk8.j2 b/operation-tools/templates/daq-query-node.service_jdk8.j2 new file mode 100644 index 0000000..198936f --- /dev/null +++ b/operation-tools/templates/daq-query-node.service_jdk8.j2 @@ -0,0 +1,52 @@ +[Unit] +Description=Query Node +After=network.target local-fs.target tuned.service + +[Service] +User=daqusr +ExecStart=/usr/java/jdk1.8.0_162/bin/java -XX:+CMSClassUnloadingEnabled \ + -XX:+UseThreadPriorities \ + -Xms8G \ + -Xmx16G \ + -Xmn4G \ + -DDirectMemoryAllocationThreshold=2KB \ + -XX:MaxDirectMemorySize=64G \ + -DDirectMemoryCleanerThreshold=0.7 \ + -XX:+ExitOnOutOfMemoryError \ + -Xss256k \ + -XX:StringTableSize=1000003 \ + -XX:+UseParNewGC \ + -XX:+UseConcMarkSweepGC \ + -XX:+CMSParallelRemarkEnabled \ + -XX:SurvivorRatio=8 \ + -XX:MaxTenuringThreshold=1 \ + -XX:CMSInitiatingOccupancyFraction=75 \ + -XX:+UseCMSInitiatingOccupancyOnly \ + -XX:+UseTLAB \ + -XX:+PerfDisableSharedMem \ + -XX:CMSWaitDuration=10000 \ + -XX:+CMSParallelInitialMarkEnabled \ + -XX:+CMSEdenChunksRecordAlways \ + -XX:CMSWaitDuration=10000 \ + -XX:+UseCondCardMark \ + -Dcom.sun.management.jmxremote.port=3336 \ + -Dcom.sun.management.jmxremote.ssl=false \ + -Dcom.sun.management.jmxremote.authenticate=false \ + -Dcom.sun.management.jmxremote.local.only=false \ + -jar /opt/query_node/latest/lib/querynode-{{query_node_version}}-all.jar \ + --daq.config.environment={{daq_environment}} +Restart=on-failure +RestartSec=3s +SuccessExitStatus=143 +StandardOutput=journal +StandardError=journal +OOMScoreAdjust=-500 +LimitNOFILE=500000 +LimitMEMLOCK=infinity +LimitNPROC=infinity +LimitAS=infinity +CPUAccounting=true +CPUShares=2048 + +[Install] +WantedBy=multi-user.target diff --git a/operation-tools/templates/daq-query-rest.service.j2 b/operation-tools/templates/daq-query-rest.service.j2 new file mode 100644 index 0000000..e90c11c --- /dev/null +++ b/operation-tools/templates/daq-query-rest.service.j2 @@ -0,0 +1,46 @@ +[Unit] +Description=Query REST Server +After=network.target +PartOf=daq-query-node.service + +[Service] +User=daqusr +ExecStart=/usr/lib/jvm/java-13/bin/java \ + --add-opens jdk.management/com.sun.management.internal=ALL-UNNAMED \ + --add-opens java.management/sun.management=ALL-UNNAMED \ + --add-exports java.base/jdk.internal.ref=ALL-UNNAMED \ + --add-opens java.base/java.nio=ALL-UNNAMED \ + --add-opens java.base/sun.nio.ch=ALL-UNNAMED \ + --add-opens java.base/java.lang=ALL-UNNAMED \ + --add-modules jdk.unsupported \ + -XX:+UnlockExperimentalVMOptions \ + -XX:+UseZGC \ + -XX:ConcGCThreads=8 \ + -Djava.util.concurrent.ForkJoinPool.common.parallelism=16 \ + -Duser.timezone=Europe/Zurich \ + -Xms1G \ + -Xmx12G \ + -Xmn1G \ + -Xss256k \ + -DDirectMemoryAllocationThreshold=50MB \ + -DDirectMemoryCleanerThreshold=0.7 \ + -XX:MaxDirectMemorySize=1G \ + -XX:+ExitOnOutOfMemoryError \ + -jar {{binaries_install_dir}}/lib/queryrest-{{binaries_version}}-all.jar \ + --daq.config.environment={{daq_environment}} \ + --server.port=8080 +Restart=on-failure +RestartSec=3s +SuccessExitStatus=143 +StandardOutput=journal +StandardError=journal +OOMScoreAdjust=-500 +LimitNOFILE=infinity +LimitMEMLOCK=infinity +LimitNPROC=infinity +LimitAS=infinity +CPUAccounting=true +CPUShares=2048 + +[Install] +WantedBy=multi-user.target diff --git a/operation-tools/templates/domain.properties.j2 b/operation-tools/templates/domain.properties.j2 new file mode 100644 index 0000000..ae64c41 --- /dev/null +++ b/operation-tools/templates/domain.properties.j2 @@ -0,0 +1 @@ +backend.default={{backend_default}} diff --git a/operation-tools/templates/elastic-stack-ca.pem b/operation-tools/templates/elastic-stack-ca.pem new file mode 100644 index 0000000..f99eec9 --- /dev/null +++ b/operation-tools/templates/elastic-stack-ca.pem @@ -0,0 +1,25 @@ +Bag Attributes + friendlyName: ca + localKeyID: 54 69 6D 65 20 31 35 36 38 36 32 32 33 34 39 32 31 31 +subject=/CN=Elastic Certificate Tool Autogenerated CA +issuer=/CN=Elastic Certificate Tool Autogenerated CA +-----BEGIN CERTIFICATE----- +MIIDSjCCAjKgAwIBAgIVALSBEnmcvNWcKOgb37AwpamramBkMA0GCSqGSIb3DQEB +CwUAMDQxMjAwBgNVBAMTKUVsYXN0aWMgQ2VydGlmaWNhdGUgVG9vbCBBdXRvZ2Vu +ZXJhdGVkIENBMB4XDTE5MDkxNjA4MjQ1N1oXDTIyMDkxNTA4MjQ1N1owNDEyMDAG +A1UEAxMpRWxhc3RpYyBDZXJ0aWZpY2F0ZSBUb29sIEF1dG9nZW5lcmF0ZWQgQ0Ew +ggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQCWgsvEjbDXlFE6f1OlRg3o +9K9guQfFtio1S1IR+J8itRTc6QtVJ0YSoTLlArj1ZE5SeqctDUFQIwNCm/vD4/6d +kiUrXUamJW+73g1kJgBWi/kn2oMUAUOerSXNF7Y1vKkCwtG9lQqk4ZMt8dKGd0x0 +5WkVgAORZrTMUNPYK2HIHG3DhsntHe84u8nR7xMZCuYza/mHC42OiCAEDeIu0R6v +zeQQY0tqxKcQE3FzGzv7fKX0FNjW+fFe4F8qqANy/+YsmIfce/iEd/7bOdIizG3V +P5e1W4jORbhTDnbw79rGgyzLHy0yGLn/o95ixXyM/3qO/aaB44KPIJlFBxz9MsM5 +AgMBAAGjUzBRMB0GA1UdDgQWBBRwArjMBG5pxXwo1sWdoY+If3yAzjAfBgNVHSME +GDAWgBRwArjMBG5pxXwo1sWdoY+If3yAzjAPBgNVHRMBAf8EBTADAQH/MA0GCSqG +SIb3DQEBCwUAA4IBAQBNeV3zlAF12/sk4W9icWuuTV2lT6MobTouy0u8zJs4ciQ3 +IGzXR6eGfvqulnVNOc754Ndmdj80WbV/WMnZY32IUsMpCebZkUmjYrSej2vozPWU +rc7AOkran3vicUN6J3OWnoWATo04HH0uJnM0HgP/oqelq0Iu4+5J+DP2OhX2kir0 +OpktbBPOlhogT15Zt1kZTU3RuY1AL3TLSy9pvfB+bfrd7Z2AKJ9rdrSKgboB/gKv +czcNTwvGAW9m9LlwUqTFzwf0Vb/1bSi8Z93+pGzm2s1LmZ6Ubvr1mOZDjcGibMTm +pIepviI2Nzd6DosV6N9VqA7UxZWklCaXvqbTp72c +-----END CERTIFICATE----- diff --git a/operation-tools/templates/heartbeat.yml b/operation-tools/templates/heartbeat.yml new file mode 100644 index 0000000..3847d33 --- /dev/null +++ b/operation-tools/templates/heartbeat.yml @@ -0,0 +1,176 @@ +################### Heartbeat Configuration Example ######################### + +# This file is an example configuration file highlighting only some common options. +# The heartbeat.reference.yml file in the same directory contains all the supported options +# with detailed comments. You can use it for reference. +# +# You can find the full configuration reference here: +# https://www.elastic.co/guide/en/beats/heartbeat/index.html + +############################# Heartbeat ###################################### + +# Define a directory to load monitor definitions from. Definitions take the form +# of individual yaml files. +heartbeat.config.monitors: + # Directory + glob pattern to search for configuration files + path: ${path.config}/monitors.d/*.yml + # If enabled, heartbeat will periodically check the config.monitors path for changes + reload.enabled: false + # How often to check for changes + reload.period: 5s + +# Configure monitors inline +#heartbeat.monitors: +#- type: http +# +# # List or urls to query +# urls: ["http://localhost:9200"] +# +# # Configure task schedule +# schedule: '@every 10s' + + # Total test connection and data exchange timeout + #timeout: 16s + +#==================== Elasticsearch template setting ========================== + +setup.template.settings: + index.number_of_shards: 1 + index.codec: best_compression + #_source.enabled: false + +#================================ General ===================================== + +# The name of the shipper that publishes the network data. It can be used to group +# all the transactions sent by a single shipper in the web interface. +#name: + +# The tags of the shipper are included in their own field with each +# transaction published. +#tags: ["service-X", "web-tier"] + +# Optional fields that you can specify to add additional information to the +# output. +#fields: +# env: staging + + +#============================== Kibana ===================================== + +# Starting with Beats version 6.0.0, the dashboards are loaded via the Kibana API. +# This requires a Kibana endpoint configuration. +setup.kibana: + + # Kibana Host + # Scheme and port can be left out and will be set to the default (http and 5601) + # In case you specify and additional path, the scheme is required: http://localhost:5601/path + # IPv6 addresses should always be defined as: https://[2001:db8::1]:5601 + #host: "localhost:5601" + host: "https://realstuff.psi.ch:5601" + ssl.certificate_authorities: ["/etc/pki/tls/certs/elastic-stack-ca.pem"] + + # Kibana Space ID + # ID of the Kibana Space into which the dashboards should be loaded. By default, + # the Default Space will be used. + #space.id: + +#============================= Elastic Cloud ================================== + +# These settings simplify using Heartbeat with the Elastic Cloud (https://cloud.elastic.co/). + +# The cloud.id setting overwrites the `output.elasticsearch.hosts` and +# `setup.kibana.host` options. +# You can find the `cloud.id` in the Elastic Cloud web UI. +#cloud.id: + +# The cloud.auth setting overwrites the `output.elasticsearch.username` and +# `output.elasticsearch.password` settings. The format is `:`. +#cloud.auth: + +#================================ Outputs ===================================== + +# Configure what output to use when sending the data collected by the beat. + +#-------------------------- Elasticsearch output ------------------------------ +output.elasticsearch: + # Array of hosts to connect to. + # Array of hosts to connect to. + hosts: ["realstuff.psi.ch:9200"] + + # Optional protocol and basic auth credentials. + protocol: "https" + username: "beats_user" + password: "beats123" + ssl.certificate_authorities: ["/etc/pki/tls/certs/elastic-stack-ca.pem"] + + + # Optional protocol and basic auth credentials. + #protocol: "https" + #username: "elastic" + #password: "changeme" + +#----------------------------- Logstash output -------------------------------- +#output.logstash: + # The Logstash hosts + #hosts: ["localhost:5044"] + + # Optional SSL. By default is off. + # List of root certificates for HTTPS server verifications + #ssl.certificate_authorities: ["/etc/pki/root/ca.pem"] + + # Certificate for SSL client authentication + #ssl.certificate: "/etc/pki/client/cert.pem" + + # Client Certificate Key + #ssl.key: "/etc/pki/client/cert.key" + +#================================ Processors ===================================== + +processors: + - add_observer_metadata: + # Optional, but recommended geo settings for the location Heartbeat is running in + #geo: + # Token describing this location + #name: us-east-1a + + # Lat, Lon " + #location: "37.926868, -78.024902" + +#================================ Logging ===================================== + +# Sets log level. The default log level is info. +# Available log levels are: error, warning, info, debug +#logging.level: debug + +# At debug level, you can selectively enable logging only for some components. +# To enable all selectors use ["*"]. Examples of other selectors are "beat", +# "publish", "service". +#logging.selectors: ["*"] + +#============================== Xpack Monitoring =============================== +# heartbeat can export internal metrics to a central Elasticsearch monitoring +# cluster. This requires xpack monitoring to be enabled in Elasticsearch. The +# reporting is disabled by default. + +# Set to true to enable the monitoring reporter. +monitoring.enabled: true + +# Sets the UUID of the Elasticsearch cluster under which monitoring data for this +# Heartbeat instance will appear in the Stack Monitoring UI. If output.elasticsearch +# is enabled, the UUID is derived from the Elasticsearch cluster referenced by output.elasticsearch. +#monitoring.cluster_uuid: +monitoring.cluster_uuid: "57-GhvUVR1WM1D-42XEFYg" + +# Uncomment to send the metrics to Elasticsearch. Most settings from the +# Elasticsearch output are accepted here as well. +# Note that the settings should point to your Elasticsearch *monitoring* cluster. +# Any setting that is not set is automatically inherited from the Elasticsearch +# output configuration, so if you have the Elasticsearch output configured such +# that it is pointing to your Elasticsearch monitoring cluster, you can simply +# uncomment the following line. +#monitoring.elasticsearch: + +#================================= Migration ================================== + +# This allows to enable 6.7 migration aliases +#migration.6_to_7.enabled: true diff --git a/operation-tools/templates/imageapi.application.properties b/operation-tools/templates/imageapi.application.properties new file mode 100644 index 0000000..4c00e40 --- /dev/null +++ b/operation-tools/templates/imageapi.application.properties @@ -0,0 +1 @@ +server.port=8080 diff --git a/operation-tools/templates/imageapi.service.j2 b/operation-tools/templates/imageapi.service.j2 new file mode 100644 index 0000000..89d48e0 --- /dev/null +++ b/operation-tools/templates/imageapi.service.j2 @@ -0,0 +1,15 @@ +[Unit] +Description=imageapi + +[Service] +User=daqusr +ExecStart=/usr/lib/jvm/java-11/bin/java \ +-Xms512m -Xmx2048m \ +-Dspring.config.location=/etc/imageapi/application.properties \ +-jar /opt/imageapi/{{imageapi_version}}/lib/imageapi-{{imageapi_version}}-all.jar + +Restart=on-failure +RestartSec=10s + +[Install] +WantedBy=multi-user.target diff --git a/operation-tools/templates/iodata.service.j2 b/operation-tools/templates/iodata.service.j2 new file mode 100644 index 0000000..507bd85 --- /dev/null +++ b/operation-tools/templates/iodata.service.j2 @@ -0,0 +1,12 @@ +[Unit] +Description=iodata + +[Service] +User=daqusr +ExecStart=/usr/lib/jvm/java-11/bin/java -Dspring.config.location=/etc/iodata/application.properties \ + -jar /opt/iodata/latest/lib/iodata-{{query_node_version}}-all.jar +Restart=on-failure +RestartSec=3s + +[Install] +WantedBy=multi-user.target diff --git a/operation-tools/templates/iodata_application.properties b/operation-tools/templates/iodata_application.properties new file mode 100644 index 0000000..3e9039b --- /dev/null +++ b/operation-tools/templates/iodata_application.properties @@ -0,0 +1,8 @@ +rootDir=/gpfs/sf-data/sf-imagebuffer +baseKeyspaceName=daq_swissfel +binSize=3600000 +# binSize=86400000 +# Not used right now +nodeId=1 + +spring.mvc.async.request-timeout = 3600000 \ No newline at end of file diff --git a/operation-tools/templates/journalbeat.yml b/operation-tools/templates/journalbeat.yml new file mode 100644 index 0000000..5aa5d8a --- /dev/null +++ b/operation-tools/templates/journalbeat.yml @@ -0,0 +1,192 @@ +###################### Journalbeat Configuration Example ######################### + +# This file is an example configuration file highlighting only the most common +# options. The journalbeat.reference.yml file from the same directory contains all the +# supported options with more comments. You can use it as a reference. +# +# You can find the full configuration reference here: +# https://www.elastic.co/guide/en/beats/journalbeat/index.html + +# For more available modules and options, please see the journalbeat.reference.yml sample +# configuration file. + +#=========================== Journalbeat inputs ============================= + +journalbeat.inputs: + # Paths that should be crawled and fetched. Possible values files and directories. + # When setting a directory, all journals under it are merged. + # When empty starts to read from local journal. +- paths: [] + + # The number of seconds to wait before trying to read again from journals. + #backoff: 1s + # The maximum number of seconds to wait before attempting to read again from journals. + #max_backoff: 20s + + # Position to start reading from journal. Valid values: head, tail, cursor + seek: cursor + # Fallback position if no cursor data is available. + #cursor_seek_fallback: head + + # Exact matching for field values of events. + # Matching for nginx entries: "systemd.unit=nginx" + #include_matches: [] + include_matches: + - "systemd.unit=daq-dispatcher-node.service" + - "systemd.unit=daq-query-node.service" + - "systemd.unit=imageapi.service" + + # Optional fields that you can specify to add additional information to the + # output. Fields can be scalar values, arrays, dictionaries, or any nested + # combination of these. + #fields: + # env: staging + + +#========================= Journalbeat global options ============================ +#journalbeat: + # Name of the registry file. If a relative path is used, it is considered relative to the + # data path. + #registry_file: registry + +#==================== Elasticsearch template setting ========================== +setup.template.settings: + index.number_of_shards: 1 + #index.codec: best_compression + #_source.enabled: false + +#================================ General ===================================== + +# The name of the shipper that publishes the network data. It can be used to group +# all the transactions sent by a single shipper in the web interface. +#name: + +# The tags of the shipper are included in their own field with each +# transaction published. +#tags: ["service-X", "web-tier"] + +# Optional fields that you can specify to add additional information to the +# output. +#fields: +# env: staging + + +#============================== Dashboards ===================================== +# These settings control loading the sample dashboards to the Kibana index. Loading +# the dashboards is disabled by default and can be enabled either by setting the +# options here or by using the `setup` command. +#setup.dashboards.enabled: false + +# The URL from where to download the dashboards archive. By default this URL +# has a value which is computed based on the Beat name and version. For released +# versions, this URL points to the dashboard archive on the artifacts.elastic.co +# website. +#setup.dashboards.url: + +#============================== Kibana ===================================== + +# Starting with Beats version 6.0.0, the dashboards are loaded via the Kibana API. +# This requires a Kibana endpoint configuration. +setup.kibana: + + # Kibana Host + # Scheme and port can be left out and will be set to the default (http and 5601) + # In case you specify and additional path, the scheme is required: http://localhost:5601/path + # IPv6 addresses should always be defined as: https://[2001:db8::1]:5601 + host: "https://realstuff.psi.ch:5601" + ssl.certificate_authorities: ["/etc/pki/tls/certs/elastic-stack-ca.pem"] + + # Kibana Space ID + # ID of the Kibana Space into which the dashboards should be loaded. By default, + # the Default Space will be used. + #space.id: + +#============================= Elastic Cloud ================================== + +# These settings simplify using Journalbeat with the Elastic Cloud (https://cloud.elastic.co/). + +# The cloud.id setting overwrites the `output.elasticsearch.hosts` and +# `setup.kibana.host` options. +# You can find the `cloud.id` in the Elastic Cloud web UI. +#cloud.id: + +# The cloud.auth setting overwrites the `output.elasticsearch.username` and +# `output.elasticsearch.password` settings. The format is `:`. +#cloud.auth: + +#================================ Outputs ===================================== + +# Configure what output to use when sending the data collected by the beat. + +#-------------------------- Elasticsearch output ------------------------------ +output.elasticsearch: + # Array of hosts to connect to. + hosts: ["realstuff.psi.ch:9200"] + pipeline: "imagebuffer-log-pipeline" + + # Optional protocol and basic auth credentials. + protocol: "https" + username: "beats_user" + password: "beats123" + ssl.certificate_authorities: ["/etc/pki/tls/certs/elastic-stack-ca.pem"] + +#----------------------------- Logstash output -------------------------------- +#output.logstash: + # The Logstash hosts + #hosts: ["localhost:5044"] + + # Optional SSL. By default is off. + # List of root certificates for HTTPS server verifications + #ssl.certificate_authorities: ["/etc/pki/root/ca.pem"] + + # Certificate for SSL client authentication + #ssl.certificate: "/etc/pki/client/cert.pem" + + # Client Certificate Key + #ssl.key: "/etc/pki/client/cert.key" + +#================================ Processors ===================================== + +# Configure processors to enhance or manipulate events generated by the beat. + +processors: + - add_host_metadata: ~ + - add_cloud_metadata: ~ + +#================================ Logging ===================================== + +# Sets log level. The default log level is info. +# Available log levels are: error, warning, info, debug +#logging.level: debug + +# At debug level, you can selectively enable logging only for some components. +# To enable all selectors use ["*"]. Examples of other selectors are "beat", +# "publish", "service". +#logging.selectors: ["*"] + +#============================== Xpack Monitoring =============================== +# journalbeat can export internal metrics to a central Elasticsearch monitoring +# cluster. This requires xpack monitoring to be enabled in Elasticsearch. The +# reporting is disabled by default. + +# Set to true to enable the monitoring reporter. +monitoring.enabled: true + +# Sets the UUID of the Elasticsearch cluster under which monitoring data for this +# Journalbeat instance will appear in the Stack Monitoring UI. If output.elasticsearch +# is enabled, the UUID is derived from the Elasticsearch cluster referenced by output.elasticsearch. +monitoring.cluster_uuid: "57-GhvUVR1WM1D-42XEFYg" + +# Uncomment to send the metrics to Elasticsearch. Most settings from the +# Elasticsearch output are accepted here as well. +# Note that the settings should point to your Elasticsearch *monitoring* cluster. +# Any setting that is not set is automatically inherited from the Elasticsearch +# output configuration, so if you have the Elasticsearch output configured such +# that it is pointing to your Elasticsearch monitoring cluster, you can simply +# uncomment the following line. +#monitoring.elasticsearch: + +#================================= Migration ================================== + +# This allows to enable 6.7 migration aliases +#migration.6_to_7.enabled: true diff --git a/operation-tools/templates/metricbeat.yml b/operation-tools/templates/metricbeat.yml new file mode 100644 index 0000000..cdfd461 --- /dev/null +++ b/operation-tools/templates/metricbeat.yml @@ -0,0 +1,163 @@ +###################### Metricbeat Configuration Example ####################### + +# This file is an example configuration file highlighting only the most common +# options. The metricbeat.reference.yml file from the same directory contains all the +# supported options with more comments. You can use it as a reference. +# +# You can find the full configuration reference here: +# https://www.elastic.co/guide/en/beats/metricbeat/index.html + +#========================== Modules configuration ============================ + +metricbeat.config.modules: + # Glob pattern for configuration loading + path: ${path.config}/modules.d/*.yml + + # Set to true to enable config reloading + reload.enabled: false + + # Period on which files under path should be checked for changes + #reload.period: 10s + +#==================== Elasticsearch template setting ========================== + +setup.template.settings: + index.number_of_shards: 1 + index.codec: best_compression + #_source.enabled: false + +#================================ General ===================================== + +# The name of the shipper that publishes the network data. It can be used to group +# all the transactions sent by a single shipper in the web interface. +#name: + +# The tags of the shipper are included in their own field with each +# transaction published. +#tags: ["service-X", "web-tier"] +tags: ["swissfel", "daq", "databuffer", "linux"] + +# Optional fields that you can specify to add additional information to the +# output. +#fields: +# env: staging + + +#============================== Dashboards ===================================== +# These settings control loading the sample dashboards to the Kibana index. Loading +# the dashboards is disabled by default and can be enabled either by setting the +# options here or by using the `setup` command. +#setup.dashboards.enabled: false + +# The URL from where to download the dashboards archive. By default this URL +# has a value which is computed based on the Beat name and version. For released +# versions, this URL points to the dashboard archive on the artifacts.elastic.co +# website. +#setup.dashboards.url: + +#============================== Kibana ===================================== + +# Starting with Beats version 6.0.0, the dashboards are loaded via the Kibana API. +# This requires a Kibana endpoint configuration. +setup.kibana: + + # Kibana Host + # Scheme and port can be left out and will be set to the default (http and 5601) + # In case you specify and additional path, the scheme is required: http://localhost:5601/path + # IPv6 addresses should always be defined as: https://[2001:db8::1]:5601 + host: "https://realstuff.psi.ch:5601" + ssl.certificate_authorities: ["/etc/pki/tls/certs/elastic-stack-ca.pem"] + + # Kibana Space ID + # ID of the Kibana Space into which the dashboards should be loaded. By default, + # the Default Space will be used. + #space.id: + +#============================= Elastic Cloud ================================== + +# These settings simplify using Metricbeat with the Elastic Cloud (https://cloud.elastic.co/). + +# The cloud.id setting overwrites the `output.elasticsearch.hosts` and +# `setup.kibana.host` options. +# You can find the `cloud.id` in the Elastic Cloud web UI. +#cloud.id: + +# The cloud.auth setting overwrites the `output.elasticsearch.username` and +# `output.elasticsearch.password` settings. The format is `:`. +#cloud.auth: + +#================================ Outputs ===================================== + +# Configure what output to use when sending the data collected by the beat. + +#-------------------------- Elasticsearch output ------------------------------ +output.elasticsearch: + # Array of hosts to connect to. + hosts: ["realstuff.psi.ch:9200"] + + # Optional protocol and basic auth credentials. + protocol: "https" + username: "beats_user" + password: "beats123" + ssl.certificate_authorities: ["/etc/pki/tls/certs/elastic-stack-ca.pem"] + +#----------------------------- Logstash output -------------------------------- +#output.logstash: + # The Logstash hosts + #hosts: ["localhost:5044"] + + # Optional SSL. By default is off. + # List of root certificates for HTTPS server verifications + #ssl.certificate_authorities: ["/etc/pki/root/ca.pem"] + + # Certificate for SSL client authentication + #ssl.certificate: "/etc/pki/client/cert.pem" + + # Client Certificate Key + #ssl.key: "/etc/pki/client/cert.key" + +#================================ Processors ===================================== + +# Configure processors to enhance or manipulate events generated by the beat. + +processors: + - add_host_metadata: ~ + - add_cloud_metadata: ~ + +#================================ Logging ===================================== + +# Sets log level. The default log level is info. +# Available log levels are: error, warning, info, debug +#logging.level: debug + +# At debug level, you can selectively enable logging only for some components. +# To enable all selectors use ["*"]. Examples of other selectors are "beat", +# "publish", "service". +#logging.selectors: ["*"] + +#============================== Xpack Monitoring =============================== +# metricbeat can export internal metrics to a central Elasticsearch monitoring +# cluster. This requires xpack monitoring to be enabled in Elasticsearch. The +# reporting is disabled by default. + +# Set to true to enable the monitoring reporter. +monitoring.enabled: true + +# Sets the UUID of the Elasticsearch cluster under which monitoring data for this +# Metricbeat instance will appear in the Stack Monitoring UI. If output.elasticsearch +# is enabled, the UUID is derived from the Elasticsearch cluster referenced by output.elasticsearch. +monitoring.cluster_uuid: "57-GhvUVR1WM1D-42XEFYg" + +# Uncomment to send the metrics to Elasticsearch. Most settings from the +# Elasticsearch output are accepted here as well. +# Note that the settings should point to your Elasticsearch *monitoring* cluster. +# Any setting that is not set is automatically inherited from the Elasticsearch +# output configuration, so if you have the Elasticsearch output configured such +# that it is pointing to your Elasticsearch monitoring cluster, you can simply +# uncomment the following line. +#monitoring.elasticsearch: + +#================================= Migration ================================== + +# This allows to enable 6.7 migration aliases +#migration.6_to_7.enabled: true diff --git a/operation-tools/templates/reachable.icmp.yml b/operation-tools/templates/reachable.icmp.yml new file mode 100644 index 0000000..bfd7c62 --- /dev/null +++ b/operation-tools/templates/reachable.icmp.yml @@ -0,0 +1,52 @@ +# These files contain a list of monitor configurations identical +# to the heartbeat.monitors section in heartbeat.yml +# The .example extension on this file must be removed for it to +# be loaded. + +- type: icmp # monitor type `icmp` (requires root) uses ICMP Echo Request to ping + # configured hosts + + # Monitor name used for job name and document type. + #name: icmp + + # Enable/Disable monitor + #enabled: true + + # Configure task schedule using cron-like syntax + schedule: '*/5 * * * * * *' # exactly every 5 seconds like 10:00:00, 10:00:05, ... + + # List of hosts to ping + hosts: ["data-api.psi.ch", "sf-data-api.psi.ch"] + + # Configure IP protocol types to ping on if hostnames are configured. + # Ping all resolvable IPs if `mode` is `all`, or only one IP if `mode` is `any`. + ipv4: true + ipv6: true + mode: any + + # Total running time per ping test. + timeout: 16s + + # Waiting duration until another ICMP Echo Request is emitted. + wait: 1s + + # The tags of the monitors are included in their own field with each + # transaction published. Tags make it easy to group servers by different + # logical properties. + #tags: ["service-X", "web-tier"] + + # Optional fields that you can specify to add additional information to the + # monitor output. Fields can be scalar values, arrays, dictionaries, or any nested + # combination of these. + #fields: + # env: staging + + # If this option is set to true, the custom fields are stored as top-level + # fields in the output document instead of being grouped under a fields + # sub-dictionary. Default is false. + #fields_under_root: false +- type: tcp + name: tcp + enabled: true + schedule: '@every 10s' + hosts: ["data-api.psi.ch:22", "sf-data-api.psi.ch:22"] \ No newline at end of file diff --git a/operation-tools/templates/system.yml b/operation-tools/templates/system.yml new file mode 100644 index 0000000..bbd2e23 --- /dev/null +++ b/operation-tools/templates/system.yml @@ -0,0 +1,41 @@ +# Module: system +# Docs: https://www.elastic.co/guide/en/beats/metricbeat/7.3/metricbeat-module-system.html + +- module: system + period: 10s + metricsets: + - cpu + - load + - memory + - network + - process + - process_summary + - socket_summary + - entropy + - core + - diskio + - socket + cpu.metrics: ["percentages","normalized_percentages"] + process.include_top_n: + by_cpu: 5 # include top 5 processes by CPU + by_memory: 5 # include top 5 processes by memory + +- module: system + period: 1m + metricsets: + - filesystem + - fsstat + processors: + - drop_event.when.regexp: + system.filesystem.mount_point: '^/(sys|cgroup|proc|dev|etc|host|lib)($|/)' + +- module: system + period: 15m + metricsets: + - uptime + +#- module: system +# period: 5m +# metricsets: +# - raid +# raid.mount_point: '/' diff --git a/operation-tools/uninstall_dispatcher_node.yml b/operation-tools/uninstall_dispatcher_node.yml new file mode 100644 index 0000000..89bbb41 --- /dev/null +++ b/operation-tools/uninstall_dispatcher_node.yml @@ -0,0 +1,20 @@ +- hosts: databuffer_cluster + become: true + tasks: + - name: Make sure the daq-dispatcher-node is stopped and disabled + systemd: + enabled: no + state: stopped + name: daq-dispatcher-node + - name: remove systemd file + file: + path: /etc/systemd/system/daq-dispatcher-node.service + state: absent + - name: Reload systemd unit files + systemd: + daemon_reload: yes + + - name: Remove deployment directory + file: + path: /opt/dispatcher_node + state: absent diff --git a/operation-tools/uninstall_query_node.yml b/operation-tools/uninstall_query_node.yml new file mode 100644 index 0000000..6e7457f --- /dev/null +++ b/operation-tools/uninstall_query_node.yml @@ -0,0 +1,21 @@ +- hosts: databuffer_cluster + become: true + tasks: + - name: Make sure the daq-query-node is stopped and disabled + systemd: + enabled: no + state: stopped + name: daq-query-node + - name: remove systemd file + file: + path: /etc/systemd/system/daq-query-node.service + state: absent + - name: Reload systemd unit files + systemd: + daemon_reload: yes + + - name: Remove deployment directory + file: + path: /opt/query_node + state: absent + diff --git a/operation-tools/update_cluster.yml b/operation-tools/update_cluster.yml new file mode 100644 index 0000000..34fe444 --- /dev/null +++ b/operation-tools/update_cluster.yml @@ -0,0 +1,28 @@ +- name: stop nodes + hosts: databuffer_cluster + become: true + tasks: + - name: stop daq-dispatcher-node + systemd: + state: stopped + name: daq-dispatcher-node + - name: stop daq-query-node + systemd: + state: stopped + name: daq-query-node + - name: Remove sources + file: + path: /home/daqusr/.config/daq/stores/sources + state: absent + - name: Remove streamers + file: + path: /home/daqusr/.config/daq/stores/streamers + state: absent + +- import_playbook: uninstall_query_node.yml +- import_playbook: uninstall_dispatcher_node.yml + +- import_playbook: install_query_node_jdk8.yml +- import_playbook: install_dispatcher_node_jdk8.yml + +- import_playbook: restart_cluster.yml