This commit is contained in:
2021-03-17 12:15:59 +01:00
parent 1aa631901b
commit 43f2b13186
5 changed files with 142 additions and 124 deletions

View File

@@ -38,7 +38,44 @@ More details on the gitutils command can be found at: https://gitutils.readthedo
# Administration
If there are new changes to this configuration (either through a merge request or direct commit) the configuration needs to be uploaded to the Data/ImageBuffer. To do so clone or pull the latest changes from this repository and execute the `./bufferutils upload` script that comes with this repository (you have to be on a machine that have /opt/gfa/python available!).
If there are new changes to this configuration (either through a merge request or direct commit) the configuration needs to be uploaded to the Data/ImageBuffer. To do so clone or pull the latest changes from this repository and execute the `./bufferutils upload` script that comes with this repository (you have to be on a machine that have /opt/gfa/python available!).
## Uploading Sources
To upload and start recording of all configured sources use:
```bash
./bufferutils upload
```
## Checking for labled sources
```bash
./bufferutils list --label
```
_Note:_ Labled sources can be individually stopped and/or restarted by the stop/restart subcommand. A label can be attached to more than one source. While doing so, the restart would affect all sources with the given label.
## Stopping a labeled source
```bash
./bufferutils stop --label <label>
```
## Restarting a labeled source
```bash
./bufferutils restart --label <label>
```
## Stopping a sources by backend
Sources of a specific backend can be stopped like this (currently only the "sf-imagebuffer" backend is supported)
```bash
./bufferutils stop --backend sf-imagebuffer
```
# Configuration Management

View File

@@ -98,8 +98,14 @@ def remove_labeled_source(sources, label):
return {"sources": [x for x in sources["sources"] if "labels" not in x or (label not in x['labels'])]}
def remove_image_source(sources):
return {"sources": [x for x in sources["sources"] if "backend" not in x or x['backend'] != "sf-imagebuffer"]}
def remove_backend_source(sources, backend):
"""
Remove sources from a given backend
:param sources:
:param backend:
:return: list of sources excluding the sources from the specified backend
"""
return {"sources": [x for x in sources["sources"] if "backend" not in x or x['backend'] != backend]}
def get_labels(sources):
@@ -122,13 +128,13 @@ def get_labeled_sources(sources, label):
return [x for x in sources["sources"] if "labels" in x and label in x['labels']]
def get_image_sources(sources):
def get_backend_sources(sources, backend):
"""
Get image source(s)
:param sources:
:return: list of source config that are images
"""
return [x for x in sources["sources"] if "backend" in x and x['backend'] == "sf-imagebuffer"]
return [x for x in sources["sources"] if "backend" in x and x['backend'] == backend]
def read_files(files_dir, file_type):
@@ -211,10 +217,10 @@ def main():
default=None,
help="label that identifies the source(s) to stop")
parser_stop.add_argument('-t',
'--type',
parser_stop.add_argument('-b',
'--backend',
default=None,
help="type of to stop")
help="backend sources to stop")
parser_list = subparsers.add_parser('list',
help="list",
@@ -291,27 +297,26 @@ def main():
# Stopping the removed source(s)
upload_sources_and_policies(sources_new, policies)
elif arguments.type:
type = arguments.type
if type != "image":
logging.warning(f"Type {type} currently not supported")
elif arguments.backend:
backend = arguments.backend
if backend != "sf-imagebuffer":
logging.warning(f"Type {backend} currently not supported")
return
logging.info(f"Stop: {type}")
logging.info(f"Stop: {backend}")
policies = read_files(base_directory / Path("policies"), "policies")
sources = read_files(base_directory / Path("sources"), "sources")
# Only for debugging purposes
image_sources = get_image_sources(sources)
image_sources = get_backend_sources(sources, backend)
for s in image_sources:
logging.info(f"Stop {s['stream']}")
sources_new = remove_image_source(sources)
sources_new = remove_backend_source(sources, backend)
# Stopping the removed source(s)
upload_sources_and_policies(sources_new, policies)
else:
logging.warning("Not yet implemented")
parser_stop.print_usage()

View File

@@ -36,7 +36,7 @@ To be able to execute the steps and commands outlined in this document following
```
- Sudo rights are needed on databuffer/... servers
- Clone the [ch.psi.daq.databuffer](https://git.psi.ch/sf_daq/ch.psi.daq.databuffer) repository and switch to the `operation-tools` folder
- Clone the [sf_databuffer](https://git.psi.ch/archiver_config/sf_databuffer.git) repository and switch to the `operation-tools` folder
# Checks
@@ -94,38 +94,49 @@ ansible -b databuffer -m shell -a "journalctl -n 50000 -u daq-dispatcher-node.se
```
## Find Sources With issues
## Find Sources With Issues
Sources with issues can be found like this:
Find sources with bsread level issues
https://kibana.psi.ch/s/gfa/app/dashboards#/view/1b1e1bb0-ca94-11ea-bc5d-315bf3957d13?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15m,to:now))&_a=(description:'',filters:!(('$state':(store:appState),meta:(alias:'Connection%20Errors',disabled:!f,index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',key:bsread.error.type,negate:!t,params:!('0','1','2'),type:phrases,value:'0,%201,%202'),query:(bool:(minimum_should_match:1,should:!((match_phrase:(bsread.error.type:'0')),(match_phrase:(bsread.error.type:'1')),(match_phrase:(bsread.error.type:'2'))))))),fullScreenMode:!f,options:(hidePanelTitles:!f,useMargins:!t),query:(language:kuery,query:''),timeRestore:!f,title:'BSREAD%20Errors%20DataBuffer',viewMode:view)
```bash
ansible databuffer -m shell -b -a "journalctl -n 5000 -u daq-dispatcher-node.service | grep \" WARN \""
Error number indicate following errors:
# not used any more
# ansible databuffer -m shell -a "tail -n 5000 /opt/dispatcher_node/latest/logs/data_validation.log | grep \"MainHeader\" | grep -Po \"(?<=')[^.']*(?=')\" | grep tcp | sort | uniq"
```
3. 0-pulse / 0 globaltime
4. time out of valid timerange
5. duplicate pulse-id
6. pulse-id before last valid pulse-id
7. duplicate globaltimestamp
8. globaltimestamp before last globaltimestamp
To see connection errors as well temporarily disable the "NOT Connection Errors" filter on the top left (click on the filter and select "Temporarily disable")
The error numbers used there are
1. receiver connected
2. receiver stopped
3. reconnect
Find channels that are received from more than one source:
https://kibana.psi.ch/s/gfa/app/discover#/view/cb725720-ca89-11ea-bc5d-315bf3957d13?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15m,to:now))&_a=(columns:!(bsread.error.type,bsread.source,message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',key:systemd.unit,negate:!f,params:(query:daq-dispatcher-node.service),type:phrase),query:(match_phrase:(systemd.unit:daq-dispatcher-node.service)))),index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',interval:auto,query:(language:kuery,query:'%22This%20is%20usually%20an%20indication%22'),sort:!(!('@timestamp',desc)))
Find channels that send corrupt MainHeader:
https://kibana.psi.ch/s/gfa/app/discover#/view/cb725720-ca89-11ea-bc5d-315bf3957d13?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15m,to:now))&_a=(columns:!(bsread.error.type,bsread.source,message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',key:systemd.unit,negate:!f,params:(query:daq-dispatcher-node.service),type:phrase),query:(match_phrase:(systemd.unit:daq-dispatcher-node.service)))),index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',interval:auto,query:(language:kuery,query:MainHeader),sort:!(!('@timestamp',desc)))
To get a detailed report use:
```bash
ansible databuffer -m shell -b -a "journalctl -u daq-dispatcher-node.service -n 500 | sed -e 's#.*tcp://##' | grep -e '\w* - ' | sort | uniq | wc -l"
```
To just get the list of sources without the reason use (send this list to sf-operation so with the request that the source responsible should fix the sources):
```bash
ansible databuffer -m shell -b -a "journalctl -u daq-dispatcher-node.service -n 5000 | sed -e 's#.*tcp://##' | grep -e '^[^ ]* - ' | sed -e 's# -.*##' | sort | uniq"
```
# Maintenance
## Emergency Restart Procedures
## Restart Procedures
Following are restart procedures meant for emergency interventions on the SwissFEL DAQ system
### Restart Data Retrieval
If there are issues with data retrieval (DataBuffer, ImageBuffer, Epics Channel Archiver) but all checks regarding the DataBuffer shows normal operation use this procedure to restart the SwissFEL data retrieval services. This will only affect the data retrieval of SwissFEL at the time of restart but there will be no interrupt in the recording of the data.
- login to sflca
- clone the databuffer repository (if you haven't yet), change to the operation-tools directory and/or pull the latest changes
- login to sf-lca.psi.ch
- clone the databuffer repository (if you haven't yet- https://git.psi.ch/archiver_config/sf_databuffer.git), change to the `operation-tools` directory and/or pull the latest changes
```bash
cd operation-tools
```
@@ -138,8 +149,8 @@ ansible-playbook restart_dataretrieval.yml
### Restart Data Retrieval All
If the method above doesn't work try to restart all of the data retrieval services via this procedure. This will not interrupt any data recording __but this restart will, beside SwissFEL also affect the data retrieval of GLS, Hipa and Proscan__!
- login to sflca
- clone the databuffer repository (if you haven't yet), change to the operation-tools directory and/or pull the latest changes
- login to sf-lca.psi.ch
- clone the databuffer repository (if you haven't yet - https://git.psi.ch/archiver_config/sf_databuffer.git), change to the `operation-tools` directory and/or pull the latest changes
```bash
cd operation-tools
```
@@ -149,18 +160,47 @@ cd operation-tools
ansible-playbook restart_dataretrieval_all.yml
```
### Restart ImageBuffer
If the DataBuffer looks healthy but the ImageBuffer seems to be in a buggy state the restart of the ImageBuffer only can be triggered as follows:
- login to sf-lca.psi.ch (_sf-lca.psi.ch is the machine in the machine network !!!!_)
- clone the databuffer repository (if you haven't yet), change to the repository directory and/or pull the latest changes
```bash
git clone https://git.psi.ch/archiver_config/sf_databuffer.git
cd sf_databuffer
# and/or
git pull
```
- stop the sources belonging to the imagebuffer
```bash
./bufferutils stop --backend sf-imagebuffer
```
- change to the operation-tools directory and call the restart_imagebuffer script
```bash
cd operation-tools
ansible-playbook restart_imagebuffer.yml
```
- Afterwards restart the recording of the image sources:
```bash
cd ..
./bufferutils upload
```
### Restart DataBuffer Cluster
This is the procedure to follow to restart the DataBuffer in an emergency.
After checking whether the restart is really necessary do this:
- login to sflca (_sflca is cluster in the machine network !!!!_)
- login to sf-lca.psi.ch (_sflca is cluster in the machine network !!!!_)
- clone the databuffer repository (if you haven't yet), change to the operation-tools directory and/or pull the latest changes
```bash
git clone git@git.psi.ch:sf_daq/ch.psi.daq.databuffer.git
cd ch.psi.daq.databuffer/operation-tools
git clone https://git.psi.ch/archiver_config/sf_databuffer.git
cd sf_databuffer/operation-tools
# and/or
git pull
```
@@ -170,92 +210,16 @@ git pull
ansible-playbook restart_cluster.yml
```
- Afterwards start the recording again - you need to have cloned the sf_daq_sources git repo:
- Afterwards restart the recording again:
```bash
git clone git@git.psi.ch:sf_config/sf_daq_sources.git
cd sf_daq_sources
git pull
./upload.sh
cd ..
./bufferutils upload
```
## Full Restart Procedure
## Manual Restart Procedures (Experts Only)
Stop recording (via the stop.sh script at https://git.psi.ch/sf_config/sf_daq_sources)
```bash
# git clone git@git.psi.ch:sf_config/sf_daq_sources.git
# git pull
# cd "sf_daq_sources"
./stop.sh
```
Stop data-api and dispatcher-api
```bash
ansible data_api -b -m shell -a "systemctl stop data-api.service nginx"
ansible dispatcher_api -b -m shell -a "systemctl stop dispatcher-api.service nginx"
```
Stop daq-dispatcher-node and daq-query-node services:
```bash
ansible databuffer_cluster -b -m shell -a "systemctl stop daq-dispatcher-node.service daq-query-node.service"
```
Remove configurations for local stream/recording restarts:
```bash
ansible databuffer_cluster -b -m shell -a "rm -rf /home/daqusr/.config/daq/stores/sources; rm -rf /home/daqusr/.config/daq/stores/streamers"
```
Start daq-dispatcher-node and daq-query-node services:
```bash
ansible databuffer_cluster --forks 1 -b -m shell -a "systemctl start daq-dispatcher-node.service"
ansible databuffer_cluster --forks 1 -b -m shell -a "systemctl start daq-query-node.service"
```
__Important Note:__ It is necessary to bring up the dispatcher node processes first before starting the query node processes!
Start data-api and dispatcher-api
```bash
ansible data_api -b -m shell -a "systemctl start data-api.service nginx"
ansible dispatcher_api -b -m shell -a "systemctl start dispatcher-api.service nginx"
```
After starting the dispatcher- and query-nodes wait about 5 minutes until all the cluster discovery processes are finished and the cluster is up and running.
Start recording (via the upload.sh script at https://git.psi.ch/sf_config/sf_daq_sources)
```bash
# git clone git@git.psi.ch:sf_config/sf_daq_sources.git
# git pull
# cd "sf_daq_sources"
./upload.sh
# reload cache
curl -H "Content-Type: application/json" -X POST -d '{"reload": "true"}' https://data-api.psi.ch/sf/channels/config &>/dev/null
```
## Restart ImageBuffer (dispatcher nodes only)
Stop image recording
```bash
# Execute this in the checkout of sf_daq_sources (git@git.psi.ch:sf_config/sf_daq_sources.git)
./stop_images.sh
```
Restart the involved services
```bash
ansible imagebuffer -b -m shell -a "systemctl stop daq-dispatcher-node"
ansible imagebuffer -b -m shell -a "rm -rf /home/daqusr/.config/daq/stores/sources; rm -rf /home/daqusr/.config/daq/stores/streamers"
ansible imagebuffer --forks 1 -b -m shell -a "systemctl start daq-dispatcher-node"
ansible dispatcher_api -b -m shell -a "systemctl restart dispatcher-api.service nginx"
ansible data_api -b -m shell -a "systemctl restart data-api.service nginx"
```
Start recording (via the upload.sh script at https://git.psi.ch/sf_config/sf_daq_sources)
```bash
# Execute this in the checkout of sf_daq_sources (git@git.psi.ch:sf_config/sf_daq_sources.git)
./upload.sh
```
## Restart query-node Services
### Restart query-node Services
Restart daq-query-node service:
```bash
@@ -265,7 +229,7 @@ ansible databuffer_cluster --forks 1 -b -m shell -a "systemctl restart daq-query
__Important Note:__ To be able to start the query node processes the dispatcher nodes need to be up and running! After restarting all query nodes you have to restart the data-api service as well. A single restart of a Query Node server should work fine (as there is no complete shutdown of the Hazelcast cluster).
## Restart dispatcher-node Services
### Restart dispatcher-node Services
Restart daq-dispatcher-node service:
```bash

View File

@@ -47,6 +47,8 @@
path: /home/daqusr/.config/daq/stores/streamers
state: absent
# IMPORTANT: It is necessary to bring up the dispatcher node processes first
# before starting the query node processes!
- name: start dispatcher nodes
hosts: databuffer_cluster

View File

@@ -54,7 +54,17 @@
hosts: data_api
become: true
tasks:
- name: start data-api
- name: restart data-api
systemd:
state: restarted
name: data-api
name: data-api
- name: restart dispatcher api
hosts: dispatcher_api
become: true
tasks:
- name: restart dispatcher-api
systemd:
state: restarted
name: dispatcher-api