forked from archiver_config/sf_databuffer
updates
This commit is contained in:
39
Readme.md
39
Readme.md
@@ -38,7 +38,44 @@ More details on the gitutils command can be found at: https://gitutils.readthedo
|
||||
|
||||
# Administration
|
||||
|
||||
If there are new changes to this configuration (either through a merge request or direct commit) the configuration needs to be uploaded to the Data/ImageBuffer. To do so clone or pull the latest changes from this repository and execute the `./bufferutils upload` script that comes with this repository (you have to be on a machine that have /opt/gfa/python available!).
|
||||
If there are new changes to this configuration (either through a merge request or direct commit) the configuration needs to be uploaded to the Data/ImageBuffer. To do so clone or pull the latest changes from this repository and execute the `./bufferutils upload` script that comes with this repository (you have to be on a machine that have /opt/gfa/python available!).
|
||||
|
||||
## Uploading Sources
|
||||
To upload and start recording of all configured sources use:
|
||||
|
||||
```bash
|
||||
./bufferutils upload
|
||||
```
|
||||
|
||||
## Checking for labled sources
|
||||
|
||||
```bash
|
||||
./bufferutils list --label
|
||||
```
|
||||
|
||||
_Note:_ Labled sources can be individually stopped and/or restarted by the stop/restart subcommand. A label can be attached to more than one source. While doing so, the restart would affect all sources with the given label.
|
||||
|
||||
## Stopping a labeled source
|
||||
|
||||
```bash
|
||||
./bufferutils stop --label <label>
|
||||
```
|
||||
|
||||
## Restarting a labeled source
|
||||
|
||||
```bash
|
||||
./bufferutils restart --label <label>
|
||||
```
|
||||
|
||||
## Stopping a sources by backend
|
||||
|
||||
Sources of a specific backend can be stopped like this (currently only the "sf-imagebuffer" backend is supported)
|
||||
|
||||
```bash
|
||||
./bufferutils stop --backend sf-imagebuffer
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
# Configuration Management
|
||||
|
||||
@@ -98,8 +98,14 @@ def remove_labeled_source(sources, label):
|
||||
return {"sources": [x for x in sources["sources"] if "labels" not in x or (label not in x['labels'])]}
|
||||
|
||||
|
||||
def remove_image_source(sources):
|
||||
return {"sources": [x for x in sources["sources"] if "backend" not in x or x['backend'] != "sf-imagebuffer"]}
|
||||
def remove_backend_source(sources, backend):
|
||||
"""
|
||||
Remove sources from a given backend
|
||||
:param sources:
|
||||
:param backend:
|
||||
:return: list of sources excluding the sources from the specified backend
|
||||
"""
|
||||
return {"sources": [x for x in sources["sources"] if "backend" not in x or x['backend'] != backend]}
|
||||
|
||||
|
||||
def get_labels(sources):
|
||||
@@ -122,13 +128,13 @@ def get_labeled_sources(sources, label):
|
||||
return [x for x in sources["sources"] if "labels" in x and label in x['labels']]
|
||||
|
||||
|
||||
def get_image_sources(sources):
|
||||
def get_backend_sources(sources, backend):
|
||||
"""
|
||||
Get image source(s)
|
||||
:param sources:
|
||||
:return: list of source config that are images
|
||||
"""
|
||||
return [x for x in sources["sources"] if "backend" in x and x['backend'] == "sf-imagebuffer"]
|
||||
return [x for x in sources["sources"] if "backend" in x and x['backend'] == backend]
|
||||
|
||||
|
||||
def read_files(files_dir, file_type):
|
||||
@@ -211,10 +217,10 @@ def main():
|
||||
default=None,
|
||||
help="label that identifies the source(s) to stop")
|
||||
|
||||
parser_stop.add_argument('-t',
|
||||
'--type',
|
||||
parser_stop.add_argument('-b',
|
||||
'--backend',
|
||||
default=None,
|
||||
help="type of to stop")
|
||||
help="backend sources to stop")
|
||||
|
||||
parser_list = subparsers.add_parser('list',
|
||||
help="list",
|
||||
@@ -291,27 +297,26 @@ def main():
|
||||
|
||||
# Stopping the removed source(s)
|
||||
upload_sources_and_policies(sources_new, policies)
|
||||
elif arguments.type:
|
||||
type = arguments.type
|
||||
if type != "image":
|
||||
logging.warning(f"Type {type} currently not supported")
|
||||
elif arguments.backend:
|
||||
backend = arguments.backend
|
||||
if backend != "sf-imagebuffer":
|
||||
logging.warning(f"Type {backend} currently not supported")
|
||||
return
|
||||
logging.info(f"Stop: {type}")
|
||||
logging.info(f"Stop: {backend}")
|
||||
|
||||
policies = read_files(base_directory / Path("policies"), "policies")
|
||||
sources = read_files(base_directory / Path("sources"), "sources")
|
||||
|
||||
# Only for debugging purposes
|
||||
image_sources = get_image_sources(sources)
|
||||
image_sources = get_backend_sources(sources, backend)
|
||||
for s in image_sources:
|
||||
logging.info(f"Stop {s['stream']}")
|
||||
|
||||
sources_new = remove_image_source(sources)
|
||||
sources_new = remove_backend_source(sources, backend)
|
||||
|
||||
# Stopping the removed source(s)
|
||||
upload_sources_and_policies(sources_new, policies)
|
||||
|
||||
|
||||
else:
|
||||
logging.warning("Not yet implemented")
|
||||
parser_stop.print_usage()
|
||||
|
||||
@@ -36,7 +36,7 @@ To be able to execute the steps and commands outlined in this document following
|
||||
```
|
||||
|
||||
- Sudo rights are needed on databuffer/... servers
|
||||
- Clone the [ch.psi.daq.databuffer](https://git.psi.ch/sf_daq/ch.psi.daq.databuffer) repository and switch to the `operation-tools` folder
|
||||
- Clone the [sf_databuffer](https://git.psi.ch/archiver_config/sf_databuffer.git) repository and switch to the `operation-tools` folder
|
||||
|
||||
|
||||
# Checks
|
||||
@@ -94,38 +94,49 @@ ansible -b databuffer -m shell -a "journalctl -n 50000 -u daq-dispatcher-node.se
|
||||
```
|
||||
|
||||
|
||||
## Find Sources With issues
|
||||
## Find Sources With Issues
|
||||
|
||||
Sources with issues can be found like this:
|
||||
Find sources with bsread level issues
|
||||
https://kibana.psi.ch/s/gfa/app/dashboards#/view/1b1e1bb0-ca94-11ea-bc5d-315bf3957d13?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15m,to:now))&_a=(description:'',filters:!(('$state':(store:appState),meta:(alias:'Connection%20Errors',disabled:!f,index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',key:bsread.error.type,negate:!t,params:!('0','1','2'),type:phrases,value:'0,%201,%202'),query:(bool:(minimum_should_match:1,should:!((match_phrase:(bsread.error.type:'0')),(match_phrase:(bsread.error.type:'1')),(match_phrase:(bsread.error.type:'2'))))))),fullScreenMode:!f,options:(hidePanelTitles:!f,useMargins:!t),query:(language:kuery,query:''),timeRestore:!f,title:'BSREAD%20Errors%20DataBuffer',viewMode:view)
|
||||
|
||||
```bash
|
||||
ansible databuffer -m shell -b -a "journalctl -n 5000 -u daq-dispatcher-node.service | grep \" WARN \""
|
||||
Error number indicate following errors:
|
||||
|
||||
# not used any more
|
||||
# ansible databuffer -m shell -a "tail -n 5000 /opt/dispatcher_node/latest/logs/data_validation.log | grep \"MainHeader\" | grep -Po \"(?<=')[^.']*(?=')\" | grep tcp | sort | uniq"
|
||||
```
|
||||
3. 0-pulse / 0 globaltime
|
||||
4. time out of valid timerange
|
||||
5. duplicate pulse-id
|
||||
6. pulse-id before last valid pulse-id
|
||||
7. duplicate globaltimestamp
|
||||
8. globaltimestamp before last globaltimestamp
|
||||
|
||||
To see connection errors as well temporarily disable the "NOT Connection Errors" filter on the top left (click on the filter and select "Temporarily disable")
|
||||
|
||||
The error numbers used there are
|
||||
|
||||
1. receiver connected
|
||||
2. receiver stopped
|
||||
3. reconnect
|
||||
|
||||
|
||||
Find channels that are received from more than one source:
|
||||
https://kibana.psi.ch/s/gfa/app/discover#/view/cb725720-ca89-11ea-bc5d-315bf3957d13?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15m,to:now))&_a=(columns:!(bsread.error.type,bsread.source,message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',key:systemd.unit,negate:!f,params:(query:daq-dispatcher-node.service),type:phrase),query:(match_phrase:(systemd.unit:daq-dispatcher-node.service)))),index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',interval:auto,query:(language:kuery,query:'%22This%20is%20usually%20an%20indication%22'),sort:!(!('@timestamp',desc)))
|
||||
|
||||
|
||||
Find channels that send corrupt MainHeader:
|
||||
https://kibana.psi.ch/s/gfa/app/discover#/view/cb725720-ca89-11ea-bc5d-315bf3957d13?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15m,to:now))&_a=(columns:!(bsread.error.type,bsread.source,message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',key:systemd.unit,negate:!f,params:(query:daq-dispatcher-node.service),type:phrase),query:(match_phrase:(systemd.unit:daq-dispatcher-node.service)))),index:'2e885ee0-c5d0-11ea-82c0-2da95a58e9d4',interval:auto,query:(language:kuery,query:MainHeader),sort:!(!('@timestamp',desc)))
|
||||
|
||||
To get a detailed report use:
|
||||
```bash
|
||||
ansible databuffer -m shell -b -a "journalctl -u daq-dispatcher-node.service -n 500 | sed -e 's#.*tcp://##' | grep -e '\w* - ' | sort | uniq | wc -l"
|
||||
```
|
||||
|
||||
To just get the list of sources without the reason use (send this list to sf-operation so with the request that the source responsible should fix the sources):
|
||||
```bash
|
||||
ansible databuffer -m shell -b -a "journalctl -u daq-dispatcher-node.service -n 5000 | sed -e 's#.*tcp://##' | grep -e '^[^ ]* - ' | sed -e 's# -.*##' | sort | uniq"
|
||||
```
|
||||
|
||||
|
||||
# Maintenance
|
||||
|
||||
## Emergency Restart Procedures
|
||||
## Restart Procedures
|
||||
Following are restart procedures meant for emergency interventions on the SwissFEL DAQ system
|
||||
|
||||
### Restart Data Retrieval
|
||||
If there are issues with data retrieval (DataBuffer, ImageBuffer, Epics Channel Archiver) but all checks regarding the DataBuffer shows normal operation use this procedure to restart the SwissFEL data retrieval services. This will only affect the data retrieval of SwissFEL at the time of restart but there will be no interrupt in the recording of the data.
|
||||
|
||||
- login to sflca
|
||||
- clone the databuffer repository (if you haven't yet), change to the operation-tools directory and/or pull the latest changes
|
||||
- login to sf-lca.psi.ch
|
||||
- clone the databuffer repository (if you haven't yet- https://git.psi.ch/archiver_config/sf_databuffer.git), change to the `operation-tools` directory and/or pull the latest changes
|
||||
```bash
|
||||
cd operation-tools
|
||||
```
|
||||
@@ -138,8 +149,8 @@ ansible-playbook restart_dataretrieval.yml
|
||||
### Restart Data Retrieval All
|
||||
If the method above doesn't work try to restart all of the data retrieval services via this procedure. This will not interrupt any data recording __but this restart will, beside SwissFEL also affect the data retrieval of GLS, Hipa and Proscan__!
|
||||
|
||||
- login to sflca
|
||||
- clone the databuffer repository (if you haven't yet), change to the operation-tools directory and/or pull the latest changes
|
||||
- login to sf-lca.psi.ch
|
||||
- clone the databuffer repository (if you haven't yet - https://git.psi.ch/archiver_config/sf_databuffer.git), change to the `operation-tools` directory and/or pull the latest changes
|
||||
```bash
|
||||
cd operation-tools
|
||||
```
|
||||
@@ -149,18 +160,47 @@ cd operation-tools
|
||||
ansible-playbook restart_dataretrieval_all.yml
|
||||
```
|
||||
|
||||
### Restart ImageBuffer
|
||||
If the DataBuffer looks healthy but the ImageBuffer seems to be in a buggy state the restart of the ImageBuffer only can be triggered as follows:
|
||||
|
||||
- login to sf-lca.psi.ch (_sf-lca.psi.ch is the machine in the machine network !!!!_)
|
||||
- clone the databuffer repository (if you haven't yet), change to the repository directory and/or pull the latest changes
|
||||
|
||||
```bash
|
||||
git clone https://git.psi.ch/archiver_config/sf_databuffer.git
|
||||
cd sf_databuffer
|
||||
# and/or
|
||||
git pull
|
||||
```
|
||||
|
||||
- stop the sources belonging to the imagebuffer
|
||||
```bash
|
||||
./bufferutils stop --backend sf-imagebuffer
|
||||
```
|
||||
|
||||
- change to the operation-tools directory and call the restart_imagebuffer script
|
||||
```bash
|
||||
cd operation-tools
|
||||
ansible-playbook restart_imagebuffer.yml
|
||||
```
|
||||
|
||||
- Afterwards restart the recording of the image sources:
|
||||
```bash
|
||||
cd ..
|
||||
./bufferutils upload
|
||||
```
|
||||
|
||||
### Restart DataBuffer Cluster
|
||||
This is the procedure to follow to restart the DataBuffer in an emergency.
|
||||
|
||||
After checking whether the restart is really necessary do this:
|
||||
|
||||
- login to sflca (_sflca is cluster in the machine network !!!!_)
|
||||
- login to sf-lca.psi.ch (_sflca is cluster in the machine network !!!!_)
|
||||
- clone the databuffer repository (if you haven't yet), change to the operation-tools directory and/or pull the latest changes
|
||||
|
||||
```bash
|
||||
git clone git@git.psi.ch:sf_daq/ch.psi.daq.databuffer.git
|
||||
cd ch.psi.daq.databuffer/operation-tools
|
||||
git clone https://git.psi.ch/archiver_config/sf_databuffer.git
|
||||
cd sf_databuffer/operation-tools
|
||||
# and/or
|
||||
git pull
|
||||
```
|
||||
@@ -170,92 +210,16 @@ git pull
|
||||
ansible-playbook restart_cluster.yml
|
||||
```
|
||||
|
||||
- Afterwards start the recording again - you need to have cloned the sf_daq_sources git repo:
|
||||
- Afterwards restart the recording again:
|
||||
```bash
|
||||
git clone git@git.psi.ch:sf_config/sf_daq_sources.git
|
||||
cd sf_daq_sources
|
||||
git pull
|
||||
./upload.sh
|
||||
cd ..
|
||||
./bufferutils upload
|
||||
```
|
||||
|
||||
|
||||
## Full Restart Procedure
|
||||
## Manual Restart Procedures (Experts Only)
|
||||
|
||||
Stop recording (via the stop.sh script at https://git.psi.ch/sf_config/sf_daq_sources)
|
||||
```bash
|
||||
# git clone git@git.psi.ch:sf_config/sf_daq_sources.git
|
||||
# git pull
|
||||
# cd "sf_daq_sources"
|
||||
./stop.sh
|
||||
```
|
||||
|
||||
Stop data-api and dispatcher-api
|
||||
```bash
|
||||
ansible data_api -b -m shell -a "systemctl stop data-api.service nginx"
|
||||
ansible dispatcher_api -b -m shell -a "systemctl stop dispatcher-api.service nginx"
|
||||
```
|
||||
|
||||
Stop daq-dispatcher-node and daq-query-node services:
|
||||
```bash
|
||||
ansible databuffer_cluster -b -m shell -a "systemctl stop daq-dispatcher-node.service daq-query-node.service"
|
||||
```
|
||||
|
||||
Remove configurations for local stream/recording restarts:
|
||||
```bash
|
||||
ansible databuffer_cluster -b -m shell -a "rm -rf /home/daqusr/.config/daq/stores/sources; rm -rf /home/daqusr/.config/daq/stores/streamers"
|
||||
```
|
||||
|
||||
Start daq-dispatcher-node and daq-query-node services:
|
||||
```bash
|
||||
ansible databuffer_cluster --forks 1 -b -m shell -a "systemctl start daq-dispatcher-node.service"
|
||||
ansible databuffer_cluster --forks 1 -b -m shell -a "systemctl start daq-query-node.service"
|
||||
```
|
||||
__Important Note:__ It is necessary to bring up the dispatcher node processes first before starting the query node processes!
|
||||
|
||||
Start data-api and dispatcher-api
|
||||
```bash
|
||||
ansible data_api -b -m shell -a "systemctl start data-api.service nginx"
|
||||
ansible dispatcher_api -b -m shell -a "systemctl start dispatcher-api.service nginx"
|
||||
```
|
||||
|
||||
After starting the dispatcher- and query-nodes wait about 5 minutes until all the cluster discovery processes are finished and the cluster is up and running.
|
||||
|
||||
Start recording (via the upload.sh script at https://git.psi.ch/sf_config/sf_daq_sources)
|
||||
```bash
|
||||
# git clone git@git.psi.ch:sf_config/sf_daq_sources.git
|
||||
# git pull
|
||||
# cd "sf_daq_sources"
|
||||
./upload.sh
|
||||
# reload cache
|
||||
curl -H "Content-Type: application/json" -X POST -d '{"reload": "true"}' https://data-api.psi.ch/sf/channels/config &>/dev/null
|
||||
```
|
||||
|
||||
|
||||
## Restart ImageBuffer (dispatcher nodes only)
|
||||
|
||||
Stop image recording
|
||||
```bash
|
||||
# Execute this in the checkout of sf_daq_sources (git@git.psi.ch:sf_config/sf_daq_sources.git)
|
||||
./stop_images.sh
|
||||
```
|
||||
|
||||
Restart the involved services
|
||||
```bash
|
||||
ansible imagebuffer -b -m shell -a "systemctl stop daq-dispatcher-node"
|
||||
ansible imagebuffer -b -m shell -a "rm -rf /home/daqusr/.config/daq/stores/sources; rm -rf /home/daqusr/.config/daq/stores/streamers"
|
||||
ansible imagebuffer --forks 1 -b -m shell -a "systemctl start daq-dispatcher-node"
|
||||
ansible dispatcher_api -b -m shell -a "systemctl restart dispatcher-api.service nginx"
|
||||
ansible data_api -b -m shell -a "systemctl restart data-api.service nginx"
|
||||
```
|
||||
|
||||
Start recording (via the upload.sh script at https://git.psi.ch/sf_config/sf_daq_sources)
|
||||
```bash
|
||||
# Execute this in the checkout of sf_daq_sources (git@git.psi.ch:sf_config/sf_daq_sources.git)
|
||||
./upload.sh
|
||||
```
|
||||
|
||||
|
||||
## Restart query-node Services
|
||||
### Restart query-node Services
|
||||
|
||||
Restart daq-query-node service:
|
||||
```bash
|
||||
@@ -265,7 +229,7 @@ ansible databuffer_cluster --forks 1 -b -m shell -a "systemctl restart daq-query
|
||||
__Important Note:__ To be able to start the query node processes the dispatcher nodes need to be up and running! After restarting all query nodes you have to restart the data-api service as well. A single restart of a Query Node server should work fine (as there is no complete shutdown of the Hazelcast cluster).
|
||||
|
||||
|
||||
## Restart dispatcher-node Services
|
||||
### Restart dispatcher-node Services
|
||||
|
||||
Restart daq-dispatcher-node service:
|
||||
```bash
|
||||
|
||||
@@ -47,6 +47,8 @@
|
||||
path: /home/daqusr/.config/daq/stores/streamers
|
||||
state: absent
|
||||
|
||||
# IMPORTANT: It is necessary to bring up the dispatcher node processes first
|
||||
# before starting the query node processes!
|
||||
|
||||
- name: start dispatcher nodes
|
||||
hosts: databuffer_cluster
|
||||
|
||||
@@ -54,7 +54,17 @@
|
||||
hosts: data_api
|
||||
become: true
|
||||
tasks:
|
||||
- name: start data-api
|
||||
- name: restart data-api
|
||||
systemd:
|
||||
state: restarted
|
||||
name: data-api
|
||||
name: data-api
|
||||
|
||||
|
||||
- name: restart dispatcher api
|
||||
hosts: dispatcher_api
|
||||
become: true
|
||||
tasks:
|
||||
- name: restart dispatcher-api
|
||||
systemd:
|
||||
state: restarted
|
||||
name: dispatcher-api
|
||||
Reference in New Issue
Block a user