real-time monitoring of operational data in the belle ii experiment … · 2020. 10. 22. · for...
TRANSCRIPT
![Page 1: Real-time monitoring of operational data in the Belle II experiment … · 2020. 10. 22. · for real-time monitoring of the Belle II DAQ system • Working smoothly, making good](https://reader035.vdocuments.us/reader035/viewer/2022071502/6122588a255888673c4108b7/html5/thumbnails/1.jpg)
Real-time monitoring of operational data
in the Belle II experiment
Takuto KUNIGO on behalf of the Belle II collaboration 22nd IEEE Real Time Conference
Poster session-B
1
![Page 2: Real-time monitoring of operational data in the Belle II experiment … · 2020. 10. 22. · for real-time monitoring of the Belle II DAQ system • Working smoothly, making good](https://reader035.vdocuments.us/reader035/viewer/2022071502/6122588a255888673c4108b7/html5/thumbnails/2.jpg)
SuperKEKB2
4 GeV e+ 7 GeV e- collider
World record achieved in June 2020 L = 2.4 x 1034 cm-2s-1
Goal: 50 ab-1 (= Belle x 50)
![Page 3: Real-time monitoring of operational data in the Belle II experiment … · 2020. 10. 22. · for real-time monitoring of the Belle II DAQ system • Working smoothly, making good](https://reader035.vdocuments.us/reader035/viewer/2022071502/6122588a255888673c4108b7/html5/thumbnails/3.jpg)
Belle II detector3
Inner
OuterMuon spectrometer
Calorimeter
Tracking
Vertex detector
PID
e- (7 GeV)
e+ (4 GeV)
Search for new physics beyond SM via high precision measurement with high statistics samples of B/D/tau decays
![Page 4: Real-time monitoring of operational data in the Belle II experiment … · 2020. 10. 22. · for real-time monitoring of the Belle II DAQ system • Working smoothly, making good](https://reader035.vdocuments.us/reader035/viewer/2022071502/6122588a255888673c4108b7/html5/thumbnails/4.jpg)
Belle II DAQ system4
Many components Need to monitor each component carefully
![Page 5: Real-time monitoring of operational data in the Belle II experiment … · 2020. 10. 22. · for real-time monitoring of the Belle II DAQ system • Working smoothly, making good](https://reader035.vdocuments.us/reader035/viewer/2022071502/6122588a255888673c4108b7/html5/thumbnails/5.jpg)
Monitoring system5
Log messagesInput data
Database
Users
Applications
Integrated database ‘Elasticsearch’
PC usagesSuperKEKB status
CR shifters Sub-systemexperts
Trigger and DAQ condition …
Web-interface ‘Kibana’ visualisations (e.g. efficiency)
Alerting framework
Alerts for ‘ABCD’
check ‘XYWZ’
Offline analyses
Collaboratorsfor example
publication-quality plotsvia ROOT
System overview
Control roomshifters
![Page 6: Real-time monitoring of operational data in the Belle II experiment … · 2020. 10. 22. · for real-time monitoring of the Belle II DAQ system • Working smoothly, making good](https://reader035.vdocuments.us/reader035/viewer/2022071502/6122588a255888673c4108b7/html5/thumbnails/6.jpg)
1. Visualisations on Kibana6Main links
Header for DAQ efficiency
1. Run time breakdown - Total 2. Run-time fraction - Total 3. Deadtime fraction - Total
1. Run time breakdown
2. Run-time fraction
3. Deadtime fraction
Main page | HLT | COPPER | Network | Data transfer | Event size | Deadtime | Run registryDAQ efficiency | Log summary | Daily summary | Error summary
DefinitionConfluence page: DAQ efficiency
1. This is a draft, not finalised yet2. If you have any comments, suggestions, please
post your comments on the confluence page orsend me an e-mail Takuto KUNIGO
Frac
tiona
l rat
io [%
]
0%
20%
40%
60%
80%
100%
_all
All docs
! [1] Belle II ph…! [2] Accelera…! [3] Accelera…! [4] Beam inj…! [5] Accelera…
Frac
tiona
l rat
io [%
]
0%
20%
40%
60%
80%
100%
_all
All docs
! (a) HV ramp…! (b) SALS! (c) HV trip o…! (d) Belle II tr…! Running
Dead
time
[%]
0
1
2
3
4
5
6
_all
All docs
! PXD! SVD! CDC! TOP! ARICH! ECL! KLM! TRG! ttlost! COPPER! FIFO full! PAUSE! Pipeline! APV VETO! Injection VE…
Dura
tion
[sec
]
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
2020-03-22 00:00 2020-03-29 00:00 2020-04-05 00:00 2020-04-12 00:00 2020-04-19 00:00 2020-04-26 00:00 2020-05-03 00:00 2020-05-10 00:00 2020-05-17 00:00 2020-05-24 00:00 2020-05-31 00:00date per 12 hours
! [1] Belle II ph…! [2] Accelera…! [3] Accelera…! [4] Beam inj…! [5] Accelera…
Frac
tiona
l rat
io [%
]
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
2020-03-22 00:00 2020-03-29 00:00 2020-04-05 00:00 2020-04-12 00:00 2020-04-19 00:00 2020-04-26 00:00 2020-05-03 00:00 2020-05-10 00:00 2020-05-17 00:00 2020-05-24 00:00 2020-05-31 00:00date per 12 hours
! (a) HV ramp…! (b) SALS! (c) HV trip o…! (d) Belle II tr…! Running
Dead
time
[%]
0
5
10
15
20
2020-03-29 00:00 2020-04-05 00:00 2020-04-12 00:00 2020-04-19 00:00 2020-04-26 00:00 2020-05-03 00:00 2020-05-10 00:00 2020-05-17 00:00 2020-05-24 00:00 2020-05-31 00:00date per 12 hours
! PXD! SVD! CDC! TOP! ARICH! ECL! KLM! TRG! ttlost! COPPER! FIFO full! PAUSE! Pipeline! APV VETO! Injection VE…
Exit full screen "
Efficiencies integrated over the time-range
Efficiencies binned in date
for physics data-taking
Belle-IIrunning
① Machine-time breakdown ② Belle-II data-taking breakdown ③ Deadtime breakdown
① Machine-time breakdown
② Belle-II data-taking breakdown
③ Deadtime breakdown
per 12 hours
per 12 hours
per 12 hours
Example: DAQ efficiency
Goals - to categorise the in-efficiency
sources - to choose time-range freely
and… Many monitoring plots(Event size, Network-traffic, errors etc.)
Efficiency definitions 1.Luminosity-based 2.Kibana
![Page 7: Real-time monitoring of operational data in the Belle II experiment … · 2020. 10. 22. · for real-time monitoring of the Belle II DAQ system • Working smoothly, making good](https://reader035.vdocuments.us/reader035/viewer/2022071502/6122588a255888673c4108b7/html5/thumbnails/7.jpg)
2: Alerting system7
Elastalert • is a third-party tool allows us to implement alerting function• Alert destinations: RocketChat, e-mail, SNS message, etc…
Example: Automatic advice for the control room shifters via the RocketChat (chat tool)
Elasticsearch (database) Elastalert
Check If any rules are satisfied
![Page 8: Real-time monitoring of operational data in the Belle II experiment … · 2020. 10. 22. · for real-time monitoring of the Belle II DAQ system • Working smoothly, making good](https://reader035.vdocuments.us/reader035/viewer/2022071502/6122588a255888673c4108b7/html5/thumbnails/8.jpg)
3: Offline analyses8
0
20
40
60
80
100
120
140
160
0 2 4 6 8 10 12 14 16 18 20Recorded luminosity [10^33 cm^-2 s^-1]
100
150
200
250
300
350
400
450
500
550
600
HLT
flow
[MB
/s]
• Basic plot to evaluatethe current HLT performance
• Various metrics are stored in Elasticsearch➡We try to evaluate the beam-
background-induced contribution
Elasticsearch (database) ROOT files
0
50
100
150
200
250
300
0 2 4 6 8 10 12 14 16 18 20]-1 s-2 cm33Recorded luminosity [10
100
150
200
250
300
350
400
450
500
550
600
HLT
flow
[MB/
s]
Extrapolate to (for example)the designed luminosity
![Page 9: Real-time monitoring of operational data in the Belle II experiment … · 2020. 10. 22. · for real-time monitoring of the Belle II DAQ system • Working smoothly, making good](https://reader035.vdocuments.us/reader035/viewer/2022071502/6122588a255888673c4108b7/html5/thumbnails/9.jpg)
Plan: Root cause analysis9
Err
or/
Fata
l m
ess
age
timestamp
Many log messages ➡ Visualisations ➡ Quick diagnosis
Visualisation of error “propagation”
CR shifter noticed
Rootcause
Powerful tool to find the root cause of problems
Example (This is not real Belle II log-messages)
![Page 10: Real-time monitoring of operational data in the Belle II experiment … · 2020. 10. 22. · for real-time monitoring of the Belle II DAQ system • Working smoothly, making good](https://reader035.vdocuments.us/reader035/viewer/2022071502/6122588a255888673c4108b7/html5/thumbnails/10.jpg)
Plan: connected with recovery actions10
Full restart (“SALS”~3mins) STOP, ABORT, LOAD, START
ERROR Recovery time Resume
Recovery actions by experts
ContactCR => expert
ERROR Resume
STO
P
STAR
T
Recovery time
Recovery by the CR shifter
• The alert system automatically detect many problems• The next step is to connect the error-diagnosis with
the appropriate recovery actions ➡ If it is implemented in a GUI, CR shifter can take
the recovery action and then we can reduce our recovery time
Time
![Page 11: Real-time monitoring of operational data in the Belle II experiment … · 2020. 10. 22. · for real-time monitoring of the Belle II DAQ system • Working smoothly, making good](https://reader035.vdocuments.us/reader035/viewer/2022071502/6122588a255888673c4108b7/html5/thumbnails/11.jpg)
Schematic view11
Elastalert
Elasticsearch
Check
Currently activated alerts
CR shifters
EPICS PVs
PXD • None
• SVDNone
・・・
• COPPERDOWNRecovery
e.g. restart.sh
① Click
② Resume data-taking
• ALARM:PXD:ABCD = 0 • ALARM:SVD:ABCD = 0 • … • ALARM:DAQ:COPPERDOWN:1234 = 1
Alert messages on RocketChat
PCs RunControl processes Trigger and DAQ condition
If ping2COPPER1234 > 100 ms
Shifter PC
![Page 12: Real-time monitoring of operational data in the Belle II experiment … · 2020. 10. 22. · for real-time monitoring of the Belle II DAQ system • Working smoothly, making good](https://reader035.vdocuments.us/reader035/viewer/2022071502/6122588a255888673c4108b7/html5/thumbnails/12.jpg)
Summary12
Current status • We have started using the Elastic Stack
for real-time monitoring of the Belle II DAQ system• Working smoothly, making good contributions to reduce downtime• Three major applications:
1.Online visualisations on Kibana (e.g. data-taking efficiency)2.Alerting system3.Offline analyses
Future plan • Root cause analysis
- It is worth to try machine learning on Elastic stack• To connect the detection results with quick recovery actions
- Implementation is on-going