towards the integrated alice online-offline monitoring ... · 3.3 tb/s first level processors event...

18
Towards the integrated ALICE Online-Offline monitoring subsystem Adam Wegrzynek for the ALICE Collaboration 9-13 July 2018 | Sofia, BG

Upload: others

Post on 09-Jan-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towards the integrated ALICE Online-Offline monitoring ... · 3.3 TB/s First Level Processors Event Processing Nodes 500 GB/s Storage 100 GB/s Asynchronous post-processing 19 detectors

Towards the integrated ALICE Online-Offline monitoring subsystemAdam Wegrzynekfor the ALICE Collaboration

9-13 July 2018 | Sofia, BG

Page 2: Towards the integrated ALICE Online-Offline monitoring ... · 3.3 TB/s First Level Processors Event Processing Nodes 500 GB/s Storage 100 GB/s Asynchronous post-processing 19 detectors

ALICE O2

Adam Wegrzynek | CHEP18 | Towards the integrated ALICE Online-Offline monitoring subsystem 2

3.3 TB/s

First Level Processors

Event Processing Nodes

500 GB/s

Storage

100 GB/s

Asynchronous post-processing

19 detectors

9000 fibers

270 nodes

1500 nodes

60 PB storage

Synchronous

Page 3: Towards the integrated ALICE Online-Offline monitoring ... · 3.3 TB/s First Level Processors Event Processing Nodes 500 GB/s Storage 100 GB/s Asynchronous post-processing 19 detectors

Comparison

Adam Wegrzynek | CHEP18 | Towards the integrated ALICE Online-Offline monitoring subsystem 3

▶ Performance requirements

▶ Functional architecture

▶ Experience at CERN

Modular stack1. 2. 3.

Page 4: Towards the integrated ALICE Online-Offline monitoring ... · 3.3 TB/s First Level Processors Event Processing Nodes 500 GB/s Storage 100 GB/s Asynchronous post-processing 19 detectors

1. collectd▶ System performance metrics

▶ Hardware monitoring

2.▶ Metric routing

3.

▶ In memory data processing

1. Modular stack

Adam Wegrzynek | CHEP18 | Towards the integrated ALICE Online-Offline monitoring subsystem 4

4.▶ Time series database

5. )

▶ Visualization tool

6. Riemann

▶ Alarming

▶ Currently used at INFN Bari, CERN IT

Page 5: Towards the integrated ALICE Online-Offline monitoring ... · 3.3 TB/s First Level Processors Event Processing Nodes 500 GB/s Storage 100 GB/s Asynchronous post-processing 19 detectors

2. MonALISA

Adam Wegrzynek | CHEP18 | Towards the integrated ALICE Online-Offline monitoring subsystem 5

▶ Distributed data collector infrastructure

▶ Discovery mechanism

▶ Aggregation, filtering, alerts

▶ Real-time data distribution

▶ In memory buffers

▶ SQL database

▶ Currently used by ALICE OfflineCourtesy of Costin Grigoraş

Page 6: Towards the integrated ALICE Online-Offline monitoring ... · 3.3 TB/s First Level Processors Event Processing Nodes 500 GB/s Storage 100 GB/s Asynchronous post-processing 19 detectors

3. Zabbix

Adam Wegrzynek | CHEP18 | Towards the integrated ALICE Online-Offline monitoring subsystem 6

▶ Agent-server

▶ Push via Zabbix protocol

▶ Community support

▶ Currently used in ALICE HLT and DAQ for computing infrastructure

monitoring

Page 7: Towards the integrated ALICE Online-Offline monitoring ... · 3.3 TB/s First Level Processors Event Processing Nodes 500 GB/s Storage 100 GB/s Asynchronous post-processing 19 detectors

Adam Wegrzynek | CHEP18 | Towards the integrated ALICE Online-Offline monitoring subsystem 7

Modular Stack MonALISA Zabbix

Reference OS (CC7) Yes Yes Yes

Documentation Good Insufficient Good

Support and maintenance Yes Yes Yes

Running in isolation Yes Yes Yes

600 kHz rate Yes Yes No

Scalable >>600 kHz Yes Yes No

Handle 100k sources Yes Yes No

Storage size ~30 bytes ~75 bytes 90-500 bytes

Comparison table (1)

Page 8: Towards the integrated ALICE Online-Offline monitoring ... · 3.3 TB/s First Level Processors Event Processing Nodes 500 GB/s Storage 100 GB/s Asynchronous post-processing 19 detectors

Adam Wegrzynek | CHEP18 | Towards the integrated ALICE Online-Offline monitoring subsystem 8

Modular Stack MonALISA ZabbixFunctional arch.

System sensors Yes Yes Yes

Metric processing Batch and stream Stream Batch

Historical dashboard Yes Yes Yes

Real-time dashboard No (RFC) Yes (obsolete) No

Alarming Yes Yes Yes

Storage downsampling Yes Yes Yes

Comparison table (2)

Page 9: Towards the integrated ALICE Online-Offline monitoring ... · 3.3 TB/s First Level Processors Event Processing Nodes 500 GB/s Storage 100 GB/s Asynchronous post-processing 19 detectors

Selection

Adam Wegrzynek | CHEP18 | Towards the integrated ALICE Online-Offline monitoring subsystem 9

Modular stack1. 2. 3.

Remains for Grid job monitoring

Page 10: Towards the integrated ALICE Online-Offline monitoring ... · 3.3 TB/s First Level Processors Event Processing Nodes 500 GB/s Storage 100 GB/s Asynchronous post-processing 19 detectors

Adam Wegrzynek | CHEP18 | Towards the integrated ALICE Online-Offline monitoring subsystem 10

Storage

Visualization

Real-time, historical

Monitoring backend

Flume

Collection,

processing

Riemann

Alarming

Grafana

Computing node

System

sensors

Application

monitoring

Processing

device

InfluxDB

Spark

Processing

device

Processing

device

Derived

Metric

Process

Monitor

Processing

device

Monitoring lib

CollectD

Modular stack metric flow

Page 11: Towards the integrated ALICE Online-Offline monitoring ... · 3.3 TB/s First Level Processors Event Processing Nodes 500 GB/s Storage 100 GB/s Asynchronous post-processing 19 detectors

Monitoring library▶ Push metrics to a backend

▶Monitors the process

▶ Derived metrics

▶ Tags

▶ AliceO2Group/Monitoring

Adam Wegrzynek | CHEP18 | Towards the integrated ALICE Online-Offline monitoring subsystem 11

myMetric,0 10 1530099250985 hostname=test.cernrole=readoutdetector=TPC….name

type

value timestamp

tags

DerivedMetric

ProcessMonitor

Processingdevice

Monitoring lib

Page 12: Towards the integrated ALICE Online-Offline monitoring ... · 3.3 TB/s First Level Processors Event Processing Nodes 500 GB/s Storage 100 GB/s Asynchronous post-processing 19 detectors

Flume routing

Adam Wegrzynek | CHEP18 | Towards the integrated ALICE Online-Offline monitoring subsystem 12

Courtesy of Gioacchino Vino

HTTP Source

UDP/JSON Source

MonALISASource

CollectD JSON Handler

Channel Selector

Memory Channel

IFQL UDPSink

WebSocketSink

HTTPSink

AvroSink

Page 13: Towards the integrated ALICE Online-Offline monitoring ... · 3.3 TB/s First Level Processors Event Processing Nodes 500 GB/s Storage 100 GB/s Asynchronous post-processing 19 detectors

Spark jobs

▶ Higher level metrics

▶Written in Scala

▶ Operates on Flume events

▶ Configurable

Adam Wegrzynek | CHEP18 | Towards the integrated ALICE Online-Offline monitoring subsystem 13

Flume

Spark

UDPavro

Readout rate of a detector

Sum rate of each detector link

Page 14: Towards the integrated ALICE Online-Offline monitoring ... · 3.3 TB/s First Level Processors Event Processing Nodes 500 GB/s Storage 100 GB/s Asynchronous post-processing 19 detectors

InfluxDB timeseries storage

▶ Up to 700 kHz writes

▶ Continuous Queries

▶ Retention Policies

Adam Wegrzynek | CHEP18 | Towards the integrated ALICE Online-Offline monitoring subsystem 14

8 data streams

2x SSD drives RAID0, 25 GbE

Downsample high resolution data

(merge 12 points into 1 by applying average)

Drop high resolution data after 30 days

Keep low resolution data for 1 year

Page 15: Towards the integrated ALICE Online-Offline monitoring ... · 3.3 TB/s First Level Processors Event Processing Nodes 500 GB/s Storage 100 GB/s Asynchronous post-processing 19 detectors

Grafana

Adam Wegrzynek | CHEP18 | Towards the integrated ALICE Online-Offline monitoring subsystem 15

Page 16: Towards the integrated ALICE Online-Offline monitoring ... · 3.3 TB/s First Level Processors Event Processing Nodes 500 GB/s Storage 100 GB/s Asynchronous post-processing 19 detectors

Integration with O2 Software▶ Quality Control

▶ Data Processing Layer

Evolution of the ALICE Software Framework for LHC Run 3

Giulio Eulisse, Tuesday 10 July 14:15, Hall 3

▶ Readout

Readout software for the ALICE integrated Online-Offline (O2) system

Filippo Costa, Thursday 12 July 11:00, Hall 3.1

Adam Wegrzynek | CHEP18 | Towards the integrated ALICE Online-Offline monitoring subsystem 16

Page 17: Towards the integrated ALICE Online-Offline monitoring ... · 3.3 TB/s First Level Processors Event Processing Nodes 500 GB/s Storage 100 GB/s Asynchronous post-processing 19 detectors

Conclusion▶ 3 options compared

▶ Modular Stack selected for O2 farm monitoring

▶ Defined interfaces between tools

▶ Deployed in the detector commissioning facilities

Adam Wegrzynek | CHEP18 | Towards the integrated ALICE Online-Offline monitoring subsystem 17

Page 18: Towards the integrated ALICE Online-Offline monitoring ... · 3.3 TB/s First Level Processors Event Processing Nodes 500 GB/s Storage 100 GB/s Asynchronous post-processing 19 detectors

Future steps

▶ Alarming

▶ Define thresholds and patterns

▶ Grafana real-time data source

▶ Display critical metrics in real time

▶ Sensors to custom hardware

▶Monitor status of custom FPGA board

Adam Wegrzynek | CHEP18 | Towards the integrated ALICE Online-Offline monitoring subsystem 18