cern it department ch-1211 genève 23 switzerland t integrating lemon monitoring and alarming...

16
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/ Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure P.Andrade, L.Cons, I.Fedorko , B.Fiorini, A.Iribarren, V.Lefebure, G.Mccance, O.Pera, M.Paladin, I. Reguero, M. Dos Santos, S.Traylen CERN HEPiX Fall 2012 Workshop

Upload: christiana-higgins

Post on 24-Dec-2015

228 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CERN IT Department CH-1211 Genève 23 Switzerland  t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Integrating Lemon Monitoring and Alarming System with the new CERN Agile

Infrastructure

P.Andrade, L.Cons, I.Fedorko, B.Fiorini, A.Iribarren, V.Lefebure, G.Mccance, O.Pera,

M.Paladin, I. Reguero, M. Dos Santos, S.Traylen

CERN

HEPiX Fall 2012 Workshop

Page 2: CERN IT Department CH-1211 Genève 23 Switzerland  t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

CERN Agile Infrastructure Monitoring

HEPiX Spring 2012– High Level Architecture– View of shared architecture

Lemon – LHC Era Monitoring System– Is Lemon only about “performance monitoring”?– Why architecture evolution rather than replacement by

existing monitoring tool(s)?

Agile Infrastructure for Monitoring– Shared Infrastructure– Use cases: Data store, Visualization– Event processing and management– Status of the components

2

Page 3: CERN IT Department CH-1211 Genève 23 Switzerland  t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure

Lemon LHC Era Monitoring System

In-house developed, multi-components, client/server-based monitoring system

SQL

TCP/UDP HTTP

Sensor Sensor Sensor

Monitoring Agent Local Cache

OracleDatabase

Repository BackendApplication

Server

Lemon CLI

Lemon-host-check

Web Browser

RRD tool / Python

Apache/ PHP

(command line tool to access data)

(command line tool node exceptions)

Measurement Repository

User InterfacesNode Monitoring

Individually configurable nodes with autonomous recovery actions

Chain of tools based on DB backend

3

Page 4: CERN IT Department CH-1211 Genève 23 Switzerland  t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure

LemonPerformance, application and facility monitoring

efficiency=𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝐴−𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝐵

𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝐷

Node monitoringe.g. CPU Load

Time-series processingHierarchy clustering

ClusterSub-cluster

Node

On behalf monitoring

Smart Power Distribution UnitsHistorical data export

4

Page 5: CERN IT Department CH-1211 Genève 23 Switzerland  t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure

LemonService availability and alarming

Node monitoring• Disk occupancy• Number of processes• Log file parse matched

Correction action on the node• Run script locally to clean var dir• After 3rd attempt var occupancy > 90%

Monitoring repository export with guaranteed reliabilityand data processing

e.g. Service Level Status

var_ful/ alarm

System administrator Support ticket

5

Page 6: CERN IT Department CH-1211 Genève 23 Switzerland  t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure

Lemon Monitoring @ Large scale

6

Experience• No single solution replacementRequirements• Tools chain

• e.g. data mining interface different from time series trending• Flexible migration

• e.g. compatible with lemon node client • Large scale ready

• Current system: • ~11k monitored entities• ~150 metrics/entity

• Expected scale: ~300k entities

Page 7: CERN IT Department CH-1211 Genève 23 Switzerland  t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure

Agile Infrastructurewith performance monitoring

Lemonagent

Lemon to messaging

Message Bus

Custom script

Monitoring XYZ

Visualizationand correlation

Data store

Cluster processingHigh load for >50% of

cluster

Ticketing

SMS gateway

Dashboard

Operations

Planned Components Views

7

Page 8: CERN IT Department CH-1211 Genève 23 Switzerland  t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Storing and visualization

Message Bus

Oracle

Lemon web

NoSQL

Visualization

Data mining(batch processing)

Splunk

Data miningVisualizationCorrelation

RRD visualization

R&D on-going

Possible options

8

Page 9: CERN IT Department CH-1211 Genève 23 Switzerland  t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure

NoSQL-based data store for monitoring Example from Data Storage Service

9

Log parsing and processing based on the NoSQL DB Prototyped by CERN IT/DSS

Shared infrastructure

Page 10: CERN IT Department CH-1211 Genève 23 Switzerland  t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure

Splunk for data mining/visualization

High precision data mining

in the current system solved by dedicated exports

~1.5 year of Lemon raw data (~4.5 TB in Oracle)

~2.5 TB Splunk data with metadata information (~43 billion entries)

10

One year period of basic metrics on node on the fly browsing capability with high time granularity

Under testing

Page 11: CERN IT Department CH-1211 Genève 23 Switzerland  t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure

Example of Splunk Dashboard Lemon data with entity cluster hierarchy

11

Metric - Time - Match entity name

Sum of running jobs over time split by entities

Under testing

Page 12: CERN IT Department CH-1211 Genève 23 Switzerland  t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure

Event processing and management concept

Metric correlation

Metrics

Node monitoring

Monitoring infrastructure

Event processing

e.g. Heartbeat checkinge.g. Load over cluster

12

Ticketing system

Incident process

Event process

Event record

Incident ticketService Now

prototype

Page 13: CERN IT Department CH-1211 Genève 23 Switzerland  t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure

Possible use of Splunk for event processing

Alarming on the fly information processing in time windows

if counter >3 event

Splunk Automate Monitoring

5 min time windowtime

NotificationSplunk

Aggregated Notification

13

In production for backup TSM service @CERN

Page 14: CERN IT Department CH-1211 Genève 23 Switzerland  t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure

Configuration status and transition period

Lemon application server (one/data centre)

Lemon metric management

Quattor managed node Puppet

managed node

Puppet

Quattor configuration Puppet configuration

14

AI monitoring

Metric Management

prototype

Page 15: CERN IT Department CH-1211 Genève 23 Switzerland  t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure

Component status

Lemonagent

Lemon to messaging

Apollo

Custom script

Cluster processingHigh load for >50% of

cluster

Visualizationand correlation

SplunkData store

Hadoop

Monitoring XYZ

prototyping/testing/using

planned/R&D on-going

15

Ticketing

SMS gateway

Dashboard

Operations

Page 16: CERN IT Department CH-1211 Genève 23 Switzerland  t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Summary

• No single solution replacement of the current Lemon system

• Shared Agile Infrastructure Modular concept – covering all the CERN Computer Centre

monitoring domains– continuous development and deployment

• Transition plan in place• Steady progress in implementation

16