integrating lemon monitoring and alarming system with the new cern agile infrastructure

16
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/ Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure P.Andrade, L.Cons, I.Fedorko , B.Fiorini, A.Iribarren, V.Lefebure, G.Mccance, O.Pera, M.Paladin, I. Reguero, M. Dos Santos, S.Traylen CERN HEPiX Fall 2012 Workshop

Upload: nairi

Post on 23-Feb-2016

57 views

Category:

Documents


0 download

DESCRIPTION

Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure. P.Andrade , L.Cons , I.Fedorko , B.Fiorini , A.Iribarren , V.Lefebure , G.Mccance , O.Pera , M.Paladin , I. Reguero, M. Dos Santos, S.Traylen CERN HEPiX Fall 2012 Workshop. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Integrating Lemon Monitoring and Alarming System  with the new CERN Agile Infrastructure

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Integrating Lemon Monitoring and Alarming System with the new CERN Agile

Infrastructure

P.Andrade, L.Cons, I.Fedorko, B.Fiorini, A.Iribarren, V.Lefebure, G.Mccance, O.Pera,

M.Paladin, I. Reguero, M. Dos Santos, S.Traylen

CERNHEPiX Fall 2012 Workshop

Page 2: Integrating Lemon Monitoring and Alarming System  with the new CERN Agile Infrastructure

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

CERN Agile Infrastructure Monitoring

HEPiX Spring 2012– High Level Architecture– View of shared architecture

Lemon – LHC Era Monitoring System– Is Lemon only about “performance monitoring”?– Why architecture evolution rather than replacement by

existing monitoring tool(s)?Agile Infrastructure for Monitoring

– Shared Infrastructure– Use cases: Data store, Visualization– Event processing and management– Status of the components

2

Page 3: Integrating Lemon Monitoring and Alarming System  with the new CERN Agile Infrastructure

Lemon LHC Era Monitoring System

In-house developed, multi-components, client/server-based monitoring system

SQL

TCP/UDP HTTP

Sensor Sensor Sensor

Monitoring Agent Local Cache

OracleDatabase

Repository BackendApplication

Server

Lemon CLI

Lemon-host-check

Web Browser

RRD tool / Python

Apache/ PHP

(command line tool to access data)

(command line tool node exceptions)

Measurement Repository

User InterfacesNode Monitoring

Individually configurable nodes with autonomous recovery actions

Chain of tools based on DB backend

3

Page 4: Integrating Lemon Monitoring and Alarming System  with the new CERN Agile Infrastructure

LemonPerformance, application and facility monitoring

efficiency=𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝐴−𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝐵𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝐷

Node monitoringe.g. CPU Load

Time-series processingHierarchy clustering

ClusterSub-cluster

Node

On behalf monitoring

Smart Power Distribution UnitsHistorical data export

4

Page 5: Integrating Lemon Monitoring and Alarming System  with the new CERN Agile Infrastructure

LemonService availability and alarming

Node monitoring• Disk occupancy• Number of processes• Log file parse matched

Correction action on the node• Run script locally to clean var dir• After 3rd attempt var occupancy > 90%

Monitoring repository export with guaranteed reliabilityand data processing

e.g. Service Level Status

var_ful/ alarm

System administrator Support ticket

5

Page 6: Integrating Lemon Monitoring and Alarming System  with the new CERN Agile Infrastructure

Lemon Monitoring @ Large scale

6

Experience• No single solution replacementRequirements• Tools chain

• e.g. data mining interface different from time series trending• Flexible migration

• e.g. compatible with lemon node client • Large scale ready

• Current system: • ~11k monitored entities• ~150 metrics/entity

• Expected scale: ~300k entities

Page 7: Integrating Lemon Monitoring and Alarming System  with the new CERN Agile Infrastructure

Agile Infrastructurewith performance monitoring

Lemonagent

Lemon to messaging

Message Bus

Custom scriptMonitoring XYZ

Visualizationand correlation

Data store

Cluster processingHigh load for >50% of

cluster

Ticketing

SMS gateway

Dashboard

Operations

Planned Components Views

7

Page 8: Integrating Lemon Monitoring and Alarming System  with the new CERN Agile Infrastructure

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Storing and visualization

Message Bus

Oracle

Lemon web

NoSQL

Visualization

Data mining(batch processing)

Splunk

Data miningVisualizationCorrelation

RRD visualization

R&D on-going

Possible options

8

Page 9: Integrating Lemon Monitoring and Alarming System  with the new CERN Agile Infrastructure

NoSQL-based data store for monitoring Example from Data Storage Service

9

Log parsing and processing based on the NoSQL DB Prototyped by CERN IT/DSS

Shared infrastructure

Page 10: Integrating Lemon Monitoring and Alarming System  with the new CERN Agile Infrastructure

Splunk for data mining/visualization

High precision data miningin the current system solved by dedicated exports

~1.5 year of Lemon raw data (~4.5 TB in Oracle) ~2.5 TB Splunk data with metadata information (~43 billion entries)

10

One year period of basic metrics on node on the fly browsing capability with high time granularity

Under testing

Page 11: Integrating Lemon Monitoring and Alarming System  with the new CERN Agile Infrastructure

Example of Splunk Dashboard Lemon data with entity cluster hierarchy

11

Metric - Time - Match entity name

Sum of running jobs over time split by entities

Under testing

Page 12: Integrating Lemon Monitoring and Alarming System  with the new CERN Agile Infrastructure

Event processing and management concept

Metric correlationMetrics

Node monitoring

Monitoring infrastructure

Event processing

e.g. Heartbeat checkinge.g. Load over cluster

12

Ticketing system

Incident process

Event process

Event record

Incident ticketService Now

prototype

Page 13: Integrating Lemon Monitoring and Alarming System  with the new CERN Agile Infrastructure

Possible use of Splunk for event processing

Alarming on the fly information processing in time windows

if counter >3 event

Splunk Automate Monitoring

5 min time windowtime

NotificationSplunk

Aggregated Notification

13

In production for backup TSM service @CERN

Page 14: Integrating Lemon Monitoring and Alarming System  with the new CERN Agile Infrastructure

Configuration status and transition period

Lemon application server (one/data centre)

Lemon metric management

Quattor managed node Puppet

managed node

Puppet

Quattor configuration Puppet configuration

14

AI monitoring

Metric Management

prototype

Page 15: Integrating Lemon Monitoring and Alarming System  with the new CERN Agile Infrastructure

Component status

Lemonagent

Lemon to messaging

Apollo

Custom script

Cluster processingHigh load for >50% of

cluster

Visualizationand correlation

SplunkData store

Hadoop

Monitoring XYZ

prototyping/testing/usingplanned/R&D on-going

15

Ticketing

SMS gateway

Dashboard

Operations

Page 16: Integrating Lemon Monitoring and Alarming System  with the new CERN Agile Infrastructure

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Summary • No single solution replacement of the current

Lemon system• Shared Agile Infrastructure Modular concept

– covering all the CERN Computer Centre monitoring domains

– continuous development and deployment• Transition plan in place• Steady progress in implementation

16