integrating lemon monitoring and alarming system with the new cern agile infrastructure
DESCRIPTION
Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure. P.Andrade , L.Cons , I.Fedorko , B.Fiorini , A.Iribarren , V.Lefebure , G.Mccance , O.Pera , M.Paladin , I. Reguero, M. Dos Santos, S.Traylen CERN HEPiX Fall 2012 Workshop. - PowerPoint PPT PresentationTRANSCRIPT
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Integrating Lemon Monitoring and Alarming System with the new CERN Agile
Infrastructure
P.Andrade, L.Cons, I.Fedorko, B.Fiorini, A.Iribarren, V.Lefebure, G.Mccance, O.Pera,
M.Paladin, I. Reguero, M. Dos Santos, S.Traylen
CERNHEPiX Fall 2012 Workshop
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
CERN Agile Infrastructure Monitoring
HEPiX Spring 2012– High Level Architecture– View of shared architecture
Lemon – LHC Era Monitoring System– Is Lemon only about “performance monitoring”?– Why architecture evolution rather than replacement by
existing monitoring tool(s)?Agile Infrastructure for Monitoring
– Shared Infrastructure– Use cases: Data store, Visualization– Event processing and management– Status of the components
2
Lemon LHC Era Monitoring System
In-house developed, multi-components, client/server-based monitoring system
SQL
TCP/UDP HTTP
Sensor Sensor Sensor
Monitoring Agent Local Cache
OracleDatabase
Repository BackendApplication
Server
Lemon CLI
Lemon-host-check
Web Browser
RRD tool / Python
Apache/ PHP
(command line tool to access data)
(command line tool node exceptions)
Measurement Repository
User InterfacesNode Monitoring
Individually configurable nodes with autonomous recovery actions
Chain of tools based on DB backend
3
LemonPerformance, application and facility monitoring
efficiency=𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝐴−𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝐵𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝐷
Node monitoringe.g. CPU Load
Time-series processingHierarchy clustering
ClusterSub-cluster
Node
On behalf monitoring
Smart Power Distribution UnitsHistorical data export
4
LemonService availability and alarming
Node monitoring• Disk occupancy• Number of processes• Log file parse matched
Correction action on the node• Run script locally to clean var dir• After 3rd attempt var occupancy > 90%
Monitoring repository export with guaranteed reliabilityand data processing
e.g. Service Level Status
var_ful/ alarm
System administrator Support ticket
5
Lemon Monitoring @ Large scale
6
Experience• No single solution replacementRequirements• Tools chain
• e.g. data mining interface different from time series trending• Flexible migration
• e.g. compatible with lemon node client • Large scale ready
• Current system: • ~11k monitored entities• ~150 metrics/entity
• Expected scale: ~300k entities
Agile Infrastructurewith performance monitoring
Lemonagent
Lemon to messaging
Message Bus
Custom scriptMonitoring XYZ
Visualizationand correlation
Data store
Cluster processingHigh load for >50% of
cluster
Ticketing
SMS gateway
Dashboard
Operations
Planned Components Views
7
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Storing and visualization
Message Bus
Oracle
Lemon web
NoSQL
Visualization
Data mining(batch processing)
Splunk
Data miningVisualizationCorrelation
RRD visualization
R&D on-going
Possible options
8
NoSQL-based data store for monitoring Example from Data Storage Service
9
Log parsing and processing based on the NoSQL DB Prototyped by CERN IT/DSS
Shared infrastructure
Splunk for data mining/visualization
High precision data miningin the current system solved by dedicated exports
~1.5 year of Lemon raw data (~4.5 TB in Oracle) ~2.5 TB Splunk data with metadata information (~43 billion entries)
10
One year period of basic metrics on node on the fly browsing capability with high time granularity
Under testing
Example of Splunk Dashboard Lemon data with entity cluster hierarchy
11
Metric - Time - Match entity name
Sum of running jobs over time split by entities
Under testing
Event processing and management concept
Metric correlationMetrics
Node monitoring
Monitoring infrastructure
Event processing
e.g. Heartbeat checkinge.g. Load over cluster
12
Ticketing system
Incident process
Event process
Event record
Incident ticketService Now
prototype
Possible use of Splunk for event processing
Alarming on the fly information processing in time windows
if counter >3 event
Splunk Automate Monitoring
5 min time windowtime
NotificationSplunk
Aggregated Notification
13
In production for backup TSM service @CERN
Configuration status and transition period
Lemon application server (one/data centre)
Lemon metric management
Quattor managed node Puppet
managed node
Puppet
Quattor configuration Puppet configuration
14
AI monitoring
Metric Management
prototype
Component status
Lemonagent
Lemon to messaging
Apollo
Custom script
Cluster processingHigh load for >50% of
cluster
Visualizationand correlation
SplunkData store
Hadoop
Monitoring XYZ
prototyping/testing/usingplanned/R&D on-going
15
Ticketing
SMS gateway
Dashboard
Operations
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Summary • No single solution replacement of the current
Lemon system• Shared Agile Infrastructure Modular concept
– covering all the CERN Computer Centre monitoring domains
– continuous development and deployment• Transition plan in place• Steady progress in implementation
16