the agile infrastructure project monitoring
DESCRIPTION
The Agile Infrastructure Project Monitoring. Markus Schulz Pedro Andrade. Outline. Monitoring WG and AI Today’s Monitoring in IT Architecture Vision Implementation Plan Conclusions. Monitoring WG and AI. Markus Schulz. Introduction. Motivation - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/1.jpg)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
The Agile Infrastructure Project
Monitoring
Markus Schulz
Pedro Andrade
![Page 2: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/2.jpg)
2
Outline
• Monitoring WG and AI• Today’s Monitoring in IT• Architecture Vision• Implementation Plan• Conclusions
![Page 3: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/3.jpg)
3
Markus Schulz
Monitoring WG and AI
![Page 4: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/4.jpg)
4
Introduction
• Motivation– Several independent monitoring activities in IT
• similar overall approach, different tool-chains, similar limitations
– High level services are interdependent • combination of data from different groups necessary, but difficult
– Understanding performance became more important • requires more combined data and complex analysis
– Move to a virtualized dynamic infrastructure • comes with complex new requirements on monitoring
• Challenges– Find a shared architecture and tool-chain components
while preserving our investment in monitoring
• IT Monitoring Working Group
![Page 5: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/5.jpg)
5
Timeline
Q1 2011
•Creation of Monitoring WG and mandate definition
•Presentations of monitoring status per IT group
Q3 2011
•Presentations of monitoring plans per IT group
•Initial discussion on a shared monitoring architecture
Q4 2011
•Definition of common tools and core user stories
•Agreement on a shared monitoring architecture
Q1 2012
•Preparation of MWG summary report
•Definition of implementation plans in the context on AI
Q2 2012
•Setup of infrastructure and prototype work. Import data from several sources into the Analysis Facility. Exercise messaging at expected rates and feed the storage system.
![Page 6: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/6.jpg)
6
Pedro Andrade
Today’s Monitoring in IT
![Page 7: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/7.jpg)
7
Monitoring Applications
Group Applications
CF Lemon, LAS, SLS
CIS CDS, Indico
CSSpectrum CA Events, Polling Value, Alarm History, Performance Analysis, Sflow/Nflow, Syslog, Wireless Monitoring
DBDatabase monitoring, Web applications monitoring, Infrastructure Monitoring
DICentral Security Logging All, Central Security Logging Logins, IP connections log, Deep Packet Inspection, DNS Logs
DSS TSM, AFS, CASTOR Tape, CASTOR Stager
ESJob Monitoring, Site Status Board, DDM Monitoring, Data Popularity, Hammer Cloud, Frontier, Coral
GT SAM-Nagios
OIS SCOM
PESJob Accounting, Fairshare, Job Monitoring, Real-time Job Status, Process Accounting
![Page 8: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/8.jpg)
8
Monitoring Applications
![Page 9: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/9.jpg)
9
Monitoring Data
• Producers– 40538
• Input Volume – 283 GB per day
• Input Rate– 697 M entries per min– 2,4 M entries per min without PES/process accounting
• Query Rate– 52 M queries per day– 3,3 M entries per day without PES/process accounting
![Page 10: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/10.jpg)
10
Analysis
• Monitoring in IT covers a wide range of resources– Hardware, OS, applications, files, jobs, etc
• Many application-specific monitoring solutions– Some are commercial solutions– Based on different technologies
• Limited sharing of monitoring data– Maybe no sharing, simply duplication of monitoring data
• All monitoring applications have similar needs– Publish metric results, aggregate results, alarms, etc
![Page 11: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/11.jpg)
11
Architecture Vision
Pedro Andrade
![Page 12: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/12.jpg)
12
Constraints (Data)
• Large data store aggregating all monitoring data for storage and combined analysis tasks
• Make monitoring data easy to access by everyone! – Not forgetting possible security constraints
• Select a simple and well supported data format– Monitoring payload to be schema free
• Rely on a centralized metadata service(s) to discover the computer center resources information– Which is the physical node running virtual machine A– Which is the virtual machine running service B– Which is the network link used by node C– … this is becoming more dynamic in the AI
![Page 13: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/13.jpg)
13
Constraints (Technology)
• Focus on providing well established solutions for each layer of the monitoring architecture– Transport, storage, analysis
• Flexible architecture where a particular technology can be easily replaced by a better one
• Adopt whenever possible existing tools and avoid home grown solutions
• Follow a tool chain approach• Allow a phased transition where existing
applications are gradually integrated
![Page 14: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/14.jpg)
14
User Stories
• User stories were collected from all IT groups and commonalities between them were identified
• To guarantee that different types of user stories were provided three categories were established:– Fast and Furious (FF)
• Get metrics values for hardware and selected services• Raise alarms according to appropriate thresholds
– Digging Deep (DD)• Curation of hardware and network historical data• Analysis and statistics on batch job and network data
– Correlate and Combine (CC)• Correlation between usage, hardware, and services• Correlation between job status and grid status
![Page 15: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/15.jpg)
15
Architecture Overview
Application Specific
Aggregation
Storage Feed
Analysis
Storage
Alarm Feed
AlarmPortal Report
Custom Feed
Publisher Sensor Publisher Sensor
Portal
Apollo
Lemon
Hadoop
Oracle
Splunk
![Page 16: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/16.jpg)
16
Architecture Overview
• All components can be changed easily– Including the messaging system (standard protocol)
• Messaging and storage as central components– Tools connect either to the Messaging or Storage
• Publishers should be kept as simple as possible– Data produced either directly on sensor or after a first
level of aggregation
• Scalability can be addressed either by horizontally scaling or by adding additional layers– Pre-aggregation, pre-processing– “Fractal approach”
![Page 17: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/17.jpg)
17
Data Format
• The selected message format is JSON• A simple common schema must be defined to
guarantee cross-reference between the data.– Timestamp– Hardware and node– Service and applications– Payload
• These base elements (tag) require the availability of the metadata service(s) mentioned before– This is still under discussion
![Page 18: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/18.jpg)
18
Messaging Broker
• Two technologies have been identified as the best candidates: Apollo and RabbitMQ– Apollo is the successor ActiveMQ– Prior positive experience in IT and the experiments
• Only realistic testing environments can produce reliable performance numbers. The use case of each application must be clear defined– Total number of producers and consumers– Size of the monitoring message– Rate of the monitoring message
• The trailblazer applications have already very demanding use cases
![Page 19: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/19.jpg)
19
Central Storage and Analysis
• All data is stored in a common location– Makes easy the sharing of monitoring data– Promotes sharing of analysis tools– Allows feeding into the system data already processed
• NoSQL technologies are the most suitable solutions– Focus on column/tabular and document based solutions– Hadoop (from the Cloudera distribution) as first step
![Page 20: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/20.jpg)
20
Central Storage and Analysis
• Hadoop is a good candidate to start with– Prior positive experience in IT and the experiments – Map-reduce paradigm is a good match for the use cases– Has been used successfully at scale– Many different NoSQL solutions use Hadoop as backend – Many tools provide export and import interfaces – Several related modules available (Hive, HBase)
• Document based store also considered– CouchDB/MongoDB are good candidates
• For some use cases a parallel relational database solution (based on Oracle) could be considered
![Page 21: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/21.jpg)
21
Integrating Closed Solutions
• External (commercial) monitoring– Windows SCOM, Oracle EM Grid Control, Spectrum CA
• These data sources must be integrated– Injecting final results into the messaging layer– Exporting relevant data at an intermediate stage
Sensor
Transport
Storage
Analysis
Visualization/Reports
Export Interface
Mes
sagi
ng
Integrated Product
![Page 22: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/22.jpg)
22
Implementation Plan
Pedro Andrade
![Page 23: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/23.jpg)
23
Transition Plan
• Moving the existing production monitoring services to a new base architecture is a complex task as these services must be continuously running
• A transition plan was defined and foresees a staged approach where the existing applications gradually incorporate elements of the new architecture
![Page 24: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/24.jpg)
24
Transition Plan
Aggregation
Storage Feed
Analysis
Storage
Alarm Feed
AlarmPortal Report
Publisher Publisher Publisher
OLD
NEW
![Page 25: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/25.jpg)
25
Milestones
Monitoring.v1 Q1 2012•AI nodes monitored with Lemon (dependency on Quattor)•Deployment of Messaging Broker and Hadoop cluster•Testing of other technologies (Splunk)
Monitoring.v2 Q2 2012 •AI nodes monitored with Lemon (no dependency on Quattor)•Lemon data starts to be published via messaging
Monitoring.v3 Q4 2012•Several clients exploiting the messaging infrastructure•Messaging consumers for real time alarms and notifications•Initial data store/analysis for select use cases
Monitoring.v4 Q4 2013 •Monitoring data published to the messaging infrastructure•Large scale data store/analysis on Hadoop cluster
![Page 26: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/26.jpg)
26
Monitoring v1
• Several meetings organized– https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/Agil
eInfraDocsMinutes
• Short-term tasks identified and tickets created– https://agileinf.cern.ch/jira/secure/TaskBoard.jspa
• Work ongoing on four main areas:– Messaging broker deployment– Hadoop cluster deployment– Testing of Splunk with Lemon data– Lemon agents running on puppet
![Page 27: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/27.jpg)
27
Monitoring v1
• Deployment of the messaging broker– Based on Apollo and RabbitMQ
• Three SL6 nodes have been provided– 2 nodes for production, 1 node for development– Each node will run Apollo and RabbitMQ
• Three applications have been identified to start using/testing the messaging infrastructure– OpenStack– MCollective– Lemon
![Page 28: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/28.jpg)
28
Monitoring v1
• Testing Splunk with Lemon data– Lemon data to be exported from DB (1 day, 1 metric)– Data exported into a JSON file and stored n AFS– This data will be imported to Splunk – Splunk functionality and scalability will be tested
• Started the deployment of a Hadoop cluster– Taking the Cloudera distribution– Other tools may also be deployed (HBase, Hive, etc)– Hadoop testing using Lemon data (as above) is planned
![Page 29: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/29.jpg)
29
Monitoring v1/v2
• AI nodes monitored with existing Lemon metrics– First step
• Current Lemon sensors/metrics are used for AI nodes• Lemon metadata will still be taken from Quattor • A solution is defined to get CDB equivalent data
– Second step• Current Lemon sensors/metrics are used for AI nodes• Lemon metadata is not taken from Quattor• Lemon agents start using the messaging infrastructure
![Page 30: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/30.jpg)
30
Conclusions
Pedro Andrade
![Page 31: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/31.jpg)
31
Conclusions
• A monitoring architecture has been defined– Promotes sharing of monitoring data between apps– Based on few core components (transport, storage, etc)– Several existing external technologies identified
• A concrete implementation plan has been identified– It assures a smooth transition for today’s applications– It enables the new AI nodes to be monitored quickly– It allows moving towards a common system
![Page 32: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/32.jpg)
32
Links
• Monitoring WG Twiki (new location!)– https://twiki.cern.ch/twiki/bin/view/MonitoringWG/
• Monitoring WG Report (ongoing)– https://twiki.cern.ch/twiki/bin/view/MonitoringWG/Monitori
ngReport
• Agile Infrastructure TWiki– https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/
• Agile Infrastructure JIRA– https://agileinf.cern.ch/jira/browse/AI
![Page 33: The Agile Infrastructure Project Monitoring](https://reader030.vdocuments.us/reader030/viewer/2022032805/56813300550346895d99bb9d/html5/thumbnails/33.jpg)
33
QUESTIONS?
Thanks !