seminar cnaf
TRANSCRIPT
![Page 1: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/1.jpg)
Seminar CNAF
Exploiting open source tools to
realize a new monitoring
infrastructure at CERN
Pedro Andrade – CERN IT/CF
![Page 2: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/2.jpg)
Overview
• Agile Infrastructure
• Monitoring Project
• Solutions and Technologies
• Producers
• Transport
• Archive
• Query and Analytics
• Real-time Analytics
• Notifications
3/17/2014 CNAF Seminar 2
![Page 3: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/3.jpg)
Agile Infrastructure
3/17/2014 CNAF Seminar 3
![Page 4: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/4.jpg)
Challenges
• New data centre in Budapest since 2013
• Additional capacity required in view of physics needs
• Local on-site maintenance for installations/repairs
3/17/2014 CNAF Seminar 4
![Page 5: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/5.jpg)
Challenges
• Be ready to handle 15’000 servers
• Increasing users of CERN’s facilities and higher
computing requirements as data rates increase
• Staff numbers are fixed, no more people
• Materials budget decreasing, no more money
• Legacy tools are high maintenance and brittle
• Deploy new services within hours
3/17/2014 CNAF Seminar 5
![Page 6: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/6.jpg)
Challenges
• “We Are Not Special”
• Move to commonly used open source tools
• Focus on strong communities and momentum
• Stop re-inventing tools, not made here syndrome
• Implement clouds at scale
• Aim for 90% infrastructure virtualised
• Ecosystem solutions rather than writing from scratch
• Request to delivery in a coffee break
3/17/2014 CNAF Seminar 6
![Page 7: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/7.jpg)
Agile Infrastructure
• Activity started in 2012
• Remodel IT services
• Move to a more horizontal approach
• Layered model: IaaS, PaaS, SaaS
• Services, Configuration, Installation, Hardware
• Virtualisation is key
• Improve efficiency
• Operational, Resources
3/17/2014 CNAF Seminar 7
![Page 8: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/8.jpg)
Agile Infrastructure
3/17/2014 CNAF Seminar 8
Bamboo
Koji, Mock
AIMS/PXE
Foreman
Yum repo
Pulp
Puppet-DB
mcollective, yum
JIRA
Lemon /
Hadoop /
Elastic Search /
Kibana
git
OpenStack
Nova
Hardware
database
Puppet
Active Directory /
LDAP
![Page 9: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/9.jpg)
Monitoring Project
3/17/2014 CNAF Seminar 9
![Page 10: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/10.jpg)
Challenges
• Several independent monitoring activities in IT
• High level services are interdependent
• Understanding performance more important
• Move to a virtualized dynamic infrastructure
• Preserve our investment in monitoring
Shared architecture & tool-chain components
3/17/2014 CNAF Seminar 10
![Page 11: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/11.jpg)
Objectives
• Deliver solutions for the shared architecture
• Work with all IT monitoring teams
• Deliver simple adoption: PaaS
• Better exploit IT resources
• While at the same time
• Mix and match open source solutions
• Exploit new tools from the Agile Infrastructure
• Retire old tools: Lemon DB, Lemon Web, LAS, etc.
3/17/2014 CNAF Seminar 11
![Page 12: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/12.jpg)
Architecture
3/17/2014 CNAF Seminar 12
![Page 13: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/13.jpg)
Process Improvements
• Establish Agile methodology
• Well defined sprints with clear targets
• Interactive evolution, continuous feedback
• Exploit Open Source tools
• Best fit, large adoption, active community
• Fast to adopt, accept limitations, easily replaced
• Look at DevOps
• Quality Assurance processes
• Contiguous Integration processes
3/17/2014 CNAF Seminar 13
![Page 14: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/14.jpg)
Technologies
• Many options available !
3/17/2014 CNAF Seminar 14
![Page 15: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/15.jpg)
Technologies
3/17/2014 CNAF Seminar 15
![Page 16: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/16.jpg)
Producers
3/17/2014 CNAF Seminar 16
![Page 17: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/17.jpg)
Motivation
• Preserve sensors/probes knowledge
• Many years writing sensors for Lemon
• Integrate other data sources
• Most likely service specific monitoring data
Selected Technology: Lemon + Others
3/17/2014 CNAF Seminar 17
![Page 18: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/18.jpg)
Lemon Producer
• Same old lemon agent
• Running in all data centre nodes
• Lemon agent extended with lemon forwarder
• Send notifications to ActiveMQ
• Send metrics to Flume
• Send syslog to Flume
3/17/2014 CNAF Seminar 18
![Page 19: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/19.jpg)
Other Producers
• Must follow common monitoring specification
• Metric v3.0 and Notification v2.0
• Can use monitoring-data-model to create new
metrics and notifications and validate them
• Messages can be send
• To ActiveMQ using a stomp client
• To Flume gateway using a flume agent
• Planning to evaluate Collectd later this year
3/17/2014 CNAF Seminar 19
![Page 20: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/20.jpg)
Transport
3/17/2014 CNAF Seminar 20
![Page 21: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/21.jpg)
Motivation
• Collect operations data
• Lemon metrics and syslog
• 3rd party applications and services
• Scalable transport layer
• Large data volume
• Easy integration with other technologies
Selected Technology: Flume
3/17/2014 CNAF Seminar 21
![Page 22: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/22.jpg)
Flume
• Distributed service for collecting large data sets
• Robust and fault tolerant
• Horizontally scalable
• Many ready to be used input/output plugins
• Java based, Apache license
• Cloudera is the main contributor
• Using their releases
• Less frequent but more stable releases
3/17/2014 CNAF Seminar 22
![Page 23: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/23.jpg)
Flume
• Flume event
• Payload + set of string headers
• Flume agent
• JVM process hosting “source to sink” flows
3/17/2014 CNAF Seminar 23
![Page 24: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/24.jpg)
Flume
• Many ready-to-be-used plugins
• Sources: Avro, JMS, Spool, Syslog, HTTP, etc.
• Interceptors: decorate events, filter events
• Channels: Memory, File, JDBC
• Sinks: Avro, Thrift, ElasticSearch, HDFS, File, etc.
• Custom sources/sinks can be implemented
3/17/2014 CNAF Seminar 24
![Page 25: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/25.jpg)
Flume
• Routing is static
• On demand subscriptions are not possible
• Requires reconfiguration and restart
• No authN and authZ features
• But secure transport available
• Java process on client side
• Small memory footprint would be nicer
3/17/2014 CNAF Seminar 25
![Page 26: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/26.jpg)
Deployment
• Running flume 1.3, latest is flume 1.4
3/17/2014 CNAF Seminar 26
![Page 27: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/27.jpg)
Deployment
• 1st layer: Flume Data publisher
• Deployed in all data centre nodes
• 2nd layer: Flume Gateway
• 20 VMs aggregating events
• 3rd layer: Flume ElasticSearch
• 10 VMs inserting to ElasticSearch
• 3rd layer: Flume Hadoop HDFS
• 10 VMs inserting to Hadoop HDFS
3/17/2014 CNAF Seminar 27
![Page 28: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/28.jpg)
Feedback
• Sizing flume layers needs some tuning
• Available sources/sinks saved a lot of time
3/17/2014 CNAF Seminar 28
![Page 29: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/29.jpg)
Archive
3/17/2014 CNAF Seminar 29
![Page 30: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/30.jpg)
Motivation
• Store operations raw data
• Long term archival required
• Allow future data replay to other tools
• Feed real-time engine
• Offline processing of collected data
• Security data? Syslog data?
Selected Technology: Hadoop/HDFS
30 3/17/2014 CNAF Seminar 30
![Page 31: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/31.jpg)
Hadoop/HDFS
• Hadoop is a framework that allows the
distributed processing of large data sets
• HDFS is a distributed filesystem designed to
run on commodity hardware
• Suitable for applications with large data sets
• Designed for batch processing, not interactive use
• High throughput preferred to low latency access
3/17/2014 CNAF Seminar 31
![Page 32: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/32.jpg)
Hadoop/HDFS
• Small files not welcome: blocks of 64M,128M
• Tens of millions files limit per cluster
• Namenode holding in memory files map
• Transparent compression not available
• Raw text could take much less space
• Real-time data access is not possible
32 3/17/2014 CNAF Seminar 32
![Page 33: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/33.jpg)
Deployment
• Production cluster
• ~200 TB available in 5 data nodes
• 6.3 TB stored since mid July 2013
• Data organized by hostgroup (cluster)
• Daily jobs to aggregate data by month
• Large files preferred to many small files
33 3/17/2014 CNAF Seminar 33
![Page 34: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/34.jpg)
Query & Analytics
3/17/2014 CNAF Seminar 34
![Page 35: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/35.jpg)
Motivation
• Real-time queries based on clear API
• Dynamic dashboards creation
• Rich user-friendly dashboards
• Horizontally scalable and easy to deploy
• Limited data retention policy
• Handle different data types in the same way
Selected Technology: ElasticSearch + Kibana
35 3/17/2014 CNAF Seminar 35
![Page 36: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/36.jpg)
ElasticSearch
• Distributed RESTful search & analytics engine
• Real time data acquisition and indexing
• Automatically balanced shards and replicas
• Schema free, document oriented (JSON)
• No prior data declaration required
• Automatic data type discovery
• Distributed under Apache license
36 3/17/2014 CNAF Seminar 36
![Page 37: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/37.jpg)
ElasticSearch
• Full text search
• Apache Lucene is used to provide full text search
• Not only text: integer/long, float/double, boolean, etc.
• RESTful JSON API
3/17/2014 CNAF Seminar 37
$ curl -XGET http://es-search:9200/_cluster/health?pretty=true
{
"cluster_name" : "itmon-es",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 11,
"number_of_data_nodes" : 8,
"active_primary_shards" : 2990,
"active_shards" : 8970,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}
![Page 38: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/38.jpg)
Limitations ElasticSearch
• Requires a lot of RAM, mainlly on data nodes
• IO intensive, careful deployment required
• Shards re-initialisation takes some time (~1h)
• Lots of shards and replicas per index, lots of indexes
• Not frequent operation, only after full cluster reboot
• Authentication not built-in (“bricolage”)
• Apache+Shibboleth on top of Jetty plugin
3/17/2014 CNAF Seminar 38
![Page 39: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/39.jpg)
Kibana Kibana
• “Make sense of a mountain of logs”
• Designed to analyse logs
• Perfectly fits timestamped data (e.g. metrics)
• Profits from ElasticSearch search/analyse features
• No coding required
• Simply point & click to build your own dashboard
• Fully integrated and supported by
ElasticSearch
• Started as separate project
3/17/2014 CNAF Seminar 39
![Page 40: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/40.jpg)
Kibana
• Built with AngularJS
• JavaScript MVC for client-side rich application
• Developed and maintained by
• No backend: web server delivers static files
• JS directly queries ElasticSearch
• Easy to install and configure
• “git clone” OR “tar -xvzf” OR ElasticSearch plugin
• 1-line config file to point to the ElasticSearch cluster
• Save its own configuration in ElasticSearch
Kibana
3/17/2014 CNAF Seminar 40
![Page 41: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/41.jpg)
Our Deployment Deployment
• Production cluster
• Running ElasticSearch 0.90.7
• 2 master nodes (16GB RAM, 8 cores)
• 1 search node (16GB RAM, 8 cores)
• 8 data nodes (48GB RAM, 24 cores, 500GB SSD)
• Monitoring: ElasticHQ, BigDesk, and Head
• Indexes structure
• One index per day with 30 days TTL
• 10 shards per index, 3 replicas per shards
3/17/2014 CNAF Seminar 41
![Page 42: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/42.jpg)
Our Deployment Deployment
• Based on ElasticSearch plugin
• Running v3.pre-4
• Deployed together with search node
• Profits from Jetty authentication
• Different endpoints for AuthN
• Public (read only)
• Private (read write)
3/17/2014 CNAF Seminar 42
![Page 43: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/43.jpg)
Feedback
• Easy to deploy and manage
• Robust, fast, and rich API
• Easy query language (DSL)
• More features with aggregation framework
• Released with ElasticSearch v1.0
3/17/2014 CNAF Seminar 43
![Page 44: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/44.jpg)
Feedback
• Easy to deploy and use
• Very cool user interface
• Fits many use cases: text (syslog), metrics (lemon)
• Many “panels” available: tables, charts, hits, etc.
• Very active community and growing
• A bit limited feature set
• Many developments ongoing
3/17/2014 CNAF Seminar 44
![Page 45: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/45.jpg)
Notifications
3/17/2014 CNAF Seminar 45
![Page 46: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/46.jpg)
Motivation
• Modular tools to manage notifications
• Notifications delivered to multiple endpoints
• Automatic SNOW tickets / Central dashboard / etc.
• More efficient handling of notifications
• Enable SMs to improve automation of their services
• Improve routing of SNOW tickets
• Avoid wasting time in multiple (fake) hops
• Make visible problems hidden to SM before
• Allow others to publish/consumer notifications
3/17/2014 CNAF Seminar 46
![Page 47: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/47.jpg)
GNI
• General Notifications Infrastructure
• Manage all data centre notifications
• Messaging consumers integrating with other tools
• Multiple notification types: HW, APP, OS, NC
• Notifications delivered as SNOW Incidents
• Incidents assigned to appropriate support unit
• Incidents masking per notification type
• Notifications stored in ElasticSearch
• Visible via a dedicated Kibana dashboard
3/17/2014 CNAF Seminar 47
![Page 48: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/48.jpg)
Deployment
• 3 VMs for messaging clients + ES cluster
• Using other IT services: ActiveMQ, SNOW
3/17/2014 CNAF Seminar 48
![Page 49: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/49.jpg)
Real-time Analytics
3/17/2014 CNAF Seminar 49
![Page 50: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/50.jpg)
Motivation
• Real-time analytics engine
• Automatic generation of curated data
• Easy to use under different contexts
• First target is aggregation of notifications
• Online machine learning, ETL, etc.
• Adopt open source tool
• Good candidates: Spark, Storm, ?
• Easy integration with current tools
3/17/2014 CNAF Seminar 50
![Page 51: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/51.jpg)
Summary
3/17/2014 CNAF Seminar 51
![Page 52: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/52.jpg)
Summary
3/17/2014 CNAF Seminar 52
Before After
Many central services More platform services
Notifications limited to lemon Generic notifications producers
Inefficient ticket routing Flexible ticket routing
Limited to lemon metrics Open to any monitoring data
Complex data access Easy data access
Central lemon dashboard Dashboard instances per application
Limited offline analytics Batch analytics in HDFS
No real-time analytics New real-time analytics tools
![Page 53: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/53.jpg)
Summary
• New shared monitoring architecture
• Being adopted by all IT monitoring activities
• Selected technologies look good
• Flume, ES, Kibana, HDFS
• Happy to get your feedback on these and others
• Don’t forget the cultural changes
• Agile methology, DevOps, PaaS, etc.
• As important as the technology changes
3/17/2014 CNAF Seminar 53
![Page 54: Seminar CNAF](https://reader030.vdocuments.us/reader030/viewer/2022012717/61afb15e1bcb5f469f467caf/html5/thumbnails/54.jpg)
Thanks !
http://cern.ch/itmon
3/17/2014 CNAF Seminar 54