elk wrestling (leeds devops)
DESCRIPTION
Talk I did on log aggregation with the ELK stack at Leeds DevOps. Covers how we process over 800,000 logs per hour at laterooms, and the cultural changes this has helped drive.TRANSCRIPT
{title: ‘ELK Wrestling’,author: ‘Steve Elliott’,company: ‘LateRooms.com’,type: ‘DevOpsLeeds,@timestamp: ‘2014-10-13T18:30Z’
}
Featuring Live Demo!
Please tweet!Include: “leedsdevops”
Home growing a metrics cultureNeeded visibility of live issuesHad trialled off the shelf before (Splunk)Hadn’t gained tractionWanted the data still
Options...
Tried Splunk
...Bit pricey, pay for HW and volume of data indexed
Looked at cloud based options, were also expensive
It started with Badger...
Logging and Monitoring Project
Locate and implement the tools we neededStarted with Cube for metrics (wouldn’t recommend)Moved onto Logging
Current tooling...
...Lacking
“But it works”
What can we log?Pretty much anything with a timestamp
Error logWeb logsProxy logsReleases?Tweets?
Logstash
ELK
High level architectural design
Web servers
QueueElasticsearch
Dashboards
Rest of Badger
Real time search and analytics database
Who’s using it?
...Clever people
Certain other hotel website...
Working with Elasticsearch
● RESTful API● JSON● Many libraries to deal with it (new on
ElasticLinq for C#)
Sense Chrome Extension
Clustering
Excellent distributed featuresEasy to useNode Self discoveryDifferent Node Types
(Data, Master, Search, Client)
“Live”SSD
“Archive”HDD
More in depth architecture
IISLogs
Errors
WMI
Collector(e.g. Live Server)
Queue Forwarder
Cube (/TSDB)
Search Analytics
Rabbit MQFilter & Forward
Logstash
Inputs
Filters
Outputs
e.g.HTTP logs, UDP, error logs, tweets.
e.g. UDP, elasticsearch, graphite, IRC
(e.g. Filter, grok, lookup IP, magic…)
Why the Queue?
● Resiliancy● Single source of data for everyone● Logstash used to recommend RabbitMQ,
now they recommend Redis● We still use RabbitMQ, works for us
Kibana
● Easy to build dashboards● Gateway drug to ElasticSearch queries● Examples!
But...
Demo
Mistake: Dashboard Fatigue
Too many dashboards to watch!Need to do more on alerting
Mistake: Using elasticsearch as a TSDB
Lots of graphs just cared about top level values, should use a TSDB (such as graphite) instead
Elasticsearch use case for more in-depth data analysis
Mistake: Trying to keep too much data
● Nodes going out of memory or disk space is bad
● Long GC can cause nodes to drop● Can lead to split brain● More shards = more memory ● usage, watch your scaling
Scaling
Hit two bottlenecks- Ingestion (solved with SSDs)- Search (solved by scaling horizontally)1.4.0 brings stability improvements, should handle oom better
Other MistakesShould have automated sooner(Good chef/puppet support)
Should have used “normal” logstash more
More node
More awesome??
What went right?
● Free and easy access to Data● Doesn’t need to be on elasticsearch, but the
tooling makes it easy● Give people access and they’ll seek out the
data to drive decisions - start the feedback loop
● Dev/Test instance
ELK in the wild
Data Driven QA
Data Driven...Managering
But wait, theres more!
Curator, Kibana 4 (Woo - aggregations), alerting, linkinglogs together…
Too much to cover here!
Thanks for Listening!
More: elasticsearch.org, logstash.net Blog: www.tegud.netTwitter: @tegudGithub: www.github.com/tegud
Come say hi!