ELK @ LinkedIn
Scaling ELK with Kafka
Introduction
Tin Le ([email protected])
Senior Site Reliability Engineer
Formerly part of Mobile SRE team, responsible for servers
handling mobile apps (IOS, Android, Windows, RIM, etc.)
traffic.
Now responsible for guiding ELK @ LinkedIn as a whole
Problems
● Multiple data centers, ten of thousands of servers,
hundreds of billions of log records
● Logging, indexing, searching, storing, visualizing and
analysing all of those logs all day every day
● Security (access control, storage, transport)
● Scaling to more DCs, more servers, and even more
logs…
● ARRGGGGHHH!!!!!
Solutions
● Commercialo Splunk, Sumo Logic, HP ArcSight Logger, Tibco,
XpoLog, Loggly, etc.
● Open Sourceo Syslog + Grep
o Graylog
o Elasticsearch
o etc.
Criterias
● Scalable - horizontally, by adding more nodes
● Fast - as close to real time as possible
● Inexpensive
● Flexible
● Large user community (support)
● Open source
ELK!
The winner is...
Splunk ???
ELK at LinkedIn
● 100+ ELK clusters across 20+ teams and 6
data centers
● Some of our larger clusters have:o Greater than 32+ billion docs (30+TB)
o Daily indices average 3.0 billion docs (~3TB)
ELK + Kafka
Summary: ELK is a popular open sourced application stack for
visualizing and analyzing logs. ELK is currently being used across
many teams within LinkedIn. The architecture we use is made up of
four components: Elasticsearch, Logstash, Kibana and Kafka.
● Elasticsearch: Distributed real-time search and analytics engine
● Logstash: Collect and parse all data sources into an easy-to-read
JSON format
● Kibana: Elasticsearch data visualization engine
● Kafka: Data transport, queue, buffer and short term storage
What is Kafka?
● Apache Kafka is a high-throughput distributed
messaging system
o Invented at LinkedIn and Open Sourced in 2011
o Fast, Scalable, Durable, and Distributed by Design
o Links for more: http://kafka.apache.org
http://data.linkedin.com/opensource/kafka
Kafka at LinkedIn
● Common data transport
● Available and supported by dedicated team
o 875 Billion messages per day
o 200 TB/day In
o 700 TB/day Out
o Peak Load 10.5 Million messages/s
18.5 Gigabits/s Inbound
70.5 Gigabits/s Outbound
Logging using Kafka at LinkedIn
● Dedicated cluster for logs in each data center
● Individual topics per application
● Defaults to 4 days of transport level retention
● Not currently replicating between data centers
● Common logging transport for all services, languages
and frameworks
ELK Architectural Concerns
● Network Concerns
o Bandwidth
o Network partitioning
o Latency
● Security Concerns
o Firewalls and ACLs
o Encrypting data in transit
● Resource Concerns
o A misbehaving application can swamp production resources
Multi-colo ELK ArchitectureELK Dashboard
13
Services
ELK Search
Clusters
Log
TransportKafka
ELK Search
Clusters
Services
DC1
Services
Kafka
ELK Search
Clusters
DC2
Services
Kafka
ELK Search
Clusters
DC3
Tribes
Corp Data Centers
ELK Search Architecture
Kibana
Elasticsearch
(tribe)
Kafka
Elasticsearch
(master)
Logstash
Elasticsearch
(data node)
Logstash
Elasticsearch
(data node)
Users
Operational Challenges
● Data, lots of it.
o Transporting, queueing, storing, securing,
reliability…
o Ingesting & Indexing fast enough
o Scaling infrastructure
o Which data? (right data needed?)
o Formats, mapping, transformation Data from many sources: Java, Scala, Python, Node.js, Go
Operational Challenges...
● Centralized vs Siloed Cluster Management
● Aggregated views of data across the entire
infrastructure
● Consistent view (trace up/down app stack)
● Scaling - horizontally or vertically?
● Monitoring, alerting, auto-remediating
The future of ELK at LinkedIn
● More ELK clusters being used by even more teams
● Clusters with 300+ billion docs (300+TB)
● Daily indices average 10+ billion docs, 10TB - move to
hourly indices
● ~5,000 shards per cluster
Extra slides
Next two slides contain example logstash
configs to show how we use input pipe plugin
with Kafka Console Consumer, and how to
monitor logstash using metrics filter.
KCC pipe input config
pipe {
type => "mobile"
command => "/opt/bin/kafka-console-consumer/kafka-console-consumer.sh \
--formatter com.linkedin.avro.KafkaMessageJsonWithHexFormatter \
--property schema.registry.url=http://schema-server.example.com:12250/schemaRegistry/schemas \
--autocommit.interval.ms=60000 \
--zookeeper zk.example.com:12913/kafka-metrics \
--topic log_stash_event \
--group logstash1"
codec => “json”
}
Monitoring Logstash metrics
filter {
metrics {
meter => "events"
add_tag => "metric"
}
}
output {
if “metric” in [tags] [
stdout {
codec => line {
format => “Rate: %{events.rate_1m}”
}
}
}