elk at linkedin - kafka, scaling, lessons learned

ELK @ LinkedIn

Scaling ELK with Kafka

Introduction

Tin Le (tinle@linkedin.com)

Senior Site Reliability Engineer

Formerly part of Mobile SRE team, responsible for servers

handling mobile apps (IOS, Android, Windows, RIM, etc.)

traffic.

Now responsible for guiding ELK @ LinkedIn as a whole

Problems

● Multiple data centers, ten of thousands of servers,

hundreds of billions of log records

● Logging, indexing, searching, storing, visualizing and

analysing all of those logs all day every day

● Security (access control, storage, transport)

● Scaling to more DCs, more servers, and even more

logs…

● ARRGGGGHHH!!!!!

Solutions

● Commercialo Splunk, Sumo Logic, HP ArcSight Logger, Tibco,

XpoLog, Loggly, etc.

● Open Sourceo Syslog + Grep

o Graylog

o Elasticsearch

o etc.

Criterias

● Scalable - horizontally, by adding more nodes

● Fast - as close to real time as possible

● Inexpensive

● Flexible

● Large user community (support)

● Open source

The winner is...

Splunk ???

ELK at LinkedIn

● 100+ ELK clusters across 20+ teams and 6

data centers

● Some of our larger clusters have:o Greater than 32+ billion docs (30+TB)

o Daily indices average 3.0 billion docs (~3TB)

ELK + Kafka

Summary: ELK is a popular open sourced application stack for

visualizing and analyzing logs. ELK is currently being used across

many teams within LinkedIn. The architecture we use is made up of

four components: Elasticsearch, Logstash, Kibana and Kafka.

● Elasticsearch: Distributed real-time search and analytics engine

● Logstash: Collect and parse all data sources into an easy-to-read

JSON format

● Kibana: Elasticsearch data visualization engine

● Kafka: Data transport, queue, buffer and short term storage

What is Kafka?

● Apache Kafka is a high-throughput distributed

messaging system

o Invented at LinkedIn and Open Sourced in 2011

o Fast, Scalable, Durable, and Distributed by Design

o Links for more: http://kafka.apache.org

http://data.linkedin.com/opensource/kafka

Kafka at LinkedIn

● Common data transport

● Available and supported by dedicated team

o 875 Billion messages per day

o 200 TB/day In

o 700 TB/day Out

o Peak Load 10.5 Million messages/s

18.5 Gigabits/s Inbound

70.5 Gigabits/s Outbound

Logging using Kafka at LinkedIn

● Dedicated cluster for logs in each data center

● Individual topics per application

● Defaults to 4 days of transport level retention

● Not currently replicating between data centers

● Common logging transport for all services, languages

and frameworks

ELK Architectural Concerns

● Network Concerns

o Bandwidth

o Network partitioning

o Latency

● Security Concerns

o Firewalls and ACLs

o Encrypting data in transit

● Resource Concerns

o A misbehaving application can swamp production resources

Multi-colo ELK ArchitectureELK Dashboard

Services

ELK Search

Clusters

TransportKafka

ELK Search

Clusters

Services

ELK Search

Clusters

Services

ELK Search

Clusters

Tribes

Corp Data Centers

ELK Search Architecture

Kibana

Elasticsearch

(tribe)

Elasticsearch

(master)

Logstash

Elasticsearch

(data node)

Logstash

Elasticsearch

(data node)

Operational Challenges

● Data, lots of it.

o Transporting, queueing, storing, securing,

reliability…

o Ingesting & Indexing fast enough

o Scaling infrastructure

o Which data? (right data needed?)

o Formats, mapping, transformation Data from many sources: Java, Scala, Python, Node.js, Go

Operational Challenges...

● Centralized vs Siloed Cluster Management

● Aggregated views of data across the entire

infrastructure

● Consistent view (trace up/down app stack)

● Scaling - horizontally or vertically?

● Monitoring, alerting, auto-remediating

The future of ELK at LinkedIn

● More ELK clusters being used by even more teams

● Clusters with 300+ billion docs (300+TB)

● Daily indices average 10+ billion docs, 10TB - move to

hourly indices

● ~5,000 shards per cluster

Extra slides

Next two slides contain example logstash

configs to show how we use input pipe plugin

with Kafka Console Consumer, and how to

monitor logstash using metrics filter.

KCC pipe input config

pipe {

type => "mobile"

command => "/opt/bin/kafka-console-consumer/kafka-console-consumer.sh \

--formatter com.linkedin.avro.KafkaMessageJsonWithHexFormatter \

--property schema.registry.url=http://schema-server.example.com:12250/schemaRegistry/schemas \

--autocommit.interval.ms=60000 \

--zookeeper zk.example.com:12913/kafka-metrics \

--topic log_stash_event \

--group logstash1"

codec => “json”

Monitoring Logstash metrics

filter {

metrics {

meter => "events"

add_tag => "metric"

output {

if “metric” in [tags] [

stdout {

codec => line {

format => “Rate: %{events.rate_1m}”

elk at linkedin - kafka, scaling, lessons learned

Internet

maurice blanchot, de kafka a kafka

kafka messaging javacro2019 ibm pdfv · •kafka basics...

kafka audit - kafka meetup - january 27th, 2015

forcepoint behavioral analytics installation manual ›...

kafka quotas talk at linkedin

kafka streams: hands-on session - ce.uniroma2.it · kafka...

apache kafka at linkedin

jay kreps, neha narkhede, jun rao linkedin ·...

amsterdam goto 2018 · 2018-06-22 · kafka as a message...

building a real-time data pipeline: apache kafka at linkedin

franz kafka - letter to his · pdf filefranz kafka pictures...

15-319 / 15-619 cloud computingmsakr/15619-s18/... ·...

apache kafka -...

enterprise kafka: kafka as a service

kafka low-level design discussion of kafka design kafka...

kelly technologies · kafka introduction to kafka kafka...

kafka and hadoop at linkedin meetup

· apache kafka introduction to apache kafka apache kafka...

zookeeper - the unsung hero - berlin...

apache kafka event stream processing solution kafka is a...