elk at linkedin - kafka, scaling, lessons learned

20
ELK @ LinkedIn Scaling ELK with Kafka

Upload: tin-le

Post on 21-Apr-2017

13.104 views

Category:

Internet


0 download

TRANSCRIPT

Page 1: ELK at LinkedIn - Kafka, scaling, lessons learned

ELK @ LinkedInScaling ELK with Kafka

Page 2: ELK at LinkedIn - Kafka, scaling, lessons learned

IntroductionTin Le ([email protected])Senior Site Reliability Engineer

Formerly part of Mobile SRE team, responsible for servers handling mobile apps (IOS, Android, Windows, RIM, etc.) traffic.

Now responsible for guiding ELK @ LinkedIn as a whole

Page 3: ELK at LinkedIn - Kafka, scaling, lessons learned

Problems● Multiple data centers, ten of thousands of servers,

hundreds of billions of log records● Logging, indexing, searching, storing, visualizing and

analysing all of those logs all day every day● Security (access control, storage, transport)● Scaling to more DCs, more servers, and even more

logs…● ARRGGGGHHH!!!!!

Page 4: ELK at LinkedIn - Kafka, scaling, lessons learned

Solutions● Commercial

o Splunk, Sumo Logic, HP ArcSight Logger, Tibco, XpoLog, Loggly, etc.

● Open Sourceo Syslog + Grepo Graylogo Elasticsearcho etc.

Page 5: ELK at LinkedIn - Kafka, scaling, lessons learned

Criterias● Scalable - horizontally, by adding more nodes● Fast - as close to real time as possible● Inexpensive● Flexible● Large user community (support)● Open source

Page 6: ELK at LinkedIn - Kafka, scaling, lessons learned

ELK!

The winner is...

Splunk ???

Page 7: ELK at LinkedIn - Kafka, scaling, lessons learned

ELK at LinkedIn● 100+ ELK clusters across 20+ teams and 6

data centers● Some of our larger clusters have:

o Greater than 32+ billion docs (30+TB)o Daily indices average 3.0 billion docs (~3TB)

Page 8: ELK at LinkedIn - Kafka, scaling, lessons learned

ELK + KafkaSummary: ELK is a popular open sourced application stack for visualizing and analyzing logs. ELK is currently being used across many teams within LinkedIn. The architecture we use is made up of four components: Elasticsearch, Logstash, Kibana and Kafka.

● Elasticsearch: Distributed real-time search and analytics engine● Logstash: Collect and parse all data sources into an easy-to-read

JSON format● Kibana: Elasticsearch data visualization engine● Kafka: Data transport, queue, buffer and short term storage

Page 9: ELK at LinkedIn - Kafka, scaling, lessons learned

What is Kafka?● Apache Kafka is a high-throughput distributed

messaging systemo Invented at LinkedIn and Open Sourced in 2011o Fast, Scalable, Durable, and Distributed by Designo Links for more:

http://kafka.apache.org http://data.linkedin.com/opensource/kafka

Page 10: ELK at LinkedIn - Kafka, scaling, lessons learned

Kafka at LinkedIn● Common data transport● Available and supported by dedicated team

o 875 Billion messages per dayo 200 TB/day Ino 700 TB/day Outo Peak Load

10.5 Million messages/s 18.5 Gigabits/s Inbound 70.5 Gigabits/s Outbound

Page 11: ELK at LinkedIn - Kafka, scaling, lessons learned

Logging using Kafka at LinkedIn● Dedicated cluster for logs in each data center● Individual topics per application● Defaults to 4 days of transport level retention● Not currently replicating between data centers● Common logging transport for all services, languages

and frameworks

Page 12: ELK at LinkedIn - Kafka, scaling, lessons learned

ELK Architectural Concerns● Network Concerns

o Bandwidtho Network partitioningo Latency

● Security Concernso Firewalls and ACLso Encrypting data in transit

● Resource Concernso A misbehaving application can swamp production resources

Page 13: ELK at LinkedIn - Kafka, scaling, lessons learned

Multi-colo ELK ArchitectureELK Dashboard

13

Services

ELK SearchClusters

LogTransport

Kafka

ELK Search Clusters

LinkedIn Services

DC1

Services

Kafka

ELK Search Clusters

DC2

Services

Kafka

ELK Search Clusters

DC3

Tribes

Corp Data Centers

Page 14: ELK at LinkedIn - Kafka, scaling, lessons learned

ELK Search Architecture

Kibana

Elasticsearch(tribe)

Kafka

Elasticsearch(master)

Logstash

Elasticsearch(data node)

Logstash

Elasticsearch(data node)

Users

Page 15: ELK at LinkedIn - Kafka, scaling, lessons learned

Operational Challenges● Data, lots of it.

o Transporting, queueing, storing, securing, reliability…

o Ingesting & Indexing fast enougho Scaling infrastructureo Which data? (right data needed?)o Formats, mapping, transformation

Data from many sources: Java, Scala, Python, Node.js, Go

Page 16: ELK at LinkedIn - Kafka, scaling, lessons learned

Operational Challenges...● Centralized vs Siloed Cluster Management● Aggregated views of data across the entire

infrastructure● Consistent view (trace up/down app stack)● Scaling - horizontally or vertically?● Monitoring, alerting, auto-remediating

Page 17: ELK at LinkedIn - Kafka, scaling, lessons learned

The future of ELK at LinkedIn● More ELK clusters being used by even more teams● Clusters with 300+ billion docs (300+TB)● Daily indices average 10+ billion docs, 10TB - move to

hourly indices● ~5,000 shards per cluster

Page 18: ELK at LinkedIn - Kafka, scaling, lessons learned

Extra slidesNext two slides contain example logstash configs to show how we use input pipe plugin with Kafka Console Consumer, and how to monitor logstash using metrics filter.

Page 19: ELK at LinkedIn - Kafka, scaling, lessons learned

KCC pipe input configpipe { type => "mobile" command => "/opt/bin/kafka-console-consumer/kafka-console-consumer.sh \ --formatter com.linkedin.avro.KafkaMessageJsonWithHexFormatter \ --property schema.registry.url=http://schema-server.example.com:12250/schemaRegistry/schemas \ --autocommit.interval.ms=60000 \ --zookeeper zk.example.com:12913/kafka-metrics \ --topic log_stash_event \ --group logstash1" codec => “json” }

Page 20: ELK at LinkedIn - Kafka, scaling, lessons learned

Monitoring Logstash metricsfilter { metrics { meter => "events" add_tag => "metric" }}output { if “metric” in [tags] [ stdout { codec => line {

format => “Rate: %{events.rate_1m}”

}

}

}