introduction to kafka
TRANSCRIPT
Introduction to KafkaAkash Vacher
2015/12/5
▪Akash VacherSRE,Data Infrastructure Streaming (Bengaluru)Linkedin
SRE?
▪Site Reliability Engineers
–Administrators
–Architects
–Developers
▪Keep the site running, always
Agenda
▪ Kafka Overview
▪ Some facts and figures
▪ Basic Kafka concepts
▪ Some use cases
▪ Q and A
Kafka Overview
▪ High-throughput distributed messaging system
▪ Kafka guarantees:
– At least once delivery
– Strong ordering
▪ Developed at Linkedin and open sourced in early 2011
▪ Implemented in Scala and Java
Kafka users
Source: https://cwiki.apache.org/confluence/display/KAFKA/Powered+By
Attributes of a Kafka Cluster
• Disk Based
• Durable
• Scalable
• Low Latency
• Finite Retention
Motivation
▪ Unified platform to handle all real time data feeds
▪ High throughput
▪ Stream Processing
▪ Horizontally scalable
Before
After
How is Kafka used at Linkedin?
▪ Monitoring (inGraphs)
▪ User tracking
▪ Email and SMS notifications
▪ Stream processing (Samza)
▪ Database Replication
Facts and figures
▪ Over 1,300,000,000,000 messages are produced to Kafka everyday at LinkedIn
▪ 300 Terabytes of inbound and 900 Terabytes of outbound traffic
▪ 4.5 Million messages per second, on single cluster
▪ Kafka runs on ~1300 servers at LinkedIn
Building blocks
The humble log
Anatomy of a topic
Consumer groups
Bird’s eye view
Kafka in action
Broker AP0
AP1
AP1
AP0 AP0
Consumer
Producer
Zookeeper
Performance recipe
▪ OS page cache▪ Linear IO, never fear the file system!▪ sendfile(), system call▪ Message batching
Operating Kafka▪ Broker Hardware
– Cisco C240, Intel xeon quad core, 64GB RAM , 14 disk Raid-10
▪ Zookeeper Hardware– 5 + 1 ensemble, 64GB RAM, 500GB SSD
Operating Kafka▪ Monitoring
– Under Replicated Partitions– Unclean leader election– Lag monitoring– Burrow
▪ Cluster rebalance – Sizewise rebalance– Partitionwise rebalance
Kafka at Linkedin
▪ Multiple data centers
▪ Mirror data
▪ Cluster Types
– Tracking
– Metrics
– Queuing
▪ Data transport from applications to Hadoop, and back
Metrics collection▪ Building Blocks
– Sensors– RRD– Front end
▪ Facts & Figures
– 320,000,000 metrics collected per minute
– 530 TB of disk space
– Over 210,000 metricscollected per service
InGraphs
Kafka for database replication - Master slave
Kafka for database replication - Multi master
How Can You Get Involved?
▪ http://kafka.apache.org
▪ Join the mailing lists–[email protected]
▪ irc.freenode.net - #apache-kafka
▪ Contribute
Questions?