streaming data with apache kafka
TRANSCRIPT
Apache Kafka general info● originated at LinkedIn● open sourced in early 2011● implemented in Scala, client libraries in Java● many other client libraries:
○ C++, Go, JS/Node, Clojure, Ruby, ...
Motivation at LinkedIn
● many disparate data sources and destinations● a need to move data around reliably● existing solutions not adequate
Requirements
● low latency● huge throughput● online and offline consumers● zero data loss● fault-tolerance
Hadoop(HBase)Consumers
Front enduser
eventsProducers Metrics Logs
RDBMS&
NoSQL
Kafka
MonitoringSystems
LogAggregation
Systems
Search engines DWH
"Events": asgeneralization
of real-time msg processing
Terminology
● Kafka broker = serverhosts
● Topic = queueconsists of
● Partition = piece of a sliced queueis mirrored by
● Replica = backup of a Partition
Consumer offsets
● previously kept in Zookeeper● from 0.8 added ability to store offsets in a
special Kafka topic● brokers are dumb - consumers track their
offsets● ZK only needed to keep track of cluster state
Ordered and immutable
● strict ordering within partition● low and high level consumers in 0.8● number of partitions = parallelism● 1 partition = max 1 consumer thread● 1 consumer thread = one or more partitions
Impressive performance
● 3 cheap machines○ 6-core Intel Xeon 2.5GHz, 32GB RAM, 7 x 7200 RPM SATA, Gb ethernet○ 3 async producers (1 per machine, since network became saturated)
● 2M msgs/s● tens of milliseconds latency end-to-end