apache kafka at linkedin
TRANSCRIPT
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Apache Kafka at LinkedInGuozhang WangBDTC 2014
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
About Me
2
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
3
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Roadmap
• Q & A
Why We Build Kafka?
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
We Have a lot of Data
5
• User activity tracking• Page views, ad impressions, etc
• Server logs and metrics• Syslogs, request-rates, etc
• Messaging• Emails, news feeds, etc
• Computation derived• Results of Hadoop / data warehousing, etc
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
.. and We Build Products on Data
6
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Newsfeed
7
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Recommendation
8
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Recommendation
9
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Search
10
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Metrics and Monitoring
11
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
.. and a LOT of Monitoring
12
The Problem:
How to integrate this variety of data and make it available to all products?
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 14
Life back in 2010: Point-to-Point Pipeplines
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 15
Example: User Activity Data Flow
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 16
What We Want• A centralized data pipeline
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 17
Apache Kafka
We tried some systems off-the-shelf, but…
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 18
What We REALLY Want
• A centralized data pipeline that is
• Elastically scalable
• Durable
• High-throughput
• Easy to use
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
• A distributed pub-sub messaging system
• Scale-out from groundup
• Persistent to disks
• High-Throughput (10s MB/sec per server)
19
Apache Kafka
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 20
Life Since Kafka in Production
Apache Kafka
• Developed and maintained by 5 Devs + 2 SRE
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
21
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Roadmap
• Q & A
Key Idea #1:
Data-parallelism leads to scale-out
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
• Produce/consume requests are randomly balanced among brokers
23
Distribute Clients across Partitions
Key Idea #2:
Disks are fast when used sequentially
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
• Appends are effectively O(1)
• Reads from known offset are fast still, when cached
25
Store Messages as a Log
3 4 5 5 7 8 9 10 11 12...
Producer Write
Consumer1 Reads (offset 7)
Consumer2 Reads (offset 7)
Partition i of Topic A
Key Idea #3:
Batching makes best use of network/IO
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
• Batched send and receive
• Batched compression
• No message caching in JVM
• Zero-copy from file to socket (Java NIO)
27
Batch Transfer
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 28
The API (0.8)
Producer:
send(topic, message)
Consumer:
Iterable stream = createMessageStreams(…).get(topic)
for (message: stream) {// process the message
}
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
29
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Pipeline deployment
• Schema for data cleanliness
• O(1) ETL
• Auditing for correctness
• Roadmap
• Q & A
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 30
Kafka Usage at LinkedIn
• Mainly used for tracking user-activity and metrics data
• 16 - 32 brokers in each cluster (615+ total brokers)
• 527 billion messages/day
• 7500+ topics, 270k+ partitions
• Byte rates:• Writes: 97 TB/day• Reads: 430 TB/day
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 31
Kafka Usage at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 32
Kafka Usage at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 33
Kafka Usage at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
34
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Pipeline deployment
• Schema for data cleanliness
• O(1) ETL
• Auditing for correctness
• Roadmap
• Q & A
Problems• Hundreds of message types
• Thousands of fields
• What do they all mean?
• What happens when they change?
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 36
Standardized Schema on Avro• Schema
• Message structure contract
• Performance gain
• Workflow
• Check in schema• Auto compatibility check
• Code review
• “Ship it!”
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
37
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Pipeline deployment
• Schema for data cleanliness
• O(1) ETL
• Auditing for correctness
• Roadmap
• Q & A
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 38
Kafka to Hadoop
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 39
Hadoop ETL (Camus)
• Map/Reduce job does data load
• One job loads all events
• ~10 minute ETA on average from producer to HDFS
• Hive registration done automatically
• Schema evolution handled transparently
• Open sourced:– https://github.com/linkedin/camus
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
40
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Pipeline deployment
• Schema for data cleanliness
• O(1) ETL
• Auditing for correctness
• Roadmap
• Q & A
Does it really work?“All published messages must be delivered to all consumers
(quickly)”
Audit Trail
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 43
More Features in Kafka 0.8
• Intra-cluster replication (0.8.0)• Highly availability,
• Reduced latency
• Log compaction (0.8.1)• State storage
• Operational tools (0.8.2)• Topic management
• Automated leader rebalance
• etc ..Checkout our page for more: http://kafka.apache.org/
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 44
Kafka 0.9
• Clients Rewrite• Remove ZK dependency• Even better throughput
• Security• More operability, multi-tenancy ready
• Transactional Messaing• From at-least-one to exactly-once
Checkout our page for more: http://kafka.apache.org/
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Kafka Users: Next Maybe You?
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 46
Acknowledgements
Questions? Guozhang [email protected]/in/guozhangwang
Backup Slides
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 49
Real-time Analysis with Kafka• Analytics from Hadoop can be slow
• Production -> Kafka: tens of milliseconds• Kafka - > Hadoop: < 1 minute• ETL in Hadoop: ~ 45 minutes• MapReduce in Hadoop: maybe hours
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 50
Real-time Analysis with Kafka
• Solution No.1: directly consuming from Kafka
• Solution No. 2: other storage than HDFS• Spark, Shark• Pinot, Druid, FastBit
• Solution No. 3: stream processing• Apache Samza• Storm
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 51
How Fast can Kafka Go? • Bottleneck #1: network bandwidth
• Producer: 100 Mb/s for 1 Gig-Ethernet
• Consumer can be slower due to multi-sub
• Bottleneck #2: disk space• Data may be deleted before consumed at peak time•
• Configurable time/size-based retention policy
• Bottleneck #3: Zookeeper• Mainly due to offset commit, will be lifted in 0.9
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 52
Intra-cluster Replication• Pick CA within Datacenter (failover < 10ms)
• Network partition is rare• Latency less than an issue
• Separate data replication and consensus• Consensus => Zookeeper• Replication => primary-backup (f to tolerate f-1 failure)
• Configurable ACK (durability v.s. latency)
• More details:• http://www.slideshare.net/junrao/kafka-replication-apachecon2013
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 53
Replication Architecture
Producer
Consumer
Producer
Broker Broker Broker Broker
Consumer
ZK