apache kafka at linkedin

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.

Apache Kafka at LinkedInGuozhang WangBDTC 2014

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

About Me

2


Agenda

3

• Overview of Kafka

• Kafka Design

• Kafka Usage at LinkedIn

• Roadmap

• Q & A

Why We Build Kafka?


We Have a lot of Data

5

• User activity tracking• Page views, ad impressions, etc

• Server logs and metrics• Syslogs, request-rates, etc

• Messaging• Emails, news feeds, etc

• Computation derived• Results of Hadoop / data warehousing, etc


.. and We Build Products on Data

6


Newsfeed

7


Recommendation

8


Recommendation

9


Search

10


Metrics and Monitoring

11


.. and a LOT of Monitoring

12

The Problem:

How to integrate this variety of data and make it available to all products?

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 14

Life back in 2010: Point-to-Point Pipeplines


Example: User Activity Data Flow


What We Want• A centralized data pipeline


Apache Kafka

We tried some systems off-the-shelf, but…


What We REALLY Want

• A centralized data pipeline that is

• Elastically scalable

• Durable

• High-throughput

• Easy to use


• A distributed pub-sub messaging system

• Scale-out from groundup

• Persistent to disks

• High-Throughput (10s MB/sec per server)

19

Apache Kafka


Life Since Kafka in Production

Apache Kafka

• Developed and maintained by 5 Devs + 2 SRE


Agenda

21


• Kafka Design


• Roadmap

• Q & A

Key Idea #1:

Data-parallelism leads to scale-out


• Produce/consume requests are randomly balanced among brokers

23

Distribute Clients across Partitions

Key Idea #2:

Disks are fast when used sequentially


• Appends are effectively O(1)

• Reads from known offset are fast still, when cached

25

Store Messages as a Log

3 4 5 5 7 8 9 10 11 12...

Producer Write

Consumer1 Reads (offset 7)

Consumer2 Reads (offset 7)

Partition i of Topic A

Key Idea #3:

Batching makes best use of network/IO


• Batched send and receive

• Batched compression

• No message caching in JVM

• Zero-copy from file to socket (Java NIO)

27

Batch Transfer


The API (0.8)

Producer:

send(topic, message)

Consumer:

Iterable stream = createMessageStreams(…).get(topic)

for (message: stream) {// process the message

}


Agenda

29


• Kafka Design


• Pipeline deployment

• Schema for data cleanliness

• O(1) ETL

• Auditing for correctness

• Roadmap

• Q & A


Kafka Usage at LinkedIn

• Mainly used for tracking user-activity and metrics data

• 16 - 32 brokers in each cluster (615+ total brokers)

• 527 billion messages/day

• 7500+ topics, 270k+ partitions

• Byte rates:• Writes: 97 TB/day• Reads: 430 TB/day


Agenda

34


• Kafka Design




• O(1) ETL


• Roadmap

• Q & A

Problems• Hundreds of message types

• Thousands of fields

• What do they all mean?

• What happens when they change?


Standardized Schema on Avro• Schema

• Message structure contract

• Performance gain

• Workflow

• Check in schema• Auto compatibility check

• Code review

• “Ship it!”


Agenda

37


• Kafka Design




• O(1) ETL


• Roadmap

• Q & A


Kafka to Hadoop


Hadoop ETL (Camus)

• Map/Reduce job does data load

• One job loads all events

• ~10 minute ETA on average from producer to HDFS

• Hive registration done automatically

• Schema evolution handled transparently

• Open sourced:– https://github.com/linkedin/camus


Agenda

40


• Kafka Design




• O(1) ETL


• Roadmap

• Q & A

Does it really work?“All published messages must be delivered to all consumers

(quickly)”

Audit Trail


More Features in Kafka 0.8

• Intra-cluster replication (0.8.0)• Highly availability,

• Reduced latency

• Log compaction (0.8.1)• State storage

• Operational tools (0.8.2)• Topic management

• Automated leader rebalance

• etc ..Checkout our page for more: http://kafka.apache.org/


Kafka 0.9

• Clients Rewrite• Remove ZK dependency• Even better throughput

• Security• More operability, multi-tenancy ready

• Transactional Messaing• From at-least-one to exactly-once

Checkout our page for more: http://kafka.apache.org/


Kafka Users: Next Maybe You?


Acknowledgements

Questions? Guozhang [email protected]/in/guozhangwang

mailto:[email protected]

Backup Slides


Real-time Analysis with Kafka• Analytics from Hadoop can be slow

• Production -> Kafka: tens of milliseconds• Kafka - > Hadoop: < 1 minute• ETL in Hadoop: ~ 45 minutes• MapReduce in Hadoop: maybe hours


Real-time Analysis with Kafka

• Solution No.1: directly consuming from Kafka

• Solution No. 2: other storage than HDFS• Spark, Shark• Pinot, Druid, FastBit

• Solution No. 3: stream processing• Apache Samza• Storm


How Fast can Kafka Go? • Bottleneck #1: network bandwidth

• Producer: 100 Mb/s for 1 Gig-Ethernet

• Consumer can be slower due to multi-sub

• Bottleneck #2: disk space• Data may be deleted before consumed at peak time•

• Configurable time/size-based retention policy

• Bottleneck #3: Zookeeper• Mainly due to offset commit, will be lifted in 0.9


Intra-cluster Replication• Pick CA within Datacenter (failover < 10ms)

• Network partition is rare• Latency less than an issue

• Separate data replication and consensus• Consensus => Zookeeper• Replication => primary-backup (f to tolerate f-1 failure)

• Configurable ACK (durability v.s. latency)

• More details:• http://www.slideshare.net/junrao/kafka-replication-apachecon2013


Replication Architecture

Producer

Consumer

Producer

Broker Broker Broker Broker

Consumer

ZK

apache kafka at linkedin

Engineering