apache kafka at linkedin

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Apache Kafka at LinkedIn


About Me

2


Agenda

3

• Overview of Kafka

• Kafka Design

• Kafka Usage at LinkedIn

• Roadmap

• Q & A

Why We Build Kafka?


We Have a lot of Data

5

• User activity tracking

• Page views, ad impressions, etc

• Server logs and metrics

• Syslogs, request-rates, etc

• Messaging

• Emails, news feeds, etc

• Computation derived

• Results of Hadoop / data warehousing, etc


.. and We Build Products on Data

6


Newsfeed

7


Recommendation

8HADOOP SUMMIT 2013

People you may know


Recommendation

9


Search

10


Metrics and Monitoring

11

HADOOP SUMMIT 2013

System and application metrics/logging

LinkedIn Corporation ©2013 All Rights Reserved 5


.. and a LOT of Monitoring

12

The Problem:

How to integrate this variety of data

and make it available to all products?

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 14

Life back in 2010:

Point-to-Point Pipeplines


Example: User Activity Data Flow


What We Want

• A centralized data pipeline


Apache Kafka

We tried some systems off-

the-shelf, but…


What We REALLY Want

• A centralized data pipeline that is

• Elastically scalable

• Durable

• High-throughput

• Easy to use


• A distributed pub-sub messaging system

• Scale-out from groundup

• Persistent to disks

• High-Throughput (10s MB/sec per server)

19

Apache Kafka


Life Since Kafka in Production

Apache Kafka

• Developed and maintained by 5 Devs + 2 SRE


Agenda

21


• Kafka Design


• Roadmap

• Q & A

Key Idea #1:

Data-parallelism leads to scale-out


• Produce/consume requests are randomly balanced among brokers

23

Distribute Clients across Partitions

Key Idea #2:

Disks are fast when used sequentially


• Appends are effectively O(1)

• Reads from known offset are fast still, when cached

25

Store Messages as a Log

3 4 5 5 7 8 9 10 11 12...

Producer Write

Consumer1

Reads (offset 7)

Consumer2

Reads (offset 7)

Partition i of Topic A

Key Idea #3:

Batching makes best use of network/IO


• Batched send and receive

• Batched compression

• No message caching in JVM

• Zero-copy from file to socket (Java NIO)

27

Batch Transfer


The API (0.8)

Producer:

send(topic, message)

Consumer:

Iterable stream = createMessageStreams(…).get(topic)

for (message: stream) {// process the message

}


Agenda

29


• Kafka Design


• Pipeline deployment

• Schema for data cleanliness

• O(1) ETL

• Auditing for correctness

• Roadmap

• Q & A


Kafka Usage at LinkedIn

• Mainly used for tracking user-activity and metrics data

• 16 - 32 brokers in each cluster (615+ total brokers)

• 527 billion messages/day

• 7500+ topics, 270k+ partitions

• Byte rates:

• Writes: 97 TB/day

• Reads: 430 TB/day


Agenda

34


• Kafka Design




• O(1) ETL


• Roadmap

• Q & A

Problems

• Hundreds of message types

• Thousands of fields

• What do they all mean?

• What happens when they change?


Standardized Schema on Avro

• Schema

• Message structure contract

• Performance gain

• Workflow

• Check in schema

• Auto compatibility check

• Code review

• “Ship it!”


Agenda

37


• Kafka Design




• O(1) ETL


• Roadmap

• Q & A


Kafka to Hadoop


Hadoop ETL (Camus)

• Map/Reduce job does data load

• One job loads all events

• ~10 minute ETA on average from producer to HDFS

• Hive registration done automatically

• Schema evolution handled transparently

• Open sourced:

– https://github.com/linkedin/camus


Agenda

40


• Kafka Design




• O(1) ETL


• Roadmap

• Q & A

Does it really work?“All published messages must be delivered to all consumers (quickly)”

Audit Trail


More Features in Kafka 0.8

• Intra-cluster replication (0.8.0)

• Highly availability,

• Reduced latency

• Log compaction (0.8.1)

• State storage

• Operational tools (0.8.2)

• Topic management

• Automated leader rebalance

• etc ..

Checkout our page for more: http://kafka.apache.org/


Kafka 0.9

• Clients Rewrite

• Remove ZK dependency

• Even better throughput

• Security

• More operability, multi-tenancy ready

• Transactional Messaing

• From at-least-one to exactly-once

Checkout our page for more: http://kafka.apache.org/


Kafka Users: Next Maybe You?


Acknowledgements

Questions? Guozhang Wang

[email protected]

www.linkedin.com/in/guozhangwang

mailto:[email protected]

Backup Slides


Real-time Analysis with Kafka• Analytics from Hadoop can be slow

• Production -> Kafka: tens of milliseconds

• Kafka - > Hadoop: < 1 minute

• ETL in Hadoop: ~ 45 minutes

• MapReduce in Hadoop: maybe hours


Real-time Analysis with Kafka

• Solution No.1: directly consuming from Kafka

• Solution No. 2: other storage than HDFS

• Spark, Shark

• Pinot, Druid, FastBit

• Solution No. 3: stream processing

• Apache Samza

• Storm


How Fast can Kafka Go?

• Bottleneck #1: network bandwidth

• Producer: 100 Mb/s for 1 Gig-Ethernet

• Consumer can be slower due to multi-sub

• Bottleneck #2: disk space

• Data may be deleted before consumed at peak time•

• Configurable time/size-based retention policy

• Bottleneck #3: Zookeeper

• Mainly due to offset commit, will be lifted in 0.9


Intra-cluster Replication• Pick CA within Datacenter (failover < 10ms)

• Network partition is rare

• Latency less than an issue

• Separate data replication and consensus

• Consensus => Zookeeper

• Replication => primary-backup (f to tolerate f-1 failure)

• Configurable ACK (durability v.s. latency)

• More details:

• http://www.slideshare.net/junrao/kafka-replication-apachecon2013


Replication Architecture

Producer

Consumer

Producer

Broker Broker Broker Broker

Consumer

ZK

apache kafka at linkedin

Engineering

kafka team

data infrastructurewe

variety of data

zenossdatabase data

data infrastructure1516what

data infrastructure1920life

data infrastructuremessaging

data infrastructurebased