Download - Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Apache Kafka at LinkedIn


About Me

2


Agenda

3

• Overview of Kafka

• Kafka Design

• Kafka Usage at LinkedIn

• Roadmap

• Q & A

Why We Build Kafka?


We Have a lot of Data

5

• User activity tracking

• Page views, ad impressions, etc

• Server logs and metrics

• Syslogs, request-rates, etc

• Messaging

• Emails, news feeds, etc

• Computation derived

• Results of Hadoop / data warehousing, etc


.. and We Build Products on Data

6


Newsfeed

7


Recommendation

8HADOOP SUMMIT 2013

People you may know


Recommendation

9


Search

10


Metrics and Monitoring

11

HADOOP SUMMIT 2013

System and application metrics/logging

LinkedIn Corporation ©2013 All Rights Reserved 5


.. and a LOT of Monitoring

12

The Problem:

How to integrate this variety of data

and make it available to all products?

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 14

Life back in 2010:

Point-to-Point Pipeplines


Example: User Activity Data Flow


What We Want

• A centralized data pipeline


Apache Kafka

We tried some systems off-

the-shelf, but…


What We REALLY Want

• A centralized data pipeline that is

• Elastically scalable

• Durable

• High-throughput

• Easy to use


• A distributed pub-sub messaging system

• Scale-out from groundup

• Persistent to disks

• High-Throughput (10s MB/sec per server)

19

Apache Kafka


Life Since Kafka in Production

Apache Kafka

• Developed and maintained by 5 Devs + 2 SRE


Agenda

21


• Kafka Design


• Roadmap

• Q & A

Key Idea #1:

Data-parallelism leads to scale-out


• Produce/consume requests are randomly balanced among brokers

23

Distribute Clients across Partitions

Key Idea #2:

Disks are fast when used sequentially


• Appends are effectively O(1)

• Reads from known offset are fast still, when cached

25

Store Messages as a Log

3 4 5 5 7 8 9 10 11 12...

Producer Write

Consumer1

Reads (offset 7)

Consumer2

Reads (offset 7)

Partition i of Topic A

Key Idea #3:

Batching makes best use of network/IO


• Batched send and receive

• Batched compression

• No message caching in JVM

• Zero-copy from file to socket (Java NIO)

27

Batch Transfer


The API (0.8)

Producer:

send(topic, message)

Consumer:

Iterable stream = createMessageStreams(…).get(topic)

for (message: stream) {// process the message

}


Agenda

29


• Kafka Design


• Pipeline deployment

• Schema for data cleanliness

• O(1) ETL

• Auditing for correctness

• Roadmap

• Q & A


Kafka Usage at LinkedIn

• Mainly used for tracking user-activity and metrics data

• 16 - 32 brokers in each cluster (615+ total brokers)

• 527 billion messages/day

• 7500+ topics, 270k+ partitions

• Byte rates:

• Writes: 97 TB/day

• Reads: 430 TB/day


Agenda

34


• Kafka Design




• O(1) ETL


• Roadmap

• Q & A

Problems

• Hundreds of message types

• Thousands of fields

• What do they all mean?

• What happens when they change?


Standardized Schema on Avro

• Schema

• Message structure contract

• Performance gain

• Workflow

• Check in schema

• Auto compatibility check

• Code review

• “Ship it!”


Agenda

37


• Kafka Design




• O(1) ETL


• Roadmap

• Q & A


Kafka to Hadoop


Hadoop ETL (Camus)

• Map/Reduce job does data load

• One job loads all events

• ~10 minute ETA on average from producer to HDFS

• Hive registration done automatically

• Schema evolution handled transparently

• Open sourced:

– https://github.com/linkedin/camus


Agenda

40


• Kafka Design




• O(1) ETL


• Roadmap

• Q & A

Does it really work?“All published messages must be delivered to all consumers (quickly)”

Audit Trail


More Features in Kafka 0.8

• Intra-cluster replication (0.8.0)

• Highly availability,

• Reduced latency

• Log compaction (0.8.1)

• State storage

• Operational tools (0.8.2)

• Topic management

• Automated leader rebalance

• etc ..

Checkout our page for more: http://kafka.apache.org/


Kafka 0.9

• Clients Rewrite

• Remove ZK dependency

• Even better throughput

• Security

• More operability, multi-tenancy ready

• Transactional Messaing

• From at-least-one to exactly-once

Checkout our page for more: http://kafka.apache.org/


Kafka Users: Next Maybe You?


Acknowledgements

Questions? Guozhang Wang

[email protected]

www.linkedin.com/in/guozhangwang

mailto:[email protected]

Backup Slides


Real-time Analysis with Kafka• Analytics from Hadoop can be slow

• Production -> Kafka: tens of milliseconds

• Kafka - > Hadoop: < 1 minute

• ETL in Hadoop: ~ 45 minutes

• MapReduce in Hadoop: maybe hours


Real-time Analysis with Kafka

• Solution No.1: directly consuming from Kafka

• Solution No. 2: other storage than HDFS

• Spark, Shark

• Pinot, Druid, FastBit

• Solution No. 3: stream processing

• Apache Samza

• Storm


How Fast can Kafka Go?

• Bottleneck #1: network bandwidth

• Producer: 100 Mb/s for 1 Gig-Ethernet

• Consumer can be slower due to multi-sub

• Bottleneck #2: disk space

• Data may be deleted before consumed at peak time•

• Configurable time/size-based retention policy

• Bottleneck #3: Zookeeper

• Mainly due to offset commit, will be lifted in 0.9


Intra-cluster Replication• Pick CA within Datacenter (failover < 10ms)

• Network partition is rare

• Latency less than an issue

• Separate data replication and consensus

• Consensus => Zookeeper

• Replication => primary-backup (f to tolerate f-1 failure)

• Configurable ACK (durability v.s. latency)

• More details:

• http://www.slideshare.net/junrao/kafka-replication-apachecon2013


Replication Architecture

Producer

Consumer

Producer

Broker Broker Broker Broker

Consumer

ZK

Download - Apache Kafka at LinkedIn

Top Related