apache kafka at linkedin

53
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure ©2013 LinkedIn Corporation. All Rights Reserved. Apache Kafka at LinkedIn Guozhang Wang BDTC 2014

Upload: guozhang-wang

Post on 21-Apr-2017

2.129 views

Category:

Engineering


0 download

TRANSCRIPT

Page 1: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.

Apache Kafka at LinkedInGuozhang WangBDTC 2014

Page 2: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

About Me

2

Page 3: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Agenda

3

• Overview of Kafka

• Kafka Design

• Kafka Usage at LinkedIn

• Roadmap

• Q & A

Page 4: Apache Kafka at LinkedIn

Why We Build Kafka?

Page 5: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

We Have a lot of Data

5

• User activity tracking• Page views, ad impressions, etc

• Server logs and metrics• Syslogs, request-rates, etc

• Messaging• Emails, news feeds, etc

• Computation derived• Results of Hadoop / data warehousing, etc

Page 6: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

.. and We Build Products on Data

6

Page 7: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Newsfeed

7

Page 8: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Recommendation

8

Page 9: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Recommendation

9

Page 10: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Search

10

Page 11: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Metrics and Monitoring

11

Page 12: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

.. and a LOT of Monitoring

12

Page 13: Apache Kafka at LinkedIn

The Problem:

How to integrate this variety of data and make it available to all products?

Page 14: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 14

Life back in 2010: Point-to-Point Pipeplines

Page 15: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 15

Example: User Activity Data Flow

Page 16: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 16

What We Want• A centralized data pipeline

Page 17: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 17

Apache Kafka

We tried some systems off-the-shelf, but…

Page 18: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 18

What We REALLY Want

• A centralized data pipeline that is

• Elastically scalable

• Durable

• High-throughput

• Easy to use

Page 19: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

• A distributed pub-sub messaging system

• Scale-out from groundup

• Persistent to disks

• High-Throughput (10s MB/sec per server)

19

Apache Kafka

Page 20: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 20

Life Since Kafka in Production

Apache Kafka

• Developed and maintained by 5 Devs + 2 SRE

Page 21: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Agenda

21

• Overview of Kafka

• Kafka Design

• Kafka Usage at LinkedIn

• Roadmap

• Q & A

Page 22: Apache Kafka at LinkedIn

Key Idea #1:

Data-parallelism leads to scale-out

Page 23: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

• Produce/consume requests are randomly balanced among brokers

23

Distribute Clients across Partitions

Page 24: Apache Kafka at LinkedIn

Key Idea #2:

Disks are fast when used sequentially

Page 25: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

• Appends are effectively O(1)

• Reads from known offset are fast still, when cached

25

Store Messages as a Log

3 4 5 5 7 8 9 10 11 12...

Producer Write

Consumer1 Reads (offset 7)

Consumer2 Reads (offset 7)

Partition i of Topic A

Page 26: Apache Kafka at LinkedIn

Key Idea #3:

Batching makes best use of network/IO

Page 27: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

• Batched send and receive

• Batched compression

• No message caching in JVM

• Zero-copy from file to socket (Java NIO)

27

Batch Transfer

Page 28: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 28

The API (0.8)

Producer:

send(topic, message)

Consumer:

Iterable stream = createMessageStreams(…).get(topic)

for (message: stream) {// process the message

}

Page 29: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Agenda

29

• Overview of Kafka

• Kafka Design

• Kafka Usage at LinkedIn

• Pipeline deployment

• Schema for data cleanliness

• O(1) ETL

• Auditing for correctness

• Roadmap

• Q & A

Page 30: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 30

Kafka Usage at LinkedIn

• Mainly used for tracking user-activity and metrics data

• 16 - 32 brokers in each cluster (615+ total brokers)

• 527 billion messages/day

• 7500+ topics, 270k+ partitions

• Byte rates:• Writes: 97 TB/day• Reads: 430 TB/day

Page 31: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 31

Kafka Usage at LinkedIn

Page 32: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 32

Kafka Usage at LinkedIn

Page 33: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 33

Kafka Usage at LinkedIn

Page 34: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Agenda

34

• Overview of Kafka

• Kafka Design

• Kafka Usage at LinkedIn

• Pipeline deployment

• Schema for data cleanliness

• O(1) ETL

• Auditing for correctness

• Roadmap

• Q & A

Page 35: Apache Kafka at LinkedIn

Problems• Hundreds of message types

• Thousands of fields

• What do they all mean?

• What happens when they change?

Page 36: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 36

Standardized Schema on Avro• Schema

• Message structure contract

• Performance gain

• Workflow

• Check in schema• Auto compatibility check

• Code review

• “Ship it!”

Page 37: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Agenda

37

• Overview of Kafka

• Kafka Design

• Kafka Usage at LinkedIn

• Pipeline deployment

• Schema for data cleanliness

• O(1) ETL

• Auditing for correctness

• Roadmap

• Q & A

Page 38: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 38

Kafka to Hadoop

Page 39: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 39

Hadoop ETL (Camus)

• Map/Reduce job does data load

• One job loads all events

• ~10 minute ETA on average from producer to HDFS

• Hive registration done automatically

• Schema evolution handled transparently

• Open sourced:– https://github.com/linkedin/camus

Page 40: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Agenda

40

• Overview of Kafka

• Kafka Design

• Kafka Usage at LinkedIn

• Pipeline deployment

• Schema for data cleanliness

• O(1) ETL

• Auditing for correctness

• Roadmap

• Q & A

Page 41: Apache Kafka at LinkedIn

Does it really work?“All published messages must be delivered to all consumers

(quickly)”

Page 42: Apache Kafka at LinkedIn

Audit Trail

Page 43: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 43

More Features in Kafka 0.8

• Intra-cluster replication (0.8.0)• Highly availability,

• Reduced latency

• Log compaction (0.8.1)• State storage

• Operational tools (0.8.2)• Topic management

• Automated leader rebalance

• etc ..Checkout our page for more: http://kafka.apache.org/

Page 44: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 44

Kafka 0.9

• Clients Rewrite• Remove ZK dependency• Even better throughput

• Security• More operability, multi-tenancy ready

• Transactional Messaing• From at-least-one to exactly-once

Checkout our page for more: http://kafka.apache.org/

Page 45: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Kafka Users: Next Maybe You?

Page 46: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 46

Acknowledgements

Page 47: Apache Kafka at LinkedIn

Questions? Guozhang [email protected]/in/guozhangwang

Page 48: Apache Kafka at LinkedIn

Backup Slides

Page 49: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 49

Real-time Analysis with Kafka• Analytics from Hadoop can be slow

• Production -> Kafka: tens of milliseconds• Kafka - > Hadoop: < 1 minute• ETL in Hadoop: ~ 45 minutes• MapReduce in Hadoop: maybe hours

Page 50: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 50

Real-time Analysis with Kafka

• Solution No.1: directly consuming from Kafka

• Solution No. 2: other storage than HDFS• Spark, Shark• Pinot, Druid, FastBit

• Solution No. 3: stream processing• Apache Samza• Storm

Page 51: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 51

How Fast can Kafka Go? • Bottleneck #1: network bandwidth

• Producer: 100 Mb/s for 1 Gig-Ethernet

• Consumer can be slower due to multi-sub

• Bottleneck #2: disk space• Data may be deleted before consumed at peak time•

• Configurable time/size-based retention policy

• Bottleneck #3: Zookeeper• Mainly due to offset commit, will be lifted in 0.9

Page 52: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 52

Intra-cluster Replication• Pick CA within Datacenter (failover < 10ms)

• Network partition is rare• Latency less than an issue

• Separate data replication and consensus• Consensus => Zookeeper• Replication => primary-backup (f to tolerate f-1 failure)

• Configurable ACK (durability v.s. latency)

• More details:• http://www.slideshare.net/junrao/kafka-replication-apachecon2013

Page 53: Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 53

Replication Architecture

Producer

Consumer

Producer

Broker Broker Broker Broker

Consumer

ZK