apache kafka at linkedin

Post on 14-Jul-2015

381 Views

Category:

Engineering

5 Downloads

Preview:

Click to see full reader

TRANSCRIPT

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

About Me

2

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Agenda

3

• Overview of Kafka

• Kafka Design

• Kafka Usage at LinkedIn

• Roadmap

• Q & A

Why We Build Kafka?

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

We Have a lot of Data

5

• User activity tracking

• Page views, ad impressions, etc

• Server logs and metrics

• Syslogs, request-rates, etc

• Messaging

• Emails, news feeds, etc

• Computation derived

• Results of Hadoop / data warehousing, etc

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

.. and We Build Products on Data

6

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Newsfeed

7

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Recommendation

8HADOOP SUMMIT 2013

People you may know

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Recommendation

9

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Search

10

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Metrics and Monitoring

11

HADOOP SUMMIT 2013

System and application metrics/logging

LinkedIn Corporation ©2013 All Rights Reserved 5

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

.. and a LOT of Monitoring

12

The Problem:

How to integrate this variety of data

and make it available to all products?

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 14

Life back in 2010:

Point-to-Point Pipeplines

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 15

Example: User Activity Data Flow

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 16

What We Want

• A centralized data pipeline

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 17

Apache Kafka

We tried some systems off-

the-shelf, but…

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 18

What We REALLY Want

• A centralized data pipeline that is

• Elastically scalable

• Durable

• High-throughput

• Easy to use

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

• A distributed pub-sub messaging system

• Scale-out from groundup

• Persistent to disks

• High-Throughput (10s MB/sec per server)

19

Apache Kafka

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 20

Life Since Kafka in Production

Apache Kafka

• Developed and maintained by 5 Devs + 2 SRE

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Agenda

21

• Overview of Kafka

• Kafka Design

• Kafka Usage at LinkedIn

• Roadmap

• Q & A

Key Idea #1:

Data-parallelism leads to scale-out

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

• Produce/consume requests are randomly balanced among brokers

23

Distribute Clients across Partitions

Key Idea #2:

Disks are fast when used sequentially

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

• Appends are effectively O(1)

• Reads from known offset are fast still, when cached

25

Store Messages as a Log

3 4 5 5 7 8 9 10 11 12...

Producer Write

Consumer1

Reads (offset 7)

Consumer2

Reads (offset 7)

Partition i of Topic A

Key Idea #3:

Batching makes best use of network/IO

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

• Batched send and receive

• Batched compression

• No message caching in JVM

• Zero-copy from file to socket (Java NIO)

27

Batch Transfer

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 28

The API (0.8)

Producer:

send(topic, message)

Consumer:

Iterable stream = createMessageStreams(…).get(topic)

for (message: stream) {// process the message

}

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Agenda

29

• Overview of Kafka

• Kafka Design

• Kafka Usage at LinkedIn

• Pipeline deployment

• Schema for data cleanliness

• O(1) ETL

• Auditing for correctness

• Roadmap

• Q & A

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 30

Kafka Usage at LinkedIn

• Mainly used for tracking user-activity and metrics data

• 16 - 32 brokers in each cluster (615+ total brokers)

• 527 billion messages/day

• 7500+ topics, 270k+ partitions

• Byte rates:

• Writes: 97 TB/day

• Reads: 430 TB/day

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 31

Kafka Usage at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 32

Kafka Usage at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 33

Kafka Usage at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Agenda

34

• Overview of Kafka

• Kafka Design

• Kafka Usage at LinkedIn

• Pipeline deployment

• Schema for data cleanliness

• O(1) ETL

• Auditing for correctness

• Roadmap

• Q & A

Problems

• Hundreds of message types

• Thousands of fields

• What do they all mean?

• What happens when they change?

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 36

Standardized Schema on Avro

• Schema

• Message structure contract

• Performance gain

• Workflow

• Check in schema

• Auto compatibility check

• Code review

• “Ship it!”

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Agenda

37

• Overview of Kafka

• Kafka Design

• Kafka Usage at LinkedIn

• Pipeline deployment

• Schema for data cleanliness

• O(1) ETL

• Auditing for correctness

• Roadmap

• Q & A

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 38

Kafka to Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 39

Hadoop ETL (Camus)

• Map/Reduce job does data load

• One job loads all events

• ~10 minute ETA on average from producer to HDFS

• Hive registration done automatically

• Schema evolution handled transparently

• Open sourced:

– https://github.com/linkedin/camus

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Agenda

40

• Overview of Kafka

• Kafka Design

• Kafka Usage at LinkedIn

• Pipeline deployment

• Schema for data cleanliness

• O(1) ETL

• Auditing for correctness

• Roadmap

• Q & A

Does it really work?“All published messages must be delivered to all consumers (quickly)”

Audit Trail

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 43

More Features in Kafka 0.8

• Intra-cluster replication (0.8.0)

• Highly availability,

• Reduced latency

• Log compaction (0.8.1)

• State storage

• Operational tools (0.8.2)

• Topic management

• Automated leader rebalance

• etc ..

Checkout our page for more: http://kafka.apache.org/

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 44

Kafka 0.9

• Clients Rewrite

• Remove ZK dependency

• Even better throughput

• Security

• More operability, multi-tenancy ready

• Transactional Messaing

• From at-least-one to exactly-once

Checkout our page for more: http://kafka.apache.org/

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure

Kafka Users: Next Maybe You?

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 46

Acknowledgements

Questions? Guozhang Wang

guwang@linkedin.com

www.linkedin.com/in/guozhangwang

Backup Slides

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 49

Real-time Analysis with Kafka• Analytics from Hadoop can be slow

• Production -> Kafka: tens of milliseconds

• Kafka - > Hadoop: < 1 minute

• ETL in Hadoop: ~ 45 minutes

• MapReduce in Hadoop: maybe hours

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 50

Real-time Analysis with Kafka

• Solution No.1: directly consuming from Kafka

• Solution No. 2: other storage than HDFS

• Spark, Shark

• Pinot, Druid, FastBit

• Solution No. 3: stream processing

• Apache Samza

• Storm

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 51

How Fast can Kafka Go?

• Bottleneck #1: network bandwidth

• Producer: 100 Mb/s for 1 Gig-Ethernet

• Consumer can be slower due to multi-sub

• Bottleneck #2: disk space

• Data may be deleted before consumed at peak time•

• Configurable time/size-based retention policy

• Bottleneck #3: Zookeeper

• Mainly due to offset commit, will be lifted in 0.9

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 52

Intra-cluster Replication• Pick CA within Datacenter (failover < 10ms)

• Network partition is rare

• Latency less than an issue

• Separate data replication and consensus

• Consensus => Zookeeper

• Replication => primary-backup (f to tolerate f-1 failure)

• Configurable ACK (durability v.s. latency)

• More details:

• http://www.slideshare.net/junrao/kafka-replication-apachecon2013

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 53

Replication Architecture

Producer

Consumer

Producer

Broker Broker Broker Broker

Consumer

ZK

top related