apache kafka at linkedin

Apache Kafka at LinkedIn

About Me

Agenda

• Overview of Kafka

• Kafka Design

• Kafka Usage at LinkedIn

• Roadmap

• Q & A

Why We Build Kafka?

We Have a lot of Data

• User activity tracking

• Page views, ad impressions, etc

• Server logs and metrics

• Syslogs, request-rates, etc

• Messaging

• Emails, news feeds, etc

• Computation derived

• Results of Hadoop / data warehousing, etc

.. and We Build Products on Data

Newsfeed

Recommendation

8HADOOP SUMMIT 2013

People you may know

Recommendation

Search

Metrics and Monitoring

HADOOP SUMMIT 2013

System and application metrics/logging

.. and a LOT of Monitoring

The Problem:

How to integrate this variety of data

and make it available to all products?

Life back in 2010:

Point-to-Point Pipeplines

Example: User Activity Data Flow

What We Want

• A centralized data pipeline

Apache Kafka

We tried some systems off-

the-shelf, but…

What We REALLY Want

• A centralized data pipeline that is

• Elastically scalable

• Durable

• High-throughput

• Easy to use

• A distributed pub-sub messaging system

• Scale-out from groundup

• Persistent to disks

• High-Throughput (10s MB/sec per server)

Apache Kafka

Life Since Kafka in Production

Apache Kafka

• Developed and maintained by 5 Devs + 2 SRE

Agenda

• Kafka Design

• Roadmap

• Q & A

Key Idea #1:

Data-parallelism leads to scale-out

• Produce/consume requests are randomly balanced among brokers

Distribute Clients across Partitions

Key Idea #2:

Disks are fast when used sequentially

• Appends are effectively O(1)

• Reads from known offset are fast still, when cached

Store Messages as a Log

3 4 5 5 7 8 9 10 11 12...

Producer Write

Consumer1

Reads (offset 7)

Consumer2

Reads (offset 7)

Partition i of Topic A

Key Idea #3:

Batching makes best use of network/IO

• Batched send and receive

• Batched compression

• No message caching in JVM

• Zero-copy from file to socket (Java NIO)

Batch Transfer

The API (0.8)

Producer:

send(topic, message)

Consumer:

Iterable stream = createMessageStreams(…).get(topic)

for (message: stream) {// process the message

Agenda

• Kafka Design

• Pipeline deployment

• Schema for data cleanliness

• O(1) ETL

• Auditing for correctness

• Roadmap

• Q & A

Kafka Usage at LinkedIn

• Mainly used for tracking user-activity and metrics data

• 16 - 32 brokers in each cluster (615+ total brokers)

• 527 billion messages/day

• 7500+ topics, 270k+ partitions

• Byte rates:

• Writes: 97 TB/day

• Reads: 430 TB/day

Agenda

• Kafka Design

• O(1) ETL

• Roadmap

• Q & A

Problems

• Hundreds of message types

• Thousands of fields

• What do they all mean?

• What happens when they change?

Standardized Schema on Avro

• Schema

• Message structure contract

• Performance gain

• Workflow

• Check in schema

• Auto compatibility check

• Code review

• “Ship it!”

Agenda

• Kafka Design

• O(1) ETL

• Roadmap

• Q & A

Kafka to Hadoop

Hadoop ETL (Camus)

• Map/Reduce job does data load

• One job loads all events

• ~10 minute ETA on average from producer to HDFS

• Hive registration done automatically

• Schema evolution handled transparently

• Open sourced:

– https://github.com/linkedin/camus

Agenda

• Kafka Design

• O(1) ETL

• Roadmap

• Q & A

Does it really work?“All published messages must be delivered to all consumers (quickly)”

Audit Trail

More Features in Kafka 0.8

• Intra-cluster replication (0.8.0)

• Highly availability,

• Reduced latency

• Log compaction (0.8.1)

• State storage

• Operational tools (0.8.2)

• Topic management

• Automated leader rebalance

• etc ..

Checkout our page for more: http://kafka.apache.org/

Kafka 0.9

• Clients Rewrite

• Remove ZK dependency

• Even better throughput

• Security

• More operability, multi-tenancy ready

• Transactional Messaing

• From at-least-one to exactly-once

Checkout our page for more: http://kafka.apache.org/

Kafka Users: Next Maybe You?

Acknowledgements

Questions? Guozhang Wang

guwang@linkedin.com

www.linkedin.com/in/guozhangwang

Backup Slides

Real-time Analysis with Kafka• Analytics from Hadoop can be slow

• Production -> Kafka: tens of milliseconds

• Kafka - > Hadoop: < 1 minute

• ETL in Hadoop: ~ 45 minutes

• MapReduce in Hadoop: maybe hours

Real-time Analysis with Kafka

• Solution No.1: directly consuming from Kafka

• Solution No. 2: other storage than HDFS

• Spark, Shark

• Pinot, Druid, FastBit

• Solution No. 3: stream processing

• Apache Samza

• Storm

How Fast can Kafka Go?

• Bottleneck #1: network bandwidth

• Producer: 100 Mb/s for 1 Gig-Ethernet

• Consumer can be slower due to multi-sub

• Bottleneck #2: disk space

• Data may be deleted before consumed at peak time•

• Configurable time/size-based retention policy

• Bottleneck #3: Zookeeper

• Mainly due to offset commit, will be lifted in 0.9

Intra-cluster Replication• Pick CA within Datacenter (failover < 10ms)

• Network partition is rare

• Latency less than an issue

• Separate data replication and consensus

• Consensus => Zookeeper

• Replication => primary-backup (f to tolerate f-1 failure)

• Configurable ACK (durability v.s. latency)

• More details:

• http://www.slideshare.net/junrao/kafka-replication-apachecon2013

Replication Architecture

Producer

Consumer

Producer

Broker Broker Broker Broker

Consumer

apache kafka at linkedin

kafka team

data infrastructurewe

variety of data

zenossdatabase data

data infrastructure1516what

data infrastructure1920life

data infrastructuremessaging

data infrastructurebased

Engineering

apache kafka best practices

apache kafka lesson learned

building a real-time data pipeline: apache kafka at linkedin...

apache kafka workshop - intuit.com...apache kafka is a...

apache kafka -...

apache kafka at linkedin

kafka tutorial - introduction to apache kafka (part 1)

apache kafka - free friday

· apache kafka introduction to apache kafka apache kafka...

kafka in production - wordpress.com · apache kafka apache...

15-319 / 15-619 cloud computingmsakr/15619-s18/... ·...

chapter 1: an introduction to smack · figure 4-11: apache...

apache kafka

building a real-time data pipeline: apache kafka at linkedin

integrating apache hive with kafka, spark, and bi ·...

about the tutorial - · pdf fileapache kafka i about the...

apache kafka overview - docs.cloudera.com

slides - apache kafka® architecture & fundamentals...

apache kafka event stream processing solution kafka is a...

kafka streams: the stream processing engine of apache kafka