apache kafka at linkedin

of 37 /37
Jay Kreps Introduction to Apache Kafka

Author: discover-pinterest

Post on 19-Aug-2014




5 download

Embed Size (px)


Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.


  • Jay Kreps Introduction to Apache Kafka
  • The Plan 1. What is Apache Kafka? 2. Kafka and Data Integration 3. Kafka and Stream Processing
  • Apache Kafka
  • A brief history of Apache Kafka
  • Characteristics Scalability of a filesystem Hundreds of MB/sec/server throughput Many TB per server Guarantees of a database Messages strictly ordered All data persistent Distributed by default Replication Partitioning model
  • Kafka is about logs
  • What is a log?
  • Logs: pub/sub done right
  • Partitioning
  • Nodes Host Many Partitions
  • Producers Balance Load
  • Consumers Divide Up Partitions
  • End-to-End
  • Kafka At LinkedIn 175 TB of in-flight log data per colo Replicated to each datacenter Tens of thousands of data producers Thousands of consumers 7 million messages written/sec 35 million messages read/sec Hadoop integration
  • Performance Producer (3x replication): Async: 786,980 records/sec (75.1 MB/sec) Sync: 421,823 records/sec (40.2 MB/sec) Consumer: 940,521 records/sec (89.7 MB/sec) End-to-end latency: 2 ms (median) 14 ms (99.9th percentile)
  • The Plan 1. What is Apache Kafka? 2. Kafka and Data Integration 3. Kafka and Stream Processing
  • Data Integration
  • Maslows Hierarchy
  • For Data
  • New Types of Data Database data Users, products, orders, etc Events Clicks, Impressions, Pageviews, etc Application metrics CPU usage, requests/sec Application logs Service calls, errors
  • New Types of Systems Live Stores Voldemort Espresso Graph OLAP Search InGraphs Offline Hadoop Teradata
  • Bad
  • Good
  • Example: User views job
  • Comparing Data Transfer Mechanisms
  • The Plan 1. What is Apache Kafka? 2. Kafka and Data Integration 3. Kafka and Stream Processing
  • Stream Processing
  • Stream processing is a generalization of batch processing
  • Stream Processing = Logs + Jobs
  • Examples Monitoring Security Content processing Recommendations Newsfeed ETL
  • Frameworks Can Help
  • Samza Architecture
  • Log-centric Architecture
  • Kafka http://kafka.apache.org Samza http://samza.incubator.apache.org Log Blog http://linkd.in/199iMwY Benchmark: http://t.co/40fkKJvanx Me http://www.linkedin.com/in/jaykreps @jaykreps