Apache Kafka at LinkedIn

Apache Kafka at LinkedIn slide 0
Download Apache Kafka at LinkedIn

Post on 19-Aug-2014




5 download


Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.


Jay Kreps Introduction to Apache Kafka The Plan 1. What is Apache Kafka? 2. Kafka and Data Integration 3. Kafka and Stream Processing Apache Kafka A brief history of Apache Kafka Characteristics Scalability of a filesystem Hundreds of MB/sec/server throughput Many TB per server Guarantees of a database Messages strictly ordered All data persistent Distributed by default Replication Partitioning model Kafka is about logs What is a log? Logs: pub/sub done right Partitioning Nodes Host Many Partitions Producers Balance Load Consumers Divide Up Partitions End-to-End Kafka At LinkedIn 175 TB of in-flight log data per colo Replicated to each datacenter Tens of thousands of data producers Thousands of consumers 7 million messages written/sec 35 million messages read/sec Hadoop integration Performance Producer (3x replication): Async: 786,980 records/sec (75.1 MB/sec) Sync: 421,823 records/sec (40.2 MB/sec) Consumer: 940,521 records/sec (89.7 MB/sec) End-to-end latency: 2 ms (median) 14 ms (99.9th percentile) The Plan 1. What is Apache Kafka? 2. Kafka and Data Integration 3. Kafka and Stream Processing Data Integration Maslows Hierarchy For Data New Types of Data Database data Users, products, orders, etc Events Clicks, Impressions, Pageviews, etc Application metrics CPU usage, requests/sec Application logs Service calls, errors New Types of Systems Live Stores Voldemort Espresso Graph OLAP Search InGraphs Offline Hadoop Teradata Bad Good Example: User views job Comparing Data Transfer Mechanisms The Plan 1. What is Apache Kafka? 2. Kafka and Data Integration 3. Kafka and Stream Processing Stream Processing Stream processing is a generalization of batch processing Stream Processing = Logs + Jobs Examples Monitoring Security Content processing Recommendations Newsfeed ETL Frameworks Can Help Samza Architecture Log-centric Architecture Kafka http://kafka.apache.org Samza http://samza.incubator.apache.org Log Blog http://linkd.in/199iMwY Benchmark: http://t.co/40fkKJvanx Me http://www.linkedin.com/in/jaykreps @jaykreps


View more >