apache kafka at linkedin
of 37
/37
Jay Kreps Introduction to Apache Kafka
Embed Size (px)
DESCRIPTION
Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.TRANSCRIPT
- Jay Kreps Introduction to Apache Kafka
- The Plan 1. What is Apache Kafka? 2. Kafka and Data Integration 3. Kafka and Stream Processing
- Apache Kafka
- A brief history of Apache Kafka
- Characteristics Scalability of a filesystem Hundreds of MB/sec/server throughput Many TB per server Guarantees of a database Messages strictly ordered All data persistent Distributed by default Replication Partitioning model
- Kafka is about logs
- What is a log?
- Logs: pub/sub done right
- Partitioning
- Nodes Host Many Partitions
- Producers Balance Load
- Consumers Divide Up Partitions
- End-to-End
- Kafka At LinkedIn 175 TB of in-flight log data per colo Replicated to each datacenter Tens of thousands of data producers Thousands of consumers 7 million messages written/sec 35 million messages read/sec Hadoop integration
- Performance Producer (3x replication): Async: 786,980 records/sec (75.1 MB/sec) Sync: 421,823 records/sec (40.2 MB/sec) Consumer: 940,521 records/sec (89.7 MB/sec) End-to-end latency: 2 ms (median) 14 ms (99.9th percentile)
- The Plan 1. What is Apache Kafka? 2. Kafka and Data Integration 3. Kafka and Stream Processing
- Data Integration
- Maslows Hierarchy
- For Data
- New Types of Data Database data Users, products, orders, etc Events Clicks, Impressions, Pageviews, etc Application metrics CPU usage, requests/sec Application logs Service calls, errors
- New Types of Systems Live Stores Voldemort Espresso Graph OLAP Search InGraphs Offline Hadoop Teradata
- Bad
- Good
- Example: User views job
- Comparing Data Transfer Mechanisms
- The Plan 1. What is Apache Kafka? 2. Kafka and Data Integration 3. Kafka and Stream Processing
- Stream Processing
- Stream processing is a generalization of batch processing
- Stream Processing = Logs + Jobs
- Examples Monitoring Security Content processing Recommendations Newsfeed ETL
- Frameworks Can Help
- Samza Architecture
- Log-centric Architecture
- Kafka http://kafka.apache.org Samza http://samza.incubator.apache.org Log Blog http://linkd.in/199iMwY Benchmark: http://t.co/40fkKJvanx Me http://www.linkedin.com/in/jaykreps @jaykreps