i heart log: real-time data and apache kafka
DESCRIPTION
This presentation discusses how logs and stream-processing can form a backbone for data flow, ETL, and real-time data processing. It will describe the challenges and lessons learned as LinkedIn built out its real-time data subscription and processing infrastructure. It will also discuss the role of real-time processing and its relationship to offline processing frameworks such as MapReduce.TRANSCRIPT
Real-time Data and Apache Kafka
Jay Kreps
I ♥ Log
The Plan1. Apache Kafka2. Logs and Distributed Systems3. Logs and Data Integration4. Logs and Stream Processing
Apache Kafka
Abrief
historyof
Kafka
Three principles1. One pipeline to rule them all2. Stream processing >> messaging3. Clusters not servers
Characteristics• Scalability of a filesystem– Hundreds of MB/sec/server throughput–Many TB per server
• Guarantees of a database–Messages strictly ordered– All data persistent
• Distributed by default– Replication– Partitioning model
Kafka At LinkedIn• 175 TB of in-flight log data per colo• Low-latency: ~1.5 ms• Replicated to each datacenter• Tens of thousands of data producers• Thousands of consumers• 7 million messages written/sec• 35 million messages read/sec• Hadoop integration
Open source• Apache Software Foundation• Very healthy usage outside LinkedIn• Broad base of committers• 30 clients in 15 languages• Great ecosystem of supporting tools
The Plan1. Apache Kafka2. Logs and Distributed Systems3. Logs and Data Integration4. Logs and Stream Processing
Kafka is about logs
What is a log?
Partitioning
Logs: pub/sub done right
Logs And Distributed Systems
Example: A Fault-tolerant CEO Hash Table
Operations Final State
State-machine Replication
Primary-backup
What use is a log?
The Plan1. Apache Kafka2. Logs and Distributed Systems3. Logs and Data Integration4. Logs and Stream Processing
Data Integration
Types of Data• Database data– Users, products, orders, etc
• Events– Clicks, Impressions, Pageviews, etc
• Application metrics– CPU usage, requests/sec
• Application logs– Service calls, errors
Systems at LinkedIn• Live Stores– Voldemort– Espresso– Graph– OLAP– Search– InGraphs
• Offline– Hadoop– Teradata
Bad
Good
Example: User views job
The Plan1. Apache Kafka2. Logs and Distributed Systems3. Logs and Data Integration4. Logs and Stream Processing
Stream Processing
Stream processing is ageneralization
of batch processing
Examples• Monitoring• Security• Content processing• Recommendations• Newsfeed• ETL
Stream Processing = Logs + Jobs
Systems Can Help
Samza Architecture
Example: Top Articles By Company
Log-centric Architecture
Kafkahttp://kafka.apache.org
Samzahttp://samza.incubator.apache.org
Log Bloghttp://linkd.in/199iMwY
Mehttp://www.linkedin.com/in/
jaykreps@jaykreps