i heart log: real-time data and apache kafka

41
Real-time Data and Apache Kafka Jay Kreps I Log

Upload: jay-kreps

Post on 19-Aug-2014

1.010 views

Category:

Engineering


3 download

DESCRIPTION

This presentation discusses how logs and stream-processing can form a backbone for data flow, ETL, and real-time data processing. It will describe the challenges and lessons learned as LinkedIn built out its real-time data subscription and processing infrastructure. It will also discuss the role of real-time processing and its relationship to offline processing frameworks such as MapReduce.

TRANSCRIPT

Page 1: I Heart Log: Real-time Data and Apache Kafka

Real-time Data and Apache Kafka

Jay Kreps

I ♥ Log

Page 2: I Heart Log: Real-time Data and Apache Kafka

The Plan1. Apache Kafka2. Logs and Distributed Systems3. Logs and Data Integration4. Logs and Stream Processing

Page 3: I Heart Log: Real-time Data and Apache Kafka

Apache Kafka

Page 4: I Heart Log: Real-time Data and Apache Kafka
Page 5: I Heart Log: Real-time Data and Apache Kafka

Abrief

historyof

Kafka

Page 6: I Heart Log: Real-time Data and Apache Kafka

Three principles1. One pipeline to rule them all2. Stream processing >> messaging3. Clusters not servers

Page 7: I Heart Log: Real-time Data and Apache Kafka

Characteristics• Scalability of a filesystem– Hundreds of MB/sec/server throughput–Many TB per server

• Guarantees of a database–Messages strictly ordered– All data persistent

• Distributed by default– Replication– Partitioning model

Page 8: I Heart Log: Real-time Data and Apache Kafka

Kafka At LinkedIn• 175 TB of in-flight log data per colo• Low-latency: ~1.5 ms• Replicated to each datacenter• Tens of thousands of data producers• Thousands of consumers• 7 million messages written/sec• 35 million messages read/sec• Hadoop integration

Page 9: I Heart Log: Real-time Data and Apache Kafka

Open source• Apache Software Foundation• Very healthy usage outside LinkedIn• Broad base of committers• 30 clients in 15 languages• Great ecosystem of supporting tools

Page 10: I Heart Log: Real-time Data and Apache Kafka

The Plan1. Apache Kafka2. Logs and Distributed Systems3. Logs and Data Integration4. Logs and Stream Processing

Page 11: I Heart Log: Real-time Data and Apache Kafka

Kafka is about logs

Page 12: I Heart Log: Real-time Data and Apache Kafka

What is a log?

Page 13: I Heart Log: Real-time Data and Apache Kafka
Page 14: I Heart Log: Real-time Data and Apache Kafka
Page 15: I Heart Log: Real-time Data and Apache Kafka

Partitioning

Page 16: I Heart Log: Real-time Data and Apache Kafka

Logs: pub/sub done right

Page 17: I Heart Log: Real-time Data and Apache Kafka

Logs And Distributed Systems

Page 18: I Heart Log: Real-time Data and Apache Kafka

Example: A Fault-tolerant CEO Hash Table

Page 19: I Heart Log: Real-time Data and Apache Kafka

Operations Final State

Page 20: I Heart Log: Real-time Data and Apache Kafka
Page 21: I Heart Log: Real-time Data and Apache Kafka

State-machine Replication

Page 22: I Heart Log: Real-time Data and Apache Kafka

Primary-backup

Page 23: I Heart Log: Real-time Data and Apache Kafka

What use is a log?

Page 24: I Heart Log: Real-time Data and Apache Kafka

The Plan1. Apache Kafka2. Logs and Distributed Systems3. Logs and Data Integration4. Logs and Stream Processing

Page 25: I Heart Log: Real-time Data and Apache Kafka

Data Integration

Page 26: I Heart Log: Real-time Data and Apache Kafka

Types of Data• Database data– Users, products, orders, etc

• Events– Clicks, Impressions, Pageviews, etc

• Application metrics– CPU usage, requests/sec

• Application logs– Service calls, errors

Page 27: I Heart Log: Real-time Data and Apache Kafka

Systems at LinkedIn• Live Stores– Voldemort– Espresso– Graph– OLAP– Search– InGraphs

• Offline– Hadoop– Teradata

Page 28: I Heart Log: Real-time Data and Apache Kafka

Bad

Page 29: I Heart Log: Real-time Data and Apache Kafka

Good

Page 30: I Heart Log: Real-time Data and Apache Kafka

Example: User views job

Page 31: I Heart Log: Real-time Data and Apache Kafka

The Plan1. Apache Kafka2. Logs and Distributed Systems3. Logs and Data Integration4. Logs and Stream Processing

Page 32: I Heart Log: Real-time Data and Apache Kafka

Stream Processing

Page 33: I Heart Log: Real-time Data and Apache Kafka
Page 34: I Heart Log: Real-time Data and Apache Kafka

Stream processing is ageneralization

of batch processing

Page 35: I Heart Log: Real-time Data and Apache Kafka

Examples• Monitoring• Security• Content processing• Recommendations• Newsfeed• ETL

Page 36: I Heart Log: Real-time Data and Apache Kafka

Stream Processing = Logs + Jobs

Page 37: I Heart Log: Real-time Data and Apache Kafka

Systems Can Help

Page 38: I Heart Log: Real-time Data and Apache Kafka

Samza Architecture

Page 39: I Heart Log: Real-time Data and Apache Kafka

Example: Top Articles By Company

Page 40: I Heart Log: Real-time Data and Apache Kafka

Log-centric Architecture

Page 41: I Heart Log: Real-time Data and Apache Kafka

Kafkahttp://kafka.apache.org

Samzahttp://samza.incubator.apache.org

Log Bloghttp://linkd.in/199iMwY

Mehttp://www.linkedin.com/in/

jaykreps@jaykreps