i heart log: real-time data and apache kafka

Post on 19-Aug-2014

1.010 Views

Category:

Engineering

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

This presentation discusses how logs and stream-processing can form a backbone for data flow, ETL, and real-time data processing. It will describe the challenges and lessons learned as LinkedIn built out its real-time data subscription and processing infrastructure. It will also discuss the role of real-time processing and its relationship to offline processing frameworks such as MapReduce.

TRANSCRIPT

Real-time Data and Apache Kafka

Jay Kreps

I ♥ Log

The Plan1. Apache Kafka2. Logs and Distributed Systems3. Logs and Data Integration4. Logs and Stream Processing

Apache Kafka

Abrief

historyof

Kafka

Three principles1. One pipeline to rule them all2. Stream processing >> messaging3. Clusters not servers

Characteristics• Scalability of a filesystem– Hundreds of MB/sec/server throughput–Many TB per server

• Guarantees of a database–Messages strictly ordered– All data persistent

• Distributed by default– Replication– Partitioning model

Kafka At LinkedIn• 175 TB of in-flight log data per colo• Low-latency: ~1.5 ms• Replicated to each datacenter• Tens of thousands of data producers• Thousands of consumers• 7 million messages written/sec• 35 million messages read/sec• Hadoop integration

Open source• Apache Software Foundation• Very healthy usage outside LinkedIn• Broad base of committers• 30 clients in 15 languages• Great ecosystem of supporting tools

The Plan1. Apache Kafka2. Logs and Distributed Systems3. Logs and Data Integration4. Logs and Stream Processing

Kafka is about logs

What is a log?

Partitioning

Logs: pub/sub done right

Logs And Distributed Systems

Example: A Fault-tolerant CEO Hash Table

Operations Final State

State-machine Replication

Primary-backup

What use is a log?

The Plan1. Apache Kafka2. Logs and Distributed Systems3. Logs and Data Integration4. Logs and Stream Processing

Data Integration

Types of Data• Database data– Users, products, orders, etc

• Events– Clicks, Impressions, Pageviews, etc

• Application metrics– CPU usage, requests/sec

• Application logs– Service calls, errors

Systems at LinkedIn• Live Stores– Voldemort– Espresso– Graph– OLAP– Search– InGraphs

• Offline– Hadoop– Teradata

Bad

Good

Example: User views job

The Plan1. Apache Kafka2. Logs and Distributed Systems3. Logs and Data Integration4. Logs and Stream Processing

Stream Processing

Stream processing is ageneralization

of batch processing

Examples• Monitoring• Security• Content processing• Recommendations• Newsfeed• ETL

Stream Processing = Logs + Jobs

Systems Can Help

Samza Architecture

Example: Top Articles By Company

Log-centric Architecture

Kafkahttp://kafka.apache.org

Samzahttp://samza.incubator.apache.org

Log Bloghttp://linkd.in/199iMwY

Mehttp://www.linkedin.com/in/

jaykreps@jaykreps

top related