i heart log: real-time data and apache kafka

Real-time Data and Apache Kafka

Jay Kreps

I ♥ Log

The Plan1. Apache Kafka2. Logs and Distributed Systems3. Logs and Data Integration4. Logs and Stream Processing

Apache Kafka

Abrief

historyof

Kafka

Three principles1. One pipeline to rule them all2. Stream processing >> messaging3. Clusters not servers

Characteristics• Scalability of a filesystem– Hundreds of MB/sec/server throughput–Many TB per server

• Guarantees of a database–Messages strictly ordered– All data persistent

• Distributed by default– Replication– Partitioning model

Kafka At LinkedIn• 175 TB of in-flight log data per colo• Low-latency: ~1.5 ms• Replicated to each datacenter• Tens of thousands of data producers• Thousands of consumers• 7 million messages written/sec• 35 million messages read/sec• Hadoop integration

Open source• Apache Software Foundation• Very healthy usage outside LinkedIn• Broad base of committers• 30 clients in 15 languages• Great ecosystem of supporting tools

Kafka is about logs

What is a log?

Partitioning

Logs: pub/sub done right

Logs And Distributed Systems

Example: A Fault-tolerant CEO Hash Table

Operations Final State

State-machine Replication

Primary-backup

What use is a log?

Data Integration

Types of Data• Database data– Users, products, orders, etc

• Events– Clicks, Impressions, Pageviews, etc

• Application metrics– CPU usage, requests/sec

• Application logs– Service calls, errors

Systems at LinkedIn• Live Stores– Voldemort– Espresso– Graph– OLAP– Search– InGraphs

• Offline– Hadoop– Teradata

Example: User views job

Stream Processing

Stream processing is ageneralization

of batch processing

Examples• Monitoring• Security• Content processing• Recommendations• Newsfeed• ETL

Stream Processing = Logs + Jobs

Systems Can Help

Samza Architecture

Example: Top Articles By Company

Log-centric Architecture

Kafkahttp://kafka.apache.org

Samzahttp://samza.incubator.apache.org

Log Bloghttp://linkd.in/199iMwY

Mehttp://www.linkedin.com/in/

jaykreps@jaykreps

i heart log: real-time data and apache kafka

Engineering