I Heart Log: Real-time Data and Apache Kafka

Download I Heart Log: Real-time Data and Apache Kafka

Post on 21-Apr-2017

8.148 views

Category:

Engineering

5 download

TRANSCRIPT

Apache Kafka, Continuous Data Flow, and The Unified Log

Real-time Data and Apache KafkaJay KrepsI Log

Who are you?What is this talk about? What is a log and what is it good forExciting topic1

The PlanApache KafkaLogs and Distributed SystemsLogs and Data IntegrationLogs and Stream Processing

2

Apache Kafka

Producers, consumers distributed3

First project was an open source clone of Amazon Dynamo (Project Voldemort)Makes explaining things easier

4

AbriefhistoryofKafka

1 Pipeline for database data1 Pipeline for metrics1 Pipeline for events1 Pipeline for real-time processingNo pipeline for application logs300 ActiveMQ brokers5

Three principlesOne pipeline to rule them allStream processing >> messagingClusters not servers

CharacteristicsScalability of a filesystemHundreds of MB/sec/server throughputMany TB per serverGuarantees of a databaseMessages strictly orderedAll data persistentDistributed by defaultReplicationPartitioning model

10,000 messages/sec * 100 byte messages = ~1MB/sec7

Kafka At LinkedIn175 TB of in-flight log data per coloLow-latency: ~1.5 msReplicated to each datacenterTens of thousands of data producersThousands of consumers7 million messages written/sec35 million messages read/secHadoop integration

Open sourceApache Software FoundationVery healthy usage outside LinkedInBroad base of committers30 clients in 15 languagesGreat ecosystem of supporting tools

200 Kafka-related projects on github1000+ emails/month9

The PlanApache KafkaLogs and Distributed SystemsLogs and Data IntegrationLogs and Stream Processing

10

Kafka is about logs

The log is fundamental abstraction Kafka providesYou can use a log as a drop-in replacement for a messaging system, but it can also do a lot more11

What is a log?

What is a log?Traditional uses?Non-traditional uses12

Time orderedSemi-structured13

List of changesContents of record doesnt matterIndexed by timeNot application log (i.e. text file)

14

Partitioning

Data model of Kafka: A topicPartitions can be spread over machines, replicated15

Logs: pub/sub done right

The whole system is one big distributed system16

Logs And Distributed Systems

Paxos, Zookeeper (Zab), Raft, etc.Traditional databases, Hbase/Bigtable, Spanner, HDFS namenode, etc

Log has two purposes:ReplicationConsistency17

Example: A Fault-tolerant CEO Hash Table

Very important problem18

OperationsFinal State

What if replica is down?Ordering is importantTime is important19

Log is list of changesKey point: can re-create any point-in-timeIn banking: credits and debitsIn software: the version control changelog20

State-machine Replication

State-machine replication: log the incoming requests (logical logging)21

Primary-backup

Log the changed rows (physical logging)22

What use is a log?

Outside of distributed systems internals23

The PlanApache KafkaLogs and Distributed SystemsLogs and Data IntegrationLogs and Stream Processing

24

Data Integration

AKA ETLMany systemsEvent dataMost important problem for data-centric companiesIntegration >> ML25

Types of DataDatabase dataUsers, products, orders, etcEventsClicks, Impressions, Pageviews, etcApplication metricsCPU usage, requests/sec Application logsService calls, errors

Two exacerbating factors26

Systems at LinkedInLive StoresVoldemortEspressoGraphOLAPSearchInGraphsOfflineHadoopTeradata

One-size fits all27

Bad

Database cache coherencyData deployment from HadoopNever get to full connectivity28

Good

Metcalfes lawAll data in multi-subscriber, real-time logsThe company is a big distributed system29

Example: User views job

The PlanApache KafkaLogs and Distributed SystemsLogs and Data IntegrationLogs and Stream Processing

31

Stream Processing

Batch is dominant paradigm for data processing, why?First thing you want when you have real-time data streams is real-time transformations

32

1790Collected data by Networks=>stream processing3,929,214 people$44kHorses and wagons are a high latency, batch channel

33

Stream processing is ageneralizationof batch processing

Service: One input = one outputBatch job: All inputs = all outputsStream computing: any window = output for that window34

ExamplesMonitoringSecurityContent processingRecommendationsNewsfeedETL

Stream Processing = Logs + Jobs

Importance of the logbuffering, multisubscriberOutput goes to a live serving system or another batch processing system (Hadoop, DWH)Examples: RecommendationsEmailMonitoringSecurity36

Systems Can Help

Storm and SamzaAbout process management both integrate with KafkaMapReduce and HDFS37

Samza Architecture

Example: Top Articles By Company

Log-centric Architecture

Kafkahttp://kafka.apache.org

Samzahttp://samza.incubator.apache.org

Log Bloghttp://linkd.in/199iMwY

Mehttp://www.linkedin.com/in/jaykreps@jaykreps