apache kafka, and the rise of stream processing

105
1 Guozhang Wang DataEngConf, Nov 3, 2016 Apache Kafka and the Rise of Stream Processing

Upload: guozhang-wang

Post on 16-Apr-2017

428 views

Category:

Engineering


0 download

TRANSCRIPT

Page 1: Apache Kafka, and the Rise of Stream Processing

1

Guozhang Wang DataEngConf, Nov 3, 2016

Apache Kafka and the Rise of Stream Processing

Page 2: Apache Kafka, and the Rise of Stream Processing

2

What is Kafka, really?

Page 3: Apache Kafka, and the Rise of Stream Processing

3

What is Kafka, Really?

[NetDB 2011]a scalable Pub-sub messaging system..

Page 4: Apache Kafka, and the Rise of Stream Processing

4

What is Kafka, Really?

[NetDB 2011]

[Hadoop Summit 2013]

a scalable Pub-sub messaging system..

a real-time data pipeline..

Page 5: Apache Kafka, and the Rise of Stream Processing

5

Example: Centralized Data Pipeline

KV-Store Doc-Store RDBMSTracking Logs / Metrics

Hadoop / DW Monitoring Rec. Engine Social GraphSearchingSecurity

Apache Kafka

Page 6: Apache Kafka, and the Rise of Stream Processing

6

What is Kafka, Really?

[NetDB 2011]

[Hadoop Summit 2013]

[VLDB 2015]

a scalable Pub-sub messaging system..

a real-time data pipeline..

a distributed and replicated log..

Page 7: Apache Kafka, and the Rise of Stream Processing

7

Example: Data Store Geo-Replication

Apache Kafka

Local Stores

User Apps User Apps

Local Stores

Apache Kafka

Region 2Region 1

write read

append log

mirroring

apply log

Page 8: Apache Kafka, and the Rise of Stream Processing

8

What is Kafka, Really?

a scalable Pub-sub messaging system.. [NetDB 2011]

a real-time data pipeline.. [Hadoop Summit 2013]

a distributed and replicated log.. [VLDB 2015]

a unified data integration stack.. [CIDR 2015]

Page 9: Apache Kafka, and the Rise of Stream Processing

9

Example: Async. Micro-Services

Page 10: Apache Kafka, and the Rise of Stream Processing

10

Which of the following is true?

a scalable Pub-sub messaging system.. [NetDB 2011]

a real-time data pipeline.. [Hadoop Summit 2013]

a distributed and replicated log.. [VLDB 2015]

a unified data integration stack.. [CIDR 2015]

Page 11: Apache Kafka, and the Rise of Stream Processing

11

Which of the following is true?

a scalable Pub-sub messaging system.. [NetDB 2011]

a real-time data pipeline.. [Hadoop Summit 2013]

a distributed and replicated log.. [VLDB 2015]

a unified data integration stack.. [CIDR 2015]

All of them!

Page 12: Apache Kafka, and the Rise of Stream Processing

12

Kafka: Streaming Platform

• Publish / Subscribe• Move data around as online streams

• Store• “Source-of-truth” continuous data

• Process• React / process data in real-time

Page 13: Apache Kafka, and the Rise of Stream Processing

13

Kafka: Streaming Platform

• Publish / Subscribe• Move data around as online streams

• Store• “Source-of-truth” continuous data

• Process• React / process data in real-time

Producer

Producer

Consumer

Consumer

Page 14: Apache Kafka, and the Rise of Stream Processing

14

Kafka: Streaming Platform

• Publish / Subscribe• Move data around as online streams

• Store• “Source-of-truth” continuous data

• Process• React / process data in real-time

Producer

Producer

Consumer

Consumer Connect

Connect

Page 15: Apache Kafka, and the Rise of Stream Processing

15

Page 16: Apache Kafka, and the Rise of Stream Processing

16

Connectors

• 40+ since first release this Feb (0.9+)

• 13 from & partners

Page 17: Apache Kafka, and the Rise of Stream Processing

17

Kafka: Streaming Platform

• Publish / Subscribe• Move data around as online streams

• Store• “Source-of-truth” continuous data

• Process• React / process data in real-time

StreamsProducer

Producer

Consumer

Consumer Connect

Connect

Streams

Page 18: Apache Kafka, and the Rise of Stream Processing

18

Kafka: Streaming Platform

• Publish / Subscribe• Move data around as online streams

• Store• “Source-of-truth” continuous data

• Process• React / process data in real-time

StreamsProducer

Producer

Consumer

Consumer Connect

Connect

Streams

Today’s Talk

Page 19: Apache Kafka, and the Rise of Stream Processing

19

Stream Processing

• A different programming paradigm

• .. that brings computation to unbounded data

• .. with tradeoffs between latency / cost / correctness

Page 20: Apache Kafka, and the Rise of Stream Processing

20

Model Update

Model

Prediction

Training / Feedback Stream

Input Stream (parameters, etc)

Output Stream (scores, categories, etc)

Online Machine Learning

Page 21: Apache Kafka, and the Rise of Stream Processing

21

SalesMessage Queue

Shipment

Orders

Async. Micro-services

Page 22: Apache Kafka, and the Rise of Stream Processing

22

OLAP Engine

Real-time Analytics

Online DB2

Log File Stream / Change Data Capture Stream

Online DB1 Query Results

Page 23: Apache Kafka, and the Rise of Stream Processing

Kafka Streams (0.10+)

• New client library besides producer and consumer

• Powerful yet easy-to-use• Event-at-a-time, Stateful

• Windowing with out-of-order handling

• Highly scalable, distributed, fault tolerant

• and more..23

Page 24: Apache Kafka, and the Rise of Stream Processing

24

Anywhere, anytime

Ok. Ok. Ok. Ok.

Page 25: Apache Kafka, and the Rise of Stream Processing

25

Anywhere, anytime

<dependency>

<groupId>org.apache.kafka</groupId> <artifactId>kafka-streams</artifactId> <version>0.10.0.0</version>

</dependency>

Page 26: Apache Kafka, and the Rise of Stream Processing

26

Anywhere, anytime

War File

Rsync

Puppet/

Chef

YARN

Mesos

Docker

Kuberne

tes

Very Uncool Very Cool

Page 27: Apache Kafka, and the Rise of Stream Processing

27

Simple is Beautiful

Page 28: Apache Kafka, and the Rise of Stream Processing

Kafka Streams DSL

28

public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”);

// count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”);

// write the result table to a new topic counts.to(”topic2”);

// create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }

Page 29: Apache Kafka, and the Rise of Stream Processing

Kafka Streams DSL

29

public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”);

// count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”);

// write the result table to a new topic counts.to(”topic2”);

// create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }

Page 30: Apache Kafka, and the Rise of Stream Processing

Kafka Streams DSL

30

public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”);

// count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”);

// write the result table to a new topic counts.to(”topic2”);

// create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }

Page 31: Apache Kafka, and the Rise of Stream Processing

Kafka Streams DSL

31

public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”);

// count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”);

// write the result table to a new topic counts.to(”topic2”);

// create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }

Page 32: Apache Kafka, and the Rise of Stream Processing

Kafka Streams DSL

32

public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”);

// count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”);

// write the result table to a new topic counts.to(”topic2”);

// create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }

Page 33: Apache Kafka, and the Rise of Stream Processing

33

API, coding

“Full stack” evaluation

Operations, debugging, …

Page 34: Apache Kafka, and the Rise of Stream Processing

34

API, coding

“Full stack” evaluation

Operations, debugging, …

Simple is Beautiful

Page 35: Apache Kafka, and the Rise of Stream Processing

35

Kafka Streams: Key Concepts

Page 36: Apache Kafka, and the Rise of Stream Processing

Stream and Records

36

Key Value Key Value Key Value Key Value

Stream

Record

Page 37: Apache Kafka, and the Rise of Stream Processing

Processor Topology

37

KStream<..> stream1 = builder.stream(”topic1”);

KStream<..> stream2 = builder.stream(”topic2”);

KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic3”);

Page 38: Apache Kafka, and the Rise of Stream Processing

Processor Topology

38

Stream

Page 39: Apache Kafka, and the Rise of Stream Processing

Processor Topology

39

StreamProcessor

Page 40: Apache Kafka, and the Rise of Stream Processing

Processor Topology

40

KStream<..> stream1 = builder.stream(”topic1”);

KStream<..> stream2 = builder.stream(”topic2”);

KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic3”);

Page 41: Apache Kafka, and the Rise of Stream Processing

Processor Topology

41

KStream<..> stream1 = builder.stream(”topic1”);

KStream<..> stream2 = builder.stream(”topic2”);

KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic3”);

Page 42: Apache Kafka, and the Rise of Stream Processing

Processor Topology

42

KStream<..> stream1 = builder.stream(”topic1”);

KStream<..> stream2 = builder.stream(”topic2”);

KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic3”);

Page 43: Apache Kafka, and the Rise of Stream Processing

Processor Topology

43

KStream<..> stream1 = builder.stream(”topic1”);

KStream<..> stream2 = builder.stream(”topic2”);

KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic3”);

Page 44: Apache Kafka, and the Rise of Stream Processing

Processor Topology

44

KStream<..> stream1 = builder.stream(”topic1”);

KStream<..> stream2 = builder.stream(”topic2”);

KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic3”);

Page 45: Apache Kafka, and the Rise of Stream Processing

Processor Topology

45

Source Processor

Sink Processor

KStream<..> stream1 = builder.stream(

KStream<..> stream2 = builder.stream(

aggregated.to(

Page 46: Apache Kafka, and the Rise of Stream Processing

Processor Topology

46Kafka Streams Kafka

Page 47: Apache Kafka, and the Rise of Stream Processing

Stream Partitions and Tasks

47

Kafka Topic B Kafka Topic A

P1

P2

P1

P2

Page 48: Apache Kafka, and the Rise of Stream Processing

Stream Partitions and Tasks

48

Kafka Topic B Kafka Topic A

Processor TopologyP1

P2

P1

P2

Page 49: Apache Kafka, and the Rise of Stream Processing

Stream Partitions and Tasks

49

Kafka Topic AKafka Topic B

Page 50: Apache Kafka, and the Rise of Stream Processing

Kafka Topic B

Stream Threads

50

Kafka Topic A

MyApp.1 MyApp.2Task2Task1

Page 51: Apache Kafka, and the Rise of Stream Processing

Stream Threads

51

Kafka Topic AKafka Topic B

Task2Task1MyApp.1 MyApp.2

Page 52: Apache Kafka, and the Rise of Stream Processing

Stream Threads

52

Task3MyApp.3

Kafka Topic AKafka Topic B

Task2Task1MyApp.1 MyApp.2

Page 53: Apache Kafka, and the Rise of Stream Processing

Stream Threads

53

Task3

Kafka Topic AKafka Topic B

Task2Task1MyApp.1 MyApp.2 MyApp.3

Page 54: Apache Kafka, and the Rise of Stream Processing

States in Stream Processing

54

• filter

• map

• join

• aggregate

Stateless

Stateful

Page 55: Apache Kafka, and the Rise of Stream Processing

55

Page 56: Apache Kafka, and the Rise of Stream Processing

States in Stream Processing

56

KStream<..> stream1 = builder.stream(”topic1”);

KStream<..> stream2 = builder.stream(”topic2”);

KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic2”);

State

Page 57: Apache Kafka, and the Rise of Stream Processing

Kafka Topic B

Task2Task1

States in Stream Processing

57

Kafka Topic A

State State

Page 58: Apache Kafka, and the Rise of Stream Processing

Interactive Queries on States

58Your App (Streams) Kafka

Other AppN

Other App1…

Query Engine

Page 59: Apache Kafka, and the Rise of Stream Processing

Interactive Queries on States

59Your App (Streams) Kafka

Other AppN

Other App1…

Query Engine

Complexity: lots of moving parts

Page 60: Apache Kafka, and the Rise of Stream Processing

Interactive Queries on States (0.10.1+)

60Your App (Streams) Kafka

State

Other AppN

Other App2

Other App1

Page 61: Apache Kafka, and the Rise of Stream Processing

Stream v.s. Table?

61

KStream<..> stream1 = builder.stream(”topic1”);

KStream<..> stream2 = builder.stream(”topic2”);

KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic2”);

State

Page 62: Apache Kafka, and the Rise of Stream Processing

62

Tables ≈ Streams

Page 63: Apache Kafka, and the Rise of Stream Processing

63

Page 64: Apache Kafka, and the Rise of Stream Processing

64

Page 65: Apache Kafka, and the Rise of Stream Processing

65

Page 66: Apache Kafka, and the Rise of Stream Processing

The Stream-Table Duality

• A stream is a changelog of a table

• A table is a materialized view at time of a stream

• Example: change data capture (CDC) of databases

66

Page 67: Apache Kafka, and the Rise of Stream Processing

KStream = interprets data as record stream

~ think: “append-only”

KTable = data as changelog stream

~ continuously updated materialized view

67

Page 68: Apache Kafka, and the Rise of Stream Processing

68

alice eggs bob lettuce alice milk

alice lnkd bob googl alice msft

KStream

KTable

User purchase history

User employment profile

Page 69: Apache Kafka, and the Rise of Stream Processing

69

alice eggs bob lettuce alice milk

alice lnkd bob googl alice msft

KStream

KTable

User purchase history

User employment profile

time

“Alice bought eggs.”

“Alice is now at LinkedIn.”

Page 70: Apache Kafka, and the Rise of Stream Processing

70

alice eggs bob lettuce alice milk

alice lnkd bob googl alice msft

KStream

KTable

User purchase history

User employment profile

time

“Alice bought eggs and milk.”

“Alice is now at LinkedIn Microsoft.”

Page 71: Apache Kafka, and the Rise of Stream Processing

71

alice 2 bob 10 alice 3

timeKStream.aggregate()

KTable.aggregate()

(key: Alice, value: 2)

(key: Alice, value: 2)

Page 72: Apache Kafka, and the Rise of Stream Processing

72

alice 2 bob 10 alice 3

time

(key: Alice, value: 2 3)

(key: Alice, value: 2+3)

KStream.aggregate()

KTable.aggregate()

Page 73: Apache Kafka, and the Rise of Stream Processing

73

KStream KTable

reduce() aggregate() …

toStream()

map() filter() join() …

map() filter() join() …

Page 74: Apache Kafka, and the Rise of Stream Processing

74

KTable aggregated

KStream joined

KStream stream1KStream stream2

Updates Propagation in KTable

State

Page 75: Apache Kafka, and the Rise of Stream Processing

75

KTable aggregated

KStream joined

KStream stream1KStream stream2

State

Updates Propagation in KTable

Page 76: Apache Kafka, and the Rise of Stream Processing

76

KTable aggregated

KStream joined

KStream stream1KStream stream2

State

Updates Propagation in KTable

Page 77: Apache Kafka, and the Rise of Stream Processing

77

KTable aggregated

KStream joined

KStream stream1KStream stream2

State

Updates Propagation in KTable

Page 78: Apache Kafka, and the Rise of Stream Processing

78

What about Fault Tolerance?

Page 79: Apache Kafka, and the Rise of Stream Processing

79

Remember?

Page 80: Apache Kafka, and the Rise of Stream Processing

80

StateProcess

StateProcess

StateProcess

Kafka ChangelogFault ToleranceKafka

Kafka Streams

Kafka

Page 81: Apache Kafka, and the Rise of Stream Processing

81

StateProcess

StateProcess Protoco

l

StateProcess

Fault ToleranceKafka

Kafka Streams

Kafka Changelog

Kafka

Page 82: Apache Kafka, and the Rise of Stream Processing

82

StateProcess

StateProcess Protoco

l

StateProcess

Fault Tolerance

StateProcess

KafkaKafka Streams

Kafka Changelog

Kafka

Page 83: Apache Kafka, and the Rise of Stream Processing

83

Page 84: Apache Kafka, and the Rise of Stream Processing

84

Page 85: Apache Kafka, and the Rise of Stream Processing

85

Page 86: Apache Kafka, and the Rise of Stream Processing

86

Page 87: Apache Kafka, and the Rise of Stream Processing

It’s all about Time

• Event-time (when an event is created)

• Processing-time (when an event is processed)

87

Page 88: Apache Kafka, and the Rise of Stream Processing

Event-time 1 2 3 4 5 6 7Processing-time 1999 2002 2005 1977 1980 1983 2015

88

PHAN

TOM

MEN

ACE

ATTA

CK O

F TH

E CL

ON

ES

REV

ENG

E O

F TH

E SI

TH

A N

EW H

OPE

THE

EMPI

RE

STR

IKES

BAC

K

RET

UR

N O

F TH

E JE

DI

THE

FORC

E AW

AKEN

S

Out-of-Order

Page 89: Apache Kafka, and the Rise of Stream Processing

Timestamp Extractor

89

public long extract(ConsumerRecord<Object, Object> record) {

return System.currentTimeMillis();

}

public long extract(ConsumerRecord<Object, Object> record) {

return record.timestamp();

}

Page 90: Apache Kafka, and the Rise of Stream Processing

Timestamp Extractor

90

public long extract(ConsumerRecord<Object, Object> record) {

return System.currentTimeMillis();

}

public long extract(ConsumerRecord<Object, Object> record) {

return record.timestamp();

}

processing-time

Page 91: Apache Kafka, and the Rise of Stream Processing

Timestamp Extractor

91

public long extract(ConsumerRecord<Object, Object> record) {

return System.currentTimeMillis();

}

public long extract(ConsumerRecord<Object, Object> record) {

return record.timestamp();

}

processing-time

event-time

Page 92: Apache Kafka, and the Rise of Stream Processing

Timestamp Extractor

92

public long extract(ConsumerRecord<Object, Object> record) {

return System.currentTimeMillis();

} processing-time

event-time

public long extract(ConsumerRecord<Object, Object> record) {

return ((JsonNode) record.value()).get(”timestamp”).longValue();

}

Page 93: Apache Kafka, and the Rise of Stream Processing

Windowing

93

t…

Page 94: Apache Kafka, and the Rise of Stream Processing

Windowing

94

t…

Page 95: Apache Kafka, and the Rise of Stream Processing

Windowing

95

t…

Page 96: Apache Kafka, and the Rise of Stream Processing

Windowing

96

t…

Page 97: Apache Kafka, and the Rise of Stream Processing

Windowing

97

t…

Page 98: Apache Kafka, and the Rise of Stream Processing

Windowing

98

t…

Page 99: Apache Kafka, and the Rise of Stream Processing

Windowing

99

t…

Page 100: Apache Kafka, and the Rise of Stream Processing

100

• Ordering

• Partitioning &

Scalability

• Fault tolerance

Stream Processing Hard Parts

• State Management

• Time, Window &

Out-of-order Data

• Re-processing

For more details: http://docs.confluent.io/current

Page 101: Apache Kafka, and the Rise of Stream Processing

101

• Ordering

• Partitioning &

Scalability

• Fault tolerance

Stream Processing Hard Parts

• State Management

• Time, Window &

Out-of-order Data

• Re-processing

Simple is Beautiful

For more details: http://docs.confluent.io/current

Page 102: Apache Kafka, and the Rise of Stream Processing

Ongoing Work (0.10.1+)

• Beyond Java APIs

• SQL support, Python client, etc

• End-to-End Semantics (exactly-once)

• … and more

102

Page 103: Apache Kafka, and the Rise of Stream Processing

Take-aways

• Apache Kafka: a centralized streaming platform

• Kafka Streams: stream processing made easy

103

Page 104: Apache Kafka, and the Rise of Stream Processing

Take-aways

• Apache Kafka: a centralized streaming platform

104

Page 105: Apache Kafka, and the Rise of Stream Processing

Take-aways

105

THANKS!

Guozhang Wang | [email protected] | @guozhangwang

CFP Kafka Summit 2017 @ NYC & SFConfluent Webinar: http://www.confluent.io/resources

• Apache Kafka: a centralized streaming platform

• Kafka Streams: stream processing made easy