apache kafka, and the rise of stream processing

Post on 16-Apr-2017

428 Views

Category:

Engineering

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Guozhang Wang DataEngConf, Nov 3, 2016

Apache Kafka and the Rise of Stream Processing

2

What is Kafka, really?

3

What is Kafka, Really?

[NetDB 2011]a scalable Pub-sub messaging system..

4

What is Kafka, Really?

[NetDB 2011]

[Hadoop Summit 2013]

a scalable Pub-sub messaging system..

a real-time data pipeline..

5

Example: Centralized Data Pipeline

KV-Store Doc-Store RDBMSTracking Logs / Metrics

Hadoop / DW Monitoring Rec. Engine Social GraphSearchingSecurity

Apache Kafka

6

What is Kafka, Really?

[NetDB 2011]

[Hadoop Summit 2013]

[VLDB 2015]

a scalable Pub-sub messaging system..

a real-time data pipeline..

a distributed and replicated log..

7

Example: Data Store Geo-Replication

Apache Kafka

Local Stores

User Apps User Apps

Local Stores

Apache Kafka

Region 2Region 1

write read

append log

mirroring

apply log

8

What is Kafka, Really?

a scalable Pub-sub messaging system.. [NetDB 2011]

a real-time data pipeline.. [Hadoop Summit 2013]

a distributed and replicated log.. [VLDB 2015]

a unified data integration stack.. [CIDR 2015]

9

Example: Async. Micro-Services

10

Which of the following is true?

a scalable Pub-sub messaging system.. [NetDB 2011]

a real-time data pipeline.. [Hadoop Summit 2013]

a distributed and replicated log.. [VLDB 2015]

a unified data integration stack.. [CIDR 2015]

11

Which of the following is true?

a scalable Pub-sub messaging system.. [NetDB 2011]

a real-time data pipeline.. [Hadoop Summit 2013]

a distributed and replicated log.. [VLDB 2015]

a unified data integration stack.. [CIDR 2015]

All of them!

12

Kafka: Streaming Platform

• Publish / Subscribe• Move data around as online streams

• Store• “Source-of-truth” continuous data

• Process• React / process data in real-time

13

Kafka: Streaming Platform

• Publish / Subscribe• Move data around as online streams

• Store• “Source-of-truth” continuous data

• Process• React / process data in real-time

Producer

Producer

Consumer

Consumer

14

Kafka: Streaming Platform

• Publish / Subscribe• Move data around as online streams

• Store• “Source-of-truth” continuous data

• Process• React / process data in real-time

Producer

Producer

Consumer

Consumer Connect

Connect

15

16

Connectors

• 40+ since first release this Feb (0.9+)

• 13 from & partners

17

Kafka: Streaming Platform

• Publish / Subscribe• Move data around as online streams

• Store• “Source-of-truth” continuous data

• Process• React / process data in real-time

StreamsProducer

Producer

Consumer

Consumer Connect

Connect

Streams

18

Kafka: Streaming Platform

• Publish / Subscribe• Move data around as online streams

• Store• “Source-of-truth” continuous data

• Process• React / process data in real-time

StreamsProducer

Producer

Consumer

Consumer Connect

Connect

Streams

Today’s Talk

19

Stream Processing

• A different programming paradigm

• .. that brings computation to unbounded data

• .. with tradeoffs between latency / cost / correctness

20

Model Update

Model

Prediction

Training / Feedback Stream

Input Stream (parameters, etc)

Output Stream (scores, categories, etc)

Online Machine Learning

21

SalesMessage Queue

Shipment

Orders

Async. Micro-services

22

OLAP Engine

Real-time Analytics

Online DB2

Log File Stream / Change Data Capture Stream

Online DB1 Query Results

Kafka Streams (0.10+)

• New client library besides producer and consumer

• Powerful yet easy-to-use• Event-at-a-time, Stateful

• Windowing with out-of-order handling

• Highly scalable, distributed, fault tolerant

• and more..23

24

Anywhere, anytime

Ok. Ok. Ok. Ok.

25

Anywhere, anytime

<dependency>

<groupId>org.apache.kafka</groupId> <artifactId>kafka-streams</artifactId> <version>0.10.0.0</version>

</dependency>

26

Anywhere, anytime

War File

Rsync

Puppet/

Chef

YARN

Mesos

Docker

Kuberne

tes

Very Uncool Very Cool

27

Simple is Beautiful

Kafka Streams DSL

28

public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”);

// count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”);

// write the result table to a new topic counts.to(”topic2”);

// create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }

Kafka Streams DSL

29

public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”);

// count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”);

// write the result table to a new topic counts.to(”topic2”);

// create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }

Kafka Streams DSL

30

public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”);

// count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”);

// write the result table to a new topic counts.to(”topic2”);

// create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }

Kafka Streams DSL

31

public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”);

// count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”);

// write the result table to a new topic counts.to(”topic2”);

// create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }

Kafka Streams DSL

32

public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”);

// count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”);

// write the result table to a new topic counts.to(”topic2”);

// create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }

33

API, coding

“Full stack” evaluation

Operations, debugging, …

34

API, coding

“Full stack” evaluation

Operations, debugging, …

Simple is Beautiful

35

Kafka Streams: Key Concepts

Stream and Records

36

Key Value Key Value Key Value Key Value

Stream

Record

Processor Topology

37

KStream<..> stream1 = builder.stream(”topic1”);

KStream<..> stream2 = builder.stream(”topic2”);

KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic3”);

Processor Topology

38

Stream

Processor Topology

39

StreamProcessor

Processor Topology

40

KStream<..> stream1 = builder.stream(”topic1”);

KStream<..> stream2 = builder.stream(”topic2”);

KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic3”);

Processor Topology

41

KStream<..> stream1 = builder.stream(”topic1”);

KStream<..> stream2 = builder.stream(”topic2”);

KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic3”);

Processor Topology

42

KStream<..> stream1 = builder.stream(”topic1”);

KStream<..> stream2 = builder.stream(”topic2”);

KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic3”);

Processor Topology

43

KStream<..> stream1 = builder.stream(”topic1”);

KStream<..> stream2 = builder.stream(”topic2”);

KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic3”);

Processor Topology

44

KStream<..> stream1 = builder.stream(”topic1”);

KStream<..> stream2 = builder.stream(”topic2”);

KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic3”);

Processor Topology

45

Source Processor

Sink Processor

KStream<..> stream1 = builder.stream(

KStream<..> stream2 = builder.stream(

aggregated.to(

Processor Topology

46Kafka Streams Kafka

Stream Partitions and Tasks

47

Kafka Topic B Kafka Topic A

P1

P2

P1

P2

Stream Partitions and Tasks

48

Kafka Topic B Kafka Topic A

Processor TopologyP1

P2

P1

P2

Stream Partitions and Tasks

49

Kafka Topic AKafka Topic B

Kafka Topic B

Stream Threads

50

Kafka Topic A

MyApp.1 MyApp.2Task2Task1

Stream Threads

51

Kafka Topic AKafka Topic B

Task2Task1MyApp.1 MyApp.2

Stream Threads

52

Task3MyApp.3

Kafka Topic AKafka Topic B

Task2Task1MyApp.1 MyApp.2

Stream Threads

53

Task3

Kafka Topic AKafka Topic B

Task2Task1MyApp.1 MyApp.2 MyApp.3

States in Stream Processing

54

• filter

• map

• join

• aggregate

Stateless

Stateful

55

States in Stream Processing

56

KStream<..> stream1 = builder.stream(”topic1”);

KStream<..> stream2 = builder.stream(”topic2”);

KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic2”);

State

Kafka Topic B

Task2Task1

States in Stream Processing

57

Kafka Topic A

State State

Interactive Queries on States

58Your App (Streams) Kafka

Other AppN

Other App1…

Query Engine

Interactive Queries on States

59Your App (Streams) Kafka

Other AppN

Other App1…

Query Engine

Complexity: lots of moving parts

Interactive Queries on States (0.10.1+)

60Your App (Streams) Kafka

State

Other AppN

Other App2

Other App1

Stream v.s. Table?

61

KStream<..> stream1 = builder.stream(”topic1”);

KStream<..> stream2 = builder.stream(”topic2”);

KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic2”);

State

62

Tables ≈ Streams

63

64

65

The Stream-Table Duality

• A stream is a changelog of a table

• A table is a materialized view at time of a stream

• Example: change data capture (CDC) of databases

66

KStream = interprets data as record stream

~ think: “append-only”

KTable = data as changelog stream

~ continuously updated materialized view

67

68

alice eggs bob lettuce alice milk

alice lnkd bob googl alice msft

KStream

KTable

User purchase history

User employment profile

69

alice eggs bob lettuce alice milk

alice lnkd bob googl alice msft

KStream

KTable

User purchase history

User employment profile

time

“Alice bought eggs.”

“Alice is now at LinkedIn.”

70

alice eggs bob lettuce alice milk

alice lnkd bob googl alice msft

KStream

KTable

User purchase history

User employment profile

time

“Alice bought eggs and milk.”

“Alice is now at LinkedIn Microsoft.”

71

alice 2 bob 10 alice 3

timeKStream.aggregate()

KTable.aggregate()

(key: Alice, value: 2)

(key: Alice, value: 2)

72

alice 2 bob 10 alice 3

time

(key: Alice, value: 2 3)

(key: Alice, value: 2+3)

KStream.aggregate()

KTable.aggregate()

73

KStream KTable

reduce() aggregate() …

toStream()

map() filter() join() …

map() filter() join() …

74

KTable aggregated

KStream joined

KStream stream1KStream stream2

Updates Propagation in KTable

State

75

KTable aggregated

KStream joined

KStream stream1KStream stream2

State

Updates Propagation in KTable

76

KTable aggregated

KStream joined

KStream stream1KStream stream2

State

Updates Propagation in KTable

77

KTable aggregated

KStream joined

KStream stream1KStream stream2

State

Updates Propagation in KTable

78

What about Fault Tolerance?

79

Remember?

80

StateProcess

StateProcess

StateProcess

Kafka ChangelogFault ToleranceKafka

Kafka Streams

Kafka

81

StateProcess

StateProcess Protoco

l

StateProcess

Fault ToleranceKafka

Kafka Streams

Kafka Changelog

Kafka

82

StateProcess

StateProcess Protoco

l

StateProcess

Fault Tolerance

StateProcess

KafkaKafka Streams

Kafka Changelog

Kafka

83

84

85

86

It’s all about Time

• Event-time (when an event is created)

• Processing-time (when an event is processed)

87

Event-time 1 2 3 4 5 6 7Processing-time 1999 2002 2005 1977 1980 1983 2015

88

PHAN

TOM

MEN

ACE

ATTA

CK O

F TH

E CL

ON

ES

REV

ENG

E O

F TH

E SI

TH

A N

EW H

OPE

THE

EMPI

RE

STR

IKES

BAC

K

RET

UR

N O

F TH

E JE

DI

THE

FORC

E AW

AKEN

S

Out-of-Order

Timestamp Extractor

89

public long extract(ConsumerRecord<Object, Object> record) {

return System.currentTimeMillis();

}

public long extract(ConsumerRecord<Object, Object> record) {

return record.timestamp();

}

Timestamp Extractor

90

public long extract(ConsumerRecord<Object, Object> record) {

return System.currentTimeMillis();

}

public long extract(ConsumerRecord<Object, Object> record) {

return record.timestamp();

}

processing-time

Timestamp Extractor

91

public long extract(ConsumerRecord<Object, Object> record) {

return System.currentTimeMillis();

}

public long extract(ConsumerRecord<Object, Object> record) {

return record.timestamp();

}

processing-time

event-time

Timestamp Extractor

92

public long extract(ConsumerRecord<Object, Object> record) {

return System.currentTimeMillis();

} processing-time

event-time

public long extract(ConsumerRecord<Object, Object> record) {

return ((JsonNode) record.value()).get(”timestamp”).longValue();

}

Windowing

93

t…

Windowing

94

t…

Windowing

95

t…

Windowing

96

t…

Windowing

97

t…

Windowing

98

t…

Windowing

99

t…

100

• Ordering

• Partitioning &

Scalability

• Fault tolerance

Stream Processing Hard Parts

• State Management

• Time, Window &

Out-of-order Data

• Re-processing

For more details: http://docs.confluent.io/current

101

• Ordering

• Partitioning &

Scalability

• Fault tolerance

Stream Processing Hard Parts

• State Management

• Time, Window &

Out-of-order Data

• Re-processing

Simple is Beautiful

For more details: http://docs.confluent.io/current

Ongoing Work (0.10.1+)

• Beyond Java APIs

• SQL support, Python client, etc

• End-to-End Semantics (exactly-once)

• … and more

102

Take-aways

• Apache Kafka: a centralized streaming platform

• Kafka Streams: stream processing made easy

103

Take-aways

• Apache Kafka: a centralized streaming platform

104

Take-aways

105

THANKS!

Guozhang Wang | guozhang@confluent.io | @guozhangwang

CFP Kafka Summit 2017 @ NYC & SFConfluent Webinar: http://www.confluent.io/resources

• Apache Kafka: a centralized streaming platform

• Kafka Streams: stream processing made easy

top related