adventures in timespace - how apache flink handles time and windows

24
Notions of Time Aljoscha Krettek [email protected] @aljoscha How Apache Flink™ Handles Time and Windows

Upload: aljoscha-krettek

Post on 21-Apr-2017

834 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Adventures in Timespace - How Apache Flink Handles Time and Windows

Notions of Time

Aljoscha [email protected]@aljoscha

How Apache Flink™ Handles Time and Windows

Page 2: Adventures in Timespace - How Apache Flink Handles Time and Windows

Adventures in Timespace

Page 3: Adventures in Timespace - How Apache Flink Handles Time and Windows

3

Why Windows*?

*not Microsoft Windows…

Page 4: Adventures in Timespace - How Apache Flink Handles Time and Windows

4

That’s why…

Page 5: Adventures in Timespace - How Apache Flink Handles Time and Windows

5

StreamingBatch

Page 6: Adventures in Timespace - How Apache Flink Handles Time and Windows

6

In Streaming:Arriving data never stops!

Page 7: Adventures in Timespace - How Apache Flink Handles Time and Windows

7

Solution:Put elements into buckets,these are called windows

Page 8: Adventures in Timespace - How Apache Flink Handles Time and Windows

8

Window (5 min)Count

#Hashtags

Just saw #Trump on #CNN, super cool. :D

Trump: 2394Cheese: 12984

Money: 42

Page 9: Adventures in Timespace - How Apache Flink Handles Time and Windows

9

What I didn’t mention• tweets have a

timestamp, their event time

• tweets from across the globe arrive with delay

=> tweets with different timestamps arrive out-of-order

Page 10: Adventures in Timespace - How Apache Flink Handles Time and Windows

Window (5 min)Count

#Hashtags

12:34 (13.10.2015):Just saw #Trump on #CNN, super cool. :D

Trump: 2394Cheese: 12984

Money: 42

These arrive with 3 minutes slack

Form windows based on processing time of the machine.

Processing Time != Event Time

10

Page 11: Adventures in Timespace - How Apache Flink Handles Time and Windows

11

Why do people use this?• easy to implement• low latency• this is what systems

give you (Spark Streaming, Apex, Samza, Storm)*

*not Google Cloud Dataflow

Page 12: Adventures in Timespace - How Apache Flink Handles Time and Windows

12

Lets look at a morecomplex example.

Page 13: Adventures in Timespace - How Apache Flink Handles Time and Windows

13

Window (5 min)Correlate Tweets

and News

something...

These still have 3 min slack.

These have 8 min slack.

12:33 (13.10.2015):Donald Trump speaks at Cheese conference.Processing Time != Event

Time

Page 14: Adventures in Timespace - How Apache Flink Handles Time and Windows

Processing Time != Event Time

=> Mismatch in thetimespace continuum

Page 15: Adventures in Timespace - How Apache Flink Handles Time and Windows

15

Use cases• out-of-order elements• sources with delay• recovery/fault-tolerance• “catching up” with a

stream

Who does it?• Google Cloud

Dataflow• Apache Flink

Page 16: Adventures in Timespace - How Apache Flink Handles Time and Windows

16

How can we do this?

Page 17: Adventures in Timespace - How Apache Flink Handles Time and Windows

17

We need aGlobal Clockthat runs onevent time instead of processing time.

Page 18: Adventures in Timespace - How Apache Flink Handles Time and Windows

18

This is a source

This is our window operator

1

0

0

0 0

1

2

1

2

1

1

This is the current event-time time

2

2

2

2

2

This is a watermark.

Page 19: Adventures in Timespace - How Apache Flink Handles Time and Windows

19

Now, show me the API!

Page 20: Adventures in Timespace - How Apache Flink Handles Time and Windows

20

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

env.setStreamTimeCharacteristic(ProcessingTime);

DataStream<Tweet> text = env.addSource(new TwitterSrc());

DataStream<Tuple2<String, Integer>> counts = text .flatMap(new ExtractHashtags()) .keyBy(“name”) .timeWindow(Time.of(5, MINUTES) .apply(new HashtagCounter());

Processing Time

Page 21: Adventures in Timespace - How Apache Flink Handles Time and Windows

21

Event TimeStreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

env.setStreamTimeCharacteristic(EventTime);

DataStream<Tweet> text = env.addSource(new TwitterSrc());text = text.assignTimestamps(new MyTimestampExtractor());

DataStream<Tuple2<String, Integer>> counts = text .flatMap(new ExtractHashtags()) .keyBy(“name”) .timeWindow(Time.of(5, MINUTES) .apply(new HashtagCounter());

Page 22: Adventures in Timespace - How Apache Flink Handles Time and Windows

22

TL;DL*• stream data is infinite• windows are helpful• event-time != processing

time• watermarks to the rescue• Flink can do it

*too long, didn’t listen

Page 23: Adventures in Timespace - How Apache Flink Handles Time and Windows

flink.apache.org@ApacheFlink

Page 24: Adventures in Timespace - How Apache Flink Handles Time and Windows

32-35

24-27

20-23

8-110-3

4-7

24Tumbling Windows of 4 Seconds

123412

4

59

9 0

20

20

22212326323321

26

353642