keynote: stephan ewen - stream processing as a foundational paradigm and apache flink's...

Post on 16-Apr-2017

258 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Stream Processing as aFoundational Paradigm and

Apache Flink's approach to itStephan Ewen, Apache Flink PMC, CTO @ data Artisans

Streaming technology is enabling the obvious: continuous processing on data that is continuously produced

Hint: you already have streaming data

4

Streaming Subsumes Batch

5

2016-3-112:00 am

2016-3-11:00 am

2016-3-12:00 am

2016-3-1111:00pm

2016-3-1212:00am

2016-3-121:00am

2016-3-1110:00pm

2016-3-122:00am

2016-3-123:00am…

partition

partition

Streaming Subsumes Batch

6

2016-3-112:00 am

2016-3-11:00 am

2016-3-12:00 am

2016-3-1111:00pm

2016-3-1212:00am

2016-3-121:00am

2016-3-1110:00pm

2016-3-122:00am

2016-3-123:00am…

partition

partition

Stream (low latency)

Stream (high latency)

Streaming Subsumes Batch

7

2016-3-112:00 am

2016-3-11:00 am

2016-3-12:00 am

2016-3-1111:00pm

2016-3-1212:00am

2016-3-121:00am

2016-3-1110:00pm

2016-3-122:00am

2016-3-123:00am…

partition

partition

Stream (low latency)

Batch(bounded stream)Stream (high latency)

Stream Processing Decouples

8

Database(State)

App a

App b

App c

App a

App b

App c

Applications build their own stateState managed centralized

Time Travel

9

Process a period ofhistoric data

partition

partition

Process latest datawith low latency(tail of the log)

Reprocess stream(historic data first, catches up with realtime data)

10

But why has it started so recently?

Stream Processing is taking off.(just look at this year's talks)

11

Latency

Volume/Throughput

State &Accuracy

The combination is what makes

steaming powerful

Only recently available together

12

Latency

Volume/Throughput

State &Accuracy

Exactly-once semanticsEvent time processing

10s of millions evts/secfor stateful applications

Latency down tothe milliseconds

Apache Flink was the first open-source system to eliminate these

tradeoffs

Flink's Approach

13

Stateful Steam Processing

Fluent API, Windows, Event Time

Table API

Stream SQL

Core API

Declarative DSL

High-level Language

Building Block

Stateful Steam Processing

14

Source Filter /Transform

Stateread/write Sink

Stateful Steam Processing

15

Scalable embedded state Access at memory speed &scales with parallel operators

Stateful Steam Processing

16

Re-load state

Reset positionsin input streams

Rolling back computationRe-processing

Stateful Steam Processing

17

Restore to differentprograms

Bugfixes, Upgrades, A/B testing, etc

Versioning the state of applications

18

Savepoint

Savepoint

Savepoint

App. A

App. B

App. C

Time

Savepoint

Flink's Approach

19

Stateful Steam Processing

Fluent API, Windows, Event Time

Table API

Stream SQL

Core API

Declarative DSL

High-level Language

Building Block

Event Time / Out-of-Order

20

1977 1980 1983 1999 2002 2005 2015

Processing Time

EpisodeIV

EpisodeV

EpisodeVI

EpisodeI

EpisodeII

EpisodeIII

EpisodeVII

Event Time

(Stream) SQL & Table API

21

Table API

// convert stream into Tableval sensorTable: Table = sensorData .toTable(tableEnv, 'location, 'time, 'tempF)

// define query on Tableval avgTempCTable: Table = sensorTable .groupBy('location) .window(Tumble over 1.days on 'rowtime as 'w) .select('w.start as 'day, 'location, (('tempF.avg - 32) * 0.556) as 'avgTempC) .where('location like "room%")

SQL

sensorTable.sql(""" SELECT day, location, avg((tempF - 32) * 0.556) AS avgTempC

FROM sensorData WHERE location LIKE 'room%'GROUP BY day, location

""")

What can you do with that?

22

10 billion events (2TB) processed daily across multiple Flink jobs for the telco network control center.

Ad-hoc realtime queries, > 30 operators, processing 30 billion events daily, maintaining state of 100s of GB inside Flink with exactly-once guarantees

Jobs with > 20 operators, runs on > 5000 vCores in 1000-node cluster, processes millions of events per second

Flink's Streams playing at Batch

23

TeraSort

Relational Join

Classic Batch Jobs

GraphProcessing

LinearAlgebra

24

Streaming Technology is already awesome,but what are the next steps?

A.k.a, what can we expect in the "next gen" ?

A lot of things are "next gen" when lookingat the program, so here is my take on it…

"Next Gen"

25

Queryable State

"Next Gen"

26

Elastic ParallelismMaintaining exactly-once

state consistencyNo extra effort for the userNo need to carefully planpartitions

"Next Gen"

27

Terabytes of state inside thestream processor

Maintaining fast checkpoints and recoveryE.g., long histories of windows, large join tablesState at local memory speed

"Next Gen"

28

Full SQL on Streams

Continuous queries, incremental resultsWindows, event time, processing timeConsistent with SQL on bounded data

29

Thank you!

30

Appendix

31

We are hiring!

data-artisans.com/careers

top related