stephan ewen - stream processing as a foundational paradigm and apache flink's approach to it

Stream Processingand Apache Flink®'s approach to it@StephanEwen

Apache Flink PMCCTO @ data Artisans

About meDatabase systems, TU Berlin, IBM, MicrosoftCo-bootstrapped Stratosphere project's runtimeApache Flink created from a (partial) Stratosphere forkApache Flink community founded data ArtisansNow Flink PMC and CTO at data Artisans

Streaming technology is enabling the obvious: continuous processing on data that is continuously produced

Hint: you already have streaming data

Streaming Subsumes Batch

2016-3-112:00 am

2016-3-11:00 am

2016-3-12:00 am

2016-3-1111:00pm

2016-3-1212:00am

2016-3-121:00am

2016-3-1110:00pm

2016-3-122:00am

2016-3-123:00am…

partition

2016-3-112:00 am

2016-3-11:00 am

2016-3-12:00 am

2016-3-1111:00pm

2016-3-1212:00am

2016-3-121:00am

2016-3-1110:00pm

2016-3-122:00am

2016-3-123:00am…

partition

Stream (low latency)

Stream (high latency)

2016-3-112:00 am

2016-3-11:00 am

2016-3-12:00 am

2016-3-1111:00pm

2016-3-1212:00am

2016-3-121:00am

2016-3-1110:00pm

2016-3-122:00am

2016-3-123:00am…

partition

Stream (low latency)

Batch(bounded stream)Stream (high latency)

Stream Processing Decouples

Database(State)

Applications build their own stateState managed centralized

Time Travel

Process a period ofhistoric data

partition

Process latest datawith low latency(tail of the log)

Reprocess stream(historic data first, catches up with realtime data)

Latency

Volume/Throughput

State &Accuracy

Latency

Volume/Throughput

State &Accuracy

Exactly-once semanticsEvent time processing

10s of millions evts/secfor stateful applications

Latency down tothe milliseconds

Apache Flink was the first open-source system to eliminate these

tradeoffs

Streaming Architecture Blueprint

collect log analyze serve & store

Flink's Approach

Stateful Steam Processing

Fluent API, Windows, Event Time

Table API

Stream SQL

Core API

Declarative DSL

High-level Language

Building Block

Source Filter /Transform

Stateread/write Sink

Scalable embedded state Access at memory speed &scales with parallel operators

Re-load state

Reset positionsin input streams

Rolling back computationRe-processing

Restore to differentprograms

Bugfixes, Upgrades, A/B testing, etc

Versioning the state of applications

Savepoint

App. A

App. B

App. C

Savepoint

Flink's Approach

Fluent API, Windows, Event Time

Table API

Stream SQL

Core API

Declarative DSL

High-level Language

Building Block

Event Time / Out-of-Order

1977 1980 1983 1999 2002 2005 2015

Processing Time

EpisodeIV

EpisodeV

EpisodeVI

EpisodeI

EpisodeII

EpisodeIII

EpisodeVII

Event Time

(Stream) SQL & Table API

Table API

// convert stream into Tableval sensorTable: Table = sensorData .toTable(tableEnv, 'location, 'time, 'tempF)

// define query on Tableval avgTempCTable: Table = sensorTable .groupBy('location) .window(Tumble over 1.days on 'rowtime as 'w) .select('w.start as 'day, 'location, (('tempF.avg - 32) * 0.556) as 'avgTempC) .where('location like "room%")

sensorTable.sql(""" SELECT day, location, avg((tempF - 32) * 0.556) AS avgTempC

FROM sensorData WHERE location LIKE 'room%'GROUP BY day, location

What can you do with that?

10 billion events (2TB) processed daily across multiple Flink jobs for the telco network control center.

Ad-hoc realtime queries, > 30 operators, processing 30 billion events daily, maintaining state of 100s of GB inside Flink with exactly-once guarantees

Jobs with > 20 operators, runs on > 5000 vCores in 1000-node cluster, processes millions of events per second

Flink's Streams playing at Batch

TeraSort

Relational Join

Classic Batch Jobs

GraphProcessing

LinearAlgebra

What can we expect next ?

Queryable State

Streaming Architecture Blueprint

collect log analyze &serve & store

Other Services

Full SQL on Streams

Continuous queriesincremental results

Windows, event time,processing time

Consistent with SQL on bounded data https://docs.google.com/document/d/1qVVt_16kdaZQ8RTfA_f4konQPW4tnl8THw6rzGUdaqU

Elastic Parallelism

Maintaining exactly-oncestate consistency

No extra effort for the userNo need to carefully planpartitions

Very large state

Terabytes of state inside the stream processor

Maintaining fast checkpoints and recoveryE.g., long histories of windows, large join tablesState at local memory speed

We are hiring!

data-artisans.com/careers

stephan ewen - stream processing as a foundational paradigm and apache flink's approach to it

Data & Analytics

vowels of beryozovka ewen - stony brook university

hasselblad h5d-50 - ewen bell

architecture of flink's streaming runtime @ apachecon eu...

kostas kloudas - extending flink's streaming apis

mc ewen ecn_2012

assignment 7 - jake ewen

cook islands idyll - photography by ewen bell ·...

sharon ewen...sharon ewen sharon ewen, %lojbo director, tsa...

travel photography adventure - ewen bell

keynote: stephan ewen - stream processing as a foundational...

ewen ferguson oabp/oaba guelph april 29-2004

the church book of st. ewen s, bristol

david ewen, encyclopedia of the opera

the power of snapshots stateful stream processing with...

taking a look under the hood of apache flink's relational...

stephan ewen flink committer co-founder / cto @ data...

about ewen chia -

curriculum vitae ewen cameron david todd - expert witness

facebook's support for employees' lives by margaret ewen

k. tzoumas & s. ewen – flink forward keynote