keynote: stephan ewen - stream processing as a foundational paradigm and apache flink's...

Stream Processing as aFoundational Paradigm and

Apache Flink's approach to itStephan Ewen, Apache Flink PMC, CTO @ data Artisans

Streaming technology is enabling the obvious: continuous processing on data that is continuously produced

Hint: you already have streaming data

Streaming Subsumes Batch

2016-3-112:00 am

2016-3-11:00 am

2016-3-12:00 am

2016-3-1111:00pm

2016-3-1212:00am

2016-3-121:00am

2016-3-1110:00pm

2016-3-122:00am

2016-3-123:00am…

partition

2016-3-112:00 am

2016-3-11:00 am

2016-3-12:00 am

2016-3-1111:00pm

2016-3-1212:00am

2016-3-121:00am

2016-3-1110:00pm

2016-3-122:00am

2016-3-123:00am…

partition

Stream (low latency)

Stream (high latency)

2016-3-112:00 am

2016-3-11:00 am

2016-3-12:00 am

2016-3-1111:00pm

2016-3-1212:00am

2016-3-121:00am

2016-3-1110:00pm

2016-3-122:00am

2016-3-123:00am…

partition

Stream (low latency)

Batch(bounded stream)Stream (high latency)

Stream Processing Decouples

Database(State)

Applications build their own stateState managed centralized

Time Travel

Process a period ofhistoric data

partition

Process latest datawith low latency(tail of the log)

Reprocess stream(historic data first, catches up with realtime data)

But why has it started so recently?

Stream Processing is taking off.(just look at this year's talks)

Latency

Volume/Throughput

State &Accuracy

The combination is what makes

steaming powerful

Only recently available together

Latency

Volume/Throughput

State &Accuracy

Exactly-once semanticsEvent time processing

10s of millions evts/secfor stateful applications

Latency down tothe milliseconds

Apache Flink was the first open-source system to eliminate these

tradeoffs

Flink's Approach

Stateful Steam Processing

Fluent API, Windows, Event Time

Table API

Stream SQL

Core API

Declarative DSL

High-level Language

Building Block

Source Filter /Transform

Stateread/write Sink

Scalable embedded state Access at memory speed &scales with parallel operators

Re-load state

Reset positionsin input streams

Rolling back computationRe-processing

Restore to differentprograms

Bugfixes, Upgrades, A/B testing, etc

Versioning the state of applications

Savepoint

App. A

App. B

App. C

Savepoint

Flink's Approach

Fluent API, Windows, Event Time

Table API

Stream SQL

Core API

Declarative DSL

High-level Language

Building Block

Event Time / Out-of-Order

1977 1980 1983 1999 2002 2005 2015

Processing Time

EpisodeIV

EpisodeV

EpisodeVI

EpisodeI

EpisodeII

EpisodeIII

EpisodeVII

Event Time

(Stream) SQL & Table API

Table API

// convert stream into Tableval sensorTable: Table = sensorData .toTable(tableEnv, 'location, 'time, 'tempF)

// define query on Tableval avgTempCTable: Table = sensorTable .groupBy('location) .window(Tumble over 1.days on 'rowtime as 'w) .select('w.start as 'day, 'location, (('tempF.avg - 32) * 0.556) as 'avgTempC) .where('location like "room%")

sensorTable.sql(""" SELECT day, location, avg((tempF - 32) * 0.556) AS avgTempC

FROM sensorData WHERE location LIKE 'room%'GROUP BY day, location

What can you do with that?

10 billion events (2TB) processed daily across multiple Flink jobs for the telco network control center.

Ad-hoc realtime queries, > 30 operators, processing 30 billion events daily, maintaining state of 100s of GB inside Flink with exactly-once guarantees

Jobs with > 20 operators, runs on > 5000 vCores in 1000-node cluster, processes millions of events per second

Flink's Streams playing at Batch

TeraSort

Relational Join

Classic Batch Jobs

GraphProcessing

LinearAlgebra

Streaming Technology is already awesome,but what are the next steps?

A.k.a, what can we expect in the "next gen" ?

A lot of things are "next gen" when lookingat the program, so here is my take on it…

"Next Gen"

Queryable State

"Next Gen"

Elastic ParallelismMaintaining exactly-once

state consistencyNo extra effort for the userNo need to carefully planpartitions

"Next Gen"

Terabytes of state inside thestream processor

Maintaining fast checkpoints and recoveryE.g., long histories of windows, large join tablesState at local memory speed

"Next Gen"

Full SQL on Streams

Continuous queries, incremental resultsWindows, event time, processing timeConsistent with SQL on bounded data

Thank you!

Appendix

We are hiring!

data-artisans.com/careers

keynote: stephan ewen - stream processing as a foundational paradigm and apache flink's...

Data & Analytics

sweden - ewen bell

redesigning apache flink's distributed architecture @ flink...

kenneth ewen and gary d. brunner

stephan ewen - running flink everywhere

mc ewen ecn_2012

big data management and scalable data science: challenges...

· 2020-02-28 · stephan ewen — data artisans c to,...

k. tzoumas & s. ewen – flink forward keynote

ewen smith estudios de caso / case studies

the mc ewen photographic studio

about ewen chia -

photography masterclass - photography by ewen bell

power5 ewen cheslack-postava case taintor jake mcpadden

hasselblad h5d-50 - ewen bell

architecture of flink's streaming runtime @ apachecon eu...

flink forward sf 2017: stephan ewen - convergence of...

the old forge ewen cirencester gl7

kemble and ewen neighbourhood development plan, … ›...

stephan ewen - stream processing as a foundational paradigm...

kostas kloudas - extending flink's streaming apis