an introduction to distributed data...
TRANSCRIPT
![Page 1: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/1.jpg)
An Introduction to Distributed Data StreamingElements and Systems
Paris Carbone<[email protected]> PhD Candidate
KTH Royal Institute of Technology
1
![Page 2: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/2.jpg)
2
![Page 3: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/3.jpg)
2
![Page 4: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/4.jpg)
2
![Page 5: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/5.jpg)
2
how to avoid this?
![Page 6: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/6.jpg)
2
how to avoid this?
![Page 7: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/7.jpg)
2
how to avoid this?
Q
![Page 8: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/8.jpg)
2
how to avoid this?
Q = +
Q
![Page 9: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/9.jpg)
Motivation
3
Q = +
![Page 10: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/10.jpg)
Motivation
3
Q = +
![Page 11: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/11.jpg)
Motivation
3
Q Q
Q = +
![Page 12: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/12.jpg)
Motivation
3
Q Q
Q = +
![Page 13: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/13.jpg)
Motivation
3
Q Q
Q = +
![Page 14: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/14.jpg)
Motivation
4
Q
Standing Query
![Page 15: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/15.jpg)
Motivation
4
Q
Standing Query
![Page 16: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/16.jpg)
Motivation
4
Q
Standing Query
![Page 17: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/17.jpg)
Motivation
4
Q
Standing Query
![Page 18: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/18.jpg)
Preliminaries
• Data Streaming Paradigm
• Incoming data is unbound - continuous arrival
• Standing queries are evaluated continuously
• Queries operate on the full data stream or on the most recent views of the stream ~ windows
5
![Page 19: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/19.jpg)
Data Streams Basics• Events/Tuples : elements of computation - respect a schema
• Data Streams : unbounded sequences of events
• Stream Operators: consume streams and generate new ones.
• Events are consumed once - no backtracking!
6
f
S1
S2
So
S’1
S’2
![Page 20: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/20.jpg)
Streaming Pipelines
7
stream1
stream2
approximations predictions alerts ……
Q
sources
sinks
![Page 21: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/21.jpg)
Core Abstractions
• Windows
• Synopses (summary state)
• Partitioning
8
![Page 22: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/22.jpg)
Windows
Discussion
Why do we need windows?
9
![Page 23: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/23.jpg)
Windows• We are often interested only in fresh data
• f = “average temperature over the last minute every 20 sec”
• Range: Most data stream processing systems allow window operations on the most recent history (eg. 1 minute, 1000 tuples)
• Slide: The frequency/granularity f is evaluated on a given range
10
#seconds40 80
Average #3
Average #2
0
Average #1
20 60 100
f
W: 1min, 20sec
![Page 24: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/24.jpg)
Window Types
11
#sec40 80
Average #2
0
Average #1
20 60 100
#sec40 80
Average #3
Average #2
0
Average #1
20 60 100
#sec40 80
Average #2
0
Average #1
20 60 100
120
120
Sliding
Tumbling
Jumping
range > slide
range = slide
range < slide
![Page 25: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/25.jpg)
SynopsesWe cannot infinitely store all events seen
• Synopsis: A summary of an infinite stream
• It is in principle any streaming operator state
• Examples: samples, histograms, sketches, state machines…
12
f
sa summary of everything
seen so far1. process t, s 2. update s 3. produce t’
t t’
What about window synopses?
![Page 26: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/26.jpg)
Synopses-Aggregations
• Discussion - Rolling Aggregations
• Propose a synopsis, s=? when
• f= max
• f= ArithmeticMean
• f= stDev
13
![Page 27: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/27.jpg)
Synopses-Approximations
14
• Discussion - Approximate Results
• Propose a synopsis, s=? when
• f= uniform random sample of k records over the whole stream
• f= filter distinct records over windows of 1000 records with a 5% error
![Page 28: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/28.jpg)
Synopses-ML and Graphs
15
• Examples of cool synopses to check out
• Sparsifiers/Spanners - approximating graph properties such as shortest paths
• Change detectors - detecting concept drift
• Incremental decision trees - continuous stream training and classification
![Page 29: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/29.jpg)
Partitioning• One stream operator is not enough
• Data might be too large to process
• e.g. very high input rate, too many stream sources
• State could possibly not fit in memory
16
fs
fs
fs
parallel instances
How do we partition the input streams?
fs
![Page 30: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/30.jpg)
Partitioning• Partitioning defines how we allocate events to each
parallel instance. Typical partitioners are:
• Broadcast
• Shuffle
• Key-based
17
fs
fs
fs
fs
fs
fs
P
P
P
bycolor
![Page 31: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/31.jpg)
Putting Everything Together
18
Fire Detection Pipeline
{area,temp}
{area,smoke} {loc,alert!}
• operators • synopses • windows • partitioning
trigger on detection
trigger periodically
?
![Page 32: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/32.jpg)
Operators
19
As
Fs
Rolling Arithmetic Mean of Temperatures
State Machine-based Fire Alarm
{area,temp} {area,avgTemp}
{alarm}
Src
Sensor Data Sources{area,temp}
Src
{area}Periodic Temperature Updates
Smoke Detections
trivial…
What is the state and its transitions?
![Page 33: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/33.jpg)
Partitioning• We are only interested in correlating smoke and
high temperature within the same area
• Events carry area information so we can partition our computation by area
20
Src Pkey:area
![Page 34: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/34.jpg)
Windowing• Individual sensor data could be potentially faulty
• We need to gather data from all temperature sensors of an area and produce an average
• We want fresh average temperatures
21
Src Pkey:area{area,temp} A
s
As
w
w
w = ?
![Page 35: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/35.jpg)
The Fire Alarm
22
![Page 36: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/36.jpg)
The Fire Alarm
22
Fs
![Page 37: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/37.jpg)
The Fire Alarm
22
Fs
T : avgTemp>40T : avgTemp<40S : Smoke
![Page 38: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/38.jpg)
The Fire Alarm
22
Fs
T : avgTemp>40T : avgTemp<40 …TTTSTTSTTTT….S : Smoke
![Page 39: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/39.jpg)
The Fire Alarm
22
Fs
T : avgTemp>40T : avgTemp<40 …TTTSTTSTTTT….
OK
HOT
SMOKE
FIRE
T
T
TS
S
T
T
S : Smoke
![Page 40: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/40.jpg)
The Fire Alarm
22
Fs
T : avgTemp>40T : avgTemp<40 …TTTSTTSTTTT….
OK
HOT
SMOKE
FIRE
T
T
TS
S
T
T
synopsis= 1 state
S : Smoke
![Page 41: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/41.jpg)
Putting Everything Together
23
{area,temp}
{area,smoke}
Src
Src
P
P
As
As
key:area
key:area
w
w
Fs
Fs
Pkey:area
{area, alert}
{area,avg_temp}
{area,smoke}
![Page 42: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/42.jpg)
Systems: The Big Picture
24
Proprietary Open Source
Google DataFlow
IBM Infosphere
Microsoft Azure
Flink
Storm
Samza
Spark
![Page 43: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/43.jpg)
Evolution
25
’95 Materialised
Views
’01 Complex
Event Processing
’03 TelegraphCQ
’03 STREAM
’05 Borealis
’15 User-Defined
Windows
’12 Policy-Based Windowing
’88 Active
DataBases
’88 HiPac
’12 Twitter Storm
’12 IBM
System S
’13 Spark
Streaming
’14 Apache
Flink
’13 Parallel
Recovery
’05 Decentralised
Stream Queries
’05 High Availability
on Streaming
concepts
systems’13
Google Millwheel
’13 Discretized
Streams
’00 Eddies
02 Aurora ’12
Twitter Storm
![Page 44: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/44.jpg)
Programming Models
26
Compositional Declarative
• Offer basic building blocks for composing custom operators and topologies
• Advanced behaviour such as windowing is often missing
• Custom Optimisation
• Expose a high-level API • Operators are higher order
functions on abstract data stream types
• Advanced behaviour such as windowing is supported
• Self-Optimisation
![Page 45: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/45.jpg)
Programming Model Types
27
DStream, DataStream, PCollection…
• Direct access to the execution graph / topology
• Suitable for engineers
• Transformations abstract operator details
• Suitable for engineers and data analysts
![Page 46: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/46.jpg)
Standing Queries with Apache Storm
28
• Step1: Implement input (Spouts) and intermediate operators (Bolts)
• Step 2: Construct a Topology by combining operators
Spout Bolt Bolt
Spouts are the topology sources
The listen to data feeds
Bolts represent all intermediate computation vertices of the topology
They do arbitrary data manipulation
Each operator can emit/subscribe to Streams (computation results)
![Page 47: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/47.jpg)
Example: Topology Definition
29
numbers new_numbers
numbers new_numbers
toFile
![Page 48: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/48.jpg)
Standing Queries with Apache Flink
30
Flink Runtime
Flink Job Graph Builder/Optimiser
Flink Client
Streaming Program
• Operator fusion • Window Pre-aggregates
• Deploy Long Running Tasks • Monitor Execution
![Page 49: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/49.jpg)
Distributed Stream Execution Paradigms
31(Hadoop, Spark) (Spark Streaming)
1) Real Streaming (Distributed Data Flow)
LONG-LIVED TASK EXECUTION STATE IS KEPT INSIDE TASKS
2) Batched Execution
![Page 50: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/50.jpg)
Windows in Action
32
• DStreams are already partitioned in time windows
• Only time windows supported
• Windows decomposed into policies
• Policies can be user-defined too
range
slide
![Page 51: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/51.jpg)
Windows on Storm?
33src-http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
![Page 52: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/52.jpg)
Partitioning in Action
34
forward() shuffle() broadcast() keyBy()
partitionCustom()
shuffleGrouping() allGrouping() fieldsGrouping()
customGrouping()
repartition(num) reduceByKey() updateStateByKey()
no fine-grained control full control
![Page 53: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/53.jpg)
Synopses in Action
35
implementing a rolling max per key
![Page 54: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/54.jpg)
State in Spark?
36
• Streams are partitioned into small batches • There is practically no state kept in workers (stateless) • How do we keep state??
(Spark Streaming)
put new states in output RDDdstream.updateStateByKey(…)
In S’
![Page 55: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/55.jpg)
Implementing the alarm in Flink
37
![Page 56: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/56.jpg)
So everything works
38
{area,temp}
{area,smoke}
Src
Src
P
P
As
As
key:area
key:area
w
w
Fs
Fs
Pkey:area
{area,avg_temp}
{area,smoke}
or…
![Page 57: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/57.jpg)
Unreliable Sources
39
Standing Query
Q
![Page 58: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/58.jpg)
Unreliable Sources
39
Standing Query
Q
![Page 59: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/59.jpg)
Unreliable Sources
39
Standing Query
Q
![Page 60: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/60.jpg)
Unreliable Sources
39
Standing Query
add more sensors
Q
![Page 61: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/61.jpg)
Unreliable Processing
40
Standing Query
Q
![Page 62: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/62.jpg)
Unreliable Processing
40
Standing Query
Q
![Page 63: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/63.jpg)
recovered!
Unreliable Processing
40
Standing Query
Q
![Page 64: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/64.jpg)
recovered!
Unreliable Processing
40
Standing Query
Qlost smoke events
![Page 65: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/65.jpg)
Resilient Brokers
Main Features
• Topic-based partitioned queues
• Strongly consistent offset mapping to records
41
![Page 66: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/66.jpg)
Processing Guarantees• Kafka solves the source consistency problem
• How about the rest of the states of the computation ? (e.g. alert operator state)
• Each system offers different guarantees
42
Guarantees Technique
Storm at least once event dependency tracking
Spark exactly once source upstream backup
Flink exactly once periodic snapshots
![Page 67: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/67.jpg)
43
Q
Standing Query
Mission Accomplished
![Page 68: An Introduction to Distributed Data Streaminglinc.ucy.ac.cy/isocial/images/stories/Events/EIT_i...Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 User-Defined](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5d69509d47022543be2c8/html5/thumbnails/68.jpg)
Research Topics at KTH/SICS
• Exactly-Once-Output Guarantees
• State management and auto-scaling
• Streaming ML pipelines
• Streaming Graphs
44