jt@ucsb - on-demand data streaming from sensor nodes and a quick overview of apache flink
TRANSCRIPT
On-Demand Data Streaming from Sensor Nodes
(ACM SoCC 2017)
and
A quick overview of Apache Flink
Presentation at Sep. 30, 2017
University of California, Santa Barbara
About me
• Researcher and PhD candidate at – Technische Universität Berlin (DIMA)
– German Research Center for Artificial Intelligence (DFKI) / (IAM)
• Working with Volker Markl
• Before – Master’s degree in Computer Science (KTH Stockholm and TU Belin)
– Bachelor’s degree in Applied Computer Science (DHBW Stuttgart)
– Four years at IBM in Germany and the USA
Jonas Traub [email protected] [email protected]
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
Optimized On-Demand Data
Streaming from Sensor Nodes
Jonas Traub, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, Volker Markl
Extended Talk for . .
Santa Clara, California,
September 25-27, 2017
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
The Sensor Cloud
Real-time
insights
4
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
The Sensor Cloud
Real-time
insights
Billions of sensor nodes form a sensor cloud
and provide data streams to analysis systems.
5
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
The Sensor Cloud
Real-time
insights
Billions of sensor nodes form a sensor cloud
and provide data streams to analysis systems.
6
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
The Sensor Cloud
Real-time
insights
Billions of sensor nodes form a sensor cloud
and provide data streams to analysis systems.
7
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
The Sensor Cloud
Real-time
insights
Billions of sensor nodes form a sensor cloud
and provide data streams to analysis systems.
8
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
The Sensor Cloud – Problems
Real-time
insights
9
Billions of sensor nodes form a sensor cloud
and provide data streams to analysis systems.
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
The Sensor Cloud – Problems
Real-time
insights
Streaming all data from billions
of sensors to all applications
with maximal frequencies is impossible
10
Billions of sensor nodes form a sensor cloud
and provide data streams to analysis systems.
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
The Sensor Cloud – Problems
Real-time
insights
Streaming all data from billions
of sensors to all applications
with maximal frequencies is impossible
Increasing data rates
require expensive
system scale-out.
11
Billions of sensor nodes form a sensor cloud
and provide data streams to analysis systems.
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
The Sensor Cloud – Solutions
12
Tailor Data Streams to the Demand of Applications
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
The Sensor Cloud – Solutions
13
Tailor Data Streams to the Demand of Applications
• Provide an abstraction to define the data demand of applications.
User-Defined Sampling Functions (UDSFs)
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
The Sensor Cloud – Solutions
14
Tailor Data Streams to the Demand of Applications
• Provide an abstraction to define the data demand of applications.
• Optimize communication costs while maintaining the result accuracy.
User-Defined Sampling Functions (UDSFs)
Read-Time Optimization
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
The Sensor Cloud – Solutions
15
Tailor Data Streams to the Demand of Applications
• Provide an abstraction to define the data demand of applications.
• Optimize communication costs while maintaining the result accuracy.
• Share sensor reads and data transfer among users and queries.
User-Defined Sampling Functions (UDSFs)
Read-Time Optimization
Multi-Query / Multi-User Optimization
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
A Motivating Example
16
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
A Motivating Example
17
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
A Motivating Example
18
Different Data Data Demands:
• Query 1 adaptively increases sampling rates when accelerating or braking.
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
A Motivating Example
19
Different Data Data Demands:
• Query 1 adaptively increases sampling rates when accelerating or braking.
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
A Motivating Example
20
Different Data Data Demands:
• Query 1 adaptively increases sampling rates when accelerating or braking.
• Query 2 requires a sample at least every 20 meters
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
A Motivating Example
21
Different Data Data Demands:
• Query 1 adaptively increases sampling rates when accelerating or braking.
• Query 2 requires a sample at least every 20 meters
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
A Motivating Example
22
Different Data Data Demands:
• Query 1 adaptively increases sampling rates when accelerating or braking.
• Query 2 requires a sample at least every 20 meters
• Query 3 requires a sample at least every 0.3s.
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
A Motivating Example
23
Different Data Data Demands:
• Query 1 adaptively increases sampling rates when accelerating or braking.
• Query 2 requires a sample at least every 20 meters
• Query 3 requires a sample at least every 0.3s.
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
A Motivating Example
24
Different Data Data Demands:
• Query 1 adaptively increases sampling rates when accelerating or braking.
• Query 2 requires a sample at least every 20 meters
• Query 3 requires a sample at least every 0.3s.
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
A Motivating Example - Evaluation
25
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
A Motivating Example - Evaluation
26
-57%
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
A Motivating Example - Evaluation
27
-57%
-72%
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
A Motivating Example - Evaluation
28
1/3 because 3 values per tuple
-57%
-72%
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
Architecture Overview
29
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
Architecture Overview
30
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
Architecture Overview
31
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
Architecture Overview
32
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
Architecture Overview
33
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
Sensor Read Scheduling
34
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
User-Defined Sampling Functions
35
Input:
Sensor read time and value Output:
Next Sensor Read Request
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
User-Defined Sampling Functions
36
Input:
Sensor read time and value
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
User-Defined Sampling Functions
37
Enable adaptive sampling techniques to reduce data transmission
e.g., Adam [Trihinas ‘15], FAST [Fan ‘14], L-SIP [Gaura ’13]
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
User-Defined Sampling Functions - Examples
38
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
User-Defined Sampling Functions - Examples
39
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
User-Defined Sampling Functions - Examples
40
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
Sensor Read Fusion
41
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
Sensor Read Fusion
42
1) Minimize Sensor Reads and Data Transfer:
Latest possible read time
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
Read Time Optimization
43
2) Optimize Sensor Read Times:
● Minimize penalty while executing the minimum number of sensor reads only
● Challenge: assign read requests to sensor reads
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
Assigning Read Requests to Sensor Reads
44
Postpone Assign to next Read
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
Assigning Read Requests to Sensor Reads
45
Postpone Assign to next Read
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
Assigning Read Requests to Sensor Reads
46
Postpone Assign to next Read
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
Assigning Read Requests to Sensor Reads
47
Postpone Assign to next Read
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
Local Filtering
48
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
Local Filtering
49
● Enable adaptive filtering in combination with adaptive sampling
● Enable model-driven data acquisition
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
Local Filtering
50
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
Evaluation
● Replay sensor data
- from a football match [DEBS Grand Challenge ’13]
- formula 1 telementry data
● Random UDSFs:
- Read in a poisson process (also simulate load peaks)
- In average 1 read per query per second
- Exponentially distributed read time tolerance
- high probability for small tolerances
- small probability for large tolerances
- In average 0.04s read time tolerance
51
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 52
Increasing the number of concurrent queries
• On-Demand scheduling reduces sensor reads and data transfer by up to 87%.
• The # of reads and transfers increases sub-linearly with the # of queries.
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 53
Increasing the number of concurrent queries
• Our read-time optimizer reduces the deviation from desired read times
by up to 69% (preserving the min. # of reads and transfers).
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 54
Increasing read time tolerances
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 55
Increasing read time tolerances
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 56
Query Prioritization (1/2)
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 57
Query Prioritization (2/2)
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 58
Slack Robustness of Adaptive Sampling
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
Optimized On-Demand Data
Streaming from Sensor Nodes
Wrap-Up:
Tailor Data Streams to the Demand of Applications
• Define data demand: User-Defined Sampling Functions
• Schedule sensor reads and data transfer on-demand
• Optimize read times globally - for all users and queries
Jonas Traub, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, Volker Markl
Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17
A quick overview of Apache Flink - research summary -
Jonas Traub visiting September 30, 2017
Outline
Apache Flink Primer
• Stratosphere – The origin of Apache Flink
• What is Apache Flink? – Basic System Internals
• The Flink Community
An Apache Flink Research Summary
61
62 © Volker Markl
• Relational Algebra
• Declarativity
• Query Optimization
• Robust Out-of-core
• Scalability
• User-defined
Functions
• Complex Data Types
• Schema on Read
62
Draws on
Database Technology
Draws on
MapReduce Technology
Stratosphere: General Purpose
Programming + Database Execution
63 © Volker Markl
• Relational Algebra
• Declarativity
• Query Optimization
• Robust Out-of-core
• Scalability
• User-defined
Functions
• Complex Data Types
• Schema on Read
• Iterations
• Advanced Dataflows
• General APIs
• Native Streaming
63
Draws on
Database Technology
Draws on
MapReduce Technology
Adds
Stratosphere: General Purpose
Programming + Database Execution
64 © Volker Markl 64
64 © Volker Markl
Apache Flink is an open source platform for scalable batch and stream
data processing.
What is Apache Flink?
http://flink.apache.org
A distributed system that you can use to process data
Like a DBMS but not exactly a DBMS
What kind of data? Data that comes in the form of streams
What kind of processing
Quite flexible. You can use Java/Scala APIs similar to programming with Java collections, the new SQL API, etc
Distributed: runs on many (1000s) of machines and hides this complexity from the user
64
65 © Volker Markl 65
65 © Volker Markl
Basic application architecture
app state
app state
app state
event log
Query
service
Sources of data
(e.g., sensors,
logs, …)
A replayable log of
events with pub/sub
functionality
Processing of
events
Storage and query systems
By courtesy of Kostas Tzoumas 65
66 © Volker Markl 66 © 2013 Berlin Big Data Center • All Rights Reserved
66 © Volker Markl
Technology inside Flink
case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next } Program
67 © Volker Markl 67 © 2013 Berlin Big Data Center • All Rights Reserved
67 © Volker Markl
Technology inside Flink
case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next }
Cost-based
optimizer
Type extraction
stack
Pre-flight (Client) Program
68 © Volker Markl 68 © 2013 Berlin Big Data Center • All Rights Reserved
68 © Volker Markl
Technology inside Flink
case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next }
Cost-based
optimizer
Type extraction
stack
Pre-flight (Client) DataSourc
e orders.tbl
Filter
Map DataSourc
e lineitem.tbl
Join Hybrid Hash
build
HT probe
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
Dataflow
Graph
69 © Volker Markl 69 © 2013 Berlin Big Data Center • All Rights Reserved
69 © Volker Markl
Technology inside Flink
case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next }
Cost-based
optimizer
Type extraction
stack
Task
scheduling
Recovery
metadata
Pre-flight (Client)
Master Workers
DataSourc
e orders.tbl
Filter
Map DataSourc
e lineitem.tbl
Join Hybrid Hash
build
HT probe
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
Dataflow
Graph
deploy
operators
track
intermediate
results
70 © Volker Markl 70
70 © Volker Markl
Flink community
0
50
100
150
200
250
300
Feb15 Dec15 Dec16
NumberofContributors
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Feb15 Dec15 Dec16
StarsonGitHub
0
200
400
600
800
1000
1200
1400
Feb15 Dec15 Dec16
ForksonGitHub
By courtesy of Kostas Tzoumas
Project Statistics (Updated: Sep 29, 2017)
70
71 © Volker Markl 71
71 © Volker Markl
Companies Using Flink
Apache Flink - Related Publications
System Paper 2015:
System Paper 2014:
72
Apache Flink - Related Publications
System Paper 2015:
System Paper 2014:
73
Apache Flink - Related Publications
System Paper 2015:
System Paper 2014:
74
State Management
VLDB 2017
75
Iterative Processing
VLDB 2012
76
Iterative Processing
VLDB 2012
SIGMOD 2013 77
Fault Tolerance
78
Fault Tolerance
79
Fault Tolerance
80
Fault Tolerance
81
Streaming Window Aggregation
82
Visualization of Streaming Data
83
On-Demand Data Streaming from Sensor Nodes
(ACM SoCC 2017)
and
A quick overview of Apache Flink
Presentation at Sep. 30, 2017
University of California, Santa Barbara