![Page 1: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/1.jpg)
1 © Volker Markl© 2013 Berlin Big Data Center • All Rights Reserved1 © Volker Markl
Big Data Management,
Scalable Data Science, and Apache Flink:
Challenges and (some) SolutionsProf. Dr. Volker Markl
http://www.user.tu-berlin.de/marklv/
http://www.dima.tu-berlin.de
http://www.dfki.de/web/forschung/iam
http://bbdc.berlin
![Page 2: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/2.jpg)
2 © Volker Markl2 © 2013 Berlin Big Data Center • All Rights Reserved
2 © Volker Markl
ML
ML
alg
orith
ms
alg
orith
ms
data too uncertain Veracity Data Mining MATLAB, R, Python
Predictive/Prescriptive MATLAB, R, Python
Data & Analysis: Increasingly Complex!
data volume too large Volume
data rate too fast Velocity
data too heterogeneous Variability
Data
Reporting aggregation, selection
Ad-Hoc Queries SQL, XQuery
ETL/ELT MapReduce
Analysis
DM
DM
sca
lab
ility
sca
lab
ility
![Page 3: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/3.jpg)
3 © Volker Markl3 © 2013 Berlin Big Data Center • All Rights Reserved
3 © Volker Markl
Application
Data
Science
Control Flow
Iterative Algorithms
Error Estimation
Active Sampling
Sketches
Curse of Dimensionality
Decoupling
Convergence
Monte Carlo
Mathematical Programming
Linear Algebra
Stochastic Gradient Descent
Regression
Statistics
Hashing
Parallelization
Query Optimization
Fault Tolerance
Relational Algebra / SQL
Scalability
Data Analysis Language
Compiler
Memory Management
Memory Hierarchy
Data Flow
Hardware Adaptation
Indexing
Resource Management
NF2 /XQuery
Data Warehouse/OLAP
“Data Scientist” – “Jack of All Trades!”Domain Expertise (e.g., Industry 4.0, Medicine, Physics, Engineering, Energy, Logistics)
Real-Time
New Technology to the Rescue!
![Page 4: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/4.jpg)
4 © Volker Markl4
4 © Volker Markl
Big Data Analytics Requires Systems Programming
R/Matlab:
3 million users
Hadoop:
100,000
users
Data Analysis
Statistics
Algebra
Optimization
Machine Learning
NLP
Signal Processing
Image Analysis
Audio-,Video Analysis
Information Integration
Information Extraction
Data Value Chain
Data Analysis Process
Predictive Analytics
Indexing
Parallelization
Communication
Memory Management
Query Optimization
Efficient Algorithms
Resource Management
Fault Tolerance
Numerical StabilityBig Data is now where database systems were in the
70s (prior to relational algebra, query optimization
and a SQL-standard)!
People with Big Data
Analytics Skills
Declarative languages to the rescue!
![Page 5: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/5.jpg)
5 © Volker Markl5 © 2013 Berlin Big Data Center • All Rights Reserved
5 © Volker Markl
Deep Analysis of „Big Data“ is Key!
Small Data Big Data (3V)
Deep A
naly
tics
Sim
ple
Analy
sis
Many new companies and products are emerging to enable deep big data analysis;strong European contenders include Apache Flink, Parstream, and Exasol.„New companies“ are the (b)leading users of these technologies, e.g., in theinformation economy (e.g., Zalando, Amazon, Researchgate, Soundcloud, Spotify).„Traditional Big companies“ are following and still determining strategies (Industrie4.0, Logistics, Telco, etc.). Most SMEs are not ready yet to capitalize on Big Data.
![Page 6: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/6.jpg)
6 © Volker Markl6 © 2013 Berlin Big Data Center • All Rights Reserved
6 © Volker Markl
Challenge: Technologies for Data Science at the Intersection of
Data Management and Machine Learning
aggregation relational algebra UDF iteration/recursionlinear algebragraph algebraeetc.
DM MLFeature Engineering
Representation
Algorithms (SVM, EM, etc.)
Declarative Languages
Automatic Adaption
Scalable processing
Think ML-algorithms
in a scalable way
Process (iterative)
algorithms
in a scalable way
declarative
![Page 7: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/7.jpg)
7 © Volker Markl7 © Volker Markl
Apache Flink –
Big Data Batch
and Stream Processing
![Page 8: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/8.jpg)
8 © Volker Markl
• Relational Algebra
• Declarativity
• Query Optimization
• Robust Out-of-core
• Scalability
• User-defined
Functions
• Complex Data Types
• Schema on Read
• Iterations
• Advanced Dataflows
• General APIs
• Native Streaming
8
Draws on
Database Technology
Draws on
MapReduce Technology
Adds
Stratosphere: General Purpose
Programming + Database Execution
![Page 9: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/9.jpg)
9 © Volker Markl9
9 © Volker Markl
Apache Flink is an open source platform for scalable batch and stream
data processing.
What is Apache Flink?
http://flink.apache.org
• The core of Flink is a distributed
streaming dataflow engine.
• Executing dataflows in parallel on
clusters
• Providing a reliable foundation for
various workloads
• DataSet and DataStream programming
abstractions are the foundation for user
programs and higher layers
![Page 10: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/10.jpg)
10 © Volker Markl10 © 2013 Berlin Big Data Center • All Rights Reserved
10 © Volker Markl
Technology inside Flink
case class Path (from: Long, to:Long)val tc = edges.iterate(10) {
paths: DataSet[Path] =>val next = paths
.join(edges)
.where("to")
.equalTo("from") {(path, edge) =>
Path(path.from, edge.to)}.union(paths).distinct()
next}
Cost-based
optimizer
Type extraction
stack
Task
scheduling
Recovery
metadata
Pre-flight (Client)
MasterWorkers
DataSourc
eorders.tbl
Filter
MapDataSourc
elineitem.tbl
JoinHybrid Hash
build
HTprobe
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
Dataflow
Graph
Memory
manager
Out-of-core
algos
Batch &
Streaming
State &
Checkpoints
deploy
operators
track
intermediate
results
![Page 11: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/11.jpg)
11 © Volker Markl11 © 2013 Berlin Big Data Center • All Rights Reserved
11 © Volker Markl
Effect of optimization
11
Run on a sampleon the laptop
Run a month laterafter the data evolved
Hash vs. SortPartition vs. BroadcastCachingReusing partition/sortExecution
Plan A
ExecutionPlan B
Run on large fileson the cluster
ExecutionPlan C
![Page 12: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/12.jpg)
12 © Volker Markl12 © 2013 Berlin Big Data Center • All Rights Reserved
12 © Volker Markl
Why optimization ?
Do you want to hand-tune that?
![Page 13: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/13.jpg)
13 © Volker Markl13 © 2013 Berlin Big Data Center • All Rights Reserved
13 © Volker Markl
DATA STREAMING ANALYSIS
![Page 14: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/14.jpg)
14 © Volker Markl14
14 © Volker Markl
Life of data streams
• Create: create streams from event sources (machines, databases, logs, sensors, …)
• Collect: collect and make streams available for consumption (e.g., Apache Kafka)
• Process: process streams, possibly generating derived streams (e.g., Apache Flink)
14
![Page 15: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/15.jpg)
15 © Volker Markl15 © 2013 Berlin Big Data Center • All Rights Reserved
15 © Volker Markl
Stream Analysis in Flink
More at: http://flink.apache.org/news/2015/02/09/streaming-example.html
![Page 16: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/16.jpg)
16 © Volker Markl16
16 © Volker Markl
Defining windows in Flink
• Trigger policy
– When to trigger the computation on current window
• Eviction policy
– When data points should leave the window
– Defines window width/size
• E.g., count-based policy
– evict when #elements > n
– start a new window every n-th element
• Built-in: Count, Time, Delta policies
![Page 17: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/17.jpg)
17 © Volker Markl17
17 © Volker Markl
Checkpointing / Recovery
• Flink acknowledges batches of records
– Less overhead in failure-free case
– Currently tied to fault tolerant data sources (e.g., Kafka)
• Flink operators can keep state
– State is checkpointed
– Checkpointing and record acks go together
• Exactly one semantics for state
![Page 18: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/18.jpg)
18 © Volker Markl18 © 2013 Berlin Big Data Center • All Rights Reserved
18 © Volker Markl
Checkpointing / Recovery
18
Chandy-Lamport Algorithm for consistent asynchronous distributed snapshots
Pushes checkpoint barriersthrough the data flow
Operator checkpointstarting
Checkpoint done
Data Stream
barrier
Before barrier =part of the snapshot
After barrier =Not in snapshot
Checkpoint done
checkpoint in progress
(backup till next snapshot)
![Page 19: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/19.jpg)
19 © Volker Markl19 © 2013 Berlin Big Data Center • All Rights Reserved
19 © Volker Markl
ITERATIONS IN DATA FLOWS
MACHINE LEARNING
ALGORITHMS
![Page 20: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/20.jpg)
20 © Volker Markl20
20 © Volker Markl
Iterate by looping
• for/while loop in client submits one job per iteration step
• Data reuse by caching in memory and/or disk
Step Step Step Step Step
Client
![Page 21: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/21.jpg)
21 © Volker Markl21
21 © Volker Markl
Iterate in the Dataflow
partial
solution partial
solution X
other
datasets
Y initial
solution
iteration
result
Replace
Step function
![Page 22: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/22.jpg)
22 © Volker Markl22 © 2013 Berlin Big Data Center • All Rights Reserved
22 © Volker Markl
Large-Scale Machine Learning
Factorizing a matrix with28 billion ratings forrecommendations
(Scale of Netflixor Spotify)
More at: http://data-artisans.com/computing-recommendations-with-flink.html
![Page 23: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/23.jpg)
23 © Volker Markl23
23 © Volker Markl
Optimizing iterative programs
Caching Loop-invariant DataPushing work„out of the loop“
Maintain state as index
![Page 24: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/24.jpg)
24 © Volker Markl24 © 2013 Berlin Big Data Center • All Rights Reserved
24 © Volker Markl
STATE IN ITERATIONS
GRAPHS AND MACHINE
LEARNING
![Page 25: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/25.jpg)
25 © Volker Markl25
25 © Volker Markl
Iterate natively with deltas
partial
solution
delta
setX
other
datasets
Y initial
solution
iteration
result
workset A B workset
Merge deltas
Replace
initial
workset
![Page 26: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/26.jpg)
26 © Volker Markl26
26 © Volker Markl
Effect of delta iterations…
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
45000000
1 6 11 16 21 26 31 36 41 46 51 56 61
# o
f e
lem
en
ts u
pd
ate
d
iteration
![Page 27: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/27.jpg)
27 © Volker Markl27 © 2013 Berlin Big Data Center • All Rights Reserved
27 © Volker Markl
… very fast graph analysis
… and mix and matchETL-style and graph analysisin one program
Performance competitivewith dedicated graph
analysis systems
More at: http://data-artisans.com/data-analysis-with-flink.html
![Page 28: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/28.jpg)
28 © Volker Markl28 © Volker Markl
Current Benchmark Results
Performed by Yahoo! Engineering,
Dec 16, 2015
[..]Storm 0.10.0, 0.11.0-SNAPSHOT and
Flink 0.10.1 show sub- second latencies
at relatively high throughputs[..]. Spark
streaming 1.5.1 supports high
throughputs, but at a relatively higher
latency.
Flink achieves highest throughput
with competitive low latency!
Source: http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
![Page 29: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/29.jpg)
29 © Volker Markl29
29 © Volker Markl
Timeline
![Page 30: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/30.jpg)
30 © Volker Markl30
30 © Volker Markl
(Strictly) Flink European Meetups with
Member Totals (as of 30.5.16)
Country Total Members
Berlin 758
Paris 500
Madrid 384
Stockholm 313
Brussels 279
London 190
Munich 98
Istanbul 56
![Page 31: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/31.jpg)
31 © Volker Markl31
31 © Volker Markl
Meetups By Country Concerning Flink
Apache Flink Meetups Worldwide (Data accurate as of 30.5.16) 6326 members strictly focused on Apache Flink (comprising 57%)4771 members broader in scope, including Flink (comprising 43%)
![Page 32: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/32.jpg)
32 © Volker Markl32
32 © Volker Markl
Distribution of (Strictly) Flink Meetup Group
Members by Country (as of 30.5.16)
Country Total Members
USA 3184
Germany 856
France 500
Spain 384
Sweden 313
Belgium 279
Brazil 233
UK 190
Taiwan 142
India 139
Turkey 56
Mexico 50
![Page 33: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/33.jpg)
33 © Volker Markl33
33 © Volker Markl
> 13 Companies Using Flink
![Page 34: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/34.jpg)
34 © Volker Markl34
34 © Volker Markl
> 6 Software Projects Using Flink
![Page 35: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/35.jpg)
35 © Volker Markl35
35 © Volker Markl
> 10 Research Institutions
Using Flink
![Page 36: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/36.jpg)
36 © Volker Markl36
36 © Volker Markl
Fault tolerance
Pessimistic Recovery:
• Write intermediate state to stable storage
• Restart from such a checkpoint in case of a failure
Problematic:
• High overhead, checkpoint must
be replicated to other machines
• Overhead always incurred, even if no
failures happen!
How can we avoid this overhead in failure-free casesru
nti
me
per
iter
atio
n (
sec)
} actual work
} checkpointingoverhead
20
120
![Page 37: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/37.jpg)
37 © Volker Markl37
37 © Volker Markl
Optimistic recovery
• Many data mining algorithms are fixpoint algorithms
• Optimistic Recovery: jump to a different state in case of a failure,
still converge to solution
• No checkpoints No overhead in absense of failures!
• algorithm-specific compensation function must restore state
pessimistic recovery optimistic recoveryfailure-free
![Page 38: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/38.jpg)
38 © Volker Markl38
38 © Volker Markl
All Roads lead to Rome
If you are interested more, read our CIKM 2013 paper:
Sebastian Schelter, Stephan Ewen, Kostas Tzoumas, Volker
Markl: "All roads lead to Rome": optimistic recovery for
distributed iterative data processing. CIKM 2013: 1919-1928
Sergey Dudoladov, Chen Xu, Sebastian Schelter, Asterios
Katsifodimos, Stephan Ewen, Kostas Tzoumas, Volker Markl:
Optimistic Recovery for Iterative Dataflows in Action.
To appear in SIGMOD 2015
![Page 39: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/39.jpg)
39 © Volker Markl39 © Volker Markl
Declarative Data
Processing
and Big Data
![Page 40: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/40.jpg)
40 © Volker Markl40
40 © Volker Markl
A Billion $$$ Mantra...
Declarative Data Processing
SQL Relations RDBMS
A simple, high-level language for querying data (Chamberlin ’74).
An effective, formal foundation based on relational algebra and calculus (Codd ’71).
An efficient, low-level execution environment tailored towards the data (Selinger ’79).
![Page 41: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/41.jpg)
41 © Volker Markl41
41 © Volker Markl
With 40+ years of success...
Declarative Data Processing
SQL Relations RDBMS
![Page 42: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/42.jpg)
42 © Volker Markl42
42 © Volker Markl
Is Being Revised
Declarative Data Processing
SQL Relations RDBMS
DistributedCollections
Parallel DataflowEngines
Second-OrderFunctions
![Page 43: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/43.jpg)
43 © Volker Markl43
43 © Volker Markl
Overall Vision & Next Steps
• First results
– Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl: Emma in Action: Declarative Dataflows for Scalable Data Analysis. SIGMOD 2016
– Alexander Alexandrov, Asterios Katsifodimos, Georgi Krastev, Volker Markl: Implicit Parallelism through Deep Language Embedding. SIGMOD Record 45(1): 51-58 (2016)
• Next Steps (Fall 2016)– Open-Source Release
• Vision (Frontend): Multi-model DSL based on type contracts– Collection Processing DataBag[A]
– Linear Algebra Matrix[A], Vector[A]
– Stream Processing Stream[A]
• Vision (Backend): Target more execution engines– Column Stores
– GPUs
![Page 44: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/44.jpg)
44 © Volker Markl44
44 © Volker Markl
Thanks to my team members and students
• Dr. Stephan Ewen
• Sebastian Schelter
• Dr. Kostas Tzoumas
• Dr. Asterios Katsifodimos
• Fabian Hüske
• Alexander Alexandrov
• Max Heimel
and many more members of the Stratosphere Project, the
Berlin Big Data Center, and the Apache Flink community
![Page 45: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec5c5d76d942b5f2d16a707/html5/thumbnails/45.jpg)
45 © Volker Markl
Evolution of Big Data Platforms
4G
3G
2G
1G
Relational Databases
Hadoop
Flink
Scale-out, Map/Reduce, UDFs
Spark
In-memory Performance and Improved Programming Model
In-memory + Out of Core Performance, Declarativity, Optimisation of iterative Algorithms, True Streaming/Lambda