big data challenges and opportunities - ffg€¦ · big data value chain however - both in...
TRANSCRIPT
1 © Volker Markl © 2013 Berlin Big Data Center • All Rights Reserved 1 © Volker Markl
Big Data
Challenges and Opportunities
Volker Markl
http://www.user.tu-berlin.de/marklv/
Talk based on the Vision Paper: Markl, V.: On “Declarative Data Analysis and Data Independence in the Big Data Era“ PVLDB 7(13): 1730-1733 (2014)
2 © Volker Markl 2
2 © Volker Markl
More and more data is available to
science and business!
Drivers: Cloud Computing Internet of Services Internet of Things Cyberphysical Systems
Underlying Trends: Connectivity Collaboration Computer generated data
video streams
web archives
sensor data
audio streams
RFID data
simulation data
3 © Volker Markl 3 © 2013 Berlin Big Data Center • All Rights Reserved
3 © Volker Markl
Data & Analysis: More and More Complex!
data volume too large Volume
data rate too fast Velocity
data too heterogeneous Variability
data too uncertain Veracity
Data
Reporting aggregation, selection
Ad-Hoc Queries SQL, XQuery
ETL/Integration map/reduce
Data Mining Matlab, R, Python
Predictive/Prescriptive Matlab, R, Python
Analysis
ML
DM
M
L
DM
sca
lab
ility
alg
orith
ms
sca
lab
ility
alg
orith
ms
4 © Volker Markl 4
4 © Volker Markl
Data-driven applications …
lifecycle management
home automation health
water management
market research traffic management
energy management
information marketplaces
… will revolutionize decision making in business and the sciences!
… have great economic potential!
e-sciences
5 © Volker Markl 5 © 2013 Berlin Big Data Center • All Rights Reserved
5 © Volker Markl
Deep Analysis of Big Data
Small Data Big Data (3V)
De
ep
An
aly
tics
Sim
ple
An
aly
sis
6 © Volker Markl 6
6 © Volker Markl
Databases ➤ “Big Data”
• Tables ➤ Tables and unstructured files
– Schema on read
• Parallel ➤ More parallel, commodity, shared clusters
– Mid-query fault tolerance, resource allocation
• SQL ➤ SQL and Java, Scala, Python, you name it
– General object manipulation
• Data Warehousing ➤ Logs, ML, Graphs, also DW
– Iterative processing, user-defined functions
6
7 © Volker Markl 7 © 2013 Berlin Big Data Center • All Rights Reserved
7 © Volker Markl
Application
Data
Science
Control Flow
Iterative Algorithms
Error Estimation
Active Sampling
Sketches
Curse of Dimensionality
Decoupling
Convergence
Monte Carlo
Mathematical Programming
Linear Algebra
Stochastic Gradient Descent
Regression
Statistics
Hashing
Parallelization
Query Optimization
Fault Tolerance
Relational Algebra / SQL
Scalability
Data Analysis Language
Compiler
Memory Management
Memory Hierarchy
Data Flow
Hardware Adaptation
Indexing
Resource Management
NF2 /XQuery
Data Warehouse/OLAP
“Data Scientist” – “Jack of All Trades!” Domain Expertise (e.g., Industry 4.0, Medicine, Physics, Engineering, Energy, Logistics)
Real-Time
8 © Volker Markl 8
8 © Volker Markl
Big Data Analytics Requires Systems Programming
R/Matlab:
3 million users
Hadoop:
100,000
users
Data Analysis
Statistics
Algebra
Optimization
Machine Learning
NLP
Signal Processing
Image Analysis
Audio-,Video Analysis
Information Integration
Information Extraction
Data Value Chain
Data Analysis Process
Predictive Analytics
Indexing
Parallelization
Communication
Memory Management
Query Optimization
Efficient Algorithms
Resource Management
Fault Tolerance
Numerical Stability Big Data is now where database systems were in the
70s (prior to relational algebra, query optimization
and a SQL-standard)!
People with Big Data
Analytics Skills
Declarative languages to the rescue!
9 © Volker Markl
Further Challenges for Deep Analysis on Big Data
Low Latency: Trading-off virtualization,
heterogeneous CPUs, new hardware
Evolving Datasets: First results fast, stream mining
Advanced Data Analysis Programs: Declarative specification and optimization of programs with
iteration and state
Engines: one size does not fit all - pluggable engines and libraries
Multi-tenancy: Continuous, workload-aware optimizations
Adaptive Seamless Deployment: Scale from laptop to cluster
Optimizing Access on Raw Data: in-situ data analysis
Markl, V.: On “Declarative Data Analysis and Data Independence in the Big Data Era“ PVLDB 7(13): 1730-1733 (2014)
10 © Volker Markl 10
10 © Volker Markl
Introducing Apache Flink
11 © Volker Markl
• Declarativity
• Query optimization
• Robust out-of-core
• Scalability
• User-defined
functions
• Complex data types
• Schema on read
• Iterations
• Advanced
Dataflows
• General APIs
11
Draws on
Database Technology
Draws on
MapReduce Technology
Add
Apache Flink: General Purpose
Programming + Database Execution
Alexandrov et al.: “The Stratosphere Platform for Big Data Analytics,” VLDB Journal 5/2014
12 © Volker Markl 12 © 2013 Berlin Big Data Center • All Rights Reserved
12 © Volker Markl
Apache Flink Stack
http://flink.incubator.apache.org
13 © Volker Markl 13
13 © Volker Markl
Apache Flink Project History
• Project started under the name “Stratosphere” late 2008 as a DFG funded research unit, comprised of TU Berlin, HU Berlin, and the Hasso Plattner Institute Potsdam
• Latest release adds support for YARN, offers Java and Scala APIs
• Fast growing community of open source users and developers in Europe and worldwide
14 © Volker Markl 14 © 2013 Berlin Big Data Center • All Rights Reserved
14 © Volker Markl
Data Sets and Operators
Data Set
A
Data Set
B
Data Set
C
A (1)
A (2)
B (1)
B (2)
C (1)
C (2)
X
X
Y
Y
Program
Parallel Execution
X Y
Operator X Operator Y
Alexandrov et al.: “The Stratosphere Platform for Big Data Analytics,” VLDB Journal 5/2014
15 © Volker Markl 15 © 2013 Berlin Big Data Center • All Rights Reserved
15 © Volker Markl
Rich Set of Operators
Reduce
Join
Map
Reduce
Map
Iterate
Source
Sink
Source
Map Iterate Project
Reduce Delta Iterate Aggregate
Join Filter Distinct
CoGroup FlatMap Vertex Update
Union GroupReduce Accumulators
16 © Volker Markl 16 © Volker Markl
16
Data Flow Flink Program
Program Compiler
Runtime Hash- and sort-based out-of-core operator implementations, memory management
Flink Optimizer Picks data shipping and local strategies, operator order
Execution Plan
Job Graph Execution Graph
Parallel Runtime Task scheduling, network data transfers, resource allocation
17 © Volker Markl 17
17 © Volker Markl
European companies do already to some extent provide services and solutions along the
Big Data Value chain
However - both in businesses and in science, data use is handled in a fragmented way
Actors along the Big data Value chain should cooperate and form the basis of a strong
and vibrant data-driven ecosystem to maximise value creation of Big Data
The EU has announced a PPP to bring relevant actors from the Big Data Value Chain
together – bigdatavalue.eu
EU-wide Big Data Value Private Public Partnership
Social & Economic
Benefits
18 © Volker Markl 18
18 © Volker Markl
Legal Dimension
Social Dimension
Economic Dimension
Technology Dimension
Application Dimension
Business Models Benchmarking Open Source Deployment Models Information Pricing
Scalable Data Processing Signal Processing Statistics/ML Linguistics HCI/Visualization
The 5 Dimensions of Big Data Ownership Copyright/IPR Liability Insolvancy Privacy
User Behaviour Societal Impact Collaboration
Data-driven Decision Making Risk Management Competitive Intelligence Digital Humanities Verticals Industry 4.0 Systems
Frameworks Skills
Best-Practices Tools
19 © Volker Markl © 2013 Berlin Big Data Center • All Rights Reserved 19 © Volker Markl
Join Us! We are currently recruiting: Postdoctoral Research Fellows and PhD students Soon open call for a: W1 Junior Professorship on Big Data Management Current open PhD/Postodc positions at http://www.dima.tu-berlin.de/menue/jobs/ Direct your application to: [email protected]