big data - challenges and opportunities · big data analytics requires systems programming...
TRANSCRIPT
1 © Volker Markl
© 2013 Berlin Big Data Center • All Rights Reserved 1 © Volker Markl
Big Data - Challenges and Opportunities
Volker Markl
http://www.user.tu-berlin.de/marklv/
http://www.dima.tu-berlin.de
http://www.dfki.de/web/forschung/iam
http://bbdc.berlin
Talk based on the Vision Paper: Markl, V.: On “Declarative Data Analysis and Data Independence in the Big Data Era“ PVLDB 7(13): 1730-1733 (2014)
2 © Volker Markl
• About Data Management, Big Data, and Data Scientists
• Apache Flink – the Next Generation of Big Data Analytics Technologies
• About the Berlin Big Data Center
a technology-driven perspective
Agenda
3 © Volker Markl
3 © 2013 Berlin Big Data Center • All Rights Reserved
3 © Volker Markl
ML
ML
alg
orith
ms
alg
orith
ms
data too uncertain Veracity Data Mining MATLAB, R, Python
Predictive/Prescriptive MATLAB, R, Python
Data & Analysis: Increasingly Complex!
data volume too large Volume
data rate too fast Velocity
data too heterogeneous Variability
Data
Reporting aggregation, selection
Ad-Hoc Queries SQL, XQuery
ETL/ELT MapReduce
Analysis
DM
DM
sca
lab
ility
sca
lab
ility
4 © Volker Markl
4 © 2013 Berlin Big Data Center • All Rights Reserved
4 © Volker Markl
Deep Analysis of Big Data
Small Data Big Data (3V)
De
ep
An
aly
tics
Sim
ple
An
aly
sis
5 © Volker Markl
5 5 © Volker Markl
Databases ➤ “Big Data”
• Tables ➤ Tables and unstructured files
– Schema on read
• Parallel ➤ More parallel, commodity, shared clusters
– Mid-query fault tolerance, resource allocation
• SQL ➤ SQL and Java, Scala, Python, you name it
– General object manipulation
• Data Warehousing ➤ Logs, ML, Graphs, also DW
– Iterative processing, user-defined functions
5
6 © Volker Markl
6 6 © Volker Markl
Data Analysis with Databases
… | R | Python | Matlab | Fortran | …
+
Database
=
Data Analysis
6
Extract data from
database
Analyze with tool of choice
Collect data, store in
database
Store results in database
Need schema
Memory bound
processing
7 © Volker Markl
7 7 © Volker Markl
HBase
Open Source Big Data Stack
7
Apache Hadoop
Hadoop Distributed File System (HDFS)
High Level Language Compiler
SQL HiveQL PigLatin …
M/R Job (Java, Scala, …)
Put/Get ops
Files/Bytes
Cluster Management (Mesos, YARN, ZooKeeper…)
Where is the Optimizer?
Apache Flink
GraphLab
What about standards?
8 © Volker Markl
8 © 2013 Berlin Big Data Center • All Rights Reserved
8 © Volker Markl
Application
Data
Science
Control Flow
Iterative Algorithms
Error Estimation
Active Sampling
Sketches
Curse of Dimensionality
Decoupling
Convergence
Monte Carlo
Mathematical Programming
Linear Algebra
Stochastic Gradient Descent
Regression
Statistics
Hashing
Parallelization
Query Optimization
Fault Tolerance
Relational Algebra / SQL
Scalability
Data Analysis Language
Compiler
Memory Management
Memory Hierarchy
Data Flow
Hardware Adaptation
Indexing
Resource Management
NF2 /XQuery
Data Warehouse/OLAP
“Data Scientist” – “Jack of All Trades!” Domain Expertise (e.g., Industry 4.0, Medicine, Physics, Engineering, Energy, Logistics)
Real-Time
9 © Volker Markl
9 9 © Volker Markl
Big Data Analytics Requires Systems Programming
R/Matlab:
3 million users
Hadoop:
100,000
users
Data Analysis
Statistics
Algebra
Optimization
Machine Learning
NLP
Signal Processing
Image Analysis
Audio-,Video Analysis
Information Integration
Information Extraction
Data Value Chain
Data Analysis Process
Predictive Analytics
Indexing
Parallelization
Communication
Memory Management
Query Optimization
Efficient Algorithms
Resource Management
Fault Tolerance
Numerical Stability Big Data is now where database systems were in the
70s (prior to relational algebra, query optimization
and a SQL-standard)!
People with Big Data
Analytics Skills
Declarative languages to the rescue!
10 © Volker Markl
10 10 © Volker Markl
Introducing Apache Flink – A Success Story
from Berlin
• Project started under the name “Stratosphere” late 2008 as a DFG funded research unit, comprised of TU Berlin, HU Berlin, and the Hasso Plattner Institute Potsdam.
• Apache Open Source Incubation since April 2014, Apache Top Level Project since December 2014
• Fast growing community of open source users and developers in Europe and worldwide, in academia (e.g., SICS/KTH, INRIA, ELTE) and companies (e.g., Researchgate, Spotify, Amadeus)
More information: http://flink.apache.org
11 © Volker Markl
• Declarativity
• Query optimization
• Robust out-of-core
• Scalability
• User-defined
functions
• Complex data types
• Schema on read
• Iterations
• Advanced
Dataflows
• General APIs
11
Draws on
Database Technology
Draws on
MapReduce Technology
Add
Apache Flink: General Purpose
Programming + Database Execution
Alexandrov et al.: “The Stratosphere Platform for Big Data Analytics,” VLDB Journal 5/2014 http://flink.apache.org
12 © Volker Markl
12 © 2013 Berlin Big Data Center • All Rights Reserved
12 © Volker Markl
Apache Flink Stack
http://flink.apache.org
13 © Volker Markl 13 © Volker Markl
13
Data Flow Flink Program
Program Compiler
Runtime Hash- and sort-based out-of-core operator implementations, memory management
Flink Optimizer Picks data shipping and local strategies, operator order
Execution Plan
Job Graph Execution Graph
Parallel Runtime Task scheduling, network data transfers, resource allocation
14 © Volker Markl
14 14 © Volker Markl
Built-in vs. driver-based looping
Step Step Step Step Step
Client
Step Step Step Step Step
Client
map
join
red.
join
Loop outside the system, in driver program Iterative program looks like many independent jobs
Dataflows with feedback edges System is iteration-aware, can optimize the job
14
Flink
15 © Volker Markl
15 © 2013 Berlin Big Data Center • All Rights Reserved
15 © Volker Markl
Effect of optimization
15
Run on a sample on the laptop
Run a month later after the data evolved
Hash vs. Sort Partition vs. Broadcast Caching Reusing partition/sort Execution
Plan A
Execution Plan B
Run on large files on the cluster
Execution Plan C
16 © Volker Markl
16 16 © Volker Markl
Flink Optimizer Transitive Closure
16
HDFS new
Paths
paths
Join Union Distinct
Step function
Iterate Iterate replace
• What you write is not what is executed
• No need to hardcode execution strategies
• Flink Optimizer decides: – Pipelines and dam/barrier placement
– Sort- vs. hash- based execution
– Data exchange (partition vs. broadcast)
– Data partitioning steps
– In-memory caching
Hash Partition on [0]
Hash Partition on [1]
Hybrid Hash Join
Loop-invariant data cached in memory
Group Reduce (Sorted (on [0]))
Co-locate JOIN + UNION Hash Partition on [1]
Forward
Co-locate DISTINCT + JOIN
17 © Volker Markl
© 2013 Berlin Big Data Center • All Rights Reserved 17 © Volker Markl
Introducing the
http://bbdc.berlin
18 © Volker Markl
18 © 2013 Berlin Big Data Center • All Rights Reserved
18 © Volker Markl
Fritz-Haber-Institut,
Max-Planck-Gesellschaft
ZIB Berlin
DFKI
Beuth Hochschule
Technische Universität Berlin, Data Analytics Lab
Anja Feldmann
Klaus R. Müller
Volker Markl
Odej Kao
Thomas Wiegand
Hans Uszkoreit
Alexander
Reinefeld
Christof
Schütte
Matthias Scheffler
Stefan Edlich
Data Management Machine
Learning
Computer
Networks
Distributed
Systems
Video Mining Language Technology
Material Science
Software Engineering
File Systems,
Supercomputing
Information-
based Medicine
19 © Volker Markl
19 © 2013 Berlin Big Data Center • All Rights Reserved
19 © Volker Markl
ML DM Feature Engineering
Representation
Algorithms (SVM, GPs, etc.)
Think ML-algorithms
in a scalable way
Process iterative
algorithms
in a scalable way
Technology X
Machine Learning + Data Management = X
Declarative Languages
Automatic Adaption
Scalable processing
Control flow
Iterative Algorithms
Error Estimation
Active Sampling
Sketches
Curse of Dimensionality
Isolation Convergence
Monte Carlo
Mathematical Programming
Linear Algebra
Regression
Statistic
Hashing
Parallelization
Query Optimization
Fault Tolerance
Relational Algebra/SQL
Scalability
Data Analysis Language
Compiler
Memory Management
Memory Hierarchy
Dataflow
Hardware adaption
Indexing
Resource Management
NF2 /XQuery
Data Warehouse/OLAP
declarative
Goal: Data Analysis without
System Programming!
20 © Volker Markl
20 20 © Volker Markl
X = Big Data Analytics – System Programming!
(„What“, not „How“)
Technology X Think ML-algorithms
in a scalable way
Analysis of
“data in motion”
Multimodal
analysis
Numerical
stability
Declarative specification
Automatic optimization,
parallelization and hardware adaption
of dataflow and control flow
with user-defined functions,
iterations and distributed state
Scalable algorithms and debugging
process
Iterative algorithms
in a scalable way
Algorithmic
fault tolerance
Consistent
intermediate results
Software-defined
networking
Description of „How“?
(State of the art in scalable data analysis)
Hadoop, MPI
Larger human base of „data scientists“
Reduction of „human“ latencies
Cost reduction
Description of „What“?
(declarative specification)
Technology X
Data Analyst Machine
21 © Volker Markl
21 © 2013 Berlin Big Data Center • All Rights Reserved
21 © Volker Markl
Application example: Marketplace
for
information
Technology X
science-based society-based economics-based
hierarchical
numerical
simulation data
text data flows multimodal data
windowing
numerical
stability
integration:
video, images, text
Think ML-algorithms
in a scalable way
Declarative
Process iterative algorithms
in a scalable way
Application Examples:
Technology Drivers and Validation
Application Example:
Material
science
Application Example:
Health
22 © Volker Markl
22 22 © Volker Markl
Legal Dimension
Social Dimension
Economic Dimension
Technology Dimension
Application Dimension
Business Models Benchmarking Open Source Deployment Models Information Pricing
Scalable Data Processing Signal Processing Statistics/ML Linguistics HCI/Visualization
The 5 Dimensions of Big Data Ownership Copyright/IPR Liability Insolvancy Privacy
User Behaviour Societal Impact Collaboration
Data-driven Decision Making Risk Management Competitive Intelligence Digital Humanities Verticals Industry 4.0 Systems
Frameworks Skills
Best-Practices Tools
http://www.bbdc.berlin/media-press/blog-articles/#c1896
23 © Volker Markl
23 23 © Volker Markl
Souveräner Umgang mit Daten –
„Enabling Technologies“ • Verbreiterung der Basis von „Data Scientists“ durch
Technologieentwicklung → Erhöhung der Benutzbarkeit
• Deklarative Spezifikation von Datenanalysen
• Debugging von komplexen Datenflüssen
• Reduktion der menschlichen Latenz und Systemlatenz → Erhöhung der Effizienz
• Benutzbarkeit durch automatische Optimierung von komplexen Datenanalysen
• Effiziente Ausführung von tiefen Analysen auf großen Datenströmen
• Informationsvisualisierung
• Kostenreduktion
24 © Volker Markl
© 2013 Berlin Big Data Center • All Rights Reserved 24 © Volker Markl
Current open PhD/Postdoc positions at http://www.dima.tu-berlin.de/menue/jobs/ http://www.dfki.de/web/forschung/iam Direct your application to: [email protected]