big data - challenges and opportunities · big data analytics requires systems programming...

1 © Volker Markl

© 2013 Berlin Big Data Center • All Rights Reserved 1 © Volker Markl

Big Data - Challenges and Opportunities

Volker Markl

http://www.user.tu-berlin.de/marklv/

http://www.dima.tu-berlin.de

http://www.dfki.de/web/forschung/iam

http://bbdc.berlin

Talk based on the Vision Paper: Markl, V.: On “Declarative Data Analysis and Data Independence in the Big Data Era“ PVLDB 7(13): 1730-1733 (2014)




http://www.dima.tu-berlin.de/





http://bbdc.berlin/

2 © Volker Markl

• About Data Management, Big Data, and Data Scientists

• Apache Flink – the Next Generation of Big Data Analytics Technologies

• About the Berlin Big Data Center

a technology-driven perspective

Agenda

3 © Volker Markl

3 © 2013 Berlin Big Data Center • All Rights Reserved

3 © Volker Markl

ML

ML

alg

orith

ms

alg

orith

ms

data too uncertain Veracity Data Mining MATLAB, R, Python

Predictive/Prescriptive MATLAB, R, Python

Data & Analysis: Increasingly Complex!

data volume too large Volume

data rate too fast Velocity

data too heterogeneous Variability

Data

Reporting aggregation, selection

Ad-Hoc Queries SQL, XQuery

ETL/ELT MapReduce

Analysis

DM

DM

sca

lab

ility

sca

lab

ility

4 © Volker Markl


4 © Volker Markl

Deep Analysis of Big Data

Small Data Big Data (3V)

De

ep

An

aly

tics

Sim

ple

An

aly

sis

5 © Volker Markl

5 5 © Volker Markl

Databases ➤ “Big Data”

• Tables ➤ Tables and unstructured files

– Schema on read

• Parallel ➤ More parallel, commodity, shared clusters

– Mid-query fault tolerance, resource allocation

• SQL ➤ SQL and Java, Scala, Python, you name it

– General object manipulation

• Data Warehousing ➤ Logs, ML, Graphs, also DW

– Iterative processing, user-defined functions

5

6 © Volker Markl

6 6 © Volker Markl

Data Analysis with Databases

… | R | Python | Matlab | Fortran | …

+

Database

=

Data Analysis

6

Extract data from

database

Analyze with tool of choice

Collect data, store in

database

Store results in database

Need schema

Memory bound

processing

7 © Volker Markl

7 7 © Volker Markl

HBase

Open Source Big Data Stack

7

Apache Hadoop

Hadoop Distributed File System (HDFS)

High Level Language Compiler

SQL HiveQL PigLatin …

M/R Job (Java, Scala, …)

Put/Get ops

Files/Bytes

Cluster Management (Mesos, YARN, ZooKeeper…)

Where is the Optimizer?

Apache Flink

GraphLab

What about standards?

8 © Volker Markl


8 © Volker Markl

Application

Data

Science

Control Flow

Iterative Algorithms

Error Estimation

Active Sampling

Sketches

Curse of Dimensionality

Decoupling

Convergence

Monte Carlo

Mathematical Programming

Linear Algebra

Stochastic Gradient Descent

Regression

Statistics

Hashing

Parallelization

Query Optimization

Fault Tolerance

Relational Algebra / SQL

Scalability

Data Analysis Language

Compiler

Memory Management

Memory Hierarchy

Data Flow

Hardware Adaptation

Indexing

Resource Management

NF2 /XQuery

Data Warehouse/OLAP

“Data Scientist” – “Jack of All Trades!” Domain Expertise (e.g., Industry 4.0, Medicine, Physics, Engineering, Energy, Logistics)

Real-Time

9 © Volker Markl

9 9 © Volker Markl

Big Data Analytics Requires Systems Programming

R/Matlab:

3 million users

Hadoop:

100,000

users

Data Analysis

Statistics

Algebra

Optimization

Machine Learning

NLP

Signal Processing

Image Analysis

Audio-,Video Analysis

Information Integration

Information Extraction

Data Value Chain

Data Analysis Process

Predictive Analytics

Indexing

Parallelization

Communication

Memory Management

Query Optimization

Efficient Algorithms

Resource Management

Fault Tolerance

Numerical Stability Big Data is now where database systems were in the

70s (prior to relational algebra, query optimization

and a SQL-standard)!

People with Big Data

Analytics Skills

Declarative languages to the rescue!

10 © Volker Markl

10 10 © Volker Markl

Introducing Apache Flink – A Success Story

from Berlin

• Project started under the name “Stratosphere” late 2008 as a DFG funded research unit, comprised of TU Berlin, HU Berlin, and the Hasso Plattner Institute Potsdam.

• Apache Open Source Incubation since April 2014, Apache Top Level Project since December 2014

• Fast growing community of open source users and developers in Europe and worldwide, in academia (e.g., SICS/KTH, INRIA, ELTE) and companies (e.g., Researchgate, Spotify, Amadeus)

More information: http://flink.apache.org

11 © Volker Markl

• Declarativity

• Query optimization

• Robust out-of-core

• Scalability

• User-defined

functions

• Complex data types

• Schema on read

• Iterations

• Advanced

Dataflows

• General APIs

11

Draws on

Database Technology

Draws on

MapReduce Technology

Add

Apache Flink: General Purpose

Programming + Database Execution

Alexandrov et al.: “The Stratosphere Platform for Big Data Analytics,” VLDB Journal 5/2014 http://flink.apache.org

12 © Volker Markl


12 © Volker Markl

Apache Flink Stack

http://flink.apache.org

http://flink.incubator.apache.org

http://flink.incubator.apache.org

13 © Volker Markl 13 © Volker Markl

13

Data Flow Flink Program

Program Compiler

Runtime Hash- and sort-based out-of-core operator implementations, memory management

Flink Optimizer Picks data shipping and local strategies, operator order

Execution Plan

Job Graph Execution Graph

Parallel Runtime Task scheduling, network data transfers, resource allocation

14 © Volker Markl


Built-in vs. driver-based looping

Step Step Step Step Step

Client

Step Step Step Step Step

Client

map

join

red.

join

Loop outside the system, in driver program Iterative program looks like many independent jobs

Dataflows with feedback edges System is iteration-aware, can optimize the job

14

Flink

15 © Volker Markl


15 © Volker Markl

Effect of optimization

15

Run on a sample on the laptop

Run a month later after the data evolved

Hash vs. Sort Partition vs. Broadcast Caching Reusing partition/sort Execution

Plan A

Execution Plan B

Run on large files on the cluster

Execution Plan C

16 © Volker Markl


Flink Optimizer Transitive Closure

16

HDFS new

Paths

paths

Join Union Distinct

Step function

Iterate Iterate replace

• What you write is not what is executed

• No need to hardcode execution strategies

• Flink Optimizer decides: – Pipelines and dam/barrier placement

– Sort- vs. hash- based execution

– Data exchange (partition vs. broadcast)

– Data partitioning steps

– In-memory caching

Hash Partition on [0]

Hash Partition on [1]

Hybrid Hash Join

Loop-invariant data cached in memory

Group Reduce (Sorted (on [0]))

Co-locate JOIN + UNION Hash Partition on [1]

Forward

Co-locate DISTINCT + JOIN

18 © Volker Markl


18 © Volker Markl

Fritz-Haber-Institut,

Max-Planck-Gesellschaft

ZIB Berlin

DFKI

Beuth Hochschule

Technische Universität Berlin, Data Analytics Lab

Anja Feldmann

Klaus R. Müller

Volker Markl

Odej Kao

Thomas Wiegand

Hans Uszkoreit

Alexander

Reinefeld

Christof

Schütte

Matthias Scheffler

Stefan Edlich

Data Management Machine

Learning

Computer

Networks

Distributed

Systems

Video Mining Language Technology

Material Science

Software Engineering

File Systems,

Supercomputing

Information-

based Medicine

19 © Volker Markl


19 © Volker Markl

ML DM Feature Engineering

Representation

Algorithms (SVM, GPs, etc.)

Think ML-algorithms

in a scalable way

Process iterative

algorithms

in a scalable way

Technology X

Machine Learning + Data Management = X

Declarative Languages

Automatic Adaption

Scalable processing

Control flow

Iterative Algorithms

Error Estimation

Active Sampling

Sketches

Curse of Dimensionality

Isolation Convergence

Monte Carlo

Mathematical Programming

Linear Algebra

Regression

Statistic

Hashing

Parallelization

Query Optimization

Fault Tolerance

Relational Algebra/SQL

Scalability

Data Analysis Language

Compiler

Memory Management

Memory Hierarchy

Dataflow

Hardware adaption

Indexing

Resource Management

NF2 /XQuery

Data Warehouse/OLAP

declarative

Goal: Data Analysis without

System Programming!

20 © Volker Markl


X = Big Data Analytics – System Programming!

(„What“, not „How“)

Technology X Think ML-algorithms

in a scalable way

Analysis of

“data in motion”

Multimodal

analysis

Numerical

stability

Declarative specification

Automatic optimization,

parallelization and hardware adaption

of dataflow and control flow

with user-defined functions,

iterations and distributed state

Scalable algorithms and debugging

process

Iterative algorithms

in a scalable way

Algorithmic

fault tolerance

Consistent

intermediate results

Software-defined

networking

Description of „How“?

(State of the art in scalable data analysis)

Hadoop, MPI

Larger human base of „data scientists“

Reduction of „human“ latencies

Cost reduction

Description of „What“?

(declarative specification)

Technology X

Data Analyst Machine

21 © Volker Markl


21 © Volker Markl

Application example: Marketplace

for

information

Technology X

science-based society-based economics-based

hierarchical

numerical

simulation data

text data flows multimodal data

windowing

numerical

stability

integration:

video, images, text

Think ML-algorithms

in a scalable way

Declarative

Process iterative algorithms

in a scalable way

Application Examples:

Technology Drivers and Validation

Application Example:

Material

science

Application Example:

Health

22 © Volker Markl


Legal Dimension

Social Dimension

Economic Dimension

Technology Dimension

Application Dimension

Business Models Benchmarking Open Source Deployment Models Information Pricing

Scalable Data Processing Signal Processing Statistics/ML Linguistics HCI/Visualization

The 5 Dimensions of Big Data Ownership Copyright/IPR Liability Insolvancy Privacy

User Behaviour Societal Impact Collaboration

Data-driven Decision Making Risk Management Competitive Intelligence Digital Humanities Verticals Industry 4.0 Systems

Frameworks Skills

Best-Practices Tools

http://www.bbdc.berlin/media-press/blog-articles/#c1896

23 © Volker Markl


Souveräner Umgang mit Daten –

„Enabling Technologies“ • Verbreiterung der Basis von „Data Scientists“ durch

Technologieentwicklung → Erhöhung der Benutzbarkeit

• Deklarative Spezifikation von Datenanalysen

• Debugging von komplexen Datenflüssen

• Reduktion der menschlichen Latenz und Systemlatenz → Erhöhung der Effizienz

• Benutzbarkeit durch automatische Optimierung von komplexen Datenanalysen

• Effiziente Ausführung von tiefen Analysen auf großen Datenströmen

• Informationsvisualisierung

• Kostenreduktion

24 © Volker Markl


Current open PhD/Postdoc positions at http://www.dima.tu-berlin.de/menue/jobs/ http://www.dfki.de/web/forschung/iam Direct your application to: [email protected]

http://www.dima.tu-berlin.de/menue/jobs/








big data - challenges and opportunities · big data analytics requires systems programming...

Documents