big data challenges and opportunities - ffg€¦ · big data value chain however - both in...

19
1 © Volker Markl © 2013 Berlin Big Data Center • All Rights Reserved 1 © Volker Markl Big Data Challenges and Opportunities Volker Markl http://www.user.tu-berlin.de/marklv/ Talk based on the Vision Paper: Markl, V.: On “Declarative Data Analysis and Data Independence in the Big Data Era“ PVLDB 7(13): 1730-1733 (2014)

Upload: others

Post on 07-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data Challenges and Opportunities - FFG€¦ · Big Data Value chain However - both in businesses and in science, data use is handled in a fragmented way Actors along the Big

1 © Volker Markl © 2013 Berlin Big Data Center • All Rights Reserved 1 © Volker Markl

Big Data

Challenges and Opportunities

Volker Markl

http://www.user.tu-berlin.de/marklv/

Talk based on the Vision Paper: Markl, V.: On “Declarative Data Analysis and Data Independence in the Big Data Era“ PVLDB 7(13): 1730-1733 (2014)

Page 2: Big Data Challenges and Opportunities - FFG€¦ · Big Data Value chain However - both in businesses and in science, data use is handled in a fragmented way Actors along the Big

2 © Volker Markl 2

2 © Volker Markl

More and more data is available to

science and business!

Drivers: Cloud Computing Internet of Services Internet of Things Cyberphysical Systems

Underlying Trends: Connectivity Collaboration Computer generated data

video streams

web archives

sensor data

audio streams

RFID data

simulation data

Page 3: Big Data Challenges and Opportunities - FFG€¦ · Big Data Value chain However - both in businesses and in science, data use is handled in a fragmented way Actors along the Big

3 © Volker Markl 3 © 2013 Berlin Big Data Center • All Rights Reserved

3 © Volker Markl

Data & Analysis: More and More Complex!

data volume too large Volume

data rate too fast Velocity

data too heterogeneous Variability

data too uncertain Veracity

Data

Reporting aggregation, selection

Ad-Hoc Queries SQL, XQuery

ETL/Integration map/reduce

Data Mining Matlab, R, Python

Predictive/Prescriptive Matlab, R, Python

Analysis

ML

DM

M

L

DM

sca

lab

ility

alg

orith

ms

sca

lab

ility

alg

orith

ms

Page 4: Big Data Challenges and Opportunities - FFG€¦ · Big Data Value chain However - both in businesses and in science, data use is handled in a fragmented way Actors along the Big

4 © Volker Markl 4

4 © Volker Markl

Data-driven applications …

lifecycle management

home automation health

water management

market research traffic management

energy management

information marketplaces

… will revolutionize decision making in business and the sciences!

… have great economic potential!

e-sciences

Page 5: Big Data Challenges and Opportunities - FFG€¦ · Big Data Value chain However - both in businesses and in science, data use is handled in a fragmented way Actors along the Big

5 © Volker Markl 5 © 2013 Berlin Big Data Center • All Rights Reserved

5 © Volker Markl

Deep Analysis of Big Data

Small Data Big Data (3V)

De

ep

An

aly

tics

Sim

ple

An

aly

sis

Page 6: Big Data Challenges and Opportunities - FFG€¦ · Big Data Value chain However - both in businesses and in science, data use is handled in a fragmented way Actors along the Big

6 © Volker Markl 6

6 © Volker Markl

Databases ➤ “Big Data”

• Tables ➤ Tables and unstructured files

– Schema on read

• Parallel ➤ More parallel, commodity, shared clusters

– Mid-query fault tolerance, resource allocation

• SQL ➤ SQL and Java, Scala, Python, you name it

– General object manipulation

• Data Warehousing ➤ Logs, ML, Graphs, also DW

– Iterative processing, user-defined functions

6

Page 7: Big Data Challenges and Opportunities - FFG€¦ · Big Data Value chain However - both in businesses and in science, data use is handled in a fragmented way Actors along the Big

7 © Volker Markl 7 © 2013 Berlin Big Data Center • All Rights Reserved

7 © Volker Markl

Application

Data

Science

Control Flow

Iterative Algorithms

Error Estimation

Active Sampling

Sketches

Curse of Dimensionality

Decoupling

Convergence

Monte Carlo

Mathematical Programming

Linear Algebra

Stochastic Gradient Descent

Regression

Statistics

Hashing

Parallelization

Query Optimization

Fault Tolerance

Relational Algebra / SQL

Scalability

Data Analysis Language

Compiler

Memory Management

Memory Hierarchy

Data Flow

Hardware Adaptation

Indexing

Resource Management

NF2 /XQuery

Data Warehouse/OLAP

“Data Scientist” – “Jack of All Trades!” Domain Expertise (e.g., Industry 4.0, Medicine, Physics, Engineering, Energy, Logistics)

Real-Time

Page 8: Big Data Challenges and Opportunities - FFG€¦ · Big Data Value chain However - both in businesses and in science, data use is handled in a fragmented way Actors along the Big

8 © Volker Markl 8

8 © Volker Markl

Big Data Analytics Requires Systems Programming

R/Matlab:

3 million users

Hadoop:

100,000

users

Data Analysis

Statistics

Algebra

Optimization

Machine Learning

NLP

Signal Processing

Image Analysis

Audio-,Video Analysis

Information Integration

Information Extraction

Data Value Chain

Data Analysis Process

Predictive Analytics

Indexing

Parallelization

Communication

Memory Management

Query Optimization

Efficient Algorithms

Resource Management

Fault Tolerance

Numerical Stability Big Data is now where database systems were in the

70s (prior to relational algebra, query optimization

and a SQL-standard)!

People with Big Data

Analytics Skills

Declarative languages to the rescue!

Page 9: Big Data Challenges and Opportunities - FFG€¦ · Big Data Value chain However - both in businesses and in science, data use is handled in a fragmented way Actors along the Big

9 © Volker Markl

Further Challenges for Deep Analysis on Big Data

Low Latency: Trading-off virtualization,

heterogeneous CPUs, new hardware

Evolving Datasets: First results fast, stream mining

Advanced Data Analysis Programs: Declarative specification and optimization of programs with

iteration and state

Engines: one size does not fit all - pluggable engines and libraries

Multi-tenancy: Continuous, workload-aware optimizations

Adaptive Seamless Deployment: Scale from laptop to cluster

Optimizing Access on Raw Data: in-situ data analysis

Markl, V.: On “Declarative Data Analysis and Data Independence in the Big Data Era“ PVLDB 7(13): 1730-1733 (2014)

Page 10: Big Data Challenges and Opportunities - FFG€¦ · Big Data Value chain However - both in businesses and in science, data use is handled in a fragmented way Actors along the Big

10 © Volker Markl 10

10 © Volker Markl

Introducing Apache Flink

Page 11: Big Data Challenges and Opportunities - FFG€¦ · Big Data Value chain However - both in businesses and in science, data use is handled in a fragmented way Actors along the Big

11 © Volker Markl

• Declarativity

• Query optimization

• Robust out-of-core

• Scalability

• User-defined

functions

• Complex data types

• Schema on read

• Iterations

• Advanced

Dataflows

• General APIs

11

Draws on

Database Technology

Draws on

MapReduce Technology

Add

Apache Flink: General Purpose

Programming + Database Execution

Alexandrov et al.: “The Stratosphere Platform for Big Data Analytics,” VLDB Journal 5/2014

Page 12: Big Data Challenges and Opportunities - FFG€¦ · Big Data Value chain However - both in businesses and in science, data use is handled in a fragmented way Actors along the Big

12 © Volker Markl 12 © 2013 Berlin Big Data Center • All Rights Reserved

12 © Volker Markl

Apache Flink Stack

http://flink.incubator.apache.org

Page 13: Big Data Challenges and Opportunities - FFG€¦ · Big Data Value chain However - both in businesses and in science, data use is handled in a fragmented way Actors along the Big

13 © Volker Markl 13

13 © Volker Markl

Apache Flink Project History

• Project started under the name “Stratosphere” late 2008 as a DFG funded research unit, comprised of TU Berlin, HU Berlin, and the Hasso Plattner Institute Potsdam

• Latest release adds support for YARN, offers Java and Scala APIs

• Fast growing community of open source users and developers in Europe and worldwide

Page 14: Big Data Challenges and Opportunities - FFG€¦ · Big Data Value chain However - both in businesses and in science, data use is handled in a fragmented way Actors along the Big

14 © Volker Markl 14 © 2013 Berlin Big Data Center • All Rights Reserved

14 © Volker Markl

Data Sets and Operators

Data Set

A

Data Set

B

Data Set

C

A (1)

A (2)

B (1)

B (2)

C (1)

C (2)

X

X

Y

Y

Program

Parallel Execution

X Y

Operator X Operator Y

Alexandrov et al.: “The Stratosphere Platform for Big Data Analytics,” VLDB Journal 5/2014

Page 15: Big Data Challenges and Opportunities - FFG€¦ · Big Data Value chain However - both in businesses and in science, data use is handled in a fragmented way Actors along the Big

15 © Volker Markl 15 © 2013 Berlin Big Data Center • All Rights Reserved

15 © Volker Markl

Rich Set of Operators

Reduce

Join

Map

Reduce

Map

Iterate

Source

Sink

Source

Map Iterate Project

Reduce Delta Iterate Aggregate

Join Filter Distinct

CoGroup FlatMap Vertex Update

Union GroupReduce Accumulators

Page 16: Big Data Challenges and Opportunities - FFG€¦ · Big Data Value chain However - both in businesses and in science, data use is handled in a fragmented way Actors along the Big

16 © Volker Markl 16 © Volker Markl

16

Data Flow Flink Program

Program Compiler

Runtime Hash- and sort-based out-of-core operator implementations, memory management

Flink Optimizer Picks data shipping and local strategies, operator order

Execution Plan

Job Graph Execution Graph

Parallel Runtime Task scheduling, network data transfers, resource allocation

Page 17: Big Data Challenges and Opportunities - FFG€¦ · Big Data Value chain However - both in businesses and in science, data use is handled in a fragmented way Actors along the Big

17 © Volker Markl 17

17 © Volker Markl

European companies do already to some extent provide services and solutions along the

Big Data Value chain

However - both in businesses and in science, data use is handled in a fragmented way

Actors along the Big data Value chain should cooperate and form the basis of a strong

and vibrant data-driven ecosystem to maximise value creation of Big Data

The EU has announced a PPP to bring relevant actors from the Big Data Value Chain

together – bigdatavalue.eu

EU-wide Big Data Value Private Public Partnership

Social & Economic

Benefits

Page 18: Big Data Challenges and Opportunities - FFG€¦ · Big Data Value chain However - both in businesses and in science, data use is handled in a fragmented way Actors along the Big

18 © Volker Markl 18

18 © Volker Markl

Legal Dimension

Social Dimension

Economic Dimension

Technology Dimension

Application Dimension

Business Models Benchmarking Open Source Deployment Models Information Pricing

Scalable Data Processing Signal Processing Statistics/ML Linguistics HCI/Visualization

The 5 Dimensions of Big Data Ownership Copyright/IPR Liability Insolvancy Privacy

User Behaviour Societal Impact Collaboration

Data-driven Decision Making Risk Management Competitive Intelligence Digital Humanities Verticals Industry 4.0 Systems

Frameworks Skills

Best-Practices Tools

Page 19: Big Data Challenges and Opportunities - FFG€¦ · Big Data Value chain However - both in businesses and in science, data use is handled in a fragmented way Actors along the Big

19 © Volker Markl © 2013 Berlin Big Data Center • All Rights Reserved 19 © Volker Markl

Join Us! We are currently recruiting: Postdoctoral Research Fellows and PhD students Soon open call for a: W1 Junior Professorship on Big Data Management Current open PhD/Postodc positions at http://www.dima.tu-berlin.de/menue/jobs/ Direct your application to: [email protected]