big data and hpc - scads · 15 processing challenges not „just“ scalability t further...

62
Zellescher Weg Willers-Bau A207 +49 351 - 463 - 35450 Wolfgang E. Nagel ([email protected]) Big Data and HPC: Two Worlds Apart or Synergetic Future? Center for Information Services and High Performance Computing (ZIH)

Upload: others

Post on 22-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

Zellescher Weg

Willers-Bau A207

+49 351 - 463 - 35450

Wolfgang E. Nagel ([email protected])

Big Data and HPC: Two Worlds Apart or Synergetic Future?

Center for Information Services and High Performance Computing (ZIH)

Page 2: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

2

Structure

Introduction/Motivation

Big Data Processing

– Classical MapReduce approach

– In-Memory processing

– Stream processing

Big Data Analysis

– Machine Learning Algorithms

– Deep Learning

– Graph Analysis

– Data Analytics Frameworks

Big Data and HPC

Page 3: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

3

Meaning of Big Data

Big Data slogans and classification

“Big Data: The next frontier for innovation, competition, and productivity”

(McKinsey Global Institute)

“Data is the new gold” (Open Data Initiative, European Commission aim at

opening up Public Sector Information).

– The term ”Big Data" refers to large amounts of different types of data

produced with high velocity from a high number of various types of

sources. Handling today's highly variable and real-time datasets requires

new tools and methods, such as powerful processors, software and

algorithms.

– “The term ”Open Data" refers to a subset of data, namely to data made

freely available for re-use to everyone for both commercial and non-

commercial purposes”.

– “Linked Data” is about using the Web to connect related data that

wasn't previously linked, or using the Web to lower the barriers to linking

data currently linked using other methods.

Lecture: Big Data: A data-driven society ? Roberto V. Zicari, Goethe University Frankfurt

Page 4: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

4

Motivation: How large is Big Data? Mostly

unstructured

data!

Source: IDC’s Digital Universe study, sponsored by EMC, 2014

Big means not a fixed scale!!!!

Page 5: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

5

Sensor data – production (industry) / tracking (logistics)

Medical data (MRT, patient data, genome) –

“personalized medizin”

Financial data

Mobile Networks – sensor arrays and meta information

(tracking, customer behavior)

Autonomous systems (e.g. Floating Car Data - FCD)

Social Media / Apps / Platforms (e.g youtube)

Weather / Climate / Environmental Data

www.scads.de 5

Where is data coming from?

Page 6: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

6

more important:

extract new

content from

database

6

Characteristics of Big Data

Page 7: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

7

Structure

Introduction/Motivation

Big Data Processing

– Classical MapReduce approach

– In-Memory processing

– Stream processing

Big Data Analysis

– Machine Learning Algorithms

– Deep Learning

– Graph Analysis

– Data Analytics Frameworks

Big Data and HPC

Page 8: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

8

Big Data Processing

Data

Collection

Integration/

Aggregation

Analysis/

Modelling Interpretation

Extraction/

Cleaning/

Annotation

Volume

Veracity

Velocity

Variety

Privacy

Hu

man

In

tera

ctio

n

Value

Page 9: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

10

Scalability

Horizontal vs. vertical scalability

Horizontal scaling

– involves distributing (partitioning) the workload across many machines

(usually commodity hardware)

– Also referred to as „scale-out“

Vertical Scaling:

– Use more processors and attached memory in single application

– Also “distributed” over multiple nodes but “shared”

(shared-memory systems)

– Also referred to as „scale-up“

Page 10: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

11

Horizontal Scaling

Scale-out

Opposite to scale-up

Add further instances of the same kind

No special assumptions

Ideal case: distribute computing to data, not vice versa,

“separate” partitioned workloads

Page 11: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

12

Vertical Scaling

Scale-up

Usually within a system (e.g. scale parallel application)

Use more computing elements in single

application

Often communication intensive (iterative approaches)

Ideal case: highly parallel simulations

Page 12: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

13

Scale-out vs. Scale-up

Advantages

Horizontal scaling

– Data parallel (trivial); simply add

instances; easy handling

– Multiple copies of data;

fault-tolerant mechanisms

– Relatively cheap components;

cheap network

Vertical Scaling:

– Parallel programming techniques

(scale to 105..106 cores)

– Use hybrid systems

(cpu + accelerators)

– Fast message passing

Page 13: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

14

Scale-out vs. Scale-up

Advantages

Horizontal scaling

– Data parallel (trivial); simply add

instances; easy handling

– Multiple copies of data;

fault-tolerant mechanisms

– Relatively cheap components;

cheap network

Vertical Scaling:

– Parallel programming techniques

(scale to 105..106 cores)

– Use hybrid systems

(cpu + accelerators)

– Fast message passing

Disadvantages

– Middleware has to manage

complexities (data copies;

distributed tasks)

– Iterative techniques hard to

apply (performance loss)

– Higher investments

– Limited frameworks;

applications centric approach

– Hard to adopt specific solution

to other use cases

Page 14: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

15

Processing Challenges

Not „just“ Scalability – further challenges

Data Acquisition: “capturing” data, streaming or persistent storage of data

Aligning / Integration of data from different sources (e.g., resolving duplicates)

Transformation: bring data into a form of suitable format for further processing

or analysis

Modelling: hypothesis testing or prediction, usually mathematically or statistical

Understanding the output — visualizing and sharing results

Techniques:

– Massive Parallelism

– Huge Data Volumes Storage

– Data Distribution

– High-Speed Networks

– High-Performance Computing

– Task and Thread Management

– Data Mining and Analytics

– Data Retrieval

– Machine Learning

– Data Visualization

Page 15: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

16

Structure

Introduction/Motivation

Big Data Processing

– Classical MapReduce approach

– In-Memory processing

– Stream processing

Big Data Analysis

– Machine Learning Algorithms

– Deep Learning

– Graph Analysis

– Data Analytics Frameworks

Big Data and HPC

Page 16: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

17

Hadoop 1.0 Ecosystem

“First” software stack based on shared nothing architectures

Horizontal scaling paradigm, framework support for workload distribution and

fault-tolerance

Page 17: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

18

MapReduce Example

Page 18: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

19

MapReduce criticism (from the DBMS perspective - 2008)

https://homes.cs.washington.edu/~billhowe/mapreduce_a_major

_step_backwards.html

A giant step backward in the programming paradigm for large-scale data

intensive applications

Not novel at all -- it represents a specific implementation of well known

techniques developed nearly 25 years ago

Missing most of the features that are routinely included in current DBMS

Incompatible with all of the tools DBMS users have come to depend on

Missing features and incompatible with DBMS

– Bulk loader -- to transform input data in files into a desired format and load it into a

DBMS

– Indexing – MapReduce simple has no index, only “brute force” approach

– Updates -- to change the data in the data base

– Transactions -- to support parallel update and recovery from failures during update

– Integrity constraints and references – helps to keep garbage out of the data base

– Views -- so the schema can change without having to rewrite the application

program

Page 19: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

20

Limitations and Extensions to MapReduce

Geoffrey Fox:- Lecture: Building a Library at the Nexus of High

Performance Computing and Big Data, Indiana University

Page 20: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

21

Limitations and Extensions to MapReduce

Geoffrey Fox:- Lecture: Building a Library at the Nexus of High

Performance Computing and Big Data, Indiana University

Map-Reduce/

Classical Hadoop

Page 21: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

22

Limitations and Extensions to MapReduce

Geoffrey Fox:- Lecture: Building a Library at the Nexus of High

Performance Computing and Big Data, Indiana University

In-Memory

Processing

Page 22: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

23

Structure

Introduction/Motivation

Big Data Processing

– Classical MapReduce approach

– In-Memory processing

– Stream processing

Big Data Analysis

– Machine Learning Algorithms

– Deep Learning

– Graph Analysis

– Data Analytics Frameworks

Big Data and HPC

Page 23: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

24

In-Memory Processing

Processing of data stored in an in-memory (database)

Historically triggered by business intelligence (BI) needs

Previously: processing based on disk storage and relational databases using

SQL query language, but became to slow with increasing data

Towards data analysis in real time; lifts limitations of IO-waiting times

(in IO-intense applications usually show stopper)

RAM has come down in cost, in-memory computing becomes available in

larger scale (BI: analytic reports would be started and results would come

later (i.e. "coffee break analytics")) https://www-01.ibm.com/software/data/what-is-in-memory-computing.html

Provides basis for other algorithms pipelines, not just separating compute

tasks

– Iterations

– Joins

Page 24: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

25

In-Memory Processing

Caching

– E.g. Processor caches, file cache, disk cache,

permission cache

Replication

– E.g. RAID, Content Distribution Networks

(CDN), Web/Server Caches

Prediction – which data will be needed and pre-

fetched

– Tradeoff bandwidth

– E.g. disk caches, Google Earth

Limitations:

– Caching works only if working set is small

– Prefetching only works when access

patterns are predictable

– Replication is expensive and limited by

receiving side machines

Srinath Perera: In-Memory Computing, WSO2 Inc.

Page 25: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

26

In-Memory: Applicability for Big Data

Srinath Perera: In-Memory Computing, WSO2 Inc.

Page 26: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

27

The Framework Apache Spark

Apache Spark offers Directed Acyclic

Graph (DAG) execution engine that

supports cyclic data flow and in-memory

computing.

Offers over 80 high-level operators to build

parallel applications; use it interactively

from the Scala, Python and R shells

Offers stack of libraries for analytics

including SQL and DataFrames, MLlib for

machine learning, GraphX, and Spark

Streaming; combine libraries seamlessly

in the same application

Run Spark standalone or in cluster mode,

on EC2, on Hadoop YARN, or on Apache

Mesos; access data in HDFS, Cassandra,

HBase, Hive, Tachyon, and any Hadoop

data source

http://spark.apache.org/

http://spark-project.org

Page 27: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

28

Structure

Introduction/Motivation

Big Data Processing

– Classical MapReduce approach

– In-Memory processing

– Stream processing

Big Data Analysis

– Machine Learning Algorithms

– Deep Learning

– Graph Analysis

– Data Analytics Frameworks

Big Data and HPC

Page 28: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

29

Limitations and Extensions to MapReduce

Geoffrey Fox:- Lecture: Building a Library at the Nexus of High

Performance Computing and Big Data, Indiana University

Stream Processing:

Essential for

Autonomous driving

Sensor data analysis

Page 29: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

30

Stream Processing Alternatives

Text

Bullet 1

Bullet 1

– Bullet 2

– Bullet 2

• Bullet 3

• Bullet 3

- Bullet 4

- Bullet 4

• Bullet 3

Bullet 1

– Bullet 2

Text

Page 30: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

31

Stream Processing Alternatives

Text

Bullet 1

Bullet 1

– Bullet 2

– Bullet 2

• Bullet 3

• Bullet 3

- Bullet 4

- Bullet 4

• Bullet 3

Bullet 1

– Bullet 2

Text

Page 31: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

32

Stream Processing: Apache Flink

Data stream outputs

http://flink.apache.org/docs/latest/streaming_guide.html

Marton Balassi – data Artisans: «Real-time Stream Processing with

Apache Flink“

True streaming with adjustable latency

and throughput

Rich functional API exploiting streaming

runtime

Flexible windowing semantics

Exactly-once processing guarantees

with (small) state

Page 32: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

33

Stream Processing: Apache Flink - examples

Scala example: Reading from data sources

Marton Balassi – data Artisans: «Real-time Stream Processing with

Apache Flink“

Page 33: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

34

Apache Projects (relevant for Big Data)

Big data (36):

– Airavata, Ambari, Apex, Avro, Bigtop, BookKeeper, Calcite, CouchDB,

Crunch, DataFu (Incubating), DirectMemory (in the Attic), Drill, Falcon,

Flink, Flume, Giraph, Hama, Helix, Ignite, Kafka, Knox, MetaModel,

Oozie, ORC, Parquet, Phoenix, Quarks (Incubating), REEF, Samza, Spark,

Sqoop, Storm, Tajo, Tez, VXQuery, Zeppelin

Database (25):

– Accumulo; Cassandra, Cayenne, Cocoon, CouchDB, Curator, Derby,

Empire-db, Forrest, Gora, Hadoop, Hbase, Hive, Jackrabbit, Lucene

Core, Lucene.Net, Lucy, MetaModel, OFBiz, OpenJPA, ORC, Phoenix,

Pig, Torque, ZooKeeper

Many more: https://projects.apache.org/projects.html?category

Page 34: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

35

Structure

Introduction/Motivation

Big Data Processing

– Classical MapReduce approach

– In-Memory processing

– Stream processing

Big Data Analysis

– Machine Learning Algorithms

– Deep Learning

– Graph Analysis

– Data Analytics Frameworks

Big Data and HPC

Page 35: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

36

Big Data Processing – Data Analysis

Data

Collection

Integration/

Aggregation

Analysis/

Modelling Interpretation

Extraction/

Cleaning/

Annotation

Volume

Veracity

Velocity

Variety

Privacy

Hu

man

In

tera

ctio

n

Value

Page 36: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

37

Structure

Introduction/Motivation

Big Data Processing

– Classical MapReduce approach

– In-Memory processing

– Stream processing

Big Data Analysis

– Machine Learning Algorithms

– Deep Learning

– Graph Analysis

– Data Analytics Frameworks

Big Data and HPC

Page 37: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

38

Machine Learning Algorithms

Text

Bullet 1

Bullet 1

– Bullet 2

– Bullet 2

• Bullet 3

• Bullet 3

- Bullet 4

- Bullet 4

• Bullet 3

SVM: Support Vector Machine

HMM: Hidden Markov Models

LDA: Linear Discriminant Analysis

LSA: Latent Semantic Analysis

Page 38: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

39

Machine Learning Algorithms

Unsupervised Learning Supervised Learning

unlabeled data labeled data

Goal: Discover classes within data Classify unknown data

Iris flower data set, Ronald Fisher, The use of multiple measurements in taxonomic problems, 1936

5.0,3.5,1.6,0.6,Iris-setosa 5.1,3.8,1.9,0.4,Iris-setosa 4.8,3.0,1.4,0.3,Iris-setosa 5.1,3.8,1.6,0.2,Iris-setosa 4.6,3.2,1.4,0.2,Iris-setosa 5.3,3.7,1.5,0.2,Iris-setosa 5.0,3.3,1.4,0.2,Iris-setosa 7.0,3.2,4.7,1.4,Iris-versicolor 6.4,3.2,4.5,1.5,Iris-versicolor 6.9,3.1,4.9,1.5,Iris-versicolor …

5.0,3.5,1.6,0.6 5.1,3.8,1.9,0.4 4.8,3.0,1.4,0.3 5.1,3.8,1.6,0.2 4.6,3.2,1.4,0.2 5.3,3.7,1.5,0.2 5.0,3.3,1.4,0.2 7.0,3.2,4.7,1.4 6.4,3.2,4.5,1.5 6.9,3.1,4.9,1.5 …

Page 39: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

40

Machine Learning Algorithms

Unsupervised Learning

(dealing with unlabeled data)

Frequency

counts and

clustering of

words from a

text data set.

Identifying patches, which are similar with

regard to six edaphic and physiographic

variables (e.g. 50-year mean monthly

temperature) with spatial clustering

technique in the US.

Clustering the Iris data set.

Page 40: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

41

Machine Learning Algorithms

Supervised Learning

(dealing with labeled data)

Classify handwritten

digits and images

with Deep Learning

methods.

Classify text documents

with Naïve Bayes.

Page 41: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

42

Machine Learning Algorithms

The most popular machine learning algorithms are …

Unsupervised methods/Clustering

– K-Means

– Hierarchical clustering

– Gaussian mixture model

– …

Supervised methods

– Logistic regression

– Decision tree classifier

– Random forest classifier

– Naïve Bayes

– Multilayer perceptron classifier

– Support Vector Machine

– HMM: Hidden Markov Models

– LDA: Linear Discriminant Analysis

– LSA: Latent Semantic Analysis

– …

Page 42: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

43

Structure

Introduction/Motivation

Big Data Processing

– Classical MapReduce approach

– In-Memory processing

– Stream processing

Big Data Analysis

– Machine Learning Algorithms

– Deep Learning

– Graph Analysis

– Data Analytics Frameworks

Big Data and HPC

Page 43: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

44

Deep Learning

Deep Learning methods are already applied in many different domains.

Need for user-friendly frameworks enabling development of

application dependent deep neural networks.

Page 44: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

45

Deep Learning

Deep neural networks have many adjustable parameters:

# Input/output Nodes

# Hidden layer

Activation function

# Connections

Page 45: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

46

Application - Intraoperative Thermal Imaging

N. Hoffmann et al. Learning Thermal Process Representations for Intraoperative Analysis of Cortical Perfusion During Ischemic

Strokes. Springer International Publishing, 2016.

Imaging neural activity by measuring small

temperature variation

Data from University Hospital Dresden

Long-term intraoperative measurements

(~10 minutes) and fast preprocessing

Deep learning framework (Keras) to approximate

parameters

3000 frames (5.4 GB) every minute

(50 Hz sampling rate), ~7000s on single node

Parallel implementation of real-time data

Processing pipeline with Apache Spark

8 nodes on HPC cluster, ~30s, 200x faster

(a) Infarct demarcation in postoperative computed

tomography (CT)

(b) Infarct demarcation in postoperative CT scans and

mapped thermal image sequences onto the cortex

(c) Approximated heating parameters by Deep

Learning approach being applied to heating

sequence caused by the cortical irrigation

Page 46: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

47

Structure

Introduction/Motivation

Big Data Processing

– Classical MapReduce approach

– In-Memory processing

– Stream processing

Big Data Analysis

– Machine Learning Algorithms

– Deep Learning

– Graph Analysis

– Data Analytics Frameworks

Big Data and HPC

Page 47: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

48

Graph Analytics

“Graphs are everywhere”

Graph = Edges + Vertices

Potentially very large

– Facebook:

• ca. 1.49 billion active users

• ca. 340 friends per user

– WWW

• ca. 1 Billion sites

Martin Junghans: ”Scalable Graph

Analytics with Gradoop“

Page 48: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

49

Graph Analytics – small excurse

Can be used as add-ons to Analytics Frameworks

Some systems: Gradoop, GraphX, GraphLab, Giraph,

Pregel, …

Lecture “Gradoop”, Prof. E. Rahm

http://ampcamp.berkeley.edu/big-data-mini-course/graph-

analytics-with-graphx.html

Page 49: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

50

Structure

Introduction/Motivation

Big Data Processing

– Classical MapReduce approach

– In-Memory processing

– Stream processing

Big Data Analysis

– Machine Learning Algorithms

– Deep Learning

– Graph Analysis

– Data Analytics Frameworks

Big Data and HPC

Page 50: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

51

Data Analytics Frameworks

There are already several sophisticated frameworks supporting Machine

Learning with Big Data like

Spark ML

Flink ML

Mahout

H2O

Page 51: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

52

Data Analytics Frameworks

There are already several sophisticated frameworks supporting Deep Learning

applications like

Keras

Caffee

CuDNN

DeepLearning4j

Tensorflow

Page 52: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

53

The Framework Keras

Deep Learning Framework

Written in Python

Based on Theano/Tensorflow

Convolutional/Recurrent Networks

Parallelization possible

GPU usage possible

Page 53: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

54

Data Analytics Frameworks

Established frameworks for statistical analysis are e.g.

R (script language with statistics community projects)

MatLab, GNU Octave, Scilab, …

Python packages (numpy, pandas, scipy…)

ROOT (High Energy Physics package)

SOFA statistics

Torch

Weka

….

Page 54: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

55

Structure

Introduction/Motivation

Big Data Processing

– Classical MapReduce approach

– In-Memory processing

– Stream processing

Big Data Analysis

– Machine Learning Algorithms

– Deep Learning

– Graph Analysis

– Data Analytics Frameworks

Big Data and HPC

Page 55: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

56

Big Data and HPC – convergence patterns

There is no unique fingerprint for “big data applications”

Big V’s of big data (volume, velocity, …) imply very different usage scenarios

Also: no unique architecture to serve all facets

Why converging two worlds?

Algorithms rise: highly iterative learning models become interesting for many

application areas (deep learning)

Many initiatives to overcome initial Map/Reduce approach

– Big Data meets Infiniband: 56Gbps FDR Infiniband >100 x faster than

even 10Gbps Ethernet

– Big Data meets Accelerators: switch between CPU/GPU usage on

framework level (e.g. using Apache Spark on GPUs, TensorFlow on CPU

/ GPU)

– Using virtualization / cloud paradigms to shape environments over bare

metal

Page 56: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

57

Big Data and HPC – convergence patterns

Requirements to support Big Data workloads on HPC

Support frameworks: more versatile software stacks

Fast access to data: not just self-production of

data (simulation), but also use 3rd-party data

(open data, domain repositories)

Support different data processing

paradigms

– Streaming

– Batch

– Iteration

on very same system

Better support of evaluation of (temporary)

results, e.g. visualization frontends

Service orientation (working environments)

Page 57: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

58

Big Data and HPC – convergence patterns

High Performance Data Analytics

Page 58: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

59

Big Data and HPC – convergence patterns

High Performance Data Analytics

Classicäl HPC

Page 59: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

60

Big Data and HPC – convergence patterns

High Performance Data Analytics

Add system

features via

virtualization

layer

Page 60: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

61

Big Data and HPC – convergence patterns

High Performance Data Analytics

Enhance

software stack

up to complete

unique software

settings

Page 61: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

62

Big Data Analytics and HPC

HPC offers a lot of support for Big Data Analytics

Software Frameworks allow users to use HPC support concerning

– Parallelization

– CPU and GPU usage

– Strong computational power per node

– Compute power for visualization

– Streaming

– Storage

– Shared memory

– Usage of specifically adapted hardware

Page 62: Big Data and HPC - ScaDS · 15 Processing Challenges Not „just“ Scalability t further challenges Data Acquisition: “capturing” data, streaming or persistent storage of data

Zellescher Weg 12

Willers-Bau A207

Tel. +49 351 - 463 - 35450

Wolfgang E. Nagel ([email protected])

Center for Information Services and High Performance Computing (ZIH)

Thank You!

Sunna Torge

Rene Jäkel

Ralph Müller-Pfefferkorn