big data and hpc - scads · 15 processing challenges not „just“ scalability t further...

Zellescher Weg

Willers-Bau A207

+49 351 - 463 - 35450

Wolfgang E. Nagel ([email protected])

Big Data and HPC: Two Worlds Apart or Synergetic Future?

Center for Information Services and High Performance Computing (ZIH)

2

Structure

Introduction/Motivation

Big Data Processing

– Classical MapReduce approach

– In-Memory processing

– Stream processing

Big Data Analysis

– Machine Learning Algorithms

– Deep Learning

– Graph Analysis

– Data Analytics Frameworks

Big Data and HPC

3

Meaning of Big Data

Big Data slogans and classification

“Big Data: The next frontier for innovation, competition, and productivity”

(McKinsey Global Institute)

“Data is the new gold” (Open Data Initiative, European Commission aim at

opening up Public Sector Information).

– The term ”Big Data" refers to large amounts of different types of data

produced with high velocity from a high number of various types of

sources. Handling today's highly variable and real-time datasets requires

new tools and methods, such as powerful processors, software and

algorithms.

– “The term ”Open Data" refers to a subset of data, namely to data made

freely available for re-use to everyone for both commercial and non-

commercial purposes”.

– “Linked Data” is about using the Web to connect related data that

wasn't previously linked, or using the Web to lower the barriers to linking

data currently linked using other methods.

Lecture: Big Data: A data-driven society ? Roberto V. Zicari, Goethe University Frankfurt

4

Motivation: How large is Big Data? Mostly

unstructured

data!

Source: IDC’s Digital Universe study, sponsored by EMC, 2014

Big means not a fixed scale!!!!

5

Sensor data – production (industry) / tracking (logistics)

Medical data (MRT, patient data, genome) –

“personalized medizin”

Financial data

Mobile Networks – sensor arrays and meta information

(tracking, customer behavior)

Autonomous systems (e.g. Floating Car Data - FCD)

Social Media / Apps / Platforms (e.g youtube)

Weather / Climate / Environmental Data

…

www.scads.de 5

Where is data coming from?

6

more important:

extract new

content from

database

6

Characteristics of Big Data

7

Structure


Big Data Processing




Big Data Analysis


– Deep Learning

– Graph Analysis


Big Data and HPC

8

Big Data Processing

Data

Collection

Integration/

Aggregation

Analysis/

Modelling Interpretation

Extraction/

Cleaning/

Annotation

Volume

Veracity

Velocity

Variety

…

Privacy

Hu

man

In

tera

ctio

n

Value

10

Scalability

Horizontal vs. vertical scalability

Horizontal scaling

– involves distributing (partitioning) the workload across many machines

(usually commodity hardware)

– Also referred to as „scale-out“

Vertical Scaling:

– Use more processors and attached memory in single application

– Also “distributed” over multiple nodes but “shared”

(shared-memory systems)

– Also referred to as „scale-up“

11

Horizontal Scaling

Scale-out

Opposite to scale-up

Add further instances of the same kind

No special assumptions

Ideal case: distribute computing to data, not vice versa,

“separate” partitioned workloads

12

Vertical Scaling

Scale-up

Usually within a system (e.g. scale parallel application)

Use more computing elements in single

application

Often communication intensive (iterative approaches)

Ideal case: highly parallel simulations

13

Scale-out vs. Scale-up

Advantages

Horizontal scaling

– Data parallel (trivial); simply add

instances; easy handling

– Multiple copies of data;

fault-tolerant mechanisms

– Relatively cheap components;

cheap network

Vertical Scaling:

– Parallel programming techniques

(scale to 105..106 cores)

– Use hybrid systems

(cpu + accelerators)

– Fast message passing

14

Scale-out vs. Scale-up

Advantages

Horizontal scaling

– Data parallel (trivial); simply add

instances; easy handling

– Multiple copies of data;

fault-tolerant mechanisms

– Relatively cheap components;

cheap network

Vertical Scaling:

– Parallel programming techniques

(scale to 105..106 cores)

– Use hybrid systems

(cpu + accelerators)

– Fast message passing

Disadvantages

– Middleware has to manage

complexities (data copies;

distributed tasks)

– Iterative techniques hard to

apply (performance loss)

– Higher investments

– Limited frameworks;

applications centric approach

– Hard to adopt specific solution

to other use cases

15

Processing Challenges

Not „just“ Scalability – further challenges

Data Acquisition: “capturing” data, streaming or persistent storage of data

Aligning / Integration of data from different sources (e.g., resolving duplicates)

Transformation: bring data into a form of suitable format for further processing

or analysis

Modelling: hypothesis testing or prediction, usually mathematically or statistical

Understanding the output — visualizing and sharing results

Techniques:

– Massive Parallelism

– Huge Data Volumes Storage

– Data Distribution

– High-Speed Networks

– High-Performance Computing

– Task and Thread Management

– Data Mining and Analytics

– Data Retrieval

– Machine Learning

– Data Visualization

16

Structure


Big Data Processing




Big Data Analysis


– Deep Learning

– Graph Analysis


Big Data and HPC

17

Hadoop 1.0 Ecosystem

“First” software stack based on shared nothing architectures

Horizontal scaling paradigm, framework support for workload distribution and

fault-tolerance

18

MapReduce Example

19

MapReduce criticism (from the DBMS perspective - 2008)

https://homes.cs.washington.edu/~billhowe/mapreduce_a_major

_step_backwards.html

A giant step backward in the programming paradigm for large-scale data

intensive applications

Not novel at all -- it represents a specific implementation of well known

techniques developed nearly 25 years ago

Missing most of the features that are routinely included in current DBMS

Incompatible with all of the tools DBMS users have come to depend on

Missing features and incompatible with DBMS

– Bulk loader -- to transform input data in files into a desired format and load it into a

DBMS

– Indexing – MapReduce simple has no index, only “brute force” approach

– Updates -- to change the data in the data base

– Transactions -- to support parallel update and recovery from failures during update

– Integrity constraints and references – helps to keep garbage out of the data base

– Views -- so the schema can change without having to rewrite the application

program

20

Limitations and Extensions to MapReduce

Geoffrey Fox:- Lecture: Building a Library at the Nexus of High

Performance Computing and Big Data, Indiana University

21




Map-Reduce/

Classical Hadoop

22




In-Memory

Processing

23

Structure


Big Data Processing




Big Data Analysis


– Deep Learning

– Graph Analysis


Big Data and HPC

24

In-Memory Processing

Processing of data stored in an in-memory (database)

Historically triggered by business intelligence (BI) needs

Previously: processing based on disk storage and relational databases using

SQL query language, but became to slow with increasing data

Towards data analysis in real time; lifts limitations of IO-waiting times

(in IO-intense applications usually show stopper)

RAM has come down in cost, in-memory computing becomes available in

larger scale (BI: analytic reports would be started and results would come

later (i.e. "coffee break analytics")) https://www-01.ibm.com/software/data/what-is-in-memory-computing.html

Provides basis for other algorithms pipelines, not just separating compute

tasks

– Iterations

– Joins

25

In-Memory Processing

Caching

– E.g. Processor caches, file cache, disk cache,

permission cache

Replication

– E.g. RAID, Content Distribution Networks

(CDN), Web/Server Caches

Prediction – which data will be needed and pre-

fetched

– Tradeoff bandwidth

– E.g. disk caches, Google Earth

Limitations:

– Caching works only if working set is small

– Prefetching only works when access

patterns are predictable

– Replication is expensive and limited by

receiving side machines

Srinath Perera: In-Memory Computing, WSO2 Inc.

26

In-Memory: Applicability for Big Data

Srinath Perera: In-Memory Computing, WSO2 Inc.

27

The Framework Apache Spark

Apache Spark offers Directed Acyclic

Graph (DAG) execution engine that

supports cyclic data flow and in-memory

computing.

Offers over 80 high-level operators to build

parallel applications; use it interactively

from the Scala, Python and R shells

Offers stack of libraries for analytics

including SQL and DataFrames, MLlib for

machine learning, GraphX, and Spark

Streaming; combine libraries seamlessly

in the same application

Run Spark standalone or in cluster mode,

on EC2, on Hadoop YARN, or on Apache

Mesos; access data in HDFS, Cassandra,

HBase, Hive, Tachyon, and any Hadoop

data source

http://spark.apache.org/

http://spark-project.org

http://spark-project.org/



28

Structure


Big Data Processing




Big Data Analysis


– Deep Learning

– Graph Analysis


Big Data and HPC

29




Stream Processing:

Essential for

Autonomous driving

Sensor data analysis

…

30

Stream Processing Alternatives

Text

Bullet 1

Bullet 1

– Bullet 2

– Bullet 2

• Bullet 3

• Bullet 3

- Bullet 4

- Bullet 4

• Bullet 3

Bullet 1

– Bullet 2

Text

31

Stream Processing Alternatives

Text

Bullet 1

Bullet 1

– Bullet 2

– Bullet 2

• Bullet 3

• Bullet 3

- Bullet 4

- Bullet 4

• Bullet 3

Bullet 1

– Bullet 2

Text

32

Stream Processing: Apache Flink

Data stream outputs

http://flink.apache.org/docs/latest/streaming_guide.html

Marton Balassi – data Artisans: «Real-time Stream Processing with

Apache Flink“

True streaming with adjustable latency

and throughput

Rich functional API exploiting streaming

runtime

Flexible windowing semantics

Exactly-once processing guarantees

with (small) state

33

Stream Processing: Apache Flink - examples

Scala example: Reading from data sources

Marton Balassi – data Artisans: «Real-time Stream Processing with

Apache Flink“

34

Apache Projects (relevant for Big Data)

Big data (36):

– Airavata, Ambari, Apex, Avro, Bigtop, BookKeeper, Calcite, CouchDB,

Crunch, DataFu (Incubating), DirectMemory (in the Attic), Drill, Falcon,

Flink, Flume, Giraph, Hama, Helix, Ignite, Kafka, Knox, MetaModel,

Oozie, ORC, Parquet, Phoenix, Quarks (Incubating), REEF, Samza, Spark,

Sqoop, Storm, Tajo, Tez, VXQuery, Zeppelin

Database (25):

– Accumulo; Cassandra, Cayenne, Cocoon, CouchDB, Curator, Derby,

Empire-db, Forrest, Gora, Hadoop, Hbase, Hive, Jackrabbit, Lucene

Core, Lucene.Net, Lucy, MetaModel, OFBiz, OpenJPA, ORC, Phoenix,

Pig, Torque, ZooKeeper

Many more: https://projects.apache.org/projects.html?category

35

Structure


Big Data Processing




Big Data Analysis


– Deep Learning

– Graph Analysis


Big Data and HPC

36

Big Data Processing – Data Analysis

Data

Collection

Integration/

Aggregation

Analysis/

Modelling Interpretation

Extraction/

Cleaning/

Annotation

Volume

Veracity

Velocity

Variety

…

Privacy

Hu

man

In

tera

ctio

n

Value

37

Structure


Big Data Processing




Big Data Analysis


– Deep Learning

– Graph Analysis


Big Data and HPC

38

Machine Learning Algorithms

Text

Bullet 1

Bullet 1

– Bullet 2

– Bullet 2

• Bullet 3

• Bullet 3

- Bullet 4

- Bullet 4

• Bullet 3

SVM: Support Vector Machine

HMM: Hidden Markov Models

LDA: Linear Discriminant Analysis

LSA: Latent Semantic Analysis

39


Unsupervised Learning Supervised Learning

unlabeled data labeled data

Goal: Discover classes within data Classify unknown data

Iris flower data set, Ronald Fisher, The use of multiple measurements in taxonomic problems, 1936

5.0,3.5,1.6,0.6,Iris-setosa 5.1,3.8,1.9,0.4,Iris-setosa 4.8,3.0,1.4,0.3,Iris-setosa 5.1,3.8,1.6,0.2,Iris-setosa 4.6,3.2,1.4,0.2,Iris-setosa 5.3,3.7,1.5,0.2,Iris-setosa 5.0,3.3,1.4,0.2,Iris-setosa 7.0,3.2,4.7,1.4,Iris-versicolor 6.4,3.2,4.5,1.5,Iris-versicolor 6.9,3.1,4.9,1.5,Iris-versicolor …

5.0,3.5,1.6,0.6 5.1,3.8,1.9,0.4 4.8,3.0,1.4,0.3 5.1,3.8,1.6,0.2 4.6,3.2,1.4,0.2 5.3,3.7,1.5,0.2 5.0,3.3,1.4,0.2 7.0,3.2,4.7,1.4 6.4,3.2,4.5,1.5 6.9,3.1,4.9,1.5 …

40


Unsupervised Learning

(dealing with unlabeled data)

Frequency

counts and

clustering of

words from a

text data set.

Identifying patches, which are similar with

regard to six edaphic and physiographic

variables (e.g. 50-year mean monthly

temperature) with spatial clustering

technique in the US.

Clustering the Iris data set.

41


Supervised Learning

(dealing with labeled data)

Classify handwritten

digits and images

with Deep Learning

methods.

Classify text documents

with Naïve Bayes.

42


The most popular machine learning algorithms are …

Unsupervised methods/Clustering

– K-Means

– Hierarchical clustering

– Gaussian mixture model

– …

Supervised methods

– Logistic regression

– Decision tree classifier

– Random forest classifier

– Naïve Bayes

– Multilayer perceptron classifier

– Support Vector Machine

– HMM: Hidden Markov Models

– LDA: Linear Discriminant Analysis

– LSA: Latent Semantic Analysis

– …

43

Structure


Big Data Processing




Big Data Analysis


– Deep Learning

– Graph Analysis


Big Data and HPC

44

Deep Learning

Deep Learning methods are already applied in many different domains.

Need for user-friendly frameworks enabling development of

application dependent deep neural networks.

45

Deep Learning

Deep neural networks have many adjustable parameters:

# Input/output Nodes

# Hidden layer

Activation function

# Connections

…

46

Application - Intraoperative Thermal Imaging

N. Hoffmann et al. Learning Thermal Process Representations for Intraoperative Analysis of Cortical Perfusion During Ischemic

Strokes. Springer International Publishing, 2016.

Imaging neural activity by measuring small

temperature variation

Data from University Hospital Dresden

Long-term intraoperative measurements

(~10 minutes) and fast preprocessing

Deep learning framework (Keras) to approximate

parameters

3000 frames (5.4 GB) every minute

(50 Hz sampling rate), ~7000s on single node

Parallel implementation of real-time data

Processing pipeline with Apache Spark

8 nodes on HPC cluster, ~30s, 200x faster

(a) Infarct demarcation in postoperative computed

tomography (CT)

(b) Infarct demarcation in postoperative CT scans and

mapped thermal image sequences onto the cortex

(c) Approximated heating parameters by Deep

Learning approach being applied to heating

sequence caused by the cortical irrigation

47

Structure


Big Data Processing




Big Data Analysis


– Deep Learning

– Graph Analysis


Big Data and HPC

48

Graph Analytics

“Graphs are everywhere”

Graph = Edges + Vertices

Potentially very large

– Facebook:

• ca. 1.49 billion active users

• ca. 340 friends per user

– WWW

• ca. 1 Billion sites

Martin Junghans: ”Scalable Graph

Analytics with Gradoop“

49

Graph Analytics – small excurse

Can be used as add-ons to Analytics Frameworks

Some systems: Gradoop, GraphX, GraphLab, Giraph,

Pregel, …

Lecture “Gradoop”, Prof. E. Rahm

http://ampcamp.berkeley.edu/big-data-mini-course/graph-

analytics-with-graphx.html

50

Structure


Big Data Processing




Big Data Analysis


– Deep Learning

– Graph Analysis


Big Data and HPC

51

Data Analytics Frameworks

There are already several sophisticated frameworks supporting Machine

Learning with Big Data like

Spark ML

Flink ML

Mahout

H2O

…

52


There are already several sophisticated frameworks supporting Deep Learning

applications like

Keras

Caffee

CuDNN

DeepLearning4j

Tensorflow

53

The Framework Keras

Deep Learning Framework

Written in Python

Based on Theano/Tensorflow

Convolutional/Recurrent Networks

Parallelization possible

GPU usage possible

54


Established frameworks for statistical analysis are e.g.

R (script language with statistics community projects)

MatLab, GNU Octave, Scilab, …

Python packages (numpy, pandas, scipy…)

ROOT (High Energy Physics package)

SOFA statistics

Torch

Weka

….

55

Structure


Big Data Processing




Big Data Analysis


– Deep Learning

– Graph Analysis


Big Data and HPC

56

Big Data and HPC – convergence patterns

There is no unique fingerprint for “big data applications”

Big V’s of big data (volume, velocity, …) imply very different usage scenarios

Also: no unique architecture to serve all facets

Why converging two worlds?

Algorithms rise: highly iterative learning models become interesting for many

application areas (deep learning)

Many initiatives to overcome initial Map/Reduce approach

– Big Data meets Infiniband: 56Gbps FDR Infiniband >100 x faster than

even 10Gbps Ethernet

– Big Data meets Accelerators: switch between CPU/GPU usage on

framework level (e.g. using Apache Spark on GPUs, TensorFlow on CPU

/ GPU)

– Using virtualization / cloud paradigms to shape environments over bare

metal

57


Requirements to support Big Data workloads on HPC

Support frameworks: more versatile software stacks

Fast access to data: not just self-production of

data (simulation), but also use 3rd-party data

(open data, domain repositories)

Support different data processing

paradigms

– Streaming

– Batch

– Iteration

on very same system

Better support of evaluation of (temporary)

results, e.g. visualization frontends

Service orientation (working environments)

58


High Performance Data Analytics

59



Classicäl HPC

60



Add system

features via

virtualization

layer

61



Enhance

software stack

up to complete

unique software

settings

62

Big Data Analytics and HPC

HPC offers a lot of support for Big Data Analytics

Software Frameworks allow users to use HPC support concerning

– Parallelization

– CPU and GPU usage

– Strong computational power per node

– Compute power for visualization

– Streaming

– Storage

– Shared memory

– Usage of specifically adapted hardware

Zellescher Weg 12

Willers-Bau A207

Tel. +49 351 - 463 - 35450

Wolfgang E. Nagel ([email protected])

Center for Information Services and High Performance Computing (ZIH)

Thank You!

Sunna Torge

Rene Jäkel

Ralph Müller-Pfefferkorn

big data and hpc - scads · 15 processing challenges not „just“ scalability t further...

Documents