big data and hpc - scads · 15 processing challenges not „just“ scalability t further...
TRANSCRIPT
Zellescher Weg
Willers-Bau A207
+49 351 - 463 - 35450
Wolfgang E. Nagel ([email protected])
Big Data and HPC: Two Worlds Apart or Synergetic Future?
Center for Information Services and High Performance Computing (ZIH)
2
Structure
Introduction/Motivation
Big Data Processing
– Classical MapReduce approach
– In-Memory processing
– Stream processing
Big Data Analysis
– Machine Learning Algorithms
– Deep Learning
– Graph Analysis
– Data Analytics Frameworks
Big Data and HPC
3
Meaning of Big Data
Big Data slogans and classification
“Big Data: The next frontier for innovation, competition, and productivity”
(McKinsey Global Institute)
“Data is the new gold” (Open Data Initiative, European Commission aim at
opening up Public Sector Information).
– The term ”Big Data" refers to large amounts of different types of data
produced with high velocity from a high number of various types of
sources. Handling today's highly variable and real-time datasets requires
new tools and methods, such as powerful processors, software and
algorithms.
– “The term ”Open Data" refers to a subset of data, namely to data made
freely available for re-use to everyone for both commercial and non-
commercial purposes”.
– “Linked Data” is about using the Web to connect related data that
wasn't previously linked, or using the Web to lower the barriers to linking
data currently linked using other methods.
Lecture: Big Data: A data-driven society ? Roberto V. Zicari, Goethe University Frankfurt
4
Motivation: How large is Big Data? Mostly
unstructured
data!
Source: IDC’s Digital Universe study, sponsored by EMC, 2014
Big means not a fixed scale!!!!
5
Sensor data – production (industry) / tracking (logistics)
Medical data (MRT, patient data, genome) –
“personalized medizin”
Financial data
Mobile Networks – sensor arrays and meta information
(tracking, customer behavior)
Autonomous systems (e.g. Floating Car Data - FCD)
Social Media / Apps / Platforms (e.g youtube)
Weather / Climate / Environmental Data
…
www.scads.de 5
Where is data coming from?
6
more important:
extract new
content from
database
6
Characteristics of Big Data
7
Structure
Introduction/Motivation
Big Data Processing
– Classical MapReduce approach
– In-Memory processing
– Stream processing
Big Data Analysis
– Machine Learning Algorithms
– Deep Learning
– Graph Analysis
– Data Analytics Frameworks
Big Data and HPC
8
Big Data Processing
Data
Collection
Integration/
Aggregation
Analysis/
Modelling Interpretation
Extraction/
Cleaning/
Annotation
Volume
Veracity
Velocity
Variety
…
Privacy
Hu
man
In
tera
ctio
n
Value
10
Scalability
Horizontal vs. vertical scalability
Horizontal scaling
– involves distributing (partitioning) the workload across many machines
(usually commodity hardware)
– Also referred to as „scale-out“
Vertical Scaling:
– Use more processors and attached memory in single application
– Also “distributed” over multiple nodes but “shared”
(shared-memory systems)
– Also referred to as „scale-up“
11
Horizontal Scaling
Scale-out
Opposite to scale-up
Add further instances of the same kind
No special assumptions
Ideal case: distribute computing to data, not vice versa,
“separate” partitioned workloads
12
Vertical Scaling
Scale-up
Usually within a system (e.g. scale parallel application)
Use more computing elements in single
application
Often communication intensive (iterative approaches)
Ideal case: highly parallel simulations
13
Scale-out vs. Scale-up
Advantages
Horizontal scaling
– Data parallel (trivial); simply add
instances; easy handling
– Multiple copies of data;
fault-tolerant mechanisms
– Relatively cheap components;
cheap network
Vertical Scaling:
– Parallel programming techniques
(scale to 105..106 cores)
– Use hybrid systems
(cpu + accelerators)
– Fast message passing
14
Scale-out vs. Scale-up
Advantages
Horizontal scaling
– Data parallel (trivial); simply add
instances; easy handling
– Multiple copies of data;
fault-tolerant mechanisms
– Relatively cheap components;
cheap network
Vertical Scaling:
– Parallel programming techniques
(scale to 105..106 cores)
– Use hybrid systems
(cpu + accelerators)
– Fast message passing
Disadvantages
– Middleware has to manage
complexities (data copies;
distributed tasks)
– Iterative techniques hard to
apply (performance loss)
– Higher investments
– Limited frameworks;
applications centric approach
– Hard to adopt specific solution
to other use cases
15
Processing Challenges
Not „just“ Scalability – further challenges
Data Acquisition: “capturing” data, streaming or persistent storage of data
Aligning / Integration of data from different sources (e.g., resolving duplicates)
Transformation: bring data into a form of suitable format for further processing
or analysis
Modelling: hypothesis testing or prediction, usually mathematically or statistical
Understanding the output — visualizing and sharing results
Techniques:
– Massive Parallelism
– Huge Data Volumes Storage
– Data Distribution
– High-Speed Networks
– High-Performance Computing
– Task and Thread Management
– Data Mining and Analytics
– Data Retrieval
– Machine Learning
– Data Visualization
16
Structure
Introduction/Motivation
Big Data Processing
– Classical MapReduce approach
– In-Memory processing
– Stream processing
Big Data Analysis
– Machine Learning Algorithms
– Deep Learning
– Graph Analysis
– Data Analytics Frameworks
Big Data and HPC
17
Hadoop 1.0 Ecosystem
“First” software stack based on shared nothing architectures
Horizontal scaling paradigm, framework support for workload distribution and
fault-tolerance
18
MapReduce Example
19
MapReduce criticism (from the DBMS perspective - 2008)
https://homes.cs.washington.edu/~billhowe/mapreduce_a_major
_step_backwards.html
A giant step backward in the programming paradigm for large-scale data
intensive applications
Not novel at all -- it represents a specific implementation of well known
techniques developed nearly 25 years ago
Missing most of the features that are routinely included in current DBMS
Incompatible with all of the tools DBMS users have come to depend on
Missing features and incompatible with DBMS
– Bulk loader -- to transform input data in files into a desired format and load it into a
DBMS
– Indexing – MapReduce simple has no index, only “brute force” approach
– Updates -- to change the data in the data base
– Transactions -- to support parallel update and recovery from failures during update
– Integrity constraints and references – helps to keep garbage out of the data base
– Views -- so the schema can change without having to rewrite the application
program
20
Limitations and Extensions to MapReduce
Geoffrey Fox:- Lecture: Building a Library at the Nexus of High
Performance Computing and Big Data, Indiana University
21
Limitations and Extensions to MapReduce
Geoffrey Fox:- Lecture: Building a Library at the Nexus of High
Performance Computing and Big Data, Indiana University
Map-Reduce/
Classical Hadoop
22
Limitations and Extensions to MapReduce
Geoffrey Fox:- Lecture: Building a Library at the Nexus of High
Performance Computing and Big Data, Indiana University
In-Memory
Processing
23
Structure
Introduction/Motivation
Big Data Processing
– Classical MapReduce approach
– In-Memory processing
– Stream processing
Big Data Analysis
– Machine Learning Algorithms
– Deep Learning
– Graph Analysis
– Data Analytics Frameworks
Big Data and HPC
24
In-Memory Processing
Processing of data stored in an in-memory (database)
Historically triggered by business intelligence (BI) needs
Previously: processing based on disk storage and relational databases using
SQL query language, but became to slow with increasing data
Towards data analysis in real time; lifts limitations of IO-waiting times
(in IO-intense applications usually show stopper)
RAM has come down in cost, in-memory computing becomes available in
larger scale (BI: analytic reports would be started and results would come
later (i.e. "coffee break analytics")) https://www-01.ibm.com/software/data/what-is-in-memory-computing.html
Provides basis for other algorithms pipelines, not just separating compute
tasks
– Iterations
– Joins
25
In-Memory Processing
Caching
– E.g. Processor caches, file cache, disk cache,
permission cache
Replication
– E.g. RAID, Content Distribution Networks
(CDN), Web/Server Caches
Prediction – which data will be needed and pre-
fetched
– Tradeoff bandwidth
– E.g. disk caches, Google Earth
Limitations:
– Caching works only if working set is small
– Prefetching only works when access
patterns are predictable
– Replication is expensive and limited by
receiving side machines
Srinath Perera: In-Memory Computing, WSO2 Inc.
26
In-Memory: Applicability for Big Data
Srinath Perera: In-Memory Computing, WSO2 Inc.
27
The Framework Apache Spark
Apache Spark offers Directed Acyclic
Graph (DAG) execution engine that
supports cyclic data flow and in-memory
computing.
Offers over 80 high-level operators to build
parallel applications; use it interactively
from the Scala, Python and R shells
Offers stack of libraries for analytics
including SQL and DataFrames, MLlib for
machine learning, GraphX, and Spark
Streaming; combine libraries seamlessly
in the same application
Run Spark standalone or in cluster mode,
on EC2, on Hadoop YARN, or on Apache
Mesos; access data in HDFS, Cassandra,
HBase, Hive, Tachyon, and any Hadoop
data source
http://spark.apache.org/
http://spark-project.org
28
Structure
Introduction/Motivation
Big Data Processing
– Classical MapReduce approach
– In-Memory processing
– Stream processing
Big Data Analysis
– Machine Learning Algorithms
– Deep Learning
– Graph Analysis
– Data Analytics Frameworks
Big Data and HPC
29
Limitations and Extensions to MapReduce
Geoffrey Fox:- Lecture: Building a Library at the Nexus of High
Performance Computing and Big Data, Indiana University
Stream Processing:
Essential for
Autonomous driving
Sensor data analysis
…
30
Stream Processing Alternatives
Text
Bullet 1
Bullet 1
– Bullet 2
– Bullet 2
• Bullet 3
• Bullet 3
- Bullet 4
- Bullet 4
• Bullet 3
Bullet 1
– Bullet 2
Text
31
Stream Processing Alternatives
Text
Bullet 1
Bullet 1
– Bullet 2
– Bullet 2
• Bullet 3
• Bullet 3
- Bullet 4
- Bullet 4
• Bullet 3
Bullet 1
– Bullet 2
Text
32
Stream Processing: Apache Flink
Data stream outputs
http://flink.apache.org/docs/latest/streaming_guide.html
Marton Balassi – data Artisans: «Real-time Stream Processing with
Apache Flink“
True streaming with adjustable latency
and throughput
Rich functional API exploiting streaming
runtime
Flexible windowing semantics
Exactly-once processing guarantees
with (small) state
33
Stream Processing: Apache Flink - examples
Scala example: Reading from data sources
Marton Balassi – data Artisans: «Real-time Stream Processing with
Apache Flink“
34
Apache Projects (relevant for Big Data)
Big data (36):
– Airavata, Ambari, Apex, Avro, Bigtop, BookKeeper, Calcite, CouchDB,
Crunch, DataFu (Incubating), DirectMemory (in the Attic), Drill, Falcon,
Flink, Flume, Giraph, Hama, Helix, Ignite, Kafka, Knox, MetaModel,
Oozie, ORC, Parquet, Phoenix, Quarks (Incubating), REEF, Samza, Spark,
Sqoop, Storm, Tajo, Tez, VXQuery, Zeppelin
Database (25):
– Accumulo; Cassandra, Cayenne, Cocoon, CouchDB, Curator, Derby,
Empire-db, Forrest, Gora, Hadoop, Hbase, Hive, Jackrabbit, Lucene
Core, Lucene.Net, Lucy, MetaModel, OFBiz, OpenJPA, ORC, Phoenix,
Pig, Torque, ZooKeeper
Many more: https://projects.apache.org/projects.html?category
35
Structure
Introduction/Motivation
Big Data Processing
– Classical MapReduce approach
– In-Memory processing
– Stream processing
Big Data Analysis
– Machine Learning Algorithms
– Deep Learning
– Graph Analysis
– Data Analytics Frameworks
Big Data and HPC
36
Big Data Processing – Data Analysis
Data
Collection
Integration/
Aggregation
Analysis/
Modelling Interpretation
Extraction/
Cleaning/
Annotation
Volume
Veracity
Velocity
Variety
…
Privacy
Hu
man
In
tera
ctio
n
Value
37
Structure
Introduction/Motivation
Big Data Processing
– Classical MapReduce approach
– In-Memory processing
– Stream processing
Big Data Analysis
– Machine Learning Algorithms
– Deep Learning
– Graph Analysis
– Data Analytics Frameworks
Big Data and HPC
38
Machine Learning Algorithms
Text
Bullet 1
Bullet 1
– Bullet 2
– Bullet 2
• Bullet 3
• Bullet 3
- Bullet 4
- Bullet 4
• Bullet 3
SVM: Support Vector Machine
HMM: Hidden Markov Models
LDA: Linear Discriminant Analysis
LSA: Latent Semantic Analysis
39
Machine Learning Algorithms
Unsupervised Learning Supervised Learning
unlabeled data labeled data
Goal: Discover classes within data Classify unknown data
Iris flower data set, Ronald Fisher, The use of multiple measurements in taxonomic problems, 1936
5.0,3.5,1.6,0.6,Iris-setosa 5.1,3.8,1.9,0.4,Iris-setosa 4.8,3.0,1.4,0.3,Iris-setosa 5.1,3.8,1.6,0.2,Iris-setosa 4.6,3.2,1.4,0.2,Iris-setosa 5.3,3.7,1.5,0.2,Iris-setosa 5.0,3.3,1.4,0.2,Iris-setosa 7.0,3.2,4.7,1.4,Iris-versicolor 6.4,3.2,4.5,1.5,Iris-versicolor 6.9,3.1,4.9,1.5,Iris-versicolor …
5.0,3.5,1.6,0.6 5.1,3.8,1.9,0.4 4.8,3.0,1.4,0.3 5.1,3.8,1.6,0.2 4.6,3.2,1.4,0.2 5.3,3.7,1.5,0.2 5.0,3.3,1.4,0.2 7.0,3.2,4.7,1.4 6.4,3.2,4.5,1.5 6.9,3.1,4.9,1.5 …
40
Machine Learning Algorithms
Unsupervised Learning
(dealing with unlabeled data)
Frequency
counts and
clustering of
words from a
text data set.
Identifying patches, which are similar with
regard to six edaphic and physiographic
variables (e.g. 50-year mean monthly
temperature) with spatial clustering
technique in the US.
Clustering the Iris data set.
41
Machine Learning Algorithms
Supervised Learning
(dealing with labeled data)
Classify handwritten
digits and images
with Deep Learning
methods.
Classify text documents
with Naïve Bayes.
42
Machine Learning Algorithms
The most popular machine learning algorithms are …
Unsupervised methods/Clustering
– K-Means
– Hierarchical clustering
– Gaussian mixture model
– …
Supervised methods
– Logistic regression
– Decision tree classifier
– Random forest classifier
– Naïve Bayes
– Multilayer perceptron classifier
– Support Vector Machine
– HMM: Hidden Markov Models
– LDA: Linear Discriminant Analysis
– LSA: Latent Semantic Analysis
– …
43
Structure
Introduction/Motivation
Big Data Processing
– Classical MapReduce approach
– In-Memory processing
– Stream processing
Big Data Analysis
– Machine Learning Algorithms
– Deep Learning
– Graph Analysis
– Data Analytics Frameworks
Big Data and HPC
44
Deep Learning
Deep Learning methods are already applied in many different domains.
Need for user-friendly frameworks enabling development of
application dependent deep neural networks.
45
Deep Learning
Deep neural networks have many adjustable parameters:
# Input/output Nodes
# Hidden layer
Activation function
# Connections
…
46
Application - Intraoperative Thermal Imaging
N. Hoffmann et al. Learning Thermal Process Representations for Intraoperative Analysis of Cortical Perfusion During Ischemic
Strokes. Springer International Publishing, 2016.
Imaging neural activity by measuring small
temperature variation
Data from University Hospital Dresden
Long-term intraoperative measurements
(~10 minutes) and fast preprocessing
Deep learning framework (Keras) to approximate
parameters
3000 frames (5.4 GB) every minute
(50 Hz sampling rate), ~7000s on single node
Parallel implementation of real-time data
Processing pipeline with Apache Spark
8 nodes on HPC cluster, ~30s, 200x faster
(a) Infarct demarcation in postoperative computed
tomography (CT)
(b) Infarct demarcation in postoperative CT scans and
mapped thermal image sequences onto the cortex
(c) Approximated heating parameters by Deep
Learning approach being applied to heating
sequence caused by the cortical irrigation
47
Structure
Introduction/Motivation
Big Data Processing
– Classical MapReduce approach
– In-Memory processing
– Stream processing
Big Data Analysis
– Machine Learning Algorithms
– Deep Learning
– Graph Analysis
– Data Analytics Frameworks
Big Data and HPC
48
Graph Analytics
“Graphs are everywhere”
Graph = Edges + Vertices
Potentially very large
– Facebook:
• ca. 1.49 billion active users
• ca. 340 friends per user
– WWW
• ca. 1 Billion sites
Martin Junghans: ”Scalable Graph
Analytics with Gradoop“
49
Graph Analytics – small excurse
Can be used as add-ons to Analytics Frameworks
Some systems: Gradoop, GraphX, GraphLab, Giraph,
Pregel, …
Lecture “Gradoop”, Prof. E. Rahm
http://ampcamp.berkeley.edu/big-data-mini-course/graph-
analytics-with-graphx.html
50
Structure
Introduction/Motivation
Big Data Processing
– Classical MapReduce approach
– In-Memory processing
– Stream processing
Big Data Analysis
– Machine Learning Algorithms
– Deep Learning
– Graph Analysis
– Data Analytics Frameworks
Big Data and HPC
51
Data Analytics Frameworks
There are already several sophisticated frameworks supporting Machine
Learning with Big Data like
Spark ML
Flink ML
Mahout
H2O
…
52
Data Analytics Frameworks
There are already several sophisticated frameworks supporting Deep Learning
applications like
Keras
Caffee
CuDNN
DeepLearning4j
Tensorflow
53
The Framework Keras
Deep Learning Framework
Written in Python
Based on Theano/Tensorflow
Convolutional/Recurrent Networks
Parallelization possible
GPU usage possible
54
Data Analytics Frameworks
Established frameworks for statistical analysis are e.g.
R (script language with statistics community projects)
MatLab, GNU Octave, Scilab, …
Python packages (numpy, pandas, scipy…)
ROOT (High Energy Physics package)
SOFA statistics
Torch
Weka
….
55
Structure
Introduction/Motivation
Big Data Processing
– Classical MapReduce approach
– In-Memory processing
– Stream processing
Big Data Analysis
– Machine Learning Algorithms
– Deep Learning
– Graph Analysis
– Data Analytics Frameworks
Big Data and HPC
56
Big Data and HPC – convergence patterns
There is no unique fingerprint for “big data applications”
Big V’s of big data (volume, velocity, …) imply very different usage scenarios
Also: no unique architecture to serve all facets
Why converging two worlds?
Algorithms rise: highly iterative learning models become interesting for many
application areas (deep learning)
Many initiatives to overcome initial Map/Reduce approach
– Big Data meets Infiniband: 56Gbps FDR Infiniband >100 x faster than
even 10Gbps Ethernet
– Big Data meets Accelerators: switch between CPU/GPU usage on
framework level (e.g. using Apache Spark on GPUs, TensorFlow on CPU
/ GPU)
– Using virtualization / cloud paradigms to shape environments over bare
metal
57
Big Data and HPC – convergence patterns
Requirements to support Big Data workloads on HPC
Support frameworks: more versatile software stacks
Fast access to data: not just self-production of
data (simulation), but also use 3rd-party data
(open data, domain repositories)
Support different data processing
paradigms
– Streaming
– Batch
– Iteration
on very same system
Better support of evaluation of (temporary)
results, e.g. visualization frontends
Service orientation (working environments)
58
Big Data and HPC – convergence patterns
High Performance Data Analytics
59
Big Data and HPC – convergence patterns
High Performance Data Analytics
Classicäl HPC
60
Big Data and HPC – convergence patterns
High Performance Data Analytics
Add system
features via
virtualization
layer
61
Big Data and HPC – convergence patterns
High Performance Data Analytics
Enhance
software stack
up to complete
unique software
settings
62
Big Data Analytics and HPC
HPC offers a lot of support for Big Data Analytics
Software Frameworks allow users to use HPC support concerning
– Parallelization
– CPU and GPU usage
– Strong computational power per node
– Compute power for visualization
– Streaming
– Storage
– Shared memory
– Usage of specifically adapted hardware
Zellescher Weg 12
Willers-Bau A207
Tel. +49 351 - 463 - 35450
Wolfgang E. Nagel ([email protected])
Center for Information Services and High Performance Computing (ZIH)
Thank You!
Sunna Torge
Rene Jäkel
Ralph Müller-Pfefferkorn