the convergence of hpc and big data intel® data analytics ... - coding high performan… · stages...
TRANSCRIPT
The convergence of HPC and Big Data
Intel® Data Analytics Acceleration Library (DAAL)
Roger Philp
Intel HPC Software Workshop Series 2016
HPC Code Modernization for Intel® Xeon and Xeon Phi™
February 17th 2016, Barcelona
Setting the stage for analytics
2
Data analytics in the age of Big Data
Big Data is converging with High Performance Computing
• Extreme Data Volumes
• Intense Compute Workloads
Gap between current programming and hardware evolution
• More cores/threads, wider vectors, more memory, more storage, faster interconnect
• Many big data applications leave performance at the table – Not optimized for underlying hardware
3
More Cores
More Threads
More Memory
More Storage
Wider SIMD Vectors
Faster Interconnect
Fraud detection
Detected using
Benford‘sLaw
4
A B
Benford’s law
Also called the first digit law
States that in listings, tables of statistics, etc., the digit 1 tends to occur with probability ∼30%, much greater than the expected 11.1% (i.e., one digit out of 9)
Source: MathWorld
5
30.1%
17.6%
12.5%
9.7%7.9%
6.7%5.8% 5.1% 4.6%
0.00
0.05
0.10
0.15
0.20
0.25
0.30
1 2 3 4 5 6 7 8 9
P(D
)
IDC estimates that High Performance Data Analytics has saved PayPal more than $700 million so far (fraud detection)
Big data analytics: problem statement
6
Spark* MLLib
Breeze
Netlib-Java
JVM
JNINetlibBLAS
Run on stat-of-art hardwareBuilt with patchwork of math libs
Not exploiting HW performance features
Limited performanceMany layers of dependenciesLow ROI on HW investment
Big data analytics: desired solution
7
Run on stat-of-art hardwareSingle library to cover all stages of data
analyticsFully optimized for underlying HW
Optimized performanceSimpler development/deployment
High ROI on HW investment
Intel® DAAL for HPC and analytics
8
Introducing Intel® DAALData Analytics Acceleration Library
An industry leading end-to-end IA-based data analytics acceleration library of fundamental algorithms coveringall data analysis stages
What Intel DAAL brings to the game:• Optimized algorithms based upon Intel® Math Kernel Library (Intel® MKL)
• Data serialization primitives to help running in a distributed system
• Speed and some building blocks for preprocessing data
• Support IA-32 and Intel64 architectures.
• C++, Java APIs.
• Static and dynamic linking.
• A standalone library, and also bundled in Intel® Parallel Studio XE 2016.
• Windows, Linux, OS-X
• Microsoft Visual Studio* (Windows*) & Eclipse/CDT* (Linux*)
9
Create faster code… faster
Intel® Parallel Studio XE • Design, build, verify and tune
• C++, C, Fortran and Java*
Highlights from what’s new for “2016” edition• Intel® Data Analytics Acceleration Library
• Vectorization Advisor:Custom Analysis and Advice
• MPI Performance Snapshot: Scalable profiling
• Support for the latest Standards, Operating Systems and Processors
10
Intel® Parallel Studio XE 2016 configurations
Component
Full Licensing(including Intel® Premier Support)
Free Licensing
ComposerEdition
ProfessionalEdition
ClusterEdition
Student/Educator
Open SourceContributor
AcademicResearcher
Community(Everyone!)
Intel® C/C++ Compiler(including Intel® Cilk™ Plus)
Intel® Fortran Compiler
OpenMP 4.0
Intel® Threading Building Blocks (C++ only)
Intel® IPP Library (C/C++ only)
Intel® Math Kernel Library
Intel® Data Analytics Acceleration Library
Intel® MPI Library
Rogue Wave IMSL Library(Fortran only)
Bundledand Add-on
Add-on Add-on
Intel® Advisor XE
Intel® Inspector XE
Intel® VTuneTM Amplifier XE
Intel® ITAC + MPI Performance Snapshot
11
What Intel® DAAL “is” and “isn’t”?
It is…
A performance library with C++ and Java APIs optimized for Intel architectures
A collection of common building blocks for constructing high-end solutions in all stages of a data analytics project
Abstracted from communication layers and data sources, to be easily integrated into different analytics platforms
Boosting performance of critical algorithms hence reducing time-to-value of your big data projects
It is not…
A programming environment or a cluster computing framework (like MATLAB, R, or Hadoop)
A black-box solution to tackle domain specific analytics needs
A toolkit or plug-in tied to a particular big data platform
Promoting fancy algorithms as the silver bullet for all you big data needs
12
How Intel accelerates analytics & machine learning
Intel® Analytics Toolkit for Apache Hadoop* software providing faster time to insight
Enhancing MLLib, Spark*, GraphX
Extending Machine Learning (ML) through Intel® Data Analytics Acceleration Library (Intel® DAAL) and “Trusted Analytics” platform
ML primitives accelerating through Intel® Math Kernel Library (Intel® MKL)
Intel® Xeon®, Intel® Xeon Phi™ processors powering the Data Center
13
1
2
3
4
5
Hardware Platforms
Intel Analytics Toolkit
Intel DAAL
Intel MKL
Intel® DAAL in the analytics workflow
14
ForecastingDecision trees
Etc.
Hypothesis testingModel errors
(De-)CompressionFiltering
Normalization
AggregationDimension reduction
Summary statistics
Clustering
Machine learningParameter estimationSimulation
Pre-processing Transformation Analysis ModelingDecisionMaking
Scientific/Engineering
Web/Social
Business
Validation
Intel® DAAL in the analytics pipeline
15
Ingest & Clean
Engineer Features
Build GraphQuery & Visualize
DataLearn
ParsePrepare Graph
data Basic AnalysisRun Graph/ML
Algorithm
Insightful result
Basic Analysis
Intel DAALModeling and Decision Making
Intel DAALSummary Statistics
Intel DAALPCA
Intel DAALCompression and outlier detection
Key features of Intel® DAAL
Optimized for Intel® Architecture
C++ and Java* API in the first release (more in future)
Flexible interface to build connectors to data sources, HDFS*, Spark* RDD, SQL, etc.
Batch, streaming, and distributed processing modes
16
Distributed Processing
Streaming Processing
Batch Processing
D1D2D3
R = F(R1,…,Rk)
Si+1 = T(Si,Di)Ri+1 = F(Si+1)
R1
Rk
D1
D2
Dk
R2 R
Si,Ri
D1Dk-1Dk…
Append
R = F(D1,…,Dk)
Data transformation and analysis (Intel® DAAL)
Basic statistics for datasets
Statistical moments
Quantities
Correlation and dependence
Cosine distance
Correlation distance
Variance-covariance
matrix
Matrix factorization
SVD
QR
Cholesky
Dimensionality reduction
PCA
Association rule missing
(Apriori)
Outlier detection
Univariate
Multivariate
17
Algorithms support streaming and distributed processing in the current release
Machine learning (Intel® DAAL)
18
Supervised learning
RegressionLinear
Regression
Classification
Weak learner
Boosting (Ada, Brown, Logit)
Naïve Bayes
SVM
Unsupervised learning
K-Means Clustering
EM for GMM
Collaborative filtering
ALS
Algorithms support streaming and distributed processing in the current release
To be available in
future releases
Intel® DAAL workflow
Optimizes the entire workflow
• From data acquisition from SQL* and no-SQL data sources
• To data transformations
• To data analysis, training and prediction
19
Connects to physical data (e.g. f iles,
ODBC, etc.)
Streams data into memory
Transforms raw data into numeric
representat ion of supported layout
(Numeric Table)
Performs f iltering (out lier
detect ion)
Computes basic counters
Streams in-memory data to
algorithm by blocks to
improve data locality
Converts variety of numeric
formats into smaller set of
numeric formats for ef fect ive
vectorizat ion
Transforms Numeric Table
layout into layout which is the
most eff icient for a given
Algorithm
Compression Engine
Serialization and Compression Engine
Performs data processing
Intel® DAAL major component families
20
Data processingOptimized analytics building blocks for all data analysis stages, from data acquisition to data
mining and machine learning.
Data modelingData structures for model representation, and operations to derive model-based predictions
and conclusions.
Data managementInterfaces for data representation and access.
Connectors to a variety of data sources and data formats, such HDFS, SQL, CSV, ARFF, and
user-defined data source/format.
Data sources
Numeric tables
Outliers detection
Compression / Decompression
Serialization / Deserialization
Data management in Intel® DAAL
Covers raw data acquisition, filtering, normalization, and conversion• Data sources
• Define interfaces for access/management of data in raw format and out-of-memory data
• Stream and transform raw out-of-memory data into numeric in-memory data
• Numeric tables• Fundamental component of in-memory numeric data processing
• Supports heterogeneous and homogeneous numeric tables for dense and sparse data
Usage model might also require compression/decompression, etc.21
Data processing
Optimized analytics building blocks for all data analysis stages, from data acquisition to data mining and machine learning.
Data Modeling
Data structures for model representation, and operations to derive model-based predictions and conclusions.
Data management
Interfaces for data representation and access. Connectors to a variety of data sources and data formats, such HDFS, SQL, CSV,
ARFF, and user-defined data source/format.
Data sources Numeric tables Outliers detection
Compression/Decompression Serialization/Deserialization
Data processing in Intel® DAAL
Algorithms for data analysis (data mining) and machine learning• Matrix decompositions, clustering algorithms, PCA, etc.
Following processing modes are supported• Batch: all algorithms, entire dataset available in memory of a single process
• Online: processing of data sets in blocks (streaming), might be asynchronous
• Distributed: data set is split in blocks across computation nodes• Multi-device computing and data transfer scenarios (e.g., MPI, Hadoop/Spark, etc.)
22
Data processing
Optimized analytics building blocks for all data analysis stages, from data acquisition to data mining and machine learning.
Data Modeling
Data structures for model representation, and operations to derive model-based predictions and conclusions.
Data management
Interfaces for data representation and access. Connectors to a variety of data sources and data formats, such HDFS, SQL, CSV,
ARFF, and user-defined data source/format.
Data sources Numeric tables Outliers detection
Compression/Decompression Serialization/Deserialization
Data modeling in Intel® DAAL
Structures and operations for data modeling in two stages• Training: estimates model parameters based on a training data set
• Prediction: uses the trained model to predict the outcome based on new data
Available methods• Regression: predict the values of dependent variables (responses) by observing
independent variables
• Classification: identify to which sub-population (class) a given observation belongs
23
Data processing
Optimized analytics building blocks for all data analysis stages, from data acquisition to data mining and machine learning.
Data Modeling
Data structures for model representation, and operations to derive model-based predictions and conclusions.
Data management
Interfaces for data representation and access. Connectors to a variety of data sources and data formats, such HDFS, SQL, CSV,
ARFF, and user-defined data source/format.
Data sources Numeric tables Outliers detection
Compression/Decompression Serialization/Deserialization
More information about Intel® DAAL
24
https://software.intel.com/en-us/intel-daal