optimizing machine learning workloads on intel platforms · stand-aloneexample:convolution 24 1 //...
TRANSCRIPT
![Page 1: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/1.jpg)
Optimizing Machine Learning
workloads on Intel® PlatformsColfax International — colfaxresearch.com
November 2016
colfaxresearch.com/ Welcome © Colfax International, 2013–2016
![Page 2: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/2.jpg)
Disclaimer2
While best efforts have been used in preparing this training, Colfax International makes norepresentations or warranties of any kind and assumes no liabilities of any kind with respect tothe accuracy or completeness of the contents and specifically disclaims any implied warrantiesof merchantability or fitness of use for a particular purpose. The publisher shall not be heldliable or responsible to any person or entity with respect to any loss or incidental orconsequential damages caused, or alleged to have been caused, directly or indirectly, by theinformation or programs contained herein. No warranty may be created or extended by salesrepresentatives or written sales materials.
colfaxresearch.com/ About This Document © Colfax International, 2013–2016
![Page 3: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/3.jpg)
Colfax Research3
http://colfaxresearch.com/
colfaxresearch.com/ About This Document © Colfax International, 2013–2016
![Page 4: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/4.jpg)
§2. Code Modernization
![Page 5: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/5.jpg)
What is Code Modernization?5
.Code Modernization..
......Optimizing software to better utilize features available in modern computerarchitectures.
Scalar Tuningwhat goes on in the pipeline?
Threadingdo cores cooperate efficiently?
Vectorizationis SIMD parallelism used well?
Memoryis cache usage maximized or
RAM access streamlined?
Communicationcan coordination in a distributed or
heterogeneous system be improved?
colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016
![Page 6: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/6.jpg)
Case Study: VGG-Net on Torch6
0
5
10
15
20
25
30
Original Intel Compiler+MKL
MiddlewareChanges
User CodeChanges
ParallelStrategy
MCDRAM asCache
Perform
ance (im
ages/s)Optimization of NeuralTalk2
colfaxresearch.com55x
28x
Intel® Xeon® processor E5-2650 v4 (2 sockets)
0.91 1.5
11
15
25Intel® Xeon Phi™ processor 7210 (KNL)
5.7
10
21
28
Colfax Research Summary Paper
colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016
![Page 7: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/7.jpg)
Intel Python Performance7
LUDecomposition
CholeskyDecomposition
Singular ValueDecomposition
DGEMM0
20
40
60
80
100
120
140
160
180
Rel
ativ
e Pe
rfor
man
ce
1.0 1.0 1.0 1.0 3.5 3.6 1.1 7.0
29.0 17.0
8.3
154.0
colfaxresearch.com Intel Python on Knights Landing Processors (N=5000)
CPython, SciPy CPython, NumPy Intel Python, SciPy
Portal: software.intel.com/intel-distribution-for-python. See also: CR paper.
colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016
![Page 8: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/8.jpg)
Three Approaches8
.High Level Approach..
......
Use high level libraries that are pre-optimized for modern architectures.▷ IntelCaffe, TensorFlow, Scikit-learn etc.
.Low Level Approach..
......
Apply code modernization techniques to frameworks/applications.▷ Colfax Research Website, HOW series, Intel Modern Code page etc.
.Middle Ground Approach..
......
Integrate pre-optimized kernels into frameworks/applications.▷ Intel® MKL DNN primitives, Intel® DAAL, Intel® MKL DNN etc.
colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016
![Page 9: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/9.jpg)
§3. The High Level Approach
![Page 10: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/10.jpg)
Intel Libraries for Machine Learning10
LeNet (Cifar10, minibatch 64)
Xeon PhiProcessor
Broadwell XeonProcessor
0
5
10
15
20
25
30 F
orw
ard/
Bac
kwar
d Pe
rf (k
img/
s, m
inib
atch
64)
0.15k 0.75k
13.27k
25.16k
BVLC Intel
VGG16 (ImageNet, minibatch 64)
Xeon PhiProcessor
Broadwell XeonProcessor
0
10
20
30
40
50
60
70
For
war
d/B
ackw
ard
Perf
(im
g/s,
min
ibat
ch 6
4)
0.913.82
54.40
28.57
BVLC Intel
colfaxresearch.com/ The High Level Approach © Colfax International, 2013–2016
![Page 11: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/11.jpg)
References for Intel Machine Learning Libraries11
▷ Intel MKL (https://software.intel.com/en-us/intel-mkl)
▷ Intel® MKL-DNN (https://github.com/01org/MKL-DNN)
▷ IntelCaffe (https://github.com/intel/caffe)
▷ Intel Theano (https://github.com/intel/theano)
▷ Intel DAAL (https://software.intel.com/en-us/intel-daal)
▷ Intel Torch (https://github.com/xhzhao/Optimized-Torch)
▷ IntelPython (https://software.intel.com/en-us/intel-distribution-for-python)
• Scikit-learn, Numpy, Scipy etc.
▷ And more coming...• TensorFlow, CNTK, etc.
colfaxresearch.com/ The High Level Approach © Colfax International, 2013–2016
![Page 12: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/12.jpg)
Intel Distribution for Python12
SciPy
Caffe
Intel Distribution for Python → Intel Math Kernel Library →
Intel DAAL
Portal: software.intel.com/intel-distribution-for-python. See also: CR paper.
colfaxresearch.com/ The High Level Approach © Colfax International, 2013–2016
![Page 13: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/13.jpg)
§4. Low Level Approach
![Page 14: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/14.jpg)
Optimization Areas14
Scalar Tuningwhat goes on in the pipeline?
Threadingdo cores cooperate efficiently?
Vectorizationis SIMD parallelism used well?
Memoryis cache usage maximized or
RAM access streamlined?
Communicationcan coordination in a distributed or
heterogeneous system be improved?
colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
![Page 15: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/15.jpg)
Case Study: VGG-Net on Torch15
0
5
10
15
20
25
30
Original Intel Compiler+MKL
MiddlewareChanges
User CodeChanges
ParallelStrategy
MCDRAM asCache
Perform
ance (im
ages/s)Optimization of NeuralTalk2
colfaxresearch.com55x
28x
Intel® Xeon® processor E5-2650 v4 (2 sockets)
0.91 1.5
11
15
25Intel® Xeon Phi™ processor 7210 (KNL)
5.7
10
21
28
Colfax Research Summary Paper
colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
![Page 16: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/16.jpg)
Base Torch Performance16
0
2
4
6
8
10
12
14
16
18
10 20 30 40 50 60
images/s
Batch Count (images)
Comp. Perf. (64 threads)
By Layer:▷ ReLU: 66%
▷ Conv: 30%
▷ MaxPool: 3%
▷ Other: <1%
colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
![Page 17: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/17.jpg)
Performance After ReLU Optimization17
0
5
10
15
20
25
30
35
40
10 20 30 40 50 60
images/s
Batch Count (images)
Original (64 threads)ReLU optimized (64 threads)
RELU -> 160x boost
By Layer:▷ ReLU: 1%
▷ Conv: 85%
▷ MaxPool: 11%
▷ Other: 3%
colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
![Page 18: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/18.jpg)
FALCON paper18
https://colfaxresearch.com/falcon-library/
colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
![Page 19: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/19.jpg)
Learn More
![Page 20: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/20.jpg)
Colfax Research20
http://colfaxresearch.com/
colfaxresearch.com/ Learn More © Colfax International, 2013–2016
![Page 22: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/22.jpg)
§5. The Middle Ground Approach
![Page 23: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/23.jpg)
Intel MKL and Intel MKL-DNN23
slide credit: Intel corp.
colfaxresearch.com/ The Middle Ground Approach © Colfax International, 2013–2016
![Page 24: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/24.jpg)
Stand-alone Example: Convolution24
1 // Creating MKL DNN primitive object2 dnnPrimitive_t convFwd;3 dnnConvolutionCreateForward_F32(&convFwd, NULL, dnnAlgorithmConvolutionDirect,4 dim, input_dims, output_dims, filter_dims,5 conv_strides, padding, dnnBorderZeros);6
7 // Creating the needed data buffer8 void* conv_res[dnnResourceNumber];9 conv_res[dnnResourceSrc] = (void*) input;
10 conv_res[dnnResourceFilter] = (void*) filter;11 conv_res[dnnResourceDst] = (void*) output;12
13 // Execute the workload14 dnnExecute_F32(pConvFwd, conv_res);
For more: Intel MKL documentation on DNN primitives
colfaxresearch.com/ The Middle Ground Approach © Colfax International, 2013–2016
![Page 25: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/25.jpg)
Example Integration: IntelCaffe25
GitHub link: https://github.com/intel/caffe/Example layer implementations: caffe/src/caffe/layers/mkl_*.cpp
1 // Grabbing parameters from Caffe Layers2 PoolingParameter pool_param = this->layer_param_.pooling_param();3 channels_ = bottom[0]->channels();4 height_ = bottom[0]->height();5 width_ = bottom[0]->width();6 num_ = bottom[0]->num();7 // ... //8 kernel_h_ = pool_param.kernel_h(); kernel_w_ = pool_param.kernel_w();9 // ..... //
10
11 // Creating the math kernel from these parameters12 status = dnnPoolingCreateForward<Dtype>( /* ... */ );
colfaxresearch.com/ The Middle Ground Approach © Colfax International, 2013–2016
![Page 26: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/26.jpg)
§6. Distributed Memory Computation
![Page 27: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/27.jpg)
"FLOPs Are Cheap"?27
.Theoretical estimates, Intel® Xeon E5-2697 V3 processor..
......
Performance =28 cores ×2.7 GHz × (256/64) vec.lanes ×2 FMA ×2 FPU ≈ 1.2 TFLOP/s
Required Data Rate =1.2 TFLOP/s×8 bytes ≈ 10 TB/s
OPA Max Bandwidth =12.5 GB/s ≈ 0.01 TB/s
Ratio = 10/0.01 ≈ 1000 (FLOPs)/(Memory Transferred)
To put it short....Difficulty of Distributed Computation..
......
In the time it takes to transfer one data element, processors can do thousands ofoperation on one data element.
colfaxresearch.com/ Distributed Memory Computation © Colfax International, 2013–2016
![Page 28: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/28.jpg)
Distributed Computation for Neural Networks28
Forward
Backward
Loss
Update
Forward
Backward
Loss
Update
GatherGradients
Forward
Backward
Loss
Update
Forward
Backward
Loss
Update
PartialResults
PartialResults
node 2node 1 node 2node 1
Data Parallel Model Parallel
Gradient Trnsferred but not Data Data Trnsferred but not Gradient
colfaxresearch.com/ Distributed Memory Computation © Colfax International, 2013–2016
![Page 29: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/29.jpg)
Caffe Scaling29
Source: Intel® Corporation. (Caffe* Training on Multi-nodeDistributed-memory Systems Based on Intel® Xeon® Processor E5 Family)
colfaxresearch.com/ Distributed Memory Computation © Colfax International, 2013–2016
![Page 30: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/30.jpg)
Machine Learning Framework: Intel® DAAL
![Page 31: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/31.jpg)
Algorithms in DAAL31
Analysis- Low Order Moments- Quantile- Correlation and Variance- Cosine Distance Matrix- Correlation Distance Matrix- K-Means Clustering- Principal Component Analysis- Cholesky Decomposition
Training & prediction- Regression - Linear/Ridge Regresion- Clasification - Naive Bayes Classifier - Boosting - SVM - Neural Networks - Multi-Class Classifier
- Singular Value Decomposition- QR Decomposition- Expectation-Maximization- Multivariate Outlier Detection- Univariate Outlier Detection- Association Rules- Kernel Functions- Quality Metrics
Portal: DAAL page. See also: intro article, CR papers.
colfaxresearch.com/ Machine Learning Framework: Intel® DAAL © Colfax International, 2013–2016
![Page 32: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/32.jpg)
Algorithms in DAAL32
Data Set
PartialComputation
Data Set
PartialComputation
Data Set
PartialComputation
Final Result Final Result
Data SetData Set
Data Set
Final Result
FullComputation
FullComputation
Data Set
Distributed Mode Batch Mode Online Mode
Portal: DAAL page. See also: intro article, CR papers.
colfaxresearch.com/ Machine Learning Framework: Intel® DAAL © Colfax International, 2013–2016
![Page 33: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/33.jpg)
Communication Framework: MPI
![Page 34: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/34.jpg)
Structure of MPI Applications: Hello World34
1 #include "mpi.h"2 #include <cstdio>3 int main (int argc, char *argv[]) {4 MPI_Init (&argc, &argv); // Initialize MPI envirnmnt5 int rank, size, namelen;6 char name[MPI_MAX_PROCESSOR_NAME];7 MPI_Comm_rank (MPI_COMM_WORLD, &rank); // ID of current process8 MPI_Get_processor_name (name, &namelen); // Hostname of node9 MPI_Comm_size (MPI_COMM_WORLD, &size); // Number of processes
10 printf ("Hello World from rank %d running on %s!\n", rank, name);11 if (rank == 0) printf("MPI World size = %d processes\n", size);12 MPI_Finalize (); // Terminate MPI environment13 }
colfaxresearch.com/ Communication Framework: MPI © Colfax International, 2013–2016
![Page 35: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/35.jpg)
Collective Communication: Gather35
1 int MPI_Gather(void *sendbuf, int sendcnt, MPI_Datatype sendtype,2 void *recvbuf, int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm);
Gather
sender
data
sender
data
sender
data
sender
data
receiver
colfaxresearch.com/ Communication Framework: MPI © Colfax International, 2013–2016
![Page 36: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/36.jpg)
Collective Communication: Broadcast36
1 int MPI_Bcast( void *buffer, int count, MPI_Datatype datatype,2 int root, MPI_Comm comm );
sender
data
receiver receiver receiverreceiver
Broadcast
colfaxresearch.com/ Communication Framework: MPI © Colfax International, 2013–2016
![Page 37: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/37.jpg)
Implementation
![Page 38: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/38.jpg)
Example Distributed Image Processing: DAAL38
▷ Algorithm <step1Local> is responsible for the forward/backward propagation.
1 training::Distributed<step1Local> local_net; // local net algorithm2 local_net.compute(); // forward/backward3 part_res = local_net.getPartialResult(); // getting partial result4 local_net.input.get(training::inputModel)5 ->setWeightsAndBiases(wb); // Update the weights/bias
▷ Algorithm <step2Master> is responsible for accumulating the gradient.
1 training::Distributed<step2Master> master_net; // master net algorithm2 master_net.input.add(training::partialResults, // Add partial result3 0, part_res);4 master_net.compute(); // Accumulate gradients5 wbModel = master_net.getPartialResult() // Get Current Model6 ->get(training::resultFromMaster)7 ->get(training::model);8 wb = wbModel->getWeightsAndBiases(); // Extract weights/bias
colfaxresearch.com/ Implementation © Colfax International, 2013–2016
![Page 39: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/39.jpg)
Example Distributed Image Processing (Part 1)39
1 // Computation part of the node with the master net2 // Local forward and backward propagation3 local_net.compute();4 part_res[master_node_id] = local_net.getPartialResult();5
6 // ... Code to store the result into a buffer (char *) ... //7
8 // Send the result to the master node9 MPI_Gather(....);
10
11 // ... Code to reconstruct the partial result from the buffer... //12
13 // accumulate the partial result from nodes14 for(int i = 0; i < num_nodes; i++)15 master_net.input.add(training::partialResults, node, part_res[i]);16 master_net.compute();
colfaxresearch.com/ Implementation © Colfax International, 2013–2016
![Page 40: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/40.jpg)
Example Distributed Image Processing (Part 2)40
1 // ... Continuing on the master compute ... //2
3 // Extract the weight/bias from the master net4 training::ModelPtr wbModel = master_net.getPartialResult()5 ->get(training::resultFromMaster)6 ->get(training::model);7 NumericTablePtr wb = wbModel->getWeightsAndBiases();8
9 // ... Code to store weights/bias into a buffer (char*) ... //10
11 // Broadcast the weights/bias to all nodes //12 MPI_Bcast(.....);13
14 // ... Code to reconstruct the weights/bias from buffer ... //15
16 // Update the weights on local node17 local_net.input.get(training::inputModel)->setWeightsAndBiases(wb);
colfaxresearch.com/ Implementation © Colfax International, 2013–2016
![Page 41: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/41.jpg)
Parallel Efficiency41
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
1 2 3 4
Parallel Efficiency
Number of Nodes
93%
91%
87%
Linear Scaling (Theoretical)Distributed Lenet
Further performance optimizations and model parallelism are coming soon...
colfaxresearch.com/ Implementation © Colfax International, 2013–2016
![Page 42: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/42.jpg)
§7. Final Words
![Page 43: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/43.jpg)
Colfax Research43
http://colfaxresearch.com/
colfaxresearch.com/ Final Words © Colfax International, 2013–2016
![Page 44: Optimizing Machine Learning workloads on Intel Platforms · Stand-aloneExample:Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd,](https://reader030.vdocuments.us/reader030/viewer/2022013010/5fdf19a227c6000f2a690023/html5/thumbnails/44.jpg)
Thank you for your Attention!
Join us at Booth #2407 at SC16!
colfaxresearch.com/ Final Words © Colfax International, 2013–2016