evaluation of k-means data clustering algorithm on intel xeon...
TRANSCRIPT
![Page 1: Evaluation of K-Means Data Clustering Algorithm on Intel Xeon Phicecsresearch.org/vcl/ASH/index_files/2016/Evaluation_of... · 2016-12-13 · Evaluation of K-Means Data Clustering](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b0112056fed14020250f7/html5/thumbnails/1.jpg)
Evaluation of K-Means Data Clustering
Algorithm on Intel Xeon Phi
Sunwoo Lee, Wei-keng Liao, Ankit Agrawal, Nikos Hardavellas, and Alok Choudhary
EECS Department
Northwestern University
![Page 2: Evaluation of K-Means Data Clustering Algorithm on Intel Xeon Phicecsresearch.org/vcl/ASH/index_files/2016/Evaluation_of... · 2016-12-13 · Evaluation of K-Means Data Clustering](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b0112056fed14020250f7/html5/thumbnails/2.jpg)
Contents
• Introduction
• Parallel K-Means Algorithm
• Techniques for Utilizing Xeon Phi’s Features
• Evaluation and Conclusion
1
![Page 3: Evaluation of K-Means Data Clustering Algorithm on Intel Xeon Phicecsresearch.org/vcl/ASH/index_files/2016/Evaluation_of... · 2016-12-13 · Evaluation of K-Means Data Clustering](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b0112056fed14020250f7/html5/thumbnails/3.jpg)
IntroductionIntel Xeon Phi
• Intel Xeon Phi Processor
Many Integrated Core (MIC) architecture
Relatively weak computing cores (1~1.3GHz)
57~72 physical cores with 4-way hardware threading
• Vectorization Capability
Instruction-level parallelism
512-bit wide Vector Processing Units (VPUs)
2
![Page 4: Evaluation of K-Means Data Clustering Algorithm on Intel Xeon Phicecsresearch.org/vcl/ASH/index_files/2016/Evaluation_of... · 2016-12-13 · Evaluation of K-Means Data Clustering](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b0112056fed14020250f7/html5/thumbnails/4.jpg)
IntroductionK-Means Data Clustering Algorithm
• Clustering Algorithm for Spatial Data
Partitions a dataset into k distinct groups
Each member data belongs to the cluster with the nearest mean
3
Figure Reference: https://en.wikipedia.org/wiki/K-means_clustering#/media/File:Iris_Flowers_Clustering_kMeans.svg
![Page 5: Evaluation of K-Means Data Clustering Algorithm on Intel Xeon Phicecsresearch.org/vcl/ASH/index_files/2016/Evaluation_of... · 2016-12-13 · Evaluation of K-Means Data Clustering](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b0112056fed14020250f7/html5/thumbnails/5.jpg)
• Loop-based Structure
for( n ∈ N ){ // N: Dataset
A = objects[n];
for( k ∈ K ){ // K: Dimensions
B = cluster_centers[k];
distance = calculate_distance(A, B);
if( min > distance )}
nearest = k;
min = distance;
}
}
Sums[nearest] += A;
}
cluster_centers = calculate_means(Sums);
Parallel K-Means (1/2)
4
The outer-most loop without data dependency;Thread-level parallelization
The inner-most loop;Instruction-level parallelization
Should reduce Sums from multiple processes before calculating averages
![Page 6: Evaluation of K-Means Data Clustering Algorithm on Intel Xeon Phicecsresearch.org/vcl/ASH/index_files/2016/Evaluation_of... · 2016-12-13 · Evaluation of K-Means Data Clustering](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b0112056fed14020250f7/html5/thumbnails/6.jpg)
Parallel K-Means (2/2)
5
• Multi-level Parallelization Strategy
Distributed Memory
MPI
Distribute dataset to multiple machines
Shared Memory
OpenMP
Assign a subset of the given data to each thread
Instruction-level Parallelism
Single Instruction Multiple Data (SIMD) operations
Auto-vectorization of Intel compiler
![Page 7: Evaluation of K-Means Data Clustering Algorithm on Intel Xeon Phicecsresearch.org/vcl/ASH/index_files/2016/Evaluation_of... · 2016-12-13 · Evaluation of K-Means Data Clustering](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b0112056fed14020250f7/html5/thumbnails/7.jpg)
Problems
• Naïve implementation of parallel K-Means engenders;
Low vectorization intensity
Low cache hit ratio
• Loop peelings
6
Aligned memory address
data0 data1 data2 data3 data4 data5 …
data0 data1 data2 data3 data4 data5 …
Vectorized from here
Try to vectorize but no guaranteed
![Page 8: Evaluation of K-Means Data Clustering Algorithm on Intel Xeon Phicecsresearch.org/vcl/ASH/index_files/2016/Evaluation_of... · 2016-12-13 · Evaluation of K-Means Data Clustering](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b0112056fed14020250f7/html5/thumbnails/8.jpg)
Efficient Vectorization Memory Layout
7
• 512-bit aligned contiguous memory spaces
Intel AVX-based VPUs require a memory alignment for optimal
data movement for vector instructions
local_sums
cluster_centers
Tid0Cluster 0
Cluster 0
Tid0Data 0data_objects
Tid0Cluster 1
Cluster 1
Tid0Data 1
…
…
Tid0Data 2
Tid0Cluster k
Cluster k
…
Tid1Cluster 0
Tid0Data t
Tid1Cluster 1
Tid1Data 0
…
Tid1Data 1
Tid1Cluster k
Tid1Data 2
…
… Tid1Data t
Dim0 Dim1 … Dim d
Computing Node 0
Computing Node 1
Computing Node 2
Computing Node P
…
![Page 9: Evaluation of K-Means Data Clustering Algorithm on Intel Xeon Phicecsresearch.org/vcl/ASH/index_files/2016/Evaluation_of... · 2016-12-13 · Evaluation of K-Means Data Clustering](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b0112056fed14020250f7/html5/thumbnails/9.jpg)
Efficient Vectorization Data Padding (1/2)
• What if dimension is not divisible by VPU width?
E.g., distance calculation with 10-dimensional data points
data0 data1 data2 data3 …
0x12340200 0x12340228
10-dimensional data
Unaligned! Loop peeling from the second iteration!
![Page 10: Evaluation of K-Means Data Clustering Algorithm on Intel Xeon Phicecsresearch.org/vcl/ASH/index_files/2016/Evaluation_of... · 2016-12-13 · Evaluation of K-Means Data Clustering](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b0112056fed14020250f7/html5/thumbnails/10.jpg)
Efficient Vectorization Data Padding (2/2)
• Pad up each data point so as to make them aligned
• Every distance calculation is vectorized perfectly at the
cost of more memory consumption
9
data0 data1 data2 data3 …
0x12340200 0x12340240
10-dimensional data
Every starting point is aligned well!
24 bytes padding
![Page 11: Evaluation of K-Means Data Clustering Algorithm on Intel Xeon Phicecsresearch.org/vcl/ASH/index_files/2016/Evaluation_of... · 2016-12-13 · Evaluation of K-Means Data Clustering](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b0112056fed14020250f7/html5/thumbnails/11.jpg)
Cache-aware Reduction
10
• Averaging Coordinates for Calculating New Centroids
When adding two local data, one should be always its own local
data make data stays in cache as long as possible
Read the remote data from threads with close thread IDs
consecutive 4 threads share a L2 cache
Local Data 0
Local Data 1
Local Data 2
Local Data 3
Local Data 0,2
Local Data 1,3
Local Data 0:3
![Page 12: Evaluation of K-Means Data Clustering Algorithm on Intel Xeon Phicecsresearch.org/vcl/ASH/index_files/2016/Evaluation_of... · 2016-12-13 · Evaluation of K-Means Data Clustering](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b0112056fed14020250f7/html5/thumbnails/12.jpg)
Experiment Setup
• Host System Dual socket Intel Xeon processor, E5-2695 v3
14 physical cores and 2-way SMP per core
2296 MHz frequency of each core
• Xeon Phi MIC architecture, Xeon Phi 7120 (Knight Corner)
61 physical cores and 4-way SMP per core
1238 MHz frequency of each core
• Dataset Synthetic datasets with various dimensions (aligned/unaligned)
4 millions /16 millions of data points with various dimensions
Data generator http://cucis.eecs.northwestern.edu/projects/DMS/MineBenchDownload.html
11
![Page 13: Evaluation of K-Means Data Clustering Algorithm on Intel Xeon Phicecsresearch.org/vcl/ASH/index_files/2016/Evaluation_of... · 2016-12-13 · Evaluation of K-Means Data Clustering](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b0112056fed14020250f7/html5/thumbnails/13.jpg)
EvaluationPerformance Study with Aligned Dataset
12
Dimension 16 32 64 128 256
Original (Phi) 73.37 87.9 112.79 142.3 179.99
Optimized (Phi) 34.41 27.56 36.55 55.58 102.5
Original (Host) 13.09 24.87 41.07 62.43 118.99
Optimized (Host) 17.18 22.79 36.55 63.89 123.08
0
20
40
60
80
100
120
140
160
180
200
16 32 64 128 256
Tim
ing
(sec
)
Number of Dimensions
Original (Host)
Optimized (Host)
Original (Phi)
Optimized (Phi)
![Page 14: Evaluation of K-Means Data Clustering Algorithm on Intel Xeon Phicecsresearch.org/vcl/ASH/index_files/2016/Evaluation_of... · 2016-12-13 · Evaluation of K-Means Data Clustering](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b0112056fed14020250f7/html5/thumbnails/14.jpg)
Evaluation Effect of Cache-aware Parallel Reduction
13
Works Computation Reduction
Original (Host) 39.62 0.55
Optimized (Host) 38.71 1.03
Original (Phi) 39.31 14.11
Optimized (Phi) 38.60 0.18
0 10 20 30 40 50 60
Optimized (Phi)
Original (Phi)
Optimized (Host)
Original (Host)
Execution Time (sec)
Computation Reduction
Metric
Original
(Phi)
Optimized
(Phi)
Average CPI per Thread 3.51 2.82
Average CPI per Core 0.88 0.71
Vectorization Intensity 12.33 14.71
L1 Data Access Ratio 31.11 36.24
L2 Data Access Ratio 358.99 15628.93
L1 Hit Ratio 0.83 0.83
L1 TLB Miss Ratio 0.0011 0.0021
Performance Analysis with Intel VTune Amplifier Tools
![Page 15: Evaluation of K-Means Data Clustering Algorithm on Intel Xeon Phicecsresearch.org/vcl/ASH/index_files/2016/Evaluation_of... · 2016-12-13 · Evaluation of K-Means Data Clustering](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b0112056fed14020250f7/html5/thumbnails/15.jpg)
EvaluationPerformance Study with Unaligned Dataset
14
Dimensions 3 13 16
Original 45.65 64.41 73.37
Optimized 1 46.27 62.96 38.82
Optimized 2 38.12 38.11 38.07
0
10
20
30
40
50
60
70
80
3 13 16
Exec
uti
on
Tim
e (s
ec)
Number of Dimensions
Original
Memory Alignment
Memory Alignment + DataPadding
![Page 16: Evaluation of K-Means Data Clustering Algorithm on Intel Xeon Phicecsresearch.org/vcl/ASH/index_files/2016/Evaluation_of... · 2016-12-13 · Evaluation of K-Means Data Clustering](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b0112056fed14020250f7/html5/thumbnails/16.jpg)
EvaluationPerformance Study on Multi-Node Supercomputer
15
Dimensions 3 8 13
Original 10.28 10.83 17.81
Optimized 1 9.73 8.55 18.17
Optimized 2 8.58 8.49 9.55
0
5
10
15
20
3 8 13
Exec
uti
on
Tim
e (S
ec)
Number of Dimensions
Original
Memory Alignment
Memory Alignment+ Data Padding
Nodes 1 2 4 8 16 32 64 128 256 512
Original 1074.98 522.45 264.73 133.56 69.83 37.01 21.12 13.55 9.52 8.19
Optimized 539.64 258.82 129.31 65.13 32.92 17.23 11.64 6.59 4.73 4.87
1
10
100
1000
10000
Exec
uti
on
Tim
e (s
ec)
Number of Nodes (Number of Cores)
Original Optimized
![Page 17: Evaluation of K-Means Data Clustering Algorithm on Intel Xeon Phicecsresearch.org/vcl/ASH/index_files/2016/Evaluation_of... · 2016-12-13 · Evaluation of K-Means Data Clustering](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b0112056fed14020250f7/html5/thumbnails/17.jpg)
EvaluationMulti Node Performance Study
16
Nodes 1 2 4 8 16 32 64 128 256 512
Original 1 2.06 4.06 8.05 15.39 29.04 50.91 79.36 112.86 131.25
Optimized 1 2.08 4.17 8.29 16.39 31.31 46.35 81.83 114.13 110.74
1
2
4
8
16
32
64
128
256
512
1 2 4 8
16
32
64
12
8
25
6
51
2
Spee
du
p
Number of Nodes
Linear
Original
Optimized
1
2
4
8
16
32
64
128
256
512
1024
Tim
ing
(Sec
)
Number of Nodes (Number of Cores)
Comp. (Opt)
Comp. (Orig)
Comm. (Orig)
Comm. (Opt)
![Page 18: Evaluation of K-Means Data Clustering Algorithm on Intel Xeon Phicecsresearch.org/vcl/ASH/index_files/2016/Evaluation_of... · 2016-12-13 · Evaluation of K-Means Data Clustering](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b0112056fed14020250f7/html5/thumbnails/18.jpg)
Conclusion
• To achieve a good performance on Intel Xeon Phi,
Efficient vectorization with an appropriate memory layout
Loop peelings can be avoided by padding each data point
Auto-vectorization can work better only with some hints
Efficient cache utilization in reduction
• Not only K-Means but also…
Any data mining algorithms with an iterative structure or
sequential memory access pattern
17
![Page 19: Evaluation of K-Means Data Clustering Algorithm on Intel Xeon Phicecsresearch.org/vcl/ASH/index_files/2016/Evaluation_of... · 2016-12-13 · Evaluation of K-Means Data Clustering](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b0112056fed14020250f7/html5/thumbnails/19.jpg)
Q&A