numa-aware matrix-matrix-multiplication · performance of naive vs. mkl 5 0,38 11,79 98,14 0,02...
TRANSCRIPT
![Page 1: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/1.jpg)
NUMA-aware Matrix-Matrix-Multiplication
Max Reimann, Philipp Otto
1
![Page 2: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/2.jpg)
About this talk
• Objective: Show how to improve performance of algorithms in a NUMA-system with MMM as an example
• Code was written in C with numa.h, pthread.h • Tested on FSOC
– ubuntu-0101 • 2 Nodes, 24 Cores
– dl980 • 8 Nodes, 128 Cores
• Compiled with gcc – –O3
2
![Page 3: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/3.jpg)
Naïve Matrix-Matrix-Multiplication
• We will examine MMM for large n x n matrices
• 𝒪 𝑛3
3 http://www.mathematrix.de/wp-content/uploads/matrixmul2.png
![Page 4: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/4.jpg)
Naïve MMM implementation
4
![Page 5: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/5.jpg)
Performance of Naive vs. MKL
5
0,38
11,79
98,14
0,02
0,13
1,02
0,015625
0,03125
0,0625
0,125
0,25
0,5
1
2
4
8
16
32
64
128
512 1024 2048
Naive
MKL
dl980 on one core
![Page 6: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/6.jpg)
Intel Math Kernel Library (MKL)
• BLAS: Basic Linear Algebra Subprograms
– Standard for Linear Algebra
• MKL:
– Implements BLAS for Intel hardware
– Vectorized and threaded for highest performance
6
![Page 7: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/7.jpg)
Analysis of Naïve MMM
• Testsetup • Use ubuntu-numa machine
• No thread or memory pinning
• Use numatop/pcm
• Performance tools show:
– Unused cores (obvious)
– QPI cannot be fully loaded with one thread
8
![Page 8: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/8.jpg)
Parallelization I
• How can the work be divided?
– 1. Partition computation of matrixC by rows or columns
• Problem: All threads need matrixA and matrixB
• Solution: – Accept overhead for remote memory access or
– Copy input/output matrices to the other nodes (preprocessing)
9
* =
![Page 9: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/9.jpg)
Parallelization – Partition by rows
10
![Page 10: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/10.jpg)
Parallelization – Partition by rows
11
0,38
11,79
98,14
0,05
0,26
2,54
0,19
0,27 0,28
0,03125
0,0625
0,125
0,25
0,5
1
2
4
8
16
32
64
128
512 1024 2048
Naive Sequential
Naive Parallel
MKL Parallel
dl980 on 128 cores
![Page 11: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/11.jpg)
Parallelization II
• How can the work be divided?
– 2. Partition computation of matrixC by summands
• Benefit: – for computing the i-th summand, only the i-th row of matrixA
/ column of matrixB is needed
– This allows to only copy the needed parts to the other nodes
• Disadvantage: – matrixB has to be transposed to be able to partition the
memory (preprocessing)
– locking or merging of matrixC is needed
12
![Page 12: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/12.jpg)
Parallelization II
13
![Page 13: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/13.jpg)
Performance of „Parallel Sum“ Method
14
1,59
2,81 3,34
14,91
218,84
0,27
1,41
2,94
17,24
186,39
0,19
0,27 0,28
0,43
2,41
0,13
0,25
0,50
1,00
2,00
4,00
8,00
16,00
32,00
64,00
128,00
256,00
512 1024 2048 4096 8192
Parallel sum
Naive Parallel
MKL Parallel
dl980 on 128 cores
![Page 14: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/14.jpg)
Strassen
• Runtime Complexity: – Naive algorithm 𝒪 𝑛3
• Can we get better? – Strassens algorithm, published 1969, was the first
to improve asymptotic complexity
– Runtime 𝒪 𝑛log2 7 ≈ 𝒪 𝑛2.8 • Algorithms today can get O(𝑛2.35), but are not pratical
– Uses only 7 multiplications instead of 8 per recursion step
15
![Page 15: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/15.jpg)
Matrix definition
16
A = 𝐴1,1 𝐴1,2
𝐴2,1 𝐴2,2, 𝐵 =
𝐵1,1 𝐵1,2
𝐵2,1 𝐵2,2, 𝐶 =
𝐶1,1 𝐶1,2𝐶2,1 𝐶2,2
For matrices A,B,C with dimension 𝑛 = 4𝑘, 𝑘 ∈ ℕ A,B,C can be viewed as 2x2 block matrices:
𝐶1,1 = 𝐴1,1 ∙ 𝐵1,1 + 𝐴1,2 ∙ 𝐵2,1 𝐶1,2 = 𝐴1,1 ∙ 𝐵1,2 + 𝐴1,2 ∙ 𝐵2,2
𝐶2,1 = 𝐴2,1 ∙ 𝐵1,1 + 𝐴2,2 ∙ 𝐵2,1 𝐶2,2 = 𝐴2,1 ∙ 𝐵1,2 + 𝐴2,2 ∙ 𝐵2,2
Conventional Algorithm uses 8 (expensive) multiplications:
![Page 16: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/16.jpg)
Strassen’s algorithm
17
𝑀1 ∶= 𝐴1,1 + 𝐴2,2 ∙ 𝐵1,1 + 𝐵2,2
𝑀2 ∶= 𝐴2,1 + 𝐴2,2 ∙ 𝐵1,1
𝑀3 ∶= 𝐴1,1 ∙ 𝐵1,2 − 𝐵2,2
𝑀4 ∶= 𝐴2,2 ∙ 𝐵2,1 − 𝐵1,1
𝑀5 ∶= 𝐴1,1 + 𝐴1,2 ∙ 𝐵2,2
𝑀6 ∶= 𝐴2,1 − 𝐴1,1 ∙ 𝐵1,1 + 𝐵1,2 𝑀7 ∶= (𝐴1,2 − 𝐴2,2) ∙ (B2,1 + 𝐵2,2
Define temporary matrices:
𝐶1,1 = 𝑀1 + 𝑀4 − 𝑀5 + 𝑀7 𝐶1,2 = 𝑀3 + 𝑀5 𝐶2,1 = 𝑀2 + 𝑀4 𝐶2,2 = 𝑀1 − 𝑀2 + 𝑀3 + 𝑀6
Compose final matrix
Only 7 multiplications!
![Page 17: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/17.jpg)
Strassen - Example
18
𝐶1,2 = 𝑀3 + 𝑀5
= 𝐴1,1𝐵1,2 + 𝐴1,2𝐵2,2
= 𝐴1,1𝐵1,2 − 𝐴1,1𝐵2,2 + 𝐴1,1𝐵2,2 + 𝐴1,2𝐵2,2
= 𝐴1,1 ∙ 𝐵1,2 − 𝐵2,2 + 𝐴1,1 + 𝐴1,2 ∙ 𝐵2,2
𝐴1,1 𝐴1,2
𝐴2,1 𝐴2,2∙𝐵1,1 𝐵1,2
𝐵2,1 𝐵2,2=
𝐶1,1 𝐶1,2𝐶2,1 𝐶2,2
Substituting the 𝑀𝑖𝑠 by their term gives back the original formula:
![Page 18: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/18.jpg)
Strassen - Analysis
• Cost: 7 multiplications and 18 additions
– 8 multiplications and 4 additions for naïve
• Only practical for large matrices n > 1000
– Although our results indicate otherwise (later)
• Define cutoff point for recursion
– If n is sufficiently small, do naïve multiplication
19
![Page 19: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/19.jpg)
Strassen - Implementation
20
![Page 20: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/20.jpg)
Execution Time: Single-threaded
0,00
0,00
0,01
0,05
0,38
11,79
98,14
0,00
0,00
0,00
0,02
0,12
0,87
6,12
0,00 0,00
0,00
0,00
0,02
0,13
1,02
0,00010,00010,00020,00050,00100,00200,00390,00780,01560,03130,06250,12500,25000,50001,00002,00004,00008,0000
16,000032,000064,0000
128,0000
32 64 128 256 512 1024 2048
Seco
nd
s
N-dimension
Naive Strassen MKL
21
strassen: BREAK = 64
dl980 on 1 core
![Page 21: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/21.jpg)
Parallelization of Strassen I
• Data dependencies:
– Have to do additions in 𝑀𝑖 before multiplication
• e.g. M1 = 𝐴1,1 + 𝐴2,2 ∙ 𝐵1,1 + 𝐵2,2
– Have to calculate 𝑀𝑖 before calculating C
• 𝐶1,2 = 𝑀3 + 𝑀5
• Easiest solution:
– Calculate in 𝑀𝑖s in parallel
– Then calculate 𝐶𝑖,𝑗 in parallel
22
![Page 22: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/22.jpg)
Parallelization of Strassen II
• Level 1 can be scheduled to 7 threads • Level n can be scheduled to 7𝑛 threads
– Most systems have number of processors on base 2
• We used manual parallelization – 49 distinct functions for Ms and 16 for Cs – Code bloating and not scalable, BUT:
• Automatic parallelization is hard – Thread load becomes very unbalanced – Every level needs 7 temporary matrices
• Exponential rising memory requirements
23
![Page 23: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/23.jpg)
Execution Time – 49 Threads
0,05
0,26
2,54
27,61
228,57
0,05
0,14
0,49
2,06
13,53
0,19 0,27 0,28
0,44
1,84
0,03125
0,0625
0,125
0,25
0,5
1
2
4
8
16
32
64
128
256
512 1024 2048 4096 8192
seco
nd
s
N-dimension
Naive Strassen MKL
24 dl980 on 49 cores
![Page 24: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/24.jpg)
NUMA-Optimizations
• Try to have as much memory local as possible to avoid remote memory access
– Because it is slower by a factor of ~ 1.4
• Partition data and work depending on #nodes and #cores
• Pin threads to nodes with the memory they need
• (Topology for other algorithms)
25
![Page 25: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/25.jpg)
Distributing memory and threads
0,34
11,39
101,12
0,35
18,34
182,45
0,35
21,96
204,85
0,37
14,33
143,44
0,25
0,5
1
2
4
8
16
32
64
128
256
1024 2048 4096
Distributed Memory andThreads
Neither distributed
Distributed threads
Distributed memory
26 Parallel naive on ubuntu-numa0101 on 24 cores
![Page 26: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/26.jpg)
DEMO
27
![Page 27: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/27.jpg)
Application of NUMA-Optimizations
• Copy all data to every node: – Duration of preprocessing:
• 11.11s for a 8192x8192 matrix to 8 nodes
• Partition data and move to corresponding nodes – Duration of preprocessing:
• 1.03s for a 8192x8192 matrix to 8 nodes
• Pin threads to nodes – int numa_run_on_node(int node);
28
![Page 28: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/28.jpg)
Parallelization – Partition by rows Copying memory to different nodes
29
![Page 29: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/29.jpg)
Strassen Memory Distribution Effects
22,147083 19,611477
21,316545
14,671332
0
5
10
15
20
25
30
35
40
6 7 8 distributed
Dimension: 16384
memory copy multiplication result combination
30 dl980 on 128 core
![Page 30: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/30.jpg)
Other optimization techniques
• Tiling
• Vectorization
• Scalar replacement
• Precomputation of constants
• (unrolling)
31
![Page 31: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/31.jpg)
Tiling
• Divide computational work into tiles to leverage cache
• Tile size depends on cache size • gcc -DCLS=$(getconf LEVEL1_DCACHE_LINESIZE)
33
![Page 32: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/32.jpg)
Performance of Tiling perf stat -e L1-dcache-misses,LLC-misses,DTLB-misses bin/matrixmult –n 2048
34
1
8
64
512
4096
32768
262144
2097152
16777216
134217728
1,074E+09
8,59E+09
Not Tiled, not TransposedNot Tiled, TransposedTiled, not TransposedTiled, Transposed
97
39
13 12
0
20
40
60
80
100
120
Time
s
dl980 on 128 cores
![Page 33: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/33.jpg)
Vectorization
• SIMD : Single Instruction Multiple Data • All recent Intel and AMD Processors have
Streaming Instructions Extensions (SSE) • An instruction is simultaneously applied to
multiple floats • Can only operate efficiently on aligned data (16
bit aligned) • SSE operate on 128bit registers
– Newer Intel processors have Advanced Vector Instructions (AVX) with 256 bit
– Dl980 machine only support 128bit operations
35
![Page 34: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/34.jpg)
Auto-Vectorization
• Can this be done automatically?
– Gcc –O3 tries to auto-vectorize
• only possible for simple statements
36
![Page 35: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/35.jpg)
Assembler
37
![Page 36: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/36.jpg)
Aligned Malloc
38
Example: • Numa_alloc returns adrr: 0x1232, not 16bit aligned • We add 15, so addr = 0x1241 or 0b1001001000001 • Now we clear last 4 bits by ANDing ~0x0f (=0xfff0) • => result 0x1240 is now 16bit aligned
![Page 37: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/37.jpg)
Intrinsics
39
Example
Source: https://software.intel.com/sites/landingpage/IntrinsicsGuide/
![Page 38: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/38.jpg)
Use Parallelism for MMM
• We try to construct a 4x4 matrix multiplication
• How to process rows ?
40
continuous memory
X Can’t be loaded in one instr.
![Page 39: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/39.jpg)
Use parallelism for MMM
• We try to construct a 4x4 matrix multiplication
• How to process rows ?
• Idea: process all elements of row of B in parallel
41
X =
A11 𝐵11 𝐵12 𝐵13 𝐵14 A12 A13 A14
Add up results
![Page 40: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/40.jpg)
4x4 Kernel
42
![Page 41: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/41.jpg)
SSE – Single Threaded
1024 2048 4096
naiveSSE 0,27 2 20
tiledSSE 0,48 5 41
tiled 2 24 213
naive 11 97 879
0,25
0,5
1
2
4
8
16
32
64
128
256
512
1024
Seco
nd
s
N-dimensions
naiveSSE tiledSSE tiled naive
43 dl980 on 1 core
![Page 42: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/42.jpg)
Cache Misses of SSE Variants
0
1.000.000.000
2.000.000.000
3.000.000.000
4.000.000.000
5.000.000.000
6.000.000.000
L1 cache misses dTLB misses
naiveSSE tiledSSE
44
![Page 43: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/43.jpg)
Performance for Small Matrices
0,00
0,05
0,10
0,15
0,20
0,25
64 128 256 512
seco
nd
s
naiveSSE tiled strassen MKL
45 dl980 on 128 cores
![Page 44: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/44.jpg)
Performance for Large Matrices
0,79
7,29
3,90
0,17 0,34
1,20
5,09
0,20 0,39 0,53
1,94
0
1
2
3
4
5
6
7
8
9
10
1024 2048 4096 8192
seco
nd
s
naiveSSE tiled strassenSSE MKL
28,3
46 dl980 on 128 cores
![Page 45: NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02 0,13 1,02 0,015625 0,03125 0,0625 0,125 0,25 0,5 1 2 4 8 16 32 64 128 512 1024 2048](https://reader033.vdocuments.us/reader033/viewer/2022041620/5e3e83c27705da4c1940aa0b/html5/thumbnails/45.jpg)
Summary
• Analyze algorithm for bottlenecks
– IO optimization
– Hardware specific optimization
• Cache size
• NUMA architecture
• Specific instructions (SSE)
• Try to minimize remote memory access
• Visualisations can facilitate understanding
47