numa-aware matrix-matrix-multiplication...about this talk •objective: show how to improve...
TRANSCRIPT
-
NUMA-aware Matrix-Matrix-Multiplication
Max Reimann, Philipp Otto
1
-
About this talk
• Objective: Show how to improve performance of algorithms in a NUMA-system with MMM as an example
• Code was written in C with numa.h, pthread.h • Tested on FSOC
– ubuntu-0101 • 2 Nodes, 24 Cores
– dl980 • 8 Nodes, 128 Cores
• Compiled with gcc – –O3
2
-
Naïve Matrix-Matrix-Multiplication
• We will examine MMM for large n x n matrices
• 𝒪 𝑛3
3 http://www.mathematrix.de/wp-content/uploads/matrixmul2.png
-
Naïve MMM implementation
4
-
Performance of Naive vs. MKL
5
0,38
11,79
98,14
0,02
0,13
1,02
0,015625
0,03125
0,0625
0,125
0,25
0,5
1
2
4
8
16
32
64
128
512 1024 2048
Naive
MKL
dl980 on one core
-
Intel Math Kernel Library (MKL)
• BLAS: Basic Linear Algebra Subprograms
– Standard for Linear Algebra
• MKL:
– Implements BLAS for Intel hardware
– Vectorized and threaded for highest performance
6
-
Analysis of Naïve MMM
• Testsetup • Use ubuntu-numa machine
• No thread or memory pinning
• Use numatop/pcm
• Performance tools show:
– Unused cores (obvious)
– QPI cannot be fully loaded with one thread
8
-
Parallelization I
• How can the work be divided?
– 1. Partition computation of matrixC by rows or columns
• Problem: All threads need matrixA and matrixB
• Solution: – Accept overhead for remote memory access or
– Copy input/output matrices to the other nodes (preprocessing)
9
* =
-
Parallelization – Partition by rows
10
-
Parallelization – Partition by rows
11
0,38
11,79
98,14
0,05
0,26
2,54
0,19
0,27 0,28
0,03125
0,0625
0,125
0,25
0,5
1
2
4
8
16
32
64
128
512 1024 2048
Naive Sequential
Naive Parallel
MKL Parallel
dl980 on 128 cores
-
Parallelization II
• How can the work be divided?
– 2. Partition computation of matrixC by summands
• Benefit: – for computing the i-th summand, only the i-th row of matrixA
/ column of matrixB is needed
– This allows to only copy the needed parts to the other nodes
• Disadvantage: – matrixB has to be transposed to be able to partition the
memory (preprocessing)
– locking or merging of matrixC is needed
12
-
Parallelization II
13
-
Performance of „Parallel Sum“ Method
14
1,59
2,81 3,34
14,91
218,84
0,27
1,41
2,94
17,24
186,39
0,19
0,27 0,28
0,43
2,41
0,13
0,25
0,50
1,00
2,00
4,00
8,00
16,00
32,00
64,00
128,00
256,00
512 1024 2048 4096 8192
Parallel sum
Naive Parallel
MKL Parallel
dl980 on 128 cores
-
Strassen
• Runtime Complexity: – Naive algorithm 𝒪 𝑛3
• Can we get better? – Strassens algorithm, published 1969, was the first
to improve asymptotic complexity
– Runtime 𝒪 𝑛log2 7 ≈ 𝒪 𝑛2.8 • Algorithms today can get O(𝑛2.35), but are not pratical
– Uses only 7 multiplications instead of 8 per recursion step
15
-
Matrix definition
16
A = 𝐴1,1 𝐴1,2𝐴2,1 𝐴2,2
, 𝐵 = 𝐵1,1 𝐵1,2𝐵2,1 𝐵2,2
, 𝐶 = 𝐶1,1 𝐶1,2𝐶2,1 𝐶2,2
For matrices A,B,C with dimension 𝑛 = 4𝑘, 𝑘 ∈ ℕ A,B,C can be viewed as 2x2 block matrices:
𝐶1,1 = 𝐴1,1 ∙ 𝐵1,1 + 𝐴1,2 ∙ 𝐵2,1 𝐶1,2 = 𝐴1,1 ∙ 𝐵1,2 + 𝐴1,2 ∙ 𝐵2,2
𝐶2,1 = 𝐴2,1 ∙ 𝐵1,1 + 𝐴2,2 ∙ 𝐵2,1 𝐶2,2 = 𝐴2,1 ∙ 𝐵1,2 + 𝐴2,2 ∙ 𝐵2,2
Conventional Algorithm uses 8 (expensive) multiplications:
-
Strassen’s algorithm
17
𝑀1 ∶= 𝐴1,1 + 𝐴2,2 ∙ 𝐵1,1 + 𝐵2,2
𝑀2 ∶= 𝐴2,1 + 𝐴2,2 ∙ 𝐵1,1
𝑀3 ∶= 𝐴1,1 ∙ 𝐵1,2 − 𝐵2,2
𝑀4 ∶= 𝐴2,2 ∙ 𝐵2,1 − 𝐵1,1
𝑀5 ∶= 𝐴1,1 + 𝐴1,2 ∙ 𝐵2,2
𝑀6 ∶= 𝐴2,1 − 𝐴1,1 ∙ 𝐵1,1 + 𝐵1,2 𝑀7 ∶= (𝐴1,2 − 𝐴2,2) ∙ (B2,1 + 𝐵2,2
Define temporary matrices:
𝐶1,1 = 𝑀1 + 𝑀4 − 𝑀5 + 𝑀7 𝐶1,2 = 𝑀3 + 𝑀5 𝐶2,1 = 𝑀2 + 𝑀4 𝐶2,2 = 𝑀1 − 𝑀2 + 𝑀3 + 𝑀6
Compose final matrix
Only 7 multiplications!
-
Strassen - Example
18
𝐶1,2 = 𝑀3 + 𝑀5
= 𝐴1,1𝐵1,2 + 𝐴1,2𝐵2,2
= 𝐴1,1𝐵1,2 − 𝐴1,1𝐵2,2 + 𝐴1,1𝐵2,2 + 𝐴1,2𝐵2,2
= 𝐴1,1 ∙ 𝐵1,2 − 𝐵2,2 + 𝐴1,1 + 𝐴1,2 ∙ 𝐵2,2
𝐴1,1 𝐴1,2𝐴2,1 𝐴2,2
∙𝐵1,1 𝐵1,2𝐵2,1 𝐵2,2
= 𝐶1,1 𝐶1,2𝐶2,1 𝐶2,2
Substituting the 𝑀𝑖𝑠 by their term gives back the original formula:
-
Strassen - Analysis
• Cost: 7 multiplications and 18 additions
– 8 multiplications and 4 additions for naïve
• Only practical for large matrices n > 1000
– Although our results indicate otherwise (later)
• Define cutoff point for recursion
– If n is sufficiently small, do naïve multiplication
19
-
Strassen - Implementation
20
-
Execution Time: Single-threaded
0,00
0,00
0,01
0,05
0,38
11,79
98,14
0,00
0,00
0,00
0,02
0,12
0,87
6,12
0,00 0,00
0,00
0,00
0,02
0,13
1,02
0,00010,00010,00020,00050,00100,00200,00390,00780,01560,03130,06250,12500,25000,50001,00002,00004,00008,0000
16,000032,000064,0000
128,0000
32 64 128 256 512 1024 2048
Seco
nd
s
N-dimension
Naive Strassen MKL
21
strassen: BREAK = 64
dl980 on 1 core
-
Parallelization of Strassen I
• Data dependencies:
– Have to do additions in 𝑀𝑖 before multiplication
• e.g. M1 = 𝐴1,1 + 𝐴2,2 ∙ 𝐵1,1 + 𝐵2,2
– Have to calculate 𝑀𝑖 before calculating C
• 𝐶1,2 = 𝑀3 + 𝑀5
• Easiest solution:
– Calculate in 𝑀𝑖s in parallel
– Then calculate 𝐶𝑖,𝑗 in parallel
22
-
Parallelization of Strassen II
• Level 1 can be scheduled to 7 threads • Level n can be scheduled to 7𝑛 threads
– Most systems have number of processors on base 2
• We used manual parallelization – 49 distinct functions for Ms and 16 for Cs – Code bloating and not scalable, BUT:
• Automatic parallelization is hard – Thread load becomes very unbalanced – Every level needs 7 temporary matrices
• Exponential rising memory requirements
23
-
Execution Time – 49 Threads
0,05
0,26
2,54
27,61
228,57
0,05
0,14
0,49
2,06
13,53
0,19 0,27 0,28
0,44
1,84
0,03125
0,0625
0,125
0,25
0,5
1
2
4
8
16
32
64
128
256
512 1024 2048 4096 8192
seco
nd
s
N-dimension
Naive Strassen MKL
24 dl980 on 49 cores
-
NUMA-Optimizations
• Try to have as much memory local as possible to avoid remote memory access
– Because it is slower by a factor of ~ 1.4
• Partition data and work depending on #nodes and #cores
• Pin threads to nodes with the memory they need
• (Topology for other algorithms)
25
-
Distributing memory and threads
0,34
11,39
101,12
0,35
18,34
182,45
0,35
21,96
204,85
0,37
14,33
143,44
0,25
0,5
1
2
4
8
16
32
64
128
256
1024 2048 4096
Distributed Memory andThreads
Neither distributed
Distributed threads
Distributed memory
26 Parallel naive on ubuntu-numa0101 on 24 cores
-
DEMO
27
-
Application of NUMA-Optimizations
• Copy all data to every node: – Duration of preprocessing:
• 11.11s for a 8192x8192 matrix to 8 nodes
• Partition data and move to corresponding nodes – Duration of preprocessing:
• 1.03s for a 8192x8192 matrix to 8 nodes
• Pin threads to nodes – int numa_run_on_node(int node);
28
-
Parallelization – Partition by rows Copying memory to different nodes
29
-
Strassen Memory Distribution Effects
22,147083 19,611477
21,316545
14,671332
0
5
10
15
20
25
30
35
40
6 7 8 distributed
Dimension: 16384
memory copy multiplication result combination
30 dl980 on 128 core
-
Other optimization techniques
• Tiling
• Vectorization
• Scalar replacement
• Precomputation of constants
• (unrolling)
31
-
Tiling
• Divide computational work into tiles to leverage cache
• Tile size depends on cache size • gcc -DCLS=$(getconf LEVEL1_DCACHE_LINESIZE)
33
-
Performance of Tiling perf stat -e L1-dcache-misses,LLC-misses,DTLB-misses bin/matrixmult –n 2048
34
1
8
64
512
4096
32768
262144
2097152
16777216
134217728
1,074E+09
8,59E+09
Not Tiled, not TransposedNot Tiled, TransposedTiled, not TransposedTiled, Transposed
97
39
13 12
0
20
40
60
80
100
120
Time
s
dl980 on 128 cores
-
Vectorization
• SIMD : Single Instruction Multiple Data • All recent Intel and AMD Processors have
Streaming Instructions Extensions (SSE) • An instruction is simultaneously applied to
multiple floats • Can only operate efficiently on aligned data (16
bit aligned) • SSE operate on 128bit registers
– Newer Intel processors have Advanced Vector Instructions (AVX) with 256 bit
– Dl980 machine only support 128bit operations
35
-
Auto-Vectorization
• Can this be done automatically?
– Gcc –O3 tries to auto-vectorize
• only possible for simple statements
36
-
Assembler
37
-
Aligned Malloc
38
Example: • Numa_alloc returns adrr: 0x1232, not 16bit aligned • We add 15, so addr = 0x1241 or 0b1001001000001 • Now we clear last 4 bits by ANDing ~0x0f (=0xfff0) • => result 0x1240 is now 16bit aligned
-
Intrinsics
39
Example
Source: https://software.intel.com/sites/landingpage/IntrinsicsGuide/
-
Use Parallelism for MMM
• We try to construct a 4x4 matrix multiplication
• How to process rows ?
40
continuous memory
X Can’t be loaded in one instr.
-
Use parallelism for MMM
• We try to construct a 4x4 matrix multiplication
• How to process rows ?
• Idea: process all elements of row of B in parallel
41
X =
A11 𝐵11 𝐵12 𝐵13 𝐵14 A12 A13 A14
Add up results
-
4x4 Kernel
42
-
SSE – Single Threaded
1024 2048 4096
naiveSSE 0,27 2 20
tiledSSE 0,48 5 41
tiled 2 24 213
naive 11 97 879
0,25
0,5
1
2
4
8
16
32
64
128
256
512
1024
Seco
nd
s
N-dimensions
naiveSSE tiledSSE tiled naive
43 dl980 on 1 core
-
Cache Misses of SSE Variants
0
1.000.000.000
2.000.000.000
3.000.000.000
4.000.000.000
5.000.000.000
6.000.000.000
L1 cache misses dTLB misses
naiveSSE tiledSSE
44
-
Performance for Small Matrices
0,00
0,05
0,10
0,15
0,20
0,25
64 128 256 512
seco
nd
s
naiveSSE tiled strassen MKL
45 dl980 on 128 cores
-
Performance for Large Matrices
0,79
7,29
3,90
0,17 0,34
1,20
5,09
0,20 0,39 0,53
1,94
0
1
2
3
4
5
6
7
8
9
10
1024 2048 4096 8192
seco
nd
s
naiveSSE tiled strassenSSE MKL
28,3
46 dl980 on 128 cores
-
Summary
• Analyze algorithm for bottlenecks
– IO optimization
– Hardware specific optimization
• Cache size
• NUMA architecture
• Specific instructions (SSE)
• Try to minimize remote memory access
• Visualisations can facilitate understanding
47