18.337 / 6.338: parallel computing project finalreport parallelization of matrix multiply: a look at...
TRANSCRIPT
18.337 / 6.338: Parallel ComputingProject FinalReport
Parallelization of Matrix Multiply:
A Look At How Differing AlgorithmicApproaches and CPU Hardware ImpactScaling Calculation Performance in Java
Elliotte KimMassachusetts Institute of Technology
Class of 2012
Hypothesis:
The duration to compute(n x kn) * (kn x n)
will take at least k times
the duration to compute(n x n) * (n x n)
regardless of parallelizationif the same parallelization method is applied
to both matmuls.
Under Ordinary Matrix Multiplication,
(n x kn) * (kn x n)matmul
will have k timesthe number of multiplication operations
than
(n x n) * (n x n)matmul
Test Case 1:
Intel Atom N2701.6 GHz1 core
2 thread/core2 threads total
56 KB L1 cache512 KB L2 cache
k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8
0
500000
1000000
1500000
2000000
2500000
3000000
Intel Atom N270
Ordinary Matrix Multiply1 thread
ms
n = 1024
k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8
0
200000
400000
600000
800000
1000000
1200000
1400000
Intel Atom N270
Ordinary Matrix Multiply2 threads
ms
n = 1024
Test Case 2:
AMD Turion 64 X22.0 GHz2 cores
1 thread/core2 threads total
128 KB L1 cache per core512 KB L2 cache per core
k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8
0
100000
200000
300000
400000
500000
600000
700000
AMD Turion 64 X2 TL-60
Ordinary Matrix Multiply1 thread
ms
n = 1024
k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8
0
50000
100000
150000
200000
250000
300000
350000
400000
AMD Turion 64 X2 TL-60
Ordinary Matrix Multiply2 threads
ms
n = 1024
Observation
Near doubling in performancegoing from 1 to 2 Threads.
Calculation rate slowdowngoing from k = 3 to k = 4.
Why?
L2 cache accessat k = 4.
Test Case 3:
Intel Core2 Quad Q67002.66 GHz4 cores
1 thread/core4 threads total
128 KB L1 cache per core2 x 4 MB L2 cache (shared)
k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
Intel Core2 Quad Q6700
Ordinary Matrix Multiply1 thread
ms
n = 1024
k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
Intel Core2 Quad Q6700
Ordinary Matrix Multiply2 threads
ms
n = 1024
k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
Intel Core2 Quad Q6700
Ordinary Matrix Multiply4 threads
ms
n = 1024
Observation
Near doubling in performancegoing from 1 to 2 Threads.
At 4 Threads, increased computation slowdown at k=4, 7.
Recoveries at k=6, 8.
Effects of shared cache?
Recursive Matrix Multiply
Breaks up a matrix into 4 smaller matrices
Spawns a new thread for each matrix
Apply recursively, until thresholdis reached.
Recursive Matrix Multiply
k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8
0
50000
100000
150000
200000
250000
300000
350000
Intel Atom N270
ms
n = 1024
Observation
Recursive MatMul 1 to 3 times FASTERthan Parallel Ordinary MatMul
on the Atom processor.
No drastic slowdown in computation rateafter k = 1.
Near linear relationship between calculationtimes and values of k.
Recursive Matrix Multiply
k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
AMD Turion 64 X2 TL-60
ms
n = 1024
Observation
Recursive MatMul 1.5 to 3.5 times FASTERthan Parallel Ordinary MatMul
on the Turion processor.
No drastic slowdown in computation ratebetween k=3 to k=4.
Near linear relationship between calculationtimes and values of k.
Recursive Matrix Multiply
k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8
0
2000
4000
6000
8000
10000
12000
14000
Intel Core2 Quad Q6700
ms
n = 1024
Observation
Recursive MatMul 0.5 to 4 times FASTERthan Parallel Ordinary MatMul
on the Q6700 processor.
Better than k-scaling performance whenk = 3, 5, 6, 7 and 8.
Why?
Conclusions
Better than k-scaling can be achieved,though uncertain as to why.
Hardware? Algorithm?Combination of the two?
Further research required.