18.337 / 6.338: parallel computing project finalreport parallelization of matrix multiply: a look at...

29
18.337 / 6.338: Parallel Computing Project FinalReport Parallelization of Matrix Multiply: A Look At How Differing Algorithmic Approaches and CPU Hardware Impact Scaling Calculation Performance in Java Elliotte Kim Massachusetts Institute of Technology Class of 2012

Upload: mercy-newman

Post on 24-Dec-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

18.337 / 6.338: Parallel ComputingProject FinalReport

Parallelization of Matrix Multiply:

A Look At How Differing AlgorithmicApproaches and CPU Hardware ImpactScaling Calculation Performance in Java

Elliotte KimMassachusetts Institute of Technology

Class of 2012

A * B = C(n x m) (m x p) (n x p)

Matrix Multiplication

Hypothesis:

The duration to compute(n x kn) * (kn x n)

will take at least k times

the duration to compute(n x n) * (n x n)

regardless of parallelizationif the same parallelization method is applied

to both matmuls.

In both cases, resulting matrix Cwill be (n x n)

Ordinary Matrix Multiply

Under Ordinary Matrix Multiplication,

(n x kn) * (kn x n)matmul

will have k timesthe number of multiplication operations

than

(n x n) * (n x n)matmul

Test Case 1:

Intel Atom N2701.6 GHz1 core

2 thread/core2 threads total

56 KB L1 cache512 KB L2 cache

k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8

0

500000

1000000

1500000

2000000

2500000

3000000

Intel Atom N270

Ordinary Matrix Multiply1 thread

ms

n = 1024

k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8

0

200000

400000

600000

800000

1000000

1200000

1400000

Intel Atom N270

Ordinary Matrix Multiply2 threads

ms

n = 1024

Test Case 2:

AMD Turion 64 X22.0 GHz2 cores

1 thread/core2 threads total

128 KB L1 cache per core512 KB L2 cache per core

k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8

0

100000

200000

300000

400000

500000

600000

700000

AMD Turion 64 X2 TL-60

Ordinary Matrix Multiply1 thread

ms

n = 1024

k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8

0

50000

100000

150000

200000

250000

300000

350000

400000

AMD Turion 64 X2 TL-60

Ordinary Matrix Multiply2 threads

ms

n = 1024

Observation

Near doubling in performancegoing from 1 to 2 Threads.

Calculation rate slowdowngoing from k = 3 to k = 4.

Why?

L2 cache accessat k = 4.

Test Case 3:

Intel Core2 Quad Q67002.66 GHz4 cores

1 thread/core4 threads total

128 KB L1 cache per core2 x 4 MB L2 cache (shared)

k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

Intel Core2 Quad Q6700

Ordinary Matrix Multiply1 thread

ms

n = 1024

k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

Intel Core2 Quad Q6700

Ordinary Matrix Multiply2 threads

ms

n = 1024

k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

Intel Core2 Quad Q6700

Ordinary Matrix Multiply4 threads

ms

n = 1024

Observation

Near doubling in performancegoing from 1 to 2 Threads.

At 4 Threads, increased computation slowdown at k=4, 7.

Recoveries at k=6, 8.

Effects of shared cache?

Ordinary Matrix Multiply

All performance times observed were in accordance with the

hypothesis.

Is there an algorithmthat can give

better than k scaling?

The Question

Recursive Matrix Multiply

Breaks up a matrix into 4 smaller matrices

Spawns a new thread for each matrix

Apply recursively, until thresholdis reached.

Recursive Matrix Multiply

k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8

0

50000

100000

150000

200000

250000

300000

350000

Intel Atom N270

ms

n = 1024

Observation

Recursive MatMul 1 to 3 times FASTERthan Parallel Ordinary MatMul

on the Atom processor.

No drastic slowdown in computation rateafter k = 1.

Near linear relationship between calculationtimes and values of k.

Recursive Matrix Multiply

k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

AMD Turion 64 X2 TL-60

ms

n = 1024

Observation

Recursive MatMul 1.5 to 3.5 times FASTERthan Parallel Ordinary MatMul

on the Turion processor.

No drastic slowdown in computation ratebetween k=3 to k=4.

Near linear relationship between calculationtimes and values of k.

Recursive Matrix Multiply

k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8

0

2000

4000

6000

8000

10000

12000

14000

Intel Core2 Quad Q6700

ms

n = 1024

Observation

Recursive MatMul 0.5 to 4 times FASTERthan Parallel Ordinary MatMul

on the Q6700 processor.

Better than k-scaling performance whenk = 3, 5, 6, 7 and 8.

Why?

Conclusions

Better than k-scaling can be achieved,though uncertain as to why.

Hardware? Algorithm?Combination of the two?

Further research required.

Conclusions

Algorithmic approach can affect time required.Hardware can affect time required.

Faster processors help.More cache helps.

But best peformance achieved whenAlgorithms can account for hardware

and determine the best approach.