18.337 / 6.338: parallel computing project finalreport parallelization of matrix multiply: a look at...

18.337 / 6.338: Parallel ComputingProject FinalReport

Parallelization of Matrix Multiply:

A Look At How Differing AlgorithmicApproaches and CPU Hardware ImpactScaling Calculation Performance in Java

Elliotte KimMassachusetts Institute of Technology

Class of 2012

A * B = C(n x m) (m x p) (n x p)

Matrix Multiplication

Hypothesis:

The duration to compute(n x kn) * (kn x n)

will take at least k times

the duration to compute(n x n) * (n x n)

regardless of parallelizationif the same parallelization method is applied

to both matmuls.

In both cases, resulting matrix Cwill be (n x n)

Ordinary Matrix Multiply

Under Ordinary Matrix Multiplication,

(n x kn) * (kn x n)matmul

will have k timesthe number of multiplication operations

than

(n x n) * (n x n)matmul

Test Case 1:

Intel Atom N2701.6 GHz1 core

2 thread/core2 threads total

56 KB L1 cache512 KB L2 cache

k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8

0

500000

1000000

1500000

2000000

2500000

3000000

Intel Atom N270

Ordinary Matrix Multiply1 thread

ms

n = 1024

k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8

0

200000

400000

600000

800000

1000000

1200000

1400000

Intel Atom N270

Ordinary Matrix Multiply2 threads

ms

n = 1024

Test Case 2:

AMD Turion 64 X22.0 GHz2 cores


128 KB L1 cache per core512 KB L2 cache per core

k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8

0

100000

200000

300000

400000

500000

600000

700000

AMD Turion 64 X2 TL-60


ms

n = 1024

k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8

0

50000

100000

150000

200000

250000

300000

350000

400000



ms

n = 1024

Observation

Near doubling in performancegoing from 1 to 2 Threads.

Calculation rate slowdowngoing from k = 3 to k = 4.

Why?

L2 cache accessat k = 4.

Test Case 3:

Intel Core2 Quad Q67002.66 GHz4 cores


128 KB L1 cache per core2 x 4 MB L2 cache (shared)

k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

Intel Core2 Quad Q6700


ms

n = 1024

k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8

0

5000

10000

15000

20000

25000

30000

35000

40000

45000



ms

n = 1024

k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000



ms

n = 1024

Observation

Near doubling in performancegoing from 1 to 2 Threads.

At 4 Threads, increased computation slowdown at k=4, 7.

Recoveries at k=6, 8.

Effects of shared cache?

Ordinary Matrix Multiply

All performance times observed were in accordance with the

hypothesis.

Is there an algorithmthat can give

better than k scaling?

The Question

Recursive Matrix Multiply

Breaks up a matrix into 4 smaller matrices

Spawns a new thread for each matrix

Apply recursively, until thresholdis reached.


k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8

0

50000

100000

150000

200000

250000

300000

350000

Intel Atom N270

ms

n = 1024

Observation

Recursive MatMul 1 to 3 times FASTERthan Parallel Ordinary MatMul

on the Atom processor.

No drastic slowdown in computation rateafter k = 1.

Near linear relationship between calculationtimes and values of k.


k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8

0

10000

20000

30000

40000

50000

60000

70000

80000

90000


ms

n = 1024

Observation

Recursive MatMul 1.5 to 3.5 times FASTERthan Parallel Ordinary MatMul

on the Turion processor.

No drastic slowdown in computation ratebetween k=3 to k=4.

Near linear relationship between calculationtimes and values of k.


k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8

0

2000

4000

6000

8000

10000

12000

14000


ms

n = 1024

Observation

Recursive MatMul 0.5 to 4 times FASTERthan Parallel Ordinary MatMul

on the Q6700 processor.

Better than k-scaling performance whenk = 3, 5, 6, 7 and 8.

Why?

Conclusions

Better than k-scaling can be achieved,though uncertain as to why.

Hardware? Algorithm?Combination of the two?

Further research required.

Conclusions

Algorithmic approach can affect time required.Hardware can affect time required.

Faster processors help.More cache helps.

But best peformance achieved whenAlgorithms can account for hardware

and determine the best approach.

18.337 / 6.338: parallel computing project finalreport parallelization of matrix multiply: a look at...

Documents