processes distribution of homogeneous parallel linear algebra routines on heterogeneous clusters

Antonio Javier Cuenca Muñoz

Dpto. Ingeniería y Tecnología de Computadores

Processes Distribution of HomogeneousParallel Linear Algebra Routines on

Heterogeneous Clusters

Processes Distribution of HomogeneousParallel Linear Algebra Routines on

Heterogeneous Clusters

Javier Cuenca

Luis Pedro García

Domingo Giménez Scientific Computation Researching Group, University of Murcia, Spain

Jack DongarraInnovative Computing Laboratory, University of Tennessee, USA

2Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Introduction

Automatically Optimised Linear Algebra Software Objective

Software capable of tuning itself according to the execution environment

Motivation Non-expert users take decisions about computation Software should adapt to the continuous evolution of hardware Developing efficient code by hand consumes a large quantity of resources System computation capabilities are very variable

Some examples of auto-tuning software: ATLAS, LFC, FFTW, I-LIB, FIBER, mpC, BeBOP, FLAME, ...


Automatic Optimisation on Heterogeneous Parallel Systems

Two possibilities on heterogeneous systems: HoHe: Heterogeneous algorithms (heterogeneous distribution

of data). HeHo: Homogeneous algorithms and heterogeneous

assignation of processes: A variable number of processes to each processor, depending on the

relative speeds

Mapping processes processors must be made, and without a large execution time in the decision taking

Theoretical models: parameters which represent the characteristics of the system

The general assignation problem is NP use of heuristic approximations


Our previous HoHo methodology

Routines model

n: problem size

SP: system parameters Computation and communication characteristics of the system

AP: algorithm parameters Block size, number of processors to use, logical configurations of the

processors, ... (with one process per processor) Values are chosen when the routine begins to run

),,( APSPnfTEXEC


Our previous HoHo methodology Our HeHo meth.

Modifications in the routine model: New AP:

Number of processes to generate Mapping processes to processors

SP values changes: More than one process per processor: Each SPi in processor i as di

(number of processes assigned to processor i) times higher

Implicit synchronization global value of each of the SPi is considered as the maximum value from all the processors.

The slowest process forces to the other ones to reduce their speed, waiting for it at the different synchronization points of the routine.


Our HeHo methodology: an example of routine model

LU factorisation, parallel version. Model:

SP: system parameters k3_DGEMM, k3_DTRSM, k2_DGETF2

ts, tw

AP: algorithm parameters b: block size P: number of processors p: number of processes Mapping p processes on the P processors p = r x c: logical configuration of the processes: 2D mesh

nkbnbkp

cr

p

nkT DGETFDTRSMDGEMMARI 2_2

22_3

3

_3 3

1

3

2

p

dnt

b

ndtT wsCOM

222 COMARIEXEC TTT



Platforms: SUNEt:

Five SUN Ultra 1 One SUN Ultra 5 Interconexion network: Ethernet

TORC (Innovative Computing Laboratory): 21 nodes of different types

dual and single processors Pentium II, III and 4 AMD Athlon

Interconexion networks: FastEthernet Giganet Myrinet



Theoretical vs. Experimental time on SUNEt.n=2048

Mapping of 8

processes on the

6 processors

Logical topology

of the 8

processes

Block

size

AP 1

AP 2

AP 3

(1,1,1,1,1,3)

(2,1,1,1,1,2)

(2,2,1,1,1,1)

2 х 4

2 х 4

2 х 4

32

32

32

AP 4

AP 5

AP 6

(1,1,1,1,1,3)

(2,1,1,1,1,2)

(2,2,1,1,1,1)

2 х 4

2 х 4

2 х 4

64

64

64

AP 7

AP 8

AP 9

(1,1,1,1,1,3)

(2,1,1,1,1,2)

(2,2,1,1,1,1)

1 х 8

1 х 8

1 х 8

32

32

32

AP 10

AP 11

AP 12

(1,1,1,1,1,3)

(2,1,1,1,1,2)

(2,2,1,1,1,1)

1 х 8

1 х 8

1 х 8

64

64

64

0

50

100

150

200

250

AP1AP2

AP3AP4

AP5AP6

AP7AP8

AP9AP10

AP11AP12

theoretical time

experimental time



Theoretical vs. Experimental time on TORC. n=4096

Mapping of 8 processes on 19

processors

Logical

topology of

the 8

processes

Block

Size

AP 1

AP 2

(1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,0,0)

(1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,0,0)

4 х 2

8 х 1

32

32

AP 3

AP 4

(1,0,1,0,1,0,1,0,1,0,1,0,0,0,0,0,0,2,0)

(1,0,1,0,1,0,1,0,1,0,1,0,0,0,0,0,0,2,0)

4 х 2

8 х 1

32

32

AP 5

AP 6

(1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,2,2,1)

(1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,2,2,1)

4 х 2

8 х 1

32

32

AP 7

AP 8

(1,0,1,0,1,0,1,0,1,0,1,0,0,0,0,0,2,0,0)

(1,0,1,0,1,0,1,0,1,0,1,0,0,0,0,0,2,0,0)

4 х 2

8 х 1

32

32

0

10

20

30

40

50

60

70

AP1 AP2 AP3 AP4 AP5 AP6 AP7

theoretical time

experimental time


Our HeHo methodology

Our approach: Assignment tree

A limit in the height of the tree (number of processes) is necessary

Each node represents a possible solution: processesprocessors

The other APs (block size, logical topology) are chosen at each node

3 P

PP3P32P321

21 ...

.........

P processors

p processes

...



For each node: EET(node): Estimated Execution Time

Optimization problem: finding the node with the lowest EET

LET(node): Lowest Execution Time GET(node): Greatest Execution Time

LET and GET are lower and upper bounds of the optimum solution of the subtree below the node

LET and GET to limit the number of nodes evaluated MEET = minevaluated_nodes {GET(node)}

If {LET (node) > MEET} do not work below this node



Automatic searching strategies in the assignment tree: Method 1:

Backtracking GET = EET.

Method 2: Backtraking GET obtained with a greedy approach

Method 3: Backtraking GETobtained with a greedy approach LET obtained with a greedy approach

Method 4: Greedy method on the current assignment tree

(a combinatorial tree with repetitions) Method 5:

Greedy method on a permutational tree with repetitions




Backtracking

GET = EET

LET = LETari + LETcom

LETari = sequential time divided by the maximum achievable speed-up when using all the processors not yet discarded

LETcom = assuming the best logical topology of processes that can be obtained from this node




Backtracking

GET = a greedy approach: the EET for each of the children of the node is calculated, and the node with the lowest EET is included in the solution


LETari = sequential time divided by the maximum achievable speed-up when using all the processors not yet discarded

LETcom = assuming the best logical topology of processes that can be obtained from this node.

Fewer nodes are analyzed, but the evaluated cost per node increases




Backtracking

GET = a greedy approach: the EET for each of the children of the node is calculated, and the node with the lowest EET is included in the solution


LETari = A greedy approach is used: For each node, the child that least increases the cost of

arithmetic operations is included in the solution to obtain the lowest bound

LETcom = assuming the best logical topology of processes that can be obtained from this node.

It is possible that a branch to a optimal solution will be discarded




Greedy method on the current assignment tree (a combinatorial tree with repetitions)

Method 5: Greedy method on a permutational tree with repetitions

Both methods 4 and 5: To obtain better logical topologies of the processes:

traversal searching continues (through the best child for each node) until the established maximum level is reached.


Experimental Results

Human searching strategies in the assignment tree:

Greedy User (GU) Use ALL the available processors One process per processor

Conservative User (CU) Use HALF of the available processors One process per processor

Expert User (EU): Use 1 processor, HALF or ALL the processors depending on the problem

size One process per processor



Automatic decisions vs. Users, on SUNEt (n = 7680)Method Processes

mapping

b Logical

Topology

Solution t. t. t. Level

1 (1,1,1,1,1,1) 64 2 х 3 718.94 0.02 25

2 (1,1,1,1,1,1) 64 2 х 3 718.94 0.04 25

3 (1,1,1,1,1,1) 64 2 х 3 718.94 0.02 25

4 (1,1,0,0,0,1) 128 1 х 3 887.85 0.0001 25

5 (1,1,0,0,0,1) 128 1 х 3 887.85 0.0005 25

CU (1,1,0,0,0,1) 128 1 х 3 1047.13

GU (1,1,1,1,1,1) 64 2 х 3 887.85

EU (1,1,1,1,1,1) 64 2 х 3 887.85



Automatic decisions vs. Users, on TORC (n = 2048)Method Processes

mapping

b Logical

Topology

Solution t. t. t. Level

1 (1,1,1,1,1,1,1,1,1,1,

1,1,1,1,1,0,0,0,0)64 3 х 5 17.91 3.08 15

2 (1,1,1,1,1,1,1,1,1,1,

1,1,1,1,1,0,0,0,0)64 3 х 5 17.91 3.08 15

3 (1,1,1,1,1,1,1,1,1,1,

1,1,1,1,1,0,0,0,0)64 4 х 4 15.27 0.06 25

4 (0,0,0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,1,0)64 1 х 1 43.16 0.0012 30

5 (1,1,1,1,1,1,1,1,1,1,

1,1,1,1,1,0,0,0,0)64 4 х 4 15.27 0.01 30

CU (1,1,1,1,1,1,0,0,0,0,

0,0,0,0,0,0,1,1,1)64 3 х 3 23.77

GU (1,1,1,1,1,1,1,1,1,1,

1,1,1,1,1,1,1,1,1)32 1 х 19 33.57

EU (1,1,1,1,1,1,0,0,0,0,

0,0,0,0,0,0,1,1,1)64 3 х 3 23.77


Simulations

Virtual Platforms: variations and/or increases of the

real platforms: mTORC-01

the quantity of 17P4 is increased to 11 Number of processors: 29. Types of processors: 4

mTORC-02 the quantities of DPIII, SPIII, Ath and 17P4 are increased to 10, 10, 10 and

20 respectively. Number of processors: 50. Types of processors: 4

mTORC-03 the quantities of DPIII, SPIII, Ath and 17P4 are increased to 10, 15, 5 and

10, respectively additional processors have been included Number of processors: 100. Types of processors: 10


Simulations

Automatic decisions vs. Users

On virtual platform: mTORC01 (n = 20000) the quantity of 17P4 is increased to 11 Number of processors: 29. Types of processors: 4

Met. 1 Met. 2 Met. 3 Met. 4 Met. 5 CU GU EU

Solution 666.44 818.82 666.44 666.44 666.44 1322.23 1145.09 1145.09

t.t.t 20.39 59.45 0.68 0.0007 0.0122

Level 15 15 20 25 25




On virtual platform: mTORC02 (n = 20000) the quantities of DPIII, SPIII, Ath and 17P4 are increased to 10, 10, 10 and

20 respectively Number of processors: 50. Types of processors: 4


Solution 3721.98 3791.98 2439.43 1958.43 1500.24 2249.70 2748.36 2249.70

t.t.t 259.44 792.32 7.46 0.01 0.07

Level 15 15 25 30 30




On virtual platform: mTORC03 (n = 20000) the quantities of DPIII, SPIII, Ath and 17P4 are increased to 10, 15, 5 and

10, respectively additional processors have been included Number of processors: 100. Types of processors: 10


Solution 10712.55 14532.45 10712.55 10712.55 4333.23 7405.34 5422.87 5422.87

t.t.t 109.24 169.72 1274.34 0.08 2.34

Level 10 10 5 25 40


Conclusions

Extension of our previous self-optimisation methodology for homogeneous systems

On hetereogeneous systems, new decisions: Number of processes Mapping processes processors

Good results with parallel LU factorisation

Same methodology could be applied to other linear algebra routines

processes distribution of homogeneous parallel linear algebra routines on heterogeneous clusters

Documents