processes distribution of homogeneous parallel linear algebra routines on heterogeneous clusters
DESCRIPTION
Javier Cuenca Luis Pedro García Domingo Giménez Scientific Computation Researching Group , University of Murcia, Spain Jack Dongarra Innovative Computing Laboratory, University of Tennessee, USA. - PowerPoint PPT PresentationTRANSCRIPT
Antonio Javier Cuenca Muñoz
Dpto. Ingeniería y Tecnología de Computadores
Processes Distribution of HomogeneousParallel Linear Algebra Routines on
Heterogeneous Clusters
Processes Distribution of HomogeneousParallel Linear Algebra Routines on
Heterogeneous Clusters
Javier Cuenca
Luis Pedro García
Domingo Giménez Scientific Computation Researching Group, University of Murcia, Spain
Jack DongarraInnovative Computing Laboratory, University of Tennessee, USA
2Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
Introduction
Automatically Optimised Linear Algebra Software Objective
Software capable of tuning itself according to the execution environment
Motivation Non-expert users take decisions about computation Software should adapt to the continuous evolution of hardware Developing efficient code by hand consumes a large quantity of resources System computation capabilities are very variable
Some examples of auto-tuning software: ATLAS, LFC, FFTW, I-LIB, FIBER, mpC, BeBOP, FLAME, ...
3Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
Automatic Optimisation on Heterogeneous Parallel Systems
Two possibilities on heterogeneous systems: HoHe: Heterogeneous algorithms (heterogeneous distribution
of data). HeHo: Homogeneous algorithms and heterogeneous
assignation of processes: A variable number of processes to each processor, depending on the
relative speeds
Mapping processes processors must be made, and without a large execution time in the decision taking
Theoretical models: parameters which represent the characteristics of the system
The general assignation problem is NP use of heuristic approximations
4Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
Our previous HoHo methodology
Routines model
n: problem size
SP: system parameters Computation and communication characteristics of the system
AP: algorithm parameters Block size, number of processors to use, logical configurations of the
processors, ... (with one process per processor) Values are chosen when the routine begins to run
),,( APSPnfTEXEC
6Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
Our previous HoHo methodology Our HeHo meth.
Modifications in the routine model: New AP:
Number of processes to generate Mapping processes to processors
SP values changes: More than one process per processor: Each SPi in processor i as di
(number of processes assigned to processor i) times higher
Implicit synchronization global value of each of the SPi is considered as the maximum value from all the processors.
The slowest process forces to the other ones to reduce their speed, waiting for it at the different synchronization points of the routine.
7Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
Our HeHo methodology: an example of routine model
LU factorisation, parallel version. Model:
SP: system parameters k3_DGEMM, k3_DTRSM, k2_DGETF2
ts, tw
AP: algorithm parameters b: block size P: number of processors p: number of processes Mapping p processes on the P processors p = r x c: logical configuration of the processes: 2D mesh
nkbnbkp
cr
p
nkT DGETFDTRSMDGEMMARI 2_2
22_3
3
_3 3
1
3
2
p
dnt
b
ndtT wsCOM
222 COMARIEXEC TTT
8Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
Our HeHo methodology: an example of routine model
Platforms: SUNEt:
Five SUN Ultra 1 One SUN Ultra 5 Interconexion network: Ethernet
TORC (Innovative Computing Laboratory): 21 nodes of different types
dual and single processors Pentium II, III and 4 AMD Athlon
Interconexion networks: FastEthernet Giganet Myrinet
9Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
Our HeHo methodology: an example of routine model
Theoretical vs. Experimental time on SUNEt.n=2048
Mapping of 8
processes on the
6 processors
Logical topology
of the 8
processes
Block
size
AP 1
AP 2
AP 3
(1,1,1,1,1,3)
(2,1,1,1,1,2)
(2,2,1,1,1,1)
2 х 4
2 х 4
2 х 4
32
32
32
AP 4
AP 5
AP 6
(1,1,1,1,1,3)
(2,1,1,1,1,2)
(2,2,1,1,1,1)
2 х 4
2 х 4
2 х 4
64
64
64
AP 7
AP 8
AP 9
(1,1,1,1,1,3)
(2,1,1,1,1,2)
(2,2,1,1,1,1)
1 х 8
1 х 8
1 х 8
32
32
32
AP 10
AP 11
AP 12
(1,1,1,1,1,3)
(2,1,1,1,1,2)
(2,2,1,1,1,1)
1 х 8
1 х 8
1 х 8
64
64
64
0
50
100
150
200
250
AP1AP2
AP3AP4
AP5AP6
AP7AP8
AP9AP10
AP11AP12
theoretical time
experimental time
10Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
Our HeHo methodology: an example of routine model
Theoretical vs. Experimental time on TORC. n=4096
Mapping of 8 processes on 19
processors
Logical
topology of
the 8
processes
Block
Size
AP 1
AP 2
(1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,0,0)
(1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,0,0)
4 х 2
8 х 1
32
32
AP 3
AP 4
(1,0,1,0,1,0,1,0,1,0,1,0,0,0,0,0,0,2,0)
(1,0,1,0,1,0,1,0,1,0,1,0,0,0,0,0,0,2,0)
4 х 2
8 х 1
32
32
AP 5
AP 6
(1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,2,2,1)
(1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,2,2,1)
4 х 2
8 х 1
32
32
AP 7
AP 8
(1,0,1,0,1,0,1,0,1,0,1,0,0,0,0,0,2,0,0)
(1,0,1,0,1,0,1,0,1,0,1,0,0,0,0,0,2,0,0)
4 х 2
8 х 1
32
32
0
10
20
30
40
50
60
70
AP1 AP2 AP3 AP4 AP5 AP6 AP7
theoretical time
experimental time
11Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
Our HeHo methodology
Our approach: Assignment tree
A limit in the height of the tree (number of processes) is necessary
Each node represents a possible solution: processesprocessors
The other APs (block size, logical topology) are chosen at each node
3 P
PP3P32P321
21 ...
.........
P processors
p processes
...
12Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
Our HeHo methodology
For each node: EET(node): Estimated Execution Time
Optimization problem: finding the node with the lowest EET
LET(node): Lowest Execution Time GET(node): Greatest Execution Time
LET and GET are lower and upper bounds of the optimum solution of the subtree below the node
LET and GET to limit the number of nodes evaluated MEET = minevaluated_nodes {GET(node)}
If {LET (node) > MEET} do not work below this node
13Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
Our HeHo methodology
Automatic searching strategies in the assignment tree: Method 1:
Backtracking GET = EET.
Method 2: Backtraking GET obtained with a greedy approach
Method 3: Backtraking GETobtained with a greedy approach LET obtained with a greedy approach
Method 4: Greedy method on the current assignment tree
(a combinatorial tree with repetitions) Method 5:
Greedy method on a permutational tree with repetitions
14Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
Our HeHo methodology
Automatic searching strategies in the assignment tree: Method 1:
Backtracking
GET = EET
LET = LETari + LETcom
LETari = sequential time divided by the maximum achievable speed-up when using all the processors not yet discarded
LETcom = assuming the best logical topology of processes that can be obtained from this node
15Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
Our HeHo methodology
Automatic searching strategies in the assignment tree: Method 2:
Backtracking
GET = a greedy approach: the EET for each of the children of the node is calculated, and the node with the lowest EET is included in the solution
LET = LETari + LETcom
LETari = sequential time divided by the maximum achievable speed-up when using all the processors not yet discarded
LETcom = assuming the best logical topology of processes that can be obtained from this node.
Fewer nodes are analyzed, but the evaluated cost per node increases
16Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
Our HeHo methodology
Automatic searching strategies in the assignment tree: Method 3:
Backtracking
GET = a greedy approach: the EET for each of the children of the node is calculated, and the node with the lowest EET is included in the solution
LET = LETari + LETcom
LETari = A greedy approach is used: For each node, the child that least increases the cost of
arithmetic operations is included in the solution to obtain the lowest bound
LETcom = assuming the best logical topology of processes that can be obtained from this node.
It is possible that a branch to a optimal solution will be discarded
17Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
Our HeHo methodology
Automatic searching strategies in the assignment tree: Method 4:
Greedy method on the current assignment tree (a combinatorial tree with repetitions)
Method 5: Greedy method on a permutational tree with repetitions
Both methods 4 and 5: To obtain better logical topologies of the processes:
traversal searching continues (through the best child for each node) until the established maximum level is reached.
18Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
Experimental Results
Human searching strategies in the assignment tree:
Greedy User (GU) Use ALL the available processors One process per processor
Conservative User (CU) Use HALF of the available processors One process per processor
Expert User (EU): Use 1 processor, HALF or ALL the processors depending on the problem
size One process per processor
21Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
Experimental Results
Automatic decisions vs. Users, on SUNEt (n = 7680)Method Processes
mapping
b Logical
Topology
Solution t. t. t. Level
1 (1,1,1,1,1,1) 64 2 х 3 718.94 0.02 25
2 (1,1,1,1,1,1) 64 2 х 3 718.94 0.04 25
3 (1,1,1,1,1,1) 64 2 х 3 718.94 0.02 25
4 (1,1,0,0,0,1) 128 1 х 3 887.85 0.0001 25
5 (1,1,0,0,0,1) 128 1 х 3 887.85 0.0005 25
CU (1,1,0,0,0,1) 128 1 х 3 1047.13
GU (1,1,1,1,1,1) 64 2 х 3 887.85
EU (1,1,1,1,1,1) 64 2 х 3 887.85
22Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
Experimental Results
Automatic decisions vs. Users, on TORC (n = 2048)Method Processes
mapping
b Logical
Topology
Solution t. t. t. Level
1 (1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,0,0,0,0)64 3 х 5 17.91 3.08 15
2 (1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,0,0,0,0)64 3 х 5 17.91 3.08 15
3 (1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,0,0,0,0)64 4 х 4 15.27 0.06 25
4 (0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,1,0)64 1 х 1 43.16 0.0012 30
5 (1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,0,0,0,0)64 4 х 4 15.27 0.01 30
CU (1,1,1,1,1,1,0,0,0,0,
0,0,0,0,0,0,1,1,1)64 3 х 3 23.77
GU (1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1)32 1 х 19 33.57
EU (1,1,1,1,1,1,0,0,0,0,
0,0,0,0,0,0,1,1,1)64 3 х 3 23.77
24Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
Simulations
Virtual Platforms: variations and/or increases of the
real platforms: mTORC-01
the quantity of 17P4 is increased to 11 Number of processors: 29. Types of processors: 4
mTORC-02 the quantities of DPIII, SPIII, Ath and 17P4 are increased to 10, 10, 10 and
20 respectively. Number of processors: 50. Types of processors: 4
mTORC-03 the quantities of DPIII, SPIII, Ath and 17P4 are increased to 10, 15, 5 and
10, respectively additional processors have been included Number of processors: 100. Types of processors: 10
25Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
Simulations
Automatic decisions vs. Users
On virtual platform: mTORC01 (n = 20000) the quantity of 17P4 is increased to 11 Number of processors: 29. Types of processors: 4
Met. 1 Met. 2 Met. 3 Met. 4 Met. 5 CU GU EU
Solution 666.44 818.82 666.44 666.44 666.44 1322.23 1145.09 1145.09
t.t.t 20.39 59.45 0.68 0.0007 0.0122
Level 15 15 20 25 25
26Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
Experimental Results
Automatic decisions vs. Users
On virtual platform: mTORC02 (n = 20000) the quantities of DPIII, SPIII, Ath and 17P4 are increased to 10, 10, 10 and
20 respectively Number of processors: 50. Types of processors: 4
Met. 1 Met. 2 Met. 3 Met. 4 Met. 5 CU GU EU
Solution 3721.98 3791.98 2439.43 1958.43 1500.24 2249.70 2748.36 2249.70
t.t.t 259.44 792.32 7.46 0.01 0.07
Level 15 15 25 30 30
27Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
Experimental Results
Automatic decisions vs. Users
On virtual platform: mTORC03 (n = 20000) the quantities of DPIII, SPIII, Ath and 17P4 are increased to 10, 15, 5 and
10, respectively additional processors have been included Number of processors: 100. Types of processors: 10
Met. 1 Met. 2 Met. 3 Met. 4 Met. 5 CU GU EU
Solution 10712.55 14532.45 10712.55 10712.55 4333.23 7405.34 5422.87 5422.87
t.t.t 109.24 169.72 1274.34 0.08 2.34
Level 10 10 5 25 40
28Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
Conclusions
Extension of our previous self-optimisation methodology for homogeneous systems
On hetereogeneous systems, new decisions: Number of processes Mapping processes processors
Good results with parallel LU factorisation
Same methodology could be applied to other linear algebra routines