high performance lu factorization for non-dedicated clusters toshio endo, kenji kaneda, kenjiro...

High Performance LU Factorization for Non-dedicated Clusters

Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori

Yonezawa(University of Tokyo)

and the future Grid

Background Computing nodes on clusters/Grid are shared

by multiple applications To obtain good performance, HPC

applications should struggle with Background processes Dynamic changing available nodes Large latencies on the Grid

Performance limiting factor:background processes

Other processes may run on background Network daemons, interactive shells, etc.

Many typical applcations are written in synchronous style

In such applications, delay of a single node degrades the overall performance

Performance limitng factor:Large latencies on the Grid

In the future Grid environments, bandwidth will accommodate HPC applications

Large latencies will remain to be obstacles

Synchronous applications suffer from large latencies

>100ms

Available nodes change dynamically

Many HPC applications assumes that computing nodes are fixed

If applications support dynamically changing nodes, we can harness computing resources more efficiently!

Goal of this workAn LU factorization algorithm that Tolerates background processes & large

latencies

Supports dynamically changing nodes

Overlapping multiple iterations

Written in the Phoenix modelData mapping for dynamically changing nodes

A fast HPC application on non-dedicated clusters and Grid

Outline of this talk The Phoenix model Our LU Algorithm

Overlapping multiple iterations Data mapping for dynamically changing nodes

Performance of our LU and HPL Related work Summary

Phoenix model [Taura et al. 03]

A message passing model for dynamically changing environments Concept of virtual nodes Virtual nodes as destinations of messages

VirtualnodesPhysicalnodes

Overview of our LU Like typical implementations,

Based on message passing The matrix is decomposed into small blocks A block is updated by its owner node

Unlike typical implementations, Asynchronous data-driven style for overlapping

multiple iterations Cyclic-like data mapping for any & dynamically

changing number of nodes (Currently, pivoting is not performed)

LU factorizationfor (k=0; k<B; k++) { Ak,k=fact(Ak,k); for (i=k+1; i<B; i++) Ai,k=update_L(Ai,k,Ak,k); for (j=k+1; j<B; j++) Ak,j=update_U(Ak,j,Ak,k); for (i=k+1; i<B; i++) for (j=k+1; j<B; j++) Ai,j=Ai,j – Ai,k x Ak,j;}

L part

U part

Trail part

Diagonal

Naïve implementation and its problem

Iterations are separated Not tolerant to latencies/background processes!

time

k th iteration (k+1) th iteration (k+2) th iteration

# of

exe

cuta

ble

task

s

Diagonal U L trail

Latency Hiding Techniques

Overlapping iterations hides latencies Diagonal/L/U parts is advanced If computations of trail parts are separated, only

adjacent two iterations are overlapped

There is room for further improvement

time

Overlapping multiple iterations for more tolerance

We overlap multiple iterations by computing all blocks, including trail

parts asynchronously Data driven style & prioritized task

scheduling are used

time

Prioritized task scheduling We assign a priority to updating task of each

block k-th update of block Ai,j has a priority of

min(i-S, j-S, k) (smaller number is higher) where S is a desired overlap depth We can control overlapping by changing the

value of S

Typical data mapping and its problem

Two dimensional block cyclic distributionP0 P1 P2

P3 P4 P5

matrix

Good load balance and small communication, but The number of nodes must be fixed and factored

into two small numbers How to support dynamically changing nodes?

Our data mapping for dynamically changing nodes

Permutation is common among all nodes

A00 A01 A02 A03 A04 A05 A06 A07

A10 A11 A12 A13 A14 A15 A16 A17

A20 A21 A22 A23 A24 A25 A26 A27

A30 A31 A32 A33 A34 A35 A36 A37

A40 A41 A42 A43 A44 A45 A46 A47

A50 A51 A52 A53 A54 A55 A56 A57

A60 A61 A62 A63 A64 A65 A66 A67

A70 A71 A72 A73 A74 A75 A76 A77

Original matrix

A40A41 A42A43A44 A45 A46A47

A20A21 A22A23A24 A25 A26A27

A70A71 A72A73A74 A75 A76A77

A30A31 A32A33A34 A35 A36A37

A50A51 A52A53A54 A55 A56A57

A00A01 A02A03A04 A05 A06A07

A60A61 A62A63A64 A65 A66A67

A10A11 A12A13A14 A15 A16A17RandomPermutation

Permuted matrix

A40A41 A42A43A44 A45 A46A47

A20A21 A22A23A24 A25 A26A27

A70A71 A72A73A74 A75 A76A77

A30A31 A32A33A34 A35 A36A37

A50A51 A52A53A54 A55 A56A57

A00A01 A02A03A04 A05 A06A07

A60A61 A62A63A64 A65 A66A67

A10A11 A12A13A14 A15 A16A17

A40 A41 A42 A43 A44 A45 A46 A47

A20 A21 A22 A23 A24 A25 A26 A27

A70 A71 A72 A73 A74 A75 A76 A77

A30 A31 A32 A33 A34 A35 A36 A37

A50 A51 A52 A53 A54 A55 A56 A57

A00 A01 A02 A03 A04 A05 A06 A07

A60 A61 A62 A63 A64 A65 A66 A67

A10 A11 A12 A13 A14 A15 A16 A17

A40 A41 A42 A43 A44 A45 A46 A47

A20 A21 A22 A23 A24 A25 A26 A27

A70 A71 A72 A73 A74 A75 A76 A77

A30 A31 A32 A33 A34 A35 A36 A37

A50 A51 A52 A53 A54 A55 A56 A57

A00 A01 A02 A03 A04 A05 A06 A07

A60 A61 A62 A63 A64 A65 A66 A67

A10 A11 A12 A13 A14 A15 A16 A17

A40 A41 A42 A43 A44 A45 A46 A47

A20 A21 A22 A23 A24 A25 A26 A27

A70 A71 A72 A73 A74 A75 A76 A77

A30 A31 A32 A33 A34 A35 A36 A37

A50 A51 A52 A53 A54 A55 A56 A57

A00 A01 A02 A03 A04 A05 A06 A07

A60 A61 A62 A63 A64 A65 A66 A67

A10 A11 A12 A13 A14 A15 A16 A17

Dynamically joining nodes A new node sends a steal

message to one of nodes The receiver abandons

some virual nodes, and sends blocks to the new node

The new node undertakes virtual nodes and blocks

For better load balance, stealing process is repeated

original permutedoriginal permutedoriginal permutedoriginal permutedoriginal permutedoriginal permutedoriginal permuted

Experimental environments (1)

112 nodes IBM BladeCenter Cluster Dual 2.4GHz Xeon: 70 nodes + Dual 2.8GHz Xeon: 42 nodes

1 CPU per node is used Slower CPU (2.4GHz) determines the overall

performance Gigabit ethernet

Experimental environments (2)

High performance Linpack (HPL) is by Petitet et al. GOTO BLAS is made by Kazushige Goto (UT-Austin)

Ours HPLComm. library Phoenix library mpich- 1.2.5Algebra library GOTO BLAS GOTO BLASPivoting no row pivotingSequential speed(2.4GHz) 2.92GFlops 3.38GFops

Ours (S=0): don’t overlap explicitly Ours (S=1): overlap with an adjacent iteration Ours (S=5): overlap multiple (5) iterations

Scalability

Ours(S=5) achieves 190 GFlops with 108 nodes 65 times speedup

• Matrix size N=61440• Block size NB=240• Overlap depth S=0 or 5

N=61440

050

100150200250

0 20 40 60 80 100 120# of processors

Spee

d(GF

lops)

Ours(S=0) Ours(S=5) HPL

x72x65

Tolerance to background processes (1)

We run LU/HPL with background processes We run 3 background processes per randomely

chosen node The background processes are short term

They move to other random nodes every 10 secs

00.20.40.60.8

11.2

0 4 8 16# of loaded nodes

Relat

ive sp

eed

Ours(S=0) Ours(S=1)Ours(S=5) HPL

050

100150200250

0 4 8 16# of loaded nodes

Spee

d (GF

lops)

Ours(S=0) Ours(S=1)Ours(S=5) HPL

Tolerance to background processes (2)

HPL slows down heavily Ours(S=0) and Ours(S=1) also suffer By overlapping multiple iterations (S=5), Our LU

becomes more tolerant !

• 108 nodes for computation • N=46080

-31%-16%

-36%-26%

Tolerance to large latencies (1)

We emulate the future Grid environment with high bandwidth & large latencies Experiments are done on a cluster Large latencies are emulated by software

+0ms, +200ms, +500ms

00.20.40.60.8

11.2

+0 +200 +500Added latency (ms)

Relat

ive S

peed

Ours(S=0) Ours(S=1)Ours(S=5)

Tolerance to large latencies (2)

S=0 suffers by 28% Overlapping of iterations makes our LU more tolerant

Both S=1 and S=5 work well

-28% -19%

• 108 nodes for computation • N=46080

-20%

050

100150200

+0 +200 +500Added latency (ms)

Spee

d (GF

lops)

Ours(S=0) Ours(S=1)Ours(S=5)

Performance with joining nodes (1)

16 nodes at first, then 48 nodes are added dynamically

16166644

020406080

100120140

Fixed- 16 Dynamic Fixed- 64

Spee

d (GF

lops)

Performance with joining nodes (2)

Flexibility to the number of nodes is useful to obtain higher performance

Comared with Fixed-64, Dynamic suffers migration overhead etc.

• N=30720• S=5

x1.9 faster

Related WorkDyn-MPI [Weatherly et al. 03] An extended MPI library that

supports dynamically changing nodes

Dyn-MPI Our approachRedist method

Synchronous

Asynchronous

Distribution of2D matrix

Only the firstdimension

Arbitrary(Left for the programmers)

SummaryAn LU implementation suitable for non-

dedicated clusters and the Grid Scalable Support dynamically changing nodes Tolerate background processes & large

latencies

Future Work Perform pivoting

More data dependencies are introduced Is our LU still tolerant?

Improve dynamic load balancing Choose better target nodes for stealing Take care of CPU speeds

Apply our approach to other HPC applications CFD applications

Thank you!

high performance lu factorization for non-dedicated clusters toshio endo, kenji kaneda, kenjiro...

Documents