High Performance LU Factorization for Non-dedicated Clusters
Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori
Yonezawa(University of Tokyo)
and the future Grid
Background Computing nodes on clusters/Grid are shared
by multiple applications To obtain good performance, HPC
applications should struggle with Background processes Dynamic changing available nodes Large latencies on the Grid
Performance limiting factor:background processes
Other processes may run on background Network daemons, interactive shells, etc.
Many typical applcations are written in synchronous style
In such applications, delay of a single node degrades the overall performance
Performance limitng factor:Large latencies on the Grid
In the future Grid environments, bandwidth will accommodate HPC applications
Large latencies will remain to be obstacles
Synchronous applications suffer from large latencies
>100ms
Available nodes change dynamically
Many HPC applications assumes that computing nodes are fixed
If applications support dynamically changing nodes, we can harness computing resources more efficiently!
Goal of this workAn LU factorization algorithm that Tolerates background processes & large
latencies
Supports dynamically changing nodes
Overlapping multiple iterations
Written in the Phoenix modelData mapping for dynamically changing nodes
A fast HPC application on non-dedicated clusters and Grid
Outline of this talk The Phoenix model Our LU Algorithm
Overlapping multiple iterations Data mapping for dynamically changing nodes
Performance of our LU and HPL Related work Summary
Phoenix model [Taura et al. 03]
A message passing model for dynamically changing environments Concept of virtual nodes Virtual nodes as destinations of messages
VirtualnodesPhysicalnodes
Overview of our LU Like typical implementations,
Based on message passing The matrix is decomposed into small blocks A block is updated by its owner node
Unlike typical implementations, Asynchronous data-driven style for overlapping
multiple iterations Cyclic-like data mapping for any & dynamically
changing number of nodes (Currently, pivoting is not performed)
LU factorizationfor (k=0; k<B; k++) { Ak,k=fact(Ak,k); for (i=k+1; i<B; i++) Ai,k=update_L(Ai,k,Ak,k); for (j=k+1; j<B; j++) Ak,j=update_U(Ak,j,Ak,k); for (i=k+1; i<B; i++) for (j=k+1; j<B; j++) Ai,j=Ai,j – Ai,k x Ak,j;}
L part
U part
Trail part
Diagonal
Naïve implementation and its problem
Iterations are separated Not tolerant to latencies/background processes!
time
k th iteration (k+1) th iteration (k+2) th iteration
# of
exe
cuta
ble
task
s
Diagonal U L trail
Latency Hiding Techniques
Overlapping iterations hides latencies Diagonal/L/U parts is advanced If computations of trail parts are separated, only
adjacent two iterations are overlapped
There is room for further improvement
time
Overlapping multiple iterations for more tolerance
We overlap multiple iterations by computing all blocks, including trail
parts asynchronously Data driven style & prioritized task
scheduling are used
time
Prioritized task scheduling We assign a priority to updating task of each
block k-th update of block Ai,j has a priority of
min(i-S, j-S, k) (smaller number is higher) where S is a desired overlap depth We can control overlapping by changing the
value of S
Typical data mapping and its problem
Two dimensional block cyclic distributionP0 P1 P2
P3 P4 P5
matrix
Good load balance and small communication, but The number of nodes must be fixed and factored
into two small numbers How to support dynamically changing nodes?
Our data mapping for dynamically changing nodes
Permutation is common among all nodes
A00 A01 A02 A03 A04 A05 A06 A07
A10 A11 A12 A13 A14 A15 A16 A17
A20 A21 A22 A23 A24 A25 A26 A27
A30 A31 A32 A33 A34 A35 A36 A37
A40 A41 A42 A43 A44 A45 A46 A47
A50 A51 A52 A53 A54 A55 A56 A57
A60 A61 A62 A63 A64 A65 A66 A67
A70 A71 A72 A73 A74 A75 A76 A77
Original matrix
A40A41 A42A43A44 A45 A46A47
A20A21 A22A23A24 A25 A26A27
A70A71 A72A73A74 A75 A76A77
A30A31 A32A33A34 A35 A36A37
A50A51 A52A53A54 A55 A56A57
A00A01 A02A03A04 A05 A06A07
A60A61 A62A63A64 A65 A66A67
A10A11 A12A13A14 A15 A16A17RandomPermutation
Permuted matrix
A40A41 A42A43A44 A45 A46A47
A20A21 A22A23A24 A25 A26A27
A70A71 A72A73A74 A75 A76A77
A30A31 A32A33A34 A35 A36A37
A50A51 A52A53A54 A55 A56A57
A00A01 A02A03A04 A05 A06A07
A60A61 A62A63A64 A65 A66A67
A10A11 A12A13A14 A15 A16A17
A40 A41 A42 A43 A44 A45 A46 A47
A20 A21 A22 A23 A24 A25 A26 A27
A70 A71 A72 A73 A74 A75 A76 A77
A30 A31 A32 A33 A34 A35 A36 A37
A50 A51 A52 A53 A54 A55 A56 A57
A00 A01 A02 A03 A04 A05 A06 A07
A60 A61 A62 A63 A64 A65 A66 A67
A10 A11 A12 A13 A14 A15 A16 A17
A40 A41 A42 A43 A44 A45 A46 A47
A20 A21 A22 A23 A24 A25 A26 A27
A70 A71 A72 A73 A74 A75 A76 A77
A30 A31 A32 A33 A34 A35 A36 A37
A50 A51 A52 A53 A54 A55 A56 A57
A00 A01 A02 A03 A04 A05 A06 A07
A60 A61 A62 A63 A64 A65 A66 A67
A10 A11 A12 A13 A14 A15 A16 A17
A40 A41 A42 A43 A44 A45 A46 A47
A20 A21 A22 A23 A24 A25 A26 A27
A70 A71 A72 A73 A74 A75 A76 A77
A30 A31 A32 A33 A34 A35 A36 A37
A50 A51 A52 A53 A54 A55 A56 A57
A00 A01 A02 A03 A04 A05 A06 A07
A60 A61 A62 A63 A64 A65 A66 A67
A10 A11 A12 A13 A14 A15 A16 A17
Dynamically joining nodes A new node sends a steal
message to one of nodes The receiver abandons
some virual nodes, and sends blocks to the new node
The new node undertakes virtual nodes and blocks
For better load balance, stealing process is repeated
original permutedoriginal permutedoriginal permutedoriginal permutedoriginal permutedoriginal permutedoriginal permuted
Experimental environments (1)
112 nodes IBM BladeCenter Cluster Dual 2.4GHz Xeon: 70 nodes + Dual 2.8GHz Xeon: 42 nodes
1 CPU per node is used Slower CPU (2.4GHz) determines the overall
performance Gigabit ethernet
Experimental environments (2)
High performance Linpack (HPL) is by Petitet et al. GOTO BLAS is made by Kazushige Goto (UT-Austin)
Ours HPLComm. library Phoenix library mpich- 1.2.5Algebra library GOTO BLAS GOTO BLASPivoting no row pivotingSequential speed(2.4GHz) 2.92GFlops 3.38GFops
Ours (S=0): don’t overlap explicitly Ours (S=1): overlap with an adjacent iteration Ours (S=5): overlap multiple (5) iterations
Scalability
Ours(S=5) achieves 190 GFlops with 108 nodes 65 times speedup
• Matrix size N=61440• Block size NB=240• Overlap depth S=0 or 5
N=61440
050
100150200250
0 20 40 60 80 100 120# of processors
Spee
d(GF
lops)
Ours(S=0) Ours(S=5) HPL
x72x65
Tolerance to background processes (1)
We run LU/HPL with background processes We run 3 background processes per randomely
chosen node The background processes are short term
They move to other random nodes every 10 secs
00.20.40.60.8
11.2
0 4 8 16# of loaded nodes
Relat
ive sp
eed
Ours(S=0) Ours(S=1)Ours(S=5) HPL
050
100150200250
0 4 8 16# of loaded nodes
Spee
d (GF
lops)
Ours(S=0) Ours(S=1)Ours(S=5) HPL
Tolerance to background processes (2)
HPL slows down heavily Ours(S=0) and Ours(S=1) also suffer By overlapping multiple iterations (S=5), Our LU
becomes more tolerant !
• 108 nodes for computation • N=46080
-31%-16%
-36%-26%
Tolerance to large latencies (1)
We emulate the future Grid environment with high bandwidth & large latencies Experiments are done on a cluster Large latencies are emulated by software
+0ms, +200ms, +500ms
00.20.40.60.8
11.2
+0 +200 +500Added latency (ms)
Relat
ive S
peed
Ours(S=0) Ours(S=1)Ours(S=5)
Tolerance to large latencies (2)
S=0 suffers by 28% Overlapping of iterations makes our LU more tolerant
Both S=1 and S=5 work well
-28% -19%
• 108 nodes for computation • N=46080
-20%
050
100150200
+0 +200 +500Added latency (ms)
Spee
d (GF
lops)
Ours(S=0) Ours(S=1)Ours(S=5)
Performance with joining nodes (1)
16 nodes at first, then 48 nodes are added dynamically
16166644
020406080
100120140
Fixed- 16 Dynamic Fixed- 64
Spee
d (GF
lops)
Performance with joining nodes (2)
Flexibility to the number of nodes is useful to obtain higher performance
Comared with Fixed-64, Dynamic suffers migration overhead etc.
• N=30720• S=5
x1.9 faster
Related WorkDyn-MPI [Weatherly et al. 03] An extended MPI library that
supports dynamically changing nodes
Dyn-MPI Our approachRedist method
Synchronous
Asynchronous
Distribution of2D matrix
Only the firstdimension
Arbitrary(Left for the programmers)
SummaryAn LU implementation suitable for non-
dedicated clusters and the Grid Scalable Support dynamically changing nodes Tolerate background processes & large
latencies
Future Work Perform pivoting
More data dependencies are introduced Is our LU still tolerant?
Improve dynamic load balancing Choose better target nodes for stealing Take care of CPU speeds
Apply our approach to other HPC applications CFD applications
Thank you!