high performance lu factorization for non-dedicated clusters toshio endo, kenji kaneda, kenjiro...
DESCRIPTION
Performance limiting factor: background processes Other processes may run on background Network daemons, interactive shells, etc. Many typical applcations are written in synchronous style In such applications, delay of a single node degrades the overall performanceTRANSCRIPT
![Page 1: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/1.jpg)
High Performance LU Factorization for Non-dedicated Clusters
Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori
Yonezawa(University of Tokyo)
and the future Grid
![Page 2: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/2.jpg)
Background Computing nodes on clusters/Grid are shared
by multiple applications To obtain good performance, HPC
applications should struggle with Background processes Dynamic changing available nodes Large latencies on the Grid
![Page 3: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/3.jpg)
Performance limiting factor:background processes
Other processes may run on background Network daemons, interactive shells, etc.
Many typical applcations are written in synchronous style
In such applications, delay of a single node degrades the overall performance
![Page 4: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/4.jpg)
Performance limitng factor:Large latencies on the Grid
In the future Grid environments, bandwidth will accommodate HPC applications
Large latencies will remain to be obstacles
Synchronous applications suffer from large latencies
>100ms
![Page 5: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/5.jpg)
Available nodes change dynamically
Many HPC applications assumes that computing nodes are fixed
If applications support dynamically changing nodes, we can harness computing resources more efficiently!
![Page 6: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/6.jpg)
Goal of this workAn LU factorization algorithm that Tolerates background processes & large
latencies
Supports dynamically changing nodes
Overlapping multiple iterations
Written in the Phoenix modelData mapping for dynamically changing nodes
A fast HPC application on non-dedicated clusters and Grid
![Page 7: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/7.jpg)
Outline of this talk The Phoenix model Our LU Algorithm
Overlapping multiple iterations Data mapping for dynamically changing nodes
Performance of our LU and HPL Related work Summary
![Page 8: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/8.jpg)
Phoenix model [Taura et al. 03]
A message passing model for dynamically changing environments Concept of virtual nodes Virtual nodes as destinations of messages
VirtualnodesPhysicalnodes
![Page 9: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/9.jpg)
Overview of our LU Like typical implementations,
Based on message passing The matrix is decomposed into small blocks A block is updated by its owner node
Unlike typical implementations, Asynchronous data-driven style for overlapping
multiple iterations Cyclic-like data mapping for any & dynamically
changing number of nodes (Currently, pivoting is not performed)
![Page 10: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/10.jpg)
LU factorizationfor (k=0; k<B; k++) { Ak,k=fact(Ak,k); for (i=k+1; i<B; i++) Ai,k=update_L(Ai,k,Ak,k); for (j=k+1; j<B; j++) Ak,j=update_U(Ak,j,Ak,k); for (i=k+1; i<B; i++) for (j=k+1; j<B; j++) Ai,j=Ai,j – Ai,k x Ak,j;}
L part
U part
Trail part
Diagonal
![Page 11: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/11.jpg)
Naïve implementation and its problem
Iterations are separated Not tolerant to latencies/background processes!
time
k th iteration (k+1) th iteration (k+2) th iteration
# of
exe
cuta
ble
task
s
Diagonal U L trail
![Page 12: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/12.jpg)
Latency Hiding Techniques
Overlapping iterations hides latencies Diagonal/L/U parts is advanced If computations of trail parts are separated, only
adjacent two iterations are overlapped
There is room for further improvement
time
![Page 13: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/13.jpg)
Overlapping multiple iterations for more tolerance
We overlap multiple iterations by computing all blocks, including trail
parts asynchronously Data driven style & prioritized task
scheduling are used
time
![Page 14: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/14.jpg)
Prioritized task scheduling We assign a priority to updating task of each
block k-th update of block Ai,j has a priority of
min(i-S, j-S, k) (smaller number is higher) where S is a desired overlap depth We can control overlapping by changing the
value of S
![Page 15: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/15.jpg)
Typical data mapping and its problem
Two dimensional block cyclic distributionP0 P1 P2
P3 P4 P5
matrix
Good load balance and small communication, but The number of nodes must be fixed and factored
into two small numbers How to support dynamically changing nodes?
![Page 16: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/16.jpg)
Our data mapping for dynamically changing nodes
Permutation is common among all nodes
A00 A01 A02 A03 A04 A05 A06 A07
A10 A11 A12 A13 A14 A15 A16 A17
A20 A21 A22 A23 A24 A25 A26 A27
A30 A31 A32 A33 A34 A35 A36 A37
A40 A41 A42 A43 A44 A45 A46 A47
A50 A51 A52 A53 A54 A55 A56 A57
A60 A61 A62 A63 A64 A65 A66 A67
A70 A71 A72 A73 A74 A75 A76 A77
Original matrix
A40A41 A42A43A44 A45 A46A47
A20A21 A22A23A24 A25 A26A27
A70A71 A72A73A74 A75 A76A77
A30A31 A32A33A34 A35 A36A37
A50A51 A52A53A54 A55 A56A57
A00A01 A02A03A04 A05 A06A07
A60A61 A62A63A64 A65 A66A67
A10A11 A12A13A14 A15 A16A17RandomPermutation
Permuted matrix
A40A41 A42A43A44 A45 A46A47
A20A21 A22A23A24 A25 A26A27
A70A71 A72A73A74 A75 A76A77
A30A31 A32A33A34 A35 A36A37
A50A51 A52A53A54 A55 A56A57
A00A01 A02A03A04 A05 A06A07
A60A61 A62A63A64 A65 A66A67
A10A11 A12A13A14 A15 A16A17
A40 A41 A42 A43 A44 A45 A46 A47
A20 A21 A22 A23 A24 A25 A26 A27
A70 A71 A72 A73 A74 A75 A76 A77
A30 A31 A32 A33 A34 A35 A36 A37
A50 A51 A52 A53 A54 A55 A56 A57
A00 A01 A02 A03 A04 A05 A06 A07
A60 A61 A62 A63 A64 A65 A66 A67
A10 A11 A12 A13 A14 A15 A16 A17
A40 A41 A42 A43 A44 A45 A46 A47
A20 A21 A22 A23 A24 A25 A26 A27
A70 A71 A72 A73 A74 A75 A76 A77
A30 A31 A32 A33 A34 A35 A36 A37
A50 A51 A52 A53 A54 A55 A56 A57
A00 A01 A02 A03 A04 A05 A06 A07
A60 A61 A62 A63 A64 A65 A66 A67
A10 A11 A12 A13 A14 A15 A16 A17
A40 A41 A42 A43 A44 A45 A46 A47
A20 A21 A22 A23 A24 A25 A26 A27
A70 A71 A72 A73 A74 A75 A76 A77
A30 A31 A32 A33 A34 A35 A36 A37
A50 A51 A52 A53 A54 A55 A56 A57
A00 A01 A02 A03 A04 A05 A06 A07
A60 A61 A62 A63 A64 A65 A66 A67
A10 A11 A12 A13 A14 A15 A16 A17
![Page 17: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/17.jpg)
Dynamically joining nodes A new node sends a steal
message to one of nodes The receiver abandons
some virual nodes, and sends blocks to the new node
The new node undertakes virtual nodes and blocks
For better load balance, stealing process is repeated
original permutedoriginal permutedoriginal permutedoriginal permutedoriginal permutedoriginal permutedoriginal permuted
![Page 18: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/18.jpg)
Experimental environments (1)
112 nodes IBM BladeCenter Cluster Dual 2.4GHz Xeon: 70 nodes + Dual 2.8GHz Xeon: 42 nodes
1 CPU per node is used Slower CPU (2.4GHz) determines the overall
performance Gigabit ethernet
![Page 19: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/19.jpg)
Experimental environments (2)
High performance Linpack (HPL) is by Petitet et al. GOTO BLAS is made by Kazushige Goto (UT-Austin)
Ours HPLComm. library Phoenix library mpich- 1.2.5Algebra library GOTO BLAS GOTO BLASPivoting no row pivotingSequential speed(2.4GHz) 2.92GFlops 3.38GFops
Ours (S=0): don’t overlap explicitly Ours (S=1): overlap with an adjacent iteration Ours (S=5): overlap multiple (5) iterations
![Page 20: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/20.jpg)
Scalability
Ours(S=5) achieves 190 GFlops with 108 nodes 65 times speedup
• Matrix size N=61440• Block size NB=240• Overlap depth S=0 or 5
N=61440
050
100150200250
0 20 40 60 80 100 120# of processors
Spee
d(GF
lops)
Ours(S=0) Ours(S=5) HPL
x72x65
![Page 21: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/21.jpg)
Tolerance to background processes (1)
We run LU/HPL with background processes We run 3 background processes per randomely
chosen node The background processes are short term
They move to other random nodes every 10 secs
![Page 22: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/22.jpg)
00.20.40.60.8
11.2
0 4 8 16# of loaded nodes
Relat
ive sp
eed
Ours(S=0) Ours(S=1)Ours(S=5) HPL
050
100150200250
0 4 8 16# of loaded nodes
Spee
d (GF
lops)
Ours(S=0) Ours(S=1)Ours(S=5) HPL
Tolerance to background processes (2)
HPL slows down heavily Ours(S=0) and Ours(S=1) also suffer By overlapping multiple iterations (S=5), Our LU
becomes more tolerant !
• 108 nodes for computation • N=46080
-31%-16%
-36%-26%
![Page 23: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/23.jpg)
Tolerance to large latencies (1)
We emulate the future Grid environment with high bandwidth & large latencies Experiments are done on a cluster Large latencies are emulated by software
+0ms, +200ms, +500ms
![Page 24: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/24.jpg)
00.20.40.60.8
11.2
+0 +200 +500Added latency (ms)
Relat
ive S
peed
Ours(S=0) Ours(S=1)Ours(S=5)
Tolerance to large latencies (2)
S=0 suffers by 28% Overlapping of iterations makes our LU more tolerant
Both S=1 and S=5 work well
-28% -19%
• 108 nodes for computation • N=46080
-20%
050
100150200
+0 +200 +500Added latency (ms)
Spee
d (GF
lops)
Ours(S=0) Ours(S=1)Ours(S=5)
![Page 25: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/25.jpg)
Performance with joining nodes (1)
16 nodes at first, then 48 nodes are added dynamically
16166644
![Page 26: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/26.jpg)
020406080
100120140
Fixed- 16 Dynamic Fixed- 64
Spee
d (GF
lops)
Performance with joining nodes (2)
Flexibility to the number of nodes is useful to obtain higher performance
Comared with Fixed-64, Dynamic suffers migration overhead etc.
• N=30720• S=5
x1.9 faster
![Page 27: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/27.jpg)
Related WorkDyn-MPI [Weatherly et al. 03] An extended MPI library that
supports dynamically changing nodes
Dyn-MPI Our approachRedist method
Synchronous
Asynchronous
Distribution of2D matrix
Only the firstdimension
Arbitrary(Left for the programmers)
![Page 28: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/28.jpg)
SummaryAn LU implementation suitable for non-
dedicated clusters and the Grid Scalable Support dynamically changing nodes Tolerate background processes & large
latencies
![Page 29: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/29.jpg)
Future Work Perform pivoting
More data dependencies are introduced Is our LU still tolerant?
Improve dynamic load balancing Choose better target nodes for stealing Take care of CPU speeds
Apply our approach to other HPC applications CFD applications
![Page 30: High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1aff7f8b9ab0599862b7/html5/thumbnails/30.jpg)
Thank you!