ieee cluster talk
TRANSCRIPT
![Page 1: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/1.jpg)
High Performance Dense Linear System Solver with Soft Error Resilience
Peng Du, Piotr Luszczek, Jack Dongarra
May 3, 2023
![Page 2: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/2.jpg)
Agenda
• Soft error threat to the dense linear solver• LU factorization• Error propagation
• Error modeling
• Fault tolerant algorithm
• Performance Evaluation
May 3, 2023 2
![Page 3: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/3.jpg)
Soft error• Silent error due to radiation
• Alpha particle• High energy neutron• Thermal neutron
• Outbreaks• Commercial computing system from Sun Microsystem in 2000• ASC Q supercomputer at Los Alamos National Lab in 2003
May 3, 2023 3
0 0 1 1 0 10 1
1 0 1 1 0 10 1
![Page 4: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/4.jpg)
LU based linear solver
![Page 5: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/5.jpg)
Block LU factorization
GETF2 TRSM
GEMM GETF2 STRSM
![Page 6: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/6.jpg)
General work flow
(1) Generate checksum for the input matrix as additional columns
(2) Perform LU factorization WITH the additional checksum columns
(3) Solve Ax=b using LU from the factorization (even if soft error occurs during LU factorization)
(4) Check for soft error
(5) Correct solution x
![Page 7: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/7.jpg)
Why is soft error hard to handle?
• Soft error occurs silently
• Propagation
![Page 8: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/8.jpg)
Example: Error propagationError location (using matlab notation and 1-based index)
Error strikes right before panel factorization of (41:200, 41:60),
Case 1: Error at (35,10), in L area
Case 2: Error at (50,120), in A’ area
Note: Pivoting on the left of panel factorization is delayed to the end of error detection and recovery so that error in L area does not move
![Page 9: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/9.jpg)
Case 1: Non-propagating error
Error does not propagate in this case
![Page 10: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/10.jpg)
Case 2: Propagating error
![Page 11: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/11.jpg)
Soft error challenge
May 3, 2023 11
![Page 12: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/12.jpg)
Error modeling (for propagating error)
• When?• Answer: Doesn’t really matter
May 3, 2023 12
LU
![Page 13: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/13.jpg)
Error modeling (for “where”)
1 1 1t t t tA L P A
Define an initial erroneous initial matrix
Input matrix
One step of LU
If no soft error occurs
If soft error occurs at step t
![Page 14: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/14.jpg)
Locate Error
Column j
![Page 15: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/15.jpg)
Recover Ax=b
• Sherman Morison Formula
![Page 16: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/16.jpg)
Recover Ax=b
Given:
To Solve:Ax b
![Page 17: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/17.jpg)
Recover Ax=b
![Page 18: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/18.jpg)
Recover Ax=bRecall:
Therefore:
![Page 19: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/19.jpg)
Recover Ax=b
Sherman Morrison
![Page 20: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/20.jpg)
Recover Ax=b
![Page 21: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/21.jpg)
Recover Ax=b
![Page 22: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/22.jpg)
Recover Ax=b
Needs protection
![Page 23: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/23.jpg)
How to detect & recovery a soft error in L?
• The recovery of Ax=b requires a correct L
• L does not change once produced• Static checkpointing for L
• Delay pivoting on L to prevent checksum of L from being invalidated
LU
![Page 24: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/24.jpg)
• PDGEMM based checkpointing• Checkpointing time increases when scaled to more processes and
larger matrices
Checkpointing for L, idea 1
NOT SCALABLE
![Page 25: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/25.jpg)
Checkpointing for L, idea 2• Local Checkpointing• Each process checkpoints their local involved data• Constant checkpointing time
SCALABLE
![Page 26: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/26.jpg)
Encoding for L
• On each process, for a column of L
![Page 27: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/27.jpg)
Kraken Performance
Two 2.6 GHz six-core AMD Opteron processors per node
32x32 MPI processes, 6 threads/(process, core) 6,144 cores used in total
![Page 28: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/28.jpg)
Kraken Performance
Two 2.6 GHz six-core AMD Opteron processors per node
64x64 MPI processes, 6 threads/(process, core) 24,576 used cores in total
![Page 29: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/29.jpg)
May 3, 2023 29
![Page 30: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/30.jpg)
• Backup slides
May 3, 2023 30
![Page 31: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/31.jpg)
A
A e A w
cA
cPA LU
c v
L
U
Locate Error
U
U e
U w
![Page 32: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/32.jpg)
Locate Error
A
A e A w
cA
cPA LU
c v
L
U
U
U e
U w
![Page 33: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/33.jpg)
Locate Error
• Wj is the jth element of vector w in the generator matrix
• Component-wise division of s and r reveals wj
• Search wj in w reveals the initial soft error’s column
![Page 34: Ieee cluster talk](https://reader035.vdocuments.us/reader035/viewer/2022081521/58ee474d1a28ab02188b467b/html5/thumbnails/34.jpg)
Extra Storage
• For input matrix of size MxN on PxQ grid• A copy of the original matrix
• Not necessary when it’s easy to re-generate the required column of the original matrix
• 2 additional columns: 2 x M• Each process has 2 rows: , in total 2 N
Q 2P N
2 2
2 2 2N
extra storage M P Nrmatrix storage M N
P PN M M