Download - Process recovery
![Page 1: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/1.jpg)
Process recovery• Restore computing point
• New MPI_COMM_WORLD by FT-MPI
• Detect failure member
• Rebuild process id array
• Rebuild worker comm
![Page 2: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/2.jpg)
Communicator Issue• Working communicator
• Used by ScaLAPACK and BLACS
• Standby communicator
• Used by standup processes
• MPI_COMM_WORLD
• Checkpointing
![Page 3: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/3.jpg)
Who is
where?
April 21, 2023 3
![Page 4: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/4.jpg)
Where is the lost data?• Diskless Checkpointing and Checksum
April 21, 2023 4
• Participation• Checksum:
MPI_Reduce(…,MPI_SUM…)14x14 matrix, nb=2, 2x3 grid, global view
Standby process, local view
![Page 5: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/5.jpg)
Restart computation
• Reverse computation
April 21, 2023 5
1 iiTii CCBA
C A B
i
i
![Page 6: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/6.jpg)
Restart computation: Failure
April 21, 2023 6
f
f
![Page 7: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/7.jpg)
Restart computation
April 21, 2023 7
i
fk
Tkkll BACC 1
f i
i
i
i
i
f
![Page 8: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/8.jpg)
Checkpointing Performance
April 21, 2023 8
![Page 9: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/9.jpg)
Recovery Performance
April 21, 2023 9
![Page 10: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/10.jpg)
Optimization Attempt
• Too many MPI_Reduce on small data blocks
April 21, 2023 10
•12 MPI_Reduce•49+ memcpy
•4 MPI_Reduce•0 memcpy•MPI user defined datatype•MPI user defined opt in commutative mode
![Page 11: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/11.jpg)
Issues with FT-MPI
• 14x14, nb=2, pxq=1x4, ok
• 140x140, nb=20, pxq=1x4 =>
“snipe_lite.c:490 Problem, connection on socket [19] has failed.
This connect has been closed.”
Problem disappeared if all 1x4 processes are on one node
• 1400x1400 =>
“conn 5 chan 4 pending send before flctrl 9999 = 1”
April 21, 2023 11
![Page 12: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/12.jpg)
Some ideas from the workshop
• Asynchronous checkpointing (if i don't have a failure, why do i need to stop and do checkpointing?)
• Variable checkpointing interval, because failure tends to occur till the end...reduce checkpointing overhead (t = sqrt (2* (time to save satte)*AMTTI))
• Silent data corruption:
• how do we know we are getting the right answer on large system and large problem?
• different failure model
• failure guard system
• how to do it for the hybrid system failure (FT for stan‘s MAGMA code)
April 21, 2023 12
![Page 13: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/13.jpg)
Init Checkpointing
April 21, 2023 13
Before the loop starts
![Page 14: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/14.jpg)
Rolling
April 21, 2023 14
i>=0
![Page 15: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/15.jpg)
Rolling
April 21, 2023 15
i==k
?
![Page 16: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/16.jpg)
When something goes wrong silently
April 21, 2023 16
C
![Page 17: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/17.jpg)
Recovery
April 21, 2023 17
![Page 18: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/18.jpg)
Shadow backup
April 21, 2023 18
![Page 19: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/19.jpg)
Algorithm
• At every step i:
• Do the rank-k update
• At every K steps:
• Check for silent error by comparing the checksum on checkpointing processes and the one freshly computed by reduction
• On the checkpointing processors, shadow checkpoints
• At silent error
• Identify the ill process
• Checkpoint processors roll back to shadowed checkpoints
• Kill the ill process, spawn a new one, and put it in the grid
• Surviving process reverse computing to the last healthy step
• Recover the data on the new process through checksum
April 21, 2023 19
![Page 20: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/20.jpg)
Question
• Are we using too many processes for checkpointing?
• What if checkpointing process starts to fail when a working process has gone wrong silently?
April 21, 2023 20
![Page 21: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/21.jpg)
• Thursday, April 23, 2009
April 21, 2023 21
![Page 22: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/22.jpg)
STRSM for GPU
• Why?
• Needed in hybrid routine like sgetrf
• The one from cublas is too slow
• So current we have to transfer data back from GPU to CPU to do strsm, wasting time
April 21, 2023 22
![Page 23: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/23.jpg)
Design -1
April 21, 2023 23
1 2 3
4 5 6
A x b
![Page 24: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/24.jpg)
Improvement
• Strsm_kernel is always on the critical path
• 3 strsm_kernel = 3 strtri_kernel + 3 strmv_kernel
• All 3 strtri_kernel can be done in one shot on GPU
• So now 3 strsm_kernel > 1 strtri_kernel + 3 strmv_kernel
• Critical path is reduced
April 21, 2023 24
![Page 25: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/25.jpg)
Design -2
April 21, 2023 25
1 2 3
4 5
……
![Page 26: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/26.jpg)
April 21, 2023 26
![Page 27: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/27.jpg)
April 21, 2023 27
![Page 28: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/28.jpg)
A few thoughts
April 21, 2023 28
What if we could:(1)Being able to reuse pblas/scalapack(2)As less checkpointing processors as possible
Julien & George’s code is (1) trying to build pblas/scalapack from scratch to enable FT(2) Using m+n extra processes to babysit m*n process
![Page 29: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/29.jpg)
Checksum
April 21, 2023 29
![Page 30: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/30.jpg)
Multiplication
April 21, 2023 30
A x B = C
![Page 31: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/31.jpg)
Checkpointing
April 21, 2023 31
(1) Get to reuse Pblas/scalapack
(2) Require 1 extra process for 6 processes, instead of 5 for 6
![Page 32: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/32.jpg)
Detect errors
April 21, 2023 32
![Page 33: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/33.jpg)
Locate sick process
• What could go wrong?
• Network card -> sending out wrong data via mpi
• Memory -> doing computation on wrong data
• ALU -> giving wrong result
• Disk -> providing wrong data
• etc…
• Old checksum is good for detect errors, but is not process specific
• We could:
• Recompute the local checksum
• Keep a local-only checksum, like the sum of all local matrices
• Or …
April 21, 2023 33
![Page 34: Process recovery](https://reader035.vdocuments.us/reader035/viewer/2022062315/568151e2550346895dc01b97/html5/thumbnails/34.jpg)
A little update with clapack
• Keith’s f2c applied to lapack3.2
• Problems solved:
• Substring issue with maxloc
• Malloc on the fly problem
• Status
• Passing the LAPACK test suite
• Extended to lapack3.2.1
April 21, 2023 34