process recovery
DESCRIPTION
Process recovery. Restore computing point New MPI_COMM_WORLD by FT-MPI Detect failure member Rebuild process id array Rebuild worker comm. Communicator Issue. Working communicator Used by ScaLAPACK and BLACS Standby communicator Used by standup processes MPI_COMM_WORLD Checkpointing. - PowerPoint PPT PresentationTRANSCRIPT
Process recovery• Restore computing point
• New MPI_COMM_WORLD by FT-MPI
• Detect failure member
• Rebuild process id array
• Rebuild worker comm
Communicator Issue• Working communicator
• Used by ScaLAPACK and BLACS
• Standby communicator
• Used by standup processes
• MPI_COMM_WORLD
• Checkpointing
Who is
where?
April 21, 2023 3
Where is the lost data?• Diskless Checkpointing and Checksum
April 21, 2023 4
• Participation• Checksum:
MPI_Reduce(…,MPI_SUM…)14x14 matrix, nb=2, 2x3 grid, global view
Standby process, local view
Restart computation
• Reverse computation
April 21, 2023 5
1 iiTii CCBA
C A B
i
i
Restart computation: Failure
April 21, 2023 6
f
f
Restart computation
April 21, 2023 7
i
fk
Tkkll BACC 1
f i
i
i
i
i
f
Checkpointing Performance
April 21, 2023 8
Recovery Performance
April 21, 2023 9
Optimization Attempt
• Too many MPI_Reduce on small data blocks
April 21, 2023 10
•12 MPI_Reduce•49+ memcpy
•4 MPI_Reduce•0 memcpy•MPI user defined datatype•MPI user defined opt in commutative mode
Issues with FT-MPI
• 14x14, nb=2, pxq=1x4, ok
• 140x140, nb=20, pxq=1x4 =>
“snipe_lite.c:490 Problem, connection on socket [19] has failed.
This connect has been closed.”
Problem disappeared if all 1x4 processes are on one node
• 1400x1400 =>
“conn 5 chan 4 pending send before flctrl 9999 = 1”
April 21, 2023 11
Some ideas from the workshop
• Asynchronous checkpointing (if i don't have a failure, why do i need to stop and do checkpointing?)
• Variable checkpointing interval, because failure tends to occur till the end...reduce checkpointing overhead (t = sqrt (2* (time to save satte)*AMTTI))
• Silent data corruption:
• how do we know we are getting the right answer on large system and large problem?
• different failure model
• failure guard system
• how to do it for the hybrid system failure (FT for stan‘s MAGMA code)
April 21, 2023 12
Init Checkpointing
April 21, 2023 13
Before the loop starts
Rolling
April 21, 2023 14
i>=0
Rolling
April 21, 2023 15
i==k
?
When something goes wrong silently
April 21, 2023 16
C
Recovery
April 21, 2023 17
Shadow backup
April 21, 2023 18
Algorithm
• At every step i:
• Do the rank-k update
• At every K steps:
• Check for silent error by comparing the checksum on checkpointing processes and the one freshly computed by reduction
• On the checkpointing processors, shadow checkpoints
• At silent error
• Identify the ill process
• Checkpoint processors roll back to shadowed checkpoints
• Kill the ill process, spawn a new one, and put it in the grid
• Surviving process reverse computing to the last healthy step
• Recover the data on the new process through checksum
April 21, 2023 19
Question
• Are we using too many processes for checkpointing?
• What if checkpointing process starts to fail when a working process has gone wrong silently?
April 21, 2023 20
• Thursday, April 23, 2009
April 21, 2023 21
STRSM for GPU
• Why?
• Needed in hybrid routine like sgetrf
• The one from cublas is too slow
• So current we have to transfer data back from GPU to CPU to do strsm, wasting time
April 21, 2023 22
Design -1
April 21, 2023 23
1 2 3
4 5 6
A x b
Improvement
• Strsm_kernel is always on the critical path
• 3 strsm_kernel = 3 strtri_kernel + 3 strmv_kernel
• All 3 strtri_kernel can be done in one shot on GPU
• So now 3 strsm_kernel > 1 strtri_kernel + 3 strmv_kernel
• Critical path is reduced
April 21, 2023 24
Design -2
April 21, 2023 25
1 2 3
4 5
……
April 21, 2023 26
April 21, 2023 27
A few thoughts
April 21, 2023 28
What if we could:(1)Being able to reuse pblas/scalapack(2)As less checkpointing processors as possible
Julien & George’s code is (1) trying to build pblas/scalapack from scratch to enable FT(2) Using m+n extra processes to babysit m*n process
Checksum
April 21, 2023 29
Multiplication
April 21, 2023 30
A x B = C
Checkpointing
April 21, 2023 31
(1) Get to reuse Pblas/scalapack
(2) Require 1 extra process for 6 processes, instead of 5 for 6
Detect errors
April 21, 2023 32
Locate sick process
• What could go wrong?
• Network card -> sending out wrong data via mpi
• Memory -> doing computation on wrong data
• ALU -> giving wrong result
• Disk -> providing wrong data
• etc…
• Old checksum is good for detect errors, but is not process specific
• We could:
• Recompute the local checksum
• Keep a local-only checksum, like the sum of all local matrices
• Or …
April 21, 2023 33
A little update with clapack
• Keith’s f2c applied to lapack3.2
• Problems solved:
• Substring issue with maxloc
• Malloc on the fly problem
• Status
• Passing the LAPACK test suite
• Extended to lapack3.2.1
April 21, 2023 34