process recovery

Post on 23-Jan-2016

45 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Process recovery. Restore computing point New MPI_COMM_WORLD by FT-MPI Detect failure member Rebuild process id array Rebuild worker comm. Communicator Issue. Working communicator Used by ScaLAPACK and BLACS Standby communicator Used by standup processes MPI_COMM_WORLD Checkpointing. - PowerPoint PPT Presentation

TRANSCRIPT

Process recovery• Restore computing point

• New MPI_COMM_WORLD by FT-MPI

• Detect failure member

• Rebuild process id array

• Rebuild worker comm

Communicator Issue• Working communicator

• Used by ScaLAPACK and BLACS

• Standby communicator

• Used by standup processes

• MPI_COMM_WORLD

• Checkpointing

Who is

where?

April 21, 2023 3

Where is the lost data?• Diskless Checkpointing and Checksum

April 21, 2023 4

• Participation• Checksum:

MPI_Reduce(…,MPI_SUM…)14x14 matrix, nb=2, 2x3 grid, global view

Standby process, local view

Restart computation

• Reverse computation

April 21, 2023 5

1 iiTii CCBA

C A B

i

i

Restart computation: Failure

April 21, 2023 6

f

f

Restart computation

April 21, 2023 7

i

fk

Tkkll BACC 1

f i

i

i

i

i

f

Checkpointing Performance

April 21, 2023 8

Recovery Performance

April 21, 2023 9

Optimization Attempt

• Too many MPI_Reduce on small data blocks

April 21, 2023 10

•12 MPI_Reduce•49+ memcpy

•4 MPI_Reduce•0 memcpy•MPI user defined datatype•MPI user defined opt in commutative mode

Issues with FT-MPI

• 14x14, nb=2, pxq=1x4, ok

• 140x140, nb=20, pxq=1x4 =>

“snipe_lite.c:490 Problem, connection on socket [19] has failed.

This connect has been closed.”

Problem disappeared if all 1x4 processes are on one node

• 1400x1400 =>

“conn 5 chan 4 pending send before flctrl 9999 = 1”

April 21, 2023 11

Some ideas from the workshop

• Asynchronous checkpointing (if i don't have a failure, why do i need to stop and do checkpointing?)

• Variable checkpointing interval, because failure tends to occur till the end...reduce checkpointing overhead (t = sqrt (2* (time to save satte)*AMTTI))

• Silent data corruption:

• how do we know we are getting the right answer on large system and large problem?

• different failure model

• failure guard system

• how to do it for the hybrid system failure (FT for stan‘s MAGMA code)

April 21, 2023 12

Init Checkpointing

April 21, 2023 13

Before the loop starts

Rolling

April 21, 2023 14

i>=0

Rolling

April 21, 2023 15

i==k

?

When something goes wrong silently

April 21, 2023 16

C

Recovery

April 21, 2023 17

Shadow backup

April 21, 2023 18

Algorithm

• At every step i:

• Do the rank-k update

• At every K steps:

• Check for silent error by comparing the checksum on checkpointing processes and the one freshly computed by reduction

• On the checkpointing processors, shadow checkpoints

• At silent error

• Identify the ill process

• Checkpoint processors roll back to shadowed checkpoints

• Kill the ill process, spawn a new one, and put it in the grid

• Surviving process reverse computing to the last healthy step

• Recover the data on the new process through checksum

April 21, 2023 19

Question

• Are we using too many processes for checkpointing?

• What if checkpointing process starts to fail when a working process has gone wrong silently?

April 21, 2023 20

• Thursday, April 23, 2009

April 21, 2023 21

STRSM for GPU

• Why?

• Needed in hybrid routine like sgetrf

• The one from cublas is too slow

• So current we have to transfer data back from GPU to CPU to do strsm, wasting time

April 21, 2023 22

Design -1

April 21, 2023 23

1 2 3

4 5 6

A x b

Improvement

• Strsm_kernel is always on the critical path

• 3 strsm_kernel = 3 strtri_kernel + 3 strmv_kernel

• All 3 strtri_kernel can be done in one shot on GPU

• So now 3 strsm_kernel > 1 strtri_kernel + 3 strmv_kernel

• Critical path is reduced

April 21, 2023 24

Design -2

April 21, 2023 25

1 2 3

4 5

……

April 21, 2023 26

April 21, 2023 27

A few thoughts

April 21, 2023 28

What if we could:(1)Being able to reuse pblas/scalapack(2)As less checkpointing processors as possible

Julien & George’s code is (1) trying to build pblas/scalapack from scratch to enable FT(2) Using m+n extra processes to babysit m*n process

Checksum

April 21, 2023 29

Multiplication

April 21, 2023 30

A x B = C

Checkpointing

April 21, 2023 31

(1) Get to reuse Pblas/scalapack

(2) Require 1 extra process for 6 processes, instead of 5 for 6

Detect errors

April 21, 2023 32

Locate sick process

• What could go wrong?

• Network card -> sending out wrong data via mpi

• Memory -> doing computation on wrong data

• ALU -> giving wrong result

• Disk -> providing wrong data

• etc…

• Old checksum is good for detect errors, but is not process specific

• We could:

• Recompute the local checksum

• Keep a local-only checksum, like the sum of all local matrices

• Or …

April 21, 2023 33

A little update with clapack

• Keith’s f2c applied to lapack3.2

• Problems solved:

• Substring issue with maxloc

• Malloc on the fly problem

• Status

• Passing the LAPACK test suite

• Extended to lapack3.2.1

April 21, 2023 34

top related