process recovery

34
Process recovery Restore computing point New MPI_COMM_WORLD by FT-MPI Detect failure member Rebuild process id array Rebuild worker comm

Upload: helia

Post on 23-Jan-2016

45 views

Category:

Documents


2 download

DESCRIPTION

Process recovery. Restore computing point New MPI_COMM_WORLD by FT-MPI Detect failure member Rebuild process id array Rebuild worker comm. Communicator Issue. Working communicator Used by ScaLAPACK and BLACS Standby communicator Used by standup processes MPI_COMM_WORLD Checkpointing. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Process recovery

Process recovery• Restore computing point

• New MPI_COMM_WORLD by FT-MPI

• Detect failure member

• Rebuild process id array

• Rebuild worker comm

Page 2: Process recovery

Communicator Issue• Working communicator

• Used by ScaLAPACK and BLACS

• Standby communicator

• Used by standup processes

• MPI_COMM_WORLD

• Checkpointing

Page 3: Process recovery

Who is

where?

April 21, 2023 3

Page 4: Process recovery

Where is the lost data?• Diskless Checkpointing and Checksum

April 21, 2023 4

• Participation• Checksum:

MPI_Reduce(…,MPI_SUM…)14x14 matrix, nb=2, 2x3 grid, global view

Standby process, local view

Page 5: Process recovery

Restart computation

• Reverse computation

April 21, 2023 5

1 iiTii CCBA

C A B

i

i

Page 6: Process recovery

Restart computation: Failure

April 21, 2023 6

f

f

Page 7: Process recovery

Restart computation

April 21, 2023 7

i

fk

Tkkll BACC 1

f i

i

i

i

i

f

Page 8: Process recovery

Checkpointing Performance

April 21, 2023 8

Page 9: Process recovery

Recovery Performance

April 21, 2023 9

Page 10: Process recovery

Optimization Attempt

• Too many MPI_Reduce on small data blocks

April 21, 2023 10

•12 MPI_Reduce•49+ memcpy

•4 MPI_Reduce•0 memcpy•MPI user defined datatype•MPI user defined opt in commutative mode

Page 11: Process recovery

Issues with FT-MPI

• 14x14, nb=2, pxq=1x4, ok

• 140x140, nb=20, pxq=1x4 =>

“snipe_lite.c:490 Problem, connection on socket [19] has failed.

This connect has been closed.”

Problem disappeared if all 1x4 processes are on one node

• 1400x1400 =>

“conn 5 chan 4 pending send before flctrl 9999 = 1”

April 21, 2023 11

Page 12: Process recovery

Some ideas from the workshop

• Asynchronous checkpointing (if i don't have a failure, why do i need to stop and do checkpointing?)

• Variable checkpointing interval, because failure tends to occur till the end...reduce checkpointing overhead (t = sqrt (2* (time to save satte)*AMTTI))

• Silent data corruption:

• how do we know we are getting the right answer on large system and large problem?

• different failure model

• failure guard system

• how to do it for the hybrid system failure (FT for stan‘s MAGMA code)

April 21, 2023 12

Page 13: Process recovery

Init Checkpointing

April 21, 2023 13

Before the loop starts

Page 14: Process recovery

Rolling

April 21, 2023 14

i>=0

Page 15: Process recovery

Rolling

April 21, 2023 15

i==k

?

Page 16: Process recovery

When something goes wrong silently

April 21, 2023 16

C

Page 17: Process recovery

Recovery

April 21, 2023 17

Page 18: Process recovery

Shadow backup

April 21, 2023 18

Page 19: Process recovery

Algorithm

• At every step i:

• Do the rank-k update

• At every K steps:

• Check for silent error by comparing the checksum on checkpointing processes and the one freshly computed by reduction

• On the checkpointing processors, shadow checkpoints

• At silent error

• Identify the ill process

• Checkpoint processors roll back to shadowed checkpoints

• Kill the ill process, spawn a new one, and put it in the grid

• Surviving process reverse computing to the last healthy step

• Recover the data on the new process through checksum

April 21, 2023 19

Page 20: Process recovery

Question

• Are we using too many processes for checkpointing?

• What if checkpointing process starts to fail when a working process has gone wrong silently?

April 21, 2023 20

Page 21: Process recovery

• Thursday, April 23, 2009

April 21, 2023 21

Page 22: Process recovery

STRSM for GPU

• Why?

• Needed in hybrid routine like sgetrf

• The one from cublas is too slow

• So current we have to transfer data back from GPU to CPU to do strsm, wasting time

April 21, 2023 22

Page 23: Process recovery

Design -1

April 21, 2023 23

1 2 3

4 5 6

A x b

Page 24: Process recovery

Improvement

• Strsm_kernel is always on the critical path

• 3 strsm_kernel = 3 strtri_kernel + 3 strmv_kernel

• All 3 strtri_kernel can be done in one shot on GPU

• So now 3 strsm_kernel > 1 strtri_kernel + 3 strmv_kernel

• Critical path is reduced

April 21, 2023 24

Page 25: Process recovery

Design -2

April 21, 2023 25

1 2 3

4 5

……

Page 26: Process recovery

April 21, 2023 26

Page 27: Process recovery

April 21, 2023 27

Page 28: Process recovery

A few thoughts

April 21, 2023 28

What if we could:(1)Being able to reuse pblas/scalapack(2)As less checkpointing processors as possible

Julien & George’s code is (1) trying to build pblas/scalapack from scratch to enable FT(2) Using m+n extra processes to babysit m*n process

Page 29: Process recovery

Checksum

April 21, 2023 29

Page 30: Process recovery

Multiplication

April 21, 2023 30

A x B = C

Page 31: Process recovery

Checkpointing

April 21, 2023 31

(1) Get to reuse Pblas/scalapack

(2) Require 1 extra process for 6 processes, instead of 5 for 6

Page 32: Process recovery

Detect errors

April 21, 2023 32

Page 33: Process recovery

Locate sick process

• What could go wrong?

• Network card -> sending out wrong data via mpi

• Memory -> doing computation on wrong data

• ALU -> giving wrong result

• Disk -> providing wrong data

• etc…

• Old checksum is good for detect errors, but is not process specific

• We could:

• Recompute the local checksum

• Keep a local-only checksum, like the sum of all local matrices

• Or …

April 21, 2023 33

Page 34: Process recovery

A little update with clapack

• Keith’s f2c applied to lapack3.2

• Problems solved:

• Substring issue with maxloc

• Malloc on the fly problem

• Status

• Passing the LAPACK test suite

• Extended to lapack3.2.1

April 21, 2023 34