bronis r. de supinski and jeffrey s. vetter center for applied scientific computing august 15, 2000...

13
Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000 Umpire: Making MPI Programs Safe

Upload: leon-malone

Post on 19-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000 Umpire: Making MPI Programs Safe

Bronis R. de Supinski and Jeffrey S. VetterCenter for Applied Scientific Computing

August 15, 2000

Umpire: Making MPI Programs Safe

Page 2: Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000 Umpire: Making MPI Programs Safe

2CASC

Umpire

Writing correct MPI programs is hard

Unsafe or erroneous MPI programs —Deadlock—Resource errors

Umpire—Automatically detect MPI programming errors—Dynamic software testing—Shared memory implementation

Page 3: Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000 Umpire: Making MPI Programs Safe

3CASC

MPI Runtime System

MPI Application UmpireManagerT

ask

0

Ta

sk 1

Ta

sk 2

Ta

sk N

-1

Interpositionusing MPI profiling layer

Transactionsvia Shared Memory

Ta

sk 0

Ta

sk 1

Ta

sk 2

Ta

sk N

-1

Task 0Task 1

Task 2Task N-1

...

...

Umpire Architecture

VerificationAlgorithms

Page 4: Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000 Umpire: Making MPI Programs Safe

4CASC

Collection system

Calling task —Use MPI profiling layer—Perform local checks—Communicate with manager if necessary

–Call parameters–Return program counter (PC)–Call specific information (e.g. Buffer checksum)

Manager—Allocate Unix shared memory—Receive transactions from calling tasks

Page 5: Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000 Umpire: Making MPI Programs Safe

5CASC

Manager

Detects global programming errors

Unix shared memory communication

History queues—One per MPI task—Chronological lists of MPI operations

Resource registry—Communicators—Derived datatypes—Required for message matching

Perform verification algorithms

Page 6: Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000 Umpire: Making MPI Programs Safe

6CASC

Configuration Dependent Deadlock

Unsafe MPI programming practice

Code result depends on:—MPI implementation limitations—User input parameters

Classic example code:Task 0 Task 1MPI_Send MPI_SendMPI_Recv MPI_Recv

Page 7: Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000 Umpire: Making MPI Programs Safe

7CASC

Mismatched Collective Operations

Erroneous MPI programming practice

Simple example code:Tasks 0, 1, & 2 Task 3MPI_Bcast MPI_BarrierMPI_Barrier MPI_Bcast

Possible code results:—Deadlock—Correct message matching— Incorrect message matching—Mysterious error messages

Page 8: Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000 Umpire: Making MPI Programs Safe

8CASC

Deadlock detection

MPI history queues—One per task in Manager—Track MPI messaging operations

–Items added through transactions–Remove when safely matched

Automatically detect deadlocks—MPI operations only—Wait-for graph—Recursive algorithm— Invoke when queue head changes

Also support timeouts

Page 9: Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000 Umpire: Making MPI Programs Safe

9CASC

Deadlock Detection Example

Bcast

Barrier

Bcast

Barrier

Bcast

Barrier

Barrier

Task 0 Task 1 Task 2 Task 3

Task 1: MPI_BcastTask 0: MPI_BcastTask 0: MPI_BarrierTask 2: MPI_BcastTask 3: MPI_BarrierERROR! Report it!Task 2: MPI_BarrierTask 1: MPI_Barrier

Page 10: Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000 Umpire: Making MPI Programs Safe

10CASC

Resource Tracking Errors

Many MPI features require resource allocations—Communicators, datatypes and requests—Detect “leaks” automatically

Simple “lost request” example:MPI_Irecv (..., &req);MPI_Irecv (..., &req);MPI_Wait (&req,…)

Complicated by assignment

Also detect errant writes to send buffers

Page 11: Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000 Umpire: Making MPI Programs Safe

11CASC

Conclusion

First automated MPI debugging tool—Detect deadlocks—Eliminates resource leaks—Assure correct non-blocking sends

Performance—Low overhead (21% for sPPM) —Located deadlock in code set-up

Limitations—MPI_Waitany and MPI_Cancel—Shared memory implementation —Prototype only

Page 12: Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000 Umpire: Making MPI Programs Safe

12CASC

Future Work

Further prototype testing

Improve user interface

Handle all MPI calls

Tool distribution—LLNL application group testing —Exploring mechanisms for wider availability

Detection of other errors—Datatype matching —Others?

Distributed memory implementation

Page 13: Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000 Umpire: Making MPI Programs Safe

13CASC

Work performed under the auspices of the U. S. Department of Energy by University of California Lawrence Livermore National Laboratory under Contract W-7405-Eng-48

UCRL-VG-139184