supporting multi-domain decomposition for mpi programs laxmikant kale computer science 18 may 2000...

Supporting Multi-domain decompositionfor MPI programs

Laxmikant KaleComputer Science

18 May 2000

©1999 Board of Trustees of the University of Illinois

Center for Simulation of Advanced Rockets

University of Illinois at Urbana-Champaign©

2


University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois

Adaptive Strategies

Dynamic behavior of rocket simulation components:

Burning away of solid fuelAdaptive refinementPossible external interference:

on timeshared clusters

Requires adaptive load balancing strategies Idea: Automatically adapt to variations

Multi-domain decompositionMinimal influence on Component Applications

Should write as if stand-alone code Minimal recoding/effort to integrate

3



Multi-domain decomposition

Divide the computation into a large number of small pieces

Typically, much larger than the number of processors

Let the system map pieces (chunks) to processors

4



Multi-domain decomposition:

Advantages:Separation of concerns:

System automates what can be best done by the system

System can remap chunks to deal with load imbalances, external interference

Preserves and encourages modularity: multiple codes can interleave ..

5



Multi-domain decomposition

1

12

5

9 10

2

11

34

7

13

6

8

5810

4

1112

9 2 3

9

6

7

13

RTS

1PE0

PE1

PE2

PE3

6



Need Data Driven Execution Support

There are multiple chunks on each processor

How should execution interleave between them?Who gets the chance to run next?

Charm++ provides this support:Parallel C++ with Data Driven ObjectsObject Groups:

global object with a “representative” on each PEObject Arrays and Object CollectionsAsynchronous method invocationPrioritized schedulingMature, robust, portablehttp://charm.cs.uiuc.edu

7



Data driven execution

Scheduler Scheduler

Message Q Message Q

8



Object based load balancing

Application induced imbalances in CSE Apps:Abrupt, but infrequent, orSlow, cumulativerarely: frequent, large changes

Principle of persistenceExtension of principle of localityBehavior of objects , including computational load and

communication patterns, tends to persist over time

Measurement based load balancing:Periodic, or triggeredAutomatic

Already implemented a frameworkWith strategies that exploit this principle automatically

9



Object based load balancing

Automatic instrumentation via Charm RTSObject timesObject communication graph:

Number and total size of messages between every pair of objects

Suite of remapping strategiesCentralized: for periodic case

Several strategiesDistributed strategies

For frequent Load balancing For workstation clusters

In fruitful use for several applicationsCSAR and outside

10



Example of Recent Speedup Results: Molecular Dynamics on ASCI Red

Speedup on ASCI Red: Apo-A1, 92000 atoms

0

100

200

300

400

500

600

700

800

900

0 200 400 600 800 1000 1200 1400 1600 1800

Processors

Sp

eed

up

11



Adaptive objects:

How can they be used in more traditional (MPI) programs?

AMPI: Adaptive array based MPIBeing used bor Rocekt Simulation component

applications

What other consequences and benefits follow?

Automatic Checkpointing schemeTimeshared parallel machines and clusters

12



Using Adaptive Objects Framework

The framework requires Charm++, while Rocket simulation components use MPI

No problem!We have implemented AMPI on top of charm++User level threads embeded in objects

Multiple threads per processorThread migration!Multiple threads per processor:

No global variables MPI programs must be changed a bit to

encapsulate globals Automatic conversion is possible: future work

with Hoeflinger/Padua

13



AMPI

Array-based Adaptive MPIEach MPI “process” runs in a user-level thread

embeded in Charm++ objectAll MPI calls (will be) supportedThreads are migratable!Migrate under the control of load balancer/RTS

14



Making threads migrate

Thread migration is difficult:Pointers within stackNeed to minimize user intervention

Solution strategy I:Multiple threads (stack) on each PEUse only one stackCopy entire stack in and out on each context switch!

Advantages and disadvantages:Pointers are safe: migrated threads run in the stack at

the same virtual addressHigher overhead at context switch

Even when no migration How much is this overhead?

15



Context switching overhead

0

5

10

15

20

25

30

35

40

45

50

0 4 16 64 256 1024 4096 16384

Stack Size in Bytes

Co

nte

xt s

wit

chin

g i

n m

icro

seco

nd

s

Stack-Copy

Stack-pointer swap

16



Low overhead migratable threads

Challenge: if stack size is large, overhead large even w/o migration..Possible remedy: reduce stack size

iso_malloc idea Due to group at ENS Lyon, FranceAllocate virtual space for each thread on all processors,

when created

Now:Context switching is fasterMigration is still possible, as before

Scalability optimizations:map and unmap, to avoid virtual memory clutter

17



Application Performance: ROCFLO

ROCFLO with MPI and AMPI

0

200

400

600

800

1000

1200

1400

1600

1800

1 2 4 8 16 32

Processors

Sec

on

ds

MPI

AMPI

18



ROCSOLID: scaled problem-size

ROCSOLID with MPI and AMPI

56

58

60

62

64

66

68

70

72

74

76

78

1 8 32 64

processors

seco

nd

s

MPI

AMPI

19



Other applications of adaptive objects

Adaptive Objects is a versatile ideaCache Performance optimizationOut of core codes:

Automatic prefetch before execution of each object method and parameters

Flexible CheckpointingTimeshared Parallel Machines

Effective utilization of resources

20



Checkpointing

When running on large configuration:E.g. 4096 processors of a 4096 PE system If one PE is down, can’t restart from last checkpoint! In some applications: number of PEs must be a power of two (square, ..)

A solution based on objectsCheckpoint objectsNote: number of objects need to be a power of two, not PEsRestart on a different number processors!

Let adaptive balancer migrate objects to restore efficiencyRequires several innovations, but potential is open with objects

We have a preliminary system implemented for this Can restart on fewer or more processors

21



Time shared parallel machines

Need the ability to shrink and expand a job to available number of processors

Again, we can do that with objectsCompletely vacating processors when neededUse fraction of power on a desktop workstationAlready available for Charm++ and AMPI

Need quality of service contractsMin/Max PesDeadlinesPerformance profileSelection of machine: negotiation“Faucets” project in progress

Will link up with Globus

22



Contributors

AMPI Library Milind Bhandarkar

Timeshared cluster Sameer Kumar

Checkpointing Sameer Paranjpye

Charm++ support: Robert Brunner Orion Lawlor

ROCFLO/ROCSOLID development and conversion to Charm

Prasad AlavilliEric de SturlerJay HoeflingerJim JiaoFady NajarAli NamazifardDavid PaduaDennis ParsonsZhe ZhangJie Zheng

23



Laxmikant (Sanjay) V. Kale

Dept. of Computer Science

and


University of Illinois at Urbana-Champaign

3304 Digital Computer Laboratory

1304 West Springfield Avenue

Urbana, IL 61801 USA

[email protected]

http://charm.cs.uiuc.edu/~kale

telephone: 217-244-0094

fax: 217-333-1910

supporting multi-domain decomposition for mpi programs laxmikant kale computer science 18 may 2000...

Documents