supporting multi-domain decomposition for mpi programs laxmikant kale computer science 18 may 2000...

23
Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 Board of Trustees of the University of Illinois Center for Simulation of Advanc University of Illinois at Urbana-C ©

Upload: holly-day

Post on 12-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois

Supporting Multi-domain decompositionfor MPI programs

Laxmikant KaleComputer Science

18 May 2000

©1999 Board of Trustees of the University of Illinois

Center for Simulation of Advanced Rockets

University of Illinois at Urbana-Champaign©

Page 2: Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois

2

Center for Simulation of Advanced Rockets

University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois

Adaptive Strategies

Dynamic behavior of rocket simulation components:

Burning away of solid fuelAdaptive refinementPossible external interference:

on timeshared clusters

Requires adaptive load balancing strategies Idea: Automatically adapt to variations

Multi-domain decompositionMinimal influence on Component Applications

Should write as if stand-alone code Minimal recoding/effort to integrate

Page 3: Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois

3

Center for Simulation of Advanced Rockets

University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois

Multi-domain decomposition

Divide the computation into a large number of small pieces

Typically, much larger than the number of processors

Let the system map pieces (chunks) to processors

Page 4: Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois

4

Center for Simulation of Advanced Rockets

University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois

Multi-domain decomposition:

Advantages:Separation of concerns:

System automates what can be best done by the system

System can remap chunks to deal with load imbalances, external interference

Preserves and encourages modularity: multiple codes can interleave ..

Page 5: Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois

5

Center for Simulation of Advanced Rockets

University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois

Multi-domain decomposition

1

12

5

9 10

2

11

34

7

13

6

8

5810

4

1112

9 2 3

9

6

7

13

RTS

1PE0

PE1

PE2

PE3

Page 6: Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois

6

Center for Simulation of Advanced Rockets

University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois

Need Data Driven Execution Support

There are multiple chunks on each processor

How should execution interleave between them?Who gets the chance to run next?

Charm++ provides this support:Parallel C++ with Data Driven ObjectsObject Groups:

global object with a “representative” on each PEObject Arrays and Object CollectionsAsynchronous method invocationPrioritized schedulingMature, robust, portablehttp://charm.cs.uiuc.edu

Page 7: Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois

7

Center for Simulation of Advanced Rockets

University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois

Data driven execution

Scheduler Scheduler

Message Q Message Q

Page 8: Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois

8

Center for Simulation of Advanced Rockets

University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois

Object based load balancing

Application induced imbalances in CSE Apps:Abrupt, but infrequent, orSlow, cumulativerarely: frequent, large changes

Principle of persistenceExtension of principle of localityBehavior of objects , including computational load and

communication patterns, tends to persist over time

Measurement based load balancing:Periodic, or triggeredAutomatic

Already implemented a frameworkWith strategies that exploit this principle automatically

Page 9: Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois

9

Center for Simulation of Advanced Rockets

University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois

Object based load balancing

Automatic instrumentation via Charm RTSObject timesObject communication graph:

Number and total size of messages between every pair of objects

Suite of remapping strategiesCentralized: for periodic case

Several strategiesDistributed strategies

For frequent Load balancing For workstation clusters

In fruitful use for several applicationsCSAR and outside

Page 10: Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois

10

Center for Simulation of Advanced Rockets

University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois

Example of Recent Speedup Results: Molecular Dynamics on ASCI Red

Speedup on ASCI Red: Apo-A1, 92000 atoms

0

100

200

300

400

500

600

700

800

900

0 200 400 600 800 1000 1200 1400 1600 1800

Processors

Sp

eed

up

Page 11: Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois

11

Center for Simulation of Advanced Rockets

University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois

Adaptive objects:

How can they be used in more traditional (MPI) programs?

AMPI: Adaptive array based MPIBeing used bor Rocekt Simulation component

applications

What other consequences and benefits follow?

Automatic Checkpointing schemeTimeshared parallel machines and clusters

Page 12: Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois

12

Center for Simulation of Advanced Rockets

University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois

Using Adaptive Objects Framework

The framework requires Charm++, while Rocket simulation components use MPI

No problem!We have implemented AMPI on top of charm++User level threads embeded in objects

Multiple threads per processorThread migration!Multiple threads per processor:

No global variables MPI programs must be changed a bit to

encapsulate globals Automatic conversion is possible: future work

with Hoeflinger/Padua

Page 13: Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois

13

Center for Simulation of Advanced Rockets

University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois

AMPI

Array-based Adaptive MPIEach MPI “process” runs in a user-level thread

embeded in Charm++ objectAll MPI calls (will be) supportedThreads are migratable!Migrate under the control of load balancer/RTS

Page 14: Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois

14

Center for Simulation of Advanced Rockets

University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois

Making threads migrate

Thread migration is difficult:Pointers within stackNeed to minimize user intervention

Solution strategy I:Multiple threads (stack) on each PEUse only one stackCopy entire stack in and out on each context switch!

Advantages and disadvantages:Pointers are safe: migrated threads run in the stack at

the same virtual addressHigher overhead at context switch

Even when no migration How much is this overhead?

Page 15: Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois

15

Center for Simulation of Advanced Rockets

University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois

Context switching overhead

0

5

10

15

20

25

30

35

40

45

50

0 4 16 64 256 1024 4096 16384

Stack Size in Bytes

Co

nte

xt s

wit

chin

g i

n m

icro

seco

nd

s

Stack-Copy

Stack-pointer swap

Page 16: Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois

16

Center for Simulation of Advanced Rockets

University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois

Low overhead migratable threads

Challenge: if stack size is large, overhead large even w/o migration..Possible remedy: reduce stack size

iso_malloc idea Due to group at ENS Lyon, FranceAllocate virtual space for each thread on all processors,

when created

Now:Context switching is fasterMigration is still possible, as before

Scalability optimizations:map and unmap, to avoid virtual memory clutter

Page 17: Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois

17

Center for Simulation of Advanced Rockets

University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois

Application Performance: ROCFLO

ROCFLO with MPI and AMPI

0

200

400

600

800

1000

1200

1400

1600

1800

1 2 4 8 16 32

Processors

Sec

on

ds

MPI

AMPI

Page 18: Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois

18

Center for Simulation of Advanced Rockets

University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois

ROCSOLID: scaled problem-size

ROCSOLID with MPI and AMPI

56

58

60

62

64

66

68

70

72

74

76

78

1 8 32 64

processors

seco

nd

s

MPI

AMPI

Page 19: Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois

19

Center for Simulation of Advanced Rockets

University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois

Other applications of adaptive objects

Adaptive Objects is a versatile ideaCache Performance optimizationOut of core codes:

Automatic prefetch before execution of each object method and parameters

Flexible CheckpointingTimeshared Parallel Machines

Effective utilization of resources

Page 20: Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois

20

Center for Simulation of Advanced Rockets

University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois

Checkpointing

When running on large configuration:E.g. 4096 processors of a 4096 PE system If one PE is down, can’t restart from last checkpoint! In some applications: number of PEs must be a power of two (square, ..)

A solution based on objectsCheckpoint objectsNote: number of objects need to be a power of two, not PEsRestart on a different number processors!

Let adaptive balancer migrate objects to restore efficiencyRequires several innovations, but potential is open with objects

We have a preliminary system implemented for this Can restart on fewer or more processors

Page 21: Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois

21

Center for Simulation of Advanced Rockets

University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois

Time shared parallel machines

Need the ability to shrink and expand a job to available number of processors

Again, we can do that with objectsCompletely vacating processors when neededUse fraction of power on a desktop workstationAlready available for Charm++ and AMPI

Need quality of service contractsMin/Max PesDeadlinesPerformance profileSelection of machine: negotiation“Faucets” project in progress

Will link up with Globus

Page 22: Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois

22

Center for Simulation of Advanced Rockets

University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois

Contributors

AMPI Library Milind Bhandarkar

Timeshared cluster Sameer Kumar

Checkpointing Sameer Paranjpye

Charm++ support: Robert Brunner Orion Lawlor

ROCFLO/ROCSOLID development and conversion to Charm

Prasad AlavilliEric de SturlerJay HoeflingerJim JiaoFady NajarAli NamazifardDavid PaduaDennis ParsonsZhe ZhangJie Zheng

Page 23: Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois

23

Center for Simulation of Advanced Rockets

University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois

Laxmikant (Sanjay) V. Kale

Dept. of Computer Science

and

Center for Simulation of Advanced Rockets

University of Illinois at Urbana-Champaign

3304 Digital Computer Laboratory

1304 West Springfield Avenue

Urbana, IL 61801 USA

[email protected]

http://charm.cs.uiuc.edu/~kale

telephone: 217-244-0094

fax: 217-333-1910