supporting multi-domain decomposition for mpi programs laxmikant kale computer science 18 may 2000...
TRANSCRIPT
Supporting Multi-domain decompositionfor MPI programs
Laxmikant KaleComputer Science
18 May 2000
©1999 Board of Trustees of the University of Illinois
Center for Simulation of Advanced Rockets
University of Illinois at Urbana-Champaign©
2
Center for Simulation of Advanced Rockets
University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois
Adaptive Strategies
Dynamic behavior of rocket simulation components:
Burning away of solid fuelAdaptive refinementPossible external interference:
on timeshared clusters
Requires adaptive load balancing strategies Idea: Automatically adapt to variations
Multi-domain decompositionMinimal influence on Component Applications
Should write as if stand-alone code Minimal recoding/effort to integrate
3
Center for Simulation of Advanced Rockets
University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois
Multi-domain decomposition
Divide the computation into a large number of small pieces
Typically, much larger than the number of processors
Let the system map pieces (chunks) to processors
4
Center for Simulation of Advanced Rockets
University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois
Multi-domain decomposition:
Advantages:Separation of concerns:
System automates what can be best done by the system
System can remap chunks to deal with load imbalances, external interference
Preserves and encourages modularity: multiple codes can interleave ..
5
Center for Simulation of Advanced Rockets
University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois
Multi-domain decomposition
1
12
5
9 10
2
11
34
7
13
6
8
5810
4
1112
9 2 3
9
6
7
13
RTS
1PE0
PE1
PE2
PE3
6
Center for Simulation of Advanced Rockets
University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois
Need Data Driven Execution Support
There are multiple chunks on each processor
How should execution interleave between them?Who gets the chance to run next?
Charm++ provides this support:Parallel C++ with Data Driven ObjectsObject Groups:
global object with a “representative” on each PEObject Arrays and Object CollectionsAsynchronous method invocationPrioritized schedulingMature, robust, portablehttp://charm.cs.uiuc.edu
7
Center for Simulation of Advanced Rockets
University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois
Data driven execution
Scheduler Scheduler
Message Q Message Q
8
Center for Simulation of Advanced Rockets
University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois
Object based load balancing
Application induced imbalances in CSE Apps:Abrupt, but infrequent, orSlow, cumulativerarely: frequent, large changes
Principle of persistenceExtension of principle of localityBehavior of objects , including computational load and
communication patterns, tends to persist over time
Measurement based load balancing:Periodic, or triggeredAutomatic
Already implemented a frameworkWith strategies that exploit this principle automatically
9
Center for Simulation of Advanced Rockets
University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois
Object based load balancing
Automatic instrumentation via Charm RTSObject timesObject communication graph:
Number and total size of messages between every pair of objects
Suite of remapping strategiesCentralized: for periodic case
Several strategiesDistributed strategies
For frequent Load balancing For workstation clusters
In fruitful use for several applicationsCSAR and outside
10
Center for Simulation of Advanced Rockets
University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois
Example of Recent Speedup Results: Molecular Dynamics on ASCI Red
Speedup on ASCI Red: Apo-A1, 92000 atoms
0
100
200
300
400
500
600
700
800
900
0 200 400 600 800 1000 1200 1400 1600 1800
Processors
Sp
eed
up
11
Center for Simulation of Advanced Rockets
University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois
Adaptive objects:
How can they be used in more traditional (MPI) programs?
AMPI: Adaptive array based MPIBeing used bor Rocekt Simulation component
applications
What other consequences and benefits follow?
Automatic Checkpointing schemeTimeshared parallel machines and clusters
12
Center for Simulation of Advanced Rockets
University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois
Using Adaptive Objects Framework
The framework requires Charm++, while Rocket simulation components use MPI
No problem!We have implemented AMPI on top of charm++User level threads embeded in objects
Multiple threads per processorThread migration!Multiple threads per processor:
No global variables MPI programs must be changed a bit to
encapsulate globals Automatic conversion is possible: future work
with Hoeflinger/Padua
13
Center for Simulation of Advanced Rockets
University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois
AMPI
Array-based Adaptive MPIEach MPI “process” runs in a user-level thread
embeded in Charm++ objectAll MPI calls (will be) supportedThreads are migratable!Migrate under the control of load balancer/RTS
14
Center for Simulation of Advanced Rockets
University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois
Making threads migrate
Thread migration is difficult:Pointers within stackNeed to minimize user intervention
Solution strategy I:Multiple threads (stack) on each PEUse only one stackCopy entire stack in and out on each context switch!
Advantages and disadvantages:Pointers are safe: migrated threads run in the stack at
the same virtual addressHigher overhead at context switch
Even when no migration How much is this overhead?
15
Center for Simulation of Advanced Rockets
University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois
Context switching overhead
0
5
10
15
20
25
30
35
40
45
50
0 4 16 64 256 1024 4096 16384
Stack Size in Bytes
Co
nte
xt s
wit
chin
g i
n m
icro
seco
nd
s
Stack-Copy
Stack-pointer swap
16
Center for Simulation of Advanced Rockets
University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois
Low overhead migratable threads
Challenge: if stack size is large, overhead large even w/o migration..Possible remedy: reduce stack size
iso_malloc idea Due to group at ENS Lyon, FranceAllocate virtual space for each thread on all processors,
when created
Now:Context switching is fasterMigration is still possible, as before
Scalability optimizations:map and unmap, to avoid virtual memory clutter
17
Center for Simulation of Advanced Rockets
University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois
Application Performance: ROCFLO
ROCFLO with MPI and AMPI
0
200
400
600
800
1000
1200
1400
1600
1800
1 2 4 8 16 32
Processors
Sec
on
ds
MPI
AMPI
18
Center for Simulation of Advanced Rockets
University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois
ROCSOLID: scaled problem-size
ROCSOLID with MPI and AMPI
56
58
60
62
64
66
68
70
72
74
76
78
1 8 32 64
processors
seco
nd
s
MPI
AMPI
19
Center for Simulation of Advanced Rockets
University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois
Other applications of adaptive objects
Adaptive Objects is a versatile ideaCache Performance optimizationOut of core codes:
Automatic prefetch before execution of each object method and parameters
Flexible CheckpointingTimeshared Parallel Machines
Effective utilization of resources
20
Center for Simulation of Advanced Rockets
University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois
Checkpointing
When running on large configuration:E.g. 4096 processors of a 4096 PE system If one PE is down, can’t restart from last checkpoint! In some applications: number of PEs must be a power of two (square, ..)
A solution based on objectsCheckpoint objectsNote: number of objects need to be a power of two, not PEsRestart on a different number processors!
Let adaptive balancer migrate objects to restore efficiencyRequires several innovations, but potential is open with objects
We have a preliminary system implemented for this Can restart on fewer or more processors
21
Center for Simulation of Advanced Rockets
University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois
Time shared parallel machines
Need the ability to shrink and expand a job to available number of processors
Again, we can do that with objectsCompletely vacating processors when neededUse fraction of power on a desktop workstationAlready available for Charm++ and AMPI
Need quality of service contractsMin/Max PesDeadlinesPerformance profileSelection of machine: negotiation“Faucets” project in progress
Will link up with Globus
22
Center for Simulation of Advanced Rockets
University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois
Contributors
AMPI Library Milind Bhandarkar
Timeshared cluster Sameer Kumar
Checkpointing Sameer Paranjpye
Charm++ support: Robert Brunner Orion Lawlor
ROCFLO/ROCSOLID development and conversion to Charm
Prasad AlavilliEric de SturlerJay HoeflingerJim JiaoFady NajarAli NamazifardDavid PaduaDennis ParsonsZhe ZhangJie Zheng
23
Center for Simulation of Advanced Rockets
University of Illinois at Urbana-Champaign©2000 Board of Trustees of the University of Illinois
Laxmikant (Sanjay) V. Kale
Dept. of Computer Science
and
Center for Simulation of Advanced Rockets
University of Illinois at Urbana-Champaign
3304 Digital Computer Laboratory
1304 West Springfield Avenue
Urbana, IL 61801 USA
http://charm.cs.uiuc.edu/~kale
telephone: 217-244-0094
fax: 217-333-1910