gengbin zheng parallel programming laboratory university of illinois at urbana-champaign

63
Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana- Champaign

Upload: margaretmargaret-ward

Post on 17-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Gengbin Zheng

Parallel Programming Laboratory

University of Illinois at Urbana-Champaign

Page 2: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

OutlineMotivationAMPI: Overview Benefits of AMPI Converting MPI Codes to AMPIHandling Global/Static VariablesRunning AMPI ProgramsAMPI StatusAMPI References, Conclusion

AMPI: Adaptive MPI 2

Page 3: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

AMPI: MotivationChallenges

New generation parallel applications are: Dynamically varying: load shifting during execution Adaptively refined Composed of multi-physics modules

Typical MPI Implementations:Not naturally suitable for dynamic applicationsAvailable processor set may not match algorithm

Alternative: Adaptive MPI (AMPI)MPI & Charm++ virtualization: VP (“Virtual Processors”)

AMPI: Adaptive MPI 3

Page 4: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

OutlineMotivationAMPI: Overview Benefits of AMPI Converting MPI Codes to AMPIHandling Global/Static VariablesRunning AMPI ProgramsAMPI StatusAMPI References, Conclusion

AMPI: Adaptive MPI 4

Page 5: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

AMPI: OverviewVirtualization: MPI ranks → Charm++ threads

AMPI: Adaptive MPI 5

Real Processors

MPI “tasks”

Implemented as user-level migratable threads

( VPs: virtual processors )

Page 6: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

AMPI: Overview (cont.)AMPI Execution Model:

AMPI: Adaptive MPI 6

• Multiple user-level threads per process

• Typically, one process per physical processor

• Charm++ Scheduler coordinates execution

• Threads (VPs) can migrate across processors

• Virtualization ratio: R = #VP / #P (over-decomposition)

Charm++ Scheduler

P=1 , VP=4

Page 7: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

AMPI: Overview (cont.)AMPI’s Over-Decomposition in Practice

AMPI: Adaptive MPI 7

MPI: P=4, ranks=4 AMPI: P=4, VP=ranks=16

Page 8: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

OutlineMotivationAMPI: Overview Benefits of AMPI Converting MPI Codes to AMPIHandling Global/Static VariablesRunning AMPI ProgramsAMPI StatusAMPI References, Conclusion

AMPI: Adaptive MPI 8

Page 9: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Benefits of AMPIOverlap between Computation/Communication

AMPI: Adaptive MPI 9

• Automatically achieved • When one thread blocks for a

message, another thread in the same processor can execute

• Charm++ Scheduler picks next thread among those that are ready to run Charm++ Scheduler

Page 10: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Benefits of AMPI (cont.)Potentially Better Cache Utilization

AMPI: Adaptive MPI 10

Gains occur when subdomain is accessed repeatedly (e.g. by multiple functions, called in sequence)

12 might fit in

cache, but 3 might not fit

Page 11: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Benefits of AMPI (cont.)Thread Migration for Load Balancing

AMPI: Adaptive MPI 11

Migration of thread 13:

Page 12: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Benefits of AMPI (cont.)Load Balancing in AMPI: MPI_Migrate()

Collective operation informing the load balancer that the threads can be migrated, if needed, for balancing load

Easy to insert in the code of iterative applications Leverages Load-Balancing framework of Charm++Balancing decisions can be based on

Measured parameters: computation load, communication pattern

Application-provided information

AMPI: Adaptive MPI 12

Page 13: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Benefits of AMPI (cont.)Decoupling of Physical/Virtual Processors

AMPI: Adaptive MPI 13

Problem setup: 3D stencil calculation of size 2403 run on Lemieux.

AMPI runs on any # of PE’s (eg 19, 33, 105). Native MPI needs P=K3

1

10

100

10 100 1000Procs

Ex

ec

Tim

e [s

ec

]

Native MPI AMPI

Page 14: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Benefits of AMPI (cont.)Asynchronous Implementation of Collectives

Collective operation is posted, returns immediatelyTest/wait for its completion; meanwhile, do useful work

e.g. MPI_Ialltoall( … , &req);/* other computation */MPI_Wait(req);

Other operations available: MPI_Iallreduce, MPI_Iallgather

Example: 2D FFT benchmark time (ms)

AMPI: Adaptive MPI 14

0 10 20 30 40 50 60 70 80 90 100

AMPI,4

AMPI,8

AMPI,16

Native MPI,4

Native MPI,8

Native MPI,161D FFT

All-to-all

Wait

Page 15: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Motivation for Collective Communication Optimization

0

100

200

300

400

500

600

700

800

900

76 276 476 876 1276 1676 2076 3076 4076 6076 8076

Message Size (Bytes)

Tim

e (

ms)

Mesh

Mesh Compute

AMPI: Adaptive MPI 15

Time breakdown of an all-to-all operation using Mesh library

Computation is only a small proportion of the elapsed time A number of optimization techniques are developed to

improve collective communication performance

Page 16: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Benefits of AMPI (cont.)Fault Tolerance via Checkpoint/Restart

State of application checkpointed to disk or memoryCapable of restarting on different number of physical

processors!Synchronous checkpoint, collective call:

In-disk: MPI_Checkpoint(DIRNAME) In-memory: MPI_MemCheckpoint(void)

Restart: In-disk: charmrun +p4 prog +restart DIRNAME In-memory: automatic restart upon failure detection

AMPI: Adaptive MPI 16

Page 17: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

OutlineMotivationAMPI: Overview Benefits of AMPI Converting MPI Codes to AMPIHandling Global/Static VariablesRunning AMPI ProgramsAMPI StatusAMPI References, Conclusion

AMPI: Adaptive MPI 17

Page 18: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Converting MPI Codes to AMPI

AMPI needs its own initialization, before user-codeFortran program entry-point: MPI_Main

program pgm subroutine MPI_Main... ...end program end subroutine

C program entry-point is handled automatically, via mpi.h - include in same file as main() if absent

If the code has no global/static variables, this is all that is needed to convert!

AMPI: Adaptive MPI 18

Page 19: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

OutlineMotivationAMPI: Overview Benefits of AMPI Converting MPI Codes to AMPIHandling Global/Static VariablesRunning AMPI ProgramsAMPI StatusAMPI References, Conclusion

AMPI: Adaptive MPI 19

Page 20: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Handling Global/Static Variables

Global and static variables are a problem in multi-threaded programs (similar problem in OpenMP):Globals/statics have a single instance per processThey become shared by all threads in the processExample:

AMPI: Adaptive MPI 20

Thread 1 Thread 2

var = myid (1)

MPI_Recv()

(block...)

b = var

var = myid (2)

MPI_Recv()

(block...)If var is a global/static, incorrect

value is read!

time

Page 21: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Handling Global/Static Variables (cont.)

• General Solution: Privatize variables in thread• Approaches:

a) Swap global variables

b) Source-to-source transformation via Photran

c) Use TLS scheme (in development)

Specific approach to use must be decided on a case-by-case basis

AMPI: Adaptive MPI 21

Page 22: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Handling Global/Static Variables (cont.)

First Approach: Swap global variablesLeverage ELF – Execut. & Linking Format (e.g. Linux)ELF maintains a Global Offset Table (GOT) for globalsSwitch GOT contents at thread context-switch Implemented in AMPI via build flag –swapglobals+ No source code changes needed+ Works with any language (C, C++, Fortran, etc)- Does not handle static variables- Context-switch overhead grows with num.variables

AMPI: Adaptive MPI 22

Page 23: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Handling Global/Static Variables (cont.)

Second Approach: Source-to-source transformMove globals/statics to an object, then pass it around Automatic solution for Fortran codes: PhotranSimilar idea can be applied to C/C++ codes+ Totally portable across systems/compilers + May improve locality and cache utilization+ No extra overhead at context-switch- Requires new implementation for each language

AMPI: Adaptive MPI 23

Page 24: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Handling Global/Static Variables (cont.)

Example of Transformation: C Program

AMPI: Adaptive MPI 24

Original Code: Transformed Code:

Page 25: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Handling Global/Static Variables (cont.)

Example of Photran Transformation: Fortran Prog.

AMPI: Adaptive MPI 25

Original Code: Transformed Code:

Page 26: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Handling Global/Static Variables (cont.)

Photran Transformation ToolEclipse-based IDE, implemented in Java Incorporates automatic refactorings for Fortran codesOperates on “pure” Fortran 90 programsCode transformation infrastructure:

Construct rewriteable ASTs ASTs are augmented with binding information

AMPI: Adaptive MPI 26

Source: Stas Negara & Ralph Johnson

http://www.eclipse.org/photran/

Page 27: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Handling Global/Static Variables (cont.)

AMPI: Adaptive MPI 27

Source: Stas Negara & Ralph Johnson

http://www.eclipse.org/photran/

Photran-AMPI

GUI:

Page 28: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

NAS Benchmark

AMPI: Adaptive MPI28

Page 29: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

FLASH Results

AMPI: Adaptive MPI29

Page 30: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Handling Global/Static Variables (cont.)

Third Approach: TLS scheme (Thread-Local-Store)Originally employed in kernel threadsIn C code, variables are annotated with __threadModified/adapted gfortran compiler available+ Handles uniformly both globals and statics + No extra overhead at context-switch- Although popular, not yet a standard for compilers- Current Charm++ support only for x86 platforms

AMPI: Adaptive MPI 30

Page 31: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Handling Global/Static Variables (cont.)

Summary of Current Privatization Schemes: Program transformation is very portableTLS scheme may become supported on Blue Waters,

depending on work with IBM

AMPI: Adaptive MPI 31

Privatiz.

Scheme

X86 IA64

Opteron

MacOS

IBM Power

SUN IBM BG/P

CrayXT

Windows

Prog. Trans

f.

Yes Yes Yes Yes Yes Yes Yes Yes Yes

Swap Globa

ls

Yes Yes Yes No No Maybe

No No No

TLS Yes Maybe

Yes No Maybe

Maybe

No Yes Maybe

Page 32: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

NAS Benchmark

AMPI: Adaptive MPI32

Page 33: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

FLASH Results FLASH is a parallel, multi-

dimensional code used to study astrophysical fluids.

Many astrophysical environments are highly turbulent, and have structure on scales varying from large scale, like galaxy clusters, to small scale, like active galactic nuclei, in the same system.

AMPI: Adaptive MPI 33

Page 34: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

OutlineMotivationAMPI: Overview Benefits of AMPI Converting MPI Codes to AMPIHandling Global/Static VariablesRunning AMPI ProgramsAMPI StatusAMPI References, Conclusion

AMPI: Adaptive MPI 34

Page 35: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Object Migration

AMPI: Adaptive MPI35

Page 36: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Object MigrationHow do we move work between processors?Application-specific methods

E.g., move rows of sparse matrix, elements of FEM computation

Often very difficult for applicationApplication-independent methods

E.g., move entire virtual processorApplication’s problem decomposition doesn’t change

AMPI: Adaptive MPI 36

Page 37: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

How to Migrate a Virtual Processor? Move all application state to new processorStack Data

Subroutine variables and callsManaged by compiler

Heap DataAllocated with malloc/freeManaged by user

Global VariablesOpen files, environment variables, etc. (not handled

yet!)

AMPI: Adaptive MPI 37

Page 38: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Stack DataThe stack is used by the compiler to track function

calls and provide temporary storageLocal VariablesSubroutine ParametersC “alloca” storage

Most of the variables in a typical application are stack data

AMPI: Adaptive MPI 38

Page 39: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Migrate Stack DataWithout compiler support, cannot change stack’s

addressBecause we can’t change stack’s interior pointers

(return frame pointer, function arguments, etc.)Solution: “isomalloc” addresses

Reserve address space on every processor for every thread stack

Use mmap to scatter stacks in virtual memory efficiently

Idea comes from PM2

AMPI: Adaptive MPI 39

Page 40: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Migrate Stack Data

AMPI: Adaptive MPI 40

Thread 2 stack

Thread 3 stack

Thread 4 stack

Processor A’s Memory

Code

Globals

Heap

0x00000000

0xFFFFFFFF

Thread 1 stack

Code

Globals

Heap

0x00000000

0xFFFFFFFF

Processor B’s Memory

Migrate Thread 3

Page 41: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Migrate Stack Data

AMPI: Adaptive MPI 41

Thread 2 stack

Thread 4 stack

Processor A’s Memory

Code

Globals

Heap

0x00000000

0xFFFFFFFF

Thread 1 stack

Code

Globals

Heap

0x00000000

0xFFFFFFFF

Processor B’s Memory

Migrate Thread 3 Thread 3 stack

Page 42: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Migrate Stack DataIsomalloc is a completely automatic solution

No changes needed in application or compilersJust like a software shared-memory system, but with proactive

pagingBut has a few limitations

Depends on having large quantities of virtual address space (best on 64-bit) 32-bit machines can only have a few gigs of isomalloc stacks across the

whole machineDepends on unportable mmap

Which addresses are safe? (We must guess!) What about Windows? Blue Gene?

AMPI: Adaptive MPI 42

Page 43: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Heap DataHeap data is any dynamically allocated data

C “malloc” and “free”C++ “new” and “delete”F90 “ALLOCATE” and “DEALLOCATE”

Arrays and linked data structures are almost always heap data

AMPI: Adaptive MPI 43

Page 44: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Migrate Heap DataAutomatic solution: isomalloc all heap data just like

stacks!“-memory isomalloc” link optionOverrides malloc/freeNo new application code neededSame limitations as isomalloc

Manual solution: application moves its heap dataNeed to be able to size message buffer, pack data into

message, and unpack on other side“pup” abstraction does all three

AMPI: Adaptive MPI 44

Page 45: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Migrate Heap Data: PUPSame idea as MPI derived types, but datatype

description is code, not dataBasic contract: here is my data

Sizing: counts up data sizePacking: copies data into messageUnpacking: copies data back outSame call works for network, memory, disk I/O ...

Register “pup routine” with runtimeF90/C Interface: subroutine calls

E.g., pup_int(p,&x);C++ Interface: operator| overloading

E.g., p|x;

AMPI: Adaptive MPI 45

Page 46: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Migrate Heap Data: PUP BuiltinsSupported PUP Datatypes

Basic types (int, float, etc.) Arrays of basic types Unformatted bytes

Extra Support in C++ Can overload user-defined types

Define your own operator| Support for pointer-to-parent class

PUP::able interface Supports STL vector, list, map, and string

“pup_stl.h” Subclass your own PUP::er object

AMPI: Adaptive MPI 46

Page 47: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Migrate Heap Data: PUP C++ Example

AMPI: Adaptive MPI 47

#include “pup.h”#include “pup_stl.h”

class myMesh { std::vector<float> nodes; std::vector<int> elts;public: ... void pup(PUP::er &p) { p|nodes; p|elts; }};

Page 48: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Migrate Heap Data: PUP C Example

AMPI: Adaptive MPI 48

struct myMesh { int nn,ne; float *nodes; int *elts;};

void pupMesh(pup_er p,myMesh *mesh) { pup_int(p,&mesh->nn); pup_int(p,&mesh->ne); if(pup_isUnpacking(p)) { /* allocate data on arrival */ mesh->nodes=new float[mesh->nn]; mesh->elts=new int[mesh->ne]; } pup_floats(p,mesh->nodes,mesh->nn); pup_ints(p,mesh->elts,mesh->ne); if (pup_isDeleting(p)) { /* free data on departure */ deleteMesh(mesh); }}

Page 49: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Migrate Heap Data: PUP F90 Example

AMPI: Adaptive MPI 49

TYPE(myMesh) INTEGER :: nn,ne REAL*4, ALLOCATABLE(:) :: nodes INTEGER, ALLOCATABLE(:) :: eltsEND TYPE

SUBROUTINE pupMesh(p,mesh) USE MODULE ... INTEGER :: p TYPE(myMesh) :: mesh fpup_int(p,mesh%nn) fpup_int(p,mesh%ne) IF (fpup_isUnpacking(p)) THEN ALLOCATE(mesh%nodes(mesh%nn)) ALLOCATE(mesh%elts(mesh%ne)) END IF fpup_floats(p,mesh%nodes,mesh%nn); fpup_ints(p,mesh%elts,mesh%ne); IF (fpup_isDeleting(p)) deleteMesh(mesh);END SUBROUTINE

Page 50: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Global DataGlobal data is anything stored at a fixed place

C/C++ “extern” or “static” dataF77 “COMMON” blocksF90 “MODULE” data

Problem if multiple objects/threads try to store different values in the same place (thread safety)Compilers should make all of these per-thread; but they

don’t!Not a problem if everybody stores the same value

(e.g., constants)

AMPI: Adaptive MPI 50

Page 51: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Migrate Global DataAutomatic solution: keep separate set of globals for each

thread and swap “-swapglobals” compile-time optionWorks on ELF platforms: Linux and Sun

Just a pointer swap, no data copying needed Idea comes from Weaves framework

One copy at a time: breaks on SMPs

Manual solution: remove globalsMakes code threadsafeMay make code easier to understand and modifyTurns global variables into heap data (for isomalloc or pup)

AMPI: Adaptive MPI 51

Page 52: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

How to Migrate a Virtual Processor? Move all application state to new processorStack Data

Automatic: isomalloc stacksHeap Data

Use “-memory isomalloc” -or-Write pup routines

Global VariablesUse “-swapglobals” -or-Remove globals entirely

AMPI: Adaptive MPI 52

Page 53: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Running AMPI ProgramsBuild Charm++/AMPI if not yet available:

./build AMPI <version> <options> (see README for details)

Build application with AMPI’s scripts: <charmdir>/bin/ampicc –o prog prog.c

Run the application via charmrun:ampirun –np K charmrun +pK prog

MPI’s machinefile ≈ Charm’s nodelist file

+p option: number of physical processors to use

+vp option: number of virtual processors to use

AMPI: Adaptive MPI 53

Page 54: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Running AMPI Programs (cont.)

Multiple VP P mappings are available:charmrun +p2 prog +vp8 +mapping <map>

RR_MAP: Round-Robin (cyclic) BLOCK_MAP: Block (default mapping) PROP_MAP: Proportional to processors’ speeds

Example: VP=8, P=2, map=RR_MAPP[0]: VPs 0,2,4,6; P[1]: VPs 1,3,5,7

Other mappings can be easily addedSimple AMPI-lib changes needed (examples available)Best mapping depends on the application

AMPI: Adaptive MPI 54

Page 55: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Running AMPI Programs (cont.)

Optional: Build application with “isomalloc” ampicc –o prog prog.c -memory isomallocSpecial memory allocator, helps in migration

Run the application with modified stack sizes:ampirun –np K prog +vpM +tcharm_stacksize 1000Size specified in Bytes, valid for each threadDefault size: 1 MBCan be increased or decreased via command-line

AMPI: Adaptive MPI 55

Page 56: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Running AMPI Programs (cont.)

Load Balancer Use: Link program with LB modules:ampicc –o prog prog.c –module EveryLB

Run program with one of the available balancers:ampirun -np4 prog +vp16 +balancer <SelectedLB>e.g. GreedyLB, GreedyCommLB, RefineLB, etc, etc

It is possible to define when to collect information for the load balancer, during execution:LBTurnInstrumentOn() and LBTurnInstrumentOff() callsUsed with +LBOff, +LBCommOff command-line options

AMPI: Adaptive MPI 56

Page 57: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Running AMPI Programs (cont.)

Performance Analysis: Projections Tool (GUI) ampicc –o prog prog.c -tracemode projections

AMPI: Adaptive MPI 57

Post-mortem visualization of performance data:

Page 58: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

OutlineMotivationAMPI: Overview Benefits of AMPI Converting MPI Codes to AMPIHandling Global/Static VariablesRunning AMPI ProgramsAMPI StatusAMPI References, Conclusion

AMPI: Adaptive MPI 58

Page 59: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

AMPI StatusCompliance to MPI-1.1 Standard

Missing: error handling, profiling interfacePartial MPI-2 support:

Some new functions implemented when neededROMIO integrated for parallel I/OMajor missing features: dynamic process management,

language bindingsMost missing features are documented

Tested periodically via MPICH-1 test-suite

AMPI: Adaptive MPI 59

Page 60: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

AMPI ApplicationsRocstar

Rocket simulationFractography3DFLASHBRAM

AMPI: Adaptive MPI 60

Page 61: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

OutlineMotivationAMPI: Overview Benefits of AMPI Converting MPI Codes to AMPIHandling Global/Static VariablesRunning AMPI ProgramsAMPI StatusAMPI References, Conclusion

AMPI: Adaptive MPI 61

Page 62: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

AMPI ReferencesCharm++ site for manuals

http://charm.cs.uiuc.edu/manuals/Papers on AMPI

http://charm.cs.uiuc.edu/research/ampi/index.shtml#Papers

AMPI Source Code: part of Charm++ distribution

http://charm.cs.uiuc.edu/download/AMPI’s current funding support (indirect):

NSF/NCSA Blue Waters (Charm++, BigSim)DoE – Colony2 project (Load Balancing, Fault Tolerance)

AMPI: Adaptive MPI 62

Page 63: Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

ConclusionAMPI makes exciting features from Charm++

available for many MPI applications!VPs in AMPI are used in BigSim to emulate

processors of future machines – see next talk…We support AMPI through our regular mailing list:

[email protected] on AMPI is always welcome Thank You!

Questions ?

AMPI: Adaptive MPI 63