gengbin zheng parallel programming laboratory university of illinois at urbana-champaign

Gengbin Zheng

Parallel Programming Laboratory

University of Illinois at Urbana-Champaign

OutlineMotivationAMPI: Overview Benefits of AMPI Converting MPI Codes to AMPIHandling Global/Static VariablesRunning AMPI ProgramsAMPI StatusAMPI References, Conclusion

AMPI: Adaptive MPI 2

AMPI: MotivationChallenges

New generation parallel applications are: Dynamically varying: load shifting during execution Adaptively refined Composed of multi-physics modules

Typical MPI Implementations:Not naturally suitable for dynamic applicationsAvailable processor set may not match algorithm

Alternative: Adaptive MPI (AMPI)MPI & Charm++ virtualization: VP (“Virtual Processors”)


AMPI: OverviewVirtualization: MPI ranks → Charm++ threads


Real Processors

MPI “tasks”

Implemented as user-level migratable threads

( VPs: virtual processors )

AMPI: Overview (cont.)AMPI Execution Model:


• Multiple user-level threads per process

• Typically, one process per physical processor

• Charm++ Scheduler coordinates execution

• Threads (VPs) can migrate across processors

• Virtualization ratio: R = #VP / #P (over-decomposition)

Charm++ Scheduler

P=1 , VP=4

AMPI: Overview (cont.)AMPI’s Over-Decomposition in Practice


MPI: P=4, ranks=4 AMPI: P=4, VP=ranks=16

Benefits of AMPIOverlap between Computation/Communication


• Automatically achieved • When one thread blocks for a

message, another thread in the same processor can execute

• Charm++ Scheduler picks next thread among those that are ready to run Charm++ Scheduler

Benefits of AMPI (cont.)Potentially Better Cache Utilization


Gains occur when subdomain is accessed repeatedly (e.g. by multiple functions, called in sequence)

12 might fit in

cache, but 3 might not fit

Benefits of AMPI (cont.)Thread Migration for Load Balancing


Migration of thread 13:

Benefits of AMPI (cont.)Load Balancing in AMPI: MPI_Migrate()

Collective operation informing the load balancer that the threads can be migrated, if needed, for balancing load

Easy to insert in the code of iterative applications Leverages Load-Balancing framework of Charm++Balancing decisions can be based on

Measured parameters: computation load, communication pattern

Application-provided information


Benefits of AMPI (cont.)Decoupling of Physical/Virtual Processors


Problem setup: 3D stencil calculation of size 2403 run on Lemieux.

AMPI runs on any # of PE’s (eg 19, 33, 105). Native MPI needs P=K3

1

10

100

10 100 1000Procs

Ex

ec

Tim

e [s

ec

]

Native MPI AMPI

Benefits of AMPI (cont.)Asynchronous Implementation of Collectives

Collective operation is posted, returns immediatelyTest/wait for its completion; meanwhile, do useful work

e.g. MPI_Ialltoall( … , &req);/* other computation */MPI_Wait(req);

Other operations available: MPI_Iallreduce, MPI_Iallgather

Example: 2D FFT benchmark time (ms)


0 10 20 30 40 50 60 70 80 90 100

AMPI,4

AMPI,8

AMPI,16

Native MPI,4

Native MPI,8

Native MPI,161D FFT

All-to-all

Wait

Motivation for Collective Communication Optimization

0

100

200

300

400

500

600

700

800

900

76 276 476 876 1276 1676 2076 3076 4076 6076 8076

Message Size (Bytes)

Tim

e (

ms)

Mesh

Mesh Compute


Time breakdown of an all-to-all operation using Mesh library

Computation is only a small proportion of the elapsed time A number of optimization techniques are developed to

improve collective communication performance

Benefits of AMPI (cont.)Fault Tolerance via Checkpoint/Restart

State of application checkpointed to disk or memoryCapable of restarting on different number of physical

processors!Synchronous checkpoint, collective call:

In-disk: MPI_Checkpoint(DIRNAME) In-memory: MPI_MemCheckpoint(void)

Restart: In-disk: charmrun +p4 prog +restart DIRNAME In-memory: automatic restart upon failure detection


Converting MPI Codes to AMPI

AMPI needs its own initialization, before user-codeFortran program entry-point: MPI_Main

program pgm subroutine MPI_Main... ...end program end subroutine

C program entry-point is handled automatically, via mpi.h - include in same file as main() if absent

If the code has no global/static variables, this is all that is needed to convert!


Handling Global/Static Variables

Global and static variables are a problem in multi-threaded programs (similar problem in OpenMP):Globals/statics have a single instance per processThey become shared by all threads in the processExample:


Thread 1 Thread 2

var = myid (1)

MPI_Recv()

(block...)

b = var

var = myid (2)

MPI_Recv()

(block...)If var is a global/static, incorrect

value is read!

time

Handling Global/Static Variables (cont.)

• General Solution: Privatize variables in thread• Approaches:

a) Swap global variables

b) Source-to-source transformation via Photran

c) Use TLS scheme (in development)

Specific approach to use must be decided on a case-by-case basis



First Approach: Swap global variablesLeverage ELF – Execut. & Linking Format (e.g. Linux)ELF maintains a Global Offset Table (GOT) for globalsSwitch GOT contents at thread context-switch Implemented in AMPI via build flag –swapglobals+ No source code changes needed+ Works with any language (C, C++, Fortran, etc)- Does not handle static variables- Context-switch overhead grows with num.variables



Second Approach: Source-to-source transformMove globals/statics to an object, then pass it around Automatic solution for Fortran codes: PhotranSimilar idea can be applied to C/C++ codes+ Totally portable across systems/compilers + May improve locality and cache utilization+ No extra overhead at context-switch- Requires new implementation for each language



Example of Transformation: C Program


Original Code: Transformed Code:


Example of Photran Transformation: Fortran Prog.


Original Code: Transformed Code:


Photran Transformation ToolEclipse-based IDE, implemented in Java Incorporates automatic refactorings for Fortran codesOperates on “pure” Fortran 90 programsCode transformation infrastructure:

Construct rewriteable ASTs ASTs are augmented with binding information


Source: Stas Negara & Ralph Johnson

http://www.eclipse.org/photran/



Source: Stas Negara & Ralph Johnson

http://www.eclipse.org/photran/

Photran-AMPI

GUI:

NAS Benchmark

AMPI: Adaptive MPI28

FLASH Results



Third Approach: TLS scheme (Thread-Local-Store)Originally employed in kernel threadsIn C code, variables are annotated with __threadModified/adapted gfortran compiler available+ Handles uniformly both globals and statics + No extra overhead at context-switch- Although popular, not yet a standard for compilers- Current Charm++ support only for x86 platforms



Summary of Current Privatization Schemes: Program transformation is very portableTLS scheme may become supported on Blue Waters,

depending on work with IBM


Privatiz.

Scheme

X86 IA64

Opteron

MacOS

IBM Power

SUN IBM BG/P

CrayXT

Windows

Prog. Trans

f.

Yes Yes Yes Yes Yes Yes Yes Yes Yes

Swap Globa

ls

Yes Yes Yes No No Maybe

No No No

TLS Yes Maybe

Yes No Maybe

Maybe

No Yes Maybe

NAS Benchmark


FLASH Results FLASH is a parallel, multi-

dimensional code used to study astrophysical fluids.

Many astrophysical environments are highly turbulent, and have structure on scales varying from large scale, like galaxy clusters, to small scale, like active galactic nuclei, in the same system.


Object Migration


Object MigrationHow do we move work between processors?Application-specific methods

E.g., move rows of sparse matrix, elements of FEM computation

Often very difficult for applicationApplication-independent methods

E.g., move entire virtual processorApplication’s problem decomposition doesn’t change


How to Migrate a Virtual Processor? Move all application state to new processorStack Data

Subroutine variables and callsManaged by compiler

Heap DataAllocated with malloc/freeManaged by user

Global VariablesOpen files, environment variables, etc. (not handled

yet!)


Stack DataThe stack is used by the compiler to track function

calls and provide temporary storageLocal VariablesSubroutine ParametersC “alloca” storage

Most of the variables in a typical application are stack data


Migrate Stack DataWithout compiler support, cannot change stack’s

addressBecause we can’t change stack’s interior pointers

(return frame pointer, function arguments, etc.)Solution: “isomalloc” addresses

Reserve address space on every processor for every thread stack

Use mmap to scatter stacks in virtual memory efficiently

Idea comes from PM2


Migrate Stack Data


Thread 2 stack

Thread 3 stack

Thread 4 stack

Processor A’s Memory

Code

Globals

Heap

0x00000000

0xFFFFFFFF

Thread 1 stack

Code

Globals

Heap

0x00000000

0xFFFFFFFF

Processor B’s Memory

Migrate Thread 3

Migrate Stack Data


Thread 2 stack

Thread 4 stack

Processor A’s Memory

Code

Globals

Heap

0x00000000

0xFFFFFFFF

Thread 1 stack

Code

Globals

Heap

0x00000000

0xFFFFFFFF

Processor B’s Memory

Migrate Thread 3 Thread 3 stack

Migrate Stack DataIsomalloc is a completely automatic solution

No changes needed in application or compilersJust like a software shared-memory system, but with proactive

pagingBut has a few limitations

Depends on having large quantities of virtual address space (best on 64-bit) 32-bit machines can only have a few gigs of isomalloc stacks across the

whole machineDepends on unportable mmap

Which addresses are safe? (We must guess!) What about Windows? Blue Gene?


Heap DataHeap data is any dynamically allocated data

C “malloc” and “free”C++ “new” and “delete”F90 “ALLOCATE” and “DEALLOCATE”

Arrays and linked data structures are almost always heap data


Migrate Heap DataAutomatic solution: isomalloc all heap data just like

stacks!“-memory isomalloc” link optionOverrides malloc/freeNo new application code neededSame limitations as isomalloc

Manual solution: application moves its heap dataNeed to be able to size message buffer, pack data into

message, and unpack on other side“pup” abstraction does all three


Migrate Heap Data: PUPSame idea as MPI derived types, but datatype

description is code, not dataBasic contract: here is my data

Sizing: counts up data sizePacking: copies data into messageUnpacking: copies data back outSame call works for network, memory, disk I/O ...

Register “pup routine” with runtimeF90/C Interface: subroutine calls

E.g., pup_int(p,&x);C++ Interface: operator| overloading

E.g., p|x;


Migrate Heap Data: PUP BuiltinsSupported PUP Datatypes

Basic types (int, float, etc.) Arrays of basic types Unformatted bytes

Extra Support in C++ Can overload user-defined types

Define your own operator| Support for pointer-to-parent class

PUP::able interface Supports STL vector, list, map, and string

“pup_stl.h” Subclass your own PUP::er object


Migrate Heap Data: PUP C++ Example


#include “pup.h”#include “pup_stl.h”

class myMesh { std::vector<float> nodes; std::vector<int> elts;public: ... void pup(PUP::er &p) { p|nodes; p|elts; }};

Migrate Heap Data: PUP C Example


struct myMesh { int nn,ne; float *nodes; int *elts;};

void pupMesh(pup_er p,myMesh *mesh) { pup_int(p,&mesh->nn); pup_int(p,&mesh->ne); if(pup_isUnpacking(p)) { /* allocate data on arrival */ mesh->nodes=new float[mesh->nn]; mesh->elts=new int[mesh->ne]; } pup_floats(p,mesh->nodes,mesh->nn); pup_ints(p,mesh->elts,mesh->ne); if (pup_isDeleting(p)) { /* free data on departure */ deleteMesh(mesh); }}

Migrate Heap Data: PUP F90 Example


TYPE(myMesh) INTEGER :: nn,ne REAL*4, ALLOCATABLE(:) :: nodes INTEGER, ALLOCATABLE(:) :: eltsEND TYPE

SUBROUTINE pupMesh(p,mesh) USE MODULE ... INTEGER :: p TYPE(myMesh) :: mesh fpup_int(p,mesh%nn) fpup_int(p,mesh%ne) IF (fpup_isUnpacking(p)) THEN ALLOCATE(mesh%nodes(mesh%nn)) ALLOCATE(mesh%elts(mesh%ne)) END IF fpup_floats(p,mesh%nodes,mesh%nn); fpup_ints(p,mesh%elts,mesh%ne); IF (fpup_isDeleting(p)) deleteMesh(mesh);END SUBROUTINE

Global DataGlobal data is anything stored at a fixed place

C/C++ “extern” or “static” dataF77 “COMMON” blocksF90 “MODULE” data

Problem if multiple objects/threads try to store different values in the same place (thread safety)Compilers should make all of these per-thread; but they

don’t!Not a problem if everybody stores the same value

(e.g., constants)


Migrate Global DataAutomatic solution: keep separate set of globals for each

thread and swap “-swapglobals” compile-time optionWorks on ELF platforms: Linux and Sun

Just a pointer swap, no data copying needed Idea comes from Weaves framework

One copy at a time: breaks on SMPs

Manual solution: remove globalsMakes code threadsafeMay make code easier to understand and modifyTurns global variables into heap data (for isomalloc or pup)


How to Migrate a Virtual Processor? Move all application state to new processorStack Data

Automatic: isomalloc stacksHeap Data

Use “-memory isomalloc” -or-Write pup routines

Global VariablesUse “-swapglobals” -or-Remove globals entirely


Running AMPI ProgramsBuild Charm++/AMPI if not yet available:

./build AMPI <version> <options> (see README for details)

Build application with AMPI’s scripts: <charmdir>/bin/ampicc –o prog prog.c

Run the application via charmrun:ampirun –np K charmrun +pK prog

MPI’s machinefile ≈ Charm’s nodelist file

+p option: number of physical processors to use

+vp option: number of virtual processors to use


Running AMPI Programs (cont.)

Multiple VP P mappings are available:charmrun +p2 prog +vp8 +mapping <map>

RR_MAP: Round-Robin (cyclic) BLOCK_MAP: Block (default mapping) PROP_MAP: Proportional to processors’ speeds

Example: VP=8, P=2, map=RR_MAPP[0]: VPs 0,2,4,6; P[1]: VPs 1,3,5,7

Other mappings can be easily addedSimple AMPI-lib changes needed (examples available)Best mapping depends on the application



Optional: Build application with “isomalloc” ampicc –o prog prog.c -memory isomallocSpecial memory allocator, helps in migration

Run the application with modified stack sizes:ampirun –np K prog +vpM +tcharm_stacksize 1000Size specified in Bytes, valid for each threadDefault size: 1 MBCan be increased or decreased via command-line



Load Balancer Use: Link program with LB modules:ampicc –o prog prog.c –module EveryLB

Run program with one of the available balancers:ampirun -np4 prog +vp16 +balancer <SelectedLB>e.g. GreedyLB, GreedyCommLB, RefineLB, etc, etc

It is possible to define when to collect information for the load balancer, during execution:LBTurnInstrumentOn() and LBTurnInstrumentOff() callsUsed with +LBOff, +LBCommOff command-line options



Performance Analysis: Projections Tool (GUI) ampicc –o prog prog.c -tracemode projections


Post-mortem visualization of performance data:

AMPI StatusCompliance to MPI-1.1 Standard

Missing: error handling, profiling interfacePartial MPI-2 support:

Some new functions implemented when neededROMIO integrated for parallel I/OMajor missing features: dynamic process management,

language bindingsMost missing features are documented

Tested periodically via MPICH-1 test-suite


AMPI ApplicationsRocstar

Rocket simulationFractography3DFLASHBRAM


AMPI ReferencesCharm++ site for manuals

http://charm.cs.uiuc.edu/manuals/Papers on AMPI

http://charm.cs.uiuc.edu/research/ampi/index.shtml#Papers

AMPI Source Code: part of Charm++ distribution

http://charm.cs.uiuc.edu/download/AMPI’s current funding support (indirect):

NSF/NCSA Blue Waters (Charm++, BigSim)DoE – Colony2 project (Load Balancing, Fault Tolerance)


ConclusionAMPI makes exciting features from Charm++

available for many MPI applications!VPs in AMPI are used in BigSim to emulate

processors of future machines – see next talk…We support AMPI through our regular mailing list:

[email protected] on AMPI is always welcome Thank You!

Questions ?


gengbin zheng parallel programming laboratory university of illinois at urbana-champaign

Documents