heterogeneous computing in charm++

Heterogeneous Computingin Charm++

David Kunzman

Motivations

• Performance and Popularity of Accelerators– Our work currently focuses on Cell (and Larrabee)– Difficult to program accelerators

• Architecture specific code (not portable)• Many asynchronous events (data movement, multiple cores)

• Heterogeneous Clusters Exist Already– Roadrunner at LANL (Opterons and Cells)– Lincoln at NCSA (Xeons and GPUs)– MariCel at BSC (Powers and Cells)

Goals

• Portability of code– Code should be portable between systems with and without

accelerators– Across homogeneous and heterogeneous clusters– Reduce programmer effort

• Allow various pieces of code to be written independently– Pieces of code share the accelerator(s)– Scheduled by the runtime system automatically

• Naturally extend the existing Charm++ model– Same programming model for all hosts and accelerators

Approach

• Make entry methods portable between host and accelerator cores– Allows the programmer to write entry method code

once and use the same code for all cores– Still make use of architecture/core specific features

• Take advantage of the clear communication boundaries in Charm++– Almost all data is encapsulated within chare objects– Data is passed between chare objects by invoking

entry methods

Extending Charm++

• SIMD Instruction Abstraction– To reach any significant fraction of peak, must use

SIMD instructions on modern cores– Abstract SIMD instructions so code is portable

• Accelerated Entry Methods– May execute on accelerators– Essentially a standard entry method split into two

stages• Function body (accelerator or host; limited)• Callback function (host; not limited)

SIMD Instruction Abstraction

• Abstract SIMD instructions supported by multiple architectures– Currently adding support for: SSE (x86),

AltiVec/VMX (PowerPC; PPE), SIMD instructions on SPEs, and Larrabee

– Generic C implementation when no direct architectural support is present

– Types: vecf, veclf, veci, ...– Operations: vaddf, vmulf, vsqrtf, ...

Example Entry Method

entry void accum(int inArrayLen, float inArray[inArrayLen]) {

if (inArrayLen != localArrayLen) return;

for (int i = 0; i < inArrayLen; ++i)

localArray[i] = localArray[i] + inArray[i];

};

To Invoke: myChareObj.accum(someFloatArray_len, someFloatArray_ptr);

Example Entry Method w/ SIMD

entry void accum(int inArrayLen, align(sizeof(vecf)) float inArray[inArrayLen]) {


vecf *inArrayVec = (vecf*)inArray;vecf *localArrayVec = (vecf*)localArray;int arrayVecLen = inArrayLen / vecf_numElems;for (int i = 0; i < arrayVecLen; ++i)

localArrayVec[i] = vaddf(localArrayVec[i], inArrayVec[i]);

for (int i = arrayVecLen * vecf_numElems; i < inArrayLen; ++i)localArray[i] = localArray[i] + inArray[i];

};


Accel Entry Method Structure

Invocation (both): chareObj.entryName(… passed parameters …)

Accelerated

Interface File:

entry [accel] void entryName

( …passed parameters… )

[ …local parameters… ]

{ … function body … }

callback_member_function;

Standard

Interface File:

entry void entryName

( …passed parameters… );

Source File:

void ChareClass::entryName

( …passed parameters … )

{ … function body … }

vs.

Example Accelerated Entry Method

entry [accel] void accum(int inArrayLen, align(sizeof(vecf)) float inArray[inArrayLen]) [ readOnly : int localArrayLen <impl_obj->localArrayLen>, readWrite : float localArray[localArrayLen] <impl_obj->localArray> ] {


vecf *inArrayVec = (vecf*)inArray;vecf *localArrayVec = (vecf*)localArray;int arrayVecLen = inArrayLen / vecf_numElems;for (int i = 0; i < arrayVecLen; ++i)

localArrayVec[i] = vaddf(localArrayVec[i], inArrayVec[i]);

for (int i = arrayVecLen * vecf_numElems; i < inArrayLen; ++i)localArray[i] = localArray[i] + inArray[i];

} accum_callback;


Timeline of Events

• Runtime system…– Directs data movement (messages & DMAs)– Schedules accelerated entry methods and callbacks

Communication Overlap

• Data movement automatically overlapped with accelerated entry method execution on SPEs and entry method execution on PPE

Handling Host Core Differences

• Automatic modification of application data at communication boundaries– Structure of data is known via

parameters and Pack-UnPack (PUP) routines

– During packing process, add information on how the data is encoded

– During unpacking, if needed, modify data to match local architecture

Molecular Dynamics (MD) Code

• Based on object interaction seen in NAMD’s nonbonded electrostatic force computation (simplified)– Coulomb’s Law– Single precision floating-point

• Particles evenly divided between patch objects– ~92K particles in 144 patches (similar to ApoA1 benchmark)

• Compute objects (self and pair wise) compute forces for patch objects

• Patches integrate combined force data and update particle positions

MD Code Results

• Executing on 2 Xeons cores, 8 PPEs, and 56 SPEs– 3 ISAs, 3 SIMD instruction extensions, and 2 memory structures– Better scaling is achieved when Xeons are present– 331.1 GFlop/s (19.82% peak; serial code limited to 27.7% peak

on one SPE, assuming that SPE has an infinite local store)

Visualizing MD Code Execution

Summary

• Support for accelerators and heterogeneous execution in Charm++– Programming model and runtime system changes

• Accelerated entry methods• SIMD instruction abstraction• Automatic modification of application data• Visualization support

– Support• Currently supports Cell• Adding support for Larrabee• Clusters where host cores have different architectures

Future Work

• Dynamic measurement based load balancing on heterogeneous systems

• Increase support for more accelerators– In the process of adding support for Larrabee– Increasing support for existing abstractions

and/or developing new abstractions

Questions

heterogeneous computing in charm++

Documents