heterogeneous computing in charm++

19
Heterogeneous Computing in Charm++ David Kunzman

Upload: galvin-dudley

Post on 02-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Heterogeneous Computing in Charm++. David Kunzman. Motivations. Performance and Popularity of Accelerators Our work currently focuses on Cell (and Larrabee) Difficult to program accelerators Architecture specific code (not portable) Many asynchronous events (data movement, multiple cores) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Heterogeneous Computing in Charm++

Heterogeneous Computingin Charm++

David Kunzman

Page 2: Heterogeneous Computing in Charm++

Motivations

• Performance and Popularity of Accelerators– Our work currently focuses on Cell (and Larrabee)– Difficult to program accelerators

• Architecture specific code (not portable)• Many asynchronous events (data movement, multiple cores)

• Heterogeneous Clusters Exist Already– Roadrunner at LANL (Opterons and Cells)– Lincoln at NCSA (Xeons and GPUs)– MariCel at BSC (Powers and Cells)

Page 3: Heterogeneous Computing in Charm++

Goals

• Portability of code– Code should be portable between systems with and without

accelerators– Across homogeneous and heterogeneous clusters– Reduce programmer effort

• Allow various pieces of code to be written independently– Pieces of code share the accelerator(s)– Scheduled by the runtime system automatically

• Naturally extend the existing Charm++ model– Same programming model for all hosts and accelerators

Page 4: Heterogeneous Computing in Charm++

Approach

• Make entry methods portable between host and accelerator cores– Allows the programmer to write entry method code

once and use the same code for all cores– Still make use of architecture/core specific features

• Take advantage of the clear communication boundaries in Charm++– Almost all data is encapsulated within chare objects– Data is passed between chare objects by invoking

entry methods

Page 5: Heterogeneous Computing in Charm++

Extending Charm++

• SIMD Instruction Abstraction– To reach any significant fraction of peak, must use

SIMD instructions on modern cores– Abstract SIMD instructions so code is portable

• Accelerated Entry Methods– May execute on accelerators– Essentially a standard entry method split into two

stages• Function body (accelerator or host; limited)• Callback function (host; not limited)

Page 6: Heterogeneous Computing in Charm++

SIMD Instruction Abstraction

• Abstract SIMD instructions supported by multiple architectures– Currently adding support for: SSE (x86),

AltiVec/VMX (PowerPC; PPE), SIMD instructions on SPEs, and Larrabee

– Generic C implementation when no direct architectural support is present

– Types: vecf, veclf, veci, ...– Operations: vaddf, vmulf, vsqrtf, ...

Page 7: Heterogeneous Computing in Charm++

Example Entry Method

entry void accum(int inArrayLen, float inArray[inArrayLen]) {

if (inArrayLen != localArrayLen) return;

for (int i = 0; i < inArrayLen; ++i)

localArray[i] = localArray[i] + inArray[i];

};

To Invoke: myChareObj.accum(someFloatArray_len, someFloatArray_ptr);

Page 8: Heterogeneous Computing in Charm++

Example Entry Method w/ SIMD

entry void accum(int inArrayLen, align(sizeof(vecf)) float inArray[inArrayLen]) {

if (inArrayLen != localArrayLen) return;

vecf *inArrayVec = (vecf*)inArray;vecf *localArrayVec = (vecf*)localArray;int arrayVecLen = inArrayLen / vecf_numElems;for (int i = 0; i < arrayVecLen; ++i)

localArrayVec[i] = vaddf(localArrayVec[i], inArrayVec[i]);

for (int i = arrayVecLen * vecf_numElems; i < inArrayLen; ++i)localArray[i] = localArray[i] + inArray[i];

};

To Invoke: myChareObj.accum(someFloatArray_len, someFloatArray_ptr);

Page 9: Heterogeneous Computing in Charm++

Accel Entry Method Structure

Invocation (both): chareObj.entryName(… passed parameters …)

Accelerated

Interface File:

entry [accel] void entryName

( …passed parameters… )

[ …local parameters… ]

{ … function body … }

callback_member_function;

Standard

Interface File:

entry void entryName

( …passed parameters… );

Source File:

void ChareClass::entryName

( …passed parameters … )

{ … function body … }

vs.

Page 10: Heterogeneous Computing in Charm++

Example Accelerated Entry Method

entry [accel] void accum(int inArrayLen, align(sizeof(vecf)) float inArray[inArrayLen]) [ readOnly : int localArrayLen <impl_obj->localArrayLen>, readWrite : float localArray[localArrayLen] <impl_obj->localArray> ] {

if (inArrayLen != localArrayLen) return;

vecf *inArrayVec = (vecf*)inArray;vecf *localArrayVec = (vecf*)localArray;int arrayVecLen = inArrayLen / vecf_numElems;for (int i = 0; i < arrayVecLen; ++i)

localArrayVec[i] = vaddf(localArrayVec[i], inArrayVec[i]);

for (int i = arrayVecLen * vecf_numElems; i < inArrayLen; ++i)localArray[i] = localArray[i] + inArray[i];

} accum_callback;

To Invoke: myChareObj.accum(someFloatArray_len, someFloatArray_ptr);

Page 11: Heterogeneous Computing in Charm++

Timeline of Events

• Runtime system…– Directs data movement (messages & DMAs)– Schedules accelerated entry methods and callbacks

Page 12: Heterogeneous Computing in Charm++

Communication Overlap

• Data movement automatically overlapped with accelerated entry method execution on SPEs and entry method execution on PPE

Page 13: Heterogeneous Computing in Charm++

Handling Host Core Differences

• Automatic modification of application data at communication boundaries– Structure of data is known via

parameters and Pack-UnPack (PUP) routines

– During packing process, add information on how the data is encoded

– During unpacking, if needed, modify data to match local architecture

Page 14: Heterogeneous Computing in Charm++

Molecular Dynamics (MD) Code

• Based on object interaction seen in NAMD’s nonbonded electrostatic force computation (simplified)– Coulomb’s Law– Single precision floating-point

• Particles evenly divided between patch objects– ~92K particles in 144 patches (similar to ApoA1 benchmark)

• Compute objects (self and pair wise) compute forces for patch objects

• Patches integrate combined force data and update particle positions

Page 15: Heterogeneous Computing in Charm++

MD Code Results

• Executing on 2 Xeons cores, 8 PPEs, and 56 SPEs– 3 ISAs, 3 SIMD instruction extensions, and 2 memory structures– Better scaling is achieved when Xeons are present– 331.1 GFlop/s (19.82% peak; serial code limited to 27.7% peak

on one SPE, assuming that SPE has an infinite local store)

Page 16: Heterogeneous Computing in Charm++

Visualizing MD Code Execution

Page 17: Heterogeneous Computing in Charm++

Summary

• Support for accelerators and heterogeneous execution in Charm++– Programming model and runtime system changes

• Accelerated entry methods• SIMD instruction abstraction• Automatic modification of application data• Visualization support

– Support• Currently supports Cell• Adding support for Larrabee• Clusters where host cores have different architectures

Page 18: Heterogeneous Computing in Charm++

Future Work

• Dynamic measurement based load balancing on heterogeneous systems

• Increase support for more accelerators– In the process of adding support for Larrabee– Increasing support for existing abstractions

and/or developing new abstractions

Page 19: Heterogeneous Computing in Charm++

Questions