battle of the accelerator stars pavan balaji computer scientist group lead, programming models and...

Battle of the Accelerator Stars

Pavan Balaji

Computer Scientist

Group lead, Programming Models and Runtime Systems

Argonne National Laboratory

[email protected]

http://www.mcs.anl.gov/~balaji

mailto:[email protected]

http://www.mcs.anl.gov/~balaji

Pavan Balaji, Argonne National Laboratory P2S2 Workshop Panel (09/10/2012)

Accelerators != GPUs

Anything that is built to make a specific type of computation (i.e., not general purpose computation) is an accelerator– A vector instruction unit is an accelerator– The double/quad floating point operations on BG/P and Q are

accelerators– An H.264 media decoder sitting on a processor die is an accelerator– There is no such thing as a “general purpose” accelerator

GPUs are one form of accelerators, but not the only ones


Divergence in Accelerator Computing?

Divergence == Increasing difference No -- There are a lot of different models of accelerator

computing today– NVIDIA/AMD GPUs, FPGAs, AMD’s fused chip architectures, Intel MIC

architecture, Intel Xeon, Blue Gene (Yes, they are accelerators too)– Broadly classified into:

• Decoupled processing/Decoupled memory (think GPUs)• Coupled processing/Decoupled memory (think AMD Fusion)• Coupled processing/Coupled memory (think Intel Xeon/MIC, BG/Q)

But the trend is not towards increasing difference, but rather towards convergence– Vendors and researchers are trying out different options to see what

works and what does not

You have

to kiss

many frogs

before you

can find

your

prince!


Who will be the last man standing?

GPUsDecoupled processingDecoupled memory

Fused Processors (e.g., AMD Fusion)

Coupled processingDecoupled memory

General Purpose Processors with Accelerator Extensions (e.g., Xeon,

MIC, BG/P, BG/Q)Coupled processingCoupled memory


Quantum mechanical interactions are near-sighted (Walter Kohn)

Traditional quantum chemistry studies lie within the nearsighted range where interactions are dense:

Future quantum chemistry studies expose both short- and long-range interactions:

Range of interactions between particles

Note that the figures are phenomenological. Quantum chemistry methods treat correlation using a variety of approaches and have different short/long-range cutoffs.

distance

Interaction strength

Courtesy Jeff Hammond, Argonne National Laboratory


Wind Turbine and Flight Blade Designs

Blades are getting larger with every new design– With larger blades, the additional lift or torque

generated is from the outer regions of the blade

– Air flow from far out regions of the blade has lesser computational intensity making the computation more “sparse”


Decoupled Processing/Decoupled Memory (GPUs)

Pros:– A separate can be custom built for

acceleration– Faster memory; better designed memory

and memory controllers for acceleration Cons:

– Decoupled from the main processing unit

Control UnitALU ALU

ALU ALU

Cache

DRAM

Regular CPU cores GPU cores

Verdict


Coupled Processing/Decoupled Memory

Pros:– Improved coupling of the processing units

and memory allows for much faster synchronization

– Separate memory allows for better optimized memory and memory controllers

Cons:– The need for data staging does not disappear

CPU

GPU

CPU Memory

GPU Memory

CPU

GPU

CPU Memory

GPU Memory

Verdict


General Purpose Processors with Accelerator Extensions

Pros:– Very fine-grained synchronization (no memory synchronization required;

processing synchronization for power constraints) Cons:

– Unified memory means that specialization not possible (either in memory or in memory controllers)

– Single die memory constraints

Intel: MIC

IBM: BG/Q

Pow

er C

onst

rain

ed M

emor

y Co

nsis

tenc

y

Tilera: GX

Godson T

Intel: SCC

Dally: Echelon

Extr

eme

Spec

ializ

ation

and

Pow

er M

anag

emen

t

Chien: 10x10


Towards On-chip Instruction-level Heterogeneity Vector units were a form of instruction-level heterogeneity

– Some instructions use vector hardware, some don’t– Vector instruction units processed the same data that other units

processed

Synchronization requirements– No memory staging requirements– Theoretically, accelerator units can fit into the same instruction pipeline

as general purpose processing

But, there are some practicality constraints– Amount of acceleration is so high that not all hardware can be turned on

at the same time (dark silicon with power gating will lead the way)• So synchronization is not absent, but much more fine-grained (10s of cycles)

– Compilers (with help from users – OpenMP, OpenACC) will have to do some work to coalesce hardware power-gating

Verdict


Summary

Accelerators are of different kinds – GPUs are just one example of it

Decoupled memory accelerators do not have much of a chance to survive because of data staging requirements– Fundamentally ill-suited for sparse/fine-grained computations– Caveat: LINPACK is not a fine-grained computation, so the Top500

might still boast a GPU-like machine

Fine-grained instruction-level heterogeneity is required– Many architectures are already going in that direction– BG/Q and Intel MIC’s planned roadmap are in that direction

battle of the accelerator stars pavan balaji computer scientist group lead, programming models and...

Documents