battle of the accelerator stars pavan balaji computer scientist group lead, programming models and...

11
Battle of the Accelerator Stars Pavan Balaji Computer Scientist Group lead, Programming Models and Runtime Systems Argonne National Laboratory [email protected] http://www.mcs.anl.gov/~balaji

Upload: lorin-lambert

Post on 16-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Battle of the Accelerator Stars Pavan Balaji Computer Scientist Group lead, Programming Models and Runtime Systems Argonne National Laboratory balaji@mcs.anl.gov

Battle of the Accelerator Stars

Pavan Balaji

Computer Scientist

Group lead, Programming Models and Runtime Systems

Argonne National Laboratory

[email protected]

http://www.mcs.anl.gov/~balaji

Page 2: Battle of the Accelerator Stars Pavan Balaji Computer Scientist Group lead, Programming Models and Runtime Systems Argonne National Laboratory balaji@mcs.anl.gov

Pavan Balaji, Argonne National Laboratory P2S2 Workshop Panel (09/10/2012)

Accelerators != GPUs

Anything that is built to make a specific type of computation (i.e., not general purpose computation) is an accelerator– A vector instruction unit is an accelerator– The double/quad floating point operations on BG/P and Q are

accelerators– An H.264 media decoder sitting on a processor die is an accelerator– There is no such thing as a “general purpose” accelerator

GPUs are one form of accelerators, but not the only ones

Page 3: Battle of the Accelerator Stars Pavan Balaji Computer Scientist Group lead, Programming Models and Runtime Systems Argonne National Laboratory balaji@mcs.anl.gov

Pavan Balaji, Argonne National Laboratory P2S2 Workshop Panel (09/10/2012)

Divergence in Accelerator Computing?

Divergence == Increasing difference No -- There are a lot of different models of accelerator

computing today– NVIDIA/AMD GPUs, FPGAs, AMD’s fused chip architectures, Intel MIC

architecture, Intel Xeon, Blue Gene (Yes, they are accelerators too)– Broadly classified into:

• Decoupled processing/Decoupled memory (think GPUs)• Coupled processing/Decoupled memory (think AMD Fusion)• Coupled processing/Coupled memory (think Intel Xeon/MIC, BG/Q)

But the trend is not towards increasing difference, but rather towards convergence– Vendors and researchers are trying out different options to see what

works and what does not

You have

to kiss

many frogs

before you

can find

your

prince!

Page 4: Battle of the Accelerator Stars Pavan Balaji Computer Scientist Group lead, Programming Models and Runtime Systems Argonne National Laboratory balaji@mcs.anl.gov

Pavan Balaji, Argonne National Laboratory P2S2 Workshop Panel (09/10/2012)

Who will be the last man standing?

GPUsDecoupled processingDecoupled memory

Fused Processors (e.g., AMD Fusion)

Coupled processingDecoupled memory

General Purpose Processors with Accelerator Extensions (e.g., Xeon,

MIC, BG/P, BG/Q)Coupled processingCoupled memory

Page 5: Battle of the Accelerator Stars Pavan Balaji Computer Scientist Group lead, Programming Models and Runtime Systems Argonne National Laboratory balaji@mcs.anl.gov

Pavan Balaji, Argonne National Laboratory P2S2 Workshop Panel (09/10/2012)

Quantum mechanical interactions are near-sighted (Walter Kohn)

Traditional quantum chemistry studies lie within the nearsighted range where interactions are dense:

Future quantum chemistry studies expose both short- and long-range interactions:

Range of interactions between particles

Note that the figures are phenomenological. Quantum chemistry methods treat correlation using a variety of approaches and have different short/long-range cutoffs.

distance

Interaction strength

Courtesy Jeff Hammond, Argonne National Laboratory

Page 6: Battle of the Accelerator Stars Pavan Balaji Computer Scientist Group lead, Programming Models and Runtime Systems Argonne National Laboratory balaji@mcs.anl.gov

Pavan Balaji, Argonne National Laboratory P2S2 Workshop Panel (09/10/2012)

Wind Turbine and Flight Blade Designs

Blades are getting larger with every new design– With larger blades, the additional lift or torque

generated is from the outer regions of the blade

– Air flow from far out regions of the blade has lesser computational intensity making the computation more “sparse”

Page 7: Battle of the Accelerator Stars Pavan Balaji Computer Scientist Group lead, Programming Models and Runtime Systems Argonne National Laboratory balaji@mcs.anl.gov

Pavan Balaji, Argonne National Laboratory P2S2 Workshop Panel (09/10/2012)

Decoupled Processing/Decoupled Memory (GPUs)

Pros:– A separate can be custom built for

acceleration– Faster memory; better designed memory

and memory controllers for acceleration Cons:

– Decoupled from the main processing unit

Control UnitALU ALU

ALU ALU

Cache

DRAM

Regular CPU cores GPU cores

Verdict

Page 8: Battle of the Accelerator Stars Pavan Balaji Computer Scientist Group lead, Programming Models and Runtime Systems Argonne National Laboratory balaji@mcs.anl.gov

Pavan Balaji, Argonne National Laboratory P2S2 Workshop Panel (09/10/2012)

Coupled Processing/Decoupled Memory

Pros:– Improved coupling of the processing units

and memory allows for much faster synchronization

– Separate memory allows for better optimized memory and memory controllers

Cons:– The need for data staging does not disappear

CPU

GPU

CPU Memory

GPU Memory

CPU

GPU

CPU Memory

GPU Memory

Verdict

Page 9: Battle of the Accelerator Stars Pavan Balaji Computer Scientist Group lead, Programming Models and Runtime Systems Argonne National Laboratory balaji@mcs.anl.gov

Pavan Balaji, Argonne National Laboratory P2S2 Workshop Panel (09/10/2012)

General Purpose Processors with Accelerator Extensions

Pros:– Very fine-grained synchronization (no memory synchronization required;

processing synchronization for power constraints) Cons:

– Unified memory means that specialization not possible (either in memory or in memory controllers)

– Single die memory constraints

Intel: MIC

IBM: BG/Q

Pow

er C

onst

rain

ed M

emor

y Co

nsis

tenc

y

Tilera: GX

Godson T

Intel: SCC

Dally: Echelon

Extr

eme

Spec

ializ

ation

and

Pow

er M

anag

emen

t

Chien: 10x10

Page 10: Battle of the Accelerator Stars Pavan Balaji Computer Scientist Group lead, Programming Models and Runtime Systems Argonne National Laboratory balaji@mcs.anl.gov

Pavan Balaji, Argonne National Laboratory P2S2 Workshop Panel (09/10/2012)

Towards On-chip Instruction-level Heterogeneity Vector units were a form of instruction-level heterogeneity

– Some instructions use vector hardware, some don’t– Vector instruction units processed the same data that other units

processed

Synchronization requirements– No memory staging requirements– Theoretically, accelerator units can fit into the same instruction pipeline

as general purpose processing

But, there are some practicality constraints– Amount of acceleration is so high that not all hardware can be turned on

at the same time (dark silicon with power gating will lead the way)• So synchronization is not absent, but much more fine-grained (10s of cycles)

– Compilers (with help from users – OpenMP, OpenACC) will have to do some work to coalesce hardware power-gating

Verdict

Page 11: Battle of the Accelerator Stars Pavan Balaji Computer Scientist Group lead, Programming Models and Runtime Systems Argonne National Laboratory balaji@mcs.anl.gov

Pavan Balaji, Argonne National Laboratory P2S2 Workshop Panel (09/10/2012)

Summary

Accelerators are of different kinds – GPUs are just one example of it

Decoupled memory accelerators do not have much of a chance to survive because of data staging requirements– Fundamentally ill-suited for sparse/fine-grained computations– Caveat: LINPACK is not a fine-grained computation, so the Top500

might still boast a GPU-like machine

Fine-grained instruction-level heterogeneity is required– Many architectures are already going in that direction– BG/Q and Intel MIC’s planned roadmap are in that direction