battle of the accelerator stars pavan balaji computer scientist group lead, programming models and...
TRANSCRIPT
Battle of the Accelerator Stars
Pavan Balaji
Computer Scientist
Group lead, Programming Models and Runtime Systems
Argonne National Laboratory
http://www.mcs.anl.gov/~balaji
Pavan Balaji, Argonne National Laboratory P2S2 Workshop Panel (09/10/2012)
Accelerators != GPUs
Anything that is built to make a specific type of computation (i.e., not general purpose computation) is an accelerator– A vector instruction unit is an accelerator– The double/quad floating point operations on BG/P and Q are
accelerators– An H.264 media decoder sitting on a processor die is an accelerator– There is no such thing as a “general purpose” accelerator
GPUs are one form of accelerators, but not the only ones
Pavan Balaji, Argonne National Laboratory P2S2 Workshop Panel (09/10/2012)
Divergence in Accelerator Computing?
Divergence == Increasing difference No -- There are a lot of different models of accelerator
computing today– NVIDIA/AMD GPUs, FPGAs, AMD’s fused chip architectures, Intel MIC
architecture, Intel Xeon, Blue Gene (Yes, they are accelerators too)– Broadly classified into:
• Decoupled processing/Decoupled memory (think GPUs)• Coupled processing/Decoupled memory (think AMD Fusion)• Coupled processing/Coupled memory (think Intel Xeon/MIC, BG/Q)
But the trend is not towards increasing difference, but rather towards convergence– Vendors and researchers are trying out different options to see what
works and what does not
You have
to kiss
many frogs
before you
can find
your
prince!
Pavan Balaji, Argonne National Laboratory P2S2 Workshop Panel (09/10/2012)
Who will be the last man standing?
GPUsDecoupled processingDecoupled memory
Fused Processors (e.g., AMD Fusion)
Coupled processingDecoupled memory
General Purpose Processors with Accelerator Extensions (e.g., Xeon,
MIC, BG/P, BG/Q)Coupled processingCoupled memory
Pavan Balaji, Argonne National Laboratory P2S2 Workshop Panel (09/10/2012)
Quantum mechanical interactions are near-sighted (Walter Kohn)
Traditional quantum chemistry studies lie within the nearsighted range where interactions are dense:
Future quantum chemistry studies expose both short- and long-range interactions:
Range of interactions between particles
Note that the figures are phenomenological. Quantum chemistry methods treat correlation using a variety of approaches and have different short/long-range cutoffs.
distance
Interaction strength
Courtesy Jeff Hammond, Argonne National Laboratory
Pavan Balaji, Argonne National Laboratory P2S2 Workshop Panel (09/10/2012)
Wind Turbine and Flight Blade Designs
Blades are getting larger with every new design– With larger blades, the additional lift or torque
generated is from the outer regions of the blade
– Air flow from far out regions of the blade has lesser computational intensity making the computation more “sparse”
Pavan Balaji, Argonne National Laboratory P2S2 Workshop Panel (09/10/2012)
Decoupled Processing/Decoupled Memory (GPUs)
Pros:– A separate can be custom built for
acceleration– Faster memory; better designed memory
and memory controllers for acceleration Cons:
– Decoupled from the main processing unit
Control UnitALU ALU
ALU ALU
Cache
DRAM
Regular CPU cores GPU cores
Verdict
Pavan Balaji, Argonne National Laboratory P2S2 Workshop Panel (09/10/2012)
Coupled Processing/Decoupled Memory
Pros:– Improved coupling of the processing units
and memory allows for much faster synchronization
– Separate memory allows for better optimized memory and memory controllers
Cons:– The need for data staging does not disappear
CPU
GPU
CPU Memory
GPU Memory
CPU
GPU
CPU Memory
GPU Memory
Verdict
Pavan Balaji, Argonne National Laboratory P2S2 Workshop Panel (09/10/2012)
General Purpose Processors with Accelerator Extensions
Pros:– Very fine-grained synchronization (no memory synchronization required;
processing synchronization for power constraints) Cons:
– Unified memory means that specialization not possible (either in memory or in memory controllers)
– Single die memory constraints
Intel: MIC
IBM: BG/Q
Pow
er C
onst
rain
ed M
emor
y Co
nsis
tenc
y
Tilera: GX
Godson T
Intel: SCC
Dally: Echelon
Extr
eme
Spec
ializ
ation
and
Pow
er M
anag
emen
t
Chien: 10x10
Pavan Balaji, Argonne National Laboratory P2S2 Workshop Panel (09/10/2012)
Towards On-chip Instruction-level Heterogeneity Vector units were a form of instruction-level heterogeneity
– Some instructions use vector hardware, some don’t– Vector instruction units processed the same data that other units
processed
Synchronization requirements– No memory staging requirements– Theoretically, accelerator units can fit into the same instruction pipeline
as general purpose processing
But, there are some practicality constraints– Amount of acceleration is so high that not all hardware can be turned on
at the same time (dark silicon with power gating will lead the way)• So synchronization is not absent, but much more fine-grained (10s of cycles)
– Compilers (with help from users – OpenMP, OpenACC) will have to do some work to coalesce hardware power-gating
Verdict
Pavan Balaji, Argonne National Laboratory P2S2 Workshop Panel (09/10/2012)
Summary
Accelerators are of different kinds – GPUs are just one example of it
Decoupled memory accelerators do not have much of a chance to survive because of data staging requirements– Fundamentally ill-suited for sparse/fine-grained computations– Caveat: LINPACK is not a fine-grained computation, so the Top500
might still boast a GPU-like machine
Fine-grained instruction-level heterogeneity is required– Many architectures are already going in that direction– BG/Q and Intel MIC’s planned roadmap are in that direction