single chip heterogeneous computing: the future...
TRANSCRIPT
Single‐Chip Heterogeneous Computing: Does the Future Include Custom Logic,
FPGAs, and GPGPUs?
Eric S. Chung, Peter A. Milder,
James C. Hoe, Ken Mai
Computer Architecture Lab at
Based on Chung, et al. to appear in Proc. Intl. Symposium on Microarchitecture (MICRO), 2010.
Heterogeneous Potpourri• By 2022 (11nm node), a large die (~500mm2) will have
over 10 billion transistors
What will you choose
Big Corelittle core
little core
little core
little core
little core
little core
little core
little core
little coreCustom
Logic
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐2
to put on it?
GPGPU FPGA
Logic
100
g)
ITRS road map 2009
Area density
What we “know” about the future
2009 Intl. Technology Roadmap for Semiconductors
1
10
malize
d t
o 4
0n
m (
log
) y
16X Area Density
Moore’s Law
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐3
0
40 36 32 28 25 22 20 18 16 14 13 11
Norm
Technology Node (nm)
100
g)
ITRS road map 2009
Area density
What we “know” about the future
2009 Intl. Technology Roadmap for Semiconductors
1
10
malize
d t
o 4
0n
m (
log
) ySupply voltage
B t V i ’t li d t V
16X Area Density
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐4
0
40 36 32 28 25 22 20 18 16 14 13 11
Norm
Technology Node (nm)
But VDD isn’t scaling due to Vth
100
g)
ITRS road map 2009
Area density
What we “know” about the future
2009 Intl. Technology Roadmap for Semiconductors
1
10
malize
d t
o 4
0n
m (
log
) ySupply voltageDevice power reduction
Only 4X Lower Power16X Area Density
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐5
0
40 36 32 28 25 22 20 18 16 14 13 11
Norm
Technology Node (nm)Area is free; power is not
Outline• Future is about perf‐per‐watt and op‐per‐Joule
– does the future need more than programmable processors Answer: YesAnswer: Yes
• Where do FPGAs and GPUs stand today?– FPGAs suck at floating‐point Answer: No
– GPUs are power hogs Answer: Not entirely
• Do the future heterogeneous multicores include custom logic FPGAs and GPGPUs?
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐6
logic, FPGAs, and GPGPUs?Answer: . . . . . . . . .
Disclosure: I am a big fan of reconfigurable logic
Energy Efficiency ofProgrammable Processors
technologynormalized
Pentium 4
power(Watt)
PowerPerf1.75
Better to replace 1 of thisby 2 of these; Or N ofthese
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐7
technologynormalizedperformance(op/sec)Energy per Instruction Trends in Intel®
Microprocessors, Grochowski et al., 2006
486
Asymmetric Multicores
• Combine a power‐hungry “fast” core with lots of simple, power‐efficient “ l ”BCE BCE BCE BCE
BCE BCE
BCE BCE
“slow” cores
– fast core for sequential code
– slow cores for parallel sections
• [Hill and Marty, 2008]Fast Core
BCE BCE BCE BCE
BCE
BCE
BCE
BCE
BCE
BCEBCE BCE perfrnf
perff
Speedup
)(1
1
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐8
– solve for optimal die area allocation depending on parallelizability f
• But, up to 95% of energy of
von Neuman processors is “overhead”
[Hameed, et al., ISCA 2010]
C
Base Core Equivalent (BCE) in [Hill and Marty, 2008]
seqseq perfrnperf )(
A hard look at the performance and ops‐per‐Joule of today’s
non‐von Neuman alternativesnon von Neuman alternatives
Dual 12V PCI-E Rails
ATX 3.3V
ATX 12V
ATX 5V
Exposed
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐9
“Core” 12V Rail
PCIe x16 Riser Card
ppower rails
GPUs, FPGAs and all thatCPU GPUs FPGA ASIC
IntelCore i7‐960
NvidiaGTX285
NvidiaGTX480
ATIR5870
XilinxV6‐LX760 Std. Cell
Y 2009 2008 2010 2009 2009 2007Year 2009 2008 2010 2009 2009 2007
Node 45nm 55nm 40nm 40nm 40nm 65nm
Die area 263mm2 470mm2 529mm2 334mm2 ‐ ‐
Clock rate 3.2GHz 1.5GHz 1.4GHz 1.5GHz 0.3GHz ‐
Single‐prec.floating‐point
apps
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐10
M‐M‐MultMKL 10.2.3
multithreaded
CUBLAS 2.3
CUBLAS 3.1 CAL++ hand‐coded
FFTSpiral.net
multithreaded
CUFFT 2.33.0/3.1
CUFFT 3.0 ‐ Spiral.net
Black‐ScholesPARSEC
multithreadedCUDA 2.3 ‐ ‐ hand‐coded
In‐Core Performance and Energy
DeviceGFLOP/sactual
(GFLOP/s)/mm2
normalized to 40nm
GFLOP/Jnormalized to
40nm
MMM
Intel Core i7 (45nm) 96 0.50 1.14
Nvidia GTX285 (55nm) 425 2.40 6.78
Nvidia GTX480 (40nm) 541 1.28 3.52
ATI R5870 (40nm) 1491 5.95 9.87
Xilinx V6‐LX760 (40nm) 204 0.53 3.62
same RTL std cell (65nm) 694 19.28 50.73
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐11
• CPU and GPU benchmarking was compute‐bound; FPGA and Std Cell effectively compute‐bound (no off‐chip I/O)
• Power (switching+leakage) measurements isolated the core from the system
• For detail see [Chung, et al. MICRO 2010]
More of the SameGFLOP/s (GFLOP/s)/mm2 GFLOP/J
Intel Core i7 (45nm) 67 0.35 0.71
Nvidia GTX285 (55nm) 250 1 41 4 2
Mopt/s (Mopts/s)/mm2 Mopts/J
I t l C i7 (45 ) 487 2 52 4 88
FFT‐210
Nvidia GTX285 (55nm) 250 1.41 4.2
Nvidia GTX480 (40nm) 453 1.08 4.3
ATI R5870 (40nm) ‐ ‐ ‐
Xilinx V6‐LX760 (40nm) 380 0.99 6.5
same RTL std cell (65nm) 952 239 90
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐12
Black‐Scholes
Intel Core i7 (45nm) 487 2.52 4.88
Nvidia GTX285 (55nm) 10756 60.72 189
Nvidia GTX480 (40nm) ‐ ‐ ‐
ATI R5870 (40nm) ‐ ‐ ‐
Xilinx V6‐LX760 (40nm) 7800 20.26 138
same RTL std cell (65nm) 25532 1719 642.5
Onward with a Future ofHeterogeneo s M lticoresHeterogeneous Multicores
Cores
?
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐13
?
Custom FPGA GPGPU
Modeling Heterogeneous MulticoresAsymmetric
Fast
BCE BCE BCE BCE
BCE BCE seqseq perfrnf
perff
Speedup
)(1
1
FastCore
BCE
BCE
BCE
BCE
BCE
BCEBCE BCE
U U U U
Heterogeneous
1
[Hill and Marty, 2008] simplifiedf is fraction parallelizablen is total die area in BCE unitsr is fast core area in BCE unitsperfseq(r) is fast core perf. relative to BCE
Base Core Equivalent
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐14
Fast Core
U
U
U
U
U
UU U
)-(-1
1
rnf
perff
Speedup
seq
For the sake of analysis, break the area for GPU/FPGA/etc. into units of U‐cores that are the same size as BCEs. Each U‐core type is characterized by a relative performance µand relative power compared to a BCE
and μ example values
MMM Black‐Scholes FFT‐210
Nvidia Φ 0.74 0.57 0.63
GTX285 μ 3.41 17.0 2.88
NvidiaGTX480
Φ 0.77 ‐ 0.47
μ 1.83 ‐ 2.20
ATIR5870
Φ 1.27 ‐ ‐
μ 8.47 ‐ ‐
Xilinx Φ 0.31 0.26 0.29
On equal area basis, 3.41x performance at 0.74x power relative a BCE
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐15
LX760 μ 0.75 5.68 2.02
CustomLogic
Φ 0.79 4.75 4.96
μ 27.4 482 489
Nominal BCE based on an Intel Atom in‐order processor, 26mm2 in a 45nm process
Modeling Power and Bandwidth Budgets
U U U U
U U
Heterogeneous
-11
ffSpeedup
• The above is based on area alone
• Power or bandwidth budget limits the usable die area
Fast Core
U
U
U
U
U
UU U
)-( rnperfseq
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐16
– if P is total power budget expressed as a multiple of a BCE’s power,
then usable U‐core area
– if B is total memory bandwidth expressed also as a multiple of BCEs,
then usable U‐core area
Prn
μBrn
Combine Model with ITRS Trends
Year 2011 2013 2016 2019 2022
Technology 40nm 32nm 22nm 16nm 11nm
C di b d ( 2) 432 432 432 432 432
2011 fl hi h d
Core die budget (mm2) 432 432 432 432 432
Normalized area (BCE) 19 37 75 149 298
Core power (W) 100 100 100 100 100
Bandwidth (GB/s) 180 198 234 234 252
Rel pwr per device 1X 0.75X 0.5X 0.36X 0.25X
(16x)
(1.4x)
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐17
• 2011 parameters reflect high‐end systems today; future parameters extrapolated from ITRS 2009
• 432mm2 populated by an optimally sized Fast Core and U‐cores of choice
Fast Core
U U U U
U
U
U
U
U
UU U
Single‐Prec. MMMult (f=90%)
3
40 ASIC
15
20
25
30
35
Sp
ee
dup
GPU R5870
FPGA LX760
Asymmetricmulticore
Symmetric
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐18
0
5
10
40nm 32nm 22nm 16nm 11nm
f=0.900
(0) SymCMP(1) AsymCMP
(2) ASIC(3) LX760
(5) R5870
Symmetricmulticore
SymMCAsymMC
ASICFPGA LX760
GPU R5870 Power Bound
Mem Bound
Single‐Prec. MMMult (f=99%)
300 ASIC
100
150
200
250
Sp
ee
dup
GPU R5870
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐19
0
50
40nm 32nm 22nm 16nm 11nm
f=0.990
(0) SymCMP(1) AsymCMP
(2) ASIC(3) LX760
(5) R5870SymMCAsymMC
ASICFPGA LX760
GPU R5870
FPGA LX760
Sym & Asymmulticore
Power Bound
Mem Bound
Single‐Prec. MMMult (f=50%)
8 ASIC/GPU/FPGA
3
4
5
6
7
Sp
ee
dup
Asymmetric
Symmetric
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐20
0
1
2
40nm 32nm 22nm 16nm 11nm
f=0.500
(0) SymCMP(1) AsymCMP
(2) ASIC(3) LX760
(5) R5870SymMCAsymMC
ASICFPGA LX760
GPU R5870 Power Bound
Mem Bound
Single‐Prec. FFT‐1024 (f=99%)
60ASIC/GPU/FPGA
20
30
40
50
Sp
ee
dup
Asymmetric
Symmetric
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐21
0
10
40nm 32nm 22nm 16nm 11nm
f=0.990
(0) SymCMP(1) AsymCMP
(2) ASIC(3) LX760
(4) GTX480SymMCAsymMC
ASICFPGA LX760
GPU R5870 Power Bound
Mem Bound
FFT‐1024 (f=99%)if hypothetical 1TB/sec bandwidth
200ASIC
100
150
Sp
ee
dup
GPU GTX480
FPGA LX760
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐22
0
50
40nm 32nm 22nm 16nm 11nm
f=0.990
(0) SymCMP(1) AsymCMP
(2) ASIC(3) LX760
(4) GTX480SymMCAsymMC
ASICFPGA LX760
GPU R5870
Sym & Asymmulticore
Power Bound
Mem Bound
Performance Scaling Summary• Perf‐per‐Watt and op‐per‐Joule matter
– need to look beyond programmable processors
– GPUs and FPGAs are viable programmable candidates
• GPU/FPGA/custom logic help performance only if
– significant fraction amenable to acceleration, and
– adequate bandwidth to sustain acceleration
3D‐stacked memory could be very helpful!
Wi h d b d id h GPU/FPGA h
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐23
• Without adequate bandwidth, GPU/FPGA catches up with custom logic in achievable performance but remains programmable and flexible
• Custom logic best for maximizing performance under tight energy/power constraints
A quick plug forA quick plug for using FPGAs in computing
check out the CARL 2010 Workshop
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐24
Programming FPGA is Hard?• So is programming CPUs and GPUs at the level of
optimization we are talking about; and they are not “portable”
[www.spiral.net]
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐25
• Whatever the platform, these kernels will be developed by a few and used by many
• Efficiency has to be #1; ease of programming #2check out the CARL 2010 Workshop
So what is the problem?• Building HW kernels for FPGA fabric is relatively easy (at
least for some of us)
Th l i i i i l i• The real pain is in connecting to external system, in particular off‐chip memory
• FPGAs have no native memory architecture
– soft implementations of memory controller and memory hierarchy leave performance (capacity, bandwidth and latency) on the table
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐26
y)
– why can’t I have virtual memory?
• FPGA fabric has no memory abstraction
check out the CARL 2010 Workshop
CoRAM: FPGA Architecture for ComputingNoC
Mem
ory C
ontro
llers an
Mem
ory C
ontro
llers an BRAMs
Cach
es
Cach
es
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐27
nd M
MU
nd M
MU
and DSP slices
CPU CPU
check out the CARL 2010 Workshop
CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐28
Computer Architecture Lab (CALCM)
Carnegie Mellon University
http://www.ece.cmu.edu/CALCM
check out the CARL 2010 Workshop