single chip heterogeneous computing: the future...

14
SingleChip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? Eric S. Chung, Peter A. Milder, James C. Hoe, Ken Mai Computer Architecture Lab at Based on Chung, et al. to appear in Proc. Intl. Symposium on Microarchitecture (MICRO), 2010. Heterogeneous Potpourri By 2022 (11nm node), a large die (~500mm 2 ) will have over 10 billion transistors What will you choose Big Core little core little core little core little core little core little core little core little core little core Custom Logic CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide2 to put on it? GPGPU FPGA Logic

Upload: others

Post on 24-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Single Chip Heterogeneous Computing: the Future …lph.ece.utexas.edu/merez/uploads/LACSS2010/LACSS2010...rails GPUs, FPGAs and all that CPU GPUs FPGA ASIC Intel Core i7‐960 Nvidia

Single‐Chip Heterogeneous Computing: Does the Future Include Custom Logic, 

FPGAs, and GPGPUs?

Eric S. Chung, Peter A. Milder, 

James C. Hoe, Ken Mai

Computer Architecture Lab at

Based on Chung, et al. to appear in Proc. Intl. Symposium on Microarchitecture (MICRO),  2010.

Heterogeneous Potpourri• By 2022 (11nm node), a large die (~500mm2) will have 

over 10 billion transistors

What will you choose 

Big Corelittle core

little core

little core

little core

little core

little core

little core

little core

little coreCustom

Logic

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐2

to put on it?

GPGPU FPGA

Logic

Page 2: Single Chip Heterogeneous Computing: the Future …lph.ece.utexas.edu/merez/uploads/LACSS2010/LACSS2010...rails GPUs, FPGAs and all that CPU GPUs FPGA ASIC Intel Core i7‐960 Nvidia

100

g)

ITRS road map 2009

Area density

What we “know” about the future

2009 Intl. Technology Roadmap for Semiconductors

1

10

malize

d t

o 4

0n

m (

log

) y

16X Area Density

Moore’s Law

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐3

0

40 36 32 28 25 22 20 18 16 14 13 11

Norm

Technology Node (nm)

100

g)

ITRS road map 2009

Area density

What we “know” about the future

2009 Intl. Technology Roadmap for Semiconductors

1

10

malize

d t

o 4

0n

m (

log

) ySupply voltage

B t V i ’t li d t V

16X Area Density

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐4

0

40 36 32 28 25 22 20 18 16 14 13 11

Norm

Technology Node (nm)

But VDD isn’t scaling due to Vth

Page 3: Single Chip Heterogeneous Computing: the Future …lph.ece.utexas.edu/merez/uploads/LACSS2010/LACSS2010...rails GPUs, FPGAs and all that CPU GPUs FPGA ASIC Intel Core i7‐960 Nvidia

100

g)

ITRS road map 2009

Area density

What we “know” about the future

2009 Intl. Technology Roadmap for Semiconductors

1

10

malize

d t

o 4

0n

m (

log

) ySupply voltageDevice power reduction

Only 4X Lower Power16X Area Density

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐5

0

40 36 32 28 25 22 20 18 16 14 13 11

Norm

Technology Node (nm)Area is free; power is not

Outline• Future is about perf‐per‐watt and op‐per‐Joule

– does the future need more than programmable processors Answer: YesAnswer: Yes

• Where do FPGAs and GPUs stand today?– FPGAs suck at floating‐point Answer: No

– GPUs are power hogs  Answer: Not entirely

• Do the future heterogeneous multicores include custom logic FPGAs and GPGPUs?

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐6

logic, FPGAs, and GPGPUs?Answer:   . . . . . . . . .

Disclosure: I am a big fan of reconfigurable logic

Page 4: Single Chip Heterogeneous Computing: the Future …lph.ece.utexas.edu/merez/uploads/LACSS2010/LACSS2010...rails GPUs, FPGAs and all that CPU GPUs FPGA ASIC Intel Core i7‐960 Nvidia

Energy Efficiency ofProgrammable Processors

technologynormalized

Pentium 4

power(Watt)

PowerPerf1.75

Better to replace 1 of thisby 2 of these; Or N ofthese

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐7

technologynormalizedperformance(op/sec)Energy per Instruction Trends in Intel® 

Microprocessors, Grochowski et al., 2006

486

Asymmetric Multicores

• Combine a power‐hungry “fast” core with lots of simple, power‐efficient “ l ”BCE BCE BCE BCE

BCE BCE

BCE BCE

“slow” cores

– fast core for sequential code

– slow cores for parallel sections

• [Hill and Marty, 2008]Fast Core

BCE BCE BCE BCE

BCE

BCE

BCE

BCE

BCE

BCEBCE BCE perfrnf

perff

Speedup

)(1

1

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐8

– solve for optimal die area allocation depending on parallelizability f

• But, up to 95% of energy of          

von Neuman processors is “overhead”

[Hameed, et al., ISCA 2010]

C

Base Core Equivalent (BCE) in [Hill and Marty, 2008]

seqseq perfrnperf )(

Page 5: Single Chip Heterogeneous Computing: the Future …lph.ece.utexas.edu/merez/uploads/LACSS2010/LACSS2010...rails GPUs, FPGAs and all that CPU GPUs FPGA ASIC Intel Core i7‐960 Nvidia

A hard look at the performance and ops‐per‐Joule of today’s

non‐von Neuman alternativesnon von Neuman alternatives

Dual 12V PCI-E Rails

ATX 3.3V

ATX 12V

ATX 5V

Exposed

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐9

“Core” 12V Rail

PCIe x16 Riser Card

ppower rails

GPUs, FPGAs and all thatCPU GPUs FPGA ASIC

IntelCore i7‐960

NvidiaGTX285

NvidiaGTX480

ATIR5870

XilinxV6‐LX760 Std. Cell

Y 2009 2008 2010 2009 2009 2007Year 2009 2008 2010 2009 2009 2007

Node 45nm 55nm 40nm 40nm 40nm 65nm

Die area 263mm2 470mm2 529mm2 334mm2 ‐ ‐

Clock rate 3.2GHz 1.5GHz 1.4GHz 1.5GHz 0.3GHz ‐

Single‐prec.floating‐point

apps 

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐10

M‐M‐MultMKL 10.2.3

multithreaded

CUBLAS 2.3

CUBLAS 3.1 CAL++ hand‐coded

FFTSpiral.net

multithreaded

CUFFT 2.33.0/3.1

CUFFT 3.0 ‐ Spiral.net

Black‐ScholesPARSEC 

multithreadedCUDA 2.3 ‐ ‐ hand‐coded

Page 6: Single Chip Heterogeneous Computing: the Future …lph.ece.utexas.edu/merez/uploads/LACSS2010/LACSS2010...rails GPUs, FPGAs and all that CPU GPUs FPGA ASIC Intel Core i7‐960 Nvidia

In‐Core Performance and Energy

DeviceGFLOP/sactual

(GFLOP/s)/mm2

normalized to 40nm

GFLOP/Jnormalized to 

40nm

MMM

Intel Core i7 (45nm) 96 0.50 1.14

Nvidia GTX285 (55nm) 425 2.40 6.78

Nvidia GTX480 (40nm) 541 1.28 3.52

ATI R5870 (40nm) 1491 5.95 9.87

Xilinx V6‐LX760 (40nm) 204 0.53 3.62

same RTL std cell (65nm) 694 19.28 50.73

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐11

• CPU and GPU benchmarking was compute‐bound; FPGA and Std Cell effectively compute‐bound (no off‐chip I/O)

• Power (switching+leakage) measurements isolated the core from the system

• For detail see [Chung, et al. MICRO 2010]

More of the SameGFLOP/s (GFLOP/s)/mm2 GFLOP/J

Intel Core i7 (45nm) 67 0.35 0.71

Nvidia GTX285 (55nm) 250 1 41 4 2

Mopt/s (Mopts/s)/mm2 Mopts/J

I t l C i7 (45 ) 487 2 52 4 88

FFT‐210

Nvidia GTX285 (55nm) 250 1.41 4.2

Nvidia GTX480 (40nm) 453 1.08 4.3

ATI R5870 (40nm) ‐ ‐ ‐

Xilinx V6‐LX760 (40nm) 380 0.99 6.5

same RTL std cell (65nm) 952 239 90

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐12

Black‐Scholes

Intel Core i7 (45nm) 487 2.52 4.88

Nvidia GTX285 (55nm) 10756 60.72 189

Nvidia GTX480 (40nm) ‐ ‐ ‐

ATI R5870 (40nm) ‐ ‐ ‐

Xilinx V6‐LX760 (40nm) 7800 20.26 138

same RTL std cell (65nm) 25532 1719 642.5

Page 7: Single Chip Heterogeneous Computing: the Future …lph.ece.utexas.edu/merez/uploads/LACSS2010/LACSS2010...rails GPUs, FPGAs and all that CPU GPUs FPGA ASIC Intel Core i7‐960 Nvidia

Onward with a Future ofHeterogeneo s M lticoresHeterogeneous Multicores

Cores

?

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐13

?

Custom FPGA GPGPU

Modeling Heterogeneous MulticoresAsymmetric

Fast

BCE BCE BCE BCE

BCE BCE seqseq perfrnf

perff

Speedup

)(1

1

FastCore

BCE

BCE

BCE

BCE

BCE

BCEBCE BCE

U U U U

Heterogeneous

1

[Hill and Marty, 2008] simplifiedf is fraction parallelizablen is total die area in BCE unitsr is fast core area in BCE unitsperfseq(r) is fast core perf. relative to BCE

Base Core Equivalent

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐14

Fast Core

U

U

U

U

U

UU U

)-(-1

1

rnf

perff

Speedup

seq

For the sake of analysis, break the area for GPU/FPGA/etc. into units of U‐cores that are the same size as BCEs. Each U‐core type is characterized by a relative performance µand relative power  compared to a BCE

Page 8: Single Chip Heterogeneous Computing: the Future …lph.ece.utexas.edu/merez/uploads/LACSS2010/LACSS2010...rails GPUs, FPGAs and all that CPU GPUs FPGA ASIC Intel Core i7‐960 Nvidia

and μ example values

MMM Black‐Scholes FFT‐210

Nvidia Φ 0.74 0.57 0.63

GTX285 μ 3.41 17.0 2.88

NvidiaGTX480

Φ 0.77 ‐ 0.47

μ 1.83 ‐ 2.20

ATIR5870

Φ 1.27 ‐ ‐

μ 8.47 ‐ ‐

Xilinx Φ 0.31 0.26 0.29

On equal area basis, 3.41x performance at 0.74x power relative a BCE

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐15

LX760 μ 0.75 5.68 2.02

CustomLogic

Φ 0.79 4.75 4.96

μ 27.4 482 489

Nominal BCE based on an Intel Atom in‐order processor,  26mm2 in a 45nm process

Modeling Power and Bandwidth Budgets

U U U U

U U

Heterogeneous

-11

ffSpeedup

• The above is based on area alone

• Power or bandwidth budget limits the usable die area

Fast Core

U

U

U

U

U

UU U

)-( rnperfseq

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐16

– if P is total power budget expressed as a multiple of a BCE’s power,

then usable U‐core area

– if B is total memory bandwidth expressed also as a multiple of BCEs,

then usable U‐core area 

Prn

μBrn

Page 9: Single Chip Heterogeneous Computing: the Future …lph.ece.utexas.edu/merez/uploads/LACSS2010/LACSS2010...rails GPUs, FPGAs and all that CPU GPUs FPGA ASIC Intel Core i7‐960 Nvidia

Combine Model with ITRS Trends

Year 2011 2013 2016 2019 2022

Technology 40nm 32nm 22nm 16nm 11nm

C di b d ( 2) 432 432 432 432 432

2011 fl hi h d

Core die budget (mm2) 432 432 432 432 432

Normalized area (BCE) 19 37 75 149 298

Core power (W) 100 100 100 100 100

Bandwidth (GB/s) 180 198 234 234 252

Rel pwr per device 1X 0.75X 0.5X 0.36X 0.25X

(16x)

(1.4x)

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐17

• 2011 parameters reflect high‐end systems today; future parameters extrapolated from ITRS 2009

• 432mm2  populated by an optimally sized Fast Core and U‐cores of choice  

Fast Core

U U U U

U

U

U

U

U

UU U

Single‐Prec. MMMult (f=90%)

3

40 ASIC

15

20

25

30

35

Sp

ee

dup

GPU R5870

FPGA LX760

Asymmetricmulticore

Symmetric

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐18

0

5

10

40nm 32nm 22nm 16nm 11nm

f=0.900

(0) SymCMP(1) AsymCMP

(2) ASIC(3) LX760

(5) R5870

Symmetricmulticore

SymMCAsymMC

ASICFPGA LX760

GPU R5870 Power Bound

Mem Bound

Page 10: Single Chip Heterogeneous Computing: the Future …lph.ece.utexas.edu/merez/uploads/LACSS2010/LACSS2010...rails GPUs, FPGAs and all that CPU GPUs FPGA ASIC Intel Core i7‐960 Nvidia

Single‐Prec. MMMult (f=99%)

300 ASIC

100

150

200

250

Sp

ee

dup

GPU R5870

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐19

0

50

40nm 32nm 22nm 16nm 11nm

f=0.990

(0) SymCMP(1) AsymCMP

(2) ASIC(3) LX760

(5) R5870SymMCAsymMC

ASICFPGA LX760

GPU R5870

FPGA LX760

Sym & Asymmulticore

Power Bound

Mem Bound

Single‐Prec. MMMult (f=50%)

8 ASIC/GPU/FPGA

3

4

5

6

7

Sp

ee

dup

Asymmetric

Symmetric

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐20

0

1

2

40nm 32nm 22nm 16nm 11nm

f=0.500

(0) SymCMP(1) AsymCMP

(2) ASIC(3) LX760

(5) R5870SymMCAsymMC

ASICFPGA LX760

GPU R5870 Power Bound

Mem Bound

Page 11: Single Chip Heterogeneous Computing: the Future …lph.ece.utexas.edu/merez/uploads/LACSS2010/LACSS2010...rails GPUs, FPGAs and all that CPU GPUs FPGA ASIC Intel Core i7‐960 Nvidia

Single‐Prec. FFT‐1024 (f=99%)

60ASIC/GPU/FPGA

20

30

40

50

Sp

ee

dup

Asymmetric

Symmetric

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐21

0

10

40nm 32nm 22nm 16nm 11nm

f=0.990

(0) SymCMP(1) AsymCMP

(2) ASIC(3) LX760

(4) GTX480SymMCAsymMC

ASICFPGA LX760

GPU R5870 Power Bound

Mem Bound

FFT‐1024 (f=99%)if hypothetical 1TB/sec bandwidth

200ASIC

100

150

Sp

ee

dup

GPU GTX480

FPGA LX760

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐22

0

50

40nm 32nm 22nm 16nm 11nm

f=0.990

(0) SymCMP(1) AsymCMP

(2) ASIC(3) LX760

(4) GTX480SymMCAsymMC

ASICFPGA LX760

GPU R5870

Sym & Asymmulticore

Power Bound

Mem Bound

Page 12: Single Chip Heterogeneous Computing: the Future …lph.ece.utexas.edu/merez/uploads/LACSS2010/LACSS2010...rails GPUs, FPGAs and all that CPU GPUs FPGA ASIC Intel Core i7‐960 Nvidia

Performance Scaling Summary• Perf‐per‐Watt and op‐per‐Joule matter

– need to look beyond programmable processors

– GPUs and FPGAs are viable programmable candidates 

• GPU/FPGA/custom logic help performance only if

– significant fraction amenable to acceleration, and

– adequate bandwidth to sustain acceleration

3D‐stacked memory could be very helpful!

Wi h d b d id h GPU/FPGA h

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐23

• Without adequate bandwidth, GPU/FPGA catches up with custom logic in achievable performance but remains programmable and flexible

• Custom logic best for maximizing performance under tight energy/power constraints

A quick plug forA quick plug for using FPGAs in computing

check out the CARL 2010 Workshop

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐24

Page 13: Single Chip Heterogeneous Computing: the Future …lph.ece.utexas.edu/merez/uploads/LACSS2010/LACSS2010...rails GPUs, FPGAs and all that CPU GPUs FPGA ASIC Intel Core i7‐960 Nvidia

Programming FPGA is Hard?• So is programming CPUs and GPUs at the level of 

optimization we are talking about; and they are not “portable”

[www.spiral.net]

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐25

• Whatever the platform, these kernels will be developed by a few and used by many

• Efficiency has to be #1; ease of programming #2check out the CARL 2010 Workshop

So what is the problem?• Building HW kernels for FPGA fabric is relatively easy (at 

least for some of us)

Th l i i i i l i• The real pain is in connecting to external system, in particular off‐chip memory

• FPGAs have no native memory architecture

– soft implementations of memory controller and memory hierarchy leave performance (capacity, bandwidth and latency) on the table

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐26

y)

– why can’t I have virtual memory?

• FPGA fabric has no memory abstraction

check out the CARL 2010 Workshop

Page 14: Single Chip Heterogeneous Computing: the Future …lph.ece.utexas.edu/merez/uploads/LACSS2010/LACSS2010...rails GPUs, FPGAs and all that CPU GPUs FPGA ASIC Intel Core i7‐960 Nvidia

CoRAM: FPGA Architecture for ComputingNoC

Mem

ory C

ontro

llers an

Mem

ory C

ontro

llers an BRAMs

Cach

es

Cach

es

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐27

nd M

MU

nd M

MU

and DSP  slices

CPU CPU

check out the CARL 2010 Workshop

CMU/ECE/CALCM/Chung&Hoe Los Alamos CS Symposium, October 2010, slide‐28

Computer Architecture Lab (CALCM)

Carnegie Mellon University

http://www.ece.cmu.edu/CALCM

check out the CARL 2010 Workshop