karu sankaralingam university of wisconsin-madison collaborators: hadi esmaeilzadeh, emily blem,...

Karu SankaralingamUniversity of Wisconsin-Madison

Collaborators: Hadi Esmaeilzadeh, Emily Blem, Renee

St. Amant, and Doug Burger

The Dark Silicon Implications for Microprocessors

Multicore Decade?

2

We have relied on multicore scaling for over five years.

How much longer will it be our primary performance scaling technique?

2000 2005 2010 2015

Pentium Extreme

Dual-Core

Core 2 Quad-Core

i7 980x Hex-Core

?

Finding Optimal Multicore Designs

3

Comprehensive design space:Fixed area budgetFixed power budgetTwo sets of CMOS scaling projectionsOptimal core and diverse multicore organizationsParallel benchmarks

For next 5 technology generations, we find the best performing multicore from a

comprehensive design space search for each of the PARSEC benchmarks

Symmetric Multicore Projections

4

Symmetric multicores alone will not sustain the multicore era.

0 2 4 6 8 100

4

8

12

16

20

Target

Symmetric

Year

Sp

eed

up

3.4x in 10 years

18x

Multicore Solutions

5

Asymmetric Topologies

0 2 4 6 8 100

4

8

12

16

20Target

Sym-metric

Year

Sp

eed

up

0 2 4 6 8 100

4

8

12

16

20TargetSymmetric

Year

Sp

eed

up

3.5x

Multicore Solutions

6

Dynamic Topologies

[Chakraborty (2008), Suleman et al (2009)]

0 2 4 6 8 100

4

8

12

16

20TargetSym-metricAsym-metric

Year

Sp

eed

up

0 2 4 6 8 100

4

8

12

16

20Target

Sym-metric

Year

Sp

eed

up

3.5x

Multicore Solutions

7

Composed/Fused Topologies

[Ipek et al (2007), Kim et al (2007)]

0 2 4 6 8 100

4

8

12

16

20TargetSym-metricAsym-metricDynamic

Year

Sp

eed

up

0 2 4 6 8 100

4

8

12

16

20TargetSym-metricAsym-metric

Year

Sp

eed

up

3.7x

Multicore Solutions

8

GPU-Style Cores

0 2 4 6 8 100

4

8

12

16

20Target

Sym-metric

Asym-metric

Dynamic

Year

Sp

eed

up

0 2 4 6 8 100

4

8

12

16

20TargetSym-metricAsym-metricDynamic

Year

Sp

eed

up

2.7x

Multicore Era Projections

9

The best designs speed up 14% per year rather than the recent trend of 34% per year

0 2 4 6 8 100

4

8

12

16

20

Target

Composed

Year

Sp

eed

up

Composed

3.7x

18x

Why Diminishing Returns?

10

Transistor area is still scalingVoltage and capacitance scaling have

slowedResult: designs are power, not area,

limited

Overview

11

Conservative Optimistic

Area 32x 32x

Power 4.5x 8.3x

Frequency 1.3x 3.9x

[Borkar 2007]

[ITRS 2010]

Device Scaling Projections

12

From 45 nm to 8 nm:

0 5 10 15 20 25 30 35 400

5

10

15

20

25

30Intel Nehalem AMD Shanghai Intel Core Intel Atom

SPECmark Score

Pow

er

(TD

P,

Watt

s)

Modeling Ideal Core Power/Perf.

13

Atom

Nehalem

0 5 10 15 20 25 300

5

10

15

20

25

30Intel Nehalem AMD Shanghai Intel Core Intel Atom

SPECmark Score

Pow

er

(TD

P,

Watt

s)

Pareto Frontier includes all optimal power/performance points

Repeat using core area for optimal area/performance points

0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

SPECmark Score

Pow

er

(TD

P,

Watt

s)

Combining Device and Core Models

14

45 nm Frontie

r

32 nm Frontie

r

Device Scaling

Overview

15

What belongs in multicore model?

16

Styles

Applications

Topologies

Area & Power / Performance Tradeoffs

Architectures

PARSEC fparallel,Data Use

Number of Threads,

Cache Sizes

Area & Power Budget

Cache & memory latencies, memory

bandwidth

Pareto Frontiers

Multicore Speedup Model

17

MulticoreSpeedup

11-fparallel

Serial Speedup

fparallel

Parallel Speedup

=+

Multicore Performance Model

18

Performance is limited by:

and

Memory bandwidthBWmax / (instructions per byte from

memory)

ComputationNcores (core frequency/CPIexe) core

utilization

[Guz et al, 2009]

Core Utilization Model

19

Core utilization is limited by:

Fraction of Time Core is Ready to IssueNumber of Threads in Core / Number of Threads to

Keep Busy

[Guz et al, 2009]

Multicore Model & Pareto Frontiers

20

0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

100 points

A(q),

P(q)

q

Translating from SPECmark

21

1. From q, find core’s SPECmark speedup

2. Frequency linearly distributed from Atom to Nehalem

3. Recall: model predicts benchmark performance as f(benchmark chars, frequency, CPIexe)

4. Compute CPIexe such that

Benchmark Speedup = SPECmark Speedup

Area and Power Constraints

22

Ncores x A(q) ≤ Area Budget

Ncores x P(q) ≤ Power Budget

Dark silicon = Ncores / # of cores that fit in chip area

Overview

23

Dark Silicon

24

blacksholes bodytrack canneal ferret streamcluster GM0%

20%

40%

60%

80%

100%ITRS Conservative

Perc

en

tag

e D

ark

Silic

on

blacksholes bodytrack canneal ferret streamcluster GM0%

20%

40%

60%

80%

100%ITRS Conservative

Perc

en

tag

e D

ark

Silic

on

Sources of Dark Silicon:Power + Limited Parallelism

At 22 nm:

At 8 nm:

17%

26%

51%

71%

0 2 4 6 8 100

4

8

12

16

20

ITRS: All TopologiesITRS: Symmetric

Year

Sp

eed

up

0 2 4 6 8 100

4

8

12

16

20Conservative: Symmet...

Year

Sp

eed

up

0 2 4 6 8 100

4

8

12

16

20Conservative: All Topologies

Year

Sp

eed

up

0 2 4 6 8 100

4

8

12

16

20

ITRS: All TopologiesITRS: Symmetric

Year

Sp

eed

up

Overall Performance

25

Target

fparallel = 0.99

18x16x

8x6x3x

ConclusionsMulticore performance gains are

limited

Need at least 18%-40% per generation from architecture alone without

additional power26

Unicore Era

Multicore Era ?

27

Specialization

Shrinking chips

Pervasive

Efficiency

karu sankaralingam university of wisconsin-madison collaborators: hadi esmaeilzadeh, emily blem,...

Documents