karu sankaralingam university of wisconsin-madison collaborators: hadi esmaeilzadeh, emily blem,...
TRANSCRIPT
Karu SankaralingamUniversity of Wisconsin-Madison
Collaborators: Hadi Esmaeilzadeh, Emily Blem, Renee
St. Amant, and Doug Burger
The Dark Silicon Implications for Microprocessors
Multicore Decade?
2
We have relied on multicore scaling for over five years.
How much longer will it be our primary performance scaling technique?
2000 2005 2010 2015
Pentium Extreme
Dual-Core
Core 2 Quad-Core
i7 980x Hex-Core
?
Finding Optimal Multicore Designs
3
Comprehensive design space:Fixed area budgetFixed power budgetTwo sets of CMOS scaling projectionsOptimal core and diverse multicore organizationsParallel benchmarks
For next 5 technology generations, we find the best performing multicore from a
comprehensive design space search for each of the PARSEC benchmarks
Symmetric Multicore Projections
4
Symmetric multicores alone will not sustain the multicore era.
0 2 4 6 8 100
4
8
12
16
20
Target
Symmetric
Year
Sp
eed
up
3.4x in 10 years
18x
Multicore Solutions
5
Asymmetric Topologies
0 2 4 6 8 100
4
8
12
16
20Target
Sym-metric
Year
Sp
eed
up
0 2 4 6 8 100
4
8
12
16
20TargetSymmetric
Year
Sp
eed
up
3.5x
Multicore Solutions
6
Dynamic Topologies
[Chakraborty (2008), Suleman et al (2009)]
0 2 4 6 8 100
4
8
12
16
20TargetSym-metricAsym-metric
Year
Sp
eed
up
0 2 4 6 8 100
4
8
12
16
20Target
Sym-metric
Year
Sp
eed
up
3.5x
Multicore Solutions
7
Composed/Fused Topologies
[Ipek et al (2007), Kim et al (2007)]
0 2 4 6 8 100
4
8
12
16
20TargetSym-metricAsym-metricDynamic
Year
Sp
eed
up
0 2 4 6 8 100
4
8
12
16
20TargetSym-metricAsym-metric
Year
Sp
eed
up
3.7x
Multicore Solutions
8
GPU-Style Cores
0 2 4 6 8 100
4
8
12
16
20Target
Sym-metric
Asym-metric
Dynamic
Year
Sp
eed
up
0 2 4 6 8 100
4
8
12
16
20TargetSym-metricAsym-metricDynamic
Year
Sp
eed
up
2.7x
Multicore Era Projections
9
The best designs speed up 14% per year rather than the recent trend of 34% per year
0 2 4 6 8 100
4
8
12
16
20
Target
Composed
Year
Sp
eed
up
Composed
3.7x
18x
Why Diminishing Returns?
10
Transistor area is still scalingVoltage and capacitance scaling have
slowedResult: designs are power, not area,
limited
Overview
11
Conservative Optimistic
Area 32x 32x
Power 4.5x 8.3x
Frequency 1.3x 3.9x
[Borkar 2007]
[ITRS 2010]
Device Scaling Projections
12
From 45 nm to 8 nm:
0 5 10 15 20 25 30 35 400
5
10
15
20
25
30Intel Nehalem AMD Shanghai Intel Core Intel Atom
SPECmark Score
Pow
er
(TD
P,
Watt
s)
Modeling Ideal Core Power/Perf.
13
Atom
Nehalem
0 5 10 15 20 25 300
5
10
15
20
25
30Intel Nehalem AMD Shanghai Intel Core Intel Atom
SPECmark Score
Pow
er
(TD
P,
Watt
s)
Pareto Frontier includes all optimal power/performance points
Repeat using core area for optimal area/performance points
0 5 10 15 20 25 30 35 400
5
10
15
20
25
30
SPECmark Score
Pow
er
(TD
P,
Watt
s)
Combining Device and Core Models
14
45 nm Frontie
r
32 nm Frontie
r
Device Scaling
Overview
15
What belongs in multicore model?
16
Styles
Applications
Topologies
Area & Power / Performance Tradeoffs
Architectures
PARSEC fparallel,Data Use
Number of Threads,
Cache Sizes
Area & Power Budget
Cache & memory latencies, memory
bandwidth
Pareto Frontiers
Multicore Speedup Model
17
MulticoreSpeedup
11-fparallel
Serial Speedup
fparallel
Parallel Speedup
=+
Multicore Performance Model
18
Performance is limited by:
and
Memory bandwidthBWmax / (instructions per byte from
memory)
ComputationNcores (core frequency/CPIexe) core
utilization
[Guz et al, 2009]
Core Utilization Model
19
Core utilization is limited by:
Fraction of Time Core is Ready to IssueNumber of Threads in Core / Number of Threads to
Keep Busy
[Guz et al, 2009]
Multicore Model & Pareto Frontiers
20
0 5 10 15 20 25 30 35 400
5
10
15
20
25
30
100 points
A(q),
P(q)
q
Translating from SPECmark
21
1. From q, find core’s SPECmark speedup
2. Frequency linearly distributed from Atom to Nehalem
3. Recall: model predicts benchmark performance as f(benchmark chars, frequency, CPIexe)
4. Compute CPIexe such that
Benchmark Speedup = SPECmark Speedup
Area and Power Constraints
22
Ncores x A(q) ≤ Area Budget
Ncores x P(q) ≤ Power Budget
Dark silicon = Ncores / # of cores that fit in chip area
Overview
23
Dark Silicon
24
blacksholes bodytrack canneal ferret streamcluster GM0%
20%
40%
60%
80%
100%ITRS Conservative
Perc
en
tag
e D
ark
Silic
on
blacksholes bodytrack canneal ferret streamcluster GM0%
20%
40%
60%
80%
100%ITRS Conservative
Perc
en
tag
e D
ark
Silic
on
Sources of Dark Silicon:Power + Limited Parallelism
At 22 nm:
At 8 nm:
17%
26%
51%
71%
0 2 4 6 8 100
4
8
12
16
20
ITRS: All TopologiesITRS: Symmetric
Year
Sp
eed
up
0 2 4 6 8 100
4
8
12
16
20Conservative: Symmet...
Year
Sp
eed
up
0 2 4 6 8 100
4
8
12
16
20Conservative: All Topologies
Year
Sp
eed
up
0 2 4 6 8 100
4
8
12
16
20
ITRS: All TopologiesITRS: Symmetric
Year
Sp
eed
up
Overall Performance
25
Target
fparallel = 0.99
18x16x
8x6x3x
ConclusionsMulticore performance gains are
limited
Need at least 18%-40% per generation from architecture alone without
additional power26
Unicore Era
Multicore Era ?
27
Specialization
Shrinking chips
Pervasive
Efficiency