2 explaining the gap between asic and custom power: a custom perspective andrew chang cadence design...
TRANSCRIPT
2
Explaining The Gap Between ASIC and Custom Power: A Custom Perspective
Andrew Chang Cadence Design Systems*
William J. DallyComputer Systems Laboratory
Stanford University
* Work done while Author was at Stanford
3
Design Tradeoffs: Power vs. Performance
1. Move to More Energy Efficient
Operating Point
More Energy Efficient w/ Custom
Power
2
1 3
Performance
4
Design Tradeoffs: Power vs. Performance
1. Move to More Energy Efficient
Operating Point
More Energy Efficient w/ Custom
2. Trade Performance for
Power
Larger Range w/ Custom
Power
2
1 3
Performance
5
Design Tradeoffs: Power vs. Performance
1. Move to More Energy Efficient
Operating Point
More Energy Efficient w/ Custom
2. Trade Performance for
Power
Larger Range w/ Custom
3. Move to Different
Power vs. Performance Curve
More Architectural Choice with
Custom
Power
2
1 3
Performance
6
Dynamic Power Dissipation
Pdyn = CVdd2 f = Ecircuit f
Reduce Vdd
Static, dynamic, voltage islands, power gating
Reduce and/or f Clock gating, block enables, bus encoding, glitch identification
and elimination
Reduce Ecircuit
Engineer interconnects, increase circuit efficiency, subthreshold circuit techniques
7
Static Power Dissipation
Pstatic = Vdd (Isub + Iox )
Isub = K1 W e -Vt/ nV
(1- e –Vgs
/V)
Iox = K2 W (Vgs/tox)2 e – tox
/ Vgs
With K1, K2, n, and experimentally determined
Reduce Vdd Static, dynamic, voltage islands, power gating
Increase effective Vt Substituting high-threshold devices, transistor stacking, static and active
body bias
Reduce effective W Reduce number and size of devices in design
8
Which Design Is More Efficient?
0.7um CMOS 173MHz chip w/ 460K T’s
0.18um CMOS 10kHz chip w/ 640K T’s
9
Which Design Is More Efficient?
0.7um CMOS 173MHz chip w/ 460K T’sVdd (typ) = 3.3V, Vdd (min) = 1.1V
0.18um CMOS 10kHz chip w/ 640K T’s
Vdd (max) = 1.8V, Vdd (min) = 0.18V
10
Which Design Is More Efficient?
0.7um CMOS 173MHz chip w/ 460K T’sVdd (typ) = 3.3V, Vdd (min) = 1.1VPower = 845mW
0.18um CMOS 10kHz chip w/ 640K T’s
Vdd (max) = 1.8V, Vdd (min) = 0.18VPower = 1.6mW
11
Talk Outline
Normalized Metric: Ebit
Effect of Architecture ASIC vs. Custom
Building BlocksAchievable Energy Efficiency
16b 1024 FFT Example Answer to “Which Design is More Efficient”
12
Talk Outline
Normalized Metric: Ebit
Effect of Architecture ASIC vs. Custom
Building BlocksAchievable Energy Efficiency
16b 1024 FFT Example Answer to “Which Design is More Efficient”
13
Defining Ebit
Ebit = Cbit * Vdd2
Cbit = 4 * 2 fF/um * Wmin
Energy needed to write a 1-bit SRAM cell Approximates minimum useful capacitanceThe ratio of Ebit to the energy for a range of circuits
remains largely constant with technology scaling
14
Technology Scaling for Ebit
is a normalized unit of distance equal to the M1 pitch
Technology
0.5m
0.18m
58 18
5.7 18
m2
15
Technology Scaling for Nand2
is a normalized unit of distance equal to the M1 pitch
4 = 2.24m
8 = 4.48m
NAND2AB YN
A
BYN
16
Applying Ebit
Energy 180nm 130nm 90nm 65nm
Ebit (fJ) 3.3 1.4 0.5 0.36
Relative 180nm 130nm 90nm 65nm
Ebit 1 1 1 1
1b FO4 ~10 ~10 ~10 ~10
1b SP-SRAM 0.3-7 0.3-7 0.3-7 0.3-7
1b RF 4-20+ 4-20+ 4-20+ 4-20+
1b DFF 20-30+ 15-30+ 10-30+ 10-30+
1b Nand2 11-30 (typ 19) 5-30 (typ 14) 5-30 (typ 14) 5-30 (typ 14)
Move 1b 1000 ~100 ~100 ~100 ~100
Move 1b 1.5mm 268 367 467 714
17
Talk Outline
Normalized Metric: Ebit
Effect of Architecture ASIC vs. Custom
Building BlocksAchievable Energy Efficiency
16b 1024 FFT Example Answer to “Which Design is More Efficient”
18
Talk Outline
Normalized Metric: Ebit
Effect of Architecture ASIC vs. Custom
Building BlocksAchievable Energy Efficiency
16b 1024 FFT Example Answer to “Which Design is More Efficient”
19
Design Style: Custom
NVIDIA GeForceFX Intel Pentium-4
Design Style: ASIC400MHz – 125M Transistors 2600MHz – 55M Transistors
Effect of Architecture
20
Design Style: Custom
NVIDIA GeForceFX Intel Pentium-4
Design Style: ASIC400MHz – 125M Transistors~20 Watts
2600MHz – 55M Transistors~60 Watts
Effect of Architecture
21
Effect of Architecture ASIC Architecture: 6x Efficiency
Design Style: Custom
NVIDIA GeForceFX Intel Pentium-4
Design Style: ASIC400MHz – 125M Transistors~20 Watts: 10GFlops & 13 GBs
2600MHz – 55M Transistors~60 Watts: 5GFlops & 5 Gbs
22
Custom Circuits: 9x (7x) Efficiency
Design Style: Custom
NVIDIA GeForceFX Intel Pentium-4
Design Style: Custom400MHz – 125M Transistors~3 Watts: 10GFlops & 13 GBs Vdd = 0.65V
2600MHz – 55M Transistors~60 Watts: 5GFlops & 5 Gbs Vdd = 1.3V
23
Combined Architecture and Circuits40x+ Improvement but 1.5 Years vs. 3+ Years
Design Style: Custom
NVIDIA GeForceFX Intel Pentium-4
Design Style: Custom400MHz – 125M Transistors~3 Watts: 10GFlops & 13 GBs Vdd = 0.65V
2600MHz – 55M Transistors~60 Watts: 5GFlops & 5 Gbs Vdd = 1.3V
24
Talk Outline
Normalized Metric: Ebit
Effect of Architecture ASIC vs. Custom
Building BlocksAchievable Energy Efficiency
16b 1024 FFT Example Answer to “Which Design is More Efficient”
25
Talk Outline
Normalized Metric: Ebit
Effect of Architecture ASIC vs. Custom
Building BlocksAchievable Energy Efficiency
16b 1024 FFT Example Answer to “Which Design is More Efficient”
26
ASIC vs. Custom
ASIC Methods Provide only coarse-grain control 100K+ gates,
but require much less effort and historically scale with complexity
Custom Methods Offer fine-grain control individual transistors &
gates, but require large effort and scale poorly with complexity
Exploits Design StructureExploits Circuit Techniques
27
Custom Methods EmphasizeFine-Grain Manual Control + Custom Library
Design Gate Library Floorplanning/ Coarse Detailed Coarse Detailed Style Partitioning Placement Placement Routing RoutingCustom Complex Manual Manual Manual Manual Manual
Specific
ASIC Simple Manual/Automated Automated Automated Automated Automated
Generic Automated w/ Hints
28
Custom Methods EmphasizeFine-Grain Manual Control + Custom Library
Design Gate Library Floorplanning/ Coarse Detailed Coarse Detailed Style Partitioning Placement Placement Routing RoutingCustom Complex Manual Manual Manual Manual Manual
Specific
ASIC Simple Manual/Automated Automated Automated Automated Automated
Generic Automated w/ Hints
Operation and Performance Characterized
for the Specific Case
29
ASIC Methods SubstituteCoarse-Grain Control
Automation + Generic Library
Design Gate Library Floorplanning/ Coarse Detailed Coarse Detailed Style Partitioning Placement Placement Routing RoutingCustom Complex Manual Manual Manual Manual Manual
Specific
ASIC Simple Manual/Automated Automated Automated Automated Automated
Generic Automated w/ Hints
30
ASIC Methods SubstituteCoarse-Grain Control
Automation + Generic Library
Design Gate Library Floorplanning/ Coarse Detailed Coarse Detailed Style Partitioning Placement Placement Routing RoutingCustom Complex Manual Manual Manual Manual Manual
Specific
ASIC Simple Manual/Automated Automated Automated Automated Automated
Generic Automated w/ Hints
Operation and Performance Characterized
for the Typical/Generic Case
31
ASIC Focus on 100K+ GatesLost Opportunities to Exploit Structure
Designs reuse similar basic building blocks Building blocks: 1-10K-gates not 100K+ gate
64-bit adder 1K-gates64x64 rf 2K-gates 64x64 multiplier 20K-gates
Opportunities to exploit these structures lost when design is viewed in large chunks
32
Different Architectures Similar Building Blocks
LC LC LC
LC
LCLC
LC
LC LC
EX RF SRAM XCVRS
LC
Bus
Bank 1 Bank 0
CLST 0CLST 1CLST 2
CLST 0CLST 1CLST 2
NIF/ROUTER
MEMORY SWITCH
CLUSTER SWITCH
EMI
LTLB
1998 “MAP” 64b Microprocessor - 5M T’s(MIT/Stanford)
EX RF SRAM XCVRS Bus
LC
LCLCLC
LC
2002 “Imagine” 32b Stream Processor - 22M T’s(Stanford)
Cluster1
Cluster0
Cluster3
Cluster2
Cluster5
Cluster4
Cluster7
Cluster6
Microcontroller
33
Significant Structure ExistsWithin 100K-gates
LC LC LC
LC
LCLC
LC
LC LC
LC
LC
LCLCLC
LC
EX RF SRAM XCVRS Bus
EX RF SRAM XCVRS Bus
Bank 1 Bank 0
CLST 0CLST 1CLST 2
CLST 0CLST 1CLST 2
NIF/ROUTER
MEMORY SWITCH
CLUSTER SWITCH
EMI
LTLB
1998 “MAP” 64b Microprocessor - 5M T’s(MIT/Stanford)
2002 “Imagine” 32b Stream Processor - 22M T’s(Stanford)
Cluster1
Cluster0
Cluster3
Cluster2
Cluster5
Cluster4
Cluster7
Cluster6
Microcontroller
34
Energy of 100K-gate Equivalent
ASIC (N2) = 1400K Ebits (typ)
Custom Logic = 424K Ebits*
SRAM (small) = 1085K Ebits
SRAM (med) = 155K Ebits
SRAM (large) = 50K Ebits
*Based on data extracted from Intel McKinley
35
Exploiting Circuit Techniques
Custom circuits more efficient Reduced parasitics 1.7x circuit techniques and flops 1.4x libraries 1.4x due to engineering interconnects
Subthreshold Circuits Low Performance but ultra-low powerRequires Architecture, Gates, Memories, CAD
Tools
36
Relating Power to PerformanceCV/I, Idsat, tFO4
Idsat = K3 Leff -0.5 tox-0.8 (Vgs - Vt)1.25
tFO4 = K4 [Ceff Vdd /Idsat] (K4 ~ 13.5)
37
Relating Power to Performance Relating Vdd and Vt to tFO4
Idsat = K3 Leff -0.5 tox-0.8 (Vgs - Vt)1.25
tFO4 = K4 [Ceff Vdd /Idsat] (K4 ~ 13.5)
38
Relating Power to PerformanceCorrelation to Reported Foundry Data
Technology NodeCV/I est
(ps)CV/I reported
(ps)tFO4 est
(ps)
Foundry A 180-nm 3.94 3.70 53
Foundry A 130-nm 2.55 2.17 34
Foundry A 90-nm 1.85 2.04 25
Foundry A 65-nm 1.45 1.00 20
Idsat = K3 Leff -0.5 tox-0.8 (Vgs - Vt)1.25
tFO4 = K4 [Ceff Vdd /Idsat] (K4 ~ 13.5)
39
Achievable Power Improvement (Assuming 50/50 split of Logic and Memory)
Technique TypeCustom vs.
ASIC Energy Type
Circuit Styles and Flops
Dynamic
1.7 0.815 Logic
Libraries + Vdd
Scaling1.4 0.855 Logic
SRAM Circuits 2 0.95 SRAM
Interconnect + Vdd
Scaling1.4 0.855 Inter-connect
40
Achievable Power Improvement(Assuming 50/50 Split of Logic and Memory)
Technique TypeCustom vs.
ASIC Energy Type
Bit Encoding
Dynamic
1 0.84 Inter-connect
Clock Gating 1 0.84 Chip
Frequency Scaling 1 0.5 Chip
Subthreshold Circuits
N/A 0.062 Chip
41
Achievable Power Improvement(Assuming 50/50 Split of Logic and Memory)
Technique TypeCustom vs.
ASIC Energy Type
Vdd Scaling
Static
1 0.79 Chip
MT-CMOS 1 0.5 Chip
Stacking and input state vector
1.4 0.7 Chip(typically
only one of these three is
applied)
Body Bias 2 0.5
Supply Gating 10 0.1
42
Achievable Power ImprovementAssuming 50/50 Split of Logic and Memory
Type Tech ASIC
(Custom)Tech
ASIC (Custom)
Net Dynamic
130-nm
45% (32%)
90-nm
28%(20%)
Net Static 8% (4%) 20%(10%)
Total53%
(36%)48%(30%)
130nm uP assumes 80% Dynamic and 20% Static 90nm uP assumes 50% Dynamic and 50% Static
43
Talk Outline
Normalized Metric: Ebit
Effect of Architecture ASIC vs. Custom
Building BlocksAchievable Energy Efficiency
16b 1024 FFT Example Answer to “Which Design is More Efficient”
44
Talk Outline
Normalized Metric: Ebit
Effect of Architecture ASIC vs. Custom
Building BlocksAchievable Energy Efficiency
16b 1024 FFT Example Answer to “Which Design is More Efficient”
45
16b 1024 point FFT
Generally, k N log N operations (complex multiplies) with pre-computation
Radix-2, Radix-4 etc… implementations
Decimation in time and/or decimation in Frequency
46
Range of Implementations
MIT FFT (2005) 0.18um CMOS, 628K T’s, 10KHz: Architecture and subtheshold circuits, 180mV
operation Spiffee (1999)
0.7um CMOS, 460K T’s, 173MHz: Cached FFT Architecture and algorithm, 1.1V operation
SA-1100 (1999) 0.35um CMOS, 2.6M T’s, 74MHz: Commercial embedded processor, Custom
Circuits, 1.5V operation Imagine (2003)
0.15um CMOS, 22M T’s , 232MHz: Streaming Media Processor, tiled standard cells, 1.2V operation
Stratix IS25F627C8 (2005) 0.13um CMOS, 3.9K logic elements, 123K memory bits, 24 DSP blocks, 272MHz: Commercial FPGA Co-processor,
Intel P4 (2003) 0.13um CMOS, 3GHz, SSE: Commerical General Purpose Processor, Custom
Circuits, 1.5V operation TI ‘C6416 (2003)
0.13um CMOS, 720MHz: Commercial Digital Signal Processor
47
Ebit Energy 16b 1024 point FFT
Design Fab Vdd MHz mW Cycles
MIT FFT 180 1.8 0.01 1.6 95
Spiffee 700 3.3 173 845 5190
SA-1100 350 2 74 39 31500
Imagine 150 1.5 232 4000 3708
Stratix 130 1.3 275 884 1291
Intel P4 130 1.2 3000 51200 71680
TI 'C6416 130 1.2 720 1200 6526
48
Ebit Energy 16b 1024 point FFT
DesignEDP
(rel norm)
Ebit
(fJ) Efft (nJ)Normalized to
Ebit (1e6)EnergyRatio
MIT FFT 143 3.3 154 47 1
Spiffee 1 91 25350 277 6
SA-1100 283 4.2 16601 3953 85
Imagine 148 2.2 63931 29726 637
Stratix 24 1.4 4149 2964 64
Intel P4 12548 1.4 1E+06 873813 18591
TI 'C6416 27 1.4 10877 7769 166
49
Which Design Is More Efficient?
0.7um CMOS 173MHz chip w/ 460K T’sVdd (typ) = 3.3V, Vdd (min) = 1.1VPower = 845mW
0.18um CMOS 10kHz chip w/ 640K T’s
Vdd (max) = 1.8V, Vdd (min) = 0.18VPower = 1.6mW
50
Which Design Is More Efficient?Depends on the Metric!
0.7um CMOS 173MHz chip w/ 460K T’sVdd (typ) = 3.3V, Vdd (min) = 1.1VPower = 845mWEDP 143x better
0.18um CMOS 10kHz chip w/ 640K T’s
Vdd (max) = 1.8V, Vdd (min) = 0.18VPower = 1.6mWAbsolute energy 6x better
51
Summary
Normalized metric – Ebit - enables meaningful comparisons across designs and technologies
Custom designers can exploit a wide range of optimizations: enabling architecture with circuits and circuits with Architecture
Custom designs can readily achieve a 3x advantage in energy with the potential for over 10x
Selective application of custom techniques and automated support for performance characterization at specific instead of generic operating points can enable ASIC designers to begin to bridge this Power Gap.
52
Back-Up Slides
53
ASIC Rely on General Optimization TechniquesFocus - Improve the Average Case
Partitioning: Hyper-graph - min-cut, ratio cut Solutions: move-based, geometric & combinatorial forms, clustering
Hypergraph
H(V,E) E = { e1, e2….} nets
Circuite1
e3
e4
e5
e6
e7
e8V1 V3
V4
V5
V2
e2
e2
V3
V4
e6
e7
e4
e5
e8
e3Vertex & Edge weights
used to encode costs
V1
V2
V5e1
54
Designs with Structure Do Not Exhibit Average Characteristics
64b Multiplier (half-array)
Clear Disparity in Resource Usage
Routing
Density