the return of silicon efficiency - page d'accueil / lirmm ... · arith18 montpellier june 07...
TRANSCRIPT
Arith18 Montpellier June 07 [email protected]
AGENDA
• Chip Market• Design Objectives• Modern Transistors• Optimality of Designs• The Design Hull• Implications for Design
Arith18 Montpellier June 07 [email protected]
“The Cheap will Outsell the Good”
Customer Priorities
1. Availability2. Price
Features3. Power
Size
Design Choices
1. Hardware � Software2. Process Technology3. MHz and Volts4. Custom � Synthesized
Arith18 Montpellier June 07 [email protected]
The Chip Market
Most logic chips are processors or processor-centric.
Computers have driven MOS chip evolution – the first“product superclass”.
The second product superclass will be “cellphones”…• Consumer driven – more cost sensitive than computers.• High mobility – more power sensitive than computers.• High (specialized) performance – eg. HSDPA cellular
modem ~50GOPs.
ICs $211bnMOS $174bn
Memory $59bnMicro $54bn Logic $60bnMPU ASSP
SIA 2006
FPGAASICDSPMCU (inc GPU)
0
500
1000
1500
'04 '05 '06 '07 '08 '09
Cellphone units:
Arith18 Montpellier June 07 [email protected]
What’s New(-ish) in Logic Chip Physics?
Small dopant populations (VT variability)
Subthreshold leakage (VT reduction is constrained, so Vdd likewise)
Gate leakage (end of SiO2 Tox scaling)
Channel Stress (engineered and STI-induced)
OPC rules
Well Proximity Effect
Line Edge Roughness (gate and wire)
NBTI (RAM VT drift)
All important for designers, but not (directly) the main influencers of design evolution
Arith18 Montpellier June 07 [email protected]
Top 2 Design Influencers
1. Most power-sensitive designs will set their own supply voltage.
• Design at typical, turn manufacturing speed spread (functionality risk) into power efficiency spread (battery life).
• Offset manufacturing leakage spread against speed spread.
• Dynamically trade performance vs power.
• Disconnect power in standby (efficient regulators are switch-mode).
2. Processors are evolving to displace fixed-function hardware
• Target multiple market sockets per design.
• Hardware-efficient (dynamic re-purposing of resources).
• Mitigation of specification uncertainty.
• Work around bugs by software change.
• Well-practiced idiom for speed.
Arith18 Montpellier June 07 [email protected]
Modern Transistors
Arith18 Montpellier June 07 [email protected]
Modern Transistors
ITRS distinguishes 3 roadmaps…
• High Performance (HP)
• Low Operating Power (LOP)
• Low Standby Power (LSTP)
VT (mV)Tox (nm)Lg (nm)
5241.945LSTP
2851.232LOP
1651.125HP
ITRS2005 for 65nm 2007
“cellphones”
computers
TSMC offers 12 logic T’s at 65nm…
• 2 x HP
• 6 x LOP
• 4 x LSTP
Arith18 Montpellier June 07 [email protected]
65nm LOP DelayN
orm
aliz
ed �
Temp (°C)Vdd (V)
high VT
mid VT
low VT
Arith18 Montpellier June 07 [email protected]
65nm LSTP vs LOP Delay
LST
P �
/ LO
P �
Temp (°C)Vdd (V)
Arith18 Montpellier June 07 [email protected]
65nm LOP Sub-Threshold LeakageN
orm
aliz
ed I S
UB
Temp (°C)Vdd (V)
high VT
mid VT
low VT
Arith18 Montpellier June 07 [email protected]
65nm LSTP vs LOP Sub-Threshold LeakageLS
TP
I SU
B/ L
OP
I SU
B
Temp (°C)Vdd (V
)
Arith18 Montpellier June 07 [email protected]
High Temp ISUB Still Dominates Ig at 65nm (LOP and LSTP)
Temp (°C)Vdd (V
)
(LOP high VT)
Nor
mal
ized
I SU
B, I
g
Arith18 Montpellier June 07 [email protected]
LOP 65nm ISUB vs 90nm ISUB65
nm I S
UB
/ 90n
m I S
UB
Temp (°C)Vdd (V
)
(LOP high VT)
Arith18 Montpellier June 07 [email protected]
LOP 65nm Ig vs 90nm Ig65
nm I g
/ 90n
m I g
Temp (°C)Vdd (V
)
(LOP high VT)
Arith18 Montpellier June 07 [email protected]
Manufacturing Spread: 65nm mid VT LOP DelayN
orm
aliz
ed �
Temp (°C)Vdd (V)
SS
FF
Arith18 Montpellier June 07 [email protected]
Manufacturing Spread: 65nm mid VT LOP ISUBN
orm
aliz
ed I S
UB
Temp (°C)Vdd (V
)
SS
FF
Arith18 Montpellier June 07 [email protected]
Optimality of Design
Arith18 Montpellier June 07 [email protected]
Power, Money and Performance
Excellent recent work on Power vs Performance (among many)…
• Srinivasan, Brooks, Gschwind, Bose, Zyuban, Strenski and Emma, “Optimimum Pipelines for Power and Performance”, Micro ’02.
• Zyuban and Strenski, “Unified Methodology for Resolving Power-Performance Tradeoffs at the Microarchitectural and Circuit Levels”, ISLPED ‘02.
• Harstein, Puzak, “Optimum Power/Performance Pipeline Depth”, Micro ’03.
• Dao, Zeydel, Oklobdzija, “Energy Optimization of Digital Pipelined Systems Using Circuit Sizing and Supply Scaling”, Trans. VLSI Systems ‘05.
But outside the computer market, few chips sell on MHz.
Most designs set out to minimize cost (and power) at fixed performance…
� Decode an MPEG video without glitching.
� Execute an ADSL modem at 8Mbps.
� etc…
Arith18 Montpellier June 07 [email protected]
Design Levers
1. Pipelining
2. Replication (parallelism)
3. Speculation
4. Logical redundancy (eg. carry-skip adder)
5. Arithmetic redundancy (eg. carry-save)
6. Vdd
7. VT
8. tOX
9. Lg
10. Manhattan cell height (transistor sizing)
11. Wire geometry
12. Wire screening
13. Structured layout (wire control)
14. Logic circuit engineering
15. Clock circuit engineering
16. Package engineering
(micro)architecture
transistors
geometry
circuits
Arith18 Montpellier June 07 [email protected]
Electro-Thermal Model
PackageRSUPPLY
LeakageModel
CapacitanceModel
DeviceDelayModel
Package�JA
+•+
N�
freq
TA
TJ
�req
Vpins
Idd
Istat
Idyn
Required Vdd
Pco-packaged
Ptot
(thermal loop)
PipelineModel
Arith18 Montpellier June 07 [email protected]
Pipeline Model
Flux � = nC/τC (T/i/b)
Performance � = KnCf (T/s)
Cost N = K(nC+nS) (T)
Frequency f = 1/τ� (Hz)
nC nS
τC τS
Clock period τ
K “pipettes”…
1 bitAverage nC transistors of combinatorial logic per pipe stage per pipeline bit.
nS transistors per pipeline latch.
Units of τ are “i” � “FO4”; 1i = � seconds.
Arbitrary width/depth mix of pipettes make up the pipelined logic.
RAM
non-pipeline latches
pipelined logic
IO (f
ixed
Vdd
)
Define…
Chip…
Arith18 Montpellier June 07 [email protected]
Reality not Included…
• Performance � (TCOMB/s) tends to over-state effective performance at low τ, because logical redundancy is introduced to speed up logic paths and mitigate latency by speculation.
• Flux � (T/i/b) varies a lot locally; tends to rise at low τ for the same redundancy reason, and because pipe cuts have less freedom to avoid high-flux logic.
PrefixAdder
1 wire/bit
3 wires/bit2 wires/bit
Local �sensitivity
Eg. adders…
01234567
flux
at o
utpu
t
Ladner-Fischer Kogge-Stone
0
5
10
15
20
8b 16b 32b 64b
dela
y (i)
Arith18 Montpellier June 07 [email protected]
Pipeline Flux in Icera’s DXP®
-
1
2
3
4
All logic macros except regfiles
Flux(T/i/b)
� ~ 3-4 for arithmetic and random translation logic.
� ~ 1-2 for steering logic.
Arith18 Montpellier June 07 [email protected]
Required � to meet specified Performance � and Cost N
( )����
���
�
ττΦ+1�
�
���
�
τΨ = ∆
���������� �
��
Specification
Latch delay 4iLatch cost 40TPerformance 10PT/sDesign Cost 10MT
0
5
10
15
20
25
30
35
40
4 9 14 19 24 29 34 39 44 49 54 59 64 69 74
Pipe stage delay τ (number of �’s)
Req
uire
d �
(ps)
� = 2
� = 4
Arith18 Montpellier June 07 [email protected]
Constant Flux Implications (� = 3, τS = 4i)
0
5
10
15
20
25
30
35
8 13 18 23 28 33 38 43 48 53 58 63 68 73 78
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
8 13 18 23 28 33 38 43 48 53 58 63 68 73 78
0
5
10
15
20
25
30
35
0.00 1.00 2.00 3.00 4.00 5.00
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00
Pipe stage delay τ
Req
uire
d �
(ps)
Com
bina
toria
l fra
ctio
n
Clock frequency (GHz)
Transistors
Delay
Arith18 Montpellier June 07 [email protected]
DXP® Logic Stage Population, by #Transistors
Clock distrib.(mostly invs)
All other logic <0.5%
Regfiles(latch-like)
22%
29%
18%
11%
9%
8%
(mostly pipeline)Manhattan cells are ubiquitous in all digital design methodologies.
Complex cells can be important for performance but account for negligible % of total transistors.
Designs are characterized (capacitance, leakage, delay, area) by a few simple Manhattan stages.
Arith18 Montpellier June 07 [email protected]
The Design Hull
Arith18 Montpellier June 07 [email protected]
Example Design Spec
Performance � = 15PT/sFlux � = 4T/i/bLatch Cost nS = 40TLatch Delay τS = 4iNon-Pipe Logic = 5MTRAM = 1MBLatch Activity = 3x Combinatorial ActivityRAM Activity = 0.2x Combinatorial Activity65nm high VT LOP ProcessAmbient Temp = 85°CMax Die Temp = 125°CPackage �JC = 15°C/WSupply R = 50m�Fixed Parasitic Power = 100mWVdd Limit Range = 0.7V – 1.2V
Arith18 Montpellier June 07 [email protected]
Design Hull
Pipe Stage Depth (i)
Pow
er (m
W)
Pipe Cost (MT)
Vdd Ceiling
Thermal Ceiling
Vdd Floor
from here Min Power Pipes
Arith18 Montpellier June 07 [email protected]
Feasible DesignsP
ipe
Sta
ge D
epth
(i)
Pipe Cost (MT)
Vdd
Cei
ling
Thermal Ceiling (Thermal Ceiling)
Arith18 Montpellier June 07 [email protected]
Minimum Power PipelinesP
ipe
Sta
ge D
epth
ττ ττ(i)
for M
inim
um P
ower
Pipeline Cost (MT)
Vdd
@ fl
oor
Vdd above floor
Arith18 Montpellier June 07 [email protected]
The Cost of Minimizing Power
Min
imum
Fea
sibl
e P
ower
(mW
)
Pipeline Cost (MT)
Clearly a design choice
Arith18 Montpellier June 07 [email protected]
Minimum Power PipelinesP
ower
(mW
)
Pipe Stage Depth ττττ (i)
13MT
15MT
17MT19MT
Arith18 Montpellier June 07 [email protected]
Sensitivity to Pipeline FluxP
ipe
Sta
ge D
epth
ττ ττ(i)
for M
inim
um P
ower
Pipeline Cost (MT)
� = 2
� = 3
� = 4
Arith18 Montpellier June 07 [email protected]
LOP vs LSTP
Min
imum
Fea
sibl
e P
ower
(mW
)
Pipeline Cost (MT)
LOP high VT
LSTP mid VT
Arith18 Montpellier June 07 [email protected]
LOP vs LSTPP
ipe
Sta
ge D
epth
ττ ττ(i)
for M
inim
um P
ower
Pipeline Cost (MT)
LOP high VT
LSTP mid VT
Arith18 Montpellier June 07 [email protected]
Feasible Solutions in Cost-Frequency Space
Clo
ck F
requ
ency
(GH
z)
Pipeline Cost (MT)
65nm LOP high VT
Arith18 Montpellier June 07 [email protected]
Implications for Design
Arith18 Montpellier June 07 [email protected]
Implications for Chip Design
Every logic chip will have control over its supply voltage.
• Integrated regulators?
• Switch off in standby – limited call for LSTP processes?
We can’t minimize cost and power simultaneously.
Good cost-power points require fairly high speed.
• Beyond synthesis and auto-P&R today � more Structured Custom?
Short pipelines expose the basic TA assumption (single path sensitization).
• Wavefront timing analysis tools?
• Voltage agility makes TA sign-off harder.
Arith18 Montpellier June 07 [email protected]
Structured Custom
…is enough for predictable auto-routingNatural cell placement by engineers…