low power soc design - computer science and...
TRANSCRIPT
Matt Severson
Qualcomm CDMA Technologies
July 27, 2009
Low Power SOC Design and Automation
Introduction
Overview of Serra (Qualcomm’s first 45nm tapeout)
Features / Technology / Low Power Techniques
Tradeoffs of Automated vs. Custom Design for Low Power
Memory IP
Standard Cell
Mixed-Vt Design
Clock Power
Clock Gating
Clock Tree Synthesis
Multiple Power Domains
Voltage Scaling
Voltage Islands with Power Gating
Conclusions & Future Directions
Outline
2July 2009
Introduction
Power consumption is a key differentiator in wireless
communications products.
Power Consumption Limits
Battery Life
Performance
Feature set
Form Factor
3July 2009
Introduction – Form Factor
July 2009 4
Phone Surface Temperature Rise Above Ambient
Surface Power Density [W/sq-in]
0.02 0.03 0.05 0.07 0.2 0.3 0.5 0.80.01 0.1 1
Tem
per
ature
Ris
e, [
C]
2
3
5
7
20
30
50
80
1
10
100
Surface Power Densities less than 0.1 W/sq-inThis is the recommended design area
Surface Power Densities between 0.1 and 0.22W/sq-inPhone is likely to have local hot spots
Surface Power Densities greater than 0.22W/sq-inPhone should be redesigned
Power Densities Increasing
Overheating
Limit Form Factors
Introduction - Battery Life
5July 2009
44%
43%
21%
28%
14%
26%
35%
12%
19%
48%
0%
20%
40%
60%
80%
100%
Device 1 Device 2 Device 3 Device 4 Device 5 Device 6 Device 7 Device 8 Device 9 Device 10
% o
f To
tal R
evi
ew
s
Battery Life Analysis
1 Star (Bad) 2 Stars 3 Stars 4 Stars 5 Stars (Good) Expressed Dissatisfaction
Verizon™ A T & T™
Battery Capacity
Screen Pixels
860 mAh 800 mAh 800 mAh 1400 mAh240 x 320 240 x 320 240 x 320 360 x 480
930 mAh 1300 mAh 1130 mAh 1500 mAh 910 mAh 880 mAh
240 x 320 240 x 400 240 x 320 240 x 320 240 x 320 176 x 220
Phone
Chipset
SERRAQualcomm’s First 45nm Tapeout
July 2009 6
Serra Feature Set
July 2009 7
Modemo CDMA 1xEV-DO revA & Bo UMTS (includes HSDPA, HSUPA)o GSM (includes GPRS and EDGE)o Unified GPS engine for both CDMA
and UMTS modes.
Processorso QDSP4u8 based MDSP core o ARM11 core with 32KB I/D cacheo ARM926 core with 32KB I/D cache o QDSP5u4 based ADSP core w/ 256
KB L2 cache
Multimediao 24 bit WVGA w/ LCDC (active refresh)o ATI LT graphics core (Open GL 2.0)
o 22M Triangles/Seco 8 Mpixel Camera support
Peripherals 2 HS USB interfaces o MDDI gen 1.5
Serra Physical Characteristics
Die size: 8200.08 x 6500.34 um (53.3 mm^2)
Signal I/Os: 419
Process: tsmc45lp
Metal Layers: 6 (5 thin, 1 thick (4x)) and 1 AP RDL layer
Total # Transistors: 170 Million
Total # RAM bits: 13.7 Mbits
Total # ROM bits: 1.1 Mbits
Static IR Drop: < 10mV (@ Worst case 800 mA)
Leakage: ~450 uA (TT,25c, 1.125V)
671 pin 13x13 NSP Package 0.5mm ball pitchIncludes Serra (Digital die) + Analog + Memory
8July 2009
Serra Low Power Design Goals
Background
Leakage Power is increasing due to process
45nm Sub-Threshold is worse than 65nm (pA/um)
Gate leakage is increasing.
Junction Diode leakage is increasing.
45nm Process has no HVt transistor.
Simple scaling of Dynamic Power is not Enough.
+ Dynamic Power will scale down with process geometries (-C)(-V)o However increased wire cap will temper the reduction
- Increased performance demands and more applications (+f) (+C)
- Aggressive Product requirements for battery life
Conclusion:
More aggressive leakage and active power management techniques are required in 45nm
Low Power Priorities / Goals for Serra:
1 Decrease Dynamic Power
2 Maintain the total static leakage power.
3 Keep Active Leakage a “small” percentage of Dynamic Power (< ~15%)
9July 2009
Serra Low Power Features
Low Power Multi-Threshold Qualcomm Standard Cell Libraryo 2 Vt and 2 Channel Lengths
Low Power Memoryo Power Collapsing of RAM/ROM periphery and coreo Independent Bank Collapsing for Large High Density Memories
Advanced Low Power Clockingo ~105 Master Clock Domains (I/O or Independent Frequency)o ~230 Total Clock Domains (Synchronous, Iso-Synchronous, Asynchronous)o Automatically inserted Fine grained clock gatingo Manually inserted Architectural clock gateso Static SW control and Dynamic HW control of clock gating.o Custom Raw Clock Tree Routingo Low Power CTS with Qualcomm Custom Clock Tree cells.
24 Analog and Pad power domains
2 Digital Power domainso Independent Voltage Scaling
– Active and Sleep modes
o Power Collapsing
8 Digital Power Islands with Power Gating
All Low Power Features fully Verified
Power Aware simulation
Power Structural Checks
10July 2009
Serra Floorplan
11July 2009
Serra Static IR Drop Map
12July 2009
Serra Dynamic IR drop Map
13July 2009
DESIGN AUTOMATION
Design Automation is Mandatory
Design complexity
Time to Market is Critical
Fewer design resources required
Quality
o Through Standardized flows and tools
Automated Design tools and flows have several
limitations that affect low power
Many automated tools don’t consider power
Others don’t make the correct tradeoffs between power and
area/timing.
This Presentation focuses on the tradeoffs involved with
several low power techniques used on Serra and the
limitations of automated design for low power
July 2009 15
Design Automation
Customized Design for Low Power
Custom design flows and circuits can produce better
results
Lower Power, Higher Speeds, Less Area
Custom design requires more design effort and time
Use customized Design and signoff ONLY in critical areas
Pick areas of customization to get the greatest benefit
Clock Trees
Raw Clock Trees
Raw Clock Dividers
Memory IP
Standard Cell
o Move the customization into IP
o Use automation to insert the IP, check the IP and optimize with IP.
16July 2009
Customization of Raw Clock Network Raw clock networks are high
speed, high power nets from PLLs to dividers
Raw clock dividers are stacked and custom routed. Width and spacing are chosen for
optimal clock isolation while maintaining fast transition times.
Use minimal clock buffers to distribute clocks within the network but maintain desired transition delay.
10-input tri-state mux Reduces insertion delay and
power
Custom Layout of raw dividers Reduces critical path delay,
voltage noise and optimizes rise/fall times.
~4x reduction in Raw clock Power (Compared to Previous chip)
Selected Clock Path (green)
Non-Selected Active Clocks (red)
Traditional Wide Mux Structure
PLL PLL
PLLPLL
Raw Clock Network
17July 2009
LOW POWER IP
Periphe
ral
with
footer
Bit-cell
array
with
header
Sleep with
data
retention
Sleep
without
data
retention
90nm Yes No Yes No
65nm Yes No Yes No
45nm Yes Yes Yes Yes
Le
ak
ag
e
90nm 65nm
45nm
Sleep
Sleep
Sleep w/
retention
array
peri
Function
Function
Function
Sleep w/o
retentionSleep w/
retention
Sleep w/o
retention
with
Vdd scaling
without
Vdd scaling
• Bit-cell leakage is up 6X in 45nm.
• No hVt devices.
• All memories need to have leakage control
• Circuit + System solutions
- Peripheral footer
- Bit cell header
- Vdd scaling
Maintain only the useful data with array header and reduce Vdd during sleep mode to manage the leakage.
Low Power Memory
Core Array
Periphery
19July 2009
Memory Partial Bank Collapse
Power Gating portions of the bit-cell array that are not
needed
Standby/Active Leakage reduction
Some active power reduction since clock/data is gated to
banks that are not accessed.
Requires Proper memory management in SW and FW.
July 2009 20
STANDARD CELL POWER REDUCTION
Standard Cell Leakage
45nm Standard Cell Challenges
Ioff increase and no HVT device compared to 65nm
Performance provided by NVT is not required everywhere
Power Gating not possible in all blocks
45nm OPTIONS
Use Longer Channel length NVT device
o Min channel length is 40nm in 45nm tech
Use Stacked NVT deviceso Replace every device with a stack of 2
devices
Length
(L)
Spacing
(S)
Pitch Increase
40n 40n 180n
50n 45n 200n 11%
60n 45n 210n 17%
L L
60n
S
Pitch = L/2 + S + 60n + S + L/2
S
QCT45Leakage Scaling Factor, 65nm to 45nm
H2N H2NL N2N L2L L2N
N 24.71 9.45 1.61 10.48 0.56
P 27.58 5.68 3.00 26.80 1.51
Ave 26.14 7.57 2.31 18.64 1.03
TT, 25c
22July 2009
Simulation Results
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
NVT, 40n NVT, 50n NVT, 60n NVT, 70n Stacked NVT
Fa
cto
r
Candidates
Leakage Savings
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
NVT, 40n NVT, 50n NVT, 60n NVT, 70n Stacked NVT
Fa
cto
r
Candidates
Delay Increase
0%
2%
4%
6%
8%
10%
NVT, 40n NVT, 50n NVT, 60n NVT, 70n Stacked NVT
Are
a In
cre
as
e (%
)
Candidates
Area Increase
Area : Comb.
0%
10%
20%
30%
40%
50%
60%
70%
NVT, 40n NVT, 50n NVT, 60n NVT, 70n Stacked NVT
Sw
itc
hin
g C
ap
In
cre
ase
(%
)
Candidates
Switching Cap Increase
TSMC TT, 1.1V, 25C
Not Available
23July 2009
Standard Cell Leakage
Longer Channel length Leakage savings = 4.5X (L=50nm), 6.6X (L=60nm)
Delay = 1.18X (L=50nm), 1.32X (L=60nm)
Area increase ~ 5.8% (L=50nm), 6.7% (L=60nm)
Switching Cap increase ~ 16% (L=50nm), 32% (L=60nm)
Stacking is NOT as beneficialo 8.3X Leakage Reductiono Steep Delay Penalty 2.9Xo Substantial area penaltyo Switching cap increase ~ 62%
Conclusion L=40 to L=70 was evaluated
Diminishing returns in leakage savings as L increases
Dynamic power versus Leakage tradeoff
Provide NVT and LVT with two channel lengths in 45nm Library o Nvt and PNvt (long channel)o Lvt and PLvt (long channel)
24July 2009
Block Level Leakage Comparison
65nm vs. 45nm (Serra)
Compare the Leakage of 6 Hard Macros with the fewest RTL changes.o 65nm vs. 45nm Library o 1.23v vs. 1.125v
Higher Speed blocks (B, E, F)
Benefited from increased speeds in 45nm
Standard Cell Leakage actually decreased
Medium/Slow speed blocks (A, C, D)
Increase in standard cell leakage from 30-66%.
At the block level the ave. leakage increase from 65nm -> 45nm is 1.5x to 2x
July 2009 25
0.00 uW
5.00 uW
10.00 uW
15.00 uW
20.00 uW
25.00 uW
30.00 uW
35.00 uW
40.00 uW
45.00 uW
50.00 uW
BLK A BLK B BLK C BLK D BLK E BLK F
65nm
45nm
Block 65nm 45nm DiffBLK A 19.10 uW 24.70 uW 29.3%
BLK B 0.39 uW 0.18 uW -54.4%
BLK C 15.50 uW 20.20 uW 30.3%
BLK D 25.50 uW 42.40 uW 66.3%
BLK E 0.93 uW 0.14 uW -84.6%
BLK F 46.30 uW 32.00 uW -30.9%
MTCMOS
MTCMOS - Strategy
Mixed Vt Libraries are used to optimize both leakage and active power
Higher Vt reduces leakage
Use Long-channel devices to provide additional leakage reduction
o Small area and active power penalty
Lower Vt can improve active powero providing better Isat / Cg ratio
Lower Vt has better relative performance and less delay variation at Low Vdd.
Applications of Low Vt
High Activity Nets
Exclusively for clock trees.o Less insertion delay, skew, variation, optimal repeater.o Decreased active power at the cost of leakage.
In blocks that are power gated
In power domains with aggressive voltage scaling
In high performance blocks.27July 2009
Most EDA tools do a poor job of active power
optimization with mixed Vt Libraries
EDA tools DO use Mixed Vt for timing.
Don’t know where to use Low Vt for active power reduction.
o Need usage information to tradeoff leakage and active power
o High activity nets – Requires Vector Information
o Clock Nets – Requires Constraints to restrict cell list
o Blocks that are power gated – Power Intent
o Blocks that are Voltage scaled – Power Intent
Solution
Users decide which Vt is most appropriate.
Users constrain or restrict the tool’s choice of Vt.
Use single Vt for initial runs and multi-Vt for incremental
optimization.
July 2009 28
MTCMOS – EDA Limitations
MTCMOS – EDA Limitations
Many opportunities for leakage
recovery
o Timing slack on paths
o Excess margin early in the flow
o Pessimistic view of timing
o Optimization cost functions favor
area over leakage
Point tools in the industry that
were created for this purpose.
Special scripts in Timing Signoff
tool.
o Leakage Recovery within PTSI™
Knowing where to use High Vt for leakage power reduction
July 2009 29
Zhan, “A Utility for Leakage Power Recovery within PrimeTime SI” SNUG Boston 2008
Misc. Standard Cell Power
Low Clock Power Flip Flops
Additional Flip Flop Family that targets power reduction
o New Clock Topology
o Enables designers and tools to make a power vs. speed tradeoff
o 21% lower clock power, 6% smaller, but 36% slower than regular
flop.
Originally they had the same footprint
o EDA tools did not insert the Low Clock Power Flops unless they
were also smaller. [Cost functions favor Area & Timing]
30July 2009
CLOCK TREE POWER
Clock Tree Power
Clock Tree Power is still a major contributor to total active
power (30-40%)
Clock Architecture and Frequency Planning
Clock Architecture has a huge impact on power
July 2009 32
Number of PLLs
Independent PLLs o Increase flexibility and ease of use.
o Optimizes Power for simple use cases
Shared high frequency PLLs o Lower system power in concurrency modes
o However, SW may not be able to manage very complex frequency plans which results in power inefficiencies
Clock domain partitioning
Independent Clock Domain Control (HW/SW)
Frequency and Gating
Clock Synchronicity
Asynchronous domains are more flexibleo Increased latency across clock domain boundaries
Synchronous vs. Iso-Synchronouso Saves power in low performance modes. Cost power in worst case
July 2009 33
Clock Tree Architecture
Still one of the most effective ways to save active power
Maximize fine-grain clock gatingo Optimize settings for integrated clock gating (w/ PowerCompiler™)
– Almost always a win.
– Beware of increased skew, insertion delay, timing and area impact.
Manually inserted Architectural CGCso Gate modules based on Modal and Temporal need.
o Can be HW or SW controlled
Analyze the Clock Gating Percentage (CGP) and Clock
Gating Efficiency (CGE) to find new opportunities
July 2009 34
Clock Gating
Example - Clock Power
Example of a Complex DMA engine with multiple bus masters.
Separated Client Interface clocks
Provided Independent clock frequency control
Div 1-4
Iso-sync / Async support on bridges o Enables Master/Slave clock division
TR
EE
LE
AF
Previous
TR
EE
~1650 buffers
~18655 registers
LE
AF
LE
AF
LE
AF
LE
AF
LE
AF
~2947 registers
~3009 registers
~2197 registers
~1540 registers
~8962 registers~1650/2 buffers
~1650/8 buffers
~1650/8 buffers
~1650/8 buffers
~1650/8 buffers
CXC
CXC
CXC
CXC
CXC
CXC
35July 2009
90nm 65nm r1 65nm r2 Serra
CI3 0 0 0 0.1979
CI2 0 0 0 0.2232
CI1 0 0 0 0.2307
CI0 0 0 0 0.2507
Baseline 17 8.0578 5.7606 3.377
0
2
4
6
8
10
12
14
16
18
mA
Bus Master Clock Power Comparison
Low Power CTS
Clock Tree Synthesis (CTS)
Used for the majority of ~250 clock trees on Serra
Small changes to the constraints and cells can dramatically
affect the power of CTS trees.
Custom Clock Cell Design
Clock Buffers, Inverters, Muxes, Gating Cells, Dividers.
o Internal clock routes are minimized to reduced clock power
o Insertion delays are minimized to reduce overall clock power and
skew
Low power CTS
Higher granularity of clock buffers.
Relaxed transition and skew constraints
Modal skew balancing (functional vs. test)
36July 2009
Low Power CTS – EDA Deficiencies
July 2009 37
Most CTS tools only consider
Global skew and transition time.
CTS – Example
1) Balance Global Skew
2) 100ps Max Transition
3) Min Area/Power
A. Default Settings
B. Optimized Run
Need tools which
Consider Local vs. Global skew.
Relax clock transition times
Cbuf sizing on paths
Cluster registers that share a clock or clock gate
Insert, clone and de-clone clock gates
Move clock gates up the tree
A
B
MULTIPLE POWER DOMAINS
Multiple Power Domains
Power Domain Partitioning
More power regimes from PMIC
o + Better Power Control and Efficiency
o + Independent Voltage Scaling and Power Collapsing
o - Increased PDN impedance and IR Drop
o - Increased Bill Of Materials (BOM)
o - Requires level shifters and resynchronization at boundaries.
o Need a small on-chip regulator with fast response and good
efficiency
IR Drop and PDN impedanceo - Increased Power Density
o - Increased metal resistance
o - Increased IR due to Power Switches
o - IR impact is greater at lower voltages
o - Dynamic IR affects on skew and timing not well modeled.
39July 2009
PDN Impedance
Worst Case IR drop may not be at max frequency and power
An arbitrary PDN network observed from the die looking back towards the VRM is shown below. The PMIC has passive modeling.
Resonance points in the RLC network.
1 103
1 104
1 105
1 106
1 107
1 108
1 109
0.01
0.1
11
0.01
Z spec s i( )( )
Z network s i( )( )
R spec1
R spec3
s i( ) L spec
1
s i( ) C spec
1 109
1 103 f i( )
Potential Peaking due to PMIC & Bulk CapBoard “Mounting Inductance”
Mid-band region controlled by decap inventory40July 2009
VOLTAGE SCALING
Static Voltage Scaling (ARM11)
SW selects a performance level based on application and/or SW performance monitors.
HW selects 1 of 8 pre-programmed Frequency-Voltage pairs
Benefits
Active power reduction o 32%-36% based on lab
measurements at 96 MHzo Limited by Memory Vcc Min
Easier to implement on Applications processor
Drawbacks Requires separate voltage regulator (vdd_apc)
Less efficient than AVSo SW Controlled vs. HW controlled AVSo SW Performance monitors and algorithms lagging
Requires careful characterization of ARM11ss FMAX vs. Voltage
Timing Closure (Corners, Variation, Margins)
PMIC7500SSBI1
8
8
1
1
lvl 8lvl 7lvl 6lvl 5lvl 4lvl 3lvl 2lvl 1
data
addr
addr
FSMreq
ack3
8
Modem ARM
Control
ARM9
Peripheral
Bridge
and
MPU
Serra Digital Die
svs_cntl
vdd_apc
vdd_digARM11SS
Next
PLevel
Current
PLevelstat
42July 2009
Freq.
(MHz)
Vdd
(V)
128 1.15
96 1.10
64 1.05
48 1.00
32 0.975
25 0.95
20 0.95
Freq.
(MHz)
Vdd
(V)
256 1.20
196 1.15
128 1.10
96 1.05
64 1.00
32 0.975
20 0.950
Freq.
(MHz)
Vdd
(V)
256 1.20
196 1.15
128 1.10
96 1.05
64 1.00
32 0.95
20 0.95
Static Voltage Scaling (MSM Top) Multiple Blocks in the same power regime
The highest required voltage is determined from multiple LUTs. LUT data obtained from FMAX characterization.
Software programs the voltage upon entry / exit of each mode
Drawbacks SW Complexity
Fixed frequency blocks don’t scale well
Lots of characterization work
Risk of Test escapes Not all blocks included in characterization.
43July 2009
BLK A BLK B BLK C
Mode1 Requires 1.2v
Mode2 Requires 1.15v
Mode3 Requires .95v
Process Monitoring DVS
Process Monitoring DVS
Increased process variation at 45nm, increased benefits of Process related Dynamic Voltage Scaling.
Measure PM speed.
SW obtains the required voltage from LUT.
LUT is created through Characterization.
Benefits
16% power reduction for TTT, 31% for FFF estimated.
More practical than DVFS on top-level
AVS is too complex to implement on top-level
Drawbacks
Impact on ATE and test time, binning and test escapes.
Characterization effort
Timing Closure (Corners, Margins, Variation)
Fast
TypicalSlow
3 Design Target
pass failFrequency
Simulation Data: Fmax vs Vdd for FFF, TTT, SSS parts
Process Bin
Required
Vdd (V)
Fast 0.950
Typ. 1.025
Slow 1.125
44July 2009
45nm designs require MMMC signoff for Hold robustness
Hold closure at several PVT corners.
Functional Mode and Test Mode.
MMMC Implementation is also necessary to achieve
lowest power.
Voltage Scaling:
o Optimize Implementation across several PVT corners to make the
correct decisions
o MMMC Reduces iterations and effort for closing Setup & Hold.
Power Optimization
o Power corners are different than timing corners.
– Leakage information is very in-accurate at Temperature/Voltage
extremes.
– Optimize power for the nominal use modes.
Multi-Mode Multi-Corner
45July 2009
MMMC Requirements
Constraints for each Mode are needed early.
More library characterization required.
o >15 Corners used on Serra
MMMC Limitations
Increased tool run times (Physical Design cycle time)
o Time to market
No mature solution for MMMC throughout the flow
o Used single corner Synthesis and Placement with some iteration
and over-margining on Serra.
o MMMC was used during CTS, Post-CTS opt and final timing
closure.
o Evaluation of Commercial tools
– Evaluated industry leading tools for MMMC synthesis/placement
after Serra. (Still Maturing)
Multi-Mode Multi-Corner
46July 2009
Variation and Margins
Variation is increasing at smaller geometries and lower voltages
Adding fixed margins is detrimental to power
Increased hold buffer insertion
Larger gate sizes
Swapping to lower Vt when not needed.
Use OCV to apply margin only where necessary.
47July 2009
Monte Carlo Simulations of Hold Time at 2 VT Corners
VDD Minimization During Sleep
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 0.2 0.4 0.6 0.8 1 1.2
Scale
d L
eakag
e P
ow
er (%
)
Supply Voltage (V)
Estimated Leakage Power Reduction by VDD Minimization
Minimize voltage of entire
digital die during sleep
Especially beneficial during
short sleep cycles (BT Scan,
QChat)
Estimated 70% leakage power
reduction @ Battery by
lowering the digital voltage
from 1.1V to 0.7V
Minimal SW overhead and fast
restoration.
o All memory and register state
is retained.
Simulated [PNVt TT, 25c]
48July 2009
VDD Minimization (Serra)
Serra Leakage current during
VDD MINIMIZATION
Calculated 64% savings at
Battery
80% efficiency, 3.7V Battery.
Voltage minimization is the
most effective way to control
digital leakage without
complicated SW changes.
batt
msmmsmbatt
V
VII
*8.
*
Serra (Normalized) Off Leakage Measured
(3 TT parts, 25c)
measured calculated
49July 2009
Voltage Ave @ MSM Ave @ Battery
1.078 1.00 0.36
0.81 0.47 0.13
Savings 52.66% 64.43%
0.0x
0.2x
0.4x
0.6x
0.8x
1.0x
Ave @ MSM Ave @ Battery
1.00
0.36
0.47
0.13
1.078
0.81
Full Chip Power Collapse
Full Chip collapse at PMIC regulator
Requires data to be saved and re-boot of all processors
Power Collapse Break-Even Point
Energy Saved (Leakage * Time) > Energy Overhead (Save/Restore);
For 2.56 Sec Slot cycle (Assuming 5mJ of Overhead) Power Collapse is beneficial if leakage is greater than 1.8mA
Which consumes less energy? w/ or w/o PC
0
1
2
3
4
5
0 1 2 3 4 5
Sleep Time (Sec)
Leakag
e c
urr
en
t (m
A)
1mJ
0.5mJ
2mJ
5mJPC/OH =10mJ
PCH Time Slot (2.56 Sec)
0.68
34
76
Sc
ale
d/E
sti
ma
ted
Se
rra
Cu
rre
nt
(mA
)
2,477 2,508 2,560
50July 2009
POWER GATING
FS/HS Power Gating
Headswitch (HS) and footswitch (FS)
Serially adding a high resistance device between power rails in low leakage mode
Power Switch design is a trade-off between leakage power, IR drop (when FS/HS are on), area, ramp-up noise and time.
Local FS is integrated within logic circuit
Used at Qualcomm since .13u
Minimum impact on design automation flow
Finer control to turn down logic circuits independently
State Maintained
(Leakage savings / Cost) is low.
Global FS/HS are integrated within power grid
Switches shared by all logic in a domain, thus smaller size and lower IR drop is possible.
Can be implemented with finer grid, e.g. global distributed FS/HS (GDFS), or coarse grid, e.g. FS/HS power ring
GDFS is used in 90nm and 65nm designs
Employ a save & restore scheme
….
Vdd
Local FootswitchVss
….
Global FSVss_int
Vdd
Vss
52July 2009
Global FS/HS implementations
FS/HS ring
Less PD effort
Shorter sleep control distribution
Larger IR drop compared with GDFS, especially in flipchip case
IR drop increases quicker when the size of the block increases (cubic w.r.t. the length)
Only Suitable for small size macros
Global distributed FS/HS
Can be modeled as an additional resistance between global and local power mesh
Does not break global mesh
Needs sleep control signal distribution throughout.
Suitable for large size macros
FS Ring Global PG Mesh
Local PG Mesh
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
GDFS
En_few_in En_few_out
vssfx
Vss
En_rest_in En_rest_out
Mf
Mr
53July 2009
Serra GDFS Design
Global Distributed Footswitch (GDFS) was chosen for
leakage reduction vs. area cost
Sleep leakage control
Extra long channel NVT transistors used for footswitch cells
Active leakage control
Turn off macros that are not used in a certain operating
modes
High Temperature active leakage control
54July 2009
GDFS Design Flow
Use QCOM in-house tools
for inserting isolation cells.
Use existing commercial
tools to insert switches and
connect enables.
Use existing commercial
tools + scripts to insert
isolation cells during
Physical Design.
Pre-place GDFS cells and
do IR drop analysis at early
stage
Further refine the size of
GDFS cells according to
local IR drop
July 2009 55
Low Power Verification
Power Aware Verification is Requiredo Verify entry to & exit from low power states
o Properly model power collapse
o Verify clamp polarity
Power Aware Simulation toolso Qualcomm Scripts
o Commercial Tools
Power Structural Checkso Verify Power domain crossings
– Isolation cells
o Commercial Tools @ 3 design stages
– RTL
– Logical Gate
– Physical Gate
July 2009 56
.
.
.
.
.
.
.
Active Idle Dormant Power Off
Total Leakage Current by State
gfs
mem
std
N2
T2
T1
(TTT, 1.125V, 25C)
Serra Leakage Leakage by State
Measurements on 3 TT Parts
4x reduction in leakage from Active to Power off states. (Excluding Vdd min)
>4x leakage reduction from GDFS.
Power Off Variation
across 3 Typical Wafers Mean to mean is 20%
Variation across dice from same wafer is ~100%
57July 2009Columns – Estimated Dots - Measured
CONCLUSIONS & FUTURE DIRECTIONS
July 2009 58
Conclusions
Power IS a key differentiator and limits the design
Tools and users must Design for Power
Power is a different beast than area/timing
o Power/Energy is highly dependant on use case (Vector)
o Must consider Static and Dynamic Power
o Power varies across PVT.
– Optimization at Nominal
– Constrained by worst case.
Designing for Low Power is all about Tradeoffs.
Many Low Power Techniques exist
o Each needs to be applied with tradeoffs in mind.
o Know the tradeoffs and what your specific goals are.
We still need more techniques to meet the customer
demands. (We are not doing enough)
59July 2009
Conclusions
Design automation is required for Complex SOCs, but Customization in select areas can Dramatically reduce power.
Move Customization into IP
Several deficiencies and limitations exist within EDA tools.
Multi-Mode Multi-Corner Design is a must
EDA tools need to improve and move MMMC up in the flow
Timing and power corners are not the same.
Simultaneous optimization of timing, power and area needed at all stages
High Level Synthesis
RTL Design & Optimization
Logic Synthesis
Clock Tree Synthesis
Place and Route
Timing Closure
60July 2009
Getting the Architecture right is critical to power
High Level System Power Modeling required for making
correct architectural design decisions
Need to have “relatively” accurate power models to make
tradeoffs
Power Vectors
Accurate power estimation and optimization require
accurate activity information
o There is a need for tools/methods which extract activity information
from real SW code running on Emulation and/or system models.
o Benchmarks and verification type vectors are often quite different
from “real” world use cases
Vector-less power estimation and optimization is a very
challenging problem.
o Some have attempted, but no satisfactory solution seen thus far.
July 2009 61
Conclusions
References
[1]http://www.nttdocomo.co.jp/english/info/notice/page/0902
27_00.html
[2] Bruce Zhan, “A Utility for Leakage Power Recovery within
PrimeTime SI” SNUG Boston 2008
[3] Krzysztof A. Kozminski, “Optimization for Leakage Power
with PrimeTime” 2004 San Jose SNUG Conference
62July 2009