low power soc design - computer science and...

Matt Severson

Qualcomm CDMA Technologies

July 27, 2009

Low Power SOC Design and Automation

Introduction

Overview of Serra (Qualcomm’s first 45nm tapeout)

Features / Technology / Low Power Techniques

Tradeoffs of Automated vs. Custom Design for Low Power

Memory IP

Standard Cell

Mixed-Vt Design

Clock Power

Clock Gating

Clock Tree Synthesis

Multiple Power Domains

Voltage Scaling

Voltage Islands with Power Gating

Conclusions & Future Directions

Outline

2July 2009

Introduction

Power consumption is a key differentiator in wireless

communications products.

Power Consumption Limits

Battery Life

Performance

Feature set

Form Factor

3July 2009

Introduction – Form Factor

July 2009 4

Phone Surface Temperature Rise Above Ambient

Surface Power Density [W/sq-in]

0.02 0.03 0.05 0.07 0.2 0.3 0.5 0.80.01 0.1 1

Tem

per

ature

Ris

e, [

C]

2

3

5

7

20

30

50

80

1

10

100

Surface Power Densities less than 0.1 W/sq-inThis is the recommended design area

Surface Power Densities between 0.1 and 0.22W/sq-inPhone is likely to have local hot spots

Surface Power Densities greater than 0.22W/sq-inPhone should be redesigned

Power Densities Increasing

Overheating

Limit Form Factors

http://www.huawei.com/mobileweb/en/products/view.do?id=1503

Introduction - Battery Life

5July 2009

44%

43%

21%

28%

14%

26%

35%

12%

19%

48%

0%

20%

40%

60%

80%

100%

Device 1 Device 2 Device 3 Device 4 Device 5 Device 6 Device 7 Device 8 Device 9 Device 10

% o

f To

tal R

evi

ew

s

Battery Life Analysis

1 Star (Bad) 2 Stars 3 Stars 4 Stars 5 Stars (Good) Expressed Dissatisfaction

Verizon™ A T & T™

Battery Capacity

Screen Pixels

860 mAh 800 mAh 800 mAh 1400 mAh240 x 320 240 x 320 240 x 320 360 x 480

930 mAh 1300 mAh 1130 mAh 1500 mAh 910 mAh 880 mAh

240 x 320 240 x 400 240 x 320 240 x 320 240 x 320 176 x 220

Phone

Chipset

SERRAQualcomm’s First 45nm Tapeout

July 2009 6

Serra Feature Set

July 2009 7

Modemo CDMA 1xEV-DO revA & Bo UMTS (includes HSDPA, HSUPA)o GSM (includes GPRS and EDGE)o Unified GPS engine for both CDMA

and UMTS modes.

Processorso QDSP4u8 based MDSP core o ARM11 core with 32KB I/D cacheo ARM926 core with 32KB I/D cache o QDSP5u4 based ADSP core w/ 256

KB L2 cache

Multimediao 24 bit WVGA w/ LCDC (active refresh)o ATI LT graphics core (Open GL 2.0)

o 22M Triangles/Seco 8 Mpixel Camera support

Peripherals 2 HS USB interfaces o MDDI gen 1.5

Serra Physical Characteristics

Die size: 8200.08 x 6500.34 um (53.3 mm^2)

Signal I/Os: 419

Process: tsmc45lp

Metal Layers: 6 (5 thin, 1 thick (4x)) and 1 AP RDL layer

Total # Transistors: 170 Million

Total # RAM bits: 13.7 Mbits

Total # ROM bits: 1.1 Mbits

Static IR Drop: < 10mV (@ Worst case 800 mA)

Leakage: ~450 uA (TT,25c, 1.125V)

671 pin 13x13 NSP Package 0.5mm ball pitchIncludes Serra (Digital die) + Analog + Memory

8July 2009

Serra Low Power Design Goals

Background

Leakage Power is increasing due to process

45nm Sub-Threshold is worse than 65nm (pA/um)

Gate leakage is increasing.

Junction Diode leakage is increasing.

45nm Process has no HVt transistor.

Simple scaling of Dynamic Power is not Enough.

+ Dynamic Power will scale down with process geometries (-C)(-V)o However increased wire cap will temper the reduction

- Increased performance demands and more applications (+f) (+C)

- Aggressive Product requirements for battery life

Conclusion:

More aggressive leakage and active power management techniques are required in 45nm

Low Power Priorities / Goals for Serra:

1 Decrease Dynamic Power

2 Maintain the total static leakage power.

3 Keep Active Leakage a “small” percentage of Dynamic Power (< ~15%)

9July 2009

Serra Low Power Features

Low Power Multi-Threshold Qualcomm Standard Cell Libraryo 2 Vt and 2 Channel Lengths

Low Power Memoryo Power Collapsing of RAM/ROM periphery and coreo Independent Bank Collapsing for Large High Density Memories

Advanced Low Power Clockingo ~105 Master Clock Domains (I/O or Independent Frequency)o ~230 Total Clock Domains (Synchronous, Iso-Synchronous, Asynchronous)o Automatically inserted Fine grained clock gatingo Manually inserted Architectural clock gateso Static SW control and Dynamic HW control of clock gating.o Custom Raw Clock Tree Routingo Low Power CTS with Qualcomm Custom Clock Tree cells.

24 Analog and Pad power domains

2 Digital Power domainso Independent Voltage Scaling

– Active and Sleep modes

o Power Collapsing

8 Digital Power Islands with Power Gating

All Low Power Features fully Verified

Power Aware simulation

Power Structural Checks

10July 2009

Serra Floorplan

11July 2009

Serra Static IR Drop Map

12July 2009

Serra Dynamic IR drop Map

13July 2009

DESIGN AUTOMATION

Design Automation is Mandatory

Design complexity

Time to Market is Critical

Fewer design resources required

Quality

o Through Standardized flows and tools

Automated Design tools and flows have several

limitations that affect low power

Many automated tools don’t consider power

Others don’t make the correct tradeoffs between power and

area/timing.

This Presentation focuses on the tradeoffs involved with

several low power techniques used on Serra and the

limitations of automated design for low power

July 2009 15

Design Automation

Customized Design for Low Power

Custom design flows and circuits can produce better

results

Lower Power, Higher Speeds, Less Area

Custom design requires more design effort and time

Use customized Design and signoff ONLY in critical areas

Pick areas of customization to get the greatest benefit

Clock Trees

Raw Clock Trees

Raw Clock Dividers

Memory IP

Standard Cell

o Move the customization into IP

o Use automation to insert the IP, check the IP and optimize with IP.

16July 2009

Customization of Raw Clock Network Raw clock networks are high

speed, high power nets from PLLs to dividers

Raw clock dividers are stacked and custom routed. Width and spacing are chosen for

optimal clock isolation while maintaining fast transition times.

Use minimal clock buffers to distribute clocks within the network but maintain desired transition delay.

10-input tri-state mux Reduces insertion delay and

power

Custom Layout of raw dividers Reduces critical path delay,

voltage noise and optimizes rise/fall times.

~4x reduction in Raw clock Power (Compared to Previous chip)

Selected Clock Path (green)

Non-Selected Active Clocks (red)

Traditional Wide Mux Structure

PLL PLL

PLLPLL

Raw Clock Network

17July 2009

LOW POWER IP

Periphe

ral

with

footer

Bit-cell

array

with

header

Sleep with

data

retention

Sleep

without

data

retention

90nm Yes No Yes No

65nm Yes No Yes No

45nm Yes Yes Yes Yes

Le

ak

ag

e

90nm 65nm

45nm

Sleep

Sleep

Sleep w/

retention

array

peri

Function

Function

Function

Sleep w/o

retentionSleep w/

retention

Sleep w/o

retention

with

Vdd scaling

without

Vdd scaling

• Bit-cell leakage is up 6X in 45nm.

• No hVt devices.

• All memories need to have leakage control

• Circuit + System solutions

- Peripheral footer

- Bit cell header

- Vdd scaling

Maintain only the useful data with array header and reduce Vdd during sleep mode to manage the leakage.

Low Power Memory

Core Array

Periphery

19July 2009

Memory Partial Bank Collapse

Power Gating portions of the bit-cell array that are not

needed

Standby/Active Leakage reduction

Some active power reduction since clock/data is gated to

banks that are not accessed.

Requires Proper memory management in SW and FW.

July 2009 20

STANDARD CELL POWER REDUCTION

Standard Cell Leakage

45nm Standard Cell Challenges

Ioff increase and no HVT device compared to 65nm

Performance provided by NVT is not required everywhere

Power Gating not possible in all blocks

45nm OPTIONS

Use Longer Channel length NVT device

o Min channel length is 40nm in 45nm tech

Use Stacked NVT deviceso Replace every device with a stack of 2

devices

Length

(L)

Spacing

(S)

Pitch Increase

40n 40n 180n

50n 45n 200n 11%

60n 45n 210n 17%

L L

60n

S

Pitch = L/2 + S + 60n + S + L/2

S

QCT45Leakage Scaling Factor, 65nm to 45nm

H2N H2NL N2N L2L L2N

N 24.71 9.45 1.61 10.48 0.56

P 27.58 5.68 3.00 26.80 1.51

Ave 26.14 7.57 2.31 18.64 1.03

TT, 25c

22July 2009

Simulation Results

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

NVT, 40n NVT, 50n NVT, 60n NVT, 70n Stacked NVT

Fa

cto

r

Candidates

Leakage Savings

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5


Fa

cto

r

Candidates

Delay Increase

0%

2%

4%

6%

8%

10%


Are

a In

cre

as

e (%

)

Candidates

Area Increase

Area : Comb.

0%

10%

20%

30%

40%

50%

60%

70%


Sw

itc

hin

g C

ap

In

cre

ase

(%

)

Candidates

Switching Cap Increase

TSMC TT, 1.1V, 25C

Not Available

23July 2009

Standard Cell Leakage

Longer Channel length Leakage savings = 4.5X (L=50nm), 6.6X (L=60nm)

Delay = 1.18X (L=50nm), 1.32X (L=60nm)

Area increase ~ 5.8% (L=50nm), 6.7% (L=60nm)

Switching Cap increase ~ 16% (L=50nm), 32% (L=60nm)

Stacking is NOT as beneficialo 8.3X Leakage Reductiono Steep Delay Penalty 2.9Xo Substantial area penaltyo Switching cap increase ~ 62%

Conclusion L=40 to L=70 was evaluated

Diminishing returns in leakage savings as L increases

Dynamic power versus Leakage tradeoff

Provide NVT and LVT with two channel lengths in 45nm Library o Nvt and PNvt (long channel)o Lvt and PLvt (long channel)

24July 2009

Block Level Leakage Comparison

65nm vs. 45nm (Serra)

Compare the Leakage of 6 Hard Macros with the fewest RTL changes.o 65nm vs. 45nm Library o 1.23v vs. 1.125v

Higher Speed blocks (B, E, F)

Benefited from increased speeds in 45nm

Standard Cell Leakage actually decreased

Medium/Slow speed blocks (A, C, D)

Increase in standard cell leakage from 30-66%.

At the block level the ave. leakage increase from 65nm -> 45nm is 1.5x to 2x

July 2009 25

0.00 uW

5.00 uW

10.00 uW

15.00 uW

20.00 uW

25.00 uW

30.00 uW

35.00 uW

40.00 uW

45.00 uW

50.00 uW

BLK A BLK B BLK C BLK D BLK E BLK F

65nm

45nm

Block 65nm 45nm DiffBLK A 19.10 uW 24.70 uW 29.3%

BLK B 0.39 uW 0.18 uW -54.4%

BLK C 15.50 uW 20.20 uW 30.3%

BLK D 25.50 uW 42.40 uW 66.3%

BLK E 0.93 uW 0.14 uW -84.6%

BLK F 46.30 uW 32.00 uW -30.9%

MTCMOS

MTCMOS - Strategy

Mixed Vt Libraries are used to optimize both leakage and active power

Higher Vt reduces leakage

Use Long-channel devices to provide additional leakage reduction

o Small area and active power penalty

Lower Vt can improve active powero providing better Isat / Cg ratio

Lower Vt has better relative performance and less delay variation at Low Vdd.

Applications of Low Vt

High Activity Nets

Exclusively for clock trees.o Less insertion delay, skew, variation, optimal repeater.o Decreased active power at the cost of leakage.

In blocks that are power gated

In power domains with aggressive voltage scaling

In high performance blocks.27July 2009

Most EDA tools do a poor job of active power

optimization with mixed Vt Libraries

EDA tools DO use Mixed Vt for timing.

Don’t know where to use Low Vt for active power reduction.

o Need usage information to tradeoff leakage and active power

o High activity nets – Requires Vector Information

o Clock Nets – Requires Constraints to restrict cell list

o Blocks that are power gated – Power Intent

o Blocks that are Voltage scaled – Power Intent

Solution

Users decide which Vt is most appropriate.

Users constrain or restrict the tool’s choice of Vt.

Use single Vt for initial runs and multi-Vt for incremental

optimization.

July 2009 28

MTCMOS – EDA Limitations

MTCMOS – EDA Limitations

Many opportunities for leakage

recovery

o Timing slack on paths

o Excess margin early in the flow

o Pessimistic view of timing

o Optimization cost functions favor

area over leakage

Point tools in the industry that

were created for this purpose.

Special scripts in Timing Signoff

tool.

o Leakage Recovery within PTSI™

Knowing where to use High Vt for leakage power reduction

July 2009 29

Zhan, “A Utility for Leakage Power Recovery within PrimeTime SI” SNUG Boston 2008

Misc. Standard Cell Power

Low Clock Power Flip Flops

Additional Flip Flop Family that targets power reduction

o New Clock Topology

o Enables designers and tools to make a power vs. speed tradeoff

o 21% lower clock power, 6% smaller, but 36% slower than regular

flop.

Originally they had the same footprint

o EDA tools did not insert the Low Clock Power Flops unless they

were also smaller. [Cost functions favor Area & Timing]

30July 2009

CLOCK TREE POWER

Clock Tree Power

Clock Tree Power is still a major contributor to total active

power (30-40%)

Clock Architecture and Frequency Planning

Clock Architecture has a huge impact on power

July 2009 32

Number of PLLs

Independent PLLs o Increase flexibility and ease of use.

o Optimizes Power for simple use cases

Shared high frequency PLLs o Lower system power in concurrency modes

o However, SW may not be able to manage very complex frequency plans which results in power inefficiencies

Clock domain partitioning

Independent Clock Domain Control (HW/SW)

Frequency and Gating

Clock Synchronicity

Asynchronous domains are more flexibleo Increased latency across clock domain boundaries

Synchronous vs. Iso-Synchronouso Saves power in low performance modes. Cost power in worst case

July 2009 33

Clock Tree Architecture

Still one of the most effective ways to save active power

Maximize fine-grain clock gatingo Optimize settings for integrated clock gating (w/ PowerCompiler™)

– Almost always a win.

– Beware of increased skew, insertion delay, timing and area impact.

Manually inserted Architectural CGCso Gate modules based on Modal and Temporal need.

o Can be HW or SW controlled

Analyze the Clock Gating Percentage (CGP) and Clock

Gating Efficiency (CGE) to find new opportunities

July 2009 34

Clock Gating

Example - Clock Power

Example of a Complex DMA engine with multiple bus masters.

Separated Client Interface clocks

Provided Independent clock frequency control

Div 1-4

Iso-sync / Async support on bridges o Enables Master/Slave clock division

TR

EE

LE

AF

Previous

TR

EE

~1650 buffers

~18655 registers

LE

AF

LE

AF

LE

AF

LE

AF

LE

AF

~2947 registers

~3009 registers

~2197 registers

~1540 registers

~8962 registers~1650/2 buffers

~1650/8 buffers

~1650/8 buffers

~1650/8 buffers

~1650/8 buffers

CXC

CXC

CXC

CXC

CXC

CXC

35July 2009

90nm 65nm r1 65nm r2 Serra

CI3 0 0 0 0.1979

CI2 0 0 0 0.2232

CI1 0 0 0 0.2307

CI0 0 0 0 0.2507

Baseline 17 8.0578 5.7606 3.377

0

2

4

6

8

10

12

14

16

18

mA

Bus Master Clock Power Comparison

Low Power CTS

Clock Tree Synthesis (CTS)

Used for the majority of ~250 clock trees on Serra

Small changes to the constraints and cells can dramatically

affect the power of CTS trees.

Custom Clock Cell Design

Clock Buffers, Inverters, Muxes, Gating Cells, Dividers.

o Internal clock routes are minimized to reduced clock power

o Insertion delays are minimized to reduce overall clock power and

skew

Low power CTS

Higher granularity of clock buffers.

Relaxed transition and skew constraints

Modal skew balancing (functional vs. test)

36July 2009

Low Power CTS – EDA Deficiencies

July 2009 37

Most CTS tools only consider

Global skew and transition time.

CTS – Example

1) Balance Global Skew

2) 100ps Max Transition

3) Min Area/Power

A. Default Settings

B. Optimized Run

Need tools which

Consider Local vs. Global skew.

Relax clock transition times

Cbuf sizing on paths

Cluster registers that share a clock or clock gate

Insert, clone and de-clone clock gates

Move clock gates up the tree

A

B

MULTIPLE POWER DOMAINS

Multiple Power Domains

Power Domain Partitioning

More power regimes from PMIC

o + Better Power Control and Efficiency

o + Independent Voltage Scaling and Power Collapsing

o - Increased PDN impedance and IR Drop

o - Increased Bill Of Materials (BOM)

o - Requires level shifters and resynchronization at boundaries.

o Need a small on-chip regulator with fast response and good

efficiency

IR Drop and PDN impedanceo - Increased Power Density

o - Increased metal resistance

o - Increased IR due to Power Switches

o - IR impact is greater at lower voltages

o - Dynamic IR affects on skew and timing not well modeled.

39July 2009

PDN Impedance

Worst Case IR drop may not be at max frequency and power

An arbitrary PDN network observed from the die looking back towards the VRM is shown below. The PMIC has passive modeling.

Resonance points in the RLC network.

1 103

1 104

1 105

1 106

1 107

1 108

1 109

0.01

0.1

11

0.01

Z spec s i( )( )

Z network s i( )( )

R spec1

R spec3

s i( ) L spec

1

s i( ) C spec

1 109

1 103 f i( )

Potential Peaking due to PMIC & Bulk CapBoard “Mounting Inductance”

Mid-band region controlled by decap inventory40July 2009

VOLTAGE SCALING

Static Voltage Scaling (ARM11)

SW selects a performance level based on application and/or SW performance monitors.

HW selects 1 of 8 pre-programmed Frequency-Voltage pairs

Benefits

Active power reduction o 32%-36% based on lab

measurements at 96 MHzo Limited by Memory Vcc Min

Easier to implement on Applications processor

Drawbacks Requires separate voltage regulator (vdd_apc)

Less efficient than AVSo SW Controlled vs. HW controlled AVSo SW Performance monitors and algorithms lagging

Requires careful characterization of ARM11ss FMAX vs. Voltage

Timing Closure (Corners, Variation, Margins)

PMIC7500SSBI1

8

8

1

1

lvl 8lvl 7lvl 6lvl 5lvl 4lvl 3lvl 2lvl 1

data

addr

addr

FSMreq

ack3

8

Modem ARM

Control

ARM9

Peripheral

Bridge

and

MPU

Serra Digital Die

svs_cntl

vdd_apc

vdd_digARM11SS

Next

PLevel

Current

PLevelstat

42July 2009

Freq.

(MHz)

Vdd

(V)

128 1.15

96 1.10

64 1.05

48 1.00

32 0.975

25 0.95

20 0.95

Freq.

(MHz)

Vdd

(V)

256 1.20

196 1.15

128 1.10

96 1.05

64 1.00

32 0.975

20 0.950

Freq.

(MHz)

Vdd

(V)

256 1.20

196 1.15

128 1.10

96 1.05

64 1.00

32 0.95

20 0.95

Static Voltage Scaling (MSM Top) Multiple Blocks in the same power regime

The highest required voltage is determined from multiple LUTs. LUT data obtained from FMAX characterization.

Software programs the voltage upon entry / exit of each mode

Drawbacks SW Complexity

Fixed frequency blocks don’t scale well

Lots of characterization work

Risk of Test escapes Not all blocks included in characterization.

43July 2009

BLK A BLK B BLK C

Mode1 Requires 1.2v

Mode2 Requires 1.15v

Mode3 Requires .95v

Process Monitoring DVS

Process Monitoring DVS

Increased process variation at 45nm, increased benefits of Process related Dynamic Voltage Scaling.

Measure PM speed.

SW obtains the required voltage from LUT.

LUT is created through Characterization.

Benefits

16% power reduction for TTT, 31% for FFF estimated.

More practical than DVFS on top-level

AVS is too complex to implement on top-level

Drawbacks

Impact on ATE and test time, binning and test escapes.

Characterization effort

Timing Closure (Corners, Margins, Variation)

Fast

TypicalSlow

3 Design Target

pass failFrequency

Simulation Data: Fmax vs Vdd for FFF, TTT, SSS parts

Process Bin

Required

Vdd (V)

Fast 0.950

Typ. 1.025

Slow 1.125

44July 2009

45nm designs require MMMC signoff for Hold robustness

Hold closure at several PVT corners.

Functional Mode and Test Mode.

MMMC Implementation is also necessary to achieve

lowest power.

Voltage Scaling:

o Optimize Implementation across several PVT corners to make the

correct decisions

o MMMC Reduces iterations and effort for closing Setup & Hold.

Power Optimization

o Power corners are different than timing corners.

– Leakage information is very in-accurate at Temperature/Voltage

extremes.

– Optimize power for the nominal use modes.

Multi-Mode Multi-Corner

45July 2009

MMMC Requirements

Constraints for each Mode are needed early.

More library characterization required.

o >15 Corners used on Serra

MMMC Limitations

Increased tool run times (Physical Design cycle time)

o Time to market

No mature solution for MMMC throughout the flow

o Used single corner Synthesis and Placement with some iteration

and over-margining on Serra.

o MMMC was used during CTS, Post-CTS opt and final timing

closure.

o Evaluation of Commercial tools

– Evaluated industry leading tools for MMMC synthesis/placement

after Serra. (Still Maturing)

Multi-Mode Multi-Corner

46July 2009

Variation and Margins

Variation is increasing at smaller geometries and lower voltages

Adding fixed margins is detrimental to power

Increased hold buffer insertion

Larger gate sizes

Swapping to lower Vt when not needed.

Use OCV to apply margin only where necessary.

47July 2009

Monte Carlo Simulations of Hold Time at 2 VT Corners

VDD Minimization During Sleep

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 0.2 0.4 0.6 0.8 1 1.2

Scale

d L

eakag

e P

ow

er (%

)

Supply Voltage (V)

Estimated Leakage Power Reduction by VDD Minimization

Minimize voltage of entire

digital die during sleep

Especially beneficial during

short sleep cycles (BT Scan,

QChat)

Estimated 70% leakage power

reduction @ Battery by

lowering the digital voltage

from 1.1V to 0.7V

Minimal SW overhead and fast

restoration.

o All memory and register state

is retained.

Simulated [PNVt TT, 25c]

48July 2009

VDD Minimization (Serra)

Serra Leakage current during

VDD MINIMIZATION

Calculated 64% savings at

Battery

80% efficiency, 3.7V Battery.

Voltage minimization is the

most effective way to control

digital leakage without

complicated SW changes.

batt

msmmsmbatt

V

VII

*8.

*

Serra (Normalized) Off Leakage Measured

(3 TT parts, 25c)

measured calculated

49July 2009

Voltage Ave @ MSM Ave @ Battery

1.078 1.00 0.36

0.81 0.47 0.13

Savings 52.66% 64.43%

0.0x

0.2x

0.4x

0.6x

0.8x

1.0x

Ave @ MSM Ave @ Battery

1.00

0.36

0.47

0.13

1.078

0.81

Full Chip Power Collapse

Full Chip collapse at PMIC regulator

Requires data to be saved and re-boot of all processors

Power Collapse Break-Even Point

Energy Saved (Leakage * Time) > Energy Overhead (Save/Restore);

For 2.56 Sec Slot cycle (Assuming 5mJ of Overhead) Power Collapse is beneficial if leakage is greater than 1.8mA

Which consumes less energy? w/ or w/o PC

0

1

2

3

4

5

0 1 2 3 4 5

Sleep Time (Sec)

Leakag

e c

urr

en

t (m

A)

1mJ

0.5mJ

2mJ

5mJPC/OH =10mJ

PCH Time Slot (2.56 Sec)

0.68

34

76

Sc

ale

d/E

sti

ma

ted

Se

rra

Cu

rre

nt

(mA

)

2,477 2,508 2,560

50July 2009

POWER GATING

FS/HS Power Gating

Headswitch (HS) and footswitch (FS)

Serially adding a high resistance device between power rails in low leakage mode

Power Switch design is a trade-off between leakage power, IR drop (when FS/HS are on), area, ramp-up noise and time.

Local FS is integrated within logic circuit

Used at Qualcomm since .13u

Minimum impact on design automation flow

Finer control to turn down logic circuits independently

State Maintained

(Leakage savings / Cost) is low.

Global FS/HS are integrated within power grid

Switches shared by all logic in a domain, thus smaller size and lower IR drop is possible.

Can be implemented with finer grid, e.g. global distributed FS/HS (GDFS), or coarse grid, e.g. FS/HS power ring

GDFS is used in 90nm and 65nm designs

Employ a save & restore scheme

….

Vdd

Local FootswitchVss

….

Global FSVss_int

Vdd

Vss

52July 2009

Global FS/HS implementations

FS/HS ring

Less PD effort

Shorter sleep control distribution

Larger IR drop compared with GDFS, especially in flipchip case

IR drop increases quicker when the size of the block increases (cubic w.r.t. the length)

Only Suitable for small size macros

Global distributed FS/HS

Can be modeled as an additional resistance between global and local power mesh

Does not break global mesh

Needs sleep control signal distribution throughout.

Suitable for large size macros

FS Ring Global PG Mesh

Local PG Mesh

v

v

v

v

v

v

v

v

v

v

v

v

v

v

v

v

v

v

GDFS

En_few_in En_few_out

vssfx

Vss

En_rest_in En_rest_out

Mf

Mr

53July 2009

Serra GDFS Design

Global Distributed Footswitch (GDFS) was chosen for

leakage reduction vs. area cost

Sleep leakage control

Extra long channel NVT transistors used for footswitch cells

Active leakage control

Turn off macros that are not used in a certain operating

modes

High Temperature active leakage control

54July 2009

GDFS Design Flow

Use QCOM in-house tools

for inserting isolation cells.

Use existing commercial

tools to insert switches and

connect enables.

Use existing commercial

tools + scripts to insert

isolation cells during

Physical Design.

Pre-place GDFS cells and

do IR drop analysis at early

stage

Further refine the size of

GDFS cells according to

local IR drop

July 2009 55

Low Power Verification

Power Aware Verification is Requiredo Verify entry to & exit from low power states

o Properly model power collapse

o Verify clamp polarity

Power Aware Simulation toolso Qualcomm Scripts

o Commercial Tools

Power Structural Checkso Verify Power domain crossings

– Isolation cells

o Commercial Tools @ 3 design stages

– RTL

– Logical Gate

– Physical Gate

July 2009 56

.

.

.

.

.

.

.

Active Idle Dormant Power Off

Total Leakage Current by State

gfs

mem

std

N2

T2

T1

(TTT, 1.125V, 25C)

Serra Leakage Leakage by State

Measurements on 3 TT Parts

4x reduction in leakage from Active to Power off states. (Excluding Vdd min)

>4x leakage reduction from GDFS.

Power Off Variation

across 3 Typical Wafers Mean to mean is 20%

Variation across dice from same wafer is ~100%

57July 2009Columns – Estimated Dots - Measured

CONCLUSIONS & FUTURE DIRECTIONS

July 2009 58

Conclusions

Power IS a key differentiator and limits the design

Tools and users must Design for Power

Power is a different beast than area/timing

o Power/Energy is highly dependant on use case (Vector)

o Must consider Static and Dynamic Power

o Power varies across PVT.

– Optimization at Nominal

– Constrained by worst case.

Designing for Low Power is all about Tradeoffs.

Many Low Power Techniques exist

o Each needs to be applied with tradeoffs in mind.

o Know the tradeoffs and what your specific goals are.

We still need more techniques to meet the customer

demands. (We are not doing enough)

59July 2009

Conclusions

Design automation is required for Complex SOCs, but Customization in select areas can Dramatically reduce power.

Move Customization into IP

Several deficiencies and limitations exist within EDA tools.

Multi-Mode Multi-Corner Design is a must

EDA tools need to improve and move MMMC up in the flow

Timing and power corners are not the same.

Simultaneous optimization of timing, power and area needed at all stages

High Level Synthesis

RTL Design & Optimization

Logic Synthesis

Clock Tree Synthesis

Place and Route

Timing Closure

60July 2009

Getting the Architecture right is critical to power

High Level System Power Modeling required for making

correct architectural design decisions

Need to have “relatively” accurate power models to make

tradeoffs

Power Vectors

Accurate power estimation and optimization require

accurate activity information

o There is a need for tools/methods which extract activity information

from real SW code running on Emulation and/or system models.

o Benchmarks and verification type vectors are often quite different

from “real” world use cases

Vector-less power estimation and optimization is a very

challenging problem.

o Some have attempted, but no satisfactory solution seen thus far.

July 2009 61

Conclusions

References

[1]http://www.nttdocomo.co.jp/english/info/notice/page/0902

27_00.html

[2] Bruce Zhan, “A Utility for Leakage Power Recovery within

PrimeTime SI” SNUG Boston 2008

[3] Krzysztof A. Kozminski, “Optimization for Leakage Power

with PrimeTime” 2004 San Jose SNUG Conference

62July 2009

low power soc design - computer science and...

Documents