adaptive techniques for leakage power management in l2 cache peripheral circuits houman homayoun...

Adaptive Techniques for Adaptive Techniques for Leakage Power Management in Leakage Power Management in

L2 Cache Peripheral CircuitsL2 Cache Peripheral Circuits

Houman Homayoun Alex Veidenbaum and Jean-Luc Gaudiot Dept. of Computer Science, UC Irvine

[email protected]

Outline

L2 Cache Power Dissipation Why Cache Peripheral ? Study recently proposed static approach to reduce

leakage Propose two adaptive technique to reduce leakage Present power, performance and energy-delay results

L2 Cache and Power L2 cache in high-performance processors is

large 2 to 4 MB is common

It is typically accessed relatively infrequently

Thus dissipates most of its power via leakage Much of it was in the SRAM cells

Many architectural techniques proposed to remedy this

Today, there is also significant leakage in the peripheral circuits of an SRAM (cache)

In part because cell design has been optimized

Pentium M processor die photoCourtesy of intel.com

Peripherals ?!

Data Input/Output Driver Address Input/Output Driver Row Pre-decoder Wordline Driver Row Decoder

Others : sense-amp, bitline pre-charger, memory cells, decoder logic

addr0

addr1

addr2

addr3

Predecoder and Global Wordline Drivers

Decoder

addr

Global WordlineLocal Wordline

Bitline BitlineAddr Input Global Drivers

Sense amp

Global Output Drivers

Why Peripherals ?

Using minimal sized transistor for area considerations in cells and larger, faster and accordingly more leaky transistors to satisfy timing requirements in peripherals

Using high vt transistors in cells compared with typical threshold voltage transistors in peripherals

1

10

100

1000

10000

100000

mem

ory c

ell

INVX

INV2X

INV3X

INV4X

INV5X

INV6X

INV8X

INV12

X

INV16

X

INV20

X

INV24

X

INV32

X

( pw )

200X

6300X

Leakage Power Components of L2 Cache

SRAM peripheral circuits dissipate more than 90% of the total leakage power

global address input drivers

11%

global data input drivers

14%

global row predecoder

1%

local row decoders

33%

others8%

local data output drivers

8%

global data output drivers

25%

Leakage power as a Fraction of Total L2 Power Dissipation

L2 cache leakage power dominates its dynamic power above 87% of the total

0%10%20%30%40%50%60%70%80%90%

100%

amm

pap

plu

apsi art

bzi

p2

craf

tyeo

neq

uak

efa

cere

cg

alg

elg

ap gcc

gzi

plu

cas

mcf

mes

am

gri

dp

arse

rp

erlb

mk

sixt

rack

swim

two

lfvo

rtex vp

rw

up

wis

eav

erag

e

Leakage Dynamic

Circuit Techniques Address Leakage in

SRAM Cell

Gated-Vdd, Gated-Vss Voltage Scaling (DVFS) ABB-MTCMOS Forward Body Biasing (FBB), RBB

Target SRAM memory cell

Architectural Techniques

Way Prediction, Way Caching, Phased Access Predict or cache recently access ways, read tag first

Drowsy Cache Keeps cache lines in low-power state, w/ data retention

Cache Decay Evict lines not used for a while, then power them down

Applying DVS, Gated Vdd, Gated Vss to memory cell Many architectural support to do that.

All target cache SRAM memory cell

Static Architectural Techniques: SM

SM Technique (ICCD’07) Asserts the sleep signal by default.

Wakes up L2 peripherals on an access to the cache

Keeps the cache in the normal state for J cycles (turn-on period) before returning it to the stand-by mode (SM_J)

No wakeup penalty during this period Larger J leads to lower performance degradation but lower

energy savings

Static Architectural Techniques: IM

IM technique (ICCD’07)

Monitor issue logic and functional units of the processor after L2 cache miss. Asserts the sleep if the issue logic has not issued any instructions and functional units have not executed any instructions for K consecutive cycles (K=10)

De-asserted the sleep signal M cycles before the miss is serviced

No performance loss

Simulated Processor Architecture

SimpleScalar 4.0 SPEC2K benchmarks

Compiled with the -O4 flag using the Compaq compiler targeting the Alpha 21264 processor

fast–forwarded for 3 billion instructions, then fully simulated for 4 billion instructions

using the reference data sets.

Parameter Value L1 I-cache 128KB, 2 cycles L1 D-cache 128KB, 2 cycles L2 cache 2MB, 8 way, 20 cycles Fetch, dispatch 4 wide Issue 4 way out of order Memory 300 cycles Reorder buffer 96 entry Instruction queue 32 entry Register file 128 integer and 125 floating point Load/store queue 32 entry Branch predictor 64KB entry g-share Arithmetic unit 4 integer, 4 floating point units Complex unit 2 INT, 2 FP multiply/divide units Pipeline 15 cycles

SM Performance Degradation

92%

93%

94%

95%

96%

97%

98%

99%

100%

SM-100 SM-200 SM-500 SM-750 SM-1500

More Insight on SM and IM

Fraction of program execution time during which L2 cache is in low power mode (FLP) using one of IM or SM

two techniques benefit different benchmarks

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

amm

p

app

lu

apsi art

bzi

p2

craf

ty

eon

equ

ake

face

rec

gal

gel

gap gcc

gzi

p

luca

s

mcf

mes

a

mg

rid

par

ser

per

lbm

k

sixt

rack

swim

two

lf

vort

ex vpr

wu

pw

ise

aver

age

IM SM-750

More Insight on SM and IM (Cont.)

In almost half of the benchmarks the FLP is negligible and there is no leakage reduction opportunity using IM

The majority of load instructions satisfied within the cache hierarchy The memory accesses are extremely infrequent The average FLP period is 26.9%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

amm

p

app

lu

apsi art

bzi

p2

craf

ty

eon

equ

ake

face

rec

gal

gel

gap gcc

gzi

p

luca

s

mcf

mes

a

mg

rid

par

ser

per

lbm

k

sixt

rack

swim

two

lf

vort

ex vpr

wu

pw

ise

aver

age

IM SM-750

Some Observations

Some benchmarks SM and IM techniques are both effective facerec, gap, perlbmk and vpr

IM works well in almost half of the benchmarks but is ineffective in the other half

SM work well in about one half of the benchmarks but not the same benchmarks as the IM

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

amm

p

app

lu

apsi art

bzi

p2

craf

ty

eon

equ

ake

face

rec

gal

gel

gap gcc

gzi

p

luca

s

mcf

mes

a

mg

rid

par

ser

per

lbm

k

sixt

rack

swim

two

lf

vort

ex vpr

wu

pw

ise

aver

age

IM SM-750

adaptive technique combining IM and SM has the potential to deliver an even greater power

reduction

Which Technique Is the Best and When ?

DL1 miss rate

L2 miss rate

L1xL2 miss rates x 10K

DL1 miss rate

L2 miss rate

L1xL2 miss rates x 10K

ammp 0.05 0.19 96.11 lucas 0.10 0.67 645.73 applu 0.06 0.66 368.03 mcf 0.24 0.43 1023.88 apsi 0.03 0.28 75.01 mesa 0.00 0.27 8.02 art 0.41 0.00 0.41 mgrid 0.04 0.46 165.13

bzip2 0.02 0.04 7.09 parser 0.02 0.07 13.76 crafty 0.00 0.01 0.17 perlbmk 0.01 0.46 22.88 eon 0.00 1.00 0.00 sixtrack 0.01 0.00 0.14

equake 0.02 0.67 124.36 swim 0.09 0.63 561.41 facerec 0.03 0.31 86.11 twolf 0.05 0.00 0.16 galgel 0.04 0.01 2.11 vortex 0.00 0.23 6.94 gap 0.01 0.55 38.54 vpr 0.02 0.15 33.95 gcc 0.05 0.04 16.88 wupwise 0.02 0.68 122.40 gzip 0.01 0.05 3.28 Average 0.05 0.31 136.50

L2 to be idle

There are few L1 misses Many L2 misses waiting for memory

miss rate product (MRP) may be a good indicator

of the cache behavior

The Adaptive Techniques

Adaptive Static Mode (ASM) MRP measured only once during an initial learning

period (the first 100M committed instructions) MRP > A IM (A=90) MRP ≤ A SM_J Initial technique SM_J

Adaptive Dynamic Mode (ADM) MRP measured continuously over a K cycle period (K is 10M)

choose IM or the SM, for the next 10M cycles MRP > A IM (A=100) A ≥ MRP > B SM_N (B=200) otherwise SM_P

More Insight on ASM and ADM

ASM attempts to find the more effective static technique per benchmark by profiling a small subset of a program

ADM is more complex and attempts to find the more effective static technique at a finer granularity of every 10M cycles intervals based on profiling the previous timing interval

ASM Results

ASM_750 makes a good power-performance trade-off with a 44% FLP and an approximately 2% performance loss

20%

30%

40%

50%

60%

70%

80%

J=100 J=200 J=500 J=750 J=1500

92%

93%

94%

95%

96%

97%

98%

99%

100%

J=100 J=200 J=500 J=750 J=1500

FLP Period Performance Loss

Compare ASM with IM and SM

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

am

mp

ap

plu

ap

si art

bzi

p2

cra

fty

eo

n

eq

ua

ke

face

rec

ga

lge

l

ga

p

gc

c

gzi

p

luca

s

mc

f

me

sa

mg

rid

pa

rser

pe

rlb

mk

six

trac

k

sw

im

two

lf

vo

rte

x vp

r

wu

pw

ise

ave

rag

e

ASM-IM ASM-SM

fraction of IM and SM contribution for ASM_750

Most benchmarks ASM correctly selects the more effective static technique

Exception: equake

a small subset of program can be used to identify L2 cache behavior,

whether it is accessed very infrequently or it is idle since

processor is idle

ASM and SM Performance

82%

84%

86%

88%

90%

92%

94%

96%

98%

100%

amm

p

app

lu

apsi art

bzi

p2

craf

ty

eon

equ

ake

face

rec

gal

gel

gap gcc

gzi

p

luca

s

mcf

mes

a

mg

rid

par

ser

per

lbm

k

sixt

rack

swim

two

lf

vort

ex vpr

wu

pw

ise

aver

age

ASM_750 SM_750

No Performance Loss ammp, applu, lucas, mcf, mgird, swim and wupwise

2X more leakage power reduction and less performance loss

compare to static approaches

84%

86%

88%

90%

92%

94%

96%

98%

100%

amm

p

app

lu

apsi art

bzi

p2

craf

ty

eon

equ

ake

face

rec

gal

gel

gap gcc

gzi

p

luca

s

mcf

mes

a

mg

rid

par

ser

per

lbm

k

sixt

rack

swim

two

lf

vort

ex vpr

wu

pw

ise

aver

age

ADM Results

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

amm

p

app

lu

apsi art

bzi

p2

craf

ty

eon

equ

ake

face

rec

gal

gel

gap gcc

gzi

p

luca

s

mcf

mes

a

mg

rid

par

ser

per

lbm

k

sixt

rack

swim

two

lf

vort

ex vpr

wu

pw

ise

aver

age

ADM_IM ADM_SM

Many benchmarks both IM and SM make a noticeable contribution ADM is effective in combining the IM and SM

Some benchmarks either IM or SM contribution is negligible ADM selects the best static technique

Power Measurement Approach

CACTI-5 Peripheral circuits account for 90% of all the leakage

power The power reduction is 88%.

Total dynamic power : N*Eaccess/Texec N is the total number of accesses (obtained from

simulation) Eaccess is the single access energy from CACTI-5 Texec is the program execution time

Leakage energy is dissipated on every cycle

Power Results

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

amm

p

app

lu

apsi art

bzi

p2

craf

ty

eon

equ

ake

face

rec

gal

gel

gap gcc

gzi

p

luca

s

mcf

mes

a

mg

rid

par

ser

per

lbm

k

sixt

rack

swim

two

lf

vort

ex vpr

wu

pw

ise

aver

age

ASM ADM

(a)

-20%

-10%

0%

10%

20%

30%

40%

50%

60%

70%

80%

amm

p

app

lu

apsi art

bzi

p2

craf

ty

eon

equ

ake

face

rec

gal

gel

gap gcc

gzi

p

luca

s

mcf

mes

a

mg

rid

par

ser

per

lbm

k

sixt

rack

swim

two

lf

vort

ex vpr

wu

pw

ise

aver

age

ASM ADM

(b)

leakage power savings total energy delay reduction

leakage reduction using ASM and ADM is 34% and 52% respectively

The overall energy delay reduction is 29.4 and 45.5% respectively, using the ASM and ADM.

2~3 X more leakage power reduction and less performance loss compare to

static approaches

Conclusion

Study break down of leakage in L2 cache components, show peripheral circuit leaking considerably

Study recently proposed IM and SM approach Propose a metric (cache miss rate product) to

differentiate the benchmarks works well with each of static approach

Propose two adaptive technique to select the best static approach dynamically

Present power, performance and energy-delay results 2 to 3 X improvement over recently proposed static

techniques

adaptive techniques for leakage power management in l2 cache peripheral circuits houman homayoun...

Documents

power l2 cache

sram cache

total leakage power

l2 cache miss

cache lines

drowsy cache

leakage power management

leakage present power