adaptive techniques for leakage power management in l2 cache peripheral circuits houman homayoun...
TRANSCRIPT
Adaptive Techniques for Adaptive Techniques for Leakage Power Management in Leakage Power Management in
L2 Cache Peripheral CircuitsL2 Cache Peripheral Circuits
Houman Homayoun Alex Veidenbaum and Jean-Luc Gaudiot Dept. of Computer Science, UC Irvine
Outline
L2 Cache Power Dissipation Why Cache Peripheral ? Study recently proposed static approach to reduce
leakage Propose two adaptive technique to reduce leakage Present power, performance and energy-delay results
L2 Cache and Power L2 cache in high-performance processors is
large 2 to 4 MB is common
It is typically accessed relatively infrequently
Thus dissipates most of its power via leakage Much of it was in the SRAM cells
Many architectural techniques proposed to remedy this
Today, there is also significant leakage in the peripheral circuits of an SRAM (cache)
In part because cell design has been optimized
Pentium M processor die photoCourtesy of intel.com
Peripherals ?!
Data Input/Output Driver Address Input/Output Driver Row Pre-decoder Wordline Driver Row Decoder
Others : sense-amp, bitline pre-charger, memory cells, decoder logic
addr0
addr1
addr2
addr3
Predecoder and Global Wordline Drivers
Decoder
addr
Global WordlineLocal Wordline
Bitline BitlineAddr Input Global Drivers
Sense amp
Global Output Drivers
Why Peripherals ?
Using minimal sized transistor for area considerations in cells and larger, faster and accordingly more leaky transistors to satisfy timing requirements in peripherals
Using high vt transistors in cells compared with typical threshold voltage transistors in peripherals
1
10
100
1000
10000
100000
mem
ory c
ell
INVX
INV2X
INV3X
INV4X
INV5X
INV6X
INV8X
INV12
X
INV16
X
INV20
X
INV24
X
INV32
X
( pw )
200X
6300X
Leakage Power Components of L2 Cache
SRAM peripheral circuits dissipate more than 90% of the total leakage power
global address input drivers
11%
global data input drivers
14%
global row predecoder
1%
local row decoders
33%
others8%
local data output drivers
8%
global data output drivers
25%
Leakage power as a Fraction of Total L2 Power Dissipation
L2 cache leakage power dominates its dynamic power above 87% of the total
0%10%20%30%40%50%60%70%80%90%
100%
amm
pap
plu
apsi art
bzi
p2
craf
tyeo
neq
uak
efa
cere
cg
alg
elg
ap gcc
gzi
plu
cas
mcf
mes
am
gri
dp
arse
rp
erlb
mk
sixt
rack
swim
two
lfvo
rtex vp
rw
up
wis
eav
erag
e
Leakage Dynamic
Circuit Techniques Address Leakage in
SRAM Cell
Gated-Vdd, Gated-Vss Voltage Scaling (DVFS) ABB-MTCMOS Forward Body Biasing (FBB), RBB
Target SRAM memory cell
Architectural Techniques
Way Prediction, Way Caching, Phased Access Predict or cache recently access ways, read tag first
Drowsy Cache Keeps cache lines in low-power state, w/ data retention
Cache Decay Evict lines not used for a while, then power them down
Applying DVS, Gated Vdd, Gated Vss to memory cell Many architectural support to do that.
All target cache SRAM memory cell
Static Architectural Techniques: SM
SM Technique (ICCD’07) Asserts the sleep signal by default.
Wakes up L2 peripherals on an access to the cache
Keeps the cache in the normal state for J cycles (turn-on period) before returning it to the stand-by mode (SM_J)
No wakeup penalty during this period Larger J leads to lower performance degradation but lower
energy savings
Static Architectural Techniques: IM
IM technique (ICCD’07)
Monitor issue logic and functional units of the processor after L2 cache miss. Asserts the sleep if the issue logic has not issued any instructions and functional units have not executed any instructions for K consecutive cycles (K=10)
De-asserted the sleep signal M cycles before the miss is serviced
No performance loss
Simulated Processor Architecture
SimpleScalar 4.0 SPEC2K benchmarks
Compiled with the -O4 flag using the Compaq compiler targeting the Alpha 21264 processor
fast–forwarded for 3 billion instructions, then fully simulated for 4 billion instructions
using the reference data sets.
Parameter Value L1 I-cache 128KB, 2 cycles L1 D-cache 128KB, 2 cycles L2 cache 2MB, 8 way, 20 cycles Fetch, dispatch 4 wide Issue 4 way out of order Memory 300 cycles Reorder buffer 96 entry Instruction queue 32 entry Register file 128 integer and 125 floating point Load/store queue 32 entry Branch predictor 64KB entry g-share Arithmetic unit 4 integer, 4 floating point units Complex unit 2 INT, 2 FP multiply/divide units Pipeline 15 cycles
SM Performance Degradation
92%
93%
94%
95%
96%
97%
98%
99%
100%
SM-100 SM-200 SM-500 SM-750 SM-1500
More Insight on SM and IM
Fraction of program execution time during which L2 cache is in low power mode (FLP) using one of IM or SM
two techniques benefit different benchmarks
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
amm
p
app
lu
apsi art
bzi
p2
craf
ty
eon
equ
ake
face
rec
gal
gel
gap gcc
gzi
p
luca
s
mcf
mes
a
mg
rid
par
ser
per
lbm
k
sixt
rack
swim
two
lf
vort
ex vpr
wu
pw
ise
aver
age
IM SM-750
More Insight on SM and IM (Cont.)
In almost half of the benchmarks the FLP is negligible and there is no leakage reduction opportunity using IM
The majority of load instructions satisfied within the cache hierarchy The memory accesses are extremely infrequent The average FLP period is 26.9%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
amm
p
app
lu
apsi art
bzi
p2
craf
ty
eon
equ
ake
face
rec
gal
gel
gap gcc
gzi
p
luca
s
mcf
mes
a
mg
rid
par
ser
per
lbm
k
sixt
rack
swim
two
lf
vort
ex vpr
wu
pw
ise
aver
age
IM SM-750
Some Observations
Some benchmarks SM and IM techniques are both effective facerec, gap, perlbmk and vpr
IM works well in almost half of the benchmarks but is ineffective in the other half
SM work well in about one half of the benchmarks but not the same benchmarks as the IM
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
amm
p
app
lu
apsi art
bzi
p2
craf
ty
eon
equ
ake
face
rec
gal
gel
gap gcc
gzi
p
luca
s
mcf
mes
a
mg
rid
par
ser
per
lbm
k
sixt
rack
swim
two
lf
vort
ex vpr
wu
pw
ise
aver
age
IM SM-750
adaptive technique combining IM and SM has the potential to deliver an even greater power
reduction
Which Technique Is the Best and When ?
DL1 miss rate
L2 miss rate
L1xL2 miss rates x 10K
DL1 miss rate
L2 miss rate
L1xL2 miss rates x 10K
ammp 0.05 0.19 96.11 lucas 0.10 0.67 645.73 applu 0.06 0.66 368.03 mcf 0.24 0.43 1023.88 apsi 0.03 0.28 75.01 mesa 0.00 0.27 8.02 art 0.41 0.00 0.41 mgrid 0.04 0.46 165.13
bzip2 0.02 0.04 7.09 parser 0.02 0.07 13.76 crafty 0.00 0.01 0.17 perlbmk 0.01 0.46 22.88 eon 0.00 1.00 0.00 sixtrack 0.01 0.00 0.14
equake 0.02 0.67 124.36 swim 0.09 0.63 561.41 facerec 0.03 0.31 86.11 twolf 0.05 0.00 0.16 galgel 0.04 0.01 2.11 vortex 0.00 0.23 6.94 gap 0.01 0.55 38.54 vpr 0.02 0.15 33.95 gcc 0.05 0.04 16.88 wupwise 0.02 0.68 122.40 gzip 0.01 0.05 3.28 Average 0.05 0.31 136.50
L2 to be idle
There are few L1 misses Many L2 misses waiting for memory
miss rate product (MRP) may be a good indicator
of the cache behavior
The Adaptive Techniques
Adaptive Static Mode (ASM) MRP measured only once during an initial learning
period (the first 100M committed instructions) MRP > A IM (A=90) MRP ≤ A SM_J Initial technique SM_J
Adaptive Dynamic Mode (ADM) MRP measured continuously over a K cycle period (K is 10M)
choose IM or the SM, for the next 10M cycles MRP > A IM (A=100) A ≥ MRP > B SM_N (B=200) otherwise SM_P
More Insight on ASM and ADM
ASM attempts to find the more effective static technique per benchmark by profiling a small subset of a program
ADM is more complex and attempts to find the more effective static technique at a finer granularity of every 10M cycles intervals based on profiling the previous timing interval
ASM Results
ASM_750 makes a good power-performance trade-off with a 44% FLP and an approximately 2% performance loss
20%
30%
40%
50%
60%
70%
80%
J=100 J=200 J=500 J=750 J=1500
92%
93%
94%
95%
96%
97%
98%
99%
100%
J=100 J=200 J=500 J=750 J=1500
FLP Period Performance Loss
Compare ASM with IM and SM
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
am
mp
ap
plu
ap
si art
bzi
p2
cra
fty
eo
n
eq
ua
ke
face
rec
ga
lge
l
ga
p
gc
c
gzi
p
luca
s
mc
f
me
sa
mg
rid
pa
rser
pe
rlb
mk
six
trac
k
sw
im
two
lf
vo
rte
x vp
r
wu
pw
ise
ave
rag
e
ASM-IM ASM-SM
fraction of IM and SM contribution for ASM_750
Most benchmarks ASM correctly selects the more effective static technique
Exception: equake
a small subset of program can be used to identify L2 cache behavior,
whether it is accessed very infrequently or it is idle since
processor is idle
ASM and SM Performance
82%
84%
86%
88%
90%
92%
94%
96%
98%
100%
amm
p
app
lu
apsi art
bzi
p2
craf
ty
eon
equ
ake
face
rec
gal
gel
gap gcc
gzi
p
luca
s
mcf
mes
a
mg
rid
par
ser
per
lbm
k
sixt
rack
swim
two
lf
vort
ex vpr
wu
pw
ise
aver
age
ASM_750 SM_750
No Performance Loss ammp, applu, lucas, mcf, mgird, swim and wupwise
2X more leakage power reduction and less performance loss
compare to static approaches
84%
86%
88%
90%
92%
94%
96%
98%
100%
amm
p
app
lu
apsi art
bzi
p2
craf
ty
eon
equ
ake
face
rec
gal
gel
gap gcc
gzi
p
luca
s
mcf
mes
a
mg
rid
par
ser
per
lbm
k
sixt
rack
swim
two
lf
vort
ex vpr
wu
pw
ise
aver
age
ADM Results
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
amm
p
app
lu
apsi art
bzi
p2
craf
ty
eon
equ
ake
face
rec
gal
gel
gap gcc
gzi
p
luca
s
mcf
mes
a
mg
rid
par
ser
per
lbm
k
sixt
rack
swim
two
lf
vort
ex vpr
wu
pw
ise
aver
age
ADM_IM ADM_SM
Many benchmarks both IM and SM make a noticeable contribution ADM is effective in combining the IM and SM
Some benchmarks either IM or SM contribution is negligible ADM selects the best static technique
Power Measurement Approach
CACTI-5 Peripheral circuits account for 90% of all the leakage
power The power reduction is 88%.
Total dynamic power : N*Eaccess/Texec N is the total number of accesses (obtained from
simulation) Eaccess is the single access energy from CACTI-5 Texec is the program execution time
Leakage energy is dissipated on every cycle
Power Results
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
amm
p
app
lu
apsi art
bzi
p2
craf
ty
eon
equ
ake
face
rec
gal
gel
gap gcc
gzi
p
luca
s
mcf
mes
a
mg
rid
par
ser
per
lbm
k
sixt
rack
swim
two
lf
vort
ex vpr
wu
pw
ise
aver
age
ASM ADM
(a)
-20%
-10%
0%
10%
20%
30%
40%
50%
60%
70%
80%
amm
p
app
lu
apsi art
bzi
p2
craf
ty
eon
equ
ake
face
rec
gal
gel
gap gcc
gzi
p
luca
s
mcf
mes
a
mg
rid
par
ser
per
lbm
k
sixt
rack
swim
two
lf
vort
ex vpr
wu
pw
ise
aver
age
ASM ADM
(b)
leakage power savings total energy delay reduction
leakage reduction using ASM and ADM is 34% and 52% respectively
The overall energy delay reduction is 29.4 and 45.5% respectively, using the ASM and ADM.
2~3 X more leakage power reduction and less performance loss compare to
static approaches
Conclusion
Study break down of leakage in L2 cache components, show peripheral circuit leaking considerably
Study recently proposed IM and SM approach Propose a metric (cache miss rate product) to
differentiate the benchmarks works well with each of static approach
Propose two adaptive technique to select the best static approach dynamically
Present power, performance and energy-delay results 2 to 3 X improvement over recently proposed static
techniques