lecture 15: power. power = voltage × current –voltage is usually a constant (we’ll talk about...
Post on 11-Dec-2015
222 Views
Preview:
TRANSCRIPT
Advanced MicroarchitectureLecture 15: Power
2
Basic Power Review• Power = Voltage × Current
– Voltage is usually a constant• (we’ll talk about voltage scaling later)
– Current varies• Depends on the block (cache vs. ALU vs. decoder …)• Depends on the application (int vs. FP vs. multimedia)• Depends on the program phase
• Another form:– i = Cdv/dt vi dt = Cv dv P = ½CV2
– Power = Energy of each capacitor × avg times (dis)charged/time to
(dis)charge
– = bAll Blocks½CbV2ab/tc = ½V2f b Cbab =
½aCV2fLecture 15: Power
C = Total Capacitancea = average activity
factor
3
Static Power• We talked about this in Lecture 1
– Two types of static power• Leakage through the channel (sub-threshold
conductance)• Leakage through the gate/oxide (tunneling)
• Pstatic = Psub + Poxide
• Ptotal = Pdynamic + Pstatic
= ½aCV2f + K1We-VT/nVq(1-e-V/Vq) + K2W(V/Tox)2e-aTox/V
Lecture 15: Power
4
Trading Power for Performance• P = ½aCV2f, f V P V3
• To a first order, Perf f Perf V
Lecture 15: Power
Pow
er
Voltage
P V3
For a linear decrease in voltage (and performance)
… we get a cubic decrease in (dynamic)
power consumption
Rule of thumb: for small DV/Df,
1% performance for every 3%
power
http://download.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/vol7iss2_art03.pdf
5
Limits of Trading
• Vdd – VT > VNoise Margin
• Vdd cannot be scaled below VT + VNoise Margin
Lecture 15: Power
Gnd Noise can cause transistor to
accidentally switch!
Pow
er
Voltage/Frequency
P V3
Voltage scaling can take the supply voltage down only
so far
Below this, we can only use frequency scaling (decrease f, but keep V constant), which provides only linear power
reduction (½CV2f)
VT
Vdd
noise
6
DVFS• Dynamic Voltage/Frequency Scaling• Someone tracks performance demands,
idleness, etc.– “Someone” is typically the OS with hardware
support– … but you could have a hardware only-
approach• Under thermal emergencies, the HW takes over
regardless of what voltage/frequency the OS asks for
• Goal: consume minimum power necessary while still meeting performance demands
• Can also do just DVS or DFSLecture 15: Power
7
Clock Gating• CMOS logic is also called “static” logic:
– If the inputs don’t change, neither do the outputs(or any other intermediate nodes)
• Therefore, to reduce dynamic power in CMOS circuits, don’t let the inputs change if you don’t need to!
Lecture 15: Power
CMOSBlock
1234 59038644 9087
Power dissipated CMOS
Block
Clock gate this block?
12348644 5903
1976
1976
Latch doesn’t grab new value, so its output
(block’s input) doesn’t change
8
Example: ALU
Lecture 15: Power
opcode
+
logic
shift
com
p
×
opcode
one result
All units consume
power, but only one output is
useful
+
logic
shift
com
p
×
opcodeClock-gating
Logic
one result
Based on opcode, the logic
clock-gates all but the one
required unit
Note, this logic consumes its own
power
9
Logic Timing• To properly clock-gate, you must know
you’re going to gate the cycle before(otherwise it’ll be too late as the clock edge will have
already arrived)
Lecture 15: Power
Payload RAM
+
logic
com
pClock-gatingLogic
OpcodeValueEValueL
10
Logic Timing• Not all blocks can be easily gated
– may be difficult to know whether gating should be applied ahead of time• likely true for critical path circuits: e.g., gating select
logic probably difficult since bidders not known until last moment
– computation of gating condition may be complex• value-based (is input zero?)• multi-value based (are all inputs zero?)• multi-condition based (are all RS entries not bidding?)
Lecture 15: Power
11
Clock Gating Dynamic Logic• CMOS logic toggles only when input
changes• Dynamic logic may consume power
regardless
Lecture 15: Power
CMOS NOR gate
N-Domino NOR gate
pictures from http://6004.csail.mit.edu/6.371/handouts/L11.pdf
If A (or B) equals 1 and does not change, then sequence is:
precharge X to 1, evaluate
discharges X to 0, precharge X
to 1, evaluate …
X
X
Gating inputs is not enough; need to
ensure CLK is disabled.
12
Clock Gating is for Dynamic Power• Even if gates not toggling, they continue to
leak
Lecture 15: Power
Vdd
Gnd
1
On
Off
subthreshold leakage
gate leakage
gate leakage
Vdd
Gnd
0
Off
On
gate leakage
gate leakage
subthreshold leakage
13
Reducing Leakage: Stacking
Lecture 15: Power
intermediate node has V > 0
V
0
V/2R
R
0
1
0
1channel le
aka
ge
channel le
aka
ge
higher resistance
vs.
Higher VSB increases VT
VB=0
VSV/2
Higher threshold voltage decreases leakage
currentHigher resistance increases
gate latency
14
Body Bias Effect
Lecture 15: Power
Channel Leakage
Less Channel Leakage
VB VB
VS
VS
Larger VSB
WARNING: This is a GROSSLY simplified explanation!!!
If you’re interested in low-power circuits and microarchitecture, you should go read up on some real semiconductor/electronics
literature.
15
Dual VT Devices
• Manufacture two types of transistors:– Low VT gates: fast, high leakage
– High VT gates: slow, low leakage (typically 10x less)
– Designer chooses what kind to use
• Pro:– less area than stacking (one high-VT gate = one
low-VT gate in area, stacking requires multiple gates)
• Con:– Manufacturing process needs to provide two
device typesLecture 15: Power
16
Use Only Where Appropriate
• Stacking and higher VT both slow down the gates
• Analyze circuits and…– apply one or both techniques to gates not on
the critical path– apply to longest path if timing permits (i.e., this
circuit is not a frequency limiter)
Lecture 15: Power
Critical path gates
Stack or use high-VT gates here
17
Standby Input Vectors• The amount of leakage depends on the
clock-gated inputs to the gate
Lecture 15: Power
0
0 Off
On
Off
On 1
0 On
On
Off
Off 0
1 Of
Off
On
On 1
1 On
Off
On
Off
2 off transistorsin parallel
1 off transistorin leakage path
1 off transistorin leakage path
2 off transistorsin leakage path
18
Standy Input Vectors• When clock-gating a block
– disable latch clock (as usual)– load leakage-minimizing input vector (stored
elsewhere)
Lecture 15: Power
Clock gate
1
1
1
1
1
• How to determine best input vector for n-input gate?
Can cause spurious
transitions that consume more dynamic power
19
Variant: Embedded Dual-VT
• Instead of at the gate-level, choose high-VT vs. low-VT at the transistor-level
Lecture 15: Power
High-VT devices
Low-VT devices
• Can be used if some transitions are more important than others– “more important” can be
speed or power• Combine with setting
input sleep vectors– make the off transistors
high-VT if possible to further reduce leakge
20
Power Gating• If you turn off the power, then the gates
can’t leak
Lecture 15: Power
Vdd
Gnd
0
Off
On
X Gnd
Off
Gnd
Virtual Vdd
Vdd
01 X off
This gating transistor is a beast… it needs to be
big enough to supply the necessary current when not-gated, also
needs to be low leakage (high VT gate)
Gating transistor also called “sleep” transistor
21
Power Gating
Lecture 15: Power
Virtual Vdd
Vdd
After gating, residual charge in
system will continue to leak
Off
Gnd
Virtual Vdd
Vdd
Virtual VGnd
Both paths cut off now
22
Turn-On/Turn-Off Latency
• Sleep transistors are slow high VT devices• Depending on size of block covered by
sleep transistor, virtual Vdd/Gnd may have a lot of capacitance to charge/discharge
Lecture 15: Power
Vdd
Virt. Vdd
R
C
Moderate R, Large C Large RC (slow)
timeADD inst ready
to execute
ALU asleep
delay towakeup ALU
ADD exec
Wakeup delay can causesignificant performance
penalties when units unavailable
23
Turn-On/Turn-Off Latency• In some situations, can know early enough
ahead
Lecture 15: Power
(crude pipeline)
fetch decode
FP inst decoded
!FPU
Immediately send wakeup to FPU
fadd
Hopefully by the time the fadd makes it to the OOO core, gets scheduled, and makes it to the FPU, the turn-on has completed
exec
24
Turn-On/Turn-Off Latency• In some cases it’s much harder
Lecture 15: Power
pipeline full/stalled(maybe due to D$ miss to main memory)
power-off front-end units(fetch, decode, etc.)
miss serviced, back-end starts movingagain; front-end starts wake up
back-end gets starved because front-endwakeup is too slow and can’t refill the pipeline
But it’s hard to start the power-on early because we don’t know when the memoryrequest will be fulfilled (and whether that will cause the back-end to drain)
25
Turn-On/Turn-Off Power
• (Dis)Charging Virtual Vdd/Gnd consumes quite a bit of energy/power
Lecture 15: Power
P = ½aCV2f
• Worst-case: charge up as soon as you’re done discharging
time
Go to sleep!
Virt. Vdd
Done discharging,now wakeup! We just wasted 2×½×CVirt
Vdd×Vdd2 Watts to discharge
and then recharge the virtual Vdd
And we spent zero cycles fully asleep, so we didn’t save any/much leakage
power
26
Turn-On/Turn-Off Power• Must stay asleep for some time, just to
break even!
Lecture 15: Power
Energy consumed fromleakage (no sleeping)
timeEnerg
y c
onsu
med
Energy to dischargeVirtual Vdd/Gnd
Zero energy consumedwhile sleeping
Energy to rechargeVirtual Vdd/Gnd
Minimum sleep-time for energy break-even
Too little sleep… ends up costingmore energy than doing nothing
Extra energyspent
Sleep interval > break-even length
Energyreduction
27
Turn-On/Turn-Off Noise• Instantly turning on the sleep transistor to
recharge virtual Vdd causes very large current spike (di/dt)
Lecture 15: Power
Water Tank Ishower
Flush!Ijohn
Ishower - Ijohn
PressureDrop
Current for recharging virtual
Vdd
Solution: progressive turn-on;recharge virtual Vdd slowly, which
limits Ijohn (i.e., Irecharge) to keep pressure drop (supply noise)
under control
Slowing down recharge increases
performance penalty when
recharge is late
28
Example: Intel Core (not Core 2)• OS power management
(OSPM)– algorithm monitors CPU load
over some window of time– computes target
performance point, requests from CPU
– CPU is expected to modify operating voltage/frequency to match OSPM’s request
Lecture 15: Power
Rela
tive Po
wer C
onsu
mptio
n
Voltage andfrequency
scaling
Frequencyscaling only
• OS can choose different power saving states (C0 – Cn)– C0: active state (no power
saving)– Ci: higher i more power
savings, but longer recovery time
http://download.intel.com/technology/itj/2006/volume10issue02/vol10_art03.pdf
29
Example: Core Idle States
• C0: Active
• C1 (processor-centric measures)– instruction execution halted, clocks are gated
• C2: CPU does not access bus w/o chipset’s consent– allows bus to be put in low-power mode
• C3: CPU disables PLLs (clock generators)
• C4: CPU lowers voltage to minimum level while still being able to retain state (e.g., cache contents)
• DC4: “Deep” C4 (next slide)Lecture 15: Power
30
Example: Core Sleep State• Upon entering C4, flush L2 cache to main
memory– Don’t do it all at once!
• If C4 period is short, then you waste more power due to flushing
• Can have performance impact on wakeup since cache will be cold
• Flush only part of the L2 (1/8 to 1/2) by ways– once a complete way has been flushed, power gate it
with sleep transistors (discussed later)• Do this upon each entry into C4 state
• When L2 shrunk to 0 bytes, enter DC4– Greatly reduce voltage since there’s no state to
retain• No need to wakeup cache for snoops• Chipset directs snoop traffic directly to memory• Typically expand cache to minimum of two ways on
exit from DC4
Lecture 15: Power
31
Example: Core Duo• Many shared resources
– PLL, power supply, L2 cache• Can’t (easily) run cores at
different clock speeds with a single PLL
• Can’t run cores at different voltages with a single power supply
• Can’t turn off L2 cache just because one core is idle
• External interface complications– OS sees two separate CPUs
• one C-state per core
– Platform views the whole processor as a single entity for power-management (for C2 state and higher)Lecture 15: Power
OS can request C-stateson a per-core basis
Platform seesonly a single
C-state(the lower of
the two)
32
Turbo-Mode• If one core is in deep-sleep, it’s not consuming much
power• Idea: use DVFS in reverse to increase
voltage/freqency
Lecture 15: Power
core
0co
re 1
pow
er
power limit
rela
tive
perf
orm
ance
Both coresin C0
core
0Core 0 in C0
Core 1 in DC4
core
0
Core 0 in C0
Core 1 in DC4
Deliver more performance when
running a single program and not
worried about battery life
(plugged in to wall)
“Intel Dynamic Acceleration Technology”
33
Variable VT Devices
• Pros:– significant standby leakage reduction– memory elements retain state– no transistor sizing/partitioning required– dynamically tunable VT at runtime
• Cons:– requires expensive triple-well fabrication process– body-biasing effect decreases with technology scaling
Lecture 15: Power
Higher VSB increases VT
VB=0
VSV/2
Earlier body-bias effect from stackedtransistors due to higher source voltage
Provide a way to explicitly bias VB
Set VBBN < 0 makes VSB > 0 for this NFET
Since VBBN < 0, also called“reverse biasing”
Kao et al., Embedded Tutorial: Subthreshold Leakage Modeling and Reduction Techniques, ICCAD 2002
34
Body-Biased Cache• Super-high VT for caches (very slow)• Use selective forward-body biasing during
access to read/write at a reasonable speed
Lecture 15: Power
0
0
0
0
0
0
0
Very-high VT devices
(very low leakage,
slow access speed)
0
VBBN
Access
Vfwd-bias
Vfwd-bias
VSB < 0 VT decreases transistors are faster
(but consume more power)
Access Completed
0
A few cache lines go into high leakage mode, but onlyvery briefly (during access). The rest of the time, it
consumes very little leakage power.
35
GALS• Different blocks have different performance
needs– and this varies in time
• Idea: clock different blocks at different speeds– Apply voltage/frequency scaling to
blocks/groups-of-blocks• e.g., FP units can be slowed down (or maybe even
completely turned off) for integer applications– Block consumes less power when it doesn’t
have to operate in max-performance mode• GALS = Globally Asynchronous, Locally
SynchronousLecture 15: Power
36
GALS Example
Lecture 15: Power http://www.ece.cmu.edu/~dianam/conferences/isca02.pdf
Baseline Processor GALS Processor
37
GALS Issues• How to communicate between clock
domains?
Lecture 15: Power
Asynchronous FIFO Design[Chelcea and Nowick]
Producer can clear empty, but it gets cleared on
clk2
Consumer clears the full signal,
but it occurs on clk1
Timing Issues:
Voltage Issues:
0V
0.75V
“0”
“1”
0V
1.5V
“1” (0.75V)0.75V =0/1?
Vdd1 Vdd2
FIFO between domains must “speak” both
voltages
top related