15-447 computer architecturefall 2008 © november 24, 2007 nael abu-ghazaleh [email protected]...
Post on 20-Dec-2015
219 views
TRANSCRIPT
![Page 1: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/1.jpg)
15-447 Computer Architecture Fall 2008 ©
November 24, 2007Nael Abu-Ghazaleh
[email protected]://www.qatar.cmu.edu/~msakr/15447-f08
Lecture 27Power Aware Architecture Design
CS 15-447: Computer Architecture
![Page 2: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/2.jpg)
2
15-447 Computer Architecture Fall 2008 ©
1
10
100
1000
10000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Pe
rfo
rma
nce
(vs
. V
AX
-11
/78
0)
25%/year
52%/year
Uniprocessor Performance (SPECint)
• VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002• RISC + x86: ??%/year 2002 to present
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, Sept. 15, 2006
Sea change in chip design—what is emerging?
3X
??%/year
![Page 3: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/3.jpg)
3
15-447 Computer Architecture Fall 2008 ©
Three walls
1. ILP Wall: • Wall: not enough parallelism available in one thread• Very costly to find more• Implications: cant continue to grow IPC• VLIW? SIMD ISA extensions?
2. Memory Wall:• Growing gap between DRAM and Processor speed• Caching helps, but only so much• Implications: cache misses are getting more expensive• Multithreaded processors?
3. Physics/Power Wall:• Cant continue to shrink devices; running into physical limits• Power dissipation is also increasing (more today)• Implications: cant rely on performance boost from shrinking
transistors• But we will continue to get more transistors
![Page 4: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/4.jpg)
4
15-447 Computer Architecture Fall 2008 ©
Multithreaded Processors
• What support is needed?
• I can use it to help ILP as well – Which designs help
ILP in the picture to the right?
![Page 5: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/5.jpg)
5
15-447 Computer Architecture Fall 2008 ©
Power-Efficient Processor Design
Goals: 1. Understand why energy efficiency is important2. Learn the sources of energy dissipation3. Overview a selection of approaches to reduce energy
![Page 6: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/6.jpg)
6
15-447 Computer Architecture Fall 2008 ©
Why Worry About Power?
• Embedded systems:– Battery life
• High-end processors:– Cooling (costs $1 per chip per Watt if operating @ >40W)– Power cost:15 cents/KiloWatt hr (KWH)
• A single 900 Watt server costs 100 USD /month to run, not including cooling costs!
– Packaging– Reliability
![Page 7: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/7.jpg)
7
15-447 Computer Architecture Fall 2008 ©
Why worry about power -- Oakridge Lab. Jaguar
• Current highest performance super computer– 1.3 sustained petaflops (quadrillion FP operations per
second)– 45,000 processors, each quad-core AMD Opteron
• 180,000 cores!
– 362 Terabytes of memory; 10 petabytes disk space– Check top500.org for a list of the most powerful
supercomputers
• Power consumption? (without cooling)– 7MegaWatts!– 0.75 million USD/month to power– There is a green500.org that rates computers based on
flops/Watt
![Page 8: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/8.jpg)
8
15-447 Computer Architecture Fall 2008 ©
• Alpha 21264 95W• AMD Athlon XP 67W• HP PA-8700 75W• IBM Power 4 135W• Intel Itanium 130W• Intel Xeon 59W
Peak Power in Today’s CPUs
Even worse when we consider power density (watt/cm2)
![Page 9: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/9.jpg)
9
15-447 Computer Architecture Fall 2008 ©
![Page 10: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/10.jpg)
10
15-447 Computer Architecture Fall 2008 ©
• Sources of power consumption in CMOS:– Dynamic or active power (due to the switching of
transistors)– Short-circuit power– Leakage power
• High temperature increases power consumption– Silicon is a bad conductor: higher temperature
->higher leakage current->even higher temperature…
Where is This Power Coming From?
![Page 11: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/11.jpg)
11
15-447 Computer Architecture Fall 2008 ©
Power Consumption in CMOS
– Dynamic Power Consumption• Charging and discharging capacitors
Vdd
In Out
Vdd
In Out
C C
0 1 1 0
E=CV2 E=CV2
P=E*f=C*V2*f
![Page 12: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/12.jpg)
12
15-447 Computer Architecture Fall 2008 ©
Power= *C*V2*f
Activity factor: how often do wires switch
Supply voltage: has been dropping with successive
process generations
Clock frequency: increasing
Capacitance: function of wire length, transistor size
Dynamic Power Consumption
![Page 13: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/13.jpg)
13
15-447 Computer Architecture Fall 2008 ©
Power Consumption in CMOS
– Short-circuit power• Both PMOS and NMOS are conducting
Vdd
InOut
C1/2
About 2% of the overall power.
Isc
![Page 14: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/14.jpg)
14
15-447 Computer Architecture Fall 2008 ©
Power Consumption in CMOS
– Leakage power – transistors are not perfect switches and they leak.
Vdd
In Out
C0 1
20% now, expect 40% in next technology and growing
Isub
![Page 15: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/15.jpg)
15
15-447 Computer Architecture Fall 2008 ©
• All of the consumed power has to be dissipated
• Done by means of heat pipes, heat sinks, fans, etc.
• Different segments use different cooling mechanisms.
• Costs $1-$3 or more per chip per Watt if operating @ >40W
• We may soon need budgets for liquid-cooling or refrigeration hardware.
Cooling
![Page 16: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/16.jpg)
16
15-447 Computer Architecture Fall 2008 ©
Power= *C*V2*f
Activity factor: how often do wires switch
Supply voltage: has been dropping with successive
process generations
Clock frequency: increasing
Capacitance: function of wire
length, transistor size
Dynamic Power Consumption
![Page 17: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/17.jpg)
17
15-447 Computer Architecture Fall 2008 ©
• Transistor switches slower at lower voltage.
• Leakage current grows exponentially with decreases in threshold voltage
• Leakage power goes through the roof
Voltage Scaling
![Page 18: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/18.jpg)
18
15-447 Computer Architecture Fall 2008 ©
• New process generation every 2-3 years• Ideal shrink for 30% reduction in size:
– Voltage scales down by 30%– Gate delays are shortened by 30%
~50% frequency gain (500ps cycle = 2GHz clock, 333ps cycle = 3GHz clock)
– Transistor density increases by 2X• 0.7X shrink on a side, 2X area reduction
– Capacitance/transistor reduced by 30%
Technology Scaling: the Enabler
![Page 19: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/19.jpg)
19
15-447 Computer Architecture Fall 2008 ©
• 2/3 reduction in energy/transition (CV2 0.7x0.72 = 0.34X)
• 1/2 reduction in power (CV2f 0.7x0.72 x 1.5= 0.5X
• But twice as many transistors, or more if area increases
• Power density unchanged
Ideal Process Shrink: the Results
Looks good!
![Page 20: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/20.jpg)
20
15-447 Computer Architecture Fall 2008 ©
• Performance does not scale w/ frequency– New designs increase frequency by 2X– New designs use 2X-3X more transistors to get 1.4X-1.8X
performance*
• So, every new process generation:– Power goes up by about 2X (3X transistors * 2X switches
* 1/3 energy)– Leakage power is also increasing– Power density goes up 30%~80% (2X power / 1.X area)
• Will get worse in future technologies, because Voltage will scale down less
Process Technology – the Reality*
*Source: “Power – the Next Frontier: a Microarchitecture Perspective”, Ronny Ronen, Keynote speech at PACS’02 Workshop.
![Page 21: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/21.jpg)
21
15-447 Computer Architecture Fall 2008 ©
Ugly Numbers*
i486 (0.8) Pentium 4 (0.18) Factor
Transistors 1.2M 42M 35x
Frequency 50 MHz 2000 MHz 40x
Voltage 5 V 1.65 V 1/3x
Peak Power 5 W 100 W 20x
Die size 0.73 cm2 2.17 cm2 3x
Power density 6.8 W/cm2 46 W/cm2 7x
![Page 22: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/22.jpg)
22
15-447 Computer Architecture Fall 2008 ©
• Circuits and process scaling alone can no longer solve all power problems
• SYSTEMS must also be power-aware– OS– Compilers– Architecture
• Techniques at the architectural level are needed to reduce the absolute power dissipation as well as the power density
The Bottom Line
![Page 23: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/23.jpg)
23
15-447 Computer Architecture Fall 2008 ©
Microarchitectural Techniques for Power Reduction
![Page 24: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/24.jpg)
24
15-447 Computer Architecture Fall 2008 ©
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding
buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
A Superscalar Datapath
Performance=N*f*IPC
Actually, it’s the whole system, but we focus on processor
![Page 25: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/25.jpg)
25
15-447 Computer Architecture Fall 2008 ©
• Dynamic power:– Reduce the activity factor– Reduce the switching capacitance (usually not possible)– Reduce the voltage/frequency (speedstep; e.g., 1.6 GHz
pentium M can be clocked down to 600MHz, voltage can be dropped from 1.48V to 0.95V)
• Leakage power:– Put some portions of the on-chip storage structures in a low-
power stand-by mode or even completely shutting off the power supply to these partitions
– Resizing
• We usually give up some performance to save energy, but how much?
Microarchitectural Techniques—General Approach
![Page 26: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/26.jpg)
26
15-447 Computer Architecture Fall 2008 ©
• If we reduce voltage, linear drop in maximum frequency (and performance)
• “The cube law”: P=kV3 (~1%V=3%P)– If we use voltage scaling we can approximately trade 1%
of performance loss for 3% of power reduction.
• Any architectural technique that trades performance for power should do better than that (or at least as good). Otherwise simple voltage scaling can be used to achieve better tradeoffs.
Guideline
![Page 27: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/27.jpg)
27
15-447 Computer Architecture Fall 2008 ©
• Speculation is used to increase performance• Wasted energy if it is wrong• Can we speculate only when we think we’ll be right?• Gating: temporarily prevent the new instructions
from entering the pipeline• Use Gating to avoid speculation beyond the
branches with low prediction accuracy– The number of unresolved low-confidence branches is used
to determine when to gate the pipeline and for how long– Report 38% energy savings in the wrong-path instructions
with about 1% of IPC loss
Examples: Front-End Throttling
![Page 28: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/28.jpg)
28
15-447 Computer Architecture Fall 2008 ©
• Just-in Time Instruction Delivery – Fetch stage is throttled based on the number of in-flight
instructions.– If the number of in-flight instructions exceeds a
predetermined threshold, the fetch is throttled– Threshold is adjusted through the “tuning cycle”– Reasons for energy savings:
• Fewer instructions are processed along the mispredicted path
• Instruction spends fewer cycles in the issue queue
Front-End Throttling (continued)
![Page 29: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/29.jpg)
29
15-447 Computer Architecture Fall 2008 ©
• General solutions:– Use of multi-banked RFs. Each bank has fewer entries
and fewer ports than the monolithic RF.• Problems:
– Possible bank conflicts -> IPC loss– Overhead of the port arbitration logic
– Use of the smaller cache-like structures to exploit the access locality
Energy Reduction in the Register Files
![Page 30: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/30.jpg)
30
15-447 Computer Architecture Fall 2008 ©
• Value Aging Buffer – At the time of writeback, the results are written into a
FIFO-style cache called VAB– The RF is updated only when the values are evicted from
the VAB.– In many situations, this can be avoided because a
register may be deallocated during its residency in the VAB
– If a register is read from the VAB, there is no need to access the RF.
– Some performance loss due to the sequential access to the VAB and the RF.
Energy Reduction in the Register Files
![Page 31: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/31.jpg)
31
15-447 Computer Architecture Fall 2008 ©
Isolation of short-lived operands
![Page 32: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/32.jpg)
32
15-447 Computer Architecture Fall 2008 ©
Out-of-Order Execution andIn-Order Retirement
ROB
F R D
Inst. Queue ExARF
In-order front end
Out-of-order core
In-order retirement
![Page 33: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/33.jpg)
33
15-447 Computer Architecture Fall 2008 ©
• Used to cope with false data dependencies.• A new physical register is allocated for EVERY
new result• P6 style: ROB slots serve as physical registers
Register Renaming
LOAD R1, R2, 100
SUB R5, R1, R3 ADD R1, R5, R4
LOAD P31, P2, 100
SUB P32, P31, P3
ADD P33, P32, P4
![Page 34: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/34.jpg)
34
15-447 Computer Architecture Fall 2008 ©
– Register Alias Table (RAT) maintains the mappings between logical and physical registers
Register Renaming: the Implementation
Arch. Reg
Phys. Reg.
Location(0-ROB,1-ARF)
0 0 1
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4
Original code
![Page 35: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/35.jpg)
35
15-447 Computer Architecture Fall 2008 ©
– Register Alias Table (RAT) maintains the mappings between logical and physical registers
Register Renaming: the Implementation
Arch. Reg
Phys. Reg.
Location(0-ROB,1-ARF)
0 0 1
1 31 0
2 2 1
3 3 1
4 4 1
5 5 1
LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4
LOAD P31, R2, 100
Original code
Renamed code
![Page 36: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/36.jpg)
36
15-447 Computer Architecture Fall 2008 ©
– Rename Table (RT) is used to maintain the mappings between logical and physical registers
Register Renaming: the Implementation
Arch. Reg
Phys. Reg.
Location(0-ROB,1-ARF)
0 0 1
1 31 0
2 2 1
3 3 1
4 4 1
5 32 0
LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4
LOAD P31, R2, 100 SUB P32, P31, R3
Original code
Renamed code
![Page 37: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/37.jpg)
37
15-447 Computer Architecture Fall 2008 ©
– Rename Table (RT) is used to maintain the mappings between logical and physical registers
Register Renaming: the Implementation
Arch. Reg
Phys. Reg.
Location(0-ROB,1-ARF)
0 0 1
1 33 0
2 2 1
3 3 1
4 4 1
5 32 0
LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4
LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4
Original code
Renamed code
![Page 38: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/38.jpg)
38
15-447 Computer Architecture Fall 2008 ©
• Definition: a value is short-lived if the destination register is renamed by the time of the result generation.
• Identified one cycle before the result writeback
• A large percentage of all generated results are short-lived for SPEC 2000 benchmarks.
Short-Lived Values
LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4
LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4RENAMER
![Page 39: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/39.jpg)
39
15-447 Computer Architecture Fall 2008 ©
0
10
20
30
40
50
60
70
80
90
10096-entry ROB, 4-way processor
Percentage of Short-Lived Values
As
![Page 40: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/40.jpg)
40
15-447 Computer Architecture Fall 2008 ©
• Reasons for maintaining short-lived values:
– Recovering from branch mispredictions
– Reconstructing precise state if interrupts or exceptions occur
Why Keep Them ?
LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4
LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4
![Page 41: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/41.jpg)
41
15-447 Computer Architecture Fall 2008 ©
Energy-dissipating Events
ROB
F R D
Inst. Queue ExARF
In-order front end
Out-of-order core
In-order retirement
WriteWrite
Read
![Page 42: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/42.jpg)
42
15-447 Computer Architecture Fall 2008 ©
Isolating Short-Lived Values: the Idea
ROB
F R D
Inst. Queue ExARF
Write
Write
Read
SRF
Write short-lived values into a small
dedicated RF (SRF)
LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4
In-order front end
Out-of-order core
In-order retirement
![Page 43: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/43.jpg)
43
15-447 Computer Architecture Fall 2008 ©
• Dynamically resizable caches – Dynamically estimates the program requirements and
adapts to the required cache size– Cache is upsized or downsized at the end of periodic
intervals based on the value of the cache miss counter– Downsizing puts the higher-numbered sets into a low-
leakage mode using sleep transistors– A bit mask is used to specify the number of address bits
that are used for indexing into the set– The cache size always changes by a factor of two
Energy Reduction in Caches
![Page 44: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/44.jpg)
44
15-447 Computer Architecture Fall 2008 ©
• Gating off portions of the execution units – Disables the upper bits of the ALUs where they are not needed
(for small operands)
– Energy can be reduced by 54% for integer programs
• Packaging multiple narrow-width operations in a single ALU in the same cycle
• Steering instructions to FUs based on the criticality information – Critical instructions are steered to fast and power-hungry
execution units, non-critical instructions are steered to slow and power-efficient units
Energy Reduction within the Execution Units
![Page 45: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/45.jpg)
45
15-447 Computer Architecture Fall 2008 ©
• Using Grey code for the addresses to reduce switching activity on the address buses (Su et.al., IEEE Design and Test, 1994)– Exploits the observation that programs often generate
consecutive addresses– Grey code: there is only a single transition on the
address bus when consecutive addresses are accessed– 37% reduction in the switching activity is reported– A Gray code encoder is placed at the transmitting end of
the bus, and a decoder is needed at the receiving end
Encoding Addresses for Low Power
![Page 46: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/46.jpg)
46
15-447 Computer Architecture Fall 2008 ©
• Bus-invert encoding – Uses redundancy to reduce the number of transitions– Adds one line to the bus to indicate if the actual data or
its complement is transmitted– If the Hamming distance between the current value and
the previous one is less than or equal to (n/2) (for n bits), the value is transmitted as such and the value of 0 is transmitted on the extra line.
– Otherwise, the complement of the value is transmitted and the extra line is set to 1
– The average number of bus transitions per clock cycle is lowered by 25% as a result
Encoding Data for Low Power
![Page 47: 15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu msakr/15447-f08 Lecture 27 Power Aware](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649d4e5503460f94a2e283/html5/thumbnails/47.jpg)
47
15-447 Computer Architecture Fall 2008 ©
• Can compiler help?
• Can OS help?– E.g., control voltage scaling– Control turning off devices
OS and Compiler Techniques