copyright agrawal, 2007elec5270/6270 spring 13, lecture 81 elec 5270/6270 spring 2013 low-power...
Post on 27-Dec-2015
222 Views
Preview:
TRANSCRIPT
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 11
ELEC 5270/6270 Spring 2013ELEC 5270/6270 Spring 2013Low-Power Design of Electronic CircuitsLow-Power Design of Electronic Circuits
Power Aware MicroprocessorsPower Aware Microprocessors
Vishwani D. AgrawalVishwani D. AgrawalJames J. Danaher ProfessorJames J. Danaher Professor
Dept. of Electrical and Computer EngineeringDept. of Electrical and Computer EngineeringAuburn University, Auburn, AL 36849Auburn University, Auburn, AL 36849
vagrawal@eng.auburn.eduhttp://www.eng.auburn.edu/~vagrawal/COURSE/E6270_Spr13/
course.html
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 22
SIA Roadmap for Processors (1999)SIA Roadmap for Processors (1999)YearYear 19991999 20022002 20052005 20082008 20112011 20142014
Feature size (nm)Feature size (nm) 180180 130130 100100 7070 5050 3535
Logic transistors/cmLogic transistors/cm22 6.2M6.2M 18M18M 39M39M 84M84M 180M180M 390M390M
Clock (GHz)Clock (GHz) 1.251.25 2.12.1 3.53.5 6.06.0 10.010.0 16.916.9
Chip size (mmChip size (mm22)) 340340 430430 520520 620620 750750 900900
Power supply (V)Power supply (V) 1.81.8 1.51.5 1.21.2 0.90.9 0.60.6 0.50.5
High-perf. Power (W)High-perf. Power (W) 9090 130130 160160 170170 175175 183183
Source: http://www.semichips.org
Un
true
pre
dic
tion
s.
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 33
Power Reduction in ProcessorsPower Reduction in Processors
Hardware methods:Hardware methods: Voltage reduction for dynamic powerVoltage reduction for dynamic power Dual-threshold devices for leakage reductionDual-threshold devices for leakage reduction Clock gating, frequency reductionClock gating, frequency reduction Sleep modeSleep mode
Architecture:Architecture: Instruction setInstruction set hardware organizationhardware organization
Software methodsSoftware methods
Performance CriteriaPerformance Criteria
Throughput – computations per unit time.Throughput – computations per unit time.Performance is inverse of time – increasing Performance is inverse of time – increasing
CPU time indicates lower performance.CPU time indicates lower performance.Power – computations per watt.Power – computations per watt.Energy efficiency – performance/joule.Energy efficiency – performance/joule.
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 44
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 55
SPEC CPU2006 BenchmarksSPEC CPU2006 Benchmarks Standard Performance Evaluation Corporation (SPEC)Standard Performance Evaluation Corporation (SPEC) http://www.spec.orghttp://www.spec.org Twelve integer and 17 floating point programs, Twelve integer and 17 floating point programs, CINT2006CINT2006
and and CFP2006CFP2006.. Each program run time is normalized to obtain a Each program run time is normalized to obtain a SPEC SPEC
ratioratio with respect to the run time of with respect to the run time of Sun Ultra Enterprise 2 Sun Ultra Enterprise 2 system with a 296 MHz UltraSPARC II processorsystem with a 296 MHz UltraSPARC II processor..
It takes about 12 days to run all benchmarks on reference It takes about 12 days to run all benchmarks on reference system.system.
CINT2006CINT2006 and and CFP2006CFP2006 metrics are the geometric means metrics are the geometric means of SPEC ratios:of SPEC ratios: Peak metric – each program is individually optimized (aggressive Peak metric – each program is individually optimized (aggressive
compilation).compilation). Base metric – common optimization for all programs.Base metric – common optimization for all programs.
SPEC CINT2006 ResultsSPEC CINT2006 Results http://www.spec.org/cpu2006/results/cint2006.htmlhttp://www.spec.org/cpu2006/results/cint2006.html
Dell Inc., PowerEdge R610Dell Inc., PowerEdge R610CPU: Intel Xeon X5670, 2.93 GHzCPU: Intel Xeon X5670, 2.93 GHzNumber of chips 2, cores 12, threads/core 2Number of chips 2, cores 12, threads/core 2Performance metric 36.6 base, 39.4 peakPerformance metric 36.6 base, 39.4 peak
Dell Inc. PowerEdge M905Dell Inc. PowerEdge M905CPU: AMD Opteron 8381 HE, 2.50 GHzCPU: AMD Opteron 8381 HE, 2.50 GHzNumber of chips 4, cores 16, threads/core 1Number of chips 4, cores 16, threads/core 1Performance metric 15.8 base, 19.1 peak Performance metric 15.8 base, 19.1 peak
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 66
SPEC CFP2006 ResultsSPEC CFP2006 Results http://www.spec.org/cpu2006/results/cfp2006.htmlhttp://www.spec.org/cpu2006/results/cfp2006.html
Dell Inc., PowerEdge R610Dell Inc., PowerEdge R610CPU: Intel Xeon X5670, 2.93 GHzCPU: Intel Xeon X5670, 2.93 GHzNumber of chips 2, cores 12, threads/core 2Number of chips 2, cores 12, threads/core 2Performance metric 42.5 base, 45.8 peakPerformance metric 42.5 base, 45.8 peak
Dell Inc. PowerEdge M905Dell Inc. PowerEdge M905CPU: AMD Opteron 8381 HE, 2.50 GHzCPU: AMD Opteron 8381 HE, 2.50 GHzNumber of chips 4, cores 16, threads/core 1Number of chips 4, cores 16, threads/core 1Performance metric 17.4 base, 21.5 peak Performance metric 17.4 base, 21.5 peak
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 77
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 88
Other BenchmarksOther Benchmarks
LINPACK is numerically intensive floating point linear LINPACK is numerically intensive floating point linear system (Ax = b) program used for benchmarking system (Ax = b) program used for benchmarking supercomputers.supercomputers.
SPECPOWER_ssj2008SPECPOWER_ssj2008 measures power and performance measures power and performance of a computer system.of a computer system. The initial benchmark addresses the performance of server-side The initial benchmark addresses the performance of server-side
Java; additional workloads are planned.Java; additional workloads are planned. http://www.spec.org/benchmarks.html#power http://www.spec.org/benchmarks.html#power
Second Quarter 2010 Second Quarter 2010 SPECpower_ssj2008 ResultsSPECpower_ssj2008 Results
http://www.spec.org/power_ssj2008/results/res2010q2/http://www.spec.org/power_ssj2008/results/res2010q2/
Apr 7, 2010: Hewlett-Packard ProLiant DL385 G7Apr 7, 2010: Hewlett-Packard ProLiant DL385 G7CPU: AMD Opteron 6174, 2.2GHzCPU: AMD Opteron 6174, 2.2GHzNumber of chips 2, cores 12, threads/core 2Number of chips 2, cores 12, threads/core 2Total memory 16GBTotal memory 16GBssj operations @ 100% 888,819ssj operations @ 100% 888,819Average power @ 100% 271 WAverage power @ 100% 271 WAverage power @ active idle 101 WAverage power @ active idle 101 WOverall ssj operations per watt 2,355Overall ssj operations per watt 2,355
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 99
Second Quarter 2010 Second Quarter 2010 SPECpower_ssj2008 ResultsSPECpower_ssj2008 Results
http://www.spec.org/power_ssj2008/results/res2010q2/http://www.spec.org/power_ssj2008/results/res2010q2/
May 19, 2010: Dell Inc., PowerEdge R610May 19, 2010: Dell Inc., PowerEdge R610CPU: Intel Xeon X5670, 2.93 GHzCPU: Intel Xeon X5670, 2.93 GHzNumber of chips 2, cores 12, threads 2Number of chips 2, cores 12, threads 2Total memory 12GBTotal memory 12GBssj operations @ 100% 914,076ssj operations @ 100% 914,076Average power @ 100% 244 WAverage power @ 100% 244 WAverage power @ active idle 62.3 WAverage power @ active idle 62.3 WOverall ssj operations per watt 2,938Overall ssj operations per watt 2,938
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 1010
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 1111
Energy SPEC BenchmarksEnergy SPEC Benchmarks
Energy efficiency mode: Besides the execution time, energy Energy efficiency mode: Besides the execution time, energy efficiency of SPEC benchmark programs is also measured. efficiency of SPEC benchmark programs is also measured. Energy efficiency of a benchmark program is given by:Energy efficiency of a benchmark program is given by:
1/(Execution time)1/(Execution time)Energy efficiency Energy efficiency == ──────────────────────── Average powerAverage power
D. A. Patterson and J. L. Hennessy, D. A. Patterson and J. L. Hennessy, Computer Organization & Design: The Computer Organization & Design: The Hardware/Software InterfaceHardware/Software Interface, 4, 4thth Edition, Morgan Kaufmann Publishers (Elsevier), Edition, Morgan Kaufmann Publishers (Elsevier), 2009,2009,
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 1212
Energy EfficiencyEnergy Efficiency
Efficiency averaged on Efficiency averaged on nn benchmark programs: benchmark programs:
nnEfficiencyEfficiency == (( ΠΠ Efficiency Efficiencyii ))
1/1/nn
ii=1=1
where Efficiencywhere Efficiencyii is the efficiency for program is the efficiency for program ii..
Relative efficiency:Relative efficiency:
Efficiency of a computerEfficiency of a computerRelative efficiency = Relative efficiency = ──────────────────────────────────
Eff. of reference Eff. of reference computercomputer
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 1313
SPEC2000 Relative Energy EfficiencySPEC2000 Relative Energy Efficiency
0
1
2
3
4
5
6
SP
EC
INT
20
00
SP
EC
FP
20
00
SP
EC
INT
20
00
SP
EC
FP
20
00
SP
EC
INT
20
00
SP
EC
FP
20
00
Pentium M@1.6/0.6GHz Energy-efficient procesor
Pentium 4-M@2.4GHz (Reference)
Pentium III-M@1.2GHz
Always max. clock
Laptop adaptive clk.
Min. power min. clock
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 1414
Voltage ScalingVoltage Scaling
Dynamic: Reduce voltage and frequency Dynamic: Reduce voltage and frequency during idle or low activity periods.during idle or low activity periods.
Static: Static: Clustered voltage scalingClustered voltage scalingLogicLogic on non-critical paths given lower voltage.on non-critical paths given lower voltage.47% power reduction with 10% area increase 47% power reduction with 10% area increase
reported.reported.M. Igarashi et al., “Clustered Voltage Scaling M. Igarashi et al., “Clustered Voltage Scaling
Techniques for Low-Power Design,” Techniques for Low-Power Design,” Proc. IEEE Proc. IEEE Symp. Low Power DesignSymp. Low Power Design, 1997., 1997.
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 1515
Processor UtilizationProcessor UtilizationThroughput = Operations / second
Th
rou
ghp
ut
Time
Compute-intensiveprocesses
Systemidle
Low throughput(background)
processes
Maximumthroughput
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 1616
Examples of ProcessesExamples of Processes
Compute-intensive: spreadsheet, spelling Compute-intensive: spreadsheet, spelling check, video decoding, scientific check, video decoding, scientific computing.computing.
Low throughput: data entry, screen Low throughput: data entry, screen updates, low bandwidth I/O data transfer.updates, low bandwidth I/O data transfer.
Idle: no computation, no expected output.Idle: no computation, no expected output.
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 1717
Effects of Voltage ReductionEffects of Voltage Reduction
Voltage reduction increases delay, Voltage reduction increases delay, decreases throughput:decreases throughput:
Slow reduction in throughput at firstSlow reduction in throughput at firstRapid reduction in throughput for VRapid reduction in throughput for VDD ≤ V≤ Vth
Time per operation (TPO) increasesTime per operation (TPO) increases
Voltage reduction continues to reduce Voltage reduction continues to reduce power consumption:power consumption:
Energy per operation (EPO) = Power × TPOEnergy per operation (EPO) = Power × TPO
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 1818
Energy per Operation (EPO)Energy per Operation (EPO)
VVDD / V / Vth
1 2 3 4 5
PowerTPO
EPO
1.0
0.5
0.0
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 1919
Dynamic Voltage and ClockDynamic Voltage and Clock
ThroughputThroughputTime spent in:Time spent in:
Battery Battery lifelifeFast Fast
modemodeSlow Slow modemode
Idle Idle modemode
Always full speedAlways full speed 10%10% 0%0% 90%90% 1 hr1 hr
Sometimes full speedSometimes full speed 1%1% 90%90% 9%9% 5.3 hrs5.3 hrs
Rarely full speedRarely full speed 0.1%0.1% 99%99% 0.9%0.9% 9.2 hrs9.2 hrs
T. D. Burd and R. W. Brodersen, Energy Efficient Microprocessors,Springer, 2002, pp. 35-36.
Example: Find Minimum Energy ModeExample: Find Minimum Energy Mode
Processor data (rated operation):Processor data (rated operation):2 GHz clock2 GHz clock1.5 volt supply voltage1.5 volt supply voltage0.5 volt threshold voltage0.5 volt threshold voltagePower consumptionPower consumption
50 watts dynamic power50 watts dynamic power50 watts static power50 watts static power
Maximum clock frequency for V volt supply Maximum clock frequency for V volt supply (alpha-power law): f(alpha-power law): f αα (V – V(V – VTHTH)/V)/V
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 2020
Alpha-Power Law ModelAlpha-Power Law ModelVariation of delay with supply voltage:Variation of delay with supply voltage:
delay delay αα V VDDDD /(V /(VDD DD – V – VTH TH ))αα
VVTH TH = Threshold voltage= Threshold voltage
αα = 1 for short-channel devices, ≈ 2 for long-channel devices = 1 for short-channel devices, ≈ 2 for long-channel devices
T. Sakurai and A. R. Newton, “Delay analysis of series-connected MOSFET circuits,” IEEE Journal of Solid-State Circuits, Vol. 26, pp.122–131, Feb. 1991.
T. Sakurai and A. R. Newton, “A simple MOSFET model for circuit analysis,” IEEE Transaction on Electron Devices, Vol. 38, No. 4, pp.887–894, Apr. 1991.
T. Sakurai, “High-speed circuit design with scaled-down MOSFETs and low supply voltage (invited),” Proc. IEEE ISCAS, pp.1487–1490, Chicago, May 1993.
T. Sakurai, “Alpha-Power Law MOS Model,” IEEE Solid-State Circuits Society Newsletter, Vol. 9, No. 4, pp. 4–5, Oct. 2004.
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 2121
Example Cont.Example Cont.
Dynamic power:Dynamic power:
PPdd = CV = CV22f = C(1.5)f = C(1.5)22××22××101099 = 50W = 50W
C = 11.11 nF, capacitance switching/cycleC = 11.11 nF, capacitance switching/cycle
PPdd = 11.11 V = 11.11 V22ffDynamic energy per cycle:Dynamic energy per cycle:
EEdd = P = Pdd/f = 11.11 V/f = 11.11 V22
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 2222
Example Cont.Example Cont.
Clock frequency:Clock frequency:
f = k (V – Vf = k (V – VTHTH)/V = k (1.5 – 0.5)/1.5 = 2 GHz)/V = k (1.5 – 0.5)/1.5 = 2 GHz
k = 3 GHz, a proportionality constantk = 3 GHz, a proportionality constant
f = 3(V – 0.5)/Vf = 3(V – 0.5)/V GHzGHz
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 2323
Example Cont.Example Cont.
Static power:Static power:
PPss = k’ V = k’ V22 = k’ (1.5) = k’ (1.5)22 = 50W= 50W
k’ = 22.22 mho, total leakage conductancek’ = 22.22 mho, total leakage conductance
PPss = 22.22 V = 22.22 V22
Static energy per cycle:Static energy per cycle:
EEss = P= Pss/f = 22.22 V/f = 22.22 V33/[3(V – 0.5)]/[3(V – 0.5)]
= 7.41 V= 7.41 V33/(V – 0.5)/(V – 0.5)
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 2424
Example Cont.Example Cont.Total energy per cycle:Total energy per cycle:
E = EE = Edd + E + Ess = 11.11 V = 11.11 V2 2 + 7.41 V+ 7.41 V33/(V – 0.5)/(V – 0.5)To minimize E, To minimize E, ∂E/∂V = 0, or∂E/∂V = 0, or
5V5V2 2 – 4.6V + 0.75 = 0– 4.6V + 0.75 = 0Solutions of quadratic equation:Solutions of quadratic equation:
V = 0.679 volt, 0.221 voltV = 0.679 volt, 0.221 voltDiscard second solution, which is lower Discard second solution, which is lower
than the threshold voltage of 0.5 volt.than the threshold voltage of 0.5 volt.Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 2525
Example: ResultExample: Result
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 2626
Rated modeLow energy
modeReduction
(%)
Voltage 1.5 V 0.679 V 54.7%
Clock frequency 2 GHz 791 MHz 60%
Dynamic energy/cycle
25.00 nJ 5.12 nJ 79.52%
Static energy/cycle 25.00 nJ 12.96 nJ 48.16%
Total energy/cycle 50.0 nJ 18.08 nJ 63.84%
Dynamic power 50.0 W 4.05 W 91.90%
Static power 50.0 W 10.25 W 79.50%
Total power 100.0 W 14.20 W 85.80%
Cycle EfficiencyCycle Efficiency Cycle efficiency is a rating similar to the Cycle efficiency is a rating similar to the
maximum clock frequency rating.maximum clock frequency rating. Analogy:Analogy:
Cycle efficiency is similar to miles per gallon (mpg)Cycle efficiency is similar to miles per gallon (mpg) Maximum clock frequency is similar to miles per hour Maximum clock frequency is similar to miles per hour
(mph)(mph)
Reference: A. Shinde and V. D. Agrawal, Reference: A. Shinde and V. D. Agrawal, “Managing Performance and Efficiency of a “Managing Performance and Efficiency of a Processor,” Processor,” Proc. 45Proc. 45thth IEEE Southeastern Symp. IEEE Southeastern Symp. System TheorySystem Theory, March 2013., March 2013.
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 2727
Performance in TimePerformance in Time
Performance is measured with respect to Performance is measured with respect to a program.a program.
Performance = Performance = TimeExecution
1
D. A. Patterson and J. L. Hennessy, Computer Organization & Design, the hardware/Software Interface, Fourth Edition, San Francisco, California: Morgan Kaufman Publishers, Inc., 2008.
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 28282828
Performance in Energy (Efficiency)Performance in Energy (Efficiency)
Efficiency is measured with respect to a Efficiency is measured with respect to a program.program.
Performance Performance nconsumptio Power
ePerformanc
D. A. Patterson and J. L. Hennessy, Computer Organization & Design, the Hardware/Software Interface, Fourth Edition, San Francisco, California: Morgan Kaufman Publishers, Inc., 2008.
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 29292929
dissipatedEnergy
1
Two PerformancesTwo Performances
Time performanceTime performance
Energy performance Energy performance
D. A. Patterson and J. L. Hennessy, Computer Organization & Design, the Hardware/Software Interface, Fourth Edition, San Francisco, California: Morgan Kaufman Publishers, Inc., 2008.
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 30303030
dissipatedEnergy
1
Time Execution
1
Time Performance or ClockTime Performance or Clock
Speed of a processor is measured in Speed of a processor is measured in cycles per secondcycles per second or or clock frequency (f).clock frequency (f).
Execution time of a program using C clock Execution time of a program using C clock cycles = C/fcycles = C/f
Time performance = f/CTime performance = f/CClock period (1/f) is the time per cycle.Clock period (1/f) is the time per cycle.
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 3131
Energy PerformanceEnergy Performance
Energy efficiency of a processor may be Energy efficiency of a processor may be measured in measured in cycles per joule cycles per joule or or cycle cycle efficiency (efficiency (ηη).).
Energy dissipated by a program using C Energy dissipated by a program using C clock cycles = C/clock cycles = C/ηη
Energy performance = Energy performance = ηη/C/C1/1/ηη is energy per cycle (EPC) is energy per cycle (EPC)
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 3232
Characterizing Device Technology Characterizing Device Technology Speed and EfficiencySpeed and Efficiency
Consider 90nm CMOS technology.Consider 90nm CMOS technology. Use predictive technology model (PTM).Use predictive technology model (PTM). Example circuit: Eight-bit ripple carry adder.Example circuit: Eight-bit ripple carry adder. Nominal voltage = 1.2 volts.Nominal voltage = 1.2 volts. Simulation for varying operating conditions (VDD = 100mV Simulation for varying operating conditions (VDD = 100mV
through 1.2V) using Spice:through 1.2V) using Spice: With random vectors for energy per cycle (EPC = 1/With random vectors for energy per cycle (EPC = 1/ηη).). With critical path vectors for clock period (1/f).With critical path vectors for clock period (1/f).
Reference: W. Zhao and Y. Cao, “New Generation of Predictive Reference: W. Zhao and Y. Cao, “New Generation of Predictive Technology Model for Sub-45nm Early Design ExplorationTechnology Model for Sub-45nm Early Design Exploration,“ IEEE ,“ IEEE Trans. Electron Devices, Trans. Electron Devices, vol. 53, no. 11, pp. 2816–2823, 2006.vol. 53, no. 11, pp. 2816–2823, 2006.
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 3333
Energy per Cycle of 8-Bit AdderEnergy per Cycle of 8-Bit Adder
K. Kim, “Ultra Low Power CMOS Design,” K. Kim, “Ultra Low Power CMOS Design,” PhD DissertationPhD Dissertation, Auburn University, , Auburn University, Dept. of ECE, Auburn, Alabama, May 2011.Dept. of ECE, Auburn, Alabama, May 2011.
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 83434
Cycle Time of 8-Bit AdderCycle Time of 8-Bit Adder
K. Kim, “Ultra Low Power CMOS Design,” K. Kim, “Ultra Low Power CMOS Design,” PhD DissertationPhD Dissertation, Auburn University, Dept. , Auburn University, Dept. of ECE, Auburn, Alabama, May 2011.of ECE, Auburn, Alabama, May 2011.
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 83535
Pentium M processorPentium M processor Published data: H. Hanson, K. Rajamani, S. Keckler, F. Published data: H. Hanson, K. Rajamani, S. Keckler, F.
Rawson, S. Ghiasi, J. Rubio, “Thermal Response to Rawson, S. Ghiasi, J. Rubio, “Thermal Response to DVFS: Analysis with an Intel Pentium M,” DVFS: Analysis with an Intel Pentium M,” Proc.Proc. International Symp. Low Power Electronics and DesignInternational Symp. Low Power Electronics and Design, , 2007, pp. 219-224.2007, pp. 219-224.
VDD = 1.2VVDD = 1.2V Maximum clock rate = 1.8GHzMaximum clock rate = 1.8GHz Critical path delay, td = 1/1.8GHz = 555.56psCritical path delay, td = 1/1.8GHz = 555.56ps Power consumption = 120WPower consumption = 120W EPC = 120/(1.8GHz) = 66.67nJEPC = 120/(1.8GHz) = 66.67nJ
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 3636
Cycle Efficiency and FrequencyCycle Efficiency and Frequency
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 83737
ExampleExample For a program that executes in 1.8 billion clock cycles.For a program that executes in 1.8 billion clock cycles.
Voltage VDD
Frequency f
MHz
Cycle Efficiency,η
Execution Time
second
Total Energy
Consumed
Powerf/η
1.2 V1800
megacycles/s15
megacycles/joule1.0 120 Joules 120W
0.6 V277
megacycles/s70
megacycles/joule6.5 25 Joules 39.6W
200 mV54.5
megacycles/s660
megacycles/joule33 2.72 Joules 0.083W
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 83838
Cycle EfficiencyCycle Efficiency New performance rating: Cycle efficiency New performance rating: Cycle efficiency ηη
unit is cycles per joule.unit is cycles per joule. Clock frequency f in cycles per second is a Clock frequency f in cycles per second is a
similar rating with respect to time.similar rating with respect to time. Similarity to other popular ratings:Similarity to other popular ratings:
ηη → mpg → mpg f → mph f → mph
Two ratings allow effective time and energy Two ratings allow effective time and energy management of an electronic system.management of an electronic system.
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 83939
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 4040
Problem of Process Variation in Problem of Process Variation in Nanometer TechnologiesNanometer Technologies
Lower Vth Vth Higher Vth
Nu
mb
er
of c
hip
s
Powerspecification
Clockspecification
From a presentation:Power Reduction using LongRun2 in Transmeta’sEfficon Processor, by D. DitzelMay 17, 2006
Yield lossdue to highleakage
Yield lossdue to slowspeedH
ighe
r vo
ltage
ope
ratio
n
Low
er v
olta
ge o
pera
tion
Nominalvoltage
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 4141
Clock Distribution H-TreeClock Distribution H-Tree
clock
Fanout, λ = 4
Tree depth, s = logλN
No. of flip-flops = N
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 4242
Clock PowerClock PowerPclk = CLVDD
2f + CLVDD2f / λ + CLVDD
2f / λ2 + . . .
stages – 1 1= CLVDD
2f Σ ─ n = 0 λn
where CL = total load capacitance of N flip-flops
λ = constant fanout at each stage in distributionnetwork
Clock consumes about 40% of total processor power, because(1)Clock is always active(2)Makes two transitions per cycle, (α = 2)(3)Clock gating is useful; inhibit clock to unused blocks
Properties of H-TreeProperties of H-Tree
Balanced clock skew.Balanced clock skew.Small delay and power consumption.Small delay and power consumption.Requires fine-tuning for complex layout.Requires fine-tuning for complex layout.
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 4343
Clock Power and DelayClock Power and Delay Unit size buffer or inverter delay = dUnit size buffer or inverter delay = d Total dynamic power supplied to N flip-Total dynamic power supplied to N flip-
flops, P = Cflops, P = CLLVVDDDD22ff
Total power consumption of clock network:Total power consumption of clock network:
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 4444
Flip-flps, N Clock power per flip-flop Clock delay
1 P d
4 P 4d
16 1.25P 8d
64 1.3125P 12d
128 1.327125P 16d
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 4545
Clock Network ExamplesClock Network ExamplesAlpha 21064Alpha 21064 Alpha 21164Alpha 21164 Alpha 21264Alpha 21264
TechnologyTechnology 0.750.75μμ CMOS CMOS 0.50.5μμ CMOS CMOS 0.350.35μμ CMOS CMOS
Frequency (MHz)Frequency (MHz) 200200 300300 600600
Total capacitanceTotal capacitance 12.5nF12.5nFClock gating Clock gating used. Total used. Total power 80 -power 80 -
110W110W
Clock loadClock load 3.25nF3.25nF 3.75nF3.75nF
Clock powerClock power 40%40% 40% (20W)40% (20W)
Max. clock skewMax. clock skew 200ps (<10%)200ps (<10%) 90ps90ps
D. W. Bailey and B. J. Benschneider, “Clocking Design and Analysis for a 600-MHz Alpha Microprocessor,” IEEE J. Solid-State Circuits, vol. 33, no. 11, pp. 1627-1633, Nov. 1998.
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 4646
Architecture Level: Pipeline GatingArchitecture Level: Pipeline Gating A pipeline processor uses speculative execution.A pipeline processor uses speculative execution.
Incorrect branch prediction results in pipeline stalls and Incorrect branch prediction results in pipeline stalls and wasted energy.wasted energy.
Idea: Stop fetching instructions if a branch Idea: Stop fetching instructions if a branch hazard is expected:hazard is expected:
If the count (M) of incorrect predictions exceeds a pre-If the count (M) of incorrect predictions exceeds a pre-specified number (N), then suspend fetching instruction for specified number (N), then suspend fetching instruction for some k cycles.some k cycles.
Ref.: S. Manne, A. Klauser and D. Grunwald, Ref.: S. Manne, A. Klauser and D. Grunwald, “Pipeline Gating: Speculation Control for Energy “Pipeline Gating: Speculation Control for Energy Reduction,” Reduction,” Proc. 25Proc. 25thth Annual International Annual International Symp. Computer ArchitectureSymp. Computer Architecture, June 1998., June 1998.
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 4747
Slack SchedulingSlack Scheduling Application: Superscalar, out-of-order execution:Application: Superscalar, out-of-order execution:
An instruction is executed as soon as the required data and An instruction is executed as soon as the required data and resources become available.resources become available.
A commit unit reorders the results.A commit unit reorders the results.
Delay the completion of instructions whose result Delay the completion of instructions whose result is not immediately needed.is not immediately needed.
Example of RISC instructions:Example of RISC instructions: addadd r0, r1, r2;r0, r1, r2; (A)(A) sub sub r3, r4, r5;r3, r4, r5; (B)(B) and and r9, r1, r9;r9, r1, r9; (C)(C) or or r5, r9, r10;r5, r9, r10; (D)(D) xor xor r2, r10, r11;r2, r10, r11; (E)(E)
J. Casmira and D. Grunwald,“Dynamic Instruction SchedulingSlack,” Proc. ACM Kool ChipsWorkshop, Dec. 2000.
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 4848
Slack Scheduling ExampleSlack Scheduling Example
Slack schedulingSlack scheduling
AABB CC
DD
EE
Standard schedulingStandard scheduling
AA BB CC
DD
EE
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 4949
Slack SchedulingSlack Scheduling
Slack bitLow-power
execution units
Re-order buffer
Sch
edul
ing
logi
c
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 5050
Power Reduction ExamplePower Reduction Example Alpha 21064: 200MHz @ 3.45V, power dissipation =Alpha 21064: 200MHz @ 3.45V, power dissipation = 26W Reduce voltage to 1.5V, power (5.3x) =Reduce voltage to 1.5V, power (5.3x) = 4.9W Eliminate FP, power (3x) =Eliminate FP, power (3x) = 1.6W Scale 0.75Scale 0.75μμ → 0.35 → 0.35μμ, power (2x) =, power (2x) = 0.8W Reduce clock load, power (1.3x) =Reduce clock load, power (1.3x) = 0.6W Reduce frequency 200 →160MHz, power (1.25x) =Reduce frequency 200 →160MHz, power (1.25x) = 0.5W J. Montanaro J. Montanaro et alet al., “A 160-MHz, 32-b, 0.5-W CMOS RISC ., “A 160-MHz, 32-b, 0.5-W CMOS RISC
Microprocessor,” Microprocessor,” IEEE J. Solid-State CircuitsIEEE J. Solid-State Circuits, vol. 31, no. , vol. 31, no. 11, pp. 1703-1714, Nov. 1996.11, pp. 1703-1714, Nov. 1996.
Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 5151
For More on MicroprocessorsFor More on Microprocessors
T. D. Burd and R. W. Brodersen, Energy T. D. Burd and R. W. Brodersen, Energy Efficient Microprocessor Design, Springer, Efficient Microprocessor Design, Springer, 2002.2002.
R. Graybill and R. Melhem, R. Graybill and R. Melhem, Power Aware Power Aware ComputingComputing, New York: Plenum Publishers, , New York: Plenum Publishers, 2002.2002.
top related