copyright agrawal, 2007elec5270/6270 spring 13, lecture 81 elec 5270/6270 spring 2013 low-power...

Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 11

ELEC 5270/6270 Spring 2013ELEC 5270/6270 Spring 2013Low-Power Design of Electronic CircuitsLow-Power Design of Electronic Circuits

Power Aware MicroprocessorsPower Aware Microprocessors

Vishwani D. AgrawalVishwani D. AgrawalJames J. Danaher ProfessorJames J. Danaher Professor

Dept. of Electrical and Computer EngineeringDept. of Electrical and Computer EngineeringAuburn University, Auburn, AL 36849Auburn University, Auburn, AL 36849

vagrawal@eng.auburn.eduhttp://www.eng.auburn.edu/~vagrawal/COURSE/E6270_Spr13/

course.html

SIA Roadmap for Processors (1999)SIA Roadmap for Processors (1999)YearYear 19991999 20022002 20052005 20082008 20112011 20142014

Feature size (nm)Feature size (nm) 180180 130130 100100 7070 5050 3535

Logic transistors/cmLogic transistors/cm22 6.2M6.2M 18M18M 39M39M 84M84M 180M180M 390M390M

Clock (GHz)Clock (GHz) 1.251.25 2.12.1 3.53.5 6.06.0 10.010.0 16.916.9

Chip size (mmChip size (mm22)) 340340 430430 520520 620620 750750 900900

Power supply (V)Power supply (V) 1.81.8 1.51.5 1.21.2 0.90.9 0.60.6 0.50.5

High-perf. Power (W)High-perf. Power (W) 9090 130130 160160 170170 175175 183183

Source: http://www.semichips.org

Power Reduction in ProcessorsPower Reduction in Processors

Hardware methods:Hardware methods: Voltage reduction for dynamic powerVoltage reduction for dynamic power Dual-threshold devices for leakage reductionDual-threshold devices for leakage reduction Clock gating, frequency reductionClock gating, frequency reduction Sleep modeSleep mode

Architecture:Architecture: Instruction setInstruction set hardware organizationhardware organization

Software methodsSoftware methods

Performance CriteriaPerformance Criteria

Throughput – computations per unit time.Throughput – computations per unit time.Performance is inverse of time – increasing Performance is inverse of time – increasing

CPU time indicates lower performance.CPU time indicates lower performance.Power – computations per watt.Power – computations per watt.Energy efficiency – performance/joule.Energy efficiency – performance/joule.

SPEC CPU2006 BenchmarksSPEC CPU2006 Benchmarks Standard Performance Evaluation Corporation (SPEC)Standard Performance Evaluation Corporation (SPEC) http://www.spec.orghttp://www.spec.org Twelve integer and 17 floating point programs, Twelve integer and 17 floating point programs, CINT2006CINT2006

and and CFP2006CFP2006.. Each program run time is normalized to obtain a Each program run time is normalized to obtain a SPEC SPEC

ratioratio with respect to the run time of with respect to the run time of Sun Ultra Enterprise 2 Sun Ultra Enterprise 2 system with a 296 MHz UltraSPARC II processorsystem with a 296 MHz UltraSPARC II processor..

It takes about 12 days to run all benchmarks on reference It takes about 12 days to run all benchmarks on reference system.system.

CINT2006CINT2006 and and CFP2006CFP2006 metrics are the geometric means metrics are the geometric means of SPEC ratios:of SPEC ratios: Peak metric – each program is individually optimized (aggressive Peak metric – each program is individually optimized (aggressive

compilation).compilation). Base metric – common optimization for all programs.Base metric – common optimization for all programs.

SPEC CINT2006 ResultsSPEC CINT2006 Results http://www.spec.org/cpu2006/results/cint2006.htmlhttp://www.spec.org/cpu2006/results/cint2006.html

Dell Inc., PowerEdge R610Dell Inc., PowerEdge R610CPU: Intel Xeon X5670, 2.93 GHzCPU: Intel Xeon X5670, 2.93 GHzNumber of chips 2, cores 12, threads/core 2Number of chips 2, cores 12, threads/core 2Performance metric 36.6 base, 39.4 peakPerformance metric 36.6 base, 39.4 peak

Dell Inc. PowerEdge M905Dell Inc. PowerEdge M905CPU: AMD Opteron 8381 HE, 2.50 GHzCPU: AMD Opteron 8381 HE, 2.50 GHzNumber of chips 4, cores 16, threads/core 1Number of chips 4, cores 16, threads/core 1Performance metric 15.8 base, 19.1 peak Performance metric 15.8 base, 19.1 peak

SPEC CFP2006 ResultsSPEC CFP2006 Results http://www.spec.org/cpu2006/results/cfp2006.htmlhttp://www.spec.org/cpu2006/results/cfp2006.html

Dell Inc., PowerEdge R610Dell Inc., PowerEdge R610CPU: Intel Xeon X5670, 2.93 GHzCPU: Intel Xeon X5670, 2.93 GHzNumber of chips 2, cores 12, threads/core 2Number of chips 2, cores 12, threads/core 2Performance metric 42.5 base, 45.8 peakPerformance metric 42.5 base, 45.8 peak

Dell Inc. PowerEdge M905Dell Inc. PowerEdge M905CPU: AMD Opteron 8381 HE, 2.50 GHzCPU: AMD Opteron 8381 HE, 2.50 GHzNumber of chips 4, cores 16, threads/core 1Number of chips 4, cores 16, threads/core 1Performance metric 17.4 base, 21.5 peak Performance metric 17.4 base, 21.5 peak

Other BenchmarksOther Benchmarks

LINPACK is numerically intensive floating point linear LINPACK is numerically intensive floating point linear system (Ax = b) program used for benchmarking system (Ax = b) program used for benchmarking supercomputers.supercomputers.

SPECPOWER_ssj2008SPECPOWER_ssj2008 measures power and performance measures power and performance of a computer system.of a computer system. The initial benchmark addresses the performance of server-side The initial benchmark addresses the performance of server-side

Java; additional workloads are planned.Java; additional workloads are planned. http://www.spec.org/benchmarks.html#power http://www.spec.org/benchmarks.html#power

Second Quarter 2010 Second Quarter 2010 SPECpower_ssj2008 ResultsSPECpower_ssj2008 Results

http://www.spec.org/power_ssj2008/results/res2010q2/http://www.spec.org/power_ssj2008/results/res2010q2/

Apr 7, 2010: Hewlett-Packard ProLiant DL385 G7Apr 7, 2010: Hewlett-Packard ProLiant DL385 G7CPU: AMD Opteron 6174, 2.2GHzCPU: AMD Opteron 6174, 2.2GHzNumber of chips 2, cores 12, threads/core 2Number of chips 2, cores 12, threads/core 2Total memory 16GBTotal memory 16GBssj operations @ 100% 888,819ssj operations @ 100% 888,819Average power @ 100% 271 WAverage power @ 100% 271 WAverage power @ active idle 101 WAverage power @ active idle 101 WOverall ssj operations per watt 2,355Overall ssj operations per watt 2,355

Second Quarter 2010 Second Quarter 2010 SPECpower_ssj2008 ResultsSPECpower_ssj2008 Results

http://www.spec.org/power_ssj2008/results/res2010q2/http://www.spec.org/power_ssj2008/results/res2010q2/

May 19, 2010: Dell Inc., PowerEdge R610May 19, 2010: Dell Inc., PowerEdge R610CPU: Intel Xeon X5670, 2.93 GHzCPU: Intel Xeon X5670, 2.93 GHzNumber of chips 2, cores 12, threads 2Number of chips 2, cores 12, threads 2Total memory 12GBTotal memory 12GBssj operations @ 100% 914,076ssj operations @ 100% 914,076Average power @ 100% 244 WAverage power @ 100% 244 WAverage power @ active idle 62.3 WAverage power @ active idle 62.3 WOverall ssj operations per watt 2,938Overall ssj operations per watt 2,938

Energy SPEC BenchmarksEnergy SPEC Benchmarks

Energy efficiency mode: Besides the execution time, energy Energy efficiency mode: Besides the execution time, energy efficiency of SPEC benchmark programs is also measured. efficiency of SPEC benchmark programs is also measured. Energy efficiency of a benchmark program is given by:Energy efficiency of a benchmark program is given by:

1/(Execution time)1/(Execution time)Energy efficiency Energy efficiency == ──────────────────────── Average powerAverage power

D. A. Patterson and J. L. Hennessy, D. A. Patterson and J. L. Hennessy, Computer Organization & Design: The Computer Organization & Design: The Hardware/Software InterfaceHardware/Software Interface, 4, 4thth Edition, Morgan Kaufmann Publishers (Elsevier), Edition, Morgan Kaufmann Publishers (Elsevier), 2009,2009,

Energy EfficiencyEnergy Efficiency

Efficiency averaged on Efficiency averaged on nn benchmark programs: benchmark programs:

nnEfficiencyEfficiency == (( ΠΠ Efficiency Efficiencyii ))

1/1/nn

ii=1=1

where Efficiencywhere Efficiencyii is the efficiency for program is the efficiency for program ii..

Relative efficiency:Relative efficiency:

Efficiency of a computerEfficiency of a computerRelative efficiency = Relative efficiency = ──────────────────────────────────

Eff. of reference Eff. of reference computercomputer

SPEC2000 Relative Energy EfficiencySPEC2000 Relative Energy Efficiency

Pentium M@1.6/0.6GHz Energy-efficient procesor

Pentium 4-M@2.4GHz (Reference)

Pentium III-M@1.2GHz

Always max. clock

Laptop adaptive clk.

Min. power min. clock

Voltage ScalingVoltage Scaling

Dynamic: Reduce voltage and frequency Dynamic: Reduce voltage and frequency during idle or low activity periods.during idle or low activity periods.

Static: Static: Clustered voltage scalingClustered voltage scalingLogicLogic on non-critical paths given lower voltage.on non-critical paths given lower voltage.47% power reduction with 10% area increase 47% power reduction with 10% area increase

reported.reported.M. Igarashi et al., “Clustered Voltage Scaling M. Igarashi et al., “Clustered Voltage Scaling

Techniques for Low-Power Design,” Techniques for Low-Power Design,” Proc. IEEE Proc. IEEE Symp. Low Power DesignSymp. Low Power Design, 1997., 1997.

Processor UtilizationProcessor UtilizationThroughput = Operations / second

Compute-intensiveprocesses

Systemidle

Low throughput(background)

processes

Maximumthroughput

Examples of ProcessesExamples of Processes

Compute-intensive: spreadsheet, spelling Compute-intensive: spreadsheet, spelling check, video decoding, scientific check, video decoding, scientific computing.computing.

Low throughput: data entry, screen Low throughput: data entry, screen updates, low bandwidth I/O data transfer.updates, low bandwidth I/O data transfer.

Idle: no computation, no expected output.Idle: no computation, no expected output.

Effects of Voltage ReductionEffects of Voltage Reduction

Voltage reduction increases delay, Voltage reduction increases delay, decreases throughput:decreases throughput:

Slow reduction in throughput at firstSlow reduction in throughput at firstRapid reduction in throughput for VRapid reduction in throughput for VDD ≤ V≤ Vth

Time per operation (TPO) increasesTime per operation (TPO) increases

Voltage reduction continues to reduce Voltage reduction continues to reduce power consumption:power consumption:

Energy per operation (EPO) = Power × TPOEnergy per operation (EPO) = Power × TPO

Energy per Operation (EPO)Energy per Operation (EPO)

VVDD / V / Vth

1 2 3 4 5

PowerTPO

Dynamic Voltage and ClockDynamic Voltage and Clock

ThroughputThroughputTime spent in:Time spent in:

Battery Battery lifelifeFast Fast

modemodeSlow Slow modemode

Idle Idle modemode

Always full speedAlways full speed 10%10% 0%0% 90%90% 1 hr1 hr

Sometimes full speedSometimes full speed 1%1% 90%90% 9%9% 5.3 hrs5.3 hrs

Rarely full speedRarely full speed 0.1%0.1% 99%99% 0.9%0.9% 9.2 hrs9.2 hrs

T. D. Burd and R. W. Brodersen, Energy Efficient Microprocessors,Springer, 2002, pp. 35-36.

Example: Find Minimum Energy ModeExample: Find Minimum Energy Mode

Processor data (rated operation):Processor data (rated operation):2 GHz clock2 GHz clock1.5 volt supply voltage1.5 volt supply voltage0.5 volt threshold voltage0.5 volt threshold voltagePower consumptionPower consumption

50 watts dynamic power50 watts dynamic power50 watts static power50 watts static power

Maximum clock frequency for V volt supply Maximum clock frequency for V volt supply (alpha-power law): f(alpha-power law): f αα (V – V(V – VTHTH)/V)/V

Alpha-Power Law ModelAlpha-Power Law ModelVariation of delay with supply voltage:Variation of delay with supply voltage:

delay delay αα V VDDDD /(V /(VDD DD – V – VTH TH ))αα

VVTH TH = Threshold voltage= Threshold voltage

αα = 1 for short-channel devices, ≈ 2 for long-channel devices = 1 for short-channel devices, ≈ 2 for long-channel devices

T. Sakurai and A. R. Newton, “Delay analysis of series-connected MOSFET circuits,” IEEE Journal of Solid-State Circuits, Vol. 26, pp.122–131, Feb. 1991.

T. Sakurai and A. R. Newton, “A simple MOSFET model for circuit analysis,” IEEE Transaction on Electron Devices, Vol. 38, No. 4, pp.887–894, Apr. 1991.

T. Sakurai, “High-speed circuit design with scaled-down MOSFETs and low supply voltage (invited),” Proc. IEEE ISCAS, pp.1487–1490, Chicago, May 1993.

T. Sakurai, “Alpha-Power Law MOS Model,” IEEE Solid-State Circuits Society Newsletter, Vol. 9, No. 4, pp. 4–5, Oct. 2004.

Example Cont.Example Cont.

Dynamic power:Dynamic power:

PPdd = CV = CV22f = C(1.5)f = C(1.5)22××22××101099 = 50W = 50W

C = 11.11 nF, capacitance switching/cycleC = 11.11 nF, capacitance switching/cycle

PPdd = 11.11 V = 11.11 V22ffDynamic energy per cycle:Dynamic energy per cycle:

EEdd = P = Pdd/f = 11.11 V/f = 11.11 V22

Clock frequency:Clock frequency:

f = k (V – Vf = k (V – VTHTH)/V = k (1.5 – 0.5)/1.5 = 2 GHz)/V = k (1.5 – 0.5)/1.5 = 2 GHz

k = 3 GHz, a proportionality constantk = 3 GHz, a proportionality constant

f = 3(V – 0.5)/Vf = 3(V – 0.5)/V GHzGHz

Static power:Static power:

PPss = k’ V = k’ V22 = k’ (1.5) = k’ (1.5)22 = 50W= 50W

k’ = 22.22 mho, total leakage conductancek’ = 22.22 mho, total leakage conductance

PPss = 22.22 V = 22.22 V22

Static energy per cycle:Static energy per cycle:

EEss = P= Pss/f = 22.22 V/f = 22.22 V33/[3(V – 0.5)]/[3(V – 0.5)]

= 7.41 V= 7.41 V33/(V – 0.5)/(V – 0.5)

Example Cont.Example Cont.Total energy per cycle:Total energy per cycle:

E = EE = Edd + E + Ess = 11.11 V = 11.11 V2 2 + 7.41 V+ 7.41 V33/(V – 0.5)/(V – 0.5)To minimize E, To minimize E, ∂E/∂V = 0, or∂E/∂V = 0, or

5V5V2 2 – 4.6V + 0.75 = 0– 4.6V + 0.75 = 0Solutions of quadratic equation:Solutions of quadratic equation:

V = 0.679 volt, 0.221 voltV = 0.679 volt, 0.221 voltDiscard second solution, which is lower Discard second solution, which is lower

than the threshold voltage of 0.5 volt.than the threshold voltage of 0.5 volt.Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 8 2525

Example: ResultExample: Result

Rated modeLow energy

modeReduction

Voltage 1.5 V 0.679 V 54.7%

Clock frequency 2 GHz 791 MHz 60%

Dynamic energy/cycle

25.00 nJ 5.12 nJ 79.52%

Static energy/cycle 25.00 nJ 12.96 nJ 48.16%

Total energy/cycle 50.0 nJ 18.08 nJ 63.84%

Dynamic power 50.0 W 4.05 W 91.90%

Static power 50.0 W 10.25 W 79.50%

Total power 100.0 W 14.20 W 85.80%

Cycle EfficiencyCycle Efficiency Cycle efficiency is a rating similar to the Cycle efficiency is a rating similar to the

maximum clock frequency rating.maximum clock frequency rating. Analogy:Analogy:

Cycle efficiency is similar to miles per gallon (mpg)Cycle efficiency is similar to miles per gallon (mpg) Maximum clock frequency is similar to miles per hour Maximum clock frequency is similar to miles per hour

(mph)(mph)

Reference: A. Shinde and V. D. Agrawal, Reference: A. Shinde and V. D. Agrawal, “Managing Performance and Efficiency of a “Managing Performance and Efficiency of a Processor,” Processor,” Proc. 45Proc. 45thth IEEE Southeastern Symp. IEEE Southeastern Symp. System TheorySystem Theory, March 2013., March 2013.

Performance in TimePerformance in Time

Performance is measured with respect to Performance is measured with respect to a program.a program.

Performance = Performance = TimeExecution

D. A. Patterson and J. L. Hennessy, Computer Organization & Design, the hardware/Software Interface, Fourth Edition, San Francisco, California: Morgan Kaufman Publishers, Inc., 2008.

Performance in Energy (Efficiency)Performance in Energy (Efficiency)

Efficiency is measured with respect to a Efficiency is measured with respect to a program.program.

Performance Performance nconsumptio Power

ePerformanc

D. A. Patterson and J. L. Hennessy, Computer Organization & Design, the Hardware/Software Interface, Fourth Edition, San Francisco, California: Morgan Kaufman Publishers, Inc., 2008.

dissipatedEnergy

Two PerformancesTwo Performances

Time performanceTime performance

Energy performance Energy performance

D. A. Patterson and J. L. Hennessy, Computer Organization & Design, the Hardware/Software Interface, Fourth Edition, San Francisco, California: Morgan Kaufman Publishers, Inc., 2008.

dissipatedEnergy

Time Execution

Time Performance or ClockTime Performance or Clock

Speed of a processor is measured in Speed of a processor is measured in cycles per secondcycles per second or or clock frequency (f).clock frequency (f).

Execution time of a program using C clock Execution time of a program using C clock cycles = C/fcycles = C/f

Time performance = f/CTime performance = f/CClock period (1/f) is the time per cycle.Clock period (1/f) is the time per cycle.

Energy PerformanceEnergy Performance

Energy efficiency of a processor may be Energy efficiency of a processor may be measured in measured in cycles per joule cycles per joule or or cycle cycle efficiency (efficiency (ηη).).

Energy dissipated by a program using C Energy dissipated by a program using C clock cycles = C/clock cycles = C/ηη

Energy performance = Energy performance = ηη/C/C1/1/ηη is energy per cycle (EPC) is energy per cycle (EPC)

Characterizing Device Technology Characterizing Device Technology Speed and EfficiencySpeed and Efficiency

Consider 90nm CMOS technology.Consider 90nm CMOS technology. Use predictive technology model (PTM).Use predictive technology model (PTM). Example circuit: Eight-bit ripple carry adder.Example circuit: Eight-bit ripple carry adder. Nominal voltage = 1.2 volts.Nominal voltage = 1.2 volts. Simulation for varying operating conditions (VDD = 100mV Simulation for varying operating conditions (VDD = 100mV

through 1.2V) using Spice:through 1.2V) using Spice: With random vectors for energy per cycle (EPC = 1/With random vectors for energy per cycle (EPC = 1/ηη).). With critical path vectors for clock period (1/f).With critical path vectors for clock period (1/f).

Reference: W. Zhao and Y. Cao, “New Generation of Predictive Reference: W. Zhao and Y. Cao, “New Generation of Predictive Technology Model for Sub-45nm Early Design ExplorationTechnology Model for Sub-45nm Early Design Exploration,“ IEEE ,“ IEEE Trans. Electron Devices, Trans. Electron Devices, vol. 53, no. 11, pp. 2816–2823, 2006.vol. 53, no. 11, pp. 2816–2823, 2006.

Energy per Cycle of 8-Bit AdderEnergy per Cycle of 8-Bit Adder

K. Kim, “Ultra Low Power CMOS Design,” K. Kim, “Ultra Low Power CMOS Design,” PhD DissertationPhD Dissertation, Auburn University, , Auburn University, Dept. of ECE, Auburn, Alabama, May 2011.Dept. of ECE, Auburn, Alabama, May 2011.

Copyright Agrawal, 2007Copyright Agrawal, 2007 ELEC5270/6270 Spring 13, Lecture 8ELEC5270/6270 Spring 13, Lecture 83434

Cycle Time of 8-Bit AdderCycle Time of 8-Bit Adder

K. Kim, “Ultra Low Power CMOS Design,” K. Kim, “Ultra Low Power CMOS Design,” PhD DissertationPhD Dissertation, Auburn University, Dept. , Auburn University, Dept. of ECE, Auburn, Alabama, May 2011.of ECE, Auburn, Alabama, May 2011.

Pentium M processorPentium M processor Published data: H. Hanson, K. Rajamani, S. Keckler, F. Published data: H. Hanson, K. Rajamani, S. Keckler, F.

Rawson, S. Ghiasi, J. Rubio, “Thermal Response to Rawson, S. Ghiasi, J. Rubio, “Thermal Response to DVFS: Analysis with an Intel Pentium M,” DVFS: Analysis with an Intel Pentium M,” Proc.Proc. International Symp. Low Power Electronics and DesignInternational Symp. Low Power Electronics and Design, , 2007, pp. 219-224.2007, pp. 219-224.

VDD = 1.2VVDD = 1.2V Maximum clock rate = 1.8GHzMaximum clock rate = 1.8GHz Critical path delay, td = 1/1.8GHz = 555.56psCritical path delay, td = 1/1.8GHz = 555.56ps Power consumption = 120WPower consumption = 120W EPC = 120/(1.8GHz) = 66.67nJEPC = 120/(1.8GHz) = 66.67nJ

Cycle Efficiency and FrequencyCycle Efficiency and Frequency

ExampleExample For a program that executes in 1.8 billion clock cycles.For a program that executes in 1.8 billion clock cycles.

Voltage VDD

Frequency f

Cycle Efficiency,η

Execution Time

second

Total Energy

Consumed

Powerf/η

1.2 V1800

megacycles/s15

megacycles/joule1.0 120 Joules 120W

0.6 V277

megacycles/s70

megacycles/joule6.5 25 Joules 39.6W

200 mV54.5

megacycles/s660

megacycles/joule33 2.72 Joules 0.083W

Cycle EfficiencyCycle Efficiency New performance rating: Cycle efficiency New performance rating: Cycle efficiency ηη

unit is cycles per joule.unit is cycles per joule. Clock frequency f in cycles per second is a Clock frequency f in cycles per second is a

similar rating with respect to time.similar rating with respect to time. Similarity to other popular ratings:Similarity to other popular ratings:

ηη → mpg → mpg f → mph f → mph

Two ratings allow effective time and energy Two ratings allow effective time and energy management of an electronic system.management of an electronic system.

Problem of Process Variation in Problem of Process Variation in Nanometer TechnologiesNanometer Technologies

Lower Vth Vth Higher Vth

Powerspecification

Clockspecification

From a presentation:Power Reduction using LongRun2 in Transmeta’sEfficon Processor, by D. DitzelMay 17, 2006

Yield lossdue to highleakage

Yield lossdue to slowspeedH

Nominalvoltage

Clock Distribution H-TreeClock Distribution H-Tree

Fanout, λ = 4

Tree depth, s = logλN

No. of flip-flops = N

Clock PowerClock PowerPclk = CLVDD

2f + CLVDD2f / λ + CLVDD

2f / λ2 + . . .

stages – 1 1= CLVDD

2f Σ ─ n = 0 λn

where CL = total load capacitance of N flip-flops

λ = constant fanout at each stage in distributionnetwork

Clock consumes about 40% of total processor power, because(1)Clock is always active(2)Makes two transitions per cycle, (α = 2)(3)Clock gating is useful; inhibit clock to unused blocks

Properties of H-TreeProperties of H-Tree

Balanced clock skew.Balanced clock skew.Small delay and power consumption.Small delay and power consumption.Requires fine-tuning for complex layout.Requires fine-tuning for complex layout.

Clock Power and DelayClock Power and Delay Unit size buffer or inverter delay = dUnit size buffer or inverter delay = d Total dynamic power supplied to N flip-Total dynamic power supplied to N flip-

flops, P = Cflops, P = CLLVVDDDD22ff

Total power consumption of clock network:Total power consumption of clock network:

Flip-flps, N Clock power per flip-flop Clock delay

4 P 4d

16 1.25P 8d

64 1.3125P 12d

128 1.327125P 16d

Clock Network ExamplesClock Network ExamplesAlpha 21064Alpha 21064 Alpha 21164Alpha 21164 Alpha 21264Alpha 21264

TechnologyTechnology 0.750.75μμ CMOS CMOS 0.50.5μμ CMOS CMOS 0.350.35μμ CMOS CMOS

Frequency (MHz)Frequency (MHz) 200200 300300 600600

Total capacitanceTotal capacitance 12.5nF12.5nFClock gating Clock gating used. Total used. Total power 80 -power 80 -

110W110W

Clock loadClock load 3.25nF3.25nF 3.75nF3.75nF

Clock powerClock power 40%40% 40% (20W)40% (20W)

Max. clock skewMax. clock skew 200ps (<10%)200ps (<10%) 90ps90ps

D. W. Bailey and B. J. Benschneider, “Clocking Design and Analysis for a 600-MHz Alpha Microprocessor,” IEEE J. Solid-State Circuits, vol. 33, no. 11, pp. 1627-1633, Nov. 1998.

Architecture Level: Pipeline GatingArchitecture Level: Pipeline Gating A pipeline processor uses speculative execution.A pipeline processor uses speculative execution.

Incorrect branch prediction results in pipeline stalls and Incorrect branch prediction results in pipeline stalls and wasted energy.wasted energy.

Idea: Stop fetching instructions if a branch Idea: Stop fetching instructions if a branch hazard is expected:hazard is expected:

If the count (M) of incorrect predictions exceeds a pre-If the count (M) of incorrect predictions exceeds a pre-specified number (N), then suspend fetching instruction for specified number (N), then suspend fetching instruction for some k cycles.some k cycles.

Ref.: S. Manne, A. Klauser and D. Grunwald, Ref.: S. Manne, A. Klauser and D. Grunwald, “Pipeline Gating: Speculation Control for Energy “Pipeline Gating: Speculation Control for Energy Reduction,” Reduction,” Proc. 25Proc. 25thth Annual International Annual International Symp. Computer ArchitectureSymp. Computer Architecture, June 1998., June 1998.

Slack SchedulingSlack Scheduling Application: Superscalar, out-of-order execution:Application: Superscalar, out-of-order execution:

An instruction is executed as soon as the required data and An instruction is executed as soon as the required data and resources become available.resources become available.

A commit unit reorders the results.A commit unit reorders the results.

Delay the completion of instructions whose result Delay the completion of instructions whose result is not immediately needed.is not immediately needed.

Example of RISC instructions:Example of RISC instructions: addadd r0, r1, r2;r0, r1, r2; (A)(A) sub sub r3, r4, r5;r3, r4, r5; (B)(B) and and r9, r1, r9;r9, r1, r9; (C)(C) or or r5, r9, r10;r5, r9, r10; (D)(D) xor xor r2, r10, r11;r2, r10, r11; (E)(E)

J. Casmira and D. Grunwald,“Dynamic Instruction SchedulingSlack,” Proc. ACM Kool ChipsWorkshop, Dec. 2000.

Slack Scheduling ExampleSlack Scheduling Example

Slack schedulingSlack scheduling

AABB CC

Standard schedulingStandard scheduling

AA BB CC

Slack SchedulingSlack Scheduling

Slack bitLow-power

execution units

Re-order buffer

Power Reduction ExamplePower Reduction Example Alpha 21064: 200MHz @ 3.45V, power dissipation =Alpha 21064: 200MHz @ 3.45V, power dissipation = 26W Reduce voltage to 1.5V, power (5.3x) =Reduce voltage to 1.5V, power (5.3x) = 4.9W Eliminate FP, power (3x) =Eliminate FP, power (3x) = 1.6W Scale 0.75Scale 0.75μμ → 0.35 → 0.35μμ, power (2x) =, power (2x) = 0.8W Reduce clock load, power (1.3x) =Reduce clock load, power (1.3x) = 0.6W Reduce frequency 200 →160MHz, power (1.25x) =Reduce frequency 200 →160MHz, power (1.25x) = 0.5W J. Montanaro J. Montanaro et alet al., “A 160-MHz, 32-b, 0.5-W CMOS RISC ., “A 160-MHz, 32-b, 0.5-W CMOS RISC

Microprocessor,” Microprocessor,” IEEE J. Solid-State CircuitsIEEE J. Solid-State Circuits, vol. 31, no. , vol. 31, no. 11, pp. 1703-1714, Nov. 1996.11, pp. 1703-1714, Nov. 1996.

For More on MicroprocessorsFor More on Microprocessors

T. D. Burd and R. W. Brodersen, Energy T. D. Burd and R. W. Brodersen, Energy Efficient Microprocessor Design, Springer, Efficient Microprocessor Design, Springer, 2002.2002.

R. Graybill and R. Melhem, R. Graybill and R. Melhem, Power Aware Power Aware ComputingComputing, New York: Plenum Publishers, , New York: Plenum Publishers, 2002.2002.

copyright agrawal, 2007elec5270/6270 spring 13, lecture 81 elec 5270/6270 spring 2013 low-power...

Documents

elec 5270/6270 fall 2007 low-power design of electronic...

docschi #5270-v1-nceo 2014-presentation1

copyright agrawal, 2007 elec6270 fall 07, lecture 12 1 elec...

5270 9900 0000 00005270 9900 0000 00005270 …...seamless...

copyright agrawal, 2011elec5270/6270 spring 11, lecture 71...

bsol 6270-2

copyright agrawal, 2007 elec6270 fall 07, lecture 11 1 elec...

copyright agrawal, 2007 elec6270 fall 07, lecture 1 1 elec...

copyright agrawal, 2011elec5270/6270 spr 15, lecture 71 elec...

8/17/06 elec 5970-003/6970-003 lecture 1 1 elec...

copyright agrawal, 2007 elec5270/6270 spring 09, lecture 4 1...

geog/envst 3270/5270 - 001 4 credit hours spring semester...

copyright agrawal, 2007 elec6270 fall 07, lecture 3 1 elec...

elec 5270/6270 spring 2009 low-power design of electronic...

model 5270 data acquisition and control software ... ·...

fall 2006, sep. 5 and 7 elec5270-001/6270-001 lecture 4 1...

fall 2006, nov. 30 elec 5270-001/6270-001 lecture 12 1 elec...

copyright agrawal, 2007elec5270/6270 spring 11, lecture 141...

elec 5270/6270 spring 2013 low-power design of electronic...

copyright agrawal, 2007elec6270 spring 15, lecture 91 elec...