power aware microarchitecture

26

Power dissipation limits haveemerged as a major constraint in the designof microprocessors. At the low end of the per-formance spectrum, namely in the world ofhandheld and portable devices or systems,power has always dominated over perfor-mance (execution time) as the primary designissue. Battery life and system cost constraintsdrive the design team to consider power overperformance in such a scenario.

Increasingly, however, power is also a keydesign issue in the workstation and server mar-kets (see Gowan et al.)1 In this high-end arenathe increasing microarchitectural complexities,clock frequencies, and die sizes push the chip-level—and hence the system-level—powerconsumption to such levels that traditionallyair-cooled multiprocessor server boxes maysoon need budgets for liquid-cooling or refrig-eration hardware. This need is likely to causea break point—with a step upward—in theever-decreasing price-performance ratio curve.As such, a design team that considers powerconsumption and dissipation limits early inthe design cycle and can thereby adopt aninherently lower power microarchitectural linewill have a definite edge over competing teams.

Thus far, most of the work done in the areaof high-level power estimation has been focusedat the register-transfer-level (RTL) descriptionin the processor design flow. Only recently havewe seen a surge of interest in estimating powerat the microarchitecture definition stage, andspecific work on power-efficient microarchi-tecture design has been reported.2-8

Here, we describe the approach of usingenergy-enabled performance simulators inearly design. We examine some of the emerg-ing paradigms in processor design and com-ment on their inherent power-performancecharacteristics.

Power-performance efficiencySee the “Power-performance fundamentals”

box. The most common (and perhaps obvious)metric to characterize the power-performanceefficiency of a microprocessor is a simple ratio,such as MIPS (million instructions per sec-ond)/watts (W). This attempts to quantify effi-ciency by projecting the performance achievedor gained (measured in MIPS) for every wattof power consumed. Clearly, the higher thenumber, the “better” the machine is.

While this approach seems a reasonable

David M. BrooksPradip Bose

Stanley E. SchusterHans Jacobson

Prabhakar N. KudvaAlper BuyuktosunogluJohn-David Wellman

Victor ZyubanManish GuptaPeter W. CookIBM T.J. Watson

Research Center

THE ABILITY TO ESTIMATE POWER CONSUMPTION DURING EARLY-STAGE

DEFINITION AND TRADE-OFF STUDIES IS A KEY NEW METHODOLOGY

ENHANCEMENT. OPPORTUNITIES FOR SAVING POWER CAN BE EXPOSED VIA

MICROARCHITECTURE-LEVEL MODELING, PARTICULARLY THROUGH CLOCK-

GATING AND DYNAMIC ADAPTATION.

0272-1732/00/$10.00 2000 IEEE

POWER-AWAREMICROARCHITECTURE:

Design and Modeling Challenges for Next-Generation Microprocessors

choice for some purposes, there are strong argu-ments against it in many cases, especially whenit comes to characterizing higher end proces-sors. Performance has typically been the key

driver of such server-class designs, and cost orefficiency issues have been of secondary impor-tance. Specifically, a design team may wellchoose a higher frequency design point (which

27NOVEMBER–DECEMBER 2000

At the elementary transistor gate level, we can formulate total powerdissipation as the sum of three major components: switching loss, leak-age, and short-circuit loss.1-4

(A)

Here, C is the output capacitance, VDD is the supply voltage, f is the chipclock frequency, and a is the activity factor (0 < a < 1) that determines thedevice switching frequency. Vswing is the voltage swing across the out-put capacitor. Ileakage is the leakage current, and Isc is the average short-circuit current. The literature often approximates Vswing as equal to VDD

(or simply V for short), making the switching loss around (1/2)C V 2af.Also for current ranges of VDD (say, 1 volt to 3 volts) switching loss,(1/2)C V2af remains the dominant component.2 So as a first-order approx-imation for the whole chip we may formulate the power dissipation as

(B)

Ci , Vi , ai , and fi are unit- or block-specific average values. The sum-mation is taken over all blocks or units i, at the microarchitecture level(instruction cache, data cache, integer unit, floating-point unit, load-store unit, register files, and buses). For the voltage range considered,the operating frequency is roughly proportional to the supply voltage;C remains roughly the same if we keep the same design, but scale thevoltage. If a single voltage and clock frequency is used for the wholechip, the formula reduces to

(C)

where K’s are unit- or block-specific constants. If we consider the worst-case activity factor for each unit i—that is, if ai = 1 for all i, then

(D)

where Kv and Kf are design-specific constants.That equation leads to the so-called cube-root rule.2 This points to the

single most efficient method for reducing power dissipation for a proces-sor designed to operate at high frequency: reduce the voltage (and hencethe frequency). This is the primary mechanism of power control in Trans-meta’s Crusoe chip (http://www.transmeta.com). There’s a limit, how-ever, on how much VDD can be reduced (for a given technology), whichhas to do with manufacturability and circuit reliability issues. Thus, acombination of microarchitecture and circuit techniques to reduce powerconsumption—without necessarily employing multiple or variable sup-ply voltages—is of special relevance.

Performance basicsThe most straightforward metric for measuring performance is the

execution time of a representative workload mix on the target processor.We can write the execution time as

(E)

Here, PL is the dynamic path length of the program mix, measured as thenumber of machine instructions executed. CPI is the average processorcycles per instruction incurred in executing the program mix, and CT isthe processor cycle time (measured in seconds per cycle) whose inversedetermines clock frequency f. Since performance increases with decreas-ing T, we may formulate performance PF as

(F)

Here, the K’s are constants for a given microarchitecture-compiler imple-mentation. The Kpf value stands for the average number of machineinstructions executed per cycle on the machine being measured. PFchip

in this case is measured in MIPS.Selecting a suite of publicly available benchmark programs that every-

one accepts as being representative of real-world workloads is difficult.Adopting a noncontroversial weighted mix is also not easy. For the com-monly used SPEC benchmark suite (http://www.specbench.org) the SPEC-mark rating for each class is derived as a geometric mean of executiontime ratios for the programs within that class. Each ratio is calculated asthe speedup with respect to execution time on a specified referencemachine. If we believe in the particular benchmark suite, this methodhas the advantage of allowing us to rank different machines unambigu-ously from a performance viewpoint. That is, we can show the rankingas independent of the reference machine used in such a formulation.

References1. V. Zyuban, “Inherently Lower-Power High Performance

Super Scalar Architectures,” PhD dissertation, Univ. of NotreDame, Dept. of Computer Science and Engineering, 2000.

2. M.J. Flynn et al., “Deep-Submicron Microprocessor DesignIssues,” IEEE Micro, Vol. 19, No. 4, July/Aug. 1999, pp. 11-22.

3. S. Borkar, “Design Challenges of Technology Scaling,” IEEEMicro, Vol. 19, No. 4, July/Aug. 1999, pp. 23-29.

4. R. Gonzalez and M. Horowitz, “Energy Dissipation in GeneralPurpose Microprocessors,” IEEE J. Solid-State Circuits, Vol.31, No. 9, Sept. 1996, pp. 1277-1284.

PF K f K Vchip pf pv= =

T PL CPI CT PL CPI f= × × = × × ( / )1

PW K V K fv fchip = =3 3

PW V K a f ai

iv K i

fi

iichip (= =∑ ∑3 3 ) ( )

PW C V a fi i i ichipi

= ∑( / ) [ ]1 2 2

PW C V afdevice DD swing leakage DD sc DD ( / ) V I V I V= + +1 2

Power-performance fundamentals at the microarchitecture level

meets maximum power budget constraints)even if it operates at a much lower MIPS/Wefficiency compared to one that operates at bet-ter efficiency but at a lower performance level.As such, (MIPS)2/W or even (MIPS)3/W maybe the metric of choice at the high end.

On the other hand, at the lowest end, wherebattery life (or energy consumption) is the pri-mary driver, designers may want to put an evengreater weight on the power aspect than thesimplest MIPS/W metric. That is, they mayjust be interested in minimizing the power fora given workload run, irrespective of the exe-cution time performance, provided the latterdoesn’t exceed some specified upper limit.

The MIPS metric for performance and thewatts value for power may refer to average orpeak values, derived from the chip specifica-tions. For example, for a 1-gigahertz (109

cycles/sec) processor that can complete up to4 instructions per cycle, the theoretical peakperformance is 4,000 MIPS. If the averagecompletion rate for a given workload mix is pinstructions/cycle, the average MIPS wouldequal 1,000 times p. However, when it comesto workload-driven evaluation and character-ization of processors, metrics are often con-troversial. Apart from the problem of decidingon a representative set of benchmark applica-tions, fundamental questions persist aboutways to boil down performance into a single(average) rating that’s meaningful in compar-ing a set of machines.

Since power consumption varies, dependingon the program being executed, the bench-

marking issue is also relevant in assigning anaverage power rating. In measuring power andperformance together for a given program exe-cution, we may use a fused metric such as thepower-delay product (PDP) or energy-delayproduct (EDP).9 In general, the PDP-based for-mulations are more appropriate for low-power,portable systems in which battery life is the pri-mary index of energy efficiency. The MIPS/Wmetric is an inverse PDP formulation, wheredelay refers to average execution time perinstruction. PDP, being dimensionally equal toenergy, is the natural metric for such systems.

For higher end systems (workstations) theEDP-based formulation is more appropriate,since the extra delay factor ensures a greateremphasis on performance. The (MIPS)2/Wmetric is an inverse EDP formulation. For thehighest performance server-class machines, itmay be appropriate to weight the delay parteven more. This would point to the use of(MIPS)3/W, which is an inverse ED2P for-mulation. Alternatively, we may use(CPI)3 ×W (the cube of cycles per instructiontimes power) as a direct ED2P metric applic-able on a per-instruction basis.

The energy × (delay)2 metric, or perfor-mance3/power formula, is analogous to thecube-root rule,10 which follows from constantvoltage-scaling arguments (see Equations Athrough D in the “Power-performance funda-mentals” box). Clearly, to formulate a voltage-invariant power-performance characterizationmetric, we need to think in terms of perfor-mance3/power.

28

POWER-AWARE MICROARCHITECTURE

IEEE MICRO

0

5

10

15

20

25

30

Inte

l Pen

tium

III

AMD A

thlon

HP-PA86

00

IBM

Pow

er3

Compa

q 21

264

Mot

orola

PPC74

0

Inte

l Cele

ron

MIP

S R12

000

Sun U

SII

Hal Spa

rc 6

4III

0

10

20

30

40

50Im

prov

emen

tov

er w

orst

per

form

er

Inte

l Pen

tium

III

AMD A

thlon

HP-PA86

00

IBM

Pow

er3

Compa

q 21

264

Mot

orola

PPC74

0

Inte

l Cele

ron

MIP

S R12

000

Sun U

SII

Hal Spa

rc 6

4III

Impr

ovem

ent

over

wor

st p

erfo

rmer

SpecInt/WSpecInt2/WSpecInt3/W

SpecFp/WSpecFp2/WSpecFp3/W

(a) (b)

Figure 1. Performance-power efficiencies across commercial processor products. The newer processors appear on the left.

When we are dealing withthe SPEC benchmarks, wemay evaluate efficiency as(SPECrating)x/W, or (SPEC)x/W for short; where exponentvalue x (equaling 1, 2, or 3)may depend on the class ofprocessors being compared.

Figure 1a,b shows thepower-performance efficiencydata for a range of commercial processors ofapproximately the same generation. (Seehttp://www.specbench.org, the Aug. 2000Microprocessor Report, and individual vendorWeb pages, for example, http://www.intel.com/design/pentiumiii/datashts/245264.htm.)

In Figure 1a,b the latest available processoris plotted on the left and the oldest on theright. We used SPEC/W, SPEC2/W, andSPEC3/W as alternative metrics, where SPECstands for the processor’s SPEC rating. (Seeearlier definition principles.) For each cate-gory, the worst performer is normalized to 1,and other processor values are plotted asimprovement factors over the worstperformer.

The data validates our assertion that—depending on the metric of choice, and thetarget market (determined by workload classand/or the power/cost)—the conclusiondrawn about efficiency can be quite different.For performance-optimized, high-end proces-sors, the SPEC3/W metric seems fairest, withthe very latest Intel Pentium III and AMDAthlon offerings (at 1 GHz) at the top of theclass for integer workloads. The older HP-PA8600 (552 MHz) and IBM Power3 (450MHz) still dominate in the floating-pointclass. For power-first processors targetedtoward integer workloads (such as Intel’smobile Celeron at 333 MHz), SPEC/Wseems the fairest.

Note that we’ve relied on published perfor-mance and “max power” numbers; and,because of differences in the methodologiesused in quoting the maximum power ratings,the derived rankings may not be completelyaccurate or fair. This points to the need forstandardized methods in reporting maximumand average power ratings for future proces-sors so customers can compare power-perfor-mance efficiencies across competing productsin a given market segment.

Microarchitecture-level power estimation Figure 2 shows a block diagram of the basic

procedure used in the power-performance sim-ulation infrastructure (PowerTimer) at IBMResearch. Apart from minor adaptations spe-cific to our existing power and performancemodeling methodology, at a conceptual level itresembles the recently published methods usedin Princeton University’s Wattch,4 Penn State’sSimplePower,5 and George Cai’s model at Intel.8

The core of such models is a classical trace- orexecution-driven, cycle-by-cycle performancesimulator. In fact, the power-performance mod-els4-6,8 are all built upon Burger and Austin’swidely used, publicly available, parameterizedSimpleScalar performance simulator.11

We are building a tool set around the exist-ing, research- and production-level simula-tors12,13 used in the various stages of thedefinition and design of high-end PowerPCprocessors. The nature and detail of the ener-gy models used in conjunction with the work-load-driven cycle simulator determine the keydifference for power projections.

During every cycle of the simulated proces-sor operation, the activated (or busy) micro-architecture-level units or blocks are knownfrom the simulation state. Depending on theparticular workload (and the execution snap-shot within it), a fraction of the processorunits and queue/buffer/bus resources areactive at any given cycle.

We can use these cycle-by-cycle resource usagestatistics—available easily from a trace- or exe-cution-driven performance simulator—to esti-mate the unit-level activity factors. If accurateenergy models exist for each modeled resource,on a given cycle and if unit i is accessed or used,we can estimate the corresponding energy con-sumed and add it to the net energy spent over-all for that unit. So, at the end of a simulation,we can estimate the total energy spent on a unitbasis as well as for the whole processor.


Programexecutable

or trace Cycle-by-cycleperformance

simulator

Microarchitectureparameters

Circuit/technologyparameters

Energy models

Performance estimate

Powerestimate

Figure 2. PowerTimer high-level block diagram.

Since the unit-level andoverall execution cyclecounts are available fromthe performance simula-tion, and since the tech-nology implementationparameters (voltage, fre-quency) are known, wecan estimate average andpeak power dissipationprofiles for the whole chip,in the context of the inputworkload. Of course, thiskind of analysis implicitlyassumes that we can useclock-gating techniques atthe circuit level (for exam-ple, see Tiwari et al.14) toselectively turn off dynam-ic power consumption in

inactive functional units and on-chip stor-age/communication resources.

Figure 3 shows the unit-level usage data forsome of the key functional units, based on adetailed, cycle-accurate simulation of a proces-sor similar in complexity and functionality toa current generation superscalar such as thePower4 processor.15 The data plot shown isobtained by running simulations using a rep-resentative sampled trace repository. This cov-ers a range of benchmark workloads (selectionsfrom SPEC95 and TPC-C) commonly usedin such designs. The functional units trackedin this graph are the floating-point unit (FPU),integer (fixed-point) unit (FXU), branch unit(BRU), load-store unit (LSU), and “logic oncondition register” unit (CRU).

This simulation data indicates the poten-tial power savings gained by selectively “gatingoff ” unused processor segments during theexecution of a given workload mix. For exam-ple, the FPU is almost totally inactive for gcc,go, ijpeg, li, vortex, and tpcc-db2. The max-imum unit use is about 50% (FXU form88ksim and vortex).

The potential here is especially significant ifwe consider the distribution of power withina modern processor chip. In an aggressivesuperscalar processor that doesn’t employ anyform of clock gating, a design’s clock tree, dri-vers, and clocked latches (pipeline registers,buffers, queues) can consume up to 70% ofthe total power. Even with the registers and

cache structures accounted for separately, theclock-related power for a recent high-perfor-mance Intel CPU is reported to be almost halfthe total chip power.14 Thus, the use of gatedclocks to disable latches (pipeline stages) andentire units when possible is potentially amajor source of power savings.

Depending on the exact hardware algo-rithm used to gate off clocks, we can depictdifferent characteristics of power-performanceoptimization. In theory, for example, if theclock can be disabled for a given unit (perhapseven at the individual pipeline stage granu-larity) on every cycle that it’s not used, we’dget the best power efficiency, without losingany or much performance. However, such ide-alized clock gating provides many implemen-tation challenges for designers of high-endprocessors due to considerations such as

• the need to include additional on-chipdecoupling capacitance to deal with largecurrent swings (L × [di/dt] effects), whereL stands for inductance and di/dt denotescurrent gradients over time;

• more-complicated timing analysis to dealwith additional clock skew and the pos-sibility of additional gate delays on crit-ical timing paths, and

• more effort to test and verify the micro-processor.

Despite these additional design challenges,some commercial high-end processors havebegun to implement clock gating at variouslevels of granularity, for example, the IntelPentium series and Compaq Alpha 21264.The need for pervasive use of clock gating inour server-class microprocessors has thus farnot outweighed the additional design com-plexity. For the future, however, we continueto look at methods to implement clock gat-ing with a simplified design cost. Simplerdesigns may implement gated clocks in local-ized hot spots or in league with dynamic feed-back data. For example, upon detection thatthe power-hungry floating-point unit hasbeen inactive over the last, say, 1,000 cycles,the execution pipes within that unit may begated off in stages, a pipe at a time to mini-mize current surges. (See Pant et al.16)

Also for a given resource such as a register file,enabling only the segment(s) in use—depend-

30


IEEE MICRO

com

pres

s

fppp

p

gcc

go

ijpeg

li

m88

ksim

perl

vort

ex

tpcc

-db2

0

0.1

0.2

0.3

0.4

0.5

0.6

Exe

cutio

n pi

pe u

se r

atio

FXUFPULSUBRUCRU

Figure 3. Unit-level usage data for a current-gen-eration server-class superscalar processor.

ing on the number of ports that are active at agiven time—might conserve power. Alterna-tively, at a coarser grain an entire resource maybe powered on or off, depending on demand.

Of course, even when a unit or its part isgated off, the disabled part may continue toburn a small amount of static power. To mea-sure the overall advantage, we must carefullyaccount for this and other overhead power con-sumed by the circuitry added to implementgated-clock designs. In Wattch,4 Brooks et al.report some of the observed power-perfor-mance variations, with different mechanisms

of clock-gating control. Other dynamic adap-tation schemes (see the “Adaptive microarchi-tectures” box) of on-chip resources to conservepower are also of interest in our ongoing power-aware microarchitecture research.

Asynchronous circuit techniques such asinterlocked pipeline CMOS (IPCMOS) pro-vide the benefits of fine-grain clock gating as anatural and integrated design aspect. (See the“Low-power, high-performance circuit tech-niques” box.) Such techniques have the furtheradvantage that current surges are automatical-ly minimized because each stage’s clock is gen-


Current-generation microarchitectures are usually not adaptive; thatis, they use fixed-size resources and a fixed functionality across all pro-gram runs. The size choices are made to achieve best overall perfor-mance over a range of applications. However, an individual application’srequirements not well matched to this particular hardware organizationmay exhibit poor performance. Even a single application run may exhib-it enough variability during execution, resulting in uneven use of the chipresources. The result may often be low unit use (see Figure 3 in the maintext), unnecessary power waste, and longer access latencies (thanrequired) for storage resources. Albonesi et al.1,2 in the CAP (Complexi-ty-Adaptive Processors) project at the University of Rochester, have stud-ied the power and performance advantages in dynamically adaptingon-chip resources in tune with changing workload characteristics. Forexample, cache sizes can be adapted on demand, and a smaller cachesize can burn less power while allowing faster access. Thus, power andperformance attributes of a pair (program, machine) can both beenhanced via dynamic adaptation.

Incidentally, a static solution that attempts to exploit the fact thatmost of the cache references can be trapped by a much smaller “filtercache”3 can also save power, albeit at the expense of a performance hitcaused by added latency for accessing the main cache.

Currently, as part of the power-aware microarchitecture research pro-ject at IBM, we are collaborating with the Albonesi group in implementingaspects of dynamic adaptation in a prototype research processor. We’vefinished the high-level design (with simulation-based analysis at the cir-cuit and microarchitecture levels) of an adaptive issue queue.4 The dynam-ic workload behavior controls the queue resizing. The transistor andenergy overhead of the added counter-based monitoring and decisioncontrol logic is less than 2%.

Another example of dynamic adaptation is one in which on-chip power(or temperature) and/or performance levels are monitored and fed backto control the progress of later instruction processing. Manne et al.5 pro-posed the use of a mechanism called “pipeline gating” to control thedegree of speculation in modern superscalars, which employ branch pre-diction. This mechanism allows the machine to stall specific pipelinestages when it’s determined that the processor is executing instructions

in an incorrectly predicted branch path. This determination is made using“confidence estimation” hardware to assess the quality of each branchprediction. Such methods can drastically reduce the count of misspecu-lated executions, thereby saving power. Manne5 shows that such powerreductions can be effected with minimal loss of performance. Brooks andcolleagues6 have used power dissipation history as a proxy for tempera-ture, in determining when to throttle the processor. This approach canhelp put a cap on maximum power dissipation in the chip, as dictated bythe workload at acceptable performance levels. It can help reduce thethermal packaging and cooling solution cost for the processor.

References1. D.H. Albonesi, “The Inherent Energy Efficiency of Complexity-

Adaptive Processors,” presented at the ISCA-25 Workshop onPower-Driven Microarchitecture, 1998; http://www.ccs.rochester.edu/projects/cap/cappublications.htm.

2. R. Balasubramonian et al., “Memory HierarchyReconfiguration for Energy and Performance in General-Purpose Processor Architectures,” to be published in Proc.33rd Ann. Int’l Symp. Microarchitecture, (Micro-33), IEEEComputer Society Press, Los Alamitos, Calif., 2000.

3. J. Kin, M. Gupta, and W. Mangione-Smith, “The Filter Cache:An Energy-Efficient Memory Structure,” Proc. IEEE Int’lSymp. Microarchitecture, IEEE CS Press, 1997, pp. 184-193.

4. A. Buyuktosunoglu et al., “An Adaptive Issue Queue forReduced Power at High Performance,” IBM Research ReportRC 21874, Yorktown Heights, New York, Nov. 2000.

5. S. Manne, A. Klauser, and D. Grunwald, “Pipeline Gating:Speculation Control for Energy Reduction,” Proc. 25th Ann.Int’l Symp. Computer Architecture (ISCA-25), 1998, IEEE CSPress, pp. 132-141.

6. D. Brooks and M. Martonosi, “Dynamic ThermalManagement for High Performance Microprocessors,” to bepublished in Proc. Seventh Int’l Symp. High PerformanceComputer Architecture HPCA-7, IEEE CS Press, 2001.

Adaptive microarchitectures

erated locally and asynchronously with respectto other pipeline stages. Thus, techniques suchas IPCMOS have the potential of exploiting theunused execution pipe resources (depicted inFigure 3) to the fullest. They burn clocked latchpower strictly on demand at every pipeline stage,without the inductive noise problem.

The usefulness of microarchitectural powerestimators hinges on the accuracy of theunderlying energy models. We can formulatesuch models for given functional unit blocks(the integer or floating-point execution datapath), storage structures (cache arrays, regis-ter files, or buffer space), or communication

32


IEEE MICRO

Stanley Schuster, Peter Cook, Hans Jacobson, and Prabhakar KudvaCircuit techniques that reduce power consumption while sustaining

high-frequency operation are a key to future power-efficient processors.Here, we mention a couple of areas where we’ve made recent progress.

IPCMOSChip performance, power, noise, and clock synchronization are becom-

ing formidable challenges as microprocessor performances move intothe GHz regime and beyond. Interlocked Pipelined CMOS (IPCMOS),1 anew asynchronous clocking technique, helps address these challenges.

Figure A (reproduced from Schuster et al.1) shows a typical block (D)interlocked with all the blocks with which it interacts. In the forwarddirection, dedicated Valid signals emulate the worst-case path througheach driving block and thus determine when data can be latched withinthe typical block. In the reverse direction, Acknowledge signals indicatethat data has been received by the subsequent blocks and that new datamay be processed within the typical block. In this interlocked approachlocal clocks are generated only when there’s an operation to perform.

Measured results on an experimental chip demonstrate robust oper-ation for IPCMOS at 3.3 GHz under typical conditions and 4.5 GHz underbest-case conditions in 0.18-micron, 1.5-V CMOS technology. Since thelocally generated clocks for each stage are active only when the datasent to a given stage is valid, power is conserved when the logic blocksare idling. Furthermore, with the simplified clock environment it’s possi-ble to design a very simple single-stage latch that can capture and launchdata simultaneously without the danger of a race.

The general concepts of interlocking, pipelining and asynchronousself-timing are not new and have been proposed in a variety of forms.2,3

However the techniques used in those approaches are too slow, espe-cially for macros that receive data from many separate logic blocks. IPC-MOS achieves high-speed interlocking by combining the function of astatic NOR and an input switch to perform a unique cycle-dependentAND function. Every local clock circuit has a strobe circuit that imple-ments the asynchronous interlocking between stages. (See Schuster etal.1 for details). A significant power reduction results when there’s nooperation to perform and the local clocks turn off. The clock transitionsare staggered in time, reducing the peak di/dt—and therefore noise—compared to a conventional approach with a single global clock. The IPC-MOS circuits show robust operation with large variations in power supplyvoltage, operating temperature, threshold voltage, and channel length.

Low-power memory structuresWith ever-increasing cache sizes and out-of-order issue structures,

the energy dissipated by memory circuitry is becoming a significant partof a microprocessor’s total on-chip power consumption.

Random access memories (RAMs) found in caches and register filesconsume power mainly due to the high-capacitance bitlines on whichdata is read and written. To allow fast access to large memories, thesebitlines are typically implemented as precharged structures with senseamplifiers. Such structures combined with the high load on the bitlinestend to consume significant power.

Techniques typically used to improve access time, such as dividingthe memory into separate banks, can also be used aseffective means to reduce the power consumption.Only the sub-bitline in one bank then needs to beprecharged and discharged on a read, rather than thewhole bitline, saving energy. Dividing often-unusedbitfields (for example, Immediates) into segments andgating the wordline drive buffers with valid signalsassociated with these segments may in some situa-tions further save power by avoiding having to driveparts of the high-capacitance wordline and associat-ed bitlines.

Content addressable memories (CAMs)4 store dataand provide an associative search of their contents.Such memories are often used to implement registerfiles with renaming support and issue queue struc-tures that require support for operand and instruction

Low-power, high-performance circuit techniques

DATA (A)

VALID (A)

ACK (D)Block

A

BlockD

BlockE

BlockF

DATA (B)

VALID (B)

ACK (D)

DATA (D)

VALID (D)

ACK (E)

DATA (D)

VALID (D)

ACK (F)

BlockB

DATA (C)

VALID (C)

ACK (D)Block

C

Figure A. Interlocking at block level.

bus structures (instruction dispatch bus orresult bus) using either

• circuit-level or RTL simulation of thecorresponding structures with circuit andtechnology parameters germane to theparticular design, or

• analytical models or equations that for-mulate the energy characteristics in termsof the design parameters of a particularunit or block.

Prior work4,6 discusses the formulation ofanalytical capacitance equations that describemicroarchitecture building blocks. Brooks etal.,4 for example, provides categories of struc-tures modeled for power: 1) array structuresincluding caches, register files, and branch pre-diction tables; 2) content-addressable memo-ry (CAM) structures including those used inissue queue logic, translation look-aside buffers(TLBs); 3) combinational logic blocks andwires including the various functional unitsand buses; and 4) clocking structures includ-ing the corresponding buffers, wires, andcapacitive loads.

Brooks et al. validated the methodology forforming energy models by comparing the ener-gy models for array structures with schematicand layout-level circuit extractions for arraystructures derived from a commercial Intelmicroprocessor. They were found to be accu-rate within 10%. This is similar to the accu-racy reported by analytical cache delay modelsthat use a similar methodology. On the otherhand, since the core simulation occurred at themicroarchitectural abstraction, the speed was1,000 times faster when compared to existinglayout-level power estimation tools.

If detailed power and area measurements fora given chip are possible, we could build rea-sonably accurate energy models based on powerdensity profiles across the various units. We canuse such models to project expected powerbehavior for follow-on processor design pointsby using standard technology scaling factors.These power-density-based energy models havebeen used in conjunction with trace-drivensimulators for power analysis and trade-offstudies in some Intel processors.8 We’ve devel-oped various types of energy models: 1) power-density-based models for design lines withavailable power and area measurements fromimplemented chips; and 2) analytical modelsin terms of microarchitecture-level design para-meters such as issue width, number of physi-cal registers, pipeline stage lengths,misprediction penalties, cache geometry para-meters, queue/buffer lengths, and so on.

We formulated the analytical energy behav-


tag matching. The high energy dissipation of CAMstructures physically limits the size of such memories.4

The high power consumption of CAMs is a result ofthe way the memory search is performed. The searchlogic for a CAM entry typically consists of a matchlineto which a set of wired-XNOR functions are connect-ed. These XNOR functions implement the bitwise com-parison of the memory word and an externally supplieddata word. The matchline is precharged high and isdischarged if a mismatch occurs. In the use of CAMsfor processor structures such as the issue queue, mis-matches are predominant. Significant power is thusconsumed without performing useful work. In addi-tion, precharge-structured logic requires that the high-capacitance taglines that drive the match transistorsfor the externally supplied word must be cleared beforethe matchline can be precharged. This results in furtherwase of power.

AND-structured CAM match logic has been pro-posed to reduce power dissipation since it dischargesonly the matchline if there’s actually a match. How-ever, AND logic is inherently slower than wired-XNORlogic for wide memory words. New CAM structuresthat can offer significantly lower power while offer-ing performance comparable to wired-XNOR CAMshave recently been studied as part of our research inthis area.

References1. S. Schuster et al., “Asynchronous Inter-

locked Pipelined CMOS Circuits Operatingat 3.3-4.5 GHz,” Proc. 2000 IEEE Int’l Solid-State Circuits Conf. (ISSCC), IEEE Press, Pis-cataway, N.J., 2000, pp. 292-293.

2. I. Sutherland, “Micropipelines,” Comm.ACM, Vol. 32, No. 6, ACM Press, New York,June 1989.

3. C. Mead and L. Conway, Introduction ToVLSI Systems, Addison-Wesley PublishingCompany, Reading, Mass., 1980.

4. K.J. Schultz, “Content-Addressable MemoryCore Cells: A Survey,” Integration, the VLSIJ., Vol. 23, 1997, pp. 171-188.

iors based on simple area determinants andvalidated the constants with circuit-level sim-ulation experiments using representative testvectors. Data-dependent transition variationsaren’t considered in our initial energy modelformulations, but we will factor them into thenext simulator version.

We consider validation to be an integratedpart of the methodology. Our focus is onbuilding models that rate highly for relativeaccuracy. That is, our goal is to enable design-ers to explore early-stage microarchitectureoptions that are inherently superior from apower-efficiency standpoint. Absolute accu-racy of early-stage power projection for afuture processor isn’t as important, so long astight upper bounds are established early.

Power-performance trade-off resultIn our simulation toolkit, we assumed a

high-level description of the processor model.We obtained results using our current versionof PowerTimer, which works with presiliconperformance and energy models developed forfuture, high-end PowerPC processors.

Base microarchitecture modelFigure 4 shows the high-level organization of

our modeled processor. The model details makeit equivalent in complexity to a modern, out-of-order, high-end microprocessor (for exam-ple, the Power4 processor15). Here, we assumea generic, parameterized, out-of-order super-scalar processor model adopted in a researchsimulator called Turandot.12,13 (Figure 4 is essen-

34


IEEE MICRO

I-TLB1 L1-Icache

I-buffer

I-TLB2

D-TLB1

D-TLB2

L2 cacheDecode/expand

Rename/dispatch

L1-Dcache

Mainmemory

Cast-outqueue

Branch predictor

Issue queueinteger

Issue logic

Register read

Integer units

Issue queueload/store

Issue logic

Register read

Load/store units

Load/storereorder buffer

Storequeue

Missqueue

Issue queuefloating point

Issue logic

Retirementqueue

Retirement logic

Register read

Floating-pointunits

Issue queuebranch

Issue logic

Register read

Branch units

I-fetch

Figure 4. Processor organization modeled by the Turandot simulator.

tially reproduced from Moudgill et al.13)As described in the Turandot validation

paper, this research simulator was calibratedagainst a pre-RTL, detailed, latch-accurateprocessor model (referred to here as the R-model13). The R-model served as a validationreference in a real processor development pro-ject. The R-model is a custom simulator writ-ten in C++ (with mixed VHDL “interconnectcode”). There is a virtual one-to-one corre-spondence of signal names between the R-model and the actual VHDL (RTL) model.However, the R-model is about two orders ofmagnitude faster than the RTL model and con-siderably more flexible. Many microarchitec-ture parameters can be varied, albeit withinrestricted ranges. Turandot, on the other handis a classical trace/execution-driven simulator,written in C, which is one to two orders ofmagnitude faster than the R-model. It supportsa much greater number and range of parame-ter values (see Moudgill et al.13 for details).

Here, we report power-performance resultsusing the same version of R-model as thatused in Moudgill et al.13 That is, we decidedto use our developed energy models first inconjunction with the R-model to ensure accu-rate measurement of the resource use statis-tics within the machine. To circumvent thesimulator speed limitations, we used a paral-lel workstation cluster (farm). We also post-processed the performance simulation outputand fed the average resource use statistics tothe energy models to get the average powernumbers. Looking up the energy models onevery cycle during the actual simulation runwould have slowed the R-model executioneven further. While it would’ve been possibleto get instantaneous, cycle-by-cycle energyconsumption profiles through such a method,it wouldn’t have changed the average powernumbers for entire program runs.

Having used the detailed, latch-accurate ref-erence model for our initial energy character-ization, we could look at the unit- andqueue-level power numbers in detail to under-stand, test, and refine the various energy mod-els. Currently, we have reverted to using anenergy-model-enabled Turandot model, forfast CPI versus power trade-off studies withfull benchmark traces. Turandot lets us exper-iment with a wider range and combination ofmachine parameters.

Our experimental results are based on theSPEC95 benchmark suite and a commercialTPC-C trace. All workload traces were collect-ed on a PowerPC machine. We generatedSPEC95 traces using the Aria tracing facilitywithin the MET toolkit.13 We created the SPECtrace repository by using the full reference inputset, however, we used sampling to reduce thetotal trace length to 100 million instructionsper benchmark program. In finalizing thechoice of exact sampling parameters, the per-formance team also compared the sampledtraces against the full traces in a systematic val-idation study. The TPC-C trace is a contiguous(unsampled) trace collected and validated bythe processor performance team at IBM Austinand is about 180 million instructions long.

Data cache size and effect of scaling techniquesWe evaluated the relationship between per-

formance, power, and level 1 (L1) data cachesize. We varied the cache size by increasing thenumber of cache lines per set while leavingthe line size and cache associativity constant.Figure 5a,b (next page) shows the results ofincreasing the cache size from the baselinearchitecture. This is represented by the pointscorresponding to the current design size; theseare labeled 1x on the x-axis in those graphs.

Figure 5a (next page) illustrates the varia-tion of relative CPI with the L1 data cache size(CPI and related power-performance metricvalues in Figures 5 and 6 are relative to thebaseline machine values, that is, points 1x onthe x-axes.) Figure 5b shows the variationwhen considering the metric (CPI)3 × power.From Figure 5b, it’s clear that the small CPIbenefits of increasing the data cache are out-weighed by the increases in power dissipationdue to larger caches.

Figure 5b shows the [(CPI)3 ×power] metricwith two different scaling techniques. The firsttechnique assumes that power scales linearlywith the cache size. As the number of lines dou-bles, the power of the cache also doubles. Thesecond scaling technique is based on data froma study17 of energy optimizations within multi-level cache architectures: cache power dissipa-tion for conventional caches with sizes rangingfrom 1 Kbyte to 64 Kbytes. In the second scal-ing technique, which we call non-lin, the cachepower is scaled with the ratios presented in thesame study. The increase in cache power by dou-


bling cache size using this technique is roughly1.46x, as opposed to the 2x with the simple lin-ear scaling method. Obviously, the choice ofscaling technique can greatly affect the results.However, with either scaling choice, conven-tional cache organizations (that is, cache designswithout partial array shutdowns to conservepower) won’t scale in a power-efficient manner.Note that the curves shown in Figure 5b assume

a given, fixed circuit/technology generation;they show the effect of adding more cache to anexisting design.

Ganged sizingOut-of-order superscalar processors of the

class considered rely on queues and buffers toefficiently decouple instruction execution andincrease performance. The pipeline depth and

36


IEEE MICRO

1x 2x 4x 8x 16x

Relative cache size

0.95

0.97

0.96

0.98

0.99

1

1.01

Rel

ativ

e C

PI

1x 2x 4x 8x 16x

Relative cache size

0

1

2

3

4

5

Rel

ativ

e C

PI3

× po

wer

(a) (b)

SpecFpSpecIntTPC-C

SpecFp-linSpecFp-nonlinSpecInt-linSpecInt-nonlinTPC-C-linTPC-C-nonlin

Rel

ativ

e C

PI3 ×

pow

er

(a) (b)

0.5

1

1.5

2

2.5

3SpecFpSpecIntTPC-C

SpecFpSpecIntTPC-C

0.6x 0.8x 1x 1.2x 1.4x

Relative core size

0.6x 0.8x 1x 1.2x 1.4x

Relative core size

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

1.6

Rel

ativ

e C

PI

Figure 5. Variation of performance (a) and power-performance (b) with cache size.

Figure 6. Variation of performance (a) and power-performance (b) with core size (ganged parameters).

the resource size required to support decou-pled execution combine to determine themachine performance. Because of this decou-pled execution style, increasing the size of oneresource without regard to other machineresources may quickly create a performancebottleneck. Thus, we considered the effects ofvarying multiple parameters rather than justa single parameter in our modeled processor.

Figure 6a,b shows the effects of varying allof the resource sizes within the processor core.These include issue queues, rename registers,branch predictor tables, memory disambigua-tion hardware, and the completion table. Forthe buffers and queues, the number of entriesin each resource is scaled by the values speci-fied in the charts (0.6x, 0.8x, 1.2x, and 1.4x).For the instruction cache, data cache, andbranch prediction tables, the sizes of the struc-tures are doubled or halved at each data point.

From Figure 6a, we can see that perfor-mance increased by 5.5% for SPECfp, 9.6%for SPECint, and 11.2% for TPC-C as thesize of the resources within the core isincreased by 40% (except for the caches thatare 4x larger). The configuration had a powerdissipation of 52% to 55% higher than thebaseline core. Figure 6b shows that the mostpower-efficient core microarchitecture issomewhere between the 1x and 1.2x cores.

Power-efficient microarchitecture designideas and trends

With the means to evaluate power efficien-cy during early-stage microarchitecture eval-uations, we can propose and characterizeprocessor organizations that are inherentlyenergy efficient. Also, if we use the right effi-ciency metric in the right context (see earlierdiscussion), we can compare alternate pointsand rank them using a single power-perfor-mance efficiency number. Here, we first exam-ine the power-performance efficiency trendexhibited by the current regime of superscalarprocessors. The “SMT/CMP differences andenergy efficiency issues” box uses a simpleloop-oriented example to illustrate the basicperformance and power characteristics of thecurrent superscalar regime. It also shows howthe follow-on paradigms such as simultane-ous multithreading (SMT) may help correctthe decline in power-performance efficiencymeasures.

Single-core, wide-issue, superscalar processor chipparadigm

One school of thought envisages a contin-ued progression along the path of wider,aggressively speculative superscalar paradigms.(See Patt et al.18 for an example.) Researcherscontinue to innovate in efforts to exploit


SMT/CMP differences and energy-efficiency issuesConsider the floating-point loop kernel shown in Table A. The loop body consists of seven

instructions labeled A through G. The final instruction is a conditional branch that causescontrol to loop back to the top of the loop body. Labels T through Z are used to tag the cor-responding instructions for a parallel thread when considering SMT and CMP. The lfdu/stfduinstructions are load/store instructions with update where the base address register (say, r1,r2, or r3) is updated to hold the newly computed address.

We assumed that the base machine (refer to Figure 4 in the main text) is a four-wide super-scalar processor with two load-store units supporting two floating-point pipes. The datacache has two load ports and a separate store port. The two load-store units (LSU0 and LSU1)are fed by a single issue queue LSQ; similarly, the two floating-point units (FPU0 and FPU1)are fed by a single issue queue FPQ. In the context of the loop just shown, we essentially focuson the LSU-FPU subengine of the whole processor.

Assume that the following high-level parameters (latency and bandwidth) characterizingthe base superscalar machine:

• Instruction fetch bandwidth fetch_bw of two times W is eight instructions per cycle.• Dispatch/decode/rename bandwidth equals W, which is four instructions/cycle; dis-

patch stalls beyond the first branch scanned in the instruction fetch buffer.• Issue bandwidth from LSQ (reservation station) lsu_bw of W/2 is two instructions/cycle.• Issue bandwidth from FPQ fpu_bw of W/2 is two instructions/cycle.• Completion bandwidth compl_bw of W is four instructions/cycle.• Back-to-back dependent floating-point operation issue delay fp_delay is one cycle.

Table A. Example loop test case.

Parallel

instruction Load/store

tags Instruction instruction Description

T A fadd fp3, fp1, fp0 ; add FPR fp1 and fp0; store into target register fp3

U B lfdu fp5, 8(r1) ; load FPR fp5 from memory address: 8 + contents of GPR r1

V C lfdu fp4, 8(r3) ; load FPR fp4 from memory address: 8 + contents of GPR r3

W D fadd fp4, fp5, fp4 ; add FPR fp5 and fp4; store into target register fp4

X E fadd fp1, fp4, fp3 ; add FPR fp4 and fp3; store into target register fp1

Y F stfdu fp1, 8(r2) ; store FPR fp1 to memory address: 8 + contents of GPR r2

Z G bc loop_top ; branch back conditionally to top of loop body

continued on p. 38

38


IEEE MICRO

• The best-case load latency from fetch to write back is five cycles• The best-case store latency, from fetch to writing in the pending

store queue is four cycles. (A store is eligible to complete the cycleafter the address-data pair is valid in the store queue.)

• The best-case floating-point operation latency from fetch to writeback is seven cycles (when the FPQ issue queue is bypassedbecause it’s empty).

Load and floating-point operations are eligible for completion (retire-ment) the cycle after write back to rename buffers. For simplicity of analy-sis assume that the processor uses in-order issue from issue queues LSQand FPQ.

In our simulation model, superscalar width W is a ganged parameter,defined as follows:

• W = (fetch_bw/2) = disp_bw = compl_bw.• The number of LSU units, ls_units, FPU units, fp_units, data cache

load ports, l_ports, and data cache store ports, the term s_ports,vary as follows as W is changed: ls_units = fp_units = l_ports =max [floor(W/2), 1]. s_ports = max [floor(l_ports/2), 1].

We assumed a simple analytical energy model in which the powerconsumed is a function of parameters W, ls_units, fp_units, l_ports, ands_ports. In particular, the power (PW) in pseudowatts is computed as

PW = W 0.5+ ls_units + fp_units + l_ports + s_ports.

Figure B shows the performance and performance/power ratio vari-ation with superscalar width. The MIPS values are computed from theCPI values, assuming a 1-GHz clock frequency.

The graph in Figure B1 shows that a maximum issue width of four couldbe used to achieve the best (idealized) CPI performance. However as shownin Figure B2, from a power-performance efficiency viewpoint (measuredas a performance over power ratio in this example), the best-case designis achieved for a width of three. Depending on the sophistication and accu-racy of the energy model (that is, how power varies with microarchitec-

tural complexity) and the exact choice of the power-performance efficien-cy metric, the maximum value point in the curve in Figure B2 will change.However beyond a certain superscalar width, the power-performance effi-ciency will diminish continuously. Fundamentally, this is due to the single-thread ILP limit of the loop trace.

Note that the resource sizes (number of rename buffers, reorder buffersize, sizes of various other queues, caches, and so on) are assumed to belarge enough that they’re effectively infinite for the purposes of our runningexample. Some of the actual sizes assumed for the base case (W = 4) are

• completion (reorder) buffer size cbuf_size of 32,• load-store queue size lsq_size of 6,• floating-point queue size fpq_size of 8, and• pending store queue size psq_size of 16.

As indicated in the main text, the microarchitectural trends beyondthe current superscalar regime are effectively targeted toward the goalof extending the processing efficiency factors. That is, the complexitygrowth must ideally scale at a slower rate than performance growth.Power consumption is one index of complexity; it also determines pack-aging and cooling costs. (Verification cost and effort is another impor-tant index.) In that sense, striving to ensure that the power-performanceefficiency metric of choice is a nondecreasing function of time is a wayof achieving complexity-effective designs (see reference 3 in the maintext reference list).

Let’s examine one of the paradigms, SMT, discussed in the main textto understand how this may affect our notion of power-performance effi-ciency. Table B shows a snapshot of the steady-state, cycle-by-cycle,resource use profile for our example loop trace executing on the base-line, with a four-wide superscalar machine (with a fp_delay of 1). Note thecompletion (reorder) buffer, CBUF, LSQ, LSU0, LSU1, FPQ, FPU0, FPU1, thecache access ports (C0, C1: read and C3: write), and the pending storequeue. The use ratio for a queue, port, or execution pipe is measured asthe number of valid entries/stages divided by the total for that resource.Recall from Figure B that the steady-state CPI for this case is 0.28.

The data in Table B shows that the steady-state use of the CBUF is 53%,the LSU0/FPU0 pipe is fully used (100%), and the average use of the

1 2 3 4 5 6 7 8 9 10 11 12

Superscalar width (W)

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

Ste

ady-

stat

e lo

op (

CPI)

1 2 3 4 5 6 7 8 9 10

Superscalar width (W)

50

100

150

200

250

MIP

S/p

seud

owat

ts

Simple, analytical energy model

(1) (2)

Figure B. Loop performance (1) and performance/power variation (2) with issue width.

continued from p. 37


LSU1/FPU1 pipe is 50% [(0.6 + 0.4)/2= 0.5 for FPU1]. The LSQ and FPQ aretotally unused (0%) because of thequeue bypass facility and the rela-tive lack of dependence stalls. Theaverage read/write port use of thedata cache is roughly 33%; and thepending store queue residency is aconstant 2 out of 16 (= 0.125, orroughly 13%). Due to fundamentalILP limits, the CPI won’t decreasebeyond W = 4, while queue/buffersizes and added pipe use factors willgo on decreasing as W is increased.Clearly, power-performance efficiency will be on a downward trend (seeFigure B). (Of course, here we assume maximum processor power num-bers, without clock gating or dynamic adaptation to bring down power).

With SMT, assume that we can fetch from two threads (simultane-ously, if the instruction cache has two ports, or in alternate cycles if theinstruction cache has one port). Suppose two copies of the same loopprogram (see Table A) are executing as two different threads. So, thread-1 instructions A-B-C-D-E-F-G and thread-2 instructions T-U-V-W-X-Y-Zare simultaneously available for dispatch and subsequent execution onthe machine. This facility allows the use factors, and the net throughputperformance to increase, without a significant increase in the maximumclocked power. This is because the width isn’t increased, but the exe-cution and resource stages or slots can be filled up simultaneously fromboth threads.

The added complexity in the front end—of maintaining two programcounters (fetch streams) and the global register space increase alludedto before—adds to the power a bit. On the other hand, the core execu-tion complexity can be relaxed somewhat without a performance hit. Forexample, we can increase the fp_delay parameter to reduce core com-plexity, without performance degradation.

Figure C shows the expected performance and power-performancevariation with W for the two-thread SMT processor. The power modelassumed for the SMT machine is the same as that of the underlying

superscalar, except that a fixed fraction of the net power is added toaccount for the SMT overhead. (Assume the fraction added is linear inthe number of threads in an n-thread SMT.)

Figure C2 shows that under the assumed model, the power-perfor-mance efficiency scales better with W, compared with the base super-scalar (Figure B1).

In a multiscalar-type CMP machine, different iterations of a loop pro-gram could be initiated as separate tasks or threads on different coreprocessors on the same chip. Thus in a two-way multiscalar CMP, a glob-al task sequencer would issue threads A-B-C-D-E-F-G and T-U-V-W-X-Y-Z derived from the same user program in sequence to two cores.

Register values set in one task are forwarded in sequence to depen-dent instructions in subsequent tasks. For example, the register value in fp1set by instruction E in task 1 must be communicated to instruction T in task2. So instruction T must stall in the second processor until the value com-munication has occurred from task 1. Execution on each processor pro-ceeds speculatively, assuming the absence of load-store address conflictsbetween tasks. Dynamic memory address disambiguation hardware isrequired to detect violations and restart task executions as needed.

If the performance can be shown to scale well with the number oftasks, and if each processor is designed as a limited-issue, limited-speculation (low-complexity) core, we can achieve better overall scala-bility of power-performance efficiency.

Table B. Steady-state, cycle-by-cycle resource use profile (W = 4 superscalar). C: cache access port, CBUF: completion

(reorder) buffer, PSQ: pending store queue

Cycle CBUF LSQ LSU0 LSU1 FPQ FPU0 FPU1 C0 C1 C3 PSQ

n 0.53 0 1 0.5 0 1 0.6 1 1 0 0.13n+1 0.53 0 1 0.5 0 1 0.4 0 0 1 0.13n+2 0.53 0 1 0.5 0 1 0.6 1 1 0 0.13n+3 0.53 0 1 0.5 0 1 0.4 0 0 1 0.13n+4 0.53 0 1 0.5 0 1 0.6 1 1 0 0.13n+5 0.53 0 1 0.5 0 1 0.4 0 0 1 0.13n+6 0.53 0 1 0.5 0 1 0.6 1 1 0 0.13n+7 0.53 0 1 0.5 0 1 0.4 0 0 1 0.13n+8 0.53 0 1 0.5 0 1 0.6 1 1 0 0.13n+9 0.53 0 1 0.5 0 1 0.4 0 0 1 0.13

1 2 3 4 5 6 7 8 9 10 11 12Superscalar width (W)

50

100

150

200

250

300

MIP

S/p

seud

o w

atts

Simple, analytical energy model

2-thread SMT

1 2 3 4 5 6 7 8 9 101112Superscalar width (W)

0

0.2

0.4

0.6

0.8

1

Ste

ady-

stat

e th

roug

hput

(C

PI)

2-thread SMT

fp_delay=1fp_delay=2

(1) (2)

Figure C. Performance (1) and power-performance variation (2) with W for a two-thread SMT.

single-thread instruction-level parallelism(ILP). Value prediction advances (see Lipastiet al.18) promise to break the limits imposedby data dependencies. Trace processors (Smithet al.18) ease the fetch bandwidth bottleneck,which can otherwise impede scalability.

Nonetheless, increasing the superscalarwidth beyond a certain limit seems to yielddiminishing gains in performance, while con-tinuously reducing the performance-powerefficiency metric (for example, SPEC3/W).Thus, studies show that for the superscalar par-adigm going for wider issue, speculative hard-ware in a core processor ceases to be viablebeyond a certain complexity point. (See anillustrative example in the “SMT/CMP dif-ferences and energy efficiency issues” box).Such a point certainly seems to be upon us inthe processor design industry. In addition topower issues, more complicated, bigger coresadd to the verification and yield problems,which all add up to higher cost and delays intime-to-market goals.

Advanced methods in low-power circuittechniques and adaptive microarchitecturescan help us extend the superscalar paradigmfor a few more years. (See the earlier “Low-power, high-performance circuit techniques”and “Adaptive microarchitectures” boxes.)Other derivative designs, such as those formulticluster processors and SMT, can be moredefinitive paradigm shifts.

Multicluster superscalar processorsZyuban6 studied the class of multicluster

superscalar processors as a means of extendingthe power-efficient growth of the basic super-scalar paradigm. Here, we provide a very briefoverview of the main results and conclusionsof Zyuban’s work (see also Zyuban andKogge7). We’ll use elements of this work in themodeling and design work in progress withinour power-aware microarchitecture project.

As we’ve implied, the desire to extract moreand more ILP using the superscalar approachrequires the growth of most of the centralizedstructures. Among them are instruction fetchlogic, register rename logic, register file,instruction issue window with wakeup andselection logic, data-forwarding mechanisms,and resources for disambiguating memory ref-erences. Zyuban analyzed circuit-level imple-mentation of these structures. His analysis

shows that in most cases, the energy dissipat-ed per instruction grows in a superlinear fash-ion. None of the known circuit techniquessolves this energy growth problem. Given thatthe IPC performance grows sublinearly withissue width (with asymptotic saturation), it’sclear why the classical superscalar path willlead to increasingly power-inefficient designs.

One way to address the energy growthproblem at the microarchitectural level is toreplace a classical superscalar CPU with a setof clusters, so that all key energy consumersare split among clusters. Then, instead ofaccessing centralized structures in the tradi-tional superscalar design, instructions sched-uled to an individual cluster would access localstructures most of the time. The primaryadvantage of accessing a collection of localstructures instead of a centralized one is thatthe number of ports and entries in each localstructure is much smaller. This reduces accesslatency and lowers power.

Current high-performance processors (forexample, the Compaq Alpha 21264 and IBMPower4) certainly have elements of multi-clustering, especially in terms of duplicatedregister files and distributed issue queues.Zyuban proposed and modeled a specific mul-ticluster organization. Simulation resultsshowed that to compensate for the interclus-ter communication and to improve power-performance efficiency significantly, eachcluster must be a powerful out-of-order super-scalar machine by itself. This simulation-basedstudy determined the optimal number of clus-ters and their configurations, for a specifiedefficiency metric (the EDP).

The multicluster organization yields IPCperformance inferior to a classical superscalarwith centralized resources (assuming equal netissue width and total resource sizes). The laten-cy overhead of intercluster communication isthe main reason behind the IPC shortfall.Another reason is that centralized resources arealways better used than distributed ones. How-ever, Zyuban’s simulation data shows that themulticluster organization is potentially moreenergy-efficient for wide-issue processors withan advantage that grows with the issue width.7

Given the same power budget, the multiclus-ter organization allows configurations that candeliver higher performance than the best con-figurations with the centralized design.

40


IEEE MICRO

VLIW or EPIC microarchitecturesThe very long instruction word (VLIW)

paradigm has been the conceptual basis forIntel’s future thrust using the Itanium (IA-64)processors. The promise here is (or was) thatmuch of the hardware complexity could bemoved to the software (compiler). The latterwould use global program analysis to look foravailable parallelism and present explicitly par-allel instruction computing (EPIC) executionpackets to the hardware. The hardware couldbe relatively simple in that it would avoidmuch of the dynamic unraveling of paral-lelism necessary in a modern superscalarprocessor. For inferences regarding perfor-mance, power, and efficiency metrics for thismicroarchitecture class, see the September-October 2000 issue of IEEE Micro, whichfocuses on the Itanium.

Chip multiprocessingServer product groups such as IBM’s Pow-

erPC division have relied on chip multipro-cessing as the future scalable paradigm. ThePower4 design15 is the first example of thistrend. Its promise is to build multiple proces-sor cores on the same die or package to deliv-er scalable solutions at the system level.

Multiple processor cores on a single die canoperate in various ways to yield scalable perfor-mance in a complexity- and power-efficientmanner. In addition to shared-memory chipmultiprocessing (see Hammond et al.18), wemay consider building a multiscalar processor,19

which can spawn speculative tasks (derived froma sequential binary) to execute concurrently onmultiple cores. Intertask communication canoccur via register value forwarding or throughshared memory. Dynamic memory disam-biguation hardware is required to detect mem-ory-ordering violations. On detecting such aviolation, the offending task(s) must besquashed and restarted. Multiscalar-type para-digms promise scalable, power-efficient designs,provided the compiler-aided task-partitioningand instruction scheduling algorithms can beimproved effectively.

MultithreadingVarious flavors of multithreading have been

proposed and implemented in the past as ameans to go beyond single-thread ILP limits.One recent paradigm that promises to provide

a significant boost in performance with a smallincrease in hardware complexity is simultane-ous multithreading.20 In SMT, the processorcore shares its execution-time resources amongseveral simultaneously executing threads (pro-grams). Each cycle, instructions can be fetchedfrom one or more independent threads andinjected into the issue and execution slots.Because the issue and execution resources canbe filled by instructions from multiple inde-pendent threads in the same cycle, the per-cycle uses of the processor resources can besignificantly improved, leading to muchgreater processor throughput. Compaq hasannounced that its future 21x64 designs willembrace the SMT paradigm.

For occasions when only a single thread isexecuting on an SMT processor, the processorbehaves almost like a traditional superscalarmachine. Thus the same power reductiontechniques are likely to be applicable. Whenan SMT processor is simultaneously executingmultiple threads, however, the per-cycle useof the processor resources should noticeablyincrease, offering fewer opportunities forpower reduction via such traditional tech-niques as clock gating. An SMT processor—when designed specifically to execute multiplethreads in a power-aware manner—providesadditional options for power-aware design.

The SMT processor generally provides aboost in overall throughput performance, andthis alone will improve the power-perfor-mance ratio for a set of threads, especially ina context (such as server processors) of greateremphasis on the overall (throughput) perfor-mance than on low power. Furthermore, aprocessor specifically designed with SMT inmind can provide even greater power perfor-mance efficiency gains.

Because the SMT processor takes advan-tage of multiple execution threads, it could bedesigned to employ far less aggressive specu-lation in each thread. By relying on instruc-tions from a different thread to provideincreased resource use when speculationwould have been used in a single-threadedarchitecture (and accepting higher through-put over the multiple threads rather than sin-gle-thread latency performance), the SMTprocessor can spend more of its effort on non-speculative instructions. This inherentlyimplies a greater power efficiency per thread;


that is, the power expended in the executionof useful instructions weighs better against

misspeculated instructions on a per-threadbasis. This also implies a somewhat simplerbranch unit design (for example, fewerresources devoted to branch speculation) thatcan further aid in the development by reduc-ing design complexity and verification effort.

SMT implementations require an overheadin terms of additional hardware to maintain amultiple-thread state. This increase in hardwareimplies some increase in the processor’s per-cyclepower requirements. SMT designers mustdetermine whether this increase in processorresources (and thus per-cycle power) can be wellbalanced by the reduction of other resources(less speculation) and the increase in perfor-mance attained across the multiple threads.

Compiler supportCompilers can assist the microarchitecture

in reducing the power consumption of pro-grams. As explained in the “Compiler sup-port” box, a number of compiler techniquesdeveloped for performance-oriented opti-mizations can be exploited (usually withminor modifications) to achieve power reduc-tion. Reducing the number of memory access-es, reducing the amount of switching activityin the CPU, and increasing opportunities forclock gating will help here.

Energy-efficient cache architecturesWe’ve seen several proposals for power-effi-

cient solutions to the cache hierarchy design.2-

5,17 The simplest of these is the filter cache ideafirst proposed by Kin et al. (referenced in the“Adaptive microarchitectures” box). Morerecently, Albonesi (see same box) investigat-ed the power-performance trade-offs that canbe exploited by dynamically changing thecache sizes and clock frequencies during pro-gram execution.

Our tutorial-style contribution may helpenlighten the issues and trends in this

new area of power-aware microarchitectures.Depending on whether the design priority ishigh performance or low power, and whatform of design scaling is used for energyreduction, designers should employ differentmetrics to compare a given class of compet-ing processors.

As explained by Borkar,21 the static (leak-age) device loss will become a much more sig-

42


IEEE MICRO

Compiler supportManish Gupta

Figure D shows a high-level organization of the IBM XL family of compilers. The front endsfor different languages generate code in a common intermediate representation called W-code. The Toronto Portable Optimizer (TPO) is a W-code-to-W-code transformer. It performsclassical data flow optimizations (such as global constant propagation, dead-code elimina-tion, as well as loop and data transformations) to improve data locality (such as loop fusion,array contraction, loop tiling, and unimodular loop transformations.)1,2 The TPO also performsparallelization of codes based on both automatic detection of parallelism and use of Open-MP directives for parallel loops and sections. The back end for each architecture performsmachine-specific code generation and optimizations.

The loop transformation capabilities of the TPO can be used on array-intensive codes(such as SPECfp-type codes and multimedia codes2) to reduce the number of memory access-es and, hence, reduce the energy consumption. Recent research (such as the work by Kan-demir et al.3) indicates that the best loop configuration for performance need not be the bestfrom an energy point of view. Among the optimizations applied by the back end, register allo-cation and instruction scheduling are especially important for achieving power reduction.

Power-aware register allocation involves not only reducing the number of memory access-es but also attempting to reduce the amount of switching activity by exploiting any knownrelationships between variables that are likely to have similar values. We can use power-aware instruction scheduling to increase opportunities for clock gating of CPU components,by trying to group instructions with similar resource use, as long as the impact on perfor-mance is kept to a reasonable limit.

References1. U. Banerjee, Loop Transformations for Restructuring Compilers, Kluwer

Academic Publishers, Boston, Mass., 1997.2. C. Kulkarni et al., “Interaction Between Data-Parallel Compilation and Data Transfer

and Storage Cost for Multimedia Applications,” Proc. EuroPar 99, Lecture Notesin Computer Science, Vol. 1685, Springer-Verlag, Sep. 1999, pp. 668-676.

3. M. Kandemir et al., “Experimental Evaluation of Energy Behavior of IterationSpace Tiling,” to appear in Proc. 13th Workshop on Languages and Compilersfor Parallel Computing, Lecture Notes on Computer Science, Springer-Verlag,2001, pp.141-156.

Sourceprogram

Machinecode

Fortran90front end

PowerPCback end

TPO

C/C++front end

S/390back end

W-code W-code

Figure D. Architecture of IBM XL compilers. (TPO: Toronto Portable Optimizer)

nificant component of total power dissipationin future technologies. This will impose manymore difficult challenges in identifying ener-gy-saving opportunities in future designs.

The need for robust power-performancemodeling at the microarchitecture level willcontinue to grow with tomorrow’s workloadand performance requirements. Such modelswill enable designers to make the right choic-es in defining the future generation of ener-gy-efficient microprocessors. MICRO

References1. M.K. Gowan, L.L. Biro, and D.B. Jackson,

“Power Considerations in the Design of theAlpha 21264 Microprocessor,” Proc.IEEE/ACM Design Automation Conf., 1998,ACM, New York, pp. 726-731.

2. 1998 ISCA Workshop on Power-Driven Micro-architecture papers; http://www.cs.colorado.edu/~grunwald/LowPowerWorkshop.

3. 2000 ISCA Workshop Complexity-EffectiveDesign papers; http://www.ece.rochester.edu/~albonesi/ced00.html.

4. D. Brooks, V. Tiwari, and M. Martonosi,“Wattch: A Framework for Architectural-Level Power Analysis and Optimizations,”Proc. 27th Ann. Int’l Symp. ComputerArchitecture (ISCA), IEEE Computer SocietyPress, Los Alamitos, Calif., 2000, pp. 83-94.

5. N. Vijaykrishnan et al., “Energy-Driven Inte-grated Hardware-Software OptimizationsUsing SimplePower,” Proc. 27th Ann. Int’lSymp. Computer Architecture (ISCA), 2000,pp. 95-106.

6. V. Zyuban, “Inherently Lower-Power HighPerformance Super Scalar Architectures,”PhD thesis, Dept. of Computer Science andEngineering, Univ. of Notre Dame, Ind., 2000.

7. V. Zyuban and P. Kogge, “Optimization ofHigh-Performance Superscalar Architecturesfor Energy Efficiency,” Proc. IEEE Symp.Low Power Electronics and Design, ACM,New York, 2000.

8. Cool Chips Tutorial talks presented atMICRO-32, the 32nd Ann. IEEE Int’l Symp.Microarchitecture, Dec. 1999, http://www.eecs.umich.edu/~tnm/cool.html.

9. R. Gonzalez and M. Horowitz, “EnergyDissipation in General Purpose Micro-processors,” IEEE J. Solid-State Circuits,Vol. 31, No. 9, Sept. 1996, pp. 1277-1284.

10. M.J. Flynn et al., “Deep-Submicron Micro-

processor Design Issues,” IEEE Micro, Vol.19, No. 4, July/Aug. 1999, pp. 11-22.

11. D. Burger and T.M. Austin, “TheSimpleScalar Toolset, Version 2.0,”Computer Architecture News, Vol. 25, No.3, Jun. 1997, pp. 13-25.

12. M. Moudgill, J-D Wellman, and J.H. Moreno,“Environment for PowerPC MicroarchitectureExploration,” IEEE Micro, Vol. 19, No. 3,May/June 1999, pp. 15-25.

13. M. Moudgill, P. Bose, and J. Moreno,“Validation of Turandot, a Fast ProcessorModel for Microarchitecture Exploration,”Proc. IEEE Int’l Performance, Computingand Communication Conf., IEEE Press,Piscataway, N.J., 1999, pp. 451-457.

14. V. Tiwari et al., “Reducing Power in High-Performance Microprocessors,” Proc.IEEE/ACM Design Automation Conf., ACM,New York, 1998, pp. 732-737.

15. K. Diefendorff, “Power4 Focuses onMemory Bandwidth,” MicroprocessorReport, Oct. 6, 1999, pp. 11-17.

16. M. Pant et al., “An Architectural Solution forthe Inductive Noise Problem Due to Clock-Gating,” Proc. Int’l Symp. Low-Power Elec-tronics and Design, ACM, New York, 1999.

17. U. Ko, P.T. Balsara, and A.K. Nanda, “EnergyOptimization of Multilevel Cache Archi-tectures for RISC and CISC Processors,”IEEE Trans. VLSI Systems, Vol. 6, No. 2,1998, pp. 299-308.

18. Theme issue, “The Future of Processors,”Computer, Vol. 30, No. 9, Sept. 1997, pp. 37-93.

19. G. Sohi, S.E. Breach and T.N. Vijaykumar,“Multiscalar Processors,” Proc. 22nd Ann.Int’l Symp. Computer Architecture, IEEE CSPress, Los Alamitos, Calif., 1995, pp. 414-425.

20. D.M. Tullsen, S.J. Eggers, and H.M. Levy,“Simultaneous Multithreading: MaximizingOn-Chip Parallelism,” Proc. 22nd Ann. Int’lSymp. Computer Architecture, IEEE CSPress, 1995, pp. 392-403.

21. S. Borkar, “Design Challenges ofTechnology Scaling,” IEEE Micro, Vol. 19,No. 4, July/Aug., 1999, pp. 23-29.

Several authors at IBM T.J. WatsonResearch Center in Yorktown Heights, NewYork, collaborated on this tutorial. DavidBrooks, a PhD student at Princeton, workedas a co-op student while at IBM Watson.


Pradip Bose is a research staff member, work-ing on power-aware architectures. He is asenior member of the IEEE. Stanley Schusteris a research staff member, working in the areaof low-power, high-speed digital circuits. Heis a Fellow of the IEEE. Hans Jacobson, a PhDstudent at the University of Utah, worked as anintern at IBM Watson. Prabhakar Kudva, aresearch staff member, works on synthesis andphysical design with the Logic SynthesisGroup. Alper Buyuktosunoglu, a PhD stu-dent at the University of Rochester, New York,worked as an intern at IBM Watson. John-David Wellman is a research staff member,working on high-performance processors and

modeling. Victor Zyuban is a research staffmember, working on low-power embeddedand DSP processors. Manish Gupta is aresearch staff member and manager of theHigh Performance Programming Environ-ments Group. Peter Cook is a research staffmember and manager of High PerformanceSystems within the VLSI Design and Archi-tecture Department.

Direct comments about this article toPradip Bose, IBM T.J. Watson Research Cen-ter, PO Box 218, Yorktown Heights, NY10598; [email protected].

44


IEEE MICRO

January-FebruaryHot Interconnects

This issue focuses on the hardware and software architec-ture and implementation of high-performance interconnec-tions on chips. Topics include network-attached storage; voiceand video transport over packet networks; network interfaces,novel switching and routing technologies that can provide dif-ferentiated services, and active network architecture.Ad close date: 2 January

March-AprilHot Chips

An extremely popular annual issue, Hot Chips presents thelatest developments in microprocessor chip and system tech-nology used to construct high-performance workstations andsystems.Ad close date: 1 March

May-JuneMobile/Wearable computing

The new generation of cell phones and powerful PDAs hasmade mobile computing practical. Wearable computing willsoon be moving into the deployment stage.Ad close date: 1 May

July-AugustGeneral Interest

IEEE Micro gathers together the latest details on new devel-opments in chips, systems, and applications.Ad close date: 1 July

September-OctoberEmbedded Fault-Tolerant Systems

To avoid loss of life, certain computer systems—such asthose in automobiles, railways, satellites, and other vital sys-tems—cannot fail. Look for articles that focus on the verifi-cation and validation of complex computers, embeddedcomputing system design, and chip-level fault-tolerant designs.Ad close date: 1 September

November-DecemberRF-ID and noncontact smart card applications

Equipped with radio-frequency signals, small electronic tagscan locate and recognize people, animals, furniture, and otheritems.Ad close date: 1 November

IEEE Micro 2001 Editorial Calendar

IEEE Micro is a bimonthly publication of the IEEE Computer Society. Authors should submit paper proposals to micro-

[email protected], include author name(s) and full contact information.

power aware microarchitecture

Documents