power modeling and characterization of computing devices...

Foundations and TrendsR© inElectronic Design AutomationVol. 6, No. 2 (2012) 121–216c© 2012 S. Reda and A. N. NowrozDOI: 10.1561/1000000022

Power Modeling and Characterization ofComputing Devices: A Survey

By Sherief Reda and Abdullah N. Nowroz

Contents

1 Introduction 123

1.1 Computing Substrates 1261.2 Survey Overview 1311.3 Summary 133

2 Background: Basics of Power Modeling 134

2.1 Dynamic Power 1352.2 Static Power 1402.3 Summary 144

3 Pre-Silicon Power Modeling Techniques 146

3.1 Power Modeling for General-Purpose Processors 1473.2 Power Modeling for SoC-Based Embedded Systems 1603.3 Power Modeling for FPGAs 1673.4 Summary 173

4 Post-Silicon Power Characterization 175

4.1 Power Characterization for Validation and Debugging 1764.2 Power Characterization for Adaptive

Power-Aware Computing 1854.3 Power Characterization for Software Power Analysis 1894.4 Summary 196

5 Future Directions 199

Acknowledgments 203

Notations and Acronyms 204

References 207

Foundations and TrendsR© inElectronic Design AutomationVol. 6, No. 2 (2012) 121–216c© 2012 S. Reda and A. N. NowrozDOI: 10.1561/1000000022

Power Modeling and Characterization ofComputing Devices: A Survey

Sherief Reda1 and Abdullah N. Nowroz2

1 Brown University, 182 Hope st, Providence 02912, USA,sherief [email protected]

2 Brown University, 182 Hope st, Providence 02912, USA,Abdullah [email protected]

Abstract

In this survey we describe the main research directions in pre-siliconpower modeling and post-silicon power characterization. We reviewtechniques in power modeling and characterization for three computingsubstrates: general-purpose processors, system-on-chip-based embed-ded systems, and field programmable gate arrays. We describe the basicprinciples that govern power consumption in digital circuits, and utilizethese principles to describe high-level power modeling techniques fordesigns of the three computing substrates. Once a computing deviceis fabricated, direct measurements on the actual device reveal a greatwealth of information about the device’s power consumption undervarious operating conditions. We describe characterization techniquesthat integrate infrared imaging with electric current measurements togenerate runtime power maps. The power maps can be used to vali-date design-time power models and to calibrate computer-aided design

tools. We also describe empirical power characterization techniquesfor software power analysis and for adaptive power-aware computing.Finally, we provide a number of plausible future research directions forpower modeling and characterization.

1Introduction

In the past decade power has emerged as a major challenge tocomputing advancement. A recent report by the National ResearchCouncil (NRC) of the National Academies highlights power as thenumber one challenge to sustain historical improvements in comput-ing performance [39]. Power is limiting the performance of both mobileand server computing devices. At one extreme, embedded and portablecomputing devices operate within power constraints to prolong batteryoperation. The power budgets of these devices are about tens of milli-Watts for some embedded systems (e.g., sensor nodes), 1–2 W formobile smart phones and tablets, and 15–30 W for laptop computers.At another extreme, high-end server processors, where performance isthe main objective, are increasingly becoming hot-spot limited [46],where increases in performance are constrained by a maximum junc-tion temperature (typically 85C). Economic air-based cooling tech-niques limit the total power consumption of server processors to about100–150 W, and it is the spatial and temporal allocation of the powerdistribution that leads to hot spots in the die that can comprise thereliability of the device. Because server-based systems are typicallydeployed in data centers, their aggregate performance becomes power

123

124 Introduction

limited [6], where energy costs represent the major portion of total costof ownership. The emergence of power as a major constraint has forceddesigners to carefully evaluate every architectural and design featurewith respect to its performance and power trade-offs. This evaluationrequires pre-silicon power modeling tools that can navigate the richdesign landscape. Furthermore, runtime constraints on power consump-tion require power management tools that control a number of runtimeknobs that trade-off performance and power consumption. Power man-agement techniques that seek to meet a power cap, e.g., as in the case ofservers in data centers [6, 37], require either direct power measurementswhen feasible, or alternatively, runtime power modeling techniques thatcan substitute direct characterization. In addition, software power char-acterization can help tune and restructure algorithms to reduce theirpower consumption.

The last decade has seen a diversification in possible computing sub-strates that offer different trade-offs in performance, power, and costfor different applications. These substrates include application-specificcustom-fabricated circuits, application-specific circuits implemented infield-programmable logic arrays (FPGAs), general-purpose processorswhose functionality is determined by software, general-purpose graphi-cal processing units (GP-GPUs), digital signal processors (DSPs), andsystem-on-chip (SoC) substrates that combine general-purpose coreswith heterogeneous application-specific custom circuits. None of thesesubstrates necessarily dominate the other, but they rather offer certainadvantages that depend on the target application and the deploymentsetting of the computing device. For instance, custom fabricated cir-cuits outperform their FPGA counterparts in performance and power,but they are more expensive. SoCs offer higher performance/Watt ratiofor a range of applications than general-purpose processors; however,general-purpose processors offer higher throughput for scientific appli-cations. GPGPUs are also emerging as a strong contender to processorsand FPGAs; however, the relative advantage of each of these substratesdiffers by the application [39, 54, 76]. Sorting out the exact trade-offsof all these substrates across different application domains is an activearea of research [4, 29, 76]. While power modeling and characterizationfor these substrates share common concepts, each of these substrates

125

has its own peculiarities. In this survey we will discuss the basic powermodeling and characterization concepts that are shared among thesesubstrates as well as the specific techniques that are applicable foreach one.

Pre-silicon power modeling and post-silicon power characterizationare very challenging tasks. The following factors contribute to thesechallenges.

(1) Large die areas with billions of transistors and interconnectslead to computational difficulties in modeling.

(2) Input patterns and runtime software applications triggerlarge variation in power consumption. These variations arecomputationally impossible to enumerate exhaustively dur-ing modeling.

(3) Spatial and temporal thermal variations arising from powerconsumption trigger large variations in leakage power, whichlead to intricate dependencies in power modeling.

(4) Process variabilities that arise during fabrication lead tointra-die and inter-die power leakage variations that areunique to each die. These deviations recast the modelingresults to be educated guesses, rather than exact estimates.

(5) Practical limitations on the design of power-delivery net-works make it difficult to directly characterize the runtimepower consumption of individual circuit blocks.

The objective of this survey is to describe modern researchdirections for pre-silicon power modeling and post-silicon powercharacterization. Pre-silicon power modeling tools estimate the powerconsumption of an input design, and they can be used to create apower-aware design exploration framework, where different designchoices are evaluated in terms of their power impact in addition totraditional design objective such as performance and area. Post-siliconpower characterization tools are applied to a fabricated design tocharacterize its power consumption under various workloads andenvironmental variabilities. The results of power characterization areuseful for power-related debugging issues, calibration of design-timepower modeling tools, software-driven power analysis, and adaptive

126 Introduction

power-aware computing. Our technical exposition reviews power mod-eling and characterization techniques of various computing substrates,while emphasizing cross-cutting issues. We also connect the dotsbetween the research results of different research communities, such ascircuit designers, computer-aided design (CAD) developers, computerarchitects, and system designers. Our discussions reveal the sharedconcepts and the different research angles that have been explored forpower modeling and characterization.

1.1 Computing Substrates

1.1.1 General-Purpose Processors

A general-purpose processor is designed to serve a large variety of appli-cations, rather than being highly tailored to one specific application ora class of applications. The design of a general-purpose processor hasto be done carefully to lead to good performance within the proces-sor’s thermal design power (TDP) limit under different kinds of work-loads. The TDP limit has forced a significant change in the design ofprocessors. At present, designers aim to increase the processor’s totalthroughput rather than improving the single-thread performance. Thisthroughput increase is achieved by using more than one processing coreper chip.

Figure 1.1 gives an example of a quad-core processor based on Intel’sCore i7 Nehalem architecture. The 64-bit processor features four coresthat share an 8 MB of L3 cache. The cores can run up to 3.46 GHz in a130 W TDP. Each core has a 16-stage pipeline and includes a 32 KB L1instruction cache, a 32 KB L1 data cache, and a 256 KB of L2 cache.The front-end of the pipeline can fetch up to 16 bytes from the L1instruction cache. The instructions in the fetched 16 bytes are identi-fied and inserted into an instruction queue. The decoder unit receives itsinputs from the instruction queue, and it can decode up to four instruc-tions per cycle into micro-ops. A branch prediction unit with a branchtarget buffer enables the core to fetch and process instructions beforethe outcome of a branch is determined. The back-end of the pipelineallocates resources for the micro-ops and renames their source and des-tination registers to eliminate hazards and to expose instruction-level

1.1 Computing Substrates 127

Core 1 Core 2 Core 3 Core 4

Shared L3 Cache

Instruction Fetch and PreDecode

Instruction Queue

Rename/ Allocate

Decode

Retirement Unit

Reorder Buffer

Schedulers

Reservation Stations

Execution Units

L1 D-Cache (32KB) and D-TLB

Shared L3 Cache (8 MB)

L2 D-Cache (256KB)

Fro

nt-

En

d

In-O

rde

r

Exe

cuti

on

En

gin

e

Ou

t-o

f-O

rde

rM

em

ory

Fig. 1.1. High-level diagram of Intel Core i7 processor (Nehalem architecture).

parallelism. The micro-ops are then queued in the re-order buffer untilthey are ready for execution. The pipeline can dynamically scheduleand issue up to six micro-ops per cycle to the execution units as longas the operands and resources are available. The execution units per-form loads, stores, scalar integer or floating-point arithmetic, and vectorinteger or floating-point arithmetic. The results from the execution ofmicro-ops are stored in the re-order buffer, and results are committedin-order only for correct instruction execution paths.

1.1.2 Embedded SoC

SoCs are computational substrates that are targeted for embeddedsystems and mobile computing platforms for a certain niche of appli-cations. An SoC for a smart phone or a tablet typically consumes lessthan 1–2 W of power, while delivering the throughput required for

128 Introduction

applications that include video and audio playback, internet connec-tivity, and games. In contrast to a general-purpose processor, an SoCincludes, in addition to the general-purpose core(s), application-specificcustom hardware (HW) components that can provide the requiredthroughput for the target applications within the power envelope ofthe embedded system. Because total die area is constrained by costand yield considerations, the inclusion of application-specific customHW components must come at the expense of the functionality ofthe general-purpose core. SoC general-purpose cores are less capablethan the ones used in general-purpose processors. They are usually lessaggressively pipelined with limited instruction-level parallelism capa-bilities and smaller cache sizes.

Figure 1.2 gives an example of an SoC based on nVidia’s Tegraplatform that has a total power budget of about 250 mW. The SoC fea-tures a 32-bit ARM11 general-purpose core that runs up to 800 MHz.The ARM11 core has an 8-stage pipeline, with a single instruction issueand support for out-of-order completion. The L1 data and code cachememory sizes are 32 KB each, and the size of the L2 cache is 256 KB.The performance specifications of the core are clearly inferior comparedto the specifications of the Core i7. To compensate for the lost general-purpose computing performance, the SoC uses a number of application-specific components to deliver the required performance within its

Fig. 1.2. Example of nVidia Tegra SoC.

1.1 Computing Substrates 129

power budget. These include an image signal processor that can provideimage processing functions (e.g., de-noising, sharpening, and color cor-rection) for images captured from embedded cameras. The SoC includesa high-definition audio and video processor for image, video and audioplayback, and a GPU to deliver the required graphics performancefor 3-D games. The SoC supports an integrated memory controller,an encryption/decryption accelerator component, and components forcommunication, such as Universal Asynchronous Receiver/Transmitter(UART), Universal Serial Bus (USB), and High-Definition MultimediaInterface (HDMI). All SoC components communicate with each otherusing an on-chip communication network, which can take a number offorms, including shared and hierarchical busses, point-to-point busses,and meshes.

1.1.3 Field-Programmable Gate Arrays

Soaring costs associated with fabricating computing circuitry atadvanced technology nodes have increased the interest in programm-able logic devices that can be configured after fabrication to implementuser designs. The most versatile programmable logic currently availableis Field Programmable Gate Arrays (FPGAs). The basic FPGA archi-tecture is an island-style structure, where programmable logic arrayblocks (LABs) are embedded in a reconfigurable wiring fabric that con-sists of wires and switch blocks as illustrated in Figure 1.3. The inputsand outputs of the LABs are connected to the routing fabric throughprogrammable switches. When programmed, these switches determinethe exact input and output connections of the LABs. In addition,10s–100s of programmable I/O pads are available in the FPGA. Inmany occasions, FPGAs also host heterogeneous dedicated computingresources, such as digital signal processors to implement multiplica-tions, memory blocks to store runtime data, and even full light-weightprocessor cores.

Each LAB is composed of several basic logic elements (BLEs), wherea BLE is made up of a 4-, 5-, or 6-input look-up table (LUT) togetherwith an associated flip-flop. A 4-input LUT can be used to implementany 4-input Boolean function. Figure 1.4(a) illustrates the structure

130 Introduction

Logic Array

Block

Switch

Block

Wire

Segment

I/O pad

Fig. 1.3. Island-style FPGA.

Basic Logic Element (BLE)

K-input

LUT DFFClock

InputsOut

Logic Array Block (LAB)

BLE

BLE

Outputs

Inputs

Fig. 1.4. Typical design of a Logic Array Block (LAB) and a basic logic element (BLE).

of a BLE and, Figure 1.4(b) illustrates the structure of a LAB. EachBLE can receive its inputs from other BLEs inside its LAB or fromother LABs through the reconfigurable wiring fabric. Additional wiringstructures in the LAB enable it to propagate arithmetic carry outputs

1.2 Survey Overview 131

in a fast and efficient way. To implement a computing circuit into anFPGA, it is first necessary to synthesize the input circuit by breaking itup into subcircuits, where each subcircuit is mapped to a BLE. TheseBLEs are then clustered into groups, where the size of each groupis determined by the number of BLEs in a LAB. These clusters arethen mapped and placed at the LABs. Finally, routing is conducted todetermine the exact routes and switches of the routing fabric used bythe circuit. The configuration bits for the logic and routing are storedin SRAM or FLASH memory cells.

While FPGAs are very attractive to computer-system designers dueto their post-silicon flexibility, this flexibility comes at the expense ofhigher design area and power consumption compared to custom cir-cuits that perform the same computing tasks. For example, Kuon andRose report almost a 35× overhead for using programmable logic overcustom logic [68]. However, for low to mid-volume fabrication, pro-grammable logic is the only economically feasible technology. Alongwith performance and area, power is also an important factor thatmust be considered during architectural design exploration of FPGAs.FPGA architectural parameters include segment length, switch blocktopology, cluster size, BLE/LAB designs. Choices for these parame-ters lead to different power, performance, and area trade-offs. Thus,proper evaluation of power consumption is required to help designersand users make correct choices for the FPGA’s architecture and pro-grammed designs.

1.2 Survey Overview

The basic techniques for circuit-level power modeling are discussed inSection 2. The power consumption of computing circuits can be described by two components: dynamic power and static power. The sectionincludes discussions on how to estimate each of these components whenthe design’s circuit is available. We will also discuss the various factorsthat impact these power components, which include, circuit design andlayout, input patterns, fabrication technology, process variability, andoperational temperature. The discussions in Section 2 will form the basisfor the techniques discussed in Sections 3 and 4.

132 Introduction

In Section 3 we discuss techniques for pre-silicon power model-ing techniques. Historically, performance and area were the two maincriteria during the design of computing devices. In the past 10–15 years,power has emerged as a third criterion that has to be considered dur-ing design. Every architectural feature has to be judged in terms of itsperformance, area and power. A typical design space has an exponen-tial number of possible combination of settings for the various features.Thus, there is a strong need for power modeling methods that enabledesigners to efficiently explore the design space and to evaluate theimpact of various high-level system architectural choices and optimiza-tions on power consumption. These architectural features and choicesvary by the medium of the computing substrate. For multi-core proces-sors, the choices include, for example, pipeline depth, instruction issuewidth, and cache sizes. For SoC-based embedded systems, the choicesinclude, the functionality of the custom blocks and the on-chip commu-nication architecture (e.g., network topology, buffer sizes and transfermodes). In some embedded systems, the boundary between hardware(HW) and software (SW) is fluid, where the choice of the implemen-tation (SW or HW) of every component could be decided based on itsimpact on performance, power, and area. In embedded design environ-ments, it is necessary to have power co-modeling tools that can effec-tively explore the possible HW/SW implementation choices of everydesign component, and guide designers to the correct choice. FPGApower modeling is also challenging as the user’s design is not knownduring the design and fabrication of the FPGA. Furthermore, users donot have direct access to the internal circuits of the FPGA. Thus, pre-characterized power models for the different FPGA structures must beestimated during the design of the FPGA and then bundled with thevendor’s tools to be used by the end user.

Once a design is implemented and a physical prototype is avail-able for direct measurements, new opportunities become possible. InSection 4, we discuss a number of techniques for post-silicon powercharacterization. We describe techniques that integrate infrared imag-ing and direct electric current measurements to develop power mappingtechniques, that reveal the true power consumption of every designstructure. These true power maps can be used to validate pre-silicon

1.3 Summary 133

design estimates, to calibrate power-modeling CAD tools, and to esti-mate the impact of variabilities introduced during fabrication. We alsodiscuss power characterization techniques for adaptive power-awarecomputing, where power models based on lumped power measurementsare used by power management systems to cut down operational mar-gins and to enforce runtime power constraints. Another discussed topicis SW power characterization using instruction-level, architectural-leveland algorithmic-level power models. SW power characterization helpssoftware developers and compiler designers to cut down the power con-sumption of their applications.

1.3 Summary

In this section we have highlighted the importance of power modelingand characterization techniques for modern computing devices. Futurecomputing systems will be constrained by power, and the choices fordesign features and runtime settings have to be guided by the impacton power consumption as well as traditional objectives such as perfor-mance and implementation area.

Computing substrates can come in a number of forms, includingcustom circuits with fixed functionality, general-purpose processorswhose functionality is determined by software applications, SoCs thatcombine general-purpose processing cores with application specific cus-tom circuits, and programmable logic that can be used to implementcomputing circuits in a cost-effective way. These computing forms sharesome basic power modeling techniques; however, their unique architec-tural features enable them to utilize efficient large-scale modeling andcharacterization methods.

Pre-silicon power modeling and post-silicon characterization tech-niques will be discussed in the remaining sections of this survey.The basic circuit-level power modeling techniques are discussed inSection 2. High-level power modeling techniques for various computingsubstrates will be discussed in Section 3. In Section 4 we overview dif-ferent techniques for post-silicon power characterization through phys-ical measurements on a fabricated device. Finally, a number of futureresearch directions are outlined in Section 5.

2Background: Basics of Power Modeling

Basic transistor and circuit-level power modeling is a mature area ofresearch with numerous existing textbooks and surveys [73, 86, 99, 100].The objective of this section is to introduce sufficient background tounderstand the focus of this survey, which spans from pre-silicon high-level power modeling techniques to post-silicon power characterizationtechniques. Power consumption of digital Complementary Metal OxideSemiconductor (CMOS) circuits is caused by two mechanisms. The firstmechanism is dynamic power, which arises when signals transition theirvalues, and the second mechanism is static power, which causes circuitsto dissipate power when no switching activity is occurring. One of themain advantages of using CMOS technology over earlier bipolar tech-nology was that CMOS circuits consumed power only during circuitswitching. However, aggressive technology scaling in the past decadeled to a situation where static power is no longer negligible, but rathera significant contributor to total power consumption. This section dis-cusses the basic principles and main challenges of circuit-level powermodeling. The techniques discussed in this section provide the basicbackground required for the next two sections.

134

2.1 Dynamic Power 135

2.1 Dynamic Power

Logic gates implemented in CMOS chips use two complementary tran-sistor types, NMOS and PMOS, to build the functionality of each gate.One terminal of PMOS transistors is typically connected to the voltagesupply, VDD, while one terminal of NMOS transistors is connected tothe ground voltage VGND. Figure 2.1 gives the schematic of an invertergate that consists of one NMOS transistor and one PMOS transistor.To understand the operation of the gate, assume that the input volt-age is first at logic 0 (i.e., VGND). In this case the PMOS transistoris in on state with a very low resistance (ideally 0), while the NMOStransistor is in off state with a very high resistance (ideally ∞), and apath exists to charge the load capacitance CL until the output voltagereaches VDD. The load capacitance, CL, represents the total capacitancearising from the output diffusion capacitances of the two transistors, theinput capacitances of fan-out gates, wiring capacitance, and parasitics.When the input voltage switches to a logic 1 (i.e., VDD), the PMOStransistor is in off state with a very high resistance (ideally ∞), whilethe NMOS transistor is in on state with a very low resistance (ideally 0),and a path exists to discharge the charges on the load capacitance tothe ground until the output voltage reaches 0. The sum of energy con-sumed during the charging and discharging, i.e., the energy per cycle, is

PMOS

NMOS

input output

CL

VDD

GND

Fig. 2.1. CMOS inverter.

136 Background: Basics of Power Modeling

equal to CLV 2DD. The dynamic power consumed, which is the switching

energy per second, by the gate is equal to

Pdynamic–gate = sCLV 2DD, (2.1)

where s is an activity factor that denotes the number of switching cyclesper second. If a circuit has N gates, then the total dynamic power isequal to

Pdynamic =N∑i

siCLiV2DD, (2.2)

where si and CLi are the switching activity and load capacitance ofgate i respectively. Besides directly contributing to dynamic power,the capacitances also determine the exact propagation delays of thesignals, which influence the occurrence of signal glitches. Glitches areunnecessary signal transitions arising from unbalanced path delays atgate inputs as illustrated in Figure 2.2. In synchronous logic circuits, theclock signal has the highest switching activity, f , in the circuit. Thus,the buffer gates along the clock network path will have the highesttransition frequency. All other gates in the design can switch at a ratethat is at most half of f . Determining the exact switching activity ofeach gate is a challenging task because (1) it requires knowledge ofthe exact sequence of applied input vectors to the circuit; and (2) theexact signal timing information which is not accurately available untilthe final circuit layout is determined. The final layout provides theexact wiring and parasitic capacitances, which are needed to estimatethe load capacitances and the propagation delays along the wires.

Another power component that is incurred during switching is shortcircuit power. If the transition edge (from 1 to 0 or 0 to 1) of the input

ab

c

glitchc

a

b

Fig. 2.2. Illustration of a glitch arising from a mismatch in arrival times at the inputs ofan OR gate.


signal is not sharp, there will exist a brief moment of time where theNMOS and the PMOS transistors are both turned on and current willflow from the supply terminal to the ground. Short circuit power isincurred only whenever a switching activity occurs, which makes it pro-portional to dynamic power consumption. Its exact value is determinedby the slopes, or transition times, of the input and output signals. Withproper circuit design, short circuit power is usually about 10% of thedynamic power [43].

To get a reasonable estimate of switching activity, it is necessaryto obtain representative input waveform vectors for the circuit. Twoapproaches are possible.

(1) Designers can generate input vectors that are derived fromthe knowledge of the semantics and intended functionality ofthe circuit. Finding the most relevant input vectors is a chal-lenge, especially for computing circuits of general-purposeprocessors, where software applications can trigger a largerange of input vector sequences. While it is possible to con-struct input vectors that are triggered by some standardbenchmark applications, there is always a chance that a newapplication can trigger non-modeled activity behavior.

(2) If realistic input vectors are not available, then it is pos-sible to construct pseudo-random input sequences that aregenerated in a way that mimics realistic input vectors. Thefirst input vector can be constructed by assigning each inputbit a logic level that depends on the probability of observ-ing a 0 or 1 as an input signal. Then each subsequent inputvector is generated from the previous vector with some tran-sition probability for each bit. Input signals of real circuitsexhibit spatial and temporal correlations, where signal levelsof inputs that are structurally close to each other can exhibitspatial correlations, and input transitions can be correlatedin time. Thus, it is desirable to account for these spatiotem-poral correlations during the generation of pseudo-randominput sequences [88].

Accurate circuit-level power estimates are obtained using acircuit-level simulator such as SPICE [94]. Given the the input voltage


waveform vectors and the layout capacitances, SPICE can solve theequations of the circuit to compute the voltage and current signals atall circuit nodes. These signals give the exact switching activities andthe total current drawn from the voltage supply. Furthermore, leak-age estimation (discussed in Section 2.2) can be realized within SPICEby back annotating the temperatures of transistors. However, SPICEsimulations are computationally feasible for only small circuits.

Speeding-up Power Simulation. To speed-up dynamic power esti-mation, it is possible to partition a circuit into a number of blocks, andestimate the power consumption of each block using SPICE. The inputvectors and the resultant power estimates from the simulations canbe then used to build a power macro model for quick estimation of theblock’s power within the context of the larger circuit. If the block size isrelatively small, then it is possible to store the power estimates of everypossible pair of consecutive input vectors. For larger circuit blocks, it isnecessary to abstract the input vectors into macro parameters, and foreach possible combination of these parameters, the average estimatedpower value is stored in the corresponding entry in the macro modellook-up table. Candidate parameters include the Hamming distancebetween consecutive input vectors, the average input signal probability,the average input transition density, and the average spatial correlationcoefficients [44]. In macro modeling, deviations from the average valuethat arise because of interactions of different partitioned blocks andbecause of lack of glitch propagation. Another speed-up option is touse fast switch-level logic simulators; however, these simulators cannotincorporate exact timing information, and consequently they tend tounderestimate the dynamic power incurred from signal glitches.

Average Power Estimation. It is also possible to speed up dynamicpower estimation by directly estimating the average dynamic power. Inthis case, Equation (2.3) can be approximated by a simpler form:

Pdynamic = sCtotalV2DD, (2.3)

where s is the average switching activity per circuit node and Ctotal isthe total circuit capacitance including the wiring. In the approximate


form, estimating the average dynamic power breaks into two estimationproblems: total capacitance and average switching activity.

• Capacitance: the total capacitance is the sum of the gatecapacitances and wiring capacitances. If the gate-level imple-mentation is not known, but another representative specifica-tion (e.g., Boolean formulae) is known, then it is possible toderive complexity metrics that relate the boolean formulae tothe number of gates [21]. If the final-layout is not available,then the wiring capacitance can be estimated using wire-loadmodels and Rent-based statistical techniques [42, 135]. Thesetechniques use parameters such as number of gates, numberof design partitions, fan-out distribution, and wiring com-plexity to estimate the wire length distribution. However,these estimates quickly lose their accuracy at higher designabstractions [62].

• Switching activity : to calculate the average switching activ-ity, the average signal level and transition probabilities arefirst assigned to the circuit inputs, and then propagated ana-lytically through the different gates depending on their func-tions and their input probabilities [93, 98, 101]. Let so denotethe average switching activity at output node o, that is drivenby a gate with k inputs, where si denote the average switch-ing activity of the ith input. Assuming that the switchingactivities of the input signals are uncorrelated, then it is pos-sible to express so as

so =k∑i

p

(∂so

∂si

)si, (2.4)

where p(

∂so∂si

)is the probability that the output transitions

due to a transition in the ith input. To make this analysismore accurate, it is possible to consider the spatiotempo-ral correlations among the different input and internal sig-nals [88]. Temporal correlations are of particular importancebecause they determine the transition rates of signals.


In many cases the objective of power estimation is to exploredifferent design choices at higher levels during architectural and systemexploration. At these levels, circuit-level implementations might not beknown and it is necessary to estimate power from high-level specifica-tions. The topic of high-level power modeling for various computingsubstrates will be the main focus of Section 3.

2.2 Static Power

Static power is the power consumed by transistors when they are notswitching. When CMOS logic gates do not switch, they have no elec-trical path between the supply terminal, VDD, and the ground termi-nals. Thus, static or leakage current was historically negligible; however,aggressive scaling in sub-100 nm technologies has led to a substantialincrease in its magnitude. Note that modern computing devices includeanalog components (e.g., phase-locked loops, sense amplifiers) thatincur static DC power consumption. For digital CMOS, there are threemain components for leakage current: (1) subthreshold leakage currentbetween the transistor’s source and drain; (2) gate leakage between thetransistor’s channel and gate; and (3) reverse bias current between thetransistor’s drain and well [2, 125]. Figure 2.3 illustrates these threeleakage components. With the introduction of high-k dielectrics, themain source of leakage current is the subthreshold leakage.

MOS transistors operate by modulating an energy barrier betweenthe source and drain of the transistor. The height of the energy barrier

n+

Gate (G)

Source (S) Drain (D)

well

12 3

Fig. 2.3. Three components of leakage current in a MOS transistor.

2.2 Static Power 141

is called the threshold voltage (Vth). By increasing the potential differ-ence between the gate and the source, |VGS|, it is possible to reducethe energy barrier enabling more flow of electrical carriers betweenthe source and the drain of the transistor. During the off state whenVGS = 0, the average carrier’s energy is lower than the barrier’s energy;however, the carriers do not have a uniform energy distribution, andthere is a probability that some carriers will have higher energy thanthe height of the barrier, enabling them to leak from the source to thedrain [39]. The probability of a carrier having an energy higher than theaverage energy drops exponentially with a factor that is proportional totemperature. The subthreshold leakage current can be mathematicallyexpressed as

Ileak = Ioe−qVthαkT , (2.5)

where Io is a constant that depends on the transistor’s geometricaldimensions and fabrication technology, α is a number greater than 1,T is the temperature of the transistor, q is the charge of the electricalcarrier, and k is the Boltzmann constant [39]. Equation (2.5) reveals anumber of sensitivity factors that impact leakage [1].

• Process sensitivity : leakage power depends exponentiallyon Vth; for every drop in threshold voltage by 100 mV, theleakage current increases by about a factor of 10. Aggressivescaling in sub-100 nm process technologies have reduced thevalue of Vth enough to make leakage power a good contribu-tor of the total power consumption. Furthermore, aggressivetechnology scaling has led to large levels of manufactur-ing process variations due to statistical fluctuations inherentin the manufacturing process. Manifestations of these vari-ations include gate length variations, line-edge roughness,and dopant fluctuations [14, 106, 123, 146]. These varia-tions impact the threshold voltages of individual transis-tors leading to intra-die and inter-die leakage variations.In 2002, Intel released a statistical study that revealedabout 20× variations in the leakage of one of its productionprocessors [16].


Fig. 2.4. Thermal distributions of a dual-core processor during runtime.

• Temperature sensitivity : leakage power also depends expo-nentially on temperature. Besides the change in environmen-tal temperature. Internal heat generation arising from powerconsumption increases the temperature. Furthermore, thelarge spatial variations in power consumption lead to spa-tial thermal gradients on the chip, which create non-uniformleakage power variations. Using our laboratory’s infraredimaging setup, we illustrate in Figure 2.4 a thermal mapof a popular dual-core processor during operation. The ther-mal maps show non-uniform temperature distributions withgradients reaching up to 14C.

While the leakage current of an individual logic gate also depends onits input vector, the sum of these vector-dependent variations typicallyaverage out for large circuits; furthermore, they are dwarfed by theimpact of temperature and Vth variations [1]. The sensitivities of leakagepower (i.e., VDD × Ileak) to the supply voltage, threshold voltage, andtemperature introduce variabilities in leakage modeling.

Modeling Leakage Dependency on Process and TemperatureVariations. To model inter-die variations, circuit designers have tosimulate the design at a number of corners that capture the worst,typical, and best case scenarios for supply voltage, threshold voltage,temperature, and transistor geometrical dimensions (gate length in par-ticular). The results from the simulation give a range for leakage currentfrom the highest leakage at the high-supply voltage, high temperatureand low Vth corner to the lowest leakage at the low-supply voltage,

2.2 Static Power 143

Fig. 2.5. Corner simulation results (timing in ns and leakage in mW) for a 64-bit adder in90 nm technology.

low temperature and high Vth corner. Figure 2.5 gives the results of atypical corner simulation for a design, where the delay of the design’scritical path is given on the y-axis, and total leakage power is givenon the x-axis.1 The results show a large range (almost 25×) of leakagepower variations. The results also show the typical inverse correlationbetween leakage and delay, where leaky transistors tend to be faster,which creates critical paths with smaller delay.

To model intra-die leakage variations, it is necessary to accountfor the intra-die variations arising from manufacturing variability andfor the spatial and temporal variations in temperature [22, 121, 136].Intra-die process variability tends to be spatially correlated; that is,devices that are close to each other will likely have similar character-istics than those that are farther away. If Vth is considered a randomvariable, then the statistical field formed from the Vth of all transistorsis spatially correlated, where the correlation between two transistors

1 We simulate a 64-bit ripple carry adder at 90 nm using the PTM SPICE models.


Fig. 2.6. Temperature-aware leakage power modeling.

on the die is a function of the distance between them [38, 47]. Usinga representative model that captures the spatial correlations in deviceparameters, it is possible to statistically compute the intra-die varia-tions in leakage power [22, 48]. Modeling the dependency of intra-dieleakage on temperature requires the use of a thermal simulator thatcan estimate the temperature distributions on the chip, and then pro-vide feedback to the leakage power model. Figure 2.6 shows an iterativefeedback power modeling technique, where estimates of the leakage anddynamic power components are given as inputs to the thermal simula-tor. The resultant spatial thermal distribution is fed back to the leak-age model to update the leakage power estimates. This feedback loopis repeated a few times until convergence is reached.

While temperature and process variations have strong impact onleakage power, they have a weak impact on dynamic power. In par-ticular, temperature and process variabilities lead to timing variabil-ities along the various paths of a circuit. These timing variabilitiescan alter the number of circuit glitches, which impacts dynamic powerconsumption.

2.3 Summary

In this section we discussed the basics of modeling power consump-tion in CMOS digital circuits. Power consumption is contributed bytwo components: dynamic power and static power. Dynamic power isincurred due to switching activity, while static power is consumed by

2.3 Summary 145

transistors when they do not switch. For dynamic switching power esti-mation, it is necessary to know the load capacitances and the signaltransition activities of every signal. For dynamic short circuit power,it is necessary to know the signal transition times for every signal, andfor static power, it is necessary to know the threshold voltages andtemperatures of the individual transistors.

Power estimation is highly challenging due to many modeling com-plexities and unknown factors during design time. Dynamic poweris highly dependent on input vector sequences, and it is impossibleto exhaustively cover all possible sequences, making these estimatesvulnerable to deviations arising from unforeseen runtime inputs. Fur-thermore, process variabilities can impact the signal propagation andtransition times along different paths of the circuit, which impactsdynamic power and short circuit power. Leakage power is highly depen-dent on temperature and process variabilities. On-die thermal gradi-ents complicate leakage estimation and can lead to deviations betweenestimates and actual leakage consumption. Furthermore, intra-die andinter-die process variabilities create large variations in leakage powerconsumption.

The supply voltage is a key parameter in determining the totalpower consumption. Dynamic power has a quadratic dependency on thesupply voltage, while leakage power have a linear dependency. Exceptfor IR drops on the power distribution network, the supply voltageis a relatively easy parameter to estimate. In the past decade, supplyand threshold voltages have deviated from ideal scaling, leading to anincrease in power consumption. The exponential dependency of leakagepower on the threshold voltage has constrained the ability to scale thethreshold voltage down with feature size, which in return stalled supplyvoltage scaling. This stalling is necessary because transistor current,switching speed, and noise margin are all proportional to the differencebetween the supply voltage and the threshold voltage. Thus, scalingdown the supply voltage at a higher rate than the threshold voltagehurts performance and makes the circuit susceptible to runtime errors.

3Pre-Silicon Power Modeling Techniques

In this section we discuss pre-silicon power modeling techniques. Thesetechniques are used by designers to evaluate the power consumption ofa given design and to explore the device’s architectural choices, wheredifferent designs are evaluated to assess their performance, power, andarea trade-offs. We discuss power modeling techniques for three com-puting substrates: general-purpose processors, SoC-based embeddedsystems, and FPGAs. Power modeling for these substrates shares thebasic concepts discussed in Section 2; however, their sheer complexitymakes direct use of basic circuit power estimation techniques a compu-tationally infeasible task. We will explore techniques that leverage theunique features of these substrates to customize their high-level powermodels to result in accurate yet computationally-feasible models. Wediscuss power modeling techniques for general-purpose processors inSection 3.1; power modeling techniques for SoC-based embedded sys-tems are discussed in Section 3.2; and power modeling techniques forFPGAs are discussed in Section 3.3.

146

3.1 Power Modeling for General-Purpose Processors 147

3.1 Power Modeling for General-Purpose Processors

Power dissipation and thermal issues are the most important concernsin modern general-purpose processors. As a result, computer architectsdevelop power-aware designs for processors, which require accuratemodels to estimate power at the early stage of the design flow. Powerestimates help to perform architectural optimizations that improvepower and performance. A good amount of research work has beendone in estimating processor power at the early stage, but accurate andefficient power estimation of general-purpose processors still remains achallenge. As technology advances, transistor miniaturization impactsboth dynamic and leakage power, adds functional units to the design,and demands more interconnections between the units. All these factorsimpact the processor power consumption and distribution significantly,and as a result, it is hard for power estimation tools to keep up theaccuracy and efficiency.

To analyze and estimate processor power dissipation at anearly stage, high-level architectural power modeling combines char-acterizations from circuit-level simulation data or analytical powermodels with architectural-level switching activity. Figure 3.1 shows a

Fig. 3.1. Cycle-by-cycle power estimation framework for a processor.

148 Pre-Silicon Power Modeling Techniques

cycle-by-cycle power estimation framework for processors. The frame-work has the following parts: a compiler to generate machine code fromvarious benchmark applications, a cycle-accurate simulator to gener-ate resource utilization and circuit switching activity information, andvarious kinds of power models. There are two main types of simu-lators for processors: trace-driven and execution-driven. Trace-drivensimulators record an execution trace and use the recorded trace asinput to the simulator to drive the model. In execution-driven simula-tors, applications run directly on the simulator host which maintainsboth application state and architecture state, while recording detailedperformance statistics. The current trend in power modeling simula-tion is to use execution-driven simulation because it allows to capturedynamic properties of the application unlike the trace-driven simula-tion. The dynamic properties of an application (e.g., speculative execu-tion for branch prediction) are crucial for modern processor power mod-eling. Examples of execution-driven simulators are SimpleScalar [20],and Asim [36], which are used by many popular power modeling toolssuch as Wattch [19], SimplePower [148] and CAMP [118]. Some simula-tors can be either configured as trace-driven or execution-driven (e.g.,Turandot [95]) [18, 74]. The cycle-level simulator requires applicationprograms, the hardware configuration and technology parameters asinputs to model the processor power on a cycle-by-cycle basis.

For dynamic power estimation, there are two tasks: one is toestimate the switching activity and the other to estimate the capac-itances as discussed in Section 2.1. The cycle-accurate simulatoris usually used to generate the architectural activity information(switching activity/vector sequences), which is used with the availablepower/capacitance models to estimate power on a cycle-by-cycle basis.Various methods are used for building the models at the architecturallevel. We will discuss three approaches in this section. First, a macromodeling approach which involves building the power models by esti-mating the power from existing circuits through measurement or sim-ulation. This method is suitable for newly designed circuits for whichno information of previous implementation is available and for irregu-lar structured circuits which do not have a specific architectural designtemplate. Second, an analytical modeling approach which is used for


circuits that have regular organizations. In that case, the circuit canbe divided into smaller sub-circuits and each can have a separate ana-lytical representation. Third, we will discuss regression-based modelingapproaches which are used to speed up power estimation.

Estimating the power of a general-purpose processor is usuallydone by dividing the processor into smaller architectural functionalblocks. Each of these functional blocks are modeled separately and theindividual block power is combined to get the total processor power.As shown in Figure 3.1, the processor is mainly divided into followingcategories of functional blocks: datapath units, control unit, memorymodules, interconnection/bus wires, and clock distribution networks.Datapath components are adders, multipliers, register file, pipelineregisters, arithmetic logic unit (ALU) whereas memory models arecomposed of caches, translation lookaside buffers (TLBs), and reorderbuffers. Most of the architectural-level power estimation tools combineboth macro modeling and analytical modeling depending on the typeof the functional block. In this section we will organize our discussionaccordingly; the modeling of complex units (e.g., ALU, control unit)will be discussed in the context of macro modeling, while the modelingof more regular structures (e.g., register file, cache) will be discussedin the context of analytical modeling. We will be discussing both thedynamic and leakage power of the functional blocks.

3.1.1 Macro Modeling

Macro modeling is a good candidate for estimating the power offunctional blocks that do not have a regular structure such as con-trol units. The first step for macro modeling is choosing appropriateinput parameters through which the power models will be queried. Wediscuss four different techniques to choose appropriate input param-eters to the look-up tables: (1) the input patterns, (2) probabilisticparameters, (3) the instruction traces, and (4) leakage parameters.

(1) The first technique is to use the input vectors to the functionalblocks as the parameters to access the tables. Because the power of anyfunctional block depends on the input pattern transitions, a naturalchoice for power macro modeling is to store the estimated switching


Fig. 3.2. Example of a look-up table for macro modeling.

energy as a function of consecutive input vectors. In this case, the tablesare basically matrices, where the row index is the previous pattern,the column index is the present pattern, and a matrix entry valuecorresponds to the switching energy. An example of such a table isillustrated in Figure 3.2 for a circuit with two inputs. The stored valuecan also be switching capacitance, which is defined as the effectivecapacitance value corresponding to a certain switching activity. Theproblem with this approach is that the size of the look-up table sizegrows exponentially as a function of the input vector length. To addressthis problem, power estimation tools use the following techniques tocompress the size of the tables.

• The circuit can be partitioned into smaller sub-circuits beforebuilding the tables. Power values from the sub-circuits can becombined to find the power for the larger circuit which helpsreduce the size of the look-up table. For example, one 4:1 mul-tiplexer can be partitioned into three 2:1 multiplexers. Thetable for a 4:1 multiplexer will have 4096 entries, whereas thenumber of entries for a 2:1 multiplexer table is only 64 [26].Instead of sampling the input vectors and running a full sim-ulation, it is possible to sample a small subset of internalnodes in a circuit unit and use this subset to estimate thepower dissipation of the unit under a larger range of inputstatistics [12]. In general, partitioning techniques have the


disadvantage that they can introduce errors in power esti-mation due to its inability to perform glitch propagation. Toaddress this problem, Ragunathan et al. discuss techniquesthat estimate glitches generation and propagation from theBoolean formula that specify the control logic [120].

• Various clustering algorithms can be used to compress thesize of the inputs of these tables keeping the error in estima-tion within tolerable limits. The values in the look-up tablesare distributed in different clusters by collapsing closely re-lated input transition vectors and similar power patterns [90].For example, clustering algorithm can compress the tablefor a 16-bit ripple carry adder with 232 entries to a tablewith only 97 entries, but this clustering reduces estimationaccuracy as the number of inputs increases [148].

• The dynamic power is directly proportional to the number oftransitions of the inputs, and as a result another choice of theinput parameters is the Hamming distance between the inputpatterns [64]. The power dissipation of different blocks can bestored in a look-up table indexed by the Hamming distancesbetween the consecutive vectors. If the Hamming distanceis used as the input parameter, the table in Figure 3.2 willreduce to only 3 entries instead of 16 entries. Though thisapproach drastically reduces the size of the tables, it is notaccurate for most cases when certain input bit transitionsimpact the transitions occurring of other bits. For example,in the case of a 32-bit adder, one single bit transition meansthe Hamming distance between previous and current vectorsis one. But this transition can trigger many transitions andlarge power dissipation occurs in the adder due to the prop-agation of the carry bit.

(2) Probabilistic parameters are also suitable choices of input parame-ters for macro modeling instead of using the exact input vectors. Guptaet al. propose a four-dimensional (4-D) look-up table based on themacro modeling approach [44]. The dimensions of the 4-D model arethe average input signal probability, average input transition density,


average output transition density, and average input spatial correlationcoefficient. As discussed in Section 2, the inputs of a functional unit canbe either temporally or spatially correlated, which makes it essential toinclude the last parameter as an input to the look-up tables.

(3) A third technique is to use the instruction traces from the cycle-accurate simulator as indices to the look-up table. The control unit inthe data path is so complex and accepts hundreds of instructions. It isimpossible to construct one single table that stores all input transitions.Because instructions of the same format have similar power dissipation,different instructions can be categorized according to their format [26].For each format, it is efficient to extract corresponding power valuesusing a lower-level power simulator and to access them from the look-uptable during runtime power estimation.

(4) The fourth technique is relevant for macro modeling of leakagepower. Though it is common to use analytical models for circuit leakageestimation as discussed in Section 2.2, macro modeling is also possible.Pre-characterized leakage power values for different functional blockobtained from transistor-level simulators can be stored in a look-uptable and used during processor power estimation. The inputs to thelook-up tables are all the parameters that impact leakage power, whichinclude device width, temperature, supply voltage, and process varia-tion parameters (mean and variance). Do et al. perform characteriza-tion of a SRAM cell (shown in Figure 3.3) by connecting a single cellto a pair of bitlines to quantify leakage components via circuit-levelsimulations [34]. To improve the leakage estimation results, a look-uptable based leakage current model is used in eCACTI [87]. The look-uptables are built using SPICE simulations of transistors with differentwidths.

After choosing the appropriate input parameters using any ofthe four techniques discussed above, the next critical step in macromodeling is to choose a suitable training set for the models. The trainingset is a representative of the subset of all possible set of the chosen inputparameters. The training set should be sufficiently large to encompassall the possible domains of the parameters. For example, for inputvectors, the training set should span over all the possible previous and


current input vectors; otherwise, the power values stored in model willnot be accurate. The last step in macro modeling is model characteri-zation through simulation, where the power tools use a low-power sim-ulator (gate-level or transistor level) to find the corresponding powervalues for each representative training set and store the values in thelook-up tables.

3.1.2 Analytical Modeling

Analytical models are suitable for functional blocks which are organizedin a regular way such as cache, TLBs, register file, queues, and buffers.Due to the regular structure of these functional blocks, a relationshipbetween the parameters can be established to estimate power. Eachfunctional block can be further divided into smaller structures, whereeach structure is modeled with a separate analytical equation. Cachememory, which can also be represented as an array, is the one of thepower hungry modules of a processor. The fraction of the total area andtotal contribution to power from memory can be as high as 40% [65, 87].A number of power estimation tools are purely dedicated for estimatingcache power [87, 128, 145]. These tools are used for micro-architecturalpower modeling to determine the best cache configuration. They takeas inputs a few high-level cache parameters and output the best cacheconfiguration. High-level input cache parameters include cache size inbytes, block size in bytes, associativity, data output width in bits, andaddress bus width in bits. Typically, these models estimate the readand write access times as well as the cache power consumption.

The basic building block of an array structure (e.g., register file orcache) is the static random-access memory (SRAM) cell which is shownin Figure 3.3. Power modeling of an SRAM cell is done with analyt-ical equations due to its fixed structure. The SRAM dynamic powerdissipation happens through two lumped capacitances: wordline capac-itance CWL and bitline capacitance CBL. During each read or write, thewordline is charged through the decoder (shown in Figure 3.4) whichresults in dynamic power consumption. The bitline power consumptionis divided into the read phase and the write phase. During read, thepower consumption results from charging and discharging of the bitline


WL

Bit Bit

P1 P2

N1 N2

N3 N4

Fig. 3.3. Typical 6T SRAM cell structure.

Fig. 3.4. Wordline and bitline schematic in a register file.

capacitance through the pre-charge circuitry and NMOS access tran-sistors. During write, current discharges through the write circuitry inthe cell which causes power consumption.

The register file has a very regular internal structure as shown inFigure 3.4, and it can be represented by an array structure. The basicbuilding block of a register file is the SRAM cell. Beside the SRAMcells, a register file also includes decoders, precharge circuitry andsense amplifiers. Each of these structures has a separate capacitance


equation. Equations (3.1) and (3.2) give the analytical modeling forwordline capacitance CWL and bitline capacitance CBL, respectively, ina register file:

CWL = Cdiff + 2 × Cgate × Ncol + Cmetal × WLlength (3.1)

CBL = Cdiff + Cdiff × Nrow + Cmetal × BLlength, (3.2)

where Cdiff , Cgate and Cmetal are the diffusion, gate and the metal wirecapacitances, respectively. Nrow, Ncol, WLlength and BLlength are thenumber and length of the wordlines and bitlines, respectively. The dif-fusion capacitance Cdiff corresponds to the transistor from the decoderline, gate capacitance Cgate is estimated from the NMOS transistorsused to access the SRAM cells, and metal wire capacitance Cmetal isthe load capacitance for the wordline or bitline. The dynamic power ofthe register file can be estimated using activity information from thecycle simulator combined with the capacitances calculated from theanalytical equations. For example, the dynamic power consumption ofa bitline can be expressed by,

Pdynamic-bitline = rCBLVDDVswing, (3.3)

where r is the number of reads or writes per second and Vswing is thevoltage swing of the bitline [34, 81].

The leakage current of the SRAM cell can be also determined analyt-ically [87, 149]. Transistors that are in the off-state need to be analyzedduring read or write phase. Due to symmetry in the SRAM cell design,the leakage current from each of the transistors within a pair such as(N1, N2), (N3, N4) and (P1, P2) are the same. For example, whenWL = 1 and Bit = 1, transistors N1 and P2 are in the off state andwhen WL = 0 and Bit = 1, transistors N1, N3 and P2 are in the offstate and as a result they contribute to leakage current. Due to pair-wise symmetry, the leakage current while WL = 1 and WL = 0 can beexpressed by Equation (3.4) regardless of the data stored in the cell.

Ileak =

IleakN1 + IleakP2 for WL = 1IleakN1 + IleakN3 + IleakP2 for WL = 0

(3.4)

where IleakN1 , IleakN3 and IleakP2 are the leakage currents due totransistors N1, N3 and P2 respectively. The leakage current for each


transistor is obtained from the NMOS and PMOS models as given ear-lier in Equation (2.5).

The leakage power of a register file can be formulated usingEquation (3.4). If we assume Nrow and Ncol are the total number ofrows (wordlines) and columns (bitlines) in the array structure, thenEquation (3.4) can be summed up for all the rows and columns asEquation (3.5). During the read phase only one WL = 1 and rest(Nrow − 1) rows have WL = 0.

IleakmemRD = Ncol × Nrow × (IleakN1 + IleakP2)

+(Nrow − 1) × Ncol × IleakN3 (3.5)

where IleakmemRD is the total leakage current during one read in thearray [87].

The cache is the main component of a processor memory, and italso has a very regular structure similar to the register file. As a result,the power consumption of most memory blocks can be estimated quiteaccurately by using analytical models such as cache access time esti-mation model. A typical cache is composed of multiple sub-banks inboth tag and data array as illustrated in Figure 3.5. The configuration

Address

Input

Ta

g A

rra

y

Da

ta A

rra

y

De

cod

er

Output OutputData

Wordlines

Bitlines

Column Muxes

Sense Amps

Comparators

Mux

drivers

Fig. 3.5. Typical cache structure.


of the tag array and data array are usually identical. Other com-ponents in the cache structure include: bitlines, wordlines, prechargecircuitry, row and column multiplexers, address decoder, comparators,and sense amplifiers. Tools like CACTI estimate the capacitances ofall these components by representing the circuit into several equiva-lent RC networks [128]. The gate and diffusion capacitances are esti-mated using individual circuit layout and combined to find the totalcapacitance using the cache configuration parameters. To reduce theimpact of a cache miss, modern processors employ multi-level cachedesign with L1, L2 and L3 caches. Power tools like AccuPower [114]includes the option for modeling caches in separate stages. The tool iter-atively estimates the capacitances on the switching nodes using orga-nization specified by the cache configuration parameters and analyticalequations.

3.1.3 Regression-Based Modeling

The long simulation time and large number of parameters encounteredduring cycle-accurate power estimation constrain design studies. Toovercome this problem, regression-based modeling is used for predictingperformance and power for various applications running on processors.Regression analysis is a statistical inference method, where the rela-tionship between a dependent variable and one or more independentvariable is established. For power estimation, the dependent variable isthe power which is the observed response and the independent variablesare various architectural parameters.

We will discuss two different approaches for regression-based powermodeling. Both techniques utilize offline regression-based modeling toestablish coefficients through a large set of training benchmarks, whichcan be used to estimate processor power as shown in Figure 3.6.The first approach is an offline trace-driven simulation, where sam-ples are collected to fit a regression model that relates architecturaldesign parameters to processor power. The second approach is to gen-erate activity factors offline and use the activity factors combined withsimulated performance counters information from interval cycles to esti-mate runtime power metrics. Both of these techniques estimate only


Training Benchmarks

Power simulators

Regression basedmodeling

Correlation analysis

Observed responses (power)

Predictors (Architectural parameters, Utilization statistics)

Model coefficients

Fig. 3.6. Estimating modeling coefficients by using regression analysis.

the dynamic power of the processor. The first method can be usedto evaluate different design choices for processors at the architecturallevel (e.g., instruction width, cache size, ALU latency, branch latency),whereas the second method is used to estimate software-based powerconsumption for a given architectural configuration.

In the first approach, Lee et al. [74] use 12 different setsS1,S2, . . . ,S12, each of which defines a group of specific architecturalconfigurations. The groups include pipeline depth, width, physical reg-isters, reservation stations, cache, main memory, control latency, fixed-point latency, and floating-point latency. Each of these sets describe aspecific type of architectural configuration for the processor and has afew parameters, which are varied within a specific range and power isestimated in each case using simulation. For example, set S2 is the groupthat describes width of different functional blocks in the processor, andit has four different architectural parameters such as instruction width,reorder queue entries, store queue entries, and functional units count.Each of these parameters are varied within a range and each has a car-dinality of 3. For example, instruction width is varied over 4-bit, 8-bitand 16-bit and has cardinality 3. Combining all the parameters for S2,the cardinality of the set is found to be |S2| = 3 × 3 × 3 × 3 = 81. Thecardinality of S defines the design space which is all the cardinalitiesmultiplied with each other; i.e., |S| = Π12

i=1|Si|, i corresponds to a spe-cific set.


Though the number of sets is only 12, the total cardinality |S| isabout one billion, and for total of 22 benchmarks the whole designspace is approximately 22 billion points. Sweeping design parametervalues to consider all the points in this design space is not possible,only 4000 out of 22 billion samples are considered while building themodels, which gives reasonable accuracy with a median error between4.3% and 22%. The different parameters are called the predictor vari-ables and the observed responses are basically performance and power.The relationship between the predictors and the observed responsesare captured by regression model coefficients. These coefficients can bethought of as the expected change in power per unit change in thepredictor variables. The observed responses are found via simulationsof benchmarks and are used to formulate the models by regressionanalysis. Regression model fitting to the observed responses is donecommonly through method of least squares to find the coefficients.Once established, these coefficients can be used to predict the runtimepower without performing runtime simulations. Basic linear regressionmodeling can be too simple for processor power modeling. To improveaccuracy, spline functions are used to take into account the non-linearrelation between the predictors and the responses [74].

The second approach is to use parameters based on general pro-cessor utilization statistics. CAMP [118] performs correlation analysisto choose among all the utilization statistics that correlate best withthe architectural activity factors. Example of parameters that corre-late best are instruction per cycle (IPC), speculative-IPC (SIPC), loadrate (LR), store rate (SR), floating-point load/store rate (FPLd/St),single instruction multiple data (SIMD) instruction rate, 64-bit FPinstruction rate (FP64), micro-SIPC rate, branch rate (BR), load hitrate, and store hit rate. The processor is divided into different powermacros, which is the smallest unit of power computation. The switch-ing activity and the capacitance in the power equation are replacedby the architectural activities and the equivalent capacitance for thearchitectural events. The activity factor (AF) for each macro can beexpressed as a function of the statistics as shown in Equation (3.6).

AF = c1 + c2 × IPC + c3 × SIPC + c4 × LR

+c5 × SR + c6 × FPLd/St + c7 × FP64 + c8 × BR (3.6)


where c1, . . . , c8 are the coefficients of the chosen statistics. By per-forming regression analysis on several training benchmarks the corre-sponding coefficients for each utilization statistics in Equation (3.6) arefound for each macro. CAMP requires a set of performance countersthat monitor the processor statistics mentioned above. These statisticsare collected from the performance counters over certain interval sizesand activity factors are estimated using previously established coeffi-cients as in Equation (3.6). These activity factors are combined withcapacitance models in the simulator to estimate power.

3.2 Power Modeling for SoC-Based Embedded Systems

In contrast to general-purpose processors that are designed for a largerange of SW applications, an embedded system is designed for a targetapplication or a range of applications with known characteristics. Tocreate the best design, embedded system designers have to consider theco-optimization of both the SW and HW, where the boundary betweenthe HW and SW domains is fluid, and computing functions can beimplemented in either domain [8, 31]. An SW/HW power co-estimationframework enables designers to identify the impact of SW/HW parti-tioning, component selection, and the system design choices on the totalpower consumption [70, 80, 84, 129, 137].

3.2.1 Power-Aware SoC Design Flow

The overall flow of power modeling in SoC-based embedded systemsis illustrated in Figure 3.7 [70, 80]. At the first step, a system specifi-cation (typically written in SystemC) is partitioned into HW and SWdomains. The SW component is compiled with a compiler for the targetprocessor core. Using a library of cores, custom intellectual property(IP) blocks, memories, and on-chip network designs, a fast synthesistool compiles the HW specifications into representations (e.g., gate,RTL or architectural design templates) sufficient for power modeling.Macro-models or analytical power modeling techniques discussed inSection 3.1 can be then applied to these representations. The originalSystemC specifications, the outputs of SW compilation, and the out-puts of HW synthesis are provided as inputs to the system simulator.

3.2 Power Modeling for SoC-Based Embedded Systems 161

Fig. 3.7. Power modeling flow for SoC designs.

The simulator simulates a discrete-event model of the entire system,synchronizing and transferring the required data between the HW andSW power simulators. The SW power simulator uses an instruction-setsimulator as described in Section 3.1 to estimate the power consump-tion on a cycle-per-cycle basis for the target processor core. The HWpower estimator estimates the power consumption of the individualHW components based on their input vectors, commands, and opera-tion modes. By synchronizing and transferring data between the SWand HW components, the system simulator is able to capture the inter-play between the SW and HW components. In embedded systems theinteractions between software transformations and the parameters ofthe memory subsystem are particularly important [80]. For example,loop unrolling can reduce loop overhead and improve instruction-levelparallelism, but it can lead to a situation where the code does not fit inthe cache anymore. System-level simulation enables designers to assessthe implications of loop unrolling and other SW optimizations on thetotal system power.

At the most basic level, the power consumption of the HWcomponents can be estimated from the input vector sequencesand HW gate-level representations using the techniques describedin Section 2. However, these techniques become computationally


infeasible, especially for large SoC designs. Many of the high-level powermodeling techniques described in Section 3.1 for processors are applica-ble to SoC designs as well. However, the embedded customized natureof SoC designs enables new modeling techniques. In the rest of thissection, we describe some of these techniques that have been proposedin the SoC-based embedded systems literature for fast, high-level powermodeling.

3.2.2 Fast High-Level SoC Power Modeling

Early approaches for power modeling of embedded systems consideredsimple power models for the HW components, where each componentis characterized by a number of mode states, each with its own uniquepower consumption [129]. For example, a processor could be in an activeor a stall mode, and a memory could be in an active, an idle or a refreshmode. The average power values for each of these modes are first identi-fied, and then incorporated within a cycle-accurate system-level simula-tion as illustrated in Figure 3.7 to estimate the total power. While suchapproach is sufficient for simple embedded systems, advanced SoCs inembedded systems can execute a large range of applications with dif-ferent power characteristics that require more elaborate power models.

A reasonable compromise between the high-accuracy of waveform-based estimation and the simplicity of mode-based power estimation isto extract key information from the waveforms of input sequences anduse this information as input to the power models. Two techniques arepossible.

(1) Sequence compaction: it is possible to compact a sequence ofinput vectors into a smaller sequence that yields almost thesame power characteristics. One possible compaction heuris-tic is to derive the spatiotemporal statistics of the inputvectors within a window of cycles and then use these statisticsto identify a smaller subset of input vectors that preserve thesame spatiotemporal statistics of the original sequence [70].

(2) Sequence abstraction: another possibility is to abstract asequence of input vectors into commands or fundamentaloperations that describe the intended function requested


from the component [40, 41]. For example, a UART compo-nent can receive a sequence of input vectors that commandsit to transmit an input data packet, or a video processingcomponent can receive a sequence of input vectors that com-mands it to decode, rotate, or transform an input dataframe [105]. In this technique, the commands or operationsof every HW component are first identified from its func-tionality. The power consumption of these commands arethen estimated using input sequences that correspond to eachcommand and representative input data. The results of thepower simulation are then stored in a look-up table that isaccessed whenever a future command is sent to the compo-nent. To make this technique more accurate, it is possibleto model the impact of data-dependency and inter-commanddependency on the component’s power consumption.

With the use of both techniques, it might be necessary to preservesome key bits in the input waveforms. For example, power managementtechniques introduce a number of different operational modes for acomponent. Each of these modes gives a trade-off between throughputand power consumption. Direct gate or RTL power estimation usingwaveform sequences naturally estimates the power under the correctoperational mode. Because the correct operational mode is encoded inthe signal waveforms, sequence compaction or abstraction can lead to aloss of such key information, potentially yielding wrong power estimatesdue to mode mismatch. Thus, it is necessary that the waveforms of keyinput signals are first analyzed to identify the operational mode ofthe component [53]. The identified mode is then used to steer powermodeling to produce the correct estimates. Sometimes it can be alsouseful to deploy different techniques for different groups of bits. Forexample, it is better to model the least and most significant bits ofdata path components (e.g., adders and multipliers) separately, whereswitching activities of the least-significant bits are modeled using awhite noise model with an average value, while those of the most-significant bits are modeled more accurately using their spatiotemporalcharacteristics [30, 73].


3.2.3 Power Modeling of SoC On-Chip Communication

An important SoC component is the system-level on-chip communica-tion subsystem, where studies in the literature have shown that it canconsume a significant amount of power [61, 69]. System-level on-chipcommunication is typically referred to as network-on-chip (NoC) or on-chip network (OCN) [113]. On-chip architectures can take a numberof topologies, including shared and hierarchical buses, point-to-pointbuses, rings and meshes [109, 113]. The building blocks of an NoCinclude the FIFO buffers, arbiters, crossbar switches, and bus links.Nodes in shared bus systems are designated as masters or slaves andcommunicate using transactions on the bus. Estimating the NoC powerentails estimating and adding up the dynamic and leakage power of allof its building blocks [143, 61]. Modeling each of these blocks can beachieved using the same empirical synthesis and analytical techniquesdiscussed in Section 2 and Section 3.1. Figure 3.8 gives an overview of anempirical synthesis-based flow for bus-based NoC power estimation [69].

NoCs are usually implemented using a number of typical architec-tural design templates, and thus, analytical techniques are particularlyattractive. FIFO buffers, for example, resemble to a great extent SRAMarrays, and thus the techniques discussed in Section 3.1 are well appli-cable to them. We describe an analytical power model for a crossbarswitch.

Fig. 3.8. Synthesis-based for power estimation of a shared-bus NoC [69].


input 1

input 2

input n

output 1 output 2 output n

control signal

tri-state buffer connector

Fig. 3.9. Architectural template for a crossbar matrix design with n input ports and n out-put ports with 1-bit port width. Not all control signals of tri-state buffer connectors arelabeled.

A crossbar switch is a circuit that is designed to generally forwardtraffic from any its of input ports to any of its output ports depend-ing on the applied control signals. Figure 3.9 illustrates an architec-tural template design for an NoC matrix crossbar switch [143]. Thetotal capacitance of the switch is equal to the sum of three capaci-tances: input line capacitances, output line capacitances, and controlline capacitances. The capacitance of each line can be estimated asfollows.

(1) input line capacitance = capacitance of an input buffer +capacitance of the wire which is proportional to the width ofthe crossbar + the number of output ports × input capaci-tance of a connector.

(2) output line capacitance = capacitance of an output buffer +capacitance of the wire which is proportional to the lengthof the crossbar + the number of input ports × output capac-itance of a connector.


(3) control line capacitance = capacitance of an average lengthcontrol wire + capacitance of a control connector × portwidth in bits (assuming the control wires run horizontally).

Using these capacitances together with the data port width andrates of input and output bit switching, it is possible to estimate thedynamic power consumption. To estimate the crossbar leakage powerconsumption, it is first necessary to estimate the leakage power of thefundamental circuit elements (e.g., individual transistors and gates)under various input states as discussed in Section 2. Because eachelement can be in a different state, the leakage has to be analyzedprobabilistically, where the average leakage of a element is equal toa weighted sum of the leakage at its different states, where a state’sweight is equal to the probability of being in that state [27]. The totalleakage of the crossbar is then equal to the sum of the average leakagesof its elements. As discussed in Section 2, leakage power is sensitive toprocess variations, operational temperatures, and voltage. Therefore,it is necessary to re-evaluate the NoC power consumption at differentdesign corners [110].

In Section 3.1, we discussed techniques that use regression macromodels to express the power consumption of a processor as a functionof its design parameters. This technique is well applicable to on-chipcommunication, where the power consumption of an on-chip commu-nication system can be modeled as a weighted linear combination of anumber of high-level parameters, such as the number of masters, num-ber of slaves, data path width, and arbitration policy [13]. The coeffi-cient weights can be found using regression analysis based on resultsfrom empirical or analytical power models of the NoC configuration.Executing the regression analysis on every possible NoC configurationis computationally infeasible. Instead, the regression analysis can beexecuted on a sample of NoC configurations, and the results are usedto build a response surface model for each coefficient. The coefficientsfor the non-modeled configurations are interpolated directly from theconstructed response surfaces.

As in the case of computational nodes, the nature of NoC transac-tions can lead to good variations in power consumption. For example,

3.3 Power Modeling for FPGAs 167

Lahiri and Raghunathan [69] show up to 19% power variations thatare dependent on the spatial distribution of transactions in a memory-mapped shared bus architecture. Other factors that determine powerconsumption include the nature of bus transactions (e.g., sequentialversus pipeline) and the encoding of the bus.

3.2.4 Heterogeneous Power Modeling

It is possible to utilize heterogeneous power models, with different accu-racies and computational efforts, to model the power consumption ofSoC components [5, 53, 75]. Each component in an SoC can be mod-eled using various different power models (e.g., mode-based models,gate-level models, or command-based models). The power models varyin accuracy and efficiency. Designers usually have to make a trade-offbetween power estimation accuracy and computation runtime of powerestimation tools. One possibility is to choose among various powermodels during simulation depending on the requirement of accuracyor efficiency. For example, if the SoC processing core consumes thelargest portion of the total power consumption, then the power modelsfor the core should be more accurate in comparison to the UART powermodel which may consume a smaller portion of total power. Anotherpossibility is to allocate alternative power models which vary in accu-racy and efficiency, where the simulation tool customizes the temporaland spatial allocation of computational effort among the power modelsof various components. The customization is based on the variations ofeach component’s power consumption and its percentage contributionto the total power.

3.3 Power Modeling for FPGAs

There are two purposes for power modeling of FPGAs. The first purposeis power modeling for FPGA architectural exploration, and the secondpurpose is power modeling for designs programmed in the FPGA fabric.The first purpose is important for FPGA architects and manufacturers.An FPGA architecture has many parameters, such as LUT size, num-ber of BLEs clustered in a LAB, distribution of wire segment lengths,switch topology, and distribution of heterogeneous resources. Architects


need to understand the impact of different values of these parameterson the FPGA’s final power consumption. These architectural estimatestend to be highly approximate as detailed designs are not yet availablefor the FPGA itself. For architectural exploration, the fidelity of sortingthe relative impact of different design choices on power is critical. Oncean architectural design is chosen, implemented, and fabricated, it ispossible to obtain highly accurate power estimates that are calibratedagainst measurements from the actual FPGA device [107]. Power eval-uation for FPGA architectural exploration reduces to estimating thepower consumption of a number of representative user designs for everyconsidered FPGA architectural configuration.

The second purpose is relevant for FPGA users who need to quantifythe impact of their design choices on power consumption. Users mightalso like to conduct their own high-level power-aware design explo-ration, where the power consumption of different user design choices areestimated and used toward choosing the best design. In FPGAs, powermodeling is split between two designs: the design of the actual FPGAfabric which is known to the designers who conceived the FPGA archi-tecture, and the design that is programmed into the FPGA fabric by theusers. Users do not have access to the detailed FPGA architecture andschematics, and thus power characterization of different FPGA struc-tures must be conducted by the FPGA designers, who have access to theinternal FPGA circuits. The pre-characterization data has to be bun-dled with the power estimation tools distributed by the FPGA vendor.

3.3.1 Power-Aware FPGA Design Flow

A typical flow for FPGA power estimation is given in Figure 3.10. Atthe highest abstraction level, the design can be described in a high-leveldescription language such as C, SystemC, or SystemVerilog. Behavioralsynthesis is used to compile the high-level description into an register-transfer level (RTL) implementation, which is then mapped to the BLEstructure of the FPGA. Clustering packs groups of these BLEs intoLABs depending on the connectivity between the BLEs and the sizeof each LAB. The LABs are then placed in the FPGA and routing isdetermined. Using the final routed designs, the configuration bits of


Requirement

Logic synthesis and

Place and route

High-level circuit

description (C,

SystemC)

RTL Design/

Pre-layout Netliste

Routed circuit

Switching

activity

Input stimuli(test vectors

or random vectors)/

Statistical methods

High level power

estimation and

program/design space

exploration

Low level power

estimation and FPGA

architecture exploration

Wire length estimation

(Rent’s rule)

Capacitance estimation

(Post-layout Netlist)

Resource utilization

FP

GA

De

sig

n F

low

High-level/

Behavioral synthesis

Activity Generation

Fig. 3.10. Power estimation flow.

all the logic and routing blocks are determined, and the design is thenready to be programmed into the FPGA. Power modeling can occurat all of the different stages of the design. It is expected that high-level power estimates are inherently less-accurate than low-level powerestimates; however, high-level estimates are very useful in guiding thecrucial early stages of the design toward the best choices from the per-spective of power and other design considerations. FPGAs can be alsoprogrammed with soft processing cores that are programmed as regu-lar designs in the FPGA’s reconfigurable fabric. These soft processorsadd an extra dimension to FPGA power modeling, as their power con-sumption depends on their SW applications [108]. We will next discussthe various techniques that can be used to model switching activity,capacitances, and leakage at different design abstractions.

3.3.2 Low-Level Power Modeling

Switching activity analysis for FPGAs can be based on input vectorsimulation or probabilistic analysis as discussed in Section 2.1. The


regular structure of FPGAs is particularly attractive for input vectorsimulation, where it is possible to build a pre-characterized look-uptable for a single element such as a BLE and then re-use the table forthe simulation of the thousands of BLEs available in the FPGA [77, 78].Simulating thousands of BLEs is more feasible than direct simula-tion of millions of gates as in the case of custom designs. To buildthe look-up table, every possible input pair sequence is applied to asingle BLE and a transistor-level SPICE simulation is conducted toestimate the dynamic power of every pair. Exact power consumption isimpossible to determine during FPGA design. Thus, the look-up tablehas to be bundled with the vendor’s power estimation tool, and laterused by users when the input vectors and the programmed design areknown.

Similar to custom circuits, the average switching activity of a BLEcan be analytically calculated as a function of the switching activitiesof the BLE’s inputs and the Boolean function programmed in the BLE,where the output switching activity is the weighted sum of the inputswitching activities, and the weight of an input is equal to the proba-bility that a transition occurs at the output of the BLE in responseto a transition in the input [99, 107, 116, 127]. Through an itera-tive approach, the power estimation tool propagates the probabilisticinformation associated with the input pins through the FPGA designconfiguration and generates the activities of every internal signal. Theprocedure is conceptually similar to the probabilistic technique dis-cussed in Section 2.1 in the context of Equation (2.4), except that inthe FPGA case, it is necessary to consider that the building blocks ofthe design are the BLEs rather than the gate primitives.

The regular structure of FPGAs generally lends itself to fast capac-itance characterization. Because the reconfigurable routing resourcesin FPGAs consume a considerable area, they tend to dominate theFPGA’s timing and power [24]. In contrast to custom circuits thatusually have a myriad of wire segments of different sizes, FPGAs havea regular structure, where a few wire segment configurations are repli-cated across the whole FPGA. Thus, once capacitance estimates areextracted for a single wire segment, the estimates for all instances of thesegment are automatically determined. Capacitance estimates can also


be verified and calibrated using characterization measurements fromthe actual FPGA [107]. These capacitance estimates are also bundledwith the vendor’s FPGA power estimation tool. The user would lateruse the placement and routing results of the mapped design to identifythe exact wire paths and routing resources used in the FPGA.

An FPGA is designed to accommodate a large number of possi-ble designs. These designs might not fully utilize the available pro-grammable logic components in the FPGA. The average utilizationof an FPGA is estimated to be about 75% [140], but this figure cango very low or very high depending on the design and the dynamicpower will vary accordingly. Thus, it is necessary to scale any pre-characterization power estimates for the FPGA structures by the actualresource utilization of the programmed design [127].

Due to its high switching activity, the clock network has a goodimpact on the FPGA’s power consumption. Estimating the powerconsumption of the FPGA’s clock requires knowledge of its topology,the dimensions of its global and local wire segments, as well as the num-ber of its buffers. If the final FPGA implementation is available, thenit is possible to extract good estimates for these network parameters.For early architectural exploration, approximate models can be devel-oped based on clock network topology, the size of the FPGA, and thetarget fabrication technology [71, 72, 115]. The number of clock-treerepeaters can also be estimated using approximate RC models [115].Modern FPGAs usually have more than one clock domain, where globalclocks are distributed across the whole chip and local clocks are dis-tributed within a region [71, 72]. Multiple clock domains enable greateruser design flexibility and finer grain control over runtime power con-sumption. Multiple clock domains introduce reconfigurable multiplex-ers into the clock signal path to select the appropriate domain for eachflip-flop. The additional power consumption of these multiplexors haveto be modeled.

3.3.3 High-Level Power Modeling

During high-level design power exploration and synthesis, the finalBLE-based implementation is not available, which complicates power


analysis. High-level synthesis is divided into three tasks: scheduling,allocation and binding. Each of these tasks needs to be guided bypower, area and delay estimates to find the best possible synthesissolution. To find the switching activity at such high-level of abstrac-tion, the design is first expressed by a two-level control data flow graph(CDFG). In the first level, the graph has nodes corresponding to thebasic blocks of the design, and in the second level, each block contains aset of operation nodes and edges that represent data dependencies. Theswitching activity is determined by using an iterative test-vector basedfunctional simulation. For each set of input vectors, the control and thedata flows from the inputs through the CDFG until they reach the out-puts [24, 25]. The individual CDFG graph nodes are looked up to findtheir expected number of logic elements, DSPs and I/Os, together withtheir pre-characterized capacitances. For routing power estimation, sta-tistical Rent-based estimation techniques must be utilized to estimatethe wire length distribution as a function of the LAB sizes and thenumber of their inputs and outputs, as well as the expected complexityof the programmed design [24, 32]. The statistical wire length distribu-tion is used to calculate the capacitance estimates as well as the timinginformation using approximate RC models. The timing information ishelpful to estimate the switching activity.

3.3.4 FPGA Dynamic Power Breakup

Combining the estimates from the capacitances, resource utilizationand switching activities, the dynamic power of the FPGA can be cal-culated. Based on various reported power distributions of real circuits,we give in Figure 3.11 a typical dynamic power distribution for variousFPGA structures [127]. In FPGAs, it is expected that the reconfig-urable routing fabric consumes about 45–55% of the total power; thelogic blocks consume about 20–25%; and the clock and I/O consumeabout 25–35%. Short circuit power is approximately about 10% of thedynamic power [115, 127].


Interconnect

45-55%

Logic

20-25%

Clocking

25-35%

Fig. 3.11. Typical FPGA power consumption breakdown.

3.3.5 FPGA Leakage Power Analysis and Breakdown

For leakage power, the simplest estimation method is to use SPICEsimulation or analytical derivations based on Equation (2.5) ofSection 2 to estimate the leakage for the individual NMOS and PMOStransistors that are part of the reconfigurable fabric of the FPGA. Theleakages of the transistors are then summed up for all idle transis-tors [115, 116]. A more elaborate scheme would simulate in SPICEor analytically derive the leakage of the individual structures thatmake-up the reconfigurable fabric (e.g., logic blocks, routing switch-ing, and SRAM blocks) [67, 140]. The total leakage power is equal tothe product of the leakage of an individual structure times the numberof structure instances summed over all structures. Since the estimationis conducted on a per-structure basis, it is then possible to explore theimpact of different possible input data states and temperature on leak-age [67, 140]. Resource utilization can impact leakage power estimatesas well, because non-utilized reconfigurable blocks will have a differentinput state than the utilized blocks [140]. In one study for 90 nm tech-nology, it was reported that routing structures, logic structures, andSRAM configuration cells contribute 53%, 41%, and 6% respectively tothe total leakage power [67]. It is assumed that the SRAM configura-tion cells are manufactured with high Vth. Implementing high Vth forthe configuration SRAM cells only increase the FPGA’s programmingtime but it does not impact its runtime performance for user designs.


3.4 Summary

In this section we discussed power modeling techniques for threecomputing substrates: processors, SoC-based embedded systems, andFPGAs. The sheer complexity of these systems makes power estima-tion and power-aware architectural exploration a challenging task. Wediscussed the impact of SW variations on the power consumption ofgeneral purpose processors. To estimate the power consumption of pro-cessors, it is necessary to use a cycle-accurate instruction-set simulatorthat can identify the activity factors of different processor units. Thecapacitances of the different units can be either identified from directextraction of gate-level layouts, or from analytical derivations usingcanonical architectural templates. A number of speed-up techniqueswere discussed, including the use of regression models.

We also discussed SoC power modeling techniques. Compared toprocessors, SoCs pose additional difficulties because the boundariesbetween the HW and SW domains are fluid, and because they incorpo-rate application-specific custom components. We discussed system-levelsimulation techniques that can transfer and synchronize data betweenthe SW and HW components to ensure that each component can receiveits correct input sequences which are necessarily to estimate power con-sumption. We also discussed speed-up techniques that are specific forSoCs. In particular, the application-specific custom components can beassociated with a number of high-level operations or commands.

Power modeling for FPGAs is also difficult because it is not possibleto know the user’s design during the design of the FPGA device itself.FPGA power-aware architectural evaluation involves simulating thepower consumption of a number of representative user designs for eachpossible architectural configuration. We discussed techniques for devel-oping power models of the FPGA’s basic building blocks (e.g, logicarray blocks and routing switches). These models are later used by theend users to estimate the power consumption of their designs. In somecases, user designs can be programmable processors or SoCs that mayrun a large range of applications, making the power estimation processco-dependent on software.

4Post-Silicon Power Characterization

Once a design is fabricated, the manufactured devices can yield awealth of power characterization data that can improve various designchoices and runtime applications. The produced devices can be directlycharacterized for their true power consumption under various loadingconditions with no need for approximate models or simulations as inthe case with the pre-silicon design. In this section we discuss threemain research directions for post-silicon characterization.

(1) Detailed power mapping for validation. Design-time powerestimates are approximate. By measuring electrical currentand infrared emissions from a fabricated device, the truepower estimates of the circuit structures are revealed. Thesepost-silicon power estimates can be used to validate designtime power estimates, calibrate various empirical modelsused by CAD tools, or re-spin the design if necessary.

(2) Power management for adaptive power-aware computing.Runtime power management systems adapt the performanceand power of computing devices depending on runtime con-straints. In case direct power measurements are not available,it is useful to develop mathematical models that can estimate

175

176 Post-Silicon Power Characterization

the total power consumption of a device from its operatingcharacteristics.

(3) Software power analysis. Using the fabricated device, it ispossible to conduct lumped power characterization for indi-vidual instructions and their interactions, and then use thecharacterization results to develop models for software poweranalysis. These models enable software developers and com-piler designers to tune their code and algorithms to improvetheir energy efficiency.

4.1 Power Characterization for Validation and Debugging

While computer-aided power analysis tools can provide power con-sumption estimates for various circuit blocks, these estimates can devi-ate from the actual power consumption of working silicon chips. Thereare a number of reasons these pre-silicon estimates can deviate fromthe actual post-silicon power measurements.

(1) Large input vector space: the exponentially vast number ofpossible input vector sequences and the billions of transistorsimplemented in current designs make it impossible to deter-mine the power consumption incurred from every possibleinput vector sequence or software application.

(2) Errors in coupling capacitance estimation: coupling capaci-tance between neighboring wires is determined by the exactwaveform activity experienced by the wires. The comput-ational infeasibility of simulating every input vector auto-matically makes estimating the coupling capacitance duringdesign time a difficult process. Wrong estimates for thecoupling capacitance impact switching power the followingways [43]. Wrong estimates incorrectly estimate incompletevoltage transitions arising from crosstalk noise. They impactsignal slew times which determine short circuit power. Fur-thermore, they impact the signal timing delays which deter-mine the occurrence of glitches.

4.1 Power Characterization for Validation and Debugging 177

(3) Spatiotemporal thermal variations: incorrect estimates fordynamic power implies incorrect estimation of thermalvariations, which in turn leads to further deviations in leak-age and total power estimates.

(4) Process variations: intra-die and inter-die process variationsare unique for each die. Process variations impact leakagepower as discussed in Section 2, and they also impact the sig-nal delays along the circuit paths, which determine glitches.Process variations can introduce serious deviations in powerconsumption compared to pre-silicon design time estimates.

Because of the aforementioned complexities, pre-silicon power esti-mates might not be accurate when compared to the real post-siliconcharacteristics. In many cases the mismatch between pre-silicon andpost-silicon characteristics forces major changes in the design andimplementation of integrated circuits. In a study from 2005, it wasshown that 70% of new designs require at least one design re-spinto fix post-silicon problems and that 20% of these re-spins are dueto power and thermal issues [133]. It is likely that these figures aremuch higher now as chips are more complex with a larger number oftransistors.

Post-silicon power characterization of integrated computing devicesprovides the true spatial and temporal power characteristics under rep-resentative loading conditions, such as workloads and dynamic voltageand frequency (DVFS) settings. Figure 4.1 gives an integrated frame-work, where measurements from infrared imaging equipment togetherwith electrical current measurements are used for power characteriza-tion. The electric current measurements can provide the total powerconsumption. If the chip supports multiple independent power supplynetworks, then it is possible to increase the pool of current measure-ments. The thermal emissions captured from the infrared imaging sys-tem, together with the lumped electrical current measurements, can beinverted to yield high-resolution spatial power maps for the individualcircuit blocks.


Fig. 4.1. Using post-silicon power characterization for design validation and CAD toolcalibration. P & R stands for placement and routing.

The results of post-silicon power characterization can improve thedesign process during re-spins or for future designs in the followingways.

• High-level power modeling tools rely on the use of parame-ters that are estimated from empirical data as discussed inSection 3 [19, 118]. The post-silicon power characterizationresults can be used to calibrate and tune high-level pre-siliconpower modeling tools.

• To evaluate “what if?” design re-spin questions, the powercharacterization results can substitute the power simula-tor estimates and directly feed the thermal simulator. Forexample, if the thermal characterization results are unaccept-able, then the layout can be changed to reduce the spatialpower densities and hot spot temperatures. The power char-acterization results are fed together with the new layouts tothe thermal simulator to evaluate the impact of the changes.

• The post-silicon power characterization results can also forcea re-evaluation of the computing device specifications (e.g.,operating frequencies and voltages).


4.1.1 Relationship between Power and Temperature

The relationship between power and temperature is described by thephysics of heat transfer. Mainly, heat diffusion governs the relationshipbetween power and temperature in the bulk of the die and the asso-ciated metal heat spreader, while heat convection governs the transferof heat from the boundary of the integrated metal spreader/sink tothe surrounding fluid medium which is either air or liquid. The heatdiffusion equation is given by

· (k t(x,y,z)) + p(x,y,z) = ρc∂t(x,y,z)

∂t, (4.1)

where t(x,y,z) is the temperature at location (x,y,z), p(x,y,z) is thepower density at location (x,y,z), ρ is the material density, c is the spe-cific heat density, and k is the thermal conductivity [112]. k, ρ, and c

are all functions of location [112]. The power transferred at the bound-ary by convection is described by Fourier’s law for heat transfer, andit is proportional to the temperature difference between the boundaryof the heat sink and the ambient temperature. The constant of propor-tionality is the heat transfer coefficient which depends on the geometryof the heat sink, and the fluid used for heat removal and its speed [55].

4.1.2 Basics of Infrared Imaging

Any body above absolute zero Kelvin emits infrared thermal radiationwith an intensity that depends on its temperature, its emissivity, andthe radiation wavelength. Transistors and interconnects in integratedcircuits operate at elevated temperatures due to resistive heating bycharge carriers [117]. Modern computing integrated circuits use flip-chip packaging, where the die is flipped over and soldered to the packagesubstrate. By removing the package heat spreader, one can obtain opti-cal access to every device on the die through the silicon backside. It ispossible to use infrared-transparent heat removal techniques that havesimilar thermal characteristics to the original heat spreader [91, 122].1

Silicon is transparent to infrared emissions with photon energies that

1 To capture the thermal emissions from an operational die, it is necessary to remove theoptical obstruction introduced by traditional heat removal mechanisms (e.g., integratedheat spreader, metal heat sink and fan), and to substitute them with mechanisms thatcan remove heat while being transparent to infrared emissions. The standard technique is


are less than its bandgap energy (1.12 eV), which corresponds to wave-lengths larger than 1.1µm. This transparency is ideal from an infraredimaging perspective as it enables the capture of photonic emissionsfrom the devices, which provides valuable information for thermal andpower characterization of computing devices operating under realisticloading conditions.

One of the key specifications of an infrared imaging system is itsspectral response, which determines the part of the infrared spectrumthat the imaging equipment can detect. For the range of temperaturesencountered during chip operation, the mid-wave infrared (MWIR)range, which stretches from 3 µm to 5 µm, yields the most sensitive andaccurate characterization of thermal emissions. Detecting emissions inthe MWIR range requires the use of InSb quantum detectors whichhave to be cooled to cryogenic temperatures to ensure sensitivity. As aconsequence, high-resolution MWIR imaging systems tend to be fairlyexpensive. The MWIR imaging system used in our laboratory has afocal point array of 640 × 512 InSb detectors cooled to 77 K (−196C)to reduce the noise in temperature measurements to about 20 mK. Theimaging system can capture thermal frames at a rate of up to 380 Hz,and depending on the used microscopic lens, the system can resolvetemperature down to 5 × 5 µm2.

4.1.3 Thermal to Power Inversion

The objective of post-silicon power characterization is to identify thetrue power consumption of different blocks of a computing integratedcircuit under real loading conditions. While it is possible to isolatethe power consumption of a circuit block during testing by using scanchains, such approach is not capable of characterizing the wide range ofpossible power consumption maps under true loading conditions, where

through the use of infrared transparent oil-based heat removal systems [46, 91, 92, 102,122]. For our experiments, we machined a special sapphire oil-based infrared-transparentheat sink that precisely controls the oil flow over the computing device. Chilled oil is forcedinto an inlet valve of the sink, which then flows on top of the die to sweep the heat andthen exits through an outlet valve. The oil maintains its flow using an external pump, andthe temperature of the oil is controlled using a thermoelectric cooler. By controlling theoil flow and its temperature, as well as the sapphire window dimensions, it is possible toget similar thermal characteristics as the original heat removal package [91, 122].


workloads simultaneously exercise multiple circuit blocks in intricateways. The post-silicon flow of Figure 4.1 is the most versatile approachto compute the power map from the captured thermal emissions and theexternal current measurements. In its discretized version, the steady-state form of Equation (4.1) is approximated by the following linearmatrix formulation

Rp + e = tm, (4.2)

where R is the system’s thermal resistance matrix, tm is a thermal mapacquired from the thermal imaging system, e is the noise in measure-ments, and p is the desired power map. The objective of post-siliconpower mapping is to invert the thermal map tm to produce p.

Given a computing device, it is necessary to measure the ther-mal to power model matrix R. The matrix R can be estimated ina column-by-column basis as follows. Enabling only one circuit blockat a time is mathematically equivalent to setting the vector p to beequal to [0 0 · · ·pk · · ·0 0]T, where pk denotes the power consumptionof the enabled kth circuit block andT denotes the transpose operation.The power consumption incurred from enabling only one circuit blockcan be easily measured using the external current sensing meter. If tkdenotes the steady-state thermal emissions captured from enabling thekth micro-heater block, then column k of matrix R is equal to tk/pk.The described estimation method for the matrix R can be implementedusing three different approaches depending on the design [28, 46, 92].

(1) One approach is to use control signals to enable individualcircuit blocks, which trigger their power consumption [28].Software micro benchmarks that selectively stimulate desiredcircuit blocks can be considered as soft enable signals.

(2) A second approach is to turn off the chip, and scan a laserbeam with known power density to deliver the power fromthe outside to the circuit block of interest. The scanning ofthe laser system can be automated by using a pair of galvo-directing mirrors [46].

(3) A third approach is to use design information to conductfinite-element heat convection and diffusion simulations to


estimate the matrix R, by applying power pulses in simula-tion at the desired locations. The results from the thermalsimulations of the finite element model are used to constructthe matrix R [46, 63, 92].

Given the matrix R, the next step is to execute the desired workloadon the device and to capture the resultant thermal emissions tm. Thepower map p can be estimated as follows.

p = argpmin ||Rp − tm||22, (4.3)

where || · ||2 is the L2 norm, and such that

||p||1 =∑

i

pi = ptotal and p ≥ 0, (4.4)

where || · ||1 is the L1 norm and ptotal is the total power consumptionof the chip which can be measured externally using a digital multi-meter. The constraint given by Equation (4.4) reduces the possibilityof getting multiple solutions to objective (4.3) in case the matrix Ris not well conditioned, and it ensures that the the power mappingfrom infrared imaging is consistent with the electrical current measure-ments. In practice, any digital multimeter has a tolerance, tol, in itsmeasurements (the tolerance is typically listed in the multimeter’s datasheet), and thus it is better to replace constraint Equation (4.4) by thefollowing two constraints:

||p||1 ≤ ptotal + tol

||p||1 ≥ ptotal − tol

In chips with multiple independent power networks, there will beone measurement for each network, and thus there will be multiplecorresponding constraints.

To illustrate the effectiveness of post-silicon power mapping, ourgroup constructed a test chip where the underlying switching activitycan be precisely controlled. Our test chip consists of 10 × 10 micro-heater blocks organized in a grid fashion. Each block contains nine ringoscillators that can be simultaneously enabled or disabled through acontrol signal that is stored in an associated flip-flop. When enabled, the


av. error = 0%

av. error = 2.2%

(a) injected power map

(b) resultant thermal map

(c) estimatedpower map

(d) estimated roundedpower map

Fig. 4.2. Examples of using post-silicon power mapping. Estimated power is in mW.

dynamic power consumption of a micro-heater block is equal to 25 mW.The size of each block is about 0.56 mm2. To create a desired spatialpower map, we program the flip-flops with 100 bits that correspond tothe desired power map; thus, the length of the p vector is 100. Thedesign is embedded in a section of a 90 nm Altera Stratix II FPGA.Figure 4.2a shows two usages of the test device where the block heatersare enabled to create dynamic power maps. One map spells the word“POWER” in dynamic power activity and the other map resembles thepower maps of SoCs. The resultant captured thermal emissions shownin Figure 4.2b clearly show that the process of heat diffusion blursthe underlying power maps. The resultant power estimates from ourtechnique are given in Figure 4.2c, and the rounded power estimates —power is rounded to either 0 mW or 25 mW — are given in Figure 4.2d.Compared to the known power maps, the estimated rounded maps showan average of 0% and 2.2%, which clearly demonstrate the effectivenessof post-silicon power mapping.

We also conduct a spatial power mapping of the embedded NiosII/f soft processor [103]. The layout of the processor is given inFigure 4.3a and the captured thermal emissions from executing theDhrystone II application are illustrated in Figure 4.3b. The estimatedspatial power maps in mW are plotted in Figure 4.3c. The estimatedspatial maps augment the floorplan with valuable spatial power density


Total Power 477 mW

(a) Nios II/f layout (b) resultant thermal map

pow

er (

mW

)

(c) estimatedpower map

Fig. 4.3. Post-silicon power mapping of a Nios II processor.

estimates. The revealed detailed power estimation maps can be usedto calibrate the estimates from high-level dynamic power modelingtools. Given that the design of the Nios II processor is proprietary,it is not possible for us to match the spatial power consumption esti-mates to the various functional blocks of the processor. Previous workin the literature report power mapping results for IBM’s PowerPC 970processor [46] and AMD’s Athlon processor [92].

4.1.4 AC-based Thermal to Power Inversion

A major source of error in power mapping from infrared emissions isheat diffusion which blurs the underlying power map. A recent researchdirection advocates the use of AC excitation sources instead of typicalDC excitation sources [104]. AC excitation reduces the amount of spa-tial heat diffusion as the AC excitation frequency increases [17, 89]. Inaddition, AC excitation has the benefit of reducing flicker noise whichhas a 1

f spectrum [51]. Applying a true sinusoidal AC source to excitea digital circuit is impossible. Instead a square wave is applied, andbecause a square wave can be represented by a Fourier series, whosedominant component is the fundamental frequency, such technique doesnot alter the results as long as the acquired infrared emissions arefiltered to only extract the fundamental frequency [17]. Creating ACsquare-wave excitations in digital circuits can be implemented by anumber of techniques such as: (1) toggling enable signals of circuitblocks while keeping the operating voltage constant; (2) alternatingthe voltage supply signal between two values (e.g., 0.9 V and 1 V); or

4.2 Power Characterization for Adaptive Power-Aware Computing 185

by (3) executing workloads that alternate between an activity phaseand an inactivity phase. AC-based power mapping reduces power esti-mation error by more than half [104].

4.2 Power Characterization for AdaptivePower-Aware Computing

Another useful reason for post-silicon power characterization is toenable runtime power management. Runtime power management isconcerned with adapting the performance and power of a computingdevice in response to runtime power objectives and constraints. Forexample, in portable systems, it could be desirable to minimize run-time power consumption under performance constraints that guaranteeacceptable operation for a particular application such as H.264 play-back. In case of server systems, it could be desirable to constrain peakpower consumption when there is a light demand on the system. Inmany of these runtime management situations, a computing device canlack the means to directly measure its own power consumption, andthus, there is a need to develop lumped power models that can estimatethe total power of a device from its operating characteristics.

To develop any lumped power-related model, it is necessary tofirst collect a large volume of power characterization data. Thereare generally two approaches used to measure the total electricalcurrent consumption of a computing device or system. In the firstapproach, the power supply lines are intercepted and a shunt resis-tor (e.g,. Figure 4.4a) is inserted in series with the positive supply line.

(a) shunt resistor (b) clamp meter

Fig. 4.4. Power measurement techniques used for lumped current measurements.


In contrast to regular resistors, shunt resistors have very low resistance(e.g., 1 mΩ) with high accuracy of about ±0.1%. The low resistanceis needed to avoid adding a voltage drop along the supply line. Thechanges in voltage across the shunt resistor are proportional to the elec-trical current variations as dictated by Ohm’s law. The second approachuses clamp meters (e.g., Figure 4.4b), which utilize the Hall effect todetect electric current variations in the supply line by measuring theinduced magnetic field variations surrounding the supply wire. Clampmeters are less intrusive, but they are less accurate and their measure-ments tend to be noisy compared to shunt resistors. In both approaches,a digital multimeter or an analog-to-digital device is required to log themeasurements of the shunt resistor or clamp meter into the power man-agement system of the computing device.

4.2.1 Power Characterization Using Performance Counters

A popular approach for modeling total power in software is through theuse of performance counters [9, 11, 56, 57, 58, 59, 82, 111, 118, 131, 147].Performance counters are inserted by designers to track the usage ofdifferent device blocks. For example, Intel Core i7 has seven countersper core that can be programmed to monitor tens of different processorevents. Examples of such events for processors include the number ofretired instructions, the number of cache hits, and the number of cor-rectly predicted branches. Example of performance counters for GPUsinclude memory global/local load and stores, number of instructionsexecuted and number of thread wraps with bank conflict [97]. In GPUs,it is also possible to measure workload signals that track the utiliza-tion levels of various GPU units, such as vertex shading units, textureunits, and geometric shading units [85]. In an off-line characterizationphase a number of workloads are executed on the device and the valuesfrom the performance counters together with measurements of the totalpower are recorded, and then used to develop and train an empiricalmathematical model of the power consumption as a function of theperformance counters. During runtime, the values of the performancecounters are given as inputs to the model to estimate the total powerof any instance of the device.

4.2 Power Characterization for Adaptive Power-Aware Computing 187

Fig. 4.5. Correlation between performance counters and Intel Core i7 processor power forthe PARSEC benchmarks.

In devising empirical power models, there are two key questions.

(1) Which performance counters should be used as inputs to themodel? It is desirable to use few performance counters toavoid over fitting the model and to avoid complicating theassignment of performance counters to events. Popular exam-ples in the literature include: the instructions per cycle andL1 load misses per cycle for the POWER6 processor [58];the dispatch stalls, the L2 cache misses, and the number ofretired micro-ops for AMD Phenom [131]. For example, usingour own measurements of the PARSEC benchmarks [10] onIntel Core i7, we give in Figure 4.5 the correlation coeffi-cients between a number of performance counters and theprocessor’s power consumption. The results show a verystrong correlation between power and the number of micro-ops retired, the L1 data reads, total instructions executed,and L1 instruction misses. Figure 4.6 shows the processorpower consumption and number of retired micro-ops over aperiod of time, clearly illustrating the correlation betweenthe two values.


Fig. 4.6. Processor power consumption and number of retired micro-ops for the ×264 appli-cation running on Core i7 (1 thread at 2.67 GHz) for 60 s.

(2) What is the nature of the mathematical model? Most ofthe proposed models in the literature use linear regression,where the values from the performance counters are weightedand linearly combined. If p(k) denote the power at timeinstance k, then in a linear regression model, p(k) is equal top(k) = w0 +

∑i wici(k), where ci(k) is the value of counter i

at time k, wi is the weight associated with counter i, and w0 isa constant weight offset. The weights are learned offline fromthe collected characterization data using least-square estima-tion. Using the performance counters reported in Figure 4.5,we built a power model for the Core i7 processor. Our resultsshow that the average absolute error in power estimation isabout 2.13%.

4.2.2 Block-Level Power Characterization

To enable sophisticated power management systems that utilizeper-core DVFS or selective clock gating to attain larger reductions inpower consumption, it is valuable to estimate the power on a per-blockbasis [57, 131, 118, 147]. These blocks can be coarse-grain such as core-level structures or fine-grain such as register files, reservation stations,

4.3 Power Characterization for Software Power Analysis 189

and memory caches. Structures in GPUs include register units, ALUs,texture caches, and global/local memories [49]. One method forestimating the power of a circuit block is to scale its known maximumpower consumption by an activity factor that is computed from themeasurements of the performance counters. The maximum power ofthe structure can be known from design time estimates or identifiedto a reasonable degree during runtime by crafting micro benchmarks,which are simple programs that execute only one type of instructionswithin infinite loops. These micro benchmarks can be written forgeneral-purpose processors [57] and GPUs [49]. These benchmarks trig-ger power activity in a small subset of structures and can identify themaximum power consumption of the desired blocks. Another methodto estimate the power of a circuit block is to build a submodel for itspower consumption based on the performance counter measurementsthat are relevant for the block. For example, the power consumptionof a core should only utilize its performance counter measurements. Inboth methods, estimation errors can be reduced by ensuring that thesum of estimated powers of all blocks is equal to the total measuredpower.

Performance counters mainly capture the activity of the processorand memory. In some cases, adaptive power-aware computing needsto occur at a full server scale [35, 79, 124]. Besides the processor andmemory components, major activity and power consumption also occurin the hard disk and network interface components of server systems.The activity of these components are not logged in by the processor’sperformance counters, and thus it is necessary to collect system-widestatistics that measure the utilization of these components. The mea-sured utilizations together with the performance counters can be thenused within a regression framework to model the power consumptionof the full server.

4.3 Power Characterization for Software Power Analysis

The power consumption of a computing device is influenced by the SWapplications that is executed on it. Once a computing device is in pro-duction, its power consumption as a function of the SW applications


Fig. 4.7. Post-silicon power analysis for software applications.

can be analyzed and used to improve the energy efficiency of SW appli-cations. This characterization is particularly attractive in embeddedsystems where the software application is a priori known. Hence, tun-ing the algorithm and/or the compiler can lead to consistent powerimprovements.

SW power analysis can be tackled at a number of abstraction levelsas shown in Figure 4.7. At the lowest level, the individual instructionsof a processor and their interactions can be characterized and modeled.These models can then be used to analyze the energy consumption ofSW applications. A higher-level approach is to abstract the impactof the instructions by looking at the architectural parameters (e.g.,instructions per cycle, cache miss rate, and resource utilization) that aretriggered by the instruction makeup of an application. Such approachis not fundamentally different from the power characterization tech-niques developed for adaptive power-aware computing (discussed inSection 4.2). The difference stems from the target setting and goal;i.e., software power analysis versus adaptive power-aware computing.The highest level of abstraction is to characterize power consumptionas a function of the algorithmic parameters that are relevant to theapplication, such as function argument types and variables sizes. Wediscuss these techniques in the rest of this section.


4.3.1 Instruction-Level Analysis

One of the main hypotheses in instruction-level power analysis is thatit is possible to evaluate the energy consumption of a program bycomposing the energy consumption of its individual instructions, whichare characterized from prior measurements [139]. One of the challengesin such evaluation is that the energy consumption of an individualinstruction is also dependent on the state of the processor, which isdetermined by preceding instructions. To address this issue, the energycost of an instruction is decomposed into two components: a base costand an inter-instruction cost. The base cost of an instruction is deter-mined by repeatedly executing multiple instances of the exact sameinstruction instantiated within a loop, while the processor’s power isbeing externally measured by a digital multimeter. Because the switch-ing activity of a circuit is a function of the changes in its inputs andits state, executing the exact instruction repeatedly will consume lessenergy than executing a number of different instructions. Thus, it isnecessary to add an inter-instruction energy cost to the base cost.To reduce the intractability of analyzing the impact of all precedinginstructions, a number of techniques have been proposed.

• In the simplest case, a fixed cost is added to account for inter-dependencies [139]. Three sources of additional costs can beconsidered. First, the impact of the immediately precedinginstruction can be considered by executing different pairs ofinstructions repeatedly in a loop and the additional averageenergy incurred is characterized; second, the energy cost ofa pipeline stall can be characterized; and third, the cost ofa cache hit and miss can be characterized. The pipeline stalland cache miss costs are multiplied by the expected numberof stalls and cache misses, respectively. The characterizationresults of these inter-instruction dependencies are summedup and added to the base cost [139]. This model can beextended to incorporate data-dependent power consumptionassociated with bus and memory transactions [134].

• A more detailed analysis can use statistical techniques, wherethe energy consumption of an instruction is characterized by


both a mean and a standard deviation [126]. To determinethe mean and standard deviation of an instruction, charac-terizing variations of the same instruction can be considered,including different source and destination registers, differentoperand values, and different conditions (e.g., cache hit ormiss). Using statistical models has the advantage of report-ing confidence intervals for estimated program energy.

• Sometimes it is infeasible to characterize every instructionpair combination. For example, in the ARM7 instruction setarchitecture, there is a total of 500 basic instruction coststo be measured. It will be practically infeasible to character-ize all possible 5002 combinations of instruction pairs [130].Instead, it is possible to first group the instructions into setsbased on the resources they utilize, and then measure theinter-set power dependency between instructions from differ-ent sets [130]. In a study for the StrongARM SA-1100 proces-sor, four possible sets were identified: instruction, sequentialmemory access, non-sequential memory access, and internalcycles [132].

• SW power consumption depends on the operating frequencyand voltage. For example, we use our measurement setupto measure the average power consumption of the PARSECapplications as a function of the frequency–voltage settings ofa quad-core Intel Core i7 processor. Figure 4.8 gives the aver-age power measurements at nine different frequency–voltagesettings. The figure shows that there are strong variations inpower consumption that depend on the application and theapplied frequency–voltage setting. For devices that supportmultiple frequencies and voltages, it is necessary to charac-terize the variations in instruction energy consumption as afunction of the operating voltage and frequency [132].

It is possible to integrate the post-silicon power measurementstogether with pre-silicon power modeling techniques to obtain fine-grain power estimates for each circuit component. Varma et al. describea SystemC power modeling technique in which the parameters of the


0

10

20

30

40

50

60

black

scho

les

body

track

cann

eal

dedu

p

face

sim ferre

t

fluida

nimat

e

freqm

ine

raytr

ace

strea

mclu

ster

swap

tions vip

sx2

64

Pro

cess

or P

ower

(W

)2.672.532.402.132.001.731.60

Fig. 4.8. Average power consumption of the PARSEC applications as a function offrequency–voltage settings. Number of threads = 4. Frequencies start at 1.6 GHz andincrease in increments of 133 MHz until 2.67 GHz.

power models of the SoC components are obtained from measurements.The models are then used to break-up the total power consumption ofgeneral applications among the different SoC components [142].

With a more elaborate measurement infrastructure, it is possible tomeasure the energy consumption on a per-cycle basis [23]. A per-cyclemeasurement setup enables energy measurements for instructions asthey advance through the individual pipeline stages. As discussed inSection 3.1, the Hamming distance between consecutive input vectorsis a good estimator of the amount of switching activity. It is possi-ble to generalize such approach to measure the energy consumption ofeach pipeline stage (e.g., IF, ID, and EX) as a function of the Ham-ming distance between consecutive instructions. The Hamming dis-tance between instructions can be controlled by changing the operationcodes, the source and destination registers, the memory address, andthe immediate operands. Thus, by administering stimuli code that iscrafted to create desired Hamming distances between instructions, it ispossible to measure the energy consumption of each pipeline stage as


Fig. 4.9. Measuring instruction-level energy consumption on a per-cycle, per-stage basis[23].

a function of the Hamming distance. Figure 4.9 illustrates an examplewhere a base instruction is used to first fill the pipeline stages, andthen a test instruction with a prescribed Hamming distance is appliedafterwards. A per-cycle energy measurement setup can measure theenergy consumption of each stage as the test instruction progressesthrough the stages. By using different test instructions, it is possibleto construct energy characterization tables that give the energy con-sumption of each pipeline stage as a function of the Hamming distance.These tables can be used to quickly characterize software energy con-sumption of the target processor.

Techniques for instruction-level power analysis are also applicablefor SW applications executing on soft processors (e.g., MicroBlaze andPicoBlaze from Xilinx, and Nios II from Altera), which are synthe-sized in FPGAs. Despite their simplicity, these processors also displaygood variations in their energy as a function of the executed instruc-tions [108], and the discussed characterization techniques can be alsoused to analyze their instruction-level power analysis.

4.3.2 Architectural Analysis

Some of the technical difficulties in instruction-level power analysisinclude (1) the large number of measurements that need to be con-ducted for different base instructions and their inter-dependencies,and (2) handling the intricacies involved with Very-Large InstructionWord (VLIW) processors, which are typically deployed as digitalsignal processors in embedded systems [7]. To reduce the difficulties


in instruction-level modeling, it is possible to abstract the impactof instructions on the key architectural parameters of the comput-ing device [52, 60]. These parameters include the parallelism rate(i.e., number of instructions issued per cycle), the utilization ratiosof various processing units, cache miss rate, and external memory datarate. Similar to the techniques in adaptive power-aware computing dis-cussed in Section 4.2, it is possible to build regression models thatrelate the values of these parameters to the SW power consumption.The weights associated with these parameters in the model are learnedby simultaneously applying micro benchmarks that stimulate individ-ual functional blocks on an actual chip and measuring the resultantpower consumption. To estimate the power consumption of a desiredprogram, the compiled code of the program must be first profiled andsamples, values for the parameters are computed. These values are thenfed to the power model to estimate the power consumption and thencontrasted to the actual power measurements.

4.3.3 Algorithmic Analysis

An alternative approach is to characterize software computing powerat a functional or algorithmic level, where macro models are used tocharacterize the energy consumption of an algorithm as a functionof its parameters [96, 119, 138]. These algorithmic parameters caninclude, for example, function arguments, return objects, algorithm-specific parameters, and other parameters that include size and valueof variables. This approach is particularly attractive for embedded sys-tems, which are largely based on re-used software components such ascompiler library calls, operating system calls, and popular algorithmiccode. By varying the values of the input parameters, and characteriz-ing the energy consumption of the algorithm, it is possible to develop aregression model that learns the association between energy consump-tion and the values of the parameters. Challenges in macro modelinginclude identifying an appropriate set of parameters and training themodel with a representative data set [96].

Software programs frequently call operating system (OS) routines,e.g., file services, process control, and interrupts, to execute required


functions. These routines can execute millions of their own instructionsand trigger energy consumption in peripheral devices such as the diskdrive and network interface cards [33, 45, 79]. Thus, it is necessary tocharacterize the power consumption of different OS routines under var-ious input scenarios, and then scale the consumption of these routinesappropriately depending on the number of their invocations. One ofthe difficulties in OS routine power characterization is that the powerconsumption of a routine is dependent on its input arguments, andconsequently, characterizing each routine by a single power value couldyield wrong estimates. To improve the estimates it is possible to aug-ment an algorithm power model by architectural parameters such asinstruction per cycles [79].

With a software energy estimator in hand, it is possible to devise anumber of SW energy optimizations ([33, 139]):

(1) Many of the typical compiler optimization techniques tar-geted for performance improvement (e.g., reducing stalls,reducing memory accesses, increasing cache hit rate) alsoimprove energy consumption as energy is the product of run-time and power.

(2) Specific energy-related optimizations include: (1) substitut-ing one instruction by another suitable instruction(s) if thenew instruction(s) is known to use less power; and (2) re-ordering instructions, whenever possible, to reduce the sumof Hamming distance between consecutive instruction pairs,which lower the switching activity.

4.4 Summary

In this section we explored recent research directions in post-siliconpower characterization. The main differentiating factor between thissection and the previous one is the availability of the actual computingdevice and the ability to characterize its power consumption. The abil-ity to take measurements from an early prototype or a final productionchip can be used to improve the energy-efficiency of computing devices.

We explored infrared imaging techniques to characterize the powerconsumption of every circuit structure. Infrared imaging is the most

4.4 Summary 197

flexible technique for power mapping as it does not require designinvestments. Infrared imaging equipment provides high-resolution ther-mal imaging in space and time, and their primarily usage is in designhouses and characterization groups trying to gain a deeper understand-ing of the impact of software and hardware interactions on power con-sumption. We described a mathematical inversion framework that takesas inputs the thermal maps and lumped current measurements and out-puts the underlying power maps. Power mapping results from infraredimaging can be used to improve the design or to provide detailed powerdecomposition for different software applications.

In many cases, computing devices need to adapt to their oper-ating conditions during runtime to conserve power consumption.Popular methods to adjust performance and power consumption in-clude dynamic frequency and voltage scaling, nap modes, and sleepmodes. Software applications lead to large variations in power con-sumption, and thus the resultant power and performance characteristicsfrom the enforcement of one setting will vary depending on the appli-cation. To provide a guided adjustment to performance and power, itis necessary to physically measure the total power consumption and toprovide these measurements to the power management system. In theabsence of a measuring device during runtime, the power managementsystem has to use a model that relates the system’s usage statistics toits power consumption. We explored in this section regression modelingtechniques that can be used to accurately model power consumptiongiven as inputs the measurements from the device’s performance coun-ters. These models tend to be fairly accurate, but they require off-linetraining on a representative data set.

Finally, we discussed techniques where power characterization froma chip can be leveraged to optimize the software applications to useless power. This ability is particularly appealing when the softwareapplication is fixed as in embedded systems; a good example wouldbe optimizing an embedded image processing application on a DSPprocessor of a digital camera. Optimizing software power consump-tion requires a good model for software power analysis. We discussedthree different ways to develop such model: instruction-level models,architectural-level models, and algorithmic-level models. These models


provide varying degrees in their granularity, accuracy, and complexityin estimating power consumption. Instruction-level models provideenergy estimates of individual instructions, but they require largecharacterization efforts. Architectural models are conceptually similarto the models used for adaptive power-aware computing, but their endgoal is focused on software power characterization and optimization.Algorithmic models provide high-level models that are parameterizedby algorithmic parameters.

5Future Directions

In this survey we discussed power modeling and characterization tech-niques for computing devices. The discussed techniques span all the wayfrom low-level circuit estimation to high-level power modeling and soft-ware power characterization. These techniques are the starting pointsfor future research in power modeling and characterization. We outlinefour promising future research directions for power modeling andcharacterization.

1. Power Modeling for Computing Substrate Design Explo-ration. The current constraints on power consumption created a wallin which it is not possible to improve single-thread performance alonghistorical trends. Multi-core processors trade-off single-thread perfor-mance of the individual cores against lower energy per instruction [39].Seymour Cray once said “If you were plowing a field, which would yourather use: two strong oxen or 1024 chickens?”. Indeed, further reduc-tion to the energy consumed per instruction can lead to an overalldegradation in multi-core processor performance. Future improvementsin energy efficiency can come from customizing the computing substrateto better match the application [29, 39]. Possible options to cus-tomize the computing substrate according to the applications include

199

200 Future Directions

fabricating custom computing systems, implementing custom designsin FPGAs, augmenting existing SoCs and multi-core processors withapplication-specific circuit units, using GPGPUs, or using a hetero-geneous mixture of these different computing substrates. This fluidityin the computing substrate has been dubbed “liquid metal” [54], orperhaps it should be rather more appropriately called “liquid silicon”.

All existing power-aware design exploration frameworks are devel-oped to explore the design decision choices of only a single com-puting substrate. Fluidity in the computing substrate will lead to afundamental shift in power-aware design exploration tools: future toolshave to explore the computing substrate choices as well as the possibledesign choices within each substrate. That is, the tools must take theapplication needs as inputs and explore the power consumption of dif-ferent computing substrates with different design scenarios for everysubstrate. Thus, power modeling needs to be moved to a higher abstrac-tion level where no decision on the computing substrate is yet made.

2. Power Modeling for Full Platforms. The future of computingpoints to two clear emerging platforms: mobile computing and datacenter computing. As mentioned in Section 1, both of these paradigmswill be constrained by power. Furthermore, parallel applications withdeep interactions will be likely the dominant form of workloads exe-cuting on these platforms. Software power modeling techniques haveto be integrated with platform power modeling techniques to createlarge-scale models that can estimate the impact of interactions amongapplications across the platform on power consumption. For instance,power macromodeling of message passing mechanisms in multi-core sys-tems show high dependency on message traffic patterns [83]. Physicalinteractions among the components of the platform can also lead to seri-ous implications on power consumption. One example of cross-cuttingplatform physical interactions is temperature. The power consumptionof one chip or system can lead to heat transfer to adjacent chips orsystems, which increases their temperature, leakage power and cool-ing power consumption. Such phenomena is particularly pronounced indata centers, where large number of servers with high-power densitycan lead to thermodynamic interactions that are detrimental to power

201

consumption [144]. Future platform modeling tools have to consider theintricate interactions among workloads and the physical phenomenaoccurring on the platform.

3. Power Characterization for Trojan Detection. The economiesof scaling have led to the prevalent fablite model, where design housescontract fabrication facilities to manufacture their designs. In thismodel, it is possible that malicious circuitry, also known as Trojans, areinserted into the design during fabrication. A Trojan embedder modifiesan original design to enable unauthorized access and possibly to con-trol the chip’s computation and communication infrastructure. InsertedTrojans trigger discrepancies between the pre-silicon power modelsand the post-silicon power characterization results beyond typicalmanufacturing variations. Existing works in the literature investigateside-channel detection techniques in which the electric current measure-ments are used to detect Trojans [3, 66]. Non-intrusive detection tech-niques for Trojans will require further drastic improvements in powermodeling and characterization techniques. One possible future themeis the use of infrared-based power mapping techniques to generatehigh-resolution spatiotemporal power maps, which could be contrastedagainst the design-time power estimates. Because process variations canintroduce large variations in leakage and total power measurements,these variations have to be quantified and factored out. One possi-ble solution is to exploit the exponential dependency of leakage andtemperature to conduct characterization experiments at multiple tem-peratures to isolate and identify leakage. Once the leakage variationsare estimated, any significant remaining deviations between the post-silicon characterization and pre-silicon model estimates are indicatorsof increased Trojan likelihood.

4. Power Characterization for Fine-Grain Adaptive Power-Aware Computing. Future many-core processors will include tens tohundreds of processing cores [15]. Existing research prototypes include48-core and 80-core chips from Intel Corporation [141, 50]. These many-core processors will be hot-spot limited, and furthermore they are likelyto be deployed in data centers that are constrained by power. Data cen-ter operation will require power capping techniques to ensure a peak

202 Future Directions

power consumption constraint [6]. To achieve greater energy efficiency,many-core processors will use advanced power management circuitsthat can control the voltage and frequency of each core independentlyin a fine-grain manner. Due to the dependency of dynamic power onvoltage, controlling the voltage on an individual core basis will lead tomore effective power management. For instance, the experimental rock-creek processor from Intel incorporates 48 cores, where every six coresare controlled by an independent voltage domain [50]. In future powerand thermal constrained operation, there will be a need for fine-grainpower characterization during runtime where the power of each coreis exactly determined, and used towards a global chip-scale optimiza-tion policy that maximizes the total throughput under power and ther-mal constraints. This goal will be challenging because different corescould run applications with different power characteristics, and becausetraffic intercommunication between applications on different cores canintroduce interdependencies that are subtle to model.

The proliferation of various computing substrates together with therise of energy consumption as a major societal challenge imply thatpower modeling, characterization, and optimization will be a main frontin computing systems research. The next decade will see the rise ofbillions of embedded systems and a large number of planetary-scaledata centers. Power modeling and estimation for the computing devicesof these systems will have major challenges, and the payoffs from bettermodeling and optimization will be huge.

Acknowledgments

The authors of this manuscript are partially supported by NSF grantnumbers 1115424 and 0952866. The equipment used in the experimentsare obtained through DoD ARO grant number W911NF-09-1-0320 anda donation from Altera corporation. We thank the anonymous reviewersfor their valuable feedback.

203

Notations and Acronyms

Following are the acronyms in alphabetical order that are used in thepaper.

AC Alternating currentAF Activity factor

ALU Arithmetic logic unitBL Bitline in a memory structure

BLE Basic logic elementCAD Computer-aided design

CDFG Control data flow graphCMOS Complementary metal oxide semiconductor

DC Direct currentDSPs Digital signal processorsDVFS Dynamic voltage and frequency scaling

EX Excutation stage of a pipelined processorFIFO First in, first out

FPGA Field-programmable gate arraysFPLd/St Floating-point load/store rate

204

Notations and Acronyms 205

GP-GPUs General-purpose graphical processing unitsHDMI High-definition multimedia interface

HW HardwareID Instruction decode stage of a pipelned processorIF Instruction fetch stage of a pipelined processor

IPC Instruction per cycleLAB Logic array block

LR Load rateLUT Look-up table

MOSFET Metal oxide semiconductor field effect transistorMWIR Mid-wave infraredNMOS n-Channel MOSFET

NoC Network-on-chipNRC National research councilOCN On-chip network

OS Operating systemPMOS p-Channel MOSFET

RTL Register transfer levelSIMD Single instruction multiple dataSIPC Speculative instruction per cycleSoC System-on-chipSR Store rate

SRAM Static random access memorySW Software

TDP Thermal design powerTLB Translation lookaside buffer

UART Universal asynchronous receiver/transmitterUSB Universal serial bus

VLIW Very large instruction wordWL Wordline in a memory structure

206 Notations and Acronyms

We have defined the following notations in the paper.

CL Load capacitance at the outputCtotal Total circuit capacitanceCWL Wordline capacitanceCBL Bitline capacitanceCdiff Diffusion capacitanceCgate Gate capacitance

Cmetal Metal wire capacitancee Noise in measurementsf Clock frequency

Ileak Subthreshold leakage currentp Power map

Pdynamic−bitline Dynamic power consumption of bitlinePdynamic−gate Dynamic power of a gate

Pdynamic Total dynamic power of a circuitptotal Total power consumption of the chip

R System’s thermal resistance matrixs Activity factor that denotes the number of

switching cycles per seconds Average switching activity

so Input switching activitytm Thermal map acquired from the

thermal imaging systemVDD Supply voltage

VGND Ground voltageVth Threshold voltage

References

[1] E. Acar, A. Devgan, R. Rao, Y. Liu, H. Su, S. Nassif, and J. Burns, “Leakageand leakage sensitivity computation for combinational circuits,” in Interna-tional Symposium on Low-Power Electronics, pp. 96–99, 2003.

[2] A. Agarwal, S. Mukhopadhyay, A. Raychowdhury, K. Roy, and C. H. Kim,“Leakage power analysis and reduction for nanoscale circuits,” IEEE Micro,vol. 26, no. 2, pp. 68–80, 2006.

[3] Y. Alkabani, F. Koushanfar, N. Kiyavash, and M. Potkonjak, “Trusted inte-grated circuits: A nondestructive hidden characteristics extraction approach,”in Information Hiding, pp. 102–117, 2008.

[4] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz,N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel, and K. Yelick,“A view of the parallel computing landscape,” Communications of the ACM,vol. 52, no. 10, 2009.

[5] N. Bansal, K. Lahiri, A. Raghunathan, and S. T. Chakradhar, “Power mon-itors: A framework for system-level power estimation using heterogeneouspower models,” in VLSI Design, pp. 579–585, 2005.

[6] L. A. Barroso and U. Holzle, The Datacenter as a Computer. Morgan andClaypool Publishers, 2009.

[7] L. Benini, D. Bruni, M. Chinosi, C. Silvano, V. Zaccaria, and R. Zafalon,“A power modeling and estimation framework for VLIW-based embeddedsystems,” ST Journal of System Research, pp. 2.1.1–2.1.10, 2002.

[8] R. A. Bergamaschi1 et al, “SEAS: A system for early analysis of SoCs,” inCODES+ISSS, pp. 150–155, 2003.

207

208 References

[9] R. Bertran, M. Gonzalez, X. Martorel, N. Navarro, and E. Ayguade,“Decomposable and responsive power models for multicore processors usingperformance counters,” in International Conference on Supercomputing,pp. 147–158, 2010.

[10] C. Bienia and K. Li, “PARSEC 2.0: A new benchmark suite for Chipmultipro-cessors,” in Proceedings of the Annual Workshop on Modeling, Benchmarkingand Simulation, 2009.

[11] W. L. Bircher, M. Valluri, J. Law, and L. K. John, “Runtime identification ofmicroprocessor energy saving opportunities,” in International Symposium onLow Power Electronics and Design, pp. 275–280, 2005.

[12] A. Boliolo and L. Benini, “Robust RTL power macromodels,” IEEE Trans onVLSI Systems, vol. 6, no. 4, pp. 578–581, 1998.

[13] A. Bona, V. Zaccaria, and R. Zafalon, “System level power modeling andsimulation of high-end industrial network-on-chip,” in Design, Automationand Test in Europe, pp. 318–323, 2004.

[14] D. Boning and S. Nassif, “Models of process variations in device and inter-connect,” in Design of high-performance microporcessor circuits, (A. Chan-drakasan, W. J. Bowhill, and F. Cox, eds.), pp. 98–115, IEEE Press, 2001.

[15] S. Borkar, “Thousand core chips — a technology perspective,” in Proceedingsof Design Automation Conference, pp. 746–749, 2007.

[16] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, and V. De,“Parameter variations and impact on circuits and microarchitecture,” in Pro-ceedings of Design Automation Conference, pp. 338–342, 2003.

[17] O. Breitenstein, W. Warta, and M. Langenkamp, Lock-In Thermography:Basics and Use for Functional Diagnostics of Electronic Components. SpringerVerlag, second ed., 2010.

[18] D. Brooks, M. Martonosi, J.-D. Wellman, and P. Bose, “Power-performancemodeling and tradeoff analysis for a high end microprocessor,” in Power AwareComputing Systems Workshop at ASPLOS-IX, pp. 126–136, 2000.

[19] D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: A framework forarchitectural-level power analysis and optimizations,” in Proceedings of Inter-national Symposium on Computer Architecture, pp. 83–94, 2000.

[20] D. C. Burger and T. M. Austin, “The simplescalar tool set, version 2.0,” ACMComputer Architecture News, vol. 25, no. 3, pp. 13–25, 1997.

[21] K. M. Buyuksahin and F. N. Najm, “Early power estimation for VLSI cir-cuits,” IEEE Transactions on Computer-Aided Design of Integrated Circuitsand Systems, vol. 24, no. 7, pp. 1076–1088, 2005.

[22] H. Chang and S. S. Sapatnekar, “Full-chip analysis of leakage power underprocess variations, including spatial correlations,” in Design Automation Con-ference, pp. 523–528, 2005.

[23] N. Chang, K. Kim, and H. G. Lee, “Cycle-accurate energy measurement andcharacterization with a case study of the ARM7TDMI,” IEEE Transactionson Very Large Scale Integration Systems, vol. 10, no. 2, pp. 146–154, 2002.

[24] D. Chen, J. Cong, Y. Fan, and L. Wan, “LOPASS: A low-power architecturalsynthesis system for FPGAs with interconnect estimation and optimization,”IEEE Transactions on Very Large Scale Integration Systems, vol. 18, no. 4,pp. 564–577, 2010.

References 209

[25] D. Chen, J. Cong, Y. Fan, and Z. Zhang, “High-level power estimation andlow-power design space exploration for FPGAs,” in Proceedings of DesignAutomation Conference, pp. 529–534, 2007.

[26] R. Y. Chen, M. J. Irwin, and R. S. Bajwa, “Architecture-level power estima-tion and design experiments,” ACM Transactions on Design Automation ofElectronic Systems, vol. 6, no. 1, pp. 50–66, 2001.

[27] X. Chen and L.-S. Peh, “Leakage power modeling and optimization in inter-connection networks,” in International Symposium on Low-Power Electronics,pp. 90–95, 2003.

[28] R. Cochran, A. Nowroz, and S. Reda, “Post-silicon power characterizationusing thermal infrared emissions,” in International Symposium on Low PowerElectronics and Design, pp. 331–336, 2010.

[29] J. Cong, V. Sarkar, G. Reinman, and A. Bui, “Customizable domain-specificcomputing,” IEEE Design & Test of Computers, vol. 28, no. 2, pp. 6–14, 2011.

[30] D. Crisu, S. D. Cotofana, S. Vassiliadis, and P. Liuha, “High-level energyestimation for arm-based socs,” in Systems, Architectures, Modeling, andSimulation, pp. 168–177, 2004.

[31] J. A. Darringer et al, “Early analysis tools for system-on-a-chip design,” IBMJournal of Research and Development, vol. 46, no. 6, pp. 691–707, 2002.

[32] J. A. Davis, V. K. De, and J. D. Meindl, “A stochastic wire-length distributionfor gigascale integration (GSI) — Part I. Derivation and validation,” IEEETransactions on Electron Devices, vol. 45, no. 3, pp. 580–589, 1998.

[33] R. P. Dick, G. Lakshminarayana, A. Raghunathan, and N. K. Jha, “Poweranalysis of embedded operating systems,” in Design Automation Conference,pp. 312–315, 2000.

[34] M. Q. Do, M. Drazdziulis, P. Larsson-Edefors, and L. Bengtsson, “Parame-terizable architecture-level SRAM power model using circuit-simulation back-end for leakage calibration,” International Symposium on Quality ElectronicDesign, pp. 557–563, 2006.

[35] D. Economou, S. Rivoire, C. Kozyrakis, and P. Ranganathan, “Full-systempower analysis and modeling for server environments,” in In Proceedings ofWorkshop on Modeling, Benchmarking, and Simulation, pp. 70–77, 2006.

[36] J. Emer, P. Ahuja, E. Borch, A. Klauser, C.-K. Luk, S. Manne,S. S. Mukherjee, H. Patil, S. Wallace, N. Binkert, R. Espasa, and T. Juan,“Asim: A performance model framework,” Computer, vol. 35, pp. 68–76, 2002.

[37] X. Fan, W. Weber, and L. Barroso, “Power provisioning for a warehouse-sizedcomputer,” International Symposium on Computer Architecture, pp. 13–23,2007.

[38] P. Friedberg, Y. Cao, J. Cain, R. Wang, J. Rabaey, and C. Spanos, “Model-ing within-die spatial correlation effects for process-design co-optimization,”in International Symposium on Quality Electronic Design Automation,pp. 516–521, 2005.

[39] S. H. Fuller and L. I. Millett, The Future of Computing Performance: GameOver or Next Level? The National Academies Press, 2011.

[40] T. Givargis, F. Vahid, and J. Henkel, “A hybrid approach for core-basedsystem-level power modeling,” in Asia and South Pacific Design AutomationConference, pp. 141–145, 2000.

210 References

[41] T. Givargis, F. Vahid, and J. Henkel, “Instruction-based system-level powerevaluation of system-on-a-chip peripheral cores,” IEEE Transactions on VeryLarge Scale Integration Systems, vol. 10, no. 6, pp. 856–862, 2002.

[42] W. Gosti, A. Narayan, R. K. Brayton, and A. L. Sangiovanni-Vincentelli,“Wireplanning in Logic Synthesis,” in Proceedings of International Conferenceon Computer-Aided Design, pp. 26–33, 1998.

[43] P. Gupta, A. B. Kahng, and S. Muddu, “Quantifying error in dynamic powerestimation of CMOS circuits,” Analog Integrated Circuits and Signal Process-ing, vol. 42, no. 3, pp. 253–264, 2005.

[44] S. Gupta and F. N. Najm, “Power modeling for high-level power estimation,”Transactions on Very Large Scale Integration (VLSI) Systems, vol. 8, no. 1,pp. 18–29, 2000.

[45] S. Gurumurthi, A. Sivasubramaniam, M. J. Irwin, N. Vijaykrishnan, andM. Kandemir, “Using complete machine simulation for software powerestimation: The softwatt approach,” in International Conference on High-Performance Computer Architecture, pp. 141–150, 2002.

[46] H. Hamann, A. Weger, J. Lacey, Z. Hu, and P. Bose, “Hotspot-limited micro-processors: Direct temperature and power distribution measurements,” IEEEJournal of Solid-State Circuits, vol. 42, no. 1, pp. 56–65, 2007.

[47] B. Hargreaves, H. Hult, and S. Reda, “Within-die process variations: Howaccurately can they be statistically modeled?,” in Proceedings of Asia andSouth Pacific Design Automation Conference, pp. 524–530, 2008.

[48] K. R. Heloue, N. Azizi, and F. N. Najm, “Full-chip model for leakage cur-rent estimation considering within-die correlation,” IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems, vol. 28, no. 6,pp. 847–887, 2009.

[49] S. Hong and H. Kim, “An Integrated GPU Power and PerformanceModel,” in International Symposium on Computer Architecture, pp. 280–289,2010.

[50] J. Howard et al, “A 48-Core IA-32 Processor in 45 nm CMOS using on-diemessage-passing and DVFS for performance and power scaling,” IEEE Journalof Solid-State Circuits, vol. 46, no. 1, pp. 173–183, 2011.

[51] C. Hsieh, C. Wu, F. Jih, and T. Sun, “Focal-plane-arrays and CMOS readouttechniques of infrared imaging systems,” IEEE Transactions on Circuits andSystems for Video Technology, vol. 7, no. 4, pp. 594–605, 1997.

[52] C.-T. Hsieh, M. Pedram, G. Mehta, and F. Rastgar, “Profile-driven programsynthesis for evaluation of system power dissipation,” in Design AutomationConference, pp. 576–581, 1997.

[53] C.-W. Hsu et al, “PowerDepot: Integrating IP-based power modeling with ESLpower analysis for multi-core SoC designs,” in Design Automation Conference,pp. 47–52, 2011.

[54] S. Huang, A. Hormati, D. Bacon, and R. Rabbah, “Liquid metal: Object-oriented programming across the hardware/software boundary,” in n Proceed-ings of the European Conference on Object-Oriented Programming, pp. 76–103,2008.

References 211

[55] W. Huang, K. Skadron, S. Gurumurthi, R. J. Ribando, and M. R. Stan,“Differentiating the roles of IR measurement and simulation for power andtemperature-aware design,” in International Symposium on PerformanceAnalysis of Systems and Software, pp. 1–10, 2009.

[56] C. Isci, G. Contreras, and M. Martonosi, “Live, runtime phase monitoringand prediction on real systems with application to dynamic power man-agement,” in International Symposium on Microarchitecture, pp. 359–370,2006.

[57] C. Isci and M. Martonosi, “Runtime power monitoring in high-end processors:Methodology and empirical data,” in Proceedings of International Symposiumon MicroArchitecture, pp. 93–104, 2003.

[58] V. Jimenez et al, “Power and thermal characterization of POWER6 system,”in International Conference on Parallel Architectures and Compilation Tech-niques, pp. 7–18, 2010.

[59] R. Joseph and M. Martonosi, “Run-time power estimation in high performancemicroprocessors,” in International Symposium on Low Power Electronics andDesign, pp. 135–140, 2001.

[60] N. Julien, J. Laurent, E. Senn, and E. Martin, “Power consumption modelingand characterization of the TI C6201,” IEEE Micro, vol. 23, no. 5, pp. 40–49,2004.

[61] A. B. Kahng, B. Li, L.-S. Peh, and K. Samadi, “ORION 2.0: A fast andaccurate NoC power and area model for early-stage design space exploration,”in Design, Automation and Test in Europe, pp. 423–428, 2009.

[62] A. B. Kahng and S. Reda, “A tale of two nets: Studies of wirelength progres-sion in physical design,” in System-Level Interconnect Prediction Workshop,pp. 17–34, 2006.

[63] T. Kemper, Y. Zhang, Z. Bian, and A. Shakouri, “Ultrafast temperature pro-file calculation in IC chips,” in International Workshop on Thermal inversti-gations of ICs and Systems, pp. 133–137, 2006.

[64] N. S. Kim, T. Austin, T. Mudge, and D. Grunwald, Power Aware Computing.Kluwer Academic Publishers Norwell, 2002.

[65] N. S. Kim, K. Flautner, D. Blaauw, and T. Mudge, “Drowsy instruc-tion caches: Leakage power reduction using dynamic voltage scaling andcache sub-bank prediction,” in International Symposium on Microarchitecture,pp. 219–230, 2002.

[66] F. Koushanfar and A. Mirhoseini, “A unified framework for multimodal sub-modular integrated circuits trojan detection,” IEEE Transactions on Infor-mation Forensic and Security, 2011.

[67] A. Kumar and M. Anis, “An analytical state dependent leakage power modelfor FPGAs,” in Design, Automation and Test in Europe, 2006.

[68] I. Kuon and J. Rose, “Measuring the gap between FPGAs and ASICs,” IEEETransactions on Computer-Aided Design of Integrated Circuits, vol. 26, no. 2,pp. 203–215, 2007.

[69] K. Lahiri and A. Raghunathan, “Power analysis of system-level on-chip com-munication architectures,” in CODES+ISSS, pp. 236–241, 2004.

212 References

[70] M. Lajolo, A. Raghunathan, S. Dey, and L. Lavagno, “Cosimulation-basedpower estimation for system-on-chip design,” IEEE Transactions on VeryLarge Scale Integration Systems, vol. 10, no. 3, pp. 253–265, 2002.

[71] J. Lamoureux and S. Wilton, “FPGA clock network architecture: Flexibilityvs. area and power,” in International Symposium on Field Programmable GateArrays, pp. 101–108, 2006.

[72] J. Lamoureux and S. Wilton, “On the trade-off between power and flexibilityof FPGA clock networks,” ACM Transactions on Reconfigurable Technologyand Systems, vol. 1, no. 13, 2008.

[73] P. Landman, “High-level power estimation,” in International Symposium onLow-Power Electronics and Design, pp. 29–35, 1996.

[74] B. Lee and D. Brooks, “Accurate and efficient regression modeling for microar-chitectural performance and power prediction,” in Architectural Support forProgramming Languages and Operating Systems, pp. 185–194, 2006.

[75] I. Lee et al, “PowerViP: SoC power estimation framework at transactionlevel,” in Proceedings of Asia and South Pacific Design Automation Confer-ence, pp. 551–558, 2006.

[76] V. Lee et al, “Debunking the 100X GPU vs. CPU Myth: An evaluation ofthroughput computing on CPU and GPU,” International Symposium on Com-puter Architecture, pp. 451–460, 2010.

[77] F. Li, D. Chen, L. He, and J. Cong, “Architecture evaluation for power-efficientFPGAs,” in International Symposium on Field Programmable Gate Arrays,pp. 175–184, 2003.

[78] F. Li, D. Chen, L. He, and J. Cong, “Power modeling and characteristicsof field programmable gate arrays,” IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems, vol. 24, no. 11, pp. 1712–1724,2005.

[79] T. Li and L. K. John, “Run-time modeling and estimation of operating systempower consumption,” in SIGMETRICS, pp. 160–171, 2003.

[80] Y. Li and J. Henkel, “A framework for estimating and minimizing energy dis-sipation of embedded HW/SW systems,” in Design Automation Conference,pp. 188–193, 1998.

[81] X. Liang, K. Turgay, and D. Brooks, “Architectural power models for SRAMand CAM structures based on hybrid analytical/empirical techniques,” Inter-national Conference on Computer Aided Design, pp. 824–830, 2007.

[82] M. Y. Lim, A. Porterfield, and R. Fowler, “SoftPower: Fine-grain power esti-mations using performance counters,” in International Symposium on HighPerformance Distributed Computing, pp. 308–311, 2010.

[83] M. Loghi, L. Benini, and M. Poncino, “Power macromodeling of MPSoCmessage passing primitives,” ACM Transactions on Embedded ComputingSystems, vol. 6, no. 31, pp. 1–22, 2007.

[84] M. Loghi, M. Poncino, and L. Benini, “Cycle-accurate power analysis formultiprocessor systems-on-a-chip,” in Great Lakes Symposium on VLSI,pp. 401–406, 2004.

[85] X. Ma, M. Dong, L. Zhong, and Z. Deng, “Statistical power consumptionanalysis and modeling for GPU-based computing,” in Workshop on PowerAware Computing and Systems (HotPower), 2009.

References 213

[86] E. Macii, M. Pedram, and F. Somenzi, “High-level power modeling, estima-tion, and optimization,” IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems, vol. 17, no. 11, pp. 1061–1079, 1998.

[87] M. Mamidipaka and N. Dutt, “eCACTI: An enhanced power estimation modelfor on-chip caches,” Technical report, In Technical Report TR-04-28, CECS,UCI, 2004.

[88] R. Marculescu, D. Marculescu, and M. Pedram, “Probabilistic modelingof dependencies during switching activity analysis,” IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems, vol. 17, no. 2,pp. 73–83, 1998.

[89] E. Marin, “The role of thermal properties in periodic time-varying phenom-ena,” European Journal of Physics, vol. 28, no. 3, pp. 429–445, 2007.

[90] H. Mehta, R. M. Owens, and M. J. Irwin, “Energy characterization based onclustering,” in Design Automation Conference, pp. 702–707, 1996.

[91] F. J. Mesa-Martinez, E. Ardestani, and J. Renau, “Characterizing processorthermal behavior,” in Architectural Support for Programming Languages andOperating Systems, pp. 193–204, 2010.

[92] F. J. Mesa-Martinez, M. Brown, J. Nayfach-Battilana, and J. Renau,“Measuring performance, power, and temperature from real processors,” inProceedings of International Symposium on Computer Architecture, pp. 1–10,2007.

[93] J. Monteiro, S. Devadas, A. Ghosh, K. Keutzer, and J. White, “Estimation ofaverage switching activity in combinational logic circuits using symbolic simu-lation,” IEEE Transactions on Computer-Aided Design of Integrated Circuitsand Systems, vol. 16, no. 1, 1997.

[94] J. Monteiro, R. Patel, and V. Tiwari, “Power analysis and optimization fromcircuit to register-transfer levels,” in EDA for IC Implementation, CircuitDesign, and Process Technology, vol. 2, (L. Scheffer, L. Lavagno, and G. Mar-tin, eds.), Taylor & Francis, 2006.

[95] M. Moudgill, J.-D. Wellman, and J. H. Moreno, “Environment for PowerPCmicroarchitecture exploration,” Micro, vol. 19, pp. 15–25, 1999.

[96] A. Muttreja, A. Raghunathan, S. Ravi, and N. K. Jha, “Automatedenergy/performance macromodeling of embedded software,” IEEE Transac-tions on Computer-Aided Design of Integrated Circuits and Systems, vol. 26,no. 3, pp. 542–552, 2007.

[97] H. Nagasaka, N. Maruyama, A. Nukada, T. Endo, and S. Matsuoka, “Statisti-cal power modeling of gpu kernels using performance counters,” in Proceedingsof the International Conference on Green Computing, pp. 115–122, 2010.

[98] F. Najm, “Transition density: A new measure of activity in digital circuits,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Sys-tems, vol. 12, no. 2, pp. 310–323, 1993.

[99] F. Najm, “A survey of power estimation techniques in VLSI circuits,”IEEE Transactions on Very Large Scale Integration Systems, vol. 2, no. 4,pp. 446–455, 1994.

[100] F. Najm, “Power estimation techniques for integrated circuits,” in Interna-tional Conference on Computer Aided Design, pp. 492–499, 1995.

214 References

[101] M. Nemani and F. N. Najm, “Towards a high-level power estimation capa-bility,” IEEE Transactions on Computer-Aided Design of Integrated Circuitsand Systems, vol. 15, no. 6, pp. 588–598, 1996.

[102] A. N. Nowroz, R. Cochran, and S. Reda, “Thermal monitoring of real pro-cessors: Techniques for sensor allocation and full characterization,” in DesignAutomation Conference, pp. 56–61, 2010.

[103] A. N. Nowroz and S. Reda, “Thermal and power characterization of field-programmable gate arrays,” in International Symposium on Field Pro-grammable Gate Arrays, pp. 111–114, 2011.

[104] A. N. Nowroz, G. Woods, and S. Reda, “Improved post-silicon power modelingusing AC lock-in techniques,” in Design Automation Conference, pp. 101–106,2011.

[105] M. Onouchi, T. Yamada, K. Morikawa, I. Mochizuki, and H. Sekine, “Asystem-level power estimation methodology based on IP-level modeling,power-level adjustment, and power accumulation,” in Proceedings of Asia andSouth Pacific Design Automation Conference, pp. 547–550, 2006.

[106] M. Orshansky and S. Nassif, Design or Manufacturability and StatisticalDesign: A Constructive Approach. Springer, 2007.

[107] T. Osmulski et al, “A probabilistic power prediction tool for the Xilinx 4000-series FPGA,” in Lecture Notes in Computer Science, pp. 776–783, 2000.

[108] J. Ou and V. K. Prasanna, “Rapid energy estimation of computations onFPGA based soft processors,” in SOC Conference, pp. 285–288, 2004.

[109] S. Pasricha and N. Dutt, On-Chip Communication Architectures: System onChip Interconnect. Morgan Kaufmann Publishers, 2008.

[110] S. Pasricha, Y.-H. Park, N. Dutt, and F. J. Jurdahi, “System-level PVTvariation-aware power exploration of on-chip communication architectures,”ACM Transactions on Design Automation of Electronic Systems, vol. 14, no. 2,pp. 20:1–20:25, 2009.

[111] J. Peddersen and S. Parameswaran, “CLIPPER: Counter-based low impactprocessor power estimation at run-time,” in Asia and South Pacific DesignAutomation Conference, pp. 890–895, 2007.

[112] M. Pedram and S. Nazarin, “Thermal modeling, analysis, and managementin VLSI circuits: Principles and methods,” Proceedings of the IEEE, vol. 94,no. 8, pp. 1487–1501, 2006.

[113] L. Peh and N. Jerger, On-Chip Networks. Morgan and Claypool Publishers,2009.

[114] D. Ponomarev, G. Kucuk, and K. Ghose, “AccuPower: An accurate powerestimation tool for superscalar microprocessors,” in Design, Automation andTest in Europe Conference and Exhibition, pp. 124–129, 2002.

[115] K. Poon, S. Wilton, and A. Yan, “A detailed power model for field-programmable gate arrays,” ACM Transactions on Design Automation ofElectronic Systems, vol. 10, no. 2, pp. 279–302, 2005.

[116] K. K. W. Poon, “Power estimation for field programmable gate arrays,” PhDthesis, The University of British Columbia, August 2002.

[117] E. Pop, S. Sinha, and K. Goodson, “Heat generation and transport innanometer-scale transistors,” Proceedings of the IEEE, vol. 94, no. 8,pp. 1587–1601, 2006.

References 215

[118] M. Powell, A. Biswas, J. Emer, and S. Mukherjee, “CAMP: A technique toestimate per-structure power at run-time using a few simple parameters,”International Symposium on High Performance Computer Architecture,pp. 289–300, 2009.

[119] G. Qu, N. Kawabe, K. Usami, and M. Potkonjak, “Function-level power esti-mation methodology for microprocessors,” in Design Automation Conference,pp. 810–813, 2000.

[120] A. Ragunathan, S. Dey, and N. K. Jha, “Register-transfer level estimationtechniques for switching activity and power consumption,” in InternationalConference on Computer-Aided Design, pp. 158–165, 1996.

[121] R. Rao, A. Srivastava, D. Blaauw, and D. Sylvester, “Statistical estimation ofleakage current considering inter- and intra-die process variation,” in Interna-tional Symposium on Low-Power Electronics and Design, pp. 84–89, 2003.

[122] S. Reda, “Thermal and power characterization of real computing devices,”IEEE Journal on Emerging Topics in Circuits and Systems, vol. 1, no. 2,2011.

[123] S. Reda and S. Nassif, “Analyzing the impact of process variations on para-metric measurements: Novel models and applications,” in Design, Automationand Test in Europe Automation, pp. 375–380, 2009.

[124] S. Rivoire, P. Ranganathan, and C. Kozyrakis, “A comparison of high-levelfull-system power models,” in Conference on Power Aware Computing andSystems, pp. 1–4, 2008.

[125] K. Roy, S. Mukhopadhyay, and H. Mahmoodi-Meimand, “Leakage currentmechanisms and leakage reduction techniques in deep-submicrometer CMOScircuits,” Proceedings of the IEEE, vol. 91, no. 2, pp. 305–327, 2003.

[126] J. T. Russell and M. F. Jacome, “Software power estimation and optimizationfor high performance, 32-bit embedded processors,” in International Confer-ence on Computer Design, pp. 328–333, 1998.

[127] L. Shang, A. S. Kaviani, and K. Bathala, “Dynamic power consumptionin virtexTM -II FPGA family,” in International Symposium on Field Pro-grammable Gate Arrays, pp. 157–164, 2002.

[128] P. Shivakumar, N. P. Jouppi, and P. Shivakumar, “CACTI 3.0: An integratedcache timing, power, and area model,” Technical report, 2001.

[129] T. Simunic, L. Benini, and G. De Micheli, “Cycle-accurate simulation ofenergy consumption in embedded systems,” in Design Automation Confer-ence, pp. 867–872, 1999.

[130] G. Sinevriotis et al, “SOFLOPO: Towards systematic software exploitationfor low-power designs,” in International Symposium on Low-Power Electronicsand Design, 2000.

[131] K. Singh, M. Bhadauria, and S. A. McKee, “Real time power estimation andthread scheduling via performance counters,” in Proceedings of the Workshopon Design, Architecture, and Simulation of Chip Multi-Processors, pp. 46–55,2008.

[132] A. Sinha and A. P. Chandrakasan, “JouleTrack — A Web based tool forsoftware energy profiling,” in Design Automation Conference, pp. 220–225,2001.

216 References

[133] G. Spirakis, “Designing for 65 nm and beyond: Where is the revolution?,” inElectronic Design Process (EDP) Symposium, 2005.

[134] S. Steinke, M. Knauer, L. Wehmeyer, and P. Marwedel, “An accurate andfine grain instruction-level energy model supporting software optimizations,”in Proceedings of the International Workshop Power and Timing Modeling,Optimization and Simulation, 2001.

[135] D. Stroobandt, A Priori Wire Length Estimates for Digital Design. KluwerAcademic Publishers, 2001.

[136] H. Su, F. Liu, A. Devgan, E. Acar, and S. Nassif, “Full chip leakage estima-tion considering power supply and temperature variations,” in InternationalSymposium on Low Power Electronics and Design, pp. 78–83, 2003.

[137] C. Talarico, J. W. Rozenblit, V. Malhotra, and A. Stritter, “A new frameworkfor power estimation of embedded systems,” IEEE Computer, vol. 38, no. 2,pp. 71–78, 2005.

[138] T. K. Tan, A. Raghunathan, G. Lakshminarayana, and N. K. Jha, “High-levelsoftware energy macro-modeling,” in Design Automation Conference, pp. 605–610, 2001.

[139] V. Tiwari, S. Malik, and A. Wolfe, “Power analysis of embedded software: Afirst step towards software power minimization,” IEEE Transactions on VeryLarge Scale Integration Systems, vol. 2, no. 4, pp. 437–445, 1994.

[140] T. Tuan and B. Lai, “Leakage power analysis of a 90 nm FPGA,” in CustomIntegrated Circuits Conference, pp. 57–60, 2003.

[141] S. Vangal et al, “An 80-Tile Sub-100-W Teraflops Processor in 65-nm CMOS,”IEEE Journal of Solid-State Circuits, vol. 43, no. 1, pp. 29–41, 2008.

[142] A. Varma, E. Debes, I. Kozintsev, P. Klein, and B. Jacob, “Accurate and fastsystem-level power modeling: An XScale-based case study,” ACM Transac-tions on Embedded Computing Systems, vol. 7, no. 3, pp. 25:1–25:20, 2008.

[143] H.-S. Wang, X. Zhu, L.-S. Peh, and S. Malik, “Orion: A power-performancesimulator for interconnection networks,” in International Symposium onMicroarchitecture, pp. 294–305, 2002.

[144] Z. Wang, N. Tolia, and C. Bash, “Opportunities and challenges to unifyworkload, power, and cooling management in data centers,” ACM SIGOPSOperating Systems Review, vol. 44, no. 3, pp. 41–46, 2010.

[145] S. J. E. Wilton and N. P. Jouppi, “CACTI: An enhanced cache access and cycletime model,” IEEE Journal Solid-State Circuits, vol. 31, no. 5, pp. 677–688,1996.

[146] B. Wong, A. Mittal, Y. Cao, and G. W. Starr, Nano-CMOS Circuit and Phys-ical Design. Wiley-Interscience, 2004.

[147] W. Wu, L. Jin, J. Yang, P. Liu, and S. X.-D. Tan, “A systematic method forfunctional unit power estimation in microprocessors,” in Design AutomationConference, pp. 554–557, 2006.

[148] W. Ye, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin, “The design and useof simplepower: A cycle-accurate energy estimation tool,” in Design Automa-tion Conference, pp. 340–345, 2000.

[149] Y. Zhang, D. Parikh, K. Sankaranarayanan, K. Skadron, and M. Stan,“HotLeakage: A temperature-aware model of subthreshold and gate leakagefor architects,” Technical report, 2003.

power modeling and characterization of computing devices...

Documents