kapow: a system identiﬁcation approach to online per...

KAPow: A System Identification Approach toOnline Per-module Power Estimation in FPGA Designs

Eddie Hung, James J. Davis, Joshua M. Levine, Edward A. Stott,Peter Y. K. Cheung and George A. Constantinides

Department of Electrical and Electronic EngineeringImperial College London, London, SW7 2AZ, United Kingdom

E-mail: {e.hung, james.davis06, josh.levine05, ed.stott, p.cheung, g.constantinides}@imperial.ac.uk

Abstract—In a modern FPGA system-on-chip design, it is ofteninsufficient to simply assess the total power consumption of theentire circuit by design-time estimation or runtime power railmeasurement. Instead, to make better runtime decisions, it isdesirable to understand the power consumed by each individ-ual module in the system. In this work, we combine board-level power measurements with register-level activity countingto build an online model that produces a breakdown of powerconsumption within the design. Online model refinement avoidsthe need for a time-consuming characterisation stage and alsoallows the model to track long-term changes to operatingconditions. Our flow is named KAPow, a (loose) acronymfor ‘K’ounting Activity for Power estimation, which we showto be accurate, with per-module power estimates as close to±5mW of true measurements, and to have low overheads.We also demonstrate an application example in which a per-module power breakdown can be used to determine an efficientmapping of tasks to modules and reduce system-wide powerconsumption by over 8%.

1. Introduction

In a world increasingly dominated by system-on-chip(SoC) designs, power efficiency is of ultimate concern dueto the dark silicon effect: more transistors can be placedon a die than can be continuously switched. Designersput a large amount of effort into managing this challengeup-front, but many things can change once a system ismanufactured and deployed: to simply assume worst-casebehaviour incurs significant performance penalties underaverage conditions. For example, a system may be pro-duced where, due to variation, some modules are morepower-efficient than others. An intelligent, self-aware systemmight independently control the power consumption of eachmodule using dynamic frequency scaling. Tasks could thenbe mapped to these modules in a way that delivers thebest overall performance given the constraints of the powerbudget, available hardware and work to be done.

Such runtime techniques would be particularly usefulfor FPGAs, where the shortened design cycles reduce thetime available for offline analysis. FPGAs’ reconfigurablehardware makes it more difficult to implement well estab-lished techniques, such as power gating, but also offersgreat opportunities for runtime adaptation. Unfortunately,the self-awareness necessary to deliver this vision is cur-rently missing from the power consumption toolbox: we

can measure system-wide power consumption at runtimeand forecast per-module contributions at design-time, butwe cannot determine such a breakdown online.

1.1. Per-module Online Power Modelling

While power measurement at Vdd pins is common, man-ufacturing SoCs with per-module power domains is usuallyimpractical due to increased metal and pad costs, particu-larly for a configurable technology such as the FPGA. Amore feasible approach is to instead monitor the switch-ing activity within each module, since switching is a keyindicator of dynamic power. Models that forecast powerconsumption based on predicted switching activity are wellestablished for use at design-time, however inaccuraciesinevitably arise from assumptions made regarding data pat-terns and operating conditions. Some of these assumptionscan be avoided by training a model during commission-ing, but, unless the external conditions are static and allthe possible system behaviour is captured by the trainingprogramme, such a model would be running blindly anderrors will begin to accumulate. Instead, what is needed isa means to calculate a runtime power breakdown withoutrelying on a stale model.

Figure 1 illustrates the benefits of an online, activity-based power model—described in this paper—used to esti-mate power consumption. The plot shows the error between

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000%

5%

10%

15%

20%

25%

30%

35%

40%

Offline

Online

Iteration

Err

or

: Mo

de

l vs

Me

as.

0.95

1

1.05

1.1

Vdd

Vo

ltag

e (

V)

Offline

Online

Offline training

Figure 1: Error accumulations in online- vs offline-generatedsignal activity-to-power models under voltage scaling

modelled and externally measured power for a system mod-ule as its supply voltage is lowered, comparing an onlinemodel to an offline version based on ordinary least squares(OLS). As the operating conditions deviate from their nom-inal values, the error in the offline power model increases,while the online model quickly adapts. The effects of dif-ferent classes of input data, operating modes and exogenousconditions, such as temperature, voltage and degradation,can all be captured online, with the resulting model being farmore useful for runtime control in a dynamic environmentthan its offline counterpart. Note that while Figure 1 isincluded here for illustrative purposes, the system its datawas obtained from is that described in Section 6, run on theplatform explained in Section 5 using the online algorithmdescribed in Section 4.

Using an adaptive online model in an embedded sys-tem is more practical than ever thanks to readily availablegeneral computing resources. FPGA-SoCs with hardenedmulticore CPUs are ideal platforms to use since the general-purpose processors can carry out infrequent incrementalmodel updates and optimise the computationally intensivetasks that run in soft logic.

In this paper, we describe the first runtime modellingframework providing a per-module power breakdown in anFPGA system, making the following novel contributions:

• We describe an automated tool flow capable of in-strumenting arbitrary HDL for runtime monitoringof signal activity.

• We apply system identification techniques to allowan activity-to-power model for any SoC design to betrained and updated at runtime.

• We experimentally quantify the relationships be-tween model accuracy and the incurred hardware andsoftware overheads.

• We show how knowledge of per-module behaviourcan facilitate the power-efficient mapping of tasks tohardware, leading to an over-8% power reduction.

2. Background

The power consumed by a CMOS integrated circuit canbe split into static and dynamic components: static powerconsumption is a property of the process technology andoperating conditions, while dynamic power originates fromthe charging and discharging of circuit nets due to switchingactivity. Both components can vary at runtime and need to beconsidered by a model, however dynamic power is the onlyconsideration required for a per-module breakdown as it isrelevant to runtime decisions regarding clock frequency andscheduling. Equation 1 describes the relationship betweendynamic power Pdynamic, switching activity αi on a net i,operating frequency f , intrinsic capacitance Ci and supplyvoltage Vdd.

Pdynamic =∑i

αifCiV2

dd (1)

Power estimation tools are widely used at compile-time to ascertain whether designs will meet their power

specifications and to inform decisions on packing, thermalmanagement and power supply capacity. These tools initiallyestimated power consumption by simulating circuits withtypical test vectors and later using vectorless, probabilistictechniques [1]; the latter style has been further applied toFPGAs to facilitate power-aware compilation [2]. As thecomplexity of FPGA applications has increased, so too hasthe need for high-level power estimation models for modularsystems [3]. These statistical, learning-based models aretrained using activity and power estimates as well as re-source utilisation, allowing implementations to be comparedwithout performing placement and routing.

Prior work [4] proposed the evaluation of system-widepower consumption through a dynamic power monitoringapproach using activity counters. Therein, signals were auto-matically selected for activity counting and an offline modelwas developed through simulation to relate these activitycounts to power: realtime activity measurements facilitatedruntime power estimation. Because the model was staticallytrained, however, the results were found to have ±15%error. Moreover, observation only of overall system powerdoes not provide sufficient information to deploy adaptivestrategies such as dynamic voltage and/or frequency scaling,task migration or power gating.

3. Tool Flow

Our automated tool flow comprises two main stages:signal selection, which identifies nets likely to be correlatedwith high power consumption, and instrumentation, whichadds logic to them to allow their behaviour to be monitoredonline. In this work, we monitor only synchronous eventsat rising clock edges on these signals, rather than thoseintroduced asynchronously by glitches; the online model weuse will, however, automatically tune itself to add moreweight to signals that are prone to glitching. We targetsystems composed of multiple modules (IP blocks, typicallyhardware accelerators) described in HDL and assembled inAltera’s QSys system integration tool. These are transpar-ently instrumented in order to report their own switchingactivity while remaining functionally identical. We assumea master CPU-slave accelerator model, corresponding tothe typical use of newer FPGA-SoC devices: in this work,specifically the Altera Cyclone V SoC family.

The most straightforward approach to instrumenting amodule would be to analyse (or simulate) its HDL in orderto determine signals of interest before augmenting it withactivity counters. Two issues exist with this method: firstly,estimating the power consumption of individual signalsbased solely on behavioural HDL can be inaccurate sinceneither physical resources nor the data that exercise themare considered [3] and, secondly, augmentation of the HDLwith instruments would likely cause circuit mapping to bedifferent, thereby altering the very thing one wishes toobserve. The latter problem is similar to that faced wheninserting timing [5] or debugging infrastructure [6], both ofwhich are even more sensitive to any circuit perturbations.Our flow, shown in Figure 2, overcomes these issues to

Module 0

Module 1

Module M-1

PythonscriptVQM output

(fully technologymapped netlist)

+

HDL source(with inst. stubs)

Fitterplace & route(quartus_fit) PowerPlay

power analyzer(quartus_pow)

Instantiate asnew IP

using Qsys

Analysis & Synthesis

(quartus_map) continueas normal

Instrumentationtemplate

ASF signalactivity estimate

All VQM primitivesfrom previous analysis

are fully preserved

UninstrumentedHDL sources

...

Analysis & Synthesis

(quartus_map)

Figure 2: KAPow tool flow

some extent by performing signal selection and insertinginstrumentation using placed and routed netlists.

Some minor preprocessing must be performed on each ofthe M modules to be instrumented before compilation: eachis modified to accommodate eight additional words withinits address space for instrumentation control. Following this,the modules’ HDL is compiled and the resultant netlistsextracted as Verilog Quartus Mapping (VQM) files [7]. Inorder to identify the signals that best indicate power con-sumption, each module is fed through Altera’s design-timepower analysis tool, PowerPlay, as described in Section 3.1.Selected signals are augmented with activity counters (de-tailed in Section 3.2) by modifying the VQMs, which arethen substituted for the modules’ original HDL within QSys.Finally, the system is compiled as normal: importantly, allVQM primitives are fully preserved.

3.1. Signal Selection

The purpose of signal selection is to identify nets inthe design that are strongly indicative of their modules’power consumption. We do this by using PowerPlay toreport nets with high estimated switching activity, sincethese should consume the most power. Altera report thatPowerPlay’s power estimates are accurate to within ±20%of true measurements when provided with realistic activitydata [8]. The tool makes power consumption estimatesby summing the static and dynamic power componentscontributed by all logic and routing resources based on adetailed database of device characteristics. Dynamic powercomponents are scaled using relevant clock frequencies andswitching forecasts from one of two sources: in vectoredmode, switching rates are derived from a simulation, whilein vectorless mode, they are estimated using statistical meth-ods instead [9]. A vectorless estimation starts from primaryinputs and propagates switching rates through all nets byconsidering the Boolean functions that connect them.

Although vectorless estimation is known to be lessaccurate than its vectored counterpart, this is the methodwe adopt for signal selection because it is general-purpose:it is applicable to any circuit, even those without sourceor testbench code or indicative test vectors. We leave theswitching rate for primary inputs at its default value of12.5%. Since we only use PowerPlay to identify signals tomonitor, the accuracy of those signals’ estimated activities

AdaptiveLUT

(ALUT)

Reg

Reg

Reg

Reg

+

+

RegSDATA(scan in)

D(adder out)

SLOAD(scan enable)

Q(scan out)

CLK

Carry in

Carry out

Inte

rnal

mux

InternalReg

Figure 3: Simplified view of a Cyclone V ALM

is essentially irrelevant as they will be measured at runtime.Thus, the only important factor at this stage is whether theyare likely to have a high impact on power consumption. Thetool produces a file containing switching rates for all signals,from which we select the N—a user-selectable parameter—most frequently toggling.

We note that PowerPlay’s estimates of power consump-tion are inherently inaccurate and do not capture any dy-namic shifts brought about through changing input dataor environmental conditions. However, for the purposesof identifying the ‘hot’ signals at design-time this is notimportant since the coefficients associated with these signalswill be determined and tuned at runtime.

3.2. Instrumentation

At the heart of any instrumentation that counts syn-chronous events is an efficient counter structure. We uselinear feedback shift register (LFSR)-based counters becausethey are smaller and faster than arithmetic counters: fourLFSR bits can be placed in each adaptive logic module(ALM) of the Cyclone V architecture we target as opposedto two binary counter bits due to the presence of only twofull adders. Outputs are not sequential; however, since eachcounter has relatively few bits (justified in Section 5.2), thedecoding is easy to perform in software with a lookup table.

A low-overhead single-bit scan chain is sufficient toread out the activity counts since the required samplingrate is low. Reading out the counters’ contents also shiftszeroes into the head of the scan chain: an efficient resetmethod. An advantage of the Cyclone V architecture is thatregister resources each have two data input ports, as shownin Figure 3: we use one to capture the next state of thecounting logic while the second forms the scan chain input,avoiding the use of a soft multiplexer.

Figure 4 shows the full, scan-capable activity counter.When a positive edge is detected and the instrument isenabled (EN = 1), the LFSR’s clock enable (CE) is drivenhigh, causing it to advance to its next state since SLOAD isheld permanently high. When the scan chain is in operation(SE = 1), the LFSR does not shift into its next state, takingon the value at its scan input instead.

Scan chain read-back is accomplished through a FIFOinstantiated as part of a template, shown in Figure 5, whichwraps each instrumented module. Clock domain crossing,if required, is handled automatically. To take an activitymeasurement, the counters are enabled for a period of timedictated by an adjustable ‘instrumentation lifetime’ counter,

Reg

Reg

SLOAD=1

CE(Clock enable)

...

Reg

SE (Scan enable)

EN (Enable)

SI (Scan input)

...

...SO(Scanoutput)1 0

b

aCE = EN &

((a & ~b) | SE)

Reg

SDATA

Signalinput

Posedge detector Linear feedback shift register (LFSR)

CE

CE CE CEReg

CE

Figure 4: LFSR-based activity counter

IP Bus Wrapper

0x000x040x080x0C0x100x140x18

Instrumentation Template(same clock as IP kernel)

FIFO

SystemBus

Activity countersinterconnectedby single bit scan chain

FIFO data (32 bits)

ResetInst. lifetime

Scan enable

FIFO full

All control/data crosses clock domains through synchronisers or dual-clock FIFOs

decr

Instrumentation stubs:

IP kernel(possibly separate

clock domain)

Figure 5: Instrumentation template

after which the contents of the scan chain can be read outacross the system bus via the FIFO.

While we have optimised our design for a particulardevice family, similar activity counters and read-back infras-tructure could easily be implemented in alternative FPGAs.Similarly, although our flow was built upon Altera tools,a Vivado-based version targetting Xilinx devices would belargely equivalent in structure.

4. System Identification

Signal activity can be translated into a power estimateusing a weighted linear model, with partial system behaviourused to dynamically update the coefficients via methods ofsystem identification [10]. Figure 6 shows a typical systemidentification setup in which an input vector, a, is fed intoboth a black box—‘black’ since this technique has no, norneeds any, understanding of the system’s internals—and itsmodel. Coefficient vector x is an estimate of x, the lattercontaining the ‘true’ coefficients of the system. Outputs ofthe system and model, y and y, respectively, are used to forman error, e: an adaptive algorithm seeks to tune x in orderto drive e towards zero for the latest observation as wellas those that preceded it. We assume a linear relationshipbetween activities a and measured system power y, asshown in Equation 2.

y = aTx (2)

In prior work [4], a non-adaptive offline model requiringmultiple (a, y) observation pairs for training was used.A vector of y and matrix of a were formed from thesepairs, allowing Equation 2 to be solved in the least squares

Black-box system Σ

Mathematical modelx

+

-

Adaptivealgorithm

y

y

e

a

^

^

a vector of system inputs(in our case, activity counts)

y true output(measured power)

y model output(predicted power)

^

e prediction error

x vector of model coefficients^

Figure 6: System identification overview

sense to find x, which remained fixed during runtime. Bycontrast, the identification algorithm we use is recursiveleast squares (RLS) [11], which iteratively updates x asnew measurements arrive. The RLS update function is ofthe form x[t] = x[t − 1] + f(e[t], k[t]), as defined inEquations 3, 4 and 5. k is the gain vector, containing factorsdetermining the scaling of each coefficient, P the inversecovariance matrix, capturing the partial correlations betweeninput variables, and λ the forgetting factor, a parameterdetermining the memory of the algorithm. λ = 1 meansthat all prior observations are given equal weighting (infinitememory) and is used for describing time-invariant systems,whereas lesser values allow prior samples to be assignedexponentially decaying weights.

k[t] =λ−1P [t− 1]a[t]

1 + λ−1a[t]TP [t− 1]a[t](3)

P [t] = λ−1P [t− 1]− λ−1k[t]a[t]TP [t− 1] (4)x[t] = x[t− 1] + e[t]k[t] (5)

With a linear model, it is straightforward to computethe power contribution of each module m, ym, since themodel’s coefficients can be partitioned and its individualpower estimates computed as shown in Equations 6, 7, 8and 9. Scalars as and xs represent the ‘input’ and coefficient,respectively, for the device’s static power which, for us, alsoincludes the dynamic power consumed by resources that arenot correlated with module activity, such as the SoC bus. Wekeep as constant, allowing the model to tune xs over time.The dynamic power consumed by the activity counters them-selves is included within the respective modules’ estimatessince their own switching is dictated by the behaviour ofthe module to which each is connected.

a = [as a0 a1 . . . aM−1] (6)x = [xs x0 x1 . . . xM−1] (7)

ym = aTmxm (8)

y = asxs +

M−1∑m=0

ym (9)

5. Errors and Overheads

Experiments were performed to investigate how modelaccuracy and overheads change as we modify parametersN , the number of activity counters per module, and W ,

0 1000 2000 3000 4000 5000 6000 7000 8000 90000

5

10

15

8

16

32

64

128

256

512

Iteration

Err

or: M

odel

vs

Mea

s. (

mW

)

Num. ofCounters (N)

Figure 7: Relative error by counters per module N

the width of each counter in bits. The system used toobtain the results in this section was the seven-moduledesign described in detail in Section 6. We targetted theAltera Cyclone V SX SoC development board for all exper-iments performed in this work, with Quartus II 64-bit 15.0.0used for compilation. At the development board’s core liesa 5CSXFC6D6F31C6 FPGA-SoC, consisting of two hardARM Cortex-A9 cores tightly coupled to a 42k-ALM FPGAmanufactured on a 28nm low-power process. The boardalso features two Linear Technology LTC2978 power supplyregulators—one each for the CPU cores and FPGA—whichwe used for taking runtime power measurements.

System identification was implemented entirely in soft-ware on the SoC’s hard CPU cores, clocked at their de-fault 925MHz and running Ubuntu 14.04. Communicationwith hardware modules and their activity counters wasaccomplished through memory-mapped registers accessedfrom Linux using mmap(). Experimentation revealed thatP [0] = 1000I from starting coefficients x[0] = 0 gave goodresults, and we found λ = 0.999 to work well in allowingthe algorithm to adapt to changing operating conditions.

5.1. Error by Number of Activity Counters

Figure 7 shows how the absolute error between estimatedand measured power consumption varies with the numberof activity counters, N , used per module with counterseach W = 9 bits wide. Each point is an average of errorsacross the system’s seven modules, with 32-point windowingapplied in order to reduce noise and highlight the models’trends. In all cases in Figure 7 (and Figures 8 and 11), datacollected for the first 7N + 1 (the model’s order) iterationsare not shown since the model cannot converge on a singlesolution for x during this period. Larger values of N tendedto result in lower relative error but took longer to converge,as one would expect since RLS adapts dynamically: errorsaccurate to ±10mW were seen with small N (8), while±5mW was achievable with larger values (≥ 256).

5.2. Error by Counter Width

Figure 8 shows how the relative error varies with thecounter width, W , with instrumentation lifetime equal to thenumber of cycles, 2(2W − 2), needed to maximise dynamicrange while guaranteeing no overflow. Analysis across therange of N tested shows that the choice of counter width hasa significant effect on steady-state error, but not necessarily

0 1000 2000 3000 4000 5000 6000 7000 8000 90000

20

40

60

80

3

6

9

Iteration

Err

or: M

odel

vs

Mea

s. (

mW

)

CounterWidth (W)

(a) N = 8 counters per module

0 1000 2000 3000 4000 5000 6000 7000 8000 90000

20

40

60

80

3

6

9

Iteration

Err

or: M

odel

vs

Mea

s. (

mW

)

CounterWidth (W)

(b) N = 32 counters per module

0 1000 2000 3000 4000 5000 6000 7000 8000 90000

20

40

60

80

3

6

9

Iteration

Err

or: M

odel

vs

Mea

s. (

mW

)

CounterWidth (W)

(c) N = 128 counters per module

Figure 8: Relative error by counter width W

the rate of convergence. These results indicate that reducingN has a smaller impact on error than decreasing W by thesame factor to achieve a similar area gain.

5.3. Hardware and Design Overheads

Figure 9 shows the overheads in area (ALMs and logicarray blocks (LABs)), compilation time and power incurredthrough adding our instrumentation. As expected, the num-ber of ALMs required increased with N , but the relationshipbetween N and the number of LABs was less clear sincethe latter is determined by a packing heuristic. Compilationtime appreciates with N , with N = 512 proving especiallydifficult to compile since it approached the limits of devicecapacity. The compilation time figure excludes the initialVQM generation because this would be integrated into asingle pass in a fully developed tool flow. System powerconsumption increased with N , reflecting the disturbancesand wire length increases created by intrusively attachingactivity counters inside each module. These results indicatethat N = 8 and W = 9 offer a reasonable tradeoffbetween relative error (±10mW) for area (ALM) and poweroverheads of 9% and 3.6% (53mW), respectively.

The total power overhead measurements quoted weretaken with activity counters disabled. Experimentation re-vealed that system-wide measurements could be taken at upto 91 samples per second; with the counters operating at thisrate, the average power for W = 9 was found to increase by

2.24

x

~3% error,reasonableoverheads

Static Power(Measured) (Measured)

Compilation Time

Figure 9: Area, compilation time and power overheads

LMS (online, per update)

6620

s

91s 66

20s

91s

Num. of Counters (N)

0.0001

0.001

Figure 10: System identification execution time

3.7%, from 1.70W to 1.76W. This measurement was foundto be unaffected by changes in N .

5.4. Software Overhead

The complexity of the RLS algorithm across various Nis captured in Figure 10, the values of which represent thetimes needed to update the model during each iteration.With N = 8, each system-wide update required 0.6ms tocomplete, representing around 5% of total CPU time at themaximum sampling rate of 91Hz.

6. Power Breakdown

A system was developed upon which KAPow’s esti-mation of module-wise power breakdowns could be eval-uated. A SoC implementation containing seven functionallyidentical (in terms of throughput and latency) FIR filtermodules was devised, with each module instrumented asdescribed in Section 3. Heterogeneity was introduced byforcing each module to map a different proportion of itsmultipliers, from all to none, to LUTs rather than DSPblocks; consequently, each module exhibited different powercharacteristics and, therefore, had instruments attached todifferent signals. The uninstrumented system occupied 24%of the available ALMs, spread out over 62% of the FPGA’sLABs, along with 94% of block RAM and all DSPs. Allmodules met timing at 200MHz, however each was inde-pendently clocked by a runtime-adjustable PLL, allowingfrequency to be changed dynamically on a per-module basis.The SoC bus and instrumentation controllers ran at 50MHzthroughout all experiments conducted in this work and, asidefrom the experiment performed for Section 1.1, FPGA corevoltage remained fixed at 1.1V. N = 8 activity counterswere used per module, each W = 9 bits wide; this remainedtrue for all later-described experiments as well.

During each experimental iteration, a different systemworkload was applied: a clock frequency, one of nine inputdata sets and one of ten groups of coefficients were selectedat random for each filter module. Activity and system-widepower measurements were also taken in each iteration andused to update the model. A true power breakdown wasestablished for comparison on every fifth iteration by takinga system-wide power measurement and then successivelyclock-gating single modules and repeating power measure-ments, from which per-module dynamic and system staticpower consumptions were derived.

The scatter plots in Figure 11 show the close correla-tions obtained between modelled and true per-module powerconsumptions. Each data point represents an experimentaliteration with an associated power breakdown measurement.Once sufficient training time (750 iterations) had elapsed,the mean absolute error between modelled and measuredper-module power consumption was found to be 9.8mW.There was some variation between the different filter im-plementations: the best (module 4) achieved a mean errorof 8.3mW while the worst (module 6) was 12.9mW. Forall of the modules, the mean error was small compared tothe range in power consumed over their different operatingmodes. Static power tracking was particularly good, achiev-ing mean absolute error of 4.0mW.

Repeatability experiments on the power breakdown mea-surements showed deviations in the ±5mW range: the meanabsolute measurement noise was 3.9mW, with outliers aslarge as 14.0mW. The observed model errors were thereforenot significantly greater than measurement noise, indicatingthat the model accuracy approached the limits of what couldbe measured with our test setup.

7. Static Power Compensation

An experiment was performed to verify KAPow’s abilityto compensate for changes in static power consumption,the key determiner of which is temperature. We used thesystem described in Section 6 and a temperature control rigconsisting of a thermoelectric effect heat pump, water coolerand resistance thermometer, allowing the setting and main-taining of device temperature over a wide range. Randomworkload changes were applied as described in Section 6,with static power estimated using our model and measuredafter clock-gating all modules. Initial training lasted for 300iterations, with temperature held at 25◦C. Following this, thetemperature was increased by 5◦C every 150 iterations untilit reached the device’s upper temperature corner of 85◦C.The results of the experiment are shown in Figure 12, whichdemonstrates close tracking between model-predicted andmeasured static power consumption. No correlations wereseen between dynamic power estimates and temperature.

8. Task Mapping

We now demonstrate an application in which KAPowwas employed to guide the mapping of tasks to modules forsystem-wide power optimisation. Seven different filteringtasks (input data and coefficients), representing a range

0 50 100 150 200 250 3000

50

100

150

200

250

300

Measured (mW)

Mo

de

l (m

W)

(a) Static

0 50 100 150 200 250 3000

50

100

150

200

250

300

Measured (mW)

Mo

de

l (m

W)

(b) Module 0

0 50 100 150 200 250 3000

50

100

150

200

250

300

Measured (mW)

Mo

de

l (m

W)

(c) Module 1

0 50 100 150 200 250 3000

50

100

150

200

250

300

Measured (mW)

Mo

de

l (m

W)

(d) Module 2

0 50 100 150 200 250 3000

50

100

150

200

250

300

Measured (mW)

Mo

de

l (m

W)

(e) Module 3

0 50 100 150 200 250 3000

50

100

150

200

250

300

Measured (mW)

Mo

de

l (m

W)

(f) Module 4

0 50 100 150 200 250 3000

50

100

150

200

250

300

Measured (mW)

Mo

de

l (m

W)

(g) Module 5

0 50 100 150 200 250 3000

50

100

150

200

250

300

Measured (mW)

Mo

de

l (m

W)

(h) Module 6

Figure 11: Per-module power estimates versus true values

0 200 400 600 800 1000 1200 1400 1600 1800 20000

50

100

150

200

250

300

350

Static(Model)

Static(Meas.)

Iteration

Pow

er (

mW

)

0

20

40

60

80

100

120

PackageTemp

Tem

p (C

)

Figure 12: Static power tracking within power breakdownduring temperature sweep

0

10

20

30

40

50

60

70

Measured Power (mW)

De

nsi

ty

Median: 1423mW

Figure 13: System power for all task→ hardware mappings

-200

0

200

400

600

800

1000

1200

1400

1600

Module 6

Module 5

Module 4

Module 3

Module 2

Module 1

Module 0

Static

Iteration

Mo

de

l Po

we

r (m

W)

1300

1350

1400

1450

1500P

ow

er

(mW

)

More

Less

Efficiency

Average: 1369mW

Training

TrainingMedian: 1423mW

Figure 14: Task mapping power breakdown

of complexities and, consequently, power behaviour, werespecified to be executed simultaneously by the seven-modulesystem described in Section 6. Recall that the modulesare functionally identical but implementationally different,thus the task → module mapping chosen will influence thesystem-wide power consumption. Clock frequency remainedfixed at 200MHz. The set of all possible mappings contained7P7 = 7! = 5040 permutations; a histogram (with 2mWbins) of their power consumptions is given in Figure 13.This data suggests that an arbitrary mapping will result insystem-wide power consumption of 1420mW, with best andworst cases of 1350 and 1500mW, respectively.

Figure 14 shows the results of an experiment run usinga simple closed-loop controller that attempted to minimisetotal power consumption. Training completed during the firstseven iterations consisted of the rotation of tasks across theseven modules, initialising a 7×7 activity table. In each sub-sequent iteration, activity counts and power measurementswere taken as normal to update the model. The controllerthen used its new coefficients, x, to exhaustively forecastthe system-wide power consumption for all 7! mappings(340ms in software), the optimal of which was applied tothe hardware. Figure 14 shows that, for around the first 50iterations, the model continued to adapt before convergingon what it believed to be the optimal mapping. In this case,the controller was able to find a mapping that consumed1369mW of power on average: 54mW (3.8%) lower than themedian—whence the best possible improvement is 75mW(5.3%)—and 126mW (8.4%) lower than the worst case, afteraccounting for the overhead of instrumentation.

Measured

Model

Vectorless

0 200 400 600 800 1000 1200 1400

199

208

251Module 6

Module 5

Module 4

Module 3

Module 2

Module 1

Module 0

Static

Power (mW)

More

Less

Efficiency

Figure 15: Vectorless, modelled and true power breakdowns

Finally, Figure 15 provides side-by-side comparisons ofcompile-time vectorless power forecasts from PowerPlay,true measurements and runtime estimates from KAPow.Measurements and runtime estimates are a snapshot takenduring the experiment’s 150th iteration. Comparing the ‘Vec-torless’ and ‘Measured’ bars, we observe that vectorlessestimation predicts approximately equal power behaviouracross the modules, while measurement reveals significantvariation. Looking now at the ‘Model’ bar, it can be seen thatour online modelling is much closer to the measured data:KAPow successfully accounts for implementational and op-erational differences not foreseen by vectorless analysis.

9. Conclusion

In this paper we presented KAPow, a flow that combineshardware instrumentation with software system identifica-tion techniques to estimate per-module breakdowns of powerconsumption within FPGA-SoC applications. Our approachis based around the monitoring of influential signals withineach module, determined at compile-time through vectorlesspower analysis. We demonstrated that low-overhead activitycounting logic can be mapped automatically, transparentlyand elegantly to FPGAs.

Rather than estimating system-wide power consumption,as in prior work, we measured it directly and applied systemidentification to train and continuously update a linear powermodel online. Once running, the model adapts to changesin operating conditions in ways that offline models cannot.Our approach can unburden vendors from the costly expenseof having to pre-characterise a custom power model to fitindividual devices and applications, and greatly improveson the accuracy achievable with a one-size-fits-all model.Furthermore, the ability to establish power consumptionbreakdowns unlocks a whole field of runtime performanceoptimisation techniques, allowing challenges including pro-cess variation and dark silicon to be addressed.

We evaluated KAPow on a multi-module system toestablish its accuracy, adaptability and suitability to informruntime task mapping for system-wide power optimisation.Module-level power estimates were shown to be accurateto as low as ±5mW of true measurements: of the order ofmeasurement noise within our experimental setup. In ourtask mapping experiment, power consumption improvementof over 8% was achieved for a 9% area overhead (or ∼2%of the device) and worst-case power overhead of sub-4%.

Early indications show that the presented accuracy holdsfor arbitrary circuits. The largest limitation at present—and,

we believe, most promising avenue for future work—lies insignal selection. Selecting signals based solely upon vec-torless analysis-predicted activity ignores the fact that somemay be highly correlated. This is indeed a limitation withour current flow: often, several counters can be eliminatedwith no effect on the accuracy of the model.

In the future, we would like to experiment with differentsignal selection methods, including vectored (simulation-based) approaches [1] and by analysing circuit structureusing centrality techniques [12]. We will also evaluate theuse of asynchronous counters to explicitly capture glitches.Finally, we envisage that a higher-level runtime managementlayer, in place of the simple controllers used in this work,may allow estimates from KAPow to be combined withapplication-level parameters to enable finer-grained control.

Acknowledgments

This work was supported by the EPSRC-fundedPRiME Project (grant numbers EP/I020357/1and EP/K034448/1).http://www.prime-project.org

The authors also acknowledge the support of ImaginationTechnologies and the Royal Academy of Engineering.

References

[1] F. N. Najm, “A Survey of Power Estimation Techniques in VLSICircuits,” IEEE Transactions on Very Large Scale Integration (VLSI)Systems, vol. 2, no. 4, 1994.

[2] J. H. Anderson and F. N. Najm, “Power Estimation Techniques forFPGAs,” IEEE Transactions on Very Large Scale Integration (VLSI)Systems, vol. 12, no. 10, 2004.

[3] A. Lakshminarayana, S. Ahuja, and S. Shukla, “High-level PowerEstimation Models for FPGAs,” in IEEE Computer Society AnnualSymposium on VLSI (ISVLSI), 2011.

[4] M. Najem, P. Benoit, F. Bruguier, G. Sassatelli, and L. Torres,“Method for Dynamic Power Monitoring on FPGAs,” in InternationalConference on Field-programmable Logic and Applications (FPL),2014.

[5] J. M. Levine, E. Stott, G. A. Constantinides, and P. Y. K. Cheung,“Online Measurement of Timing in Circuits: For Health Monitoringand Dynamic Voltage & Frequency Scaling,” in IEEE InternationalSymposium on Field-programmable Custom Computing Machines(FCCM), 2012.

[6] E. Hung and S. J. E. Wilton, “Zero-overhead FPGA Debugging,”Reconfigurable Logic: Architecture, Tools, and Applications, vol. 48,2015.

[7] K. E. Murray, S. Whitty, S. Liu, J. Luu, and V. Betz, “Titan: EnablingLarge and Complex Benchmarks in Academic CAD,” in InternationalConference on Field-programmable Logic and Applications (FPL),2013.

[8] Altera, “Quartus II Handbook Volume 3: Verification,” http://www.altera.com/literature/hb/qts/qts qii5v3.pdf.

[9] J. Lamoureux and S. J. E. Wilton, “Activity Estimation for Field-programmable Gate Arrays,” in International Conference on Field-programmable Logic and Applications, 2006.

[10] L. Ljung, “System Identification,” in Signal Analysis and Prediction,ser. Applied and Numerical Harmonic Analysis. Birkhauser, 1998.

[11] MathWorks, “Adaptive Filters – MATLAB & Simulink,” http://uk.mathworks.com/help/dsp/adaptive-filters.html.

[12] E. Hung and S. J. E. Wilton, “Scalable Signal Selection for Post-silicon Debug,” IEEE Transactions on Very Large Scale Integration(VLSI) Systems, vol. 21, 2013.

kapow: a system identiﬁcation approach to online per...

Documents