quantifying the energy efficiency of ... - tu delft

8
Quantifying the Energy Efficiency of Coordinated Micro-Architectural Adaptation for Multimedia Workloads Shrirang Yardi and Michael S. Hsiao The Bradley Department of Electrical and Computer Engineering Virginia Tech, Blacksburg, VA 24061. USA. {yardi,mhsiao}@vt.edu Abstract— Adaptive micro-architectures aim to achieve greater energy efficiency by dynamically allocating computing resources to match the workload performance. The decisions of when to adapt (temporal dimension) and what to adapt (spatial dimension) are taken by a control algorithm based on an analysis of the power/performance tradeoffs in both dimensions. We perform a rigorous analysis to quantify the energy efficiency limits of fine-grained temporal and coordi- nated spatial adaptation of multiple architectural resources by casting the control algorithm as a constrained optimization problem. Our study indicates that coordinated adaptation can potentially improve energy efficiency by up to 60% as compared to static architectures and by up to 33% over algorithms that adapt resources in isolation. We also analyze synergistic application of coarse and fine grained adaptation and find modest improvements of up to 18% over optimized dynamic voltage/frequency scaling. Finally, we analyze several previous control algorithms to understand the underlying reasons for their inefficiency. I. I NTRODUCTION As transistor densities increase rapidly with each new process technology and supply voltage decreases relatively slowly, microprocessor power consumption has become a critical operational constraint. Researchers have mainly used two approaches to reduce microprocessor power. The first is intelligent hardware design with static power saving tech- niques (for e.g. clock/power gating unused components). The second is to dynamically allocate just enough resources to match the performance requirements of the application. These adaptive approaches aim to achieve greater energy- efficiency by exploiting the variability or execution slack which arises due to the diverse execution characteristics of different applications running on a static hardware. Examples of such methods include adaptation of micro-architectural structures [1] and system-level adaptation such as dynamic voltage/frequency scaling (DVS) [16], among others. Adaptive techniques typically exploit two types of exe- cution slack to save energy: temporal slack which can be exploited by slowing down the processor and resource slack which can be exploited by re-sizing or de-activating parts of the processor. The key to adaptation is the control algorithm that decides when to adapt and what to adapt with the goal of achieving energy-efficient operation [7]. Ideally, to maximize energy efficiency, we would like to adapt frequently (tempo- rally fine-grained) over an adaptive space of many resources (spatially coordinated). This scenario of performing fine- grained temporal and coordinated spatial adaptation is a complex multi-dimensional optimization problem. To realize the full potential of such adaptation, it is important to perform a rigorous assessment of its benefits and costs. This paper performs a detailed, off-line, quantitative anal- ysis of the energy savings when adapting multiple resources within a high-performance general-purpose microprocessor running multimedia workloads. In this context, our goals are to perform a comprehensive exploration of the adaptive design space, quantify the potential efficiency benefits of fine-grained and coordinated adaptation and identify the limitations of existing techniques. If significant gains are found, this can motivate further analysis and design of more efficient adaptive hardware substrates and control algorithms. A. Motivation The following factors have motivated our study: 1) A considerable amount of research has been devoted to the design of control algorithms for micro-architectural adaptation (see Albonesi, et al. [1] for a survey). However, due to the challenging multi-dimensionality of the problem, prior techniques are largely ad-hoc and have often constrained their analysis in either the temporal or spatial dimensions. Temporal constraints limit micro-architectural responsiveness to workload heterogeneity and spatial constraints fail to account for interactions between adaptive structures. Only a rigorous and comprehensive exploration of the adaptive design space can provide an accurate idea of the potential efficiency benefits. 2) We focus on multimedia applications because, unlike throughput-oriented workloads (such as SPEC), these applications present a unique set of issues that warrant their detailed study. First, these applications represent a large (and sometimes the only) chunk of workloads for the increasingly power-hungry mobile devices. Second, these applications have markedly different execution characteristics than throughput workloads so that several multimedia-specific adaptation techniques have been proposed [6], [7], [14]. It is important to analyze such application-specific control algorithms to determine the underlying reasons for their energy (in)efficiency. However, our analysis framework is also applicable to other workload domains. 978-1-4244-2658-4/08/$25.00 ©2008 IEEE 583

Upload: others

Post on 29-Apr-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Quantifying the Energy Efficiency of ... - TU Delft

Quantifying the Energy Efficiency of Coordinated Micro-Architectural

Adaptation for Multimedia Workloads

Shrirang Yardi and Michael S. Hsiao

The Bradley Department of Electrical and Computer Engineering

Virginia Tech, Blacksburg, VA 24061. USA.

{yardi,mhsiao}@vt.edu

Abstract—Adaptive micro-architectures aim to achievegreater energy efficiency by dynamically allocating computingresources to match the workload performance. The decisionsof when to adapt (temporal dimension) and what to adapt(spatial dimension) are taken by a control algorithm basedon an analysis of the power/performance tradeoffs in bothdimensions. We perform a rigorous analysis to quantify theenergy efficiency limits of fine-grained temporal and coordi-nated spatial adaptation of multiple architectural resources bycasting the control algorithm as a constrained optimizationproblem. Our study indicates that coordinated adaptation canpotentially improve energy efficiency by up to 60% as comparedto static architectures and by up to 33% over algorithmsthat adapt resources in isolation. We also analyze synergisticapplication of coarse and fine grained adaptation and findmodest improvements of up to 18% over optimized dynamicvoltage/frequency scaling. Finally, we analyze several previouscontrol algorithms to understand the underlying reasons fortheir inefficiency.

I. INTRODUCTION

As transistor densities increase rapidly with each new

process technology and supply voltage decreases relatively

slowly, microprocessor power consumption has become a

critical operational constraint. Researchers have mainly used

two approaches to reduce microprocessor power. The first is

intelligent hardware design with static power saving tech-

niques (for e.g. clock/power gating unused components).

The second is to dynamically allocate just enough resources

to match the performance requirements of the application.

These adaptive approaches aim to achieve greater energy-

efficiency by exploiting the variability or execution slack

which arises due to the diverse execution characteristics of

different applications running on a static hardware. Examples

of such methods include adaptation of micro-architectural

structures [1] and system-level adaptation such as dynamic

voltage/frequency scaling (DVS) [16], among others.

Adaptive techniques typically exploit two types of exe-

cution slack to save energy: temporal slack which can be

exploited by slowing down the processor and resource slack

which can be exploited by re-sizing or de-activating parts of

the processor. The key to adaptation is the control algorithm

that decides when to adapt and what to adapt with the goal of

achieving energy-efficient operation [7]. Ideally, to maximize

energy efficiency, we would like to adapt frequently (tempo-

rally fine-grained) over an adaptive space of many resources

(spatially coordinated). This scenario of performing fine-

grained temporal and coordinated spatial adaptation is a

complex multi-dimensional optimization problem. To realize

the full potential of such adaptation, it is important to

perform a rigorous assessment of its benefits and costs.

This paper performs a detailed, off-line, quantitative anal-

ysis of the energy savings when adapting multiple resources

within a high-performance general-purpose microprocessor

running multimedia workloads. In this context, our goals

are to perform a comprehensive exploration of the adaptive

design space, quantify the potential efficiency benefits of

fine-grained and coordinated adaptation and identify the

limitations of existing techniques. If significant gains are

found, this can motivate further analysis and design of more

efficient adaptive hardware substrates and control algorithms.

A. Motivation

The following factors have motivated our study:

1) A considerable amount of research has been devoted to

the design of control algorithms for micro-architectural

adaptation (see Albonesi, et al. [1] for a survey).

However, due to the challenging multi-dimensionality

of the problem, prior techniques are largely ad-hoc

and have often constrained their analysis in either the

temporal or spatial dimensions. Temporal constraints

limit micro-architectural responsiveness to workload

heterogeneity and spatial constraints fail to account

for interactions between adaptive structures. Only a

rigorous and comprehensive exploration of the adaptive

design space can provide an accurate idea of the

potential efficiency benefits.

2) We focus on multimedia applications because, unlike

throughput-oriented workloads (such as SPEC), these

applications present a unique set of issues that warrant

their detailed study. First, these applications represent

a large (and sometimes the only) chunk of workloads

for the increasingly power-hungry mobile devices.

Second, these applications have markedly different

execution characteristics than throughput workloads so

that several multimedia-specific adaptation techniques

have been proposed [6], [7], [14]. It is important to

analyze such application-specific control algorithms

to determine the underlying reasons for their energy

(in)efficiency. However, our analysis framework is also

applicable to other workload domains.

978-1-4244-2658-4/08/$25.00 ©2008 IEEE 583

Page 2: Quantifying the Energy Efficiency of ... - TU Delft

3) For multimedia workloads, system-level adaptations

(such as DVS) add a whole another dimension to the

energy-performance tradeoff space by changing the

relative impact of structural adaptation on the overall

energy efficiency. This added complexity has hindered

an integrated analysis of structural and system-level

adaptation. As a result, the control algorithms at the

two levels have been largely orthogonal. One of our

goals is to analyze if these can be applied in concert

and, if so, quantify the potential efficiency gains.

In summary, we believe that an unconstrained, rigorous

analysis of micro-architectural adaptivity is critical to over-

come the limitations of previous ad-hoc approaches. Despite

its high computational cost, such comprehensive exploration

is crucial to get an accurate idea about the potential benefits

of coordinated adaptation and provide insights for designing

practical and powerful control algorithms.

B. Contributions

The following summarizes our main contributions:

1. We cast the problem of fine-grained, coordinated struc-

tural adaptation as a constrained optimization problem. In

particular, we consider adaptation at a temporal granularity

of every 1024 instructions across 25920 micro-architectural

configurations spanning a design space of greater than

257 points. We also consider the problem of integrating

fine-grained structural and coarse-grained system adaptation

(such as DVS) to identify their relative contributions to

overall energy efficiency. The solutions to these models allow

us to perform, for the first time, a comprehensive analysis

of benefits from fine-grained temporal, coordinated structural

and integrated structural and system-level adaptation.

2. We apply this framework to assess the benefits of

varying degrees of temporal and spatial adaptivity. We find

significant energy efficiency gains of up to 85% (60% on

average) over a base, non-adaptive processor without DVS

suggesting significant potential for fine-grained structural

adaptation. We observe that these gains are a result of a

comprehensive use of the available configurations suggesting

that interaction between adaptive structures is an important

factor in realizing these efficiency benefits.

3. We implemented several previously proposed ad-hoc

control algorithms and analyzed the underlying reasons for

their inefficiency. We observed that the best previous algo-

rithm performs 33% worse on average than the optimal due

to inefficient exploitation of available slack. We also find

that the energy savings and the performance impact due

to these algorithms are unpredictable even after extensive,

application-specific manual tuning.

4. We observe that the amount of temporal slack and

its distribution across adaptive intervals are the keys to the

energy efficiency achieved by the optimal algorithm. This

motivates the design of control algorithms that use temporal

slack as a first-class constraint to guide the adaptation

decisions. We discuss such implications for the design of

more efficient adaptive hardware and control algorithms.

II. METHODOLOGY OVERVIEW

A. Definition of Key Terms

Most multimedia applications are real-time and need to

process discrete units of data, termed as a frame. The

processing of each frame has to be completed within a certain

time, termed as the deadline. For a given architecture, the

difference between the deadline and actual execution time is

the temporal slack for the frame and can vary from frame

to frame. Further, there can be significant variation in intra-

frame resource utilization which is termed as resource slack

and typically varies over a few hundred cycles. We define

an epoch as the temporal granularity at which a structural

adaptation is invoked. We consider two classes of adaptations

that can be applied at different time scales. First, dynamic

voltage/frequency scaling (DVS) is applied at the granularity

of an entire frame. Second, micro-architectural adaptation,

is applied at varying epoch sizes ranging from 1024 (1K)

instructions to 1M instructions. Finally, we term our non-

adaptive processor as the Base architecture.

B. Modeling Methodology

For our off-line analysis, we model adaptation over the

lifetime of a single frame as an optimization problem. We

chose a frame as the unit for our model based on prior

work [7] which shows that several multimedia workloads

exhibit significant per-frame execution time variability. Fur-

ther, temporal slack is defined in terms of a single frame

since each frame (and not the entire application) is associated

with a deadline. Finally, modeling global adaptation across

the entire application is infeasible since the number of frames

can be unbounded.

We propose two models: the first captures structural adap-

tation (Section III) within a frame and the other (Section IV)

models DVS at the frame granularity in addition to structural

adaptation at the epoch granularity. We obtain the infor-

mation required to solve each problem using cycle-accurate

simulations. The solutions provide the per-epoch optimal re-

source configurations and per-frame voltage/frequency values

to execute the frame with minimum energy. This is done for

each frame in the workload and the optimal configurations

are fed back to the simulator to obtain actual energy and

delay values.

C. Adaptivity Analysis

We quantify efficiency for co-ordinated spatial adaptation

by reconfiguring several structures such as the instruction

window size, load store queue size, number of ALUs, number

of FPUs (floating-point units) and the issue width. We

study the energy efficiency trends for different degrees of

temporal adaptivity by: (1) varying the epoch-size, and (2)

by varying the amount of temporal slack available using

two different sets of frame deadlines. We also implement

several previously proposed adaptation techniques to asses

limitations to efficiency by constraining the adaptation in

the temporal and/or spatial dimensions. Finally, we assess

the efficiency gains when fine-grained structural and DVS

are applied together.

584

Page 3: Quantifying the Energy Efficiency of ... - TU Delft

Our analysis uses the average energy-per-instruction (EPI),

the energy-delay product, and the number of missed dead-

lines as the primary metrics of efficiency. In addition, we

quantify the resource slack consumed by each technique by

the diversity of configurations used and quantify the temporal

slack consumed with the per-epoch energy×delay product.

D. Assumptions

For the purpose of this analysis, we assume a soft real-

time, single-core system running single-threaded multimedia

workloads. We assume that the CPU scheduler uses a single

frame as the scheduling unit and that there is no time bor-

rowing between frames. However, we account for the effect

of varying system load in a multi-programmed environment

by considering two sets of deadlines (further explained in

Section V). Finally, we assume that the hardware recognizes

the frame boundary, frame type and the deadline. We believe

that these assumptions are reasonable for devices running

multi-programmed (not multi-threaded) workloads. Finally,

we account for structural adaptation overheads but assume

zero overhead for performing per-frame DVS.

III. MODELING STRUCTURAL ADAPTATION

A. Problem Formulation

Our approach for modeling fine-grained structural adap-

tations is based on previous work by Hughes, et al. [6]

and is described as follows. Within a frame, each epoch,

i, can be run with a different architectural configuration,

C j, where j ∈ A rch, the set of all possible configurations.

Each configuration has two attributes: a reward, which is the

energy saved by using C j instead of Base, and a cost, which

is the performance degradation due to C j for that epoch.

The goal is to determine a single configuration, C j, for each

epoch, i, such that these configurations together result in

the most energy saved while consuming no more than the

available temporal slack for the frame. We characterize the

reward in terms of energy-per-instruction (EPI) saved and

the cost in terms of the number of additional cycles, both

vs. Base to execute each epoch. Formally, we can state the

problem as:

maximize ∑i∈N

∑j∈A rch

Ei j ·Ci j subject to: (1)

∑i∈N

∑j∈A rch

Si j · (Ci j · A( j)) ≤ S f rame, (2)

∀i ∈ N , ∑j∈A rch

Ci j = 1, (3)

Ci j =

{

1 if config j is selected for interval i0 otherwise

(4)

Above, Ei j and Si j are the energy-per-instruction (EPI) saved

and the cycles required when using configuration j vs. Base

for epoch i. S f rame is the available slack for the frame and N

is the total number of epochs in the frame. A is a map from

the value of Ci j to the actual configuration to be used. Eqn.

3 guarantees that exactly one configuration is selected for

each epoch by using the decision variable Ci j. The products

in Eqn. 1 and 2 define the complete energy-performance

tradeoff space for the configurations in A rch. The optimal

solution is a vector C∗ = (C∗1 , . . . ,C∗

N) of configurations, one

per epoch, that provides the maximum energy savings. This

problem is an instance of the well-known multiple-choice

knapsack problem (MCKP) and is NP-hard [10]. Note that,

since the temporal slack and number of instructions vary

from frame-to-frame, we have to define one such problem

for each frame. We term this problem as OPT:FG.

B. Solving OPT:FG

To solve the optimization problem, we need the values of

Ei j and Si j for all frames, all configurations and all epochs.

We obtain these values using cycle-accurate instruction-level

simulation as follows.

We reconfigure the instruction window (IW), load store

queue (LSQ), number of integer ALUs, the number of FPUs

(floating-point units) and the issue width giving |A rch| =

25920 configurations. To reduce the number of simulations

and to maintain a balanced design, we adapt IW and LSQ

together and the ALUs and the issue width together. More

details about the different adaptive units are provided in

Section V. With these constraints, we need to perform 360

simulations for each frame to obtain the values for Ei j and

Si j required to solve the problem. For each application, we

profile several frames for all the configurations.

An intuitive idea about the solution can be given as fol-

lows. For each epoch, the most energy-efficient configuration

is the one that maximizes the tradeoff between EPI saved

and the cycles used. In other words, since each configuration

uses some part of the available temporal slack, C∗ provides

the best way to “distribute” the slack across the frame

by exploiting synergistic interactions between the adaptive

resources. Finally, to obtain the actual dynamic energy, we

simulate each frame using its optimal configurations.

IV. INTEGRATED STRUCTURAL AND SYSTEM

ADAPTATION

In the context of soft real-time systems, DVS has long

been applied as an effective frame-level technique [15],

where the processor voltage/frequency are scaled to save

energy while guaranteeing that the deadline is met. One

of our goals is to understand the interaction between these

adaptations and quantify the potential efficiency benefits

by applying them synergistically. As a simple example of

interaction between the two algorithms, an aggressive DVS

setting may allow the fine-grained algorithm to exercise a

wider range of configurations and conversely, a less ag-

gressive setting may leave little potential to exploit intra-

frame variability. This section describes our formulation to

determine the optimal way to apply these adaptations.

A. Problem Formulation

The objective is to select a single frequency/voltage for

the frame and a single configuration for each epoch within

a frame such that, together, they maximize the EPI savings

while consuming no more than the available slack for the

585

Page 4: Quantifying the Energy Efficiency of ... - TU Delft

frame. Eqns. 5-11 formally state the problem. For Eqns. 5-

11, A rch,N ,Ci j and A have the same definitions as for

OPT:FG. V is the set of all possible voltage values (possibly

unbounded for a system supporting continuous DVS). Dk is

a binary variable that is set to 1 if voltage V (Dk) is selected

for the frame, where V maps k to a unique voltage/frequency

pair. Eqns. 7 and 8 guarantee that a single voltage value

is selected for the entire frame and a single configuration

is selected for each epoch. Eqn. 9 shows that, for ∀k ∈ V ,

S f rame depends on both V (Dk) and on the configuration set

C∗ = (C∗1 , . . . ,C∗

N) selected for the frame. We denote this

problem as OPT:CG+FG.

maximize ∑k∈V

∑i∈N

∑j∈A rch

Eki j ·Dk ·Ci j subject to: (5)

∀k ∈ V :N

∑i=1

|A rch|

∑j=1

Si jCi j ≤ S f rame,k (6)

∀k ∈ V :

|V |

∑k=1

Dk = 1, (7)

∀i ∈ N :

|A rch|

∑j=1

Ci j = 1 (8)

where,

∀k ∈ V :}

S f rame,k = Dk ·F (V (Dk),|N |

∑i=1

|A rch|

∑j=1

Ci j · A( j)), (9)

∀k ∈ V : Dk =

{

1 if voltage k is selected for the frame0 otherwise

(10)

Ci j =

{

1 if config j is selected for interval i0 otherwise

(11)

The solution to this problem provides, for each frame, (1)

a single voltage/frequency value, VCG, and (2) the optimal

configuration set C∗FG; which save the most energy while

consuming no more than the available slack for the frame.

Intuitively, by selecting the best voltage and configuration

set, the solution provides the best “split” of the available

slack between the two control algorithms.

B. Solving OPT:CG+FG

Since S f rame now depends on both voltage and the con-

figuration set, OPT:CG+FG is a mixed-integer, non-linear

problem (MI-NLP) and is infeasible even for industrial

solvers. One naive heuristic to solve it would be to discretize

[0,V ], effectively decoupling the voltage and configuration

selection. The is similar to solving OPT:FG repeatedly

with Ei j values scaled for each discrete voltage value. We

wish to avoid such decoupling to consider the interaction

between these adaptations and use the following heuristic to

accomplish this.

We use the amount of temporal slack as a knob to control

the relative aggressiveness (and hence energy efficiency) of

the CG and FG parts of OPT:CG+FG as follows. For the

candidate frame, let Tbase be the execution time for Base

and Smax be the maximum available temporal slack. Consider

the case when only structural adaptation is performed for

some slack SFG ≤ Smax. This is accomplished by solving

OPT:FG with S f rame = SFG to obtain the minimum energy

configuration set, C∗FG. Let TFG be the required execution

time and IPCFG be the average IPC. It follows that TFG =Tbase +SFG.

Next, consider the case that DVS is applied in addition

to structural adaptation to consume the remaining slack,

SCG = Smax − SFG. It follows that TCG = TFG + SCG, i.e. ,

TCG = Tbase+(SFG+SCG). The minimum frequency required

to consume SCG is then given by, fCG = ICountTCG×IPCBase

[7].

The goal of OPT:CG+FG then is to determine the best

”split” of Smax into SFG and SCG such that energy savings for

the frame are maximized. We discretize the interval [0,Smax]in to several candidate splits - we use values of 1% to 100%

of Smax in steps of 1%. For each split, we calculate SFGand SCG, solve OPT:FG to obtain C∗

FG, determine TFG and

IPCFG and determine fCG. Finally, we simulate the frame

using these values to obtain the split the gives the best energy

savings. In summary,

for each frame doTbase = frame execution time on Base Smax = deadline - TbaseICount = instruction count for this framefor split = 0.01 to 1 in steps of 0.01 do

Solve OPT:FG with S f rame = split to get C∗FG, IPCFG

TCG = TFG +(Smax× (1− split))fCG = ICount

TCG× IPCFG

EPIsplit = EPI at fCG,VCG with C∗FG

endLowest EPIsplit gives best VCG,C∗

FG

end

Algorithm 1: Slack splitting heuristic to solve OPT:CG+FG

The main advantage of the slack splitting approach over

the naive heuristic is that it allows a wider choice in selection

of voltage values which makes the solution closer to the

theoretical optimum. A discrete voltage would limit the

voltage choices and consequently the potential benefits.

V. SIMULATION SETUP

We use the execution-driven Simplescalar (v3. 0d) simula-

tor [3] for performance evaluation and the Wattch [2] tool to

track dynamic energy consumption. The base, non-adaptive

architecture is an aggressive 8-wide out-of-order superscalar

processor (parameters summarized in Table I).

Adaptive Structures Modeled: We assume a centralized

instruction window but with a separate register file. The

window is implemented as a circular FIFO without collaps-

ing and is split in to 8-entry segments [13]. We clock-gate

the empty and ready entries in the wake-up logic [5]. We

assume that the issue width of the core is the sum of all

active functional units [14]. When a functional unit is de-

activated, we also deactivate the corresponding parts of the

instruction selection logic, result bus and wake-up ports of

the instruction window.

Adaptation Overheads: To evaluate the best possible per-

formance of each adaptation algorithm, our study does not

model the adaptation overheads for DVS. For structural

adaptations, the delay overhead due to small additional

586

Page 5: Quantifying the Energy Efficiency of ... - TU Delft

TABLE I

BASE PROCESSOR CONFIGURATION

Parameter Value

Processor Core

Processor speed 2 GHzRUU Size 128 instructionsLSQ Size 64 instructionsFetch Queue Size 32 instructionsFetch Width 8 instructions/cycleDecode Width 8 instructions/cycleIssue Width out–of–order 8 instructions/cycleCommit Width in–order 8 instructions/cycleFunctional Units 6 Int, 4 FP, 2 address gen.Int FU Latencies 1/3/20 add/mult/div (pipelined)FP FU Latencies 2/4 add/mult (pipelined) 12/24 div/sqrtBranch Predictor 4KB bimodal, 32-entry RAS,

6 cycle latency

Memory Hierarchy

L1 data cache 64K, 2-way (LRU)32B blocks, 2 cycle latency

L1 instruction cache 64K, 2-way (LRU)32B blocks, 2 cycle latency

L2 cache unified, 2M, 4-way (LRU)64B blocks, 12 cycle latency

Main memory latency 200 cyclesTLBs 128 entry, fully associative,

30 cycle miss latency

TABLE II

WORKLOAD DESCRIPTION

App. Type Frames Frame Base DefaultTypes IPC Deadline

MPEG2-enc High bit-rate 100 I, P, B 1.6 33.9msMPEG2-dec video codec 100 I, P, B 2.9 1.6ms

H263-enc Low bit-rate 100 I, P 1.8 20.1msMPEG4-dec video codec 100 I, P 3.3 10.6ms

MP3 Audio 850 N/A 3.6 123µs

Mesa Rendering 100 N/A 3.3 11.5ms

hardware such as counters, comparators and control logic is

likely to be small. We model a delay of 5 cycles to activate

all de-activated components.

Power Model: We track dynamic energy using Wattch [2]

with parameters scaled for the 0.1 micron technology at 1.2V.

We also model overheads for adaptive structures such as

additional bits for each window entry and transistors for

gating unused segments. Experiments with DVS assume a

continuous frequency range from 500 MHz up to 2 GHz with

voltage values derived from data for the Intel PentiumM [8].

We assume aggressive conditional clock-gating (“cc3” clock-

ing style in Wattch) where a clock-gated resource consumes

idle power equal to 20% of its maximum power [17]. We

assume that resources that are de-activated do not consume

any power.

Memory Hierarchy: To minimize the effects of cache

behavior, we select the L1 cache size based on prior results

from the working set size analysis of media applications [7].

We scale the input for each application to ensure a hit rate

of at least 99% for the L1 data and instruction caches. To

emulate a processor used in a typical hand-held device, we

set the L2 cache size similar to that of the PentiumM [8].

Workloads: We consider several single-threaded multime-

dia benchmarks that include high (mpeg2) and low (h263,

mpeg4-avc) bit-rate video codecs, audio decoding (mp3) and

graphics rendering. These were derived from the Media-

bench [12] suite and other online resources. The MPEG2

workloads consist of three types of frames (I, P, B) while

H263/MPEG4 consist of a single I-frame followed by P-

frames. Mesa does not have a well-defined notion of a frame,

but processes one picture or “scene” at a time. We do not

use any multimedia instructions (such as MMX) because we

do not have a power model for specialized functional units.

We consider two sets of deadlines as described in [6]. The

first set, referred to as the default deadline, is the maximum

time required by the base processor to execute all the frames.

The second set, referred to as the relaxed deadline, is equal to

twice the default deadline. For multi-programmed, soft real-

time systems, the two deadlines model the effect of other

system load when processing a frame. Table II gives the

default deadlines for each workload on the base processor.

Due to per-frame execution time variability, we observe a

average temporal slack of 24% and 62% for the default and

relaxed deadlines, respectively.

VI. PREVIOUS ADAPTATION ALGORITHMS

We evaluate several previously proposed adaptation algo-

rithms which we describe in this section. These algorithms

operate at fixed time intervals (ranging from 256 cycles to

few 1000s of cycles) during which each adaptive structure

is monitored and certain performance statistics are col-

lected [1]. After each interval, these statistics are compared

to a set of thresholds to make an adaptation decision.

We implement the instruction window (termed as IW) [5]

and functional unit (termed as FU) [14] adaptation as exam-

ples of per-structure, non-coordinated adaptation algorithms.

To evaluate co-ordinated adaptation, we manually tuned the

thresholds for each resource using two methods. First, we

randomly chose thresholds for each resource from candidate

thresholds and then combined them. Second, we took a

cross-product of several individual design points and selected

random combinations. We selected the best thresholds to be

ones which execute each application with the least energy

while missing at most 5% of the deadlines. This algorithm

is termed as Manual.

For system-level adaptation, we focus on DVS where the

goal is to slow down the processor to save energy while

guaranteeing frame deadlines. DVS algorithms operate at

the frame-level and leverage special multimedia application

characteristics to predict performance and guide adaptation

decisions. We implement the per-frame adaptation algorithm

proposed by Hughes et al. [7] and term it CG (for coarse-

grained). Finally, we combine CG and Manual together to

perform integrated DVS and structural adaptation and denote

this algorithm as CG+FG.

Tuning Effort for Threshold-Based Adaptation Algo-

rithms: Both IW and FU require tuning of three parameters

each. For IW, these are: (1) adaptation period for reducing

the window size, (2) the number of periods after which to

increase the size and, (3) the counter threshold that triggers

the adaptation decision. For FU, these are: (1) adaptation

period, (2) utilization threshold to trigger de-activation, and

587

Page 6: Quantifying the Energy Efficiency of ... - TU Delft

TABLE III

THRESHOLDS FOR PREVIOUS ADAPTATION ALGORITHMS

App. IW FU Manual Missed DeadlinesPeriod Num Count Period Util Hazards IW1 IW2 IW3 FU1 FU2 FU3 IW FU Manual(cycles) Periods (instr) (cycles) (cycles)

MPEG2-enc 512 16 32 2048 4 40 512 5 64 512 4 40 2 0 51

MPEG2-dec 4096 16 64 8192 16 120 512 8 64 512 96 120 0 3 35

H263-enc 1024 4 128 2048 8 40 1024 8 128 1024 8 40 30 18 88

MPEG4-dec 256 4 8 512 32 40 256 8 8 256 8 40 0 0 0

MP3 256 4 16 4096 8 40 4096 8 8 4096 8 40 5 5 10

Mesa 1024 8 64 2048 32 40 1024 5 128 1024 4 40 0 0 0

(3) number of hazards to trigger activation. We manually

evaluated a large number of points in the search space by

varying adaptation period value from 256 to 8192, the win-

dow upgrade periods ranging from 2 to 16, segment counter

values from 8 to 128, FU utilization values from 4 to 64 and

hazard values from 40 to 200. Table III shows the thresholds

for the IW, FU and Manual algorithms. The best thresholds

differ across applications and also when the techniques are

combined together. Combining threshold values randomly

always resulted in worse performance than exploring the

cross-product of design points. The last three columns list

the fraction of deadlines missed for each algorithm for the

default deadlines. The deadlines missed for CG and CG+FG

were less than 3% in all cases.

This data highlights large design effort required for

prevalent threshold-based approaches. Even after application-

specific tuning, their behavior is unpredictable and (as we see

in the next sub-section), their energy benefits are limited.

VII. RESULTS FOR STRUCTURAL ADAPTATION

This section presents results for structural adaptation using

the default deadlines. We first summarize the potential energy

benefits of different algorithms. We then quantify the sources

of inefficiency of previous algorithms based on the manner in

which they consume the resource and temporal slack. We find

that the efficiency of OPT:FG is a result of judicious temporal

slack distribution and a comprehensive use of configuration

options.

A. Potential Energy Savings

Table IV summarizes the energy savings for each bench-

mark, expressed as percentage energy savings over Base for

each algorithm, averaged over all frames. For reference, we

also list the savings of OPT:FG relative to each algorithm,

which we term as the energy efficiency gap. This data illus-

trates the significant energy benefits by exploiting intra-frame

variability with mean potential savings of up to 60%. Energy

saved is proportional to the amount of intra-frame variability

- benchmarks with lower variability such as MPEG2-dec and

MP3-dec show modest savings (up to 47%), whereas those

with high variability, such as MPEG2-enc and Mesa, show

significant savings (up to 85%). In general, algorithms that

adapt structures together perform well, with CG+FG showing

savings within 13% of OPT:FG. Further tuning can likely

improve these savings. However, notice that Manual (12%

savings) performs worse than even IW (15% savings). This,

coupled with the high miss ratio of Manual (Table II), shows

that it is difficult to guarantee performance even if thresholds

are extensively hand-tuned for individual applications.

Table IV also quantifies the amount of temporal slack

consumed by each algorithm as the slowdown over Base,

averaged across all frames. In general, structural adaptation

is unable to consume large amounts of temporal slack which

indicates that most potential savings result by exploiting

intra-frame resource slack. This has significant implications

for coarse-grained algorithms. These algorithms can exploit

almost the entire temporal slack to save as much energy as

possible (detailed in Section VIII).

In what follows, we use the solution of OPT:FG to

quantify the underlying sources of inefficiency when con-

straining adaptation in the spatial and temporal dimensions.

We find that the net energy efficiency of OPT:FG results

from: (1) by using all the available configuration space and

by reconfiguring in moderate to large step sizes between

neighboring intervals, and (2) a strategic distribution of the

available temporal slack within the frame.

IW FU MAN CG CG+FGOPT

0

0.2

0.4

0.6

0.8

MPEG2−enc

IW FU MAN CG CG+FGOPT

0

0.2

0.4

0.6

0.8

1

Mesa−Texgen

IW FU MAN CG CG+FGOPT

0

0.2

0.4

0.6

0.8

1

H263−enc

IW FU MAN CG CG+FGOPT

0

0.2

0.4

0.6

0.8

1

H264−dec

Re

lative

Ma

gn

itu

de

of

Pa

ram

ete

r C

ha

ng

e

Fig. 1. Magnitude of Configuration Changes

B. Configurations Used

Figure 1 plots the magnitude of change in parameter values

across intervals in terms of step size [11]. The step size is the

difference in configuration parameters between successive

intervals expressed as a fraction of the total number of con-

figurations. For example, we have 15 choices for instruction

window size (16 to 128 entries in steps of 8). If window

size changes from 32 to 64 entries in successive intervals,

588

Page 7: Quantifying the Energy Efficiency of ... - TU Delft

TABLE IV

ENERGY SAVINGS (%), ENERGY EFFICIENCY GAP (%) AND SLOWDOWN FOR DEFAULT DEADLINES

App. Savings (% Base Energy) OPT:FG Savings Relative to Slowdown (× Base Execution Time)IW FU MAN CG CG+FG OPT:FG IW FU MAN CG CG+FG MAN CG CG+FG OPT:FG

MPEG2-enc 35 5 42 41 51 71 56 70 51 52 41 1.05 1.06 1.1 1.04

MPEG2-dec 21 -6 17 25 40 47 34 50 18 29 13 1.1 1.25 1.2 1

H263-enc 25 -3 31 30 41 52 36 53 31 32 20 1.1 1.1 1.2 1.1

MPEG4-dec 3 -11 -1.2 -9 23 64 63 68 64 67 54 1.21 1.1 1.26 1.1

MP3-dec 3 -20 -19 -15 10 42 41 52 51 50 36 1.0 1.0 1.0 1.0

Mesa 3 -1 3 36 35 85 84 85 85 77 77.5 1.0 1.0 1.14 1.1

Mean 15 -6 12 18 33 60 52 63 50 50 40 1.07 1.09 1.15 1.05

0 5 10 15 20 25−50

0

50

MPEG2−enc I−Frame

0 10 20 30 40 50 60−100

−50

0

50

H263−enc P−Frame

Instructions (M)

En

erg

y X

De

lay I

mp

rove

me

nt

(% B

ase

)

IW FU Manual CG CG+FG Optimal

Fig. 2. Per-epoch EnergyxDelay Trends

the relative step size is 0.13. If a parameter changes from

its minimum to maximum value, then the step size equals

1. We combine the relative step sizes of each parameter to

quantify the change across the configuration space. Figure 1

uses a compact box-plot to illustrate this information.

We observe that OPT:FG exercises parameters spanning

the entire configuration space with a median step size across

all workloads of 0.07, a standard deviation of 14% and

maximum of 0.98. In contrast, CG+FG uses a step size of

only 0.01 with a deviation of 2% and maximum of 0.6.

This data suggests that OPT:FG performs relatively gradual

parameter changes across intervals, but uses all the available

configurations to achieve energy savings. We also calculated

the percentage of intervals that use each configuration (not

shown due to lack of space) and observe that constraining

spatial adaptivity results in bottlenecks by reducing the

amount of configurations that can be exercised.

In summary, most workloads need to exercise a large

fraction of the configuration space in relatively modest step

sizes. This suggests diverse requirements of computational

resources within each workload to ensure a better matching

of execution characteristics to resource sizes. Comparatively,

IW and FU are limited to adapting only a single resource and

while CG and CG+FG create bottlenecks by constraining the

temporal granularity of adaptation.

C. Temporal Slack Used

OPT:FG strategically chooses the temporal slack to be

consumed at each epoch based on the energy-performance

tradeoffs across the entire frame. This intelligent “spreading”

of temporal slack across the frame guides the selection

of per-epoch configurations and ultimately the net energy

efficiency. We quantify this behavior using the per-epoch

energy×delay (ED) for each control algorithm. Figure 2 plots

the per-epoch ED trends for two workloads. It can be seen

that ED remains almost constant across the entire frame sug-

gesting that, as per-epoch performance requirements change,

OPT:FG selects the configurations that provide the best trade-

off between energy and delay resulting in high per-frame

energy efficiency. The result is a strategic distribution of per-

epoch delay which, coupled with spatial adaptation, allows

OPT:FG to exploit the differing computation requirements of

each epoch in an optimal manner.

VIII. RESULTS FOR INTEGRATED STRUCTURAL AND

SYSTEM-LEVEL ADAPTATION

This section summarizes the additional energy efficiency

when applying per-epoch structural and per-frame DVS

synergistically. The results reported here use an algorithm

termed BestDVS as the baseline.BestDVS runs the processor

at the lowest possible voltage/frequency that still makes

the deadline without any structural adaptation. Figures 3(a)

and 3(b) quantify the efficiency for the default and relaxed

deadlines, respectively, expressed as percentage savings over

BestDVS. CG and CG+FG show cases where the DVS

and structural adaptations are decoupled. CG selects the

voltage/frequency and structure sizes once at the start of the

frame while CG+FG additionally performs intra-frame struc-

tural adaptation. The data suggests that for both sets of dead-

lines, integrating DVS with structural adaptation increases

efficiency modestly with an average of 1.4x for default and

1.18x for relaxed deadlines, respectively. Decoupling DVS

and structural adaptation generally leads to fewer savings

than DVS by up to 36% (in case of MPEG4-dec). For default

deadlines, we observe that architectural adaptation alone can

provide more energy savings than even DVS (up to 1.5x

in case of MPEG-enc). Additional savings due to structural

adaptation reduce for relaxed deadlines since voltage can be

more aggressively ramped down. For reference, Figure 3(c)

illustrates the different solutions to OPT:CG+FG using our

temporal slack splitting approach when running MPEG2-enc

for a single frame. The plot indicates that a split of 18%

(18% of total slack consumed by structural adaptation and

rest by DVS) is optimal for this frame to make the frame

589

Page 8: Quantifying the Energy Efficiency of ... - TU Delft

BestDVSOPT:CG+FGCGCG+FGOPT:FG

0.4

0.6

0.8

1

1.2

1.4

1.6

MPGenc MPGdec H263enc MPG4dec MP3dec

En

erg

y S

avin

gs

rela

tiv

e to

Bes

tDV

S

(a) Default Deadline

BestDVSOPT:CG+FGCGCG+FGOPT:FG

0.4

0.6

0.8

1

1.2

1.4

1.6

MPGenc MPGdec H263enc MPG4dec MP3dec Mesa

En

erg

y S

avin

gs

rela

tiv

e to

Bes

tDV

S

(b) Relaxed Deadline

0 20 40 60 80 1000.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

Temporal Slack Split (% Total Slack)

% B

ase

En

erg

y

BestDVS OPT:CG+FG CG CG+FG

additional energy savings due tostructural adaptation

frame deadline

(c) Energy Efficiency for different tempo-ral slack splits

Fig. 3. Potential Energy Efficiency for Integrated structural adaptation and DVS

deadline while saving the most energy.

In general, these results indicate that for the adaptive

structures, voltage range and deadlines that we consider,

DVS contributes to most of the energy savings and structural

adaptation only modestly increases these savings. However,

results for the default deadline lead us to believe that as

more structures are added to the adaptive space (for example,

adapting pipeline depth, memory hierarchy, branch predic-

tors), voltage scaling margins decrease and more load is

added to the system, efficiency gains due to combined coarse-

grained and fine-grained adaptations would likely increase.

Summary: Our study reveals the significant energy effi-

ciency that can result from fine-grained temporal, coordi-

nated spatial adaptation and integrated structural and system-

level adaptations. From a hardware perspective, this indicates

that a comprehensively adaptive hardware will be required to

realize these benefits. From the control algorithm perspective,

our findings challenge previous threshold-based algorithms

that constrain spatial adaptivity resulting in bottlenecks.

Finally, we observe that fine-grained temporal adaptivity is

better suited to localize energy costs by expending power

during epochs that actually need it, thus reducing waste and

increasing the net efficiency.

To make this study more extensive, recent advances in

statistical inference based techniques [4], [9], [11] can be

leveraged. These techniques perform efficient design space

exploration by using linear and/or non-linear predictive mod-

els to infer processor power/performace using fewer detailed

simulations. It will also be interesting to analyze adaptation

in the presence of SMT and/or CMP configurations that are

becoming common even for mobile devices.

IX. CONCLUSION

We have presented a detailed analysis of fine-grained

temporal and coordinated spatial micro-architectural adap-

tation by casting adaptation as a combinatorial optimization

problem. We also analyze the problem of integrating coarse-

grained adaptation with architectural adaptation using a novel

optimization model. Solutions to these models have allowed

an oracle-based assessment of the potential energy efficiency

benefits and an insight into the behavior of ideal control al-

gorithms. The solutions reveal significant efficiency benefits

resulting from a judicial use of available temporal slack and

comprehensive use of the adaptive space. A comparison with

several previous algorithms has demonstrated the impractica-

bility of threshold-based algorithms and the loss in efficiency

by constraining adaptation in either temporal or spatial or

both dimensions. Although our problem formulations are

conceptually simple, the analysis is much more complex due

to the high computational cost and multi-dimensionality of

the problem. Given the significant potential benefits, our next

step is to analyze control algorithm implementation options

in terms of their complexity and effectiveness.

REFERENCES

[1] D. H. Albonesi. et al. Dynamically tuning processor resources withadaptive processing. IEEE Computer, 36(12):49–58, 2003.

[2] D. Brooks et al. Wattch: a framework for architectural-level poweranalysis and optimizations. In ISCA, pages 83–94, 2000.

[3] D. Burger, T. M. Austin, and S. Bennett. Evaluating Future Micropro-cessors: The SimpleScalar Tool Set. Tech. Report CS-TR-1996-1308.

[4] S. Eyerman, L. Eeckhout, and K. D. Bosschere. Efficient design spaceexploration of high performance embedded out-of-order processors. InDATE, pages 351–356, 2006.

[5] D. Folegnani and A. Gonzalez. Energy-effective issue logic. InInternational Symposium on Computer Architecture (ISCA), pages230–239, 2001.

[6] C. J. Hughes and S. V. Adve. A formal approach to frequent energyadaptations for multimedia applications. In ISCA, pages 138–149,2004.

[7] C. J. Hughes, J. Srinivasan, and S. V. Adve. Saving energy witharchitectural and frequency adaptations for multimedia applications.In MICRO, pages 250–261, 2001.

[8] Intel Corporation. Intel Pentium M Processor Datasheet.[9] E. Ipek et al. Efficiently exploring architectural design spaces via

predictive modeling. In ASPLOS, 2006.[10] H. Kellerer, U. Pferschy, and D. Pisinger. Kanpsack Problems.

Springer, 2004.[11] B. C. Lee and D. Brooks. Efficiency trends and limits from compre-

hensive microarchitectural adaptivity. In ASPLOS, 2008.[12] C. Lee, M. Potkonjak, and W. H. Mangione-Smith. Mediabench: A

tool for evaluating and synthesizing multimedia and communicatonssystems. In MICRO, pages 330–335, 1997.

[13] D. Ponomarev, G. Kucuk, and K. Ghose. Reducing power requirementsof instruction scheduling through dynamic allocation of multipledatapath resources. In MICRO, pages 90–101, 2001.

[14] R. Sasanka, C. J. Hughes, and S. V. Adve. Joint local and globalhardware adaptations for energy. In ASPLOS, pages 144–155, 2002.

[15] O. S. Unsal and I. Koren. System-Level Power-Aware DesignTechniques in Real-Time Systems. Proc. of IEEE, 91(7), Jul 2003.

[16] M. Weiser, B. B. Welch, A. J. Demers, and S. Shenker. Schedulingfor reduced cpu energy. In OSDI, pages 13–23, 1994.

[17] Y. -K. Chen et al. Media Applications on Hyper-Threading Technol-ogy. Intel Technology Journal, 6(1), February 2003.

590