quantifying the energy efficiency of ... - tu delft
TRANSCRIPT
Quantifying the Energy Efficiency of Coordinated Micro-Architectural
Adaptation for Multimedia Workloads
Shrirang Yardi and Michael S. Hsiao
The Bradley Department of Electrical and Computer Engineering
Virginia Tech, Blacksburg, VA 24061. USA.
{yardi,mhsiao}@vt.edu
Abstract—Adaptive micro-architectures aim to achievegreater energy efficiency by dynamically allocating computingresources to match the workload performance. The decisionsof when to adapt (temporal dimension) and what to adapt(spatial dimension) are taken by a control algorithm basedon an analysis of the power/performance tradeoffs in bothdimensions. We perform a rigorous analysis to quantify theenergy efficiency limits of fine-grained temporal and coordi-nated spatial adaptation of multiple architectural resources bycasting the control algorithm as a constrained optimizationproblem. Our study indicates that coordinated adaptation canpotentially improve energy efficiency by up to 60% as comparedto static architectures and by up to 33% over algorithmsthat adapt resources in isolation. We also analyze synergisticapplication of coarse and fine grained adaptation and findmodest improvements of up to 18% over optimized dynamicvoltage/frequency scaling. Finally, we analyze several previouscontrol algorithms to understand the underlying reasons fortheir inefficiency.
I. INTRODUCTION
As transistor densities increase rapidly with each new
process technology and supply voltage decreases relatively
slowly, microprocessor power consumption has become a
critical operational constraint. Researchers have mainly used
two approaches to reduce microprocessor power. The first is
intelligent hardware design with static power saving tech-
niques (for e.g. clock/power gating unused components).
The second is to dynamically allocate just enough resources
to match the performance requirements of the application.
These adaptive approaches aim to achieve greater energy-
efficiency by exploiting the variability or execution slack
which arises due to the diverse execution characteristics of
different applications running on a static hardware. Examples
of such methods include adaptation of micro-architectural
structures [1] and system-level adaptation such as dynamic
voltage/frequency scaling (DVS) [16], among others.
Adaptive techniques typically exploit two types of exe-
cution slack to save energy: temporal slack which can be
exploited by slowing down the processor and resource slack
which can be exploited by re-sizing or de-activating parts of
the processor. The key to adaptation is the control algorithm
that decides when to adapt and what to adapt with the goal of
achieving energy-efficient operation [7]. Ideally, to maximize
energy efficiency, we would like to adapt frequently (tempo-
rally fine-grained) over an adaptive space of many resources
(spatially coordinated). This scenario of performing fine-
grained temporal and coordinated spatial adaptation is a
complex multi-dimensional optimization problem. To realize
the full potential of such adaptation, it is important to
perform a rigorous assessment of its benefits and costs.
This paper performs a detailed, off-line, quantitative anal-
ysis of the energy savings when adapting multiple resources
within a high-performance general-purpose microprocessor
running multimedia workloads. In this context, our goals
are to perform a comprehensive exploration of the adaptive
design space, quantify the potential efficiency benefits of
fine-grained and coordinated adaptation and identify the
limitations of existing techniques. If significant gains are
found, this can motivate further analysis and design of more
efficient adaptive hardware substrates and control algorithms.
A. Motivation
The following factors have motivated our study:
1) A considerable amount of research has been devoted to
the design of control algorithms for micro-architectural
adaptation (see Albonesi, et al. [1] for a survey).
However, due to the challenging multi-dimensionality
of the problem, prior techniques are largely ad-hoc
and have often constrained their analysis in either the
temporal or spatial dimensions. Temporal constraints
limit micro-architectural responsiveness to workload
heterogeneity and spatial constraints fail to account
for interactions between adaptive structures. Only a
rigorous and comprehensive exploration of the adaptive
design space can provide an accurate idea of the
potential efficiency benefits.
2) We focus on multimedia applications because, unlike
throughput-oriented workloads (such as SPEC), these
applications present a unique set of issues that warrant
their detailed study. First, these applications represent
a large (and sometimes the only) chunk of workloads
for the increasingly power-hungry mobile devices.
Second, these applications have markedly different
execution characteristics than throughput workloads so
that several multimedia-specific adaptation techniques
have been proposed [6], [7], [14]. It is important to
analyze such application-specific control algorithms
to determine the underlying reasons for their energy
(in)efficiency. However, our analysis framework is also
applicable to other workload domains.
978-1-4244-2658-4/08/$25.00 ©2008 IEEE 583
3) For multimedia workloads, system-level adaptations
(such as DVS) add a whole another dimension to the
energy-performance tradeoff space by changing the
relative impact of structural adaptation on the overall
energy efficiency. This added complexity has hindered
an integrated analysis of structural and system-level
adaptation. As a result, the control algorithms at the
two levels have been largely orthogonal. One of our
goals is to analyze if these can be applied in concert
and, if so, quantify the potential efficiency gains.
In summary, we believe that an unconstrained, rigorous
analysis of micro-architectural adaptivity is critical to over-
come the limitations of previous ad-hoc approaches. Despite
its high computational cost, such comprehensive exploration
is crucial to get an accurate idea about the potential benefits
of coordinated adaptation and provide insights for designing
practical and powerful control algorithms.
B. Contributions
The following summarizes our main contributions:
1. We cast the problem of fine-grained, coordinated struc-
tural adaptation as a constrained optimization problem. In
particular, we consider adaptation at a temporal granularity
of every 1024 instructions across 25920 micro-architectural
configurations spanning a design space of greater than
257 points. We also consider the problem of integrating
fine-grained structural and coarse-grained system adaptation
(such as DVS) to identify their relative contributions to
overall energy efficiency. The solutions to these models allow
us to perform, for the first time, a comprehensive analysis
of benefits from fine-grained temporal, coordinated structural
and integrated structural and system-level adaptation.
2. We apply this framework to assess the benefits of
varying degrees of temporal and spatial adaptivity. We find
significant energy efficiency gains of up to 85% (60% on
average) over a base, non-adaptive processor without DVS
suggesting significant potential for fine-grained structural
adaptation. We observe that these gains are a result of a
comprehensive use of the available configurations suggesting
that interaction between adaptive structures is an important
factor in realizing these efficiency benefits.
3. We implemented several previously proposed ad-hoc
control algorithms and analyzed the underlying reasons for
their inefficiency. We observed that the best previous algo-
rithm performs 33% worse on average than the optimal due
to inefficient exploitation of available slack. We also find
that the energy savings and the performance impact due
to these algorithms are unpredictable even after extensive,
application-specific manual tuning.
4. We observe that the amount of temporal slack and
its distribution across adaptive intervals are the keys to the
energy efficiency achieved by the optimal algorithm. This
motivates the design of control algorithms that use temporal
slack as a first-class constraint to guide the adaptation
decisions. We discuss such implications for the design of
more efficient adaptive hardware and control algorithms.
II. METHODOLOGY OVERVIEW
A. Definition of Key Terms
Most multimedia applications are real-time and need to
process discrete units of data, termed as a frame. The
processing of each frame has to be completed within a certain
time, termed as the deadline. For a given architecture, the
difference between the deadline and actual execution time is
the temporal slack for the frame and can vary from frame
to frame. Further, there can be significant variation in intra-
frame resource utilization which is termed as resource slack
and typically varies over a few hundred cycles. We define
an epoch as the temporal granularity at which a structural
adaptation is invoked. We consider two classes of adaptations
that can be applied at different time scales. First, dynamic
voltage/frequency scaling (DVS) is applied at the granularity
of an entire frame. Second, micro-architectural adaptation,
is applied at varying epoch sizes ranging from 1024 (1K)
instructions to 1M instructions. Finally, we term our non-
adaptive processor as the Base architecture.
B. Modeling Methodology
For our off-line analysis, we model adaptation over the
lifetime of a single frame as an optimization problem. We
chose a frame as the unit for our model based on prior
work [7] which shows that several multimedia workloads
exhibit significant per-frame execution time variability. Fur-
ther, temporal slack is defined in terms of a single frame
since each frame (and not the entire application) is associated
with a deadline. Finally, modeling global adaptation across
the entire application is infeasible since the number of frames
can be unbounded.
We propose two models: the first captures structural adap-
tation (Section III) within a frame and the other (Section IV)
models DVS at the frame granularity in addition to structural
adaptation at the epoch granularity. We obtain the infor-
mation required to solve each problem using cycle-accurate
simulations. The solutions provide the per-epoch optimal re-
source configurations and per-frame voltage/frequency values
to execute the frame with minimum energy. This is done for
each frame in the workload and the optimal configurations
are fed back to the simulator to obtain actual energy and
delay values.
C. Adaptivity Analysis
We quantify efficiency for co-ordinated spatial adaptation
by reconfiguring several structures such as the instruction
window size, load store queue size, number of ALUs, number
of FPUs (floating-point units) and the issue width. We
study the energy efficiency trends for different degrees of
temporal adaptivity by: (1) varying the epoch-size, and (2)
by varying the amount of temporal slack available using
two different sets of frame deadlines. We also implement
several previously proposed adaptation techniques to asses
limitations to efficiency by constraining the adaptation in
the temporal and/or spatial dimensions. Finally, we assess
the efficiency gains when fine-grained structural and DVS
are applied together.
584
Our analysis uses the average energy-per-instruction (EPI),
the energy-delay product, and the number of missed dead-
lines as the primary metrics of efficiency. In addition, we
quantify the resource slack consumed by each technique by
the diversity of configurations used and quantify the temporal
slack consumed with the per-epoch energy×delay product.
D. Assumptions
For the purpose of this analysis, we assume a soft real-
time, single-core system running single-threaded multimedia
workloads. We assume that the CPU scheduler uses a single
frame as the scheduling unit and that there is no time bor-
rowing between frames. However, we account for the effect
of varying system load in a multi-programmed environment
by considering two sets of deadlines (further explained in
Section V). Finally, we assume that the hardware recognizes
the frame boundary, frame type and the deadline. We believe
that these assumptions are reasonable for devices running
multi-programmed (not multi-threaded) workloads. Finally,
we account for structural adaptation overheads but assume
zero overhead for performing per-frame DVS.
III. MODELING STRUCTURAL ADAPTATION
A. Problem Formulation
Our approach for modeling fine-grained structural adap-
tations is based on previous work by Hughes, et al. [6]
and is described as follows. Within a frame, each epoch,
i, can be run with a different architectural configuration,
C j, where j ∈ A rch, the set of all possible configurations.
Each configuration has two attributes: a reward, which is the
energy saved by using C j instead of Base, and a cost, which
is the performance degradation due to C j for that epoch.
The goal is to determine a single configuration, C j, for each
epoch, i, such that these configurations together result in
the most energy saved while consuming no more than the
available temporal slack for the frame. We characterize the
reward in terms of energy-per-instruction (EPI) saved and
the cost in terms of the number of additional cycles, both
vs. Base to execute each epoch. Formally, we can state the
problem as:
maximize ∑i∈N
∑j∈A rch
Ei j ·Ci j subject to: (1)
∑i∈N
∑j∈A rch
Si j · (Ci j · A( j)) ≤ S f rame, (2)
∀i ∈ N , ∑j∈A rch
Ci j = 1, (3)
Ci j =
{
1 if config j is selected for interval i0 otherwise
(4)
Above, Ei j and Si j are the energy-per-instruction (EPI) saved
and the cycles required when using configuration j vs. Base
for epoch i. S f rame is the available slack for the frame and N
is the total number of epochs in the frame. A is a map from
the value of Ci j to the actual configuration to be used. Eqn.
3 guarantees that exactly one configuration is selected for
each epoch by using the decision variable Ci j. The products
in Eqn. 1 and 2 define the complete energy-performance
tradeoff space for the configurations in A rch. The optimal
solution is a vector C∗ = (C∗1 , . . . ,C∗
N) of configurations, one
per epoch, that provides the maximum energy savings. This
problem is an instance of the well-known multiple-choice
knapsack problem (MCKP) and is NP-hard [10]. Note that,
since the temporal slack and number of instructions vary
from frame-to-frame, we have to define one such problem
for each frame. We term this problem as OPT:FG.
B. Solving OPT:FG
To solve the optimization problem, we need the values of
Ei j and Si j for all frames, all configurations and all epochs.
We obtain these values using cycle-accurate instruction-level
simulation as follows.
We reconfigure the instruction window (IW), load store
queue (LSQ), number of integer ALUs, the number of FPUs
(floating-point units) and the issue width giving |A rch| =
25920 configurations. To reduce the number of simulations
and to maintain a balanced design, we adapt IW and LSQ
together and the ALUs and the issue width together. More
details about the different adaptive units are provided in
Section V. With these constraints, we need to perform 360
simulations for each frame to obtain the values for Ei j and
Si j required to solve the problem. For each application, we
profile several frames for all the configurations.
An intuitive idea about the solution can be given as fol-
lows. For each epoch, the most energy-efficient configuration
is the one that maximizes the tradeoff between EPI saved
and the cycles used. In other words, since each configuration
uses some part of the available temporal slack, C∗ provides
the best way to “distribute” the slack across the frame
by exploiting synergistic interactions between the adaptive
resources. Finally, to obtain the actual dynamic energy, we
simulate each frame using its optimal configurations.
IV. INTEGRATED STRUCTURAL AND SYSTEM
ADAPTATION
In the context of soft real-time systems, DVS has long
been applied as an effective frame-level technique [15],
where the processor voltage/frequency are scaled to save
energy while guaranteeing that the deadline is met. One
of our goals is to understand the interaction between these
adaptations and quantify the potential efficiency benefits
by applying them synergistically. As a simple example of
interaction between the two algorithms, an aggressive DVS
setting may allow the fine-grained algorithm to exercise a
wider range of configurations and conversely, a less ag-
gressive setting may leave little potential to exploit intra-
frame variability. This section describes our formulation to
determine the optimal way to apply these adaptations.
A. Problem Formulation
The objective is to select a single frequency/voltage for
the frame and a single configuration for each epoch within
a frame such that, together, they maximize the EPI savings
while consuming no more than the available slack for the
585
frame. Eqns. 5-11 formally state the problem. For Eqns. 5-
11, A rch,N ,Ci j and A have the same definitions as for
OPT:FG. V is the set of all possible voltage values (possibly
unbounded for a system supporting continuous DVS). Dk is
a binary variable that is set to 1 if voltage V (Dk) is selected
for the frame, where V maps k to a unique voltage/frequency
pair. Eqns. 7 and 8 guarantee that a single voltage value
is selected for the entire frame and a single configuration
is selected for each epoch. Eqn. 9 shows that, for ∀k ∈ V ,
S f rame depends on both V (Dk) and on the configuration set
C∗ = (C∗1 , . . . ,C∗
N) selected for the frame. We denote this
problem as OPT:CG+FG.
maximize ∑k∈V
∑i∈N
∑j∈A rch
Eki j ·Dk ·Ci j subject to: (5)
∀k ∈ V :N
∑i=1
|A rch|
∑j=1
Si jCi j ≤ S f rame,k (6)
∀k ∈ V :
|V |
∑k=1
Dk = 1, (7)
∀i ∈ N :
|A rch|
∑j=1
Ci j = 1 (8)
where,
∀k ∈ V :}
S f rame,k = Dk ·F (V (Dk),|N |
∑i=1
|A rch|
∑j=1
Ci j · A( j)), (9)
∀k ∈ V : Dk =
{
1 if voltage k is selected for the frame0 otherwise
(10)
Ci j =
{
1 if config j is selected for interval i0 otherwise
(11)
The solution to this problem provides, for each frame, (1)
a single voltage/frequency value, VCG, and (2) the optimal
configuration set C∗FG; which save the most energy while
consuming no more than the available slack for the frame.
Intuitively, by selecting the best voltage and configuration
set, the solution provides the best “split” of the available
slack between the two control algorithms.
B. Solving OPT:CG+FG
Since S f rame now depends on both voltage and the con-
figuration set, OPT:CG+FG is a mixed-integer, non-linear
problem (MI-NLP) and is infeasible even for industrial
solvers. One naive heuristic to solve it would be to discretize
[0,V ], effectively decoupling the voltage and configuration
selection. The is similar to solving OPT:FG repeatedly
with Ei j values scaled for each discrete voltage value. We
wish to avoid such decoupling to consider the interaction
between these adaptations and use the following heuristic to
accomplish this.
We use the amount of temporal slack as a knob to control
the relative aggressiveness (and hence energy efficiency) of
the CG and FG parts of OPT:CG+FG as follows. For the
candidate frame, let Tbase be the execution time for Base
and Smax be the maximum available temporal slack. Consider
the case when only structural adaptation is performed for
some slack SFG ≤ Smax. This is accomplished by solving
OPT:FG with S f rame = SFG to obtain the minimum energy
configuration set, C∗FG. Let TFG be the required execution
time and IPCFG be the average IPC. It follows that TFG =Tbase +SFG.
Next, consider the case that DVS is applied in addition
to structural adaptation to consume the remaining slack,
SCG = Smax − SFG. It follows that TCG = TFG + SCG, i.e. ,
TCG = Tbase+(SFG+SCG). The minimum frequency required
to consume SCG is then given by, fCG = ICountTCG×IPCBase
[7].
The goal of OPT:CG+FG then is to determine the best
”split” of Smax into SFG and SCG such that energy savings for
the frame are maximized. We discretize the interval [0,Smax]in to several candidate splits - we use values of 1% to 100%
of Smax in steps of 1%. For each split, we calculate SFGand SCG, solve OPT:FG to obtain C∗
FG, determine TFG and
IPCFG and determine fCG. Finally, we simulate the frame
using these values to obtain the split the gives the best energy
savings. In summary,
for each frame doTbase = frame execution time on Base Smax = deadline - TbaseICount = instruction count for this framefor split = 0.01 to 1 in steps of 0.01 do
Solve OPT:FG with S f rame = split to get C∗FG, IPCFG
TCG = TFG +(Smax× (1− split))fCG = ICount
TCG× IPCFG
EPIsplit = EPI at fCG,VCG with C∗FG
endLowest EPIsplit gives best VCG,C∗
FG
end
Algorithm 1: Slack splitting heuristic to solve OPT:CG+FG
The main advantage of the slack splitting approach over
the naive heuristic is that it allows a wider choice in selection
of voltage values which makes the solution closer to the
theoretical optimum. A discrete voltage would limit the
voltage choices and consequently the potential benefits.
V. SIMULATION SETUP
We use the execution-driven Simplescalar (v3. 0d) simula-
tor [3] for performance evaluation and the Wattch [2] tool to
track dynamic energy consumption. The base, non-adaptive
architecture is an aggressive 8-wide out-of-order superscalar
processor (parameters summarized in Table I).
Adaptive Structures Modeled: We assume a centralized
instruction window but with a separate register file. The
window is implemented as a circular FIFO without collaps-
ing and is split in to 8-entry segments [13]. We clock-gate
the empty and ready entries in the wake-up logic [5]. We
assume that the issue width of the core is the sum of all
active functional units [14]. When a functional unit is de-
activated, we also deactivate the corresponding parts of the
instruction selection logic, result bus and wake-up ports of
the instruction window.
Adaptation Overheads: To evaluate the best possible per-
formance of each adaptation algorithm, our study does not
model the adaptation overheads for DVS. For structural
adaptations, the delay overhead due to small additional
586
TABLE I
BASE PROCESSOR CONFIGURATION
Parameter Value
Processor Core
Processor speed 2 GHzRUU Size 128 instructionsLSQ Size 64 instructionsFetch Queue Size 32 instructionsFetch Width 8 instructions/cycleDecode Width 8 instructions/cycleIssue Width out–of–order 8 instructions/cycleCommit Width in–order 8 instructions/cycleFunctional Units 6 Int, 4 FP, 2 address gen.Int FU Latencies 1/3/20 add/mult/div (pipelined)FP FU Latencies 2/4 add/mult (pipelined) 12/24 div/sqrtBranch Predictor 4KB bimodal, 32-entry RAS,
6 cycle latency
Memory Hierarchy
L1 data cache 64K, 2-way (LRU)32B blocks, 2 cycle latency
L1 instruction cache 64K, 2-way (LRU)32B blocks, 2 cycle latency
L2 cache unified, 2M, 4-way (LRU)64B blocks, 12 cycle latency
Main memory latency 200 cyclesTLBs 128 entry, fully associative,
30 cycle miss latency
TABLE II
WORKLOAD DESCRIPTION
App. Type Frames Frame Base DefaultTypes IPC Deadline
MPEG2-enc High bit-rate 100 I, P, B 1.6 33.9msMPEG2-dec video codec 100 I, P, B 2.9 1.6ms
H263-enc Low bit-rate 100 I, P 1.8 20.1msMPEG4-dec video codec 100 I, P 3.3 10.6ms
MP3 Audio 850 N/A 3.6 123µs
Mesa Rendering 100 N/A 3.3 11.5ms
hardware such as counters, comparators and control logic is
likely to be small. We model a delay of 5 cycles to activate
all de-activated components.
Power Model: We track dynamic energy using Wattch [2]
with parameters scaled for the 0.1 micron technology at 1.2V.
We also model overheads for adaptive structures such as
additional bits for each window entry and transistors for
gating unused segments. Experiments with DVS assume a
continuous frequency range from 500 MHz up to 2 GHz with
voltage values derived from data for the Intel PentiumM [8].
We assume aggressive conditional clock-gating (“cc3” clock-
ing style in Wattch) where a clock-gated resource consumes
idle power equal to 20% of its maximum power [17]. We
assume that resources that are de-activated do not consume
any power.
Memory Hierarchy: To minimize the effects of cache
behavior, we select the L1 cache size based on prior results
from the working set size analysis of media applications [7].
We scale the input for each application to ensure a hit rate
of at least 99% for the L1 data and instruction caches. To
emulate a processor used in a typical hand-held device, we
set the L2 cache size similar to that of the PentiumM [8].
Workloads: We consider several single-threaded multime-
dia benchmarks that include high (mpeg2) and low (h263,
mpeg4-avc) bit-rate video codecs, audio decoding (mp3) and
graphics rendering. These were derived from the Media-
bench [12] suite and other online resources. The MPEG2
workloads consist of three types of frames (I, P, B) while
H263/MPEG4 consist of a single I-frame followed by P-
frames. Mesa does not have a well-defined notion of a frame,
but processes one picture or “scene” at a time. We do not
use any multimedia instructions (such as MMX) because we
do not have a power model for specialized functional units.
We consider two sets of deadlines as described in [6]. The
first set, referred to as the default deadline, is the maximum
time required by the base processor to execute all the frames.
The second set, referred to as the relaxed deadline, is equal to
twice the default deadline. For multi-programmed, soft real-
time systems, the two deadlines model the effect of other
system load when processing a frame. Table II gives the
default deadlines for each workload on the base processor.
Due to per-frame execution time variability, we observe a
average temporal slack of 24% and 62% for the default and
relaxed deadlines, respectively.
VI. PREVIOUS ADAPTATION ALGORITHMS
We evaluate several previously proposed adaptation algo-
rithms which we describe in this section. These algorithms
operate at fixed time intervals (ranging from 256 cycles to
few 1000s of cycles) during which each adaptive structure
is monitored and certain performance statistics are col-
lected [1]. After each interval, these statistics are compared
to a set of thresholds to make an adaptation decision.
We implement the instruction window (termed as IW) [5]
and functional unit (termed as FU) [14] adaptation as exam-
ples of per-structure, non-coordinated adaptation algorithms.
To evaluate co-ordinated adaptation, we manually tuned the
thresholds for each resource using two methods. First, we
randomly chose thresholds for each resource from candidate
thresholds and then combined them. Second, we took a
cross-product of several individual design points and selected
random combinations. We selected the best thresholds to be
ones which execute each application with the least energy
while missing at most 5% of the deadlines. This algorithm
is termed as Manual.
For system-level adaptation, we focus on DVS where the
goal is to slow down the processor to save energy while
guaranteeing frame deadlines. DVS algorithms operate at
the frame-level and leverage special multimedia application
characteristics to predict performance and guide adaptation
decisions. We implement the per-frame adaptation algorithm
proposed by Hughes et al. [7] and term it CG (for coarse-
grained). Finally, we combine CG and Manual together to
perform integrated DVS and structural adaptation and denote
this algorithm as CG+FG.
Tuning Effort for Threshold-Based Adaptation Algo-
rithms: Both IW and FU require tuning of three parameters
each. For IW, these are: (1) adaptation period for reducing
the window size, (2) the number of periods after which to
increase the size and, (3) the counter threshold that triggers
the adaptation decision. For FU, these are: (1) adaptation
period, (2) utilization threshold to trigger de-activation, and
587
TABLE III
THRESHOLDS FOR PREVIOUS ADAPTATION ALGORITHMS
App. IW FU Manual Missed DeadlinesPeriod Num Count Period Util Hazards IW1 IW2 IW3 FU1 FU2 FU3 IW FU Manual(cycles) Periods (instr) (cycles) (cycles)
MPEG2-enc 512 16 32 2048 4 40 512 5 64 512 4 40 2 0 51
MPEG2-dec 4096 16 64 8192 16 120 512 8 64 512 96 120 0 3 35
H263-enc 1024 4 128 2048 8 40 1024 8 128 1024 8 40 30 18 88
MPEG4-dec 256 4 8 512 32 40 256 8 8 256 8 40 0 0 0
MP3 256 4 16 4096 8 40 4096 8 8 4096 8 40 5 5 10
Mesa 1024 8 64 2048 32 40 1024 5 128 1024 4 40 0 0 0
(3) number of hazards to trigger activation. We manually
evaluated a large number of points in the search space by
varying adaptation period value from 256 to 8192, the win-
dow upgrade periods ranging from 2 to 16, segment counter
values from 8 to 128, FU utilization values from 4 to 64 and
hazard values from 40 to 200. Table III shows the thresholds
for the IW, FU and Manual algorithms. The best thresholds
differ across applications and also when the techniques are
combined together. Combining threshold values randomly
always resulted in worse performance than exploring the
cross-product of design points. The last three columns list
the fraction of deadlines missed for each algorithm for the
default deadlines. The deadlines missed for CG and CG+FG
were less than 3% in all cases.
This data highlights large design effort required for
prevalent threshold-based approaches. Even after application-
specific tuning, their behavior is unpredictable and (as we see
in the next sub-section), their energy benefits are limited.
VII. RESULTS FOR STRUCTURAL ADAPTATION
This section presents results for structural adaptation using
the default deadlines. We first summarize the potential energy
benefits of different algorithms. We then quantify the sources
of inefficiency of previous algorithms based on the manner in
which they consume the resource and temporal slack. We find
that the efficiency of OPT:FG is a result of judicious temporal
slack distribution and a comprehensive use of configuration
options.
A. Potential Energy Savings
Table IV summarizes the energy savings for each bench-
mark, expressed as percentage energy savings over Base for
each algorithm, averaged over all frames. For reference, we
also list the savings of OPT:FG relative to each algorithm,
which we term as the energy efficiency gap. This data illus-
trates the significant energy benefits by exploiting intra-frame
variability with mean potential savings of up to 60%. Energy
saved is proportional to the amount of intra-frame variability
- benchmarks with lower variability such as MPEG2-dec and
MP3-dec show modest savings (up to 47%), whereas those
with high variability, such as MPEG2-enc and Mesa, show
significant savings (up to 85%). In general, algorithms that
adapt structures together perform well, with CG+FG showing
savings within 13% of OPT:FG. Further tuning can likely
improve these savings. However, notice that Manual (12%
savings) performs worse than even IW (15% savings). This,
coupled with the high miss ratio of Manual (Table II), shows
that it is difficult to guarantee performance even if thresholds
are extensively hand-tuned for individual applications.
Table IV also quantifies the amount of temporal slack
consumed by each algorithm as the slowdown over Base,
averaged across all frames. In general, structural adaptation
is unable to consume large amounts of temporal slack which
indicates that most potential savings result by exploiting
intra-frame resource slack. This has significant implications
for coarse-grained algorithms. These algorithms can exploit
almost the entire temporal slack to save as much energy as
possible (detailed in Section VIII).
In what follows, we use the solution of OPT:FG to
quantify the underlying sources of inefficiency when con-
straining adaptation in the spatial and temporal dimensions.
We find that the net energy efficiency of OPT:FG results
from: (1) by using all the available configuration space and
by reconfiguring in moderate to large step sizes between
neighboring intervals, and (2) a strategic distribution of the
available temporal slack within the frame.
IW FU MAN CG CG+FGOPT
0
0.2
0.4
0.6
0.8
MPEG2−enc
IW FU MAN CG CG+FGOPT
0
0.2
0.4
0.6
0.8
1
Mesa−Texgen
IW FU MAN CG CG+FGOPT
0
0.2
0.4
0.6
0.8
1
H263−enc
IW FU MAN CG CG+FGOPT
0
0.2
0.4
0.6
0.8
1
H264−dec
Re
lative
Ma
gn
itu
de
of
Pa
ram
ete
r C
ha
ng
e
Fig. 1. Magnitude of Configuration Changes
B. Configurations Used
Figure 1 plots the magnitude of change in parameter values
across intervals in terms of step size [11]. The step size is the
difference in configuration parameters between successive
intervals expressed as a fraction of the total number of con-
figurations. For example, we have 15 choices for instruction
window size (16 to 128 entries in steps of 8). If window
size changes from 32 to 64 entries in successive intervals,
588
TABLE IV
ENERGY SAVINGS (%), ENERGY EFFICIENCY GAP (%) AND SLOWDOWN FOR DEFAULT DEADLINES
App. Savings (% Base Energy) OPT:FG Savings Relative to Slowdown (× Base Execution Time)IW FU MAN CG CG+FG OPT:FG IW FU MAN CG CG+FG MAN CG CG+FG OPT:FG
MPEG2-enc 35 5 42 41 51 71 56 70 51 52 41 1.05 1.06 1.1 1.04
MPEG2-dec 21 -6 17 25 40 47 34 50 18 29 13 1.1 1.25 1.2 1
H263-enc 25 -3 31 30 41 52 36 53 31 32 20 1.1 1.1 1.2 1.1
MPEG4-dec 3 -11 -1.2 -9 23 64 63 68 64 67 54 1.21 1.1 1.26 1.1
MP3-dec 3 -20 -19 -15 10 42 41 52 51 50 36 1.0 1.0 1.0 1.0
Mesa 3 -1 3 36 35 85 84 85 85 77 77.5 1.0 1.0 1.14 1.1
Mean 15 -6 12 18 33 60 52 63 50 50 40 1.07 1.09 1.15 1.05
0 5 10 15 20 25−50
0
50
MPEG2−enc I−Frame
0 10 20 30 40 50 60−100
−50
0
50
H263−enc P−Frame
Instructions (M)
En
erg
y X
De
lay I
mp
rove
me
nt
(% B
ase
)
IW FU Manual CG CG+FG Optimal
Fig. 2. Per-epoch EnergyxDelay Trends
the relative step size is 0.13. If a parameter changes from
its minimum to maximum value, then the step size equals
1. We combine the relative step sizes of each parameter to
quantify the change across the configuration space. Figure 1
uses a compact box-plot to illustrate this information.
We observe that OPT:FG exercises parameters spanning
the entire configuration space with a median step size across
all workloads of 0.07, a standard deviation of 14% and
maximum of 0.98. In contrast, CG+FG uses a step size of
only 0.01 with a deviation of 2% and maximum of 0.6.
This data suggests that OPT:FG performs relatively gradual
parameter changes across intervals, but uses all the available
configurations to achieve energy savings. We also calculated
the percentage of intervals that use each configuration (not
shown due to lack of space) and observe that constraining
spatial adaptivity results in bottlenecks by reducing the
amount of configurations that can be exercised.
In summary, most workloads need to exercise a large
fraction of the configuration space in relatively modest step
sizes. This suggests diverse requirements of computational
resources within each workload to ensure a better matching
of execution characteristics to resource sizes. Comparatively,
IW and FU are limited to adapting only a single resource and
while CG and CG+FG create bottlenecks by constraining the
temporal granularity of adaptation.
C. Temporal Slack Used
OPT:FG strategically chooses the temporal slack to be
consumed at each epoch based on the energy-performance
tradeoffs across the entire frame. This intelligent “spreading”
of temporal slack across the frame guides the selection
of per-epoch configurations and ultimately the net energy
efficiency. We quantify this behavior using the per-epoch
energy×delay (ED) for each control algorithm. Figure 2 plots
the per-epoch ED trends for two workloads. It can be seen
that ED remains almost constant across the entire frame sug-
gesting that, as per-epoch performance requirements change,
OPT:FG selects the configurations that provide the best trade-
off between energy and delay resulting in high per-frame
energy efficiency. The result is a strategic distribution of per-
epoch delay which, coupled with spatial adaptation, allows
OPT:FG to exploit the differing computation requirements of
each epoch in an optimal manner.
VIII. RESULTS FOR INTEGRATED STRUCTURAL AND
SYSTEM-LEVEL ADAPTATION
This section summarizes the additional energy efficiency
when applying per-epoch structural and per-frame DVS
synergistically. The results reported here use an algorithm
termed BestDVS as the baseline.BestDVS runs the processor
at the lowest possible voltage/frequency that still makes
the deadline without any structural adaptation. Figures 3(a)
and 3(b) quantify the efficiency for the default and relaxed
deadlines, respectively, expressed as percentage savings over
BestDVS. CG and CG+FG show cases where the DVS
and structural adaptations are decoupled. CG selects the
voltage/frequency and structure sizes once at the start of the
frame while CG+FG additionally performs intra-frame struc-
tural adaptation. The data suggests that for both sets of dead-
lines, integrating DVS with structural adaptation increases
efficiency modestly with an average of 1.4x for default and
1.18x for relaxed deadlines, respectively. Decoupling DVS
and structural adaptation generally leads to fewer savings
than DVS by up to 36% (in case of MPEG4-dec). For default
deadlines, we observe that architectural adaptation alone can
provide more energy savings than even DVS (up to 1.5x
in case of MPEG-enc). Additional savings due to structural
adaptation reduce for relaxed deadlines since voltage can be
more aggressively ramped down. For reference, Figure 3(c)
illustrates the different solutions to OPT:CG+FG using our
temporal slack splitting approach when running MPEG2-enc
for a single frame. The plot indicates that a split of 18%
(18% of total slack consumed by structural adaptation and
rest by DVS) is optimal for this frame to make the frame
589
BestDVSOPT:CG+FGCGCG+FGOPT:FG
0.4
0.6
0.8
1
1.2
1.4
1.6
MPGenc MPGdec H263enc MPG4dec MP3dec
En
erg
y S
avin
gs
rela
tiv
e to
Bes
tDV
S
(a) Default Deadline
BestDVSOPT:CG+FGCGCG+FGOPT:FG
0.4
0.6
0.8
1
1.2
1.4
1.6
MPGenc MPGdec H263enc MPG4dec MP3dec Mesa
En
erg
y S
avin
gs
rela
tiv
e to
Bes
tDV
S
(b) Relaxed Deadline
0 20 40 60 80 1000.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
Temporal Slack Split (% Total Slack)
% B
ase
En
erg
y
BestDVS OPT:CG+FG CG CG+FG
additional energy savings due tostructural adaptation
frame deadline
(c) Energy Efficiency for different tempo-ral slack splits
Fig. 3. Potential Energy Efficiency for Integrated structural adaptation and DVS
deadline while saving the most energy.
In general, these results indicate that for the adaptive
structures, voltage range and deadlines that we consider,
DVS contributes to most of the energy savings and structural
adaptation only modestly increases these savings. However,
results for the default deadline lead us to believe that as
more structures are added to the adaptive space (for example,
adapting pipeline depth, memory hierarchy, branch predic-
tors), voltage scaling margins decrease and more load is
added to the system, efficiency gains due to combined coarse-
grained and fine-grained adaptations would likely increase.
Summary: Our study reveals the significant energy effi-
ciency that can result from fine-grained temporal, coordi-
nated spatial adaptation and integrated structural and system-
level adaptations. From a hardware perspective, this indicates
that a comprehensively adaptive hardware will be required to
realize these benefits. From the control algorithm perspective,
our findings challenge previous threshold-based algorithms
that constrain spatial adaptivity resulting in bottlenecks.
Finally, we observe that fine-grained temporal adaptivity is
better suited to localize energy costs by expending power
during epochs that actually need it, thus reducing waste and
increasing the net efficiency.
To make this study more extensive, recent advances in
statistical inference based techniques [4], [9], [11] can be
leveraged. These techniques perform efficient design space
exploration by using linear and/or non-linear predictive mod-
els to infer processor power/performace using fewer detailed
simulations. It will also be interesting to analyze adaptation
in the presence of SMT and/or CMP configurations that are
becoming common even for mobile devices.
IX. CONCLUSION
We have presented a detailed analysis of fine-grained
temporal and coordinated spatial micro-architectural adap-
tation by casting adaptation as a combinatorial optimization
problem. We also analyze the problem of integrating coarse-
grained adaptation with architectural adaptation using a novel
optimization model. Solutions to these models have allowed
an oracle-based assessment of the potential energy efficiency
benefits and an insight into the behavior of ideal control al-
gorithms. The solutions reveal significant efficiency benefits
resulting from a judicial use of available temporal slack and
comprehensive use of the adaptive space. A comparison with
several previous algorithms has demonstrated the impractica-
bility of threshold-based algorithms and the loss in efficiency
by constraining adaptation in either temporal or spatial or
both dimensions. Although our problem formulations are
conceptually simple, the analysis is much more complex due
to the high computational cost and multi-dimensionality of
the problem. Given the significant potential benefits, our next
step is to analyze control algorithm implementation options
in terms of their complexity and effectiveness.
REFERENCES
[1] D. H. Albonesi. et al. Dynamically tuning processor resources withadaptive processing. IEEE Computer, 36(12):49–58, 2003.
[2] D. Brooks et al. Wattch: a framework for architectural-level poweranalysis and optimizations. In ISCA, pages 83–94, 2000.
[3] D. Burger, T. M. Austin, and S. Bennett. Evaluating Future Micropro-cessors: The SimpleScalar Tool Set. Tech. Report CS-TR-1996-1308.
[4] S. Eyerman, L. Eeckhout, and K. D. Bosschere. Efficient design spaceexploration of high performance embedded out-of-order processors. InDATE, pages 351–356, 2006.
[5] D. Folegnani and A. Gonzalez. Energy-effective issue logic. InInternational Symposium on Computer Architecture (ISCA), pages230–239, 2001.
[6] C. J. Hughes and S. V. Adve. A formal approach to frequent energyadaptations for multimedia applications. In ISCA, pages 138–149,2004.
[7] C. J. Hughes, J. Srinivasan, and S. V. Adve. Saving energy witharchitectural and frequency adaptations for multimedia applications.In MICRO, pages 250–261, 2001.
[8] Intel Corporation. Intel Pentium M Processor Datasheet.[9] E. Ipek et al. Efficiently exploring architectural design spaces via
predictive modeling. In ASPLOS, 2006.[10] H. Kellerer, U. Pferschy, and D. Pisinger. Kanpsack Problems.
Springer, 2004.[11] B. C. Lee and D. Brooks. Efficiency trends and limits from compre-
hensive microarchitectural adaptivity. In ASPLOS, 2008.[12] C. Lee, M. Potkonjak, and W. H. Mangione-Smith. Mediabench: A
tool for evaluating and synthesizing multimedia and communicatonssystems. In MICRO, pages 330–335, 1997.
[13] D. Ponomarev, G. Kucuk, and K. Ghose. Reducing power requirementsof instruction scheduling through dynamic allocation of multipledatapath resources. In MICRO, pages 90–101, 2001.
[14] R. Sasanka, C. J. Hughes, and S. V. Adve. Joint local and globalhardware adaptations for energy. In ASPLOS, pages 144–155, 2002.
[15] O. S. Unsal and I. Koren. System-Level Power-Aware DesignTechniques in Real-Time Systems. Proc. of IEEE, 91(7), Jul 2003.
[16] M. Weiser, B. B. Welch, A. J. Demers, and S. Shenker. Schedulingfor reduced cpu energy. In OSDI, pages 13–23, 1994.
[17] Y. -K. Chen et al. Media Applications on Hyper-Threading Technol-ogy. Intel Technology Journal, 6(1), February 2003.
590