ashutosh sham dhodapkar - jes.ece.wisc.eduautonomic management of adaptive microarchitectures by...

AUTONOMIC MANAGEMENT OF ADAPTIVE

MICROARCHITECTURES

by

Ashutosh Sham Dhodapkar

A dissertation submitted in partial fulfillment of

the requirements for the degree of

Doctor of Philosophy

(Electrical Engineering)

at the

UNIVERSITY OF WISCONSIN–MADISON

2004

c© Copyright by Ashutosh Sham Dhodapkar 2004

All Rights Reserved

i

To my parents

ii

ACKNOWLEDGEMENTS

This work would have been impossible without the constant support and encouragement from

my parents Vasudha and Sham Dhodapkar. They had unwaivering faith in my abilities, even

at times when I did not. I am indebted to them for instilling in me the curiosity essential for

solving research problems.

My wife Ritu made my stay in Madison delightful by taking care of the daily chores,

in spite of being a graduate student herself. Graduate school would not have been half as

much fun without her. During stressful times, Ritu provided a bright outlook and words of

encouragement. My brother Chinmay helped maintain the competitive spirit by continuously

challenging my knowledge and programming skills. His jokes have been an abundant source

of humor.

Jim Smith has been a tremendous influence and a solid role model. I am thankful to

him for giving me a chance to work with him when I had almost no background in computer

architecture. Jim gave me freedom to pursue research, but at the same time provided gentle

guidance when needed. His emphasis on ethics, technical rigor, and precise language has made

me a better researcher and engineer.

Discussions with lab mates on topics academic and otherwise were enlightening. Timothy

Heil, Subramanya Sastry, and Todd Bezenek provided useful advice during my initial years in

graduate school. Sebastien Nussbaum, Ho-Seop Kim, Tejas Karkhanis, and Jason Cantin were

good company during late night debugging sessions. Ho-Seop, along with Bruce Orchard, did

a wonderful job of keeping the computer systems up and running. Trey Cain helped me setup

PharmSim and provided 24x7 software support. Jason Cantin helped debug the coherence pro-

tocol. Many other students, past and present, have shared the joys and frustrations of graduate

iii

school. I refrain from listing their names, lest I forget some.

Professors Mark Hill, Mikko Lipasti, Michael Schulte, and Gurindar Sohi are gratefully

acknowledged for serving on my committee and enduring my preliminary exam and final de-

fense. Their suggestions and criticisms have been helpful in improving this dissertation. Guri

Sohi was an excellent teacher in ECE 552, which sparked my interest in computer architecture.

I would like to thank all my family and friends for keeping my social life busy over the

past six years. I have been fortunate to go to school in a town as beautiful as Madison. The

third floor of College Library, with couches and a beautiful view of the lake, provided an

ideal location for writing this dissertation. Fall colors and beer at the Terrace and area micro-

breweries helped keep spirits high.

Finally, I would like to acknowledge the various funding agencies that made this research

possible. This work was supported by SRC grants 2000-HJ-782 and 2001-HJ-902, NSF grants

EIA-0071924 and CCR-9900610, Intel foundation Ph.D. fellowship, and equipment donations

from Intel and IBM. Any opinions, findings, and conclusions or recommendations expressed in

this dissertation are those of the author and do not necessarily reflect the views of the funding

agencies.

iv

ABSTRACT

Microarchitectural resource requirements vary across programs and even within programs – as

they go through distinct phases of execution. Adaptive microarchitectures can adjust to chang-

ing program requirements to provide better power/performance characteristics. Efficiency of

the tuning algorithm that governs the adaptation process is key to achieving benefits from such

microarchitectures.

We propose a class of generic tuning algorithms that use program phase information to

guide the tuning process. These algorithms improve upon previously proposed periodic algo-

rithms by reducing unnecessary tunings and reconfigurations, which are the main source of

performance loss associated with the tuning process.

Phase changes are detected dynamically using a light-weight profiling mechanism called

the instruction working set signature. The signature is a lossy-compressed representation of

the working set, 32 - 128B in size. In addition to detecting phase changes, signatures can be

used to estimate the working set size and identify recurring phases. These properties can be

exploited by tuning algorithms to achieve better resource savings and lower performance loss.

We propose three working set signature based tuning algorithms. Each of these algorithms

trigger tuning only when a phase change is detected. The first one, called the basic tuning

algorithm, performs a trial and error search for the best configuration whenever a new phase

is detected. The second one, the signature density based algorithm, directly configures units

whose performance depends on working set size. Working set size is estimated from signature

density i.e. the number of ones in the signature. The last one, the history based algorithm, is

similar to the basic algorithm but stores the results of tuning in a phase table. When a phase

repeats, configuration information is read from the table – bypassing the trial and error pro-

v

cess. On average, the best performing algorithms achieve 53%, 30%, 18%, and 48% resource

savings, respectively, for the I-cache, D-cache, L2-cache, and branch predictor. The associated

performance loss is close to 1%.

An algorithm for managing adaptive microarchitectures with multiple configurable units is

proposed. This algorithm uses a novel apportioning technique to decouple the tuning processes

of individual units while still meeting a tight performance loss tolerance. On average, this

algorithm is able to achieve 25%, 17%, 9%, and 30% resource savings, respectively, for the

I-cache, D-cache, L2-cache, and branch predictor. The associated performance loss is 1.5%.

We propose the use of co-designed virtual machine software to implement the tuning al-

gorithms. Based on full system simulation, we conclude that such a software implementation

is perfectly viable – leading to < 0.3% performance loss.

vi

Contents

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

1 Introduction 1

1.1 Phase Based Tuning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Autonomic Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Thesis Statement and Contributions . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Related Work 9

2.1 Program Phase Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Multi-configuration Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Hardware/Software co-Design . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Program Phase Detection 14

3.1 Defining Program Phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1.1 Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.2 Implementation Independence . . . . . . . . . . . . . . . . . . . . . . 17

3.1.3 Phase Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Detecting Phase Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.1 Model: Instruction Working Set . . . . . . . . . . . . . . . . . . . . . 19

3.2.2 Difference Metric: Relative Working Set Distance . . . . . . . . . . . 20

vii

3.2.3 Working Set Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3 Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3.2 Difference Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.3 Granularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3.4 Signature Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3.5 Sampling Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3.6 Hash Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.4 Limitations of the Instruction Working Set Model . . . . . . . . . . . . . . . . 37

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 Comparison of Phase Detection Techniques 40

4.1 Comparison Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2 Performance with Unbounded Resources . . . . . . . . . . . . . . . . . . . . . 42

4.2.1 Sensitivity and False Positives . . . . . . . . . . . . . . . . . . . . . . 43

4.2.2 Stability and Average Phase Length . . . . . . . . . . . . . . . . . . . 45

4.2.3 Performance Variance . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2.4 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.3 Hardware Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.3.1 Design Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3.2 Comparison of Hardware Mechanisms . . . . . . . . . . . . . . . . . . 51

4.3.3 Conditional Branch Counter . . . . . . . . . . . . . . . . . . . . . . . 52

4.4 Other Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4.1 Hardware Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4.2 Recurring Phase Identification . . . . . . . . . . . . . . . . . . . . . . 54

viii

4.4.3 Estimating Working Set Size . . . . . . . . . . . . . . . . . . . . . . . 55

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5 Tuning Algorithms 57

5.1 Tuning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.2.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.2.2 Baseline configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.2.3 Reconfiguration Mechanisms . . . . . . . . . . . . . . . . . . . . . . . 64

5.3 Tuning Individual Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.3.1 Signature Based Tuning Algorithms . . . . . . . . . . . . . . . . . . . 69

5.3.2 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.3.3 Reconfiguration Overheads . . . . . . . . . . . . . . . . . . . . . . . . 79

5.3.4 Tuning Low-Overhead Units . . . . . . . . . . . . . . . . . . . . . . . 86

5.3.5 Tuning High-Overhead Units . . . . . . . . . . . . . . . . . . . . . . . 104

5.3.6 Comparison with Periodic Tuning Algorithms . . . . . . . . . . . . . . 107

5.3.7 Optimizing Granularity . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.3.8 Per-Benchmark Static Tuning . . . . . . . . . . . . . . . . . . . . . . 113

5.4 Tuning Multiple Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.4.1 Apportioning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5.5 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

6 The Micro-OS 124

ix

6.1 Co-Designed Virtual Machines . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6.2 Memory Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6.3 Implementation ISA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6.3.1 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.3.2 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6.3.3 Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6.4 Overheads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

7 Conclusions 137

7.1 Phase Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

7.2 Tuning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

Bibliography 144

x

List of Figures

1.1 Program phases for SPEC CPU2000 benchmark apsi . . . . . . . . . . . . . . 3

1.2 Adaptive microarchitecture managed by a micro-OS . . . . . . . . . . . . . . . 6

3.1 Phase resolution using different profiling interval lengths . . . . . . . . . . . . 15

3.2 Phase nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 The working set signature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4 Working set vs. working set signature . . . . . . . . . . . . . . . . . . . . . . 22

3.5 ROC curves averaged over all benchmarks . . . . . . . . . . . . . . . . . . . . 26

3.6 ROC curves averaged over integer and floating-point benchmarks . . . . . . . . 27

3.7 Resolution of phases as a function of the granularity . . . . . . . . . . . . . . . 28

3.8 Phase stability as a function of the granularity . . . . . . . . . . . . . . . . . . 29

3.9 Variation of signature fidelity with signature size . . . . . . . . . . . . . . . . 30

3.10 Variation of signature density with signature size . . . . . . . . . . . . . . . . 32

3.11 Variation of working set size with granularity . . . . . . . . . . . . . . . . . . 33

3.12 Signature density for different granularities . . . . . . . . . . . . . . . . . . . 34

3.13 Theoretical signature density vs. working set size . . . . . . . . . . . . . . . . 34

3.14 Variation of signature fidelity with sampling rate . . . . . . . . . . . . . . . . . 36

3.15 The folded-XOR hash function . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.16 Variation of signature fidelity using the folded-XOR hash function . . . . . . . 38

3.17 Limitations of the working set model . . . . . . . . . . . . . . . . . . . . . . . 39

4.1 ROC curves for various techniques, significant performance change: 2% . . . . 43

4.2 ROC curves for various techniques, significant performance change: 10% . . . 44

4.3 Stability and average phase length for working set and BBV based techniques . 45

xi

4.4 Stability and average phase length for branch counter based technique . . . . . 46

4.5 Average performance variance within phases for various techniques . . . . . . 47

4.6 Correlation between BBV and instruction working set based techniques . . . . 48

4.7 The accumulator table update mechanism . . . . . . . . . . . . . . . . . . . . 49

4.8 Design space of phase detection techniques . . . . . . . . . . . . . . . . . . . 50

4.9 Correlation of hardware schemes with corresponding unbounded schemes . . . 51

5.1 Execution time breakup for SPEC CPU2000 benchmarks . . . . . . . . . . . . 61

5.2 Performance achieved by the baseline microarchitecture . . . . . . . . . . . . . 64

5.3 Performance loss due to a smaller instruction cache . . . . . . . . . . . . . . . 64

5.4 Performance loss due to a smaller data cache . . . . . . . . . . . . . . . . . . 65

5.5 Performance loss due to a smaller L2-cache . . . . . . . . . . . . . . . . . . . 65

5.6 Performance loss due to a smaller branch predictor . . . . . . . . . . . . . . . 66

5.7 State machine for the basic tuning algorithm . . . . . . . . . . . . . . . . . . . 70

5.8 State machine for the signature density based tuning algorithm . . . . . . . . . 71

5.9 State machine for the history based tuning algorithm . . . . . . . . . . . . . . 73

5.10 Schematic illustrating transient misses caused by reconfiguration . . . . . . . . 77

5.11 System quiesce overheads for various cache reconfigurations . . . . . . . . . . 79

5.12 System quiesce overheads for various L2-cache reconfigurations . . . . . . . . 80

5.13 Line invalidation overheads for cache reconfiguration . . . . . . . . . . . . . . 81

5.14 Extra transient misses caused by reconfiguration . . . . . . . . . . . . . . . . . 83

5.15 Cost of misses and mis-predicts . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.16 Stability and average phase length for a granularity of 100K instructions . . . . 86

5.17 Savings and performance loss for the basic tuning algorithm . . . . . . . . . . 88

5.18 Savings and performance loss for the signature density based tuning algorithm . 93

xii

5.19 Savings comparison between the signature density and oracle algorithm . . . . 94

5.20 Phase recurrence for SPEC CPU2000 benchmarks . . . . . . . . . . . . . . . . 96

5.21 Savings and performance loss for the history based tuning algorithm . . . . . . 97

5.22 Comparison of savings achieved by the basic and history based algorithms . . . 100

5.23 Analytical model based savings for basic and history based algorithms . . . . . 101

5.24 Percentage of tunings resulting in various configurations being chosen . . . . . 101

5.25 Stability and average phase length for a granularity of 5M instructions . . . . . 104

5.26 Savings and performance loss for L2-cache tuning . . . . . . . . . . . . . . . . 105

5.27 Reduction in tunings and reconfigurations using the history algorithm . . . . . 106

5.28 Branch predictor resource savings using basic and periodic algorithms . . . . . 109

5.29 Performance loss for branch predictor using basic and periodic algorithms . . . 109

5.30 Average savings and performance loss for various phase based algorithms . . . 112

5.31 Savings for optimized and non-optimized signature density algorithms . . . . . 113

5.32 Comparison of savings achieved by dynamic and per-benchmark static tuning . 114

5.33 Multiple-unit tuning algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 117

5.34 Resource savings achieved by the multiple-unit tuning algorithm . . . . . . . . 119

5.35 Performance loss caused by multiple-unit tuning algorithm . . . . . . . . . . . 120

6.1 Co-Designed Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6.2 Memory layout and protection mechanism . . . . . . . . . . . . . . . . . . . . 128

6.3 Control transfer while handling architected and non-architected interrupts . . . 132

6.4 Additional signature comparisons performed by the history based algorithm . . 135

xiii

List of Tables

3.1 Compilers and flags used for building benchmarks for the Alpha ISA . . . . . . 24

3.2 Microarchitecture parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.1 Compilers and flags used to build benchmarks for PowerPC-AIX . . . . . . . . 60

5.2 Virtual address ranges for various parts of software in AIX based systems . . . 62

5.3 Baseline microarchitecture parameters . . . . . . . . . . . . . . . . . . . . . . 63

5.4 Available configurations for the multi-configuration units . . . . . . . . . . . . 65

5.5 Summary of reconfiguration overheads for various multi-configuration units . . 85

5.6 Number of tunings and reconfigurations for the basic I-cache tuning algorithm . 89

5.7 Number of reconfigurations for the basic and signature density algorithms . . . 95

5.8 Various statistics for the history based algorithm . . . . . . . . . . . . . . . . . 99

5.9 Savings and performance loss for tuning low-overhead units . . . . . . . . . . 103

5.10 Average savings and performance loss for various tuning algorithms . . . . . . 108

5.11 Optimal tuning intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.12 Summary of average savings and performance loss for single unit tuning . . . . 123

6.1 Registers specific to the I-ISA . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.2 Instructions specific to the I-ISA . . . . . . . . . . . . . . . . . . . . . . . . . 130

6.3 Micro-OS overheads for the basic algorithm . . . . . . . . . . . . . . . . . . . 134

1

Chapter 1INTRODUCTION

General-purpose microprocessors are used in a wide variety of applications ranging from

game consoles to web servers. Programs running on these processors have widely varying

microarchitectural resource requirements. Thus, general-purpose processors are designed with

a microarchitecture that provides a good power-performance trade-off, on average, across a

spectrum of workloads. This however means that microarchitecture is often sub-optimal for a

specific program or a specific execution phase of program.

For some programs (or program phases), the microarchitecture may be over-designed i.e.

it may provide more resources than required. This leads to unnecessary power dissipation.

For example, consider an application whose instruction working set fits in an 8KB cache. A

microarchitecture with 64KB cache provides no additional performance for this application,

but does dissipate extra power.

Performance may be sub-optimal for certain programs due to design mismatch. This can

happen if the microarchitectural parameters are highly sub-optimal for a given program. For in-

stance, consider a program accessing 4B words with a stride of 16. If the microarchitecture has

a 64B data cache line, the cache lines are highly under-utilized leading to lower performance

compared to a microarchitecture with a 4B cache line.

Finally, the microarchitecture may be highly optimized for the wrong applications. Micro-

processor design cycles are about five years long. Thus, designers have to predict the “killer”

applications (i.e. very important applications) of the future. If the predictions are wrong, the

microarchitecture may lack the edge it was designed to have.

2

Adaptive microarchitectures are a promising solution to each of these problems. Such

microarchitectures can dynamically adapt to changing program requirements in order to im-

prove performance and/or reduce power dissipation. The enabling technology for adaptive

microarchitectures is multi-configuration hardware. Multi-configuration hardware provides a

set of configurations with different microarchitectural parameters, any one of which can be se-

lected at run-time. For example, a multi-configuration cache may provide configurations with

different sizes. An adaptive microarchitecture employing such a cache may reduce power dis-

sipation by dynamically selecting the smallest cache configuration that is big enough to hold

the program’s working set.

Adaptive microarchitectures are a logical progression in the evolution of microprocessors.

As microprocessors have evolved, they have employed increasing levels of dynamic adaptivity

in the form of caching, out of order issue mechanisms, branch prediction, and prefetching.

In one way or another, each of these mechanisms exploit dynamic program information to

improve performance. Adaptive microarchitectures are the next step, where dynamic program

information guides reconfiguration of one or more units to improve performance and/or reduce

power dissipation.

1.1. Phase Based Tuning Algorithms

Reconfiguration of an adaptive microarchitecture is governed by a tuning algorithm, and

an efficient tuning algorithm is essential for deriving maximum benefits from such microarchi-

tectures. Several tuning algorithms have been proposed in literature [1–4]. Most of these are

ad hoc and specific to the multi-configuration hardware being reconfigured.

Previously proposed tuning algorithms use a periodic trial and error process to find best

configuration. As a consequence of the trial and error process, each tuning attempt is associated

3

with multiple reconfigurations. Microarchitecture reconfigurations lead to performance loss

partly due to the reconfiguration overhead and partly due to temporary loss of performance-

critical implementation state. Avoiding unnecessary tunings is thus key to an efficient tuning

algorithm.

0

10

20

30

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

0

10

20

30

0

10

20

30

I n s t

r u c t

i o n

C a c

h e

M i s

s R

a t e

( % )

Instruction Intervals (1 interval = 100K instructions)

32KB I-Cache

8KB I-Cache

2KB I-Cache

A B B C

Figure 1.1. Instruction cache miss rates vs. time (measured in terms of instruction intervals) for threecache sizes. Benchmark used is apsi from the SPEC CPU2000 suite.

Programs go through phases of execution wherein their performance is relatively stable.

This behavior can be leveraged to design efficient tuning algorithms. Figure 1.1 illustrates the

phase behavior of SPEC CPU2000 benchmark apsi, using variation of I-cache miss rates with

time. Clearly, the program goes through distinct phases of execution (labeled for convenience)

with differing microarchitectural resource requirements. For example, in phase A, the applica-

tion requires a large instruction cache, and shows significant reduction in miss rate going from

8KB to 32KB. However, in phase B, it requires a much smaller instruction cache and even an

8KB instruction cache suffices.

4

Since program behavior does not change within a phase, results of tuning performed at the

beginning of a phase remain valid throughout the phase. Thus, if a tuning algorithm can detect

these phase changes accurately, it can significantly reduce the number of tunings. Also, since

phases repeat in time (for example, phase B), tuning results can be reused. This essentially

bypasses the repeated trial and error process, further reducing reconfigurations.

We propose a light-weight hardware profiling technique called the working set signature,

to dynamically detect program phase changes. We also propose a generic class of signature-

based tuning algorithms that trigger tuning only on phase changes. Furthermore, history based

tuning algorithms are proposed, which reuse tuning results for recurring phases. These algo-

rithms tune for a phase once using trial and error and save the best found configuration in a

table. When the phase repeats, the configuration is reinstated from the table – bypassing the

trial and error process.

1.2. Autonomic Management

Tuning algorithms such as the ones proposed in this work, may be too sophisticated for

efficient hardware-only implementation. Software implementation may be more appropriate

since it provides flexibility and obviates the need for complex specialized hardware. This

software could be run on a special co-processor or implemented in the operating system (OS)

via low-level device drivers.

The co-processor approach has almost zero performance overhead because the tuning al-

gorithm is run on a separate processor. But, it adds an additional piece of hardware to the

system which increases system cost and power dissipation.

The OS implementation approach has the advantage that no extra hardware is needed. The

downside is that it borrows cycle time for running the tuning algorithm, which can lead to some

5

performance overhead. However, there are other issues which make an OS implementation less

desirable.

First, the tuning algorithms are best designed by hardware designers because they have

intimate knowledge of the microarchitecture. Incorporation of these algorithms into the OS

requires interaction between hardware and OS vendors. This interaction can adversely affect

time to market since OS modifications must be in place at the time the processor first ships.

Moreover, the hardware vendors have to create device drivers for various operating systems in

the market.

The bigger problem with an OS implementation is that it requires hardware implementation-

dependent information to be exposed to software. This usually means implementation-specific

changes to the instruction set architecture (ISA) which has historically been shown to be a

problem. Moreover, ISA changes are slow which can curtail continuous enhancement of the

tuning mechanisms.

We believe that tuning algorithms should be an integral part of the processor. In other

words, processor resources must be managed autonomically1. Co-designed virtual machines

(VM) are the enabling technology for implementing such autonomic processors. A co-designed

VM consists of a thin layer of software that is concealed from all conventional software includ-

ing the OS. The VM software is co-designed with the hardware (by the hardware designers)

and thus provides an ideal medium for implementing the tuning algorithms.

The virtual machine monitor (VMM) is the core of the VM software. The VMM has access

to a special ISA called the implementation-ISA (I-ISA). The I-ISA contains instructions that

can be used to profile and configure the adaptive microarchitecture. The I-ISA is invisible to

the external world and thus can be modified at will by the hardware designers.

1Autonomic, Not controlled by others or by outside forces; independent. Source: The American HeritageDictionary of the English Language, Fourth Edition.

6

��

� � ��

��! #"%$'&( )�*+*-,

. �0/

. �'1

*+2 3�46587:9;2 <>=?�@>=A7

��BDCFEHG� 1JI K6LNM CFONP K61

*Q2 3F3?�@J=R7 S 3F3�TF7

?�@D=87

�U� �

V 2 WJ7X 3F3ZY[<[\]�5N7F^N7>=A<[_Z`

a�2 3>=RY�5 bcZ76dZe:=f_ g 2 dZ9[YZh

3[2 WJ7

X<D=i2 jZ7k�d�2 =N3 V 2 WJ7

X 3:3:Y�<Z\]�5A7F^N7D=8<6_6`

*+2 3F3?l@J=87

Figure 1.2. An adaptive microarchitecture managed by a mOS. The dotted lines represent feedbackfrom profiling hardware. The tuning knobs represent the configurable parameters of multi-configurationunits.

Figure 1.2 illustrates the paradigm using an example adaptive microarchitecture. The

VMM acts as a micro-OS (mOS) that manages the various microarchitecture resources, much

in the same way a conventional OS manages system resources. Unlike a conventional OS,

the mOS does not receive explicit resource requests from programs, but instead uses profiling

hardware to guide tuning.

The profiling hardware contains counters for tracking various performance characteris-

tics such as branch mis-prediction rates, cache miss rates and cycles per instruction (CPI). It

also consists of a mechanism for detecting program phase changes. The mOS tunes the mi-

croarchitecture in response to profile information using a set of configuration-control registers.

The profiling hardware and configuration-control registers are visible only to the mOS and not

accessible to conventional software.

The mOS can also perform other OS-like functions such as saving/restoring implementa-

tion state across context switches. For example, branch predictor state can be saved/restored

7

across context switches [5]. This eliminates interference caused by multiple contexts sharing

a predictor, and can lead to significant performance savings in applications with a high context

switch rate. However, we do not study this aspect of the mOS in the dissertation.

1.3. Thesis Statement and Contributions

The thesis of my research is that program phase information can be leveraged to implement

efficient tuning algorithms. Furthermore, such algorithms can be implemented via co-designed

virtual machines with minimal hardware and performance overhead.

This dissertation defends the thesis by demonstrating three key points. First, I show that

program phase changes can be detected, and recurring phases can be identified using a low-

overhead profiling mechanism called the working set signature. Second, I show that tuning

algorithms based on working set signatures can significantly reduce reconfigurations and per-

formance overheads associated with the tuning process. Finally, I show that implementing

signature based tuning algorithms in co-designed virtual machine software is a low-overhead

and thus viable alternative.

The main contributions of this work are as follows:

1. I propose a new hardware profiling mechanism viz. working set signatures to identify

working sets and estimate working set size. Working set signatures are implementation-

independent and can be used to detect program phase changes and identify recurring

phases on-the-fly.

2. I define a set of metrics which can be used to compare different techniques for detecting

program phase changes.

3. I propose two new tuning algorithms based on working set signatures. The first algo-

8

rithm uses signatures to estimate working set size and directly configure units whose

performance depends on this attribute. The second algorithm uses signatures to identify

recurring phases and reuses configuration information to bypass the tuning process.

4. I propose a simple analytical model that provides insights into the functioning and per-

formance trade-offs of tuning algorithms.

1.4. Dissertation Outline

The next chapter describes previous work related to various parts of our research. Chapter

3 describes techniques for detecting program phase changes and identifying recurring phases. It

also presents a detailed design space exploration of working set signatures. Chapter 4 presents

a comparison of signature based phase detection techniques with other proposed techniques.

Chapter 5 describes and evaluates a variety of signature based tuning algorithms for reducing

power dissipation. Chapter 6 describes the micro-OS architecture and evaluates the associated

overheads. Chapter 7 concludes the dissertation with a summary of results.

9

Chapter 2RELATED WORK

This chapter discusses work relevant to various parts of our research. The first section

discusses related work in the area of program phase detection. The following sections describe

multi-configuration hardware and tuning algorithms. The last section describes work relevant

to hardware/software co-design.

2.1. Program Phase Detection

Working set models were used to explain memory paging behavior as early as 1967. Den-

ning and Schwartz [6,7] use a working set model to understand the relationship between work-

ing set size and miss rates, and present offline algorithms for estimating these quantities. They

also use the working set model to characterize LRU page replacement algorithms.

Program phase behavior was observed and studied in the early 1970s [8–12]. These studies

were performed in the context of memory paging behavior. Program phases were often referred

to as program localities or regimes. Denning [8] presents a model for program execution based

on phase transitions. Phases are defined in terms of segments of information accessed over

specific periods of time. Denning and Kahn [9] explain program behavior in terms of a macro

model that captures phase behavior, and a micro model that captures behavior within phases.

Phase behavior is represented using a Markov-like model which is validated experimentally.

Batson and Madison [11, 12] propose similar phase based models for program behavior and

experimentally validate the models using Algol 60 programs. They also observe the hierar-

chical behavior of phases i.e. higher-level phases composed of lower-level fine grain phases.

10

Phase models have also been extended to understand file referencing behavior in applications.

Majumdar and Bunt [13] present a characterization of the phase behavior of file references in

UNIX applications.

More recently, there has been an interest in phase behavior of applications at microar-

chitectural granularities. The goal is to dynamically detect phase changes in order to guide

tuning algorithms. Balasubramonian et al. [1] propose the use of a conditional branch counter

to (dynamically) detect phase changes. Phase changes are detected by comparing dynamic

branch counts of consecutive intervals. If the difference exceeds a threshold, a phase change is

detected. The threshold is varied dynamically throughout program execution to control sensi-

tivity. We proposed the use of instruction working set signatures to detect phase changes and

identify recurring phases [14, 15]. Working set signatures are lossy-compressed representa-

tions of the working set. Phase changes are detected by comparing signatures of consecutive

intervals using a metric called the relative signature distance. If the relative signature distance

exceeds a pre-defined threshold, a phase change is detected. Unlike the previous scheme, the

threshold is not varied. Huang et al. [16] use a hardware based subroutine call stack to identify

program phases. The call stack monitors the time spent in various phases. If time spent in a

subroutine is greater than a preset threshold, it is identified as a major phase. We presented a

comparison of the various phase detection techniques using a set of metrics [17].

Sherwood et al. [18,19] leverage phase behavior to speed up microarchitectural simulation.

Detailed simulation is performed for various distinct phases of the program. The performance

is then weighted by the recurrence ratio for each unique phase. Reduction in simulation time

can be achieved if there is a high recurrence of phases, as is normally the case. They propose

the use of basic block vectors (BBVs) to detect program phase changes and identify recur-

ring phases. BBVs track execution frequencies of basic blocks touched in a sampling interval.

11

Phase changes are detected by comparing the Manhattan distance between consecutive BBVs

against a preset threshold. In later work, they approximate BBVs in hardware using an accu-

mulator table containing a few large counters [20]. The accumulator table is used to detect

phase changes dynamically, and predict future phase behavior.

HP Dynamo [21], a run-time dynamic optimization system, detects phase changes to flush

stale translations from the cache. Dynamo stores optimized traces of the program, called frag-

ments, in a fragment cache. In steady state most of the instructions are fetched from this

fragment cache. A sharp increase in fragment formation rate is used as an indicator of a phase

change. When this happens, stale fragments are flushed from the cache, making room for the

new ones.

Hind et al. [22] present a formal analysis of the “phase shift” problem. They show that the

intuitive notion of phases is not well-defined, but, the phase shift problem can be captured by

means of an abstract problem with two parameters – granularity and similarity. They present

a concrete instance of the abstract problem and show that a unique solution exists. We borrow

some nomenclature from their work, in order to be consistent.

2.2. Multi-configuration Hardware

In literature, multi-configuration hardware is sometimes referred to as reconfigurable hard-

ware. We make a distinction between the two terms because the latter is traditionally used in

the context of FPGA based reconfigurable hardware. A variety of multi-configuration units

have been proposed over recent years.

– Configurable caches and TLBs, which can be configured to save power by disabling

ways [23, 24], sets [4, 25] and specific cache lines [26] or by putting certain lines in a

low-leakage mode [27,28]. Performance can be improved by adapting the line size [29],

12

associativity [1] or fetch sizes [30] to changing program reference patterns.

– Configurable memory resources, such as cache memory which can be divided among

levels in the cache hierarchy to improve performance [1], configured for other uses such

as instruction reuse [31]. Memory buffer resources can be used for stream buffers or

victim buffers, depending on the current needs of the program [29].

– Configurable branch predictors, in which the global history register length can be var-

ied to improve performance [32]. Parts of the branch predictor may also be disabled

to reduce power dissipation. Configurable issue window, reorder buffer and load-store

queue, where parts of these structures can be disabled when there is low instruction level

parallelism, thus reducing power dissipation [2, 33–35].

– Configurable pipelines, where power can be optimized by throttling parts of the pipeline

[34, 36–38], by disabling portions of clustered microarchitectures [3], by varying the

pipeline between in order, out of order and pipeline gating [39] or by dynamically choos-

ing between high and low speed functional units [40].

Of course, these various methods are not mutually exclusive, and in practice a combination

of adaptive techniques will likely be used in the same processor [16, 34, 35, 41]. This leads to

a fairly complex optimization problem, especially if the methods interact with one another.

Huang et al. [16, 41] describe a general framework and algorithms that are intended to deal

with processors containing several configurable units.

2.3. Hardware/Software co-Design

There is a large body of work in the area of hardware/software co-design for embedded

processors. However, it is largely unrelated to our application since most of it deals with design

13

flow and electronic design automation (EDA). In this section, we mainly focus on work similar

or related to mOS design.

The IBM DAISY and BOA projects [42,43] and Transmeta Crusoe processor [44] demon-

strated the practicality of using co-designed VMs. Both DAISY and Crusoe use VM tech-

nology for transparent runtime binary translation from conventional ISAs (PowerPC and x86

respectively) to propriety VLIW I-ISAs. Crusoe also implements resource management algo-

rithms in VM software. This technology known as LongRun reduces power dissipation using

dynamic voltage scaling [45]. In this respect, the technology is very similar to the mOS.

The IBM S/390 microprocessor uses highly privileged subroutines called millicode to im-

plement complex ESA/390 instructions [46]. Millicode has access not only to all implemented

ESA/390 instructions but also to special instructions used to access specific hardware. Mil-

licode is stored at a fixed location in the hardware storage and uses a separate set of general

purpose and control registers. The mOS differs from millicode in certain key aspects. The

mOS can potentially provide a much richer functionality than millicode, including several OS-

like functions. Unlike millicode, the mOS is completely hidden from the OS leading to greater

flexibility and portability. Finally, the mOS uses the architected general purpose registers for

execution leading to minimal hardware overhead.

Alpha microprocessors use PALcode [47] to implement low-level hardware support func-

tions such as power-up initialization and TLB fill routines. PALcode aims at hiding low level

implementation details from the OS. Unlike the mOS, PALcode is part of the architectural

specification and accessible through special instructions. The OS is fully aware of the code

and has to reserve part of the address space for the code.

14

Chapter 3PROGRAM PHASE DETECTION

The existence of program phases has intuitive appeal, but in practice, program phases are

not easily determined, or even defined. Intuitively, a program phase is a contiguous interval

of program execution in which the program behavior remains relatively unchanged. Program

phases may be observed through changes in program characteristics such as memory usage [48]

or performance characteristics such as cache miss rates (Figure 1.1), instructions per cycle

(IPC), etc. [49].

Program phases are a manifestation of the static program structure and dynamic control

flow. They are a result of changing patterns of program behavior as control passes from loop to

loop and through procedure call trees. As a consequence of nested program control structures,

program phases have a fractal-like self-similar behavior. High level (long duration) phases are

typically composed of low level (short duration) phases which are in turn composed of even

shorter phases, until eventually phases are composed of individual instructions. In essence,

there are no absolute phases. Phase behavior can only be observed with respect to a certain

profiling interval i.e. an interval over which program characteristics are observed.

Figure 3.1 illustrates this point. Figures 3.1a, b, and c plot miss rates averaged over pro-

filing intervals of 1M, 100K, and 10K instructions respectively. Figure 3.1b is a magnification

of the shaded region of Figure 3.1a, while Figure 3.1c is a magnification of the shaded region

of Figure 3.1b. Clearly, the observed phase behavior depends on the profiling interval – the

shorter the interval, the finer the phase resolution. For example, the long plateaus seen with a

profiling interval of 1M instructions (Figure 3.1a) reveal much more intricate phase behavior

when observed with a shorter profiling interval of 10K instructions (Figure 3.1c).

15

0

10

20

30

13800 13850 13900 13950 14000 14050 14100 14150 14200

10K

10K instruction intervals

Inst

ruct

ion

cach

e m

iss

rate

(%)

0

10

20

30

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

1M

0

10

20

30

0 5000 10000 15000 20000

100K

a)

b)

c)

Figure 3.1. Variation of instruction cache miss rates with time (measured in terms of 10K instructionintervals) for SPEC CPU2000 benchmark mgrid. Instruction cache size is 2KB. a) Miss rate averagedover 1 million instructions, x-axis range: 0 – 1 billion instructions; b) Miss rate averaged over 100,000instructions, x-axis range: 0 – 250 million instructions (shaded region in a); c) Miss rate averaged over100,000 instructions, x-axis range: 138 – 142 million instructions (shaded region in b).

3.1. Defining Program Phases

It follows from the discussion above, that a phase can only be defined with respect to a

certain granularity and a notion of similarity. The granularity τ is the size of the profiling in-

terval. The similarity σ is a boolean function that determines whether two intervals of program

execution are similar.

The similarity function σ(Pi, Pj) : {P × P} → {0, 1} takes as inputs, profiles Pi and Pj

corresponding to intervals i and j respectively, and outputs a boolean value indicating whether

the two intervals are similar or dissimilar. The profile P is the set of program characteristics

observed over a profiling interval. Examples of P include the instruction working set, set of

basic blocks, number of conditional branches, etc.

A program phase φ(τ, σ) is defined as a maximal region of program execution, consisting

of one or more contiguous intervals of size τ , such that adjacent intervals are determined to be

similar by the function σ.

16

The profiling intervals may be either overlapping or non overlapping. We use non overlap-

ping intervals, given the complexity of implementing overlapping intervals in hardware. Since

σ is applied to adjacent intervals, gradual changes in program behavior can go undetected. In

other words, there could be a sequence of intervals such that each interval is similar to its ad-

jacent intervals, but the first and the last interval in the sequence may be dissimilar. However,

this is not a major concern since changes in program behavior are generally abrupt [11].

A series of two or more similar intervals (which may not be maximal) is defined to be

a stable region. Dissimilar adjacent intervals indicate a phase change. A region of program

execution wherein the phase changes every interval is defined as an unstable region. An unsta-

ble region is essentially a series of unity-length phases which may arise due to misalignment

between the profiling interval and natural phase boundaries1. Figure 3.2 elucidates the nomen-

clature pictorially.

Phase Unstable Region Granularity

Phase Change

Stable Region

Figure 3.2. Program execution is depicted as a series of non-overlapping profiling intervals (rectangles).Profiling intervals with the same color are similar.

3.1.1. Similarity

The similarity function σ is completely defined by the following three parameters.

1. Model (µ): Defines what profile information is to be compared2. More precisely, the

model defines the elements of the profile P . Models proposed in literature include con-

1This is discussed in more detail in Section 3.3.3.2It should be noted that our terminology differs from that used by Hind et al. [22].

17

ditional branch count [1], instruction working set [15], basic block execution frequen-

cies [18], and procedure call stacks [16]. Each of these models varies in the amount of

information contained. Chapter 4 presents a comparison of some proposed models.

2. Difference Metric (δ): Specifies how the difference between profiles of two intervals is

to be computed. The difference metric δ(Pi, Pj) : (P × P )→ Q is a function that maps

profile tuples (Pi, Pj) to a positive real number. Difference metrics proposed include

simple arithmetic difference [1], relative distance [15], and Manhattan distance [18].

The difference metric used depends upon the model i.e. not all difference metrics can be

applied to all models. For example, a simple arithmetic difference may not be applicable

to the procedure call stack model.

3. Difference Threshold (δth): Specifies a threshold for comparison of profiles. The similar-

ity function σ(Pi, Pj) returns 1 (similar) if δ(Pi, Pj) ≤δth, and 0 (dissimilar) if δ(Pi, Pj) >

δth. As we shall discuss later, the appropriate difference threshold depends on the sensi-

tivity required.

3.1.2. Implementation Independence

A program phase can be either implementation dependent or implementation independent,

depending on the definition of granularity and similarity. For example, if the granularity is

defined in terms of number of instructions and the similarity is based on number of committed

branches, the phases observed will be the same irrespective of the machine used for program

execution. On the other hand, if the granularity is defined in terms of clock cycles and the

similarity is based on IPC, the phases observed will depend on the clock frequency and the

microarchitecture of the machine used for program execution.

18

We consider program phases that are implementation independent. Implementation inde-

pendent phases are an intrinsic property of the program and have the following advantages.

– Mechanisms for detecting phase changes can be applied un-modified to any system ca-

pable of executing the program.

– In the context of tuning algorithms, implementation independence helps in differentiat-

ing performance changes due to reconfiguration from those due to changes in program

behavior.

3.1.3. Phase Recurrence

Since programs contain loops that execute the same instructions repeatedly, program phases

often recur in time. Phase recurrence at various granularities is visible in Figure 3.1. In order

to identify recurring phases, a history of previously observed phases has to be maintained. A

phase is said to be recurring if it is similar to one of the previously observed phases.

Comparison of phases is different from comparison of intervals because a phase may be

composed of multiple intervals. There are several ways in which phases could be compared.

One way is to compute the union of profiles of individual intervals belonging to the phase,

and use that as an input to the similarity function σ. Another way, is to perform a pair-wise

comparison of intervals using σ. Since our goal is to identify recurring phases dynamically,

we use a very simple comparison – two phases are considered similar if their corresponding

second intervals are similar according to σ.

This choice is based on a specific mechanism used in the proposed tuning algorithms3. The

algorithm maintains a phase-table that stores profiles of previously observed phases. When

3Discussed in more detail in Chapter 5.

19

execution enters a stable region, the profile of the current interval is compared against the

stored profiles to check for recurrence. Since a stable region is at least two intervals long, it

can be detected only at the end of the second interval. Thus, the profile used for comparison

corresponds to the second interval.

If a similar profile is found, then the phase is identified as a recurring phase. If not, the

current profile is entered into the phase-table. For a given granularity and similarity, a phase-

table contains a certain number of unique phases that we call static phases. Every dynamic

instance of a static phase (including the first) is called a dynamic phase. Since phases recur, a

program may have a small number of static phases but a very large number of dynamic phases.

3.2. Detecting Phase Changes

3.2.1. Model: Instruction Working Set

Since program phases are rooted in the static structure of the program, using an instruction-

related model is intuitive. In the past, researchers have used page working set models to de-

scribe program behavior [12]. We use the instruction working set model, instead. The in-

struction working set of a program is defined as the set of instruction addresses touched over

a granularity-sized profiling interval. Using instruction working sets rather than page working

sets allows phase detection at a much finer granularity. This is important for the applications

where the phase detection technique will be employed.

Other working set models such as branch working sets and procedure working sets could

also be used. However, we chose the instruction working set model because it is most general

and does not require any decoding hardware. One advantage of using the branch working set

model is that the working set is smaller. The instruction working set size can also be reduced

20

by shifting instruction addresses by a few bits. In fact, since about one in five instructions is a

branch [50], shifting the instruction addresses by five (assuming 4B instructions) approximates

the branch working set.

Procedure working sets are even smaller. However, they can not be used to resolve phase

behavior within procedures. This can be a significant shortcoming for many of the floating-

point benchmarks which perform several computations within a single procedure. Chapter 4

compares the phase detection efficiency of the instruction, branch, and procedure working set

models.

3.2.2. Difference Metric: Relative Working Set Distance

An exact comparison of working sets4 may not be appropriate because the same phase

may not always touch exactly the same instructions in each profiling interval. There is some

level of “noise” partly due to mismatch between the natural phase boundaries and granularity,

and partly due to small differences in execution. We use a measure of similarity that we call

the relative working set distance which is defined as

δ =‖Wi ∪Wj‖ − ‖Wi ∩Wj‖

‖Wi ∪Wj‖(3.1)

where, Wi and Wj are the working sets being compared. A large δ indicates a working set

change whereas a small δ indicates no change. At the extreme ends, δ = 0 when the working

sets are identical, and δ = 1 when they are totally different.

Phase changes are detected by comparing the relative working set distance with a differ-

ence threshold δth. δ > δth implies a phase change.

4Working sets refer to instruction working sets unless noted otherwise.

21

3.2.3. Working Set Signatures

Since working sets can be large, representing and manipulating complete working sets is

impractical for our application. Consequently, we propose a lossy compressed representation

of the working set that we call a working set signature. The working set signature is an n-bit

vector formed by mapping working set elements (i.e. instruction addresses) into n buckets

using a randomizing hash function (see Figure 3.3).

1

0

0

0

1

1

1

0

H

b

PC >> b

PC

Signature Size (n)

Randomizing Hash Function

Figure 3.3. Mechanism for collecting working set signatures. The program counter (PC), shifted by bbits, is hashed into a bit-vector and the corresponding entry is set. The vector is cleared at the beginningof each interval.

In order to reduce the effective working set size, we consider cache block size elements

formed by shifting the instruction addresses by a few bits (5–7). The size of the bit-vector is

of the order of 32–128 bytes. The bit-vector is cleared at the beginning of every interval to

remove stale working set information.

Like working sets, signatures are compared using a difference metric called the relative

signature distance. For signatures S1 and S2, the relative signature distance is defined as

δ =‖Si ⊕ Sj‖‖Si + Sj‖

(3.2)

i.e., (ones count of XOR)/(ones count of OR). Phase changes are detected by comparing the

relative signature distance against a threshold δth.

22

0

20

40

60

80

100

0 20 40 60 80 100 relative signature distance (%)

rela

tive

wor

king

set

dis

tanc

e (%

)

a)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 signature density (%)

inst

ruct

ion

wor

king

set

siz

e

b)

Figure 3.4. The figure shows a) relative working set distance vs. relative signature distance, and b)working set size vs. signature density, for SPEC CPU2000 benchmark gzip using a 1K bit-vector and100K instruction granularity.

Figure 3.4a shows, qualitatively, the signature fidelity i.e. how closely signatures track

the true instruction working set. The figure plots relative working set distances between con-

secutive intervals versus the corresponding relative signature distances, for SPEC CPU2000

benchmark gzip. Other benchmarks show similar behavior. The high degree of correlation be-

tween the relative signature distances and relative working set distances, shows that signatures

track working sets accurately. There is some dispersion in the plot due to hash collisions when

forming signatures. However, the dispersion is negligible – indicating that using signatures for

detecting phase changes is nearly as accurate as using full working sets.

Working set signatures have an interesting property that they can be used to estimate the

working set size. The signature density or fill-factor i.e. the fraction of bits set in the signature,

is probabilistically related to the true working set size. When k random keys are hashed into n

buckets, the fill-factor f is given by

f = 1−(

1− 1

n

)k

(3.3)

23

Figure 3.4b shows the true working set size plotted against the signature density for SPEC

CPU2000 benchmark gzip. The plot closely tracks the theoretical relationship. This property

of working set signatures may be used to instantly configure (i.e. without trial and error) multi-

configuration units such as instruction caches, whose performance is a direct function of the

instruction working set size.

3.3. Design Space Exploration

The working set signature based phase detection mechanism has several design parameters

which must be tuned for optimal performance. However, evaluating performance of a phase

detection mechanism is complicated because there are no absolute phases, and thus, there is

no golden standard that the technique can be compared against. Consequently, we evaluate

the phase detection mechanism on the basis of certain properties that are relevant to tuning

algorithms. This section evaluates the effects of varying the difference threshold, granularity,

signature size, sampling rate, and the hash function of working set signatures.

3.3.1. Methodology

Data reported in this section was collected using a modified version of the SimpleScalar

tool-set [51] and benchmarks from the SPEC CPU2000 suite [52]. Benchmarks facerec and

sixtrack could not be run due to shortcomings of the simulation environment. The benchmarks

were compiled for the Alpha EV6 ISA using base level optimizations. Reference input sets

were used. The compiler flags used were the same as those reported for DEC AlphaServer

ES40 benchmark results (Table 3.1).

For most data, functional simulation was used and benchmarks were run for the first 20 bil-

lion instructions. For performance data such as IPC, an out-of-order processor model was used

24

and benchmarks were run for the first 15 billion instructions. Microarchitecture parameters for

the processor are summarized in Table 3.2.

Table 3.1. Compilers and flags used for building the SPEC CPU2000 benchmarks for the Alpha ISA.

SYSTEM:AlphaServer ES40, 4-way 21264 500MHz, 4GB RAMDigital UNIX V4.0 (Rev. 1229)

COMPILERS:C - DEC C V5.9-005CXX - DIGITAL C++ V6.1-027F77 - DIGITAL Fortran 77 V5.2-171-428BHF90 - DIGITAL Fortran 90 Compiler V5.2-705

FLAGS:Integer Baseline Optimization

OPTIMIZE = -v -arch ev6 -non_sharedCOPTIMIZE = -fastCXXOPTIMIZE = -O2

Floating-Point Baseline Optimization

OPTIMIZE = -v -arch ev6 -non_sharedCOPTIMIZE = -fast -O4FOPTIMIZE = -O5

Table 3.2. Microarchitecture parameters.

Processor core 4-wide Fetch/Decode/Issue/Commit64-entry RUU, 32-entry LSQ4 ALUs, 1 Multiplier, 4 FP ALUs, 1 FP Multiplier

Branch Prediction 4K-entry gshare, 10-bit global history2K-entry, 2-way BTB32-entry RAS

Memory subsystem I and D-cache: 32KB, 2-way, 64B line, latency: 1 cycleL2-cache: 512KB, 4-way, 128B line, latency: 6 cyclesMemory: 16B wide, latency: 100 cycles

25

3.3.2. Difference Threshold

The choice of difference threshold largely dictates the sensitivity of the phase detection

mechanism. Sensitivity is defined as the fraction of intervals with significant performance

changes (with respect to the preceding interval), which are also indicated to be phase changes

by the phase detection mechanism [53]. Of course, a significant performance change has to be

defined in order to precisely define sensitivity. Consider an example program execution, which

consists of 1000 intervals, of which, 100 intervals show a significant performance change with

respect to the preceding interval. If the phase detection mechanism indicates a phase change in

75 of these 100 intervals, the sensitivity is 75%. If it indicates a change for all 100 intervals,

then the sensitivity is 100%.

If a phase change were to be indicated for each of the 1000 intervals, the sensitivity would

still be 100% because all the significant performance changes were in fact detected. Thus, we

must also consider the flip side to sensitivity: the fraction of false positives [53]. The fraction of

false positives is the fraction of intervals where the performance shows no significant change

but the phase detection technique indicates a phase change. Continuing with the previous

example – of the 900 intervals where no significant performance change occurs, if the phase

detection scheme indicates a change in 90 intervals, then the fraction of false positives is 10%.

In the extreme case where all intervals are indicated as phase changes, there are 100% false

positives.

High sensitivity is desirable in tuning algorithms because it exposes more tuning opportu-

nities. On the other hand, a large fraction of false positives can cause unnecessary reconfigura-

tions which can lead to significant performance loss. Thus, a good phase detection technique

should have high sensitivity and a small fraction of false positives.

Unfortunately, sensitivity and false positives are typically at odds with each other. We

26

use Receiver Operating Characteristic (ROC) analysis to study the effect of varying the dif-

ference threshold. ROC analysis is a widely used technique for analyzing medical tests [54],

which have sensitivity and false positive trade-offs similar to the phase detection problem. The

ROC curve is essentially a plot of the sensitivity versus false positives for various difference

thresholds.

Receiver Operating Characteristics

0

20

40

60

80

0 10 20 30 40 50 60False Positives (%)

Sen

sitiv

ity (%

)

Increasing δδδδTH

01

234

5

6

7

8

9

10

20

3040

50

6070

8090

a) Significant ∆CPI = 2%

Receiver Operating Characteristics

0

20

40

60

80

100

0 10 20 30 40 50 60False Positives (%)

Sen

sitiv

ity (%

)


01234

5

6

7

8

9

1020

3040

50

6070

80

90

b) Significant ∆CPI = 10%

Figure 3.5. ROC curves averaged over SPEC CPU2000 benchmarks. a) CPI changes of more than2% are considered significant. b) CPI changes of more than 10% are considered significant. Completeinstruction working sets used. Granularity used is 10M instructions.

Figure 3.5 shows two ROC curves, each with a different definition of a significant per-

formance change. Consider Figure 3.5a which defines a CPI change of 2% to be a significant

performance change. Both sensitivity and false positives increase with decreasing difference

thresholds because as the threshold is reduced, even minor fluctuations in CPI are noticed.

Redefining a CPI change of 10% to be significant (Figure 3.5b), leads to qualitatively similar

results except that a higher sensitivity is achieved for a given number of false positives. This is

to be expected because a 10% change in CPI is more easily detected compared to a 2% change.

The maximum sensitivity (δth= 0%) in both cases is less than 100%, which means that

some performance changes go undetected. This is mainly due to the shortcomings of the

27

instruction working set model described in Section 3.2.1. The high number of false positives

do not necessarily indicate shortcomings of the phase detection technique, because some phase

changes may indeed lead to no percievable performance changes.

Receiver Operator Characteristics

0

20

40

60

80

100

0 20 40 60 80 100false positives (%)

sens

itivi

ty (%

)

avgfloat_avgint_avg


Figure 3.6. ROC curves averaged over SPEC CPU2000 benchmarks (avg), only integer benchmarks(int avg), and only floating-point benchmarks (float avg). Complete instruction working sets used.Granularity used is 10M instructions. CPI changes of more than 2% are considered significant.

Figure 3.6 shows ROC curves averaged over integer and floating-point benchmarks sepa-

rately. Integer benchmarks show higher number of false positives compared to floating-point

benchmarks. This is explained by the relatively irregular code paths followed in integer bench-

marks which lead to changes in the instruction working set, but no significant change in per-

formance.

The definition of significant performance change as well as the optimal difference thresh-

old depends upon the requirements of the tuning algorithm. A threshold near the knee of the

ROC curve is desirable because beyond the knee, a small increase in sensitivity comes at the

expense of a large number of false positives. In rest of this section, we use a threshold of 5%

because it works well for the tuning algorithms studied in Chapter 5.

28

3.3.3. Granularity

Program phase behavior is caused as control passes through procedures and nested loop

structures. Consequently, each program has certain natural phase boundaries and ideally, phase

changes should be detected at these boundaries. Due to the presence of natural phase bound-

aries, the choice of granularity affects the way phases are resolved. Figure 3.7 illustrates this

using an example program execution.

It is evident that using a granularity of 100K or 10M instructions leads to long phases (in

terms of number of intervals) while an interval of 1M instructions leads to a phase change every

interval. In fact, each 10M phase is composed of several 100K phases and its performance

characteristics would be the average of the performance characteristics of these 100K phases.

In practice, a perfect alignment as depicted in Figure 3.7 is rare and thus program executions

consist of a series of stable regions, separated by unstable regions.

Granularity: 100K Long phases

Granularity: 1M Unstable behavior

Granularity: 10M Very long phases

A B A A A A B B B B A B A A A A B B B B

A B A A A A A A A A A B B B B B B B B B

AB AB AB AB AB AB AB AB AB AB AB

AAAAAAAAAABBBBBBBBBBAAAAAAAAAABBBBB 100K instructions

Program Execution

Figure 3.7. A and B are instruction working sets touched in 100K intervals. The figure shows programexecution in terms of working sets and how different granularities resolve program phases.

Attempting to tune and reconfigure in unstable regions can lead to unpredictable, non-

optimal results. Consequently, the tuning algorithms described in Chapter 5 do not perform

tuning while in unstable regions. Tuning algorithms can thus benefit if a large part of pro-

gram execution is spent in stable regions. We quantify this using a metric called stability,

29

which is defined as the fraction of intervals that belong to stable regions. Figure 3.8 shows

the percentage of time spent in stable regions by the SPEC CPU2000 benchmarks for different

granularities.

Stability vs granularity

0

20

40

60

80

100

gzip vpr

gcc mcf

crafty parser

eon perl gap

vortex

bzip2 twolf

stab

ility

(%)

100K 1M 10M 100M

Stability vs granularity

0 20 40

60 80

100

wupwise swim

mgrid

applu mesa

galgel art

equake ammp

lucas fma3d

apsi

stab

ility

(%)

100K 1M 10M 100M

Figure 3.8. The figure shows how time spent in stable regions changes with the granularity. Completeworking sets have been used. Difference threshold used is 5%.

Clearly, there is no single optimal granularity across benchmarks because each benchmark

has different natural phase boundaries. We found that for maximum benefit, the tuning algo-

rithm should start with a small interval (e.g. 100K instructions) and expand the interval until

behavior is relatively stable. Chapter 5 describes this in more detail.

30

3.3.4. Signature Size

The signature size is an important design parameter because it directly affects the hard-

ware and software overheads associated with tuning algorithms5. A small signature is desirable

because it leads to lower overheads. However, a smaller signature can get saturated with rel-

atively small working sets leading to increased aliasing. Increased aliasing reduces fidelity

because some phase changes go undetected.

Signature fidelity vs. size

50

60 70

80

90

100

gzip vpr

gcc mcf

crafty parser

eon perl gap

vortex

bzip2 twolf

corr

elat

ion

(%)

32B 64B 128B 512B

Signature fidelity vs. size

50 60 70 80 90

100

wupwise swim

mgrid

applu mesa

galgel art

equake ammp

lucas fma3d

apsi

corr

elat

ion

(%)

32B 64B 128B 512B

Figure 3.9. Correlation between relative signature distances and corresponding working set distancesfor signature sizes ranging from 32B - 512B is shown. The instruction addresses are shifted by 6 bits.Granularity is 1M instructions.

5Chapter 5 discusses this in more detail.

31

Figure 3.9 compares signature fidelity for four different size signatures. The fidelity is

measured quantitatively by correlating relative signature distances with corresponding relative

working set changes. Correlations are calculated using the Pearson product moment correlation

coefficient(

Σ(x−x)(y−y)√Σ(x−x)2Σ(y−y)2

)

. Correlations are a good measure of signature fidelity because

a high correlation indicates that the relative signature distance is very close to the relative

working set distance.

Clearly, the correlations fall off as signature size reduces. However, signatures as small

as 32B show more than 95% correlation in most cases. For certain benchmarks such as vortex

and perl, the correlations fall off rapidly with decreasing signature size. This can be attributed

to their relatively large working sets, which saturate smaller signatures.

The average signature density corresponding to the different signature sizes is shown in

Figure 3.10. Evidently, benchmarks showing low fidelity in Figure 3.9 have very dense i.e.

saturated signatures. For example, benchmark vortex which shows low fidelity with a 32B

signature, has a signature density > 90%.

One of the apparent anomalies is that floating-point benchmarks fma3d and apsi show

density as high as crafty and gcc respectively, but demonstrate very high fidelity. This anomaly

can be explained by analyzing the phase behavior of these benchmarks. Benchmarks fma3d

and apsi have large working sets that saturate the signatures, but the working sets do not change

much with time. Thus, the signature density does not have much of an impact. On the other

hand, integer benchmarks show highly irregular behavior i.e. working sets keep changing

often. Highly saturated signatures cannot detect these changes and thus show low fidelity.

In the previous section, we argued that each benchmark has a different optimal granularity,

where it demonstrates maximum stability. Intuitively, the working set size should increase

with granularity indicating that larger signatures must be used at larger granularities in order

32

Signature density vs. size

0

20

40

60

80

100

gzip vpr

gcc mcf

crafty parser

eon perl gap

vortex

bzip2 twolf

dens

ity (%

)

32B 64B 128B 512B

Signature density vs. size

0 20 40

60 80

100

wupwise swim

mgrid

applu mesa

galgel art

equake ammp

lucas fma3d

apsi

dens

ity (%

)

32B 64B 128B 512B

Figure 3.10. Average signature density for signature sizes ranging from 32B - 512B is shown. Theinstruction addresses are shifted by 6 bits. Granularity is 1M instructions.

to avoid saturation. However, the increase in working set size with granularity is not as bad

as it might first appear (see Figure 3.11) because programs typically repeat tasks several times

which means that working sets do not change much over time. On average, the working set

increases by 4x for a 1000x increase in granularity.

Figure 3.12 shows the theoretical density (using Eqn. 3.3) for a 128B signature corre-

sponding 100K and 100M instruction granularities. Clearly, a 128B signature works well for

most benchmarks – showing < 50% signature density even for a granularity of 100M instruc-

tions. However, certain benchmarks like gcc that have a large instruction footprint may require

33

Average working set size vs. granularity

0

600 1200

1800 2400

gzip

vpr

gcc

mcf

craf

ty

parse

r eo

n per

l gap

vorte

x

bzip2

twolf w

orki

ng s

et s

ize

100K 1M 10M 100M

Average working set size vs. granularity

0 250 500 750

1000

wupwise

swim

mgrid

applu

mes

a

galgel ar

t

equak

e

amm

p

luca

s

fma3

d ap

si wor

king

set

siz

e

100K 1M 10M 100M

Figure 3.11. Average working set size for different granularities. The instruction addresses are shiftedby 6 bits.

a larger signature to faithfully represent the working set. For such benchmarks, the mOS can

detect signature saturation dynamically, and increase the signature size accordingly.

Certain workloads such as databases may have much larger instruction working sets com-

pared to the SPEC CPU2000 benchmarks. Signature density for such workloads can be esti-

mated from their working set size using Eqn. 3.3. Figure 3.13 shows the theoretical relationship

between working set size (in kilo-bytes) and signature density, for two different signature sizes

(128B and 512B). The signatures are generated by shifting PCs by 6 bits before hashing them

into the bit-vector. As expected, the 128B signature gets saturated with smaller working sets

compared to the 512B signature. If a 90% filled signature is considered saturated, then the

34

Theoretical density vs. granularity

0

25

50

75

100 gz

ip

vpr

gcc

mcf

craf

ty

pars

er

eon

perl

gap

vort

ex

bzip

2

twol

f

wup

wis

e

swim

mgr

id

appl

u

mes

a

galg

el

art

equa

ke

amm

p

luca

s

fma3

d

apsi

dens

ity (%

)

100K 100M

Figure 3.12. Theoretical signature density corresponding to 100K and 100M instruction granularities,computed using Eqn. 3.3 for a 128B signature.

Signature density vs. working set size

0

20

40

60

80

100

0 200 400 600 800 1000

Working set size (KB)

Den

sity

(%)

128B

512B

Figure 3.13. Theoretical signature density vs. working set size corresponding to 128B and 512B signa-tures. PCs are shifted by 6 bits.

128B signature gets saturated with a 150KB working set, while a 512B signature is saturated

by a 600KB working set. This means that, roughly, the relative increase in signature size has to

be the same as the relative increase in working set size, in order to maintain the same signature

density.

A more accurate relation can be derived from Eqn. 3.3, as follows. Consider a working

set that leads to a signature density of f , for a n1-bit signature. A working set N times larger,

35

requires a larger signature of size n2-bits, in order to maintain the same density (f ). The

relationship between n2 and n1 is given by

n2 =1

1− N

√

1− 1n1

. (3.4)

3.3.5. Sampling Rate

Generating signatures by sampling each and every committed instruction address may

not be feasible in a multi-GHz superscalar processor without unduly increasing the circuit

power/complexity. Shifting the instruction addresses by 5 - 7 bits reduces the pressure on

hardware. Periodic sampling of PCs can further relax timing constraints and thus make a low

power implementation of the mechanism possible.

Collecting working set data just indicates whether an element was touched or not, i.e. it

does not track the number of times and element was touched. The intuition behind periodic

sampling is that since instructions are repeated several times within a profiling interval, there

is a large probability that, on average, each instruction address gets sampled at least once. Of

course, certain rarely executed instructions may be left out, but more importantly, if the period

of sampling matches the period of a loop, then a significant fraction of the working set may be

left out. This can happen in floating-point benchmarks with tight inner loops. As a solution,

we introduce a certain amount of randomness in the system by picking 1 in N PCs randomly

where N is the sampling period. This is done by collecting committed PCs in an N -entry buffer.

When the buffer is full, one of the PCs is picked at random from the buffer and hashed into the

signature. The buffer is then cleared to collect the next N PCs.

Figure 3.14 shows the correlation between relative signature distances and relative working

set distances for four different sampling rates. It is clear that information loss due to sparse

36

Signature fidelity vs. sampling rate

50

60

70

80

90 100

gzip vpr

gcc mcf

crafty parser

eon perl gap

vortex

bzip2 twolf

corr

elat

ion

(%)

perfect 1 in 4 1 in 8 1 in 16


50 60 70 80 90

100

wupwise swim

mgrid

applu mesa

galgel art

equake ammp

lucas fma3d

apsi

corr

elat

ion

(%)


Figure 3.14. Correlation between relative signature and working set distances for perfect sampling(every PC) and sampling rates of 1 in 4, 8 and 16 PCs. Granularity is 1M instructions. Signature size is128B.

sampling leads to some loss in fidelity. However, the loss is minimal – a sampling rate of 1 in 8

PCs leads to more that 90% correlation in all benchmarks. For a 4-wide superscalar machine,

this sampling rate translates to approximately 1 sample every 2 to 4 cycles.

3.3.6. Hash Function

In all previous sections, the hash function used to generate signatures is a pseudo-random

function implemented with the random() function provided by libc. Pseudo-random hash

functions such as random() are too complex to be efficiently implemented in multi-GHz

37

processors. As a practical replacement, we propose the use of a simple folded-XOR hash

function which splits the PC into three parts and XORs them together (Figure 3.15).

Address

Index

XOR

Figure 3.15. The folded-XOR hash function. The PC is divided into three parts and XOR’ed together togenerate the index.

Figure 3.16 shows the correlation between relative signature distances and relative work-

ing set distances for four different sampling rates, using a folded-XOR hash function. The

correlations are almost the same as those seen in Figure 3.14, which uses the pseudo-random

hash function. This clearly shows that for our application a simple folded-XOR hash function

works as good as a pseudo-random hash function.

3.4. Limitations of the Instruction Working Set Model

Since program phases are rooted in the static structure of the program, using the instruction

working set model makes intuitive sense. However, there are situations where the instruction

working set model fails to detect changes in program behavior. The code snippets in Fig-

ure 3.17 illustrate two such cases assume that the granularity is such that each invocation of

transform() or getRec() is the size of a profiling interval.

Consider code snippet a. If loop1 body and loop2 body have distinct behaviors,

then an invocation of tranform()where loop1 body contributes to most of the execution

time will have a different behavior than an invocation where loop2 body contributes more.

However, the instruction working set model cannot differentiate between the two invocations

38


50

60

70

80

90 100

gzip vpr

gcc mcf

crafty parser

eon perl gap

vortex

bzip2 twolf

corr

elat

ion

(%)



50 60 70 80 90

100

wupwise swim

mgrid

applu mesa

galgel art

equake ammp

lucas fma3d

apsi

corr

elat

ion

(%)


Figure 3.16. Correlation between relative signature and working set distances for sampling rates of 1 in1, 4, 8 and 16 PCs, using a folded-XOR hash function. Granularity is 1M instructions. Signature size is128B.

because it does not weight the frequency of executed instructions.

In code snippet b, the data working set size, and consequently the data cache behavior,

depends on the size of the data base db. Since the instruction working set model does not take

data accesses into account, it can not differentiate accesses to two different databases – which

may have completely different behavior.

Although, several such cases can be contrived, empirical evidence provided in this chapter

shows that the instruction working set model does capture most of the phase changes and

almost all of the major phase changes.

39

void transform() {int i, j;while(j++ < N1) {loop1 body

}while(k++ < N2) {loop2 body

}}

a)

rec *getRec(dbase **db, int key) {rec *r = db[hash(key)];while(r) {if(key == r->key)return r;

r = r->next;}return NULL;

}

b)

Figure 3.17. Code snippets illustrating the limitations of the instruction working set model.

3.5. Summary

In summary, we find that instruction working set signatures are efficient at detecting pro-

gram phase changes. Signatures as small as 128B provide more than 95% correlation between

relative signature and relative working set distances. However, signatures can get saturated as

the granularity increases. This can be handled by a simple run-time algorithm that increases

signature size in such scenarios.

Periodic sampling and a simple hash function can help reduce the complexity of the signa-

ture generation mechanism. We find that a simple folded-XOR hash function works quite well

compared to a pseudo-random hash function. Also, sampling as few as one in eight instruc-

tions works fairly well when compared with perfect sampling. This roughly amounts to one

sample every two cycles on a 4-way superscalar processor. The implication for circuit design

is that slower, less power-hungry transistors can be used to implement the signature generation

mechanism.

40

Chapter 4COMPARISON OF PHASE DETECTION TECHNIQUES

This chapter compares instruction working set based phase detection techniques with tech-

niques based on branch working sets, procedure working sets, basic block vectors (BBVs) [20],

and conditional branch counters [1].

Data reported in this chapter was collected using SimpleScalar [51]. An out-of-order pro-

cessor model was used. Details of the benchmarks and processor model are provided in Tables

3.1 and 3.2 respectively, in Section 3.3.1. Results are averaged over all benchmarks.

4.1. Comparison Metrics

Comparing different phase detection techniques is complicated because there is no golden

standard that techniques can be compared against. Thus, we compare phase detection tech-

niques using a variety of metrics that have some practical appeal. We use the following metrics,

in addition to the metrics described in Chapter 3 i.e. sensitivity, false positives, and stability.

1. Average phase length: defined as the average length of phases in terms of number of

profiling intervals. The average phase length is an important metric because most tuning

algorithms use trial and error mechanisms to arrive at the optimal configuration. That

is, they simply try a series of different configurations and determine the best one. These

algorithms require several intervals at the beginning of a stable phase to complete the

tuning process. If phases are short (i.e. small number of intervals), tuning may never

complete. Also, short phases make it difficult to amortize reconfiguration overheads

associated with the tuning process.

41

Note that two programs with the same stability can have different average phase lengths.

For example, if a program runs for 1000 intervals divided into two length 500 stable

phases, the stability is 100% and the average phase length is 500 intervals. However, if

the phase changes at the end of every other interval, the stability is still 100% but the

average phase length is two. These metrics should therefore be used in conjunction with

each other.

2. Performance variance: defined as the average variance in performance within phases.

Performance is measured in terms of cycles per instruction (CPI). Performance variance

is an important metric because most tuning algorithms are based on the assumption that

performance is uniform within a phase. A good phase detection method should be able to

resolve phases with a relatively small variance in performance, compared to the variance

across the whole program. A small variance is an indicator that the phase detection

mechanism is detecting phase boundaries correctly.

3. Correlation between techniques: defined as the fraction of intervals for which two tech-

niques agree on the presence or absence of a phase change. Correlation between phase

detection techniques can be useful for comparing their relative ability to detect phase

changes. If the techniques are highly correlated, then the technique with the simplest im-

plementation is preferable. In the absence of high correlation, the choice of techniques

should be based on one or more of the metrics defined above and other advantages asso-

ciated with the technique and where it is being applied.

42

4.2. Performance with Unbounded Resources

Before comparing hardware based phase detection techniques, we evaluate their limits

by comparing techniques based on unbounded working sets, BBVs, and conditional branch

counters. We equalize the granularity of these techniques to 10 million instructions. In the

course of this research, we tried other granularities, and did not find any qualitative difference

in our conclusions.

The instruction working set based technique has been described in Section 3.2. Branch

and procedure working set based techniques are similar except that branch and procedure call

instruction addresses (respectively) are sampled, instead of all instruction addresses.

Sherwood et al. [18] define a BBV to be a set of counters, each of which counts the number

of times a static basic block is entered in a given execution interval. In later work [20], they

approximate the BBV with an array of counters, where each counter tracks the number of

instructions executed by a basic block in a given execution interval. In this study, we use the

latter definition for a BBV as it relates more closely to the hardware implementation. The

difference metric used for comparing BBVs is the Manhattan distance, given by

δ =

∞∑

n=0

‖counti,n − countj,n‖ (4.1)

where the subscripts i and j represent the intervals to be compared and each distinct value

of n represents a unique basic block. Phase changes are defined with respect to a difference

threshold δth.

Conditional branch counts are compared using simple arithmetic differences. In order to

compare the techniques, we normalize the BBV and conditional branch count differences to

100%. This is done by dividing the differences by the maximum possible difference, which is

2τ for BBVs and τ for branch counts, where τ is the granularity.

43

4.2.1. Sensitivity and False Positives

Figure 4.1 shows ROC curves for the different phase detection techniques. These curves

are based on the assumption that a significant CPI change is one of more than 2%, i.e. the

sensitivity is computed as the fraction of intervals where a CPI change of more than 2% is

indicated as a phase change. False positives are computed as the fraction of intervals where the

CPI changes by less than 2%, but a phase change is indicated.

0

20

40

60

80

100

0 20 40 60 80 100False Positives (%)

Sen

sitiv

ity (%

)

BBVIWSETBWSETPWSETBR_CNT


60

70

80

90

25 30 35False Positives (%)

Sen

sitiv

ity (%

)

δδδδTH = 4%

δδδδTH = 0.08%

δδδδTH = 2%

Figure 4.1. ROC curves for the various phase detection techniques are shown. CPI changes of more than2% are considered significant. Difference thresholds (δth) increase from right to left. IWSET, BWSETand PWSET represent instruction, branch and procedure working set based techniques respectively.BR CNT represents the conditional branch counter based technique. The figure on the right shows amagnified view of the shaded part of the left figure. The circled points are chosen for comparison.

In order to compare the different techniques, we choose difference thresholds correspond-

ing to the knees of the curves. Moreover, we also try to equalize the false positives to make

clear comparisons among methods. We arrive at difference thresholds of 4% for the BBV, in-

struction, and branch working set techniques; 2% for the procedure working set technique and

0.08% for the conditional branch counter based technique. This choice of thresholds leads to

about 30% false positives for each technique.

It is evident that BBVs perform the best – with a sensitivity of 82%, followed by con-

ditional branch counter (74%) and working set techniques (70%). As discussed before, the

working set techniques sometimes do not perform well because they do not keep track of in-

44

struction (or branch/procedure) execution frequencies. Consequently, the maximum sensitivity

achievable (δth= 0) by the instruction working set technique is limited to 81%.

Amongst the working set methods, the procedure based method shows slightly lower sen-

sitivity than the other two. This is expected because it fails to detect phase changes within

procedures. Results show that the procedure based working set method achieves a maximum

sensitivity of only 68% compared to 81% achieved by the instruction working set method. This

is a fundamental problem with procedure based phase detection methods.

0

20

40

60

80

100

0 20 40 60 80 100False Positives (%)

Sen

sitiv

ity (%

)

BBV IWSET

BWSET PWSET

BR_CNT


Figure 4.2. ROC curves for the various phase detection techniques are shown. CPI changes of morethan 2% are considered significant. Difference thresholds (δth) increase from right to left.

Redefining a significant CPI change leads to qualitatively similar results for BBV and

working set based techniques. Figure 4.2 shows ROC curves assuming that a CPI change

of 10% (rather than 2%) is significant. The curves are similar to those in Figure 4.1 except

that each technique achieves higher sensitivity for a given number of false positives and the

difference in sensitivity between the BBV and working set methods decreases. Interestingly,

in this case, the working set based techniques work better than the conditional branch counter

based technique, if the fraction of false positives is limited to less than 40%. This means that

the working set based techniques are more efficient at detecting major phase changes.

45

4.2.2. Stability and Average Phase Length

Figure 4.3 shows, for BBV and working set methods, how stability and average phase

length vary with respect to the difference threshold. Figure 4.4 shows the same for the con-

ditional branch counter method. Clearly, the stability of each method increases with the dif-

ference threshold because fewer changes are detected due to reduced sensitivity (see Figure

4.1). For the difference thresholds chosen for comparison in the previous section (circled in

the figure) the working set based methods achieve slightly greater stability (64%) compared to

the BBV (62%) and conditional branch counter (63%) based schemes.

40

50

60

70

80

90

0 2 4 6 8 10

Difference Threshold (%)

Sta

bilit

y (%

)

BBVinsn. wsetbranch wsetproc. wset

0

5

10

15

0 2 4 6 8 10Difference Threshold (%)

Ave

rage

Pha

se L

engt

h

BBVinsn. wsetbranch wsetproc. wset

Figure 4.3. Variation of stability and average phase length with respect to the difference threshold. Thephase length is shown in terms of number of intervals. The circled points correspond to the differencethresholds used for comparison.

The average phase length roughly increases with the difference threshold because small

perturbations in program behavior are not indicated as phase changes. For the chosen difference

thresholds, instruction and branch working set techniques lead to 30% longer phases on average

compared to BBVs and 38% longer phases on average compared to the conditional branch

counter technique. Given that the stability shown by each of these techniques is similar, using

BBVs or branch counts leads to a larger number of shorter phases. This may not be desirable

for tuning algorithms with large performance overheads associated with reconfiguration.

46

Dynamic Branch Counter based Technique

0

20

40

60

80

100

0 0.1 0.2 0.3 0.4 0.5Difference Threshold (%)

Sta

bilit

y (%

)

0

4

8

12

16

20

Avg. P

hase Length

Figure 4.4. Stability (primary axis) and average phase length (secondary axis) with respect to differ-ence threshold, for the conditional branch counter based technique. The circled points correspond todifference thresholds used for comparison.

4.2.3. Performance Variance

Although large average phase lengths are desirable, the performance stability within a

phase is also important. Figure 4.5 shows the percent coefficient of variance (standard de-

viation/average) in CPI within stable phases, averaged over all benchmarks. The difference

thresholds used were the ones arrived at in Section 4.2.1. Each of the techniques achieved less

than 2% variance in CPI within stable phases as compared to a 116% CPI variance across all

intervals (i.e. including unstable regions).

Using BBVs leads to least performance variance (0.65%) within phases. Instruction and

branch working set techniques achieve just less than 1% variance. The procedure working

set technique leads to 1.4% variance, further establishing the limitations of procedure based

techniques. Interestingly, the conditional branch counter technique performs the worst with a

CPI variance of 1.6%, although its sensitivity is higher than any of the working set techniques.

This means that the technique detects more relatively small phase changes and misses some of

the larger phase changes, compared to the working set techniques.

47

Average CPI Variance within Phases

0

0.3

0.6 0.9

1.2

1.5

1.8

BBV Insn Wset

Branch Wset

Proc Wset

Branch Counter

C P

I v a r

i a n c

e ( %

)

Figure 4.5. Average CPI variance within phases. Difference thresholds used were 4% for BBV, in-struction and branch working sets, 2% for procedure working set, and 0.08% for the conditional branchcounter technique.

4.2.4. Correlation

Because BBVs perform better than the other techniques on most metrics, we compute

correlation between the BBV technique and each of the other techniques. The difference

thresholds used were the ones arrived at in Section 4.2.1. The instruction and branch working

set schemes show 85% correlation, while the procedure working set and conditional branch

counter based schemes show 80% and 83% correlation respectively. Since each of the tech-

niques agrees with the BBV technique more than 80% of the time, an important question is –

do they agree on most of the major phase changes?

To answer this question, we compare the BBV and instruction working set techniques

in more detail (Figure 4.6). The first set of four bars in Figure 4.6 show respectively, the

percent of time the two techniques agree on the presence or absence of a phase change (agree),

the percent of time they disagree (disagree), the percent of time the BBV technique indicates a

change but the instruction working set technique does not (bbv change only) and the percent of

time the instruction working set technique indicates a change but the BBV technique does not

(wset change only). The next four sets of bars show the average relative change in performance

metrics (CPI; data and L2 cache miss rates; branch mis-prediction rate) for each of these four

48

Correlation and Performance changes

0

20

40

60

80

100

correlation CPI Icache Dcache L2cache branch

P e

r c e

n t

agree

disagree

bbv_change_only

wset_change_only

miss miss miss mispred

Figure 4.6. A comparison of BBV and instruction working set schemes. The first set of four bars(correlation) shows the fraction of time that the two techniques agree, disagree, only BBV detects achange and only the working set detects a change. The remaining four sets show the relative change inCPI; data and L2 cache miss rates and branch mis-prediction rates for each of the four types of eventsdescribed above.

events (agree, disagree, bbv change only, west change only). We do not include instruction

cache miss rates because being close to zero they cause very large relative changes that cannot

be well represented on this graph. Note that for the performance metrics, the agree case consists

only of events where both techniques indicate a phase change, i.e. it does not contain cases

where both agree that there is no phase change.

As seen from the figure, the BBV and the instruction working set technique agree about

85% of the time. Of the 15% of the time they disagree, they are split roughly equally. Focusing

on the first two bars in each set of performance metrics, it is evident that the relative perfor-

mance change seen when both the methods agree on a change is much higher than the relative

change seen when they disagree. This means that most of the major phase changes are detected

by both methods.

In cases where the two techniques disagree, the relative change in performance seen when

only the BBV indicates a phase change (third bar in each set) is higher than the relative change

seen when only the working set method indicates a change (fourth bar in each set). This

49

means that the BBV based technique is better at detecting changes in performance. This is

expected because the BBV inherently contains much more information compared to the in-

struction working set. However, it should be noted that this happens less than 9% of the time

on average.

4.3. Hardware Implementations

The previous section compared the performance of BBV, working set and conditional

branch count techniques using unbounded hardware resources. In this section, we compare

the performance of hardware implementations. The hardware implementation for a conditional

branch counter is equivalent to its unbounded implementation for the interval sizes studied, and

thus requires no further evaluation.

H

Branch PC

Accumulator Table

Adder # instructions

1

1 1 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 H

Branch PC

Accumulator Table

Adder # instructions

1

1 1 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1

1 1 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0

Figure 4.7. Accumulator table update mechanism. Branch PC is hashed into the table and the corre-sponding counter is incremented with the number of instructions committed since the last branch. Hashfunction used is pseudo-random.

Working sets can be represented in hardware using working set signatures (Section 3.2.3).

BBVs can be approximated using an array of accumulators (counters) called the accumulator

table [20]. The accumulator table (Figure 4.7) is an array of counters indexed by hashing

branch PCs. Whenever a branch PC is encountered, the corresponding counter is incremented

by the number of instructions committed since the last branch. The accumulator table collects

samples over a fixed interval of instructions and is reset at the beginning of each interval. To

50

prevent overflow, each accumulator is made large enough to be able to count up to the number

of instructions in the interval. For example, if the interval is 10 million instructions, then each

accumulator is 24-bits wide. Phase changes are detected by comparing consecutive arrays

using Manhattan distance (see Section 4.2).

4.3.1. Design Space

We study three different sizes for each of the hardware based phase detection techniques.

The sizes for instruction/branch signatures (512B, 128B, 64B) are similar to those studied in

Section 3.3.4. Procedure signature sizes were chosen to be 128B, 32B and 8B because pro-

cedure working sets are much smaller compared to corresponding instruction/branch working

sets. Accumulator table sizes studies (1024, 128, 32 entries) are similar to those used in previ-

ous work [20].

# counters

# b

its in

cou

nter

BBV ( 8 , 8 )

Working set ( 8 ,1)

Accumulator table ( 32, 8 )

Working set signature ( 1024,1)

Branch Counter

(1, 8 )

# counters

# b

its in

cou

nter

BBV ( 8 , 8 )

Working set ( 8 ,1)

Accumulator table ( 32, 8 )

Working set signature ( 1024,1)

Branch Counter

(1, 8 )

Figure 4.8. The position of each of the techniques in the design space is shown. The X-axis representsthe number of counters used to capture information. The Y-axis shows the number of bits used in eachcounter.

These hardware techniques, along with the unbounded cases considered in the previous

section, span a wide design space as shown in Figure 4.8. Each technique can be categorized in

51

terms of the number of counters and the number of bits in each counter. The unbounded BBV

contains maximum information with unbounded counters each with an unbounded number

of bits, the accumulator tables have a small number of unbounded counters and finally the

conditional branch counter based scheme has one unbounded counter, although it counts only

conditional branches. The working set based techniques form a similar spectrum albeit with a

larger number of single-bit counters. It should be noted that accumulator table counters have a

bounded number of bits, but they are considered unbounded because they are large enough to

prevent overflows for a given granularity.

4.3.2. Comparison of Hardware Mechanisms

Figure 4.9 shows the correlation of each of the hardware based techniques with the cor-

responding unbounded case i.e. instruction working set signatures are compared to complete

instruction working sets, accumulator tables to BBVs, etc. The difference threshold used is

4% for the instruction and branch working set signatures, 2% for the procedure working set

signature and 4% for the accumulator table. (see Section 4.2.1)

Hardware vs Unbounded Correlations

40

60

80

100

C o

r r e l

a t i o

n ( %

)

insn sign

branch sign

proc sign

acc table

small

medium

large

Figure 4.9. The correlation of each hardware scheme with the corresponding unbounded scheme isshown. The figure shows correlations for three different sizes of instruction and branch working setsignatures (512B, 128B, 64B), procedure working set signatures (128B, 32B, 8B) and accumulatortables (1024, 128, 32 entries).

52

It is evident that the hardware schemes are highly correlated with their corresponding

unbounded schemes. As the number of bits/entries is reduced, the correlations drop off mainly

due to increased aliasing. However, the smallest size hardware structure still correlates more

than 90% of the time (in each case) with the unbounded scheme.

It is worth comparing the accumulator table with an equivalent instruction working set

signature. We consider the smallest accumulator table i.e. 32 entries, since it shows reasonably

high correlation with BBVs. To prevent overflow for a granularity of 10 million instructions,

each of the accumulators should be at least 24-bits wide. This translates to a total of 32*24 =

768 bits. This is comparable in area to a 128B signature taking into account the extra sense

amplifiers and fast adder(s) required by the accumulator table. Correlation between the 32-

entry accumulator table and 128B instruction working set signature showed results similar to

those given in Figure 4.6 and thus are not repeated. The two hardware techniques correlate 87%

of the time and the change in performance metrics when the techniques agree is much higher

than the change when they disagree. This means that both techniques agree on major phase

changes. This is not surprising given the fact that both these techniques are highly correlated

to their unbounded cases and a similar observation was made there.

4.3.3. Conditional Branch Counter

Sherwood et al. [20] evaluated the performance of accumulator tables by using a metric

called the visible phase difference. The visible phase difference is the ratio of the phase dif-

ference (Manhattan distance) observed using the accumulator table, to the phase difference

observed using unbounded BBVs. The visible phase difference of the unbounded BBV is

100%.

The accumulator table size for their algorithms was chosen to be 32 entries, because the

53

visible phase difference achieved by using 32 entries is 72%. However, this is not necessarily

a good metric to use because phase changes are detected based on a difference threshold and

as long as the phase difference is above threshold, it does not matter what the visible phase

difference is. As an example, if the difference threshold is 10% and the unbounded BBV

shows a phase difference of 90%, it does not matter if the phase difference achieved by the

accumulator table is 80% or 25% because a phase change is detected in both cases. This

explains why the 32-counter method agrees with the unbounded BBV 93% of the time even

when results from [20] show that it achieves a visible phase difference of only 70%.

This means that perhaps an even smaller number of counters can provide reasonable phase

detection ability. In fact, the conditional branch counter, which is an extreme example with a

single counter, works quite well correlating 83% of the time with the unbounded BBV scheme.

4.4. Other Considerations

Because the hardware schemes discussed in the previous section correlate (agree) most of

the time, the decision to use a particular technique may be based on other considerations such

as hardware complexity and additional attributes that may be useful for tuning algorithms.

4.4.1. Hardware Complexity

The hardware used in the conditional branch counter scheme is clearly the simplest and

warrants no further discussion. The working set signature requires a 1-bit wide RAM array

with one read/write port. The instruction sampling hardware samples each instruction address

(at most 4 in a 4-wide superscalar) and hashes it to get the signature bit to be set. As discussed

before, one possible optimization is to employ periodic sampling of one instruction address

every two to four cycles. As shown in Section 3.3.5, periodic sampling technique works rea-

54

sonably well because the signature only tracks the number of static instructions touched and

not the number of times they were touched. This can simplify the hardware significantly and

make it amenable to a slow-transistor implementation, thereby saving power.

The accumulator table uses a 24-bit wide RAM array, with one read and one write port.

Separate read and write ports may be needed for throughput reasons. The sampling hardware

is more complex than that used in the working set signatures as it has to analyze the retire

stream to detect positions of branches and increment counters appropriately. Moreover, since

the Manhattan distance is based on instruction counts, dropping samples may not be advisable,

thus making fast hardware essential. Additionally, the accumulator table also requires a fast

24-bit adder to update the accumulators.

It is clear that the accumulator table is more complex and less power efficient compared

with the signature method. However, it should also be noted that neither of these schemes

would form an appreciable fraction of hardware in a modern microprocessor, and thus their

small contribution to power dissipation/ complexity may not be a concern.

4.4.2. Recurring Phase Identification

The ability to identify recurring phases is a desirable attribute in phase detection tech-

niques. This property can be exploited in tuning algorithms to reuse previously found optimal

configurations. This can eliminate a significant fraction of reconfigurations, leading to per-

formance improvements. In the next chapter, we show that a tuning algorithm based on re-

curring phase identification reduces more than 80% of reconfigurations for some benchmarks.

This reduces average performance loss associated with tuning by 2% (absolute). Working set

signatures and BBVs [20] have been shown to identify recurring program phases. Whether

conditional branch counters can be used to identify recurring phases remains to be shown.

55

4.4.3. Estimating Working Set Size

Working set signatures have an added advantage that they can be used to estimate the work-

ing set size (see Section 3.2.3). For multi-configuration units whose performance is directly

related to the instruction working set size, signatures can be used to determine the optimal

configuration without going through a trial and error tuning process. In the next chapter, we

show an algorithm that leverages this property to achieve close to oracle efficiency for tuning

the instruction cache and branch predictor.

Of course, in order to make use of this property, the signature should capture the same

working set that the unit performance is dependent on. Thus, instruction working set signatures

can be used to configure instruction caches and branch predictor, but not data caches.

4.5. Summary

The BBV based technique provides better sensitivity and lower performance variation in

phases compared to the other techniques. The instruction and branch working set techniques

have similar performance on each of the metrics described. These techniques are less sensitive

than the BBV technique mainly because working sets contain less information compared to

BBVs. However, the instruction working set technique provides slightly higher stability and

achieves 30% longer phases on average compared to the BBV technique. This can benefit trial

and error based tuning algorithms. On average, the BBV and instruction working set schemes

agree on phase changes 85% of the time. Of the 15% time they disagree, the BBV is more

efficient at detecting important performance changes. As an auxiliary result, we show that

procedure working set based techniques do not perform quite as well as the other working set

based methods. This is mainly due to their inability to detect phase changes within procedures.

One of the surprising results of this study is that a simple conditional branch counter

56

scheme performs quite well and agrees with the unbounded BBV scheme 83% of the time.

However, it does lead to shorter average phase lengths and higher performance variance within

phases compared with the BBV and working set schemes. This indicates that the branch

counter based technique fails to detect some of the major phase changes.

We find that the hardware schemes i.e. working set signatures and the accumulator table

approximate their corresponding unbounded cases (working sets and BBV) very closely, corre-

lating more than 90% of the time even for the smallest structures considered. Also, equivalent

sized instruction working set signatures and accumulator tables agree on phase changes 87%

of the time.

Given the high correlation between these techniques, the choice of technique may be

guided by other considerations. While the conditional branch counter is the simplest to im-

plement, signatures and accumulator tables can be used to identify recurring phases – leading

to more efficient tuning algorithms. Signatures also provide the added advantage that they

can be used to estimate certain working set sizes and immediately configure the corresponding

microarchitectural units such as caches.

Finally, in this work, we dealt with a very large design space composed of several variables

including sampling intervals, difference thresholds, and hardware sizes. Admittedly, the results

therefore represent a very small slice of the design space. On the other hand, in the process of

conducting this research we did simulate a large number of variations and found the results to

be qualitatively similar to those reported here.

57

Chapter 5TUNING ALGORITHMS

Tuning is the process of adjusting system parameters in order to make it more effective.

In the context of microarchitectures, this may mean improving performance and/or power effi-

ciency. Traditionally, microarchitectures have been tuned statically, i.e., at design time – using

simulation based approaches. The goal is to provide a good power/performance trade-off on

average, over a variety of workloads. However, as discussed before, this can lead to power/

performance inefficiencies for certain programs or certain execution phases of a program.

A promising solution to this problem is to dynamically tune the microarchitecture i.e.

reconfigure it on-the-fly to adapt to changing program characteristics. A key aspect of such

adaptive microarchitectures is the algorithm that governs the tuning process. Several ad hoc

periodic tuning algorithms have been proposed in the past [2–4]. We propose a new class

of tuning algorithms that leverage program phase information to improve upon such ad hoc

algorithms. These algorithms are generic i.e. algorithmic parameters are not manually tuned to

any particular benchmark. The proposed algorithms adjust automatically to program behavior

using dynamic program phase information. This is a desirable quality in an algorithm because

manually tuning to benchmarks is not an option in the general purpose computing paradigm.

This chapter presents a characterization of the properties of adaptive systems, and a simple

model for understanding tuning algorithms. We study algorithms with the goal of reducing

power without causing significant performance loss. The algorithms are applied to two types

of multi-configuration units viz. low-overhead and high-overhead units. Specific instances are

chosen to represent these general types of units: the instruction cache, data cache, and branch

predictor represent low-overhead units, while the L2-cache represents high-overhead units.

58

5.1. Tuning Goals

Tuning algorithms should have a well defined goal which might be performance improve-

ment, power reduction, or both. The goal of tuning in this dissertation is to reduce power

dissipation without losing more than 2% performance. Performance is measured in terms of

cycles per instruction (CPI). Albeit arbitrary, we feel that upto a 2% performance loss can be

justified if it gives reasonable power savings.

Power dissipation can be reduced, without losing performance, by dynamically disabling

parts of the microarchitecture that do not contribute to performance. For example, if the

program is executing a very long tight loop, most of the branch predictor can be disabled

without affecting performance. The key enabling technology for such optimizations is multi-

configuration hardware (see Section 2.2). Multi-configuration hardware is designed such that

parts of the hardware can be disabled dynamically via a combination of clock gating and power

gating. Clock gating i.e. disabling the clock signal, eliminates signal toggling and thus reduces

dynamic power dissipation [55]. Power gating, on the other hand, reduces static leakage power

by using a sleep transistor [4, 56] to cut-off power to the disabled hardware.

Although dynamic power has been the dominant source of power dissipation in the past,

static leakage power is becoming increasingly important. In fact, it may account for more than

50% of the processor power dissipation in future technology generations [57]. Static leakage

power depends on several factors such as number of transistors (area), source voltage, transistor

size, junction-temperature and circuit design. We aim at reducing static leakage power by

reducing the effective number of transistors – by switching off large parts of the chip when not

required. This technique is orthogonal to other methods for reducing static leakage power such

as improving process technology, using smaller transistors, etc.

We study algorithms for tuning relatively large multi-configuration hardware structures

59

such as caches and the branch predictor. We chose to tune these units because they account

for more than 65% of transistors in current processors such as the Pentium M [58] and the

fraction is likely to grow in the future [59]. Moreover, reconfiguration of some of these units is

associated with very high overheads (1000’s to several 100,000 cycles) which makes the tuning

algorithms more challenging.

5.2. Simulation Setup

Preliminary studies of tuning algorithms [14, 15, 60, 61] were performed using modified

versions of the SimpleScalar [51] and Alphasim [62] simulators. However, for studies reported

in this dissertation, we used PharmSim [63] – a full-system execution-driven timing simulator.

The reason for moving to PharmSim was to measure reconfiguration overheads for a realistic

memory sub-system and to ensure correctness of the mOS memory protection and interrupt

mechanisms in the presence of the operating system.

PharmSim is a PowerPC ISA [64] based simulator that augments SimOS-PPC [65, 66] –

a full-system emulator, with detailed models for the processor and memory sub-system. The

simulator emulates the entire system including processor, disks, consoles, Ethernet, etc., and

boots up a slightly modified version of the IBM AIX 4.3.1 OS. The processor modeled is

a conventional out-of-order superscalar processor, with a Sun Gigaplane-XB [67] like cache

coherence protocol. Several modifications were made to PharmSim to enable execution of

virtual machine software and to implement reconfiguration mechanisms. These are described

in later sections along with descriptions of the various mechanisms.

60

5.2.1. Benchmarks

The tuning algorithms were studied using the SPEC CPU2000 benchmarks. The binaries

were compiled using base optimizations. Table 5.1 lists the compilers and flags used to build

the benchmarks. The flags used are similar to those used for reporting results for the IBM

RS/6000 Model S80 machines. For most floating-point benchmarks, we also added the non-

standard option -qarch=com.

Table 5.1. Compilers and flags used to build the SPEC CPU2000 benchmarks for the PowerPC-AIXenvironment.

SYSTEM:IBM RS/6000 Model S80, RS64 III (450 MHz)AIX 4.3.3

COMPILERS:C - IBM VAC++ 5.0.2.0 invoked as ccCXX - IBM VAC++ 5.0.2.0 invoked as xlCF77 - IBM xl Fortran 7.1.0.2 invoked as xlfF90 - IBM xl Fortran 7.1.0.2 invoked as xlf90

FLAGS:Integer Baseline Optimization

COPTIMIZE = -O4CXXOPTIMIZE = -qpdf1/pdf2 -O3 qarch=ppc qtune=rs64b

Floating-Point Baseline Optimization

# wupwise sixtrack

COPTIMIZE = -O5 -lmassFOPTIMIZE = -O5 -lmass

# swim mgrid applu mesa art equake ammp lucas fma3d apsi

COPTIMIZE = -O5 -qarch=comFOPTIMIZE = -O3 -qarch=com

The -qarch=com option was added in order to speed up simulation of floating-point

benchmarks. When running in an X86-Linux environment, PharmSim executes certain floating-

point instructions such as fsqrt on a central PowerPC-AIX machine using remote procedure

61

calls. This significantly slows down simulations for most floating-point benchmarks. The

-qarch=com flag suppresses generation of these instructions by directing the compiler to

generate instructions common to various flavors of the POWER and PowerPC ISAs.

Execution Time Breakup

0

20

40

60

80

100

Frac

tion

of T

otal

Tim

e (%

)

user libs kernel idle

gzip

vpr

gcc

mcf

craf

ty

parse

r eo

n per

l gap

vorte

x

bzip2

twolf

5B Inst. 10B Inst.

15B Inst. 20B Inst.

Execution Time Breakup

0

20

40

60

80

100

Frac

tion

of T

otal

Tim

e (%

)

user libs kernel idle

wupwise

swim

mgrid

applu

mes

a art

equak

e

amm

p

luca

s

fma3

d

sixtra

ck

apsi

5B Inst. 10B Inst.

15B Inst. 20B Inst.

Figure 5.1. Fraction of execution time spent in various parts of software i.e. user code, shared librarycode, kernel code (non-idle), and idle loop (waiting for I/O). Each benchmark is associated with fourstacks. The stacks show the execution time breakup for the first 5, 10, 15, and 20 billion instructionsrespectively. Benchmarks using train inputs have been marked with N.

Reference input sets were used, unless the benchmark spent a large fraction of the first 20

billion instructions in an idle loop, waiting for I/O. In such cases train inputs were used. For

wupwise and apsi, train inputs could not be used because the benchmark run completed too

quickly.

62

Figure 5.1 shows, for each benchmark, the fraction of execution time spent in various

parts of software i.e. user code, shared library code, kernel code (non-idle), and the idle loop

(waiting for I/O). This data was collected using knowledge of virtual address ranges for various

parts of software in IBM AIX based systems [68]. Table 5.2 summarizes these virtual address

ranges.

Table 5.2. Virtual address ranges for various parts of software in IBM AIX based systems.

code virtual address rangeuser 0x10000000 - 0x2fffffff

shared library 0xd0000000 - 0xdfffffff

kernel 0x10000000 - 0x2fffffff

idle loop 0x00025270 - 0x0002533c

As expected, almost all SPEC CPU2000 benchmarks spend more than 95% of the time in

user and shared library code. Thus, most of the phase behavior results reported Chapter 3 are

generally applicable.

5.2.2. Baseline configuration

The baseline microarchitecture was chosen to reflect current generation commercial mi-

croprocessors such as the AMD Opteron [69], IBM POWER4 [70], Intel Pentium 4 [71] and

Intel Pentium M [72] processors. Table 5.3 summarizes important parameters of the baseline

microarchitecture and Figure 5.2 shows its performance in terms of CPI.

Although the baseline resembles a state-of-the-art commercial microprocessor, some of

the parameters were adjusted to ensure that the microarchitecture was not over-aggressive for

the benchmarks studied. For example, an 8K-entry branch predictor was used instead of a

16K-entry branch predictor used in POWER41 since the latter did not provide significant per-

1It should be noted that the POWER4 branch predictor uses 1-bit per entry as opposed to 2-bits per entry usedby conventional predictors.

63

formance improvement i.e. 2% increase in CPI, on any benchmark.

An over-aggressive baseline can artificially improve tuning results. We ensure that the

baseline configurations (sizes) for each of the units being tuned i.e I-cache, D-cache, L2-cache,

and branch predictor, are not too aggressive. Figures 5.3 - 5.6 show the performance loss due to

independently reducing the size of each of these units. The smaller configurations considered

are those provided by the multi-configuration units (discussed in the next section). For each

unit, there is at least one benchmark that suffers significant performance loss due to the next

smaller size available, which clearly indicates that the baseline microarchitecture is justified.

Table 5.3. Baseline microarchitecture parameters.

Processor coreFetch/Decode/Issue/Commit 4-wideQueues 16 entry fetch queue

32 entry issue queue64 entry load/store queue128 entry re-order buffer

Functional Units Integer: 4 ALU, 2 MultipliersFloating-point: 2 ALU, 2 Multipliers

Front-end pipeline depth 12 stagesBranch Prediction Combining predictor: 8K entry meta, gshare (10-bit

global history), bi-modal predictors4K entry, 4-way branch target buffer64 entry return address stackMemory sub-system

Cache hierarchy InclusiveLevel-1 Instruction cache 64KB, 2-way, 64B line, writeback, 1 cycle latencyLevel-1 Data cache 64KB, 2-way, 64B line, writeback, 1 cycle latencyLevel-2 Unified cache 2MB, 16-way, 64B line, writeback, 7 cycle latencyWrite buffer 32 entryMemory DRAMSize 128 MBLatency 200 processor clock cycles

64

Baseline Performance

0.0

0.5

1.0

1.5

2.0 gz

ip

vpr

gcc

mcf

craf

ty

pars

er

eon

perl

gap

vort

ex

bzip

2

twol

f

wup

wis

e

swim

mgr

id

appl

u

mes

a

art

equa

ke

amm

p

luca

s

fma3

d

sixt

rack

apsi

aver

age

CP

I

4.1 6.3

Figure 5.2. Performance (in terms of CPI) achieved by the baseline microarchitecture.

Performance loss due to smaller I-cache configurations

0

20

40

60

gzip

vpr

gcc

mcf

craf

ty

pars

er

eon

perl

gap

vort

ex

bzip

2

twol

f

wup

wis

e

swim

mgr

id

appl

u

mes

a art

equa

ke

amm

p

luca

s

fma3

d

sixt

rack

apsi

aver

age

CP

I inc

reas

e (%

)

2KB 8KB 32KB

96 66 94

Figure 5.3. Performance loss due to smaller sized instruction cache configurations, relative to the base-line configuration. The baseline cache is 64KB 2-way set associative. The smaller cache configurationshave the same associativity, but fewer sets.

5.2.3. Reconfiguration Mechanisms

The adaptive microarchitecture studied employs four multi-configuration units viz. in-

struction and data caches, L2-cache, and branch predictor. Each of the four multi-configuration

units can be configured into four different sizes as shown in Table 5.4. The instruction and data

caches are reconfigured by disabling sets, while the L2-cache is reconfigured by disabling ways

(i.e. changing associativity). Structures are disabled using a combination of clock gating (for

reducing dynamic power) and power gating (for reducing static leakage power).

65

Performance loss due to smaller D-cache configurations

0

2

4

6

8

10 gz

ip

vpr

gcc

mcf

craf

ty

pars

er

eon

perl

gap

vort

ex

bzip

2

twol

f

wup

wis

e

swim

mgr

id

appl

u

mes

a art

equa

ke

amm

p

luca

s

fma3

d

sixt

rack

apsi

aver

age

CP

I inc

reas

e (%

)

8KB 16KB 32KB

104 25

Figure 5.4. Performance loss due to smaller sized instruction cache configurations, relative to the base-line configuration. The baseline cache is 64KB 2-way set associative. The smaller cache configurationshave the same associativity, but fewer sets.

Performance loss due to smaller L2-cache configurations

0

25

50

75

100

125

gzip

vpr

gcc

mcf

craf

ty

pars

er

eon

perl

gap

vort

ex

bzip

2

twol

f

wup

wis

e

swim

mgr

id

appl

u

mes

a art

equa

ke

amm

p

luca

s

fma3

d

sixt

rack

apsi

aver

age

CP

I inc

reas

e (%

)

256KB 512KB 1MB

225 154

Figure 5.5. Performance loss due to smaller sized L2-cache configurations, relative to the baselineconfiguration. The baseline cache is 2MB 16-way set associative. The smaller cache configurationshave the same number of sets, but lower associativity.

Table 5.4. Available configurations for the multi-configuration units.

Unit Configurations MechanismInstruction cache 2KB, 8KB, 32KB, 64KB varying number of setsData cache 8KB, 16KB, 32KB, 64KB varying number of setsL2-cache 256KB, 512KB, 1MB, 2MB varying associativityBranch predictor 1K, 2K, 4K, 8K entry varying number of entries

66

Performance loss due to smaller branch predictor configurations

0

2

4

6

8

10 gz

ip

vpr

gcc

mcf

craf

ty

pars

er

eon

perl

gap

vort

ex

bzip

2

twol

f

wup

wis

e

swim

mgr

id

appl

u

mes

a art

equa

ke

amm

p

luca

s

fma3

d

sixt

rack

apsi

aver

age

CP

I inc

reas

e (%

)

1K 2K 4K

11 14 17

Figure 5.6. Performance loss due to smaller sized branch predictor configurations, relative to the base-line configuration. The baseline branch predictor has 8K entries.

The branch predictor does not contain architected state and so disabling predictor entries

does not lead to correctness issues. However, it does lead to performance loss since the pre-

dictor has to be warmed up when the entries are enabled. Caches, on the other hand, contain

architected state and thus reconfiguring caches requires certain additional steps to maintain

correctness. All valid cache lines belonging to disabled sets (or ways) have to be invalidated.

Also, if the mapping of a cache line changes after reconfiguration, it has to be invalidated.

The cache coherence protocol ensures that dirty lines are written back to the lower level of the

cache hierarchy. In case of an inclusive cache hierarchy, the coherence protocol also causes

invalidations and subsequent writebacks in the higher level of the hierarchy.

Simplicity was the main goal while implementing reconfiguration mechanisms. This lead

to increased reconfiguration overheads, but we prefer to handle the extra overhead in the tuning

algorithms than add extra hardware complexity. In order to avoid subtle race conditions, all

reconfigurations are performed after the pipeline is drained and the system is quiesced i.e. no

events are in progress. Most of the mechanisms required for supporting reconfiguration are

present in modern microarchitectures. Mechanisms for draining the pipeline are required to

implement instructions such as system calls. Also, most cache coherence protocols support

67

cache line invalidates to implement instructions such as dcbf2 in the PowerPC ISA [64].

Invalidation of cache lines can be performed by software by issuing cache flush instruc-

tions such as dcbf. However, to keep performance overheads low, we chose to implement

the functionality in hardware. The additional hardware required is minimal because the state

machine is quite simple. To be conservative, only one valid cache line is invalidated every

cycle.

The direct reconfiguration overhead is composed of 1) time to quiesce the system, 2) la-

tency of turning on/off transistors, and 3) in case of caches, time to invalidate cache lines.

While we model the overheads associated with system quiesce and invalidation of cache lines

fairly accurately, we do not model the latency to turn on/off cache lines. This is mainly be-

cause the latency is highly dependent upon the underlying circuits and the data is not readily

available. Tschanz et al. [73] report that the latency to turn off a particular ALU circuit is less

than a micro-second and the latency to turn it on is a few nano-seconds. Latencies of this order

are easily masked by other overheads and do not affect accuracy of the results.

Writebacks due to invalidation are the main source of performance overhead while recon-

figuring caches. Reconfiguration by changing associativity has the advantage that it does not

change the mapping of cache lines. This leads to reduced direct overhead because 1) only dis-

abled cache lines have to be invalidated, and 2) while increasing associativity, no invalidations

(and writebacks) need to be performed. The disadvantage is that the largest configuration may

require a very high associativity which may increase cache access latency. This is not desirable

in L1 caches since it can have an adverse effect on clock cycle time and/or pipeline depth.

We reconfigure the L2-cache by changing associativity because it has a very high reconfigura-

tion overhead and is more tolerant to a slight increase in the cache access latency. Moreover,

2Data Cache Block Flush: copies modified cache blocks to main storage and invalidates the copy in the datacache.

68

highly associative caches can be made more power efficient as demonstrated by the Pentium

M processor [58].

For reasons discussed above, L1 caches are reconfigured by changing the number of sets.

Changing the number of sets changes mapping of cache lines which leads to additional invali-

dates (and writebacks) thereby increasing the direct reconfiguration overhead. However, this is

not a major concern for L1 caches since the overheads are relatively low to begin with. There is

also an associated area (and power) overhead because a larger number of tag bits are required

to support the smallest configuration. Assuming a 48-bit physical address and a physically

tagged instruction cache with configurations shown in Tables 5.3 and 5.4, the area overhead of

extra tag bits is .1%.

5.3. Tuning Individual Units

This section describes and evaluates algorithms for tuning a single multi-configuration

unit. The proposed algorithms use program phase information to guide tuning. This helps in

reducing the number of tunings (and reconfigurations), which is key to implementing efficient

tuning algorithms. Before describing the algorithms, it is worth pointing out the shortcomings

of previously proposed algorithms.

1. Some algorithms perform tuning at pre-established fixed intervals of time [2–4]. This

may lead to unnecessary reconfigurations since tuning may be performed even though

the program phase may not have changed. Moreover, if the program phase changes while

tuning is underway, it goes undetected and may lead to instabilities in the algorithm.

2. Many algorithms detect phase changes by monitoring some performance metric such as

miss-rate [4]. Such algorithms are intrinsically unstable because they cannot differen-

69

tiate performance changes due to phase transitions, from performance changes due to

reconfiguration.

3. None of the previously proposed algorithms keep a record of previous phases – leading

to unnecessary tuning every time a phase recurs.

4. Finally, some of the proposed algorithms use parameters that are manually tuned to in-

dividual applications [4].

The proposed algorithms solve, or at least alleviate, all of these problems. These algo-

rithms are based on the observation that performance of programs is relatively constant within

a phase. So, number of reconfigurations can be minimized by only tuning at phase boundaries.

Moreover, since phases recur in time – the tuning results can be reused for recurring phases

thereby further reducing reconfigurations.

5.3.1. Signature Based Tuning Algorithms

Basic Algorithm

Figure 5.7 shows the state machine for the basic tuning algorithm. The algorithm is in-

voked at the end of each granularity-sized tuning interval. Phase changes are detected by

comparing the relative signature distance (δ) between the previous and the current interval

against a difference threshold δth; δ > δth indicates a potential phase change. Since working

set signatures are implementation-independent, phase changes can be detected irrespective of

the configuration.

The tuning algorithm has three states: UNSTABLE – when the phase behavior is unstable

i.e. the phase is in transition, TUNING – when the phase behavior is stable (i.e. the phase does

70

not change across intervals) and different configurations are being explored, and STABLE –

when the phase behavior is stable and the best configuration has been found and enabled.

The algorithm starts in the UNSTABLE state (with configuration maximized) and stays in

it until a stable region is detected (δ <= δth). Tuning is not performed in unstable regions since

the results are indeterministic. When a stable region is detected, the algorithm transitions to the

TUNING state and starts searching for the best configuration using a trial and error mechanism.

While in TUNING, the algorithm tries a different configuration every interval and uses

profile information such as CPI to decide which configuration works best. When found, this

configuration is enabled and the algorithm transitions to STABLE state and stays there until a

phase transition occurs.

mnporq�o+st m�uwvrxly

nroQzt m�uwvrx y

{#|:}i~ ��R�;��Z��

� �D�� J�

�� D�

� �D�

�� H�{+|Z}8~ ��[��Z��}6�8�;�;�Z��

�� D�

� �D�

Figure 5.7. State machine for the basic tuning algorithm is shown. The state machine has three states:UNSTABLE, TUNING, and STABLE. State transitions are performed at the end of every granularity sizedtuning interval. δ represents the relative signature distance while δth is the difference threshold.

71

Signature Density Based Algorithm

Performance of caches and branch predictors is a strong function of the corresponding

working set sizes because working set size governs the capacity misses [74] [75]. Instruction

cache performance depends on instruction working set size. Similarly, data cache and branch

predictor performance depends on data and branch working set sizes respectively. A cache (or

branch predictor) that is too small to hold the working set can cause a large number of capacity

misses leading to performance loss.

Working set size can be estimated from signature density as shown in Section 3.2.3. This

estimate can be used configure the corresponding unit to be just big enough to contain the

working set of the program. If the working set is small, this can lead to significant power

savings without any performance loss. Moreover, since the trial and error process is bypassed,

there are fewer reconfigurations and less time is spent in non-optimal configurations.

�+��0��p��+��0��

��:��:��Z�% ¡�¢;£%¤ � ��

¥�¦ §F¨

¥�¦ §>¨© §F¨

© §>¨

Figure 5.8. State machine for the signature density based tuning algorithm is shown. Unlike the basicalgorithm, the state machine has only two states – STABLE and UNSTABLE.

Figure 5.8 shows the state machine for the signature density based algorithm. The algo-

rithm has only two states – STABLE and UNSTABLE. It starts in UNSTABLE state and remains

there until phase behavior stabilizes (δ < δth). When a stable region is detected, the unit is

configured using the estimated working set size and the algorithm moves to STABLE state. It

transitions back to UNSTABLE state when the phase changes.

The signature used for phase detection tracks the instruction working set, and thus can be

72

used to configure the instruction cache. The instruction working set size Siwset (bytes) can be

estimated from signature density f as follows

Siwset =log(1− f)

log(1− 1n)· 2b (5.1)

where, n is the signature size and b is the number of bits by which the PCs are shifted.

The signature density can also be used to roughly estimate the branch working set. The

branch working set size Sbwset (number of branches) is given by

Sbwset =Siwset

IWsize ·BBsize

(5.2)

where, IWsize is the instruction word size3 (bytes) and BBsize is the average basic block size

(number of instructions).

The signature density algorithm is not directly applicable to data caches because to calcu-

late data working set size, data working set signatures are needed. We found that even when

data working set signatures are used, the algorithm does not work well. The main reason is

that the data working set sometimes changes within a program phase. This is expected because

a single piece of code can successively access different elements of a large data set. A simple

example is a loop that reads elements of a large array.

One possible solution to this problem is to use the data working set signatures to detect

phase changes. However, our studies showed the stability of such phases is very low making

the algorithms ineffective. Thus, we use the signature density algorithm to tune the instruction

cache and branch predictor.

3For ISAs with multiple instruction word sizes (e.g. x86), this is the average instruction word size.

73

History Based Algorithm

The history based algorithm improves upon the basic tuning algorithm by reusing tun-

ing results for recurring phases. The algorithm maintains the tuning history in a phase table.

Specifically, the phase table keeps a record of working set signatures and associated optimal

configurations for previously tuned phases. The phase table can be implemented in either hard-

ware or software, but we prefer software, since it provides flexibility and requires no special

hardware.

ª «+¬+ ¬-®¯ ª�°�±+²�³

«#¬+´¯ ª�°�±#²l³

µU¶>·R¸ ¹rº[¹¼»8½[º[¾F¿À º6¶Z¿ZÁ>·NÂ+¶[ÃFÁDÄ:Â-·�Á:Å[Æ ÂFÇ

ÈlÈ

É ÊJË

Ì�Í ÊJË

Ì�Í ÊÎËµU¶:·A¸ ¹�ºZ¹Ï¾F½6·:»8½[º6¾:¿ÈlÈ

Ì�Í ÊJË

É ÊJË

Ð ÃFÁDÄ:ÂU»N½;º6¾:¿�¸ ¾Q·�Á:Å[Æ ÂÈlÈÌ�Í ÊJË

Ð ÃFÁDÄ:ÂQ¾F½6·:»8½�ºZ¾Z¿�¸ ¾#·NÁZÅZÆ ÂÈlÈÌ�Í ÊHË

Figure 5.9. State machine for the history based tuning algorithm is shown. The state machine is similarto that of the basic tuning algorithm, except for the phase table lookup and update.

Figure 5.9 shows the state machine for the history based algorithm. It is similar to the

basic algorithm except it includes a phase table lookup before tuning and a phase table update

after tuning. The algorithm starts in the UNSTABLE state with configuration maximized. When

phase behavior stabilizes (δ < δth), a phase table lookup is performed to see if the phase was

tuned in the past. The lookup is performed by comparing the current signature with the stored

signatures using the relative signature distance metric. If a match is found (δ < δth), the optimal

configuration is read and enabled, and the algorithm transitions to STABLE state. If no match

74

is found, then the algorithm moves to TUNING. When tuning completes, the phase table is

updated with the working set signature and optimal configuration and the algorithm moves to

STABLE state.

Trial and error based tuning algorithms, by their very nature, spend time in non-optimal

configurations – leading to performance loss. The history based algorithm eliminates time

spent in non-optimal configurations when the tuning results are found in the phase table. This,

in addition to reducing the direct and indirect overheads, can lead to significant performance

improvement over the basic algorithm.

Phase table lookups (in software) can lead to some performance overhead. However, this

can be minimized through efficient data structures such as hash tables. Moreover, the phase ta-

ble entries can be augmented with other information such as signature density to make lookups

more efficient.

5.3.2. Evaluation Methodology

The tuning algorithms were applied to the instruction cache, data cache, L2-cache, and

branch predictor. In order to shorten evaluation time, we collected detailed profile data for

different configurations and evaluated tuning algorithms by processing the profile data offline.

This methodology also provided deeper insights into the functioning of the various tuning

algorithms.

Each benchmark was run for the first 4 billion instructions. Since there are four multi-

configuration units with four different configurations each, 13 simulations (1 baseline + 4 units

x 3 smaller configurations) were run per benchmark to cover the configuration space. Profile

data was collected at intervals of 100K instructions in a way that data for larger intervals (e.g.

1M instructions) can be accurately constructed.

75

In order to ensure that each of the 13 simulations executed the same instructions, the num-

ber of branches and memory access instructions committed were checked at 100K instruction

intervals. No inconsistencies were found because PharmSim uses full-system checkpoints and

interrupt events are aligned with committed instructions.

Power savings achieved by various algorithms are compared using the average unit size

metric. Lower the average unit size achieved, larger the power savings. Average unit size is a

good measure of relative power savings because static power dissipation is proportional to ac-

tive area. In fact, most microarchitectural power estimation tools [76–80] effectively compute

power dissipation by scaling the average unit size by a power factor. The performance loss is

measured in terms of percentage increase in CPI. The CPI is adjusted to account for various

reconfiguration overheads, as discussed in the next section. The reconfigurations overheads are

computed separately as described in Section 5.3.3.

Modeling Performance Loss

The tuning process may lead to performance losses due to 1) trying sub-optimal configura-

tions, i.e., configurations which cause more than 2% CPI increase, and 2) overheads associated

with the reconfiguration process, itself. The first cause of performance loss is intrinsic to trial-

and-error based algorithms. By their very definition, these algorithms have to try sub-optimal

configuration(s) before they select an optimal configuration. It should be noted that “optimal”

does not refer to the global optimal, but just the optimal amongst a given set of configurations.

The second cause of performance loss, reconfiguration overhead, can be classified into direct

and indirect overheads. The direct overhead can be attributed to the following.

1. Quiesce overhead: The time required to drain the pipeline and quiesce the system. This

overhead can vary widely because it is highly dependent upon the state of the pipeline

76

when reconfiguration is scheduled. For units which can be reconfigured without draining

the pipeline, the quiesce overhead is zero.

2. Electrical latency: The minimum time required to charge/discharge the effective capaci-

tance of the reconfigured hardware structure, and to stabilize the circuits. The electrical

latency is largely governed by the underlying process and circuit technology and the size

of the structure being reconfigured.

3. Invalidation overhead: This overhead applies mainly to caches and is the time required

to invalidate disabled cache lines. The invalidation overhead depends on the number of

cache lines disabled and the bandwidth available to writeback lines to lower levels of the

cache hierarchy.

The indirect overhead is attributed to the extra cycles incurred due to cold misses asso-

ciated with warming up a unit after reconfiguration. The indirect overhead essentially de-

pends upon the amount of implementation state that needs to be re-acquired after reconfigura-

tion. Figure 5.10 illustrates the miss transients caused by reconfiguration. Sizing down a unit

(L → S) flushes out some state belonging to the working set of the program. This causes the

miss rate to increase beyond the steady state miss rate for the smaller configuration. While

sizing up a unit (S → L), the miss rate is already higher than the steady state miss rate for the

larger configuration. In both cases, the miss rate eventually converges with the steady state,

after the unit warms up.

The net performance overhead (∆CPI) associated with tuning is given by

∆CPI = ∆CPIsubopt +1

N

NR∑

i=1

(τdirect,i + τindirect,i) (5.3)

where, ∆CPIsubopt is CPI increase due to being in sub-optimal configurations, N is the total

number of committed instructions, NR is the number of reconfigurations caused by a tuning

77

Steady state (S)

Time

Steady state (L)

Misses

L > S L > S S > L

Extra transient misses

Figure 5.10. The schematic shows transient misses resulting from reconfiguration. L → S and S → L

denote transitions from a large to small configuration and vice versa. The gray lines show the steadystate misses in each configuration. The shaded area represents the extra misses incurred while warmingup the unit.

algorithm, and τdirect,i and τindirect,i are the direct and indirect overheads associated with the

ith reconfiguration.

Assuming that a reconfiguration is performed at the end of every tuning interval, the frac-

tion NR/N reduces to 1/T , where T is the length of the tuning interval. While this seems to

be a simplistic assumption, it often holds because program phases are very short. Based on this

approximation, Eqn. 5.3 is reduced to

∆CPI = ∆CPIsubopt +T

τavg

(5.4)

where, τavg is the average reconfiguration overhead.

∆CPIsubopt is an inherent property of the tuning algorithm and cannot be easily controlled.

The choice of tuning interval (T ) is thus the key algorithmic parameter that controls the per-

formance overheads associated with tuning. Given a performance loss tolerance ∆CPItol (e.g.

2%), the tuning interval T must be chosen such that

τavg/T ≤ ∆CPItol. (5.5)

This is an intuitive result which says that the tuning interval should be large enough to amortize

reconfiguration overheads. In other words, the CPI contribution of reconfiguration overheads

78

(∆CPIrecon) should be less than the specified performance loss tolerance. For example, if the

specified tolerance is 2% and the reconfiguration overhead is 1000 cycles on average, then the

tuning interval should be at least 50K instructions.

Of course, the tuning interval can be made arbitrarily large to reduce the impact recon-

figuration overheads. However, a small tuning interval is desirable because it provides more

opportunities for saving power. For example, consider a program that alternates between two

phases with widely varying instruction working set sizes. Assume that the phases are 1M in-

structions long, each, and one phase requires a 64KB I-cache whereas the other requires only a

2KB I-cache for good performance. A tuning algorithm with a 100K instruction interval may

be able to exploit the opportunity to give almost 50% power savings while the same algorithm

with a 2M instruction interval may give no power savings without hurting performance.

Classifying Multi-configuration Units

The CPI contribution of reconfiguration overheads can be made arbitrarily small by in-

creasing the tuning interval. However, the CPI contribution of the penalty associated with

being in a sub-optimal configuration Psubopt4, can not be controlled in a similar fashion. We

classify units into low/high overhead units based on this penalty.

For a given unit, Psubopt can be roughly gauged from the performance loss incurred by

statically configuring the unit to a smaller size. For the microarchitecture studied here, Psubopt

can be gauged from Figures 5.3 - 5.6. Based on these figures, a smaller I-cache (32KB) leads

1.4% CPI increase, a smaller D-cache (32KB) leads to 1% CPI increase, a smaller L2-cache

(1MB) leads to an 8.6% CPI, and a smaller branch predictor (4K) leads to 0.9% CPI increase.

Since Psubopt for I-cache, D-cache, and branch predictor is relatively small compared to

4Note that ∆CPIsubopt is the CPI contribution of Psubopt. The former depends upon the fraction of times aunit is in a sub-optimal configuration, the latter does not.

79

the performance loss tolerance (2%), we classify these units as low-overhead units. Psubopt for

the L2-cache is significantly higher than 2%, thus, it is classified as a high-overhead unit.

5.3.3. Reconfiguration Overheads

The direct and indirect reconfiguration overheads were estimated via profiling runs. Intu-

itively, these overheads are a function of the particular transition i.e. current and next config-

urations. Thus, the overheads were computed on a per-transition (i → j) basis, by averaging

over all benchmarks. The notation i → j represents a reconfiguration (or transition) from

configuration i to j. i and j take integral values from 0 through 3, each representing progres-

sively larger configurations from Table 5.4. For example, reconfiguration 3 → 0 represents a

reconfiguration from 64KB to 2KB for the I-cache and 2MB to 256KB for the L2-cache.

Direct Overheads

The direct overheads were estimated via simulations run for 1 billion instructions with

forced reconfigurations every 5M instructions. For each reconfiguration, the next configuration

was chosen at random in order to cover the entire space.

System quiesce overhead

0

50

100

150

200

250

0 >

1

0 >

2

0 >

3

1 >

0

1 >

2

1 >

3

2 >

0

2 >

1

2 >

3

3 >

0

3 >

1

3 >

2

avg

Cyc

les

icache dcache l2cache

Figure 5.11. System quiesce overheads for various I-cache, D-cache, and L2-cache reconfigurations.The overhead is measured in terms of number of cycles.

80

System quiesce overheads (τq) for various I-cache, D-cache, and L2-cache reconfigura-

tions is shown in Figure 5.11. Overheads for the branch predictor are the same as those for the

I-cache, and are not shown in the figure. The overheads for I-cache and D-cache reconfigura-

tion are similar and more or less independent of the transition. In case of L2-cache reconfigura-

tions, the overhead is higher mainly due to transitions from smaller configurations because the

probability of the pipeline waiting for a memory access to return is higher for smaller L2-cache

configurations. This leads to more time spent draining the pipeline and quiescing the system.

The overhead also varies a lot across benchmarks. The variation is most striking in case

of L2-cache reconfiguration. Figure 5.12 shows the minimum and maximum overheads for

each transition – averaged over individual benchmarks. The quiesce overhead can be as high

as 2000 cycles in certain benchmarks with a large number of L2-cache misses.

System quiesce overhead (L2-cache)

0

500

1000

1500

2000

0 >

1

0 >

2

0 >

3

1 >

0

1 >

2

1 >

3

2 >

0

2 >

1

2 >

3

3 >

0

3 >

1

3 >

2

avg

Cyc

les

Figure 5.12. System quiesce overheads for various L2-cache reconfigurations, averaged over all bench-marks. The lines show the minimum and maximum overheads when averaged over individual bench-marks.

Figure 5.13 shows the cache line invalidation overhead (τi) for the I-cache, D-cache, and

L2-cache. The invalidation overhead is zero in case of the branch predictor since it does not

contain architected state. In general, the invalidation overhead increases with the number of

cache lines disabled. As discussed before, I and D-cache reconfigurations from a smaller to

81

larger size are also associated with an overhead because lines whose mappings change in the

new configuration have to be flushed.

I-cache invalidation overhead

0

200

400

600

800

1000

0 >

1

0 >

2

0 >

3

1 >

0

1 >

2

1 >

3

2 >

0

2 >

1

2 >

3

3 >

0

3 >

1

3 >

2

avg

Cyc

les

D-cache invalidation overhead

0

200

400

600

800

1000

0 >

1

0 >

2

0 >

3

1 >

0

1 >

2

1 >

3

2 >

0

2 >

1

2 >

3

3 >

0

3 >

1

3 >

2

avg

Cyc

les

L2-cache invalidation overhead

0

100

200

300

400

500

0>1

0>2

0>3

1>0

1>2

1>3

2>0

2>1

2>3

3>0

3>1

3>2

avg

Cyc

les

(tho

usan

ds)

Figure 5.13. Cache line invalidation overheads for various I-cache, D-cache, and L2-cache reconfigu-rations, averaged over all benchmarks. The lines show the minimum and maximum overheads whenaveraged over individual benchmarks.

I-cache line invalidation does not cause writebacks, but due to our conservative assump-

tion of invalidating one line per cycle, it suffers similar overhead to the D-cache. In practice,

the I-cache can be flash invalidated – reducing the overhead to almost zero. On average, the

invalidation overhead for the I and D-cache is less than 1000 cycles while that for the L2-cache

can be as high as 150,000 cycles.

82

Indirect Overheads

The indirect overhead is caused by the additional misses incurred just after reconfiguration

as shown in Figure 5.10. The number of extra misses incurred during warmup (misseswarmup)

is given by the area of the shaded regions in the figure. The indirect overhead (τindirect) can be

estimated from misseswarmup using the relation

τindirect = misseswarmup · Pmiss (5.6)

where, Pmiss is the average miss penalty or cost of a miss in terms of cycles. misseswarmup for

various units are estimated by periodically forcing reconfigurations from configuration 3 → 2

and vice versa. misseswarmup for other transitions (X → Y ) are estimated by scaling these

numbers using the relation

misseswarmup(X, Y ) = misseswarmup(3, 2) · |size(X)− size(Y )||size(3)− size(2)| (5.7)

where, size(X) is the unit size corresponding to configuration X . The number of extra misses

is computed every 5K instructions by subtracting the steady state misses for the particular

configuration. Figure 5.14 shows the extra transient misses caused by reconfiguring various

units. The curves look similar to the ones shown in Figure 5.10. The number of extra misses

accrued over a given tuning interval can be computed from these curves.

The miss penalties were also estimated empirically, via simulation. Unfortunately, we

could not simulate perfect caches and branch predictors with PharmSim. Thus, we calculated

the miss penalty using the following relation

Pmiss =CyclesL − CyclesS

MissesL −MissesS

(5.8)

where, the subscripts L and S denote the largest and smallest configuration respectively, from

those shown in Table 5.4.

83

I-cache

0

20

40

60

80

0.0 0.2 0.4 0.6 0.8 1.0

extr

a m

isse

s D-cache

0

10

20

30

0.0 0.2 0.4 0.6 0.8 1.0

L2-cache

0

20

40

60

80

0 2 4 6 8 10

instructions (millions)

extr

a m

isse

s

Branch Predictor

0

20

40

60

80

0.0 0.2 0.4 0.6 0.8 1.0

instructions (millions)

Figure 5.14. Number of extra transient misses vs. time, caused by reconfiguring the I-cache, D-cache,L2-cache, and branch predictor. Time is measured in terms of number of instructions. The results areaveraged over all benchmarks.

In order to reduce inaccuracies, we computed the penalties every 100K instructions and

filtered out points where the difference in misses was less than 400. For L1 caches, misses

that subsequently miss in the L2-cache can increase the penalty of an L1-miss. We artificially

filter out L2 misses by reducing the DRAM access latency to 1 cycle for simulations used to

compute cost of L1 misses.

Figure 5.15 shows the cost of various misses and mis-predicts. The results roughly match

those predicted by the analytical models presented by Karkhanis and Smith [81]. The average

I-cache miss penalty is ≈7 cycles. This is expected since the I-cache miss penalty equals the

84

I-cache miss penalty

0 2 4 6 8

10

gzip

vpr

gcc

mcf

crafty

parser

eon

perl

gap

vortex

bzip2

twolf

wupwise

swim

mgrid

applu

mesa

art

equake

ammp

lucas

fma3d

sixtrack

apsi

average

cycles

D-cache m

iss penalty

0 1 2 3 4 5

gzip

vpr

gcc

mcf

crafty

parser

eon

perl

gap

vortex

bzip2

twolf

wupwise

swim

mgrid

applu

mesa

art

equake

ammp

lucas

fma3d

sixtrack

apsi

average

cycles

L2-cache miss penalty

0

50

100

150

200

250

gzip

vpr

gcc

mcf

crafty

parser

perl

gap

vortex

bzip2

twolf

wupwise

swim

mgrid

applu

mesa

art

equake

ammp

lucas

fma3d

sixtrack

apsi

average

cycles

Branch m

is-predict penalty

0 4 8

12

16

20

24

gzip

vpr

gcc

mcf

crafty

parser

eon

perl

gap

vortex

bzip2

twolf

wupwise

swim

mgrid

applu

mesa

art

equake

ammp

lucas

fma3d

sixtrack

apsi

average

cycles

Figure5.15.

Average

costforvarious

cachem

issesand

branchm

is-predicts.T

hecostis

measured

interm

sofnum

berofprocessor

clockcycles.

85

L2-cache access (and hit) latency which is 7 cycles (see Table 5.3).

Unlike I-cache misses, D-cache misses can be hidden by out-of-order execution and thus

the penalty is much lower. Figure 5.15 shows that D-cache miss penalty varies widely across

benchmarks depending upon the amount of exploitable ILP, and is ≈3 cycles on average. L2-

cache misses can also be hidden to some extent and thus the L2-cache miss penalty is typically

lower than the DRAM access latency (200 cycles). Like the D-cache miss penalty, the L2-cache

miss penalty varies across benchmarks and is ≈150 cycles, on average.

The branch mis-prediction penalty is largely determined by the front-end pipeline depth

(12 stages) but is typically larger than that. This is partly due to lost issue opportunity prior to

detecting a mis-prediction and partly due to pipeline refill time [81]. Figure 5.15 shows that

the penalty is more or less constant across benchmarks, ≈16 cycles on average.

Table 5.5. Average quiesce, invalidation, and indirect overheads for various multi-configuration units.The intervals shown are those used to compute the indirect overhead. The interval is in terms of numberof instructions, while the overheads are in terms of number of cycles. The indirect overhead is theaverage for 3→ 2 transitions.

Unit Interval Quiesce Invalidation Indirect TotalInstruction cache 100K 100 200 1200 1500Data cache 100K 100 300 200 800L2-cache 5M 200 45K 125K 170KBranch predictor 100K 100 – 3200 3300

Table 5.5 summarizes the approximate system quiesce, invalidation, and warmup over-

heads for the various units. The warmup overheads are computed from data presented in Fig-

ures 5.14 and 5.15. The intervals chosen to compute the warmup overhead are large enough

to warm up the units. To compute the L2-cache warmup overhead, we use a conservative

200 cycle miss penalty instead of 150 cycles seen on average. While the I-cache, D-cache,

and branch predictor reconfiguration overheads are roughly similar (few thousand cycles), the

L2-cache reconfiguration overhead is much higher (a few 100 thousand cycles).

86

5.3.4. Tuning Low-Overhead Units

Given a performance loss tolerance of 2%, the tuning interval is roughly determined using

Equation 5.5. Assuming a CPI of 1 for all applications, we choose a tuning interval of 100K

instructions. At first glance, a 100K interval seems too short because 1) the CPI of some

applications may be less that 1 which can lead to more than 2% performance loss, and 2) even

if the CPI is 1, the I-cache and branch predictor may suffer more than 2% performance loss in

the worst case. However, considering the fact that the signature based tuning algorithms reduce

reconfigurations by tuning only on phase boundaries, this may not be such a bad choice.

Stability (%)

0

20

40

60

80

100

gzip

vpr

gcc

mcf

craf

ty

pars

er

eon

perl

gap

vort

ex

bzip

2

twol

f

wup

wis

e

swim

mgr

id

appl

u

mes

a art

equa

ke

amm

p

luca

s

fma3

d

sixt

rack

apsi

aver

age

Average phase length

0

5

10

15

20

gzip

vpr

gcc

mcf

craf

ty

pars

er

eon

perl

gap

vort

ex

bzip

2

twol

f

wup

wis

e

swim

mgr

id

appl

u

mes

a art

equa

ke

amm

p

luca

s

fma3

d

sixt

rack

apsi

aver

age

29

Figure 5.16. Stability and average phase length using a granularity of 100K instructions. δth= 5%,signature is 512B with perfect sampling, pseudo-random hash function, and a shift amount of 6.

Phase changes are detected using a 512B signature. The difference threshold (δth) is set

to 5%. The signature is generated using perfect sampling and a pseudo-random hash function

with a shift amount of 6. Figure 5.16 shows the stability and average phase length observed for

87

the various benchmarks, using these parameters. The stability is 60% on average, but for some

benchmarks such as crafty, it is as low as 7%. This has implications on the efficiency of tuning

algorithms, as discussed in the next section.

Basic Algorithm

Figure 5.17 show the percent savings in average unit size and corresponding performance

loss (percent increase in CPI) for the basic tuning algorithm. Results for each of the low-

overhead units viz. I-cache, D-cache, and branch predictor are shown. The performance loss

due to the reconfiguration overheads are shown in black.

Consider I-cache tuning first. The basic algorithm works well – achieving 18% and 25%

savings, respectively, for integer and floating-point benchmarks. The performance loss is≈1%

on average, and never exceeds 2% for any benchmark. The algorithm successfully allocates

just enough resources to a benchmark to keep performance degradation less than the speci-

fied tolerance. For benchmarks such as crafty and gcc which require the largest configuration

(see Figure 5.3), resource savings are minimal. Benchmarks such as bzip2 whose instruction

working set fits in the smallest configuration, show >50% resource savings.

A significant fraction of the performance loss (≈50% on average) is attributed to recon-

figuration overheads. Table 5.6 summarizes the number of tunings i.e. the number of times the

trial and error tuning process is initiated, and number of reconfigurations performed by the ba-

sic I-cache tuning algorithm. Since tuning is initiated at the beginning of each phase, the num-

ber of tunings is the same as the number of dynamic phases. The number of reconfigurations

is larger than number of tunings because each tuning leads to more than one reconfigurations.

A comparison of Figure 5.17 and Table 5.6 shows that larger the number of reconfigu-

rations, larger the performance loss due to reconfiguration overheads. The number of recon-

88

Basic Algorithm

0

20

40

60 S

avin

gs (%

) I-cache D-cache Bpred

0

1

2

3

CP

I inc

reas

e (%

)

g z i p

v p r

g c c

m c

f

c r a f

t y

p a r

s e r

e o n

p e r

l

g a p

v o r t

e x

b z i

p 2

t w o

l f

a v g

Reconfiguration Overhead

Basic Algorithm

0

20

40

60

Sav

ings

(%)

I-cache D-cache Bpred

0

1

2

3

CP

I inc

reas

e (%

)

w u

p w

i s e

s w i m

m g

r i d

a p p

l u

m e

s a

a r t

e q u

a k e

a m m

p

l u c

a s

f m a

3 d

s i x t

r a c

k

a p s i

a v g


Figure 5.17. Percent savings in average unit size and corresponding performance loss (percent increasein CPI) for the basic tuning algorithm. Each cluster of three bars shows the results for the I-cache,D-cache, and branch predictor.

89

figurations mainly depends upon the number of tunings, which varies widely across programs

due to differences in phase behavior. For example, programs with high phase stability and low

average phase length (e.g. twolf) have a large number of tunings (and reconfigurations) while

programs with unstable phase behavior (e.g. crafty) have very few tunings. The number of re-

configurations also depends upon the configuration chosen most often for a given benchmark.

For benchmarks such as sixtrack, whose instruction working set fits in the smallest configura-

tion (see Figure 5.3), each tuning attempt leads to several reconfigurations (≈4, on average)

before the best configuration is chosen. However, for benchmarks such as crafty, which require

the largest I-cache configuration, each tuning attempt leads to a few reconfigurations (≈2, on

average).

Table 5.6. Number of tunings and reconfigurations for the basic I-cache tuning algorithm. The numbersare shown as fractions of the total number of tuning intervals, which is 40,000.

Benchmark Tunings Reconfigsgzip 11.27 36.38vpr 12.49 27.68gcc 5.98 13.15mcf 14.38 34.65crafty 2.72 5.46parser 11.47 27.81eon 16.75 33.84perl 8.94 18.24gap 7.90 26.13vortex 1.49 3.84bzip2 10.07 33.34twolf 19.54 45.41

Benchmark Tunings Reconfigswupwise 7.12 16.51swim 3.34 9.87mgrid 12.31 34.48applu 8.95 22.80mesa 12.35 34.68art 12.57 28.02equake 10.18 33.65ammp 15.16 35.78lucas 6.29 24.37fma3d 6.79 23.84sixtrack 3.18 14.46apsi 14.01 30.30

The basic algorithm works well, but does not provide as much savings as could be expected

from Figure 5.3. This is primarily due to low phase stability (60% on average), which causes

the algorithm to default to the maximum configuration about 40% of the time, on average.

The effect is apparent with a comparison of bzip2 and lucas. Both benchmarks have similar

baseline performance and their instruction working sets are small enough to fit in the smallest

90

I-cache configuration. However, lucas has close to half the stability of bzip2 and thus shows

much less resource savings. The other reason for lower savings is the trial and error process.

The algorithm has to go through several intermediate configurations before it reaches the best

one. While this leads to lower savings for benchmarks such as bzip2, it also leads to some

savings for benchmarks such as gcc, which require the largest configuration.

The tuning algorithm may not perform well if phase changes go undetected, as is the case

in benchmark swim. Benchmark swim has an instruction working set small enough to fit in

the smallest cache configuration. It also has higher stability than bzip2, but the savings are not

as high as bzip2 because with a δth of 5%, some of the phase changes in swim go undetected

leading to lost tuning opportunities. A simple way to fix this problem is to force tuning if a

phase gets longer than a preset length threshold. We describe algorithms with such backoff

mechanisms in [60, 61].

The algorithm performs similarly for the D-cache with resource savings of 19% and 26%

respectively, for integer and floating-point benchmarks with an average performance loss of

≈0.7%, which is lower compared to I-cache tuning.This is partly due to lower reconfiguration

overheads associated with the D-cache, and partly because most benchmarks tolerate a smaller

D-cache better than a smaller I-cache (see Figures 5.3 and 5.4). The performance loss is below

the 2% tolerance for all benchmarks except mgrid. In mgrid, the phases are noisy i.e. there is

significant performance variation within a phase. This causes a sub-optimal configuration to

be chosen sometimes, which leads to significant performance loss. It should be noted that the

performance loss is computed using worst case quiesce overheads. Using average overheads,

performance loss for mgrid barely exceeds 2%.

Branch predictor tuning leads to average savings of 20% and 26% respectively, for integer

and floating-point benchmarks. Performance loss is ≈1.4% on average. Unlike I-cache and

91

D-cache tuning, most of the performance loss is due reconfiguration overheads. This is to be

expected, given the relatively high reconfiguration overhead (see Table 5.5). The performance

loss exceeds the 2% tolerance limit for five benchmarks – eon, gap, bzip2, twolf, apsi. A closer

look at the phase behavior of the benchmarks reveals different reasons for performance loss in

different benchmarks.

– Benchmarks eon, twolf, and apsi have high stability (>75%), but relatively small phases

(≤5 intervals). This causes a very large number of reconfigurations (one every two in-

tervals) leading to high performance overhead.

– Benchmark gap has high stability and long phases leading to fewer reconfigurations.

But, it looses about 1.3% performance just due to trying out smaller configurations. This

along with the reconfiguration overhead causes the performance loss to exceed 2%.

– Benchmark bzip2 loses more than 2% performance mainly because it has a low CPI

(≈0.5), and a 100K interval is too small to amortize reconfiguration overheads. This is

apparent by comparing bzip2 and equake. Both benchmarks have similar stability and

average phase lengths (see Figure 5.16) leading to similar number of reconfigurations.

Moreover, both the benchmarks perform well even with the smallest branch predictor

configuration (see Figure 5.6). The only difference is that equake has a CPI close to 1

which causes its performance loss to be much less than 2%.

In summary, the main problem with the basic tuning algorithm is repeated trial and error

tuning. For benchmarks with short phases, this leads to 1) lower power savings, because the

algorithm has to try intermediate configurations before it reaches the best one, and 2) higher

performance loss, due to the large number of reconfigurations. This becomes more apparent in

branch predictor tuning, where the reconfiguration overheads are relatively high.

92


The signature density based algorithm solves one of the problems mentioned above, i.e.,

it eliminates the trial and error process by choosing the best configuration based on signature

density. This causes fewer reconfigurations and consequently, lower performance loss and

larger savings. As mentioned before, this algorithm is applied to I-cache and branch predictor

tuning only.

Working set size estimation using Equation 5.1 requires multiple arithmetic operations

including two logarithm computations. This can be avoided by comparing signature density

against certain preset thresholds to arrive at the best configuration. The technique is best ex-

plained by means of an example. Consider the I-cache which supports configurations of 2KB,

16KB, 32KB, and 64KB. We compute three preset thresholds f0, f1 and f2 corresponding to

working set sizes 2KB, 16KB, and 32KB. At run-time, the tuning algorithm compares the sig-

nature density f against these thresholds and the best configuration is chosen as follows: 2KB

if (f ≤ f0), 16KB if (f0 < f ≤ f1), 32KB if (f1 < f ≤ f2), and 64KB if (f > f2). This

technique essentially replaces the working set size computation with a few compares.

In order to avoid performance loss for corner cases, the preset thresholds are computed

using 75% of the cache (or predictor) sizes. In other words, the chosen configuration is such

that the estimated working set size is only 75% of the unit size. This provided an estimation

error margin of about 30%. Of course, it leads to slightly lower resource savings in some cases

but we prefer this to losing more than 2% performance.

In order to estimate branch working set size, we set IWsize = 4 and BBsize = 4. IWsize

is set based on the PowerPC ISA [64]. BBsize is set conservatively based on the observation

that the average basic block size for SPEC CPU2000 benchmarks is about five instructions

[50]. Using BBsize = 4 can lead to lower resource savings for some benchmarks. A simple

93

mechanism to improve savings is to use profiling counters to estimate the average basic block

size. However, even with our simplistic assumptions, the resource savings are significant.

Thus, in favor of simplicity, we do not explore such mechanisms.

The algorithm presented in Section 5.3.1 is modified slightly to improve resource savings.

The original algorithm sets the configuration to maximum size when it moves into the UNSTA-

BLE state. Instead, the modified algorithm sets the configuration to an estimated size using the

working set signature. This leads to some additional resource savings with only slightly higher

performance loss.


0

1

2

CP

I inc

reas

e (%

)

g z i

p

v p r

g c c

m c

f

c r a f

t y

p a r

s e

r

e o n

p e r

l

g a p

v o r t

e x

b z i

p 2

t w o

l f

w u

p w

i s e

s w i m

m g

r i d

a p p

l u

m e

s a

a r t

e q u

a k e

a m m

p

l u c a

s

f m a

3 d

s i x t

r a c

k

a p s

i

a v g


0

20

40

60

80

100

Sav

ings

(%)

I-cache Bpred


Figure 5.18. Percent savings in average unit size and corresponding performance loss (percent increasein CPI) for the signature density based tuning algorithm. Each pair of bars shows the results for theI-cache and branch predictor.

Figure 5.18 shows the average resource savings and associated performance loss for tun-

ing the I-cache and branch predictor. Since the algorithm performs equally well in both cases,

we focus the discussion on I-cache tuning. The signature density algorithm performs ex-

tremely well saving more than 90% resources for some benchmarks. On average, the algo-

94

rithm achieves 53% savings in average unit size with≈0.7% performance loss on average. The

performance loss never exceeds the 2% limit.

Most of the benefits come from bypassing the trial and error process. This leads to less

time spent in sub-optimal configurations, leading to larger resource savings. Also, there are

fewer reconfigurations, which leads to lower reconfiguration overhead. Table 5.7 shows the

number of reconfigurations for the basic and signature density algorithms. The signature den-

sity algorithm leads to 76% reduction in reconfigurations, on average. For certain benchmarks

such as eon, as high as 99% reduction in reconfigurations is archieved.

A comparison with Figure 5.3 shows that the algorithm successfully disables cache blocks

for benchmarks that do not benefit from them. Thus, the savings are almost zero for bench-

marks such as crafty and eon which require the largest size cache for performance, and as high

as 93% for benchmarks such as bzip2 which perform well even with the smallest cache.

I-cache Tuning: Savings Comparison

0

25

50

75

100

gzip

vpr

gcc

mcf

craf

ty

pars

er

eon

perl

gap

vort

ex

bzip

2

twol

f

wup

wis

e

swim

mgr

id

appl

u

mes

a art

equa

ke

amm

p

luca

s

fma3

d

sixt

rack

apsi

aver

age

Sav

ings

(%) signature density oracle

Figure 5.19. Percent savings in average unit size for the signature density based tuning algorithm andthe oracle algorithm.

The efficiency of the tuning algorithm can be judged by comparing it to an oracle algo-

rithm. We implement an oracle algorithm that chooses, at the beginning of every interval, the

smallest configuration that leads to less than 2% performance loss compared to the baseline.

Figure 5.19 compares the savings achieved by the signature density based algorithm with those

achieved by the oracle algorithm.

95

Table 5.7. Number of reconfigurations for the basic and signature density based algorithm, for I-cacheand branch predictor tuning. The percent reduction in reconfigurations for each case are also shown.The total number of tuning intervals is 40,000.

I-cache Branch predictorBenchmarks Basic Density Red. (%) Basic Density Red. (%)gzip 14550 5733 61 13913 4516 68vpr 11072 3808 66 11092 4997 55gcc 5259 1530 71 5235 2391 54mcf 13860 5261 62 14430 5753 60crafty 2184 47 98 2325 1089 53parser 11125 2998 73 11190 4653 58eon 13535 198 99 18699 6700 64perl 7294 726 90 8875 3577 60gap 10450 2162 79 11005 3162 71vortex 1537 523 66 1574 596 62bzip2 13337 4588 66 13368 4029 70twolf 18163 1212 93 18157 7817 57wupwise 6604 2422 63 7078 2848 60swim 3947 2586 34 3980 1340 66mgrid 13791 1299 91 13586 4847 64applu 9118 4380 52 8764 3580 59mesa 13872 1468 89 14472 4939 66art 11209 5923 47 11247 5027 55equake 13458 1979 85 13505 4072 70ammp 14311 1544 89 14370 6063 58lucas 9747 5011 49 9747 2514 74fma3d 9537 1282 87 9817 2717 72sixtrack 5784 1460 75 2782 715 74apsi 12119 968 92 14070 5602 60average 10260 2437 76 10730 3975 63

The signature density based algorithm achieves savings close to the oracle algorithm. In

fact, there are cases such as gap and apsi where the former achieves more savings. This can

be explained as follows. The signature density based algorithm tunes only at the beginning of

a phase. However, phases are sometimes associated with a certain amount of noise. If a small

configuration is chosen at the beginning of a phase and the CPI increases within the phase, the

oracle algorithm transitions to a larger configuration but the signature density based algorithm

does not. This leads to larger savings for the latter. Since most significant CPI changes are

96

detected by the phase detection mechanism (see Section 3.3.2), the performance loss incurred

is negligible.


The history based algorithm bypasses the trial and error process for recurring phases, by

reusing results of previous tunings. While the signature density algorithm achieves the same

effect, the history algorithm is more generic and applies to any multi-configuration unit.

Recurrence (# Dynamic Phases) / (# Static Phases)

0

100

200

300

400

500

gzip

vpr

gcc

mcf

craf

ty

pars

er

eon

perl

gap

vort

ex

bzip

2

twol

f

wup

wis

e

swim

mgr

id

appl

u

mes

a

art

equa

ke

amm

p

luca

s

fma3

d

sixt

rack

apsi

aver

age

Rec

urre

nce

Figure 5.20. Phase recurrence for various benchmarks. Signature size = 512B, δth= 5%.

Figure 5.20 shows phase recurrence for various benchmarks. Phase recurrence is com-

puted as the ratio of the number of dynamic and static phases. To measure recurrence, a history

of previously seen signatures is maintained. On a phase change, the current signature is com-

pared against signatures in the history. If a match is found i.e. δth≤ 5%, the phase is identified

as a recurring phase. If not, it is identified as a new static phase and its signature is added to

the history.

Phase recurrence varies widely across benchmarks from almost zero for crafty to about

500 for lucas. In general, the recurrence is higher for floating-point benchmarks because their

phases are composed of well behaved nested loops that repeat in time. The average phase

97

recurrence over all benchmarks is 200. This suggests that there is significant potential for a

history based algorithm.


0

1

2

CP

I inc

reas

e (%

) History Based Algorithm

0

10

20

30

40

50

Sav

ings

(%)

g z i

p

v p r

g c c

m c

f

c r a f

t y

p a r

s e r

e o n

p e r

l

g a p

v o r t

e x

b z i p

2

t w o

l f

a v g




0

1

2

3

CP

I inc

reas

e (%

) 3.2

w u

p w

i s e

s w i m

m g

r i d

a p p

l u

m e

s a

a r t

e q u

a k e

a m m

p

l u c

a s

f m a

3 d

s i x

t r a c

k

a p s i

a v g


0

20

40

60

Sav

ings

(%)



Figure 5.21. Percent savings in average unit size and corresponding performance loss (percent increasein CPI) for the history based tuning algorithm. Each cluster of three bars shows the results for theI-cache, D-cache, and branch predictor.

98

Figure 5.21 shows the resource savings achieved and the associated performance loss for

the history based algorithm. The I-cache tuning algorithm achieves 11% and 24% savings

respectively for the integer and floating-point benchmarks. The average performance loss is

≈0.3% and the performance loss stays within the tolerance limit for all benchmarks. For D-

cache tuning, the algorithm achieves 8% and 25% savings respectively for the integer and

floating-point benchmarks. The average performance loss is ≈0.5% but the performance loss

exceeds the threshold for benchmark mgrid. As explained before, this is due to noise associted

with mgrid’s phases which leads to a sub-optimal configuration being chosen. The branch

predictor tuning algorithm achieves 13% and 22% savings on integer and floating-point bench-

marks, respectively. The average performance loss is ≈0.6% and the performance loss never

exceeds the specified threshold.

On average, the history algorithm causes lower performance loss than the basic algorithm,

due to fewer reconfigurations and less time spent in sub-optimal configurations. Both these

benefits are a result of fewer trial and error based tuning attempts, due successful phase table

lookups for recurring phases. Table 5.8 summarizes various statistics for the history based

I-cache tuning algorithm. The recurrence ratios for the various benchmarks differ from those

shown in Figure 5.20 because the phase table is updated with the signature only if tuning

completes. The history algorithm is able to bypass 86% of the trial and error tuning attempts,

on average, leading to 64% reduction in reconfigurations compared to the basic algorithm.

While lesser time spent in sub-optimal configurations leads to lower performance loss, it

also translates into lower resource savings for many benchmarks. To understand this behavior,

we study I-cache tuning in some detail. D-cache and branch predictor tuning results can be

explained along similar lines.

A comparison of savings (for I-cache tuning) achieved by the basic and history based al-

99

Table 5.8. The table shows, for the history based I-cache tuning algorithm, the number of static and dy-namic phases, recurrence ratio, number of tunings, number of successful phase table lookups, percentageof trial and error attempts bypassed, number of reconfigurations, and percent reduction in reconfigura-tions with respect to the basic tuning algorithm. The total number of tuning intervals is 40,000.

Bench- Static Dynamic Recur- Tunings Lookups Bypass Reconfig- Reduc.marks Phases Phases rence (%) urations (%)gzip 11 3734 339 3734 3622 97 5707 53vpr 43 4997 116 4997 4847 97 3545 68gcc 264 2391 9 2391 1100 46 3089 41mcf 17 5753 338 5753 5638 98 3256 77crafty 148 1089 7 1089 229 21 1704 22parser 243 4287 18 4287 3130 73 4266 59eon 12 6700 558 6700 6566 98 200 99perl 34 3577 105 3577 2897 81 1533 79gap 12 2915 243 2915 2857 98 786 92vortex 11 596 54 596 453 76 710 54bzip2 19 4029 212 4029 3747 93 7158 46twolf 13 7817 601 7817 7739 99 1149 94wupwise 13 2848 219 2848 2335 82 1214 82swim 8 1081 135 1081 1070 99 1478 54mgrid 11 4922 447 4922 4577 93 6488 53applu 9 3580 398 3580 1289 36 5987 34mesa 11 4939 449 4939 4890 99 5540 60art 11 5027 457 5027 4926 98 2048 82equake 16 4072 255 4072 4031 99 5915 56ammp 21 6063 289 6063 5820 96 882 94lucas 3 2514 838 2514 2489 99 5032 48fma3d 12 2717 226 2717 2690 99 4463 53sixtrack 15 1272 85 1272 1247 98 1731 70apsi 31 5602 181 5602 4930 88 3903 68average 41 3855 274 3855 3463 86 3241 64

gorithms (see Figure 5.22) reveals that the latter achieves better savings on some benchmarks,

and worse savings on others. This can be qualitatively explained by means of a rough ana-

lytical model for resource savings. For simplicity, assume that the stability is 100% and each

attempted tuning completes. Also, assume that the length of each phase is L and L > 3. The

savings for the basic algorithm are given by

0.03L + 1.53

LC0 +

0.125L + 1.16

LC1 +

0.5L + 0.13

LC2 +

L− 0.5

LC3 (5.9)

100

Savings Comparison

0

20

40

60 gz

ip

vpr

gcc

mcf

craf

ty

pars

er

eon

perl

gap

vort

ex

bzip

2

twol

f

wup

wis

e

swim

mgr

id

appl

u

mes

a

art

equa

ke

amm

p

luca

s

fma3

d

sixt

rack

apsi

Sav

ings

(%)

basic history

Figure 5.22. Comparison of I-cache resource savings achieved by the basic and history based algorithms.

where, C0..C3 represent the percentage of completed tunings that result in configurations 0..3

being chosen. The sizes for these configurations are shown in Table 5.4. The coefficients for

C0..C3 are derived from the sequence of reconfigurations that leads to the particular configu-

ration being chosen. For example, to choose configuration 3, the sequence of reconfigurations

is 3, 2, 3. Since configuration 2 is half the size of 3, the coefficient for C3 is (L−0.5)/L. Other

coefficients are derived along similar lines.

While modeling the savings for the history based algorithm, we neglect the one-time trial

and error process for each static phase. Based on this assumption, the savings are given by

0.03C0 + 0.125C1 + 0.5C2 + C3 (5.10)

Given percentages C0..C3, one can predict which algorithm provides better savings. In-

tuitively, the history based algorithm works better for benchmarks where most of the tunings

result in the smallest configuration being chosen because the algorithm jumps to the small-

est configuration without going through the larger intermediate configurations. In benchmarks

where the largest configuration is chosen most of the time, the basic algorithm achieves greater

savings because every tuning causes a smaller configuration to be tried before the larger one is

chosen.

101

0

20

40

60

80

100

0 20 40 60 80 100

C3 (%)

Sav

ings

(%)

hist-c2

basic-c2

hist-c0

basic-c0

Figure 5.23. Savings for the basic and history based algorithms computed using the analytical model.Two simple scenarios are chosen – 1) Either configuration 3 or 2 is chosen i.e. C2 = 100−C3 (basic-c2,hist-c2), and 2) Either configuration 3 or 0 is chosen i.e. C0 = 100−C3 (basic-c0, hist-c0). Break-evenpoints for the two scenarios are shown with the vertical lines.

Figure 5.23 shows this behavior for two simplistic scenarios 1) either configuration 3 or

configuration 2 is chosen i.e. C2 = 100− C3, and 2) either configuration 3 or configuration 0

is chosen i.e. C0 = 100− C3. As expected, the basic algorithm has an edge over the history

based algorithm if C3 is high. The break-even point depends upon the relative sizes of the

two configurations. As expected the break-even point is lower (30%) for the first scenario and

higher (80%) for the second one.

I-cache Configs chosen by Basic / History Algorithms

0

20

40

60

80

100

% C

ompl

eted

Tun

ings

g z i

p

v p r

g c c

m c

f

c r a f

t y

p a r

s e r

e o n

p e r

l

g a p

v o r t

e x

b z i

p 2

t w o

l f

w u

p w

i s e

s w i m

m g

r i d

a p p

l u

m e

s a

a r t

e q u

a k e

a m m

p

l u c

a s

f m a

3 d

s i x

t r a c

k

a p s i

C3 C2 C1 C0

Figure 5.24. Percentage of tunings resulting in various configurations being chosen. C0..C3 are thepercentages of completed tunings that result in configurations 0..3 being chosen. For each pair, the leftand right bars show the percentages for the basic and history based algorithm respectively.

102

Figure 5.24 shows the percentages C0..C3 for the basic and history based algorithms. In

most cases, the values for the two algorithms are similar, which means that tuning results can be

effectively reused. In some cases such as bzip2, the values differ significantly. This can happen

if a sub-optimal configuration gets chosen (and saved) due to noise associated with phases. One

solution is to tune each phase a few times and save the most frequently chosen configuration.

Another solution is to flush phase history periodically (every few billion instructions) to get rid

of spurious tuning results. We do not explore these techniques in this dissertation.

Given the values of C0..C3 (Figure 5.24), results shown in Figure 5.22 can be largely

explained using the model. The basic algorithm works better for most integer benchmarks

because the largest configuration is chosen most of the time. The history based algorithm works

better for benchmarks such as swim and lucas because the smallest configuration is chosen a

fairly large number of times (C0 > 20%). Other results are also easily explained using the

analytical models. The only major anomaly is bzip2 where the history based algorithm should

work better (C0 > 70%). However, as explained before, the history based algorithm locks in a

spurious configuration leading to reduced savings.

Summary

Table 5.9 summarizes the performance loss, savings, and number of violations (∆CPI >

2%) for each of the algorithms. Clearly, the signature density based algorithm performs the best

for I-cache and branch predictor tuning. It provides maximum savings within the specified

tolerance, with zero violations. In case of D-cache tuning, both the basic and history based

algorithms lead to one violation (mgrid). On average, the basic algorithm performs better

because it provides more savings.

Based on the results summarized in Table 5.9, the history based algorithm does not seem

103

to have a solid advantage over the basic algorithm. The main feature of history based algorithm

is that it performs fewer reconfigurations compared to the basic algorithm. Unfortunately, the

reduction in reconfigurations does not provide significant benefits for these units because the re-

configuration overheads are relatively low. In fact, the basic algorithm seems to perform better

solely because it tries sub-optimal configurations and the penalty of being in such configura-

tions is not high. The next section shows that this seeming advantage turns into a weaknesses

when this penalty increases, and the history based algorithm becomes an attractive alternative.

Table 5.9. Average performance loss, average savings, and number of tolerance violations (∆CPI >

2%) for each of the algorithms.

I-cacheAlgorithm Savings (%) ∆CPI (%) Violations

Int FP Avg Int FP AvgBasic 18 25 22 1.08 0.94 1.01 0Signature Density 44 62 53 0.87 0.50 0.68 0History 11 24 18 0.33 0.40 0.36 0

D-cacheAlgorithm Savings (%) ∆CPI (%) Violations

Int FP Avg Int FP AvgBasic 19 26 23 0.59 0.70 0.65 1History 8 25 17 0.20 0.61 0.40 1

Branch PredictorAlgorithm Savings (%) ∆CPI (%) Violations

Int FP Avg Int FP AvgBasic 20 26 23 1.42 0.94 1.18 5Signature Density 41 54 48 0.70 0.24 0.47 0History 13 22 18 0.68 0.55 0.62 0

104

5.3.5. Tuning High-Overhead Units

Given the high overheads associated with L2-cache tuning (see Table 5.5), we choose a

tuning interval of 5M instructions based on Equation 5.5. The 5M instruction interval leads

to an average stability of 55% and average phase length of 5 intervals, across the benchmarks

(see Figure 5.25).

Figure 5.26 shows the percent savings in average unit size and corresponding performance

loss (percent increase in CPI) for the basic and history based tuning algorithm. As with the

low-overhead units, worst case overheads for each transition are used to compute performance

loss. The performance loss due to the reconfiguration overheads are shown in black.

Stability (%)

0

20

40

60

80

100

gzip

vpr

gcc

mcf

craf

ty

pars

er

eon

perl

gap

vort

ex

bzip

2

twol

f

wup

wis

e

swim

mgr

id

appl

u

mes

a

equa

ke

amm

p

luca

s

fma3

d

sixt

rack

apsi

aver

age Average Phase Length

0

3

6

9

12

gzip

vpr

gcc

mcf

craf

ty

pars

er

eon

perl

gap

vort

ex

bzip

2

twol

f

wup

wis

e

swim

mgr

id

appl

u

mes

a

equa

ke

amm

p

luca

s

fma3

d

sixt

rack

apsi

aver

age

Figure 5.25. Stability and average phase length using a granularity of 5M instructions. δth= 5%, signa-ture is 512B with perfect sampling, pseudo-random hash function, and a shift amount of 6.

The basic algorithm achieves more than 40% savings on some benchmarks with average

savings of 19% across benchmarks. The average savings are low because most benchmarks

show significant performance loss even with a 1MB L2-cache (see Figure 5.5). A comparison

105

L2-cache Tuning

0

2

4

6

8

CP

I inc

reas

e (%

)

L2-cache Tuning

0

20

40

60

Sav

ings

(%)

Basic History

g z i

p

v p r

g c c

m c

f

c r a

f t y

p a r s

e r

e o n

p e r

l

g a p

v o r t

e x

b z i p

2

t w o

l f

w u

p w

i s e

s w i m

m g

r i d

a p p

l u

m e

s a

e q u

a k e

a m m

p

l u c

a s

f m a

3 d

s i x t

r a c

k

a p s

i

a v g


Figure 5.26. Percent savings in average unit size and corresponding performance loss (percent increasein CPI) for L2-cache tuning algorithms. The left and right bars in each pair correspond to basic andhistory based algorithms respectively.

with Figure 5.5 shows that the algorithm is quite effective at disabling L2-cache resources for

benchmarks that do not benefit from them.

The average performance loss caused by the basic algorithm is 3.5% – which is well over

the specified tolerance of 2%. While part of this loss can be attributed to reconfiguration

overheads (black bars), most of it is due to time spent in sub-optimal configurations. This effect

is most evident in benchmark twolf, which suffers severe performance degradation (78%) even

in configuration 2 – the second largest configuration. In fact, twolf is a well behaved benchmark

with long phases (average phase length of 10) and a high stability (86%), which leads to fewer

tunings compared to other benchmarks. But each tuning attempt leads to a performance loss

so high that it cannot be amortized even over its long phases.

The history based algorithm improves upon the basic algorithm by causing only 1.3%

performance degradation on average. The performance loss barely exceeds the 2% tolerance

106

in case of three benchmarks. As expected, the algorithm provides slightly lower savings of

17%. The reduction in savings is mainly due to elimination of unnecessary reconfigurations as

discussed in the previous section.

Percentage of tunings where configuration found in Phase Table

0

20

40

60

80

100

gzip

vpr

gcc

mcf

craf

ty

pars

er

eon

perl

gap

vort

ex

bzip

2

twol

f

wup

wis

e

swim

mgr

id

appl

u

mes

a

equa

ke

amm

p

luca

s

fma3

d

sixt

rack

apsi

aver

age

Percent reduction in reconfigurations relative to basic algorithm

0

20

40

60

80

100

gzip

vpr

gcc

mcf

craf

ty

pars

er

eon

perl

gap

vort

ex

bzip

2

twol

f

wup

wis

e

swim

mgr

id

appl

u

mes

a

equa

ke

amm

p

luca

s

fma3

d

sixt

rack

apsi

aver

age

Figure 5.27. Shown, for the history based algorithm, are 1) the percentage of tuning attempts where theoptimal configuration is found in the phase table, and 2) the reduction in reconfigurations relative to thebasic algorithm.

It is evident (from Figure 5.26) that the lower performance loss is mainly due to less time

spent in sub-optimal configurations. Figure 5.27 shows the percentage of tuning attempts where

the optimal configuration is found in the phase table. This basically indicates the fraction of

time the trial and error process is bypassed.

The optimal configuration is found in the phase table more than 90% of the time for some

benchmarks and about 63% of the time on average. For some benchmarks such as applu, this

number is almost zero. This is because applu is highly unstable (almost zero stability) at a

107

granularity of 5M instructions and thus there are almost no tunings. Fewer trial and error based

tuning attempts also lead to fewer reconfigurations (shown in the same figure) which in turn

contribute to the lower performance loss.

5.3.6. Comparison with Periodic Tuning Algorithms

This section compares the phase based tuning algorithms with a purely periodic algorithm.

The periodic tuning algorithm is invoked at fixed instruction intervals, and searches for the best

configuration using trial and error. It starts with the subsystem configuration set to maximum

(i.e. the one that should yield maximum performance), and measures the CPI as the processor

runs for one (or more) intervals. Then, it tries a smaller configuration in successive intervals

and selects the smallest one that leads to less than 2% CPI increase relative to the recorded

CPI. To prevent repeated reconfigurations, the algorithm stays in the selected configuration

for a fixed interval of time after which, it transitions back to the maximum configuration and

repeats the tuning process.

In essence, the periodic algorithm can be defined in terms of two parameters – tuning

interval i.e., the period of invocation of the algorithm, and stable intervals i.e., the number of

intervals the algorithm stays in the best found configuration. The periodic algorithm does not

require a notion of a program phase or program phase changes. However, it deals with phase

changes by defaulting to the maximum configuration after stable intervals and repeating the

tuning process.

For evaluation, the tuning interval is chosen using Equation 5.5. Like phase based algo-

rithms, a 100K interval works well for tuning the I-cache, D-cache, and branch predictor, and

a 5M interval works well for the L2-cache. stable intervals is chosen empirically, by trying

various values and choosing the one that provides best savings under the 2% performance loss

108

constraint. We find that stable intervals = 8 works best for all units, leading to tuning once

every 10 - 13 intervals. This is not surprising because, on average, the benchmarks have a

stability of 60% (Figure 5.16) which leads to a new phase every 10 intervals (stability = 60%

=> 6 stable + 4 unstable intervals).

Table 5.10. Average savings and performance loss (CPI increase) for the periodic and phase basedalgorithms.

Savings (%)Unit Periodic Basic History Density

I-cache 26.2 21.5 17.8 53.0D-cache 29.2 22.7 16.5 -Brand predictor 29.1 22.9 17.6 47.4L2-cache 27.9 19.1 16.7 -

CPI increase (%)Unit Periodic Basic History Density

I-cache 1.05 1.01 0.36 0.68D-cache 0.58 0.65 0.40 -Brand predictor 1.21 1.18 0.62 0.47L2-cache 3.27 3.44 1.32 -

Table 5.10 shows the average resource savings and performance loss for the periodic al-

gorithm and the various phase based algorithms. However, we only compare the periodic

algorithm and the basic tuning algorithm. The efficiency of these two algorithms, relative to

other algorithms, is very similar making other comparisons less interesting.

On average, the periodic algorithm provides more savings compared to the basic algo-

rithm.To explain this behavior, we show the savings achieved by the two algorithms for branch

predictor tuning (see Figure 5.28). A comparison with Figure 5.16 shows that most of the ben-

efits for the periodic algorithm come from benchmarks with low stability such as gcc, crafty,

and applu because for such benchmarks, the simple algorithm spends most of the time in the

largest configuration.

The performance loss for both algorithms is very similar (see Figure 5.29). However, the

basic algorithm suffers more loss due to reconfiguration overhead compared to the periodic

109

Savings (%) for Branch Predictor Tuning

0

20

40

60 gz

ip

vpr

gcc

mcf

craf

ty

pars

er

eon

perl

gap

vort

ex

bzip

2

twol

f

wup

wis

e

swim

mgr

id

appl

u

mes

a art

equa

ke

amm

p

luca

s

fma3

d

sixt

rack

apsi

aver

age

Sav

ings

(%)

basic periodic

Figure 5.28. Average savings for branch predictor tuning using basic and periodic algorithms.

algorithm. This is to be expected because, on average, the periodic algorithm tunes slightly

less often than the basic algorithm. The periodic algorithm loses more performance due to

being in a sub-optimal configuration because the algorithm does not have a notion of program

phases which can lead to tuning while the phase is in transition. This can potentially cause

a sub-optimal configuration to be chosen. However, the penalty of being in a sub-optimal

configuration is not very high for the branch predictor and thus the performance loss is minimal.

Performance Loss for Branch Predictor Tuning

0

1

2

3

CP

I inc

reas

e (%

)

g z i

p

v p r

g c c

m c

f

c r a f

t y

p a r

s e r

e o n

p e r l

g a p

v o r t

e x

b z i

p 2

t w o

l f

w u

p w

i s e

s w i m

m g

r i d

a p p

l u

m e

s a

a r t

e q u

a k e

a m m

p

l u c

a s

f m a

3 d

s i x t

r a c

k

a p s i

a v g

basic periodic


Figure 5.29. Performance loss for branch predictor tuning using basic and periodic algorithms.

110

5.3.7. Optimizing Granularity

The low resource savings achieved the basic and history based algorithms are due to the

fact that they default to the largest configuration when phase behavior becomes unstable. Thus

high stability is desirable for these tuning algorithms. Stability depends upon the granularity

as shown in Section 3.3.3. If a granularity leading to high stability can be found, better savings

can be achieved.

A simple run-time algorithm can try a few different granularities over time and choose the

best one for tuning. Since a smaller tuning interval provides more tuning opportunities, the

algorithm should start with a small granularity and increase it till reasonably stable behavior is

found. Of course, the smallest granularity is bounded by Equation 5.5 based on the specified

tolerance and reconfiguration overheads.

Due to limited simulation time, we do not explore such run-time algorithms in this section.

Rather, we ran the algorithm off-line to find the best granularity for each benchmark. However,

we have previously presented a run-time algorithm for optimizing the granularity for adaptive

prefetching [82]. Table 5.11 shows the best granularities and corresponding stability and aver-

age phase length for the various benchmarks. For low-overhead units, the algorithm chooses

between granularities of 100K, 500K, 1M, 5M, and 10M instructions. For high-overhead units,

the algorithm chooses between granularities of 5M and 10M instructions. Granularities over

10M instructions are not considered because the resulting number of tuning intervals are insuf-

ficient for analysis.

Figure 5.30 shows the average savings and performance loss for the various algorithms,

using optimal and non-optimal granularities. Consider the basic tuning algorithm first. The

optimized algorithm (i.e. using optimal granularities) provides 3.8%, 7.7%, 7.2%, and 1.4%

additional savings for I-cache, D-cache, branch predictor, and L2-cache tuning respectively.

111

Table 5.11. Optimal granularities (amongst a finite set) for tuning low and high overhead units. Gran.

represents the granularity in terms of number of instructions, Stb. is the stability in percent, and Len. isthe average phase length in terms of number of intervals.

Low-overhead High-overheadGran. Stb. Len. Gran. Stb. Len.

gzip 100K 71 6 10M 65 4vpr 100K 77 6 5M 65 4gcc 10M 20 3 10M 20 3mcf 100K 80 5 5M 67 7crafty 5M 29 2 5M 29 2parser 5M 44 4 5M 44 4eon 10M 78 5 10M 78 5perl 5M 85 9 5M 78 9gap 100K 86 10 5M 76 4vortex 10M 82 6 10M 82 6bzip2 100K 84 8 5M 81 5twolf 5M 86 10 5M 86 10wupwise 5M 88 10 5M 88 10swim 100K 96 28 5M 20 2mgrid 100K 76 6 5M 74 6applu 500K 72 4 5M 0 6mesa 5M 81 10 5M 81 10art 100K 62 4 10M 70 4equake 100K 86 8 10M 61 5ammp 100K 69 4 5M 67 5lucas 5M 90 12 5M 90 12fma3d 100K 82 12 5M 45 2sixtrack 500K 92 15 5M 50 7apsi 5M 87 8 5M 87 8

In most cases, the optimized algorithm also leads to lower performance loss. Most of the

additional savings come from benchmarks twolf, wupwise, applu, and lucas. The additional

savings are mainly due to the increased stability which is apparent by comparing Table 5.11

and Figure 5.16.

The history based algorithm also benefits from this optimization for similar reasons. In

this case, the optimized algorithm provides 2.2%, 7.9%, 7.2%, and 0.8% additional savings for

I-cache, D-cache, branch predictor, and L2-cache tuning respectively.

The increased stability should not lead to larger savings for the signature density algorithm

because it does not default to the maximum configuration. In fact, results show that the signa-

112

ture density based algorithm achieves significantly lower savings with the optimization turned

on. This is primarily due to lost tuning opportunities. As the tuning interval increases, the al-

gorithm can not take advantage of finer grained changes in the working set. The configuration

is set based on the signature density which is determined by the largest working set observed

over the tuning interval.

Figure 5.31 compares the savings for the optimized and non-optimized I-cache tuning

algorithms. Only those benchmarks are shown for which the tuning intervals differ. Evidently,

lost tuning opportunities can lead to as much as 89% lower savings for the optimized algorithm.

Figure 5.30. Average savings and performance loss for various phase based algorithms. The suffix opt

means that optimal granularities from Table 5.11 have been used.

113

Signature Density Algorithm

0

20

40

60

80

100

gcc

craf

ty

pars

er

eon

perl

vort

ex

twol

f

wup

wis

e

appl

u

mes

a

luca

s

sixt

rack

apsi

Sav

ings

(%)

non-optimized

optimized

Figure 5.31. I-cache resource savings for the optimized and non-optimized signature density algorithms.Only those benchmarks are shown for which the tuning intervals differ.

5.3.8. Per-Benchmark Static Tuning

While the focus of this dissertation is on general-purpose processors, the dynamic tuning

algorithms can also be applied to application specific processors. However, application specific

processors can be tuned statically, at design time, for the particular application at hand. This

section compares the resource savings provided by dynamic tuning algorithm with those pro-

vided by per-benchmark static tuning. As discussed before, static tuning to individual bench-

marks may not be possible in the general-purpose computing paradigm, but, the comparison

does provide some insights into the behavior of these benchmarks.

We compare savings of the best performing dynamic algorithms with per-benchmark static

tuning. The best performing dynamic tuning algorithm for the I-cache and branch predic-

tor is the signature density based algorithm. The optimized basic algorithm works best for

D-cache tuning, and the optimized history based algorithm works best for L2-cache tuning.

Per-benchmark static tuning is performed by selecting the smallest configuration, per bench-

mark, that provides less than 2% CPI increase with respect to the baseline configuration. The

configuration is fixed throughout the execution of the benchmark.

114

I-cache

0

25

50

75

100

gzip

vpr

gcc

mcf

crafty

parser

eon

perl

gap

vortex

bzip2

twolf

wupwise

swim

mgrid

applu

mesa

art

equake

ammp

lucas

fma3d

sixtrack

apsi

average

Savings (%) static

dynamic

D-cache

0 25 50

75

100

gzip

vpr

gcc

mcf

crafty

parser

eon

perl

gap

vortex

bzip2

twolf

wupwise

swim

mgrid

applu

mesa

art

equake

ammp

lucas

fma3d

sixtrack

apsi

average

Savings (%)

Branch P

redictor

0 25

50

75

100

gzip

vpr

gcc

mcf

crafty

parser

eon

perl

gap

vortex

bzip2

twolf

wupwise

swim

mgrid

applu

mesa

art

equake

ammp

lucas

fma3d

sixtrack

apsi

average

Savings (%)

L2-cache

0 25 50

75

100

gzip

vpr

gcc

mcf

crafty

parser

eon

perl

gap

vortex

bzip2

twolf

wupwise

swim

mgrid

applu

mesa

art

equake

ammp

lucas

fma3d

sixtrack

apsi

average

Savings (%)

static dynam

ic

static dynam

ic

static dynam

ic

Figure5.32.C

omparison

ofsavingsachieved

bydynam

icand

per-benchmark

statictuning.

Figure5.32

shows

savingsachieved

bydynam

icand

per-benchmark

statictuning,forthe

variousunits.

Forthe

I-cache,dynam

ictuning

works

betterthan

statictuning,as

expected.

Dynam

ictuning

isable

toexploitchanges

inresource

requirements

within

theprogram

,asit

115

goes through different phases of execution. This is exemplified by benchmark gzip, which is

composed of alternating phases with widely differing resource requirements. For gzip, dynamic

tuning is able to achieve 30% additional resource savings compared to static tuning. However,

in general, the improvements provided by dynamic tuning are little. This can be explained

on the basis of Figure 5.24, which shows that for most benchmarks, dynamic tuning ends up

selecting a single configuration most of the time. This means that resource requirements do not

change much across phases of a given program, thereby making static tuning very effective.

In case of the branch predictor, dynamic tuning does not perform as well as static tuning.

This is mainly due to the approximate basic block size used to configure the branch predictor,

which leads to conservative savings. As mentioned before, savings can be improved by using

additional profiling hardware to compute the basic block size dynamically.

For D-cache and L2-cache tuning, static tuning out-performs dynamic tuning in almost

all benchmarks. This is mainly attributed to the fact that the basic and history based algo-

rithms default to the maximum configuration in unstable regions. This leads to lower savings,

especially for the D-cache, where a smaller configuration can be used, for most benchmarks,

without causing any performance loss (see Figure 5.4).

5.4. Tuning Multiple Units

Tuning multiple units simultaneously is complex because the total number of possible

configurations increases rapidly with the number of units. An adaptive microarchitecture with

N multi-configuration units has a total ofN∏

i=1

Mi configurations, where Mi is the number of

possible configurations for the ith unit. For the microarchitecture studied in this dissertation,

this amounts to 256 possible configurations.

A naive extension of the trial and error based single-unit algorithms may not work because

116

the exact order in which the configurations should be tried is not apparent. Even if the order

is apparent, tuning may never complete given the large number of configurations to be tried

and the relatively short program phases. It can be argued that certain units such as caches may

be tuned independently by examining their miss rates. However, the exact contribution of the

miss rate to the overall performance varies from one program to another, and even within a

program. This can lead to highly indeterministic performance loss or very conservative power

savings.

We propose a tuning algorithm that apportions tolerable performance loss amongst the

multi-configuration units and decouples their tuning processes. This way, the algorithm re-

duces worst case tuning time fromN∏

i=1

Mi intervals to just max(Mi) intervals. The next section

describes the algorithm while the following section evaluates it for an adaptive microarchitec-

ture that employs four multi-configuration units viz. I-cache, D-cache, L2-cache, and branch

predictor. As before, the goal of tuning is to reduce power dissipation of each unit without

degrading overall CPI by more than 2%.

5.4.1. Apportioning Algorithm

The apportioning algorithm is based on two key ideas: 1) CPI and miss rates of various

units remain relatively constant within a phase, and 2) the performance penalties associated

with miss events of individual units can be linearly added to compute the overall performance

loss. While this might seem overly simplistic, it works reasonably well for tuning algorithms.

Recently, Karkhanis and Smith [81] have made similar observations and explained them using

an analytical model.

Figure 5.33a shows the state machine for the algorithm. The overall algorithm is similar

to the basic tuning algorithm. The algorithm starts in UNSTABLE state with individual unit

117

UN- STABLE

Apportion Cyles Allocate Misses

TUNE UNIT-N

TUNE UNIT-1

TUNE UNIT-2

Config = MAX

Allocated Misses

Exceeded?

Increase Config

Reduce Config

Config tried before?

STABLE

Y N

Y

N

Try Config

a) b)

Figure 5.33. Figure a shows the state machine for the overall tuning algorithm. δ represents the relativesignature distance and δththe difference threshold. The TUNE states represent independent trial anderror based tuning for individual units. Figure b shows the tuning algorithm for individual units, whichis based on the misses allocated in the overall algorithm.

configurations set to maximum. When a new phase is detected, the algorithm apportions toler-

able performance loss (∆CPI) amongst the multi-configuration units, and allocates additional

tolerable misses (∆missi) to individual units as follows

∆missi =∆CPI · CPI · T

N · Pi

(5.11)

where, the subscript i denotes the ith unit, T is the tuning interval, N is the number of multi-

configuration units, and Pi is the miss penalty for the ith unit. In essence, the equation computes

the number of extra tolerable cycles per interval, divides it equally amongst the various units,

and computes the number of additional misses tolerable based on the average cost of a miss.

Following miss allocation, the tuning process is decoupled i.e. each unit is independently

118

tuned under the constraint that its additional misses stay below the allocated misses (see Figure

5.33b). Tuning for a unit stops as soon as the minimal configuration satisfying the constraint is

found. Like the single-unit algorithms, the overall algorithm transitions to UNSTABLE state

whenever a phase change is detected. When this happens, individual tuning is stopped and all

unit configurations are set to maximum.

Like single-unit tuning, the overall tuning algorithm can be extended to take advantage

of phase recurrence. Such a history based algorithm performs a phase table lookup when

execution enters a new phase. If configuration information for the phase is found in the table,

the tuning process is bypassed. If not, tuning is performed and the table is updated when tuning

completes.

5.4.2. Evaluation

The baseline microarchitecture (Table 5.3) and multi-configuration units (Table 5.4) used

for evaluating the algorithm are same as those used for single-unit algorithms. The only differ-

ence is that, unlike the single-unit case, all multi-configuration units are tuned concurrently.

The choice of the overall algorithm and the tuning interval is decided by the L2-cache since

it has the maximum reconfiguration overhead. Based on results of the previous section, we use

an optimized history-based algorithm as the overall tuning algorithm. The tuning intervals

are shown in Table 5.11. Empirically computed average miss penalties (see Section 5.3.3.2)

are used for allocating tolerable misses. The miss penalties used are 7, 4, 200, and 16 cycles

respectively for the I-cache, D-cache, L2-cache, and branch predictor. Conservative (larger)

penalties are used for D-cache and L2-cache, given their large variance.

Figure 5.34 shows the resource savings achieved by the multiple-unit tuning algorithm. On

average, the algorithm achieves 25%, 17%, 9%, and 30% savings respectively for the I-cache,

119

I-cache Savings (%

)

0

25

50

75

100

gzip

vpr

gcc

mcf

crafty

parser

eon

perl

gap

vortex

bzip2

twolf

wupwise

swim

mgrid

applu

mesa

art

equake

ammp

lucas

fma3d

sixtrack

apsi

average

D-cache S

avings (%)

0

20

40

60

gzip

vpr

gcc

mcf

crafty

parser

eon

perl

gap

vortex

bzip2

twolf

wupwise

swim

mgrid

applu

mesa

art

equake

ammp

lucas

fma3d

sixtrack

apsi

average

L2-cache Savings (%

)

0

10

20

30

40

gzip

vpr

gcc

mcf

crafty

parser

eon

perl

gap

vortex

bzip2

twolf

wupwise

swim

mgrid

applu

mesa

art

equake

ammp

lucas

fma3d

sixtrack

apsi

average

Branch P

redictor Savings (%

)

0

25

50

75

100

gzip

vpr

gcc

mcf

crafty

parser

eon

perl

gap

vortex

bzip2

twolf

wupwise

swim

mgrid

applu

mesa

art

equake

ammp

lucas

fma3d

sixtrack

apsi

average

Figure5.34.R

esourcesavings

achievedby

them

ultiple-unittuningalgorithm

.

120

D-cache, L2-cache, and branch predictor. The savings for the I-cache and branch predictor are

quite good. It would seem that better savings could be achieved using the signature density

based algorithm. However, as explained in Section 5.3.7, the signature density algorithm does

not function well for large tuning granularities.

The savings for the D-cache and L2-cache are low mainly because we have used conser-

vative miss penalties. This leads to fewer misses being allocated to these units. Using smaller

miss penalties improves results for most benchmarks, but can lead to significant performance

degradation for some. By using conservative penalties, we trade-off power savings for lower

performance loss.

CPI increase (%)

0 1 2 3 4 5

gzip

vpr

gcc

mcf

craf

ty

pars

er

eon

perl

gap

vort

ex

bzip

2

twol

f

wup

wis

e

swim

mgr

id

appl

u

mes

a art

equa

ke

amm

p

luca

s

fma3

d

sixt

rack

apsi

aver

age

Figure 5.35. Performance loss caused by the multiple-unit tuning algorithm.

The average performance loss (Figure 5.35) caused by the algorithm is 1.5%. The low

performance loss is an indicator of the fact that the apportioning algorithm works. However,

the performance loss does exceed the threshold significantly for benchmarks such as crafty and

parser. This is mainly due to low stability and short phases of these benchmarks which leads to

repeated tuning. One way to avoid performance loss in such benchmarks is by using a filtering

policy that does not initiate tuning for very short phases [60, 61]. However, such algorithms

tend to be ad hoc and thus not considered here.

121

5.5. Implementation Issues

Long reconfiguration latencies for units such as the L2-cache can stall the processor for

thousands of clock cycles. This should be considered while designing the system, or else,

certain I/O devices or other processors in the system may conclude that the processor is faulty.

Mechanisms for handling long latency stalls are incorporated in current systems. For example,

processors such as the Pentium 4 [71] include thermal throttling mechanisms which can cause

the processor to stall for relatively long durations of time. Systems based on such processors

have mechanisms built in place to handle such situations.

In this dissertation, tuning algorithms have been evaluated using uni-programmed work-

loads i.e. a single program runs on the processor. In practice, microprocessors run multi-

programmed workloads, where, the OS runs several different programs by periodically context

switching between them. All the tuning algorithms, except the history based algorithm, are

applicable to multi-programmed workloads un-modified. In fact, the benchmarks studied do

perform context switches between the user and kernel mode, following system calls. The con-

text switches are detected as phase changes thereby triggering the tuning process.

For the history based algorithm, it might be beneficial to maintain a separate phase table

for each process. The mOS can detect context switches and save/restore the phase table cor-

responding to a process. This prevents a process from using the optimal configurations found

for another process, which can potentially lead to performance loss or reduced power savings.

An important issue is the ability to identify contexts without relying directly on conventional

software. This can be done in at least two ways. One way is to rely on architected control reg-

isters that contain context-specific information. For example, different contexts will typically

use different page tables, so an architected control register that points to the process’ page table

can often be used to distinguish the context. Another approach is for the mOS to hash the PC

122

and some of the architected register values to provide a context identifier at the time a context

is switched out. This hash value can be used by the mOS as a way of identifying saved contexts

when a new context is loaded. A key point is that the phase table contains implementation state

and does affect the correctness of the program. That is, if the mOS occasionally loads in the

wrong phase table, the worst thing that can happen is performance loss or lower power savings.

Finally, one of the issues with the history based algorithm is that it relies on the fact

that previously found optimal configurations are applicable to every recurring instance of the

phase. As discussed before, noise associated with phase behavior can sometimes lead to a sub-

optimal configuration being recorded in the phase table. Also, the performance characteristics

of a phase can vary gradually with time due to changes in the data working set. Both these

phenomena can lead to sub-optimal power/performance if the configurations are reused over

and over. A simple solution to both the problems is to periodically flush the history, and

recapture the program behavior. This may have to be done anyways given the limited amount

of memory available to the mOS.

5.6. Summary

For single-unit tuning, the signature density based algorithm works best for I-cache and

branch predictor tuning, optimized basic algorithm works best for D-cache tuning, and op-

timized history based algorithm works best for L2-cache tuning. The average savings and

performance loss are summarized in Table 5.12.

Given the high overheads associated with tuning the L2-cache, the only viable multi-unit

tuning algorithm is the history based apportioning algorithm. This algorithm achieves 25%,

17%, 9%, and 30% savings respectively for the I-cache, D-cache, L2-cache, and branch pre-

dictor. The associated performance loss is 1.5%, on average.

123

Table 5.12. Average savings and performance loss (CPI increase) for the best performing single unittuning algorithms.

Unit Savings (%) ∆ CPI (%) AlgorithmI-cache 53 0.68 Signature densityD-cache 30 0.72 Basic optimizedBrand predictor 48 0.47 Signature densityL2-cache 18 1.12 History optimized

For application specific processors, per-benchmark static tuning provides more savings, on

average, compared to dynamic tuning. Part of the reason is inefficiencies associated with the

dynamic tuning algorithms. However, even if these inefficiencies are removed, the additional

benefits provided by dynamic tuning are little. This is mainly attributed to the low variability

in resource requirements across phases of a given program.

124

Chapter 6THE MICRO-OS

Most of the tuning algorithms described in the previous chapter can be implemented in

hardware. However, this requires additional specialized hardware, mainly for computing the

relative signature distance. While the power and area overhead of such hardware may not be

high, a hardware implementation does restrict flexibility. Furthermore, sophisticated and com-

plex algorithms such as history-based tuning may be impractical for hardware implementation.

Hardware/software co-design is an attractive solution to implement sophisticated tuning

algorithms. This chapter describes the design and implementation of a micro-OS (mOS) –

a thin layer of co-designed software that manages the adaptive microarchitecture. The mOS

is developed by hardware designers in conjunction with hardware. This gives the designers

flexibility to implement parts of the algorithm in hardware and parts of it in software. The

partitioning is based on the usual trade-off of speed (hardware) versus flexibility (software).

The mOS relies on virtual machine (VM) technology to make it completely transparent

to all conventional software including the OS. This feature provides hardware designers with

the ability to enhance the mOS continually, without interacting with OS designers. Enhanced

versions can be easily applied to previously deployed hardware via mechanisms similar to

firmware upgrades or microcode patches.

Transparency to conventional software is a key requirement of the mOS. Transparency can

be a significant advantage when the processor and OS are developed by different vendors. The

following sections describe co-designed virtual machines and how transparency is achieved via

a combination of memory protection, special op-codes, and trap handling mechanisms.

125

6.1. Co-Designed Virtual Machines

A co-designed virtual machine is a system that implements an ISA via a combination of

hardware and software, designed concurrently as part of a coordinated effort. Typically, the

VM implements a conventional ISA (e.g. PowerPC), which we call the virtual ISA (V-ISA)

while the underlying hardware implements a proprietary ISA, called the implementation ISA

(I-ISA). The VM software provides the necessary coupling element to map V-ISA binaries on

to I-ISA hardware. The IBM DAISY [42] and BOA [43] projects, and the Transmeta Cru-

soe [44] processor pioneered the use of co-designed VMs to build general-purpose micropro-

cessors.

Operating System

Virtual ISA

Implementation ISA

Appllication

Virtual Machine Monitor

Data

Appllication

Profiling Hardware

Configurable Hardware

Conventional Software (Visible Memory)

Hardware

Virtual Machine Software (Concealed Memory)

Figure 6.1. A co-designed virtual machine.

Figure 6.1 depicts a generic co-designed VM. A key component of VM architecture is

concealed memory – a portion of physical memory completely hidden from all conventional

software. This memory is reserved for the VM software and any attempt to access it is invalid;

the same as if the physical memory were not present.

The core of the VM, the virtual machine monitor (VMM), runs in a privilege level that

supersedes all conventional privilege levels including the supervisor mode. We call this priv-

ilege level the VMM mode. The VMM has ultimate control over the microprocessor and can

126

take control periodically based on a micro-timer or when certain events such as interrupts oc-

cur. The transfer of control to the VMM and back takes place transparently (without altering

any architected state) and thus, the OS maintains an illusion that it is in direct control of the

processor. This property of the VM, called complete virtualization, guarantees that the OS as

well as applications developed for the V-ISA can run unchanged on the VM.

Co-designed VMs are typically used to provide binary compatibility since the I-ISAs are

designed to achieve specific goals such as high performance [42,43] or low power [44]. We use

the technology to implement configuration management functions transparently. Transmeta’s

LongRun technology [45] is an example of such use of co-designed VMs. For our application,

the V-ISA and I-ISA are the same (PowerPC) except that the latter provides additional in-

structions to access profiling and configuration control hardware. These instructions can only

be executed in VMM mode. Any attempt to execute them in other privilege levels causes an

illegal instruction trap.

6.2. Memory Protection

The mOS code is contained in an on/off-chip ROM, similar to microcode. The binary

image may be compressed to reduce space requirements. However, unlike microcode, the mOS

requires read/write memory in order to maintain its data structures. Thus, it is decompressed

and moved into main memory at boot-time. Moving to main memory not only allows read/write

access but also allows the mOS to use the processor’s memory hierarchy, which is optimized

for performance. There are three possible mechanisms to conceal this memory segment from

conventional software.

1. Modifying the BIOS: The OS can request the memory size from the BIOS using special

interrupts. For example, in a typical x86 system, this can be achieved by sending interrupt

127

0x15 with function code 0x88 in register ah. Memory can be effectively concealed if

the BIOS is modified to return a smaller size.

2. Modifying parameters stored on the memory module: Many modern memory mod-

ules are compliant with the JEDEC (Joint Electron Device Engineering Council) stan-

dard [83]. JEDEC compliant memory modules have an EEPROM containing various

memory parameters such as size, access speed, and configuration. This EEPROM can

be accessed using the I2C (Inter Integrated Circuit) serial interface to find out available

system memory. Memory can be concealed by modifying the contents of the EEPROM.

3. Using a bounds check mechanism: The mOS can detect installed system memory and

program a bounds register in the memory management unit with a smaller values at pre-

boot time i.e. before conventional boot. All physical addresses can be checked against

this register. If an address exceeds the bound, execution traps into the mOS which in turn

generates the appropriate response indicating an illegal physical memory address.

After reading through Linux kernel code1, we found that Linux, and other operating sys-

tems, do not depend upon JEDEC compliant modules and do not trust the BIOS. They detect

available memory via reading and writing information to the memory and checking consis-

tency. In fact, this is the same mechanism that the BIOS uses to detect installed system memory

at boot time. Thus, the only robust alternative to protect the concealed memory is to imple-

ment a physical memory bounds check. This has the added advantage that the BIOS – which

is typically distributed by third party vendors, does not have to be modified.

Figure 6.2 illustrates the protection mechanism. The mOS executes in VMM mode. Ad-

dress translation and bounds checking is disabled in VMM mode thus providing the mOS with

1We did not have access to AIX kernel code. So, we used Linux as a reference

128

Visible Memory

T o t a l P h y s i c a l M

e m o r y

Concealed Memory

OS Applications

mOS VA PA No Translation

VMM mode

User/Supervisor

Bounds Check

Out of bound TRAP

VA PA Translation

Figure 6.2. Memory layout and protection mechanisms. Translation from Virtual address (VA) toPhysical address (PA) is turned off in VMM mode. An extra bounds check mechanism is added to thetranslation when not in VMM mode.

access to the entire physical address space. In the user/supervisor modes, physical memory

bounds checking is enabled, and any access to concealed memory leads to a trap into the mOS.

However, if this occurs, it indicates an error in the OS for placing an invalid physical address

in the page table.

It should be noted that the mOS does not have a backing store, and thus has to reside

in limited amount of memory. This approach has been shown to be viable by the Transmeta

Crusoe processor. Transmeta Crusoe based systems reserve 16MB on main memory [84] for

the Code Morphing software (i.e. VM software). Most of this memory is used for the binary

translation system, which includes the translation cache. By comparison, the mOS is light-

weight and can fit in much smaller amount of memory.

6.3. Implementation ISA

Implementation of the mOS requires certain changes to the V-ISA. However, these changes

are part of the I-ISA and not visible to conventional software. Conventional software compiled

for the V-ISA can run unmodified on the processor. In this work, the V-ISA is PowerPC and

129

the I-ISA is a derivative of PowerPC with a few additional instructions, registers, and interrupts

specified.

6.3.1. Registers

Three types of registers are specified in the I-ISA – scratch registers, control registers, and

profiling registers. Of course, these are in addition to those specified in the PowerPC ISA. All

I-ISA specific registers are mapped to a generic namespace VMR as shown in Table 6.1.

Table 6.1. Registers specific the the I-ISA. Each special register corresponds to one register in thegeneric name space VMR.

Register Name Generic NameVMRG0 - VMRG3 VMR[0 - 3]VMSR VMR[4]VMSRR0 VMR[5]VMDEC VMR[6]VMPR0 - VMPR5 VMR[7 - 12]

The scratch registers are used by the mOS to preserve architected state during a context

switch into VMM mode. Scratch registers are essential for any kind of context switch. For

example, when context switches from user to supervisor mode, the OS uses special registers

such as SPRG0 - SPRG3 in the PowerPC ISA to save/restore state. These registers can not

be accessed in user mode. The mOS, of course, can not use these registers because the OS has

full access to them. Thus, we define four scratch registers: VMRG0 - VMRG3, as part of the

I-ISA. These are mainly used to save and restore the stack pointer (SP) and table of contents

pointer (TOC) [85].

Control registers, as the name suggests, are used to control the behavior of the machine.

We define three control registers VMSR, VMSRR0, and VMDEC. VMSR maintains the status of

the virtual machine which includes the VMM mode bit, exception codes, and configuration

information for the various multi-configuration units. A set VMM mode bit indicates that the

130

machine is in VMM mode. VMSRR0 is used to facilitate the atomic context switch between

VMM mode and other modes. VMDEC is setup by the mOS to generate a micro-timer interrupt

(MTIMER) after a given number of instructions. VMDEC is decremented on every committed

instruction and raises the MTIMER interrupt when it crosses zero. This provides a mechanism

for the mOS to periodically grab control of the machine, even if it is in supervisor mode. An

interrupt mechanism based on cycle counts instead of instruction counts may be desirable for

many applications. However, we do not implement this because it is not needed for the tuning

algorithms studied.

The profiling registers are used for counting various events of interest to the tuning algo-

rithm. We define registers VMPR0 - VMPR5 to track instructions, cycles, cache misses, and

branch mis-predicts.

6.3.2. Instructions

The I-ISA specifies three types of instructions – to read/write VMM registers (VMRs),

read/write special hardware, and to return from VMM mode. The format and semantics of

these instructions are summarized in Table 6.2.

Table 6.2. Instructions specific to the I-ISA. The formats are specific to the PowerPC ISA [64].

Instruction PowerPC Format Semanticsmtvmr VMR, RS XFX VMR← (RS)mfvmr RT, VMR XFX RT← (VMR)rddev RT, RA, DEV X RT← DEV((RA))wrdev RS, RA, DEV X DEV((RA))← (RS)rfvm XL NPC← (VMSRR0)

Clear VMM mode bit in VMSRTrigger reconfiguration if needed

Instructions mtvmr and mfvmr are used to move values to and from the VMM registers.

The instructions are similar to the mtspr and mfspr instructions specified in the PowerPC

131

ISA. Values can be moved from VMRs to general purpose registers (GPRs) and vice versa, but

not between two VMRs. Writing to VMR[4], the virtual machine status register, can lead to a

unintentional context switch.

Instructions rddev and wrdev are used for reading from and writing to a ”device”, which

in this context means a microarchitectural unit. The instructions read/write a data word2 start-

ing at a specified offset in the device. The offsets are word aligned. For our application, only

one device – the working set signature is specified. Other devices (e.g., the branch predictor)

may be specified for applications such as saving and restoring implementation state [5].

The rfvm instruction is used to atomically context switch to the user or supervisor mode,

depending on the state of the machine before control transferred to the mOS. The instruction

does three things. First, it copies the contents of VMSRR0 to the next PC (NPC). Second, it

resets the VMM mode bit in VMSR. Finally, it compares the configurations specified in VMSR

with the current configurations and triggers reconfiguration if needed. If reconfiguration is

triggered, the machine is stalled until reconfiguration completes and the system is quiesced.

6.3.3. Interrupts

The I-ISA defines a MTIMER interrupt to allow the mOS to periodically gain control of

the processor. The interrupt is similar to the PowerPC decrementer interrupt. However, it

has a higher priority that all PowerPC interrupts except the system reset and machine check

interrupts. When raised, the interrupt causes NPC to be copied into VMSRR0, VMM mode bit

to be set, the MTIMER exception code to be copied into VMSR and control to be transferred to

offset 0x0 in the concealed memory. Control can be transferred back to conventional software

using the rfvm instruction.

2The data word is 64 bits in 64-bit mode and 32 bits otherwise.

132

Operating System

Application

mOS

Visible Memory

Concealed Memory

Non - architected Interrupt

Operating System

Application

mOS

Visible Memory

Concealed Memory

Architected Interrupt

Operating System

Application

mOS

Visible Memory

Concealed Memory


Operating System

Application

mOS

Visible Memory

Concealed Memory

Operating System

Application

mOS

Visible Memory

Concealed Memory


Operating System

Application

mOS

Visible Memory

Concealed Memory


Operating System

Application

mOS

Visible Memory

Concealed Memory

Operating System

Application

mOS

Visible Memory

Concealed Memory


Figure 6.3. Control transfer while handling architected and non-architected interrupts.

In a co-designed VM environment, interrupts can be classified into architected (part of the

V-ISA) and non-architected (specific to the I-ISA) interrupts. The non-architected interrupts

are invisible to the OS and thus the mOS handles them and transfers control back to the inter-

rupted process without OS intervention or knowledge (see Figure 6.3). The MTIMER interrupt

is an example of a non-architected interrupt. Other non-architected interrupts such as “phase

change” or “performance below threshold” may be implemented to guide tuning algorithms.

The mOS may also intercept architected interrupts in order to detect context switches and

subsequently save/restore critical implementation state [5]. To achieve this, the hardware is de-

signed to transfer control to the mOS on all interrupts (see Figure 6.3). The mOS performs the

required tasks and redirects the interrupt to the appropriate location in the OS. While returning

from an interrupt, control flow is again vectored to the mOS which may subsequently transfer

control to the user mode process. The mOS ensures that none of the registers are clobbered so

that this process takes place transparently. In the context of tuning algorithms, the mOS can

preserve tuning history across context switches. This may provide some benefits in the case of

high-overhead units. However, we do not explore this mechanism in the dissertation.

133

6.4. Overheads

The run-time overhead of the mOS was evaluated using simulation on PharmSim. The

involved two main steps – compiling the mOS, and implementing virtualization mechanisms

in PharmSim.

The mOS uses instructions that are specific to the I-ISA. Therefore, it can not be com-

piled using the standard PowerPC compilers. We modified the GNU assembler – gas to

compile for the I-ISA. Instructions specific to the I-ISA were represented either by extend-

ing available PowerPC instructions, or by using illegal instructions [64]. Specifically, mtvmr

and mfvmr were represented using mtspr and mfspr respectively, but with special register

names. rfvm, rddev, and wrdev were represented using illegal extended opcodes corre-

sponding to primary opcodes 19 and 31.

The mOS is hand-coded in 32-bit PowerPC (extended) assembly. It is assembled into an

AIXCOFF-RS6000 object, which is read by PharmSim. Since PharmSim uses checkpoints, we

overlay the mOS binary onto the memory checkpoint. This way, no new checkpoints have to

created. Several changes were made to PharmSim in order to implement virtualization. These

are mainly associated with the memory protection and interrupt mechanism described in the

previous section.

The mOS overheads for the basic algorithm are shown in Table 6.3. The overheads are

computed with a tuning interval of 100K instructions and using benchmark gcc. There is a

some variance across benchmarks, but it is negligible for all practical purposes. The overheads

also increase slightly with the tuning interval because some of the mOS code and data gets

flushed out of the L2-cache. However, the relative overheads become negligible.

The overhead is of the order of 1% for the 128B signature and 3% for the 512B signature.

Most of the overhead (> 90%) is attributed to computing the relative signature distance. The

134

Table 6.3. mOS overheads for the basic algorithm with a tuning interval of 100K instructions. Overheadsare shown in terms of cycles per invocation of the algorithm. The last two columns show the overheadsusing a population-count instruction. δ is the relative signature distance.

Without population-count With population-countSignature size Net overhead δ computation Net overhead δ computation

128B 845 772 229 154256B 1582 1507 354 280512B 3056 2982 610 537

high overheads associated with relative signature distance computation are due the population-

count routine. This is in spite of using a very efficient population-count algorithm that runs in

O(logN) time.

One way to reduce overheads is to implement a population-count instruction in hardware.

This requires specialized hardware. However, it should be noted that a population-count in-

struction is useful for other applications such as cryptography and is available in some ar-

chitectures (e.g. CTPOP in ALPHA AXP [86]). The overheads with the population count

instruction are shown in the last two columns of Table 6.3. The instruction clearly helps –

reducing the overheads to < 0.3% for the 128B signature.

The signature density algorithm adds < 5% overhead to the basic tuning algorithm. This

overhead is due to the extra signature population-count at the beginning and end of a phase.

The history based algorithm performs more signature comparisons than the basic algorithm

due to phase table lookups. The overhead due to the additional comparisons depends on 1)

the average number of phase table lookups per tuning interval, and 2) the average number of

signature comparisons performed per lookup.

Figure 6.4 combines the two statistics to show the number of additional signature com-

parisons per interval performed by the history based algorithm. The results are for L2-cache

tuning since the history based algorithm would most likely be used in that case. On average, the

history based algorithm performs only 0.2 additional signature comparisons per tuning inter-

135

Number of Additional Signature Comparisons per Interval

0.0

0.2

0.4

0.6

0.8

1.0 gz

ip

vpr

gcc

mcf

craf

ty

pars

er

eon

perl

gap

vort

ex

bzip

2

twol

f

wup

wis

e

swim

mgr

id

appl

u

mes

a art

equa

ke

amm

p

luca

s

fma3

d

sixt

rack

apsi

aver

age

# C

ompa

riso

ns

Figure 6.4. Number of additional signature comparisons per interval, performed by the history basedalgorithm. The data shown is for L2-cache tuning with a 5M instruction tuning interval.

val. This number is low because 1) phase table lookups are performed only once per dynamic

phase, and 2) the number of comparisons per lookup is low – 2, on average. The few additional

comparisons add < 20% extra overhead compared to the basic algorithm.

As far as memory requirements are concerned, the static mOS image is less than 2KB in

size. The dynamic requirements are an issue only in the history based algorithm. Our results

for L2-cache tuning show that the maximum phase table size is 20 entries – for benchmark

parser. Using 512B signatures, this leads to a 10KB phase table. Of course, the phase table

size increases with time, but, the increase should not be rapid given the large recurrence ratio.

If this is still a problem, a simple FIFO replacement policy will be effective at limiting table

size.

6.5. Summary

In summary, implementing flexible tuning algorithms in the mOS is a viable alternative to

hardware-only implementations. The overall memory and run-time performance overheads of

136

the mOS are negligible. The static memory image size is less than 2KB, while the dynamic

memory requirements are estimated to be less than 10KB. For a 100K instruction tuning inter-

val, the run-time overhead of the mOS is less than 0.3% using 128B working set signatures.

This overhead further decreases as the tuning interval increases.

One of the issues is the added verification complexity introduced due to the virtual ma-

chine. We believe that the additional complexity is minimal given that most of the mechanisms

for memory protection and interrupt handling are present in modern microprocessors. Also,

unlike binary translation systems such as Transmeta Crusoe, the mOS does not touch any ar-

chitected state, thus lowering the possibility of serious bugs.

137

Chapter 7CONCLUSIONS

Microarchitectural resource requirements vary across programs and even within programs

– as they go through distinct phases of execution. Adaptive microarchitectures can adjust

to changing program requirements to provide better power/performance characteristics. Effi-

ciency of the tuning algorithm that governs the adaptation process is key to achieving benefits

from such microarchitectures.

We focus on a class of tuning algorithms that reduce power dissipation without causing

significant performance loss. A CPI increase of > 2% is considered significant. This class of

algorithms is a small subset of the various possible algorithms, but nonetheless an important

one since performance is still the primary goal in general-purpose microprocessor design.

This dissertation defends the thesis that program phase information can be leveraged to

implement efficient tuning algorithms. Furthermore, such algorithms can be implemented via

co-designed virtual machines with minimal hardware and performance overhead. Chapter 3

shows that phase changes can be detected and recurring phases can be identified, using a light-

weight profiling mechanism called the working set signature. Chapter 5 shows that tuning

algorithms based on working set signatures can outperform previously proposed periodic algo-

rithms in terms of resource savings achieved for a given performance loss tolerance. Finally,

Chapter 6 shows that the working set signature based algorithms can be implemented in co-

designed VM software with minimal performance overhead.

The following sections summarize our contributions and results in more detail.

138

7.1. Phase Detection

We show that observed phase behavior can be precisely defined based on two parame-

ters – granularity and similarity. We propose the instruction working set signature – a lossy-

compressed representation of the instruction working set, as a light-weight profiling mecha-

nism to detect program phase changes dynamically, in hardware. In addition to detecting phase

changes, signatures can be used to estimate the working set size and identify recurring phases.

These properties can be exploited by tuning algorithms to achieve better resource savings and

lower performance loss.

Working set signatures are generated by hashing committed instruction PCs into a bit vec-

tor. Since the signature generation mechanism is off the critical path, it can be implemented in

slow low-power transistors. Our studies show that signatures as small as 128B in size are effec-

tive at detecting phase changes in SPEC CPU200 benchmarks. This makes the area overhead

of implementing signatures negligible. The signature mechanism can be simplified by using a

simple hash function and periodic sampling. We find that a folded-XOR hash function and a

periodic sampling rate of one in eight instructions works quite well. This roughly translates to

sampling one instruction every two cycles for a four-way superscalar processor.

Comparison of different phase detection techniques is complicated because there is no

ideal that techniques can be compared against. We propose a methodology and a set of metrics

to facilitate such comparisons. The metrics of sensitivity, false-positives, stability, average

phase length, variance, and correlation, are chosen since they have a practical appeal in the

context of dynamic optimization systems such as adaptive microarchitectures.

We find that instruction working set signatures are slightly less sensitive to performance

changes compared to basic block vectors (BBV). This is mainly because working sets contain

less information compared to BBVs. However, instruction working set signatures provide

139

higher stability and achieve 30% longer phases on average compared to the BBV technique.

This can benefit trial and error based tuning algorithms. Branch working set signatures perform

similar to instruction working set signatures on each of the metrics. Procedure working set

signatures, on the other hand, do not perform quite as well mainly due to their inability to

detect phase changes within procedures.

One of the surprising results of this study is that a simple conditional branch counter is

quite effective at detecting phase changes. However, there is higher performance variance

within phases compared with the BBV and working set based methods. This indicates that the

branch counter based technique fails to detect some of the major phase changes. Moreover, it

is not clear whether a conditional branch counter can be used to detect recurring phases.

A unique feature of the working set signature is that the signature density (number of ones)

can be used to estimate working set size. This feature can be exploited in I-cache and branch

predictor algorithms to achieve close to oracle resource savings. Considering this benefit, we

conclude that working set signatures are better suited for tuning algorithms than BBVs or

condition branch counters.

7.2. Tuning Algorithms

We propose phase based tuning algorithms for adaptive microarchitectures employing sin-

gle as well as multiple configurable units. The units considered for tuning are the I-cache,

D-cache, L2-cache, and branch predictor. Although specific units are studied, the algorithms

are applicable to any unit. In order to draw generic conclusions, units are grouped into low-

overhead and high-overhead units based on the penalties associated with being in a sub-optimal

configuration. The I-cache, D-cache, and branch predictor are categorized as low-overhead

units while the L2-cache is categorized as a high-overhead unit.

140

We propose three generic working set signature based tuning algorithms. Each of these al-

gorithms trigger tuning when a phase change is detected. The basic tuning algorithm performs

a trial and error search whenever a new phase is detected. It defaults to the largest config-

uration in unstable regions. The signature density based algorithm directly configures units

whose performance depends on working set size. Working set size is estimated from signature

density. The history based algorithm is similar to the basic algorithm but stores the results of

tuning in a phase table. When a phase repeats, configuration information is read from the table

– bypassing the trial and error process.

The signature density based algorithm is the best performing algorithm – achieving close

to oracle savings for negligible performance loss. However, it is applicable only to I-cache and

branch predictor tuning. Of the basic and history based algorithms, the former is more suitable

for low-overhead units and the latter for high-overhead units. On average, the basic algorithm

provides better savings than the history based algorithm. However, it causes significant perfor-

mance degradation for high-overhead units. This behavior is explained by the repeated trial and

error search performed by the basic algorithm. While this brings down the average resource

usage, it also increases performance degradation due to time spent in sub-optimal configura-

tions. The performance degradation is tolerable for low-overhead units but quite significant for

high-overhead units.

We find that a well designed periodic tuning algorithm provides better savings than the

basic tuning algorithm for comparable performance loss. This is because the latter defaults to

the largest configuration when execution enters an unstable region. The basic tuning algorithm

can be improved is the tuning interval is chosen such that stability is maximized. A simple

run-time algorithm can be used to find such an interval. With this optimization, the basic

algorithm outperforms the periodic algorithm. The optimization also benefits the history based

141

algorithm for similar reasons. However, the signature density based algorithm gives lower

savings because it loses tuning opportunities at larger tuning intervals.

In summary, the signature density based algorithm works best for I-cache and branch pre-

dictor tuning, the optimized basic algorithm works best for D-cache tuning, and the optimized

history based algorithm works best for L2-cache tuning. On average, the algorithms achieve

resource savings of 53%, 30%, 18%, and 48% by tuning the I-cache, D-cache, L2-cache, and

branch predictor, respectively. Average performance degradation caused is 0.7%, 0.7%, 1.1%,

and 0.5% respectively.

For application specific processors, per-benchmark static tuning provides more savings, on

average, compared to dynamic tuning. Part of the reason is inefficiencies associated with the

dynamic tuning algorithms. However, even if these inefficiencies are removed, the additional

benefits provided by dynamic tuning are little. This is mainly attributed to the low variability

in resource requirements across phases of a given program.

For tuning multiple units simultaneously, an algorithm that apportions tolerable perfor-

mance loss amongst the units works well. Such an algorithm provides 25%, 17%, 9%, and

30% savings respectively for the I-cache, D-cache, L2-cache, and branch predictor. The asso-

ciated performance loss is 1.5%, on average.

Finally, we show that implementing signature based algorithms in the mOS is a viable

alternative. The overall memory and run-time performance overheads of the mOS are negligi-

ble. The static memory image size is less than 2KB, while the dynamic memory requirements

are estimated to be less than 10KB. For a 100K instruction tuning interval, the run-time over-

head of the mOS is less than 0.3% using 128B working set signatures. This overhead further

decreases as the tuning interval increases.

142

7.3. Future Directions

We have studied program phase behavior in the context of single-threaded, uniprocessor

workloads. Multi-threaded applications are becoming increasingly prevalent, and thus, study-

ing phase behavior of such applications is an important area of future research. A related area

of research is phase based tuning algorithms for multi-threaded processors such as simulta-

neous multi-threading (SMT) processors. These processors have multiple threads of control

active simultaneously, but resources are typically shared. Phase based algorithms for tuning

such processors can either track phases and resource usage of the various threads indepen-

dently, or track the composite behavior. The advantages and disadvantages of either of these

approaches remain to be explored.

This dissertation demonstrates the feasibility of autonomic resource management using

a mOS. The main advantage of the mOS is that it hides low-level implementation-specific

details from the OS. This can potentially improve time to market because interaction between

hardware and OS vendors is eliminated. With the advent of chip multiprocessors (CMP) and

SMT, several implementation-specific details such as types of cores, number of threads on each

core, etc. may have to be exposed to the OS in order to achieve good performance. If any of

these parameters are modified by hardware designers, the OS scheduling policies have to be

changed to take advantage of the new microarchitecture. In fact, in some cases, not changing

the policies could lead to performance loss. This can adversely impact time to market.

The mOS can be used to hide such microarchitectural changes from the OS. For example,

the OS might be informed of the total number of threads on the processor but not the exact

configuration i.e. number of cores, and number of threads on each core. The mOS can then

map the threads onto the microarchitecture, by using profile-guided scheduling algorithms. In

case of heterogeneous CMPs, the algorithms can schedule threads for improved performance

143

and/or reduced power dissipation depending upon the dynamic program environment. Such

algorithms have a rich design space, and understanding them is an important area of research.

As mentioned before, the mOS can improve performance by saving/restoring performance-

critical implementation state on context switches. The mOS can also maintain a list of critical

implementation state such as TLB entries and L2-cache blocks, and trigger prefetches when it

detects a context switch. This can lead to significant performance improvement in applications

such as web-servers, where the context-switch rate is high. While we presented a preliminary

study of this functionality of the mOS [5], a more detailed study is required to understand the

various trade-offs.

144

BIBLIOGRAPHY

[1] R. Balasubramonian, D. H. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas, “Memory

hierarchy reconfiguration for energy and performance in general purpose architectures,”

Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitec-

ture, pp. 245–257, Dec. 2000.

[2] D. Folegnani and A. Gonzalez, “Energy-effective issue logic,” Proceedings of the 28th

Annual International Symposium on Computer Architecture, pp. 230–239, Jun. 2001.

[3] R. Bahar and S. Manne, “Power and energy reduction via pipeline balancing,” Proceed-

ings of the 28th Annual International Symposium on Computer Architecture, pp. 218–229,

Jun. 2001.

[4] S. H. Yang, M. D. Powell, B. Falsafi, K. Roy, and T. N. Vijaykumar, “An integrated

circuit/architecture approach to reducing leakage in deep submicron high-performance I-

caches,” Proceedings of the 7th International Symposium on High Performance Computer

Architecture, pp. 147–157, Jan. 2001.

[5] A. S. Dhodapkar and J. E. Smith, “Saving and restoring contexts via co-designed virtual

machines,” Workshop on Complexity Effective Design (WCED), held in conjunction with

the 28th Annual International Symposium on Computer Architecture, Jun. 2001.

[6] P. Denning, “The working set model for program behavior,” Proceedings of the 1st ACM

Symposium on Operating Systems Principles, pp. 15.1–15.12, 1967.

[7] P. Denning and S. Schwartz, “Properties of the working-set model,” Communications of

the ACM, vol. 15(3), pp. 191–198, Mar. 1972.

145

[8] P. Denning, “On modelling the behavior of programs,” Proceedings of the AFIPS Conf.

40 (SJCC 1972), pp. 937–944, 1972.

[9] P. Denning and K. Kahn, “A study of program locality and lifetime functions,” Proceed-

ings of the 5th ACM Symposium on Operating Systems Principles, pp. 207–216, 1975.

[10] W. F. Freiberger, U. Grenander, and P. D. Sampson, “Patterns in program references,”

IBM Journal of Research and Development, vol. 19(3), pp. 230–243, May 1975.

[11] A. Batson and W. Madison, “Measurements of major locality phases in symbolic refer-

ence strings,” Proceedings of the International Symposium on Computer Performance and

Modelling, Measurement and Evaluation, ACM SIGMETRICS and IFIP WG7.3, pp. 75–

84, Mar. 1976.

[12] W. Madison and A. Batson, “Characteristics of program localities,” Communications of

the ACM, vol. 19(5), pp. 285–294, May 1976.

[13] S. Majumdar and R. Bunt, “Measurement and analysis of locality phases in file referenc-

ing behaviour,” Proceedings of the 1986 ACM SIGMETRICS joint International Confer-

ence on Computer Performance Modelling, Measurement, and Evaluation, pp. 180–192,

1986.

[14] J. E. Smith and A. S. Dhodapkar, “Dynamic microarchitecture adaptation via co-designed

virtual machines,” 2002 IEEE Solid-State Circuits Conference, Digest of Technical Pa-

pers, pp. 198–199, Feb. 2002.

[15] A. S. Dhodapkar and J. E. Smith, “Managing multi-configuration hardware via dynamic

working set analysis,” Proceedings of the 29th Annual International Symposium on Com-

puter Architecture, pp. 233–244, May 2002.

146

[16] M. Huang, J. Reneau, and J. Torrellas, “Positional adaptation of processors: application to

energy reduction,” Proceedings of the 30th Annual International Symposium on Computer

Architecture, pp. 157–168, Jun. 2003.

[17] A. S. Dhodapkar and J. E. Smith, “Comparison of program phase detection techniques,”

Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitec-

ture, pp. 217–227, Dec. 2003.

[18] T. Sherwood, E. Perelman, and B. Calder, “Basic block distribution analysis to find pe-

riodic behavior and simulation,” Proceedings of the 2001 International Conference on

Parallel Architectures and Compilation Techniques, pp. 3–14, Sep. 2001.

[19] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automatically characterizing

large scale program behavior,” Proceedings of the 10th International Conference on Ar-

chitectural Support for Programming Languages and Operating Systems, pp. 45–57, Oct.

2002.

[20] T. Sherwood, S. Sair, and B. Calder, “Phase tracking and prediction,” Proceedings of

the 30th Annual International Symposium on Computer Architecture, pp. 336–347, Jun.

2003.

[21] V. Bala, E. Duesterwald, and S. Banerjia, “Dynamo: A transparent dynamic optimization

system,” Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language

Design and Implementation, pp. 1–12, May 2000.

[22] M. J. Hind, V. T. Rajan, and P. F. Sweeney, “Phase shift detection: A problem classifica-

tion,” Tech. Rep. RC222887, IBM T. J. Watson Research Center, Yorktown Heights, NY,

Aug. 2003.

147

[23] D. H. Albonesi, “Dynamic IPC/clock rate optimization,” Proceedings of the 25th Annual

International Symposium on Computer Architecture, pp. 282–292, Jun. 1998.

[24] D. H. Albonesi, “Selective cache ways: On-demand cache resource allocation,” Pro-

ceedings of the 32nd Annual IEEE/ACM International Symposium on Microarchitecture,

pp. 248–259, Nov. 1999.

[25] H. Zhou, M. C. Toburen, E. Rotenberg, and T. M. Conte, “Adaptive mode control: a

static-power-efficient cache design,” Proceedings of the 2001 International Conference

on Parallel Architectures and Compilation Techniques, pp. 61–70, Sep. 2001.

[26] S. Kaxiras, Z. Hu, and M. Martonosi, “Cache decay: exploiting generational behavior to

reduce cache leakage power,” Proceedings of the 28th Annual International Symposium

on Computer Architecture, pp. 240–251, Jun. 2001.

[27] K. Flautner, N. S. Kim, S. Martin, D. Blaauw, and T. Mudge, “Drowsy caches: simple

techniques for reducing leakage power,” Proceedings of the 29th Annual International

Symposium on Computer Architecture, pp. 148–157, May. 2002.

[28] L. Li, I. Kadayif, Y.-F. Tsai, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and A. Siva-

subramaniam, “Leakage energy management in cache hierarchies,” Proceedings of the

2002 International Conference on Parallel Architectures and Compilation Techniques,

pp. 131–140, Sep. 2002.

[29] A. Veidenbaum, W. Tang, R. Gupta, A. Nicolau, and X. Ji, “Adapting cache line size to

application behavior,” Proceedings of the 13th International Conference on Supercom-

puting, pp. 145–154, Jun. 1999.

148

[30] T. L. Johnson and W. W. Hwu, “Run-time adaptive cache hierarchy via reference analy-

sis,” Proceedings of the 24th Annual International Symposium on Computer Architecture,

pp. 315–326, Jun. 1997.

[31] P. Ranganathan, S. Adve, and N. Jouppi, “Reconfigurable caches and their application

to media processing,” Proceedings of the 27th Annual International Symposium on Com-

puter Architecture, pp. 214–224, Jun. 2000.

[32] T. Juan, S. Sanjeevan, and J. Navarro, “Dynamic history-length fitting: a third level of

adaptivity for branch prediction,” Proceedings of the 25th Annual International Sympo-

sium on Computer Architecture, pp. 155–166, Jun. 1998.

[33] A. Buyuktosunoglu, S. Schuster, D. Brooks, P. Bose, P. Cook, and D. Albonesi, “An

adaptive issue queue for reduced power at high performance,” Workshop on Power-Aware

Computer Systems (PACS), held in conjunction with the 9th International Conference on

Architectural Support for Programming Languages and Operating Systems, Nov. 2000.

[34] A. Buyuktosunoglu, T. Karkhanis, D. H. Albonesi, and P. Bose, “Energy efficient co-

adaptive instruction fetch and issue,” Proceedings of the 30th Annual International Sym-

posium on Computer Architecture, pp. 147–156, Jun. 2003.

[35] D. Ponomarev, G. Kucuk, and K. Ghose, “Reducing power requirements of instruction

scheduling through dynamic allocation of multiple datapath resources,” Proceedings of

the 34th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 90–101,

Dec. 2001.

[36] S. Manne, A. Klauser, and D. Grunwald, “Pipeline gating: speculation control for en-

149

ergy reduction,” Proceedings of the 25th Annual International Symposium on Computer


[37] T. Karkhanis, J. E. Smith, and P. Bose, “Saving energy with just in time instruction de-

livery,” Proceedings of the 2002 International Symposium on Low Power Electronics and

Design, pp. 178–183, Aug. 2002.

[38] J. L. Aragon, J. Gonzalez, and A. Gonzalez, “Power-aware control speculation through

selective throttling,” Proceedings of the 9th International Symposium on High Perfor-

mance Computer Architecture, pp. 103–112, Feb. 2003.

[39] S. Ghiasi, J. Casmira, and D. Grunwald, “Using IPC variations in workloads with exter-

nally specified rates to reduce power consumption,” Workshop on Complexity Effective

Design (WCED), held in conjunction with the Proceedings of the 27th Annual Interna-

tional Symposium on Computer Architecture, Jun. 2000.

[40] J. S. Seng, E. S. Tune, and D. M. Tullsen, “Reducing power with dynamic critical path

information,” Proceedings of the 34th Annual IEEE/ACM International Symposium on

Microarchitecture, pp. 114–123, Dec. 2001.

[41] M. Huang, J. Reneau, S.-M. Yoo, and J. Torrellas, “A framework for dynamic energy

efficiency and temperature management,” Proceedings of the 33rd Annual IEEE/ACM

International Symposium on Microarchitecture, pp. 202–213, Dec. 2000.

[42] K. Ebcioglu and E. Altman, “DAISY: dynamic compilation for 100% architecture com-

patibility,” Tech. Rep. RC20538, IBM T. J. Watson Research Center, Yorktown Heights,

NY, Aug. 1996.

150

[43] E. Altman, M. Gschwind, S. Sathaye, S. Kosonocky, A. Bright, J. Fritts, P. Ledak, D. Ap-

penzeller, C. Agricola, and Z. Filan, “BOA: The architecture of a binary translation pro-

cessor,” Tech. Rep. RC21665, IBM T. J. Watson Research Center, Yorktown Heights, NY,

Dec. 2000.

[44] A. Klaiber, “The technology behind Crusoe processors.” Transmeta Corporation White

Paper, Website http://www.transmeta.com/corporate/pressroom/

whitepapers.html, Jan. 2000.

[45] M. Fleischmann, “LongRun power management - dynamic power management for

Crusoe processors.” Transmeta Corporation White Paper, Website http://www.

transmeta.com/corporate/pressroom/whitepapers.html, Jan. 2001.

[46] C. F. Webb and J. S. Liptay, “A high-frequency custom CMOS S/390 microprocessor,”

IBM Journal of Research and Development, vol. 41(4/5), pp. 463–474, Jul. 1997.

[47] “PALcode for Alpha microprocessors system design guide.” Digital Equipment Corpora-

tion, May 1996.

[48] J. L. Henning, “SPEC CPU2000 memory footprint.” Website http://www.spec.

org/cpu2000/analysis/memory.

[49] T. Sherwood and B. Calder, “Time varying behavior of programs,” Tech. Rep. UCSD-

CS99-630, University of California - San Diego, Aug. 1999.

[50] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach.

San Francisco, CA: Morgan Kaufman, third ed., 2003.

[51] D. Burger and T. Austin, “The simplescalar tool set, version 2.0,” Tech. Rep. 1342, Uni-

versity of Wisconsin - Madison, Computer Sciences Department, Jun. 1997.

151

[52] J. L. Henning, “SPEC CPU2000: Measuring CPU performance in the new millennium,”

IEEE Computer, vol. 33(7), pp. 28–35, Jul. 2000.

[53] D. Grunwald, A. Klauser, S. Manne, and A. Pleszkun, “Confidence estimation for spec-

ulation control,” Proceedings of the 25th Annual International Symposium on Computer

Architecture, pp. 122–131, Jul. 1998.

[54] M. S. Pepe, The statistical evaluation of medical tests for classification and prediction.

Oxford University Press, 2003.

[55] Q. Wu, M. Pedram, and X. Wu, “Clock-gating and its application to low power design of

sequential circuits,” IEEE Transactions on Circuits and Systems-I: Fundamental Theory

and Applications, vol. 47(3), pp. 415–420, Mar. 2000.

[56] M. D. Powell, S. H. Yang, B. Falsafi, K. Roy, and T. N. Vijaykumar, “Gated-Vdd: A

circuit technique to reduce leakage in cache memories,” Proceedings of the 2000 Inter-

national Symposium on Low Power Electronics and Design, pp. 90–95, Jul. 2000.

[57] S. Borkar, “Low power design challenges for the decade,” Proceedings of the ASP-DAC

2001, pp. 293–296, Jan. 2001.

[58] K. Krewell, “Pentium m hits the street,” Microprocessor Report, Mar. 2003.

[59] “Intel describes billion-transistor four-core Itanium processor,” EE Times, Oct. 16 2002.

[60] A. S. Dhodapkar and J. E. Smith, “Tuning reconfigurable microarchitectures for power

efficiency,” 11thReconfigurable Architectures Workshop (RAW 2004), held in conjunction

with the 18thInternational Parallel and Distributed Processing Symposium, Apr. 2004.

[61] A. S. Dhodapkar and J. E. Smith, “Tuning adaptive microarchitectures,” To appear in the

International Journal on Embedded Systems.

152

[62] R. Desikan, D. Burger, and S. Keckler, “Measuring experimental error in microproces-

sor simulation,” Proceedings of the 28th Annual International Symposium on Computer


[63] H. W. Cain, K. M. Lepak, B. A. Schwartz, and M. H. Lipasti, “Precise and accurate pro-

cessor simulation,” The 5thWorkshop on Computer Architecture Evaluation using Com-

mercial Workloads (CAECW), held in conjunction with the 8th International Symposium

on High Performance Computer Architecture, Feb. 2002.

[64] The PowerPC Architecture: A Specification for a New Family of RISC Processors. San

Francisco, CA: Morgan Kaufman, second ed., 1994.

[65] M. Rosenblum, S. A. Herrod, E. Witchel, and A. Gupta, “Complete computer system

simulation: The SimOS approach,” IEEE Parallel & Distributed Technology: Systems &

Technology, vol. 3(4), pp. 34–43, Dec. 1995.

[66] T. Keller, A. M. Maynard, R. Simpson, and P. Bohrer, “SimOS-PPC: ARL’s

full system simulation project.” Website http://www.cs.utexas.edu/

users/cart/simOS.

[67] A. Charlesworth, A. Phelps, R. Williams, and G. Gilbert, “Gigaplane-XB: Extending the

ultra enterprise family,” Proceedings of Hot Interconnects Symposium V, Aug. 1997.

[68] Private communication with Mikko Lipasti, University of Wisconsin - Madison.

[69] C. N. Keltcher, K. J. McGrath, A. Ahmed, and P. Conway, “The AMD Opteron processor

for multiprocessor servers,” IEEE Micro, vol. 23(2), pp. 66–76, 2003.

[70] J. M. Tendler, J. S. Dodson, J. S. Fields Jr., H. Le, and B. Sinharoy, “POWER4 system

153

microarchitecture,” IBM Journal of Research and Development, vol. 46(1), pp. 5–26,

2002.

[71] G. Hinton et al., “The microarchitecture of the Pentium 4 processor,” Intel technology

journal, 1st quarter, 2001.

[72] S. Gochman et al., “Intel Pentium M processor: Microarchitecture and performance,”

Intel technology journal, vol. 7(2), 2003.

[73] J. W. Tschanz, S. G. Narendra, Y. Ye, B. A. Bloechel, S. Borkar, and V. De, “Dynamic

sleep transistor and body bias for active leakage power control of microprocessors,” IEEE

Journal of Solid-State Circuits, vol. 38(11), pp. 1838–1845, Nov. 2003.

[74] M. D. Hill, Aspects of cache memory and instruction buffer performance. PhD thesis,

University of California, Berkeley, 1987.

[75] P. Michaud, A. Seznec, and R. Uhlig, “Trading conflict and capacity aliasing in condi-

tional branch predictors,” Proceedings of the 24th Annual International Symposium on

Computer Architecture, pp. 292–303, Jun. 1997.

[76] A. S. Dhodapkar, C. H. Lim, G. Cai, and R. Daasch, “TEM2P2EST: A thermal enabled

multi-model power/ performance estimator,” Workshop on Power-Aware Computer Sys-

tems (PACS), held in conjunction with the 9th International Conference on Architectural

Support for Programming Languages and Operating Systems, Nov. 2000.

[77] B. Falsafi and T. N. Vijaykumar, eds., Power-Aware Computer Systems: First Interna-

tional Workshop, PACS 2000, Cambridge, MA, USA, November 12, 2000, Revised Papers,

vol. 2008 of Lecture Notes in Computer Science. Springer-Verlag Heidelberg, 2001.

154

[78] G. Cai, A. S. Dhodapkar, and J. E. Smith, “Integrated performance, power, and thermal

modeling,” Journal of Circuits, Systems and Computers, vol. 11(6), pp. 659–676, Dec.

2002.

[79] D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: a framework for architectural-level

power analysis and optimizations,” Proceedings of the 27th Annual International Sympo-

sium on Computer Architecture, pp. 83–94, Jun. 2000.

[80] N. Vijaykrishnan, M. Kandemir, M. J. Irwin, H. S. Kim, and W. Ye, “Energy-driven

integrated hardware-software optimizations using SimplePower,” Proceedings of the 27th

Annual International Symposium on Computer Architecture, pp. 85–106, Jun. 2000.

[81] T. S. Karkhanis and J. E. Smith, “A first-order superscalar processor model,” Proceedings

of the 31st Annual International Symposium on Computer Architecture, pp. 338–349, Jun.

2004.

[82] K. J. Nesbit, A. S. Dhodapkar, and J. E. Smith, “AC/DC: An adaptive data cache

prefetcher,” Proceedings of the 2004 International Conference on Parallel Architectures

and Compilation Techniques, pp. 135–145, Sep. 2004.

[83] Joint Electron Device Engineering Council. Website http://www.jedec.org.

[84] “Crusoe processor Model TM5800 product brief.” Transmeta Corporation Technical

Document, Website http://www.transmeta.com/developers/techdocs.

html.

[85] “AIX assembler language reference.” Website http://www16.boulder.ibm.com/

pseries/en US/aix assem/alangref/alangreftfrm.htm.

[86] Alpha AXP architecture reference manual. Newton, MA: Digital Press, second ed., 1995.

ashutosh sham dhodapkar - jes.ece.wisc.eduautonomic management of adaptive microarchitectures by...

Documents