orchestrating the compiler and microarchitecture …
TRANSCRIPT
The Pennsylvania State University
The Graduate School
Department of Computer Science and Engineering
ORCHESTRATING THE COMPILER AND
MICROARCHITECTURE FOR REDUCING CACHE ENERGY
A Thesis in
Computer Science and Engineering
by
Jie Hu
c© 2004 Jie Hu
Submitted in Partial Fulfillmentof the Requirements
for the Degree of
Doctor of Philosophy
August 2004
The thesis of Jie Hu was reviewed and approved∗ by the following:
Vijaykrishnan NarayananAssociate Professor of Computer Science and EngineeringThesis AdviserChair of Committee
Mary Jane IrwinA. Robert Noll Chair of EngineeringProfessor of Computer Science and Engineering
Mahmut KandemirAssociate Professor of Computer Science and Engineering
Yuan XieAssistant Professor of Computer Science and Engineering
Richard R. BrooksResearch Associate/Department Head, Applied Research LaboratoryIndustrial and Manufacturing Engineering
Raj AcharyaProfessor of Computer Science and EngineeringHead of the Department of Computer Science and Engineering
∗Signatures are on file in the Graduate School.
iii
Abstract
Cache memories are widely employed in modern microprocessor designs to bridge
the increasing speed gap between the processor and the off-chip main memory, which
imposes the major performance bottleneck in computer systems. Consequently, caches
consume a significant amount of the transistor budget and chip die area in microproces-
sors employed in both low-end embedded systems and high-end server systems. Being
a major consumer of on-chip transistors, thus also of the power budget, cache memory
deserves a new and complete study of its performance and energy behavior and new
techniques for designing cache memories for next generation microprocessors.
This thesis focuses on developing compiler and microarchitecture techniques for
designing energy-efficient caches, targeting both dynamic and leakage energy. This thesis
has made four major contributions towards energy efficient cache architectures. First, a
detailed cache behavior characterization for both array-based embedded applications and
general-purpose applications was performed. The insights obtained from this study sug-
gest that (1) different applications or different code segments within a single application
have very different cache demands in the context of performance and energy concerns,
(2) program execution footprints (instruction addresses) can be highly predictable and
usually have a narrow scope during a particular execution phase, especially for embed-
ded applications, (3) high sequentiality is presented in accesses to the instruction cache.
Second, a technique called compiler-directed cache polymorphism (CDCP) was proposed.
CDCP is used to analyze the data reuse exhibited by loop nests, and thus to extract the
iv
cache demands and determine the best data cache configurations for different code seg-
ments to achieve the best performance and energy behavior. Third, this thesis presents
a redesigned processor datapath to capture and utilize the predictable execution foot-
print for reducing energy consumption in instruction caches. Finally, this thesis work
addresses the increasing leakage concern in the instruction cache by exploiting cache
hotspots during phase execution and the sequentiality exhibited in execution footprint.
v
To my Mom and Dad
and
To Kai Chen
vi
Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi
Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Caches as the Bridge between Processor and Memory . . . . . . . . . 1
1.2 Basics on Cache Energy Consumption . . . . . . . . . . . . . . . . . 3
1.3 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Challenges in This Work . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.1 Determine Cache Resource Demands . . . . . . . . . . . . . . 8
1.4.2 Redesign Instruction-Supply Mechanism . . . . . . . . . . . . 9
1.4.3 Application Sensitive Leakage Control . . . . . . . . . . . . . 10
1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Thesis Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Chapter 2. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1 Compiler Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 Addressing Dynamic Energy Consumption . . . . . . . . . . . 14
2.1.2 Managing Cache Leakage . . . . . . . . . . . . . . . . . . . . 15
2.2 Architectural and Microarchitectural Schemes . . . . . . . . . . . . . 16
vii
2.2.1 Using Additional Smaller Caches . . . . . . . . . . . . . . . . 16
2.2.2 Changing Load Capacitance of Cache Access . . . . . . . . . 16
2.2.3 Improving the Fetch Mechanism . . . . . . . . . . . . . . . . 17
2.2.4 Reducing leakage in Caches . . . . . . . . . . . . . . . . . . . 18
2.3 Circuit and Device Techniques . . . . . . . . . . . . . . . . . . . . . 18
Chapter 3. Experimental Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 Simulation Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.1 Evaluating Compiler schemes . . . . . . . . . . . . . . . . . . 20
3.1.2 Evaluating Microarchitectural Schemes . . . . . . . . . . . . . 21
3.2 Benchmarks and Input Sets . . . . . . . . . . . . . . . . . . . . . . . 21
Chapter 4. Characterizing Application and Cache Behavior . . . . . . . . . . . . 24
4.1 Data Cache Demands for Performance . . . . . . . . . . . . . . . . . 24
4.2 Instruction Execution Footprint . . . . . . . . . . . . . . . . . . . . . 27
4.3 Accessing Behavior in Instruction Cache . . . . . . . . . . . . . . . . 28
Chapter 5. Analyzing Data Reuse for Cache Energy Reduction . . . . . . . . . . 34
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Array-Based Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2.1 Representation for Programs . . . . . . . . . . . . . . . . . . 36
5.2.2 Representation for Loop Nests . . . . . . . . . . . . . . . . . 38
5.2.3 Representation for Array References . . . . . . . . . . . . . . 39
5.3 Cache Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
viii
5.3.1 Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3.2 Data Reuse and Data Locality . . . . . . . . . . . . . . . . . 41
5.4 Algorithms for Cache Polymorphism . . . . . . . . . . . . . . . . . . 42
5.4.1 Compiler-Directed Cache Polymorphism . . . . . . . . . . . . 43
5.4.2 Formal Description of Program Hierarchies . . . . . . . . . . 45
5.4.3 Array References and Uniform Reference Sets . . . . . . . . . 46
5.4.4 Algorithm for Reuse Analysis . . . . . . . . . . . . . . . . . . 47
5.4.4.1 Self-Reuse Analysis . . . . . . . . . . . . . . . . . . 48
5.4.4.2 Group-Reuse Analysis . . . . . . . . . . . . . . . . . 50
5.4.5 Simulating the Footprints of Reuse Spaces . . . . . . . . . . . 52
5.4.6 Computation and Optimization of Cache Configurations for
Loop Nests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4.7 Global Level Cache Polymorphism . . . . . . . . . . . . . . . 58
5.4.8 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.5.1 Simulation Framework . . . . . . . . . . . . . . . . . . . . . . 64
5.5.2 Selected Cache Configurations . . . . . . . . . . . . . . . . . . 65
5.5.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 68
5.6 Discussions and Summary . . . . . . . . . . . . . . . . . . . . . . . . 76
Chapter 6. Reusing Instructions for Energy Efficiency . . . . . . . . . . . . . . . 77
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2 Modified Issue Queue Design . . . . . . . . . . . . . . . . . . . . . . 80
ix
6.2.1 Detecting Reusable Loop Structures . . . . . . . . . . . . . . 81
6.2.2 Buffering Reusable Instructions . . . . . . . . . . . . . . . . . 82
6.2.2.1 Buffering Strategy . . . . . . . . . . . . . . . . . . . 84
6.2.2.2 Handling Procedure Calls . . . . . . . . . . . . . . . 85
6.2.3 Optimizing Loop Buffering Strategy . . . . . . . . . . . . . . 85
6.2.4 Reusing Instructions in the Issue Queue . . . . . . . . . . . . 87
6.2.5 Restoring Normal State . . . . . . . . . . . . . . . . . . . . . 89
6.3 Distribution of Dynamic Loop Code . . . . . . . . . . . . . . . . . . 89
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.5 Impact of Compiler Optimizations . . . . . . . . . . . . . . . . . . . 96
6.6 Discussions and Summary . . . . . . . . . . . . . . . . . . . . . . . . 99
Chapter 7. Managing Instruction Cache Leakage . . . . . . . . . . . . . . . . . . 101
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.2 Existing Approaches: Where they stumble? . . . . . . . . . . . . . . 105
7.3 Using Hotspots and Sequentiality in Managing Leakage . . . . . . . 109
7.3.1 HSLM: HotSpot Based Leakage Management . . . . . . . . . 110
7.3.1.1 Protecting Program Hotspots . . . . . . . . . . . . . 110
7.3.1.2 Detecting New Program Hotspots . . . . . . . . . . 114
7.3.2 JITA: Just-In-Time Activation . . . . . . . . . . . . . . . . . 115
7.4 Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . 116
7.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.5.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . 119
x
7.5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 122
7.5.3 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . 129
7.6 Discussions and Summary . . . . . . . . . . . . . . . . . . . . . . . . 133
Chapter 8. Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . 136
8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
xi
List of Tables
3.1 Base configurations of simulated processor and memory hierarchy for
evaluating microarchitectural schemes. . . . . . . . . . . . . . . . . . . . 22
3.2 Array-based benchmarks used in the experiments. . . . . . . . . . . . . 22
3.3 Benchmarks from SPEC2000 used in the experiments. . . . . . . . . . . 23
5.1 Cache configurations generated by algorithm 4 for the example nest. . . 64
5.2 Running time of algorithm 4 for each benchmark. . . . . . . . . . . . . . 65
5.3 Cache configurations for each loop nest in benchmarks: Shade Vs CDCP. 67
5.4 Energy consumption (micro joules) of L1 data cache for each loop nest
in benchmarks with configurations in Table 5.3: Shade Vs CDCP. . . . . 75
7.1 Leakage control schemes evaluated: turn-off mechanisms. . . . . . . . . 116
7.2 Leakage control schemes evaluated: turn-on mechanisms. . . . . . . . . . 117
7.3 Technology and energy parameters for the simulated processor given in
Table 3.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
xii
List of Figures
1.1 Typical memory hierarchy in modern computer systems. . . . . . . . . . 2
1.2 Leakage current paths in an SRAM cell. The bitline leakage flows through
the access transistor Nt2, while the cell leakage flows through transistors
N1 and P2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.1 Cache performance behavior as data cache size increases from 1KB to
1024KB. All cache configurations use fixed block size of 32 bytes and
fixed associativity of 4 ways. . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Dynamic instruction address distribution at runtime for a set of array-
intensive code. A sampling rate (/500) means only one instruction among
500 dynamic instructions is sampled for its address (PC). . . . . . . . . 29
4.3 The distribution of accesses (at cache line granularity) to L1 instruction
cache with respect to the length of consecutively accessing cache lines
(sequential length), for SPEC2000 benchmarks. The rightmost bar (<)
in each plot corresponds to those with sequential length larger than 32
cache lines. To be continued by Figure 4.4. . . . . . . . . . . . . . . . . 31
4.4 (A continue to Figure 4.3) The distribution of accesses (at cache line
granularity) to L1 instruction cache with respect to the length of consec-
utively accessing cache lines (sequential length), for SPEC2000 bench-
marks. Last two plots show the average distribution for integer bench-
marks and floating-point benchmarks used, respectively. . . . . . . . . . 32
xiii
5.1 Format for a program. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Format for a loop nest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Overview of compiler-directed cache polymorphism (CDCP). . . . . . . 44
5.4 Intermediate format of source codes produced by the generator. . . . . . 45
5.5 Example code – a loop nest. . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.6 An Example: Array-based code. . . . . . . . . . . . . . . . . . . . . . . . 60
5.7 Cache performance comparison for configurations at block size of 16:
Shade Vs CDCP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.8 Cache performance comparison for configurations at block size of 32:
Shade Vs CDCP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.9 Cache performance comparison for configurations at block size of 64:
Shade Vs CDCP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.10 A breakdown of cache performance comparison at the granularity of each
loop for benchmarks adi, aps, bmcm, and tsf . Configurations for all
three cache block sizes, 16 byte, 32 byte and 64 byte are compared:
Shade Vs CDCP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.11 A breakdown of cache performance comparison at the granularity of each
loop for benchmarks eflux, tomcat, vpenta, and wss. Configurations for
all three cache block sizes, 16 byte, 32 byte and 64 byte are compared:
Shade Vs CDCP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
xiv
6.1 (a). The datapath diagram, and (b). pipeline stages of the modeled
baseline superscalar microprocessor. Parts in dotted lines are augmented
for the new design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.2 State machine for the issue queue. . . . . . . . . . . . . . . . . . . . . . 83
6.3 The new issue queue with augmented components supporting instruction
reuse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.4 An example of a non-bufferable loop that is an outer loop in this code
piece . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.5 Dynamic instruction distribution w.r.t. loop sizes. . . . . . . . . . . . . 91
6.6 Percentages of the total execution cycles that the pipeline front-end has
been gated with different issue queue sizes: 32, 64, 128, 256 entries. . . . 92
6.7 Access Reduction and energy reduction in instruction cache, branch pre-
dictor, instruction decoder, and issue queue. . . . . . . . . . . . . . . . . 94
6.8 The overall power reduction compared to a baseline microprocessor using
the conventional issue queue at different issue queue sizes. . . . . . . . . 95
6.9 Performance impact of reusing instructions at different issue queue sizes. 96
6.10 Impact of compiler optimizations on instruction cache accesses. . . . . . 97
6.11 Impact of compiler optimizations on overall energy saving. . . . . . . . . 98
6.12 Impact of compiler optimizations on performance degradation. . . . . . 98
7.1 (a). A simple loop with two portions, (b). Bank mapping for the loop
given in (a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.2 Leakage control circuitry supporting Just-in-Time Activation (JITA). . 111
xv
7.3 Microarchitecture for Hotspot based Leakage Management (HSLM) scheme.
Note that O/P from AND gates go to the set I/P of the mask latches. . 113
7.4 The ratio of cycles that cache lines are in active mode over the entire
execution time (Active ratio). . . . . . . . . . . . . . . . . . . . . . . . . 121
7.5 Breakdown of turn offs in scheme DHS-Bk-PA. . . . . . . . . . . . . . . . 123
7.6 Leakage energy reduction w.r.t the Base scheme. . . . . . . . . . . . . . 123
7.7 The leakage energy breakdown (an average for fourteen SPEC2000 bench-
marks). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.8 Ratio of activations on instruction cache hits. . . . . . . . . . . . . . . . 126
7.9 The ratio of effective preactivations performed by JITA over total acti-
vations incurred during the entire simulation. . . . . . . . . . . . . . . . 126
7.10 Performance degradation w.r.t the Base scheme. . . . . . . . . . . . . . 128
7.11 Energy delay (J*s) product (EDP). . . . . . . . . . . . . . . . . . . . . . 128
7.12 Impact of sampling window size on leakage control scheme DHS-Bk-PA. . 130
7.13 Impact of hotness threshold on leakage control scheme DHS-Bk-PA. . . . 131
7.14 Impact of subbank size on leakage control scheme DHS-Bk-PA. . . . . . . 132
7.15 Impact of cache associativity. IPC degradation (left), Leakage energy
reduction (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
xvi
Acknowledgments
I would like to take this special moment to thank many of those who guided and
helped me journey through these four years invaluable Ph. D. life at Penn State. First
and foremost, my heartfelt gratitude to my thesis advisor, Dr. Vijaykrishnan N.. He
introduced me to this most exciting research area and supervised my research with great
enthusiasm. He’s always there ready to help whenever I had difficulty or was under
quandary with research. I’ll never forget those sincere and free discussions on a wide
variety of topics. Working with him has gained me life-time benefits.
I am also grateful and indebted to Dr. Mary Jane Irwin and Dr. Mahmut Kan-
demir for their great suggestion and advice, enlightening discussions, and the happy time
we worked together. I thank my other committee members, Dr. Richard Brooks and
Dr. Yuan Xie, for their insightful commentary on my work.
I feel very lucky to work with many of our wonderful MDL members. My work
at MDL wouldn’t be such a great joy without those friends, who never hesitate to stop
by for brainstorming or chit-chat. I’d like to thank you all, especially, Wei Zhang, Vijay
Degalahal, Avanti Nadgir, Feihui li, Yuh-Fang Tsai, Jooheung Lee, and Soontae kim.
Finally and most importantly, I owe my deepest gratitude and thanks to my
family. Mom, dad, and Jiaorong, your unconditional support made all this possible.
My dear wife, Kai Chen, you are always being my side supporting me, encouraging me,
sharing every joy and every anguish we have, your love has made these Ph.D. years a
fantasy.
1
Chapter 1
Introduction
Cache memories are widely employed in modern microprocessor designs to bridge
the increasing speed gap between the processor and the off-chip main memory, which
imposes the major performance bottleneck in computer systems. Consequently, caches
consume a significant amount of the transistor budget and chip die area in microproces-
sors employed in both low-end embedded systems and high-end server systems. Being
a major consumer of on-chip transistors, thus also of the power budget, cache memory
deserves a new and complete study of its performance and energy behavior and new
techniques for designing cache memories for next generation microprocessors.
1.1 Caches as the Bridge between Processor and Memory
The computer memory system seeks a hierarchical design that takes the advantage
of reference locality (both code and data) and achieves the benefit of cost/performance
of memory technologies. The growing speed gap between the fast processor and the
slow memory has resulted in the increasing importance of the memory hierarchy. The
speed of microprocessors improved around 60% per year since 1987. However, memory
performance only achieved less than 10% per year since 1980. As is pointed out in [28],
microprocessors designed in 1980 were often without caches, while in 1995 microproces-
sors were often integrated with two levels of caches. Figure 1.1 shows the typical memory
2
hierarchy presented in modern computer systems. The closer to CPU, the faster, smaller,
and more expensive memory structures. Level one caches usually operate at the same
clock frequency as the processor core.
Disk Storage
L1 Caches
Registers
CPU
Microprocessor
L2/L3 Caches
Main Memory
Fig. 1.1. Typical memory hierarchy in modern computer systems.
In today’s microprocessor designs, for both low-end embedded microprocessors
and high-performance general purpose microprocessors, larger on-chip caches are always
preferred for performance sake. However, embedded microprocessors in general have a
much simple memory hierarchy with only one level on-chip data cache and instruction
cache, e.g., StrongArm SA-110 [56] processor only has level one caches accounting for
94% of the total transistor budget. Continuous technology scaling enables large and
multilevel cache structures to be integrated with the processor core in a single die. For
3
the latest processors, a large fraction of the transistor budget and die area is dedicated
to the cache structures, e.g., 90% in Alpha 21364 processors [9], 86% (93% in the second
generation) in Itanium 2 processors [57][70].
Meanwhile, excessive power/energy consumption has become one of the major
impediments in designing future microprocessors as semiconductor technology continues
scaling down. Low power design is not only important in battery-powered embedded
systems, but also very important in desktop PCs, workstations, and even servers due to
the increasing packaging and cooling cost. As on-chip caches dominate the transistor
budget, they present a major contribution to the dynamic and leakage power consump-
tion in processors. For example, data cache and instruction cache consume 43% of the
total dynamic power in DEC StrongArm SA-110 [56], and 22% in IBM PowerPc micro-
processor [10]. Leakage power is projected to account for 70% of the cache power budget
in 70nm technology [20]. Thus, optimizing power/energy consumption in caches is of
first class importance in microprocessor designs.
1.2 Basics on Cache Energy Consumption
Cache energy consumption consists of two part: dynamic energy Edyn and leakage
energy Eleak, as shown in Equation 1.1.
E = Edyn + Eleak (1.1)
4
Dynamic energy consumption of a device can be modeled as Equation 1.2,
Edyn = CL × V 2DD
× P0→1 (1.2)
where CL is the capacitance of the device, VDD is the supply voltage, and P0→1 is the
probability that the device switches. Clearly, optimizing dynamic energy consumption
can attack one or more of these three parameters by reducing the number of switching
devices, lowing the supply voltage, or reducing the switching probability. This equation
also shows that reducing supply voltage has a quadratic effect on decreasing the dynamic
energy consumption. However, lower supply voltage leads to slower circuit.
At microarchitecture level, Cacti cache model [63] is used in this thesis to derive
the dynamic energy consumption of an access to a given cache configuration. The energy
consumption comes from two part, tag portion and data array portion. In tag portion,
energy is consumed in address decoder, wordline, bitlines, sense amplifiers, compara-
tors, mux drivers, and output drivers. Similarly, in data portion, energy is consumed
in address decoder, wordline, bitlines, sense amplifiers, and output drivers. The data
bitlines and sense amplifiers are responsible for the majority part of energy consumption
of low-associative caches [63].
On-chip caches constitute the major portion of the processor’s transistor budget
and account for a significant share of leakage, which can be derive by Equation 1.3.
Leakage current is a combination of subthreshold leakage current and gate-oxide leakage
5
current: Ileak = Isub + Iox.
Eleak = VDD × Ileak × t (1.3)
Vdd
data
/BL
subthreshold leakage
leakage from cell
leakage from bitline
Nt1 Nt2
N1 N2
P2P1
0/data
WLWL
0V 0V
Vdd
BL
Vdd
Vdd
Fig. 1.2. Leakage current paths in an SRAM cell. The bitline leakage flows throughthe access transistor Nt2, while the cell leakage flows through transistors N1 and P2.
Figure 1.2 illustrates the various leakage current paths in a typical SRAM cell.
The current through the access transistor Nt2 from the bitline is referred to as bitline
leakage, while the current flowing through transistors N1 and P2 is cell leakage. Both
bitline and cell leakage result from subthreshold conduction - current flowing from the
source to drain even when gate-source voltage is below the threshold voltage Vth.
6
The following equation developed in [16] shows how subthreshold leakage current
depends on threshold voltage and supply voltage.
Isub = K1 × W × e−Vth/nVθ × (1 − e−VDD/Vθ ) (1.4)
K1 and n are experimentally derived, W is the gate width, and Vθ in the exponents is
the thermal voltage. At room temperature, Vθ is about 25 mV; it increases linearly as
temperature increases. Equation 1.4 suggests two ways to reduce Isub: reducing supply
voltage VDD or increasing threshold voltage Vth. However, for an SRAM cell, lowing
VDD may destroy the state stored; using high Vth transistors will increase the access
latency of SRAM cells.
On the other hand, gate-oxide leakage Iox is projected to be dramatically reduced
if high-k dielectric gate insulators reach mainstream production [1]. Thus, this thesis
will not take gate leakage for further discussion.
1.3 Thesis Statement
Caches were first introduced to form the memory hierarchy in computer systems
for performance sake. Technology advance has turned the increasing power/energy con-
sumption into a major constraint in designing future microprocessors. Due to their large
share of transistor budget, die area, and energy budget of the processor, caches are the
ideal target for optimizing microprocessor energy behavior. With the principle role of
caches in mind, any energy optimization based on caches should be carefully weighted
not to jeopardize the performance noticeably.
7
This thesis explores the implicit possible connection between application charac-
teristics and its cache behavior, and tries to answer how this information can be used
to reduce energy consumption (both dynamic and leakage energy) in caches and how
compiler and microarchitectural schemes can be orchestrated for this purpose. More
specifically, this work focuses on four main problems. First, given an application, what
information should be extracted and what characteristics should be studied for cache
energy optimization. Second, how to extract this information, at compile time or at run
time. Third, how compiler and microarchitecture can utilize the analytical results from
the second step. Finally, how to justify the information from the first step is the right
one for the purpose of this thesis work.
1.4 Challenges in This Work
There are several major challenges in achieving the objective of this thesis work.
As discussed in Section 1.2, dynamic energy can be reduced by attacking one or more
of the three factors CL, VDD, and P0→1 in Equation 1.2. Lowing VDD is suggested as
the most effective way due to its quadratic effect on energy consumption. The major
problem associated with this approach is that lowing VDD also slows the circuit. Notice
that instruction fetch and data loading are on the critical path of the processor datapath
pipeline. Increasing access latency to the level one instruction cache or data cache is not
preferable even for energy optimization, which leaves us two other options, reducing CL
or reducing P0→1. Now, the question is what we can do with these two factors with
respect to the principle role of caches.
8
There is a similar problem when optimizing leakage (subthreshold leakage) energy
in caches. Equation 1.4 tells us that increasing the threshold voltage Vth reduces the
subthreshold leakage current. However, high Vth results in a longer access latency to
the SRAM cell. Reducing supply voltage VDD to zero can eliminate the subthreshold
leakage current (Isub = 0). Meanwhile, the content or state stored in the SRAM cell will
be also destroyed. Later access to this cell will result in voltage recovery and access to
lower level caches. Lowing the supply voltage VDD to a particular point that leads to
a significant leakage reduction while still retains the content seems more interesting in
terms of performance. Later access to this cell only needs to restore the supply voltage
to the regular one before performing the access. This voltage restoring incurs both
performance and energy penalty. Now, the tricky question is how to maximize benefit
from leakage reduction at the least cost of performance loss.
This thesis proposes a new design methodology for energy-efficient caches by
first understanding the applications such as their resource demands, execution footprint,
cache behavior, etc. Then, this application specific characteristics can be utilized to
guide energy optimization strategies at either compile time, or run time, or both.
1.4.1 Determine Cache Resource Demands
Different applications may have different resource requirements. For example,
some applications are computation intensive while others are I/O intensive. For the
same reason, the actual size of data caches demanded by applications are also different
from one to another. On the other hand, most of nowadays microprocessors implement
caches in fixed sizes that are normally quite large for performance sake. However, large
9
caches are actually wasted for applications that cannot fully utilize them as they are
implemented in a rigid manner. For example, not all the loops in a given array-based
application can take advantage of a large on-chip cache. Also, working with a fixed
cache configuration can increase energy consumption in loops where the best required
configuration (from the performance angle) is smaller than the default (fixed) one. This
is because a larger cache can result in a larger per access energy.
This research proposes a novel approach [33][32] where an optimizing compiler
analyzes the application code and decides the best cache configuration demanded (from a
given objective viewpoint) for different part in the application code. The caches are then
dynamically reconfigured according to these compiler determined configurations during
the course of execution. These configurations match the dynamic characteristics of the
running application. This is called as compiler-directed cache polymorphism (CDCP) in
this work. This approach differs from previous research on reconfigurable caches such as
[4] and [62] in that it does not depend on dynamic feedback information.
1.4.2 Redesign Instruction-Supply Mechanism
Notice that CDCP is actually targeting the parameter CL for dynamic energy re-
duction in data cache in embedded application domain. This idea is definitely applicable
to instruction cache as well. However, the thesis work explores a more aggressive ap-
proach to redesign the instruction-supply mechanism, which is to optimize the parameter
P0→1 for dynamic energy reduction in instruction cache.
10
The proposed issue queue [38] has a mechanism to dynamically detect and identify
reusable instructions, particularly instructions belonging to tight loops. This code typi-
cally dominates the execution time of array-based embedded applications. Once reusable
instructions are detected, the issue queue buffers these instructions and reschedules these
buffered reusable instructions for the following execution. Special care should be taken
of to guarantee that the reused instructions are register-renamed in the original program
order. Thus, the instructions are supplied by the issue queue itself rather than the fetch
unit. There is no need to perform instruction cache access, branch prediction, or in-
struction decoding. Consequently, the front-end of the datapath pipeline, i.e., pipelines
stages before register renaming, can be gated for energy saving during instruction reusing
mode.
1.4.3 Application Sensitive Leakage Control
A good leakage management scheme needs to balance appropriately the energy
penalty of leakage incurred in keeping a cache line turned on after its current use with
the overhead associated with the transition energy (for turning on a cache line) and
performance loss that will be incurred if and when that cache line is accessed again. In
order to strike this balance, it is important that the management approach tracks both
the spatial and temporal locality of instruction cache accesses. Existing leakage control
approaches track and exploit one or the other of these forms of locality.
The leakage management proposed in this work focuses on being able to ex-
ploit both forms of locality and exploits two main characteristics of instruction access
patterns: program execution is mainly confined in program hotspots and instructions
11
exhibit a sequential access pattern [34][44]. In order to exploit this behavior, this the-
sis work proposes a HotSpot based Leakage Management (HSLM) approach that is used
for detecting and protecting cache lines containing program hotspots from inadvertent
turn-off and used to detect a shift in program hotspot and to turn off cache lines closer
to their last use instead of waiting for a period to expire. This scheme is specifically
oriented to detect new loop-based hotspots. Next, this work presents a Just-in-Time
Activation (JITA) scheme that exploits sequential access pattern for instruction caches
by predictively activating the next cache line when the current cache line is accessed.
1.5 Contributions
By understanding, capturing, and utilizing the static and dynamic characteristics
of application code, this thesis work is to provide a comprehensive solution to optimize
the energy efficiency in caches, including both dynamic and static energy. Specifically,
four major contributions have been made in this thesis research.
• A detailed cache behavior characterization for both array-based embedded applica-
tions and general-purpose applications was performed. The insights obtained from
this study suggest that (1) different applications or different code segments within a
single application have very different cache demands in the context of performance
and energy concerns, (2) program execution footprints (instruction addresses) can
be highly predictable and usually have a narrow scope during a particular execution
phase, especially for embedded applications, (3) high sequentiality is presented in
accesses to the instruction cache.
12
• A technique called compiler-directed cache polymorphism (CDCP) was proposed.
CDCP is used to analyze the data reuse exhibited by loop nests, and thus to extract
the cache demands and determine the best data cache configurations for different
code segments to achieve the best performance and optimized energy behavior, as
well as reconfigure the data cache with these determined cache configurations at
runtime.
• This thesis presents a redesigned processor datapath to capture and utilize the pre-
dictable execution footprint for reducing energy consumption in instruction cache
as well as other processor components as a side benefit. The issue queue proposed
here is capable of rescheduling buffered instructions in the issue queue itself thus to
avoid instruction streaming from the pipeline front-end and result in significantly
reduced energy consumption in the instruction caches.
• This thesis proposes hotspot-based leakage management (HSLM) and just-in-time
activation (JITA) strategies to manage the instruction cache leakage in an appli-
cation sensitive fashion. The scheme, employing these two strategies in additional
to periodic and spatial based (bank switch) turn-off, provides a significant im-
provement on leakage energy savings in the instruction cache (while considering
overheads incurred in the rest of the processor as well) over previously proposed
schemes [45][78].
13
1.6 Thesis Roadmap
The rest part of this thesis is organized as follows. An overview of related work
is presented in Chapter 2. The experimental framework for this thesis is detailed in
Chapter 3. Chapter 4 studies the data cache and instruction cache behavior for a set
of array-based embedded applications and a set of general purpose applications that are
used throughout this thesis work. Chapter 5 proposes the compiler-directed cache poly-
morphism technique to optimize dynamic energy consumption in data cache while guar-
anteeing near-optimal performance in embedded application domain. A more aggressive
scheme, rather than cache reconfiguration, scheduling reusable instructions within the
issue queue is proposed in Chapter 6 to achieve significant energy reduction in instruc-
tion cache as well as other components in the front-end of datapath pipeline. Chapter 7
develops two new strategies, namely hotspot-based leakage management and just-in-time
activation, to attack leakage in the instruction cache in a more effective and application
aware manner. Finally, Chapter 8 concludes this thesis work and outlines the directions
for future research.
14
Chapter 2
Related Work
There has been a lot of prior research on energy (both dynamic and leakage)
optimizations in caches (both data cache and instruction cache). This research spans
multiple levels in the microprocessor design flow, from high level compiler optimiza-
tions, architectural and microarchitectural schemes, to low level circuit optimizations
and physical device designs.
2.1 Compiler Optimizations
2.1.1 Addressing Dynamic Energy Consumption
In the domain of embedded systems design, most of the work focuses on the be-
havior of array references in loop nests as loop nests are the most important part of
array-intensive media and signal processing application programs. In most cases, the
computation performed in loop nests dominates the execution time of these programs.
Thus, the behavior of the loop nests determines both performance and energy behavior
of applications. Previous research (e.g., [53]) shows that the performance of loop nests is
directly influenced by the cache behavior of array references. Also, energy consumption
is a major design constraint in embedded systems [73][55] [25]. Consequently, determin-
ing a suitable combination of cache memory configuration and optimized software is a
challenging problem in the embedded design world.
15
The conventional approach to address this problem is to employ compiler opti-
mization techniques [27] [7][51][53] [64][68][76] [3] and try to modify the program be-
havior such that the new behavior becomes more compatible with the underlying cache
configuration. Current locality oriented compiler techniques generally work under the
assumption of a fixed cache memory architecture, and there are several problems with
this method. First, these compiler-directed modifications sometimes are not effective
when data/control dependences prevent necessary program transformations. Second,
the available cache space sometimes cannot be utilized efficiently, because the static con-
figuration of cache does not match different requirements of different programs and/or of
different portions of the same program. Third, most of the current compiler techniques
(adapted from scientific compilation domain) do not take energy issues into account in
general.
2.1.2 Managing Cache Leakage
In [78][79], an optimizing compiler is used to analyze the program to insert explicit
cache line turn-off instructions. This scheme demands sophisticated program analysis
and needs modifications in the ISA to implement cache line turn-on/off instructions.
In addition, this approach is only applicable when the source code of the application
being optimized is available. In [78], instructions are inserted only at the end of loop
constructs and, hence, this technique does not work well if a lot of time is spent within
the same loop. In these cases, periodic schemes may be able to transition portions of
the loop that are already executed into a drowsy mode. Further, when only selected
portions of a loop are used, the entire loop is kept in an active state. Finally, inserting
16
the turn-off instructions after a fast executing loop placed inside an outer loop can cause
performance and energy problems due to premature turn-offs.
2.2 Architectural and Microarchitectural Schemes
2.2.1 Using Additional Smaller Caches
Generally, smaller caches consume less power due to their less capacitance. To
reduce the power consumption in the pipeline front-end, stage-skip pipeline [31][8] in-
troduces a small decoded instruction buffer (DIB) to temporarily store decoded loop
instructions that are reused to stop instruction fetching and decoding for power re-
duction. The DIB is controlled by a special loop-evoking instruction and requires ISA
modification. Loop caches [47][5] dynamically detect loop structures and buffer loop
instructions or decoded loop instructions in an additional loop cache for later reuse. A
preloaded loop cache is proposed in [26] using profiling information. Loops dominating
the execution time are preloaded into the loop cache during system reset based on static
profiling. More generally, filter caches [46][71] use smaller level zero caches (between the
level one cache and datapath) to capture tight spatial/temporal locality in cache access
thus reducing the power consumption in larger level one caches. However, filter caches
usually incur big performance loss due to the low hit rate in the smaller level zero caches.
2.2.2 Changing Load Capacitance of Cache Access
An alternative approach for addressing the cache behavior problem is to use re-
configurable cache structures and dynamically modify the cache configuration (at specific
program point) to meet the execution profile of the application at hand. This approach
17
has the potential to address the problem in cases where optimizing the application code
alone fails. However, previous research on this area such as [4], [62], and [77] is mainly
focused on the implementation and the employment mechanisms of such designs, and
lack software-based techniques to direct dynamic cache reconfigurations.
Way-prediction and selective direct-mapping were used for dynamic energy re-
duction in set-associative caches in in [61]. In this scheme, only predicted cache way is
accessed without probing all cache ways simultaneously. Other ways are accessed only
after a way misprediction.
2.2.3 Improving the Fetch Mechanism
Monitoring instruction fetch has a significant impact on the energy consumption
in the instruction cache, e.g., speculation control for pipeline gating [52]. Trace cache
was first studied for its energy efficiency in [37] [35][36]. Sequential trace cache achieves
superior power behavior at the cost of a large performance degradation compared to con-
ventional current trace cache architecture [37]. A compiler-based selective trace cache
(SLTC) [35] utilizes the profiling information to statically determine the fetch direction
either to trace cache or to instruction cache. Direction prediction based trace cache
(DPTC) proposed in [36] is independent of compiler optimizations or code layout op-
timizations. DPTC provides a pure hardware scheme to implement this selective fetch
rather than profile-based software schemes. It also avoids any impact on the current ISA
architecture which makes it independent of the underlying platforms. Both SLTC and
DPTC achieve the best energy behavior from sequential access mechanism and a very
close performance behavior to conventional trace cache due to their selective access.
18
2.2.4 Reducing leakage in Caches
The leakage current is a function of the supply voltage and the threshold voltage.
It can be controlled by either reducing the supply voltage or by increasing the threshold
voltage. However, this has an impact on the cache access times. Thus, a common
approach is to use these mechanisms dynamically when a cache line is not currently in
use. DRI-icache [60][59] uses the performance feedback to dynamically resize the cache
utilizing Gated-Vdd technique. However, this resizing is at a very coarse granularity.
Using the same circuit technique, cache decay [41] explores the generation information
of cache lines and turns off the cache line after its decay period expires. Cache leakage
was controlled at the fine granularity of each cache line. Drowsy cache [20] periodically
transitions all cache lines to a drowsy mode based on multiplexed supply voltage. Drowsy
cache lines retain the cache contents. However, they must be restored to the active supply
voltage before any access to them is carried out, which incurs both performance and
energy overhead. This technique was adopted to instruction cache in [45] and augmented
with a bank based strategy for cache line turn-off and predictive cache line activation.
Oriented for reducing performance penalty, this bank-based scheme suffers from the
dynamic energy overhead due to turning on a whole cache subbank.
2.3 Circuit and Device Techniques
Existing techniques that control cache leakage use three main styles of circuit
primitives for reducing cache leakage energy. The first approach involves the Gated-
Vdd [60] techniques employed in [59][41] that use an additional NMOS sleep transistor
19
connected between the memory storage cell and the ground of the power supply. This
sleep transistor is turned off to reduce leakage but results in data stored in the memory
array being lost. A modification of this GND-gating scheme that can still retain data
was used in [49][2]. The second type of circuit primitive is based on a multiplexed supply
voltage for the cache lines [20]. When a reduced supply voltage is selected the leakage
can be controlled and the cache line is said to be in a drowsy state (retaining its value).
However, cache lines in a drowsy state cannot be accessed and need to be brought back
to the active state (operating at the normal supply voltage). This transition from drowsy
voltage to normal voltage requires a single cycle or multi-cycle wakeup time. The third
approach to reducing leakage energy while minimizing performance penalties relies on
selectively decreasing the threshold voltage of the cache lines that are accessed while
maintaining a higher threshold voltage for all other cache lines [43]. There have also
been approaches at designing memory cells using dual threshold voltages to minimize
leakage when storing the preferred data value (zero or one) [6].
20
Chapter 3
Experimental Models
This chapter presents the experimental models for this thesis work. Two different
simulation frameworks are used for evaluating the compiler schemes and microarchitec-
tural schemes proposed for reducing cache energy (including both dynamic and leakage
energy) in this work, respectively. Two sets of benchmarks from array-intensive em-
bedded application domain and general purpose application domain are used for this
purpose.
3.1 Simulation Frameworks
3.1.1 Evaluating Compiler schemes
SUIF compiler version 1.0 [69] is used as the framework to implement the compiler
algorithms proposed in this work. It has two major part: kernel and toolkit. The SUIF
compiler kernel defines the intermediate representation of programs, provides methods to
perform operations on the intermediate representation, and interfaces between different
compiler passes. The SUIF toolkit consists of compiler passes built on the top of its
kernel, including Fortran and ANSI C front ends for Fortran and C to SUIF translation,
a SUIF to Fortran and C translator, data dependence analyzer, a basic parallelizer, a
loop-level locality optimizer, and a visual SUIF code browser. The proposed compiler
algorithms in Chapter 5 are implemented as several independent SUIF passes.
21
Shade cache simulator [18] is augmented to cooperated with SUIF compiler to
perform cache simulation and runtime reconfiguration. SUIF compiler takes the program
source code and converts it to SUIF intermediate representation. Then it executes the
SUIF passes that implement the proposed compiler algorithms and generates a Shade
required profile. Shade cache simulator then uses this profile to carry out corresponding
action during its simulation.
3.1.2 Evaluating Microarchitectural Schemes
For evaluating proposed microarchitectural schemes, a superscalar processor sim-
ulator, SimpleScalar 3.0 [12] is used as the base to develop microarchitectural simulators
required in this work. SimpleScalar performs cycle-accurate simulation of modern proces-
sors and implements a six-stage processor pipeline: Fetch, Dispatch Schedule, Execute,
Writeback, and Commit.
In the experiments conducted in this work, a contemporary microprocessor similar
to the Alpha 21264 microprocessor is modeled. The base configurations of the processor
and memory hierarchy are given in Table 3.1.
3.2 Benchmarks and Input Sets
Two set of benchmarks from both array-intensive application domain and general
purpose application domain are used in the experiments. Table 3.2 lists a set of nine
array-based benchmarks. The second column and third column give the number of arrays
and loop nests manipulated by the each benchmark, respectively. The fourth column
22
Processor Core
Issue Queue 64 entriesLoad/Store Queue 32 entriesReorder Buffer 64 entriesFetch Width 4 instructions per cycleDecode Width 4 instructions per cycleIssue Width 4 instructions per cycle, out of orderCommit Width 4 instructions per cycleFunction Units 4 IALU, 1 IMULT/IDIV,
4 FALU, 1 FMULT/FDIV,2 Memports
Branch Predictor Bimodal, 2048 entries,512-set 4-way BTB,8-entry RAS
Memory Hierarchy
L1 ICache 32KB, 1 way, 32B blocks, 1 cycle latencyL1 DCache 32KB, 4 ways, 32B blocks, 1 cycle latencyL2 UCache 256KB, 4 ways, 64B blocks, 8 cycle latencyMemory 80 cycles first chunk, 8 cycles rest, 8B bus widthTLB 4 way, ITLB 64 entry, DTLB 128 entry,
30 cycle miss penalty
Table 3.1. Base configurations of simulated processor and memory hierarchy for evalu-ating microarchitectural schemes.
Benchmark Arrays Nests Brief Description Source
adi 6 2 Alternate Direction Integral Livermoreaps 17 3 Mesoscale Hydro Model Perfect Club
bmcm 11 3 Molecular Dynamic of Water Spec92/NASAbtrix 29 7 Block tridiagonal matrix solution Spec92/NASAeflux 5 6 Mesh Computation Perfect Club
tomcat 9 8 Mesh Generation Spec95tsf 1 4 Array-based Computation Perfect Club
vpenta 9 8 inverts 3 matrix pentadiagonals Spec92/NASAwss 10 7 Molecular Dynamics of Water Perfect Club
Table 3.2. Array-based benchmarks used in the experiments.
23
describes the main function of each benchmark. The sources of the benchmarks are given
in the last column.
Benchmark Input Set Description
gzip input.source 60 Compressionvpr net.in arch.in place.in FPGA Circuit Placement and Routinggcc scilab.i C Programming Language Compilermcf inp.in Combinatorial Optimization
parser 2.1.dict -batch ref.in Word Processingperlbmk splitmail.pl PERL Programming Language
gap -q -m 192M ref.in Group Theory, Interpretervortex bendian1.raw Object-oriented Databasebzip2 input.source Compressiontwolf ref Place and Route Simulator
wupwise wupwise.in Physics / Quantum Chromodynamicsmesa -frames 1000 mesa.in 3-D Graphics Libraryart c756hel.in a10.img hc.img Image Recognition / Neural Networks
equake inp.in Seismic Wave Propagation Simulation
Table 3.3. Benchmarks from SPEC2000 used in the experiments.
In addition to the array-based benchmarks, a set of ten integer and four floating-
point applications from the SPEC2000 benchmark suite are used for evaluating the leak-
age control strategies proposed in Chapter 7. Their PISA version binaries and reference
inputs for execution are used in the experiments. During the simulation, each of these
SPEC2000 benchmarks is first fast forwarded half billion instructions, and then simu-
lated in detail the next half billion committed instructions. Table 3.3 gives the names,
input sets, and function descriptions of these fourteen SPEC2000 benchmarks used in
this work.
24
Chapter 4
Characterizing Application and Cache Behavior
This thesis proposes to develop compiler and microarchitectural techniques for
cache energy reduction based on the understanding of the application characteristics
and its cache behavior. In this chapter, three critical properties of a given application
and its cache behavior, namely cache resource demands for performance, program exe-
cution footprint, and instruction cache access behavior have been identified, highlighted,
extracted, and analyzed in the context of cache energy optimization.
4.1 Data Cache Demands for Performance
General-purpose high-performance microprocessor designs, such as Alpha 21364
[9] and HP PA-8800 microprocessor, intend to incorporate fix-sized large level one caches
to accommodate different workloads of different applications. However, this design is in-
herently in a very conservative way and only achieves an average performance. Notice
that the cache resource demands of the workload from different applications are signifi-
cantly different from each other. Even within a single application, different part of code
segments may also have very different cache requirements due to the different functions
performing and different data being manipulated. Fix-sized cache design leads to very
inefficient utilization of the cache resources when the demands of applications are much
25
smaller than the configured caches. On the other hand, the performance is to be seriously
suffered if the application demands a much larger cache than the existing one.
Such a design in embedded microprocessors will be a disaster in terms of cost,
energy consumption, and performance. Larger caches result in higher energy consump-
tion, which is a major constraint in embedded systems design. Smaller caches may cause
performance problem (i.e., missing deadlines) and also incur additional energy overhead
due to the accesses to the lower level memory hierarchies. Note that embedded applica-
tions usually manipulate massive data and are very complex in the way the data being
accessed and processed. Thus, it is paramount to study the cache resource demands
of embedded applications and use this characteristics to direct energy optimizations in
caches, especially level one data cache.
In this section, this application characterization is performed on a set of array-
based embedded applications, which are listed in Table 3.2 in Chapter 3. Since loop
nests are the major code in these benchmarks, this study analyzes the cache performance
behavior while the cache configuration varies for each loop nest within a given benchmark.
Figure 4.1 presents a set of analysis results, where the data cache size varies from 1 KB
to 1024 KB while the cache block size is fixed at 32 Bytes and set-associativity is fixed
at 4 ways for all cache configurations. In each plot, x axis represents the data cache sizes
and y axis is the data cache miss rate.
Several important observations can be found from Figure 4.1. First, at a given
cache configuration, most loops have very different cache performance behavior within a
single application. For example, in benchmark aps, the miss rates in an 8 KB data cache
for its three loops are 0.99%, 2.07%, and 16.11%, respectively. In few case, if two loops
26
1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M0
0.05
0.1
0.15
0.2
0.25
Data Cache Size
Dat
a C
ache
Mis
s R
ate
loop1loop2
1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M0
0.05
0.1
0.15
0.2
0.25
Data Cache Size
Dat
a C
ache
Mis
s R
ate
loop1loop2loop3
(a). adi (b). aps
1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Data Cache Size
Dat
a C
ache
Mis
s R
ate
loop1loop2loop3
1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M0
0.05
0.1
0.15
0.2
0.25
Data Cache Size
Dat
a C
ache
Mis
s R
ate
loop1loop2loop3loop4loop5loop6
(c). bmcm (d). eflux
1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Data Cache Size
Dat
a C
ache
Mis
s R
ate
loop1loop2loop3loop4loop5loop6loop7loop8
1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M0
0.05
0.1
0.15
0.2
0.25
Data Cache Size
Dat
a C
ache
Mis
s R
ate
loop1loop2loop3loop4
(e). tomcat (f). tsf
1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Data Cache Size
Dat
a C
ache
Mis
s R
ate
loop1loop2loop3loop4loop5loop6loop7loop8
1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M0
0.02
0.04
0.06
0.08
0.1
0.12
Data Cache Size
Dat
a C
ache
Mis
s R
ate
loop1loop2loop3loop4loop5loop6loop7
(g). vpenta (h). wss
Fig. 4.1. Cache performance behavior as data cache size increases from 1KB to 1024KB.All cache configurations use fixed block size of 32 bytes and fixed associativity of 4 ways.
27
are very similar in code structure and data access pattern, they might have the same
behavior, e.g., loop 1 and loop 2 in bmcm and loop 2 and loop 5 in eflux. Second, cache
behavior variation with respect to the increasing cache size is different among different
loops. For some loops, the miss rate does not improve even when the cache size increases
from 1 KB to 1024 KB, e.g., loop 2, 3, 5, 6, 7 in benchmark vpenta. However, for most
loops, the cache performance behavior improves significantly as cache size increases, e.g.,
all loops but loop 4 in eflux. Finally and most importantly, every loop has a performance
saturating point, either at some sharp-turning point or at the very first point (smallest
cache size). For those having sharp-turning points, increasing cache size before those
points may or may not improve the performance, e.g., the performance of loop 1, loop 4,
and loop 8 in vpenta has no improvement as cache size increases from 1 KB to 32 KB.
However, this sharp-turning point brings a significant performance improvement for the
loop such as 128 KB cache size for loop 1 and loop 8 in vpenta. Further, increasing cache
size beyond this sharp-turning point only has very minor or no performance benefit. It
is very important to understand these findings. This saturating point is proposed as the
optimal cache configuration for a particular loop in terms of performance and energy
consumption. Schemes trying to optimize cache energy for these embedded applications
should develop approaches to capture these optimal points of their loops.
4.2 Instruction Execution Footprint
This thesis proposes to redesign the instruction supply mechanism for dynamic
energy optimization in instruction caches oriented for embedded systems. This proposal
28
premises well understanding of the dynamic behavior of instruction footprints at runtime
for typical array-based embedded applications.
This section conducts an analysis on the distribution of dynamic instructions
with respect to their PC addresses for the set of array-based embedded applications
used in this work. Figure 4.2 gives the PC address profiling results for a group of eight
benchmarks. In each plot, x axis represents the sampling point during the execution,
and y axis is the instruction address space. The sampling rate for each benchmark is
given with the label of x axis, e.g., “/500” means a sampling point is made for every 500
dynamic instructions.
For these embedded applications, as shown in Figure 4.2, the dynamic instruction
footprint has a very regular pattern. Simple pattern has a very uniform behavior such
as in aps, btrix, and wss. Some pattern consists of several phases such as in eflux and
vpenta. Each phase executes for a certain amount of time and within a narrow address
space. This dynamic characteristics of these applications can be certainly utilized by
a reconfigurable instruction cache for energy reduction. This thesis research explores a
more aggressive microarchitectural design to exploit this predictable dynamic behavior
for instruction cache energy optimization as well as other components such as branch
predictor and decoder in the datapath front-end.
4.3 Accessing Behavior in Instruction Cache
This section studies the distribution of accesses to the instruction cache for a set
of benchmarks from SPEC2000 benchmark suite. The accesses being characterized are
at the granularity of each cache line. Sequentially accessing instructions within a single
29
0 2000 4000 6000 8000 10000 12000 140004.194
4.196
4.198
4.2
4.202
4.204
4.206
4.208
4.21
4.212x 10
6
Instruction Fetch Sampling (/300)
Inst
ruct
ion
Add
ress
Spa
ce
0 2000 4000 6000 8000 10000 120004.19
4.195
4.2
4.205
4.21
4.215
4.22x 10
6
Instruction Fetch Sampling (/50)
Inst
ruct
ion
Add
ress
Spa
ce
(a). adi (b). aps
0 2000 4000 6000 8000 10000 12000 140004.194
4.196
4.198
4.2
4.202
4.204
4.206
4.208
4.21
4.212
4.214x 10
6
Instruction Fetch Sampling (/3000)
Inst
ruct
ion
Add
ress
Spa
ce
0 2000 4000 6000 8000 10000 12000 140004.194
4.196
4.198
4.2
4.202
4.204
4.206
4.208
4.21
4.212
4.214x 10
6
Instruction Fetch Sampling (/500)
Inst
ruct
ion
Add
ress
Spa
ce
(c). btrix (d). eflux
0 0.5 1 1.5 2 2.5 3
x 104
4.194
4.196
4.198
4.2
4.202
4.204
4.206x 10
6
Instruction Fetch Sampling (/5000)
Inst
ruct
ion
Add
ress
Spa
ce
0 0.5 1 1.5 2 2.5 3
x 104
4.19
4.195
4.2
4.205
4.21
4.215x 10
6
Instruction Fetch Sampling (/500)
Inst
ruct
ion
Add
ress
Spa
ce
(e). tomcat (f). tsf
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 104
4.194
4.196
4.198
4.2
4.202
4.204
4.206
4.208x 10
6
Instruction Fetch Sampling (/5000)
Inst
ruct
ion
Add
ress
Spa
ce
0 2000 4000 6000 8000 10000 12000 140004.194
4.196
4.198
4.2
4.202
4.204
4.206
4.208
4.21x 10
6
Instruction Fetch Sampling (/2500)
Inst
ruct
ion
Add
ress
Spa
ce
(g). vpenta (h). wss
Fig. 4.2. Dynamic instruction address distribution at runtime for a set of array-intensivecode. A sampling rate (/500) means only one instruction among 500 dynamic instructionsis sampled for its address (PC).
30
cache line is counted as only one access to the instruction cache in this study. The
instruction cache configuration is given in Table 3.1. The goal of this study is to provide
quantitative analysis of the sequentiality of cache accesses (at cache line granularity) for
general purpose applications. This quantitative characteristics of applications can be
later utilized for guiding energy optimizations in instruction caches.
The characterization of this cache-line access distribution is performed with re-
spect to the sequential length, which is defined as the number of consecutively accessed
cache lines with increasing set index in the same cache way. For example, a consecutive
access to five cache lines (in a same cache way) with set indices 5, 6, 7, 8, 10, has a
sequential length of four for the first four cache-line accesses, and the sequentiality is
broken after th fourth cache-line access. Cache line 10 starts a new sequential length
counting. Note that a sequence of accesses with sequential length of N only realizes N−1
sequential accesses, i.e., cache lines 5, 6, and 7 have sequential (followed-up) access in
the previous example. Access to cache line 5 is said to achieve sequential access since
the next consecutive access is to cache line 6. A preset series of sequential lengths, from
1 to 32 and larger than 32, is used for this distribution characterization. Figure 4.3 and
Figure 4.4 present this cache line access distribution with respect to sequential length for
each benchmark, and an average for integer benchmarks and floating-point benchmarks
used in this study, respectively. The rightmost bar (<) in each plot corresponds to those
with sequential length larger than 32 cache lines.
From these two figures, one should observe that most cache line accesses happened
in sequential mode. Accesses with sequential length of one (i.e., without sequential
access) only accounts a very small portion of the overall accesses, e.g., benchmark gcc
31
4 8 12 16 20 24 28 32 <0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
L1 IC
ache
Acc
ess
Dis
tr. w
.r.t
Seq
uent
ial L
engt
h
4 8 12 16 20 24 28 32 <0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
L1 IC
ache
Acc
ess
Dis
tr. w
.r.t
Seq
uent
ial L
engt
h
(a). gzip (b). vpr
4 8 12 16 20 24 28 32 <0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
L1 IC
ache
Acc
ess
Dis
tr. w
.r.t
Seq
uent
ial L
engt
h
4 8 12 16 20 24 28 32 <0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
L1 IC
ache
Acc
ess
Dis
tr. w
.r.t
Seq
uent
ial L
engt
h
(c). gcc (d). mcf
4 8 12 16 20 24 28 32 <0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
L1 IC
ache
Acc
ess
Dis
tr. w
.r.t
Seq
uent
ial L
engt
h
4 8 12 16 20 24 28 32 <0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
L1 IC
ache
Acc
ess
Dis
tr. w
.r.t
Seq
uent
ial L
engt
h
(e). parser (f). perlbmk
4 8 12 16 20 24 28 32 <0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
L1 IC
ache
Acc
ess
Dis
tr. w
.r.t
Seq
uent
ial L
engt
h
4 8 12 16 20 24 28 32 <0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
L1 IC
ache
Acc
ess
Dis
tr. w
.r.t
Seq
uent
ial L
engt
h
(g). gap (h). vortex
Fig. 4.3. The distribution of accesses (at cache line granularity) to L1 instruction cachewith respect to the length of consecutively accessing cache lines (sequential length), forSPEC2000 benchmarks. The rightmost bar (<) in each plot corresponds to those withsequential length larger than 32 cache lines. To be continued by Figure 4.4.
32
4 8 12 16 20 24 28 32 <0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
L1 IC
ache
Acc
ess
Dis
tr. w
.r.t
Seq
uent
ial L
engt
h
4 8 12 16 20 24 28 32 <0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
L1 IC
ache
Acc
ess
Dis
tr. w
.r.t
Seq
uent
ial L
engt
h
(h). bzip2 (i). twolf
4 8 12 16 20 24 28 32 <0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
L1 IC
ache
Acc
ess
Dis
tr. w
.r.t
Seq
uent
ial L
engt
h
4 8 12 16 20 24 28 32 <0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
L1 IC
ache
Acc
ess
Dis
tr. w
.r.t
Seq
uent
ial L
engt
h
(j). wupwise (k). mesa
4 8 12 16 20 24 28 32 <0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
L1 IC
ache
Acc
ess
Dis
tr. w
.r.t
Seq
uent
ial L
engt
h
4 8 12 16 20 24 28 32 <0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
L1 IC
ache
Acc
ess
Dis
tr. w
.r.t
Seq
uent
ial L
engt
h
(l). art (m). equake
4 8 12 16 20 24 28 32 <0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
L1 IC
ache
Acc
ess
Dis
tr. w
.r.t
Seq
uent
ial L
engt
h
4 8 12 16 20 24 28 32 <0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
L1 IC
ache
Acc
ess
Dis
tr. w
.r.t
Seq
uent
ial L
engt
h
Avg. for Int. Avg. for FP
Fig. 4.4. (A continue to Figure 4.3) The distribution of accesses (at cache line granular-ity) to L1 instruction cache with respect to the length of consecutively accessing cachelines (sequential length), for SPEC2000 benchmarks. Last two plots show the averagedistribution for integer benchmarks and floating-point benchmarks used, respectively.
33
has the biggest value 6.5% for this percentage. Overall, for integer benchmarks, most
instruction cache line accesses are spent in sequences with sequential length from 2 to 16,
and some length has the highest percentage such as 6 in gzip and 14 in gap. For floating-
point benchmarks, this distribution appears more irregular and tends to have much larger
sequential length in instruction cache accesses, e.g., more than 75% of cache accesses have
a sequential length more than 32 cache lines in wupwise. The last two plots in Figures 4.4
give the average distribution for integer benchmarks and floating-pointed benchmarks. In
general, the percentage decreases as the sequential length increases in integer benchmarks
and floating-point benchmarks achieve longer sequential length than integer benchmarks.
Notice that the last access of a sequence of accesses with sequential length N does
not have sequential (followed-up) access. On the average, 78% of instruction cache
line accesses achieve sequential access in SPEC2000 integer benchmarks, and 87% in
SPEC2000 floating-point benchmarks used in this study. This quantitative result is used
in Chapter 7 to explore performance-aware leakage management.
34
Chapter 5
Analyzing Data Reuse for Cache Energy Reduction
5.1 Introduction
Most of today’s microprocessor systems include several special architectural fea-
tures (e.g., large on-chip caches) that use a significant fraction of the on-chip transistors.
These complex and energy-hungry features are meant to be applicable across different
application domains. However, they are wasted for applications that cannot fully utilize
them as they are implemented in a rigid manner. For example, not all the loops in a
given array-based embedded application can take advantage of a large on-chip cache.
Also, working with a fixed cache configuration can increase energy consumption in loops
where the best required configuration (from the performance angle) is smaller than the
default (fixed) one. This is because a larger cache can result in a larger per access energy.
This thesis work proposes a strategy where an optimizing compiler decides the
best cache configuration (from a given objective viewpoint) for each nest in the appli-
cation code. This approach focuses on array based applications and reconfigures cache
configuration dynamically between the nests as loop nests are the most important part
of array-intensive media and signal processing application programs. In most cases, the
computation performed in loop nests dominates the execution time of these programs.
Thus, the behavior of the loop nests determines both performance and energy behavior
of applications. Previous research (e.g., [53]) shows that the performance of loop nests is
35
directly influenced by the cache behavior of array references. Also, energy consumption
is an important design constraint in embedded systems [73][55] [25]. Consequently, de-
termining a suitable combination of cache memory configuration and optimized software
is a challenging problem in the embedded design world.
Classical compiler optimizations assume a fixed cache architecture and modify
the program to take best advantage of it. In some cases, this may not be the best
strategy because each nest might work best with a different cache configuration and
transforming a nest for a given fixed cache configuration may not be possible due to
data and control dependences. Working with a fixed cache configuration can also increase
energy consumption in loops where the best required configuration is smaller than the
default (fixed) one. This thesis work takes an alternate approach and modifies the cache
configuration for each nest depending on the access pattern exhibited by the nest. This
technique is called as compiler-directed cache polymorphism (CDCP). More specifically,
in this chapter, following contributions have been made. First, it presents an approach
for analyzing data reuse properties of loop nests. Second, it gives algorithms to simulate
the footprints of array references in their reuse space. This simulation approach is much
more efficient than classical cycle-based simulation techniques as it simulates only data
reuse space. Third, based on the reuse analysis, it presents an optimization algorithm
to compute the best cache configurations for each loop nest. The experimental results
show that CDCP is very effective in finding the near-optimal data cache configurations
for different nests in array-intensive applications.
Section 5.2 revises the basic concepts, notions, and representations for array-
based codes. In Section 5.3, the concepts related to cache behavior such as cache misses,
36
interferences, data reuse, and data locality are analyzed. Section 5.4 introduces the
compiler-directed cache polymorphism technique, and presents a complete set of algo-
rithms to implement it. The experimental results is presented in Section 5.5 to show
the effectiveness of this technique. Finally, Section 5.6 concludes the chapter with a
summary.
5.2 Array-Based Codes
This work is particularly targeted at the array-based codes. Since the performance
of loop nests dominates the overall performance of the array-based codes, optimizing loop
nests is particularly important for achieving best performance in many embedded signal
and video processing applications. Optimizing data locality (so that the majority of
data references are satisfied from the cache instead of main memory) can improve the
performance and energy efficiency of loop nests in the following ways. First, it can
significantly reduce the number of misses in data cache, thus avoiding frequent accesses
to lower memory hierarchies. Second, by reducing the number of accesses to the lower
memory hierarchies, the increased cache hit rate helps promote the energy efficiency of
the entire memory system. In this section, some basic notions about array-based codes
are discussed, loop nests, array references as well as some assumptions to be made.
5.2.1 Representation for Programs
It is assumed that the application code to be optimized has the format which is
shown in Figure 5.1.
37
#include < header.h >· · ·Global Declaration Section of Arrays;· · ·main(int argc, char *argv[ ]){· · ·Loop Nest No. 0;· · ·Loop Nest No. 1;
...Loop Nest No. l;· · ·
}
Fig. 5.1. Format for a program.
Assumption 1. Each array in the application code being optimized is declared in the
global declaration section of the program. The arrays declared in the global section can
be referenced by any loop in the code.
This assumption is necessary for the algorithms that will be discussed in following
sections. In the optimization stage of computing the cache configuration for the loop
nests, Assumption 1 ensures an exploitable relative base address of each array involved.
Since loop nests are the main structures in array-based programs, program codes
between loop nests can be neglected. It is also assumed that each nest is independent
from the others. That is, as shown in Figure 5.1, the application contains a number of
independent nests, and no inter-nest data reuse is accounted for. This assumption can
be relaxed to achieve potentially more effective utilization of reconfigurable caches. This
will be addressed in the future research. Note that several compiler optimizations such
38
as loop fusion, fission, and code sinking can be used to bring a given application code
into the format shown in Figure 5.1.
Assumption 2. All loop nests are at the same program lexical level, the global level.
There is no inter-nesting between any two nest pair.
Assumption 3. All nests in the code are perfectly-nested, i.e., all array operations and
array references only occur at the innermost loop.
These assumptions, while not vital for this analysis, make the implementation easier.
5.2.2 Representation for Loop Nests
In this work, loop nests form the boundaries at which dynamic cache reconfigu-
rations occur. Figure 5.2 shows the format for a loop nest.
for(i1
= l1; i
1≤ u
1; i
1+ = s
1)
for(i2
= l2; i
2≤ u
2; i
2+ = s
2)
· · ·for(i
n= l
n; i
n≤ u
n; i
n+ = s
n)
{
· · ·AR1[f
1,1(~i)][f
1,2(~i)] · · · [f
1,d1(~i))] · · · ;
· · ·AR2[f
2,1(~i)][f
2,2(~i)] · · · [f
2,d2(~i))] · · · ;
...
· · ·ARr[(f
r,1(~i)][f
r,2(~i)] · · · [f
r,dr(~i))] · · · ;
}
Fig. 5.2. Format for a loop nest.
39
In this format, ~i stands for the loop index vector, ~i = (i1, i2, · · · , in)T . Nota-
tions lj , uj and sj are the corresponding the lower bound, upper bound, and stride for
loop index ij , where j = 1, 2, · · · , n. AR1, AR2, · · · , and ARr correspond to different
instances of array references in the nest. Note that these may be same or different ref-
erences to the same array, or different references to different arrays. Function fj,k(~i)
is the subscript (expression) function (of ~i) for the kth subscript of the jth array refer-
ence, where j = 1, 2, · · · , r, k = 1, 2, · · · , dk , and dk is the number of dimensions for the
corresponding array.
5.2.3 Representation for Array References
In a loop nest with the loop index vector ~i, a reference ARj to an array with m
dimensions is expressed as:
ARj [fj,1(~i)][fj,2(~i)] · · · [fj,m(~i)].
It assumes that the subscript expression functions fj,k(~i) are affine functions of the en-
closing loop indices and loop-invariant constants. A row-major storage layout is assumed
for all arrays as in C language. Assuming that the loop index vector is an n depth vec-
tor, that is, ~i = (i1, i2, · · · , in)T , where n is the number of loops in the nest, an array
40
reference can be represented as:
fj,1
fj,2
...
fj,m
=
a11 a12 · · · a1n
a21 a22 · · · a2n
......
. . ....
am1 am2 · · · amn
i1
i2
...
in
+
c1
c2
...
cm
(5.1)
The vector on the left side of the above equation is called the array reference
subscript vector and is denoted using ~f . The matrix shown above is defined as the
access matrix, and is denoted using A. The rightmost vector is known as the constant
offset vector ~c. Thus, the above equation can be also written as [74]:
~f = A~i + ~c. (5.2)
5.3 Cache Behavior
This section reviews a few fundamental concepts about cache behavior. As noted
earlier, in array-intensive applications, cache behavior is largely determined by the foot-
prints of the data manipulated by loop nests.
5.3.1 Cache Misses
There are three types of cache misses: compulsory (cold) misses, capacity misses,
and conflict (interference) misses. Different types of misses influence the performance
of program in different ways. Compulsory misses cannot be avoided (using software
techniques alone) and usually form only a small fraction of total cache misses. Capacity
41
misses can be reduced by increasing the cache size or by optimizing the application code.
Note that, in fully-associative caches, only capacity misses and cold misses can exist.
However, most of the data caches used in current embedded systems are implemented as
set-associative caches or direct-mapping caches in order to achieve high speed, low power,
and low implementation cost. Thus, for these caches, interference misses (also known
as conflict misses) can dominate the cache behavior, particularly for array-based codes.
Previous research by [72] has identified different kinds of cache interferences in numerical
(array-based) codes: self-interferences and cross-interferences. The interference misses
can also be grouped into temporal interferences and spatial interferences. It should
be stressed that since the cache interferences occur in a highly irregular manner, it is
very difficult to capture them accurately as well as costly to estimate [15]. Ghosh et
al. proposed cache miss equations in [23][24] as an analytical framework to compute
potential cache misses and direct code optimizations for achieving good cache behavior.
5.3.2 Data Reuse and Data Locality
Data reuse and data locality concepts for scientific array based applications are
discussed in [74], among others, in detail. Basically, there are two types of data reuses:
temporal reuse and spatial reuse. In a given loop nest, if a reference accesses the same
memory location across different loop iterations, this is termed as the temporal reuse; if
the reference accesses the same cache block (not necessarily the same memory location),
this is called as the spatial reuse. In fact, temporal reuse can be considered as a special
case of spatial reuse. If there are different references accessing the same memory location,
it is said that a group-temporal reuse exists; whereas if different references are accessing
42
the same cache block, it is termed as group-spatial reuse. Note that group reuse occurs
only among different references of the same array in a loop nest. When the reused data
item is found in the cache, it is said that the reference exhibits locality. This means that
data reuse does not guarantee data locality. A data reuse can be converted into locality
only by catching the reused item in cache. Classical loop-oriented compiler techniques
try to achieve this by modifying the loop access patterns and/or array layout in memory.
5.4 Algorithms for Cache Polymorphism
In a cache based embedded system, the performance and energy behavior of loop
nests are largely determined by their cache behavior. Thus, how to optimize the cache
behavior of loop nests is utmost important for satisfying high-performance and energy
efficiency demands of array-based codes.
There are at least two kinds of approaches for achieving acceptable cache behav-
ior. The conventional way is to employ compiler algorithms that optimize loops using
transformations such as interchange, reversal, skewing, and tiling, or transform the data
layouts (i.e., array layout in memory) to match the array access pattern. As mentioned
earlier, an alternative approach to optimize the cache behavior is to modify the underly-
ing cache architecture depending on the program access pattern. Recent research work
by [40] explores the potential benefits from the second approach. The strategy presented
in [40] is based on exhaustive simulation. It simulates each loop nest of an array-based
program separately with all possible cache configurations, and then determines the best
cache configuration from the viewpoint of a given objective (e.g., optimizing memory
energy or reducing cache misses). These configurations can then be dynamically applied
43
at running time. The main drawback of this simulation-based strategy is that it is time
consuming and can consider only a fixed set of cache configurations. Typically, simu-
lating each nest with all possible cache configurations makes this approach unsuitable
for very large embedded applications. In this section, an alternative way is presented
for determining the suitable cache configurations for different sections (nests) of a given
code.
5.4.1 Compiler-Directed Cache Polymorphism
The existence of cache interferences is the main factor that degrades the perfor-
mance of a loop nest. Cache interferences disrupt the cache behavior of a loop nest by
preventing data reuse from being converted into locality. Note that both self-interferences
or cross-interferences can prevent a data item from being reused from the cache. The
objective is then to determine the cache configurations that help reduce interferences.
The basic idea behind the compiler-directed cache polymorphism (CDCP) is to analyze
the source code of an array-based program and determine data reuse characteristics of
its loop nests at compile time, and then to compute a suitable (near-optimal) cache
configuration for each loop nest to exploit the data locality implied by its reuse. The
near-optimal cache configuration determined for each nest is expected to eliminate most
of the interference misses while keeping the cache size and associativity under control.
In this way, it optimizes execution time and energy at the same time. In fact, increasing
either cache capacity or associativity further (i.e., expand the configuration determined
by CDCP) only increases energy consumption. In this approach, the source codes are
not modified (obviously, they can be optimized before the algorithms are run; what it
44
means here is that it does not do any further code modifications for the sake of cache
morphism).
IntermediateCode
Generator
Array Sorter
Loop Nest
Optimizer
Inter-
Codes
Uniform-set
Shade Cache Simulator Energy Model
Cacti
SUIF
SCC
Source
Codes
GCC
Cache Reconfiguration
Mechanism
Constructor
Cache
Array
Block Size
ReuseUniform Reuse-
Analyser
Configurations
Sets Vectors
Reuse-space
Simulator
Bitmaps
Performance/Energy
Results
Fig. 5.3. Overview of compiler-directed cache polymorphism (CDCP).
At the very high level, this approach can be described as follows. First, it uses
compiler to transform the source codes into an intermediate format (IF), which represents
the array-based programs in a regular hierarchical format. In the second step, each loop
nest is processed as a basic element for cache configuration. In each loop nest, references
of each array are assigned into different uniform reference sets. References belonging
to a same uniform set have the same access matrix. Each uniform set is then analyzed
to determine the reuse they exhibit over different loop levels. Then, for each array, an
algorithm is used to simulate the footprints of the reuse space within the layout space of
this array. Following this, a loop nest level algorithm optimizes the cache configurations
while ensuring data locality. Finally, the code is generated such that these dynamic
45
cache configurations are activated at runtime (in appropriate points in the application
code). Figure 5.3 shows an overview of this approach.
5.4.2 Formal Description of Program Hierarchies
This subsection shows how an array-based program is shown in its IF. The in-
termediate code generator follows the hierarchy of the source program code to generate
an explicit hierarchical tree structure to represent the original code. The hierarchy goes
from the root (which represents the main program) down to loop nests, arrays, and array
references at the leaves. Each node and leaf contain all the information needed by this
approach as can be extracted from the original code. Thus, such a tree structure func-
tionally represents the original codes in full scope. This intermediate program format is
shown in Figure 5.4.
loop nest 0 loop nest 2loop nest 1
Program
loop nest n
array ref 2 array ref larray ref 0
array 0 array 1 array 2 array m
array ref 1
Fig. 5.4. Intermediate format of source codes produced by the generator.
46
5.4.3 Array References and Uniform Reference Sets
An array reference is at the lowest level of the IF. As explained earlier, each array
reference can be expressed using ~f = A~i + ~c, where ~f is the subscript vector, A is the
access matrix,~i is the loop index vector, and ~c is the constant vector. All the information
about a reference is stored in the array reference leaf, array node and its parent loop-nest
node of the intermediate codes. Consider the piece of code shown in Figure 5.5.
for(i = 0; i ≤ N1; i + +)
for(j = 0; j ≤ N2; j + +)
for(k = 0; k ≤ N3; k + +)
for(l = 0; l ≤ N4; l + +)
{a[i + 2 ∗ k][2 ∗ j + 2][l] = a[i + 2 ∗ k][2 ∗ j][l];b[j][k + l][i] = a[2 ∗ i][k][l];
}
Fig. 5.5. Example code – a loop nest.
In the intermediate code format, the first reference to array a is represented by
the following access matrix Aa and constant offset vector −→ca,
Aa :
1 0 2 0
0 2 0 0
0 0 0 1
,−→ca :
0
2
0
.
47
The reference to array b in this code fragment is also represented by its access matrix
(Ab) and constant offset vector (−→cb):
Ab :
0 1 0 0
0 0 1 1
1 0 0 0
,−→cb :
0
0
0
.
The definition of uniform reference set is very similar to the uniformly generated
set [22]. If two references to an array have the same access matrix and only differ in
constant offset vectors, these two references are said to belong to the same uniform
reference set. Constructing uniform reference sets for an array provides an efficient
way for analyzing the data reuse for the said array. This is because all references in
an uniform reference set have same data access pattern and data reuse characteristics.
Also, identifying uniform reference sets allows us to capture group reuse easily.
5.4.4 Algorithm for Reuse Analysis
The following sections use a bottom-up approach to introduce the algorithms that
implementing the CDCP. First, algorithms analyzing the data reuses including self-reuses
and group-reuses are provided for each uniform reference set. After that, an array-level
algorithm to obtain the footprints of all the reuses in a given nest is given. This algorithm
is called by a loop-nest level algorithm. The loop-nest level algorithm simulates the reuse
behavior of arrays in a memory space, and computes near-optimal cache configurations
in order to exploit data reuses with the minimum cache capacity/associativity. Finally, a
48
global or program level algorithm integrates the results from each loop nest, and makes
all necessary changes to activate the selected cache reconfigurations at runtime.
5.4.4.1 Self-Reuse Analysis
Before the reuse analysis, all references to an array in a given nest are first clas-
sified into several uniform reference sets. Self-reuses (both temporal and spatial) are
analyzed at a uniform set granularity. This algorithm works on access matrices and is
given in Algorithm 1.
The algorithm checks each loop index variable from the innermost loop to the
outermost loop to see whether it occurs in the subscript expressions of the references.
If the jth loop index variable ij does not occur in any subscript expression, the impact
of this on the corresponding access matrix is that all elements in the jth column are
0. This means that the iterations of the jth loop do not have a say in the memory
location accessed, i.e., the array reference has self-temporal reuse in the jth loop. If
the index variable only occurs in the lowest (the fastest-changing) dimension (i.e., the
mth dimension), the distance between the contiguous loop iterations is checked. In
the algorithm, s[CLP ] is the stride of the CLP th loop, BK SZ is the cache block size,
and ELMT SZ is the size of array elements. If the distance (A[CDN ][CLP ] ∗ s[CLP ])
between two contiguous iterations of the reference being analyzed is within a cache block,
it has spatial reuse in this loop level.
49
Algorithm 1 Self-Reuse Analysis
INPUT: access matrix Am∗n of a uniform reference set, array node, loop-nest node, agiven cache block size: BK SZ
OUTPUT: self-reuse pattern vector−−−−→SRPn of this uniform set
Initial self-reuse pattern vector:−−−−→SRPn = ~0;
Set current loop level CLP to be the innermost loop: CLP = n;repeat
Set current dimension level CDN to be the highest dimension: CDN = 0;Set index occurring flag IOF : IOF = FALSE;repeat
if Element in access matrix A[CDN ][CLP ] 6= 0 then
/∗ Which means the CLP th index variable appears in expression of the CDN th
subscript ∗/Set IOF = TRUE;Break;
end ifGo up to the next lower dimension level;
until CDN > the lowest dimensionif IOF == FALSE then
/∗ The CLP th index variable does not occur in any subscript expression ∗/Set reference has temporal reuse at this level: SRP [CLP ] = TEMP-REUSE;
else if CDN == m then/∗ index variable only occurs in the lowest dimension ∗/if A[CDN ][CLP ] ∗ s[CLP ] < BK SZ/ELMT SZ then
Set reference has spatial reuse at this level: SRP [CLP ] = SPAT-REUSE;end if
end ifGo up to the next higher loop level;
until CLP < the outermost loop level
50
5.4.4.2 Group-Reuse Analysis
Group reuses only exist among references in the same uniform reference set.
Group-temporal reuse occurs when different references access the same data location
across loop iterations, while group-spatial reuse occurs when different references access
the same cache block in the same or different loop iterations. Algorithm 2 exploits a
simplified version of group reuse which only exists in a single loop level.
When a group-spatial reuse is found at a particular loop level, Algorithm 2 first
checks whether this level has group-temporal reuse for other pairs of references. If it
does not have such reuse, this level will be marked to indicate that a group-spatial reuse
exists. Otherwise, it just omits the current reuse found. It defines a vector−−−−→GRPn as
the group-reuse vector. Each element of−−−−→GRPn records the type of group reuse at its
corresponding loop level. For group-temporal reuse found at some loop level, the element
corresponding to that level in−−−−→GRPn will be directly set to have group-temporal reuse.
Now, for each array and each of its uniform reference sets in a particular loop
nest, using Algorithm 1 and Algorithm 2, the reuse information at each loop level can
be collected. As for the example code in subsection 4.3, references to array a have self-
spatial reuse at loop level l, self-temporal reuse at loop level j and group reuse at loop
level j. Reference of array b has self-spatial reuse at loop level i.
Note that, in contrast to the most of the previous work in reuse analysis (e.g.,
[74]), this approach is simpler and computes the required reuse information without
solving a system of equations.
51
Algorithm 2 Group-Reuse Analysis
INPUT: a uniform reference set with A and ~cs, array node, loop-nest node, a givencache block size: BK SZOUTPUT: group-reuse pattern vector
−−−−→GRPn of this uniform set
Initial group-reuse pattern vector:−−−−→GRPn = ~0;
for all pairs of constant vectors ~c1 and ~c2 do
if ~c1 and ~c2 only differ at the jth element then
/∗ set the initial reuse distance at jth dimension ∗/Set init dist = | c1[j] − c2[j] |;
Check the jth row in access matrix A;Find the first occurring loop index variable (non-zero element) starting from theinnermost loop, say ik;if k < 1 then
/∗ no index variable occurs in the dimension ∗/if j == m and init dist < BK SZ/ELMT SZ then
/∗ these two references have group-spatial reuse at each loop iteration ∗/Continue;
end ifelse
Check the kth column of access matrix A;
if ik only occurs in the jth dimension thenif j == m //m is the lowest dimension of array then
if init dist%A[m][k] == 0 thenSet GRP[k] = TEMP-REUSE;
else if GRP[k] == 0 thenSet GRP[k] = SPAT-REUSE;
end ifelse
/∗ j < m ∗/if init dist%A[j][k] == 0 then
Set GRP[k] = TEMP-REUSE;end if
end ifend if
end ifend if
end for
52
5.4.5 Simulating the Footprints of Reuse Spaces
The next step in this approach is to transform the data reuses detected above
into data locality. The idea is to make the data cache capacity large enough to hold all
the data in the (detected) reuse spaces of the arrays. Note that the data that are out
of reuse space are not necessary to be kept in cache after the first reference since such
data do not exhibit reuse. As discussed earlier, the cache interferences can significantly
affect the overall performance of a nest. Thus, the objective of this technique is to
find a near-optimal cache configuration that can reduce or eliminate the majority of the
cache interferences within a given nest. An informal definition of near-optimal cache
configuration is as follows.
Definition 1. A ’near-optimal cache configuration’ is the one with the smallest capacity
and associativity that approximates the locality that can be obtained using a very large
and fully-associative cache. Intuitively, any increase in either cache size or associativity
over this configuration would not produce any significant improvement.
In order to figure out such a near-optimal cache configuration, the cache behavior
in the reuse space is required for potential optimizations. This section provides an
algorithm that simulates the exact footprints (memory addresses) of array references in
their reuse spaces.
Suppose, for a given loop index vector~i, an array reference with a particular value
of ~i = (i1, i2, · · · , in)T can be expressed as follows:
f(~i) = SA + Cof1 ∗ i1 + Cof2 ∗ i2 + · · · + Cofn ∗ in. (5.3)
53
Here, SA is starting address of the array reference, which is different from the base
address (the memory address of the first array element) of an array. It is the constant
part of the above equation. Suppose that the size of each array element is elmt sz, the
depth of dimension is m, the dimensional bound vectors (defining the scope of each array
dimension) are−−→dlm = (dl1, dl2, · · · , dlm)T , and
−−→dum = (du1, du2, · · · , dum)T , and the
constant offset vector is ~c = (c1, c2, · · · , cm)T , SA can be derived from the following
equation:
SA = elmt sz ∗
m∑
j=1
m+1∏
k=j+1
ddk ∗ cj , ddk =
1, k = m + 1
duk − dlk , k ≤ m
(5.4)
Cofj(j = 1, 2, · · · , n) is used to denote the integrated coefficients of the loop index
variables. Suppose that the access matrix is m by n. In this case, Cofj can be derived
as follows:
Cofj = elmt sz ∗m∑
l=1
m+1∏
k=l+1
ddk ∗ alj , ddk =
1, k = m + 1
duk − dlk, k ≤ m
(5.5)
Note that, with Equation 3, the address of an array reference at a particular
loop iteration can be calculated as the offset in the layout space of this array. The
algorithm provided in this section uses these formulations to simulate the footprints
of array references at each loop iteration within their reuse spaces. The following two
observations give some basis as to how to simulate the reuse spaces.
54
Observation 1. In order to realize the reuse carried by the innermost loop, only one
cache block is needed for this array reference.
Observation 2. In order to realize the reuse carried by a non-innermost loop, the mini-
mum number of cache blocks needed for this array reference is the number of cache blocks
that are visited by the loops inner than it.
Since it has assumed that all subscript functions are affine, for any array reference, the
patterns of reuse space during different iterations at the loop level which has reuse are
exactly the same. Thus, it only needs to simulate the first iteration of the loop having the
reuse currently under exploiting. For example, loop level j in loop vector~i has the reuse
it is exploiting, the simulation space is defined as SMSj = (i1 = l1, i2 = l2, · · · , ij =
lj , ij+1, · · · , in), in which ik>j varies from its lower bound lk to upper bound uk.
Algorithm 3 first calls Algorithms 1 and 2 given on pages 9 and 10, respectively.
After that, it simulates the footprints of the most significant reuse space for an array in
a particular loop nest. The most significant reuse space is defined as the highest reuse
(self reuse and group reuse) level among that of all uniform sets of this given array. If
an array only has reuse at the innermost loop level, Algorithm 3 only needs to calculate
the footprints once by setting all loop index variables to their lower bounds. Otherwise,
Algorithm computes the footprints of all references to this array within the iteration
space which is defined by the reuse space SMS. These footprints are marked using an
array bitmap.
55
Algorithm 3 Simulating Footprints in Reuse Spaces
INPUT: an array node, a loop-nest node, a given cache block size: BK SZOUTPUT: an array-level bitmap for footprints
Initial array size AR SZ in number of cache blocks;Allocate an array-level bitmap ABM with size AR SZ and initial ABM to zeros;Initial the highest reuse level RS LEV = n;/∗ n is the depth of loop nest ∗/for each uniform reference set do
Call Algorithm 1 for self-reuse analysis;Call Algorithm 2 for group-reuse analysis;Set URS LEV = highest reuse level of this set;if RS LEV > URS LEV then
/∗ current set has the highest reuse level ∗/Set RS LEV = URS LEV ;
end ifend forif RS LEV == n then
/∗ only innermost loop has reuse or no reuse exists ∗//∗ simulate one iteration at innermost loop (loop n) ∗/for all references of this array do
Set ~i = ~l; /* only use the lower bound */
apply equation 3 to get the reference address f(~i);
transfer to block id: bk id = f(~i)/BK SZ;set array bitmap: ABM [bk id] = V ISITED;
end forelse
/∗ simulate all iterations of loops inner than RS LEV ∗/for all loop indexes ij , j > RS LEV do
varies the value of ij from lower bound to upper bound;for all references of this array do
apply equation 3 to get the reference address f(~i);
transfer to block id: bk id = f(~i)/BK SZ;set array bitmap: ABM [bk id] = V ISITED;
end forend for
end if
56
5.4.6 Computation and Optimization of Cache Configurations for Loop Nests
In previous subsections, the reuse spaces of each array in a particular loop nest
have been determined and their footprints have been simulated in the layout space of
each array. After executing Algorithm 3, each array has a bitmap indicating the cache
blocks which have been visited by the iterations in reuse spaces. As is discussed earlier,
the phenomena of cache interferences can disturb these reuses and prevent the array
references from realizing data localities across loop iterations. Thus, an algorithm that
can reduce these cache interferences and result in better data localities within the reuse
spaces is crucial.
This subsection provides a loop-nest level algorithm to capture cache interfer-
ences among different arrays accessed within a loop nest. This algorithm tries to map
the reuse space of each array into a linear memory space. At the same time, the degree
of conflicts (number of interferences among different arrays) at each cache block is stored
in a loop-nest level bitmap. Since the self-interference of each array is already solved
by Algorithm 3 using an array bitmap, this algorithm mainly focuses on reducing the
group-interference that might occur among different arrays. As is well-known, one of the
most effective way to avoid interferences is to increase the associativity of data cache,
which is used in this algorithm. Based on the definition of near-optimal cache config-
uration, this algorithm tries to find the smallest data cache with smallest associativity
that reduces cache conflicts significantly. The detailed algorithm is given as Algorithm
4 that computes and optimizes the cache configuration.
57
Algorithm 4 Compute and Optimize Cache Configurations for Loop Nests
INPUT: loop-nest node, global list of arrays declared, lower bound of block size:Bk SZ LB, upper bound of block size: Bk SZ UBOUTPUT: optimal cache configurations at different BK SZ
Set BK SZ = BK SZ LB; /∗ lower bound ∗/repeat
for each array in this loop nest doCall algorithm 3 to get the array bitmap ABM ;
end forcreate and initial a loop-nest level bitmap LBM , with the size is the smallest 2n
that is ≥ the size of the largest array (in block): LBM size;for each array bitmap ABM do
map ABM into the loop-nest bitmap LBM with the relative base-address ofarray: base addr to indicate the degree of conflict at each block;for block id < array size do
LBM [(block id + base addr)%LBM size] += ABM [block id];end for
end forset assoc = the largest degree of conflict in LBM ;set cache sz = assoc ∗ LBM size;set optimal cache conf. to current cache conf.;for assoc < assoc upper bound do
half the number of sets of current cache by LBM size/ = 2;for i ≤ LBM size do
LBM [i]+ = LBM [i + LBM size];end forset assoc = highest value of LBM [i], i ≤ LBM size;set cache size = assoc ∗ LBM size;if assoc < assoc upper bound and cache size < optimal cache size then
set optimal cache conf. to current cache conf.;end if
end forgive out optimal cache conf. at BK SZ;doubling BK SZ∗ = 2;
until BK SZ > BK SZ UB /∗ upper bound ∗/
58
For a given loop nest, Algorithm 4 starts with the cache block size (BK SZ) from
its lower bound, e.g., 16 bytes and goes up to its upper bound, e.g., 64 bytes. For a given
BK SZ, it first applies Algorithm 3 to obtain the array bitmap ABM of each array. After
that, it allocates a loop-nest level bitmap LBM for all arrays within this nest, whose size
is the smallest value (in power of two) that is greater or equal to the largest array size.
All ABMs are remapped to this LBM with their relative array base addresses. The
value of each bits in LBM indicates the conflict at a particular cache block. Following
this, the optimization is carried out by halving the size of LBM and remapping LBM .
Note that the largest value of bits in LBM gives the smallest cache associativity needed
to avoid the interference in the corresponding cache block. This process stops when the
upper bound of associativity is met. A near-optimal cache configuration at block size
BK SZ is computed as the one which has the smallest cache size as well as the smallest
associativity.
5.4.7 Global Level Cache Polymorphism
The compiler-directed cache polymorphism technique does not make changes to
the source code. Instead, it uses compiler only for source code parsing and generates
internal code with the intermediate format which is local to the algorithms. A global
or program level algorithm, Algorithm 5 is presented in this section to implement the
compiler-directed cache polymorphism.
Algorithm 5 first generates the intermediate format of the original code and col-
lects the global information of arrays in source code. After that, it applies Algorithm
4 to each of its loop nests and obtains the near-optimal cache configuration for each of
59
Algorithm 5 Global Level Cache Polymorphism
INPUT: source code(.spd)OUTPUT: Performance data and its cache configurations for each loop nest
Initial cache-configuration list: CCL;Use one SUIF pass to generate the intermediate code format;Construct a global list of arrays declared with its relative base address;for each loop nest do
for each array in this loop nest doConstruct uniform reference sets for all its references;
end forCall algorithm 4 to optimize the cache configurations for this loop nest;Store the configurations to the CCL;
end forfor each block size do
Activate reconfiguration mechanisms with each loop nest using its configurationfrom the CCL;Output performance data as well as the cache configuration of each loop nest;
end for
them. These configurations are stored in the cache-configuration list (CCL). Each loop
nest has a corresponding node in the CCL, which has its near-optimal cache configu-
rations at different block sizes. After the nest-level optimization has been performed,
Algorithm 5 activates the cache reconfiguration, where a modified version of the Shade
simulator is used. During the simulation, Shade is directed to use the near-optimal cache
configurations in CCL for each loop nest before its execution. The performance data of
each loop nest under different cache configurations is generated as output.
Since current cache reconfiguration mechanisms can only vary cache size and cache
associativity with fixed cache block size, or cache optimization is performed for different
(fixed) cache block sizes. This means that the algorithms in this work suggest a near-
optimal cache configuration for each loop nest for a given block size. In the section 5.5,
experimental results verifying the effectiveness of this technique are presented.
60
#define N 8int a[N][N][N], b[N][N][N];intN
1= 4, N
2= 4, N
3= 4, N
4= 4;
main(){
int i, j, k, l;for(i = 0; i ≤ N
1; i + +)
for(j = 0; j ≤ N2; j + +)
for(k = 0; k ≤ N3; k + +)
for(l = 0; l ≤ N4; l + +)
{a[i + k][j + 2][l] = a[i + k][j][l];b[j][k + l][i] = a[2 ∗ i][k][l];
}}
Fig. 5.6. An Example: Array-based code.
5.4.8 An Example
This subsection focuses on the example code in Figure 5.6 to demonstrate how
CDCP works. For simplicity, this code only contains a single nest (with four loops and
four references).
Algorithm 5 starts with a SUIF pass that converts the source code shown above
into the IF, in which the program node only has one loop-nest node. The loop-nest node
is represented by its index vector ~i = (i, j, k, l)T , with an index lower bound vector of
−→il = (0, 0, 0, 0)T , an upper bound vector of
−→iu = (N1,N2,N3,N4)T and a stride vector
of−→is = (1, 1, 1, 1)T . Within the nest, arrays a and b have references AR
a1 , ARa2 , AR
a3
and ARb, which are represented using access matrices and constant vectors as follows:
61
Aa1 :
1 0 1 0
0 1 0 0
0 0 0 1
,−→ca1 :
0
2
0
, Aa2 :
1 0 1 0
0 1 0 0
0 0 0 1
,−→ca2 :
0
0
0
,
Aa3 :
2 0 0 0
0 0 1 0
0 0 0 1
,−→ca3 :
0
0
0
, Ab :
0 1 0 0
0 0 1 1
1 0 0 0
,−→cb :
0
0
0
.
Also, a global array list is generated as < a, b >. Then, for array a, references ARa1
and ARa2 are grouped into a uniform reference set, and AR
a3 is put into another set.
Array b, on the other hand, has only a single uniform reference set.
After that, Algorithm 4 is invoked. It starts processing from the smallest cache
block size, BK SZ, say 16 bytes and uses Algorithm 3 to obtain the array bitmap ABMa
for array a and ABMb for array b using this BK SZ. Within Algorithm 3, it first calls
Algorithm 1 and Algorithm 2 to analyze the reuse characteristics of a given array. In
this example, these algorithms compute the self-reuse pattern−−−→SRP = (0, 0, 0, 1)T and
group-reuse pattern−−−→GRP = (0, 1, 0, 0)T for the first uniform set of array a. These two
patterns indicate that this array has self-spatial reuse at level l, group-temporal reuse at
level j. For the second uniform set of array a, the algorithm returns−−−→SRP = (0, 2, 0, 1)T
which indicates that the array has self-spatial reuse at level l and self-temporal reuse at
level j. Reference to array b has self-spatial reuse at level i corresponding to its self-reuse
pattern−−−→SRP = (1, 0, 0, 0)T . The highest level of reuse is then used for each array by
Algorithm 3 to generate the ABM for its footprints in the reuse space. It is assumed
62
that an integer has 4 bytes in size. In this case, both ABMa and ABMb have 128 bits
as shown below.
ABMa:
0-31 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 032-63 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 064-95 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
96-127 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ABMb:
0-31 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 032-63 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 064-95 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
96-127 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
These two ABMs are then passed by Algorithm 3 to Algorithm 4. In turn,
Algorithm 4 creates a loop-nest bitmap LBM , its size being equal to the largest array
size, MAX( ABMs), and re-maps ABMa and ABMb to LBM . Since array a has
relative base address at 0 (byte), and array b at 2048, it determines the LBM as follows:
The maximum value of bits in the LBM indicates the number of interference
among different arrays in the nest. Thus, it is the least associativity that is required to
63
0-31 2 0 2 0 2 0 2 0 1 0 1 0 1 0 1 0 2 0 1 0 2 0 1 0 1 0 1 0 1 0 1 032-63 2 0 1 0 2 0 1 0 1 0 1 0 1 0 1 0 2 0 1 0 2 0 1 0 1 0 1 0 1 0 1 064-95 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
96-127 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
avoid this interference. In this example, Algorithm 4 starts from a cache associativity
of 2 to compute the near-optimal cache configuration. At each iteration, the size of
LBM is halved and the LBM is re-mapped until the resulting associativity reaches
the upper bound, e.g., 16. Then, the algorithm outputs the smallest cache size with
smallest associativity as the near-optimal configuration at this block size BK SZ. For
this example, the near-optimal cache configuration is 2KB 2-way associative cache at 16
byte block size. The LBM after optimization looks as follows.
0-31 2 0 2 0 2 0 2 0 1 0 1 0 1 0 1 0 2 0 1 0 2 0 1 0 1 0 1 0 1 0 1 032-63 2 0 1 0 2 0 1 0 1 0 1 0 1 0 1 0 2 0 1 0 2 0 1 0 1 0 1 0 1 0 1 0
Algorithm 4 then proceeds to compute the near-optimal cache configurations for
larger cache block sizes by doubling the previous block size. When the block size reaches
its upper bound (e.g., 64 bytes), the algorithm stops, and passes all the near-optimal
configurations at different block sizes to Algorithm 5. The cache configurations (in
this example) computed by Algorithm 4 at different block sizes are given in Table 5.1.
On receiving these configurations, Algorithm 5 activates Shade (a fast instruction-set
64
simulator) to simulate the example code (executable) with these cache configurations.
Then the performance data is generated as the output of Algorithm 5.
Block Size(B) Number of Sets Associativity Cache Size(B)
16 64 2 204832 32 2 204864 16 2 2048
Table 5.1. Cache configurations generated by algorithm 4 for the example nest.
5.5 Experiments
5.5.1 Simulation Framework
This section presents the simulation results to verify the effectiveness of the CDCP
technique. The technique has been implemented using SUIF [69] compiler and Shade
[18]. Eight array-based benchmarks from Table 3.2 are used in this simulation work. In
each benchmark, loop nests dominate the overall execution time.
The main goal here is to compare the cache configurations returned by CDCP
scheme and those obtained through a scheme based on exhaustive simulation (using
Shade). Three different block (line) sizes are considered here: 16, 32 and 64 bytes. Note
that this part of work is particularly targeted at on-chip L1 caches.
65
5.5.2 Selected Cache Configurations
This subsection first applies an exhaustive simulation method using the Shade
simulator. For this method, the original program codes are divided into a set of small
programs, each program having a single nest. Shade simulates these loop nests individ-
ually with all possible L1 data cache configurations within the following ranges: cache
sizes from 1K to 128K, set-associativity from 1 way to 16 ways, and block size at 16, 32
and 64 bytes. The number of data cache misses is used as the metric for comparing per-
formance. The optimal cache configuration at a certain cache block size is the smallest
one in terms of both cache size and set associativity that achieves a performance (the
number of misses) which cannot be improved significantly (the number of misses cannot
be reduced by 1%) by increasing cache size and/or set associativities. The left portion
of Table 5.3 shows the optimal cache configurations (as selected by Shade) for each loop
nest in different benchmarks as well as at different cache block sizes.
Benchmark Running Time(s) Name Running Time(s)
adi.c 10.491 aps.c 1.638bmcm.c 2.609 eflux.c 0.296tomcat.c 1.544 tsf.c 0.809vpenta.c 4.009 wss.c 0.148
Table 5.2. Running time of algorithm 4 for each benchmark.
66
CDCP technique takes the original source code in the SUIF .spd format and
applies Algorithm 5 to generate the near-optimal cache configurations for each loop nest
in the source code. It does not perform any instruction level simulation for configuration
optimization. Thus, it is expected to be very fast in finding the near-optimal cache
configuration. In fact, Table 5.2 gives the real running time of Algorithm 4 for each
benchmark. The execution engine (a modified version of Shade) of CDCP directly applies
these cache configurations to activate the reconfiguration mechanisms dynamically. The
cache configurations determined by CDCP are shown on the right part of Table 5.3.
To sum up, in Table 5.3, for each loop nest in a given benchmark, the optimal cache
configurations from Shade and near-optimal cache configurations from CDCP technique
at block sizes 16, 32, and 64 bytes are given. A notation such as 8k4s is used to indicate
a 8K bytes 4-way set associative cache with a block size of 32 bytes. In this table, B
means bytes, K denotes kilobytes, and M indicates megabytes.
From Table 5.3, it can be observed that CDCP has the ability to determine cache
capacities at byte granularity. In most cases, the cache configuration determined by
CDCP is less than or equal to the one determined by the exhaustive simulation. This is
because the exhaustive simulation strategy searches for an optimal cache configuration
reducing the cache conflicts as much as possible. On the other hand, CDCP is trying to
determine a cache configuration that avoids the majority of cache conflicts among the
reuse space of different arrays rather the whole memory space of all arrays. The next
section presents simulation results to show the effectiveness of the CDCP technique de-
termining the near-optimal cache configuration by exploiting this majority cache conflicts
among reuse space.
67
Benchmark Shade CDCP
adi 16 32 64 16 32 64
1 1k4w 1k4w 1k4w 64B4w 128B4w 256B4w
2 16k16w 16k16w 16k16w 16k16w 16k16w 16k16w
aps 16 32 64 16 32 64
1 2k4w 4k8w 64k4w 2k8w 4k4w 8k8w
2 16k8w 16k16w 32k16w 16k4w 16k8w 32k8w
3 4k2w 4k8w 8k8w 2k16w 4k8w 8k8w
bmcm 16 32 64 16 32 64
1 1k8w 2k8w 4k8w 64B1w 128B1w 256B1w
2 1k8w 2k8w 4k8w 64B2w 128B4w 256B1w
3 32k4w 64k4w 128k4w 32k4w 64k4w 128k4w
eflux 16 32 64 16 32 64
1 16k4w 32k4w 64k4w 2k8w 4k4w 8k8w
2 16k8w 32k4w 64k4w 8k4w 16k2w 32k4w
3 128k16w 256k2w 256k2w 128k8w 256k2w 256k2w
4 2k8w 2k8w 4k8w 128B4w 256B2w 256B4w
5 16k16w 32k4w 64k4w 8k16w 16k8w 32k4w
6 128k16w 256k1w 256k2w 128k8w 256k2w 256k2w
tomcat 16 32 64 16 32 64
1 1k2w 1k1w 1k1w 32B2w 64B2w 128B1w
2 1k1w 1k1w 1k1w 32B1w 64B1w 128B2w
3 128k4w 256k4w 256k16w 64k1w 128k2w 256k2w
4 1k2w 1k4w 2k8w 32B2w 64B2w 128B1w
5 64k8w 128k8w 256k16w 64k1w 128k2w 256k2w
6 1k2w 1k4w 2k4w 64B4w 128B4w 256B2w
7 64k4w 128k8w 128k8w 32k4w 64k8w 128k16w
8 32k1w 128k2w 128k4w 32k1w 64k2w 128k4w
tsf 16 32 64 16 32 64
1 4k4w 8k1w 8k1w 4k1w 4k1w 4k1w
2 1M1w 1M1w 1M1w 1M1w 1M1w 1M1w
3 4k4w 4k16w 8k4w 4k1w 4k1w 4k1w
4 1M1w 1M1w 1M1w 1M1w 1M1w 1M1w
vpenta 16 32 64 16 32 64
1 64k1w 128k1w 256k1w 64k1w 128k1w 256k8w
2 1k8w 2k4w 2k8w 128B8w 256B8w 512B8w
3 1k4w 2k2w 2k8w 256B4w 512B2w 1k2w
4 128k8w 256k8w 512k2w 128k2w 256k8w 512k2w
5 1k4w 2k4w 4k2w 256B4w 512B2w 1k2w
6 1k2w 2k2w 2k8w 128B8w 256B4w 512B8w
7 1k2w 1k2w 1k16w 64B1w 128B2w 256B4w
8 64k8w 128k2w 256k2w 64k1w 128k1w 256k1w
wss 16 32 64 16 32 64
1 4k4w 8k4w 8k16w 2k2w 4k4w 8k8w
2 1k8w 2k8w 4k4w 64B4w 128B4w 256B4w
3 1k2w 1k2w 1k2w 64B2w 128B4w 256b4w
4 64k4w 64k4w 64k4w 64k2w 64k2w 64k2w
5 4k4w 8k8w 16k8w 2k4w 4k4w 8k8w
6 1k2w 1k2w 1k2w 32B2w 64B1w 128B2w
7 2k8w 4k4w 4k4w 64B4w 128B1w 256B2w
Table 5.3. Cache configurations for each loop nest in benchmarks: Shade Vs CDCP.
68
5.5.3 Simulation Results
Notice that an underlying reconfigurable cache is assume for this research. Since
loop nests dominate the performance and energy consumption in array-based appli-
cation, the cache reconfigurations which take place at the loop nest boundaries incur
negligible performance/energy overhead. Reconfiguration is performed at a coarse gran-
ularity of changing cache associativity and cache sizes. This reconfiguration involves
enabling/disabling cache sub-banks that are normally present in a cache architecture.
The impact on cache access time will be negligible in designs that exploit this prop-
erty. Note that the reconfiguration required in this work disables the unused portions of
the cache as opposed to the more complex reconfigurable caches that divides the cache
memory into multiple partitions used for different purposes [62]. A simple cache flushing
scheme is applied during the cache reconfiguration.
In this part of experiments, the two sets of cache configurations for each loop
nests given in Table 5.3 are both simulated. All configurations from CDCP with cache
size less than 1K are simulated at 1K cache size with other parameters unmodified. For
best comparison, the performance is shown as the cache hit rate instead of the miss rate.
Figure 5.7 gives the performance comparison between Shade (exhaustive simulation) and
CDCP using a block size of 16 bytes.
The observation from Figure 5.7 is that, for benchmarks adi.c, aps.c, bmcm.c,
tsf.c, and wss.c, the results obtained from Shade and CDCP are very close. On the
other hand, Shade outperforms CDCP in benchmarks eflux.c, tomcat.c and vpenta.c.
On the average, CDCP achieves 98% of the optimal performance using optimal cache
69
adi aps bmcm eflux tomcat tsf vpenta wss0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Dat
a C
ache
Hit
Rat
e
ShadeCDCP
Fig. 5.7. Cache performance comparison for configurations at block size of 16: ShadeVs CDCP.
adi aps bmcm eflux tomcat tsf vpenta wss0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Dat
a C
ache
Hit
Rat
e
ShadeCDCP
Fig. 5.8. Cache performance comparison for configurations at block size of 32: ShadeVs CDCP.
70
configurations. Figures 5.8 and 5.9, on the other hand, show the results obtained for
block sizes of 32 and 64 bytes.
adi aps bmcm eflux tomcat tsf vpenta wss0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%D
ata
Cac
he H
it R
ate
ShadeCDCP
Fig. 5.9. Cache performance comparison for configurations at block size of 64: ShadeVs CDCP.
It should be noted that, for most benchmarks, the performance difference between
Shade and CDCP decreases as the block size is increased to 32 and 64 bytes. Especially
for benchmarks adi.c, aps.c, bmcm.c, tsf.c, vpenta.c, and wss.c, the performance of
the configurations determined by the two approaches are almost the same. For other
benchmarks such as eflux.c and tomcat.c, Shade consistently outperforms CDCP when
block size is 32 or 64 bytes. On the average, the performance difference is reduced to
1.1% and 0.7% at 32 byte and 64 byte block sizes, respectively.
71
Loop 1 Loop 2 0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Dat
a C
ache
Hit
Rat
e
Shade−16CDCP−16Shade−32CDCP−32Shade−64CDCP−64
Loop 1 Loop 2 Loop 3 0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Dat
a C
ache
Hit
Rat
eShade−16CDCP−16Shade−32CDCP−32Shade−64CDCP−64
(a). adi (b). aps
Loop 1 Loop 2 Loop 3 0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Dat
a C
ache
Hit
Rat
e
Shade−16CDCP−16Shade−32CDCP−32Shade−64CDCP−64
Loop 1 Loop 2 Loop 3 Loop 40
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Dat
a C
ache
Hit
Rat
e
Shade−16CDCP−16Shade−32CDCP−32Shade−64CDCP−64
(c). bmcm (d). tsf
Fig. 5.10. A breakdown of cache performance comparison at the granularity of eachloop for benchmarks adi, aps, bmcm, and tsf . Configurations for all three cache blocksizes, 16 byte, 32 byte and 64 byte are compared: Shade Vs CDCP.
72
Loop 1 Loop 2 Loop 3 Loop 40
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Dat
a C
ache
Hit
Rat
e
Loop 5 Loop 6 Loop 5
Shade−16CDCP−16Shade−32CDCP−32Shade−64CDCP−64
(a). eflux
Loop 1 Loop 2 Loop 3 Loop 40
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Dat
a C
ache
Hit
Rat
e
Loop 5 Loop 6 Loop 7 Loop 8
Shade−16CDCP−16Shade−32CDCP−32Shade−64CDCP−64
(b). tomcat
Loop 1 Loop 2 Loop 3 Loop 40
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Dat
a C
ache
Hit
Rat
e
Loop 5 Loop 6 Loop 7 Loop 8
Shade−16CDCP−16Shade−32CDCP−32Shade−64CDCP−64
(c). vpenta
Loop 1 Loop 2 Loop 3 Loop 40
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Dat
a C
ache
Hit
Rat
e
Loop 5 Loop 6 Loop 7
Shade−16CDCP−16Shade−32CDCP−32Shade−64CDCP−64
(d). wss
Fig. 5.11. A breakdown of cache performance comparison at the granularity of eachloop for benchmarks eflux, tomcat, vpenta, and wss. Configurations for all three cacheblock sizes, 16 byte, 32 byte and 64 byte are compared: Shade Vs CDCP.
73
For more detailed study, a breakdown of the performance comparison at loop
nest level for benchmarks adi, aps, bmcm, and tsf is given in Figure 5.10, and the
comparison breakdown for benchmarks eflux, tomcat, vpenta, and wss is presented in
Figure 5.11. For each loop of a given benchmark, optimal cache configurations from
Shade exhaustive simulation and the near-optimal cache configurations from CDCP at
all three cache block sizes (16, 32, and 64 bytes) are compared. In each group of six bars
for a loop, the left two bars are cache hit rates of configurations at 16-byte block size
from Shade and CDCP, the middle two bars are for configurations at 32-byte block size,
and the last two are for configurations at 64-byte block size. From Figure 5.10, the cache
configurations computed by CDCP for each loop at different block sizes achieve the very
same cache performance as the optimal ones from Shade. One exception is loop 3 of
aps at block size of 16-byte. However, this performance difference disappears for cache
configurations at 32-byte or 64-byte block size. It also happens in benchmarks vpenta
and wss in Figure 5.11. Figure 5.11 also sees some noticeable performance gap between
configurations from Shade and CDCP for benchmarks eflux and tomcat. This gap is
diminishing in configurations with larger block size. The results from the loop nest level
comparison show that the CDCP technique is very effective in finding the near-optimal
cache configurations for loop nests in this benchmark, especially at block sizes of 32 and
64 bytes (the most common block sizes used in embedded processors). Since CDCP is
analysis-based not simulation-based, it is expected that it will be even more desirable in
codes with large input sizes.
74
Data cache hit rate is a reliable metric for comparing the performances of Shade-
based and CDCP-based configurations. However, the impact of degraded cache perfor-
mance (i.e., increased miss rate) on the overall processor performance can sometimes be
amortized by other factors such as control dependences, data dependences, and resource
conflicts. Thus, the overall performance degradation can be smaller than cache hit rate
degradation when using CDCP (in comparison to Shade). That is, it is being pessimistic
here in estimating the performance impact of CDCP-selected cache configuration.
From energy perspective, the Cacti power model [63] is used to compute the
energy consumption in L1 data cache for each loop nest of the benchmarks at different
cache configurations listed in Table 5.3. A 0.18 micron technology is used for all the
cache configurations. Since cache reconfiguration is performed at the granularity of loop
nest, the energy consumed during reconfiguration is negligible compared to the energy
consumed during the execution of loop nests (Experimental results show that the energy
impact is less than 0.1% even if the cost of reconfiguration energy is assumed to be 1000
times of a single cache access). The detailed energy consumption figures are given in
Table 5.4.
From this experimental results, it can be concluded that (i) the CDCP strategy
generates competitive performance results with exhaustive simulation, and (ii) in general
it results in a much lower energy consumption than a configuration selected by exhaustive
simulation. Consequently, this approach strikes a balance between performance and
power consumption.
1Energy estimation is not available from Cacti due to the very small cache configuration.
75
Benchmark Shade CDCP
adi 16 32 64 16 32 64
1 318.6 287.4 -1
318.6 287.4 -
2 12154.4 13164.5 16753.6 12154.4 13164.5 16753.6
aps 16 32 64 16 32 64
1 322.3 771.7 540.1 661.2 335.4 822.0
2 125599.5 279985.9 368764.9 65461.7 122847.2 145962.2
3 7907.7 33273.5 34697.7 64275.4 33273.5 34697.7
bmcm 16 32 64 16 32 64
1 314.6 342.9 393.4 31.7 30.5 31.1
2 314.6 342.9 393.4 83.0 155.2 31.1
3 26826.7 32203.8 36989.1 26826.7 32203.8 36989.1
eflux 16 32 64 16 32 64
1 366.7 386.4 433.3 648.4 320.1 776.6
2 1068.8 610.3 700.1 534.8 301.7 598.5
3 2366.1 727.5 749.6 1220.7 727.5 749.6
4 310.2 321.7 370.7 146.0 77.0 -
5 2326.5 636.5 731.2 2399.6 1121.7 624.5
6 2573.0 682.0 821.3 1323.3 795.5 821.3
tomcat 16 32 64 16 32 64
1 895.0 280.4 260.0 895.0 748.4 260.0
2 28.4 27.5 28.1 28.4 27.5 74.3
3 66507.5 86086.0 350675.9 26846.9 40767.0 83199.2
4 78.1 147.5 - 78.1 77.1 29.5
5 25678.1 27508.1 79570.4 9448.6 14978.6 25989.1
6 80.8 152.7 167.6 152.8 152.7 86.5
7 9461.3 18865.2 25190.9 9647.7 21984.0 57050.0
8 2051.1 5050.0 8406.6 2051.1 4046.2 8406.6
tsf 16 32 64 16 32 64
1 160.9 38.5 41.4 34.7 34.7 35.9
2 6263.6 9501.5 14293.2 6263.6 9501.5 14293.2
3 163.5 787.9 173.9 35.2 35.2 42.5
4 6234.3 9452.6 14226.7 6234.3 9452.6 14226.7
vpenta 16 32 64 16 32 64
1 4111.6 5130.1 9029.6 4111.6 5130.1 22364.9
2 350.7 184.6 - 350.7 - -
3 189.4 102.3 - 189.4 97.7 98.6
4 77075.1 90080.5 100849.2 27835.4 90080.5 100849.2
5 188.4 216.9 108.7 188.4 97.4 98.3
6 99.1 101.7 - - 185.8 -
7 90.2 89.0 - 32.7 89.0 -
8 36158.0 13557.1 26934.3 8994.2 12456.5 21512.0
wss 16 32 64 16 32 64
1 268.8 279.9 1610.6 138.6 261.1 624.0
2 288.5 317.1 168.1 143.9 143.6 -
3 75.1 74.1 74.9 75.1 141.8 -
4 22641.6 23665.1 22935.2 13274.4 13051.5 14560.2
5 326.7 672.8 775.3 325.7 319.6 756.3
6 74.8 73.8 74.6 74.8 27.6 74.6
7 302.8 155.6 166.6 142.4 27.9 75.1
Table 5.4. Energy consumption (micro joules) of L1 data cache for each loop nest inbenchmarks with configurations in Table 5.3: Shade Vs CDCP.
76
5.6 Discussions and Summary
In this chapter, a new technique, compiler-directed cache polymorphism (CDCP),
is proposed for optimizing data locality of array-based embedded applications while keep-
ing the energy consumption under control. In contrast to many previous techniques that
modify a given code for a fixed cache architecture, this technique is based on modifying
(reconfiguring) the cache architecture dynamically between loop nests. A set of algo-
rithms are presented in this chapter that (collectively) allow the compiler to select a
near-optimal cache configuration for each nest of a given application. The experimental
results obtained using a set of array-intensive applications reveal that this approach gen-
erates competitive performance results and consumes much less energy (when compared
to an exhaustive simulation based framework).
CDCP can be further extended in the following several directions. First, high-level
transformation algorithms [21] can be incorporated with CDCP to convert pointer-based
code to array code followed by CDCP optimization. Second is to use cache polymor-
phism at granularities smaller than loop nests. And finally, how to combine CDCP with
loop/data based compiler optimizations to optimize both hardware and software in a
coordinated manner is a very interesting topic.
77
Chapter 6
Reusing Instructions for Energy Efficiency
6.1 Introduction
Advancing technology has increased the speed gap between on-chip caches and the
datapath. Even in current technology, the access latency of the level one instruction cache
can hardly be maintained within one cycle (e.g., two cycles for accessing the trace cache
in the Pentium 4 [30]). In this case, pipelined instruction cache must be implemented
in order to supply instructions each cycle. As a result, the pipeline depth of the front
end of the datapath will increase (e.g., 6 stages in Pentium 4 [30]). Sophisticated branch
predictors employed in the latest microprocessors are also very power consuming [58].
This again will increase the power contribution of the pipeline front-end.
Previous research utilized a small instruction buffer to capture tight loop code
such as decoded instruction buffer (DIB) [31][8], decoded loop cache [5], decoded filter
cache [71] to reduce energy consumption in instruction cache and decoder. Loop cache
[47] and filter cache [46] are more general for reducing energy consumption in level one
caches.
The dynamic instruction footprint analysis performed in Chapter 4 for a set of
array-based embedded applications shows that these applications have very regular be-
havior pattern. Their execution happens in phases, one or more. Within a particular
phase, the instruction footprint of execution only spans a very limited range in address
78
space. This dynamic characteristics of these applications can be used to design either
reconfigurable instruction caches or smaller instruction buffers to capture the phase exe-
cution for energy optimization. This thesis work explores a more aggressive approach to
utilize this dynamic application behavior to optimize instruction cache energy consump-
tion as well as other components in the datapath front-end.
This chapter proposes a new issue queue design that is capable of instruction
reuse. The proposed issue queue has a mechanism to dynamically detect and identify
reusable instructions, particularly instructions belonging to tight loops. Once reusable
instructions are detected, the issue queue switches its operation mode to buffer these
instructions. In contrast to conventional issue logic, buffered instructions are not removed
from the issue queue after they are issued. After the buffering is finished, the issue queue
is then switched to an operation mode to reuse those buffered reusable instructions.
During this mode, issued (buffered) instructions keep occupying their entries in the issue
queue and are reused in later cycles. A special mechanism employed by the issue queue
guarantees that the reused instructions are register-renamed in the original program
order. Thus, the instructions are supplied by the issue queue itself rather than the
fetch unit. There is no need to perform instruction cache access, branch prediction, or
instruction decoding. Consequently, the front-end of the datapath pipeline, i.e., pipelines
stages before register renaming, can be gated during this instruction reusing mode. This
thesis proposes this design as a solution to effectively address the power/energy problem
in the front-end of the pipeline. Since no instruction is entering or leaving the issue
queue in this mode, the power consumption in the issue queue is also reduced due to the
reduced activities.
79
As embedded microprocessor designs are moving on to use superscalar architec-
ture for high performance such as SandCraft MIPS64 embedded processor [66], this work
targets at an out-of-order multi-issue superscalar processor rather than simple in-order
single-issue processors that have been the focus of previous research on loop caches.
Different from previous research [31][47][5][46][71], the scheme proposed here eliminates
the need of an additional instruction buffer for loop caching and utilizes the existing
issue queue resources. It automatically unrolls the loops in the issue queue to reduce the
inter-loop dependences instead of buffering only one iteration of the loop in the small
DIB or loop cache. Further, there is no need for ISA modification as in [31]. Note
that the concept and the purpose of instruction reuse in this paper is also different from
that proposed in [67]. The proposed scheme speculatively reuse the decoded instructions
buffered in the issue queue to avoid the instruction streaming from the instruction cache
rather than speculatively reusing the result of a previous instance of the instruction for
performance as in [67].
Results using array-intensive codes show that up to 82% of the total execution
cycles, the pipeline front-end can be gated, providing a energy reduction of 70% in the
instruction cache, 30% in the branch predictor, and 17.5% in the issue queue, respectively,
at a small performance cost. Further, the impact of compiler optimizations on this new
issue queue is investigated. The results indicate that using optimized code can further
improve the gated rate (the percentage of gated cycles in the total execution cycles) of
the pipeline front-end, and thus the overall power savings.
The detailed issue queue design is presented in Section 6.2. Section 6.3 studies the
dynamic instruction distribution of a set of array-intensive code. Section 6.4 describes
80
the experimental framework and provides the evaluation results. A study of the impact
of compiler optimizations on the proposed scheme is conducted in Section 6.5. Finally,
Section 6.6 summaries this chapter.
6.2 Modified Issue Queue Design
In this section, the detailed design of the proposed new issue queue is elaborated.
This design is based on a superscalar architecture with a separated issue queue and re-
order buffer (ROB) and the datapath model is similar to that of the MIPS R10000 [75]
except that it use a unified issue queue instead of separated integer queue and floating-
point queue. The baseline datapath pipeline is given in Figure 6.1.
CommitIssueDecodeFetch
calcAdd
Data
Int Function Units
FP Function Units
File
Register
ResourceMap
Register
Decoder
Inst.
Cache
Reorder Buffer (ROB)
Cache
Inst.
Rename
Gate-Gate-
Detected
LoopRegister #
LRL
Issue
Queue
Control
Rename
Register
SignalSignal
Reuse
QueueStoreLoad
(b)
(a)
DcacheAccWriteBackExecuteReg Read
Queue
Fig. 6.1. (a). The datapath diagram, and (b). pipeline stages of the modeled baselinesuperscalar microprocessor. Parts in dotted lines are augmented for the new design.
The fetch unit fetches instructions from the instruction cache and performs branch
prediction and next PC generation. Fetched instructions are then sent to the decoder
81
for decoding. Decoded instructions are register-renamed and dispatched into the issue
queue. At the same time, each instruction is allocated an entry in the ROB in program
order. Instructions with all source operands ready are waken up and selected to issue to
the appropriate available function units for execution, and removed from the issue queue.
The status of the corresponding ROB entry will be updated as the instruction proceeds.
The results coming either from the function units or the data cache are written back to
the register file. Instructions in ROB are committed in order.
Reusable instructions are those mainly belonging to loop structures that are re-
peatedly executed. The proposed new issue queue is thus designed to be able to reuse
these instructions in the loop structures. The new issue queue design consists of the
following four parts: a loop structure detector, a mechanism to buffer the reusable
instructions within the issue queue, a scheduling mechanism to reuse those buffered in-
structions in their program order, and a recovery scheme from the reuse state to the
normal state. The dotted parts in Figure 6.1 shows the augmented logic for this new
design.
6.2.1 Detecting Reusable Loop Structures
To enable loop detection, additional logic is added to check for conditional branch
instructions and direct jump instructions that may form the last instruction of a loop
iteration. The loop detector performs two checks for these instructions: (1) whether it
is a backward branch/jump; (2) whether the static distance from the current instruction
to the target instruction is no larger than the issue queue size.
82
Loop detection can be performed at either the decode stage or stages after ex-
ecution stage. If detection takes place at post-execution stages, the detector can be
100% sure whether it is a loop or not by comparing the computed target address and
the current instruction address. However, it has several drawbacks. First, the detection
may come too late for small tight loops. Second, deciding when to start buffering the
detected loop can be complex. Third, the ROB has to keep the address information
for each instruction in flight in order to perform this detection. On the other hand,
performing loop detection at decode stage by using predicted target address has many
advantages. First, loop buffering can be started immediately after a loop is detected.
Second, since the instruction fetch buffer is very small (e.g., 4 or 8 entries), adding ad-
dress information will not incur much hardware overhead. Further, the target address of
direct jump will be available at decode stage and can be directly used for this purpose.
With these tradeoffs in consideration, loop detection is performed at decode stage rather
in this design than at a later stage.
6.2.2 Buffering Reusable Instructions
After a loop is detected and determined to be capturable (loop size less than or
equal to the issue queue size) by the issue queue, two dedicated registers Rloophead and
Rlooptail are used to record the addresses of the starting and ending instructions of the
loop iteration. A two-bit register Riqstate is utilized to indicate the current state of the
issue queue (00-Normal, 01-Loop Buffering, 11-Code Reuse, 10-not used). A complete
state transition diagram of the issue queue is given in Figure 6.2. The issue queue state
is then changed from Normal to Loop Buffering state. In the following cycle, the issue
83
Reuse
Recov
ery
Detected
Buffering Revoke
Misprediction Recovery/
Start
Capturable LoopM
ispre
dicti
on
Code_Buffering
Loop_Buffering finished
Normal
Fig. 6.2. State machine for the issue queue.
queue starts to buffer instructions as the second iteration begins. The new issue queue
is augmented as illustrated in Figure 6.3.
Specifically, each entry is augmented with a classification bit indicating whether
this instruction belongs to a loop being buffered, and a issue state bit indicating whether
a buffered instruction has been issued or not. The logical register numbers for each
buffered instruction are stored in the logical register list (LRL). For an issue queue size
of 64 entries, the additional hardware cost for these augmented components is around
136 byte (= (1 bit + 1 bit + 15 bits for three logical register numbers) * 64 / 8) cache
structure.
After the issue queue enters Loop Buffering state, buffering a reusable instruction
requires several operations as the instruction is renamed and queued into the issue queue:
the classification bit is set, the issue state bit is reset to zero, the logical register numbers
of all the operands are recorded in the logical register list. With the classification bit set,
the instruction will not be removed from the issue queue even after it has been issued.
Note that a collapsing design is used for the issue queue.
84
The following two subsections are going to address two important issues con-
cerning the buffering: when to terminate the instruction buffering and how to handle
procedure calls within a loop.
6.2.2.1 Buffering Strategy
There are at least two strategies for deciding when to stop buffering and promote
to Code Reuse state. The first strategy is to buffer only one iteration of the loop. This
scheme is simple to implement and enables more instructions to be reused from the issue
queue. This is because it stops instruction fetch from the instruction cache and enters
Code Reuse state much earlier (at the beginning of the third iteration). In contrast,
the second strategy tries to buffer multiple iterations of the loop according to available
free entries in the issue queue. The buffering logic uses an additional counter to record
the size of the current buffering iteration and to predict the size of the next iteration.
After buffering one iteration of the loop, a decision is made whether the remaining issue
queue can hold another iteration by comparing the counter value with the number of
free entries in the issue queue. If yes, the buffering continues. Otherwise, the state of
the issue queue is switched from Loop Buffering to Code Reuse, and the front end of the
pipeline is then gated. It automatically unrolls the loop to exploit more instruction level
parallelism, which is basically the way that the original issue queue works. Also, the issue
queue resource is used more effectively here than in the first strategy, especially for small
loops. Although the second strategy does not gate the pipeline front-end as fast as the
first strategy, it is still chosen in this work for performance sake. If the execution exits
85
the loop (check with Rloophead and Rlooptail) during the buffering state, the buffering
is revoked and the issue queue switches back to the Normal state.
6.2.2.2 Handling Procedure Calls
Note that the loop detector has no knowledge about either the existence or the
sizes of procedure calls within a detected loop. This is because the detection only uses
one iteration and happens at the end of the first iteration of the loop. If the procedure
is small, the issue queue should be managed so as to capture both the loop and the
procedure. Otherwise, it may not be possible to buffer the loop. The strategy to deal
with procedure calls works as follows. During the Loop Buffering state, if a procedure
call instruction is decoded, it will keep buffering. If the issue queue is used up before the
loop-ending instruction is met, which means the procedure is too large to be captured
by the issue queue, the buffering is revoked and the issue queue state is changed back to
Normal. Otherwise, the counter value (the size of current iteration including procedure
calls) is checked with the number of free entries in the issue queue to make the decision
whether to promote to Code Reuse state or to continue buffering.
6.2.3 Optimizing Loop Buffering Strategy
Since the innermost loop dominates the execution of the loop nest, buffering outer
loops does not make sense in this case. Thus the attempt to buffer outer loops should
be avoided. Loops with procedure calls in which the procedures are large, may not be
bufferable. The loop detection does not have any information about this. Any started
buffering of those loops will be soon revoked as the condition is met. This will incur the
86
6263 1 0
scan directionreuse pointer
loopheadR
Rlooptail
rsrt
rd rdrd
rtrt
rsrsrsrsrs
rtrtrtrd rd rd
Logical Register List 15 bits
2
10
Original Issue Queue
0
Index
Classification Bit 1 11111 1 bit 1 bit0 1 0Issue State Bit
Fig. 6.3. The new issue queue with augmented components supporting instructionreuse.
state thrashing between Loop Buffering and Normal. Thus, an optimization scheme for
loop detection and buffering is proposed in this section.
To avoid the state thrashing between Loop Buffering and Normal, a small non-
bufferable loop table (NBLT) holding the most recent non-bufferable loops (e.g., 8 loops)
is introduced for optimizing the buffer strategy. The NBLT is implemented in CAM and
maintained as a FIFO queue. Each entry in NBLT has a valid bit and the address of
the loop-ending instruction. If a detected loop appears in NBLT, it is identified as non-
bufferable. In this case, no buffering is attempted for this loop. Otherwise, the issue
queue is switched to Loop Buffering state. During the Loop Buffering state, if an inner
loop is detected, or the execution exits the current loop, or a procedure call within the
loop causes the issue queue to become full before the loop end is met, the current loop
is identified as a non-bufferable loop and registered with the NBLT table. Figure 6.4
shows an example of an non-bufferable loop. With this optimization, the issue queue
can eliminate most of the buffering of non-bufferable loops.
87
(bufferable)
Loop
(non-bufferable)
Outer Loop Innermost
slti r2, r24, 499
addiu r24, r24, 1
addiu r5, r5, 2000
addiu r6, r6, 2000
slti r2, r22, 499
addiu r3, r3, 4
addiu r4, r4, 4
addiu r22, r22, 1
sw r2, 0(r4)
subu r2, r24, 422
sw r2, 0(r3)
addu r3, r0, r5
addu r4, r0, r6
beq r20, r0, 0x4002e8
addiu r20, r0, 499������������������������������
������������������������������addu r22, r0, r0
bne r2, r0, 0x4002a0
addu r2, r24, r22
bne r2, r0, 0x400278
Fig. 6.4. An example of a non-bufferable loop that is an outer loop in this code piece .
6.2.4 Reusing Instructions in the Issue Queue
After the reusable instructions of a loop have been successfully buffered, the state
of the issue queue is switched to Code Reuse. A gating signal is then sent to the fetch
unit and the instruction decoder. In the following cycles, the issue queue starts to supply
instructions itself by reusing the buffered instructions already in the issue queue. Thus,
the instruction streaming from the instruction cache is no longer needed and the pipeline
front-end is then completely gated.
During instruction scheduling, the classification bit of a ready-to-issue instruction
is checked at the issue time. If this bit is not set (i.e., its value is zero meaning not a
reusable instruction), the instruction is removed from the issue queue after being issued.
Otherwise, the instruction still occupies its entry in the issue queue after its issue. And
its corresponding issue state bit is set to indicate that this buffered instruction has been
88
issued. The issue queue collapses each cycle if any hole is generated due to the removal
of an issued instruction.
The issue queue utilizes a reuse pointer to scan the buffered instructions in unidi-
rection for instructions to be reused in the next cycle. The pointer is initiated to point
to the first buffered instruction. In each cycle, the issue state bits of the first n (equal
to the issue width) instructions starting from the entry pointed by the reuse pointer are
checked. If the first m (m ≤ n) bits are set, which means these m instructions have
been issued and can be reused, the logical register numbers of these instructions are
fetched from the logical register list and sent to renaming logic. The reuse pointer then
advances by m and scans instructions for the next cycle. Renamed instructions update
their corresponding entries in the issue queue. Note that only register information and
ROB pointer of each instruction are updated in this case. Register renaming is needed
anyway in both this scheme and conventional issue queues and hence is not an overhead.
After the last buffered instruction is reused, the reused pointer is automatically reset to
the position of the first buffered instruction. This process repeats until a branch mispre-
diction is detected due to either the execution exiting the loop or the execution taking
a different path within the loop. The state of the issue queue is then switched back the
Normal state.
Note that the dynamic branch prediction is avoided during the Code Reuse state.
Branch instructions are statically predicted using the previous dynamic prediction out-
come from Loop Buffering state. The static prediction scheme works very well for loops
since the branches within loops are normally highly-biased for one direction. In this
89
scheme, the static prediction is still verified after the branch instruction completes exe-
cution. The issue queue exits Code Reuse state if the static prediction is detected to be
incorrect during this verification.
6.2.5 Restoring Normal State
When an ongoing buffering is revoked, if an instruction is buffered (classification
bit = 1) and issued (issue state = 1), it is immediately removed from the issue queue.
All classification bits are then cleared. The issue queue state is switched back to Nor-
mal. If a misprediction is detected at the writeback stage and the issue queue is in the
Loop Buffering state, a conventional recovery is carried out by removing instructions
newer than this branch from the issue queue, ROB and restoring registers, followed by
the recovery process of revoking the current buffering state. If a misprediction is detected
in the Code Reuse state, this may be due to an early branch outside the current loop,
or a branch within the loop taking different path, or the execution exiting the current
loop. In this case, a conventional branch misprediction recovery is initiated followed by
the revoking process. The gating signal is also reset when restoring the Normal state. It
should be noted that the new issue queue has no impact on exception handling.
6.3 Distribution of Dynamic Loop Code
Although the phase pattern of execution footprint has been extracted and ana-
lyzed in Chapter 4 and leads to this proposed new instruction supply mechanism, further
information still needs to be extracted for each phase in order to guide the real imple-
mentation of this new issue queue such as the size of the issue queue. This section
90
studies the dynamic instruction distribution with respect to the size of loop code that
an instruction resides in.
Three types of loop structure are profiled for this study: loops (without any
constraint), innermost loops, and innermost loops without any procedural call in it. Re-
member the discussion in previous sections, outer loops are not considered as bufferable
loops, and innermost loops without procedural call are most likely the bufferable loops.
If instructions from the last type of loops dominate the overall dynamic instructions, the
new issue queue can maximize the opportunity for loop buffering and instruction reusing.
Figure 6.5 gives the dynamic instruction distribution for a set of array-based embedded
applications. This figure shows that the majority of dynamic instructions, more than
90% are from loop code. Loop code size varies from less than 16 instructions to a size
between 128 instructions and 256 instructions, which requires different size issue queues
if to capture the loop code. From this figure, it clearly shows that dynamic instructions
from the innermost loops without procedural call are the dominant part during execu-
tion and instructions located between innermost loop and outer loops only account for
a negligible portion. This confirms the proposed new issue queue will be very effective
in capturing a significant number of reusable instructions.
6.4 Experiments
The proposed issue queue was modeled upon SimpleScalar 3.0 [12] and the power
model is derived from Wattch [11]. The baseline configuration for the simulated processor
is given in Table 3.1. A set of array-intensive applications listed in Table 3.2 are used to
evaluate the new issue queue.
91
16 32 64 128 256 512 1024 20480
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Loop Code Size
Per
cent
of D
ynam
ic In
stru
ctio
ns
LoopIn−LoopIn−Loop w/o Call
16 32 64 128 256 512 1024 20480
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Loop Code Size
Per
cent
of D
ynam
ic In
stru
ctio
ns
LoopIn−LoopIn−Loop w/o Call
(a). adi (b). aps
16 32 64 128 256 512 1024 20480
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Loop Code Size
Per
cent
of D
ynam
ic In
stru
ctio
ns
LoopIn−LoopIn−Loop w/o Call
16 32 64 128 256 512 1024 20480
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Loop Code Size
Per
cent
of D
ynam
ic In
stru
ctio
ns
LoopIn−LoopIn−Loop w/o Call
(c). btrix (d). eflux
16 32 64 128 256 512 1024 20480
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Loop Code Size
Per
cent
of D
ynam
ic In
stru
ctio
ns
LoopIn−LoopIn−Loop w/o Call
16 32 64 128 256 512 1024 20480
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Loop Code Size
Per
cent
of D
ynam
ic In
stru
ctio
ns
LoopIn−LoopIn−Loop w/o Call
(e). tomcat (f). tsf
16 32 64 128 256 512 1024 20480
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Loop Code Size
Per
cent
of D
ynam
ic In
stru
ctio
ns
LoopIn−LoopIn−Loop w/o Call
16 32 64 128 256 512 1024 20480
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Loop Code Size
Per
cent
of D
ynam
ic In
stru
ctio
ns
LoopIn−LoopIn−Loop w/o Call
(g). vpenta (h). wss
Fig. 6.5. Dynamic instruction distribution w.r.t. loop sizes.
92
adi aps btrix eflux tomcat tsf vpenta wss avg0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Pip
elin
e F
ront
−en
d G
ated
Rat
e (in
Cyc
les)
IQ−32IQ−64IQ−128IQ−256
Fig. 6.6. Percentages of the total execution cycles that the pipeline front-end has beengated with different issue queue sizes: 32, 64, 128, 256 entries.
It is found that two factors: the loop structure and the issue queue size, affect
the effectiveness of the proposed issue queue design. A large loop structure cannot be
completely buffered in a small issue queue. This section conducts a set of experiments
to evaluate the impact of issue queue size by varying it from 32 to 256 entries, suggested
by the analysis results presented in previous section. In these experiments, the ROB size
is set equal to the issue queue size, and the load/store queue size is half that of the issue
queue. An eight-entry NBLT is used to optimize the loop detection, which helps reduce
the buffering revoke rate from around 40% to 1% below.
Once the issue queue enters Code Reuse state, the pipeline front-end is gated.
Figure 6.6 shows the percentages of the total execution cycles that the front-end of the
pipeline has been gated due to the instruction reuse for issue queues with different sizes.
Benchmarks aps, tsf , and wss achieve very high gated percentage even with small issue
93
queues due to their small loop structures. Some benchmarks work well only with large
issue queues, such as adi, btrix, eflux, tomcat, and vpenta. An interesting observation
from this figure is that increasing issue queue size does not always improve the ability
to perform pipeline gating (e.g., see tsf and wss). The main reason for this case is
that a larger issue queue will unroll and buffer more iterations of the loop, delaying the
instruction reuse and pipeline gating. On the average, the ability to gate the pipeline
front-end increases from 42% to 82% as the issue queue size increases.
Gated pipeline front-end leads to activity reduction in the instruction cache,
branch predictor, and instruction decoder. As shown in Figure 6.7 (a), on the aver-
age, the instruction cache access is reduced by 42% to 82%, branch prediction or update
is reduced by 50% to 76% as shown in Figure 6.7 (c), and instruction decoding is re-
duced by 46% to 84% as illustrated in Figure 6.7 (e), as the issue queue size is increased
from 32 to 256 entries. Figure 6.7 (b)(d)(f) show the corresponding energy reduction
in the instruction cache ranging from 35% to 70%, branch predictor from 19% to 30%,
and issue queue from 12% to 17.5%, as the issue queue size increases from 32 entries to
256 entries. The energy reduction in the issue queue is due to the partial update (only
register information and ROB pointer are updated) during the instruction reuse state in
contrast to removing and inserting the instructions in a conventional issue queue.
The energy reduction of the entire processor for each benchmark at different issue
queue sizes is shown in Figure 6.8. The overall energy saving is up to 20.5%. For
benchmark adi and btrix, the overall energy is increased at some configurations. On
the average, the energy reduction is improved from 6.7% to 7.8% as the issue queue size
increases. The performance impact of this new issue queue is illustrated in Figure 6.9.
94
adi aps btrix eflux tomcat tsf vpenta wss avg0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Red
uctio
n in
ICac
he A
cces
ses
IQ−32IQ−64IQ−128IQ−256
adi aps btrix eflux tomcat tsf vpenta wss avg0
10%
20%
30%
40%
50%
60%
70%
80%
90%
Ene
rgy
Red
uctio
n in
ICac
he
IQ−32IQ−64IQ−128IQ−256
(a). Access reduction in Icache. (b). Energy reduction in Icache.
adi aps btrix eflux tomcat tsf vpenta wss avg0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Red
uctio
n in
Bra
nch
Pre
dict
or A
cces
ses
IQ−32IQ−64IQ−128IQ−256
adi aps btrix eflux tomcat tsf vpenta wss avg−10%
0
10%
20%
30%
40%
50%
60%
70%
Ene
rgy
Red
uctio
n in
Bra
nch
Pre
dict
or
IQ−32IQ−64IQ−128IQ−256
(c). Access reduction in bpred. (d). Energy reduction in bpred.
adi aps btrix eflux tomcat tsf vpenta wss avg0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Red
uctio
n in
Inst
ruct
ion
Dec
odin
g
IQ−32IQ−64IQ−128IQ−256
adi aps btrix eflux tomcat tsf vpenta wss avg−5%
0
5%
10%
15%
20%
25%
30%
35%
Ene
rgy
Red
uctio
n in
Issu
e Q
ueue
IQ−32IQ−64IQ−128IQ−256
(e). Reduction in instruction decoding. (f). Energy reduction in issue queue.
Fig. 6.7. Access Reduction and energy reduction in instruction cache, branch predictor,instruction decoder, and issue queue.
95
adi aps btrix eflux tomcat tsf vpenta wss avg−5%
0
5%
10%
15%
20%
25%
30%
Ove
rall
Ene
rgy
Sav
ings
IQ−32IQ−64IQ−128IQ−256
Fig. 6.8. The overall power reduction compared to a baseline microprocessor using theconventional issue queue at different issue queue sizes.
The average performance loss ranges from 0.2% (32 entry issue queue) to 4% (256 entry
issue queue). Notice that the performance of the new issue queue is compared to the
conventional issue queue with the same number of entries. This performance degradation
is mainly due to the non-fully utilized issue queue (i.e., only to buffer an integer number
of iterations of the loop). In benchmark btrix, the execution is dominated by a loop with
size of 90 instructions that results in a low utilization of the issue queue with size of 128
entries or 256 entries in Code Reuse state, consequently a noticeable performance loss
(around 12%) as seen in Figure 6.9. From Figure 6.8 and Figure 6.9, one can find out
32-entry issue queue IQ-32 takes the best advantage of this instruction reusing in terms
of performance and overall energy saving.
96
adi aps btrix eflux tomcat tsf vpenta wss avg−2%
0
2%
4%
6%
8%
10%
12%
14%
Per
form
ance
(IP
C)
Deg
rada
tion
IQ−32IQ−64IQ−128IQ−256
Fig. 6.9. Performance impact of reusing instructions at different issue queue sizes.
6.5 Impact of Compiler Optimizations
Notice that some benchmarks such as adi, btrix, eflux, tomcat, and vpenta have
large loop structures, and these loops can hardly be captured with a small issue queue
(e.g., with size of 32 or 64). Compiler optimizations, especially loop transformations
can play an important role in optimizing these loop structures. This section specifically
focuses on loop distribution [42] to reduce the size of the loop body.
After applying loop distribution, as shown in Figure 6.10 the new issue queue
starts to schedule reusable instructions for benchmarks adi and btrix, and buffer more
reusable loop code in benchmarks eflux, tomcat, and vpenta. However, loop distribution
has minor effect on benchmarks aps, tsf , and wss since their reuse rate is already very
high. On the average, the reduction of instruction cache accesses improves from 51%
97
adi aps btrix eflux tomcat tsf vpenta wss avg0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Red
uctio
n in
ICac
he A
cces
ses
OrginalOptimized
Fig. 6.10. Impact of compiler optimizations on instruction cache accesses.
to 88% after this compiler optimization, which results in corresponding more energy
reduction in the instruction cache.
Figure 6.11 shows the overall energy comparison between optimized code (per-
formed loop distribution) and non-optimized code, both simulated with the baseline
configuration (64 entry issue queue). The average energy reduction of the entire pro-
cessor is increased from 6.7% to 11.1% by using the optimized code, at the cost of a
slightly increased performance loss from 1% to 2.4% as shown in Figure 6.12, on the
average. This improvement of power reduction results from the increased percentage of
gated cycles (an average from 48% to 86% (not shown for brief)) when executing the
optimized code.
98
adi aps btrix eflux tomcat tsf vpenta wss avg−2%
0
2%
4%
6%
8%
10%
12%
14%
16%
18%
Ove
rall
Ene
rgy
Sav
ings
OrginalOptimized
Fig. 6.11. Impact of compiler optimizations on overall energy saving.
adi aps btrix eflux tomcat tsf vpenta wss avg0
1%
2%
3%
4%
5%
6%
7%
8%
Per
form
ance
(IP
C)
Deg
rada
tion
OrginalOptimized
Fig. 6.12. Impact of compiler optimizations on performance degradation.
99
6.6 Discussions and Summary
In many recent embedded microprocessors, code compression is used in instruc-
tion cache to minimize the required memory size due to cost and space constraints [48].
Such a compression scheme can be also used in this new instruction supply mechanism
without any negative impact since the instructions under reuse are decoded (thus also
decompressed) and buffered in the instruction issue queue. On the other hand, this in-
struction reusing scheme can significantly improve the energy and performance behavior
of code compression in that instruction reusing also avoids code decompression since no
(compressed) instruction is fetched from the instruction cache during the code reusing
state.
This redesigned processor datapath also implies opportunities to optimize the
energy consumption in the clock distribution network and the bus between instruction
cache and datapath due to the gated datapath front-end during instruction reusing. The
instruction cache can be also turned off or transitioned to drowsy mode for leakage energy
reduction when the issue queue is reusing buffered instructions.
To summarize, a new issue queue design is proposed in this chapter to be capable
of buffering the dynamically detected reusable instructions, and reusing these buffered
instructions in the issue queue. The front-end of the pipeline is then completely gated
when the issue queue enters instruction reusing state, thus invoking no activities in the
instruction cache, branch predictor, and the instruction decoder. Consequently, this
leads to a significant energy reduction in these components, and a considerable overall
energy reduction. The experimental evaluation also shows that compiler optimizations
100
(loop transformations) can further gear the code towards a given issue queue size and
improve these energy savings.
101
Chapter 7
Managing Instruction Cache Leakage
7.1 Introduction
Static energy consumption due to leakage current is an important concern in
future technologies [13]. As the threshold voltage continues to scale and the number
of transistors on the chip continues to increase, managing leakage current will become
more and more important. As on-chip caches are the major portion of the processor’s
transistor budget, they account for a significant share of the leakage power consumption.
In fact, leakage is projected to account for 70% of the cache power budget in 70nm
technology [45].
The leakage current is a function of the supply voltage and the threshold voltage.
It can be controlled by either reducing the supply voltage or by increasing the threshold
voltage. However, this has an impact on the cache access times. Thus, a common
approach is to use these mechanisms dynamically when a cache line is not currently in
use. Existing techniques that control cache leakage utilize three main styles of circuit
primitives for reducing cache leakage energy, namely, Gated-Vdd [60], multiplexed supply
voltage for cache line [20], and dynamic Vt SRAM [43]. The approach in [29] targets
bitline leakage and hence does not utilize either of these three primitives.
This chapter focus on reducing the leakage energy in the instruction cache. A
good leakage management scheme needs to balance appropriately the energy penalty of
102
leakage incurred in keeping a cache line turned on after its current use with the overhead
associated with the transition energy (for turning on a cache line) and performance loss
that will be incurred if and when that cache line is accessed again. In order to strike
this balance, it is important that the management approach tracks both the spatial
and temporal locality of instruction cache accesses. Existing leakage control approaches
track and exploit one or the other of these forms of locality. For example, the drowsy
cache scheme [20] (designed originally for the data caches) periodically transitions all
cache lines to a drowsy mode assuming accesses to cache lines are confined to a specific
time period. Hence, it tends to focus mainly on temporal locality. Due to the use of
fixed periods, it also does not adapt well to changes in temporal locality. This can
be important to capture as straightline code has very little temporal locality, while
instructions in loops have significant temporal locality. Further, this scheme does not
support the sequential nature of instruction accesses well and will incur the wakeup
latencies when a new instruction is accessed in the sequential code.
The approach proposed for drowsy instruction caches in [45] focuses on spatial
locality. Here, turn-off 1 is applied when execution shifts from a specified spatial region.
Specifically, this scheme turns off a bank of cache lines when execution shifts to a new
bank. This scheme is well suited for capturing the sequential and repetitive behavior of
program execution confined to a small instruction address space. However, this scheme is
agnostic as to the extent of spatial locality in that it turns on only at the fixed granularity
of a bank. If execution is in a small, long running loop, the extend of spatial locality
1Turn-off is used here to refer to a transition to the drowsy state, and turn-on to refer to
waking up to normal (active) state.
103
is small and may never access most of the cache lines in a bank. A finer granularity of
leakage control (at cache line instead of bank level) would provide more adaptability to
the different extents of spatial locality. Further, this scheme frequently turns on and off
banks when the instructions accessed in a given phase are not tightly clustered together
in one portion of the address space. Previous research shows the program hotspots can
be scattered all over the address space [54]. A common example would be a method call
made within a loop, with the method being located in a separate bank.
The leakage management scheme proposed in this chapter focuses on being able to
exploit both forms of locality and exploits two main characteristics of instruction access
patterns: program execution is mainly confined in program hotspots and instructions
exhibit a sequential access pattern. It is observed that a significant part of the execution
is spent in specific program hotspots (identification of hotspots is explained Section
7.3). This percentage is found to be 82% on the average for the SPEC2000 benchmark
suite. In order to exploit this behavior, this work proposes a HotSpot based Leakage
Management (HSLM) approach that is used in two different ways. First, it is used for
detecting and protecting cache lines containing program hotspots from inadvertent turn-
off. HSLM is particularly useful in reducing performance and energy penalties associated
with unnecessarily turning off actively used cache lines. It can provide some adaptability
to simple periodic or spatial schemes that are program behavior agnostic. Second, HSLM
can be used to detect a shift in program hotspot and to turn off cache lines closer to their
last use instead of waiting for a period to expire. This scheme is specifically oriented
to detect new loop-based hotspots. Next, a Just-in-Time Activation (JITA) scheme is
104
presented that exploits sequential access pattern for instruction caches by predictively
activating the next cache line when the current cache line is accessed.
The experiments show that the combination of HSLM and JITA strategies can
make periodic schemes quite effective in terms of both performance and energy reduc-
tion. This work further proposes a scheme that combines both periodic and spatial
based turn-off (to capture both temporal and spatial locality) in an application sensitive
fashion by using the hotspot information. This scheme when combined with the JITA
scheme is shown to provide the best energy savings as compared to existing approaches.
Specifically, the evaluation of this scheme using SPEC2000 benchmarks shows that it
provides 22% and 49% more leakage energy savings in the instruction cache (while con-
sidering overheads incurred in the rest of the processor as well) as compared to pure
periodic and spatial schemes. Further, it also provides 29% more leakage energy savings
in the instruction cache as compared to a recently proposed instruction cache leakage
scheme based on compiler analysis [78].
Section 7.2 provides a more detailed view of factors influencing leakage reduction
and how the new approach proposed in this work relates to existing schemes. Section 7.3
details the implementation of the HSLM and JITA strategies. Section 7.4 explores
different leakage management approaches that combine HSLM and JITA. An evaluation
of the different schemes is performed in Section 7.5. Finally, Section 7.6 summaries this
chapter.
105
7.2 Existing Approaches: Where they stumble?
Previous approaches that target reducing cache leakage energy consumption can
be broadly categorized into three groups: (i) those that base their leakage management
decisions on some form of performance feedback (e.g., cache miss rate) [59], (ii) those that
manage cache leakage in an application insensitive manner (e.g., periodically turning off
cache lines) [20, 41, 45], and (iii) those that use feedback from the program behavior [41,
80, 78].
The approach in category (i) is inherently coarse-grain in managing leakage as
it turns off large portions of the cache depending on a performance feedback that does
not specifically capture cache line usage patterns. For example, the approach in (i) may
indicate that 25% of the cache can be turned off because of very good hit rate, but, it
does not provide the guidance on which 75% of the cache lines are going to be used in
the near future.
The major drawback of the approaches in category (ii) is that they turn off cache
lines independent of the instruction access pattern. An example of such a scheme is the
periodic cache line turn-off proposed in [20]. The success of this strategy depends on how
well the selected period reflects the rate at which the instruction working set changes.
Specifically, the optimum period may change not only across applications but also within
the different phases of the application itself. In such cases, one can either keep cache lines
in the active state longer than necessary, or turn off cache lines that hold the current
instruction working set, thereby impacting performance and wasting energy. Note that
trying to address the first problem by decreasing the period will exacerbate the second
106
problem. On the plus side, this approach is simple and has very little implementation
overhead.
Bank I Bank II
Loop Portions
(A)
(B)
(A)
(B)
(b)(a)
Fig. 7.1. (a). A simple loop with two portions, (b). Bank mapping for the loop givenin (a).
Another example of a fixed scheme in category (ii) is the technique proposed in
[45]. This technique adopts a bank based strategy, where when execution moves from
one bank to another, the hardware turns off the former and turns on the latter. To
illustrate some of the potential drawbacks of this bank-based strategy, a simple loop
structure is shown in Figure 7.1(a). Let us assume that this loop structure is mapped
on to a two-bank cache architecture as shown in Figure 7.1(b). The first problem is
that while the execution is in part (A) of the loop, the entire bank I is kept in the
active state. Consequently, all cache lines in this bank save for the ones that hold part
(A) of the loop waste leakage. While this energy wastage can be reduced with very
small bank sizes, increasing the number of banks beyond a point incurs latency penalties
107
(due to decoding overheads); 4KB banks are typical bank sizes. The second problem
with this approach becomes clear when one considers the execution of the entire loop in
Figure 7.1(a). Assuming that this loop contains no other loop, one can expect frequent
transitions from part (A) to part (B) and vice versa. Note that this leads to frequent
bank turn-offs/ons, thereby increasing the energy overhead of the execution. By using
compiler directives it might be possible to align some loops across bank boundaries
(this also assumes that compiler knows the bank structure). However, in a typical large
application, it is likely that there are some loops that are divided across bank boundaries.
Note that reducing the bank size (to eliminate the first problem) aggravates the second
problem. Note that while in this simple example using a loop to illustrate the idea,
frequent bank transitions can also occur due to procedure calls (which might be quite
numerous in applications written in languages such as Java). A typical scenario would
be a small procedure located in one bank but is frequently invoked by procedures that
reside in different banks.
Another technique in category (ii) is the cache decay-based approach (its adaptive
variant falls in category (iii)) proposed by Kaxiras et al [41]. In this technique, a small
counter is attached to each cache line which tracks its access frequency. If a cache line
is not accessed for a certain number of cycles, it is placed into the leakage saving mode.
While this technique tries to capture the usage frequency of cache lines, it does not
directly predict the cache line access pattern. Consequently, a cache line whose counter
saturates is turned off even if it is going to be accessed in the next cycle. Since it is also
a periodic approach, choosing a suitable decay interval is crucial if it is to be successful.
In fact, the problems associated with selecting a good decay interval are similar to those
108
associated with selecting a suitable turn-off period in [20]. Consequently, this scheme
can also keep a cache line in the active state until the next decay interval arrives even if
the cache line is not going to be used in the near future. Finally, since each cache line is
tracked individually, this scheme has more overhead.
The approaches in category (iii) attempt to manage cache lines in an application-
sensitive manner. The adaptive version of the cache-decay scheme [41] tailors the decay
interval for the cache lines based on cache line access patterns. They start out with the
smallest decay interval for each cache line to aggressively turn off cache lines and increase
the decay interval when they learn that the cache lines were turned off prematurely.
These schemes learn about premature turn-off by leaving the tags on at all times. The
approach in [80] also uses tag information to adapt leakage management.
In [78], an optimizing compiler is used to analyze the program to insert explicit
cache line turn-off instructions. This scheme demands sophisticated program analysis
and modification support, and needs modifications in the ISA to implement cache line
turn-on/off instructions. In addition, this approach is only applicable when the source
code of the application being optimized is available. In [78], instructions are inserted
only at the end of loop constructs and, hence, this technique does not work well if a lot
of time is spent within the same loop. In these cases, periodic schemes may be able to
transition portions of the loop that are already executed into a drowsy mode. Further,
when only select portions of a loop are used, the entire loop is kept in an active state.
Finally, inserting the turn-off instructions after a fast executing loop placed inside an
outer loop can cause performance and energy problems due to premature turn-offs.
109
Another important limitation of existing leakage control schemes is that most of
the techniques only focus on a turn-off mechanism and activate turned-off cache lines
(or banks) only when accessed. Due to the sequential nature of instruction cache access
patterns, this is a significant shortcoming of the existing techniques. A notable exception
to this is the predictive bank turn-on scheme employed in [45]. Also, almost all exist-
ing schemes focus either on temporal locality (using counters) or spatial locality (using
address space).
7.3 Using Hotspots and Sequentiality in Managing Leakage
Having analyzed the shortcomings of directly applying existing approaches to
instruction cache leakage management, the goal of this work is to support a turn-off
scheme that is sensitive to program behavior changes and that captures both temporal
and spatial locality changes. Further, a predictive turn-on mechanism is needed to
support the sequentiality of instruction cache accesses. However, the granularity of
predictive turn-on should be kept as small as possible so that the cache lines are turned
on if and only if they are needed.
In this work, two mechanisms are proposed to support leakage management of
instruction caches. First, it proposes a HotSpot based Leakage Management scheme
(HSLM) that tracks program behavior. Second, it proposes a Just-in-Time Activation
(JITA) schemes for the next cache line exploiting sequentiality of code accesses.
110
7.3.1 HSLM: HotSpot Based Leakage Management
Previous research shows that a program execution typically occurs in phases.
Each phase can be identified by a set of instructions that exhibit high temporal locality
during the course of execution of the phase. Two important observations made by
previous research is that the phases can share instructions and that the instructions in
a given phase do not need to be tightly clustered together in one portion of the address
space. In fact, they can be scattered all over the address space as mentioned in [54].
Typically, when execution enters a new phase, it spends a certain number of cycles in
it. When this number is high, one can refer to that phase as a hotspot. Since branch
behavior is an important factor in shaping the instruction access behavior, the hotspot
is tracked using a branch predictor in this work. While the use of branch predictors for
optimizing programs has been used in the past (e.g., see [54]), to our knowledge, this is
the first study that employs branch predictors for improving leakage consumption.
Detecting program hotspots can bring two advantages. First, it gives the knowl-
edge which cache lines are going to be the most active ones and prevent them from being
turned off. Second, cache lines can be turned off if they hold instructions that do not
belong to a newly detected hotspot.
7.3.1.1 Protecting Program Hotspots
The proposed leakage management approach builds on the drowsy cache tech-
nique [20] that periodically transitions all cache lines to drowsy mode by issuing a global
turn-off signal which sets register Q of leakage control circuitry in Figure 7.2. A global
(modulo-N) counter is used to control the periodic turn-off. In order to protect the cache
111
lines containing the program hotspots from inadvertent turn-off, each drowsy cache cir-
cuit is augmented with a local voltage control mask bit (VCM). If this mask bit is set,
the corresponding cache line will mask the influence of the global turn-off signal and
prevent turn-off. In order to identify execution within hotspots, the information from
the branch target buffer (BTB) is augmented and utilized as explained in detail in the
next paragraph. Once the program is identified to be within a program hotspot (or not),
the global mask bit (GM) (Figure 7.3) is set (reset). When this global mask bit is set,
the voltage control mask bit of all cache lines accessed is set to one to indicate that these
cache lines form the program hotspot. In a set-associative cache, the voltage control
mask bit is set based on the tag match results of the cache access and is performed only
for the way that actually services the request. The voltage control mask bits are reset
on cache line replacements.
Q!Q
reset
Word line
Word line (Drowsy Mode)
(Active Mode)
Ro
w D
ecod
er
Preactivate
Preactivate
0.3V
1V
Cache Line
Word line Gate
Power line
GlobalTurn-off
set
VCM
Fig. 7.2. Leakage control circuitry supporting Just-in-Time Activation (JITA).
112
The hotspot detection mechanism tracks the branch behavior information using
the BTB. The BTB entries are augmented to collect the execution frequencies of basic
blocks. Compared to the conventional BTB entry, the augmented structure includes
three additional fields: the valid bit (vbit) for target address, an access counter for
the target basic block (tgt cnt), and an access counter for the fall-through basic block
(fth cnt). This new structure is shown in Figure 7.3. The valid bit indicates whether
the current value of target address is valid or not. The valid bit is needed as a new
entry can be added to the augmented BTB by both taken and non-taken branches. If
the new entry is introduced when the branch is taken (not taken), the valid bit is set
to one (zero). The access counter for the target (fall-through) basic block records how
many times the branch is predicted as taken (not-taken). These counters are accessed
and updated during each branch prediction according to the outcome of the prediction.
The value of the target/fall-through counter shows the frequency of the target/fall-
through basic block fetched within a given sampling window and is compared with a
predefined threshold Tacc to determine the hotness of the corresponding basic block.
Each counter in the BTB has log(Tacc) + 1 bits. The counters are initially set to zero
when a new BTB entry is created. During a branch prediction, if the BTB hits, the
corresponding counter is read out according to the outcome of the prediction and then
incremented. Next, the most significant bit of the corresponding counter is checked to
determine the hotness of the basic block starting at the target/fall-through address. If
this bit is set, it means that the next (target or fall-through) basic block has exceeded the
threshold Tacc number of accesses and subsequent fetches are part of a program hotspot.
The global mask bit is set to capture this detection of a program hotspot and set the
113
1 0
������������������������������
������������������������������������������������������������
������������������������������������������������������������
������������������������������������������������������������
������������������������������������������������������������
������������������������������
1
1
target_addr
target_addr
0
0
0
0
1
1
1
1
1
tgt_cnt
1 0000 0 0111
0 1 00001000
fth_cnt
PC
Word line
vbit
ICache
CircuitryLeakage Control
Mask BitGlobal
Branch Target Buff
Branch Taken
BTB Hit
Way Select
GM
Global Reset
BitVCM
Fig. 7.3. Microarchitecture for Hotspot based Leakage Management (HSLM) scheme.Note that O/P from AND gates go to the set I/P of the mask latches.
114
voltage control mask bit of all accessed cached lines as long as the global masking signal
is set. The mask bit is set based on the tag match results of the cache access and is
performed only for the way that actually services the request in a set-associative cache.
Also, the mask bits are reset on cache line replacement. The global mask bit is reset
when the most significant bit of the access counter for a subsequent BTB lookup is not
set or when a BTB miss happens.
When a sampling window expires (determined by zeroing of the global counter),
several initialization operations take place. First, a global turn-off signal is issued to
turn off all cache lines except those with their voltage control mask bit set (mask bits of
cache lines in hotspots are set to disable voltage scaling). Second, a global reset signal
resets all voltage control mask bits. This is performed to track variances in program
behavior hotness. Third, all the access counters in the BTB are shifted right by one bit
to reduce their access count by half. This is performed to reduce the weight for accesses
performed in an earlier period when determining hotness. Subsequently, a new sampling
window begins and the operations repeat.
7.3.1.2 Detecting New Program Hotspots
One of the drawbacks of periodic approaches is that cache lines can be turned off
only when a preset period expires. It would be more beneficial if older cache lines can
be turned off immediately when a shift in hotspot is detected. The current approach
proposed in this work is specifically targeted at identifying a shift of the program hotspot
to a new loop. Specifically, in this dynamic turn-off scheme, if the current target counter
in the BTB entry of a predicted taken branch indicates that the target basic block is
115
in a hotspot (the most significant bit of the counter is “1”) and if the target address
is smaller than the current program counter value, it assumes that the program is in
a hotspot executing a loop. At this point, a global turn off signal is set and all cache
lines except hotspots are switched to drowsy mode. In schemes evaluated in this work, a
periodic turn-off is always used in addition to the dynamic loop-based turnoff to account
for cases where the execution remains within the same loop for a long time or when there
are few loop constructs.
7.3.2 JITA: Just-In-Time Activation
In many applications, sequentiality is the norm in code execution. For example,
optimization such as loop unrolling, superblock and hyperblock formation increase the
sequentiality of the code [14, 17, 50]. The sequential nature of code can be used to
predict the next cache line that will be accessed and mask the penalty for transitioning
a cache line from drowsy to active mode just-in-time for access. Specifically, this work
proposes a scheme that preactivates the next cache line, JITA.
The leakage control circuitry that also supports preactivation for a direct-mapped
cache is illustrated in Figure 7.2. When the current cache line is being accessed, the
voltage control bit for the next cache line (next index) is reset, thereby transitioning it
to the active state. Thus, when the next fetch cycle occurs and there is code sequentiality,
the next required cache line is already in the active mode and ready for access. However,
this preactivation scheme is not successful when a taken branch occurs or when the next
address falls in a different memory bank. While the same circuit can be employed for
a set-associative cache, it would lead to activating the cache lines in all the ways of the
116
same set (Approach 1). In order to avoid this, way prediction information associated
with the next cache line is used to activate only the cache line of a selected way. In this
scheme (Approach 2), each cache set has n bits, one for each of the n ways, where the
set bit corresponds to the way that provided the data when the cache set was accessed
previously. This scheme is found to work well as programs expend a major part of their
time in program hotspots.
7.4 Design Space Exploration
Schemes Turn-off Mechanism Granularity of Turn-off
Base - -
Drowsy-Bk Switch banks Bank
Loop Instruction Entire cache
FHS Periodic+Not Hot Entire cache
FHS-PA Periodic+Not Hot Entire cache
DHS-PA Periodic+Hot backward branch+Not Hot Entire cache
DHS-Bk-PA Periodic+Hot backward branch+Switch banks+Not Hot Entire cache
Table 7.1. Leakage control schemes evaluated: turn-off mechanisms.
Table 7.1 shows the turn-off mechanisms and granularity of the different ap-
proaches evaluated. Table 7.2 shows the turn-on mechanisms and granularity of these
approaches. All the approaches considered, except the Drowsy-Bk scheme, turn on at the
cache line granularity and turn off using a global signal to all cache lines. By contrast,
the Drowsy-Bk approach turns on and turns off at the bank granularity.
117
Schemes Turn-on Mechanism Granularity of Turn-on
Base When accessed Cache lineDrowsy-Bk Bank prediction BankLoop When accessed Cache lineFHS When accessed Cache lineFHS-PA When previous line is accessed Cache lineDHS-PA When previous line is accessed Cache lineDHS-Bk-PA When previous line is accessed Cache line
Table 7.2. Leakage control schemes evaluated: turn-on mechanisms.
In all cases, including Base, a cache line is assumed to be in drowsy mode before its
first access. The Loop and Drowsy-Bk schemes are used here for comparative purposes.
The FHS (Fixed HotSpot) scheme is a variant of the drowsy scheme [20] that augments
the hotspot protection scheme described in section 7.3.1.1 to avoid turning off hot cache
lines. If the span of execution in a hotspot is longer than the fixed period for turn-off,
the FHS scheme will gain because of the masking. Shorter periods of turn-off are useful
when executing straightline code while longer periods of turn-off are desirable for long
executing loops. The FHS scheme helps to balance these. However, this scheme does
have a shortcoming (as compared to Drowsy) in that it can delay the turn-off of cache
lines that belonged to an older hotspot because of the masking. The FHS-PA scheme is
similar to FHS but uses the JITA scheme to predictively turn on the next cache line.
The Drowsy-Bk scheme employs a turn-off policy that is based on the assumption
that bank access changes indicate a shift in locality. The reactivation energy may involve
both the transition energy for changing the supply voltage for the cache line as well as an
additional energy expended in the rest of the system due to the performance penalties
associated with wakeup.
118
The DHS (Dynamic HotSpot) scheme is built on top of the FHS scheme. In addi-
tional to the periodic global cacheline turn-offs, global cacheline turnoff signals are also
issued when a new loop-based hotspot is detected. This scheme also employs the hotspot
detection for protecting cache lines containing program hotspots. This scheme can turn
off unused cache lines before the fixed period is reached by detecting that execution will
remain in the new loop based hotspot. This approach is specifically useful when there
are straight line code segments sandwiched between loops. The DHS scheme also incurs a
penalty due to the masking that can delay the turn-off of cache lines that belonged to an
older hotspot until the identification of a new hotspot or the expiration of an additional
period as compared to a periodic scheme that employs no masking. The DHS-PA scheme
employs the JITA strategy on top of the DHS scheme.
All schemes considered so far are either oriented towards identifying spatial or
temporal locality changes. The final approach, DHS-Bk-PA attempts to identify both of
these. Specifically, it issues a global turn-off at fixed periods, when execution shifts to a
new bank or when a new loop hotspot is detected. Further, it employs the mask bits set
using hotspot detection to protect active cache lines and the JITA scheme for predictive
cacheline turn-on.
The turn-on mechanisms of the proposed schemes can be classified broadly as
those that are activated on access (incurring transition latency) and those that are pre-
dictively activated. The schemes denoted with a PA suffix employ the JITA strategy.
Predictive turn-on strategies are not without their drawbacks. When a wrong prediction
is made, they not only incur the performance penalty (also associated with techniques
that have no prediction) but also the energy cost for activating the wrong cache line(s).
119
7.5 Experimental Evaluation
This section is to evaluate the leakage control schemes described in the previous
section. First, it describes the simulation parameters. Next, it compares the energy,
performance and energy-delay results of the different schemes. Finally, a sensitivity
analysis is performed for scheme DHS-Bk-PA.
7.5.1 Experiment Setup
The experimental environment is described in Chapter 3. The experiments are
conducted using our simulator developed based on SimpleScalar 3.0 [12]. A set of ten
integer and four floating point applications from the SPEC2000 benchmark suite and
their PISA binaries and reference inputs for execution are used in this experiment. Table
7.3 gives the technology and energy parameters used in this work. The energy parameters
are based on drowsy control for individual cache lines and is based on the circuit in [20].
The energy model is as follows:
Eenergy = Edrowsy + Eactive + Edatapath+dcache + Eoverhead (7.1)
Eoverhead = Eturnon + Eextraturnon + Ebtbcounters + Emisc (7.2)
Emisc = Econtrolbits + Ewaypredictor (7.3)
The total leakage energy Eenergy of the instruction cache with leakage manage-
ment schemes is composed of three part: leakage energy Edrowsy consumed by the cache
lines in drowsy mode, leakage energy Eactive consumed by cache lines in active mode, the
120
increased leakage energy consumption in datapath and data cache Edatapath+dcache due
to the increased cycles incurred by leakage control, and the overhead energy Eoverhead
for implementing the leakage control schemes. Here, the instruction cache is assumed
to consume one-third of the leakage energy of the whole processor and that the re-
maining is expended in the datapath and data cache. The overhead energy Eoverhead
includes transition energy Eturnon for activating a drowsy cache line to active mode,
extra transition energy Eextraturnon due to unnecessary turn-ons resulting from pre-
dictive cacheline turn-on schemes, the dynamic energy Ebtbcounters consumed in BTB
counters introduced for HSLM, and miscellaneous energy consumption Emisc due to
voltage control mask bits and a way predictor, if used, in set-associative cache. The
transition delay for activating an entire bank in all schemes was assumed to be one cycle
and based on the use of a separate voltage controller associated with each cache line.
A bank activation would cause all cache lines in the bank to switch from the reduced
voltage to a normal voltage. Hence, the energy for transitioning a bank is proportional
to the number of cache lines in the bank. The dynamic energy for the Ebtbcounters is
calculated using Cacti 3.0 [65] using 70nm technology. Since the BTB counters have a
very high percentage of zero bits (an average of 95%, due to saturation or not being
touched), these counters are implemented using asymmetric-Vt SRAM cells [6]. The
optimized cells consume only 1/10th of the original leakage when storing zeros. Thus,
additional leakage due to BTB counters is very small.
121
Technology and Energy Parameters
Feature Size 70nmSupply Voltage 1.0VClock Speed 1.0GHzL1 cache line Leakage in Active 0.417pJ/cycleL1 cache line Leakage in Drowsy 0.0663pJ/cycleTransition (drowsy to active) Energy 25.6pJTransition (drowsy to active) latency 1 cycleDynamic Energy per BTB counter (5 bits) 0.96pJ/transaction
Simulation Parameters
Window Size 2048 cyclesHotness Threshold (T
acc) 16
Subbank Size 4K Bytes
Table 7.3. Technology and energy parameters for the simulated processor given in Table3.1.
0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Rat
io o
f Cac
he L
ines
in A
ctiv
e M
ode
(in C
ycle
s)
gzip vp
rgc
cm
cf
pars
er
perlb
mk
gap
vote
xbz
ip2 twolf
wupwise
mes
a art
equa
ke Avg
BaseDrowsy−BkLoopFHSFHS−PADHS−PADHS−Bk−PA
Fig. 7.4. The ratio of cycles that cache lines are in active mode over the entire executiontime (Active ratio).
122
7.5.2 Experimental Results
The effectiveness of a leakage control scheme depends critically on how many
cache lines it can place in the drowsy mode. In order to evaluate this, the active ratio
(see Figure 7.4), which is defined as the average percentage of cache lines that are active
throughout the program execution, is measured. A smaller active ratio indicates the
potential for larger savings. However, overheads or performance penalties can impact
this potential. In measuring this ratio, the instruction cache is assumed to initially be
in a drowsy mode and each cache line is activated only when first accessed. On the
average, the DHS-Bk-PA scheme achieves the lowest active ratio (around 4.5%) while
the active ratio for the Base scheme was 66.2%. Observe that DHS-Bk-PA employs the
most aggressive turn-off scheme. Among the leakage optimization schemes, FHS-PA and
Drowsy-Bk have the largest active ratio (12.7% and 12.5%). In Drowsy-Bk scheme this
happens because all lines in a bank are turned on immaterial of whether they will be
accessed. In FHS-PA masking can delay turn-off. In vortex, the active ratio of Loop is
much higher (around 59%) due to the absence of many loop constructs.
Figure 7.5 breaks down the turn-offs in scheme DHS-Bk-PA into three categories:
periodic turn-off, dynamic turn-off, and bankswitch turnoff. The observation is that each
category presents a significant portion. On average, periodic turn-off accounts for 35.8%,
dynamic turn-off accounts for 25.3%, and the rest (38.9%) is contributed by bankswitch
turn-off. This convinces us that all these three turning-off schemes are important in
leakage control.
123
0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Cac
helin
e T
urni
ng O
ff B
reak
dow
n
gzip vp
rgc
cm
cf
pars
er
perlb
mk
gap
vote
xbz
ip2 twolf
wupwise
mes
a art
equa
ke Avg
PeriodicDynamicBankswitch
Fig. 7.5. Breakdown of turn offs in scheme DHS-Bk-PA.
−20%
−10%
0
10%
20%
30%
40%
50%
60%
70%
80%
Leak
ge E
nerg
y (w
/ Ove
rhea
d) R
educ
tion
gzip vp
rgc
cm
cf
pars
er
perlb
mk
gap
vote
xbz
ip2 twolf
wupwise
mes
a art
equa
ke Avg
Drowsy−BkLoopFHSFHS−PADHS−PADHS−Bk−PA
Fig. 7.6. Leakage energy reduction w.r.t the Base scheme.
124
Next, it should be checked out whether the active ratio really translates into en-
ergy savings. Figure 7.6 presents the total leakage energy reduction of all leakage control
schemes compared to the Base scheme. This evaluation depends on the overhead leak-
age incurred in the rest of the chip excluding the instruction cache. In order to capture
different processor configurations and underlying circuit styles, the contribution of the
instruction cache leakage is varied from 10-30% of overall on-chip leakage. DHS-Bk-PA
which has the smallest active ratio also has the best energy behavior. Further, HSLM
and JITA help to reduce additional overhead energy for this scheme. Hence, it achieves
an average energy reduction of 63% over Base, 49% over Drowsy-Bk, and 29% over
Loop. When this percentage is 10%, these energy reductions are 59% over Base, 44%
over Drowsy-Bk, and 50% over loop (Not shown in figure for brevity). Focusing on an
anomalous trend in Figure 7.6, benchmark wupwise exhibits very different energy behav-
ior. Except for scheme FHS (0.3% reduction) and Loop (0% reduction), all other schemes
increase the energy consumption with the Drowsy-Bk scheme increasing energy consump-
tion by 19% This results from the small footprint of this benchmark which touches only
77 cache lines of the same bank (out of 128 lines for configuration given in Table 7.3).
In order to have a closer look at the energy behavior of the different schemes,
Figure 7.7 provides a more detailed breakdown (averaged over all benchmarks). For
Base, the leakage energy is due to leakage energy consumed by drowsy cache lines (before
a cache line is accessed) and leakage energy consumed by active cache lines. Loop and
FHS have a noticeable portion of energy from additional datapath and data cache leakage
energy due to performance degradation. In contrast, Drowsy-Bk has very little overhead
due to performance overhead because it predictively turns on banks. However, turn-on
125
Base Drowsy−Bk Loop FHS FHS−PA DHS−PA DHS−Bk−PA0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Leak
age
Ene
rgy
(w/ O
verh
ead)
Bre
akdo
wn
Drowsy LeakageActive LeakageDatapath+DCacheTurn OnExtra Turn OnBTB Counter
Fig. 7.7. The leakage energy breakdown (an average for fourteen SPEC2000 bench-marks).
energy for the Drowsy-Bk scheme is significant, as it turns all lines in a bank in one
cycle. Further, the extra turn-on energy (activating the wrong subbank) is around 20%
of the turn-on overhead energy which accounts for 40.7% for the total leakage energy.
Even without accounting for the significant turn-on penalty for the Drowsy-Bk scheme,
DHS-Bk-PA (considering all overheads except turn-on) achieves 23% more energy savings
from Drowsy-Bk. This is because of the high-active ratio for the Drowsy-Bk scheme. The
FHS and FHS-PA also have a major portion of their energy budget consumed in active
cache lines due to their high active ratio. Further, the BTB counter overhead is minimal
since the saturated counters do not incur any additional dynamic activity as their clocks
are gated once the most significant bit turns to a one (until it is reset). Only the most
significant bit of these saturated counters is used even when reading to identify hotspots.
126
0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Rat
io o
f Act
ivat
ions
on
Cac
he H
its
gzip vp
rgc
cm
cf
pars
er
perlb
mk
gap
vote
xbz
ip2 twolf
wupwise
mes
a art
equa
ke Avg
Drowsy−BkLoopFHSFHS−PADHS−PADHS−Bk−PA
Fig. 7.8. Ratio of activations on instruction cache hits.
0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Rat
io o
f Pre
activ
atio
ns o
ver
Tot
al A
ctiv
atio
ns
gzip vp
rgc
cm
cf
pars
er
perlb
mk
gap
vote
xbz
ip2 twolf
wupwise
mes
a art
equa
ke Avg
FHS−PADHS−PADHS−Bk−PA
Fig. 7.9. The ratio of effective preactivations performed by JITA over total activationsincurred during the entire simulation.
127
Next, a measurement is performed on a metric defined as activation ratio to
highlight the performance penalty for accessing drowsy cache lines. This ratio provides
the percentage of activations of cache lines made on an a cache hit to the total number of
activations performed. A larger number indicates more performance penalty. Activations
on cache misses do not incur any additional penalty. Figure 7.8 shows the results. For
Loop and FHS this value is 79.5% and 83% on average respectively. The use of JITA
reduces this number to 7.6%, 11% and 12.4% for the FHS-PA, DHS-PA and DHS-Bk-PA.
While JITA is successful in reducing the penalty of activation, it still incurs penalties
when it fails due to taken branches or jumps to drowsy cache lines. For the Drowsy-Bk
scheme, as it activates many unnecessary cache lines when turning on an entire bank,
this metric is not very useful. To provide more insights about how JITA works so
well in FHS-PA, DHS-PA, and DHS-Bk-PA, Figure 7.9 shows the percentage of effective
preactivations performed by JITA over the total activations incurred during the execution
for these three schemes.
Figure 7.10, shows how this activation penalty translates into actual performance
values. The Base scheme (not shown) performs the best as it incurs no performance
penalties except for initial activation of untouched cache lines. The Drowsy-Bk scheme
performs the best among other schemes and incurs only a degradation of 0.56%. The
Loop scheme incurs the highest degradation of 15.4% degradation on the average because
this scheme had the highest number of accesses to drowsy cache lines on average. In
contrast, providing the hotspot protection in FHS reduces this penalty to 5.2%. Further,
using JITA reduces this penalty to 0.7% for the FHS-PA scheme. The best performing
energy scheme, DHS-Bk-PA suffers a degradation of 2.3% on the average. Note, however,
128
0
5%
10%
15%
20%
25%
30%
35%
Per
form
ance
(IP
C)
Deg
rada
tion
gzip vp
rgc
cm
cf
pars
er
perlb
mk
gap
vote
xbz
ip2 twolf
wupwise
mes
a art
equa
ke Avg
Drowsy−BkLoopFHSFHS−PADHS−PADHS−Bk−PA
Fig. 7.10. Performance degradation w.r.t the Base scheme.
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Ene
rgy
(J)
* D
elay
(s)
Pro
duct
(E
DP
)
gzip vp
rgc
cm
cf
pars
er
perlb
mk
gap
vote
xbz
ip2 twolf
wupwise
mes
a art
equa
ke Avg
BaseDrowsy−BkLoopFHSFHS−PADHS−PADHS−Bk−PA
Fig. 7.11. Energy delay (J*s) product (EDP).
129
that the DHS-Bk-PA is the best performing scheme in spite of the additional energy
overhead incurred due to the performance penalty.
Finally, the energy delay products (EDP) of each scheme are presented in Figure
7.11. Note that the overhead energy (see Figure 7.7) has been included. The results
show that scheme DHS-Bk-PA performs best. It achieves the smallest EDP value, which
has an average reduction of 62.63% over Base, and additionally 48.3% and 37.7% over
Drowsy-Bk and Loop respectively.
7.5.3 Sensitivity Analysis
This section is to investigate the impact of different parameters and report results
only for the best performing scheme, the DHS-Bk-PA. It will only highlight some of the key
aspects influencing the other schemes and only select a representative set of benchmarks
for clarity and due to space limitations. Three benchmarks, parser, bzip2, and equake are
selected for the purpose of this experiment. The baseline configuration for DHS-Bk-PA is
same as the one in the above experiments: 2K cycle window for hotness sampling and
global turnoff interval, 16 accesses for the hotness threshold, Tacc, used in HSLM and
4KB subbank size. In each set of following experiments, only one parameter is varying
while the other two keep unchanged.
Sampling window plays an important role in leakage control schemes with peri-
odic turning off. If the window decreases, cache lines can be placed in drowsy mode
much faster, potentially, implying more leakage reduction. However, it may incur more
performance loss due to more activations on cache hit. Consequently more energy con-
sumption for transition and energy in datapath and data cache. On the other hand,
130
0.5K 1K 2K 4K 8K1
1.5
2.0
2.5
Window Size
Inst
ruct
ions
per
Cyc
le (
IPC
)
parserbzip2equake
0.5K 1K 2K 4K 8K0.02
0.025
0.03
0.035
0.04
0.045
Window Size
Tot
al L
eaka
ge E
nerg
y (J
)
parserbzip2equake
(a). Performance (IPC) (b). Leakage energy (J)
Fig. 7.12. Impact of sampling window size on leakage control scheme DHS-Bk-PA.
increasing the window will put cache lines to drowsy mode less frequently, which reduces
the opportunity for leakage saving. But the performance loss and overhead energy will be
much smaller. The masking bits used in hotspot protection schemes can however mask
the negative impact of unnecessary turn-offs. However, a very small sampling window
close to Tacc can prevent cache lines from entering hotspots and nullify the benefits of
masking. Figure 7.12 shows the impact of sampling window size on performance and
leakage energy of scheme DHS-Bk-PA. The performance impact is very slight (IPC degra-
dation is just 1.4%) when the period decreases from 8K cycles to 0.5K cycles, compared
to the Drowsy scheme for which IPC degrades by 9.9% for the same period decrease.
The leakage energy also reduces as window size shrinks when using DHS-Bk-PA, but the
reduction drops after window size reaches 1K cycles as there is less potential to turn-off
cache lines. In contrast, the energy consumption of Drowsy scheme increases due to
higher overheads when period reduces from 8K to 0.5K cycles, e.g. benchmark parser
131
increases the energy by 10.5%. The result shows that the DHS-Bk-PA scales very well
with window size.
4 8 16 32 641
1.5
2.0
2.5
Hotness Threshold
Inst
ruct
ions
per
Cyc
le (
IPC
)
parserbzip2equake
4 8 16 32 640.02
0.025
0.03
0.035
0.04
0.045
Hotness Threshold
Tot
al L
eaka
ge E
nerg
y (J
)
parserbzip2equake
(a). Performance (IPC) (b). Leakage energy (J)
Fig. 7.13. Impact of hotness threshold on leakage control scheme DHS-Bk-PA.
Hotness threshold controls how a cache line can be established as hotspot within
a given sampling window. A smaller threshold puts more cache lines into hotspots and
prevents them from being turned off. This helps maintain high performance but hurts
leakage energy saving. On the other hand, larger threshold favorites energy saving but
might degrade performance. Figure 7.13 shows this impact when the threshold varies
from 4 to 64. Note that reducing threshold from 64 to 4 improved IPC by 1.2% but
increased the leakage energy consumption by 5.1%. Also, note that increasing threshold
beyond 64 makes the threshold approach the sampling window intervals and does not
help the schemes much.
132
0.5K 1K 2K 4K 8K1
1.5
2.0
2.5
Subbank Size
Inst
ruct
ions
per
Cyc
le (
IPC
)
parserbzip2equake
0.5K 1K 2K 4K 8K0.02
0.025
0.03
0.035
0.04
0.045
Subbank Size
Tot
al L
eaka
ge E
nerg
y (J
)
parserbzip2equake
(a). Performance (IPC) (b). Leakage energy (J)
Fig. 7.14. Impact of subbank size on leakage control scheme DHS-Bk-PA.
Subbank size affects the turning off caused by bank switches. Smaller bank size
might present more opportunity for energy optimization. It also introduces false phase
changes detected by monitoring bank switches (for example more loops can split across
bank boundaries), which incurs more energy and performance overhead. Figure 7.14
shows the impact of subbank size on scheme DHS-Bk-PA. It says that the energy con-
sumption increases and IPC degrades when use banks smaller than 2KB. Note that very
small banks are anyhow not desirable as it increases decoding overheads.
Next, the impact of cache associativities is studied using Approaches 1 and 2
described in Section 7.3.2. Approach 1 gives preference to performance and eliminates
performance penalties due to way predictions at the expense of energy. Approach 2 uses
way prediction to reduce energy consumption and can potentially incur performance
penalties due to additional activation on access due to way misprediction. Figure 7.15
presents performance degradation and leakage energy reduction (compared to the Base
133
scheme of corresponding cache configuration) for five instruction cache configurations:
direct mapped cache (DM), 2-way associative cache (2-Way) using approach 1, 2-way
associative cache (2-Way + WP) using approach 2, 4-way associative cache (4-Way)
using approach 1, and 4-way associative cache (4-Way + WP) using approach 2. It
is observed that the performance degradation increases as the associativity increases as
way-prediction accuracy drops when using approach 2. Consequently, the leakage energy
reduction reduces as well. In contrast, Approach 1 achieves much better performance at
the cost of higher leakage energy, especially for the cache with more ways.
DM 2−Way 2−Way+WP 4−Way 4−Way+WP0
1%
2%
3%
4%
5%
6%
7%
Per
form
ance
(IP
C)
Deg
rada
tion
parserbzip2equake
DM 2−Way 2−Way+WP 4−Way 4−Way+WP55%
60%
65%
70%
75%
80%
Leak
age
Ene
rgy
Red
uctio
n
parserbzip2equake
(a). Performance (IPC) loss (b). Leakage energy reduction
Fig. 7.15. Impact of cache associativity. IPC degradation (left), Leakage energy reduc-tion (right).
7.6 Discussions and Summary
This work focused on the leakage management of instruction caches. The leakage
management premise focuses on being able to identify changes in spatial and temporal
134
locality and exploits two main characteristics of instruction access patterns: that pro-
gram execution is mainly confined in program hotspots and that instructions exhibit a
sequential access pattern. Two strategies have been devised: a HotSpot based Leakage
Management (HSLM) and Just-in-Time Activation (JITA) to exploit these two main
characteristics.
Specifically, HSLM is used to protect turning off cache lines containing program
hotspots and for dynamically identifying shifts in program hotspot. JITA was used to
predictively activate the next cache line to mitigate the performance penalty incurred in
waking up drowsy cache lines. These schemes were combined with existing approaches
that exploit either the spatial or temporal locality of instruction cache accesses. The
evaluation shows that it is important to consider shifts in both spatial and temporal
locality in order to optimize the leakage energy consumed by instruction caches. Fur-
ther, using program behavior captured by HSLM helps avoid some of the overheads of
managing leakage in an application agnostic fashion and also helps to detect shifts in
program hotspots dynamically. Finally, JITA is a simple and effective scheme for mask-
ing the performance penalties associated with waking up drowsy cache lines and permits
a fine-grain leakage management at the cache line level.
DHS-Bk-PA, one of the leakage management schemes explored in this work, is the
most effective in terms of both energy reduction and energy-delay metrics among all the
schemes explored (including recently proposed instruction cache leakage management
techniques). It aggressively combines HSLM, JITA and both the spatial and temporal
based cache line turn-off. In DHS-Bk-PA, the spatial, temporal and HSLM hotspot de-
tection aggressively reduce the leakage of the caches, while HSLM hotspot protection
135
and the JITA mitigate the performance and energy overheads associated with aggressive
cache line turn-off. With the increasing focus on reducing leakage energy as technol-
ogy scales and the incorporation of larger and larger caches on-chip, such cache leakage
control schemes will be vital in future processor generations.
136
Chapter 8
Conclusions and Future Work
8.1 Conclusions
Energy consumption becomes an increasing concern and one of the major con-
straints in microprocessor designs. Due to their large share in transistor budget, on-chip
caches present a significant contribution to the processor energy consumption in terms of
both dynamic and static energy. This thesis work started with exploring the relationship
between application characteristics and its cache behavior, and how the properties of this
relationship can be utilized by either compiler or microarchitectural schemes to reduce
the energy consumption (both dynamic and leakage energy) in caches. Following that,
this thesis proposed several techniques that orchestrate compiler and microarchitectural
support to attack the cache energy consumption in an application sensitive way.
More specifically, this thesis research made the following four major contributions
towards a new approach and design methodology for designing highly energy-efficient on-
chip memory hierarchies,
• A detailed cache behavior characterization for both array-intensive embedded ap-
plications and general-purpose applications was performed in this work. Three
critical properties of application and its cache behavior, namely cache resource de-
mands for performance, program execution footprint, and instruction cache access
behavior have been identified, highlighted, extracted, and analyzed in the context
137
of cache energy optimization. The insights obtained from this study suggest that
(1) different applications or different code segments within a single application have
very different cache demands in the context of performance and energy concerns,
(2) program execution footprints (instruction addresses) can be highly predictable
and usually have a narrow scope during a particular execution phase, especially
for array-intensive applications, (3) high sequentiality is presented in accesses to
the instruction cache.
• Inspired by the findings from the above study, the compiler-directed cache polymor-
phism (CDCP) technique proposed in this thesis work implements an optimizing
compiler that is capable to analyze the cache behavior (i.e., data reuse) of the
application code and determine the best cache configuration that matches this
cache behavior and achieves the best performance and optimized energy behavior.
The cache is then directed to perform dynamic reconfiguration at runtime with
the cache configurations determined by CDCP. This technique is mainly focusing
on the new role of a compiler interacting with reconfigurable cache architectures.
Experimental results show that this CDCP technique provides competitive perfor-
mance and less energy consumption in data cache compared to an oracle scheme
using optimal cache configurations from exhaustive simulation.
• Utilizing the dynamic behavior of instruction footprint at runtime from a set of
array-based embedded applications, this thesis proposed a new issue queue de-
sign that restructures the instruction supply mechanism in conventional micro-
processors. This new scheme is to capture and utilize the predictable execution
138
footprint for reducing energy consumption in instruction cache as well as other pro-
cessor components as a side benefit. The issue queue proposed here is capable of
rescheduling buffered instructions in the issue queue itself thus to avoid instruction
streaming from the pipeline front-end and result in significantly reduced energy
consumption in the instruction cache.
• Further, two techniques, hotspot based leakage management (HSLM) and just-in-
time activation (JITA) are proposed in this work to manage the leakage in the
instruction cache in an application sensitive fashion. HSLM not only protects
program hotspots from inadvertent turning off, but also switches old hotspots into
drowsy mode as soon as a phase change is detected. JITA exploits the sequential
nature of accesses to the instruction cache and preactivates the next cache line of
a currently been accessed cache line thus to overlap the one cycle drowsy wakeup
penalty. The scheme, employing these two strategies in additional to periodic and
spatial based (bank switch) turn-off, provides a significant improvement on leakage
energy savings in the instruction cache (while considering overheads incurred in the
rest of the processor as well) over previously proposed schemes [45][78].
8.2 Future Work
This thesis research has raised a number of potential new ideas and topics for fu-
ture research work in the areas of low-power systems design, high-performance computer
architecture, and reliable power-efficient systems.
139
Power consumption has become one of the critical limiters to integration in mod-
ern microprocessor and might stop current technology advancement if not solved. Ex-
cessive power consumption not only exponentially increases the packaging and cooling
cost of microprocessors, but also demands extremely costly room design for data centers.
These data centers cost millions of dollars every year for power supply and the heat sink
(removal) systems. In my future work, I would like to extend my research theme from
circuit, architecture and compiler to operating systems and applications, from processor
core to on-chip systems, main memory, disk, and disk arrays in data centers. My goal
for this research is to build an infrastructure seamlessly embodying power optimizations
at different system levels for different system components, which is intended to make a
big impact on the low power industry and academia research community.
The traditional design paradigm for microprocessor architectures might become
an impediment to sustain the current performance improvement delivered by the advanc-
ing VLSI technologies. I believe that the new generation technology necessitates new
design methodology and philosophy. Wakeup-free instruction scheduler [19][39] is one of
the very good examples of new computer architectures for the future microprocessors.
An insight from this research is that the complexity and timing of large centralized com-
ponents of the processor are becoming obstacles to performance improvement driven by
faster clock speeds. A promising research topic is to reconsider the datapath and design
new architectures that partition these centralized components. Avoiding centralized de-
sign, each distributed part is self-managed, self-adaptive, and self-activated in such a
way that this structure would scale very well with the technology scaling, in terms of
both complexity and performance.
140
This research has indicated that behavior characterization is crucial for guiding
potential effective optimizations on a particular component of the system. Foreseeing
the rapidly speed diverting between the memory hierarchy and the processor core, an
everlasting question is how to improve the performance of the memory hierarchy. This
question is not new but never stops bothering the architects. Specifically, my interest
in this area is to use sophisticated behavior characterization to direct performance op-
timization. Now, the question is What memory behavior should be characterized beyond
the generation behavior? Following up is How to efficiently perform the characterization,
and at what level, compiler or architectural level? Finally, How to utilize the behavior
characteristics for performance improvement? These questions raise a lot of interesting
research topics for high performance memory systems.
Another issue that is capturing increasing attention of the industry and academia
is the system reliability with further scaling technology and aggressive power optimiza-
tion strategies. The transistor noise margin is lowering due the much faster scaling down
supply voltage than the transistor threshold voltage Vth, which makes the circuit more
vulnerable to noises. Supply noise is becoming worse with runtime dynamic/leakage
power optimizations. Dropping Qcritical due to lower supply voltage and smaller node
capacitance makes the chip more susceptible to soft errors. All these trends lead to
reducing reliability of future products. However, the commercial severs could not afford
the downtime due to reliability problems. Since we are now conducting some very funda-
mental research on understanding the basics of soft errors, my specific interest for future
work is to model the impact of soft errors on different components in the processor and
141
the error propagation path and distance. This research is intended to give some funda-
mental understanding on how to, when to, and where to detect the possible errors as
well as perform the following up correction/recovery in an effective way. Another very
interesting research direction is designing reliable systems from unreliable components
with schemes such as self-adaptive reconfiguration, time/space redundancy, etc.. Reli-
ability in conjunction with low power systems design opens a broad research area for
future work.
Projecting to high performance computer architectures, I have been interested
in and plan to investigate power and reliability issues in the context of multithreaded
architectures, chip multiprocessor architecture (CMP), and network on-chip architecture
(NOC). In network domain, I am particularly interested in the power management dur-
ing different operation modes in sensor networks. The power optimization I am going
to explore spans from sensor hardware, communication protocols to the specific applica-
tions.
142
References
[1] International technology roadmap for semiconductors, semiconductor industry as-
sociation. http://public.itrs.net, 2001.
[2] A. Agarwal, H. Hai, and K. Roy. Drg-cache: A data retention gated-ground cache
for low power. In Proc. ACM/IEEE Design Automation Conference, pages 473–478,
June 2002.
[3] Nawaaz Ahmed, Nikolay Mateev, and Keshav Pingali. Synthesizing transformations
for locality enhancement of imperfectly-nested loop nests. In Proceedings of the 2000
International Conference on Supercomputing, pages 141–152, Santa Fe, New Mexico,
May 2000.
[4] D. H. Albonesi. Selective cache ways: On-demand cache resource allocation. In
Proc. of the 32nd Annual International Conference on Microarchitecture, 1999.
[5] T. Anderson and S. Agarwala. Effective hardware-based two-way loop cache for
high performance low power processors. In IEEE Int’l Conf. on Computer Design,
2000.
[6] N. Azizi, A. Moshovos, and F. N. Najm. Low-leakage asymmetric-cell sram. In Proc.
the 2002 International Symposium on Low Power Electronics and Design, Monterey,
CA, 2002.
143
[7] D. F. Bacon, S. L. Graham, and O. J. Sharp. Compiler transformations for high-
performance computing. ACM Computing Surveys, 26(4):345–420, 1994.
[8] Raminder S. Bajwa et al. Instruction buffering to reduce power in processors for sig-
nal processing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
5(4):417–424, December 1997.
[9] P. Bannon. Alpha 21364: A scalable single-chip smp. Microprocessor Forum, Oc-
tober 1998.
[10] R. Bechade and e. al. A 32b 66mhz 1.8w microprocessor. In Proc. of International
Solid-State Circuits Conference, 1994.
[11] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a framework for architectural-
level power analysis and optimizations. In Proc. International Symposium on High-
Performance Computer Architecture, 2000.
[12] D. Burger and T. M. Austin. The simplescalar tool set, version 2.0. Technical
report, University of Wisconsin-Madison, June 1997.
[13] J. A. Butts and G. Sohi. A static power model for architects. In Proc. the 33th
Annual International Symposium on Microarchitecture, December 2000.
[14] S. Carr, C. Ding, and P. Sweany. Improving software pipelining with unroll-and-
jam. In Proc. the 29th Annual Hawaii International Conference on System Sciences,
pages 183–192, Maui, HI, January 1996.
144
[15] Jacqueline Chame. Compiler Analysis of Cache Interference and its Applications to
Compiler Optimizations. PhD thesis, Dept. of Computer Engineering, University of
Southern Californi, 1997.
[16] Anantha Chandrakasan, William J. Bowhill, and Frank Fox, editors. Design of
High-Performance Microprocessor Circuits. IEEE Press, 2001.
[17] P. P. Chang, N. J. Warter, S. Mahlke, W. Y. Chen, and W-M. W. Hwu. Three
superblock scheduling models for superscalar and superpipelined processors. Tech-
nical Report CRHC-91-29, Center for Reliable and High-Performance Computing,
University of Illinois, Urbana, IL, 1991.
[18] B. Cmelik and D. Keppel. Shade: a fast instruction-set simulator for execution
profiling. In Proc. of the 1994 ACM SIGMETRICES Conf. on the Measurement
and Modeling of Computer Systems, pages 128–137, May 1994.
[19] D.Ernst, A.Hamel, and T.Austin. Cyclone:a broadcast-free dynamic instruction
scheduler selective replay. In Proceedings of the 30th Annual International Sympo-
sium on Computer Architecture, June 2003.
[20] K. Flautner, N. Kim, S. Martin, D. Blaauw, and T. Mudge. Drowsy caches: Simple
techniques for reducing leakage power. In Proc. the 29th International Symposium
on Computer Architecture, Anchorage, AK, May 2002.
[21] B. Franke and M. F.P. O’Boyle. Array recovery and high-level transformations for
dsp applications. ACM Transactions on Embedded Computing Systems (TECS),
2(2):132 – 162, May 2003.
145
[22] D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory
management by global program transformation. Journal of Parallel and Distributed
Computing, 5(5):587–616, October 1988.
[23] S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: An analytical repre-
sentation of cache misses. In Proc. of the 11th International Conference on Super-
computing (ICS-97), July 1997.
[24] S. Ghosh, M. Martonosi, and S. Malik. Precise miss analysis for program transforma-
tions with caches of arbitrary associativity. In Proceedings of the 8th International
Conference on Architectural Support for Programming Languages and Operating
Systems, pages 228–239, San Jose, CA, October 1998.
[25] T. Givargis, J. Henkel, and F. Vahid. Interface and cache power exploration for core-
based embedded systems. In Proceedings of International Conference on Computer
Aided Design (ICCAD), pages 270–273, November 1999.
[26] A. Gordon-Ross, S. Cotterell, and F. Vahid. Exploiting fixed programs in embedded
systems: A loop cache example. IEEE Computer Architecture Letters, 2002.
[27] D. Grunwald, B. G. Zorn, and R. Henderson. Improving the cache locality of memory
allocation. In Proceedings of the ACM SIGPLAN’93 Conference on Programming
Language Design and Implementation (PLDI), pages 177–186, Albuquerque, New
Mexico, 1993.
[28] J. Hennesey and D. Patterson. Computer Architecture: A Quantitative Approach.
Morgan Kaufman Publishers, 3rd edition, 2002.
146
[29] S. Heo, K. Barr, M. Hampton, and K. Asanovi. Dynamic fine-grain leakage reduc-
tion using leakage-biased bitlines. In Proc. the 29th International Symposium on
Computer Architecture, Anchorage, AK, May 2002.
[30] G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel.
The microarchitecture of the pentium 4 processor. Intel Technical Journal, Q1 2001
Issue, Feb. 2001.
[31] M. Hiraki et al. Stage-skip pipeline: A low power processor architecture using
a decoded instruction buffer. In Proc. International Symposium on Low Power
Electronics and Design, 1996.
[32] J. S. Hu, M. Kandemir, N. Vijaykrishnan, and M. J. Irwin. Analyzing data reuse
for cache reconfiguration. Accepted to publish in ACM Transactions on Embedded
Computing Systems.
[33] J. S. Hu, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, H. Saputra, and W. Zhang.
Compiler-directed cache polymorphism. In Proc. of ACM SIGPLAN Joint Con-
ference on Languages, Compilers, and Tools for Embedded Systems (LCTES’02)
and Software and Compilers for Embedded Systems (SCOPES’02), pages 165 – 174,
Berlin , Germany, June 19-21 2002.
[34] J. S. Hu, A. Nadgir, N. Vijaykrishnan, M. J. Irwin, and M. Kandemir. Exploiting
program hotspots and code sequentiality for instruction cache leakage management.
In Proc. of the International Symposium on Low Power Electronics and Design
(ISLPED’03), pages 402 – 407, Seoul, Korea, August 25-27 2003.
147
[35] J. S. Hu, N. Vijaykrishnan, M. J. Irwin, and M. Kandemir. Selective trace cache:
A low power and high performance fetch mechanism. Technical Report CSE-02-
016, Department of Computer Science and Engineering, The Pennsylvania State
University, 2002.
[36] J. S. Hu, N. Vijaykrishnan, M. J. Irwin, and M. Kandemir. Using dynamic branch
behavior for power-efficient instruction fetch. In Proc. of IEEE Computer Society
Annual Symposium on VLSI (ISVLSI 2003), pages 127 – 132, Tampa, Florida,
February 20-21 2003.
[37] J. S. Hu, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin. Power-efficient trace
caches. In Proc. of the 5th Design Automation and Test in Europe Conference
(DATE’02), March 2002.
[38] J. S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M. J. Irwin. Scheduling
reusable instructions for power reduction. In Proc. of the Conference on Design,
Automation and Test in Europe Conference (DATE’04), Paris, France, February
16-20 2004.
[39] Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin. Exploring wakeup-free instruc-
tion scheduling. In Proc. of the International Symposium on High Performance
Computer Architecture (HPCA-10), pages 232 – 241, Madrid, Spain, February 14-
18 2004.
148
[40] I. Kadayif, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, and J. Ramanujam. Mor-
phable cache architectures: potential benefits. In ACM Workshop on Languages,
Compilers, and Tools for Embedded Systems (LCTES’01), June 2001.
[41] S. Kaxiras, Z. Hu, and M. Martonosi. Cache decay: exploiting generational behav-
ior to reduce cache leakage power. In Proc. the 28th International Symposium on
Computer Architecture, Sweden, June 2001.
[42] K. Kennedy and K. S. McKinley. Optimizing for parallelism and data locality. In
Proc. the 6th ACM International Conference on Supercomputing (ICS’92, Washing-
ton, DC, 1992.
[43] H. Kim and K. Roy. Dynamic vt sram’s for low leakage. In Proc. ACM International
Symposium on Low Power Design, pages 251–254, August 2002.
[44] N. Kim, T. Austin, D. Blaauw, T. Mudge, K. Flautner, J. S. Hu, M. J. Irwin,
M. Kandemir, and N. Vijaykrishnan. Leakage current: Moore’s law meets static
power. IEEE Computer Special Issue on Power- and Temperature-Aware Comput-
ing, pages 68 – 75, December 2003.
[45] N. Kim, K. Flautner, D. Blaauw, and T. Mudge. Drowsy instruction caches: Leakage
power reduction using dynamic voltage scaling and cache sub-bank prediction. In
Proc. the 35th Annual International Symposium on Microarchitecture, November
2002.
[46] J. Kin et al. The filter cache: An energy efficient memory structure. In Proc.
International Symposium on Microarchitecture, 1997.
149
[47] L. H. Lee, B. Moyer, and J. Arends. Instruction fetch energy reduction using loop
caches for embedded applications with small tight loops. In Proc. International
Symposium on Low Power Electronics and Design, 1999.
[48] Haris Lekatsas and Wayne Wolf. Samc: a code compression algorithm for embedded
processors. IEEE Transactions on CAD, 18(12):1689–1701, December 1999.
[49] L. Li et al. Leakage energy management in cache hierarchies. In Proc. the 11th Inter-
national Conference on Parallel Architectures and Compilation Techniques, Septem-
ber 2002.
[50] S. A. Mahlke et al. Effective compiler support for predicate execution using the hy-
perblock. In Proc. the 25th Annual International Symposium on Microarchitecture,
1992.
[51] N. Manjikian and T. S. Abdelrahman. Fusion of loops for parallelism and locality. In
Proceedings of the 24th International Conference on Parallel Processing (ICPP’95),
pages II:19–28, Oconomowoc, Wisconsin, August 1995.
[52] S. Manne, A. Klauser, and D. Grunwald. Pipeline gating: Speculation control for
energy reduction. In Proc. the 25th Annual International Symposium on Computer
Architecture, pages 132–141, June 1998.
[53] Kathryn S. McKinley, Steve Carr, and Chau-Wen Tseng. Improving data locality
with loop transformations. ACM Transactions on Programming Lanaguages and
Systems, 18(4):424–453, July 1996.
150
[54] M. C. Merten et al. An architectural framework for runtime optimization. IEEE
Transactions on Computers, 50(6):567–589, June 2001.
[55] T. Simunic G. De Micheli and L. Benini. Energy-efficient design of battery-powered
embedded systems. In Proceedings of International Symposium on Low Power Elec-
tronics and Design, pages 212–217, August 1999.
[56] J. Montanaro and et al. A 160-mhz, 32-b, 0.5-w cmos risc microprocessor. Digital
Technical Journal, Digital Equipment Corporation, 9, 1997.
[57] Samuel D Naffziger and Gary Hammond. The implementation of the next-generation
64b itaniumTM microprocessor. In Proceedings of ISSCC, February 2002.
[58] Dharmesh Parikh, Kevin Skadron, Yan Zhang, Marco Barcella, and Mircea R. Stan.
Power issues related to branch prediction. In Proc. the 8th International Symposium
on High-Performance Computer Architecture (HPCA’02), February 2002.
[59] M. D. Powell, S. Yang, B. Falsafi, K. Roy, and T. N. Vijaykumar. Reducing leakage
in a high-performance deep-submicron instruction cache. IEEE Transactions on
VLSI, 9(1), February 2001.
[60] Michael Powell, Se-Hyun Yang, Babak Falsafi, Kaushik Roy, and T. N. Vijaykumar.
Gated-vdd: A circuit technique to reduce leakage in deep-submicron cache memo-
ries. In Proc. the International Symposium on Low Power Electronics and Design
(ISLPED ’00), pages 90–95, July 2000.
151
[61] Michael D. Powell, Amit Agarwal, T. N. Vijaykumar, Babak Falsafi, and Kaushik
Roy. Reducing set-associative cache energy via way-prediction and selective direct-
mapping. In Proceedings of the 34th annual ACM/IEEE international symposium
on Microarchitecture, pages 54–65, 2001.
[62] P. Ranganathan, S. Adve, and N. P. Jouppi. Reconfigurable caches and their appli-
cation to media processing. In Proc. of the 27th Annual International Symposium
on Computer Architecture, pages 214–224, June 2000.
[63] G. Reinman and N. Jouppi. An integrated cache timing and power model. Cacti
2.0 technical report, COMPAQ Western Research Lab, 1999.
[64] G. Rivera and C.-W. Tseng. Eliminating conflict misses for high performance archi-
tectures. In Proceedings of the 1998 International Conference on Supercomputing,
pages 353–360, Melbourne, Australia, July 1998.
[65] P. Shivakumar and N. P. Jouppi. Cacti 3.0: An integrated cache timing, power and
area model. Technical report, Compaq Computer Corporation, August 2001.
[66] Silicon Strategies. Sandcraft mips64 embedded processor hits 800-mhz.
http://www.siliconstrategies.com, 2002.
[67] Avinash Sodani and Gurindar S. Sohi. Dynamic instruction reuse. In Proc. the 24th
Annual International Symposium on Computer Architecture (ISCA-97), June 1997.
152
[68] Y. Song and Z. Li. New tiling techniques to improve cache temporal locality. In
Proceedings of the SIGPLAN ’99 Conference on Programming Language Design and
Implementation, Atlanta, GA, May 1999.
[69] Stanford Compiler Group. The SUIF Library, version 1.0 edition, 1994.
[70] Jason Stinson and Stefan Rusu. A 1.5ghz third generation itanium® 2 pro-
cessor. In Proc. of the 40th conference on Design automation, pages 706–709, 2003.
[71] W. Tang, R. Gupta, and A. Nicolau. Power savings in embedded processors through
decode filter cache. In Proc. Design and Test in Europe Conference, 2002.
[72] O. Temam, C. Fricker, and W. Jalby. Cache interference phenomena. In Proc. of
ACM SIGMETRICS Conference on Measurement & Modeling Computer Systems,
1994.
[73] V. Tiwari, S. Malik, A. Wolfe, and M.T.C. Lee. Instruction level power analysis
and optimization of software. Journal of VLSI Signal Processing, 13(2):1–18, 1996.
[74] M. Wolf and M. Lam. A data locality optimizing algorithm. In Proc. of SIGPLAN’91
conf. Programming Language Design and Implementation, pages 30–44, 1991.
[75] K. C. Yager. The mips r10000 superscalar microprocessor. IEEE Micro, 16(2):28–40,
April 1996.
[76] Q. Yi, V. Adve, and K. Kennedy. Transforming loops to recursion for multi-level
memory hierarchies. In Proceedings of the SIGPLAN ’00 Conference on Program-
mingLanguage Design and Implementation, Vancouver, Canada, June 2000.
153
[77] Chuanjun Zhang, Frank Vahid, and Walid Najjar. A highly configurable cache
architecture for embedded systems. In Proceedings of the 30th annual international
symposium on Computer architecture, pages 136–146, 2003.
[78] W. Zhang, J. S. Hu, V. Degalahal, M. Kandemir, N. Vijaykrishnan, and M. J. Irwin.
Compiler-directed instruction cache leakage optimization. In Proc. the 35th Annual
International Symposium on Microarchitecture, November 2002.
[79] W. Zhang, J. S. Hu, V. Degalahal, M. Kandemir, N. Vijaykrishnan, and M. J. Irwin.
Reducing instruction cache energy consumption using a compiler-based strategy.
ACM Transactions on Architecture and Code Optimization (TACO), 1(1):3 – 33,
2004.
[80] H. Zhou, M. C. Toburen, E. Rotenberg, and T. M. Conte. Adaptive mode control:
a static power-efficient cache design. In Proc. the 2001 International Conference on
Parallel Architectures and Compilation Techniques, September 2001.
Vita
Jie Hu was born in Ninghai, Zhejiang, China on July 8, 1975. He graduated from
Ninghai High School of Zhejiang Province in 1993. He received his B.E. degree in com-
puter science and engineering from Beijing University of Aeronautics and Astronautics
in 1997. He was ranked the first place in his class and was recommended by his depart-
ment to the graduate school at Peking University with the privilege of waived graduate
admission exams. In 2000, he married Ms. Kai Chen. In the same year, he received his
M.E. degree in signal and information processing from Peking University. Immediately
after than, he enrolled in the Ph. D. program in computer science and engineering at the
Pennsylvania State University. Since August 2000, he has been a graduate assistant in
the same department.
Jie Hu is a member of IEEE, ACM, and ACM SIGARCH.