technical reportthis thesis presents cobra (continuous binary re-adaptation), a dynamic binary...
TRANSCRIPT
COBRA: A Framework for Continuous Profiling and Binary Re-Adaptation
Technical Report
Department of Computer Science
and Engineering
University of Minnesota
4-192 EECS Building
200 Union Street SE
Minneapolis, MN 55455-0159 USA
TR 08-016
COBRA: A Framework for Continuous Profiling and Binary
Re-Adaptation
Jinpyo Kim, Wei-chung Hsu, and Pen-chung Yew
May 09, 2008
c© Jinpyo Kim February 2008
Abstract
Dynamic optimizers have shown to improve performance and power efficiency of
single-threaded applications. Multithreaded applications running on CMP, SMP and
cc-NUMA systems also exhibit opportunities for dynamic binary optimization. Ex-
isting dynamic optimizers lack efficient monitoring schemes for multiple threads to
support appropriate thread specific or system-wide optimization for a collective be-
havior of multiple threads since they are designed primarily for single-threaded pro-
grams. Monitoring and collecting profiles from multiple threads expose optimization
opportunities not only for single core, but also for multi-core systems that include
interconnection networks and the cache coherent protocol. Detecting global phases
of multithreaded programs and determining appropriate optimizations by considering
the interaction between threads such as coherent misses are some features of the dy-
namic binary optimizer presented in this thesis when compared to the prior dynamic
optimizers for single threaded programs.
This thesis presents COBRA (Continuous Binary Re-Adaptation), a dynamic binary
optimization framework, for single-threaded and multithreaded applications. It in-
cludes components for collective monitoring and dynamic profiling, profile and trace
management, code optimization and code deployment. The monitoring component
collects the hot branches and performance information from multiple working threads
with the support of OS and the hardware performance monitors. It sends data to the
dynamic profiler. The dynamic profiler accumulates performance bottleneck profiles
such as cache miss information along with hot branch traces. Optimizer generates
new optimized binary traces and stored them in the code cache. Profiler and opti-
mizer closely interact with each other in order to optimize for more effective code
layout and fewer data cache miss stalls. The continuous profiling component only
monitors the performance behavior of optimized binary traces and generates the feed-
back information to determine the efficiency of optimizations for guiding continuous
re-optimization. It is currently implemented on Itanium 2 based CMP, SMP and
cc-NUMA systems.
This thesis proposes a new phase detection scheme and hardware support, espe-
cially for dynamic optimizations, that effectively identifies and accurately predicts
program phases by exploiting program control flow information. This scheme could
not only be applied on single-threaded programs, but also more efficiently applied on
multithreaded programs. Our proposed phase detection scheme effectively identifies
dynamic intervals that are contiguous variable-length intervals aligned with dynamic
i
code regions that show distinct single and parallel program phase behavior.
Two efficient phase-aware runtime program monitoring schemes are implemented on
our COBRA framework. The sampled Basic Block Vector (BBV)-based and sampled
Hot Working Set (HWSET)-based program phase detection schemes are studied. We
showed that Sampled HWSET-based program phase detection scheme has a higher
phase coverage and a longer stable phase compared to sampled BBV-based program
phase detection scheme. We also propose dynamic code region (DCR)-based program
phase detection hardware for dynamic optimization system. We show that our pro-
posed hardware exhibits the desired characteristics of a phase detector for dynamic
optimization.
This thesis also proposes a persistent dynamic profile management scheme for contin-
uous re-optimization. The code region based profile manager stores dynamic control
flow information including hot paths and loops. It classifies them according to an
entropy calculated from the frequency vectors of taken branches and load latencies.
The profile characterization and classification minimize the explosion of persistent
runtime profiles and the overhead of profile collection for continuous re-optimization.
We implemented two dynamic compiler optimizations to reduce the impact of coherent
memory accesses in OpenMP NAS parallel benchmarks. Using OpenMP NAS parallel
benchmarks, we show how COBRA can adaptively choose appropriate optimizations
according to observed changing runtime program behavior. The optimizations im-
prove the performance of OpenMP NAS parallel benchmarks (BT, SP, LU, FT, MG,
CG) up to 15% with an average of 4.7% on a 4-way Itanium 2 SMP server, and up
to 68% with an average of 17.5% on a SGI Altix cc-NUMA system.
ii
Dedicated to my parents and my wife
iii
Acknowledgments
It is time to thank for all supports and love from my family, advisors and friends to
make it possible to finish this work. First of all, I would like to appreciate to my wife,
Aejung Min, for taking care of daily chores and playful two sons with dedication of
her whole time. It was always fun to spend time with my lovely first son, Donghyun
Kim, who has nicely grown up and become a playful and smart 4th grader. I was
always happy with a big smile from my second son, Kevin D. Kim, who was born in
US and has grown up as a happy and healthy 3 years kid. Strong support from my
family keeps me on track to finish this work.
I would like to thank my parents, Jong-Kyu Kim and Kee-Hyun Park, for believing
me on whatever I was doing and showing strong supports all the time. Especially,
my father has been a good mentor and friend in my life. He showed me how people
can grow up as a responsible and loving person for their family through his life.
Professor Pen-Chung Yew has been a great advisor on my work and every decisions
made while studying as a graduate student. I would like to thank him for spending his
invaluable time to discuss about every details of my work and giving me thoughtful
suggestions. He has been a definite role model for me as a productive researcher and
professor.
Professor Wei-Chung Hsu allowed me to work on dynamic optimization project and
has been a great co-advisor on my work. He has been an energetic leader of the project
and technically sound debater in every bits of details when discussing research idea
with students. I has been really fortunate to have a chance to work with him and
learn every nuts and bolts of compiler optimization techniques.
Sreekumar V. Kodakara, fellow graduate student, has been a good research collabora-
tor and dear friend during my thesis work. Numerous days and nights of hard working
on the paper could not take his humor and smile. His humor made our collaboration
enjoyable and delightful experience.
I would like to thank fellow graduate students working together in DYNOPT research
group, namely, Howard Chen, Jiwei Lu, Sourabh Joshi, Ananth Lingamneni, Abhinav
Das, and Lao Fu. Group discussion with them greatly influenced my thesis work. I
also would like to thank fellow graduate students working in Aggassiz research group,
namely, Tong Chen, Shengyue Wang, Xiaoru Dai, Jin Lin, Venkatesan Packirisamy,
Kiran S. Yellajyosula and Jin Woo Jung.
iv
I would like to thank professor David J. Lilja to provide insightful suggestions on
collaborative work with Sreekumar V. Kodakara and my thesis work. Professor Mats
Heimdahl is gratefully acknowledged for serving on my committee and giving me his
suggestions to improve my thesis.
Finally, I would like to acknowledge the funding agencies and companies for this
work. This work was supported in part by National Scientific Foundation grant no.
EIA-0220021, Intel, HP, Sun, and the Minnesota Supercomputer Institute. This work
was also supported in part by IT National Scholarship, Ministry of Information and
Communications, Korea Government.
v
Contents
Chapter 1 Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Phase Detection and Prediction for Dynamic Optimizations . 3
1.1.2 Profile Characterization and Classification . . . . . . . . . . . 4
1.1.3 Optimizing Coherent Misses via Binary Re-Adaptation . . . . 5
1.2 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Phase Detection and Prediction for Dynamic Optimization . . 6
1.2.2 Profile Characterization and Classification . . . . . . . . . . . 7
1.2.3 Optimizing Coherent Misses via Binary Re-Adaptation . . . . 8
1.3 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Chapter 2 Related Works 12
2.1 Dynamic Optimizations for Parallel and Multithreaded Programs . . 12
2.2 Phase Detection and Prediction . . . . . . . . . . . . . . . . . . . . . 13
2.3 Profile Characterization and Classification . . . . . . . . . . . . . . . 15
2.4 Reducing Coherent Misses . . . . . . . . . . . . . . . . . . . . . . . . 16
Chapter 3 COBRA: A Continuous Binary Re-Adaptation Framework 17
3.1 COBRA System Architecture . . . . . . . . . . . . . . . . . . . . . . 19
vi
3.2 Startup Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Monitoring Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Optimizer Thread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Chapter 4 Phase-Aware Runtime Program Monitoring 23
4.1 Phase Detection for Dynamic Optimization Systems . . . . . . . . . . 24
4.2 Extended Calling Context Tree . . . . . . . . . . . . . . . . . . . . . 24
4.3 Dynamic Code Region . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.1 Dynamic Code Region Analysis . . . . . . . . . . . . . . . . . 26
4.3.2 Stable and Transition Intervals . . . . . . . . . . . . . . . . . 28
4.4 Sampling-based Program Phase Tracking . . . . . . . . . . . . . . . . 30
4.4.1 Sampled BBV-based Program Phase Detection . . . . . . . . . 30
4.4.2 Sampled HWSET-based Program Phase Detection . . . . . . . 31
4.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Global Program Phase on Multithreaded Programs . . . . . . . . . . 35
4.5.1 Global Program Phase . . . . . . . . . . . . . . . . . . . . . . 35
4.5.2 Exploiting Global Program Phase . . . . . . . . . . . . . . . . 36
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Chapter 5 Hardware Support for Program Phase Tracking 40
5.1 Dynamic Code Region (DCR): A Unit of Monitoring and Re-optimization 40
5.1.1 Tracking Dynamic Code Region as a Phase . . . . . . . . . . . 40
5.1.2 Correlation between Dynamic Code Regions and Program Per-
formance Behaviors . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 DCR-based Phase Tracking and Prediction Hardware . . . . . . . . . 44
vii
5.2.1 Identifying function calls and loops in the hardware . . . . . . 44
5.2.2 Hardware Description . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.3 Handling special cases . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3.1 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . 50
5.3.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.4.1 Results for Phase Detection Hardware . . . . . . . . . . . . . 52
5.4.2 Comparison with BBV Technique . . . . . . . . . . . . . . . . 56
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Chapter 6 Continuous and Persistent Profile Management 63
6.1 Continuous and Persistent Profile-Guided Optimization . . . . . . . . 63
6.2 Similarity of Sampled Profiles . . . . . . . . . . . . . . . . . . . . . . 65
6.2.1 Similarity Metric . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.2.2 Accuracy of Persisted Profiles . . . . . . . . . . . . . . . . . . 67
6.3 Entropy-Based Profile Characterization . . . . . . . . . . . . . . . . . 68
6.3.1 Information Entropy: A Metric for Profile Characterization . . 68
6.3.2 Entropy-Based Adaptive Profiler . . . . . . . . . . . . . . . . 70
6.4 Entropy-Based Profile Classification . . . . . . . . . . . . . . . . . . . 70
6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 72
6.5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 72
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
viii
Chapter 7 Optimizing Coherent Misses via Binary Re-Adaptation 80
7.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.2 Optimizing Coherent Misses . . . . . . . . . . . . . . . . . . . . . . . 86
7.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.4.1 Impact on Execution Time . . . . . . . . . . . . . . . . . . . . 92
7.4.2 Impact on L3 Cache Misses . . . . . . . . . . . . . . . . . . . 94
7.4.3 Impact on Memory Bus Transactions . . . . . . . . . . . . . . 94
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Chapter 8 Conclusions and Future Works 98
8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
ix
List of Figures
3.1 COBRA framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 The startup sequence of 4-threaded OpenMP program with COBRA
framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1 Partial ECCT of gzip and gcc is shown. The grey nodes are the root
of the sub-tree that forms the DCR and the rectangles mark the DCR’s 25
4.2 Sampled BBV-based Program Phase Detection . . . . . . . . . . . . . 30
4.3 Sampled HWSET-based Program Phase Detection . . . . . . . . . . . 31
4.4 Phase coverage and stable phase of sampled BBV-based phase detec-
tion scheme on SPEC CPU2000 benchmarks . . . . . . . . . . . . . . 32
4.5 Phase coverage and stable phase of sampled HWSET-based phase de-
tection scheme on SPEC CPU2000 benchmarks . . . . . . . . . . . . 33
4.6 Comparison of BBV-based and HWSET-based program phase detection 34
4.7 Global program phase behavior on multithreaded programs . . . . . . 37
4.8 Performance scale-up of OpenMP swim on Itanium 2 4-way Mckinley
and 8-way Montecito machine . . . . . . . . . . . . . . . . . . . . . . 38
4.9 Performance of SPEC OMP2001 benchmarks in the different CPU fre-
quencies on Intel Core 2 Quad processor . . . . . . . . . . . . . . . . 38
5.1 An example code and its corresponding ECCT representation. Three
dynamic code regions are identified in the program and are marked by
different shades in the tree. . . . . . . . . . . . . . . . . . . . . . . . . 41
x
5.2 Visualizing phase change in bzip2. (a) Change of average CPI dur-
ing program execution. Each point in the graph is the average CPI
observed over an 1-million-instruction interval. (b) Tracking Phase
changes over the time (1 million instruction interval) in bzip2 using
dynamic code regions. The Y-axis shows phase ID. . . . . . . . . . . 43
5.3 Assembly code of a function call (a) and loop (b). The target address
of the branch instruction is the start of the loop and the PC address
of the branch instruction is the end of the loop. . . . . . . . . . . . . 44
5.4 Conditions checked in the phase detection hardware . . . . . . . . . . 46
5.5 Schematic diagram of the hardware phase detector . . . . . . . . . . . 47
5.6 Recursion structure in code and the content of hardware stack during
recursion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.7 Weighted average of the CoV of CPI for different configurations of the
phase detection hardware . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.8 The performance comparison of last-phase predictor and markov pre-
dictor in detecting the phase of the next interval for 32/64 configuration. 56
5.9 BBV-based phase tracking hardware . . . . . . . . . . . . . . . . . . 57
5.10 Comparison between BBV- and DCR-based phase detection hardware
on the performance of a 256-entry Markov Predictor in predicting the
phase ID of the next interval. A 32-entry accumulator table/hardware
stack and a 64-entry phase signature table were used. The first 2
columns for each benchmark are for BBV method using threshold val-
ues of 10% and 40% of one million instructions, respectively. . . . . . 60
5.11 Comparison of the weighted average of the CoV of CPI between BBV-
and DCR-based phase detection schemes. A 32-entry accumulator ta-
ble/hardware stack and a 64-entry phase signature tables were used.
The first 2 columns for each benchmark are for a threshold value of
10% and 40% of one million instructions respectively. . . . . . . . . . 61
6.1 Continuous profile-guided optimization model . . . . . . . . . . . . . 64
6.2 Convergence of merged profiles of gcc with 200.i input set . . . . . . 67
xi
6.3 Relative frequency distribution of PC address samples (gcc, gzip) . . 69
6.4 Entropy-based profile classification . . . . . . . . . . . . . . . . . . . 71
6.5 Convergence of merged profiles of SPEC CPU2000 benchmarks . . . . 73
6.6 Accuracy of entropy-based adaptive profiler on SPECJBB ver. 1.01 . 75
7.1 OpenMP DAXPY C source code . . . . . . . . . . . . . . . . . . . . 80
7.2 icc compiler generated Itanium assembly code for DAXPY kernel . . 81
7.3 Normalized execution time of OpenMP DAXPY kernel on 4-way Ita-
nium 2 SMP server . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.4 Speedup of coherent memory access optimization on OpenMP NPB
benchmarks. The performance of prefetch version (optimized by Intel com-
piler) is normalized to 1 as the baseline. . . . . . . . . . . . . . . . . . 93
7.5 Number of L3 misses on OpenMP NPB benchmarks . . . . . . . . . . 95
7.6 Number of memory transactions on the system bus on OpenMP NPB
benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
xii
List of Tables
4.1 Top 5 dynamic code region statistics on SPEC2000 CPU benchmarks 29
5.1 Number of phases detected for different configurations of the phase
detection hardware. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 Average Length of phase detected for different configurations of the
phase detection hardware. . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 Comparison of the number of phases detected between BBV- and DCR-
based phase detection schemes. A 32-entry accumulator table/hardware
stack and a 64-entry phase signature table were used. The first 2
columns for each benchmark correspond to a threshold value of 10%,
40% of one million instructions, respectively. . . . . . . . . . . . . . . 59
5.4 Comparison of the average phase length between BBV- and DCR-based
phase detection schemes. A 32-entry accumulator table/hardware stack
and a 64-entry phase signature table were used. The first 2 columns
for each benchmark correspond to a threshold value of 10%, 40% of
one million instructions, respectively. . . . . . . . . . . . . . . . . . . 59
6.1 Entropy of SPEC CPU2000 INT benchmarks . . . . . . . . . . . . . . 74
6.2 Entropy of SPEC CPU2000 FP benchmarks . . . . . . . . . . . . . . 74
6.3 Performance improvement (%) from PGO on vortex with multiple in-
put sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.4 Performance improvement (%) from PGO on vpr with multiple input
sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
xiii
6.5 Performance improvement (%) from PGO on gzip with multiple input
sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.1 The number of loops and prefetches in compiler generated OpenMP
NPB binaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
xiv
Chapter 1
Introduction
As we enter the era of multi-core and many-core systems that could integrate two to
hundreds of processing units on a single chip, immense computing resources such as
processing cores, large memories, and I/Os are readily available. We could exploit the
massive computing resources with the support of operating systems (OS), compilers,
and thread libraries. The OS supports task-level parallelism for concurrently exe-
cuted processes by providing fair and efficient scheduling for shared system resources.
Compilers support automatic parallelization and programmer annotated paralleliza-
tion such as OpenMP so as to make writing multithreaded programs relatively easier.
Compiler-generated parallel codes are executed as light weight threads. Programmers
could write multithreaded programs with thread libraries such as Linux pthread li-
brary. The increasing development and use of multithreaded programs pose a huge
challenge for compiler optimizations due to the dynamic behaviors of multithreaded
programs.
The compiler optimizations have to improve not only the performance of each thread,
but also the overall performance of multithreaded applications. Due to the changes in
the number of threads and data working set size during parallel execution, dynamic
1
parallel program behavior makes it difficult for a static compiler to generate a high-
performance binary for multiprocessor systems. To cope with this problem, adaptive
dynamic optimizations could be used during various stages of program development
and deployment such as system libraries, algorithms, and compilation. Recently,
with the advent of profile-guided optimizations using Hardware Performance Moni-
tors (HPM), re-optimizing the binary at runtime has been proved to be a promising
approach. It could adapt a binary according to its changing program behavior in
data working set sizes and system configurations.
In order to efficiently deploy compiler optimizations at the runtime for multithreaded
programs, this thesis presents a HPM-based continuous profiling and optimization
framework, called COBRA (Continuous Binary Re-Adaptation). We investigate the
use of program phases in order to precisely detect and predict the changing program
behavior by exploiting program control flow information such as loops and func-
tion calls. We propose software and hardware based phase detection and prediction
schemes for dynamic optimizations. COBRA manages runtime profiles in persistent
manner to enable continuous re-optimization. We propose to use an information
entropy to characterize dynamic profiles. Furthermore, we show that it can be effec-
tively applied to profile classification. We implemented dynamic re-optimization of
data prefetching to minimize unnecessary coherent misses on multithreaded applica-
tions.
1.1 Problem Statement
This thesis addresses the following problems: phase detection and prediction for
dynamic optimizations, profile characterization and classification, and optimizing co-
herent misses via dynamic optimization.
2
1.1.1 Phase Detection and Prediction for Dynamic Optimizations
Understanding and predicting a program’s execution phase is crucial to dynamic
optimizations and dynamically adaptable systems. Accurate classification of pro-
gram behavior creates many optimization opportunities for adaptive reconfigurable
microarchitectures, dynamic optimization systems, efficient power management, and
accelerated architecture simulation [22, 49, 51, 23, 53, 37, 8, 14, 7].
Dynamically adaptable systems [22, 51, 29, 41] have exploited phase behavior of pro-
grams in order to adaptively reconfigure the microarchitecture such as the cache size.
A dynamic optimization system optimizes the program binary at runtime using code
transformations to increase program execution efficiency. Dynamic binary translation
also falls in this category. In such systems, program phase behavior has been exploited
for dynamic profiling and code cache management [37, 8, 26, 43, 12, 15, 14]. For ex-
ample, the performance of code cache management relies on how well the change in
instruction working set is tracked by the system.
Current dynamic optimization systems continuously track the program phases change
either by sampling the performance counters or by instrumenting the code. For
sampling-based profiling systems, the sampling rate usually dominates the overhead.
However, while using a low sampling rate can avoid high profiling overhead, it could
also result in missing optimization opportunities as well as an unstable system where
reproducibility is compromised. Using program phase detection and prediction may
more efficiently control the profiling overhead by adaptively adjust sampling rate or
applying burst instrumentation. For example, if the program execution is in a stable
phase, profiling overhead can be minimized. (e.g., by lowering the sampling rate),
while a new phase would require burst profiling.
A phase detection technique developed for dynamically adaptation systems is less
applicable to a dynamic optimization system. This is because collecting performance
3
characteristics at regular time intervals with arbitrary boundaries in an instruction
stream is not as useful as gathering performance profiles of instruction streams that
are aligned with program control structures. We introduce the concept of Dynamic
Code Region (DCR), and use it to model program execution phase. A DCR is a node
and its all child nodes in the extended calling context tree (ECCT) of the program.
ECCT as an extension of the calling context tree (CCT) proposed by Ammon et
al. [4] with the addition of loop nodes.
Based on the study of DCR-based phase detection, this thesis proposes a sampling
based software phase detection and prediction scheme for dynamic optimizations.
Furthermore, we propose an efficient phase detection and prediction hardware.
1.1.2 Profile Characterization and Classification
In order to enable advanced profile-guided optimizations (PGO) in a dynamic com-
piler and binary optimization system, dynamic profiles are usually collected through
sampling and runtime instrumentation. The re-compilation process relies on accu-
rate HPM (Hardware Performance Monitor)-sampled profiles that are accumulated
over several executions of the application program. HPM-sampled profiles could pro-
vide more precise runtime performance events, such as cache misses and resource
contentions, that allow more effective runtime or offline optimizations [39, 16, 37].
Sampling-based profile management has thus become an essential part in the con-
tinuous profile-guided optimization framework. However, many production compilers
are still dependent on instrumentation-based profiles for their PGO optimizations.
This is because some existing optimizations, such as the complete loop unrolling,
would require the iteration count of a loop, and this type of information may not be
accurately obtained with sampling-based profiles.
In order to obtain more accurate profiles with a low sampling frequency, sampled
4
profiles could be merged and stored across multiple runs on the disk. Due to the
statistical nature of sampling, the quality of sampled profiles is greatly affected by the
sampling rate. As the sampling frequency is increased, more samples are collected and
the accuracy of sampled profiles is improved. Unfortunately, the sampling overhead is
also increased. A high sampling frequency would cause more interrupts, require more
memory space to store sampled data, and more disk I/O activities to keep persistent
profile data. With a fixed number of runs, a challenge to the runtime profiler is how to
determine the ideal sampling rate so that we can still obtain high quality profiles with
minimum sampling overhead. This thesis introduces the use of information entropy
to characterize dynamic profiles. Furthermore, it has been shown to be effective on
profile classification.
1.1.3 Optimizing Coherent Misses via Binary Re-Adaptation
Compile optimizations for memory accesses have become extremely important as we
take ever increasing memory latency. Larger cache memories and data prefetching
have proven to be very effective in reducing cache misses and hiding large cache miss
latency. Processors without hardware data prefetchers, such as Intel Itanium, rely on
effective compiler generated prefetches to minimize the performance impact of large
memory latency. Therefore, modern compilers for such processors have been very
aggressive in generating data cache prefetch instructions.
Aggressive data cache prefetching could be very effective for applications such as dense
matrix-oriented numerical applications since their memory access patterns are highly
predictable on single-processor systems. However, in a multiprocessor environment
with multi-level caches, the cache behavior becomes less predictable because it heavily
depends on the system bus contention and the coherent misses generated from both
true-sharing and false-sharing data accesses. This thesis proposes a dynamic binary
re-adaptation techniques to minimize unnecessary cache misses caused by aggressive
5
data prefetching.
1.2 Thesis Contributions
The contributions of the thesis are given below.
1.2.1 Phase Detection and Prediction for Dynamic Optimization
We introduce dynamic intervals that are contiguous, variable-length intervals aligned
with dynamic code regions. In traditional compiler analysis [42], interval analysis
is used to identify regions in the control flow graph during compilation time. We
define dynamic intervals as instruction streams aligned with code regions and exhibit
distinct phase behavior during runtime. Dynamic intervals as a program phase can
be easily identified by tracking dynamic code regions. We track higher-level control
structures such as loops and procedure calls during the execution of the program. By
tracking higher-level code structures, we were able to effectively detect the change
in dynamic code regions, and hence, the phase changes during a program execution.
This is because, intuitively, programs exhibit different phase behaviors as the result of
control transfer through procedures, nested loop structures and recursive functions.
In [34], it was reported that tracking loops and procedures yields comparable phase
tracking accuracy as the Basic Block Vector (BBV) method [51, 35]. This supports
our observation.
We also propose a dynamic code region (DCR) based phase tracking hardware for
dynamic optimization systems. We track the code signature of procedure calls and
loops using a special hardware stack, and compare them against previously seen code
signatures to identify dynamic code regions. We show that the detected dynamic
code regions correlate well with the observed phases in program execution.
The primary contributions on phase detection and prediction are as follows:
6
• We showed dynamic intervals that correspond to dynamic code regions that
are aligned with the boundaries of procedure calls and loops can accurately
represent program behavior.
• We proposed a new phase tracking hardware that consists of a simple stack
and a phase signature table. Comparing with existing proposed schemes, this
structure can detects smaller number of phase and the detected phase length
is longer. Using this structure, it can also give more accurate prediction of the
next execution phase.
1.2.2 Profile Characterization and Classification
We propose using “information entropy” to determine adaptive sampling rates for
automated profile collection and processing that can efficiently support continuous
re-optimization in a pre-JIT environment. The information entropy is a good way
to summarize the frequency distribution into a single number [18]. Since a sampled
profile in our study is a frequency profile of collected PC addresses, the information
entropy of the profile is well suited for characterizing program behaviors in which we
are interested.
In practice, a program has multiple input data sets and may exhibit different program
behaviors for each particular input set. Hence, the entropy of a profile could be
different according to the input used. Even though it is difficult to predict which
input set is to be used in each run, if the execution time of the program is sufficiently
long, the sampling rate can be adjusted accordingly using the entropy information
collected during execution. On the other hand, if the execution time is very short,
the overhead from high sampling rate will be insignificant since we conduct a small
number of runs for sample collection.
In the presence of multiple input sets, existing profile guided optimizations (PGO)
7
schemes simply merge the profiles collected from all input sets. The PGO based on
the merged profile might miss some opportunities for increasing performance from
specialized optimizations more suitable for certain input sets. We show that the
information entropy can be used to classify profiles with a similar behavior. This
classified profile allows the optimizer to generate specially optimized versions for
particular input set.
The primary contributions on profile characterization and classification are as follows:
• We show that highly accurate profiles can be obtained efficiently by merging a
number of profiles collected over repeated executions with low sampling rates.
We demonstrate this approach by using the SPEC2000 benchmarks.
• We also show that a simple characterization of profiles using information entropy
can be used to automatically set the sampling rate for the next profiling run.
On SPECjbb2000, our adaptive profiler obtains very accurate profile (94.5%
match with the baseline profile) with only 8.7% of the samples needed using
1M-instruction sampling intervals.
• We show that the entropy of a profile could be used to classify different program
behaviors according to different input sets and to generate classified profiles for
targeted optimizations.
1.2.3 Optimizing Coherent Misses via Binary Re-Adaptation
We implemented two different dynamic binary optimizations in COBRA. The first
optimization uses dynamic profile information to select appropriate prefetch hints
related to coherent memory accesses. As more processing cores and larger cache
memories are being integrated on chip, coherent memory accesses could limit the
scalability of parallel programs. If a program experiences frequent coherent misses
8
due to truly-shared and falsely-shared data, even larger caches cannot help to reduce
such bus accesses. Cache coherent L2 write misses could lead to L3 misses especially
in invalidation-based cache coherence protocols. Itanium 2 supports the .excl hint
for the lfetch instruction to prefetch a cache line in exclusive state instead of the
usual shared state. This can reduce the cost of requesting for exclusive state from
the actual write operation. However, their effectiveness largely depends on program
runtime behavior.
The second optimization reduces the aggressiveness of prefetching. Modern compilers
have been very aggressive in generating data prefetch instructions to hide potential
large memory latency from cache misses for each thread. However, such aggressive
prefetching in a thread could exert tremendous stress on the system bus if most of its
prefetches turn out to be useless or unnecessary. This might have no effect on a single
core system, but could have devastating effect on a multi-core system. Using dynamic
profiling at runtime, we could identify and eliminate those unnecessary prefetches
from a processor and free up the bus and memory bandwidth for other processors.
To demonstrate the feasibility and potential benefits of the COBRA framework, we
use OpenMP NAS parallel benchmarks on a 4-way SMP server and a SGI Altix
cc-NUMA system. The contributions on optimizing coherent misses are as follows:
• Using OpenMP version of the DAXPY kernel, we show that static compiler
generated binaries cannot provide consistent performance in the presence of
changing runtime environment. The runtime binary optimizer could adapt the
binary better to the changing runtime behavior.
• To the best of our knowledge, COBRA is the first implementation of a HPM-
based runtime binary optimization framework for multithreaded applications.
We discuss the trade-offs in the design of a robust and scalable runtime binary
9
optimizer that includes: thread monitoring, dynamic profiling, trace manage-
ment, system-wide dynamic compiler optimizations, and code deployment.
• We implemented two dynamic compiler optimizations to reduce the impact of
coherent memory accesses in OpenMP NAS parallel benchmarks. The opti-
mizations improve the performance of OpenMP NAS parallel benchmarks (BT,
SP, LU, FT, MG, CG) up to 15% with an average of 4.7% on a 4-way Itanium 2
SMP server, and up to 68% with an average of 17.5% on a SGI Altix cc-NUMA
system.
1.3 Outline of Thesis
Chapter 2 describes the related works on dynamic optimizations for parallel and
multithreaded programs, phase detection and prediction, profile characterization and
classification, and optimizing coherent misses on cache coherent multiprocessor sys-
tems.
Chapter 3 describes implementation details of a runtime binary optimization frame-
work, called COBRA (Continuous Binary Re-Adaptation). We describe major func-
tional components of COBRA framework and startup model. The optimizer thread
and monitoring threads in a COBRA framework are described in more details.
Chapter 4 describes efficient phase-aware runtime program monitoring schemes imple-
mented on our COBRA framework. We investigate the use of control flow information
such as loops and function calls in order to identify repetitive program behavior as
a program behavior. We describe sampled Basic Block Vector(BBV)-based and Hot
Working Set (HWSET)-based program phase detection schemes. Sampled HWSET-
based program phase detection scheme shows a higher phase coverage and a longer
stable phase compared to sampled BBV-based program phase detection scheme.
10
Chapter 5 describes our proposed Dynamic Code Region (DCR)-based program phase
detector for dynamic optimization. We show that our proposed hardware exhibits the
desired characteristics of a phase detector for dynamic optimization systems.
Chapter 6 describes techniques to characterize and classify dynamic profiles for dy-
namic compilation and optimization systems. We show that simple characterization
of the profile with information entropy can effectively guide the sampling rate for a
profiler. The entropy-based approach provides a good foundation for continuous pro-
filing management and effective profile guided optimization in a dynamic compilation
environment.
Chapter 7 describes runtime binary re-adaptation techniques that improve the per-
formance of some OpenMP parallel programs by reducing the aggressiveness of data
prefetching and using exclusive hints for prefetch instructions.
Chapter 8 concludes this work and describes the future works.
11
Chapter 2
Related Works
Dynamic optimization has been used in the context of dynamic compilation and
optimization systems such as Java Virtual Machine [7, 13], runtime binary trans-
lation [24, 47, 52, 31] and optimization [8, 37, 12, 15, 11, 20, 59]. Prior runtime
binary optimization systems [8, 37, 12, 15, 11, 20, 59] are developed to improve the
performance of single-threaded applications. In contrast, COBRA is designed to con-
currently monitor multiple threads and to optimize the binary on multiprocessors, its
design of thread monitoring, profile processing and trace management are significantly
different from other binary optimizers for single-threaded applications. Furthermore,
optimization decisions are based on profiles collected from multiple threads, or mul-
tiple runs, to determine if a system-wide optimization is needed.
2.1 Dynamic Optimizations for Parallel and Multithreaded
Programs
The ADAPT [57] is a generic compiler-supported framework for high-level adaptive
program optimizations. The ADAPT compiler accepts user-supplied heuristics and
generates a complete runtime system to apply these heuristics dynamically. ADAPT
12
is applicable to both serial and parallel programs. However, given the variety of
options and the importance of high performance for parallel programs, ADAPT is
particularly well suited to these types of applications.
Thomas et al. [54] proposed a general framework for adaptive algorithm selection and
used it on the Standard Template Adaptive Parallel Library (STAPL) [5]. When
STAPL is first installed on the system, statically available information about the
architecture and the environment are collected. Performance characteristics for the
algorithmic options available in the library are then computed. This data is stored
in a repository, and machine learning techniques are used to determine the tests that
will be used at run-time for selecting an algorithmic option. At run-time, necessary
performance characteristics are collected and then a decision about which algorithmic
option to use is made.
The ATLAS [58, 21] is a linear algebra library generator that makes use of domain-
specific, algorithmic information. It generates platform-optimized Basic Linear Alge-
bra Subroutine (BLAS) by searching different blocking strategies, operation schedules,
and degree of unrolling. SPIRAL [45] automatically generates high-performance code
that is tuned to the given platform. SPIRAL formulates the tuning as an optimiza-
tion problem and exploits the domain-specific mathematical structure of the trans-
formation algorithms to implement a feedback-driven optimizer. SPIRAL generates
high-performance code for a broad set of DSP transformations, including the discrete
Fourier transformations, other trigonometric transformations, filter transformations,
and discrete wavelet transformations.
2.2 Phase Detection and Prediction
In previous work, researchers have studied phase behavior to dynamically reconfig-
ure microarchitecture and re-optimize binaries. In order to detect the change of
13
program behavior, metrics representing program runtime characteristics were col-
lected [22, 49, 51, 23, 48, 37]. If the difference of metrics, or code signature, between
two intervals exceeds a given threshold, phase change is detected. The stability of a
phase can be determined by using performance metrics (such as CPI, cache misses,
and branch misprediction) [23, 37], similarity of code execution profiles (such as in-
struction working set, basic block vector) [22, 49, 51] and data access locality (such
as data reuse distance) [48] and indirect metrics (such as Entropy) [37].
Our work uses calling context in an extended calling context tree (ECCT) as a signa-
ture to distinguish distinct phases . Similar techniques were used for locating reconfig-
uration points to reduce CPU power consumption [41, 28], where the calling context
was analyzed on the instrumented profiles. Our proposed phase tracking hardware
could effectively find similar reconfiguration points. For example, it is useful for phase
aware power management in embedded processors. M. Huang et al [29] also proposes
to track calling context by using a hardware stack for microarchitecture adaptation
in order to reduce processor power consumption. W. Liu and M. Huang [36] propose
to exploit program repetition to accelerate detailed microarchitecture simulation by
examining procedure calls and loops in the simulated instruction streams. M. Hind
et al [27] identified two major parameters (granularity and similarity) that capture
the essence of phase shift detection problems.
In dynamic optimization systems [8], it is important to maximize the amount of time
spent in the code cache because trace regeneration overhead is relatively high and
may offset performance gains from optimized traces [26]. Dynamo [8] used preemptive
flushing policy for code cache management, which detected a program phase change
and flushed the entire code cache. This policy is more effective than a policy that
simply flushes the entire code cache when it is full. Accurate phase change detection
would enable more efficient code cache management. ADORE [37, 14] used sampled
14
PC centroid to track instruction working set and coarse-grain phase changes.
Nagpurkar et al [43] proposed a flexible hardware-software scheme for efficient remote
profiling on networked embedded device. It relies on the extraction of meta informa-
tion from executing programs in the form of phases, and then use this information
to guide intelligent online sampling and to manage the communication of those sam-
ples. They used BBV based hardware phase tracker which was proposed in [51] and
enhanced in [35].
2.3 Profile Characterization and Classification
Savari and Young [46] introduced an approach, based on information theories, to
analyze, compare and combine profiles. They showed how to merge two profiles from
the same program more effectively guide compiler optimizations. They showed that
information entropy based hybrid profile works better than other profile blending
methods. In our work, we used information entropy to adaptively select sampling
rate, characterize profiles and classify them instead of combining them.
Kistler and Franz [32] proposed using a frequency (edges and paths) vector so as to
compare the similarity among profiles. Their proposed similarity metric is based on
geometric angle and distance between two vectors. Their goal was to determine if a
program execution changed enough to trigger re-optimization. In our work, we used a
Manhattan distance between two profiles as a similarity metric. Our method is more
efficient than Kistler’s approach.
Sun et al [53] showed that an information entropy based on performance events, such
as L2 misses, could be a good metric to track changes in program phase behavior.
Our approach used information entropy based on frequency profiles to adaptively
determine sampling rates.
15
2.4 Reducing Coherent Misses
Collard et al [17] proposed using system-wide hardware performance monitors, called
SWIFT, to detect pairs of instructions that causes false sharing. The profiles of
false sharing can be fed back into the compiler to enable LDBIAS and FPBIAS
optimization. If the LDBIAS and FPBIAS are used instead of LOAD instruction, the
cache line is fetched as exclusive state instead of shared state. The FPBIAS is used
for loading floating point data. The LDBIAS is used for all other load operations. In
order to carefully separate the benefits of prefetching, the use of lfetch.excl instruction
is excluded. In contrast, we focus on the selective use of lfetch.excl instruction to
optimize coherent memory accesses.
Tullsen and Eggers [55] pointed out that prefetching can negatively affect bus uti-
lization, overall cache miss rates, memory latencies and data sharing. They ex-
amined the sources of cache misses, in light of several different prefetching strate-
gies, pinpointed the cause of the performance changes. They simulated the ef-
fects of a set compiler-directed prefetching algorithm, namely NP(No Prefetching),
PREF(Prefetching), EXCL(exclusive prefetch), LPD(long prefetch distance), and
PWS(prefetch write-shared data more aggressively), on a bus-based multiprocessor.
These prefetching strategies can be implemented in a static compiler, or be applied
when precise runtime profiles are available. In our work, we compare three prefetch-
ing strategies, namely PREF(baseline), NP, EXCL on Itanium 2 processor. The NP
strategy is implemented by turning lfetch instruction into NOP instruction. The
EXCL strategy is implemented by adding .excl hint into lfetch instruction. The
PREF strategy is used in the optimized binary generated by the optimizing compiler.
16
Chapter 3
COBRA: A Continuous Binary
Re-Adaptation Framework
In prior works [8, 37, 15, 20, 59], most of dynamic optimization systems, such as
Dynamo [8] and ADORE [37], are developed to improve the performance of single-
threaded applications. In order to explore the potential benefit of dynamic opti-
mizations on multithreaded applications, we proposed a runtime binary optimization
framework, called COBRA (Continuous Binary Re-Adaptation). It is currently im-
plemented on Itanium 2 based 4-way SMP server and SGI Altix cc-NUMA systems.
COBRA collects dynamic profiles from each thread using HPM and analyzes them to
find system-wide performance bottlenecks. Currently, the performance events mon-
itored include coherent memory accesses and system bus contention, in addition to
typical performance events for single threaded execution. The aggregated dynamic
profiles are fed into a runtime optimizer to generate optimized binary traces. These
optimized binary traces are stored in a trace cache in the same address space as the
binary program being optimized. The binary program is then patched and redirected
to the optimized traces during execution.
17
Figure 3.1: COBRA framework
COBRA (COntinuous Binary Re-Adaptation) is implemented as a shared library on
Linux and could be automatically preloaded before other shared libraries are loaded at
the program startup time. Since COBRA is designed to concurrently monitor multiple
threads on multiprocessors, its design of thread monitoring, profile processing and
trace management are significantly different from other binary optimizers for single-
threaded applications such as ADORE. Furthermore, optimization decisions are based
on profiles collected from multiple threads to determine if a system-wide optimization
is warranted.
18
3.1 COBRA System Architecture
Figure 3.1 illustrates the major functional blocks of the COBRA framework. It in-
cludes components for monitoring, profiling, trace management, code optimization
and code deployment. The monitoring component collects performance information
with the support of OS and the hardware performance monitors. It sends data to
the profiler. The profiler gathers and processes various performance sampled HPM
data such as data cache misses, branch histories, and other event data. The trace
management component maintains prospective binary traces that can be optimized.
The optimizer generates new optimized binary traces and stored them in the code
cache. The profiler and optimizer closely interact with each other in order to optimize
more efficiently in reducing I-cache misses and the impact of data cache misses.
As shown in Figure 3.1, two types of supporting threads are invoked for a multi-
threaded program. One is an optimizer thread (used in Figure 3.2) that orchestrates
profile collection and runtime optimizations. This thread is created during program
startup time. The other is a group of monitoring threads that monitors worker
threads. A monitoring thread is created when a worker thread is forked. If an appli-
cation program executes with four threads, one optimizer thread and four monitoring
threads will be created by COBRA.
3.2 Startup Model
Figure 3.2 illustrates the startup sequence of a 4-threaded OpenMP parallel program
running under COBRA. In Linux, when a program starts to run, the dynamic linking
loader invokes a libc entry-point routine called libc start main, within which the
main function is called. COBRA preloads the shared library and provides a function
wrapper for libc start main to redirect control to an initialization routine and spawns
the optimization thread before starting the application program.
19
monitorthread
main process (worker thread) vforkOMP monitor thread worker thread worker thread Worker thread
pthread_create
monitorthreadmonitor
threadmonitorthread
monitoringprocess
Optimzer thread start
end
1
6
5
4
3
2
Same Address Space
Figure 3.2: The startup sequence of 4-threaded OpenMP program with COBRAframework
The functions of two types of threads are explained in the following sections.
3.3 Monitoring Threads
The code optimizer in the COBRA framework mainly relies on accurate dynamic
profile collected from the monitoring threads. The monitoring threads continuously
sample the performance counters and record cache miss events to guide binary op-
timizations. On the Itanium 2 processor, hundreds of processor performance events
including CPU cycles, the number of retired instructions, stall cycles for each back-
end instruction pipeline stage can be monitored. Four of them can be monitored
concurrently. To build hot traces for binary trace optimizations, monitoring threads
also sample Branch Trace Buffer(BTB) that keeps track of four address pairs from
the last four taken branches and branch targets.
Each monitoring thread tracks signals from the perfmon [2] sampling kernel drivers.
Sampled data are stored in the kernel buffer initially, and when the kernel buffer
20
is full, a signal is a raised to monitoring thread. Once it catches a signal, it stores
the content of performance counters from the kernel memory area to a user memory
area, called User Sampling Buffer (USB). Each sample consists of a sample index,
Program Counter (PC) address, process ID, thread ID, processor ID, four perfor-
mance counters, eight BTB entries, data cache miss instruction address, miss latency,
and miss data cache line address. The process ID, thread ID and processor ID are
used to tag each sample for a better and more precise understanding of each thread
in the multi-threaded application. The four performance counters could be used to
track performance bottlenecks. For example, using the number of L2 and L3 misses
per 1000 instructions could track the changes in cache miss patterns for detecting
changes in data working sets and their access behavior. The eight BTB entries are
used for building hot execution traces for later optimizations. Data cache miss in-
struction, data address, as well as miss latency are accumulated to pinpoint the exact
instructions that caused the most cache misses. We used this information to find the
delinquent loads [37, 19].
3.4 Optimizer Thread
The optimization thread orchestrates the overall initialization, trace selection, op-
timization, and trace cache management. Notably, there is only one optimization
thread in our initial implementation. This design choice simplify its implementation,
and enables centralized control over multiple monitoring threads.
At start up, all hardware performance counters are initialized by perfmon sampling
kernel device driver. The list of available processors is registered in a shared memory
area. The Kernel Sampling Buffer is also allocated in the shared memory area. The
memory pages allocated to the Kernel Sampling Buffer may reside in different process-
ing nodes. We rely on the OS to migrate memory pages into relevant processor nodes.
21
For example, SGI Altix cc-NUMA system uses a first-touch policy to pin a memory
page to the first processor that accesses the memory page. This scheme works well
if each thread initializes its portion of Kernel Sampling Buffer during initialization
phase.
Trace selection highly depends on the type of optimization applied to the collected
traces. Since our current optimizations mainly focus on adapting data prefetching
on hot loops that consume most of execution time, trace formation and selection
algorithms are tuned to discover hot loops and leading execution paths to the loops.
The BTB profiles from Itanium 2’s HPM are particularly useful to build loop traces
with relatively less frequent sampling that keeps the overall overhead low.
22
Chapter 4
Phase-Aware Runtime Program
Monitoring
Programs spend most of execution time on frequent function calls and loops. Re-
peated control flows in the program tend to show similar and stable performance
behavior across entire execution time [35]. This stable performance behavior is con-
sidered as a program phase behavior. Detecting the change of program phase is crucial
to minimize unnecessarily triggering of re-optimization on dynamic optimization sys-
tems. In previous work [22, 49, 51, 23, 35], a program phase is defined as a set of
intervals within a program’s execution that have similar behavior and performance
characteristics, regardless of their temporal adjacency. The execution of a program
was divided into equally sized non-overlapping intervals. An interval is a contiguous
portion (i.e., a time slice) of the execution of a program. Metrics representing pro-
gram runtime characteristics were calculated for every interval. If the difference in
the metrics between two adjacent intervals exceeds a given threshold, a phase change
is assumed. Phase classification partitions a set of intervals into phases with similar
behavior. Phase prediction foretells the phase for the next interval of execution.
23
4.1 Phase Detection for Dynamic Optimization Systems
Dynamic optimization systems in general have four major components that consist
of phase detector, profiler, optimizer and a controller. The phase detector tracks
changes in program behavior and predicts the future behavior of the program. De-
pending on the optimizations being targeted, the profiler and the optimizer could
add significant overhead to the overall system. Also, if the code cache is not man-
aged effectively, significant overheads could also occur due to trace re-generation and
re-optimization [26, 25, 10]. These characteristics of detected phases have a direct
impact on the overhead. Some of those are detailed below.
Dynamic optimization systems prefer longer phases. If the phase detector is overly
sensitive, it may trigger profiling and optimization operations too frequently and cause
performance degradation. Phase prediction accuracy is essential to avoid bringing
unrelated traces into the code cache. The code cache management system would also
require information about the phase boundaries to precisely identity the code regions
for the phase. The information about the code structures can be used by the profiler
and the optimizer to identify the code regions for optimization. Finally, it is generally
beneficial to have a small number of phases as long as we can capture most important
program behavior; this is because a small number of phases allow the phase detector
to identify longer phases and to predict the next phase more accurately. It should
be noted that dynamic optimization systems can trade some variability within each
phase for a longer phase length and a higher predictability, as these factors determines
the overhead of the system.
4.2 Extended Calling Context Tree
In order to identify code regions consisting of frequently executed function calls and
loops, we first instrument the whole program execution and represent it as a large
24
DCR 1 DCR 2 DCR 3 DCR 4
spec_compress (func)
zip (func) clear_bufs (func)
spec_reset (func)
memset (func)
spec_uncompress (func)
unzip (func) get_method (func) clear_bufs (func)
spec_load (loop)
printf (func) memcpy (func)
main (func)
main (loop) spec_initbufs (func) main (loop) spec_load (func)
(a) gzip
DCR 1 DCR 2 DCR 3 DCR 5
DCR 4
life_analysis (loop)
propagate_block (func)
life_analysis (loop)
life_analysis (loop)
schedule_block (func)
schedule_block (loop) free_pending_lists (func) sched_analyze (func)
cse_basic_block (func)
cse_around_loop (func) cse_basic_block (loop)
global_conflicts (func)
global_conflicts (loop)
yyparse (loop)
finish_function (func)
rest_of_compilation (func)
flow_analysis (func) schedule_insns (func) cse_main (func)global_alloc (func)
life_analysis (func) schedule_insns (func) cse_main (loop)
(b) gcc
Figure 4.1: Partial ECCT of gzip and gcc is shown. The grey nodes are the root ofthe sub-tree that forms the DCR and the rectangles mark the DCR’s
single tree, called Extended Calling Context Tree (ECCT) . Calling Context Tree
(CCT) was first proposed by Ammon et al. [4]. CCT is a directed graph G=(N,E),
where N is the number of nodes that represent procedures in the program and E are
the edges that connect the procedures. For example, if a procedure proc1 is called
from another procedure proc2, the graph will include two nodes proc1 and proc2 with
a directed edge connecting proc2 to proc1. In CCT, only those procedures that are
called during the execution of the program are present. Nodes representing procedures
in CCT are context sensitive. If a procedure proc1 is called from procedures proc2 and
proc3, the graph will contain two different nodes for proc1. Creating unique nodes for
procedures for each context makes the graph a tree. We added nodes which represent
loops into CCT, and call it Extended Calling Context Tree (ECCT). All properties of
the nodes representing procedures in CCT are also applicable to loop nodes in ECCT.
Figure 4.1(a) and 4.1(b) show part of the ECCT obtained for gzip and gcc respec-
25
tively. The nodes that have (func) and (loop) prefix are procedure and loop nodes
respectively. Each node in ECCT is annotated with statistical information about the
execution of the node and all the nodes in the subtree with the node as the root.
In our experiments, the cumulative number of dynamic instruction executed in the
node and all the nodes under it are recorded. For example in Figure 4.1(a), the node
spec compress, will have the total number of instructions retired in zip and clear bufs
in addition to the instructions retired in spec compress. We used this annotated
information for dynamic code region analysis.
4.3 Dynamic Code Region
Dynamic Code Region (DCR) is defined as a node in ECCT and all the nodes in the
subtree of that node. In Figure 4.1(b), for example, schedule block (func) and its child
nodes schedule block (loop), free pending lists (func) and schedule analyze (func) can
be grouped together and considered as one dynamic code region. Depending on the
target application, the ECCT can be analyzed to identity a set of DCR’s from ECCT
that have desirable characteristics.
4.3.1 Dynamic Code Region Analysis
Dynamic code region analysis is an algorithm that automatically identifies non-
overlapping DCR’s with high code coverage and relatively stable behavior. The input
to the algorithm is an ECCT of the program that is annotated with cumulative num-
ber of retired instructions in each node and the required total coverage for the final set
of DCR’s. We define coverage to be the ratio of the sum of the dynamic instruction
counts in each DCR to the dynamic instruction count of the root node of ECCT. In
ECCT, the instructions that correspond to statements other than a procedure call or
a loop in the parent node (if-then-else conditions, statements etc) are not included in
the cumulative instruction count of the child node. Thus the sum of the cumulative
26
instruction counts of all child nodes will be less than the cumulative instruction count
stored in the parent node. We specify the coverage value, to restrict the algorithm
to find a set of DCR’s whose cumulative coverage will be greater than the specified
value.
The search algorithm is iterative. The algorithm uses a set to keep track of the
DCR’s that are discovered during any iterations. The set is initialized with the root
node. During any iterations of the search, one node is removed from the set and its
child nodes are added to the set. Next, the coverage of the set is calculated. If the
coverage is greater than the coverage specified by the user, the algorithm proceeds
to the next iteration of the search. If coverage is less than the specified coverage,
the algorithm backtracks. It removes the child nodes that are added to the set in
the current iteration and adds the parent node back to the set. This node is then
marked and will not be selected in future iteration of the search. This completes an
iteration of the search algorithm. When no unmarked nodes remained in the set, the
algorithm terminates. The nodes in the set at the termination of the algorithm are
root nodes for the sub-tree that forms the final set of DCR. To avoid the search from
going deeper and deeper in one region of the tree, the node in the set that is closer
to the root node is given higher priority during the search when selecting the node.
Figure 4.1(a) and 4.1(b), show the DCR’s obtained for gzip and gcc respectively.
The root node of each DCR is marked grey. The rectangle mark the nodes that are
included in each DCR. Since each node would be visited no more than once during
the search, complexity of the search algorithm is linear in the number of nodes in the
tree.
The final set of DCR will contain many nodes that have insignificant contribution to
the overall execution of the program. For example, nodes representing the c library
functions executed before main would be present in the final set. We prune these
27
functions from the final DCR set. This will result in a slight reduction in coverage
but will not affect the accuracy of the technique.
Table 4.1 shows the Coverage and Coefficient of Variation (CoV) of CPI for the top 5
DCR’s obtained for the benchmark programs evaluated in this work, using Dynamic
Code Region Analysis algorithm. CoV is the ratio of standard deviation and mean.
Higher CoV value implies higher variability in performance within the DCR. The
DCR’s are ordered in the descending order of its size for each benchmark. From the
table we can see that a small number of DCR’s cover a large portion of the program
execution. Top three DCRs cover over 90% of dynamic retired instructions in gzip,
vpr, mcf, vortex, crafty,eon, mesa and ammp. This means that dynamic optimization
system focuses on a few sets of dynamic code regions and could achieve high execution
time coverage on SPEC CPU2000 benchmarks.
4.3.2 Stable and Transition Intervals
The CoV of CPI for each DCRs shown in Table 4.1 represents performance variance
of intervals belonging to each DCRs. If CoV of CPI for the DCR is lower than 0.5,
the intervals for it are considered as stable intervals. All intervals can be classified
into stable and transition intervals. During transition intervals, dynamic optimiza-
tion systems do not trigger binary trace generation and re-optimization in order to
minimize unnecessary translation and optimization overhead. For example, perlbmk
has relatively large instruction footprint and does not have many loops to have large
coverage and good optimization benefits. In such benchmarks, detecting transition
intervals precisely could reduce unnecessary optimization overhead. The stable inter-
vals are contiguous intervals that are classified into same program phases by a phase
detector. The longer and contiguous stable intervals can be a good target for the
dynamic optimization.
28
Table 4.1: Top 5 dynamic code region statistics on SPEC2000 CPU benchmarks
Benchmarks function name DCR Coverage CPI
type CoV
gzip-source spec compress func 88.12 0.16
spec uncompress func 11.79 0.42
spec reset func 0.07 1.06
spec load loop 0.01 0.13
vpr-route try route loop 97.59 0.35
alloc and load rr graph func 0.79 0.04
print route loop 0.52 0.01
check rr graph func 0.28 0.26
get tok func 0.22 0.13
gcc-166 life analysis loop 33.47 0.19
schedule block func 15.53 0.50
cse basic block func 12.43 0.28
life analysis loop 9.17 0.17
global conflicts loop 6.55 0.94
mcf-ref primal net simplex func 57.19 0.34
price out impl func 41.66 0.39
sscanf func 0.66 0.05
flow cost func 0.28 0.60
fgets func 0.07 0.05
perlbmk-splitmail1 Perl pp substcont func 38.26 0.96
Perl pp subst func 32.81 0.97
Perl pp helem func 3.06 0.29
Perl pp match func 2.56 0.32
Perl pp sassign func 2.22 0.34
perlbmk-splitmail2 Perl pp substcont func 38.72 0.57
Perl pp subst func 23.33 0.49
Perl pp helem func 5.76 0.18
Perl pp match func 4.69 0.20
Perl pp sassign func 1.97 0.21
vortex-lendian1 BMT DeleteParts func 67.15 0.41
BMT CreateParts func 10.39 0.38
BMT LookUpParts func 9.73 0.16
BMT CreateParts loop 5.05 0.41
BMT CommitParts func 4.61 0.10
bzip2-program sortIt func 25.35 0.32
getAndMoveTo-FrontDecode loop 19.58 0.13
generateMTFValues loop 16.76 0.13
loadAndRLEsource loop 16.43 0.19
sendMTFValues loop 10.40 0.10
crafty-ref Iterate func 99.99 0.05
InitializeAttackBoards func 0.003 0.05
eon-cook ggBRDF func 49.61 0.02
ggMaterialRecord func 43.67 0.02
ggSpectrumf func 2.13 0.02
ggSpectrumT0 func 1.94 0.02
ggSpectrum func 1.06 0.02
swim-ref calc1 loop 38.21 0.15
calc2 loop 30.70 0.47
calc3 loop 18.40 0.32
MAIN loop 12.27 0.22
inital loop 0.15 0.01
mesa-ref gl render vb func 92.93 0.16
shade vertices func 4.03 0.17
viewport map vertices func 0.83 0.17
project and cliptest func 0.80 0.18
transform points loop 0.50 0.18
ammp-ref fv update nonbon func 80.50 0.17
f nonbon loop 12.87 0.73
divdf3 func 2.14 0.70
divdf3 func 1.12 0.70
sqrt func 0.95 0.70
29
�����������
�� ���������
��� ��
�������������
�� ������������������
Figure 4.2: Sampled BBV-based Program Phase Detection
4.4 Sampling-based Program Phase Tracking
Current processors do not directly support the hardware for the phase tracking, even
in Intel Itanium processors supporting the most advanced hardware performance mon-
itors. Hence, our dynamic optimization framework relies on sampling-based software
program phase detection. We implemented Basic Block Vector (BBV)-based and Hot
Working Set (HWSET)-based phase detectors as a software module in our framework.
4.4.1 Sampled BBV-based Program Phase Detection
T. Sherwood et al. [50] proposed a novel method to automatically characterize large
scale program behavior by using BBV based clustering. This technique has been fur-
ther studied to show correlation between program code signature and actual behavior
through simulation [33] and measurement from real machines [44, 6].
We implemented a sampled BBV-based software phase detector functionally similar
to BBV-based hardware phase detector that was proposed by T. Sherwood [51]. Fig-
ure 4.2 shows a sampled BBV-based program phase detection scheme. The branch
addresses are collected through periodic sampling. We periodically capture four taken
branches on Itanium 2 processor and store them in a kernel buffer. Once a kernel
buffer is full, one BBV is computed from branch addresses in the buffer. The hash
30
�����������
�� ���������
��������� ������������� �
�������������
Figure 4.3: Sampled HWSET-based Program Phase Detection
function uses branch target address to decide which frequency counter in BBV is
updated. Then, the phase table is updated.
4.4.2 Sampled HWSET-based Program Phase Detection
Hot Working Set(HWSET) is a list of hot branch addresses. Instead of maintaining a
frequency vector of branches, HWSET maintains a sorted hot branch lists. Figure 4.3
shows a sampled HWSET-based program phase detection scheme. At every sampling
interval, branch addresses are collected and stored in the kernel buffer. When the
buffer is full, branch addresses are processed as a sorted list according to the frequency.
A HWSET signature is computed from top branches.
4.4.3 Experimental Results
Data are collected from phase detection modules of our COBRA framework while
running the benchmarks on 1.0 GHz Itanium-2 server. The SPEC CPU2000 bench-
marks (12 integer benchmarks, 14 floating point benchmarks) used in our experiments
are compiled with Intel icc ver. 9.1 compiler with O3 optimization level. The integer
benchmarks are gzip, vpr, gcc, mcf, crafty, parser, eon, perlbmk, gap, vortex, bzip2,
and twolf. Many programs in the integer benchmarks have complex control flows even
31
������������������ ��������������� ������� �����
�������Figure 4.4: Phase coverage and stable phase of sampled BBV-based phase detectionscheme on SPEC CPU2000 benchmarks
in hot code regions. The floating point benchmarks are wupwise, swim, mgrid, applu,
mesa, galgel, art, equake, facerec, ammp, lucas, fma3d, sixtrack, and apsi. Most of
the floating benchmark are loop intensive programs.
In our study, the phase coverage and stable phase are examined. The phase coverage
is a phase table hit ratio. Whenever a sampling buffer is full, a phase signature is
computed. Then if the same phase signature is found in the phase table, a phase
table hit counter is incremented. At the end of execution, the phase table hit ratio is
computed. The stable phase ratio is the same as last phase prediction ratio. Whenever
a new phase signature is computed, the new phase signature is the same as previous
phase signature. If the contiguous phases have the same phase ID, the stable phase
counter is incremented.
4.4.3.1 Results on Sampled BBV-based Program Phase Detection
Figure 4.4 shows phase coverage and stable phase of sampled BBV-based phase detec-
tion scheme on SPEC CPU2000 benchmarks. In this experiment, the BBV signature
table has 32 entries and the phase table has 64 entries. Most of the floating point
benchmarks except mesa, apsi show over than 80% phase table hit ratio. In contrast,
Most of integer benchmarks except gzip, vpr, mcf, bzip2 show lower than 30% phase
32
������������������ ��������������� ������� �����
�������Figure 4.5: Phase coverage and stable phase of sampled HWSET-based phase detec-tion scheme on SPEC CPU2000 benchmarks
table hit ratio.
4.4.3.2 Results on Sampled HWSET-based Program Phase Detection
Figure 4.5 shows phase coverage and stable phase of sampled HWSET-based phase
detection scheme on SPEC CPU2000 benchmarks. The phase table has 64 entries.
Most of the floating point benchmarks except mesa, apsi show over than 90% phase
table hit ratio. And most of integer benchmarks except crafty, eon, perlbmk, twolf
show over than 45% phase table hit ratio.
4.4.3.3 Comparison of BBV-based and HWSET-based Program Phase Detection
Figure 4.6 shows comparison between BBV-based and HWSET-based program phase
detection. In this experiment, the BBV signature table has 32 entries and the phase
table has 64 entries. The HWSET-based program phase detection shows average
18.2% higher phase table hit ratio than BBV-based program phase detection. The
HWSET-base program also shows average 12.1% higher stable phase ratio than BBV-
based program phase detection.
Another metric to compare two scheme is a phase homogeneity improvement. If
the phase detection scheme works well, we could expect higher phase homogeneity
33
��������������� �� ���� ������� �������
Tab
le H
it R
atio
���������Pha
se T
(a) Phase table hit
ble
Pha
se R
atio
���������������� �� ������� ������������
Sta
ble ���������
(b) Stable phase
��������������� ���� ��� ������� ���
��������������(c) Homogeneity improvement from phase detection
Figure 4.6: Comparison of BBV-based and HWSET-based program phase detection
34
improvement. As shown in Figure 4.6(c), the HWSET-based program phase detection
shows average 2.9% higher phase homogeneity improvement than BBV-based program
phase detection scheme. Therefore, we can conclude that the HWSET-based program
phase detection scheme is better sampling-based program phase detection scheme
than BBV-based program phase detection.
4.5 Global Program Phase on Multithreaded Programs
As more multithread programs are written for exploiting parallelism and concurrency
on multi-core and multiprocessor systems, efficient monitoring on these programs be-
comes an important problem for dynamic optimizer. Multithreaded programs could
cause performance problems due to contentions to shared system resources such as
shared L2 cache, memory subsystems and system interconnection networks. Precise
monitoring and profiling could open up new optimization opportunities for the dy-
namic optimizer on multithreaded programs. In this section, we describe how to
extend our proposed sampling-based phase monitoring for multithreaded programs.
4.5.1 Global Program Phase
A global program phase represents collective multithreaded program behaviors. In
order to observe the change of program behavior of all concurrent threads, we use
a global timer-based periodic sampling. At every global monitoring interval, each
thread’s phase signature is collected. A set of per-thread phase signature are formed
as one dimensional vector to use global phase signature. Here is a description of per-
thread monitoring and collective global phase monitoring implemented in COBRA
framework.
1. Per-thread monitoring: we can track arbitrary invoked threads and create code
signature from sampled execution paths (taken branches) or hot instruction
35
pointer addresses. It also annotates performance characteristics such as cache
misses and CPI for each code signature. The periodic sampling of each per-
thread monitoring relies on each processor’s timeout counter.
2. Collective global phase monitoring: Thread-wise monitoring needs to use system-
wide timer to periodically monitor collective performance impacts of concurrent
threads. We use a global phase vector consisting of each thread’s code signature.
Figure 4.7 illustrates global program phase behaviors on various multithreaded appli-
cations. In every 10 million cycles, a HWSET-based phase signature is computed in
each thread. Different phase is encoded as different color to observe a global phase
change along time. Three different types of multithreaded programs are studied on
4-way Itanium 2 servers. Each thread shows similar program behavior across concur-
rent threads in the OpenMP swim benchmark shown in Figure 4.7(a). Since the inner
loops of swim are parallelized and the execution paths of parallelized loops are similar,
their performance behaviors are similar each other. Each thread shows different pro-
gram behavior, but relatively small set of phases are repeated in the multithreaded
BLAST program shown in Figure 4.7(b). The small number of hot loops are exe-
cuted as concurrent threads in multithreaded BLAST program. The SPECjbb2005
benchmark is executed on Java Virtual Machine (JVM). The execution time is spent
on JVM threads. The program behaviors of each JVM threads are very different as
shown in Figure 4.7(c).
4.5.2 Exploiting Global Program Phase
If each thread intensively accesses shared system resource, the performance scale-up
along the increased number of threads could be limited.
Figure 4.8 shows performance scale-up of OpenMP swim on Itanium 2 4-way Mckinley
and 8-way Montecito machine. Even though the number of threads are increased from
36
0
1
2
3
4
5
6
7
8
20000 20020 20040 20060 20080 20100 20120 20140 20160 20180 20200 20220
thre
ad
ID
time (1 = 10 msec)
(a) OpenMP parallelized swim
0
1
2
3
4
5
6
7
8
0 20 40 60 80 100 120 140 160 180
thre
ad
ID
time (1 = 10 msec)
(b) Multithreaded BLAST (blastn)
0
5
10
15
20
45300 45350 45400 45450 45500 45550 45600 45650 45700
(c) SPECjbb2005 benchmarks
Figure 4.7: Global program phase behavior on multithreaded programs
37
0.58 0.71
0.68 0.84 0.75
0.91 0.98 ��������������������
Exe
c. T
ime
(1 t
hre
ad)
exec_cycles stall_cycles
0.42 0.29
0.21 0.14 0.11 0.09 0.04
0.75
���������������(Mckinley,
1)(Montecito,
1)(Mckinley,
2)(Montecito,
2)(Mckinley,
4)(Montecito,
4)(Montecito,
8)
No
rmal
ized
E
(Processor type, Number of threads)
Figure 4.8: Performance scale-up of OpenMP swim on Itanium 2 4-way Mckinley and8-way Montecito machine
300040005000600070008000 1 thread-1.6GHz1 thread-1.87GHz1 thread-2.13GHz1 thread-2.4GHz2 threads-1.6GHz2 threads-1.87GHzExecution time (sec)
010002000 2 threads-2.13GHz2 threads-2.4GHz4 threads-1.6GHz4 threads-1.87GHz4 threads-2.13GHz4 threads-2.4GHzExecution time (sec)
Figure 4.9: Performance of SPEC OMP2001 benchmarks in the different CPU fre-quencies on Intel Core 2 Quad processor
1 thread to 8 threads in 8-way Montecito server, no performance improvement is ob-
tained. Since one thread already used up sustained memory bandwidth, the increased
threads add contentions into system bus and memory subsystem. Dynamic thread
throttling could reduce unnecessary shared resource contention with the support of
global program phase monitoring.
Figure 4.9 shows performance of SPEC OMP2001 benchmarks in the different CPU
frequencies on Intel Core 2 Quad processor. Most of benchmarks reduce execution
time as we increase CPU frequency from 1.6GHz to 2.4GHz. Two benchmarks (swim,
38
applu) show little performance improvement in the 2 and 4 threads. It shows that
dynamic thread throttling should be used with dynamic voltage/frequency scaling
techniques (DVFS) to achieve optimal power and performance in multithreaded pro-
grams.
4.6 Summary
We describe efficient phase-aware runtime program monitoring schemes implemented
on our COBRA framework. We investigate the use of control flow information such
as loops and function calls in order to identify repetitive program behavior as a
program behavior. We describe sampled Basic Block Vector(BBV)-based and Hot
Working Set (HWSET)-based program phase detection schemes. Sampled HWSET-
based program phase detection scheme shows a larger phase coverage and a longer
stable phase compared to sampled BBV-based program phase detection scheme. We
also describe how to extend our proposed sampling-based phase monitoring for mul-
tithreaded programs. Our preliminary data indicate that dynamic thread throttling
is a promising technique to achieve optimal power and performance trade-offs when
it is used under dynamic optimizer.
39
Chapter 5
Hardware Support for Program
Phase Tracking
This chapter describes hardware support for program phase tracking.
5.1 Dynamic Code Region (DCR): A Unit of Monitoring and
Re-optimization
5.1.1 Tracking Dynamic Code Region as a Phase
In this work, we propose phase tracking hardware that only tracks functions and loops
in the program. The hardware consists of a stack and a phase history table. The idea
of using a hardware stack is based on the observation that any path from the root node
to a node representing a dynamic code region in the Extended Calling Context Tree
(ECCT) can be represented as a stack of function calls and loops. To illustrate this
observation, we use an example program and its corresponding ECCT in Figure 5.1.
In this example, we assume that each of loop0, loop1 and loop3 executes for a long
period of time, and represents dynamic code regions that are potential targets for
40
Figure 5.1: An example code and its corresponding ECCT representation. Threedynamic code regions are identified in the program and are marked by different shadesin the tree.
optimizations. The sequence of function calls which leads to loop1 is main() →
func1() → loop1. Thus, if we maintain a runtime stack of the called functions and
executed loops while loop1 is executing, we would have main(), func1() and loop1
on it. Similarly, as shown in Figure 5.1, the content of the stack for code region 2
would be main() and loop0, while for code region 3 it would be main(), func3() and
loop3. The stack can uniquely identify the calling context of a code region, and thus
can be used as a signature of the code region. For example, the code region loop3
could be identified by the signature main() → func3() on the stack. Code regions
in Figure 1 can be formed during runtime. This is why it is called Dynamic Code
Region (DCR). Stable DCR is a subtree which has a stable calling context in ECCT
during a monitoring interval, typically one million instructions in our study.
The phase signature table stores the stack signatures extracted from the stack. It also
41
assigns a phase ID for each signature. The hardware can be programmed to check
for the current phase by comparing a subset of the stack with the signatures stored
in the phase signature table. If there is a match, the phase ID associated with the
signature in the table is returned; otherwise, a new entry is created in the table and
a new phase ID is assigned to the new entry. The details of the fields in the table
and its function are presented in section 5.2.2.
5.1.2 Correlation between Dynamic Code Regions and Program Perfor-
mance Behaviors
In Figure 5.2(a), CPI calculated for every 1-million-instruction interval for bzip2 is
plotted. We then used the notion of DCR also for every one million instructions and
assigned a distinct phase ID for each DCR. Such ID’s are then plotted against the time
shown in Figure 5.2(b). Comparing the CPI graph in Figure 5.2(a) with the phase
ID graph in Figure 5.2(b), it can be seen that the CPI variation in the program has
a strong correlation with changes in DCR’s. This shows that DCR’s in a program
could reflect program performance behavior and tracks the boundaries of behavior
changes. Although Basic Block Vector (BBV) also shows a similar correlation, DCR
gives code regions that aligned with procedures and loops, which exhibits a higher
accuracy in phase tracking and also make it easier for code optimization.
There are several horizontal lines in Figure 5.2(b). It shows that a small number of
DCR’s are being repeatedly executed during the period. Most DCR’s seen in this
program are loops. More specifically, phase ID 6 is a loop in loadAndRLLSource,
phase ID 10 is a loop in SortIt, phase ID 17 is a loop in generateMTFvalues, and
phase ID 31 is a loop in getAndMoveTofrontDecode.
42
(a) CPI change over execution time
(b) Phase Change over execution time
Figure 5.2: Visualizing phase change in bzip2. (a) Change of average CPI duringprogram execution. Each point in the graph is the average CPI observed over an1-million-instruction interval. (b) Tracking Phase changes over the time (1 millioninstruction interval) in bzip2 using dynamic code regions. The Y-axis shows phaseID.
43
(a) function call (b) loop
Figure 5.3: Assembly code of a function call (a) and loop (b). The target addressof the branch instruction is the start of the loop and the PC address of the branchinstruction is the end of the loop.
5.2 DCR-based Phase Tracking and Prediction Hardware
We have discussed why DCR can be used to track program execution phases. In
this section, we propose a relatively simple hardware structure to track DCR during
program execution.
5.2.1 Identifying function calls and loops in the hardware
5.2.1.1 Detecting Function Calls
Function calls and their returns are identified by call and ret instructions in the
binary. Most modern architectures have included call/ret instructions. On detecting
a call instruction (see Figure 5.3(a)), the PC of the call instruction and the target
address of the called function are pushed onto the hardware stack. On detecting a
return instruction, they are popped out of the stack. A special case to be considered,
when detecting function calls and returns, is recursion. In section 5.2.3, we describe
a technique to deal with recursions.
44
5.2.1.2 Detecting Loops
Loops can be detected using backward branches. A branch which jumps to an address
that is lower than the PC of the branch instruction is a backward branch. The target
of a backward branch is the start of the loop and the PC of the backward branch
instruction is the end of the loop. This is illustrated in Figure 5.3(b). These two
addresses represent the boundaries of a loop. Code re-positioning transformations
can introduce backward branches that are not loop branches. Such branches may
temporarily be put on the stack and then get removed quickly. On identifying a loop,
the two addresses marking the loop boundaries are pushed onto the stack. To detect a
loop, we only need to detect the first iteration of the loop. In order to prevent pushing
these two addresses onto the stack multiple times in the subsequent iterations, the
following check is performed. On detecting a backward branch, the top of the stack
is checked to see if it is a loop. If so, the addresses stored at the top of the stack
are compared to that of the detected loop. If the addresses match, we have detected
an iteration of a loop which is already on the stack. A loop exit occurs when the
program branches out to an address outside the loop boundaries. On a loop exit, the
loop node is popped out of the stack. The conditions checked in the hardware are
summarized in Figure 5.4.
5.2.2 Hardware Description
The schematic diagram of the hardware is shown in Figure 5.5. The central part
of the phase detector is a hardware stack and a signature table. Each entry in the
hardware stack consists of four fields. The first two fields hold different information
for functions and loops. In the case of a function, the first and second fields are used
to store the PC address of the call instruction and the PC of the called function,
respectively. The PC address of the call instruction is used in handling recursions. In
the case of a loop, the first two fields are used to store the start and the end address
45
Figure 5.4: Conditions checked in the phase detection hardware
of the loop. The third field stores an one-bit value, called the stable stack bit. This
bit is used to track the signature of the dynamic code region. At the start of every
interval, the stable stack bit for all entries in the stack which holds a function or a
loop are set to ’1’. The stable stack bit is set to zero for any entry that is pushed
into or popped out of the stack during the interval. At the end of the interval, those
entries at the bottom of the stack whose stable stack bit is still ’1’ are entries which
were not popped during the current interval. This set of entries at the bottom of
the stack forms the signature of the region in the code to which the execution was
restricted to, in the current interval. At the end of the interval, this signature is
compared against all signatures stored in the phase signature table.
The phase signature table holds the stack signature seen in the past and its associated
phase ID. On a match, the phase ID corresponding to that entry in the signature table
is returned as the current phase ID. If a match is not detected, a new entry is created
46
Figure 5.5: Schematic diagram of the hardware phase detector
in the signature table with the current signature and a new phase ID assigned to it. If
there are no free entries available in the signature table, the least recently used entry
is evicted from the signature table to create space for the new entry. The fourth field
is an one bit value called the recursion bit and is used when handling recursions. The
use of this bit is explained in section 5.2.3.1. The check logic showed in Figure 5.5
implements the algorithm presented in Figure 5.4.
The configurable parameters in the hardware are the number of entries in the stack
and the number of entries in the phase signature table. In the result section, we show
that a 32 entry stack and 64 entry phase signature table are sufficient to track the
phase changes for many programs.
47
(a) self-recursive calls (b) general recursive calls
Figure 5.6: Recursion structure in code and the content of hardware stack duringrecursion.
5.2.3 Handling special cases
Recursion and longer stack signature are two conditions during which the hardware
stack might overflow. In this section we describe how we handle these cases in our
hardware.
5.2.3.1 Recursions
In our phase detection technique, all functions that form a cycle in the call graph
(i.e., they are recursive calls) are considered as members of the same dynamic code
region. Figure 5.6 shows two recursive call structures and their corresponding stack
contents during the execution. A simple and most common type of recursion, shown in
Figure 5.6(a), is a function calling itself. The dynamic code region for this recursion
contains just func1() in it. Another, but more complicated recursion structure is
shown in Figure 5.6(b), where func1() calls func2() which in turn calls func1(), causing
a recursion. The dynamic code region contains func1() and func2(), with func1()
forming the boundary of the recursion.
48
In our hardware, all recursions are detected by checking the content of the stack. A
recursion is detected when the address of the function being called is already present
in an entry on the stack. This check assumes that an associative search of the stack is
done during every push operation. Since the number of entries on the stack is small,
the associative search hardware would be feasible.
To avoid stack overflow during a recursion, no push operation is performed after a
recursion is detected. The recursion bit is set to ’1’ for the entry corresponding to
the function marking the boundary of the recursion, e.g. func1() in the examples
shown in Figure 5.6. Since we no longer push any entry onto the stack, we cannot
pop any entry from the stack on detecting a return instruction, until it is out of the
recursion cycle. This can be detected when a return instruction jumps to a function
outside of the recursion cycle. All entries in the stack that are functions, which lies
below the entry whose recursion bit is set, are outside of the recursion cycle. After
a recursion is detected, the return address of all subsequent return instructions are
checked against these entries. On a match, all entries above the matched entry are
flushed, and normal stack operation is resumed.
5.2.3.2 Hardware Stack Overflow
Recursion is not the only case in which a stack overflow could occur. If the stack
signature of a dynamic code region has more elements than the stack can hold, the
stack would overflow. We did not encounter any stack overflow for a 32-entry stack.
But if it did occur, it is handled very similar to a recursive call described earlier.
On a stack overflow, no further push operation is performed. The address to which
the control gets transferred during a return instruction is checked for a match to an
address in the stack. If it matches, all entries above this instruction are removed from
the stack and normal stack operation is resumed.
49
5.3 Evaluation
In this section, we describe the evaluation methodology, the metrics used to evaluate
our hardware and the benchmarks.
5.3.1 Evaluation Methodology
Pin and pfmon [3, 40, 2] were used in our experiments to evaluate the effectiveness
of the phase detector. Pin is a dynamic instrumentation framework developed at
Intel for Itanium processors [2]. A Pin tool with custom instrumentation routines
to detect function calls and returns, backward branches for loops and to maintain a
runtime stack was developed. The benchmark programs were instrumented using the
customized Pin tool. For every interval of one million instructions, the content of the
stack was dumped into a trace file. This trace file was then analyzed by programs
simulating the phase detection and the phase prediction hardware.
The pfmon is a tool which reads performance counters of the Itanium processor [2].
We use CPI as the overall performance metric to analyze the variability within de-
tected phases. We modified pfmon to get CPI for every one million instructions.
Because these measurements were done on a real machine, random noise may cause
variation in the measured data. To minimize such effects, the data collection was
repeated 3 times and the average of the 3 runs was used for all measurements. These
CPI values were then matched with the phase information obtained from the cus-
tomized Pin Tool, to get the variability information.
All the data were collected from a 900 Mhz Itanium-2 processors with 1.5 M L3 cache
running on Redhat Linux Operating System version 2.4.18-e37.
50
5.3.2 Metrics
The metrics used in our study are the number of distinct phases, the average phase
length, Coefficient of Variance (CoV) of CPI, and the accuracy of next phase pre-
diction. The number of distinct phases corresponds to the number of dynamic code
regions detected in the program. Average phase length of a benchmark program gives
the average number of contiguous intervals classified into a phase. It is calculated
by taking the sum of the number of contiguous intervals classified into a phase di-
vided by the total number of phases detected in the program. Coefficient of Variation
quantifies the variability of program performance behavior and is given by
CoV =σ
µ(5.1)
σ is a standard deviation and µ is a mean of CPI. CoV provides a relative measure
of the dispersion of data when compared to the mean. A smaller CoV for the perfor-
mance metrics within each phase implies that the phase is more stable. We present a
weighted average of CoV on different phases detected in each program. The formula
for calculating the weighted average of the CoV for each benchmark is given by
CoV =
∑(ni · CoVi)∑
ni
(5.2)
where, ni is the number of intervals in the phase i, CoVi is the CoV of performance
metric in phase i. We use weighted average of the CoV to give more weight to the
CoV of phases that have more intervals (i.e., longer execution times) and hence, better
represent the CoV observed in the program.
The ratio of next phase prediction is the number of intervals whose phase ID was
correctly predicted by the hardware divided by the total number of intervals in the
51
program.
5.3.3 Benchmarks
Ten benchmarks from the SPEC CPU2000 benchmark suite (8 integer and 2 floating
point benchmarks) were evaluated. These benchmarks were selected for this study
because they are known to have interesting phase behavior and are challenging for
phase classification [51]. Reference input sets were used for all benchmarks. Three
integer benchmarks, namely gzip, bzip and vpr, were evaluated with 2 different input
sets to illustrate the effect of input sets on the performance of the phase detection
hardware. A total of 13 benchmarks and input sets combinations were evaluated. All
benchmarks were compiled using gcc (version 3.4) at O3 optimization level.
5.4 Experimental Results
In this section, we present the evaluation results of our phase classification and predic-
tion hardware. In Section 5.4.1, we explore the design space of our hardware predictor
and present the results of our analysis. In Section 5.4.2, we compare the results of
our hardware scheme to the basic block vector (BBV) scheme described in [51, 35]. A
customized Pin tool was used to get the basic block vectors. These vectors were an-
alyzed off-line using our implementation of the phase detection mechanism described
in [51, 35] to generate the phase ID’s for each interval. The metrics which we use for
comparison include: the total number of phases detected, the average phase length,
the accuracy of next phase prediction and the CoV of the CPI within each phase.
5.4.1 Results for Phase Detection Hardware
There are two configurable parameters in our hardware. They are the size of the stack
and the size of the phase signature table. We evaluated four different configurations
52
Table 5.1: Number of phases detected for different configurations of the phase detec-tion hardware.
benchmarks 16/16 32/64 64/64 infiniteammp 800 53 53 53bzip2 1 2856 99 99 99bzip2 2 1278 87 87 87crafty 27 27 27 27eon 22 22 22 22gcc 430 337 337 173gzip 1 58 48 48 48gzip 2 45 42 42 42mcf 157 55 55 55mesa 50 37 37 37perl 28 28 28 28vpr 1 92 91 91 91vpr 2 27 27 27 27median 58.00 48.00 48.00 48.00
for the hardware. The size of the stack and the size of the phase history table were
set to 16/16, 32/64, 64/64 and infinity/infinite, respectively.
5.4.1.1 Number of Phases and Phase Length
Table 5.1 shows the number of phases detected using four different configurations of
our phase detection hardware across different benchmark programs. The last row is
the median of the number of phases detected across all programs. We choose median
to eliminate the effect of outliers in the data. For each benchmark program, there
are 4 columns which correspond to 16/16, 32/64, 64/64 and infinite/infinite hardware
configurations, respectively. It should be noted that except in the case of 16/16, the
number of phases detected for all programs is very close to or exactly the same as
that of using infinite hardware.
Table 5.2 shows the average phase length for four different configurations of our
hardware across benchmark programs. The last row is the median of the phase length
across all programs. A similar trend is seen in both Table 5.1 and Table 5.2, which is
expected. Except for gcc, in all other programs, the 32/64 and 64/64 configurations
53
Table 5.2: Average Length of phase detected for different configurations of the phasedetection hardware.
benchmarks 16/16 32/64 64/64 infiniteammp 985.43 15027.77 15027.77 15027.77bzip2 1 68.75 1984.60 1984.60 1984.60bzip2 2 125.73 1845.51 1845.51 1845.51crafty 10314.54 10314.54 10314.54 10314.54eon 10029.18 10029.18 10029.18 10029.18gcc 1 22.75 47.71 47.71 460.90gzip 1 2020.51 2450.40 2450.40 2450.40gzip 2 1273.73 1273.73 1273.73 1273.73mcf 683.36 1950.67 1950.67 1950.67mesa 10780.59 13402.89 13402.89 13402.89perl 2138.23 3579.22 3579.22 3579.22vpr 1 1601.36 1618.96 1618.96 1618.96vpr 2 7018.19 7018.19 7018.19 7018.19median 1601.36 2450.40 2450.40 2450.40
have exactly the same phase length as that in the configuration with an infinite
number of entries. On average, the phase length is about 2450 one- million-instruction
intervals, which roughly correspond to 2.5 seconds of real machine execution time in
900 Mhz Itanium-2 machine.
5.4.1.2 Performance variance within the same phase
Figure 5.7 shows the variation of the weighted average of Coefficient of Variation
(CoV) on CPI for different benchmarks and hardware configurations. The y-axis
shows the weighted average of CoV and the x-axis shows different benchmark pro-
grams. The last 4 bars show the average of the CoV for different hardware configura-
tions. It should be noted that the CoV for the infinite hardware is similar to 32/64 or
64/64 configurations. On average the CoV for the phases detected by our hardware
is around 15%, which is close to the CoV in the BBV method with 40% threshold
value.
54
Figure 5.7: Weighted average of the CoV of CPI for different configurations of thephase detection hardware
5.4.1.3 Phase Prediction Accuracy
Figure 5.8 shows the performance of a simple last-phase predictor and the Run Length
Encoding Markov predictor [51] for predicting the phase ID of the next interval.
The y-axis is the correct prediction ratio and the x-axis is the different benchmark
programs. A last-phase predictor is one which predicts the phase of the next interval
to be the same as the current one. Thus a last-phase predictor always predicts stable
phase behavior. The Markov predictor can be used to predict a phase change and the
new phase ID. On an average, a simple last-phase predictor and Markov predictor
correctly predicts the next phase ID 80% and 84.5% of the time, respectively. Except
in the case of mcf and mesa, the performance difference between Markov predictor
and Last Phase predictor is less than 5%, which in turn indirectly indicates that
the phase detector detects phases that are longer and more stable. It is known that
SPEC2000 benchmarks have relatively stable phases. To evaluate the effectiveness on
predicting phase changes, we need to expand the test to more real world applications.
55
Figure 5.8: The performance comparison of last-phase predictor and markov predictorin detecting the phase of the next interval for 32/64 configuration.
5.4.1.4 Discussion
There is one underlying thread in the results presented above. The performance of
32/64 or 64/64 phase detector is very similar to the hardware with an infinite amount
of resource. This makes the phase detection hardware very cost effective. In our
analysis of the design space, we found that the maximum nesting levels of functions
and loops in the programs, after handling recursions, were always less than 32 for
SPEC benchmark programs. Thus a 32-entry hardware stack would be sufficient to
capture the phase signature without overflowing. In the case of the phase signature
table, larger sized tables would help to reduce the potential overflow, and the need to
evict some signature entries, when it occurs. Except for gcc, which has a larger code
base, a 64-entry table is sufficient to capture all phases.
5.4.2 Comparison with BBV Technique
In this section, we compare the performance of our phase detection hardware with the
phase detection hardware based on BBV [51, 35]. The BBV-based phase detection
hardware [51, 35] is shown in Figure 5.9. There are two tables in the hardware
structure, namely the accumulator table which stores the basic block vector and the
56
Figure 5.9: BBV-based phase tracking hardware
signature table which stores the basic block vectors seen in the past. These structures
are similar in function to our hardware stack and signature tables, respectively. To
make a fair comparison, we compare our 32/64 configuration against a BBV-based
hardware which has 32 entries in the accumulator table and 64 entries in the signature
table. Also, in the BBV-based method, a phase change is detected by comparing the
Manhattan distance of two vectors to a threshold value. If the difference is greater
than the threshold value, a phase change is detected.
We compare our phase detector and BBV based detector using four parameters
namely, number of phases detected, phase length, predictability of phases, and sta-
bility of the performance within each phase. We compare our results with those of
the BBV technique for two threshold values namely 10% and 40% of an one-million-
instruction interval. The original BBV paper [51] sets the threshold value to be 10%
of the interval. It should be noted that the phases detected in the BBV-based tech-
nique may not be aligned with the code structures. Aligned phases are desirable for
dynamic binary re-optimization systems. We still compare against the BBV method
because it is a well accepted method for detecting phases.
57
5.4.2.1 Number of phases and phase length
Table 5.3 compares the number of phases detected by the BBV technique and our
phase detection technique. For the BBV technique, there are 2 columns for each
benchmark that correspond to a threshold value of 10% and 40% of one million in-
structions, respectively. In the case of BBV technique, as we increase the threshold
value, small differences between the basic block vectors will not cause a phase change.
Hence, a smaller number of phases is detected as we go from 10% to 40% threshold
value. Recall that, in a dynamic binary optimization system, on detecting a new
phase, the profiler will start profiling the code, which might cause a significant over-
head. Hence, for such systems a smaller number of phases with a longer per phase
length is desirable. We can see that in the original BBV technique with 10% thresh-
old, the number of phases detected is 100 times more than those detected in the
DCR-based technique. In the BBV technique, as we go from 10% to 40% threshold
value, the number of phases detected becomes smaller, which is expected. But even at
40%, the number of phases detected in BBV technique is 2x more than those detected
by our technique.
Table 5.4 shows the average phase length of the BBV technique and our phase de-
tection technique. The trend in the data is similar to that seen in Table ??, which
is expected. The median phase length of our technique is 100 times more than that
found in BBV technique with 10% threshold value, and two times more than those
found in BBV technique with 40% threshold value. Although in the case of eon
and mesa the phase length for BBV with 40% threshold value is three times that of
DCR technique, these programs are known to have trivial phase behavior. The larger
difference is due to the number of phases detected .
58
Table 5.3: Comparison of the number of phases detected between BBV- and DCR-based phase detection schemes. A 32-entry accumulator table/hardware stack and a64-entry phase signature table were used. The first 2 columns for each benchmarkcorrespond to a threshold value of 10%, 40% of one million instructions, respectively.
Benchmarks BBV-10% BBV-40% DCR-32/64ammp 13424 122 53bzip2 1 35154 1796 99bzip2 2 37847 1469 87crafty 20111 20 27eon 38 7 22gcc 2650 599 337gzip 1 8328 182 48gzip 2 4681 77 42mcf 5507 88 55mesa 945 15 37perl 8036 201 28vpr 1 3136 105 91vpr 2 51 27 27median 5507 105 48
Table 5.4: Comparison of the average phase length between BBV- and DCR-basedphase detection schemes. A 32-entry accumulator table/hardware stack and a 64-entry phase signature table were used. The first 2 columns for each benchmarkcorrespond to a threshold value of 10%, 40% of one million instructions, respectively.
Benchmarks BBV-10% BBV-40% DCR-32/64ammp 58.83 6472.27 15027.77bzip2 1 5.59 111.51 1984.60bzip2 2 4.24 108.30 1845.51crafty 14.57 15073.26 10314.54eon 6128.94 31520.43 10029.18gcc 19.70 86.84 158.15gzip 1 13.90 629.34 2450.40gzip 2 11.11 678.23 1273.73mcf 19.52 1219.19 1950.67mesa 517.82 32899.13 13402.89perl 13.76 497.59 3579.22vpr 1 47.17 1398.50 1618.96vpr 2 3715.51 7018.19 7018.19median 19.52 1219.19 2450.40
59
Figure 5.10: Comparison between BBV- and DCR-based phase detection hardwareon the performance of a 256-entry Markov Predictor in predicting the phase ID of thenext interval. A 32-entry accumulator table/hardware stack and a 64-entry phase sig-nature table were used. The first 2 columns for each benchmark are for BBV methodusing threshold values of 10% and 40% of one million instructions, respectively.
5.4.2.2 Performance Variance within the same phase
Figure 5.10 compares the performance of the 256-entry Markov Predictor using BBV
technique and our phase detection technique. Except eon and vpr 1, the Markov
predictor using our phase detector predicts the next phase better. On average using
our phase detection technique, the Markov predictor predicts the correct next phase
ID 84.9% of the time. Using BBV-based technique, the average correct prediction
ratios are 42.9% and 73.3% for 10% and 40% respectively.
5.4.2.3 Phase Prediction Accuracy
Figure 5.11 compares the weighted average of the CoV of CPI for phases detected by
the BBV technique and by our phase detection technique. The last three bars give the
average CoV. From the figure we can see that BBV-10% has the least average CoV
value. This is because the number of phases detected by BBV-10% is much higher
than the number of phases detected by BBV-40% or our technique. In the case of
BBV-10%, the variability of CPI gets divided into many phases, thus reducing the
variability observed per phase. On average the CoV of our phase detection hardware
60
Figure 5.11: Comparison of the weighted average of the CoV of CPI between BBV-and DCR-based phase detection schemes. A 32-entry accumulator table/hardwarestack and a 64-entry phase signature tables were used. The first 2 columns for eachbenchmark are for a threshold value of 10% and 40% of one million instructionsrespectively.
is 14.7% while it is 12.57% for the BBV-40%. Although the average variability of our
technique is greater than BBV-40%, the numbers are comparable. In fact for bzip2 1,
crafty, gzip 1, gzip 2, mesa, vpr 1 and vpr 2, the CoV of the DCR based technique is
less than or equal to the CoV observed in the BBV-40% . For ammp, gcc, mcf and
perl the performance variation is higher in the dynamic code regions detected. The
higher performance variation within each dynamic code region may be due to change
in control flow as in the case of gcc or change in data access patterns as in the case
of mcf.
5.5 Summary
From the above discussions we can conclude that, our hardware detects a smaller
number of phases, has a longer average phase length, has higher phase prediction
and is aligned with the code structure, all of which are desirable characteristics of a
phase detector for a dynamic binary re-optimization system. The CoV of the phases
detected in our technique is slightly higher but comparable to that observed in the
BBV technique with 40% threshold value. The phase difference is detected using an
61
absolute comparison of phase signatures, which makes the hardware simpler and the
decision easier to make. The 32/64 hardware configuration performs similar to an
infinite sized hardware, which makes it cost effective and easier to design.
62
Chapter 6
Continuous and Persistent Profile
Management
This chapter describes techniques to characterize and classify dynamic profiles for
dynamic compilation and optimization systems.
6.1 Continuous and Persistent Profile-Guided Optimization
JIT compilers have been widely used to achieve near native code performance by
eliminating interpretation overhead [13, 7]. However, JIT compilation time is still
a significant part of execution time for large programs. In order to cope with this
problem, recently released language runtime virtual machines, such as Java Virtual
Machine (JVM) or Microsoft Common Language Runtime (CLR), provides Ahead-
Of-Time (AOT) compiler, sometimes called pre-JIT compiler, to generate native bi-
naries and store in a designated area before the execution of some frequently used
applications. This approach could mitigate runtime compilation overhead [9].
A pre-JIT compiler [56] can afford more time-consuming profile-guided optimizations
63
Figure 6.1: Continuous profile-guided optimization model
(PGO) compared to a JIT compiler because compilation time in a pre-JIT compiler
may not be a part of execution time. In order to enable advanced profile-guided
optimizations (PGO) in a dynamic compiler, dynamic profiles are usually collected
through sampling and runtime instrumentation by high-level language virtual ma-
chines. With the deployment of Pre-JIT compilers, automatic continuous profiling
and re-optimization, as shown in Figure 6.1, becomes a viable option for the man-
aged runtime execution environment. For example, with the introduction of a Pre-JIT
compiler on the recent Microsoft CLR [9], continuous PGO framework as shown in
Figure 6.1 has become a feasible optimization model. The Pre-JIT compiler compiles
MSIL code into native machine code and stores it on the disk. The re-compilation
does not occur during program execution but, instead, is invoked as an offline low
priority process. The re-compilation process relies on accurate HPM (Hardware Per-
formance Monitor)-sampled profiles that are accumulated over several executions of
the application program. HPM-sampled profiles could provide more precise runtime
performance events, such as cache misses and resource contentions, that allow more
effective runtime or offline optimizations [39, 16, 37].
However, the perturbation in collecting and processing profiles should be minimized to
justify the performance gain from re-optimization [60]. It includes runtime overhead
and usage of system resources such as memory and disk space. With HPM avail-
64
able in most of recent microprocessors, sampling-based runtime profiling provides an
attractive alternative to instrumentation-based profiling. It could avoid the pertur-
bation caused by the instrumentation code, and also provide more precise runtime
performance events, such as cache misses and resource contentions, that allow more
effective runtime or offline optimizations [39, 16, 37]. Sampling-based profile manager
has thus become an essential component in the continuous profile-guided optimization
framework such as the one shown in Figure 6.1. However, many production compilers
are still dependent on using instrumentation based profiles for their PGO optimiza-
tions. This is because some existing optimizations, such as complete loop unrolling,
would require the iteration count of a loop, and this type of information may not be
accurately generated with sampling-based profiles.
In order to obtain more accurate profiles with a low sampling frequency, sampled
profiles could be merged and stored across multiple runs on the disk. Due to the
statistical nature of sampling, the quality of sampled profiles is greatly affected by the
sampling rate. As the sampling frequency is increased, more samples are collected and
the accuracy of sampled profiles is improved. Unfortunately, the sampling overhead
is also increased. A high sampling frequency would cause more interrupts, require
more memory space to store sampled data, and more disk I/O activities to transfer
persistent profile data. With a fixed number of runs, a challenge to the runtime
profiler is how to determine the ideal sampling rate so that we can still obtain high
quality profiles with minimum sampling overhead.
6.2 Similarity of Sampled Profiles
Sampling-based profiler collects the frequency profile of sampled PC addresses instead
of edge count profiles using instrumentation. Zhang et al [60] showed that an accurate
edge profile can be deduced from the frequency profile of sampled PC addresses. The
65
optimization results from using the deduced edge profile are comparable to that of
using actual edge profile obtained from instrumentation.
6.2.1 Similarity Metric
In order to evaluate the similarity between a sampled profile and the “complete pro-
file”, we define a “similarity metric” between the two profiles. Since our profile is a
list of frequency counts of distinct PC addresses sampled, it can be represented as an
1-dimensional vector. To compute the linear distance between two vectors, we used
the Manhattan distance shown in the following equation as a similarity metric.
S =n∑
i=1
(2.0 − |ai − bi|)
2.0ai, bi : relative freq. of ith distinct PC addr.
The PC addresses of ai and bi are the same. The ai or bi could be zero when no
matching samples are collected. If two profiles are identical (ai = bi), S becomes 1.
6.2.1.1 Baseline Profile to Determine Similarity of Sampled Profiles
Instead of using instrumented profiles, we use a merged profile generated from very
high frequency sampling rates over multiple runs as the baseline profile. One way to
merge them is to sum up the frequency of each PC address across all sampled profiles.
We collected sampled profiles three times using each of the six different sampling rates
(one sample every 634847, 235381, 87811, 32973, 34512, 32890 instructions), and
generate a baseline profile by merging the 18 profiles for each benchmark program.
In every sampling interval, one PC address is collected. Each sampled profile has
a list of frequency count for each distinct PC address. Hence, we could compute
normalized frequency for each distinct PC address by dividing its frequency count
by the total sample counts. We mask off 4 low-order bits of the PC address to
approximate a basic block, i.e. use the masked address as the starting address of an
66
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of runs
Sim
ilar
ity
(S)
1M-insts
10M-insts
100M-insts
100M-R2R
Need higher sampling rate
Figure 6.2: Convergence of merged profiles of gcc with 200.i input set
approximated basic block instead of distinct PC addresses within the approximated
basic block. The obtained frequency is the frequency of the approximated basic block.
The obtained baseline profile is very close to the instrumentation-based complete
profile. It ranges from 0.95 to 0.98 using our similarity comparison metric for SPEC
CPU2000 benchmarks.
6.2.2 Accuracy of Persisted Profiles
As the similarity between the baseline profile and the instrumented complete profile
reflects how “accurately” the baseline profile could mimic the complete profile, we use
“accuracy” and “similarity” interchangeable in the rest of the chapter. Intuitively, the
accuracy of merged profiles improves as the number of samples increases. Figure 6.2
show that merged profiles are more accurate (compare to the baseline profile) than a
single instance of profile, the R2R (Run-to-Run) in the figure, on 176.gcc with 200.i
input.
In Figure 6.2, sampled profiles are cumulatively merged along repeated runs on the
same input set. For example, at the 10th runs, the merged profile is the summation
from 1st profile to 10th profile. The y-axis shows the similarity (S) between the baseline
67
profile and the merged profile with three different sampling rates (one sample every
1M, 10M, 100M instructions).
We could observe two interesting facts. First, most of the improvement in accuracy
came from the first three to six runs. Second, after that the curve of improvement
becomes flattened. Since we cannot afford too many runs at high sampling rates, we
need to adapt sampling rates according to the program behavior. We address how to
automatically reduce the sampling rate through profile characterization in the next
section.
6.3 Entropy-Based Profile Characterization
In this section, we show that an appropriate sampling rate could be determined by
using the information entropy.
6.3.1 Information Entropy: A Metric for Profile Characterization
An appropriate sampling rate could be determined according to the program behavior.
Since our frequency profile can be represented as a statistical distribution, where each
distinct PC address has a probability that is the number of its occurrences divided
by the overall number of samples, we can use the equation to quantify the shape
of statistical distribution. In this work, we use information entropy as defined by
Shannon [18] to characterize the sampled profiles (for example, “flat” or “skewed”).
The information entropy is defined as follows:
E =N∑
i=1
Pi · log1
Pi
Pi : relative freq. prob. of ith distinct PC addr.
If the program has a large footprint and a complex control flow, a large number of
distinct PC addresses will be collected. On the other hand, if the program has a
68
PC Address
Rel
ativ
e F
req
uen
cy P
rob
abili
ty (
Pi)
0
0.005
0.01
0.015
0.02
0.025
4e+006 4.2e+006 4.4e+006 4.6e+006 4.8e+006 5e+006 5.2e+006 5.4e+006
gcc
(a) gcc (E=10.12)
PC Address
Rel
ativ
e F
req
uen
cy P
rob
abili
ty (
Pi)
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
4.19e+006 4.2e+006 4.21e+006 4.22e+006 4.23e+006 4.24e+006 4.25e+006 4.26e+006 4.27e+006
gzip
(b) gzip (E=5.67)
Figure 6.3: Relative frequency distribution of PC address samples (gcc, gzip)
small number of hot spots (or hot loops), sampled profile will have a small number
of distinct PC addresses. It leads to a low entropy number. This property could be
used in determining an appropriate sampling rate for the next run. The two example
programs shown in Figure 6.3 clearly show that entropy distinguish “flat” profiles
from “skewed” profiles. The gcc in Figure 6.3(a) shows a relatively “flat” distribution
and a high entropy number (E=10.12). In contrast, the gzip in Figure 6.3(b) shows
a “skewed” distribution and a low entropy number (E=5.67).
69
6.3.2 Entropy-Based Adaptive Profiler
This subsection describes the implementation of an entropy-based adaptive profiler.
The application of entropy heuristics in the adaptive profiler is as follows:
1. When an application is loaded and ready to run, check if there is already a
profile for this application. If not, i.e. this is the first time the application
executes, start with a low sampling rate.
2. After the program terminates, compute the information entropy of the profile.
Categorize the profile based on one of the three ranges of entropy values. Our
data shows that the following three ranges are sufficient: low [0-5], medium
[5-8], and high [8- ].
3. When an application is loaded, if the entropy is known, set the sampling rate
according to the entropy: a high entropy uses a high sampling rate, a medium
entropy uses a medium sampling rate and a low entropy uses a low sampling
rate.
6.4 Entropy-Based Profile Classification
In practice, a program shows different program behaviors when different input sets
are used. The entropy of their profiles will also change along with the changed input
sets. For example, perlbmk has 7 inputs. The entropy of each input ranges from
5.52 to 9.05. For multiple inputs, if the program executes for a long enough time,
its sampling rate can be adjusted using entropy during the execution. It is more
important to understand the impact of different profiles on PGO. For multiple inputs,
this section describes how entropy can be used to classify different profiles.
Figure 6.4 shows the workflow of entropy-based profile classification. In our profile
classification framework, we use k-mean clustering technique to identify similar pro-
70
5.52
6.46
6.30
6.68 B
A
C
Entropy Entropy BasedBased
ClassificationClassification
8.42
Profile Profile GuidedGuided
OptimizationOptimization
RawRawprofilesprofiles
ProfileProfileDatabaseDatabase
Figure 6.4: Entropy-based profile classification
files. If the maximum number of clusters sets to 3 (k = 3), incoming profiles will
be classified and merged into three persistent profiles (A,B,C). In Figure 6.4, the
three profiles with their entropy in a similar range (E = 6.46, 6.30, 6.68) are classified
and merged into a single persistent profile A. One profile with entropy (E = 8.42) is
classified and merged into persistent profile B. Another one with entropy (E = 5.52)
is classified and merged into persistent profile C.
When the recompilation for PGO is invoked, the controller determines whether the
classified profiles are merged into one profile or one particular profile is selected for
PGO. If the similarity (S) of classified profiles is very low, it means that the profiles
come from disjoint code regions. In this case, it is better to combine the profiles. vpr
is such a case. The profile from place and the profile from route are disjoint from
each other. Otherwise, we have to choose one profile that is merged from majority of
runs.
We could generate multiple specialized binaries, each customized for one specific type
of profiles. However, it is difficult to predict what would be the incoming input data
set for the next run. In this work, we introduce three different types of profiles that
could lead to different profile management and feedback strategies.
71
6.5 Experiments
6.5.1 Experimental Setup
For our experiments, we collected the sampled profiles with Intel SEP(Sampling En-
abling Products) tool ver. 1.5.252, running on 3.0 GHz Pentium 4 with Windows
XP Operating System. The SPEC CPU2000 INT and FP benchmarks are used to
evaluate the convergence of merged profiles and the effectiveness of using entropy to
characterize different sampled profiles. The SPEC CPU2000 benchmarks are com-
piled with Intel icc compiler ver. 8.0 with O3 optimization level. The SpecJBB 1.01
benchmark is used to show how the entropy could be effectively used in an adaptive
profile manager.
In the SPEC CPU2000 INT benchmarks, vpr, vortex, gzip, gcc, perlbmk and bzip2
have multiple input sets. We used these six benchmarks for our experiments. These
benchmarks are compiled with the Intel icc compiler ver. 8.0 with O3 optimization
level, and measured on 1 GHz Itanium-2 processor machine. The profile feedback
uses Intel icc profile guided optimization. In our experiments, number of maximum
clusters is set to 3 (MaxK = 3).
6.5.2 Experimental Results
6.5.2.1 Accuracy of Merged Profiles
Figure 6.5 shows that persisted profiles converge to the baseline profile at different
convergence rates based on their program behavior using SPEC CPU2000 (INT, FP)
benchmark and a sampling rate of one sample every 100M instructions. In SPEC
CPU2000 INT benchmarks shown in Figure 6.5(a), most of benchmarks converge
quickly above a similarity metric of 0.9 (i.e. more than 90% similar to the baseline
profile) after the initial five runs except for five benchmarks (gcc, vortex, perlbmk,
72
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of runs
Sim
ilari
ty (
S)
gzip
vpr
gcc
mcf
crafty
parser
eon
perlbmk
gap
vortex
bzip2
twolf
(a) INT
0.7
0.75
0.8
0.85
0.9
0.95
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of runs
Sim
ilari
ty (S
)
wupwise
swim
mgrid
applu
mesa
galgel
art
equake
facerec
ammp
lucas
fma3d
sixtrack
apsi
(b) FP
Figure 6.5: Convergence of merged profiles of SPEC CPU2000 benchmarks
crafty, eon). Since those four benchmarks have relatively complex control flows and
large instruction footprints, they need a higher sampling rate to achieve a targeted
accuracy with only a limited number of runs. In the SPEC CPU2000 FP benchmarks
shown in Figure 6.5(b), most benchmarks converge quickly above 0.9 similarity to the
baseline after initial five runs except three benchmarks (lucas, applu, fma3d).
73
Table 6.1: Entropy of SPEC CPU2000 INT benchmarks
benchmark entropy benchmark entropy benchmark entropy benchmark entropygzip 5.67 vpr 4.87 gcc 10.12 mcf 4.45
crafty 9.52 parser 7.79 eon 7.95 perlbmk 8.46gap 7.62 vortex 8.02 bzip2 7.50 twolf 7.24
Table 6.2: Entropy of SPEC CPU2000 FP benchmarks
benchmark entropy benchmark entropy benchmark entropy benchmark entropywupwise 6.43 swim 4.86 mgrid 5.09 applu 8.26
mesa 7.85 galgel 5.86 art 4.01 equake 5.98facerec 6.44 ammp 6.95 lucas 9.26 fma3d 9.16
sixtrack 5.19 apsi 7.91
6.5.2.2 Entropy-Based Profile Characterization
Table 6.1 shows the entropy of SPEC CPU2000 INT benchmarks. Interestingly, we
can observe that the entropy of those benchmarks are clustered in three ranges. Two
programs (vpr, mcf ) show a low entropy (0 ≤ E < 5). Four programs (gcc, crafty,
perlbmk, vortex ) show a high entropy (E ≥ 8). The rest of programs have a medium
entropy (5 ≤ E < 8). The four programs that show high entropy exactly match
the programs, shown in Figure 6.5(a), that need higher sampling rates to achieve a
targeted accuracy.
Table 6.2 shows the entropy of the SPEC CPU2000 FP benchmarks. Two programs
(swim, art) show a low entropy (0 ≤ E < 5). Three programs (lucas, applu, fma3d)
show a high entropy (E ≥ 8). Three programs that show high entropy also exactly
match the programs, shown in Figure 6.5(b), that need higher sampling rate. The
results strongly suggest that entropy is a good metric to select the sampling rate for
SPEC CPU2000 benchmarks (INT, FP).
We could start sampling with high frequency for all programs to obtain more accurate
profiles in general. However, it would require unnecessary high overhead. Based on
our entropy based characterization and observation on convergence of merged profiles,
only seven programs among 26 SPEC CPU2000 programs need a sampling rate higher
74
0.8
0.9
1
Sim
ilari
ty (
S)
1M-insts
10M-insts
Use entropy heuristics
Use delta heuristics
0.6
0.7
1 2 3 4 5 6 7 8 9 10
Number of runs
S10M-insts
100M-insts
Self-Adaptive
Figure 6.6: Accuracy of entropy-based adaptive profiler on SPECJBB ver. 1.01
than one sample per 100M instructions. Hence, it is more cost-effective to start
with low sampling rate and adjust the sampling rate according to the profile entropy
collected at runtime.
6.5.2.3 Adaptive Profiler
Figure 6.6 shows the results of using entropy in an adaptive profiler for the SPECjbb
ver. 1.01 benchmarks written for Microsoft .NET platform. After the first run of
the program, the profiler increases the sampling rate from one sample per 100 M
instruction to one sample per 10M instructions according to the entropy measured.
In practice, we may not have the baseline profile to compute the similarity metric (S).
We could use the delta (∆) of similarity (S) between current cumulative profile and
previous cumulative profile. If the ∆S is small enough (for example ∆S = 0.005), we
consider that convergence curve of merged profile has been flattened. Depending on
the number of runs, the profiler can stop or continue to collect profiles. In Figure 6.6,
the ∆S is less than a given threshold (∆S < 0.005) at the 7th run. Since we expect
6 to 8 runs in this experiments, the profiler decides to stop profile collection.
When we compare the profiles generated from our adaptive profiler with those from
75
a sampling rate of one sample per 1M instructions, our profile is quite accurate
(S=0.945) (94.5% similar to the baseline) with only 8.7% of samples taken. Our
profile is only 3% less accurate compared to the profile generated from the sampling
rate of one sample per 1M instructions. Since an edge profile could be deduced from
this frequency profile as explained earlier, 3% difference in accuracy will not lead to
any significant difference in the accuracy of the deduced edge profile.
6.5.2.4 Entropy-Based Profile Classification
We found that there are three types of program behavior. In the type I, the program
behavior does not change much with different input sets. Hence, the entropy of their
profiles from different inputs is pretty similar to each other. vortex program is like
that. Their sampled profiles are classified and merged into one baseline profile. This
is the simplest case.
Table 6.3 shows performance improvement from PGO on vortex with multiple input
sets. Each column of the table uses different input set. For convenience, it is num-
bered according to input sets, for example the first one is for lendian1 input. Each
row presents the performance improvement from the binary generated using feedback
profiles. The baseline binary to compute performance improvement is the binary gen-
erated without using PGO. For example, feedback(1) is the binary generated using
the profile generated from lendian1 input. feedback(self) is the binary that is gen-
erated from the same input with which the profile generated. The feedback(self) is
used to show the full potential of PGO. In vortex, feedback(classified) uses one profile
merged from all profiles.
In the type II, program behavior changes significantly due to the change of input
data set. Hence, the sampled profiles from each different input are dissimilar. vpr
is like that. Since the entropy of two sampled profiles are in different ranges, they
76
Table 6.3: Performance improvement (%) from PGO on vortex with multiple inputsets
1:endian1 2:endian2 3:endian3 averagefeedback:(1) 27.64 30.84 27.83 28.77feedback:(2) 28.03 31.26 26.92 28.74feedback:(3) 28.02 30.02 31.26 29.77
feedback:(self) 27.64 31.26 31.26 30.06feedback:(classified) 27.30 31.39 26.25 28.31
Entropy 7.80 8.19 7.77
Table 6.4: Performance improvement (%) from PGO on vpr with multiple input sets
1:place 2:route averagefeedback:(1) 4.72 -3.00 0.86feedback:(2) -6.76 8.59 0.92
feedback:(self) 4.72 8.59 6.66feedback:(classified) 8.96 8.88 8.92
Entropy 6.83 4.82
are classified into two different profiles. Since the similarity (S) of the profiles is very
low (S < 0.4), the profile manager combines them into one profile when used for
PGO. Combining disjoint profiles is generally beneficial since it would increase the
code coverage.
Table 6.4 shows the performance improvement from PGO on vpr with different input
sets. The feedback(1) experiences 3.0% slowdown compared to the baseline binary
when input set 2 is used. The feedback(2) also loses 6.76% performance when input set
1 is used. It shows that PGO could degrade performance if the profile used in feedback
is not generated from a representative input set. The feedback(classified) uses a
merged profile from the two sampled profiles. Interestingly, it performs 2.26% better
than the feedback(self) binary. It might be because merged profile provides increased
code coverage that gives slightly better analysis results for compiler optimizations.
This may be due to some heuristics used in compiler optimizations that are sensitive
to path frequency distribution.
In the type III, their profiles could be classified into several groups of similar profiles.
77
Table 6.5: Performance improvement (%) from PGO on gzip with multiple input sets
1:source 2:log 3:graphic 4:random 5:program averagefeedback:(1) 4.52 4.17 7.87 5.75 4.04 5.27feedback:(2) 5.05 4.77 9.73 11.24 5.01 7.16feedback:(3) -6.97 -8.02 6.29 10.35 -8.89 -1.45feedback:(4) -0.28 -4.86 7.24 14.49 1.65 3.65feedback:(5) 5.62 4.68 13.40 12.56 5.06 8.26feedback:(self) 4.52 4.77 6.29 14.49 5.06 7.03feedback:(classified) 6.38 4.95 6.29 12.48 5.06 7.03feedback:(1,2,4) 6.38 4.95 12.32 12.48 6.38 8.50
Entropy 5.71 5.87 6.49 5.92 4.98
gzip, perlbmk, gcc and bzip2 are in this camp.
Table 6.5 shows the performance improvement from PGO on gzip with different in-
put sets. Three profiles (1:source, 2:log, 4: random) are merged into one profile. The
profile (3:graphic) and the profile (5:program) are classified into separate profiles.
The profiles from three text inputs are classified into the same group. Table 6.5 indi-
cates that entropy-based classification works quite well. The feedback(1,2,4) performs
1.47% better than that from feedback(self).
From the above results, we could see that entropy is a good metric to classify sampled
profiles for PGO. The binaries generated from classified profiles always perform similar
to or better than that from feedback(self) binaries. It should be noted that in our
experiments, the feedback(classified) binaries never caused any slowdown compared
to the performance of the binaries generated without PGO for any input sets. In
contrast, for example, feedback(3) in gzip shows slowdown in performance compared
to the performance of the binary generated without PGO for three inputs (1:source,
2:log, 5:program).
6.6 Summary
We shows that highly accurate profiles can be obtained by merging a number of pro-
files collected over repeated executions with relatively low sampling frequency. It
78
also show that simple characterization of the profile with information entropy can
effectively guide the sampling rate for a profiler. Using SPECjbb2000 benchmark,
our adaptive profiler obtains very accurate profile (94.5% similar to the baseline pro-
file) with only 8.7% of samples in a sampling rate of one sample per 1M instructions.
Furthermore, we show that information entropy could be used to classify different pro-
files obtained from different input sets. The profile entropy-based approach provides
a good foundation for continuous profiling management.
79
Chapter 7
Optimizing Coherent Misses via
Binary Re-Adaptation
The chapter describes runtime binary re-adaptation techniques that improve the per-
formance of some OpenMP parallel programs by reducing the aggressiveness of data
prefetching and using exclusive hints for prefetch instructions.
7.1 Motivating Example
First, let us use an OpenMP version of the DAXPY kernel, shown in Figure 7.1,
as an example to illustrate the changing memory access behavior when it runs with
different input data sets and different number of threads. The source code is compiled
with the Intel icc compiler ver. 9.1 with -O2 -openmp options. ARRAY SZ is varied
for (j=0; j < 1000000; j++)#pragma omp parallel for
for (i=0; i < ARRAY_SZ; i++) {y[i] = y[i] + a * x[i];
}
Figure 7.1: OpenMP DAXPY C source code
80
...lfetch.nt1 [r10] // prefetch y[0]+648lfetch.nt1 [r11] // prefetch y[0]+520lfetch.nt1 [r14] // prefetch y[0]+392lfetch.nt1 [r15] // prefetch y[0]+264lfetch.nt1 [r16] // prefetch y[0]+136lfetch.nt1 [r17] // prefetch y[0]+8
....b1_22:{ .mii(p16)ldfd f32=[r2],8 // load x[i], i++
nop.i 0nop.i 0 }
{ .mmb(p16)ldfd f38=[r33] // load y[i](p16)lfetch.nt1 [r43] // prefetch x[i]+1200, y[i]+1200
nop.b 0 ;; }{ .mfi(p23)stfd [r40]=f46 // store y[i](p21)fma.d f44=f6,f37,f43 // y[i] + a*x[i](p16)add r41=16,r43 } // increment lfetch address{ .mib(p16)add r32=8,r33 // increment y[i] address
nop.i 0br.ctop.sptk .b1_22 ;; } // inner for loop (SWP)
Figure 7.2: icc compiler generated Itanium assembly code for DAXPY kernel
to create different data working set sizes from 128K to 2M bytes. The number of
working threads is varied from 1 to 4. Each thread is bound to a different processor.
Figure 7.2 shows its Itanium assembly code generated by the Intel icc compiler. Before
entering the software pipelined loop (.b1 22), the generated code has 6 prefetches for
the initial cache line of y[0] and the subsequent five cache lines. Then, in the loop, the
code issues one prefetch instruction per iteration for both arrays x[] and y[], using the
rotating registers to alternately change the prefetch target addresses. This prefetch is
very aggressive. It targets 9 cache lines ahead of the current array references. Toward
the end of loop execution, this prefetch instruction starts to fetch unnecessary cache
lines that would be modified by its neighboring processors. Therefore, this prefetch
would trigger unnecessary coherent misses. For example, with 128KB data working
set with two arrays x[] and y[], each array has 64KB data. When running with 4
threads, each array has 16KB data. Since the L2 cache line size on Itanium 2 is 128
81
Scalability of DAXPY Kernel on 4-way Itanium 2 Machine(# of threads, with/without prefetch)
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
128K 512K 2M
data working set size
No
rmal
ized
exe
cuti
on
tim
e to
b
asel
ine
(1, prefetch)(1, noprefetch)(2, prefetch)(2, noprefetch)(4, prefetch)(4, noprefetch)
(a) prefetch vs. noprefetch
Scalability of DAXPY Kernel on 4-way Itanium 2 Machine (#of threads, prefetch without/with .excl hints)
0.00
0.20
0.40
0.60
0.80
1.00
1.20
128K 512K 2M
data working set size
No
rmal
ized
exe
cuti
on
tim
e to
bas
elin
e
(1, prefetch)(1, prefetch.excl)(2, prefetch)(2, prefetch.excl)(4, prefetch)(4, prefetch.excl)
(b) prefetch vs. prefetch.excl
Figure 7.3: Normalized execution time of OpenMP DAXPY kernel on 4-way Itanium2 SMP server
82
bytes, 9 unnecessary cache lines amount to 1KB. Therefore, a significant portion of
data is unnecessarily shared between processors due to aggressive prefetching.
If we consider there are many more arrays that are divided and computed by the
number of working threads in scientific applications, larger data working set would
also show similar unnecessary sharing. This example shows that even when an expert
programmer elaborately avoided data sharing when writing explicit multithreaded ap-
plications, unnecessary coherent misses could still happen due to aggressive compiler
prefetch optimizations.
In Figure 7.3(a), we compare two versions of the binaries. The baseline binary code
generated by the Intel icc compiler has lfetch instructions. In the noprefetch version,
the lfetch instructions are changed to NOP instructions. Figure 7.3(a) shows the
normalized execution time of the two versions. The x axis shows different working
set sizes including both arrays x[] and y[].
The Itanium 2 processor used in our experiments has 256KB L2 cache and 1.5MB
L3 cache. In the three different working set sizes, ranging from 128KB, 512KB and
2MB, two versions exhibit very different behaviors. With 2MB, the prefetch version
always performs better than the noprefetch version when running with 1, 2, and 4
threads. As expected, prefetching works effectively in the presence of frequent cache
misses.
With the smallest 128KB working set, the data would fit within the 256KB L2 cache
memory. When only one thread is present, the two versions do not have much per-
formance difference since no cache misses would occur after initialization. Since the
noprefetch version does not cause unnecessary data sharing between processors caused
by aggressive data prefetching shown in Figure 7.2, it runs 35% faster than the base-
line prefetch version when running with 2 threads, and 52% faster with 4 threads.
83
The prefetch version when running with 4 threads suffers significantly from L2 OZQ FULL
stalls. On Itanium 2, prefetch requests are placed in the L2 memory queue (OZQ)
with other load operations and will not retire until the requested cache lines are filled.
Even though loads can retire quickly when they hit in L2, prefetch requests stay longer
in the queue due to the long latency of the coherent miss and eventually make the L2
memory queue full. This leads to the slowdown of the prefetch version running with
multiple threads. The noprefetch version significantly reduces the L2 OZQ FULL
stall cycles.
Such unnecessary coherent misses could be avoided by more careful optimizations. For
example, the compiler could use conditional prefetches to nullify the prefetches if the
addresses are outside the intended range. However, conditional prefetch generation is
more expensive, since it requires one more register, one more compare instruction, two
more add operations and at least one additional bundle. Unless the static compiler
has a very accurate profile indicating precisely which prefetches are likely to cause
this problem, the compiler will not generate such conditional prefetches.
The compiler can also generate multi-version code to select the noprefetch version
when the remaining iteration count is small. This is to avoid the performance degra-
dation from unnecessary coherence misses caused by aggressive prefetching. When
the remaining iteration count is large, the benefit of data prefetching could outweigh
the downside of prefetching-induced coherence misses.
If the data working set size for each array is large enough to mitigate the impact
of unnecessary coherent misses, the overall slowdown of the performance may be
relatively small. Such coherent misses will happen no matter the data working set
fits in L2 or L3 caches or not. However, since the portion of unnecessarily shared data
is fixed, such as 1KB in our example, the cost of coherent misses would be relative to
the data working set size in the loop.
84
Figure 7.3(b) shows the impact of using lfetch.excl instruction on OpenMP DAXPY
kernel shown in Figure 7.1. The lfetch.excl instruction prefetches a cache line in the
Exclusive state instead of the Shared state. When a prefetch operation with .excl
hint misses the cache, it requests the cache line in the Exclusive state. If a store
operation soon follows the prefetch operation, it will not trigger an invalidation. For
each working set size from 128KB to 2MB, the performance is normalized to 1-thread
prefetch version. With a 128KB working set, data accesses hit in the L2 cache. The
lfetch.excl optimization made no performance difference. But the lfetch.excl version
runs 18% faster than the baseline prefetch version when running with 2 threads, and
14% faster with 4 threads. With a 512KB working set, data no longer fit in one single
L2 cache. But they fit in four L2 cache memories with 128KB each when the program
is running with 4 threads. When we increase the number of threads from 2 to 4, the
overhead of coherent misses starts to outweigh the benefit of prefetching. Therefore,
the lfetch.excl version runs 7% faster than the baseline prefetch version when running
with 4 threads.
Since the use of lfetch.excl could increase the number of writebacks in L2, it could
result in longer latency for the store instructions. That is why the version with
lfetch.excl is slower than the baseline prefetch version. With a 2MB working set,
data sharing effect is relatively small because aggressive prefetching would result in
a sharing of only the last 10 cache lines. In this case, due to the increased L2 cache
writebacks, using lfetch.excl causes a slowdown of the program. As shown in this
example, a correct application of lfetch.excl instructions could be very challenging to
a static compiler. This is why .excl prefetch hints are usually used only in numeric
libraries written by expert programmers. However, the dynamic optimizer has more
accurate information to guide the use of such prefetch hints.
This example clearly shows that a single binary generated by one of the most advanced
85
optimizing compilers cannot always provide good performance under different execu-
tion conditions. The performance opportunities left to be exploited could be very
significant. This is rather different from the single processor scenario where aggres-
sive data prefetching is usually considered useful without too much downside. This
is why most compilers will perform aggressive data prefetching by default. As our
example shows, unwanted prefetches could cause coherent misses, and thus substan-
tially slow down the execution. It is difficult for programmers to analyze and evaluate
the impact of performance caused by the change in data working sets and number
of threads/processors. A runtime binary optimizer such as COBRA could identify
performance bottlenecks and hot spots through continuous performance monitoring,
and effectively tune the performance via runtime code optimization.
7.2 Optimizing Coherent Misses
As shown in the previous section, it is very challenging to statically optimize scal-
able OpenMP parallel applications because changes in the data working sets and
contention for shared data could vary at runtime. Collard et al [17] proposed system-
wide monitors, called SWIFT, to find pairs of instructions involved in false sharing.
The compiler then uses SWIFT-based profiles to tune .bias hint for the identified
load instructions in order to reduce unnecessary coherent traffic on the system bus.
Some recent processors provide special instructions to optimize cache coherent mem-
ory accesses. However, due to the lack of runtime profiling support to pinpoint the
instructions that cause such unnecessary memory coherent traffic, these instructions
are rarely used in static compiler optimizations. Itanium 2 supports .bias hint for
integer load instructions. When a load operation with .bias hint misses the cache, it
requests the cache line in the exclusive state, i.e. it will invalidate all of the existing
copies of the cache line, instead of the regular shared state. If a store operation soon
86
follows the load operation, and it writes to the same cache line, it will not trigger a
coherent bus transaction to invalidate the cache lines in other processors. The .bias
hint is not supported for control- and data-speculative loads (ld.s and ld.a), the load
check (ld.c), the load with acquire semantics (ld.acq), and floating point loads. There-
fore, the use of .bias hint is very limited. Itanium 2 processor also provides .excl hint
for lfetch prefetch instruction. The lfetch.excl instruction prefetches a cache line in
the Exclusive state in stead of the Shared state. Depending on the sequence of load
and store operations, the use of .excl hint might lead to more system bus transactions
because shared cache lines are invalidated. Therefore, this type of optimizations relies
heavily on accurate run-time profile.
On Itanium 2, there are several hardware performance counters related to coherent
bus events. For example, BUS RD HIT, BUS RD HITM, and BUS RD INVAL ALL HITM,
record the snooping responses from other processors to the bus transaction initiated
by the monitoring processor [1]. The hardware performance counter corresponding to
the BUS MEMORY event, monitors the number of bus transactions. If we divide the
sum of coherent bus events by the total number of bus transactions, we could estimate
the ratio of coherent memory accesses to all bus transactions. We could use this ratio
to decide whether to perform the optimization to coherent cache misses. Other pro-
cessors, such as IBM Power3, also support the monitoring of cache coherent events.
The PM SNOOP L2 E OR S TO I and PM SNOOP M TO I events could be used
to measure the total number of L2 cache invalidations.
On Itanium 2 systems, once we detect intensive coherent misses, we could use Data
Event Address Registers (DEARs) to pinpoint which instructions caused most of
coherent cache misses. The DEAR can be used to monitor any of L1 data cache
load misses, FP load misses, L1 data TLB misses, or ALAT (Advanced Load Address
Table) misses. Each DEAR sample contains an instruction address that caused the
87
cache miss, its data address and associated latency. The DEAR can be programmed
to filter out unwanted events. For example, L3 cache hit latency on Itanium 2 is 12
cycles, we could filter out on-chip L2 cache misses that hit in L3 cache by programming
the DEAR to track events with latency greater than 12 cycles. This filtering scheme
could avoid selecting those memory loads that cause L2 cache misses but are satisfied
by L3 cache hits. Still, we need another filter to separate loads with long latency
caused by coherent memory accesses from those that are served by the memory. We
found that on the Itanium 2 server, the latency of a coherent miss is usually much
greater than the latency of a memory load, e.g. memory access latencies are usually
between 120-150 cycles, but coherent miss latencies could exceed 180-200 cycles.
We may hide the long memory latency by either inserting data prefetches, or by
scheduling the cache missing load far away from the actual use. Prefetch insertion is
easier to apply since the prefetch instruction is non-binding, so it can be scheduled
freely. Furthermore, prefetch instructions are merely hints, they do not affect the
correctness of the code. However, we need to find the prefetch instructions that are
associated with the load instructions. Our heuristic is based on the fact that prefetch
instructions are usually generated inside a loop or the entry point of a loop. Therefore,
we try to discover the loops that have the loads found through the above mentioned
two-level filtering scheme. On Itanium 2, using BTB to capture the last 4 taken
branches and their target addresses, we could easily discover the loop boundaries to
determine the PC addresses having lfetch instruction within the identified boundaries.
Finally, we can apply optimizations on the identified prefetch instructions.
7.3 Experimental Setup
Our experimental data are collected on a 4-processor Itanium 2 server and a SGI
Altix system. We used 8 processors in the SGI Altix system for our experiments. On
88
the 4-processor Itanium 2 SMP server, four processors are connected via a front-side
bus (6.4GB/sec) that supports a MESI (also called Illinois protocol) cache coherence
protocol. On the SGI Altix, two processors are connected via a front-side bus to
form a computing node. All of the 2-processor nodes are connected by a fat-tree
interconnection network. Intel icc/ifort compiler ver. 9.1 is used to compile NAS
parallel benchmark with -O3, and -openmp options.
The NAS Parallel Benchmark (NPB) [30] consists of five kernels and three simulated
CFD applications (BT, SP, LU) derived from several important aerophysics appli-
cations. The five kernels (FT, MG, CG, EP, IS) mimic the computational core of
five numeric methods used in CFD applications. The simulated CFD applications
reproduce much of data movement and computation found in full CFD codes.
The description of eight NAS parallel benchmarks is as follows:
• BT is a simulated CFD application that uses an implicit algorithm to solve
3-D compressible Navier-Stokes equations. The finite-difference solution to the
problem is based on an Alternating Direction Implicit (ADI) approximate fac-
torization that decouples the x,y and z dimensions.
• SP is a simulated CFD application that has a similar structure to BT. The
finite-difference solution is based on a Beam-Warming approximate factorization
that decouples the x,y, and z dimensions.
• LU is a simulated CFD application that uses a symmetric successive over-
relaxation (SSOR) method to solve a seven-block diagonal system resulting
from finite discretization of the Navier-Strokes equations in 3-D by splitting it
into blocked lower and upper triangular systems.
• FT contains the computational kernel of a 3-D Fast Fourier Transform (FFT)-
based spectral method. FT performs three one-dimensional (1-D) FFT’s, one
89
for each dimension.
• MG uses a V-cycle multi-grid method to compute the solution of the 3-D scalar
Poisson equation. The algorithm works continuously on a set of grids that are
made between coarse and fine grids. It tests both short and long distance data
movement.
• CG uses a Conjugate Gradient method to compute an approximation to the
smallest eigenvalue of a large, sparse, unstructured matrix.
• EP is an embarrassingly parallel benchmark. It generates pairs of Gaussian
random deviates according to a specific scheme. The goal is to establish the
reference point for the peak performance of a given platform.
• IS is the integer sort kernel.
The NPB benchmarks are implemented with High Performance Fortran (HPF), OpenMP,
and Message Passing Interface (MPI) to accommodate various parallel machines. The
OpenMP version of NPB benchmark is used in our experiments. OpenMP uses a set
of compiler directives that guide the compiler to exploit loop-level parallelism. The
cache coherent memory accesses could limit the scalability of OpenMP programs since
computations inside a loop are distributed based on the loop index range regardless
of data locations. The NPB benchmarks provides five data sets (S, W, A, B, C) from
the smallest (S) to the largest (C) data sets. Since 60-70% of memory accesses in the
smallest data set (S) are related to coherent memory accesses, we use the smallest
data set (S) in our experiments for evaluating the effectiveness of optimizations on
coherent memory accesses. As the data set size in NAS parallel benchmarks increases,
the proportion of coherence memory accesses is decreased.
Table 7.1 shows the number of loops and prefetches generated by the icc compiler
in the OpenMP NPB binaries. On Itanium, br.ctop and br.wtop are branches used
90
Table 7.1: The number of loops and prefetches in compiler generated OpenMP NPBbinaries
benchmarks lfetch br.ctop br.cloop br.wtop
BT 140 34 32 0SP 276 67 22 0LU 184 61 19 0FT 258 45 9 8MG 419 66 34 4CG 433 69 29 2EP 17 1 4 1IS 76 19 13 2
in software pipelined (SWP) loops. The br.cloop is used in the counted loops. The
compiler generates several hundreds prefetches in most of the benchmarks except EP
and IS. It is infeasible to tune every prefetch instruction manually due to the large
number of candidate prefetches.
7.4 Experimental Results
To understand the impact of two optimizations (noprefetch, prefetch.excl) on different
system architectures, we examined the execution time, L3 misses, and the number of
system memory bus transactions. The overall execution time on parallel programs is
based on wall clock time. The L3 misses and the number of memory bus transactions
are highly correlated because L3 misses need to be serviced by bus transactions.
Since IS and EP benchmarks don’t show any long latency coherent misses on both
machines, we exclude IS and EP benchmarks from our final results.
Three different prefetch strategies are studied in our experiments.
• prefetch: This is our baseline for evaluating the effect of our prefetch optimiza-
tions for coherent cache misses. The prefetch version is chosen as the baseline
because recent optimizing compilers aggressively generate prefetches even at the
commonly used -O2 optimization level. Our baseline binaries are compiled with
the highest compiler optimization level O3 in intel compiler.
91
• noprefetch: This optimization selectively reduces the aggressiveness of prefetch-
ing to remove unnecessary coherent cache misses. Our runtime profiler guides
the optimizer to select prefetches in a few loops and turn them into NOP in-
structions.
• prefetch.excl: This optimization also selectively chooses prefetch instructions
that cause long latency coherent misses and applies .excl hint on the selected
prefetches.
Noprefetch strategy is very effective when the data working set fits in the processor
caches and many coherent misses are caused by aggressive prefetching. However, it
needs precise runtime profiles to avoid removing effective prefetches that could result
in performance loss.
7.4.1 Impact on Execution Time
Figure 7.4 shows the performance improvement from two optimizations (noprefetch,
prefetch.excl) on OpenMP NPB benchmarks. The speedup achieved with noprefetch
optimization on 4-way SMP server was up to 15% with an average of 4.7%, and with
lfetch.excl optimization, it was up to 8% with an average of 2.7%, as shown in Fig-
ure 7.4(a). Since the penalty of coherent misses is much higher on cc-NUMA machines
than that on SMP machines, we obtained a higher performance improvement from
the two optimizations on SGI Altix. The speedup achieved with noprefetch optimiza-
tion on SGI Altix cc-NUMA system was up to 68% with an average of 17.5%, and
with lfetch.excl optimization, it was up to 18% with an average of 8.5%, as shown in
Figure 7.4(b).
Intuitively, replacing prefetch instructions with NOP instructions could slowdown
program execution because the load latency increases. However, it should be noted
that our noprefetch optimization does not blindly replace prefetch instructions with
92
0.900
0.950
1.000
1.050
1.100
1.150
1.200
bt.S sp.S lu.S ft.S mg.S cg.S avg
NPB OMP v3.0 benchmarks
Sp
eed
up
rel
ativ
e to
bas
elin
e (p
refe
tch
)(4, prefetch) (4, noprefetch) (4, prefetch.excl)
8%
15%
2.7%
4.7%
(a) 4 threads running on 4-way SMP node
0.000
0.200
0.400
0.600
0.800
1.000
1.200
1.400
1.600
1.800
bt.S sp.S lu.S ft.S mg.S cg.S avg
NPB OMP v3.0 benchmarks
Sp
eed
up
rel
ativ
e to
bas
elin
e (p
refe
tch
)
(8, prefetch) (8, noprefetch) (8, prefetch.excl)
17.5%
8.5%
68%
18%
(b) 8 threads running on SGI Altix cc-NUMA machine
Figure 7.4: Speedup of coherent memory access optimization on OpenMP NPB bench-marks. The performance of prefetch version (optimized by Intel compiler) is normalized to1 as the baseline.
93
NOP instructions. It uses the filtering mechanism detailed in section 7.2 to filter
out instructions that causes frequent L3 misses when L2 miss ratio is low. Thus
a large portion of memory transactions that are optimized by noprefetch are those
transactions that are related to coherent memory accesses. This filtering heuristics
allow us to minimize the negative impact of optimizations on performance.
The lfetch.excl optimization is less effective than noprefetch optimization. Even
though this optimization improves the performance of the instruction sequences that
contain load operations followed by store operations to the same cache line, it could
still fetch unnecessary cache lines from other processors.
7.4.2 Impact on L3 Cache Misses
Figure 7.5 shows the impact on L3 misses from the two optimizations (noprefetch,
prefetch.excl). On Itanium 2, coherent cache misses could lead to L3 misses. When
coherent memory accesses are a significant portion of L3 cache misses, reducing L3
misses substantially indicates that we have reduced unnecessary coherent misses.
On the SP and the CG benchmarks, L3 misses have been substantially reduced by
the noprefetch version. The reduction is as high as 29.9% for SP and 39.5% for CG,
on the 4-way SMP server, as shown in Figure 7.5(a). On the SGI Altix system, we
have also observed near 20% reduction of L3 misses from the noprefetch version for
BT, SP and CG, as shown in Figure 7.5(b).
7.4.3 Impact on Memory Bus Transactions
Figure 7.6 shows the impact on the number of memory transactions on the system bus
from the two optimizations (noprefetch, prefetch.excl). Since L3 misses are directly
translated into memory transactions on the system bus, the number of memory trans-
actions is highly correlated with L3 misses. Hence, Figure 7.6 is closely correlated to
94
0.000
0.200
0.400
0.600
0.800
1.000
1.200
bt.S sp.S lu.S ft.S mg.S cg.S avg.
OpenMP NPB Benchmark
No
rmal
ized
# o
f L
3 m
isse
s to
bas
elin
e(4, prefetch) (4, noprefetch) (4, prefetch.excl)
-16.3%
3.5%
(a) 4 threads running on 4-way SMP node
0.000
0.200
0.400
0.600
0.800
1.000
1.200
bt.S sp.S lu.S ft.S mg.S cg.S avg.
OpenMP NPB Benchmarks
No
rmal
ized
# o
f L
3 m
isse
s to
bas
elin
e
(8, prefetch) (8, noprefetch) (8, prefetch.excl)
-13%
-0.3%
(b) 8 threads running on SGI Altix cc-NUMA machine
Figure 7.5: Number of L3 misses on OpenMP NPB benchmarks
95
0.000
0.200
0.400
0.600
0.800
1.000
1.200
bt.S sp.S lu.S ft.S mg.S cg.S avg.
OpenMP NPB benchmarks
No
rmal
ized
# o
f sy
stem
bu
s m
emo
ry t
ran
sact
ion
s to
bas
elin
e
(4, prefetch) (4, noprefetch) (4, prefetch.excl)
-15.1%
4.9%
(a) 4 threads running on 4-way SMP node
0.000
0.200
0.400
0.600
0.800
1.000
1.200
bt.S sp.S lu.S ft.S mg.S cg.S avg.
OpenMP NPB benchmarks
No
rmal
ized
# o
f sy
stem
bu
s m
emo
ry t
ran
sact
ion
s to
bas
elin
e
(8, prefetch) (8, noprefetch) (8, prefetch.excl)
-13.9%
-1.9%
(b) 8 threads running on SGI Altix cc-NUMA machine
Figure 7.6: Number of memory transactions on the system bus on OpenMP NPBbenchmarks
Figure 7.5.
7.5 Summary
We have shown that, with OpenMP NAS parallel benchmarks, COBRA can adap-
tively select appropriate optimization techniques based to changing runtime program
behaviors to achieve significant speedup. Coherent memory accesses caused by data
sharing often limit the scalability of multithreaded applications. Using COBRA, the
performance of some OpenMP parallel programs can be improved by dynamically
96
reducing the aggressiveness of data prefetching and using exclusive hints for prefetch
instructions.
97
Chapter 8
Conclusions and Future Works
Dynamic compilation and binary optimization become an increasingly important part
of program runtime environments as static binaries generated by traditional compil-
ers face challenges to provide good performance on various revisions of processors,
system architecture, and diverse program behaviors according to different input sets.
This thesis proposes dynamic binary optimization framework and addresses several
key problems, such as phase-aware program monitoring, hardware support for phase
detection, profile characterization and classification for continuous profiling, and dy-
namic binary re-adaptation for multithreaded programs.
8.1 Conclusions
Runtime dynamic optimizers have shown to improve performance and power efficiency
of single-threaded applications. Multithreaded applications running on SMP, CMP
and cc-NUMA systems pose new challenges and opportunities to runtime dynamic bi-
nary optimizers. This thesis introduces COBRA (Continuous Binary Re-Adaptation),
a runtime binary optimization framework. A prototype has been implemented on Ita-
nium 2 based SMP and cc-NUMA systems.
98
We investigate the use of control flow information such as loops and function calls in
order to identify repetitive program behavior as a program behavior. Along the study
on the use of control flow as phase signature, we implemented efficient phase-aware
runtime program monitoring schemes on our COBRA framework. We describe sam-
pled Basic Block Vector(BBV)-based and Hot Working Set (HWSET)-based program
phase detection schemes. Sampled HWSET-based program phase detection scheme
shows a higher phase coverage and a longer stable phase compared to sampled BBV-
based program phase detection scheme. We have evaluated the effectiveness of our
DCR-based phase tracking hardware on a set of SPEC benchmark programs with
known phase behaviors. We have shown that our hardware exhibits the desirable
characteristics of a phase detector for dynamic optimization systems. The hardware
is simple and cost effective. The phase sequence detected by our hardware could be
accurately predicted using simple prediction techniques.
We shows that highly accurate profiles can be obtained by merging a number of pro-
files collected over repeated executions with relatively low sampling frequency. It also
show that simple characterization of the profile with information entropy can effec-
tively guide the sampling rate for a profiler. Using SPECjbb2000 benchmark, our
adaptive profiler obtains very accurate profile (94.5% similar to the baseline profile)
with only 8.7% of samples in a sampling rate of one sample per 1M instructions.
Furthermore, we show that information entropy could be used to classify different
profiles obtained from different input sets. The profile entropy-based approach pro-
vides a good foundation for continuous profiling management and effective PGO in
dynamic compilation environment.
We have shown that, with OpenMP NAS parallel benchmarks, COBRA can adap-
tively select appropriate optimization techniques based to changing runtime program
behaviors to achieve significant speedup. Coherent memory accesses caused by data
99
sharing often limit the scalability of multithreaded applications. Using COBRA, the
performance of some OpenMP parallel programs can be improved by dynamically
reducing the aggressiveness of data prefetching and using exclusive hints for prefetch
instructions.
8.2 Future Works
We have several directions to extend our work in the future.
First, we can extend the implementation of our runtime binary optimization frame-
work to support multiple platforms including Intel Pentium and IBM PowerPC. Cur-
rent implementation only supports Intel Itanium processors. Itanium processors have
several generations of implementations: Itanium 1, Itanium 2 and Montecito. Each
processor has a little different implementation of hardware performance monitors that
supports different performance events, the number of performance counters and event
registers. Current implementation of COBRA has each version for different processor
revision. ADORE binary optimizer has an independent implementation for Intel Ita-
nium [37] and Sun SPARC processors [38]. Generic support for performance counters
of different processors and revisions is left as a future enhancement of the framework.
Second, we further investigate the correlation between sampled code signatures and
the performance behaviors in the large scale multithreaded programs. Understanding
the performance bottleneck in such large scale applications become increasingly diffi-
cult problems. The sampled code signature as a phase ID is used as a tag information
to classify the performance behavior. This tagged information with performance char-
acteristics opens the possibility of building performance database to further enable
the statistical analysis of the performance bottleneck over time. The sampled code
signature can be collected from kernel codes and library codes in addition to user own
codes.
100
Third, we further investigate the hardware support for phase detection by augmenting
existing performance counters. Many recent processors already provide the hardware
counters recording recent taken branches. Our proposed phase detection needs addi-
tional branch type check logics, hardware stack and phase signature table. We believe
that augmenting phase tracking hardware in current hardware performance counters
greatly reduce the overhead of software phase detection in dynamic optimizers.
Fourth, Current dynamic/static compilers and dynamic binary optimization use its
own proprietary profile formats. If we imagine that both compilation and dynamic
binary optimization coexist to maximize the benefit of continuous optimization in the
future program runtime environments, generic profile format support is very impor-
tant functionality to support.
Finally, as recent processors support the voltage/frequency scaling for better power
efficiency and thermal control, dynamic optimization need to consider not only the
performance improvement, but also the power efficiency in the optimization decision.
We further investigate to integrate the power/performance projection model into
runtime optimization.
101
Bibliography
[1] Intel Itanium Processor Reference Manual for Software Development. http:
//www.intel.com/design/itanium/manuals.htm.
[2] pfmon - HP Performance Monitoring Tool. http://www.hpl.hp.com/research/
linux/perfmon.
[3] PIN - A Dynamic Binary Instrumentation Tool. http://rogue.colorado.edu/
Pin.
[4] Ammons, G., Ball, T., and Larus, J. R. Exploiting hardware performance
counters with flow and context sensitive profiling. In Proceedings of the ACM
SIGPLAN‘97 Conference on Programming Language Design and Implementation
(PLDI) (June 1997).
[5] An, P., Jula, A., Rus, S., Saunders, S., Smith, T., Tanase, G., Amato,
N., and Rauchwerger, L. STAPL: An adaptive, generic parallel program-
ming library for C++. In Workshop on Languages and Compilers for Parallel
Computing (LCPC) (Cumberland Falls, Kentucky, August 2001).
[6] Annavaram, M., Rakvic, R., Polito, M., Bouguet, J., Hankins, R.,
and Davies, B. The Fuzzy Correlation between Code and Performance Pre-
dictability. In the 37th International Symposium on Microarchitecture (December
2004).
102
[7] Arnold, M., Fink, S., Grove, D., Hind, M., and Sweeney, P. F.
Adaptive optimization in the Jalapeno JVM. In 15th Conference on Object-
Oriented Programming, Systems, Languages, and Applications (OOPSLA’00)
(2000), pp. 47–65.
[8] Bala, V., Duesterwald, E., and Banerjia, S. Dynamo: A transparent
dynamic optimization system. In Proceedings of the ACM SIGPLAN conference
on Programming language design and implementation (PLDI’00) (June 2000).
[9] Bosworth, G. PreJIT in the CLR. In 2nd Workshop on Managed Runtime
Environments (MRE‘04) (2004).
[10] Bruening, D., and Amarasinghe, S. Maintaining consistency and bounding
capacity of software code caches. pp. 74–85.
[11] Bruening, D., Duesterwald, E., and Amarasinghe, S. Design and im-
plementation of a dynamic optimization framework for Windows. In the 4th
Workshop of Feedback-Directed and Dynamic Optimization (Austin, TX, 2001).
[12] Bruening, D., Garnett, T., and Amarasinghe, S. An Infrastructure for
Adaptive Dynamic Optimization. In Proceedings of 1st International Symposium
on Code Generation and Optimization (CGO’03) (2003), pp. 265–275.
[13] Burke, M., Choi, J.-D., Fink, S., Grove, D., Hind, M., Sarkar, V.,
Serrano, M., Sreedhar, V., Srinivasan, H., and Whaley, J. The
Jalapeno dynamic optimizing compiler for Java. In Proc. ACM 1999 Java Grande
Conference (1999), pp. 129–141.
[14] Chen, H., Lu, J., Hsu, W.-C., and Yew, P.-C. Continuous Adaptive
Object-Code Re-optimization Framework. the 9th Asia-Pacific Computer Sys-
tems Architecture Conference (2004).
103
[15] Chen, W.-K., Lerner, S., Chaiken, R., and Gillies, D. Mojo: A dynamic
optimization system. In the 3rd Workshop of Feedback-Directed and Dynamic
Optimization (2000), pp. 81–90.
[16] Choi, Y., Knies, A., Vedaraman, G., and Williamson, J. Design and
experience: Using the Intel Itanium 2 processor performance monitoring unit to
implement feedback optimization. In EPIC2 Workshop (2002).
[17] Collard, J.-F., Jouppi, N., and Yehia, S. System-Wide Performance Mon-
itors and their Application to the Optimization of Coherent Memory Accesses.
In Proc. Intl. Symp. on Prin. and Practice of Parallel Prog. (PPoPP) (Chicago,
IL, June 2005).
[18] Cover, T., and Thomas, J. Elements of Information Theory. John Wiley
and Sons, 1991.
[19] Das, A., Lu, J., Chen, H., Kim, J., Yew, P.-C., Hsu, W.-C., and Chen,
D.-Y. Performance of Runtime Optimization on BLAST. In Proceedings of
the Third Annual IEEE/ACM Internation Symposium on Code Generation and
Optimization (March 2005).
[20] Deaver, D., Gorton, R., and Rubin, N. Wiggins/redstone: An on-line
program specializer. In Hot Chips 11 Conf. (Palo Alto, CA, 1999).
[21] Demmel, J., Dongarra, J., Eijkhout, V., Fuentes, E., Petitet, A.,
Vuduc, R., Whaley, C., and Yelick, K. Self-Adapting linear algebra
algorithms and software. Proceedings of the IEEE 93, 2 (February 2005), 293–
312.
[22] Dhodapkar, A., and Smith, J. Managing multi-configuration hardware via
dynamic working set analysis. In 29th Annual International Symposium on Com-
puter Architecture (May 2002).
104
[23] Duesterwald, E., Cascaval, C., and Dwarkadas, S. Characterizing and
predicting program behavior and its variability. In International Conference on
Parallel Architectures and Compilation Techniques (October 2003).
[24] Ebcioglu, K., and Altman, E. DAISY: Dynamic compilation for 100%
architectural compatibility. In Proc. 24th Annual International Symposium on
Computer Architecture (1997), pp. 26–37.
[25] Hazelwood, K., and Smith, J. E. Exploring Code Cache Eviction Granu-
larities in Dynamic Optimization Systems. In second Annual IEEE/ACM Inter-
national Symposium on Code Generation and Optimization (March 2004).
[26] Hazelwood, K., and Smith, M. Generational cache management of code
traces in dynamic optimization systems. In 36th International Symposium on
Microarchitecture (December 2003).
[27] Hind, M., Rajan, V., and Sweeney, P. Phase shift detection: a problem
classification. IBM Research Report RC-22887 , 45–57.
[28] Hsu, C.-H., and Kermer, U. The design, implementation and evaluation of a
compiler algorithm for CPU energy reduction. In Proceedings of ACM SIGPLAN
Conference on Programming Language Design and Implementation (June 2003).
[29] Huang, M., Renau, J., and Torrellas, J. Positional adaptation of pro-
cessors: Application to energy reduction, June 2003.
[30] Jin, H., Frumkin, M., and Yan, J. The OpenMP Implementation of NAS
Parallel Benchmarks and Its Performance. NAS Technical Report NAS-99-011
(October 1999).
[31] Kim, H., and Smith, J. Dynamic binary translation for accumulator-oriented
architectures. pp. 25–35.
105
[32] Kistler, T., and Franz, M. Computing the Similarity of Profiling Data. In
Proc. Workshop on Feedback-Directed Optimization (1998).
[33] Lau, J., Perelman, E., and Calder, B. Selecting Software Phase Markers
with Code Structure Analysis. In Proceedings of the International Symposium
on Code Generation and Optimization (CGO2006) (March 2006).
[34] Lau, J., Schoenmackers, S., and Calder, B. Structures for Phase Classi-
fication. In IEEE International Symposium on Performance Analysis of Systems
and Software (March 2004).
[35] Lau, J., Schoenmackers, S., and Calder, B. Transition Phase Classifica-
tion and Prediction. In the 11th International Symposium on High Performance
Computer Architecture (February 2005).
[36] Liu, W., and Huang, M. EXPERT: expedited simulation exploiting program
behavior repetition. In Proceedings of the 18th annual International Conference
on Supercomputing (June 2004).
[37] Lu, J., Chen, H., Fu, R., Hsu, W.-C., Othmer, B., and Yew, P.-C.
The Performance of Runtime Data Cache Prefetching in a Dynamic Optimiza-
tion System. In Proceedings of the 36th Annual International Symposium on
Microarchitecture (December 2003).
[38] Lu, J., Das, A., Hsu, W.-C., Nguyen, K., and AbrahamDas, S. G.
Dynamic helper threaded prefetching on the sun ultrasparc cmp processor. In
Proceedings of the 38th Annual IEEE/ACM international Symposium on Mi-
croarchitecture (2005).
[39] Luk, C., and et al. Ispike: A post-link optimizer for Intel Itanium 2 architec-
ture. In Proceedings of 2nd International Symposium on Code Generation and
Optimization (CGO).
106
[40] Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G.,
Wallace, S., Reddi, V. J., and Hazelwood, K. Pin: Building customized
program analysis tools with dynamic instrumentation. pp. 190–200.
[41] Magklis, G., Scott, M. L., Semeraro, G., Albonesi, D. A., and Drop-
sho, S. Profile-based Dynamic Voltage and Frequency Scaling for a Multiple
Clock Domain Microprocessor. In Proceedings of the International Symposium
on Computer Architecture (June 2003).
[42] Muchnick, S. Advanced Compiler Design and Implementation. Morgan Kauf-
man, 1997.
[43] Nagpurkar, P., Krintz, C., and Sherwood, T. Phase-Aware Remote
Profiling. In the third Annual IEEE/ACM International Symposium on Code
Generation and Optimization (March 2005).
[44] Patil, H., Cohn, R., Charney, M., Kapoor, R., Sun, A., and
Karunanidhi, A. Pinpointing representative portions of large Intel Itanium
programs with dynamic instrumentation. In MICRO-37 (December 2004).
[45] Puschel, M., Moura, J. M. F., Johnson, J. R., Padua, D., Veloso,
M. M., Singer, B. W., Xiong, J., Franchetti, F., Gacic, A., Voro-
nenko, Y., Chen, K., Johnson, R. W., and Rizzolo, N. SPIRAL: Code
Generation for DSP Transforms. Proceedings of the IEEE 93, 2 (February 2005).
[46] Savari, S., and Young, C. Comparing and Combining Profiles. In Journal
of Instruction Level Parallelism (2004).
[47] Scott, K., Kumar, N., Velusamy, S., Childers, B., and Soffa, M.
Retargetable and reconfigurable software dynamic translation. In International
Symposium on Code Generation and Optimization (CGO ’03) (2003), pp. 36–47.
107
[48] Shen, X., Zhong, Y., and Ding, C. Locality phase prediction. In Inter-
national Conference on Architectural Support for Programming Languages and
Operating Systems (2004).
[49] Sherwood, T., Perelman, E., and Calder, B. Basic block distribution
analysis to find periodic behavior and simulation points in applications. In In-
ternational Conference on Parallel Architectures and Compilation Techniques
(September 2001).
[50] Sherwood, T., Perelman, E., Hamerly, G., and Calder, B. Automat-
ically characterizing large scale program behavior. In Architectural Support for
Programming Language and Operating Systems - X (October 2002).
[51] Sherwood, T., Sair, S., and Calder, B. Phase tracking and prediction. In
30th Annual International Symposium on Computer Architecture (June 2003).
[52] Srivastava, A., Edwards, A., and Vo, H. Vulcan: Binary translation in a
distributed environment. Microsoft Research Technical Report, MSR-TR-2001-
50 (2001).
[53] Sun, M., Daly, J., Wang, H., and Shen, J. Entropy-based Characterization
of Program Phase Behaviors. In the 7th Workshop on Computer Architecture
Evaluation using Commercial Workloads (February 2004).
[54] Thomas, N., Tanase, G., Tkachyshyn, O., Perdue, J., Amato, N. M.,
and Rauchwerger, L. A Framework for Adaptive Algorithm Selection in
STAPL. In PPoPP’05 (Chicago,Illinois, June 2005).
[55] Tullsen, D. M., and Eggers, S. J. Limitations of Cache Prefetching on a
Bus-Based Multiprocessor. In Proc. of 20th International Symposium on Com-
puter Architecture(ISCA) (May 1993), pp. 278–288.
108
[56] Vaswani, K., and Srikant, Y. Dynamic recompilation and profile-guided
optimisations for a .NET JIT compiler. IEE Proc.-Softw. 150, 5 (2003), 296–
302.
[57] Voss, M. J., and Eigenmann, R. High-Level Adaptive Program Optimization
with ADAPT. ACM SIGPLAN Notices 32, 7 (July 2001), 93–102.
[58] Whaley, R., Petitet, A., and Dongarra, J. Automated empirical opti-
mization of software and the ATLAS project. Parallel Comput. 27, 1-2 (2001),
3–35.
[59] Zhang, W., Calder, B., and Tullsen, D. M. An event-driven multi-
threaded dynamic optimization framework. pp. 87–98.
[60] Zhang, X., Wang, Z., Gloy, N., Chen, J., and Smith, M. System support
for automatic profiling and optimization. In Proc. 16th ACM Symp. Operating
Systems Principles (1997), pp. 15–26.
109