technical reportthis thesis presents cobra (continuous binary re-adaptation), a dynamic binary...

COBRA: A Framework for Continuous Profiling and Binary Re-Adaptation

Technical Report

Department of Computer Science

and Engineering

University of Minnesota

4-192 EECS Building

200 Union Street SE

Minneapolis, MN 55455-0159 USA

TR 08-016

COBRA: A Framework for Continuous Profiling and Binary

Re-Adaptation

Jinpyo Kim, Wei-chung Hsu, and Pen-chung Yew

May 09, 2008

c© Jinpyo Kim February 2008

Abstract

Dynamic optimizers have shown to improve performance and power efficiency of

single-threaded applications. Multithreaded applications running on CMP, SMP and

cc-NUMA systems also exhibit opportunities for dynamic binary optimization. Ex-

isting dynamic optimizers lack efficient monitoring schemes for multiple threads to

support appropriate thread specific or system-wide optimization for a collective be-

havior of multiple threads since they are designed primarily for single-threaded pro-

grams. Monitoring and collecting profiles from multiple threads expose optimization

opportunities not only for single core, but also for multi-core systems that include

interconnection networks and the cache coherent protocol. Detecting global phases

of multithreaded programs and determining appropriate optimizations by considering

the interaction between threads such as coherent misses are some features of the dy-

namic binary optimizer presented in this thesis when compared to the prior dynamic

optimizers for single threaded programs.

This thesis presents COBRA (Continuous Binary Re-Adaptation), a dynamic binary

optimization framework, for single-threaded and multithreaded applications. It in-

cludes components for collective monitoring and dynamic profiling, profile and trace

management, code optimization and code deployment. The monitoring component

collects the hot branches and performance information from multiple working threads

with the support of OS and the hardware performance monitors. It sends data to the

dynamic profiler. The dynamic profiler accumulates performance bottleneck profiles

such as cache miss information along with hot branch traces. Optimizer generates

new optimized binary traces and stored them in the code cache. Profiler and opti-

mizer closely interact with each other in order to optimize for more effective code

layout and fewer data cache miss stalls. The continuous profiling component only

monitors the performance behavior of optimized binary traces and generates the feed-

back information to determine the efficiency of optimizations for guiding continuous

re-optimization. It is currently implemented on Itanium 2 based CMP, SMP and

cc-NUMA systems.

This thesis proposes a new phase detection scheme and hardware support, espe-

cially for dynamic optimizations, that effectively identifies and accurately predicts

program phases by exploiting program control flow information. This scheme could

not only be applied on single-threaded programs, but also more efficiently applied on

multithreaded programs. Our proposed phase detection scheme effectively identifies

dynamic intervals that are contiguous variable-length intervals aligned with dynamic

i

code regions that show distinct single and parallel program phase behavior.

Two efficient phase-aware runtime program monitoring schemes are implemented on

our COBRA framework. The sampled Basic Block Vector (BBV)-based and sampled

Hot Working Set (HWSET)-based program phase detection schemes are studied. We

showed that Sampled HWSET-based program phase detection scheme has a higher

phase coverage and a longer stable phase compared to sampled BBV-based program

phase detection scheme. We also propose dynamic code region (DCR)-based program

phase detection hardware for dynamic optimization system. We show that our pro-

posed hardware exhibits the desired characteristics of a phase detector for dynamic

optimization.

This thesis also proposes a persistent dynamic profile management scheme for contin-

uous re-optimization. The code region based profile manager stores dynamic control

flow information including hot paths and loops. It classifies them according to an

entropy calculated from the frequency vectors of taken branches and load latencies.

The profile characterization and classification minimize the explosion of persistent

runtime profiles and the overhead of profile collection for continuous re-optimization.

We implemented two dynamic compiler optimizations to reduce the impact of coherent

memory accesses in OpenMP NAS parallel benchmarks. Using OpenMP NAS parallel

benchmarks, we show how COBRA can adaptively choose appropriate optimizations

according to observed changing runtime program behavior. The optimizations im-

prove the performance of OpenMP NAS parallel benchmarks (BT, SP, LU, FT, MG,

CG) up to 15% with an average of 4.7% on a 4-way Itanium 2 SMP server, and up

to 68% with an average of 17.5% on a SGI Altix cc-NUMA system.

ii

Dedicated to my parents and my wife

iii

Acknowledgments

It is time to thank for all supports and love from my family, advisors and friends to

make it possible to finish this work. First of all, I would like to appreciate to my wife,

Aejung Min, for taking care of daily chores and playful two sons with dedication of

her whole time. It was always fun to spend time with my lovely first son, Donghyun

Kim, who has nicely grown up and become a playful and smart 4th grader. I was

always happy with a big smile from my second son, Kevin D. Kim, who was born in

US and has grown up as a happy and healthy 3 years kid. Strong support from my

family keeps me on track to finish this work.

I would like to thank my parents, Jong-Kyu Kim and Kee-Hyun Park, for believing

me on whatever I was doing and showing strong supports all the time. Especially,

my father has been a good mentor and friend in my life. He showed me how people

can grow up as a responsible and loving person for their family through his life.

Professor Pen-Chung Yew has been a great advisor on my work and every decisions

made while studying as a graduate student. I would like to thank him for spending his

invaluable time to discuss about every details of my work and giving me thoughtful

suggestions. He has been a definite role model for me as a productive researcher and

professor.

Professor Wei-Chung Hsu allowed me to work on dynamic optimization project and

has been a great co-advisor on my work. He has been an energetic leader of the project

and technically sound debater in every bits of details when discussing research idea

with students. I has been really fortunate to have a chance to work with him and

learn every nuts and bolts of compiler optimization techniques.

Sreekumar V. Kodakara, fellow graduate student, has been a good research collabora-

tor and dear friend during my thesis work. Numerous days and nights of hard working

on the paper could not take his humor and smile. His humor made our collaboration

enjoyable and delightful experience.

I would like to thank fellow graduate students working together in DYNOPT research

group, namely, Howard Chen, Jiwei Lu, Sourabh Joshi, Ananth Lingamneni, Abhinav

Das, and Lao Fu. Group discussion with them greatly influenced my thesis work. I

also would like to thank fellow graduate students working in Aggassiz research group,

namely, Tong Chen, Shengyue Wang, Xiaoru Dai, Jin Lin, Venkatesan Packirisamy,

Kiran S. Yellajyosula and Jin Woo Jung.

iv

I would like to thank professor David J. Lilja to provide insightful suggestions on

collaborative work with Sreekumar V. Kodakara and my thesis work. Professor Mats

Heimdahl is gratefully acknowledged for serving on my committee and giving me his

suggestions to improve my thesis.

Finally, I would like to acknowledge the funding agencies and companies for this

work. This work was supported in part by National Scientific Foundation grant no.

EIA-0220021, Intel, HP, Sun, and the Minnesota Supercomputer Institute. This work

was also supported in part by IT National Scholarship, Ministry of Information and

Communications, Korea Government.

v

Contents

Chapter 1 Introduction 1

1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Phase Detection and Prediction for Dynamic Optimizations . 3

1.1.2 Profile Characterization and Classification . . . . . . . . . . . 4

1.1.3 Optimizing Coherent Misses via Binary Re-Adaptation . . . . 5

1.2 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.1 Phase Detection and Prediction for Dynamic Optimization . . 6

1.2.2 Profile Characterization and Classification . . . . . . . . . . . 7

1.2.3 Optimizing Coherent Misses via Binary Re-Adaptation . . . . 8

1.3 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Chapter 2 Related Works 12

2.1 Dynamic Optimizations for Parallel and Multithreaded Programs . . 12

2.2 Phase Detection and Prediction . . . . . . . . . . . . . . . . . . . . . 13

2.3 Profile Characterization and Classification . . . . . . . . . . . . . . . 15

2.4 Reducing Coherent Misses . . . . . . . . . . . . . . . . . . . . . . . . 16

Chapter 3 COBRA: A Continuous Binary Re-Adaptation Framework 17

3.1 COBRA System Architecture . . . . . . . . . . . . . . . . . . . . . . 19

vi

3.2 Startup Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 Monitoring Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4 Optimizer Thread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Chapter 4 Phase-Aware Runtime Program Monitoring 23

4.1 Phase Detection for Dynamic Optimization Systems . . . . . . . . . . 24

4.2 Extended Calling Context Tree . . . . . . . . . . . . . . . . . . . . . 24

4.3 Dynamic Code Region . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3.1 Dynamic Code Region Analysis . . . . . . . . . . . . . . . . . 26

4.3.2 Stable and Transition Intervals . . . . . . . . . . . . . . . . . 28

4.4 Sampling-based Program Phase Tracking . . . . . . . . . . . . . . . . 30

4.4.1 Sampled BBV-based Program Phase Detection . . . . . . . . . 30

4.4.2 Sampled HWSET-based Program Phase Detection . . . . . . . 31

4.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 31

4.5 Global Program Phase on Multithreaded Programs . . . . . . . . . . 35

4.5.1 Global Program Phase . . . . . . . . . . . . . . . . . . . . . . 35

4.5.2 Exploiting Global Program Phase . . . . . . . . . . . . . . . . 36

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Chapter 5 Hardware Support for Program Phase Tracking 40

5.1 Dynamic Code Region (DCR): A Unit of Monitoring and Re-optimization 40

5.1.1 Tracking Dynamic Code Region as a Phase . . . . . . . . . . . 40

5.1.2 Correlation between Dynamic Code Regions and Program Per-

formance Behaviors . . . . . . . . . . . . . . . . . . . . . . . . 42

5.2 DCR-based Phase Tracking and Prediction Hardware . . . . . . . . . 44

vii

5.2.1 Identifying function calls and loops in the hardware . . . . . . 44

5.2.2 Hardware Description . . . . . . . . . . . . . . . . . . . . . . . 45

5.2.3 Handling special cases . . . . . . . . . . . . . . . . . . . . . . 48

5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.3.1 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . 50

5.3.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.3.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.4.1 Results for Phase Detection Hardware . . . . . . . . . . . . . 52

5.4.2 Comparison with BBV Technique . . . . . . . . . . . . . . . . 56

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Chapter 6 Continuous and Persistent Profile Management 63

6.1 Continuous and Persistent Profile-Guided Optimization . . . . . . . . 63

6.2 Similarity of Sampled Profiles . . . . . . . . . . . . . . . . . . . . . . 65

6.2.1 Similarity Metric . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.2.2 Accuracy of Persisted Profiles . . . . . . . . . . . . . . . . . . 67

6.3 Entropy-Based Profile Characterization . . . . . . . . . . . . . . . . . 68

6.3.1 Information Entropy: A Metric for Profile Characterization . . 68

6.3.2 Entropy-Based Adaptive Profiler . . . . . . . . . . . . . . . . 70

6.4 Entropy-Based Profile Classification . . . . . . . . . . . . . . . . . . . 70

6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 72

6.5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 72

6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

viii

Chapter 7 Optimizing Coherent Misses via Binary Re-Adaptation 80

7.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

7.2 Optimizing Coherent Misses . . . . . . . . . . . . . . . . . . . . . . . 86

7.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

7.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

7.4.1 Impact on Execution Time . . . . . . . . . . . . . . . . . . . . 92

7.4.2 Impact on L3 Cache Misses . . . . . . . . . . . . . . . . . . . 94

7.4.3 Impact on Memory Bus Transactions . . . . . . . . . . . . . . 94

7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

Chapter 8 Conclusions and Future Works 98

8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

8.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

ix

List of Figures

3.1 COBRA framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 The startup sequence of 4-threaded OpenMP program with COBRA

framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1 Partial ECCT of gzip and gcc is shown. The grey nodes are the root

of the sub-tree that forms the DCR and the rectangles mark the DCR’s 25

4.2 Sampled BBV-based Program Phase Detection . . . . . . . . . . . . . 30

4.3 Sampled HWSET-based Program Phase Detection . . . . . . . . . . . 31

4.4 Phase coverage and stable phase of sampled BBV-based phase detec-

tion scheme on SPEC CPU2000 benchmarks . . . . . . . . . . . . . . 32

4.5 Phase coverage and stable phase of sampled HWSET-based phase de-

tection scheme on SPEC CPU2000 benchmarks . . . . . . . . . . . . 33

4.6 Comparison of BBV-based and HWSET-based program phase detection 34

4.7 Global program phase behavior on multithreaded programs . . . . . . 37

4.8 Performance scale-up of OpenMP swim on Itanium 2 4-way Mckinley

and 8-way Montecito machine . . . . . . . . . . . . . . . . . . . . . . 38

4.9 Performance of SPEC OMP2001 benchmarks in the different CPU fre-

quencies on Intel Core 2 Quad processor . . . . . . . . . . . . . . . . 38

5.1 An example code and its corresponding ECCT representation. Three

dynamic code regions are identified in the program and are marked by

different shades in the tree. . . . . . . . . . . . . . . . . . . . . . . . . 41

x

5.2 Visualizing phase change in bzip2. (a) Change of average CPI dur-

ing program execution. Each point in the graph is the average CPI

observed over an 1-million-instruction interval. (b) Tracking Phase

changes over the time (1 million instruction interval) in bzip2 using

dynamic code regions. The Y-axis shows phase ID. . . . . . . . . . . 43

5.3 Assembly code of a function call (a) and loop (b). The target address

of the branch instruction is the start of the loop and the PC address

of the branch instruction is the end of the loop. . . . . . . . . . . . . 44

5.4 Conditions checked in the phase detection hardware . . . . . . . . . . 46

5.5 Schematic diagram of the hardware phase detector . . . . . . . . . . . 47

5.6 Recursion structure in code and the content of hardware stack during

recursion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.7 Weighted average of the CoV of CPI for different configurations of the

phase detection hardware . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.8 The performance comparison of last-phase predictor and markov pre-

dictor in detecting the phase of the next interval for 32/64 configuration. 56

5.9 BBV-based phase tracking hardware . . . . . . . . . . . . . . . . . . 57

5.10 Comparison between BBV- and DCR-based phase detection hardware

on the performance of a 256-entry Markov Predictor in predicting the

phase ID of the next interval. A 32-entry accumulator table/hardware

stack and a 64-entry phase signature table were used. The first 2

columns for each benchmark are for BBV method using threshold val-

ues of 10% and 40% of one million instructions, respectively. . . . . . 60

5.11 Comparison of the weighted average of the CoV of CPI between BBV-

and DCR-based phase detection schemes. A 32-entry accumulator ta-

ble/hardware stack and a 64-entry phase signature tables were used.

The first 2 columns for each benchmark are for a threshold value of

10% and 40% of one million instructions respectively. . . . . . . . . . 61

6.1 Continuous profile-guided optimization model . . . . . . . . . . . . . 64

6.2 Convergence of merged profiles of gcc with 200.i input set . . . . . . 67

xi

6.3 Relative frequency distribution of PC address samples (gcc, gzip) . . 69

6.4 Entropy-based profile classification . . . . . . . . . . . . . . . . . . . 71

6.5 Convergence of merged profiles of SPEC CPU2000 benchmarks . . . . 73

6.6 Accuracy of entropy-based adaptive profiler on SPECJBB ver. 1.01 . 75

7.1 OpenMP DAXPY C source code . . . . . . . . . . . . . . . . . . . . 80

7.2 icc compiler generated Itanium assembly code for DAXPY kernel . . 81

7.3 Normalized execution time of OpenMP DAXPY kernel on 4-way Ita-

nium 2 SMP server . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

7.4 Speedup of coherent memory access optimization on OpenMP NPB

benchmarks. The performance of prefetch version (optimized by Intel com-

piler) is normalized to 1 as the baseline. . . . . . . . . . . . . . . . . . 93

7.5 Number of L3 misses on OpenMP NPB benchmarks . . . . . . . . . . 95

7.6 Number of memory transactions on the system bus on OpenMP NPB

benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

xii

List of Tables

4.1 Top 5 dynamic code region statistics on SPEC2000 CPU benchmarks 29

5.1 Number of phases detected for different configurations of the phase

detection hardware. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.2 Average Length of phase detected for different configurations of the

phase detection hardware. . . . . . . . . . . . . . . . . . . . . . . . . 54

5.3 Comparison of the number of phases detected between BBV- and DCR-

based phase detection schemes. A 32-entry accumulator table/hardware

stack and a 64-entry phase signature table were used. The first 2

columns for each benchmark correspond to a threshold value of 10%,

40% of one million instructions, respectively. . . . . . . . . . . . . . . 59

5.4 Comparison of the average phase length between BBV- and DCR-based

phase detection schemes. A 32-entry accumulator table/hardware stack

and a 64-entry phase signature table were used. The first 2 columns

for each benchmark correspond to a threshold value of 10%, 40% of

one million instructions, respectively. . . . . . . . . . . . . . . . . . . 59

6.1 Entropy of SPEC CPU2000 INT benchmarks . . . . . . . . . . . . . . 74

6.2 Entropy of SPEC CPU2000 FP benchmarks . . . . . . . . . . . . . . 74

6.3 Performance improvement (%) from PGO on vortex with multiple in-

put sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.4 Performance improvement (%) from PGO on vpr with multiple input

sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

xiii

6.5 Performance improvement (%) from PGO on gzip with multiple input

sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

7.1 The number of loops and prefetches in compiler generated OpenMP

NPB binaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

xiv

Chapter 1

Introduction

As we enter the era of multi-core and many-core systems that could integrate two to

hundreds of processing units on a single chip, immense computing resources such as

processing cores, large memories, and I/Os are readily available. We could exploit the

massive computing resources with the support of operating systems (OS), compilers,

and thread libraries. The OS supports task-level parallelism for concurrently exe-

cuted processes by providing fair and efficient scheduling for shared system resources.

Compilers support automatic parallelization and programmer annotated paralleliza-

tion such as OpenMP so as to make writing multithreaded programs relatively easier.

Compiler-generated parallel codes are executed as light weight threads. Programmers

could write multithreaded programs with thread libraries such as Linux pthread li-

brary. The increasing development and use of multithreaded programs pose a huge

challenge for compiler optimizations due to the dynamic behaviors of multithreaded

programs.

The compiler optimizations have to improve not only the performance of each thread,

but also the overall performance of multithreaded applications. Due to the changes in

the number of threads and data working set size during parallel execution, dynamic

1

parallel program behavior makes it difficult for a static compiler to generate a high-

performance binary for multiprocessor systems. To cope with this problem, adaptive

dynamic optimizations could be used during various stages of program development

and deployment such as system libraries, algorithms, and compilation. Recently,

with the advent of profile-guided optimizations using Hardware Performance Moni-

tors (HPM), re-optimizing the binary at runtime has been proved to be a promising

approach. It could adapt a binary according to its changing program behavior in

data working set sizes and system configurations.

In order to efficiently deploy compiler optimizations at the runtime for multithreaded

programs, this thesis presents a HPM-based continuous profiling and optimization

framework, called COBRA (Continuous Binary Re-Adaptation). We investigate the

use of program phases in order to precisely detect and predict the changing program

behavior by exploiting program control flow information such as loops and func-

tion calls. We propose software and hardware based phase detection and prediction

schemes for dynamic optimizations. COBRA manages runtime profiles in persistent

manner to enable continuous re-optimization. We propose to use an information

entropy to characterize dynamic profiles. Furthermore, we show that it can be effec-

tively applied to profile classification. We implemented dynamic re-optimization of

data prefetching to minimize unnecessary coherent misses on multithreaded applica-

tions.

1.1 Problem Statement

This thesis addresses the following problems: phase detection and prediction for

dynamic optimizations, profile characterization and classification, and optimizing co-

herent misses via dynamic optimization.

2

1.1.1 Phase Detection and Prediction for Dynamic Optimizations

Understanding and predicting a program’s execution phase is crucial to dynamic

optimizations and dynamically adaptable systems. Accurate classification of pro-

gram behavior creates many optimization opportunities for adaptive reconfigurable

microarchitectures, dynamic optimization systems, efficient power management, and

accelerated architecture simulation [22, 49, 51, 23, 53, 37, 8, 14, 7].

Dynamically adaptable systems [22, 51, 29, 41] have exploited phase behavior of pro-

grams in order to adaptively reconfigure the microarchitecture such as the cache size.

A dynamic optimization system optimizes the program binary at runtime using code

transformations to increase program execution efficiency. Dynamic binary translation

also falls in this category. In such systems, program phase behavior has been exploited

for dynamic profiling and code cache management [37, 8, 26, 43, 12, 15, 14]. For ex-

ample, the performance of code cache management relies on how well the change in

instruction working set is tracked by the system.

Current dynamic optimization systems continuously track the program phases change

either by sampling the performance counters or by instrumenting the code. For

sampling-based profiling systems, the sampling rate usually dominates the overhead.

However, while using a low sampling rate can avoid high profiling overhead, it could

also result in missing optimization opportunities as well as an unstable system where

reproducibility is compromised. Using program phase detection and prediction may

more efficiently control the profiling overhead by adaptively adjust sampling rate or

applying burst instrumentation. For example, if the program execution is in a stable

phase, profiling overhead can be minimized. (e.g., by lowering the sampling rate),

while a new phase would require burst profiling.

A phase detection technique developed for dynamically adaptation systems is less

applicable to a dynamic optimization system. This is because collecting performance

3

characteristics at regular time intervals with arbitrary boundaries in an instruction

stream is not as useful as gathering performance profiles of instruction streams that

are aligned with program control structures. We introduce the concept of Dynamic

Code Region (DCR), and use it to model program execution phase. A DCR is a node

and its all child nodes in the extended calling context tree (ECCT) of the program.

ECCT as an extension of the calling context tree (CCT) proposed by Ammon et

al. [4] with the addition of loop nodes.

Based on the study of DCR-based phase detection, this thesis proposes a sampling

based software phase detection and prediction scheme for dynamic optimizations.

Furthermore, we propose an efficient phase detection and prediction hardware.

1.1.2 Profile Characterization and Classification

In order to enable advanced profile-guided optimizations (PGO) in a dynamic com-

piler and binary optimization system, dynamic profiles are usually collected through

sampling and runtime instrumentation. The re-compilation process relies on accu-

rate HPM (Hardware Performance Monitor)-sampled profiles that are accumulated

over several executions of the application program. HPM-sampled profiles could pro-

vide more precise runtime performance events, such as cache misses and resource

contentions, that allow more effective runtime or offline optimizations [39, 16, 37].

Sampling-based profile management has thus become an essential part in the con-

tinuous profile-guided optimization framework. However, many production compilers

are still dependent on instrumentation-based profiles for their PGO optimizations.

This is because some existing optimizations, such as the complete loop unrolling,

would require the iteration count of a loop, and this type of information may not be

accurately obtained with sampling-based profiles.

In order to obtain more accurate profiles with a low sampling frequency, sampled

4

profiles could be merged and stored across multiple runs on the disk. Due to the

statistical nature of sampling, the quality of sampled profiles is greatly affected by the

sampling rate. As the sampling frequency is increased, more samples are collected and

the accuracy of sampled profiles is improved. Unfortunately, the sampling overhead is

also increased. A high sampling frequency would cause more interrupts, require more

memory space to store sampled data, and more disk I/O activities to keep persistent

profile data. With a fixed number of runs, a challenge to the runtime profiler is how to

determine the ideal sampling rate so that we can still obtain high quality profiles with

minimum sampling overhead. This thesis introduces the use of information entropy

to characterize dynamic profiles. Furthermore, it has been shown to be effective on

profile classification.

1.1.3 Optimizing Coherent Misses via Binary Re-Adaptation

Compile optimizations for memory accesses have become extremely important as we

take ever increasing memory latency. Larger cache memories and data prefetching

have proven to be very effective in reducing cache misses and hiding large cache miss

latency. Processors without hardware data prefetchers, such as Intel Itanium, rely on

effective compiler generated prefetches to minimize the performance impact of large

memory latency. Therefore, modern compilers for such processors have been very

aggressive in generating data cache prefetch instructions.

Aggressive data cache prefetching could be very effective for applications such as dense

matrix-oriented numerical applications since their memory access patterns are highly

predictable on single-processor systems. However, in a multiprocessor environment

with multi-level caches, the cache behavior becomes less predictable because it heavily

depends on the system bus contention and the coherent misses generated from both

true-sharing and false-sharing data accesses. This thesis proposes a dynamic binary

re-adaptation techniques to minimize unnecessary cache misses caused by aggressive

5

data prefetching.

1.2 Thesis Contributions

The contributions of the thesis are given below.

1.2.1 Phase Detection and Prediction for Dynamic Optimization

We introduce dynamic intervals that are contiguous, variable-length intervals aligned

with dynamic code regions. In traditional compiler analysis [42], interval analysis

is used to identify regions in the control flow graph during compilation time. We

define dynamic intervals as instruction streams aligned with code regions and exhibit

distinct phase behavior during runtime. Dynamic intervals as a program phase can

be easily identified by tracking dynamic code regions. We track higher-level control

structures such as loops and procedure calls during the execution of the program. By

tracking higher-level code structures, we were able to effectively detect the change

in dynamic code regions, and hence, the phase changes during a program execution.

This is because, intuitively, programs exhibit different phase behaviors as the result of

control transfer through procedures, nested loop structures and recursive functions.

In [34], it was reported that tracking loops and procedures yields comparable phase

tracking accuracy as the Basic Block Vector (BBV) method [51, 35]. This supports

our observation.

We also propose a dynamic code region (DCR) based phase tracking hardware for

dynamic optimization systems. We track the code signature of procedure calls and

loops using a special hardware stack, and compare them against previously seen code

signatures to identify dynamic code regions. We show that the detected dynamic

code regions correlate well with the observed phases in program execution.

The primary contributions on phase detection and prediction are as follows:

6

• We showed dynamic intervals that correspond to dynamic code regions that

are aligned with the boundaries of procedure calls and loops can accurately

represent program behavior.

• We proposed a new phase tracking hardware that consists of a simple stack

and a phase signature table. Comparing with existing proposed schemes, this

structure can detects smaller number of phase and the detected phase length

is longer. Using this structure, it can also give more accurate prediction of the

next execution phase.

1.2.2 Profile Characterization and Classification

We propose using “information entropy” to determine adaptive sampling rates for

automated profile collection and processing that can efficiently support continuous

re-optimization in a pre-JIT environment. The information entropy is a good way

to summarize the frequency distribution into a single number [18]. Since a sampled

profile in our study is a frequency profile of collected PC addresses, the information

entropy of the profile is well suited for characterizing program behaviors in which we

are interested.

In practice, a program has multiple input data sets and may exhibit different program

behaviors for each particular input set. Hence, the entropy of a profile could be

different according to the input used. Even though it is difficult to predict which

input set is to be used in each run, if the execution time of the program is sufficiently

long, the sampling rate can be adjusted accordingly using the entropy information

collected during execution. On the other hand, if the execution time is very short,

the overhead from high sampling rate will be insignificant since we conduct a small

number of runs for sample collection.

In the presence of multiple input sets, existing profile guided optimizations (PGO)

7

schemes simply merge the profiles collected from all input sets. The PGO based on

the merged profile might miss some opportunities for increasing performance from

specialized optimizations more suitable for certain input sets. We show that the

information entropy can be used to classify profiles with a similar behavior. This

classified profile allows the optimizer to generate specially optimized versions for

particular input set.

The primary contributions on profile characterization and classification are as follows:

• We show that highly accurate profiles can be obtained efficiently by merging a

number of profiles collected over repeated executions with low sampling rates.

We demonstrate this approach by using the SPEC2000 benchmarks.

• We also show that a simple characterization of profiles using information entropy

can be used to automatically set the sampling rate for the next profiling run.

On SPECjbb2000, our adaptive profiler obtains very accurate profile (94.5%

match with the baseline profile) with only 8.7% of the samples needed using

1M-instruction sampling intervals.

• We show that the entropy of a profile could be used to classify different program

behaviors according to different input sets and to generate classified profiles for

targeted optimizations.

1.2.3 Optimizing Coherent Misses via Binary Re-Adaptation

We implemented two different dynamic binary optimizations in COBRA. The first

optimization uses dynamic profile information to select appropriate prefetch hints

related to coherent memory accesses. As more processing cores and larger cache

memories are being integrated on chip, coherent memory accesses could limit the

scalability of parallel programs. If a program experiences frequent coherent misses

8

due to truly-shared and falsely-shared data, even larger caches cannot help to reduce

such bus accesses. Cache coherent L2 write misses could lead to L3 misses especially

in invalidation-based cache coherence protocols. Itanium 2 supports the .excl hint

for the lfetch instruction to prefetch a cache line in exclusive state instead of the

usual shared state. This can reduce the cost of requesting for exclusive state from

the actual write operation. However, their effectiveness largely depends on program

runtime behavior.

The second optimization reduces the aggressiveness of prefetching. Modern compilers

have been very aggressive in generating data prefetch instructions to hide potential

large memory latency from cache misses for each thread. However, such aggressive

prefetching in a thread could exert tremendous stress on the system bus if most of its

prefetches turn out to be useless or unnecessary. This might have no effect on a single

core system, but could have devastating effect on a multi-core system. Using dynamic

profiling at runtime, we could identify and eliminate those unnecessary prefetches

from a processor and free up the bus and memory bandwidth for other processors.

To demonstrate the feasibility and potential benefits of the COBRA framework, we

use OpenMP NAS parallel benchmarks on a 4-way SMP server and a SGI Altix

cc-NUMA system. The contributions on optimizing coherent misses are as follows:

• Using OpenMP version of the DAXPY kernel, we show that static compiler

generated binaries cannot provide consistent performance in the presence of

changing runtime environment. The runtime binary optimizer could adapt the

binary better to the changing runtime behavior.

• To the best of our knowledge, COBRA is the first implementation of a HPM-

based runtime binary optimization framework for multithreaded applications.

We discuss the trade-offs in the design of a robust and scalable runtime binary

9

optimizer that includes: thread monitoring, dynamic profiling, trace manage-

ment, system-wide dynamic compiler optimizations, and code deployment.

• We implemented two dynamic compiler optimizations to reduce the impact of

coherent memory accesses in OpenMP NAS parallel benchmarks. The opti-

mizations improve the performance of OpenMP NAS parallel benchmarks (BT,

SP, LU, FT, MG, CG) up to 15% with an average of 4.7% on a 4-way Itanium 2

SMP server, and up to 68% with an average of 17.5% on a SGI Altix cc-NUMA

system.

1.3 Outline of Thesis

Chapter 2 describes the related works on dynamic optimizations for parallel and

multithreaded programs, phase detection and prediction, profile characterization and

classification, and optimizing coherent misses on cache coherent multiprocessor sys-

tems.

Chapter 3 describes implementation details of a runtime binary optimization frame-

work, called COBRA (Continuous Binary Re-Adaptation). We describe major func-

tional components of COBRA framework and startup model. The optimizer thread

and monitoring threads in a COBRA framework are described in more details.

Chapter 4 describes efficient phase-aware runtime program monitoring schemes imple-

mented on our COBRA framework. We investigate the use of control flow information

such as loops and function calls in order to identify repetitive program behavior as

a program behavior. We describe sampled Basic Block Vector(BBV)-based and Hot

Working Set (HWSET)-based program phase detection schemes. Sampled HWSET-

based program phase detection scheme shows a higher phase coverage and a longer

stable phase compared to sampled BBV-based program phase detection scheme.

10

Chapter 5 describes our proposed Dynamic Code Region (DCR)-based program phase

detector for dynamic optimization. We show that our proposed hardware exhibits the

desired characteristics of a phase detector for dynamic optimization systems.

Chapter 6 describes techniques to characterize and classify dynamic profiles for dy-

namic compilation and optimization systems. We show that simple characterization

of the profile with information entropy can effectively guide the sampling rate for a

profiler. The entropy-based approach provides a good foundation for continuous pro-

filing management and effective profile guided optimization in a dynamic compilation

environment.

Chapter 7 describes runtime binary re-adaptation techniques that improve the per-

formance of some OpenMP parallel programs by reducing the aggressiveness of data

prefetching and using exclusive hints for prefetch instructions.

Chapter 8 concludes this work and describes the future works.

11

Chapter 2

Related Works

Dynamic optimization has been used in the context of dynamic compilation and

optimization systems such as Java Virtual Machine [7, 13], runtime binary trans-

lation [24, 47, 52, 31] and optimization [8, 37, 12, 15, 11, 20, 59]. Prior runtime

binary optimization systems [8, 37, 12, 15, 11, 20, 59] are developed to improve the

performance of single-threaded applications. In contrast, COBRA is designed to con-

currently monitor multiple threads and to optimize the binary on multiprocessors, its

design of thread monitoring, profile processing and trace management are significantly

different from other binary optimizers for single-threaded applications. Furthermore,

optimization decisions are based on profiles collected from multiple threads, or mul-

tiple runs, to determine if a system-wide optimization is needed.

2.1 Dynamic Optimizations for Parallel and Multithreaded

Programs

The ADAPT [57] is a generic compiler-supported framework for high-level adaptive

program optimizations. The ADAPT compiler accepts user-supplied heuristics and

generates a complete runtime system to apply these heuristics dynamically. ADAPT

12

is applicable to both serial and parallel programs. However, given the variety of

options and the importance of high performance for parallel programs, ADAPT is

particularly well suited to these types of applications.

Thomas et al. [54] proposed a general framework for adaptive algorithm selection and

used it on the Standard Template Adaptive Parallel Library (STAPL) [5]. When

STAPL is first installed on the system, statically available information about the

architecture and the environment are collected. Performance characteristics for the

algorithmic options available in the library are then computed. This data is stored

in a repository, and machine learning techniques are used to determine the tests that

will be used at run-time for selecting an algorithmic option. At run-time, necessary

performance characteristics are collected and then a decision about which algorithmic

option to use is made.

The ATLAS [58, 21] is a linear algebra library generator that makes use of domain-

specific, algorithmic information. It generates platform-optimized Basic Linear Alge-

bra Subroutine (BLAS) by searching different blocking strategies, operation schedules,

and degree of unrolling. SPIRAL [45] automatically generates high-performance code

that is tuned to the given platform. SPIRAL formulates the tuning as an optimiza-

tion problem and exploits the domain-specific mathematical structure of the trans-

formation algorithms to implement a feedback-driven optimizer. SPIRAL generates

high-performance code for a broad set of DSP transformations, including the discrete

Fourier transformations, other trigonometric transformations, filter transformations,

and discrete wavelet transformations.

2.2 Phase Detection and Prediction

In previous work, researchers have studied phase behavior to dynamically reconfig-

ure microarchitecture and re-optimize binaries. In order to detect the change of

13

program behavior, metrics representing program runtime characteristics were col-

lected [22, 49, 51, 23, 48, 37]. If the difference of metrics, or code signature, between

two intervals exceeds a given threshold, phase change is detected. The stability of a

phase can be determined by using performance metrics (such as CPI, cache misses,

and branch misprediction) [23, 37], similarity of code execution profiles (such as in-

struction working set, basic block vector) [22, 49, 51] and data access locality (such

as data reuse distance) [48] and indirect metrics (such as Entropy) [37].

Our work uses calling context in an extended calling context tree (ECCT) as a signa-

ture to distinguish distinct phases . Similar techniques were used for locating reconfig-

uration points to reduce CPU power consumption [41, 28], where the calling context

was analyzed on the instrumented profiles. Our proposed phase tracking hardware

could effectively find similar reconfiguration points. For example, it is useful for phase

aware power management in embedded processors. M. Huang et al [29] also proposes

to track calling context by using a hardware stack for microarchitecture adaptation

in order to reduce processor power consumption. W. Liu and M. Huang [36] propose

to exploit program repetition to accelerate detailed microarchitecture simulation by

examining procedure calls and loops in the simulated instruction streams. M. Hind

et al [27] identified two major parameters (granularity and similarity) that capture

the essence of phase shift detection problems.

In dynamic optimization systems [8], it is important to maximize the amount of time

spent in the code cache because trace regeneration overhead is relatively high and

may offset performance gains from optimized traces [26]. Dynamo [8] used preemptive

flushing policy for code cache management, which detected a program phase change

and flushed the entire code cache. This policy is more effective than a policy that

simply flushes the entire code cache when it is full. Accurate phase change detection

would enable more efficient code cache management. ADORE [37, 14] used sampled

14

PC centroid to track instruction working set and coarse-grain phase changes.

Nagpurkar et al [43] proposed a flexible hardware-software scheme for efficient remote

profiling on networked embedded device. It relies on the extraction of meta informa-

tion from executing programs in the form of phases, and then use this information

to guide intelligent online sampling and to manage the communication of those sam-

ples. They used BBV based hardware phase tracker which was proposed in [51] and

enhanced in [35].

2.3 Profile Characterization and Classification

Savari and Young [46] introduced an approach, based on information theories, to

analyze, compare and combine profiles. They showed how to merge two profiles from

the same program more effectively guide compiler optimizations. They showed that

information entropy based hybrid profile works better than other profile blending

methods. In our work, we used information entropy to adaptively select sampling

rate, characterize profiles and classify them instead of combining them.

Kistler and Franz [32] proposed using a frequency (edges and paths) vector so as to

compare the similarity among profiles. Their proposed similarity metric is based on

geometric angle and distance between two vectors. Their goal was to determine if a

program execution changed enough to trigger re-optimization. In our work, we used a

Manhattan distance between two profiles as a similarity metric. Our method is more

efficient than Kistler’s approach.

Sun et al [53] showed that an information entropy based on performance events, such

as L2 misses, could be a good metric to track changes in program phase behavior.

Our approach used information entropy based on frequency profiles to adaptively

determine sampling rates.

15

2.4 Reducing Coherent Misses

Collard et al [17] proposed using system-wide hardware performance monitors, called

SWIFT, to detect pairs of instructions that causes false sharing. The profiles of

false sharing can be fed back into the compiler to enable LDBIAS and FPBIAS

optimization. If the LDBIAS and FPBIAS are used instead of LOAD instruction, the

cache line is fetched as exclusive state instead of shared state. The FPBIAS is used

for loading floating point data. The LDBIAS is used for all other load operations. In

order to carefully separate the benefits of prefetching, the use of lfetch.excl instruction

is excluded. In contrast, we focus on the selective use of lfetch.excl instruction to

optimize coherent memory accesses.

Tullsen and Eggers [55] pointed out that prefetching can negatively affect bus uti-

lization, overall cache miss rates, memory latencies and data sharing. They ex-

amined the sources of cache misses, in light of several different prefetching strate-

gies, pinpointed the cause of the performance changes. They simulated the ef-

fects of a set compiler-directed prefetching algorithm, namely NP(No Prefetching),

PREF(Prefetching), EXCL(exclusive prefetch), LPD(long prefetch distance), and

PWS(prefetch write-shared data more aggressively), on a bus-based multiprocessor.

These prefetching strategies can be implemented in a static compiler, or be applied

when precise runtime profiles are available. In our work, we compare three prefetch-

ing strategies, namely PREF(baseline), NP, EXCL on Itanium 2 processor. The NP

strategy is implemented by turning lfetch instruction into NOP instruction. The

EXCL strategy is implemented by adding .excl hint into lfetch instruction. The

PREF strategy is used in the optimized binary generated by the optimizing compiler.

16

Chapter 3

COBRA: A Continuous Binary

Re-Adaptation Framework

In prior works [8, 37, 15, 20, 59], most of dynamic optimization systems, such as

Dynamo [8] and ADORE [37], are developed to improve the performance of single-

threaded applications. In order to explore the potential benefit of dynamic opti-

mizations on multithreaded applications, we proposed a runtime binary optimization

framework, called COBRA (Continuous Binary Re-Adaptation). It is currently im-

plemented on Itanium 2 based 4-way SMP server and SGI Altix cc-NUMA systems.

COBRA collects dynamic profiles from each thread using HPM and analyzes them to

find system-wide performance bottlenecks. Currently, the performance events mon-

itored include coherent memory accesses and system bus contention, in addition to

typical performance events for single threaded execution. The aggregated dynamic

profiles are fed into a runtime optimizer to generate optimized binary traces. These

optimized binary traces are stored in a trace cache in the same address space as the

binary program being optimized. The binary program is then patched and redirected

to the optimized traces during execution.

17

Figure 3.1: COBRA framework

COBRA (COntinuous Binary Re-Adaptation) is implemented as a shared library on

Linux and could be automatically preloaded before other shared libraries are loaded at

the program startup time. Since COBRA is designed to concurrently monitor multiple

threads on multiprocessors, its design of thread monitoring, profile processing and

trace management are significantly different from other binary optimizers for single-

threaded applications such as ADORE. Furthermore, optimization decisions are based

on profiles collected from multiple threads to determine if a system-wide optimization

is warranted.

18

3.1 COBRA System Architecture

Figure 3.1 illustrates the major functional blocks of the COBRA framework. It in-

cludes components for monitoring, profiling, trace management, code optimization

and code deployment. The monitoring component collects performance information

with the support of OS and the hardware performance monitors. It sends data to

the profiler. The profiler gathers and processes various performance sampled HPM

data such as data cache misses, branch histories, and other event data. The trace

management component maintains prospective binary traces that can be optimized.

The optimizer generates new optimized binary traces and stored them in the code

cache. The profiler and optimizer closely interact with each other in order to optimize

more efficiently in reducing I-cache misses and the impact of data cache misses.

As shown in Figure 3.1, two types of supporting threads are invoked for a multi-

threaded program. One is an optimizer thread (used in Figure 3.2) that orchestrates

profile collection and runtime optimizations. This thread is created during program

startup time. The other is a group of monitoring threads that monitors worker

threads. A monitoring thread is created when a worker thread is forked. If an appli-

cation program executes with four threads, one optimizer thread and four monitoring

threads will be created by COBRA.

3.2 Startup Model

Figure 3.2 illustrates the startup sequence of a 4-threaded OpenMP parallel program

running under COBRA. In Linux, when a program starts to run, the dynamic linking

loader invokes a libc entry-point routine called libc start main, within which the

main function is called. COBRA preloads the shared library and provides a function

wrapper for libc start main to redirect control to an initialization routine and spawns

the optimization thread before starting the application program.

19

monitorthread

main process (worker thread) vforkOMP monitor thread worker thread worker thread Worker thread

pthread_create

monitorthreadmonitor

threadmonitorthread

monitoringprocess

Optimzer thread start

end

1

6

5

4

3

2

Same Address Space

Figure 3.2: The startup sequence of 4-threaded OpenMP program with COBRAframework

The functions of two types of threads are explained in the following sections.

3.3 Monitoring Threads

The code optimizer in the COBRA framework mainly relies on accurate dynamic

profile collected from the monitoring threads. The monitoring threads continuously

sample the performance counters and record cache miss events to guide binary op-

timizations. On the Itanium 2 processor, hundreds of processor performance events

including CPU cycles, the number of retired instructions, stall cycles for each back-

end instruction pipeline stage can be monitored. Four of them can be monitored

concurrently. To build hot traces for binary trace optimizations, monitoring threads

also sample Branch Trace Buffer(BTB) that keeps track of four address pairs from

the last four taken branches and branch targets.

Each monitoring thread tracks signals from the perfmon [2] sampling kernel drivers.

Sampled data are stored in the kernel buffer initially, and when the kernel buffer

20

is full, a signal is a raised to monitoring thread. Once it catches a signal, it stores

the content of performance counters from the kernel memory area to a user memory

area, called User Sampling Buffer (USB). Each sample consists of a sample index,

Program Counter (PC) address, process ID, thread ID, processor ID, four perfor-

mance counters, eight BTB entries, data cache miss instruction address, miss latency,

and miss data cache line address. The process ID, thread ID and processor ID are

used to tag each sample for a better and more precise understanding of each thread

in the multi-threaded application. The four performance counters could be used to

track performance bottlenecks. For example, using the number of L2 and L3 misses

per 1000 instructions could track the changes in cache miss patterns for detecting

changes in data working sets and their access behavior. The eight BTB entries are

used for building hot execution traces for later optimizations. Data cache miss in-

struction, data address, as well as miss latency are accumulated to pinpoint the exact

instructions that caused the most cache misses. We used this information to find the

delinquent loads [37, 19].

3.4 Optimizer Thread

The optimization thread orchestrates the overall initialization, trace selection, op-

timization, and trace cache management. Notably, there is only one optimization

thread in our initial implementation. This design choice simplify its implementation,

and enables centralized control over multiple monitoring threads.

At start up, all hardware performance counters are initialized by perfmon sampling

kernel device driver. The list of available processors is registered in a shared memory

area. The Kernel Sampling Buffer is also allocated in the shared memory area. The

memory pages allocated to the Kernel Sampling Buffer may reside in different process-

ing nodes. We rely on the OS to migrate memory pages into relevant processor nodes.

21

For example, SGI Altix cc-NUMA system uses a first-touch policy to pin a memory

page to the first processor that accesses the memory page. This scheme works well

if each thread initializes its portion of Kernel Sampling Buffer during initialization

phase.

Trace selection highly depends on the type of optimization applied to the collected

traces. Since our current optimizations mainly focus on adapting data prefetching

on hot loops that consume most of execution time, trace formation and selection

algorithms are tuned to discover hot loops and leading execution paths to the loops.

The BTB profiles from Itanium 2’s HPM are particularly useful to build loop traces

with relatively less frequent sampling that keeps the overall overhead low.

22

Chapter 4

Phase-Aware Runtime Program

Monitoring

Programs spend most of execution time on frequent function calls and loops. Re-

peated control flows in the program tend to show similar and stable performance

behavior across entire execution time [35]. This stable performance behavior is con-

sidered as a program phase behavior. Detecting the change of program phase is crucial

to minimize unnecessarily triggering of re-optimization on dynamic optimization sys-

tems. In previous work [22, 49, 51, 23, 35], a program phase is defined as a set of

intervals within a program’s execution that have similar behavior and performance

characteristics, regardless of their temporal adjacency. The execution of a program

was divided into equally sized non-overlapping intervals. An interval is a contiguous

portion (i.e., a time slice) of the execution of a program. Metrics representing pro-

gram runtime characteristics were calculated for every interval. If the difference in

the metrics between two adjacent intervals exceeds a given threshold, a phase change

is assumed. Phase classification partitions a set of intervals into phases with similar

behavior. Phase prediction foretells the phase for the next interval of execution.

23

4.1 Phase Detection for Dynamic Optimization Systems

Dynamic optimization systems in general have four major components that consist

of phase detector, profiler, optimizer and a controller. The phase detector tracks

changes in program behavior and predicts the future behavior of the program. De-

pending on the optimizations being targeted, the profiler and the optimizer could

add significant overhead to the overall system. Also, if the code cache is not man-

aged effectively, significant overheads could also occur due to trace re-generation and

re-optimization [26, 25, 10]. These characteristics of detected phases have a direct

impact on the overhead. Some of those are detailed below.

Dynamic optimization systems prefer longer phases. If the phase detector is overly

sensitive, it may trigger profiling and optimization operations too frequently and cause

performance degradation. Phase prediction accuracy is essential to avoid bringing

unrelated traces into the code cache. The code cache management system would also

require information about the phase boundaries to precisely identity the code regions

for the phase. The information about the code structures can be used by the profiler

and the optimizer to identify the code regions for optimization. Finally, it is generally

beneficial to have a small number of phases as long as we can capture most important

program behavior; this is because a small number of phases allow the phase detector

to identify longer phases and to predict the next phase more accurately. It should

be noted that dynamic optimization systems can trade some variability within each

phase for a longer phase length and a higher predictability, as these factors determines

the overhead of the system.

4.2 Extended Calling Context Tree

In order to identify code regions consisting of frequently executed function calls and

loops, we first instrument the whole program execution and represent it as a large

24

DCR 1 DCR 2 DCR 3 DCR 4

spec_compress (func)

zip (func) clear_bufs (func)

spec_reset (func)

memset (func)

spec_uncompress (func)

unzip (func) get_method (func) clear_bufs (func)

spec_load (loop)

printf (func) memcpy (func)

main (func)

main (loop) spec_initbufs (func) main (loop) spec_load (func)

(a) gzip

DCR 1 DCR 2 DCR 3 DCR 5

DCR 4

life_analysis (loop)

propagate_block (func)



schedule_block (func)

schedule_block (loop) free_pending_lists (func) sched_analyze (func)

cse_basic_block (func)

cse_around_loop (func) cse_basic_block (loop)

global_conflicts (func)

global_conflicts (loop)

yyparse (loop)

finish_function (func)

rest_of_compilation (func)

flow_analysis (func) schedule_insns (func) cse_main (func)global_alloc (func)

life_analysis (func) schedule_insns (func) cse_main (loop)

(b) gcc

Figure 4.1: Partial ECCT of gzip and gcc is shown. The grey nodes are the root ofthe sub-tree that forms the DCR and the rectangles mark the DCR’s

single tree, called Extended Calling Context Tree (ECCT) . Calling Context Tree

(CCT) was first proposed by Ammon et al. [4]. CCT is a directed graph G=(N,E),

where N is the number of nodes that represent procedures in the program and E are

the edges that connect the procedures. For example, if a procedure proc1 is called

from another procedure proc2, the graph will include two nodes proc1 and proc2 with

a directed edge connecting proc2 to proc1. In CCT, only those procedures that are

called during the execution of the program are present. Nodes representing procedures

in CCT are context sensitive. If a procedure proc1 is called from procedures proc2 and

proc3, the graph will contain two different nodes for proc1. Creating unique nodes for

procedures for each context makes the graph a tree. We added nodes which represent

loops into CCT, and call it Extended Calling Context Tree (ECCT). All properties of

the nodes representing procedures in CCT are also applicable to loop nodes in ECCT.

Figure 4.1(a) and 4.1(b) show part of the ECCT obtained for gzip and gcc respec-

25

tively. The nodes that have (func) and (loop) prefix are procedure and loop nodes

respectively. Each node in ECCT is annotated with statistical information about the

execution of the node and all the nodes in the subtree with the node as the root.

In our experiments, the cumulative number of dynamic instruction executed in the

node and all the nodes under it are recorded. For example in Figure 4.1(a), the node

spec compress, will have the total number of instructions retired in zip and clear bufs

in addition to the instructions retired in spec compress. We used this annotated

information for dynamic code region analysis.

4.3 Dynamic Code Region

Dynamic Code Region (DCR) is defined as a node in ECCT and all the nodes in the

subtree of that node. In Figure 4.1(b), for example, schedule block (func) and its child

nodes schedule block (loop), free pending lists (func) and schedule analyze (func) can

be grouped together and considered as one dynamic code region. Depending on the

target application, the ECCT can be analyzed to identity a set of DCR’s from ECCT

that have desirable characteristics.

4.3.1 Dynamic Code Region Analysis

Dynamic code region analysis is an algorithm that automatically identifies non-

overlapping DCR’s with high code coverage and relatively stable behavior. The input

to the algorithm is an ECCT of the program that is annotated with cumulative num-

ber of retired instructions in each node and the required total coverage for the final set

of DCR’s. We define coverage to be the ratio of the sum of the dynamic instruction

counts in each DCR to the dynamic instruction count of the root node of ECCT. In

ECCT, the instructions that correspond to statements other than a procedure call or

a loop in the parent node (if-then-else conditions, statements etc) are not included in

the cumulative instruction count of the child node. Thus the sum of the cumulative

26

instruction counts of all child nodes will be less than the cumulative instruction count

stored in the parent node. We specify the coverage value, to restrict the algorithm

to find a set of DCR’s whose cumulative coverage will be greater than the specified

value.

The search algorithm is iterative. The algorithm uses a set to keep track of the

DCR’s that are discovered during any iterations. The set is initialized with the root

node. During any iterations of the search, one node is removed from the set and its

child nodes are added to the set. Next, the coverage of the set is calculated. If the

coverage is greater than the coverage specified by the user, the algorithm proceeds

to the next iteration of the search. If coverage is less than the specified coverage,

the algorithm backtracks. It removes the child nodes that are added to the set in

the current iteration and adds the parent node back to the set. This node is then

marked and will not be selected in future iteration of the search. This completes an

iteration of the search algorithm. When no unmarked nodes remained in the set, the

algorithm terminates. The nodes in the set at the termination of the algorithm are

root nodes for the sub-tree that forms the final set of DCR. To avoid the search from

going deeper and deeper in one region of the tree, the node in the set that is closer

to the root node is given higher priority during the search when selecting the node.

Figure 4.1(a) and 4.1(b), show the DCR’s obtained for gzip and gcc respectively.

The root node of each DCR is marked grey. The rectangle mark the nodes that are

included in each DCR. Since each node would be visited no more than once during

the search, complexity of the search algorithm is linear in the number of nodes in the

tree.

The final set of DCR will contain many nodes that have insignificant contribution to

the overall execution of the program. For example, nodes representing the c library

functions executed before main would be present in the final set. We prune these

27

functions from the final DCR set. This will result in a slight reduction in coverage

but will not affect the accuracy of the technique.

Table 4.1 shows the Coverage and Coefficient of Variation (CoV) of CPI for the top 5

DCR’s obtained for the benchmark programs evaluated in this work, using Dynamic

Code Region Analysis algorithm. CoV is the ratio of standard deviation and mean.

Higher CoV value implies higher variability in performance within the DCR. The

DCR’s are ordered in the descending order of its size for each benchmark. From the

table we can see that a small number of DCR’s cover a large portion of the program

execution. Top three DCRs cover over 90% of dynamic retired instructions in gzip,

vpr, mcf, vortex, crafty,eon, mesa and ammp. This means that dynamic optimization

system focuses on a few sets of dynamic code regions and could achieve high execution

time coverage on SPEC CPU2000 benchmarks.

4.3.2 Stable and Transition Intervals

The CoV of CPI for each DCRs shown in Table 4.1 represents performance variance

of intervals belonging to each DCRs. If CoV of CPI for the DCR is lower than 0.5,

the intervals for it are considered as stable intervals. All intervals can be classified

into stable and transition intervals. During transition intervals, dynamic optimiza-

tion systems do not trigger binary trace generation and re-optimization in order to

minimize unnecessary translation and optimization overhead. For example, perlbmk

has relatively large instruction footprint and does not have many loops to have large

coverage and good optimization benefits. In such benchmarks, detecting transition

intervals precisely could reduce unnecessary optimization overhead. The stable inter-

vals are contiguous intervals that are classified into same program phases by a phase

detector. The longer and contiguous stable intervals can be a good target for the

dynamic optimization.

28

Table 4.1: Top 5 dynamic code region statistics on SPEC2000 CPU benchmarks

Benchmarks function name DCR Coverage CPI

type CoV

gzip-source spec compress func 88.12 0.16

spec uncompress func 11.79 0.42

spec reset func 0.07 1.06

spec load loop 0.01 0.13

vpr-route try route loop 97.59 0.35

alloc and load rr graph func 0.79 0.04

print route loop 0.52 0.01

check rr graph func 0.28 0.26

get tok func 0.22 0.13

gcc-166 life analysis loop 33.47 0.19

schedule block func 15.53 0.50

cse basic block func 12.43 0.28

life analysis loop 9.17 0.17

global conflicts loop 6.55 0.94

mcf-ref primal net simplex func 57.19 0.34

price out impl func 41.66 0.39

sscanf func 0.66 0.05

flow cost func 0.28 0.60

fgets func 0.07 0.05

perlbmk-splitmail1 Perl pp substcont func 38.26 0.96

Perl pp subst func 32.81 0.97

Perl pp helem func 3.06 0.29

Perl pp match func 2.56 0.32

Perl pp sassign func 2.22 0.34

perlbmk-splitmail2 Perl pp substcont func 38.72 0.57

Perl pp subst func 23.33 0.49

Perl pp helem func 5.76 0.18

Perl pp match func 4.69 0.20

Perl pp sassign func 1.97 0.21

vortex-lendian1 BMT DeleteParts func 67.15 0.41

BMT CreateParts func 10.39 0.38

BMT LookUpParts func 9.73 0.16

BMT CreateParts loop 5.05 0.41

BMT CommitParts func 4.61 0.10

bzip2-program sortIt func 25.35 0.32

getAndMoveTo-FrontDecode loop 19.58 0.13

generateMTFValues loop 16.76 0.13

loadAndRLEsource loop 16.43 0.19

sendMTFValues loop 10.40 0.10

crafty-ref Iterate func 99.99 0.05

InitializeAttackBoards func 0.003 0.05

eon-cook ggBRDF func 49.61 0.02

ggMaterialRecord func 43.67 0.02

ggSpectrumf func 2.13 0.02

ggSpectrumT0 func 1.94 0.02

ggSpectrum func 1.06 0.02

swim-ref calc1 loop 38.21 0.15

calc2 loop 30.70 0.47

calc3 loop 18.40 0.32

MAIN loop 12.27 0.22

inital loop 0.15 0.01

mesa-ref gl render vb func 92.93 0.16

shade vertices func 4.03 0.17

viewport map vertices func 0.83 0.17

project and cliptest func 0.80 0.18

transform points loop 0.50 0.18

ammp-ref fv update nonbon func 80.50 0.17

f nonbon loop 12.87 0.73

divdf3 func 2.14 0.70

divdf3 func 1.12 0.70

sqrt func 0.95 0.70

29

��

��

��

��

��

Figure 4.2: Sampled BBV-based Program Phase Detection

4.4 Sampling-based Program Phase Tracking

Current processors do not directly support the hardware for the phase tracking, even

in Intel Itanium processors supporting the most advanced hardware performance mon-

itors. Hence, our dynamic optimization framework relies on sampling-based software

program phase detection. We implemented Basic Block Vector (BBV)-based and Hot

Working Set (HWSET)-based phase detectors as a software module in our framework.

4.4.1 Sampled BBV-based Program Phase Detection

T. Sherwood et al. [50] proposed a novel method to automatically characterize large

scale program behavior by using BBV based clustering. This technique has been fur-

ther studied to show correlation between program code signature and actual behavior

through simulation [33] and measurement from real machines [44, 6].

We implemented a sampled BBV-based software phase detector functionally similar

to BBV-based hardware phase detector that was proposed by T. Sherwood [51]. Fig-

ure 4.2 shows a sampled BBV-based program phase detection scheme. The branch

addresses are collected through periodic sampling. We periodically capture four taken

branches on Itanium 2 processor and store them in a kernel buffer. Once a kernel

buffer is full, one BBV is computed from branch addresses in the buffer. The hash

30

��

��

��

��

Figure 4.3: Sampled HWSET-based Program Phase Detection

function uses branch target address to decide which frequency counter in BBV is

updated. Then, the phase table is updated.

4.4.2 Sampled HWSET-based Program Phase Detection

Hot Working Set(HWSET) is a list of hot branch addresses. Instead of maintaining a

frequency vector of branches, HWSET maintains a sorted hot branch lists. Figure 4.3

shows a sampled HWSET-based program phase detection scheme. At every sampling

interval, branch addresses are collected and stored in the kernel buffer. When the

buffer is full, branch addresses are processed as a sorted list according to the frequency.

A HWSET signature is computed from top branches.

4.4.3 Experimental Results

Data are collected from phase detection modules of our COBRA framework while

running the benchmarks on 1.0 GHz Itanium-2 server. The SPEC CPU2000 bench-

marks (12 integer benchmarks, 14 floating point benchmarks) used in our experiments

are compiled with Intel icc ver. 9.1 compiler with O3 optimization level. The integer

benchmarks are gzip, vpr, gcc, mcf, crafty, parser, eon, perlbmk, gap, vortex, bzip2,

and twolf. Many programs in the integer benchmarks have complex control flows even

31

��

��Figure 4.4: Phase coverage and stable phase of sampled BBV-based phase detectionscheme on SPEC CPU2000 benchmarks

in hot code regions. The floating point benchmarks are wupwise, swim, mgrid, applu,

mesa, galgel, art, equake, facerec, ammp, lucas, fma3d, sixtrack, and apsi. Most of

the floating benchmark are loop intensive programs.

In our study, the phase coverage and stable phase are examined. The phase coverage

is a phase table hit ratio. Whenever a sampling buffer is full, a phase signature is

computed. Then if the same phase signature is found in the phase table, a phase

table hit counter is incremented. At the end of execution, the phase table hit ratio is

computed. The stable phase ratio is the same as last phase prediction ratio. Whenever

a new phase signature is computed, the new phase signature is the same as previous

phase signature. If the contiguous phases have the same phase ID, the stable phase

counter is incremented.

4.4.3.1 Results on Sampled BBV-based Program Phase Detection

Figure 4.4 shows phase coverage and stable phase of sampled BBV-based phase detec-

tion scheme on SPEC CPU2000 benchmarks. In this experiment, the BBV signature

table has 32 entries and the phase table has 64 entries. Most of the floating point

benchmarks except mesa, apsi show over than 80% phase table hit ratio. In contrast,

Most of integer benchmarks except gzip, vpr, mcf, bzip2 show lower than 30% phase

32

��

��Figure 4.5: Phase coverage and stable phase of sampled HWSET-based phase detec-tion scheme on SPEC CPU2000 benchmarks

table hit ratio.

4.4.3.2 Results on Sampled HWSET-based Program Phase Detection

Figure 4.5 shows phase coverage and stable phase of sampled HWSET-based phase

detection scheme on SPEC CPU2000 benchmarks. The phase table has 64 entries.

Most of the floating point benchmarks except mesa, apsi show over than 90% phase

table hit ratio. And most of integer benchmarks except crafty, eon, perlbmk, twolf

show over than 45% phase table hit ratio.

4.4.3.3 Comparison of BBV-based and HWSET-based Program Phase Detection

Figure 4.6 shows comparison between BBV-based and HWSET-based program phase

detection. In this experiment, the BBV signature table has 32 entries and the phase

table has 64 entries. The HWSET-based program phase detection shows average

18.2% higher phase table hit ratio than BBV-based program phase detection. The

HWSET-base program also shows average 12.1% higher stable phase ratio than BBV-

based program phase detection.

Another metric to compare two scheme is a phase homogeneity improvement. If

the phase detection scheme works well, we could expect higher phase homogeneity

33

��

Tab

le H

it R

atio

��Pha

se T

(a) Phase table hit

ble

Pha

se R

atio

��

Sta

ble ��

(b) Stable phase

��

��(c) Homogeneity improvement from phase detection

Figure 4.6: Comparison of BBV-based and HWSET-based program phase detection

34

improvement. As shown in Figure 4.6(c), the HWSET-based program phase detection

shows average 2.9% higher phase homogeneity improvement than BBV-based program

phase detection scheme. Therefore, we can conclude that the HWSET-based program

phase detection scheme is better sampling-based program phase detection scheme

than BBV-based program phase detection.

4.5 Global Program Phase on Multithreaded Programs

As more multithread programs are written for exploiting parallelism and concurrency

on multi-core and multiprocessor systems, efficient monitoring on these programs be-

comes an important problem for dynamic optimizer. Multithreaded programs could

cause performance problems due to contentions to shared system resources such as

shared L2 cache, memory subsystems and system interconnection networks. Precise

monitoring and profiling could open up new optimization opportunities for the dy-

namic optimizer on multithreaded programs. In this section, we describe how to

extend our proposed sampling-based phase monitoring for multithreaded programs.

4.5.1 Global Program Phase

A global program phase represents collective multithreaded program behaviors. In

order to observe the change of program behavior of all concurrent threads, we use

a global timer-based periodic sampling. At every global monitoring interval, each

thread’s phase signature is collected. A set of per-thread phase signature are formed

as one dimensional vector to use global phase signature. Here is a description of per-

thread monitoring and collective global phase monitoring implemented in COBRA

framework.

1. Per-thread monitoring: we can track arbitrary invoked threads and create code

signature from sampled execution paths (taken branches) or hot instruction

35

pointer addresses. It also annotates performance characteristics such as cache

misses and CPI for each code signature. The periodic sampling of each per-

thread monitoring relies on each processor’s timeout counter.

2. Collective global phase monitoring: Thread-wise monitoring needs to use system-

wide timer to periodically monitor collective performance impacts of concurrent

threads. We use a global phase vector consisting of each thread’s code signature.

Figure 4.7 illustrates global program phase behaviors on various multithreaded appli-

cations. In every 10 million cycles, a HWSET-based phase signature is computed in

each thread. Different phase is encoded as different color to observe a global phase

change along time. Three different types of multithreaded programs are studied on

4-way Itanium 2 servers. Each thread shows similar program behavior across concur-

rent threads in the OpenMP swim benchmark shown in Figure 4.7(a). Since the inner

loops of swim are parallelized and the execution paths of parallelized loops are similar,

their performance behaviors are similar each other. Each thread shows different pro-

gram behavior, but relatively small set of phases are repeated in the multithreaded

BLAST program shown in Figure 4.7(b). The small number of hot loops are exe-

cuted as concurrent threads in multithreaded BLAST program. The SPECjbb2005

benchmark is executed on Java Virtual Machine (JVM). The execution time is spent

on JVM threads. The program behaviors of each JVM threads are very different as

shown in Figure 4.7(c).

4.5.2 Exploiting Global Program Phase

If each thread intensively accesses shared system resource, the performance scale-up

along the increased number of threads could be limited.

Figure 4.8 shows performance scale-up of OpenMP swim on Itanium 2 4-way Mckinley

and 8-way Montecito machine. Even though the number of threads are increased from

36

0

1

2

3

4

5

6

7

8

20000 20020 20040 20060 20080 20100 20120 20140 20160 20180 20200 20220

thre

ad

ID

time (1 = 10 msec)

(a) OpenMP parallelized swim

0

1

2

3

4

5

6

7

8

0 20 40 60 80 100 120 140 160 180

thre

ad

ID

time (1 = 10 msec)

(b) Multithreaded BLAST (blastn)

0

5

10

15

20

45300 45350 45400 45450 45500 45550 45600 45650 45700

(c) SPECjbb2005 benchmarks

Figure 4.7: Global program phase behavior on multithreaded programs

37

0.58 0.71

0.68 0.84 0.75

0.91 0.98 ��

Exe

c. T

ime

(1 t

hre

ad)

exec_cycles stall_cycles

0.42 0.29

0.21 0.14 0.11 0.09 0.04

0.75

��(Mckinley,

1)(Montecito,

1)(Mckinley,

2)(Montecito,

2)(Mckinley,

4)(Montecito,

4)(Montecito,

8)

No

rmal

ized

E

(Processor type, Number of threads)

Figure 4.8: Performance scale-up of OpenMP swim on Itanium 2 4-way Mckinley and8-way Montecito machine

300040005000600070008000 1 thread-1.6GHz1 thread-1.87GHz1 thread-2.13GHz1 thread-2.4GHz2 threads-1.6GHz2 threads-1.87GHzExecution time (sec)

010002000 2 threads-2.13GHz2 threads-2.4GHz4 threads-1.6GHz4 threads-1.87GHz4 threads-2.13GHz4 threads-2.4GHzExecution time (sec)

Figure 4.9: Performance of SPEC OMP2001 benchmarks in the different CPU fre-quencies on Intel Core 2 Quad processor

1 thread to 8 threads in 8-way Montecito server, no performance improvement is ob-

tained. Since one thread already used up sustained memory bandwidth, the increased

threads add contentions into system bus and memory subsystem. Dynamic thread

throttling could reduce unnecessary shared resource contention with the support of

global program phase monitoring.

Figure 4.9 shows performance of SPEC OMP2001 benchmarks in the different CPU

frequencies on Intel Core 2 Quad processor. Most of benchmarks reduce execution

time as we increase CPU frequency from 1.6GHz to 2.4GHz. Two benchmarks (swim,

38

applu) show little performance improvement in the 2 and 4 threads. It shows that

dynamic thread throttling should be used with dynamic voltage/frequency scaling

techniques (DVFS) to achieve optimal power and performance in multithreaded pro-

grams.

4.6 Summary

We describe efficient phase-aware runtime program monitoring schemes implemented

on our COBRA framework. We investigate the use of control flow information such

as loops and function calls in order to identify repetitive program behavior as a

program behavior. We describe sampled Basic Block Vector(BBV)-based and Hot

Working Set (HWSET)-based program phase detection schemes. Sampled HWSET-

based program phase detection scheme shows a larger phase coverage and a longer

stable phase compared to sampled BBV-based program phase detection scheme. We

also describe how to extend our proposed sampling-based phase monitoring for mul-

tithreaded programs. Our preliminary data indicate that dynamic thread throttling

is a promising technique to achieve optimal power and performance trade-offs when

it is used under dynamic optimizer.

39

Chapter 5

Hardware Support for Program

Phase Tracking

This chapter describes hardware support for program phase tracking.

5.1 Dynamic Code Region (DCR): A Unit of Monitoring and

Re-optimization

5.1.1 Tracking Dynamic Code Region as a Phase

In this work, we propose phase tracking hardware that only tracks functions and loops

in the program. The hardware consists of a stack and a phase history table. The idea

of using a hardware stack is based on the observation that any path from the root node

to a node representing a dynamic code region in the Extended Calling Context Tree

(ECCT) can be represented as a stack of function calls and loops. To illustrate this

observation, we use an example program and its corresponding ECCT in Figure 5.1.

In this example, we assume that each of loop0, loop1 and loop3 executes for a long

period of time, and represents dynamic code regions that are potential targets for

40

Figure 5.1: An example code and its corresponding ECCT representation. Threedynamic code regions are identified in the program and are marked by different shadesin the tree.

optimizations. The sequence of function calls which leads to loop1 is main() →

func1() → loop1. Thus, if we maintain a runtime stack of the called functions and

executed loops while loop1 is executing, we would have main(), func1() and loop1

on it. Similarly, as shown in Figure 5.1, the content of the stack for code region 2

would be main() and loop0, while for code region 3 it would be main(), func3() and

loop3. The stack can uniquely identify the calling context of a code region, and thus

can be used as a signature of the code region. For example, the code region loop3

could be identified by the signature main() → func3() on the stack. Code regions

in Figure 1 can be formed during runtime. This is why it is called Dynamic Code

Region (DCR). Stable DCR is a subtree which has a stable calling context in ECCT

during a monitoring interval, typically one million instructions in our study.

The phase signature table stores the stack signatures extracted from the stack. It also

41

assigns a phase ID for each signature. The hardware can be programmed to check

for the current phase by comparing a subset of the stack with the signatures stored

in the phase signature table. If there is a match, the phase ID associated with the

signature in the table is returned; otherwise, a new entry is created in the table and

a new phase ID is assigned to the new entry. The details of the fields in the table

and its function are presented in section 5.2.2.

5.1.2 Correlation between Dynamic Code Regions and Program Perfor-

mance Behaviors

In Figure 5.2(a), CPI calculated for every 1-million-instruction interval for bzip2 is

plotted. We then used the notion of DCR also for every one million instructions and

assigned a distinct phase ID for each DCR. Such ID’s are then plotted against the time

shown in Figure 5.2(b). Comparing the CPI graph in Figure 5.2(a) with the phase

ID graph in Figure 5.2(b), it can be seen that the CPI variation in the program has

a strong correlation with changes in DCR’s. This shows that DCR’s in a program

could reflect program performance behavior and tracks the boundaries of behavior

changes. Although Basic Block Vector (BBV) also shows a similar correlation, DCR

gives code regions that aligned with procedures and loops, which exhibits a higher

accuracy in phase tracking and also make it easier for code optimization.

There are several horizontal lines in Figure 5.2(b). It shows that a small number of

DCR’s are being repeatedly executed during the period. Most DCR’s seen in this

program are loops. More specifically, phase ID 6 is a loop in loadAndRLLSource,

phase ID 10 is a loop in SortIt, phase ID 17 is a loop in generateMTFvalues, and

phase ID 31 is a loop in getAndMoveTofrontDecode.

42

(a) CPI change over execution time

(b) Phase Change over execution time

Figure 5.2: Visualizing phase change in bzip2. (a) Change of average CPI duringprogram execution. Each point in the graph is the average CPI observed over an1-million-instruction interval. (b) Tracking Phase changes over the time (1 millioninstruction interval) in bzip2 using dynamic code regions. The Y-axis shows phaseID.

43

(a) function call (b) loop

Figure 5.3: Assembly code of a function call (a) and loop (b). The target addressof the branch instruction is the start of the loop and the PC address of the branchinstruction is the end of the loop.

5.2 DCR-based Phase Tracking and Prediction Hardware

We have discussed why DCR can be used to track program execution phases. In

this section, we propose a relatively simple hardware structure to track DCR during

program execution.

5.2.1 Identifying function calls and loops in the hardware

5.2.1.1 Detecting Function Calls

Function calls and their returns are identified by call and ret instructions in the

binary. Most modern architectures have included call/ret instructions. On detecting

a call instruction (see Figure 5.3(a)), the PC of the call instruction and the target

address of the called function are pushed onto the hardware stack. On detecting a

return instruction, they are popped out of the stack. A special case to be considered,

when detecting function calls and returns, is recursion. In section 5.2.3, we describe

a technique to deal with recursions.

44

5.2.1.2 Detecting Loops

Loops can be detected using backward branches. A branch which jumps to an address

that is lower than the PC of the branch instruction is a backward branch. The target

of a backward branch is the start of the loop and the PC of the backward branch

instruction is the end of the loop. This is illustrated in Figure 5.3(b). These two

addresses represent the boundaries of a loop. Code re-positioning transformations

can introduce backward branches that are not loop branches. Such branches may

temporarily be put on the stack and then get removed quickly. On identifying a loop,

the two addresses marking the loop boundaries are pushed onto the stack. To detect a

loop, we only need to detect the first iteration of the loop. In order to prevent pushing

these two addresses onto the stack multiple times in the subsequent iterations, the

following check is performed. On detecting a backward branch, the top of the stack

is checked to see if it is a loop. If so, the addresses stored at the top of the stack

are compared to that of the detected loop. If the addresses match, we have detected

an iteration of a loop which is already on the stack. A loop exit occurs when the

program branches out to an address outside the loop boundaries. On a loop exit, the

loop node is popped out of the stack. The conditions checked in the hardware are

summarized in Figure 5.4.

5.2.2 Hardware Description

The schematic diagram of the hardware is shown in Figure 5.5. The central part

of the phase detector is a hardware stack and a signature table. Each entry in the

hardware stack consists of four fields. The first two fields hold different information

for functions and loops. In the case of a function, the first and second fields are used

to store the PC address of the call instruction and the PC of the called function,

respectively. The PC address of the call instruction is used in handling recursions. In

the case of a loop, the first two fields are used to store the start and the end address

45

Figure 5.4: Conditions checked in the phase detection hardware

of the loop. The third field stores an one-bit value, called the stable stack bit. This

bit is used to track the signature of the dynamic code region. At the start of every

interval, the stable stack bit for all entries in the stack which holds a function or a

loop are set to ’1’. The stable stack bit is set to zero for any entry that is pushed

into or popped out of the stack during the interval. At the end of the interval, those

entries at the bottom of the stack whose stable stack bit is still ’1’ are entries which

were not popped during the current interval. This set of entries at the bottom of

the stack forms the signature of the region in the code to which the execution was

restricted to, in the current interval. At the end of the interval, this signature is

compared against all signatures stored in the phase signature table.

The phase signature table holds the stack signature seen in the past and its associated

phase ID. On a match, the phase ID corresponding to that entry in the signature table

is returned as the current phase ID. If a match is not detected, a new entry is created

46

Figure 5.5: Schematic diagram of the hardware phase detector

in the signature table with the current signature and a new phase ID assigned to it. If

there are no free entries available in the signature table, the least recently used entry

is evicted from the signature table to create space for the new entry. The fourth field

is an one bit value called the recursion bit and is used when handling recursions. The

use of this bit is explained in section 5.2.3.1. The check logic showed in Figure 5.5

implements the algorithm presented in Figure 5.4.

The configurable parameters in the hardware are the number of entries in the stack

and the number of entries in the phase signature table. In the result section, we show

that a 32 entry stack and 64 entry phase signature table are sufficient to track the

phase changes for many programs.

47

(a) self-recursive calls (b) general recursive calls

Figure 5.6: Recursion structure in code and the content of hardware stack duringrecursion.

5.2.3 Handling special cases

Recursion and longer stack signature are two conditions during which the hardware

stack might overflow. In this section we describe how we handle these cases in our

hardware.

5.2.3.1 Recursions

In our phase detection technique, all functions that form a cycle in the call graph

(i.e., they are recursive calls) are considered as members of the same dynamic code

region. Figure 5.6 shows two recursive call structures and their corresponding stack

contents during the execution. A simple and most common type of recursion, shown in

Figure 5.6(a), is a function calling itself. The dynamic code region for this recursion

contains just func1() in it. Another, but more complicated recursion structure is

shown in Figure 5.6(b), where func1() calls func2() which in turn calls func1(), causing

a recursion. The dynamic code region contains func1() and func2(), with func1()

forming the boundary of the recursion.

48

In our hardware, all recursions are detected by checking the content of the stack. A

recursion is detected when the address of the function being called is already present

in an entry on the stack. This check assumes that an associative search of the stack is

done during every push operation. Since the number of entries on the stack is small,

the associative search hardware would be feasible.

To avoid stack overflow during a recursion, no push operation is performed after a

recursion is detected. The recursion bit is set to ’1’ for the entry corresponding to

the function marking the boundary of the recursion, e.g. func1() in the examples

shown in Figure 5.6. Since we no longer push any entry onto the stack, we cannot

pop any entry from the stack on detecting a return instruction, until it is out of the

recursion cycle. This can be detected when a return instruction jumps to a function

outside of the recursion cycle. All entries in the stack that are functions, which lies

below the entry whose recursion bit is set, are outside of the recursion cycle. After

a recursion is detected, the return address of all subsequent return instructions are

checked against these entries. On a match, all entries above the matched entry are

flushed, and normal stack operation is resumed.

5.2.3.2 Hardware Stack Overflow

Recursion is not the only case in which a stack overflow could occur. If the stack

signature of a dynamic code region has more elements than the stack can hold, the

stack would overflow. We did not encounter any stack overflow for a 32-entry stack.

But if it did occur, it is handled very similar to a recursive call described earlier.

On a stack overflow, no further push operation is performed. The address to which

the control gets transferred during a return instruction is checked for a match to an

address in the stack. If it matches, all entries above this instruction are removed from

the stack and normal stack operation is resumed.

49

5.3 Evaluation

In this section, we describe the evaluation methodology, the metrics used to evaluate

our hardware and the benchmarks.

5.3.1 Evaluation Methodology

Pin and pfmon [3, 40, 2] were used in our experiments to evaluate the effectiveness

of the phase detector. Pin is a dynamic instrumentation framework developed at

Intel for Itanium processors [2]. A Pin tool with custom instrumentation routines

to detect function calls and returns, backward branches for loops and to maintain a

runtime stack was developed. The benchmark programs were instrumented using the

customized Pin tool. For every interval of one million instructions, the content of the

stack was dumped into a trace file. This trace file was then analyzed by programs

simulating the phase detection and the phase prediction hardware.

The pfmon is a tool which reads performance counters of the Itanium processor [2].

We use CPI as the overall performance metric to analyze the variability within de-

tected phases. We modified pfmon to get CPI for every one million instructions.

Because these measurements were done on a real machine, random noise may cause

variation in the measured data. To minimize such effects, the data collection was

repeated 3 times and the average of the 3 runs was used for all measurements. These

CPI values were then matched with the phase information obtained from the cus-

tomized Pin Tool, to get the variability information.

All the data were collected from a 900 Mhz Itanium-2 processors with 1.5 M L3 cache

running on Redhat Linux Operating System version 2.4.18-e37.

50

5.3.2 Metrics

The metrics used in our study are the number of distinct phases, the average phase

length, Coefficient of Variance (CoV) of CPI, and the accuracy of next phase pre-

diction. The number of distinct phases corresponds to the number of dynamic code

regions detected in the program. Average phase length of a benchmark program gives

the average number of contiguous intervals classified into a phase. It is calculated

by taking the sum of the number of contiguous intervals classified into a phase di-

vided by the total number of phases detected in the program. Coefficient of Variation

quantifies the variability of program performance behavior and is given by

CoV =σ

µ(5.1)

σ is a standard deviation and µ is a mean of CPI. CoV provides a relative measure

of the dispersion of data when compared to the mean. A smaller CoV for the perfor-

mance metrics within each phase implies that the phase is more stable. We present a

weighted average of CoV on different phases detected in each program. The formula

for calculating the weighted average of the CoV for each benchmark is given by

CoV =

∑(ni · CoVi)∑

ni

(5.2)

where, ni is the number of intervals in the phase i, CoVi is the CoV of performance

metric in phase i. We use weighted average of the CoV to give more weight to the

CoV of phases that have more intervals (i.e., longer execution times) and hence, better

represent the CoV observed in the program.

The ratio of next phase prediction is the number of intervals whose phase ID was

correctly predicted by the hardware divided by the total number of intervals in the

51

program.

5.3.3 Benchmarks

Ten benchmarks from the SPEC CPU2000 benchmark suite (8 integer and 2 floating

point benchmarks) were evaluated. These benchmarks were selected for this study

because they are known to have interesting phase behavior and are challenging for

phase classification [51]. Reference input sets were used for all benchmarks. Three

integer benchmarks, namely gzip, bzip and vpr, were evaluated with 2 different input

sets to illustrate the effect of input sets on the performance of the phase detection

hardware. A total of 13 benchmarks and input sets combinations were evaluated. All

benchmarks were compiled using gcc (version 3.4) at O3 optimization level.

5.4 Experimental Results

In this section, we present the evaluation results of our phase classification and predic-

tion hardware. In Section 5.4.1, we explore the design space of our hardware predictor

and present the results of our analysis. In Section 5.4.2, we compare the results of

our hardware scheme to the basic block vector (BBV) scheme described in [51, 35]. A

customized Pin tool was used to get the basic block vectors. These vectors were an-

alyzed off-line using our implementation of the phase detection mechanism described

in [51, 35] to generate the phase ID’s for each interval. The metrics which we use for

comparison include: the total number of phases detected, the average phase length,

the accuracy of next phase prediction and the CoV of the CPI within each phase.

5.4.1 Results for Phase Detection Hardware

There are two configurable parameters in our hardware. They are the size of the stack

and the size of the phase signature table. We evaluated four different configurations

52

Table 5.1: Number of phases detected for different configurations of the phase detec-tion hardware.

benchmarks 16/16 32/64 64/64 infiniteammp 800 53 53 53bzip2 1 2856 99 99 99bzip2 2 1278 87 87 87crafty 27 27 27 27eon 22 22 22 22gcc 430 337 337 173gzip 1 58 48 48 48gzip 2 45 42 42 42mcf 157 55 55 55mesa 50 37 37 37perl 28 28 28 28vpr 1 92 91 91 91vpr 2 27 27 27 27median 58.00 48.00 48.00 48.00

for the hardware. The size of the stack and the size of the phase history table were

set to 16/16, 32/64, 64/64 and infinity/infinite, respectively.

5.4.1.1 Number of Phases and Phase Length

Table 5.1 shows the number of phases detected using four different configurations of

our phase detection hardware across different benchmark programs. The last row is

the median of the number of phases detected across all programs. We choose median

to eliminate the effect of outliers in the data. For each benchmark program, there

are 4 columns which correspond to 16/16, 32/64, 64/64 and infinite/infinite hardware

configurations, respectively. It should be noted that except in the case of 16/16, the

number of phases detected for all programs is very close to or exactly the same as

that of using infinite hardware.

Table 5.2 shows the average phase length for four different configurations of our

hardware across benchmark programs. The last row is the median of the phase length

across all programs. A similar trend is seen in both Table 5.1 and Table 5.2, which is

expected. Except for gcc, in all other programs, the 32/64 and 64/64 configurations

53

Table 5.2: Average Length of phase detected for different configurations of the phasedetection hardware.

benchmarks 16/16 32/64 64/64 infiniteammp 985.43 15027.77 15027.77 15027.77bzip2 1 68.75 1984.60 1984.60 1984.60bzip2 2 125.73 1845.51 1845.51 1845.51crafty 10314.54 10314.54 10314.54 10314.54eon 10029.18 10029.18 10029.18 10029.18gcc 1 22.75 47.71 47.71 460.90gzip 1 2020.51 2450.40 2450.40 2450.40gzip 2 1273.73 1273.73 1273.73 1273.73mcf 683.36 1950.67 1950.67 1950.67mesa 10780.59 13402.89 13402.89 13402.89perl 2138.23 3579.22 3579.22 3579.22vpr 1 1601.36 1618.96 1618.96 1618.96vpr 2 7018.19 7018.19 7018.19 7018.19median 1601.36 2450.40 2450.40 2450.40

have exactly the same phase length as that in the configuration with an infinite

number of entries. On average, the phase length is about 2450 one- million-instruction

intervals, which roughly correspond to 2.5 seconds of real machine execution time in

900 Mhz Itanium-2 machine.

5.4.1.2 Performance variance within the same phase

Figure 5.7 shows the variation of the weighted average of Coefficient of Variation

(CoV) on CPI for different benchmarks and hardware configurations. The y-axis

shows the weighted average of CoV and the x-axis shows different benchmark pro-

grams. The last 4 bars show the average of the CoV for different hardware configura-

tions. It should be noted that the CoV for the infinite hardware is similar to 32/64 or

64/64 configurations. On average the CoV for the phases detected by our hardware

is around 15%, which is close to the CoV in the BBV method with 40% threshold

value.

54

Figure 5.7: Weighted average of the CoV of CPI for different configurations of thephase detection hardware

5.4.1.3 Phase Prediction Accuracy

Figure 5.8 shows the performance of a simple last-phase predictor and the Run Length

Encoding Markov predictor [51] for predicting the phase ID of the next interval.

The y-axis is the correct prediction ratio and the x-axis is the different benchmark

programs. A last-phase predictor is one which predicts the phase of the next interval

to be the same as the current one. Thus a last-phase predictor always predicts stable

phase behavior. The Markov predictor can be used to predict a phase change and the

new phase ID. On an average, a simple last-phase predictor and Markov predictor

correctly predicts the next phase ID 80% and 84.5% of the time, respectively. Except

in the case of mcf and mesa, the performance difference between Markov predictor

and Last Phase predictor is less than 5%, which in turn indirectly indicates that

the phase detector detects phases that are longer and more stable. It is known that

SPEC2000 benchmarks have relatively stable phases. To evaluate the effectiveness on

predicting phase changes, we need to expand the test to more real world applications.

55

Figure 5.8: The performance comparison of last-phase predictor and markov predictorin detecting the phase of the next interval for 32/64 configuration.

5.4.1.4 Discussion

There is one underlying thread in the results presented above. The performance of

32/64 or 64/64 phase detector is very similar to the hardware with an infinite amount

of resource. This makes the phase detection hardware very cost effective. In our

analysis of the design space, we found that the maximum nesting levels of functions

and loops in the programs, after handling recursions, were always less than 32 for

SPEC benchmark programs. Thus a 32-entry hardware stack would be sufficient to

capture the phase signature without overflowing. In the case of the phase signature

table, larger sized tables would help to reduce the potential overflow, and the need to

evict some signature entries, when it occurs. Except for gcc, which has a larger code

base, a 64-entry table is sufficient to capture all phases.

5.4.2 Comparison with BBV Technique

In this section, we compare the performance of our phase detection hardware with the

phase detection hardware based on BBV [51, 35]. The BBV-based phase detection

hardware [51, 35] is shown in Figure 5.9. There are two tables in the hardware

structure, namely the accumulator table which stores the basic block vector and the

56

Figure 5.9: BBV-based phase tracking hardware

signature table which stores the basic block vectors seen in the past. These structures

are similar in function to our hardware stack and signature tables, respectively. To

make a fair comparison, we compare our 32/64 configuration against a BBV-based

hardware which has 32 entries in the accumulator table and 64 entries in the signature

table. Also, in the BBV-based method, a phase change is detected by comparing the

Manhattan distance of two vectors to a threshold value. If the difference is greater

than the threshold value, a phase change is detected.

We compare our phase detector and BBV based detector using four parameters

namely, number of phases detected, phase length, predictability of phases, and sta-

bility of the performance within each phase. We compare our results with those of

the BBV technique for two threshold values namely 10% and 40% of an one-million-

instruction interval. The original BBV paper [51] sets the threshold value to be 10%

of the interval. It should be noted that the phases detected in the BBV-based tech-

nique may not be aligned with the code structures. Aligned phases are desirable for

dynamic binary re-optimization systems. We still compare against the BBV method

because it is a well accepted method for detecting phases.

57

5.4.2.1 Number of phases and phase length

Table 5.3 compares the number of phases detected by the BBV technique and our

phase detection technique. For the BBV technique, there are 2 columns for each

benchmark that correspond to a threshold value of 10% and 40% of one million in-

structions, respectively. In the case of BBV technique, as we increase the threshold

value, small differences between the basic block vectors will not cause a phase change.

Hence, a smaller number of phases is detected as we go from 10% to 40% threshold

value. Recall that, in a dynamic binary optimization system, on detecting a new

phase, the profiler will start profiling the code, which might cause a significant over-

head. Hence, for such systems a smaller number of phases with a longer per phase

length is desirable. We can see that in the original BBV technique with 10% thresh-

old, the number of phases detected is 100 times more than those detected in the

DCR-based technique. In the BBV technique, as we go from 10% to 40% threshold

value, the number of phases detected becomes smaller, which is expected. But even at

40%, the number of phases detected in BBV technique is 2x more than those detected

by our technique.

Table 5.4 shows the average phase length of the BBV technique and our phase de-

tection technique. The trend in the data is similar to that seen in Table ??, which

is expected. The median phase length of our technique is 100 times more than that

found in BBV technique with 10% threshold value, and two times more than those

found in BBV technique with 40% threshold value. Although in the case of eon

and mesa the phase length for BBV with 40% threshold value is three times that of

DCR technique, these programs are known to have trivial phase behavior. The larger

difference is due to the number of phases detected .

58

Table 5.3: Comparison of the number of phases detected between BBV- and DCR-based phase detection schemes. A 32-entry accumulator table/hardware stack and a64-entry phase signature table were used. The first 2 columns for each benchmarkcorrespond to a threshold value of 10%, 40% of one million instructions, respectively.

Benchmarks BBV-10% BBV-40% DCR-32/64ammp 13424 122 53bzip2 1 35154 1796 99bzip2 2 37847 1469 87crafty 20111 20 27eon 38 7 22gcc 2650 599 337gzip 1 8328 182 48gzip 2 4681 77 42mcf 5507 88 55mesa 945 15 37perl 8036 201 28vpr 1 3136 105 91vpr 2 51 27 27median 5507 105 48

Table 5.4: Comparison of the average phase length between BBV- and DCR-basedphase detection schemes. A 32-entry accumulator table/hardware stack and a 64-entry phase signature table were used. The first 2 columns for each benchmarkcorrespond to a threshold value of 10%, 40% of one million instructions, respectively.

Benchmarks BBV-10% BBV-40% DCR-32/64ammp 58.83 6472.27 15027.77bzip2 1 5.59 111.51 1984.60bzip2 2 4.24 108.30 1845.51crafty 14.57 15073.26 10314.54eon 6128.94 31520.43 10029.18gcc 19.70 86.84 158.15gzip 1 13.90 629.34 2450.40gzip 2 11.11 678.23 1273.73mcf 19.52 1219.19 1950.67mesa 517.82 32899.13 13402.89perl 13.76 497.59 3579.22vpr 1 47.17 1398.50 1618.96vpr 2 3715.51 7018.19 7018.19median 19.52 1219.19 2450.40

59

Figure 5.10: Comparison between BBV- and DCR-based phase detection hardwareon the performance of a 256-entry Markov Predictor in predicting the phase ID of thenext interval. A 32-entry accumulator table/hardware stack and a 64-entry phase sig-nature table were used. The first 2 columns for each benchmark are for BBV methodusing threshold values of 10% and 40% of one million instructions, respectively.

5.4.2.2 Performance Variance within the same phase

Figure 5.10 compares the performance of the 256-entry Markov Predictor using BBV

technique and our phase detection technique. Except eon and vpr 1, the Markov

predictor using our phase detector predicts the next phase better. On average using

our phase detection technique, the Markov predictor predicts the correct next phase

ID 84.9% of the time. Using BBV-based technique, the average correct prediction

ratios are 42.9% and 73.3% for 10% and 40% respectively.

5.4.2.3 Phase Prediction Accuracy

Figure 5.11 compares the weighted average of the CoV of CPI for phases detected by

the BBV technique and by our phase detection technique. The last three bars give the

average CoV. From the figure we can see that BBV-10% has the least average CoV

value. This is because the number of phases detected by BBV-10% is much higher

than the number of phases detected by BBV-40% or our technique. In the case of

BBV-10%, the variability of CPI gets divided into many phases, thus reducing the

variability observed per phase. On average the CoV of our phase detection hardware

60

Figure 5.11: Comparison of the weighted average of the CoV of CPI between BBV-and DCR-based phase detection schemes. A 32-entry accumulator table/hardwarestack and a 64-entry phase signature tables were used. The first 2 columns for eachbenchmark are for a threshold value of 10% and 40% of one million instructionsrespectively.

is 14.7% while it is 12.57% for the BBV-40%. Although the average variability of our

technique is greater than BBV-40%, the numbers are comparable. In fact for bzip2 1,

crafty, gzip 1, gzip 2, mesa, vpr 1 and vpr 2, the CoV of the DCR based technique is

less than or equal to the CoV observed in the BBV-40% . For ammp, gcc, mcf and

perl the performance variation is higher in the dynamic code regions detected. The

higher performance variation within each dynamic code region may be due to change

in control flow as in the case of gcc or change in data access patterns as in the case

of mcf.

5.5 Summary

From the above discussions we can conclude that, our hardware detects a smaller

number of phases, has a longer average phase length, has higher phase prediction

and is aligned with the code structure, all of which are desirable characteristics of a

phase detector for a dynamic binary re-optimization system. The CoV of the phases

detected in our technique is slightly higher but comparable to that observed in the

BBV technique with 40% threshold value. The phase difference is detected using an

61

absolute comparison of phase signatures, which makes the hardware simpler and the

decision easier to make. The 32/64 hardware configuration performs similar to an

infinite sized hardware, which makes it cost effective and easier to design.

62

Chapter 6

Continuous and Persistent Profile

Management

This chapter describes techniques to characterize and classify dynamic profiles for

dynamic compilation and optimization systems.

6.1 Continuous and Persistent Profile-Guided Optimization

JIT compilers have been widely used to achieve near native code performance by

eliminating interpretation overhead [13, 7]. However, JIT compilation time is still

a significant part of execution time for large programs. In order to cope with this

problem, recently released language runtime virtual machines, such as Java Virtual

Machine (JVM) or Microsoft Common Language Runtime (CLR), provides Ahead-

Of-Time (AOT) compiler, sometimes called pre-JIT compiler, to generate native bi-

naries and store in a designated area before the execution of some frequently used

applications. This approach could mitigate runtime compilation overhead [9].

A pre-JIT compiler [56] can afford more time-consuming profile-guided optimizations

63

Figure 6.1: Continuous profile-guided optimization model

(PGO) compared to a JIT compiler because compilation time in a pre-JIT compiler

may not be a part of execution time. In order to enable advanced profile-guided

optimizations (PGO) in a dynamic compiler, dynamic profiles are usually collected

through sampling and runtime instrumentation by high-level language virtual ma-

chines. With the deployment of Pre-JIT compilers, automatic continuous profiling

and re-optimization, as shown in Figure 6.1, becomes a viable option for the man-

aged runtime execution environment. For example, with the introduction of a Pre-JIT

compiler on the recent Microsoft CLR [9], continuous PGO framework as shown in

Figure 6.1 has become a feasible optimization model. The Pre-JIT compiler compiles

MSIL code into native machine code and stores it on the disk. The re-compilation

does not occur during program execution but, instead, is invoked as an offline low

priority process. The re-compilation process relies on accurate HPM (Hardware Per-

formance Monitor)-sampled profiles that are accumulated over several executions of

the application program. HPM-sampled profiles could provide more precise runtime

performance events, such as cache misses and resource contentions, that allow more

effective runtime or offline optimizations [39, 16, 37].

However, the perturbation in collecting and processing profiles should be minimized to

justify the performance gain from re-optimization [60]. It includes runtime overhead

and usage of system resources such as memory and disk space. With HPM avail-

64

able in most of recent microprocessors, sampling-based runtime profiling provides an

attractive alternative to instrumentation-based profiling. It could avoid the pertur-

bation caused by the instrumentation code, and also provide more precise runtime

performance events, such as cache misses and resource contentions, that allow more

effective runtime or offline optimizations [39, 16, 37]. Sampling-based profile manager

has thus become an essential component in the continuous profile-guided optimization

framework such as the one shown in Figure 6.1. However, many production compilers

are still dependent on using instrumentation based profiles for their PGO optimiza-

tions. This is because some existing optimizations, such as complete loop unrolling,

would require the iteration count of a loop, and this type of information may not be

accurately generated with sampling-based profiles.

In order to obtain more accurate profiles with a low sampling frequency, sampled

profiles could be merged and stored across multiple runs on the disk. Due to the

statistical nature of sampling, the quality of sampled profiles is greatly affected by the

sampling rate. As the sampling frequency is increased, more samples are collected and

the accuracy of sampled profiles is improved. Unfortunately, the sampling overhead

is also increased. A high sampling frequency would cause more interrupts, require

more memory space to store sampled data, and more disk I/O activities to transfer

persistent profile data. With a fixed number of runs, a challenge to the runtime

profiler is how to determine the ideal sampling rate so that we can still obtain high

quality profiles with minimum sampling overhead.

6.2 Similarity of Sampled Profiles

Sampling-based profiler collects the frequency profile of sampled PC addresses instead

of edge count profiles using instrumentation. Zhang et al [60] showed that an accurate

edge profile can be deduced from the frequency profile of sampled PC addresses. The

65

optimization results from using the deduced edge profile are comparable to that of

using actual edge profile obtained from instrumentation.

6.2.1 Similarity Metric

In order to evaluate the similarity between a sampled profile and the “complete pro-

file”, we define a “similarity metric” between the two profiles. Since our profile is a

list of frequency counts of distinct PC addresses sampled, it can be represented as an

1-dimensional vector. To compute the linear distance between two vectors, we used

the Manhattan distance shown in the following equation as a similarity metric.

S =n∑

i=1

(2.0 − |ai − bi|)

2.0ai, bi : relative freq. of ith distinct PC addr.

The PC addresses of ai and bi are the same. The ai or bi could be zero when no

matching samples are collected. If two profiles are identical (ai = bi), S becomes 1.

6.2.1.1 Baseline Profile to Determine Similarity of Sampled Profiles

Instead of using instrumented profiles, we use a merged profile generated from very

high frequency sampling rates over multiple runs as the baseline profile. One way to

merge them is to sum up the frequency of each PC address across all sampled profiles.

We collected sampled profiles three times using each of the six different sampling rates

(one sample every 634847, 235381, 87811, 32973, 34512, 32890 instructions), and

generate a baseline profile by merging the 18 profiles for each benchmark program.

In every sampling interval, one PC address is collected. Each sampled profile has

a list of frequency count for each distinct PC address. Hence, we could compute

normalized frequency for each distinct PC address by dividing its frequency count

by the total sample counts. We mask off 4 low-order bits of the PC address to

approximate a basic block, i.e. use the masked address as the starting address of an

66

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Number of runs

Sim

ilar

ity

(S)

1M-insts

10M-insts

100M-insts

100M-R2R

Need higher sampling rate

Figure 6.2: Convergence of merged profiles of gcc with 200.i input set

approximated basic block instead of distinct PC addresses within the approximated

basic block. The obtained frequency is the frequency of the approximated basic block.

The obtained baseline profile is very close to the instrumentation-based complete

profile. It ranges from 0.95 to 0.98 using our similarity comparison metric for SPEC

CPU2000 benchmarks.

6.2.2 Accuracy of Persisted Profiles

As the similarity between the baseline profile and the instrumented complete profile

reflects how “accurately” the baseline profile could mimic the complete profile, we use

“accuracy” and “similarity” interchangeable in the rest of the chapter. Intuitively, the

accuracy of merged profiles improves as the number of samples increases. Figure 6.2

show that merged profiles are more accurate (compare to the baseline profile) than a

single instance of profile, the R2R (Run-to-Run) in the figure, on 176.gcc with 200.i

input.

In Figure 6.2, sampled profiles are cumulatively merged along repeated runs on the

same input set. For example, at the 10th runs, the merged profile is the summation

from 1st profile to 10th profile. The y-axis shows the similarity (S) between the baseline

67

profile and the merged profile with three different sampling rates (one sample every

1M, 10M, 100M instructions).

We could observe two interesting facts. First, most of the improvement in accuracy

came from the first three to six runs. Second, after that the curve of improvement

becomes flattened. Since we cannot afford too many runs at high sampling rates, we

need to adapt sampling rates according to the program behavior. We address how to

automatically reduce the sampling rate through profile characterization in the next

section.

6.3 Entropy-Based Profile Characterization

In this section, we show that an appropriate sampling rate could be determined by

using the information entropy.

6.3.1 Information Entropy: A Metric for Profile Characterization

An appropriate sampling rate could be determined according to the program behavior.

Since our frequency profile can be represented as a statistical distribution, where each

distinct PC address has a probability that is the number of its occurrences divided

by the overall number of samples, we can use the equation to quantify the shape

of statistical distribution. In this work, we use information entropy as defined by

Shannon [18] to characterize the sampled profiles (for example, “flat” or “skewed”).

The information entropy is defined as follows:

E =N∑

i=1

Pi · log1

Pi

Pi : relative freq. prob. of ith distinct PC addr.

If the program has a large footprint and a complex control flow, a large number of

distinct PC addresses will be collected. On the other hand, if the program has a

68

PC Address

Rel

ativ

e F

req

uen

cy P

rob

abili

ty (

Pi)

0

0.005

0.01

0.015

0.02

0.025

4e+006 4.2e+006 4.4e+006 4.6e+006 4.8e+006 5e+006 5.2e+006 5.4e+006

gcc

(a) gcc (E=10.12)

PC Address

Rel

ativ

e F

req

uen

cy P

rob

abili

ty (

Pi)

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

4.19e+006 4.2e+006 4.21e+006 4.22e+006 4.23e+006 4.24e+006 4.25e+006 4.26e+006 4.27e+006

gzip

(b) gzip (E=5.67)

Figure 6.3: Relative frequency distribution of PC address samples (gcc, gzip)

small number of hot spots (or hot loops), sampled profile will have a small number

of distinct PC addresses. It leads to a low entropy number. This property could be

used in determining an appropriate sampling rate for the next run. The two example

programs shown in Figure 6.3 clearly show that entropy distinguish “flat” profiles

from “skewed” profiles. The gcc in Figure 6.3(a) shows a relatively “flat” distribution

and a high entropy number (E=10.12). In contrast, the gzip in Figure 6.3(b) shows

a “skewed” distribution and a low entropy number (E=5.67).

69

6.3.2 Entropy-Based Adaptive Profiler

This subsection describes the implementation of an entropy-based adaptive profiler.

The application of entropy heuristics in the adaptive profiler is as follows:

1. When an application is loaded and ready to run, check if there is already a

profile for this application. If not, i.e. this is the first time the application

executes, start with a low sampling rate.

2. After the program terminates, compute the information entropy of the profile.

Categorize the profile based on one of the three ranges of entropy values. Our

data shows that the following three ranges are sufficient: low [0-5], medium

[5-8], and high [8- ].

3. When an application is loaded, if the entropy is known, set the sampling rate

according to the entropy: a high entropy uses a high sampling rate, a medium

entropy uses a medium sampling rate and a low entropy uses a low sampling

rate.

6.4 Entropy-Based Profile Classification

In practice, a program shows different program behaviors when different input sets

are used. The entropy of their profiles will also change along with the changed input

sets. For example, perlbmk has 7 inputs. The entropy of each input ranges from

5.52 to 9.05. For multiple inputs, if the program executes for a long enough time,

its sampling rate can be adjusted using entropy during the execution. It is more

important to understand the impact of different profiles on PGO. For multiple inputs,

this section describes how entropy can be used to classify different profiles.

Figure 6.4 shows the workflow of entropy-based profile classification. In our profile

classification framework, we use k-mean clustering technique to identify similar pro-

70

5.52

6.46

6.30

6.68 B

A

C

Entropy Entropy BasedBased

ClassificationClassification

8.42

Profile Profile GuidedGuided

OptimizationOptimization

RawRawprofilesprofiles

ProfileProfileDatabaseDatabase

Figure 6.4: Entropy-based profile classification

files. If the maximum number of clusters sets to 3 (k = 3), incoming profiles will

be classified and merged into three persistent profiles (A,B,C). In Figure 6.4, the

three profiles with their entropy in a similar range (E = 6.46, 6.30, 6.68) are classified

and merged into a single persistent profile A. One profile with entropy (E = 8.42) is

classified and merged into persistent profile B. Another one with entropy (E = 5.52)

is classified and merged into persistent profile C.

When the recompilation for PGO is invoked, the controller determines whether the

classified profiles are merged into one profile or one particular profile is selected for

PGO. If the similarity (S) of classified profiles is very low, it means that the profiles

come from disjoint code regions. In this case, it is better to combine the profiles. vpr

is such a case. The profile from place and the profile from route are disjoint from

each other. Otherwise, we have to choose one profile that is merged from majority of

runs.

We could generate multiple specialized binaries, each customized for one specific type

of profiles. However, it is difficult to predict what would be the incoming input data

set for the next run. In this work, we introduce three different types of profiles that

could lead to different profile management and feedback strategies.

71

6.5 Experiments

6.5.1 Experimental Setup

For our experiments, we collected the sampled profiles with Intel SEP(Sampling En-

abling Products) tool ver. 1.5.252, running on 3.0 GHz Pentium 4 with Windows

XP Operating System. The SPEC CPU2000 INT and FP benchmarks are used to

evaluate the convergence of merged profiles and the effectiveness of using entropy to

characterize different sampled profiles. The SPEC CPU2000 benchmarks are com-

piled with Intel icc compiler ver. 8.0 with O3 optimization level. The SpecJBB 1.01

benchmark is used to show how the entropy could be effectively used in an adaptive

profile manager.

In the SPEC CPU2000 INT benchmarks, vpr, vortex, gzip, gcc, perlbmk and bzip2

have multiple input sets. We used these six benchmarks for our experiments. These

benchmarks are compiled with the Intel icc compiler ver. 8.0 with O3 optimization

level, and measured on 1 GHz Itanium-2 processor machine. The profile feedback

uses Intel icc profile guided optimization. In our experiments, number of maximum

clusters is set to 3 (MaxK = 3).

6.5.2 Experimental Results

6.5.2.1 Accuracy of Merged Profiles

Figure 6.5 shows that persisted profiles converge to the baseline profile at different

convergence rates based on their program behavior using SPEC CPU2000 (INT, FP)

benchmark and a sampling rate of one sample every 100M instructions. In SPEC

CPU2000 INT benchmarks shown in Figure 6.5(a), most of benchmarks converge

quickly above a similarity metric of 0.9 (i.e. more than 90% similar to the baseline

profile) after the initial five runs except for five benchmarks (gcc, vortex, perlbmk,

72

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Number of runs

Sim

ilari

ty (

S)

gzip

vpr

gcc

mcf

crafty

parser

eon

perlbmk

gap

vortex

bzip2

twolf

(a) INT

0.7

0.75

0.8

0.85

0.9

0.95

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Number of runs

Sim

ilari

ty (S

)

wupwise

swim

mgrid

applu

mesa

galgel

art

equake

facerec

ammp

lucas

fma3d

sixtrack

apsi

(b) FP

Figure 6.5: Convergence of merged profiles of SPEC CPU2000 benchmarks

crafty, eon). Since those four benchmarks have relatively complex control flows and

large instruction footprints, they need a higher sampling rate to achieve a targeted

accuracy with only a limited number of runs. In the SPEC CPU2000 FP benchmarks

shown in Figure 6.5(b), most benchmarks converge quickly above 0.9 similarity to the

baseline after initial five runs except three benchmarks (lucas, applu, fma3d).

73

Table 6.1: Entropy of SPEC CPU2000 INT benchmarks

benchmark entropy benchmark entropy benchmark entropy benchmark entropygzip 5.67 vpr 4.87 gcc 10.12 mcf 4.45

crafty 9.52 parser 7.79 eon 7.95 perlbmk 8.46gap 7.62 vortex 8.02 bzip2 7.50 twolf 7.24

Table 6.2: Entropy of SPEC CPU2000 FP benchmarks

benchmark entropy benchmark entropy benchmark entropy benchmark entropywupwise 6.43 swim 4.86 mgrid 5.09 applu 8.26

mesa 7.85 galgel 5.86 art 4.01 equake 5.98facerec 6.44 ammp 6.95 lucas 9.26 fma3d 9.16

sixtrack 5.19 apsi 7.91

6.5.2.2 Entropy-Based Profile Characterization

Table 6.1 shows the entropy of SPEC CPU2000 INT benchmarks. Interestingly, we

can observe that the entropy of those benchmarks are clustered in three ranges. Two

programs (vpr, mcf ) show a low entropy (0 ≤ E < 5). Four programs (gcc, crafty,

perlbmk, vortex ) show a high entropy (E ≥ 8). The rest of programs have a medium

entropy (5 ≤ E < 8). The four programs that show high entropy exactly match

the programs, shown in Figure 6.5(a), that need higher sampling rates to achieve a

targeted accuracy.

Table 6.2 shows the entropy of the SPEC CPU2000 FP benchmarks. Two programs

(swim, art) show a low entropy (0 ≤ E < 5). Three programs (lucas, applu, fma3d)

show a high entropy (E ≥ 8). Three programs that show high entropy also exactly

match the programs, shown in Figure 6.5(b), that need higher sampling rate. The

results strongly suggest that entropy is a good metric to select the sampling rate for

SPEC CPU2000 benchmarks (INT, FP).

We could start sampling with high frequency for all programs to obtain more accurate

profiles in general. However, it would require unnecessary high overhead. Based on

our entropy based characterization and observation on convergence of merged profiles,

only seven programs among 26 SPEC CPU2000 programs need a sampling rate higher

74

0.8

0.9

1

Sim

ilari

ty (

S)

1M-insts

10M-insts

Use entropy heuristics

Use delta heuristics

0.6

0.7

1 2 3 4 5 6 7 8 9 10

Number of runs

S10M-insts

100M-insts

Self-Adaptive

Figure 6.6: Accuracy of entropy-based adaptive profiler on SPECJBB ver. 1.01

than one sample per 100M instructions. Hence, it is more cost-effective to start

with low sampling rate and adjust the sampling rate according to the profile entropy

collected at runtime.

6.5.2.3 Adaptive Profiler

Figure 6.6 shows the results of using entropy in an adaptive profiler for the SPECjbb

ver. 1.01 benchmarks written for Microsoft .NET platform. After the first run of

the program, the profiler increases the sampling rate from one sample per 100 M

instruction to one sample per 10M instructions according to the entropy measured.

In practice, we may not have the baseline profile to compute the similarity metric (S).

We could use the delta (∆) of similarity (S) between current cumulative profile and

previous cumulative profile. If the ∆S is small enough (for example ∆S = 0.005), we

consider that convergence curve of merged profile has been flattened. Depending on

the number of runs, the profiler can stop or continue to collect profiles. In Figure 6.6,

the ∆S is less than a given threshold (∆S < 0.005) at the 7th run. Since we expect

6 to 8 runs in this experiments, the profiler decides to stop profile collection.

When we compare the profiles generated from our adaptive profiler with those from

75

a sampling rate of one sample per 1M instructions, our profile is quite accurate

(S=0.945) (94.5% similar to the baseline) with only 8.7% of samples taken. Our

profile is only 3% less accurate compared to the profile generated from the sampling

rate of one sample per 1M instructions. Since an edge profile could be deduced from

this frequency profile as explained earlier, 3% difference in accuracy will not lead to

any significant difference in the accuracy of the deduced edge profile.

6.5.2.4 Entropy-Based Profile Classification

We found that there are three types of program behavior. In the type I, the program

behavior does not change much with different input sets. Hence, the entropy of their

profiles from different inputs is pretty similar to each other. vortex program is like

that. Their sampled profiles are classified and merged into one baseline profile. This

is the simplest case.

Table 6.3 shows performance improvement from PGO on vortex with multiple input

sets. Each column of the table uses different input set. For convenience, it is num-

bered according to input sets, for example the first one is for lendian1 input. Each

row presents the performance improvement from the binary generated using feedback

profiles. The baseline binary to compute performance improvement is the binary gen-

erated without using PGO. For example, feedback(1) is the binary generated using

the profile generated from lendian1 input. feedback(self) is the binary that is gen-

erated from the same input with which the profile generated. The feedback(self) is

used to show the full potential of PGO. In vortex, feedback(classified) uses one profile

merged from all profiles.

In the type II, program behavior changes significantly due to the change of input

data set. Hence, the sampled profiles from each different input are dissimilar. vpr

is like that. Since the entropy of two sampled profiles are in different ranges, they

76

Table 6.3: Performance improvement (%) from PGO on vortex with multiple inputsets

1:endian1 2:endian2 3:endian3 averagefeedback:(1) 27.64 30.84 27.83 28.77feedback:(2) 28.03 31.26 26.92 28.74feedback:(3) 28.02 30.02 31.26 29.77

feedback:(self) 27.64 31.26 31.26 30.06feedback:(classified) 27.30 31.39 26.25 28.31

Entropy 7.80 8.19 7.77

Table 6.4: Performance improvement (%) from PGO on vpr with multiple input sets

1:place 2:route averagefeedback:(1) 4.72 -3.00 0.86feedback:(2) -6.76 8.59 0.92

feedback:(self) 4.72 8.59 6.66feedback:(classified) 8.96 8.88 8.92

Entropy 6.83 4.82

are classified into two different profiles. Since the similarity (S) of the profiles is very

low (S < 0.4), the profile manager combines them into one profile when used for

PGO. Combining disjoint profiles is generally beneficial since it would increase the

code coverage.

Table 6.4 shows the performance improvement from PGO on vpr with different input

sets. The feedback(1) experiences 3.0% slowdown compared to the baseline binary

when input set 2 is used. The feedback(2) also loses 6.76% performance when input set

1 is used. It shows that PGO could degrade performance if the profile used in feedback

is not generated from a representative input set. The feedback(classified) uses a

merged profile from the two sampled profiles. Interestingly, it performs 2.26% better

than the feedback(self) binary. It might be because merged profile provides increased

code coverage that gives slightly better analysis results for compiler optimizations.

This may be due to some heuristics used in compiler optimizations that are sensitive

to path frequency distribution.

In the type III, their profiles could be classified into several groups of similar profiles.

77

Table 6.5: Performance improvement (%) from PGO on gzip with multiple input sets

1:source 2:log 3:graphic 4:random 5:program averagefeedback:(1) 4.52 4.17 7.87 5.75 4.04 5.27feedback:(2) 5.05 4.77 9.73 11.24 5.01 7.16feedback:(3) -6.97 -8.02 6.29 10.35 -8.89 -1.45feedback:(4) -0.28 -4.86 7.24 14.49 1.65 3.65feedback:(5) 5.62 4.68 13.40 12.56 5.06 8.26feedback:(self) 4.52 4.77 6.29 14.49 5.06 7.03feedback:(classified) 6.38 4.95 6.29 12.48 5.06 7.03feedback:(1,2,4) 6.38 4.95 12.32 12.48 6.38 8.50

Entropy 5.71 5.87 6.49 5.92 4.98

gzip, perlbmk, gcc and bzip2 are in this camp.

Table 6.5 shows the performance improvement from PGO on gzip with different in-

put sets. Three profiles (1:source, 2:log, 4: random) are merged into one profile. The

profile (3:graphic) and the profile (5:program) are classified into separate profiles.

The profiles from three text inputs are classified into the same group. Table 6.5 indi-

cates that entropy-based classification works quite well. The feedback(1,2,4) performs

1.47% better than that from feedback(self).

From the above results, we could see that entropy is a good metric to classify sampled

profiles for PGO. The binaries generated from classified profiles always perform similar

to or better than that from feedback(self) binaries. It should be noted that in our

experiments, the feedback(classified) binaries never caused any slowdown compared

to the performance of the binaries generated without PGO for any input sets. In

contrast, for example, feedback(3) in gzip shows slowdown in performance compared

to the performance of the binary generated without PGO for three inputs (1:source,

2:log, 5:program).

6.6 Summary

We shows that highly accurate profiles can be obtained by merging a number of pro-

files collected over repeated executions with relatively low sampling frequency. It

78

also show that simple characterization of the profile with information entropy can

effectively guide the sampling rate for a profiler. Using SPECjbb2000 benchmark,

our adaptive profiler obtains very accurate profile (94.5% similar to the baseline pro-

file) with only 8.7% of samples in a sampling rate of one sample per 1M instructions.

Furthermore, we show that information entropy could be used to classify different pro-

files obtained from different input sets. The profile entropy-based approach provides

a good foundation for continuous profiling management.

79

Chapter 7

Optimizing Coherent Misses via

Binary Re-Adaptation

The chapter describes runtime binary re-adaptation techniques that improve the per-

formance of some OpenMP parallel programs by reducing the aggressiveness of data

prefetching and using exclusive hints for prefetch instructions.

7.1 Motivating Example

First, let us use an OpenMP version of the DAXPY kernel, shown in Figure 7.1,

as an example to illustrate the changing memory access behavior when it runs with

different input data sets and different number of threads. The source code is compiled

with the Intel icc compiler ver. 9.1 with -O2 -openmp options. ARRAY SZ is varied

for (j=0; j < 1000000; j++)#pragma omp parallel for

for (i=0; i < ARRAY_SZ; i++) {y[i] = y[i] + a * x[i];

}

Figure 7.1: OpenMP DAXPY C source code

80

...lfetch.nt1 [r10] // prefetch y[0]+648lfetch.nt1 [r11] // prefetch y[0]+520lfetch.nt1 [r14] // prefetch y[0]+392lfetch.nt1 [r15] // prefetch y[0]+264lfetch.nt1 [r16] // prefetch y[0]+136lfetch.nt1 [r17] // prefetch y[0]+8

....b1_22:{ .mii(p16)ldfd f32=[r2],8 // load x[i], i++

nop.i 0nop.i 0 }

{ .mmb(p16)ldfd f38=[r33] // load y[i](p16)lfetch.nt1 [r43] // prefetch x[i]+1200, y[i]+1200

nop.b 0 ;; }{ .mfi(p23)stfd [r40]=f46 // store y[i](p21)fma.d f44=f6,f37,f43 // y[i] + a*x[i](p16)add r41=16,r43 } // increment lfetch address{ .mib(p16)add r32=8,r33 // increment y[i] address

nop.i 0br.ctop.sptk .b1_22 ;; } // inner for loop (SWP)

Figure 7.2: icc compiler generated Itanium assembly code for DAXPY kernel

to create different data working set sizes from 128K to 2M bytes. The number of

working threads is varied from 1 to 4. Each thread is bound to a different processor.

Figure 7.2 shows its Itanium assembly code generated by the Intel icc compiler. Before

entering the software pipelined loop (.b1 22), the generated code has 6 prefetches for

the initial cache line of y[0] and the subsequent five cache lines. Then, in the loop, the

code issues one prefetch instruction per iteration for both arrays x[] and y[], using the

rotating registers to alternately change the prefetch target addresses. This prefetch is

very aggressive. It targets 9 cache lines ahead of the current array references. Toward

the end of loop execution, this prefetch instruction starts to fetch unnecessary cache

lines that would be modified by its neighboring processors. Therefore, this prefetch

would trigger unnecessary coherent misses. For example, with 128KB data working

set with two arrays x[] and y[], each array has 64KB data. When running with 4

threads, each array has 16KB data. Since the L2 cache line size on Itanium 2 is 128

81

Scalability of DAXPY Kernel on 4-way Itanium 2 Machine(# of threads, with/without prefetch)

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

128K 512K 2M

data working set size

No

rmal

ized

exe

cuti

on

tim

e to

b

asel

ine

(1, prefetch)(1, noprefetch)(2, prefetch)(2, noprefetch)(4, prefetch)(4, noprefetch)

(a) prefetch vs. noprefetch

Scalability of DAXPY Kernel on 4-way Itanium 2 Machine (#of threads, prefetch without/with .excl hints)

0.00

0.20

0.40

0.60

0.80

1.00

1.20

128K 512K 2M

data working set size

No

rmal

ized

exe

cuti

on

tim

e to

bas

elin

e

(1, prefetch)(1, prefetch.excl)(2, prefetch)(2, prefetch.excl)(4, prefetch)(4, prefetch.excl)

(b) prefetch vs. prefetch.excl

Figure 7.3: Normalized execution time of OpenMP DAXPY kernel on 4-way Itanium2 SMP server

82

bytes, 9 unnecessary cache lines amount to 1KB. Therefore, a significant portion of

data is unnecessarily shared between processors due to aggressive prefetching.

If we consider there are many more arrays that are divided and computed by the

number of working threads in scientific applications, larger data working set would

also show similar unnecessary sharing. This example shows that even when an expert

programmer elaborately avoided data sharing when writing explicit multithreaded ap-

plications, unnecessary coherent misses could still happen due to aggressive compiler

prefetch optimizations.

In Figure 7.3(a), we compare two versions of the binaries. The baseline binary code

generated by the Intel icc compiler has lfetch instructions. In the noprefetch version,

the lfetch instructions are changed to NOP instructions. Figure 7.3(a) shows the

normalized execution time of the two versions. The x axis shows different working

set sizes including both arrays x[] and y[].

The Itanium 2 processor used in our experiments has 256KB L2 cache and 1.5MB

L3 cache. In the three different working set sizes, ranging from 128KB, 512KB and

2MB, two versions exhibit very different behaviors. With 2MB, the prefetch version

always performs better than the noprefetch version when running with 1, 2, and 4

threads. As expected, prefetching works effectively in the presence of frequent cache

misses.

With the smallest 128KB working set, the data would fit within the 256KB L2 cache

memory. When only one thread is present, the two versions do not have much per-

formance difference since no cache misses would occur after initialization. Since the

noprefetch version does not cause unnecessary data sharing between processors caused

by aggressive data prefetching shown in Figure 7.2, it runs 35% faster than the base-

line prefetch version when running with 2 threads, and 52% faster with 4 threads.

83

The prefetch version when running with 4 threads suffers significantly from L2 OZQ FULL

stalls. On Itanium 2, prefetch requests are placed in the L2 memory queue (OZQ)

with other load operations and will not retire until the requested cache lines are filled.

Even though loads can retire quickly when they hit in L2, prefetch requests stay longer

in the queue due to the long latency of the coherent miss and eventually make the L2

memory queue full. This leads to the slowdown of the prefetch version running with

multiple threads. The noprefetch version significantly reduces the L2 OZQ FULL

stall cycles.

Such unnecessary coherent misses could be avoided by more careful optimizations. For

example, the compiler could use conditional prefetches to nullify the prefetches if the

addresses are outside the intended range. However, conditional prefetch generation is

more expensive, since it requires one more register, one more compare instruction, two

more add operations and at least one additional bundle. Unless the static compiler

has a very accurate profile indicating precisely which prefetches are likely to cause

this problem, the compiler will not generate such conditional prefetches.

The compiler can also generate multi-version code to select the noprefetch version

when the remaining iteration count is small. This is to avoid the performance degra-

dation from unnecessary coherence misses caused by aggressive prefetching. When

the remaining iteration count is large, the benefit of data prefetching could outweigh

the downside of prefetching-induced coherence misses.

If the data working set size for each array is large enough to mitigate the impact

of unnecessary coherent misses, the overall slowdown of the performance may be

relatively small. Such coherent misses will happen no matter the data working set

fits in L2 or L3 caches or not. However, since the portion of unnecessarily shared data

is fixed, such as 1KB in our example, the cost of coherent misses would be relative to

the data working set size in the loop.

84

Figure 7.3(b) shows the impact of using lfetch.excl instruction on OpenMP DAXPY

kernel shown in Figure 7.1. The lfetch.excl instruction prefetches a cache line in the

Exclusive state instead of the Shared state. When a prefetch operation with .excl

hint misses the cache, it requests the cache line in the Exclusive state. If a store

operation soon follows the prefetch operation, it will not trigger an invalidation. For

each working set size from 128KB to 2MB, the performance is normalized to 1-thread

prefetch version. With a 128KB working set, data accesses hit in the L2 cache. The

lfetch.excl optimization made no performance difference. But the lfetch.excl version

runs 18% faster than the baseline prefetch version when running with 2 threads, and

14% faster with 4 threads. With a 512KB working set, data no longer fit in one single

L2 cache. But they fit in four L2 cache memories with 128KB each when the program

is running with 4 threads. When we increase the number of threads from 2 to 4, the

overhead of coherent misses starts to outweigh the benefit of prefetching. Therefore,

the lfetch.excl version runs 7% faster than the baseline prefetch version when running

with 4 threads.

Since the use of lfetch.excl could increase the number of writebacks in L2, it could

result in longer latency for the store instructions. That is why the version with

lfetch.excl is slower than the baseline prefetch version. With a 2MB working set,

data sharing effect is relatively small because aggressive prefetching would result in

a sharing of only the last 10 cache lines. In this case, due to the increased L2 cache

writebacks, using lfetch.excl causes a slowdown of the program. As shown in this

example, a correct application of lfetch.excl instructions could be very challenging to

a static compiler. This is why .excl prefetch hints are usually used only in numeric

libraries written by expert programmers. However, the dynamic optimizer has more

accurate information to guide the use of such prefetch hints.

This example clearly shows that a single binary generated by one of the most advanced

85

optimizing compilers cannot always provide good performance under different execu-

tion conditions. The performance opportunities left to be exploited could be very

significant. This is rather different from the single processor scenario where aggres-

sive data prefetching is usually considered useful without too much downside. This

is why most compilers will perform aggressive data prefetching by default. As our

example shows, unwanted prefetches could cause coherent misses, and thus substan-

tially slow down the execution. It is difficult for programmers to analyze and evaluate

the impact of performance caused by the change in data working sets and number

of threads/processors. A runtime binary optimizer such as COBRA could identify

performance bottlenecks and hot spots through continuous performance monitoring,

and effectively tune the performance via runtime code optimization.

7.2 Optimizing Coherent Misses

As shown in the previous section, it is very challenging to statically optimize scal-

able OpenMP parallel applications because changes in the data working sets and

contention for shared data could vary at runtime. Collard et al [17] proposed system-

wide monitors, called SWIFT, to find pairs of instructions involved in false sharing.

The compiler then uses SWIFT-based profiles to tune .bias hint for the identified

load instructions in order to reduce unnecessary coherent traffic on the system bus.

Some recent processors provide special instructions to optimize cache coherent mem-

ory accesses. However, due to the lack of runtime profiling support to pinpoint the

instructions that cause such unnecessary memory coherent traffic, these instructions

are rarely used in static compiler optimizations. Itanium 2 supports .bias hint for

integer load instructions. When a load operation with .bias hint misses the cache, it

requests the cache line in the exclusive state, i.e. it will invalidate all of the existing

copies of the cache line, instead of the regular shared state. If a store operation soon

86

follows the load operation, and it writes to the same cache line, it will not trigger a

coherent bus transaction to invalidate the cache lines in other processors. The .bias

hint is not supported for control- and data-speculative loads (ld.s and ld.a), the load

check (ld.c), the load with acquire semantics (ld.acq), and floating point loads. There-

fore, the use of .bias hint is very limited. Itanium 2 processor also provides .excl hint

for lfetch prefetch instruction. The lfetch.excl instruction prefetches a cache line in

the Exclusive state in stead of the Shared state. Depending on the sequence of load

and store operations, the use of .excl hint might lead to more system bus transactions

because shared cache lines are invalidated. Therefore, this type of optimizations relies

heavily on accurate run-time profile.

On Itanium 2, there are several hardware performance counters related to coherent

bus events. For example, BUS RD HIT, BUS RD HITM, and BUS RD INVAL ALL HITM,

record the snooping responses from other processors to the bus transaction initiated

by the monitoring processor [1]. The hardware performance counter corresponding to

the BUS MEMORY event, monitors the number of bus transactions. If we divide the

sum of coherent bus events by the total number of bus transactions, we could estimate

the ratio of coherent memory accesses to all bus transactions. We could use this ratio

to decide whether to perform the optimization to coherent cache misses. Other pro-

cessors, such as IBM Power3, also support the monitoring of cache coherent events.

The PM SNOOP L2 E OR S TO I and PM SNOOP M TO I events could be used

to measure the total number of L2 cache invalidations.

On Itanium 2 systems, once we detect intensive coherent misses, we could use Data

Event Address Registers (DEARs) to pinpoint which instructions caused most of

coherent cache misses. The DEAR can be used to monitor any of L1 data cache

load misses, FP load misses, L1 data TLB misses, or ALAT (Advanced Load Address

Table) misses. Each DEAR sample contains an instruction address that caused the

87

cache miss, its data address and associated latency. The DEAR can be programmed

to filter out unwanted events. For example, L3 cache hit latency on Itanium 2 is 12

cycles, we could filter out on-chip L2 cache misses that hit in L3 cache by programming

the DEAR to track events with latency greater than 12 cycles. This filtering scheme

could avoid selecting those memory loads that cause L2 cache misses but are satisfied

by L3 cache hits. Still, we need another filter to separate loads with long latency

caused by coherent memory accesses from those that are served by the memory. We

found that on the Itanium 2 server, the latency of a coherent miss is usually much

greater than the latency of a memory load, e.g. memory access latencies are usually

between 120-150 cycles, but coherent miss latencies could exceed 180-200 cycles.

We may hide the long memory latency by either inserting data prefetches, or by

scheduling the cache missing load far away from the actual use. Prefetch insertion is

easier to apply since the prefetch instruction is non-binding, so it can be scheduled

freely. Furthermore, prefetch instructions are merely hints, they do not affect the

correctness of the code. However, we need to find the prefetch instructions that are

associated with the load instructions. Our heuristic is based on the fact that prefetch

instructions are usually generated inside a loop or the entry point of a loop. Therefore,

we try to discover the loops that have the loads found through the above mentioned

two-level filtering scheme. On Itanium 2, using BTB to capture the last 4 taken

branches and their target addresses, we could easily discover the loop boundaries to

determine the PC addresses having lfetch instruction within the identified boundaries.

Finally, we can apply optimizations on the identified prefetch instructions.

7.3 Experimental Setup

Our experimental data are collected on a 4-processor Itanium 2 server and a SGI

Altix system. We used 8 processors in the SGI Altix system for our experiments. On

88

the 4-processor Itanium 2 SMP server, four processors are connected via a front-side

bus (6.4GB/sec) that supports a MESI (also called Illinois protocol) cache coherence

protocol. On the SGI Altix, two processors are connected via a front-side bus to

form a computing node. All of the 2-processor nodes are connected by a fat-tree

interconnection network. Intel icc/ifort compiler ver. 9.1 is used to compile NAS

parallel benchmark with -O3, and -openmp options.

The NAS Parallel Benchmark (NPB) [30] consists of five kernels and three simulated

CFD applications (BT, SP, LU) derived from several important aerophysics appli-

cations. The five kernels (FT, MG, CG, EP, IS) mimic the computational core of

five numeric methods used in CFD applications. The simulated CFD applications

reproduce much of data movement and computation found in full CFD codes.

The description of eight NAS parallel benchmarks is as follows:

• BT is a simulated CFD application that uses an implicit algorithm to solve

3-D compressible Navier-Stokes equations. The finite-difference solution to the

problem is based on an Alternating Direction Implicit (ADI) approximate fac-

torization that decouples the x,y and z dimensions.

• SP is a simulated CFD application that has a similar structure to BT. The

finite-difference solution is based on a Beam-Warming approximate factorization

that decouples the x,y, and z dimensions.

• LU is a simulated CFD application that uses a symmetric successive over-

relaxation (SSOR) method to solve a seven-block diagonal system resulting

from finite discretization of the Navier-Strokes equations in 3-D by splitting it

into blocked lower and upper triangular systems.

• FT contains the computational kernel of a 3-D Fast Fourier Transform (FFT)-

based spectral method. FT performs three one-dimensional (1-D) FFT’s, one

89

for each dimension.

• MG uses a V-cycle multi-grid method to compute the solution of the 3-D scalar

Poisson equation. The algorithm works continuously on a set of grids that are

made between coarse and fine grids. It tests both short and long distance data

movement.

• CG uses a Conjugate Gradient method to compute an approximation to the

smallest eigenvalue of a large, sparse, unstructured matrix.

• EP is an embarrassingly parallel benchmark. It generates pairs of Gaussian

random deviates according to a specific scheme. The goal is to establish the

reference point for the peak performance of a given platform.

• IS is the integer sort kernel.

The NPB benchmarks are implemented with High Performance Fortran (HPF), OpenMP,

and Message Passing Interface (MPI) to accommodate various parallel machines. The

OpenMP version of NPB benchmark is used in our experiments. OpenMP uses a set

of compiler directives that guide the compiler to exploit loop-level parallelism. The

cache coherent memory accesses could limit the scalability of OpenMP programs since

computations inside a loop are distributed based on the loop index range regardless

of data locations. The NPB benchmarks provides five data sets (S, W, A, B, C) from

the smallest (S) to the largest (C) data sets. Since 60-70% of memory accesses in the

smallest data set (S) are related to coherent memory accesses, we use the smallest

data set (S) in our experiments for evaluating the effectiveness of optimizations on

coherent memory accesses. As the data set size in NAS parallel benchmarks increases,

the proportion of coherence memory accesses is decreased.

Table 7.1 shows the number of loops and prefetches generated by the icc compiler

in the OpenMP NPB binaries. On Itanium, br.ctop and br.wtop are branches used

90

Table 7.1: The number of loops and prefetches in compiler generated OpenMP NPBbinaries

benchmarks lfetch br.ctop br.cloop br.wtop

BT 140 34 32 0SP 276 67 22 0LU 184 61 19 0FT 258 45 9 8MG 419 66 34 4CG 433 69 29 2EP 17 1 4 1IS 76 19 13 2

in software pipelined (SWP) loops. The br.cloop is used in the counted loops. The

compiler generates several hundreds prefetches in most of the benchmarks except EP

and IS. It is infeasible to tune every prefetch instruction manually due to the large

number of candidate prefetches.

7.4 Experimental Results

To understand the impact of two optimizations (noprefetch, prefetch.excl) on different

system architectures, we examined the execution time, L3 misses, and the number of

system memory bus transactions. The overall execution time on parallel programs is

based on wall clock time. The L3 misses and the number of memory bus transactions

are highly correlated because L3 misses need to be serviced by bus transactions.

Since IS and EP benchmarks don’t show any long latency coherent misses on both

machines, we exclude IS and EP benchmarks from our final results.

Three different prefetch strategies are studied in our experiments.

• prefetch: This is our baseline for evaluating the effect of our prefetch optimiza-

tions for coherent cache misses. The prefetch version is chosen as the baseline

because recent optimizing compilers aggressively generate prefetches even at the

commonly used -O2 optimization level. Our baseline binaries are compiled with

the highest compiler optimization level O3 in intel compiler.

91

• noprefetch: This optimization selectively reduces the aggressiveness of prefetch-

ing to remove unnecessary coherent cache misses. Our runtime profiler guides

the optimizer to select prefetches in a few loops and turn them into NOP in-

structions.

• prefetch.excl: This optimization also selectively chooses prefetch instructions

that cause long latency coherent misses and applies .excl hint on the selected

prefetches.

Noprefetch strategy is very effective when the data working set fits in the processor

caches and many coherent misses are caused by aggressive prefetching. However, it

needs precise runtime profiles to avoid removing effective prefetches that could result

in performance loss.

7.4.1 Impact on Execution Time

Figure 7.4 shows the performance improvement from two optimizations (noprefetch,

prefetch.excl) on OpenMP NPB benchmarks. The speedup achieved with noprefetch

optimization on 4-way SMP server was up to 15% with an average of 4.7%, and with

lfetch.excl optimization, it was up to 8% with an average of 2.7%, as shown in Fig-

ure 7.4(a). Since the penalty of coherent misses is much higher on cc-NUMA machines

than that on SMP machines, we obtained a higher performance improvement from

the two optimizations on SGI Altix. The speedup achieved with noprefetch optimiza-

tion on SGI Altix cc-NUMA system was up to 68% with an average of 17.5%, and

with lfetch.excl optimization, it was up to 18% with an average of 8.5%, as shown in

Figure 7.4(b).

Intuitively, replacing prefetch instructions with NOP instructions could slowdown

program execution because the load latency increases. However, it should be noted

that our noprefetch optimization does not blindly replace prefetch instructions with

92

0.900

0.950

1.000

1.050

1.100

1.150

1.200

bt.S sp.S lu.S ft.S mg.S cg.S avg

NPB OMP v3.0 benchmarks

Sp

eed

up

rel

ativ

e to

bas

elin

e (p

refe

tch

)(4, prefetch) (4, noprefetch) (4, prefetch.excl)

8%

15%

2.7%

4.7%

(a) 4 threads running on 4-way SMP node

0.000

0.200

0.400

0.600

0.800

1.000

1.200

1.400

1.600

1.800

bt.S sp.S lu.S ft.S mg.S cg.S avg

NPB OMP v3.0 benchmarks

Sp

eed

up

rel

ativ

e to

bas

elin

e (p

refe

tch

)

(8, prefetch) (8, noprefetch) (8, prefetch.excl)

17.5%

8.5%

68%

18%

(b) 8 threads running on SGI Altix cc-NUMA machine

Figure 7.4: Speedup of coherent memory access optimization on OpenMP NPB bench-marks. The performance of prefetch version (optimized by Intel compiler) is normalized to1 as the baseline.

93

NOP instructions. It uses the filtering mechanism detailed in section 7.2 to filter

out instructions that causes frequent L3 misses when L2 miss ratio is low. Thus

a large portion of memory transactions that are optimized by noprefetch are those

transactions that are related to coherent memory accesses. This filtering heuristics

allow us to minimize the negative impact of optimizations on performance.

The lfetch.excl optimization is less effective than noprefetch optimization. Even

though this optimization improves the performance of the instruction sequences that

contain load operations followed by store operations to the same cache line, it could

still fetch unnecessary cache lines from other processors.

7.4.2 Impact on L3 Cache Misses

Figure 7.5 shows the impact on L3 misses from the two optimizations (noprefetch,

prefetch.excl). On Itanium 2, coherent cache misses could lead to L3 misses. When

coherent memory accesses are a significant portion of L3 cache misses, reducing L3

misses substantially indicates that we have reduced unnecessary coherent misses.

On the SP and the CG benchmarks, L3 misses have been substantially reduced by

the noprefetch version. The reduction is as high as 29.9% for SP and 39.5% for CG,

on the 4-way SMP server, as shown in Figure 7.5(a). On the SGI Altix system, we

have also observed near 20% reduction of L3 misses from the noprefetch version for

BT, SP and CG, as shown in Figure 7.5(b).

7.4.3 Impact on Memory Bus Transactions

Figure 7.6 shows the impact on the number of memory transactions on the system bus

from the two optimizations (noprefetch, prefetch.excl). Since L3 misses are directly

translated into memory transactions on the system bus, the number of memory trans-

actions is highly correlated with L3 misses. Hence, Figure 7.6 is closely correlated to

94

0.000

0.200

0.400

0.600

0.800

1.000

1.200

bt.S sp.S lu.S ft.S mg.S cg.S avg.

OpenMP NPB Benchmark

No

rmal

ized

# o

f L

3 m

isse

s to

bas

elin

e(4, prefetch) (4, noprefetch) (4, prefetch.excl)

-16.3%

3.5%


0.000

0.200

0.400

0.600

0.800

1.000

1.200


OpenMP NPB Benchmarks

No

rmal

ized

# o

f L

3 m

isse

s to

bas

elin

e


-13%

-0.3%


Figure 7.5: Number of L3 misses on OpenMP NPB benchmarks

95

0.000

0.200

0.400

0.600

0.800

1.000

1.200


OpenMP NPB benchmarks

No

rmal

ized

# o

f sy

stem

bu

s m

emo

ry t

ran

sact

ion

s to

bas

elin

e


-15.1%

4.9%


0.000

0.200

0.400

0.600

0.800

1.000

1.200


OpenMP NPB benchmarks

No

rmal

ized

# o

f sy

stem

bu

s m

emo

ry t

ran

sact

ion

s to

bas

elin

e


-13.9%

-1.9%


Figure 7.6: Number of memory transactions on the system bus on OpenMP NPBbenchmarks

Figure 7.5.

7.5 Summary

We have shown that, with OpenMP NAS parallel benchmarks, COBRA can adap-

tively select appropriate optimization techniques based to changing runtime program

behaviors to achieve significant speedup. Coherent memory accesses caused by data

sharing often limit the scalability of multithreaded applications. Using COBRA, the

performance of some OpenMP parallel programs can be improved by dynamically

96

reducing the aggressiveness of data prefetching and using exclusive hints for prefetch

instructions.

97

Chapter 8

Conclusions and Future Works

Dynamic compilation and binary optimization become an increasingly important part

of program runtime environments as static binaries generated by traditional compil-

ers face challenges to provide good performance on various revisions of processors,

system architecture, and diverse program behaviors according to different input sets.

This thesis proposes dynamic binary optimization framework and addresses several

key problems, such as phase-aware program monitoring, hardware support for phase

detection, profile characterization and classification for continuous profiling, and dy-

namic binary re-adaptation for multithreaded programs.

8.1 Conclusions

Runtime dynamic optimizers have shown to improve performance and power efficiency

of single-threaded applications. Multithreaded applications running on SMP, CMP

and cc-NUMA systems pose new challenges and opportunities to runtime dynamic bi-

nary optimizers. This thesis introduces COBRA (Continuous Binary Re-Adaptation),

a runtime binary optimization framework. A prototype has been implemented on Ita-

nium 2 based SMP and cc-NUMA systems.

98

We investigate the use of control flow information such as loops and function calls in

order to identify repetitive program behavior as a program behavior. Along the study

on the use of control flow as phase signature, we implemented efficient phase-aware

runtime program monitoring schemes on our COBRA framework. We describe sam-

pled Basic Block Vector(BBV)-based and Hot Working Set (HWSET)-based program

phase detection schemes. Sampled HWSET-based program phase detection scheme

shows a higher phase coverage and a longer stable phase compared to sampled BBV-

based program phase detection scheme. We have evaluated the effectiveness of our

DCR-based phase tracking hardware on a set of SPEC benchmark programs with

known phase behaviors. We have shown that our hardware exhibits the desirable

characteristics of a phase detector for dynamic optimization systems. The hardware

is simple and cost effective. The phase sequence detected by our hardware could be

accurately predicted using simple prediction techniques.

We shows that highly accurate profiles can be obtained by merging a number of pro-

files collected over repeated executions with relatively low sampling frequency. It also

show that simple characterization of the profile with information entropy can effec-

tively guide the sampling rate for a profiler. Using SPECjbb2000 benchmark, our

adaptive profiler obtains very accurate profile (94.5% similar to the baseline profile)

with only 8.7% of samples in a sampling rate of one sample per 1M instructions.

Furthermore, we show that information entropy could be used to classify different

profiles obtained from different input sets. The profile entropy-based approach pro-

vides a good foundation for continuous profiling management and effective PGO in

dynamic compilation environment.

We have shown that, with OpenMP NAS parallel benchmarks, COBRA can adap-

tively select appropriate optimization techniques based to changing runtime program

behaviors to achieve significant speedup. Coherent memory accesses caused by data

99

sharing often limit the scalability of multithreaded applications. Using COBRA, the

performance of some OpenMP parallel programs can be improved by dynamically

reducing the aggressiveness of data prefetching and using exclusive hints for prefetch

instructions.

8.2 Future Works

We have several directions to extend our work in the future.

First, we can extend the implementation of our runtime binary optimization frame-

work to support multiple platforms including Intel Pentium and IBM PowerPC. Cur-

rent implementation only supports Intel Itanium processors. Itanium processors have

several generations of implementations: Itanium 1, Itanium 2 and Montecito. Each

processor has a little different implementation of hardware performance monitors that

supports different performance events, the number of performance counters and event

registers. Current implementation of COBRA has each version for different processor

revision. ADORE binary optimizer has an independent implementation for Intel Ita-

nium [37] and Sun SPARC processors [38]. Generic support for performance counters

of different processors and revisions is left as a future enhancement of the framework.

Second, we further investigate the correlation between sampled code signatures and

the performance behaviors in the large scale multithreaded programs. Understanding

the performance bottleneck in such large scale applications become increasingly diffi-

cult problems. The sampled code signature as a phase ID is used as a tag information

to classify the performance behavior. This tagged information with performance char-

acteristics opens the possibility of building performance database to further enable

the statistical analysis of the performance bottleneck over time. The sampled code

signature can be collected from kernel codes and library codes in addition to user own

codes.

100

Third, we further investigate the hardware support for phase detection by augmenting

existing performance counters. Many recent processors already provide the hardware

counters recording recent taken branches. Our proposed phase detection needs addi-

tional branch type check logics, hardware stack and phase signature table. We believe

that augmenting phase tracking hardware in current hardware performance counters

greatly reduce the overhead of software phase detection in dynamic optimizers.

Fourth, Current dynamic/static compilers and dynamic binary optimization use its

own proprietary profile formats. If we imagine that both compilation and dynamic

binary optimization coexist to maximize the benefit of continuous optimization in the

future program runtime environments, generic profile format support is very impor-

tant functionality to support.

Finally, as recent processors support the voltage/frequency scaling for better power

efficiency and thermal control, dynamic optimization need to consider not only the

performance improvement, but also the power efficiency in the optimization decision.

We further investigate to integrate the power/performance projection model into

runtime optimization.

101

Bibliography

[1] Intel Itanium Processor Reference Manual for Software Development. http:

//www.intel.com/design/itanium/manuals.htm.

[2] pfmon - HP Performance Monitoring Tool. http://www.hpl.hp.com/research/

linux/perfmon.

[3] PIN - A Dynamic Binary Instrumentation Tool. http://rogue.colorado.edu/

Pin.

[4] Ammons, G., Ball, T., and Larus, J. R. Exploiting hardware performance

counters with flow and context sensitive profiling. In Proceedings of the ACM

SIGPLAN‘97 Conference on Programming Language Design and Implementation

(PLDI) (June 1997).

[5] An, P., Jula, A., Rus, S., Saunders, S., Smith, T., Tanase, G., Amato,

N., and Rauchwerger, L. STAPL: An adaptive, generic parallel program-

ming library for C++. In Workshop on Languages and Compilers for Parallel

Computing (LCPC) (Cumberland Falls, Kentucky, August 2001).

[6] Annavaram, M., Rakvic, R., Polito, M., Bouguet, J., Hankins, R.,

and Davies, B. The Fuzzy Correlation between Code and Performance Pre-

dictability. In the 37th International Symposium on Microarchitecture (December

2004).

102

[7] Arnold, M., Fink, S., Grove, D., Hind, M., and Sweeney, P. F.

Adaptive optimization in the Jalapeno JVM. In 15th Conference on Object-

Oriented Programming, Systems, Languages, and Applications (OOPSLA’00)

(2000), pp. 47–65.

[8] Bala, V., Duesterwald, E., and Banerjia, S. Dynamo: A transparent

dynamic optimization system. In Proceedings of the ACM SIGPLAN conference

on Programming language design and implementation (PLDI’00) (June 2000).

[9] Bosworth, G. PreJIT in the CLR. In 2nd Workshop on Managed Runtime

Environments (MRE‘04) (2004).

[10] Bruening, D., and Amarasinghe, S. Maintaining consistency and bounding

capacity of software code caches. pp. 74–85.

[11] Bruening, D., Duesterwald, E., and Amarasinghe, S. Design and im-

plementation of a dynamic optimization framework for Windows. In the 4th

Workshop of Feedback-Directed and Dynamic Optimization (Austin, TX, 2001).

[12] Bruening, D., Garnett, T., and Amarasinghe, S. An Infrastructure for

Adaptive Dynamic Optimization. In Proceedings of 1st International Symposium

on Code Generation and Optimization (CGO’03) (2003), pp. 265–275.

[13] Burke, M., Choi, J.-D., Fink, S., Grove, D., Hind, M., Sarkar, V.,

Serrano, M., Sreedhar, V., Srinivasan, H., and Whaley, J. The

Jalapeno dynamic optimizing compiler for Java. In Proc. ACM 1999 Java Grande

Conference (1999), pp. 129–141.

[14] Chen, H., Lu, J., Hsu, W.-C., and Yew, P.-C. Continuous Adaptive

Object-Code Re-optimization Framework. the 9th Asia-Pacific Computer Sys-

tems Architecture Conference (2004).

103

[15] Chen, W.-K., Lerner, S., Chaiken, R., and Gillies, D. Mojo: A dynamic

optimization system. In the 3rd Workshop of Feedback-Directed and Dynamic

Optimization (2000), pp. 81–90.

[16] Choi, Y., Knies, A., Vedaraman, G., and Williamson, J. Design and

experience: Using the Intel Itanium 2 processor performance monitoring unit to

implement feedback optimization. In EPIC2 Workshop (2002).

[17] Collard, J.-F., Jouppi, N., and Yehia, S. System-Wide Performance Mon-

itors and their Application to the Optimization of Coherent Memory Accesses.

In Proc. Intl. Symp. on Prin. and Practice of Parallel Prog. (PPoPP) (Chicago,

IL, June 2005).

[18] Cover, T., and Thomas, J. Elements of Information Theory. John Wiley

and Sons, 1991.

[19] Das, A., Lu, J., Chen, H., Kim, J., Yew, P.-C., Hsu, W.-C., and Chen,

D.-Y. Performance of Runtime Optimization on BLAST. In Proceedings of

the Third Annual IEEE/ACM Internation Symposium on Code Generation and

Optimization (March 2005).

[20] Deaver, D., Gorton, R., and Rubin, N. Wiggins/redstone: An on-line

program specializer. In Hot Chips 11 Conf. (Palo Alto, CA, 1999).

[21] Demmel, J., Dongarra, J., Eijkhout, V., Fuentes, E., Petitet, A.,

Vuduc, R., Whaley, C., and Yelick, K. Self-Adapting linear algebra

algorithms and software. Proceedings of the IEEE 93, 2 (February 2005), 293–

312.

[22] Dhodapkar, A., and Smith, J. Managing multi-configuration hardware via

dynamic working set analysis. In 29th Annual International Symposium on Com-

puter Architecture (May 2002).

104

[23] Duesterwald, E., Cascaval, C., and Dwarkadas, S. Characterizing and

predicting program behavior and its variability. In International Conference on

Parallel Architectures and Compilation Techniques (October 2003).

[24] Ebcioglu, K., and Altman, E. DAISY: Dynamic compilation for 100%

architectural compatibility. In Proc. 24th Annual International Symposium on

Computer Architecture (1997), pp. 26–37.

[25] Hazelwood, K., and Smith, J. E. Exploring Code Cache Eviction Granu-

larities in Dynamic Optimization Systems. In second Annual IEEE/ACM Inter-

national Symposium on Code Generation and Optimization (March 2004).

[26] Hazelwood, K., and Smith, M. Generational cache management of code

traces in dynamic optimization systems. In 36th International Symposium on

Microarchitecture (December 2003).

[27] Hind, M., Rajan, V., and Sweeney, P. Phase shift detection: a problem

classification. IBM Research Report RC-22887 , 45–57.

[28] Hsu, C.-H., and Kermer, U. The design, implementation and evaluation of a

compiler algorithm for CPU energy reduction. In Proceedings of ACM SIGPLAN

Conference on Programming Language Design and Implementation (June 2003).

[29] Huang, M., Renau, J., and Torrellas, J. Positional adaptation of pro-

cessors: Application to energy reduction, June 2003.

[30] Jin, H., Frumkin, M., and Yan, J. The OpenMP Implementation of NAS

Parallel Benchmarks and Its Performance. NAS Technical Report NAS-99-011

(October 1999).

[31] Kim, H., and Smith, J. Dynamic binary translation for accumulator-oriented

architectures. pp. 25–35.

105

[32] Kistler, T., and Franz, M. Computing the Similarity of Profiling Data. In

Proc. Workshop on Feedback-Directed Optimization (1998).

[33] Lau, J., Perelman, E., and Calder, B. Selecting Software Phase Markers

with Code Structure Analysis. In Proceedings of the International Symposium

on Code Generation and Optimization (CGO2006) (March 2006).

[34] Lau, J., Schoenmackers, S., and Calder, B. Structures for Phase Classi-

fication. In IEEE International Symposium on Performance Analysis of Systems

and Software (March 2004).

[35] Lau, J., Schoenmackers, S., and Calder, B. Transition Phase Classifica-

tion and Prediction. In the 11th International Symposium on High Performance

Computer Architecture (February 2005).

[36] Liu, W., and Huang, M. EXPERT: expedited simulation exploiting program

behavior repetition. In Proceedings of the 18th annual International Conference

on Supercomputing (June 2004).

[37] Lu, J., Chen, H., Fu, R., Hsu, W.-C., Othmer, B., and Yew, P.-C.

The Performance of Runtime Data Cache Prefetching in a Dynamic Optimiza-

tion System. In Proceedings of the 36th Annual International Symposium on

Microarchitecture (December 2003).

[38] Lu, J., Das, A., Hsu, W.-C., Nguyen, K., and AbrahamDas, S. G.

Dynamic helper threaded prefetching on the sun ultrasparc cmp processor. In

Proceedings of the 38th Annual IEEE/ACM international Symposium on Mi-

croarchitecture (2005).

[39] Luk, C., and et al. Ispike: A post-link optimizer for Intel Itanium 2 architec-

ture. In Proceedings of 2nd International Symposium on Code Generation and

Optimization (CGO).

106

[40] Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G.,

Wallace, S., Reddi, V. J., and Hazelwood, K. Pin: Building customized

program analysis tools with dynamic instrumentation. pp. 190–200.

[41] Magklis, G., Scott, M. L., Semeraro, G., Albonesi, D. A., and Drop-

sho, S. Profile-based Dynamic Voltage and Frequency Scaling for a Multiple

Clock Domain Microprocessor. In Proceedings of the International Symposium

on Computer Architecture (June 2003).

[42] Muchnick, S. Advanced Compiler Design and Implementation. Morgan Kauf-

man, 1997.

[43] Nagpurkar, P., Krintz, C., and Sherwood, T. Phase-Aware Remote

Profiling. In the third Annual IEEE/ACM International Symposium on Code

Generation and Optimization (March 2005).

[44] Patil, H., Cohn, R., Charney, M., Kapoor, R., Sun, A., and

Karunanidhi, A. Pinpointing representative portions of large Intel Itanium

programs with dynamic instrumentation. In MICRO-37 (December 2004).

[45] Puschel, M., Moura, J. M. F., Johnson, J. R., Padua, D., Veloso,

M. M., Singer, B. W., Xiong, J., Franchetti, F., Gacic, A., Voro-

nenko, Y., Chen, K., Johnson, R. W., and Rizzolo, N. SPIRAL: Code

Generation for DSP Transforms. Proceedings of the IEEE 93, 2 (February 2005).

[46] Savari, S., and Young, C. Comparing and Combining Profiles. In Journal

of Instruction Level Parallelism (2004).

[47] Scott, K., Kumar, N., Velusamy, S., Childers, B., and Soffa, M.

Retargetable and reconfigurable software dynamic translation. In International

Symposium on Code Generation and Optimization (CGO ’03) (2003), pp. 36–47.

107

[48] Shen, X., Zhong, Y., and Ding, C. Locality phase prediction. In Inter-

national Conference on Architectural Support for Programming Languages and

Operating Systems (2004).

[49] Sherwood, T., Perelman, E., and Calder, B. Basic block distribution

analysis to find periodic behavior and simulation points in applications. In In-

ternational Conference on Parallel Architectures and Compilation Techniques

(September 2001).

[50] Sherwood, T., Perelman, E., Hamerly, G., and Calder, B. Automat-

ically characterizing large scale program behavior. In Architectural Support for

Programming Language and Operating Systems - X (October 2002).

[51] Sherwood, T., Sair, S., and Calder, B. Phase tracking and prediction. In

30th Annual International Symposium on Computer Architecture (June 2003).

[52] Srivastava, A., Edwards, A., and Vo, H. Vulcan: Binary translation in a

distributed environment. Microsoft Research Technical Report, MSR-TR-2001-

50 (2001).

[53] Sun, M., Daly, J., Wang, H., and Shen, J. Entropy-based Characterization

of Program Phase Behaviors. In the 7th Workshop on Computer Architecture

Evaluation using Commercial Workloads (February 2004).

[54] Thomas, N., Tanase, G., Tkachyshyn, O., Perdue, J., Amato, N. M.,

and Rauchwerger, L. A Framework for Adaptive Algorithm Selection in

STAPL. In PPoPP’05 (Chicago,Illinois, June 2005).

[55] Tullsen, D. M., and Eggers, S. J. Limitations of Cache Prefetching on a

Bus-Based Multiprocessor. In Proc. of 20th International Symposium on Com-

puter Architecture(ISCA) (May 1993), pp. 278–288.

108

[56] Vaswani, K., and Srikant, Y. Dynamic recompilation and profile-guided

optimisations for a .NET JIT compiler. IEE Proc.-Softw. 150, 5 (2003), 296–

302.

[57] Voss, M. J., and Eigenmann, R. High-Level Adaptive Program Optimization

with ADAPT. ACM SIGPLAN Notices 32, 7 (July 2001), 93–102.

[58] Whaley, R., Petitet, A., and Dongarra, J. Automated empirical opti-

mization of software and the ATLAS project. Parallel Comput. 27, 1-2 (2001),

3–35.

[59] Zhang, W., Calder, B., and Tullsen, D. M. An event-driven multi-

threaded dynamic optimization framework. pp. 87–98.

[60] Zhang, X., Wang, Z., Gloy, N., Chen, J., and Smith, M. System support

for automatic profiling and optimization. In Proc. 16th ACM Symp. Operating

Systems Principles (1997), pp. 15–26.

109

technical reportthis thesis presents cobra (continuous binary re-adaptation), a dynamic binary...

Documents