arallel pr ogramming methodology and envir · 2002-08-06 · ypr ogramming model a thesis submitted...

PARALLEL PROGRAMMING METHODOLOGY AND ENVIRONMENT

FOR THE SHARED MEMORY PROGRAMMING MODEL

A Thesis

Submitted to the Faculty

of

Purdue University

by

Insung Park

In Partial Ful�llment of the

Requirements for the Degree

of

Doctor of Philosophy

December ��

� ii �

To my beloved grandmother

� iii �

ACKNOWLEDGMENTS

First� I�d like to thank my grandmother� whom I have not seen for more than

two years� and whom I will never see again� She �ed to South Korea with four

little daughters during the Korean War and started a new life in an unfamiliar place

with her bare hands� Her courage� perseverance� and endurance have led to my

existence� Over the years in graduate school� she has always been on my side� lending

a sympathetic ear and doing her best to keep me sane� I wish I could see her just one

more time�

I�d like to thank my advisor� Dr� Rudolf Eigenmann� for his encouragement and

advice during my research� His insightful comments and constructive suggestions are

greatly appreciated� I also express my gratitude to my graduate committee members�

Dr� Jos�e A� B� Fortes� Dr� Howard J� Siegel� and Dr� Elias Houstis� for their time

and advice�

My deepest love goes to my parents and two brothers� In Jun and In Kwon�

I can never thank them enough for their never�ending support that has made me

come through with my research� Through ups and downs in life� their love and

encouragement has given me the strength to go on with my life� I am also grateful to

my aunts� uncles� and cousins� who have never hidden their pride in me and concern

for my well�being�

Fresh and valuable perspectives that the members of our research group have

provided are greatly appreciated� Among them� Mike� Seon� Brian� and Vishal have

made extra eorts to help me with my research� which I deeply acknowledge�

Mike� Natalie� and Nicholas deserve special mention for always being there for me�

I cherish them as my brother� sister� and nephew� Without them� I would not have

made it this far� I believe one of the reasons God led me here is to meet them� I also

� iv �

value my to�be�life�long friendship with Seon� Young� and their precious daughter

Arden� Numerous evenings I have spent with all these friends are precious to me�

I appreciate many of my Korean friends here at Purdue� Especially� I extend my

thanks to Jong�hyeok and Je�Ho� The life here has been joyous and fun because of

them� Thanks are also due to their wives� who have fed this single� hungry graduate

student countless times� I�d also like to mention In Sung� Jae Hyung� Yonghee� Soon

Keon� Heon� Seungmoon� Soohong� Jang Won� Il� Jung Min� Hun Soo� Woon Young�

Jong Sun� Se Hyun� and their families�

Lastly� I send my best regard to Joon Sook and her family� I wish them happiness�

� v �

TABLE OF CONTENTS

Page

LIST OF TABLES � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ix

LIST OF FIGURES � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � xi

ABSTRACT � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � xv

INTRODUCTION � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

� Motivation � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� State of parallel computing � � � � � � � � � � � � � � � � � � � �

�� Open issues in the shared memory programming model � � � � �

�� Need for parallel programming environment � � � � � � � � � � �

�� Thesis Organization � � � � � � � � � � � � � � � � � � � � � � � � � � � �

� BACKGROUND � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Parallel Programming Concepts� Terminology� and Notations � � � � � �

�� Parallelization in the Shared Memory Programming Model � � � � � � �

�� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� History of parallel shared memory directives � � � � � � � � � �

�� Shared memory program execution � � � � � � � � � � � � � � � �

�� Automatic parallelization � � � � � � � � � � � � � � � � � � � � �

�� Parallelization in the Message Passing Programming Model � � � � � � �

�� MPI and PVM � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� HPF � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Visual parallel programming systems � � � � � � � � � � � � � � �

�� Parallel Programming and Optimization Methodology � � � � � � � � � �

�� Shared memory programming methodology � � � � � � � � � � � �

�� Message Passing programming methodology � � � � � � � � � � �

�� Tools � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� vi �

�� Program development and optimization � � � � � � � � � � � � � �

�� Instrumentation � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Performance visualization and evaluation � � � � � � � � � � � � �

�� Guidance � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Utilizing Web Resources for Parallel Programming � � � � � � � � � � � ��

�� Conclusions � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� SHARED MEMORY PROGRAM OPTIMIZATION METHODOLOGY � � ��

�� Introduction� Scope� Audience� and Metrics � � � � � � � � � � � � � � ��

�� Scope of the proposed methodology � � � � � � � � � � � � � � � ��

�� Target audience � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Metrics� understanding overheads � � � � � � � � � � � � � � � � ��

�� Parallel Program Optimization Methodology � � � � � � � � � � � � � � ��

�� Instrumenting program � � � � � � � � � � � � � � � � � � � � � � ��

�� Getting serial execution time � � � � � � � � � � � � � � � � � � � �

�� Running parallelizing compiler � � � � � � � � � � � � � � � � � � �

�� Manually optimizing programs � � � � � � � � � � � � � � � � � � ��

�� Getting optimized execution time � � � � � � � � � � � � � � � � �

�� Finding and resolving performance problems � � � � � � � � � � �

�� Conclusions � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

� TOOL SUPPORT FOR PROGRAM OPTIMIZATION METHODOLOGY �

�� Design Objectives � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Ursa Minor� Performance Evaluation Tool � � � � � � � � � � � � � � ��

�� Functionality � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Internal Organization of the Ursa Minor tool � � � � � � � � ��

�� Database structure and data format � � � � � � � � � � � � � � � ��

�� Summary � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� InterPol� Interactive Tuning Tool � � � � � � � � � � � � � � � � � � � ��

�� Overview � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Functionality � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� vii �

�� Summary � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Other Tools in Our Toolset � � � � � � � � � � � � � � � � � � � � � � � � �

�� Polaris� parallelizing compiler � � � � � � � � � � � � � � � � � � �

�� InterAct� performance monitoring and steering tool � � � � ��

�� Max�P� parallelism analysis tool � � � � � � � � � � � � � � � � ��

�� Integration with Methodology � � � � � � � � � � � � � � � � � � � � � � ��

�� Tool support in each step � � � � � � � � � � � � � � � � � � � � ��

�� Other useful utilities � � � � � � � � � � � � � � � � � � � � � � � ��

�� The Parallel Programming Hub and Ursa Major � � � � � � � � � � ��

�� Parallel Programming Hub� globally accessible integrated toolenvironment � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Ursa Major� making a repository of knowledge available tothe world wide audience � � � � � � � � � � � � � � � � � � � � � ��

�� Conclusions � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� EVALUATION � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Methodology Evaluation� Case Studies � � � � � � � � � � � � � � � � � �

�� Manual tuning of ARC�D � � � � � � � � � � � � � � � � � � � � �

�� Evaluating a parallelizing compiler on a large application � � � �

�� Interactive compilation � � � � � � � � � � � � � � � � � � � � � � �

�� Performance advisor� hardware counter data analysis � � � � �

�� Performance advisor� simple techniques to improve performance ��

�� E�ciency of the Tool Support � � � � � � � � � � � � � � � � � � � � � � ��

�� Facilitating the tasks in parallel programming � � � � � � � � � ��

�� General comments from users � � � � � � � � � � � � � � � � � � �

�� Comparison with Other Parallel Programming Environments � � � � � ��

�� Comparison of Ursa Major and the Parallel Programming Hub � � ��

�� Conclusions � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

CONCLUSIONS � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

� Summary � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Directions for Future Work � � � � � � � � � � � � � � � � � � � � � � � � ��

� viii �

LIST OF REFERENCES � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

VITA � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� ix �

LIST OF TABLES

Table Page

�� Overhead categories of the speedup component model� � � � � � � � � �

�� Optimization technique application criteria� � � � � � � � � � � � � � � �

�� A detailed breakdown of the performance improvement due to eachtechnique� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Common tasks in parallel programming � � � � � � � � � � � � � � � � � ��

�� Time �in seconds� taken to perform the tasks without our tools� � � � ��

�� Time �in seconds� taken to perform the tasks with our tools� � � � � � ��

�� Feature comparison of parallel programming environments � � � � � � ��

�� Workload distribution on resources with our network�based tools � � � ��

� x �

� xi �

LIST OF FIGURES

Figure Page

�� The structure of an SMP� � � � � � � � � � � � � � � � � � � � � � � � � �

�� A �� processor Origin �� system� �a� topology and �b� structure ofa single node board� � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Simple parallelization with OpenMP� � � � � � � � � � � � � � � � � � � �

�� Screenshot of the CODE visual programming system� � � � � � � � � � �

�� The timeline graph from NTV� � � � � � � � � � � � � � � � � � � � � � ��

�� The graphs generated by AIMS� � � � � � � � � � � � � � � � � � � � � � ��

�� The graphs generated by Pablo� � � � � � � � � � � � � � � � � � � � � � ��

�� Typical parallel program development cycle� � � � � � � � � � � � � � � ��

�� Overview of the proposed methodology� � � � � � � � � � � � � � � � � � ��

�� Scalar privatization� �a� the original loop and �b� the same loop afterprivatizing variable X� � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Array privatization� �a� the original loop and �b� the same loop afterprivatizing variable array A� � � � � � � � � � � � � � � � � � � � � � � � ��

�� Scalar reduction� �a� the original loop and �b� the same loop afterrecognizing reduction variable SUM� � � � � � � � � � � � � � � � � � � � ��

�� Array reduction� �a� the original loop and �b� the same loop afterrecognizing reduction array A� � � � � � � � � � � � � � � � � � � � � � � �

�� Induction variable recognition� �a� the original loop and �b� the sameloop after replacing induction variable X� � � � � � � � � � � � � � � � � ��

�� Scheduling modi�cation� �a� the original loop and �b� the same loopafter modifying scheduling by pushing parallel constructs inside theloop nest� In �b�� the inner loop is executed in parallel� thus processoraccess array elements that at least � stride apart� � � � � � � � � � � � ��

�� Padding� �a� the original loop and �b� the same loop after paddingextra space into the arrays� � � � � � � � � � � � � � � � � � � � � � � � � ��

� xii �

�� Load balancing� �a� the original loop and �b� the same loop afterchanging to interleaved scheduling scheme� By changing the schedulingfrom static to dynamic� unbalanced load can be distributed more evenly� ��

�� Blocking�tiling� �a� the original loop and �b� the same loop after ap�plying tiling to split the matrices into smaller tiles� In �b�� anotherloop has been added to assign smaller blocks to each processor� Thedata are likely to remain in the cache when they are needed again� � � ��

�� Loop interchange� �a� a loop with poor locality and �b� the same loopwith better locality after interchanging loop nest� � � � � � � � � � � � ��

�� Software pipeline and loop unrolling� �a� the original loop� �b� thesame loop with software pipeline �Instructions are interleaved acrossiterations� and preamble and postamble have been added�� and �c� thesame loop unrolled by ��

�� Original loop SHALOW do�� in program SWIM� � � � � � � � � � � � � �

�� Parallel version of SHALOW do�� in program SWIM� � � � � � � � � � �

�� Optimized version of SHALOW do�� in program SWIM� � � � � � � � �

�� Main view of the Ursa Minor tool� The user has gathered infor�mation on program BDNA� After sorting the loops based on the ex�ecution time� the user inspects the percentage of three major loops�ACTFOR do�� ACTFOR do�� RESTAR do�� using a pie chart gen�erator �bottom left�� Computing the speedup �column � � with theExpression Evaluator reveals that the speedup for RESTAR do�� ispoor� so the user is examining more detailed information on the loop� �

�� Structure view of the Ursa Minor tool� The user is looking at theStructure View generated for program BDNA� Using �Find� utility� theuser sets the view to subroutine ACTFOR� and opened up the sourceview for the parallelized loop ACTFOR do��

�� The user interface ofMerlin in use� Merlin provides the solutions tothe detected problems� This example shows the problems addressed inloop ACTFOR DO�� of program BDNA� The button labeled Ask Merlin

activates the analysis� The View Source button opens the sourceviewer for the selected code section� The ReadMe for Map button pullsup the ReadMe text provided by the performance map writer� � � � � ��

�� The internal structure of a Merlin �map�� The Problem Domaincorresponds to general performance problems� The Diagnostics Do�main depicts possible causes of the problems� and the Solution Do�main contains suggested remedies� Conditions are logical expressionsrepresenting an analysis of the data� � � � � � � � � � � � � � � � � � � � ��

� xiii �

�� Building blocks of the Ursa Minor tool and their interactions� � � � ��

�� The database structure of Ursa Minor� � � � � � � � � � � � � � � � � �

�� An overview of InterPol� Three main modules interact with usersthrough a Graphical User Interface� The Program Builder handles�le IO and keeps track of the current program variant� The compilerBuilder allows users to arrange optimization modules in Polaris� TheCompilation Engine combines the user selections from the other twomodules and calls Polaris modules� � � � � � � � � � � � � � � � � � � � ��

�� User Interface of InterPol� �a� the main window and �b� the Com�piler Builder� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Monitoring the example application through InterAct interface� Themain window shows the characterization data of the major loops in theSPEC�� SWIM Benchmark� � � � � � � � � � � � � � � � � � � � � � ��

�� Tool support for the parallel programming methodology� � � � � � � � ��

�� Ursa Minor Usage on the Parallel Programming Hub� � � � � � � � ��

�� Interaction provided by the Ursa Major tool� � � � � � � � � � � � � �

�� The �a� execution time and �b� speedup of the various version ofARC�D �Mod� loop interchange� Mod�� STEPFY do�� modi�ca�tion� Mod�� STEPFX do�� modi�cation� Mod�� FILERX do�� mod�i�cation� Mod�� YPENTA do� modi�cation� Mod � modi�cation onXPENTA� YPENT�� and XPENT��

�� Contents of the Program Builder during an example usage of the In�terPol tool� �a� the input program and �b� the output from thedefault Polaris compiler con�guration� � � � � � � � � � � � � � � � � � �

�� Contents of the Program Builder during an example usage of the In�terPol tool� �c� the output after placing an additional deadcodeelimination pass prior to inlining and �d� the program after manuallyparallelizing subroutine two� � � � � � � � � � � � � � � � � � � � � � � � �

�� Performance analysis of the loop STEPFX DO�� in program ARC�D� Thegraph on the left shows the overhead components in the original� serialcode� The graphs on the right show the speedup component modelfor the parallel code variants on � processors before and after loopinterchanging is applied� Each component of this model representsthe change in the respective overhead category relative to the serialprogram� Merlin is able to generate the information shown in thesegraphs� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

� xiv �

�� Speedup achieved by applying the performance map� The speedup iswith respect to one�processor run with serial code on a Sun Enterprise�� system� Each graph shows the cumulative speedup when applyingeach technique� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Overall times to �nish all � tasks� � � � � � � � � � � � � � � � � � � � �

�� The response time of UM�Applet and UM�ParHub� on �a� a networkedPC� �b� a networked workstation� and �c� a dialup PC� � � � � � � � � ��

�� The response time of the three operations on RETRAN database� �a�loading� �b� spreadsheet command evaluation� and �c� source searching� ��

� xv �

ABSTRACT

Park� Insung� Ph�D�� Purdue University� December� �� Parallel ProgrammingMethodology and Environment for the Shared Memory Programming Model� MajorProfessor� Rudolf Eigenmann�

The easy programming model of the shared memory paradigm possesses many

attributes desirable to novice programmers� However� there has not been a good

methodology with which programmers navigate through the di�cult task of program

parallelization and optimization� It is becoming increasingly di�cult to achieve good

performance without experience and intuition� Guiding methodologies must de�ne

easy�to�follow steps for programming and tuning multiprocessor applications� In ad�

dition� a parallel programming environment must acknowledge time�consuming steps

in the parallelization and tuning process and support users in their eorts�

We propose a parallel programming methodology for the shared memory model

and a set of tools designed to assist users in accordance with the methodology� Our

research addresses the questions �what� to do in parallel program development and

tuning� �how� to do it� and �where� to do it� Our main contribution is to provide a

comprehensive programming environment such that both novice and advanced users

can perform performance tuning in an e�cient and straightforward manner� Our ef�

fort diers from the other parallel programming environments in that �� it integrates

the most stages of parallel programming tasks based on a common methodology and

�� it addresses issues that have not been attempted in previous eorts� We have used

network computing technology so that world wide programmers can bene�t from our

work� Through a series of evaluation processes� we found that our programming envi�

ronment provides a methodology that works well with parallel applications and that

our tools provide an e�cient support to both novice and advanced programmers�

� xvi �

� �

�� INTRODUCTION

�� Motivation

�� State of parallel computing

Multiprocessor machines have existed in many dierent architectures� Among

them� shared memory machines are getting much attention recently� This is mainly

due to the fact that the shared�memory architecture oers an easy programming

model and that the techniques for parallelization of programs for this class of machines

are well�established and can be automated�

Today� new aordable multiprocessor workstations and PCs are attracting an in�

creasing number of users� and consequently� these new programmers are inexperienced

and desire an easier programming model to harness the power of parallel computing�

These aspects draw more attention to shared memory machines in two ways� First�

most newly developed parallel computers are shared memory machines or compati�

ble with the shared memory programming model� Second� the aforementioned easy

programming model with the help of parallelizing compilers requires relatively little

experience to develop parallel programs�

The eort in the industry toward the standardization of a programming model

makes shared memory machines more appealing� The lack of standardized parallel

language had been a problem with the shared memory model� This often requires

programmers to learn a new set of language constructs whenever there is a need to

port programs across platforms� To make matters worse� the dierence among these

native dialects in the abilities to express parallelism was signi�cant enough that in

many cases� a considerable change has to be made in the program code itself� going

beyond direct directive translation� There have been several attempts to provide

standard parallel languages� which will be discussed in Chapter �� but they failed to

get the attention from the parallel computing community in general�

� � �

The recent parallel language standard for shared memory multiprocessor ma�

chines� OpenMP �� promises an attractive interface for those programmers who

wish to exploit parallelism explicitly� The OpenMP standard resolves the portability

problem and is expected to attract more programmers and computer vendors in the

high performance computing area�

�� Open issues in the shared memory programming model

There are� however� open issues to be addressed� Perhaps the most serious of all is

the lack of a good programming methodology for these types of machines� In contrast

to several eorts to establish a methodology for other programming models ��

no known literature is found that speaks of a programming and tuning methodology

for the shared memory model� A programmer who is to develop a parallel program

has to face a number of challenging questions� What are the known techniques for

parallelizing this program� What information is available for the program at hand�

How much speedup can be expected from this program� What are the limitations

for the parallelization of this program� It usually takes substantial experience to �nd

the answers to such questions� Most general programmers do not have the time and

resources to acquire this experience�

We believe that the absence of a programming methodology is attributed to three

reasons� First� many advanced parallel programmers are used to programming in

terms of �application level� parallelism� By this we mean the study of the underlying

physics and algorithms to �nd parallelism residing in that level� It is indeed an eec�

tive method if it succeeds because in some cases� the scope of the resulting parallelism

goes wider than the �ner grain parallelism of the directive�based programming model�

resulting in less synchronization overhead� However� this approach requires signi�cant

eort to understand the underlying physics� and it is prone to a human error� It is

not a rare case in which a programmer realizes� in a later stage of development� that

the algorithm that he or she thought to be parallel is actually sequential� If a person

parallelizing a program is not the programmer who wrote the program� the required

eort doubles� as the understanding of the program has to precede parallelization�

� � �

Furthermore� depending on the problem that programmers wish to solve� the underly�

ing algorithms and physical models vary signi�cantly� making a systematic approach

to parallel application design di�cult� A programmer who is used to this approach

has to tackle each problem case�by�case relying on intuition and experience�

In contrast to the �application level� approach� there is a �program level� paral�

lelism approach� This means an eort to �nd parallelism based on the source code

and how it is written� Focusing only on repetitive computing constructs �loops�� this

approach allows automatic recognition of parallelism and possible transformations�

Numerous research projects have addressed the issues of identifying parallelism and

applying the corresponding transformations that can be incorporated into compil�

ers � � �� Nevertheless� these are not parallel programming method�

ologies by themselves� Theses researchers address only one part of parallel program

development� parallelization� A complete parallel programming methodology has to

encompass the entire development stages including parallelization� evaluation� tuning�

and so on� The second reason for the lack of a methodology for the shared memory

architecture stems from the signi�cant aid provided by these parallelizing compilers�

Many inexperienced programmers expect a signi�cant speedup after running a paral�

lelizing compiler� Indeed� they simplify the process considerably� However� running a

parallelizing compiler does not necessarily achieve high performance� To achieve an

optimal performance from a program� often many factors have to be considered� in�

cluding both machine dependent and independent parameters� underlying algorithms�

and so on� As shown in �� without proper consideration for these eects� the result�

ing performance may even degrade� We believe that there is room for a systematic

way to provide users with guidelines and remedies that can be incorporated into a

structured methodology�

Finally� there are some aspects of the shared memory model that make it hard

to develop a general methodology� As mentioned above� the shared memory model

oers an easy programming interface� This does not mean that obtaining a good per�

formance is easy as well� Unlike some other programming models such as a message

� � �

passing scheme where a programmer explicitly dictates synchronization and sending

and receiving messages� important events such as multiple processors writing to a

shared variable or false sharing are not readily visible to users in the shared memory

model� Furthermore� these eects are hard to measure� if not impossible� without

introducing signi�cant overhead� Therefore� if the performance is not satisfactory�

inexperienced programmers have di�culties �nding what caused it� An increasing

number of Non�Uniform Memory Access �NUMA� machines add more complexity

because they introduce another variable to consider� namely memory latency� The

shared memory programming model provides an easy� transparent means of express�

ing parallelism� but the price is that the parallel performance optimization requires

signi�cant time and resources� A good methodology should be general enough to cover

a variety of architectures and applications� but �exible enough to help programmers

pinpoint the bottlenecks and resolve the problems in a speci�c situation�

�� Need for parallel programming environment

With the gaining momentum of the shared memory architecture� a methodol�

ogy for the shared memory model is needed� The shared memory model provides a

simple user interface� What we do not have now is an easier way to produce good

performance� It has to be structured guidelines that encompass the whole process

of program development while providing useful tips with which users can navigate

through di�cult steps� As there are a variety of issues to deal with� it has to be

general without losing its utility when applied to real environments�

A good methodology do not su�ce without proper support from tools� Listing

the tasks that need to be completed cannot be of much help to programmers if all

those tasks are to be accomplished manually with only basic utilities available from

the target machine� During an optimization process� programmers face challenges in

analysis and performance data management� incremental application of parallelization

and optimization techniques� performance measurement and monitoring� and problem

identi�cation and devising remedies� Each of these tasks poses a signi�cant burden

onto programmers� and without any help� it can be a time�consuming task�

� � �

This leads to the need of supporting facilities for the underlying methodology�

These facilities need to address di�cult and time�consuming steps speci�ed by the

methodology and provide functionality that accelerates these steps� Together� the

methodology and the tools should be able to make up for the lack of experience

among novice programmers wherever it is required most� such as analysis� diagnosis�

and formation of solutions� We acknowledge many tools designed for the purpose of

helping programmers� but the majority of them focus on speci�c aspects or environ�

ments in the program development process� not based on a methodology� We believe

that providing a more comprehensive and actively guiding toolset is possible with the

current technology�

Another problem with the current tools is their accessibility� If useful tools cannot

be easily found and used by users� the eort to develop such tools would be wasted�

Furthermore� as more diverse multiprocessors �nd their users� the compatibility issue

has become an important factor in the tool�s applicability� As the existing program�

ming models converge to the standard OpenMP� tool developers should consider this

problem� With the emerging network technology and new portable languages such

as Java� we already have the basic framework enabling more accessible parallel pro�

gramming tools�

We present here our result in the subject of a parallel programming methodology

and supporting tools� We have developed a methodology that has worked well un�

der various environments and a set of tools that address di�cult tasks in the shared

memory model� Combining the methodology and the supporting tools we developed�

programmers can now follow a structured approach toward the optimal performance

with the support from e�cient tools� This optimization paradigm is available to gen�

eral audience through the Purdue University Network Computing Hub �PUNCH� ��

and a Java Applet application� allowing our methodology and tool support to reach

many users throughout the globe�

� �

�� Thesis Organization

Chapter � will give a brief overview of the history and background on parallel

programming� focusing on methodologies and programming tools� Chapter � will

present our proposed methodology toward these issues� and the supporting tools

developed for the methodology are summarized in Chapter �� Chapter � discusses

the evaluation process and the result� Chapter concludes the thesis�

� � �

�� BACKGROUND

In this chapter� we examine previous eorts in developing programming method�

ologies and tools for parallel programming targeted towards the two well�known pro�

gramming models� the shared memory and the distributed memory models� Our re�

search can be summarized as building a comprehensive programming environment by

�� designing a good programming methodology� �� providing a toolset that supports

it� and �� making our results available to a wide audience� From this perspective� we

discuss general concepts in parallel programming� methodologies and tools proposed

by other researchers� and previous eorts towards better accessible data repositories

and parallel programming tools�

�� Parallel Programming Concepts� Terminology� and Notations

Parallelism exists in many forms� In this paper� we consider �parallel processing��

in which multiple processors take part in executing a single program� Other parallel

schemes such as instruction�level parallelism or vector architectures are not the target

of our research� There are two major multiprocessor architecture categories� SIMD

�Single Instruction Multiple Data� and MIMD �Multiple Instruction Multiple Data��

Among these� we focus on MIMD architecture� which is the most commonly used

architecture these days�

MIMD architecture consists of two types of machines� shared memory machines

and distributed memory machines� The physical memory for the shared memory

architecture may be shared or distributed� further dividing the architecture into Uni�

formMemory Access �UMA� architecture and Non�UniformMemory Access �NUMA�

architecture� Some distinguish them by using the terms Symmetric MultiProces�

sor �SMP� architecture and Distributed Shared Memory �DSM� architecture� respec�

tively� DSM machines seek to resolve the limited capacity of shared memory buses�

� � �

which prevents scaling to a large number of processors on a conventional SMP archi�

tecture� Figure �� shows a typical ��at� SMP architecture with four processors� By

contrast� the architecture shown in Figure �� is that of a �� processor Cray Origin

�� system� which is a DSM machine�

Shared Main Memory

CPU 1

External Cache External Cache External Cache

CPU 2 CPU P...

Fig� �� The structure of an SMP�

From the programmers� point of view� there are two main models for programming

on parallel machines� the shared memory programming model and the message pass�

ing programming model� There are other programming models that target a cluster

of SMP machines �� or parallel logic environments �� but they are not widely

used� and will not be discussed in detail�

The shared memory model and the message passing programming model share

the same basic concept� threads� A single process forks multiple threads that inde�

pendently execute portions of a program� The dierence between these two is how

threads access memory� In the shared memory model� multiple processors share a

single memory space� so processors can read or write to the shared space� regardless

of where they actually reside� The notion of �shared� and �private� data becomes

important� Shared data are visible to all processors participating in the parallel ex�

ecution� Communication between processors takes place in the form of reading and

writing to shared data� Private data� on the other hand� are local to each processor

� � �

R R

R R

R R

R R

Node BoardRouter

CPU 1

External Cache External Cache

CPU 2

Hub ASICXIO

Memory & Directory

Node Board

Router

�a� �b�

Fig� �� A �� processor Origin �� system� �a� topology and �b� structure of asingle node board�

and cannot be accessed by other processors�

By contrast� in the message passing scheme� processors do not share memory�

All data are private to the processor that owns them� The message passing scheme

requires each processor be aware of which processor owns what data� thus if there is

a need to read or write to a data item that belongs to another processor� the item

has to be explicitly sent and received�

These two models provide high level constructs for easier programming� The

shared memory model oers directive languages� with which a user speci�es whether

certain loops can be executed in parallel� Also� users can program directly with

threads with the help of thread libraries� In the message passing model� parallel

constructs typically come in the form of library of functions� The library includes

functions for sending and receiving messages� synchronization� initialization� and

grouping� The Message Passing Interface �MPI� � � and Parallel Virtual Machine

�PVM� �� are important standards implemented in such libraries� The parallel pro�

grammer�s task in the message passing model is to incorporate these functions into

parallel algorithms� Programmers need to devise ways to split data� communicate�

� � �

and synchronize� and write or modify the program based on the design�

Although the shared memory programming model is basically for programming

on shared memory machines� and the message passing model for programming on

distributed memory machines� this mapping between programming models and ar�

chitectures is not binding� Many modern parallel computers are compatible with

both programming models although their hardware design takes speci�cally one form

or the other� There is still no general agreement as to which architecture and which

programming model are more eective� and it is not likely that any one of them will

prevail over the other in the near future�

Here� we focus on the parallelization in the shared memory model� Although we

view parallel program development in terms of programming models� we will keep in

mind the eects of speci�c hardware implementation on program performance as var�

ious machine�dependent parameters play signi�cant roles in program execution� We

would like our approach to parallel programming to address some of these hardware�

related issues�

�� Parallelization in the Shared Memory Programming Model

�� Introduction

The focus of the shared memory programming model is on loops� Loops are the

most common means of expressing repetitive computing patterns in a program� The

concept of thread execution does not restrict parallelism to the loop level� but high

level directive languages provided by the shared memory programming model mainly

deal with ways to specify parallel loop execution� By exploiting parallelism among

loop iterations� the shared memory model often achieves a signi�cant performance

gain�

In the shared memory programming model� a programmer speci�es parallel execu�

tion by annotating the source code with directives� Typically� directives consist of one

or more lines indicating serial�parallel execution� variable types �shared� private� and

reduction�� scheduling scheme� and a conditional construct �if directive�� Commu�

nication and synchronization among processors are implicit inside parallel sections�

� �

meaning that those operations are transparent and do not show up in the source

code� Also� parallelization is localized� In other words� parallelizing one section of

code has no logical eect� on the rest of the program� Transparent synchronization

and localized parallel sections of the shared memory programming model oer an easy

scheme to work with� especially for inexperienced programmers� Figure �� shows a

portion of code taken from a example program in �� that computes � before and

after parallelization using OpenMP� Lines starting with ��OMP indicate directives�

Directive PARALLEL DO indicates that the loop has no loop�carried dependencies and

may be executed in parallel� Directives PRIVATE and SHARED tell the compiler that

the following variables in the parenthesis are private or shared� respectively� Direc�

tive REDUCTION� SUM� indicates that the variable SUM is a summation reduction

variable and requires a special care for parallel execution� Examining the details of

OpenMP is beyond the scope of this thesis� More information can be found in ��

By narrowing the main concern to loops� the shared memory model has enabled

an impressive advance in parallelization�optimization techniques� Well�known tech�

niques for parallelization include advanced data dependence analysis� induction vari�

able substitution� reduction variable recognition� privatization� and so on� In addition�

there are locality enhancement techniques that speci�cally target the shared memory

architecture� such as blocking�tiling and load balancing� Most of these techniques

have been incorporated into modern parallelizers� which will be presented in Section

��

�� History of parallel shared memory directives

As mentioned in the introduction� until the late ��s� the shared memory model

had suered from a lack of the standard language� Computers from dierent vendors

come with their own set of directives for expressing parallelism� and compilers do

not understand other than their own directives� There had been a few initiatives to

resolve this problem� In �� an informal industry group called the Parallel Comput�

ing Forum �PCF� was formed to address the issue of standardizing loop parallelism

�Cache e�ects can a�ect the performance of the code outside the parallel section�

� � �

��

W � ��d��N

SUM � ��d�

DO I��N

X � W � �I � ��d�

SUM � SUM � F�X

ENDDO

PI � W � SUM

��

W � � �d��N

SUM � � �d�

��OMP PARALLEL DO PRIVATEX�� SHAREDW�

��OMP� REDUCTION� SUM�

DO I��N

X � W � I � � �d��

SUM � SUM FX�

ENDDO

PI � W � SUM

�a� Original sequential code �b� After transformation

Fig� �� Simple parallelization with OpenMP�

in Fortran �� The group had been active for three years� and their �nal report was

published in �� After PCF was dissolved� a subcommittee X�H�� authorized by

ANSI� was formed to establish an independent language model for shared memory

programming in Fortran and C� However� the interest was lost eventually� and their

proposed standards were deserted� leaving the last revision labeled X�H��SD�

Revision M �� There also had been commercial portable directive sets such as

KAP�Pro directive set from Kuck and Associates �KAI� �� However� since native

compilers only support their directives� portability could only be achieved by trans�

forming directives into a thread�based code� and compiling the resulting code with

native compilers� Overall� all these eorts failed to gain attention from the general

parallel computing community�

In �� spurred by the rekindled popularity of shared memory machines� Silicon

Graphics Inc� �SGI� and several major high performance computer vendors initiated

� � �

the eort to establish a new standard directive language� The proposed directive

language� named OpenMP �� embraces the previous standardization eorts and

added a few new concepts for more expressiveness� Unlike previous attempts� this

is an industry�wise eort to resolve a practical problem� so it is likely to result in

a successful standard that is supported by the majority of new and existing high

performance computers� It seems safe to say that OpenMP ensures the future of the

shared memory architecture and the programming model by adding portability across

platforms�

�� Shared memory program execution

Once an executable is generated by compiling a program with directives� pro�

grammers can run it as they run any sequential programs� In fact� an OpenMP

program starts out as a sequential program and engages other processors as OpenMP

parallel constructs are encountered� The user has a number of controls over parallel

execution�typically in the form of environment variables� The most important one of

them is the environment variable that sets the number of processors participating in

the execution of parallel code sections� For programmers that are used to the message

passing programming model� it is important to note that there are no con�guration

scripts or setups necessary to execute an OpenMP program�

�� Automatic parallelization

As the techniques for identifying parallelism and parallelizing loops advance� it is

the natural course of action to incorporate these into a compiler so that the whole

process takes place without programmers� care� The apparent advantage of using a

parallelizing compiler is that the conversion of a given serial program into parallel

form is done mechanically by the tool� releaving programmers from worrying about

parallelization details� As the impact of parallelizing compilers are signi�cant espe�

cially for the shared memory programming model� a reasonable methodology should

consider their role in parallel program development� Thus� we brie�y discuss the

general aspects of parallelizing compilers in this section�

The eort to automate parallelization process starts from vectorizers of the ��s

� � �

and ��s� The most important vectorizers among them are the Parafrase compiler

from the University of Illinois �� PFC parallelizing compiler developed at Rice

University �� and the PTRAN compiler from IBM�s T� J� Watson Research Labo�

ratory �� They laid the foundation for the modern parallelizers� Most of the general

techniques for vectorizing arrays within loops remain in the parallelizing compilers

these days ��

Today� all shared memory multiprocessor machines are equipped with their own

parallelizers� and there have been several eorts from academia to create a new gener�

ation of state�of�the�art parallelizing compilers for the shared memory programming

model� Two of the noticeable recent eorts in this �eld are the Polaris parallelizing

compiler developed at the University of Illinois �� and Purdue University� and the

SUIF �Stanford University Intermediate Format� parallelizing compiler from Stanford

University �� They were both built upon their own infrastructures �bases for Po�

laris and kernels for SUIF�� which were designed to help researchers working on the

compiler technology� The focus on SUIF compiler is on parallelizing the C language�

With such techniques as global data and computation decomposition� communication

optimization� array privatization� interprocedural parallelization� and pointer analy�

sis� SUIF boasts an impressive performance gain from many programs�

Polaris� as a compiler� includes advanced capabilities for array privatization� sym�

bolic and nonlinear data dependence testing� idiom recognition� interprocedural anal�

ysis� and symbolic program analysis� The Polaris infrastructure provides useful fa�

cilities for analyzing and manipulating Fortran programs� which can provide useful

information regarding the program structure and its potential parallelism� Polaris

has played a major role in our previous eorts in methodology and tool research� and

will continue to be a major part in our future research� The details on the role of

Polaris in our research will be discussed in Chapter ��

� � �

�� Parallelization in the Message Passing Programming Model

�� MPI and PVM

Both MPI � � and PVM �� provide message passing infrastructures for parallel

programs running on distributed memory machines� Ever since the introduction of

the �rst distributed memory machine� the Cosmic Cube from Caltech� in early ��s�

researchers and programmers who saw the potential of distributed memory computers

had struggled in the midst of the con�icts between supporting interfaces� until Oak

Ridge National Laboratory�s PVM system and a joint US�Europe initiative for a

standard message passing interface �eventually named MPI� arrived on the scene�

These two interfaces were accepted by the majority of people involved in parallel

computing on distributed memory machines� and successfully ported to a variety of

multiprocessor systems� including shared memory machines ��

These two systems take the form of libraries rather than separate language con�

structs� These libraries consist of functions and subroutines for synchronization and

sending and receiving messages across processors� Users are mandated to insert the

calls to these routines to control the parallel execution of a program� This required

the programmers to change their way of thinking� They had to be the �masters� that

explicitly take care of data distribution� communication� and other parallelization

details� Nevertheless� their performance on some distributed memory machines were

impressive�

The message passing programming model is well suited for distributed systems

with a large number of processors� By carefully controlling the interaction among

processors� the performance of some applications that do not require heavy commu�

nication were able to scale well as the number of processors increases� Another nice

thing about PVM and MPI is that they enable a cluster of heterogeneous unipro�

cessor systems to behave like one supercomputer� Good performance of the message

passing model� however� often relies on one critical factor� network latency� The time

to transfer a message from one processor to another ranges from a hundred to a mil�

lion clock cycles� If the application at hand requires frequent communication among

� �

participating processors� the resulting performance gain can be seriously limited even

on the fastest network today� let alone a cluster of uniprocessors connected by simple

network cables� This problem spawned numerous research eorts regarding data par�

allelism and work distribution on distributed memory machines� which we will not

discuss any further�

Another drawback of the message passing interface is its aforementioned low�level

programming style� The amount of bookkeeping for data transfer and synchronization

can amount to an intolerable level� and it is all up to the programmer to ensure

correct execution �� Furthermore� the tricks and tweaks to obtain high performance

may be overwhelming to inexperienced programmers� Even worse is that the eort

to parallelize a program generally starts from analyzing the underlying physics in

this programming model� making it di�cult for programmers other than the original

authors to parallelize a program� Overall� learning these interfaces is not particularly

di�cult� but designing a parallel program to achieve good performance is�

�� HPF

Many people thought that the message passing programming style is at too low

a level to appeal to the general audience �� For this reason� a group of researchers

at Rice University attempted to provide higher level constructs for programming on

distributed memory machines� Their result are Fortran D �� and its successor� High

Performance Fortran �HPF� �� These are sets of extensions to Fortran� The HPF

programming model looks similar to the shared memory model in that it focuses

on loop parallelism controlled by directives added in front of loops� In addition�

it provides directives for data distribution onto distributed memory systems� HPF

translators generate a message passing program based on these directives� Compared

to message passing functions� these directives let programmers specify array distribu�

tion without burdening them with tedious bookkeeping details�

However� compared to the shared memory programming model� HPF lacks im�

portant constructs such as loop�private arrays� and most of all� the performance of

HPF programs are not as good as that of programs written directly in MPI or PVM�

� � �

So far� only a handful of compilers and systems fully support HPF�

�� Visual parallel programming systems

A dierent approach to simplify the user interface of the message passing program�

ming model is to achieve an even higher level of abstraction by adopting the visual

programming model of such systems as Visual C�� and Visual BASIC �� The

goal of such research eorts is to develop visual programming environments in which

programmers use nodes and arcs to design and implement parallel applications� They

opt for a more e�cient way of designing and implementing parallel programs� and

performance evaluation and tuning are not their main concern� Visual programming

systems such as HeNCE �� Enterprise �� CODE �� GRAPNEL �� P�RIO ��

and Visper �� belong to this category�

Contrary to the traditional coding model� these systems call for a dierent paradigm

in writing parallel programs� Conventional constructs of programming language are

replaced with visual entities� although programmers are often required to provide

some form of textural descriptions to specify the details needed for the intended

functionality� These systems include not only new programming models� but also

supporting tools that actually allow programmers to use them� These tools usually

come with a set of templates to help programmers in designing parallel programs�

Figure �� shows the screenshot of the CODE visual parallel programming system�

The advantage of these visual parallel programming systems is an e�cient rep�

resentation of complex program structures and parallel constructs� Generally� pro�

grammers have less di�culty in grasping the parallel nature of the programs using

these tools� In addition� they reduce debugging time by providing utilities for au�

tomatic translation of parallel constructs� However� the tasks of splitting data and

coordinating communication are still left to programmers�

�� Parallel Programming and Optimization Methodology

As explained in Section �� the parallel constructs provided by the shared memory

programming model and the message passing programming model take signi�cantly

dierent forms� Hence� the corresponding programming methodologies have taken

� � �

Fig� �� Screenshot of the CODE visual programming system�

distinct paths�

�� Shared memory programming methodology

In the shared memory model� parallelism is speci�ed with directives that have no

eects on program semantics� Tasks are distributed based on loop iterations� and the

key aspects of parallelization of shared memory programs are to detect loop�carried

data dependencies and to identify shared and private data in each iteration� This

can be done by static� program�level analysis� Therefore� the methodology for the

shared memory programming model is� at the highest level� is to examine loops in a

serial program code region to detect parallelism and to determine shared and private

variables� There are publications and lecture notes addressing programming on the

shared memory machines �� They present concepts and notations�

explain directives� and discuss parallelization techniques and dependence test crite�

ria� However� they do not oer a overall strategy or a procedural methodology of

performance optimization� One exception is �� This document� speci�cally aimed

� � �

at optimization for Origin�� machines� devotes a section on tuning parallel code

for Origin� This section consists of architecture�speci�c techniques that are useful

in further improving parallel performance� However� compared to the detailed single

processor tuning description in the same text� parallel performance tuning only serves

to complement the single processor case� Also� the document lacks the performance

problem de�nitions and the performance evaluation description for parallel programs�

An alternative way of expressing programs in the shared�memory model is to use

threads� In this scheme the programmer packages program sections that can exe�

cute concurrently into subroutines and spawns these subroutines as parallel threads�

Threads parallelism is at a lower level than directive parallelism� In fact� compilers

will translate a directive�parallel program into a thread�parallel program as an inter�

mediate compilation step� Advanced parallel programmers sometimes prefer thread

parallelism because it can oer more control over parallel program execution� Usually�

this comes at the cost of a higher programming eort� A brief description on shared

memory programming with multithreading is given in this lecture note ��

�� Message Passing programming methodology

Although HPF provides a directive�based programming model for the message

passing model� programming methodologies found in literature focus on application

level approach using library functions� General methodologies for programming us�

ing message passing libraries are described in �� In �� the authors employ

the application level approach ��application�driven development� in this book�� in

which they �rst categorize a given problem as one of the �ve classes �synchronous

applications� loosely synchronous applications� embarrassingly parallel applications�

asynchronous problems� and metaproblems�� To this end� the book provides many

example algorithms common in scienti�c computing to the readers� Based on the cat�

egory of the target problem� the book lists possible parallel algorithms and suitable

parallel machines� In �� parallel program design consists of four stages� partition�

ing� communication� agglomeration� and mapping� Partitioning and communication

are tasks of distributing data and coordinating task execution� respectively� In the

� ��

agglomeration stage� the combined parallel structures �data distribution and com�

munication� are evaluated� If necessary� smaller tasks are combined into a larger

task to improve performance or to reduce development cost� Finally� each task is

assigned to a processor in a manner that attempts to satisfy the design goal in the

mapping stage� Since parallel constructs are integrated into a program source in the

message passing model� program design becomes an important part of parallel pro�

gramming� This book also gives a detailed description of the evaluation process of

parallel performance�

There are two dierent approaches to abstract the parallel programming in the

message passing model using mathematical notations� One is based on parallel pro�

gram archetypes or programming paradigms �� These are abstract notations that

combine computation structure� parallelization strategy� and templates for data�ow

and communication� Programmers are given a set of parallel program archetypes

or programming paradigms� Then they identify an appropriate element within the

set that matches the problem that they try to solve� Finally� they implement the

actual program using the parallel structure or the template stated by that element�

Using this methodology� programmers can save time and eort to design an appro�

priate parallel structure for a given problem� Once they identify the right parallel

program archetype or programming paradigm� the implementation becomes simpler�

This scheme works well in the case of scienti�c computing� in which a set of well�

known algorithms are used across many applications� The other approach states that

programmers begin with a conceptual or formal description of a given problem and

�nd an appropriate parallel structure for the algorithm through a series of suggested

analysis processes �� This method is more algorithm�speci�c� and its applicability

is even narrower�

�� Tools

In this section� we brie�y introduce the tools that have been developed to help

programmers in programming and tuning parallel programs� As the task of developing

a well�performing parallel program is very challenging� numerous tools have existed

� � �

to help programmers� Some have been made public for the general audience� and

some were used only within small research groups� Among those public tools� only

a few gained some attention from the parallel computing community� and even fewer

were actually used by other researchers and programmers�

We present here some of the major eorts in developing parallel programming

tools� Due to the sheer number of tools� we have divided them into four categories

based on their functionality� program development and optimization� instrumenta�

tion� performance visualization and evaluation� and guidance� We will examine their

advantages and shortcomings and discuss possible improvements� It should be noted

that� in this section� we do not cover tools designed to oer assists in other aspects

of developing parallel programs� such as serial program coding and parallel program

debugging� There are numerous general program coding and editing tools� Some of

the eorts in parallel program debugging include the portable debugger for parallel

and distributed programs �� Panorama �� TotalView �� and Assure �� For

the tools relevant to our research� we present detailed comparison later in Chapter ��

�� Program development and optimization

We focus on tools speci�cally designed for program parallelization and optimiza�

tion in this section� The objective of these tools is to optimize program performance

of existing programs by helping users apply various techniques� In addition to the

support for manual modi�cations� these tools generally have automated optimization

utilities to make it easy for programmers to apply the techniques on selected parts of

a program� We begin with the tools for the shared memory model�

Faust is an ambitious project started at the Center for Supercomputing Research

and Development �CSRD� at the University of Illinois in late ��s �� The tool

supports many aspects of programming parallel machines� providing facilities for

project database management� automatic program restructuring and editing� graphic

browsers for call graph� an event display tool for performance evaluation� It is an

environment that covers a wide range of parallel programming stages such as coding�

parallelization� and performance tuning� Its emphasis on project management allows

� ��

the support for a major portion of the entire program development cycle�

The Start�Pat parallel programming toolkit was developed at Georgia Tech to

support programming and debugging of parallel programs �� It consists of a static

analyzer� Start� and an interactive parallelizer� Pat� Its main concern is paralleliza�

tion� and general code optimization is not supported�

Parascope is an extension of the Rn programming environment developed at Rice

University �� Like Start�Pat� the focus of Parascope is automatic or interactive

restructuring of sequential programs into parallel form� It integrates an editor� a com�

piler� and a parallel debugger� The automatic transformation is conducted based on

the data dependence information collected by their previous tool� PTool� It provides

convenient facilities for parallelization and code transformation�

Faust� Start�Pat� and Parascope are important milestones in interactive optimiza�

tion tool eort for parallel programs� Unfortunately� the developers have stopped

maintaining these tools and their target architectures or programming models have

been abandoned� Nonetheless� their pioneering work laid the groundwork for the

current generation of interactive optimizers�

PTOPP �Practical Tools for Optimizing Parallel Programs� is a set of tools for

e�cient optimization and parallelization developed at the Center for Supercomputing

Research and Development �CSRD� �� It was designed based on the experience that

they gained through the optimization of applications for the Alliant FX�� and the

Cedar machine� This toolset stays at the UNIX operating system level and provides

some interaction through the facilities built upon the Emacs editor� Facilities are

provided for execution time analysis� convenient database and �le management of

performance data� and �exible interface with extensive con�gurability� The PTOPP

toolset does not include an interactive parallelization utility� but the Polaris compiler

can be invoked through its interface�

Our research eort actually started out by expanding PTOPP utilities to integrate

static analysis data from a parallelizing compiler and simulation and performance

data� which were missing from the previous version� PTOPP is a set of useful tools

� ��

that help make parallel programming easier� but the core need of novice programmers�

namely the lack of experience� has not been addressed in this project�

SUIF Explorer is an interactive optimization tool developed at Stanford Univer�

sity �� It utilizes the SUIF compiler infrastructure �� for automatic parallelization�

This tool comes with a basic performance evaluation facility�based on the pro�le

data generated from program runs� it can sort execution times to identify dominant

code segments� In addition� it displays the static analysis data gathered from execut�

ing the SUIF parallelizing compiler� Perhaps the highlight of the tool is its �program

slicing� capability� Using this technique� SUIF Explorer allows users to select certain

lines in a program source and displays the sections of code that may be aected by the

change made to those lines� This utility� combined with the automatic parallelization

module� provides an interactive way of tackling the task of tuning parallel programs�

Visual KAP for OpenMP �� is a commercial interactive tool from Kuck and

Associates Inc� It performs automatic parallelization on program �les� However� it

lacks the support for manual optimization and �ner grain tuning� FORGExplorer ��

is another commercial interactive parallelization tool from Applied Parallel Research

Inc� Like most of the tools presented in this section� FORGExplorer is capable of

automatic parallelization of code sections� while presenting users with static analysis

data such as call graphs and control and data �ow diagrams�

There are a couple of important optimization tools for the message passing pro�

gramming model� The Fortran D Editor �� is a graphical editor for Fortran

D that provides users with information on the parallelism and communications in a

program� It obtains data dependence� communication� and data layout information

through a direct interface to the Fortran D compiler and displays the information

during editing sessions� This is useful knowledge in developing message passing pro�

grams� but the Fortran D Editor lacks the support for automatic parallelization�

Converting directive�based data parallel languages to message passing programs is

challenging as it is� and automatic parallelization of sequential programs with data

parallel directives have not been successful�

� ��

The same applies to CAPTools� CAPTools is a programming tool for the message

passing model from the University of Greenwich in London �� The paralleliza�

tion process here is semi�automatic� Through a series of user interactions� users make

their decisions on which sections should be parallelized and how to distribute work

and data� CAPTools constructs a data dependence graph on the target section and

uses this graph in the subsequent automatic parallelization phase� If CAPTools needs

more information from users� it asks questions though the user interface� Recently� a

new front�end for the shared memory model using OpenMP has been added� but the

details are not available as of this writing�

�� Instrumentation

Instrumentation is a means to obtain performance data and usually a part of most

visualization and evaluation tools� functionality� In this section� we examine general

mechanisms for instrumentation on the shared memory and the message passing mod�

els and discuss a few instrumentation utilities that deserve special attention�

The main concern in parallel program instrumentation varies depending on the

programming paradigm� In the shared memory model� where communication be�

tween processors is fast and frequent� reducing the instrumentation overhead is an

important issue� On the message passing model side� an often overwhelming amount

of performance data becomes a problem� To this end� some researchers have incor�

porated a realtime summation utility or non�uniform instrumentation� which will be

discussed later in this section� Both of these issues con�ict with the ultimate goal of

instrumentation�obtaining as much performance data as possible�

As mentioned in Chapter � detailed instrumentation of shared memory programs

are not feasible without signi�cant perturbation� Hence most instrumentation utilities

rely on simple timing information� Therefore� the task of shared memory program

instrumentation is mainly inserting calls to timing routines� The problem that often

arise is that timing routine calls in nested code regions cause signi�cant overhead�

At its foundation� the Polaris compiler �� is a parallelizing compiler� but it

provides powerful instrumentation utility for shared memory programs� Polaris oers

� ��

several dierent strategies for instrumentation that allow users to control the amount

and the targets of instrumentation� Recently� a new library that supports hardware

counters �� has been made compatible with the Polaris instrumentation utility�

Other optimization tools capable of instrumentation include SUIF Explorer �� Forge

Explorer �� and GuideView ��

In the message passing programming model� the data needed for visualization

and animation are traces� and there have been several trace formats� IBM PE tracing

format � �� PVM tracing format �� ParaGraph format � � �� Pablo�s SDDF �Self

De�ning Data Format� �� VAMPIR format � �� are some of the examples� The

dierence between these are mainly the size of the trace �les� Most visualization

tools for the message passing model introduced in the next section use one of these

well�known formats�

Since the parallel constructs in the message passing model are libraries of func�

tions� instrumentation takes place by intercepting these calls� For additional informa�

tion� a series of checkpoints are inserted for status feedback� Instrumentation of these

checkpoints are relatively simple� but the resulting trace data may be unmanageably

large� AIMS � �� tries to resolve this problem by automatically identifying important

regions� Paradyn�s approach is unique in that instrumentation and monitoring utility

enables dynamically adjustable instrumentation by providing on�line summarization

facility � �� VAMPIR � �� oers more compact trace formats� More details on AIMS�

Paradyn� and VAMPIR are available in the next section�

The developers of TAU � �� at the University of Oregon chose a dierent approach

to program instrumentation� TAU is a toolset designed for pro�ling� tracing� and vi�

sualizing parallel program performance� TAU�s instrumentation utility can generate

either timing pro�les or trace �les depending on the target application� When tim�

ing pro�les are generated� static viewers are used to present summary information�

For trace �les� a trace visualizer is used� The instrumentation library is developed

for multiple languages such as C� C�� Fortran� HPF� and Java� thus signi�cantly

broadening its applicability� However� the instrumentation process is manually done�

� � �

Users need to specify which functions should be instrumented and associate them in

a set of groups� For very large programs� this could be very cumbersome� especially

when users have little knowledge on the program at hand�

�� Performance visualization and evaluation

Performance visualization refers to the transformation of numeric performance

data into meaningful graphical representation� Visualization helps users gain insights

into the behavior of parallel programs so that they can better understand the pro�

grams and improve the performance� Performance visualization is often a stepping

stone to performance evaluation and problem identi�cation� Performance visualiza�

tion can either be dynamic or static� Dynamic visualization tools use graphical ani�

mation to illustrate the dynamic behavior of the program under consideration� The

animation can take place either during the program execution or after the program

termination through trace simulation� Static visualization displays the summary of

performance characteristics in charts and graphs�

GuideView from KAP�Pro toolset �� is a typical static visualization tool� How�

ever� it targets the shared memory model and does not use traces� Instrumented

run�time library generates and summarizes timing information� Using charts and

graphs� GuideView illustrates what each processor is doing at various levels of detail

using a hierarchical summary� Its intuitive� color�coded displays make it easy to assess

the target application�s performance� However� due to the high overheads incurred

by the instrumentation� the resulting graphs may not re�ect accurate real time per�

formance� Fortran D Editor �� SUIF Explorer �� FORGExplorer �� and

DEEP�MPI � �� are also capable of graphical presentation of performance data� but

their uses are limited to simple displays of execution time of code blocks� DEEP�MPI

targets MPI programs� but does not provide the display of traces� Instead� it shows

resource usage and timing charts�

RACY from TAU project � �� has performance viewing utilities consisting of a

tabularized text report and several static charts� The information that is displayed

involves mostly timing pro�les� As mentioned above� TAU instrumentation utility

� ��

is capable of generating trace �les for message passing programs� Instead of writing

their own trace viewer� the developers decided to use VAMPIR � �� which is also

discussed in this section�

As for the static display of traces� NTV � �� summarizes traces from message

passing program execution and presents users with summary charts and timeline

graphs as shown in Figure �� This type of graphs help users understand load

distribution� stalls� and communication structure of the program� PMA from Annai

Tool Environment � �� is a graphic utility similar to NTV� Annai integrates this

information with its source viewer for easier reference� XMPI from LAM project ��

oers a similar view� although its main goal is debugging of MPI programs� TraceView

is a pioneering work in a timeline display �� and it generates timeline graphs for both

shared memory and message passing programs through dierent runtime libraries� In

both cases� trace �les are used� However� its graphics are not as re�ned as those listed

above� and the displayed data for shared memory programs are limited due to the

nature of the shared memory programming model�

Fig� �� The timeline graph from NTV�

ParaGraph � � �� Upshot �� AIMS � �� Scope �� and VAMPIR � �� are

tools for animated postmortem visualization of program behavior based on trace

simulation� The advantage of trace simulation is that the speed of graphic animation

can be adjusted �with the exception of ParaGraph� so that events that are di�cult to

observe in real time can be slowed down for better understanding� ParaGraph was a

pioneering eort in performance visualization from the University of Illinois� The tool

� ��

is visually elaborate� but its practical value is limited by a few missing features like

ability to set the speed of replay and the lack of appropriate annotation� Furthermore�

the target and the framework of the graphic presentation is pre�determined by the

developers� so users have little freedom in viewing other aspects of program behavior

in dierent perspectives� Upshot has a feature to adjust speed� but it does not have

features such as a dynamic call graph or a communication diagram� AIMS is an

automated instrumentation and monitoring system from NASA� it displays dynamic

program behavior through animated and summary views� AIMS adds a modeling

module that provides a means of estimating how the program would behave if the

execution environment were modi�ed� Figure �� shows a screenshot of AIMS in

use� The goal of Scope is extensibility� Scope allows users more freedom to arrange

performance data into new displays� VAMPIR adds a zoom utility� allowing users to

examine performance data with varying levels of detail� All these tools target message

passing programs�

Pablo � �� Paradyn � �� XPVM �� PVaniM �� and Falcon �� can an�

imate the behavior of a program while it is running� This monitoring capability is

achieved by periodically updating graphs and charts with newly available runtime

data from the executing application� However� events that may occur frequently for

a very short period of time cannot be traced and displayed� For this reason� XPVM

and PVaniM have utilities to playback the generated traces� and other tools generate

summary statistics� Even so� visualizing important events during the execution of

a shared memory program in an animated fashion is not feasible in the sense that

these events� such as writing to shared variables� happen too frequently and too many

times� These tools visualize the events during message passing program execution�

Pablo� a performance evaluation tool developed at the University of Illinois� is

perhaps the most successful tool currently in use � �� It uses an adaptive instru�

mentation control to reduce the perturbation of instrumentation when it executes�

The resulting trace �les are used to produce graphical display of the program per�

formance� Pablo also has a soni�cation utility and a �D support that convey more

� ��

Fig� �� The graphs generated by AIMS�

information to its users through multimedia experience� The combined eort with

Fortran D Editor �� now allows Pablo to integrate performance data with a pro�

gram development environment� However� the lack of appropriate annotation and a

complex visual interface impose a steep learning curve on users� Figure �� presents

a snapshot of Pablo graphical data presentation�

The Paradyn Parallel Performance Measurement Tool� developed at the Univer�

sity of Wisconsin at Madison� is characterized by their instrumentation scheme that

dynamically controls overheads by monitoring the cost of data collection � �� The

basic paradigm of instrumentation� execution� and visualization is the same as that of

Pablo� but due to the dynamic nature of its instrumentation scheme� the tool is par�

ticularly useful when the application at hand is very large or long�running� The tool

� ��

Fig� �� The graphs generated by Pablo�

also contains a visualization facility that generates realtime tables and histograms�

although it is not as extensive as that of Pablo�

XPVM is a graphical user interface for PVM that displays both real�time and

postmortem animations of message tra�c and machine utilization by PVM applica�

tions �� While an application is running� XPVM displays a space�time diagram

of parallel tasks showing when they are computing� communicating� or idle� XPVM

stores events in a trace �le that can be replayed and stopped to analyze the behavior

of a completed execution�

PVaniM speci�cally targets network computing environments �� The perfor�

mance factors that are unique to networked environments require careful considera�

tion in performance visualization� PVaniM addresses these network issues� such as

possible heterogeneity� low network bandwidth� and clock skew� in its design� Its

playback utility also adds to its usefulness� by allowing users to examine details that

� � �

may have been missed during real time monitoring�

The principal aspects of Falcon are its abstraction and accompanying tools for

analysis of application�speci�c program information and on�line steering �� The

term �application�speci�c� means that users choose which aspects of dynamic be�

havior to monitor and steer beyond a predetermined set of parameters� In addition�

Falcon provides a support for the on�line graphical display of the information being

monitored� The Falcon developers used POLKA system �� for its animated and

static performance views�

The metrics supported by these animation tools include CPU utilization� memory

usage� �oating point operations� message size� and so on� They help programmers in

identifying the bottleneck in the execution of message�passing programs� The advan�

tage of these types of tools lies in providing dierent views on program execution by

visualizing the timely behavior of the target program� When processor communica�

tion is relatively sparse and visible as in the message passing programming model�

it is particularly important� and bottleneck identi�cation easily leads to well�known

techniques to resolve the problems� such as dierent data distribution� combining

messages� algorithm modi�cation� and so on�

The ability to monitor real�time performance presents opportunities for perfor�

mance steering� To this end� those who developed Pablo� Paradyn� PVaniM� and

Falcon have implemented a performance steering facility� In fact� the main focus of

Falcon has been performance steering from the beginning of the development� Typi�

cally� users provide or select a set of parameters that they want to manipulate during

program execution� and they are able to do so at various checkpoints inserted into

the target program� Performance steering is not our concern in this research� so we

will not go into any more details�

Finally� CUMULVS �� takes a dierent approach to performance visualization�

As an extension to PVM� CUMULVS is a library of functions that users can insert

into programs to visualize the behavior of a parallel program in real time� The in�

strumentation task is shifted to programmers� but allows users �exibility to choose

� ��

what type of data they want to view� CUMULVS data collection utility can be uti�

lized with several front�end visualization systems� CUMULVS also supports program

steering through checkpoints�

�� Guidance

The term �performance guidance� is used in many dierent contexts in the par�

allel programming �eld� Generally� it means taking a more active role in helping

programmers overcome the obstacles in tuning programs� With so many available

tools for instrumentation and visualization of raw data� the task of extracting mean�

ingful information is becoming increasingly burdensome� In this section� we discuss

several tools that support this functionality� Accommodating novice programmers

and automating the performance evaluation process are important issues in parallel

programming� and it is one of our focuses in our research� However� we found only a

few eorts addressing these subjects�

SUIF Explorer�s Parallelization Guru bases its analysis on two metrics� parallelism

coverage and parallelism granularity �� These metrics are computed and updated

when programmers make changes to a program and run it� It sorts pro�le data in

a decreasing order to bring programmers� attention to most time consuming sections

of the program� It is also capable of analyzing data dependence information and

highlighting the sections that need to be examined by its users�

Paradyn Performance Consultant � � discovers performance problems by search�

ing through the space de�ned by its own search model� The search process is fully

automatic� but manual re�nements to direct the search is possible as well� The re�

sult is presented to the users through graphical displays� DEEP�MPI � �� features a

similar advisor that gives textual information about message passing program perfor�

mance� The DEEP�MPI advisor�s analysis is hard�coded� and the analysis is limited

to subroutines or functions�

PPA �� proposes a dierent approach in tuning message passing programs� Un�

like the Parallelization Guru� Performance Consultant and DEEP�MPI� which base

their analysis on runtime data and traces� PPA analyzes a program source and uses

� ��

a deductive framework to derive the algorithm concept from the program structure�

Compared to other programming tools� the suggestion provided by PPA is more de�

tailed and assertive� The solution for one example in �� was to replace an ine�cient

algorithm�

Parallelization Guru� Performance Consultant� and DEEP�MPI basically tell the

user where the problem is� whereas the expert system in PPA takes the role of pro�

gramming environment a step toward an active guiding system� However� the knowl�

edge base for the expert system relies on understanding of the underlying algorithm

based on pattern matching� Having an expert system that understands all the va�

riety of parallel algorithms is nearly impossible� Due to the complexity required�

problem identi�cation is done by other tools and hand analysis� and the suggestions

provided by the tool only considers parallel constructs� which also limits the usage

of the tool� Because of the lack of performance evaluation and tuning support� PPA

cannot be considered as a programming environment� but their eort in developing a

performance guiding tool is worth noting�

�� Utilizing Web Resources for Parallel Programming

One of our objectives is to reach general audience with our methodology� tools�

and optimization study results� We have taken the Internet computing approach to

address this issue� Thus� we focus out attention to previous eorts that attempted

utilizing the Web to provide a programming environment and to establish on�line

repositories�

Many of the systems and technologies that currently allow computing on the

Web support a single or a relatively small set of tools� They include PUNCH ��

MOL �� NetSolve �� Ninf �� RCS �� VNC �� WinFrame �� Globus ��

and Legion �� More detailed descriptions of these systems are found in ��

As for the benchmark repositories� several Web tools oer performance numbers

of various benchmarks �� Typically� the presented data are timing numbers

such as overall program performance or speci�c timings of communication in message

passing systems� Extensive characteristics of the measured programs are usually not

� ��

part of the on�line databases� The user will have to obtain information from separate

sources� which is often necessary for interpreting the numbers� Furthermore� these

repositories do not provide information gathered by other tools� such as compilers or

simulators� and consequently they do not support the comparison or the combined

presentation of performance aspects and program characteristics�

Our eort to resolve these problems with the previous research eorts unfolds in

two ways� First� we have used PUNCH� a network computing infrastructure ��

to construct an integrated� Web�accessible� and e�cient parallel programming tool

environment� PUNCH allows remote users to execute unmodi�ed tools on its resource

nodes� More detailed descriptions of PUNCH are found in Section �� Second�

our results on performance enhancement with various applications have been made

accessible through an Applet�based browser� which allows not only examining the

raw data but also manipulating and reasoning about the information� This facility is

explained in more detail in Section ��

�� Conclusions

Thus far� we have studied general concepts and paradigms in parallel program�

ming� We also have looked at general trends in parallel programming models and

supporting tools� We have learned that there have been numerous attempts to aid

parallel programmers through various tools� However� these tools are generally not

based on a programming methodology and tend to focus on one speci�c aspect of

the optimization process� In addition� a brief discussion has been given on enhancing

tool accessibility via the Web�

It seems that tools supporting the shared memorymodel emphasize more on static

analysis and automatic code transformation while those supporting the message pass�

ing model mainly focus on performance visualization� This is not surprising consid�

ering that the shared memory model enables structured program level parallelism�

but instrumentation is expensive� and that in the message passing model� events are

relatively explicit and sparse� but automatic parallelization is di�cult�

Several tools have attempted integration of dierent aspects in parallel program�

� ��

ming� Pablo and Fortran D editor �� opt for the integration of program optimization

and performance visualization� but their visualization utilities� although highly ver�

satile� are di�cult to comprehend and oer little to help programmers in deductive

reasoning� The lack of automatic parallelization capability of Fortran D editor also

limits its utilization especially among novice programmers� SUIF Explorer �� and

FORGExplorer �� have a similar goal� but their performance analysis utilities serve

only a complementary purpose to direct programmers to time�consuming code regions�

KAP�Pro Toolset �� consists of useful tools but does not support manual tuning�

The focus of the Annai Tool Project � �� is limited to the aspects of parallelization�

debugging� and performance monitoring� Faust �� may be the most comprehensive

environment to date� encompassing code optimization and performance evaluation�

However� many aspects of Faust are not suitable for the modern day parallel ma�

chines� and it is no longer maintained by the developers� Also� there is the issue of

active user guidance� which none of the optimization tools supports� Apart from the

missing functionality� the problems with these tools �and most other tools discussed

in this chapter� are the lack of continuous support� system compatibility� scalability

�eort to add new tools or features�� and accessibility �not available and di�cult to

learn��

The quality of visualization of performance and structure of parallel programs

provided by today�s tools has reached an impressive level� Almost every aspect of

parallel program execution can be viewed in user friendly displays� Parallel execution

events and resource utilization summaries are presented via colorful graphs� charts�

animation� and even sound eects� We believe that the next step in assisting pro�

grammers in performance evaluation should be the support for comprehension and

deductive reasoning of performance data� As the user base of aordable parallel ma�

chines keeps expanding� this aspect of performance evaluation becomes increasingly

more important�

�A lot of smart people are developing parallel tools that smart users just won�t

use�� This sentence� quoted from �� summarizes well some of the problems with

� � �

tool development over the years� Many tools have ended their lives unused by other

than the developers� Perhaps it is because the tool developers have focused their

attention only to speci�c stages in parallel program development� disregarding the big

picture� In many cases� the developers created the tool that they thought to be useful

based on their experience under their own environment� Another reason could be the

lack of eort from the developers in providing convenient access to their tools� The

conventional approach to promote tool usage has always been telling users what the

tool can do and explaining what to do with it� Furthermore� not enough consideration

has been put into actually allowing users to try the tools� We advocate the importance

of a programming and optimization methodology once more� because knowing exactly

what must be done at each stage during parallel program development� leads to an

eort to understand and appreciate the tool�s functionality that �ts users� needs�

With active motivation to reach larger audience with an integrated methodology and

a toolset� we may have a better chance�

� ��

�� SHARED MEMORY PROGRAM OPTIMIZATION

METHODOLOGY

In this chapter� we outline our proposal on the issue of a methodology for the

shared memory programming model� We believe that the programming style of this

model allows a systematic approach to program tuning that is far more detailed

and organized than simple descriptions found in general guidelines� Programmers�

task in this scheme is to follow the steps suggested by the guidelines and apply the

appropriate techniques�

�� Introduction Scope� Audience� and Metrics

Before presenting the methodology� we �rst discuss its scope and target audience

as well as the metrics used in the methodology�

�� Scope of the proposed methodology

Figure �� shows a typical shared memory program development cycle� The soft�

ware design and implementation part inside a dashed box have been simpli�ed in

this �gure� The issues in these stages include planning� design� coding� testing� and

debugging� It is a quite complex topic� and there have been a sophisticated set of

methodologies� remedies� metrics� and tools for helping out programmers in this mat�

ter �� We will not discuss general software engineering issues any further in this

proposal�

In this research� we focus our view on the parallelizing and tuning process �the box

enclosing parallelization�tuning� program development� program execution� and per�

formance evaluation�� We assume the programmers haven a working serial program�

Developing a sequential program is orthogonal to parallel processing and we assume

that most programmers follow one of the existing software engineering practices� Our

eort attempts to resolve di�culties and problems associated with parallelizing and

� ��

parallelization/tuning

programcompilation

programexecution

performanceevaluation

design

Done

implementation

Fig� �� Typical parallel program development cycle�

optimizing sequential programs� Also� notice that we do not consider application

level approach �explained in Chapter � to parallel program development� Finding

parallelism at the algorithm level and incorporating it while writing a program is a

dierent subject in that it requires a new perspective in examining algorithms� iden�

tifying parallelism� dividing and balancing tasks� and incorporating them into the

source code� As pointed out in the introduction� the sheer number of variables in this

approach is so large that �nding a systematic programming methodology would be

extremely di�cult� Some tips can be found in literatures such as �� as well as

some of the programming methodologies introduced in Chapter ��

�� Target audience

We assume that our target programmers are familiar with programming and com�

pilation� They should be able to write� debug� compile� and run a sequential program�

� ��

Also� they should know at least the basics on how parallel processing works with the

shared memory programming model� It helps to understand the underlying shared

memory architecture� because certain machine dependent parameters have a signif�

icant impact on the program performance� To follow our methodology� it is not

necessary to be an experienced parallel programmer� However� even for experienced

programmers� the methodology serves as an e�cient strategy for parallel program�

ming�

We divide our target audience into two group� novice and advanced programmers�

The word �novice� means new to parallel programming� not to programming in gen�

eral� The novice programmer group consists of those given a task of parallelizing a

sequential program or writing a parallel program without much prior experience on

the process� They resort to a methodology mainly for the guidelines and suggestions

to make up for the lack of experience� They need to get the feeling of what the avail�

able techniques are and how they can be applied� The supporting tools must take

this into account to make the learning curve as smooth as possible�

The need of advanced parallel programmers lies in the supporting utilities� The

methodology aids them in e�ciently structuring the approach they have already been

taking� They have a good idea of what tasks have to be done in each stage� and they

desire eective tools to accelerates tedious tasks� They would like the tools to be

�exible so that they can con�gure them to �t the speci�c tasks of their interest�

�� Metrics understanding overheads

Performance evaluation is an important stage in parallel programming� The eval�

uation process consists of �nding performance problems and possible techniques for

improvement� Finding problems requires the de�nitions of performance problems�

In other words� programmers should know which phenomena constitute performance

problems� Without de�nitions� problems cannot be found� Metrics are used to for�

mularize performance problems�

In our methodology� the performance evaluation process begins with identifying

dominant and problematic code sections� A metric system provides a means to e��

� ��

ciently identifying bottlenecks in the presence of a possibly large amount of informa�

tion� As the overhead analysis is a critical part of the methodology� we introduce a

couple of perspectives to look at parallel program performance and the related metrics

in this section� The main attention of these systems goes to �overhead��

One common way to view the performance overhead is described well in ��

in which a programmer needs to identify two factors contributing the overall over�

head� parallelization and spreading overheads� Our tuning strategy in the proposed

methodology is based on this overhead model�

Parallelization overhead This refers to an overhead introduced by transforming

a program into parallel form� Often it is identi�ed by comparing the execution

times of the serial version and the parallel version run on one processor� The

main reason for this is that the code gets augmented inevitably for paralleliza�

tion�

The parallelization overhead of a parallel loop is computed as

Tparallelization � T� processor parallel execution � Tserial execution ��

The factors that contribute to parallelization overhead are listed below�

� instructions needed for parallel execution� The instructions for the tasks such

as fork� join� and barriers� are necessary for parallel execution� These increase

the code size and cause inevitable overhead�

�� instructions needed for code transformation� Some parallelization techniques

require code change that may incur overhead� For instance� the reduction tech�

nique requires separate preamble and postamble� The induction technique may

introduce a complicated expression in each iteration� which was not part of the

original code�

�� ine�cient optimization� Code�generating compilers performing less optimiza�

tions on a parallel code section �compared to the original� serial code� leading

to less e�cient code�

� � �

Parallelization overhead may be amortized if the loop runs signi�cantly longer that

the overhead time� On the other hand� frequent invocation of a very small parallel

loop can cause serious degradation in performance�

Spreading overhead The execution model of a shared memory architecture is ba�

sically such that at the beginning of a program� a process forks multiple threads

and the master thread among them wakes them up whenever it encounters par�

allel sections� The time to wake the other threads is an unavoidable overhead

to endure� Spreading overhead usually increases as more processors are used in

program execution�

The spreading overhead is computed as

Tspreading�P � � Tparallel execution�P ��T� processor parallel execution

P��

where P denotes the number of processors�

Some of the reasons for spreading overhead are given below�

� startup latency� This refers to the time to initiate parallel execution on multiple

threads� Naturally� the more threads run� the larger overhead occurs� One way

to avoid this is to try to merge adjacent parallel regions into one� making a

parallel section as large as possible�

�� memory congestion� Due to sharing data on a shared memory� heavy tra�c in

a memory bus may cause parallel execution to slow down� One possible remedy

for this is to increase the locality of loops to reduce bus tra�c�

�� coherence tra�c� Sharing data also requires a coordination� which adds addi�

tional overhead for legitimate data invalidation�

�� false sharing� Depending on the cache line size� data that are needed by only

one processor may spread over other processors� caches� causing frequent un�

necessary invalidations�

� ��

�� load imbalance� Tasks are unevenly distributed over multiple processors� In

cases where the number of iterations is small and cannot be distributed evenly�

the expected speedup is limited by the remainder�

Another perspective on overhead is provided in �� Hardware counters avail�

able on most modern machines provide detailed statistics regarding the dynamic be�

havior of parallel programs� Yet� the measured values do not necessarily translate into

parallel programming terms� The proposed model de�nes four overhead components

�memory stalls� processor stalls� code overhead� and thread management overhead�

based on the hardware counter data� Each component is clearly de�ned and the

possible contributing factors and the remedies are also given� This model provides a

more detailed insight into the overhead characteristics of parallel loops� For instance�

a loop may exhibit small parallelization and spreading overheads� but memory or

processor stalls may indicate a problem� We have just begun to explore this new

system� and more work needs to be done to incorporate it into tool development�

The problem with this model is that obtaining the necessary data is tedious and very

time�consuming� The traditional parallelization and spreading overhead model still

serves as the primary measure for performance analysis for many programmers� and

it will continue to do so in the future�

�� Parallel Program Optimization Methodology

In the past� we have participated in several research eorts in parallelizing pro�

grams for dierent target architectures �� At �rst� we belonged to the

category of novice programmers� After a great deal of trial and error� we have devel�

oped a structured way to a successful parallelization of programs� As the number of

the programs that we dealt with increased� our general methodology went through

several stages of adjustment and improvement� Finally� we felt the need to write it

down so that a wider range of programmers can bene�t from the e�ciency it provides�

Thus� we started the process of re�ning our methodology to improve both e�ciency

and practicality�

� ��

Figure �� shows the overview of the parallelization and optimization steps out�

lined by our proposed methodology� There are two feedback loops in the diagram�

The �rst one serves as the adjusting process of instrumentation overhead� The second

loop is the actual optimization process consisting of application of new techniques and

evaluation�

Our methodology envisions the following tasks when porting an application pro�

gram to a parallel machine and tuning its performance� We start by identifying the

most time�consuming code section of the program� optimize its performance using

several recipes and then repeat this process with the next most important code sec�

tion� The most important code blocks for parallel execution in our programming

paradigm are loops� Hence we pro�le the program execution time on a loop�by�loop

basis� We do this by instrumenting the program with calls to timer functions� The

timing pro�le not only allows us to identify the most important code sections� but

also to monitor the program�s performance improvements as we convert it from a

serial to a parallel program� However� as the diagram shows� programmers may need

to adjust the amount of pro�ling due to the accompanying overhead� The �rst step

of performance optimization is to apply a parallelizing compiler� If no such tool is

available or if we are not satis�ed with the resulting performance we can apply pro�

gram transformations by hand� We will describe a number of such techniques� The

following section describes all these steps in detail�

�� Instrumenting program

Instrumentation is a means to obtain performance data� Typically� on the shared

memory model� pro�ling routines are inserted into the code that record necessary

data� As the result� one or more pro�les are generated at the end of the program

execution� There are other methods to instrument a program using assembly codes�

which we do not consider in this research� Program instrumentation is a important

step in optimizing program performance� The pro�le results from instrumented pro�

gram runs provide a basis for performance evaluation and optimization� It should

be determined beforehand what type of code blocks should be instrumented� In the

� ��

Instrumenting Program

Getting Serial Execution Time

Running Parallelizing Compiler

Manually Optimizing Program

Getting Optimized Execution Time

Speedup Evaluation

Finding and Resolving Performance Problmes

satisfactory

unsatisfactory

reduceinstrumentation

overhead

done

Fig� �� Overview of the proposed methodology�

� ��

directive�based shared memory programmingmodel� loops are usually the basic blocks

for instrumentation� because they are the basic sections considered for parallelization�

The metrics for measurement can vary� but they should conform to the goal of the

optimization� There are utilities for measuring various aspects of program execution�

The most widely used measures are execution time�

As the �rst step� programmers should instrument a serial program� The purpose of

this step is to understand the distribution of execution time within a program and to

identify the code segments worth the optimization eort� Therefore� it is desirable to

obtain as much timing data as possible throughout the target program� For instance�

programmers may decide to instrument all the loops in a given program�

Unfortunately� most instrumentation methods introduce overhead� This has to

be considered very carefully because it not only aects the program�s performance�

but it can also skew the execution pro�le so that the programmer targets the wrong

program sections� Our methodology suggests the following remedies�

� Programmers should make sure that they run the program with and without

instrumentation� They should proceed only after they have veri�ed that the

perturbation is small�

� In order to reduce overhead� programmers should remove instrumentation from

innermost loops �inner�most code sections� in general�� They may need to �nd

out the overhead per call of the instrumentation library� If their initial pro�le

shows code sections whose average execution times are less than two orders of

magnitude larger than the overhead� the corresponding instrumentation should

be removed�

� Programmers should add instrumentation after they run the code through a

parallelizing compiler� Compilers usually can apply less optimizations in the

presence of many subroutine calls� Source level instrumentation generally has

the form of inserted subroutine calls� If there exists a assembly level instrumen�

tation tool� this is less of a problem�

� � �

� Programmers should be careful when adding instrumentation inside a parallel

loop or region� Instrumentation libraries may assume these function calls are

made from serial program sections only�

� It is desirable that programmers make sure that instrumented code segments in

the optimized program match those instrumented in the sequential program� so

that side�by�side comparisons can be made in the performance evaluation stage�

There is an obvious dilemma� If programmers remove too many instrumentation

points� the pro�le will become less useful� They should leave the instrumentation at

least for all those program sections that they may later try to tune�

�� Getting serial execution time

Program execution may be aected by many factors� Processor speed� architec�

ture� operating systems� system load� network load such as �le IO requests� etc� The

resulting program from this optimization process may be subject to all these factors�

However� to accurately measure the eect �whether positive or negative� of the ap�

plied techniques during the optimization process� it is very important to eliminate

these external factors during instrumented program runs� One way to ensure uninter�

rupted environments is to use �single user time�� During this period� only one user is

allowed on the system� In this way� programmers can reduce unnecessary overheads

caused by context switching� external �le IO� and so on�

�� Running parallelizing compiler

Parallelizing compilers can analyze the input program� detect parallelism� and au�

tomatically generate appropriate directives for detected parallel regions� Parallelizing

compilers relieve parallel programmers of the tasks of parallelizing all loops manu�

ally� They are especially useful when the loops under consideration have complex

structures for which human analysis is cumbersome� State�of�the�art parallelizing

compilers include many advanced techniques for parallelization and optimization�

It is important to note that relying entirely on parallelizing compilers for opti�

mization may not result in the optimal performance� Compilers base the techniques

� ��

that they apply on the static analysis of input programs� This may not accurately

re�ect the dynamic behavior of the programs� Modeling dynamic characteristics of

programs is very di�cult� For this reason� programmers� intervention may be neces�

sary to achieve near�optimal performance� Programmers� compensation for compilers�

lack of knowledge on the dynamic behavior of a program is the key to obtaining good

performance�

Nonetheless� running parallelizing compilers is a good starting point� It can save

programmers signi�cant amount of time that may be spent analyzing all the loops in a

program� For novice programmers� manually parallelizing loops may be cumbersome

to begin with� In addition� most parallelizing compilers are capable of generating the

listing of the static analysis results� which may provide programmers with valuable

information on various code sections�

In our methodology� we do not assume that programmers have necessarily access

to parallelizing compilers� If this is the case� the �rst set of techniques to apply should

be those for parallelization� described in the next section�

�� Manually optimizing programs

Manual optimization allows users to make up for compilers� shortcomings� If a

programmer has run a parallelizing compiler� the static analysis information gener�

ated by the compiler �in the form of listing �les� can help the programmer better

understand the problems at hand� Running instrumented programs oers insights

into programs dynamic behavior� Combined with programmers� knowledge on the

underlying algorithm and physics� these data provide vital clues in improving the

performance�

In our methodology� we have divided various well�known techniques into four

categories� parallelization techniques� parallel performance optimization techniques�

serial performance optimization techniques� and other techniques� Parallelization

techniques refer to techniques involving parallelizing code segments� Parallel per�

formance optimization techniques are the ones that may improve the performance

of already parallel sections� Serial performance optimization techniques aim to im�

� ��

prove the performance of code sections whether they are serial or parallel� Some

of these techniques may result in a super�linear speedup if not applied to the serial

program that serves as a performance reference point� Locality enhancement tech�

niques are typical examples� The techniques that belong to �other� categories do not

seem to have eects on performance by themselves� However� they may enable other

previously non�applicable techniques� The bene�ts of the techniques described be�

low can vary signi�cantly with the underlying machine� The judgment about which

techniques to apply to a given program should be based on accurate performance

evaluation� which will be discussed in the subsequent section�

We give brief descriptions of the techniques that we have used to improve program

performance� More detailed description and theoretical backgrounds are found in

� � ��

Parallelization techniques

Privatization Privatization seeks to reduce false dependencies� Often both scalar

variables and arrays are used as temporary storage within an iteration of a

loop� and therefore if a private copy of this variable is provided with each

iteration� the loop may be parallelized� More conservatively� a single copy may

be provided to each of the participating processors for use as its loops are crucial

part of parallelization process� For example� in Figure �� variable X is used

as a temporary storage within a loop� By allowing separate copies of X for all

participating processors� seemingly serial code can be executed in parallel� In

some cases� a temporary storage could be an array� as shown in Figure ��

Reduction Scalar reductions are recurrences of the form sum � sum�expr where

expr is a loop�variant expression and sum is a scalar variable� Loops which con�

tain such recurrences cannot be executed in parallel without being restructured�

since values are accumulated into the variable sum� One way of addressing such

a situation is to calculate local sums in each processor� and combine these sums

at the completion of the loop� Figure �� shows an example of such a scalar

� ��

DO I � ��n

X � ��

�� X

ENDDO

� OMP PARALLEL DO PRIVATE�X

DO I � ��n

X � ��

�� X

ENDDO�a� �b�

Fig� �� Scalar privatization� �a� the original loop and �b� the same loop afterprivatizing variable X�

DO I � ��n

DO J � ��m

A�J � ��

ENDDO

DO J � ��m

�� A�J ��

ENDDO

ENDDO

� OMP PARALLEL DO PRIVATE�J SHARED�A

DO I � ��n

DO J � ��m

A�J � ��

ENDDO

DO J � ��m

�� A�J ��

ENDDO


Fig� �� Array privatization� �a� the original loop and �b� the same loop afterprivatizing variable array A�

� ��

reduction operation and its transformed version in OpenMP� OpenMP provides

a construct for identifying reduction operations of type addition� multiplication�

maximum� and minimum�

DO I � ��n

sum � sum � A�i

ENDDO

� OMP PARALLEL DO SHARED�A

� OMP� REDUCTION�� SUM

DO I � ��n

sum � sum � A�i


Fig� �� Scalar reduction� �a� the original loop and �b� the same loop afterrecognizing reduction variable SUM�

In addition to scalar reductions� array reductions must be addressed� as it has

been shown that array reduction recognition is one of the most important trans�

formations in real applications� Array reductions� like scalar reductions� are

summations� however they are of the form A�ind� � A�ind� � expr� where the

value of the subscript ind of A cannot be determined at compile time� There�

fore� local sums must be accumulated for each element in A and combined at

the time of the loop�s completion� Figure �� shows such a reduction operation�

The constant No Of Procs would hold the value of the number of participating

processors� and the function call Get My Id� would return the processor iden�

ti�cation of the processor executing that iteration� Two additional loops for

initialization and �nal summation are called preamble and postamble� respec�

tively�

Induction Induction variables are variables that form a recurrence in the enclosing

loop� Figure �� shows an example of a simple induction expression as well as

a transformed form� which would have no loop carried dependencies� Induc�

tion variable substitution must �rst recognize variables of this form and then

substitute them with a closed�form solution�

� � �

DO I � ��n

A�ind � A�ind � B�i

ENDDO�a�

DO I � ��No�Of�Procs

DO J � ��Elements�In�A

A��J�I � �

ENDDO

ENDDO

� OMP PARALLEL DO SHARED�A�� B� No�Of�Procs

DO I � ��n�No�Of�Procs

A��ind�Get�My�Id� �

A��ind�Get�My�Id� � B�i

ENDDO

DO J � ��Elements�In�A

DO I � ��No�Of�Procs

A�I � A�I � A��I�J

ENDDO

ENDDO

�b�

Fig� �� Array reduction� �a� the original loop and �b� the same loop afterrecognizing reduction array A�

� ��

X � �

DO I � ��n

X � X � ��I

A�X � ��

ENDDO

� OMP PARALLEL DO SHARED�A

X � �

DO I � ��n

A�I � I��


Fig� �� Induction variable recognition� �a� the original loop and �b� the same loopafter replacing induction variable X�

This transformation would allow the original loop shown in Figure ��a to be ex�

ecuted in parallel� Unfortunately if there are many enclosing loops and complex

induction variables� the closed form induction expressions may become rather

expensive to compute� If these expressions are used often they can introduce

signi�cant overhead�

Handling IO If IO statements within a loop are necessary in program execution

and the order of IO statements have to be preserved among loop iterations� the

loop cannot be parallelized� In other cases� the loop can still be parallelized by

using one of the following methods�

� If the IO is not absolutely necessary� it can simply be removed� For in�

stance� if the IO was inserted for debugging purpose or as execution status

reports� deleting the IO statements will not aect the execution�

� In cases where IO is needed to report the status of an array� the loop may

be distributed into two loops� one for computation and the other for IO�

The resulting loop containing only IO cannot be parallelized� but the loop

containing only computation may be parallelizable�

Handling subroutine and function calls If a loop has a subroutine or function

call� some parallelizing compilers usually make a conservative decision not to

parallelize it� The programmer has to make sure that the subroutine or function

� ��

has no side eects before manually parallelizing it�

Also depending on the implementation of parallel constructs� parallel sections

inside a function or subroutine that are already running in parallel may have

unexpected eects� If a programmer decides to execute a subroutine or function

within a parallel block� it is advisable to remove parallel constructs within

that subroutine or function� One other possible solution is to inline the called

function or subroutine� if the size of the function or subroutine is reasonably

small� More details on inlining are presented later in this section�

Parallel performance optimization techniques

Parallelization introduces overhead that clearly aects execution time� Program�

mers must be aware that parallelization may even degrade the performance of some

code sections� We have presented the parallelization and spreading overhead model

in Section �� Techniques listed below aim to further improve the performance of

already parallel code sections� They mainly seek to reduce the overhead introduced

by parallelization�

Serialization In many cases� the eect of optimization is not entirely predictable�

Furthermore� if programmers use a parallelizing compiler� the compiler may

cause some code sections to perform worse� Sometimes� parallelizing a code

segment just does not pay o� For instance� if the execution time of a loop is in

the same order of the parallelization overhead� its parallel execution is likely to

perform worse than the serial version� If there are no other eligible techniques

to further improve the parallel section� simply removing the parallel directives

can at least prevent degradation�

This technique is highly machine�dependent� The bene�t of parallelization rely

on many machine parameters� cache and memory size� bandwidth� processor

speed� IO e�ciency� and operating systems� If the target program is to be

used on various architectures� programmers should make a cautious decision

as to which segments should be converted back to serial� based on the study

� ��

of those architectures� A useful strategy is to serialize those loops or code

sections whose timing pro�les show no improvements from any parallelization

and tuning attempts� Also� it is advisable to monitor the performance of those

loops whose execution time is less than an order of magnitude larger than the

fork�join overhead� The fork�join overhead can be measured as the dierence in

execution time of an empty parallel loop between parallel and serial execution�

It should be noted that serialization itself can have a negative impact� The idea

of serialization is to restore a code segment back to its original state� but due to

cache eects� the execution may slow down compared to the same code section in

the untouched version� For instance� a small serial loop right between two large

parallel loops may cause signi�cant cache misses due to the data distribution

across caches�

Handling false sharing Depending of the cache line size� data that are needed

by only one processor may spread over other processors� caches� causing fre�

quent invalidations� These may be prevented by applying one of two techniques

described below�

� Programmers may try to modify array access patterns by scheduling tasks

that access adjacent regions on the same processor� An example is given

in Figure ��

� Another solution is padding� By adding empty data items into a shared

array� one may avoid false sharing by separating data into individual cache

lines� However� this may cause negative eects due to the increase in the

data size� Figure �� shows an example of padding� It should be noted that

changing array declarations can have global and interprocedural eects�

All uses of modi�ed arrays must be changed to use the new dimensions�

Scheduling A directive language usually comes with several options for scheduling�

Scheduling in parallel programming means telling the underlying machine how

� ��

� OMP PARALLEL

� OMP DO

DO I � ��

DO J � ��N

A�I�J � B�I�J

ENDDO

ENDDO

� OMP END DO

� OMP END PARALLEL

� OMP PARALLEL

DO I � ��

� OMP DO

DO J � ��N


ENDDO

� OMP END DO NOWAIT

ENDDO

� OMP END PARALLEL�a� �b�

Fig� �� Scheduling modi�cation� �a� the original loop and �b� the same loop aftermodifying scheduling by pushing parallel constructs inside the loop nest� In �b�� theinner loop is executed in parallel� thus processor access array elements that at least

� stride apart�

REAL A��N� B��N

��

� OMP PARALLEL

� OMP DO

DO I � ��N

DO J � ��N


ENDDO

ENDDO

� OMP END DO


REAL A��N� B��N

��

� OMP PARALLEL

� OMP DO

DO I � ��

DO J � ��N


ENDDO

ENDDO

� OMP END DO

� OMP END PARALLEL�a� �b�

Fig� �� Padding� �a� the original loop and �b� the same loop after padding extraspace into the arrays�

� � �

the tasks should be distributed across processors� In Fortran case� if a loop iter�

ates from to �� multiple processors allow many ways to split the iterations�

Depending on the loop structure� scheduling can make a signi�cant dierence

in performance� Locality and false sharing are two most important factors that

are aected by employing dierent scheduling schemes� The OpenMP directive

language provides four dierent options for scheduling �� Some scheduling

scheme may incur more overhead due to the required bookkeeping� Program�

mers are recommended to examine the loop structure before trying a dierent

scheduling mechanism�

� static� Each processor is assigned a contiguous chunk of iterations� If the

amount of work in each iteration is approximately the same� and there are

enough iterations for equal distribution� this scheduling will do �ne�

� dynamic� A processor is assigned the next iteration as it becomes available�

This is useful if the loop has varying amounts of work for each iteration�

The overhead is usually higher than that of static scheduling� but if the

program is to run in a multi�user environment� its better load balancing

properties can improve performance�

� guided� The same as dynamic scheduling� but a linearly decreasing number

of iterations are dispatched to each processor�

� runtime� The decision for scheduling is deferred until runtime� The value of

an environmental variable OMP SCHEDULEdetermines the scheduling scheme�

Load Balancing Unevenly distributed tasks cause stalls on multiple processors�

In cases where the number of iterations is small and cannot be distributed

evenly� the expected speedup is limited by the remainder of the number of

iteration over the number of processors� There is no solution for this case other

than trying to outer parallel loops� If the imbalance is incurred by uneven work

within the loop body �such as an outer parallel loop with an inner triangular

loop�� dynamic scheduling may result in better performance� Figure �� shows

� ��

an example of load balancing by changing scheduling�

� OMP PARALLEL DO

� OMP�SCHEDULE�STATIC

DO I � �� N

DO J � ��I

��

ENDDO

ENDDO

� OMP PARALLEL DO

� OMP�SCHEDULE�DYNAMIC

DO I � �� N

DO J � ��I

��

ENDDO


Fig� �� Load balancing� �a� the original loop and �b� the same loop afterchanging to interleaved scheduling scheme� By changing the scheduling from static

to dynamic� unbalanced load can be distributed more evenly�

Blockingtiling If the data size handled by each iteration of a loop is larger than

the data cache size of the processor and the data are reused within each it�

eration� lots of cache misses occur� Blocking�tiling splits the data needed for

each iteration so that they �t into one processor�s cache� This technique is par�

ticularly useful in large matrix manipulation� Obviously� machine parameters

should come into play for this technique to be successful� Knowing the machine�s

cache size will help determine the right block size� Blocking�tiling are basically

locality enhancement techniques� Figure �� shows how blocking�tiling can be

applied�

In Figure �� the entire B array is referenced in each iteration of the I loop� If

the ��N N�N references within each iteration of the I loop exceed the cache

size� then the each access to a new line of array B will be a cache miss� Tiling

the K and J loops allow smaller sections of B to be accessed repeatedly before

moving on to another section� decreasing the references within the I loop to

��BLK BLK�BLK references� If BLK is small enough� then each line of B will

only see one cache miss during the execution of the entire nest�

� ��

DO I � ��N��

DO K � ��N��

DO J � ��N��

C�J�I � A�K�I � B�J�K � C�J�I

ENDDO

ENDDO

ENDDO�a�

DO KK � ��N�BLK

DO JJ � ��N�BLK

DO I � ��N��

DO K � KK�min�kk�BLK��N��

DO J � JJ�min�jj�BLK��N��

C�J�I � A�K�I � B�J�K � C�J�I

ENDDO

ENDDO

ENDDO

ENDDO

ENDDO

�b�

Fig� �� Blocking�tiling� �a� the original loop and �b� the same loop after applyingtiling to split the matrices into smaller tiles� In �b�� another loop has been added toassign smaller blocks to each processor� The data are likely to remain in the cache

when they are needed again�

� ��

Serial performance optimization techniques

Sometimes programmers inadvertently write ine�cient code� For those who are

not familiar with performance issues� it is not unusual to add codes that work against

good performance� There are simple techniques that enhance performance of a code

segment �whether it is serial or parallel� without altering its intended functionality�

The techniques listed below aim to enhance the locality of program data� resulting in

better cache performance� or to reduce stalls� They are mainly machine�independent�

For instance� enhancing locality always helps� If the dominant code segments in the

target program are innately serial� the following techniques may be good candidates

for improving the performance without parallelization�

Loop interchange Loop interchange is a simple technique that interchanges loop

nests� The array access patterns determined by loop nests can have a drastic

eect on the resulting performance� In the two code segments shown in Fig�

ure �� the �rst one has poor locality because it has an array access stride of

N� The second loop� on the other hand� performs better because of its stride

access�

DO I � �� N

DO J � ��M


ENDDO

ENDDO

DO J � �� M

DO I � ��N


ENDDO


Fig� �� Loop interchange� �a� a loop with poor locality and �b� the same loopwith better locality after interchanging loop nest�

Loop interchange is a simple technique that may result in a large performance

gain� Programmers should be aware that loop interchange is not always legal� In

the presense of backward data dependences �e�g� A�i� j� � A�i�� j� �B�i� j��

in a loop� loop interchange violates the dependence in the original code�

� � �

Loop Fusion This is the opposite of loop distribution� described below� If multi�

ple loops have the same range� they can be merged if doing so does not break

any dependencies between them� Fusion generally increases locality because it

allows processors to reuse the data that are already in their caches� However�

fusion may cause the data size to exceed the cache size� which degrades the

performance� Also� as a side eect� if fusion is applied to parallel loops� it de�

creases the number of synchronization barriers� and reduces both parallelization

and spreading overhead� Programmers should be aware that loop fusion is not

always legal even when the iteration spaces match�

Software Pipeline andor Loop Unrolling In some compute�intensive loops�

data dependencies across close iterations may cause pipeline stalls� This is

more frequent in �oating�point operations� which take a number of CPU cycles�

One way to alleviate this problem is to do software pipelining or loop unrolling�

Loop unrolling does not have a direct eect on reducing dependency stalls� but

it allows the back�end compiler to interleave dependent instructions�

However� unlike software pipeline technique� which may create a loop�carried

dependency� an unrolled loop can still be executed in parallel if the original loop

is parallel� As a side eect� unrolled loops have less synchronization barriers

when executed in parallel� These techniques allow more cycles between depen�

dent instructions� so stalls are reduced� Hardware counters often have facilities

to measure dependency stalls� Figure �� shows a simple loop before and after

applying software pipeline and unrolling�

Other performance�enhancing techniques

Loop distribution Loop Distribution refers to splitting a loop into multiple loops

with smaller tasks� This techniques may reduce the grain size of parallelism�

however it enables other transformations� Figure �� shows an actual code

section found in program SWIM from the SPEC �� benchmark suite ��

� �

DO I � �� N

��

C � A�I � B�I

D�I � C

ENDDO

C � A�� B��

DO I � �� N

��

D�I�� C

C � A�I � B�I

ENDDO

D�� N � C

DO I � �� N ��

��

C � A�I � B�I

D�I � C

��

C � A�I�� B�I��

D�I�� C

ENDDO�a� �b� �c�

Fig� �� Software pipeline and loop unrolling� �a� the original loop� �b� the sameloop with software pipeline �Instructions are interleaved across iterations� andpreamble and postamble have been added�� and �c� the same loop unrolled by ��

The outer loop is parallel� Adding appropriate directives� we get the parallelized

version� as shown in Figure ��

As mentioned above in the locality enhancement section� the nested loops in

this code segment would be a good candidate for loop interchange due to the

column major attribute of Fortran� However� one line right after the nested

loop prevents applying the technique� By splitting the outer loop into two�

and interchanging nested loops� we get the code shown in Figure �� which

performs signi�cantly better than the previous two versions�

Subroutine inlining Inlining replaces a call to a subroutine with the code con�

tained within the subroutine itself� This procedure� also called �inline expan�

sion�� can have several bene�cial eects� The most obvious of these eects is

the removal of the calling overhead� This is particularly true when a call is

embedded within a small loop� and thus the overhead would be incurred in

each loop iteration� However� more importantly� in the context of parallelizing

compilers� additional optimizations and transformations may be facilitated by

this transformation�

� � �

DO icheck � �� mnmin� �

DO jcheck � �� mnmin� �

pcheck � pcheck�ABS�pnew�icheck� jcheck

ucheck � ucheck�ABS�unew�icheck� jcheck

vcheck � vcheck�ABS�vnew�icheck� jcheck

�� CONTINUE

ENDDO

unew�icheck� icheck � unew�icheck� icheck��MOD�icheck� ��

��

�� CONTINUE

ENDDO

Fig� �� Original loop SHALOW do�� in program SWIM�

With procedure calls inlined� the procedure�s code may be optimized within

the context of the call site� With site�speci�c information now available� other

transformations may be possible� which in turn may facilitate yet other opti�

mizations� This may allow some instances of a procedure to be executed in

parallel� even if it is not parallelizable at each call site�

The down side of inline expansion is the increase in the code size� which can be

signi�cant if full inlining is performed� This may cause many instruction cache

misses� Also� with the increase in code size comes the increase in compilation

time� since now each instance of the inlined code is optimized separately� Often

full inlining is not practical and so heuristics are developed for its application�

Deadcode elimination Deadcode elimination is an optimization technique which

removes unnecessary code from a program� The direct eect of dead code

elimination is decreased execution time� Code that has no eect on the output

of a program is removed� and thus the time spent in executing this portion of the

application is eliminated� Again there is the additional bene�t that deadcode

� � �

� OMP PARALLEL

� OMP�DEFAULT�SHARED

� OMP�PRIVATE�JCHECK�ICHECK

� OMP DO

� OMP�REDUCTION��vcheck�ucheck�pcheck






�� CONTINUE

ENDDO


��

�� CONTINUE

ENDDO



Fig� �� Parallel version of SHALOW do�� in program SWIM�

� � �

� OMP PARALLEL

� OMP�DEFAULT�SHARED

� OMP�PRIVATE�JCHECK�ICHECK

� OMP DO

� OMP�REDUCTION��vcheck�ucheck�pcheck






�� CONTINUE

ENDDO

�� CONTINUE

ENDDO

� OMP END DO

� OMP DO

DO icheck � �� MIN��m� n� �


��

ENDDO



Fig� �� Optimized version of SHALOW do�� in program SWIM�

� � �

elimination may enable other optimizations �e�g� An imperfect loop nest can

become a perfect loop nest after deadcode elimination��

�� Getting optimized execution time

As described in Section �� using �single user time� is important to reduce ex�

ternal perturbation factors� In parallel programs� these factors may cause signi�cant

inaccuracies and variations in execution time because of the unpredictable nature of

other user processes�

�� Finding and resolving performance problems

Finding dominant regions Programmers should focus on dominant code seg�

ments based on the measured data� Instrumented program runs usually generate

pro�les with the measured data� Programmers should �nd major code blocks

that consume most of execution time from these �les� With tool support� this

task can be simpli�ed�

Dominant program sections may change as the result of a program tuning pro�

cess� After each iteration of this process� programmers should reevaluate the

most time�consuming �or the most problematic� depending on the metrics� code

sections� Other program sections may have become the point of biggest return

of further time investment�

Identifying problems and �nding remedy When dominant code sections are

found� programmers should �gure out any possible improvements on those seg�

ments� First� the status of the segments should be understood��Is the code

section parallel�� and �Is the speedup acceptable�� are the questions that

should be answered before looking for right remedies� Computing the over�

heads discussed earlier in Section �� can be of signi�cant help to this end�

Performance analysis is a di�cult part of performance tuning� In the next

chapter� we present our eort to facilitate the performance analysis through

tool support�

� �

� code not parallel� Even advanced parallelizing compilers such as the Po�

laris compiler cannot detect all possible parallelism� There are mainly two

reasons for this� First� the target code uses some algorithmic techniques

that a parallelizing compiler cannot analyze� Second� the data dependen�

cies within the code cannot be determined without examining the input

data� so the parallelizing compiler makes a conservative decision not to

parallelize it�

For the �rst case� programmers may be able to �nd parallelism� For ex�

ample� if a reduction variable is not recognized by a parallelizing compiler�

programmers can parallelize the code section with proper reduction direc�

tives� Programmers may need to study the underlying algorithm for this

task� Parallelization techniques are found in Section ��

For the second case� programmers may be able to make up for the lack

of information about the input data� For instance� if the reason for not

parallelizing a code section is that the compiler cannot determine that

certain array accesses do not overlap� programmers can simply parallelize

the code manually� If a conditional exit within a loop only occurs in a

fatal error condition� ignoring it and parallelizing the loop will not aect

a correct execution�

If the programmer cannot �nd any way to parallelize a given code� replac�

ing the algorithm with parallel counter parts may be possible� There are

parallel algorithms for some inherent serial algorithms� such as random

number generation and linear recurrences�

Finally� even if none of the techniques are possible� programmers should

try enhancing the locality of the code� Some of the locality enhancing

techniques can make a drastic dierence in performance� Some of such

techniques are listed in Section ��

� speedup not acceptable� For parallel code segments� there are several rea�

sons for poor speedup including poor locality and parallelization and�or

� � �

spreading overhead� Spreading overhead may be incurred by poor locality�

Programmers should try to enhance locality and reduce overhead� Prob�

lems with data locality may be detected if a hardware counter is available

on the target machine� A large number of stalls or high data cache miss

ratio are a good indication of poor locality� Some of these techniques are

described in Section ��

�� Conclusions

The ultimate objective of our research is to answer �what� and �how� in a paral�

lel optimization process� The proposed methodology is designed to tell programmers

�what must be done�� We have divided the program optimization process into sev�

eral steps with feedback loops� Each step de�nes speci�c tasks for programmers to

accomplish� We have also listed common analyses and techniques that are needed�

There is a clear goal in each stage� and the condition for its achievement is clearly

de�ned� In this way� our methodology provides signi�cant guidance to programmers

in optimizing parallel applications�

The methodology described above has been empirically devised� All of the analy�

ses and techniques have helped us improving the performance of scienti�c and engi�

neering applications� However� �guring out exactly which technique would improve

the performance is still a di�cult subject and requires further studies� Performance

prediction or modeling has not been successful in general cases� In the next chapter�

we introduce our experience�based approach to resolve this issue� We support our

methodology with a set of tools� which is our approach to answer the question �how��

Supporting tools are the topic of the next chapter�

� � �

� � �

�� TOOL SUPPORT FOR PROGRAM OPTIMIZATION

METHODOLOGY

As previously mentioned� the main advantages of a methodical approach to par�

allel programming is that it is �� e�cient and �� easy to apply without advanced

experience� The proposed methodology outlines this systematic endeavor towards

good performance� However� the individual steps listed in the methodology can be

time�consuming and tedious�

Parallel programmers without access to parallel programming tools have relied on

text editors� shells� and compilers� Programmers write a program using text editors

and generate an executable with resident compilers� All other tasks such as managing

�les� examining performance �gures� searching for problems and incorporate solutions�

can be achieved using these traditional tools� However� considerable eort and good

intuition are needed in �le organization and performance diagnostics� Even with

parallelizing compilers� these tasks still remain for the users to deal with� In fact�

most users end up writing small helper scripts for these tasks�

The tools designed speci�cally for development and tuning of parallel programs

step in where traditional tools have limits� In general� these tools provide interactivity

and adequate user interface for incorporation of user knowledge to further improve

program performance� Previous eorts listed in Chapter � mainly focus on two as�

pects of functionality� automation and visualization� Automatic utilities simpli�es

analyzing very complex program structures� Visualization utilities allow user to view

and interpret a large amount of static analysis information and performance data in

an e�cient manner� Still� we feel that certain functionalities� which could be of great

help to programmers� have been largely ignored by the tool developers�

Based on the user feedback and the speci�cs of our methodology� we have set our

� ��

our design goals� which are listed in the next section� Then we discuss in detail the

tools that we have developed and�or included into our programming environment�

We also present our eort to reach the general audience with our tools through the

World�Wide Web� Finally� we describe how these tools �t into our methodology and

help programmers in the tuning process�

�� Design Objectives

Consistent support for methodology This is the main goal of our research� We

examine the steps in the methodology and �nd time�consuming programming

chores that call for additional aid� Some tasks are tedious and may be auto�

mated� Some require complex analysis and cumbersome reasoning� so assisting

utilities are needed� If these are properly addressed with the tool support� pro�

grammers can achieve greater performance with ease� The integration of the

methodology and the tool support would signi�cantly increase e�ciency and

productivity�

Support for deductive reasoning Current performance visualization systems of�

fer variety of utilities for viewing a large amount of data in many dierent per�

spectives� Understanding data patterns and locating problems� however� are

still left to users� In addition to providing raw information� advanced tools

must help �lter and abstract a potentially very large amount of data� Instead

of providing a �xed number of options for data presentation� oering the ability

to freely manipulate data and even to compute a new set of meaningful results

can serve as the basis for users� deductive reasoning�

Active guidance system Tuning programs requires dealing with numerous dif�

ferent instances of code segments� Categorizing these variants and �nding the

right remedies demand su�cient experience on the programmers� part� The

transfer of such knowledge from experienced to novice programmers� has al�

ways been a problem in the parallel programming community� It usually takes

novice programmers a signi�cant amount of time and eort to gain adequate

� � �

expertise in parallel programming� We believe that it is possible to address this

issue systematically using today�s technology�

Program characteristic visualization and performance evaluation The task

of improving program performance starts with examining the performance and

analysis data and �nding room for improvement� The ability to scroll through

these data and visualize what the data imply is critical in this task� Tables�

graphs� and charts are a common way of expressing a large set of data for easy

comprehension� However� one of the pitfalls that researchers easily get into is

too much information presented in a myriad of windows without proper anno�

tations� A good tool should be able to draw the users� attention to what is

important�

Integration of static analysis with performance evaluation Most tools pub�

lished so far focus on either one of the two types of data� However� as mentioned

earlier� good performance only comes from considering both aspects� It is im�

portant to identify the relationship between the data from both sides and have

them available for easy analysis� Without the consideration of performance

data� static program optimization can even degrade performance� Likewise�

without the static analysis data� optimization based only on performance data

may be marginal�

Interactive and modular compilation The usual black�box�oriented use of com�

piler tools have limits in e�ciently incorporating users� knowledge of program

algorithms and dynamic behavior� For example� although the compiler detects

a value�speci�c data�dependence� the user may know that in every reasonable

program input the values are such that the dependence does not occur� In

other cases� users may know that the array sections accessed in dierent loop

iterations do not overlap� Furthermore� certain program transformations may

make a substantial performance dierence� but are applicable to very few pro�

grams� and hence not built into a compiler�s repertoire� If a user can �nd the

� ��

reason why a loop was not parallelized automatically� a small modi�cation may

be applied that ensures parallel execution� Because of these reasons� manual

code modi�cations in addition to automatic parallelization is often necessary to

achieve good performance� and tools should support a convenient mechanism

to incorporate manual tuning� Another drawback of conventional compilers is

their limited support for incremental tuning� The localized eect of parallel

directives in the shared memory programming model allows users to focus on

small portions of code for possible improvement� Hence� compiler support for

incremental tuning is also an important goal in our tool design�

Data Management This is the basic need in successfully optimizing various appli�

cations� Data management refers to the task of organizing data �les� maintain�

ing the storage for the gathered data� and making it easy to retrieve them for

quick comparison and manipulation� A uni�ed space for experimental data with

clean interfaces not only help the developers themselves but also the combined

eort among research groups by allowing simple access to related databases�

Accessibility Although the importance of advanced tools for all software devel�

opment is evident� many available tools remain unused� A major reason is

that the process of searching for tools with needed capabilities� downloading

and installing them on locally available platforms and resources is very time�

consuming� In order to evaluate and �nd an appropriate tool� this process may

need to be repeated many times� Using today�s network computing technology�

tool accessibility can be greatly enhanced�

Portability For disseminating a new tool to the user community� it is important

that it be easy to install on new platforms� In addition� a tool has to be �exible

in the data format it can read� such that it can adapt to the tools �compilers

and performance analyzers� available on the local platform�

Con�gurability Satisfying the general users of a tool can only be achieved by

allowing them to con�gure the tool to their liking� By having con�gurability as

� ��

one of our design goals� many users� preferences can be incorporated into the

tool usage without individually addressing them�

Flexibility Flexibility is an important characteristic of general tools� We have seen

many cases in which new types of performance data needed to be incorporated

into the picture for a better understanding of a program behavior� Further�

more� we would like to keep the applicability of the tool open for tasks beyond

performance tuning�

In the next few sections� we introduce the tools in our methodology�support tool�

box� We present the overviews for the tools as well as the detailed structure and

functionality if needed� We also include the look and feel of these tools from the end

users� point of view�

�� Ursa Minor Performance Evaluation Tool

Often the programmers� intervention into automatic optimization is necessary to

achieve near�optimal parallel program performance� To aid programmers in this pro�

cess� we have developed a performance evaluation tool� Ursa Minor �User Respon�

sive System for the Analysis� Manipulation� and Instrumentation of NewOptimization

Research� �� The main goal of Ursa Minor is performance optimiza�

tion through interactive integration of performance evaluation with static program

analysis information� With this tool� performance anomalies such as poor speedup

and high cache miss ratio are easily identi�ed on a loop�by�loop basis via a graphical

user interface� Overhead components are computed instantly� This information is

combined with static program information such as array access patterns or loop nest

structure to give a better understanding of the problems at hand�

Ursa Minor complements the Polaris compiler in its support for OpenMP par�

allel programming in that it understands the compiler output� It collects and com�

bines information from various sources� and its graphical interface provides selective

views and combinations of the gathered data� Ursa Minor consists of a database

utility� a visualization system for both performance data and program structure� a

� ��

source searching and viewing tool� and a �le management module� Ursa Minor

also provides users with powerful utilities for manipulating and restructuring input

data to serve as the basis for the users� deductive reasoning� In addition� it takes

performance evaluation one step further by means of an active performance guidance

system calledMerlin� Ursa Minor can present to the user and reason about many

dierent types of data �e�g�� compilation results� timing pro�les� hardware counter in�

formation�� making it widely applicable to dierent kinds of program optimization

scenarios�

�� Functionality

Here� we describe the functionality of Ursa Minor� and what it can do for

programmers� Typical performance evaluation process consists of visualizing perfor�

mance� identifying problems or anomalies� �nding the cause� and devising the cor�

responding remedies� Programmers need to visualize and compare the performance

data under dierent trials� ruminate over them� compute derivative values� examine

the runtime environment for the cause of possible problems� and search for solu�

tions� We have designed practical utilities to assist programmers in this process and

integrated them into Ursa Minor�

Performance data and program structure visualization

The Ursa Minor tool presents information to the user through two main display

windows� the Table View and the Structure View� The Table View shows the data

as text entries that relate to �Program Units�� which can be subroutines� functions�

loops� blocks� or any entities that a user de�nes� The Structure View is designed to

visualize the program structure under consideration� A user interacts with the tool

by choosing menu items or mouse�clicking�

The Table View displays data such as average execution time� the number of

invocation of code sections� cache misses� and a text indicating whether loops are

serial or parallel� Generally� the entries can be of type integer� �oating�point number�

and string� Users can manipulate the presented data through various features this

view provides� This is the main view that provides the means for modifying and

� ��

augmenting the underlying database� Accesses to other modules of Ursa Minor

take place through this view� The Table View is a tabbed folder that contains one

or more tabs with labels� Each tab corresponds to a �program unit group�� which

means a group of data of a similar type� For instance� the folder labeled �LOOPS�

contains all the data regarding loops in a given program� When reading prede�ned

data inputs such as timing �les and Polaris listing �les� Ursa Minor generates

prede�ned program unit groups� �e�g�� LOOPS� PROGRAM� CALLSTRUCTURE�

etc�� Users can create their own groups with their own input �les using a proper

format�

A user can rearrange columns� delete columns� sort the entries alphabetically or

based on the execution time� The bar graph on the right side shows an instant

normalized graph of a numeric column� After each program run� the newly collected

information is included as additional columns in the Table View� Users can examine

these numbers side�by�side as they �t� In this way� performance dierences can be

inspected immediately for each individual loop as well as for the overall program�

Eects of program modi�cations on other program sections become obvious as well�

The modi�cation may change the relative importance of loops� so that sorting them

by their newest execution time yields a new most�time�consuming loop on which the

programmer has to focus next� Figure �� shows the Table View of Ursa Minor in

use�

Various features make the usage of the Table View easier and more accessible�

Users can set a display threshold for each column so that an item that is less than

a certain quantity is displayed in a dierent color� This feature allows users to ef�

fortlessly identify code sections with poor speedup� for instance� One or more rows

and columns can be selected so that they can be manipulated as a whole� Data that

would not �t into a table cell� such as the compiler�s explanation for why a loop is not

parallel� can be displayed in a separate window by one mouse click� Finally� Ursa

Minor is capable of generating pie charts and bar graphs on a selected column or

row for instant visualization of numeric data�

� � �

Fig� �� Main view of the Ursa Minor tool� The user has gathered information onprogram BDNA� After sorting the loops based on the execution time� the user inspectsthe percentage of three major loops �ACTFOR do�� ACTFOR do�� RESTAR do��using a pie chart generator �bottom left�� Computing the speedup �column � � withthe Expression Evaluator reveals that the speedup for RESTAR do�� is poor� so the

user is examining more detailed information on the loop�

� ��

Another view of Ursa Minor provides the calling structure of a given program�

which includes subroutine� function� and loop nest information as shown in Figure ��

Each rectangle represents either a subroutine� function� or loop� The rectangles are

color�coded so that more information is conveyed to the user visually� For example�

parallel loops are represented by green rectangles� and serial loops by red rectangles�

Clicking one of these will display the corresponding source code� In Figure �� the

user is inspecting loop ACTFOR do�� in this way� Rectangles positioned to the right

are nested program units� Thus if unit A has unit B inside� the rectangle representing

B will be placed to the right of the rectangle for A� If one wants a wider view of the

program structure� the user can zoom in and out� This display helps to understand

a program structure for tasks such as interchanging loops or �nding outer or inner

candidate parallel loops�

Expression Evaluator

The ability to compute derivative values of raw performance data is critical in

analyzing the gathered information� For instance� the average timing value of dierent

runs� speedup� parallel e�ciency� and the percentage of the execution time of code

sections with respect to the overall execution time of the program are commonmetrics

used by many programmers� Instead of adding individual utilities to compute these

values� we have added the Expression Evaluator for user�entered expressions� We

have provided a set of built�in mathematical functions for numeric� relational� and

logical operations� Nested operators are allowed� and any reasonable combination

of these functions are supported� The Expression Evaluator has a pattern matching

capability as well� so the selection of a data set for evaluation becomes simpli�ed�

The Expression Evaluator also provides users with query functions that apprehend

static analysis data from a parallelizing compiler� These functions can be combined

with the mathematical functions� allowing queries such as �loops that are parallel and

whose speedups are less than � or �loops that have IO and whose execution time

is larger than � of the overall execution�� For example� after the users identi�ed

parallel loops with poor speedup� they may want to compute cache miss ratio on those

� ��

Fig� �� Structure view of the Ursa Minor tool� The user is looking at theStructure View generated for program BDNA� Using �Find� utility� the user sets theview to subroutine ACTFOR� and opened up the source view for the parallelized loop

ACTFOR do��

� ��

loops or parallelization overheads� Instead of leaving the reasoning process to users�

Ursa Minor guides users through the deductive steps� The Expression Evaluator

is a powerful utility that allows manipulating and restructuring the input data to

serve as the basis for users� deductive reasoning through a common spreadsheet�like

interface�

The Merlin performance advisor

As previously mentioned� identifying performance bottlenecks and �nding the

right remedies take experience and intuition� which novice programmers usually lack�

Acquiring the expertise requires many trials and studies� Even for those programmers

who have experienced peers� the transfer of knowledge from advanced programmers

to novice programmers takes time and eort�

We believe that tools can be of considerable use in addressing this problem� We

have used a combination of the forementioned Expression Evaluator and knowledge

database to create a framework for easy transfer of experience� Merlin is an auto�

matic performance data analyzer that allows experienced programmers to tell novice

programmers how to diagnose and improve many types of performance problems�

Its objective is to provide guidelines and suggestions to inexperienced programmers

based on the accumulated knowledge of advanced programmers�

Figure �� shows an instance of the Merlin user interface� Merlin is activated

when a user clicks �Run Performance Advisor for This Row� from the row popup

menu� It consists of an analysis text area� an advice text area� and buttons� The

analysis text area displays the diagnosis that Merlin has performed on the selected

program unit� The advice text area provides Merlin�s solution to the detected

problems with examples� if any� Diagnosis and the corresponding advice are paired

with an identi�cation number �such as Analysis �� Solution �� Users can also

load a dierent map anytime�

Merlin diers from conventional spreadsheet macros in that it is capable of

comprehending static analysis data generated by a parallelizing compiler� Merlin

can take into account numeric performance data as well as program information such

� ��

Fig� �� The user interface of Merlin in use� Merlin provides the solutions to thedetected problems� This example shows the problems addressed in loop

ACTFOR DO�� of program BDNA� The button labeled Ask Merlin activates theanalysis� The View Source button opens the source viewer for the selected codesection� The ReadMe for Map button pulls up the ReadMe text provided by the

performance map writer�

� � �

as parallel loops� existence of IO statements or functions calls within a code block�

and so on� This allows a comprehensive analysis based on both performance and

static data available for the code section under consideration�

Merlin navigates through a knowledge�based database ��maps�� that contains

the information on diagnosis and solutions for various performance symptoms� Expe�

rienced programmers write maps based on their knowledge� and novice programmers

can view the suggestions made by the experienced programmers by activating Mer�

lin� As shown in Figure �� a map consists of three �domains�� The elements in

the Problem Domain correspond to general performance problems from the viewpoint

of programmers� They represent situations such as poor speedup� large number of

stalls� and non�parallel loops� depending on the performance data types targeted by

Merlin� The Diagnostics Domain depicts possible causes of the problems� such as

�oating point dependence and data cache over�ow� Finally� the Solution Domain

contains remedial techniques� Typical examples are serialization� loop interchange�

tiling� and loop unrolling� These elements are linked by �condition�s� Conditions are

logical expressions representing an analysis of data� If a condition evaluates to be

true� the corresponding link is taken� and the element in the next domain pointed

to by the link is explorered� Merlin invokes the Expression Evaluator for the eval�

uation of these expressions� A Merlin map is written in the Generic Data Format

described in Section �� and it is loaded into Ursa Minor as an instance of Ursa

Minor database� More detailed description of Merlin is available in ��

Merlin enables multiple cause�eect analyses of performance and static data� It

fetches the data speci�ed by the map from the Ursa Minor tool� performs the listed

operations and follows the links if the conditions are true� There are no restrictions on

the number of elements and conditions within each domain� and each link is followed

independently� Hence� multiple perspectives can be easily incorporated into one map�

For instance� memory stalls may be caused by poor locality� but it could also mean

�oating point dependence� In this way� Merlin considers all possibilities separately

and presents an inclusive set of solutions to users� At the same time� the remedies

� ��

.

.

.

.

.

.

.

.

.

ProblemDomain

DiagnosticsDomain

SolutionDomain

condition 1

condition 2

problem 1

problem 2

problem 3

diagnostics 1

diagnostics 2

diagnostics 3

solution 1

solution 2

solution 3

Fig� �� The internal structure of a Merlin �map�� The Problem Domaincorresponds to general performance problems� The Diagnostics Domain depictspossible causes of the problems� and the Solution Domain contains suggested

remedies� Conditions are logical expressions representing an analysis of the data�

suggested by Merlin assist users in �learning by examples�� Merlin enables users

to gain expertise in an e�cient manner by listing performance data analysis steps

and many example solutions given by experienced programmers�

Merlin is able to work with any map as long as the map is in the correct for�

mat� Therefore� the intended focus of performance evaluation may shift depending

on the interest of the user group� For instance� the default map that comes with

Merlin focuses on parallel optimization of programs� Should a map that focuses

on architecture be developed and used instead� the response of Merlin will re�ect

that intention� The Ursa Minor environment does not limit its usage to parallel

programming�

Other functionality

During the process of compiling a parallel program and measuring its performance�

a considerable amount of information is gathered� For example� timing information

becomes available from various program runs� structural information of the program

� ��

is gathered from the code documentation� and compilers oer a large amount of

program analysis information� Finding parallelism starts from looking through this

information and locating potentially parallel sections of code� The bookkeeping eort

accompanying this procedure is often overwhelming� Ursa Minor provides a orga�

nized solution to this problem� All the data regarding tuning of a speci�c program

are integrated into one compact database� Easy access to the database supported by

the tool allows users convenient views and manipulation of the data without having

to deal with numerous �les�

Ursa Minor also supports inter�group logs� Sharing the performance data and

optimization results among team members is important� Group members can share

the databases generated by others by specifying one location for a data repository�

When a member decides to share a database with other members�Ursa Minor adds

a log entry with the information regarding that particular database in the repository�

In this way� group members do not have to ask others to send the database to examine

the data� The repository has all the information about the database that the member

wants to share�

Con�gurability is one way to ensure that the tool adapts well to many users�

environments and preferences� TheUrsa Minor user interface is con�gurable� Users

can change the looks of the display views and many other functionalities� Most

functions can be mapped to keyboard shortcuts� allowing advanced users to speedup

the tasks�

Learning how to use a new tool has always been a nuisance to many programmers�

As tools become complex and versatile� reading a manual is cumbersome by itself�

Some of the successful commercial applications in word processing or games have

employed an �on�line tutorial� approach� An embedded module steps through some

of the basic functions of the program and tells users how to use them� We have

incorporated such a module into Ursa Minor� Our interactive demo session allows

users to explore important features of the tool with the input data prepared by the

developers� In addition� this demo session automates some of the steps so that users

� ��

can quickly look through them�

�� Internal Organization of the Ursa Minor tool

Database Manager

GUI Manager

Table View Structure View

Database

Other toolsSpreadsheet

Static Data

DynamicData

User

data dependence

resultsstructureanalysis

...

performancenumbersruntime

env.hardwarecounter

...

ExpressionEvaluator

MerlinPerformance

Advisor

Fig� �� Building blocks of the Ursa Minor tool and their interactions�

Figure �� illustrates interaction between Ursa Minor modules and various data

�les� The Database Manager handles interaction between the database and other

modules� Depending upon users� requests� it fetches the required data items or create

or modify database entities� The GUI manager coordinates various windows and

views and controls the process of handling user actions� It also takes care of data

consistency between the database and the display windows� The Expression Evaluator

is a facility that allows users to perform spreadsheet�like text user�typed commands

� ��

on the current database� This module parses the command� applies the operations�

and updates the views accordingly� Finally�Merlin is a guidance system capable of

automatically conducting performance analysis and �nding solutions�

Internally�Ursa Minor stores information in a Ursa Minor�Major Database

�UMD�� The Ursa Minor�Major Database �UMD� is a storage unit that holds

the collective information about a program� its execution results in a certain system

environment� or any other pertinent data that users include� This database can

be stored in dierent formats� including a plain text �le� which can optionally be

inspected with an editor and printed� Furthermore� a database can be saved in a

format that can be read by commercial spreadsheets� providing a richer set of data

manipulation functions and graphical representations�

The Ursa Minor tool is written in �� lines of Java� Thus� any platform on

which the Java runtime environment is available can be used to run the tool� It uses

the basic Java language with standard APIs� which enhances the portability of the

tool� Object orientation in Java allows a relatively easy addition of new types of data

to the database� The windowing toolkits and utilities provide a good environment

for prototyping user interfaces� which enable us to focus on the design of the tool

functionality� Furthermore� Java� with its network support� makes a useful language

for realizing another goal of this project� making available the gathered program�

compilation� and performance results to world wide users� This goal has been realized

in the Ursa Major tool� which is discussed in Section ��

�� Database structure and data format

Ursa Minor maintains an organized database structure to store data� Inside

the Ursa Minor database� data items are stored in one of the four types� integer�

�oating point number� string� and long string� For the most part� the database module

does not care what kind of information it holds� It is� of course� a good programming

practice� but more importantly� it helps ensure the �exibility and con�gurability of

the entire tool� There are certain modules that understand data semantics such as

the Structure View and query functions in the Expression Evaluator� but the lack of

� � �

the required data does not prevent the tool usage�

At the bottom of the structure is �Program Unit�� This is the basic storage unit

that maps to an entity such as a loop� a subroutine� a code block and so on� These

units belong to a larger entry called �Program Unit Group�� Usually� Program Unit

Groups are labeled loops� subroutines� etc�� depending on the Program Units that

they keep� These groups are combined into a �Session�� which logically maps to a

database for one optimization research� Sessions are managed by the Ursa Minor

database manager� the module that handles database accesses� Figure �� shows a

design schematics for the database�

. ..

Session

Loops

Subroutines

Functions

. ..

Program Unit Group

Loop 1

Loop 2

Loop 3

Program Unit

Integer: Number invocationFloat: Average Execution TimeFloat: Overall Execution TimeFloat: Number of CyclesFloat: Memory StallsString: Serial or ParallelLong String: Nested Units...

Fig� �� The database structure of Ursa Minor�

Ursa Minor is capable of reading several dierent types of data �les that are

generated by other tools listed in this chapter� Performance data �sum �les� are

generated when Polaris�instrumented executables run� Polaris listing �les are gener�

ated when Polaris attempts parallelization on a program and contain static analysis

information� When Ursa Minor reads these �les� it parses them in a prede�ned

way and creates appropriate program unit groups� Users of the tool do not need to

concern themselves with data types or formats when loading these �les� Also� Ursa

Minor can read and write using Java serialization utility� that stores the database in

a compact data �le� Adding or removing data from the loaded database is as simple

as clicking a menu�

� ��

In order to provide more �exibility� we have de�ned the �Generic Data Format�

that can handle a wide variety of data� Using this text�based format� users can

input almost any types of data with any data structure� This format allows users to

create program unit groups of their own and arrange data as they see �t� This feature

greatly enhances the applicability of Ursa Minor and ful�lls one of the design goals�

�exibility�

�� Summary

Ursa Minor supports the methodology presented in the previous chapter by

providing utilities that mitigate many tasks in the performance evaluation stage� It

integrates static analysis and performance data by means of a database with structure�

based entities that hold many dierent types of data� With the support for deduc�

tive reasoning� active guidance� data management through con�gurable and �exible

utilities� Ursa Minor oers signi�cant aid to parallel programmers in need of a

performance evaluation tool�

Ursa Minor has been installed on the Parallel Programming Hub �� allowing

accesses from remote users all over the world� Users can quickly evaluate the tool

with ease or extensively utilize it for production use� By Combining Ursa Minor

with other utilities on the Hub in support of the methodology� our goal towards a

comprehensive programming environment is getting near� The Parallel Programming

Hub is discussed in detail in Section ��

�� InterPol Interactive Tuning Tool

Good performance from a program is usually achieved by an incremental tuning

and evaluation process� The term �incremental� applies to both the applied tech�

niques and the modi�ed code segments� Conventional batch�oriented compilers are

limited in helping programmers in this task� Often� selecting target regions and choos�

ing optimization techniques are done by slicing a program and manipulating compiler

options manually� The accompanying tasks of �le management and learning about

compiler options are often overwhelming to programmers�

Advanced parallelizing compilers provide a large list of available techniques for

� ��

program parallelization and optimization� These techniques are usually controlled

by switches or command line options that may not be intuitive and user�friendly�

The ability to select optimization techniques and even re�ordering their applications

would provide �exibility in exploring various combinations of techniques on dierent

sections of code� In addition� this would oer a playground for those interested in

studying compiler techniques�

InterPol is an interactive utility that allows users to target program segments

and apply optimization techniques selectively �� It allows users to build their own

compiler from numerous optimization modules available from a parallelizing com�

piler infrastructure� It is also capable of incorporating manual changes made by

users� Meanwhile� InterPol keeps track of the entire program that users want to

optimize� relieving programmers of �le and version management tasks� In this way�

programmers are free to apply selected techniques on speci�c regions� change code

manually� and generate a working version of the entire program without exiting the

tool� During the optimization process� the tool can display static analysis information

generated by the underlying compiler� which can help users in further optimizing the

program�

�� Overview

Figure �� illustrates the major components of InterPol� Users select code

regions using the Program Builder and arrange optimization techniques through the

Compiler Builder� The Compilation Engine takes inputs from these builders� executes

the selected compiler modules� and displays the output program� If the user wants to

keep the modi�ed code segments� the output will go into Program Builder� Instead

of running the Compilation Engine� users may choose to add changes to the code

manually� All of these actions are controlled by a graphical user interface� Users are

able to store the current program variant at any point in the optimization process�

�� Functionality

Figure ��a shows the graphical user interface oered by InterPol� Target code

segments and the corresponding transformed versions are visible in separate areas�

� ��

Graphical User Interface

CompilerBuilder

Compilation Engine

ProgramBuilder

Call toPolaris

Infrastructure

Input Program

Output Program

Fig� �� An overview of InterPol� Three main modules interact with usersthrough a Graphical User Interface� The Program Builder handles �le IO and keepstrack of the current program variant� The compiler Builder allows users to arrangeoptimization modules in Polaris� The Compilation Engine combines the user

selections from the other two modules and calls Polaris modules�

Static analysis information is given in another area whenever a user activates the

compiler� Finally� the Program Builder interface provides an instant view of the

current version of the target program� InterPol is written in Java�

The underlying parallelization and optimization tool is the Polaris compiler in�

frastructure �� Various Polaris modules form building blocks for a custom�designed

parallelizing compiler� InterPol is capable of stacking up these modules in any or�

der� Polaris also comes with several dierent data dependence test modules� which

can also be arranged by InterPol� Overall� more than �� modules are available

for application� Users have a freedom to choose any blocks in any order� Executing

this custom�built compiler is as simple as clicking a menu and the result is displayed

immediately on the graphical user interface� Figure ��b shows the Compiler Builder

interface in InterPol� More detailed con�guration is also possible through the

InterPol�s Polaris switch interface� which controls the behavior of the individual

passes�

� ��

�a� �b�

Fig� �� User Interface of InterPol� �a� the main window and �b� the CompilerBuilder�

The Program Builder keeps and displays the up�to�date version of the whole pro�

gram� Users select program segments from this module� apply automatic optimization

set up by the Compiler Builder and�or add manual changes� The Compiler Builder

is accessible at any point� so users can apply entirely dierent sets of techniques to

dierent regions� The current version of the program is always shown in the Program

Builder interface for easy examination� Through this continuous process of tuning op�

timized program segments� users always stay in the process� observing and modifying

program transformations step by step�

During the optimization process� InterPol can display program analysis results

generated by running Polaris modules� This includes data dependency test results�

� � �

induction and reduction variables� etc� This provides a basis for further optimization�

Programmers incorporate their knowledge of the underlying algorithm� compensating

for the compiler�s limited knowledge of the program�s dynamic behavior and input

data�

�� Summary

InterPol seeks to assist programmers by providing highly �exible utilities for

both automatic and manual optimization� For those who are not familiar with the

techniques available from parallelizing compilers� the tool provides greater insights

into the eects of code transformations� By combining the Ursa Minor perfor�

mance evaluation tool with InterPol� we hope to create a complete programming

environment�

�� Other Tools in Our Toolset

The functionality of Ursa Minor and InterPol� combined with the Polaris in�

strumentation module� cover all the aspects of the methodology discussed in Chapter

�� Later in Section �� we describe how these tools provide a comprehensive support

for the methodology� In this section� we present a set of complementary tools in

our toolset� which were developed in related projects� The main goals of these tools

do not necessarily match the issues that we would like to address in this research�

but they provide additional information and grant control over other aspects in pro�

gram development� These tools have been either developed or modi�ed at Purdue

University�

�� Polaris parallelizing compiler

The Polaris parallelizing compiler �� is a source�to�source restructurer� devel�

oped at the University of Illinois and Purdue University� Polaris automatically �nds

parallelism and inserts appropriate parallel directives into input programs� Polaris

includes advanced capabilities for array privatization� symbolic and nonlinear data de�

pendence testing� idiom recognition� interprocedural analysis� and symbolic program

analysis� In addition� the current Polaris tool is able to generate OpenMP parallel

� ��

directives �� and apply locality optimization techniques such as loop interchange and

tiling�

As demonstrated previously �� the Polaris compiler has successfully im�

proved the performance of many programs for various target machines� Polaris pro�

vides a good starting point for parallelizing and optimizing Fortran programs� For

advanced programmers� it can save substantial time that would be spent tuning loops

that can be automatically parallelized� For novice programmers� manually paralleliz�

ing loops would be cumbersome to begin with� In addition� Polaris can provide a

listing �le with the results of static program analysis� which may provide program�

mers with valuable information on various code sections�

InterPol described above provides easy� interactive access to the Polaris paral�

lelizing compiler� InterPol is even capable of restructuring optimization modules

within Polaris� If InterPol is not available� Polaris can serve as an alternative�

allowing fast parallelization of programs at hand� Polaris is available on the Parallel

Programming Hub� available to programmers all over the world�

�� InterAct performance monitoring and steering tool

InterAct is a toolset that allows interactive instrumentation and tuning of

OpenMP programs �� This toolset provides a simple interface and API that allow

users to quickly identify performance bottlenecks through on�line monitoring of pro�

gram performance and to explore solutions through experimentation with user�de�ned

tunable variables� The Polaris parallelizing compiler has been modi�ed to annotate

sequential Fortran programs with OpenMP shared�memory directives� as well as to

insert calls to the instrumentation library� The instrumentation library collects both

timings and hardware counter events� transparently managing the low�level details

such as over�ows� To manage the hardware counters� the OpenMP Performance

Counter Library �OMPcl� has been developed to accurately collect events within the

multithreaded OpenMP environment�

InterAct provides a graphical user interface �GUI� to monitor program behavior�

as well as to dynamically change instrumentation� environmental settings and criti�

� ��

cal program variables during execution� It supports visualization of collected data�

dynamic instrumentation� interactive modi�cation of the number of threads used by

the application� interactive selection of the runtime library used for managing paral�

lel threads� and interactive modi�cation of global variables that are registered by the

target application� These global variables could be compiler or user�inserted and used

to control the behavior and�or performance of the application� The toolset provides

a socket interface between the application and the GUI that allows monitoring to be

done either locally or remotely� Figure �� shows the screenshot of InterAct in use

for the study of the dynamic behavior of SWIM benchmark�

Fig� �� Monitoring the example application through InterAct interface� Themain window shows the characterization data of the major loops in the SPEC��

SWIM Benchmark�

�� Max�P parallelism analysis tool

A compiler is able to analyze the static behavior of a program� It can �nd char�

acteristics of a program that are true for all possible input data sets and target

machines� In contrast� dynamic evaluation of a program can provide insights into the

characteristics of programs and the predictions of behaviors that may be undetected

by static analysis methods� Of great interest is understanding the dynamic behavior

� ��

of parallelism� one of the most dominant factors of performance�

Max�P is a Polaris�based tool� developed at Purdue University �� It evaluates

the inherent parallelism of a program at runtime� The inherent parallelism is de�ned

as the ratio of the total number of operations in a program� or program section�

to the number of operations along the critical path� The critical path is the longest

path in the program�s data�ow graph� which is computed byMax�P during program

execution� The tool can �nd the minimum execution time of a program assuming the

availability of an unlimited number of parallel processors� It shows the maximum

parallelism as an upper estimate for the potential performance gain that a user can

expect from aggressively optimizing the code�

�� Integration with Methodology

In this section� we examine how we envision the methodology plus tools scenario�

First� we discuss how these tools facilitate the steps listed in Chapter �� Then we

focus on other features of the tools that help programmers throughout the tuning

process�

�� Tool support in each step

Our tools have been designed and modi�ed with the parallel programmingmethod�

ology in mind� Figure �� gives the overview of how these tools can be of use in each

step in the methodology introduced in the previous chapter� Ursa Minor mainly

contributes to the performance evaluation stages� InterPol and Polaris oers aid

in parallelization and manual tuning stages� Additional help in executing target pro�

grams is available through InterAct� In the following we revisit each step in the

methodology and discuss the roles of our tools�

Instrumenting program

The Polaris tool oers an instrumentation module as one of its passes� Users

can activate this module using a set of switches� In this way� users can generate

instrumented versions of both parallel and serial programs� Polaris provides several

switches for instrumentation of execution time of loops� These switches dictate the

� ��

Instrumenting Program

Getting Serial Execution Time

Running Parallelizing Compiler

Manually Optimizing Program

Getting Optimized Execution Time

Speedup Evaluation

Finding and Resolving Performance Problmes

satisfactory

unsatisfactory

reduceinstrumentation

overhead

done

Polaris InstrumentorHardware Counter

PolarisInterPol

InterPol

InterAct

InterAct

Ursa Minor - Views - Expression Evaluator

Ursa Minor - Views - Merlin - Expression Evaluator

Fig� �� Tool support for the parallel programming methodology�

� � �

types of code blocks that are instrumented and how to instrument nested sections�

By carefully controlling the switches� users can add all the necessary timing functions

without excessive overhead�

Combined with the OpenMP Performance Counter Library �PCL� �� Polaris

can instrument a program so that each run generates a pro�le containing various

performance data measured by a hardware counter on instrumented code segments�

This library is available on many modern machines� There are more than �� types

of measurement available including the number of cycles� instruction and data cache

hits� the number of reads and writes� instruction counts� dependency stalls� and so on�

It is capable of generating a data �le that can be read by Ursa Minor for further

analysis�

As noted in the methodology� it is important to record the execution time of the

uninstrumented program� This serves as the basis for measuring the perturbation that

instrumentation introduces� A simple UNIX command such as �time� may provide

such a timing number�

Getting serial execution time

Running an instrumented serial version is done typically through the UNIX com�

mand line� It is usually a simple command line interface� Instrumentation generates

some form of records containing the timing information on the instrumented code

segments� For example� an executable instrumented by the Polaris instrumentation

utility generates a �le that looks like the following�

RESTAR�do�� AVE� � �� MIN� � �� MAX� � �� TOT� � ��





ACTFOR�do�� AVE� � �� MIN� � �� MAX� � �� TOT� � ��

ACTFOR�do�� AVE� � �� MIN� � �� MAX� � �� TOT� � ��

� ��

OVERALL time � ��

The tabular section shows the average�AVE�� minimum�MIN�� maximum�MAX� and

cumulative total�TOT� time spent on each instrumented segment� The last line shows

the overall execution time of the entire program� This �le can be directly read by the

Ursa Minor tool for analysis�

Running parallelizing compiler

This is the step in which users try parallelization by running automatic utilities�

Its main goals are �� to utilize an automatic parallelizer to optimize complex loops

and possibly gain the static analysis results from the compiler and �� to save time by

automating the parallelization of small� inconsequential loops� Therefore� the target

in this case is usually the entire program� Furthermore� most parallelizers with inter�

procedural analysis capability work well when an entire program is given as inputs�

Polaris� as a batch�oriented program� performs well for this purpose� InterPol is

also capable of handling this task�

Manually optimizing programs

Any text editor can be used to manually modify programs� Several UNIX com�

mands are useful for manipulating programs� An example is �fsplit�� which splits

subroutines and functions into dierent �les� However� InterPol is speci�cally

designed for the process of manual tuning� InterPol allows programmers to apply

selected techniques on speci�c regions� change code manually� and generate a working

version of the entire program without exiting the tool� Some of the manual techniques

that users may consider are presented in Chapter ��

Getting optimized execution time

In the shared�memory model� programmers can invoke a parallel program as they

execute a serial program� Typically� there are certain environmental variables that

need to be set beforehand� For example� on Solaris machines� environmental variable

OMP NUM THREADS determines the number of processors to be used� If programmers

used the Polaris compiler for instrumentation� a summary �le is generated after each

� ��

run�

InterAct allows an interactive instrumentation and tuning of OpenMP pro�

grams� Its ability to dynamically change runtime parameters �tile size� unrolling

number� provides a testbed for �nding the optimal set of techniques� Monitoring and

changing hardware counter instrumentation make the instrumentation process more

e�cient�

Finding and resolving performance problems

Programmers need utilities for putting together and sorting data� Identifying per�

formance problem requires a considerable amount of examination and hand analysis�

Finding solutions often requires experience in program optimization studies�

Ursa Minor provides tools that assist parallel programmers in eectively evalu�

ating performance� Its graphical interface provides selective views and combinations

of timing information in combination with a program structure and static analy�

sis data� Users can put together a table� open a Structure View� draw charts� do

spreadsheet�type operations� and examine source codes� Ursa Minor manages the

information within its own database� thus data management that might have required

signi�cant �le and version control becomes simpli�ed�

Identifying dominant loops are very simple with Ursa Minor� Users can load

timing pro�les and sort the entries through the column popup menu� If a user cre�

ates a pie chart� most time�consuming loops will be displayed with the entire circle

representing the total execution time� The bar graph on the right shows an instant

view of normalized numeric data�

An important task in tuning program performance is to evaluate if an applied

program modi�cation produces an acceptable result� This involves computing var�

ious metrics such as speedup and parallel e�ciency and examine program analysis

information� The built�in mathematical functions allow users to manipulate data�

The static analysis information generated by the Polaris compiler is also managed

within the Ursa Minor database� For code segments that require manual tuning�

this information provides vital clues� Static analysis information� as well as the source

� ��

code viewer� can be pulled up at any time with simple menu clicks� so users can make

comprehensive diagnosis of the problems at hand�

Ursa Minor does more than just presenting data� It is capable of actively

analyzing the data and giving advice� When users runMerlin� it extracts necessary

information and apply diagnosis techniques to �nd right solutions� As mentioned

previously� the decisions that Ursa Minor makes rely on the Merlin map� which

is typically provided by advanced parallel programmers� In this way� the knowledge

of experienced programmers can be used by novice programmers easily� The fact that

a map can have a variety of functions that can apply to any types of data� widens

the usage of Merlin in many dierent �elds of study�

�� Other useful utilities

In addition� the toolset provides additional functionalities for the tasks that are

not speci�cally tied to the methodology steps�

When programmers are given an application to optimize� they usually start out by

examining the source code� The basic knowledge about the program structure such

as large subroutines or functions� their algorithms� callees� and callers� tremendously

help programmers later in the tuning stage� The algorithms employed by program

modules� although not necessary to follow the methodology� may be of importance

especially programmers need to attempt replacing algorithms�

Programs written by others are generally harder to understand� Dierent coding

styles make it di�cult to capture the underlying compositions of individual program

modules� The Structure View of Ursa Minor facilitates this problem by presenting

users with an intuitive� color�coded view of the program structure� A simple click

pulls up the source view� when a closer examination is desired� This can save a

signi�cant amount of users� time�

As the size and the complexity of applications grow at exponential rate these days�

the subject of performance steering is getting more attention� Performance steering

may come in handy in both development stage and production use� For instance�

�nding the right parameters for convergence criteria in the application development

� ��

stage can be tricky� so the ability to set or reset relevant variables during the program

execution could prove to be advantageous in experimenting with dierent values�

Also� an application may be able to simulate many dierent aspects of a target object�

but users may be interested in only one aspect� In this case� performance steering

can save time and resource by restricting the simulation� The interest of InterAct

lies along this line� The primary use of InterAct in our study has been �nding

the optimal combination of optimization�related parameters �e�g� tile size� unrolling

number� for a given application� For a long running programs� InterAct allows a

�ne control over variables such as the simulation step size and the number of iterations�

When more than one person is involved in an optimization project� communication

between group members become problematic� The data that one person generates

may not be easily accessible or compatible with the tools used by others� Other

members in the group may want to focus on dierent perspectives� but the information

from one researcher may not be formatted or arranged in a compatible way� Sharing a

manipulatable database opens up the possibility of all the members having access to

a set of compatible databases relevant to individual tasks� At the same time� group

members can reason about the data gathered by other members� focusing on the

aspects that they are interested in� Ursa Minor enables an e�cient and meaningful

way of sharing research results�

Finally� the growing popularity of multiprocessor workstations and high perfor�

mance PCs is leading to a substantial increase in non�expert users and programmers

of this machine class� Such users need new programming paradigms � perhaps most

importantly� they need good examples to learn from� We have extended our eort

to support �parallel programming by examples� through Web�accessible tools and a

database repository� This is the topic of the next section�

�� The Parallel Programming Hub and Ursa Major

Although the importance of advanced tools for all software development is evi�

dent� many available tools remain unused� This is mainly due to the limited acces�

sibility of tools� We have developed a set of tools for parallel programmers� and the

� � �

Internet provided an opportunity to make our tools more accessible to world wide

parallel programmers� Here� we present two separate outcomes that resulted from

our eort to reach wider audience with our tools� The Parallel Programming Hub

is an on�going project to provide a globally accessible� integrated environment that

hosts parallelizing compilers� program analyzers� and interactive performance tuning

tools �� Users can access and run these tools with common Web browsers� Ursa

Major is an Applet�based application that enables visualization and manipulation

of the performance and static analysis data of various parallel applications that have

been studied at Purdue University �� Its goal is to make a repository of program

information available via the World�Wide Web�

�� Parallel Programming Hub globally accessible integrated tool en�

vironment

Programming tools are of paramount importance for e�cient software develop�

ment� However� despite several decades of tool research and development� there is a

drastic contrast between the large number of existing tools and those actually used

by ordinary programmers� We believe that there are two main reasons for this sit�

uation� The �rst reason is that a programmer� in order to bene�t from new tools�

will typically have to go through one or several tedious eorts of searching� down�

loading� installing� and resolving platform incompatibilities� before the tools can even

be learned and their use can be evaluated� The second reason is that� even if the

value of a number of tools has been established� they often use dierent terminology�

diverse user interfaces� and incompatible data exchange formats ! hence they are not

integrated�

Through the combined eorts from many researchers� we have created the Parallel

Programming Hub� a new parallel programming tool environment that is �� acces�

sible and executable �anytime� anywhere�� through standard Web browsers and ��

integrated in that it provides tools that adhere to a common methodology for paral�

lel programming and performance tuning� The Parallel Programming Hub addresses

these two issues� It contributes solutions in the following way� First� the Parallel

� ��

Programming Hub makes available a growing number of tools �on the Web� where

they are accessible and executable through standard Web browsers� The Parallel

Programming Hub makes no restrictions on the type of tools that can be added� A

new tools can be installed without modi�cation� providing the original graphical user

interface and� if necessary� being served directly o of the home site of a proprietary

provider� Nevertheless� the authorized user can access the tool via standard Web

browsers�

Our methodology is supported by the Parallel Programming Hub that includes the

Polaris parallelizing compiler� the Max�P parallelism analysis tool� and the Ursa

Minor performance evaluation and visualization tool� which are described in previous

sections� In addition� an increasing number of tools are being made available through

the Parallel Programming Hub� Currently� the Trimaran environment ��

for instruction level parallelism �ILP� and the SUIF parallelizing compiler �� are

accessible� Authorized users can access a number of common support tools such

as Matlab� Mentor Graphics� GNU Octave� and StarO�ce� Figure �� shows a

screenshot of Ursa Minor in use on the Parallel Programming Hub�

On the surface� the Parallel Programming Hub is a set of web pages� through

which users can run various parallel programming tools� Underneath this interface

is an elaborate network computing infrastructure� called the Purdue University Net�

work Computing Hub �PUNCH�� PUNCH is an infrastructure that supports network�

accessible� demand�based computing �� It allows users to access and run unmodi�ed

tools via standard Web browsers� PUNCH allows tools to be written in any language

and does not require source codes or object codes of the applications it hosts� This

feature allows a wide variety of tools to be included�

When a user invokes a tool on PUNCH� the resource management unit determines

an appropriate platform out of a resource pool and executes the tool on it� The

smart resource management unit maintains resource usage to an optimal level� It

also enables the system to be highly scalable� making sure that PUNCH performs

well under widely�varying numbers of users� tools� and resource nodes�

� ��

Fig� �� Ursa Minor Usage on the Parallel Programming Hub�

PUNCH is logically divided into discipline�speci�c �Hubs�� Currently� PUNCH

consists of four Hubs that contain tools from semiconductor technology� VLSI design�

computer architecture� and parallel programming� These Hubs contain over thirty

tools from eight universities and four vendors� and serve more than �ve hundred

users from Purdue� the US� Europe� PUNCH has been accessed �� million times

since it became operational in ��

Upon registering� a user is given an account and disk space� that is accessible as

long as the user is on PUNCH� The execution of tools via PUNCH takes place in

UNIX �shadow� accounts that are managed by the network computing infrastruc�

� ��

ture� This shadow account structure allows addition of user accounts to the parallel

programming Hub without requiring the setup of individual accounts by a UNIX sys�

tem administrator� PUNCH keeps all user �les in a master account and maintains a

pool of shadow accounts that are allocated dynamically for users at runtime� Input

�les for interactive programs such as Ursa Minor are transferred on�demand from

master to shadow accounts via a system call tracing program �based on the UFO

prototype �� that implements a user�level virtual �le system on top of the FTP

protocol� This system is transparent to users� thus all �le transactions appear to be

normal disk IO�

The immediate advantage of having an integrated network�based tool environment

is substantial savings in users� eorts and resources� The Parallel Programming Hub

eliminates times to search for� download� and install tools� and it greatly supports

users in learning a tool through uniform documentation� on�line tutorials� and tools

that speak a common terminology� A typical tool access time for �rst time users of

the ParHub is in the order of a minute� including authentication and navigating to

the right tool� This contrasts with download and installation times of at least an

order of magnitude larger� Even much larger eorts become necessary if tools need

to be adapted to local platforms�

A novel aspect of the ParHub�s underlying technology is that it represents not only

an actual �information grid�� but also includes the necessary portals for its end users�

One vision is that future users can access software tools via any local platform ! from a

palmtop to a powerful workstation� Compute power and �le space is provided �on the

Web�� Mobility is provided in that these resources are accessible transparently from

any access point� The described infrastructure represents a signi�cant step towards

this vision�

�� Ursa Major making a repository of knowledge available to the

world wide audience

A core need for advancing the state of the art of computer systems is performance

evaluation and the comparison of results with those obtained by others� To this end�

� ��

many test applications have been made publicly available for study and benchmarking

by both researchers and industry� Although a large body of measurements obtained

from these programs can be found in the literature and on public data repositories� it is

usually extremely di�cult to combine them into a form meaningful for new purposes�

In part this is because data are not readily available �i�e�� they have to be extracted

from several papers� and they have to undergo substantial re�categorizations and

transformations� In addressing this issue� the Ursa Major project �� is creating

a comprehensive database of information�

Many tools can gather raw program and performance information and present it

to users� which is a starting point for answering the questions above� However� in

addition to providing raw information� advanced tools must help �lter and abstract

a potentially very large amount of data�

Ursa Major addresses the described issues by providing an instrument with

which application� machine� and performance information can be obtained from vari�

ous sources and can be displayed in an interactive viewer attached to the World�Wide

Web� It provides a repository for this information and assists users in its abstrac�

tion and comprehension� Industrial benchmarkers may be interested in �one single

number� for machine comparisons� programmers may be interested in transforma�

tions that can improve the performance of an application� computer architects may

want to compare their cache measurements with those obtained by their peers� Ursa

Major provides hooks for their needs� and it includes instruments for the underlying

data mining task�

Ursa Major is an Applet�based application that enables visualization and ma�

nipulation of the performance and static analysis data of various parallel applications

that have been studied at Purdue University� The goal of Ursa Major is to make

a repository of program information available via the World�Wide Web� Ursa Ma�

jor has its origin in the Ursa Minor tool� providing almost identical functionality�

Because we chose Java as an implementation language it was natural to combine

these resources with the rapidly advancing Internet technology and� in this way� al�

� � �

low users at remote sites to access our experimental data� Typically� in response to

a user interaction� it fetches from the repository a program database that represents

a speci�c parallel programming case study� It then displays it using Ursa Minor�s

visualization utilities� Due to the Applet�s security constraints� local disk access is

not supported byUrsa Major� Figure �� shows an overall view of the interactions

between Ursa Major� a user� and the Ursa Major repository �UMR��

URSA MAJORUMD

(Ursa Major Database)

Loop Table View Call Graph View

User

DataBase DownloadJava Program Download

interactioninteraction

presentation/edit databasepresentation/edit database

UMR(Ursa Major Repository)

Ursa MajorApplet

Remote Server

Fig� �� Interaction provided by the Ursa Major tool�

The data repository is being constructed from the results gathered in various

research projects� Currently it consists of characteristics of a number of programs� the

results of compiler analyses of these programs� their performance numbers on diverse

architectures� and the data generated in several simulator runs� Individual databases

in the repository are in the Generic Data Format described in Section �� One

issue in designing the repository was to de�ne storage schemes that makes it easy for

� ��

users to �nd information entered by other users� To this end� the repository structure

consists of extensions on �le and directory names indicating data such as program

names� platforms� compilers� optimization� and parallel languages� To be �exible�

these extensions are not hard�coded� Instead� they are described in a con�guration

�le that is read by Ursa Major at the start of a session�

Ursa Major supports a user model of �parallel programming by examples� and

it serves as a program and benchmark database for high performance computing� It

integrates information available from performance analysis tools� compilers� simula�

tors� and source programs to a degree not provided by previous tools� Ursa Major

can be executed on the World�Wide Web� from which a growing repository of infor�

mation can be viewed� Through continuous updates to the repository� we envision

Ursa Major to be the �rst place to look for performance data�

The emergence of the Parallel Programming Hub presents an interesting oppor�

tunity to compare these two network�based tools� Although their goals are distinct�

Ursa Minor on the Parallel Programming Hub and Ursa Major provide users

with the same visualization utilities for viewing performance and static analysis data�

The Parallel Programming Hub enables Ursa Minor to load and manipulate user

inputs from remote sites� On the other hand� it lacks the support for access to a

centralized repository� The detailed performance comparison in terms of the response

time are given in the next chapter�

�� Conclusions

Our eort to create a parallel programming environment has resulted in a parallel

program development and tuning methodology and a set of tools� We have developed

the tools with the design goals in mind to provide an integrated� �exible� accessible�

portable and con�gurable tool environment that conforms to the underlying method�

ology� Our toolset integrates static program analysis with performance evaluation�

while supporting data visualization and interactive compilation� Data management

is also simpli�ed with our tools�

To give access to these tools to as many users as possible and to disseminate our

� ��

performance databases of various applications as widely as possible� we have used a

network computing infrastructure� In addition� we are currently building a database

repository that enables the visualization and manipulation of performance results

through a Java Applet application�

Here� we conclude the presentation of our methodology and tool eorts� The intro�

duced methodology addresses �what� in parallel programming� The toolset described

in this chapter has been designed and implemented based on our experience and de�

sign goals� and aims to answer �how�� Finally� with the extra eort to promote the

tools and to reach wider audience� we have attempted to solve the question �where��

The methodology and the tools are useless if they are not eective in actual parallel

programming and performance tuning processes� The obvious next step is to evaluate

the bene�ts of these tools as well as the methodology� hence answering �how well�

they work� This is the topic of the next chapter�

� ��

�� EVALUATION

Evaluating a methodology and tools is di�cult� It is largely due to two problems

associated with the topic� First� The desirable characteristics of a methodology and

supporting tools� such as e�ciency and eectiveness� cannot be measured easily� espe�

cially in quantitative ways� It is very challenging to establish a set of metrics for such

measures� Secondly� the goal of developing a methodology and supporting tools is to

assist users� thus in determining the e�ciency of a methodology and supporting tools�

the users� willingness and knowledge towards them become critical factors� Having

a large user community would help judge their value� Even then� however� creating

controlled experiments to obtain quantitative feedback is very di�cult�

These are the main reasons that many tool eorts in parallel programming have

ignored the evaluation aspect� The majority of publications related to parallel pro�

gramming tools do not include quantitative evaluations� Even general descriptions

of user feedback� such as �response to the Sigma editor has been good� �� are

seldom found� Some of them demonstrate the usage of tools via descriptive case

studies � �� Publications focusing on programming methodology have

taken the same approach �� and give several examples of how their proposed

scheme can be applied to actual programming practices� One notable evaluation eort

is found in the SUIF Explorer publication �� in which performance improvement

attempted by a user is summarized in detail� Whether it accurately re�ects the ef�

�ciency of the tool is arguable� but as the only quantitative measurement for tool

evaluation� their eort is noteworthy�

In this chapter� we attempt to achieve fair and accurate evaluation as follows� In

Section �� we give a series of case studies to demonstrate the usage of our method�

ology and tool support� A detailed description of each parallelization and tuning

� � �

process is given in the section� These case studies serve to show the applicability

of the methodology and the functionality of tools� In Section �� we evaluate the

tool functionality by analyzing and comparing the tasks accomplished with and with�

out the tools� Also� we summarize the comments from users in this section� The

comparison of our tools with other parallel programming environments are given in

Section �� Finally� we discuss the tool accessibility as the result of adopting the

network computing facilities in Section �� Conclusions are given at last�

�� Methodology Evaluation Case Studies

�� Manual tuning of ARC�D

In this section� we present a case study illustrating a manual tuning process of

program ARC�D from the Perfect benchmark suite �� This case study was pre�

sented in �� In this study� a programmer has tried to improve the performance of

the program beyond that achieved by the Polaris parallelizing compiler� The target

machine is a HyperSPARC workstation with � processors�

Polaris was able to parallelize almost all loops in ARC�D� However� the speedup

of the resulting executable was only �� on � processors� Using Ursa Minor�s

Structure View and sorting utility� the programmer was able to �nd three loops to

which loop interchange can be applied� FILERX do�� XPENTA do�� and XPENT� do��

After loop nests were interchanged in these loops� the total program execution time

decreased by �� seconds� increasing the speedup from �� to � ��

As the result of this modi�cation� dominant program sections have changed� The

programmer re�evaluated the most time�consuming loops using the Expression Eval�

uator to compute new speedups and the percentage of loop execution time over the

total time� The most time consuming loop was now the STEPFY do�� nest� which

consumed �� of the new parallel execution time� The programmer examined the

nest with the source viewer and noticed two things� �� there were many adjacent

parallel regions and �� the parallel loops were not always distributing the same di�

mension of the work array� The programmer merged all of the adjacent parallel

regions in the nest into a single parallel region� The new parallel region consisted of

� �

four consecutive parallel loops� The �rst two nests were single loops that distributed

the work array across its innermost dimension� The second two nests were doubly

nested and distributed the work array across its second innermost dimension� The

eect of these changes were two�fold� First� the merging of regions should eliminate

parallel loop fork�join overhead� Second� the normalization of the distributions within

the subroutine should improve locality� After this change� the speedup of the loop

improved from �� to ��

The programmer was able to apply the same techniques �fusion and normaliza�

tion� to the next � most time�consuming loops �STEPFX do�� FILERX do�� and

YPENTA do�� These modi�cations result in a speedup gain from �� to �� Finally�

the programmer attempted the same techniques to the next most time�consuming

sections XPENTA� YPENT�� and XPENT� according to the newly computed pro�les and

speedups� The speedup improved to �� The programmer felt that the point of

diminishing returns had been reached and halted the optimization�

�a� �b�

Fig� �� The �a� execution time and �b� speedup of the various version of ARC�D�Mod� loop interchange� Mod�� STEPFY do�� modi�cation� Mod�� STEPFX do��

modi�cation� Mod�� FILERX do�� modi�cation� Mod�� YPENTA do� modi�cation�Mod � modi�cation on XPENTA� YPENT�� and XPENT��

In summary� applying loop interchange� parallel region merging and distribu�

tion normalization� yielded an increase from the out�of�the�box speedup of �� to

� � �

a speedup of �� This corresponds to a �� decrease in execution time� Figure ��

shows the improvements in the total program performance as each optimization was

applied� Ursa Minor allowed the user to quickly identify the loop structure of the

program and sort the loops to identify the most time consuming code sections� After

each modi�cation� the user was able to add the new timing data from the modi�ed

program runs� re�calculate the speedup and see if an improvement was worthwhile�

�� Evaluating a parallelizing compiler on a large application

In one research project a users is enabling the Polaris compiler to work eectively

with large codes �on the order of at least �� lines� �� These codes have many

levels of abstractions and are very modular� making it di�cult to link performance

and parallelization bottlenecks to their causes� Ursa Minor was used with the

SPECseis application suite �� a set of codes that perform seismic processing� as

a basic GUI to help manage the thousands of lines of code and hundreds of loop

timings� as well as to direct the compiler developer into enabling Polaris to recognize

more parallelism�

Ursa Minor allows the user to easily pick out the signi�cant portions of the code

�in terms of execution time� and to �nd their callers and callees� We found that the

implementation of the �nite�dierencing scheme� which was a landmark in the history

of seismic processing� takes only � of the total time� The accompanying correction

routine� which compensates for the errors that accrue with the �nite�dierence ap�

proximation� takes �� of the total execution time� The correction routine performs

a FFT� applies the error equations� and transforms the data back from the frequency

domain�

Besides the ability to quickly and easily locate the major components of the execu�

tion time� the user found Ursa Minor helpful to the compiler developer in analyzing

the eectiveness of compilation techniques� One key bene�t of using Ursa Minor

for performance evaluation is the ability to apply the Expression Evaluator to both

the run�time performance and the compile�time analysis� Polaris was able to paral�

lelize loops which contributed only � of the execution time� The user used Ursa

� � �

Minor to determine why certain key loops were not parallelized �a feature requiring

one mouse click� in order to add techniques that address these issues� The SEICFT

routine performs a �D FFT on a frequency slice� The routine contains while loops

which are not parallelized by Polaris�

With Ursa Minor� the user was also able to work with the application as a

whole to determine what factors in�uence automatic parallelization across the entire

code� We can do so using the commands provided in the Ursa Minor tool� In

particular� Ursa Minor revealed that in�lining or inter�procedural analysis is a cru�

cial parallelism enabler for parallelizing compilers when dealing with large� modular

codes� Eight out of the top ten loops �for the �rst seismic phase� have subroutine

calls within them�

�� Interactive compilation

The use of a parallelizing compiler as an interactive tool can bene�t users in many

ways� Users can incorporate the feedback from the compiler during compilation and

add appropriate modi�cations to the source� An incremental use of such a tool

simpli�es code management and debugging as well because the code changes made

by users are localized� In addition� the ability to �build� a parallelizing compiler �as

described in the previous chapter� allows users to experiment with dierent compiler

techniques� so that they can learn more about the techniques and their eects�

We present a case study in �� to demonstrate the functionality of InterPol�

A user parallelized the small example program shown in Figure ��a� Figure ��b

shows the code after being simply run through the default Polaris con�guration with

the inlining switch set to inline subroutines of � statements or less� Two important

results can be seen� �� subroutine one is not inlined due to the inlining pass executing

prior to deadcode elimination� and �� the loops in subroutine two are not found to

be parallel because of subscripted array subscripts� which the Polaris compiler cannot

analyze� Figure ��c shows the resulting program after adding a deadcode pass prior

to the inlining pass in the Compiler Builder� and running the main program and sub�

routine one from Figure ��a through this �new� compiler� Finally� in Figure ��d�

� � �

PROGRAM EXAMPLE

REAL A��B��

REAL C��

INTEGER I

DO I � ��

CALL ONE�A�B�I�

C�I� � I

ENDDO

CALL TWO�A�B�C�

WRITE �� A

WRITE �� B

END

SUBROUTINE ONE�A�B�I�

REAL A��B��

INTEGER DEADCODE

DEADCODE � �

DEADCODE � �

DEADCODE �

DEADCODE �

DEADCODE � �

DO J � ��

A�J�I� � �

B�J�I� � �

ENDDO

END

SUBROUTINE TWO�A�B�C�

REAL A�� B��

REAL C��

DO I � ��

DO J � ��

A�C�J��C�I�� I�J

B�C�J��C�I�� I�J

ENDDO

ENDDO

END

PROGRAM EXAMPLE

REAL A��B��

REAL C��

INTEGER I

DO I � ��

CALL ONE�A�B�I�

C�I� � I

ENDDO


WRITE �� A

WRITE �� B

END

SUBROUTINE ONE�A�B�I�

REAL A��B��

�OMP PARALLEL DO

DO J � ��

A�J�I� � �

B�J�I� � �

ENDDO

�OMP END PARALLEL DO

END



REAL C��

DO I � ��

DO J � ��



ENDDO

ENDDO

END

�a� �b�

Fig� �� Contents of the Program Builder during an example usage of theInterPol tool� �a� the input program and �b� the output from the default Polaris

compiler con�guration�

� � �

PROGRAM EXAMPLE

REAL A��B��

REAL C��

INTEGER I

�OMP PARALLEL DO

DO I � ��

DO J � ��

A�J�I� � �

B�J�I� � �

ENDDO

C�I� � I

ENDDO



WRITE �� A

WRITE �� B

END



REAL C��

DO I � ��

DO J � ��



ENDDO

ENDDO

END

PROGRAM EXAMPLE

REAL A��B��

REAL C��

INTEGER I

�OMP PARALLEL DO

DO I � ��

DO J � ��

A�J�I� � �

B�J�I� � �

ENDDO

C�I� � I

ENDDO



WRITE �� A

WRITE �� B

END



REAL C��

�OMP PARALLEL DO

DO I � ��

DO J � ��



ENDDO

ENDDO


END�c� �d�

Fig� �� Contents of the Program Builder during an example usage of theInterPol tool� �c� the output after placing an additional deadcode eliminationpass prior to inlining and �d� the program after manually parallelizing subroutine

two�

� �

the user has selected only subroutine two� parallelized it by hand� and included this

modi�ed version into the Program Builder� Through simple interactions with Inter�

Pol� the user was able to take a code for which Polaris was only able to parallelize

a single innermost loop� and parallelize both of its outermost loops�

�� Performance advisor hardware counter data analysis

In this case study given in �� we discuss a performance map that uses the

speedup component model introduced in �� The model fully accounts for the gap

between the measured speedup and the ideal speedup in each parallel program section�

This model assumes execution on a shared�memory multiprocessor and requires that

each parallel section be fully characterized using hardware performance monitors to

gather detailed processor statistics� Hardware monitors are now available on most

commodity processors�

With hardware counter and timer data loaded intoUrsa Minor� users can simply

click on a loop from the Ursa Minor table view and activate Merlin� Merlin

then lists the numbers corresponding to the various overhead components responsible

for the speedup loss in each code section� The displayed values for the components

show overhead categories in a form that allows users to easily see why a parallel region

does not exhibit the ideal speedup of p on p processors� Merlin then identi�es the

dominant components in the loops under inspection and suggests techniques that

may reduce these overheads� An overview of the speedup component model and its

implementation as a Merlin map are given below�

Performance map description

The objective of our performance map is to be able to fully account for the perfor�

mance losses incurred by each parallel program section on a shared�memory multipro�

cessor system� We categorize overhead factors into four main components� Table ��

shows the categories and their contributing factors�

Memory stalls re�ect latencies incurred due to cache misses� memory access times

and network congestion� Merlin will calculate the cycles lost due to these overheads�

If the percentage of time lost is large� locality�enhancing software techniques will be

� � �

Table ��Overhead categories of the speedup component model�

Overhead Contributing Description Measured

Category Factors with

Memory stalls IC miss Stall due to I�Cache miss� HW Cntr

Write stall The store bu�er cannot hold additional stores� HW Cntr

Read stall An instruction in the execute stage depends on an earlier

load that is not yet completed�

HW Cntr

RAW load stall A read needs to wait for a previously issued write to the

same address�

HW Cntr

Processor stalls Mispred� Stall Stall caused by branch misprediction and recovery� HW Cntr

Float Dep� stall An instruction needs to wait for the result of a �oating

point operation�

HW Cntr

Code overhead Parallelization Added code necessary for generating parallel code� computed

Code generation More conservative compiler optimizations for parallel code� computed

Thread

management

Fork�join Latencies due to creating and terminating parallel sections� timers

Load imbalance Wait time at join points due to uneven workload

distribution�

suggested� These techniques include optimizations such as loop interchange� loop

tiling� and loop unrolling� We found� in �� that loop interchange and loop unrolling

are among the most important techniques�

Processor stalls account for delays incurred processor�internally� These include

branch mispredictions and �oating point dependence stalls� Although it is di�cult

to address these stalls directly at the source level� loop unrolling and loop fusion� if

properly applied� can remove branches and give more freedom to the backend compiler

to schedule instructions� Therefore� if processor stalls are a dominant factor in a loop�s

performance� Merlin will suggest that these two techniques be considered�

Code overhead corresponds to the time taken by instructions not found in the

original serial code� A positive code overhead means that the total number of cycles�

excluding stalls� that are consumed across all processors executing the parallel code

is larger than the number used by a single processor executing the equivalent serial

� � �

section� These added instructions may have been introduced when parallelizing the

program �e�g�� by substituting an induction variable� or through a more conservative

parallel code generating compiler� If code overhead causes performance to degrade

below the performance of the original code� Merlin will suggest serializing the code

section�

Thread management accounts for latencies incurred at the fork and join points of

each parallel section� It includes the times for creating or notifying waiting threads� for

passing parameters to them� and for executing barrier operations� It also includes the

idle times spent waiting at barriers� which are due to unbalanced thread workloads�

We measure these latencies directly through timers before and after each fork and each

join point� Thread management latencies can be reduced through highly�optimized

runtime libraries and through improved balancing schemes of threads with uneven

workloads� Merlin will suggest improved load balancing if this component is large�

Ursa Minor combined with this Merlin map displays �� the measured perfor�

mance of the parallel code relative to the serial version� �� the execution overheads

of the serial code in terms of stall cycles reported by the hardware monitor� and

�� the speedup component model for the parallel code� We will discuss details of

the analysis where necessary to explain eects� However� for the full analysis with

detailed overhead factors and a larger set of programs we refer the reader to ��

Experiment

For our experiment we translated the original source into OpenMP parallel form

using the Polaris parallelizing compiler �� The source program is the Perfect Bench�

mark ARC�D� which is parallelized to a high degree by Polaris�

We performed our measurements on a Sun Enterprise �� with six �� MHz

UltraSPARC V� processors� each with a KB L data cache and MB uni�ed L�

cache� Each code variant was compiled by the Sun v�� Fortran �� compiler with

the �ags �xtarget�ultra� �xcache�� O�� For hardware per�

formance measurements� we used the available hardware counter �TICK register� ��

ARC�D consists of many small loops� each of which has a few milli�seconds average

� � �

Fig� �� Performance analysis of the loop STEPFX DO�� in program ARC�D� Thegraph on the left shows the overhead components in the original� serial code� Thegraphs on the right show the speedup component model for the parallel codevariants on � processors before and after loop interchanging is applied� Each

component of this model represents the change in the respective overhead categoryrelative to the serial program� Merlin is able to generate the information shown in

these graphs�

execution time� Figure �� shows the overheads in the loop STEPFX DO�� of the

original code� and the speedup component graphs generated before and after applying

a loop interchange transformation�

Merlin calculates the speedup component model using the data collected by a

hardware counter� and displays the speedup component graph� Merlin applies the

following map using the speedup component model� If the memory stall appears in

performance graphs of both the serial code and the Polaris�parallelized code� then apply

loop interchange� From this suggested recipes the user tries loop interchanging� which

results in signi�cant� now superlinear speedup� Figure �loop�interchange� �� on the

right shows that the memory stall component has become negative� which means that

there are fewer stalls than in the original� serial program� The negative component

explains why there is a superlinear speedup�

The speedup component model further shows that the code overhead component

has drastically decreased from the original parallelized program� The code is even

� ��

more e�cient than in the serial program� further contributing to the superlinear

speedup�

In this example� the use of the performance map for the speedup component model

has signi�cantly reduced the time spent by a user analyzing the performance of the

parallel program� It has helped explain both the sources of overheads and the sources

of superlinear speedup behavior�

�� Performance advisor simple techniques to improve performance

In this section� we present a performance map based solely on execution timings

and static compiler information� Such a map requires program characterization data

that a novice user can easily obtain� In the study that we did in �� a map is

designed to advise novice programmers in improving the performance of programs

achieved by a parallelizing compiler such as Polaris �� In this case study� we as�

sume that novice programmers have used a parallelizing compiler as the �rst step to

optimize the performance of the target program and that its static analysis informa�

tion is available� The performance map presented in this section aims at improving

this initial performance�

Our goal in this study is to provide users with a set of simple techniques that

may help enhance the performance of a parallel program based on data that can be

easily generated� This includes timing and static program analysis data� Based on

our experiences with parallel programs� we have chosen four techniques that are ��

easy to apply and �� may yield considerable performance gain� These techniques

are serialization� loop interchange� and loop fusion� They are applicable to loops�

which are often the focus of the shared memory programming model� All of these

techniques are present in modern compilers� However� compilersmay not have enough

knowledge to apply themmost pro�tably �� and some code sections may need small

modi�cations before the techniques become applicable automatically�

Performance map description

We have devised criteria for the application of these techniques� which are shown

in Table �� If the speedup of a parallel loop is less than � we assume that the loop

� � �

Table ��Optimization technique application criteria�

Techniques Criteria

Serialization speedup � �

Loop Interchange � of stride�� accesses � � of non stride�� accesses

Loop Fusion speedup � ��

is too small for parallelization or that it requires extensive modi�cation� Serializing it

prevents performance degradation� Loop interchange may be used to improve locality

by increasing the number of stride� accesses in a loop nest� Loop interchange is

commonly applied by optimizers� however� our case study shows many examples of

opportunities missed by the backend compiler� Loop fusion can likewise be used to

increase both granularity and locality� The criteria shown in Table �� represent

simple heuristics and do not attempt to be an exact analysis of the bene�ts of each

technique� We simply assumed the threshold of the speedup as �� to apply the loop

fusion�

Experiment

We have applied these techniques based on the criteria presented above� We have

used a Sun Enterprise �� with six ��MHz UltraSPARC processors� The OpenMP

code is generated by the Polaris OpenMP backend� The results on �ve programs

are shown� They are SWIM and HYDRO�D from SPEC�� SWIM from SPEC�� and

ARC�D and MDG from the Perfect Benchmarks� We have incrementally applied these

techniques starting from serialization� Figure �� shows the speedup achieved by the

techniques� The improvement in execution time ranges from �� for fusion in ARC�D

to �� for loop interchange in SWIM�� For HYDRO�D� application of theMerlin

suggestions did not noticeably improve performance�

Among the codes with large improvement� SWIM from SPEC�� bene�ts most

� ��

Fig� �� Speedup achieved by applying the performance map� The speedup is withrespect to one�processor run with serial code on a Sun Enterprise �� system� Each

graph shows the cumulative speedup when applying each technique�

from loop interchange� It was applied under the suggestion of Merlin to the most

time�consuming loop� SHALOW DO�� Likewise� the main technique that improved

the performance in ARC�D was loop interchange� MDG consists of two large loops

and numerous small loops� Serializing these small loops was the sole reason for the

performance gain� Table �� shows a detailed breakdown of how often techniques

were applied and their corresponding bene�t�

Using this map� considerable speedups are achieved with relatively small eort�

Novice programmers can simply run Merlin to see the suggestions made by the

map� The map can be updated �exibly without modifying Merlin� Thus if new

techniques show potential or the criteria needs revision� expert programmers can

easily incorporate changes�

�� E ciency of the Tool Support

In order to quantitatively evaluate the e�ciency of the tool support� we have

performed an experiment with the help of actual tool users� We prepared a set of

small tasks that are commonly done by parallel programmers� and asked users to

� ��

Table ��A detailed breakdown of the performance improvement due to each technique�

Benchmark Technique Number of Modi�cations Improvement

ARC�D Serialization ��

Interchange ��

Fusion ��

HYDRO�D Serialization ��

Interchange � ��

Fusion � ��

MDG Serialization ��


Fusion � ��

SWIM�� Serialization � ��


Fusion ��

SWIM�� Serialization � ��


Fusion � ��

accomplish these tasks with and without our tools� In addition� we have asked users

of the tools a series of questions to gather users� opinions on tools and their usage�

The questions targeted the functionality of the tools as well as general comments on

the methodology� We present the results in the following sections�

�� Facilitating the tasks in parallel programming

Common tasks in parallel programming

The main objectives of the experiment is to produce quantitative measures for the

e�ciency of the tools� functionality� To this end� we have selected � tasks that are

commonly performed by parallel programmers using parallel directives� These tasks

� ��

are listed in Table ��

Table ��Common tasks in parallel programming

task� compute the speedup of the given program on processors in terms of the serial execution time�

task� �nd the most time�consuming loop based on the serial execution time�

task �nd the inner and outer loops of that loop�

task �nd the caller�s� of the subroutine containing the most time�consuming loop�

task� compute the parallelization and spreading overhead of that loop on processors�

task� compute the parallel e�ciency of the second most time�consuming loop on processors�

task� export pro�les to a spreadsheet to create total execution time chart

�on varying number of processors� containing � of the most time�consuming loops�

task� count the loops the speedups of which are below ��

task� count the loops that are parallel and whose speedups are below ��

task� compute the parallel coverage and the expected speedup based on Amdahl�s Law�

Task � compute the speedup of the target program� The speedup of the

entire program is perhaps the most frequently used metrics in computational engi�

neering� The changes made �parallelization or any other types of optimization� are

evaluated by the speedup gain in program execution time� The instrumentation to

measure program execution time is simple� and any calculator can be used to compute

this number�

Task � �nd the most time�consuming code sections� Finding the dominant

code sections using pro�les is the most important task in performance tuning� Most

users would look into the summary �les generated from program execution with a text

editor� In this case� users would have to run a text editor �menu clicking or typing the

command on a shell�� and �nd the most time�consuming loop in the �le� Looking for

the largest quantity among many numbers would take a signi�cant amount of time�

which is at best in the order of minutes� Some users suggested using �sort� command

available from UNIX as follows�

� cat name sum � sort �r �k ��

� ��

This produces a sorted list of summary �le entries quickly� but users have to remember

the column number to sort by� and the amount of text to type is not trivial� Moreover�

if multiple �les need to be presented for comparison� the sorting command cannot be

used� By contrast� using the Ursa Minor tool� the task can be accomplished by ��

activating the tool �typing �UM�� loading the pro�le �menu clicking�� and ��

sorting based on the column the user chooses �popup menu clicking��

Task � �nd inner and outer loops of a speci�c loop� Increasing the granulity

of parallel execution is an important technique in improving parallel performance�

This involves looking into inner or outer loops of the loop under consideration� There

are no other tools that explicitly support this task� Programmers would have to use

a text editor to �nd the loop and examine the source to �gure out the loop nest� The

Structure View of Ursa Minor signi�cantly simpli�es this task� Users only need to

load the compiler listing �le �menu clicking� scrolling� and mouse clicking�� nd the

section �scrolling or using �Find� feature�� and looking at the display�

Task � �nd the caller�s� of a speci�c subroutine� The presence of function

or subroutine calls may cause the parallelizing compiler to abandon optimizing loops�

Users� knowledge on the target program can be of great use in such cases� Finding

the callers and callees of a subroutine or a function is an essential task in optimizing

nested subroutines and loops with subroutine calls� Normally� programmers would

have to examine the program source to accomplish this task� UNIX utilities such

as �grep� can be useful� The Structure View from Ursa Minor provides one click

support for �nding �parents� and �children� of selected code sections�

Task � compute overheads� Identifying performance problems requires de�n�

ing �rst what the problems are� The metrics such as parallelization and spreading

overheads are frequently used variables in the problem de�nitions� Consequently�

computing these metrics is critical step to locate performance problems� One of the

conventional methods of computing the overheads includes a calculator� When users

� � �

need to compute overheads for multiple code sections� a commercial spreadsheet or

special�purpose scripts can provide an easier way� The mathematical functions pro�

vided by Ursa Minor also support the derivation of new metrics from the existing

data� This set of functions speci�cally targets parallel programming� so many of the

metrics commonly used in parallel programming are included in the set� In the cur�

rent version� however� the parallelization and spreading overheads are not directly

supported�

Task � compute parallel e ciencies� Parallel e�ciency is another widely used

measure for evaluating parallel performance� Parallel e�ciency� E�P� on P processors

is de�ned as

E�P � �

P

Tserial

Tparallel�P ��

Users can compute this number using a calculator or a spreadsheet� Ursa Minor

provides a function that computes parallel e�ciency�

Task � export pro�les to spreadsheet to create charts� An integrated toolset

oers an advantage in that exchanging �les is easier� Data �les speci�cally take one

form or another� and converting them into the form that other tools understand may

not be trivial� Commercial spreadsheets do a good job of importing text�based tabular

data �les such as timing pro�les and create a variety of graphs� Combing multiple

summary �les becomes di�cult� however� Without Ursa Minor� users would have

to create a comma separated �le using Awk or Sed scripts� Adding pro�les and

arranging data for exporting are frequently used features of Ursa Minor� Often� it

can be done within a minute this way� In addition� Ursa Minor can create charts

on any columns or rows that a user selects�

Task � count loops that have problems� This is another example that em�

phasizes the perspective on the overall performance� Users should be able to view

the resulting performance in terms of large blocks of code sections and that means

� ��

dealing with multiple loops that dominate the overall performance� There is no di�

rect support for this task in both Ursa Minor and commercial spreadsheets� but a

sequence of operations can accomplish the task�

Task � count parallel loops that have problems� The combined analysis

of performance and static program data such as compiler listings is more e�cient

in locating performance problems� This question is one of simple examples of such

cases� Depending on the focus of the optimization �parallel optimization or general

locality optimization�� combining the information on the parallel nature of code blocks

and their performance �gures is much more e�cient than dealing with each aspect

separately� Conventional tools do not support this approach� The query functions

available Ursa Minor are designed speci�cally to help users comprehend the two

dierent data in the same context�

Task �� compute the expected speedup based on Amdahl�s law� This

task represents multi�step process of performance evaluation� The Amdahl�s law

provides a simple performance model that can be used to evaluate actual performance�

Computing the expected speedup based on Amdahl�s law requires computing the

parallel coverage of the target program and several steps of computation� This task

was selected to test how users use tools to accomplish rather a complex goal� Users

are expected to use a combination of tools for this task�

Task is a simple calculation� so users are expected to use either a calculator

or the Expression Evaluator from Ursa Minor with comparable e�ciency� Task �

evaluates the table manipulation utilities �sorting and rearranging� for performance

data� Tasks � and � target the e�ciency of the Structure View and the utilities that

it provides� The Expression Evaluator is the main target for evaluation in tasks �

and � Task � tests the ability to rearrange tabular data and export them to other

spreadsheet applications� The rest of the tasks �� and �� attempt to evaluate

the combined usage of multiple utilities �sorting� the Expression Evaluator� query

functions� the static information viewer� and the display option control� provided by

� ��

Ursa Minor�

Experiment

We have asked four users to participate in this experiment� They were asked

to perform these tasks one by one� Two dierent datasets were prepared for the

experiment� These datasets contain timing pro�les of FLO��Q from the Perfect

benchmarks �� under two dierent environments� Thus� the number of data items

are the same in both datasets� but the pro�le numbers are dierent� First� these users

were asked to perform the tasks without our tools� Users were allowed to use any

scripts that they have written previously� Then� they performed the tasks using our

tools with the other dataset�

The time to activate tools �spreadsheet� Ursa Minor� and so on� and load input

�les was counted separately as �loading time�� The reason for this is that when users

perform these individual tasks separately under dierent environments� the loading

time needs to be added to the time taken to �nish each task� Since the users performed

the tasks in one session� users needed to activate tools only once� Time to convert

data �les for dierent tools are also included in the loading time� Hence� the loading

time also re�ects the level of integration of tools�

Four users who participated represent dierent classes of users� User is an expert

performance analyst who has written many special�purpose scripts to perform various

jobs� These scripts do tabularizing� sorting� etc� User does use our tools but relies

more on these scripts� User� has also been working on performance evaluation for

a while and is considered an expert as well� He uses only basic UNIX commands�

rather than scripts� However� his skills with the basic UNIX commands are very good�

so he can perform a complex task without taking much time� User� started using

our tools only recently� User� is also an expert performance analyst� but his main

target programs are not shared memory programs� He has been using our tools for a

long time� but with distributed memory programs� Finally� user� is a novice parallel

programmer� His experience with parallel programs are limited compared to the

others� He had read our methodology and tries to use our tools in his benchmarking

� ��

research�

Table ��Time �in seconds� taken to perform the tasks without our tools�

user user� user� user� average

task� � � ��

task��

task��

task��

task��

task� ��

task��

task��

task��

task� ��

loading � ��

total ��

Table �� shows the time for these users to perform the assigned tasks� User��

and � decided that tasks � and � cannot be performed within a reasonable time� so

they gave estimated times instead� All of the users used a commercial spreadsheet

later in the session� but user�� the novice programmer started doing the tasks after

he set up the spreadsheet and imported the input �les� User used his scripts for

many of the tasks�

As the second part the of experiment� users were allowed to use our tools to

perform the tasks� The results are shown in Table �� User used a combination of

a spreadsheet and Ursa Minor to perform tasks �� and �� The others used a

spreadsheet for task � only� User� was not sure that he can �nish task � even with

our tool support� so he gave an estimated time�

� ��

Table �� Time �in seconds� taken to perform the tasks with our tools�

user user� user� user� average

task� � � ��

task��

task��

task��

task��

task� � � � � �

task��

task��

task��

task� � ��

loading ��

total ��

As can be seen from these tables� our tool support improves the time to perform

common parallel programming tasks considerably� Figure �� shows the overall times

to �nish all the tasks� As can be seen in the �gure� our tool support not only

saves time� but also makes the process easier for novice programmers� resulting in

comparable times to perform the tasks when using our tools� The work speedups for

the users are �� and �� respectively�

The strength of our approach lies not only on the fact that the tools oer e�cient

ways of performing these individual tasks� but also that these features are provided

in an integrated toolset� This is demonstrated by the savings in the loding time

in our experiment� Users do no have to deal with several tools and commands�

There is no need to open the same �le into many dierent tools� For instance�

users can open the Structure View to inspect the program layout and examine and

� � �

Fig� �� Overall times to �nish all � tasks�

restructure the performance data from the same database� Adding this advantage

into the consideration� our tool support becomes even more appealing�

�� General comments from users

We summarize users� comments on various tool features in this section� Users have

responded very positively to the Structure View of Ursa Minor� We have received

comments such as �There is no alternative that I know of that gives as good of an

overview of the program structure quickly�� or �If I am looking at a new program� one

that I am unfamiliar with� I almost always look at its structure with Ursa Minor

to get a feel for its layout�� Although not speci�ed in the methodology� many users

examine program sources before they begin working on optimization� The Structure

View is oering vital help to those users�

The Table View has gotten good reviews as well� One response was �The Table

View is good� I like its ability to combine multiple types of data�� In addition� users

liked the bar graph at the right side of the Table View� which visualized numeric data

instantly� The Expression Evaluator also proves to be very useful� allowing users

compute dierent metrics on demand� One user listed �integration of tools in parallel

performance speci�c manner� as one of the reason for using our tools� However� some

users were not fully content with the cumbersome interface to move� swap and arrange

� ��

columns� Also� the limited graphing capabilities were pointed out as one of the weak

points of Ursa Minor� Overall� many versatile features provided by Ursa Minor

are greatly appreciated by users�

InterPol is still relatively new to users and has not been used much� Further�

more� we feel that there remain issues to be resolved with respect to documentation

and user interface� Consequently� we did not get much feedback from users� As In�

terPol gets more recognition from users with improved interface and documents�

we anticipate users to actively utilize the tool and return to us with quality feedbacks�

As the tools evolve in a need�driven way� the feedback from the user community

will provide invaluable directions into the next generation of our tool family� We

expect the future upgrades of the tools to incorporate users� opinions� For instance�

the weakness in GUI can be resolved with the newly available Java� technology�

Developers need to monitor users� needs and wishes constantly to keep up with the

current state�of�the�art parallel programming practices� Keeping close together the

tool design projects and users� application characterization eorts will ensure the

practicality of our tool in the future�

�� Comparison with Other Parallel Programming Environments

In Chapter �� we have listed several parallel programming environments� Pablo

and Fortran D editor �� SUIF Explorer �� FORGExplorer �� KAP�Pro

Toolset �� the Annai Tool Project � �� DEEP�MPI � �� and Faust �� We present

in this section a more detailed comparison of our toolset with these environments�

Table �� shows the availability of features in these environments� The parallelization

utility available from Pablo�Fortran D Editor is actually semi�automatic�

Other than the debugging capability� Ursa Minor�InterPol pair covers all of

the functionalities listed in the table� In addition� our environment has unique features

not available from others� Ursa Minor�s ability to freely manipulate and restruc�

ture performance data is unprecedented in other programming environments� Fur�

thermore�Ursa Minor allows performance data to be integrated with static analysis

data through a set of mathematical and query functions� The performance guidance

� ��

Table ��Feature comparison of parallel programming environments

performancedatavisualization

programstructurevisualization

compileranalysisoutput

automaticparallelization

interactivecompilation

supportforreasoning

automaticanalysis�guidance

debugging

Pablo�Fortran D Editorp p p p

SUIF Explorerp p p p p

FORGExplorerp p p

KAP�Pro Toolsetp p p

Annai Projectp p

DEEP�MPIp p p

Faustp p p p p

Ursa Minor�InterPolp p p p p p p

system such as Merlin has not been attempted in others� either� SUIF Explorer�s

Parallelization Guru only points to important target code sections� DEEP�MPI�s

advisor is limited to hard�coded procedure�level analysis� so detailed diagnosis into

smaller code blocks are not possible� InterPol allows users to �build� their own

parallelizing compiler� No such feature is available in other tools� Overall� the Ursa

Minor�InterPol toolset oers the most versatile and �exible features to date�

Perhaps the most outstanding aspect of our toolset is its accessibility� As opposed

to most other environments that ceased to exist or are not supported any more�Ursa

Minor exists in Web�accessible forms� Any user with an Internet connection can use

the tool with the help of complete on�line documentation� Such quality is not easily

� ��

found in most tool development projects� The topic of the next section is the e�ciency

of our tools placed in the World�Wide Web�

�� Comparison of Ursa Major and the Parallel Programming Hub

As an eort to reach a larger audience with our tools� we have used network

computing concepts to implement an on�line tuning data repository �Ursa Major�

and a Web�executable integrated tool environment �the Parallel Programming Hub��

Ursa Major is an Applet�based data visualization and manipulation tool for a

repository of optimization studies� The Parallel Programming Hub allows users to

access and run tools without the hassle of searching� downloading� and installing

them�

The Parallel Programming Hub contains Ursa Minor� and Ursa Major uses

many components from the Ursa Minor tool and provides almost identical func�

tionality� This presents an interesting opportunity to compare and evaluate dierent

approaches to network computing� In this section we compare the e�ciency of Ursa

Minor on the Parallel Programming Hub and Ursa Major� We provide qualitative

and quantitative measures� By this comparison� we attempt to provide directions for

the next generation of on�line tools� This work was presented in ��

Batch�oriented tools run as e�ciently on the Parallel Programming Hub as on

local platforms� In fact� thanks to the PUNCH system�s powerful underlying machine

resources� most users� tools have faster response times on the Hub� Interactive tools

need closer inspection�

A typical tool interaction with Ursa Minor causes the tool to fetch from a

repository a program database that represents a speci�c parallel programming case

study� It then performs various operations on this database and displays the results

using Ursa Minor�s visualization utilities� Table �� shows how server� client� and

�le operations are invoked by various tasks or the tool�

In a typical interactive tool session� a user loads input �les� runs computing util�

ities on the data� and adds more �les for further manipulation� From this scenario�

we chose three tool operations� We have measured the time taken to load a database�

� ��

Table ��Workload distribution on resources with our network�based tools

tasks Ursa Minor Ursa Major

application execution server client Applet

database load local disk IO � server network transfer � client Applet

display network transfer � client Applet �VNC� client Applet

perform a simple spreadsheet�like operation on the data� and search and display a

portion of source codes� The database load is an example of loading input data� while

spreadsheet command evaluation is representative of computing on the data� Source

search operation requires a simple search through a source code� Interestingly� these

three operations exhibit dierent patterns in resource usage� For Ursa Major� the

database load operation requires downloading the database� parsing it� and updat�

ing the display appropriately� Hence� it exercises both networking and computing

capabilities� The second operation� evaluation of a spreadsheet command� performs

a mathematical operation on the data that the Applet already has downloaded� so it

only involves computing on a client machine� The search operation mainly relies on

networking� A source �le is not part of the database� hence it has to be downloaded

separately� For Ursa Minor� data transfer over the network is replaced by �le IO�

However� the response to a user action has to be updated on the display of the remote

client machine�

We chose two dierent databases in this experiment� representing a small and a

large application study� respectively� The �rst database contains tuning information

of the program BDNA from the Perfect Benchmarks �� The database size is

about �� Kbytes� and the accompanying source �le is about �� Kbytes� We consider

this to be a small database� The second database contains information about the

parallelization of the RETRAN code � �� which represents a large� power plant

simulation application� The database we used is � � Kbytes in size� and the size of

the source is about �� Mbytes�

Finally� we chose three machines on which we measured the tool response times�

� � �

�Networked PC� is a PC with ��MHz Pentium II and � Mbytes of memory� Its

operating system is WindowsNT� It is connected to the Internet through a � Mbps

ethernet card� �Dialup PC� is a home PC with ��MHz Pentium II processor and

� Mbytes of memory� Its operating system is Windows�� and its connection to

the Internet is through ��K modem and via a local ISP� The third machine� �Net�

worked Workstation�� is an UltraSPARC workstation with � MHz processor and

� Mbytes of memory� Its operating system is SunOS v�� The network bandwidth

is � Mbps�

We have measured the response time of the three operations in � hour intervals

over several days using a Netscape browser v�� We have inserted timing functions

for Ursa Major and used an external wallclock for Ursa Minor on the Parallel

Programming Hub� We made � measurements for each case� The average times are

shown in Figure �� It displays the response time in seconds on the three machines�

The �gure shows the three measured tool operations� �rt�load� refers to the response

time to load the RETRAN database� �rt�eval� and �rt�search� refer to the time to

perform spreadsheet command evaluation and source search� respectively� The data

tags with pre�x �bd� refer to the same operations on the BDNA database�

Overall� the networked PC exhibits the shortest response time for all operations�

On this machine� the response times of Ursa Minor and Ursa Major are in the

same vicinity� However� downloading of a large program source signi�cantly increases

the response time of the search operation� despite the ethernet connection� In the case

of Ursa Minor� �les are read through �le IO within the server� thus the network

is not a dominating factor� The dialup PC displays adequate response time except

for the search operation with Ursa Major� The network bottleneck is even more

pronounced in this case� The networked workstation does not suer substantially

from the network connection� but its slow processor and relatively ine�cient imple�

mentation of the Java Virtual Machine �JVM� make it the worst performing platform

among the three�

The response time on three dierent machines for each operation� as shown in

� ��

�a�

�b�

�c�

Fig� �� The response time of UM�Applet and UM�ParHub� on �a� a networkedPC� �b� a networked workstation� and �c� a dialup PC�

� ��

Figure �� oers a dierent perspective� We only present the data regarding the

operations on the RETRAN database because those on the BDNA database show

similar trends and the characteristics are more pronounced in the RETRAN case�

The response time of Ursa Minor does not show noticeable variations on all three

machines except on the dialup PC� The spreadsheet command evaluation takes more

than twice as long on the dialup PC compared to the others� This operation is not

time�consuming� so a screen update becomes a factor with the slow modem connec�

tion� For Ursa Major� the platform becomes a deciding factor� If the network is

slow� the search operation degrades� For compute�intensive operations� the machine

speed and the quality of JVM determines the response time� In all� the Hub�based

tool performs better than the Applet�based version�

�a� �b�

�c�

Fig� �� The response time of the three operations on RETRAN database� �a�loading� �b� spreadsheet command evaluation� and �c� source searching�

Our experiments show that the Parallel Programming Hub oers users a fast and

stable solution to interactive network computing� The network transmits only users�

� ��

action �pressing buttons and clicking a mouse� to and from the server� so the network

or processor speed had little impact on the tool usage in our experiment� By contrast�

Applet�based tools rely on the client machine for computation and on the network for

data transfer� Thus� if the amount of data is large or the client machine is slow� the

resulting operations take considerably longer� The two networked machines we used

are located within the Purdue network� We expect these performance characteristics

to be even more pronounced on geographically distributed machines�

Although not as responsive as the Hub�based Ursa Minor� Ursa Major serves

a distinct purpose� The accumulated repository of tuning studies help users all over

the world in their eorts to study the results from other researchers and compare

the results on dierent platforms� Users with above the average machines can take

advantage of quick response by running the application on them� Slow screen updates

and sluggish mouse control that may result from a slow network connection for Ursa

Minor is not a problem with Ursa Major�

An increasing number of users are taking advantage of the Parallel Programming

Hub� The Parallel Programming Hub is being accessed by many users from all over the

world� Ursa Minor itself has been accessed � �� times since it became operational

in March of �� As the hub adds additional tools and gains more recognition by

parallel programming community world wide� we expect to see the number of accesses

grow at a faster rate�

�� Conclusions

In this chapter� we have evaluated the proposed methodology and the tool support�

We have presented several case studies showcasing the usage of the tools in various

parallelization and tuning studies� In many studies we did at Purdue� the proposed

approach to performance tuning has resulted in considerable improvement in the end

results� Many features provided by the tools are actively used by programmers� but

most of all� they are contained within an integrated tool environment�

In addition� we have focused on small individual tasks and shown how the tools

can eectively assist users by simplifying time�consuming chores and making di�cult

� ��

obstacles more accessible� The sample tasks we used are commonly performed in all

tuning studies� and users� time and eort are greatly saved by using our tools� The

experimental results show that our tools provide e�cient support for many common

tasks in parallel programming� Especially� the Expression Evaluator oers signi�cant

aid in deriving new data and computing metrics� Another unique feature� the Mer�

lin performance advisor� simpli�es the task of performance analysis considerably� as

shown in the case studies�

Finally� we have evaluated the e�ciency of the two dierent frameworks that we

used to broaden the user community for our tools using network computing� Overall�

the Hub�based Ursa Minor exhibited fast and uniform response time� especially in

cases where large data transfer is required� On the other hand� Ursa Major does

not suer from sluggish control when the network is slow� but the time to transfer

the requested data depends on the size of the database� Nevertheless� the purposes of

these two tools are distinct from each other� and they oer signi�cant aid to parallel

programmers world wide�

As mentioned in the beginning� evaluating a methodology and tools is a chal�

lenging work� This chapter represents our attempt to �nd ways to do so in both

qualitative and quantitative ways� We would like to point out that this is not the end

of our work towards a comprehensive parallel programming environment� Continuous

feedback from its user community will help improve the tools� service to a wide range

of parallel programmers�

� � �

�� CONCLUSIONS

�� Summary

When we �rst started out as novice parallel programmers� we had little experience

in the area� Every problem that we encountered seemed formidable and impossible

to resolve� We had to resort to experts for almost every task in the optimization

process� We did not know what to do and how to do it at practically every step of

the way� After a long period of trial and error� we have developed our own paradigm

for parallelizing and tuning programs� As our methodology re�ned over the years�

the tasks became routine� and most of all� we were seldom puzzled or frustrated by

seemingly unexpected results� The methodology gave us the con�dence that we could

always �nd the cause for unexpected anomalies and explain the phenomena�

As more members got involved in our group� however� another problem had risen�

New members of the group had just about the same amount of frustration and dismay

as we had� There are no publications that speak of a parallel tuning methodology in

terms that both expert and novice programmers could comprehend� Our experience

had not yet been documented and the tools that intimately support it were not there�

The part of the motivation for this work stems from the need to address this problem�

Now with the proposed methodology and the tools� we believe that the framework

for a structured approach to parallel programming is �rmly in place� With the gaining

momentum of the shared memory programming model� we feel that many users could

bene�t from this environment� Such a comprehensive approach that includes a wide

range of tasks in parallel programming has not been attempted previously�

The speci�c contribution of the work presented in this thesis is to present a uni�ed

framework for our approach to parallel program development� This includes a parallel

programming methodology and a set of tools that support this underlying practice�

Our work accomplishes this by achieving the following goals that we set out earlier�

� ��

Structured Parallel ProgrammingMethodology The methodology described

in Chapter � lists the tasks that need to be performed in each step and the detailed

suggestions that users may consider� Users obtain signi�cant guidance as the objec�

tive is clear in each stage� Nonetheless� it is applicable regardless of the underlying

platform� the algorithms applied by the target program� or even the tools that pro�

grammers use� It is well�organized and easy to follow even for novice programmers�

Integrated Use of Parallelizing Compilers and Evaluation Tools A com�

bined use of Ursa Minor and InterPol or Polaris achieves this� Code segments

are labeled as �Program Unit�s that work across both of these tools� Pro�le data

provides insights into the dynamic behavior of the program at hand� which in turn

can be used to further improve the performance� Through an interactive use of these

tools that speak the same terminology� programmers get a clearer understanding of

the program�

Integration of Static Analysis Information and Performance Data Ursa

Minor�s ability to search and display the source assist users to understand a pro�

gram structure signi�cantly� In addition� Ursa Minor understands the compiler

�ndings and combines them in the same picture� The query functions available from

Ursa Minor allows users to combine static analysis data with performance data in

meaningful ways�

Support for Users� Deductive Reasoning One of the greatest strength of the

Ursa Minor tool is its support for users� deductive reasoning� The Expression

Evaluator enables reasoning about the data in numerous ways� Users compute any

metrics without modifying or updating the tool� The newly created data can be

manipulated and visualized like any other data� so that the tool can stay with the

users in their reasoning process�

� ��

Potential of Automatic Performance Evaluation Merlin has shown the

potential of automatic analysis of performance and static data� It makes �transfer of

experience� from advanced to novice programmers easier� Tedious analysis steps can

be greatly simpli�ed�

Global Accessibility HavingUrsa Minor on the Parallel Programming Hub has

opened the door for world wide programmers to evaluating and using the tool without

worrying about searching� downloading� and installing� Compatibility issues are non�

existent� Also� Ursa Minor provides global parallel programming community with

the database of parallel programming studies that can be easily manipulated and

visualized�

�� Directions for Future Work

Many promising directions for further work suggest themselves�

Support for Other Parallel Programming Languages and Models As the

concept of parallel programming expands itself on many programming languages�

the ability to support other general languages such as Java or C�� would promote

the tool usage even further� The structure of the Ursa Minor database is not

limited to Fortran and can support these languages� However� a few features that

are language�sensitive have to be re�worked� Besides� automatic instrumentation and

the accompanying tasks �code segment naming scheme and incorporating compiler

listings� need careful consideration� Supporting other programming models can be

signi�cantly di�cult� Radically dierent parallel constructs and programming styles

call for a new methodology to begin with� It is interesting to see if and how the

program�level approach to parallel programming can be applied to other programming

models�

Support for Program Execution Traces The shared memory programming

model inherently imposes problems for parallel trace generation� Processor communi�

cations are implicit and frequent� so generating accurate traces is di�cult� However�

� ��

selecting right events and performing moderate summarization can make it feasible�

Timeline analysis is often critical in identifying problems such as load imbalance�

Parallel Program Debugging Parallel program debugging is an entirely dierent

�eld of study� Many challenging tasks have to be planned for and accomplished� As

a programming environment� the addition of debugging capability into the toolset

would greatly enhance its applicability�

On�line Generation of Data Files Further integration of Ursa Minor� the Po�

laris parallelizing compiler and the runtime environment will bring even more compre�

hensive environment� Supporting parallelization� compilation� and execution through

a single tool would provide a highly integrated perspective and make parallel pro�

gramming most approachable for novice programmers� The possibility of running

and monitoring parallel execution from a remote machine has been shown by In�

terAct� Issues such as single user time and Ursa Minor�s portability need to be

resolved �rst�

Getting More Information from Compilers Still� there is plenty of information

that is kept internal within a parallelizing compiler� Extracting more useful data from

a compiler and presenting them to users would have to be the top priority for the

on�going evaluation�optimization tool project�

Visual Development of Merlin Map Merlin is still in its infancy and needs

more feedback and re�nements� The foremost of all is the interface for developing a

map� Although Merlin maps are well�structured in format� programmers rely on

conventional text editors for creating a map� A better� possibly graphical use interface

will make expert programmers� job much easier�

Global Information Exchange among Parallel Programmers Ursa Major

has demonstrated the possibility of global communication and cooperation among

� ��

world wide parallel programmers� The obvious next step would be the exchange

of performance data among remote parallel programming and computer systems re�

searchers� With the proper support from the Ursa Major tool� such as the ability

to submit a database� this is a de�nite possibility� The integrated toolset from the

Parallel Programming Hub will continuously promote the usage of our databases�

Advances in technology is usually the result of such combined eorts�

� � �

� ��

LIST OF REFERENCES

�� L� Dagum and R� Menon� OpenMP� an industry standard API for shared�memory programming� Computing in Science and Engineering� ��!��January ��

�� B� L� Massingill� A structured approach to parallel programming� Methodologyand models� In Proc� of ��th IPPS�SPDP�� Workshops Held in Conjunctionwith the ��th International Parallel Processing Symposium and �th Symposiumon Parallel and Distributed Processing� pages ��!��

�� P�B� Hansen� Model programs for computational science� a programmingmethodology for multicomputers� Concurrency Practice and Experience��!�� August ��

�� T� Rauber and G� Runger� Deriving structured parallel implementations fornumerical methods� Microprocessing and Microprogramming� ��!��!��April ��

�� S� Gorlatch� From transformations to methodology in parallel program develop�ment� a case study� Microprocessing and Microprogramming� ��!��!��April ��

� � Michael Wolfe� High Performance Compilers for Parallel Computing� Addison�Wesley Publishing Company� ��

�� Michael J� Wolfe� Optimizing Compilers for Supercomputers� PhD thesis� Uni�versity of Illinois at Urbana�Champaign� October ��

�� Uptal Bannerjee� Dependence Analysis for Supercomputing� Kulwer AcademicPublishers� Norwell� MA� ��

�� Utpal Banerjee� Rudolf Eigenmann� Alexandru Nicolau� and David Padua�Automatic program parallelization� Proceedings of the IEEE� ��!��February ��

�� Dror E� Maydan� John L� Hennessy� and Monica S� Lam� E�cient and exactdata dependence analysis� In Proc� of ACM SIGPLAN �� Conference on Pro�gramming Language Design and Implementation� Ontario� Canada� June ��

�� Paul M� Petersen and David A� Padua� Static and dynamic evaluation of datadependence techniques� IEEE Transactions on Parallel and Distributed Sys�tems� ��!�� November ��

�� Michael J� Voss� Portable loop�level parallelism for shared memorymultiproces�sor architectures� Master�s thesis� School of ECE� Purdue University� October��

� ��

�� Nirav H� Kapadia and Jos�e A�B� Fortes� On the design of a demand�basednetwork�computing system� The purdue university network computing hubs� InProc� of IEEE Symposium on High Performance Distributed Computing� pages�!�� Chicago� IL� ��

�� D� A� Bader and J� JaJa� SIMPLE� a methodology for programming high per�formance algorithms on clusters of symmetric multiprocessors �SMPs�� Journalof Parallel and Distributed Computing� ��!�� July ��

�� B� Buttarazzi� A methodology for parallel structured programming in logicenvironments� International Journal of Mini and Microcomputers� ��!� � ��

� � Message Passing Interface Forum� MPI� A message�passing interface standard�Technical report� University of Tennessee� Knoxville� Tennessee� May ��

�� A� Beguelin� J� Dongarra� A� Geist� R� Manchek� S� Otto� and J� Walpole� PVM�Experiences� current status and future direction� In Proc� of Supercomputing�� pages � �!� � November ��

�� ANSI� X�H Parallel Extensions for Fortran� X�H��SD�Revision m edition�April ��

�� Kuck and Associates� Champaign� IL� Guide Reference Manual� version ��edition� September ��

�� David J� Kuck� The eects of program restructuring� algorithm change� and ar�chitecture choice on program peformance� In Proc� of International Conferenceon Parallel Processing� pages ��!�� St� Charles� Ill�� August ��

�� Randy Allen and Ken Kennedy� Automatic translation of Fortran programsto vector form� ACM Transactions on Programming Languages and Systems��!�� October ��

�� F� Allen� M� Burke� P� Charles� R� Cytron� and J� Ferrante� An overview of thePTRAN analysis system for multiprocessing� Journal of Parallel and DistributedComputing� �� ! �� October ��

�� WilliamBlume� Ramon Doallo� Rudolf Eigenmann� John Grout� Jay Hoe�inger�Thomas Lawrence� Jaejin Lee� David Padua� Yunheung Paek� Bill Pottenger�Lawrence Rauchwerger� and Peng Tu� Parallel programming with Polaris� IEEEComputer� ��!�� December ��

�� M� W� Hall� J� M� Anderson� S� P� Amarasinghe� B� R� Murphy� S�W� Liao�E� Bugnion� and M� S� Lam� Maximizing multiprocessor performance with theSUIF compiler� IEEE Computer� ��!�� December ��

�� Anthony J� G� Hey� High�performance computing�past� present� and future�Computing and Control Engineering Journal� ��!�� February ��

�� R� W� Numrich� J� L� Steidel� B� H� Johnson� B� D� de Dinechin� G� Elsesser�G� Fischer� and T� MacDonald� De�nition of the F�� extension to Fortran ��In Proc� of the Workshop of Languages and Compilers for Parallel Computing�pages ��!�� Springer�Verlag� August ��

� ��

�� R� von Hanxleden� K� Kennedy� and J� Saltz� Value�based distributions in For�tran D� In Proc� of International Conference on High�Performance Computingand Networking� pages ��!�� Springer�Verlag� April ��

�� High Performance Fortran Forum� High Performance Fortran language spec�i�cation� version �� Technical report� Rice University� Houston Texas� May��

�� Microsoft� Visual C�� http��msdn�microsoft�com�visualc��

�� Microsoft� Visual Basic� �� http��msdn�microsoft�com�vbasic��

�� A� Beguelin� J� Dongarra� A� Geist� R� Manchek� K� Moore� R� Wade� andV� Sunderam� HeNCE� Graphical development tools for network�based concur�rent computing� In Proc� of Scalable High Performance Computing Conference�pages ��!� � April ��

�� J� Schaeer� D� Szafron� G� Lobe� and I� Parsons� The Enterprise model fordeveloping distributed applications� IEEE Parallel and Distributed Technology��!� � January!March ��

�� P� Newton and J� C� Browne� The CODE �� graphical parallel programminglanguage� In Proc� of International Conference on Supercomputing� pages �!�� July ��

�� P� Kacsuk� G� Dozsa� and T� Fadgyas� Designing parallel programs by thegraphical language GRAPNEL� Microprocessing and Microprogramming� ��!�� ! �� April ��

�� O� Loques� J� Leite� and E� V� Carrera� P�RIO� a modular parallel�programmingenvironment� IEEE Concurrency� ��!�� January!March ��

�� N� Stankovic and K� Zhang� Visual programming for message�passing sys�tems� International Journal of Software Engineering and Knowledge Engineer�ing� ��!�� August ��

�� Barr E� Bauer� Practical Parallel Programming� Academic Press� ��

�� Silicon Graphics� Inc� Performance Tuning Optimization for Origin�and Onyx�� http��techpubs�sgi�com�library�manuals��html�O��Tuning��html�

�� Boston Univeristy� Introduction to Parallel Processing on SGI Shared MemoryComputers� �� http��scv�bu�edu�SCV�Tutorials�SMP��

�� University of Illinois at Urbana�Champaign� CSE��CS��ECE�� http��www�cse�uiuc�edu�cse��

�� University of California at Berkeley� U�C� Berkeley CS�� Home Page� Ap�plications of Parallel Computers� �� http��HTTP�CS�Berkeley�EDU� dem�mel�cs� ��

�� Georey C� Fox� Roy D� Williams� and Paul C� Messina� Parallel ComputingWorks� Morgan Kaufmann Publishers� ��

�� Ian Foster� Designing and Building Parallel Programs� Addison Wesley� ��

� ��

�� D� Cheng and R� Hood� A portable debugger for parallel and distributed pro�grams� In Proc� of Supercomputing �� pages ��!�� November ��

�� J� May and F� Berman� Retargetability and extensibility in a parallel debugger�Journal of Parallel and Distributed Computing� ��!�� June ��

�� Pallas� TotalView� �� http��www�pallas�de�pages�totalv�htm�

�� Kuck and Associates Inc� KAP�Pro Toolset� �� http��www�kai�com�

�� Vincent Guarna Jr�� Dennis Gannon� David Jablonowski� Allen Malony� andYogesh Gaur� Faust� An integrated environment for the development of parallelprograms� IEEE Software� ��!�� July ��

�� Bill Appelbe� Kevin Smith� and Charles McDowell� Start�Pat� A parallel�programming toolkit� IEEE Software� ��!�� July ��

�� V� Balasundaram� K� Kennedy� U� Kremer� K� McKinley� and J� Subhlok� TheParaScope editor� An interactive parallel programming tool� In Proc� of Super�computing Conference� pages ��!��

�� M� W� Hall� T� J� Harvey� K� Kennedy� N� McIntosh� K� S� McKinley� J� D�Oldham� M� H� Paleczny� and G� Roth� Experiences using the ParaScope editor�An interactive parallel programming tool� In Proc� of Principles and Practicesof Parallel Programming� pages ��!�� May ��

�� Rudolf Eigenmann and Patrick McClaughry� Practical tools for optimizingparallel programs� In Proc� of the �� Simulation Multiconference on the HighPerformance Computing Symposium� pages �! �� March ��

�� W� Liao� A� Diwan� R� P� Bosch Jr�� A� Ghuloum� and M� S� Lam� SUIF explorer�An interactive and interprocedural parallelizer� In Proc� of the th ACM SIG�PLAN Symposium on Principles and Practice of Parallel Programming� pages��!�� August ��

�� Applied Parallel Research Inc� Forge Explorer� �� http��www�apri�com�

�� Seema Hiranandani� Ken Kennedy� and Chau�Wen Tseng� Compiling For�tran d for MIMD distributed�memory machines� Communications of the ACM�� !�� August ��

�� V� S� Adve� J� Mellor�Crummey� M� Anderson� K� Kennedy� J� C� Wang� andD� A� Reed� An integrated compilation and performance analysis environmentfor data parallel programs� In Proc� of Supercomputing Conference� pages ��!��

�� S� P� Johnson� C� S� Ierotheou� and M� Cross� Automatic parallel code genera�tion for message passing on distributed memory systems� Parallel Computing��!�� February ��

�� S� P� Johnson� P� F� Leggett� C� S� Ierotheou� E� W� Evans� and M� Cross�Computer Aided Parallelisation Tools �CAPTools� Tutorials� Parallel Process�ing Research Group� University of Greenwich� October �� CAPTools Version��Beta�

� � �

�� Central Institute for Applied Mathematics� PCL � The Performance CounterLibrary� A Common Interface to Access Hardware Performance Counters onMicroprocessors� November ��

� �� Louis Lopez� The NAS Trace Visualizer �NTV� Rel� �� User�s Guide� NASA�September ��

� � Michael T� Heath and Jennifer A� Etheridge� Visualizing the performance ofparallel programs� IEEE Software� ��!�� September ��

� �� Universit�e de Marne�la�Vall�ee� PGPVM�� http��phalanstere�univ�mlv�fr� sv�PGPVM��

� �� Daniel A� Reed� Experimental performance analysis of parallel systems� Tech�niques and open problems� In Proc� of the th Int� Conf on Modelling Techniquesand Tools for Computer Performance Evaluation� pages ��!��

� �� W� E� Nagel� A� Arnold� M� Weber� H� C� Hoppe� and K� Solchenbach� VAM�PIR� visualization and analysis of MPI resources� Supercomputer� �� !��January ��

� �� J� Yan� S� Sarukkai� and P� Mehra� Performance measurement� visualizationand modeling of parallel and distributed programs using the AIMS toolkit�Software�Practice and Experience� ��!� � April ��

� � Barton P� Miller� Mark D� Callaghan� Jonathan M� Cargille� Jerey K�Hollingsworth� R� Bruce Irvin� Karen L� Karavanic� Krishna Kunchithapadam�and Tia Newhall� The Paradyn parallel performance measurement tool� IEEEComputer� ��!� � November ��

� �� S� Shende� A� D� Malony� J� Cuny� K� Lindlan� P� Beckman� and S� Karmesin�Portable pro�ling and tracing for parallel scienti�c applications using C�� InProc� of ACM SIGMETRICS Symposium on Parallel and Distributed Tools�pages ��!�� August ��

� �� Paci�c�Sierra Research� DEEP�MPI� Development Environmentfor MPI Programs Parallel Program Analysis and Debugging� ��http��www�psrv�com�deep mpi top�html�

� �� B� J� N� Wylie and A� Endo� Annai�PMA multi�level hierarchical parallel pro�gram performance engineering� In Proc� of International Workshop on High�Level Programming Models and Supportive Environments� pages ��! ��

�� LAM Team � University of North Dakota� XMPI � A Run�Debug GUI forMPI� �� http��www�mpi�nd�edu�lam�software�xmpi��

�� A� D� Malony� D� H� Hammerslag� and D� J� Jablonowski� TraceView� a tracevisualization tool� IEEE Software� ��!�� September ��

�� Michael T� Heath� Performance visualization with ParaGraph� In Proc� of theSecond Workshop on Environments and Tools for Parallel Scienti�c Computing�pages ��!�� May ��

�� E� Lusk� Visualizing parallel program behavior� In Proc� of Simulation Mul�ticonference on the High Performance Computing Symposium� pages ��!��April ��

� ��

�� Y� Arrouye� Scope� an extensible interactive environment for the performanceevaluation of parallel system� Microprocessing and Microprogramming� ��!�� ! �� April ��

�� J� A� Kohl and G� A� Geist� The PVM �� tracing facility and XPVM �g� InProc� of the Twenty�Ninth Hawaii International Conference on System Sciences�pages ��!�� January ��

�� B� Topol� J� T� Stasko� and V� Sunderam� PVaniM� A tool for visualizationin network computing environments� Concurrency Practice and Experience��!�� December ��

�� G� Weiming� G� Eisenhauer� K� Schwan� and J� Vetter� Falcon� On�line mon�itoring for steering parallel programs� Concurrency Practice and Experience�� !�� August ��

�� J� T� Stasko and E� Kraemer� A methodology for building application�speci�cvisualizations of parallel programs� Journal of Parallel and Distributed Com�puting� ��!� �� June ��

�� G� A� Geist II� J� A� Kohl� and P� M� Papadopoulos� CUMULVS� Providingfault tolerance� visualization� and steering of parallel applications� InternationalJournal of Supercomputer Applications� ��!�� Fall ��

�� K� C� Li and K� Zhang� Tuning parallel program through automatic programanalysis� In Proc� of Second International Symposium on Parallel Architectures�Algorithms� and Networks� pages ��!�� June ��

�� A� Reinefeld� R� Baraglia� T� Decker� J� Gehring� D� Laforenza� F� Ramme�T� Romke� and J� Simon� The MOL project� An open� extensiblemetacomputer�In Proc� of the �� IEEE Heterogeneous Computing Workshop� pages �!��

�� H� Casanova and J� Dongarra� NetSolve� a network enabled server for solv�ing computational science problems� International Journal of SupercomputerApplications� ��!�� Fall ��

�� M� Sato� H� Nakada� S� Sekiguchi� S� Matsuoka� U� Nagashima� and H� Tak�agi� Ninf� a network�based information library for global world�wide computinginfrastructure� In Proc� of High�Performance Computing and Networking� In�ternational Conference and Exhibition� pages ��!�� April ��

�� P� Arbenz� W� Gander� and M� Oettli� The Remote Computation System�Parallel Computing� ��!�� October ��

�� T� Richardson� Q� Staord�Fraser� K� R� Wood� and A� Hopper� Virtual networkcomputing� IEEE Internet Computing� ��!�� January!February ��

�� Citrix� ICA technical paper� �� http��www�citrix�com�products�ica�asp�

�� I� Foster and C� Kesselman� Globus� A metacomputing infrastructure toolkit�International Journal of Supercomputer Applications� ��!�� Summer��

� ��

�� A� S� Grimshaw and W� A� Wulf� The Legion vision of a worldwide virtualcomputer� Communications of the ACM� ��!�� January ��

�� Insung Park� Nirav H� Kapadia� Renato J� Figueiredo� Rudolf Eigenmann� andJos�e A� B� Fortes� Towards an integrated� web�executable parallel program�ming tool environment� To appear in the Proc� of SC��High PerformanceNetworking and Computing� ��

�� B� LaRose� The development and implementation of a performance databaseserver� Technical Report CS�� University of Tennessee� August ��

�� The University of Southampton� GRAPHICAL BENCHMARK INFORMA�TION SERVICE �GBIS�� http��www�ccg�ecs�soton�ac�uk�gbis�papiani�new�gbis�html�

�� Cherri M� Pancake and Curtis Cook� What users need in parallel tool support�Survey results and analysis� In Proc� of Scalable High Performance ComputingConference� pages ��!�� March ��

�� Roger S� Pressman� Software Engineering� a Practitioner�s Approach� McGraw�Hill� Inc�� New York� NY� ��

�� Peter Pacheco� Parallel Programming with MPI� Morgran Kaufman Publishers��

�� D� Culler� J� P� Singh� and A� Gupta� Parallel Computer Architecture� MorgranKaufman Publishers� ��

�� Rudolf Eigenmann� Toward a methodology of optimizing programs for high�performance computers� In Proc� of ACM International Conference on Super�computing� pages ��!� � Tokyo� Japan� July ��

�� Seon Wook Kim and Rudolf Eigenmann� Detailed� quantitative analysis ofshared�memory parallel programs� Technical Report ECE�HPCLab�� HP�CLAB� School of ECE� Purdue University� ��

�� Seon Wook Kim� Michael J� Voss� and Rudolf Eigenmann� Performance analysisof parallel compiler backends on shared�memorymultiprocessors� In Proc� of theTenth Workshop on Compilers for Parallel Computers� pages ��!�� January��

�� Rudolf Eigenmann� Insung Park� and Michael J� Voss� Are parallel workstationsthe right target for parallelizing compilers� In Lecture Notes in ComputerScience� No� �� Languages and Compilers for Parallel Computing� pages��!�� March ��

�� Michael J� Voss� Insung Park� and Rudolf Eigenmann� On the machine�independent target language for parallelizing compilers� In Proc� of the SixthWorkshop on Compilers for Parallel Computers� Aachen� Germany� December��

�� Insung Park� Michael J� Voss� and Rudolf Eigenmann� Compiling for the newgeneration of high�performance SMPs� Technical Report ECE�HPCLab�� HPCLAB� School of ECE� Purdue University� November ��

� ��

�� Lynn Pointer� Perfect� Performance evaluation for cost�eective tranforma�tions report �� Technical Report � �� Center for Supercomputing Research andDevelopment� University of Illinois at Urbana�Champaign� March ��

�� Insung Park� Michael J� Voss� Brian Armstrong� and Rudolf Eigenmann� Inter�active compilation and performance analysis with ursa minor� In Proc� of theWorkshop of Languages and Compilers for Parallel Computing� pages �!� �Springer�Verlag� August ��

�� Insung Park� Michael J� Voss� Brian Armstrong� and Rudolf Eigenmann� Par�allel programming and performance evaluation with the ursa tool family� In�ternational Journal of Parallel Programming� � ��!� � November ��

�� Insung Park� Michael J� Voss� Brian Armstrong� and Rudolf Eigenmann� Sup�porting users� reasoning in performance evaluation and tuning of parallel ap�plications� To appear in Proc� of the Twelth IASTED International Conferenceon Parallel and Distributed Computing and Systems� November ��

�� Seon Wook Kim� Insung Park� and Rudolf Eigenmann� A performance advisortool for novice programmers in parallel programming� To appear in the Proc�of the Workshop of Languages and Compilers for Parallel Computing� ��

�� Stefan Kortmann� Insung Park� Michael Voss� and Rudolf Eigenmann� Inter�active and modular optimization with interpol� In Proc� of the � In�ternational Conference on Parallel and Distributed Processing Techniques andApplications� pages � �!� �� June ��

�� Michael J� Voss� Kwok Wai Yau� and Rudolf Eigenmann� Interactive instru�mentation and tuning of OpenMP programs� Technical Report ECE�HPCLab�� HPCLAB� ��

�� Seon�Wook Kim and Rudolf Eigenmann� Max�P� Detecting the Maximum Par�allelism in a Fortran Program� HPCLAB� ��

�� Insung Park and Rudolf Eigenmann� Ursa Major� Exploring web technologyfor design and evaluation of high�performance systems� In Proc� of the Inter�national Conference on High Performance Computing and Networking� pages��!�� Berlin� Germany� April �� Springer�Verlag�

�� T� Nakra� R� Gupta� and M� L� Soa� Value prediction in VLIW machines�In Proc� of the ��th International Symposium on Computer Architecture� pages��!� �� May ��

�� Trimaran Homepage� Trimaran Manual� ��http��www�trimaran�org�docs�html�

�� A� D� Alexandrov� M� Ibel� K� E� Schauser� and C� J� Scheiman� UFO� A per�sonal global �le system based on user�level extensions to the operating system�ACM Transactions on Computer Systems� ��!�� August ��

�� Rudolf Eigenmann and Siamak Hassanzadeh� Benchmarking with real indus�trial applications� The SPEC High�Performance Group� IEEE ComputationalScience � Engineering� ��!�� Spring ��

� ��

�� David L� Weaver and TomGermond� The SPARC Architecture Manual� Version�� SPARC International� Inc�� PTR Prentice Hall� Englewood Clis� NJ ��

� � T� J� Downar� Jen�Ying Wu� J� Steill� and R� Janardhan� Parallel and serialapplications of the RETRAN�� power plant simulation code using domaindecomposition and Krylov subspace methods� Nuclear Technology� ��!�� February ��

� � �

� ��

VITA

Insung Park was born on February �� in Seoul� South Korea� He received

his B�S� degree in control and instrumentation engineering from Seoul National Uni�

versity in February of �� and his M�S� degree in Electrical Engineering from the

Virginia Polytechnic Institute and State University� Blacksburg� Virginia� in ��

He has successfully defended his Ph�D� reserach in August of �� at the School of

Electircal and Computer Engineering at Purdue University� He was awarded a Ph�D�

in December of the same year�

From �� to �� Insung Park had served as a system administrator of the

electrical engineering departmental workstation laboratory� During the period of his

M�S� study� he has developed a partial scan design tool� BELLONA� As a Ph�D�

student at Purdue� Insung Park designed and implemented a parallel programming

environment consisting of a programming methodology and a set of tools�

He is a member of the honor society of Phi Kappa Phi�

arallel pr ogramming methodology and envir · 2002-08-06 · ypr ogramming model a thesis submitted...

Documents