arallel pr ogramming methodology and envir · 2002-08-06 · ypr ogramming model a thesis submitted...
TRANSCRIPT
PARALLEL PROGRAMMING METHODOLOGY AND ENVIRONMENT
FOR THE SHARED MEMORY PROGRAMMING MODEL
A Thesis
Submitted to the Faculty
of
Purdue University
by
Insung Park
In Partial Ful�llment of the
Requirements for the Degree
of
Doctor of Philosophy
December ����
� ii �
To my beloved grandmother
� iii �
ACKNOWLEDGMENTS
First� I�d like to thank my grandmother� whom I have not seen for more than
two years� and whom I will never see again� She �ed to South Korea with four
little daughters during the Korean War and started a new life in an unfamiliar place
with her bare hands� Her courage� perseverance� and endurance have led to my
existence� Over the years in graduate school� she has always been on my side� lending
a sympathetic ear and doing her best to keep me sane� I wish I could see her just one
more time�
I�d like to thank my advisor� Dr� Rudolf Eigenmann� for his encouragement and
advice during my research� His insightful comments and constructive suggestions are
greatly appreciated� I also express my gratitude to my graduate committee members�
Dr� Jos�e A� B� Fortes� Dr� Howard J� Siegel� and Dr� Elias Houstis� for their time
and advice�
My deepest love goes to my parents and two brothers� In Jun and In Kwon�
I can never thank them enough for their never�ending support that has made me
come through with my research� Through ups and downs in life� their love and
encouragement has given me the strength to go on with my life� I am also grateful to
my aunts� uncles� and cousins� who have never hidden their pride in me and concern
for my well�being�
Fresh and valuable perspectives that the members of our research group have
provided are greatly appreciated� Among them� Mike� Seon� Brian� and Vishal have
made extra eorts to help me with my research� which I deeply acknowledge�
Mike� Natalie� and Nicholas deserve special mention for always being there for me�
I cherish them as my brother� sister� and nephew� Without them� I would not have
made it this far� I believe one of the reasons God led me here is to meet them� I also
� iv �
value my to�be�life�long friendship with Seon� Young� and their precious daughter
Arden� Numerous evenings I have spent with all these friends are precious to me�
I appreciate many of my Korean friends here at Purdue� Especially� I extend my
thanks to Jong�hyeok and Je�Ho� The life here has been joyous and fun because of
them� Thanks are also due to their wives� who have fed this single� hungry graduate
student countless times� I�d also like to mention In Sung� Jae Hyung� Yonghee� Soon
Keon� Heon� Seungmoon� Soohong� Jang Won� Il� Jung Min� Hun Soo� Woon Young�
Jong Sun� Se Hyun� and their families�
Lastly� I send my best regard to Joon Sook and her family� I wish them happiness�
� v �
TABLE OF CONTENTS
Page
LIST OF TABLES � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ix
LIST OF FIGURES � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � xi
ABSTRACT � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � xv
INTRODUCTION � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
� Motivation � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
�� State of parallel computing � � � � � � � � � � � � � � � � � � � �
��� Open issues in the shared memory programming model � � � � �
��� Need for parallel programming environment � � � � � � � � � � �
�� Thesis Organization � � � � � � � � � � � � � � � � � � � � � � � � � � � �
� BACKGROUND � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
�� Parallel Programming Concepts� Terminology� and Notations � � � � � �
��� Parallelization in the Shared Memory Programming Model � � � � � � �
���� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
����� History of parallel shared memory directives � � � � � � � � � �
����� Shared memory program execution � � � � � � � � � � � � � � � �
����� Automatic parallelization � � � � � � � � � � � � � � � � � � � � �
��� Parallelization in the Message Passing Programming Model � � � � � � �
���� MPI and PVM � � � � � � � � � � � � � � � � � � � � � � � � � � �
����� HPF � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
����� Visual parallel programming systems � � � � � � � � � � � � � � �
��� Parallel Programming and Optimization Methodology � � � � � � � � � �
���� Shared memory programming methodology � � � � � � � � � � � �
����� Message Passing programming methodology � � � � � � � � � � �
��� Tools � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
� vi �
���� Program development and optimization � � � � � � � � � � � � � �
����� Instrumentation � � � � � � � � � � � � � � � � � � � � � � � � � � ��
����� Performance visualization and evaluation � � � � � � � � � � � � �
����� Guidance � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
�� Utilizing Web Resources for Parallel Programming � � � � � � � � � � � ��
��� Conclusions � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
� SHARED MEMORY PROGRAM OPTIMIZATION METHODOLOGY � � ��
�� Introduction� Scope� Audience� and Metrics � � � � � � � � � � � � � � ��
��� Scope of the proposed methodology � � � � � � � � � � � � � � � ��
���� Target audience � � � � � � � � � � � � � � � � � � � � � � � � � � ��
���� Metrics� understanding overheads � � � � � � � � � � � � � � � � ��
��� Parallel Program Optimization Methodology � � � � � � � � � � � � � � ��
���� Instrumenting program � � � � � � � � � � � � � � � � � � � � � � ��
����� Getting serial execution time � � � � � � � � � � � � � � � � � � � �
����� Running parallelizing compiler � � � � � � � � � � � � � � � � � � �
����� Manually optimizing programs � � � � � � � � � � � � � � � � � � ��
����� Getting optimized execution time � � � � � � � � � � � � � � � � �
���� Finding and resolving performance problems � � � � � � � � � � �
��� Conclusions � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
� TOOL SUPPORT FOR PROGRAM OPTIMIZATION METHODOLOGY �
�� Design Objectives � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
��� Ursa Minor� Performance Evaluation Tool � � � � � � � � � � � � � � ��
���� Functionality � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
����� Internal Organization of the Ursa Minor tool � � � � � � � � ��
����� Database structure and data format � � � � � � � � � � � � � � � ��
����� Summary � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
��� InterPol� Interactive Tuning Tool � � � � � � � � � � � � � � � � � � � ��
���� Overview � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
����� Functionality � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
� vii �
����� Summary � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
��� Other Tools in Our Toolset � � � � � � � � � � � � � � � � � � � � � � � � �
���� Polaris� parallelizing compiler � � � � � � � � � � � � � � � � � � �
����� InterAct� performance monitoring and steering tool � � � � ��
����� Max�P� parallelism analysis tool � � � � � � � � � � � � � � � � ��
��� Integration with Methodology � � � � � � � � � � � � � � � � � � � � � � ��
���� Tool support in each step � � � � � � � � � � � � � � � � � � � � ��
����� Other useful utilities � � � � � � � � � � � � � � � � � � � � � � � ��
�� The Parallel Programming Hub and Ursa Major � � � � � � � � � � ��
�� � Parallel Programming Hub� globally accessible integrated toolenvironment � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
�� �� Ursa Major� making a repository of knowledge available tothe world wide audience � � � � � � � � � � � � � � � � � � � � � ��
��� Conclusions � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
� EVALUATION � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
�� Methodology Evaluation� Case Studies � � � � � � � � � � � � � � � � � �
��� Manual tuning of ARC�D � � � � � � � � � � � � � � � � � � � � �
���� Evaluating a parallelizing compiler on a large application � � � �
���� Interactive compilation � � � � � � � � � � � � � � � � � � � � � � �
���� Performance advisor� hardware counter data analysis � � � � �
���� Performance advisor� simple techniques to improve performance ��
��� E�ciency of the Tool Support � � � � � � � � � � � � � � � � � � � � � � ��
���� Facilitating the tasks in parallel programming � � � � � � � � � ��
����� General comments from users � � � � � � � � � � � � � � � � � � �
��� Comparison with Other Parallel Programming Environments � � � � � ��
��� Comparison of Ursa Major and the Parallel Programming Hub � � ��
��� Conclusions � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
CONCLUSIONS � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
� Summary � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
�� Directions for Future Work � � � � � � � � � � � � � � � � � � � � � � � � ��
� viii �
LIST OF REFERENCES � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
VITA � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
� ix �
LIST OF TABLES
Table Page
�� Overhead categories of the speedup component model� � � � � � � � � �
��� Optimization technique application criteria� � � � � � � � � � � � � � � �
��� A detailed breakdown of the performance improvement due to eachtechnique� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
��� Common tasks in parallel programming � � � � � � � � � � � � � � � � � ��
��� Time �in seconds� taken to perform the tasks without our tools� � � � ��
�� Time �in seconds� taken to perform the tasks with our tools� � � � � � ��
��� Feature comparison of parallel programming environments � � � � � � ��
��� Workload distribution on resources with our network�based tools � � � ��
� x �
� xi �
LIST OF FIGURES
Figure Page
�� The structure of an SMP� � � � � � � � � � � � � � � � � � � � � � � � � �
��� A �� processor Origin ���� system� �a� topology and �b� structure ofa single node board� � � � � � � � � � � � � � � � � � � � � � � � � � � � �
��� Simple parallelization with OpenMP� � � � � � � � � � � � � � � � � � � �
��� Screenshot of the CODE visual programming system� � � � � � � � � � �
��� The timeline graph from NTV� � � � � � � � � � � � � � � � � � � � � � ��
�� The graphs generated by AIMS� � � � � � � � � � � � � � � � � � � � � � ��
��� The graphs generated by Pablo� � � � � � � � � � � � � � � � � � � � � � ��
�� Typical parallel program development cycle� � � � � � � � � � � � � � � ��
��� Overview of the proposed methodology� � � � � � � � � � � � � � � � � � ��
��� Scalar privatization� �a� the original loop and �b� the same loop afterprivatizing variable X� � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
��� Array privatization� �a� the original loop and �b� the same loop afterprivatizing variable array A� � � � � � � � � � � � � � � � � � � � � � � � ��
��� Scalar reduction� �a� the original loop and �b� the same loop afterrecognizing reduction variable SUM� � � � � � � � � � � � � � � � � � � � ��
�� Array reduction� �a� the original loop and �b� the same loop afterrecognizing reduction array A� � � � � � � � � � � � � � � � � � � � � � � �
��� Induction variable recognition� �a� the original loop and �b� the sameloop after replacing induction variable X� � � � � � � � � � � � � � � � � ��
��� Scheduling modi�cation� �a� the original loop and �b� the same loopafter modifying scheduling by pushing parallel constructs inside theloop nest� In �b�� the inner loop is executed in parallel� thus processoraccess array elements that at least � stride apart� � � � � � � � � � � � ��
��� Padding� �a� the original loop and �b� the same loop after paddingextra space into the arrays� � � � � � � � � � � � � � � � � � � � � � � � � ��
� xii �
��� Load balancing� �a� the original loop and �b� the same loop afterchanging to interleaved scheduling scheme� By changing the schedulingfrom static to dynamic� unbalanced load can be distributed more evenly� ��
�� Blocking�tiling� �a� the original loop and �b� the same loop after ap�plying tiling to split the matrices into smaller tiles� In �b�� anotherloop has been added to assign smaller blocks to each processor� Thedata are likely to remain in the cache when they are needed again� � � ��
��� Loop interchange� �a� a loop with poor locality and �b� the same loopwith better locality after interchanging loop nest� � � � � � � � � � � � ��
��� Software pipeline and loop unrolling� �a� the original loop� �b� thesame loop with software pipeline �Instructions are interleaved acrossiterations� and preamble and postamble have been added�� and �c� thesame loop unrolled by �� � � � � � � � � � � � � � � � � � � � � � � � � �
��� Original loop SHALOW do���� in program SWIM� � � � � � � � � � � � � �
��� Parallel version of SHALOW do���� in program SWIM� � � � � � � � � � �
�� Optimized version of SHALOW do���� in program SWIM� � � � � � � � �
�� Main view of the Ursa Minor tool� The user has gathered infor�mation on program BDNA� After sorting the loops based on the ex�ecution time� the user inspects the percentage of three major loops�ACTFOR do���� ACTFOR do���� RESTAR do���� using a pie chart gen�erator �bottom left�� Computing the speedup �column � � with theExpression Evaluator reveals that the speedup for RESTAR do��� ispoor� so the user is examining more detailed information on the loop� �
��� Structure view of the Ursa Minor tool� The user is looking at theStructure View generated for program BDNA� Using �Find� utility� theuser sets the view to subroutine ACTFOR� and opened up the sourceview for the parallelized loop ACTFOR do���� � � � � � � � � � � � � � � ��
��� The user interface ofMerlin in use� Merlin provides the solutions tothe detected problems� This example shows the problems addressed inloop ACTFOR DO��� of program BDNA� The button labeled Ask Merlin
activates the analysis� The View Source button opens the sourceviewer for the selected code section� The ReadMe for Map button pullsup the ReadMe text provided by the performance map writer� � � � � ��
��� The internal structure of a Merlin �map�� The Problem Domaincorresponds to general performance problems� The Diagnostics Do�main depicts possible causes of the problems� and the Solution Do�main contains suggested remedies� Conditions are logical expressionsrepresenting an analysis of the data� � � � � � � � � � � � � � � � � � � � ��
� xiii �
��� Building blocks of the Ursa Minor tool and their interactions� � � � ��
�� The database structure of Ursa Minor� � � � � � � � � � � � � � � � � �
��� An overview of InterPol� Three main modules interact with usersthrough a Graphical User Interface� The Program Builder handles�le IO and keeps track of the current program variant� The compilerBuilder allows users to arrange optimization modules in Polaris� TheCompilation Engine combines the user selections from the other twomodules and calls Polaris modules� � � � � � � � � � � � � � � � � � � � ��
��� User Interface of InterPol� �a� the main window and �b� the Com�piler Builder� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
��� Monitoring the example application through InterAct interface� Themain window shows the characterization data of the major loops in theSPEC����� SWIM Benchmark� � � � � � � � � � � � � � � � � � � � � � ��
��� Tool support for the parallel programming methodology� � � � � � � � ��
�� Ursa Minor Usage on the Parallel Programming Hub� � � � � � � � ��
��� Interaction provided by the Ursa Major tool� � � � � � � � � � � � � �
�� The �a� execution time and �b� speedup of the various version ofARC�D �Mod� loop interchange� Mod�� STEPFY do��� modi�ca�tion� Mod�� STEPFX do��� modi�cation� Mod�� FILERX do�� mod�i�cation� Mod�� YPENTA do� modi�cation� Mod � modi�cation onXPENTA� YPENT�� and XPENT��� � � � � � � � � � � � � � � � � � � � � � �
��� Contents of the Program Builder during an example usage of the In�terPol tool� �a� the input program and �b� the output from thedefault Polaris compiler con�guration� � � � � � � � � � � � � � � � � � �
��� Contents of the Program Builder during an example usage of the In�terPol tool� �c� the output after placing an additional deadcodeelimination pass prior to inlining and �d� the program after manuallyparallelizing subroutine two� � � � � � � � � � � � � � � � � � � � � � � � �
��� Performance analysis of the loop STEPFX DO��� in program ARC�D� Thegraph on the left shows the overhead components in the original� serialcode� The graphs on the right show the speedup component modelfor the parallel code variants on � processors before and after loopinterchanging is applied� Each component of this model representsthe change in the respective overhead category relative to the serialprogram� Merlin is able to generate the information shown in thesegraphs� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
� xiv �
��� Speedup achieved by applying the performance map� The speedup iswith respect to one�processor run with serial code on a Sun Enterprise���� system� Each graph shows the cumulative speedup when applyingeach technique� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
�� Overall times to �nish all � tasks� � � � � � � � � � � � � � � � � � � � �
��� The response time of UM�Applet and UM�ParHub� on �a� a networkedPC� �b� a networked workstation� and �c� a dialup PC� � � � � � � � � ��
��� The response time of the three operations on RETRAN database� �a�loading� �b� spreadsheet command evaluation� and �c� source searching� ��
� xv �
ABSTRACT
Park� Insung� Ph�D�� Purdue University� December� ����� Parallel ProgrammingMethodology and Environment for the Shared Memory Programming Model� MajorProfessor� Rudolf Eigenmann�
The easy programming model of the shared memory paradigm possesses many
attributes desirable to novice programmers� However� there has not been a good
methodology with which programmers navigate through the di�cult task of program
parallelization and optimization� It is becoming increasingly di�cult to achieve good
performance without experience and intuition� Guiding methodologies must de�ne
easy�to�follow steps for programming and tuning multiprocessor applications� In ad�
dition� a parallel programming environment must acknowledge time�consuming steps
in the parallelization and tuning process and support users in their eorts�
We propose a parallel programming methodology for the shared memory model
and a set of tools designed to assist users in accordance with the methodology� Our
research addresses the questions �what� to do in parallel program development and
tuning� �how� to do it� and �where� to do it� Our main contribution is to provide a
comprehensive programming environment such that both novice and advanced users
can perform performance tuning in an e�cient and straightforward manner� Our ef�
fort diers from the other parallel programming environments in that �� it integrates
the most stages of parallel programming tasks based on a common methodology and
��� it addresses issues that have not been attempted in previous eorts� We have used
network computing technology so that world wide programmers can bene�t from our
work� Through a series of evaluation processes� we found that our programming envi�
ronment provides a methodology that works well with parallel applications and that
our tools provide an e�cient support to both novice and advanced programmers�
� xvi �
� �
�� INTRODUCTION
��� Motivation
����� State of parallel computing
Multiprocessor machines have existed in many dierent architectures� Among
them� shared memory machines are getting much attention recently� This is mainly
due to the fact that the shared�memory architecture oers an easy programming
model and that the techniques for parallelization of programs for this class of machines
are well�established and can be automated�
Today� new aordable multiprocessor workstations and PCs are attracting an in�
creasing number of users� and consequently� these new programmers are inexperienced
and desire an easier programming model to harness the power of parallel computing�
These aspects draw more attention to shared memory machines in two ways� First�
most newly developed parallel computers are shared memory machines or compati�
ble with the shared memory programming model� Second� the aforementioned easy
programming model with the help of parallelizing compilers requires relatively little
experience to develop parallel programs�
The eort in the industry toward the standardization of a programming model
makes shared memory machines more appealing� The lack of standardized parallel
language had been a problem with the shared memory model� This often requires
programmers to learn a new set of language constructs whenever there is a need to
port programs across platforms� To make matters worse� the dierence among these
native dialects in the abilities to express parallelism was signi�cant enough that in
many cases� a considerable change has to be made in the program code itself� going
beyond direct directive translation� There have been several attempts to provide
standard parallel languages� which will be discussed in Chapter �� but they failed to
get the attention from the parallel computing community in general�
� � �
The recent parallel language standard for shared memory multiprocessor ma�
chines� OpenMP ��� promises an attractive interface for those programmers who
wish to exploit parallelism explicitly� The OpenMP standard resolves the portability
problem and is expected to attract more programmers and computer vendors in the
high performance computing area�
����� Open issues in the shared memory programming model
There are� however� open issues to be addressed� Perhaps the most serious of all is
the lack of a good programming methodology for these types of machines� In contrast
to several eorts to establish a methodology for other programming models ��� �� �� ���
no known literature is found that speaks of a programming and tuning methodology
for the shared memory model� A programmer who is to develop a parallel program
has to face a number of challenging questions� What are the known techniques for
parallelizing this program� What information is available for the program at hand�
How much speedup can be expected from this program� What are the limitations
for the parallelization of this program� It usually takes substantial experience to �nd
the answers to such questions� Most general programmers do not have the time and
resources to acquire this experience�
We believe that the absence of a programming methodology is attributed to three
reasons� First� many advanced parallel programmers are used to programming in
terms of �application level� parallelism� By this we mean the study of the underlying
physics and algorithms to �nd parallelism residing in that level� It is indeed an eec�
tive method if it succeeds because in some cases� the scope of the resulting parallelism
goes wider than the �ner grain parallelism of the directive�based programming model�
resulting in less synchronization overhead� However� this approach requires signi�cant
eort to understand the underlying physics� and it is prone to a human error� It is
not a rare case in which a programmer realizes� in a later stage of development� that
the algorithm that he or she thought to be parallel is actually sequential� If a person
parallelizing a program is not the programmer who wrote the program� the required
eort doubles� as the understanding of the program has to precede parallelization�
� � �
Furthermore� depending on the problem that programmers wish to solve� the underly�
ing algorithms and physical models vary signi�cantly� making a systematic approach
to parallel application design di�cult� A programmer who is used to this approach
has to tackle each problem case�by�case relying on intuition and experience�
In contrast to the �application level� approach� there is a �program level� paral�
lelism approach� This means an eort to �nd parallelism based on the source code
and how it is written� Focusing only on repetitive computing constructs �loops�� this
approach allows automatic recognition of parallelism and possible transformations�
Numerous research projects have addressed the issues of identifying parallelism and
applying the corresponding transformations that can be incorporated into compil�
ers � � �� �� �� �� �� Nevertheless� these are not parallel programming method�
ologies by themselves� Theses researchers address only one part of parallel program
development� parallelization� A complete parallel programming methodology has to
encompass the entire development stages including parallelization� evaluation� tuning�
and so on� The second reason for the lack of a methodology for the shared memory
architecture stems from the signi�cant aid provided by these parallelizing compilers�
Many inexperienced programmers expect a signi�cant speedup after running a paral�
lelizing compiler� Indeed� they simplify the process considerably� However� running a
parallelizing compiler does not necessarily achieve high performance� To achieve an
optimal performance from a program� often many factors have to be considered� in�
cluding both machine dependent and independent parameters� underlying algorithms�
and so on� As shown in ���� without proper consideration for these eects� the result�
ing performance may even degrade� We believe that there is room for a systematic
way to provide users with guidelines and remedies that can be incorporated into a
structured methodology�
Finally� there are some aspects of the shared memory model that make it hard
to develop a general methodology� As mentioned above� the shared memory model
oers an easy programming interface� This does not mean that obtaining a good per�
formance is easy as well� Unlike some other programming models such as a message
� � �
passing scheme where a programmer explicitly dictates synchronization and sending
and receiving messages� important events such as multiple processors writing to a
shared variable or false sharing are not readily visible to users in the shared memory
model� Furthermore� these eects are hard to measure� if not impossible� without
introducing signi�cant overhead� Therefore� if the performance is not satisfactory�
inexperienced programmers have di�culties �nding what caused it� An increasing
number of Non�Uniform Memory Access �NUMA� machines add more complexity
because they introduce another variable to consider� namely memory latency� The
shared memory programming model provides an easy� transparent means of express�
ing parallelism� but the price is that the parallel performance optimization requires
signi�cant time and resources� A good methodology should be general enough to cover
a variety of architectures and applications� but �exible enough to help programmers
pinpoint the bottlenecks and resolve the problems in a speci�c situation�
����� Need for parallel programming environment
With the gaining momentum of the shared memory architecture� a methodol�
ogy for the shared memory model is needed� The shared memory model provides a
simple user interface� What we do not have now is an easier way to produce good
performance� It has to be structured guidelines that encompass the whole process
of program development while providing useful tips with which users can navigate
through di�cult steps� As there are a variety of issues to deal with� it has to be
general without losing its utility when applied to real environments�
A good methodology do not su�ce without proper support from tools� Listing
the tasks that need to be completed cannot be of much help to programmers if all
those tasks are to be accomplished manually with only basic utilities available from
the target machine� During an optimization process� programmers face challenges in
analysis and performance data management� incremental application of parallelization
and optimization techniques� performance measurement and monitoring� and problem
identi�cation and devising remedies� Each of these tasks poses a signi�cant burden
onto programmers� and without any help� it can be a time�consuming task�
� � �
This leads to the need of supporting facilities for the underlying methodology�
These facilities need to address di�cult and time�consuming steps speci�ed by the
methodology and provide functionality that accelerates these steps� Together� the
methodology and the tools should be able to make up for the lack of experience
among novice programmers wherever it is required most� such as analysis� diagnosis�
and formation of solutions� We acknowledge many tools designed for the purpose of
helping programmers� but the majority of them focus on speci�c aspects or environ�
ments in the program development process� not based on a methodology� We believe
that providing a more comprehensive and actively guiding toolset is possible with the
current technology�
Another problem with the current tools is their accessibility� If useful tools cannot
be easily found and used by users� the eort to develop such tools would be wasted�
Furthermore� as more diverse multiprocessors �nd their users� the compatibility issue
has become an important factor in the tool�s applicability� As the existing program�
ming models converge to the standard OpenMP� tool developers should consider this
problem� With the emerging network technology and new portable languages such
as Java� we already have the basic framework enabling more accessible parallel pro�
gramming tools�
We present here our result in the subject of a parallel programming methodology
and supporting tools� We have developed a methodology that has worked well un�
der various environments and a set of tools that address di�cult tasks in the shared
memory model� Combining the methodology and the supporting tools we developed�
programmers can now follow a structured approach toward the optimal performance
with the support from e�cient tools� This optimization paradigm is available to gen�
eral audience through the Purdue University Network Computing Hub �PUNCH� ���
and a Java Applet application� allowing our methodology and tool support to reach
many users throughout the globe�
� �
��� Thesis Organization
Chapter � will give a brief overview of the history and background on parallel
programming� focusing on methodologies and programming tools� Chapter � will
present our proposed methodology toward these issues� and the supporting tools
developed for the methodology are summarized in Chapter �� Chapter � discusses
the evaluation process and the result� Chapter concludes the thesis�
� � �
�� BACKGROUND
In this chapter� we examine previous eorts in developing programming method�
ologies and tools for parallel programming targeted towards the two well�known pro�
gramming models� the shared memory and the distributed memory models� Our re�
search can be summarized as building a comprehensive programming environment by
�� designing a good programming methodology� ��� providing a toolset that supports
it� and ��� making our results available to a wide audience� From this perspective� we
discuss general concepts in parallel programming� methodologies and tools proposed
by other researchers� and previous eorts towards better accessible data repositories
and parallel programming tools�
��� Parallel Programming Concepts� Terminology� and Notations
Parallelism exists in many forms� In this paper� we consider �parallel processing��
in which multiple processors take part in executing a single program� Other parallel
schemes such as instruction�level parallelism or vector architectures are not the target
of our research� There are two major multiprocessor architecture categories� SIMD
�Single Instruction Multiple Data� and MIMD �Multiple Instruction Multiple Data��
Among these� we focus on MIMD architecture� which is the most commonly used
architecture these days�
MIMD architecture consists of two types of machines� shared memory machines
and distributed memory machines� The physical memory for the shared memory
architecture may be shared or distributed� further dividing the architecture into Uni�
formMemory Access �UMA� architecture and Non�UniformMemory Access �NUMA�
architecture� Some distinguish them by using the terms Symmetric MultiProces�
sor �SMP� architecture and Distributed Shared Memory �DSM� architecture� respec�
tively� DSM machines seek to resolve the limited capacity of shared memory buses�
� � �
which prevents scaling to a large number of processors on a conventional SMP archi�
tecture� Figure �� shows a typical ��at� SMP architecture with four processors� By
contrast� the architecture shown in Figure ��� is that of a �� processor Cray Origin
���� system� which is a DSM machine�
Shared Main Memory
CPU 1
External Cache External Cache External Cache
CPU 2 CPU P...
Fig� ��� The structure of an SMP�
From the programmers� point of view� there are two main models for programming
on parallel machines� the shared memory programming model and the message pass�
ing programming model� There are other programming models that target a cluster
of SMP machines ��� or parallel logic environments ���� but they are not widely
used� and will not be discussed in detail�
The shared memory model and the message passing programming model share
the same basic concept� threads� A single process forks multiple threads that inde�
pendently execute portions of a program� The dierence between these two is how
threads access memory� In the shared memory model� multiple processors share a
single memory space� so processors can read or write to the shared space� regardless
of where they actually reside� The notion of �shared� and �private� data becomes
important� Shared data are visible to all processors participating in the parallel ex�
ecution� Communication between processors takes place in the form of reading and
writing to shared data� Private data� on the other hand� are local to each processor
� � �
R R
R R
R R
R R
Node BoardRouter
CPU 1
External Cache External Cache
CPU 2
Hub ASICXIO
Memory & Directory
Node Board
Router
�a� �b�
Fig� ���� A �� processor Origin ���� system� �a� topology and �b� structure of asingle node board�
and cannot be accessed by other processors�
By contrast� in the message passing scheme� processors do not share memory�
All data are private to the processor that owns them� The message passing scheme
requires each processor be aware of which processor owns what data� thus if there is
a need to read or write to a data item that belongs to another processor� the item
has to be explicitly sent and received�
These two models provide high level constructs for easier programming� The
shared memory model oers directive languages� with which a user speci�es whether
certain loops can be executed in parallel� Also� users can program directly with
threads with the help of thread libraries� In the message passing model� parallel
constructs typically come in the form of library of functions� The library includes
functions for sending and receiving messages� synchronization� initialization� and
grouping� The Message Passing Interface �MPI� � � and Parallel Virtual Machine
�PVM� ��� are important standards implemented in such libraries� The parallel pro�
grammer�s task in the message passing model is to incorporate these functions into
parallel algorithms� Programmers need to devise ways to split data� communicate�
� � �
and synchronize� and write or modify the program based on the design�
Although the shared memory programming model is basically for programming
on shared memory machines� and the message passing model for programming on
distributed memory machines� this mapping between programming models and ar�
chitectures is not binding� Many modern parallel computers are compatible with
both programming models although their hardware design takes speci�cally one form
or the other� There is still no general agreement as to which architecture and which
programming model are more eective� and it is not likely that any one of them will
prevail over the other in the near future�
Here� we focus on the parallelization in the shared memory model� Although we
view parallel program development in terms of programming models� we will keep in
mind the eects of speci�c hardware implementation on program performance as var�
ious machine�dependent parameters play signi�cant roles in program execution� We
would like our approach to parallel programming to address some of these hardware�
related issues�
��� Parallelization in the Shared Memory Programming Model
����� Introduction
The focus of the shared memory programming model is on loops� Loops are the
most common means of expressing repetitive computing patterns in a program� The
concept of thread execution does not restrict parallelism to the loop level� but high
level directive languages provided by the shared memory programming model mainly
deal with ways to specify parallel loop execution� By exploiting parallelism among
loop iterations� the shared memory model often achieves a signi�cant performance
gain�
In the shared memory programming model� a programmer speci�es parallel execu�
tion by annotating the source code with directives� Typically� directives consist of one
or more lines indicating serial�parallel execution� variable types �shared� private� and
reduction�� scheduling scheme� and a conditional construct �if directive�� Commu�
nication and synchronization among processors are implicit inside parallel sections�
� �
meaning that those operations are transparent and do not show up in the source
code� Also� parallelization is localized� In other words� parallelizing one section of
code has no logical eect� on the rest of the program� Transparent synchronization
and localized parallel sections of the shared memory programming model oer an easy
scheme to work with� especially for inexperienced programmers� Figure ��� shows a
portion of code taken from a example program in �� that computes � before and
after parallelization using OpenMP� Lines starting with ��OMP indicate directives�
Directive PARALLEL DO indicates that the loop has no loop�carried dependencies and
may be executed in parallel� Directives PRIVATE and SHARED tell the compiler that
the following variables in the parenthesis are private or shared� respectively� Direc�
tive REDUCTION� SUM� indicates that the variable SUM is a summation reduction
variable and requires a special care for parallel execution� Examining the details of
OpenMP is beyond the scope of this thesis� More information can be found in ���
By narrowing the main concern to loops� the shared memory model has enabled
an impressive advance in parallelization�optimization techniques� Well�known tech�
niques for parallelization include advanced data dependence analysis� induction vari�
able substitution� reduction variable recognition� privatization� and so on� In addition�
there are locality enhancement techniques that speci�cally target the shared memory
architecture� such as blocking�tiling and load balancing� Most of these techniques
have been incorporated into modern parallelizers� which will be presented in Section
������
����� History of parallel shared memory directives
As mentioned in the introduction� until the late ����s� the shared memory model
had suered from a lack of the standard language� Computers from dierent vendors
come with their own set of directives for expressing parallelism� and compilers do
not understand other than their own directives� There had been a few initiatives to
resolve this problem� In ���� an informal industry group called the Parallel Comput�
ing Forum �PCF� was formed to address the issue of standardizing loop parallelism
�Cache e�ects can a�ect the performance of the code outside the parallel section�
� � �
���
W � ���d��N
SUM � ���d�
DO I���N
X � W � �I � ��d�
SUM � SUM � F�X
ENDDO
PI � W � SUM
���
W � � �d��N
SUM � � �d�
��OMP PARALLEL DO PRIVATEX�� SHAREDW�
��OMP� REDUCTION� SUM�
DO I���N
X � W � I � � �d��
SUM � SUM FX�
ENDDO
PI � W � SUM
�a� Original sequential code �b� After transformation
Fig� ���� Simple parallelization with OpenMP�
in Fortran ��� The group had been active for three years� and their �nal report was
published in ���� After PCF was dissolved� a subcommittee X�H�� authorized by
ANSI� was formed to establish an independent language model for shared memory
programming in Fortran and C� However� the interest was lost eventually� and their
proposed standards were deserted� leaving the last revision labeled X�H�����SD�
Revision M ���� There also had been commercial portable directive sets such as
KAP�Pro directive set from Kuck and Associates �KAI� ���� However� since native
compilers only support their directives� portability could only be achieved by trans�
forming directives into a thread�based code� and compiling the resulting code with
native compilers� Overall� all these eorts failed to gain attention from the general
parallel computing community�
In ���� spurred by the rekindled popularity of shared memory machines� Silicon
Graphics Inc� �SGI� and several major high performance computer vendors initiated
� � �
the eort to establish a new standard directive language� The proposed directive
language� named OpenMP ��� embraces the previous standardization eorts and
added a few new concepts for more expressiveness� Unlike previous attempts� this
is an industry�wise eort to resolve a practical problem� so it is likely to result in
a successful standard that is supported by the majority of new and existing high
performance computers� It seems safe to say that OpenMP ensures the future of the
shared memory architecture and the programming model by adding portability across
platforms�
����� Shared memory program execution
Once an executable is generated by compiling a program with directives� pro�
grammers can run it as they run any sequential programs� In fact� an OpenMP
program starts out as a sequential program and engages other processors as OpenMP
parallel constructs are encountered� The user has a number of controls over parallel
execution�typically in the form of environment variables� The most important one of
them is the environment variable that sets the number of processors participating in
the execution of parallel code sections� For programmers that are used to the message
passing programming model� it is important to note that there are no con�guration
scripts or setups necessary to execute an OpenMP program�
����� Automatic parallelization
As the techniques for identifying parallelism and parallelizing loops advance� it is
the natural course of action to incorporate these into a compiler so that the whole
process takes place without programmers� care� The apparent advantage of using a
parallelizing compiler is that the conversion of a given serial program into parallel
form is done mechanically by the tool� releaving programmers from worrying about
parallelization details� As the impact of parallelizing compilers are signi�cant espe�
cially for the shared memory programming model� a reasonable methodology should
consider their role in parallel program development� Thus� we brie�y discuss the
general aspects of parallelizing compilers in this section�
The eort to automate parallelization process starts from vectorizers of the ���s
� � �
and ���s� The most important vectorizers among them are the Parafrase compiler
from the University of Illinois ����� PFC parallelizing compiler developed at Rice
University ���� and the PTRAN compiler from IBM�s T� J� Watson Research Labo�
ratory ����� They laid the foundation for the modern parallelizers� Most of the general
techniques for vectorizing arrays within loops remain in the parallelizing compilers
these days ����
Today� all shared memory multiprocessor machines are equipped with their own
parallelizers� and there have been several eorts from academia to create a new gener�
ation of state�of�the�art parallelizing compilers for the shared memory programming
model� Two of the noticeable recent eorts in this �eld are the Polaris parallelizing
compiler developed at the University of Illinois ���� and Purdue University� and the
SUIF �Stanford University Intermediate Format� parallelizing compiler from Stanford
University ����� They were both built upon their own infrastructures �bases for Po�
laris and kernels for SUIF�� which were designed to help researchers working on the
compiler technology� The focus on SUIF compiler is on parallelizing the C language�
With such techniques as global data and computation decomposition� communication
optimization� array privatization� interprocedural parallelization� and pointer analy�
sis� SUIF boasts an impressive performance gain from many programs�
Polaris� as a compiler� includes advanced capabilities for array privatization� sym�
bolic and nonlinear data dependence testing� idiom recognition� interprocedural anal�
ysis� and symbolic program analysis� The Polaris infrastructure provides useful fa�
cilities for analyzing and manipulating Fortran programs� which can provide useful
information regarding the program structure and its potential parallelism� Polaris
has played a major role in our previous eorts in methodology and tool research� and
will continue to be a major part in our future research� The details on the role of
Polaris in our research will be discussed in Chapter ��
� � �
��� Parallelization in the Message Passing Programming Model
����� MPI and PVM
Both MPI � � and PVM ��� provide message passing infrastructures for parallel
programs running on distributed memory machines� Ever since the introduction of
the �rst distributed memory machine� the Cosmic Cube from Caltech� in early ���s�
researchers and programmers who saw the potential of distributed memory computers
had struggled in the midst of the con�icts between supporting interfaces� until Oak
Ridge National Laboratory�s PVM system and a joint US�Europe initiative for a
standard message passing interface �eventually named MPI� arrived on the scene�
These two interfaces were accepted by the majority of people involved in parallel
computing on distributed memory machines� and successfully ported to a variety of
multiprocessor systems� including shared memory machines �����
These two systems take the form of libraries rather than separate language con�
structs� These libraries consist of functions and subroutines for synchronization and
sending and receiving messages across processors� Users are mandated to insert the
calls to these routines to control the parallel execution of a program� This required
the programmers to change their way of thinking� They had to be the �masters� that
explicitly take care of data distribution� communication� and other parallelization
details� Nevertheless� their performance on some distributed memory machines were
impressive�
The message passing programming model is well suited for distributed systems
with a large number of processors� By carefully controlling the interaction among
processors� the performance of some applications that do not require heavy commu�
nication were able to scale well as the number of processors increases� Another nice
thing about PVM and MPI is that they enable a cluster of heterogeneous unipro�
cessor systems to behave like one supercomputer� Good performance of the message
passing model� however� often relies on one critical factor� network latency� The time
to transfer a message from one processor to another ranges from a hundred to a mil�
lion clock cycles� If the application at hand requires frequent communication among
� �
participating processors� the resulting performance gain can be seriously limited even
on the fastest network today� let alone a cluster of uniprocessors connected by simple
network cables� This problem spawned numerous research eorts regarding data par�
allelism and work distribution on distributed memory machines� which we will not
discuss any further�
Another drawback of the message passing interface is its aforementioned low�level
programming style� The amount of bookkeeping for data transfer and synchronization
can amount to an intolerable level� and it is all up to the programmer to ensure
correct execution �� �� Furthermore� the tricks and tweaks to obtain high performance
may be overwhelming to inexperienced programmers� Even worse is that the eort
to parallelize a program generally starts from analyzing the underlying physics in
this programming model� making it di�cult for programmers other than the original
authors to parallelize a program� Overall� learning these interfaces is not particularly
di�cult� but designing a parallel program to achieve good performance is�
����� HPF
Many people thought that the message passing programming style is at too low
a level to appeal to the general audience ����� For this reason� a group of researchers
at Rice University attempted to provide higher level constructs for programming on
distributed memory machines� Their result are Fortran D ���� and its successor� High
Performance Fortran �HPF� ����� These are sets of extensions to Fortran� The HPF
programming model looks similar to the shared memory model in that it focuses
on loop parallelism controlled by directives added in front of loops� In addition�
it provides directives for data distribution onto distributed memory systems� HPF
translators generate a message passing program based on these directives� Compared
to message passing functions� these directives let programmers specify array distribu�
tion without burdening them with tedious bookkeeping details�
However� compared to the shared memory programming model� HPF lacks im�
portant constructs such as loop�private arrays� and most of all� the performance of
HPF programs are not as good as that of programs written directly in MPI or PVM�
� � �
So far� only a handful of compilers and systems fully support HPF�
����� Visual parallel programming systems
A dierent approach to simplify the user interface of the message passing program�
ming model is to achieve an even higher level of abstraction by adopting the visual
programming model of such systems as Visual C�� ���� and Visual BASIC ����� The
goal of such research eorts is to develop visual programming environments in which
programmers use nodes and arcs to design and implement parallel applications� They
opt for a more e�cient way of designing and implementing parallel programs� and
performance evaluation and tuning are not their main concern� Visual programming
systems such as HeNCE ���� Enterprise ����� CODE ����� GRAPNEL ����� P�RIO �����
and Visper �� � belong to this category�
Contrary to the traditional coding model� these systems call for a dierent paradigm
in writing parallel programs� Conventional constructs of programming language are
replaced with visual entities� although programmers are often required to provide
some form of textural descriptions to specify the details needed for the intended
functionality� These systems include not only new programming models� but also
supporting tools that actually allow programmers to use them� These tools usually
come with a set of templates to help programmers in designing parallel programs�
Figure ��� shows the screenshot of the CODE visual parallel programming system�
The advantage of these visual parallel programming systems is an e�cient rep�
resentation of complex program structures and parallel constructs� Generally� pro�
grammers have less di�culty in grasping the parallel nature of the programs using
these tools� In addition� they reduce debugging time by providing utilities for au�
tomatic translation of parallel constructs� However� the tasks of splitting data and
coordinating communication are still left to programmers�
��� Parallel Programming and Optimization Methodology
As explained in Section ��� the parallel constructs provided by the shared memory
programming model and the message passing programming model take signi�cantly
dierent forms� Hence� the corresponding programming methodologies have taken
� � �
Fig� ���� Screenshot of the CODE visual programming system�
distinct paths�
����� Shared memory programming methodology
In the shared memory model� parallelism is speci�ed with directives that have no
eects on program semantics� Tasks are distributed based on loop iterations� and the
key aspects of parallelization of shared memory programs are to detect loop�carried
data dependencies and to identify shared and private data in each iteration� This
can be done by static� program�level analysis� Therefore� the methodology for the
shared memory programming model is� at the highest level� is to examine loops in a
serial program code region to detect parallelism and to determine shared and private
variables� There are publications and lecture notes addressing programming on the
shared memory machines ��� ��� ��� ��� ���� They present concepts and notations�
explain directives� and discuss parallelization techniques and dependence test crite�
ria� However� they do not oer a overall strategy or a procedural methodology of
performance optimization� One exception is ����� This document� speci�cally aimed
� � �
at optimization for Origin���� machines� devotes a section on tuning parallel code
for Origin� This section consists of architecture�speci�c techniques that are useful
in further improving parallel performance� However� compared to the detailed single
processor tuning description in the same text� parallel performance tuning only serves
to complement the single processor case� Also� the document lacks the performance
problem de�nitions and the performance evaluation description for parallel programs�
An alternative way of expressing programs in the shared�memory model is to use
threads� In this scheme the programmer packages program sections that can exe�
cute concurrently into subroutines and spawns these subroutines as parallel threads�
Threads parallelism is at a lower level than directive parallelism� In fact� compilers
will translate a directive�parallel program into a thread�parallel program as an inter�
mediate compilation step� Advanced parallel programmers sometimes prefer thread
parallelism because it can oer more control over parallel program execution� Usually�
this comes at the cost of a higher programming eort� A brief description on shared
memory programming with multithreading is given in this lecture note ����
����� Message Passing programming methodology
Although HPF provides a directive�based programming model for the message
passing model� programming methodologies found in literature focus on application
level approach using library functions� General methodologies for programming us�
ing message passing libraries are described in ���� ���� In ����� the authors employ
the application level approach ��application�driven development� in this book�� in
which they �rst categorize a given problem as one of the �ve classes �synchronous
applications� loosely synchronous applications� embarrassingly parallel applications�
asynchronous problems� and metaproblems�� To this end� the book provides many
example algorithms common in scienti�c computing to the readers� Based on the cat�
egory of the target problem� the book lists possible parallel algorithms and suitable
parallel machines� In ����� parallel program design consists of four stages� partition�
ing� communication� agglomeration� and mapping� Partitioning and communication
are tasks of distributing data and coordinating task execution� respectively� In the
� �� �
agglomeration stage� the combined parallel structures �data distribution and com�
munication� are evaluated� If necessary� smaller tasks are combined into a larger
task to improve performance or to reduce development cost� Finally� each task is
assigned to a processor in a manner that attempts to satisfy the design goal in the
mapping stage� Since parallel constructs are integrated into a program source in the
message passing model� program design becomes an important part of parallel pro�
gramming� This book also gives a detailed description of the evaluation process of
parallel performance�
There are two dierent approaches to abstract the parallel programming in the
message passing model using mathematical notations� One is based on parallel pro�
gram archetypes or programming paradigms ��� ��� These are abstract notations that
combine computation structure� parallelization strategy� and templates for data�ow
and communication� Programmers are given a set of parallel program archetypes
or programming paradigms� Then they identify an appropriate element within the
set that matches the problem that they try to solve� Finally� they implement the
actual program using the parallel structure or the template stated by that element�
Using this methodology� programmers can save time and eort to design an appro�
priate parallel structure for a given problem� Once they identify the right parallel
program archetype or programming paradigm� the implementation becomes simpler�
This scheme works well in the case of scienti�c computing� in which a set of well�
known algorithms are used across many applications� The other approach states that
programmers begin with a conceptual or formal description of a given problem and
�nd an appropriate parallel structure for the algorithm through a series of suggested
analysis processes ��� ��� This method is more algorithm�speci�c� and its applicability
is even narrower�
��� Tools
In this section� we brie�y introduce the tools that have been developed to help
programmers in programming and tuning parallel programs� As the task of developing
a well�performing parallel program is very challenging� numerous tools have existed
� � �
to help programmers� Some have been made public for the general audience� and
some were used only within small research groups� Among those public tools� only
a few gained some attention from the parallel computing community� and even fewer
were actually used by other researchers and programmers�
We present here some of the major eorts in developing parallel programming
tools� Due to the sheer number of tools� we have divided them into four categories
based on their functionality� program development and optimization� instrumenta�
tion� performance visualization and evaluation� and guidance� We will examine their
advantages and shortcomings and discuss possible improvements� It should be noted
that� in this section� we do not cover tools designed to oer assists in other aspects
of developing parallel programs� such as serial program coding and parallel program
debugging� There are numerous general program coding and editing tools� Some of
the eorts in parallel program debugging include the portable debugger for parallel
and distributed programs ����� Panorama ����� TotalView �� �� and Assure ����� For
the tools relevant to our research� we present detailed comparison later in Chapter ��
����� Program development and optimization
We focus on tools speci�cally designed for program parallelization and optimiza�
tion in this section� The objective of these tools is to optimize program performance
of existing programs by helping users apply various techniques� In addition to the
support for manual modi�cations� these tools generally have automated optimization
utilities to make it easy for programmers to apply the techniques on selected parts of
a program� We begin with the tools for the shared memory model�
Faust is an ambitious project started at the Center for Supercomputing Research
and Development �CSRD� at the University of Illinois in late ���s ����� The tool
supports many aspects of programming parallel machines� providing facilities for
project database management� automatic program restructuring and editing� graphic
browsers for call graph� an event display tool for performance evaluation� It is an
environment that covers a wide range of parallel programming stages such as coding�
parallelization� and performance tuning� Its emphasis on project management allows
� �� �
the support for a major portion of the entire program development cycle�
The Start�Pat parallel programming toolkit was developed at Georgia Tech to
support programming and debugging of parallel programs ����� It consists of a static
analyzer� Start� and an interactive parallelizer� Pat� Its main concern is paralleliza�
tion� and general code optimization is not supported�
Parascope is an extension of the Rn programming environment developed at Rice
University ���� ��� Like Start�Pat� the focus of Parascope is automatic or interactive
restructuring of sequential programs into parallel form� It integrates an editor� a com�
piler� and a parallel debugger� The automatic transformation is conducted based on
the data dependence information collected by their previous tool� PTool� It provides
convenient facilities for parallelization and code transformation�
Faust� Start�Pat� and Parascope are important milestones in interactive optimiza�
tion tool eort for parallel programs� Unfortunately� the developers have stopped
maintaining these tools and their target architectures or programming models have
been abandoned� Nonetheless� their pioneering work laid the groundwork for the
current generation of interactive optimizers�
PTOPP �Practical Tools for Optimizing Parallel Programs� is a set of tools for
e�cient optimization and parallelization developed at the Center for Supercomputing
Research and Development �CSRD� ����� It was designed based on the experience that
they gained through the optimization of applications for the Alliant FX�� and the
Cedar machine� This toolset stays at the UNIX operating system level and provides
some interaction through the facilities built upon the Emacs editor� Facilities are
provided for execution time analysis� convenient database and �le management of
performance data� and �exible interface with extensive con�gurability� The PTOPP
toolset does not include an interactive parallelization utility� but the Polaris compiler
can be invoked through its interface�
Our research eort actually started out by expanding PTOPP utilities to integrate
static analysis data from a parallelizing compiler and simulation and performance
data� which were missing from the previous version� PTOPP is a set of useful tools
� �� �
that help make parallel programming easier� but the core need of novice programmers�
namely the lack of experience� has not been addressed in this project�
SUIF Explorer is an interactive optimization tool developed at Stanford Univer�
sity ����� It utilizes the SUIF compiler infrastructure ���� for automatic parallelization�
This tool comes with a basic performance evaluation facility�based on the pro�le
data generated from program runs� it can sort execution times to identify dominant
code segments� In addition� it displays the static analysis data gathered from execut�
ing the SUIF parallelizing compiler� Perhaps the highlight of the tool is its �program
slicing� capability� Using this technique� SUIF Explorer allows users to select certain
lines in a program source and displays the sections of code that may be aected by the
change made to those lines� This utility� combined with the automatic parallelization
module� provides an interactive way of tackling the task of tuning parallel programs�
Visual KAP for OpenMP ���� is a commercial interactive tool from Kuck and
Associates Inc� It performs automatic parallelization on program �les� However� it
lacks the support for manual optimization and �ner grain tuning� FORGExplorer ����
is another commercial interactive parallelization tool from Applied Parallel Research
Inc� Like most of the tools presented in this section� FORGExplorer is capable of
automatic parallelization of code sections� while presenting users with static analysis
data such as call graphs and control and data �ow diagrams�
There are a couple of important optimization tools for the message passing pro�
gramming model� The Fortran D Editor ���� � � is a graphical editor for Fortran
D that provides users with information on the parallelism and communications in a
program� It obtains data dependence� communication� and data layout information
through a direct interface to the Fortran D compiler and displays the information
during editing sessions� This is useful knowledge in developing message passing pro�
grams� but the Fortran D Editor lacks the support for automatic parallelization�
Converting directive�based data parallel languages to message passing programs is
challenging as it is� and automatic parallelization of sequential programs with data
parallel directives have not been successful�
� �� �
The same applies to CAPTools� CAPTools is a programming tool for the message
passing model from the University of Greenwich in London ���� ���� The paralleliza�
tion process here is semi�automatic� Through a series of user interactions� users make
their decisions on which sections should be parallelized and how to distribute work
and data� CAPTools constructs a data dependence graph on the target section and
uses this graph in the subsequent automatic parallelization phase� If CAPTools needs
more information from users� it asks questions though the user interface� Recently� a
new front�end for the shared memory model using OpenMP has been added� but the
details are not available as of this writing�
����� Instrumentation
Instrumentation is a means to obtain performance data and usually a part of most
visualization and evaluation tools� functionality� In this section� we examine general
mechanisms for instrumentation on the shared memory and the message passing mod�
els and discuss a few instrumentation utilities that deserve special attention�
The main concern in parallel program instrumentation varies depending on the
programming paradigm� In the shared memory model� where communication be�
tween processors is fast and frequent� reducing the instrumentation overhead is an
important issue� On the message passing model side� an often overwhelming amount
of performance data becomes a problem� To this end� some researchers have incor�
porated a realtime summation utility or non�uniform instrumentation� which will be
discussed later in this section� Both of these issues con�ict with the ultimate goal of
instrumentation�obtaining as much performance data as possible�
As mentioned in Chapter � detailed instrumentation of shared memory programs
are not feasible without signi�cant perturbation� Hence most instrumentation utilities
rely on simple timing information� Therefore� the task of shared memory program
instrumentation is mainly inserting calls to timing routines� The problem that often
arise is that timing routine calls in nested code regions cause signi�cant overhead�
At its foundation� the Polaris compiler ���� is a parallelizing compiler� but it
provides powerful instrumentation utility for shared memory programs� Polaris oers
� �� �
several dierent strategies for instrumentation that allow users to control the amount
and the targets of instrumentation� Recently� a new library that supports hardware
counters ���� has been made compatible with the Polaris instrumentation utility�
Other optimization tools capable of instrumentation include SUIF Explorer ����� Forge
Explorer ���� and GuideView �����
In the message passing programming model� the data needed for visualization
and animation are traces� and there have been several trace formats� IBM PE tracing
format � ��� PVM tracing format ���� ParaGraph format � � ��� Pablo�s SDDF �Self
De�ning Data Format� �� � ��� VAMPIR format � �� are some of the examples� The
dierence between these are mainly the size of the trace �les� Most visualization
tools for the message passing model introduced in the next section use one of these
well�known formats�
Since the parallel constructs in the message passing model are libraries of func�
tions� instrumentation takes place by intercepting these calls� For additional informa�
tion� a series of checkpoints are inserted for status feedback� Instrumentation of these
checkpoints are relatively simple� but the resulting trace data may be unmanageably
large� AIMS � �� tries to resolve this problem by automatically identifying important
regions� Paradyn�s approach is unique in that instrumentation and monitoring utility
enables dynamically adjustable instrumentation by providing on�line summarization
facility � �� VAMPIR � �� oers more compact trace formats� More details on AIMS�
Paradyn� and VAMPIR are available in the next section�
The developers of TAU � �� at the University of Oregon chose a dierent approach
to program instrumentation� TAU is a toolset designed for pro�ling� tracing� and vi�
sualizing parallel program performance� TAU�s instrumentation utility can generate
either timing pro�les or trace �les depending on the target application� When tim�
ing pro�les are generated� static viewers are used to present summary information�
For trace �les� a trace visualizer is used� The instrumentation library is developed
for multiple languages such as C� C��� Fortran� HPF� and Java� thus signi�cantly
broadening its applicability� However� the instrumentation process is manually done�
� � �
Users need to specify which functions should be instrumented and associate them in
a set of groups� For very large programs� this could be very cumbersome� especially
when users have little knowledge on the program at hand�
����� Performance visualization and evaluation
Performance visualization refers to the transformation of numeric performance
data into meaningful graphical representation� Visualization helps users gain insights
into the behavior of parallel programs so that they can better understand the pro�
grams and improve the performance� Performance visualization is often a stepping
stone to performance evaluation and problem identi�cation� Performance visualiza�
tion can either be dynamic or static� Dynamic visualization tools use graphical ani�
mation to illustrate the dynamic behavior of the program under consideration� The
animation can take place either during the program execution or after the program
termination through trace simulation� Static visualization displays the summary of
performance characteristics in charts and graphs�
GuideView from KAP�Pro toolset ���� is a typical static visualization tool� How�
ever� it targets the shared memory model and does not use traces� Instrumented
run�time library generates and summarizes timing information� Using charts and
graphs� GuideView illustrates what each processor is doing at various levels of detail
using a hierarchical summary� Its intuitive� color�coded displays make it easy to assess
the target application�s performance� However� due to the high overheads incurred
by the instrumentation� the resulting graphs may not re�ect accurate real time per�
formance� Fortran D Editor ���� � �� SUIF Explorer ����� FORGExplorer ����� and
DEEP�MPI � �� are also capable of graphical presentation of performance data� but
their uses are limited to simple displays of execution time of code blocks� DEEP�MPI
targets MPI programs� but does not provide the display of traces� Instead� it shows
resource usage and timing charts�
RACY from TAU project � �� has performance viewing utilities consisting of a
tabularized text report and several static charts� The information that is displayed
involves mostly timing pro�les� As mentioned above� TAU instrumentation utility
� �� �
is capable of generating trace �les for message passing programs� Instead of writing
their own trace viewer� the developers decided to use VAMPIR � ��� which is also
discussed in this section�
As for the static display of traces� NTV � �� summarizes traces from message
passing program execution and presents users with summary charts and timeline
graphs as shown in Figure ���� This type of graphs help users understand load
distribution� stalls� and communication structure of the program� PMA from Annai
Tool Environment � �� is a graphic utility similar to NTV� Annai integrates this
information with its source viewer for easier reference� XMPI from LAM project ����
oers a similar view� although its main goal is debugging of MPI programs� TraceView
is a pioneering work in a timeline display ���� and it generates timeline graphs for both
shared memory and message passing programs through dierent runtime libraries� In
both cases� trace �les are used� However� its graphics are not as re�ned as those listed
above� and the displayed data for shared memory programs are limited due to the
nature of the shared memory programming model�
Fig� ���� The timeline graph from NTV�
ParaGraph � � ���� Upshot ����� AIMS � ��� Scope ����� and VAMPIR � �� are
tools for animated postmortem visualization of program behavior based on trace
simulation� The advantage of trace simulation is that the speed of graphic animation
can be adjusted �with the exception of ParaGraph� so that events that are di�cult to
observe in real time can be slowed down for better understanding� ParaGraph was a
pioneering eort in performance visualization from the University of Illinois� The tool
� �� �
is visually elaborate� but its practical value is limited by a few missing features like
ability to set the speed of replay and the lack of appropriate annotation� Furthermore�
the target and the framework of the graphic presentation is pre�determined by the
developers� so users have little freedom in viewing other aspects of program behavior
in dierent perspectives� Upshot has a feature to adjust speed� but it does not have
features such as a dynamic call graph or a communication diagram� AIMS is an
automated instrumentation and monitoring system from NASA� it displays dynamic
program behavior through animated and summary views� AIMS adds a modeling
module that provides a means of estimating how the program would behave if the
execution environment were modi�ed� Figure �� shows a screenshot of AIMS in
use� The goal of Scope is extensibility� Scope allows users more freedom to arrange
performance data into new displays� VAMPIR adds a zoom utility� allowing users to
examine performance data with varying levels of detail� All these tools target message
passing programs�
Pablo � �� � �� Paradyn � �� XPVM ����� PVaniM �� �� and Falcon ���� can an�
imate the behavior of a program while it is running� This monitoring capability is
achieved by periodically updating graphs and charts with newly available runtime
data from the executing application� However� events that may occur frequently for
a very short period of time cannot be traced and displayed� For this reason� XPVM
and PVaniM have utilities to playback the generated traces� and other tools generate
summary statistics� Even so� visualizing important events during the execution of
a shared memory program in an animated fashion is not feasible in the sense that
these events� such as writing to shared variables� happen too frequently and too many
times� These tools visualize the events during message passing program execution�
Pablo� a performance evaluation tool developed at the University of Illinois� is
perhaps the most successful tool currently in use � �� � �� It uses an adaptive instru�
mentation control to reduce the perturbation of instrumentation when it executes�
The resulting trace �les are used to produce graphical display of the program per�
formance� Pablo also has a soni�cation utility and a �D support that convey more
� �� �
Fig� �� � The graphs generated by AIMS�
information to its users through multimedia experience� The combined eort with
Fortran D Editor �� � now allows Pablo to integrate performance data with a pro�
gram development environment� However� the lack of appropriate annotation and a
complex visual interface impose a steep learning curve on users� Figure ��� presents
a snapshot of Pablo graphical data presentation�
The Paradyn Parallel Performance Measurement Tool� developed at the Univer�
sity of Wisconsin at Madison� is characterized by their instrumentation scheme that
dynamically controls overheads by monitoring the cost of data collection � �� The
basic paradigm of instrumentation� execution� and visualization is the same as that of
Pablo� but due to the dynamic nature of its instrumentation scheme� the tool is par�
ticularly useful when the application at hand is very large or long�running� The tool
� �� �
Fig� ���� The graphs generated by Pablo�
also contains a visualization facility that generates realtime tables and histograms�
although it is not as extensive as that of Pablo�
XPVM is a graphical user interface for PVM that displays both real�time and
postmortem animations of message tra�c and machine utilization by PVM applica�
tions ����� While an application is running� XPVM displays a space�time diagram
of parallel tasks showing when they are computing� communicating� or idle� XPVM
stores events in a trace �le that can be replayed and stopped to analyze the behavior
of a completed execution�
PVaniM speci�cally targets network computing environments �� �� The perfor�
mance factors that are unique to networked environments require careful considera�
tion in performance visualization� PVaniM addresses these network issues� such as
possible heterogeneity� low network bandwidth� and clock skew� in its design� Its
playback utility also adds to its usefulness� by allowing users to examine details that
� � �
may have been missed during real time monitoring�
The principal aspects of Falcon are its abstraction and accompanying tools for
analysis of application�speci�c program information and on�line steering ����� The
term �application�speci�c� means that users choose which aspects of dynamic be�
havior to monitor and steer beyond a predetermined set of parameters� In addition�
Falcon provides a support for the on�line graphical display of the information being
monitored� The Falcon developers used POLKA system ���� for its animated and
static performance views�
The metrics supported by these animation tools include CPU utilization� memory
usage� �oating point operations� message size� and so on� They help programmers in
identifying the bottleneck in the execution of message�passing programs� The advan�
tage of these types of tools lies in providing dierent views on program execution by
visualizing the timely behavior of the target program� When processor communica�
tion is relatively sparse and visible as in the message passing programming model�
it is particularly important� and bottleneck identi�cation easily leads to well�known
techniques to resolve the problems� such as dierent data distribution� combining
messages� algorithm modi�cation� and so on�
The ability to monitor real�time performance presents opportunities for perfor�
mance steering� To this end� those who developed Pablo� Paradyn� PVaniM� and
Falcon have implemented a performance steering facility� In fact� the main focus of
Falcon has been performance steering from the beginning of the development� Typi�
cally� users provide or select a set of parameters that they want to manipulate during
program execution� and they are able to do so at various checkpoints inserted into
the target program� Performance steering is not our concern in this research� so we
will not go into any more details�
Finally� CUMULVS ���� takes a dierent approach to performance visualization�
As an extension to PVM� CUMULVS is a library of functions that users can insert
into programs to visualize the behavior of a parallel program in real time� The in�
strumentation task is shifted to programmers� but allows users �exibility to choose
� �� �
what type of data they want to view� CUMULVS data collection utility can be uti�
lized with several front�end visualization systems� CUMULVS also supports program
steering through checkpoints�
����� Guidance
The term �performance guidance� is used in many dierent contexts in the par�
allel programming �eld� Generally� it means taking a more active role in helping
programmers overcome the obstacles in tuning programs� With so many available
tools for instrumentation and visualization of raw data� the task of extracting mean�
ingful information is becoming increasingly burdensome� In this section� we discuss
several tools that support this functionality� Accommodating novice programmers
and automating the performance evaluation process are important issues in parallel
programming� and it is one of our focuses in our research� However� we found only a
few eorts addressing these subjects�
SUIF Explorer�s Parallelization Guru bases its analysis on two metrics� parallelism
coverage and parallelism granularity ����� These metrics are computed and updated
when programmers make changes to a program and run it� It sorts pro�le data in
a decreasing order to bring programmers� attention to most time consuming sections
of the program� It is also capable of analyzing data dependence information and
highlighting the sections that need to be examined by its users�
Paradyn Performance Consultant � � discovers performance problems by search�
ing through the space de�ned by its own search model� The search process is fully
automatic� but manual re�nements to direct the search is possible as well� The re�
sult is presented to the users through graphical displays� DEEP�MPI � �� features a
similar advisor that gives textual information about message passing program perfor�
mance� The DEEP�MPI advisor�s analysis is hard�coded� and the analysis is limited
to subroutines or functions�
PPA ���� proposes a dierent approach in tuning message passing programs� Un�
like the Parallelization Guru� Performance Consultant and DEEP�MPI� which base
their analysis on runtime data and traces� PPA analyzes a program source and uses
� �� �
a deductive framework to derive the algorithm concept from the program structure�
Compared to other programming tools� the suggestion provided by PPA is more de�
tailed and assertive� The solution for one example in ���� was to replace an ine�cient
algorithm�
Parallelization Guru� Performance Consultant� and DEEP�MPI basically tell the
user where the problem is� whereas the expert system in PPA takes the role of pro�
gramming environment a step toward an active guiding system� However� the knowl�
edge base for the expert system relies on understanding of the underlying algorithm
based on pattern matching� Having an expert system that understands all the va�
riety of parallel algorithms is nearly impossible� Due to the complexity required�
problem identi�cation is done by other tools and hand analysis� and the suggestions
provided by the tool only considers parallel constructs� which also limits the usage
of the tool� Because of the lack of performance evaluation and tuning support� PPA
cannot be considered as a programming environment� but their eort in developing a
performance guiding tool is worth noting�
��� Utilizing Web Resources for Parallel Programming
One of our objectives is to reach general audience with our methodology� tools�
and optimization study results� We have taken the Internet computing approach to
address this issue� Thus� we focus out attention to previous eorts that attempted
utilizing the Web to provide a programming environment and to establish on�line
repositories�
Many of the systems and technologies that currently allow computing on the
Web support a single or a relatively small set of tools� They include PUNCH ����
MOL ���� NetSolve ����� Ninf ����� RCS ����� VNC ����� WinFrame �� �� Globus �����
and Legion ����� More detailed descriptions of these systems are found in ��� ����
As for the benchmark repositories� several Web tools oer performance numbers
of various benchmarks ���� ��� Typically� the presented data are timing numbers
such as overall program performance or speci�c timings of communication in message
passing systems� Extensive characteristics of the measured programs are usually not
� �� �
part of the on�line databases� The user will have to obtain information from separate
sources� which is often necessary for interpreting the numbers� Furthermore� these
repositories do not provide information gathered by other tools� such as compilers or
simulators� and consequently they do not support the comparison or the combined
presentation of performance aspects and program characteristics�
Our eort to resolve these problems with the previous research eorts unfolds in
two ways� First� we have used PUNCH� a network computing infrastructure ����
to construct an integrated� Web�accessible� and e�cient parallel programming tool
environment� PUNCH allows remote users to execute unmodi�ed tools on its resource
nodes� More detailed descriptions of PUNCH are found in Section �� �� Second�
our results on performance enhancement with various applications have been made
accessible through an Applet�based browser� which allows not only examining the
raw data but also manipulating and reasoning about the information� This facility is
explained in more detail in Section �� ���
��� Conclusions
Thus far� we have studied general concepts and paradigms in parallel program�
ming� We also have looked at general trends in parallel programming models and
supporting tools� We have learned that there have been numerous attempts to aid
parallel programmers through various tools� However� these tools are generally not
based on a programming methodology and tend to focus on one speci�c aspect of
the optimization process� In addition� a brief discussion has been given on enhancing
tool accessibility via the Web�
It seems that tools supporting the shared memorymodel emphasize more on static
analysis and automatic code transformation while those supporting the message pass�
ing model mainly focus on performance visualization� This is not surprising consid�
ering that the shared memory model enables structured program level parallelism�
but instrumentation is expensive� and that in the message passing model� events are
relatively explicit and sparse� but automatic parallelization is di�cult�
Several tools have attempted integration of dierent aspects in parallel program�
� �� �
ming� Pablo and Fortran D editor �� � opt for the integration of program optimization
and performance visualization� but their visualization utilities� although highly ver�
satile� are di�cult to comprehend and oer little to help programmers in deductive
reasoning� The lack of automatic parallelization capability of Fortran D editor also
limits its utilization especially among novice programmers� SUIF Explorer ���� and
FORGExplorer ���� have a similar goal� but their performance analysis utilities serve
only a complementary purpose to direct programmers to time�consuming code regions�
KAP�Pro Toolset ���� consists of useful tools but does not support manual tuning�
The focus of the Annai Tool Project � �� is limited to the aspects of parallelization�
debugging� and performance monitoring� Faust ���� may be the most comprehensive
environment to date� encompassing code optimization and performance evaluation�
However� many aspects of Faust are not suitable for the modern day parallel ma�
chines� and it is no longer maintained by the developers� Also� there is the issue of
active user guidance� which none of the optimization tools supports� Apart from the
missing functionality� the problems with these tools �and most other tools discussed
in this chapter� are the lack of continuous support� system compatibility� scalability
�eort to add new tools or features�� and accessibility �not available and di�cult to
learn��
The quality of visualization of performance and structure of parallel programs
provided by today�s tools has reached an impressive level� Almost every aspect of
parallel program execution can be viewed in user friendly displays� Parallel execution
events and resource utilization summaries are presented via colorful graphs� charts�
animation� and even sound eects� We believe that the next step in assisting pro�
grammers in performance evaluation should be the support for comprehension and
deductive reasoning of performance data� As the user base of aordable parallel ma�
chines keeps expanding� this aspect of performance evaluation becomes increasingly
more important�
�A lot of smart people are developing parallel tools that smart users just won�t
use�� This sentence� quoted from ����� summarizes well some of the problems with
� � �
tool development over the years� Many tools have ended their lives unused by other
than the developers� Perhaps it is because the tool developers have focused their
attention only to speci�c stages in parallel program development� disregarding the big
picture� In many cases� the developers created the tool that they thought to be useful
based on their experience under their own environment� Another reason could be the
lack of eort from the developers in providing convenient access to their tools� The
conventional approach to promote tool usage has always been telling users what the
tool can do and explaining what to do with it� Furthermore� not enough consideration
has been put into actually allowing users to try the tools� We advocate the importance
of a programming and optimization methodology once more� because knowing exactly
what must be done at each stage during parallel program development� leads to an
eort to understand and appreciate the tool�s functionality that �ts users� needs�
With active motivation to reach larger audience with an integrated methodology and
a toolset� we may have a better chance�
� �� �
�� SHARED MEMORY PROGRAM OPTIMIZATION
METHODOLOGY
In this chapter� we outline our proposal on the issue of a methodology for the
shared memory programming model� We believe that the programming style of this
model allows a systematic approach to program tuning that is far more detailed
and organized than simple descriptions found in general guidelines� Programmers�
task in this scheme is to follow the steps suggested by the guidelines and apply the
appropriate techniques�
��� Introduction Scope� Audience� and Metrics
Before presenting the methodology� we �rst discuss its scope and target audience
as well as the metrics used in the methodology�
����� Scope of the proposed methodology
Figure �� shows a typical shared memory program development cycle� The soft�
ware design and implementation part inside a dashed box have been simpli�ed in
this �gure� The issues in these stages include planning� design� coding� testing� and
debugging� It is a quite complex topic� and there have been a sophisticated set of
methodologies� remedies� metrics� and tools for helping out programmers in this mat�
ter ����� We will not discuss general software engineering issues any further in this
proposal�
In this research� we focus our view on the parallelizing and tuning process �the box
enclosing parallelization�tuning� program development� program execution� and per�
formance evaluation�� We assume the programmers haven a working serial program�
Developing a sequential program is orthogonal to parallel processing and we assume
that most programmers follow one of the existing software engineering practices� Our
eort attempts to resolve di�culties and problems associated with parallelizing and
� �� �
parallelization/tuning
programcompilation
programexecution
performanceevaluation
design
Done
implementation
Fig� ��� Typical parallel program development cycle�
optimizing sequential programs� Also� notice that we do not consider application
level approach �explained in Chapter � to parallel program development� Finding
parallelism at the algorithm level and incorporating it while writing a program is a
dierent subject in that it requires a new perspective in examining algorithms� iden�
tifying parallelism� dividing and balancing tasks� and incorporating them into the
source code� As pointed out in the introduction� the sheer number of variables in this
approach is so large that �nding a systematic programming methodology would be
extremely di�cult� Some tips can be found in literatures such as ���� ��� as well as
some of the programming methodologies introduced in Chapter ��
����� Target audience
We assume that our target programmers are familiar with programming and com�
pilation� They should be able to write� debug� compile� and run a sequential program�
� �� �
Also� they should know at least the basics on how parallel processing works with the
shared memory programming model� It helps to understand the underlying shared
memory architecture� because certain machine dependent parameters have a signif�
icant impact on the program performance� To follow our methodology� it is not
necessary to be an experienced parallel programmer� However� even for experienced
programmers� the methodology serves as an e�cient strategy for parallel program�
ming�
We divide our target audience into two group� novice and advanced programmers�
The word �novice� means new to parallel programming� not to programming in gen�
eral� The novice programmer group consists of those given a task of parallelizing a
sequential program or writing a parallel program without much prior experience on
the process� They resort to a methodology mainly for the guidelines and suggestions
to make up for the lack of experience� They need to get the feeling of what the avail�
able techniques are and how they can be applied� The supporting tools must take
this into account to make the learning curve as smooth as possible�
The need of advanced parallel programmers lies in the supporting utilities� The
methodology aids them in e�ciently structuring the approach they have already been
taking� They have a good idea of what tasks have to be done in each stage� and they
desire eective tools to accelerates tedious tasks� They would like the tools to be
�exible so that they can con�gure them to �t the speci�c tasks of their interest�
����� Metrics understanding overheads
Performance evaluation is an important stage in parallel programming� The eval�
uation process consists of �nding performance problems and possible techniques for
improvement� Finding problems requires the de�nitions of performance problems�
In other words� programmers should know which phenomena constitute performance
problems� Without de�nitions� problems cannot be found� Metrics are used to for�
mularize performance problems�
In our methodology� the performance evaluation process begins with identifying
dominant and problematic code sections� A metric system provides a means to e��
� �� �
ciently identifying bottlenecks in the presence of a possibly large amount of informa�
tion� As the overhead analysis is a critical part of the methodology� we introduce a
couple of perspectives to look at parallel program performance and the related metrics
in this section� The main attention of these systems goes to �overhead��
One common way to view the performance overhead is described well in �� ��
in which a programmer needs to identify two factors contributing the overall over�
head� parallelization and spreading overheads� Our tuning strategy in the proposed
methodology is based on this overhead model�
Parallelization overhead This refers to an overhead introduced by transforming
a program into parallel form� Often it is identi�ed by comparing the execution
times of the serial version and the parallel version run on one processor� The
main reason for this is that the code gets augmented inevitably for paralleliza�
tion�
The parallelization overhead of a parallel loop is computed as
Tparallelization � T� processor parallel execution � Tserial execution ����
The factors that contribute to parallelization overhead are listed below�
� instructions needed for parallel execution� The instructions for the tasks such
as fork� join� and barriers� are necessary for parallel execution� These increase
the code size and cause inevitable overhead�
�� instructions needed for code transformation� Some parallelization techniques
require code change that may incur overhead� For instance� the reduction tech�
nique requires separate preamble and postamble� The induction technique may
introduce a complicated expression in each iteration� which was not part of the
original code�
�� ine�cient optimization� Code�generating compilers performing less optimiza�
tions on a parallel code section �compared to the original� serial code� leading
to less e�cient code�
� � �
Parallelization overhead may be amortized if the loop runs signi�cantly longer that
the overhead time� On the other hand� frequent invocation of a very small parallel
loop can cause serious degradation in performance�
Spreading overhead The execution model of a shared memory architecture is ba�
sically such that at the beginning of a program� a process forks multiple threads
and the master thread among them wakes them up whenever it encounters par�
allel sections� The time to wake the other threads is an unavoidable overhead
to endure� Spreading overhead usually increases as more processors are used in
program execution�
The spreading overhead is computed as
Tspreading�P � � Tparallel execution�P ��T� processor parallel execution
P�����
where P denotes the number of processors�
Some of the reasons for spreading overhead are given below�
� startup latency� This refers to the time to initiate parallel execution on multiple
threads� Naturally� the more threads run� the larger overhead occurs� One way
to avoid this is to try to merge adjacent parallel regions into one� making a
parallel section as large as possible�
�� memory congestion� Due to sharing data on a shared memory� heavy tra�c in
a memory bus may cause parallel execution to slow down� One possible remedy
for this is to increase the locality of loops to reduce bus tra�c�
�� coherence tra�c� Sharing data also requires a coordination� which adds addi�
tional overhead for legitimate data invalidation�
�� false sharing� Depending on the cache line size� data that are needed by only
one processor may spread over other processors� caches� causing frequent un�
necessary invalidations�
� �� �
�� load imbalance� Tasks are unevenly distributed over multiple processors� In
cases where the number of iterations is small and cannot be distributed evenly�
the expected speedup is limited by the remainder�
Another perspective on overhead is provided in ���� ���� Hardware counters avail�
able on most modern machines provide detailed statistics regarding the dynamic be�
havior of parallel programs� Yet� the measured values do not necessarily translate into
parallel programming terms� The proposed model de�nes four overhead components
�memory stalls� processor stalls� code overhead� and thread management overhead�
based on the hardware counter data� Each component is clearly de�ned and the
possible contributing factors and the remedies are also given� This model provides a
more detailed insight into the overhead characteristics of parallel loops� For instance�
a loop may exhibit small parallelization and spreading overheads� but memory or
processor stalls may indicate a problem� We have just begun to explore this new
system� and more work needs to be done to incorporate it into tool development�
The problem with this model is that obtaining the necessary data is tedious and very
time�consuming� The traditional parallelization and spreading overhead model still
serves as the primary measure for performance analysis for many programmers� and
it will continue to do so in the future�
��� Parallel Program Optimization Methodology
In the past� we have participated in several research eorts in parallelizing pro�
grams for dierent target architectures ���� ��� ��� At �rst� we belonged to the
category of novice programmers� After a great deal of trial and error� we have devel�
oped a structured way to a successful parallelization of programs� As the number of
the programs that we dealt with increased� our general methodology went through
several stages of adjustment and improvement� Finally� we felt the need to write it
down so that a wider range of programmers can bene�t from the e�ciency it provides�
Thus� we started the process of re�ning our methodology to improve both e�ciency
and practicality�
� �� �
Figure ��� shows the overview of the parallelization and optimization steps out�
lined by our proposed methodology� There are two feedback loops in the diagram�
The �rst one serves as the adjusting process of instrumentation overhead� The second
loop is the actual optimization process consisting of application of new techniques and
evaluation�
Our methodology envisions the following tasks when porting an application pro�
gram to a parallel machine and tuning its performance� We start by identifying the
most time�consuming code section of the program� optimize its performance using
several recipes and then repeat this process with the next most important code sec�
tion� The most important code blocks for parallel execution in our programming
paradigm are loops� Hence we pro�le the program execution time on a loop�by�loop
basis� We do this by instrumenting the program with calls to timer functions� The
timing pro�le not only allows us to identify the most important code sections� but
also to monitor the program�s performance improvements as we convert it from a
serial to a parallel program� However� as the diagram shows� programmers may need
to adjust the amount of pro�ling due to the accompanying overhead� The �rst step
of performance optimization is to apply a parallelizing compiler� If no such tool is
available or if we are not satis�ed with the resulting performance we can apply pro�
gram transformations by hand� We will describe a number of such techniques� The
following section describes all these steps in detail�
����� Instrumenting program
Instrumentation is a means to obtain performance data� Typically� on the shared
memory model� pro�ling routines are inserted into the code that record necessary
data� As the result� one or more pro�les are generated at the end of the program
execution� There are other methods to instrument a program using assembly codes�
which we do not consider in this research� Program instrumentation is a important
step in optimizing program performance� The pro�le results from instrumented pro�
gram runs provide a basis for performance evaluation and optimization� It should
be determined beforehand what type of code blocks should be instrumented� In the
� �� �
Instrumenting Program
Getting Serial Execution Time
Running Parallelizing Compiler
Manually Optimizing Program
Getting Optimized Execution Time
Speedup Evaluation
Finding and Resolving Performance Problmes
satisfactory
unsatisfactory
reduceinstrumentation
overhead
done
Fig� ���� Overview of the proposed methodology�
� �� �
directive�based shared memory programmingmodel� loops are usually the basic blocks
for instrumentation� because they are the basic sections considered for parallelization�
The metrics for measurement can vary� but they should conform to the goal of the
optimization� There are utilities for measuring various aspects of program execution�
The most widely used measures are execution time�
As the �rst step� programmers should instrument a serial program� The purpose of
this step is to understand the distribution of execution time within a program and to
identify the code segments worth the optimization eort� Therefore� it is desirable to
obtain as much timing data as possible throughout the target program� For instance�
programmers may decide to instrument all the loops in a given program�
Unfortunately� most instrumentation methods introduce overhead� This has to
be considered very carefully because it not only aects the program�s performance�
but it can also skew the execution pro�le so that the programmer targets the wrong
program sections� Our methodology suggests the following remedies�
� Programmers should make sure that they run the program with and without
instrumentation� They should proceed only after they have veri�ed that the
perturbation is small�
� In order to reduce overhead� programmers should remove instrumentation from
innermost loops �inner�most code sections� in general�� They may need to �nd
out the overhead per call of the instrumentation library� If their initial pro�le
shows code sections whose average execution times are less than two orders of
magnitude larger than the overhead� the corresponding instrumentation should
be removed�
� Programmers should add instrumentation after they run the code through a
parallelizing compiler� Compilers usually can apply less optimizations in the
presence of many subroutine calls� Source level instrumentation generally has
the form of inserted subroutine calls� If there exists a assembly level instrumen�
tation tool� this is less of a problem�
� � �
� Programmers should be careful when adding instrumentation inside a parallel
loop or region� Instrumentation libraries may assume these function calls are
made from serial program sections only�
� It is desirable that programmers make sure that instrumented code segments in
the optimized program match those instrumented in the sequential program� so
that side�by�side comparisons can be made in the performance evaluation stage�
There is an obvious dilemma� If programmers remove too many instrumentation
points� the pro�le will become less useful� They should leave the instrumentation at
least for all those program sections that they may later try to tune�
����� Getting serial execution time
Program execution may be aected by many factors� Processor speed� architec�
ture� operating systems� system load� network load such as �le IO requests� etc� The
resulting program from this optimization process may be subject to all these factors�
However� to accurately measure the eect �whether positive or negative� of the ap�
plied techniques during the optimization process� it is very important to eliminate
these external factors during instrumented program runs� One way to ensure uninter�
rupted environments is to use �single user time�� During this period� only one user is
allowed on the system� In this way� programmers can reduce unnecessary overheads
caused by context switching� external �le IO� and so on�
����� Running parallelizing compiler
Parallelizing compilers can analyze the input program� detect parallelism� and au�
tomatically generate appropriate directives for detected parallel regions� Parallelizing
compilers relieve parallel programmers of the tasks of parallelizing all loops manu�
ally� They are especially useful when the loops under consideration have complex
structures for which human analysis is cumbersome� State�of�the�art parallelizing
compilers include many advanced techniques for parallelization and optimization�
It is important to note that relying entirely on parallelizing compilers for opti�
mization may not result in the optimal performance� Compilers base the techniques
� �� �
that they apply on the static analysis of input programs� This may not accurately
re�ect the dynamic behavior of the programs� Modeling dynamic characteristics of
programs is very di�cult� For this reason� programmers� intervention may be neces�
sary to achieve near�optimal performance� Programmers� compensation for compilers�
lack of knowledge on the dynamic behavior of a program is the key to obtaining good
performance�
Nonetheless� running parallelizing compilers is a good starting point� It can save
programmers signi�cant amount of time that may be spent analyzing all the loops in a
program� For novice programmers� manually parallelizing loops may be cumbersome
to begin with� In addition� most parallelizing compilers are capable of generating the
listing of the static analysis results� which may provide programmers with valuable
information on various code sections�
In our methodology� we do not assume that programmers have necessarily access
to parallelizing compilers� If this is the case� the �rst set of techniques to apply should
be those for parallelization� described in the next section�
����� Manually optimizing programs
Manual optimization allows users to make up for compilers� shortcomings� If a
programmer has run a parallelizing compiler� the static analysis information gener�
ated by the compiler �in the form of listing �les� can help the programmer better
understand the problems at hand� Running instrumented programs oers insights
into programs dynamic behavior� Combined with programmers� knowledge on the
underlying algorithm and physics� these data provide vital clues in improving the
performance�
In our methodology� we have divided various well�known techniques into four
categories� parallelization techniques� parallel performance optimization techniques�
serial performance optimization techniques� and other techniques� Parallelization
techniques refer to techniques involving parallelizing code segments� Parallel per�
formance optimization techniques are the ones that may improve the performance
of already parallel sections� Serial performance optimization techniques aim to im�
� �� �
prove the performance of code sections whether they are serial or parallel� Some
of these techniques may result in a super�linear speedup if not applied to the serial
program that serves as a performance reference point� Locality enhancement tech�
niques are typical examples� The techniques that belong to �other� categories do not
seem to have eects on performance by themselves� However� they may enable other
previously non�applicable techniques� The bene�ts of the techniques described be�
low can vary signi�cantly with the underlying machine� The judgment about which
techniques to apply to a given program should be based on accurate performance
evaluation� which will be discussed in the subsequent section�
We give brief descriptions of the techniques that we have used to improve program
performance� More detailed description and theoretical backgrounds are found in
� � �� �� ��
Parallelization techniques
Privatization Privatization seeks to reduce false dependencies� Often both scalar
variables and arrays are used as temporary storage within an iteration of a
loop� and therefore if a private copy of this variable is provided with each
iteration� the loop may be parallelized� More conservatively� a single copy may
be provided to each of the participating processors for use as its loops are crucial
part of parallelization process� For example� in Figure ���� variable X is used
as a temporary storage within a loop� By allowing separate copies of X for all
participating processors� seemingly serial code can be executed in parallel� In
some cases� a temporary storage could be an array� as shown in Figure ����
Reduction Scalar reductions are recurrences of the form sum � sum�expr where
expr is a loop�variant expression and sum is a scalar variable� Loops which con�
tain such recurrences cannot be executed in parallel without being restructured�
since values are accumulated into the variable sum� One way of addressing such
a situation is to calculate local sums in each processor� and combine these sums
at the completion of the loop� Figure ��� shows an example of such a scalar
� �� �
DO I � ��n
X � ���
��� � X
ENDDO
� OMP PARALLEL DO PRIVATE�X
DO I � ��n
X � ���
��� � X
ENDDO�a� �b�
Fig� ���� Scalar privatization� �a� the original loop and �b� the same loop afterprivatizing variable X�
DO I � ��n
DO J � ��m
A�J � ���
ENDDO
DO J � ��m
��� � A�J ���
ENDDO
ENDDO
� OMP PARALLEL DO PRIVATE�J SHARED�A
DO I � ��n
DO J � ��m
A�J � ���
ENDDO
DO J � ��m
��� � A�J ���
ENDDO
ENDDO�a� �b�
Fig� ���� Array privatization� �a� the original loop and �b� the same loop afterprivatizing variable array A�
� �� �
reduction operation and its transformed version in OpenMP� OpenMP provides
a construct for identifying reduction operations of type addition� multiplication�
maximum� and minimum�
DO I � ��n
sum � sum � A�i
ENDDO
� OMP PARALLEL DO SHARED�A
� OMP� REDUCTION��� SUM
DO I � ��n
sum � sum � A�i
ENDDO�a� �b�
Fig� ���� Scalar reduction� �a� the original loop and �b� the same loop afterrecognizing reduction variable SUM�
In addition to scalar reductions� array reductions must be addressed� as it has
been shown that array reduction recognition is one of the most important trans�
formations in real applications� Array reductions� like scalar reductions� are
summations� however they are of the form A�ind� � A�ind� � expr� where the
value of the subscript ind of A cannot be determined at compile time� There�
fore� local sums must be accumulated for each element in A and combined at
the time of the loop�s completion� Figure �� shows such a reduction operation�
The constant No Of Procs would hold the value of the number of participating
processors� and the function call Get My Id� would return the processor iden�
ti�cation of the processor executing that iteration� Two additional loops for
initialization and �nal summation are called preamble and postamble� respec�
tively�
Induction Induction variables are variables that form a recurrence in the enclosing
loop� Figure ��� shows an example of a simple induction expression as well as
a transformed form� which would have no loop carried dependencies� Induc�
tion variable substitution must �rst recognize variables of this form and then
substitute them with a closed�form solution�
� � �
DO I � ��n
A�ind � A�ind � B�i
ENDDO�a�
DO I � ��No�Of�Procs
DO J � ��Elements�In�A
A��J�I � �
ENDDO
ENDDO
� OMP PARALLEL DO SHARED�A�� B� No�Of�Procs
DO I � ��n�No�Of�Procs
A��ind�Get�My�Id� �
A��ind�Get�My�Id� � B�i
ENDDO
DO J � ��Elements�In�A
DO I � ��No�Of�Procs
A�I � A�I � A��I�J
ENDDO
ENDDO
�b�
Fig� �� � Array reduction� �a� the original loop and �b� the same loop afterrecognizing reduction array A�
� �� �
X � �
DO I � ��n
X � X � ��I
A�X � ���
ENDDO
� OMP PARALLEL DO SHARED�A
X � �
DO I � ��n
A�I � I��� � ���
ENDDO�a� �b�
Fig� ���� Induction variable recognition� �a� the original loop and �b� the same loopafter replacing induction variable X�
This transformation would allow the original loop shown in Figure ����a to be ex�
ecuted in parallel� Unfortunately if there are many enclosing loops and complex
induction variables� the closed form induction expressions may become rather
expensive to compute� If these expressions are used often they can introduce
signi�cant overhead�
Handling IO If IO statements within a loop are necessary in program execution
and the order of IO statements have to be preserved among loop iterations� the
loop cannot be parallelized� In other cases� the loop can still be parallelized by
using one of the following methods�
� If the IO is not absolutely necessary� it can simply be removed� For in�
stance� if the IO was inserted for debugging purpose or as execution status
reports� deleting the IO statements will not aect the execution�
� In cases where IO is needed to report the status of an array� the loop may
be distributed into two loops� one for computation and the other for IO�
The resulting loop containing only IO cannot be parallelized� but the loop
containing only computation may be parallelizable�
Handling subroutine and function calls If a loop has a subroutine or function
call� some parallelizing compilers usually make a conservative decision not to
parallelize it� The programmer has to make sure that the subroutine or function
� �� �
has no side eects before manually parallelizing it�
Also depending on the implementation of parallel constructs� parallel sections
inside a function or subroutine that are already running in parallel may have
unexpected eects� If a programmer decides to execute a subroutine or function
within a parallel block� it is advisable to remove parallel constructs within
that subroutine or function� One other possible solution is to inline the called
function or subroutine� if the size of the function or subroutine is reasonably
small� More details on inlining are presented later in this section�
Parallel performance optimization techniques
Parallelization introduces overhead that clearly aects execution time� Program�
mers must be aware that parallelization may even degrade the performance of some
code sections� We have presented the parallelization and spreading overhead model
in Section ����� Techniques listed below aim to further improve the performance of
already parallel code sections� They mainly seek to reduce the overhead introduced
by parallelization�
Serialization In many cases� the eect of optimization is not entirely predictable�
Furthermore� if programmers use a parallelizing compiler� the compiler may
cause some code sections to perform worse� Sometimes� parallelizing a code
segment just does not pay o� For instance� if the execution time of a loop is in
the same order of the parallelization overhead� its parallel execution is likely to
perform worse than the serial version� If there are no other eligible techniques
to further improve the parallel section� simply removing the parallel directives
can at least prevent degradation�
This technique is highly machine�dependent� The bene�t of parallelization rely
on many machine parameters� cache and memory size� bandwidth� processor
speed� IO e�ciency� and operating systems� If the target program is to be
used on various architectures� programmers should make a cautious decision
as to which segments should be converted back to serial� based on the study
� �� �
of those architectures� A useful strategy is to serialize those loops or code
sections whose timing pro�les show no improvements from any parallelization
and tuning attempts� Also� it is advisable to monitor the performance of those
loops whose execution time is less than an order of magnitude larger than the
fork�join overhead� The fork�join overhead can be measured as the dierence in
execution time of an empty parallel loop between parallel and serial execution�
It should be noted that serialization itself can have a negative impact� The idea
of serialization is to restore a code segment back to its original state� but due to
cache eects� the execution may slow down compared to the same code section in
the untouched version� For instance� a small serial loop right between two large
parallel loops may cause signi�cant cache misses due to the data distribution
across caches�
Handling false sharing Depending of the cache line size� data that are needed
by only one processor may spread over other processors� caches� causing fre�
quent invalidations� These may be prevented by applying one of two techniques
described below�
� Programmers may try to modify array access patterns by scheduling tasks
that access adjacent regions on the same processor� An example is given
in Figure ����
� Another solution is padding� By adding empty data items into a shared
array� one may avoid false sharing by separating data into individual cache
lines� However� this may cause negative eects due to the increase in the
data size� Figure ��� shows an example of padding� It should be noted that
changing array declarations can have global and interprocedural eects�
All uses of modi�ed arrays must be changed to use the new dimensions�
Scheduling A directive language usually comes with several options for scheduling�
Scheduling in parallel programming means telling the underlying machine how
� �� �
� OMP PARALLEL
� OMP DO
DO I � ���
DO J � ��N
A�I�J � B�I�J
ENDDO
ENDDO
� OMP END DO
� OMP END PARALLEL
� OMP PARALLEL
DO I � ���
� OMP DO
DO J � ��N
A�I�J � B�I�J
ENDDO
� OMP END DO NOWAIT
ENDDO
� OMP END PARALLEL�a� �b�
Fig� ���� Scheduling modi�cation� �a� the original loop and �b� the same loop aftermodifying scheduling by pushing parallel constructs inside the loop nest� In �b�� theinner loop is executed in parallel� thus processor access array elements that at least
� stride apart�
REAL A���N� B���N
���
� OMP PARALLEL
� OMP DO
DO I � ��N
DO J � ��N
A�I�J � B�I�J
ENDDO
ENDDO
� OMP END DO
� OMP END PARALLEL
REAL A����N� B����N
���
� OMP PARALLEL
� OMP DO
DO I � ������
DO J � ��N
A�I�J � B�I�J
ENDDO
ENDDO
� OMP END DO
� OMP END PARALLEL�a� �b�
Fig� ���� Padding� �a� the original loop and �b� the same loop after padding extraspace into the arrays�
� � �
the tasks should be distributed across processors� In Fortran case� if a loop iter�
ates from to ��� multiple processors allow many ways to split the iterations�
Depending on the loop structure� scheduling can make a signi�cant dierence
in performance� Locality and false sharing are two most important factors that
are aected by employing dierent scheduling schemes� The OpenMP directive
language provides four dierent options for scheduling ��� Some scheduling
scheme may incur more overhead due to the required bookkeeping� Program�
mers are recommended to examine the loop structure before trying a dierent
scheduling mechanism�
� static� Each processor is assigned a contiguous chunk of iterations� If the
amount of work in each iteration is approximately the same� and there are
enough iterations for equal distribution� this scheduling will do �ne�
� dynamic� A processor is assigned the next iteration as it becomes available�
This is useful if the loop has varying amounts of work for each iteration�
The overhead is usually higher than that of static scheduling� but if the
program is to run in a multi�user environment� its better load balancing
properties can improve performance�
� guided� The same as dynamic scheduling� but a linearly decreasing number
of iterations are dispatched to each processor�
� runtime� The decision for scheduling is deferred until runtime� The value of
an environmental variable OMP SCHEDULEdetermines the scheduling scheme�
Load Balancing Unevenly distributed tasks cause stalls on multiple processors�
In cases where the number of iterations is small and cannot be distributed
evenly� the expected speedup is limited by the remainder of the number of
iteration over the number of processors� There is no solution for this case other
than trying to outer parallel loops� If the imbalance is incurred by uneven work
within the loop body �such as an outer parallel loop with an inner triangular
loop�� dynamic scheduling may result in better performance� Figure ��� shows
� �� �
an example of load balancing by changing scheduling�
� OMP PARALLEL DO
� OMP�SCHEDULE�STATIC
DO I � �� N
DO J � ��I
���
ENDDO
ENDDO
� OMP PARALLEL DO
� OMP�SCHEDULE�DYNAMIC
DO I � �� N
DO J � ��I
���
ENDDO
ENDDO�a� �b�
Fig� ���� Load balancing� �a� the original loop and �b� the same loop afterchanging to interleaved scheduling scheme� By changing the scheduling from static
to dynamic� unbalanced load can be distributed more evenly�
Blockingtiling If the data size handled by each iteration of a loop is larger than
the data cache size of the processor and the data are reused within each it�
eration� lots of cache misses occur� Blocking�tiling splits the data needed for
each iteration so that they �t into one processor�s cache� This technique is par�
ticularly useful in large matrix manipulation� Obviously� machine parameters
should come into play for this technique to be successful� Knowing the machine�s
cache size will help determine the right block size� Blocking�tiling are basically
locality enhancement techniques� Figure �� shows how blocking�tiling can be
applied�
In Figure ��� the entire B array is referenced in each iteration of the I loop� If
the ��N N�N references within each iteration of the I loop exceed the cache
size� then the each access to a new line of array B will be a cache miss� Tiling
the K and J loops allow smaller sections of B to be accessed repeatedly before
moving on to another section� decreasing the references within the I loop to
��BLK BLK�BLK references� If BLK is small enough� then each line of B will
only see one cache miss during the execution of the entire nest�
� �� �
DO I � ��N��
DO K � ��N��
DO J � ��N��
C�J�I � A�K�I � B�J�K � C�J�I
ENDDO
ENDDO
ENDDO�a�
DO KK � ��N�BLK
DO JJ � ��N�BLK
DO I � ��N��
DO K � KK�min�kk�BLK���N��
DO J � JJ�min�jj�BLK���N��
C�J�I � A�K�I � B�J�K � C�J�I
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
�b�
Fig� ��� Blocking�tiling� �a� the original loop and �b� the same loop after applyingtiling to split the matrices into smaller tiles� In �b�� another loop has been added toassign smaller blocks to each processor� The data are likely to remain in the cache
when they are needed again�
� �� �
Serial performance optimization techniques
Sometimes programmers inadvertently write ine�cient code� For those who are
not familiar with performance issues� it is not unusual to add codes that work against
good performance� There are simple techniques that enhance performance of a code
segment �whether it is serial or parallel� without altering its intended functionality�
The techniques listed below aim to enhance the locality of program data� resulting in
better cache performance� or to reduce stalls� They are mainly machine�independent�
For instance� enhancing locality always helps� If the dominant code segments in the
target program are innately serial� the following techniques may be good candidates
for improving the performance without parallelization�
Loop interchange Loop interchange is a simple technique that interchanges loop
nests� The array access patterns determined by loop nests can have a drastic
eect on the resulting performance� In the two code segments shown in Fig�
ure ���� the �rst one has poor locality because it has an array access stride of
N� The second loop� on the other hand� performs better because of its stride
access�
DO I � �� N
DO J � ��M
A�I�J � B�I�J
ENDDO
ENDDO
DO J � �� M
DO I � ��N
A�I�J � B�I�J
ENDDO
ENDDO�a� �b�
Fig� ���� Loop interchange� �a� a loop with poor locality and �b� the same loopwith better locality after interchanging loop nest�
Loop interchange is a simple technique that may result in a large performance
gain� Programmers should be aware that loop interchange is not always legal� In
the presense of backward data dependences �e�g� A�i� j� � A�i�� j� �B�i� j��
in a loop� loop interchange violates the dependence in the original code�
� � �
Loop Fusion This is the opposite of loop distribution� described below� If multi�
ple loops have the same range� they can be merged if doing so does not break
any dependencies between them� Fusion generally increases locality because it
allows processors to reuse the data that are already in their caches� However�
fusion may cause the data size to exceed the cache size� which degrades the
performance� Also� as a side eect� if fusion is applied to parallel loops� it de�
creases the number of synchronization barriers� and reduces both parallelization
and spreading overhead� Programmers should be aware that loop fusion is not
always legal even when the iteration spaces match�
Software Pipeline andor Loop Unrolling In some compute�intensive loops�
data dependencies across close iterations may cause pipeline stalls� This is
more frequent in �oating�point operations� which take a number of CPU cycles�
One way to alleviate this problem is to do software pipelining or loop unrolling�
Loop unrolling does not have a direct eect on reducing dependency stalls� but
it allows the back�end compiler to interleave dependent instructions�
However� unlike software pipeline technique� which may create a loop�carried
dependency� an unrolled loop can still be executed in parallel if the original loop
is parallel� As a side eect� unrolled loops have less synchronization barriers
when executed in parallel� These techniques allow more cycles between depen�
dent instructions� so stalls are reduced� Hardware counters often have facilities
to measure dependency stalls� Figure ��� shows a simple loop before and after
applying software pipeline and unrolling�
Other performance�enhancing techniques
Loop distribution Loop Distribution refers to splitting a loop into multiple loops
with smaller tasks� This techniques may reduce the grain size of parallelism�
however it enables other transformations� Figure ��� shows an actual code
section found in program SWIM from the SPEC ���� benchmark suite �����
� �
DO I � �� ��N
���
C � A�I � B�I
D�I � C
ENDDO
C � A�� � B��
DO I � �� ��N
���
D�I�� � C
C � A�I � B�I
ENDDO
D�� � N � C
DO I � �� ��N ��� �
���
C � A�I � B�I
D�I � C
���
C � A�I�� � B�I��
D�I�� � C
ENDDO�a� �b� �c�
Fig� ���� Software pipeline and loop unrolling� �a� the original loop� �b� the sameloop with software pipeline �Instructions are interleaved across iterations� andpreamble and postamble have been added�� and �c� the same loop unrolled by ��
The outer loop is parallel� Adding appropriate directives� we get the parallelized
version� as shown in Figure ����
As mentioned above in the locality enhancement section� the nested loops in
this code segment would be a good candidate for loop interchange due to the
column major attribute of Fortran� However� one line right after the nested
loop prevents applying the technique� By splitting the outer loop into two�
and interchanging nested loops� we get the code shown in Figure �� � which
performs signi�cantly better than the previous two versions�
Subroutine inlining Inlining replaces a call to a subroutine with the code con�
tained within the subroutine itself� This procedure� also called �inline expan�
sion�� can have several bene�cial eects� The most obvious of these eects is
the removal of the calling overhead� This is particularly true when a call is
embedded within a small loop� and thus the overhead would be incurred in
each loop iteration� However� more importantly� in the context of parallelizing
compilers� additional optimizations and transformations may be facilitated by
this transformation�
� � �
DO icheck � �� mnmin� �
DO jcheck � �� mnmin� �
pcheck � pcheck�ABS�pnew�icheck� jcheck
ucheck � ucheck�ABS�unew�icheck� jcheck
vcheck � vcheck�ABS�vnew�icheck� jcheck
��� CONTINUE
ENDDO
unew�icheck� icheck � unew�icheck� icheck��MOD�icheck� �������
��
��� CONTINUE
ENDDO
Fig� ���� Original loop SHALOW do���� in program SWIM�
With procedure calls inlined� the procedure�s code may be optimized within
the context of the call site� With site�speci�c information now available� other
transformations may be possible� which in turn may facilitate yet other opti�
mizations� This may allow some instances of a procedure to be executed in
parallel� even if it is not parallelizable at each call site�
The down side of inline expansion is the increase in the code size� which can be
signi�cant if full inlining is performed� This may cause many instruction cache
misses� Also� with the increase in code size comes the increase in compilation
time� since now each instance of the inlined code is optimized separately� Often
full inlining is not practical and so heuristics are developed for its application�
Deadcode elimination Deadcode elimination is an optimization technique which
removes unnecessary code from a program� The direct eect of dead code
elimination is decreased execution time� Code that has no eect on the output
of a program is removed� and thus the time spent in executing this portion of the
application is eliminated� Again there is the additional bene�t that deadcode
� � �
� OMP PARALLEL
� OMP�DEFAULT�SHARED
� OMP�PRIVATE�JCHECK�ICHECK
� OMP DO
� OMP�REDUCTION���vcheck�ucheck�pcheck
DO icheck � �� mnmin� �
DO jcheck � �� mnmin� �
pcheck � pcheck�ABS�pnew�icheck� jcheck
ucheck � ucheck�ABS�unew�icheck� jcheck
vcheck � vcheck�ABS�vnew�icheck� jcheck
��� CONTINUE
ENDDO
unew�icheck� icheck � unew�icheck� icheck��MOD�icheck� �������
��
��� CONTINUE
ENDDO
� OMP END DO NOWAIT
� OMP END PARALLEL
Fig� ���� Parallel version of SHALOW do���� in program SWIM�
� � �
� OMP PARALLEL
� OMP�DEFAULT�SHARED
� OMP�PRIVATE�JCHECK�ICHECK
� OMP DO
� OMP�REDUCTION���vcheck�ucheck�pcheck
DO icheck � �� mnmin� �
DO jcheck � �� mnmin� �
pcheck � pcheck�ABS�pnew�icheck� jcheck
ucheck � ucheck�ABS�unew�icheck� jcheck
vcheck � vcheck�ABS�vnew�icheck� jcheck
��� CONTINUE
ENDDO
��� CONTINUE
ENDDO
� OMP END DO
� OMP DO
DO icheck � �� MIN��m� n� �
unew�icheck� icheck � unew�icheck� icheck��MOD�icheck� ������
���
ENDDO
� OMP END DO NOWAIT
� OMP END PARALLEL
Fig� �� � Optimized version of SHALOW do���� in program SWIM�
� � �
elimination may enable other optimizations �e�g� An imperfect loop nest can
become a perfect loop nest after deadcode elimination��
����� Getting optimized execution time
As described in Section ������ using �single user time� is important to reduce ex�
ternal perturbation factors� In parallel programs� these factors may cause signi�cant
inaccuracies and variations in execution time because of the unpredictable nature of
other user processes�
����� Finding and resolving performance problems
Finding dominant regions Programmers should focus on dominant code seg�
ments based on the measured data� Instrumented program runs usually generate
pro�les with the measured data� Programmers should �nd major code blocks
that consume most of execution time from these �les� With tool support� this
task can be simpli�ed�
Dominant program sections may change as the result of a program tuning pro�
cess� After each iteration of this process� programmers should reevaluate the
most time�consuming �or the most problematic� depending on the metrics� code
sections� Other program sections may have become the point of biggest return
of further time investment�
Identifying problems and �nding remedy When dominant code sections are
found� programmers should �gure out any possible improvements on those seg�
ments� First� the status of the segments should be understood��Is the code
section parallel�� and �Is the speedup acceptable�� are the questions that
should be answered before looking for right remedies� Computing the over�
heads discussed earlier in Section ���� can be of signi�cant help to this end�
Performance analysis is a di�cult part of performance tuning� In the next
chapter� we present our eort to facilitate the performance analysis through
tool support�
� �
� code not parallel� Even advanced parallelizing compilers such as the Po�
laris compiler cannot detect all possible parallelism� There are mainly two
reasons for this� First� the target code uses some algorithmic techniques
that a parallelizing compiler cannot analyze� Second� the data dependen�
cies within the code cannot be determined without examining the input
data� so the parallelizing compiler makes a conservative decision not to
parallelize it�
For the �rst case� programmers may be able to �nd parallelism� For ex�
ample� if a reduction variable is not recognized by a parallelizing compiler�
programmers can parallelize the code section with proper reduction direc�
tives� Programmers may need to study the underlying algorithm for this
task� Parallelization techniques are found in Section ������
For the second case� programmers may be able to make up for the lack
of information about the input data� For instance� if the reason for not
parallelizing a code section is that the compiler cannot determine that
certain array accesses do not overlap� programmers can simply parallelize
the code manually� If a conditional exit within a loop only occurs in a
fatal error condition� ignoring it and parallelizing the loop will not aect
a correct execution�
If the programmer cannot �nd any way to parallelize a given code� replac�
ing the algorithm with parallel counter parts may be possible� There are
parallel algorithms for some inherent serial algorithms� such as random
number generation and linear recurrences�
Finally� even if none of the techniques are possible� programmers should
try enhancing the locality of the code� Some of the locality enhancing
techniques can make a drastic dierence in performance� Some of such
techniques are listed in Section ������
� speedup not acceptable� For parallel code segments� there are several rea�
sons for poor speedup including poor locality and parallelization and�or
� � �
spreading overhead� Spreading overhead may be incurred by poor locality�
Programmers should try to enhance locality and reduce overhead� Prob�
lems with data locality may be detected if a hardware counter is available
on the target machine� A large number of stalls or high data cache miss
ratio are a good indication of poor locality� Some of these techniques are
described in Section ������
��� Conclusions
The ultimate objective of our research is to answer �what� and �how� in a paral�
lel optimization process� The proposed methodology is designed to tell programmers
�what must be done�� We have divided the program optimization process into sev�
eral steps with feedback loops� Each step de�nes speci�c tasks for programmers to
accomplish� We have also listed common analyses and techniques that are needed�
There is a clear goal in each stage� and the condition for its achievement is clearly
de�ned� In this way� our methodology provides signi�cant guidance to programmers
in optimizing parallel applications�
The methodology described above has been empirically devised� All of the analy�
ses and techniques have helped us improving the performance of scienti�c and engi�
neering applications� However� �guring out exactly which technique would improve
the performance is still a di�cult subject and requires further studies� Performance
prediction or modeling has not been successful in general cases� In the next chapter�
we introduce our experience�based approach to resolve this issue� We support our
methodology with a set of tools� which is our approach to answer the question �how��
Supporting tools are the topic of the next chapter�
� � �
� � �
�� TOOL SUPPORT FOR PROGRAM OPTIMIZATION
METHODOLOGY
As previously mentioned� the main advantages of a methodical approach to par�
allel programming is that it is �� e�cient and ��� easy to apply without advanced
experience� The proposed methodology outlines this systematic endeavor towards
good performance� However� the individual steps listed in the methodology can be
time�consuming and tedious�
Parallel programmers without access to parallel programming tools have relied on
text editors� shells� and compilers� Programmers write a program using text editors
and generate an executable with resident compilers� All other tasks such as managing
�les� examining performance �gures� searching for problems and incorporate solutions�
can be achieved using these traditional tools� However� considerable eort and good
intuition are needed in �le organization and performance diagnostics� Even with
parallelizing compilers� these tasks still remain for the users to deal with� In fact�
most users end up writing small helper scripts for these tasks�
The tools designed speci�cally for development and tuning of parallel programs
step in where traditional tools have limits� In general� these tools provide interactivity
and adequate user interface for incorporation of user knowledge to further improve
program performance� Previous eorts listed in Chapter � mainly focus on two as�
pects of functionality� automation and visualization� Automatic utilities simpli�es
analyzing very complex program structures� Visualization utilities allow user to view
and interpret a large amount of static analysis information and performance data in
an e�cient manner� Still� we feel that certain functionalities� which could be of great
help to programmers� have been largely ignored by the tool developers�
Based on the user feedback and the speci�cs of our methodology� we have set our
� �� �
our design goals� which are listed in the next section� Then we discuss in detail the
tools that we have developed and�or included into our programming environment�
We also present our eort to reach the general audience with our tools through the
World�Wide Web� Finally� we describe how these tools �t into our methodology and
help programmers in the tuning process�
��� Design Objectives
Consistent support for methodology This is the main goal of our research� We
examine the steps in the methodology and �nd time�consuming programming
chores that call for additional aid� Some tasks are tedious and may be auto�
mated� Some require complex analysis and cumbersome reasoning� so assisting
utilities are needed� If these are properly addressed with the tool support� pro�
grammers can achieve greater performance with ease� The integration of the
methodology and the tool support would signi�cantly increase e�ciency and
productivity�
Support for deductive reasoning Current performance visualization systems of�
fer variety of utilities for viewing a large amount of data in many dierent per�
spectives� Understanding data patterns and locating problems� however� are
still left to users� In addition to providing raw information� advanced tools
must help �lter and abstract a potentially very large amount of data� Instead
of providing a �xed number of options for data presentation� oering the ability
to freely manipulate data and even to compute a new set of meaningful results
can serve as the basis for users� deductive reasoning�
Active guidance system Tuning programs requires dealing with numerous dif�
ferent instances of code segments� Categorizing these variants and �nding the
right remedies demand su�cient experience on the programmers� part� The
transfer of such knowledge from experienced to novice programmers� has al�
ways been a problem in the parallel programming community� It usually takes
novice programmers a signi�cant amount of time and eort to gain adequate
� � �
expertise in parallel programming� We believe that it is possible to address this
issue systematically using today�s technology�
Program characteristic visualization and performance evaluation The task
of improving program performance starts with examining the performance and
analysis data and �nding room for improvement� The ability to scroll through
these data and visualize what the data imply is critical in this task� Tables�
graphs� and charts are a common way of expressing a large set of data for easy
comprehension� However� one of the pitfalls that researchers easily get into is
too much information presented in a myriad of windows without proper anno�
tations� A good tool should be able to draw the users� attention to what is
important�
Integration of static analysis with performance evaluation Most tools pub�
lished so far focus on either one of the two types of data� However� as mentioned
earlier� good performance only comes from considering both aspects� It is im�
portant to identify the relationship between the data from both sides and have
them available for easy analysis� Without the consideration of performance
data� static program optimization can even degrade performance� Likewise�
without the static analysis data� optimization based only on performance data
may be marginal�
Interactive and modular compilation The usual black�box�oriented use of com�
piler tools have limits in e�ciently incorporating users� knowledge of program
algorithms and dynamic behavior� For example� although the compiler detects
a value�speci�c data�dependence� the user may know that in every reasonable
program input the values are such that the dependence does not occur� In
other cases� users may know that the array sections accessed in dierent loop
iterations do not overlap� Furthermore� certain program transformations may
make a substantial performance dierence� but are applicable to very few pro�
grams� and hence not built into a compiler�s repertoire� If a user can �nd the
� �� �
reason why a loop was not parallelized automatically� a small modi�cation may
be applied that ensures parallel execution� Because of these reasons� manual
code modi�cations in addition to automatic parallelization is often necessary to
achieve good performance� and tools should support a convenient mechanism
to incorporate manual tuning� Another drawback of conventional compilers is
their limited support for incremental tuning� The localized eect of parallel
directives in the shared memory programming model allows users to focus on
small portions of code for possible improvement� Hence� compiler support for
incremental tuning is also an important goal in our tool design�
Data Management This is the basic need in successfully optimizing various appli�
cations� Data management refers to the task of organizing data �les� maintain�
ing the storage for the gathered data� and making it easy to retrieve them for
quick comparison and manipulation� A uni�ed space for experimental data with
clean interfaces not only help the developers themselves but also the combined
eort among research groups by allowing simple access to related databases�
Accessibility Although the importance of advanced tools for all software devel�
opment is evident� many available tools remain unused� A major reason is
that the process of searching for tools with needed capabilities� downloading
and installing them on locally available platforms and resources is very time�
consuming� In order to evaluate and �nd an appropriate tool� this process may
need to be repeated many times� Using today�s network computing technology�
tool accessibility can be greatly enhanced�
Portability For disseminating a new tool to the user community� it is important
that it be easy to install on new platforms� In addition� a tool has to be �exible
in the data format it can read� such that it can adapt to the tools �compilers
and performance analyzers� available on the local platform�
Con�gurability Satisfying the general users of a tool can only be achieved by
allowing them to con�gure the tool to their liking� By having con�gurability as
� �� �
one of our design goals� many users� preferences can be incorporated into the
tool usage without individually addressing them�
Flexibility Flexibility is an important characteristic of general tools� We have seen
many cases in which new types of performance data needed to be incorporated
into the picture for a better understanding of a program behavior� Further�
more� we would like to keep the applicability of the tool open for tasks beyond
performance tuning�
In the next few sections� we introduce the tools in our methodology�support tool�
box� We present the overviews for the tools as well as the detailed structure and
functionality if needed� We also include the look and feel of these tools from the end
users� point of view�
��� Ursa Minor Performance Evaluation Tool
Often the programmers� intervention into automatic optimization is necessary to
achieve near�optimal parallel program performance� To aid programmers in this pro�
cess� we have developed a performance evaluation tool� Ursa Minor �User Respon�
sive System for the Analysis� Manipulation� and Instrumentation of NewOptimization
Research� ���� ��� ���� The main goal of Ursa Minor is performance optimiza�
tion through interactive integration of performance evaluation with static program
analysis information� With this tool� performance anomalies such as poor speedup
and high cache miss ratio are easily identi�ed on a loop�by�loop basis via a graphical
user interface� Overhead components are computed instantly� This information is
combined with static program information such as array access patterns or loop nest
structure to give a better understanding of the problems at hand�
Ursa Minor complements the Polaris compiler in its support for OpenMP par�
allel programming in that it understands the compiler output� It collects and com�
bines information from various sources� and its graphical interface provides selective
views and combinations of the gathered data� Ursa Minor consists of a database
utility� a visualization system for both performance data and program structure� a
� �� �
source searching and viewing tool� and a �le management module� Ursa Minor
also provides users with powerful utilities for manipulating and restructuring input
data to serve as the basis for the users� deductive reasoning� In addition� it takes
performance evaluation one step further by means of an active performance guidance
system calledMerlin� Ursa Minor can present to the user and reason about many
dierent types of data �e�g�� compilation results� timing pro�les� hardware counter in�
formation�� making it widely applicable to dierent kinds of program optimization
scenarios�
����� Functionality
Here� we describe the functionality of Ursa Minor� and what it can do for
programmers� Typical performance evaluation process consists of visualizing perfor�
mance� identifying problems or anomalies� �nding the cause� and devising the cor�
responding remedies� Programmers need to visualize and compare the performance
data under dierent trials� ruminate over them� compute derivative values� examine
the runtime environment for the cause of possible problems� and search for solu�
tions� We have designed practical utilities to assist programmers in this process and
integrated them into Ursa Minor�
Performance data and program structure visualization
The Ursa Minor tool presents information to the user through two main display
windows� the Table View and the Structure View� The Table View shows the data
as text entries that relate to �Program Units�� which can be subroutines� functions�
loops� blocks� or any entities that a user de�nes� The Structure View is designed to
visualize the program structure under consideration� A user interacts with the tool
by choosing menu items or mouse�clicking�
The Table View displays data such as average execution time� the number of
invocation of code sections� cache misses� and a text indicating whether loops are
serial or parallel� Generally� the entries can be of type integer� �oating�point number�
and string� Users can manipulate the presented data through various features this
view provides� This is the main view that provides the means for modifying and
� �� �
augmenting the underlying database� Accesses to other modules of Ursa Minor
take place through this view� The Table View is a tabbed folder that contains one
or more tabs with labels� Each tab corresponds to a �program unit group�� which
means a group of data of a similar type� For instance� the folder labeled �LOOPS�
contains all the data regarding loops in a given program� When reading prede�ned
data inputs such as timing �les and Polaris listing �les� Ursa Minor generates
prede�ned program unit groups� �e�g�� LOOPS� PROGRAM� CALLSTRUCTURE�
etc��� Users can create their own groups with their own input �les using a proper
format�
A user can rearrange columns� delete columns� sort the entries alphabetically or
based on the execution time� The bar graph on the right side shows an instant
normalized graph of a numeric column� After each program run� the newly collected
information is included as additional columns in the Table View� Users can examine
these numbers side�by�side as they �t� In this way� performance dierences can be
inspected immediately for each individual loop as well as for the overall program�
Eects of program modi�cations on other program sections become obvious as well�
The modi�cation may change the relative importance of loops� so that sorting them
by their newest execution time yields a new most�time�consuming loop on which the
programmer has to focus next� Figure �� shows the Table View of Ursa Minor in
use�
Various features make the usage of the Table View easier and more accessible�
Users can set a display threshold for each column so that an item that is less than
a certain quantity is displayed in a dierent color� This feature allows users to ef�
fortlessly identify code sections with poor speedup� for instance� One or more rows
and columns can be selected so that they can be manipulated as a whole� Data that
would not �t into a table cell� such as the compiler�s explanation for why a loop is not
parallel� can be displayed in a separate window by one mouse click� Finally� Ursa
Minor is capable of generating pie charts and bar graphs on a selected column or
row for instant visualization of numeric data�
� � �
Fig� ��� Main view of the Ursa Minor tool� The user has gathered information onprogram BDNA� After sorting the loops based on the execution time� the user inspectsthe percentage of three major loops �ACTFOR do���� ACTFOR do���� RESTAR do����using a pie chart generator �bottom left�� Computing the speedup �column � � withthe Expression Evaluator reveals that the speedup for RESTAR do��� is poor� so the
user is examining more detailed information on the loop�
� �� �
Another view of Ursa Minor provides the calling structure of a given program�
which includes subroutine� function� and loop nest information as shown in Figure ����
Each rectangle represents either a subroutine� function� or loop� The rectangles are
color�coded so that more information is conveyed to the user visually� For example�
parallel loops are represented by green rectangles� and serial loops by red rectangles�
Clicking one of these will display the corresponding source code� In Figure ���� the
user is inspecting loop ACTFOR do��� in this way� Rectangles positioned to the right
are nested program units� Thus if unit A has unit B inside� the rectangle representing
B will be placed to the right of the rectangle for A� If one wants a wider view of the
program structure� the user can zoom in and out� This display helps to understand
a program structure for tasks such as interchanging loops or �nding outer or inner
candidate parallel loops�
Expression Evaluator
The ability to compute derivative values of raw performance data is critical in
analyzing the gathered information� For instance� the average timing value of dierent
runs� speedup� parallel e�ciency� and the percentage of the execution time of code
sections with respect to the overall execution time of the program are commonmetrics
used by many programmers� Instead of adding individual utilities to compute these
values� we have added the Expression Evaluator for user�entered expressions� We
have provided a set of built�in mathematical functions for numeric� relational� and
logical operations� Nested operators are allowed� and any reasonable combination
of these functions are supported� The Expression Evaluator has a pattern matching
capability as well� so the selection of a data set for evaluation becomes simpli�ed�
The Expression Evaluator also provides users with query functions that apprehend
static analysis data from a parallelizing compiler� These functions can be combined
with the mathematical functions� allowing queries such as �loops that are parallel and
whose speedups are less than � or �loops that have IO and whose execution time
is larger than � of the overall execution�� For example� after the users identi�ed
parallel loops with poor speedup� they may want to compute cache miss ratio on those
� �� �
Fig� ���� Structure view of the Ursa Minor tool� The user is looking at theStructure View generated for program BDNA� Using �Find� utility� the user sets theview to subroutine ACTFOR� and opened up the source view for the parallelized loop
ACTFOR do����
� �� �
loops or parallelization overheads� Instead of leaving the reasoning process to users�
Ursa Minor guides users through the deductive steps� The Expression Evaluator
is a powerful utility that allows manipulating and restructuring the input data to
serve as the basis for users� deductive reasoning through a common spreadsheet�like
interface�
The Merlin performance advisor
As previously mentioned� identifying performance bottlenecks and �nding the
right remedies take experience and intuition� which novice programmers usually lack�
Acquiring the expertise requires many trials and studies� Even for those programmers
who have experienced peers� the transfer of knowledge from advanced programmers
to novice programmers takes time and eort�
We believe that tools can be of considerable use in addressing this problem� We
have used a combination of the forementioned Expression Evaluator and knowledge
database to create a framework for easy transfer of experience� Merlin is an auto�
matic performance data analyzer that allows experienced programmers to tell novice
programmers how to diagnose and improve many types of performance problems�
Its objective is to provide guidelines and suggestions to inexperienced programmers
based on the accumulated knowledge of advanced programmers�
Figure ��� shows an instance of the Merlin user interface� Merlin is activated
when a user clicks �Run Performance Advisor for This Row� from the row popup
menu� It consists of an analysis text area� an advice text area� and buttons� The
analysis text area displays the diagnosis that Merlin has performed on the selected
program unit� The advice text area provides Merlin�s solution to the detected
problems with examples� if any� Diagnosis and the corresponding advice are paired
with an identi�cation number �such as Analysis ��� Solution ���� Users can also
load a dierent map anytime�
Merlin diers from conventional spreadsheet macros in that it is capable of
comprehending static analysis data generated by a parallelizing compiler� Merlin
can take into account numeric performance data as well as program information such
� �� �
Fig� ���� The user interface of Merlin in use� Merlin provides the solutions to thedetected problems� This example shows the problems addressed in loop
ACTFOR DO��� of program BDNA� The button labeled Ask Merlin activates theanalysis� The View Source button opens the source viewer for the selected codesection� The ReadMe for Map button pulls up the ReadMe text provided by the
performance map writer�
� � �
as parallel loops� existence of IO statements or functions calls within a code block�
and so on� This allows a comprehensive analysis based on both performance and
static data available for the code section under consideration�
Merlin navigates through a knowledge�based database ��maps�� that contains
the information on diagnosis and solutions for various performance symptoms� Expe�
rienced programmers write maps based on their knowledge� and novice programmers
can view the suggestions made by the experienced programmers by activating Mer�
lin� As shown in Figure ���� a map consists of three �domains�� The elements in
the Problem Domain correspond to general performance problems from the viewpoint
of programmers� They represent situations such as poor speedup� large number of
stalls� and non�parallel loops� depending on the performance data types targeted by
Merlin� The Diagnostics Domain depicts possible causes of the problems� such as
�oating point dependence and data cache over�ow� Finally� the Solution Domain
contains remedial techniques� Typical examples are serialization� loop interchange�
tiling� and loop unrolling� These elements are linked by �condition�s� Conditions are
logical expressions representing an analysis of data� If a condition evaluates to be
true� the corresponding link is taken� and the element in the next domain pointed
to by the link is explorered� Merlin invokes the Expression Evaluator for the eval�
uation of these expressions� A Merlin map is written in the Generic Data Format
described in Section ������ and it is loaded into Ursa Minor as an instance of Ursa
Minor database� More detailed description of Merlin is available in �� ��
Merlin enables multiple cause�eect analyses of performance and static data� It
fetches the data speci�ed by the map from the Ursa Minor tool� performs the listed
operations and follows the links if the conditions are true� There are no restrictions on
the number of elements and conditions within each domain� and each link is followed
independently� Hence� multiple perspectives can be easily incorporated into one map�
For instance� memory stalls may be caused by poor locality� but it could also mean
�oating point dependence� In this way� Merlin considers all possibilities separately
and presents an inclusive set of solutions to users� At the same time� the remedies
� �� �
.
.
.
.
.
.
.
.
.
ProblemDomain
DiagnosticsDomain
SolutionDomain
condition 1
condition 2
problem 1
problem 2
problem 3
diagnostics 1
diagnostics 2
diagnostics 3
solution 1
solution 2
solution 3
Fig� ���� The internal structure of a Merlin �map�� The Problem Domaincorresponds to general performance problems� The Diagnostics Domain depictspossible causes of the problems� and the Solution Domain contains suggested
remedies� Conditions are logical expressions representing an analysis of the data�
suggested by Merlin assist users in �learning by examples�� Merlin enables users
to gain expertise in an e�cient manner by listing performance data analysis steps
and many example solutions given by experienced programmers�
Merlin is able to work with any map as long as the map is in the correct for�
mat� Therefore� the intended focus of performance evaluation may shift depending
on the interest of the user group� For instance� the default map that comes with
Merlin focuses on parallel optimization of programs� Should a map that focuses
on architecture be developed and used instead� the response of Merlin will re�ect
that intention� The Ursa Minor environment does not limit its usage to parallel
programming�
Other functionality
During the process of compiling a parallel program and measuring its performance�
a considerable amount of information is gathered� For example� timing information
becomes available from various program runs� structural information of the program
� �� �
is gathered from the code documentation� and compilers oer a large amount of
program analysis information� Finding parallelism starts from looking through this
information and locating potentially parallel sections of code� The bookkeeping eort
accompanying this procedure is often overwhelming� Ursa Minor provides a orga�
nized solution to this problem� All the data regarding tuning of a speci�c program
are integrated into one compact database� Easy access to the database supported by
the tool allows users convenient views and manipulation of the data without having
to deal with numerous �les�
Ursa Minor also supports inter�group logs� Sharing the performance data and
optimization results among team members is important� Group members can share
the databases generated by others by specifying one location for a data repository�
When a member decides to share a database with other members�Ursa Minor adds
a log entry with the information regarding that particular database in the repository�
In this way� group members do not have to ask others to send the database to examine
the data� The repository has all the information about the database that the member
wants to share�
Con�gurability is one way to ensure that the tool adapts well to many users�
environments and preferences� TheUrsa Minor user interface is con�gurable� Users
can change the looks of the display views and many other functionalities� Most
functions can be mapped to keyboard shortcuts� allowing advanced users to speedup
the tasks�
Learning how to use a new tool has always been a nuisance to many programmers�
As tools become complex and versatile� reading a manual is cumbersome by itself�
Some of the successful commercial applications in word processing or games have
employed an �on�line tutorial� approach� An embedded module steps through some
of the basic functions of the program and tells users how to use them� We have
incorporated such a module into Ursa Minor� Our interactive demo session allows
users to explore important features of the tool with the input data prepared by the
developers� In addition� this demo session automates some of the steps so that users
� �� �
can quickly look through them�
����� Internal Organization of the Ursa Minor tool
Database Manager
GUI Manager
Table View Structure View
Database
Other toolsSpreadsheet
Static Data
DynamicData
User
data dependence
resultsstructureanalysis
...
performancenumbersruntime
env.hardwarecounter
...
ExpressionEvaluator
MerlinPerformance
Advisor
Fig� ���� Building blocks of the Ursa Minor tool and their interactions�
Figure ��� illustrates interaction between Ursa Minor modules and various data
�les� The Database Manager handles interaction between the database and other
modules� Depending upon users� requests� it fetches the required data items or create
or modify database entities� The GUI manager coordinates various windows and
views and controls the process of handling user actions� It also takes care of data
consistency between the database and the display windows� The Expression Evaluator
is a facility that allows users to perform spreadsheet�like text user�typed commands
� �� �
on the current database� This module parses the command� applies the operations�
and updates the views accordingly� Finally�Merlin is a guidance system capable of
automatically conducting performance analysis and �nding solutions�
Internally�Ursa Minor stores information in a Ursa Minor�Major Database
�UMD�� The Ursa Minor�Major Database �UMD� is a storage unit that holds
the collective information about a program� its execution results in a certain system
environment� or any other pertinent data that users include� This database can
be stored in dierent formats� including a plain text �le� which can optionally be
inspected with an editor and printed� Furthermore� a database can be saved in a
format that can be read by commercial spreadsheets� providing a richer set of data
manipulation functions and graphical representations�
The Ursa Minor tool is written in ������ lines of Java� Thus� any platform on
which the Java runtime environment is available can be used to run the tool� It uses
the basic Java language with standard APIs� which enhances the portability of the
tool� Object orientation in Java allows a relatively easy addition of new types of data
to the database� The windowing toolkits and utilities provide a good environment
for prototyping user interfaces� which enable us to focus on the design of the tool
functionality� Furthermore� Java� with its network support� makes a useful language
for realizing another goal of this project� making available the gathered program�
compilation� and performance results to world wide users� This goal has been realized
in the Ursa Major tool� which is discussed in Section �� ���
����� Database structure and data format
Ursa Minor maintains an organized database structure to store data� Inside
the Ursa Minor database� data items are stored in one of the four types� integer�
�oating point number� string� and long string� For the most part� the database module
does not care what kind of information it holds� It is� of course� a good programming
practice� but more importantly� it helps ensure the �exibility and con�gurability of
the entire tool� There are certain modules that understand data semantics such as
the Structure View and query functions in the Expression Evaluator� but the lack of
� � �
the required data does not prevent the tool usage�
At the bottom of the structure is �Program Unit�� This is the basic storage unit
that maps to an entity such as a loop� a subroutine� a code block and so on� These
units belong to a larger entry called �Program Unit Group�� Usually� Program Unit
Groups are labeled loops� subroutines� etc�� depending on the Program Units that
they keep� These groups are combined into a �Session�� which logically maps to a
database for one optimization research� Sessions are managed by the Ursa Minor
database manager� the module that handles database accesses� Figure �� shows a
design schematics for the database�
. ..
Session
Loops
Subroutines
Functions
. ..
Program Unit Group
Loop 1
Loop 2
Loop 3
Program Unit
Integer: Number invocationFloat: Average Execution TimeFloat: Overall Execution TimeFloat: Number of CyclesFloat: Memory StallsString: Serial or ParallelLong String: Nested Units...
Fig� �� � The database structure of Ursa Minor�
Ursa Minor is capable of reading several dierent types of data �les that are
generated by other tools listed in this chapter� Performance data �sum �les� are
generated when Polaris�instrumented executables run� Polaris listing �les are gener�
ated when Polaris attempts parallelization on a program and contain static analysis
information� When Ursa Minor reads these �les� it parses them in a prede�ned
way and creates appropriate program unit groups� Users of the tool do not need to
concern themselves with data types or formats when loading these �les� Also� Ursa
Minor can read and write using Java serialization utility� that stores the database in
a compact data �le� Adding or removing data from the loaded database is as simple
as clicking a menu�
� �� �
In order to provide more �exibility� we have de�ned the �Generic Data Format�
that can handle a wide variety of data� Using this text�based format� users can
input almost any types of data with any data structure� This format allows users to
create program unit groups of their own and arrange data as they see �t� This feature
greatly enhances the applicability of Ursa Minor and ful�lls one of the design goals�
�exibility�
����� Summary
Ursa Minor supports the methodology presented in the previous chapter by
providing utilities that mitigate many tasks in the performance evaluation stage� It
integrates static analysis and performance data by means of a database with structure�
based entities that hold many dierent types of data� With the support for deduc�
tive reasoning� active guidance� data management through con�gurable and �exible
utilities� Ursa Minor oers signi�cant aid to parallel programmers in need of a
performance evaluation tool�
Ursa Minor has been installed on the Parallel Programming Hub ����� allowing
accesses from remote users all over the world� Users can quickly evaluate the tool
with ease or extensively utilize it for production use� By Combining Ursa Minor
with other utilities on the Hub in support of the methodology� our goal towards a
comprehensive programming environment is getting near� The Parallel Programming
Hub is discussed in detail in Section �� ��
��� InterPol Interactive Tuning Tool
Good performance from a program is usually achieved by an incremental tuning
and evaluation process� The term �incremental� applies to both the applied tech�
niques and the modi�ed code segments� Conventional batch�oriented compilers are
limited in helping programmers in this task� Often� selecting target regions and choos�
ing optimization techniques are done by slicing a program and manipulating compiler
options manually� The accompanying tasks of �le management and learning about
compiler options are often overwhelming to programmers�
Advanced parallelizing compilers provide a large list of available techniques for
� �� �
program parallelization and optimization� These techniques are usually controlled
by switches or command line options that may not be intuitive and user�friendly�
The ability to select optimization techniques and even re�ordering their applications
would provide �exibility in exploring various combinations of techniques on dierent
sections of code� In addition� this would oer a playground for those interested in
studying compiler techniques�
InterPol is an interactive utility that allows users to target program segments
and apply optimization techniques selectively ����� It allows users to build their own
compiler from numerous optimization modules available from a parallelizing com�
piler infrastructure� It is also capable of incorporating manual changes made by
users� Meanwhile� InterPol keeps track of the entire program that users want to
optimize� relieving programmers of �le and version management tasks� In this way�
programmers are free to apply selected techniques on speci�c regions� change code
manually� and generate a working version of the entire program without exiting the
tool� During the optimization process� the tool can display static analysis information
generated by the underlying compiler� which can help users in further optimizing the
program�
����� Overview
Figure ��� illustrates the major components of InterPol� Users select code
regions using the Program Builder and arrange optimization techniques through the
Compiler Builder� The Compilation Engine takes inputs from these builders� executes
the selected compiler modules� and displays the output program� If the user wants to
keep the modi�ed code segments� the output will go into Program Builder� Instead
of running the Compilation Engine� users may choose to add changes to the code
manually� All of these actions are controlled by a graphical user interface� Users are
able to store the current program variant at any point in the optimization process�
����� Functionality
Figure ����a shows the graphical user interface oered by InterPol� Target code
segments and the corresponding transformed versions are visible in separate areas�
� �� �
Graphical User Interface
CompilerBuilder
Compilation Engine
ProgramBuilder
Call toPolaris
Infrastructure
Input Program
Output Program
Fig� ���� An overview of InterPol� Three main modules interact with usersthrough a Graphical User Interface� The Program Builder handles �le IO and keepstrack of the current program variant� The compiler Builder allows users to arrangeoptimization modules in Polaris� The Compilation Engine combines the user
selections from the other two modules and calls Polaris modules�
Static analysis information is given in another area whenever a user activates the
compiler� Finally� the Program Builder interface provides an instant view of the
current version of the target program� InterPol is written in Java�
The underlying parallelization and optimization tool is the Polaris compiler in�
frastructure ����� Various Polaris modules form building blocks for a custom�designed
parallelizing compiler� InterPol is capable of stacking up these modules in any or�
der� Polaris also comes with several dierent data dependence test modules� which
can also be arranged by InterPol� Overall� more than �� modules are available
for application� Users have a freedom to choose any blocks in any order� Executing
this custom�built compiler is as simple as clicking a menu and the result is displayed
immediately on the graphical user interface� Figure ����b shows the Compiler Builder
interface in InterPol� More detailed con�guration is also possible through the
InterPol�s Polaris switch interface� which controls the behavior of the individual
passes�
� �� �
�a� �b�
Fig� ���� User Interface of InterPol� �a� the main window and �b� the CompilerBuilder�
The Program Builder keeps and displays the up�to�date version of the whole pro�
gram� Users select program segments from this module� apply automatic optimization
set up by the Compiler Builder and�or add manual changes� The Compiler Builder
is accessible at any point� so users can apply entirely dierent sets of techniques to
dierent regions� The current version of the program is always shown in the Program
Builder interface for easy examination� Through this continuous process of tuning op�
timized program segments� users always stay in the process� observing and modifying
program transformations step by step�
During the optimization process� InterPol can display program analysis results
generated by running Polaris modules� This includes data dependency test results�
� � �
induction and reduction variables� etc� This provides a basis for further optimization�
Programmers incorporate their knowledge of the underlying algorithm� compensating
for the compiler�s limited knowledge of the program�s dynamic behavior and input
data�
����� Summary
InterPol seeks to assist programmers by providing highly �exible utilities for
both automatic and manual optimization� For those who are not familiar with the
techniques available from parallelizing compilers� the tool provides greater insights
into the eects of code transformations� By combining the Ursa Minor perfor�
mance evaluation tool with InterPol� we hope to create a complete programming
environment�
��� Other Tools in Our Toolset
The functionality of Ursa Minor and InterPol� combined with the Polaris in�
strumentation module� cover all the aspects of the methodology discussed in Chapter
�� Later in Section ���� we describe how these tools provide a comprehensive support
for the methodology� In this section� we present a set of complementary tools in
our toolset� which were developed in related projects� The main goals of these tools
do not necessarily match the issues that we would like to address in this research�
but they provide additional information and grant control over other aspects in pro�
gram development� These tools have been either developed or modi�ed at Purdue
University�
����� Polaris parallelizing compiler
The Polaris parallelizing compiler ���� is a source�to�source restructurer� devel�
oped at the University of Illinois and Purdue University� Polaris automatically �nds
parallelism and inserts appropriate parallel directives into input programs� Polaris
includes advanced capabilities for array privatization� symbolic and nonlinear data de�
pendence testing� idiom recognition� interprocedural analysis� and symbolic program
analysis� In addition� the current Polaris tool is able to generate OpenMP parallel
� �� �
directives �� and apply locality optimization techniques such as loop interchange and
tiling�
As demonstrated previously ���� ���� the Polaris compiler has successfully im�
proved the performance of many programs for various target machines� Polaris pro�
vides a good starting point for parallelizing and optimizing Fortran programs� For
advanced programmers� it can save substantial time that would be spent tuning loops
that can be automatically parallelized� For novice programmers� manually paralleliz�
ing loops would be cumbersome to begin with� In addition� Polaris can provide a
listing �le with the results of static program analysis� which may provide program�
mers with valuable information on various code sections�
InterPol described above provides easy� interactive access to the Polaris paral�
lelizing compiler� InterPol is even capable of restructuring optimization modules
within Polaris� If InterPol is not available� Polaris can serve as an alternative�
allowing fast parallelization of programs at hand� Polaris is available on the Parallel
Programming Hub� available to programmers all over the world�
����� InterAct performance monitoring and steering tool
InterAct is a toolset that allows interactive instrumentation and tuning of
OpenMP programs ����� This toolset provides a simple interface and API that allow
users to quickly identify performance bottlenecks through on�line monitoring of pro�
gram performance and to explore solutions through experimentation with user�de�ned
tunable variables� The Polaris parallelizing compiler has been modi�ed to annotate
sequential Fortran programs with OpenMP shared�memory directives� as well as to
insert calls to the instrumentation library� The instrumentation library collects both
timings and hardware counter events� transparently managing the low�level details
such as over�ows� To manage the hardware counters� the OpenMP Performance
Counter Library �OMPcl� has been developed to accurately collect events within the
multithreaded OpenMP environment�
InterAct provides a graphical user interface �GUI� to monitor program behavior�
as well as to dynamically change instrumentation� environmental settings and criti�
� �� �
cal program variables during execution� It supports visualization of collected data�
dynamic instrumentation� interactive modi�cation of the number of threads used by
the application� interactive selection of the runtime library used for managing paral�
lel threads� and interactive modi�cation of global variables that are registered by the
target application� These global variables could be compiler or user�inserted and used
to control the behavior and�or performance of the application� The toolset provides
a socket interface between the application and the GUI that allows monitoring to be
done either locally or remotely� Figure ��� shows the screenshot of InterAct in use
for the study of the dynamic behavior of SWIM benchmark�
Fig� ���� Monitoring the example application through InterAct interface� Themain window shows the characterization data of the major loops in the SPEC�����
SWIM Benchmark�
����� Max�P parallelism analysis tool
A compiler is able to analyze the static behavior of a program� It can �nd char�
acteristics of a program that are true for all possible input data sets and target
machines� In contrast� dynamic evaluation of a program can provide insights into the
characteristics of programs and the predictions of behaviors that may be undetected
by static analysis methods� Of great interest is understanding the dynamic behavior
� �� �
of parallelism� one of the most dominant factors of performance�
Max�P is a Polaris�based tool� developed at Purdue University ����� It evaluates
the inherent parallelism of a program at runtime� The inherent parallelism is de�ned
as the ratio of the total number of operations in a program� or program section�
to the number of operations along the critical path� The critical path is the longest
path in the program�s data�ow graph� which is computed byMax�P during program
execution� The tool can �nd the minimum execution time of a program assuming the
availability of an unlimited number of parallel processors� It shows the maximum
parallelism as an upper estimate for the potential performance gain that a user can
expect from aggressively optimizing the code�
��� Integration with Methodology
In this section� we examine how we envision the methodology plus tools scenario�
First� we discuss how these tools facilitate the steps listed in Chapter �� Then we
focus on other features of the tools that help programmers throughout the tuning
process�
����� Tool support in each step
Our tools have been designed and modi�ed with the parallel programmingmethod�
ology in mind� Figure ��� gives the overview of how these tools can be of use in each
step in the methodology introduced in the previous chapter� Ursa Minor mainly
contributes to the performance evaluation stages� InterPol and Polaris oers aid
in parallelization and manual tuning stages� Additional help in executing target pro�
grams is available through InterAct� In the following we revisit each step in the
methodology and discuss the roles of our tools�
Instrumenting program
The Polaris tool oers an instrumentation module as one of its passes� Users
can activate this module using a set of switches� In this way� users can generate
instrumented versions of both parallel and serial programs� Polaris provides several
switches for instrumentation of execution time of loops� These switches dictate the
� �� �
Instrumenting Program
Getting Serial Execution Time
Running Parallelizing Compiler
Manually Optimizing Program
Getting Optimized Execution Time
Speedup Evaluation
Finding and Resolving Performance Problmes
satisfactory
unsatisfactory
reduceinstrumentation
overhead
done
Polaris InstrumentorHardware Counter
PolarisInterPol
InterPol
InterAct
InterAct
Ursa Minor - Views - Expression Evaluator
Ursa Minor - Views - Merlin - Expression Evaluator
Fig� ���� Tool support for the parallel programming methodology�
� � �
types of code blocks that are instrumented and how to instrument nested sections�
By carefully controlling the switches� users can add all the necessary timing functions
without excessive overhead�
Combined with the OpenMP Performance Counter Library �PCL� ����� Polaris
can instrument a program so that each run generates a pro�le containing various
performance data measured by a hardware counter on instrumented code segments�
This library is available on many modern machines� There are more than �� types
of measurement available including the number of cycles� instruction and data cache
hits� the number of reads and writes� instruction counts� dependency stalls� and so on�
It is capable of generating a data �le that can be read by Ursa Minor for further
analysis�
As noted in the methodology� it is important to record the execution time of the
uninstrumented program� This serves as the basis for measuring the perturbation that
instrumentation introduces� A simple UNIX command such as �time� may provide
such a timing number�
Getting serial execution time
Running an instrumented serial version is done typically through the UNIX com�
mand line� It is usually a simple command line interface� Instrumentation generates
some form of records containing the timing information on the instrumented code
segments� For example� an executable instrumented by the Polaris instrumentation
utility generates a �le that looks like the following�
RESTAR�do�� � AVE� � ������ MIN� � ������ MAX� � ������ TOT� � ������
RESTAR�do�� � AVE� � ������ MIN� � ������ MAX� � ������ TOT� � ������
RESTAR�do��� � AVE� � ������ MIN� � ������ MAX� � ������ TOT� � ������
RESTAR�do��� � AVE� � ������ MIN� � ������ MAX� � ������ TOT� � ������
RESTAR�do��� � AVE� � ������ MIN� � ������ MAX� � ������ TOT� � ������
ACTFOR�do��� � AVE� � ������ MIN� � ������ MAX� � ������ TOT� � ������
ACTFOR�do��� � AVE� � ������ MIN� � ������ MAX� � ������ TOT� � ������
� �� �
OVERALL time � �� ������ � � � � � �
The tabular section shows the average�AVE�� minimum�MIN�� maximum�MAX� and
cumulative total�TOT� time spent on each instrumented segment� The last line shows
the overall execution time of the entire program� This �le can be directly read by the
Ursa Minor tool for analysis�
Running parallelizing compiler
This is the step in which users try parallelization by running automatic utilities�
Its main goals are �� to utilize an automatic parallelizer to optimize complex loops
and possibly gain the static analysis results from the compiler and ��� to save time by
automating the parallelization of small� inconsequential loops� Therefore� the target
in this case is usually the entire program� Furthermore� most parallelizers with inter�
procedural analysis capability work well when an entire program is given as inputs�
Polaris� as a batch�oriented program� performs well for this purpose� InterPol is
also capable of handling this task�
Manually optimizing programs
Any text editor can be used to manually modify programs� Several UNIX com�
mands are useful for manipulating programs� An example is �fsplit�� which splits
subroutines and functions into dierent �les� However� InterPol is speci�cally
designed for the process of manual tuning� InterPol allows programmers to apply
selected techniques on speci�c regions� change code manually� and generate a working
version of the entire program without exiting the tool� Some of the manual techniques
that users may consider are presented in Chapter ��
Getting optimized execution time
In the shared�memory model� programmers can invoke a parallel program as they
execute a serial program� Typically� there are certain environmental variables that
need to be set beforehand� For example� on Solaris machines� environmental variable
OMP NUM THREADS determines the number of processors to be used� If programmers
used the Polaris compiler for instrumentation� a summary �le is generated after each
� �� �
run�
InterAct allows an interactive instrumentation and tuning of OpenMP pro�
grams� Its ability to dynamically change runtime parameters �tile size� unrolling
number� provides a testbed for �nding the optimal set of techniques� Monitoring and
changing hardware counter instrumentation make the instrumentation process more
e�cient�
Finding and resolving performance problems
Programmers need utilities for putting together and sorting data� Identifying per�
formance problem requires a considerable amount of examination and hand analysis�
Finding solutions often requires experience in program optimization studies�
Ursa Minor provides tools that assist parallel programmers in eectively evalu�
ating performance� Its graphical interface provides selective views and combinations
of timing information in combination with a program structure and static analy�
sis data� Users can put together a table� open a Structure View� draw charts� do
spreadsheet�type operations� and examine source codes� Ursa Minor manages the
information within its own database� thus data management that might have required
signi�cant �le and version control becomes simpli�ed�
Identifying dominant loops are very simple with Ursa Minor� Users can load
timing pro�les and sort the entries through the column popup menu� If a user cre�
ates a pie chart� most time�consuming loops will be displayed with the entire circle
representing the total execution time� The bar graph on the right shows an instant
view of normalized numeric data�
An important task in tuning program performance is to evaluate if an applied
program modi�cation produces an acceptable result� This involves computing var�
ious metrics such as speedup and parallel e�ciency and examine program analysis
information� The built�in mathematical functions allow users to manipulate data�
The static analysis information generated by the Polaris compiler is also managed
within the Ursa Minor database� For code segments that require manual tuning�
this information provides vital clues� Static analysis information� as well as the source
� �� �
code viewer� can be pulled up at any time with simple menu clicks� so users can make
comprehensive diagnosis of the problems at hand�
Ursa Minor does more than just presenting data� It is capable of actively
analyzing the data and giving advice� When users runMerlin� it extracts necessary
information and apply diagnosis techniques to �nd right solutions� As mentioned
previously� the decisions that Ursa Minor makes rely on the Merlin map� which
is typically provided by advanced parallel programmers� In this way� the knowledge
of experienced programmers can be used by novice programmers easily� The fact that
a map can have a variety of functions that can apply to any types of data� widens
the usage of Merlin in many dierent �elds of study�
����� Other useful utilities
In addition� the toolset provides additional functionalities for the tasks that are
not speci�cally tied to the methodology steps�
When programmers are given an application to optimize� they usually start out by
examining the source code� The basic knowledge about the program structure such
as large subroutines or functions� their algorithms� callees� and callers� tremendously
help programmers later in the tuning stage� The algorithms employed by program
modules� although not necessary to follow the methodology� may be of importance
especially programmers need to attempt replacing algorithms�
Programs written by others are generally harder to understand� Dierent coding
styles make it di�cult to capture the underlying compositions of individual program
modules� The Structure View of Ursa Minor facilitates this problem by presenting
users with an intuitive� color�coded view of the program structure� A simple click
pulls up the source view� when a closer examination is desired� This can save a
signi�cant amount of users� time�
As the size and the complexity of applications grow at exponential rate these days�
the subject of performance steering is getting more attention� Performance steering
may come in handy in both development stage and production use� For instance�
�nding the right parameters for convergence criteria in the application development
� �� �
stage can be tricky� so the ability to set or reset relevant variables during the program
execution could prove to be advantageous in experimenting with dierent values�
Also� an application may be able to simulate many dierent aspects of a target object�
but users may be interested in only one aspect� In this case� performance steering
can save time and resource by restricting the simulation� The interest of InterAct
lies along this line� The primary use of InterAct in our study has been �nding
the optimal combination of optimization�related parameters �e�g� tile size� unrolling
number� for a given application� For a long running programs� InterAct allows a
�ne control over variables such as the simulation step size and the number of iterations�
When more than one person is involved in an optimization project� communication
between group members become problematic� The data that one person generates
may not be easily accessible or compatible with the tools used by others� Other
members in the group may want to focus on dierent perspectives� but the information
from one researcher may not be formatted or arranged in a compatible way� Sharing a
manipulatable database opens up the possibility of all the members having access to
a set of compatible databases relevant to individual tasks� At the same time� group
members can reason about the data gathered by other members� focusing on the
aspects that they are interested in� Ursa Minor enables an e�cient and meaningful
way of sharing research results�
Finally� the growing popularity of multiprocessor workstations and high perfor�
mance PCs is leading to a substantial increase in non�expert users and programmers
of this machine class� Such users need new programming paradigms � perhaps most
importantly� they need good examples to learn from� We have extended our eort
to support �parallel programming by examples� through Web�accessible tools and a
database repository� This is the topic of the next section�
��� The Parallel Programming Hub and Ursa Major
Although the importance of advanced tools for all software development is evi�
dent� many available tools remain unused� This is mainly due to the limited acces�
sibility of tools� We have developed a set of tools for parallel programmers� and the
� � �
Internet provided an opportunity to make our tools more accessible to world wide
parallel programmers� Here� we present two separate outcomes that resulted from
our eort to reach wider audience with our tools� The Parallel Programming Hub
is an on�going project to provide a globally accessible� integrated environment that
hosts parallelizing compilers� program analyzers� and interactive performance tuning
tools ����� Users can access and run these tools with common Web browsers� Ursa
Major is an Applet�based application that enables visualization and manipulation
of the performance and static analysis data of various parallel applications that have
been studied at Purdue University ���� Its goal is to make a repository of program
information available via the World�Wide Web�
����� Parallel Programming Hub globally accessible integrated tool en�
vironment
Programming tools are of paramount importance for e�cient software develop�
ment� However� despite several decades of tool research and development� there is a
drastic contrast between the large number of existing tools and those actually used
by ordinary programmers� We believe that there are two main reasons for this sit�
uation� The �rst reason is that a programmer� in order to bene�t from new tools�
will typically have to go through one or several tedious eorts of searching� down�
loading� installing� and resolving platform incompatibilities� before the tools can even
be learned and their use can be evaluated� The second reason is that� even if the
value of a number of tools has been established� they often use dierent terminology�
diverse user interfaces� and incompatible data exchange formats ! hence they are not
integrated�
Through the combined eorts from many researchers� we have created the Parallel
Programming Hub� a new parallel programming tool environment that is �� acces�
sible and executable �anytime� anywhere�� through standard Web browsers and ���
integrated in that it provides tools that adhere to a common methodology for paral�
lel programming and performance tuning� The Parallel Programming Hub addresses
these two issues� It contributes solutions in the following way� First� the Parallel
� �� �
Programming Hub makes available a growing number of tools �on the Web� where
they are accessible and executable through standard Web browsers� The Parallel
Programming Hub makes no restrictions on the type of tools that can be added� A
new tools can be installed without modi�cation� providing the original graphical user
interface and� if necessary� being served directly o of the home site of a proprietary
provider� Nevertheless� the authorized user can access the tool via standard Web
browsers�
Our methodology is supported by the Parallel Programming Hub that includes the
Polaris parallelizing compiler� the Max�P parallelism analysis tool� and the Ursa
Minor performance evaluation and visualization tool� which are described in previous
sections� In addition� an increasing number of tools are being made available through
the Parallel Programming Hub� Currently� the Trimaran environment �� ��
for instruction level parallelism �ILP� and the SUIF parallelizing compiler ���� are
accessible� Authorized users can access a number of common support tools such
as Matlab� Mentor Graphics� GNU Octave� and StarO�ce� Figure �� shows a
screenshot of Ursa Minor in use on the Parallel Programming Hub�
On the surface� the Parallel Programming Hub is a set of web pages� through
which users can run various parallel programming tools� Underneath this interface
is an elaborate network computing infrastructure� called the Purdue University Net�
work Computing Hub �PUNCH�� PUNCH is an infrastructure that supports network�
accessible� demand�based computing ���� It allows users to access and run unmodi�ed
tools via standard Web browsers� PUNCH allows tools to be written in any language
and does not require source codes or object codes of the applications it hosts� This
feature allows a wide variety of tools to be included�
When a user invokes a tool on PUNCH� the resource management unit determines
an appropriate platform out of a resource pool and executes the tool on it� The
smart resource management unit maintains resource usage to an optimal level� It
also enables the system to be highly scalable� making sure that PUNCH performs
well under widely�varying numbers of users� tools� and resource nodes�
� �� �
Fig� ��� Ursa Minor Usage on the Parallel Programming Hub�
PUNCH is logically divided into discipline�speci�c �Hubs�� Currently� PUNCH
consists of four Hubs that contain tools from semiconductor technology� VLSI design�
computer architecture� and parallel programming� These Hubs contain over thirty
tools from eight universities and four vendors� and serve more than �ve hundred
users from Purdue� the US� Europe� PUNCH has been accessed ��� million times
since it became operational in ����
Upon registering� a user is given an account and disk space� that is accessible as
long as the user is on PUNCH� The execution of tools via PUNCH takes place in
UNIX �shadow� accounts that are managed by the network computing infrastruc�
� �� �
ture� This shadow account structure allows addition of user accounts to the parallel
programming Hub without requiring the setup of individual accounts by a UNIX sys�
tem administrator� PUNCH keeps all user �les in a master account and maintains a
pool of shadow accounts that are allocated dynamically for users at runtime� Input
�les for interactive programs such as Ursa Minor are transferred on�demand from
master to shadow accounts via a system call tracing program �based on the UFO
prototype ���� that implements a user�level virtual �le system on top of the FTP
protocol� This system is transparent to users� thus all �le transactions appear to be
normal disk IO�
The immediate advantage of having an integrated network�based tool environment
is substantial savings in users� eorts and resources� The Parallel Programming Hub
eliminates times to search for� download� and install tools� and it greatly supports
users in learning a tool through uniform documentation� on�line tutorials� and tools
that speak a common terminology� A typical tool access time for �rst time users of
the ParHub is in the order of a minute� including authentication and navigating to
the right tool� This contrasts with download and installation times of at least an
order of magnitude larger� Even much larger eorts become necessary if tools need
to be adapted to local platforms�
A novel aspect of the ParHub�s underlying technology is that it represents not only
an actual �information grid�� but also includes the necessary portals for its end users�
One vision is that future users can access software tools via any local platform ! from a
palmtop to a powerful workstation� Compute power and �le space is provided �on the
Web�� Mobility is provided in that these resources are accessible transparently from
any access point� The described infrastructure represents a signi�cant step towards
this vision�
����� Ursa Major making a repository of knowledge available to the
world wide audience
A core need for advancing the state of the art of computer systems is performance
evaluation and the comparison of results with those obtained by others� To this end�
� �� �
many test applications have been made publicly available for study and benchmarking
by both researchers and industry� Although a large body of measurements obtained
from these programs can be found in the literature and on public data repositories� it is
usually extremely di�cult to combine them into a form meaningful for new purposes�
In part this is because data are not readily available �i�e�� they have to be extracted
from several papers� and they have to undergo substantial re�categorizations and
transformations� In addressing this issue� the Ursa Major project ��� is creating
a comprehensive database of information�
Many tools can gather raw program and performance information and present it
to users� which is a starting point for answering the questions above� However� in
addition to providing raw information� advanced tools must help �lter and abstract
a potentially very large amount of data�
Ursa Major addresses the described issues by providing an instrument with
which application� machine� and performance information can be obtained from vari�
ous sources and can be displayed in an interactive viewer attached to the World�Wide
Web� It provides a repository for this information and assists users in its abstrac�
tion and comprehension� Industrial benchmarkers may be interested in �one single
number� for machine comparisons� programmers may be interested in transforma�
tions that can improve the performance of an application� computer architects may
want to compare their cache measurements with those obtained by their peers� Ursa
Major provides hooks for their needs� and it includes instruments for the underlying
data mining task�
Ursa Major is an Applet�based application that enables visualization and ma�
nipulation of the performance and static analysis data of various parallel applications
that have been studied at Purdue University� The goal of Ursa Major is to make
a repository of program information available via the World�Wide Web� Ursa Ma�
jor has its origin in the Ursa Minor tool� providing almost identical functionality�
Because we chose Java as an implementation language it was natural to combine
these resources with the rapidly advancing Internet technology and� in this way� al�
� � �
low users at remote sites to access our experimental data� Typically� in response to
a user interaction� it fetches from the repository a program database that represents
a speci�c parallel programming case study� It then displays it using Ursa Minor�s
visualization utilities� Due to the Applet�s security constraints� local disk access is
not supported byUrsa Major� Figure ��� shows an overall view of the interactions
between Ursa Major� a user� and the Ursa Major repository �UMR��
URSA MAJORUMD
(Ursa Major Database)
Loop Table View Call Graph View
User
DataBase DownloadJava Program Download
interactioninteraction
presentation/edit databasepresentation/edit database
UMR(Ursa Major Repository)
Ursa MajorApplet
Remote Server
Fig� ���� Interaction provided by the Ursa Major tool�
The data repository is being constructed from the results gathered in various
research projects� Currently it consists of characteristics of a number of programs� the
results of compiler analyses of these programs� their performance numbers on diverse
architectures� and the data generated in several simulator runs� Individual databases
in the repository are in the Generic Data Format described in Section ������ One
issue in designing the repository was to de�ne storage schemes that makes it easy for
� �� �
users to �nd information entered by other users� To this end� the repository structure
consists of extensions on �le and directory names indicating data such as program
names� platforms� compilers� optimization� and parallel languages� To be �exible�
these extensions are not hard�coded� Instead� they are described in a con�guration
�le that is read by Ursa Major at the start of a session�
Ursa Major supports a user model of �parallel programming by examples� and
it serves as a program and benchmark database for high performance computing� It
integrates information available from performance analysis tools� compilers� simula�
tors� and source programs to a degree not provided by previous tools� Ursa Major
can be executed on the World�Wide Web� from which a growing repository of infor�
mation can be viewed� Through continuous updates to the repository� we envision
Ursa Major to be the �rst place to look for performance data�
The emergence of the Parallel Programming Hub presents an interesting oppor�
tunity to compare these two network�based tools� Although their goals are distinct�
Ursa Minor on the Parallel Programming Hub and Ursa Major provide users
with the same visualization utilities for viewing performance and static analysis data�
The Parallel Programming Hub enables Ursa Minor to load and manipulate user
inputs from remote sites� On the other hand� it lacks the support for access to a
centralized repository� The detailed performance comparison in terms of the response
time are given in the next chapter�
��� Conclusions
Our eort to create a parallel programming environment has resulted in a parallel
program development and tuning methodology and a set of tools� We have developed
the tools with the design goals in mind to provide an integrated� �exible� accessible�
portable and con�gurable tool environment that conforms to the underlying method�
ology� Our toolset integrates static program analysis with performance evaluation�
while supporting data visualization and interactive compilation� Data management
is also simpli�ed with our tools�
To give access to these tools to as many users as possible and to disseminate our
� �� �
performance databases of various applications as widely as possible� we have used a
network computing infrastructure� In addition� we are currently building a database
repository that enables the visualization and manipulation of performance results
through a Java Applet application�
Here� we conclude the presentation of our methodology and tool eorts� The intro�
duced methodology addresses �what� in parallel programming� The toolset described
in this chapter has been designed and implemented based on our experience and de�
sign goals� and aims to answer �how�� Finally� with the extra eort to promote the
tools and to reach wider audience� we have attempted to solve the question �where��
The methodology and the tools are useless if they are not eective in actual parallel
programming and performance tuning processes� The obvious next step is to evaluate
the bene�ts of these tools as well as the methodology� hence answering �how well�
they work� This is the topic of the next chapter�
� �� �
�� EVALUATION
Evaluating a methodology and tools is di�cult� It is largely due to two problems
associated with the topic� First� The desirable characteristics of a methodology and
supporting tools� such as e�ciency and eectiveness� cannot be measured easily� espe�
cially in quantitative ways� It is very challenging to establish a set of metrics for such
measures� Secondly� the goal of developing a methodology and supporting tools is to
assist users� thus in determining the e�ciency of a methodology and supporting tools�
the users� willingness and knowledge towards them become critical factors� Having
a large user community would help judge their value� Even then� however� creating
controlled experiments to obtain quantitative feedback is very di�cult�
These are the main reasons that many tool eorts in parallel programming have
ignored the evaluation aspect� The majority of publications related to parallel pro�
gramming tools do not include quantitative evaluations� Even general descriptions
of user feedback� such as �response to the Sigma editor has been good� ����� are
seldom found� Some of them demonstrate the usage of tools via descriptive case
studies � �� � ��� �� ���� Publications focusing on programming methodology have
taken the same approach ��� �� �� �� and give several examples of how their proposed
scheme can be applied to actual programming practices� One notable evaluation eort
is found in the SUIF Explorer publication ����� in which performance improvement
attempted by a user is summarized in detail� Whether it accurately re�ects the ef�
�ciency of the tool is arguable� but as the only quantitative measurement for tool
evaluation� their eort is noteworthy�
In this chapter� we attempt to achieve fair and accurate evaluation as follows� In
Section ��� we give a series of case studies to demonstrate the usage of our method�
ology and tool support� A detailed description of each parallelization and tuning
� � �
process is given in the section� These case studies serve to show the applicability
of the methodology and the functionality of tools� In Section ���� we evaluate the
tool functionality by analyzing and comparing the tasks accomplished with and with�
out the tools� Also� we summarize the comments from users in this section� The
comparison of our tools with other parallel programming environments are given in
Section ���� Finally� we discuss the tool accessibility as the result of adopting the
network computing facilities in Section ���� Conclusions are given at last�
��� Methodology Evaluation Case Studies
����� Manual tuning of ARC�D
In this section� we present a case study illustrating a manual tuning process of
program ARC�D from the Perfect benchmark suite ����� This case study was pre�
sented in ����� In this study� a programmer has tried to improve the performance of
the program beyond that achieved by the Polaris parallelizing compiler� The target
machine is a HyperSPARC workstation with � processors�
Polaris was able to parallelize almost all loops in ARC�D� However� the speedup
of the resulting executable was only ��� on � processors� Using Ursa Minor�s
Structure View and sorting utility� the programmer was able to �nd three loops to
which loop interchange can be applied� FILERX do��� XPENTA do�� and XPENT� do��
After loop nests were interchanged in these loops� the total program execution time
decreased by �� seconds� increasing the speedup from ��� to � ��
As the result of this modi�cation� dominant program sections have changed� The
programmer re�evaluated the most time�consuming loops using the Expression Eval�
uator to compute new speedups and the percentage of loop execution time over the
total time� The most time consuming loop was now the STEPFY do��� nest� which
consumed �� of the new parallel execution time� The programmer examined the
nest with the source viewer and noticed two things� �� there were many adjacent
parallel regions and ��� the parallel loops were not always distributing the same di�
mension of the work array� The programmer merged all of the adjacent parallel
regions in the nest into a single parallel region� The new parallel region consisted of
� �
four consecutive parallel loops� The �rst two nests were single loops that distributed
the work array across its innermost dimension� The second two nests were doubly
nested and distributed the work array across its second innermost dimension� The
eect of these changes were two�fold� First� the merging of regions should eliminate
parallel loop fork�join overhead� Second� the normalization of the distributions within
the subroutine should improve locality� After this change� the speedup of the loop
improved from �� to ����
The programmer was able to apply the same techniques �fusion and normaliza�
tion� to the next � most time�consuming loops �STEPFX do���� FILERX do��� and
YPENTA do��� These modi�cations result in a speedup gain from ��� to ����� Finally�
the programmer attempted the same techniques to the next most time�consuming
sections XPENTA� YPENT�� and XPENT� according to the newly computed pro�les and
speedups� The speedup improved to ���� The programmer felt that the point of
diminishing returns had been reached and halted the optimization�
�a� �b�
Fig� ��� The �a� execution time and �b� speedup of the various version of ARC�D�Mod� loop interchange� Mod�� STEPFY do��� modi�cation� Mod�� STEPFX do���
modi�cation� Mod�� FILERX do�� modi�cation� Mod�� YPENTA do� modi�cation�Mod � modi�cation on XPENTA� YPENT�� and XPENT���
In summary� applying loop interchange� parallel region merging and distribu�
tion normalization� yielded an increase from the out�of�the�box speedup of ��� to
� � �
a speedup of ���� This corresponds to a �� decrease in execution time� Figure ��
shows the improvements in the total program performance as each optimization was
applied� Ursa Minor allowed the user to quickly identify the loop structure of the
program and sort the loops to identify the most time consuming code sections� After
each modi�cation� the user was able to add the new timing data from the modi�ed
program runs� re�calculate the speedup and see if an improvement was worthwhile�
����� Evaluating a parallelizing compiler on a large application
In one research project a users is enabling the Polaris compiler to work eectively
with large codes �on the order of at least ����� lines� ����� These codes have many
levels of abstractions and are very modular� making it di�cult to link performance
and parallelization bottlenecks to their causes� Ursa Minor was used with the
SPECseis application suite ���� a set of codes that perform seismic processing� as
a basic GUI to help manage the thousands of lines of code and hundreds of loop
timings� as well as to direct the compiler developer into enabling Polaris to recognize
more parallelism�
Ursa Minor allows the user to easily pick out the signi�cant portions of the code
�in terms of execution time� and to �nd their callers and callees� We found that the
implementation of the �nite�dierencing scheme� which was a landmark in the history
of seismic processing� takes only � of the total time� The accompanying correction
routine� which compensates for the errors that accrue with the �nite�dierence ap�
proximation� takes �� of the total execution time� The correction routine performs
a FFT� applies the error equations� and transforms the data back from the frequency
domain�
Besides the ability to quickly and easily locate the major components of the execu�
tion time� the user found Ursa Minor helpful to the compiler developer in analyzing
the eectiveness of compilation techniques� One key bene�t of using Ursa Minor
for performance evaluation is the ability to apply the Expression Evaluator to both
the run�time performance and the compile�time analysis� Polaris was able to paral�
lelize loops which contributed only � of the execution time� The user used Ursa
� � �
Minor to determine why certain key loops were not parallelized �a feature requiring
one mouse click� in order to add techniques that address these issues� The SEICFT
routine performs a �D FFT on a frequency slice� The routine contains while loops
which are not parallelized by Polaris�
With Ursa Minor� the user was also able to work with the application as a
whole to determine what factors in�uence automatic parallelization across the entire
code� We can do so using the commands provided in the Ursa Minor tool� In
particular� Ursa Minor revealed that in�lining or inter�procedural analysis is a cru�
cial parallelism enabler for parallelizing compilers when dealing with large� modular
codes� Eight out of the top ten loops �for the �rst seismic phase� have subroutine
calls within them�
����� Interactive compilation
The use of a parallelizing compiler as an interactive tool can bene�t users in many
ways� Users can incorporate the feedback from the compiler during compilation and
add appropriate modi�cations to the source� An incremental use of such a tool
simpli�es code management and debugging as well because the code changes made
by users are localized� In addition� the ability to �build� a parallelizing compiler �as
described in the previous chapter� allows users to experiment with dierent compiler
techniques� so that they can learn more about the techniques and their eects�
We present a case study in ���� to demonstrate the functionality of InterPol�
A user parallelized the small example program shown in Figure ����a� Figure ����b
shows the code after being simply run through the default Polaris con�guration with
the inlining switch set to inline subroutines of � statements or less� Two important
results can be seen� �� subroutine one is not inlined due to the inlining pass executing
prior to deadcode elimination� and ��� the loops in subroutine two are not found to
be parallel because of subscripted array subscripts� which the Polaris compiler cannot
analyze� Figure ����c shows the resulting program after adding a deadcode pass prior
to the inlining pass in the Compiler Builder� and running the main program and sub�
routine one from Figure ����a through this �new� compiler� Finally� in Figure ����d�
� � �
PROGRAM EXAMPLE
REAL A����������B���������
REAL C�����
INTEGER I
DO I � �� ���
CALL ONE�A�B�I�
C�I� � I
ENDDO
CALL TWO�A�B�C�
WRITE ����� A
WRITE ����� B
END
SUBROUTINE ONE�A�B�I�
REAL A����������B���������
INTEGER DEADCODE
DEADCODE � �
DEADCODE � �
DEADCODE �
DEADCODE �
DEADCODE � �
DO J � �����
A�J�I� � �
B�J�I� � �
ENDDO
END
SUBROUTINE TWO�A�B�C�
REAL A���������� B���������
REAL C�����
DO I � �� ���
DO J � �� ���
A�C�J��C�I�� � I�J
B�C�J��C�I�� � I�J
ENDDO
ENDDO
END
PROGRAM EXAMPLE
REAL A����������B���������
REAL C�����
INTEGER I
DO I � �� ���
CALL ONE�A�B�I�
C�I� � I
ENDDO
CALL TWO�A�B�C�
WRITE ����� A
WRITE ����� B
END
SUBROUTINE ONE�A�B�I�
REAL A����������B���������
�OMP PARALLEL DO
DO J � �����
A�J�I� � �
B�J�I� � �
ENDDO
�OMP END PARALLEL DO
END
SUBROUTINE TWO�A�B�C�
REAL A���������� B���������
REAL C�����
DO I � �� ���
DO J � �� ���
A�C�J��C�I�� � I�J
B�C�J��C�I�� � I�J
ENDDO
ENDDO
END
�a� �b�
Fig� ���� Contents of the Program Builder during an example usage of theInterPol tool� �a� the input program and �b� the output from the default Polaris
compiler con�guration�
� � �
PROGRAM EXAMPLE
REAL A����������B���������
REAL C�����
INTEGER I
�OMP PARALLEL DO
DO I � �� ���
DO J � �����
A�J�I� � �
B�J�I� � �
ENDDO
C�I� � I
ENDDO
�OMP END PARALLEL DO
CALL TWO�A�B�C�
WRITE ����� A
WRITE ����� B
END
SUBROUTINE TWO�A�B�C�
REAL A���������� B���������
REAL C�����
DO I � �� ���
DO J � �� ���
A�C�J��C�I�� � I�J
B�C�J��C�I�� � I�J
ENDDO
ENDDO
END
PROGRAM EXAMPLE
REAL A����������B���������
REAL C�����
INTEGER I
�OMP PARALLEL DO
DO I � �� ���
DO J � �����
A�J�I� � �
B�J�I� � �
ENDDO
C�I� � I
ENDDO
�OMP END PARALLEL DO
CALL TWO�A�B�C�
WRITE ����� A
WRITE ����� B
END
SUBROUTINE TWO�A�B�C�
REAL A���������� B���������
REAL C�����
�OMP PARALLEL DO
DO I � �� ���
DO J � �� ���
A�C�J��C�I�� � I�J
B�C�J��C�I�� � I�J
ENDDO
ENDDO
�OMP END PARALLEL DO
END�c� �d�
Fig� ���� Contents of the Program Builder during an example usage of theInterPol tool� �c� the output after placing an additional deadcode eliminationpass prior to inlining and �d� the program after manually parallelizing subroutine
two�
� �
the user has selected only subroutine two� parallelized it by hand� and included this
modi�ed version into the Program Builder� Through simple interactions with Inter�
Pol� the user was able to take a code for which Polaris was only able to parallelize
a single innermost loop� and parallelize both of its outermost loops�
����� Performance advisor hardware counter data analysis
In this case study given in �� �� we discuss a performance map that uses the
speedup component model introduced in ����� The model fully accounts for the gap
between the measured speedup and the ideal speedup in each parallel program section�
This model assumes execution on a shared�memory multiprocessor and requires that
each parallel section be fully characterized using hardware performance monitors to
gather detailed processor statistics� Hardware monitors are now available on most
commodity processors�
With hardware counter and timer data loaded intoUrsa Minor� users can simply
click on a loop from the Ursa Minor table view and activate Merlin� Merlin
then lists the numbers corresponding to the various overhead components responsible
for the speedup loss in each code section� The displayed values for the components
show overhead categories in a form that allows users to easily see why a parallel region
does not exhibit the ideal speedup of p on p processors� Merlin then identi�es the
dominant components in the loops under inspection and suggests techniques that
may reduce these overheads� An overview of the speedup component model and its
implementation as a Merlin map are given below�
Performance map description
The objective of our performance map is to be able to fully account for the perfor�
mance losses incurred by each parallel program section on a shared�memory multipro�
cessor system� We categorize overhead factors into four main components� Table ��
shows the categories and their contributing factors�
Memory stalls re�ect latencies incurred due to cache misses� memory access times
and network congestion� Merlin will calculate the cycles lost due to these overheads�
If the percentage of time lost is large� locality�enhancing software techniques will be
� � �
Table ��Overhead categories of the speedup component model�
Overhead Contributing Description Measured
Category Factors with
Memory stalls IC miss Stall due to I�Cache miss� HW Cntr
Write stall The store bu�er cannot hold additional stores� HW Cntr
Read stall An instruction in the execute stage depends on an earlier
load that is not yet completed�
HW Cntr
RAW load stall A read needs to wait for a previously issued write to the
same address�
HW Cntr
Processor stalls Mispred� Stall Stall caused by branch misprediction and recovery� HW Cntr
Float Dep� stall An instruction needs to wait for the result of a �oating
point operation�
HW Cntr
Code overhead Parallelization Added code necessary for generating parallel code� computed
Code generation More conservative compiler optimizations for parallel code� computed
Thread
management
Fork�join Latencies due to creating and terminating parallel sections� timers
Load imbalance Wait time at join points due to uneven workload
distribution�
suggested� These techniques include optimizations such as loop interchange� loop
tiling� and loop unrolling� We found� in ����� that loop interchange and loop unrolling
are among the most important techniques�
Processor stalls account for delays incurred processor�internally� These include
branch mispredictions and �oating point dependence stalls� Although it is di�cult
to address these stalls directly at the source level� loop unrolling and loop fusion� if
properly applied� can remove branches and give more freedom to the backend compiler
to schedule instructions� Therefore� if processor stalls are a dominant factor in a loop�s
performance� Merlin will suggest that these two techniques be considered�
Code overhead corresponds to the time taken by instructions not found in the
original serial code� A positive code overhead means that the total number of cycles�
excluding stalls� that are consumed across all processors executing the parallel code
is larger than the number used by a single processor executing the equivalent serial
� � �
section� These added instructions may have been introduced when parallelizing the
program �e�g�� by substituting an induction variable� or through a more conservative
parallel code generating compiler� If code overhead causes performance to degrade
below the performance of the original code� Merlin will suggest serializing the code
section�
Thread management accounts for latencies incurred at the fork and join points of
each parallel section� It includes the times for creating or notifying waiting threads� for
passing parameters to them� and for executing barrier operations� It also includes the
idle times spent waiting at barriers� which are due to unbalanced thread workloads�
We measure these latencies directly through timers before and after each fork and each
join point� Thread management latencies can be reduced through highly�optimized
runtime libraries and through improved balancing schemes of threads with uneven
workloads� Merlin will suggest improved load balancing if this component is large�
Ursa Minor combined with this Merlin map displays �� the measured perfor�
mance of the parallel code relative to the serial version� ��� the execution overheads
of the serial code in terms of stall cycles reported by the hardware monitor� and
��� the speedup component model for the parallel code� We will discuss details of
the analysis where necessary to explain eects� However� for the full analysis with
detailed overhead factors and a larger set of programs we refer the reader to �����
Experiment
For our experiment we translated the original source into OpenMP parallel form
using the Polaris parallelizing compiler ����� The source program is the Perfect Bench�
mark ARC�D� which is parallelized to a high degree by Polaris�
We performed our measurements on a Sun Enterprise ���� with six ��� MHz
UltraSPARC V� processors� each with a KB L data cache and MB uni�ed L�
cache� Each code variant was compiled by the Sun v��� Fortran �� compiler with
the �ags �xtarget�ultra� �xcache������������������ �O�� For hardware per�
formance measurements� we used the available hardware counter �TICK register� ����
ARC�D consists of many small loops� each of which has a few milli�seconds average
� � �
Fig� ���� Performance analysis of the loop STEPFX DO��� in program ARC�D� Thegraph on the left shows the overhead components in the original� serial code� Thegraphs on the right show the speedup component model for the parallel codevariants on � processors before and after loop interchanging is applied� Each
component of this model represents the change in the respective overhead categoryrelative to the serial program� Merlin is able to generate the information shown in
these graphs�
execution time� Figure ��� shows the overheads in the loop STEPFX DO��� of the
original code� and the speedup component graphs generated before and after applying
a loop interchange transformation�
Merlin calculates the speedup component model using the data collected by a
hardware counter� and displays the speedup component graph� Merlin applies the
following map using the speedup component model� If the memory stall appears in
performance graphs of both the serial code and the Polaris�parallelized code� then apply
loop interchange� From this suggested recipes the user tries loop interchanging� which
results in signi�cant� now superlinear speedup� Figure �loop�interchange� ��� on the
right shows that the memory stall component has become negative� which means that
there are fewer stalls than in the original� serial program� The negative component
explains why there is a superlinear speedup�
The speedup component model further shows that the code overhead component
has drastically decreased from the original parallelized program� The code is even
� �� �
more e�cient than in the serial program� further contributing to the superlinear
speedup�
In this example� the use of the performance map for the speedup component model
has signi�cantly reduced the time spent by a user analyzing the performance of the
parallel program� It has helped explain both the sources of overheads and the sources
of superlinear speedup behavior�
����� Performance advisor simple techniques to improve performance
In this section� we present a performance map based solely on execution timings
and static compiler information� Such a map requires program characterization data
that a novice user can easily obtain� In the study that we did in �� �� a map is
designed to advise novice programmers in improving the performance of programs
achieved by a parallelizing compiler such as Polaris ����� In this case study� we as�
sume that novice programmers have used a parallelizing compiler as the �rst step to
optimize the performance of the target program and that its static analysis informa�
tion is available� The performance map presented in this section aims at improving
this initial performance�
Our goal in this study is to provide users with a set of simple techniques that
may help enhance the performance of a parallel program based on data that can be
easily generated� This includes timing and static program analysis data� Based on
our experiences with parallel programs� we have chosen four techniques that are ��
easy to apply and ��� may yield considerable performance gain� These techniques
are serialization� loop interchange� and loop fusion� They are applicable to loops�
which are often the focus of the shared memory programming model� All of these
techniques are present in modern compilers� However� compilersmay not have enough
knowledge to apply themmost pro�tably ����� and some code sections may need small
modi�cations before the techniques become applicable automatically�
Performance map description
We have devised criteria for the application of these techniques� which are shown
in Table ���� If the speedup of a parallel loop is less than � we assume that the loop
� � �
Table ���Optimization technique application criteria�
Techniques Criteria
Serialization speedup � �
Loop Interchange � of stride�� accesses � � of non stride�� accesses
Loop Fusion speedup � ���
is too small for parallelization or that it requires extensive modi�cation� Serializing it
prevents performance degradation� Loop interchange may be used to improve locality
by increasing the number of stride� accesses in a loop nest� Loop interchange is
commonly applied by optimizers� however� our case study shows many examples of
opportunities missed by the backend compiler� Loop fusion can likewise be used to
increase both granularity and locality� The criteria shown in Table ��� represent
simple heuristics and do not attempt to be an exact analysis of the bene�ts of each
technique� We simply assumed the threshold of the speedup as ��� to apply the loop
fusion�
Experiment
We have applied these techniques based on the criteria presented above� We have
used a Sun Enterprise ���� with six ���MHz UltraSPARC processors� The OpenMP
code is generated by the Polaris OpenMP backend� The results on �ve programs
are shown� They are SWIM and HYDRO�D from SPEC��� SWIM from SPEC����� and
ARC�D and MDG from the Perfect Benchmarks� We have incrementally applied these
techniques starting from serialization� Figure ��� shows the speedup achieved by the
techniques� The improvement in execution time ranges from ��� for fusion in ARC�D
to ���� for loop interchange in SWIM������ For HYDRO�D� application of theMerlin
suggestions did not noticeably improve performance�
Among the codes with large improvement� SWIM from SPEC���� bene�ts most
� �� �
Fig� ���� Speedup achieved by applying the performance map� The speedup is withrespect to one�processor run with serial code on a Sun Enterprise ���� system� Each
graph shows the cumulative speedup when applying each technique�
from loop interchange� It was applied under the suggestion of Merlin to the most
time�consuming loop� SHALOW DO����� Likewise� the main technique that improved
the performance in ARC�D was loop interchange� MDG consists of two large loops
and numerous small loops� Serializing these small loops was the sole reason for the
performance gain� Table ��� shows a detailed breakdown of how often techniques
were applied and their corresponding bene�t�
Using this map� considerable speedups are achieved with relatively small eort�
Novice programmers can simply run Merlin to see the suggestions made by the
map� The map can be updated �exibly without modifying Merlin� Thus if new
techniques show potential or the criteria needs revision� expert programmers can
easily incorporate changes�
��� E ciency of the Tool Support
In order to quantitatively evaluate the e�ciency of the tool support� we have
performed an experiment with the help of actual tool users� We prepared a set of
small tasks that are commonly done by parallel programmers� and asked users to
� �� �
Table ���A detailed breakdown of the performance improvement due to each technique�
Benchmark Technique Number of Modi�cations Improvement
ARC�D Serialization �����
Interchange �� ��
Fusion �� ��� �
HYDRO�D Serialization �� �����
Interchange � ����
Fusion � ���
MDG Serialization �� ����
Interchange � ����
Fusion � ����
SWIM�� Serialization � ����
Interchange � ����
Fusion ���
SWIM��� Serialization � ����
Interchange � ����
Fusion � ���
accomplish these tasks with and without our tools� In addition� we have asked users
of the tools a series of questions to gather users� opinions on tools and their usage�
The questions targeted the functionality of the tools as well as general comments on
the methodology� We present the results in the following sections�
����� Facilitating the tasks in parallel programming
Common tasks in parallel programming
The main objectives of the experiment is to produce quantitative measures for the
e�ciency of the tools� functionality� To this end� we have selected � tasks that are
commonly performed by parallel programmers using parallel directives� These tasks
� �� �
are listed in Table ����
Table ���Common tasks in parallel programming
task� compute the speedup of the given program on processors in terms of the serial execution time�
task� �nd the most time�consuming loop based on the serial execution time�
task �nd the inner and outer loops of that loop�
task �nd the caller�s� of the subroutine containing the most time�consuming loop�
task� compute the parallelization and spreading overhead of that loop on processors�
task� compute the parallel e�ciency of the second most time�consuming loop on processors�
task� export pro�les to a spreadsheet to create total execution time chart
�on varying number of processors� containing � of the most time�consuming loops�
task� count the loops the speedups of which are below ��
task� count the loops that are parallel and whose speedups are below ��
task� compute the parallel coverage and the expected speedup based on Amdahl�s Law�
Task � compute the speedup of the target program� The speedup of the
entire program is perhaps the most frequently used metrics in computational engi�
neering� The changes made �parallelization or any other types of optimization� are
evaluated by the speedup gain in program execution time� The instrumentation to
measure program execution time is simple� and any calculator can be used to compute
this number�
Task � �nd the most time�consuming code sections� Finding the dominant
code sections using pro�les is the most important task in performance tuning� Most
users would look into the summary �les generated from program execution with a text
editor� In this case� users would have to run a text editor �menu clicking or typing the
command on a shell�� and �nd the most time�consuming loop in the �le� Looking for
the largest quantity among many numbers would take a signi�cant amount of time�
which is at best in the order of minutes� Some users suggested using �sort� command
available from UNIX as follows�
� cat name sum � sort �r �k �����
� �� �
This produces a sorted list of summary �le entries quickly� but users have to remember
the column number to sort by� and the amount of text to type is not trivial� Moreover�
if multiple �les need to be presented for comparison� the sorting command cannot be
used� By contrast� using the Ursa Minor tool� the task can be accomplished by ��
activating the tool �typing �UM��� ��� loading the pro�le �menu clicking�� and ���
sorting based on the column the user chooses �popup menu clicking��
Task � �nd inner and outer loops of a speci�c loop� Increasing the granulity
of parallel execution is an important technique in improving parallel performance�
This involves looking into inner or outer loops of the loop under consideration� There
are no other tools that explicitly support this task� Programmers would have to use
a text editor to �nd the loop and examine the source to �gure out the loop nest� The
Structure View of Ursa Minor signi�cantly simpli�es this task� Users only need to
load the compiler listing �le �menu clicking� scrolling� and mouse clicking�� �nd the
section �scrolling or using �Find� feature�� and looking at the display�
Task � �nd the caller�s� of a speci�c subroutine� The presence of function
or subroutine calls may cause the parallelizing compiler to abandon optimizing loops�
Users� knowledge on the target program can be of great use in such cases� Finding
the callers and callees of a subroutine or a function is an essential task in optimizing
nested subroutines and loops with subroutine calls� Normally� programmers would
have to examine the program source to accomplish this task� UNIX utilities such
as �grep� can be useful� The Structure View from Ursa Minor provides one click
support for �nding �parents� and �children� of selected code sections�
Task � compute overheads� Identifying performance problems requires de�n�
ing �rst what the problems are� The metrics such as parallelization and spreading
overheads are frequently used variables in the problem de�nitions� Consequently�
computing these metrics is critical step to locate performance problems� One of the
conventional methods of computing the overheads includes a calculator� When users
� � �
need to compute overheads for multiple code sections� a commercial spreadsheet or
special�purpose scripts can provide an easier way� The mathematical functions pro�
vided by Ursa Minor also support the derivation of new metrics from the existing
data� This set of functions speci�cally targets parallel programming� so many of the
metrics commonly used in parallel programming are included in the set� In the cur�
rent version� however� the parallelization and spreading overheads are not directly
supported�
Task � compute parallel e ciencies� Parallel e�ciency is another widely used
measure for evaluating parallel performance� Parallel e�ciency� E�P� on P processors
is de�ned as
E�P � �
P
Tserial
Tparallel�P �����
Users can compute this number using a calculator or a spreadsheet� Ursa Minor
provides a function that computes parallel e�ciency�
Task � export pro�les to spreadsheet to create charts� An integrated toolset
oers an advantage in that exchanging �les is easier� Data �les speci�cally take one
form or another� and converting them into the form that other tools understand may
not be trivial� Commercial spreadsheets do a good job of importing text�based tabular
data �les such as timing pro�les and create a variety of graphs� Combing multiple
summary �les becomes di�cult� however� Without Ursa Minor� users would have
to create a comma separated �le using Awk or Sed scripts� Adding pro�les and
arranging data for exporting are frequently used features of Ursa Minor� Often� it
can be done within a minute this way� In addition� Ursa Minor can create charts
on any columns or rows that a user selects�
Task � count loops that have problems� This is another example that em�
phasizes the perspective on the overall performance� Users should be able to view
the resulting performance in terms of large blocks of code sections and that means
� �� �
dealing with multiple loops that dominate the overall performance� There is no di�
rect support for this task in both Ursa Minor and commercial spreadsheets� but a
sequence of operations can accomplish the task�
Task � count parallel loops that have problems� The combined analysis
of performance and static program data such as compiler listings is more e�cient
in locating performance problems� This question is one of simple examples of such
cases� Depending on the focus of the optimization �parallel optimization or general
locality optimization�� combining the information on the parallel nature of code blocks
and their performance �gures is much more e�cient than dealing with each aspect
separately� Conventional tools do not support this approach� The query functions
available Ursa Minor are designed speci�cally to help users comprehend the two
dierent data in the same context�
Task �� compute the expected speedup based on Amdahl�s law� This
task represents multi�step process of performance evaluation� The Amdahl�s law
provides a simple performance model that can be used to evaluate actual performance�
Computing the expected speedup based on Amdahl�s law requires computing the
parallel coverage of the target program and several steps of computation� This task
was selected to test how users use tools to accomplish rather a complex goal� Users
are expected to use a combination of tools for this task�
Task is a simple calculation� so users are expected to use either a calculator
or the Expression Evaluator from Ursa Minor with comparable e�ciency� Task �
evaluates the table manipulation utilities �sorting and rearranging� for performance
data� Tasks � and � target the e�ciency of the Structure View and the utilities that
it provides� The Expression Evaluator is the main target for evaluation in tasks �
and � Task � tests the ability to rearrange tabular data and export them to other
spreadsheet applications� The rest of the tasks ��� �� and �� attempt to evaluate
the combined usage of multiple utilities �sorting� the Expression Evaluator� query
functions� the static information viewer� and the display option control� provided by
� �� �
Ursa Minor�
Experiment
We have asked four users to participate in this experiment� They were asked
to perform these tasks one by one� Two dierent datasets were prepared for the
experiment� These datasets contain timing pro�les of FLO��Q from the Perfect
benchmarks ���� under two dierent environments� Thus� the number of data items
are the same in both datasets� but the pro�le numbers are dierent� First� these users
were asked to perform the tasks without our tools� Users were allowed to use any
scripts that they have written previously� Then� they performed the tasks using our
tools with the other dataset�
The time to activate tools �spreadsheet� Ursa Minor� and so on� and load input
�les was counted separately as �loading time�� The reason for this is that when users
perform these individual tasks separately under dierent environments� the loading
time needs to be added to the time taken to �nish each task� Since the users performed
the tasks in one session� users needed to activate tools only once� Time to convert
data �les for dierent tools are also included in the loading time� Hence� the loading
time also re�ects the level of integration of tools�
Four users who participated represent dierent classes of users� User is an expert
performance analyst who has written many special�purpose scripts to perform various
jobs� These scripts do tabularizing� sorting� etc� User does use our tools but relies
more on these scripts� User� has also been working on performance evaluation for
a while and is considered an expert as well� He uses only basic UNIX commands�
rather than scripts� However� his skills with the basic UNIX commands are very good�
so he can perform a complex task without taking much time� User� started using
our tools only recently� User� is also an expert performance analyst� but his main
target programs are not shared memory programs� He has been using our tools for a
long time� but with distributed memory programs� Finally� user� is a novice parallel
programmer� His experience with parallel programs are limited compared to the
others� He had read our methodology and tries to use our tools in his benchmarking
� �� �
research�
Table ���Time �in seconds� taken to perform the tasks without our tools�
user user� user� user� average
task� � � �� ��
task�� � � � �� ����
task�� �� �� �� �� � ���
task�� �� � �� ��
task�� �� �� �� �� �
task� �� �� �� �� �����
task�� �� ��� �� ��� ����
task�� � ��� � ��� ���
task�� ��� ��� ��� �� �����
task� ��� ��� ��� �� ���
loading � �� �� � � ������
total ��� ���� ���� ����� ����
Table ��� shows the time for these users to perform the assigned tasks� User�� ��
and � decided that tasks � and � cannot be performed within a reasonable time� so
they gave estimated times instead� All of the users used a commercial spreadsheet
later in the session� but user�� the novice programmer started doing the tasks after
he set up the spreadsheet and imported the input �les� User used his scripts for
many of the tasks�
As the second part the of experiment� users were allowed to use our tools to
perform the tasks� The results are shown in Table �� � User used a combination of
a spreadsheet and Ursa Minor to perform tasks �� �� and �� The others used a
spreadsheet for task � only� User� was not sure that he can �nish task � even with
our tool support� so he gave an estimated time�
� �� �
Table �� Time �in seconds� taken to perform the tasks with our tools�
user user� user� user� average
task� � � �� � ���
task�� � � � ����
task�� � � � ���
task�� � � � ����
task�� �� �� �� ��
task� � � � � �
task�� �� �� ��� �� ����
task�� � �� �� �� ��
task�� �� �� � �� ��
task� � ��� �� �� ������
loading �� �� �� �� �����
total �� ��� ��� ���� �����
As can be seen from these tables� our tool support improves the time to perform
common parallel programming tasks considerably� Figure �� shows the overall times
to �nish all the tasks� As can be seen in the �gure� our tool support not only
saves time� but also makes the process easier for novice programmers� resulting in
comparable times to perform the tasks when using our tools� The work speedups for
the users are ���� ����� ����� and ��� respectively�
The strength of our approach lies not only on the fact that the tools oer e�cient
ways of performing these individual tasks� but also that these features are provided
in an integrated toolset� This is demonstrated by the savings in the loding time
in our experiment� Users do no have to deal with several tools and commands�
There is no need to open the same �le into many dierent tools� For instance�
users can open the Structure View to inspect the program layout and examine and
� � �
Fig� �� � Overall times to �nish all � tasks�
restructure the performance data from the same database� Adding this advantage
into the consideration� our tool support becomes even more appealing�
����� General comments from users
We summarize users� comments on various tool features in this section� Users have
responded very positively to the Structure View of Ursa Minor� We have received
comments such as �There is no alternative that I know of that gives as good of an
overview of the program structure quickly�� or �If I am looking at a new program� one
that I am unfamiliar with� I almost always look at its structure with Ursa Minor
to get a feel for its layout�� Although not speci�ed in the methodology� many users
examine program sources before they begin working on optimization� The Structure
View is oering vital help to those users�
The Table View has gotten good reviews as well� One response was �The Table
View is good� I like its ability to combine multiple types of data�� In addition� users
liked the bar graph at the right side of the Table View� which visualized numeric data
instantly� The Expression Evaluator also proves to be very useful� allowing users
compute dierent metrics on demand� One user listed �integration of tools in parallel
performance speci�c manner� as one of the reason for using our tools� However� some
users were not fully content with the cumbersome interface to move� swap and arrange
� �� �
columns� Also� the limited graphing capabilities were pointed out as one of the weak
points of Ursa Minor� Overall� many versatile features provided by Ursa Minor
are greatly appreciated by users�
InterPol is still relatively new to users and has not been used much� Further�
more� we feel that there remain issues to be resolved with respect to documentation
and user interface� Consequently� we did not get much feedback from users� As In�
terPol gets more recognition from users with improved interface and documents�
we anticipate users to actively utilize the tool and return to us with quality feedbacks�
As the tools evolve in a need�driven way� the feedback from the user community
will provide invaluable directions into the next generation of our tool family� We
expect the future upgrades of the tools to incorporate users� opinions� For instance�
the weakness in GUI can be resolved with the newly available Java� technology�
Developers need to monitor users� needs and wishes constantly to keep up with the
current state�of�the�art parallel programming practices� Keeping close together the
tool design projects and users� application characterization eorts will ensure the
practicality of our tool in the future�
��� Comparison with Other Parallel Programming Environments
In Chapter �� we have listed several parallel programming environments� Pablo
and Fortran D editor �� �� SUIF Explorer ����� FORGExplorer ����� KAP�Pro
Toolset ����� the Annai Tool Project � ��� DEEP�MPI � ��� and Faust ����� We present
in this section a more detailed comparison of our toolset with these environments�
Table ��� shows the availability of features in these environments� The parallelization
utility available from Pablo�Fortran D Editor is actually semi�automatic�
Other than the debugging capability� Ursa Minor�InterPol pair covers all of
the functionalities listed in the table� In addition� our environment has unique features
not available from others� Ursa Minor�s ability to freely manipulate and restruc�
ture performance data is unprecedented in other programming environments� Fur�
thermore�Ursa Minor allows performance data to be integrated with static analysis
data through a set of mathematical and query functions� The performance guidance
� �� �
Table ���Feature comparison of parallel programming environments
performancedatavisualization
programstructurevisualization
compileranalysisoutput
automaticparallelization
interactivecompilation
supportforreasoning
automaticanalysis�guidance
debugging
Pablo�Fortran D Editorp p p p
SUIF Explorerp p p p p
FORGExplorerp p p
KAP�Pro Toolsetp p p
Annai Projectp p
DEEP�MPIp p p
Faustp p p p p
Ursa Minor�InterPolp p p p p p p
system such as Merlin has not been attempted in others� either� SUIF Explorer�s
Parallelization Guru only points to important target code sections� DEEP�MPI�s
advisor is limited to hard�coded procedure�level analysis� so detailed diagnosis into
smaller code blocks are not possible� InterPol allows users to �build� their own
parallelizing compiler� No such feature is available in other tools� Overall� the Ursa
Minor�InterPol toolset oers the most versatile and �exible features to date�
Perhaps the most outstanding aspect of our toolset is its accessibility� As opposed
to most other environments that ceased to exist or are not supported any more�Ursa
Minor exists in Web�accessible forms� Any user with an Internet connection can use
the tool with the help of complete on�line documentation� Such quality is not easily
� �� �
found in most tool development projects� The topic of the next section is the e�ciency
of our tools placed in the World�Wide Web�
��� Comparison of Ursa Major and the Parallel Programming Hub
As an eort to reach a larger audience with our tools� we have used network
computing concepts to implement an on�line tuning data repository �Ursa Major�
and a Web�executable integrated tool environment �the Parallel Programming Hub��
Ursa Major is an Applet�based data visualization and manipulation tool for a
repository of optimization studies� The Parallel Programming Hub allows users to
access and run tools without the hassle of searching� downloading� and installing
them�
The Parallel Programming Hub contains Ursa Minor� and Ursa Major uses
many components from the Ursa Minor tool and provides almost identical func�
tionality� This presents an interesting opportunity to compare and evaluate dierent
approaches to network computing� In this section we compare the e�ciency of Ursa
Minor on the Parallel Programming Hub and Ursa Major� We provide qualitative
and quantitative measures� By this comparison� we attempt to provide directions for
the next generation of on�line tools� This work was presented in �����
Batch�oriented tools run as e�ciently on the Parallel Programming Hub as on
local platforms� In fact� thanks to the PUNCH system�s powerful underlying machine
resources� most users� tools have faster response times on the Hub� Interactive tools
need closer inspection�
A typical tool interaction with Ursa Minor causes the tool to fetch from a
repository a program database that represents a speci�c parallel programming case
study� It then performs various operations on this database and displays the results
using Ursa Minor�s visualization utilities� Table ��� shows how server� client� and
�le operations are invoked by various tasks or the tool�
In a typical interactive tool session� a user loads input �les� runs computing util�
ities on the data� and adds more �les for further manipulation� From this scenario�
we chose three tool operations� We have measured the time taken to load a database�
� �� �
Table ���Workload distribution on resources with our network�based tools
tasks Ursa Minor Ursa Major
application execution server client Applet
database load local disk IO � server network transfer � client Applet
display network transfer � client Applet �VNC� client Applet
perform a simple spreadsheet�like operation on the data� and search and display a
portion of source codes� The database load is an example of loading input data� while
spreadsheet command evaluation is representative of computing on the data� Source
search operation requires a simple search through a source code� Interestingly� these
three operations exhibit dierent patterns in resource usage� For Ursa Major� the
database load operation requires downloading the database� parsing it� and updat�
ing the display appropriately� Hence� it exercises both networking and computing
capabilities� The second operation� evaluation of a spreadsheet command� performs
a mathematical operation on the data that the Applet already has downloaded� so it
only involves computing on a client machine� The search operation mainly relies on
networking� A source �le is not part of the database� hence it has to be downloaded
separately� For Ursa Minor� data transfer over the network is replaced by �le IO�
However� the response to a user action has to be updated on the display of the remote
client machine�
We chose two dierent databases in this experiment� representing a small and a
large application study� respectively� The �rst database contains tuning information
of the program BDNA from the Perfect Benchmarks ����� The database size is
about �� Kbytes� and the accompanying source �le is about ��� Kbytes� We consider
this to be a small database� The second database contains information about the
parallelization of the RETRAN code � �� which represents a large� power plant
simulation application� The database we used is � � Kbytes in size� and the size of
the source is about ��� Mbytes�
Finally� we chose three machines on which we measured the tool response times�
� � �
�Networked PC� is a PC with ���MHz Pentium II and � Mbytes of memory� Its
operating system is WindowsNT� It is connected to the Internet through a � Mbps
ethernet card� �Dialup PC� is a home PC with ���MHz Pentium II processor and
� Mbytes of memory� Its operating system is Windows��� and its connection to
the Internet is through ����K modem and via a local ISP� The third machine� �Net�
worked Workstation�� is an UltraSPARC workstation with � MHz processor and
� Mbytes of memory� Its operating system is SunOS v�� � The network bandwidth
is � Mbps�
We have measured the response time of the three operations in � hour intervals
over several days using a Netscape browser v���� We have inserted timing functions
for Ursa Major and used an external wallclock for Ursa Minor on the Parallel
Programming Hub� We made � measurements for each case� The average times are
shown in Figure ���� It displays the response time in seconds on the three machines�
The �gure shows the three measured tool operations� �rt�load� refers to the response
time to load the RETRAN database� �rt�eval� and �rt�search� refer to the time to
perform spreadsheet command evaluation and source search� respectively� The data
tags with pre�x �bd� refer to the same operations on the BDNA database�
Overall� the networked PC exhibits the shortest response time for all operations�
On this machine� the response times of Ursa Minor and Ursa Major are in the
same vicinity� However� downloading of a large program source signi�cantly increases
the response time of the search operation� despite the ethernet connection� In the case
of Ursa Minor� �les are read through �le IO within the server� thus the network
is not a dominating factor� The dialup PC displays adequate response time except
for the search operation with Ursa Major� The network bottleneck is even more
pronounced in this case� The networked workstation does not suer substantially
from the network connection� but its slow processor and relatively ine�cient imple�
mentation of the Java Virtual Machine �JVM� make it the worst performing platform
among the three�
The response time on three dierent machines for each operation� as shown in
� �� �
�a�
�b�
�c�
Fig� ���� The response time of UM�Applet and UM�ParHub� on �a� a networkedPC� �b� a networked workstation� and �c� a dialup PC�
� �� �
Figure ���� oers a dierent perspective� We only present the data regarding the
operations on the RETRAN database because those on the BDNA database show
similar trends and the characteristics are more pronounced in the RETRAN case�
The response time of Ursa Minor does not show noticeable variations on all three
machines except on the dialup PC� The spreadsheet command evaluation takes more
than twice as long on the dialup PC compared to the others� This operation is not
time�consuming� so a screen update becomes a factor with the slow modem connec�
tion� For Ursa Major� the platform becomes a deciding factor� If the network is
slow� the search operation degrades� For compute�intensive operations� the machine
speed and the quality of JVM determines the response time� In all� the Hub�based
tool performs better than the Applet�based version�
�a� �b�
�c�
Fig� ���� The response time of the three operations on RETRAN database� �a�loading� �b� spreadsheet command evaluation� and �c� source searching�
Our experiments show that the Parallel Programming Hub oers users a fast and
stable solution to interactive network computing� The network transmits only users�
� �� �
action �pressing buttons and clicking a mouse� to and from the server� so the network
or processor speed had little impact on the tool usage in our experiment� By contrast�
Applet�based tools rely on the client machine for computation and on the network for
data transfer� Thus� if the amount of data is large or the client machine is slow� the
resulting operations take considerably longer� The two networked machines we used
are located within the Purdue network� We expect these performance characteristics
to be even more pronounced on geographically distributed machines�
Although not as responsive as the Hub�based Ursa Minor� Ursa Major serves
a distinct purpose� The accumulated repository of tuning studies help users all over
the world in their eorts to study the results from other researchers and compare
the results on dierent platforms� Users with above the average machines can take
advantage of quick response by running the application on them� Slow screen updates
and sluggish mouse control that may result from a slow network connection for Ursa
Minor is not a problem with Ursa Major�
An increasing number of users are taking advantage of the Parallel Programming
Hub� The Parallel Programming Hub is being accessed by many users from all over the
world� Ursa Minor itself has been accessed � �� times since it became operational
in March of ����� As the hub adds additional tools and gains more recognition by
parallel programming community world wide� we expect to see the number of accesses
grow at a faster rate�
��� Conclusions
In this chapter� we have evaluated the proposed methodology and the tool support�
We have presented several case studies showcasing the usage of the tools in various
parallelization and tuning studies� In many studies we did at Purdue� the proposed
approach to performance tuning has resulted in considerable improvement in the end
results� Many features provided by the tools are actively used by programmers� but
most of all� they are contained within an integrated tool environment�
In addition� we have focused on small individual tasks and shown how the tools
can eectively assist users by simplifying time�consuming chores and making di�cult
� �� �
obstacles more accessible� The sample tasks we used are commonly performed in all
tuning studies� and users� time and eort are greatly saved by using our tools� The
experimental results show that our tools provide e�cient support for many common
tasks in parallel programming� Especially� the Expression Evaluator oers signi�cant
aid in deriving new data and computing metrics� Another unique feature� the Mer�
lin performance advisor� simpli�es the task of performance analysis considerably� as
shown in the case studies�
Finally� we have evaluated the e�ciency of the two dierent frameworks that we
used to broaden the user community for our tools using network computing� Overall�
the Hub�based Ursa Minor exhibited fast and uniform response time� especially in
cases where large data transfer is required� On the other hand� Ursa Major does
not suer from sluggish control when the network is slow� but the time to transfer
the requested data depends on the size of the database� Nevertheless� the purposes of
these two tools are distinct from each other� and they oer signi�cant aid to parallel
programmers world wide�
As mentioned in the beginning� evaluating a methodology and tools is a chal�
lenging work� This chapter represents our attempt to �nd ways to do so in both
qualitative and quantitative ways� We would like to point out that this is not the end
of our work towards a comprehensive parallel programming environment� Continuous
feedback from its user community will help improve the tools� service to a wide range
of parallel programmers�
� � �
�� CONCLUSIONS
��� Summary
When we �rst started out as novice parallel programmers� we had little experience
in the area� Every problem that we encountered seemed formidable and impossible
to resolve� We had to resort to experts for almost every task in the optimization
process� We did not know what to do and how to do it at practically every step of
the way� After a long period of trial and error� we have developed our own paradigm
for parallelizing and tuning programs� As our methodology re�ned over the years�
the tasks became routine� and most of all� we were seldom puzzled or frustrated by
seemingly unexpected results� The methodology gave us the con�dence that we could
always �nd the cause for unexpected anomalies and explain the phenomena�
As more members got involved in our group� however� another problem had risen�
New members of the group had just about the same amount of frustration and dismay
as we had� There are no publications that speak of a parallel tuning methodology in
terms that both expert and novice programmers could comprehend� Our experience
had not yet been documented and the tools that intimately support it were not there�
The part of the motivation for this work stems from the need to address this problem�
Now with the proposed methodology and the tools� we believe that the framework
for a structured approach to parallel programming is �rmly in place� With the gaining
momentum of the shared memory programming model� we feel that many users could
bene�t from this environment� Such a comprehensive approach that includes a wide
range of tasks in parallel programming has not been attempted previously�
The speci�c contribution of the work presented in this thesis is to present a uni�ed
framework for our approach to parallel program development� This includes a parallel
programming methodology and a set of tools that support this underlying practice�
Our work accomplishes this by achieving the following goals that we set out earlier�
� �� �
Structured Parallel ProgrammingMethodology The methodology described
in Chapter � lists the tasks that need to be performed in each step and the detailed
suggestions that users may consider� Users obtain signi�cant guidance as the objec�
tive is clear in each stage� Nonetheless� it is applicable regardless of the underlying
platform� the algorithms applied by the target program� or even the tools that pro�
grammers use� It is well�organized and easy to follow even for novice programmers�
Integrated Use of Parallelizing Compilers and Evaluation Tools A com�
bined use of Ursa Minor and InterPol or Polaris achieves this� Code segments
are labeled as �Program Unit�s that work across both of these tools� Pro�le data
provides insights into the dynamic behavior of the program at hand� which in turn
can be used to further improve the performance� Through an interactive use of these
tools that speak the same terminology� programmers get a clearer understanding of
the program�
Integration of Static Analysis Information and Performance Data Ursa
Minor�s ability to search and display the source assist users to understand a pro�
gram structure signi�cantly� In addition� Ursa Minor understands the compiler
�ndings and combines them in the same picture� The query functions available from
Ursa Minor allows users to combine static analysis data with performance data in
meaningful ways�
Support for Users� Deductive Reasoning One of the greatest strength of the
Ursa Minor tool is its support for users� deductive reasoning� The Expression
Evaluator enables reasoning about the data in numerous ways� Users compute any
metrics without modifying or updating the tool� The newly created data can be
manipulated and visualized like any other data� so that the tool can stay with the
users in their reasoning process�
� �� �
Potential of Automatic Performance Evaluation Merlin has shown the
potential of automatic analysis of performance and static data� It makes �transfer of
experience� from advanced to novice programmers easier� Tedious analysis steps can
be greatly simpli�ed�
Global Accessibility HavingUrsa Minor on the Parallel Programming Hub has
opened the door for world wide programmers to evaluating and using the tool without
worrying about searching� downloading� and installing� Compatibility issues are non�
existent� Also� Ursa Minor provides global parallel programming community with
the database of parallel programming studies that can be easily manipulated and
visualized�
��� Directions for Future Work
Many promising directions for further work suggest themselves�
Support for Other Parallel Programming Languages and Models As the
concept of parallel programming expands itself on many programming languages�
the ability to support other general languages such as Java or C�� would promote
the tool usage even further� The structure of the Ursa Minor database is not
limited to Fortran and can support these languages� However� a few features that
are language�sensitive have to be re�worked� Besides� automatic instrumentation and
the accompanying tasks �code segment naming scheme and incorporating compiler
listings� need careful consideration� Supporting other programming models can be
signi�cantly di�cult� Radically dierent parallel constructs and programming styles
call for a new methodology to begin with� It is interesting to see if and how the
program�level approach to parallel programming can be applied to other programming
models�
Support for Program Execution Traces The shared memory programming
model inherently imposes problems for parallel trace generation� Processor communi�
cations are implicit and frequent� so generating accurate traces is di�cult� However�
� �� �
selecting right events and performing moderate summarization can make it feasible�
Timeline analysis is often critical in identifying problems such as load imbalance�
Parallel Program Debugging Parallel program debugging is an entirely dierent
�eld of study� Many challenging tasks have to be planned for and accomplished� As
a programming environment� the addition of debugging capability into the toolset
would greatly enhance its applicability�
On�line Generation of Data Files Further integration of Ursa Minor� the Po�
laris parallelizing compiler and the runtime environment will bring even more compre�
hensive environment� Supporting parallelization� compilation� and execution through
a single tool would provide a highly integrated perspective and make parallel pro�
gramming most approachable for novice programmers� The possibility of running
and monitoring parallel execution from a remote machine has been shown by In�
terAct� Issues such as single user time and Ursa Minor�s portability need to be
resolved �rst�
Getting More Information from Compilers Still� there is plenty of information
that is kept internal within a parallelizing compiler� Extracting more useful data from
a compiler and presenting them to users would have to be the top priority for the
on�going evaluation�optimization tool project�
Visual Development of Merlin Map Merlin is still in its infancy and needs
more feedback and re�nements� The foremost of all is the interface for developing a
map� Although Merlin maps are well�structured in format� programmers rely on
conventional text editors for creating a map� A better� possibly graphical use interface
will make expert programmers� job much easier�
Global Information Exchange among Parallel Programmers Ursa Major
has demonstrated the possibility of global communication and cooperation among
� �� �
world wide parallel programmers� The obvious next step would be the exchange
of performance data among remote parallel programming and computer systems re�
searchers� With the proper support from the Ursa Major tool� such as the ability
to submit a database� this is a de�nite possibility� The integrated toolset from the
Parallel Programming Hub will continuously promote the usage of our databases�
Advances in technology is usually the result of such combined eorts�
� � �
� �� �
LIST OF REFERENCES
�� L� Dagum and R� Menon� OpenMP� an industry standard API for shared�memory programming� Computing in Science and Engineering� ������!���January ����
��� B� L� Massingill� A structured approach to parallel programming� Methodologyand models� In Proc� of ��th IPPS�SPDP��� Workshops Held in Conjunctionwith the ��th International Parallel Processing Symposium and �th Symposiumon Parallel and Distributed Processing� pages ���!���� ����
��� P�B� Hansen� Model programs for computational science� a programmingmethodology for multicomputers� Concurrency Practice and Experience���������!���� August ����
��� T� Rauber and G� Runger� Deriving structured parallel implementations fornumerical methods� Microprocessing and Microprogramming� ���!�����!���April �� �
��� S� Gorlatch� From transformations to methodology in parallel program develop�ment� a case study� Microprocessing and Microprogramming� ���!�����!����April �� �
� � Michael Wolfe� High Performance Compilers for Parallel Computing� Addison�Wesley Publishing Company� �� �
��� Michael J� Wolfe� Optimizing Compilers for Supercomputers� PhD thesis� Uni�versity of Illinois at Urbana�Champaign� October ����
��� Uptal Bannerjee� Dependence Analysis for Supercomputing� Kulwer AcademicPublishers� Norwell� MA� ����
��� Utpal Banerjee� Rudolf Eigenmann� Alexandru Nicolau� and David Padua�Automatic program parallelization� Proceedings of the IEEE� ������!����February ����
��� Dror E� Maydan� John L� Hennessy� and Monica S� Lam� E�cient and exactdata dependence analysis� In Proc� of ACM SIGPLAN ��� Conference on Pro�gramming Language Design and Implementation� Ontario� Canada� June ���
�� Paul M� Petersen and David A� Padua� Static and dynamic evaluation of datadependence techniques� IEEE Transactions on Parallel and Distributed Sys�tems� �����!��� November �� �
��� Michael J� Voss� Portable loop�level parallelism for shared memorymultiproces�sor architectures� Master�s thesis� School of ECE� Purdue University� October����
� �� �
��� Nirav H� Kapadia and Jos�e A�B� Fortes� On the design of a demand�basednetwork�computing system� The purdue university network computing hubs� InProc� of IEEE Symposium on High Performance Distributed Computing� pages�!��� Chicago� IL� ����
��� D� A� Bader and J� JaJa� SIMPLE� a methodology for programming high per�formance algorithms on clusters of symmetric multiprocessors �SMPs�� Journalof Parallel and Distributed Computing� �������!��� July ����
��� B� Buttarazzi� A methodology for parallel structured programming in logicenvironments� International Journal of Mini and Microcomputers� ������!� � ����
� � Message Passing Interface Forum� MPI� A message�passing interface standard�Technical report� University of Tennessee� Knoxville� Tennessee� May ����
��� A� Beguelin� J� Dongarra� A� Geist� R� Manchek� S� Otto� and J� Walpole� PVM�Experiences� current status and future direction� In Proc� of Supercomputing���� pages � �!� � November ����
��� ANSI� X�H Parallel Extensions for Fortran� X�H�����SD�Revision m edition�April ����
��� Kuck and Associates� Champaign� IL� Guide Reference Manual� version ��edition� September �� �
���� David J� Kuck� The eects of program restructuring� algorithm change� and ar�chitecture choice on program peformance� In Proc� of International Conferenceon Parallel Processing� pages ��!��� St� Charles� Ill�� August ����
��� Randy Allen and Ken Kennedy� Automatic translation of Fortran programsto vector form� ACM Transactions on Programming Languages and Systems��������!���� October ����
���� F� Allen� M� Burke� P� Charles� R� Cytron� and J� Ferrante� An overview of thePTRAN analysis system for multiprocessing� Journal of Parallel and DistributedComputing� ����� �! ��� October ����
���� WilliamBlume� Ramon Doallo� Rudolf Eigenmann� John Grout� Jay Hoe�inger�Thomas Lawrence� Jaejin Lee� David Padua� Yunheung Paek� Bill Pottenger�Lawrence Rauchwerger� and Peng Tu� Parallel programming with Polaris� IEEEComputer� ��������!��� December �� �
���� M� W� Hall� J� M� Anderson� S� P� Amarasinghe� B� R� Murphy� S�W� Liao�E� Bugnion� and M� S� Lam� Maximizing multiprocessor performance with theSUIF compiler� IEEE Computer� ��������!��� December �� �
���� Anthony J� G� Hey� High�performance computing�past� present� and future�Computing and Control Engineering Journal� ������!��� February ����
�� � R� W� Numrich� J� L� Steidel� B� H� Johnson� B� D� de Dinechin� G� Elsesser�G� Fischer� and T� MacDonald� De�nition of the F�� extension to Fortran ���In Proc� of the Workshop of Languages and Compilers for Parallel Computing�pages ���!�� � Springer�Verlag� August ����
� �� �
���� R� von Hanxleden� K� Kennedy� and J� Saltz� Value�based distributions in For�tran D� In Proc� of International Conference on High�Performance Computingand Networking� pages ���!���� Springer�Verlag� April ����
���� High Performance Fortran Forum� High Performance Fortran language spec�i�cation� version ��� Technical report� Rice University� Houston Texas� May����
���� Microsoft� Visual C��� ����� http���msdn�microsoft�com�visualc��
���� Microsoft� Visual Basic� ����� http���msdn�microsoft�com�vbasic��
��� A� Beguelin� J� Dongarra� A� Geist� R� Manchek� K� Moore� R� Wade� andV� Sunderam� HeNCE� Graphical development tools for network�based concur�rent computing� In Proc� of Scalable High Performance Computing Conference�pages ��!� � April ����
���� J� Schaeer� D� Szafron� G� Lobe� and I� Parsons� The Enterprise model fordeveloping distributed applications� IEEE Parallel and Distributed Technology�������!� � January!March ����
���� P� Newton and J� C� Browne� The CODE ��� graphical parallel programminglanguage� In Proc� of International Conference on Supercomputing� pages �!��� July ����
���� P� Kacsuk� G� Dozsa� and T� Fadgyas� Designing parallel programs by thegraphical language GRAPNEL� Microprocessing and Microprogramming� ���!��� ��! ��� April �� �
���� O� Loques� J� Leite� and E� V� Carrera� P�RIO� a modular parallel�programmingenvironment� IEEE Concurrency� �����!��� January!March ����
�� � N� Stankovic and K� Zhang� Visual programming for message�passing sys�tems� International Journal of Software Engineering and Knowledge Engineer�ing� ��������!���� August ����
���� Barr E� Bauer� Practical Parallel Programming� Academic Press� ����
���� Silicon Graphics� Inc� Performance Tuning Optimization for Origin�and Onyx�� ����� http���techpubs�sgi�com�library�manuals����������������html�O����Tuning���html�
���� Boston Univeristy� Introduction to Parallel Processing on SGI Shared MemoryComputers� ����� http���scv�bu�edu�SCV�Tutorials�SMP��
���� University of Illinois at Urbana�Champaign� CSE���CS���ECE���� �����http���www�cse�uiuc�edu�cse�����
��� University of California at Berkeley� U�C� Berkeley CS�� Home Page� Ap�plications of Parallel Computers� ����� http���HTTP�CS�Berkeley�EDU� dem�mel�cs� ���
���� Georey C� Fox� Roy D� Williams� and Paul C� Messina� Parallel ComputingWorks� Morgan Kaufmann Publishers� ����
���� Ian Foster� Designing and Building Parallel Programs� Addison Wesley� ����
� �� �
���� D� Cheng and R� Hood� A portable debugger for parallel and distributed pro�grams� In Proc� of Supercomputing ���� pages ���!���� November ����
���� J� May and F� Berman� Retargetability and extensibility in a parallel debugger�Journal of Parallel and Distributed Computing� ��������!��� June �� �
�� � Pallas� TotalView� ����� http���www�pallas�de�pages�totalv�htm�
���� Kuck and Associates Inc� KAP�Pro Toolset� ����� http���www�kai�com�
���� Vincent Guarna Jr�� Dennis Gannon� David Jablonowski� Allen Malony� andYogesh Gaur� Faust� An integrated environment for the development of parallelprograms� IEEE Software� ������!��� July ����
���� Bill Appelbe� Kevin Smith� and Charles McDowell� Start�Pat� A parallel�programming toolkit� IEEE Software� ������!��� July ����
���� V� Balasundaram� K� Kennedy� U� Kremer� K� McKinley� and J� Subhlok� TheParaScope editor� An interactive parallel programming tool� In Proc� of Super�computing Conference� pages ���!���� ����
��� M� W� Hall� T� J� Harvey� K� Kennedy� N� McIntosh� K� S� McKinley� J� D�Oldham� M� H� Paleczny� and G� Roth� Experiences using the ParaScope editor�An interactive parallel programming tool� In Proc� of Principles and Practicesof Parallel Programming� pages ��!��� May ����
���� Rudolf Eigenmann and Patrick McClaughry� Practical tools for optimizingparallel programs� In Proc� of the ���� Simulation Multiconference on the HighPerformance Computing Symposium� pages �! �� March ����
���� W� Liao� A� Diwan� R� P� Bosch Jr�� A� Ghuloum� and M� S� Lam� SUIF explorer�An interactive and interprocedural parallelizer� In Proc� of the th ACM SIG�PLAN Symposium on Principles and Practice of Parallel Programming� pages��!��� August ����
���� Applied Parallel Research Inc� Forge Explorer� ����� http���www�apri�com�
���� Seema Hiranandani� Ken Kennedy� and Chau�Wen Tseng� Compiling For�tran d for MIMD distributed�memory machines� Communications of the ACM������� !��� August ����
�� � V� S� Adve� J� Mellor�Crummey� M� Anderson� K� Kennedy� J� C� Wang� andD� A� Reed� An integrated compilation and performance analysis environmentfor data parallel programs� In Proc� of Supercomputing Conference� pages ���!���� ����
���� S� P� Johnson� C� S� Ierotheou� and M� Cross� Automatic parallel code genera�tion for message passing on distributed memory systems� Parallel Computing����������!���� February �� �
���� S� P� Johnson� P� F� Leggett� C� S� Ierotheou� E� W� Evans� and M� Cross�Computer Aided Parallelisation Tools �CAPTools� Tutorials� Parallel Process�ing Research Group� University of Greenwich� October ���� CAPTools Version���Beta�
� � �
���� Central Institute for Applied Mathematics� PCL � The Performance CounterLibrary� A Common Interface to Access Hardware Performance Counters onMicroprocessors� November ����
� �� Louis Lopez� The NAS Trace Visualizer �NTV� Rel� ��� User�s Guide� NASA�September ����
� � Michael T� Heath and Jennifer A� Etheridge� Visualizing the performance ofparallel programs� IEEE Software� �������!��� September ���
� �� Universit�e de Marne�la�Vall�ee� PGPVM�� ����� http���phalanstere�univ�mlv�fr� sv�PGPVM���
� �� Daniel A� Reed� Experimental performance analysis of parallel systems� Tech�niques and open problems� In Proc� of the th Int� Conf on Modelling Techniquesand Tools for Computer Performance Evaluation� pages ��!�� ����
� �� W� E� Nagel� A� Arnold� M� Weber� H� C� Hoppe� and K� Solchenbach� VAM�PIR� visualization and analysis of MPI resources� Supercomputer� ���� �!���January �� �
� �� J� Yan� S� Sarukkai� and P� Mehra� Performance measurement� visualizationand modeling of parallel and distributed programs using the AIMS toolkit�Software�Practice and Experience� ���������!� � April ����
� � Barton P� Miller� Mark D� Callaghan� Jonathan M� Cargille� Jerey K�Hollingsworth� R� Bruce Irvin� Karen L� Karavanic� Krishna Kunchithapadam�and Tia Newhall� The Paradyn parallel performance measurement tool� IEEEComputer� �������!� � November ����
� �� S� Shende� A� D� Malony� J� Cuny� K� Lindlan� P� Beckman� and S� Karmesin�Portable pro�ling and tracing for parallel scienti�c applications using C��� InProc� of ACM SIGMETRICS Symposium on Parallel and Distributed Tools�pages ��!��� August ����
� �� Paci�c�Sierra Research� DEEP�MPI� Development Environmentfor MPI Programs Parallel Program Analysis and Debugging� �����http���www�psrv�com�deep mpi top�html�
� �� B� J� N� Wylie and A� Endo� Annai�PMA multi�level hierarchical parallel pro�gram performance engineering� In Proc� of International Workshop on High�Level Programming Models and Supportive Environments� pages ��! �� �� �
���� LAM Team � University of North Dakota� XMPI � A Run�Debug GUI forMPI� ����� http���www�mpi�nd�edu�lam�software�xmpi��
��� A� D� Malony� D� H� Hammerslag� and D� J� Jablonowski� TraceView� a tracevisualization tool� IEEE Software� ������!��� September ���
���� Michael T� Heath� Performance visualization with ParaGraph� In Proc� of theSecond Workshop on Environments and Tools for Parallel Scienti�c Computing�pages ��!���� May ����
���� E� Lusk� Visualizing parallel program behavior� In Proc� of Simulation Mul�ticonference on the High Performance Computing Symposium� pages ���!���April ����
� �� �
���� Y� Arrouye� Scope� an extensible interactive environment for the performanceevaluation of parallel system� Microprocessing and Microprogramming� ���!��� ��! ��� April �� �
���� J� A� Kohl and G� A� Geist� The PVM ��� tracing facility and XPVM �g� InProc� of the Twenty�Ninth Hawaii International Conference on System Sciences�pages ���!���� January �� �
�� � B� Topol� J� T� Stasko� and V� Sunderam� PVaniM� A tool for visualizationin network computing environments� Concurrency Practice and Experience��������!���� December ����
���� G� Weiming� G� Eisenhauer� K� Schwan� and J� Vetter� Falcon� On�line mon�itoring for steering parallel programs� Concurrency Practice and Experience������ ��!�� � August ����
���� J� T� Stasko and E� Kraemer� A methodology for building application�speci�cvisualizations of parallel programs� Journal of Parallel and Distributed Com�puting� ��������!� �� June ����
���� G� A� Geist II� J� A� Kohl� and P� M� Papadopoulos� CUMULVS� Providingfault tolerance� visualization� and steering of parallel applications� InternationalJournal of Supercomputer Applications� �������!���� Fall ����
���� K� C� Li and K� Zhang� Tuning parallel program through automatic programanalysis� In Proc� of Second International Symposium on Parallel Architectures�Algorithms� and Networks� pages ���!���� June �� �
��� A� Reinefeld� R� Baraglia� T� Decker� J� Gehring� D� Laforenza� F� Ramme�T� Romke� and J� Simon� The MOL project� An open� extensiblemetacomputer�In Proc� of the ��� IEEE Heterogeneous Computing Workshop� pages �!������
���� H� Casanova and J� Dongarra� NetSolve� a network enabled server for solv�ing computational science problems� International Journal of SupercomputerApplications� ������!���� Fall ����
���� M� Sato� H� Nakada� S� Sekiguchi� S� Matsuoka� U� Nagashima� and H� Tak�agi� Ninf� a network�based information library for global world�wide computinginfrastructure� In Proc� of High�Performance Computing and Networking� In�ternational Conference and Exhibition� pages ��!���� April ����
���� P� Arbenz� W� Gander� and M� Oettli� The Remote Computation System�Parallel Computing� ��������!���� October ����
���� T� Richardson� Q� Staord�Fraser� K� R� Wood� and A� Hopper� Virtual networkcomputing� IEEE Internet Computing� ������!��� January!February ����
�� � Citrix� ICA technical paper� �� � http���www�citrix�com�products�ica�asp�
���� I� Foster and C� Kesselman� Globus� A metacomputing infrastructure toolkit�International Journal of Supercomputer Applications� �����!��� Summer����
� �� �
���� A� S� Grimshaw and W� A� Wulf� The Legion vision of a worldwide virtualcomputer� Communications of the ACM� �������!��� January ����
���� Insung Park� Nirav H� Kapadia� Renato J� Figueiredo� Rudolf Eigenmann� andJos�e A� B� Fortes� Towards an integrated� web�executable parallel program�ming tool environment� To appear in the Proc� of SC��High PerformanceNetworking and Computing� �����
���� B� LaRose� The development and implementation of a performance databaseserver� Technical Report CS������� University of Tennessee� August ����
��� The University of Southampton� GRAPHICAL BENCHMARK INFORMA�TION SERVICE �GBIS�� ���� http���www�ccg�ecs�soton�ac�uk�gbis�papiani�new�gbis�html�
���� Cherri M� Pancake and Curtis Cook� What users need in parallel tool support�Survey results and analysis� In Proc� of Scalable High Performance ComputingConference� pages ��!��� March ����
���� Roger S� Pressman� Software Engineering� a Practitioner�s Approach� McGraw�Hill� Inc�� New York� NY� ����
���� Peter Pacheco� Parallel Programming with MPI� Morgran Kaufman Publishers��� �
���� D� Culler� J� P� Singh� and A� Gupta� Parallel Computer Architecture� MorgranKaufman Publishers� ����
�� � Rudolf Eigenmann� Toward a methodology of optimizing programs for high�performance computers� In Proc� of ACM International Conference on Super�computing� pages ��!� � Tokyo� Japan� July ����
���� Seon Wook Kim and Rudolf Eigenmann� Detailed� quantitative analysis ofshared�memory parallel programs� Technical Report ECE�HPCLab������� HP�CLAB� School of ECE� Purdue University� �����
���� Seon Wook Kim� Michael J� Voss� and Rudolf Eigenmann� Performance analysisof parallel compiler backends on shared�memorymultiprocessors� In Proc� of theTenth Workshop on Compilers for Parallel Computers� pages ���!���� January�����
���� Rudolf Eigenmann� Insung Park� and Michael J� Voss� Are parallel workstationsthe right target for parallelizing compilers� In Lecture Notes in ComputerScience� No� ����� Languages and Compilers for Parallel Computing� pages���!��� March ����
���� Michael J� Voss� Insung Park� and Rudolf Eigenmann� On the machine�independent target language for parallelizing compilers� In Proc� of the SixthWorkshop on Compilers for Parallel Computers� Aachen� Germany� December�� �
��� Insung Park� Michael J� Voss� and Rudolf Eigenmann� Compiling for the newgeneration of high�performance SMPs� Technical Report ECE�HPCLab�� ����HPCLAB� School of ECE� Purdue University� November �� �
� �� �
���� Lynn Pointer� Perfect� Performance evaluation for cost�eective tranforma�tions report �� Technical Report � �� Center for Supercomputing Research andDevelopment� University of Illinois at Urbana�Champaign� March ����
���� Insung Park� Michael J� Voss� Brian Armstrong� and Rudolf Eigenmann� Inter�active compilation and performance analysis with ursa minor� In Proc� of theWorkshop of Languages and Compilers for Parallel Computing� pages �!� �Springer�Verlag� August ����
���� Insung Park� Michael J� Voss� Brian Armstrong� and Rudolf Eigenmann� Par�allel programming and performance evaluation with the ursa tool family� In�ternational Journal of Parallel Programming� � ������!� � November ����
���� Insung Park� Michael J� Voss� Brian Armstrong� and Rudolf Eigenmann� Sup�porting users� reasoning in performance evaluation and tuning of parallel ap�plications� To appear in Proc� of the Twelth IASTED International Conferenceon Parallel and Distributed Computing and Systems� November �����
�� � Seon Wook Kim� Insung Park� and Rudolf Eigenmann� A performance advisortool for novice programmers in parallel programming� To appear in the Proc�of the Workshop of Languages and Compilers for Parallel Computing� �����
���� Stefan Kortmann� Insung Park� Michael Voss� and Rudolf Eigenmann� Inter�active and modular optimization with interpol� In Proc� of the � In�ternational Conference on Parallel and Distributed Processing Techniques andApplications� pages � �!� �� June �����
���� Michael J� Voss� Kwok Wai Yau� and Rudolf Eigenmann� Interactive instru�mentation and tuning of OpenMP programs� Technical Report ECE�HPCLab������� HPCLAB� �����
���� Seon�Wook Kim and Rudolf Eigenmann� Max�P� Detecting the Maximum Par�allelism in a Fortran Program� HPCLAB� ����
��� Insung Park and Rudolf Eigenmann� Ursa Major� Exploring web technologyfor design and evaluation of high�performance systems� In Proc� of the Inter�national Conference on High Performance Computing and Networking� pages���!���� Berlin� Germany� April ���� Springer�Verlag�
�� T� Nakra� R� Gupta� and M� L� Soa� Value prediction in VLIW machines�In Proc� of the ��th International Symposium on Computer Architecture� pages���!� �� May ����
��� Trimaran Homepage� Trimaran Manual� �����http���www�trimaran�org�docs�html�
��� A� D� Alexandrov� M� Ibel� K� E� Schauser� and C� J� Scheiman� UFO� A per�sonal global �le system based on user�level extensions to the operating system�ACM Transactions on Computer Systems� �������!���� August ����
��� Rudolf Eigenmann and Siamak Hassanzadeh� Benchmarking with real indus�trial applications� The SPEC High�Performance Group� IEEE ComputationalScience � Engineering� �����!��� Spring �� �
� �� �
��� David L� Weaver and TomGermond� The SPARC Architecture Manual� Version�� SPARC International� Inc�� PTR Prentice Hall� Englewood Clis� NJ �� �������
� � T� J� Downar� Jen�Ying Wu� J� Steill� and R� Janardhan� Parallel and serialapplications of the RETRAN��� power plant simulation code using domaindecomposition and Krylov subspace methods� Nuclear Technology� �������!��� February ����
� � �
� �� �
VITA
Insung Park was born on February �� � � in Seoul� South Korea� He received
his B�S� degree in control and instrumentation engineering from Seoul National Uni�
versity in February of ��� and his M�S� degree in Electrical Engineering from the
Virginia Polytechnic Institute and State University� Blacksburg� Virginia� in ����
He has successfully defended his Ph�D� reserach in August of ���� at the School of
Electircal and Computer Engineering at Purdue University� He was awarded a Ph�D�
in December of the same year�
From ��� to ���� Insung Park had served as a system administrator of the
electrical engineering departmental workstation laboratory� During the period of his
M�S� study� he has developed a partial scan design tool� BELLONA� As a Ph�D�
student at Purdue� Insung Park designed and implemented a parallel programming
environment consisting of a programming methodology and a set of tools�
He is a member of the honor society of Phi Kappa Phi�