an open64-based framework for analyzing parallel applications
TRANSCRIPT
IntroductionMethodology
Conclusion
An Open64-based Framework for AnalyzingParallel Applications
Laksono Adhianto1 Barbara Chapman2
1Department of Computer ScienceRice University
2Department of Computer ScienceUniversity of Houston
Open64 Workshop, 2008
L. Adhianto and B. Chapman Analyzing Parallel Applications
IntroductionMethodology
Conclusion
Outline
1 IntroductionMotivation and Objectives
2 MethodologyAproach, implementation and applications
3 Conclusion
L. Adhianto and B. Chapman Analyzing Parallel Applications
IntroductionMethodology
ConclusionMotivation and Objectives
Motivation: Analyzing Complex Applications
Murphy’s law:
Program complexity grows until it exceeds thecapabilities of the programmer who must maintain it.
H e l l o W o r l d
From a simplesequential program . . .
To a complex large scale application a
aImage courtesy of javaworld
L. Adhianto and B. Chapman Analyzing Parallel Applications
IntroductionMethodology
ConclusionMotivation and Objectives
Objectives
We need an infrastructure forUnderstanding large-scale parallel MPI/OpenMPapplications.Performance modeling and prediction.Program optimization and program correctness verification.
L. Adhianto and B. Chapman Analyzing Parallel Applications
IntroductionMethodology
ConclusionAproach, implementation and applications
Approach
Extract program skeleton based on compiler analysis.Retrieve information on communication latency andparallelization overhead from microbenchmarks.
L. Adhianto and B. Chapman Analyzing Parallel Applications
IntroductionMethodology
ConclusionAproach, implementation and applications
Methodology
Using compiler technology to analyze the source code andmicrobenchmarks to probe system profile.
Analysis = Application_Signature ⊗ System_Profile
Application signature: characterizes the fundamentalaspects of an application independent of the machinewhere it executes(definition borrowed from PERC SciDAC project).System profile: characteristics of the platform where theapplication will be executed.
L. Adhianto and B. Chapman Analyzing Parallel Applications
IntroductionMethodology
ConclusionAproach, implementation and applications
Application Signature vs. System Profile
Application signature: independent to system configurationSystem profile: independent to application programs
L. Adhianto and B. Chapman Analyzing Parallel Applications
IntroductionMethodology
ConclusionAproach, implementation and applications
FOMPI Framework
FOMPI: Framework for analyzing OpenMP and MPIapplications.
Analysis = Application_Signature ⊗ System_Profile
S o u r c e c o d e
F O M P I V i e w
F O M P I A n a l y z e r
O p e n U H p l u g - i n
E c l i p s e
E x e c u t a b l e
O p e n U H C o m p i l e r
X M L R e p r e s e n t a t i o n
A p p l i c a t i o n s i g n a t u r e
L o a d B a l a n c i n g L i b r a r y
M i c r o b e n c h m a r k s
D a t a b a s e
S y s t e m p r o f i l e
References:Parallelization [1, 5], OpenMP tools [7, 6], Compiler [8],Modeling [4, 3], Autoscoping [2].
L. Adhianto and B. Chapman Analyzing Parallel Applications
IntroductionMethodology
ConclusionAproach, implementation and applications
The OpenUH Compiler
LNO(Loop Nest Oprtimizer)
WOPT(Global scalar optimizer)
IPA(Inter Procedural Analyzer)
Front-endC/C++,Fortran 77/90
WHIRL2C/FIR-to-source
OpenUH
CGCode Generator
OpenMP/MPI
Source code
Source code
Native compiler
XML Code
Representation
Object files
RVI2(Register level optimizer)
Very HighWHIRL
High WHIRL
MiddleWHIRL
LowWHIRL
Very LowWHIRL
Application signature
S o u r c e c o d e
F O M P I V i e w
F O M P I A n a l y z e r
O p e n U H p l u g - i n
E c l i p s e
E x e c u t a b l e
O p e n U H C o m p i l e r
X M L R e p r e s e n t a t i o n
A p p l i c a t i o n s i g n a t u r e
L o a d B a l a n c i n g L i b r a r y
M i c r o b e n c h m a r k s
D a t a b a s e
S y s t e m p r o f i l e
Rich of analysis: datadependence, inter-proceduralanalysis, array region analysis,. . .We have extended OpenUHfor generating applicationsignature.
Containing MPI routines,OpenMP, loops, estimatedexecution time, cacheaccess pattern, . . .
L. Adhianto and B. Chapman Analyzing Parallel Applications
IntroductionMethodology
ConclusionAproach, implementation and applications
Call Graph
map_zones__
A 103.6917
get_comm_index__
0.000020.0
timer_stop__
C 0.0000
error_norm__
AO 33.9767
exact_solution__
0.0000
1000000.0
decode_line__
0.0000
z_solve__
AO 5457341.3917
1.0
binvcrhs_
0.0000190000.0
lhsinit_
0.000010000.0
binvrhs_
0.0000
10000.0
matvec_sub__
0.0000190000.0
timer_start__
C 0.0000
1.0
matmul_sub__
28.6747190000.0
x_solve__
AO 5457139.8360
1.0
190000.0
10000.0
10000.0
190000.0
1.0
190000.0
initialize_
AO 103.81606060000.0
copy_y_face__
AO 19.0709
env_setup__
AC 4.1205
1.0zone_setup__
124.8227
compute_rhs__
AO 896285.9149
4.0
4.0
print_results__
0.0000
adi_
17267843.8529
1.0
1.0
1.0
y_solve__
AO 5457037.47981.0
add_
AO 39.2305
1.0
1.0
190000.0
10000.0
10000.0
190000.0
1.0
190000.0
1.0
1.0
exact_rhs__
AO 6676.4841570000.0
rhs_norm__
AO 20.1799
zone_starts__
C 171.5825
timer_clear__
0.0000
set_constants__
0.0000
MAIN__
C 6979929560.9980
1.0
1.0
1.0
8.0
1.0
1.0
1.0
404.0
4.0
1.0
22.0
1.0
exch_qbc__
C 1061.3956101.0
verify_
C 3585446.45241.0
mpi_setup__
C 0.0000
1.0
timer_read__
0.0000
12.0
3.0
3.0
24.0
copy_x_face__
AO 23.029424.0
4.0
4.0
4.0
Traditional call graph is not scalable for largescale applications containing million lines ofcode.
Gc = 〈Nc , Ec , sc〉
Aggregated execution time:
N ic = Nx
c +n∑
i=0
(N i
ci× E f
c
)Inclusive execution time of a unit is thesum of exclusive time and the total time ofits call sites.
L. Adhianto and B. Chapman Analyzing Parallel Applications
IntroductionMethodology
ConclusionAproach, implementation and applications
Control-Flow Graph
Gf = 〈Nf , Ef , sf 〉Inclusive cost:
N if =
Nx
fl+∑n
i=0(Efi × N ifi), inside loop;
Nxfb
+ maxni=0(
∑mij=0 N i
fij), branches;
Nxf , otherwise.
Exclusive cost:
Nxfl =
{tcompser = Ef × (tmachine + toverhead + tcache), serial;
tcomppar = tcomp
sernt
+∑
tcompunpar +
∑O, parallel;
L. Adhianto and B. Chapman Analyzing Parallel Applications
IntroductionMethodology
Conclusion
Summary
FOMPI provides portable and scalable analysis andmodeling with no program execution needed
Based on application signature from the compiler andsystem profile from microbenchmarks.
Applications include: performance modeling, programunderstanding, OpenMP generation and MPI loadimbalance reduction.Open64’s extensibility is needed: WHIRL, analysis andtransformation
SUIF, GCC GEM Framework, LLVM . . .
L. Adhianto and B. Chapman Analyzing Parallel Applications
IntroductionMethodology
Conclusion
Contributions
Our contributions:Compiler extension to extract application signatureMicrobenchmark extensions for more OpenMP coverageUsing compiler and microbenchmarks for programanalyses and modelingScalable call graph and control-flow graphRuntime library for reducing load imbalance
L. Adhianto and B. Chapman Analyzing Parallel Applications
IntroductionMethodology
Conclusion
Acknowledgments
John Mellor-Crummey (Rice University)HPCTools members and alumniTLC2
PSTL labUH, CS@UHPModels, DOE and NSF
L. Adhianto and B. Chapman Analyzing Parallel Applications
IntroductionMethodology
Conclusion
For Further Reading I
L. Adhianto, F. Bodin, B. Chapman, L. Hascoet, A. Kneer,D. Lancaster, I. Wolton, and M. Wirtz.Tools for OpenMP application development: the POSTproject.Concurrency: Practice and Experience, 12(12):1177–1191,2000.
Laksono Adhianto and Barbara Chapman.Autoscoping support for openmp compiler.In Workshop on Tools and Compilers for HardwareAcceleration, 2006.
L. Adhianto and B. Chapman Analyzing Parallel Applications
IntroductionMethodology
Conclusion
For Further Reading II
Laksono Adhianto and Barbara Chapman.Performance modeling of communication and computationin hybrid mpi and openmp applications.International Conferences on Parallel and DistributedSystems (ICPADS)-Workshop of Performance Modelingand Analysis of Communication (PMAC), 2:3–8, 2006.
Laksono Adhianto and Barbara Chapman.Performance modeling of communication and computationin hybrid mpi and openmp applications.Simulation Modelling Practice and Theory, 15(4):481–491,2007.
L. Adhianto and B. Chapman Analyzing Parallel Applications
IntroductionMethodology
Conclusion
For Further Reading III
Laksono Adhianto and Michael Leuschel.Strategy for improving memory locality reuse and exploitinghidden parallelism.In Indonesian Students Scientific Meeting (ISSM),Manchester, UK, August 2001.
Barbara Chapman, Oscar Hernandez, Lei Huang,Tien-hsiung Weng, Zhenying Liu, Laksono Adhianto, andYi Wen.Dragon: An open64-based interactive program analysis toolfor large applications.In 4th International Conference on Parallel and DistributedComputing, Applications and Technologies (PDCAT), 2003.
L. Adhianto and B. Chapman Analyzing Parallel Applications
IntroductionMethodology
Conclusion
For Further Reading IV
Barbara Chapman, Tien-Hsiung Weng, Oscar Hernandez,Zhenying Lui, Lei Huang, Yi Wen, and Laksono Adhianto.Cougar: Interactive tool for cluster computing.In Proceedings of the 6th World Multi-Conference onSystemics, Cybernetics and Informatics (SCI’2002). TheInternational Institute of Informatics and Systemics, 2002.
OpenUH.http://www.cs.uh.edu/õpenuh.
L. Adhianto and B. Chapman Analyzing Parallel Applications
IntroductionMethodology
Conclusion
Application Signature: Example from NAS FT
if (timers_enabled) call timer_start(7)
do k = 1, n3
...
call Swarztrauber(...)
...
end do
<if line="109">
<ifthen line="109">
<kids>
<call name="timer_start__" line="111"/>
</kids>
</ifthen>
</if>
<loop Line="112" opcode="OPC_DO_LOOP" >
<header>
<index>K</index>
<lowerbound>1</lowerbound>
<upperbound>64</upperbound>
<increment>+1</increment>
</header>
<cost>
<iterations>64</iterations>
<average>2.69514</average>
<machine>4.9152e+06</machine>
<cache>1.12331e+08</cache>
<overhead>1.06522e+08</overhead>
<total>2.23768e+08</total>
</cost>
<parallel>
<status>None</status>
<reason>Call swarztrauber_ on line 122.</reason>
<scope>
<private> K</private>
<shared> X PLANE</shared>
</scope>
</parallel>
<kids>
. . . .
<call name="swarztrauber_" line="122"/>
OpenUH Compiler
Source code
Application signature
L. Adhianto and B. Chapman Analyzing Parallel Applications
IntroductionMethodology
Conclusion
Application Signature: Scalability
Stored in XML file: XML Program Representation.Designed for interoperability and scalability in mind.
LOC loops XML Tags Tags/Loop XML/LOCammp 10068 272 11872 6767 24.88 1.18applu 2555 120 3497 2585 21.54 1.37apsi 4386 278 9582 6815 24.51 2.18art 1570 72 2672 1799 24.99 1.70equake 1121 70 2338 1650 23.57 2.09gafort 720 72 2450 1748 24.28 3.40mgrid 1023 48 1338 1009 21.02 1.31swim 275 24 677 505 21.04 2.46wupwise 1018 43 2096 1419 33.00 2.06Average 2526.22 111.00 4508.00 3257.00 24.31 1.97
L. Adhianto and B. Chapman Analyzing Parallel Applications
IntroductionMethodology
Conclusion
Eclipse
S o u r c e c o d e
F O M P I V i e w
F O M P I A n a l y z e r
O p e n U H p l u g - i n
E c l i p s e
E x e c u t a b l e
O p e n U H C o m p i l e r
X M L R e p r e s e n t a t i o n
A p p l i c a t i o n s i g n a t u r e
L o a d B a l a n c i n g L i b r a r y
M i c r o b e n c h m a r k s
D a t a b a s e
S y s t e m p r o f i l e
Eclipse is an open sourceplatform based on IBMVisualAge Micro Edition.We have developed FOMPIview plugin as the mainuser interface
Accessible by anyEclipse perspective.Can generate call graph,control-flow graph, . . .
L. Adhianto and B. Chapman Analyzing Parallel Applications
IntroductionMethodology
Conclusion
MPI Load Imbalance
S o u r c e c o d e
F O M P I V i e w
F O M P I A n a l y z e r
O p e n U H p l u g - i n
E c l i p s e
E x e c u t a b l e
O p e n U H C o m p i l e r
X M L R e p r e s e n t a t i o n
A p p l i c a t i o n s i g n a t u r e
L o a d B a l a n c i n g L i b r a r y
M i c r o b e n c h m a r k s
D a t a b a s e
S y s t e m p r o f i l e
Unbalanced programEach MPI process hasthe same number ofOpenMP threads.
Balanced programAdjust the number ofOpenMP threadsaccording to theworkload.
L. Adhianto and B. Chapman Analyzing Parallel Applications
IntroductionMethodology
Conclusion
MPI Load Imbalance
Assumption: application includes a main iterative loopcontaining:
Large computationSignificant communication
Approach:1 Statically determine the main iterative loop2 Insert load balancing library (LBL) at the beginning and at
the end of the loop.
L. Adhianto and B. Chapman Analyzing Parallel Applications
IntroductionMethodology
Conclusion
MPI Load Imbalance
Original code#include < l b l . h>i n t MainFunction ( ) {
. . ./∗ Main i t e r a t i o n ∗ /while ( i t e r <MAX_ITER) {
. . .Do_Computation ( ) ;. . .Do_Communication ( ) ;
}/∗ end of main i t e r a t i o n ∗ /
. . .}
With Load balancing library#include < l b l . h>i n t MainFunction ( ) {
LBL_SetSk ip I te ra t ions ( 4 0 ) ;LBL_SetThreshold ( 3 0 ) ;LBL_ In i t ( ) ;/∗ Main i t e r a t i o n ∗ /while ( i t e r <MAX_ITER) {
LBL_LoadDetection ( ) ;. . .Do_Computation ( ) ;. . .Do_Communication ( ) ;
}/∗ end of main i t e r a t i o n ∗ /LBL_Final ize ( ) ;
}
L. Adhianto and B. Chapman Analyzing Parallel Applications
IntroductionMethodology
Conclusion
MPI Load Imbalance
Original code with load imbalance
50
55
60
65
70
75
0 100 200 300 400 500 600 700 800 900 1000
Tim
e (s
econ
ds)
Number of iterations
Process 0Process 1Process 2Process 3
L. Adhianto and B. Chapman Analyzing Parallel Applications
IntroductionMethodology
Conclusion
MPI Load Imbalance
Using load imbalance library
35
40
45
50
55
60
65
70
0 100 200 300 400 500 600 700 800 900 1000
Tim
e (s
econ
ds)
Number of iterations
Process 0Process 1Process 2Process 3
L. Adhianto and B. Chapman Analyzing Parallel Applications