an open64-based framework for analyzing parallel applications

IntroductionMethodology

Conclusion

An Open64-based Framework for AnalyzingParallel Applications

Laksono Adhianto1 Barbara Chapman2

1Department of Computer ScienceRice University

2Department of Computer ScienceUniversity of Houston

Open64 Workshop, 2008

L. Adhianto and B. Chapman Analyzing Parallel Applications


Conclusion

Outline

1 IntroductionMotivation and Objectives

2 MethodologyAproach, implementation and applications

3 Conclusion



ConclusionMotivation and Objectives

Motivation: Analyzing Complex Applications

Murphy’s law:

Program complexity grows until it exceeds thecapabilities of the programmer who must maintain it.

H e l l o W o r l d

From a simplesequential program . . .

To a complex large scale application a

aImage courtesy of javaworld



ConclusionMotivation and Objectives

Objectives

We need an infrastructure forUnderstanding large-scale parallel MPI/OpenMPapplications.Performance modeling and prediction.Program optimization and program correctness verification.



ConclusionAproach, implementation and applications

Approach

Extract program skeleton based on compiler analysis.Retrieve information on communication latency andparallelization overhead from microbenchmarks.




Methodology

Using compiler technology to analyze the source code andmicrobenchmarks to probe system profile.

Analysis = Application_Signature ⊗ System_Profile

Application signature: characterizes the fundamentalaspects of an application independent of the machinewhere it executes(definition borrowed from PERC SciDAC project).System profile: characteristics of the platform where theapplication will be executed.




Application Signature vs. System Profile

Application signature: independent to system configurationSystem profile: independent to application programs




FOMPI Framework

FOMPI: Framework for analyzing OpenMP and MPIapplications.

Analysis = Application_Signature ⊗ System_Profile

S o u r c e c o d e

F O M P I V i e w

F O M P I A n a l y z e r

O p e n U H p l u g - i n

E c l i p s e

E x e c u t a b l e

O p e n U H C o m p i l e r

X M L R e p r e s e n t a t i o n

A p p l i c a t i o n s i g n a t u r e

L o a d B a l a n c i n g L i b r a r y

M i c r o b e n c h m a r k s

D a t a b a s e

S y s t e m p r o f i l e

References:Parallelization [1, 5], OpenMP tools [7, 6], Compiler [8],Modeling [4, 3], Autoscoping [2].




The OpenUH Compiler

LNO(Loop Nest Oprtimizer)

WOPT(Global scalar optimizer)

IPA(Inter Procedural Analyzer)

Front-endC/C++,Fortran 77/90

WHIRL2C/FIR-to-source

OpenUH

CGCode Generator

OpenMP/MPI

Source code

Source code

Native compiler

XML Code

Representation

Object files

RVI2(Register level optimizer)

Very HighWHIRL

High WHIRL

MiddleWHIRL

LowWHIRL

Very LowWHIRL

Application signature

S o u r c e c o d e

F O M P I V i e w



E c l i p s e

E x e c u t a b l e






D a t a b a s e


Rich of analysis: datadependence, inter-proceduralanalysis, array region analysis,. . .We have extended OpenUHfor generating applicationsignature.

Containing MPI routines,OpenMP, loops, estimatedexecution time, cacheaccess pattern, . . .




Call Graph

map_zones__

A 103.6917

get_comm_index__

0.000020.0

timer_stop__

C 0.0000

error_norm__

AO 33.9767

exact_solution__

0.0000

1000000.0

decode_line__

0.0000

z_solve__

AO 5457341.3917

1.0

binvcrhs_

0.0000190000.0

lhsinit_

0.000010000.0

binvrhs_

0.0000

10000.0

matvec_sub__

0.0000190000.0

timer_start__

C 0.0000

1.0

matmul_sub__

28.6747190000.0

x_solve__

AO 5457139.8360

1.0

190000.0

10000.0

10000.0

190000.0

1.0

190000.0

initialize_

AO 103.81606060000.0

copy_y_face__

AO 19.0709

env_setup__

AC 4.1205

1.0zone_setup__

124.8227

compute_rhs__

AO 896285.9149

4.0

4.0

print_results__

0.0000

adi_

17267843.8529

1.0

1.0

1.0

y_solve__

AO 5457037.47981.0

add_

AO 39.2305

1.0

1.0

190000.0

10000.0

10000.0

190000.0

1.0

190000.0

1.0

1.0

exact_rhs__

AO 6676.4841570000.0

rhs_norm__

AO 20.1799

zone_starts__

C 171.5825

timer_clear__

0.0000

set_constants__

0.0000

MAIN__

C 6979929560.9980

1.0

1.0

1.0

8.0

1.0

1.0

1.0

404.0

4.0

1.0

22.0

1.0

exch_qbc__

C 1061.3956101.0

verify_

C 3585446.45241.0

mpi_setup__

C 0.0000

1.0

timer_read__

0.0000

12.0

3.0

3.0

24.0

copy_x_face__

AO 23.029424.0

4.0

4.0

4.0

Traditional call graph is not scalable for largescale applications containing million lines ofcode.

Gc = 〈Nc , Ec , sc〉

Aggregated execution time:

N ic = Nx

c +n∑

i=0

(N i

ci× E f

c

)Inclusive execution time of a unit is thesum of exclusive time and the total time ofits call sites.




Control-Flow Graph

Gf = 〈Nf , Ef , sf 〉Inclusive cost:

N if =

Nx

fl+∑n

i=0(Efi × N ifi), inside loop;

Nxfb

+ maxni=0(

∑mij=0 N i

fij), branches;

Nxf , otherwise.

Exclusive cost:

Nxfl =

{tcompser = Ef × (tmachine + toverhead + tcache), serial;

tcomppar = tcomp

sernt

+∑

tcompunpar +

∑O, parallel;



Conclusion

Summary

FOMPI provides portable and scalable analysis andmodeling with no program execution needed

Based on application signature from the compiler andsystem profile from microbenchmarks.

Applications include: performance modeling, programunderstanding, OpenMP generation and MPI loadimbalance reduction.Open64’s extensibility is needed: WHIRL, analysis andtransformation

SUIF, GCC GEM Framework, LLVM . . .



Conclusion

Contributions

Our contributions:Compiler extension to extract application signatureMicrobenchmark extensions for more OpenMP coverageUsing compiler and microbenchmarks for programanalyses and modelingScalable call graph and control-flow graphRuntime library for reducing load imbalance



Conclusion

Acknowledgments

John Mellor-Crummey (Rice University)HPCTools members and alumniTLC2

PSTL labUH, CS@UHPModels, DOE and NSF



Conclusion

For Further Reading I

L. Adhianto, F. Bodin, B. Chapman, L. Hascoet, A. Kneer,D. Lancaster, I. Wolton, and M. Wirtz.Tools for OpenMP application development: the POSTproject.Concurrency: Practice and Experience, 12(12):1177–1191,2000.

Laksono Adhianto and Barbara Chapman.Autoscoping support for openmp compiler.In Workshop on Tools and Compilers for HardwareAcceleration, 2006.



Conclusion

For Further Reading II

Laksono Adhianto and Barbara Chapman.Performance modeling of communication and computationin hybrid mpi and openmp applications.International Conferences on Parallel and DistributedSystems (ICPADS)-Workshop of Performance Modelingand Analysis of Communication (PMAC), 2:3–8, 2006.

Laksono Adhianto and Barbara Chapman.Performance modeling of communication and computationin hybrid mpi and openmp applications.Simulation Modelling Practice and Theory, 15(4):481–491,2007.



Conclusion

For Further Reading III

Laksono Adhianto and Michael Leuschel.Strategy for improving memory locality reuse and exploitinghidden parallelism.In Indonesian Students Scientific Meeting (ISSM),Manchester, UK, August 2001.

Barbara Chapman, Oscar Hernandez, Lei Huang,Tien-hsiung Weng, Zhenying Liu, Laksono Adhianto, andYi Wen.Dragon: An open64-based interactive program analysis toolfor large applications.In 4th International Conference on Parallel and DistributedComputing, Applications and Technologies (PDCAT), 2003.



Conclusion

For Further Reading IV

Barbara Chapman, Tien-Hsiung Weng, Oscar Hernandez,Zhenying Lui, Lei Huang, Yi Wen, and Laksono Adhianto.Cougar: Interactive tool for cluster computing.In Proceedings of the 6th World Multi-Conference onSystemics, Cybernetics and Informatics (SCI’2002). TheInternational Institute of Informatics and Systemics, 2002.

OpenUH.http://www.cs.uh.edu/õpenuh.



Conclusion

Application Signature: Example from NAS FT

if (timers_enabled) call timer_start(7)

do k = 1, n3

...

call Swarztrauber(...)

...

end do

<if line="109">

<ifthen line="109">

<kids>

<call name="timer_start__" line="111"/>

</kids>

</ifthen>

</if>

<loop Line="112" opcode="OPC_DO_LOOP" >

<header>

<index>K</index>

<lowerbound>1</lowerbound>

<upperbound>64</upperbound>

<increment>+1</increment>

</header>

<cost>

<iterations>64</iterations>

<average>2.69514</average>

<machine>4.9152e+06</machine>

<cache>1.12331e+08</cache>

<overhead>1.06522e+08</overhead>

<total>2.23768e+08</total>

</cost>

<parallel>

<status>None</status>

<reason>Call swarztrauber_ on line 122.</reason>

<scope>

<private> K</private>

<shared> X PLANE</shared>

</scope>

</parallel>

<kids>

. . . .

<call name="swarztrauber_" line="122"/>

OpenUH Compiler

Source code

Application signature



Conclusion

Application Signature: Scalability

Stored in XML file: XML Program Representation.Designed for interoperability and scalability in mind.

LOC loops XML Tags Tags/Loop XML/LOCammp 10068 272 11872 6767 24.88 1.18applu 2555 120 3497 2585 21.54 1.37apsi 4386 278 9582 6815 24.51 2.18art 1570 72 2672 1799 24.99 1.70equake 1121 70 2338 1650 23.57 2.09gafort 720 72 2450 1748 24.28 3.40mgrid 1023 48 1338 1009 21.02 1.31swim 275 24 677 505 21.04 2.46wupwise 1018 43 2096 1419 33.00 2.06Average 2526.22 111.00 4508.00 3257.00 24.31 1.97



Conclusion

Eclipse

S o u r c e c o d e

F O M P I V i e w



E c l i p s e

E x e c u t a b l e






D a t a b a s e


Eclipse is an open sourceplatform based on IBMVisualAge Micro Edition.We have developed FOMPIview plugin as the mainuser interface

Accessible by anyEclipse perspective.Can generate call graph,control-flow graph, . . .



Conclusion

MPI Load Imbalance

S o u r c e c o d e

F O M P I V i e w



E c l i p s e

E x e c u t a b l e






D a t a b a s e


Unbalanced programEach MPI process hasthe same number ofOpenMP threads.

Balanced programAdjust the number ofOpenMP threadsaccording to theworkload.



Conclusion

MPI Load Imbalance

Assumption: application includes a main iterative loopcontaining:

Large computationSignificant communication

Approach:1 Statically determine the main iterative loop2 Insert load balancing library (LBL) at the beginning and at

the end of the loop.



Conclusion

MPI Load Imbalance

Original code#include < l b l . h>i n t MainFunction ( ) {

. . ./∗ Main i t e r a t i o n ∗ /while ( i t e r <MAX_ITER) {

. . .Do_Computation ( ) ;. . .Do_Communication ( ) ;

}/∗ end of main i t e r a t i o n ∗ /

. . .}

With Load balancing library#include < l b l . h>i n t MainFunction ( ) {

LBL_SetSk ip I te ra t ions ( 4 0 ) ;LBL_SetThreshold ( 3 0 ) ;LBL_ In i t ( ) ;/∗ Main i t e r a t i o n ∗ /while ( i t e r <MAX_ITER) {

LBL_LoadDetection ( ) ;. . .Do_Computation ( ) ;. . .Do_Communication ( ) ;

}/∗ end of main i t e r a t i o n ∗ /LBL_Final ize ( ) ;

}



Conclusion

MPI Load Imbalance

Original code with load imbalance

50

55

60

65

70

75

0 100 200 300 400 500 600 700 800 900 1000

Tim

e (s

econ

ds)

Number of iterations

Process 0Process 1Process 2Process 3



Conclusion

MPI Load Imbalance

Using load imbalance library

35

40

45

50

55

60

65

70

0 100 200 300 400 500 600 700 800 900 1000

Tim

e (s

econ

ds)

Number of iterations

Process 0Process 1Process 2Process 3


an open64-based framework for analyzing parallel applications

Documents