loop-based automated performance analysis

Loop-Based Automated Performance Analysis

Eli [email protected]

Computer Sciences DepartmentUniversity of Wisconsin-Madison

Madison, WI 53706USA

Loop-aware Automated Performance Analysis-2-

Motivation

• Automated performance analysis – Ongoing work, APART

• Previous work: Callgraph, Deepstart– Faster, more efficient searching

• This work: better localize performance problems– Report performance data at finer

granularity


Motivation (Cont.)

• Function granularity works well– Don’t overload user w/ fine-grain data

• Why is a function a bottleneck?– Large function w/ multiple bottlenecks– Small function called repeatedly in a loop

• Idea: search inside bottleneck functions


Performance Consultant (PC)

• For code, PC searches the callgraph– Breadth First Search– Prune non-bottleneck functions

• Introduce a new callgraph level that– Is a logical unit of computation – Improves granularity– Partitions functions for searching– Keeps search space manageable (scalability)


Loops in the Callgraph

main

f2f1

Callgraph Callgraph w/ loops

f1

main

loop 1

f1 f2


Why Loops?

• Loops may be bottlenecks themselves– Especially in scientific and long-running

applications

• Loops are natural sources of parallelism– Compilers/HW exploit– OpenMP PARALLEL DO, loop unrolling/fusion– Provide feedback as to the effectiveness of

these optimizations


Why Loops?

• Loops logically decompose functions– Natural hierarchy (name by nesting)

• We instrument loops in binary– Binary is what actually executes– Typically can correlate PC results w/ original

source– Difficult w/ basic block, instruction

granularity


What’s new?• Loop-level performance data is not new

– Existing tools: DPOMP, HPCView, SvPablo– Edge instrumentation in EEL and OM

• Integrate loops into automated search

• Techniques to instrument loops on-the-fly

• Technical challenges doing this efficiently– Especially on IA32 (AMD64/EM64T)

• Results for some MPI/OpenMP applications


Binary Loop Instrumentation

Entry

Exit

Beginiter.

Enditer.

2

3

4

1do {

...

if (x > 100)

break;

...

} while (x < y);

LP: inc %edx inc %eax cmp $0x64,%eax jg DONE inc %edx inc %eax cmp %edx,%eax jl LPDONE:


New Instrumentation Techniques

• Traditional function, edge

instrumentation

• Function relocation, previously– Function entry, exit, callsites

• For loops, may relocate function again– Ensure enough padding around basic blocks

which need to be instrumented– Avoid trap-based instrumentation


Loop-based Search Strategy

• PC uses loops as steps in its refinement


Loop-based Search Strategy

• Inclusive metric, instrument loop

entry/exit

• If a node is a bottleneck, instrument– Function: outermost loops and call sites– Loop: nested loops and call sites

• # of PC experiments– More total experiments possible w/ loops– But loops can help prune search

• E.g. loops which contain multiple call sites


Test Applications

Name Type Lines KB Funcs Loops

ALARA Sequential, C++

21K 6,382 718 598

DRACO Sequential, F90

72K 2,516 898 5,477

OM3 MPI, C, 8 nodes

3K 88 28 202

SPhot MPI/OpenMP, F90, 8 nodes

3K 895 31 106


Results

• Loops were frequently bottlenecks– 10 total leaf-level function bottlenecks– 7 of these contained loop bottlenecks

• Bottleneck functions had many loops– Especially true for Fortran applications– OM3: 1 function, 83% CPU, 90 loops– Good results, even when code not modular– Correlate loop w/ source using call sites


Bottlenecks (ALARA)

0

2

4

6

8

10

12

14

16

18

20

Seconds

Bottlenecks

Loop PC (functions + loops)

Loop PC (functions)

PC (functions)


Bottlenecks (SPhot)

0

24

68

10

1214

1618

20

0 7 8 11 12 15 16 20 24 28 32 36 42

Seconds

Bottlenecks

Loop PC (functions + loops)

Loop PC (functions)

PC (functions)


Summary• Not much overhead

– Avoid trap-based instrumentation– Only instrument loops of bottlenecks

functions

• Find bottlenecks at similar rate– Loop-aware finds more, more in total to find

• More precise results– Little change in search time– Similar rates of experimentation


Loop-Based Automated Performance Analysis

[email protected]

http://www.paradyn.org

http://www.dyninst.org

loop-based automated performance analysis

Documents