loop-based automated performance analysis

18
Loop-Based Automated Performance Analysis Eli Collins [email protected] Computer Sciences Department University of Wisconsin-Madison Madison, WI 53706 USA

Upload: toviel

Post on 06-Jan-2016

20 views

Category:

Documents


0 download

DESCRIPTION

Loop-Based Automated Performance Analysis. Eli Collins [email protected] Computer Sciences Department University of Wisconsin-Madison Madison, WI 53706 USA. Motivation. Automated performance analysis Ongoing work, APART Previous work: Callgraph, Deepstart - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Loop-Based Automated  Performance Analysis

Loop-Based Automated Performance Analysis

Eli [email protected]

Computer Sciences DepartmentUniversity of Wisconsin-Madison

Madison, WI 53706USA

Page 2: Loop-Based Automated  Performance Analysis

Loop-aware Automated Performance Analysis-2-

Motivation

• Automated performance analysis – Ongoing work, APART

• Previous work: Callgraph, Deepstart– Faster, more efficient searching

• This work: better localize performance problems– Report performance data at finer

granularity

Page 3: Loop-Based Automated  Performance Analysis

Loop-aware Automated Performance Analysis-3-

Motivation (Cont.)

• Function granularity works well– Don’t overload user w/ fine-grain data

• Why is a function a bottleneck?– Large function w/ multiple bottlenecks– Small function called repeatedly in a loop

• Idea: search inside bottleneck functions

Page 4: Loop-Based Automated  Performance Analysis

Loop-aware Automated Performance Analysis-4-

Performance Consultant (PC)

• For code, PC searches the callgraph– Breadth First Search– Prune non-bottleneck functions

• Introduce a new callgraph level that– Is a logical unit of computation – Improves granularity– Partitions functions for searching– Keeps search space manageable (scalability)

Page 5: Loop-Based Automated  Performance Analysis

Loop-aware Automated Performance Analysis-5-

Loops in the Callgraph

main

f2f1

Callgraph Callgraph w/ loops

f1

main

loop 1

f1 f2

Page 6: Loop-Based Automated  Performance Analysis

Loop-aware Automated Performance Analysis-6-

Why Loops?

• Loops may be bottlenecks themselves– Especially in scientific and long-running

applications

• Loops are natural sources of parallelism– Compilers/HW exploit– OpenMP PARALLEL DO, loop unrolling/fusion– Provide feedback as to the effectiveness of

these optimizations

Page 7: Loop-Based Automated  Performance Analysis

Loop-aware Automated Performance Analysis-7-

Why Loops?

• Loops logically decompose functions– Natural hierarchy (name by nesting)

• We instrument loops in binary– Binary is what actually executes– Typically can correlate PC results w/ original

source– Difficult w/ basic block, instruction

granularity

Page 8: Loop-Based Automated  Performance Analysis

Loop-aware Automated Performance Analysis-8-

What’s new?• Loop-level performance data is not new

– Existing tools: DPOMP, HPCView, SvPablo– Edge instrumentation in EEL and OM

• Integrate loops into automated search

• Techniques to instrument loops on-the-fly

• Technical challenges doing this efficiently– Especially on IA32 (AMD64/EM64T)

• Results for some MPI/OpenMP applications

Page 9: Loop-Based Automated  Performance Analysis

Loop-aware Automated Performance Analysis-9-

Binary Loop Instrumentation

Entry

Exit

Beginiter.

Enditer.

2

3

4

1do {

...

if (x > 100)

break;

...

} while (x < y);

LP: inc %edx inc %eax cmp $0x64,%eax jg DONE inc %edx inc %eax cmp %edx,%eax jl LPDONE:

Page 10: Loop-Based Automated  Performance Analysis

Loop-aware Automated Performance Analysis-10-

New Instrumentation Techniques

• Traditional function, edge

instrumentation

• Function relocation, previously– Function entry, exit, callsites

• For loops, may relocate function again– Ensure enough padding around basic blocks

which need to be instrumented– Avoid trap-based instrumentation

Page 11: Loop-Based Automated  Performance Analysis

Loop-aware Automated Performance Analysis-11-

Loop-based Search Strategy

• PC uses loops as steps in its refinement

Page 12: Loop-Based Automated  Performance Analysis

Loop-aware Automated Performance Analysis-12-

Loop-based Search Strategy

• Inclusive metric, instrument loop

entry/exit

• If a node is a bottleneck, instrument– Function: outermost loops and call sites– Loop: nested loops and call sites

• # of PC experiments– More total experiments possible w/ loops– But loops can help prune search

• E.g. loops which contain multiple call sites

Page 13: Loop-Based Automated  Performance Analysis

Loop-aware Automated Performance Analysis-13-

Test Applications

Name Type Lines KB Funcs Loops

ALARA Sequential, C++

21K 6,382 718 598

DRACO Sequential, F90

72K 2,516 898 5,477

OM3 MPI, C, 8 nodes

3K 88 28 202

SPhot MPI/OpenMP, F90, 8 nodes

3K 895 31 106

Page 14: Loop-Based Automated  Performance Analysis

Loop-aware Automated Performance Analysis-14-

Results

• Loops were frequently bottlenecks– 10 total leaf-level function bottlenecks– 7 of these contained loop bottlenecks

• Bottleneck functions had many loops– Especially true for Fortran applications– OM3: 1 function, 83% CPU, 90 loops– Good results, even when code not modular– Correlate loop w/ source using call sites

Page 15: Loop-Based Automated  Performance Analysis

Loop-aware Automated Performance Analysis-15-

Bottlenecks (ALARA)

0

2

4

6

8

10

12

14

16

18

20

Seconds

Bottlenecks

Loop PC (functions + loops)

Loop PC (functions)

PC (functions)

Page 16: Loop-Based Automated  Performance Analysis

Loop-aware Automated Performance Analysis-16-

Bottlenecks (SPhot)

0

24

68

10

1214

1618

20

0 7 8 11 12 15 16 20 24 28 32 36 42

Seconds

Bottlenecks

Loop PC (functions + loops)

Loop PC (functions)

PC (functions)

Page 17: Loop-Based Automated  Performance Analysis

Loop-aware Automated Performance Analysis-17-

Summary• Not much overhead

– Avoid trap-based instrumentation– Only instrument loops of bottlenecks

functions

• Find bottlenecks at similar rate– Loop-aware finds more, more in total to find

• More precise results– Little change in search time– Similar rates of experimentation

Page 18: Loop-Based Automated  Performance Analysis

Loop-aware Automated Performance Analysis-18-

Loop-Based Automated Performance Analysis

[email protected]

http://www.paradyn.org

http://www.dyninst.org