loop-based automated performance analysis
DESCRIPTION
Loop-Based Automated Performance Analysis. Eli Collins [email protected] Computer Sciences Department University of Wisconsin-Madison Madison, WI 53706 USA. Motivation. Automated performance analysis Ongoing work, APART Previous work: Callgraph, Deepstart - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Loop-Based Automated Performance Analysis](https://reader036.vdocuments.us/reader036/viewer/2022082712/56813cd8550346895da67b14/html5/thumbnails/1.jpg)
Loop-Based Automated Performance Analysis
Computer Sciences DepartmentUniversity of Wisconsin-Madison
Madison, WI 53706USA
![Page 2: Loop-Based Automated Performance Analysis](https://reader036.vdocuments.us/reader036/viewer/2022082712/56813cd8550346895da67b14/html5/thumbnails/2.jpg)
Loop-aware Automated Performance Analysis-2-
Motivation
• Automated performance analysis – Ongoing work, APART
• Previous work: Callgraph, Deepstart– Faster, more efficient searching
• This work: better localize performance problems– Report performance data at finer
granularity
![Page 3: Loop-Based Automated Performance Analysis](https://reader036.vdocuments.us/reader036/viewer/2022082712/56813cd8550346895da67b14/html5/thumbnails/3.jpg)
Loop-aware Automated Performance Analysis-3-
Motivation (Cont.)
• Function granularity works well– Don’t overload user w/ fine-grain data
• Why is a function a bottleneck?– Large function w/ multiple bottlenecks– Small function called repeatedly in a loop
• Idea: search inside bottleneck functions
![Page 4: Loop-Based Automated Performance Analysis](https://reader036.vdocuments.us/reader036/viewer/2022082712/56813cd8550346895da67b14/html5/thumbnails/4.jpg)
Loop-aware Automated Performance Analysis-4-
Performance Consultant (PC)
• For code, PC searches the callgraph– Breadth First Search– Prune non-bottleneck functions
• Introduce a new callgraph level that– Is a logical unit of computation – Improves granularity– Partitions functions for searching– Keeps search space manageable (scalability)
![Page 5: Loop-Based Automated Performance Analysis](https://reader036.vdocuments.us/reader036/viewer/2022082712/56813cd8550346895da67b14/html5/thumbnails/5.jpg)
Loop-aware Automated Performance Analysis-5-
Loops in the Callgraph
main
f2f1
Callgraph Callgraph w/ loops
f1
main
loop 1
f1 f2
![Page 6: Loop-Based Automated Performance Analysis](https://reader036.vdocuments.us/reader036/viewer/2022082712/56813cd8550346895da67b14/html5/thumbnails/6.jpg)
Loop-aware Automated Performance Analysis-6-
Why Loops?
• Loops may be bottlenecks themselves– Especially in scientific and long-running
applications
• Loops are natural sources of parallelism– Compilers/HW exploit– OpenMP PARALLEL DO, loop unrolling/fusion– Provide feedback as to the effectiveness of
these optimizations
![Page 7: Loop-Based Automated Performance Analysis](https://reader036.vdocuments.us/reader036/viewer/2022082712/56813cd8550346895da67b14/html5/thumbnails/7.jpg)
Loop-aware Automated Performance Analysis-7-
Why Loops?
• Loops logically decompose functions– Natural hierarchy (name by nesting)
• We instrument loops in binary– Binary is what actually executes– Typically can correlate PC results w/ original
source– Difficult w/ basic block, instruction
granularity
![Page 8: Loop-Based Automated Performance Analysis](https://reader036.vdocuments.us/reader036/viewer/2022082712/56813cd8550346895da67b14/html5/thumbnails/8.jpg)
Loop-aware Automated Performance Analysis-8-
What’s new?• Loop-level performance data is not new
– Existing tools: DPOMP, HPCView, SvPablo– Edge instrumentation in EEL and OM
• Integrate loops into automated search
• Techniques to instrument loops on-the-fly
• Technical challenges doing this efficiently– Especially on IA32 (AMD64/EM64T)
• Results for some MPI/OpenMP applications
![Page 9: Loop-Based Automated Performance Analysis](https://reader036.vdocuments.us/reader036/viewer/2022082712/56813cd8550346895da67b14/html5/thumbnails/9.jpg)
Loop-aware Automated Performance Analysis-9-
Binary Loop Instrumentation
Entry
Exit
Beginiter.
Enditer.
2
3
4
1do {
...
if (x > 100)
break;
...
} while (x < y);
LP: inc %edx inc %eax cmp $0x64,%eax jg DONE inc %edx inc %eax cmp %edx,%eax jl LPDONE:
![Page 10: Loop-Based Automated Performance Analysis](https://reader036.vdocuments.us/reader036/viewer/2022082712/56813cd8550346895da67b14/html5/thumbnails/10.jpg)
Loop-aware Automated Performance Analysis-10-
New Instrumentation Techniques
• Traditional function, edge
instrumentation
• Function relocation, previously– Function entry, exit, callsites
• For loops, may relocate function again– Ensure enough padding around basic blocks
which need to be instrumented– Avoid trap-based instrumentation
![Page 11: Loop-Based Automated Performance Analysis](https://reader036.vdocuments.us/reader036/viewer/2022082712/56813cd8550346895da67b14/html5/thumbnails/11.jpg)
Loop-aware Automated Performance Analysis-11-
Loop-based Search Strategy
• PC uses loops as steps in its refinement
![Page 12: Loop-Based Automated Performance Analysis](https://reader036.vdocuments.us/reader036/viewer/2022082712/56813cd8550346895da67b14/html5/thumbnails/12.jpg)
Loop-aware Automated Performance Analysis-12-
Loop-based Search Strategy
• Inclusive metric, instrument loop
entry/exit
• If a node is a bottleneck, instrument– Function: outermost loops and call sites– Loop: nested loops and call sites
• # of PC experiments– More total experiments possible w/ loops– But loops can help prune search
• E.g. loops which contain multiple call sites
![Page 13: Loop-Based Automated Performance Analysis](https://reader036.vdocuments.us/reader036/viewer/2022082712/56813cd8550346895da67b14/html5/thumbnails/13.jpg)
Loop-aware Automated Performance Analysis-13-
Test Applications
Name Type Lines KB Funcs Loops
ALARA Sequential, C++
21K 6,382 718 598
DRACO Sequential, F90
72K 2,516 898 5,477
OM3 MPI, C, 8 nodes
3K 88 28 202
SPhot MPI/OpenMP, F90, 8 nodes
3K 895 31 106
![Page 14: Loop-Based Automated Performance Analysis](https://reader036.vdocuments.us/reader036/viewer/2022082712/56813cd8550346895da67b14/html5/thumbnails/14.jpg)
Loop-aware Automated Performance Analysis-14-
Results
• Loops were frequently bottlenecks– 10 total leaf-level function bottlenecks– 7 of these contained loop bottlenecks
• Bottleneck functions had many loops– Especially true for Fortran applications– OM3: 1 function, 83% CPU, 90 loops– Good results, even when code not modular– Correlate loop w/ source using call sites
![Page 15: Loop-Based Automated Performance Analysis](https://reader036.vdocuments.us/reader036/viewer/2022082712/56813cd8550346895da67b14/html5/thumbnails/15.jpg)
Loop-aware Automated Performance Analysis-15-
Bottlenecks (ALARA)
0
2
4
6
8
10
12
14
16
18
20
Seconds
Bottlenecks
Loop PC (functions + loops)
Loop PC (functions)
PC (functions)
![Page 16: Loop-Based Automated Performance Analysis](https://reader036.vdocuments.us/reader036/viewer/2022082712/56813cd8550346895da67b14/html5/thumbnails/16.jpg)
Loop-aware Automated Performance Analysis-16-
Bottlenecks (SPhot)
0
24
68
10
1214
1618
20
0 7 8 11 12 15 16 20 24 28 32 36 42
Seconds
Bottlenecks
Loop PC (functions + loops)
Loop PC (functions)
PC (functions)
![Page 17: Loop-Based Automated Performance Analysis](https://reader036.vdocuments.us/reader036/viewer/2022082712/56813cd8550346895da67b14/html5/thumbnails/17.jpg)
Loop-aware Automated Performance Analysis-17-
Summary• Not much overhead
– Avoid trap-based instrumentation– Only instrument loops of bottlenecks
functions
• Find bottlenecks at similar rate– Loop-aware finds more, more in total to find
• More precise results– Little change in search time– Similar rates of experimentation
![Page 18: Loop-Based Automated Performance Analysis](https://reader036.vdocuments.us/reader036/viewer/2022082712/56813cd8550346895da67b14/html5/thumbnails/18.jpg)
Loop-aware Automated Performance Analysis-18-
Loop-Based Automated Performance Analysis
http://www.paradyn.org
http://www.dyninst.org