a recipe for performance analysis and tuning in parallel computing environments 20 18 october 2001...

43
A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8 [email protected]

Upload: alan-mills

Post on 08-Jan-2018

218 views

Category:

Documents


0 download

DESCRIPTION

Objectives To provide –a tutorial-level workflow for parallel performance analysis and tuning, based on hands-on use on large parallel codes –in terms of this workflow, a high-level survey of representative production-class (mainly, COTS) parallel performance analysis tools 10 –a list of open issues in tool evolution

TRANSCRIPT

Page 1: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

A Recipe for Performance Analysis and Tuning in Parallel

Computing Environments20

18 October 2001(Rev. 21 October 2001)19

Jack K. HornerSAIC and LANL/CCN-8

[email protected]

Page 2: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Overview

• Objectives• Some trades in approaches to performance

analysis• The recipe• Some open issues in performance analysis• Notes and references

Page 3: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Objectives

To provide– a tutorial-level workflow for parallel

performance analysis and tuning, based on hands-on use on large parallel codes

– in terms of this workflow, a high-level survey of representative production-class (mainly, COTS) parallel performance analysis tools10

– a list of open issues in tool evolution

Page 4: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Some trades in approaches to performance analysis (1 of 2)

• Modeling-oriented– comprehensive in intent – but often requires long development lead time

• Measurement-oriented– problem-specific by default; may lead to

suboptimal results globally– but can often produce some useful results with

little investment

Page 5: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Some trades in approaches to performance analysis (2 of 2)

• The two approaches are interdependent– calibration of an explicit performance model

requires some performance measurement, taken as truth data

– any interpretation of a performance measurement is (at least implicitly) based on some performance model12

Page 6: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

The recipe (performed in order)17

• First, in single-thread-of-control mode– Get the right answers– Profile the program’s execution– Use existing, tuned library code for performance-critical single-

thread functionality• Then, in multiple-thread-of-control mode

– Get the right answers– Let the compiler optimize what it can– Profile the program's execution– Optimize memory utilization

Page 7: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

IN SINGLE THREAD OF CONTROL MODE

Page 8: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Get the right answers (1 of 2)

• Seems embarrassingly obvious, but – often underfunded– well calibrated software project effort and

schedule estimation models13 predict this activity can easily consume half of the project effort

– outside the existing calibration envelope, may require a V&V effort as large as the rest of the project7

Page 9: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Get the right answers (2 of 2)

• Comparing results from multiple compilers (with associated math libraries) can help expose compiler/math-library-based numeric bugs6

• Use modern quality assurance tools and practices16

Page 10: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Profile the program’s execution (1 of 5)

• Most vendor-supplied profilers are, on the whole, at least as good as “home-grown” timers/profilers; none are perfect

• In any case, the relation between source and machine instructions may be many-to-many, making at least instruction-level accounting problematic

• Comparison of multiple profilers is best3

Page 11: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Profile the program’s execution (2 of 5)

• Low-level (machine/OS-event-oriented)– Performance Data API (PAPI) implementations

(multiplatform; Ref. [4])– perfex (SGI/IRIX; Ref. [1])– DCPI1, ProfileMe (Compaq/Tru64; Refs. [2],

[16])– jm_status (IBM/AIX; Ref. [19])– Xkstat (Sun/Solaris; Ref. [20])15

Page 12: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Profile the program’s execution(3 of 5)

SpeedShop + prof (SGI O2K/IRIX 6.5.x) example (sage);

t01:~/sage20010624 [5] > mpirun -np 4 ssrun -pcsamp sage.x timing_c.inputt01:~/sage20010624 [12] > prof sage.x.pcsamp.f603300-------------------------------------------------------------------------[prof header material deleted here]

-------------------------------------------------------------------------Function list, in descending order by time------------------------------------------------------------------------- [index] secs % cum.% samples function (dso: file, line)

[1] 186.140 29.8% 29.8% 18614 TOKEN_GS_R8 (sage.x:module_token.f90, 6270) [2] 136.220 21.8% 51.7% 13622 CSR_CG_SOLVER (sage.x:module_matrix.f90, 304)[material deleted here]

Page 13: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Profile the program’s execution (4 of 5)

• Source-traceback-profiling– TAU (multiplatform; Ref. [8])– HPCview (multiplatform; Ref. [6])4

– SpeedShop + prof (SGI/IRIX; Ref. [1])– atom (Compaq/Tru64; Ref. [17])– prof/gprof(Sun Enterprise; Ref. [3])

Page 14: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Profile the program’s execution (5 of 5)

• General dynamic instrumentation of executable– Paradyn/DynInst/DPCL (multiplatform; Refs.

[12], [13], [14])

Page 15: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Use existing, tuned library code for single-thread performance-critical

regions

• Unless library tuning IS your project, you will be hard-pressed to improve on existing, well-calibrated, tuned service-level code

• Examples of good math libraries – NAG – ACM

Page 16: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

IN MULTIPLE THREAD OF CONTROL MODE

Page 17: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Get the right answers

• See previous rant on this topic18

Page 18: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Profile the program’s execution (1 of 3)

• Hypotheses about parallel performance bottlenecks that are not based on actual measurement are– like ants at a picnic (many and distracting)– often enough, spectacularly wrong

• As a first step, use profilers mentioned in the “in single-thread mode” section of this presentation: most are parallel-tolerant

Page 19: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Profile the program’s execution (2 of 3)

perfex (SGI O2K/IRIX 6.5.x) example (sage, 4 MPI processes):t01:~/sage20010624 [16] > mpirun -np 4 perfex -mp -a -x -y -o myperf.txt sage.x timing_c.inputt01:~/sage20010624 [18] > vi myperf.txt[material deleted here]

Statistics===========================================================================Graduatedinstructions/cycle............................................ 0.583431[material deleted here]L1 Cache LineReuse ........................................................ 16.610574L2 Cache LineReuse......................................................... 4.691712L1 Data Cache HitRate...................................................... 0.943216L2 Data Cache HitRate...................................................... 0.824306Time not making progress (probably waiting on memory) / Totaltime.......... 0.792328[material deleted here]MFLOPS (average perprocess)................................................ 5.650960

Page 20: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Profile the program’s execution (3 of 3)

• Apply communication-oriented profilers– MPI

• vampir/Guide (multiplatform; Refs. [8], [9])• upshot/nupshot (multiplatform; Ref. [10])• Ref. [11] is a good review, if a little dated

– OpenMP• KAPro Toolset (multiplatform; Ref. [9])

Page 21: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Let the compiler optimize what it can

• Give your compiler a fighting chance by providing it reasonable optimization opportunities (see Refs. [1] and [3])

• Modern optimizing compilers can produce binaries that typically have performance superior to that of most “hand-tuned” code

• Custom-tuning at the source level almost invariably creates non-portable performance

Page 22: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Optimize memory utilization(1 of 3)

• Proximity of process and memory could affect performance in distributed-shared-memory systems– In most of LANL’s large simulators, provided a

process is not memory-starved, placement does NOT account for more than a few percent of wallclock5

– Applications that perform frequent large memory moves9 tend to be sensitive to placement

Page 23: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Optimize memory utilization (2 of 3)

• Profile memory utilization– dmalloc (various platforms; Ref. [7])– perfex, dlook, dprof, nodememusg (SGI/IRIX;

Refs. [1], [18])– jm_status (IBM/AIX; Ref. [19])– Xkstat (Sun/Solaris; Ref. [20])15

• High cache-miss, TLB-miss, or swap rates are the hallmark of suboptimal memory utilization

Page 24: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Optimize memory utilization (3 of 3)

• In distributed memory architectures, size the process-specific memory requirements to ~50% of the size of local main memory block associated with the process

• Use placement tools2 where available• Beware of (sometimes, unadvertised)

interactions between load-sharing runtimes and placement utilities8

Page 25: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Some open issues in performance analysis (1 of 4)

• Accounting attribution is not always what it seems to be (requires user to have too much knowledge of tool internals)

• Concurrent presentation of metrics is not well supported in COTS

• There is no standard model for cross-platform performance comparisons

Page 26: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Some open issues in performance analysis (2 of 4)

• C++ support, especially for templates and mangled names, is limited in COTS

• Hardware support for performance metrics varies greatly among platforms

• COTS performance analysis tools often optimized on vendor’s hardware-diagnostic interests

Page 27: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Some open issues in performance analysis (3 of 4)

• COTS products provide little support for dynamically identifying “performance groups” -- collections of blocks that act in concert and dominate wallclock

• Mapping between the science-application-domain expert’s, and the computer-scientist’s, view of the code not supported in COTS (TAU does provide support)

Page 28: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Some open issues in performance analysis (4 of 4)

• Tool literature is not centralized -- a well-maintained database with hyperlinks would be welcome (candidate for a Parallel Tools Consortium (Ref. [23]) project?)14

• Commodity-market incentives to address on speculation any of the above are very low, especially for large systems11

Page 29: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Notes (1 of 8)

1. As of 18 October 2001, it is unclear whether DCPI will be supported on the LANL Compaq Q (30 TeraOps) system.

2. Such as SGI’s dplace (Ref. [1]).

3. HPCView (Ref. [6]), for example, provides a convenient way to juxtapose multiple metrics.

4. HPCView presumes the existence of prof-like files produced by software other than HPCView.

Page 30: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Notes (2 of 8)

5. J. K. Horner and P. Gustafson, Los Alamos National Laboratory, unpublished results using Ref. [15].

6. W. R. Boland, D. L. Harris, and J. K. Horner, “A comparison of Fortran 90 compiler features on the LANL SGI Origin2000”, presentation at SGI Quarterly Review, Fall 1998. (NAG, Compaq, HP, and Lahey/Fujitsu Fortran compilers are good crosschecks.)

7. For example, the LANL Advanced Hydrodynamic Facility (AHF), with a nominal cost of ~$1010.

Page 31: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Notes (3 of 8)

8. For example, between LSF and dplace under SGI/IRIX.

9. For example, streaming visualization systems.

10. If I failed to mention your favorite tool, let me know, and I’ll try to include it in an update.

Page 32: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Notes (4 of 8)

11. Most of today’s supercomputers are effectively networked commodity-class SMP nodes. Nominally, an ASCI-class supercomputer configured this way costs ~$108. The manufacturers of such systems are also typically in the PC/workstation market. 10% of the workstation/PC market, a nominal vendor market share, is ~$1010. On average, a successful ASCI-class computer vendor could hope to sell one supercomputer system every three years. Thus the revenue from the sale of an ASCI-class supercomputer represents only ~0.3% of the annual revenues of a nominal COTS workstation/PC vendor.

Page 33: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Notes (5 of 8)

12. In the sense of Ref. [21], pp. 18-20.

13. Such as Revised Intermediate COCOMO (REVIC), Ref. [22].

14. Federico Bassetti of LANL has an outstanding LANL-internal web page dedicated to this objective.

15. Xkstat, like many system monitoring utilities, provides system-wide, not process-specific metrics.

Page 34: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Notes (6 of 8)

[16] Getting the right answers requires a system of procedures and practices that optimize the probability of getting the right answers. These include, but are not limited to, tool-assisted configuration management, requirements definition, logical and physical design documentation, testing and user documentation (Ref. [24]). Thanks to Bill Spangenberg, Richard Barrett , and Michael Ham of LANL for this reminding of the importance of this point.

[17] See Ref. [3] for an outstanding tutorial on most of these topics.

Page 35: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Notes (7 of 8)

[18] Thanks to Larry Cox for chiding me on this point. In the original version of this presentation, I omitted explicit mention of this topic under the “multiple-thread-of-control” heading.

[19] Based on comments from participants in the Los Alamos Computer Science Institute (LACSI) Symposium 2001, Workshop on Parallel Tools, 18 October 2001, Santa Fe, NM.

Page 36: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

Notes (8 of 8)

[20] Los Alamos National Laboratory identifier LA-UR-01-5827. This work was supported in part by University of California/Los Alamos National Laboratory Subcontract 22237-001-01 4x with Science Applications International Corporation. This document is not claimed to represent the views of the U.S. Government or its contractors.

Page 37: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

References (1 of 7)

[1] SGI, Origin2000 and Onyx2 Performance Tuning and Optimization Guide, Document Number 007-3430-003, URL http://techpubs.sgi.com/library/.

[2] Compaq, Compaq (formerly Digital) Continuous Profiling Interface (DCPI), URL http://www.tru64unix.compaq.com/dcpi/.

[3] P. Mucci and K. London, Application Optimization and Tuning of Cache Based Uniprocessors, SMPs, and MPPs, Including OpenMP and MPI, URL http://www.cs.utk.edu/~mucci/MPPopt.html.

Page 38: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

References (2 of 7)[4] S. Browne (Moore) et al., Ptools Project: Performance Data API

(PAPI), URL http://www.cs.utk.edu/~browne/ptools98/perfAPI.

[5] A. Malony, S. Shende et al., TAU, URL http://www.cs.oregon.edu/research/paracomp/proj/tau.

[6] J. Mellor-Crummey, R. Fowler, and G. Marin, HPCView; contact [email protected].

[7] G. Watson, dmalloc, URL http://dmalloc.com.

Page 39: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

References (3 of 7)

[8] Pallas, vampir, URL http://www.pallas.com/pages/vampir.htm.

[9] KAI Software, KAPro Toolset, URL http://www.kai.com.

[10] Argonne National Laboratory, upshot, URL http://www-fp.mcs.anl.gov/~lusk/upshot.

[11] S. Browne, K. London, J. Dongarra, “Review of Performance Analysis Tools for MPI Parallel Programs”, URL http://www.cs.utk.edu/~browne/perftools-review.

Page 40: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

References (4 of 7)[12] B. Miller et al., Paradyn, URL

http://www.cs.wisc.edu/paradyn.

[13] (IBM-sponsored), Dynamic Probe Class Library (DPCL) Users Manual, available at URL http://www.cs.wisc.edu/paradyn/DPCL.

[14] J. Hollingsworth et al., Dyninst, URL http://www.dyninst.org.

Page 41: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

References (5 of 7)

[15] P. Gustafson and J. Horner, LANL, Memory Utilization Tracking Tool (MUTT) Version 1.1, 1999 (no longer supported).

[16] J. Dean et al., “ProfileMe: hardware support for instruction-level profiling on out-of-order processors”, Proceedings of the 30th Symposium on Microarchitecture (Micro-30), December 1997, URL ftp://ftp.digital.com/pub/DEC/SRC/publications/ weihl/micro30.ps.

Page 42: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

References (6 of 7)

[17] Compaq, atom, URL http://www.tru64unix/compaq.com/developerstoolkit/#atom.

[18] J. Horner, nodememusg, LANL, contact [email protected].

[19] A. Baker, “CU Boulder’s SP2”, URL http://www-ugrad.cs.colorado.edu/~csci4576/SP2/introsp.html#Useful.

[20] Xkstat, URL http://www.sunperf.com/perfmontools.html.

Page 43: A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20 18 October 2001 (Rev. 21 October 2001) 19 Jack K. Horner SAIC and LANL/CCN-8

References (7 of 7)

[21] C. C. Chang and H. J. Kiesler, Model Theory, North-Holland, 1990.

[22] U.S. Air Force Cost Analysis Agency, Revised Intermediate COCOMO, Version 9.2, URL http://www.hq.af.mil/afcaa/models/REVIC92.EXE.

[23] URL http://www.ptools.org.

[24] URL http://www.softstarsystems.com/faq.htm.