perflib overview jeff brown lanl/ccs-7 july 28, 2010

20
PerfLib Overview Jeff Brown LANL/CCS-7 July 28, 2010

Upload: robyn-watson

Post on 20-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PerfLib Overview Jeff Brown LANL/CCS-7 July 28, 2010

PerfLib Overview

Jeff BrownLANL/CCS-7July 28, 2010

Page 2: PerfLib Overview Jeff Brown LANL/CCS-7 July 28, 2010

PerfLib is ...

a performance measurement and analysis tool which consists of two components:

- a run-time library (libperfrt), and- post processing library, scripts, and utility programs

Page 3: PerfLib Overview Jeff Brown LANL/CCS-7 July 28, 2010

How to use PerfLib

1. instrument the code leverage existing timing infra (rage, flag, partisn, ...) instrumentation moves with the code to new systems calipers – must be “well formed” (beware alt. returns)2. build the code linking the run-time library minor modifications to the build system3. run code to collect data trigger data collection via: - environment variable settings or - library calls4. post process to analyse results (runpp, perfpp, etc.)

Page 4: PerfLib Overview Jeff Brown LANL/CCS-7 July 28, 2010

Performance Data Collected

Profiling:- timing- MPI (via profiling interface – no additional instr req.- hardware counters (e.g. flops, cache, tlb, etc.)- memory allocation (RSS footprint)- IO

Tracing:- timing/MPI- memory allocation

Page 5: PerfLib Overview Jeff Brown LANL/CCS-7 July 28, 2010

Post Processing

perfpp: a do it all script that discovers what data was collected and generates all possible reports/plots

runpp: just generates the reports

some sample reports and plots ...

Page 6: PerfLib Overview Jeff Brown LANL/CCS-7 July 28, 2010

PerfLib Overhead Tracked: ~2% for timing/MPI 12% for hardware counters, 21% for memory allocation

Page 7: PerfLib Overview Jeff Brown LANL/CCS-7 July 28, 2010

PerfLib Header Report for

64 PE Comet Impact Run on tuPERFlib version 3.0Data path: /scratch2/jeffb/tu/PerfTrack/3.0/crestone/timing/user_problems/comet/2.56km/48/64/20100324Data directory: comet-20100324

Performance data header: run date: 20100324 OS version: Linux

code version: xrage.1003.00 compiled with: unknown MPI version: /usr/projects/packages/openmpi/tu133/openmpi-intel-1.3.3

problem name: comet.input hosts: 64 processes running on (all processors with 2.30GHz cpu, 512 KB level 2 cache, 1024 TLB entries, 32189.25GB physical memory) tua043:tua043.localdomain, tua043:tua043.localdomain, tua043:tua043.localdomain, tua043:tua043.localdomain tua043:tua043.localdomain, tua043:tua043.localdomain, tua043:tua043.localdomain, tua043:tua043.localdomain tua043:tua043.localdomain, tua043:tua043.localdomain, tua043:tua043.localdomain, tua043:tua043.localdomain tua043:tua043.localdomain, tua043:tua043.localdomain, tua043:tua043.localdomain, tua043:tua043.localdomain tua043:tua044.localdomain, tua043:tua044.localdomain, tua043:tua044.localdomain, tua043:tua044.localdomain tua043:tua044.localdomain, tua043:tua044.localdomain, tua043:tua044.localdomain, tua043:tua044.localdomain tua043:tua044.localdomain, tua043:tua044.localdomain, tua043:tua044.localdomain, tua043:tua044.localdomain tua043:tua044.localdomain, tua043:tua044.localdomain, tua043:tua044.localdomain, tua043:tua044.localdomain tua043:tua046.localdomain, tua043:tua046.localdomain, tua043:tua046.localdomain, tua043:tua046.localdomain tua043:tua046.localdomain, tua043:tua046.localdomain, tua043:tua046.localdomain, tua043:tua046.localdomain tua043:tua046.localdomain, tua043:tua046.localdomain, tua043:tua046.localdomain, tua043:tua046.localdomain tua043:tua046.localdomain, tua043:tua046.localdomain, tua043:tua046.localdomain, tua043:tua046.localdomain tua043:tua051.localdomain, tua043:tua051.localdomain, tua043:tua051.localdomain, tua043:tua051.localdomain tua043:tua051.localdomain, tua043:tua051.localdomain, tua043:tua051.localdomain, tua043:tua051.localdomain tua043:tua051.localdomain, tua043:tua051.localdomain, tua043:tua051.localdomain, tua043:tua051.localdomain tua043:tua051.localdomain, tua043:tua051.localdomain, tua043:tua051.localdomain, tua043:tua051.localdomain

Profile metrics: Time (wall clock) enabled (PAPI timer) Counters (hardware performance data) not enabled (to enable, setenv PERF_PROFILE_COUNTERS) Memory not enabled (to enable, setenv PERF_PROFILE_MEMORY) MPI enabled IO enabled Trace metrics: Memory not enabled (to enable, setenv PERF_TRACE_MEMORY)

Performance data dump frequency - every 10 cycles

Page 8: PerfLib Overview Jeff Brown LANL/CCS-7 July 28, 2010

Cycle: 7234, MPI rank: 0, all instrumentation levelsdumping elapsed time performance data since start of run (inclusive nested calling tree)routines with average time/call > 10 us after 1000 callsskipping routines with < 3% wall clock time and < 1e+08 bytes sent/rcvd and < 1e+06 bytes written

controller 100.00%; 177.89 s(177.89 s); 1 call + controller_0 99.92%; 177.74 s(177.74 s); 1 call + . controller_3 96.71%; 172.04 s(172.04 s); 96 calls (1.792 s/call avg, 1.324 s min, 3.130 s max) + . | cycle 94.46%; 168.04 s(168.04 s); 95 calls (1.769 s/call avg, 1.578 s min, 2.005 s max) + . | . hydro 65.71%; 116.89 s(116.89 s); 95 calls (1.230 s/call avg, 1.210 s min, 1.363 s max) + . | . + cdt 10.13%; 18.025 s(18.025 s); 285 calls (0.063 s/call avg, 0.056 s min, 0.072 s max) + . | . + . xmeos 6.10%; 10.859 s(10.859 s); 285 calls (0.038 s/call avg, 0.035 s min, 0.046 s max) + . | . + . token_allreduce 3.29%; 5.847 s(5.847 s); 285 calls (0.021 s/call avg, 0.015 s min, 0.028 s max) + . | . + . | MPI_Allreduce 3.28%; 5.842 s(5.842 s); 285 calls (0.020 s/call avg, 0.015 s min, 0.027 s max) 2.23 KB sent (avg: 8 B, BW: 0.000372168 MB/s); 2.23 KB rcvd (avg: 8 B, BW: 0.000372168 MB/s) + . | . + hydro_lanl_1 55.53%; 98.774 s(98.774 s); 190 calls (0.520 s/call avg, 0.507 s min, 0.586 s max) + . | . + . d_common 19.79%; 35.206 s(35.206 s); 3420 calls (0.010 s/call avg, 8739 us min, 0.043 s max) + . | . + . | . MPI_Irecv 0.02%; 0.038 s(0.038 s); 17064 calls (0.000002 s/call avg, 1 us min, 21 us max) 99046.69 KB rcvd (avg: 5943.73 B, BW: 2516 MB/s) + . | . + . | (d_common exclusive) 16.12%; 28.673 s(28.673 s); + . | . + . h_1_advect_vol 5.53%; 9.829 s(9.829 s); 190 calls (0.052 s/call avg, 0.050 s min, 0.060 s max) + . | . + . | d_common_vec 4.48%; 7.965 s(7.965 s); 190 calls (0.042 s/call avg, 0.041 s min, 0.047 s max) + . | . + . | . (d_common_vec exclusive) 4.29%; 7.639 s(7.639 s); + . | . + . d_fvol 8.18%; 14.558 s(14.558 s); 1140 calls (0.013 s/call avg, 0.011 s min, 0.028 s max) + . | . + . | (d_fvol exclusive) 7.26%; 12.912 s(12.912 s); + . | . + . seteng 5.77%; 10.272 s(10.272 s); 190 calls (0.054 s/call avg, 0.054 s min, 0.061 s max) + . | . + . (hydro_lanl_1 exclusive) 11.26%; 20.025 s(20.025 s); + . | . calscr 3.32%; 5.906 s(5.906 s); 95 calls (0.062 s/call avg, 0.061 s min, 0.075 s max) + . | . freeze_restore 3.97%; 7.066 s(7.066 s); 95 calls (0.074 s/call avg, 0.071 s min, 0.086 s max) + . | . recon 17.63%; 31.362 s(31.362 s); 95 calls (0.330 s/call avg, 0.154 s min, 0.470 s max) + . | . + cdt 3.04%; 5.414 s(5.414 s); 95 calls (0.057 s/call avg, 0.056 s min, 0.060 s max) + . | . + cell_get 3.18%; 5.650 s(5.650 s); 31480 calls (0.000179 s/call avg, 51 us min, 2269 us max) + . | . + . token_get 3.04%; 5.400 s(5.400 s); 31480 calls (0.000172 s/call avg, 47 us min, 2263 us max) + . | . + . | MPI_Issend 0.08%; 0.150 s(0.150 s); 47160 calls (0.000003 s/call avg, 1 us min, 69 us max) 144710.39 KB sent (avg: 3142.14 B, BW: 943.477 MB/s) + . | . cdt 3.24%; 5.763 s(5.763 s); 95 calls (0.061 s/call avg, 0.060 s min, 0.063 s max) + . | . + . | pwrite 0.04%; 0.067 s(0.067 s); 13 calls (0.005147 s/call avg, 4546 us min, 5696 us max) 65.00 MB written @ 971.3671 MB/s

call tree stats: depth: 11 nodes: 794

Tuesday night comet impact run on tutiming report filtered at 3%

rank 0, last cycle cumulative

Page 9: PerfLib Overview Jeff Brown LANL/CCS-7 July 28, 2010

memory allocation reportCycle: 7234, MPI rank: 0, all instrumentation levelsdumping memory allocation performance data since start of run (inclusive nested calling tree)physical memory on node: 32189.25 MBytes (hostname: tua023:tua023.localdomain) memory allocated (rss): 102.95 MBytes (44.8359 MBytes allocated prior to 1st instrumentation point) minimum free memory: 26665.55 MBytes (82.84%) total page faults: 71 (22 page faults prior to 1st instrumentation point)

dumping memory allocation performance data (rss growth) - inclusive nested calling treeskipping routines with < 3% allocated memory (rss growth) and < 3% page faults

controller 100.00%, 58.1094 Mbytes, 49 page faults + controller_0 97.22%, 56.4922 Mbytes, 47 page faults + . tread 34.43%, 20.0078 Mbytes, 2 page faults + . | pio_open 16.61%, 9.65234 Mbytes + . | . universal_file_read_common 15.45%, 8.97656 Mbytes + . | . + bulkio_read_s 8.69%, 5.05078 Mbytes + . | . + . bulkio_read_d 8.69%, 5.05078 Mbytes + . | . + . | pread 8.60%, 5 Mbytes + . | . + token_bcast 6.72%, 3.90234 Mbytes + . | resize 15.51%, 9.01172 Mbytes, 1 page faults + . | . (resize exclusive) 15.49%, 9.00391 Mbytes, 1 page faults + . restart 30.14%, 17.5156 Mbytes, 10 page faults + . | bldint 10.43%, 6.05859 Mbytes + . | . resize 6.60%, 3.83594 Mbytes + . | . + (resize exclusive) 6.60%, 3.83594 Mbytes + . | seteng 5.17%, 3.00391 Mbytes + . | (restart exclusive) 10.78%, 6.26562 Mbytes, 8 page faults + . controller_3 29.69%, 17.25 Mbytes, 21 page faults + . | cycle 27.42%, 15.9336 Mbytes + . | . hydro 25.16%, 14.6211 Mbytes + . | . + hydro_lanl_1 22.63%, 13.1484 Mbytes + . | . + . d_common 4.45%, 2.58594 Mbytes + . | . + . | (d_common exclusive) 4.30%, 2.5 Mbytes + . | . + . h_1_advect_vol 7.32%, 4.25391 Mbytes + . | . + . | d_common_vec 7.31%, 4.24609 Mbytes + . | . + . | . (d_common_vec exclusive) 7.10%, 4.125 Mbytes + . | . + . seteng 4.43%, 2.57422 Mbytes + . | . + . | d_common_vec 4.43%, 2.57422 Mbytes + . | . + . | . (d_common_vec exclusive) 4.43%, 2.57422 Mbytes + . | . + . (hydro_lanl_1 exclusive) 6.06%, 3.52344 Mbytes

call tree stats: depth: 10 nodes: 438

Page 10: PerfLib Overview Jeff Brown LANL/CCS-7 July 28, 2010

flop data integrated into call tree

Cycle: 7234, MPI rank: 0, all instrumentation levelsdumping elapsed time performance data since start of run (inclusive nested calling tree)routines with average time/call > 10 us after 1000 callsskipping routines with < 3% wall clock time and < 1e+08 bytes sent/rcvd and < 1e+06 bytes written

Peak Mflops: 4600.00 Mflops: 0.82 ( 0.02% of peak)

controller 100.00%; 201.30 s(201.30 s); 1 call 164972896 FP_INS, 0.820 Mf/s, 0.072 s ( 0.04%) 141760805912 L1_DCA, 704 L1_DCA/us, 0.00116374 f/L1_DCA 2686875312 L1_DCM, 13 L1_DCM/us, 0.0613995 f/L1_DCM, 1.90% L1_DCM/L1_DCA 623965879 L2_DCM, 3 L2_DCM/us, 0.264394 f/L2_DCM, 23.22% L2_DCM/L2_DCA + controller_0 99.46%; 200.20 s(200.20 s); 1 call 164972808 FP_INS, 0.824 Mf/s, 0.072 s ( 0.04%) 140023030844 L1_DCA, 699 L1_DCA/us, 0.00117818 f/L1_DCA 2657992526 L1_DCM, 13 L1_DCM/us, 0.0620667 f/L1_DCM, 1.90% L1_DCM/L1_DCA 623906059 L2_DCM, 3 L2_DCM/us, 0.264419 f/L2_DCM, 23.47% L2_DCM/L2_DCA

+ . controller_3 96.56%; 194.38 s(194.38 s); 96 calls (2.025 s/call avg, 1.745 s min, 3.419 s max) 164674809 FP_INS, 0.847 Mf/s, 0.072 s ( 0.04%) 133280135424 L1_DCA, 686 L1_DCA/us, 0.00123555 f/L1_DCA 2527669073 L1_DCM, 13 L1_DCM/us, 0.0651489 f/L1_DCM, 1.90% L1_DCM/L1_DCA 621185981 L2_DCM, 3 L2_DCM/us, 0.265097 f/L2_DCM, 24.58% L2_DCM/L2_DCA + . | cycle 93.81%; 188.84 s(188.84 s); 95 calls (1.988 s/call avg, 1.738 s min, 2.199 s max) 132328667 FP_INS, 0.701 Mf/s, 0.058 s ( 0.03%) 129778334303 L1_DCA, 687 L1_DCA/us, 0.00101965 f/L1_DCA 2468965744 L1_DCM, 13 L1_DCM/us, 0.0535968 f/L1_DCM, 1.90% L1_DCM/L1_DCA 617737747 L2_DCM, 3 L2_DCM/us, 0.214215 f/L2_DCM, 25.02% L2_DCM/L2_DCA + . | . hydro 59.07%; 118.91 s(118.91 s); 95 calls (1.252 s/call avg, 1.228 s min, 1.379 s max) 78559433 FP_INS, 0.661 Mf/s, 0.034 s ( 0.03%) 71964911416 L1_DCA, 605 L1_DCA/us, 0.00109164 f/L1_DCA 1414105604 L1_DCM, 12 L1_DCM/us, 0.0555541 f/L1_DCM, 1.96% L1_DCM/L1_DCA 487050183 L2_DCM, 4 L2_DCM/us, 0.161296 f/L2_DCM, 34.44% L2_DCM/L2_DCA

Page 11: PerfLib Overview Jeff Brown LANL/CCS-7 July 28, 2010

Run Time by Package by Rank

Page 12: PerfLib Overview Jeff Brown LANL/CCS-7 July 28, 2010

Memory Allocation by Routine by Rankshows the affect of io processors (bulkio)

Page 13: PerfLib Overview Jeff Brown LANL/CCS-7 July 28, 2010

Rage Performance Tracking: Run time by Routine by Date (Code Version)

Page 14: PerfLib Overview Jeff Brown LANL/CCS-7 July 28, 2010

Rage Performance Tracking: tu vs. yr

Page 15: PerfLib Overview Jeff Brown LANL/CCS-7 July 28, 2010

Cache Performance by Routine

Page 16: PerfLib Overview Jeff Brown LANL/CCS-7 July 28, 2010

TLB Hit Rate by Routine

Page 17: PerfLib Overview Jeff Brown LANL/CCS-7 July 28, 2010

%peak by flops/memory reference(idea for this from Mack Kenamond Shavano L2 MS talk

Page 18: PerfLib Overview Jeff Brown LANL/CCS-7 July 28, 2010

Flop Rate by Routine by Rank

Page 19: PerfLib Overview Jeff Brown LANL/CCS-7 July 28, 2010

Memory Trace (rss) by Rank

Page 20: PerfLib Overview Jeff Brown LANL/CCS-7 July 28, 2010

Documentation, support, etc.deployed on LANL systems: lobo, tu, yr, hu, rt, rr LLNL systems: purple, bgl/dawn, linux clusters SNL systems: redstorm

Integrated into major ASC codes at LANL and LLNL

Basis for potential JOWOG performance code comparisons

/usr/projects/codeopt/PERF/4.0/doc/ ReadMe GettingStarted

My contact info: (505) 665-4655, [email protected]