using perfpmr, aix vmm page replacement and system p...

© 2007 IBM Corporation

Using Perfpmr, AIX VMM Page Replacement and System p Performance / AIX v6.1(Part 1)

IBM AIX

Click anywhere on the course to continue.

2


Agenda

This deep-dive course covers the following topics:Part 1: Deep Dive perfpmr toolPart 2: VMM page replacementPart 2: System performance / IBM AIX 6.1► New tuning parameters

This course applies to students who want to know about PerfPMR, how to generate reports and what parameters apply.It also helps with determining when there is a performance issue, what to review and what the possible causes of the issue can be.This course assumes students are familiar with IBM AIX® performance tools.

Click the Notes tab on the left to see the accompanying notes text.

3


Performance Data Collection

PERFPMR consists of a set of utilities that collect information in analyzing performance issues

PERFPMR is downloadable from a public ftp site:► Not distributed on AIX media► ftp ftp.software.ibm.com using anonymous ftp► cd /aix/tools/perftools/perfpmr/perfXX (where XX is the AIX release)► Get the compressed .tar file in that directory and install it using the

directions in the provided README file► PERFPMR is updated periodically, so it’s advisable to check the FTP

site for the most recent version

The bos.perf packages (from AIX install media) must also be installed.

ftp://ftp.software.ibm.com/

4


Running PERFPMR

Once PERFPMR has been installed, you can run it in any directory► To determine the amount of space needed, estimate at least 20MB per

logical CPU plus an extra 50MB of space► Run “perfpmr.sh <# of seconds>” at the time when the performance

problem is occurring► A pair of 5-second traces are collected first► Then various monitoring tools are run for the duration of time specified as

a parameter to perfpmr.sh► After this, tprof, filemon, iptrace, tcpdump data is collected► Finally, system configuration data is collected► Data can be tar’d up and sent to testcase.software.ibm.com with the

filename having the pmr# in it

5


Syntaxperfpmr.sh [-PQDIgfnpsc][-F file][-x file][-d sec] monitor_seconds

-P preview only - show scripts to run and disk space needed"-Q don't run lsattr,lvlv,lspv commands in order to save time"-D run perfpmr the original way without a perfpmr cfg file"-I get lock instrumented trace also"-g do not collect gennames output."-f if gennames is run, specify gennames -f."-n used if no netstat or nfsstat desired."-p used if no pprof collection desired while monitor.sh running."-s used if no svmon desired."-c used if no configuration information is desired."-F file use file as the perfpmr cfg file - default is perfpmr.cfg"-x file only execute file found in perfpmr installation directory"-d sec sec is time to wait before starting collection period default is delay_seconds 0

monitor_seconds is for the monitor collection period in seconds

6


PERFPMR shell scripts aiostat.sh Collects AIO information into a report called aiostat.int

config.sh Collects configuration information into a report called config.sum.

emstat.sh time Builds a report called emstat.int on emulated instructions. The time parameter must be greater than or equal to 60

filemon.sh time Builds a report called filemon.sum on file I/O. The time parameter does not have any restrictions.

iostat.sh time Builds two reports on I/O statistics: a summary report called iostat.sum and an interval report called iostat.int. The time parameter must be greater than or equal to 60.

iptrace.sh time Builds a raw Internet Protocol (IP) trace report on network I/O called iptrace.raw. You can convert the iptrace.raw file to a readable ipreport file called iptrace.int using the iptrace.sh -r command. The time parameter does not have any restrictions

lpartstat.sh Builds a report on Logical partitioning information, two file are created lparstat.int and lparstat.sum

monitor.sh time Invokes system performance monitors and collects interval and summary reports:

7


PERFPMR shell scripts (Cont.)

mpstat.sh Builds a report on Logical processor information into a report called mpstat.int

netstat.sh [-r] time Builds a report on network configuration and use called netstat.int containing entstat -d of the Ethernet interfaces, netstat -in, netstat -m, netstat -rn, netstat -rs, netstat -s, netstat -D, and netstat -an before and after monitor.shwas run.

nfsstat.sh time Builds a report on NFS configuration and use called netstat.int containing nfsstat -m, and nfsstat –csnr before and after nfsstat.sh was run. The time parameter must be greater than or equal to 60.

pprof.sh time Builds a file called pprof.trace.raw that can be formatted with the pprof.sh -r command. The time parameter does not have any restrictions.

ps.sh time Builds reports on process status (ps). ps.sh creates the following files:psa.elfk: A ps -elfk listing after ps.sh was run.psb.elfk: A ps -elfk listing before ps.sh was run.ps.int Active processes before and after ps.sh was run.ps.sum A summary report of the changes between when ps.sh started and finished. This is useful for determining what processes are consuming resources. The time parameter must be greater than or equal to 60.

8


PERFPMR shell scripts (Cont.)

sar.sh time Builds reports on sar. sar.sh creates the following files:sar.int Output of commands sadc 10 7 and sar -Asar.sum A sar summary over the period sar.sh was run. The time parameter must be greater than or equal to 60.

svmon.sh Builds a report on svmon data into two files svmon.out and svmon.out.S

tcpdump.sh int.time The int. parameter is the name of the interface; for example, tr0 is token-ring. Creates a raw trace file of a TCP/IP dump called tcpdump.raw. To produce a readable tcpdump.int file, use the tcpdump.sh -r command.

tprof.sh time Creates a tprof summary report called tprof.sum. Used for analyzing memory use of processes and threads. You can also specify a program to profile by specifying the tprof.sh -p program 60 command, which enables you to profile the executable-called program for 60 seconds.

trace.sh time Creates the raw trace files (trace*) from which an ASCII trace report can be generated using the trcrpt command

vmstat.sh time Builds reports on vmstat: a vmstat interval report called vmstat_v and a vmstat_s summary report The time parameter must be greater than orequal to 60. .

9


PERFPMR configuration file for perfpmr scriptsperfpmr.cfg This is the perfpmr configuration file which includes the following scripts:

♦ perfpmr_tool = trace.sh♦ perfpmr_tool = monitor.sh♦ perfpmr_tool = iptrace.sh♦ perfpmr_tool = tcpdump.sh♦ perfpmr_tool = filemon.sh♦ perfpmr_tool = tprof.sh♦ perfpmr_tool = netpmon.sh♦ perfpmr_tool = config.sh

Note: Bigger trace buffers may not be a viable option for systems that are tight on memory and trace buffers are pinned memory.

10


Example

root@nkeung /home/nam/perfpmr/test: > ../perfpmr.sh 300…

10:40:30-08/01/07 : PERFPMR: executing perfpmr_trace -k 10e,254,116,117 -L20000000 -I 5

…TRACE.SH: Starting trace for 5 seconds

/bin/trace -k 492,10e,254,116,117 -f -n -C all -d -L 20000000 -T 20000000 -aTRACE.SH: Data collection startedTRACE.SH: Data collection stoppedTRACE.SH: Binary trace data is in file trace.raw

…TRACE.SH: Enabling locktrace

lock tracing enabled for all classesTRACE.SH: Starting trace for 5 seconds

/bin/trace -j 106,10C,10E,112,113,134,139,465,46D,606,607,608,609 -f -n -C all –d –L 2000000 -T 20000000 -ao trace.raw.lock

TRACE.SH: Data collection startedTRACE.SH: Data collection stoppedTRACE.SH: Trace stoppedTRACE.SH: Binary trace data is in file trace.raw.lock

11


Example (Cont.)

10:42:48-08/01/07 : PERFPMR: executing perfpmr_monitor -h -I 0 -N 0 -S 0 3MONITOR: Capturing final lsps, svmon, and vmstat dataMONITOR: Generating reports....MONITOR: Network reports are in netstat.int and nfsstat.intMONITOR: Monitor reports are in monitor.int and monitor.sum

10:49:47-08/01/07 : PERFPMR: executing perfpmr_filemon -T 60000000 60

FILEMON: Starting filesystem monitor for 60 seconds....10:50:55-08/01/07 : filemon completed10:50:55-08/01/07 : PERFPMR: executing perfpmr_tprof 60

TPROF: Tprof report is in tprof.sum10:52:00-08/01/07 : config.sh begin

CONFIG.SH: Generating SW/HW configuration10:53:38-08/01/07 : config.sh completed

PERFPMR: Data collection complete.

12


Example (Cont.)

root@nkeung /home/nam/perfpmr/test: > ls

aiostat.int gennames.out mpstat.int tprof.csyms trace.syms

config.sum getevars.out netstat.int tprof.ctrc tunables_lastboot

crontab_l instfix.out nfsstat.int tprof.out tunables_lastboot.log

devtree.out iostat.Dl objrepos tprof.sum tunables_nextboot

errlog iptrace.raw perfpmr.int trace.crash.inode unix.what

errpt_a lparstat.int pile.out trace.fmt vfs.kdb

errtmplt lparstat.l pprof.trace.raw trace.inode vmstat_s.out

etc_filesystems lparstat.sum psa.elfk trace.j2.inode vmstat_s.p.after

etc_inittab lslpp.Lc psb.elfk trace.maj_min2lv vmstat_s.p.before

etc_rc lsps.after psemo.after trace.mount vmstat_v.after

etc_security_limits lsps.before psemo.before trace.nm vmstat_v.before

fastt.out lsrset.out sar.bin trace.raw vmstati.after

fcstat.after mem_details_dir svmon.after trace.raw-0 vmstati.before

fcstat.before mempools.out svmon.after.S trace.raw-1 vnode.kdb

filemon.sum mempools.save svmon.before trace.raw.lock w.int

genkex.out monitor.int svmon.before.S trace.raw.lock-0 xmwlm.070731

genkld.out monitor.sum tcpdump.raw trace.raw.lock-1 xmwlm.070801

13


Postprocessing raw data

TRACE report (monitors statistics of user and kernel subsystems in detail)►The trcrpt command reads the trace log and formats the trace entries, and writes a report to

standard output.

CPU Usage Reporting Tool►The CPU Usage Reporting Tool (curt) takes an AIX trace file as input and produces a

number of statistics related to CPU utilization and process/thread activity

SPLAT report (Simple Performance Lock Analysis Tool)►splat is a software tool which post-processes AIX trace files to produce kernel simple and

complex lock usage reports.

tprof report►The tprof command reports processor usage for individual programs and the system as a

whole.

I/O performance report (monitor.int and monitor.sum)

System call report

14


Case study – Ethernet Transmit Lock

curt.outHypervisor Calls Summary

------------------------Count Total Time % sys Avg Time Min Time Max Time Tot ETime Avg ETime Min ETime

Max ETime HCALL (Cal(msec) time (msec) (msec) (msec) (msec) (msec) (msec) (msec)======== =========== ====== ======== ======== ======== ======== ========= ========= =========625 169.1873 6.86% 0.2707 0.0014 2.8567 311.8368 0.4989 0.0077

3.7711 H_CONFER((unknown) 210178)117 38.2630 1.55% 0.3270 0.0005 2.7299 57.9804 0.4956 0.0005

2.7299 H_CONFER((unknown) 2f2230)

hconfer is a hypervisor call that is used in a shared partition to confer certain processor cycles of one virtual processor to another specific virtual processor in the same partition—it recognizes that the second virtual processor is in need of the excess cycles that the first virtual processor has. For example, assume one virtual processor is holding a lock and does not have enough cycles to release that lock. If another virtual processor needs that lock and has excess cycles, it confers those to the first processor through this call.

15


Case study – Ethernet Transmit Lock (Continue)SPLAT.out

10 max entries, Summary sorted by Percent CPU spin hold time:

T Acqui- Wait

y sitions or Locks or Percent Holdtime

p or Trans- Passes Real Real Comb

Lock Name, Class, or Address e Passes Spins form %Miss %Total / CSec CPU Elapse Spin

********************************** * ******* ****** ****** ******* ******** ********* ******** ******** ********

F10001005D00A680 D 1934 1933 0 49.9871 0.1373 364.302 0.21494.7678 0.0425

[AIX SIMPLE Lock] ADDRESS: F10001005D00A680 KEX: unknown

======================================================================================

| Trans- | | Percent Held ( 0.241491s )

Type: | Miss Spin form Busy | Secs Held | Real Real Comb Real

Disabled | Rate Count Count Count |CPU Elapsed | CPU Elapsed Spin Wait

| 49.987 1933 0 0 |0.011407 0.011514 | 0.21 4.77 0.04 0.00

--------------------------------------------------------------------------------------

Total Acquisitions: 1934 |SpinQ Min Max Avg |Krlocks SpinQ Min Max Avg

Acq. holding krlock: 0 |Depth 0 2 0 |Depth 0 0 0

--------------------------------------------------------------------------------------

16


Case study – Ethernet Transmit Lock (Continue)

Lock Activity (mSecs) - Interrupts Disabled

SIMPLE Count Minimum Maximum Average Total

+++++++ ++++++ ++++++++++++++ ++++++++++++++ ++++++++++++++ ++++++++++++++

LOCK 1934 0.003430 0.182701 0.005955 11.406502

SPIN 1933 0.000568 0.164352 0.001174 2.256266

Acqui- Miss Spin Transf. Busy Percent Held of Total Time

Function Name sitions Rate Count Count Count CPU Elapse Spin Transf. Return Address Start Address Offset

^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^ ^^^^^^^^

.goent_output 1933 49.99 1932 0 0 0.21 4.69 0.04 0.00 00000000040B6E10 00000000040AA200 0000CC10

Acqui- Miss Spin Transf. Busy Percent Held of Total Time Process

ThreadID sitions Rate Count Count Count CPU Elapse Spin Transf. ProcessID Name

~~~~~~~~ ~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~ ~~~~~~~~~~~~~

1753227 18 50.00 18 0 0 0.48 0.05 0.77 0.00 2093076 oraclehrvst1

3121321 32 50.00 32 0 0 0.63 0.08 0.39 0.00 2105404 oraclehrvst1

4186123 1 50.00 1 0 0 0.98 0.00 0.23 0.00 2666654 tee

2093311 1 50.00 1 0 0 22.47 0.08 0.16 0.00 528490 hats_nim

17


Case study – Ethernet Transmit Lock (continue)

tprof.sum

Total Ticks For All Processes (KERNEL) = 702

Subroutine Ticks % Source Address Bytes

========== ===== ====== ====== ======= =====

h_cede_end_point 361 45.87 hcalls.s 2d4ea8 8

.unlock_enable_mem 264 33.55 64/low.s 930c 1f4

.waitproc 26 3.30 ../../../../../src/bos/kernel/proc/dispa

tch.c 42e28 57c

.trchook64 9 1.14 trchka64.s 1d418 220

pcs_glue 6 0.76 vmvcs.s 2e3c74 c4

h_confer_end_point 2 0.25 hcalls.s 2d4ed0 8

.enable 2 0.25 misc.s ec398 10

18


Case study – Mutex LockProfile: /usr/lib/libc.a[shr.o]

Total Ticks For All Processes (/usr/lib/libc.a[shr.o]) = 7143


========== ===== ====== ====== ======= =====

.free_y 1599 1.67 ../../../../../../../src/bos/usr/ccs/lib/libc/malloc_y.c 7a2fc 6dc

.leftmost 1127 1.17 ../../../../../../../src/bos/usr/ccs/lib/libc/malloc_y.c 79d44 124

.splay 1019 1.06 ../../../../../../../src/bos/usr/ccs/lib/libc/malloc_y.c 79868 3b4

.fetch_and_addlp 658 0.69 atomic_op.s 86d78 1

.malloc_y 598 0.62 ../../../../../../../src/bos/usr/ccs/lib/libc/malloc_y.c 7b688 604

._doprnt 255 0.27 ../../../../../../../src/bos/usr/ccs/lib/libc/doprnt.c f0cc 6748

Profile: /usr/lib/libpthreads.a[shr.o]

Total Ticks For All Processes (/usr/lib/libpthreads.a[shr.o]) = 5448

Subroutine Ticks % Source Address Bytes========== ===== ====== ====== ======= =====.global_unlock_ppc_mp 2374 2.47 pth_locks_ppc_mp.s 22a3c c4.global_lock_ppc_mp 1194 1.24 pth_locks_ppc_mp.s 2293c c4._mutex_lock 905 0.94 ../../../../../../../../src/bos/usr/ccs/lib/libpthreads/pth_mutex.c 18bc 348.pthread_mutex_unlock 425 0.44 ../../../../../../../../src/bos/usr/ccs/lib/libpthreads/pth_mutex.c 1fc8 17c

19


Case study – Mutex Lock (continue)

Modify the trcfmt file, add the following to the end of the trace.fmt file (Require debug libpthread library)

020 1.0 "PTHREADS" \"mutex_lock " $D1 $D2 $D3 $D4 $D5

040 1.0 "PTHREADS" \"mutex " lockaddr=$D4 $D1 $D2 $D3

030 1.0 "PTHREADS" \"mutex_unlock " $D1 $D2 $D3 $D4 $D5

The 020 and 040 trace hooks show the pthread lock trace hook whereas 030 and 040 show the unlock trace hook. The 5 parms in 020 and 030 are the top 5 routines on the stack. The first parm in 040 is the lock address and the last 3 parms of the 040 hook are the 6th, 7th, and 8th stack addresses

20



040 scp_server 27 410070 6755063 0.582794 PTHREADS mutex lockaddr=20444B50 1009E44C 1009C8D8 10038FAC020 scp_server 32 410070 7184789 0.582797 PTHREADS mutex_lock D0300EFC D02FF57C D01C5DB0 100173F0 100174E0040 scp_server 32 410070 7184789 0.582797 PTHREADS mutex lockaddr=20444B50 10017454 1009E6D8 10038BF8030 scp_server 46 410070 6594563 0.582799 PTHREADS mutex_unlock D03027A8 D02FFCDC D01C82BC 1009E7E4 10038F0C040 scp_server 46 410070 6594563 0.582800 PTHREADS mutex lockaddr=20444B50 1002F690 1002928C D1BD7288020 scp_server 27 410070 6755063 0.582804 PTHREADS mutex_lock D0300EFC D02FF57C D01C5DB0 D01D0F14 100173C8040 scp_server 27 410070 6755063 0.582805 PTHREADS mutex lockaddr=20444B50 100174E0 1009E460 1009C8D8030 scp_server 4 410070 6598669 0.582805 PTHREADS mutex_unlock D03027A8 D02FFCDC D01C82BC 1009E5E4 10038F0C040 scp_server 4 410070 6598669 0.582806 PTHREADS mutex lockaddr=20444B50 1002F690 1002928C D1BD7288020 scp_server 38 410070 7131397 0.582807 PTHREADS mutex_lock D0300EFC D02FF57C D01C5DB0 D01D0F14 100173C8040 scp_server 38 410070 7131397 0.582807 PTHREADS mutex lockaddr=20444B50 100174E0 1009E44C 10038DDC030 scp_server 2 410070 6750965 0.582811 PTHREADS mutex_unlock D03014F4 D02FF57C D01C5DB0 D01D0F14 D1CBA968040 scp_server 2 410070 6750965 0.582812 PTHREADS mutex lockaddr=20444B50 D1CB6EAC D1CC72C4 D1CC1C94020 scp_server 46 410070 6594563 0.582813 PTHREADS mutex_lock D03022EC D02FFCDC D01C82BC 1009D0AC 1009E808020 scp_server 32 410070 7184789 0.582813 PTHREADS mutex_lock D0300EFC D02FF57C D01C5DB0 100173F0 100174E0

21



use the sym.sh with the sym to look at 020 stack trace

./sym.sh D0300EFC D02FF57C D01C5DB0 100173F0 100174E0

D0300EFC </usr/lib/libc.a[shr.o] .free_y>

D02FF57C </usr/lib/libc.a[shr.o] .free_common>

D01C5DB0 </usr/lib/libC.a[ansicore_32.o] .operator>

100173F0

100174E0

./sym.sh D03022EC D02FFCDC D01C82BC 1009D0AC 1009E808

D03022EC </usr/lib/libc.a[shr.o] .malloc_y>

D02FFCDC </usr/lib/libc.a[shr.o] .malloc_common_53_36>

D01C82BC </usr/lib/libC.a[ansicore_32.o] .operator>

1009D0AC

1009E808

22


Case study – shm lockReport from splat.out

10 max entries, Summary sorted by Percent CPU spin hold time:

T Acqui- Wait

y sitions or Locks or Percent Holdtime

p or Trans- Passes Real Real Comb

Lock Name, Class, or Address e Passes Spins form %Miss %Total / CSec CPU Elapse Spin

********************************** * ******* ****** ****** ******* ******** ********* ******** ******** ********

F00000002FF48C98 C 74582 255 0 0.3407 2.2872 2284.374 184.9558 90932.0141 33.7647

lock_shm S 25870 25864 1246 49.9942 0.7933 792.373 2.7038 86.6720 9.2270

0000000002A65750 D 3500 1178 3 25.1817 0.1073 107.202 0.0445 1.4113 0.1217

0000000002A65A68 D 2534 1177 1 31.7165 0.0777 77.614 0.0480 1.5246 0.0724

0000000002A65A28 D 887 451 1 33.7070 0.0272 27.168 0.0224 0.7103 0.0614

0000000002A659D0 D 3773 726 0 16.1369 0.1157 115.563 0.0476 1.5147 0.0232

F100010052ECF6D8 C 13084 997 132 7.0805 0.4012 400.750 0.1719 5.5278 0.0142

00000000010A5438 D 51737 4690 0 8.3116 1.5866 1584.654 0.3101 11.6915 0.0094

lock_sem_undo S 17169 37 0 0.2150 0.5265 525.870 0.0353 1.1238 0.0075

ipintrq_qarray_lock D 66924 2058 0 2.9834 2.0523 2049.817 0.0933 2.9916 0.0053

23


Case study – shm lock (continue)

Report from curt.outSystem Calls Summary

--------------------

Count Total Time % sys Avg Time Min Time Max Time Tot ETime Avg ETime Min ETime Max ETime SVC (Address)

(msec) time (msec) (msec) (msec) (msec) (msec) (msec) (msec)

======== =========== ====== ======== ======== ======== ======== ========= ========= ========= ================

14595 2619.7518 6.99% 0.1795 0.0524 1.1668 31358.9807 2.1486 0.0524 331.9492 shmdt(14a0070)

14590 1835.2794 4.90% 0.1258 0.0067 1.0218 38748.6944 2.6558 0.0067 332.0727 shmat(14a0088)

28595 368.4303 0.98% 0.0129 0.0029 1.0036 916.5396 0.0321 0.0029 56.1218 kwrite(149e138)

42642 332.9538 0.89% 0.0078 0.0019 0.0327 74828.0660 1.7548 0.0019 879.2453 kread(149e180)

24


Case study – application mutex lock

Millennium threads contention summary in POWER5+ 40way

There is one point of contention limiting the scalability of the application. The many threads of the sec_server, tdb_server, srv_drver, scp_server, and cpm_srvcachema processes are calling an application routine called IPC_ReadMessage which then calls _IPC_ReadMessage which then calls IPC_GetMessagewhich calls a routine in /cerner/w_standard/rev008_wh/aixrs6000/libcmb.a called cmb_hiber(). cmb_hiber calls pthread_cond_wait to wait on a condition ...

The application contention on the pthread condition in turn also causes kernel contention on the event list (resulting in long times for the thread_unlocksystem call).

What would be useful to know is if all these various processes need to wait on the same condition variable (in IPC_GetMessage).

25


TPROF example

Configuration information

System: AIX 5.3 Node: rvn1 Machine: 00CC12CE4C00

Tprof command was:

tprof -E -skeulz -x sleep 30

Trace command was:

/usr/bin/trace -ad -M -L 1719908352 -T 500000 -j 000,00A,001,002,003,38F,005,006,134,139,5A2,5A5,465,2FF, -o -

Total Samples = 84268

Traced Time = 30.01s (out of a total execution time of 30.01s)

Performance Monitor based reports:

Processor name: POWER6

Sampling interval: 10ms

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Process FREQ Total Kernel User Shared Other

======= ==== ===== ====== ==== ====== =====

/cerner/…/srv_drvr 976 28105 3892 0 24210 3

/cerner/…/cpm_srvscript 212 17642 3039 43 14549 11

wait 32 14991 14991 0 0 0

oraclehrvst1 320 14751 3080 11442 210 19

/usr/bin/amqzlaa0 1504 3053 1694 65 1281 13

26


TPROF example (Cont.)Total Ticks For All Processes (KERNEL) = 15310


========== ===== ====== ====== ======= =====

.dispatch 558 0.66 ../../../../../src/bos/kernel/proc/dispatch.c45694 195c

.disable_lock 464 0.55 64/low.s 9004 2fc

._check_lock 452 0.54 64/low.s 3420 40

.unlock_enable_mem 424 0.50 64/low.s 930c 1f4

.fetch_and_add 391 0.46 64/low.s 9b00 80

.simple_unlock 385 0.46 64/low.s 9900 18

Total Ticks For All Processes (SH-LIBs) = 43188

Shared Object Ticks % Address Bytes

============= ===== ====== ======= =====

/usr/lib/libc.a[shr.o] 10558 12.53 d0331800 213698

/cerner/…/libcclora.a[shobjcclora.o] 9312 11.05 d2549260 14a66f

/usr/lib/libpthreads.a[shr.o] 7791 9.25 d0656180 254b4

/cerner/…/libsrvdata.a[libsrvdata.o] 4882 5.79 d1b8e240 95150

/oracle/product/9.2.0.5/../libclntsh.a[shr.o] 2242 2.66 d273c2a0 883d72


# iostat -a –DSystem configuration: lcpu=2 drives=3 paths=1 vdisks=1Adapter:scsi0 xfer: bps tps bread bwrtn

0.0 0.0 0.0 0.0Paths/Disk:hdisk0_path0 xfer: %tm_act bps tps bread bwrtn

0.0 0.0 0.0 0.0 0.0 read: rps avgserv minserv maxserv timeouts fails

0.0 0.0 0.0 0.0 0 0write: wps avgserv minserv maxserv timeouts fails

0.0 0.0 0.0 0.0 0 0queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull

0.0 0.0 0.0 0.0 0.0 0Vadaptervsci0 xfer: tps bread bwrtn partition-id

0.0 0.0 0.0 #### read: avgserv minserv maxserv

0.0 0.0 0.0write: avgserv minserv maxserv

0.0 0.0 0.0queue: avgtime mintime maxtime avgsqsz qfull

0.0 0.0 0.0 0.0 0Disk:hdisk10 xfer: %tm_act bps tps bread bwrtn

0.0 0.0 0.0 0.0 0.0 read: rps avgserv minserv maxserv timeouts fails

0.0 0.0 0.0 0.0 0 0write: wps avgserv minserv maxserv timeouts fails

0.0 0.0 0.0 0.0 0 0queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull

0.0 0.0 0.0 0.0 0.0 0

I/O Tuning – iostat -aD Read/write IOPS

rps/wps

PV

Virtual adapter

Paths

Use –l option for wide column, one device per line format

28


hdisk1 xfer: %tm_act bps tps bread bwrtn

87.7 62.5M 272.3 62.5M 823.7

read: rps avgserv minserv maxserv timeouts fails

271.8 9.0 0.2 168.6 0 0

write: wps avgserv minserv maxserv timeouts fails

0.5 4.0 1.9 10.4 0 0

queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull

1.1 0.0 14.1 0.2 1.2 2374

Virtual adapter’s extended throughput report (-D)

Metrics related to transfers (xfer:)tps Indicates the number of transfers per second issued to the adapter.recv The total number of responses received from the hosting server to this adapter.sent The total number of requests sent from this adapter to the hosting server.partition id The partition ID of the hosting server, which serves the requests sent by this adapter.

Adapter Read/Write Service Metrics (read:)avgserv Indicates the average time. Default is in milliseconds.minserv Indicates the minimum time. Default is in milliseconds.maxserv Indicates the maximum time. Default is in milliseconds.

Adapter Wait Queue Metrics (wait:)avgtime Indicates the average time spent in wait queue. Default is in milliseconds.mintime Indicates the minimum time spent in wait queue. Default is in milliseconds.maxtime Indicates the maximum time spent in wait queue. Default is in milliseconds.avgwqsz Indicates the average wait queue size.avgsqsz Indicates the average service queue size – Waiting to be sent to the disk.sqfull Indicates the number of times the service queue becomes full.

Can’t exceed queue_depth for the disk

If this is often > 0, then increase queue_depth

I/O Tuning – iostat -D

All –D outputs are rates, except sqfull, which is an interval delta

Service times you could only get from filemon before

29


Application IO characteristicsRandom IO

Typically small (4-32 KB)Measure and size with IOPSUsually disk actuator limited

Sequential IOTypically large (32KB and up)Measure and size with MB/sUsually limited on the interconnect to the disk actuators

To determine application IO characteristicsUse filemon

# filemon –o /tmp/filemon.out –O lv,pv –T 500000; sleep 90; trcstop

Check for trace buffer wraparounds which may invalidate the data, run filemon with a larger –T value or shorter sleep

Use lvmstat to get LV IO statisticsUse iostat to get PV IO statistics

30


FilemonTo find out what files, logical volumes, and disks are most active, run the following command as root: # filemon -u -O all -o /tmp/fmon.out; sleep 30;trcstop In 30 seconds, a report is created in

/tmp/fmon.out. Check for most active segments, logical volumes, and physical volumes in this report. Check for reads and writes to paging space to determine if the disk activity is true application I/O or is due to paging activity. Check for files and logical volumes that are particularly active. If these are on a busy physical volume, moving some data to a less busy disk may improve performance. The Most Active Segments report lists the most active files by file system and inode. The mount point of the file system and inode of the file can be used with the ncheck command to identify unknown files: # ncheck -i <inode> <mount point> This report is useful in determining if the activity is to a file system (segtype = persistent), the JFS log (segtype = log), or to paging space (segtype= working). #find /var/mqm –inum 30762/var/mqm/qmgrs/CERN!RVN1!HRVSTA/queues/CERN!SSREQ!PM!REG/q

By examining the reads and read sequences counts, you can determine if the access is sequential or random. As the read sequences count approaches the reads count, file access is more random. The same applies to the writes and write sequences.

31


Using filem on To Determ ine Bottleneck #filemon -u -O lf,pv -o fmon.out# dd if=/test/100m bs=1024k of=/dev/null# trcstop# more fmon.out

Thu Apr 17 09:11:53 2003System: AIX nkeung Node: 5 Machine: 000D80CD4C00

Cpu utilization: 6.9%

Most Active Files------------------------------------------------------------------------ #MBs #opns #rds #wrs file volume:inode------------------------------------------------------------------------ 101.0 1 101 0 100m /dev/jfslv:23 101.0 1 0 100 null 3.0 0 385 0 pid=270570_fd=20960 0.2 1 62 0 unix /dev/hd3:10 0.0 1 2 0 ksh.cat /dev/hd2:41634 0.0 1 2 0 cmdtrace.cat /dev/hd2:41498

32


Summary reports at PV and LV layersMost Active Logical Volumes------------------------------------------------------------------------util #rblk #wblk KB/s volume description

------------------------------------------------------------------------1.00 10551264 5600 17600.8 /dev/rms09_lv /RMS/bormspr0/oradata071.00 6226928 7584 10394.4 /dev/rms06_lv /RMS/bormspr0/oradata041.00 128544 3315168 5741.5 /dev/rms04_lv /RMS/bormspr0/oracletemp1.00 13684704 38208 22879.4 /dev/rms02_lv /RMS/bormspr0/oradata010.99 11798800 16480 19698.9 /dev/rms03_lv /RMS/bormspr0/oradata020.99 600736 7760 1014.5 /dev/rms13_lv /RMS/bormspr0/oradata110.98 6237648 128 10399.9 /dev/oraloblv01 /RMS/bormspr0/oralob010.96 0 3120 5.2 /dev/hd8 jfslog0.55 38056 104448 237.6 /dev/rms041_lv /RMS/bormspr0/oraredo0.48 2344656 3328 3914.6 /dev/rms11_lv /RMS/bormspr0/oradata09

Most Active Physical Volumes------------------------------------------------------------------------util #rblk #wblk KB/s volume description

------------------------------------------------------------------------1.00 3313059 4520 5531.2 /dev/hdisk66 SAN Volume Controller Device1.00 7563668 22312 12647.6 /dev/hdisk59 SAN Volume Controller Device1.00 53691 1868096 3204.1 /dev/hdisk61 SAN Volume Controller Device1.00 11669 6478 30.3 /dev/hdisk0 N/A1.00 6247484 4256 10423.1 /dev/hdisk77 SAN Volume Controller Device1.00 6401393 10016 10689.3 /dev/hdisk60 SAN Volume Controller Device1.00 5438693 3128 9072.8 /dev/hdisk69 SAN Volume Controller Device

filemon summary reports

33


Detailed reports at PV and LV layers (only for one LV shown)

Similar reports for each PVVOLUME: /dev/rms09_lv description: /RMS/bormspr0/oradata07reads: 23999 (0 errs)read sizes (blks): avg 439.7 min 16 max 2048 sdev 814.8read times (msec): avg 85.609 min 0.139 max 1113.574 sdev 140.417read sequences: 19478read seq. lengths: avg 541.7 min 16 max 12288 sdev 1111.6writes: 350 (0 errs)write sizes (blks): avg 16.0 min 16 max 16 sdev 0.0write times (msec): avg 42.959 min 0.340 max 289.907 sdev 60.348write sequences: 348write seq. lengths: avg 16.1 min 16 max 32 sdev 1.2seeks: 19826 (81.4%)seek dist (blks): init 18262432, avg 24974715.3 min 16

max 157270944 sdev 44289553.4time to next req(msec): avg 12.316 min 0.000 max 537.792 sdev 31.794throughput: 17600.8 KB/secutilization: 1.00

filemon detailed reports

Average IO sizes

Blks are 512 bytes in AIX

439 x 512 = ~219KB average size

34


Look at PV summary reportLook for balanced IO across the disksLack of balance may be a data layout problem

Depends upon PV to physical disk mapping LVM mirroring scheduling policy also affects balance for readsIO service times in the detailed report is more definitive on data layout issues

Dissimilar IO service times across PVs indicates IOs are not balanced across physical disks

Look at most active LVs reportLook for busy file system logs

Look for file system logs serving more than one file system

Using filemon

35


topas - new LPAR screenSplit screen accessible from -L or the "L" command

upper section shows a subset of lparstat metricslower section shows sorted list of logical processor with mpstat columns

Interval: 2 Logical Partition: aix Sat Mar 13 09:44:48 2004Poolsize: 3.0 Shared SMT ON Online Memory: 8192.0 Entitlement: 2.5 Mode: Capped Online Logical CPUs: 4

Online Virtual CPUs: 2 %user %sys %wait %idle physc %entc %lbusy app vcsw phint %hypv hcalls 47.5 32.5 7.0 13.0 2.0 80.0 100.0 1.0 240 150 5.0 1500==============================================================================logcpu minpf majpf intr csw icsw runq lpa scalls usr sys wt idl pc lcswcpu0 1135 145 134 78 60 2 95 12345 10 65 15 10 0.6 120cpu1 998 120 104 92 45 1 89 4561 8 67 25 0 0.4 120cpu2 2246 219 167 128 72 3 92 76300 20 50 20 10 0.5 120cpu3 2167 198 127 62 43 2 94 1238 18 45 15 22 0.5 120

Notable metrics%hypv and hcalls: percentage of time in hypervisor and number of calls madepc: fraction of physical processor consumed by a logical processor

36


Shows partition levelThree modes

information (-i)ƒshows static configuration information

detailed hypervisor (-H)ƒshows breakdown of hypervisor time by hcall type

monitoring mode (default)Monitoring mode metrics

CPU utilization (%user, %sys, %idle, %wait)percentage spent in hypervisor (%hypv) and number of hcalls (hcalls) [both optional]additional shared mode only metrics

ƒPhysical Processor Consumed (physc)ƒPercentage of Entitlement Consumed (%entc)ƒLogical CPU Utilization (%lbusy)ƒAvailable Pool Processors (app)ƒnumber of virtual context switches (vcsw)

virtual processor hardware preemptionsƒnumber of phantom interrupts (phint)

interrupts received for other partitions

New lparstat command

37


New lparstat info example# lparstat -iNode Name : sq07Partition Name : sq07_aix53lparPartition Number : 2Type : Shared-SMTMode : UncappedEntitled Capacity : 100Partition Group-ID : 32770Shared Pool ID : 0Online Virtual CPUs : 1Maximum Virtual CPUs : 40Minimum Virtual CPUs : 1Online Memory : 1536 MBMaximum Memory : 2048 MBMinimum Memory : 1024 MBVariable Capacity Weight : 128Minimum Capacity : 10Maximum Capacity : 400Capacity Increment : 1Maximum Dispatch Latency : 0Maximum Physical CPUs in system : 4Active Physical CPUs in system : 4Active CPUs in Pool : -Unallocated Capacity : 0Physical CPU Percentage : 100.00%Unallocated Weight : 127Minimum Virtual Processor Required Capacity: 10

38


New mpstat commandShows detailed logical processor information

default mode showsƒutilization metrics (%user, %sys, %idle, %wait)ƒmajor and minor page faults (with and without disk I/O)ƒnumber of syscalls and interruptsƒdispatcher metrics

number of migrationsvoluntary and involuntary context switcheslogical processor affinity (percentage of redispatches inside MCM)run queue size

ƒfraction of processor consumed [SMT or shared mode only]ƒpercentage of entitlement consumed [shared mode only]ƒnumber of logical context switches [shared mode only]

hardware preemptions -d shows detailed software and hardware dispatchers metrics-i shows detailed interrupt metrics

-s option shows SMT utilization

39


New mpstat command example SMT mode

mpstat -s (example: shows SMT utilization )mpstat -s

Proc0 Proc2 Proc4 Proc680% 78% 75% 82% [shared mode only]

cpu0 cpu1 cpu2 cpu3 cpu4 cpu5 cpu6 cpu740% 40% 68% 10% 35% 40% 41% 41%

smtctlThis system is SSMT capableSMT is currently enabledSMT boot mod is not setProcessor 0 has 2 SMT threadsSMT thread 0 is bound with processor 0SMT thread 1 is bound with processor 1…lsattr –El proc0fequency 1904000000 Processor Speed Falsesmt_enabled true Processor SMT enabled Falsesmt_threads 2 Processor SMT threads Falsestate enable Processor state Falsetype PowerPC_POWER5 Processor type False

40


topas - main screen updateNew cpu section metrics on physical processing resources consumed

Physc: amount consumed in fractional number of processors%Entc: amount consumed in percentage of entitlement

Topas Monitor for host: specweb8 EVENTS/QUEUES FILE/TTYSat Mar 13 09:47:18 2004 Interval: 2 Cswitch 50 Readch 0

Syscall 47 Writech 34Kernel 0.0 | | Reads 0 Rawin 0User 0.0 | | Writes 0 Ttyout 34Wait 0.0 | | Forks 0 Igets 0Idle 100.0 |############################| Execs 0 Namei 1Physc = 0.01 %Entc= 1.2 Runqueue 0.0 Dirblk 0

Waitqueue 0.0 Network KBPS I-Pack O-Pack KB-In KB-Outen0 0.1 1.0 1.0 0.0 0.1 PAGING MEMORYlo0 0.0 0.0 0.0 0.0 0.0 Faults 0 Real,MB 8191

Steals 0 % Comp 5.4Disk Busy% KBPS TPS KB-Read KB-Writ PgspIn 0 % Noncomp 1.6hdisk0 0.0 0.0 0.0 0.0 0.0 PgspOut 0 % Client 1.6hdisk2 0.0 0.0 0.0 0.0 0.0 PageIn 0hdisk3 0.0 0.0 0.0 0.0 0.0 PageOut 0 PAGING SPACEhdisk1 0.0 0.0 0.0 0.0 0.0 Sios 0 Size,MB 512

% Used 0.6Name PID CPU% PgSp Owner NFS (calls/sec) % Free 99.3IBM.CSMAg 13180 0.0 1.6 root ServerV2 0syncd 9366 0.0 0.5 root ClientV2 0 Press:prngd 22452 0.0 0.3 root ServerV3 0 "h" for helppsgc 2322 0.0 0.0 root ClientV3 0 "q" to quitpilegc 2580 0.0 0.0 root

New metrics are added automatically when running in shared modeCPU utilization metrics are calculated using new purr-based data and formula when running in SMT or shared mode

41


topas - CEC monitoring screenSplit screen accessible from -C or the "C" command

► upper section shows CEC-level metrics► lower sections shows sorted lists of shared and dedicated partitions

CEC configuration info retrieved from HMC or specified on command line

42


Automatic Performance Metric recordingLocal metrics recordings

►Uses xmwlm daemon●automatically started from inittab●15 sec sampling frequency●5 minutes recording frequency

►Kept 7 days worth of data►Recordings include most of topas local data

●except process and WLM data►Disk space required

●system with a low number of disks: 2 MB/day●10 MB/day for every 100 disks●Stored in /etc/perf/daily

43


Automatic Performance Metric recordingCEC-wide metrics recording

► Implemented as topas -R option●records all topas -C metrics●works independently and in parallel from topas

► 10 sec sampling frequency, 60 seconds recording frequency

► Installed by default in AIX 5.3 TL6, before then●Must be “installed” in inittab manually in one of the partitions in CEC# /usr/lpp/perfagent/config_topas.sh add

44


Topas recordings – topasoutExports data and generates reports

Input source for WLE (Workload Estimator)► free alternative to System p PM services (aka PTX provider)

►Provides peak weekly utilization information in XML file●peak 2 hours of week (cpu, memory, filesystem, disk I/O totals)

Formats►Spreadsheet and csv formats

► nmon_analyser format ● http://www-941.ibm.com/collaboration/wiki/display/WikiPtype/nmonanalyser

45


Example: topasout - detailed local report in text

46


New Configure Topas Option setup in SMITsmitty Performance & Resource Scheduling Move cursor to desired item and press Enter.Resource Status & Monitors

Analysis Tools

Resource Controls

Schedule Jobs

Workload Manager

Enterprise Workload Management

Resource Set Management

Tuning Kernel & Network Parameters

Simultaneous Multi-Threading Processor Mode

Configure Topas OptionsFastpath: smitty topas

47


topas recordings - 5.3 TL6 updateSMIT panels

► setup access to partitions

not on local subnet► turn on/off CEC and local recordings

► display recording status

► generate reports● to file● to printer● to stdout

► eliminates●need to know file location and names● topasout syntax

48


Using "topas -R" to Record Cross Partition Performance

Must be running AIX5.3 TL5 Service pack 4 with APAR IY87993 or newer)This option records the performance of all partitions on a physical serverThe command is run on just one serverStart Recording►# /usr/lpp/perfagent/config_topas.sh add►The performance data is logged to: /etc/perf/topas_cec.YYMMDD

Stop Recording►# /usr/lpp/perfagent/config_topas.sh delete►Rename the /etc/perf/topas_cec.YYMMDD logfile so that a restart does not corrupt it

Summary Report►# topasout -R summary /etc/perf/topas_cec.070805

#Report: CEC Summary --- hostname: localhost version:1.1 Start:08/05/07 17:24:43 Stop:08/05/07 17:37:43 Int: 5 Min Range: 13 Min Partition Mon: 2 UnM: 0 Shr: 2 Ded: 0 Cap: 0 UnC: 2 -CEC------ -Processors------------------------- -Memory (GB)------------Time ShrB DedB Mon UnM Avl UnA Shr Ded PSz APP Mon UnM Avl UnA InU17:29 0.01 0.00 0.3 0.0 0.0 0 0.3 0 2.0 2.0 3.0 0.0 0.0 0.0 1.4 17:34 0.01 0.00 0.3 0.0 0.0 0 0.3 0 2.0 2.0 3.0 0.0 0.0 0.0 0.9 17:37 0.01 0.00 0.3 0.0 0.0 0 0.3 0 2.0 2.0 3.0 0.0 0.0 0.0 0.9

49


Using "topas -R" to Record Cross Partition Performance(Cont)

Detailed Report►# topasout -R detailed /etc/perf/topas_cec.070805

Time: 17:29:42 ------------------------------------------------------------

Partition Info Memory (GB) Processors

Monitored : 2 Monitored : 3.0 Monitored : 0.3 Shr Physcl Busy: 0.01

UnMonitored: 0 UnMonitored: 0.0 UnMonitored: 0.0 Ded Physcl Busy: 0.00

Shared : 2 Available : 0.0 Available : 0.0

Dedicated : 0 UnAllocated: 0.0 Unallocated: 0.0 Hypervisor

Capped : 0 Consumed : 1.4 Shared : 0.3 Virt Cntxt Swtch: 808

UnCapped : 2 Dedicated : 0.0 Phantom Intrpt : 0

Pool Size : 2.0

Avail Pool : 2.0

Host OS M Mem InU Lp Us Sy Wa Id PhysB Ent %EntC Vcsw PhI

-------------------------------------shared-------------------------------------

gilmore A53 U 2.0 0.9 4 1 1 0 97 0.01 0.2 4.46 482 0

localhost A53 U 1.0 0.4 2 0 2 0 97 0.00 0.1 4.48 325 0

------------------------------------dedicated----------------------------------

For more information, see the AIX /usr/lpp/perfagent/README.perfagent.tools

50


SVMON commandA n an a ly sis to o l fo r v irtu a l m em o ry

P u rp o se :

C ap tu res a sn ap sh o t o f th e cu rren t s ta te o f m em o ry .

T h e d isp lay o f in fo rm atio n can b e an a lyzed u sin g fo u r d iffe ren t rep o rts:g lo b a l [ -G ]p ro cess [-P ]segm en t [-S ]d e ta iled segm en t [-D ]

E xam p le # svm o n -G

s ize in u se free p in v irtu a l

m em o ry 3 2 63 6 9 2 8 1 2 95 8 5 7 4 1 9 67 8 3 5 4 1 45 3 0 1 1 5 8 4 2 28 2

p g sp ace 2 0 9 71 5 2 2 2 26 7

w o rk p ers c ln t

p in 1 4 5 27 7 2 2 39 0

in u se 5 8 42 2 82 8 3 45 6 4 6 2 8 17 2 8

P ageS ize P o o lS ize in u se p g sp p in v ir tu a l

s 4 K B - 1 2 6 7 9 10 2 22 2 6 7 1 2 5 3 81 1 5 5 6 2 81 0

m 6 4 K B - 1 7 4 6 7 0 1 2 4 5 0 1 74 6 7

51


Correllating vmstat and svmon output

#vmstat 1 4configuration: System lcpu=32 mem=127488MB

kthr memory page faults cpu

----- ----------- ------------------------ ------------ -----------

r b avm fre re pi po fr sr cy in sy cs us sy id wa

0 0 5842256 19678557 0 0 0 0 0 0 6 1223 1251 0 0 99 0

0 0 5842255 19678558 0 0 0 0 0 0 5 1026 1200 0 0 99 0

0 0 5842253 19678560 0 0 0 0 0 0 4 853 1130 0 0 99 0

2 0 5842253 19678560 0 0 0 0 0 0 5 1046 1218 0 0 99 0

# svmon -Gsize inuse free pin virtual

memory 32636928 12958401 19678527 1453001 5842265

pg space 2097152 22267

work pers clnt

pin 1452762 239 0

in use 5842265 834408 6281728

PageSize PoolSize inuse pgsp pin virtual

s 4 KB - 12678929 22267 1253801 5562793

m 64 KB - 17467 0 12450 17467

52


Correlating ps and svmon output

using perfpmr, aix vmm page replacement and system p...

Documents