an512stud

V5.4

cover

Front cover

Power Systems for AIX IV: Performance Management (Course code AN51)

Student NotebookERC 2.0

Student Notebook

Trademarks

IBM® and the IBM logo are registered trademarks of International Business Machines Corporation.

The following are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide:

Windows is a trademark of Microsoft Corporation in the United States, other countries, or both.

Linux is a trademark of Linus Torvalds in the United States, other countries, or both.

Other company, product, or service names may be trademarks or service marks of others.

DB2® HACMP™ System i™System p™ System x™ System z™

June 2010 edition

The information contained in this document has not been submitted to any formal IBM test and is distributed on an “as is” basis without any warranty either express or implied. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customer’s ability to evaluate and integrate them into the customer’s operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will result elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk.

© Copyright International Business Machines Corporation 2010.This document may not be reproduced in whole or in part without the prior written permission of IBM.Note to U.S. Government Users — Documentation related to restricted rights — Use, duplication or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp.

Student NotebookV5.4.0.1

TOC
Contents
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

Course description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

Agenda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

Unit 1. Performance analysis and tuning overview . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2What exactly is performance? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3What is a performance problem? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-6What are benchmarks? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-7Components of system performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-10Factors that influence performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-12Performance metrics and baseline measurement . . . . . . . . . . . . . . . . . . . . . . . . . 1-13Trade-offs and performance approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-16Performance analysis flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-18Impact of virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-20The performance management team . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-22Performance analysis tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-24Performance tuning tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-26AIX tuning commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-28Types of tunables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-29Tunable parameter categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-30Tunables command options and files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-31Tuning commands -L option . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-34Stanza file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-36File control commands for tunables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-38Checkpoint (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-42Checkpoint (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-43Exercise 1: Work with tunables files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-44Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-45

Unit 2. Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2Performance problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3Collecting performance data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5Installing PerfPMR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6Capturing data with PerfPMR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8PerfPMR report types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12Generic report contents (1 of 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-14Generic report contents (2 of 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-20Generic report contents (3 of 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-21Formatting PerfPMR raw traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-22When to run PerfPMR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-25

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

©Copyright IBM Corp. 2010 Contents iii

Student Notebook

The topas command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-26The nmon and nmon_analyser tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-29The AIX nmon command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-31Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-32Exercise 2: Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-33Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-34

Unit 3. Monitoring, analyzing, and tuning CPU usage. . . . . . . . . . . . . . . . . . . . . . . . .3-1Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-3CPU monitoring strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-4Processes and threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-6The life of a process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-8Run queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-10Process and thread priorities (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-13Process and thread priorities (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-15nice/renice examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-17Viewing process and threat priorities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-20Boosting an important process with nice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-22Usage penalty and decay rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-23Priorities: What to do? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-26AIX workload partitions (WPAR): Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-27System WPAR and application WPAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-30Target shares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-32Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-34WPAR resource management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-36wlmstat command syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-38Context switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-41User mode versus system mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-43Timing commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-45Monitoring CPU usage with vmstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-47sar command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-49Locating dominant processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-51tprof output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-53What is simultaneous multi-threading? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-54SMT scheduling and CPU utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-58System wide CPU reports (old and new) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-60Viewing CPU statistics with SMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-61POWER7 CPU statistics with SMT4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-62Processor virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-63Performance management with virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-68CPU statistics in an SPLPAR (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-71CPU statistics in an SPLPAR (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-73Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-74Exercise 3: Monitoring, analyzing, and tuning CPU usage . . . . . . . . . . . . . . . . . . .3-75Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-76

Unit 4. Virtual memory performance monitoring and tuning . . . . . . . . . . . . . . . . . . .4-1Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-3


iv AIX Performance Management ©Copyright IBM Corp. 2010


TOC
Memory hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4Virtual and real memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8Major VMM functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10VMM terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-12Free list and page replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14When to steal pages based on free pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-16Free list statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-17Displaying memory usage (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-19Displaying memory usage (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22What type of pages are stolen? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-23Values for page types and classifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-26What types of pages are in real memory? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-28Is memory over committed? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-30Memory leaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-32Detecting a memory leak with vmstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-34Detecting a memory leak with ps gv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-35Active memory sharing: Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-36Active memory sharing: Loaning and stealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-37Displaying memory usage with AMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-39Active Memory Expansion (AME) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-41AME statistics (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-43AME statistics (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-45Active Memory Expansion tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-48Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-50Managing memory demand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-52Checkpoint (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-55Checkpoint (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-56Exercise 4: Virtual memory analysis and tuning . . . . . . . . . . . . . . . . . . . . . . . . . . 4-57Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-58
Unit 5. Physical and logical volume performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3I/O stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5Individual disks versus disk arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7Disk groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8LVM attributes that affect performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10LVM mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14LVM mirroring scheduling policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-18Displaying LV fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-20Using iostat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-22What is iowait? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-26LVM pbufs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-28Viewing and changing LVM pbufs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-29I/O request disk queuing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-32Using iostat -D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-34sar -d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-36Using filemon (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-38


©Copyright IBM Corp. 2010 Contents v

Student Notebook

Using filemon (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-41Managing uneven disk workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-43Adapter and multipath statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-47Monitoring adapter I/O throughout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-49Monitoring multiple paths (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-51Monitoring multiple paths (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-53Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-54Exercise 5: I/O Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-55Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-56

Unit 6. File system performance monitoring and tuning . . . . . . . . . . . . . . . . . . . . . .6-1Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-2File system I/O layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-3File system performance factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-4How to measure file system performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-7How to measure read throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-9How to measure write throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-10Using iostat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-12Using filemon (1 of 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-14Using filemon (2 of 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-16Using filemon (3 of 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-17Fragmentation and performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-19Determine fragmentation using fileplace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-21Reorganizing the file system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-23Using defragfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-25JFS and JFS2 logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-27Creating additional JFS and JFS2 logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-29Sequential read-ahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-32Tuning file syncs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-36Sequential write-behind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-38Random write-behind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-40JFS2 random write-behind example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-42File system buffers and VMM I/O queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-44Tuning file system buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-46VMM file I/O pacing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-48The pro and con of VMM file caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-52JFS and JFS2 release-behind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-54Normal I/O versus direct I/O (DIO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-57Using direct I/O (DIO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-58Checkpoint (1 of 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-60Checkpoint (2 of 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-61Checkpoint (3 of 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-62Exercise 6: File system performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-63Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-64

Unit 7. Network performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-1Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-2What affects network performance? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-3


vi AIX Performance Management ©Copyright IBM Corp. 2010


TOC
Document your environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5Measuring network performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-7Network services processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10Network memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-13Memory statistics with netstat -m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-16Socket flow control (TCP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-18TCP acknowledgement and retransmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-20TCP flow control and probes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-22netstat -p tcp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-24TCP socket buffer tuning (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-28TCP socket buffer tuning (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-30Interface specific network options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-34Nagle’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-36UDP buffer overflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-39netstat -p upd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-41Fragmentation and segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-43Intermediate network MTU restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-44TCP maximum segment size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-45Fragmentation and IP input queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-48netstat -p ip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-51Interface and hardware flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-55Transmit queue overflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-57Adapter configuration conflicts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-61Receive pool buffer errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-65Network traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-67Network trace examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-69Checkpoint (1 of 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-70Checkpoint (2 of 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-71Checkpoint (3 of 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-72Exercise 7: Network performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-73Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-74
Unit 8. NFS performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2NFS tuning concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3NFS versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-6Transport layers used by NFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-9NFS request path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-10NFS performance related daemons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-12nfsstat -s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-15NFS statistics using netpmon -O nfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17Server tuning with nfso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18nfsstat -c. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22nfsstat -m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-25Client commit-behind tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-26Client attribute cache tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-28NFS I/O pacing, release-behind, and DIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-30Checkpoint (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-32


©Copyright IBM Corp. 2010 Contents vii

Student Notebook

Checkpoint (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-33Exercise 8: NFS performance tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-34Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-35

Unit 9. Performance management methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-1Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-2Factors that can affect performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-3Determine type of problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-5Trade-offs and performance approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-7Performance analysis flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-9CPU performance flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-10Memory performance flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-12Disk/File system performance flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-15Network performance flowchart (1 of 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-17Network performance flowchart (2 of 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-19Network performance flowchart (3 of 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-21NFS performance flowchart: Servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-23NFS performance flowchart: Clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-25Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-27Exercise 9: Summary exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-28Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-29

Appendix A. Checkpoint solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1


viii AIX Performance Management ©Copyright IBM Corp. 2010

Student NotebookV5.4

TMK
Trademarks
The reader should recognize that the following terms, which appear in the content of this training document, are official trademarks of IBM or other companies:

IBM® is a registered trademark of International Business Machines Corporation.

The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both:

Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.

Linux® is a registered trademark of Linus Torvalds in the United States, other countries, or both.

Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both.

UNIX® is a registered trademark of The Open Group in the United States and other countries.

Other product and service names might be trademarks of IBM or other companies.

Active Memory™ AIX® AIX 5L™DB2® eServer™ Enterprise Storage Server®GPFS™ HACMP™ Iterations®Lotus Notes® Lotus® Micro-Partitioning™Notes® POWER® POWER5™POWER6® POWER7™ POWER Hypervisor™PowerVM™ Redbooks® System Storage™Tivoli® WebSphere® 1-2-3®400®


© Copyright IBM Corp. 2010 Trademarks ix

Student Notebook


x AIX Performance Management © Copyright IBM Corp. 2010


pref
Course description
Power Systems for AIX IV: Performance Management

Duration: 5 days

Purpose

Develop the skills to measure, analyze, and tune common performance issues on IBM POWER systems running AIX6.

Learn about performance management concepts and techniques and how to use of basic AIX tools to monitor, analyze, and tune an AIX6 system. The course covers how virtualization technologies such as the PowerVM environment and workload partitions affect AIX performance management. Monitoring and analyzing tools discussed in this course include vmstat, iostat, sar, tprof, svmon, filemon, netstat, lvmstat, and topas. Tuning tools include schedo, vmo, ioo, no, and nfso.

The course also covers how to use Performance Problem Reporting (PerfPMR) to capture a variety of performance data for later analysis.

Each lecture is reinforced with extensive hands-on lab exercises which provide practical experience.

Audience

• AIX technical support personnel

• Performance benchmarking personnel

• AIX system administrators

Prerequisites

Students attending this course are expected to have basic AIX system administration skills. These skills can be obtained by attending the following courses:

- AU14/Q1314 AIX 5L System Administration I: Implementation

or

- AN12 Power Systems for AIX II: Implementation and Administration


© Copyright IBM Corp. 2010 Course description xi

Student Notebook

It is very helpful to have a strong background in TCP/IP networking to support the network performance portion of the course. These skills can be built or reinforced by attending:

- AU07/Q1307 AIX 5L Configuring TCP/IP

or

- AN21 TCP/IP for AIX Administrators

It is also very helpful to have a strong background in PowerVM (particularly micro partitioning and the role of the virtual I/O server). These skills can be built or reinforced by attending:

- AU73 System p LPAR and Virtualization I: Planning and Configuration

or

- AN30 Power Virtualization I: Implementing Dual VIOS & IVE

Objectives

On completion of this course, students should be able to:

- Define performance terminology

- Describe the methodology for tuning a system

- Identify the set of basic AIX tools to monitor, analyze, and tune a system

- Use AIX tools to determine common bottlenecks in the Central Processing Unit (CPU), Virtual Memory Manager (VMM), Logical Volume Manager (LVM), internal disk Input/Output (I/O), and network subsystems

- Use AIX tools to demonstrate techniques to tune the subsystems


xii AIX Performance Management © Copyright IBM Corp. 2010


pref
Agenda
Day 1

Unit 1 - Performance analysis and tuning overview Exercise 1Unit 2 - Data collectionExercise 2Unit 3 - Monitoring, analyzing, and tuning CPU usageExercise 3 parts 1 and 2

Day 2

Exercise 3 parts 3, 4 and 5Unit 4 - Virtual memory performance monitoring and tuningExercise 4Student’s choice optional exercise from Ex 3 or Ex 4

Day 3

Unit 5 - Physical and logical volume performance Exercise 5Unit 6 File system performance, topic 1Exercise 6, parts 1, 2, and 3

Day 4

Unit 6 File system performance, topic 2Exercise 6, part 4Unit 7 - Network performanceExercise 7Student’s choice optional exercise from exercises 3, 4, 5, or 6

Day 5

Unit 8 - NFS performanceExercise 8Unit 9 - Performance management methodologyExercise 9Student’s choice optional exercises from exercises 3, 4, 5, 6, or 7


© Copyright IBM Corp. 2010 Agenda xiii

Student Notebook


xiv AIX Performance Management © Copyright IBM Corp. 2010


Uempty
Unit 1. Performance analysis and tuning overview
What this unit is about

This unit defines performance terminology and gives you a set of tools to analyze and tune a system. It also discusses the process for tuning a system.

What you should be able to do

After completing this unit, you should be able to:

• Describe the following performance terms:

- Throughput, response time, benchmark, metric, baseline, performance goal

• List performance components

• Describe the performance tuning process

• List tools available for analysis and tuning

How you will check your progress

Accountability:

• Checkpoint • Machine exercises

References

AIX Version 6.1 Performance Management

AIX Version 6.1 Performance Tools Guide and Reference

AIX Version 6.1 Commands Reference, Volumes 1-6

SG24-6478 AIX 5L Practical Performance Tools and Tuning Guide (Redbook)


© Copyright IBM Corp. 2010 Unit 1. Performance analysis and tuning overview 1-1

Student Notebook

Figure 1-1. Unit objectives AN512.0

Notes:

Introduction

The objectives in the visual above state what you should be able to do at the end of this unit.

© Copyright IBM Corporation 2010

Unit objectives

After completing this unit, you should be able to: • Describe the following performance terms:

– Throughput, response time, benchmark, metric, baseline, performance goal

• List performance components• Describe the performance tuning process• List tools available for analysis and tuning


1-2 AIX Performance Management © Copyright IBM Corp. 2010


Uempty

Figure 1-2. What exactly is performance? AN512.0

Notes:

Introduction

Performance of a computer system is different from performance of something else such as a car or an actor, and so forth. The performance of a computer system is related to how well the system responds to user requests or how much work the system can do in a certain amount of time. So we can say that performance is dependent on a combination of throughput and response time. The performance is also affected by outside factors such as the network, other machines, and even the environment.

The graphic in the visual above illustrates that the performance of the system will likely have a pattern and if you understand this pattern, it will make performance management easier. For example, the 4 O’Clock Panic, as shown in the visual above, is the busiest time in this system’s day. This is not a good time to schedule additional workload, but it is a great time to monitor the system for potential bottlenecks.


What exactly is performance?• Performance is the major factor on which the productivity of a

system depends• Performance is dependent on a combination of:

– Throughput– Response time

• Acceptable performance is based on expectations:– Expectations are the basis for quantitative performance goals

7am 8 9 10 11 12 1 2 3 4 5 6

MorningCrunch

LunchDip

4 O’clockPanic

5 O’clockCliff

7am 8 9 10 11 12 1 2 3 4 5 6

MorningCrunch

LunchDip

4 O’clockPanic

5 O’clockCliff



Student Notebook

Throughput

Throughput is a measure of the amount of work over a period of time. Examples include database transactions per minute, kilobytes of a file transferred per second, kilobytes of a file read or written per second, and Web server hits per minute.

Response time

Response time is the elapsed time between when a request is submitted to when the response from that request is returned. Examples include how long a database query takes, how long it takes to echo characters to the terminal, or how long it takes to access a Web page.

Throughput and response time are related. Sometimes you can have higher throughput at the cost of response time or better response time at the cost of throughput. So, acceptable performance is based on reasonable throughput combined with reasonable response time. Sometimes a decision has to be made as to which is more important: throughput or response time. Typically, user response time is more important since we humans are probably more impatient than a computer program.

Expectations

Acceptable performance is based on our expectations. Our expectations can be based on benchmarks (custom written benchmarks or industry standard benchmarks), computer systems modeling, prior experience with the same or similar systems, or maybe even wishful thinking. Acceptable response times are relative to the system, the application, and the expectations of the users. For example, 5 second response time to initiation of a transaction might seem slow in a fast-paced retail environment but quite normal in a small town bank.

Setting performance goals

Determining expectations is typically the starting point for setting performance goals. Performance goals are often stated in terms of specific throughput and response times. There may be different goals for different applications on the same system.

Example performance goals stated in terms of throughput or response times:

- The average database transaction response time should always be less than 3 seconds.

- The nightly backup job must finish by 6:00 a.m. - The nightly accounting batch job must finish all 48 runs by 6:00 a.m. each day.

Performance goals are being met; now what?

When performance goals are being met, system administrators must still monitor the systems to determine if there are any upward trends which show that at some point in




Uempty
the future the goals will not be met. Know your performance goals and your baseline (what the system is doing now), and then you can spot a trend and estimate when a problem will occur.


Student Notebook

Figure 1-3. What is a performance problem? AN512.0

Notes:

Overview

Support personnel need to determine when a reported problem is a functional problem or a performance problem. When a system is not producing the correct results or if a system or network is down, then this is a functional problem. An application or a system with a memory leak has a functional problem.

Sometimes functional problems lead to performance problems. In these cases, rather than tune the system, it is more important to determine the root cause of the problem and fix it.


What is a performance problem?• Functional problem:

– An application, hardware, or network is not behaving correctly

• Performance problem:– The functions of the application, hardware or network are

being achieved, but the speed of the functions are slow

• A functional problem can lead to a performance problem:Networks or name servers that are down (functional problem) can slow down communication (performance problem)

A memory leak (functional problem) can cause a paging problem (performance problem)




Uempty

Figure 1-4. What are benchmarks? AN512.0

Notes:

Introduction

Benchmarks are used to evaluate and compare the performance of computer systems. They are useful because they remove other variables which might make results unreliable.

Benchmark tests must be similar to a customer application to predict the performance of the application or to use a benchmark result as the base for sizing. For example, a JAVA SPECjbb benchmark result is specific to Java performance and does not give any information about the NFS server performance of a computer system.

Benchmarks are also used in software development to identify regression in performance after code changes or enhancements.


What are benchmarks?• Benchmarks are standardized, repeatable tests

– Unlike real production workloads which change constantly• Benchmarks use a representative set of programs and data• Benchmarks serve as a basis for:

– Evaluation– Comparison

• Benchmarks include:– Industry standard benchmarks– Customer benchmark



Student Notebook

Industry standard benchmarks

Industry standard benchmarks use a representative set of programs and data designed to evaluate and compare computer and software performance for a specific type of work load like CPU intensive applications or database work loads. Each industry standard benchmark has rules and requirements to which all platforms have to adhere.

The following table lists some of the industry standard benchmarks and the type of application each is used for:

Benchmark Application Type Notes

SPECint SPECfp

Single-user technical

Updated in certain years so you will see SPECint95, SPECfp95, and so forth. These are CPU intensive applications with a heavy emphasis on integer or floating point calculations.

TPC-COnline transaction processing

Simulates network environments with a large number of attached terminals running complex workloads. Typically used as a database benchmark.

TPC-D TPC-H

Decision support

Executes sets of queries against a standard database with large volumes of data and a high degree of complexity for answering critical business questions. TPC-D is obsolete as of 4/6/99.

SPECjbb JavaEvaluates the performance of server-side Java. It emulates a 3-tier system focusing on the middle tier.

PLBwire PLBsurf Xmark Viewperf X11perf

Graphics and CAD

Demonstrates relative performance across platforms/systems using real applications (2-D design, 3-D wireframe, 3-D solid modeling, 3-D animation and low-end simulations).

SPEC SFS NFSMeasures the throughput supported by an NFS server for a given response time.

SPECweb WebStone

Web ServerFocuses on server performance and measures the ability of the server to service HTTP requests.

NotesBench Lotus NotesMeasures the maximum number of users supported, the average response time and the number of Notes transactions per minute.

AIM General commercial

Tests real office automation applications, memory management, integer and floating-point calculations, disk I/O, and multitasking.




Uempty
Customer benchmarks
Customer benchmarks include customer specific applications which are not measured through industry standard benchmarks as well as simple benchmarks like network or file system throughput tests done with standard UNIX commands.

Since industry benchmarks often do not accurately match a customer’s workload characteristics or mix, the best way to determine how well a particular combination hardware. software, and tuning changes will affect the performance of their applications is to run a standardized mix of the customers own unique workload.



Student Notebook

Figure 1-5. Components of system performance AN512.0

Notes:

Introduction

The performance of a computer system depends on four main components: CPU, memory, I/O, and network.

Both hardware and software contribute to the entire system performance. You should not depend on very fast hardware as the sole contributor of system performance. Very efficient software on average hardware can cause a system to perform much better (and probably be less costly) than poor software on very fast hardware.

CPU resources

The speed of a processor, more commonly known as clock speed in megahertz, as well as the number of processors, have an impact on performance. Kernel software that controls the use of the CPU plays a large role in performance.


Components of system performance• Central processing unit (CPU) resources

– Processor speed and number of processors– Performance of software that controls CPU scheduling

• Memory resources– Random access memory (RAM) speed, amount of

memory, and caches– Virtual Memory Manager (VMM) performance

• I/O resources– Disk latencies, number of disks and I/O adapters– Device driver and kernel performance

• Network resources– Network adapter performance and physical network itself– Software performance of network applications




Uempty
Memory resources
Memory, both real and virtual, sometimes is the biggest factor of an application’s performance. The memory latencies (RAM speeds), the design of the memory subsystem, size of memory caches, and the Virtual Memory Manager (VMM) kernel software contribute to the performance of a computer system.

I/O resources

I/O performance contributes heavily to system performance as well. I/O resources are referred to here as I/O related to disk activities, including disks and disk adapters.

Network resources

While not all systems rely on network performance, some systems’ main performance component is network related: the network media used (adapters and wiring) as well as the networking software.

Logical resources

Sometimes the constraining resource is not anything physical. There are logical resources in the software design that can become bottleneck. Examples are queues and buffers which are limited in size and pools of control blocks. While the AIX defaults for these are usually large enough for most systems, there are situation where these may need to be further tuned.



Student Notebook

Figure 1-6. Factors that influence performance AN512.0

Notes:

As server performance is distributed throughout each server component and type of resource, it is essential to identify the most important factors or bottlenecks that will affect the performance for a particular activity. Detecting the bottleneck within a server system depends on a range of factors such as those shown in the visual:

A bottleneck is a term used to describe a particular performance issue which is throttling the throughput of the system. It could be in any of the subsystems: CPU, memory, or I/O including network I/O. The graphic in the visual above illustrates that there may be several performance bottlenecks on a system and some may not be discovered until other, more constraining, bottlenecks are discovered and solved.


Factors that influence performance• Detecting the bottleneck(s) within a server system depends

on a range of factors such as:– Software application(s) workload– Speed and amount of available resources– Configuration of the server hardware– Configuration parameters of the operating system– Network configuration and topology

Throughput

Bottlenecks




Uempty

Figure 1-7. Performance metrics and baseline measurement AN512.0

Notes:

Introduction

One way to gauge performance is perception. For example, you might ask the question, “Does the system respond to us in a reasonable amount of time?” But if the system does not, then what do we do? That is where performance analysis tools play a role. These tools are programs that collect and report on various performance metrics. Whatever system components the application touches, the corresponding metrics must be analyzed.

There is a difference between the overall system utilization and the performance of a given application. An objective to fully utilize a system may be in conflict with an objective to optimize the response time of a critical application. There can be spare CPU capacity and yet an individual application can be CPU constrained. Sometime a low utilization is a sign of troubles; some application locking mechanism may be constraining use of the physical resources).


Performance metricsand baseline measurement• Performance is measured through analysis tools

• System utilization versus single program performance

• Metrics that are measured include:–CPU utilization–Memory utilization and paging–Disk I/O–Network I/O

• Each metric can be subdivided into finer details

• Create a baseline measurement to compare against in the future



Student Notebook

CPU utilization metrics

CPU utilization can be split into %user, %system, %idle, and %IOwait. Other CPU metrics can include the length of the run queues, process/thread dispatches, interrupts, and lock contention statistics.

The main CPU metric is the percent utilization. High CPU utilization is not a bad thing as some might think. However, the reason for the CPU utilization must be investigated to see if the utilization can be lowered. In the case of %idle and %IOwait the CPU is really idle. The CPU is actually being utilized only in the first two cases (user + system).

Memory paging metrics

Memory metrics include virtual memory paging statistics, file paging statistics, and cache and TLB miss rates.

Disk I/O metrics

Disk metrics include disk throughput (kilobytes read/written), disk transactions (transactions per second), disk adapter statistics, disk queues (if the device driver and tools support them), and elapsed time caused by various disk latencies. The type of disk access, random versus sequential, can also have a big impact on response times.

Network I/O metrics

Network metrics include network adapter throughput, protocol statistics, transmission statistics, network memory utilization, and much more.

Baseline measurement

You should create a baseline measurement when your system is running well and under a normal load. This will give you a guideline to compare against when your system seems to have performance problems.

Performance problems are usually reported right after a change to system hardware or software. Unless there is a baseline measurement to compare against before the change, quantification of the problem is impossible.




Uempty
System changes can affect performance
Changes to any of the following can affect performance:

- Hardware configuration - Adding, removing, or changing configurations such as how the disks are connected

- Operating system - Installing or updating a fileset, installing PTFs, and changing parameters

- Applications - Installing new versions and fixes or configuring or changing data placement

- Application tuning - Tuning options in the operating system, database or an application

You should measure the performance before and after each change. A change may be of a single tuning parameter or of multiple parameters that must be made at the same time as a group.

Another option is to run the measurements at regular intervals (for example, once a month) and save the output. When a problem is found, the previous capture can be used for comparison.



Student Notebook

Figure 1-8. Trade-offs and performance approach AN512.0

Notes:

Trade-offs

There are many trade-offs related to performance tuning that should be considered. The key is to ensure there is a balance between them.

The trade-offs are:

- Cost versus performance In some situations, the only way to improve performance is by using more or faster hardware. But, ask the question “Does the additional cost result in a proportional increase in performance?”

- Conflicting performance requirements If there is more than one application running simultaneously, there may be conflicting performance requirements.

- Speed versus functionality Resources may be increased to improve a particular area, but serve as an overall


Trade-offs and performance approach• Trade-offs must be considered, such as:

– Cost versus performance– Conflicting performance requirements– Speed versus functionality

• Performance may be improved using a methodical approach:1. Understanding the factors which can affect performance2. Measuring the current performance of the server3. Identifying a performance bottleneck4. Changing the component which is causing the bottleneck5. Measuring the new performance of the server to check

for improvement




Uempty
detriment to the system. Also, you may need to make choices when configuring your system for speed versus maximum scalability.
Methodology

Performance tuning is one aspect of performance management. The definition of performance tuning sounds simple and straight forward, but it’s actually a complex process.

Performance tuning involves managing your resources. Resources could be logical (queues, buffers, and so forth) or physical (real memory, disks, CPUs, network adapters, and so forth). Resource management involves the various tasks listed here. We will examine each of these tasks later.

Tuning always must be done based on performance analysis. While there are recommendations as to where to look for performance problems, what tools to use, and what parameters to change, what works on one system may not work on another. So there is no cookbook approach available for performance tuning that will work for all systems.

Experiences with tuning may range from the informal to the very formal where reports and reviews are done prior to changes being made. Even for informal tuning actions, it is essential to plan, gather data, develop a recommendation, implement, and document.



Student Notebook

Figure 1-9. Performance analysis flowchart AN512.0

Notes:

Tuning is a process

The flowchart in the visual above can be used for performance analysis and it illustrates that tuning is an iterative process. We will be following this flowchart throughout our course.

The starting point for this flowchart is the Normal Operations box. The first piece of data you need is a performance goal. Only by having a goal, or a set of goals, can you tell if there is a performance problem. The goals may be something like a specific response time for an interactive application or a specific length of time in which a batch job needs to complete. Tuning without a specific goal could in fact lead to the degradation of system performance.

Once you decide there is a performance problem and you analyze and tune the system, you must then go back to the performance goals to evaluate whether more tuning needs to occur.


Performance analysis flowchart

Is there aperformance

problem?

Normal operations

Monitor system performance and check against requirements

CPU bound?

ActionsYes

Memory bound?

I/O bound?

Network bound?

No

No

No

No

Yes

Yes

Yes

Actions

Actions

Actions

Does performancemeet stated

goals?Yes

No

Yes

No

Additional tests

Actions




Uempty
Additional tests
The additional tests that you perform at the bottom right of the flowchart relate to the four previous categories of resource contention. If the specific bottleneck is well hidden, or you missed something, then you must keep testing to figure out what is wrong. Even when you think you’ve found a bottleneck, it’s a good idea to do additional tests to identify more detail or to make sure one bottleneck is not masquerading as another. For example, you may find a disk bottleneck, but in reality it’s a memory bottleneck causing excessive paging.



Student Notebook

Figure 1-10. Impact of virtualization AN512.0

Notes:

Working with an AIX operating system in a logical partition changes how we approach approach performance management.

On a traditional single operating system server, all of the resources are local to and dedicated to that OS.

When AIX runs in a logical partition, the resources may all be virtualized. While this virtualization and sharing of resources can provide better utilization and lower costs, it also requires an awareness of the factors beyond the immediate AIX operating system.

Some network and I/O tuning requires you to examine and tune physical adapters. In a virtualized environment, those adapters could reside at the virtual I/O server (VIOS), and you would have to go to that VIOS partition to complete that part of the work.

Each of the resources could be shared with other LPARs. Thus workloads on these other LPARs could affect the resource availability in your LPAR (depending on how the virtualization facilities are configured).


Impact of virtualization

VirtualEthernet

Dedicated or SharedProcessors

Dedicated or SharedMemory

VirtualSCSI

PhysicalNetwork

PhysicalProcessors

PhysicalMemory

PhysicalStorage

Power Hypervisor and Virtual I/O Server

Logical Partitions

• Virtualization affects how you manage AIX performance:– The memory and processor capacity is

determined by the Power Hypervisor– Memory and processors may be shared– Physical adapters to storage or to the

network may be in the virtual I/O server




Uempty
Additional training, beyond this course, in PowerVM configuration and performance tuning is strongly recommended for administrators who are working with LPARs which are virtualized.


Student Notebook

Figure 1-11. The performance management team AN512.0

Notes:

Overview

Managing application performance is not something an AIX administrator can do in isolation. Some studies have shown that the greatest performance improvements can be found in improving the design and coding of the application or the manner in which the data is organized. Increasing physical resources require working with capital planning. The performance may be constrained by the components that are controlled by the network administrator or storage subsystem administrator. Performance bottlenecks may be isolated to the performance of other servers that you depend upon such as name servers or file servers. An upgrade of equipment may require changes to the power and cooling in the machine room. Newer Power-processor based systems can suppress performance in order to stay within designated heat and power limits.

Most significantly, with the PowerVM environment in which most AIX systems run, the resources on which performance depends are virtualized. The amount and manner in which processor, memory, and I/O capacity is provisioned to the logical partition that is


The performance management team

AIXadmin

Network Services(DNS, NFS, NIM)

ApplicationDesignVirtualization

Management(LPARs, HMC)

FacilityManagement(heat, power)

Network Facility(switches, routers)

Physical Upgrades(memory, cores,

adapters)

AIX Support Line

Storage Subsystem(SAN, storage

arrays)




Uempty
running AIX has a great influence on the performance of the applications in that partition.
Performance management is an area requiring partnerships with many other areas and the personnel who administer those areas.



Student Notebook

Figure 1-12. Performance analysis tools AN512.0

Notes:

CPU analysis tools

CPU metrics analysis tools include:

- vmstat, iostat, sar, lparstat and mpstat which are packaged with bos.acct

- ps which is in bos.rte.control

- cpupstat which is part of bos.rte.commands

- gprof and prof which are in bos.adt.prof

- time (built into the various shells) or timex which is part of bos.acct

- emstat and alstat are emulation and alignment tools from bos.perf.tools

- netpmon, tprof, locktrace, curt, splat, and topas are in bos.perf.tools

- trace and trcrpt which are part of bos.sysmgt.trace

- truss is in bos.sysmgt.ser_aids


Performance analysis tools

trace, trcrpt, truss



trace, trcrpt, curt, splat,

truss

nfs4cllparstatcpupstat, lparstat,

mpstat, smtctl

topas, nmon, performance toolbox



topas, nmon performance toolbox

lvmstatemstat, alstat

filemonlocktrace

tcpdumpfileplace

iptrace, ipreportlspv, lslv, lsvgnetpmon

ifconfiglsdevtime, timex

netpmonlsattrfilemontprof, gprof, prof

nfsstatlspssvmonsar

netstat, entstatvmstat lspsps

lsattriostatvmstatvmstat, iostat

Network Subsystem

I/O SubsystemMemory SystemCPU




Uempty
- smtctl is in bos.rte.methods
- Performance toolbox tools such as xmperf, 3dmon which are part of perfmgr

Memory subsystem analysis tools

Some of the memory metric analysis tools are:

- vmstat which is packaged with bos.acct

- lsps which is part of bos.rte.lvm

- topas, svmon and filemon are part of bos.perf.tools



- lparstat is part of bos.acct

I/O subsystem analysis tools

I/O metric analysis tools include:

- iostat and vmstat are packaged with bos.acct

- lsps, lspv, lsvg, lslv and lvmstat are in bos.rte.lvm

- lsattr and lsdev are in bos.rte.methods

- topas, filemon, and fileplace are in bos.perf.tools



Network subsystem analysis tools

Network metric analysis tools include:

- lsattr and netstat which are part of bos.net.tcp.client

- nfsstat and nfs4cl as part of bos.net.nfs.client

- topas and netpmon are part of bos.perf.tools

- ifconfig as part of bos.net.tcp.client

- iptrace and ipreport are part of bos.net.tcp.server

- tcpdump which is part of bos.net.tcp.server





Student Notebook

Figure 1-13. Performance tuning tools AN512.0

Notes:

CPU tuning tools

CPU tuning tools include:

- nice, renice, and setpri modify priorities. nice and renice are in the bos.rte.control fileset. setpri is a command available with the perfpmr package.

- schedo modifies scheduler algorithms (in the bos.perf.tune fileset).

- bindprocessor binds processes to CPUs (in the bos.mp fileset).

- chdev modifies certain system tunables (in the bos.rte.methods fileset).

- bindintcpu can bind an adapter interrupt to a specific CPU (in the devices.chrp.base.rte fileset).

- procmon is in bos.perf.gtools.


Performance tuning tools

reorgvgwparwpar

migratepvwlmwlm

chdevchdevchdevchdev

ifconfigchlvchpsmkps

bindprocessbindintcpu

nfsoioolvmo

iooschedo

novmovmonicerenice

Network Subsystem

I/O SubsystemMemory SystemCPU

• The most important tool is matching resources to demand:

• Spreading workload (over time and between components or systems)

• Allocating the correct additional resource

• Managing the demand




Uempty
Memory tuning tools
Memory tuning tools include:

- vmo and ioo for various VMM, file system, and LVM parameters (in bos.perf.tune fileset)

- chps and mkps modify paging space attributes (in bos.rte.lvm fileset)

- fdpr rearranges basic blocks in an executable so that memory footprints become smaller and cache misses are reduced (in perfagent.tools fileset)

- chdev modifies certain system tunables (in bos.rte.methods fileset)

I/O tuning tools

I/O tuning tools include:

- vmo and ioo modify certain file system and LVM parameters (in bos.perf.tune fileset).

- chdev modifies system tunables such as disk and disk adapter attributes (in bos.rte.methods fileset)

- migratepv moves logical volumes from one disk to another (in bos.rte.lvm fileset)

- lvmo displays or sets pbuf tuning parameters (in bos.rte.lvm fileset)

- chlv modifies logical volume attributes (in bos.rte.lvm fileset)

- reorgvg moves logical volumes around on a disk (in bos.rte.lvm fileset)

Network tuning tools

Network tuning tools include:

- no modifies network options (in bos.net.tcp.client fileset)

- nfso modifies NFS options (in bos.net.nfs.client fileset)

- chdev modifies network adapter attributes (in bos.rte.methods fileset)

- ifconfig modifies network interface attributes (in bos.net.tcp.client fileset)



Student Notebook

Figure 1-14. AIX tuning commands AN512.0

Notes:

Overview

The tuning options are actually kept in structures in kernel memory. To assist in reestablishing these kernel values at each system reboot, the tunables values capabilities are stored in files in the directory /etc/tunables. The tunables commands can update this file, update the kernel or both. For ease of use, these tunable commands can be invoked via SMIT (smitty tuning) and Web-based System Manager, or pconsole.


AIX tuning commands• Tunable commands include:

– vmo manages Virtual Memory Manager tunables – ioo manages I/O tunables – schedo manages CPU scheduler/dispatcher tunables – no manages network tunables – nfso manages NFS tunables – raso manages reliability, availability, serviceability tunables

• Tunables are the parameters the tuning commands manipulate

• Tunables can be managed from:– SMIT– Web-based system manager– Command line

• All tunable commands have the same syntax




Uempty

Figure 1-15. Types of tunables AN512.0

Notes:

Beginning with AIX 6.1, many of the tunables are considered restricted. Restricted tunables should not be modified unless told to do so by AIX development or support professionals.

The restricted tunables are not displayed, by default.

When migrating to AIX 6.1, the old tunable values will be kept. However, any restricted tunables that are not at their default AIX 6.1 value will cause an error log entry.


Types of tunables• There are two types of tunables (AIX 6.1):

– Restricted tunables•Restricted tunables should not be changed without approval from AIX development or AIX Support !

•Dynamic change will show a warning message•Permanent change must be confirmed•Permanent changes will cause an error log entry at boot time

– Non-restricted tunables•Can have restricted tunables as dependencies

• Migration from AIX 5.3 to AIX 6.1 will keep the old tunable values– Recommend reviewing and consider changing to AIX6 defaults

Restricted tunables should NOT be changed without Restricted tunables should NOT be changed without approval from AIX Development or AIX Support!approval from AIX Development or AIX Support!



Student Notebook

Figure 1-16. Tunable parameter categories AN512.0

Notes:

Types of tunable parameters

All the tunable parameters manipulated by the tuning commands (vmo, ioo, schedo, no, nfso and raso) have been classified into these categories:

Dynamic The parameter can be changed any time

Static The parameter can never be changed

Reboot The parameter can only be changed during boot

Bosboot The parameter can only be changed by running bosboot and rebooting the machine

Mount Changes to the parameter are only effective for future file systems or directory mounts

Incremental Parameter can only be incremented, except at boot

Connect Changes to the parameter are only effective for future socket connections


Tunable parameter categories• The tunable parameters manipulated by the tuning commands

have been classified into the following categories:– Dynamic– Static– Reboot– Bosboot– Mount– Incremental– Connect




Uempty

Figure 1-17. Tunables command options and files AN512.0

Notes:

Introduction

The parameter values tuned by vmo, schedo, ioo, no, and nfso are stored in files in /etc/tunables.

Tunables files currently support six different stanzas: one for each of the tunable commands (schedo, vmo, ioo, no and nfso), plus a special info stanza. The five stanzas schedo, vmo, ioo, no and nfso contain tunable parameters managed by the corresponding command (see the command's man pages for the complete parameter lists).

The value can either be a numerical value or the literal word DEFAULT, which is interpreted as this tunable's default value. It is possible that some stanzas contain values for non-existent parameters (in the case a tunable file was copied from a machine running a different version of AIX and one or more tunables do not exist on the current system).


Tunables command options and files• /etc/tunables directory contains:

– nextboot (overrides to default)– lastboot (values established at last boot)– lastboot.log (log of last boot actions)

• To list tunables:# command –a (summary list of tunables)# command –L (long listing of tunables)# command –h tunable (full description of a tunable)

(Note: use of –F option forces display of restricted tunables)

• To change tunables:# command -o tunable=value (update kernel only)# command -o tunable=value -r (update nextboot only)# command -o tunable=value -p (update kernel and nextboot)# command -d tunable (reset a single tunable to default)# command -D (reset all tunables to the defaults)



Student Notebook

nextboot file

This file is automatically applied at boot time and only contains the list of tunables to change. It does not contain all parameters. The bosboot command also gets the value of Bosboot type tunables from this file. It contains all tunable settings made permanent.

lastboot

This file is automatically generated at boot time. It contains the full set of tunable parameters, with their values after the last boot. Default values are marked with # DEFAULT VALUE. Static parameters are marked STATIC in the file.

This file can be very useful as a problem determination tool. For example, it will identify an error that prevented a requested change from being effective at reboot.

lastboot.log

This should be the only file in /etc/tunables that is not in the stanza format described here. It is automatically generated at boot time, and contains the logging of the creation of the lastboot file, that is, any parameter change made is logged. Any change which could not be made (possible if the nextboot file was created manually and not validated with tuncheck) is also logged. (tuncheck will be covered soon.)

Tuning command syntax

The vmo, ioo, schedo, no and nfso commands have similar syntax:

command [-p|-r] {-o Tunable[=Newvalue]} command [-p|-r] {-d Tunable} command [-p|-r] -D command [-p|-r] -a {-F}command -h Tunable command -L [Tunable] {-F}command -x [Tunable] {-F}

The descriptions of the flags are:

Flag Description-p Makes the change apply to both current and reboot values-r Forces the change to go into effect on the next reboot-o Displays or sets individual parameters-d Resets individual Tunable to default value-D Resets all tunables to default values-a Displays all parameters-h Displays help information for a Tunable




Uempty

-LLists attributes of one or all tunables; includes current value, default value, value to be set at next reboot, minimum possible value, maximum possible value, unit, type, and dependencies

-xLists characteristics of one or all tunables, one per line, using a spreadsheet-type format

-F Forces the display of restricted parameters

Flag Description



Student Notebook

Figure 1-18. Tuning commands -L option AN512.0

Notes:

Overview of the -L option

The -L option of the tunable commands (vmo, ioo, schedo, no and nfso) can be used to print out the attributes of a single tunable or all the tunables.

The output of the command with the -L option shows the current value, default value, value to be set at next reboot, minimum possible value, maximum possible value, unit, type, and dependencies.

Types of tunable parameters

All the tunable parameters manipulated by the tuning commands (no, nfso, vmo, ioo, and schedo) have been classified into these categories:

Dynamic The parameter can be changed any timeStatic The parameter can never be changedReboot The parameter can only be changed during boot


Tuning commands -L option# vmo -L

NAME CUR DEF BOOT MIN MAX UNIT TYPEDEPENDENCIES

...<part of output omitted>...maxfree 1088 1088 1088 16 367001 4KB pages D

minfreememory_frames

--------------------------------------------------------------------------------maxperm 386241 386241 S--------------------------------------------------------------------------------maxpin 370214 370214 S--------------------------------------------------------------------------------maxpin% 80 80 80 1 100 % memory D

pinnable_framesmemory_frames

--------------------------------------------------------------------------------memory_frames 448K 448K 4KB pages S--------------------------------------------------------------------------------minfree 960 960 960 8 367001 4KB pages D

maxfreememory_frames

--------------------------------------------------------------------------------...<rest of output omitted>




Uempty
Bosboot The parameter can only be changed by running bosboot and rebooting the machine
Mount Changes to the parameter are only effective for future file systems or directory mounts

Incremental Parameter can only be incremented, except at boot timeConnect Changes to the parameter are only effective for future socket

connections

For parameters of type Bosboot, whenever a change is performed, the tuning commands automatically prompt the user and asks if they want to execute the bosboot command. For parameters of type Connect, the tuning commands automatically restart the inetd daemon.

The following table shows each command and the tunable types that it supports:

Tunable flag and type issues

Any change (with -o, -d or -D) to a parameter of type Mount will result in a message being displayed to warn the user that the change is only effective for future mount operations.

Any change (with -o, -d or -D flags) to a parameter of type Connect will result in inetd being restarted, and a message displaying a warning to the user that the change is only effective for future socket connections.

Any attempt to change (with -o, -d or -D) a parameter of type Bosboot or Reboot without -r, will result in an error message.

Any attempt to change (with -o, -d or -D but without -r) the current value of a parameter of type Incremental with a new value smaller than the current value, will result in an error message.

vmo ioo schedo no nfso

Dynamic (D) x x x x xStatic (S) x x xReboot (R) x xBosboot (B) xMount (M) x xIncremental (I) x x xConnect (C) x



Student Notebook

Figure 1-19. Stanza file format AN512.0

Notes:

Stanza file format (nextboot and lastboot)

The tunables files contain one or more sections, called “stanzas”. A stanza is started by a line containing the stanza name followed by a colon (:). There is no marking for the end of a stanza. It simply continues until another stanza starts. Each stanza contains a set of parameter/value pairs; one pair per line. The values are surrounded by double quotes ("), and an equal sign (=) separates the parameter name from its value. A parameter/value pair must necessarily belong to a stanza. It has no meaning outside of a stanza. Two parameters sharing the same name but belonging to different stanzas are considered to be different parameters. If a parameter appears several times in a stanza, only its first occurrence is used. Following occurrences are simply ignored. Similarly, if a stanza appears multiple times in the file, only the first occurrence is used. Everything following a number sign (#) is considered a comment and ignored. Heading and trailing blanks are also ignored.


Stanza file formatExample of a nextboot file:

info:Description = “Tuning changes made July 2009"AIX_level = “6.1.2.3" Kernel_type = "MP64" Last_validation = "2009-07-28 15:31:24 CDT (current)"

vmo:minfree = "4000"maxfree = "4128"nokilluid = "4"

ioo:j2_maxPageReadAhead = "128" j2_nRandomCluster = "4"j2_nRandomWrite = "8"j2_nPagesPerWriteBehindCluster = "64"

no:tcp_nodelayack = "1"tcp_sendspace = "65536"tcp_recvspace = "65536"

nfso:nfs_v3_vm_bufs = “15000"




Uempty
There are six possible stanzas for each file:
- info

- schedo

- vmo

- ioo

- no

- nfso

The info stanza is used to store information about the purpose of the tunable file and the level of AIX on which it was validated. Any parameter is acceptable in this stanza, however, some fields have a special meaning:

Field Meaning

DescriptionA character string describing the tunable file. SMIT displays this field in the file selection box.

Kernel_type

Possible values are:

• UP - Uniprocessor kernel, N/A on AIX 5L V5.3 and later.

• MP - Multiprocessor kernel, N/A on AIX6 and later. • MP64 - 64 bit multiprocessor kernel

This field is automatically updated by tunsave and tuncheck (on success only).

Last_validation

The most recent date and time this file was validated, and the type of validation.

Possible values are:

• current - File has been validated against the current context

• reboot - File has been validated against the nextboot context

This field is automatically updated by tunsave and tuncheck (on success only).

Logfile_checksumThe checksum of the lastboot.log file matching this tunables file. This field is present only in the lastboot file.



Student Notebook

Figure 1-20. File control commands for tunables AN512.0

Notes:

Introduction

There are five commands which are used to control files that contain tunables. These commands take as an argument the filename to use and the commands will assume that the filename is in the /etc/tunables directory.

tuncheck command

The tuncheck command validates a tunables file. All tunables listed in the specified file are checked for range and dependencies. If a problem is detected, a warning is issued.

There are two types of validation:

- Against the current context: This checks to see if the file could be applied immediately. Tunables not listed in the file are interpreted as current values. The checking fails if a tunable of type Incremental is listed with a smaller value than its


File control commands for tunables• Commands to manipulate the tunables files in /etc/tunables

are:–tuncheck

Used to validate the parameter values in a file–tunrestore

Changes tunables based on parameters in a file–tunsave

Saves tunable values to a stanza file–tundefault

Resets tunable parameters to their default values–tunchange

Unconditionally updates a values in a file




Uempty
current value; it also fails if a tunable of type Bosboot or Reboot is listed with a different value than its current value.
- Against the next boot context: This checks to see if the file could be applied during a reboot, that is, if it could be a valid nextboot file. Decreasing a tunable of type Incremental is allowed. If a tunable of type Bosboot or Reboot is listed with a different value than its current value, a warning is issued but the checking does not fail.

Additionally, warnings are issued if the file contains unknown stanzas, or unknown tunables in a known stanza. However, that does not make the checking fail.

Upon success, the AIX_level, Kernel_type and Last_validation fields in the info stanza of the checked file are updated.

The syntax for the tuncheck command is:

tuncheck [-p | -r ] -f Filename

where:

If -p or -r are not specified, Filename is checked according to the current context.

tunrestore command

The tunrestore command is used to change all tunable parameters to values stored in a specified file.

The syntax for the tunrestore command is:

tunrestore [-r] -f Filenametunrestore -R

where:

Flag Description

-f FilenameSpecifies the name of the tunable file to be checked. If it does not contain the '/' (forward slash) character, the name is relative to the /etc/tunables directory.

-p Checks Filename in both current and boot contexts. This is equivalent to running tuncheck twice, one time without any flag and one time with the -r flag.

-r Checks Filename in a boot context.

Flag Description

-f Filename

Immediately applies Filename. All tunables listed in Filename are set to the value defined in this file. Tunables not listed in Filename are kept unchanged. Tunables explicitly set to DEFAULT are set to their default value.



Student Notebook

Additionally, a tunable file called /etc/tunables/lastboot is automatically generated. That file has all the tunables listed with numerical values. The values representing default values are marked with the comment DEFAULT VALUE. Its info stanza includes the checksum of the /etc/tunables/lastboot.log file to make sure pairs of lastboot/lastboot.log files can be identified.

tunsave command

The tunsave command saves the current state of the tunables parameters in a file.

The syntax for the tunsave command is:

tunsave [ -a | -A ] -f | -F Filename [ -d Description ]

where:

-r -f Filename

Applies Filename for the next boot. This is achieved by checking the specified file for inconsistencies (the equivalent of running tuncheck on it) and copying it over to /etc/tunables/nextboot. If bosboot is necessary, the user will be offered to run it.

-R

Is only used during reboot. All tunables that are not already set to the value defined in the nextboot file are modified. Tunables not listed in the nextboot file are forced to their default value. All actions, warnings and errors are logged into /etc/tunables/lastboot.log. tunrestore -R can only be called from /etc/inittab.

Flag Description

-aSaves all tunable parameters, including those that are currently set to their default value. These parameters are saved with the special value DEFAULT.

-A

Saves all tunable parameters, including those that are currently set to their default value. These parameters are saved numerically, and a comment, # DEFAULT VALUE, is appended to the line to flag them.

-d DescriptionSpecifies the text to use for the Description field. Special characters must be escaped or quoted inside the Description field.

-f Filename

Specifies the name of the tunable file where the tunable parameters are saved. If Filename already exists, an error message is displayed. If it does not contain the '/' (forward slash) character, the Filename is relative to /etc/tunables.

Flag Description




Uempty

If Filename does not already exist, a new file is created. If it already exists, an error message prints unless the -F flag is specified, in which case, the existing file is overwritten.

tundefault command

The tundefault command resets all tunable parameters to their default values. It launches all the tuning commands (ioo, vmo, schedo, no and nfso) with the -D flag. This resets all the AIX tunable parameters to their default value, except for parameters of type Bosboot and Reboot, and parameters of type Incremental set at values bigger than their default value, unless -r was specified. Error messages are displayed for any parameter change impossible to make.

The syntax for the tundefault command is:

tundefault [ -r | -p ]

where:

-F Filename

Specifies the name of the tunable file where the tunable parameters are saved. If Filename already exists, the existing file is overwritten. If it does not contain the '/' (forward slash) character, the Filename is relative to /etc/tunables.

Flag Description

-rDefers the reset to their default value until the next reboot. This clears stanza(s) in the /etc/tunables/nextboot file, and if necessary, proposes bosboot and warns that a reboot is needed.

-pMakes the changes permanent: resets all the tunable parameters to their default values and updates the /etc/tunables/nextboot file.

Flag Description



Student Notebook

Figure 1-21. Checkpoint (1 of 2) AN512.0

Notes:


Checkpoint (1 of 2) 1. Use these terms with the following statements:

benchmarks, metrics, baseline, performance goals, throughput, response time

a. Performance is dependent on a combination of ____________ and ___________________ .

b. Expectations can be used as the basis for _______________ .c. These are standardized tests used for evaluation.

________________________d. You need to know this to be able to tell if your system is

performing normally. _______________________e. These are collected by analysis tools. ___________________




Uempty


Notes:


Checkpoint (2 of 2) 2. The four components of system performance are:

––––

3. After tuning a resource or system parameter and monitoring the outcome, what is the next step in the tuning process? __________________________________________________________

2. The six tuning options commands are:––––––



Student Notebook

Figure 1-23. Exercise 1: Work with tunables files AN512.0

Notes:


Exercise 1: Work with tunables files

• List the attributes of tunables

• Validate the tunable parameters

• Examine the tunables files

• Reset tunables to their default values




Uempty

Figure 1-24. Unit summary AN512.0

Notes:


Unit summary

This unit covered: • The following performance terms:

– Throughput, response time, benchmark, metric, baseline, performance goal

• Performance components• The performance tuning process• Tools available for analysis and tuning



Student Notebook




Uempty
Unit 2. Data collection

This unit describes how to define a performance problem, then use tools such as the PerfPMR utility, topas, and nmon to collect performance data.



• Describe a performance problem

• Install PerfPMR

• Collect performance data using PerfPMR

• Describe the use of the following tools:

- topas

-nmon


Accountability:


References






© Copyright IBM Corp. 2010 Unit 2. Data collection 2-1

Student Notebook


Notes:


Unit objectives

At the end of this unit, you should be able to:

• Describe a performance problem

• Install PerfPMR


• Describe the use of the following tools:– topas

– nmon




Uempty

Figure 2-2. Performance problem description AN512.0

Notes:

What should a customer do?

If a performance problem exists, the customer should contact their local support center to open a Problem Management Report (PMR). They should include as much background of the problem as possible. Then, collect and analyze the data.

What typically happens?

It is quite common for support personnel to receive a problem report in which all it says is that someone has a performance problem on the system and here is some data for you to analyze. This little information is not enough to accurately determine the nature of a performance problem.

An analogy would be a patient that visits a doctor, tells the doctor that she or he is sick, and then expects an immediate diagnosis. The doctor could run many tests on the patient gathering data such as blood tests, x-rays, and so forth, and may even find


Performance problem description

• When someone reports a performance problem:

It is not enough to just gather data and analyze it

You must know the nature of the problem

Otherwise, you may waste a lot of time analyzing data which may have nothing to do with the problem being reported

• How can you find out the nature of the problem?– Ask many questions regarding the performance

problem



Student Notebook

interesting results. However, these results may have nothing to do with the problem that the patient is reporting.

As such, a performance problem is the same. The data could show 100% CPU utilization and a high run queue, but that may have nothing to do with the cause of the performance problem. Take, for example, a system where users are logged in from remote terminals over a network that goes over several routers. The users may report that the system is slow. Data could show that the CPU is very heavily utilized. But the real problem could be that the characters get echoed after long delays on their terminals due to packets getting lost on the network (which could be caused by failing routers or overloaded networks) and may have nothing to do with the CPU utilization on the machine. If, on the other hand, the complaint was that a batch job on the system was taking a long time to run, then CPU utilization or I/O bandwidth may be related. It is very important to get as much detail as possible before even attempting to collect or analyze data.

Questions to ask

Ask many questions regarding the performance problem:

- Can the problem be demonstrated with the execution of a specific command or sequence of events? (that is, ls /slow/fs, or ping xxxx, …). If not, describe the least complex example of the problem.

- Is the slow performance intermittent? Does the system get slow and then run normally for a while? Does it occur at certain times of the day or in relation to some specific activity?

- Is everything slow or just some things?

- What aspect is slow? For example, time to echo a character or elapsed time to complete a transaction or time to paint the screen?

- When did the problem start occurring? Was it that way since the system was first installed or went into production? Did anything change on the system before the problem occurred (such as adding more users or migrating additional data to the system)?

- If the problem involves a client/server, can the problem be demonstrated when run just locally on the server (network versus server issue).

- If network related, what are the network segments like (including media information such as 100 Mbps/sec, half-duplex, and so forth) and routers between the client/server application.

- What vendor applications are running on the system and are they involved in the performance issue?

- What is the impact of the performance problem to the users?

- Are there any entries in the error log?




Uempty

Figure 2-3. Collecting performance data AN512.0

Notes:

Overview

It is important to collect a variety of data that show statistics regarding the various system components. In order to make this easy, a set of tools supplied in a package called PerfPMR is available on a public ftp site. The following URL can be used to download your version using a Web browser:

ftp://ftp.software.ibm.com/aix/tools/perftools/perfpmr

The goal

The goal is to collect a good base of information that can be used by AIX technical support specialists or development lab programmers to get started in analyzing and solving the performance problem. This process may need to be repeated after analysis of the initial set of data is completed.


Collecting performance data• The data may be from just one system or from multiple

systems

• Gather a variety of data

• To make this simple, a set of tools supplied in a package called PerfPMR is available

• PerfPMR is downloadable from a public website:ftp://ftp.software.ibm.com/aix/tools/perftools/perfpmrChoose appropriate version based on the AIX releasePerfPMR may be updated for added functionality on an ongoing basis Download a new copy if your copy is back level

• Be sure to collect the performance data while the problem is occurring!



Student Notebook

Figure 2-4. Installing PerfPMR AN512.0

Notes:

Download PerfPMR

Obtain the latest version of PerfPMR from the Web site ftp://ftp.software.ibm.com/aix/tools/perftools/perfpmr.

The PerfPMR package is distributed as a compressed tar file.

Install PerfPMR

The following assumes you are installing the PerfPMR version for AIX 6.1, the tar file is in /tmp, and the tar file is named perf61.tar.Z.

1. Login as root or use the su command to obtain root authority


Installing PerfPMR

• Download the latest PerfPMR version from the website

• Read about the PerfPMR process in the READMEfile

• Install PerfPMR:Login as rootCreate the directory: /tmp/perf61Extract the shell scripts out of the compressed tar fileInstall the shell scripts (using Install script)




Uempty
2. Create a perf61 directory and change to that directory (this example assumes the directory created is under /tmp):
# mkdir /tmp/perf61# cd /tmp/perf61

3. Extract the shell scripts out of the compressed tar file:

# zcat /tmp/perf61.tar.Z | tar -xvf -

4. Install the shell scripts:

# sh ./Install

A link will be created in /usr/bin to the perfpmr.sh script.

The PerfPMR process is described in a README file provided in the PerfPMR package.



Student Notebook

Figure 2-5. Capturing data with PerfPMR AN512.0

Notes:

Data collection directory

Create a data collection directory and cd into this directory. Allow at least 12 MB/processor of unused space in whatever file system is used. Use the df command to verify the file system has at least 30000 blocks in the Free column (30000 512 byte blocks = 15 MB).

Do not collect data in a remotely mounted file system since iptrace may hang.

If there is not enough space in the file system, perfpmr.sh will print a message similar to:

perfpmr.sh: There may not be enough space in this filesystemperfpmr.sh: Make sure there is at least 44 Mbytes


Capturing data with PerfPMR• Create a directory to collect the PerfPMR data• Run perfpmr.sh 600 to collect the standard data

• It will run considerably longer than 600 seconds• Do not terminate before it finishes

•perfpmr.sh runs specialized scripts to collect the data•perfpmr.sh will collect information by:

Running kernel trace (trace.sh)for 5 secondsGathering 600 seconds of general system performance data (passed to monitor.sh script)Collecting hardware and software configuration informationRunning trace based utilities (for example: filemon, tprof)Running network traces

• Lengths of execute controlled by perfpmr.cfg

• Answer the questions in PROBLEM.INFO




Uempty
Preparing for PerfPMR
The following filesets should be installed before running perfpmr.sh:

- bos.acct

- bos.sysmgt.trace

- bos.perf.tools

- bos.net.tcp.server

- bos.adt.include

- bos.adt.samples

Running PerfPMR

To run PerfPMR, type in the command perfpmr.sh. One of the scripts perfpmr.sh calls is monitor.sh. monitor.sh calls several scripts to run performance monitoring commands. By default, each of these performance monitoring commands called by monitor.sh will collect data for 10 minutes (600 seconds). This default time can be changed by specifying the number of seconds to run as the first parameter to perfpmr.sh. For example, perfpmr.sh 300 will collect data for 5 minutes (300 seconds). The minimum time is 60 seconds.

Some of the flags for perfpmr.sh are:

-P Preview only. Show scripts to run and disk space needed

-Q Do not run lsattr, lslv, or lspv commands in order to save time

-I Get lock instrumented trace also

-g Do not collect gennames output

-f If gennames is run, specify gennames -f

-n Used if no netstat or nfsstat desired

-p Used if no pprof collection desired while monitor.sh running

-s Used if no svmon desired

-c Used if no configuration information is desired

-d sec sec is time to wait before starting collection period (default is 0)

Data collected

By default, the perfpmr.sh script provided will:

- Immediately collect a 5 second trace (trace.sh 5)

- Collect 600 seconds of general system performance data using interval tools such as such as vmstat, iostat, emtstat, and sar (monitor.sh 600)



Student Notebook

- Collect hardware and software configuration information using commands such as uname -m, lsps -a, lsdev -C, mount and df (config.sh)

In addition, if it finds the following programs available in the current execution path, it will:

- Collect 10 seconds of tcpdump information (tcpdump.sh 10)

- Collect 10 seconds of iptrace information (iptrace.sh 10)

- Collect 60 seconds of filemon information (filemon.sh 60)

- Collect 60 seconds of tprof information (tprof.sh 60)

You can also run the PerfPMR scripts individually. If you run them as an argument to perfpmr.sh with the -x flag (for example, perfpmr.sh -x tprof.sh), you do not need to know where PerfPMR was installed and give it the full path name. The perfpmr.sh command is automatically known to the system.

For HACMP users, it is generally recommended that the HACMP deadman switch interval be lengthened while performance data is being collected to avoid accidental failovers.

PROBLEM.INFO

The text file in the data collection directory, PROBLEM.INFO, asks many questions that help give a more complete picture of the problem.This background information about the problem gives the person trying to solve the problem a better understand what is going wrong.

Some examples of the questions in PROBLEM.INFO are:

- Can you append more detail on the simplest, repeatable example of the problem?

That is, can the problem be demonstrated with the execution of a specific command or sequence of events? (that is, ls /slow/fs takes 60 seconds or binary mode ftp put from one specific client only runs at 20 KB/second.

If not, describe the least complex example of the problem.

Is the execution of AIX commands also slow?

- Is this problem a case of something that had worked previously (that is, before an upgrade) and now does not run properly?

If so, describe any other recent changes (that is, workload, number of users, networks, configuration, and so forth).

- Is this a case of a application/system/hardware that is being set up for the first time?

If so, what performance is expected and on what is it based?




Uempty
More PerfPMR information
To learn more about using PerfPMR and where to send the data, read the README file that comes in the PerfPMR tar file. Also, the beginning of each script file contains a usage message explaining the parameters for that script.



Student Notebook

Figure 2-6. PerfPMR report types AN512.0

Notes:

PerfPMR output files

PerfPMR collects its data into many different files. The types of files created are:

- *.int files are from commands that collect the data at intervals over time. For example, data collected from vmstat, iostat, sar, lparstat, mpstat, netstat and nfsstat.

- *.sum files contain data that is collected once. There is also a file called monitor.sum that contains statistics that are averaged from the monitor.int files.

- *.out files contain the output from a command just run once

- *.before files contain information for commands run at the beginning of the monitoring period. One file that does not follow this convention is psb.elfk that contains the ps -elfk output before the monitoring period.


PerfPMR report types • The primary report types are:

.int Data collected at intervals over time

.sum Averages or differences during the time

.out One-time output from various commands

.before Data collected before the monitoring time

.after Data collected after the monitoring time

.raw Binary files for input into other commands

• Most report file names are self explanatory.• For example: vmstat.int or sar.int

• More generic names are not as obvious. – For example: monitor.sum or config.sum




Uempty
- *.after files contain information for commands run at the end of the monitoring period. One file that does not follow this convention is psa.elfk that contains the ps -elfk output after the monitoring period.
- *.raw files are binary files from utilities like trace, iptrace, and tcpdump. These files can be processed to create the ASCII report file by using the -r flag with the shell program. For example, iptrace.sh -r.

The .int data is most useful for metrics analysis. The .sum data is most useful for overall or configuration type of data. The .before and .after data are metrics before the testcase begins and those at end of the test interval. These are good for determining a starting and delta value for what occurred over life of the test interval.



Student Notebook

Figure 2-7. Generic report contents (1 of 3) AN512.0

Notes:

Overview

AIX Support frequently changes and enhances the PerfPMR tool. It is recommended that you periodically download a new copy (at least every three months) and before using it to document an open PMR.

The description provided in this course may not be up to date.

monitor.sum contents

monitor.sum contains output from the following files:

- ps.sum

- sar.sum

- iostat.sum

- vmstat.sum


Generic report contents (1 of 3)

• monitor.int:– ps –efk listings (before and after)– sar -A interval report– iostat interval report– vmstat interval report– emstat interval report

• monitor.sum:– ps –efk “deltas” from before to after– sar interval reports– vmstat interval averages– vmstat –s “deltas” from before to after




Uempty
Additional reports from monitor.sh
The following are additional reports from monitor.sh:

- netstat.int contains output from various netstat commands

- nfsstat.int contains output from nfsstat -m and nfsstat -csnr

- lsps.before and lsps.after contain output from lsps -a and lsps -s

- vmstati.before and vmstati.after contain output from vmstat -i

- vmstat_v.before and vmstat_v.after contain output from vmstat -v

- svmon.before and svmon.after contain output from svmon -G and svmon -Pns

- svmon.before.S and svmon.after.S contain output from svmon -1S

-

Capturing before data

The monitor.sh script captures initial data by invoking the following commands and scripts:

- lsps -a and lsps -s output into the lsps.before file

- vmstat -i output into the vmstati.before file

- vmstat -v output into the vmstat_v.before file

- svmon.sh (for output see the svmon.sh section below)

Capturing after data

The following commands and scripts capture the data after the measurement period:

- lsps -a and lsps -s output into the lsps.after

- vmstat -i output into the vmstati.after

- vmstat -v output into the vmstat_v.after

- svmon.sh (for output see the svmon.sh section below)

svmon.sh

The svmon command captures and analyzes a snapshot of virtual memory. The svmon commands that the svmon.sh script invokes are:

- svmon -G which gathers general memory usage information.

- svmon -Pns which gathers memory usage statistics for all active processes. It includes non-system segments (n) and system segments (s).



Student Notebook

- svmon -lS which gathers memory usage statistics for defined segments (S). For each displayed segment (l), the list of process identifiers that use the segment and, according to the type of report, the entity name to which the process belongs. For special segments, a label is displayed instead of the list of process identifiers.

The following files are created:

- svmon.before contains the svmon -G and svmon -Pns information at the beginning of data collection

- svmon.before.S contains the svmon -1S information at the beginning of data collection

- svmon.after contains the svmon -G and svmon -Pns information at the end of data collection

- svmon.after.S contains the svmon -1S information at the end of data collection

-

Starting system monitors

The monitor.sh script invokes the following scripts to monitor system data for the amount of time given in the perfpmr.sh or monitor.sh command.

- nfsstat.sh (unless the -n flag was used)

- netstat.sh (unless the -n flag was used)

- ps.sh

- vmstat.sh

- emstat.sh (unless the -e flag was used)

- mpstat.sh (unless the -m flag was used)

- lparstat.sh (unless the -l flag was used)

- sar.sh

- iostat.sh

- pprof.sh (unless the -p flag was used)

netstat.sh

The netstat subcommand symbolically displays the contents of various network-related data structures for active connections.

The netstat.sh script builds a report on network configuration and use called netstat.int containing tokstat -d of the token-ring interfaces, entstat -d of the Ethernet interfaces, netstat -in, netstat -m, netstat -rn, netstat -rs, netstat -s, netstat -D, and netstat -an before and after monitor.sh was run. You




Uempty
can reset the Ethernet and token-ring statistics and re-run this report by running netstat.sh -r 60. The time parameter must be greater than or equal to 60.
nfsstat.sh

The nfsstat command displays statistical information about the Network File System (NFS) and Remote Procedure Call (RPC) calls.

The nfsstat.sh script builds a report on NFS configuration and use called nfsstat.int containing nfsstat -m and nfsstat -csnr before and after nfsstat.sh was run. The time parameter must be greater than or equal to 60.

ps.sh

The ps command shows current status of processes.

The ps.sh script builds reports on process status (ps). The following files are created:

- psa.elfk contains a ps -elfk listing after ps.sh was run.

- psb.elfk contains a ps -elfk listing before ps.sh was run.

- ps.int contains the active processes before and after ps.sh was run.

- ps.sum contains a summary of the changes between when ps.sh started and finished. This is useful for determining what processes are consuming resources.

The time parameter must be greater than or equal to 60.

vmstat.sh

The vmstat subcommand displays virtual memory statistics.

The vmstat.sh script builds three reports with vmstat:

- Interval report called vmstat.int

- Summary report called vmstat.sum

- Report with the absolute count of paging activities called vmstat_s.out (vmstat -s)


emstat.sh

The emstat command shows emulation exception statistics.

The emstat.sh script builds a report called emstat.int on emulated instructions. The time parameter must be greater than or equal to 60.



Student Notebook

mpstat.sh

The mpstat command collects and displays performance statistics for all logical CPUs in the system.

The mpstat.sh script builds a report called mpstat.int with performance statistics for all logical CPUs in the system.


lparstat.sh

The lparstat command reports logical partition (LPAR) related information and statistics.

The lparstat.sh script builds two reports on logical partition (LPAR) related information and statistics:

- Interval report called lparstat.int

- Summary report called lparstat.sum


sar.sh

The sar command collects, reports, or saves system activity information.

The sar.sh script builds reports using sar. The following files are created:

- sar.int contains output of commands sadc 10 7 and sar -A

- sar.sum is a sar summary over the period sar.sh was run


iostat.sh

The iostat command reports CPU statistics, asynchronous input/output (AIO) and input/output statistics for the entire system, adapters, tty devices, disks and CD-ROMs.

The iostat.sh script builds two reports on I/O statistics:

- Interval report called iostat.int

- Summary report called iostat.sum


pprof.sh

The pprof command reports CPU usage of all kernel threads over a period of time.




Uempty
- The pprof.sh script builds a file called pprof.trace.raw that can be formatted with the pprof.sh -r command. The time parameter does not have any restrictions.


Student Notebook


Notes:

The purpose of the config.sum file is to provide static information about the configured environment. It identifies information about the adapters and devices being used, the LVM and file systems defined, the paging spaces, inter-process communications, network configuration, the current tuning parameters, memory environment, error log contents, and more.


Generic report contents (2 of 3)• config.sum:

– uname –m

– lscfg –l mem\*; lscfg -vp

– lsps –a; lsps -s

– ipcs -Smqsa

– lsdev -C

– LVM information:•lspv; lspv –l,

•lsvg rootvg; lsvg –l rootvg,

•lslv (for each LV)– lsattr –E for many devices, including:

•Adapters• Interfaces •Logical volumes•Disks•Volume groups•sys0




Uempty


Notes:

This visual continues the summary of the config.sum file information.


Generic report contents (3 of 3)• config.sum (continued):

– Filesystem information:• mount uname –m, lsfs –q, df

– Network information:• netstat reports, ifconfig -a

– Tunables listings:• no, nfso, schedo, vmo, ioo, lvmo, raso

– vmstat -v– errctrl -q

– kdb information:• Memory, filesystems, and more

– System auditing status– Environment variables– Error report– And more



Student Notebook

Figure 2-10. Formatting PerfPMR raw traces AN512.0

Notes:

Kernel trace

Because trace can collect huge amounts of data, the trace executed in perfpmr.sh will only run for five seconds (by default). This may not be enough time to collect trace data if the problem is not occurring during that time. In this case, you should run the trace by itself for a period of 15 seconds when the problem is present.

The command trace.sh 15 will run a trace for 15 seconds.

The trace.sh script issues a trcstop command to stop any trace that may already be running. Remember that only one trace can be running at a time.

Trace files created

The trace.sh script creates one raw trace file per CPU. The files are called trace.raw-0, trace.raw-1, and so forth for each CPU. Another raw trace file called trace.raw is also generated. This is a master file that has information that ties in the


Formatting PerfPMR raw traces

• Kernel trace files:– Creates one raw trace file per CPU (trace.raw-1, trace.raw-2, and so

on).

– To merge kernel trace files together:# trcrpt -C all -r trace.raw > trace.r

– To get a trace report :# ./trace.sh –r

• Network trace files:– To create a readable IP trace report file:

# /tmp/perf61/iptrace.sh –r

– To create a readable tcpdump report file:# /tmp/perf61/tcpdump.sh -r




Uempty
other CPU-specific traces. To merge the trace files together to form one raw trace file, run the following command:
trcrpt -C all -r trace.raw > trace.r

The -C all flag specifies that all CPUs should be used. The -r flag outputs unformatted (raw) trace entries and writes the contents of the trace log to standard output, by default. The example redirects the output to a file named trace.r. The trace.r file can be used as input into other trace-based utilities such as curt and splat.

Creating a trace report with trace.sh

An ASCII trace report can be generated by running perfpmr -x trace.sh -r. This command creates a file called trace.int that contains the readable trace used for analyzing performance problems. The trace.sh -r command will produce a report for all trace events. The trcrpt command can also be used.

The trace.nm file is the output of the trcnm command which is needed to postprocess the trace.raw file on a different system. The trace.fmt file is a copy of /etc/trcfmt. There are additional trace files which are used to list the contents of the i-node table and the listing of /dev so that i-node lock contention statistics can be viewed.

The iptrace utility

The iptrace utility provides interface-level packet tracing for Internet protocols.

iptrace.sh

When perfpmr.sh is run, it checks to see if the iptrace command is installed. If it is, then the iptrace.sh script is invoked. iptrace will run for a default of 10 seconds. The iptrace.sh script can also be run directly.

The iptrace.sh script builds a raw Internet Protocol (IP) trace report on network I/O called iptrace.raw. You can convert the iptrace.raw file to a readable IP report file called iptrace.int using the perfpmr -x iptrace.sh -r command.

Caution should be used when running iptrace. It can use large amounts of disk space.

The tcpdump utility

The tcpdump utility dumps information of network traffic. It prints headers of packets on a specified network interface.

tcpdump.sh

When perfpmr.sh is run, it checks to see if the tcpdump command is installed. If it is, then the tcpdump.sh script is invoked. The tcpdump command will run for a default of 10 seconds. The tcpdump.sh script can also be run directly.



Student Notebook

The tcpdump.sh script creates a raw trace file of a TCP/IP dump called tcpdump.raw. To produce a readable tcpdump.int file, use the tcpdump.sh -r command.




Uempty

Figure 2-11. When to run PerfPMR AN512.0

Notes:

Overview

PerfPMR should be installed when the system is initially set up and tuned. Then, you can get a baseline measurement from all the performance tools. When you suspect a performance problem, PerfPMR can be run again and the results compared with the baseline measurement.

It is also recommended that you run PerfPMR before and after hardware and software changes. If your system is performing fine and you then you upgrade your system and begin to have problems, then it’s difficult to identify the problem without a baseline to compare against.


When to run PerfPMR• OK. So now that I know all about PerfPMR and the data it

collects, when do I need to run it?

– When your system is running under load and is performing correctly, so you can get a baseline

– Before you add hardware or upgrade your software

– When you think you have a performance problem

• It is better to have PerfPMR installed on a system before you need it rather than try to install it after the performance problem starts!



Student Notebook

Figure 2-12. The topas command AN512.0

Notes:

Overview

The topas command reports selected statistics about the activity on the local system.

Why does AIX have the topas command? Because, there are similar tools available on other operating systems and the Internet that provide similar capabilities but are not supported on AIX.

The topas tool is in the bos.perf.tools fileset. The path to the tool is /usr/bin/topas. This tool can be used to provide a full screen of a variety of performance statistics.

The topas tool displays a continually changing screen of data rather than a sequence of interval samples, as displayed by such tools as vmstat and iostat. Therefore, topas is most useful for online monitoring and the other tools are useful for gathering detailed performance monitoring statistics for analysis.


The topas commandTopas Monitor for host: sys144_lpar4 EVENTS/QUEUES FILE/TTYMon Oct 19 21:30:22 2009 Interval: 2 Cswitch 76 Readch 0

Syscall 969.4K Writech 1708Kernel 77.3 |###################### | Reads 969.4K Rawin 0User 22.5 |####### | Writes 2 Ttyout 856Wait 0.0 |# | Forks 0 Igets 0Idle 0.2 |# | Execs 0 Namei 1Physc = 1.01 %Entc= 288.3 Runqueue 1.0 Dirblk 0

Waitqueue 0.0Network KBPS I-Pack O-Pack KB-In KB-Out MEMORYTotal 1.1 3.0 1.0 0.2 0.9 PAGING Real,MB 1024

Faults 0 % Comp 66Disk Busy% KBPS TPS KB-Read KB-Writ Steals 0 % Noncomp 30Total 0.0 40.0 8.0 0.0 40.0 PgspIn 0 % Client 30

PgspOut 0FileSystem KBPS TPS KB-Read KB-Writ PageIn 0 PAGING SPACETotal 0.0 0.0 0.0 0.0 PageOut 3 Size,MB 512

Sios 3 % Used 2Name PID CPU% PgSp Owner % Free 98cpuprog 340002 98.9 0.1 root NFS (calls/sec)syncd 127104 0.7 0.5 root SerV2 0 WPAR Activ 0topas 307438 0.1 1.4 root CliV2 0 WPAR Total 0rmcd 286874 0.0 6.4 root SerV3 0 Press: "h"-helpgil 57372 0.0 0.9 root CliV3 0 "q"-quitjava 290986 0.0 47.0 pconsolejava 282766 0.0 35.6 rootrpc.lock 254112 0.0 1.2 rootsendmail 200852 0.0 1.1 rootj2pg 135234 0.0 1.2 rootnetm 53274 0.0 0.4 rootlrud 16392 0.0 0.7 rootdtlogin 73914 0.0 0.6 rootkbiod 81998 0.0 0.5 rootbiod 86102 0.0 0.2 rootksh 90244 0.0 0.6 pconsoleinetd 94350 0.0 0.6 rootrdpgc 98406 0.0 0.4 rootlvmbb 102450 0.0 0.4 root




Uempty
If you're running topas in a partition and do a dynamic LPAR command which changes the system configuration, then topas must be stopped and restart to view accurate data.
Output sections

The topas command can show many performance statistics at the same time. The output consists of two fixed parts and a variable section.

The top several lines at the left of the display shows the name of the system topas runs on, the date and time of the last observation, and the monitoring interval.

The second fixed part fills the rightmost 25 positions of the display. It contains six subsections of statistics: EVENTS/QUEUES, FILE/TTY, PAGING, MEMORY, PAGING SPACE, and NFS

The variable part of the topas display can have one, two, three, four, or five subsections. If more than one subsection displays, they are always shown in the following order: CPU utilization, network interfaces, physical disks, and workload management classes, and processes.

When the topas command is started, it displays all subsections that are to be monitored. The exception to this is the workLoad management (WLM) classes subsection, which is displayed only when WLM is active. These subsections can be displayed or not by using the appropriate subcommand to toggle on and off.

Syntax and options

The topas options are:

Option Description

-d

Specifies the maximum number of disks shown. If this number exceeds the number of disks installed, the latter is used. If this argument is omitted, a default of 2 is used. If a value of zero is specified, no disk information is displayed.

-h Displays help information.

-i Sets the monitoring interval in seconds. The default is 2 seconds.

-n

Specifies the maximum number of network interfaces shown. If this number exceeds the number of network interfaces installed, the latter is used. If this argument is omitted a default of 2 is assumed. If a value of zero is specified, no network information will be displayed.

-p

Specifies the maximum number of processes shown. If this argument is omitted, a default of 20 is assumed. If a value of zero is specified, no process information will be displayed. Retrieval of process information constitutes the majority of the topas overhead.



Student Notebook

Subcommands

While topas is running, it accepts one-character subcommands. Each time the monitoring interval elapses, the program checks for one of the following subcommands and responds to the action requested.

The subcommands are:

-wSpecifies the number of monitored Workload Manager classes. If this argument is omitted a default of 2 is assumed.

-cSpecifies the number of monitored CPUs. If this argument is omitted a default of 2 is assumed.

-P Shows a full screen of processes.

-W Shows only WLM data on the screen.

Subcommand

Description

aShow all the variable subsections being monitored. Pressing the a key always returns topas to the main initial display.

cPressing the c key repeatedly toggles the CPU subsection between the cumulative report, off, and a list of busiest CPUs.

dPressing the d key repeatedly toggles the disk subsection between busiest disks list, off, and total disk activity for the system.

fMoving the cursor over a WLM class and pressing f shows the list of top processes in the class on the bottom of the screen (WLM Display Only).

h Toggles between help screen and main display.

nPressing the n key repeatedly toggles the network interfaces sub-section between busiest interfaces list, off, and total network activity.

p Pressing the p key toggles the hot processes subsection on and off.

P Toggle to the full screen process display.

q Quit the program.

r Refresh the screen.

wPressing the w key toggles the workload management (WLM) classes subsection on and off.

W Toggle to the full screen WLM class display.

Option Description




Uempty

Figure 2-13. The nmon and nmon_analyser tools AN512.0

Notes:

Introduction

Like topas, the nmon tool is helpful in presenting important performance tuning information on one screen and dynamically updating it.

Another tool, the nmon_analyser, takes files produced by nmon and turns them into spreadsheets containing high quality graphs ready to cut and paste into performance reports. The tool also produces analysis for ESS and EMC subsystems. It is available for both Lotus 1-2-3 and Microsoft Excel.

The nmon tool and the nmon_analyser tool are free, but are NOT SUPPORTED by IBM. No warranty is given or implied, and you cannot obtain help with it from IBM.


The nmon and nmon_analyser tools•nmon (Nigel’s Monitor)

– Similar in concept to topas– nmon not supported by IBM– nmon functionality integrated into topas (AIX6, AIX 5.3 TL9,

VIOS 1.2)– topas_nmon fully supported by AIX Support

•nmon can be run in the following modes:– Interactive– Data recording (good for trends and capacity planning)

•nmon_analyser– Graphing using Excel spreadsheets– Uses topasout or nmon output– Not supported by IBM (no warranty)– Obtained from www.ibm.com/developerworks/aix



Student Notebook

Obtaining the nmon tools

The nmon functionality which is incorporated into topas is not exactly the same as the nmon tool supported by NIgel Griffith. Below is information on obtaining Nigel’s nmon tool and also for obtaining the nmon_analyszer.

The nmon tool can be obtained from: http://www.ibm.com/developerworks/eserver/articles/nmon.html

The nmon_analyser tool can be obtained from: http://www.ibm.com/developerworks/eserver/articles/nmon_analyser/index.html

You can FTP the tool (nmonXX.tar.Z) from the agreement and download page.

Read the README.txt file for more information about which version of nmon to run on your particular operating system version. You also need to know if your AIX kernel is 32-bit or 64-bit. If you use the wrong one, nmon will simply tell you or fail to start (no risk).

The README.txt file also has information on how to run and use the nmon tool.

The nmon_analyser tool is designed to work with the latest version of nmon but is also tested with older versions for backwards compatibility. The tool is updated whenever nmon is updated and at irregular intervals for new functionality.




Uempty

Figure 2-14. The AIX nmon command AN512.0

Notes:

The graphic shows an example of what the nmon display can look like. This example shows four panels that were selected for display: CPU, memory, network, and disk. As with topas, the displayed values are updated dynamically on an interval.

nmon has wide variety of statistics panels which can be individually selected (or deselected) for display through the use of single key strokes.

Pressing the h key will provide a list of the nmon subcommands.

This nmon mode of the topas command can be accessed either by executing the nmon command, or by using the ~ (tilde) key to toggle between topas mode and nmon mode.


The AIX nmon commandtopas_nmon q=Quit Host=sys144_lpar4 Refresh=2 secs 21:36.00CPU-Utilization-Small-View EntitledCPU= 0.35 UsedCPU= 1.002Logical CPUs 0----------25-----------50----------75----------100 CPU User% Sys% Wait% Idle%| | | | | 0 0.0 0.0 0.0 100.0| > | 1 0.0 0.0 0.0 100.0| > | 2 15.0 85.0 0.0 0.0|UUUUUUUssssssssssssssssssssssssssssssssssssssssss> 3 0.0 0.0 0.0 100.0| > |

EntitleCapacity/VirtualCPU +-----------|------------|-----------|------------+ EC+ 22.6 77.2 0.0 0.2|UUUUUUUUUUUssssssssssssssssssssssssssssssssssssss| VP 11.3 38.7 0.0 0.1|UUUUUsssssssssssssssssss-------------------------| EC= 286.3% VP= 50.1% +--No Cap---|------------|-----------100% VP=2 CPU+ Memory

Physical PageSpace | pages/sec In Out | FileSystemCache % Used 96.7% 1.6% | to Paging Space 0.0 0.0 | (numperm) 30.5% % Free 3.3% 98.4% | to File System 0.0 0.0 | Process 29.4% MB Used 990.1MB 8.1MB | Page Scans 0.0 | System 36.8% MB Free 33.9MB 503.9MB | Page Cycles 0.0 | Free 3.3% Total(MB) 1024.0MB 512.0MB | Page Steals 0.0 | ------

| Page Faults 0.0 | Total 100.0% ------------------------------------------------------------ | numclient 30.5% Min/Maxperm 28MB( 3%) 836MB( 82%) <--% of RAM | maxclient 81.6% Min/Maxfree 960 1088 Total Virtual 1.5GB | User 55.7% Min/Maxpgahead 2 8 Accessed Virtual 0.6GB 42.3%| Pinned 34.9% Network I/F Name Recv=KB/s Trans=KB/s packin packout insize outsize Peak->Recv TransKB

en0 0.0 0.1 0.5 0.5 46.0 189.0 1.0 292.1 lo0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Total 0.0 0.0 in Mbytes/second Overflow=0 I/F Name MTU ierror oerror collision Mbits/s Description

en0 1500 0 0 0 2047 Standard Ethernet Network Interface lo0 16896 0 0 0 0 Loopback Network Interface

Disk-KBytes/second-(K=1024,M=1024*1024) Disk Busy Read Write 0----------25-----------50------------75--------100 Name KB/s KB/s | | | | | hdisk0 0% 0 0| | hdisk1 0% 0 0| | Totals 0 0+-----------|------------|-------------|----------+



Student Notebook

Figure 2-15. Checkpoint AN512.0

Notes:


Checkpoint1. What is the difference between a functional problem and a

performance problem? _____________________________________________________________________________________________________________________________

2. What is the name of the supported tool used to collect reports with a wide variety of performance data? ________________

3. True / False You can run individually the scripts that perfpmr.sh calls.

4. True / False You can dynamically change the topas andnmon displays.




Uempty

Figure 2-16. Exercise 2: Data collection AN512.0

Notes:


Exercise 2: Data collection

• Install PerfPMR


• Use topas and to nmon monitor the system



Student Notebook


Notes:


Unit summary

This unit covered:

• Defining a performance problem

• Installing PerfPMR

• Collecting performance data using PerfPMR

• The use of the following tools:– topas

– nmon




Uempty
Unit 3. Monitoring, analyzing, and tuning CPU usage

This unit identifies the tools to help determine CPU bottlenecks. It also demonstrates techniques to tune CPU-related issues on your system.



• Describe processes and threads

• Describe how process priorities affect CPU scheduling

• Manage process CPU utilization with either

- nice and renice commands

- WPAR resource controls

• Use the output of the following AIX tools to determine symptoms of a CPU bottleneck:

-vmstat, sar, ps, topas, tprof, nmon

• Correctly interpret CPU statistics in various environments including where:

- Simultaneous multi-threading (SMT) is enabled

- LPAR is using a shared processor pool


Accountability:


References




SG24-5977 AIX 5L Workload Manager (WLM) (Redbook)


© Copyright IBM Corp. 2010 Unit 3. Monitoring, analyzing, and tuning CPU usage 3-1

Student Notebook


SG24-5977 AIX 5L Workload Manager (WLM) (Redbook)

SG24-7940 Introduction to Advanced POWER Virtualization on IBM p5 Servers, Introduction and basic configuration (Redbook)

SG24-5768 IBM eServer p5 Virtualization Performance Considerations (Redbook)

CPU monitoring and tuning article:

http://www-128.ibm.com/developerworks/eserver/articles/aix5_cpu/




Uempty


Notes:

Introduction

The objectives in the visual above state what you should be able to do at the end of this unit.


Unit objectivesAfter completing this unit, you should be able to:• Describe processes and threads• Describe how process priorities affect CPU scheduling• Manage process CPU utilization with either

– nice and renice commands– WPAR resource controls

• Use the output of the following AIX tools to determine symptoms of a CPU bottleneck:– vmstat, sar, ps, topas, tprof

• Correctly interpret CPU statistics in various environments including where:– Simultaneous multi-threading (SMT) is enabled– LPAR is using a shared processor pool



Student Notebook

Figure 3-2. CPU monitoring strategy AN512.0

Notes:

Overview

This flowchart illustrates the CPU-specific monitoring and tuning strategy. If the system is not meeting the CPU performance goal, you need to find the root cause for why the CPU subsystem is constrained. It may be simply that the system needs more physical CPUs, but it could also be because of errant applications or processes gone awry. If the system is behaving normally but is still showing signs of a CPU bottleneck, tuning strategies may help to get the most out of the CPU resources.

Monitoring usage and compare with goal(s)

For any tuning strategy it is important to know the baseline performance statistics for a particular system and what the performance goals are. Then you can compare current statistics to see if they are abnormal or not meeting the goal(s). Be sure to take baseline measurements over time to spot any troubling trends.


CPU monitoring strategyMonitor CPU usage

and compare with goals

Determine cause of idle time by

tracing

Fix or tune the app or system

Tune applications / operating system

Locate dominant process(es)

Kill abnormal processes

CPU supposed to

be idle?

High CPUusage?

Isprocess behavior

normal?

Yes No

No Yes

Yes

No




Uempty
High CPU usage
If you spot unusually high CPU usage when monitoring, the next question to ask is, “What processes are accumulating CPU time?” “Are they supposed to be accumulating so much CPU time?” If they are, then perhaps there are some tuning strategies you can use to tune the application or the operating system to make sure that important processes get the CPU they need to meet the performance goal.

Idle CPU

Another scenario is that you are not meeting performance goals and the CPUs are fairly idle or not working as much as they should. This points to a bottleneck in another area of the computer system.



Student Notebook

Figure 3-3. Processes and threads AN512.0

Notes:

Process

A process is the entity that the operating system uses to control the use of system resources. A process is started by a command, shell program or another process.

Process properties include the process ID, process group ID, user ID, group ID, environment, current working directory, file descriptors, signal actions, and statistics such as resource usage. These properties are defined in /usr/include/sys/proc.h.

Thread

Each process is made up of one or more kernel threads. A thread is a single sequential flow of control. A single-threaded process can only handle one operation at a time, sequentially. Multiple threads of control allow an application to overlap operations, such as reading from a terminal and writing to a file. AIX schedules and dispatches CPU


Processes and threadsCPUMemoryDisk

CPU 0

CPU 0

CPU 1

CPU 2

Thread 1Program

Program

Thread 1

Thread 2

Thread 3

Multi-threadedprocess

Single-threadedprocess




Uempty
resources at the thread level. In general, when we refer to threads in this course, we will be referring to the kernel threads within a process.
An application could also be designed to have user-level threads (also known as pthreads) which are scheduled for work by the application itself or by the pthreads scheduler in the pthreads shared library (libpthreads.a). These user threads may be mapped to one or more kernel threads depending on the libpthreads thread policy used.

Multiple threads of control also allow an application to service requests from multiple users at the same time. Threads provide these capabilities without the added overhead of multiple processes such as those created through fork and exec system calls.

Rather than duplicating the environment of a parent process, as is done via fork and exec, all threads within a process use the same address space and can communicate with each other through variables. Threads synchronize their operations via mutex (mutual exclusion) variables.

Kernel thread properties are: stack, scheduling policy, scheduling priority, pending signals, blocked signals, and some thread-specific data. These thread properties are defined in /usr/include/sys/thread.h.

AIX Version 4 introduced the use of threads to control processor time consumption, but most of the system management tools still refer to the process in which a thread is running, rather than the thread itself.



Student Notebook

Figure 3-4. The life of a process AN512.0

Notes:

Introduction

A process can exist in a number of states during its lifetime.

I (Idle) state

Before a process is created, it needs a slot in the process and thread tables; at this stage it is in the SNONE state.

While a process is undergoing creation, waiting for resources (memory) to be allocated, it is in the SIDL state.


The life of a process

"R"

"S"

"T"

RUNNING

SNONE

Zombie

SIDL

"I"

"A"

"Z"

Process/Thread States

Time




Uempty
A (active) state
When a process is in an A state, one or more of its threads are in the R (ready-to-run) state. Threads of a process in this state have to contend for the CPU with all other ready-to-run threads.

Only one thread can have the use of the CPU at a time; this is the running thread for that processor. With SMP models, there are several processors, each of which would be running a different thread, as part of the same process, or as independent threads of different processes.

A thread will be in an S state if a thread is waiting on an event or I/O. Instead of wasting CPU time, it sleeps and relinquishes control of the CPU. When the I/O is completed, the thread is awakened and placed in the ready-to-run state, where it must again compete with other ready-to-run threads for the CPU.

A thread may be stopped via the SIGSTOP signal, and started again via the SIGCONT signal; while suspended it is in the T state. This has nothing to do with performance management.

Z (zombie) state

The Z state: When a process dies (exits) it becomes a zombie. A zombie occupies a slot in the process table, and thread table, but no other resources. As such, zombies are seldom a performance issue; they exist for a very short time until their parent process receives a signal to say they have terminated. Parent processes which are programmed in such a way that they ignore this signal, or even die before the child processes they have created do, can leave zombies on the system. Such an application, if long running, can with time fill up the process table to a unacceptable level. One way to remove zombies is to reboot the system, but this is not always a solution. You should investigate why the parent process is not cleaning up its zombies. The application developer may need to modify program code to be sure to have a SIGCHLD handler to read the exit status of their child processes.



Student Notebook

Figure 3-5. Run queues AN512.0

Notes:

Run queues

When there are multiple threads ready to run but not enough CPUs to go around, the threads are queued up in a run queues. The run queue is divided further into queues that are priority ordered (one queue per priority number). However, when we discuss run queues, we shall refer to a run queue as the queue that contains all of the priority-ordered queues.

Each CPU has its own run queue. Additionally, there is another run queue called the global run queue.

There are 256 priority levels (for each run queue). Prior to AIX 5L V5.1, AIX had 128 queues.


Run queues

Global Run Queue.

.

.

..

.

.

.

01

255.

.

.

01

255

01

255

CPU 1 Run Queue

CPU 0 Run Queue

Prioritizedthreads

254

• initial dispatch

• schedo –o fixed_pri_global

• export RT_GRQ=ON




Uempty
Global run queue
The global run queue is searched before a local run queue to see which thread has the best priority. When a thread is created (assuming it is not bound to a CPU), it is placed on the global run queue.

A thread can be forced to stay on a global run queue if the environment variable RT_GRQ is set to ON. This can improve performance for threads that are running SCHED_OTHER (the default scheduling policy) and are interrupt driven. However, this could also be detrimental because of the cache misses, so use this feature with caution.

Threads that are running fixed priority will be placed on the global run queue if schedo -o fixed_pri_global=1 is run.

CPU run queues

There is a run queue structure for each CPU as well as a global run queue. The per-CPU run queues are called local run queues. When a thread has been running on a CPU, it will tend to stay on that CPU’s run queue. If that CPU is busy, then the thread can be dispatched to another idle CPU and will be assigned to that CPU’s run queue. This is because idle CPUs look for more work to do and will check the global run queue and then the other local run queues for a thread with a favored priority.

The dispatcher picks the best priority thread in the run queue when a CPU is available. When a thread is first created, it is assigned to the global run queue. It stays on that queue until assigned to a local run queue (when it’s dispatched to a CPU). If all CPUs are busy, the thread stays on the local run queue even if there are worse priority threads on other CPUs.

Run queue statistics

The average number of threads in the run queue can be seen in the first column of vmstat output. If you divide this number by the number of CPUs, you will get the average number of threads runnable on each CPU. If this value is greater than one, then these threads will have to wait their turn for the CPU. Having runnable threads in the queue does not necessarily mean that performance delays will be noticed because timeslicing between threads in the queue is normal. It may be perfectly normal on your system to see runnable threads per CPU. The number of runnable threads should be only one factor in your analysis. If performance goals are being met, simply having many runnable threads in the queue may simply mean your system has a lot of threads but is working through them efficiently.

CPU scheduling policies

The default CPU scheduling policy is SCHED_OTHER. There are other CPU scheduling policies that can be set on a per-thread basis. SCHED_OTHER penalizes high-CPU usage processes and is a non-fixed priority policy. This means that the



Student Notebook

priority value changes over time (and quickly) based on the CPU usage penalty and the nice value.

Ratio of runnable threads to logical CPUs

It is important to note that priority values only effect which thread is dispatched off of a given dispatching queue. If there are two threads, each running on a different CPU and you want one of them to obtain a higher percentage of cycles on the system, then priority values will have no effect. The single threaded application can not use more than one logical CPU. It is only when there are more runnable threads competing for cycles than you have logical CPUs to serve them, that the priority values have a significant effect.




Uempty

Figure 3-6. Process and thread priorities (1 of 2) AN512.0

Notes:

What is a priority?

A priority is a number assigned to a thread used to determine the order of scheduling when multiple threads are runnable. A process priority is the most favored priority of any one of its threads. The initial process/thread priority is inherited from the parent process.

The kernel maintains a priority value (sometimes termed the scheduling priority) for each thread. The priority value is a positive integer and varies inversely with the importance of the associated thread. That is, a smaller priority value indicates a more important thread. When the scheduler is looking for a thread to dispatch, it chooses the dispatchable thread with the smallest priority value.

A thread can be fixed-priority or nonfixed-priority. The priority value of a fixed-priority thread is constant, while the priority value of a nonfixed-priority thread can change depending on its CPU usage.


Process and thread priorities (1 of 2)

.

.

.

0

1

2

3

255 wait

40

.

.

.

Real-timePriorities

UserPriorities

High

Low

The priority value of aSCHED_OTHER thread is:(Best/

Most favored)

(Worst/Least favored)

Initial priority+ CPU usage penalty

Effective priority value

Initial priority:• Has a base of 40• Amount over 40 depends upon nice #

CPU usage penalty:• Increases with CPU usage• Some CPU usage forgiven each second (default is by half)



Student Notebook

Priority values

Priority numbers range from 0-255 in AIX. Priority 255 is reserved for the wait/idle kernel thread.

Real-time thread priorities are lower than 40. Real-time applications should run with a fixed priority and a numerical value less than 40 so that they are more favored than other applications.




Uempty

Figure 3-7. Process and thread priorities (2 of 2) AN512.0

Notes:

Thread changing its priority

There are two system calls that allow users to make individual processes or threads to be scheduled with fixed priority. The setpri() system call is process-oriented and thread_setsched() is thread-oriented. Only a root-owned thread can change the priority to fixed priority (or to a more favored priority).

Priority changed by a user

A user can use the nice and renice commands to change the priority of a process and its associated threads. A user can also use a program that calls thread_setched() or setpri() system calls to change the priority. Only the root user can change the priority to a more favored priority.


Process and thread priorities (2 of 2)• Priorities control which threads get cycles:

– If more runnable threads than CPUs– A process or thread can have a fixed or variable priority.– Fixed priorities can only be set by a root process.

• Variable priorities is the default scheduling policy – Called SCHED_OTHER– Penalizes compute-intensive threads to prevent the

starvation of other threads– New threads and woken daemons have a brief advantage

over running processes. – Initial priority can be changed by a user:

•nice command can be used when a process is started:•renice command can be used for a running process:• Default nice value is 20 (foreground), 24 (background)



Student Notebook

Other methods to change priority

The kernel scheduler can change the priorities over time through its scheduling algorithms. The Workload Manager can also change the priorities of processes and threads in order to fit the requirements of the workload classes.

Nice value

The nice value is a priority adjustment factor used by the system to calculate the current priority of a running process. The nice value is added to the base user priority of 40 for non-fixed priority threads and is irrelevant for fixed priority threads. The nice value of a thread is set when the thread is created and is constant over the life of the thread unless changed with a system call or the renice command.

You can use the ps command with the -l flag to view a command's nice value. The nice value appears under the NI heading in the ps command output. If the nice value in ps is --, the process is running at a fixed priority.

The default nice value is 20 and therefore an effective priority of 60. This is because the nice value is added to the user base priority value of 40.

Some shells (such as ksh) will automatically add a nice value of 4 to the default nice value if a process is started in the background (using &). For example, if you executed program & from a ksh, this program will automatically be started with a nice value of 24. For example, if a program was preceded by a nice command such as the following, it will be started with a nice value of 34:

nice -n 10 program &

The at command automatically adds a nice value of 2 to the programs it executes.

With the use of multiple processor run queues and their load balancing mechanism, nice or renice values might not have the expected effect on thread priorities because less favored priorities might have equal or greater run time than favored priorities. Threads requiring the expected effects of nice or renice should be placed on the global run queue.




Uempty

Figure 3-8. nice/renice examples AN512.0

Notes:

nice command

The nice command lets you run a command at a priority lower (or higher) than the command's normal priority.

The syntax of the nice command is:

nice [ - Increment| -n Increment ] Command [ Argument ... ]

The Command parameter is the name of any executable file on the system. For the Increment, you can specify a positive or negative number. Positive increment values reduce priority. Negative increment values increase priority. Only users with root authority can specify a negative increment. If you do not specify an Increment value, the nice command defaults to an increment of 10.

The nice value can range from 0 to 39, with 39 being the lowest priority. For example, if a command normally runs at a priority of 20, specifying an increment of 10 runs the command at a lower priority, 30, and the command will probably run slower. The nice


nice/renice examples

Higher priority (favored)Subtract 10 from current nice valuenice -n -10 foo

Higher priority (favored)Subtract 10 from current nice valuenice --10 foo

Lower priority (disfavored)Add 10 to current nice valuenice -n 10 foo

Lower priority (disfavored)Add 10 to current nice valuenice -10 foo

Relative PriorityActionCommand

Higher priority (favored)Subtract 10 from current nice valuerenice -n -10 -p 563

Higher priority (favored)Subtract 10 from default nice valuerenice -10 -p 563

Lower priority (disfavored)Add 10 to current nice valuerenice -n 10 -p 563

Lower priority (disfavored)Add 10 to default nice valuerenice 10 -p 563

Relative PriorityActionCommand

• nice examples:

• renice examples:



Student Notebook

command does not return an error message if you attempt to increase a command's priority without the appropriate authority. Instead, the command's priority is not changed, and the system starts the command as it normally would. Specifying a nice value larger than the maximum allowed by nice causes the effective nice value to be the maximum value allowed by nice.

Examples:

renice command

The renice command alters the nice value of a specific process, all processes with a specific user ID, or all processes with a specific group ID.

The syntax of the renice command is:

renice [[-n Increment] | Increment]] [-g|-p|-u] ID...

If you do not have root user authority, you can only reset the priority of processes you own and can only increase their priority within the range of 0 to 20, with 20 being the lowest priority. If you have root user authority, you can alter the priority of any process and set the increment to any value in the range -20 to 20. The specified Increment changes the priority of a process in the following ways:

1 to 20 Runs the specified process with worse priority than the base priority

0 Sets priority of the specified processes to the base scheduling priority

-20 to -1 Runs the specified processes with better priority than the base priority

The way the increment value is used depends on whether the -n flag is specified. If -n is specified, then the increment value is added to the current nice value. If the -n flag is not specified, then the increment value is added to the default value of 20 to get the effective nice value.

Nice values are reduced by using negative increment values and increased by using positive increment values.

Examples:

Command Action Relative Prioritynice -10 foo Add 10 to current nice value Lower priority (disfavored)nice -n 10 foo Add 10 to current nice value Lower priority (disfavored)nice --10 foo Subtract 10 from current nice value Higher priority (favored)nice -n -10 foo Subtract 10 from current nice value Higher priority (favored)

Command Action Relative Priority

renice 10 -p 5632Add 10 to the default nice value (20)

Lower priority (disfavored)

renice -n 10 -p 5632Add 10 to current nice value

Lower priority (disfavored)




Uempty

renice -10 -p 5632Subtract 10 from the default nice value (20)

Higher priority (favored)

renice -n -10 -p 5632Subtract 10 from current nice value

Higher priority (favored)

Command Action Relative Priority



Student Notebook

Figure 3-9. Viewing process and threat priorities AN512.0

Notes:

Viewing process priorities

To view the process priorities of all process, simply run the command: ps -el.

To view the process priorities of all processes including kernel processes: ps -elk.

The -L <PIDlist> option generates a list of descendants of each and every PID that has been passed to it in the Pidlist variable. The list of descendants from all of the given PID is printed in the order in which they appear in the process table.

The priority is listed under the PRI column. If the value under NI is --, this indicates that it is a fixed priority process.

The processes in the visual above with the PRI of 16 are the most important: swapper, lrud, and wlmsched. Notice the processes with the least important priorities: wait.


Viewing process and thread priorities$ ps -ekl

F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD303 A 0 0 0 120 16 -- 19004190 384 - 1:08 swapper

200003 A 0 1 0 0 60 20 21298480 720 - 0:04 init303 A 0 8196 0 0 255 -- 1d006190 384 - 10:08 wait303 A 0 12294 0 2 17 -- 1008190 448 - 0:00 sched303 A 0 16392 0 0 16 -- 500a190 704 f100080009791c70 - 0:17 lrud

$ ps -L 483478 -lF S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD

200001 A 0 295148 483478 0 68 24 177a5480 176 f100060006358fb0 pts/1 0:00 sleep200001 A 0 438352 483478 0 68 24 13847480 176 f1000600063589b0 pts/1 0:00 sleep200001 A 0 442538 483478 0 60 20 2b7db480 740 pts/1 0:00 ps240005 A 0 483478 356546 0 60 20 1d840480 836 pts/1 0:00 ksh

$ ps -kmo THREAD –p 16392USER PID PPID TID ST CP PRI SC WCHAN F TT BND COMMANDroot 16392 0 - A 0 16 4 f100080009791c70 303 - - lrud

- - - 16393 S 0 16 1 f100080009791c70 1004 - - -- - - 45079 S 0 16 1 - 1004 - - -- - - 49177 S 0 16 1 - 1004 - - -- - - 53275 S 0 16 1 - 1004 - - -




Uempty
CPU usage column
Another column in the ps output is important, the C or CPU usage column. This represents the CPU utilization of process or thread, incremented each time the system clock ticks and the process or thread is found to be running. How this value is used to calculate a processes ongoing priority is covered in a few pages.

Viewing thread priorities

To view the thread priorities of all threads, simply run the command: ps -emo THREAD

To view the thread priorities of all threads including kernel threads: ps -ekmo THREAD

To view the thread priorities of all threads within a specific process: ps -mo THREAD -p <PID>

The priority is listed under the PRI column.

Process ID’s are even numbers and thread ID’s are odd numbers.

CPU usage column

The CP or CPU usage column in the visual above is the same as the C column on the last visual. This represents the CPU utilization of the thread, incremented each time the system clock ticks and the thread is found to be running. How this value is used to calculate a processes ongoing priority is covered in a few pages.



Student Notebook

Figure 3-10. Boosting an important process with nice AN512.0

Notes:

The visual shows a graph generated by applying the AIX scheduling algorithm to two threads with different nice numbers; one had a default foreground nice of 20 while the other had a preferred nice of zero.

The preferred thread runs without interference from the first thread, until the CPU usage penalties raises the preferred thread’s PRI value to equal the other thread. At that point they start taking turns executing until the one-second timer expires. Once a second the CPU usage statistics are adjusted by the sched_D factor. By default, this reduces the CPU usage by a half. This in turn, reduces the penalty. At the beginning of the next one second interval, the preferred thread, once again, has an advantage and runs without interference from the other thread. But this time it takes fewer ticks to have the CPU usage penalty increase the running threads PRI value to match the other thread. Once again they take turns until the one second period ends.

How quickly the penalty accumulates is affected by the sched_R tunable.

Tuning the sched_R and sched_D tunables will effect how long a running thread maintains its initial priority advantage and how much the CPU penalty is forgiven each second.


Boosting an important process with niceReducing nice to zero

35

40

45

50

55

60

65

70

75

80

85

90

0 100 200 300 400 500 600 700

clock ticks

prio

rity

valu

es

nice=0nice=20

Upward sloping indicates increasing CPU usage penaltyand thus thread getting cycles

Amount penalty dropseach second affected by sched_D

Angle of slopeaffected bysched_R




Uempty

Figure 3-11. Usage penalty and decay rates AN512.0

Notes:

Overview

As non-fixed priority threads accumulate CPU ticks, their priorities will worsen so that a system of fairness is enforced. Threads that are new or have not run recently can obtain the CPU before threads that are dominating the CPU. This system of fairness is implemented by using the priority, the nice value, the CPU usage of the thread, and some tunable kernel parameters (sched_R and sched_D). These parameters are represented as R and D in the visual and in the following text. The CPU usage value can be seen in the output of ps -ekmo THREAD as seen on the last visual.

As the units of CPU time increase, the priority decreases (the PRI value increases). You can give additional control over the priority calculation by setting new values for R and D.


Usage penalty and decay rates• Rate at which a thread is penalized is proportional to:

CPU usage (incremented when thread is running at a clock tic)CPU-penalty-to-recent-CPU-usage ratio (R/32, Default R value is 16)

• Rate at which CPU usage is decayed (once per second):CPU usage * D/32 (Default D value is 16)

• Tuning penalty rate:schedo -o sched_R=valueIncreasing will magnify penalty for dominant threadsDecreasing allows dominant thread to run longer (R=0 : no penalty)

• Tuning decay rate:schedo -o sched_D=valueIncreasing will decay less (D=1: no decay at all)Decreasing will decay more (D=0: zeros out the usage each second)

• Remember: This affects all threads (global in impact)



Student Notebook

Priority calculation process

The details of the formula are less important than understanding that there is a penalty for CPU usage, and that penalty has more impact if the nice value is greater than 20. In fact, the impact of the penalty is proportional to how far the nice value deviates from the default value of 20.

Here is the actual priority value formula:

Priority = x_nice + (Current CPU ticks * R / 32 * (x_nice + 4 / 64))

Where:

p_nice = nice value + base priority

If p_nice > 60

then x_nice = (p_nice * 2) - 60

else x_nice = p_nice

CPU penalty

The CPU penalty is calculated by multiplying the CPU usage by the CPU-penalty-to-recent-CPU-usage ratio. This is represented by R/32. The default value of R is 16, so by default the CPU penalty will be the CPU usage times a ratio of 1/2.

The first part of the Priority value formula (Current CPU ticks * R / 32) represents the penalty part of the calculation

The CPU usage value of a given thread is incremented by 1 each time that thread is in control of the CPU when the timer interrupt occurs (every 10 milliseconds). Its initial value is 0. Priority is calculated on a per thread basis. To see a thread’s CPU usage penalty, use the ps -emo THREAD command and look at the CP column. The priority is shown in the PRI column.

The CPU usage value for a process is displayed as the C column in the ps command output. The maximum value of CPU usage is 120. Note that a process’s CPU usage can exceed 120 since it is the sum of the CPU usage of its threads.

Tuning the CPU-penalty-to-recent-CPU-usage factor

The CPU penalty ratio is expressed as R/32 where R is 16 by default and the values for R can range from 0 to 32.

This factor can be changed dynamically by a root user through the command schedo -o sched_R=<value> . Smaller values of R will make the nice value a bigger factor in the equation (that is, the nice value has a bigger impact on the priority of the thread). This will make it easier for foreground processes to compete. Larger values of R will make the nice value have less of an impact on the priority of the thread.




Uempty
Aging or decaying the CPU usage
As the CPU usage increases for a process or thread, the numerical priority also increases thus making its scheduling priority worse. Over time, a thread’s priority can get so bad that on a system with a lot of runnable threads, it may never get to run unless the thread’s priority is increased. The mechanism which allows the thread to eventually become more favored again is known as CPU aging (or the usage decay factor).

Once per second, a kernel process called swapper wakes up and ages the CPU usage for all threads in the system. It then recalculates priorities according to the algorithm described in the visual above.

The recent-CPU-usage-decay factor is expressed as D/32 where D is 16 by default. The values for D can range from 0 to 32. The formula for recalculation is as follows:

CPU usage = old_CPU_usage * D/32

Tuning the recent-CPU-usage-decay factor

You can have additional control over the priority calculation by setting new values for both R and D. Decreasing the D value enables foreground processes to avoid competition with background processes for a longer time. Higher values of D penalize CPU intensive threads more and can be useful in an environment which has a mix of interactive user threads and CPU-intensive batch job threads.

The default for D is 16 which decays short-term CPU usage by 1/2 (16/32) every second.

This factor can by changed dynamically by a root user through the command schedo -o sched_D=<value> .



Student Notebook

Figure 3-12. Priorities: What to do? AN512.0

Notes:

Overview

The visual gives some guidelines when tuning the priorities of processes.

In addition to the suggestions above, tuning the workload could be as easy as using the at, cron, or batch commands to schedule less important jobs for off-shift hours.

Note: It is possible to use the renice command to make threads so unfavored that they will never be able to run.


Priorities: What to do?• If CPU resources are already constrained, the setting of

priorities can help allocate more CPU resources to the more important processes:– Decrease priority value on the most important processes– Increase priority value for the least important processes

• If most important processes use a lot of CPU time, could change CPU usage priority decay rate– Configure the CPU aging/decay (sched_D) and the CPU

usage penalty (sched_R) options with schedo

• Consider using WLM or WPARs to managed CPU resources




Uempty

Figure 3-13. AIX workload partitions (WPAR): Review AN512.0

Notes:

Introduction

Workload Partition (WPAR) is a software-base virtualization capability of AIX 6 that provides a new capability to reduce the number of AIX operating system images that need to be maintained when consolidating multiple workloads on a single server. WPARs provide a way for clients to run multiple applications inside the same instance of an AIX operating system while providing security and administrative isolation between applications. WPARs complement logical partitions and can be used in conjunction with logical partitions if desired. WPAR can improve administrative efficiency by reducing the number of AIX operating system instances that must be maintained and can increase the overall utilization of systems by consolidating multiple workloads on a single system and is designed to improve cost of ownership.

WPARs allow users to create multiple software-based partitions on top of a single AIX instance. This approach enables high levels of flexibility and capacity utilization for applications executing heterogeneous workloads, and simplifies patching and other operating system maintenance tasks.


WPARs reduce administration– By reducing the number of AIX images to maintain

Each WPAR is isolated– Appears as a separate instance

of AIX – Regulated share of

system resources– May have unique network

and file systems– Separate administrative

and security domain

WPARs can be relocated– Load balancing– Server maintenance

WorkloadPartition

ApplicationServer

WorkloadPartition

WebServer

WorkloadPartitionBilling

AIX 6 instance

WorkloadPartition

TestWorkloadPartition

BI

AIX workload partitions (WPAR): Review



Student Notebook

WPARs provide unique partitioning values

• Smaller number of OS images to maintain

• Performance efficient partitioning through sharing of application text and kernel data and text

• Fine-grain partition resource controls

• Simple, lightweight, centralized partition administration

WPARs enable multiple instances of the same application to be deployed across partitions.

• Many WPARs running DB2, WebSphere, or Apache in the same AIX image.

• Different capability from other partitioning technologies.

• Greatly increases the ability to consolidate workloads because often the same application is used to provide different business services.

• Enables the consolidation of separate discrete workloads that require separate instances of databases or applications into a single system or LPAR.

• Reduces costs through optimized placement of work loads between systems to yield the best performance and resource utilization.

WPAR technology enables the consolidation of diverse workloads on a single server increasing server utilization rates

• Hundreds of WPARs can be created, far exceeding the capability of other partitioning technologies.

• WPARs support fast provisioning and fast resource adjustments in response to both normal or unexpected demands. WPARs can be created and resource controls modified in seconds.

• WPAR resource controls enable the over-provisioning of resources. If a WPAR is below allocated levels, the unused allocation is automatically available to other WPARs.

• WPARs support the live migration of a partition in response to normal/unexpected demands.

• All of the above capabilities enable more consolidation on a single server or LPAR.

WPARs enable development, test, and production cycles of one workload to be placed on a single system.

• Different levels of applications (production1, production2,test1, test2) may be deployed in separate WPARs.

• Quick and easy roll out and roll back to production environments.

• Reduced costs through the sharing of hardware resources.

Reduced costs through the sharing of software resources such as the operating system, databases, and tools.




Uempty
A WPAR supports the control and the management of its resources, CPU, memory, and processes. That means that you can assign specific fractions of CPU and memory to each WPAR and this is done by WLM running on the partition.
Most resource controls are similar to those supported by the Workload Manager. You can specify shares_CPU which is the number of processor shares available for a workload partition, or you can specify minimum and maximum percentages. The same it true for memory utilization. There are also WPAR limits for run-away situations (for example: total processes).

When you create a WPAR, a WLM class is created (having the same name as the WPAR). All processes running in the partition inherit this classification. You can see the statistics and classes using the wlmstat command which has been enhanced to display WPAR statistics. wlmstat -@ 2 --shows the WPAR classes. Also, you cannot use WLM inside the WPAR to manage its resources.



Student Notebook

Figure 3-14. System WPAR and application WPAR AN512.0

Notes:

System workload partition

System workload partitions are autonomous virtual system environments with their own private root file systems, users and groups, login, network space, and administrative domain.

A system WPAR represents a partition within the operating system isolating runtime resources such as memory, CPU, user information, or file system to specific application processes. Each system WPAR has its own unique set of users, groups and network addresses. The systems administrator accesses the WPAR via the administrator console or via regular network tools such as telnet or ssh. Inter-process communication for a process in a WPAR is restricted to those processes in the same WPAR.

System workload partitions provide a complete virtualized OS environment, where multiple services and applications run. It takes longer to create a system WPAR compared to an application WPAR as it builds its file systems. The system WPAR is removed only when requested. It has its own root user, users, and groups, and own system services like inetd, cron, syslog, and so forth.


System WPAR and application WPAR • System WPAR

– Autonomous virtual system environment- Shared file systems (with the global environment) : /usr and /opt- Private file systems for the WPAR’s own use: /, /var and /tmp- Unique set of users, groups, and network addresses

– Can be accessed via: - Network protocols (for example: telnet or ssh)- Log in from the global environment using the clogin command

– Can be stopped and restarted

• Application WPAR– Isolate an individual application– Light weight; quick to create and remove

- Created with wparexec command- Removed when stopped- Stopped when the application finished

– Shares file systems and devices with the global environment– No user login capabilities

Create and run

Stop and remove




Uempty
A system WPAR does not share writable file systems with other workload partitions or the global environment. It is integrated with the role-based access control (RBAC).
Application workload partition

• Normal WPAR except that there is no file system isolation

• Login not supported

• Internal mounts not supported

• Target: Lightweight process group for mobility

Application workload partitions do not provide the highly virtualized system environment offered by system workload partitions, rather they provide an environment for segregation of applications and their resources to enable checkpoint, restart and relocation at the application level.

The application WPAR represents a shell or an envelope around a specific application process or processes which leverage shared system resources. It is light weight (that is, quick to create and remove and does not take lots of resources) since it uses the global environment system file system and device resources. Once the application process or processes are finished the WPAR is stopped. The user cannot log in inside the application WPAR using telnet or ssh from the global environment. If you need to access the application in some way this must be achieved by some application provided mechanism. All file systems are shared with the global environment. If an application is using devices it uses global environment devices.

The wparexec command builds and starts an application workload partition, or creates a specification file to simplify the creation of future application workload partitions.

An application workload partition is an isolated execution environment that might have its own network configuration and resource control profile. Although the partition shares the global environment file system space, the processes running therein are only visible to other processes in the same partition. This isolated environment allows process monitoring, gathering of resource, accounting, and auditing data for a predetermined cluster of applications.

The wparexec command invokes and monitors a single application within this isolated environment. The wparexec command returns synchronously with the return code of this tracked process only when all of the processes in the workload partition terminate. For example, if the tracked process creates a daemon and exits with the 0 return code, the wparexec command blocks until the daemon and all of its children terminate, and then exit with the 0 return code, regardless of the return code of the daemon or its children.



Student Notebook

Figure 3-15. Target shares AN512.0

Notes:

Shares determine the target (or desired) amount of resource allocation that the WPARs are entitled to (calculated as a percentage of total system resource). The shares represent how much of a particular resource a WPAR should get, relative to the other active WPARs. Shares are not coded as the absolute percentages of the total system resources, but each of the share values indicates the relative proportion of the resource usage.

For example, in the upper graphic of this visual (all WPARs active), the total of entitled shares are 60. Then the intended target for the W1 WPAR is 1/6 (or 10/60), and for the W2 WPAR it is 1/3 (or 20/60), and so forth.

If a WPAR is the only active WPAR, its target is 100% of the amount of resource available to the LPAR.

A WPAR’s target percentage for a particular resource is simply its number of shares divided by the total number of active shares.

If limits are also being used, the target is limited to the configured range [minimum, soft maximum]. If the calculated target is outside this range, it is set to the appropriate upper or


Target shares• Shares are a relative amount of resource entitlement• Target percentage is calculated based on active shares only

W1 W2 W3

W1 W2

W1 W2 W310 shares 20 shares 30 shares

16.6% 33.3% 50%

33.3% 66.6%

All WPARs active

WPAR W3 Inactive

WPAR share assignments




Uempty
lower bound (see Resource Limits). The number of active shares is the total number of shares of all WPARs that have at least one active process in them. Since the number of active shares is dynamic, so is the target.
Each share value can be between 1 and 65535.

By default, a WPAR’s resource shares are not defined, which effectively means that the WPAR does not have any WLM based target percentage and (unless limits are defined) is effectively un-regulated by WLM.

Shares are automatically self-adjusted percentages. For example, in the lower graphic of this foil, although the total of entitled shares are 60, the actual sum of the active WPARs shares is 30, since the W3 WPAR with a share value of 30 is inactive. Then, the adjusted proportions for the classes are, 1/3 (or 10/30) for the W1 and 2/3 (or 20/30) for the W2.

Within a WPAR individual processes compete for resources of the WPAR using traditional AIX mechanisms. For example, with CPU resource, the process priority value determines which process thread is dispatched to a CPU and the process nice number effects the initial priority of the process and its threads. These same mechanisms are used for resource contention between WPARs if no WLM based resource usage control has been implemented in the WPAR definitions.



Student Notebook

Figure 3-16. Limits AN512.0

Notes:

The resource allocation can also be controlled by limits. The WPAR resource limits define the maximum and the minimum amount of resource that can be allocated to a WPAR as a percentage of the total system resources.

Resource limits allow the administrator to have more control over resource allocation.

These limits are specified as percentages and are relative to the amount of resource available to LPAR.

There are three type of limits for percentage-based regulation:

Minimum

This is the minimum amount of a resource that should be made available to the WPAR. If the actual WPAR consumption is below this value, the WPAR is given highest priority access to the resource. The possible values are 0 to 100, with 0 being the default (if unspecified).


Limits• Maximum limits restrict resources

– Soft maximum– Hard maximum

• Minimum limits guarantee resources

• Limits take precedence over shares

CPU

MEM

minlimit

normalrange

softmaxlimit

Hardmax limit




Uempty
Soft maximum
This is the maximum amount of a resource that a WPAR can consume when there is contention for that resource. If the WPAR consumption exceeds this value, the WPAR is given the lowest priority. If there is no contention for the resource (from other WPARs), the WPAR is allowed to consume as much as it wants. The possible values are 1 to 100, with 100 being the default (if unspecified).

Hard maximum

This is the maximum amount of a resource that a WPAR can consume, even when there is no contention. If the WPAR reaches this limit, it is not allowed to consume any more of the resource until its consumption percentage falls below the limit. The possible values are 1 to 100, with 100 being the default (if unspecified).

Class resource limits follow some basic rules.

• Resource limits take precedence over WPAR share values.

• The minimum limit must be less than or equal to the soft maximum limit.

• The soft maximum limit must be less than or equal to the hard maximum limit.

• The sum of the minimum limits of all WPARs cannot exceed 100.

The following are the only constraints that WLM places on resource limit values:

When a WPAR with a hard memory limit has reached this limit and requests more pages, the VMM page replacement algorithm (LRU) is initiated and steals pages from the limited WPAR, thereby lowering its number of pages below the hard maximum, before handing out new pages. This behavior is correct, but extra paging activity, which can take place even where there are plenty of free pages available, impacts the general performance of the system. Minimum memory limits for other WPARs are recommended before imposing a hard memory maximum for any WPAR.

Since WPARs under their minimum have the highest priority, the sum of the minimums should be kept to a reasonable level, based on the resource requirements of the other WPARs.

For physical memory, setting a minimum memory limit provides some protection for the memory pages of the WPAR's processes. A WPAR should not have pages stolen when it is below its minimum limit unless all the active WPARs are below their minimum limit and one of them requests more pages. Setting a memory minimum limit for a WPAR with primarily interactive jobs helps make sure that their pages will not all have been stolen between consecutive interactions (even when memory is tight) and improves response time.

Attention: Using hard maximum limits can have a significant impact on system or application performance if not used appropriately. Since imposing hard limits can result in unused system resources, in most cases, soft maximum limits are more appropriate.



Student Notebook

Figure 3-17. WPAR resource management AN512.0

Notes:

Resource controls can be established at WPAR creation or they can be modified later.

These attributes can be set either though the command line (as shown) or through SMIT or using the WPAR Manger GUI. Here are some common command line attribute keywords:

- Active: Even if you have set resource controls, you can enable or disable their enforcement using this attribute

- shares_CPU: This is the number of shares for calculating the target percentage for CPU usage.

- shares_memory: This is the number of shares used to calculate the target percentage for memory usage.

- CPU: This value has three fields (note use of semicolon to delimit last field):

• The first is the minimum percentage (default is 0%)

• The second is the soft maximum (default is 100%)

• The third is the hard maximum (default is 100%)


WPAR resource management• Define at WPAR creation or later change WPAR attributes

wparexec –R attribute=valuemkwpar –R attribute=valuechwpar –R attribute=value

• Common attribute keywords:– active={ yes | no }– shares_CPU=<number of shares>– shares_memory=<number of shares>– CPU=m%-SM%;HM% (default: 0%-100%;100%)– memory=m%-SM%;HM% (default: 0%-100%;100%)




Uempty
- Memory: This value has three fields with the same format and defaults as CPU limits.
WLM overview

WPARs use the AIX Workload Manager (WLM) mechanisms to control resource usage without having to understand the complexities of defining and configuring WLM classes. Each WPAR is treated as a special WLM class.

WLM gives system administrators more control over how the scheduler allocates resources to processes. Using WLM, you can prevent different classes of jobs from interfering with each other and you can allocate resources based on the requirements of different groups of users.

Typically, WLM is used on a system where the CPU resources are constrained or at least occasionally constrained. If there is an overabundance of CPU resources for all of the system’s workload, then prioritization of workload is not important unless there are other factors such as user support agreements that have specific resource requirements.

With WLM, you can create different classes of service for jobs, as well as specify attributes for those classes. These attributes specify minimum and maximum amounts of CPU, physical memory, and disk I/O throughput to be allocated to a class. WLM then assigns jobs automatically to classes using class assignment rules provided by a system administrator. These assignment rules are based on the values of a set of attributes for a process. The system administrator or a privileged user can also manually assign jobs to classes, overriding the automatic assignment.

Classes

A WLM class is a collection of processes and their associated threads. A class has a single set of resource-limitation values and target shares.

CPU resource control

WLM allows management of resources in two ways: as a percentage of available resources or as total resource usage. Threads of type SCHED_OTHER in a class can be controlled on a percentage basis. Fixed-priority threads are non-adjustable. Therefore, they cannot be altered, and they can exceed the processor usage target.

If processor time is the only resource that you are interested in regulating, you can choose to run WLM in active mode for processor and passive mode for all other resources. This mode is called cpu only mode.



Student Notebook

Figure 3-18. wlmstat command syntax AN512.0

Notes:

The syntax options for the wlmstat command are:

• -c

- Shows only CPU statistics.

• -m

- Shows only physical memory statistics.

• -b

- Shows only disk I/O statistics.

• -B device

- Displays statistics for the given disk I/O device. Statistics for all the disks accessed by the class are displayed by passing an empty string (-B ““).

• -T


wlmstat command syntax•Syntax: (adjusted for WPAR relevant options)

wlmstat -@ [-c | -m | -b ] [-B device] [-T] [-w] [interval] [count]

•Output from wlmstat:

# wlmstat -@ 3 2CLASS CPU MEM DKIOwpar11 0.02 9.37 0.00TOTAL 0.02 9.37 0.00

CLASS CPU MEM DKIOwpar11 0.01 9.37 0.00TOTAL 0.01 9.37 0.00




Uempty
- Returns the total numbers for resource utilization since each class was created (or WLM started). The units are:
• Number CPU ticks per CPU (seconds) used by each class

• Number of memory pages multiplied by a number of seconds used by each class

• Number of 512 byte blocks sent/received by a class for all the disk devices accessed.

• -a

- Delivers absolute figures (relative to the total amount of the resource available to the whole system) for subclasses, with a 0.01 percent resolution. By default, the figures shown for subclasses are a percentage of the amount of the resource used by the superclass, with a 1 percent resolution. For instance, if a superclass has a CPU target of 7 percent and the CPU percentage shown by wlmstat without -a for a subclass is 5 percent, wlmstat with -a will show the CPU percentage for the subclass as 0.35 percent.

• -w

- Displays the memory high-water mark, that is the maximum number of pages that a class had in memory since the class was created (or WLM started).

• -v

- Shows most of the attributes concerning the class. The output includes internal parameter values purposed for AIX support persons. Here is a list of some attributes which may be interesting for users:

- CLASS - Class name.

- tr - tier number from 0...9.

- i - Value of the inheritance attribute: 0 = no, 1 = yes.

- #pr - Number of processes in the class. If a class has no processes assigned to it, the value in the other columns may not be significant.

- CPU - CPU utilization of the class in percent.

- MEM - Physical memory utilization of the class in percent.

- DKIO - Disk I/O bandwidth utilization for the class in percent.

- sha - Number of shares. If no ( “-” ) shares are defined, then sha = -1.

- min - Resource minimum limit in percent.

- smx - Resource soft maximum limit in percent.

- hmx - Resource hard maximum limit in percent.

- des - desired percentage target calculated by WLM using the numbers of the shares in percent.



Student Notebook

- npg - number of memory pages owned by the class.

- The other columns are for internal use only and bear no meaning for you the administrator or end users. This format is better suited for use with a resource selection (-c, -m or -b) otherwise the lines might be too long to fit into a line of a display terminal.

• interval

- Specifies an interval in seconds (default to 1).

• count

Specifies how many times wlmstat prints a report (default to 1).




Uempty

Figure 3-19. Context switches AN512.0

Notes:

Overview

A context switch (also known as process switch or thread switch) is when a thread is dispatched to a CPU and the previous thread on that CPU was a different thread from the one currently being dispatched. Context switches occur for various reasons. The most common reason is where a thread has used up its timeslice or has gone to sleep waiting on a resource (such as waiting on an I/O to complete or waiting on a lock) and another thread takes its place.

The context switch statistics are available through multiple tools including: sar, nmon, topas, and vmstat. For sar, the -w flag provides the context switch statistics.

What to look for

High context switch rates may be an indication of a resource contention issue such as application or kernel lock contention.


Context switches• A context switch is when one thread is taken off a CPU and

another thread is dispatched onto the same CPU.• Context switches are normal for multi-processing systems:

– What is abnormal? Check against baseline– High context switch rate is often an indication of lock contention

• Use vmstat, sar, or topas to see context switches• Example:

# vmstat 1 5

System configuration: lcpu=2 mem=1024MB ent=0.35

kthr memory page faults cpu----- ----------- ------------------------ ------------ -----------------------r b avm fre re pi po fr sr cy in sy cs us sy id wa pc ec2 2 198332 8637 0 0 0 6064 6994 0 2708 43494 12767 10 78 3 9 0.34 97.50 2 198337 8458 0 0 0 8159 8563 0 2800 24281 13703 10 80 2 8 0.37 106.80 2 198337 8057 0 0 0 6757 5112 2 1217 12276 6283 10 69 3 17 0.29 83.60 2 198337 8101 0 0 0 7869 7891 0 816 14836 4747 7 49 5 39 0.21 58.90 2 198337 8097 0 0 0 6298 10914 1 617 8112 2654 6 42 23 29 0.18 50.10 2 198337 8059 0 0 0 7104 8946 0 886 6440 3952 9 47 18 26 0.21 59.3



Student Notebook

The rate is given in switches per second. It’s not uncommon to see the context switch rate be approximately the same as the device interrupt rate (the in column in vmstat).

The scheduler performs a context switch when:

- A thread has to wait for a resource (voluntarily)

- A “higher priority” thread wakes up (involuntarily)

- The thread has used up its timeslice (10 ms by default)

vmstat and the initial interval report line

In AIX 5L V5.3 and later, vmstat displays a system configuration line, which appears as the first line displayed after the command is invoked. Following the configuration information, vmstat command only reports current intervals. As a result, the first line of output is not written until the end of the first interval and is “meaningful”.

Prior to AIX 5L V5.3, when running vmstat in interval mode, the first interval of the report provided statistics accumulated since the boot of the operating system. As such, it did not represent the current problem situation since it was diluted by a long prior period of normal operation. As a result the administrator running the script would ignore this “non-meaningful” first interval data and many scripts also would filter out the first period reported by vmstat.




Uempty

Figure 3-20. User mode versus system mode AN512.0

Notes:

Modes overview

User time is simply the percentage of time the CPUs are spending executing code in the applications or shared libraries. System time is the percentage of time the CPUs execute kernel code. System time can be because the applications are executing system calls which enter the applications into the kernel, or can be because there are kernel threads running that only execute in kernel mode, or can be because interrupt handler code is currently being run. When using monitoring tools, add up the user and the system CPU utilization percentage to see the total CPU utilization.

The use of a system call by a user mode process allows a kernel function to be called from user mode. This is considered a mode switch. Mode switching is when a thread switches from user mode to kernel or system mode. Switching from user to system mode and back again is normal for applications. System mode does not just represent operating system housekeeping functions.


User mode versus system mode• User mode:

User mode is when a thread is executing its own application code or shared library code Time spent in user mode is reflected as %user time in output of commands such as vmstat, topas, iostat, and sar

• System mode:

System mode is when the CPU is executing code in the kernel

CPU time spent in kernel mode is reflected as system time in output of commands such as vmstat, topas, iostat, and sar

Context switch time, system calls, device interrupts, NFS I/O, and anything else in the kernel is considered as system time



Student Notebook

Mode switches should be differentiated between the context switches seen in the output of vmstat (cs column) and sar (cswch/s).




Uempty

Figure 3-21. Timing commands AN512.0

Notes:

Timing commands

Use the timing commands to understand the performance characteristics of a single program and its synchronous children. The output from /usr/bin/time and timex are in seconds. The output of the Korn shell’s built-in time command is in minutes and seconds. The C shell’s built-in time command is in yet another format.

The output of /usr/bin/timex with no parameters is identical to that of /usr/bin/time. However, with additional parameters /usr/bin/timex is capable of displaying process accounting data for the command and its children. The -p and -s options on timex allow data from accounting (-p) and sar (-s) to be accessed and reported. A -o option reports on blocks read or written.

The timex command is available through SMIT on the Analysis Tools menu, found under Performance and Resource Scheduling.


Timing commands• Time commands show:

– Elapsed time– CPU time spent in user mode– CPU time spent in system mode

# /usr/bin/time <command> <command arguments>real 9.30user 3.10sys 1.20

# /usr/bin/timex <command> <arguments>real 26.08user 26.02sys 0.06

# time <command> <arguments>real 0m10.07suser 0m3.00ssys 0m2.07s



Student Notebook

If you do not invoke time with the full path, then you could be executing your shell’s built-in time command. Therefore, its output could be in a different format than that of /usr/bin/time. Since using the ksh and the csh built-in time commands have less overhead (saves a fork/exec of a time command), it’s preferred to use the built-in time commands.

Interpreting the output

Comparing the user+sys CPU time to the real time may give you an idea if the application is CPU bound or I/O bound. The difference between the real and the sum of user+sys is how much time the application spent sleeping (either waiting on I/O, for locks, or for some other resource like the CPU). The sum of user+sys may exceed the real time if a process is multi-threaded. The reason is because the real time is the time from start to finish of the process, but the user+sys is the sum of the CPU time of each of its threads.




Uempty
P
Figure 3-22. Monitoring CPU usage with vmstat . AN512.0

Notes:

Overview

Using vmstat with intervals during the execution of a workload will provide information on paging space activity, real memory use, and CPU utilization. vmstat data can be retrieved from the PerfPMR monitor.int file.

A vmstat -t flag will cause the report to show timestamps. An example is:

# vmstat -t 5 3System configuration: lcpu=2 mem=512MB

kthr memory page faults cpu time----- ----------- --------------------- ----------- ---------- ------ r b avm fre re pi po fr sr cy in sy cs us sy id wa hr mi se 1 1 62247 845162 0 0 0 0 0 0 327 9511 401 0 1 98 0 22:31:35 8 0 62254 845155 0 0 0 0 0 0 329 811 633 99 0 0 0 22:31:40 8 0 62353 845056 0 0 0 0 0 0 331 1387 637 99 0 0 0 22:31:45


Monitoring CPU usage with vmstat

• Runnable threads shows total number of runnable threads:– High number could simply mean your system is efficiently running

lots of threads; compare to the size of the lcpu count– If the high number is abnormal, look at what processes are running

and if total CPU utilization is higher than normal• If us + sy approaches100%, then there may be a system CPU

bottleneck:Compare interrupt, system call, and context switch rates to baselineIdentify code that is dominating CPU usage

# vmstat 5 3 (dedicated processor LPAR)System configuration: lcpu=2 mem=512MB

kthr memory page faults cpu----- ------------- ---------------------- --------------- ------------r b avm fre re pi po fr sr cy in sy cs us sy id wa19 2 127005 758755 0 0 0 0 0 0 1692 10464 1070 48 52 0 019 2 127096 758662 0 0 0 0 0 0 1397 71452 1059 28 72 0 019 2 127100 758656 0 0 0 0 0 0 1361 72624 1001 28 72 0 0



Student Notebook

CPU related information

Pertinent vmstat column headings and their descriptions for CPU usage are:

r - Average number of kernel threads runnable during the interval

b - Average number of kernel threads placed in the wait queue (waiting for I/O)

in - Device interrupts per second

sy - System calls per second

cs - Context switches per second

us - % of CPU time spent in user mode

sy - % of CPU time spent in system mode

id - % of time CPUs were idle

wa - % of time CPUs were idle and there was at least one I/O in progress

What to look for

If the user time (us) is abnormally high, then application profiling may need to be done.

If the system time (sy) is abnormally high, then kernel profiling (trace and/or tprof) may need to be done.

If idle (id) or wait time (wa) is high, then you must determine if that is to be expected or not.

Use vmstat -I to also see file in and file out rates which can show how quickly free pages are being used and give some idea of the demand for free pages.

Since the r column stands for the number of runnable threads, in order to determine what a high value is, you must know the number of CPUs there are. An r value of 2 on a 12-way system means that 10 of the CPUs are probably idle. If r/#of_cpus is very large it means threads are waiting for CPUs. This is not necessarily bad if the performance goals are being met and the system is running the threads quickly. It is important not to get too concerned about the actual number of runnable threads; it’s a statistic that, by itself, does not necessarily point to a bottleneck. Remember that the count of runnable threads include both the currently running threads and the threads waiting in the dispatch queue.




Uempty

Figure 3-23. sar command AN512.0

Notes:

Introduction

The sar command is the System Activity Report tool and is standard for UNIX systems. The sar command can collect data in real-time and postprocess the data in real-time or after the fact. sar data can be retrieved from the PerfPMR monitor.int file.

sar -P command

The syntax of the sar command using the -P flag is:

sar [-P CPUID [,...] | ALL] <Interval> <Count>

If the -P flag is given, the sar command reports activity which relates to the specified processor or processors. If -P ALL is given, the sar command reports statistics for each individual processor, followed by system-wide statistics in the row that starts with the hyphen. Without the -P flag, the sar command reports system-wide (global among all


sar command• Reports system activity information from selected cumulative

activity counters # sar -P ALL 5 1

System configuration: lcpu=4

15:01:19 cpu %usr %sys %wio %idle15:01:24 0 0 2 0 98

1 0 5 0 952 100 0 0 03 100 0 0 0- 50 2 0 48

# sar -q 5 3


19:31:42 runq-sz %runocc swpq-sz %swpocc19:31:47 1.0 100 1.0 10019:31:52 2.0 100 1.0 10019:31:57 1.0 100 1.0 100

Average 1.3 95 1.0 95

4 processorsSMT off



Student Notebook

processors) statistics, which are calculated as averages for values expressed as percentages or sums.

sar -q command

sar -q reports queue statistics. The following values are displayed:

- runq-sz: Reports the average number of kernel threads in the run queue

- %runocc: Reports the percentage of the time the run queue is occupied (this field is subject to error)

- swpq-sz: Reports the average number of kernel threads waiting to be paged in

- %swpocc: Reports the percentage of the time the swap queue is occupied (this field is subject to error)

A blank value in any column indicates that the associated queue is empty.

The -q option can indicate whether you just have many jobs running (runq-sz) or have a potential paging bottleneck. If paging is the problem, run vmstat. Large swap queue lengths indicate significant competing disk activity or a lot of paging due to insufficient memory.

A large number of runnable threads does not necessarily indicate a CPU bottleneck. If the performance goals are being met and the system is running the threads quickly, then it does not matter if this number seems high.

sar initial interval report line

In AIX 5L V5.3 and later, sar displays a system configuration line, which appears as the first line displayed after the command is invoked. If a configuration change is detected during a command execution iteration, a warning line will be displayed before the data which is then followed by a new configuration line and the header.

The first interval of the command output is now meaningful and does not represent statistics collected from system boot. Internal to the command, the first interval is never displayed, and therefore there may be a slightly longer wait for the first displayed interval to appear. Scripts that discard the first interval should function as before.

The topas command

The topas command output is a convenient way to see many different system statistics in one view. It will display top processes by CPU-usage, the CPU usage statistics, context switches (Cswitch), the run queue value, and the wait queue value. Once topas has started, press lowercase c twice to see per-CPU statistics.




Uempty

Figure 3-24. Locating dominant processes AN512.0

Notes:

Overview

To locate the processes dominating CPU usage, there are tools such as the standard ps and the AIX-specific tool, tprof.

Using the ps command

The ps command, run periodically, will display the CPU time under the TIME column and the ratio of CPU time to real time under the % CPU column. Keep in mind that the CPU usage shown is the average CPU utilization of the process since it was first created. Therefore, if a process consumes 100% of the CPU for five seconds and then sleeps for the next five seconds, the ps report at the end of ten seconds would report 50% CPU time. This can be misleading because right now the process is not actually using any CPU time.


Locating dominant processes• What processes are currently using the most CPU time?

Run the ps command periodically

# ps auxUSER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMANDroot 31996 15.4 0.0 188 468 pts/12 A 10:41:31 0:04 -kshuser3 36334 3.0 0.0 320 456 pts/19 A 10:40:50 0:02 tstproguser2 47864 1.4 3.0 2576 5676 pts/23 A 08:41:16 1:40 /usr/sbin/reuser5 63658 0.2 3.0 2036 5120 pts/23 A 09:18:11 0:11 /usr/bin/dduser1 35108 0.2 4.0 4148 6584 pts/17 A Jul 26 16:24 looperroot 60020 0.1 0.0 324 680 pts/14 A Jul 26 16:24 looper

Run tprof over a time period:# tprof -x sleep 60

Use other tools such as topas

• The problem may not be one or a few processes dominating the CPU, it could be the sum of many processes



Student Notebook

The example on the visual uses the ps aux flags which will display:

- a Information about all processes with terminals - u User-oriented information - x Processes without a controlling terminal in addition to processes with a

controlling terminal

Thread-related information can also be shown. The -m flag displays threads associated with processes using extra lines. You must use the -o flag with the THREAD field specifier to display extra thread-related columns. Examples are:

# ps [-e][-k] -mo THREAD [-p <pid>] # thread is not boundUSER PID PPID TID ST CP PRI SC WCHAN F TT BND COMMANDroot 20918 20660 - A 0 60 1 - 240001 pts/1 - -ksh- - - 20005 S 0 60 1 - 400 - - -

# ps [-e][-k] -mo THREAD [-p <pid>] # thread is bound

USER PID PPID TID ST CP PRI SC WCHAN F TT BND COMMANDroot 21192 20918 - A 86 64 1 8b06c 200001 pts/1 0 bndprog- - - 20279 S 0 64 1 8b06c 420 - 0 -

It may be very common to see a kproc using CPU time. When there are no threads that are runable for a time slice, the scheduler assigns the CPU time for that time slice to this kproc which is known as the idle or wait kproc. SMP systems will have an idle kproc for each processor.

A more accurate way of gauging CPU usage is with the tprof command.

Using the tprof command

The ps command takes a snapshot. To gather data over a time period, use tprof. It can be used to locate the CPU-dominant processes, and then allow you to further analyze which portion of a particular program is using the most CPU time. The -x option specifies a program to execute at the start of the trace period; and, when the program stops, the trace stops. While this can be used to measure a particular program execution, in most cases it is simply used to control the trace period. For this purpose the sleep command works well. For example, to monitor the system for 5 minutes, let the value to the -x option be a sleep command with an argument of 300 seconds (-x sleep 300). After this period is completed, tprof will generate a file called sleep.prof (AIX 5L V5.2 and later) or _prof.all (prior to AIX 5L V5.2) which will show the most dominant processes listed in order of the highest CPU percentage (starting with AIX 5L V5.2) or using the most CPU ticks (before AIX 5L V5.2).




Uempty

Figure 3-25. tprof output AN512.0

Notes:

Overview

This output lists the processes/threads that were running when the clock interrupt occurred. tprof uses the trace facility to record the instruction address register value whenever the clock interrupt occurs (every 10 ms). The report lists processes in descending order of CPU usage.

The file generated will be command.prof where command is the command given with the -x flag.

Report format

The top part of the report contains a summary of all the processes on the system. This is useful for characterizing CPU usage of a system according to process names when there are multiple copies of a program running. The second part of the report shows each thread that executed during the monitoring period.


tprof outputProcess Freq Total Kernel User Shared Other======= ==== ===== ====== ==== ====== =====cpuprog 1 50.29 47.77 0.35 2.17 0.00wait 2 49.50 49.50 0.00 0.00 0.00/usr/sbin/syncd 1 0.14 0.14 0.00 0.00 0.00/usr/bin/tprof 1 0.02 0.00 0.00 0.02 0.00/usr/bin/trcstop 1 0.02 0.02 0.00 0.00 0.00/usr/bin/sleep 1 0.01 0.01 0.00 0.00 0.00IBM.ERrmd 1 0.01 0.01 0.00 0.00 0.00rmcd 1 0.01 0.01 0.00 0.00 0.00======= ==== ===== ====== ==== ====== =====Total 9 100.00 97.46 0.35 2.20 0.00

Process PID TID Total Kernel User Shared Other======= === === ===== ====== ==== ====== =====cpuprog 16378 33133 50.29 47.77 0.35 2.17 0.00wait 516 517 44.75 44.75 0.00 0.00 0.00wait 774 775 4.75 4.75 0.00 0.00 0.00/usr/sbin/syncd 6200 8257 0.14 0.14 0.00 0.00 0.00/usr/bin/tprof 15306 32051 0.02 0.00 0.00 0.02 0.00/usr/bin/trcstop 14652 32975 0.02 0.02 0.00 0.00 0.00IBM.ERrmd 9922 24265 0.01 0.01 0.00 0.00 0.00rmcd 6718 8009 0.01 0.01 0.00 0.00 0.00/usr/bin/sleep 14650 32973 0.01 0.01 0.00 0.00 0.00======= === === ===== ====== ==== ====== =====Total 100.00 97.46 0.35 2.20 0.00

Total Samples = 12381 Total Elapsed Time = 61.90s



Student Notebook

Figure 3-26. What is simultaneous multi-threading? AN512.0

Notes:

Introduction

Simultaneous multi-threading is the ability of a single physical processor to concurrently dispatch instructions from more than one hardware thread. There are multiple hardware threads per processor. Instructions from any of the threads can be fetched by the processor in a given cycle.

The number of hardware threads per processor depends upon the version of the processor chip. The POWER5 and POWER6 chips support two hardware threads per core. The POWER7 chips support four hardware threads per core.

Simultaneous multi-threading also allows instructions from one thread to utilize all the execution units if the other thread encounters a long latency event. For instance, when one of the threads has a cache miss, another thread can continue to execute.

Each hardware thread is supported as a separate logical processor by the operating system. So, a system with one physical processor is configured by AIX as a logical


What is simultaneous multi-threading?• Multiple hardware threads can run on one physical

processor at the same time.– A processor appears as two or four logical CPUs (lcpu) to AIX.– Beneficial for most commercial environments– Computing intensive applications often do not benefit– SMT affects how we read performance statistic reports.

• SMT is enabled by default with max number of logical CPUs– Can change between SMT2 and SMT4 (POWER7 only)– Can disable or enable

• smtctl –m {on|off}• smtctl –t #SMT

Logical CPU0

Physical CPU

Logical CPU1

HardwareThread0

HardwareThread1

AIX Layer

Physical Layer




Uempty
two-way. For POWER5 or POWER6-based systems with N physical processors, with SMT enabled, many performance tools will report N*2 logical processors. With POWER7-based systems, many performance tools will report N*4 logical processors.
Simultaneous multi-threading is enabled by default on POWER5-based and later systems, and some monitoring tools in this environment will show two or four times as many processors as physical processors in the system. For example, sar -P ALL will show the logical processors in this configuration, which will be two or four times the number of physical processors installed.

For most commercial environments, simultaneous multi-threading can be slightly to greatly beneficial to performance. For a specific workload environment, test performance with it enabled and compare it to when it is disabled to see if simultaneous multi-threading will be a benefit. Highly compute-intensive environments may not seen a gain, and in fact could see a slight degradation in performance, particularly with workloads where multiple threads are competing for the same CPU execution units.

Modifying simultaneous multi-threading with the smtctl command

The smtctl command provides privileged users and applications the ability to control utilization of processors with simultaneous multi-threading support. With this command, you can enable or disable simultaneous multi-threading system-wide, either immediately or the next time the system boots.

The smtctl command syntax is:

smtctl [ -m off | on [ -w boot | now ]]

smtctl [ -t #SMT [ -w boot | now ]]

where:

-m off Sets simultaneous multi-threading mode to disabled.

-m on Sets simultaneous multi-threading mode to enabled.

-t #SMT Sets the number of simultaneous threads per processor

-w boot Makes the simultaneous multi-threading mode change effective on the next and subsequent reboots. (You must run the bosboot command before the next system reboot).

-w now Makes the simultaneous multi-threading mode change immediately but will not persist across reboot.

If neither the -w boot or the -w now options are specified, then the mode change is made now and when the system is rebooted.

Note, the smtctl command does not rebuild the boot image. If you want your change to persist across reboots, the bosboot -a command must be used to rebuild the boot image. The boot image has been extended to include an indicator that controls the default simultaneous multi-threading mode.



Student Notebook

Issuing the smtctl command with no options will display the current simultaneous multi-threading settings.

Modifying simultaneous multi-threading (SMT) with SMIT

Start the smit command with no options, and then use the following menu path to get to the main simultaneous multi-threading panel: Performance & Resource Scheduling -> Simultaneous Multi-Threading Processor Mode.

The fastpath to this screen is smitty smt.

There are two options on this screen:

List SMT Mode SettingsChange SMT Mode

The Change SMT Mode screen gives the following options:

SMT Mode

Options are: enable and disable

SMT Change Effective:

Options are: Now and subsequent boots, Now, and Only on subsequent boots

(At the time of this writing, SMIT has not been updated to provide a dialogue panel capable of changing the number of simultaneous threads).




Uempty
smtctl command output (no parameters)
(The following example is from an LPAR configured to use two hardware threads per processor)

# smtctl

This system is SMT capable.

SMT is currently enabled.

SMT boot mode is set to enabled.SMT threads are bound to the same physical processor.

proc0 has 2 SMT threads.Bind processor 0 is bound with proc0Bind processor 1 is bound with proc0

proc2 has 2 SMT threads.Bind processor 2 is bound with proc2Bind processor 3 is bound with proc2

The smtctl command with no options reports the following information:

SMT Capability Indicates whether the processors in the system are capable of simultaneous multi-threading

SMT Mode Shows the current runtime simultaneous multi-threading mode (disabled or enabled)

SMT Boot Mode Shows the current boot time simultaneous multi-threading mode (disabled or enabled)

SMT Bound Indicates whether the simultaneous multi-threading threads are bound on the same physical or virtual processor

SMT Threads Shows the number of simultaneous multi-threading threads per physical or virtual processor



Student Notebook

Figure 3-27. SMT scheduling and CPU utilization AN512.0

Notes:

The need for the additional throughput offered by SMT is not needed until we have dispatchable threads waiting for a processor. If AIX scheduling is not concerned with processor affinity issues (as when we initially dispatch a process), it will avoid having two software threads on the same processor. For example, it will tend to dispatch work to all of the primary hardware threads before it dispatches any work to the secondary hardware threads. An individual thread will generally run better without a second thread to compete with, because it avoids the occasional contention for the same execution unit within the processor that would happen if both hardware threads were being used.

As a result of this AIX scheduling preference, when using multiple dedicated processors, you will tend to see a pattern where (for SMT2) every other logical CPU is almost idle, when looking at a system that has a moderate CPU load.

AIX runs a wait kproc (kernel process) thread on any logical CPU which does not have any other dispatchable threads. When the primary thread has some real work to do and the secondary thread has only the wait kproc, there can still be some contention for execution units. To avoid that, AIX will “snooze” the secondary thread after a short time of only


SMT scheduling and CPU utilization• Processor has multiple hardware (H/W) threads

– AIX will tend to first dispatch software (S/W) threads on the low order H/W threads of all processors• primary is most preferred, next secondary, and so forth

– Runs wait kproc on (or snoozes) idle H/W threads• If tertiary and quartenary threads are snoozed (POWER7), mode is dynamically reduced to SMT2.

• For system wide reports, such as lparstat and vmstat:– Prior to AIX6 TL4:

• If only a single H/W thread was busy, processor reported as 100% utilized.

• This could be misleading since idle H/W threads have capacity.– AIX6 TL4 and later:

• Potential capacity of unused H/W threads is reported as idle time for the processor.

• Best picture of H/W thread utilization given by per-lcpureports, such as sar –P and mpstat




Uempty
running the wait kproc. When this happens the processor is running only one hardware thread and there is no contention for execution units. A similar mechanism occurs with SMT4. If not utilized, the third and fourth hardware threads will be snoozed, and the scheduling restricted to the primary and secondary threads.
Keeping a processor busy with a single thread when SMT is enabled, is the same utilization as a single thread keeping the processor busy when SMT is not enabled; in other words it is 100% utilized. But, with SMT enabled, we can actually do much more work by dispatching a second thread. Before AIX6 TL4, some performance statistics reports, such as iostat or vmstat that only report overall system utilization, would report the 100% utilization when only the primary hardware threads are being fully utilized, even though we actually have spare capacity on the secondary hardware threads that can be used. Since AIX6 TL4, the system CPU utilization factors in the potential capacity of the unused hardware threads. This is reflected in higher idle statistics (idle plus iowait) and lower in-use statistics (user plus sys).

In order to see the complete CPU utilization picture, we need to use a report that shows the utilization of each individual logical CPU. For this you can use either sar -P ALL or mpstat. If you see that there are logical CPUs which are underutilized, we know that there is extra capacity available to use.

Later, we will see that AIX in a micro partition LPAR will show a slightly different SMT behavior.



Student Notebook

Figure 3-28. System wide CPU reports (old and new) AN512.0

Notes:

The visual illustrates the difference between AIX6 technology level 3 and AIX technology level 5. Both examples are running a single significant CPU intensive thread running on a system configured for two hardware threads per processor and only one processor provided.

Before AIX6 TL4, the lparstat report shows there to be very little idle capacity, leading us to believe that we will need to add more processor capacity. Yet, we know that we can still run an additional thread on that same processor, providing even more throughput.

That same situation in AIX6 TL4 (and later) shows a significant amount of idle capacity. This is reflecting the potential capacity of the idle hardware thread on that processor.

It must be remembered that this extra capacity can only be used by additional threads. If the application is designed with a single process and a single thread, then that application will not be able to use that potential extra capacity.

In either method of reporting CPU utilization, it is informative to examine a report showing the utilization of each individual hardware thread.


System wide CPU reports (old and new)

# lparstat 2 3

System configuration: type=Shared mode=Capped smt=On lcpu=2 mem=768MB psize=4 ent=0.30

%user %sys %wait %idle physc %entc lbusy vcsw phint----- ----- ------ ------ ----- ----- ------ ----- -----98.2 0.8 0.0 1.0 0.30 99.9 51.4 618 097.9 0.8 0.0 1.3 0.30 99.9 48.7 602 198.8 0.6 0.0 0.7 0.30 99.9 50.0 450 1

…

# lparstat 2 3

System configuration: type=Shared mode=Capped smt=On lcpu=2 mem=768MB psize=4 ent=0.30

%user %sys %wait %idle physc %entc lbusy vcsw phint----- ----- ------ ------ ----- ----- ------ ----- -----77.3 2.7 0.0 20.0 0.30 99.8 53.2 1196 076.7 2.8 0.0 20.5 0.30 99.8 51.8 1216 077.0 2.6 0.0 20.4 0.30 99.8 52.2 1206 0

…

Example of lparstat

in AIX6 TL3

Example of lparstat

in AIX6 TL5




Uempty

Figure 3-29. Viewing CPU statistics with SMT AN512.0

Notes:

Introduction

The visual above shows a system running with the same workload, first with SMT disabled, then with it enabled. Notice that the logical CPU number doubled with SMT enabled. Also notice the new statistic of physc or physical CPU consumed with SMT enabled. This shows how much of a CPU was consumed by the logical processor.

In the example in the visual above, we see activity in both sar examples with two physical processors. The physc column is misleading in a way, since in this environment with dedicated processors in a logical partition, the two logical processors which make up one physical processor, have a physical processor consumption which always add up to 100% or 1.00 processors. By looking at this second sar output, there is user and system activity on logical CPU 3, and the other three logical processors are mostly idle. Just by looking at the output of this second sar, you can tell that logical CPUs 1 and 3 are on the same physical CPU (.05 + .96 add up to approximately 1.00) and logical CPUs 0 and 2 are on the same physical CPU.


Viewing CPU statistics with SMT# sar -P ALL 2 2

AIX frodo21 3 5 00C30BFE4C00 06/11/06


16:40:30 cpu %usr %sys %wio %idle16:40:32 0 0 0 0 100

1 24 76 0 0- 12 38 0 50

…

# sar -P ALL 2 2

AIX frodo21 3 5 00C30BFE4C00 06/11/06


16:40:43 cpu %usr %sys %wio %idle physc16:40:45 0 4 12 0 84 0.56

1 0 1 0 99 0.052 0 0 0 100 0.443 27 69 0 4 0.96- 14 36 0 50 2.01

…

Example of sar –P ALL

with SMT disabled

Example of sar –P ALL

with SMT enabled



Student Notebook

Figure 3-30. POWER7 CPU statistics with SMT4 AN512.0

Notes:

The visual shows examples of both the smtctl and sar commands on a POWER7-based system.

The smtctl report clearly shows the four SMT threads bound to proc0.

The sar report illustrates how we can view the utilization of each of these logical CPUs.

The principles are basically the same as with SMT2. Remember that the objective is to increase throughput by running more threads in parallel, rather than to improve the performance of any single thread.


POWER7 CPU statistics with SMT4# smtctlThis system is SMT capable.SMT is currently enabled.SMT boot mode is set to enabled.SMT threads are bound to the same virtual processor.

proc0 has 4 SMT threads.Bind processor 0 is bound with proc0Bind processor 1 is bound with proc0Bind processor 2 is bound with proc0Bind processor 3 is bound with proc0

# sar -P ALL 2 1

AIX sys304_114 1 6 00F606034C00 05/20/10

System configuration: lcpu=4 ent=0.30 mode=Capped

22:08:13 cpu %usr %sys %wio %idle physc %entc22:08:15 0 0 23 0 76 0.02 6.3

1 100 0 0 0 0.13 43.12 66 1 0 32 0.04 13.13 97 0 0 3 0.11 37.4- 88 2 0 10 0.30 99.9




Uempty

Figure 3-31. Processor virtualization AN512.0

Notes:

Introduction

This visual gives an overview of many of aspects of processor virtualization that we need to consider in AIX tuning of CPU performance.

In this example, there are 8 physical processors. Starting from left to right, there are six processors in the shared processor pool and two processors dedicated to a partition.

Moving up in the visual, we see that there are two shared pool LPARs (SPLPAR), each with four virtual processors, and one dedicated LPAR with two physical processors allocated.

Optional features

Many of the processor concepts in this unit are optional features that must be purchased. For example, the ability to have Capacity on Demand (CoD) processors is a separate, orderable feature. Also, the PowerVM standard edition (Advanced POWER


Processor virtualization

Shared Pool LPAR

Shared PoolLPAR

Dedicated Processors

Logical CPUs

Dedicated Processor

LPAR

Shared Pool of Processors

donate

VirtualProcessors



Student Notebook

Virtualization) feature must be purchased to use Micro-Partitioning and shared and virtual processors.

Shared processors

Shared processors are physical processors, which are allocated to partitions on a time slice basis. Any physical processor in the Shared Processor Pool can be used to meet the execution needs of any partition using the Shared Processor Pool. There is only one Shared Processor Pool for POWER5 processor-based systems. With POWER6 and later, multiple Shared Processor Pools can be configured.

A partition may be configured to use either dedicated processors or shared processors, but not both.

Processing units

When a partition is configured, you assign it an amount of processing units. This is referred to as the processor entitlement for the LPAR. A partition must have a minimum entitlement of one tenth of a processor; after that requirement has been met, you can configure processing units at the granularity of one hundredth of a processor.

Capped versus uncapped shared pool LPARs

When a partition is configured as using the shared processor pool, it can have its entitlement either capped or uncapped. When capped, the LPAR can not use more than its current entitlement. When uncapped, it is allowed o use processor capacity above its current entitlement as long as other LPARs do not need those cycles to do work within their own entitlement.

Benefits to using shared processors

Here are some benefits of using shared processors:

• The processing power from a number of physical processors can be utilized simultaneously, which can increase performance for multiple partitions.

• Processing power can be allocated in sub-processor units in as little as one-hundredths of a processor for configuration flexibility.

• Uncapped partitions can be used to take advantage of excess processing power not being used by other partitions.

Disadvantage of using shared processors

A disadvantage of using shared processors is that because multiple partitions use the same physical processors, there is overhead because of context switches on the processors. A context switch is when a process or thread is running on a processor, it is




Uempty
interrupted (or finishes), and a different process or thread runs on that processor. The overhead is in the copying of each job’s data from memory into the processor cache. This overhead is normal and even happens at the operating system level within a partition. However, there is added context switch overhead when the Hypervisor dispatches virtual processors onto physical processors in a time-slice manner between partitions.
Micro-partitions

The term micro-partition is used to refer to partitions that are using the shared processor pool. This is because the partition does not use processing power in whole processor units, but can be assigned a fractional allocation in units equivalent to hundredths of a processor.

Shared processor logical partition (SPLPAR)

In documentation, you might see the acronym SPLPAR for shared processor logical partition, and it simply means a partition utilizing shared processors.

Virtual processors

The virtual processor setting allows you to control the number of threads your partition can run simultaneously. The example shows six physical processors in the shared pool, and there are eight virtual processors configured in the two partitions.

The number of virtual processors is what the operating system thinks it has for physical processors. The number of virtual processors is independently configurable for each partition using shared processors.

Dedicated processor versus shared processor partition performance

Having dedicated processors will have improved performance over shared capped processor performance because of reduced processor cache misses and reduced latency. Dedicated processor partitions have the added advantage of memory affinity; that is, when the partition is activated, there is an attempt made to assign physical memory that is local to the dedicated processors, thereby reducing latency issues.

However, a partition using dedicated processors cannot take advantage of using excess shared pool capacity as you can with an uncapped partition using the shared processor pool. Performance could be better with the uncapped processors if there is excess capacity in the shared pool that can be used.

Configuring the virtual processor number on shared processor partitions is one way to increase (or reduce) the performance for a partition.

The virtual processor setting for a partition can be changed dynamically.



Student Notebook

Virtual processor folding

Starting with AIX V5.3 maintenance level 3, the kernel scheduler has been enhanced to dynamically increase and decrease the use of virtual processors in conjunction with the instantaneous load of the partition, as measured by the physical utilization of the partition. This is a function of the AIX V5.3 operating system (also of AIX 6.1) and not a Hypervisor call.

If there are too many virtual processors for the load on the partition, every time slice, the Hypervisor will cede excess cycles. This works well, but it only works within a dispatch cycle. At the next dispatch cycle, the Hypervisor distributes entitled capacity and must cede the virtual processor again if there is no work. The VP folding feature, which puts the virtual processor to sleep across dispatch cycles, improves performance by reducing the Hypervisor workload, by decreasing context switches, and by improving cache affinity.

When virtual processors are deactivated, they are not dynamically removed from the partition as with DLPAR. The virtual processor is no longer a candidate to run on or receive unbound work; however, it can still run bound jobs. The number of online logical processors and online virtual processors that are visible to the user or applications does not change. There are no impacts to the middleware or the applications running on the system because the active and inactive virtual processors are internal to the system.

Enable/Disable VP folding

The schedo command is used to dynamically enable, disable, or tune the VP folding feature. It is enabled (set to 0) by default.

Typically, this feature should remain enabled. The disable function is available for comparison reasons and in case any tools or packages encounter issues due to this feature.

Configuring vpm_xvcpus

Every second, the kernel scheduler evaluates the number of virtual processors in a partition based on their utilization. If the number of virtual processors needed to accommodate the physical utilization of the partition is less than the current number of enabled virtual processors, one virtual processor is disabled. If the number of virtual processors needed is greater than the current number of enabled virtual processors, one or more (disabled) virtual processors are enabled. Threads that are attached to a disabled virtual processor are still allowed to run on it.

To determine if folding is enabled (0=enabled; -1=disabled):

# schedo -a | grep vpm_xvcpus vpm_xvcpus = 0

To disable, set the value to -1 (To enable, set it to 0.):




Uempty
# schedo -o vpm_xvcpus=-1
Tuning VP folding

Besides enabling and disabling virtual processor folding, the vpm_xvcpus parameter can be set to an integer to tune how the VP folding feature will react to a change in workload. For example, the following command sets the vpm_xvcpus parameter to 1:

# schedo -o vpm_xvcpus=1

Now when the system determines the correct amount of virtual processors needed, it will add one more to that amount. So if the partition needs four processors, by setting the xvm_xcvpus to 1, the number of virtual processors will be set to 5.

Dedicated LPAR processor donation

POWER6 and POWER7-based systems allow a dedicated processor logical partition to donate its idle processor cycles to the shared processor pool.

The new processor function provides the ability for partitions that normally run as “dedicated processor” partitions to contribute unused processor capacity to the shared processor pool. This support will allow that un-needed capacity to be “donated” to uncapped partitions instead of being wasted as idle cycles in the dedicated partition.

This feature ensures the opportunity for maximum processor utilization throughout the system.



Student Notebook

Figure 3-32. Performance management with virtualization AN512.0

Notes:

Partnering with the managed system administrator

A major component in performance management is providing the correct amount and type of resource for the demand. In a single operating system server, you would do this by adding more physical resource. In an LPAR environment, you may be able to provide the additional resource by having it dynamically allocated via the HMC. When using shared pool LPARs, you can have the processor entitlement and the number of virtual processors increased. Often the number of virtual processors is already set to the largest value you would practically want, so that often does not need to be changed. You might also want to run uncapped, if not already in that mode. Remember that the server administrator sees a larger picture involving the optimize the performance of all LPARs, taking into consideration the characteristics and priority of the various applications.


Performance management with virtualization• Processor allocation can be very dynamic

– Work with HMC administrator to adjust capacity and VP allocations or other LPAR virtualization attributes

• AIX actual processor usage varies– AIX can cede or donate cycles that it can not use– If uncapped, AIX LPAR can use more than its entitlement

• Traditional usr, sys, wait, idle percentages are not stable– Calculated as percentage of actual cycles used (including wait kproc)– The denominator for the calculation can constantly change

• Need to factor in actual processor utilization (physc, %ent)• Uncapped execution above entitlement:

– Better system resource utilization– Performance can vary due to other LPARs’ processor demands




Uempty
Physical processor utilization variability
In a traditional non-partitioned server, the used processor cycles stays constant; either a useful thread is running or the wait kproc is running and the processor is kept busy. A dedicated processor LPAR behaves the same way. The traditional UNIX statistics assume that this is the situation and are focused on how these cycles are used: user mode, system, mode, or wait kproc (with our without threads waiting on a resource). The four categories always add up to 100%; and this 100% matches the actual execution time on physical processor.

When we use micropartitions the situation changes significantly. If AIX detects that it is just wasting cycles with the wait kproc spinning in a loop, it will cede that VP (and thus the underlying physical processor) back to the hypervisor which can then dispatch a different LPAR on that processor. Even though AIX has a processor entitlement, it may not be running anything on the physical processor for a period of time, not even a wait kproc. On the other hand, if an AIX partition is uncapped, it may execute for many more cycles than its entitlement would appear to provide. The traditional CPU statistics (user, sys, wait, idle) will still add up to 100%, but this is 100% of the time this LPAR used a physical processor; and, as you know this time can vary.

Problems with focusing on only the traditional CPU statistics

The traditional CPU utilization statistics are calculated by dividing the amount of time a logical CPU is executing in a given mode (for example user mode) by the total execution time (including the wait kproc execution). If you are examining statistics with AIX running in an LPAR which is a shared processor LPAR (or a dedicated processor LPAR with the ability to donate while active), then the denominator of this execution (the time spent executing on the physical processor) can be changing from one interval to the next. Even if the workload in AIX is constant, we may see these percentages fluctuating as other LPARs demand their entitlement and deny this LPAR capacity above its own entitlement. In a lower demand situation, the denominator of this calculation can be so small that very small thread executions can appear as fairly large percentages. Without knowing the actual physical processor utilization, these percentages can be very confusing and even misleading.

The physical processor utilization statistics

To assist with this situation, the statistic commands were modified to display the actual physical processor utilization by a logical CPU. The two additional statistics are the physc and the %entc. The physc statistic identified how much time a logical CPU spent executing on a physical processor. This is expressed as the number of processing units, where 1 represents the capacity of one physical processor. The %entc reports essential the same information, except it is expressed as a percentage of the LPARs entitled capacity. It is important to examine these statistics to know the true processor utilization and to place the traditional statistics in context.



Student Notebook

Better utilization but possibly inconsistent performance

The main advantage of running uncapped SP-LPARs is to better utilize the processor resources. With dedicated processors, if one LPAR is not utilizing its one processor allocation while another LPAR needs more than its one processor allocation, nothing can be done to transfer that capacity; and what few capabilities which may be of help (such as shutting down an LPAR which is fairly inactive or using DLPAR to transfer entire processors) are not good for relatively short term transient situation. With SP-LPARs, the excess capacity that is not used by one LPAR is immediately available to another LPAR that needs it.

The problem you may face is that the users can get spoiled with excellent performance which depends upon this excess capacity (above and beyond the entitlement which the application is assigned). Later, when other LPARs start to utilize their entitlements, the excess capacity disappears and then users then see performance which is more appropriate for the defined processor entitlement. This can be a trend over time as other LPARs slowly increase their workload, or it could be seen as a fluctuations in performance as different LPARs hit their peak demand at different times.

The key here is to clearly identify the acceptable performance and request an LPAR processor entitlement which should provide that level of performance. The user should be made to understand that you may, at times, provide much better performance than stated in the service level agreement, but that this is not guaranteed.




Uempty

Figure 3-33. CPU statistics in an SPLPAR (1 of 2) AN512.0

Notes:

Overview

The examples shown are from an SPLPAR which has an entitlement of 0.35 processing units and an allocation of two virtual processors. This LPAR is the only LPAR which has any significant work, so the shared pool of eight processors has plenty of excess capacity. The examples will illustrate how the CPU statistics change based on the number of single threaded jobs are increased.

Idle LPAR

Of course there is no such thing as a totally idle LPAR, but there are no user jobs running in this LPAR. It may at first glance appear that lcpu 0 is fairly busy, but a glance at the physc values show that the processor utilization is so low that it shows as zero. Even the %entc is a fraction of a percent of the LPARs entitlement. The %user + %sys may be 71%, but that is 71% of almost nothing.


CPU statistics in an SPLPAR (1 of 2)# sar -P ALL 2 2

System configuration: lcpu=4 ent=0.35 mode=Uncapped


1 0 25 0 75 0.00 0.32 0 23 0 77 0.00 0.33 0 25 0 75 0.00 0.3U - - 0 98 0.34 98.3- 0 1 0 99 0.01 1.7

# sar -P ALL 2 2



1 0 25 0 75 0.00 0.32 13 46 0 42 0.00 0.43 0 21 79 0 1.00 285.1- 0 21 79 0 1.00 286.5

One jobrunning

No jobsrunning



Student Notebook

One job running

Looking at just the traditional statistics, it would appear that lcpu 0 and lcpu 2 are busy, with 64% and 59% utilization. Once again, you need to look at the physc statistic to see the true situation. Those two logical cpu’s once again have very low utilization, so low that they are reported as zero processing units.

The lcpu which is reported as using an entire processing unit is lcpu 3. Furthermore, since an entire processing unit is much more than the allocated processor entitlement for the LPAR, the %entc is far above 100%. The execution is divided between system mode thread execution and the wait kproc.




Uempty

Figure 3-34. CPU statistics in an SPLPAR (2 of 2) AN512.0

Notes:

Two jobs running

Staying focused on the physc statistic, you can see that both lcpu1 and lcpu 3 are each fully utilizing a physical processor, while the other two logical cpus are almost entirely idle (despite one of them having significant %user + %sys). The selection of logical CPUs is not random. Because SMT is enabled, these logical CPUs are mapped to the primary hardware threads of the two processors.

Three jobs running

With the addition of one more job and with the primary hardware threads busy, the AIX scheduler starts to use the secondary hardware threads. The processor is still 100% busy, but now the usage is prorated between the two threads sharing it, using SMT.


CPU statistics in an SPLPAR (2 of 2)# sar -P ALL 2 2



1 0 16 84 0 1.00 284.72 13 45 0 41 0.00 0.43 18 82 0 0 1.00 285.1- 9 49 42 0 2.00 570.8

# sar -P ALL 2 2



1 0 14 86 0 1.00 284.72 16 84 0 0 0.50 142.33 16 84 0 0 0.50 143.2- 8 49 43 0 2.00 570.8

Three jobsrunning

Two jobsrunning



Student Notebook


Notes:


Checkpoint1. What is the difference between a process and a thread?

______________________________________________________________________________________________________

2. The default scheduling policy is called: _________________3. The default scheduling policy applies to fixed or non-fixed priorities?

_________________4. Priority numbers range from ____ to ____. 5. True/False The higher the priority number the more favored the thread

will be for scheduling.6. List at least two tools to monitor CPU usage:

7. List at least two tools to determine what processes are using the CPUs:




Uempty

Figure 3-36. Exercise 3: Monitoring, analyzing, and tuning CPU usage AN512.0

Notes:


Exercise 3: Monitoring,analyzing, and tuning CPU usage

• Observing the run queue• Use nice numbers to control process priorities• Analyze CPU statistics in multiple environments including

SMT and SP-LPAR, including locating a dominant process• Use WPAR resource controls (optional)• Use schedo to modify the scheduler algorithms (optional) • Use PerfPMR data to examine CPU usage



Student Notebook


Notes:


Unit summary

This unit covered:• Processes and threads• How process priorities affect CPU scheduling• Managing process CPU utilization with either

– nice and renice commands– WPAR resource controls

• Using the output of the following AIX tools to determine symptoms of a CPU bottleneck:–vmstat, sar, ps, topas, tprof

• Correctly interpreting CPU statistics in various environments including where:– Simultaneous multi-threading (SMT) is enabled– LPAR is using a shared processor pool




Uempty
Unit 4. Virtual memory performance monitoring and tuning

This unit describes virtual memory concepts including page replacement. It also explains how to analyze and tune the virtual memory manager (VMM).



• Define basic virtual memory concepts and what issues affect performance

• Describe, analyze, and tune page replacement

• Identify memory leaks

• Use the virtual memory management (VMM) monitoring and tuning tools

• Analyze memory statistics in an active memory sharing (AMS) environment.

• Describe the role of Active Memory Expansion and to interpret the related AME statistics.


Accountability:

• Checkpoints • Machine exercises

References






© Copyright IBM Corp. 2010 Unit 4. Virtual memory performance monitoring and tuning 4-1

Student Notebook

Active Memory Expansion: Overview and Usage Guide (Whitepaper)




Uempty


Notes:


Unit objectives


• Define basic virtual memory concepts and what issues affect performance

• Describe, analyze, and tune page replacement

• Identify memory leaks

• Use the virtual memory management (VMM) monitoring and tuning tools

• Analyze memory statistics in Active Memory Sharing (AMS) and in an Active Memory Expansion (AME) environments



Student Notebook

Figure 4-2. Memory hierarchy AN512.0

Notes:

Registers

The instructions and data that the CPU processes are fetched from memory. Memory comes in several layers with the top layers being the most expensive but the fastest. The top layer consists of registers which are high speed storage cells that can contain 32-bit or 64-bit instructions or data. However, there is a limited number of registers on each CPU chip.

Caches

Caches are at the next level and themselves can be split into multiple levels. Level 1 (L1) cache is the fastest and smallest (due to cost) and is usually on the CPU chip. If the CPU can find the instruction or data it needs from the L1 cache, then access time can be as little as 1 clock cycle. If it’s not in L1, then the CPU can attempt to find the instruction or data in the L2 cache (if it exists) but this could take 7-10 CPU cycles. The advantage is that L2 caches can be megabytes in size whereas the L1 is typically


Memory hierarchy

Registers

Cache:

L1, L2, and L3

Real Memory

(RAM)

Disk Drives (Persistent Storage)




Uempty
32-256 KB. L3 caches are even less expensive while not as fast as L2 cache, but significantly faster than main memory access.
Real memory (RAM)

Once the virtual address is found in random access memory (RAM), the item is fetched, typically at a cost of 200-500 CPU cycles.

Disk

If the address is not in RAM, then a page fault occurs and the data is retrieved from the hard disk. This is the slowest method but the cheapest. It’s the slowest for the following reasons:

- The disk controller must be directed to access the specified blocks (queuing delay)

- The disk arm must seek to the correct cylinder (seek latency)

- The read/write heads must wait until the correct block rotates under them (rotational latency)

- The data must be transmitted to the controller (transmission time) and then conveyed to the application program (interrupt-handling time)

Cached storage arrays also have an influence on the performance level of persistent storage. Through the storage subsystem’s caching of data, the apparent response time on some I/O requests can reflect a memory to memory transfer over the fibre channel, masking any mechanical delays (seek and rotational latency) in accessing the data.

Newer storage subsystems are offering solid state drives (SSD), also referred to as flash storage. While still much slower than the system memory and more expensive than traditional disk drives, the performance provided is significantly better than drives requiring mechanical movement to access a spinning disk. These can be used either as an alternative to disk drives (for predetermined file systems which require optimal access times) or as another layer in the memory hierarchy by using hierarchical storage systems that automatically keep frequently used data on the SSD.

A disk access can cost hundreds of thousands of CPU cycles.

If the CPU is stalled because it is waiting on a memory fetch of an instruction or data item from real memory, then the CPU is still considered as being in busy state. If the instruction or data is being fetched from disk or a remote machine, then the CPU is in I/O wait state (I/O wait also includes waits for network I/O).

Hardware hierarchy overview

When a program runs, it makes its way up the hardware and operating system hierarchies, more or less in parallel. Each level on the hardware side is scarcer and more expensive than the one below it. There is contention for resources among programs and time spent in transitional from one level to the next. Usually, the time



Student Notebook

required to move from one hardware level to another consists primarily of the latency of the lower level, that is, the time from the issuing of a request to the receipt of the first data.

Disks are the slowest hardware operation

By far the slowest operation that a running program does (other than waiting on a human keystroke) is to obtain code or data from a disk. Disk operations are necessary for read or write requests for programs. System tuning activities frequently turn out to be hunts for unnecessary disk I/O or searching for disk bottlenecks since disk operations are the slowest operations. For example, can the system be tuned to reduce paging? Is one disk too busy causing higher seek times because it has multiple filesystems which have a lot of activity?

Real memory

Random access memory (RAM) access is fast compared to disk, but much more expensive per byte. Operating systems try to keep program code and data that are in use in RAM. When the operating system begins to run out of free RAM, it needs to make decisions about what types of pages to write out to disk. Virtual memory is the ability of a system to use disk space as an extension of RAM to allow for more efficient use of RAM.

Paging and page faults

If the operating system needs to bring a page into RAM that has been written to disk or has not been brought in yet, a page fault occurs, and the execution of the program is suspended until the page has been read in from disk. Paging is a normal part of the operation of a multi-processing system. Paging becomes a performance issue when free RAM is short and pages which are in memory are paged-out and then paged back in again causing process threads to wait for slower disk operations. How virtual memory works will be covered in another unit of this course.

Caches

To minimize the number of times the program has to experience the RAM latency, systems incorporate caches for instructions and data. If the required instruction or data is already in the cache (a cache hit), it is available to the processor on the next cycle (that is, no delay occurs); otherwise, a cache miss occurs. If a given access is both a TLB miss and a cache miss, both delays occur consecutively.

Depending on the hardware architecture, there are two or three levels of cache, usually called L1, L2, and L3. If a particular storage reference results in an L1 miss, then L2 is checked. If L2 generates a miss, then the reference goes to the next level, either L3, if it is present, or RAM.




Uempty
Pipeline and registers
A pipelined, superscalar architecture allows for the simultaneous processing of multiple instructions, under certain circumstances. Large sets of general-purpose registers and floating-point registers make it possible to keep considerable amounts of the program's data in registers, rather than continually storing and reloading the data.



Student Notebook

Figure 4-3. Virtual and real memory AN512.0

Notes:

Overview

Virtual memory is a method by which the real memory appears larger than its true size. The virtual memory system is composed of the real memory plus physical disk space where portions of a file that are not currently in use are stored.

Virtual memory segments

Virtual address space is divided into segments. A segment is a 256 MB, contiguous portion of the virtual memory address space into which a data object can be mapped.

Process addressability to data is managed at the segment (or object) level so that a segment can be shared between processes or maintained as private. For example, processes share code segments yet have separate private data segments.


Virtual and real memory

Virtual Memory

.

.

.

Segment 0Segment 1Segment 2Segment 3

Segment nSegment n-1

Real Memory

PageFrame

Disk

Segment Size - 256 MB

. . .

Page

• Virtual memory mapped to:

• Real memory

• Disk storage

• Both

Page sizes of:4 KB and 64 KB

Configurable pools of16MB and 16GB pages




Uempty
Pages, page frames, and pages on disk
Virtual memory segments are divided into pages. AIX supports four different page sizes:

• 4 KB - traditional and most commonly used

• 64 KB - used mostly by the AIX, but easily used by applications

• 16 MB - mostly used in HPC environments; requires AIX allocation of a pool

• 16 GB - mostly used in HPC environments; requires server configuration of a pool

You will mostly see only the 4 KB and 64 KB page sizes. This course will not cover the configuration or use of the larger page sizes.

Similarly, physical memory is divided (by default) into 4096 byte (4 KB) page frames. A 64 KB page frame is, essentially, 16 adjacent 4KB page frames managed as a single unit. The system hardware and firmware is designed to support physical access of entire 64 KB page frames.

Each virtual memory page which has been touched is mapped to either a memory page frame or a location on disk.

For file caching, the segment’s pages are mapped to an opened file in a file system. As the process reads a file, memory page frames are allocated and the file data is paged-in to memory. When a process writes to the file, by default, a memory page frame is allocated, the data is copied to memory, and eventually the memory contents are paged-out to the file to which the process was writing. Note that the virtual memory page could have its data on disk only (in the file), in memory only (written but not yet paged-out to the file), or be stored both in memory and on disk.

For application private memory work space, the application stores data in a given virtual memory page and this gets written to a page frame in memory. If AIX needs this memory for other uses, it may steal that memory page frame. This would require that the contents be saved by paging it out to paging space on disk, since it does not have a persistent location in a file in a file system. Later, if the process references that virtual segment page, the paging space page will be paged-in to a memory page frame. Note that the paging space page is not freed when the contents are paged-in to memory. The data is kept in the paging space; this way, if that page needs to be stolen again we already have an allocation in paging space to hold it, and if the page is not modified, we will not need to page-out at all since the paging space already has the data. Once again, the virtual memory page may be stored in memory only, or in paging space only, or in both.



Student Notebook

Figure 4-4. Major VMM functions AN512.0

Notes:

Overview

The virtual memory manager (VMM) coordinates and manages all the activities associated with the virtual memory system. It is responsible for allocating real memory page frames and resolving references to pages that are not currently in real memory.

Free list

The VMM maintains a list of unallocated page frames that it uses to satisfy page faults, called the free list.

In most environments, the VMM must occasionally add to the free list by stealing some page frames owned by running processes. The virtual memory pages whose page frames are to be reassigned are selected by the VMM’s page stealer. The VMM thresholds determine the number of frames reassigned.


Major VMM functions• To manage memory, the virtual memory manager (VMM):

– Manages the allocation of page frames– Resolves references to virtual memory pages that are not currently in

RAM

• To accomplish these functions, the VMM:– Maintains a free list of available page frames– Uses a page replacement algorithm to determine which allocated real

memory page frames will be stolen to add to the free list

• The page replacement daemon is called lrud.– multi-threaded process– also referred to as the page stealer– Some memory (such as 16MB and 16GB page pools) is not LRU-able

(not managed by lrud) and can’t be stolen.

• Memory is divided into one or more memory pools.– There is one lrud thread for each memory pool.– Each memory pool has its own free list managed by that lrud.




Uempty
When a process exits, its working storage is freed up immediately and its associated memory frames are put back on the free list. However, any files the process may have opened can stay in memory.
When a file system is unmounted, any cached file pages are freed.

Intent of the page replacement algorithm

The main intent of the page replacement algorithm is to ensure that there are enough pages on the free list to satisfy memory allocation requests. The next most important intent is to try to select page frames, to be stolen, which are unlikely to be referenced again. It also ensures that computational pages are given fair treatment. For example, the sequential reading of a long data file into memory should not cause the loss of program text pages that are likely to be used again soon.

Least Recently Used Daemon (lrud)

On a multiprocessor system, page replacement is done through the lrud kernel process. Page stealing occurs when the VMM page replacement algorithm selects a currently allocated real memory page frame to be placed on the free list. Since that page frame is currently associated with a virtual memory segment of a process, we refer to this as stealing the page frame from that segment.

Memory pools

The lrud kernel process is multi-threaded with one thread per memory pool. Real memory is split into memory pools based on the number of CPUs and the amount of RAM.

While the memory statistics and memory tuning parameters often use global values, internally the lrud daemon threads use per-memory-pool values, with the statistic reports often showing values that are totals from all of the memory pools. This will become significant later when we start to analyze the performance statistics.



Student Notebook

Figure 4-5. VMM terminology AN512.0

Notes:

Overview

Virtual memory is divided into three types of segments that reflect where the data is being stored:

- Local persistent segments (for a local JFS filesystem)

- Client persistent segments (for remote, Enhanced JFS or CD-ROM filesystems)

- Working segments (in the paging space)

The segment types differ mainly in the function they fulfill and in the way they are backed to external storage when paging occurs.

Segments of a process include program text and process private. The program text segments can be persistent or client, depending on the program executable segment type. If the executable is on a JFS filesystem, then the type will be persistent. Otherwise, it will be client. The private segments are working segments containing the data for the process. For example, global variables, allocated memory and the stack.


VMM terminology

ProcessThreads

Program text (Persistent)

Segments

Data file (Client)

• Segment types:

– Persistent: File caching for JFS– Client: File caching for all

other file systemssuch as JFS2 or NFS

– Working: Private memory allocations

• Segment classification:

– Computational:– Working segments– Program text (binary object executables)

– Non-computational (file memory):– Persistent segments– Client segments

Data file (Persistent)

Process private - stack and data (Working)

Shared library data (Working)

Program text (Client)




Uempty
Segments can be shared by multiple processes. For example, processes can share code segments yet have private data segments.
Persistent segments

The pages of a persistent segment have permanent storage locations on disk. Files containing data or executable programs are mapped to persistent segments.

When the VMM needs to reclaim a page from a persistent segment that has been modified, it writes the modified information to the permanent disk storage location.

If the VMM chooses to steal a persistent segment page frame which has not been modified, then no I/O is required. If the page is referenced again later, then a new copy is read in from its permanent disk storage location.

Client segments

The client segments are used for all filesystem file caching except for JFS and GPFS. (GPFS uses its own mechanism.) Examples of filesystems cached in client segments are remote file pages (NFS), CD-ROM file pages, Enhanced JFS file pages and Veritas VxFS file pages. Compressed file systems use client pages for the compress and decompress activities.

Working segments

Working segments are transitory and exist only during their use by a process. They have no permanent disk storage location and are therefore stored on disk paging space if their page frames are stolen. Process stack and data regions are mapped to working segments, as are the kernel text segment, the kernel extension text segments, as well as the shared library text and data segments. The term text here refers to the binary object code for an executable program; it does not cover the source code or the executable files which are interpreted, such as executable shell scripts.

Computational versus file memory

Computational memory, also known as computational pages, consists of the pages that belong to working storage segments or program text (executable files) segments.

File memory, also known as file pages or non-computational memory, consists of the remaining pages. These are usually pages from permanent data files in persistent storage (persistent or client segments).

The classification of memory as computations or non-computational becomes important in the next few slides when we look at how VMM decides which page-frames to steal when the free list runs low.



Student Notebook

Figure 4-6. Free list and page replacement AN512.0

Notes:

Introduction

A process requires real memory pages to execute.

Memory is allocated out of the free list. Memory is allocated either as working storage or as file caching. As the free list gets short, VMM steals page frames that are already allocated. Applications with root authority can pin critical pages, preventing them from being stolen. Before a page is stolen, VMM needs to be sure that the data is stored in persistent storage (ultimately a disk, even when using an NFS file system).

If the page frame to be stolen was read from a file system and never modified, then nothing has to be saved. If it is working storage that was previously paged out to paging space and has not been modified since then, once again, it does not need to be paged out again. Note that when we page-in from paging space we do not delete the data from the paging space.


Free list and page replacement

FileSystem

PagingSpace

Non-pinned

Working Storage

File Cache

Pinned-Memory

(can not be stolen)

Free

List

•Memory requests are satisfied off the free list

•Memory frames are stolen from allocated pages to replenish the free list

•Recently referenced and pinned pages are not stealable

•If a stolen page frame is not backed with matching contents on disk (dirty):

•Working segment page saved to paging space (paging space page-out)

•Persistent or client segment page saved to file system (file page-out)

•Later access requires page-in from disk

Real Memory

Persistent: JFS

Client: JFS2, NFS, and others




Uempty
If there is no matching copy of the data on disk, the contents of the page frame need to paged-out to either paging space (for working segment pages) or to a file system (for persistent or client segment pages). The pages that do not have a copy on disk are often referred to as dirty pages.
The most common way for frames to be placed on the free list is either for the owning process to explicitly free the allocated memory or for VMM to free the frames allocated to a process when the process terminates.

For file caching, once allocated in memory, the memory tends to stay allocated until stolen to fill the free list. This is discussed in more detail later.

When a process references a virtual memory page that is on disk (because it either has been paged out or has yet to be read in), the referenced page must be paged in.

When a process allocates and updates a working segment page, if there is not much memory on the free list, this may require the stealing of currently allocated page frames to maintain the free list. This can result in one or more pages to be paged out. If an application later accesses the page frame, it will need to be paged in from disk. This requires I/O traffic which, as you know, is much slower than our normal memory access speeds. If the free list is totally empty, the process will hang waiting on the memory request (which in turn is waiting on the I/O request) and thus seriously delaying the progress of the process.

VMM services uses the page stealer to steal page frames that have not been recently referenced, and thus would be unlikely to be referenced in the near future. Page frames which have recently been referenced are not considered to be stealable and are skipped when scanning for pages to steal.

A successful page stealer allows the operating system to keep enough processes active in memory to keep the CPU busy.

There are some page frames that cannot be stolen. These are called pinned page frames or pinned memory. In the visual, the term pinned memory covers both non-LRUable memory (reserved memory which the lrud daemon does not manage) and memory that has been allocated and pinned by a process to prevent it from being paged out by the lrud daemon.



Student Notebook

Figure 4-7. When to steal pages based on free pages AN512.0

Notes:

Stealing based on the number of free pages

The VMM tries to keep the free list size greater than or equal to minfree so it can supply page frames to requestors immediately, without forcing them to wait for page steals and the accompanying I/O to complete. When the free list size falls below minfree, the page stealer runs.

When the free list reaches maxfree number of pages, the page stealer stops.


When to steal pages based on free pagesBegin stealing when free pages in a mempool is

less than minfree(default: 960)

maxfree

minfree

Stop stealing when free pages in a mempool is

equal to maxfree(default: 1088)

Number of free pages

Number of free pages

Free list Free list




Uempty

Figure 4-8. Free list statistics AN512.0

Notes:

minfree and maxfree vmo parameters

The following vmo parameters are used to make sure there are at least a minimum number of pages on the free list:

- minfree Minimum acceptable number of real memory page frames on the free list. When the size of the free list falls below this number, the VMM begins stealing pages.

- maxfree Maximum size to which the free list will grow by VMM page stealing. The size of the free list may exceed this number as a result of processes terminating and freeing their working segment pages or the deletion of files that have pages in memory.

The minfree and maxfree values are for each memory pool. Prior to AIX 5L V5.3, the minfree and maxfree values were divided up among the memory pools.


Free list statistics• vmo reports the free list thresholds (per mempool and page size)

– minfree (default 960 pages)– maxfree (default 1088 pages)

•vmstat reports:– The global size of the free list (all mempools):

•vmstat interval report, free column•vmstat –v, free pages statistic

– The free list size for each page size:•vmstat –P ALL; vmstat –P 4KB; vmstat –P 64KB

– The number of mempools:•vmstat –v , memory pools statistic

• Multiply minfree and maxfree by the number of memory pools before comparing to the per page size free list statistics.

• A short free list does not prove there is a shortage of available memory.– Memory may be filled with easily stolen file cache pages



Student Notebook

In addition, the thresholds can be triggered by only one page size free list getting short. Thus you should use vmstat -P ALL to see the free list values for each of the pages sizes.

vmstat statistics

System administrators often ask if their free list is short. Both the vmstat iterative report and the vmstat -v report provide a statistic on the size of the free list. It is common to compare these to the minfree and maxfree thresholds. The correct way to do this is first multiply the thresholds by the number of memory pools to obtain the total of the thresholds for all memory pools. The report that should be used is: vmstat -P ALL.

While there is a formula for how many mempools AIX will create given the number of CPUs and the amount of real memory, this formula can change and it is better to ask the system how many memory pools it has. The vmstat -v report provides a count of the number of memory pools.

What does a short free list mean?

In AIX, you should not use the free list size as your primary indication of memory availability. This is because AIX, by default, uses memory for caching file system files. And since the file stays cached in memory even after the program that used it terminates, AIX memory quickly fills up with page frames of files that may not be referenced again for a long time. This memory is easily stolen. If these file pages were never modified, then VMM does not even have to page out the contents before stealing them to be placed on the free list. As a result it is common to see the free list size down near the minfree and maxfree thresholds.

Some system administrators will add the file pages statistic to the free list statistic to get a better idea of how much memory is truly available. The current version of the svmon global report provides a statistic that is designed to fill this need; it is labeled the available value.

If the free list is constantly below minfree and even approaches or reaches zero, then that may indicate that VMM is having trouble stealing page frames fast enough to maintain the free list. That may be a trigger to look at other statistics to fully understand the situation.




Uempty

Figure 4-9. Displaying memory usage (1 of 2) AN512.0

Notes:

The vmstat -I and svmon -G commands

The vmstat command reports virtual memory statistics. The -I option includes I/O oriented information, including fi (file page ins/second) and fo (file page outs/second).

The svmon -G command gives an overall picture of memory use.

Breakdown of real memory

The size field of the svmon -G output shows the total amount of real memory on the system. The following svmon -G fields show how the memory is being used:

- The free field displays the number of free memory frames

- The work field in the in use row displays the number of memory frames containing working segment pages


Displaying memory usage (1 of 2 )# vmstat 5

System configuration: lcpu=4 mem=512MB ent=0.80

kthr memory page faults cpu----- ----------- ------------------------ ------------ -----------------------r b avm fre re pi po fr sr cy in sy cs us sy id wa pc ec2 1 152282 2731 0 0 0 14390 50492 1 479 10951 4525 1 34 52 14 0.29 36.71 1 152283 2669 0 0 0 13843 45696 1 599 9910 4872 1 34 52 14 0.29 36.21 1 152283 2738 0 0 0 14616 49573 1 503 10445 4716 1 34 52 13 0.29 36.60 1 152280 2639 0 0 0 13802 46128 1 375 11108 7984 1 38 49 11 0.33 40.9

# svmon -G -O pgsz=off

Unit: page--------------------------------------------------------------------------------------

size inuse free pin virtual available mmodememory 131072 128431 2641 82554 159754 4993 Dedpg space 131072 49897

work pers clnt otherpin 73930 0 0 8624in use 115268 0 13163



Student Notebook

- The pers field in the in use row displays the number of memory frames containing persistent segment pages

- The clnt field in the in use row displays the number of memory frames containing client segment pages

These four fields add up to the total real memory.

Computational memory

In the vmstat output, avm stands for active virtual memory and not available memory. The avm value in the vmstat output and the virtual value in the svmon -G output indicate the active number of 4 KB virtual memory pages in use at that time. (Active meaning that the virtual address has a page frame assigned to it.) The vmstat avm column will give the same figures as the virtual column of svmon except in the case where deferred page space allocation is used. In that case, svmon shows the number of pages actually paged out to paging space, whereas vmstat shows the number of virtual pages accessed but not necessarily paged out.

In the svmon -G report, if no paging has occurred, then the virtual value will be equal to the work field in the in use row. But if paging has occurred, then you cannot make that assertion.

The avm (vmstat) and virtual (svmon -G) numbers will grow as more processes get started and/or existing processes allocate more working storage. Likewise, the numbers will shrink as working segment pages are released. They can be released in two ways:

- Owning process can explicitly free them

- Kernel will automatically free them when the process terminates

The avm (vmstat) and virtual (svmon -G) statistics do not include file pages.

Free frames

The fre value in the vmstat output and the free field in the svmon -G output indicate the average amount of memory (in units of 4KB) that is currently on the free list. When an application terminates, all of its working pages are immediately returned to the free list. Its persistent pages (files), however, remain in RAM and are not added back to the free list until they are stolen by the VMM for use by other programs. Persistent pages are also freed if the corresponding file is deleted or the file system is unmounted.

For these reasons, the fre value may not indicate all the real memory that can be readily available for use by processes. If a page frame is needed, then persistent pages previously referenced by terminated applications are among the first to be stolen and placed on the free list.




Uempty
Paging rates
The fi and fo fields show the file page ins and file page outs per second. This represents I/O to and from a filesystem.

The pi and po fields show the paging space page ins and paging space page outs for working pages.

Scanning rates

The number of pages scanned is shown in the vmstat sr field. The number of pages stolen (or freed) is shown in the vmstat fr field. The ratio of scanned to freed represents relative memory activity. The ratio will start at 1 and increase as memory contention increases. It is interpreted as having to scan sr number of pages and found fr to free.



Student Notebook

Figure 4-10. Displaying memory usage (2 of 2) AN512.0

Notes:

The vmstat command has an option (-p) which accepts a value of 4KB, 64KB, or ALL. This option will show the memory related statistics broken down by page sizes.

This is important since the lrud threads will trigger page stealing whenever either page size free amount (in a given mempool) falls below the minfree threshold. In most cases, it will be the 4KB page size statistics that show the low free amount and the page stealing activity.


Displaying memory usage (2 of 2 )

# vmstat -P 64KB 5

System configuration: mem=512MB

pgsz memory page----- -------------------------- ------------------------------------

siz avm fre re pi po fr sr cy64K 2046 1992 91 0 0 0 0 0 064K 2046 1992 91 0 0 0 0 0 064K 2046 1992 91 0 0 0 0 0 064K 2046 1992 91 0 0 0 0 0 0

# vmstat -P 4KB 5

System configuration: mem=512MB

pgsz memory page----- -------------------------- ------------------------------------

siz avm fre re pi po fr sr cy4K 98336 120801 1107 0 0 0 10887 16625 04K 98336 120799 1141 0 0 0 14754 25080 04K 98336 120798 1145 0 0 0 11466 17164 04K 98336 120798 1139 0 0 0 14808 25154 0




Uempty

Figure 4-11. What type of pages are stolen? AN512.0

Notes:

Overview

The decision of which page frames to steal when the free list is low is crucial to the system performance. VMM wants to select page frames which are not likely to be referenced in the near future. When the page stealer is running frequently (due to a low free list), the record of what has been recently referenced is very short term. Just because a working page frame has not been referenced since the last lrud scan does not mean it will not soon be referenced again.

There is usually much more memory pages being used for file cache that are unlikely to be re-referenced then there are computational memory pages that will not be re-referenced. There is also likely to be a higher cost to stealing a working segment page frame as compared to a persistent or client segment page, due to the probability that file cache contents (often read from disk but not modified) will not need to be paged out. Also, if file cache pages are paged out due to page stealing, this is I/O that eventually would be needed anyway (to flush the application write to disk).


R

S

What types of pages are stolen?

numperm > maxperm

maxperm

Steals the least recently used pages

Steals the least recently used pages

numperm > minpermAND

numperm < maxperm

minperm

numperm < minperm

lru_file_repage = 1 lru_file_repage = 0(AIX6 default)

If file repage rate >computational repage rate

Thensteal computational pages

Elsesteal file pages

File cache location optimized: page_steal_method=1 (list-based)

R

S(from maxperm%,

default=90%)

(from minperm%default=3%)

Tries to only stealfile pages

Tries to only stealfile pages

(non-computational,either persistent

or client)

R



Student Notebook

The default AIX6 behavior aggressively chooses to steal from file cache rather than computation memory.

Note that the numeric thresholds (maxperm and minperm) are calculated from tuning parameters which are percentages. As such, the numeric thresholds are not modified using the vmo command; they are classified as static (as indicated by the S icon next to them on the visual). Of the discussed percentages, you should only modify minperm%. The maxperm% and lru_file_repage parameters are restricted in AIX 6 and later (as indicated by the R icon next to them on the visual).

numperm < minperm

If the percentage of RAM occupied by file pages falls below minperm, then any page (file or computational) that has not been referenced can be selected for free list replacement.

lru_file_repage=0 (default in AIX6 and later) and numperm > minperm

The page stealer tries to only steal file pages, when the file pages is above the minperm threshold.

Since the main purpose of the page stealer page scan is to locate file cache pages, the AIX6 default method for locating file cache pages is to use a list based method (page_steal_method=0). All of the file cache pages are on a list. The alternative (and default in AIX 5L V53) is to sequential search through the page frame table (page_steal_method=1) looking for pages while are file cache. This alternative is less efficient.

Note

The lru_file_repage parameter is an AIX6 Restricted tunable. It should not be changed unless instructed to do so by AIX Support.

If working in an AIX 5L V3.5 environment, it is generally recommended that you set lru_file_repage=0.

lru_file_repage=1 and numperm > maxperm

If the percentage of RAM occupied by file pages rises above maxperm, then the preference is to try and steal file pages.

Note: In the two above cases, it is stated that only file pages will be stolen, but there are some circumstances where computational pages can and will be stolen when numperm is in these ranges. One example is the situation where there are no file pages in a stealable state. By this we mean that the file pages are so heavily referenced, that we are unable to find any with the reference bit turned off. This will drive the free list to 0 and we MAY start




Uempty
stealing computational pages. This is why setting minfree at a high enough number to start reclaiming pages is so important.
lru_file_repage=1 and numperm is between minperm and maxperm

If the percentage of RAM occupied by file pages is between minperm and maxperm, then page replacement may steal file or computational pages depending on the repage rates of computational versus non-computational (file) pages.

There are two types of page faults:

- New page fault which occurs when there is no record of the page having been referenced

- Repage fault which occurs when a page is stolen using lrud and then is re-referenced and has to be read back in

When lru_file_repage=1, if the value of the file repage counter is higher than the value of the computational repage counter, then computational pages (which are the working storage) are selected for replacement. If the value of the computational repage counter exceeds the value of the file repage counter, then file pages are selected for replacement.

Experiments have shown that the effective result is that both file and computation pages are stolen somewhat equally when in this range.



Student Notebook

Figure 4-12. Values for page types and classifications AN512.0

Notes:

The file pages value

The file pages value, in the vmstat report, is the number of non-computational (file memory) pages in use. This is not the number of persistent pages in memory because persistent pages that hold program text (executable files) are considered computational pages.

The numperm percentage value

The numperm percentage value is the file pages value divided by the amount of manageable memory (the lruable memory value) and expressed as a percentage.

The client pages value

The client pages value, in the vmstat report, is the number of non-computational (file memory) pages in use, which are in client segments. This is not the number of client


Values for page types and classifications• JFS pages are in persistent type segments and reported by:–svmon -G as pers value

• JFS2 and NFS pages are in client type segments and reported by:–svmon –G as clnt value

• Page frames which are classified as non-computational are reported by:–vmstat –v as file pages and numperm percentage

values regardless of segment type–vmstat –v as client pages and numclient

percentage values, if in client segments




Uempty
segment pages in memory because client pages that hold program text (executable files) are considered computational pages.
The numclient percentage value

The numclient percentage value is the client pages value divided by the amount of manageable memory (the lruable memory value) and expressed as a percentage.

The pers value

The pers value, in the svmon report, is the number of persistent segment pages. It includes both computational and non-computational pages that are in persistent segments (JFS).

The clnt value

The client value, in the svmon report, is the number of client pages. It includes both computational and non-computational pages that are in client segments (such as JFS2).



Student Notebook

M

Figure 4-13. What types of pages are in real memory? AN512.0

Notes:

Type of workload

In a particular workload, it might be worthwhile to emphasize the avoidance of stealing file cache memory. In another workload, keeping computational segment pages in memory might be more important. To get the file cache (and other statistics), use the vmstat -v command. If PerfPMR was run, the output is in vmstat_v.before and vmstat_v.after.

What to look for

If your system is primarily I/O intensive, you will want to have more file caching, as long as it does not result in computational pages being stolen and paged to paging space. In the displayed example, the 512 MB memory is used by:

- Pinned memory (about 320 MB, leaving 192 MB for lrud to manage)


What types of pages are in real memory?

# vmstat -v131072 memory pages

109312 lruable pages2625 free pages

1 memory pools82309 pinned pages80.0 maxpin percentage3.0 minperm percentage90.0 maxperm percentage21.7 numperm percentage23737 file pages

0.0 compressed percentage0 compressed pages

21.7 numclient percentage90.0 maxclient percentage23737 client pages

0 remote pageouts scheduled28 pending disk I/Os blocked with no pbuf

47304 paging space I/Os blocked with no psbuf2484 filesystem I/Os blocked with no fsbuf

0 client filesystem I/Os blocked with no fsbuf215 external pager filesystem I/Os blocked with no fsbuf




Uempty
- The free list (for both page sizes) needs a minimum of almost 7.5 MB and will attempt to increase to about 8.5 MB)
- File cache is only about 92 MB

- The rest (about 92 MB) is being used by computational pages.

Note that the numclient percentage does not come near the maxclient threshold, thus any page stealing is the result of a short free list.

If you are seeing page stealing (as in the previous vmstat iterative report), this must be because the memory is overcommitted. The example has I/O intensive processes trying to do massive amounts of file caching with only 92 MB of memory available to them.

Remember that the priority use of memory is in the following order:

i. Non-LRUable memory

ii. LRUable but pinned memory

iii. Free list between minfree and maxfree

iv. Computational memory

v. File cache memory

Due to lru_file_repage=0, the last two items are only equal in priority when file caching is less than 3% of lruable memory.

The real problem here is that we do not have enough real memory. The minimum memory for AIX is 512 MB, which is the allocation seen in this example. If more memory were to be added, the rate of page stealing would likely go down and the amount of file cache memory would go up.



Student Notebook

Figure 4-14. Is memory over committed? AN512.0

Notes:

What happens when memory is over committed?

A successful page replacement algorithm keeps the memory pages of all currently active processes in RAM, while the memory pages of inactive processes are paged out. However, when RAM is over committed, it becomes difficult to choose pages to be paged out because they will be re-referenced in the near future by currently running processes. The result is that pages that will soon be referenced still get paged out and then paged in again later. When this happens, continuous paging in and paging out may occur. This is referred to as paging space thrashing or simply page thrashing. The system spends most of its time paging in and paging out instead of executing useful instructions, and none of the active processes make any significant progress.

How do you know if memory is over committed?

If the vmstat reports are showing a high volume of paging space page-ins and page-outs, then it is quite clear that memory is overcommitted. But sometimes memory


Is memory over committed?• Memory is considered overcommitted if the number of pages currently in

use exceeds the real memory pages available• The number of pages currently in use is the sum of the:

– Virtual pages– File cache pages

• Example:# svmon -G -O unit=MBUnit: MB-----------------------------------------------------------------

size inuse free pin virtual available mmodememory 512.00 501.79 10.2 320.82 596.55 26.9 Dedpg space 512.00 197.62

work pers clnt otherpin 287.13 0 0 33.7in use 445.98 0 55.8

Virtual pages = 596.55 MB+ File cache pages = 55.8 MB--------------------------------------------------------

Total pages in use = 652.35 MB vs. Real memory = 512 MB)




Uempty
is overcommitted and high volumes of file cache page stealing is impacting the I/O performance. Or neither is happening but you are at high risk of one or the other happening. Examination of the svmon report can be helpful.
Use the svmon -G command to get the amount of memory being used and compare that to the amount of real memory. To do this:

- The total amount of real memory is shown in the memory size field.

- The amount of memory being used is the total of:

• The virtual pages shown in the memory virtual field.

• The persistent pages shown in the in use pers field.

• The client pages shown in the in use clnt field.

- Officially, if the amount of memory being used is greater than the amount of real memory, your memory is overcommitted.

The example in the visual is officially overcommitted.

There is also an available field. This statistic is intended to identify how much memory might be available. Because of the tendency for AIX to cache as much file contents as possible and thus leaving the free list fairly small, this available statistic is a much better single statistic measurement of the memory situation, then using the size of the free list. While it is not clearly documented how this value is calculated, it is affected by the amount of cache memory. In situations where the system has little new memory demand and memory is filled with file caching, the available statistic could show a rather large number, even though the free list might be fairly short.



Student Notebook

Figure 4-15. Memory leaks AN512.0

Notes:

What is a memory leak?

A memory leak occurs when a process allocates memory, uses it, but never releases it. Memory leaks typically occur in long running programs. Over time, the process will either:

- Allocate all of its addressable virtual memory which may cause the process to abort

- Fill up memory with unused computational pages, resulting in increased stealing of non-computational memory, until the numperm is reduced below minperm at which point some computational memory is paged to paging space.

- Use up the paging space, causing the kernel to take protective measures to preserve the integrity of the operating system by killing processes to avoid running out of paging space.

- Cause pinned memory (if a pinned memory leak) to grow to the maxpin threshold. Even before that happens the system may experience significant page thrashing as


Memory leaks• A memory leak is a program error that consists of

repeatedly allocating memory, using it, and then neglecting to free it

• Systems have been known to run out of paging space because of a memory leak in a single program.

• Tools to help detect a memory leak include:– vmstat– ps gv– svmon –P

• Periodically stopping and starting the program will free up the memory.




Uempty
processes fight over the remaining unpinned memory. If the maxpin threshold is reached, the results are unpredictable - since it depends on what processes are requesting pinned memory. It could result in a system hang or crash.
Detecting a memory leak

Three commands that help to detect a potential memory leak are vmstat, ps gv, and svmon -P.

Dealing with a memory leak

The best solution is to fix the coding errors in the program that is leaking memory. That is not always possible, especially in the short term.

The common solution is to periodically quiesce and stop the program; and then start the program back up. This will cause all computational memory allocated by that program to be freed up and placed back on the free list. All related paging space will also be freed.



Student Notebook

Figure 4-16. Detecting a memory leak with vmstat AN512.0

Notes:

What to look for with vmstat

The classic indicator of a memory leak is a steady increase in the active virtual memory (avm column for vmstat).

In addition, you may notice a steady increase in amount of paging space being used. However, on a large memory system it make take some time to notice this effect.


Detecting a memory leak with vmstat• By using the vmstat command to monitor the virtual memory

usage, a memory leak would be detected by a continual increase in the avm number over time

# vmstat 3 10

System configuration: lcpu=2 mem=3792MB

kthr memory page faults cpu----- -------------- ------------------------ ----------- -----------r b avm fre re pi po fr sr cy in sy cs us sy id wa0 0 136079 817842 0 0 0 0 0 0 81 518 191 2 1 97 00 0 137402 816518 0 0 0 0 0 0 50 172 179 0 0 99 00 0 139322 814598 0 0 0 0 0 0 65 176 182 1 1 98 00 0 141190 812730 0 0 0 0 0 0 65 477 183 1 0 99 00 0 143350 810570 0 0 0 0 0 0 82 174 194 2 0 98 00 0 145513 808407 0 0 0 0 0 0 88 172 180 1 0 99 00 0 146313 807607 0 0 0 0 0 0 50 161 173 0 1 98 00 0 146319 807601 0 0 0 0 0 0 4 459 172 0 0 99 00 0 146319 807601 0 0 0 0 0 0 4 146 169 0 0 99 00 0 146319 807601 0 0 0 0 0 0 2 232 202 0 0 99 0




Uempty

Figure 4-17. Detecting a memory leak with ps gv AN512.0

Notes:

Using ps gv to find the offending process

Isolating a memory leak can be a difficult task because the programming error may exist in an application program, a kernel process or the kernel (for example, kernel extension, device driver, filesystem, and so forth).

To find the offending process, look for a growing delta in the SIZE field between multiple ps vg runs. The SIZE column is the virtual size of the data section of the process (in 1 KB units), which represents the private process memory requirements. If this number is increasing over time, then this is a memory leak process candidate.

This is not an absolute rule. The growth in the virtual size may be a normal trend of increased workload.


Detecting a memory leak with ps gv• After a suspected memory leak has been established using the vmstat

command, the next step is to identify the offending process– Capture the output of a ps gv command– Let some time go by– Capture a second set of output with the ps gv command

• The SIZE columns from the two sets of data are compared to see which programs’ heaps have grown

# ps vgPID TTY STAT TIME PGIN SIZE RSS LIM TSIZ TRS %CPU %MEM COMMAND

...315632 pts/0 A 0:00 1 9008 9016 32768 2 8 0.0 1.0 ./exmem...

<some time later>

# ps vgPID TTY STAT TIME PGIN SIZE RSS LIM TSIZ TRS %CPU %MEM COMMAND

...315632 pts/0 A 0:00 1 51324 51332 32768 2 8 0.0 8.0 ./exmem...



Student Notebook

Figure 4-18. Active memory sharing: Hierarchy AN512.0

Notes:

With POWER6 or POWER7 servers and the proper level of firmware and software, PowerVM Enterprise Edition allows the creation of a shared memory pool. An LPAR can be created to either use dedicated memory (allocated memory out of physical memory at activation) or use shared memory (allocated memory out of the shared pool at activation). The physical memory allocated to the shared memory pool is not available to be allocated to any LPAR using dedicated memory. The function and management of the shared memory pool is referred to as Active Memory Sharing (AMS).

AMS allows the memory sharing LPARs to overcommit their allocations. Thus, even though the example in the visual has only 4 MBs in the shared pool, the total allocations of the AMS LPARs is 6 MBs. If the three LPARs simultaneously need more logical memory than the physical memory of the shared pool, some memory contents will need to be paged out, either in the individual AIX LPARs paging spaces, or in the AMS paging spaces maintained in the VIOS partition. AMS works best when the partitions complement each other in their patterns of memory usage. In other words: when one LPAR has high memory demand, another LPAR has low memory demand.


Active memory sharing: Hierarchy

AMS VIOS Paging Spaces

Physical Memory

Power Hypervisor (phype) shared memory pool

Dedicated Memory LPARs

Shared Memory (AMS) with over-commitment

AIX AIX AIX AIX

realmem=2GB

2GB 1GB 2GB 4GB

AIX

realmem=2GB

AIX

realmem=2GB




Uempty

Figure 4-19. Active memory sharing: Loaning and stealing AN512.0

Notes:

Shared memory is expected to be overcommitted. Yet, the total of all the AMS LPARs’ entitlements can not be simultaneously backed by physical memory. The real memory that is backed by physical memory is reported in the vmstat -h report in the pmem field. In the following discussion, an AIX LPAR which needs more physical memory will be referred to as the requesting LPAR and the AIX LPAR from which physical memory is taken will be referred to as the donating LPAR.

As a result of real memory not being backed by physical memory, when an AIX process accesses a page, there may not be a physical page frame to assign to it.

- To avoid that (or in response to that situation), AIX will request that the Power Hypervisor (phype) provide physical memory frames for the logical memory frames it is using as its real memory.

- The hypervisor will then assign physical memory frames to the requesting AIX partition.


Active memory sharing: Loaning and stealing

Power Hypervisor

AIX real memory states

AMSmechanisms

AIX: Needs more physical memory

Loaned by AIX, from the free list,

inuse+free is decreased

Backed by physical memory

Appears to be free or inuse, but is actually stolen

AIX: Has memory that is available

Requestsphysical

page

Receivesphysical

page

Requestsa loan

Pageloaned or

stolen

svmon –G report:size = inuse + free + loaned



Student Notebook

- In order for the Power Hypervisor to fulfill these types of requests, it may need to request that some other partition loan it some physical memory frames which are currently backing AIX real memory frames.

- The donating LPAR will likely take frames off the free list to satisfy the request and may need to steal memory to replenish the free list. It may even need to page-out current memory contents to its own paging space to do this. These are considered loaned page frames.

- If the donating LPAR does not loan frames, then the hypervisor may steal page frames from the donating AIX LPAR. AIX provides hints to the hypervisor to help the phype choose what page frames to steal. If the frame chosen is dirty (modified contents not stored on disk), then the hypervisor will save the contents to a paging space provided for this purpose in the virtual I//O server.

AIX collaborates with the Power Hypervisor to help with hypervisor paging. In response to the hypervisor requests, AIX checks once a second to determine if the hypervisor needs memory. In the case where the hypervisor needs memory, AIX will free up logical memory pages (become loaned pages) and give them to the hypervisor. The policy to free up logical memory is tunable via the vmo ams_loan_policy tunable in AIX. The default is to loan frames. One could configure AIX to not loan frames, but that is not generally advisable. AIX can be much more intelligent about which page frames to loan, then the phype can be about which frames to steal, even with the AIX provided hints.

In an AIX LPAR which is using AMS, there are three possible situations for any given page frame.

• The page frame is backed by physical memory. This is included in the (vmstat -h) pmem value.

• AIX has loaned the frame off of the free list. This may have required lrud page stealing. The (svmon) free+inuse total would decrease and the stolen value would increase. The (vmstat -h) pmem value would decrease.

• AIX provided hints to the Power Hypervisor (phype) about which page frames are least critical and the phype stole the page frame. The (svmon) free, inuse, and loaned values would be unaffected. The (vmstat -h) pmem value would decrease.

In a dedicated memory LPAR, the (svmon) size field would equal the total of inuse and free. In an active shared memory LPAR, the (svmon) size field would equal the total of inuse, free, and loaned.




Uempty

Figure 4-20. Displaying memory usage with AMS AN512.0

Notes:

The vmstat and svmon commands have new fields to display AMS related information.

For vmstat there is a new flag (-h) to request hypervisor page information. When the new option is used with an iterative monitoring mode, vmstat shows four new fields on each iteration:

• hpi - Number of hypervisor page-ins.

• hpit - Time spent in hypervisor page-ins in milliseconds.

• pmem - Amount of physical memory that is backing the logical memory of partitions. The value is measured in gigabytes.

• loan - The percentage memory loaned

In addition, the initial system configuration line of the vmstat report has two new fields:

• mmode - The memory mode (dedicated or shared).

• mpsz - The size of the shared memory pool


Displaying memory usage with AMS # vmstat -h 2

System configuration: lcpu=4 mem=1024MB ent=0.30 mmode=shared mpsz=1.50GB

kthr memory page faults cpu hypv-page----- ----------- ------------------------ ------------ ----------------------- -------------------------r b avm fre re pi po fr sr cy in sy cs us sy id wa pc ec hpi hpit pmem loan0 0 130019 35592 0 0 0 0 0 0 1 85 220 0 1 98 0 0.01 3.0 0 7 0.60 0.260 0 130020 35591 0 0 0 0 0 0 0 15 208 0 1 99 0 0.01 2.2 0 0 0.60 0.260 0 130020 35674 0 0 0 0 0 0 0 19 198 0 1 99 0 0.01 2.2 0 0 0.60 0.260 0 130021 35673 0 0 0 0 0 0 3 66 207 0 1 99 0 0.01 2.6 0 0 0.60 0.26

Note: pmem is in units of gigabytesloan is a percentage of real memory

# vmstat -hv | egrep -i 'loan|virtual'2612 Virtualized Partition Memory Page Faults7756 Time resolving virtualized partition memory page faults

64674 Number of 4k page frames loaned24 Percentage of partition memory loaned

# svmon -G -O unit=MB,pgsz=onUnit: MB-------------------------------------------------------------------------------------------------

size inuse free pin virtual available loaned mmodememory 1024.00 624.05 140.13 258.11 509.06 211.14 259.82 Sharpg space 1536.00 4.09

work pers clnt otherpin 209.94 0 0 48.2in use 509.06 0 114.99



Student Notebook

When the -h option is used in combination with the -v option, there are four new counters:

• Time resolving virtualized partition memory page faults - The total time that the virtual partition is blocked to wait for the resolution of its memory page fault. The time is measured in seconds, with millisecond granularity.

• Virtualized partition memory page faults - The total number of virtual partition memory page faults that are recorded for the virtualized partition.

• Number of 4 KB page frames loaned - The number of the 4 KB pages of the memory that is loaned to the hypervisor in the partition.

• Percentage of partition memory loaned - The percentage of the memory loaned to the hypervisor in the partition.

When you request the svmon global report in an AMS environment (and you specify any -O option), you get two new fields:

• loaned - The amount of memory loaned

• mmode - The mode of memory, in this case: Shar

Traditionally, the inuse plus free statistics added up to the size statistics. Memory was either in use or it was free. (Of course the svmon collects statistics at different points in time and thus it is possible for the displayed statistics to not add up to exactly equal the total real memory).

With the ability of Active Memory Sharing, we have a new category of memory which is neither inuse nor on the free list: loaned memory. Thus, on a POWER6-based machine the new formula is:

size = inuse + free + loaned

If the formula is applied to the example in the visual, we get:

1024 = 624.05 + 140.13 + 259.82,

which is correct.




Uempty

Figure 4-21. Active Memory Expansion (AME) AN512.0

Notes:

Active Memory Expansion (AME) is a separately licensed feature of the POWER7-based servers. By compressing part of the virtual memory data, more effective memory is made available. How aggressively memory is compressed is determined by the expansion factor, which is a characteristic of the logical partition. The system administrator initially selects an expansion factor recommended by the amepat planning tool. The expansion factor is either defined in the partition profile or modified dynamically using DLPAR. Multiplying the partition’s allocated memory by the expansion factor provides the target amount of expanded memory. This target amount of expanded memory is what AIX sees as its real memory amount.

The LPAR’s allocated memory (also referred to as the true memory or logical memory), is divided into an uncompressed pool and a compressed pool. The sizes of these pools are dynamically adjusted by AIX depending on the situation.

To compress a page of data, it is paged-out to the compressed pool. When accessed, the compressed page is paged-in from the compressed pool. The only virtual memory which is


Active Memory Expansion (AME)• Memory allocated to the LPAR is the

logical or true memory– Compressed pool (comprsd)– Uncompressed pool (ucomprsd)– Data in true ucomprsd is paged-out

(co) to comprsd pool to compress it.– Data in true comprsd is paged-in (ci)

to ucomprsd pool to uncompress it.– Data in pinned and file cache pages

are not compressed.– Works better when a smaller

percentage of allocated and touched memory is reaccessed.

• Expanded memory seen by applications– The target size for expanded memory

is the AIX real memory.– Deficit (dxm): expanded memory

does not reach the target.– Poor compression ratio or insufficient

data to compress can result in a deficit

• HMC administrator sets expansion factor in the partition profileexp_factor * true_mem = target_exp_mem

• Based on recommendations of amepatplanning tool during a normal system load.

28 G

B

Compressed

Memory

(expanded)

LPAR’sExpanded Memory

True

comprsd

True

ucomprsd

LPAR’s Actual Logical Memory

20 G

B

Memory

Deficit

UncompressedMemory

30 G

B –

targ

et e

xpan

sion

Expansion factor = 1.5



Student Notebook

eligible for paging to the compressed pool are pages which are unpinned and in working segments.

There is CPU overhead to compressing and uncompressing the memory. An application which is constantly accessing all of its data will generate much more of this compression and decompression overhead, than one which is only re-accessing a small portion of that memory during a short period.

While AIX sees the target expansion as the amount of real memory, that amount of memory may not be effectively available. When the sum of the real uncompressed memory and the expansion of compressed memory is less than the target expanded memory, the difference is referred to as the deficit.

Different circumstances can result in a deficit.

- The application data may not compress well

- The amount of memory with data which is not pinned and not used for file caching may not be enough to support the target

- A system with low memory load will not have enough working storage memory to compress. On that situation, a deficit is normal and not a problem.

If AMS is used in combination with AME, AIX may use memory compression as a method to free up some true memory (logical memory) to loan to the shared memory environment.




Uempty

Figure 4-22. AME statistics (1 of 2) AN512.0

Notes:

The vmstat command has a -c option for displaying information about memory expansion and compression. In the header line, it identifies both the real memory (expansion target) and the true memory (logical memory allocated to the partition). The header line also identifies, in the memory mode field, that the partition is using expanded memory.

The vmstat iterative report lines, when using the -c option, provide five new columns.

- csz - the true size of the compressed memory pool

- cfr - the true size of the compressed memory pool which is currently available (does not hold compressed data)

- dxm - the size of the memory deficit

- ci - the number of page-ins per second from the compressed memory pool

- co - the number of page-outs per second to the compressed memory pool

The lparstat command has a -c option for displaying information about memory expansion and compression. In the header line, it identifies both the real memory (expansion target)


AME statistics (1 of 2)# vmstat -c 2

System Configuration: lcpu=4 mem=1536MB tmem=768MB ent=0.30 mmode=dedicated-E

kthr memory page ------- --------------------------------------------------- -----------------------r b avm fre csz cfr dxm ci co pi po0 0 194622 195970 6147 4425 0 1 0 0 0 4 1 215937 172723 14477 2725 0 25 11856 0 0 2 1 225693 144092 20630 2619 27947 9 7472 0 0 0 0 225693 143013 20630 2552 29021 115 127 0 0 0 0 225693 143006 20630 2554 29022 5 0 0 0

# lparstat -c 2 1000

System configuration: type=Shared mode=Capped mmode=Ded-E smt=4 lcpu=4 mem=1536MB tmem=768MB psize=16 ent=0.30

%user %sys %wait %idle physc %entc lbusy vcsw phint %xcpu xphysc dxm----- ----- ------ ------ ----- ----- ------ ----- ----- ------ ------ ------0.5 1.7 0.0 97.8 0.01 4.1 0.0 194 0 3.0 0.0004 850.1 1.4 0.0 98.5 0.01 3.0 0.0 195 0 0.0 0.0000 3011.2 25.7 0.6 62.4 0.13 44.9 22.7 294 0 53.7 0.0723 05.0 26.2 6.0 62.8 0.12 41.3 19.0 629 0 59.9 0.0741 00.4 1.7 3.4 94.5 0.01 4.8 0.2 608 0 1.7 0.0002 0



Student Notebook

and the true memory (logical memory allocated to the partition). The header line also identifies, in the memory mode field, that the partition is using expanded memory.

The lparstat iterative report lines, when using the -c option, provide three new columns.

- %xcpu - the xphysc value divided by the physc value. In other words, how much of the logical partition’s total processor utilization is used for the AME overhead.

- xphysc - the amount of processor capacity that is used to execute data compression and decompression for AME.

- dxm - the size of the memory deficit




Uempty

Figure 4-23. AME statistics (2 of 2) AN512.0

Notes:

The svmon global report has an option (summary=ame) which provides AME related details.

Below the memory line of real memory global statistics, two new rows are displayed which show the breakdown of real inuse memory into compressed and uncompressed categories. These are measurements of the amount of expanded (or effective) memory, as would be seen by the applications. They add up to the total real inuse memory.

In the section which provides columns by type of segment, the working storage column shows a breakdown into compressed and uncompressed categories. The uncompressed value is only for working storage; it obviously does not include file caching such as client segment storage.

A new section on True Memory statistics is provided.

Separate statistics for compressed and uncompressed memory are provided under the following columns:


AME statistics (2 of 2)# svmon -G -O summary=ame,unit=MBUnit: MB--------------------------------------------------------------------------------------

size inuse free pin virtual available mmodememory 1536.00 1532.06 3.94 282.36 1607.78 0.19 Ded-Eucomprsd - 584.77 -comprsd - 947.29 -

pg space 2048.00 58.6

work pers clnt otherpin 246.14 0 0 36.2in use 1530.13 0 1.93ucomprsd 582.84comprsd 947.29

--------------------------------------------------------------------------------------True Memory: 768.00

CurSz %Cur TgtSz %Tgt MaxSz %Max CRatioucomprsd 588.56 76.64 590.22 76.85 - - -comprsd 179.44 23.36 177.78 23.15 416.84 54.28 5.37

txf cxf dxf dxmAME 2.00 2.00 0.00 0

# svmon -G -O summary=longame,unit=MB -i 5Allows long single line iterations



Student Notebook

- CurSz - current true sizes of compressed and uncompressed, which added together will equal the total True Memory size.

- %Cur - CurSz expressed as a percentage of total True Memory

- TgtSz - target sizes of true compressed and uncompressed memory pools which are calculated to be needed in order to reach the target expanded memory size.

- %Tgt - TgtSz expressed as a percentage of total True Memory

- MaxSz - maximum allowed size of true compressed memory (there are vmo command tunables which affect this)

- %Max - MaxSz expressed as a percentage of total True Memory

- Cratio - current compression ratio

- txf - target memory expansion factor

- cxf - current memory expansion factor

- dxf - deficit factor to reach the target expansion factor (txf - cxf)

- dxm - deficit memory to reach the target expansion

The svmon command also has a summary=longame option which will provide AME related statistics in a single long line that is god for iterative monitoring. Below is example output (the line is so long that a very small font is needed to fit the page):

# svmon -G -O summary=longame,unit=MB -i 5Unit: MB-------------------------------------------------------------------------------------------------------------------------------- Active Memory Expansion-------------------------------------------------------------------------------------------------------------------------------- Size Inuse Free DXMSz UCMInuse CMInuse TMSz TMFr CPSz CPFr txf cxf CR 1536.00 1218.76 317.24 0 618.19 600.57 768.00 4.64 145.18 11.6 2.00 2.00 4.49 1536.00 1339.77 196.23 0 595.04 744.72 768.00 3.76 169.20 17.3 2.00 2.00 4.90 1536.00 1339.81 196.19 0 594.90 744.91 768.00 3.91 169.20 17.5 2.00 2.00 4.91 1536.00 1339.18 196.82 0 595.51 743.67 768.00 3.29 169.20 17.3 2.00 2.00 4.89 1536.00 1459.82 76.2 0 578.64 881.18 768.00 4.14 185.21 16.6 2.00 2.00 5.22 1536.00 1460.50 75.5 0 579.02 881.48 768.00 3.77 185.21 16.4 2.00 2.00 5.22 1536.00 1459.86 76.1 0 578.21 881.65 768.00 4.58 185.21 16.3 2.00 2.00 5.21 1536.00 1532.50 3.50 0 571.44 961.05 768.00 3.34 193.22 14.4 2.00 2.00 5.37

The svmon longame statistics are:

- DXMSz - size of the memory deficit

- UCMInuse - size of uncompressed memory which is in use.

- CMInuse - size of compressed memory which is in use (measured as an amount of expanded memory)

- TMSz - size of true memory pool (logical, allocated memory)

- TMFr - size of true memory which is free

- CPSz - true size of compressed memory pool




Uempty
- CPFr - true size of compressed memory pool which is free (The AIX Performance Tools manual states this field to be the size of the uncompressed pool, but this author believes that to be a mistake in the manual).
With the ability of Active Memory Expansion, we have a new category of memory which is neither inuse, free, nor loaned: deficit memory. Thus, on a POWER7-based machine the new formula is:

size = inuse + free + loaned + deficit



Student Notebook

Figure 4-24. Active Memory Expansion tuning AN512.0

Notes:

The amepat planning tool tries to model the Active Memory Expansion behavior, given data collected during a selected time span. That modeling is not a perfect prediction and the character of the system activity can change from what was collected for planning. As a result, it is a good practice to monitor how AME is working after deployment and make any adjustments.

AME is designed to allow a trade-off between memory and CPU. If memory is the bottleneck and there is excess CPU capacity, then that excess processing capacity can be used to relieve some of the memory constraint. It is possible for the CPU overhead of AME, to cause the CPU capacity to be the major performance constraint. Monitoring the overall CPU utilization and how much of that is used for AME compression can identify situations where either the target expansion factor needs to be reduced, or where the partition might benefit from a larger processor entitlement.

AME should not show a persistent memory deficit while the partition is under heavy memory demand load. A deficit is an indication that AME is unable to effectively reach the configured memory expansion factor (given amount of compressible memory and the


Active Memory Expansion tuning• Actual performance needs to be monitored

– Planning tool recommendations are modeled estimates– Application memory characteristics may have changed:

• Amount of memory that is file cache or pinned memory • Proportion of allocated memory that is repeatedly accessed• Compressibility of the data

• Monitor CPU overhead:– AME is a trade-off between memory and CPU resources– May need larger CPU allocation to support large expansion factor– May need less aggressive expansion factor to avoid excessive CPU load

• Monitor memory deficit:– If seeing consistent deficit while under load, notify HMC administrator– May be appropriate to reduce expansion factor until deficit is eliminated– Deficit under light loads is normal

• Once deficit is zero at an appropriate expansion factor:– Follow tradition memory management methods– May need to increase true memory– May need to manage memory demand




Uempty
compressibility of the data). In that circumstance it is recommended to reduce the memory expansion factor until there is no deficit displayed. If AIX needs more additional memory than AME can effectively provide (resulting in paging space activity), then you need to use the traditional methods of memory management: either increase the allocated (true) memory or reduce the memory demand.
Note that a deficit under light loads is normal.



Student Notebook

Figure 4-25. Recommendations AN512.0

Notes:

Initial recommendations

These recommendations are starting points for tuning. Additional tuning may be required. The AIX defaults work well for over 95% of the installed systems.

The objectives in tuning these limits are to ensure the following:

- Any activity that has critical response time objectives can always get the page frames it needs from the free list.

- The system does not experience unnecessarily high levels of I/O because of premature stealing of pages to expand the free list.

The best recommendation is that unless you can identify a memory performance problem, do not “tune” anything!

The second recommendation involves the changes made to the default for vmo tunables in AIX6 (especially the lru_file_repage and the minperm and maxperm changes. If in AIX 5.3, a good starting point is to set the vmo values to the AIX6


Overall Recommendations• If memory is overcommitted and impacting performance:

– Add logical memory (for example, use DLPAR to increase allocation) – Reduce demand (especially wasteful demand)– Consider sharing memory using AMS– Consider implementing expanded memory with AME

• The primary tuning recommendation is: if it is not broken, do not fix it!

• The AIX6 vmo tuning defaults are already well tuned for most systems– Use of outdated tuning recommendations can cause problems– If back-leveled at AIX 5L V5.3, use the AIX6 default vmo parameter values as a

starting point

• If free list is driven to zero or sustained below minfree, increasing minfreeand maxfree may be beneficial.– maxfree = minfree + (maxpgahead or j2_maxPageReadAhead)

• Increasing minperm% may be beneficial when working segments dominate, if:– Computational allocations are due to wasteful application memory

management, perhaps even a memory leak– and I/O performance is being impacted




Uempty
defaults. If at AIX 6 and later, be very careful of applying outdated tuning recommendations; they will likely degrade your performance.
minfree and maxfree parameters

If bursts of activity are frequently driving the free list to well below the minfree value (and even to zero), then it can be beneficial to increase these thresholds.

The difference between the maxfree and minfree parameters should always be equal to or greater than the value of the maxpgahead ioo parameter, if you are using JFS. For Enhanced JFS, the difference between the maxfree and minfree parameters should always be equal to or greater than the value of the j2_maxPageReadAhead parameter. If you are using both JFS and Enhanced JFS, then you should set the value of the minfree parameter to a number that is greater than or equal to the larger page-ahead value of the two filesystems.

Making parameter changes

Some changes should be strung together to avoid possible error messages and run in the order given:

- minfree and maxfree

For example, vmo -p -o minfree=1024 -o maxfree=1536.



Student Notebook

Figure 4-26. Managing memory demand AN512.0

Notes:

Overview

As with any resource type, the most important factor in performance management is balancing resource with demand. If you cannot afford more memory then you need to reduce demand. Remember that the resource and demand balance varies from one time period to the next and from server to server. Peak demand on one machine may be in the middle of the afternoon, while peak demand on another machine may be early in the morning. If possible, shift work to off-peak periods. If that is not possible, shift the work to a server which has its peak demand at a different time. The AIX6 ability to relocate live workload partitions to a different server makes this easy. The PowerVM based Live Partition Mobility is another way to do this dynamically. Both methods also support static relocation, which involve first shutting down either the workload partition or the logical partition.


Managing memory demand• Shift high memory load applications to

– a lower memory demand time slot – a machine with underutilized memory

• Adjust applications to: – Only allocate memory that is really needed– Only pin memory when necessary

• Fix memory leaks or periodically cycle the leaking application

• Consider use of direct I/O (no file system caching), if:– Application does its own caching– Application does random access or optimizes its own sequential access

• Consider using file system release-behind mount options, if:– Files in the file system are not soon reaccessed, thus not benefiting from

sustained caching in memory

• Unmount file systems when not being used




Uempty
Tuning applications
Some applications can be configured to adjust the amount of computational memory they allocate. The most common example of this are the databases. Databases often allocate massive amounts of computational memory to cache database storage contents. Sometimes the amount is excessive; given enough computational memory they will cache data that has a low hit ratio (not accessed often). If the system is memory constrained, this is a wasteful use of memory. Note that this application managed caching is considered computational storage by AIX. You should work with the application administrators (in our example that would be the database administrator) to determine the appropriate amount of computational memory to be allocated.

Memory leaks

Another way in which applications can waste memory is by having a coding bug which allocates memory, loses track of that allocation, and does not free that memory. The next time the application needs memory it allocates new memory and then loses track of that allocation. This is referred to as a memory leak. The application, with each iteration of its logic, keeps allocating memory but never frees any (and is not using it). This tends to fill up memory. Over the long term, the application needs to have the coding error fixed. In the short term, you need to periodically quiesce the application, terminate the process (all threads), and then restart the application. This is often referred to as cycling the application. When the application is terminated, AIX will then free up all the memory owned by the application.

Direct I/O

AIX caches file contents in memory for two main reasons:

- It assumes that the file will be re-referenced; and wants to avoid disk I/O by keeping it in memory

- It allows sequential disk operations to be optimized. (this involved read-ahead and write-behind mechanisms which will be covered later).

For applications which do not need these benefits, you can tell the file system to do no automatic memory caching. The best example, again, is the database. Most database engines will do their own caching of data. Since the data base engine has a better understanding of access data patterns, it can use this intelligence to manage what data is kept in its private cache. There is no reason to both cache it for the filesystem (persistent or client segment) and to also cache it in the database engine’s computational memory. In addition the database access tends to be random access of relatively small blocks rather than sequential. As such it does not benefit as much from the file system sequential access mechanisms.



Student Notebook

These applications can usually be configured to request Direct I/O which eliminates the file system memory caching. Details on Direct I/O will be covered later.

Unmounting and release-behind

Some applications benefit greatly from the file system I/O optimization that depends on file caching, but do not reaccess file data blocks very frequently. They write a file but then do not reaccess for a long time (it may be the next day or perhaps not until the end of an accounting period). Or they access a large file just once for generating an end of month report, but do not read it again until the end of the year.

AIX does not know that the file is not likely to be reaccessed, so it tries to keep it cached in memory even after the application terminates. If the files which follow this access pattern can be placed in one or more special files systems, then we can tell the kernel file systems services to not keep these files cached in memory once the immediate application access is complete. If the file system is mounted with a release-behind option, then once a file’s data is delivered to the application (on a read) or written to disk (on a write), AIX frees that memory to be placed on the free list.

Another alternative, for applications that only run at certain times, is to place all of that application’s files in a separate file system. Then only mount the filesystem when the application is running. When the application completes its run, unmount the filesystem. AIX frees all file cache memory contents related to that file system when it is unmounted.




Uempty


Notes:


Checkpoint (1 of 2)

1. What are the three virtual memory segment types? _____________, _____________, and _____________

2. What type of segments are paged out to paging space? __________________

3. What are the two classifications of memory (for the purpose of choosing which pages to steal)? __________________ and ___________________

4. What is the name of of the kernel process that implements the page replacement algorithm? _______



Student Notebook


Notes:


Checkpoint (2 of 2)5. List the vmo parameter that matches the description:

a. Specifies the minimum number of frames on the free list when theVMM starts to steal pages to replenish the free list _______

b. Specifies the number of frames on the free list at which page stealing stops ______________

c. Specifies the point below which the page stealer will steal file or computational pages regardless of repaging rates ___________

d. Specifies whether or not to consider repage rates when deciding what type of page to steal ________________




Uempty

Figure 4-29. Exercise 4: Virtual memory analysis and tuning AN512.0

Notes:


Exercise 4: Virtual memory analysis and tuning

• Use VMM monitoring and tuning tools to analyze memory over-commitment, file caching, and page stealing

• Use PerfPMR to examine memory data• Identify a memory leak• Use WPAR manager resource controls• Examine statistics in an active memory

sharing (AMS) environment



Student Notebook


Notes:


Unit summary

This unit covered:

•Basic virtual memory concepts and what issues affect performance

•Describing, analyzing, and tuning page replacement

•Identifying memory leaks

•Using the virtual memory management (VMM) monitoring and tuning tools

•Analyze memory statistics in Active Memory Sharing (AMS) and in an Active Memory Expansion (AME) environments




Uempty
Unit 5. Physical and logical volume performance

This unit describes the issues related to physical and logical volume performance. It shows you how to use performance tools and how to configure your disks, adapters, and logical volumes for optimal performance.



• Identify factors related to physical and logical volume performance

• Use performance tools to identify I/O bottlenecks

• Configure logical volumes for optimal performance


Accountability:


References





SG24-6184 IBM eServer Certification Study - AIX 5L Performance and System Tuning (Redbook)


© Copyright IBM Corp. 2010 Unit 5. Physical and logical volume performance 5-1

Student Notebook


Notes:


Unit objectives


• Identify factors related to physical and logical volume performance

• Use performance tools to identify I/O bottlenecks

• Configure logical volumes for optimal performance




Uempty

Figure 5-2. Overview AN512.0

Notes:

As with any performance analysis, the bottom line is whether the applications is performing to the expectations (or needs) of the application users. For this there are often metrics provided by the application which are more appropriate than the operating system metrics.

If the performance is not what you expect, then you need to look at where you can make improvements. An obvious starting point is the devices that hold the data. What is the expected response time and throughput for the disk drive or storage subsystem? Do the actual measurements meet those expectations? If the storage is performing well, are there better and faster storage solutions you can invest in?

Every device has its limits in how many requests per second or how data per second it can handle. The problem may be that storage device is saturated. For that you either need to get a better device or see if you can shift some of that load to another device. It is not uncommon to find expensive new hardware underutilized while the older hardware is being pushed to the limits. The same principle can be applied to any component in the path. It may be that the storage adapter or the PCI bus is saturated. Again, shifting and adapter to


Overview• Does I/O response and throughput meet expectations?• If not, what is the cause of the degraded performance?

– Disk subsystem underperforming to specifications– Saturated disk subsystem, storage adapter, or adapter bus– Lack of I/O overlap processing (serialization)– Shortage of LVM logical resources– Fragmentation and improper block sizes– File system issues– Shortage of CPU or memory resources

• Can the workload be managed?– Load balancing workload across adapters or disks– Isolate contending workloads to different resources– I/O pacing– Shift of where and when certain programs run



Student Notebook

an different bus, or spreading traffic across more adapters can often have significant payback.

Within AIX, there are layers to storage architecture and each layer has queue and controls block which be short of what is needed. These pools and queue can often be increased, being careful to understand the consequences.

One of the common principles of performance is that the bigger the unit of work, the more efficient the processing. The size of the application’s read and write requests, LVM stripe unit sizes, the configuration of the file system block sizes and mechanisms, and the logical track group size of the volume group all can have an affect. Even if we do all of this correctly, fragmentation can breakup the flow of processing the data requests.

There are many file system mechanisms that can affect performance and these will be covered in the next unit.

Always remember that overall performance can be affected by memory and CPU constraints. It is necessary to look at the whole picture. Some efforts to improve I/O performance can have a negative affect on these other factors.

Moving to the demand side of the situation, sometimes you need to manage the workload within the constraints of the resources. The techniques of spreading workloads across more buses, more adapters, or more disks is part of this. But, sometimes you need to identify which programs are generating the I/O load and decide what should continue to run on this server at this time and which work should be shifted. The workload might be moved to another server or delayed until a time slot when the resources are not saturated. Some applications are designed to work in a clustered environments with transactions automatically load balanced between the nodes.

I/O pacing is a filesystem technique which can pacing (slowing down) the batch I/O loads to give the interactive requests a better response time.




Uempty

Figure 5-3. I/O stack AN512.0

Notes:

Overview

The Logical Volume Manager provides the following features:

- Translation of logical disk addresses to physical disk addresses

- Dynamic allocation of disk space

- Mirroring of data (at the logical volume level) for increased availability

- Bad block relocation

The heart of the LVM is the Logical Volume Device Driver (LVDD).

When a process requests a disk read or write, the operation involves the file system, VMM and LVM. But, these layers are transparent to the process.


I/O stackApplication

Logical file system

File systems (JFS, JFS2, NFS, others)

VMM (file caching)

LVM (logical volume)

Disk Device Drivers (physical volume)

Adapter Device Drivers

Disk Subsystem (optional)

DiskR

aw LV

Raw

Disk

DIO



Student Notebook

File system I/O

When an application issues an I/O, it may be to a file in a file system. If so, then the file system may either send the I/O to the VMM or directly to the LVM. The VMM will send the I/O to the LVM. The LVM sends the I/O to the disk driver which will communicate with the adapter layer. With the file system I/O path, the application can benefit from the VMM read-ahead and write-behind algorithms but may run into performance issues because of inode locking at the file system level.

Raw LVM I/O

The application can also access the logical volume directly. This is called raw LVM I/O. Many database applications often do this for performance reasons. With this type of access, you avoid two layers (VMM and the file system). You also avoid having to acquire the inode lock on a file.

Raw disk I/O

An application can bypass the Logical Volume Manager altogether by issuing the I/O directly to the disk device. This is normally not recommended since you lose the easier system management provided by the LVM. However, this does avoid going through one more layer (LVM). Issuing I/O this way can be a good test of a disk’s capabilities, thus taking VMM, file system, and LVM out of the picture.

The disk device itself may be a simple physical disk drive or it could be a logical hdisk that is comprised of several physical disks (such as in a RAID array) or the disk device could be just a path specification to get to the actual logical or physical disk (as is the case for vpath or powerpath devices).

Using a raw device

When using a raw device (LVM or disk I/O), the mechanism bypasses the normal AIX I/O processing. To use this, the character device must be used. Failure to the use character device will cause the I/O to be buffered by LVM and can result in up to a 90% reduction in throughput. The character device name always begins with an r. So, accessing /dev/hdisk1 will be buffered by LVM, but /dev/rhdisk1 will bypass all buffering. This is the only correct way to access a raw device for performance reasons.




Uempty

Figure 5-4. Individual disks versus disk arrays AN512.0

Notes:

For traditional disks, AIX controls where data is located through the AIX Logical Volume Manager (LVM) intra-policies. Poor placement of data can result in increased access arm movement and rotational delays getting to the desired sector.

For disk arrays, the controller decides where to place the data, most commonly stripping the data across the disks with distributed parity (RAID5). In this environment, it make no difference whether the logical volume is on the outer edge or in the center, since the disk array administrator is controlling where it is actually placed.

Another major difference is that almost all disk arrays have their own data caching. This allows the controller to collect many write requests and then optimize how and when the data is written to physical disk. It also allows the controller to recognize patterns of sequential access and to anticipate the next request by sequentially reading ahead. When the next host read request comes in, the data is already in the cache and ready to be transferred to the host.


Individual disks versus disk arrays

Inner Edge

Inner Middle

Center

(Outer) Middle

(Outer) Edge

Disks: AIX controls position of data on the platter

Disk arrays: array controller spreads the data; AIX sees LUN as an hdisk.

LUN 1

LUN 2



Student Notebook

Figure 5-5. Disk groups AN512.0

Notes:

In the traditional disk environment it is common to spread data between disks, both for availability and for performance. I/O requests sent to one disk do not interfere with requests sent to another disk. The requests can be processed in parallel with less workload on each of the disks. This is a great way to deal with a single saturated disk.

In the disk array environment, it is possible that two hdisks are actually LUNs allocated out of the same disk array. If this was an LVM mirroring situation, this would make nonsense out the availability planning. The disk arrays crashes and both hdisks are gone. To help avoid this, AIX allows you to formally define mirror groups. A mirror group is a group of hdisks which actually reside on the same disk array. The mirror group functionality allows the system to reject attempts to mirror where copies would be in the same mirror group. Even when not mirroring, it is important to be aware of these relationships. To convey this, the course will use the term disk groups to define hdisks which are LUNs from the same disk array. in. There is no need or benefit to defining a formal mirror group if not doing LVM mirroring.


Disk groups• Common to balance the workload between hdisks• A disk group is a group of LUNs on the same disk array

• Similar in concept to AIX mirror groups for availability• If hdisks are on the same disk group, there is no real workload balancing (example: LUN1 and LUN2)• To balance workload with SANs, choose disks in different disk groups (for example: LUN1 and LUN3)

LUN 1

LUN 2

LUN 3

LUN 4

AIX AIX




Uempty
Spreading data between hdisks in the same disk group would have no benefit. A single disk array would still be single point of resource contention. But, spreading data between hdisks which are in different disks groups can still be very beneficial.
A similar concept can occur in a virtual SCSI environment with the LPARs’ virtual disks being backed by logical volumes that are on a single disk at the VIOS server. They appear as two hdisks but, in reality, they are out of the same physical disk which acts a single point of resource contention. This is why most VIOS administrators back virtual disks with physical volumes. But it is important in that situation to coordinate with the VIOS administrator and the SAN storage administrator to understand ultimately which virtual disks are in the same disk group and which are not.



Student Notebook

Figure 5-6. LVM attributes that affect performance AN512.0

Notes:

Inter-physical volume allocation policy

The Inter-Physical Volume Allocation policy specifies which strategy should be used for choosing physical devices to allocate the physical partitions of a logical volume. The choices are:

- MINIMUM (the default)

- MAXIMUM

The MINIMUM option indicates the number of physical volumes used to allocate the required physical partitions. This is generally the policy to use to provide the greatest reliability, without having copies, to a logical volume. The MINIMUM option can be interpreted in one of two different ways, based on whether the logical volume has multiple copies or not:

- Without Copies: The MINIMUM option indicates one physical volume should contain all the physical


LVM attributes that affect performance• Disk band locality issues – affecting only individual disks

– Position on physical volume (intra-policy)– Active mirror write consistency (cache on outer edge)– Logical volume fragmentation (for random I/O loads)

• Managing location across hdisks – disks and disk groups– Range of physical volumes (inter-policy)– Maximum number of physical volumes to use – Number of copies of each logical partition (LVM mirroring)

• Extra I/O traffic – affects all I/O environments– Enabling write verify – Active Mirror Write Consistency (extra write to update MWC

cache before each data write)




Uempty
partitions of this logical volume. If the allocation program must use two or more physical volumes, it uses the minimum number possible, remaining consistent with the other parameters.
- With Copies: The MINIMUM option indicates that as many physical volumes as there are copies should be used. Otherwise, the minimum number of physical volumes possible are used to hold all the physical partitions. At all times, the constraints imposed by other parameters such as the strict option are observed. (The strict allocation policy allocates each copy of a logical partition on a separate physical volume.)

These definitions are applicable when extending or copying an existing logical volume. For example, the existing allocation is counted to determine the number of physical volumes to use in the minimum with copies case.

The MAXIMUM option indicates the number of physical volumes used to allocate the required physical partitions. The MAXIMUM option intends, considering other constraints, to spread the physical partitions of this logical volume over as many physical volumes as possible. This is a performance-oriented option and should be used with copies to improve availability. If an uncopied logical volume is spread across multiple physical volumes, the loss of any physical volume containing a physical partition from that logical volume is enough to cause the logical volume to be incomplete.

To specify an inter-physical policy use the -e argument of the mklv command. The options are:

- x - Allocate across the maximum number of physical volumes

- m - Allocate the logical partitions across the minimum number of physical volumes

The mklv -u UpperBound argument sets the maximum number of physical volumes for new allocation. The value of the Upperbound variable should be between one and the total number of physical volumes in the volume group.

For example, to create a logical volume with 4 logical partitions that are spread across three physical volumes:

# mklv -u 3 -e m datavg 4

Intra-physical volume allocation policy

The Intra-Physical Volume Allocation policy specifies what strategy should be used for choosing physical partitions on a physical volume. The five general strategies are:

- EDGE

- INNER EDGE

- MIDDLE

- INNER MIDDLE

- CENTER



Student Notebook

The Intra-Physical Volume Allocation policy has no affect on a caching disk subsystem. It only apply to real physical drives. They also apply when setting up VIO devices from the server.

Physical partitions are numbered consecutively, starting with number one, from the outer-most edge to the inner-most edge.

The EDGE and INNER EDGE strategies specify allocation of partitions to the edges of the physical volume. These partitions have the slowest average seek times, which generally result in longer response times for any application that uses them. Outer edge (EDGE) on disks produced since the mid 1990s can hold more sectors per track so that the outer edge is faster for sequential I/O.

The MIDDLE and INNER MIDDLE strategies specify to stay away from the edges of the physical volume and out of the center when allocating partitions. These strategies allocate reasonably good locations for partitions with reasonably good average seek times. Most of the partitions on a physical volume are available for allocation using this strategy.

The CENTER strategy specifies allocation of partitions to the center section of each physical volume. These partitions have the fastest average seek times, which generally result in the best response time for any application that uses them. There are fewer partitions on a physical volume that satisfy the CENTER strategy than any other general strategy.

To specify an intra-physical policy use the -a argument of the mklv command. The options are:

- e - Edge (Outer edge)

- m - Middle (Outer middle)

- c - Center

- im - Inner middle

- ie - Inner edge

For example, to create a logical partition in the “center” of the disk with 4 logical partitions:

# mklv -a c datavg 4

The Intra-Physical Volume Allocation policy may or may not matter depending on whether a disk or disk subsystem is being used for the storage. This is because AIX has no real control over what part of the real disk in the subsystem is actually used.

Other miscellaneous LV attributes:

Other various factors that can be controlled when creating a logical volume:

- Allocate each logical partition copy on a separate PV specifies the strictness policy to follow. A value of 'yes' means the strict allocation policy will be used, which




Uempty
means no copies of a logical partition are permitted to reside on the same physical volume. A value of 'no' means the nonstrict policy is used, which allows copies of logical partitions to reside on the same physical volume. A value of superstrict uses an allocation policy that ensures that no partition from one mirror copy resides on the same physical volume that has partitions from another mirror copy of the logical volume.
- Relocate LV during reorganization specifies whether to allow the relocation of the logical volume during reorganization. For striped logical volumes, the relocate parameter must be set to no (the default for striped logical volumes). Depending on your installation you may want to relocate your logical volume.

- Write verify sets an option for the disk to do whatever its “write verify” procedure is. Exactly what happens is up to the disk vendor. This is implemented completely in the disk. This will negatively impact performance.

- Logical volume serialization serializes overlapping I/Os. When serialization is enabled, it will force serialization on concurrent writes to the same disk block. Most applications like file systems and databases do serialization so serialization should be turned off. The default for new logical volumes is off. Enabling this parameter can degrade performance I/O. Operations are serialized if they are issued to the same data block.



Student Notebook

Figure 5-7. LVM mirroring AN512.0

Notes:

Introduction

LVM mirroring is a form of disk mirroring. Rather than an entire disk being mirrored, the individual logical volume can be mirrored. LVM mirroring is turned on for a logical volume when the number of copies is greater than one.

Number of copies

The number of copies could be:

- One: No mirror

- Two: Double mirror which protects against a single point of failure

- Three: Triple mirror which protects against multiple disk failures

Mirroring helps with high-availability because in case a disk fails, there would be another disk with the same data (the copy of the mirrored logical volume).


LVM mirroring• LVM Mirroring provides software mirroring for either individual

logical volumes all logical volumes in a volume group

• Mirror write consistency (MWC)– Ensures all copies are the same after a crash– Active MWC records mirror write activity in a MWC cache

(MWCC) on the outer edge of the hdisk. – The logical volume location can cause excessive access arm

movement between the LV and the MWCC on every write. – Passive MWC (big VG only) does not use a MWC cache

• Mirroring may benefit read performance but could degrade write performance if mirror write consistency is on or active




Uempty
Copies on a physical volume
When creating mirrored copies of logical volumes, use only one copy per disk. If you had the copies on the same disk and the disk fails, mirroring would not have helped with high-availability. In a SAN storage environment, use only one copy per mirror group.

Scheduling policies

There are several scheduling policies that are available when a logical volume is mirrored. The appropriate policy is chosen based on the availability requirements and the performance characteristics of the workloads accessing the logical volume.

Performance impact

Mirroring may have a performance impact since it does involve writing two to three copies of the data. Mirroring also adds to the cost because of the necessity for additional physical disk drives. In the case of reads, mirroring may help performance.

Mirror write consistency

The LVM always ensures mirrored copies of a logical volume are consistent during normal I/O processing. To ensure consistency of mirrored copies, for every write to a logical volume, the LVM generates a write request for every mirror copy. A problem can occur if the system crashes before all the copies are written. If Active Mirror Write Consistency recovery is requested for a logical volume, the LVM keeps additional information to allow recovery of these inconsistent mirrors. Mirror Write Consistency recovery should be performed for most mirrored logical volumes. Logical volumes, such as paging space, should not have MWC on since the data in the paging space logical volume is not reused when the system is rebooted.

Caching disk subsystems

When mirroring on a caching disk subsystem, AIX does not know how to make sure each copy of the logical volume is on a separate storage system. This is the responsibility of the administrator to take care of this, by requesting that the related LUNs be allocated out of separate disk arrays. Defining mirror groups will allow AIX to assist with this, since it will refuse to allocate two copies in the same mirror group.

Because the writes are cached, MWC has less affect on a caching disk subsystem. If the disk subsystem writes are slow (over 5 ms), MWC may have a significant impact on the performance of the system. This is because the MWC writes are synchronous and must be completed before the writing of the actual data.



Student Notebook

Mirror write consistency

The Mirror Write Consistency (MWC) record consists of one sector. It identifies which logical partitions may be inconsistent if the system is not shut down correctly. When the volume group is varied back on-line, this information is used to make the logical partitions consistent again. The MWC control sector is on the edge of the disk, performance may be improved if the mirrored logical volume is also on the edge.

Active MWC

With active MWC, mirrored writes do not return until the Mirror Write Consistency check cache has been updated. This can have a negative effect on write performance. This cache holds approximately 62 entries. Each entry represents a logical track group (LTG). The LTG size is a configurable attribute of the volume group.

When a write occurs to a mirrored logical volume, the cache is checked to see if the write is in the same Logical Track Group (LTG) as one of the LTGs in the cache. If that LTG is there, then no consistency write is done. If not, then the cache is updated with this LTG entry and a consistency check record is written to each disk in the volume group that contains this logical volume. The MWC does not guarantee that the absolute latest write is made available to the user. MWC only guarantees that the images on the mirrors are identical.

You may choose to turn off MWC as long as the system administrator sets auto_varyon to false and does a syncvg -f on the volume group after rebooting. However, recovery from a crash will take much longer since ALL partitions will have to be resync’d. MWC gives the advantage of fast recovery when it is turned on.

Passive MWC

Passive MWC is available for logical volumes that are part of a big volume group.

A normal volume group is a collection of 1 to 32 physical volumes of varying sizes and types. A big volume group can have from 1 to 128 physical volumes. The mklv -B option is used to create a big volume group. Most large volume groups are configured as scalable volume groups and thus can not use passive MWC.

Active MWC’s disadvantage is the write performance penalty (which can be substantial in the case of random writes). However, it provides for fast recovery at reboot time if a crash occurred. By disabling active MWC, the write penalty is eliminated but after boot, the entire logical volume has to be resync’d by hand (using syncvg -f) before users can use that logical volume (autovaryon must be off). However, with passive MWC, not only is the write penalty eliminated but the administrator does not have to do the syncvg or set autovaryon off. Instead, the system will automatically resync the entire logical volume if it detects that the system was not shutdown properly. This resyncing is done in the background. The disadvantage is that until the partitions are resync’d, reads may be slower as the partitions are resync’d.




Uempty
The passive option may be chosen in SMIT or with the mklv or chlv commands when creating or changing a logical volume. Just like with active MWC, these options take effect only if the logical volume is mirrored.


Student Notebook

Figure 5-8. LVM mirroring scheduling policies AN512.0

Notes:

Introduction

The scheduling policy determines how reads and writes to mirrored logical volumes are handled.

Parallel

The default policy is parallel which balances reads between the disks. When a read occurs, the LVM will initiate the read from the primary copy if the disk which contains that copy is not busy. If that disk is busy, then the disk with the secondary copy is checked. If that disk is also busy, then the read is initiated on the disk which has the least number of outstanding I/Os. In the parallel policy, the writes are written out in parallel (concurrently). The LVM write is not considered complete until all copies have been written.


LVM mirroring scheduling policies• Parallel (default):

– Read I/Os will be sent to the least busy disk that has a mirrored copy

– Write I/Os will be sent to all copies concurrently

• Sequential:– Read I/Os will only be sent to the primary copy– Write I/Os will be done in sequential order, one copy at a

time

• Parallel write/sequential read:– Read I/Os will only be sent to the primary copy– Write I/Os will be sent to all copies concurrently

• Parallel write/round-robin read:– Read I/Os are round-robin’d between the copies – Write I/Os will be sent to all copies concurrently




Uempty
In a caching disk subsystem, this works well for highly random data access with relatively small data transfers (128 KB or less).
Sequential

With the sequential policy, all reads will only go to the primary copy. If the read operation is unsuccessful, the next copy is read, and then the primary copy is fixed by turning the read operation into a write operation with hardware relocation specified on the call to the physical device driver.

Writes will occur serially with the primary copy updated first and when that is completed, the secondary copy is updated, and then the tertiary copy (if triple mirroring is enabled).

Parallel write/sequential read

With the parallel write/sequential read policy, writes will go to the copies concurrently (just like the parallel policy) but the reads are only from the primary copy (like the sequential policy). This policy tends to get better performance when you’re doing sequential reads from the application.

This is highly beneficial when dealing with caching disk subsystems that are intelligent enough to perform their own internal read ahead. By using this, all the reads go to one copy of the disk and the subsystem sees them as sequential and internally performs read ahead. Without this, the reads can be sent to different copies of the logical volume and the disk system is unlikely to see much sequential read activity. Therefore, it will not perform its own internal read ahead. The same problem can occur with fragmented files as you cross over the boundary between fragments.

Parallel write/round-robin read

With the parallel write/round-robin read policy, writes will go to the copies concurrently (just like in parallel policy). The reads, however, are initiated from a copy in round-robin order. The first time it could be from the primary, the next read from the secondary copy, the next read back to the primary, the next to the secondary copy, and so on. This results in equal utilization across the copies in the case of reads when there is never more than one outstanding I/O at a time. This policy can hurt sequential read performance however since reads are broken up between the copies.



Student Notebook

Figure 5-9. Displaying LV fragmentation AN512.0

Notes:

Using lslv -l

The lslv -l output shows several characteristics of the logical volume. The PerfPMR config.sum file lists the output of lslv for each logical volume.

The COPIES column shows the disks where the physical partitions reside. There are three columns, the first column is for the primary copy, the second column is for the secondary copy (if mirroring is enabled) and the third column is for the tertiary copy (if mirroring is enabled).

The IN BAND column shows the percentage of the partitions that met the intra-policy criteria.

The DISTRIBUTION column shows the locations of the physical partitions of this logical volume as numbers separated by a colon (:). Each of these numbers represents an intra-policy location. For example, the first column is edge, then middle, then center, then inner-middle, and then inner-edge. Of the remaining percentage in the IN BAND


Displaying LV fragmentation# lslv -l lv01

lv01:/mydataPV COPIES IN BAND DISTRIBUTIONhdisk0 024:000:000 95% 000:001:023:000:000

# lslv -p hdisk0 lv01

hdisk0:lv01:/mattfsUSED USED USED USED USED USED USED USED USED USED 1-10USED USED USED USED USED USED USED USED USED 0153 11-200154 0155 0156 USED USED USED USED USED USED USED 21-30USED USED USED USED USED USED 0157 0158 0159 0161 31-40FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 41-50FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 51-60FREE FREE FREE FREE FREE FREE 61-66

USED USED USED USED USED USED USED USED USED USED 67-76USED USED USED USED USED USED USED USED USED USED 77-86USED USED USED USED USED USED USED USED USED USED 87-96USED USED USED USED USED USED USED 0160 USED USED 97-106USED USED USED USED USED USED USED USED USED USED 107-116USED USED USED USED USED USED USED USED FREE FREE 117-126FREE FREE FREE FREE FREE 127-131

USED USED USED USED 0001 0002 0003 0004 0005 0006 132-1410007 0008 0009 0010 0011 0012 0013 0014 0015 0016 142-1510017 0018 0019 0020 0021 0022 0023 0024 0025 0026 152-1610027 0028 0029 0030 0031 0032 0033 0034 0035 0036 162-171




Uempty
value, the rest may be on a different part of the disk and may be fragmented. On the other hand, even if the partitions were all in-band, that does not guarantee that they are not fragmented. Therefore, the lslv -p data should be looked at next.
Using lslv -l

Logical volume fragmentation occurs if logical partitions are not contiguous across the disk. The lslv -p command shows the logical volume allocation map for the physical volume given.

The state of the partition is listed as one of the following:

- USED indicates that the physical partition at this location is used by a logical volume other than the one specified with lslv -p.

- FREE indicates that this physical partition is not used by any logical volume.

- STALE indicates that the specified partition is no longer consistent with other partitions. The system lists the logical partition number with a question mark if the partition is stale.

- Where it shows a number, this indicates the logical partition number of the logical volume specified with the lslv -p command.

Logical volume intra-policy:

The intra policy that the logical volume will use in allocating storage can be seen in the lslv listing of the logical volume attributes:

# lslv lv01LOGICAL VOLUME: lv01 VOLUME GROUP: datavgLV IDENTIFIER: 0001d2ba00004c00000000f98ba97636.5 PERMISSION:read/writeVG STATE: active/complete LV STATE: opened/syncdTYPE: jfs2 WRITE VERIFY: offMAX LPs: 32512 PP SIZE: 32 megabyte(s)COPIES: 1 SCHED POLICY: parallelLPs: 224 PPs: 224STALE PPs: 0 BB POLICY: relocatableINTER-POLICY: minimum RELOCATABLE: yesINTRA-POLICY: center UPPER BOUND: 32MOUNT POINT: /mydata LABEL: /mydataMIRROR WRITE CONSISTENCY: on/ACTIVE EACH LP COPY ON A SEPARATE PV ?: yes

Serialize IO ?: NO



Student Notebook

Figure 5-10. Using iostat . AN512.0

Notes:

Introduction

The iostat command is used for monitoring system input/output device load by observing the time the physical disks are active in relation to their average transfer rates. It does not provide data for file systems or logical volumes. The iostat command generates reports that can be used to change the system configuration to better balance the input/output load between physical disks and adapters.

Data collection

iostat displays a system configuration line, which appears as the first line displayed after the command is invoked. If a configuration change is detected during a command execution iteration, a warning line will be displayed before the data which is then followed by a new configuration line and the header.


Using iostat# iostat 5

System configuration: lcpu=2 drives=3 paths=2 vdisks=0

tty: tin tout avg-cpu: % user % sys % idle % iowait0.0 86.8 56.8 43.2 0.0 0.0

Disks: % tm_act Kbps tps Kb_read Kb_wrtnhdisk0 99.7 7676.0 248.9 4260 72500hdisk1 0.0 0.0 0.0 0 0cd0 0.0 0.0 0.0 0 0






Uempty
The iostat command only reports current intervals, so the first interval of the command output is now meaningful and does not represent statistics collected from system boot. Internal to the command, the first interval is never displayed, and therefore there may be a slightly longer wait for the first displayed interval to appear. Scripts that discard the first interval should function as before.
Disk I/O statistics since last reboot are not collected by default. Prior to AIX 5L V5.3, the first line of output will display the message: Disk History Since Boot Not Available

When iostat is run without an interval, it only attempts to show the statistics since last reboot. But, disk I/O statistics since last reboot are not collected by default (configurable in using the iostat attribute of the sys0 device). Thus it will display this message:

Disk History Since Boot Not Available

To check current settings, enter the following command:

# lsattr -E -l sys0 -a iostat

To enable this data collection, enter the following command:

# chdev -l sys0 -a iostat=true

Report data

There are two sections to the iostat report. By default, both are displayed. You can restrict the report to only one of the sections. The sections are:

- tty and CPU utilization (iostat -t specifies the tty/CPU utilization report only)

- Disk utilization (iostat -d specifies the disk utilization report only)

The columns in the disk utilization report are:

- Disks lists the disk name.

- %tm_act specifies the percentage of time during that interval that the disk had at least one I/O in progress. A drive is active during data transfer and command processing, such as seeking to a new location.

- Kbps indicates the throughput of that disk during the interval in kilobytes per second. This is the sum of Kb_read plus Kb_wrtn, divided by the number of seconds in the reporting interval.

- tps indicates the number of physical disk transactions per second during that monitoring period. A transfer is an I/O request to the physical disk. Multiple logical requests can be combined into a single I/O request to the disk. A transfer is of indeterminate size.

- Kb_read indicates the kilobytes of data read during that interval.

- Kb_wrtn indicates the kilobytes of data written on that disk during the interval.



Student Notebook

When running PerfPMR (in the iostat.sh script), this information is put in the monitor.int file.

The flag, -D, provides the following additional information:

- Metrics related to disk transfers

- Disk read service metrics

- Disk write service metrics

- Disk wait queue service metrics

The -l flag (lowercase L) can be used with the -D flag to provide a long listing, which makes it easier to read.

iostat -Dl data is collected with PerfPMR (in the iostat.sh script) and put in the iostat-Dl.out file.

What to look for

Taken alone, there is no unacceptable value for any of the fields because statistics are too closely related to application characteristics, system configuration, and types of physical disk drives and adapters. Therefore, when evaluating data, you must look for patterns and relationships. The most common relationship is between disk utilization and data transfer rate.

To draw any valid conclusions from this data, you must understand the application's disk data access patterns (sequential, random, or a combination), and the type of physical disk drives and adapters on the system.

For example, if an application reads and writes sequentially, you should expect a high disk transfer rate when you have a high disk busy rate. (Kb_read and Kb_wrtn can confirm an understanding of an application's read and write behavior but they provide no information on the data access patterns.)

Generally, you do not need to be concerned about a high disk busy rate as long as the disk transfer rate is also high. However, if you get a high disk busy rate and a low data transfer rate, you may have a fragmented logical volume, file system, or individual file.

The average physical I/O size can be calculated by dividing the Kbps value by the tps value.

What is a high data-transfer rate? That depends on the disk drive and the effective data-transfer rate for that drive.

What can you do?

The busier a disk is, the more likely that I/Os to that disk will have to wait longer. You may get higher performance by moving some of that disk’s activity to another disk or spreading the I/O across multiple disk drives. In our example, hdisk0 is receiving the majority of the workload. Perhaps hdisk1 or hdisk2 are under-utilized and some disk




Uempty
I/O performance can be gained by allocating more logical volumes and file system blocks to hdisk1 or hdisk2, but you should first examine what is on each of the disks. It is also important to find out what kind of disk I/O is taking place on hdisk0. The tool filemon will help us determine this.


Student Notebook

Figure 5-11. What is iowait? AN512.0

Notes:

Introduction

To summarize it in one sentence, iowait is the percentage of time the CPU is idle AND there is at least one I/O in progress. At any point in time, each CPU can be in one of four states:

- user

- sys

- idle

- iowait

Performance tools such as vmstat, iostat, and sar print out these four states as a percentage. The sar tool can print out the states on a per CPU basis (-P flag) but most other tools print out the average values across all the CPUs. Since these are percentage values, the four state values should add up to 100%.


What is iowait?• iowait is a form of idle time

• The iowait statistic is simply the percentage of time the CPU is idle AND there is at least one I/O still in progress (started from that CPU)

• The iowait value seen in the output of commands like vmstat, iostat, and topas is the iowait percentagesacross all CPUs averaged together

• High I/O wait does not mean that there is definitely an I/O bottleneck

• Zero I/O wait does not mean that there is not an I/O bottleneck

• A CPU in I/O wait state can still execute threads if there are any runnable threads




Uempty
Example
For a single CPU system with one thread running that does exactly 5 ms of computation and then a read that takes 5 ms, the I/O wait would be 50%.

If we were to add a second thread (with the same mix of computation and I/O) onto the system, then we would have 100% user, 0% system, 0% idle and 0% I/O wait.

If we were to next add a third thread, nothing in the statistics will change, but because of the contention, we may see a drop in overall throughput.



Student Notebook

Figure 5-12. LVM pbufs AN512.0

Notes:

What are LVM pbufs?

The pbufs are pinned memory buffers used to hold I/O requests and control pending disk I/O requests at the LVM layer. One pbuf is used for each individual I/O request, regardless of the amount of data that is going to be transferred. If the system runs out of pbufs, then the LVM I/O waits until one of these buffers has been released due to another I/O being completed.

LVM pbuf pool

Prior to AIX 5L V5.3, the pbuf pool was a system wide resource. Starting with AIX 5L V5.3, the LVM assigns and manages one pbuf pool per volume group. AIX creates extra pbufs when a new physical volume is added to a volume group.


LVM pbufs• LVM pbufs are used to hold I/O requests and control

pending disk I/O requests at the LVM layer

• One LVM pbuf is used for each individual I/O

• Insufficient pbufs can result in blocked I/O

• LVM pbufs use pinned memory

• LVM pbuf pool:– One pbuf pool per volume group (AIX 5L V5.3 and later)– Automatically scales as disks are added to volume group– pbufs per disk is tunable:

• For each volume group• With a global default




Uempty

Figure 5-13. Viewing and changing LVM pbufs AN512.0

Notes:

Viewing and changing pbufs

The lvmo command to provides support for the pbuf pool related administrative tasks. The syntax for the lvmo command is:

lvmo [-a] [-v VGName] -o Tunable [ =NewValue ]


Viewing and changing LVM pbufs• Viewing LVM pbuf information:

# lvmo –v <vg_name> -avgname = rootvgpv_pbuf_count = 512total_vg_pbufs = 1024max_vg_pbuf_count = 16384pervg_blocked_io_count = 0

global_blocked_io_count = 0

• Changing LVM pbufs:– Global tuning:

# ioo -o pv_min_pbuf=<new value>

– Tuning for each volume group:# lvmo –v <vg_name> -o pv_pbuf_count# lvmo –v <vg_name> -o max_vg_pbuf_count



Student Notebook

The lvmo -a command is used to display pbuf and blocked I/O statistics and the settings for pbuf tunables (system wide or volume group specific). Sample output:

# lvmo -avgname = rootvgpv_pbuf_count = 512total_vg_pbufs = 1024max_vg_pbuf_count = 16384pervg_blocked_io_count = 0global_blocked_io_count = 0

If the vgname is not provided as an option to the lvmo command, it defaults to rootvg.

The definitions for the fields in this report are:

- vgname: Volume group name specified with the -v option.

- pv_pbuf_count: The number of pbufs that are added when a physical volume is added to the volume group.

- total_vg_pbufs: Current total number of pbufs available for the volume group.

- max_vg_pbuf_count: The maximum number of pbufs that can be allocated for the volume group.

- pervg_blocked_io_count: Number of I/Os that were blocked due to lack of free pbufs for the volume group.

- pv_min_pbuf: The minimum number of pbufs that are added when a physical volume is added to any volume group.

- global_blocked_io_count: Number of I/Os that were blocked due to lack of free pbufs for all volume groups.

When changing settings, the lvmo command can only change the LVM pbuf tunables that are for specific volume groups. These are:

- pv_pbuf_count - The number of pbufs that will be added when a physical volume is added to the volume group. Takes effect immediately at run time. The default value is 256 for the 32-bit kernel and 512 for the 64-bit kernel.

- max_vg_pbuf_count - The maximum number of pbufs that can be allocated for the volume group. Takes effect after the volume group has been varied off and varied on again.

The system wide parameter pv_min_pbuf is tunable with the ioo command. It sets the minimum number of pbufs that will be added when a physical volume is added to any volume group.




Uempty
If both the volume group specific parameter pv_pbuf_count, and the system wide parameter pv_min_pbuf are configured, the larger value takes precedence over the smaller.
Making changes permanent

To make changes permanent with the ioo command, use the -p flag.

The lvmo command does not have any options to make changes permanent.



Student Notebook

Figure 5-14. I/O request disk queuing AN512.0

Notes:

Understanding the queues in I/O processing is essential to understanding the I/O statistics and their significance.

Some storage devices can only handle one request at a time while others can handle a large number of overlapping storage requests. If a host sends more overlapping requests than the storage device can handle, the device will likely reject the extra requests. This is very undesirable because the host then has to go through error recovery and retransmit those requests.

In order to avoid this, the AIX disk definition has an attribute of queue_depth, which identifies the limit of how many overlapping I/O requests can be sent to the disk. If more requests than this number arrive at the disk device driver, they are queued up on the wait queue. The requests that are sent to the storage adapter are queued on the disk device drivers server queue (until each request is completed). The service queue can not grow any larger than the queue_depth limit.

The requests that arrive at the storage adapter device driver are handled by command elements. The adapter has a limited number of command elements (configured by the


I/O request disk queuing

DiskDD

adapter

wait queue

service queue

Sent requests

I/O results

disk queue_depthMax reqs adapter can handle: num_cmd_elements

AIX host

• Check with vendor to find out recommended queue depth• Best starting point: install the vendor provided filesets before discovering the disks; queue_depth default will be appropriately set• If not using vendor fileset, disk description will contain “other”




Uempty
num_cmd_elements attribute of the adapter definition). If more requests arrive at the adapter than this limit, they are rejected.
The requests are transmitted to the storage device, where they are processed and the result returned to the hosts. The time that it takes for the transmitted request to be completed is an important measurement of the storage device performance and the connection it flows over. A completed response is handled by the storage adapter, which notifies the disk device driver. The adapter command element is freed up to service another request. The disk device driver processes the request completion and notifies LVM (assuming an application is not doing direct disk raw I/O). The control block representing the request on the service queue is then freed.



Student Notebook

Figure 5-15. Using iostat -D AN512.0

Notes:

The iostat -D report gives more detail than the iostat default disk information.

For the read and write metrics it provides:

- rps: Indicates the number of read transfers per second.

- avgserv: Indicates the average service time per read transfer. Different suffixes are used to represent the unit of time. Default is in milliseconds.

- minserv: Indicates the minimum read service time. Different suffixes are used to represent the unit of time. Default is in milliseconds.

- maxserv: Indicates the maximum read service time. Different suffixes are used to represent the unit of time. Default is in milliseconds.

- timeouts: Indicates the number of read timeouts per second.

- fails: Indicates the number of failed read requests per second.

For the queue metrics it provides:


Using iostat -D# iostat –D hdisk1 5

hdisk1 xfer: %tm_act bps tps bread bwrtn0.0 0.0 0.0 0.0 0.0

read: rps avgserv minserv maxserv timeouts fails0.0 0.0 0.0 0.0 0 0

write: wps avgserv minserv maxserv timeouts fails0.0 0.0 8.7 8.7 0 0

queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull0.0 0.0 0.0 0.0 0.0 0.0

--------------------------------------------------------------------------------hdisk1 xfer: %tm_act bps tps bread bwrtn

33.9 25.6M 101.2 6.3M 19.2Mread: rps avgserv minserv maxserv timeouts fails

24.3 11.8 3.5 20.4 0 0write: wps avgserv minserv maxserv timeouts fails

76.9 9.3 3.9 19.1 0 0queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull

41.8 0.0 258.8 4.0 1.0 99.4--------------------------------------------------------------------------------hdisk1 xfer: %tm_act bps tps bread bwrtn

100.0 70.3M 353.2 19.4M 50.9Mread: rps avgserv minserv maxserv timeouts fails

79.5 9.8 0.2 20.4 0 0write: wps avgserv minserv maxserv timeouts fails

273.8 8.0 2.2 26.0 0 0queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull

33.2 0.0 258.8 12.0 2.0 347.0




Uempty
- avgtime: Indicates the average time spent by a transfer request in the wait queue. Different suffixes are used to represent the unit of time. Default is in milliseconds.
- mintime: Indicates the minimum time spent by a transfer request in the wait queue. Different suffixes are used to represent the unit of time. Default is in milliseconds.

- maxtime: Indicates the maximum time spent by a transfer request in the wait queue. Different suffixes are used to represent the unit of time. Default is in milliseconds.

- avgwqsz: Indicates the average wait queue size.

- avgsqsz: Indicates the average service queue size.

- sqfull: Indicates the number of times the service queue becomes full (that is, the disk is not accepting any more service requests) per second.



Student Notebook

Figure 5-16. sar -d . AN512.0

Notes:

Overview

The -d option of sar provides real time disk I/O statistics.

The fields listed by sar -d are:

- %busy - Reports the portion of time device was busy servicing a transfer request.

- avque - Reports the average number of requests waiting to be sent to the disk. This statistic is a good indicator if an I/O bottleneck exists. (Before AIX 5L V5.3, it reported the instantaneous number of requests sent to disk but not completed yet.)

- r+w/s - The number of read/write transfers from or to device.

- blks/s - The number of bytes transferred in 512-byte units.

- avwait - The average time (in milliseconds) that transfer requests waited idly on the queue for the device. Prior to AIX 5L V5.3, this was not supported.


sar -d# sar -d 1 3

AIX train43 3 5 0009330F4C00 11/05/04

System configuration: lcpu=2 drives=3

22:54:39 device %busy avque r+w/s Kbs/s avwait avserv

22:54:40 hdisk1 5 1.4 18 807 9.5 7.5hdisk0 0 0.0 0 0 0.0 0.0

cd0 0 0.0 0 0 0.0 0.0

22:54:41 hdisk1 100 151.7 405 26039 91.2 7.4hdisk0 0 0.0 0 0 0.0 0.0

cd0 0 0.0 0 0 0.0 0.0

22:54:42 hdisk1 66 104.9 224 16740 22.9 9.0hdisk0 0 0.0 0 0 0.0 0.0

cd0 0 0.0 0 0 0.0 0.0

Average hdisk1 42 64.5 161 10896 30.9 6.0hdisk0 0 0.0 0 0 0.0 0.0

cd0 0 0.0 0 0 0.0 0.0




Uempty
If you see large numbers in the avwait column. try to distribute the workload on other disks.
- avserv - The average time (in milliseconds) to service each transfer request (includes seek, rotational latency, and data transfer times) for the device. Prior to AIX 5L V5.3, this was not supported.

Note: %busy is the same as %tm_act in iostat. r+w/s is equal to tps in iostat.



Student Notebook

Figure 5-17. Using filemon (1 of 2) AN512.0

Notes:

Overview

The filemon command uses the trace facility to obtain a detailed picture of I/O activity during a time interval on the various layers of file system utilization, including the logical file system, virtual memory segments, LVM, and physical disk layers. Data can be collected on all the layers, or some of the layers. The default is to collect data on the virtual memory segments, LVM, and physical disk layers.

The report begins with a summary of the I/O activity for each of the levels (the Most Active sections) and ends with detailed I/O activity for each level (Detailed sections). Each section is ordered from most active to least active.

The logical file I/O includes read, writes, opens and seeks which may or may not result in actual physical I/O depending on whether or not the files are already buffered in memory. Statistics are kept by file.


Using filemon (1 of 2)# filemon -O lv,pv -o fmon.out# dd if=/dev/rhdisk0 of=/dev/null bs=32k count=100# dd if=/dev/zero of=/tmp/junk bs=32k count=100# trcstop# more fmon.out

Fri Nov 5 23:16:10 2004System: AIX ginger Node: 5 Machine: 00049FDF4C00

Cpu utilization: 5.7%

Most Active Logical Volumes-----------------------------------------------------------------util #rblk #wblk KB/s volume description

-----------------------------------------------------------------0.28 0 6144 14238.3 /dev/hd3 /tmp0.06 0 8 18.5 /dev/hd8 jfs2log

Most Active Physical Volumes-----------------------------------------------------------------util #rblk #wblk KB/s volume description

-----------------------------------------------------------------0.78 6400 3592 23155.8 /dev/hdisk0 N/A




Uempty
The virtual memory data contains physical I/O (paging) between segments and disk. Statistics are per segment.
Since it uses the trace facility, the filemon command can be run only by the root user or by a member of the system group. Note that if filemon shows dropped events, it is not reliable data. filemon should be re-run specifying larger buffer sizes.

When running PerfPMR, the filemon data is in the filemon.sum file.

Only data for those files opened after the filemon command was started will be collected, unless you specify the -u flag.

Running filemon

Data can be collected on all the layers, or layers can be specified with the -O layer option. Valid -O options are:

- lf - Monitor file I/O

- vm - Monitor virtual memory I/O

- lv - Monitor logical volume I/O

- pv - Monitor physical volume I/O

- all - Monitor all (lf, vm, lv, and pv)

By default, filemon runs in the background while other applications are running and being monitored. When the trcstop command is issued, filemon stops and generates its report. You may want to issue nice -n -20 trcstop to stop filemon since filemon is currently running at priority 40.

Only the top 20 logical files and segments are reported unless the -v (verbose) flag is used.

To produce filemon output from a previously collected AIX trace, use the -i option with an AIX trace file and the -n option with a file that contains gennames output (gennames -f must be used for filemon offline profiling).

Example

The visual shows an example of:

1. Starting filemon and redirecting the output to the fmon.out file

2. Issuing some I/O intensive commands

3. Stopping filemon with trcstop

4. Examining the fmon.out file

This report shows logical volume activity information and expands on the physical disk information from iostat by displaying a description of the disk. This is useful for determining whether the greatest activity is on your fastest disks.



Student Notebook

The reason that the logical volume utilizations are no more than 28% but the hdisk utilization is 78% is because the first dd command is reading directly from the hdisk and bypassing the LVM.




Uempty


Notes:

What to look for

The physical volume statistics can be used to determine physical disk access patterns.

The seek distance shows how far the disk had to seek. The longer the distance, the longer it takes. If the majority of the reads and writes required seeks, you may have fragmented files and/or more than one busy file system on the same physical disk. If the number of reads and writes approaches the number of sequences, physical disk access is more random than sequential.

As the number of seek operations approached the number of reads or writes (look at the corresponding seek type), then the data access becomes less sequential and more random.


Using filemon (2 of 2)------------------------------------------------------------------------Detailed Physical Volume Stats (512 byte blocks)------------------------------------------------------------------------

VOLUME: /dev/hdisk0 description: N/Areads: 100 (0 errs)

read sizes (blks): avg 64.0 min 64 max 64 sdev 0.0read times (msec): avg 0.952 min 0.518 max 12.906 sdev 1.263read sequences: 1read seq. lengths: avg 6400.0 min 6400 max 6400 sdev 0.0

writes: 15 (0 errs)write sizes (blks): avg 239.5 min 8 max 256 sdev 61.9write times (msec): avg 5.572 min 3.716 max 12.736 sdev 2.618write sequences: 2write seq. lengths: avg 1796.0 min 8 max 3584 sdev 1788.0

seeks: 2 (1.7%)seek dist (blks): init 0,

avg 7284988.0 min 324392 max 14245584 sdev 6960596.0seek dist (%tot blks):init 0.00000,

avg 20.49320 min 0.91254 max 40.07386 sdev 19.58066time to next req(msec): avg 1.565 min 0.581 max 28.526 sdev 3.090throughput: 23155.8 KB/secutilization: 0.78



Student Notebook

Example

This visual shows that there were 100 reads where the average size was 64 blocks or 32 KB (since each block is 512 bytes). It also gives the average time in milliseconds to complete the disk read (as well as min, max, and standard deviation). The number of sequences compared to the number of reads indicates how sequential the I/Os were. One sequence with 100 reads means it was fully sequential. A large number of seeks indicates either fragmentation or random I/O.




Uempty

Figure 5-19. Managing uneven disk workloads AN512.0

Notes:

Moving a logical volume to another physical disk

One way to solve an I/O bottleneck is to see if placement of different logical volumes across multiple physical disks is possible. First, you would have to determine if a particular physical disk is being heavily utilized using the iostat or filemon commands. Second, determine if there are multiple logical volumes being accessed on that same physical disk. If so, then you can move one logical volume from that physical disk to another physical disk. Of course, if you move it to a much slower disk, your performance may be worse than having two logical volumes on the same fast disk. The moving of a logical volume can be easily accomplished by using the migratepv command. A logical volume can be moved or migrated even while it’s in use. The syntax of migratepv for moving a logical volume is:

migratepv -l lvname source_disk destination_disk


Managing uneven disk workloads• Using the previous monitoring tools:

– Identify saturated hdisks.– Identify underused or unused hdisks (separate disk groups).

• Use the migratepv command to move logical volumes from one physical disk to another, to even the load:migratepv -l lvname source_disk destination_disk

• Set LV range to max with an upper bound and reorganize:chlv -e m -u upperbound logicalvolume reorgvg VolumeGroup LogicalVolume

• Convert to using LVM striping across the candidate disks– Use best practices in defining the LVM striping– Backup data, redefine LVs and restore data

• Micro-manage the position of hot physical partitions – Use lvmstat to identify hot logical partitions– Use migratelp to move individual logical partitions to even the load.– Once you go down this path you may need to continually monitor and shift the

hotspots manually.



Student Notebook

Spreading traffic with LV inter-policies

Sometimes you need to spread traffic for a single logical volume, in which case moving an entire LV to another disk may not be the right solution. In that case you can use what is called “poor man’s striping”. This involves spreading the logical partitions evenly across multiple disks. This works well if the random I/O demand is evenly spread across the logical volume.

To do this you need to set the particular logical volumes inter-policies as follows:

- Set the range to maximum ( -e m)

- Set the upper bound to the number of disks you wish to use (-u #).

Then, you run reorgvg against that logical volume. It will move the logical partitions to spread them as equally as possible between the candidate disks.

The problem here is that you cannot name which particular disks to use. If you need to constrain it to particular disk, then you need to have the logical volume in its own volume group with only those disks. That may require a backup and restore. Or you may be able to copy it to a new volume group using the cplv command.

Spreading traffic with LVM striping

If the previous methods do not properly do the job, it is likely because the LV has hot spots in the distribution of data and the hot LVM physical partitions are still mostly on one disk. The easiest way to spread this load is to use LVM striping across the disks. This works because the LVM stripe unit is so much smaller than the physical partition size. But if doing striping, it needs to be done efficiently.

- The more physical volumes used the more spread the load.

- Avoid the adapters from being the single point of contention.

- Avoid other uses for the disks. A good way to do this is to create a separate volume group for striped logical volumes.

- Set a stripe-unit size of 64 KB. Setting too small a stripe unit size will fragment the I/O and impact performance. 64 KB has been found to be optimal in most situations.

- If doing sequential reads or random reads of a very large size, set the maximum page ahead (see read-ahead discussion in the next unit) to 16 times the number of disk drives. This causes page-ahead to be done in units of the stripe-unit size (64 KB) times the number of disk drives, resulting in the reading of one stripe unit from each disk drive for each read-ahead operation.

- Have the application use read and write sizes which are a multiple of the stripe unit size, or even better (if practical) the sizes equal to the full stripe (64 KB times the number of disk drives)

- Modify maxfree, using the ioo command, to accommodate the change in the maximum page ahead value (maxfree = minfree + <max page ahead>).




Uempty
Moving logical partitions
There may also be the case where a disk has a single very large logical volume on it. In this case, moving the entire logical volume to an equivalent disk would not help. You could check to see if individual partitions are accessed heavily. For example, with a large partition size and a database on a raw logical volume with too small of a database buffer cache, the individual physical partition may be accessed heavily. The command lvmstat in AIX can be used to check for this. To move an individual partition, a command called migratelp is available.

The syntax of migratelp is:

migratelp lvname/lpartnum[/copynum] destpv[/ppartnum]

migratelp moves the specified logical partition lpartnumber of the logical volume lvname to the destpv physical volume. If the destination physical partition ppartnum is specified it will be used. Otherwise, a destination partition is selected using the intra-allocation policy of the logical volume. By default, the first mirror copy of the logical partition in question is migrated. A value of 1, 2 or 3 can be specified for copynum to migrate a particular mirror copy.

Examples:

- To move the first logical partition of logical volume lv00 to hdisk1:

# migratelp lv00/1 hdisk1

- To move second mirror copy of the third logical partition of logical volume hd2 to hdisk5:

# migratelp hd2/3/2 hdisk5

Migrating physical partitions

Rather than migrating entire logical volumes from one disk to another in an attempt to rebalance the workload, if we can identify the individual hot logical partitions, then we can focus on migrating just those to another disk.The lvmstat utility can be used to monitor the utilization of individual logical partitions of a logical volume. By default, statistics are not kept on a per partition basis. These statistics can be enabled with the lvmstat -e option. You can enable statistics for:

- All logical volumes in a volume group with lvmstat -e -v vgname

- Per logical volume basis with lvmstat -e -l lvname

The first report generated by lvmstat provides statistics concerning the time since the system was booted. Each subsequent report covers the time since the previous report. All statistics are reported each time lvmstat runs. The report consists of a header row followed by a line of statistics for each logical partition or logical volume depending on the flags specified.



Student Notebook

lvmstat syntax

The syntax of lvmstat is:

lvmstat {-l|-v} Name [-e|-d][-F][-C][-c Count][-s][Interval [Iterations]]

If the -l flag is specified, Name is the logical volume name, and the statistics are for the physical partitions of this logical volume. The mirror copies of the logical partitions are considered individually for the statistics reporting. They are listed in descending order of number of I/Os (iocnt) to the partition.

The Interval parameter specifies the amount of time, in seconds, between each report. The first report contains statistics for the time since the volume group startup. Each subsequent report contains statistics collected during the interval since the previous report.

If the Count parameter is specified, only the top Count lines of the report are generated. For a logical volume if Count is 10, only the 10 busiest partitions are identified.

If the Iterations parameter is specified in conjunction with the Interval parameter, then only that many iterations are run. If no Iterations parameter is specified, lvmstat generates reports continuously.

If Interval is used to run lvmstat more than once, no reports are printed if the statistics did not change since the last run. A single period (.) is printed instead.

Statistics can be disabled using the -d option.




Uempty

Figure 5-20. Adapter and multipath statistics AN512.0

Notes:

There are two scenarios where the storage adapters figure into the I/O performance analysis:

- There are multiple disks connected to the same adapter

- There are multiple paths to the same disk.

While none of the individual disks may be saturated by the I/O workload, the total I/O for all disks may overload the storage adapter. If that occurs, you may want to add another adapter and move some of the disks to that adapter. Remember that too many adapters with too much traffic can, in turn, overload the PCI bus to which they are connected. Alternately, you might migrate the data to other disks that already use a different storage adapter. Remember to avoid setting queue_depths so high that the total of all the queue depths exceed the number of command elements on the adapter.

If we can have multiple disks using a single adapter in a single disk environment, that is even more true for the SAN storage environment, where you could expect to have dozens of hdisks having a single fibre channel adapter as their parent. Beyond that, it is common to


Adapter and multipath statistics

LUN 1

LUN 2

SAN switch SAN switch

fcs0 fcs1hdisk8

Storage subsystem

AIX host

scsi3hdisk4

hdisk5



Student Notebook

have multiple FC adapters zoned to access the same LUNs. In that environment there is an extra layer of path management software to handle the adapter load balancing and fall over. Rather than having one fibre channel adapter in an idle standby to handle fail over situations, many installations configure both adapters to both carry I/O traffic, thus increasing the throughput capacity. The path control software should properly load balance to ensure this. In that situation, if one of the adapters was not functioning correctly, or the load balancing software was not configured correctly, the desired bandwidth would not be realized.




Uempty

Figure 5-21. Monitoring adapter I/O throughout AN512.0

Notes:

Adapter throughput

The -a option to iostat will combine the disks statistics to the adapter to which they are connected. The adapter throughput will simply be the sum of the throughput of each of its connected devices. With the -a option, the adapter will be listed first, followed by its devices and then followed by the next adapter, followed by its devices, and so on. The adapter throughput values can be used to determine if any particular adapter is approaching its maximum bandwidth or to see if the I/O is balanced across adapters.

System throughput

In addition, there is also a -s flag that will show system throughput. This is the sum of all the adapter’s throughputs. The system throughput numbers can be used to see if you are approaching the maximum throughput for the system bus.


Monitoring adapter I/O throughput• iostat -a shows adapter throughput• Disks are listed following the adapter to which they are

attached

# iostat -a

System configuration: lcpu=2 drives=3


Adapter: Kbps tps Kb_read Kb_wrtnscsi0 131.7 4.2 128825 2618720

Disks: % tm_act Kbps tps Kb_read Kb_wrtnhdisk1 0.0 0.2 0.0 3194 0hdisk0 1.2 131.6 4.2 125631 2618720



Student Notebook

For example:

# iostat -s 1System configuration: lcpu=2 drives=3

tty: tin tout avg-cpu: % user % sys % idle % iowait 0.0 583.0 48.5 6.0 0.0 45.5

System: train33.beaverton.ibm.com Kbps tps Kb_read Kb_wrtn 15984.0 339.0 0 15984





Uempty

Figure 5-22. Monitoring multiple paths (1 of 2) AN512.0

Notes:

The iostat command’s -m option allows us to see the I/O traffic, for one or more disks, broken down by path. The example shown has two fibre channel adapters, both zoned to have access to the same LUNS. Path0 through Path3 are for one adapter and Path4 through Path7 are for the other adapter.

In the example, it is clear that all of the traffic is being sent over just one of the adapters. An investigation, in this case, would show that the disk attribute of algorithm was set to fail_over instead of round_robin.

It should be noted that the vendor specific path management software often has better tools for examining this than more generic tools such as iostat.

Each adapter will have multiple paths for different routing options in the SAN switch fabric.

In order to corollate the PathIDs shown in this report to the available adapters, we would need to use the lspath command:

# lspath -F "name parent path_id status" -l hdisk5

hdisk5 fscsi0 0 Enabled


Monitoring multiple paths (1 of 2)# iostat –md hdisk5 5 1

System configuration: lcpu=2 drives=22 ent=0.30 paths=82 vdisks=13

Disks: % tm_act Kbps tps Kb_read Kb_wrtnhdisk5 99.8 76090.7 399.2 382064 377892

Paths: % tm_act Kbps tps Kb_read Kb_wrtnPath7 0.0 0.0 0.0 0 0Path6 0.0 0.0 0.0 0 0Path5 0.0 0.0 0.0 0 0Path4 0.0 0.0 0.0 0 0Path3 0.0 0.0 0.0 0 0Path2 0.0 0.0 0.0 0 0Path1 0.0 0.0 0.0 0 0Path0 99.6 76077.9 399.2 382064 377764



Student Notebook











Uempty

Figure 5-23. Monitoring multiple paths (2 of 2) AN512.0

Notes:

This report looks very similar to the one displayed earlier, with the exception that you can see the same disks shown under both the fcs0 adapter and under the fcs1 adapter. This report is a little easier to understand since you do not need to figure out which adapter is associated to which path ID.


Monitoring multiple paths (2 of 2)# iostat -ad hdisk5 hdisk12 10

System configuration: lcpu=4 drives=23 paths=163 vdisks=1 tapes=0

Adapter: Kbps tps Kb_read Kb_wrtnfcs0 76407.9 628.1 385904 377984


Adapter: Kbps tps Kb_read Kb_wrtnfcs1 0.0 0.0 0 0




Student Notebook


Notes:


Checkpoint

1. True/False When you see two hdisks on your system, you know they represent two separate physical disks.

2. List two commands that will provide real time disk I/O statistics.

3. Identify and define the default mirroring scheduling policy._____________________________________________

4. What tool allows you to observe the time the physical disks are active in relation to their average transfer rates bymonitoring system input/output device loads?_____________________________________________




Uempty

Figure 5-25. Exercise 5: I/O Performance AN512.0

Notes:


Exercise 5: I/O Performance

• Use the filemon command• Locate and fix I/O bottlenecks with the

following tools:vmstatiostatsarlvmstatfilemon



Student Notebook


Notes:


Unit summary

This unit covered:

• Identifying factors related to physical and logical volume performance

• Using performance tools to identify I/O bottlenecks

• Configuring logical volumes for optimal performance




Uempty
Unit 6. File system performance monitoring and tuning

This unit describes the issues related to file system I/O performance. It shows you how to use performance tools to monitor and tune file system I/O performance.



• List characteristics of the file systems that apply to performance

• Describe how file fragmentation affects file system I/O performance

• Use the filemon tool to evaluate file system performance

• Tune:

- JFS logs

- Release-behind

- Read-ahead

- Write-behind

• Identify resource bottlenecks for file systems


Accountability:

• Checkpoints • Machine exercises

References






© Copyright IBM Corp. 2010 Unit 6. File system performance monitoring and tuning 6-1

Student Notebook


Notes:


Unit objectives

After completing this unit, you should be able to:• Describe guidelines for accurate file system measurements• Describe how file fragmentation affects file system I/O performance

• Use the filemon tool to evaluate file system performance• Tune:

–JFS and JFS2 logs–Release-behind–Read-ahead–Write-behind

• Identify resource bottlenecks for file systems




Uempty

Figure 6-2. File system I/O layers AN512.0

Notes:

Overview

There are a number of layers involved in file system storage and retrieval. It’s important to understand what performance issues are associated with each layer. The management tools used to monitor file system activity can provide data on each of these layers.

The effect of a file’s physical disk placement on I/O performance diminishes when the file is buffered in memory. When a file is opened in AIX, it is mapped to a persistent (JFS) or client (JFS2) data segment in virtual memory. The segment represents a virtual buffer for the file. The file’s blocks map directly to segment pages. The VMM manages the segment pages, reading file blocks into segment pages upon demand (as they are accessed). There are several circumstances that cause the VMM to write a page back to its corresponding block in the file on disk.


File system I/O layers

Logical FileSystem

VirtualMemory Manager

LogicalVolume Manager

PhysicalDisk I/O

Local or NFS

Paging

Disk space management

Hardware dependent



Student Notebook

Figure 6-3. File system performance factors AN512.0

Notes:

Overview

There is a theory that anything that starts out with perfect order will, over time, become disordered due to outside forces. This concept certainly applies to file systems. The longer a file system is used, the more likely it will become fragmented. Also, the dynamic allocation of resources (for example, extending a logical volume) contributes to the disorder. File system performance is also affected by physical considerations like the:

- Types of disks and number of adapters

- Amount of memory for file buffering

- Amount of local versus remote file access

- Pattern and amount of file access by applications


File system performance factors• Proper performance management at lower layers:

- LVM logical volume and physical volume- Adapters, paths, and storage subsystem

• Large reads and writes at all layersLarge application read and write sizes

Multiple of file system block sizeManage file fragmentation

Avoid small discontiguous reads at the LV layer

• Avoid significant impacts from journal loggingConcurrent file access locking and serializationPhysical seeks to log on each write

• Manage file caching if overcommitted memory

• Avoid JFS file compression option




Uempty
Issues of fragmentation
With fragmentation, sequential file access will no longer find contiguous physical disk blocks. Random access may not find physically contiguous logical records and will have to access more widely dispersed data. In both cases, seek time for file access grows. Both JFS and JFS2 attach a virtual memory segment to do I/O. As a result, file data becomes cached in memory and disk fragmentation does not affect access to the VMM cached data.

File system CPU overhead

Each read or write operation on a file system is done through system calls. System calls for reads and writes define the size of the operation, that is, number of bytes. The smaller the operation the more system calls are needed to read or write the entire file. Therefore, more CPU time is spent making the system calls. The read or write size should be a multiple of the file system block size to reduce the amount of CPU time spent per system call.

Fragment size

The following discussion uses JFS fragments to illustrate the concept; but, the same principles apply equally to small JFS2 block sizes.

As many whole fragments (or blocks) as necessary are used to store a file or directory’s data.

Consider that we have chosen to use a JFS fragment size of 4 KB, and we are attempting to store file data which only partially fills a JFS fragment. Potentially, the amount of unused or wasted space in the partially filled fragment can be high. For example, if only 500 bytes are stored in this fragment, then 3596 bytes will be wasted. However, if a smaller JFS fragment size (for example 512 bytes) was used, the amount of wasted disk space would be greatly reduced to only 12 bytes. Therefore, it is better to use small fragment sizes if efficient use of available space is required.

Although small fragment sizes can be beneficial in reducing disk space wastage, this can have an adverse effect on disk I/O activity. For a file with a size of 4 KB stored in a single fragment of 4 KB, only one disk I/O operation would be required to either read or write the file. If the choice of the fragment size was 512 bytes, eight fragments would be allocated to this file, and for a read or write to complete, several additional disk I/O operations (disk seeks, data transfers, and allocation activity) would be required. Therefore, for file systems which use a fragment size of 4 KB, the number of disk I/O operations will be far less than for file systems which employ a smaller fragment size.

Fragments are allocated contiguously or not at all.



Student Notebook

Compression

Compression can be used for JFS file systems with a fragment size less than 4 KB. It uses the Lempel-Zev (LZ) algorithm that replaces subsequent occurrences of a given string with a pointer to the first occurrence. On an average, a 50% savings in disk space is realized.

Compression can be specified when creating the file system through SMIT: System Storage Management (Physical & Logical Storage) -> File Systems -> Add / Change / Show / Delete File Systems -> Journaled File Systems -> Add a Journaled File System -> Add a Compressed Journaled File System.

Or, use one of the following commands:

- crfs -a compress=LZ <other options>

- mkfs -o compress=LZ <other options>

JFS compression performance considerations

In addition to increased disk I/O activity and free space fragmentation problems, file systems using data compression have the following performance considerations:

- Degradation in file system usability arising as a direct result of the data compression/decompression activity

- All logical blocks in a compressed file system, when modified for the first time, will be allocated 4096 bytes of disk space, and this space will subsequently be reallocated when the logical block is written to disk

- In order to perform data compression, approximately 50 CPU cycles per byte are required, and about 10 CPU cycles per byte for decompression

- The JFS compression kproc (jfsc) runs at a fixed priority of 30 so that while compression/decompression is occurring, the CPU that this kproc is running on may not be available to other processes unless they run at a better priority

Compression is not supported for the JFS2 (J2) file systems.




Uempty

Figure 6-4. How to measure file system performance AN512.0

Notes:

Idle system

File system operations require system resources such as CPU, memory, and I/O. The result of a file system performance measurement will not be accurate if one or more of these resources are in use by other applications.

System management tools

The same applies if one or more of these resources is managed and/or the statistics are gathered with system management tools like Workload Manager (WLM). Those tools should be turned off.

I/O subsystems

I/O subsystems, such as Enterprise Storage Server (ESS), can share disk space among several systems. The available bandwidth might not be enough to achieve


How to measure file system performance• General guidelines for accurate measurements

System has to be idleSystem management tools like Workload Manager should be turned offStorage subsystems should not be shared with other systemsFiles must not be in AIX file cache or storage subsystem cache for read throughput measurementWrites must go to the file system disk and not just written to AIX memory



Student Notebook

maximum file system performance if the I/O subsystem is used by other systems during the performance measurement, thus it should not be shared.

Read measurement

When a file is cached in memory, a read throughput measurement does not give any information about the file system throughput since no physical operation on the file system takes place. The best way to assure that a file is not cached in memory is to unmount and then mount the file system on which the file is located. You may need to work with the storage subsystem administrator to ensure that the subsystem cache is empty of the data you will be reading.

Write measurement

A write throughput measurement does not give any information about file system performance if nothing is written out to disk. Unless the application opens files in such a way that it does not use file system buffers (such as direct I/O), then each write to a file is done in memory and is written out to disk by either a syncd or a write-behind algorithm. The write-behind algorithm should always be used and tuned for a write throughput measurement.




Uempty

Figure 6-5. How to measure read throughput AN512.0

Notes:

Utilities

The dd command is a good utility to measure the throughput of a file system since it allows you to specify the exact size for reads or writes as well as the number of operations. When the dd command is started, it creates a second dd process. One dd process is used to read and the other to write. This allows dd to provide a continuous data flow on an SMP machine.

Example

The time command shows the amount of time it took to complete the read.

The read throughput in this example is about 2272 MB per second (1000 MB / 0.44 seconds real time)


How to measure read throughput

• Useful tools for file system performance measurements are dd and time

• Example of a read throughput measurement with ddcommand:

# time dd if=/fs/file1 of=/dev/null bs=1024k1000+0 records in1000+0 records out

real 0m0.44suser 0m0.01ssys 0m0.45s



Student Notebook

Figure 6-6. How to measure write throughput AN512.0

Notes:

Overview

Writes to a file are done in memory (unless direct I/O, asynchronous I/O, or synchronous I/O is used) and will be written out to disk through syncd or the write-behind algorithm. If the application is not issuing fsync() periodically, then it is necessary that the file system sequential write-behind mechanism be enabled (the default). Otherwise, the process could complete with a large amount of data still in memory and not yet written to disk.

With write_behind, most of the data will be written to disk, but up to 128KB (by default) of data could be left unwritten to the disk; thus, a large amount of data should be used to make that 128KB a small percentage of measurement. Write_behind will be discussed in more detail, later.

Placing a sync command before the final date command neither helps nor hurts the measurement in the default sequential write-behind environment. But, the final sync command is necessary if you disable sequential write behind. Note that the sync


How to measure write throughput• Ensure sequential write_behind is enabled (default)• Use large amounts of data• Example of a write throughput measurement with dd

command:

# sync; sync; date; dd if=/dev/zero of=/fs/file1bs=1024k count=800; date; sync ;date

Mon Mar 15 16:48:35 CET 2010800+0 records in.800+0 records out.Mon Mar 15 16:48:46 CET 2010Mon Mar 15 16:48:46 CET 2010




Uempty
command is asynchronous; it returns without waiting for the data to be confirmed as written to disk. But if processing a large amount of data, the unrecorded amount will not be significant.
Example

The first set of sync commands flush all modified file pages in memory to disk.

The time between the first and the second date commands is the amount of time the dd command took to write the file into memory and to process the disk I/O triggered by the write-behind mechanism. The last time period is what it took to write and commit almost all of the data to disk.

The time between the second and the third date commands is the time it took the sync command to schedule any remaining dirty pages to be written to disk. Note that the sync command will terminate without waiting for all its writes to be committed. For the default write-behind environment, this is a very short amount of time.

If the write-behind mechanism had been disabled, and there was sufficient memory to cache the written data, the dd command elapsed time would have been much shorter (perhaps 5 seconds for this example) and the elapsed time for the final sync command would have been much longer (around 12 seconds).

The time between the first and third date command is the total amount of time it took to write the file to disk.

In this example, dd completed after 11 seconds (16:48:46 - 16:48:35) and wrote 72.7 MB per second.



Student Notebook

Figure 6-7. Using iostat . AN512.0

Notes:

Example

The iostat command might help you see if something is going wrong.

The example uses the -f option which adds per file system statistics. This allows you to see which file systems have the heaviest traffic and to see the statistics for individual file systems.

The output of the iostat taken during the dd read operation shows a higher number of transactions per second (tps) than you would expect. The average block size in this sample was about 8 KB (calculated by Kbps / tps).

Both the high number of transactions per second and the small block size points to a problem. You would expect to see larger I/Os and less transactions per second, specifically with a sequential read. This is due to the file system read_ahead mechanism that will be covered, later.


Using iostat

# iostat –f /test 1 > ios.out &[1] 245906# dd if=/test/file2 bs=1024k of=/dev/null100+0 records in.100+0 records out.# kill %1# egrep "/test|FS Name“ ios.out

FS Name: % tm_act Kbps tps Kb_read Kb_wrtn/test 0.0 0.0 0.0 0 0FS Name: % tm_act Kbps tps Kb_read Kb_wrtn/test 12.0 3268.0 410.0 3268 0FS Name: % tm_act Kbps tps Kb_read Kb_wrtn/test 72.0 17308.0 2163.0 17308 0FS Name: % tm_act Kbps tps Kb_read Kb_wrtn/test 69.0 17248.0 2155.0 17248 0FS Name: % tm_act Kbps tps Kb_read Kb_wrtn/test 53.0 14576.0 1598.0 14576 0




Uempty
Since iostat gives us only a general overview on the I/O activity, you need to continue your analysis with tools like filemon which provides more detailed information.


Student Notebook


Notes:

Example

This example demonstrates how to use the filemon command to analyze the file system performance issue as seen with the iostat command in the last visual. The visual on this page shows the logical file output (lf) from the filemon report. Output is ordered by #MBs read and written to a file.

By default, the logical file reports are limited to the top 20. If the verbose flag (-v) is added, activity for all files would be reported. The -u flag is used to generate reports on files opened prior to the start of the trace daemon.

Look for the most active files to see usage patterns. If they are dynamic files, they may need to be backed up and restored. The Most Active Files sections shows the file1 file (read by dd command) as most active file with one open and 101 reads.

The number of writes (#wrs) is 1 less than the number of reads (#rds), because end-of-file has been reached.


Using filemon (1 of 3)# filemon -u -O lf,pv -o fmon.out# dd if=/test/file2 bs=1024k of=/dev/null# trcstop# more fmon.outWed Nov 10 13:24:34 2004System: AIX train21 Node: 5 Machine: 0001D2BA4C00

Cpu utilization: 6.9%

Most Active Files---------------------------------------------------------------------#MBs #opns #rds #wrs file volume:inode

---------------------------------------------------------------------101.0 1 101 0 file2 /dev/jfslv:23100.0 1 0 100 null3.0 0 385 0 pid=270570_fd=209600.2 1 62 0 unix /dev/hd3:100.0 0 60 51 pid=208964_fd=142840.0 0 205 107 pid=249896_fd=177360.0 0 0 102 pid=282802_fd=20162




Uempty
If the trace does not capture the open call, it does not know what name the file was opened with. So, it just records the file descriptor number. When the trace does not have the process name, then it saves the PID.


Student Notebook


Notes:

Detailed File Stats report

The Detailed File Stats report is based on the activity on the interface between the application and the file system. As such, the number of calls and the size of the reads or writes reflects the application calls. The read sizes and write sizes will give you an idea of how efficiently your application is reading and writing information.

In this example, the report shows the average read size is approximately 1 MB, which matches the block size specified on the dd command on the previous visual.

The size used by an application has performance implications. For sequentially reading a large file, a larger read size will result in fewer read requests and thus lower CPU overhead to read the entire file. When specifying an application’s read or write block size, using values which are a multiple of the page size (which is 4 KB) is recommended.


Using filemon (2 of 3)------------------------------------------------------------------------Detailed File Stats------------------------------------------------------------------------

FILE: /test/file2 volume: /dev/jfslv (/test) inode: 23opens: 1total bytes xfrd: 105906176reads: 101 (0 errs)

read sizes (bytes): avg 1048576.0 min 1048576 max 1048576 sdev 0.0read times (msec): avg 30.401 min 0.005 max 38.883 sdev 3.681

FILE: /dev/nullopens: 1total bytes xfrd: 104857600writes: 100 (0 errs)

write sizes (bytes): avg 1048576.0 min 1048576 max 1048576 sdev 0.0write times (msec): avg 0.005 min 0.004 max 0.022 sdev 0.002




Uempty


Notes:

Detailed Physical Volume Stats report

As contrasted with the Detailed File States report, the Detailed Physical Volume Stats report shows the activity at disk device driver. This report shows the actual number and size of the reads and writes to the disk device driver. The file system uses VMM caching. The default unit of work in VMM is the 4 KB page. But, rather than writing or reading one page at a time, the file system tends to group work together to read or write multiple pages at a time. This grouping of work can be seen in the physical volume read and write sizes provided in this report.

Note that the sizes are expressed in blocks, where a block is the traditional UNIX block size of 512 bytes. To translate the sizes to KBs, divide the number by 2.


Using filemon (3 of 3)------------------------------------------------------------------------Detailed Physical Volume Stats (512 byte blocks)------------------------------------------------------------------------

VOLUME: /dev/hdisk1 description: N/Areads: 6326 (0 errs)

read sizes (blks): avg 16.6 min 8 max 64 sdev 9.5read times (msec): avg 0.301 min 0.100 max 6.057 sdev 0.172read sequences: 3125read seq. lengths: avg 33.5 min 32 max 4832 sdev 85.9

seeks: 3125 (49.4%)seek dist (blks): init 4634960,

avg 243.2 min 32 max 659488 sdev 11796.7seek dist (%tot blks):init 13.03848,

avg 0.00068 min 0.00009 max 1.85519 sdev 0.03318time to next req(msec): avg 2.148 min 0.191 max 10516.272 sdev 132.204throughput: 2574.3 KB/secutilization: 0.09



Student Notebook

Example

In this report, the minimum read size was 4 KB which matches the VMM page size. The average size approximately matches the 8 KB size that was calculated from in the iostat report. The iostat and this filemon report are both reporting the disk device driver activity. The maximum size was 32 KB. Generally, more work per read is better.

The example in the visual shows 3125 seeks and 6326 reads on hdisk1. You would not expect to see any seeks here since dd reads the data sequentially. The file could be fragmented on the file system or partially cached in real memory (partial caching will defeat the sequential read-ahead mechanism). A simple test for this would be a unmount, mount, and another dd to see if the behavior changes. Generally, the fewer the number of seeks, the better the performance.




Uempty

Figure 6-11. Fragmentation and performance AN512.0

Notes:

Overview

While an operating system’s file is conceptually a sequential and contiguous string of bytes, the physical reality might be very different. File fragmentation arises from appending to a file while other applications are also writing to the files in the same area. A file system is considered fragmented when its available space consists of large numbers of small chunks of space, making it impossible to write out a new file in contiguous blocks.

Access to fragmented files may result in a large number of seeks and longer I/O response times (seek latency dominates I/O response time). For example, if the file is accessed sequentially, a file placement that consists of many, widely separated extents requires more seeks than a placement that consists of one or a few large contiguous extents. If the file is accessed randomly, a placement that is widely dispersed requires longer seeks than a placement in which the file’s blocks are close together.


Fragmentation and performance

Logical File

i - n o d e s 1 2 3

4

i - n o d e s

5 6

Physical File System

1 2

i-nodes

3

i-nodes

4

5

6

Physical Disk Allocation



Student Notebook

The i-nodes and indirect blocks are part of the file system. They are placed at various points throughout the file system. This is desirable since it helps keep i-nodes physically close to data blocks. The disadvantage is that the i-nodes and indirect blocks contribute to the file fragmentation.




Uempty

Figure 6-12. Determine fragmentation using fileplace . AN512.0

Notes:

Overview

The fileplace tool displays the placement of a file’s blocks within a logical or physical volume(s). fileplace expects an argument containing the name of the file to examine. This tool can be used to detect file fragmentation.

By default, fileplace sends its output to the display, but the output can be redirected to a file via normal shell redirection.

fileplace accepts the following options:

-l Displays the file’s placement in terms of logical volume blocks (default).

-p Displays the file’s placement in terms of physical volume blocks for the physical volumes that contain the file. Mirroring data is included if the logical volume is mirrored. The -p flag is mutually exclusive with the -l flag.

-i Displays the indirect blocks (if any) for the file. This option is not available for JFS2 files.


Determine fragmentation using fileplace# fileplace -pv file1

File: file1 Size: 1048576000 bytes Vol: /dev/hd1Blk Size: 4096 Frag Size: 4096 Nfrags: 256000Inode: 28834 Mode: -rw-r--r-- Owner: root Group: system

Physical Addresses (mirror copy 1) Logical Extent---------------------------------- ----------------04075296-04076063 hdisk8 768 frags 3145728 Bytes, 0.3% 00077056-0007782304077600-04082719 hdisk8 5120 frags 20971520 Bytes, 2.0% 00079360-0008447904084512-04085023 hdisk8 512 frags 2097152 Bytes, 0.2% 00086272-0008678304088864-04089119 hdisk8 256 frags 1048576 Bytes, 0.1% 00090624-0009087904089632-04172831 hdisk8 83200 frags 340787200 Bytes, 32.5% 00091392-0017459104173088-04173855 hdisk8 768 frags 3145728 Bytes, 0.3% 00174848-0017561504175648-04176671 hdisk8 1024 frags 4194304 Bytes, 0.4% 00177408-0017843104202784-04219423 hdisk8 16640 frags 68157440 Bytes, 6.5% 00204544-0022118304223264-04224287 hdisk8 1024 frags 4194304 Bytes, 0.4% 00225024-0022604704260384-04407071 hdisk8 146688 frags 600834048 Bytes, 57.3% 00262144-00408831

256000 frags over space of 331776 frags: space efficiency = 77.2%10 extents out of 256000 possible: sequentiality = 100.0%



Student Notebook

-v Displays more information, such as space efficiency and sequentiality.

A logical fragment is now composed of a number of fragments.

Example

The example in the visual demonstrates how to use fileplace to determine whether a file is fragmented.

The report generated by the -pv options displays the file’s placement in terms of physical volume blocks for the physical volumes. The verbose part of the report is one of the most important sections since it displays the efficiency and sequentiality of the file.

Range of fragments (R) is calculated as the (highest assigned address - lowest assigned address + 1).

Number of fragments (N) is the total number of fragments. File space efficiency is calculated as the number of non null fragments (N) divided by the range of fragments (R) assigned to the file and multiplied by 100, or (N/R) * 100.

Sequential efficiency is defined as 1 minus the number of gaps (nG) divided by the number of possible gaps (nPG) or (1- (nG/nPG)) * 100. The number of possible gaps (nPG) is calculated as nPG = (N -1).

In this example, the file is not very fragmented.

Higher sequentiality provide better sequential file access.




Uempty

Figure 6-13. Reorganizing the file system AN512.0

Notes:

Overview

File system fragmentation can be alleviated by backing up the problem files, deleting the problem files, and then restoring them; provided that there is enough existing sequential free space.This loads the file sequentially and reduces fragmentation.

Using the copy command to attempt problem file defragmentation can be dangerous, due to the possibility of making significant inode changes such as ownership, permission, and date stamps. In addition, use of the copy command can result in the unparsing of parse files.

If the filesystem has very little or fragmented free space, then the entire file system needs to be backed up, the entire file system contents deleted and the entire file system restored. This will both defragment the free space and defragment the individual files.


Reorganizing the file system

• After identifying a fragmented file system, reduce the fragmentation by:

1. Backing up the files (by name) in that file system2. Deleting the contents of the file system (or

recreating it with mkfs)3. Restoring the contents of file system

• Some file systems should not be reorganized because the data is either transitory (for example, /tmp), or does not change that much (for example, / and /usr)



Student Notebook

Some file systems or logical volumes should not be reorganized because the data is either transitory (that is, /tmp), does not change much (that is, /usr and /), or not in a file system format (log).

Backing up the file system

Back up the file system by file name. If you back up the file system by i-node instead of by name, the restore command puts the files back in their original places, which would not solve the problem. The commands to backup the file system are:

1. # cd /filesystem

2. # find . -print | backup -ivf backup_filename

This command creates a backup file (in a different file system), containing all of the files in the file system that is to be reorganized. If disk space on the system is limited, you can use tape to back up the file system.

3. # cd /

4. # unmount /filesystem

5. # mkfs /filesystem

You can also use tar or pax (rather than backup/restore) and backup by name.

Restoring the file system

To restore the contents, run the following:

1. # mount /filesystem

2. # cd /filesystem

3. Restore the data, as follows:

# restore -xvf backup_filename >/dev/null

Standard output is redirected to /dev/null to avoid displaying the name of each of the files that were restored, which is time-consuming.




Uempty

Figure 6-14. Using defragfs AN512.0

Notes:

Overview

If a JFS file system has been created with a fragment size smaller than 4 KB, it becomes necessary after a period of time to query the amount of scattered unusable fragments. If many small fragments are scattered, it makes it difficult to find available contiguous free space.

To recover these small, scattered spaces, use smit or the defragfs command. Some free space must be available for the defragmentation procedure to be used. The file system must be mounted for read-write.

For JFS2, the defragfs command focuses on the number of free runs (the number of contiguous free space extents) in used allocation groups.


Using defragfs

•If the file system has only fragmented free space, then new file allocations are automatically fragmented.

•This usually occurs when the file system is almost full.

•To attempt online defragmentation of a file system, use one of the following:

smit dejfssmit dejfs2

− defragfs command

• This may be only partially effective if there is not enough free space to work with.

• In JFS, online defragmentation is primarily intended for scattered free fragments when using small fragment sizes.



Student Notebook

defragfs syntax

The defragfs command line syntax is:

defragfs /fs (to perform) defragfs -q /fs (to query) defragfs -r /fs (to report)

A query will display, for JFS, the current state of the file system:

- Number of free fragments

- Number of allocated fragments

- Number of free spaces shorter than a block

- Number of free fragments in short free spaces

A query will display, for JFS2, the current state of the file system:

- Total allocation groups

- Allocation groups skipped - entirely free

- Allocation groups that are candidates for defragmenting

- Average number of free runs in candidate allocation groups




Uempty

Figure 6-15. JFS and JFS2 logs AN512.0

Notes:

Overview

JFS and JFS2 use a database journaling technique to maintain a consistent file system structure. This involves duplicating transactions that are made to file system metadata to the circular file system log. File system metadata includes the superblock, i-nodes, indirect data pointers, and directories.

When pages in memory are actually written to disk by a sync() or fsync() system call, commit records are written to the log to indicate that the data is now on disk. All I/Os to the log are synchronous. Log transactions occur in the following situations:

- File is created or deleted

- write() occurs for a file opened with O_SYNC and the write causes a new disk block allocation

- fsync() or sync() is called

- Write causes an indirect or double-indirect block to be allocated (JFS)


JFS and JFS2 logs• A special logical volume called the log device records

modifications to file system metadata, prior to writing the metadata changes to disk. After the changes are written to disk, commit records are written to the log.

• By default, each volume group has a single JFS log and a single JFS2 log, shared by all file systems in that VG.

• Log device updates can:• Serialize file updates due to locking for the log update• Introduce extra seeks into the disk write pattern

• I/O statistics from the filemon utility can identify heavy device log volumes and increased seeks.

• Can mount –o log=NULL, if integrity is not a concern.



Student Notebook

Location of the log

File system logs enable rapid and clean recovery of file systems if a system goes down. However, there may be a performance trade-off here. If an application is doing synchronous I/O or is creating and/or removing many files in a short amount of time, then there may be a lot of I/O going to the log logical volume. If both the log logical volume and the file system logical volume are on the same physical disk, then this could cause an I/O bottleneck. The recommendation would be to migrate the log device to another physical disk (this is especially useful for NFS servers).

JFS2 file systems have an option to have an inline log. An inline log allows you to create the log within the same data logical volume. With an inline log, each JFS2 file system can have its own log device without having to share this device. The space used can be much less than the physical partition size and the location is implicitly in close proximity with the file system using it. Inline logs are really used for availability. Performance is further improved if you have a dedicated log volume on a different disk.

Recording statistics about I/Os to the log

Information about I/Os to the log can be recorded using the filemon command. If you notice that a file system and its log device are both heavily utilized, it may be better to put each one on a separate physical disk (assuming that there is more than one disk in that volume group). This can be done using the migratepv command or via SMIT.

Avoiding file system journal logging

In most situations you need to have file system journal logging for the integrity of your file system. But there are a few situations where integrity is not a concern and the I/O can run much faster without the logging. One example would be when the file system is being recovered from a backup. If there is a failure, you would simply repeat the recovery. Another example is compile scratch space; if there is a failure you would just rerun the compile.

For these situations, you may choose to mount the JFS2 filesystem with an option of log=NULL. Just remember to remount without this option before using the filesystem for a purpose that requires integrity!

JFS also has a mount option that provides the same capability:

mount -o nointegrity.




Uempty

Figure 6-16. Creating additional JFS and JFS2 logs AN512.0

Notes:

Overview

In the following discussion, references to the journal log apply to both JFS and JFS2 journal log devices.

Placing the log logical volume on a physical volume different from your most active file system’s logical volume will increase parallel resource usage assuming that the I/O pattern on that file system causes journal log transactions. If there is more than one file system in the same volume group which is causing journal log transactions, you may get better performance by creating a separate journal logs for each of these file systems. The downside of this is that if you have one journal log for each file system then you are potentially faced with storage waste, since the smallest each journal log can be is one physical partition.

The performance of disk drives differ. So, try to create a logical volume for a hot file system on a fast drive (possibly one with fast write cache). If using a caching storage


Creating additional JFS and JFS2 logs

• Additional log devices in a volume group, for a given file system type, may improve performance if multiple file systems are competing for the default log device.

• Placing a file system and its log on different disks may also reduce costly physical seeks.

• What to do:Create a new JFS or JFS2 log logical volumeUnmount the file systemFormat the log:logform -V vfstype /dev/LogName

Change the filesystem use the new log device:chfs –a log=/dev/lvname <FileSystem>

Mount the file system



Student Notebook

subsystem, the seeks affects may be less of a concern due to implementation of write caching.

Creating a new log logical volume

Create a new file system log logical volume, as follows:

# mklv -t jfslog -y LVname VGname 1 PVname

or

# mklv -t jfs2log -y LVname VGname 1 PVname

or

# smitty mklv

Another way to create the log on a separate volume is to:

i. Initially define the volume group with a single physical volume

ii. Define a logical volume within the new volume group (this causes the allocation of the volume group JFS log to be on the first physical volume)

iii. Add the remaining physical volumes to the volume group

iv. Define the high-utilization file systems (logical volumes) on the newly added physical volumes

The default journal log size is 1 logical partition. For small partition sizes, this size may be insufficient for certain file systems (such as very large file systems or file systems with a lot of files being created and/or deleted). If there is a high rate of journal log transactions, then a small log could actually degrade performance because I/O activities will be stopped until the transactions can be committed. In this case, an error log entry regarding JFS log wait is recorded. If you want to increase the size of a journal log, you must first unmount the file systems that use the log device. You can then increase the size of the logical volume used by the log device, then format it using the logform command before mounting the file systems again.

Formatting the log

Format the log as follows:

# /usr/sbin/logform -V vfstype /dev/LogName

For JFS2 logs, the logical volume type used is jfs2log instead of jfslog.

Also, when using logform on a JFS2 log, specify logform -V jfs2.




Uempty
Modifying /etc/filesystems and the LVCB
You can use the chfs command to modify the file system stanza in /etc/filesystems and also the logical volume control block and specify the new log volume for that file system. For example: chfs -a log=/dev/LVname /filesystemname.



Student Notebook

Figure 6-17. Sequential read-ahead AN512.0

Notes:

Overview

The VMM tries to anticipate the future need for pages of a sequential file by observing the pattern in which a program is accessing the file. When the program accesses two successive pages of the file, the VMM assumes that the program will continue to access the file sequentially, and the VMM schedules additional sequential reads of the file. These reads are overlapped with the program processing, and will make the data available to the program sooner than if the VMM had waited for the program to access the next page before initiating the I/O.

The visual uses JFS as the example, but the same principles apply to JFS2.


Sequential read-aheadJFS: minpgahead=2

JFS2: j2_minPageReadAhead=2

JFS: maxpgahead=8 (default)

JF2: j2_maxPageReadAhead=128(default)

Page 0 accessed:

Page 1 accessed:

Page 2 accessed:

Page 4 accessed:

Page 8 accessed:

0

Read in from disk

Read in from disk

Read in from disk

Read in from disk

Read in from disk

Set via ioo: increase for striped logical volumes and large sequential I/O

1 2 3

4 5 6 7

8 15

16 23

R




Uempty
Sequential read-ahead thresholds
The number of pages to be read ahead in a JFS file system is determined by the two VMM thresholds:

- minpgahead

Note: this is an AIX6 Restricted tunable.

Number of pages read ahead when the VMM first detects the sequential access pattern. If the program continues to access the file sequentially, the next read-ahead will be for 2 times minpgahead, the next for 4 times minpgahead, and so on until the number of pages reaches maxpgahead.

- maxpgahead

Maximum number of pages the VMM will read ahead in a sequential file.

The number of pages to read ahead on a JFS2 file system is determined by the two thresholds:

- j2_minPageReadAhead

- j2_maxPageReadAhead

The distance between minfree and maxfree relative to maxpgahead or j2_maxPageReadAhead should also take into account the number of threads that might be doing maxpgahead or j2_maxPageReadAhead reads at a time. IBM’s current policy is that maxfree = minfree + (#_of_CPUs * maxpgahead or j2_maxPageReadAhead). Without this, it is too easy to drive the free list to 0 and start paging working storage pages.

How sequential read-ahead works

The first access to a file causes:

- The first page to be read in for JFS file systems

- Two pages to be read in for JFS2 file systems

When the next page is accessed, the next page plus minpgahead number of page is read in. Subsequent accesses of the first page of a group of read-ahead pages results in a doubling of the pages read in, up to maxpgahead (JFS) or j2_maxPageReadAhead (JFS2).

If the program were to deviate from the sequential-access pattern and access a page of the file out of order, sequential read-ahead would be terminated. It would be resumed with minpgahead (JFS) or j2_minPageReadAhead (JFS2) pages if the VMM detected that the program resumed sequential access.

Another situation where the pattern would be broken is when the file was previously cached in memory and memory overcommit has caused random blocks of the cached file contents to be stolen.



Student Notebook

JFS example

The visual shows an example of sequential read-ahead for a JFS file system.

In this example, minpgahead is 2 and maxpgahead is 8 (the defaults). The program is processing the file sequentially.

Following is the sequence of steps in the example:

1. The first access to the file causes the first page (page 0) of the file to be read. At this point, the VMM makes no assumptions about random or sequential access.

2. When the program accesses the first byte of the next page (page 1), with no intervening accesses to other pages of the file, the VMM concludes that the program is accessing sequentially. It schedules minpgahead (2) additional pages (pages 2 and 3) to be read. Thus the access causes a total of 3 pages to be read.

3. When the program accesses the first byte of the next page that has been read ahead (page 2), the VMM doubles the page-ahead value to 4 and schedules pages 4 through 7 to be read.

4. When the program accesses the first byte of the next page that has been read ahead (page 4), the VMM doubles the page-ahead value to 8 and schedules pages 8 through 15 to be read.

5. When the program accesses the first byte of the next page that has been read ahead (page 8), the VMM determines that the page-ahead value is equal to maxpgahead and schedules pages 16 through 23 to be read.

The VMM continues reading maxpgahead pages when the program accesses the first byte of the previous group of read-ahead pages until the file ends.

Changing the sequential read-ahead thresholds

If you are thinking of changing the read-ahead values, keep in mind:

- The values should be powers of 2 (from the set: 0, 1, 2, 4, 8, 16, and so on). The use of other values may have adverse performance or functional effects.

• Values should be powers of 2 because of the doubling algorithm of the VMM.

• If the max page ahead value exceeds the capabilities of a disk device driver, the largest read size stays at 64 KB (16 pages).

• Higher values of the maximum page ahead can be used in systems where the sequential performance of striped logical volumes is of paramount importance.

- A j2_minPageReadAhead value of 0 effectively turns off the mechanism. This can adversely affect performance. However, it can be useful in some cases where I/O is random, but the size of the I/Os cause the VMM's read-ahead algorithm to take effect. For example, with a database block size of 8 KB and a read pattern that is




Uempty
purely random, the j2_minPageReadAhead value of 2 would cause a total of 4 pages to be read for each 8 KB block instead of two 4 KB pages. Another case where turning off page-ahead is useful is the case of NFS reads on files that are locked. On these types of files, read-ahead pages are typically flushed by NFS so that reading ahead is not helpful. NFS and the VMM will automatically turn off VMM read-ahead if it is operating on a locked file.
- The buildup of the read-ahead value from the j2_minPageReadAhead to j2_maxPageReadAhead is quick enough that for most file sizes there is no advantage to increasing j2_minPageReadAhead.

- When an application uses a large read size, the pattern fast-starts with reading 8 blocks at a time before starting the doubling pattern.

For JFS, The minpgahead and maxpgahead values can be changed with:

- ioo (-o minpgahead and -o maxpgahead)

- But minpgahead is a restricted tunable; do not modify without the direction of AIX Support.

For JFS2, The j2_minPageReadAhead and j2_maxPageReadAhead values can be changed with:

- ioo (-o j2_minPageReadAhead and -o j2_maxPageReadAhead)



Student Notebook

Figure 6-18. Tuning file syncs AN512.0

Notes:

Overview

When an application writes to a file, the data is stored in a memory segment which is mapped to the file. The memory page frames are marked as being modified, until they have been written to disk. These are referred to as dirty pages. When the VMM chooses to steal a modified page frame, it needs to page-out (write) the modified contents to disk. In the case of a file, the modified contents are paged to the file it is mapped to. If the VMM waits until a page frame is stolen to write the changes to the file, it might result in a very long delay and the writes would be individual pages scattered throughout the file. So, there are some other mechanisms which are used to force these dirty pages to be written to disk.

If too many pages accumulate before one of these conditions occur, then when pages do get flushed by the syncd daemon, the i-node lock is obtained and held until all dirty pages have been written to disk. During this time, threads trying to access that file will get blocked because the i-node lock is not available. Remember that the syncd daemon


Tuning file syncs• JFS file writes will be stored in memory, but not written to

disk, until any one of the following happens:Free list page replacement steals a dirty file page forcing a page-out to disk for that one pageThe syncd daemon flushes pages at scheduled intervalsThe sync command or fsync() call is issuedWrite-behind mechanism is triggered

• An i-node lock is obtained and held while dirty pages are being written to disk, which can be a performance issue.

• Tuning options:Tune sequential write-behind (on by default)Turn on and tune random write-behindIncrease the frequency of the syncd daemon




Uempty
currently flushes all dirty pages of a file, but one file at a time. On systems with a large amount of memory and large numbers of pages getting modified, high peaks of I/Os can occur when the syncd daemon flushes the pages.
Tunable options

There are three options to tune file syncs:

- Tune the sequential write-behind.

- Enable and tune random write-behind.

- This blocking effect can also be minimized by increasing the frequency of syncs in the syncd daemon. Use of the write-behind mechanisms is a much better solution over having increasing the frequency of the syncd. To modify the syncd frequency, change /sbin/rc.boot where it invokes the syncd daemon. Then reboot the system for it to take effect. For the current system, kill the syncd daemon and restart it with the new seconds value.

Caution

Caution should be exercised when changing the syncd time on systems with more than 16-24 GB of memory being used for the file system cache and not running AIX 5L V5.3 or later. On AIX 5L V5.1 and V5.2, syncd looks at each page in the file system cache to determine if it has been modified. As the file system cache grows large, this can cause other additional problems. With AIX 5L V5.3, a linked list of dirty pages is kept. On AIX 5L V5.1 and V5.2, because we do not have a linked list of dirty pages, syncd will use more and more system CPU to scan for dirty pages. It then sleeps for the syncd sleep time and starts again. This can and does consume a huge amount of CPU on systems with large amounts of memory used for the file cache.



Student Notebook

Figure 6-19. Sequential write-behind AN512.0

Notes:

Overview

To increase write performance, limit the number of dirty file pages in memory, reduce system overhead, and minimize disk fragmentation, the file system implements a mechanism called write-behind. The file system organizes each file into clusters. The size of a cluster is 16 KB (4 pages) for JFS and 128 KB (32 pages), by default, for JFS2.

In JFS, the written pages are cached until the numclust number of 16KB clusters have been accumulated. That cached numclust number of clusters are written to disk as soon as the application writes to the next sequential cluster. Note that the data that triggers this write-behind of cached data is not immediately written, but has to wait for the numclust threshold to be passed again.

In JFS2, the concept is similar, except that the threshold amount to be cached before being written to disk is a single cluster of a tunable size: (j2_nPagesPerWriteBehindCluster).


Sequential write-behind• Files are divided into clusters

4 pages (16 KB) for JFS (fixed cluster size)32 pages (128 KB) for JFS2 (default; tunable)

• Dirty pages of a file are not written to disk until the program writes the first byte beyond the threshold

• Tuning JFS file systems:

Threshold number of clusters is tunable with the ioo –o numclust parameter

• Tuning JFS2 file systems:

Single cluster threshold

Number of pages per cluster is tunable with the ioo -o j2_nPagesPerWriteBehindCluster




Uempty
To distribute the I/O activity more efficiently than either doing synchronous writes or waiting for the syncd to run, sequential write-behind is enabled by default. Without this feature, pages would stay in memory until the syncd daemon runs. This could cause I/O bottlenecks and possibly increased fragmentation of the file.
The write-behind threshold is on a per-file basis, which causes pages to be written to disk before the syncd daemon runs. The I/O is spread more evenly throughout the workload.

There are two types of write-behind: sequential and random.

Tuning sequential write-behind

For JFS, the size of a cluster is 16 KB (4 pages). The number of clusters that the VMM uses as a threshold is tunable. The default is one cluster. You can delay write-behind by increasing the numclust parameter. This will allow small writes to get coalesced into larger batches of writes so that you can get better disk write throughput. By setting numclust to a larger value, this allows for coalescing of smaller logical I/Os into larger physical I/Os. Change the numclust parameter using:

-ioo -o numclust

For JFS2, the number of pages per cluster is the tunable value (rather than the number of clusters with JFS). The default is 32 pages (128 KB). This can be changed by using:

-ioo -o j2_nPagesPerWriteBehindCluster

To disable write-behind for JFS2, set j2_nPagesPerWriteBehindCluster to 0.



Student Notebook

Figure 6-20. Random write-behind AN512.0

Notes:

Overview

There may be applications that perform a lot of random I/O, that is, the I/O pattern does not meet the requirements of the sequential write-behind algorithm and thus the dirty pages do not get written to disk until the syncd daemon runs. If the application has modified many pages in memory, this could cause a very large number of pages to be written to disk when the syncd daemon issues a sync() system call.

The write-behind feature provides a mechanism such that when the number of randomly written dirty pages in memory for a given file exceeds a defined threshold, these pages are then scheduled to be written to disk.


Random write-behind• Can be used to prevent too many random write dirty pages

from accumulating in RAM so that when syncd does a flush, there would not be a large amount of I/O sent to the disks

• This is disabled by default

• Random write-behind writes modified pages in memory to disk after reaching tunable thresholds:

JFS threshold: • maxrandwrt (Maximum number of dirty pages)

JFS2 thresholds:• j2_nRandomCluster (Separation in number of clusters)• j2_maxRandomWrite (Maximum number of dirty pages)• The random cluster size is fixed at 16 KB (4 pages)




Uempty
JFS threshold
The parameter, maxrandwrt, specifies a threshold (in 4 KB pages) for random writes to accumulate in RAM before subsequent pages triggers them to be flushed to disk by the write-behind algorithm. The random write-behind threshold is on a per-file basis. The default value is 0 indicating that random write-behind is disabled.

Increasing this value to 128 would mean that once 128 random page writes have occurred, any subsequent random write causes the previous write to be written to the disk. The first set of pages and the last page written will be flushed after a sync.

This threshold is tunable by using: ioo -o maxrandwrt

JFS2 thresholds

In the JFS2 random write-behind algorithm, writes are considered random if the distance between two consecutive writes are separated by tunable number of clusters.

There are two thresholds for JFS2 file systems:

- The parameter, j2_nRandomCluster, specifies the distance apart (in clusters) that writes have to exceed in order for them to be considered as random by JFS2’s random write-behind algorithm. The cluster size in this context is a fixed 16 kilobytes. The default is 0 which means that any non-sequential writes to different clusters are considered random writes. The default of 0 is with bos.mp.5.1.0.15 and later. The threshold is tunable by using:

• ioo -o j2_nRandomCluster

- The parameter, j2_maxRandomWrite, specifies a threshold for random writes to accumulate in RAM before subsequent pages are flushed to disk by JFS2’s write-behind algorithm. The random write-behind threshold is on a per-file basis. The default value is 0 indicating that random write-behind is disabled. The threshold is tunable by using:

• ioo -o j2_maxRandomWrite



Student Notebook

Figure 6-21. JFS2 random write-behind example AN512.0

Notes:

Example

The example in the visual demonstrates when a write is considered to be random.

In this example:

- j2_nRandomCluster is set to 4

- j2_maxRandomWrite is set to 1

Thus, two consecutive writes must be at least 16 pages apart to be considered random This 16 page separation comes from the following calculation:

j2_nRandomCluster * 16 KB; which is 4 * 16 KB = 64 KB = 16 pages

The first two consecutive writes, one in page number 4 and the other in page number 12, are not considered to be random because the actual separation is 8 pages (12 - 4), which does not exceed the j2_nRandomCluster requirement.


JFS2 random write-behind example

• The first two consecutive writes are not considered to be random since they are not j2_nRandomClusterclusters apart

• The second two consecutive writes are considered to be random since they are more than j2_nRandomClusterclusters apart

0 8 16 24 32 ...

1st two consecutive writes

Page number

2nd two consecutive writes

j2_nRandomCluster * 4

j2_nRandomCluster=4j2_maxRandomWrite=1




Uempty
The second two consecutive writes, one in page number 7 and the other in page number 28 are considered to be random because the actual separation is 21 pages (28 - 7), which is more than 16 pages. In this case, page number 7 will be written out to disk.


Student Notebook

Figure 6-22. File system buffers and VMM I/O queue AN512.0

Notes:

Overview

Each read or write to a file system requires resources such as file system buffers. One or more file system buffers are needed for a single request. The number of file system buffers needed mainly depends on the number of pages to read or write and the file system itself:

- JFS usually needs a single file system buffer per request, which can consist of multiple pages

- JFS2 needs one file system buffer per page

When enough file system buffers are available, read or write requests can be sent to the file system pager device which will queue the I/Os to the LVM.

If the system runs out of file system buffers, read and write requests will be queued on the VMM I/O queue. The VMM then will queue the request to the file system pager once enough file system buffers become available.


File system buffers and VMM I/O queue

read() write()

VMM page fault VMM write-behind

File system bufferavailable?

yesno

Pager

LVM

Disk DeviceDriver

VMM IO queue

Wait forfile systembuffer




Uempty
Performance issue
The number of read/write requests on the VMM I/O queue can become quite large in an environment with heavy file system activity. As a result of this, the average response time can increase significantly.

For example, if there are already 1000 write requests to one single file system on the VMM I/O queue, and the disk subsystem can perform 100 writes per second, a single read queued at the end of the VMM I/O queue would return after more than 10 seconds.

Note: It is possible to fill up the entire memory available for file system caching with pages that have outstanding physical I/Os (queued in VMM). The system appears to hang or has a very long response time for any command that requires file access. Such a situation can be avoided by the proper use of I/O pacing.



Student Notebook

Figure 6-23. Tuning file system buffers AN512.0

Notes:

Overview

JFS file system buffers are allocated when a file system is mounted.The number of file system buffers allocated is defined by the ioo parameter:

- numfsbufs

Increasing the initial number of JFS file system buffers can increase the performance if there are many simultaneous or large I/Os to a single file system. However, if there is mainly write activity to a file system, increasing the number of file system buffers might not avoid a file system buffer shortage. I/O pacing should be used in such case.

JFS2 file system buffers have an initial allocation when a file system is mounted, but then dynamically increase n demand by a tunable amount.The related ioo parameters are:

- j2_nBufferPerPagerDevice (for the initial allocation; Restricted tunable)

- j2_dynamicBufferPreallocations (for increase amounts; not restricted)


Tuning file system buffers• To determine if there is a file system buffer shortage, use vmstat –v and note the rate of change between displays:# vmstat -v

<...output omitted>0 pending disk I/Os blocked with no pbuf

7801 paging space I/Os blocked with no psbuf2740 filesystem I/Os blocked with no fsbuf794 client filesystem I/Os blocked with no fsbuf0 external pager filesystem I/Os blocked with no fsbuf

• Increasing the number of file system buffers can increase performance, if there is a high rate of blocked I/Os

• Do not increase without good reason; uses pinned pages.

• To change the number of file system buffers use ioo:– JFS: numfsbufs=<#buffers>– JFS2: j2_dynamicBufferPreallocation=<#buffers>




Uempty
If there are bursts of JFS2 file system activity, the normal rate of fsbuf increase may not react fast enough. A larger j2_dynamicBufferPreallocations may help in that situation.
Determining a file system buffer shortage

Use vmstat -v. Look for the following lines in the output:

- file system I/Os blocked with no fsbuf

Refers to the number of waits on JFS file system buffers

- client file system I/Os blocked with no fsbuf

Refers to the number of waits on NFS and VxFS (Veritas) file system buffers

- external pager filesystem I/Os blocked with no fsbuf

Refers to the number of waits on JFS2 file system buffers

It is normal and acceptable to have periodic transitory fsbuf shortages, so the displayed count may be fairly large without representing a problem. If the system has unsatisfactory I/O performance, compare two displays of vmstat -v and calculate the rate of change over the intervening time period. If it is a high rate of change, then modifying the discussed tunable might help. Do not increase without good reason. They take up valuable pinned memory.

Changing the number of file system buffers

To change the number of file system buffers use:

For JFS: ioo -o numfsbufs=<# buffers>

This tunable will require an unmount and mount of file system to be effective.

For JFS2: ioo -o j2_nBufferPerPagerDevice=<# buffers>

This tunable is effective immediately. There is no need to unmount and mount the file system.



Student Notebook

Figure 6-24. VMM file I/O pacing AN512.0

Notes:

Overview

Disk I/O pacing is used to prevent I/O-intensive programs, that generate very large amounts of output, from dominating the system's I/O facilities and causing the response times of less demanding programs to deteriorate. Disk I/O pacing enforces per segment (which effectively means per-file) high and low-water marks on the sum of all pending I/Os.

When a process tries to write to a file that already has pending writes request at or above the high-water mark, the process is put to sleep until enough I/Os have completed to make the number of pending writes less than or equal to the low-water mark. The logic of I/O request handling does not change. The output from high volume processes is slowed down somewhat.


VMM file I/O pacing• A VMM file cache algorithm which paces the amount of write I/O to a file, trading throughput for response time

• Prevents any one thread from dominating system resources• Tuned with system wide sys0 object attributes: maxpoutand minpout:

– AIX 5L defaults: minpout=0, maxpout=0 (disabled)– AIX6 defaults: minpout=4096, maxpout=8193

• Can be specified per file system via the mount options:maxpout and minpout

0 8 16 24 32 ...

minpout

maxpout - 1

Delta

I/O Requests




Uempty
Controlling system wide parameters
There are two parameters that control the system wide I/O pacing:

- maxpout: High-water mark that specifies the maximum number of pending I/Os to a file minpout: Low-water mark that specifies the point at which programs that have reached maxpout can resume writing to the file

The high- and low-water marks can be set by:

- smit -> System Environments -> Change / Show Characteristics of Operating System (smitty chgsys) and then entering the number of 4KB pages for the high- and low-water marks

- chdev -l sys0 -a maxpout=NewValue chdev -l sys0 -a minpout=NewValue

Controlling per file system options

In AIX 5L V5.3, and later, I/O pacing can be tuned on a per file system basis. There are cases when some file systems, like database file systems, require different values than other file systems, like temporary file systems.

This tuning is done when using the mount command:

# mount -o minpout=40,maxpout=60 /fs

Another way to do this is to use SMIT or edit the /etc/filesystems.

Default and recommended values

In AIX6, the default value for the high-water mark is 8193 and for low-water mark is 4096. (Prior to AIX6, these were both defaulted to 0, thus disabling I/O pacing). Changes to the maxpout and minpout values take effect immediately and remain in place until they are explicitly changed.

While, in AIX6, I/O pacing is enabled by default, the values are set rather large to manage only the worse situations of an over-dominant batch I/O job. Depending on your situation you may benefit by making them smaller.

It is a good idea to make the value of maxpout (and also the difference between maxpout and minpout) large enough so that they are greater than the write-behind amounts. This way sequential write-behind will not be suspended due to I/O pacing.For example, for JFS the maxpout number of pages should be greater than (4*numclust). For JFS2, the maxpout number of pages should be greater than j2_maxRandomWrite.

Using JFS as an example, the recommended value for maxpout should be (a multiple of 4) + 1 so that it works well with the VMM write-behind feature. The reason this works well is for the following interaction:

1. The write-behind feature sends the previous four pages to disk when a logical write occurs to the first byte of the fifth page (JFS with default numclust=1).



Student Notebook

2. If the pacing high-water mark (maxpout) were a multiple of 4 (say, 8), a process would hit the high-water mark when it requested a write that extended into the ninth page. It would be then put to sleep before the write-behind algorithm had a chance to detect that the fourth dirty page is complete and the four pages were ready to be written.

3. The process would then sleep with four full pages of output until its outstanding writes fell below the pacing low-water mark (minpout).

4. If on the other hand, the high-water mark had been set to 9, write-behind would get to schedule the four pages for output before the process was suspended.

While enabling VMM I/O pacing may improve response time for certain workloads, the workloads generating the large amounts of I/O will be slowed down because the processes are put to sleep periodically instead of continuously streaming the I/Os. Disk-I/O pacing can improve interactive response time in some situations where foreground or background programs that write large volumes of data are interfering with foreground requests. If not used properly, however, it can reduce throughput excessively.

Example 1

The figure on the visual presents the minpout and maxpout VMM file I/O pacing values. A thread writing to the file goes to sleep once the number of outstanding write I/Os, this include pages that are sent to the disk and those queued in the VMM I/O queue, reaches the maxpout threshold. The thread is woken up when the number of outstanding I/Os is minpout or less.

VMM file I/O pacing should be used to avoid a large number of read/write requests on the VMM I/O queue which can cause new read/write requests to take many seconds to complete, since they are put at the end of the VMM I/O queue.

Example 2

The effect of pacing on performance can be demonstrated with an experiment that consists of starting a vi editor session on a new file while another process is writing a very large file with the dd command. If the high-water mark were set to 0, the logical writes from the dd command could consistently arrive faster than they could be physically written, and a large queue would build up.

Each I/O started by the vi session must wait its turn in the queue before the next I/O can be issued, and thus the vi command is not able to complete its needed I/O until after the dd command finishes. The following table shows the elapsed seconds for dd execution and vi initialization with different pacing parameters.

.

High Water Low Water Throughput (sec) vi (sec)0 0 49.8 finished after dd




Uempty

It is important to notice that the dd duration is always longer when pacing is set. Pacing sacrifices some throughput on I/O-intensive programs to improve the response time of other programs.

Remember that these results are for one particular environment. It requires experimentation in the actual target environment with your actual applications to find out what values work best for you.

The challenge for a system administrator is to choose settings that result in a throughput and response-time trade-off that is consistent with the organization's priorities. It may be that a 3 second response time is acceptable and you need to optimize the batch processing. For a different organization, the response time is paramount and they can accept some delay in batch job completion.

33 24 23.8 no delay129 64 37.6 no delay257 128 44.8 no delay513 256 48.5 no delay769 640 48.3 < 3 1025 256 49.0 < 11025 384 49.3 31025 896 47.8 3 to 10

High Water Low Water Throughput (sec) vi (sec)



Student Notebook

Figure 6-25. The pro and con of VMM file caching AN512.0

Notes:

The main advantage to using file caching is the avoidance of costly re-reads to disk. Once a file block has been cached in memory, an application file read for that file is quickly handled through a memory read.

Even if there is no expected re-read of that data in the near future, the file caching is necessary for the read-ahead and write-behind mechanisms which were just covered. Both mechanisms result in the coalescing of many smaller I/O requests into fewer larger requests for contiguous (or at least clustered) requests. For sequential read-ahead, there is the additional benefit of generating overlapping disk I/O requests, without requiring an application to be re-written to use asynchronous I/O processing. This provides a significant throughput improvement.

The main disadvantage is the memory and CPU overhead. The path length of the kernel logic for read and write service calls and for interrupt handlers is significantly increased to handle the file caching. Normally, this is an acceptable trade-off for the listed benefits. With the current memory defaults, a large file cache overcommit of memory does not usually result in any computational pages being paged out, but it can trigger a high volume of file


The pro and con of VMM file caching• Pro:

Later reads do not require disk I/OSequential read-ahead mechanism provides:• Overlapping of disk I/O (without application AIO) • Larger disk reads from grouping of read requests.

Write-behind mechanism allows larger and more efficient writes to disk due to coalescing.

• Con:CPU overhead• Longer path length of kernel logic• Extra load is a concern if CPU constrained

Memory overhead• Large amount of non–computational memory• Page stealing disrupts and diminishes benefits




Uempty
cache page stealing. This page stealing can then significantly diminish the potential advantages of the file caching.


Student Notebook

Figure 6-26. JFS and JFS2 release-behind AN512.0

Notes:

Overview

Release-behind is a mechanism under which pages are freed as soon as they are either committed to permanent storage (by writes) or delivered to an application (by reads). This solution addresses a scaling problem when performing large amounts of sequential I/O on files whose pages may not need to be re-accessed in the near future.

Release-behind only applies to sequential IO, so random IO pages are cached. In addition, the pages used for read-ahead are also cached until they are delivered to the application.

When writing a large file without using release-behind, writes will go very fast whenever there are available pages on the free list. When the number of pages drops to minfree, VMM uses its Least Recently Used (LRU) algorithm to find candidate pages to release and reuse. Because the LRU daemon examines frames one at a time, acquiring and releasing multiple locks, it can take too long to release enough frames to allow the I/O to


JFS and JFS2 release-behind• Over-committed memory results in page stealing:

Inefficient single page writes Disruption of sequential read-ahead

• When you know the data will not be re-read in the near future, use release-behind to free file cache memory

Reduced page stealing and less disruption

• With release-behind, sequential I/O pages are freed as soon as:

They are committed to permanent storage (by writes) They are delivered to an application (by reads)

• Enabled by mounting a file system with one of the following options:

– rbr Release-behind when reading– rbw Release-behind when writing – rbrw Release-behind when reading and writing




Uempty
continue at full speed. This lock contention can cause a sharp performance degradation.
Further more, when file cache is stolen, sequential read-head and write behind mechanisms can be disrupted. Dirty pages can be paged-out one page at a time, rather than being allowed to accumulate and be written in longer sequences. The read ahead mechanism is disrupted when previous pages that were read ahead are stolen; the file read is forced to re-read from disk the stolen block which then resets the read ahead algorithm. Random I/O will not be impacted as much as sequential I/O.

Enabling release-behind

You enable this mechanism by specifying one of the following flags to the mount command:

- rbr

Mount file system with the release-behind-when-reading capability. When sequential reading of a file in this file system is detected, the real memory pages used by the file will be released once the pages are copied to internal buffers.

- rbw

Mount file system with the release-behind-when-writing capability. When sequential writing of a file in this file system is detected, the real memory pages used by the file will be released once the pages have been written to disk.

- rbrw

Mount file system with both the release-behind-when-reading and release-behind-when-writing capabilities.

Release-behind side-effect

A side-effect of using the release-behind mechanism is that you will notice an increase in CPU utilization for the same read or write throughput rate without release-behind. This is due to the work of freeing pages, which would normally be handled at a later time by the LRU daemon. On the other hand the overhead of freeing the pages needs to be done sometime; this is a shift of when that occurs.

Note that, with release behind, all file page accesses result in disk I/O since file data is only briefly cached by VMM before being released. The exception to this is if the pages are the result of read-ahead mechanism. In that case, they are cached by VMM until the application reads them into its private segment; then the release-behind would free those pages.

Files with contents that are expect to be read soon after writing should not use release-behind-when-writing. Files with contents that are expected to be re-read within a relatively short period of time should not use release-behind-when-reading. Since this is managed through mount options, you should plan to segregate your files into file



Student Notebook

systems where the mount options can be appropriate to the files in each file system. Alternatively you may remount the file system with different options according to how the file system will be used. For example, a normal mount during the day and a release-behind mount on third shift during batch report processing (if appropriate).




Uempty

Figure 6-27. Normal I/O versus direct I/O (DIO) AN512.0

Notes:

Direct I/O uses all the facilities of the file system and the logical volume manager, except that it does not do any file caching. The entire VMM file caching layer is skipped.

Normal I/O writes will store the data in the VMM file cache and tell the application the write is complete. The application can then do something else asynchronously while the write is being processed. Later actions, such as the syncd running, flush the data to the disk. Normal I/O reads attempt to read the file from the file cache. If there is a cache miss, then it triggers a read to disk. But If the data is already in memory from an earlier read or write, the read request is quickly completed through a memory to memory transfer. The read ahead mechanism provides asynchronous or overlap processing of reads.Cached files tend to stay in memory until the pages are stolen or the file system is unmounted.

With direct I/O the write request needs to be processed all the way to a commit to disk before the application is told it is completed. The data goes directly from the application’s private memory to the storage adapter for writing to disk. A DIO read request immediately is sent out to the disk; when completed the data gets stored directly to the applications private memory without first being stored in the file system’s file cache.


Normal I/O versus direct I/O (DIO)

File System

VMM

LVM

Physical Disk I/O

Application

File System

LVM

Physical Disk I/O

Application

DIO



Student Notebook

Figure 6-28. Using direct I/O (DIO) AN512.0

Notes:

Since the data flows directly from and to the application’s private memory, without VMM caching, there is a great reduction in memory usage. This is a common reason for database engines to use DIO, since they maintain their own cache of data in computational storage. Having eliminated the entire VMM caching layer, the amount of code to be executed in handling filesystem I/O requests is greatly reduced. Both of these benefits can be valuable, especially if the system is memory or CPU constrained. On the other hand, all the benefits of file caching that were previously discussed are lost: quick access to previous read data, read-ahead overlap processing and request coalescing, and finally the write-behind data coalescing and asynchronous completions.

Direct I/O should only be used with applications which are known to be DIO compliant. DIO has rules that require the block sizes and file positions to be multiples of the page size (4KB). If these rules are violated, the DIO request is demoted, which is a very bad thing. DIO emotion means the I/O request that violated the rules reverts to file caching and all of its overhead, but without the read-ahead and write-behind benefits.


Using direct I/O (DIO)• I/O requests are between disk and application’s private

memory; no VMM cachingReduces memory loadReduces CPU load, shortens kernel service path lengthLoses the significant benefits of VMM cachingAll normal reads and writes are synchronous

• Application should be specially written to use DIOFiles opened with O_DIO flagComplies with DIO block size and alignment rulesData caching (if needed) in application computational memoryLarge read and write sizes for sequential I/OAsynchronous I/O with AIO service calls

• Non-DIO access demotes DIO requestsCommonly copy or backup commandsSolved by using dio mount optionOnly use DIO mounts with DIO compliant applications




Uempty
It is best to use applications which are specifically written for using DIO. In addition to being compliant with the DIO rules, these application will compensate for the loss of the read-ahead and write-behind mechanisms. For example, they will use very large read and writes for sequential processing (rather than small reads and writes and letting read-ahead or write-behind coalesce them into larger disk I/O requests.) If the application expects to re-read data, it will manage its own intelligent caching of data, often with much better hit ratios then what the file system VMM caching can achieve. To improve effective throughput, these applications will often use the AIX Asynchronous I/O (AIO) facility by issuing aio_read and aio_write calls which tell kernel service how to notify the application of later completion, while immediately returning control to that application to do work asynchronous to the processing of the I/O request. These applications which are written for DIO, will open the file in DIO mode. As a result, the system administrator does not need to do anything special to enable this capability.
A problem will sometimes develop where the administrator wants to work with a file that is currently being processed by an application using DIO. The administrator will use a utility such as the cp command or the backup command to read from the file. The utility opens the file normally and does not request DIO processing. When any process requests normal I/O processing of a file, then all processing is done normally (VMM file caching). For the process that is requesting DIO, this is referred to as a demotion. When DIO requests are demoted, they incur all of the costs of file caching with little of benefit; the I/O will not use read-ahead or write-behind.

A common solution to this situation, is to use the DIO option of the mount command. When the file system is mounted with this option, all I/O requests to the files will be treated as DIO requests. That can allow the utility to run without causing demotions. The catch is that the utility must be DIO compliant.

The danger of doing a DIO mount is that there are other files in that file system which were never intended to be processed with DIO. In other words the programs that use these files are not even DIO compliant, much less written to run well with DIO. For these other files performance can be seriously impacted. One of the worse things you can do is mount file systems containing executable programs and libraries using DIO.



Student Notebook


Notes:


Checkpoint (1 of 3)

1. True/False File fragmentation can result in a sequential read pattern of many small reads with seeks between them.

2. True/False When measuring file system performance, I/O subsystems should not be shared.

3. Two commands to measure read throughput are:_________ and __________

4. The _____________ command can be used to determine if there is fragmentation.




Uempty


Notes:


Checkpoint (2 of 3)

5. What tunable functions exist to flush out modified file pages, based on a threshold of the number of dirty pages in memory?

6. What is the difference between JFS and JFS2 random write-behind?

________________________________________________________________________________________________________________________



Student Notebook


Notes:


Checkpoint (3 of 3)

7. List factors that may impact performance when files are fragmented:

8. What commands can be used to determine if there is a file system performance problem?

9. What is the relationship between file system buffersand the VMM I/O queue?______________________________________________________________________________________




Uempty

Figure 6-32. Exercise 6: File system performance AN512.0

Notes:


Exercise 6: File system performance

• Monitor and fix file fragmentation

• Using release-behind

• Using DIO (optional)



Student Notebook


Notes:


Unit summary

This unit covered:•Guidelines for accurate file system measurements•How file fragmentation affects file system I/O performance•Using the filemon tool to evaluate file system performance•Tuning:–JFS and JFS2 logs–Release-behind–Read-ahead–Write-behind

•Identifying resource bottlenecks for file systems




Uempty
Unit 7. Network performance

This unit describes the issues related to network performance. It shows you how to use performance tools to monitor and tune network performance.



• Identify the network components that affect network performance

• List the network tools that can be used to measure, monitor, and tune network performance

• Monitor and tune UDP and TCP transport mechanisms

• Monitor and tune for IP fragmentation mechanisms

• Monitor and tune network adapter and interface mechanisms


Accountability:


References




AIX Version 6.1 System Management Guide: Communications and Networks




© Copyright IBM Corp. 2010 Unit 7. Network performance 7-1

Student Notebook


Notes:


Unit objectives

After completing this unit, you should be able to:• Identify the network components that affect network

performance• List the network tools that can be used to measure,

monitor, and tune network performance• Monitor and tune UDP and TCP transport

mechanisms• Monitor and tune for IP fragmentation mechanisms• Monitor and tune network adapter and interface

mechanisms




Uempty

Figure 7-2. What affects network performance? AN512.0

Notes:

Overview

There are a number of things that can affect network performance. Among them are:

- Type of interface

- Capacity of hubs, switches, and routers

- Host architecture

- Types of connections using the network at the same time

- Settings of AIX parameters

Some of these factors are external to AIX and have to be managed separately. However, there are a number of parameters that are internal to AIX that can be used to manage and even improve network performance.

As with so many areas of performance, it is often possible to improve the performance by simply increasing the amount of the constraining resource. In the case of networking,


What affects network performance?

• Network infrastructure

• Type of session or connection

• Session parameters

• Resource availability

• AIX settings



Student Notebook

the primary resource is network bandwidth. Thus upgrading to a faster network can often improve network performance. For example, one might upgrade the network from 10 Mbps to 100 Mbps (or even gigabit Ethernet). But the bandwidth of the physical network is not the only factor and there can be logical resources that are the actual constraining factor. Thus, this unit will focus on many of these logical resources such as buffer sizes, queue sizes, delay timers, and so forth.




Uempty

Figure 7-3. Document your environment AN512.0

Notes:

netstat -i

It is hard to analyze a network performance problem if you do not know what you are working with. The netstat -in command lists the configured interfaces on the system. This includes those interfaces which are currently in a down state (shown with a leading * symbol in its name). The configured MTU size is shown for each interface (this will be covered later).

It also shows some statistics related to those interfaces. The network interface statistics provide a quick overview about which network interfaces show the most load.

To look at a single interface, use capital I and the name of the interface. For example:

# netstat -I en0

To output the address as numeric IP address, add a -n; to see the symbolic name translation, do not use the -n flag.


Document your environment# netstat –in

Name Mtu Network Address ZoneID Ipkts Ierrs Opkts Oerrs Collen0 1500 link#4 0.1a.64.91.85.fe - 1406 0 209 4 0en0 1500 192.168.2 192.168.2.1 - 1406 0 209 4 0lo0 16896 link#1 - 354 0 366 0 0lo0 16896 127 127.0.0.1 - 354 0 366 0 0lo0 16896 ::1 0 354 0 366 0 0

# lsdev -Cc adapter | grep 'ent[0-9]‘

ent0 Available 01-08 2-Port 10/100/1000 Base-TX PCI-X Adapter (14108902)ent1 Available 01-09 2-Port 10/100/1000 Base-TX PCI-X Adapter (14108902)

# entstat –d ent0 | more-------------------------------------------------------------ETHERNET STATISTICS (ent2) :Device Type: 2-Port 10/100/1000 Base-TX PCI-X Adapter (14108902)Hardware Address: 00:1a:64:91:85:feElapsed Time: 0 days 7 hours 35 minutes 53 seconds

Transmit Statistics: Receive Statistics:-------------------- -------------------Packets: 205 Packets: 1410Bytes: 19786 Bytes: 136274Interrupts: 0 Interrupts: 1404Transmit Errors: 0 Receive Errors: 0Packets Dropped: 0 Packets Dropped: 0

…



Student Notebook

lsdev

The lsdev listing identifies the available ethernet adapters. This can identify currently unconfigured adapters that might be used, either as alternative network adapters or that might be combined with another adapter to form an etherchannel (aggregate adapter), for additional bandwidth.

entstat

The entstat listing identifies additional adapter characteristics and configuration information. Later, the course cover using the detailed network statistics provided in the entstat report.




Uempty

Figure 7-4. Measuring network performance AN512.0

Notes:

Network throughput

Network performance has two main aspects. One is throughput and the other is response time.

When transferring large amounts of data, we are concerned with throughput. This is measured in bytes per second. Any application can measure the amount of data transferred and how long it took. A standard file transfer application which automatically reports this to us is FTP.

One common mistake made in measuring network throughput is to accidentally include non-network factors, such as retrieval of the data from disk at the source or storing of the data at the destination. A good way to avoid this is to use special device files such as /dev/zero and /dev/null which have no disk activity. Other utilities may give you an option to do “memory” transfers.


Measuring network performance• Network throughput:– Number of bytes transferred / total transfer time– ftp command is a common tool:

ftp> put “| dd if=/dev/zero bs=32k count=100” /dev/null

– Compare with:• Theoretical bandwidth (Ex. Fast Ethernet = 94.8 Mbps )• Baseline measurements• Performance goals

• Network response time:– Round-trip delay of transaction– Common tools: ping, netperf (transaction rate)– Application processing is often a major factor in total

response time



Student Notebook

Another common tool that is used is the spray command. This is an RPC based command which will allow you to specify the number, size and delay between packets sent. It then reports the average transaction rate and data transfer rate. It depends on configuring the inetd to enable the service at the destination.

Next, the measurement needs to be evaluated. One way to do this is to compare it to the theoretical bandwidth that can be achieved on the type of network you have. One needs to be careful here. Do not assume that 100 Mbps transfer is actually possible on a 100 Mbps Ethernet. Even without contention for the bandwidth, there is overhead in the form of intergap delays, frame headers and trailers, IP headers and TCP headers. If not transferring on a common LAN, there are other components in the path which can affect the transfer.

The following table lists some of the network interface types:

To avoid a mismatch in speed or duplex mode at 10/100 Mb speed, it is often recommended to disable auto-negotiation for Ethernet cards and set them to the fixed media speed and duplex mode. It is also imperative that the same be done at the switch port. This does not apply if you are running at Gb speed.

As with performance analysis in all the major resource categories, it is important to obtain a benchmark of what is achievable under normal conditions when performance seems satisfactory. When concerned with degraded performance, comparing current measurements to the baseline can tell you if network throughput is a cause of the current degradation.

Comparing against the baseline is also important for normal monitoring of the system to spot trends in performance degradation, which in time may result in performance goals not being met.

Utilization is the percentage of time a device is in use. General queueing theory holds that devices with utilizations greater than 70% will see response time increases because incoming requests have to wait for previous requests to complete. Maximum acceptable utilizations are a trade-off of response time for throughput. In interactive

Interface Name Speed

Ethernet (en) 10 Mb/sec - 10 Gb/sec

IEEE 802.3 (et) 10 Mb/sec - 10 Gb/sec

Token-Ring (tr) 4 Mb/sec or 16 Mb/sec

X.25 protocol (xt) 64 Kb/sec

Serial Line Internet Protocol, SLIP (sl) 64 Kb/sec

loopback (lo)

FDDI (fi) 100 Mb/sec

SOCC (so) 220 Mb/sec

ATM (at) 155 Mb/sec or 622 Mb/sec




Uempty
systems, utilizations of devices should generally not exceed 70-80% for acceptable response times.
Network response time

Many applications do not have that much data to transfer, but need to obtain a fairly quick reply. There are many networking elements which may have little effect on throughput, but would have a great effect on response time. Response time is determined by measuring the time between when a request is sent and a reply is received.

While we will focus here on network response time, it is important to realize that to the end user, this network response time may only be a small part of the overall perceived response time. The client software may do processing even before sending a transaction, the server may need to wait for a program to be loaded and once loaded may take some time to process the request (including disk I/O delays).

To isolate the network response time from the non-network elements, we use tools which have very little processing delay at either end. A universal tool of this sort is the ping command. The ping command will provide a report giving the delay between sending the ICMP echo request and the ICMP echo reply. Because it only operates at Layer 3 and is built into the kernel, it has very little additional delay in the processing.

Another way to measure response time is to record a rapid transaction rate. For this you need a client/server application which serializes a large number of transactions. As soon as the reply to one transaction returns, it issues another one. If we invert the transaction rate (divide it into 1) we will obtain the average response time per transaction. This has the dual benefit of smoothing out any short term variances during the test and the ability to measure sub second response times more accurately. A common tool for this purpose is netperf. (www.netperf.org)

It is more difficult to identify an optimal theoretical response time because the network latencies (often measured in hundreds of microseconds) are only a small part of the network response time. The other factors depend very much on the network topology and machine configurations. Once again, obtaining a baseline measurement under normal loads when performance is acceptable is important for later comparisons.

For those who are not familiar with the ftp command’s put subcommand syntax in the visual, here is an extract from the BSD man page for ftp:

If the first character of the file name is “|”, the remainder of the argument is interpreted as a shell command. ftp then forks a shell, using popen(3) with the argument supplied, and reads (writes) from the stdin (stdout). If the shell command includes spaces, the argument must be quoted; for example, “| ls -lt "'.



Student Notebook

Figure 7-5. Network services processing AN512.0

Notes:

Introduction

The visual shows the flow of network data from a sending application down through the protocol layers and out on to the physical network. It then shows the flow of network data into a receiving application. The data arrives across the physical network and makes its way up the protocol stack.

Maximum Transmission Unit (MTU)

The MTU indicates the established maximum size for data transfer for a given network interface. The MTU parameter is tunable and must be set to the same value for all hosts on a given network interface. If the amount of data sent by a process is larger than the MTU, it is divided into separate packets that comply to the MTU by the layers of protocol illustrated above.


Network services processing

Writebuffer

mbuf chain

SocketSendBuffer

MTU compliance (TCP)

MTU compliance (TCP/UDP)

MTU enforcement (TCP/UDP)

TransmitQueues

ReceiveQueues

Readbuffer

mbuf chain

SocketReceiveBuffer

IP Input Queues

File DescriptorApplication

Application layer

Socket Layer

TCP/UDP Layer

IP Layer

Demux and IFLayer

Device Driver

Adapter

Media




Uempty
Sending
The interface layer (IF) is used on output and is the same level as the Demux layer (used for input) in AIX. It places transmit requests on to a transmit queue, where the requests are then serviced by the network interface device driver. The size of the transmit queue is tunable, as is described later in this lecture. The loopback interface still uses the IF layer both on input and output.

Receiving

The network interface device driver places incoming packets on a receive queue. If the receive queue is full, packets are dropped and lost, resulting in the sender needing to retransmit. The receive queue is tunable using SMIT or the chdev command. The maximum queue size is specified to each type of communication adapter.

The IP layer also has an input queue. The Demux layer places incoming packets on this queue. Once again, if the queue is full, packets are dropped and never reach the application. If packets are dropped at the IP layer, a statistic called ipintrq overflows in the output of netstat -s -f inet is incremented. If this statistic increases in value, then you should tune the ipqmaxlen tunable using the no command. In AIX, in general, interfaces will not do queuing and will directly call the IP Input Queue routine to process the packet. The loopback interface will still do queuing.

Communication steps

Applications that use TCP/IP protocols across a network use sockets to facilitate communication. A socket is similar to the file access mechanism. On the send side, the following things happen:

Step Action

1A program that needs network communication opens a TCP or UDP type socket.

2 The program writes data to the socket.

3Data is copied to a socket send buffer made up of mbufs and clusters.

4The socket calls TCP/UDP layer, passing a pointer to the linked list of mbufs or clusters.

5 TCP/UDP allocates a new mbuf for header information.

6

TCP/UDP copies data from the socket send buffer to the header mbuf or new mbufs chain (maximum size of chain governed by MTU). UDP copies the data to one or more clusters, with the remainder allocated to mbufs, then adds a UDP header.



Student Notebook

The receive side works in reverse, stripping off headers and passing along pointers to mbufs. Queue sizes, buffer sizes, and MTU sizes are all tunable parameters.

The processing sequence shown does the various alternative flows that could result from changing various tunables. It is only an example of the default processing sequence that is traditionally implemented.

7 TCP/UDP checksums the data and calls the IP layer.

8IP determines the correct interface fragments for UDP packets larger than the MTU size, updates and checksums the IP part of header.

9 The mbuf chain is passed to the interface (Demux layer).

10Demux prepends link layer header information, checks format, and calls device driver write routine.

11At the device driver layer, the mbuf chain containing the data is enqueued on the transmit queue and the adapter is signaled to start transmission operations.

12After all data is sent, control returns to the application, the transmit queues are adjusted, and the mbufs are freed.

Step Action




Uempty

Figure 7-6. Network memory AN512.0

Notes:

Overview

AIX provides a network memory pool to service network communication. The maximum size of the network memory pool is that defined by the thewall network option. Network buffers are automatically allocated and pinned out of this network memory pool when they are needed.

Size of network pool

The thewall value for maximum size of the network memory pool cannot be changed. It will be calculated at boot time by the following formula:

- 32-bit kernel is one half of RAM or 1 GB (whichever is smaller)

- 64-bit kernel is one half of RAM or 65 GB (whichever is smaller)


Network memory

• AIX provides a network buffer pool for network operations Dynamically allocated and deallocated based on demandClusters sizes can range from 32 bytes up to 128 KBmbufs anchor dataData can be in mbuf or a chain of clusters

•thewall is the maximum amount of network pinned memory

• It is not tunable• It is determined by the amount of real memory

• If mbufs or clusters are not available, performance may suffer because packets may be dropped or delayed



Student Notebook

The following command shows the amount of real memory that can be used for the network memory pool on a machine:

no -o thewall

The sys0 ODM object (attributes of the operating system) maxmbuf attribute, if set to a non-zero value, will be used instead of thewall. The default value is zero.

Note: The sys0 maxmbuf parameter can not be set to a value that is greater than the system determined thewall value.

Memory buffers

AIX network services manages memory that is allocated and pinned for use by network services. The amount of memory allocated and managed by network kernel services dynamically increases and decreases based upon the utilization pattern, but will never exceed the thewall.

The allocated memory is organized into memory clusters of various sizes. These clusters are used for many purposes including various control blocks that are necessary to represent sockets, connections, interfaces, routes, and other network components. The cluster sizes can vary from 32 bytes to 1024 KB, in sizes that are always a power of 2 (Ex. 32, 64, 128, 256, 512,...).

Network operations require buffers to transfer data. These buffers use clusters matched to the amount of data that needs to be stored. On a multiprocessor system, each CPU is assigned its own network memory pool with buckets of buffers from 32 bytes to 16384 bytes, and there is a set of global buckets for sizes from 16 KB to 1024 KB.

Every network datagram that is sent or received must be represented by a control block called an mbuf. The mbuf is stored in the appropriately sized memory cluster. If the data is small enough, then it is stored right in the mbuf. Otherwise, the mbuf will point to an mbuf cluster that can hold the data. An mbuf cluster is a control block which is allocated a matching size memory cluster. In many cases, the mbufs are chained together to keep track of them.

Since the mbuf clusters only come in certain sizes, the memory used may be almost twice as much as the amount of data being transmitted. For example, if an application sends 1460 bytes of data, then the smallest cluster that can contain it would be a 2048 byte cluster.

The term buffer is used in many different ways in computer jargon and, in fact has many different meanings in networking. For example, later we will talk about a socket buffer which is really a different concept than the clusters we are discussing here to hold an individual datagram or message sent by an application or received from the network adapter. When someone uses the term buffer, be sure to clarify what they are referring to.




Uempty
Network buffer limit
The maximum amount of real memory that can be used for the network memory pool is limited by thewall. The value of thewall cannot be changed and changing the sys0 maxmbuf attribute can only be used to further restrict the maximum amount of memory used for network buffer pool, not to increase it. The value of thewall is calculated at boot time:

- 32-bit kernel is 1/2 of RAM or 1 GB (whichever is smaller)

- 64-bit kernel is 1/2 of RAM or 65 GB (whichever is smaller)

Options for a network buffer shortage

When netstat -m shows mbuf allocation failures, you have the following options:

- Add more RAM, if the machine runs 32-bit kernel and has less than 2 GB RAM

- Add more RAM, if the machine runs 64-bit kernel and thewall is less than 65 GB

- Change from 32-bit kernel to 64-bit kernel if possible and add RAM if needed

- Check the size of the socket send and receive buffers to determine whether they can be reduced

- It is possible that an mbuf or cluster memory leak by a kernel component is causing the mbuf or cluster shortage. A steady increase in allocations of a particular cluster size or of a particular control block usage could be an indicator of an mbuf. A full analysis of the kernel memory leak is outside the scope of this class.

Limited components

The limit on the network memory pool (thewall) is also used to limit how much memory can be used for STREAMS. The tunable parameter called strthresh (default value of 85% of thewall) specifies that once the total amount of allocated memory has reached 85%, no more memory can be given to STREAMS.

Similarly, another threshold called sockthresh (also defaults to 85%) specifies that once the total amount of allocated memory has reached 85% of thewall, no new socket connections can occur (socket() and socketpair() system calls return with ENOBUFS). These thresholds are tunable via the no command.



Student Notebook

Figure 7-7. Memory statistics with netstat -m . AN512.0

Notes:

What to look for

The main thing to look for in the output of netstat -m are non-zero values in the failed and delayed columns. If these values are non-zero, you need to identify whether you can reduce the mbuf usage (by reducing the socket buffer sizes, discussed later), add more memory (on 64-bit kernel) or move to a 64-bit kernel, if possible.

The maximum size for the network buffer pool cannot be increased beyond the system defined default.

Once an cluster (such as an mbuf) is allocated and pinned, it can be freed by the network services routine. Instead of unpinning this buffer and giving it back to the system, it is left on a free list based on the size of this buffer. The next time a buffer is requested, it can be taken off this free list in order to avoid the overhead of pinning. Once the number of buffers on the free list reaches the high water mark, buffers less than 4096 will be coalesced together into page-sized units so that they can be unpinned and given back to the system. When the buffers are given back to the system, the


Memory statistics with netstat -m# netstat –mKernel malloc statistics:

******* CPU 0 *******By size inuse calls failed delayed free hiwat freed64 171 5452 0 2 21 1884 0128 2032 2477 0 63 16 942 0256 810 5189 0 50 22 1884 0512 2108 175570 0 258 20 2355 01024 188 4428 0 48 8 942 02048 557 1694 0 261 3 1413 04096 133 139 0 4 25 471 08192 4 10 0 1 0 117 016384 128 128 0 16 0 58 032768 24 24 0 6 0 29 065536 59 59 0 30 0 29 0131072 3 3 0 0 36 73 0

******* CPU 1 *******By size inuse calls failed delayed free hiwat freed64 32 326 0 1 96 1884 0128 54 646 0 2 42 942 0256 36 998 0 2 12 1884 0512 26 30441 0 0 78 2355 0. . .




Uempty
“freed” column is incremented. If the freed value consistently increases, this should indicate that the high water mark is too low. There is no shipped tool to increase the high water mark, however, the thresholds scale with the maximum amount of memory available for network buffers.
Allocating and deallocating buffers

When a network service needs to transport data, it can call a kernel service such as m_get() to obtain a memory buffer. If the buffer is already available and pinned, it can get it right away. If the upper limit has not been reached and the buffer is not pinned, then a buffer is allocated and pinned. Once pinned, the memory stays pinned when it is freed back to the network pool. If the number of free buffers reaches a high water mark (not tunable), then a certain number are unpinned and given back to the system for general use. This unpinning is done by the netm kproc. The caller of m_get() can specify whether or not to wait for a network memory buffer. If M_DONTWAIT is specified and no pinned buffers are available at that time, then a failed counter is incremented. If M_WAIT is specified, then the process is put to sleep until the buffer can be allocated and pinned by the netm kproc. If the ‘failed’ counter is not incremented, M_WAIT was specified. The larger size buffers can only be allocated if M_WAIT is specified.

The low watermark and high water mark for mbufs scale with the size of the network buffer pool.

If mbufs or clusters are not available, performance may suffer because packets may be dropped or delayed.



Student Notebook

Figure 7-8. Socket flow control (TCP) AN512.0

Notes:

Buffers

Sockets hold transient data in two buffers:

- Send space buffer

- Receive space buffer

The system implements limits on the sizes of these buffers on both a per socket and system level. Separate limits are defined for TCP and UDP buffers.

TCP reliable transport

The TCP send buffer is used to hold onto data that has been sent, until acknowledgement is received that the other side has successfully received the data. If acknowledgement is not received, then TCP can retransmit what it has in the buffer. The size of the buffer limits how much unacknowledged data can be buffered. If it fills, then a send by the application will be blocked until there is free space in the send buffer.


Socket flow control (TCP)

mbufs

Application

Data Buffer

User Memory

Kernel Memory

SocketStructure

Send Buffer Receive Buffer




Uempty
At the destination socket, the receive buffer is used to hold onto arriving data until it can be matched to an application receive request and moved to the application’s private memory. Due to TCP flow control, this buffer should never overflow or discard data.
TCP flow control

TCP implements flow control by using a sliding window mechanism which is described in detail on the next visual. This allows data to be transmitted and received without having to worry about exceeding the size of the socket buffers. The no command parameters tcp_sendspace and tcp_recvspace are global limits on the TCP send and receive socket buffers. Applications can use the setsockopt() system call to override these limits.The default value for tcp_sendspace and tcp_recvspace is 16384.

Ultimate limit for TCP and UDP buffers

Another parameter, sb_max, controls the upper limit for any of these buffers. All of these parameters are tuned with the no command.



Student Notebook

Figure 7-9. TCP acknowledgement and retransmission AN512.0

Notes:

Overview

When the destination socket receives a segment it does not immediately send an acknowledgement. Instead it waits to see if there is a datagram being sent in the other direction to “piggy back” the acknowledgement. This is to reduce the number of ack-only packets using the network capacity. (The TCP protocol designers use the term “piggy back” to refer to the practice of signalling the acknowledgement in a data packet that is being sent anyway.

If there are no datagrams going in the other direction, the destination socket waits for another segment to arrive and will then acknowledge both segments. If after waiting for 200 ms and there is neither a datagram to piggy back on nor a second segment to trigger an acknowledgement, then the socket sends an acknowledgement for the one segment. This 200 ms timer can be tuned using the fasttimeo network option.

At the socket which sent the segments, it holds them in the send space buffer until it receives the acknowledgement, at which point it frees up that space in the buffer.


TCP acknowledgement and retransmission5

6

7

8

8

acknowledge 6

acknowledge 7

RTO

fasttimeo

200 ms

no acknowledgement

X

acknowledge 8




Uempty
If after waiting for a kernel calculated Retransmission Time Out (RTO) period, the sending socket does not receive an acknowledgement, it retransmits the segment on the assumption that it was discarded in the network.
If the host sending the data has sent additional packets after the unacknowledged packet, then that sending host needs to retransmit not only the unacknowledged packet, but all the packets that were sent after that point. This further adds load to the network and additional overhead to the hosts on the connection.

Technically, the receiving host acknowledges receipt of everything up to a byte position in a stream of bytes that are numbered from the first byte transmitted in the connection, but the acknowledged byte position almost always correlates to the last byte of a segment; this it is common to talk about the segments that were acknowledged. The details about acknowledged bytes in the stream of transmission is mainly important when do analysis of network traces.



Student Notebook

Figure 7-10. TCP flow control and probes AN512.0

Notes:

Overview

In order to prevent TCP receive buffer overflows, TCP implements a flow control mechanism in which the receiving socket controls how much data can be transmitted by the sending socket. The amount of un-acknowledged data that the sending side may transmit is called the window size. If the transmitting socket has a full window o transmitted but unacknowledged traffic, it has to stop and wait for data to be acknowledged before it can continue transmitting.

The receiving side advertises a window size based upon its ability to receive that data. The greatest factor used to determine the window size is the size of the TCP socket receive buffer. If the TCP receive buffer is too small, it can artificially constrain the throughput of the connection. Even when the TCP receive buffer is very large, if the receiving server is experiencing congestion (buffer filling faster than it can process the data), it can reduce the window to protect itself from overflow.


TCP flow control and probes

acknowledge D win=8 KB

timeout

X

Packets C and D – 8 KB

Packets E and F – 8 KB

acknowledge F win=0 KB

window probe


timeout


window probe


Packet G - 4KB

servercongestion

serverreadyfor

more

. . .




Uempty
How it works
The receiver advertises a window size back to the sender as part of an ACK packet. This tells the sender how much room the receiver has in its buffer to accept packets.

The sender will transmit send out all the segments within its window (sequence of packet segments waiting to be sent).

The receiver can acknowledge multiple packets instead of sending back an ACK for each packet. As long as the receive is acknowledging packets fast enough and is advertising a large enough window, the sender will continue transmitting packets. Larger window sizes allow more time to transmit data while unacknowledged data travels to the destination and the acknowledgement returns.

When a server gets overloaded the receive socket buffer may fill up faster than the application can receive the data. In response to this the receiving host will reduce the advertised window size. If the socket buffer is completely filled, then the advertised window can be reduced to zero, preventing any new transmission on the connection. When the receiving socket buffer empties, the receiving host will send an unsolicited acknowledgement with the non-zero window size. If there is a long delay in the sending host receiving the new window size, it will send a window probe. This is because it does not know if a non-zero window advertisement was discarded in the network. Without a window probe, the connection could be in a permanent deadlock with the sending host waiting for a non-zero window and the receiver waiting for the next data packet.

A statistic of the number of window probes usually indicates how long and how often the advertised window was closed to zero. This, in turn, can be an indication of server overload.



Student Notebook

Figure 7-11. netstat -p tcp . AN512.0

Notes:

Highlighted statistics

Statistics of interest are:

- Packets sent - Data packets - Data packets retransmitted - Window probe packets - Packets received - Completely duplicate packets - Window probes - Retransmit timeouts


netstat -p tcp# netstat -p tcptcp:6899764 packets sent

4436476 data packets (2943162856 bytes)3208 data packets (3788499 bytes) retransmitted1813815 ack-only packets (500199 delayed)1 URG only packet389 window probe packets484658 window update packets161217 control packets0 large sends0 bytes sent using largesend0 bytes is the biggest largesend

7861688 packets received3535095 acks (for 2943219325 bytes)82344 duplicate acks0 acks for unsent data5906529 packets (950111507 bytes) received in-sequence4165 completely duplicate packets (376089 bytes)1 old duplicate packet140 packets with some dup. data (1611 bytes duped)67997 out-of-order packets (15386274 bytes)105 packets (139969 bytes) of data after window0 window probes




Uempty
Packets sent and packets retransmitted
For the TCP statistics, compare the number of packets sent to the number of data packets retransmitted. If the number of packets retransmitted is over 10-15 percent of the total packets sent, TCP is experiencing timeouts indicating that network traffic may be too high for acknowledgments (ACKs) to return before a timeout. A bottleneck on the receiving node or general network problems can also cause TCP retransmissions, which will increase network traffic, further adding to any network performance problems.

Packets received and completely duplicate packets

Compare the number of packets received with the number of completely duplicate packets. If TCP on a sending node times out before an ACK is received from the receiving node, it will retransmit the packet. Duplicate packets occur when the receiving node eventually receives all the retransmitted packets. If the number of duplicate packets exceeds 10-15 percent, the problem may again be too much network traffic or a bottleneck at the receiving node. Duplicate packets increase network traffic.

Retransmit timeouts

Another important statistic in the report is the value for retransmit timeouts, which occurs when TCP sends a packet but does not receive an ACK in time. It then resends the packet. This value is incremented for any subsequent retransmittals. These continuous retransmittals drive CPU utilization higher, and if the receiving node does not receive the packet, it eventually will be dropped. (This is not shown on the visual, but is highlighted in the “rest of the output” below.)

Window probe packets and window probes

A large value for sent window probe packets indicates that either the socket receive space of the remote receiving sockets are too small or the applications on the remote receiving host are not reading the data fast enough.

A large value for received window probes indicates that either the socket receive buffer on the local receiving socket is too small or the applications on the local receiving host are not reading the data quickly enough.



Student Notebook

The rest of the output

The output in the visual is the beginning of the total output. Following is the remainder of the report:

5441 window update packets 51 packets received after close 0 packets with bad hardware assisted checksum 0 discarded for bad checksums 0 discarded for bad header offset fields 0 discarded because packet too short 53 discarded by listeners 1123560 ack packet headers correctly predicted 4109607 data packet headers correctly predicted53780 connection requests56147 connection accepts109101 connections established (including accepts)130872 connections closed (including 1623 drops)0 connections with ECN capability0 times responded to ECN826 embryonic connections dropped3418078 segments updated rtt (of 3087065 attempts)0 segments with congestion window reduced bit set0 segments with congestion experienced bit set0 resends due to path MTU discovery24 path MTU discovery terminations due to retransmits940 retransmit timeouts 4 connections dropped by rexmit timeout1402 fast retransmits223 when congestion window less than 4 segments646 newreno retransmits42 times avoided false fast retransmits0 persist timeouts0 connections dropped due to persist timeout8929 keepalive timeouts 8893 keepalive probes sent 35 connections dropped by keepalive0 times SACK blocks array is extended0 times SACK holes array is extended0 packets dropped due to memory allocation failure3 connections in timewait reused0 delayed ACKs for SYN0 delayed ACKs for FIN0 send_and_disconnects0 spliced connections




Uempty
0 spliced connections closed0 spliced connections reset0 spliced connections timeout0 spliced connections persist timeout0 spliced connections keepalive timeout


Student Notebook

Figure 7-12. TCP socket buffer tuning (1 of 2) AN512.0

Notes:

TCP send buffer

The TCP socket send buffer is used to buffer the application data in the kernel using mbufs and clusters before it is sent beyond the socket and TCP layer. The default size of this buffer is specified by the no parameter tcp_sendspace, but can be overridden by the application using the setsockopt() system call.

The send buffer can hold both data waiting to be transmitted (queued due to a full window) and data that has already been transmitted but is waiting for acknowledgement. When data is acknowledge, this frees up space in the buffer, which in turn allows the application to send more data. If a send buffer fills up, the application can not send any more data. A larger send buffer allows an application to have more transmitted data to be unacknowledged and for the application to send data that is being queued due to the window being full. A small send buffer quickly fill up and cause the application send requests to be blocked, even though the window is not full.


TCP socket buffer tuning (1 of 2)• TCP send buffer size - how much data can be buffered before the application is blocked

• TCP receive buffer size - how much data the receiving system can buffer until the application reads it

• Buffer sizes can be set in a hierarchy:– Application setsockopt(): SO_SNDBUF, SO_RCVBUF

– Interface attributes: tcp_sendspace, tcp_recvspace– Network tunables: tcp_sendspace, tcp_recvspace

• Effective window size based on minimum of:– Transmitter’s send buffer– Receiver’s advertised receive window size




Uempty
If an application does non-blocking I/O (specified O_NDELAY or O_NONBLOCK on the socket), then if the send buffer fills up, the application will return with an EWOULDBLOCK/EAGAIN error rather than being put to sleep. Applications need to be coded to handle this condition. A suggested solution is to sleep for a short period of time and try to send again. When changing sendspace or recvspace values, in some cases it is necessary to stop and restart the inetd process (stopsrc -s inetd; startsrc -s inetd).
TCP receive buffer

The TCP receive buffer is used to accommodate incoming data. When the data is read by the TCP layer, TCP can send back an acknowledgment for that packet immediately or it can delay before sending the ACK. TCP tries to piggyback the ACK if a data packet was being sent back anyway. If multiple packets are coming in and can be stored in the receive buffer, TCP can ACK all of these packets with one ACK. Along with the ACK, TCP will send back a window advertisement to the sending system telling it how much room there is left in the receive buffer. If there’s not enough room left, the sender will be blocked until the application has read the data. The default size of this buffer is specified by the parameter tcp_recvspace.



Student Notebook

Figure 7-13. TCP socket buffer tuning (2 of 2) AN512.0

Notes:

Selecting the best socket buffer size

If we set the initial window size too small we may be unnecessarily blocking application send requests. On the other hand, if the sender is on a much faster machine and network than the destination, it could be adding to a congestion situation (resulting in packet discards). In that case, we may want to “de-tune” the window size to restrain how fast we transmit.

One way to determine how large the window should be is to calculate the “bandwidth-delay product”. Basically this is the amount of data we can transmit during the Round Trip Time (RTT). The RTT is the time between when we transmit a segment and when we receive the matching acknowledgement. The trick is in determining the transmission rate and the RTT.

A different and common approach is to determine the best window size through experimentation. By trying a variety of different window sizes and measuring the


TCP socket buffer tuning (2 of 2)

• Tune initial window size:– Big enough to avoid blocking application sends: how much data

can be sent during round trip time (RTT)?– Small enough to avoid receiver or network problems

• Experiment for optimal throughput:– Try different sizes and measure the effect

•sb_max limits the maximum size for any socket buffer:– Set sb_max to at least twice the size of the largest socket buffer

• TCP window size is limited to (64 KB) but can be set higher (rfc1323=1):– Maximum window size = 2 ** 30 (1 GB)– Adds 12 additional bytes of overhead to each packet




Uempty
throughput on each, we can determine at what point increasing the window size leads to diminishing returns.
TCP window size

TCP uses a 16-bit value for its window size, by default. This provides for a maximum of 65536 bytes. If data is being sent through adapters that have large MTU sizes (32 KB or 64 KB for example), TCP streaming performance may not be optimal since the packet or packets will get sent and the sender will have to wait for an acknowledgment. By enabling the RFC1323 option using no -o rfc1323=1, TCP’s window size can be set as high as 4 GB. After setting this option, you can increase the tcp_recvspace parameter to something much larger, such as 10 times the size of the MTU. An alternative option would be to reduce the MTU size if the receiving system does not support RFC1323.

Maximum memory for socket buffers

A socket’s send buffer memory usages plus that socket’s receive buffer memory usage can never exceed the value of sb_max bytes. sb_max is a ceiling on buffer space consumption. In addition, no individual socket buffer size maximum (Ex. tcp_sendspace) is allowed to exceed the sb_max value. The two quantities (socket buffer size versus sb_max) are not measured in the same way, however. The socket buffer size limits the amount of data that can be held in the socket buffers. The sb_max value limits the number of bytes of mbufs that can be in the socket buffer at any given time. In an Ethernet environment, for example, each 2048 byte mbuf cluster might hold just 1500 bytes of data. In that case, sb_max would have to be 1.37 times larger than the specified socket buffer size to allow the buffer to reach its specified capacity. The guideline is to set sb_max to at least twice the size of the largest socket buffer.

To change the sb_max value, use the command:

# no -o sb_max=<new_value>

Large MTU issues

On adapters that have 64 KB MTUs, TCP streaming performance can be seriously degraded if the receive buffer is 64 KB or less. The two main protocols which have this concern are the High Performance Switch (Federation switch) and ATM since both can be configured with a 64 KB MTU.

The problem is that as soon we transmit a segment, we have filled the window. We then have to stop and wait the RTT until we receive an acknowledgement that allows us to slide the window forward. We have to wait between each and every MTU transmission. To avoid this, we need to have a window size that is at least twice the MTU size and preferably larger.



Student Notebook

If we send large segments (ex. 32 KB) which are less than these large MTUs (64 KB) we also have interactions with Nagle’s Algorithm, which can result in us waiting for 200 ms between each transmit, reducing the throughput to 5 packets per second. We will cover Nagle’s Algorithm later in this unit

If the receiving machine is not an AIX system or does not support RFC1323, then reducing the MTU size is one way to improve streaming performance in this situation.

RFC 1323

RFC1323 is designed to modify the standard TCP protocol to support networks which have a large bandwidth, use large MTUs, and are very fast. The original protocol was designed for the 10 Mbps Ethernet with a 1500 byte MTU. There are several changes to the protocol specified in RFC 1323. For example, with the rate at which packets could be sent on a 10 Mbps Ethernet, it would take a very long time for the sequence number field in the TCP header to overflow, but on a Gigabit Ethernet connection this could happen much quicker. So RFC 1323 designed a protocol for handling the wraparound when the sequence number field starts count from the beginning after reaching its limit. The most commonly cited benefit is the ability to modify the use of the TCP header filed for advertising the window size. The original field had a maximum value of 64 KB. With modern networks that became a major performance constraint. RFC 1323 provides a mechanism to multiply the value in the TCP header advertised window size field by powers of two, thus allowing much larger window sizes. For the High Performance Switch (HPS), use of RFC 1323 is really a requirement.

To enable RFC 1323, use the command:

# no -o rfc1323=1

The downside of RFC 1323 is that every TCP header needs an 12 byte extension which adds to the overhead of using the protocol. So you do not want to turn this on unless you need the benefits it provides.

Table of suggested buffer sizes

The following table shows some suggested socket buffer sizes based on the type of adapter and the MTU size. The general rule of thumb is for TCP send and receive space to be at least 10 times the MTU size. MTU sizes above 16 KB should use rfc1323=1 to allow larger tcp_recvspace values. For high-speed adapters, larger TCP send and receive space values help performance.

The window size is the receiver’s window size. rfc1323 only affects the receiver.

In benchmark tests with gigabit Ethernet using a 9000 byte MTU, it was found that the performance was the same for both the given sets of buffer sizes.




Uempty

Many of the faster adapters set ISNO options, making it unnecessary for you to tune based on that adapter. But remember, this is only a starting point - different sessions have different requirements.

045056450564352100 MbFDDI

165536065536065527155 MbATM

16553665536652802 GbFibreChannel

165536065536065527155 MbATM

165536655369180155 MbATM

016384163841500155 MbATM

1131072262144900010 GbEthernet

065536131072150010 GbEthernet

11310722621449000GbEthernet

0655361310729000GbEthernet

0655361310721500GbEthernet

01638416384150010/100 MbEthernet

rfc 1323tcprecvspace

tcpsendspace

MTUSpeedDevice

045056450564352100 MbFDDI

165536065536065527155 MbATM

16553665536652802 GbFibreChannel

165536065536065527155 MbATM

165536655369180155 MbATM

016384163841500155 MbATM

1131072262144900010 GbEthernet

065536131072150010 GbEthernet

11310722621449000GbEthernet

0655361310729000GbEthernet

0655361310721500GbEthernet

01638416384150010/100 MbEthernet

rfc 1323tcprecvspace

tcpsendspace

MTUSpeedDevice



Student Notebook

Figure 7-14. Interface specific network options AN512.0

Notes:

Tunable options

The Interface Specific Network Options (ISNO) is enabled by default and can be disabled by setting the no option (use_isno) to 0.

For each network interface, there are five ISNO parameters:

-rfc1323

-tcp_nodelay

-tcp_sendspace

-tcp_recvspace

-tcp_mssdflt

These correspond to the same options with the no command.


Interface specific network options• AIX supports a subset of network tuning attributes that can

be set on each network interface• Tunable options include:

tcp_sendspace and tcp_recvspacerfc1323tcp_mssdflttcp_nodelay

• These options are tuned at the interface level using SMIT orchdev

• The no option use_isno defaults to 1 (enabled)• ISNO values automatically configured for some adapters• All of these can be overridden by application setsockopt()




Uempty
Changing the options
If these values are set for a specific interface, then they will override the system no default value. This allows different network adapters to be tuned for the best performance. The application can override any of these options using setsockopt().

These values can be displayed via the lsattr -E -l interface command. They can be changed via the chdev -l interface -a attribute=value command. For example:

chdev -l en0 -a tcp_recvspace=65536 -a tcp_sendspace=65536

sets the tcp_recvspace and tcp_sendspace to 64 KB for en0 interface.

Using the chdev command will change the value in the ODM database so it will be saved between system reboots. If you want to set a value for testing or temporarily, use the ifconfig command. For example:

ifconfig en0 hostname tcp_recvspace 65536 tcp_sendspace 65536 tcp_nodelay 1

sets the tcp_recvspace and tcp_sendspace to 64 KB and enables tcp_nodelay.

These values are also displayed via the ifconfig interface command.

Default ISNO option values

For some high speed adapters, the ISNO parameters are defaulted in the ODM predefined database.

Adapter Type MTU RFC1323 tcp_sendspace tcp_recvspaceGigaE 1500 0 131072 65536GigaE 9000 1 262144 131072ATM 9180 0 65536 65536ATM 65527 1 65536 65536FDDI 4352 0 45046 45046



Student Notebook

Figure 7-15. Nagle’s algorithm AN512.0

Notes:

Overview

While a local area network (LAN) can handle many small sized packets (defined to be smaller than the maximum segment size), a wide area network (WAN) could get congested. To reduce the congestion problem, TCP implements the Nagle algorithm which states that a TCP connection can have no more than one outstanding small segment that has yet to be acknowledged. This means the first small segment can be sent right away, but no more small segments can be sent until an acknowledgement (ACK) is received for the previous one. Instead, subsequent small segments are collected together by TCP until TCP deems there is enough to meet the MSS value or until the TCP 200 ms timer expires.


Nagle’s algorithmTo prevent congestion of networks, Nagle’s algorithm statesthat: a TCP connection can have only one outstanding smallsegment that has yet to be acknowledged:

• A small segment is defined to be smaller than the MSS• Packets are collected until there is enough data to meet the

MSS requirement or until the 200 ms TCP timer expires• May hinder the performance of certain types of applications

such as request/response applications• Packet transmission does not get delayed by Nagle’s

algorithm if any of the following are set:-tcp_nodelay socket option through setsockopt()system call in the application-tcp_nagle_limit to 1-tcp_nodelay to 1 in the Interface Specific Network Options




Uempty
Disabling Nagle’s algorithm
Since some applications may not stream packets (such as an application that sends a small packet and cannot do anything until it gets back a response), these applications may actually suffer serious performance problems due to this algorithm. In such cases, the applications can do a setsockopt() on the socket (after the connect or accept) and set the tcp_nodelay flag.

If the send buffer size is less than or equal to the maximum segment size (ATM and SP switches can have 64 KB MTUs), then the application’s data will be sent immediately but the application will have to wait for an ACK before sending another packet, due to Nagle’s algorithm. This prevents TCP streaming and could reduce throughput. To maintain a steady stream of packets, increase the socket send buffer size so that it’s greater than the MTU (3-10 times the MTU size could be used as a rule-of-thumb).

A system administrator can also allow all TCP connections on an interface to behave as if tcp_nodelay was set by setting interface specific options such as tcp_nodelay to 1 on the interface (not all interfaces support this yet). For details on tcp_nodelay tuning see later material under, “Network interface tuning”.

Another no parameter is called tcp_nagle_limit. This value defaults to the largest packet size (65535). TCP disables the Nagle algorithm for packets of size greater than or equal to the value of tcp_nagle_limit. So, this means you can essentially disable the algorithm altogether for all packets by setting the value to 0 or 1.

Delayed packet transmission

If the amount of data that the application wants to send has all of the following attributes:

- Smaller than the send buffer size

- Smaller than the maximum segment size

- tcp_nodelay is not set

Then TCP will delay up to 200 ms (fasttimeo tunable) until one of the following conditions is met before transmitting the packets:

- There’s enough data to fill the send buffer

- The amount of data is greater than or equal to the maximum segment size

The MSS value is computed by TCP based on the MTU size or the tcp_mssdflt value, depending on whether it is a local or remote network. If tcp_nodelay is set, then the data is sent immediately. This is useful for request/response type of applications.

Most network interfaces support the tcp_nodelay option which can be set with the chdev command. If you have a connection through a network interface that does not support tcp_nodelay, and you cannot change the application, set the no parameter tcp_nagle_limit to 1.



Student Notebook

Delayed acknowledgements

Operating systems like Windows NT/2000 have difficulties with data streaming when they get delayed acknowledgements (acknowledgement packets are always sent delayed on AIX). You can disable delayed acknowledgement transmission by setting the no parameter tcp_nodelayack to 1.

Idle connection

When a connection goes idle (after 0.5 seconds without any data traffic), the initial window size is set to one MTU. When an application then sends more data than the MTU size, one packet is sent that needs to be acknowledged before sending the rest of the data. The receiver might be expecting more than one packet and does not send the acknowledgement packet immediately (usually the receiver will send it after 200 ms). You can increase the initial window size by changing the no parameter tcp_init_window so that more than one packet is sent when sending data through an idle connection.

In order to do this, the rfc2414 option must be on. For example, the following will set the initial window size to 4*MSS (Maximum Segment Size for this connection):

# no -o rfc2414=1

# no -o tcp_init_window=4




Uempty

Figure 7-16. UDP buffer overflow AN512.0

Notes:

Adjusting for UDP buffer overflows

Without flow control services, it is possible to have socket receive buffer overflows.

This will happen when UDP datagrams arrive faster than the UDP application can issue receive requests. Unable to transfer the data from the socket buffer to the application’s private memory, the buffer fills up. The next datagram to arrive is discarded for lack of space.

While it is hard to predict how large the receive buffer should be, a commonly recommended starting value is 10 times larger than the sendspace being used by the transmitting socket. Some environments can get by with less, while others will need even more. The only way to know for sure is to monitor the occurrence of receive buffer overflows.

One way to handle the situation is to decrease demand by reducing the number of clients transmitting to a server (perhaps by spreading the workload across more individual


UDP buffer overflow

Datagrams arriving faster than receives are issued result in buffer overflow and packet discard

• Solutions: – Increase upd_recvspace to at

least 10 x udp_sendspace

– Increase CPU cycles– Decrease number of clients or

workload

recvfrom()

udp_recvspace

Application



Student Notebook

servers). Another technique is to reduce the size of the records being sent, though that is not always an option for the given application.

Sometimes the problem is that the receiving host is CPU bound. When this happens, it is possible that while the network adapter interrupt handlers are able to get cycles (they get a preferred fixed priority), the application may be starved for cycles and thus delayed in issuing new receives. The solution is to tune the CPU situation.




Uempty

Figure 7-17. netstat -p upd . AN512.0

Notes:

Highlighted statistics

Statistics of interest are:

- Dropped due to no socket

- Socket buffer overflows

Socket buffer overflows

A large socket buffer overflow count indicates that either the UDP socket receive buffer on the local machine is too small or the applications are not reading the data fast enough. The result is that the packet is dropped. You want to avoid ANY dropped packets in UDP protocol since it has a severe impact on performance.

Socket buffer overflows could be due to insufficient transmit and receive UDP sockets, too few nfsd daemons threads (we will cover NFS in the next unit), or too small nfs_socketsize, udp_recvspace and sb_max values.


netstat -p udp# netstat -p udp

udp:1309238 datagrams received0 incomplete headers0 bad data length fields0 bad checksums139 dropped due to no socket521435 broadcast/multicast datagrams

dropped due to no socket0 socket buffer overflows787664 delivered1283000 datagrams output



Student Notebook

Check the affected system for CPU or I/O saturation, and verify the recommended setting for the other communication layers by using the no -a command. If the system is saturated, you must either reduce its load or increase its resources.

Dropped due to no socket

The dropped due to no sockets counter is an important statistic. It indicates that there was no open socket matching the destination port number on the arriving datagram. The application on this host is either not running or is in a state where it is not ready to accept requests. A well written UDP client server design would have the two side do a hand shake before starting a flow of requests. The sending side may attempt this over and over in a polling fashion until the server replies. As such this counter is likely to represent a functional problem. If this value is high, investigate how the application is handling sockets.




Uempty

Figure 7-18. Fragmentation and segmentation AN512.0

Notes:

Segment Size

When TCP takes data from the sendspace buffer to pass to the IP layer, it first prepends a 20 byte header with such information as source and destination port numbers, byte position in the stream and other transport layer management information. The data carried in this datagram is referred to as a segment.

Because TCP is connection oriented it has better visibility to what an optimal transmission unit size should be. This MTU is then converted into the maximum size of the segment. The Maximum Segment Size (MSS) is (MTU - TCP header - IP header). The IP header is 20 bytes and the TCP header is 20 bytes. Thus, for standard Ethernet the MSS would be ( 1500 - 40 = 1460 ).

TCP will never send a segment larger than the MSS. As a result, IP at the originating host will never have to fragment. If the MSS is optimal for the entire path, then none of the intermediate routers will need to fragment the datagram, either. As a result, the destination host will not need to do fragment reassembly using the IP input queue.


Fragmentation and segmentation

MSS : Maximum Segment SizeMTU : Maximum Transmit Unit

DATAUserKernel

TCP LAYER

IP LAYER

No IP Fragmentation

TCP DATATCP TCP

TCPIP IP DATATCP IP TCP

TCPIPLINK LINK IP DATATCP LINK IP TCP

MTU

MSS

FRAME



Student Notebook

Figure 7-19. Intermediate network MTU restrictions AN512.0

Notes:

Introduction

At TCP connection establishment, both sides communicate what their MTU restrictions are, expressed as an MSS value. Unfortunately, this only identifies the local MTU restrictions, which is fine if both sides are on the same network, but will cause problems if they are remote.

If TCP used these values based upon the local MTU, we could end up with intermediate routers fragmenting the packets.

In this example, we have two gigabit Ethernets using jumbo frames (9000 byte MTU) with an intermediate fast Ethernet network which has an MTU of 1500. If we were to use the local MTU values of 9000, then the first router would be forced to break those large packets into smaller ones with transmission units no more than 1500 bytes in size.

The router does not need this extra burden and we want to avoid creating fragments.

We need some way to communicate the MTU restriction of the intermediate networks.


Intermediate network MTU restrictions

.

.

.

GigaEGigaE

Let’s use mss=8960

Let’s use mss=8960

FastEFastE GigaEGigaE

Connectionestablishment

9000 bytes

1500 bytes

1500 bytes

1500 bytes

Fragment 1

Fragment 2

Fragment 3

etc ….

• Best performance: largest segments which will not be fragmented




Uempty

Figure 7-20. TCP maximum segment size AN512.0

Notes:

Introduction

The Maximum Segment Size (MSS) is the largest “chunk” of data that TCP will transmit to the other end. When a connection is established, each end has the option of announcing an MSS it expects to receive, based upon its local MTU restriction. If one end of a connection does not receive an MSS option from the other end, a default of 536 bytes is (typically) assumed. This allows for a 20 byte IP header plus 20 byte TCP header to fit into a 576 byte IP datagram. In practice, this small default size is rarely experienced.

Size and fragmentation

In general, the larger the MSS the better, until fragmentation occurs. A larger segment size allows more data per segment, reducing the TCP and IP header cost per byte of data.


TCP maximum segment size • TCP Maximum Segment Size (MSS):

– Largest segment TCP will transmit to the other end– MSS = (MTU – TCP/UDP Headers – IP Header)– The goal is to avoid any fragmentation – A larger size allows less overhead per byte of data

• Local connection: MSS uses the local interface MTU sizes• For a remote connection the MSS is tunable:

# no –o tcp_mssdflt=1460 (default value)

• Overridden with the path MTU mechanism– Discovers the best MTU value for a given path– To disable (it is enabled by default): # no –o tcp_pmtu_discover=0

– To display or manage table entries:# pmtu display# pmtu delete -dst 192.168.1.5



Student Notebook

The MSS allows a host to limit the datagram size that is sent by the other side. If the MSS size is small enough, there will be no needed to fragment.

When establishing a connection, TCP can announce an MSS value up to the outgoing interface MTU minus the size of the fixed TCP and IP headers. If the destination IP address specified in a connection is “nonlocal”, the protocol default for MSS is 536 (AIX uses a default value of 512 bytes). In practice this is overridden by the value of the no option tcp_mssdflt which defaults to 1460 and is tunable.

Note that, in AIX 5L V5.3 and later, the tcp_mssdflt is ignored when path MTU discovery (PMTU) is enabled. This will be covered in more detail later in this unit.

Network route with MTU attribute

There may be situations where the smallest MTU that you need to anticipate with tcp_mssdflt is only on some routes but not on others. In that situation, you might want to use a different MSS for some connection than others.

One way to improve this is to add a static route that forces the default MTU to a different size. Let’s say that the local system is on the 129.35.46.1-126 subnet (the router’s address is 129.35.46.1). If you wanted to send data to the 9.3 network with a 1500 size MTU, then you can specify this on the local system with the route command:

/usr/sbin/route add -net 9.3.0.0 -netmask 255.255.0.0 129.35.46.1 -mtu 1500

Path MTU

The global tcp_mssdflt may not be optimal for all paths. The alternative of manually defining routes each with an associated mtu attribute would be difficult to manage. To provide a more customized approach with low administrative overhead, AIX implements a mechanism which automatically discovers the optimal MTU size for each connection destination and stores it in a Path MTU (PMTU) table. The contents of the table can be displayed with the pmtu command.

How path MTU is discovered

The discovery mechanism simply sends segments which comply with the local MTU restriction, but with an IP header bit set to forbid fragmentation. If there is a router in the path which has an interface with a tighter restriction, it sends back an ICMP error packet. Using this information, TCP discovers the largest MSS that can be successfully routed to the destination without requiring fragmentation.

Path MTU timeout

Since changes could occur in the network (such as a failover resulting in a path with a smaller MTU), the entries periodically expire and then require rediscovery. If




Uempty
administrators know that the path MTU has changed and do not wish to wait for the 10 minute (default) entry expiration, then they can manually delete an entry.
This PMTU table is used to identify the MSS to be used for each TCP transmission.

UDP PMTU discovery

There is also a udp_pmtu_discover network option, which is also enabled by default. The catch is that this is only used if UDP applications are coded to query the PMTU size and then use that information to restrict the size of their sends.



Student Notebook

Figure 7-21. Fragmentation and IP input queue AN512.0

Notes:

Introduction

A datagram will be fragmented by the IP layer whenever the transmission unit would otherwise exceed the MTU for the interface. If this did not happen, the interface would have to discard the datagram to avoid overflowing the adapter’s transmission buffer.

Fragmentation may occur at the source host or at an intermediate router’s IP layer. One of the major reasons to avoid fragmentation is IP Input Queue overflows.

IP input queue processing

When fragments arrive at the destination host’s IP layer, they are placed on the IP Input Queue. IP will not pass the data to the transport layer until it has reassembled the original datagram. It must receive all the fragments before it can do this. If one of the fragments was discarded in the lower network layers, then IP will never be able to reassemble the original datagram. Rather than have these dead fragments fill up the


Fragmentation and IP input queue

Datagram

4 3 2 1

1 4 2

Fragmented (IP layer)

IP Input Queue

no options:

ipqmaxlen

ipfragttl

Discarded

or delayed

MTU Adapter MTU Adapter




Uempty
queue, it periodically checks to see how long they have been in the queue. If fragments have been in the queue longer than the time to live specified by the ipfragttl network option value, they are discarded by the IP layer. The ipfragttl option is coded in half seconds with a default of 60 (that is 30 seconds).
These fragment discards are shown under the netstat -s and the netstat -p ip statistic: fragments dropped after timeout.

If the total number of fragments in the IP Input Queue reach the number specified by the ipqmaxlen network option, any newly arriving fragments are discarded by the IP layer. These fragment discards are shown under the netstat -s and the netstat -p ip statistic: ipintrq overflows.

Performance impact of IP input queue discards

All discards are bad, since that requires the higher layers to detect the loss through timeouts and then retransmit. There are several scenarios in which discards can occur at the IP Input Queue.

- There were no discards or significant delays in the network, but the arrival rate of fragments is so high that it overflows the IP Input Queue anyway. This would require a very high traffic rate in combination with insufficient CPU cycles to process the fragments.

- There are discards in the network (lower layers). This will cause the IP Input Queue to eventually discard the dead fragments. Since a fragment from the original datagram was lost anyway, there is no additional timeout and retransmit penalty for these fragments. The problem is that holding onto the dead fragments until ipfragttl expires will increase the chance that the IP Input Queue will overflow. When that happens, fragments from other datagrams, which would otherwise arrive and be reassembled, will be discarded. To the extent that the remaining fragments (of the original datagram) do arrive and get placed on the IP Input Queue, they will wait for ipfragttl before being discarded, This can further aggravate a IP Input Queue overflow problem, and require timeout and retransmission of the other datagrams.

- There is a long delay in the arrival of a fragment. Even though waiting longer for the late fragment would have allowed reassembly of the datagram, the queued fragments for that datagram will exceed their time to live and they will be discarded. Then the late fragment will arrive and sit for ipfragttl after which it will be discarded. The impact is that the datagram for the late fragment needs to be retransmitted. In addition, the situation can contribute to an IP Input Queue overflow with the result of discarding fragments for other datagrams, which then also need to be retransmitted.



Student Notebook

Tuning to avoid IP input queue discards

What is causing overflows is that the fragments arrive faster than they can be assembled and removed for the queue. Reducing the volume of fragments arriving and solving any network problems that may cause delay or discard of in-transit fragments will be the best way to solve the problem.

Shortening the ipfragttl will discard incomplete fragment chains sooner and possibly avoid a queue overflow situation, but that may also force otherwise avoidable retransmissions of datagrams with delayed fragments.

Increasing the ipfragttl may help if delayed fragments is the main cause of discards due to timeouts, rather than overflows of the queue. But this will hold onto dead fragments longer and could cause an input queue overflow.

Increasing the ipqmaxlen will help avoid transitory overflows, but will not help if there is a sustained high fragment arrival rate with delayed or discarded fragments.

Again, the best way to avoid IP Input Queue overflows is to reduce fragmentation and eliminate packet discards and delays.




Uempty

Figure 7-22. netstat -p ip . AN512.0

Notes:

What to look for

Our focus here is to examine how much traffic there is, how much of it is fragmented, and any discards related to IP Input Queue management.

Statistics should generally be assessed in terms of their significance as a percentage of the total traffic.

Items to look for relative to the total packets received:

- A high percentage of fragments received: As we will see, using TCP with the proper MSS value should avoid this situation

- A high percentage of fragments dropped after timeout: These are ipfragttl timeouts.

- A high percentage of fragments dropped (dup or out of space) and a high percentage of ipintrq overflows: These are due to overflowing the IP Input Queue.


netstat -p ip# netstat -p ip

ip:9892501 total packets received

.

.

.

189901 fragments received0 fragments dropped (dup or out of space)5 fragments dropped after timeout64443 packets reassembled ok9222159 packets for this host

.

.

.

8260713 packets sent from this host...

10494 output datagrams fragmented206446 fragments created

.

.

.

0 ipintrq overflows..



Student Notebook

Items to look for relative to the packets sent from this host:

- A high percentage of output datagrams fragmented and a high percentage of fragments created: Again, using TCP with the proper MSS value should avoid this situation.




Uempty
Additional output
A complete listing of the output would be:

9892501 total packets received0 bad header checksums0 with size smaller than minimum0 with data size < data length0 with header length < data size0 with data length < header length0 with bad options0 with incorrect version number189901 fragments received0 fragments dropped (dup or out of space)5 fragments dropped after timeout64443 packets reassembled ok9222159 packets for this host12408 packets for unknown/unsupported protocol0 packets forwarded532466 packets not forwardable0 redirects sent8260713 packets sent from this host0 packets sent with fabricated ip header0 output packets dropped due to no bufs, etc.0 output packets discarded due to no route10494 output datagrams fragmented206446 fragments created0 datagrams that can't be fragmented10 IP Multicast packets dropped due to no receiver42608 successful path MTU discovery cycles8422 path MTU rediscovery cycles attempted8070 path MTU discovery no-response estimates8773 path MTU discovery response timeouts7 path MTU discovery decreases detected60158 path MTU discovery packets sent0 path MTU discovery memory allocation failures0 ipintrq overflows0 with illegal source0 packets processed by threads0 packets dropped by threads0 packets dropped due to the full socket receive buffer0 dead gateway detection packets sent0 dead gateway detection packet allocation failures0 dead gateway detection gateway allocation failures



Student Notebook

Broadcast traffic

A high value for packets for unknown/unsupported protocol points to machines using non-IP addresses sending broadcast messages. If this number is increasing rapidly, the machines sending the packets should be identified and possibly moved to another subnet or network segment (sometimes the router mistakenly forwards those packets). On the other hand this traffic may be normal and the extra load on you host (even if it is not participating in these broadcasts) may not be a problem.

Indicators of corrupted or truncated packets

Non-zero values for the following counters, though rare, can point to possible network problems, such as defective or misconfigured adapter or switch port, or using poor cabling practices:

- bad header checksums or

-fragments dropped (dup or out of space)

(If the output shows non-zero values for either of these values, this indicates either a network that is corrupting packets or device driver receive queues that are not large enough)

•with size smaller than minimum

•with data size < data length

•with header length < data size

•with data length < header length

•with bad options




Uempty

Figure 7-23. Interface and hardware flow AN512.0

Notes:

Overview

To handle the transfer of data between AIX and the network adapters, AIX provides queues where data can be placed until the other party can remove it and process it. If packets are placed on the queue faster than they are removed, the queue will fill up and packets will be discarded.

Transmit queue processing

The device drivers may provide a “transmit queue” limit which may be both hardware queue and software queue limits. Some drivers only have a hardware queue whereas others can have both hardware and software queues. Some drivers only allow the software queue limits to be modified.


Interface and hardware flow

TransmitQueues

ReceiveQueues

InterfaceLayer

Device Driverand

AdapterBuffer Buffer

Switch/Hub Network Switch/Hub

InterruptHandler



Student Notebook

The interface receives a pointer to the mbufs holding the packet to be sent and prepends the link layer frame headers. It then places a pointer to the mbufs on the transmit queue and signals the adapter.

The adapter will access the queue, locate the data and transfer it to its hardware buffer, thus freeing up the entry in the transmit queue. The adapter then uses the link protocols to transmit the data on the cabling. The cabling could be point to point to another host adapter (cross-over cable), could be daisy chained through other host adapters (Ethernet BNC), could be wired to a central repeater hub or (more commonly) could be cabled to a switch.

Receive queue processing

AIX pre-allocates mbufs and mbuf clusters which are large enough to hold the largest transmission unit it expects to receive (MTU) and stores pointers to them on the receive queue. These are referred to as the receive pool buffers.

The receiving adapter listens to the transmissions on the wire looking for the frames with its own hardware address (or a broadcast address). It records the data in its hardware buffer and locates a free buffer on the receive queue, to which it transfers the data it has received. The adapter then signals AIX which runs an interrupt handler to process the data.

The interrupt handler locates the data on the receive queue and processes the data. The mbufs with the data likely end up either on the IP Input Queue or on a transport layer socket receive queue. The entry on the adapter receive queue is freed up and a new mcluster is allocated and placed on the queue to receive future data.




Uempty

Figure 7-24. Transmit queue overflows AN512.0

Notes:

Introduction

The device driver queues a transmit packet directly to the adapter hardware queue. If the CPU is fast relative to the speed of the network, or if there are multiple CPUs, the system may produce transmit packets faster than they can be transmitted on the network. This will cause the hardware queue to fill. Once the hardware queue is full, the driver can queue packets to the software queue (if it exists). If the software transmit queue limit is reached, then the transmit packets are discarded. This can affect performance because the upper level protocols will have to retransmit these packets.

If there are transient situations where this happens due to a burst of activity or a temporary delay in transmitting by the adapter, a larger queue may be able to handle the situation.

If on the other hand there is a sustained situation where the packet transmission rate is too much for the adapter to keep up with, then one needs to either reduce that rate


Transmit queue overflows• Packets arrive faster than the adapter can remove them

# entstat –d | grep –I transmit queueMax Packets on S/W Transmit Queue: 210S/W Transmit Queue Overflow: 0Current S/W+H/W Transmit Queue Length: 1

• If bursty traffic, increasing the queue size may help• If sustained high traffic rate:

– Improve adapter speed or bandwidth– Reduce number of parallel application threads

• Change the queue size on the adapter in the ODM:# chdev –P –l ent1 –a tx_que_size=2048# shutdown –Fr

• To make effective without a reboot:# ifconfig en1 detach

# chdev –l ent1 –a tx_que_size=2048

# /etc/rc.net



Student Notebook

(reschedule or distribute the workload, reduce the window size) or get more adapter bandwidth.

Upgrading from 10 Mbps to 100 Mbps Ethernet may be all that is needed. An alternative solution is to group multiple adapters in an aggregate (etherchannel). That provides increased bandwidth and availability.

An aggregate will only help the situation if the high packet transmission rate is over several different connections. If there is only one connection, then all the traffic for that connection will go over the same single physical adapter. Workload balancing for aggregates works best when there are many connections to spread over the physical adapters which participate in the aggregate.

A similar solution would be to spread the connections over different interfaces using multi-path routing.

Tuning the transmit queue

The transmit queue size is an attribute of the adapter. Adapter attributes can not be changed if the device driver is open. Since a configured interface will be using the adapter (even if in a down state), you must detach the interface from the adapter before you can change the attribute. The chdev command will normally update the ODM object for the adapter and also make the change effective in the kernel. Since detaching the interface removes all interface configuration information from the kernel, you will need to reconfigure the interface after making your change. Note that this will be disruptive to all communication over that interface. If you are using the interface to remotely access the platform you lose your connections and you need to use a procedure that allows you to reconnect. The alternative method is to have the chdev command only update the ODM object. This does not require that the interface be detached. But the ODM change will not become effective until you reboot the system. During reboot the change is made effective in the kernel when the adapter object is changed to an available state during cfgmgr processing.

Max packets on S/W transmit queue

The Max Packets on S/W Transmit Queue shows the maximum number of outgoing packets ever queued to the software transmit queue.

An indication of an inadequate queue size is if the maximal transmits queued equals the current queue size (tx_que_size). This indicates that the queue was full at some point.

To check the current size of the queue, use the lsattr -El adapter command (where adapter is, for example, tok0 or ent0). Because the queue is associated with the device driver and adapter for the interface, use the adapter name, not the interface name. Use the SMIT or the chdev command to change the queue size.




Uempty
S/W transmit queue overflow
The S/W Transmit Queue Overflow shows the number of outgoing packets that have overflowed the software transmit queue. A value other than zero requires the same actions as would be needed if the Max Packets on S/W Transmit Queue reaches the tx_que_size. The transmit queue size must be increased.

The Max Packets on S/W Transmit Queue field will show the high water mark for the transmit queue, and the S/W Transmit Queue Overflow field will show the number of software queue overflows. Note, these values may represent the hardware queue if the adapter does not support a software transmit queue. If there are Transmit Queue Overflows, then the hardware or software queue limits for the driver should be increased using the chdev command or SMIT.

Changing the attribute

Different adapters have different names for the attribute that controls the transmit queue size. So you need to first display the attributes of the adapter to find out the name.

# lsattr -E -l ent1

You also need to know what values are acceptable, so you next need to list the range of values for that attribute name

# lsattr -R -l ent1 -a tx_que_size

The value is the number of entries in the queue. The cost for setting a large number is not too great, since the queue itself does not require much storage, being basically an array of pointers to the buffers that were already allocated by the higher layers. For UDP traffic, the larger queue could allow more datagrams to accumulate which would otherwise be discarded on successful transmit, thus using more memory.

Reconfiguring the interface

If you try to configure the adapter while a related interface is configured, you will get an error message stating that the device is in use. The solution is to change the attribute in the ODM object for the adapter, without making it effective yet.

You then have two options. You can either:

- Delay making the change effective until the next reboot. On reboot, the init process will run /etc/rc.net which will read the ODM and configure the interface.

- Make the change effective now by detaching and then reconfiguring the interface.

If using the second option, remember that it is disruptive to connections using that interface. If you are doing the second option remotely, there is only one interface to connect through and you are using it, you need to make sure you can reconnect. One way to do this is to code the procedure in a script and run it with nohup.



Student Notebook

Example

The following example shows:

- Detection of adapter transmit queue overflow.

- Change to the transmit queue size for the adapter. This change will not take effect until reboot. (lsattr says it has been made, but that change is only in the ODM, not the current value). The change should reduce transmit queue overflows.

Note that the transmit and receive queue overflows can also be indicated by the netstat -i output under Ierrs and Oerrs columns.

# entstat -d ent0-------------------------------------------------------------ETHERNET STATISTICS (ent0) :Device Type: 10/100 Mbps Ethernet PCI Adapter II (1410ff01)Hardware Address: 00:02:55:6f:1b:aaElapsed Time: 1 days 19 hours 14 minutes 2 seconds

. . .

S/W Transmit Queue Overflow: 20

. . .

# lsattr -El ent0 . . .tx_que_sz 8192 Software TX Queue Size Truetxdesc_que_sz 512 TX Descriptor Queue Size Trueuse_alt_addr no Enable Alternate Ethernet Address True

# chdev -P -a tx_que_sz=16384 -l ent0ent0 changed

# lsattr -El ent0 . . .tx_que_sz 16384 Software TX Queue Size Truetxdesc_que_sz 512 TX Descriptor Queue Size Trueuse_alt_addr no Enable Alternate Ethernet Address True

# shutdown -Fr




Uempty

Figure 7-25. Adapter configuration conflicts AN512.0

Notes:

Adapter configuration problems

A common reason for network performance problems is the misconfiguration of the adapter or switch port.

Speed mismatches between the switch and the port will become obvious because the adapter simply will not work.

A less obvious misconfiguration problem will be mode mismatches. The adapter communicates, but the performance will be severely impacted. The classic symptom will be a high collision rate with both multiple collisions, late collisions, and CRC errors. Note that this will only be seen on the side using half-duplex. If that is the Ethernet switch side, then these errors would be visible to the Ethernet switch administrator. The AIX side might see higher TCP retransmission without knowing why.


Adapter configuration conflicts• Switch port and adapter configuration must match# entstat –d | egrep –i ‘media speed’Media Speed Selected: Auto negotiation Media Speed Running: 100 Mbps Full Duplex

• A duplex mode mismatch results in high level of collisions# entstat –d | egrep –i ‘collision|deferred’Max Collision Errors: 0 No Resource Errors: 0Late Collision Errors: 0 Receive Collision Errors: 0Deferred: 0 Packet Too Short Errors: 0Timeout Errors: 0 Packets Discarded by Adapter: 0Single Collision Count: 0 Receiver Start Count: 0Multiple Collision Count: 0

• Configuration options:– Configure both sides for auto-negotiation– Configure both sides for the fastest speed and full duplex

• Do not set auto-negotiation on one side only:– Defaults to half-duplex at auto-negotiation end– If other side is coded full-duplex: mode mismatch



Student Notebook

It is recommended that either both sides configure to use auto-negotiate, which should result in the highest common speed and full-duplex mode, or to hard code the configuration on both sides to the fastest common speed and full-duplex.

A common error is to code auto-negotiate on one side and not on the other. This will likely result in a mode mismatch.

Gigabit Ethernet only supports full-duplex, so it is impossible to have a mode mismatch when configured for 1000 Mbps.

What is a collision?

When discussing collisions, you may wish to think of it like a telephone party line. A system will check for a transmission in progress before trying to send a packet (carrier sense). If two machines transmit at exactly the same time, there is a collision because neither senses the other. When a host recognizes a collision, it backs off and waits a few seconds to re-transmit. The more machines on the network, the greater the chance for collision. A collision rate can be calculated as the number of collisions divided by the number of output packets. If the result is greater than 10%, there is a high network utilization, and it may need to be reorganized or partitioned.

If your network environment requires half-duplex operation then you will find that in a half-duplex environment, collisions are normal. It is through collision detection, back-offs, and retries that the original Ethernet standard allowed the sharing of a common wire. If you are using BNC wiring or are cabled into a repeater hub (instead of a switching hub), then you will be using half-duplex protocol. But even in this environment you will wish for only single collisions (too many busy adapters on the common wire will have multiple collisions and bad performance.) And even with half-duplex, we do not expect to see late collisions.

Max collision errors

Max Collision Errors is the number of unsuccessful transmissions due to too many collisions. The number of collisions encountered exceeded the number of retries on the adapter.

Late collision errors

Late Collision Errors is the number of unsuccessful transmissions due to the late collision error. A late collision error is one that occurs after the start of transmission. Normally when two adapters try to start a transmission at the same time, they detect a collision and retry at random time intervals. Or if one adapter detects that another has a transmission in progress, it will retry later. Late Collision is usually caused by either misconfigured or defective Ethernet adapters (they failed to detect a transmission in progress and transmitted anyway), or by incorrect cabling where the two adapters are so far apart that they do not hear the other until after they have started their transmissions.




Uempty
Deferred
Deferred counts indicate packets that could not be sent because the media was half-duplex and the adapter sensed that there was a packet already coming down the media so that it deferred the sending of the packet until the line was free. With full-duplex, this would not be a problem because both sides can send and receive at the same time. If the adapter sensed that the line was free and sent the packet but the other side also did the same thing at the same time, that’s where collisions occur.

Ethernet collisions can be avoided by running full-duplex (make sure both sides are correctly set to full-duplex otherwise the performance will be even worse than having collisions).

Collision errors

The types of collision errors reported are:

- Single Collision Count is the number of outgoing packets with single (only one) collision encountered during transmission.

- Multiple Collision Count is the number of outgoing packets with multiple (2 - 15) collisions encountered during transmission attempts.

- Receive Collision Errors is the number of incoming packets with collision errors during reception.

Collision errors should be considered since they can decrease performance. Multiple collision errors are even worse because that means the same packet was sent multiple times and each time it had a collision. With full-duplex (available on switches and crossover cables), you should not see any collisions; so it is best to use a switch with both ends running full-duplex.

If one end was half-duplex and the other was full-duplex, then collisions are almost guaranteed since the full-duplex side is not even listening for collisions) and performance may be terrible (look for CRC (cyclical redundancy check) errors in this case). The errors and collisions will be seen on the half-duplex side of the mismatch.

Additional details

Collisions do not occur in a full-duplex environment. Since the connection between the adapter and the switch port is analogous to an Ethernet crossover cable joining two computers, they are the only ones talking. If the connection is full-duplex, then there is never a conflict over who gets to talk when. In fact, an adapter or port that is configured for full-duplex does not even bother to detect the other side transmitting when it wants to transmit.

If one side is configured to half-duplex and the other is configured to full-duplex, then the half-duplex side is assuming that the other side can only handle half-duplex communications while the full-duplex side is going to transmit any time it wants to.



Student Notebook

On the half-duplex side, when it wants to transmit, it keeps hearing the other side transmitting, does its random delay retries and keeps having collisions, because the full-duplex side is not using the back off and retry protocol. And when the half-duplex side gets no initial collision and starts a transmit, it will likely get late collisions and immediately terminate that transmission because the full-duplex side does not care if the half-duplex side is talking and it will transmit anyway. This leads to high collision rates and very poor performance.

Even with both sides set to auto-negotiate there can still be mode mismatches. This depends on the both sides having properly implemented the standard auto-negotiate standard. If there was a confusion about how to implement the standard or a bug in the implementation, then they may fail to negotiate correctly. In this situation, hard coding on both sides is the expedient solution.

Another mistake that can effect performance is the misplacement of adapters in the bus slots. Overloading a a PCI bus can lead to performance problems. For example a single Gigabit Ethernet adapter can consume the bandwidth of a PCI bus. Two gigabit adapters on the same bus could result in overloading that bus. Even if you do not overload a bus, placing an PCI-X adapter in a PCI slot could reduce the Mhz rate of the adapter PCI interface.

It is strongly recommended that administrators plan their adapter placement using the manual: Adapter Placement Reference for AIX.




Uempty

Figure 7-26. Receive pool buffer errors AN512.0

Notes:

Introduction

A full receive pool can result in the adapter discarding packets. Some adapters allow configuring the number of resources used for receiving packets from the network. This might include the number of receive buffers (and even their size) or may simply be a receive queue parameter (which indirectly controls the number of receive buffers). The receive resources may need to be increased to handle peak bursts on the network.

The entstat -d <adapter-name> command will give you a count of the number of receive pool buffer errors. A small percentage is not a great concern. But, a large percentage will have a significant impact on performance.


Receive pool buffer errors• An adapter may receive packets faster than the interrupt handler can remove them

• Check entstat -d for a count of errors# entstat –d | grep –i ‘receive pool’Receive Pool Buffer Size: 2048 Free Receive Pool Buffers: 767 No Receive Pool Buffer Errors: 5746361

• If bursty traffic, a larger pool may reduce the errors:– Some adapters support configurable receive pools

• If sustained high traffic, may throttle traffic:– Reduce window size(s)– Reduce number of clients transmitting to server

• Adapter and switch port problems can cause errors



Student Notebook

Increasing the queue size

The name of the adapter attribute which controls the adapters receive queue size will vary from adapter to adapter. List the attributes of the adapter to find the correct name and then update the value using the procedure we covered with the transmit queue.

Throttling the traffic

If there is a sustained high level of traffic, you may need to throttle it back to avoid the errors. While the discards themselves should automatically slow down the sliding window and even place the TCP session into slowstart mode, it might be better to reduce the tcp_recvspace to further reduce the arrival rate of packets. You may select to do this globally, or be more selective by using the ISNO options or even by configuring the application to set the socket buffer sizes using setsockopt().

More commonly, the total traffic is a combination of many clients transmitting steams of data at the same time. For example, doing multiple concurrent backups of clients systems to Tivoli Storage Manager (TSM) may overload the TSM server.

Adapter and switch port problems

In some cases, we have found that the cause is a defective switch port or a defective adapter. Changing the port in use may solve it. Upgrading the adapter microcode to the current level may provide a fix. Replacing the adapter may be needed.




Uempty

Figure 7-27. Network traces AN512.0

Notes:

tcpdump

The tcpdump command prints out the header information of the packets captured on a network interface. It can be used to trace all packets that go through a single network interface or to trace a specific protocol, such as TCP.

By default, tcpdump sends its output to stdout and does not require any post processing. However, it also allows data collection in raw format (without any packet parsing) into a file. This file can be used as input for post processing with tcpdump.

If you ran the perfpmr.sh script, then a tcpdump.raw file will be in the output directory. You may then either use the PerfPMR script tcpdump.sh -r to format the tcpdump or (if you wish more control over the formatting and record selection) use the tcpdump -r command to format it. tcpdump defaults to the first configured interface. If you need control over what interface is being traced, you need to run tcpdump directly with the -i <interface> option.


Network traces• Capture and report details of network packets• Useful for analyzing difficult situations• Requires detailed understanding of network protocols and

header fields •tcpdump

• Good summary of header information

• Easy to read• PerfPMR creates tcpdump.raw file

•iptrace and ipreport

• More detailed report

• PerfPMR creates iptrace.raw file



Student Notebook

Supported interfaces

Only a limited number of network interfaces are supported by tcpdump: Ethernet, FDDI, token-ring and loopback. Interfaces like the css interface (SP2 high performance switch) are not supported by tcpdump. For those interfaces, the iptrace command must be used to capture the packets.

iptrace command

The iptrace command is an interface-level packet tracing tool for Internet protocols. Unlike tcpdump, it captures the entire contents of the packets and writes them into a logfile. The filename must be specified when the iptrace command is invoked. Thus, the size of the logfile can become quite large, sometimes several hundreds of megabytes in just a few seconds, depending on the speed of the network adapter and level of network traffic.

Post processing of the logfile is done with the ipreport command.

iptrace loads a kernel extension for the packet capturing. This kernel extension does not get unloaded when iptrace is stopped with kill -9. Thus, iptrace should be stopped with kill -15. You can unload the kernel extension with iptrace -u if you mistakenly killed iptrace with kill -9.

ipreport command

The ipreport command is the post processing tool for iptrace data.

To generate a report on an iptrace logfile run: ipreport <logfile> | more

It is advisable to generate an ipreport output with packet numbering, decoded RPC calls, and protocol information (-s, -r, and -n flags).

The PerfPMR generated iptrace.raw file can be formatted by running the PerfPMR script: iptrace.sh -r.

The iptrace command does a hostname resolution for all packets in the logfile. The processing of the data can take a very long time if the hostnames are not known (that usually happens when you post process iptrace data from customer machines). You can reduce the processing time by using the -N option which bypasses the host name resolution. For example: ipreport -srnN <logfile> | more




Uempty

Figure 7-28. Network trace examples AN512.0

Notes:

The strength of the tcpdump report is the ability to see many packets on a single page, because it can use one line per packet. The tool can be customized to present the trace information in different ways. For example, this example requested that the IP addresses not be translated into their symbolic names and also requested that the time stamps only print the time stamps as a delta (in micro-seconds) between current and previous line on each dump line. The ability to see time stamps can be helpful in seeing where delays occurred. The source and destination fields are important for identify the connection and the direction of flow. You can also see the amount of data (in parenthesis) and what bytes are being acknowledged as received in the acknowledgements.

The strength of the ipreport is the great amount of detail provided in its breakdown of the header fields. But this can also be a weakness, since it can be difficult to see the big picture through all that detail.


Network trace examples

000015 IP client.33100 > server.14000: P 4061:4121(60) ack 21 win 65535000182 IP server.14000 > client.33100: P 21:41(20) ack 4121 win 17520000059 IP client.33100 > server.14000: . 4121:5581(1460) ack 41 win 65535000010 IP client.33100 > server.14000: P 5581:6121(540) ack 41 win 65535209793 IP server.14000 > client.33100: . ack 6121 win 17520

Packet Number 2ETH: ====( 77 bytes transmitted on interface en0 )==== 09:48:01.954494310ETH: [ 00:06:29:c3:0a:1c -> 00:06:29:ec:00:64 ] type 800 (IP)IP: < SRC = 9.3.104.19 > (ginger.austin.ibm.com)IP: < DST = 9.41.90.25 > (idefix.austin.ibm.com)IP: ip_v=4, ip_hl=20, ip_tos=0, ip_len=63, ip_id=50365, ip_off=0 DFIP: ip_ttl=60, ip_sum=a5a3, ip_p = 6 (TCP)TCP: <source port=23(telnet), destination port=34919 >TCP: th_seq=e4d92b8a, th_ack=43bc31c5TCP: th_off=5, flags<PUSH | ACK>TCP: th_win=17520, th_sum=e5b8, th_urp=0TCP: 00000000 67696e67 65722e61 75737469 6e2e6962 |ginger.austin.ib|TCP: 00000010 6d2e636f 6d203a |m.com :

• tcpdump example:

• ipreport example:



Student Notebook


Notes:


Checkpoint (1 of 3)

1. Interactive users are more concerned with measurements of __________, while users of batch data transfers are more concerned with measurements of _______________.

2. True/False thewall maximum amount of network pinned memory can be increased in AIX6 only by increasing the amount of real memory.

3. When sending a single TCP packet an acknowledgement can, by default, be delayed as long as ______ milliseconds




Uempty


Notes:


Checkpoint (2 of 3) 4. True/False Increasing the tcp_recvspace at the

receiving host will always increase the effective window size for the connections.___________________________________________________________________________________________________________________________________

5. What network option must be enabled to allow window sizes greater than 64 KBs? _______________________________________

6. List two ways in which Nagle’s Algorithm can be disabled:

––



Student Notebook


Notes:


Checkpoint (3 of 3)7. If you saw a large count for ipintrq in the netstat

report, which actions would help reduce the overflows?a) Increase memory and CPU capacity at the receiving

hostb) Increase ipmaxqlen and decrease ipfragttlc) Decrease ipmaxqlen and increase ipfragttld) Eliminate the cause of delayed and dropped fragments.

8. A high percentage of collisions in an Ethernet full duplex switch environment is an indication of:

______________________________________________

_______________________________________________




Uempty

Figure 7-32. Exercise 7: Network performance AN512.0

Notes:


Exercise 7: Network performance

• Window size tuning

• FTP case study

• Packet throughput (optional)

• Transmit queue overflows (optional)



Student Notebook


Notes:


Unit summary

This unit covered:• Identifying the network components that affect

network performance• Listing the network tools that can be used to measure,

monitor, and tune network performance• Monitoring and tuning UDP and TCP transport

mechanisms• Monitoring and tuning for IP fragmentation

mechanisms• Monitoring and tuning network adapter and interface

mechanisms




Uempty
Unit 8. NFS performance

This unit describes the factors that influence the performance of the Network File System, more commonly known as NFS. It covers the tools for monitoring NFS activity and for tuning performance in an NFS environment.



• Define the basic NFS tuning concepts

• List the differences between NFS V2, V3 and V4

• Monitor and tune NFS servers

• Monitor and tune NFS clients


Accountability:


References




AIX Version 6.1 System Management Guide: Communications and Networks




© Copyright IBM Corp. 2010 Unit 8. NFS performance 8-1

Student Notebook


Notes:


Unit objectives


• Define the basic Network File Systems (NFS) tuning concepts

• List the differences between NFS V2, V3 and V4

• Use nfstat and netpmon to monitor NFS

• Use nfso and mount options to tune NFS




Uempty

Figure 8-2. NFS tuning concepts AN512.0

Notes:

Overview

The Network File System (NFS) is a distributed file system. It is independent of machine type or operating system. A typical NFS environment consists of one or more client machines and at least one NFS server. The NFS server can export its local file systems to the client machines so that the clients can have access to these file systems as if they were local on the clients. Applications can access the file systems transparently (normal file semantics can be used).

NFS misuse

A common error is to assume that since remote access to a file system acts functionally like a local file system that it can be conveniently used for all sorts of remote data access without concern. While it works functionally, it is not always the most efficient way to handle the situation. Here are two examples of when it might not be a good idea to use NFS:


NFS tuning concepts• NFS allows one or NFS clients to mount file systems from

an NFS server– Cumulative load of many NFS clients can overload server

– May need to limit number of clients or detune clients

• Reduce number of client biod threads

• Reduce client read and write sizes

• NFS performance depends on basic memory, CPU, I/O, and network performance management

• Do not misuse NFS– Stable files should be replicated to user systems



Student Notebook

- The first example is for a one time access of an entire file. It will be faster to use FTP to transfer the file to your local platform then to retrieve the entire file over NFS. The issue would be different if you were only accessing a few portions of the file.

- A second example is when an application is designed for performance assuming that all the files are local. There is a big difference between the application looking through a 5,000 entry directory file locally versus doing the same thing over NFS. Or an application that does a lot of dynamic linking to modules in what it expects to be a local library. Place that same library across an NFS mount and you could experience significant degradation of performance.

In many situations, it is better to allow the client to have its own local copies of the file.

Client and server tuning

NFS tuning can be quite complex since it involves tuning every component we have discussed.

The NFS file system accesses may be concentrated on one or a few disks. Check for disk bottlenecks using the physical I/O monitoring techniques mentioned previously (using tools such as iostat, topas, or filemon). Disk and/or LVM tuning may have to be done.

On both the NFS client and NFS server, this includes tuning CPU usage, memory, logical and physical I/O (physical I/O on client if cacheFS is used), mount options, network options, network adapter tuning, and NFS options. Sometimes, the client may have to be de-tuned. That is, limit the amount of I/O sent to the NFS server because either the network or the NFS server cannot keep up with the data rate.

Increasing biod and nfsd threads can improve throughput unless the network or server cannot keep up. Then, decreasing biod threads or reducing read/write buffer sizes may help. The biod and nfsd threads will be created dynamically up to a tunable maximum value (mount option for biod and nfso option nfs_max_threads for nfsd).

Network tuning

The network itself may have to be tuned (switches, routers, network media). Check for network media settings such as speed and duplex modes. Also, check the switch statistics or router statistics to see if packets are getting dropped.

Detuning a client

If dropped and significantly delayed packets are mainly due to the load being presented by your platform as a client and you can not improve the ability of the network or the NFS server to handle that load, then you may have to de-tune your NFS client. De-tuning means to slow down the NFS client.




Uempty
If the network or server congestion is due to the accumulated load of many clients, then you would need to either reduce the number of clients or find a way to detune all of them. If you do not control the client platforms, many of the client detuning methods will not be practical.
It is counter-intuitive, but detuning may actually improve performance. This is because the performance impact of pacing the traffic is often less than the performance impact of discarded packets due to congestion.

De-tuning should only be done as a last resort since this will decrease performance if the network or server was not a bottleneck.

How to de-tune

A common detuning technique is to reduce the number of biod threads down to a value of 1 (mount -o biods=1). This sets the biod threads to 1 per mount.

The read/write size can be reduced by specifying a small value for rsize or wsize (such as 1024), where rsize and wsize are mount options.

Enabling commit-behind

Sometimes enabling commit-behind can also improve performance even though this will cause more commits to occur than if commit-behind was not needed. While commit-behind causes more commits, it actually can reduce the number of commits if page-replacement is occurring on modified client pages.



Student Notebook

Figure 8-3. NFS versions AN512.0

Notes:

NFS V2

NFS Version 2 (V2) was the only version of NFS available until AIX V4.2.1 at which point NFS Version 3 (V3) was implemented. Currently on AIX, NFS V2, V3 and V4 are supported simultaneously on both the AIX NFS client and the AIX NFS server. Writes in NFS V2 are much slower than in V3 or V4 because all writes are synchronous on the server. The client’s write request does not complete until the write has reached the disk on the NFS server. Also, the maximum read or write size is 8 KB in NFS V2. Another limitation of NFS V2 is that the file offsets are 32 bits which means the maximum file size that can be supported is 4 GB. On AIX, the default maximum number of biod threads is 7 per NFS V2 mount, but this can be overridden with the biods mount option. NFS V2 supports both UDP and TCP (TCP is the default on AIX).


NFS versions• NFS v2

– Maximum and default read/write size is 8 KB– All writes are synchronous– File offsets are limited to 32 bits

• NFS v3 (best performing)– Default read/write size 32 KB, maximum 64 KB – Reliable asynchronous writes– Attributes on replies and readdirplus reduce getattr overhead– File offsets can be up to 64 bits

• NFS v4 – Only supports TCP– rpc.lockd, rpc.statd and rpc.mountd merged into nfsd– Benefits large-scale file sharing distributed environments– Better security– Requires more processing overhead




Uempty
NFS V3
NFS Version 3 was introduced in AIX V4.2.1. AIX can support both NFS Version 2 and Version 3 on the same machine. NFS V3 is the default on AIX. NFS V3 improves performance in many ways:

- NFS V3 removes the 8 KB read/write size limit in V2. The default read/write size is 32 KB on AIX with NFS V3.

- The maximum NFS I/O size in AIX NFS V3 is 64 KB. Other vendors’ NFS V3 implementations may have different maximum and default values.

- NFS V2 required that writes to the NFS server did not return back to the client until the data was committed to disk. NFS V3 provides for reliable asynchronous writes so that writes can be written to disk asynchronously. The write goes to memory and returns. If the write has not been committed to disk, the client marks that write as a smudged write. Eventually the write will have to be committed (due to page replacement on the client or due to a sync). When the write is committed, the contents are sent to the disk and then the write on the client is considered no longer in a smudged state.

NFS V3 operations return the attributes of the file with every operation so that the number of attribute calls can be decreased from the client. A directory read can result in a readdirplus call (a operation not available in NFS V2) which not only reads the directory but can also return the attributes of multiple files in the directory.

Since NFS V3 uses 64-bit offsets, the file size can be considerably larger than 4 GB.

NFS V4

NFS Version 4 is described by RFC 3530.

NFS V4 only supports TCP. UDP is not supported.

While NFS V4 is similar to prior versions of NFS (primarily NFS V3), NFS V4 provides many new functional enhancements in areas such as security, scalability, and back-end data management. These characteristics make NFS V4 a better choice for large-scale distributed file sharing environments.

The additional functionality and complexity of NFS V4 result in more processing overhead. Therefore, NFS V4 performance might be slower than with NFS V3 for many applications. The performance impact of NFS V4 varies significantly depending on which new functions are used.

The performance impact of NFS V4 varies significantly depending on which new functions you use. For example, if you use the same security mechanisms on NFS V4 and NFS V3, your system might perform slightly slower with NFS V4. However, you might notice a significant degradation in performance when comparing the performance of NFS V3 using traditional UNIX authentication (AUTH_SYS) to that of NFS V4 using Kerberos 5 with privacy, which means full user data encryption.



Student Notebook

A client that does an NFS V3 mount of a file system, which has been NFS exported to allow mounting by another client using NFS V4 will experience performance degradation compared with mounting the file system from a server that has exported the file system without allowing NFS V4 mounts.

rsize/wsize option

The rsize value specifies the maximum read size on the mount while the wsize value specifies the maximum wsize value on the mount. If the application issues I/Os larger than these, they are broken down by NFS into rsize/wsize chunks.




Uempty

Figure 8-4. Transport layers used by NFS AN512.0

Notes:

Selecting TCP or UDP

In NFS V3, the client machine can select TCP or UDP as a transport protocol for a particular mount. The default is TCP in AIX V4.3 and higher. Prior to AIX V4.3, the default was UDP for both NFS V2 and NFS V3. A mount option (proto) can be used to select TCP or UDP (example: mount -o proto=udp).

UDP works efficiently over clean or efficient networks and responsive servers. For wide area networks or for busy networks or networks with slower servers, TCP may provide better performance since its inherent flow control can minimize retransmits on the network. Also, since the maximum UDP packet size is 64 KB (which includes the IP header), the maximum NFS I/O size used with UDP is less than 64 KB (depending on the size of the IP header and UDP header).

UDP is not available in NFS V4.


Transport layers used by NFS• Client can select transport protocol for each mount via mount options

• TCP is the default protocol in NFSv3– Provides reliable delivery with flow control, in-order

stream of data, and error detection with retransmission– Provides for larger throughput due to 64 KB I/O sizes– Recommended for wide-area networks or on networks

with high congestion or inefficient NFS servers

• UDP is recommended in efficient environments where the network and server are able to keep up with client requests



Student Notebook

Figure 8-5. NFS request path AN512.0

Notes:

Client side

NFS is just one of the possible file systems that can be accessed transparently on a client machine. When an application thread issues a system call to access a file or directory in an NFS file system on the client, the system call will go to the kernel’s virtual file system layer which will determine what type of file system it is. If it is NFS, then the kernel calls the NFS kernel extension. The NFS kernel extension will, for certain requests (for example, reads and writes) use a daemon thread (discussed later) to send the request to the server. For other requests, the NFS kernel extension will send the requests itself. Before sending the requests, External Data Representation (XDR) will be used to convert the data to a common form in order to support platforms with different data representations. (We will not cover the details of XDR in this course and it is not a performance tuning component.) Once the data is in the XDR format, the Remote Procedure Call (RPC) library routines are used to handle the actual network communications between the client and the server. It is important to understand that it is


NFS request path

Client Server

Network

Application system calls

Virtual file system (vnode)

NFS kernel extension

Daemon thread

RPC

RPC

RPC daemon thread

NFS kernel extension

Virtual file system (vnode)

AIX file system

Disk




Uempty
RPC which is handling the UDP or TCP socket calls. When a daemon thread is used, that thread is dedicated to that request until it receives the reply from the server.
Server side

At the NFS server, the request is received by a daemon thread (discussed later) using the RPC routines. XDR is used to convert the data to the local data representation and the request is then handled by the NFS kernel extension. The NFS kernel extension accesses the local file system to execute the requests such as: open, close, read, write, get attributes, and so forth. The local file system access goes to the kernel's virtual file system layer which determines the type of file system (for example, JFS or JFS2) and then invokes the kernel extension for that file system type. Assuming a normal JFS or JFS2 mount, the local file system kernel extension will use VMM mechanisms to handle the files. The local file system I/O processing is normal with caching mechanisms available. Writes may remain in memory until a write-behind, a sync, or a page steal, flushes them to disk. Reads may be satisfied without physical disk I/O, if already cached in memory. When a daemon thread is used, that thread is dedicated to that request until it is completed and the reply is sent back to the client.



Student Notebook

Figure 8-6. NFS performance related daemons AN512.0

Notes:

Overview

Additional NFS daemon threads are not expensive when it comes to memory usage, so it is normally okay to have many NFS daemon threads running. The rpc.mountd, rpc.lockd, nfsd, and biod daemons are multi-threaded.

nfsd daemon

nfsd daemons are the active agents providing NFS services from the NFS server. The receipt of any one NFS protocol request from a client requires the dedicated attention of an nfsd daemon until that request is satisfied and the results of the request processing are sent back to the client.


NFS performance related daemons• Too few daemon threads could restrict performance•nfsd (server only)

# nfso –o nfs_max_threads (restricted tunable)

•rpc.biod (client only, threads per mount)# mount –o biods=## (default 4)

•rpc.mountd (server only, mainly automounter concern)rpc.mountd –h ## (default 16)

•rpc.lockd (client and server)rpc.lockd –a ## (default 33)

• Procedure for rpc.mountd and rpc.lockd subsystems# chssys –s <ssysname> -a “<option>”# stopsrc –s <subsysname># startsrc –s <subsysname>




Uempty
rpc.mountd daemon
rpc.mountd is a server daemon and an RPC that answers a client request to mount a server’s exported file system or directory. The rpc.mountd daemon finds out which file systems are available by reading the /etc/xtab file. The rpc.mountd daemon is not in NFS V4 because the operation is moved into the main NFS V4 protocol.

rpc.lockd daemon

The rpc.lockd handles the file locking requests for files in NFS file systems prior to NFS V4. The rpc.lockd daemon is not in NFS V4 because the operation is moved into the main NFS V4 protocol.

rpc.statd daemon

The rpc.statd coordinates file lock recovery if a system crashes. The rpc.statd daemon is not in NFS V4 because the operation is moved into the main NFS V4 protocol.

portmap daemon

The portmap daemon is a network service daemon that provides clients with a standard way of looking up a port number associated with a specific program.

nfsrgyd daemon

The nfsrgyd daemon is new in NFS V4. It provides a user name and group name translation service for NFS servers and clients. This daemon must be running in order to perform translations between NFS string attributes and UNIX numeric identities.

biod daemon

The biod daemon is the block input/output daemon. The biod is used on the client to submit open/read/write/close requests from the client. It also performs read-ahead and write-behind requests, as well as directory reads. The biod daemon threads improve NFS performance by filling or emptying the buffer cache on behalf of the NFS client applications. When a user on a client system wants to read from or write to a file on a server, the biod threads send the requests to the server.

When a user on a client system wants to read and/or write to a file on a server, the biod daemons send the requests to the server. The following NFS operations are sent directly to the server from the NFS client kernel extension and do not require the use of biods: getattr, setattr, lookup, readlink, create, remove, rename, link, symlink, mkdir, rmdir, readdir, and fsstat. The default number of biods is seven per V2 mount or four per V3 and V4 mounts and can be increased or decreased as necessary for performance.



Student Notebook

The nfsd and biod user level processes are not used by NFS V4. They have been replaced by kernel processes called nfsd and kbiod. Use the -k flag of the ps command to see kernel processes

biods option

The biods option specifies the number of biod threads for the mount. The default is 32 per NFS V3 and NFS V4 mounts and 7 per NFS V2 mount. The maximum value is 128 for each type of mount.

Tuning the network lock manager in NFS V2 and V3

On clients and servers where there is heavy file locking activity, the rpc.lockd daemon may become a bottleneck. If so, it can be tuned so that there are more rpc.lockd threads created. This is done by passing in the number of threads as an argument to rpc.lockd. The chssys/stopsrc/startsrc commands can be used to change this number. The NFS server should have enough rpc.lockd threads to handle all of its clients (which means have more threads than what the client runs). The default value is 33 and the maximum value is 511.

File locking in NFS V4

There are significant changes in NFS V4 for file locking when compared to earlier NFS versions. The RPC operations for file locking have been moved into the main NFS protocol. The separate Network Lock Manager and status monitor protocols in earlier NFS versions are eliminated along with the corresponding rpc.lockd and rpc.statd daemons in NFS V4.

mountd

On servers that are handling large numbers of mount requests from clients, the rpc.mountd may not be able to keep up. It could be that the clients are using automount and the automount timeout is too low. In any case, the number of rpc.mountd threads on the server can be increased by specifying the -h flag to rpc.mountd. The new value goes into effect after rpc.mountd is stopped and restarted using:

# stopsrc -s rpc.mountd# startsrc -s rpc.mountd

The rpc.mountd daemon is not used in NFS V4. It is part of the protocol




Uempty

Figure 8-7. nfsstat -s . AN512.0

Notes:

nfsstat command

By default, the nfsstat command prints out NFS client and server statistics and statistics on NFS and remote procedure calls (RPC).

The flags for /usr/sbin/nfsstat are:

-c client information

-s server information

-n NFS information only

-r RPC information only

-z reset statistics (root only)

-m mount statistics

The default if no flags are given is: nfsstat -csnr.


nfsstat -s# nfsstat -rsServer rpc:

Connection oriented:

calls badcalls nullrecv badlen xdrcall dupchecks dupreqs

100256 0 0 0 0 29999 0

Connectionless:

calls badcalls nullrecv badlen xdrcall dupchecks dupreqs

721845 0 0 0 0 94082 2

# nfsstat -nsServer nfs:

calls badcalls public_v2 public_v3

822019 2 0 0

… <version 2 calls not shown > …

Version 3: (613184 calls)

null getattr setattr lookup access readlink read

28 0% 50447 8% 4883 0% 23692 3% 17356 2% 0 0% 344852 56%

write create mkdir symlink mknod remove rmdir

111758 18% 3856 0% 191 0% 0 0% 0 0% 1297 0% 18 0%

rename link readdir readdir+ fsstat fsinfo pathconf

349 0% 52 0% 629 0% 1650 0% 4241 0% 16 0% 8 0%

commit

47861 7%



Student Notebook

Using the -s flag on an NFS client does not show server statistics for any machine other than itself (an NFS client may also be an NFS server to another machine). The statistics are cumulative statistics since system boot (or since the counters were reset to 0 with nfsstat -z).

Too many nfsd threads?

If you think you need more nfsd threads on your server, and proceed to add some, watch the nullrecv column in the nfsstat -rs output. If the number starts to grow, it may mean you have too many nfsd threads. However, this is usually not the case on AIX NFS servers as much as it could be the case on other platforms. The reason for that is that all nfsd threads are not woken up at the same time when an NFS request comes into the server. Instead, the first nfsd thread wakes up, and if there is more work to do, this daemon will wake up the second nfsd thread, and so on. You can adjust the maximum number of nfsd threads in the system by using the nfs_max_threads parameter of the nfso command. The default is 3891, which is the maximum.

Duplicate checks

Duplicate checks are performed for “non-idem potent” operations (that is, those that can not be performed twice with the same result). The classic example is rm. The first rm will succeed, but if the reply is lost the client will re-transmit it. You want duplicate requests like these to succeed, so the duplicate request cache is consulted, and if it is a duplicate request the same (successful) result is returned on the duplicate request as was generated on the initial request. Depending on whether the Connection oriented or the Connectionless dupreq counters are increasing, the appropriate NFS duplicate cache size may need to be increased (nfso -o nfs_tcp_duplicate_cache_size or nfso -o nfs_udp_duplicate_cache_size). The default duplicate cache size is 1000 and the maximum is 10000. The type of requests that can be stored in the duplicate cache are: setattr(), write(), create(), remove(), rename(), link(), symlink(), mkdir(), rmdir().

PerfPMR reports

The PerfPMR nfsstat.sh script runs all the nfsstat reports both before and after the specified measurement period. The results can be found in the nfsstat.int file in the PerfPMR output directory.

If you do not already have a separate baseline report, do not reset the netstat statistics before running perfpmr.sh or nfsstat.sh. That way the “before” reports can be used as the baseline.




Uempty

Figure 8-8. NFS statistics using netpmon -O nfs AN512.0

Notes:

This netpmon report shows the number of reads and writes and their rates that each client is sending to the NFS server. It is requested by using the -O nfs option when running netpmon.

The netpmon report also shows file activity on the server due to NFS mounts. Each row describes the amount of NFS activity handled by this server on behalf of a particular client. At the bottom of the report, call for all clients are totaled.

This information can be used to identify where the demand is coming. This can in turn be used to investigate the situation for a particular client or to load balance the demand among alternative servers.


NFS statistics using netpmon –O nfs# netpmon –O nfs –o netpmon.out; sleep 60; trcstop

# more netpmon.out

=====================================================================

NFS Server Statistics (by Client):

----------------------------------

------ Read ----- ----- Write ----- Other

Client Calls/s Bytes/s Calls/s Bytes/s Calls/s

---------------------------------------------------------------------

aixclient1 0.48 2115 0.22 162 0.32

aixclient2 0.28 1228 0.25 356 0.32

aixclient3 0.30 1296 0.27 264 0.38

aixclient4 0.28 1228 0.23 261 0.30

aixclient5 0.22 887 0.33 216 0.53

aixclient6 0.17 751 0.08 107 0.03

aixclient7 0.02 68 0.05 5 0.02

---------------------------------------------------------------------

Total (all clients) 1.75 7574 1.43 1371 1.90

=====================================================================



Student Notebook

Figure 8-9. Server tuning with nfso . AN512.0

Notes:

Overview

In addition to the nfs_max_threads, nfs_rfc1323, and the nfs_tcp_duplicate_cache_size or nfs_tcp_duplicate_cache_size parameters already discussed, there are a few more that affect performance on the NFS server.

nfso command

The nfso command is used to configure NFS tuning parameters. The nfso command sets or displays current or next boot values for NFS tuning parameters. This command can also make permanent changes or defer changes until the next reboot. Whether the command sets or displays a parameter is determined by the accompanying flag. The -o flag performs both actions. It can either display the value of a parameter or set a new value for a parameter.


Server tuning with nfso• Tuning options with nfso on a NFS server:

nfs_server_base_priority fixes the priority of the NFS daemon threads to this value

nfs_server_clread default value of 1 enables aggressive file read-ahead on the NFS server

nfs_v3_server_readdirplus default value of 1 enables NFS V3 readdirplus ability




Uempty
Extreme care should be taken when using this command. If used incorrectly, the nfso command can make your system inoperable.
nfs_server_base_priority

The nfs_server_base_priority parameter will fix the priority (scheduling policy is round-robin) to the value specified. A value of 0 means that the default non-fixed or floating priority (scheduling policy is SCHED_OTHER) is used.

nfs_server_clread

The nfs_server_clread option allows the NFS server to be very aggressive about the reading of a file. This may be useful in cases where the client is reading sequentially but the JFS/JFS2 read-ahead parameters are at default values. If value is 1 (default), then aggressive read-ahead is done. If the value is 0, normal system default read-ahead methods are used. Normal system read-ahead is controlled by VMM. The more aggressive top-half JFS read-ahead is less susceptible to read-ahead breaking down due to out-of-order requests (which are typical in the NFS server case). When the mechanism is activated, it will read an entire cluster (128 KB, the LVM logical track group size).

nfs_v3_server_readdirplus

The nfs_v3_server_readdirplus option enables the server to automatically return file handle and file attribute information along with directory entries instead of the client having to separately request this information. This greatly improves performance and is enabled by default. There may be rare situations where almost all the server traffic is for a client which does not need the additional information and turning the option off may improve performance. It is recommended to leave this enabled.

The following NFS options have either been discontinued or are restricted tunables:

nfs_device_specific_bufs (discontinued)

The nfs_device_specific_bufs parameter when set to 1 means that the NFS server will use memory allocations from network devices if the network device supports such a feature. Use of these special memory allocations by the NFS server can positively affect the overall performance of the NFS server. The default of 1 means the NFS server is allowed to use the special network device memory allocations. These are buffers managed by a network interface that result in improved performance (over regular mbufs) because no setup for DMA is required on these. Two adapters that support this include the Micro Channel ATM adapter and the SP2 switch adapter. If the value of 0 is used, the NFS server will use the traditional memory allocations for its processing of NFS client requests.



Student Notebook

nfs_max_connections (discontinued)

The nfs_max_connections parameter value can be used to limit the number of NFS clients that use TCP mounts. The default value of 0 means that no limit is enforced. Tuning this can be used to reduce the load on the NFS server by denying access to some clients.

nfs_device_specific_bufs (discontinued)

The nfs_device_specific_bufs parameter when set to 1 means that the NFS server will use memory allocations from network devices if the network device supports such a feature. Use of these special memory allocations by the NFS server can positively affect the overall performance of the NFS server. The default of 1 means the NFS server is allowed to use the special network device memory allocations. These are buffers managed by a network interface that result in improved performance (over regular mbufs) because no setup for DMA is required on these. Two adapters that support this include the Micro Channel ATM adapter and the SP2 switch adapter. If the value of 0 is used, the NFS server will use the traditional memory allocations for its processing of NFS client requests.

nfs_socketsize (restricted)

The nfs_socketsize parameter when tuned on the client machine will specify the size of the UDP send/receive socket buffers used for each UDP request. The default is 60000 bytes.

nfs_tcp_socketsize (restricted)

The nfs_tcp_socketsize parameter when tuned on the client machine will specify the size of the TCP send/receive socket buffer used for each NFS server connection. The default is 60000 bytes.

nfs_iopace_pages (restricted)

The nfs_iopace_pages specifies the number of pages that will be sent to a file on the NFS server before the application issuing I/Os to that file is put to sleep. The default value is 0 which means that 1/8th of the file will be sent. The VMM I/O pacing can also be used. But, in the case of NFS flushes from shared memory, the writes may bypass the VMM I/O pacing code. This parameter can also be used to keep one process from using up all of the bufstructs from a paging device table (PDT).

nfs_dynamic_retrans (restricted)

The nfs_dynamic_retrans value if set to 1 (the default) will adjust the timeout value dynamically after each retransmit. This value can be doubled each time but may vary




Uempty
based on a feedback mechanism. The maximum timeout value is 20 seconds. If setting the timeo mount option, then the timeo value is used for each retransmit if nfs_dynamic_retrans=0. Otherwise, timeo is only used for the initial retransmit.
Tuning bufstruct pools (restricted)

The number of pools is tunable with the nfso command (nfs_v2_pdts, nfs_v3_pdts or nfs_v4_pdts). The number of bufstructs in each pool is also tunable. It can go up to a maximum value of 5000 per pool. This is tunable also with nfso (nfs_v2_vm_bufs, nfs_v3_vm_bufs or nfs_v4_vm_bufs). When tuning the vm_bufs values, make sure this is set before the pdts value is set. Both the pdts and vm_bufs values must be set before the NFS file systems are mounted.

When issuing reads/writes to files in an NFS mounted file system, each NFS I/O uses a bufstruct that is obtained from a pre-allocated pool. By default, there is one pool that is created for each NFS version and all NFS mount points of the same version share that pool. This pool can differ in size depending on the amount of RAM on the client machine (though for most clients, the pool will have 1000 bufstructs).

nfs_auto_rbr_trigger (restricted)

The nfs_auto_rbr_trigger nfso option can be used to specify the number of megabytes to initially cache in memory. File contents sequentially read after this threshold will be released after the application has read the data. This option is defaulted to 0, which currently has AIX perform no release-behind-on-read processing. By coding the option you can enable release-behind-on-read and determine how much initial data to cache.



Student Notebook

Figure 8-10. nfsstat -c. AN512.0

Notes:

Connection oriented versus connectionless

The report has two sections. The connection oriented section is for TCP mounts. The connectionless section is for UDP mounts. The statistics are the total for all mounts of a given type.

Dropped packets

For performance monitoring, nfsstat -rc will give information on whether the network is dropping packets. A network may drop a packet if it cannot handle it. Dropped packets may be the result of the response time of the network hardware or software, or an overloaded CPU on the server. A dropped packet is not actually lost, as the request is retransmitted (normally successfully). Packets are rarely dropped on the client. Usually, packets are dropped on either the network or on the server.


nfsstat -c# nfsstat -rcClient rpc:Connection orientedcalls badcalls badxids timeouts newcreds badverfs timers 1 0 0 0 0 0 0 nomem cantconn interrupts 0 0 0 Connectionlesscalls badcalls retrans badxids timeouts newcreds badverfs 1448 0 12 0 12 0 0 timers nomem cantsend 22 0 0

# nfsstat -ncClient nfs:calls badcalls clgets cltoomany 1437 0 0 0 … <nfsv2 calls not shown> ….Version 3: (1 calls)null getattr setattr lookup access readlink read 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% write create mkdir symlink mknod remove rmdir 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% rename link readdir readdir+ fsstat fsinfo pathconf 0 0% 0 0% 0 0% 0 0% 0 0% 1 100% 0 0% commit 0 0%




Uempty
Retransmissions and timeouts
If using UDP mounts, the retrans column in the rpc section displays the number of times requests were retransmitted due to a timeout in waiting for a response. This is either related to dropped packets or a delayed reply from the server. If the retrans number consistently increases, then it indicates a problem with the server or network keeping up with demand. Use vmstat, netpmon, and iostat on the server machine to check the load. With TCP mounts, retransmissions are handled by the transport layer.

With soft mounts, when the major timeout period has expired and there has been no reply from the server, the application is informed that the request has failed and the timeouts counter is incremented. For UDP, the major timeout period is after RPC has reached the retransmit limit.

Delayed server replies and excessive retransmissions

Generally the analysis of the following statistics is mainly a concern for UDP mounts, where the RPC protocol needs to handle error detection and retransmission.

A high badxid count implies that requests are reaching the various NFS servers, but the servers are too loaded to send replies before the local host’s RPC calls timeout and are retransmitted. The badxid value is incremented each time a duplicate reply is received for a transmitted request (an RPC request retains its XID through all transmission cycles). Excessive retransmissions place an additional strain on the network or server, further degrading response time.

If the server is CPU-bound, it will affect NFS and its daemons. To improve the situation, the server must be tuned or upgraded, or the user can localize his application’s files. If the server is I/O-bound, the server file systems can be reorganized, or localized files can be used.

If the server does not appear overloaded and the badxid column in nfsstat is much lower than the timeout column, then there may be network hardware problems. A network analyzer can help pinpoint this.

Other report fields

The timers field shows the number of times the calculated time-out value was greater than or equal to the minimum specified timed-out value for a call.

The cantconn field shows the number of times the call failed due to a failure to make a connection to the server.

The nomem field shows the number of times the calls failed due to a failure to allocate memory.

The interrupts field shows the number of times the call was interrupted by a signal before completing.



Student Notebook

The output listed using nfsstat -nc on the client machine can be used to determine what type of requests are being made. If there are a high number of getattr requests, attribute cache tuning may help. If there are a high number of read requests, then more memory may help. If there are a high number of commits, then tuning commit-behind may help.

Additional NFS V4 output

When you run nfsstat -4nc on an AIX 5L V5.3 system, the following additional information is displayed:

Version 4: (0 calls)null getattr setattr lookup access readlink read0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%write create mkdir symlink mknod remove rmdir0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%rename link readdir statfs finfo commit open0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%confirm downgrade close lock locku lockt setclid0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%renew clid_cfm secinfo release_lo replicate

0 0% 0 0% 0 0% 0 0% 0 0%




Uempty

Figure 8-11. nfsstat -m . AN512.0

Notes:

Overview

The nfsstat -m option shows statistics for each NFS mount on the client. first it shows the NFS version, protocol, and mount options. In addition it provides (for UDP mount only) the smoothed round trip times and the current NFS RPC timeout value:

- srtt is the smoothed round-trip time

- dev is the estimated deviation

- cur is the current timeout value

RPC uses a an exponential back-off for the time-out. A large current timeout indicates slow RPC acknowledgements.

The numbers in parentheses are the actual times in milliseconds. The other values are un-scaled values kept by the operating system kernel. You can ignore the un-scaled values. Response times are shown for lookups, reads, writes and a combination of all of these operations (all).


nfsstat -m

# nfsstat -m/nfs/retain/pmrs from /nfs/retain/pmrs:cia

Flags: vers=2,proto=udp,auth=unix,soft,intr,dynamic,rsize=8192,wsize=8192,ret

rans=5

Lookups: srtt=2 (5ms), dev=2 (10ms), cur=1 (20ms)

All: srtt=2 (5ms), dev=2400 (12000ms), cur=600 (12000ms)

/nfs/retain/bin from /nfs/retain/bin:cia

Flags: vers=2,proto=udp,auth=unix,hard,intr,dynamic,rsize=8192,wsize=8192,ret

rans=5

Lookups: srtt=14 (35ms), dev=6 (30ms), cur=4 (80ms)

Reads: srtt=18 (45ms), dev=5 (25ms), cur=4 (80ms)


/nfs/cust from /nfs/cust:l3perf

Flags: vers=3,proto=tcp,auth=unix,soft,intr,link,symlink,rsize=32768,wsize=32

768,retrans=5




Student Notebook

Figure 8-12. Client commit-behind tuning AN512.0

Notes:

Overview

With NFS V2, for each write, the page is synched to disk on the server. Therefore, it is considered committed once the write is completed.

With NFS V3 and V4, a write may just go to the memory cache on the NFS server and return. In this case, the write is considered as being in a smudged state on the NFS client. If the file is synched, then a commit occurs and all dirty pages are flushed to disk. At this point, the pages are no longer in smudged state. If page-replacement occurs on the client and a smudged page is stolen, then the VMM drives an NFS commit for this page. This page-replacement activity can cause a high rate of commits to the NFS server as there can be a commit per page.


Client commit-behind tuning• A high number of commits could be due to VMM page-replacement:

When NFS client pages in a ‘smudged’ state are stolen, they will be committed to the NFS serverPage-by-page commits are inefficient and can overload the server

• The combehind mount option enables commit-behind:After numclust clusters of pages are modified and a cluster boundary is crossed, smudged pages in previously modified clusters are committed to the server using a single commitThe default numclust value is 128 clusters (each cluster has 4 pages)A smaller numclust mount option value will make the commits more aggressive (use when page-by-page commits are still high)




Uempty
Enabling commit-behind
To increase the NFS client and server performance in this case, the combehind mount option can be used to enable commit-behind. Commit-behind uses the NFS numclust value (defaults to 128 and can be overridden with the numclust mount option) to determine when to send commits. An NFS cluster contains 4 pages. After 4*128 pages by default, if another page is modified, then these pages are committed with a single commit call. If page replacement is running faster than the commit-behind algorithm (commits continue to increase in nfsstat -c), then make commit-behind more aggressive by reducing the numclust value (a suggested value is 32, 64 or 128).

VMM read cache impact

A side effect of enabling commit-behind is that VMM caching is effectively disabled. This could had a negative impact when doing large sequential reads on the same NFS mount.



Student Notebook

Figure 8-13. Client attribute cache tuning AN512.0

Notes:

File attribute cache tunables

NFS maintains a cache on each client of the attributes of recently accessed directories and files. Five parameters, beginning with ac, can be set to control how long an entry is kept in cache. These are all mount options and can be set in /etc/filesystems or specified at the mount command line. The mount options are:

- actimeo is the absolute time for which file and directory entries are kept in the file attribute cache after an update. If specified, this value overrides the following *min and *max values, effectively setting them all to the actimeo value.

- acregmin is the minimum time after an update that file entries will be retained. The default is 3 seconds.

- acregmax is the maximum time after an update that file entries will be retained. The default is 60 seconds.


Client attribute cache tuning• If the NFS client is looking up file attributes at a high rate

AND if the attributes don’t change often, then tuning the attribute cache values may increase performance

• File attribute cache mount options:

actimeo

acregmin

acregmax

acdirmin

acdirmax

noac




Uempty
- acdirmin is the minimum time after an update that directory entries will be retained. The default is 30 seconds.
- acdirmax is the maximum time after an update that directory entries will be retained. The default is 60 seconds.

- noac - specified that this mount performs no attribute or directory caching.

Each time the file or directory is updated, its removal is postponed for at least acregmin or acdirmin seconds. If this is the second or subsequent update, the entry is kept at least as long as the interval between the last two updates, but not more than acregmax or acdirmax seconds.



Student Notebook

Figure 8-14. NFS I/O pacing, release-behind, and DIO AN512.0

Notes:

Avoiding shortages of resources

A sudden large increase in NFS requests can exhaust the number of bufstructs on the client and strain network or server resources. One common cause is the flushing of unwritten file pages when an application closes (typically with an fsync) or when the syncd daemon runs. Another common cause is a single application which is writing out a large file, which could hog these resources and thus affect other applications.

Pacing the flushing of cached file writes

The nfs_iopace_pages nfso option specifies the maximum, number of dirty pages that can be written to the server at one time. The default value is 0, which indicates that the kernel dynamically adjusts the maximum depending upon the write sizes (has a starting value of 32 pages). Coding a non-zero value for this option allows the administrator to force a particular maximum to be used.


NFS I/O pacing, release-behind, and DIO• Pace NFS reads and writes on open files– minpout and maxpout mount options– Suspends I/O until outstanding pageouts is low

• Release-behind conserves NFS client memory– For reads only– Similar to JFS release behind tuning– rbr mount option– Next page read triggers release of previous cluster– Cluster size = numclust * MAX(wsize,rsize)

• Direct I/O and Concurrent I/O are available as NFS mount options: dio and cio– Application needs to be properly designed.




Uempty
Pacing application I/O
The maxpout and minpout mount options control the outstanding pageouts thresholds at which additional I/O to the NFS file system will be suspended and when it will be resumed. These options for NFS mounts were introduced with AIX 5L V5.3 ML03.

When the outstanding number of pageouts reaches the maxpout value, I/Os to that file system are blocked until the outstanding pageouts has been reduced to the minpout value. This helps prevent a single application from dominating the I/O, but also tends to smooth out the request load, thus avoiding a transient shortage of bufstructs.

By default, if not coded on the mount, AIX will still use this pacing mechanism, but with kernel determined values for maxpout and minpout.

Conserving NFS client memory

If there is very little likelihood that the applications sequentially reading a file (or any other application on this client) will re-read what has been cached in memory, then that data is unnecessarily competing with other uses of that memory. Starting with AIX 5L V5.3 ML03, NFS has have the ability (as JFS has had for some time) to free up this memory after the application has read the data.

Global automatic release behind on read

The nfs_auto_rbr_trigger nfso option can be used to specify the number of megabytes to initially cache in memory. File contents sequentially read after this threshold will be released after the application has read the data. This option is defaulted to 0, which currently has AIX perform no release-behind-on-read processing. By coding the option you can enable release-behind-on-read and determine how much initial data to cache.

Using the rbr mount option

You can enable release-behind-on-read for individual mounts by coding the rbr mount option. This option will cause AIX to release previously read pages when the next page is sequentially read beyond the current cluster. The size of the cluster is determined by the kernel, based upon the current value for numclust and the read or write sizes that are specified for the mount.

The mount option overrides the global automatic mechanism

Using the dio and cio mount options

Direct I/O for NFS has the same consideration as discussed in the filesystem unit. Concurrent I/O is DIO with file locking disabled (the application has to handle the data locking instead).



Student Notebook


Notes:


Checkpoint ( 1 of 2 )

1. True / False A large number of concurrent NFS clients can overload an NFS server

2. The ________ daemons are the block input/output daemons and are required in order to perform remote I/O requests at an NFS client.

3. On clients and servers where there is heavy file locking activity, the _________ daemon may become a bottleneck (for NFS V2 and NFS V3).




Uempty


Notes:


Checkpoint ( 2 of 2 )

4. The __________ command can be used to look at per mount statistics at the NFS client

5. The _________ utility can identify which NFS clients present the greatest workload at the NFS server.

6. If the NFS client has overcommitted memory, the ___________ mount option can be used improve NFS I/O efficiency and the _______ mount option can be used to release file cache memory once the application receives the data.



Student Notebook

Figure 8-17. Exercise 8: NFS performance tuning AN512.0

Notes:


Exercise 8: NFS performance tuning

View the performance differences between NFS versions

• Examine nfsstat and netpmonreports




Uempty


Notes:


Unit summary

This unit covered:

• The basic Network File Systems (NFS) tuning concepts

• Differences between NFS V2, V3 and V4

• Using nfstat and netpmon to monitor NFS

• Using nfso and mount options to tune NFS



Student Notebook




Uempty
Unit 9. Performance management methodology

This unit reviews performance monitoring methodology and summarized the tools and procedures covered in this course. The emphasis is on finding bottlenecks using standard AIX monitoring tools.



• List the steps to approach performance analysis

• Describe the distinct areas of performance that need to be investigated and how to go about monitoring those areas

• Use tools that will aid with performance monitoring and tuning on partitioned systems


Accountability:

• Checkpoint

• Machine exercises

References







© Copyright IBM Corp. 2010 Unit 9. Performance management methodology 9-1

Student Notebook


Notes:


Unit objectivesAfter completing this unit, you should be able to:

• List the steps to a methodical approach to performance analysis

• Describe the distinct areas of performance that need to be investigated and how to go about monitoring those areas

• Use tools that will aid with performance monitoring and tuning on partitioned systems




Uempty

Figure 9-2. Factors that can affect performance AN512.0

Notes:

Introduction

Technological improvements in microprocessors, disks, and networking equipment have dramatically changed the look of server computing. While those improvements have more often than not reduced the incidence of performance problems, they have also increased the capabilities of systems such that more complex problems need to be solved. Thus, performance tuning has tended to change in nature from simple hardware and software bottleneck analysis toward evaluation of more complex interactions.

Important factors that can affect performance

As server performance is distributed throughout each server component and type of resource, it is essential to identify the most important factors or bottlenecks that will


Factors that can affect performance• Detecting the bottleneck(s) within a server system depends on a range of factors such as:–Configuration of the server hardware–Software application(s) workload–Configuration parameters of the operating system–Network configuration and topology

Throughput

Bottlenecks



Student Notebook

affect the performance for a particular activity. Detecting the bottleneck within a server system depends on a range of factors such as:

- Configuration of the server hardware - Software application(s) workload - Configuration parameters of the operating system - Network configuration and topology

File servers need fast network adapters and fast disk subsystems. In contrast, database server environments typically produce high processor and disk utilization, requiring fast processors or multiple processors and fast disk subsystems. Both file and database servers require large amounts of memory for caching by the operating system or the application.

Bottlenecks

A bottleneck is a term used to describe a particular performance issue which is throttling the throughput of the system. It could be in any of the subsystems: CPU, memory, or I/O including network I/O. The graphic in the visual above illustrates that there may be several performance bottlenecks on a system and some may not be discovered until other, more constraining, bottlenecks are discovered and solved.




Uempty

Figure 9-3. Determine type of problem AN512.0

Notes:

Types of problems

In addition to the questions in the visual above to discover the type of problem, you can ask questions such as: Is this problem new or could it have always been there? What has changed since you’ve noticed the problem? Is it a functional problem that is causing the performance problem?

Determining the type of problem (functional or performance) is half the battle in determining how to solve the issue. A functional problem is typically more straight forward to solve; you find one or more things to fix. With a performance problem you need to theorize what could help, try it, and see if it causes the performance to be better or worse.


Determine type of problem• Determine the type of problem:

– Is it a functional problem or purely a performance problem?– Is it a trend or a sudden issue?– Is the problem only at certain times (of the day, week, and so

forth)

• You’ll need to know baseline performance statistics and what your performance goals are:

– Use AIX tools– Use PerfPMR– Document statistics regularly to spot trends for capacity planning– Document statistics during high workloads



Student Notebook

Creating a baseline

Because you need something to compare current statistics to, you need to have baseline statistics documented. You may have several baselines documented depending on the cyclical nature of the workload. For example, a separate baseline may be need for the end-of-month batch processing workload versus the rest of the month.




Uempty

Figure 9-4. Trade-offs and performance approach AN512.0

Notes:

Trade-offs

There are many trade-offs related to performance tuning that should be considered. The key is to ensure there is a balance between them.

The trade-offs are:

- Cost versus performance

In some situations, the only way to improve performance is by using more or faster hardware. But, ask the question “Does the additional cost result in a proportional increase in performance?”

- Conflicting performance requirements

If there is more than one application running simultaneously, there may be conflicting performance requirements.


Trade-offs and performance approach• Trade-offs must be considered, such as:

– Cost versus performance– Conflicting performance requirements– Speed versus functionality

• Performance may be improved using a methodical approach:1. Understanding the factors which can affect performance2. Measuring the current performance of the server3. Identifying a performance bottleneck4. Changing the component which is causing the bottleneck5. Measuring the new performance of the server to check for

improvement



Student Notebook

- Speed versus functionality

Resources may be increased to improve a particular area, but serve as an overall detriment to the system. Also, you may need to make choices when configuring your system for speed versus maximum scalability.

Methodical approach

Using a methodical approach, you can obtain improved server performance. For example:

- Understanding the factors which can affect server performance, for the specific server functional requirements and for the characteristics of the particular system

- Measuring the current performance of the server

- Identifying a performance bottleneck

- Upgrading/tuning the component which is causing the bottleneck

- Measuring the new performance of the server to check for improvement




Uempty

Figure 9-5. Performance analysis flowchart AN512.0

Notes:

Introduction

This is a flowchart that some performance analysts use. Keep in mind it is an iterative process. The rest of this unit will look at the subsystems in more detail.


Performance analysis flowchart

Is there aperformance

problem?

Normal Operations

Monitor system performance and check against requirements

CPU Bound?

ActionsYes

Memory Bound?

I/O Bound?

Network Bound?

No

No

No

No

Yes

Yes

Yes

Actions

Actions

Actions

Does performancemeet stated

goals?Yes

No

Yes

No

Additional tests

Actions



Student Notebook

Figure 9-6. CPU performance flowchart AN512.0

Notes:

CPU bound system

A CPU bound system means that all the processors are nearing 100% busy, with processes which want to run but cannot (or cannot as quickly), causing you not to meet your performance goals.

Locate dominant processes

In order to understand why a system is CPU bound, you have to determine which processes are using the most CPU time by using a command like ps. The %CPU column gives the percentage of time the process has used the CPU since the process started. You then must verify if those processes are running correctly and if they are using the same amount of CPU as usual. You may find processes which are not behaving normally (spinning, using up CPU time, but not doing any work) and you may be able to kill these.


CPU performance flowchartMonitor CPU usage

and compare with goals

Determine cause of idle time by

tracing

Fix or tune the app or OS

Tune applications / operating system

Locate dominant process(es)

Killabnormal processes

CPUsupposed to

be idle?

High CPUusage?

Isprocess behavior

normal?

Yes No

No Yes

Yes

No

vmstat, sar, topas, time

nice/renice, bindprocessor, smtctl, schedo,

Make app multi-threaded

ps, tprof,topas

Check memory and disk subsystems

START

No

kill

Actions

Actions

Actions

Actions

Actions




Uempty
Tune applications or operating system
The application itself may be able to be tuned. Is it single-threaded? Can you increase the number of its threads or processes? You may also be able to tune the operating system. Is simultaneous multi-threading enabled? Can you change the priorities of processes or threads such that the most important processes and threads receive most favored status?



Student Notebook

Figure 9-7. Memory performance flowchart AN512.0

Notes:

Memory bound system

A system is memory bound if it has high memory occupancy and high paging space or file paging activity. The activity of the paging space is given by the number of pages read from disk to memory (page ins) and number of pages written to disk (page out).

Examples of memory parameters to tune with vmo are the minfree, maxfree, minperm%, maxperm%, maxclient%, strict_client, strict_maxperm, and lru_file_repage tuning options.

Example of determining a memory bound system

For example, you might use topas and notice that memory is 100% consumed (Comp, Noncomp), paging space is 61% consumed (% Used), a lot of pages are written to disk (PgspOut) and you see higher VMM page steals (Steals). Because the system is using all the memory and asking for more, this system is memory bound. Note that the page


Memory performance flowchart

Monitor memory usage.Meeting goals?

Paging,page steals,repaging?

lsps –s, topas, vmstat –I, PerfPMR

START

No

Is memoryovercommitted?

No

Reduce workload or add memory

Yes

Yes

Memoryleak?

No

Determine process or kill/debug

Yesvmstat, ps gv,

svmon -P

svmon -G

No

Tune memory parameters

Actions

Actions

Actions

Monitoring tools

vmo




Uempty
stealing is a normal behavior in AIX and depending on the application may not be an issue. Now you need to determine why the system is memory bound. Is it because of a memory leak? Perhaps you need to reduce the workload to free up memory, or add more physical memory. You may be able to change tuning options to make more efficient use out of memory if adding memory is not an option.
Determine which processes are causing the problem

One action to take is to determine which processes are making the system memory bound. Use the ps or svmon commands to look for processes that are consuming a lot of memory. In the following example, perl is the largest memory consumer application with a total number of pages in real memory of 218933 (around 875 MB) and total number of pages reserved or used on paging space of 97963 (nearly 400 MB). The



Student Notebook

second application vpross uses only 48 MB of memory and less than 7 MB of paging space. So the perl application is the root cause of this memory problem.

# svmon -P ----------------------------------------------------------------------- Pid Command Inuse Pin Pgsp Virtual 64-bit Mthrd LPage 332008 perl 218933 4293 97963 318471 N N N

Vsid Esid Type Description LPage Inuse Pin Pgsp Virtual 7380 6 work working storage - 65536 0 0 65536 1383 3 work working storage - 47302 0 18215 6551515389 7 work working storage - 44717 0 0 4471717388 4 work working storage - 38021 0 27528 6553621393 5 work working storage - 15031 0 50528 65536 0 0 work kernel segment - 6843 4290 1621 84543f8bd d work loader segment - 1352 0 71 306229397 f work shared library data - 83 0 0 83 d385 2 work process private - 32 3 0 32 3362 1 clnt code,/dev/hd2:12435 - 16 0 - -2f374 a work working storage - 0 0 0 03f37c 9 work working storage - 0 0 0 03d37d 8 work working storage - 0 0 0 0------------------------------------------------------------------------Pid Command Inuse Pin Pgsp Virtual 64-bit Mthrd LPage Pid Command Inuse Pin Pgsp Virtual 64-bit Mthrd LPage 303122 vpross 12193 4293 1696 15465 N N N

Vsid Esid Type Description LPage Inuse Pin Pgsp Virtual 0 0 work kernel segment - 6843 4290 1621 845417368 2 work process private - 3913 3 4 39173f8bd d work loader segment - 1352 0 71 30621936f 1 pers code,/dev/lv00:83977 - 53 0 - -1136b f work shared library data - 32 0 0 321336a - pers /dev/lv00:83969 - 0 0 - -----------------------------------------------------------------------




Uempty

Figure 9-8. Disk/File system performance flowchart AN512.0

Notes:

Disk and file system performance issues

A system may be disk bound if at least one disk is busy and cannot fulfill other requests, and process are blocked and are waiting for the I/O operation to complete. The limitation can be either physical or logical. The physical limitation involves hardware like bandwidth of disks, adapters and the system bus. The logical limitations involves the organization of the logical volumes on disks and Logical Volume Manager (LVM) tunings and settings, such as striping or mirroring.

Example of determining a disk or file system bound system

A system might show a high wait I/O at 86.6% (Wait), a percentage of time that hdisk0 was active at 98.7% (Busy%) and more than 5 processes which are waiting for an I/O operation to complete (Waitqueue). This system is waiting for write operation on hdisk0, which may be an indication that it is disk bound.


Disk/File system performance flowchart

Monitor disk usage.Meeting goals?

Is adapteroverloaded?

vmstat –I and –v, svmon -G

iostat -a

START

No

Is it a file system or

LVM issue?

No De-fragment, change fragment size, check compression setting, distribute logical or physical volumes if there are hotspots

Yes

Yes

Is diskoverloaded?

No

Distribute loadYes

iostat, topas sar –d, filemon

fileplace, lvmstat, lslv, lspv,

svmon will show if there is enough memory to cache

filesNo

Distribute loadActions

Actions

Actions

Monitoring tools



Student Notebook

Disk I/O analysis

When a system has been identified having disk I/O performance problems, the next point is to find out where the problem comes from. Check the adapter throughput and the disk throughput. The activity of a disk adapter is given by the iostat -a command. Because the maximum bandwidth of an adapter depends on its type and technology, compare the statistics given by iostat to the published bandwidth for the adapter to know the load percentage of the adapter. If the adapter is overloaded, try to move some data to another disk on a distinct adapter, move a physical disk to another adapter or add a disk adapter.

The disk may be bound just because all the data are not well organized. Verify the placement of logical volumes on the disk with lspv command. If logical volumes are fragmented across the disk like in the following lspv example, reorganize them with the reorgvg or migratepv commands.

If logical volumes are well organized in the disks, the problem may come from the file distribution in the file system. The fileplace command displays the file organization. If space efficiency is near 100%, this means that the file does not have many fragments and they are contiguous. If necessary, you can use the defragfs command to increases a file system's contiguous free space by reorganizing allocations to be contiguous rather than scattered across the disk.




Uempty

Figure 9-9. Network performance flowchart (1 of 3) AN512.0

Notes:

Monitoring performance and compare to goals

With network I/O, one of the things that needs to be done is to document the network topology and identify the transmit and receive hosts for the major applications (or at least the ones that use the network).

Hardware configuration problems

Adapter to switch port link configuration problems could cause corrupted frames, late or multiple collisions. In addition to using netstat -n and entstat -d on the hosts, check the statistics on the Ethernet switch(es). If the configuration looks fine, check the rest of the hardware such as cables, switch ports, and the adapters.


Network performance flowchart (1 of 3)

Monitor network usage.Meeting goals?

HWconfiguration

problems?

ping, netstat, netperf

netstat -v, entstat -d

START

No

Network buffershortage?

No

Add RAM, use 64-bit kernel, or decrease

workload

Yes

Yes

Adapterxmit queue/rcv

overflows?

No

Yesnetstat –v, entstat -d

netstat -m

No

Fix configuration

Actions

Actions

Monitoring tools

chdev

chdev

Increase queue or pool, or decrease

workload

Actions



Student Notebook

Are there adapter transmit (xmit) queue or receive (rcv) pool overflows?

If there are overflows in these areas, increase the size of the transmit queue or the receive pool with the chdev command for the adapter or decrease the network load.

Are there network buffer shortages?

One configuration option if you are running short on network buffers is to use the 64-bit kernel. You could also add more memory or find out what is using all the network buffers and try to decrease the network load.




Uempty


Notes:

Are there IP input queue overflows?

If there are IP input queue overflows, try to solve this by avoiding fragmentation, and/or by eliminating network discards and delays. You could also increase the queue size with the no command or decrease network load.

Are there UDP receive buffer overflows?

If there are UDP receive buffer overflows, try to solve this by increasing the size of the buffer with the no command. Also, check that the CPU subsystem is not constrained. Another solution is to decrease the network load.



IP inputqueue

overflows?

netstat –p ip

TCPretransmits?

No

Find and fix drops/delays

Yes

Yes

UDP rcvbuffer

overflows?

No

Increase buffer size, check CPU, or

decrease workload

Yesnetstat –p udp

netstat –p tcp

No

Increase queue size, avoid fragmentation,

eliminate discards/delays, or decrease workload

Actions

Actions

Actions

Monitoring tools

Back to Meeting goals?

no

no, setsockopt()

All



Student Notebook

Are there TCP retransmits?

If you see TCP retransmits, you will need to identify where the packets are being dropped or delayed and fix the problem at the source. The problem could be caused by any of the above network issues, or could be somewhere else in the network.




Uempty


Notes:

Use the optimal TCP initial window size

The TCP initial window size is controlled by the TCP send and receive buffer sizes. You may need to experiment with window sizes to find the optimal setting. Tune with the ISNO parameters with the chdev command. You may need to enable rfc1323.

Do you see 200 ms transmit pauses?

These transmit pauses are due to waiting for delayed acknowledgements. Solve by disabling Nagle’s Algorithm.

Is the network demand simply too high?

Identify the source(s) of the demand and eliminate what you can or slow it down. Or, even better, try to redistribute the traffic to other adapters or servers, or over time. To



TCP initialwindow sizenot optimal?

no, lsdev

Demand toohigh?

No

Decrease or distribute traffic

Yes

Yes

200 mstransmitpauses?

No

Disable Nagle’s Algorithm

Yesnetstat –p tcp,

tcpdump or iptrace

netstat

No

Size tcp send/rcv buffers, and possibly

enable rfc1323

Actions

Actions

Actions

Monitoring tools

Back to Meeting goals?

no, setsockopt()

All

no, chdev, setsockopt()



Student Notebook

slow down demand, you can decrease window size, message sizes, or the maximum number of connections.

The source of the demand may be:

- An application that either has a bug, or is poorly designed, or simply has a lot of valid work to do

- The sum of many applications that cumulative overload the queues, memory, or adapter

- One remote session partner which is either sending too much data or requesting too much data to be sent back

- The sum of many session partners that in total are overloading this host




Uempty

Figure 9-12. NFS performance flowchart: Servers AN512.0

Notes:

NFS performance tuning

Part of the difficulty with tuning for NFS is that it is a client/server application. So its tuning involves everything we have taught in the entire course plus some items that are unique to NFS. The flowchart in the visual above is a basic methodology outline for servers. The next visual will show a flowchart for clients.

Use the correct protocols for NFS mounts

Are you using the default and recommended NFSv3 and TCP protocols for the NFS mounts? Other options (NFSv2, UDP, NFSv4) may impact performance, and should be treated as special cases when it comes to tuning. If you can, change back to NFSv3 and TCP.

The rest of the server tuning decision points in the flowchart assume these protocols are in use.


NFS performance flowchart: Servers

Monitor NFS usage.Meeting goals?

Non-standardmount

protocols?

nfsstat

START

No

No

Max # threads may be too low for

lockd, mountd, nfsd

Yes

Yes

NoFix system

bottleneck. Use NFS specific tuning tools.

YesAll CPU, memory, I/O, and network analysis

tools

Try and look for improvement

No

Use NFSv3 and TCP if appropriate

Actions

Actions

Actions

Monitoring tools

nfso

Anyresource

bottlenecks?

NFSthreads

need tuning?



Student Notebook

Tune the resource subsystems on the NFS server

On the NFS server, watch for bottlenecks in all subsystems: CPU, memory, I/O, and networking as previously discussed in this course. Any of these could affect NFS performance.

Specifically for NFS processor performance, you can set the server’s priority with nfso. For I/O, try setting aggressive read-ahead with nfso. For networking, you can request setsockopt() values for socket buffer size and configure the use of rfc1323 with nfso.

Tune NFS threads

NFS server has some logical resources which may need tuning. You can configure the maximum number of threads for lockd, mountd, and nfsd. Increase these if your server can handle the load. Decrease these if you need to throttle back NFS traffic.

Additional ways to throttle workload include using nfso to reduce the maximum read and write sizes, the maximum number of connections, and the socket size. Also look at redistributing the workload by spreading the workload over more NFS servers or by controlling when certain clients request services. Instead of de-tuning the server, you could de-tune the clients.




Uempty

Figure 9-13. NFS performance flowchart: Clients AN512.0

Notes:

Tune the resource subsystems on the NFS clients

On the NFS client, just like on the NFS server, watch for bottlenecks in all subsystems: CPU, memory, I/O, and networking as previously discussed in this course. Any of these could affect NFS performance.

Specifically for NFS client memory performance, you can use nfso or mount options to enable release behind on read. For networking, you can request setsockopt() values for socket buffer size and configure the use of rfc1323 with nfso. You can also set read and write options with mount options.

Tune NFS threads

NFS server has some logical resources which may need tuning. You can configure the maximum number of threads for biod. Increase these if your server can handle the load. Decrease these if you need to throttle back NFS traffic from this client. Also check


NFS performance flowchart: Clients

Monitor NFS usage.Meeting goals?

nfsstat, netpmon START

No

Highcommits or attr

requests?

No

Yes

Yes

NFSthreads need

tuning?

No

Set # of biodthreads, size

bufstruct pools

Yes

nfsstat -nc

No

Fix system bottleneck. Use NFS specific tuning tools.

Actions

Actions

Actions

Monitoring tools

nfso, or mount options

nfso, or mount options

Tune combehindoption, tune attribute

cachemount options

Anyresource

bottlenecks?All CPU, memory, I/O, and network analysis

tools

Try and look for improvement



Student Notebook

the number and size of the vugstruct pools with nfso. These could be artificially constraining performance.

High number of commits and/or a high number of attribute requests

If nfsstat -nc shows a high number of commits, consider tuning the combehind setting. This is a mount option.

If nfsstat -nc shows a high number of attribute requests, consider tuning the attribute cache. This is a mount option.

Also, if one application is unfairly dominating in a constrained environment, look at tuning the IOpacing mount options, both minpout and maxpout.

Special case when using UDP (server and client)

For systems using UDP protocol, if netstat -p udp at receiving side shows buffer overruns, either:

- Tune the NFS system as previously discussed (increase socket size or tune for constrained CPU). Also try setting the UDP buffer size via setsockopt() (using nfso), or

- Possibly increase the maximum number of nfsd threads using nfso.

Use nfstat -cr on the client to see the RPC statistics. NFS/RPC handles error detection and retransmission. If retrans and badxid are both high, there are delays (or ack discards) either in the network or at a congested server. Investigate and fix the problem (using the tools covered in the networking unit of this course).

If retrans is high but badxid is low, this indicates that requests are being discarded either in the network or at the server. Investigate and fix the problem (using the tools covered in the networking unit of this course).




Uempty


Notes:


Checkpoint1. These are the steps for a methodological approach to

performance analysis. Put them in the correct order:_ Identifying a performance bottleneck_ Measuring the current performance of the server_ Changing the component which is causing the bottleneck_ Understanding the factors which can affect performance_ Measuring the new performance of the server to check for

improvement

2. What are the distinct areas or subsystems to analyze for performance?



Student Notebook

Figure 9-15. Exercise 9: Summary exercise AN512.0

Notes:

Introduction

Use all of the tools and knowledge learned in this course to find and fix the performance issues.


Exercise 9: Summary exercise

• Use PerfPMR reports to determine symptoms of performance issues

• Recommend tuning actions




Uempty


Notes:


Unit summary

• A system could be CPU bound if all of the following are true:– Processors are nearing 100% busy– Many jobs are waiting for a CPU in the run queue and

performance has degraded• A system could be memory bound if it has:

– High memory occupancy, high paging space activity, or high file paging activity

• A system could be disk bound if it has:– At least one disk busy and cannot fulfill other requests– Processes blocked and waiting for I/O operation to complete

• A system could be network I/O bound if it has:– The bandwidth of at least one network adapter totally (or almost

totally) used– Run out of buffers, memory, or is configured incorrectly



Student Notebook




AP
Appendix A. Checkpoint solutions
Unit 1 - Performance analysis and tuning overview


Checkpoint solutions (1 of 2) 1. Use these terms with the following statements:

benchmarks, metrics, baseline, performance goals, throughput, response time

a. Performance is dependent on a combination of throughput andresponse time.

b. Expectations can be used as the basis for performance goals.c. These are standardized tests used for evaluation. benchmarksd. You need to know this to be able to tell if your system is

performing normally. baselinee. These are collected by analysis tools. metrics


© Copyright IBM Corp. 2010 Appendix A. Checkpoint solutions A-1

Student Notebook

Unit 1 - Performance analysis and tuning overview (cont.)


Checkpoint solutions (2 of 2) 2. The four components of system performance are:

– CPU– Memory– I/O– Network

3. After tuning a resource or system parameter and monitoring the outcome, what is the next step in the tuning process? Determine if the performance goal(s) have been met.

4. The six tuning options commands are:– schedo – vmo– ioo– lvmo– no– nfso


A-2 AIX Performance Management © Copyright IBM Corp. 2010


AP
Unit 2 - Data collection

Checkpoint solutions 1. What is the difference between a functional problem and a

performance problem? A functional problem is when an application, hardware, or network is not behaving correctly. A performance problem is when the function is working, but the speed it's performing at is slow.

2. What is the name of the supported tool used to collect reports with a wide variety of performance data? PerfPMR

3. True /False You can individually run the scripts that perfpmr.sh calls.

4. True / False You can dynamically change the topas andnmon displays.



Student Notebook

Unit 3- Monitoring, analyzing, and tuning CPU usage


Checkpoint solutions 1. What is the difference between a process and a thread?

A process is an activity within the system that is started by acommand, shell program or another process. A thread is what is dispatched to a CPU and is part of a process. A process can haveone or more threads.

2. The default scheduling policy is called: SCHED_OTHER3. The default scheduling policy applies to fixed or non-fixed priorities?

non-fixed4. Priority numbers range from 0 to 255.5. True/False The higher the priority number the more favored the thread

will be for scheduling. 6. List at least two tools to monitor CPU usage:

– vmstat, sar, topas, nmon7. List at least two tools to determine what processes are using the CPUs:

– ps, tprof, topas, nmon




AP
Unit 4 - Virtual memory performance monitoring and tuning

Checkpoint solutions (1 of 2)

1. What are the three virtual memory segment types? persistent, client, and working

2. What type of segments are paged out to paging space? working

3. What are the two classifications of memory (for the purpose of choosing which pages to steal)? computational memory and non-computational (file) memory

4. What is the name of the kernel process that implements the page replacement algorithm? lrud



Student Notebook

Unit 4 - Virtual memory performance monitoring and tuning (cont.)


Checkpoint solutions (2 of 2)5. List the vmo parameter that matches the description:

a. Specifies the minimum number of frames on the free list when theVMM starts to steal pages to replenish the free list minfree

b. Specifies the number of frames on the free list at which page stealing stops maxfree

c. Specifies the point below which the page stealer will steal file or computational pages regardless of repaging rates minperm%

d. Specifies whether or not to consider repage rates when deciding what type of page to steal lru_file_repage




AP
Unit 5- Physical and logical volume performance

Checkpoint solutions

1. True/False When you see two hdisks on your system, you know they represent two separate physical disks.

2. List two commands that will provide real time disk I/O statistics.

iostatsar –dtopas or nmon

3. Identify and define the default mirroring scheduling policy.Parallel policy - sends read requests to the least busycopy and write requests to all copies concurrently

4. What tools allow you to observe the time the physical disks are active in relation to their average transfer rates bymonitoring system input/output device loads? iostat and sar



Student Notebook

Unit 6 - File system performance monitoring and tuning



1. True/False File fragmentation can result in a sequential read pattern of many small reads with seeks between them.

2. True/False When measuring file system performance, I/O subsystems should not be shared.

3. Two commands to measure read throughput are:dd and time.

4. The fileplace command can be used to determine if there is fragmentation.




AP
Unit 6 - File system performance monitoring and tuning (cont.)

Checkpoint solutions (2 of 3)5. What tunable functions exist to flush out modified file

pages, based on a threshold of the number of dirty pages in memory?

Sequential write-behindRandom write-behind

6. What is the difference between JFS and JFS2 random write-behind?The threshold for random writes in JFS is simply the number of random pages. In JFS2, in addition to using the number of random writes as a threshold, it has a definition of what is considered a random write based upon the separation between the writes



Student Notebook

Unit 6 - File system performance monitoring and tuning (cont.)



7. List factors that may impact performance when files are fragmented:

Sequential access is no longer sequentialRandom access affected (by having to access more widely dispersed data)Access time dominated by longer seek time

8. What commands can be used to determine if there is a file system performance problem?

iostatfilemon

9. What is the relationship between file system buffersand the VMM I/O queue?Read/write requests will be queued on the VMM I/O queue once the system runs out of file system buffers




AP
Unit 7 - Network performance

Checkpoint solutions (1 of 3) 1. Interactive users are more concerned with

measurements of response time while users of batch data transfers are more concerned with measurements of throughput .

2. True/False thewall maximum amount of network pinned memory can be increased in AIX6 only by increasing the amount of real memory.

3. When sending a single TCP packet an acknowledgement can, by default, be delayed as long as200 milliseconds.



Student Notebook

Unit 7 - Network performance (cont.)


Checkpoint solutions (2 of 3) 4. True/False Increasing the tcp_recvspace at the receiving host

will always increase the effective window size for the connections.If the tcp_sendspace at the transmitting host is smaller than the tcp_recvspace at the receiving host, it will become the controlling factor. Both ends would need to be increased.

5. What network option must be enabled to allow window sizes greater than 64 KBs? rfc1323

6. List two ways in which Nagle’s Algorithm can be disabled:– Specify tcp_nodelay either from the application

(setsockopt) or as an Interface Specific Network Option

– Specify tcp_nagle_limit=1 as a network option




AP
Unit 7 - Network performance (cont.)

Checkpoint solutions (3 of 3)7. If you saw a large count for ipintrq in the netstat

report, which actions would help reduce the overflows?a) Increase memory and CPU capacity at the receiving hostb) Increase ipmaxqlen and decrease ipfragttlc) Decrease ipmaxqlen and increase ipfragttld) Eliminate the cause of delayed and dropped fragments.Answer: b and d

8. A high percentage of collisions in an Ethernet full duplex switch environment is an indication of: Either a defective adapter or switch port, or a duplex mode configuration mismatch between the adapter and switch port.



Student Notebook

Unit 8 - NFS performance


Checkpoint solutions ( 1 of 2 )

1. True / False A large number of concurrent NFS clients can overload an NFS server

2. The biod daemons are the block input/output daemons and are required in order to perform remote I/O requests at an NFS client.

3. On clients and servers where there is heavy file locking activity, the rpc.lockd daemon may become a bottleneck (for NFS V2 and NFS V3).




AP
Unit 8 - NFS performance (cont.)

Checkpoint solutions ( 2 of 2 )

4. The nfsstat -m command can be used to look at per mount statistics at the NFS client

5. The netpmon utility can identify which NFS clients present the greatest workload at the NFS server.

6. If the NFS client has overcommitted memory, the combehind mount option can be used improve NFS I/O efficiency and the rbr mount option can be used to release file cache memory once the application receives the data.



Student Notebook

Unit 9 - Performance management methodology


Checkpoint solutions1. These are the steps for a methodological approach to

performance analysis. Put them in the correct order:1. Understanding the factors which can affect performance2. Measuring the current performance of the server3. Identifying a performance bottleneck4. Changing the component which is causing the bottleneck5. Measuring the new performance of the server to check for

improvement

2. What are the distinct areas or subsystems to analyze for performance? CPU, memory, disks, file systems, network, NFS



V5.4

backpg
Back page

an512stud

Documents

performance data2

performance metrics

performance approach

performance tuning tools

performance management

performance analysis

ibm logo

ibm corp