resetting your pain threshold andy ward – senior principal consultant december 2015

Resetting Your Pain Threshold

Andy Ward – Senior Principal Consultant

December 2015

2 © 2015 CA. ALL RIGHTS RESERVED.

Monitoring Overhead WLM and Its Role System Address Spaces – keeping CPU down DDF Log Activity Locking and Latching zIIPs IO Suspension Buffer Pools What we’re not going to cover

– How to report on all this information in individual monitors


Conventions Used In This Presentation

There are a number of vendors of DB2 performance monitors– CA, IBM, BMC, ASG etc.– To keep things simple whenever IFCID metrics are specified I have used

the IBM IFCID field name These can be found in hlq.DSNIVPD(DSNWMSGS) If vendors use other field-names (as CA do) then it is likely they will provide

the IBM equivalent (if available) too (as CA do)

To avoid confusion, especially in calculations, I have listed the IFCID field name in underlined italic text in all places

A number of metrics in this presentation are relevant to monitoring over a set time interval – to be decided by yourself– In some cases your statistics interval or perhaps an aggregation of these

I’ve done my very best to ensure calculations are correct


The Missing Bullet Holes

Abraham Wald– A statistician in World War 2– Tasked with finding out where best to put the armour on planes

These are the numbers he had to work with for returning planes

His conclusion?– Put the armour on the engines

– Sometimes it’s where you DON’T look rather than where you do look

Sometimes the answer is where you are not looking…

Section of Plane Bullet Holes Per Square Foot

Engine 1.11

Fuselage 1.73

Fuel System 1.55

Rest of Plane 1.8

Monitoring Overhead


Monitoring Overhead

All DB2 performance monitors generate overhead

This can be realised in– The monitors STCs– DB2 STCs– Your applications

Small changes to DB2 monitor parameters can have big effects– Certain IFCIDs result in big overheads, there is a reason you can’t see

everything out of the box.

Ensure temporary monitoring requirements remain temporary– And qualify them as much as possible


Monitoring Your Monitoring

No Monitor-ing

IDB2 Only (No -805)

TMON (LBP Reported)

IDB2 Only (O-riginal -805)

Over-head

0 0.0538 0.071 0.5993

5.00%15.00%25.00%35.00%45.00%55.00%

Overhead

Perc

enta

ge O

verh

ead

Current Mon.(Customer

The damage the wrong IFCID can do…


Using Your Monitor

If your monitor allows, consider setting up your own DB2 Health report– Online (preferably) and/or batch (good for regular execution to view

trends)

Consider outputting to a comma separated file (if available) for easy export into Excel or offload data to DB2 tables

Incorporate the metrics discussed in this PPT– There is also a very good Redbook ‘Subsystem and Transaction

Monitoring and Tuning with DB2 11 for z/OS’ Despite the name - it’s not only for DB2 11 shops

When issues occur this information may well point you to the area of concern quickly

Workload Manager


Workload Manager

Don’t need to be experts but may need enough knowledge to make a case

Sometimes we need to monitor outside DB2 to ensure things are running well inside– Look for WLM resource group capping delays (your z/OS monitor

should show you these) Capping of workloads entering DB2 should be avoided

– Potential for longer holding of locks if class is suspended

Understand Service Classification Goals– Average response time goal– Percentile response time goal– Velocity


Workload Manager…cont'd

The ‘Performance Index’– A PI of 1 means the goal is being exactly achieved – <1 performing better than goal requirements– >1 performing worse than goal requirements– Good measure but relies on service classification goals being correctly

set– In a Sysplex there is a local PI to indicate if a particular LPAR is suffering– DISPLAY THREAD(*) TYPE(ACTIVE) DETAIL (screenshot follows)

DB2 10 and greater contains WLM details PI is multiplied by 100 in this output

Periods– Each Service Class can have multiple periods

Work moves down the periods once it has consumed n service units– WLM importance level is likely to drop through these

1 to 5 (discretionary workload is 6)


DISPLAY THREAD(*) TYPE(ACTIVE) DETAIL

DSNV401I !S10A DISPLAY THREAD REPORT FOLLOWS -

DSNV402I !S10A ACTIVE THREADS -

NAME ST A REQ ID AUTHID PLAN ASID TOKEN

DB2CALL T 11 PTX#S170 ACFSTCID PSAA1700 00D7 154

V437-WORKSTATION=DB2CALL, USERID=ACFSTCID,

APPLICATION NAME=PTX#S170

V482-WLM-INFO=CLSONL:1:5:95

DB2CALL T * 7 WARAN01 WARAN01 DB2C1800 002B 155

V437-WORKSTATION=DB2CALL, USERID=WARAN01,

APPLICATION NAME=WARAN01

V441-ACCOUNTING=1

V482-WLM-INFO=*:1077952576:*:1077952576

Service Class:Period:Importance:Performance Index


WLM Best Practices

Ensure DB2 Address spaces are always running at a higher importance than the work entering them

Should only run 25-35 concurrent service class periods on an LPAR

Understand your threads that are regularly exceeding their PI– DISPLAY THREAD(*) TYPE(ACTIVE) DETAIL– Consider moving them to another class or perhaps altering the existing

class

Ensure there is an entry for SUBSYS=DDF in your WLM policy linked to, at the least, a catch-all Service Class

Assess DB2 Service Goals at least annually– If a PI is regulary seen below 1 then perhaps response time goals need

to be tightened

System Address Spaces


System Address Space CPU Time

Sometimes zIIP processing is ignored because it is not perceived to have a cost– Important to take into account across releases – CPU may have been

moved here– But it is still processing– Especially important if you are comparing across releases

Still should be accounted for tuning purposes too

Understanding CPU time– Class 1 (in-application)

From getting a thread to end of execution– Class 2 (in-DB2)

CPU related to SQL execution– Class 3

Wait times (around 20 buckets) The dreaded ‘not-accounted’


System Address Space CPU Time…cont’d

Some rules of thumb– TCB times should be low compared to MSTR and DBM1 SRB times– IRLM SRB time should be relatively low compared to MSTR and DBM1

SRB time

MSTR – Reduce TCB

Bigger Logs Not always achievable under certain processing but…

– Try and keep 6 hours of log data across your active logs

– Reduce SRB Investigate REALSTORAGE_MANAGEMENT zPARM

– AUTO by default

– Consider ON for constrained development environments


System Address Space CPU Time…cont’d

DBM1– Reduce TCB

Monitor dataset open and close processing Should be less than 0.1 – 1 opens per second during normal processing

IFCID 0002 QBSTDSO shows cumulative dataset opens since start of stats interval

IRLM– Reduce SRB

P-lock negotiation for pagesets constantly moving in and out of GBP dependency can add to SRB time

– Look at pseudo close controlling zPARMS PCLOSEN & PCLOSET (consider increasing these if GBP dependency thrashing is recognised)

Especially if you are seeing synchronous IO increase

– Also look at CLOSE settings

If the object Is known to be GBP dependent CLOSE NO may be best


DDF

PKGREL_COMMIT = YES (DB2 11)– A PACKAGE bound RELEASE(DEALLOCATE) will be implicitly released at

COMMIT/ROLLBACK if BIND/REBIND or OLS is queued that effects it

MAXDBAT– -DISPLAY DDF DETAIL

Look at QUEDBAT

– Cumulative counter of queued DBATs since DB2 was recycled

Look at CONQUED

– Currently queued DBATs (also shown in IFCID 0001 QDSTQDBT)

If either of these are high consider increasing zParm MAXDBAT

– Carefully consider the storage implications of doing this though

WLMHEALTH

– Shows how WLM is treating the address space


-DISPLAY DDF DETAIL

-D1B1 DIS DDF DET

DSNL080I -D1B1 DSNLTDDF DISPLAY DDF REPORT FOLLOWS:

DSNL081I STATUS=STARTD

DSNL082I LOCATION LUNAME GENERICLU

DSNL083I DB1B

USIBMSC.SCPD1B1 -NONE

DSNL084I TCPPORT=39400 SECPORT=39401 RESPORT=39402 IPNAME=-NONE

DSNL085I IPADDR=::9.12.6.70

DSNL086I SQL DOMAIN=wtsc63.itso.ibm.com

DSNL086I RESYNC DOMAIN=wtsc63.itso.ibm.com

DSNL089I MEMBER IPADDR=::9.12.6.70

DSNL090I DT=I CONDBAT= 10000 MDBAT= 200

DSNL091I MCONQN = 200 MCONQW = 120

DSNL092I ADBAT = 0 QUEDBAT = 0 INADBAT = 0 CONQUED = 0

DSNL093I DSCDBAT= 0 INACONN = 0

DSNL094I WLMHEALTH=100 CLSDCONQN=

…

Log Activity


Log Activity

In DB2 11 Log Buffers use 1MB page frames if they are available– Consider using these (LFAREA setting)– All buffers reside in HCSA

Log Buffers can be a bottleneck - writing– Remember if they are unavailable all INSERT, UPDATE and DELETE

processing is suspended– In DB210 and above OUTBUFF specification is going to be allocated and

fixed at DB2 startup – correct sizing is important– Calculate Log data volume at peak times

IFCID 0001

– QJSTBFFL (Log CI’s created) * 4K/monitoring_interval or stats_interval

– If QJSTWTB (Log buffers unavailable) is > 0 at non-peak times increase OUTBUFF


Log Activity

Log writes and log prefetch are zIIP eligible in DB2 11– zIIP queuing may potentially cause issues

A little more on this later

I’m sure many of you have already implemented these, but to reduce log activity ensure– Compression is in use, especially for large row lengths

And compression dictionaries are current– RRF is implemented for objects that contain VARCHARs

Especially those where VARCHARs are not at the end of the table definition

Locking and Latching


Locking and Latching

Making the most of lock avoidance– CLSN, PUNC bits, Datasharing page latches – Are unlock requests/commits high for an interval (>5)?

If so lock avoidance may not be working well Lack of lock avoidance can also result in IRLM latch contention Lots of pointer records? - Reorg

– DB211 - PCTFREE_UPD (-1 learns the best value)

Pseudo deleted index rows– Count in SYSIBM.SYSINDEXSPACESTATS – REORGPSUEDODELETES

Non-data sharing this should be no more than 10% of total entries Data sharing no more than 5% of total entries

– A S-Lock will be required if CLSN processing fails

To clean-up - Reorg or if using DB211 an automated process is available


Locking and Latching…cont'd

Long URs can result in lock avoidance not being utilised– Consider using the following ZPARMs

URCHKTH – number of checkpoint cycles before COMMIT

– DB2 10> default is 5

URLGWTH – number of log writes between COMMITs

– DB2 10> default is 10K

IRLM latch contention is considered high if IRLM suspensions are approx. 5% of the total IRLM lock requests– IRLM latch contention rate

IFCID 0002 QTXASLAT / (QTXALOCK + QTXAUNLK + QTXAQRY + QTXACHNG) * 100


Data Sharing Contentions

Global contention rate (%) over time period– Next slide has calculation

False contention (IFCID 0002 – QTGSFLMG)– This is a per member rate

Assuming OA12164 and PK85159 are on (old APARs)– Should be less than 3% of total XES IRLM requests

Calculation shown on the next slide

XES contention (IFCID 0002 - QTGSSGLO)– XES only has 2 locking modes S and X– IRLM supports many more – the lock may be compatible– V8 introduced ‘locking protocol 2’ to improve XES/IRLM lock mapping

Real Contention (IFCID 0002 – QTGSIGLO)


Calculations

Global contention rate– (QTGSIGLO + QTGSSGLO + QTGSFLMG) / Total XES IRLM requests

Total XES IRLM requests– (QTGSLSLM + QTGSCSLM + QTGSUSLM) + (QTGSKIDS + QTGSFLSE) +

(QTGSIGLO + QTGSSGLO + QTGSFLMG)


Tuning Data Sharing Locks

TRACKMOD NO– Limit spacemap updates– Downside – incremental copies will run longer due to having to scan the

space to find changed pages

Reduce index leaf page splits – IFCID 359

INSERT intensive workload?– Consider MEMBER CLUSTER

INSERTS will be added where there is space rather than in CLUSTERing order

– Likely to result in reducing P-locks

– Although an increase in spacemap pages Each spacemap only covers 199 pages when using MEMBER CLUSTER

MAXROWS 1 LOCKSIZE PAGE if locking is intensive– Consider for small objects to mimic row level locking


zIIP

As already stated, including zIIP processing is important when assessing overall CPU resource consumption

SYS1.PARMLIB(IEAOPTxx) parameters– IIPHONORPRIORITY (YES/NO)

This setting specifies whether GCP can assist when zIIPs are flooded

– YES – workload can be passed back to GCP (preferred setting)

– NO – workload will queue until zIIP capacity is available

DB2 11 system workload will not be offloaded (i.e. prefetch)

– ZIIPAWMT The amount of time workload will queue before being returned to the GCP

– Minimum setting 1600 (1.6 milliseconds)

– Default

HIPERDISPATCH = YES – 3200 (3.2 milliseconds)


zIIP

WLM Activity Report (RMF)– APPL%IIPCP shows how much work was offloaded to zIIP and returned

to the GCP

Rule of thumb to minimise dispatching delays– Total zIIP utilization thresholds

1 zIIP < 30%, 2 zIIPs < 60%, 3 zIIPS < 70%– Don’t drive your zIIPs in the same way you drive GCP

In most shops the number of GCP engines far outweigh the number of zIIP engines

Queuing theory dictates that at relatively low utilization values response time increases (see graph on next slide)


Markov’s Equation

IO Suspensions


IO Suspensions

There is a need to monitor IO (read and write) elapsed time– Looking for suspensions– Generally there is more read IO – a good place to start

General high processor contention may manifest itself as an IO problem– Once IO completes the application is not immediately dispatched

What is a good response time?– As always it depends!– A baseline for your environment is required

Busier IO subsystems are of course going to have higher response times– Data residing on SSDs should certainly see a better response time than

traditional DASD


IO Suspensions…cont’d

Synchronous IO– Most monitors will show an average for synchronous IO

Perhaps broken down by read and write– Understand the issue, is it a large number of small suspensions or a

small number of large suspensions? - Use IFCID 0003 fields QWACARNE – shows the number of synchronous IO wait events QWACAWTI – shows the accumulated wait time

– Number of suspensions high Poor access path, poor page residency (covered in next section),

disorganised objects, index splits (IFCID 359)– Time per suspension large

Disk inefficiency

– Over utilization, ensure disk features are being used such as PAV on ESS

– Log synchronous read IO waits on a separate counter to assist with problem determination – IFCID 0003 QWACARLG Lots of ROLLBACKS?

Buffer Pools


Buffer Pools - Local

Look at your hit ratios Data Management threshold reached

– This should be kept at 0– Threads processing multiple rows in same page will result in 1

GETPAGE per row if this is hit

Page residency time– System Residency Time – look for a goal of > 60 seconds

Calculation on next slide

Reducing IO– Add 50MB to a BP in increments, once IO reduction drops off you've

likely hit the top of the curve

Check for pre-fetched pages getting stolen before they are used (IFCID 0002 QBSTSIO)


Page Residency Time Calculations

All IFCID fields from IFCID 0002

Total Pages Read– QBSTRIO + QBSTSIO + QBSTSPP + QBSTLPP + QBSTDPP

System Residency Time (seconds)– VPSIZE / Total Pages Read per second

Random Page Residency Time (seconds)– MAX (System Residency Time, (buffer pool size * (1-VPSEQT/100) / sync

pages (QBSTSIO) read per second)

Sequential Page Residency Time (seconds)– MIN (System Residency Time, (buffer pool size * (VPSEQT/100) / sync

pages (QBSTSIO) read per second)


Buffer Pool – Page Fixing

The pool must be TOTALLY backed by real storage– IFCID 0002 QBSTRPI + QBSTWPI to see if paging is becoming an issue

for page fixed pools– If page fixing still a requirement keep critical objects in pool, remove

others and downsize

Base which pools to fix on IO intensity if you are storage constrained– IO intensity - (BP pages read + BP pages written) / number of buffers– Also ensure these pools benefit from large frames if available

Keep an eye on Large Frame high water marks– D VIRTSTOR,LFAREA (APAR 0A31116)

AUTOSIZE– New DB211 parms to assist here VPSIZEMIN/VPSIZEMAX


Group Buffer Pools

Coupling Facility settings– ALLOWAUTOALT(YES)

Not designed for sudden floods but gradual workload increases No rebuild of the structure required

– Set MINSIZE equal to INITSIZE (assuming your initial sizing is good)– FULLTHRESHOLD – 80-90%– SIZE – 1.3 to 2 times INITSIZE

Synchronous Read Cross Invalidation ratio should be < 10%– QBGLXR / (QBGLXR + QBGLXD)

Failures due to lack of storage (IFCID 0002 QBGLWF)– Ideally these should be 0 but certainly aim for <1%

Group Buffer Pool Write Around (IFCID 0002 QBGLWA)– If this is > 0 it is indicative of the GBP becoming flooded

In Summary


We’ve covered a lot of ground

Hopefully you have found a few things to take back to your environments in this PPT– All the fields/calculations have been included for you to use at your

shops – Disclaimer – I have done my best to ensure the calculations are correct

Remember many of the metrics (individual and calculated) should be viewed over an interval or time period– What that is is down to you – perhaps a peak hour?

Make use of your monitors to become more proactive– But keep an eye on monitoring overhead, there is a happy medium

between collecting too much and too little information

Remember the missing bullet holes!

Senior Principal Consultant

[email protected]

Andy Ward

resetting your pain threshold andy ward – senior principal consultant december 2015

Documents

shops db2

db2 monitor parameters

db2 performance monitorsca

db2 environmentsthis

db2 table documentation

individual monitors

ifcid metrics

monitoring best practice