resetting your pain threshold andy ward – senior principal consultant december 2015
TRANSCRIPT
Resetting Your Pain Threshold
Andy Ward – Senior Principal Consultant
December 2015
2 © 2015 CA. ALL RIGHTS RESERVED.
Monitoring Overhead WLM and Its Role System Address Spaces – keeping CPU down DDF Log Activity Locking and Latching zIIPs IO Suspension Buffer Pools What we’re not going to cover
– How to report on all this information in individual monitors
3 © 2015 CA. ALL RIGHTS RESERVED.
Conventions Used In This Presentation
There are a number of vendors of DB2 performance monitors– CA, IBM, BMC, ASG etc.– To keep things simple whenever IFCID metrics are specified I have used
the IBM IFCID field name These can be found in hlq.DSNIVPD(DSNWMSGS) If vendors use other field-names (as CA do) then it is likely they will provide
the IBM equivalent (if available) too (as CA do)
To avoid confusion, especially in calculations, I have listed the IFCID field name in underlined italic text in all places
A number of metrics in this presentation are relevant to monitoring over a set time interval – to be decided by yourself– In some cases your statistics interval or perhaps an aggregation of these
I’ve done my very best to ensure calculations are correct
4 © 2015 CA. ALL RIGHTS RESERVED.
The Missing Bullet Holes
Abraham Wald– A statistician in World War 2– Tasked with finding out where best to put the armour on planes
These are the numbers he had to work with for returning planes
His conclusion?– Put the armour on the engines
– Sometimes it’s where you DON’T look rather than where you do look
Sometimes the answer is where you are not looking…
Section of Plane Bullet Holes Per Square Foot
Engine 1.11
Fuselage 1.73
Fuel System 1.55
Rest of Plane 1.8
Monitoring Overhead
6 © 2015 CA. ALL RIGHTS RESERVED.
Monitoring Overhead
All DB2 performance monitors generate overhead
This can be realised in– The monitors STCs– DB2 STCs– Your applications
Small changes to DB2 monitor parameters can have big effects– Certain IFCIDs result in big overheads, there is a reason you can’t see
everything out of the box.
Ensure temporary monitoring requirements remain temporary– And qualify them as much as possible
7 © 2015 CA. ALL RIGHTS RESERVED.
Monitoring Your Monitoring
No Monitor-ing
IDB2 Only (No -805)
TMON (LBP Reported)
IDB2 Only (O-riginal -805)
Over-head
0 0.0538 0.071 0.5993
5.00%15.00%25.00%35.00%45.00%55.00%
Overhead
Perc
enta
ge O
verh
ead
Current Mon.(Customer
The damage the wrong IFCID can do…
8 © 2015 CA. ALL RIGHTS RESERVED.
Using Your Monitor
If your monitor allows, consider setting up your own DB2 Health report– Online (preferably) and/or batch (good for regular execution to view
trends)
Consider outputting to a comma separated file (if available) for easy export into Excel or offload data to DB2 tables
Incorporate the metrics discussed in this PPT– There is also a very good Redbook ‘Subsystem and Transaction
Monitoring and Tuning with DB2 11 for z/OS’ Despite the name - it’s not only for DB2 11 shops
When issues occur this information may well point you to the area of concern quickly
Workload Manager
10 © 2015 CA. ALL RIGHTS RESERVED.
Workload Manager
Don’t need to be experts but may need enough knowledge to make a case
Sometimes we need to monitor outside DB2 to ensure things are running well inside– Look for WLM resource group capping delays (your z/OS monitor
should show you these) Capping of workloads entering DB2 should be avoided
– Potential for longer holding of locks if class is suspended
Understand Service Classification Goals– Average response time goal– Percentile response time goal– Velocity
11 © 2015 CA. ALL RIGHTS RESERVED.
Workload Manager…cont'd
The ‘Performance Index’– A PI of 1 means the goal is being exactly achieved – <1 performing better than goal requirements– >1 performing worse than goal requirements– Good measure but relies on service classification goals being correctly
set– In a Sysplex there is a local PI to indicate if a particular LPAR is suffering– DISPLAY THREAD(*) TYPE(ACTIVE) DETAIL (screenshot follows)
DB2 10 and greater contains WLM details PI is multiplied by 100 in this output
Periods– Each Service Class can have multiple periods
Work moves down the periods once it has consumed n service units– WLM importance level is likely to drop through these
1 to 5 (discretionary workload is 6)
12 © 2015 CA. ALL RIGHTS RESERVED.
DISPLAY THREAD(*) TYPE(ACTIVE) DETAIL
DSNV401I !S10A DISPLAY THREAD REPORT FOLLOWS -
DSNV402I !S10A ACTIVE THREADS -
NAME ST A REQ ID AUTHID PLAN ASID TOKEN
DB2CALL T 11 PTX#S170 ACFSTCID PSAA1700 00D7 154
V437-WORKSTATION=DB2CALL, USERID=ACFSTCID,
APPLICATION NAME=PTX#S170
V482-WLM-INFO=CLSONL:1:5:95
DB2CALL T * 7 WARAN01 WARAN01 DB2C1800 002B 155
V437-WORKSTATION=DB2CALL, USERID=WARAN01,
APPLICATION NAME=WARAN01
V441-ACCOUNTING=1
V482-WLM-INFO=*:1077952576:*:1077952576
Service Class:Period:Importance:Performance Index
13 © 2015 CA. ALL RIGHTS RESERVED.
WLM Best Practices
Ensure DB2 Address spaces are always running at a higher importance than the work entering them
Should only run 25-35 concurrent service class periods on an LPAR
Understand your threads that are regularly exceeding their PI– DISPLAY THREAD(*) TYPE(ACTIVE) DETAIL– Consider moving them to another class or perhaps altering the existing
class
Ensure there is an entry for SUBSYS=DDF in your WLM policy linked to, at the least, a catch-all Service Class
Assess DB2 Service Goals at least annually– If a PI is regulary seen below 1 then perhaps response time goals need
to be tightened
System Address Spaces
15 © 2015 CA. ALL RIGHTS RESERVED.
System Address Space CPU Time
Sometimes zIIP processing is ignored because it is not perceived to have a cost– Important to take into account across releases – CPU may have been
moved here– But it is still processing– Especially important if you are comparing across releases
Still should be accounted for tuning purposes too
Understanding CPU time– Class 1 (in-application)
From getting a thread to end of execution– Class 2 (in-DB2)
CPU related to SQL execution– Class 3
Wait times (around 20 buckets) The dreaded ‘not-accounted’
16 © 2015 CA. ALL RIGHTS RESERVED.
System Address Space CPU Time…cont’d
Some rules of thumb– TCB times should be low compared to MSTR and DBM1 SRB times– IRLM SRB time should be relatively low compared to MSTR and DBM1
SRB time
MSTR – Reduce TCB
Bigger Logs Not always achievable under certain processing but…
– Try and keep 6 hours of log data across your active logs
– Reduce SRB Investigate REALSTORAGE_MANAGEMENT zPARM
– AUTO by default
– Consider ON for constrained development environments
17 © 2015 CA. ALL RIGHTS RESERVED.
System Address Space CPU Time…cont’d
DBM1– Reduce TCB
Monitor dataset open and close processing Should be less than 0.1 – 1 opens per second during normal processing
IFCID 0002 QBSTDSO shows cumulative dataset opens since start of stats interval
IRLM– Reduce SRB
P-lock negotiation for pagesets constantly moving in and out of GBP dependency can add to SRB time
– Look at pseudo close controlling zPARMS PCLOSEN & PCLOSET (consider increasing these if GBP dependency thrashing is recognised)
Especially if you are seeing synchronous IO increase
– Also look at CLOSE settings
If the object Is known to be GBP dependent CLOSE NO may be best
DDF
23 © 2015 CA. ALL RIGHTS RESERVED.
DDF
PKGREL_COMMIT = YES (DB2 11)– A PACKAGE bound RELEASE(DEALLOCATE) will be implicitly released at
COMMIT/ROLLBACK if BIND/REBIND or OLS is queued that effects it
MAXDBAT– -DISPLAY DDF DETAIL
Look at QUEDBAT
– Cumulative counter of queued DBATs since DB2 was recycled
Look at CONQUED
– Currently queued DBATs (also shown in IFCID 0001 QDSTQDBT)
If either of these are high consider increasing zParm MAXDBAT
– Carefully consider the storage implications of doing this though
WLMHEALTH
– Shows how WLM is treating the address space
24 © 2015 CA. ALL RIGHTS RESERVED.
-DISPLAY DDF DETAIL
-D1B1 DIS DDF DET
DSNL080I -D1B1 DSNLTDDF DISPLAY DDF REPORT FOLLOWS:
DSNL081I STATUS=STARTD
DSNL082I LOCATION LUNAME GENERICLU
DSNL083I DB1B
USIBMSC.SCPD1B1 -NONE
DSNL084I TCPPORT=39400 SECPORT=39401 RESPORT=39402 IPNAME=-NONE
DSNL085I IPADDR=::9.12.6.70
DSNL086I SQL DOMAIN=wtsc63.itso.ibm.com
DSNL086I RESYNC DOMAIN=wtsc63.itso.ibm.com
DSNL089I MEMBER IPADDR=::9.12.6.70
DSNL090I DT=I CONDBAT= 10000 MDBAT= 200
DSNL091I MCONQN = 200 MCONQW = 120
DSNL092I ADBAT = 0 QUEDBAT = 0 INADBAT = 0 CONQUED = 0
DSNL093I DSCDBAT= 0 INACONN = 0
DSNL094I WLMHEALTH=100 CLSDCONQN=
…
Log Activity
26 © 2015 CA. ALL RIGHTS RESERVED.
Log Activity
In DB2 11 Log Buffers use 1MB page frames if they are available– Consider using these (LFAREA setting)– All buffers reside in HCSA
Log Buffers can be a bottleneck - writing– Remember if they are unavailable all INSERT, UPDATE and DELETE
processing is suspended– In DB210 and above OUTBUFF specification is going to be allocated and
fixed at DB2 startup – correct sizing is important– Calculate Log data volume at peak times
IFCID 0001
– QJSTBFFL (Log CI’s created) * 4K/monitoring_interval or stats_interval
– If QJSTWTB (Log buffers unavailable) is > 0 at non-peak times increase OUTBUFF
27 © 2015 CA. ALL RIGHTS RESERVED.
Log Activity
Log writes and log prefetch are zIIP eligible in DB2 11– zIIP queuing may potentially cause issues
A little more on this later
I’m sure many of you have already implemented these, but to reduce log activity ensure– Compression is in use, especially for large row lengths
And compression dictionaries are current– RRF is implemented for objects that contain VARCHARs
Especially those where VARCHARs are not at the end of the table definition
Locking and Latching
29 © 2015 CA. ALL RIGHTS RESERVED.
Locking and Latching
Making the most of lock avoidance– CLSN, PUNC bits, Datasharing page latches – Are unlock requests/commits high for an interval (>5)?
If so lock avoidance may not be working well Lack of lock avoidance can also result in IRLM latch contention Lots of pointer records? - Reorg
– DB211 - PCTFREE_UPD (-1 learns the best value)
Pseudo deleted index rows– Count in SYSIBM.SYSINDEXSPACESTATS – REORGPSUEDODELETES
Non-data sharing this should be no more than 10% of total entries Data sharing no more than 5% of total entries
– A S-Lock will be required if CLSN processing fails
To clean-up - Reorg or if using DB211 an automated process is available
30 © 2015 CA. ALL RIGHTS RESERVED.
Locking and Latching…cont'd
Long URs can result in lock avoidance not being utilised– Consider using the following ZPARMs
URCHKTH – number of checkpoint cycles before COMMIT
– DB2 10> default is 5
URLGWTH – number of log writes between COMMITs
– DB2 10> default is 10K
IRLM latch contention is considered high if IRLM suspensions are approx. 5% of the total IRLM lock requests– IRLM latch contention rate
IFCID 0002 QTXASLAT / (QTXALOCK + QTXAUNLK + QTXAQRY + QTXACHNG) * 100
31 © 2015 CA. ALL RIGHTS RESERVED.
Data Sharing Contentions
Global contention rate (%) over time period– Next slide has calculation
False contention (IFCID 0002 – QTGSFLMG)– This is a per member rate
Assuming OA12164 and PK85159 are on (old APARs)– Should be less than 3% of total XES IRLM requests
Calculation shown on the next slide
XES contention (IFCID 0002 - QTGSSGLO)– XES only has 2 locking modes S and X– IRLM supports many more – the lock may be compatible– V8 introduced ‘locking protocol 2’ to improve XES/IRLM lock mapping
Real Contention (IFCID 0002 – QTGSIGLO)
32 © 2015 CA. ALL RIGHTS RESERVED.
Calculations
Global contention rate– (QTGSIGLO + QTGSSGLO + QTGSFLMG) / Total XES IRLM requests
Total XES IRLM requests– (QTGSLSLM + QTGSCSLM + QTGSUSLM) + (QTGSKIDS + QTGSFLSE) +
(QTGSIGLO + QTGSSGLO + QTGSFLMG)
33 © 2015 CA. ALL RIGHTS RESERVED.
Tuning Data Sharing Locks
TRACKMOD NO– Limit spacemap updates– Downside – incremental copies will run longer due to having to scan the
space to find changed pages
Reduce index leaf page splits – IFCID 359
INSERT intensive workload?– Consider MEMBER CLUSTER
INSERTS will be added where there is space rather than in CLUSTERing order
– Likely to result in reducing P-locks
– Although an increase in spacemap pages Each spacemap only covers 199 pages when using MEMBER CLUSTER
MAXROWS 1 LOCKSIZE PAGE if locking is intensive– Consider for small objects to mimic row level locking
zIIPs
35 © 2015 CA. ALL RIGHTS RESERVED.
zIIP
As already stated, including zIIP processing is important when assessing overall CPU resource consumption
SYS1.PARMLIB(IEAOPTxx) parameters– IIPHONORPRIORITY (YES/NO)
This setting specifies whether GCP can assist when zIIPs are flooded
– YES – workload can be passed back to GCP (preferred setting)
– NO – workload will queue until zIIP capacity is available
DB2 11 system workload will not be offloaded (i.e. prefetch)
– ZIIPAWMT The amount of time workload will queue before being returned to the GCP
– Minimum setting 1600 (1.6 milliseconds)
– Default
HIPERDISPATCH = YES – 3200 (3.2 milliseconds)
36 © 2015 CA. ALL RIGHTS RESERVED.
zIIP
WLM Activity Report (RMF)– APPL%IIPCP shows how much work was offloaded to zIIP and returned
to the GCP
Rule of thumb to minimise dispatching delays– Total zIIP utilization thresholds
1 zIIP < 30%, 2 zIIPs < 60%, 3 zIIPS < 70%– Don’t drive your zIIPs in the same way you drive GCP
In most shops the number of GCP engines far outweigh the number of zIIP engines
Queuing theory dictates that at relatively low utilization values response time increases (see graph on next slide)
37 © 2015 CA. ALL RIGHTS RESERVED.
Markov’s Equation
IO Suspensions
39 © 2015 CA. ALL RIGHTS RESERVED.
IO Suspensions
There is a need to monitor IO (read and write) elapsed time– Looking for suspensions– Generally there is more read IO – a good place to start
General high processor contention may manifest itself as an IO problem– Once IO completes the application is not immediately dispatched
What is a good response time?– As always it depends!– A baseline for your environment is required
Busier IO subsystems are of course going to have higher response times– Data residing on SSDs should certainly see a better response time than
traditional DASD
40 © 2015 CA. ALL RIGHTS RESERVED.
IO Suspensions…cont’d
Synchronous IO– Most monitors will show an average for synchronous IO
Perhaps broken down by read and write– Understand the issue, is it a large number of small suspensions or a
small number of large suspensions? - Use IFCID 0003 fields QWACARNE – shows the number of synchronous IO wait events QWACAWTI – shows the accumulated wait time
– Number of suspensions high Poor access path, poor page residency (covered in next section),
disorganised objects, index splits (IFCID 359)– Time per suspension large
Disk inefficiency
– Over utilization, ensure disk features are being used such as PAV on ESS
– Log synchronous read IO waits on a separate counter to assist with problem determination – IFCID 0003 QWACARLG Lots of ROLLBACKS?
Buffer Pools
42 © 2015 CA. ALL RIGHTS RESERVED.
Buffer Pools - Local
Look at your hit ratios Data Management threshold reached
– This should be kept at 0– Threads processing multiple rows in same page will result in 1
GETPAGE per row if this is hit
Page residency time– System Residency Time – look for a goal of > 60 seconds
Calculation on next slide
Reducing IO– Add 50MB to a BP in increments, once IO reduction drops off you've
likely hit the top of the curve
Check for pre-fetched pages getting stolen before they are used (IFCID 0002 QBSTSIO)
43 © 2015 CA. ALL RIGHTS RESERVED.
Page Residency Time Calculations
All IFCID fields from IFCID 0002
Total Pages Read– QBSTRIO + QBSTSIO + QBSTSPP + QBSTLPP + QBSTDPP
System Residency Time (seconds)– VPSIZE / Total Pages Read per second
Random Page Residency Time (seconds)– MAX (System Residency Time, (buffer pool size * (1-VPSEQT/100) / sync
pages (QBSTSIO) read per second)
Sequential Page Residency Time (seconds)– MIN (System Residency Time, (buffer pool size * (VPSEQT/100) / sync
pages (QBSTSIO) read per second)
44 © 2015 CA. ALL RIGHTS RESERVED.
Buffer Pool – Page Fixing
The pool must be TOTALLY backed by real storage– IFCID 0002 QBSTRPI + QBSTWPI to see if paging is becoming an issue
for page fixed pools– If page fixing still a requirement keep critical objects in pool, remove
others and downsize
Base which pools to fix on IO intensity if you are storage constrained– IO intensity - (BP pages read + BP pages written) / number of buffers– Also ensure these pools benefit from large frames if available
Keep an eye on Large Frame high water marks– D VIRTSTOR,LFAREA (APAR 0A31116)
AUTOSIZE– New DB211 parms to assist here VPSIZEMIN/VPSIZEMAX
45 © 2015 CA. ALL RIGHTS RESERVED.
Group Buffer Pools
Coupling Facility settings– ALLOWAUTOALT(YES)
Not designed for sudden floods but gradual workload increases No rebuild of the structure required
– Set MINSIZE equal to INITSIZE (assuming your initial sizing is good)– FULLTHRESHOLD – 80-90%– SIZE – 1.3 to 2 times INITSIZE
Synchronous Read Cross Invalidation ratio should be < 10%– QBGLXR / (QBGLXR + QBGLXD)
Failures due to lack of storage (IFCID 0002 QBGLWF)– Ideally these should be 0 but certainly aim for <1%
Group Buffer Pool Write Around (IFCID 0002 QBGLWA)– If this is > 0 it is indicative of the GBP becoming flooded
In Summary
47 © 2015 CA. ALL RIGHTS RESERVED.
We’ve covered a lot of ground
Hopefully you have found a few things to take back to your environments in this PPT– All the fields/calculations have been included for you to use at your
shops – Disclaimer – I have done my best to ensure the calculations are correct
Remember many of the metrics (individual and calculated) should be viewed over an interval or time period– What that is is down to you – perhaps a peak hour?
Make use of your monitors to become more proactive– But keep an eye on monitoring overhead, there is a happy medium
between collecting too much and too little information
Remember the missing bullet holes!