why did my job run so long? - cmg

Why did my job run so long?

Speeding Performance by

Understanding the Cause

John Baker

MVS Solutions

[email protected]

mailto:[email protected]

Agenda

• Where is my application spending its time?

– CPU time, I/O time, wait (queue) times

• What am I waiting for?

– Various flavors of queue time

– What can/should I do about delays?

• Real world comparison

– Stay tuned!

• Q/A

• Conclusions and wrap up

2

Distribution of Elapsed Time

Elapsed time = CPU time + I/O time + wait times

3

• CPU time = TCB + SRB

• I/O = IOSQ + PEND + CON + DISC

• Wait (queue) times

– Initiator

– Allocation (ENQ contention)

– System services (HSM recall)

– CPU Delay

– LPAR dispatch

– …

Sample Job A

• Elapsed time over 4 hours

• CPU time almost 1 hour

• I/O time under 10 minutes

= Focus on CPU time

4

JOB RUNTM CPUTM IOTIME

JOBA 4:23:53 0:48:21 0:09:12

Reducing CPU time

• Recompile

– Many improvements in OS updates

• Tune application

– Application Performance Tools

• (e.g. Strobe, FreezeFrame)

– Identify CPU use by area of source code

– Make friends with your developer

5

Sample Job B

• Elapsed time under 3 hours

• CPU time 20 minutes

• I/O time over 1.5 hours

= Focus on I/O time

6

JOB RUNTM CPUTM IOTIME

JOBB 2:41:30 0:21:20 1:37:44

Reducing I/O time

• Identify patterns

– sequential vs random; read vs write

• Buffers

– For VSAM consider NSR vs LSR

– Give SORT memory – but not too much!

• Block size

– System-determined generally works well – but check!

– Half track for sequential; No smaller than 2K for random

• Compression

– zEDC looks very impressive!

• Include Storage Subsystem in your capacity planning

7

Wait/Queue time

8

0

5

10

15

20

25

30

0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100%

Re

spo

nse

/Ela

pse

d T

ime

Percent Utilization

Time vs Utilization

Wait

Execution

High Utilization = High wait time

9

Elapsed time grows exponentially with

utilization. Increasing priority doesn’t make

the CPU any faster

• At high utilization levels, wait time is

much greater than service time

Flavors of Queue (wait) Time

• Wait for “server”: initiator / CICS AOR / IMS MPR

• CPU delay (wait for logical CPU)

• I/O delay (iosq, pend, disconnect)

• Capping delay (LPAR capped vs actual delay)

• Resource Group maximum enforced

• Wait for LPAR (logical CPU) to be dispatched

– PR/SM weight

– Demand from other LPARs

– CPC/CEC capacity

10

Initiator Queue

• SMF: R723CQDT

• TOTAL queue time (divide for average per job)

• Just start more inits?

• Not necessarily a good idea

• “Tuning to reduce the number of simultaneously active

address spaces to the proper number needed to support

a workload can reduce RNI and improve performance”

11 https://www-304.ibm.com/servers/resourcelink/lib03060.nsf/pages/lsprwork?OpenDocument&pathID=

https://www-304.ibm.com/servers/resourcelink/lib03060.nsf/pages/lsprwork?OpenDocument&pathID



Automated Initiators: Less is More

12

0

50

100

150

200

250

300

350

0 1 2 3 4 5 6 7 8 9

Acti

ve J

ob

s

Time (hours)

Benchmark: TM vs WLM concurrent Jobs

TM

WLM

TM jobs ahead

• Concurrency based on performance and utilization

CPU Delay

• Wait for logical CPU

• SMF: R723CCDE

• Work is ready to run but is delayed access to CPU

• Related to Service Class / goal / importance

– Dispatching priority

• There is almost always some CPU delay

– Tolerance is subjective

– Are goals/SLA’s being met?

• Priorities are relative – overloading leads to thrashing

• Consider discretionary for MTTW 13

0

10

20

30

40

50

60

70

80

90

100

Utilization vs CPU Delay

HIDLY

MDDLY

LODLY

UTIL

Utilization vs CPU delays

14

50% delay may be acceptable

At 100% busy, throughput degrades

significantly

I/O Delay

• IOSQ:

– HyperPAV

• Pend:

– CMR = overloaded controller

– DB = volume contention (reserve?)

– Any remaining = likely channels

• Disconnect

– Random read misses

– Synchronous remote copy

15

Revisit: sample Job B

• Disconnect time of 0:40:31 = 40% of total 1:37:44

• 40:31 (2431 seconds) divided by 9239646 I/O’s…

• = .263 ms average disconnect time

• Likely not unreasonable for random reads (consider SSD)

• Could also be replicated writes

• Become familiar with your typical application response times 16

JOB RUNTM CPUTM IOTIME SMF30AID SMF30AIW EXCPS

JOBB 2:41:30 0:21:20 1:37:44 0:40:31 0:04:40 9239646

Capping Delay

• Possible when caps present

• SMF70NSW

– WLM caps the logical CPUs

– Delays LPAR dispatch

• SMF70NCA

– Work is actually delayed for CPU due to capping

• Consider TM automation

17

0

200

400

600

800

1000

1200

0:0

0:0

0

0:1

5:0

0

0:3

0:0

0

0:4

5:0

0

1:0

0:0

0

1:1

5:0

0

1:1

5:0

0

1:3

0:0

0

1:4

5:0

0

2:0

0:0

0

2:1

5:0

0

2:3

0:0

0

2:4

5:0

0

3:0

0:0

0

3:1

5:0

0

3:3

0:0

0

3:4

5:0

0

4:0

0:0

0

4:1

5:0

0

4:3

0:0

0

4:4

5:0

0

5:0

0:0

0

5:1

5:0

0

5:3

0:0

0

5:4

5:0

0

6:0

0:0

0

6:1

5:0

0

6:3

0:0

0

6:4

5:0

0

7:0

0:0

0

7:1

5:0

0

7:3

0:0

0

7:4

5:0

0

8:0

0:0

0

8:0

0:0

0

MSU

/HR

MSU Demand vs R4HA

LPAR_C

LPAR_B

LPAR_A

CPCR4HA

CAP

Capping vs Delay

LPAR is capped (SMF70NSW)

Work is delayed SMF70NCA

Capping can impact all Workloads

0

10

20

30

40

50

60

70

80

90

100

Ve

loci

ty

STC_H: Importance 1

Goal

Velocity

Machine CPU reaches 100% Capping

begins

Batch is not the only workload suffering. Even the most

critical workloads are unable to meet their goal

Resource Group (max)

• Also a form of capping (same WLM algorithms)

• Pro: Useful to control “problem” applications

• Con: Static. Not flexible

• R723CCCA

– Resource Group maximum enforced

– Will override Service Class goals

20

LPAR Dispatch Delay

• Ratio of logical processor busy to physical processor busy

• Not always as obvious but very common!

• Term “Short CPs” introduced by Kathy Walsh (IBM WSC)

– Share, Aug. 2004

– https://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/PRS1077

• MXG: PLCPRDYQ

• Improved with Hiperdispatch and IRD

21

https://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/PRS1077




LPAR Dispatch Delay

• More initiators and/or higher dispatching

priority will not resolve this problem 22

0

10

20

30

40

50

60

70

80

90

100

LPAR Dispatch Delay

INITS

CPUBSY

MVSBSY

40% delay

What does this look like in the real world?

23

Let’s take a trip to the deli

How many in the store at one time?

24

INITIATORS

Who’s next in line?

25

Dispatching Priority (Service Class)

How long til I can give my order?

26

Logical Processor (CP) Busy

How long til I’m done!

27

Physical Processor (CP) Busy

Forcing one step only stresses the next

28

Balance

30

About MVS Solutions

• MVS Solutions Inc. – Installed in over 200 datacenters worldwide

– IBM Partner in Development

• ThruPut Manager – Automated Workload Balancing

– Automated Batch Prioritization

– Automated Capacity Management

31

Contact me: [email protected]

[email protected]

Join our Blog at www.thruputmanager.com



http://www.thruputmanager.com/

why did my job run so long? - cmg

Documents