why did my job run so long? - cmg

32
Why did my job run so long? Speeding Performance by Understanding the Cause John Baker MVS Solutions [email protected]

Upload: others

Post on 26-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Why did my job run so long? - CMG

Why did my job run so long?

Speeding Performance by

Understanding the Cause

John Baker

MVS Solutions

[email protected]

Page 2: Why did my job run so long? - CMG

Agenda

• Where is my application spending its time?

– CPU time, I/O time, wait (queue) times

• What am I waiting for?

– Various flavors of queue time

– What can/should I do about delays?

• Real world comparison

– Stay tuned!

• Q/A

• Conclusions and wrap up

2

Page 3: Why did my job run so long? - CMG

Distribution of Elapsed Time

Elapsed time = CPU time + I/O time + wait times

3

• CPU time = TCB + SRB

• I/O = IOSQ + PEND + CON + DISC

• Wait (queue) times

– Initiator

– Allocation (ENQ contention)

– System services (HSM recall)

– CPU Delay

– LPAR dispatch

– …

Page 4: Why did my job run so long? - CMG

Sample Job A

• Elapsed time over 4 hours

• CPU time almost 1 hour

• I/O time under 10 minutes

= Focus on CPU time

4

JOB RUNTM CPUTM IOTIME

JOBA 4:23:53 0:48:21 0:09:12

Page 5: Why did my job run so long? - CMG

Reducing CPU time

• Recompile

– Many improvements in OS updates

• Tune application

– Application Performance Tools

• (e.g. Strobe, FreezeFrame)

– Identify CPU use by area of source code

– Make friends with your developer

5

Page 6: Why did my job run so long? - CMG

Sample Job B

• Elapsed time under 3 hours

• CPU time 20 minutes

• I/O time over 1.5 hours

= Focus on I/O time

6

JOB RUNTM CPUTM IOTIME

JOBB 2:41:30 0:21:20 1:37:44

Page 7: Why did my job run so long? - CMG

Reducing I/O time

• Identify patterns

– sequential vs random; read vs write

• Buffers

– For VSAM consider NSR vs LSR

– Give SORT memory – but not too much!

• Block size

– System-determined generally works well – but check!

– Half track for sequential; No smaller than 2K for random

• Compression

– zEDC looks very impressive!

• Include Storage Subsystem in your capacity planning

7

Page 8: Why did my job run so long? - CMG

Wait/Queue time

8

Page 9: Why did my job run so long? - CMG

0

5

10

15

20

25

30

0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100%

Re

spo

nse

/Ela

pse

d T

ime

Percent Utilization

Time vs Utilization

Wait

Execution

High Utilization = High wait time

9

Elapsed time grows exponentially with

utilization. Increasing priority doesn’t make

the CPU any faster

• At high utilization levels, wait time is

much greater than service time

Page 10: Why did my job run so long? - CMG

Flavors of Queue (wait) Time

• Wait for “server”: initiator / CICS AOR / IMS MPR

• CPU delay (wait for logical CPU)

• I/O delay (iosq, pend, disconnect)

• Capping delay (LPAR capped vs actual delay)

• Resource Group maximum enforced

• Wait for LPAR (logical CPU) to be dispatched

– PR/SM weight

– Demand from other LPARs

– CPC/CEC capacity

10

Page 11: Why did my job run so long? - CMG

Initiator Queue

• SMF: R723CQDT

• TOTAL queue time (divide for average per job)

• Just start more inits?

• Not necessarily a good idea

• “Tuning to reduce the number of simultaneously active

address spaces to the proper number needed to support

a workload can reduce RNI and improve performance”

11 https://www-304.ibm.com/servers/resourcelink/lib03060.nsf/pages/lsprwork?OpenDocument&pathID=

Page 12: Why did my job run so long? - CMG

Automated Initiators: Less is More

12

0

50

100

150

200

250

300

350

0 1 2 3 4 5 6 7 8 9

Acti

ve J

ob

s

Time (hours)

Benchmark: TM vs WLM concurrent Jobs

TM

WLM

TM jobs ahead

• Concurrency based on performance and utilization

Page 13: Why did my job run so long? - CMG

CPU Delay

• Wait for logical CPU

• SMF: R723CCDE

• Work is ready to run but is delayed access to CPU

• Related to Service Class / goal / importance

– Dispatching priority

• There is almost always some CPU delay

– Tolerance is subjective

– Are goals/SLA’s being met?

• Priorities are relative – overloading leads to thrashing

• Consider discretionary for MTTW 13

Page 14: Why did my job run so long? - CMG

0

10

20

30

40

50

60

70

80

90

100

Utilization vs CPU Delay

HIDLY

MDDLY

LODLY

UTIL

Utilization vs CPU delays

14

50% delay may be acceptable

At 100% busy, throughput degrades

significantly

Page 15: Why did my job run so long? - CMG

I/O Delay

• IOSQ:

– HyperPAV

• Pend:

– CMR = overloaded controller

– DB = volume contention (reserve?)

– Any remaining = likely channels

• Disconnect

– Random read misses

– Synchronous remote copy

15

Page 16: Why did my job run so long? - CMG

Revisit: sample Job B

• Disconnect time of 0:40:31 = 40% of total 1:37:44

• 40:31 (2431 seconds) divided by 9239646 I/O’s…

• = .263 ms average disconnect time

• Likely not unreasonable for random reads (consider SSD)

• Could also be replicated writes

• Become familiar with your typical application response times 16

JOB RUNTM CPUTM IOTIME SMF30AID SMF30AIW EXCPS

JOBB 2:41:30 0:21:20 1:37:44 0:40:31 0:04:40 9239646

Page 17: Why did my job run so long? - CMG

Capping Delay

• Possible when caps present

• SMF70NSW

– WLM caps the logical CPUs

– Delays LPAR dispatch

• SMF70NCA

– Work is actually delayed for CPU due to capping

• Consider TM automation

17

Page 18: Why did my job run so long? - CMG

0

200

400

600

800

1000

1200

0:0

0:0

0

0:1

5:0

0

0:3

0:0

0

0:4

5:0

0

1:0

0:0

0

1:1

5:0

0

1:1

5:0

0

1:3

0:0

0

1:4

5:0

0

2:0

0:0

0

2:1

5:0

0

2:3

0:0

0

2:4

5:0

0

3:0

0:0

0

3:1

5:0

0

3:3

0:0

0

3:4

5:0

0

4:0

0:0

0

4:1

5:0

0

4:3

0:0

0

4:4

5:0

0

5:0

0:0

0

5:1

5:0

0

5:3

0:0

0

5:4

5:0

0

6:0

0:0

0

6:1

5:0

0

6:3

0:0

0

6:4

5:0

0

7:0

0:0

0

7:1

5:0

0

7:3

0:0

0

7:4

5:0

0

8:0

0:0

0

8:0

0:0

0

MSU

/HR

MSU Demand vs R4HA

LPAR_C

LPAR_B

LPAR_A

CPCR4HA

CAP

Capping vs Delay

LPAR is capped (SMF70NSW)

Work is delayed SMF70NCA

Page 19: Why did my job run so long? - CMG

Capping can impact all Workloads

0

10

20

30

40

50

60

70

80

90

100

Ve

loci

ty

STC_H: Importance 1

Goal

Velocity

Machine CPU reaches 100% Capping

begins

Batch is not the only workload suffering. Even the most

critical workloads are unable to meet their goal

Page 20: Why did my job run so long? - CMG

Resource Group (max)

• Also a form of capping (same WLM algorithms)

• Pro: Useful to control “problem” applications

• Con: Static. Not flexible

• R723CCCA

– Resource Group maximum enforced

– Will override Service Class goals

20

Page 21: Why did my job run so long? - CMG

LPAR Dispatch Delay

• Ratio of logical processor busy to physical processor busy

• Not always as obvious but very common!

• Term “Short CPs” introduced by Kathy Walsh (IBM WSC)

– Share, Aug. 2004

– https://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/PRS1077

• MXG: PLCPRDYQ

• Improved with Hiperdispatch and IRD

21

Page 22: Why did my job run so long? - CMG

LPAR Dispatch Delay

• More initiators and/or higher dispatching

priority will not resolve this problem 22

0

10

20

30

40

50

60

70

80

90

100

LPAR Dispatch Delay

INITS

CPUBSY

MVSBSY

40% delay

Page 23: Why did my job run so long? - CMG

What does this look like in the real world?

23

Let’s take a trip to the deli

Page 24: Why did my job run so long? - CMG

How many in the store at one time?

24

INITIATORS

Page 25: Why did my job run so long? - CMG

Who’s next in line?

25

Dispatching Priority (Service Class)

Page 26: Why did my job run so long? - CMG

How long til I can give my order?

26

Logical Processor (CP) Busy

Page 27: Why did my job run so long? - CMG

How long til I’m done!

27

Physical Processor (CP) Busy

Page 28: Why did my job run so long? - CMG

Forcing one step only stresses the next

28

Page 29: Why did my job run so long? - CMG

29

Page 30: Why did my job run so long? - CMG

Balance

30

Page 31: Why did my job run so long? - CMG

About MVS Solutions

• MVS Solutions Inc. – Installed in over 200 datacenters worldwide

– IBM Partner in Development

• ThruPut Manager – Automated Workload Balancing

– Automated Batch Prioritization

– Automated Capacity Management

31

Contact me: [email protected]

[email protected]

Join our Blog at www.thruputmanager.com

Page 32: Why did my job run so long? - CMG