fewer/faster vs more/slower practical considerations€¦ · •consider more/slower cpus instead...

67
Fewer/Faster vs More/Slower Practical Considerations Scott Chapman Enterprise Performance Strategies [email protected]

Upload: others

Post on 25-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Fewer/Faster vs More/Slower

Practical ConsiderationsScott Chapman

Enterprise Performance Strategies

[email protected]

Page 2: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Contact, Copyright, and Trademark Notices

Questions?Send email to Scott at [email protected], or visit our website at http://www.epstrategies.com or http://www.pivotor.com.

Copyright Notice:© Enterprise Performance Strategies, Inc. All rights reserved. No part of this material may be reproduced, distributed, stored in a retrieval system,

transmitted, displayed, published or broadcast in any form or by any means, electronic, mechanical, photocopy, recording, or otherwise, without the prior written permission of Enterprise Performance Strategies. To obtain written permission please contact Enterprise Performance Strategies, Inc. Contact information can be obtained by visiting http://www.epstrategies.com.

Trademarks:Enterprise Performance Strategies, Inc. presentation materials contain trademarks and registered trademarks of several companies.

The following are trademarks of Enterprise Performance Strategies, Inc.: Health Check®, Reductions®, Pivotor®

The following are trademarks of the International Business Machines Corporation in the United States and/or other countries: IBM®, z/OS®, zSeries® WebSphere®, CICS®, DB2®, S390®, WebSphere Application Server®, and many others.

Other trademarks and registered trademarks may exist in this presentation.

Page 3: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Performance Workshops Available

During these workshops you will be analyzing your own data!

• Parallel Sysplex and z/OS Performance Tuning

(Web / Internet Based!)

• Instructors: Peter Enrico and Scott Chapman

• August 2016

• WLM Performance and Re-evaluating of Goals

• Instructors: Peter Enrico and Scott Chapman

• October 2016

• Essential z/OS Performance Tuning Workshop

• Instructors: Peter Enrico, Tom Beretvas, and Scott Chapman

• z/OS Capacity Planning and Performance Analysis

• Instructor: Ray Wicks

Page 4: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Overview

•Processors, caches, c, and performance

•Pros/Cons for fewer/faster vs more/slower

•What to expect when changing processor speeds

•What choices might you have: example scenarios

• Interesting measurements

•Considerations for various workloads

Red = things I want you to pay particular attention to

Page 5: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Clock cycles and effective capacity

• Ideally, you’d like to get real work done each

clock cycle

•z Processor speeds are really fast

–z10 – 4.4 Ghz

–z196 – 5.2Ghz

–zEC12 – 5.5Ghz

–z13 – 5.0 Ghz

•So 1ms to wait for an I/O = millions of clock

cycles

Billions of cycles per second

1 Clock cycle = fraction of a

nanosecond

Page 6: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

How long is a cycle again?

•Just over 2 inches

–Light, in a vacuum

–Electrical signal in a circuit is much slower (40-70% of c)

–1 meter in fiber ~ 5 ns

•Need to make a round trip

•Signal paths aren’t as the mosquito flies

–7.7 Miles of wire in a zEC12 chip, >13 in z13

•Physical distance matters!

Page 7: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Data access hierarchy

• Register

• Memory

– L1 Cache

– L2 Cache

– L3 Cache

– L4 Cache

• Local

• Remote

– Real

• Storage Class Memory

• Disk

– Cache

– SSD

– Spinning

• Network

Clo

ser

to C

PU

Core

Optimal performance &

capacity utilization =

keeping data as close to

processor as possible!

The farther the data is away

from the processor, the more

clock cycles will be spent

accessing it.

Page 8: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

z13 Processor Chip Schematic

L3

L2 L2 L2 L2

L2 L2

L2 L2L2 L2L2 L2

L2 L2

L2 L2

L3

• Approximation

based on IBM docs

• Cache sizes scaled

relative to each

other

• Physical location of

L1 cache unclear

• L1 = 96K + 128K

• L2 = 2M + 2M

• L3 = 64M

Page 9: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

z13 Storage Control Chip Schematic

L4 Controller

L4 L4

L4 L4

• Approximation

based on IBM docs

• NIC directory

embedded in the 4

L4 areas

• L4 Controller

schematic simplified

• L4 = 480M / node,

960M / drawer

• L4 NIC directory =

224M / node

Page 10: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Data locality – L4 / Memory

Mem

ory

card

Mem

ory

card

Mem

ory

card

Mem

ory

card

Mem

ory

card

Mem

ory

card

Mem

ory

card

Mem

ory

card

Mem

ory

card

Mem

ory

card

Mem

ory

card

Mem

ory

card

Mem

ory

card

Mem

ory

card

Mem

ory

card

Mem

ory

card

Mem

ory

card

Mem

ory

card

Mem

ory

card

Mem

ory

card

Mem

ory

card

Mem

ory

card

Mem

ory

card

Mem

ory

card

Mem

ory

card

Not

to

Scale

Page 11: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Cache utilization & performance

•Memory is far away from the processor core and relatively slow

•Effective use of processor cache is important to keeping the processor

“fed”

– Shared engines = more systems competing for cache resources

•Cache effectiveness measurements are in the Hardware Instrumentation

Services SMF 113 records

–Requires z/OS 1.8 +PTFs & z10 GA2

•Enable HIS and record the 113 records

–Required for effective capacity planning on upgrade

Page 12: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Dynamic Address Translation

•DAT performed using multiple tables that point to different ranges of

storage

•DAT is not free!

•Result of DAT cached in Translation Look-aside Buffers (TLB)

•TLBs are in L1 cache and managed by the hardware

–Relatively small

•1MB & 2GB pages make TLBs more effective

–Larger pages = Fewer pages = Fewer TLB entries required

• 100GB = 50 2GB pages = 102,400 1MB pages = 26,214,400 4K pages

Page 13: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Estimated impact of TLB Misses

Page 14: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

HiperDispatch Terms

•Logical processors classified as:

–High – The processor is essentially dedicated to the LPAR (100% share)

–Medium – Share between 0% and 100%

–Low – Unneeded to satisfy LPAR’s weight

•This processor classification is sometimes referred to as “vertical” or

“polarity” or “pool”

–E.G. Vertical High = VH = High Polarity = High Pool = HP

•Parked / Unparked

–Initially, VL processors are “parked”: work is not dispatched to them

–VL processors may become unparked (eligible for work) if there is demand and

available capacity

Page 15: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

HiperDispatch off

L

3L

3

L

3L

3

L

3L

3

L4

Controller

L4 L4

L4 L4

LPAR A LPAR B LPAR C

Page 16: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

HiperDispatch on

L

3L

3

L

3L

3

L

3L

3

L4 controller

L4 L4

L4 L4

LPAR A LPAR B LPAR C

Turn on

HiperDispatch!

Page 17: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

HiperDispatch Details

•Cache effectiveness may increase when a unit of work is redispatched

on the same physical CPU that it was last on

•But prior to HiperDispatch, PR/SM would manage each logical CPU

evenly based on its average share of a processor

• Imagine an LPAR with 4 logical CPUs, weight of 40% of a 5-way CEC

–The LPAR can use up to 40% of the 5 way, or up to 2 physical CPUs worth of

capacity

•Without HD, each logical CPU would get 50% of a physical CPU

–Dispatching across 5 physical CPUs

•With HD, 1 logical CPU gets 100% of a physical CPU, 2 get 50%

–The other logical CPU held in reserve, but not used until/unless needed

Page 18: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

How can you improve cache effectiveness?

• Enable HiperDispatch

• Make good use of large pages

• Upgrade to newer machine

• Consider more/slower CPUs instead of fewer/faster

–More CPUs = More L1/L2/TLB

Cache sizes (L4/book or drawer)

Per CP Per Chip NIC dir Shared

Year Machine L1 Data L1 Inst. L2 Data L2 Inst. L3 L4

2005 z9 EC 256KB 256KB 40MB n/a n/a

2008 z10 EC 128KB 64KB 3MB n/a 48MB

2010 z196 128KB 64KB 1.5MB 24MB 192MB

2012 zEC12 96K 64K 1MB 1MB 48MB 384MB

2015 z13 128K 96K 2MB 2MB 64MB 224MB 480MB

Page 19: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Do you have a choice, and if you do,

what should you choose?

Page 20: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Processor count trade-offs

•Mainframe

–Sub-capacity (“slow”) processors limited

to first drawer/book

–4 or 26 speed options

• Intel – new Xeon chips

–6 to 22 cores / chip

–1.6Ghz – 3.5Ghz

Page 21: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

What’s in a name (or model number)?

•General form: mmmm-snn

–mmmm = model number

–s = relative engine speed

• “EC” machines: 4 (slowest) to 7 (fastest)

• “BC” machines: A (slowest) to Z (fastest)

–nn = number of general purpose engines

•Examples:

–2827-704 = zEC12 with 4 full speed engines

–2964-410 = z13 with 10 slowest-speed engines

–2818-M03 = z114 with 3 engines, middle of the speed range

Common name Model number

z13

z13s

2964

2965

zEC12

zBC12

2827

2828

z196

z114

2817

2818

z10EC

z10BC

2097

2098

z9EC

z9BC

2094

2096

Page 22: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Scott’s ROTs for CPU counts

•Don’t configure z/OS with less than 2 logical CPs

–Possible exception: LPAR is a largely unused sandbox & not in a Sysplex

•Less than 3 physical CPs on a machine troublesome

–2 possibly ok if single primary LPAR

•Minor changes probably don’t matter as much as larger changes

–Going from 3 faster to 5 slower not as interesting as 5 to 10 or 3 to 8

–“not as interesting” doesn’t mean ignore it

Page 23: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Upgrade scenario examples

•Upgrade z196/z114 to z13

•Keep MSUs about the same

–Easily done for MLC sub-capacity, but consider the ISVs

•Explore the options with zPCR

–Complete fictional configs, but hopefully somewhat representative

z196 2817-704

531 MSUs

z196 2817-504

265 MSUs

z114 2818-V02

92 MSUs

6 LPARs 4 LPARs 3 LPARs

Page 24: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

z13 Options <1500 MSUs

0

100

200

300

400

500

600

700

800

900

1000

1100

1200

1300

1400

1500

See: https://www-304.ibm.com/servers/resourcelink/lib03060.nsf/pages/lsprITRzOSv2r1?OpenDocument

Page 25: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

All z13s Models

0

100

200

300

400

500

600

700

800

900

2965-A01

2965-A03

2965-A05

2965-B01

2965-B03

2965-B05

2965-C01

2965-C03

2965-C05

2965-D01

2965-D03

2965-D05

2965-E01

2965-E03

2965-E05

2965-F01

2965-F03

2965-F05

2965-G01

2965-G03

2965-G05

2965-H01

2965-H03

2965-H05

2965-I01

2965-I03

2965-I05

2965-J01

2965-J03

2965-J05

2965-K01

2965-K03

2965-K05

2965-L01

2965-L03

2965-L05

2965-M

01

2965-M

03

2965-M

05

2965-N01

2965-N03

2965-N05

2965-O01

2965-O03

2965-O05

2965-P01

2965-P03

2965-P05

2965-Q01

2965-Q03

2965-Q05

2965-R01

2965-R03

2965-R05

2965-S01

2965-S03

2965-S05

2965-T01

2965-T03

2965-T05

2965-U01

2965-U03

2965-U05

2965-V01

2965-V03

2965-V05

2965-W

01

2965-W

03

2965-W

05

2965-X01

2965-X03

2965-X05

2965-Y01

2965-Y03

2965-Y05

2965-Z01

2965-Z03

2965-Z05

Page 26: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

z13s Options < 500 MSUs

0

100

200

300

400

500

2965-A01

2965-A03

2965-A05

2965-B01

2965-B03

2965-B05

2965-C01

2965-C03

2965-C05

2965-D01

2965-D03

2965-D05

2965-E01

2965-E03

2965-E05

2965-F01

2965-F03

2965-F05

2965-G01

2965-G03

2965-G05

2965-H01

2965-H03

2965-H05

2965-I01

2965-I03

2965-I05

2965-J01

2965-J03

2965-J05

2965-K01

2965-K03

2965-K05

2965-L01

2965-L03

2965-L05

2965-M

01

2965-M

03

2965-M

05

2965-N01

2965-N03

2965-N05

2965-O01

2965-O03

2965-O05

2965-P01

2965-P03

2965-P05

2965-Q01

2965-Q03

2965-Q05

2965-R01

2965-R03

2965-R05

2965-S01

2965-S03

2965-S05

2965-T01

2965-T03

2965-T05

2965-U01

2965-U03

2965-U05

2965-V02

2965-V04

2965-W

01

2965-W

03

2965-X01

2965-X03

2965-Y02

2965-Z01

2965-Z03

Page 27: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Scenario A, Total Capacity

Page 28: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Scenario A, Single CP

Page 29: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Scenario B, Total Capacity

Page 30: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Scenario C, Total Capacity

Page 31: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Scenario Comparisons

Page 32: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Scenario C comparisons

Page 33: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Impact on software cost

Fewer MSUs/Cap

is better!

Page 34: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

What does this mean?

•When considering an upgrade you should consider all engine speed

options

•The choice might impact your software bill, at least slightly

•The big question is: what happens if you change the engine speed?

Page 35: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

What should you expect from

more/slower vs. fewer/faster?

Page 36: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Fewer/Faster

+ Lower CPU time

+ Better for single-threaded CPU-intensive

workloads

+ Specialty engine capacity constraints cause

less impact

+ Scalable to multiple books/drawers

- CPU spinning/waiting is more of total capacity

(Higher Parallel Sysplex overhead)

- Less L1/L2 cache

- Potentially fewer high polarity processors

(More inter-LPAR impacts)

- Fewer concurrently executing tasks

More/Slower

- Higher CPU times

- Worse for single-threaded CPU-intensive

workloads

- Specialty engine capacity constraints cause

more impact

- Not scalable past one book/drawer

+ CPU spinning/waiting is less of total capacity

(Lower Parallel Sysplex overhead)

+ More L1/L2 cache

+ Potentially more high polarity processors

(Less inter-LPAR impact)

+ More concurrently executing tasks

Professionals and Jailbirds

Best for

single-task

performance

Best for

system efficiency

Page 37: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Components of elapsed times

•CPU time – time spent using the CPU

–Directly impacted by CPU speed

•CPU wait/delay time – time spent waiting to get on a CPU

–Related to number & speed of CPUs

• I/O time – time spent doing I/O

–Likely won’t change substantially with CPU changes

•Database or other subsystem request time

–This will likely include some CPU time that’s not charged back to the job

–Also likely includes some I/O time

•Serialization (locking, enqueues, latches, etc.)

–Can be related to how fast other work is running

Page 38: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

What to expect (fewer/faster)

•Lower CPU times

–If you double the processor speed, then CPU time should drop by about half

–Of course, things will likely not be so easy

•Possibly higher CPU delays

–Fewer processors = more queueing

•Possibly more variable throughput

–High priority, CPU-intensive tasks can monopolize more of the total capacity

–Sync CF requests consume more of your capacity, unless they get faster too

• Always best to keep the CF technology on the same level as your CPU

Page 39: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

What to expect (more/slower)

•Higher CPU times

–If you halve the processor speed, then CPU time should double

–Of course, things will likely not be so easy

•Possibly lower CPU delays

–More processors = less queueing

•Possibly more consistent throughput

–High priority, CPU-intensive tasks can monopolize less of the total capacity

–Sync CF requests consume less of your capacity

• Note CF engines always run at full speed

Page 40: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Interesting measurements

Page 41: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Work units

•Work units is the replacement for the “in-ready” address space counts

–Counts running or waiting work units (consensus vs. doc)

•More accurate representation of work because we have an increasing

number of multi-threaded address spaces

•Values by processor type (GP, zIIP, zAAP)

•Plot min, average, max over time

–Max is often far larger than average

•Distribution of observations

–Based on the number of online and not parked processors (N)

–Counts in buckets: N, N+1, N+2, N+3, N+5, N+10, N+15, N+20, N+30…

Page 42: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Work units over time

Page 43: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Work units over time - zIIP

Page 44: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Work unit distribution

N=7-9

Page 45: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Highest task CPU percent

• “New” field in SMF30: SMF30_Highest_Task_CPU_Percent

–For interval records: largest percentage of CPU time used by any task in the

address space = round(TCB / interval * 100)

–For step/job end records: largest reported interval value

•Related field: SMF30_Highest_Task_CPU_Program

–Name of the program loaded by the task that had the largest percentage

•As value approaches 100, per-CPU speed becomes more important

–Threshold for worry: “it depends”

•We don’t know anything about the second-most busy task

• “Spikey” TCBs might be under-represented

–Larger intervals hide larger spikes

Page 46: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Highest task CPU – high level overview

Page 47: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

CPU Intensity

•Consider calculating CPU Intensity: CPU time / Elapsed Time

–Really the same calculation as the highest task CPU percent

•Can do this for entire batch jobs and/or on an interval basis for jobs

and/or started tasks

•Gives you a gross overview of which workloads might be sensitive to

CPU speed changes

–Note same issues as for highest task CPU percent, maybe even worse

–But… considers all CPU time, not just the most intensive TCB

Page 48: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

CPU Intensity

Page 49: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

CICS: QR TCB time

•Each CICS region has multiple TCBs, but only one QR TCB

–Single TCB = scalability bottleneck

•Historically all application code ran on the QR TCB

•Today: many more TCBs

–Prime example: SQL and MQ commands run on L8 TCB

–L9: User Key OpenAPI programs

• But use CICS Key for programs doing DB2/MQ

–L8/L9 TCB pool limit = (2 * max task) + 32

•CICS SMF data will break CPU consumption down by TCB

–Will also show things like the amount of dispatch delay and number of dispatches

Page 50: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Moving between TCBs: not threadsafe

Program start

App code

EXEC CICS (ts)

App code

App code

App code

EXEC CICS (not ts)

App code

App code

EXEC CICS (ts)

App code

App code

App code

EXEC SQL

EXEC SQL

EXEC SQL

EXEC SQL

EXEC SQL

QR L8

Even though the program is not threadsafe,

some amount of execution (SQL/MQ) occurs

on the L8 TCB, not the QR TCB

Every TCB switch takes a certain amount of

CPU

Waits for the QR TCB are also very possible

Page 51: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Moving between TCBs: threadsafe

Program start

App code

EXEC CICS (ts)

App code

EXEC CICS (not ts)

App code

EXEC SQL

App code

EXEC SQL

App code

EXEC SQL

App code

EXEC CICS (ts)

App code

EXEC SQL

App code

EXEC SQL

App code

QR L8

Marking your programs threadsafe means:

• Less TCB switch overhead

• Less QR TCB wait

• More work runs on multiple L8 TCBs

Note that without API(OPENAPI): after

executing a non-threadsafe CICS command,

execution stays on the QR TCB until the next

EXEC SQL (or MQ)

Where possible,

Use threadsafe!Threadsafe Considerations for CICS: http://www.redbooks.ibm.com/abstracts/sg246351.html

Page 52: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Moving between TCBs: threadsafe & OpenAPI

Program start

App code

EXEC CICS (ts)

App code

EXEC CICS (not ts)

EXEC SQL

App code

EXEC SQL

App code

App code

EXEC SQL

App code

EXEC CICS (ts)

App code

EXEC SQL

App code

EXEC SQL

App code

QR L8

Note that with the API(OPENAPI), after

executing a non-threadsafe CICS command,

execution returns to the L8 TCB

But this is EXECKey(CICSKey),

EXECKey(UserKey) is shown on next page

Page 53: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Program start

App code

EXEC CICS (ts)

App code

App code

App code

App code

App code

EXEC CICS (ts)

App code

App code

App code

Threadsafe, OPENAPI, User Key

EXEC CICS (not ts)

EXEC SQL

EXEC SQL

EXEC SQL

EXEC SQL

EXEC SQL

QR L8L9

While this solves our

single-TCB problem,

it re-introduces the

excessive task

switching we started

with

Not recommended for

applications doing

DB2/MQ work!

Page 54: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

CICS Response Time Breakdown

Page 55: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

QR Dispatch Delay

Page 56: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Specialty engine crossover

•SMF 30 and SMF 72 record the amount of CPU time consumed on GPs

that could have been done on zIIPs (or zAAPs)

•Mostly an indication of queueing for specialty engines

•Can be mostly (~99%) stopped by setting IIPHONORPRIORITY(NO)

•Slower GPs mean the execution time will be longer

–Maybe should just wait for a zIIP to become available?

• ZIIPAWMT can tune that, but probably rarely needed

• If you have significant failover happening, ideally just buy another zIIP

–Or turn on SMT on z13

–Note SMT or not is also a more/slower vs. fewer/faster question

Page 57: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

zIIP crossover

Page 58: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Sysplex overhead

•While sync requests are executing, the requesting CPU spins waiting for

a response

•Faster engine with same CF response time = more lost capacity

•More / slower CPs means a smaller portion of your capacity

•Keep CF technology matched as best as possible

Page 59: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Sysplex overhead

Page 60: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Practical Considerations

Page 61: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Batch

•Find candidates that will be impacted

–Consider filtering by batch jobs that consume more than trivial CPU and/or run more

than a trivial amount of time

–Consider filtering by CPU intensity

–Consider batch jobs that are CPU-bound when the system is not CPU-constrained

•Understand how much headroom you have in your batch SLAs

–E.G. if you’re meeting your window by 3 hours today, some delay may be fine

–Make sure the SLAs are representative of real business need!

•Do you care about low-importance batch?

–Maybe: dev/test work needs to get done too!

Page 62: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Started tasks

•Find candidates that will be impacted

–SMF30_Highest_Task_CPU_Percent is a good place to start

–But don’t forget to look for other started tasks with a high CPU intensity

• If moving to fewer/faster beware high-priority workloads’ ability to

monopolize CPUs

Page 63: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

CICS

•Often the CPU time & CPU delay per transaction is so small as to be

unnoticeable to the end user

–Most transactions are using milliseconds of CPU—milliseconds + milliseconds of

wait may not be enough to matter

–But beware applications which trigger multiple transactions per user interaction

•How busy are the QR TCBs in each region?

–Note that this is a single server queueing model: ~70% busy implies 2x service time

–As QR TCB busy exceeds 70% and approaches 100%, increased caution warranted

• If at all possible, make applications threadsafe

Page 64: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

zIIP-eligible workloads

•Slower engines differ more from full-speed zIIPs

•Consider whether you want to allow failover based on

–How often failover occurs

–Whether it occurs during your R4HA peaks

–How important the work that fails over is

–Discrepancy between GPs and zIIPs

Page 65: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Can you use OOCoD to help?

•On/Off Capacity on Demand is great for trying different combinations

–“Delivered” capacity can be less than “purchased” capacity

–Changes within purchased capacity can be done for no additional hardware charge

•But there are rules that you have to be aware of:

–There are various agreements to sign

–You can not decrease the number of physical CPs to less than what was delivered

–You can not decrease the capacity to less than what was delivered

–You can not go beyond what’s physically installed or 2x purchased capacity

–Pre-paid maintenance may be problematic

• Pre-pay based on purchased vs. delivered capacity?

• Try during warranty, lock in before maintenance kicks in?

Page 66: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

OOCoD Matrices (Upgrade Scenario B)Number of CPUs

Speed 1 2 3 4 5 6 7 8 9 10 11

4xx 250 478 697 910 1,118 1,321 1,520 1,716 1,907 2,093 2,277

5xx 746 1,417 2,067 2,694 3,306 3,903 4,485 5,052 5,606 6,145 6,671

6xx 1,068 2,019 2,938 3,827 4,690 5,528 6,342 7,132 7,899 8,644 9,368

7xx 1,695 3,196 4,644 6,041 7,392 8,700 9,964 11,188 12,371 13,515 14,622

Number of CPUs

Speed 1 2 3 4 5 6 7 8 9 10 11

4xx 250 478 697 910 1,118 1,321 1,520 1,716 1,907 2093 2277

5xx 746 1417 2067 2694 3,306 3,903 4,485 5,052 5,606 6,145 6,671

6xx 1,068 2019 2938 3827 4,690 5,528 6,342 7,132 7,899 8,644 9,368

7xx 1,695 3,196 4,644 6,041 7,392 8,700 9,964 11,188 12,371 13,515 14,622

Number of CPUs

Speed 1 2 3 4 5 6 7 8 9 10 11

4xx 250 478 697 910 1,118 1,321 1,520 1,716 1,907 2093 2277

5xx 746 1417 2067 2694 3,306 3,903 4,485 5,052 5,606 6,145 6,671

6xx 1,068 2019 2938 3827 4,690 5,528 6,342 7,132 7,899 8,644 9,368

7xx 1,695 3,196 4,644 6,041 7,392 8,700 9,964 11,188 12,371 13,515 14,622

Number of CPUs

Speed 1 2 3 4 5 6 7 8 9 10 11

4xx 250 478 697 910 1,118 1,321 1,520 1,716 1,907 2093 2277

5xx 746 1417 2067 2694 3,306 3,903 4,485 5,052 5,606 6,145 6,671

6xx 1,068 2019 2938 3827 4,690 5,528 6,342 7,132 7,899 8,644 9,368

7xx 1,695 3,196 4,644 6,041 7,392 8,700 9,964 11,188 12,371 13,515 14,622

Which of these

options will work

best?

Bringing in the machine

as a 410 precludes

using OOCoD to try the

other options

Starting at a 503 is

better

But we really want to

start as a 602 (or 502)

even though we don’t

expect to use that

Blue =

delivered

capacity

Green =

OOCoD

possibilities

Page 67: Fewer/Faster vs More/Slower Practical Considerations€¦ · •Consider more/slower CPUs instead of fewer/faster –More CPUs = More L1/L2/TLB Cache sizes (L4/book or drawer) Per

Summary

•Choice of engine speed can affect system efficiency

•Engine speed choice can possibly affect real capacity / MSU

•Slower engines may be a better choice

•Many measurements to review to help you decide

•Examine your workloads

Fill out your evaluations