hptf 2240 final

1 April 9, 2023

Hanging By a Thread: Using Capacity Planning to Survive

Session 2240 Surf F 08:00 Wednesday

Paul O’Sullivan

Topics Up for Discussion

•Introduction•Current Status•Case Study 1 – Capacity Planning•Case Study 2 – Performance Analysis•Findings•Future

3

Introduction

•Paul O’Sullivan•Capacity Management Consultant•Capacity Planning/Performance Analyst since 1994−Infrastructure and Fixed Income

•Investment Banking/Insurance applications

•PerfCap Corporation

4

Current State of Performance Analysis and Capacity Planning

Capacity Planning−Different climate today to even 5 years ago

•Massive Proliferation of Servers•Multi-platform, and Multi-tier•Management non-interest−High level data only

−Capacity Planning: • ‘too difficult to do so we will not bother’

• Buy more servers – (not any more)

5

Issues

•Lack of specialists•Too much data to collect•Hard to correlate different platforms and treat application as an entity

•Top down approach−Processes first, data later

•Diffused Responsibility•…and....

6

Issues

•Lack of specialists•Too much data to collect•Hard to correlate different platforms and treat application as an entity

•Top down approach−Processes first, data later

•Diffused Responsibility•…and....

7

Falling hardware costs

•Following is quotation for typical 4 way database server:−4 x CPU GBP 8,000

−1 x Storage Array GBP 13,235

− 3 x Power supplies GBP 750

− 15 x Drives for Array GBP 4500

− 2 x 1GB Memory GBP 10,000

−Total 35,500

−Year: 2000

−Refurbished!

8

OK anyone can complain….

•…But how can we fix it?•Two examples of recent work−Capacity Planning

• Itanium

−Performance Analysis

SQL Server and EVA

•Futures

9 April 9, 2023

Capacity PlanningOracle RAC on Itanium Linux

A Sample StudyOracle RAC Capacity Planning

•Currently 3-node RAC running on IA64 Linux

•Expect 3x workload on current Oracle RAC within next two years.

•Must evaluate capacity of current cluster.

•Examine upgrade alternatives if current configuration not capable of sustaining expected load.

11

RAC Node CPU Utilizations, July-Sept 2008

12

Selection of Peak Benchmark Load

13

CPU by Image / Disk I/O Rate

14

CPU Utilization by Core

Reasonable core load balance at heavy loads.

15

Overall Disk I/O Rates

16

Overall Disk Data Rate

17

Disk Response Times

18

Memory Allocation

19

eCAP Workload Definition

20

oracleNDSPRD1 oracleLockProcs oracleProcs asmProcs

Disk I/OCPU

CPU

CPU

CPU

Disk I/O Disk I/O Disk I/O

Workload Class

Process

Count

Multi-

ProcessingLevel

Process

CreationRate (/sec)

CPUUtilization

Disk I/ORate (/sec)

oracleNDSPRD1 1110 547.1 0.925 73% 639

oracleLockProcs 8 3.2 0.007 5% 277

oracleWorkProcs 46 31.8 0.038 1% 14

ASM processes 20 9.7 0.017 0.2% 10

daemons 6 2.4 0.005 0.05% 4

data collector 1 0.4 0.001 0.3% 26

root processes 1161 266.0 0.968 3% 233

other processes 774 47.5 0.645 2% 311

Workload Characteristics

Primary Response Time Components

21

Current System Response Time Curve

9%

Headroom 9%

22

Capacity 100%

Headroom 9%

%910100

10

100

wthToKneepercentGro

wthToKneepercentGroheadRoom

Current System Headroom

23

Findings - Current System

•At peak sustained load, 9% headroom

•CPU is primary resource bottleneck

•Possible solutions:−Horizontal scaling

−Integrity upgrade

−Alternate hardware platform

24

Platform Alternatives(3 or 4 nodes)

HP rx7620 (1.1 GHz, Itanium 2) – current configuration

HP rx8640 (1.6 GHz, 24MB L3 cache), 16 core

HP rx8640 (1.6 GHz, 25MB L3 cache), 32 core

IBM p 570 (2.2 GHz, Power 5), 16 core



Sun SPARC Enterprise M8000 (2.4 GHz) , 16 core

Sun SPARC Enterprise M8000 (2.4 GHz) , 32 core

Configuration must support 200%

workload growth

25

Response Time vs Workload Growth3-node RAC

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

-100 -50 0 50 100 150 200 250 300 350 400

% Workload Growth from Benchmark

Re

lati

ve

Re

sp

on

se

Tim

e

HP rx7620 (1.1 GHz Itanium 2), 16-core

HP rx8640 (1.6 GHz, 24MB, Itanium 2), 16-core

HP rx8640 (1.6 GHz, 24MB, Itanium 2), 32-core

IBM p570 (2.2 GHz, Power 5), 16-core



Sun SPARC Enterprise M8000 (2.4 GHz), 16-core


Note: CPU is primary resource bottleneck; disk and memory will support 200% growth

26

Response Time vs Workload Growth4-node RAC

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

-100 -50 0 50 100 150 200 250 300 350 400


Rela

tive R

esp

on

se T

ime

HP rx7620 (1.1 GHz, Itanium 2), 16-core

HP rx8640 (1.6 GHz, 24 MB L3 cache), 16-core

HP rx8640 (1.6 GHz, 24 MB L3 cache), 32-core






27

Qualifying Platforms

•3 configuration platforms support growth:− HP rx8640 (1.6 GHz, 25MB L3 cache), 32 core

− IBM p 570 (2.2 GHz, Power 5), 32 core

− IBM p 570 (4.7 GHz, Power 6), 16 core

− Sun SPARC Enterprise M8000 (2.4 GHz) , 32 core

•Horizontal scaling to 4 nodes will not change qualifying platforms.

Response Time vs Workload Growth(reduced core, 3-node configurations)

28

0.0

0.2

0.4

0.6

0.8

1.0

1.2

-100 -50 0 50 100 150 200 250 300


Re

lati

ve

Re

sp

on

se

Tim

e

Sun SPARC Enterprise M8000 (2.4 GHz), 32 cores

HP rx8640 (1.6 GHz, 25MB L3 cache), 30 cores

IBM p 570 (2.2 GHz, Power 5), 26 cores


0.0

0.2

0.4

0.6

0.8

1.0

1.2

-100 -50 0 50 100 150 200 250 300


Re

lati

ve

Re

sp

on

se

Tim

e

Sun SPARC Enterprise M8000 (2.4 GHz), 24 cores

HP rx8640 (1.6 GHz, 25MB L3 cache), 24 cores



Response Time vs Workload Growth(reduced core, 4-node configurations)

29

Optimized Configurations

30

Platform 3-node 4-node

Sun SPARC Enterprise M8000 (2.4 GHz) 32 24

HP rx8640 (1.6 GHz, 25MB L3 cache) 30 24

IBM p 570 (2.2 GHz, Power 5) 26 20

IBM p 570 (4.7 GHz, Power 6) 12 10

Final choice based on cost and management issues.

31 April 9, 2023

Performance AnalysisSQL Server on HP Blades and EVA

Performance Analysis 1

•Large Insurance firm acquisition•Migrating applications •Requirement of 10x times growth•Much new hardware purchased•160 servers in environments•Application still slow −SQL Developers under the microscope

Performance Analysis

•Asked to examine SQL Server Application•Theory was that EVA 6000 could not cope with IO load generated by SQL

•Used PAWZ Performance Analysis and Capacity Planning tool to find performance issues.

•EVA performance data ‘unavailable’, so used SAN modeling ability on PAWZ Capacity Planner

Hardware Configuration

−16 way Quad Core HP Blade 460c

−2 x FC 4Gb fibre cards

−SQL Server 2000

−EVA 6000• 96 disk disk group, 300Gb 15k drives

• Shared with other window servers

Initial Analysis

•SQL Server processes was generated very high response times on SAN drives

•SQL Server processes were themselves paging (flushing data onto disk) at regular intervals

•Overall IO rates were low 1000 IO/Sec.•CPU Usage is low (10%) for a server of this type. (?)

•Memory Usage is low (15%)for a server of this type (?)

36 April 9, 2023

Not really high IO counts these days….

IO Rates

37 April 9, 2023

Very high D: drive response time….

Disk Response Time

38 April 9, 2023

Very high D: drive response time….

IO Sizes

39 April 9, 2023

SQL Server process generating all the IO

Obviously, something wrong with the application, right?

Process-based IO Rates

40 April 9, 2023

1.7Gb. Excuse me?

But the server has 24Gb of memory

SQL Server Memory

41 April 9, 2023

Soft paging into the free list

SQL Server paging

42 April 9, 2023

Soft paging into the free list huge IO load generated as data I

s moved to and from the SQL Server process

SQL Server paging

So what happened?

•Although SQL Server Enterprise can be configured to use all available memory it will not use more than 1.7Gb actual memory until Address Windowing Extensions (AWE) is enabled.

•AWE has to be configured by the sp_configure utility (show advanced options)

•AWE has to be enabled and then provided a required memory size.

•AWE will not operate if there is less than 3Gb of free memory on the server: SQL Server will disable it.

44 April 9, 2023

Production: IO before

45 April 9, 2023

Production: IO After

46 April 9, 2023

Production: IO Q Before

47 April 9, 2023

Production: IO Q After

48 April 9, 2023

Production: Disk Busy Q Before

49 April 9, 2023

: Production: Disk Busy Q after

HUGE reduction in disk busy

Result

•CPU increased•Application could handle more concurrent users in test

•Customer very happy−No hardware purchase, no project, no application change

−Rapid resolution to problem• Took 2 hours to work it out -

• Problem was bad since January

•Relieved pressure on SAN−Until another SQL Server with the same problem….

Lessons

•Even if performance tool is already in place, few people were using it well.

• Blame game without looking at the facts (data)

•Need to improve fault-finding capabilities−Better ways to correlate data

−Automatic methods of alerting as to real problem and nature of problem

•Classic case of the ‘cause behind the cause’

So what do we need?

•1st hurdle overcome – obtaining data•2nd hurdle overcome – presenting data efficiently

•3rd hurdle overcome – scalability of performance data from clients

•4th hurdle overcome – automatic capacity planning data

•5th hurdle – to do – making sense of the data−Expert reports

−Just showing the issues

−Removing the need for manual analysis

Want to know more?

•Booth Number 631

•http://www.perfcap.com

•[email protected]•[email protected]•[email protected]

hptf 2240 final

Documents

core configuration

core ibm p

core hp rx8640

capacity planning oracle

capacity planning tool

disk disk group

capacity of current

capacity planning case