hptf 2240 final
DESCRIPTION
Case studies of performance analysis and capacity planningTRANSCRIPT
1 April 9, 2023
Hanging By a Thread: Using Capacity Planning to Survive
Session 2240 Surf F 08:00 Wednesday
Paul O’Sullivan
Topics Up for Discussion
•Introduction•Current Status•Case Study 1 – Capacity Planning•Case Study 2 – Performance Analysis•Findings•Future
3
Introduction
•Paul O’Sullivan•Capacity Management Consultant•Capacity Planning/Performance Analyst since 1994−Infrastructure and Fixed Income
•Investment Banking/Insurance applications
•PerfCap Corporation
4
Current State of Performance Analysis and Capacity Planning
Capacity Planning−Different climate today to even 5 years ago
•Massive Proliferation of Servers•Multi-platform, and Multi-tier•Management non-interest−High level data only
−Capacity Planning: • ‘too difficult to do so we will not bother’
• Buy more servers – (not any more)
5
Issues
•Lack of specialists•Too much data to collect•Hard to correlate different platforms and treat application as an entity
•Top down approach−Processes first, data later
•Diffused Responsibility•…and....
6
Issues
•Lack of specialists•Too much data to collect•Hard to correlate different platforms and treat application as an entity
•Top down approach−Processes first, data later
•Diffused Responsibility•…and....
7
Falling hardware costs
•Following is quotation for typical 4 way database server:−4 x CPU GBP 8,000
−1 x Storage Array GBP 13,235
− 3 x Power supplies GBP 750
− 15 x Drives for Array GBP 4500
− 2 x 1GB Memory GBP 10,000
−Total 35,500
−Year: 2000
−Refurbished!
8
OK anyone can complain….
•…But how can we fix it?•Two examples of recent work−Capacity Planning
• Itanium
−Performance Analysis
SQL Server and EVA
•Futures
9 April 9, 2023
Capacity PlanningOracle RAC on Itanium Linux
A Sample StudyOracle RAC Capacity Planning
•Currently 3-node RAC running on IA64 Linux
•Expect 3x workload on current Oracle RAC within next two years.
•Must evaluate capacity of current cluster.
•Examine upgrade alternatives if current configuration not capable of sustaining expected load.
11
RAC Node CPU Utilizations, July-Sept 2008
12
Selection of Peak Benchmark Load
13
CPU by Image / Disk I/O Rate
14
CPU Utilization by Core
Reasonable core load balance at heavy loads.
15
Overall Disk I/O Rates
16
Overall Disk Data Rate
17
Disk Response Times
18
Memory Allocation
19
eCAP Workload Definition
20
oracleNDSPRD1 oracleLockProcs oracleProcs asmProcs
Disk I/OCPU
CPU
CPU
CPU
Disk I/O Disk I/O Disk I/O
Workload Class
Process
Count
Multi-
ProcessingLevel
Process
CreationRate (/sec)
CPUUtilization
Disk I/ORate (/sec)
oracleNDSPRD1 1110 547.1 0.925 73% 639
oracleLockProcs 8 3.2 0.007 5% 277
oracleWorkProcs 46 31.8 0.038 1% 14
ASM processes 20 9.7 0.017 0.2% 10
daemons 6 2.4 0.005 0.05% 4
data collector 1 0.4 0.001 0.3% 26
root processes 1161 266.0 0.968 3% 233
other processes 774 47.5 0.645 2% 311
Workload Characteristics
Primary Response Time Components
21
Current System Response Time Curve
9%
Headroom 9%
22
Capacity 100%
Headroom 9%
%910100
10
100
wthToKneepercentGro
wthToKneepercentGroheadRoom
Current System Headroom
23
Findings - Current System
•At peak sustained load, 9% headroom
•CPU is primary resource bottleneck
•Possible solutions:−Horizontal scaling
−Integrity upgrade
−Alternate hardware platform
24
Platform Alternatives(3 or 4 nodes)
HP rx7620 (1.1 GHz, Itanium 2) – current configuration
HP rx8640 (1.6 GHz, 24MB L3 cache), 16 core
HP rx8640 (1.6 GHz, 25MB L3 cache), 32 core
IBM p 570 (2.2 GHz, Power 5), 16 core
IBM p 570 (2.2 GHz, Power 5), 32 core
IBM p 570 (4.7 GHz, Power 6), 16 core
Sun SPARC Enterprise M8000 (2.4 GHz) , 16 core
Sun SPARC Enterprise M8000 (2.4 GHz) , 32 core
Configuration must support 200%
workload growth
25
Response Time vs Workload Growth3-node RAC
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
-100 -50 0 50 100 150 200 250 300 350 400
% Workload Growth from Benchmark
Re
lati
ve
Re
sp
on
se
Tim
e
HP rx7620 (1.1 GHz Itanium 2), 16-core
HP rx8640 (1.6 GHz, 24MB, Itanium 2), 16-core
HP rx8640 (1.6 GHz, 24MB, Itanium 2), 32-core
IBM p570 (2.2 GHz, Power 5), 16-core
IBM p570 (2.2 GHz, Power 5), 32-core
IBM p570 (4.7 GHz, Power 6), 16-core
Sun SPARC Enterprise M8000 (2.4 GHz), 16-core
Sun SPARC Enterprise M8000 (2.4 GHz), 32-core
Note: CPU is primary resource bottleneck; disk and memory will support 200% growth
26
Response Time vs Workload Growth4-node RAC
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
-100 -50 0 50 100 150 200 250 300 350 400
% Workload Growth from Benchmark
Rela
tive R
esp
on
se T
ime
HP rx7620 (1.1 GHz, Itanium 2), 16-core
HP rx8640 (1.6 GHz, 24 MB L3 cache), 16-core
HP rx8640 (1.6 GHz, 24 MB L3 cache), 32-core
IBM p570 (2.2 GHz, Power 5), 16-core
IBM p570 (2.2 GHz, Power 5), 32-core
IBM p570 (4.7 GHz, Power 6), 16-core
Sun SPARC Enterprise M8000 (2.4 GHz), 16-core
Sun SPARC Enterprise M8000 (2.4 GHz), 32-core
27
Qualifying Platforms
•3 configuration platforms support growth:− HP rx8640 (1.6 GHz, 25MB L3 cache), 32 core
− IBM p 570 (2.2 GHz, Power 5), 32 core
− IBM p 570 (4.7 GHz, Power 6), 16 core
− Sun SPARC Enterprise M8000 (2.4 GHz) , 32 core
•Horizontal scaling to 4 nodes will not change qualifying platforms.
Response Time vs Workload Growth(reduced core, 3-node configurations)
28
0.0
0.2
0.4
0.6
0.8
1.0
1.2
-100 -50 0 50 100 150 200 250 300
% Workload Growth from Benchmark
Re
lati
ve
Re
sp
on
se
Tim
e
Sun SPARC Enterprise M8000 (2.4 GHz), 32 cores
HP rx8640 (1.6 GHz, 25MB L3 cache), 30 cores
IBM p 570 (2.2 GHz, Power 5), 26 cores
IBM p 570 (4.7 GHz, Power 6), 12 cores
0.0
0.2
0.4
0.6
0.8
1.0
1.2
-100 -50 0 50 100 150 200 250 300
% Workload Growth from Benchmark
Re
lati
ve
Re
sp
on
se
Tim
e
Sun SPARC Enterprise M8000 (2.4 GHz), 24 cores
HP rx8640 (1.6 GHz, 25MB L3 cache), 24 cores
IBM p 570 (2.2 GHz, Power 5), 20 cores
IBM p 570 (4.7 GHz, Power 6), 10 cores
Response Time vs Workload Growth(reduced core, 4-node configurations)
29
Optimized Configurations
30
Platform 3-node 4-node
Sun SPARC Enterprise M8000 (2.4 GHz) 32 24
HP rx8640 (1.6 GHz, 25MB L3 cache) 30 24
IBM p 570 (2.2 GHz, Power 5) 26 20
IBM p 570 (4.7 GHz, Power 6) 12 10
Final choice based on cost and management issues.
31 April 9, 2023
Performance AnalysisSQL Server on HP Blades and EVA
Performance Analysis 1
•Large Insurance firm acquisition•Migrating applications •Requirement of 10x times growth•Much new hardware purchased•160 servers in environments•Application still slow −SQL Developers under the microscope
Performance Analysis
•Asked to examine SQL Server Application•Theory was that EVA 6000 could not cope with IO load generated by SQL
•Used PAWZ Performance Analysis and Capacity Planning tool to find performance issues.
•EVA performance data ‘unavailable’, so used SAN modeling ability on PAWZ Capacity Planner
Hardware Configuration
−16 way Quad Core HP Blade 460c
−2 x FC 4Gb fibre cards
−SQL Server 2000
−EVA 6000• 96 disk disk group, 300Gb 15k drives
• Shared with other window servers
Initial Analysis
•SQL Server processes was generated very high response times on SAN drives
•SQL Server processes were themselves paging (flushing data onto disk) at regular intervals
•Overall IO rates were low 1000 IO/Sec.•CPU Usage is low (10%) for a server of this type. (?)
•Memory Usage is low (15%)for a server of this type (?)
36 April 9, 2023
Not really high IO counts these days….
IO Rates
37 April 9, 2023
Very high D: drive response time….
Disk Response Time
38 April 9, 2023
Very high D: drive response time….
IO Sizes
39 April 9, 2023
SQL Server process generating all the IO
Obviously, something wrong with the application, right?
Process-based IO Rates
40 April 9, 2023
1.7Gb. Excuse me?
But the server has 24Gb of memory
SQL Server Memory
41 April 9, 2023
Soft paging into the free list
SQL Server paging
42 April 9, 2023
Soft paging into the free list huge IO load generated as data I
s moved to and from the SQL Server process
SQL Server paging
So what happened?
•Although SQL Server Enterprise can be configured to use all available memory it will not use more than 1.7Gb actual memory until Address Windowing Extensions (AWE) is enabled.
•AWE has to be configured by the sp_configure utility (show advanced options)
•AWE has to be enabled and then provided a required memory size.
•AWE will not operate if there is less than 3Gb of free memory on the server: SQL Server will disable it.
44 April 9, 2023
Production: IO before
45 April 9, 2023
Production: IO After
46 April 9, 2023
Production: IO Q Before
47 April 9, 2023
Production: IO Q After
48 April 9, 2023
Production: Disk Busy Q Before
49 April 9, 2023
: Production: Disk Busy Q after
HUGE reduction in disk busy
Result
•CPU increased•Application could handle more concurrent users in test
•Customer very happy−No hardware purchase, no project, no application change
−Rapid resolution to problem• Took 2 hours to work it out -
• Problem was bad since January
•Relieved pressure on SAN−Until another SQL Server with the same problem….
Lessons
•Even if performance tool is already in place, few people were using it well.
• Blame game without looking at the facts (data)
•Need to improve fault-finding capabilities−Better ways to correlate data
−Automatic methods of alerting as to real problem and nature of problem
•Classic case of the ‘cause behind the cause’
So what do we need?
•1st hurdle overcome – obtaining data•2nd hurdle overcome – presenting data efficiently
•3rd hurdle overcome – scalability of performance data from clients
•4th hurdle overcome – automatic capacity planning data
•5th hurdle – to do – making sense of the data−Expert reports
−Just showing the issues
−Removing the need for manual analysis
Want to know more?
•Booth Number 631
•http://www.perfcap.com