virtualization and the metrics of performance ... - cmg€¦ · • cloud - ex: aws; “one ec2...

20
23 S t b 2011 Virtualization and the Metrics of 23 September 2011 Performance & Capacity Management Has the world changed? © Copyright 2011 | First Data Corporation Mark Preston

Upload: others

Post on 25-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Virtualization and the Metrics of Performance ... - CMG€¦ · • Cloud - Ex: AWS; “One EC2 Compute Unit provides the equivalent CPU capacity of a 1 0CPU capacity of a 1.0-1 2

23 S t b 2011

Virtualization and the Metrics of 23 September 2011

Performance & Capacity Management Has the world changed?

© Copyright 2011 | First Data Corporation

Mark Preston

Page 2: Virtualization and the Metrics of Performance ... - CMG€¦ · • Cloud - Ex: AWS; “One EC2 Compute Unit provides the equivalent CPU capacity of a 1 0CPU capacity of a 1.0-1 2

A dAgenda

• Reality CheckReality Check …….• General Observations• Traditional metrics for a non-virtual environmentTraditional metrics for a non virtual environment• How virtualization scheduling/dispatching of shared

environments can change the paradigm• Case Study• Conclusions• Discussion

2| © Copyright 2011 | First Data Corporation

Page 3: Virtualization and the Metrics of Performance ... - CMG€¦ · • Cloud - Ex: AWS; “One EC2 Compute Unit provides the equivalent CPU capacity of a 1 0CPU capacity of a 1.0-1 2

R lit Ch kReality Check ……• Application users don’t care about system utilization as

long as response time is metlong as response time is met• Consumers purchase units of computing to support

functionality, not CPU cyclesfunctionality, not CPU cycles• Internal – Clusters, SAN, Networking• Cloud - Ex: AWS; “One EC2 Compute Unit provides the equivalent

CPU capacity of a 1 0 1 2 GHz 2007 Opteron or 2007 Xeon processorCPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor. This is also the equivalent to an early-2006 1.7 GHz Xeon processor“

• Virtualized environments exhibit a range of latency & i hb h d iti itneighborhood sensitivity

• Shared environment interference effects are realConsumers of IT are realizing that in a virtual world• Consumers of IT are realizing that in a virtual world utilization based capacity planning is suspect, if not useless in some cases

3| © Copyright 2011 | First Data Corporation

Page 4: Virtualization and the Metrics of Performance ... - CMG€¦ · • Cloud - Ex: AWS; “One EC2 Compute Unit provides the equivalent CPU capacity of a 1 0CPU capacity of a 1.0-1 2

R lit Ch kReality Check ……• System building blocks incur latency• Virtualization is not a free technology (adds another layer)• In the drive to reduce enterprise cost host are over-subscribed• Over-subscribed host and shared resources (Multi-CPU, I/O

consolidation) require scheduling, which incurs overhead• The application team doesn’t have control of the neighborhood• The application team doesn t have control of the neighborhood• Let’s reword the opening Reality Check

• Application admin’s don’t care about system utilization as long asApplication admin s don t care about system utilization as long as response time is met

4| © Copyright 2011 | First Data Corporation

Page 5: Virtualization and the Metrics of Performance ... - CMG€¦ · • Cloud - Ex: AWS; “One EC2 Compute Unit provides the equivalent CPU capacity of a 1 0CPU capacity of a 1.0-1 2

General Observations• The neighborhood is affected by many factors

• Hypervisor, Over-subscription, Load synchronicityyp , p , y y• Underlying HW virtualization (AMD-V, Intel VT-x, srIOV, mrIOV)

• Some classes of applications are better suited for virtualization th ththan others

• Financial trading systems, Infrastructure Monitoring – Less compatible• Transaction processing – Depends on workloadTransaction processing Depends on workload• Web facing – Highly compatible

• Virtualization is best suited for dissimilar applications• Multi-threaded applications (databases) prefer bare metal• Intra-tier sensitivity may produce unexpected application effects

• A Virtualized DB, sensitive to virtualization latency, may result in low Web tier utilization and “Server is too busy” messages

• Cloud adds both on-site and connection latency

5| © Copyright 2011 | First Data Corporation

y

Page 6: Virtualization and the Metrics of Performance ... - CMG€¦ · • Cloud - Ex: AWS; “One EC2 Compute Unit provides the equivalent CPU capacity of a 1 0CPU capacity of a 1.0-1 2

Traditional Metrics• CPU Utilization

• CPU is generally considered to be a proxy for all other performance/capacity vectors

• Memory

• Network

• I/O

• Assumption - Interference effects from other workloads do not occur for the CPU vector

• Free Memory• An indication of goodness

Vi li i i• Virtualization can over-commit memory• ccNUMA can add latency - local L1 cache vs. ‘remote’ inter-socket L2/L3 cache

• Storage ThroughputStorage Throughput• Typically specified as IOPs, MB/s• Assumption - Interference effects from other workloads do not occur for the CPU

vector

6| © Copyright 2011 | First Data Corporation

vector

Page 7: Virtualization and the Metrics of Performance ... - CMG€¦ · • Cloud - Ex: AWS; “One EC2 Compute Unit provides the equivalent CPU capacity of a 1 0CPU capacity of a 1.0-1 2

H D Vi t li ti Ch Thi ?How Does Virtualization Change Things?

• Shared-occupancy host environments exhibit not-easily-Shared occupancy host environments exhibit not easilyvisible interference effects

• Resource scheduler / dispatcher design sharpens this interference effect

• CPU busy and storage throughput at the guest level is no longer a reliable indicator of workload performancelonger a reliable indicator of workload performance

• Usage and throughput metrics are increasingly de-coupledp

7| © Copyright 2011 | First Data Corporation

Page 8: Virtualization and the Metrics of Performance ... - CMG€¦ · • Cloud - Ex: AWS; “One EC2 Compute Unit provides the equivalent CPU capacity of a 1 0CPU capacity of a 1.0-1 2

Vi t li ti dd l d l tVirtualization adds layers and latencyApp App App App

2 vCPU/MemvFS, vFA , vNIC

2 vCPU/MemvFS, vFA , vNIC

4 vCPU/MemvFS, vFA , vNIC

4 vCPU/MemvFS, vFA , vNIC

12 vCPUs

Scheduling, Load Balancing & Bandwidth Management

Virtualization

Ove

r bs

crip

tion

tenc

y

pCPU pCPU pCPU pCPU pNIC pFANIC 4 pCPUs

Sub

crea

sing

Lat

p pCPU pCPU pCPU

Cache CacheCache

pNIC pFApNIC pFA 4 pCPUs

2 pNICs

Inc

pMemory

SANSwitch

2 pFAs

8| © Copyright 2011 | First Data Corporation

Page 9: Virtualization and the Metrics of Performance ... - CMG€¦ · • Cloud - Ex: AWS; “One EC2 Compute Unit provides the equivalent CPU capacity of a 1 0CPU capacity of a 1.0-1 2

Each Layer Incurs LatencyYour Data

Your Application

Designs, Lessons and Advice from Building Large Distributed Systems“Numbers Everyone Should Know”

9| © Copyright 2011 | First Data Corporation

yJeff Dean Google Fellow

Page 10: Virtualization and the Metrics of Performance ... - CMG€¦ · • Cloud - Ex: AWS; “One EC2 Compute Unit provides the equivalent CPU capacity of a 1 0CPU capacity of a 1.0-1 2

Virtualization exhibits the classic latency curve

10| © Copyright 2011 | First Data Corporation

Source: VMware ESX 3: Ready Time Observations - Feb 2004

Page 11: Virtualization and the Metrics of Performance ... - CMG€¦ · • Cloud - Ex: AWS; “One EC2 Compute Unit provides the equivalent CPU capacity of a 1 0CPU capacity of a 1.0-1 2

Storage also follows the traditional latency curvecurve

11| © Copyright 2011 | First Data Corporation

Page 12: Virtualization and the Metrics of Performance ... - CMG€¦ · • Cloud - Ex: AWS; “One EC2 Compute Unit provides the equivalent CPU capacity of a 1 0CPU capacity of a 1.0-1 2

Capacity Planning with Traditional Metrics%CPU < 60 70% Th ld i h !%CPU < 60-70% …..The world is happy! …. I think

Traditional allocation ofresources

Produces an ESX host with apparently significant capacity

Produces an ESX host with apparently significant capacity

12| © Copyright 2011 | First Data Corporation

Page 13: Virtualization and the Metrics of Performance ... - CMG€¦ · • Cloud - Ex: AWS; “One EC2 Compute Unit provides the equivalent CPU capacity of a 1 0CPU capacity of a 1.0-1 2

Case StudyCase Study• Our application provides:

• Infrastructure discovery• Event correlation• Service assurance• Service assurance

• Deployed across entire enterprise• Composed of multiple tiers / componentsComposed of multiple tiers / components

• Collect data across thousands of monitored resources• Ingest and processing of eventsg p g• Console functionality for operation centers/admins

• Some components exhibit synchronous behavior • Reporting elements can be resource hogs

• Users characterized application as “Unusable”

13| © Copyright 2011 | First Data Corporation

Page 14: Virtualization and the Metrics of Performance ... - CMG€¦ · • Cloud - Ex: AWS; “One EC2 Compute Unit provides the equivalent CPU capacity of a 1 0CPU capacity of a 1.0-1 2

Environment Capacity AuditN P bl F dNo Problem Found

Host CPU Utilization

MemoryUtilization

StorageThroughput

ESX Host A 16-20% ~12.5% 30MB/s

ESX Host B 19-25% ~28% 3MB/s

ESX Host C ~10% ~12.5% 22MB/s for App AApp AkB/s for App of Interest

14| © Copyright 2011 | First Data Corporation

Page 15: Virtualization and the Metrics of Performance ... - CMG€¦ · • Cloud - Ex: AWS; “One EC2 Compute Unit provides the equivalent CPU capacity of a 1 0CPU capacity of a 1.0-1 2

Performance AnalysisSignificant waiting for resourcesSignificant waiting for resources

T diti l tili ti t iPerformance Test

Traditional utilization metrics indicate significant capacity

Virtualization metrics indicate i ifi i i f CPUsignificant waiting for CPU resources

%Rdy – Ready-to-Run but no physical CPU fresource free

5% - Could be sign of trouble

10% - There is a problem10% There is a problem

10-15% - In this case the application was unusable

15| © Copyright 2011 | First Data Corporation

Page 16: Virtualization and the Metrics of Performance ... - CMG€¦ · • Cloud - Ex: AWS; “One EC2 Compute Unit provides the equivalent CPU capacity of a 1 0CPU capacity of a 1.0-1 2

Environment Performance AuditEnvironment Performance Audit

Host CPU Utilization

MemoryUtilization

StorageThroughput

CPU %Rdy

Latency

ESX Host A 16-20% ~12.5% 30MB/s 10%

ESX Host B 19-25% ~28% 3MB/s 35-55%

ESX Host C ~10% ~12.5% 22MB/s for App A

22-35%App AkB/s for App of Interest

16| © Copyright 2011 | First Data Corporation

Page 17: Virtualization and the Metrics of Performance ... - CMG€¦ · • Cloud - Ex: AWS; “One EC2 Compute Unit provides the equivalent CPU capacity of a 1 0CPU capacity of a 1.0-1 2

ResolutionResolution

• Starting point• Performance team brought in after many months of application team trying to

make application ‘usable’• Existing infrastructure - ESX 3.5 + vmware tools + multi-core ESC host + SAN• Followed a structured use-case driven performance test, analysis,

recommendations process

• Initial relatively short timeframe solutionInitial relatively short timeframe solution• ESX 4.1 + vmware tools + same multi-core ESX host + SAN• Result – better but still judged “Not Usable”

• Final solution• ESX 4.1 + vmware tools + new 24-core based server cluster + new dedicated

VM neighborhood layout + dedicated one-to-one vCPU-to-pCPU + load-VM neighborhood layout dedicated one to one vCPU to pCPU loadbalanced SAN

• Although one step away from a physical environment, virtualization still provides ‘ility’ benefits

17| © Copyright 2011 | First Data Corporation

p y

Page 18: Virtualization and the Metrics of Performance ... - CMG€¦ · • Cloud - Ex: AWS; “One EC2 Compute Unit provides the equivalent CPU capacity of a 1 0CPU capacity of a 1.0-1 2

Performance issues resolved

%Rdy reduced to less than 5% !

CPU utilization has virtually quadrupled ! %CPU, a dependant metric,

has increased to acceptable levels

18| © Copyright 2011 | First Data Corporation

Page 19: Virtualization and the Metrics of Performance ... - CMG€¦ · • Cloud - Ex: AWS; “One EC2 Compute Unit provides the equivalent CPU capacity of a 1 0CPU capacity of a 1.0-1 2

Latency can be cumulativeLatency can be cumulative

Web Business Intelligence DB DB ….

2 vCPU/MemvFS, vFA , vNIC

Intelligence

2 vCPU/MemvFS, vFA , vNIC

4 vCPU/MemvFS, vFA , vNIC

4 vCPU/MemvFS, vFA , vNIC

12 vCPUsFrom an overall service perspective the impact of %Rdyis application, neighborhood and path dependent

Scheduling, Load Balancing & Bandwidth Management

Virtualization

Ove

r bs

crip

tion

tenc

y From a host perspective the impact of %Rdy can be summation dependent

pCPU pCPU pCPU pCPU pNIC pFANIC 4 pCPUs

Sub

crea

sing

Lat summation dependent

For sequential task the cumulative delay of %Rdy is p pCPU pCPU pCPU

Cache CacheCache

pNIC pFApNIC pFA 4 pCPUs

2 pNICs

Inc

most likely unacceptable

pMemory

SANSwitch

2 pFAsCareful latency driven selection and layout of VMs is critical to service performance

19| © Copyright 2011 | First Data Corporation

Page 20: Virtualization and the Metrics of Performance ... - CMG€¦ · • Cloud - Ex: AWS; “One EC2 Compute Unit provides the equivalent CPU capacity of a 1 0CPU capacity of a 1.0-1 2

Summaryy• Old metrics can be misleading

• Low CPU utilization may be an indication of significant waiting for CPU y g gresources

• Low storage throughput may be an indication of significant waiting for SAN resources

• Review your application for suitability• Financial trading systems, Infrastructure Monitoring – Less compatible• Transaction processing – Depends on workload• Web facing – Highly compatible

• Adopt /augment new metrics for virtualized environmentsAdopt /augment new metrics for virtualized environments• CPU – For vmware ESX control for %Rdy• Storage – Latency is unique to class, architecture and technology of storage

Similar to BTM, Control for Latency

20| © Copyright 2011 | First Data Corporation