virtualization and the metrics of performance ... - cmg€¦ · • cloud - ex: aws; “one ec2...
TRANSCRIPT
23 S t b 2011
Virtualization and the Metrics of 23 September 2011
Performance & Capacity Management Has the world changed?
© Copyright 2011 | First Data Corporation
Mark Preston
A dAgenda
• Reality CheckReality Check …….• General Observations• Traditional metrics for a non-virtual environmentTraditional metrics for a non virtual environment• How virtualization scheduling/dispatching of shared
environments can change the paradigm• Case Study• Conclusions• Discussion
2| © Copyright 2011 | First Data Corporation
R lit Ch kReality Check ……• Application users don’t care about system utilization as
long as response time is metlong as response time is met• Consumers purchase units of computing to support
functionality, not CPU cyclesfunctionality, not CPU cycles• Internal – Clusters, SAN, Networking• Cloud - Ex: AWS; “One EC2 Compute Unit provides the equivalent
CPU capacity of a 1 0 1 2 GHz 2007 Opteron or 2007 Xeon processorCPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor. This is also the equivalent to an early-2006 1.7 GHz Xeon processor“
• Virtualized environments exhibit a range of latency & i hb h d iti itneighborhood sensitivity
• Shared environment interference effects are realConsumers of IT are realizing that in a virtual world• Consumers of IT are realizing that in a virtual world utilization based capacity planning is suspect, if not useless in some cases
3| © Copyright 2011 | First Data Corporation
R lit Ch kReality Check ……• System building blocks incur latency• Virtualization is not a free technology (adds another layer)• In the drive to reduce enterprise cost host are over-subscribed• Over-subscribed host and shared resources (Multi-CPU, I/O
consolidation) require scheduling, which incurs overhead• The application team doesn’t have control of the neighborhood• The application team doesn t have control of the neighborhood• Let’s reword the opening Reality Check
• Application admin’s don’t care about system utilization as long asApplication admin s don t care about system utilization as long as response time is met
4| © Copyright 2011 | First Data Corporation
General Observations• The neighborhood is affected by many factors
• Hypervisor, Over-subscription, Load synchronicityyp , p , y y• Underlying HW virtualization (AMD-V, Intel VT-x, srIOV, mrIOV)
• Some classes of applications are better suited for virtualization th ththan others
• Financial trading systems, Infrastructure Monitoring – Less compatible• Transaction processing – Depends on workloadTransaction processing Depends on workload• Web facing – Highly compatible
• Virtualization is best suited for dissimilar applications• Multi-threaded applications (databases) prefer bare metal• Intra-tier sensitivity may produce unexpected application effects
• A Virtualized DB, sensitive to virtualization latency, may result in low Web tier utilization and “Server is too busy” messages
• Cloud adds both on-site and connection latency
5| © Copyright 2011 | First Data Corporation
y
Traditional Metrics• CPU Utilization
• CPU is generally considered to be a proxy for all other performance/capacity vectors
• Memory
• Network
• I/O
• Assumption - Interference effects from other workloads do not occur for the CPU vector
• Free Memory• An indication of goodness
Vi li i i• Virtualization can over-commit memory• ccNUMA can add latency - local L1 cache vs. ‘remote’ inter-socket L2/L3 cache
• Storage ThroughputStorage Throughput• Typically specified as IOPs, MB/s• Assumption - Interference effects from other workloads do not occur for the CPU
vector
6| © Copyright 2011 | First Data Corporation
vector
H D Vi t li ti Ch Thi ?How Does Virtualization Change Things?
• Shared-occupancy host environments exhibit not-easily-Shared occupancy host environments exhibit not easilyvisible interference effects
• Resource scheduler / dispatcher design sharpens this interference effect
• CPU busy and storage throughput at the guest level is no longer a reliable indicator of workload performancelonger a reliable indicator of workload performance
• Usage and throughput metrics are increasingly de-coupledp
7| © Copyright 2011 | First Data Corporation
Vi t li ti dd l d l tVirtualization adds layers and latencyApp App App App
2 vCPU/MemvFS, vFA , vNIC
2 vCPU/MemvFS, vFA , vNIC
4 vCPU/MemvFS, vFA , vNIC
4 vCPU/MemvFS, vFA , vNIC
12 vCPUs
Scheduling, Load Balancing & Bandwidth Management
Virtualization
Ove
r bs
crip
tion
tenc
y
pCPU pCPU pCPU pCPU pNIC pFANIC 4 pCPUs
Sub
crea
sing
Lat
p pCPU pCPU pCPU
Cache CacheCache
pNIC pFApNIC pFA 4 pCPUs
2 pNICs
Inc
pMemory
SANSwitch
2 pFAs
8| © Copyright 2011 | First Data Corporation
Each Layer Incurs LatencyYour Data
Your Application
Designs, Lessons and Advice from Building Large Distributed Systems“Numbers Everyone Should Know”
9| © Copyright 2011 | First Data Corporation
yJeff Dean Google Fellow
Virtualization exhibits the classic latency curve
10| © Copyright 2011 | First Data Corporation
Source: VMware ESX 3: Ready Time Observations - Feb 2004
Storage also follows the traditional latency curvecurve
11| © Copyright 2011 | First Data Corporation
Capacity Planning with Traditional Metrics%CPU < 60 70% Th ld i h !%CPU < 60-70% …..The world is happy! …. I think
Traditional allocation ofresources
Produces an ESX host with apparently significant capacity
Produces an ESX host with apparently significant capacity
12| © Copyright 2011 | First Data Corporation
Case StudyCase Study• Our application provides:
• Infrastructure discovery• Event correlation• Service assurance• Service assurance
• Deployed across entire enterprise• Composed of multiple tiers / componentsComposed of multiple tiers / components
• Collect data across thousands of monitored resources• Ingest and processing of eventsg p g• Console functionality for operation centers/admins
• Some components exhibit synchronous behavior • Reporting elements can be resource hogs
• Users characterized application as “Unusable”
13| © Copyright 2011 | First Data Corporation
Environment Capacity AuditN P bl F dNo Problem Found
Host CPU Utilization
MemoryUtilization
StorageThroughput
ESX Host A 16-20% ~12.5% 30MB/s
ESX Host B 19-25% ~28% 3MB/s
ESX Host C ~10% ~12.5% 22MB/s for App AApp AkB/s for App of Interest
14| © Copyright 2011 | First Data Corporation
Performance AnalysisSignificant waiting for resourcesSignificant waiting for resources
T diti l tili ti t iPerformance Test
Traditional utilization metrics indicate significant capacity
Virtualization metrics indicate i ifi i i f CPUsignificant waiting for CPU resources
%Rdy – Ready-to-Run but no physical CPU fresource free
5% - Could be sign of trouble
10% - There is a problem10% There is a problem
10-15% - In this case the application was unusable
15| © Copyright 2011 | First Data Corporation
Environment Performance AuditEnvironment Performance Audit
Host CPU Utilization
MemoryUtilization
StorageThroughput
CPU %Rdy
Latency
ESX Host A 16-20% ~12.5% 30MB/s 10%
ESX Host B 19-25% ~28% 3MB/s 35-55%
ESX Host C ~10% ~12.5% 22MB/s for App A
22-35%App AkB/s for App of Interest
16| © Copyright 2011 | First Data Corporation
ResolutionResolution
• Starting point• Performance team brought in after many months of application team trying to
make application ‘usable’• Existing infrastructure - ESX 3.5 + vmware tools + multi-core ESC host + SAN• Followed a structured use-case driven performance test, analysis,
recommendations process
• Initial relatively short timeframe solutionInitial relatively short timeframe solution• ESX 4.1 + vmware tools + same multi-core ESX host + SAN• Result – better but still judged “Not Usable”
• Final solution• ESX 4.1 + vmware tools + new 24-core based server cluster + new dedicated
VM neighborhood layout + dedicated one-to-one vCPU-to-pCPU + load-VM neighborhood layout dedicated one to one vCPU to pCPU loadbalanced SAN
• Although one step away from a physical environment, virtualization still provides ‘ility’ benefits
17| © Copyright 2011 | First Data Corporation
p y
Performance issues resolved
%Rdy reduced to less than 5% !
CPU utilization has virtually quadrupled ! %CPU, a dependant metric,
has increased to acceptable levels
18| © Copyright 2011 | First Data Corporation
Latency can be cumulativeLatency can be cumulative
Web Business Intelligence DB DB ….
2 vCPU/MemvFS, vFA , vNIC
Intelligence
2 vCPU/MemvFS, vFA , vNIC
4 vCPU/MemvFS, vFA , vNIC
4 vCPU/MemvFS, vFA , vNIC
12 vCPUsFrom an overall service perspective the impact of %Rdyis application, neighborhood and path dependent
Scheduling, Load Balancing & Bandwidth Management
Virtualization
Ove
r bs
crip
tion
tenc
y From a host perspective the impact of %Rdy can be summation dependent
pCPU pCPU pCPU pCPU pNIC pFANIC 4 pCPUs
Sub
crea
sing
Lat summation dependent
For sequential task the cumulative delay of %Rdy is p pCPU pCPU pCPU
Cache CacheCache
pNIC pFApNIC pFA 4 pCPUs
2 pNICs
Inc
most likely unacceptable
pMemory
SANSwitch
2 pFAsCareful latency driven selection and layout of VMs is critical to service performance
19| © Copyright 2011 | First Data Corporation
Summaryy• Old metrics can be misleading
• Low CPU utilization may be an indication of significant waiting for CPU y g gresources
• Low storage throughput may be an indication of significant waiting for SAN resources
• Review your application for suitability• Financial trading systems, Infrastructure Monitoring – Less compatible• Transaction processing – Depends on workload• Web facing – Highly compatible
• Adopt /augment new metrics for virtualized environmentsAdopt /augment new metrics for virtualized environments• CPU – For vmware ESX control for %Rdy• Storage – Latency is unique to class, architecture and technology of storage
Similar to BTM, Control for Latency
20| © Copyright 2011 | First Data Corporation