cmgi 2015 tt1 s4 chasing-performance

57
Computer Measurement Group, India 1 Presented at CMG India 2 nd Annual Conference 2015. Copyright © VMware Inc, 2015 Computer Measurement Group, India 1 www.cmgindia.org Chasing Performance Understand, Troubleshoot and Optimize Workload performance on vSphere Platform Adarsh Jagadeeshwaran, Ramprasad K S, Sai Inabattini VMware India

Upload: gfd953845

Post on 09-Jul-2016

218 views

Category:

Documents


0 download

DESCRIPTION

Performance for vsphere

TRANSCRIPT

Page 1: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 1 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Computer Measurement Group, India 1

www.cmgindia.org

Chasing Performance

Understand, Troubleshoot and Optimize Workload performance on vSphere Platform

Adarsh Jagadeeshwaran, Ramprasad K S, Sai Inabattini

VMware India

Page 2: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 2 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

What we refine

• Introduction

• Virtualization Basics

• vSphere Platform Basics

• Scoping

• Approach

• Data Collection and Analysis

• IO Profiling

• Optimizing

Page 3: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 3 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Introduction

Page 4: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 4 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Virtualization Basics

Page 5: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 5 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Virtualizing x86…

• Non-virtualized (“native”) system – OS is designed run on bare metal hardware

• Assumption of full ownership with full privileges – Use the CPU privilege levels to restrict the access – Operating System code run with higher privilege – Application code has less privilege – Privilege escalations are either faulted or ignored

• Challenges (Virtualization) – Hardware is shared among multiple VMs – Cannot let guest run with highest privilege – Virtual Machines need to be isolated and contained

• Classical “ring compression” or “de-privileging”

– Run guest OS kernel in Ring 1 – Privileged instructions trap; emulated by VMM

• System call, Exceptions, • Privileged resource, (IOAPIC)

– Not enough for x86 as all privilege escalations cannot be trapped

Page 6: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 6 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Binary Translation

• Combination of Direct Execution and Binary Translation (BT) – Virtual Machine Monitor (VMM) tracks the execution

from guest

– Just in time translation of privileged execution

• Translate the code that can cause privilege escalation

• Use translation cache for reusable translations

– Direct execution of user level code (Ring 3)

• Impacts – Overhead of translation and cache

– Translated code is not as optimal as original code

• VMM needs to run in the same address space as guest – VMM data region needs protection from guest access

– Use segmentation on 32 bit platform to protect VMM region

– Segmentation is not available on all 64 Bit platform

– BT inefficient to run 64 bit guests on these platforms

Page 7: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 7 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Hardware Assist - CPU

• New Execution modes for privileges execution – Root mode for VMM

– Not Root mode with Ring 0 – 3 for guest execution

– Additional control structures (VMCS/VMCB)

– Guest executions need not to be monitored

• Preprogram the execution to trap privilege escalations and pass the control to VMM

• On exit to VMM, VMM inspects the guest execution and completes on behalf of guest

• Impacts – Exit to VMM can be expensive

– In guest updates (page table) costlier to track

– IN/OUT instruction and memory mapped I/O have higher overhead

Page 8: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 8 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Memory Virtualization

• Virtual Memory – Application see a contiguous fixed address space - Virtual

Address (VA)

– OS maps Virtual Address to Physical Address (PA)

• Uses Page Tables to store this VA to PA mapping. (normal page size: 4K)

– CPU walks the page table for any translations

• Address in %CR3 = page table location

• Uses TLB (Translation look-aside Buffers) to cache the translations

• In a Virtual Machine – Additional address translation is required

• VA -> PA -> MA

– VMM uses shadow page table to map Guest VA to MA and replaces the guest page tables

– CPU uses VA -> MA mapping (no two level walks)

• Reduced cost at execution

0GB 4GB

App 1 App 2

0GB 4GB

Physical

Machine

0GB 4GB

App 1 App 2

0GB 4GB

VM - Physical

Page 9: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 9 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Memory Virtualization

• Impact of using Shadow Page Tables – Tracing the guest page table writes are expensive

– Cost of Propagating the changes to shadow page table

– Page faults need to be intercepted

• Hidden page faults (Missed in Shadow page table)

• True page fault (Forward to guest)

– Context switches to be monitored

• CR3 updates need to be intercepted to replace the guest page table with shadow page table

– Resources (memory) for maintaining shadow page tables

Page 10: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 10 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Hardware Assist - MMU

• Nested or Extended Page Tables – Hardware assistance for two level of address translations

• VA – PA mapping is set up by guest as usual

• PA – MA mapping is set up by hypervisor

– CPU walks guest page tables and then hypervisor page tables to arrive at VM to MA mapping during TLB fill

• Benefits – No traces required (Guest page table modifications are as fast as native)

– No exits on guest page faults (True faults), Context switches

– No memory overhead for shadow page tables

– Scalable across SMP

• Impact – TLB fill is expensive and page walk is costlier

– Using large pages will absorb some costs here (2MB v/s 4KB)

• Smaller page tables – fewer levels to traverse

• TLB capacity increase thus lesser TLB miss

Page 11: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 11 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

I/O Virtualization

• Software based I/O virtualization – I/O devices are virtual

– IN/OUT instructions are monitored and completed by VMM

– Impact of traps will induce additional latency

• Hardware Assistance – PCI pass through

• A physical I/O device dedicated to guest

• Depends on availability of hardware support (IO-MMU)

• Guest has control over I/O device and transacts directly

– SR-IOV

• A single device can be shared across multiple guests

• Multiple Virtual Functions appear as if they are physical devices

• Needs hardware support and capable devices

– Good for latency sensitive applications

Page 12: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 12 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

vSphere Platform Basics

Page 13: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 13 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

vSphere ESXi

vSphere ESXi is a Type 1 hypervisor runs directly on the bare metal

• VMkernel is the purpose built

kernel which manages and

schedules resources

• Resources are allocated based on

the entitlements and availability

• Guests run in isolation controlled

by a Virtual Machine Monitor

(VMM)

• Device Drivers load as VMkernel

modules

• Other Worlds include processes

related to management of the

server (hostd, vpxa, dcui etc…)

Page 14: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 15 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

CPU Scheduler

• Designed around being fair (Fairness) with emphasis on responsiveness and resource utilization – Share based algorithm with option to guarantee resources

– Highly Scalable

• Supports 128 vCPUs /4 TB RAM per Virtual machine

• 480 Logical CPUs on the host

• 1024 Virtual Machines/4096 vCPUs per host

• 1:32 over commitment ratio

• CPU time is distributed to virtual machines and other worlds (helpers, hostd, vpxa, sfcbd etc…) – Scheduler's default quantum is 50 ms

– World can be de-scheduled before this quantum expires if it becomes idle

– VM can overrun the quantum if it does not yield

Page 15: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 16 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Co – Scheduling

• SMP Virtual Machines (vSMP) pose unique challenge for optimized resource usage

– Guest expects all CPUs to be scheduled at the same time (Co-Scheduling)

• But all CPUs might not have enough workload (waste??)

• Optimization

– Do not schedule siblings if they are idle

• Provides an illusion of synchronous progress

• Skew in execution can be dangerous (no forever run)

– Keep a check on skew and keep it bounded

– Strict Co-Scheduling

• If the skew is beyond threshold stop all sibling CPUs and schedule them together

• Can cause CPU fragmentation

– Relaxed Co-Scheduling

• Decisions to stop or proceed done at each siblings

– Not all siblings to be stopped (Only which far ahead should stop)

Page 16: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 17 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Load Balancing

• Running worlds need to be load balanced across PCPUs to have proper resource utilization and responsiveness

– Worlds can be migrated across PCPUs (Pull or Push)

– Not All migrations can be good (Cache/NUMA advantages at risk)

• Virtual Machines are scheduled across the NUMA nodes

• Based on either CPU, Memory load or Round Robin

• Can trigger sub-optimal performance if guest is not aware of topology

• Rebalance the work load across

• CPU load (Short term imbalance or long term fairness)

• Effectiveness of Last Level Cache

• CPU and Memory Topology

• Action Affinity

• Cost of migration

• Migrations across NUMA nodes are carefully evaluated

Page 17: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 18 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Memory Scheduler

• Proportional share based allocation (as CPU)

– Will reclaim the allocation if found idle

– Over commitment friendly memory conservation algorithms

• Transparent Page Sharing

• Memory Ballooning

• Memory Compression

• Swap to SSD (Low Latency Swap)

• Swap to File

• Large Page Support (2MB)

– Lesser overhead for page tables

– TLB can hold more addressable space

– Amortizes the cost of Nested Page Table walks

Page 18: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 19 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Storage I/O Path

Guest OS Guest OS

SCSI HBA Emulation SCSI HBA Emulation

SCSI Command Emulation SCSI Command Emulation

Virtual Disk Virtual Mode RDM, VMDK, Snapshot

Virtual Disk Virtual Mode RDM, VMDK, Snapshot

Block Device Block Device

Physical Mode RDM

Physical Mode RDM

Logical Device I/O scheduler

Device and Path Management

Adapter I/O Scheduler

Device Drivers

I/O Adapter

File System Switch File System Switch

VMFS VMFS Virtual SAN Virtual SAN NFS NFS

TCP/IP

VMkernel

Virtual Machine / VMM

Physical Infra

Page 19: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 20 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Network I/O Path

Guest OS Guest OS

NIC Emulation NIC Emulation

Virtual Switch Virtual Switch

Adapter I/O Scheduler

Device Drivers

I/O Adapter

VMkernel

Virtual Machine / VMM

Physical Infra

I/O Filters/Chain (Pre) I/O Filters/Chain (Pre)

I/O Filters/Chain (Post) I/O Filters/Chain (Post)

VLAN Team Mirror Offloads

Page 20: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 21 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Scoping

Page 21: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 22 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Resource utilization is high possibly impacting the performance

Page 22: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 23 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Process takes more time in a VM when compared to Physical

Page 23: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 24 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Application startup time is too long

Page 24: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 25 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Application is performing poorly in VM

Page 25: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 26 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

File copy is too slow

Physical Machine Virtual Machine

Page 26: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 27 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Network access is too slow (Packet Drops??

Page 27: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 28 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

What and How

• Understanding the Issue – Clearly define and scope the problem

– Collect details about the scenario of manifestation

– Have we collected the evidences (Statistics, Results)

– Do we have any specific pattern

– Is it only one VM/Application having the issue?

• Redefine – What are the expectations (SLA….)

– Is it a comparison against? If yes is it apple to apple

– Eliminate obvious resource related mismatches

– Do we have enough resources allocated?

– Redefine the issue after factoring all above

– Tools and Measures, Exit criteria

Page 28: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 29 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Required…..

• Good to have – Platform level benchmarks

– I/O benchmarks are handy

– Application Profile

– SLA and comparison details

– More than one iteration of test and data collection

• Tools/Metrics – vCenter Data Collection, ESXTOP, vSCSI Stats, Net Stats, vRealize

Operations Manager

– Guest OS Level Statistics

– Perfmon, SAR, vmstat, iostat

– Benchmarking Tools

– IOMeter, Netperf, Uperf, Database Stress tools

Page 29: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 30 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Approach

Page 30: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 31 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Where to look

• Application Level – App specific Performance data

– Guest OS - CPU/Memory, I/O Statistics

• Virtualization layer – vCenter Performance Charts, Limits, Shares & other

Contentions, ESXTOP

• Physical Server Level – CPU and Memory Saturation

– Power Management

– I/O Bandwidth

• Connectivity Layer – Network/FC Switches and data paths.

– Packet Loss, Bandwidth Utilization

• Peripheral Devices – Utilization, Latency, Throughput

Page 31: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 32 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Methodology

Test or Monitor

Collect

Data

Validate Data

Analyze Data

Tune

Page 32: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 33 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Benchmarking

• Real Program – Run the program and measure the performance with respect to

response/completion times

• Micro Benchmark – Performance of a specific activity of a program is tested – Short run and involves no/small set of IO – Ex: Memory Allocation/De Allocation, Program Loop

• Component Benchmark – Measure raw performance of components – CPU/ Memory, Network IO, Storage IO

• Kernel Benchmarking – Targeted towards profiling kernel performance

• Synthetic Benchmarks – Specifically written tests to mimic the operations from a program/platform

• I/O, Database benchmarks – Measure systems throughput

Page 33: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 34 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Data Collection

Page 34: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 35 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Application and OS Stats

• Benchmark tools – IOMeter, Netperf, uperf, sysbench, DB hammer…..

• CPU Utilization – Used by Application – Privileged (System) Time V/s User Time – Load Averages or Processor Queue Length

• Memory Utilization – Available (Free) – Pages/Sec (Swap I/O), Page Faults

• I/O – Read/Writes, Throughput – Queue Lengths – Latency/Wait Times – Network Packets Received, Errors, Discarded, Resets

Page 35: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 36 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

vSphere Layer

• vCenter Real time/Historic performance charts

– Start here as it provides historical data for comparison

– Minimum Collection Interval 20 Seconds (Refresh: 1 min)

– Enough if the problem window is longer (> ~3 minutes)

• Host Level Statistics

– Host does not store much of historic data (Real time only)

– esxtop provides real-time statistics (minimum interval: 2 Sec)

• Provides CPU, Memory, Storage, Network and other System Level Statistics

• Can be intimidating to perform live analysis

• Collect the data using batch mode option and perform offline analysis (remember to –a switch to include all stats)

– vscsistats, netstats – I/O statistics

Page 36: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 37 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Analysis

Page 37: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 38 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

CPU Resources

• Availability of the CPU time is key for any application performance – ESXi CPU scheduler distributes CPU time to all running worlds based

on the entitlements and reservations using share based algorithm

– Several factors influence the availability of scheduling of CPU

– Contentions can result in degraded performance

Page 38: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 39 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

CPU Statistics

• What to Look

– USED (Used): CPU time consumed by the world

– RUN (Run): Time for which world was scheduled on the CPU

– SYS (System): CPU time used by VMkernel on behalf of world

– WAIT (Wait): CPU time spent idle/busy/waiting for resource

– IDLE (Idle): World had no work to do

– RDY (Ready): World is waiting for availability of CPU resources

– CSTP(Co-Stop): For VM with multiple vCPUs. Suggests vCPUs are waiting to be scheduled together

– SWPWT (Swap Wait): World waiting for Swap I/O completion

– DMD (Demand): Moving average of CPU demand from world

– LAT_C (Latency): Overall latency

Page 39: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 40 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

CPU Usage

• Used and Run – A high value for used suggests CPU bound workload – Compare with Demand – Will adding more CPU to VM help?

• Used substantially differs from Run – Scheduled CPU is not consumed by the VM

• Usually latency is associated with scenario – Check the Power Management Settings for the Active Policy

• Too much transition into C1 and lower states can be a problem

• Look at the page showing Power/CPU State data (“p”)

Page 40: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 41 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

CPU Ready and Co-Stop

• Ready: Time spent waiting for the availability of resource – Guest wants to execute but no free CPU slot available

• Co-Stop: All siblings vCPUs are not making progress at same rate – Stop the vCPU which is ahead till slower vCPUs can catchup

• Check the Physical CPU V/s Virtual CPU Ratio – Higher consolidation ratio can result in higher ready time for VM

Page 41: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 42 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Memory Stats

• Memory State – Indicator of Memory Pressure – High : Sufficient Memory available to satisfy current demand

• No memory conservation activities necessary – Soft (< 4%): Software based conservations/sharing techniques used

• Balloon, Memory Compression, Low Latency Swap, – Hard (< 2%): Swap to disk (swap file of the VM) – Low (< 1%): Allocations would be delayed/stunned

• Not a workable scenarios. VMs will be stunned (applications fail/disconnect)

Page 42: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 43 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Key Memory Stats

• NUMA related Statistics

– NRMEM (Not In vCenter):Memory Allocated on Remote NUMA Node

– NLMEM (Not In vCenter): Memory Allocated on Local NUMA Node

– N%L (Not in vCenter): Percentage of Memory on Local Node

• Utilization related Statistics

– GRANT (Granted): Physical memory allocated

– THCD (Active): Indicates the current Memory transaction

– MCTLSZ (Balloon): Memory reclaimed using Memory Control Driver

– SWCUR (Swapped): Memory that is swapped Out

– SWW/s, SWR/s (Swap out /in): Current Swap Activity

Page 43: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 44 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

VM Stats in UI

Memory Usage - No Pressure Memory Usage – Under Pressure

Page 44: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 45 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Storage Stats

• Esxtop storage views

– Adapter (d)

– Disk Device (u)

– VM (v)

• What should be looked at

– Queue Statistics

– Latency Statistics

– Error Statistics

Page 45: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 46 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Latency – Why?

• Majority of the Device Latencies are caused by external infrastructure

– Busy Storage Device

– Load factor

– Burst I/O resulting from Backup, OS/Antivirus updates, Replication, Boot Storm

– Noisy neighbor issues

• Un-Optimized access to the Device (Path policies and configuration)

• Unreliable fabric/network

– Faulty cables, adapters

• I/O to the disk with snapshot will see inconsistent latency

Page 46: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 47 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Network Stats

• Network issues generally manifest in failed connections, high number of retransmissions and packet drops. – TCP will recover but results in poor throughput. Application may not sustain.

• Network I/O can be more CPU intensive than Storage – Focus should be on Drop related stats

– In majority of the cases receive drops are associated with CPU contentions

– Transmit drops are associated with overload of physical network adapters

– Are we seeing too much broadcast traffic on the links?

Page 47: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 48 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

IO Profiling

Page 48: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 49 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

vSCSI Stats

• Data collection tool which provides into I/O pattern of a Virtual Machine

– Can be used to profile the I/O requirements of an Application

• I/O Size (Length)

• Queue Size (Otstanding I/O)

• Read/Write Ratio

• How Random (or How Sequential)

• Inter-arrival rate (How demanding)

• Latency (Read/Write)

Page 49: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 50 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Net Stats

• Can be used to profile Network I/O

– Type of Traffic (TCP/UDP, IPV4/IPV6)

– Transmit and Receive Rates (PPS)

– Transmit and Receive Packet Sizes

– Burst or flat (Cluster of Packets)

– Inter-arrival Time (How active)

– CPU Usage

Page 50: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 51 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Optimization

Page 51: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 52 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Sizing and Configuration

• Virtual Machine Sizing and Configuration – Pick the right sizing for CPU and Memory

• Over allocation of CPU can result in scheduling conflicts • Under allocation of memory results in guest paging • Over allocation of Memory will result in higher overhead • Ballooning away memory might be bad for memory intensive applications

– Use pre allocated eager zeroed thin disks for I/O intensive workloads – pvSCSI and vmxnet3 have lower CPU requirement compared to other

emulations – Are we running with unwarranted/unnoticed Limits/Reservations

• Server Hardware Considerations – Make sure Hardware Assists for CPU/Memory and I/O devices are available – Is the Power Management Settings Optimal (Default: Balanced)

• Can result in too conservative CPU usage in low load scenarios • I/O bound workloads perform poorly in these condition • Disable C-States for turning off any power management(Check the BIOS)

Page 52: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 53 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Workload Placement

• Planning should consider Average as well as Peak utilization • Consolidation Ratios

– Make sure sufficient resources for CPU bound critical workloads available

– Higher consolidation ratios are acceptable for desktop/test workloads

• Size the VM with attention to the NUMA layout of the server – VMs spreading on multiple NUMA nodes can be a problem – vNUMA will help to avoid this penalty

• Don't place all heavy hitters on same hosts – Use Affinity/Anti-Affinity rules to guaranteed placements

• DRS can help to keep the VMs happy – Make sure VMotion works (VMotion normally needs some CPU

Resources)

• Use Resource Pools (with reservations) over VM level reservations – Resource pools also help to implement priority based SLAs

Page 53: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 54 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

I/O Optimization

• Storage – Optimal path policies to be used (Vendor involvement might be

required)

– Keep an eye on storage utilization and load

– Pick right RAID level based on the usage pattern

– Storage I/O Control can help to reduce the impact of busy conditions

– Storage DRS can keep the load balanced

• Network – Use a hardware with offload features (Savings on CPU time)

• Segmentation, Checksum etc..

– Jumbo frames have lesser CPU overhead

• May not help if average size of I/O transactions are small (~1500)

– Network I/O Control can help to prioritize the workload

• BIOS/Drivers – Recommended to keep the driver and BIOS level up to date

Page 54: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 55 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Latency Sensitive Application

• Source of Latency – Virtualization Overhead

– CPU Contention (Consolidation Ratio)

• Additional Contention in case of using Hyper Threading

– Techniques used for optimizing the CPU usage for network handling

• Interrupt coalescing

• Large Receive Offload (LRO)

– CPU Power Management

• C-State Transitions

• Optimization – Use virtual machine settings to mark the sensitivity of the VM

• Exclusive access to CPU resources

• Limit the de-scheduling of VM

• Use of coalescing and other optimization methods are limited

– Use SR-IOV based pass-trough for further reduction in latency

Page 55: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 56 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Questions?

Page 56: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 57 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Thank You

Page 57: CMGI 2015 TT1 S4 Chasing-Performance

Computer Measurement Group, India 58 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Appendix

• Performance Monitoring Utilities: resxtop and esxtop – https://pubs.vmware.com/vsphere-

60/index.jsp#com.vmware.vsphere.monitoring.doc/GUID-A31249BF-B5DC-455B-AFC7-7D0BBD6E37B6.html

• vScsiStats – http://cormachogan.com/2013/07/10/getting-started-with-vscsistats/

• NetStats – Command line tool

• From a shell session run ‘net-stats –h’ for details

• Latency Sensitive Applications – http://www.vmware.com/files/pdf/techpaper/latency-sensitive-perf-

vsphere55.pdf

• vCenter Performance Data Overview – http://pubs.vmware.com/vsphere-

60/topic/com.vmware.vsphere.monitoring.doc/GUID-AA1F733C-1450-4437-AB0A-E5FA24CC386E.html