cmgi 2015 tt1 s4 chasing-performance

Computer Measurement Group, India 1 Presented at CMG India 2nd Annual Conference 2015. Copyright © VMware Inc, 2015

Computer Measurement Group, India 1

www.cmgindia.org

Chasing Performance

Understand, Troubleshoot and Optimize Workload performance on vSphere Platform

Adarsh Jagadeeshwaran, Ramprasad K S, Sai Inabattini

VMware India

http://www.cmgindia.org/


What we refine

• Introduction

• Virtualization Basics

• vSphere Platform Basics

• Scoping

• Approach

• Data Collection and Analysis

• IO Profiling

• Optimizing


Introduction


Virtualization Basics


Virtualizing x86…

• Non-virtualized (“native”) system – OS is designed run on bare metal hardware

• Assumption of full ownership with full privileges – Use the CPU privilege levels to restrict the access – Operating System code run with higher privilege – Application code has less privilege – Privilege escalations are either faulted or ignored

• Challenges (Virtualization) – Hardware is shared among multiple VMs – Cannot let guest run with highest privilege – Virtual Machines need to be isolated and contained

• Classical “ring compression” or “de-privileging”

– Run guest OS kernel in Ring 1 – Privileged instructions trap; emulated by VMM

• System call, Exceptions, • Privileged resource, (IOAPIC)

– Not enough for x86 as all privilege escalations cannot be trapped


Binary Translation

• Combination of Direct Execution and Binary Translation (BT) – Virtual Machine Monitor (VMM) tracks the execution

from guest

– Just in time translation of privileged execution

• Translate the code that can cause privilege escalation

• Use translation cache for reusable translations

– Direct execution of user level code (Ring 3)

• Impacts – Overhead of translation and cache

– Translated code is not as optimal as original code

• VMM needs to run in the same address space as guest – VMM data region needs protection from guest access

– Use segmentation on 32 bit platform to protect VMM region

– Segmentation is not available on all 64 Bit platform

– BT inefficient to run 64 bit guests on these platforms


Hardware Assist - CPU

• New Execution modes for privileges execution – Root mode for VMM

– Not Root mode with Ring 0 – 3 for guest execution

– Additional control structures (VMCS/VMCB)

– Guest executions need not to be monitored

• Preprogram the execution to trap privilege escalations and pass the control to VMM

• On exit to VMM, VMM inspects the guest execution and completes on behalf of guest

• Impacts – Exit to VMM can be expensive

– In guest updates (page table) costlier to track

– IN/OUT instruction and memory mapped I/O have higher overhead


Memory Virtualization

• Virtual Memory – Application see a contiguous fixed address space - Virtual

Address (VA)

– OS maps Virtual Address to Physical Address (PA)

• Uses Page Tables to store this VA to PA mapping. (normal page size: 4K)

– CPU walks the page table for any translations

• Address in %CR3 = page table location

• Uses TLB (Translation look-aside Buffers) to cache the translations

• In a Virtual Machine – Additional address translation is required

• VA -> PA -> MA

– VMM uses shadow page table to map Guest VA to MA and replaces the guest page tables

– CPU uses VA -> MA mapping (no two level walks)

• Reduced cost at execution

0GB 4GB

App 1 App 2

0GB 4GB

Physical

Machine

0GB 4GB

App 1 App 2

0GB 4GB

VM - Physical


Memory Virtualization

• Impact of using Shadow Page Tables – Tracing the guest page table writes are expensive

– Cost of Propagating the changes to shadow page table

– Page faults need to be intercepted

• Hidden page faults (Missed in Shadow page table)

• True page fault (Forward to guest)

– Context switches to be monitored

• CR3 updates need to be intercepted to replace the guest page table with shadow page table

– Resources (memory) for maintaining shadow page tables


Hardware Assist - MMU

• Nested or Extended Page Tables – Hardware assistance for two level of address translations

• VA – PA mapping is set up by guest as usual

• PA – MA mapping is set up by hypervisor

– CPU walks guest page tables and then hypervisor page tables to arrive at VM to MA mapping during TLB fill

• Benefits – No traces required (Guest page table modifications are as fast as native)

– No exits on guest page faults (True faults), Context switches

– No memory overhead for shadow page tables

– Scalable across SMP

• Impact – TLB fill is expensive and page walk is costlier

– Using large pages will absorb some costs here (2MB v/s 4KB)

• Smaller page tables – fewer levels to traverse

• TLB capacity increase thus lesser TLB miss


I/O Virtualization

• Software based I/O virtualization – I/O devices are virtual

– IN/OUT instructions are monitored and completed by VMM

– Impact of traps will induce additional latency

• Hardware Assistance – PCI pass through

• A physical I/O device dedicated to guest

• Depends on availability of hardware support (IO-MMU)

• Guest has control over I/O device and transacts directly

– SR-IOV

• A single device can be shared across multiple guests

• Multiple Virtual Functions appear as if they are physical devices

• Needs hardware support and capable devices

– Good for latency sensitive applications


vSphere Platform Basics


vSphere ESXi

vSphere ESXi is a Type 1 hypervisor runs directly on the bare metal

• VMkernel is the purpose built

kernel which manages and

schedules resources

• Resources are allocated based on

the entitlements and availability

• Guests run in isolation controlled

by a Virtual Machine Monitor

(VMM)

• Device Drivers load as VMkernel

modules

• Other Worlds include processes

related to management of the

server (hostd, vpxa, dcui etc…)


CPU Scheduler

• Designed around being fair (Fairness) with emphasis on responsiveness and resource utilization – Share based algorithm with option to guarantee resources

– Highly Scalable

• Supports 128 vCPUs /4 TB RAM per Virtual machine

• 480 Logical CPUs on the host

• 1024 Virtual Machines/4096 vCPUs per host

• 1:32 over commitment ratio

• CPU time is distributed to virtual machines and other worlds (helpers, hostd, vpxa, sfcbd etc…) – Scheduler's default quantum is 50 ms

– World can be de-scheduled before this quantum expires if it becomes idle

– VM can overrun the quantum if it does not yield


Co – Scheduling

• SMP Virtual Machines (vSMP) pose unique challenge for optimized resource usage

– Guest expects all CPUs to be scheduled at the same time (Co-Scheduling)

• But all CPUs might not have enough workload (waste??)

• Optimization

– Do not schedule siblings if they are idle

• Provides an illusion of synchronous progress

• Skew in execution can be dangerous (no forever run)

– Keep a check on skew and keep it bounded

– Strict Co-Scheduling

• If the skew is beyond threshold stop all sibling CPUs and schedule them together

• Can cause CPU fragmentation

– Relaxed Co-Scheduling

• Decisions to stop or proceed done at each siblings

– Not all siblings to be stopped (Only which far ahead should stop)


Load Balancing

• Running worlds need to be load balanced across PCPUs to have proper resource utilization and responsiveness

– Worlds can be migrated across PCPUs (Pull or Push)

– Not All migrations can be good (Cache/NUMA advantages at risk)

• Virtual Machines are scheduled across the NUMA nodes

• Based on either CPU, Memory load or Round Robin

• Can trigger sub-optimal performance if guest is not aware of topology

• Rebalance the work load across

• CPU load (Short term imbalance or long term fairness)

• Effectiveness of Last Level Cache

• CPU and Memory Topology

• Action Affinity

• Cost of migration

• Migrations across NUMA nodes are carefully evaluated


Memory Scheduler

• Proportional share based allocation (as CPU)

– Will reclaim the allocation if found idle

– Over commitment friendly memory conservation algorithms

• Transparent Page Sharing

• Memory Ballooning

• Memory Compression

• Swap to SSD (Low Latency Swap)

• Swap to File

• Large Page Support (2MB)

– Lesser overhead for page tables

– TLB can hold more addressable space

– Amortizes the cost of Nested Page Table walks


Storage I/O Path

Guest OS Guest OS

SCSI HBA Emulation SCSI HBA Emulation

SCSI Command Emulation SCSI Command Emulation

Virtual Disk Virtual Mode RDM, VMDK, Snapshot

Virtual Disk Virtual Mode RDM, VMDK, Snapshot

Block Device Block Device

Physical Mode RDM

Physical Mode RDM

Logical Device I/O scheduler

Device and Path Management

Adapter I/O Scheduler

Device Drivers

I/O Adapter

File System Switch File System Switch

VMFS VMFS Virtual SAN Virtual SAN NFS NFS

TCP/IP

VMkernel

Virtual Machine / VMM

Physical Infra


Network I/O Path

Guest OS Guest OS

NIC Emulation NIC Emulation

Virtual Switch Virtual Switch

Adapter I/O Scheduler

Device Drivers

I/O Adapter

VMkernel

Virtual Machine / VMM

Physical Infra

I/O Filters/Chain (Pre) I/O Filters/Chain (Pre)

I/O Filters/Chain (Post) I/O Filters/Chain (Post)

VLAN Team Mirror Offloads


Scoping


Resource utilization is high possibly impacting the performance


Process takes more time in a VM when compared to Physical


Application startup time is too long


Application is performing poorly in VM


File copy is too slow

Physical Machine Virtual Machine


Network access is too slow (Packet Drops??


What and How

• Understanding the Issue – Clearly define and scope the problem

– Collect details about the scenario of manifestation

– Have we collected the evidences (Statistics, Results)

– Do we have any specific pattern

– Is it only one VM/Application having the issue?

• Redefine – What are the expectations (SLA….)

– Is it a comparison against? If yes is it apple to apple

– Eliminate obvious resource related mismatches

– Do we have enough resources allocated?

– Redefine the issue after factoring all above

– Tools and Measures, Exit criteria


Required…..

• Good to have – Platform level benchmarks

– I/O benchmarks are handy

– Application Profile

– SLA and comparison details

– More than one iteration of test and data collection

• Tools/Metrics – vCenter Data Collection, ESXTOP, vSCSI Stats, Net Stats, vRealize

Operations Manager

– Guest OS Level Statistics

– Perfmon, SAR, vmstat, iostat

– Benchmarking Tools

– IOMeter, Netperf, Uperf, Database Stress tools


Approach


Where to look

• Application Level – App specific Performance data

– Guest OS - CPU/Memory, I/O Statistics

• Virtualization layer – vCenter Performance Charts, Limits, Shares & other

Contentions, ESXTOP

• Physical Server Level – CPU and Memory Saturation

– Power Management

– I/O Bandwidth

• Connectivity Layer – Network/FC Switches and data paths.

– Packet Loss, Bandwidth Utilization

• Peripheral Devices – Utilization, Latency, Throughput


Methodology

Test or Monitor

Collect

Data

Validate Data

Analyze Data

Tune


Benchmarking

• Real Program – Run the program and measure the performance with respect to

response/completion times

• Micro Benchmark – Performance of a specific activity of a program is tested – Short run and involves no/small set of IO – Ex: Memory Allocation/De Allocation, Program Loop

• Component Benchmark – Measure raw performance of components – CPU/ Memory, Network IO, Storage IO

• Kernel Benchmarking – Targeted towards profiling kernel performance

• Synthetic Benchmarks – Specifically written tests to mimic the operations from a program/platform

• I/O, Database benchmarks – Measure systems throughput


Data Collection


Application and OS Stats

• Benchmark tools – IOMeter, Netperf, uperf, sysbench, DB hammer…..

• CPU Utilization – Used by Application – Privileged (System) Time V/s User Time – Load Averages or Processor Queue Length

• Memory Utilization – Available (Free) – Pages/Sec (Swap I/O), Page Faults

• I/O – Read/Writes, Throughput – Queue Lengths – Latency/Wait Times – Network Packets Received, Errors, Discarded, Resets


vSphere Layer

• vCenter Real time/Historic performance charts

– Start here as it provides historical data for comparison

– Minimum Collection Interval 20 Seconds (Refresh: 1 min)

– Enough if the problem window is longer (> ~3 minutes)

• Host Level Statistics

– Host does not store much of historic data (Real time only)

– esxtop provides real-time statistics (minimum interval: 2 Sec)

• Provides CPU, Memory, Storage, Network and other System Level Statistics

• Can be intimidating to perform live analysis

• Collect the data using batch mode option and perform offline analysis (remember to –a switch to include all stats)

– vscsistats, netstats – I/O statistics


Analysis


CPU Resources

• Availability of the CPU time is key for any application performance – ESXi CPU scheduler distributes CPU time to all running worlds based

on the entitlements and reservations using share based algorithm

– Several factors influence the availability of scheduling of CPU

– Contentions can result in degraded performance


CPU Statistics

• What to Look

– USED (Used): CPU time consumed by the world

– RUN (Run): Time for which world was scheduled on the CPU

– SYS (System): CPU time used by VMkernel on behalf of world

– WAIT (Wait): CPU time spent idle/busy/waiting for resource

– IDLE (Idle): World had no work to do

– RDY (Ready): World is waiting for availability of CPU resources

– CSTP(Co-Stop): For VM with multiple vCPUs. Suggests vCPUs are waiting to be scheduled together

– SWPWT (Swap Wait): World waiting for Swap I/O completion

– DMD (Demand): Moving average of CPU demand from world

– LAT_C (Latency): Overall latency


CPU Usage

• Used and Run – A high value for used suggests CPU bound workload – Compare with Demand – Will adding more CPU to VM help?

• Used substantially differs from Run – Scheduled CPU is not consumed by the VM

• Usually latency is associated with scenario – Check the Power Management Settings for the Active Policy

• Too much transition into C1 and lower states can be a problem

• Look at the page showing Power/CPU State data (“p”)


CPU Ready and Co-Stop

• Ready: Time spent waiting for the availability of resource – Guest wants to execute but no free CPU slot available

• Co-Stop: All siblings vCPUs are not making progress at same rate – Stop the vCPU which is ahead till slower vCPUs can catchup

• Check the Physical CPU V/s Virtual CPU Ratio – Higher consolidation ratio can result in higher ready time for VM


Memory Stats

• Memory State – Indicator of Memory Pressure – High : Sufficient Memory available to satisfy current demand

• No memory conservation activities necessary – Soft (< 4%): Software based conservations/sharing techniques used

• Balloon, Memory Compression, Low Latency Swap, – Hard (< 2%): Swap to disk (swap file of the VM) – Low (< 1%): Allocations would be delayed/stunned

• Not a workable scenarios. VMs will be stunned (applications fail/disconnect)


Key Memory Stats

• NUMA related Statistics

– NRMEM (Not In vCenter):Memory Allocated on Remote NUMA Node

– NLMEM (Not In vCenter): Memory Allocated on Local NUMA Node

– N%L (Not in vCenter): Percentage of Memory on Local Node

• Utilization related Statistics

– GRANT (Granted): Physical memory allocated

– THCD (Active): Indicates the current Memory transaction

– MCTLSZ (Balloon): Memory reclaimed using Memory Control Driver

– SWCUR (Swapped): Memory that is swapped Out

– SWW/s, SWR/s (Swap out /in): Current Swap Activity


VM Stats in UI

Memory Usage - No Pressure Memory Usage – Under Pressure


Storage Stats

• Esxtop storage views

– Adapter (d)

– Disk Device (u)

– VM (v)

• What should be looked at

– Queue Statistics

– Latency Statistics

– Error Statistics


Latency – Why?

• Majority of the Device Latencies are caused by external infrastructure

– Busy Storage Device

– Load factor

– Burst I/O resulting from Backup, OS/Antivirus updates, Replication, Boot Storm

– Noisy neighbor issues

• Un-Optimized access to the Device (Path policies and configuration)

• Unreliable fabric/network

– Faulty cables, adapters

• I/O to the disk with snapshot will see inconsistent latency


Network Stats

• Network issues generally manifest in failed connections, high number of retransmissions and packet drops. – TCP will recover but results in poor throughput. Application may not sustain.

• Network I/O can be more CPU intensive than Storage – Focus should be on Drop related stats

– In majority of the cases receive drops are associated with CPU contentions

– Transmit drops are associated with overload of physical network adapters

– Are we seeing too much broadcast traffic on the links?


IO Profiling


vSCSI Stats

• Data collection tool which provides into I/O pattern of a Virtual Machine

– Can be used to profile the I/O requirements of an Application

• I/O Size (Length)

• Queue Size (Otstanding I/O)

• Read/Write Ratio

• How Random (or How Sequential)

• Inter-arrival rate (How demanding)

• Latency (Read/Write)


Net Stats

• Can be used to profile Network I/O

– Type of Traffic (TCP/UDP, IPV4/IPV6)

– Transmit and Receive Rates (PPS)

– Transmit and Receive Packet Sizes

– Burst or flat (Cluster of Packets)

– Inter-arrival Time (How active)

– CPU Usage


Optimization


Sizing and Configuration

• Virtual Machine Sizing and Configuration – Pick the right sizing for CPU and Memory

• Over allocation of CPU can result in scheduling conflicts • Under allocation of memory results in guest paging • Over allocation of Memory will result in higher overhead • Ballooning away memory might be bad for memory intensive applications

– Use pre allocated eager zeroed thin disks for I/O intensive workloads – pvSCSI and vmxnet3 have lower CPU requirement compared to other

emulations – Are we running with unwarranted/unnoticed Limits/Reservations

• Server Hardware Considerations – Make sure Hardware Assists for CPU/Memory and I/O devices are available – Is the Power Management Settings Optimal (Default: Balanced)

• Can result in too conservative CPU usage in low load scenarios • I/O bound workloads perform poorly in these condition • Disable C-States for turning off any power management(Check the BIOS)


Workload Placement

• Planning should consider Average as well as Peak utilization • Consolidation Ratios

– Make sure sufficient resources for CPU bound critical workloads available

– Higher consolidation ratios are acceptable for desktop/test workloads

• Size the VM with attention to the NUMA layout of the server – VMs spreading on multiple NUMA nodes can be a problem – vNUMA will help to avoid this penalty

• Don't place all heavy hitters on same hosts – Use Affinity/Anti-Affinity rules to guaranteed placements

• DRS can help to keep the VMs happy – Make sure VMotion works (VMotion normally needs some CPU

Resources)

• Use Resource Pools (with reservations) over VM level reservations – Resource pools also help to implement priority based SLAs


I/O Optimization

• Storage – Optimal path policies to be used (Vendor involvement might be

required)

– Keep an eye on storage utilization and load

– Pick right RAID level based on the usage pattern

– Storage I/O Control can help to reduce the impact of busy conditions

– Storage DRS can keep the load balanced

• Network – Use a hardware with offload features (Savings on CPU time)

• Segmentation, Checksum etc..

– Jumbo frames have lesser CPU overhead

• May not help if average size of I/O transactions are small (~1500)

– Network I/O Control can help to prioritize the workload

• BIOS/Drivers – Recommended to keep the driver and BIOS level up to date


Latency Sensitive Application

• Source of Latency – Virtualization Overhead

– CPU Contention (Consolidation Ratio)

• Additional Contention in case of using Hyper Threading

– Techniques used for optimizing the CPU usage for network handling

• Interrupt coalescing

• Large Receive Offload (LRO)

– CPU Power Management

• C-State Transitions

• Optimization – Use virtual machine settings to mark the sensitivity of the VM

• Exclusive access to CPU resources

• Limit the de-scheduling of VM

• Use of coalescing and other optimization methods are limited

– Use SR-IOV based pass-trough for further reduction in latency


Questions?


Thank You


Appendix

• Performance Monitoring Utilities: resxtop and esxtop – https://pubs.vmware.com/vsphere-

60/index.jsp#com.vmware.vsphere.monitoring.doc/GUID-A31249BF-B5DC-455B-AFC7-7D0BBD6E37B6.html

• vScsiStats – http://cormachogan.com/2013/07/10/getting-started-with-vscsistats/

• NetStats – Command line tool

• From a shell session run ‘net-stats –h’ for details

• Latency Sensitive Applications – http://www.vmware.com/files/pdf/techpaper/latency-sensitive-perf-

vsphere55.pdf

• vCenter Performance Data Overview – http://pubs.vmware.com/vsphere-

60/topic/com.vmware.vsphere.monitoring.doc/GUID-AA1F733C-1450-4437-AB0A-E5FA24CC386E.html

https://pubs.vmware.com/vsphere-60/index.jsp














http://cormachogan.com/2013/07/10/getting-started-with-vscsistats/








http://www.vmware.com/files/pdf/techpaper/latency-sensitive-perf-vsphere55.pdf








http://pubs.vmware.com/vsphere-60/topic/com.vmware.vsphere.monitoring.doc/GUID-AA1F733C-1450-4437-AB0A-E5FA24CC386E.html














cmgi 2015 tt1 s4 chasing-performance

Documents