singapore, q1 2013

1

Singapore, Q1 2013

Capacity Management in the Virtual World

What changes in the virtual world

It’s a shared Infrastructure.• Apps Team or “Business” no longer own the infrastructure.

• Bespoke system to standardized services

Emerging of 2-tier Capacity Management• VM level

Done by Apps team.

• Infra level. Compute, Storage, Network, Security, Datacenter Done by DC Infra team.

It is not just technical• It is also cultural, social, political, or whatever you want to label it.

Words from a Practitioner

Changes in detail

Compute Resource• Dedicated becomes shared

• Single server becomes Cluster

• 2 node cluster (HA) becomes N+X cluster.

• Emerging of Cluster as the smallest compute Different cluster serves different purpose.

Network Resource• Access Switch completely virtualised

Storage Resource• 10% on central array become 100%. Sharing is 100%

• Do you share the array for both virtual and physical servers?

Application• Licensing!

• Affinity Rules

Capacity Management at VM-level

Keep the building block simple• 1 VM - 1 OS - 1 App - 1 Instance

Size for Peak Workload• A month-end VM needs to be sized based on the month-end workload

• Examples: a VM has 2 peaks Peak 1 uses 8 vCPU and 12 GB vRAM Peak 2 uses 2 vCPU and 48 GB vRAM Size is 8 vCPU, 48 GB vRAM. Yes, this is 75% buffer!

Size correctly• Over size results in slower performance in the virtual world.

• No need to be rigid. Physical world constaint no longer apply You can have 50 GB vRAM, no need to be 64 GB. You can have 5 vCPU, no need 8 vCPU. You can have 500 GB vDisk, no need 720.

• It can always be increased. No downtime in certain OS.

Capacity Management at VM-level

Monitor the following 5 components• vCPU

• vRAM

• vNetwork

• vDisk: IOPS (this is hard!) Capacity

Use Reservation sparingly

Use Shares sparingly

Avoid using Limit

Raw Capacity

Capacity Management at Infra-Level

Get the Architecture correct• Wrong architecture makes management & operation unnecessarily difficult.

Always know the actual capacity of the physical resource• If you don’t know how many CPU or RAM the ESXi has, then it’s impossible to figure

out how much capacity left.

• With that…. do you know how many IOPS the Storage has? Majority of shared storage is… well, shared. There are non ESXi servers mounting it. You need to know the minimum, guaranteed IOPS for vSphere.

Once you know the Actual, figure out the Usable portion

Capacity left for VM

IaaS “workload”

Non vSphere workload

Capacity Management at Infra-Level: IaaS “workload”

vSphere is more than hypervisor.Think Datacenter, not Server.

IaaS Compute Network Storage

vMotion, Storage vMotion Low High High

VM Snapshot, cloning, hibernate Medium Low High

Virtual Storage(e.g. VMware VSA, Distributed Storage, Nutanix) Medium High High

Backup (e.g. VDP Advance, VADP) with dedupe- Back up vendor can provide the info- For VDP-Advance, you can see the actual in vSphere or

vCenter OperationsHigh High High

Hypervisor-based AV, IDS, IPS, FW Medium Medium Low

vSphere Replication Low Medium Medium

Edge of Network services (e.g. vShield Edge, F5) Low High Low

DR Test or Actual (with SRM) High High High

HA and cluster maintenance High N/A N/A

Fault Tolerant (VM) High Medium Low

Capacity Management at Infra-Level: IaaS “workload”

Use this setting in vCenter Operations 5.6 to cater for both the IaaS workload and VM Peak buffer

Capacity Management at Infra-Level: Compute

11 Confidential

Capacity Management at Infra-Level: Compute

12 Confidential

Example of tiering in cluster

Tier # Host ESXi Spec? FailureTolerance Max #VM Remarks

Tier 1 Always 6 Always Identical 2 hosts 25 Only for Critical App. No Resource Overcommit.

Tier 2 4-8 Maybe 1 host 75 App can be vMotioned to Tier 1 during critical run

Tier 3 4-10 No 1 host 150 Some Resource Overcommit

SW 2-10 Maybe 1-3 hosts 150 Running expensive softwares. Oracle, SQL are the norms as part of DB as a Service

In the above example, capacity planning becomes much simpler in Tier 1, as we will hit the Availability limit before we hit Capacity limit.

We still have to do Capacity Management at the VM-level though.

Example: Usable RAM calculation

Physical RAM Remarks

Raw capacity 128 GB

- Hypervisor -2 GB vmkernel, vMotion, cloning, VADP

- Agent VMs -6 GB AV, IDS/IPS, FW, Virtual Storage

Capacity after hypervisor 120 GB 128 – 8 = 120 GB

80% utilisation -24 GB 120 GB x 80%Room for Anti-Affinity or Large VM

Nett Available 90 GB 120 – 24 = 96

In a cluster of 10 ESXi with N+1:- The Raw RAM is 1280 GB- The Usable RAM is 864 GB- Your capacity planning starts at 864 GB, not 1280 GB.

Lower if you are running Virtual Storage VM.

Example: Usable CPU calculation

Physical Core Remarks

Raw capacity 22 HT is assumed to give 40% boost.vSphere 5.1 manual states 50%.

- Hypervisor -2 vmkernel, vMotion, cloning, VADP

- Agent VMs -1 AV, IDS/IPS, FW, Virtual Storage

Capacity after hypervisor 19 128 – 8 = 120 GB

Nett Available 15 80% of 19

In a cluster of 10 ESXi (16 cores, 32 theads) with N+1:- The Raw CPU is 160 cores (320 threads)- The Usable CPU is 135 “cores”

Lower if you are running Virtual Storage VM

Capacity Management at Infra-Level: Storage

Capacity Management at Infra-Level: Storage

Determine the actual architecture you use. Then calculate your Usable IOPS

ESXi 1 ESXi 2 ESXi 3 BackupServer

SharedStorage

Storage Network(normally FC)

Non ESXi

Non ESXi servers.Backup workload.Non VAAI workload.Array Replication.

ESXi 1 ESXi 2 ESXi 3

Storage 1 Storage 2 Storage 3

Shared IP Network

Backup VM

BackupStorage

vSphere Cluster

Distributed Storage MirroringvSphere Replication.

Example: Usable Storage calculation

IOPS Remarks

Raw capacity 100,000 Backend IOPS

Front-end IOPS - 50,000 RAID 10. RAID 5 or 6 will be lower.

Non ESXi workload - 5,000 Not recommended to mix as we cannot do QoS or Storage IO Control

Back up workload - 5,000 Included in vSphere if VDP Advanced is used

Array Replication - 5,000 If used

vSphere Replication - 5,000 If used

Distributed Storage Mirroring - 10,000 If used

Nett Available 20,000 Only 20% left for VM workload!

All numbers are pure estimate

Capacity Management at Infra-Level: Network

Capacity Management at Infra-Level: Network

Add picture of Network TrafficAdd my network calculation

20

Estimated ESXi IO bandwidth in 2014

Confidential

Purpose Bandwidth Remarks

VM 4 Gb For ~20 VM.vShield Edge VM needs a lot of bandwidth as all traffic pass through it

FT network 10 Gb Based on VMworld 2012 presentation for the multi CPU FT.Current 1 vCPU FT needs around 1 Gb.

Distributed Storage 10 Gb Required if storage is virtualised.Based on Tech Preview in 5.1 and Nutanix

vMotion 8 GbvSphere 5.1 introduces shared-nothing live migration. This increases the demand as vmdk is much larger than vRAM.Include multi-NIC vMotion for faster vMotion when there are multiple VMs to be migrated.

Management 1 Gb Copying a powered-off VM to another host without shared datastore takes this bandwidth

IP Storage (serving VM) 6 GbNFS or iSCSI.Not the same with the Distributed Storage as DS is not serving VM.No need 10 Gb as the storage array is likely shared by 10-50 hosts. The array may only have 40 Gb total for all these hosts.

vSphere Replication 1 Gb Should be sufficient as the WAN link is likely the bottlenect

Total 40 Gb

Disclaimer: future features may differ than what I put here.

Example: Usable Network calculation

21 Confidential

GE Remarks

Raw capacity 40 4x 10 GE NIC

Management Network -1 RAID 10. RAID 5 or 6 will be lower.

vSphere Replication -1 Not recommended to mix as we cannot do QoS or Storage IO Control

Live Migration -6 DRS and Storage DRS.Shared nothing vMotion.

IP Storage -6 From VM to Storage

Distributed Storage -10 From Storage to Storage. Mirroring

Fault Tolerance -8 If

Nett Available 8 Only 20% left for VM workload!

Summary

22

Architecture UsableCapacity

vCenter Operations

Get the architecture right- Compute- Storage- Network

Calculate the Usable Capacity- Non VMware workload- IaaS workload- VM workload

Configure VC Ops accordingly- To be covered separately

21 3

Requirements: Reports & Analysis

23 Confidential

What are our typical VMs profile?• Small, Medium, Large

How many more VMs can we put?• Based on our actual-workload VM

• Based on theoritical/specified size VM

When do we need to buy additional capacity?• Compute

• Storage

• Network

Which VMs need to be right size?

Which VMs are basically dormant (idle VM?)

Which cluster is under heavy usage? Or not utilised?

Core Questions(major impact)

Non-coreQuestions

Determining a VM utilisation

24 Confidential

There are 2 type of utilisation that must be captured• Average Utilisation for entire period

This is the average of all utilisation. Over time, this tends to be low unless the VM is busy more than 50% of the time. A VM will have low average if:

It is normally used during office hours (which is just ~40 hours per week) It is cyclical in nature (e.g. month-end payroll processing, end of day batch job) The Peak is much higher than the average, but only for a short burst.

• Average Utilisation for Peak “period” only This is the average of peak utilisation. Examples this could be

Average of Peak period (e.g. 1st day of the month) Average of Top 10% utilisation.

In month-end job, this needs to be Top 3% as it’s 1 day in 30 days. This is needed for Cyclical Workload VM workload is cleaner than Physical Server workload

AV scan & backup are offload to hypervisor Disk Defragmentation no longer required.

Managing cyclical workload

Example of cyclical workload• Daily batch run from midnite to 6 am. From 6 am to midnite it is idle.

In this case, even if it’s running 100%, the daily average will be just ~25%

• End of month batch job. From 1st to 29th (or 30th) it is idle In this case, even if it’s running 100% for entire day, the monthly average will be ~4%

Is the VM oversized?• No. The VM is right sized.

• If we can know the Peak period, and only count for this period, we will know that the VM is not over-sized.

• The key here is to determine this Peak period automatically. Setting the peak period for 1000s VM is prone to human error. Plus the workload can change.

Is the VM running out of Capacity soon?• No. The VM does not need to be given more resources.

• To determine the Time Remaining, we need both Peak period and Total period. Doing Peak period alone will be misleading (as it will show that we need to give the VM more resource)

Actual VM distribution

26 Confidential

Explanation:• vCPU is rounded up to the nearest digit as we cannot assign a slice of vCPU.

• vRAM is rounded up the nearest digit. In reality we will rounded up to the nearest even number, or nearest 10s for large VM 3 GB becomes 4 GB 73 GB becomes 80 GB

• vDisk GB is rounded up the nearest 10 GB.

• vDisk IOPS is rounded up to the nearest 10 IOPS

• vNetwork is rounded up to the nearest Mb. It will take 1000 VM to saturate 1 GE link.

Size vCPU vRAM vDisk (GB) vDisk (IOPS) vNetwork

Small 1 1 GB 110 GB 10 IOPS 1 Mb

Medium 2 4 GB 320 GB 310 IOPS 1 Mb

Large 5 11 GB 740 GB 1020 IOPS 30 Mb

How many VMs can we put?

27 Confidential

Description• Actual-size VM means it is based on Actual workload or actual Reservation

vCenter Operations can take Reservation into account

• Given-size VM means the value is entered manually by Administrator. We might enter 4 vCPU, 16 GB vRAM, 400 GB vDisk as input, and see how many VM we can

put

• This is at cluster level, not host level. Cluster is the smallest logical building block.

• Result above the cluster limit is shown as vSphere 5.1 cluster limit

• This is showing what we can put now, not 6 months in the future (or other date)

Cluster Actual-size VM Given-size VM

Tier 1 21 10

Tier 2 62 42

Tier 3 125 110

Software Cluster 12 17

When do we need to buy additional resource?

28 Confidential

Description• Result rounded to the nearest days. Result above 1 year is shown as >1 year

Cluster vCPU vRAM vDisk (GB) vDisk (IOPS) vNetwork

Tier 1 6 months 3 months 11 months 8 months 6 months

Tier 2 >1 year >1 year >1 year >1 year >1 year

Tier 3 15 days 10 days 25 days 5 days 15 days

Software 1.5 months 1.8 months 1.2 months 3.5 months 2.5 months

Cluster Overview

singapore, q1 2013

Documents

virtual worldhttp

virtual worldits

virtual worldto

vmware administrators

configuration management

resource management

performance management

largest virtual applications