singapore, q1 2013
DESCRIPTION
Capacity Management in the Virtual World. Singapore, Q1 2013. What changes in the virtual world. It’s a shared Infrastructure. Apps Team or “Business” no longer own the infrastructure. Bespoke system to standardized services Emerging of 2-tier Capacity Management VM level - PowerPoint PPT PresentationTRANSCRIPT
1
Singapore, Q1 2013
Capacity Management in the Virtual World
What changes in the virtual world
It’s a shared Infrastructure.• Apps Team or “Business” no longer own the infrastructure.
• Bespoke system to standardized services
Emerging of 2-tier Capacity Management• VM level
Done by Apps team.
• Infra level. Compute, Storage, Network, Security, Datacenter Done by DC Infra team.
It is not just technical• It is also cultural, social, political, or whatever you want to label it.
Words from a Practitioner
Changes in detail
Compute Resource• Dedicated becomes shared
• Single server becomes Cluster
• 2 node cluster (HA) becomes N+X cluster.
• Emerging of Cluster as the smallest compute Different cluster serves different purpose.
Network Resource• Access Switch completely virtualised
Storage Resource• 10% on central array become 100%. Sharing is 100%
• Do you share the array for both virtual and physical servers?
Application• Licensing!
• Affinity Rules
Capacity Management at VM-level
Keep the building block simple• 1 VM - 1 OS - 1 App - 1 Instance
Size for Peak Workload• A month-end VM needs to be sized based on the month-end workload
• Examples: a VM has 2 peaks Peak 1 uses 8 vCPU and 12 GB vRAM Peak 2 uses 2 vCPU and 48 GB vRAM Size is 8 vCPU, 48 GB vRAM. Yes, this is 75% buffer!
Size correctly• Over size results in slower performance in the virtual world.
• No need to be rigid. Physical world constaint no longer apply You can have 50 GB vRAM, no need to be 64 GB. You can have 5 vCPU, no need 8 vCPU. You can have 500 GB vDisk, no need 720.
• It can always be increased. No downtime in certain OS.
Capacity Management at VM-level
Monitor the following 5 components• vCPU
• vRAM
• vNetwork
• vDisk: IOPS (this is hard!) Capacity
Use Reservation sparingly
Use Shares sparingly
Avoid using Limit
Raw Capacity
Capacity Management at Infra-Level
Get the Architecture correct• Wrong architecture makes management & operation unnecessarily difficult.
Always know the actual capacity of the physical resource• If you don’t know how many CPU or RAM the ESXi has, then it’s impossible to figure
out how much capacity left.
• With that…. do you know how many IOPS the Storage has? Majority of shared storage is… well, shared. There are non ESXi servers mounting it. You need to know the minimum, guaranteed IOPS for vSphere.
Once you know the Actual, figure out the Usable portion
Capacity left for VM
IaaS “workload”
Non vSphere workload
Capacity Management at Infra-Level: IaaS “workload”
vSphere is more than hypervisor.Think Datacenter, not Server.
IaaS Compute Network Storage
vMotion, Storage vMotion Low High High
VM Snapshot, cloning, hibernate Medium Low High
Virtual Storage(e.g. VMware VSA, Distributed Storage, Nutanix) Medium High High
Backup (e.g. VDP Advance, VADP) with dedupe- Back up vendor can provide the info- For VDP-Advance, you can see the actual in vSphere or
vCenter OperationsHigh High High
Hypervisor-based AV, IDS, IPS, FW Medium Medium Low
vSphere Replication Low Medium Medium
Edge of Network services (e.g. vShield Edge, F5) Low High Low
DR Test or Actual (with SRM) High High High
HA and cluster maintenance High N/A N/A
Fault Tolerant (VM) High Medium Low
Capacity Management at Infra-Level: IaaS “workload”
Use this setting in vCenter Operations 5.6 to cater for both the IaaS workload and VM Peak buffer
Capacity Management at Infra-Level: Compute
11 Confidential
Capacity Management at Infra-Level: Compute
12 Confidential
Example of tiering in cluster
Tier # Host ESXi Spec? FailureTolerance Max #VM Remarks
Tier 1 Always 6 Always Identical 2 hosts 25 Only for Critical App. No Resource Overcommit.
Tier 2 4-8 Maybe 1 host 75 App can be vMotioned to Tier 1 during critical run
Tier 3 4-10 No 1 host 150 Some Resource Overcommit
SW 2-10 Maybe 1-3 hosts 150 Running expensive softwares. Oracle, SQL are the norms as part of DB as a Service
In the above example, capacity planning becomes much simpler in Tier 1, as we will hit the Availability limit before we hit Capacity limit.
We still have to do Capacity Management at the VM-level though.
Example: Usable RAM calculation
Physical RAM Remarks
Raw capacity 128 GB
- Hypervisor -2 GB vmkernel, vMotion, cloning, VADP
- Agent VMs -6 GB AV, IDS/IPS, FW, Virtual Storage
Capacity after hypervisor 120 GB 128 – 8 = 120 GB
80% utilisation -24 GB 120 GB x 80%Room for Anti-Affinity or Large VM
Nett Available 90 GB 120 – 24 = 96
In a cluster of 10 ESXi with N+1:- The Raw RAM is 1280 GB- The Usable RAM is 864 GB- Your capacity planning starts at 864 GB, not 1280 GB.
Lower if you are running Virtual Storage VM.
Example: Usable CPU calculation
Physical Core Remarks
Raw capacity 22 HT is assumed to give 40% boost.vSphere 5.1 manual states 50%.
- Hypervisor -2 vmkernel, vMotion, cloning, VADP
- Agent VMs -1 AV, IDS/IPS, FW, Virtual Storage
Capacity after hypervisor 19 128 – 8 = 120 GB
Nett Available 15 80% of 19
In a cluster of 10 ESXi (16 cores, 32 theads) with N+1:- The Raw CPU is 160 cores (320 threads)- The Usable CPU is 135 “cores”
Lower if you are running Virtual Storage VM
Capacity Management at Infra-Level: Storage
Capacity Management at Infra-Level: Storage
Determine the actual architecture you use. Then calculate your Usable IOPS
ESXi 1 ESXi 2 ESXi 3 BackupServer
SharedStorage
Storage Network(normally FC)
Non ESXi
Non ESXi servers.Backup workload.Non VAAI workload.Array Replication.
ESXi 1 ESXi 2 ESXi 3
Storage 1 Storage 2 Storage 3
Shared IP Network
Backup VM
BackupStorage
vSphere Cluster
Distributed Storage MirroringvSphere Replication.
Example: Usable Storage calculation
IOPS Remarks
Raw capacity 100,000 Backend IOPS
Front-end IOPS - 50,000 RAID 10. RAID 5 or 6 will be lower.
Non ESXi workload - 5,000 Not recommended to mix as we cannot do QoS or Storage IO Control
Back up workload - 5,000 Included in vSphere if VDP Advanced is used
Array Replication - 5,000 If used
vSphere Replication - 5,000 If used
Distributed Storage Mirroring - 10,000 If used
Nett Available 20,000 Only 20% left for VM workload!
All numbers are pure estimate
Capacity Management at Infra-Level: Network
Capacity Management at Infra-Level: Network
Add picture of Network TrafficAdd my network calculation
20
Estimated ESXi IO bandwidth in 2014
Confidential
Purpose Bandwidth Remarks
VM 4 Gb For ~20 VM.vShield Edge VM needs a lot of bandwidth as all traffic pass through it
FT network 10 Gb Based on VMworld 2012 presentation for the multi CPU FT.Current 1 vCPU FT needs around 1 Gb.
Distributed Storage 10 Gb Required if storage is virtualised.Based on Tech Preview in 5.1 and Nutanix
vMotion 8 GbvSphere 5.1 introduces shared-nothing live migration. This increases the demand as vmdk is much larger than vRAM.Include multi-NIC vMotion for faster vMotion when there are multiple VMs to be migrated.
Management 1 Gb Copying a powered-off VM to another host without shared datastore takes this bandwidth
IP Storage (serving VM) 6 GbNFS or iSCSI.Not the same with the Distributed Storage as DS is not serving VM.No need 10 Gb as the storage array is likely shared by 10-50 hosts. The array may only have 40 Gb total for all these hosts.
vSphere Replication 1 Gb Should be sufficient as the WAN link is likely the bottlenect
Total 40 Gb
Disclaimer: future features may differ than what I put here.
Example: Usable Network calculation
21 Confidential
GE Remarks
Raw capacity 40 4x 10 GE NIC
Management Network -1 RAID 10. RAID 5 or 6 will be lower.
vSphere Replication -1 Not recommended to mix as we cannot do QoS or Storage IO Control
Live Migration -6 DRS and Storage DRS.Shared nothing vMotion.
IP Storage -6 From VM to Storage
Distributed Storage -10 From Storage to Storage. Mirroring
Fault Tolerance -8 If
Nett Available 8 Only 20% left for VM workload!
Summary
22
Architecture UsableCapacity
vCenter Operations
Get the architecture right- Compute- Storage- Network
Calculate the Usable Capacity- Non VMware workload- IaaS workload- VM workload
Configure VC Ops accordingly- To be covered separately
21 3
Requirements: Reports & Analysis
23 Confidential
What are our typical VMs profile?• Small, Medium, Large
How many more VMs can we put?• Based on our actual-workload VM
• Based on theoritical/specified size VM
When do we need to buy additional capacity?• Compute
• Storage
• Network
Which VMs need to be right size?
Which VMs are basically dormant (idle VM?)
Which cluster is under heavy usage? Or not utilised?
Core Questions(major impact)
Non-coreQuestions
Determining a VM utilisation
24 Confidential
There are 2 type of utilisation that must be captured• Average Utilisation for entire period
This is the average of all utilisation. Over time, this tends to be low unless the VM is busy more than 50% of the time. A VM will have low average if:
It is normally used during office hours (which is just ~40 hours per week) It is cyclical in nature (e.g. month-end payroll processing, end of day batch job) The Peak is much higher than the average, but only for a short burst.
• Average Utilisation for Peak “period” only This is the average of peak utilisation. Examples this could be
Average of Peak period (e.g. 1st day of the month) Average of Top 10% utilisation.
In month-end job, this needs to be Top 3% as it’s 1 day in 30 days. This is needed for Cyclical Workload VM workload is cleaner than Physical Server workload
AV scan & backup are offload to hypervisor Disk Defragmentation no longer required.
Managing cyclical workload
Example of cyclical workload• Daily batch run from midnite to 6 am. From 6 am to midnite it is idle.
In this case, even if it’s running 100%, the daily average will be just ~25%
• End of month batch job. From 1st to 29th (or 30th) it is idle In this case, even if it’s running 100% for entire day, the monthly average will be ~4%
Is the VM oversized?• No. The VM is right sized.
• If we can know the Peak period, and only count for this period, we will know that the VM is not over-sized.
• The key here is to determine this Peak period automatically. Setting the peak period for 1000s VM is prone to human error. Plus the workload can change.
Is the VM running out of Capacity soon?• No. The VM does not need to be given more resources.
• To determine the Time Remaining, we need both Peak period and Total period. Doing Peak period alone will be misleading (as it will show that we need to give the VM more resource)
Actual VM distribution
26 Confidential
Explanation:• vCPU is rounded up to the nearest digit as we cannot assign a slice of vCPU.
• vRAM is rounded up the nearest digit. In reality we will rounded up to the nearest even number, or nearest 10s for large VM 3 GB becomes 4 GB 73 GB becomes 80 GB
• vDisk GB is rounded up the nearest 10 GB.
• vDisk IOPS is rounded up to the nearest 10 IOPS
• vNetwork is rounded up to the nearest Mb. It will take 1000 VM to saturate 1 GE link.
Size vCPU vRAM vDisk (GB) vDisk (IOPS) vNetwork
Small 1 1 GB 110 GB 10 IOPS 1 Mb
Medium 2 4 GB 320 GB 310 IOPS 1 Mb
Large 5 11 GB 740 GB 1020 IOPS 30 Mb
How many VMs can we put?
27 Confidential
Description• Actual-size VM means it is based on Actual workload or actual Reservation
vCenter Operations can take Reservation into account
• Given-size VM means the value is entered manually by Administrator. We might enter 4 vCPU, 16 GB vRAM, 400 GB vDisk as input, and see how many VM we can
put
• This is at cluster level, not host level. Cluster is the smallest logical building block.
• Result above the cluster limit is shown as vSphere 5.1 cluster limit
• This is showing what we can put now, not 6 months in the future (or other date)
Cluster Actual-size VM Given-size VM
Tier 1 21 10
Tier 2 62 42
Tier 3 125 110
Software Cluster 12 17
When do we need to buy additional resource?
28 Confidential
Description• Result rounded to the nearest days. Result above 1 year is shown as >1 year
Cluster vCPU vRAM vDisk (GB) vDisk (IOPS) vNetwork
Tier 1 6 months 3 months 11 months 8 months 6 months
Tier 2 >1 year >1 year >1 year >1 year >1 year
Tier 3 15 days 10 days 25 days 5 days 15 days
Software 1.5 months 1.8 months 1.2 months 3.5 months 2.5 months
© 2009 VMware Inc. All rights reserved
Thank You
The rest of the decks are still very draft.
Confidential29
Cluster Overview