vmware farm optimization by jeremy kampwerth [email protected]
TRANSCRIPT
VMware Farm Optimization
By Jeremy [email protected]
Introduction To Me
• Windows and Unix System Administrator for 8 years.
• Capacity and Performance Engineer for 6 years.
• Apparently I like working for large companies.• I consider myself a jack of many trades.• Presentations like this are not one of the
trades.
In General
• Topic has VMware in the title but this is not about VMware specifically as the concepts I will discuss could be applied to any virtual environment.
• The concepts are more about common sense then they are trade secrets.
• My role was to assist the project by providing analysis and technical expertise.– I was not doing the dirty work
Introduction to the Topic
• Working hand-in-hand with the virtual support team to reign in the wild fire that was virtual sprawl.
• Optimize VMs based on historical utilization data with added controls around application requirements.
• Today capacity team is part of the before and after process to regulate and review.
Introduction to the Topic
• I will discuss– How we got into the mess and how did the
capacity team help to get out of the mess.– How and why was the capacity team engaged.– The expensive tools we used to do the job.– The guidelines we used to make safe decisions.– What were the failures?– What led to the successes?
How did we get into this mess?
• Many factors led to the wild fire (virtual sprawl)– Corporate decision to push virtualization– Lack of controls in request process• Lead to many over-provisioned VMs
– Existing large non-centralized environment managed across many different internal organizations each with a different set of rules
Ask the Capacity Team for help
• Surely the internal capacity team was the first call.
• Surely before they ask for money they would think of the capacity team.
• Surely upper management would know the capacity team exists.– Luckily they did
What was being asked
• Can we help the virtual team reign in the madness?
• Can we produce same results as outside company?
• Can we do it in a safe manner?• Can we do it reliably and reproduce reliability?
What did they do?
• Looked at data for thousands of VMs– Data only contained 4 weeks
• Analysis via Modeling tool– Fancy tool with top secret formula
• CMDB details not considered– No application relationships– No account for age of the VM
• Many reductions found– Over 40% vCPU reduction– Over 70% vMEM reduction
Our Guidelines• Make sure the server is being used for what it was intended
– In deployment for 180 days• Consider the application
– Match by role and function• Within each application, all production web servers should be sized the same
• Enough Data– Minimum 90 days of data
• Peak utilization– No arguing (but but why?)– 15 minute interval– Add headroom
• 20% headroom for vCPU• 5% headroom for vMem (consumed memory)
Outside Co. Internal
Location Only subset of location included
All locations included
Resources vCPU & vMem Initial vCPU & Small vMem pilot
Asset Status/Duration Not taken into consideration
180 days Deployed
Environment Matching (by AppID) Not taken into consideration
Match Prod/BCPMatch Non-Prod
Minimum Data Required for Recommendation
4 weeks via Modeling tool 90 days of data
vCPU Formulas Used Not Disclosed Single Max vCPU (15min Interval) + 20%, rounded up
vMem Formulas Used Not Disclosed Single Max Consumed vMem + 5%, rounded up
VMs Analyzed thousands thousands plus thousands
Candidates Identified 92% 15%
Candidacy Analysis Overview
Analysis ComparisonData Center 1 Server 1 Server 2 Server 3
Server 4Server 5Server 6
Current Configuration 4 x 8 4 x 8 4 x 8 4 x 8
Outside Co. Recommendation 1 x 1 1 x 1 1 x 1 1 x 1
Internal Recommendation 2 x 8 2 x 8 4 x 8 None
Reason for Difference Max vCPU 1.48Max vMem 7.8 BCP Match PROD Max vCPU 2.92
Max vMem 7.67 Disposed
Data Center 2 Server 1 Server 2 Server 3 Server 4
Current Configuration 4 x 16 4 x 16 4 x 16 4 x 16
Outside Co. Recommendation 1 x 4 1 x 7 1 x 6 1 x 7
Internal Recommendation 3 x 12 2 x 16 2 x 16 3 x 16
Reason for Difference Max vCPU 2.8Max vMem 11.4
Max vCPU 1.52Max vMem 15.62
Max vCPU 1.14Max vMem 15.62
Max vCPU 2.72Max vMem 15.61
Data Center 3 Server 1 Server 2 Server 3Server 4
Current Configuration 4 x 6 4 x 6 2 x 4
Outside Co. Recommendation 1 x 1 1 x 1 1 x 2
Internal Recommendation 2 x 6 2 x 6 None
Reason for Difference Max vCPU 1.44Max vMem 5.49
Max vCPU 1.7Max vMem 5.6
Disposed
The Process
• Capacity team to produce the results and review and with project team to identify candidates.
• Project team to communicate plan to planners and application owners.
• Allow for rebutal– But you better bring the facts
• Optimize
The First Year Results
• Of the 15% of VM candidates identified– 23% were cancelled after appeals process– Of the completed• 50% reduction in configured vCPUs• vMem was excluded
– 100% of reductions made with no issues
The Second Year Results
• Of the 8% of VM candidates identified– 24% were cancelled after appeals process– Of the completed• 20% reduction in configured vCPUs• 10% reduction in configured vMem
– 100% of reductions made with no issues
Realized Benefits
• Better performing VMs– Over-provisioning of resources can hurt
• Better performing Hosts– Accurate view allowed for higher utilization of the
clusters• Costs– Delayed purchase of new farms for over a year
• Time to focus on future– New farms running more powerful hardware allowed
for a many to one replacement
What were the issues?
• Communication breakdown– First knowledge of optimization was from the
change request• Lack of understanding– Not knowing how and why
• Coordination of optimizations– Had to learn how things would work
What led to the Success?
• Management backing– You will be optimized unless you can produce evidence
• Conservative formula– Peak utilization served us well
• Communication, Communication, Communication• Processes in place– Appeals process– Resources on demand (or at least with a phone call)
What if we got more aggressiveScenario 1(Same Year 2)
Scenario 2(1 hour interval)
Scenario 3(12 month max)
Scenario 4(Less overhead)
Location All Infrastructure
Resources vCPU & vMem
Asset Status/Duration 180 days Deployed
Environment Matching (by AppID)
Match Prod/BCPMatch Non-Prod
Data Required for Recommendation
90 days minimum(15 month max)
90 days minimum(12 month max)
vCPU Formulas Used Single Max vCPU (15min Interval) + 20%, rounded
up
Single Max vCPU (1 hour Interval) + 20%,
rounded up
Single Max vCPU (1 hour Interval) + 20%,
rounded up
Single Max vCPU (1 hour Interval), rounded
up
vMem Formulas Used Single Max Consumed vMem (15min Interval) +
5%, rounded up
Single Max Consumed vMem
(1 hour Interval) + 5%, rounded up
Single Max Consumed vMem
(1 hour Interval) + 5%, rounded up
Single Max Consumed vMem
(1 hour Interval), rounded up
VMs Analyzed Over 10k
Candidates Identified 4% 18% 19% 29%
RISK
Where are we today
• Part of the request process– Previously we may or may not be asked for sizing– Currently all sizings come through us
• All existing servers get a sizing recommendation
• Annual Optimization Review– At least one optimization per year– Optimization now includes vCPU, vMem, and
storage• Storage follows same type guidelines but analysis not be
capacity team
Thanks for listening!