Iwan ‘e1’ Rahabok
MGT2508BU
MGT2508BU
Troubleshooting Mastery with vRealize
VMworld 2017 Content: Not fo
r publication or distri
bution
• This presentation may contain product features that are currently under development.
• This overview of new technology represents no commitment from VMware to deliver these features in any generally available product.
• Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.
• Technical feasibility and market demand will affect final delivery.
• Pricing and packaging for any new technologies or features discussed or presented have not been determined.
Disclaimer
#MGT2508BUR CONFIDENTIAL
2
VMworld 2017 Content: Not fo
r publication or distri
bution
3
Your speaker
Iwan ‘e1’ [email protected]@e1_ang 9119-9226
VMworld 2017 Content: Not fo
r publication or distri
bution
Troubleshooting Stories: Availability
4
What was reported Client Final Cause
Network Mass VDI Disconnect Investment House Network switch buffer overflow
App Horizon View View CS issue Investment House MS ADAM not synchronizing
Storage Random VMs have high disk latency Telco 1 Bug in Array firmware
VMworld 2017 Content: Not fo
r publication or distri
bution
Troubleshooting Stories: Performance
5
What was reported Client Final Cause
Overall VDI slowness Telco 2 JavaScript in new apps
SQL Server hit a glitch Holding Company vMotion was long
VMs experience storage latency UK Bank Physical Oracle servers on shared array
vSAN performing slow Cloud Provider A VM did IOmeter
SAP experiences slow storage State Oil Firm Backup Schedule is different for Tier 1
Random VMs slow. No response to ping Holding Company Symantec EP Agent update
VMs slow when cluster utilization was 6% UK Bank Large Idle VMs
VDI grew slow over 5 months Investment VDI CPU demand undersized
Tier 3 clusters performing better than Tier 2 Global Bank Overcommit Ratio is misleading
VMworld 2017 Content: Not fo
r publication or distri
bution
Troubleshoting: Only 1st time is forgiven
7
Issue! Alert Set
Alert
Triggered
Troubleshoot
+ RCA
Same problem happens again know within minutes.Similar problem happens have good understanding within 1 hourVMworld 2017 Content: N
ot for publicatio
n or distribution
Availability Troubleshooting Process
8
CallHome
PreventiveAction
ManualProcess
Optimized Process
IdealProcess
Time loss…. Time loss…. ManualManual
Upload Logs
Waiting for analysis
Check the rest
RCA Report
AnalysisIssue!
Joint analysis
RCA Report
vRealize
Alert Setup
Issue!
vRealize
VMworld 2017 Content: Not fo
r publication or distri
bution
Performance Troubleshooting Process
9
Early WarningPreventive
Action
Manual Process
Optimized Process
IdealProcess
Ping Pong No RCA
Time loss….
Blame Storming
All Hands on Deck
Issue!
SLA breached
Big Picture Joint Analysis
Your Dashboard
RCA Report
Shared Tools
Alert Setup
vRealize
VMworld 2017 Content: Not fo
r publication or distri
bution
10
“A problem thoroughly understood is always fairly simple. Found your opinions on facts, not prejudices. We know too many things that are not true.”
Charles Kettering
Inventor and philosopher
General Motors
VMworld 2017 Content: Not fo
r publication or distri
bution
Cluster 1 3 Cluster N2 Cluster 1
1 3 4 7 8 910
52 6
Cluster N
vSAN Nutanix
Tier 1 Clusters Tier 2 Clusters
Datastores
11
Master the Architecture
Volume 1 Vol 3 Vol 1 Vol 3
Vol 2
Physical Array
Vol 2N/a
Clusters
Example
VMworld 2017 Content: Not fo
r publication or distri
bution
Master the Dependancy
• Distributed Services that are in-line can cause performance problem somewhere else
• If they are affected, the business VM maybe affected.
– One problem translates into another. This makes troubleshooting hard
ESXi Host
Storage VM
Network VM
Security VM
Business VM
Business VM
CPUCPU RAM Disk Net
ESXi Host
VMworld 2017 Content: Not fo
r publication or distri
bution
Master the point of monitoring
• For vmdk, use Datastore metric groups.
• For RDM, use Disk metric groups
• Disk metric group is not relevant for NFS
VM
Disk 1 Disk 2 Disk 3
RDM
DiskDisk
VMFS
Datastore
NFS
Datastore
scsi0:1 scsi0:2
vDisk vDisk vDisk
scsi0:0
VM Disk
• Depends on the
app• Depends on the
app
• OS Free RAM
• OS Commit Ratio• OS Page-in Rate
• VM Consumed • VM Contention
Application
Guest OS
VM
Utilization Contention
VM Memory
VMworld 2017 Content: Not fo
r publication or distri
bution
Hero to Zero. Over 5 months…
14
VMworld 2017 Content: Not fo
r publication or distri
bution
IT received complaint
from end user. Random
users are affected, but
most users are happy.
Performance was good
initially. It grew worse
over the months.
What was reported
VMworld 2017 Content: Not fo
r publication or distri
bution
Information Gathering
16
Over 5 months
200
250
300
350
400
450
500
550
Full Capacity
Deployed
Performance
No change in Horizon, vSphere. Same Win 7 golden image. new application,
Happens on both
wired & wireless, on
thin, thick client,
iPad
VMworld 2017 Content: Not fo
r publication or distri
bution
17
How Infra impacts VM performance
ESXi + physical infrastructure Infrastructure
Utilisation
VM VM
Guest
OSVM Utilisation
VM
VM Contention
Each VM is 2 vCPU, 8 GB RAM.
At peak period, average user is
expected to use only 7% CPU.
VMworld 2017 Content: Not fo
r publication or distri
bution
18
Analysis
674 MHz
~1.8 GHzPlan
ActualPlotting over time.
CPU higher than expected
during peak period.
ESXi utilization
Again, higher than
plan
Total Demand
Need to investigate this
Demand/User
VMworld 2017 Content: Not fo
r publication or distri
bution
19
Windows idle. No hands on keyboard.
MS Outlook running (online).
CPU “Peak” Usage is around 40%.
VMworld 2017 Content: Not fo
r publication or distri
bution
20
Watching a YouTube video (company video).
Both CPU rose when we changed to full screen.
Around 95%.
We need to cater for the concurrency. How many
% of staff watching at the same time?
Working with a web based app, then
pause, then working on another web
based app.
CPU Usage is way above >7%
VMworld 2017 Content: Not fo
r publication or distri
bution
25
Utilization Contention
• By itself, 40% of a 2 vCPU VM does not matter.
• But stack them up, add a monster VM, run it on a small ESXi, and you will have a problem.
8 cores 8 cores
ESXi
8 vCPUStorage VM
2 2 2 2
2 2 2 2
Demand
Supply
VMworld 2017 Content: Not fo
r publication or distri
bution
27
Super Metrics
8 line charts:
• Max and Average.
• 4 Infra element.
Each over 5 months, as we knew
what was good.
CPU was the constraint
• Average is good.
• Max is bad.
VMworld 2017 Content: Not fo
r publication or distri
bution
Cluster CPU Utilization
Max VM CPU Contention
Avg VM CPU Contention
We need to validate at individual level
We need to validate at individual level
VMworld 2017 Content: Not fo
r publication or distri
bution
29
This one proves that the VM experienced CPU contention, which
built over 2+ months. It was then solved by moving it to a
performing cluster
We went back 3 months for User A, as the
performance was good 3 months ago.
CPU utilisation pattern was the same. So user demand didn’t go up.
The time the VM
was migrated
VMworld 2017 Content: Not fo
r publication or distri
bution
Zooming in: Before and After
30
Zooming into the period before and after VM migration.
CPU Contention dropped right after migration. Notice it
remains stable after that.
Zooming in further, so the comparison clearer.
This proves the HCI vendor box was having difficulty
serving its VMs well. We are at a low consolidation
ratio, so something else is using up the physical
cores.
VMworld 2017 Content: Not fo
r publication or distri
bution
VM CPU Latency in mid July.
We went back 3 months, as the
performance was good 3
months ago.
This was 7 days data, >2000
chances to hit high.
VM CPU Latency in end Sep,
after more VMs were added
It’s ~10x worse.
This explains the performance
degradation felt by user
VMworld 2017 Content: Not fo
r publication or distri
bution
After migration: 20x better
3232
After migrated out of HCI vendor into ESXi with no overcommit nor large VM.
It was well below 0.5%. The peak was 0.38%
This one proves that the VM experienced CPU contention, which built over 2+ months. It
was then solved by moving it to Cluster without contention.
VMworld 2017 Content: Not fo
r publication or distri
bution
33
Root Cause Analysis
What happened Lesson Learned Alert Setup
• Sizing was too small.
• ESXi unable to cope.
• Monster VM + AV VM
took half the box
• Know the workload.
• Website matters in VDI
• Know how VMkernel
scheduler works
• Define KPI for VDI
• Max VM CPU
Contention in a cluster
• Max VM RAM
Contention in a cluster
• Max VM Disk Latency
in a cluster
VMworld 2017 Content: Not fo
r publication or distri
bution
34
VMworld 2017 Content: Not fo
r publication or distri
bution
Mass VDI Disconnect
35
VMworld 2017 Content: Not fo
r publication or distri
bution
A fellow IT team seated on the
same row with VDI team had
his session disconnected at
7:42 am while he was working.
Head of IT Infra had issues
trying to get his zero client to
wake up.
What was reported
VMworld 2017 Content: Not fo
r publication or distri
bution
This is Horizon Event DB.
It’s hard to see as it’s not plotted across time.
Need to manually scroll one by one. Easy to make
mistake.
Worse, it does not distinguish normal and
abnormal disconnect. VMworld 2017 Content: Not fo
r publication or distri
bution
This is Log Insight.
We know who was affected.
We also know the ESXi Host they
were on.
We can also group by View CS
VMworld 2017 Content: Not fo
r publication or distri
bution
Mass-Disconnect happen
We know this is abnormal as it’s a high spike, and
no one was working at that time.
VMworld 2017 Content: Not fo
r publication or distri
bution
Is there pattern on the client side?
Only thin clients are affected. Thick clients (Windows PC)
not affected.
Checking on the Client logs, no error other than Network
error. The client itself is working fine.VMworld 2017 Content: N
ot for publicatio
n or distribution
Analysis on affected VMs
Problems only happened on HCI clusters.
The hosts were well spread.
• The hosts themselves had no errors
• We knew this was the hosts. No vMotion.
Conclusion: does not look like a problem with host, but the HCI.
Find something in common among the affected VMs
ESXi No Cluster Affected VM
5 HCI cluster 1 1
7 HCI cluster 1 4
8 HCI cluster 1 2
9 HCI cluster 1 2
10 HCI cluster 1 2
11 HCI cluster 1 4
12 HCI cluster 1 2
13 HCI cluster 1 3
14 HCI cluster 1 1
15 HCI cluster 1 2
16 HCI cluster 1 3
17 HCI cluster 1 1
18 HCI cluster 1 2
21 HCI cluster 1 2
22 HCI cluster 1 4
23 HCI cluster 2 1
26 HCI cluster 2 1
27 HCI cluster 1 2
28 HCI cluster 1 4
VMworld 2017 Content: Not fo
r publication or distri
bution
What did the VMs experience?
It’s not Windows issue, since VMs resumed normally.
It’s not application, since it happened at 42 OS at the same time.
A VM consumes 4 resources:
• CPU
• Disk
• RAM
• Network
We will check for each 4.
42
Something must
have freezed the
VMs.
Temporarily.
VMworld 2017 Content: Not fo
r publication or distri
bution
Average Disk Latency among Affected VMs.
The 42 users experienced a spike in Disk Latency at the time of incident.
The spike timing and length matched the situation.
We plot the same for CPU, Network and RAM. No correlation at all.
VMworld 2017 Content: Not fo
r publication or distri
bution
Max latency rose to 543 ms. That’s very high as it’s
sustained for 5 minutes. From here we can tell there is a
storage issue.
VMworld 2017 Content: Not fo
r publication or distri
bution
Drilling into the 42 VM.
There is noticeable spike across many VM.
The green number shows the last value,
not the peak value.
VMworld 2017 Content: Not fo
r publication or distri
bution
However, there is no high IOPS coming
from the 42 VM. This is expected at
7:42 am on weekday.
So it is something from outside the VM.
It indicates Storage was not processed.
VMworld 2017 Content: Not fo
r publication or distri
bution
Conclusion: VMs were hit by disk latency
• At the time of incident (7:42 am), there was a distinct and unhealthy spike
• Both Before and After are very good. So it’s a short lived issue.
• It’s also a one time event. It did not occur before and after within 48 hours period. The spike is distinct.
• There was no spike in IOPS. Latency happens when Demand isn’t met by Supply. So Supply has dropped.
VMworld 2017 Content: Not fo
r publication or distri
bution
Alert so IT knows right away.
VMworld 2017 Content: Not fo
r publication or distri
bution
49
Root Cause Analysis
What happened Lesson Learned Alert Setup
• View Client initiated
disconnect
• View Server not
responding
• Windows IO freeze
• HCI Storage can’t sync
• Network buffer overflow
• VDI monitoring has to
include storage and
network
• Not all alerts are in the
GUI. Buried in log
• Network buffer overflow
• HCI unable to sync
• View Disconnect > 10
at the same time.
VMworld 2017 Content: Not fo
r publication or distri
bution
VMworld 2017 Content: Not fo
r publication or distri
bution
SG1029263SG1029263
VMworld 2017 Content: Not fo
r publication or distri
bution
Start CMP Engagement with AssessmentsGet your FREE reports
52
▪ Optimize SDDC and Hybrid Cloud
▪ Time to value in days
vSphere Optimization Assessment (VOA) Hybrid Cloud Assessment (HCA)
▪ Compare private and public cloud costs
▪ Time to value in < 1 hour
vmware.com/assessment/voa vmware.com/hybrid-cloud-assessment.html
VMworld 2017 Content: Not fo
r publication or distri
bution
Next Steps and Key Resources
Upgrade to vRealize Operations 6.6 Share Your Feedback Get Certified
Upgrade to vROps 6.6
Visit our new upgrade center:
vmware.com/go/vrops/upgradeReceive a $10 amazon gift card
by completing a short survey
about vRealize Operations
surveymonkey.com/r/vrops66
OFFER valid during VMworld
2017Complete your online
certification exam
VMware Digital Badges:
vmware.com/go/vROPS201
7badge
VMware Digital Badge
VMworld 2017 Content: Not fo
r publication or distri
bution
Overall VDI slowness
• What was reported
– IT Manager received complaint from end users that VDI is slow
– Customer is a telco in Singapore
• Gathering additional info
– No change in infrastructure
– No change in no of users, VDI VM master image
– A new application was rolled out that provide a front-end for 5 existing applications. User is able to work with all 5 apps instead of 1 at a time. Productivity improve.
– Apps was developed on laptop. Never tested in VDI environment.
56
VMworld 2017 Content: Not fo
r publication or distri
bution
Analysis
• Sum up total Demand from all users
– Take over 1 month, not 1 day.
– Sum the total demand at peak period, then divide by # Users.
57
VMworld 2017 Content: Not fo
r publication or distri
bution
Random VMs slow
• What was reported
– VMs do not response to ping. There seems to be no network problem.
– No time pattern on the problems. They can occur daily, or less.
– Different VMs are affected. No pattern on the affected VMs
– Problem disappear on its own.
– Customer suspect a rogue VM is causing the problem
• Gathering additional info
– 700 VMs spread across 2 Datacenters in Singapore. Linked by low latency network, 2x 1Gb link.
– EMC (VPLEX, linked by Cisco MDS) supporting vSphere HA (stretched cluster)
– Half of the VMs are on VPLEX-replicated datastores.
– vR Ops and Log Insight were not installed
– Network team notice high spike on the link utilization that match the time of issue
• Since the link is dedicated to VPLEX, then VPLEX replication must have gone up
58
VMworld 2017 Content: Not fo
r publication or distri
bution
Analysis
• VPLEX is replicating changes
– Write operations, not Read.
– We plotted
• Install vR Ops
– Created a group on all datastores backed by VPLEX
– Create a super metric that sums their Write Throughput.
– Expect to see the pattern match. It did!
• Login to some VMs. Check with Windows Events. It shows Symantec EP Agent signature update.
• Signature update: It appends 4 KB, but it copy 512 MB, append 4 KB, then delete 512 MB.
• Discussed with Security Team. They agree to randomize even more.
• Problem stop happening
59
VMworld 2017 Content: Not fo
r publication or distri
bution
VMs slow when cluster utilization was 6%
• What was reported
• Gathering additional info
• Analysis
• Guarding for future
60
VMworld 2017 Content: Not fo
r publication or distri
bution
Tier 3 clusters performing better than Tier 2
• What was reported
• Gathering additional info
• Analysis
• Guarding for future
61
VMworld 2017 Content: Not fo
r publication or distri
bution
SQL Server performance
• What was reported
• Gathering additional info
• Analysis
• Guarding for future
62
VMworld 2017 Content: Not fo
r publication or distri
bution
64
VMworld 2017 Content: Not fo
r publication or distri
bution
VMs experience storage latency
• What was reported
• Gathering additional info
• Analysis
• Guarding for future
65
VMworld 2017 Content: Not fo
r publication or distri
bution
Which VMs hit high
disk latency and
how often?
VMworld 2017 Content: Not fo
r publication or distri
bution
Zooming into 1
VM. The pattern
is obvious!
VMworld 2017 Content: Not fo
r publication or distri
bution
vSAN performing slow
• What was reported
• Gathering additional info
• Analysis
• Guarding for future
68
VMworld 2017 Content: Not fo
r publication or distri
bution
SAP slow performance
• What was reported
– SAP team detects slowness. They are seeing high storage latency
– They check with Storage team. Storage team saw high latency & IOPS across many LUNs in many arrays. Since the arrays are dedicated to VMware, they ask Platform team what causing the widespread demand.
– Platform team is caught in-between customer and supplier.
• Gathering additional info
– Customer has 6 datacenters, 22 vCenters and 3000 VM.
– The problem happen during early morning, but complaint only reach VMware team at 12 pm.
• Analysis
• Guarding for future
69
VMworld 2017 Content: Not fo
r publication or distri
bution
Random VMs have high disk latency
• What was reported
• Gathering additional info
• Analysis
• Guarding for future
70
Telco in India
VMworld 2017 Content: Not fo
r publication or distri
bution
Windows VMs cannot boot
• What was reported
– 3 VMs unable to boot. All are Windows, and all have physical RDM
– Customer is a large Asian property conglomerate based in Singapore
– Customer wanted to know what caused the corruption
• Gathering additional info
– MBR is only 512 bytes at the beginning of a drive. BIOS uses it to locate the start of the OS partition. There is a boot table
• Analysis
– MBR region corrupted. 3 bytes
– Microsoft restores the MBR region
– Not able to figure out why it’s corrupted.
• Guarding for future
– Microsoft provided a utility to back up MBR region.
71
VMworld 2017 Content: Not fo
r publication or distri
bution
SQL Server slow but IOPS is low
• What was reported
• Gathering additional info
• Analysis
• Guarding for future
72
VMworld 2017 Content: Not fo
r publication or distri
bution
VM CPU counters
• Used vs Usage vs Demand
• Latency
• Ready
– Include co-stop since 6.0?
74
VMworld 2017 Content: Not fo
r publication or distri
bution
Single VM Troubleshooting
75
The VM
The Infra
Capacity issue:• CPU Demand > 90%• CPU Run Queue > 3 per vCPU• CPU Swap Wait high, CPU IO Wait high• RAM Free < 250 MB• RAM Committed > 70%• Page-In Rate is high• Disk Queue Length > ___• Disk IOPS or Throughput or OIO is high• Low disk space• Network Usage is high
Non-Capacity issue:• Wrong driver (storage driver, network driver) or its settings• Too many snapshots or large snapshots• Tools not running • VM vCPU Usage unbalanced• App configured wrongly, not-indexed• Memory Leak • Network Latency is high or TCP retransmit• VM too big, process ping-pong, high context switch• NUMA effect• Guest OS power setting
Infra unable to cope:• ESXi CPU insufficient: Demand > 90%, VM CPU Co-Stop
>1%, CPU Ready >5%, no of cores to small for VM• ESXi RAM insufficient: VM Balloon active, VM RAM Swap-in
is high, NUMA migration• ESXi Disk IOPS or Throughput is high• ESXi vmkernel queue or latency is high• Datastore latency is high• ESXi vmnic usage is high
Other issue:• VM was vMotion• ESXi vmnic dropped packets or generate errors• ESXi wrong configuration: power management, multi-
pathing, driver version, queue depth setting• Hardware fault: disk soft error, bad sector, RAM error,
VMworld 2017 Content: Not fo
r publication or distri
bution