Download - Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Iwan ‘e1’ Rahabok

MGT2508BU

MGT2508BU

Troubleshooting Mastery with vRealize

VMworld 2017 Content: Not fo

r publication or distri

bution

• This presentation may contain product features that are currently under development.

• This overview of new technology represents no commitment from VMware to deliver these features in any generally available product.

• Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.

• Technical feasibility and market demand will affect final delivery.

• Pricing and packaging for any new technologies or features discussed or presented have not been determined.

Disclaimer

#MGT2508BUR CONFIDENTIAL

2



bution

3

Your speaker

Iwan ‘e1’ [email protected]@e1_ang 9119-9226



bution

mailto:[email protected]

Troubleshooting Stories: Availability

4

What was reported Client Final Cause

Network Mass VDI Disconnect Investment House Network switch buffer overflow

App Horizon View View CS issue Investment House MS ADAM not synchronizing

Storage Random VMs have high disk latency Telco 1 Bug in Array firmware



bution

Troubleshooting Stories: Performance

5

What was reported Client Final Cause

Overall VDI slowness Telco 2 JavaScript in new apps

SQL Server hit a glitch Holding Company vMotion was long

VMs experience storage latency UK Bank Physical Oracle servers on shared array

vSAN performing slow Cloud Provider A VM did IOmeter

SAP experiences slow storage State Oil Firm Backup Schedule is different for Tier 1

Random VMs slow. No response to ping Holding Company Symantec EP Agent update

VMs slow when cluster utilization was 6% UK Bank Large Idle VMs

VDI grew slow over 5 months Investment VDI CPU demand undersized

Tier 3 clusters performing better than Tier 2 Global Bank Overcommit Ratio is misleading



bution

Troubleshoting: Only 1st time is forgiven

7

Issue! Alert Set

Alert

Triggered

Troubleshoot

+ RCA

Same problem happens again know within minutes.Similar problem happens have good understanding within 1 hourVMworld 2017 Content: N

ot for publicatio

n or distribution

Availability Troubleshooting Process

8

CallHome

PreventiveAction

ManualProcess

Optimized Process

IdealProcess

Time loss…. Time loss…. ManualManual

Upload Logs

Waiting for analysis

Check the rest

RCA Report

AnalysisIssue!

Joint analysis

RCA Report

vRealize

Alert Setup

Issue!

vRealize



bution

Performance Troubleshooting Process

9

Early WarningPreventive

Action

Manual Process

Optimized Process

IdealProcess

Ping Pong No RCA

Time loss….

Blame Storming

All Hands on Deck

Issue!

SLA breached

Big Picture Joint Analysis

Your Dashboard

RCA Report

Shared Tools

Alert Setup

vRealize



bution

10

“A problem thoroughly understood is always fairly simple. Found your opinions on facts, not prejudices. We know too many things that are not true.”

Charles Kettering

Inventor and philosopher

General Motors



bution

Cluster 1 3 Cluster N2 Cluster 1

1 3 4 7 8 910

52 6

Cluster N

vSAN Nutanix

Tier 1 Clusters Tier 2 Clusters

Datastores

11

Master the Architecture

Volume 1 Vol 3 Vol 1 Vol 3

Vol 2

Physical Array

Vol 2N/a

Clusters

Example



bution

Master the Dependancy

• Distributed Services that are in-line can cause performance problem somewhere else

• If they are affected, the business VM maybe affected.

– One problem translates into another. This makes troubleshooting hard

ESXi Host

Storage VM

Network VM

Security VM

Business VM

Business VM

CPUCPU RAM Disk Net

ESXi Host



bution

Master the point of monitoring

• For vmdk, use Datastore metric groups.

• For RDM, use Disk metric groups

• Disk metric group is not relevant for NFS

VM

Disk 1 Disk 2 Disk 3

RDM

DiskDisk

VMFS

Datastore

NFS

Datastore

scsi0:1 scsi0:2

vDisk vDisk vDisk

scsi0:0

VM Disk

• Depends on the

app• Depends on the

app

• OS Free RAM

• OS Commit Ratio• OS Page-in Rate

• VM Consumed • VM Contention

Application

Guest OS

VM

Utilization Contention

VM Memory



bution

Hero to Zero. Over 5 months…

14



bution

IT received complaint

from end user. Random

users are affected, but

most users are happy.

Performance was good

initially. It grew worse

over the months.

What was reported



bution

Information Gathering

16

Over 5 months

200

250

300

350

400

450

500

550

Full Capacity

Deployed

Performance

No change in Horizon, vSphere. Same Win 7 golden image. new application,

Happens on both

wired & wireless, on

thin, thick client,

iPad



bution

17

How Infra impacts VM performance

ESXi + physical infrastructure Infrastructure

Utilisation

VM VM

Guest

OSVM Utilisation

VM

VM Contention

Each VM is 2 vCPU, 8 GB RAM.

At peak period, average user is

expected to use only 7% CPU.



bution

18

Analysis

674 MHz

~1.8 GHzPlan

ActualPlotting over time.

CPU higher than expected

during peak period.

ESXi utilization

Again, higher than

plan

Total Demand

Need to investigate this

Demand/User



bution

19

Windows idle. No hands on keyboard.

MS Outlook running (online).

CPU “Peak” Usage is around 40%.



bution

20

Watching a YouTube video (company video).

Both CPU rose when we changed to full screen.

Around 95%.

We need to cater for the concurrency. How many

% of staff watching at the same time?

Working with a web based app, then

pause, then working on another web

based app.

CPU Usage is way above >7%



bution

25

Utilization Contention

• By itself, 40% of a 2 vCPU VM does not matter.

• But stack them up, add a monster VM, run it on a small ESXi, and you will have a problem.

8 cores 8 cores

ESXi

8 vCPUStorage VM

2 2 2 2

2 2 2 2

Demand

Supply



bution

27

Super Metrics

8 line charts:

• Max and Average.

• 4 Infra element.

Each over 5 months, as we knew

what was good.

CPU was the constraint

• Average is good.

• Max is bad.



bution

Cluster CPU Utilization

Max VM CPU Contention

Avg VM CPU Contention

We need to validate at individual level

We need to validate at individual level



bution

29

This one proves that the VM experienced CPU contention, which

built over 2+ months. It was then solved by moving it to a

performing cluster

We went back 3 months for User A, as the

performance was good 3 months ago.

CPU utilisation pattern was the same. So user demand didn’t go up.

The time the VM

was migrated



bution

Zooming in: Before and After

30

Zooming into the period before and after VM migration.

CPU Contention dropped right after migration. Notice it

remains stable after that.

Zooming in further, so the comparison clearer.

This proves the HCI vendor box was having difficulty

serving its VMs well. We are at a low consolidation

ratio, so something else is using up the physical

cores.



bution

VM CPU Latency in mid July.

We went back 3 months, as the

performance was good 3

months ago.

This was 7 days data, >2000

chances to hit high.

VM CPU Latency in end Sep,

after more VMs were added

It’s ~10x worse.

This explains the performance

degradation felt by user



bution

After migration: 20x better

3232

After migrated out of HCI vendor into ESXi with no overcommit nor large VM.

It was well below 0.5%. The peak was 0.38%

This one proves that the VM experienced CPU contention, which built over 2+ months. It

was then solved by moving it to Cluster without contention.



bution

33

Root Cause Analysis

What happened Lesson Learned Alert Setup

• Sizing was too small.

• ESXi unable to cope.

• Monster VM + AV VM

took half the box

• Know the workload.

• Website matters in VDI

• Know how VMkernel

scheduler works

• Define KPI for VDI

• Max VM CPU

Contention in a cluster

• Max VM RAM

Contention in a cluster

• Max VM Disk Latency

in a cluster



bution

34



bution

Mass VDI Disconnect

35



bution

A fellow IT team seated on the

same row with VDI team had

his session disconnected at

7:42 am while he was working.

Head of IT Infra had issues

trying to get his zero client to

wake up.

What was reported



bution

This is Horizon Event DB.

It’s hard to see as it’s not plotted across time.

Need to manually scroll one by one. Easy to make

mistake.

Worse, it does not distinguish normal and

abnormal disconnect. VMworld 2017 Content: Not fo


bution

This is Log Insight.

We know who was affected.

We also know the ESXi Host they

were on.

We can also group by View CS



bution

Mass-Disconnect happen

We know this is abnormal as it’s a high spike, and

no one was working at that time.



bution

Is there pattern on the client side?

Only thin clients are affected. Thick clients (Windows PC)

not affected.

Checking on the Client logs, no error other than Network

error. The client itself is working fine.VMworld 2017 Content: N

ot for publicatio

n or distribution

Analysis on affected VMs

Problems only happened on HCI clusters.

The hosts were well spread.

• The hosts themselves had no errors

• We knew this was the hosts. No vMotion.

Conclusion: does not look like a problem with host, but the HCI.

Find something in common among the affected VMs

ESXi No Cluster Affected VM

5 HCI cluster 1 1

7 HCI cluster 1 4

8 HCI cluster 1 2

9 HCI cluster 1 2

10 HCI cluster 1 2

11 HCI cluster 1 4

12 HCI cluster 1 2

13 HCI cluster 1 3

14 HCI cluster 1 1

15 HCI cluster 1 2

16 HCI cluster 1 3

17 HCI cluster 1 1

18 HCI cluster 1 2

21 HCI cluster 1 2

22 HCI cluster 1 4

23 HCI cluster 2 1

26 HCI cluster 2 1

27 HCI cluster 1 2

28 HCI cluster 1 4



bution

What did the VMs experience?

It’s not Windows issue, since VMs resumed normally.

It’s not application, since it happened at 42 OS at the same time.

A VM consumes 4 resources:

• CPU

• Disk

• RAM

• Network

We will check for each 4.

42

Something must

have freezed the

VMs.

Temporarily.



bution

Average Disk Latency among Affected VMs.

The 42 users experienced a spike in Disk Latency at the time of incident.

The spike timing and length matched the situation.

We plot the same for CPU, Network and RAM. No correlation at all.



bution

Max latency rose to 543 ms. That’s very high as it’s

sustained for 5 minutes. From here we can tell there is a

storage issue.



bution

Drilling into the 42 VM.

There is noticeable spike across many VM.

The green number shows the last value,

not the peak value.



bution

However, there is no high IOPS coming

from the 42 VM. This is expected at

7:42 am on weekday.

So it is something from outside the VM.

It indicates Storage was not processed.



bution

Conclusion: VMs were hit by disk latency

• At the time of incident (7:42 am), there was a distinct and unhealthy spike

• Both Before and After are very good. So it’s a short lived issue.

• It’s also a one time event. It did not occur before and after within 48 hours period. The spike is distinct.

• There was no spike in IOPS. Latency happens when Demand isn’t met by Supply. So Supply has dropped.



bution

Alert so IT knows right away.



bution

49

Root Cause Analysis

What happened Lesson Learned Alert Setup

• View Client initiated

disconnect

• View Server not

responding

• Windows IO freeze

• HCI Storage can’t sync

• Network buffer overflow

• VDI monitoring has to

include storage and

network

• Not all alerts are in the

GUI. Buried in log

• Network buffer overflow

• HCI unable to sync

• View Disconnect > 10

at the same time.



bution



bution

SG1029263SG1029263



bution

Start CMP Engagement with AssessmentsGet your FREE reports

52

▪ Optimize SDDC and Hybrid Cloud

▪ Time to value in days

vSphere Optimization Assessment (VOA) Hybrid Cloud Assessment (HCA)

▪ Compare private and public cloud costs

▪ Time to value in < 1 hour

vmware.com/assessment/voa vmware.com/hybrid-cloud-assessment.html



bution

https://www.vmware.com/assessment/voa

https://www.vmware.com/hybrid-cloud-assessment.html

Next Steps and Key Resources

Upgrade to vRealize Operations 6.6 Share Your Feedback Get Certified

Upgrade to vROps 6.6

Visit our new upgrade center:

vmware.com/go/vrops/upgradeReceive a $10 amazon gift card

by completing a short survey

about vRealize Operations

surveymonkey.com/r/vrops66

OFFER valid during VMworld

2017Complete your online

certification exam

VMware Digital Badges:

vmware.com/go/vROPS201

7badge

VMware Digital Badge



bution

http://www.vmware.com/go/vROPS2017badge

Overall VDI slowness

• What was reported

– IT Manager received complaint from end users that VDI is slow

– Customer is a telco in Singapore

• Gathering additional info

– No change in infrastructure

– No change in no of users, VDI VM master image

– A new application was rolled out that provide a front-end for 5 existing applications. User is able to work with all 5 apps instead of 1 at a time. Productivity improve.

– Apps was developed on laptop. Never tested in VDI environment.

56



bution

Analysis

• Sum up total Demand from all users

– Take over 1 month, not 1 day.

– Sum the total demand at peak period, then divide by # Users.

57



bution

Random VMs slow


– VMs do not response to ping. There seems to be no network problem.

– No time pattern on the problems. They can occur daily, or less.

– Different VMs are affected. No pattern on the affected VMs

– Problem disappear on its own.

– Customer suspect a rogue VM is causing the problem


– 700 VMs spread across 2 Datacenters in Singapore. Linked by low latency network, 2x 1Gb link.

– EMC (VPLEX, linked by Cisco MDS) supporting vSphere HA (stretched cluster)

– Half of the VMs are on VPLEX-replicated datastores.

– vR Ops and Log Insight were not installed

– Network team notice high spike on the link utilization that match the time of issue

• Since the link is dedicated to VPLEX, then VPLEX replication must have gone up

58



bution

Analysis

• VPLEX is replicating changes

– Write operations, not Read.

– We plotted

• Install vR Ops

– Created a group on all datastores backed by VPLEX

– Create a super metric that sums their Write Throughput.

– Expect to see the pattern match. It did!

• Login to some VMs. Check with Windows Events. It shows Symantec EP Agent signature update.

• Signature update: It appends 4 KB, but it copy 512 MB, append 4 KB, then delete 512 MB.

• Discussed with Security Team. They agree to randomize even more.

• Problem stop happening

59



bution

VMs slow when cluster utilization was 6%



• Analysis

• Guarding for future

60



bution

Tier 3 clusters performing better than Tier 2



• Analysis


61



bution

SQL Server performance



• Analysis


62



bution

64



bution

VMs experience storage latency



• Analysis


65



bution

Which VMs hit high

disk latency and

how often?



bution

Zooming into 1

VM. The pattern

is obvious!



bution

vSAN performing slow



• Analysis


68



bution

SAP slow performance


– SAP team detects slowness. They are seeing high storage latency

– They check with Storage team. Storage team saw high latency & IOPS across many LUNs in many arrays. Since the arrays are dedicated to VMware, they ask Platform team what causing the widespread demand.

– Platform team is caught in-between customer and supplier.


– Customer has 6 datacenters, 22 vCenters and 3000 VM.

– The problem happen during early morning, but complaint only reach VMware team at 12 pm.

• Analysis


69



bution

Random VMs have high disk latency



• Analysis


70

Telco in India



bution

Windows VMs cannot boot


– 3 VMs unable to boot. All are Windows, and all have physical RDM

– Customer is a large Asian property conglomerate based in Singapore

– Customer wanted to know what caused the corruption


– MBR is only 512 bytes at the beginning of a drive. BIOS uses it to locate the start of the OS partition. There is a boot table

• Analysis

– MBR region corrupted. 3 bytes

– Microsoft restores the MBR region

– Not able to figure out why it’s corrupted.


– Microsoft provided a utility to back up MBR region.

71



bution

SQL Server slow but IOPS is low



• Analysis


72



bution

VM CPU counters

• Used vs Usage vs Demand

• Latency

• Ready

– Include co-stop since 6.0?

74



bution

Single VM Troubleshooting

75

The VM

The Infra

Capacity issue:• CPU Demand > 90%• CPU Run Queue > 3 per vCPU• CPU Swap Wait high, CPU IO Wait high• RAM Free < 250 MB• RAM Committed > 70%• Page-In Rate is high• Disk Queue Length > ___• Disk IOPS or Throughput or OIO is high• Low disk space• Network Usage is high

Non-Capacity issue:• Wrong driver (storage driver, network driver) or its settings• Too many snapshots or large snapshots• Tools not running • VM vCPU Usage unbalanced• App configured wrongly, not-indexed• Memory Leak • Network Latency is high or TCP retransmit• VM too big, process ping-pong, high context switch• NUMA effect• Guest OS power setting

Infra unable to cope:• ESXi CPU insufficient: Demand > 90%, VM CPU Co-Stop

>1%, CPU Ready >5%, no of cores to small for VM• ESXi RAM insufficient: VM Balloon active, VM RAM Swap-in

is high, NUMA migration• ESXi Disk IOPS or Throughput is high• ESXi vmkernel queue or latency is high• Datastore latency is high• ESXi vmnic usage is high

Other issue:• VM was vMotion• ESXi vmnic dropped packets or generate errors• ESXi wrong configuration: power management, multi-

pathing, driver version, queue depth setting• Hardware fault: disk soft error, bad sector, RAM error,



bution

Download - Troubleshooting Mastery with vRealize VMworld 2017 · Iwan ‘e1’ Rahabok MGT2508BU MGT2508BU Troubleshooting Mastery with vRealize VMworld 2017 Content: Not for publication or

Top Related