singapore, june 2013

1

Singapore, June 2013

Network Troubleshooting in the virtual world

2

Co-authors & Reviewers

Reviewers• Lim Wei Chiang

• Huang Ya Jian, SE Manager, Arista Networks

Iwan ‘e1’ Rahabok Staff SE, Strategic Accounts, VMware

[email protected] | Linkedin.com/in/e1ang

VCAP-DCD, TOGAF Certified, vExpert 2013

VCP, CCDP, CCNP

mailto:[email protected]

3

Network Troubleshooting

Our example scenario:• You are responsible for the following environment:

• 1500 VMs on 100 ESXi 5.1. All VMs are server VM, not desktop VM. Mixed of Windows and Linux.

• Majority of VM use VADP based backup. A few have LAN-based.

• All the above reside on 1 physical datacenter.

• The physical networks are 10 GE on Arista switches.

• Each ESXi is 1U rackmount, has 2x 10 GE ports, and the following network• vmkernel network: IP Storage, vMotion, Management

• VM network: DMZ, Zone 1, Zone 2, Veritas Heartbeat

• You use dvSwitch 5.1 with Network QoS enabled.

4

Network Troubleshooting: Approach

Possible reasons for demand spike:

• Storage vMotion, vMotion, broadcast storm

Possible reasons for supply drop:

• Misconfiguration

• Hardware fault

5

Available counters

ESXi CountersVM Counters Cluster Counters

Datacenter Counters

6

Network Troubleshooting

What type of info do you need?• Drop packets and errors

• Throughputs (bandwidth)

• Latency

• Special packets: broadcast, error, multicast

How do you need to show the info?• A line chart is useful in showing a few object across time.

• Great at showing the time or period. Not scalable in terms of #objects

• A heat map is useful in showing many objects, but at a given point in time• Normally just the current.

• It can also present 2-dimensional information, making it useful for comparison.

• It gives good relative information, comparing many objects againts one another

• A weather map adds a dimension allows you to go back in time• Not as good as line chart

• A top-N chart shows the top N objects (e.g. Top 25 VM in terms of network utilisation)

• A data-distribution charts shows how the data is distributed during a period of time• See next screen for example

7

Charts example

8

To prove that Network is performing well

Errors• Not a single ESXi host is experiencing packet drops in any of its NICs (vmnics)

• If there are, show the ESXi names.

• Not a single VM is experiencing packet drops.

Utilisation• Not a single VM is hitting its limit, be it 1 GE or 10 GE.

• Not a single ESXi vmnic is hitting its limit.

• Total bandwitdh hitting the physical switches is below capacity.

• Top 25 talkers showing utilisation below limit• 4 charts required: VM TX, VM RX, ESXi TX, ESXi RX

Special network • The broadcast network is minimal. For both ESXi and VM.

9

Approach

Dashboard #1: Do we have any errors in our networks?

• A multi-datacenter view

• A single error in a VM or ESXi will show up in this overall dashboard, as it is taking the Max (all objects).

Dashboard #2: If yes, which VMs and ESXi are affected?

• Listing the top 25 VM and top 25 ESXi

Dashboard #3: Is any VM or ESXi near its peak?

• A peak in any VM or ESXi will show up in this super-metric based line chart.

Dashboard #4: Is our network near its peak?

Dashboard #5: Who are the top consumer for each physical datacenter?

Dashboard #6: How is the workload distributed?

• This uses a heat map to show relative info.

Dashboard #7: What’s the detail for a particular VM?

• When we have identified a specific VM and want to know all the network details.

10

Physical Datacenter 1 Physical Datacenter 2

Dashboard #1: Do we have any errors in our networks?

Maximum packet drop for all VM in entire DC

Maximum packet drop for all ESXi in entire DC

%

%

Maximum “bad“ packet for all ESXi in entire DC

Gb/s

Same sets of charts with Datacenter 1.We should display all datacenters that have heavy connection with each other.

Explanation on how this dashboard is built will be given later.

11


Dashboard #2: If yes, which VMs and ESXi are affected?

Top 25 VM by packet drop Top 25 ESXi by packet drop


The above charts consists of 2 part, the bar chart and the line chart. The line chart is not really visible though, so we will zoom into it later on.An actual dashboard will be shown later.

12


Dashboard #3: Is any VM or ESXi near its peak?

Maximum TX for all VM in entire DC

Maximum TX all ESXi in entire DC

%

%

Maximum RX for all VM in entire DC

Maximum RX all ESXi in entire DC

%

%


13


Dashboard #4: Is our network near its peak?

Total TX from all VM in entire DC

Total TX from all ESXi in entire DC

Gb/s

Total RX from all VM in entire DC

Total RX from all ESXi in entire DC

%

Gb/s

Gb/s

Gb/s


14

Physical Datacenter 1

Dashboard #5: Who are the top consumer for each physical datacenter?

Top 25 VM by RX in Mb/sec Top 25 VM by TX in Mb/sec Top 25 ESXi by RX in Mb/sec Top 25 ESXi by TX in Mb/sec

15

Dashboard #6: How is the workload distributed?

16

Dashboard #7: What’s the detail for a particular VM?

17

Dashboard #1

18

Dashboard #1: Cluster packet drop

19

Dashboard #2: VM with packet drop

Screenshot showing Top 25 VMs in terms of % packet drops• In this example, it’s clear there is a

problem as the % is high.

• The chart can be complemented with a line chart, showing the details of the selected VM.

• The line chart can be adjusted to display historical data.

20

Dashboard #3: Peak of any VM or ESXi

We are using Workload (%), a derived metric

21

Dashboard #4: Total network utilisation

Sample super-metric that provides granularity

22

Dashboard #5

This example shows the top 25 VM in terms of packet sent

• Data is shown in KBps.

• Utilisation is very low.

The bar chart is complemented with a simple line chart

• It gives historical data.

• Can go back 1 year.

24

1

2

Super Metric: main screen

25

1

2

Super Metric: applying to a type of resource

26

1

2

3

4

Super Metric package: group of super metrics

27

Super Metric: naming tips

[Calculation] [Object] [Resource] in a [Container] (units)Sum, Min, Max, etcVM or ESXiCPU or RAM or Disk or NetworkCluster or Datacenter or vCenter% or Mbps or packets, etc

singapore, june 2013

Documents