singapore, june 2013

27
1 Singapore, June 2013 Network Troubleshooting in the virtual world

Upload: robbin

Post on 08-Feb-2016

45 views

Category:

Documents


0 download

DESCRIPTION

Network Troubleshooting in the virtual world. Singapore, June 2013. Co-authors & Reviewers. Reviewers Lim Wei Chiang Huang Ya Jian, SE Manager, Arista Networks. VCAP-DCD, TOGAF Certified, vExpert 2013. Iwan ‘e1’ Rahabok Staff SE, Strategic Accounts, VMware - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Singapore,  June  2013

1

Singapore, June 2013

Network Troubleshooting in the virtual world

Page 2: Singapore,  June  2013

2

Co-authors & Reviewers

Reviewers• Lim Wei Chiang

• Huang Ya Jian, SE Manager, Arista Networks

Iwan ‘e1’ Rahabok Staff SE, Strategic Accounts, VMware

[email protected] | Linkedin.com/in/e1ang

VCAP-DCD, TOGAF Certified, vExpert 2013

VCP, CCDP, CCNP

Page 3: Singapore,  June  2013

3

Network Troubleshooting

Our example scenario:• You are responsible for the following environment:

• 1500 VMs on 100 ESXi 5.1. All VMs are server VM, not desktop VM. Mixed of Windows and Linux.

• Majority of VM use VADP based backup. A few have LAN-based.

• All the above reside on 1 physical datacenter.

• The physical networks are 10 GE on Arista switches.

• Each ESXi is 1U rackmount, has 2x 10 GE ports, and the following network• vmkernel network: IP Storage, vMotion, Management

• VM network: DMZ, Zone 1, Zone 2, Veritas Heartbeat

• You use dvSwitch 5.1 with Network QoS enabled.

Page 4: Singapore,  June  2013

4

Network Troubleshooting: Approach

Possible reasons for demand spike:

• Storage vMotion, vMotion, broadcast storm

Possible reasons for supply drop:

• Misconfiguration

• Hardware fault

Page 5: Singapore,  June  2013

5

Available counters

ESXi CountersVM Counters Cluster Counters

Datacenter Counters

Page 6: Singapore,  June  2013

6

Network Troubleshooting

What type of info do you need?• Drop packets and errors

• Throughputs (bandwidth)

• Latency

• Special packets: broadcast, error, multicast

How do you need to show the info?• A line chart is useful in showing a few object across time.

• Great at showing the time or period. Not scalable in terms of #objects

• A heat map is useful in showing many objects, but at a given point in time• Normally just the current.

• It can also present 2-dimensional information, making it useful for comparison.

• It gives good relative information, comparing many objects againts one another

• A weather map adds a dimension allows you to go back in time• Not as good as line chart

• A top-N chart shows the top N objects (e.g. Top 25 VM in terms of network utilisation)

• A data-distribution charts shows how the data is distributed during a period of time• See next screen for example

Page 7: Singapore,  June  2013

7

Charts example

Page 8: Singapore,  June  2013

8

To prove that Network is performing well

Errors• Not a single ESXi host is experiencing packet drops in any of its NICs (vmnics)

• If there are, show the ESXi names.

• Not a single VM is experiencing packet drops.

Utilisation• Not a single VM is hitting its limit, be it 1 GE or 10 GE.

• Not a single ESXi vmnic is hitting its limit.

• Total bandwitdh hitting the physical switches is below capacity.

• Top 25 talkers showing utilisation below limit• 4 charts required: VM TX, VM RX, ESXi TX, ESXi RX

Special network • The broadcast network is minimal. For both ESXi and VM.

Page 9: Singapore,  June  2013

9

Approach

Dashboard #1: Do we have any errors in our networks?

• A multi-datacenter view

• A single error in a VM or ESXi will show up in this overall dashboard, as it is taking the Max (all objects).

Dashboard #2: If yes, which VMs and ESXi are affected?

• Listing the top 25 VM and top 25 ESXi

Dashboard #3: Is any VM or ESXi near its peak?

• A peak in any VM or ESXi will show up in this super-metric based line chart.

Dashboard #4: Is our network near its peak?

Dashboard #5: Who are the top consumer for each physical datacenter?

Dashboard #6: How is the workload distributed?

• This uses a heat map to show relative info.

Dashboard #7: What’s the detail for a particular VM?

• When we have identified a specific VM and want to know all the network details.

Page 10: Singapore,  June  2013

10

Physical Datacenter 1 Physical Datacenter 2

Dashboard #1: Do we have any errors in our networks?

Maximum packet drop for all VM in entire DC

Maximum packet drop for all ESXi in entire DC

%

%

Maximum “bad“ packet for all ESXi in entire DC

Gb/s

Same sets of charts with Datacenter 1.We should display all datacenters that have heavy connection with each other.

Explanation on how this dashboard is built will be given later.

Page 11: Singapore,  June  2013

11

Physical Datacenter 1 Physical Datacenter 2

Dashboard #2: If yes, which VMs and ESXi are affected?

Top 25 VM by packet drop Top 25 ESXi by packet drop

Same sets of charts with Datacenter 1.We should display all datacenters that have heavy connection with each other.

The above charts consists of 2 part, the bar chart and the line chart. The line chart is not really visible though, so we will zoom into it later on.An actual dashboard will be shown later.

Page 12: Singapore,  June  2013

12

Physical Datacenter 1 Physical Datacenter 2

Dashboard #3: Is any VM or ESXi near its peak?

Maximum TX for all VM in entire DC

Maximum TX all ESXi in entire DC

%

%

Maximum RX for all VM in entire DC

Maximum RX all ESXi in entire DC

%

%

Same sets of charts with Datacenter 1.We should display all datacenters that have heavy connection with each other.

Page 13: Singapore,  June  2013

13

Physical Datacenter 1 Physical Datacenter 2

Dashboard #4: Is our network near its peak?

Total TX from all VM in entire DC

Total TX from all ESXi in entire DC

Gb/s

Total RX from all VM in entire DC

Total RX from all ESXi in entire DC

%

Gb/s

Gb/s

Gb/s

Same sets of charts with Datacenter 1.We should display all datacenters that have heavy connection with each other.

Page 14: Singapore,  June  2013

14

Physical Datacenter 1

Dashboard #5: Who are the top consumer for each physical datacenter?

Top 25 VM by RX in Mb/sec Top 25 VM by TX in Mb/sec Top 25 ESXi by RX in Mb/sec Top 25 ESXi by TX in Mb/sec

Page 15: Singapore,  June  2013

15

Dashboard #6: How is the workload distributed?

Page 16: Singapore,  June  2013

16

Dashboard #7: What’s the detail for a particular VM?

Page 17: Singapore,  June  2013

17

Dashboard #1

Page 18: Singapore,  June  2013

18

Dashboard #1: Cluster packet drop

Page 19: Singapore,  June  2013

19

Dashboard #2: VM with packet drop

Screenshot showing Top 25 VMs in terms of % packet drops• In this example, it’s clear there is a

problem as the % is high.

• The chart can be complemented with a line chart, showing the details of the selected VM.

• The line chart can be adjusted to display historical data.

Page 20: Singapore,  June  2013

20

Dashboard #3: Peak of any VM or ESXi

We are using Workload (%), a derived metric

Page 21: Singapore,  June  2013

21

Dashboard #4: Total network utilisation

Sample super-metric that provides granularity

Page 22: Singapore,  June  2013

22

Dashboard #5

This example shows the top 25 VM in terms of packet sent

• Data is shown in KBps.

• Utilisation is very low.

The bar chart is complemented with a simple line chart

• It gives historical data.

• Can go back 1 year.

Page 23: Singapore,  June  2013

© 2010 VMware Inc. All rights reserved

Thank you

Page 24: Singapore,  June  2013

24

1

2

Super Metric: main screen

Page 25: Singapore,  June  2013

25

1

2

Super Metric: applying to a type of resource

Page 26: Singapore,  June  2013

26

1

2

3

4

Super Metric package: group of super metrics

Page 27: Singapore,  June  2013

27

Super Metric: naming tips

[Calculation] [Object] [Resource] in a [Container] (units)Sum, Min, Max, etcVM or ESXiCPU or RAM or Disk or NetworkCluster or Datacenter or vCenter% or Mbps or packets, etc