how to get the most out of - geekboy.pro

53
#vmworld HBI2333BU How to Get the Most Out of vSphere vMotion Niels Hagoort, VMware, Inc. Arunachalam Ramanathan, VMware, Inc. #HBI2333BU VMworld 2019 Content: Not for publication or distribution

Upload: others

Post on 10-Jan-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How to Get the Most Out of - GEEKBOY.PRO

#vmworld

HBI2333BU

How to Get the Most Out of vSphere vMotion

Niels Hagoort, VMware, Inc.Arunachalam Ramanathan, VMware, Inc.

#HBI2333BU

VMworld 2019 Content: Not for publication or distribution

Page 2: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc.

Disclaimer

This presentation may contain product features or functionality that are currently under development.

This overview of new technology represents no commitment from VMware to deliver these features in any generally available product.

Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.

Technical feasibility and market demand will affect final delivery.

Pricing and packaging for any new features/functionality/technology discussed or presented, have not been determined.

2

The information in this presentation is for informational purposes only and may not be incorporated into any contract. There is no commitment or obligation to deliver any items presented herein. VMworld 2019 Content: Not for publication or distribution

Page 3: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc.

Agenda

3

How does vMotion work?

How to scale vMotion performance?

How to tune vMotion concurrent limits?

Troubleshooting vMotion

Q&A

VMworld 2019 Content: Not for publication or distribution

Page 4: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 4

One of the most momentous game-changers in the IT industry!

History of vMotion

VMworld 2019 Content: Not for publication or distribution

Page 5: How to Get the Most Out of - GEEKBOY.PRO

5©2019 VMware, Inc.

How Does vMotion Work?

VMworld 2019 Content: Not for publication or distribution

Page 6: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 6

Start a live-migration

vMotion Process

VMworld 2019 Content: Not for publication or distribution

Page 7: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 7

Quiesce VM on Source

Copy Memory

vMotion Workflow

Create VM on Destination1

2

3

Transfer Device State

Resume VM on Destination

Power Off VM on Source

4

5

6

vMotion Network

Datastore

SourceESX Host

DestinationESX Host

Execution Switchover

Time of 1 sec

VMworld 2019 Content: Not for publication or distribution

Page 8: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 8

What happens when you initiate a live-migration?

vMotion Process

VMworld 2019 Content: Not for publication or distribution

Page 9: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 9

What happens when you initiate a live-migration?

vMotion Process

Compatibility specification• Versions• Available resources

• For virtual machine• To support vMotion process

VMworld 2019 Content: Not for publication or distribution

Page 10: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 10

What happens when you initiate a live-migration?

vMotion Process

Migration specification• The virtual machine that is being live-migrated• Configuration of that virtual machine (virtual

hardware, VM options, etc.)• Source ESXi host• Destination ESXi host• vMotion network details

Compatibility specification• Versions• Available resources

• For virtual machine• To support vMotion process

VMworld 2019 Content: Not for publication or distribution

Page 11: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 11

What happens when you initiate a live-migration?

vMotion Process

VMworld 2019 Content: Not for publication or distribution

Page 12: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 12

How is memory copied?

vMotion Process

Source VM Memory

Destination VM Memory

• Phase 0: Copy the VM’s 24GB of memory, trace pages. As we send that memory, the VM dirties 8GB

• Phase 1: Retransmit the dirtied 8GB. In the process, the VM dirties another 3GB

• Phase 2: Send the 3GB. While that transfer is happening, the VM dirties 1GB

• Phase 3: Send the remaining 1GBVMworld 2019 Content: Not for publication or distribution

Page 13: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 13

What if Guest writes to memory during live-migration? Page Tracing

vMotion Process

VMworld 2019 Content: Not for publication or distribution

Page 14: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 14

Iterative memory pre-copy

vMotion Process

VMworld 2019 Content: Not for publication or distribution

Page 15: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 15

Switchover phase

vMotion Process

VMworld 2019 Content: Not for publication or distribution

Page 16: How to Get the Most Out of - GEEKBOY.PRO

16©2019 VMware, Inc.

How to Scale vMotion Performance?Saturating the given network

VMworld 2019 Content: Not for publication or distribution

Page 17: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc.

Agenda

17

Saturating link speed

• Multiple vMotion vmkernel NICs

• Scaling single vMotion vmkernel NIC

• Auto scaling vMotion vmkernel NIC

VMworld 2019 Content: Not for publication or distribution

Page 18: How to Get the Most Out of - GEEKBOY.PRO

18©2019 VMware, Inc.

Multiple vMotion VMkernel InterfacesFor high speed links

VMworld 2019 Content: Not for publication or distribution

Page 19: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 19

vMotion VMkernel interface

Standard vMotion Configuration

VMworld 2019 Content: Not for publication or distribution

Page 20: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 20

How many treads / helpers are used by vMotion?

Streams and Threads

VMworld 2019 Content: Not for publication or distribution

Page 21: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 21

Multi-NIC vMotion helped to saturate multiple 1-10GbE NICs

Streams and Threads

VMworld 2019 Content: Not for publication or distribution

Page 22: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 22

The challenge with > 25GbE NICs

Streams and Threads

VMworld 2019 Content: Not for publication or distribution

Page 23: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 23

Scale bandwidth utilization by adding vMotion VMkernel interfaces!

Streams and Threads

VMworld 2019 Content: Not for publication or distribution

Page 24: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 24

Instantiate multiple vMotion streams

Streams and Threads

VMworld 2019 Content: Not for publication or distribution

Page 25: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 25

Using multiple CPU cores = Higher transfer rates

Streams and Threads

VMworld 2019 Content: Not for publication or distribution

Page 26: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 26

A single vMotion stream bandwidth utilization capability of ~15 GbE

25 GbE : 1 stream = ~15 GbE

40 GbE : 2 streams = ~30 GbE

50 GbE : 3 streams = ~45 GbE

100 GbE : 6 streams = ~90 GbE

vMotion stream performance

Streams and Threads

VMworld 2019 Content: Not for publication or distribution

Page 27: How to Get the Most Out of - GEEKBOY.PRO

27©2019 VMware, Inc.

Scaling Single vmknic to Link SpeedTune vMotion streams and vmknic RX queues

VMworld 2019 Content: Not for publication or distribution

Page 28: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 28

Create multiple vmknics on single uplink

• Requires multiple IP addresses

• Each vmknic will have

– One vMotion stream

– One receive queue

• 25 GbE will require 2 vmknics

– 1 vmknic gets 15 Gbps

• Management overhead

– IP per vmknic

Multiple vmknics

Scaling vMotion to Link Speed

Physical NIC

vMotion vmknic

vMotion stream vMotion stream

Hardware RX queue

vmknic RX queues

Crypto

Helper

Stream

Helper

Completion

Helper

Crypto

Helper

Stream

Helper

Completion

Helper

25 GbE NIC

vMotion vmknic

VMworld 2019 Content: Not for publication or distribution

Page 29: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 29

What if a single vmknic could have

• Multiple vMotion streams

• Multiple receive dispatch (RX) queues

Still need the NIC to scale out the flows

• Receive Side Scaling

– Dynamic RSS

– Maps each vMotion stream to a NIC hardware ring

Tune vMotion streams and RX queues

Scaling Single vMotion vmknic

Physical NIC

vMotion vmkernel NIC

vMotion stream vMotion stream

vmknic RX queues

Crypto

Helper

Stream

Helper

Completion

Helper

Crypto

Helper

Stream

Helper

Completion

Helper

25 GbE NIC

Hardware RX queue

VMworld 2019 Content: Not for publication or distribution

Page 30: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 30

1. Scale vmknic RX queue

/net/tcpip/defaultNumRxQueue

vsish -e set /net/tcpip/defaultNumRxQueue 2

2. Scale vMotion streams

/config/Migrate/intOpts/VMotionStreamHelpers

vsish -e set /config/Migrate/intOpts/VMotionStreamHelpers 2

Tune vMotion streams and RX queues

Scaling Single vMotion vmknic

Physical NIC

vMotion vmkernel NIC

vMotion stream vMotion stream

vmknic RX queues

Crypto

Helper

Stream

Helper

Completion

Helper

Crypto

Helper

Stream

Helper

Completion

Helper

25 GbE NIC

Hardware RX queue

VMworld 2019 Content: Not for publication or distribution

Page 31: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 31

Auto Scaling vMotion

1. Determine uplink speed at vMotion start

2. Determine Scale factor

Uplink speed/15 Gbps

3. Scale vMotion

vmknic’s RX queues

4. Start required no. of

vMotion streams

No tuning required

Dynamically scale

• Works out of the box

– Avoids need for manual tuning

• ESX network stack

– Provides ability to dynamically scale vmknic

• vMotion

– Can dynamically start the required no. of streams

VMworld 2019 Content: Not for publication or distribution

Page 32: How to Get the Most Out of - GEEKBOY.PRO

32©2019 VMware, Inc.

Tuning vMotion for Hybrid CloudLong distance vMotion

VMworld 2019 Content: Not for publication or distribution

Page 33: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 33

Scaling Long Distance vMotion

1. Determine Network Latency

2. Determine Network

Bandwidth

3. Compute BDP

4. Set TCP Socket buffer

size

Saturating available network bandwidth

To saturate network

• Size TCP socket buffer

– With bandwidth delay product (BDP)

– BDP = latency x network bandwidth

VMworld 2019 Content: Not for publication or distribution

Page 34: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 34

1. Expected Bandwidth

2. Max Socket Buffer Size

vMotion doesn’t determine bandwidth

• Takes uplink speed as bandwidth

• Doesn’t work for long distances

• So 1 GbE is default bandwidth

Default MAX is 16 MB

• 10 GbE at 150 ms requires ~186 MB

vsish –e set

/config/Migrate/intOpts/NetExpectedLineRateMBps

10000

vsish –e set

/net/tcpip/instances/defaultTcpipStack/sbMax

195035136

(186 MB in bytes)

Manual Scaling for Long Distance vMotion Tuning line rate and socket buffer

Configure Why?How?

ESX Host: vmkernel config options

VMworld 2019 Content: Not for publication or distribution

Page 35: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 35

• At vMotion start compute latency

1. Measure latency

• If latency > 4ms then compute BW

2. Measure bandwidth

• Compute Bandwidth delay product (BDP)

3. BDP = BW x latency

• Adjust socket buffer (SB) Max to BDP

4. If BDP > SB MAX

• Adjust socket buffer size to BDP

5. Start pre-copy

Determining expected bandwidth

Auto Scaling Long Distance vMotion

VMworld 2019 Content: Not for publication or distribution

Page 36: How to Get the Most Out of - GEEKBOY.PRO

36©2019 VMware, Inc.

Tuning Concurrent vMotion LimitsvMotion Per Resource limits

VMworld 2019 Content: Not for publication or distribution

Page 37: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 37

Concurrent vMotion LimitsvMotion cost per resource

vMotion Cost

• Network Cost = 1

• Storage Cost = 1

• Host Cost = 1

vMotion limits

• 4 per 1 GbE NIC

• 8 per 10 GbE NIC

• 8 per ESX Host

• 128 per datastore

MAX Cost: 4

1 GbE

MAX Cost: 8

10/40/100 GbE

Datastore

MAX Cost: 128

ESX Host

MAX Cost: 8

VMworld 2019 Content: Not for publication or distribution

Page 38: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 38

Single vMotion can saturate given link speed

• 1/10/40/100 GbE Nics

Concurrent vMotions => Longer vMotion duration

– Multiple vMotions share vMotion network

– Longer time put host into maintenance mode

Reason for these limits => Historical

• Limits defined when vMotion couldn’t saturate link speed

• Increase network cost of vMotion

vCenter Advanced Config option

config.vpxd.ResourceManager.networkCostPerVmotion

• Default cost is 1

• For 1 vMotion at any given time

– Increase cost to 4 for 1 GbE

– Increase cost to 8 for 10 GbE

Why change default concurrency limits?

Limit Concurrency

Tuning Concurrent vMotionsHow to limit concurrency?

VMworld 2019 Content: Not for publication or distribution

Page 39: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 39

Concurrent Storage vMotion LimitsStorage vMotion cost per resource

Storage vMotion Cost

• Network Cost = 0

• Storage Cost = 16

• Host Cost = 4

Storage vMotion Limits

• 2 per ESX Host

• 8 per datastore

Datastore

MAX Cost: 128

ESX Host

MAX Cost: 8

VMworld 2019 Content: Not for publication or distribution

Page 40: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 40

Copy one disk at a time for a given VM

– For the given source destination datastore pair

VMX Config option

• svmotion.maxSimultaneousDiskCopy

• Eg. For a VM with 3 disks on NVMe datastore

• svmotion.maxSimultaneousDiskCopy = 3

Increase/Decrease storage vMotion cost

• Maximize datastore throughput and minimize latency

– Preserve VM requirements

vCenter Advanced Config option

• config.vpxd.ResourceManager.CostPerEsx6xSVmotion

• Default Cost is 16

• Eg. Reduce cost to 8 to double concurrency

– For high perf NVMe storage

Tuning simultaneous disk copy Tuning Concurrency

Tuning Storage vMotionsConcurrency based on storage array

VMworld 2019 Content: Not for publication or distribution

Page 41: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 41

Concurrent vMotion without Shared Storage LimitsvMotion cost per resource

vMotion Cost

• Network Cost = 1

• Storage Cost = 16

• Host Cost = 4

vMotion Limits

• 4 per 1 GbE NIC

• 8 per 10 GbE NIC

• 2 per ESX Host

• 8 per datastore

MAX Cost: 4

1 GbE

MAX Cost: 8

10/40/100 GbE

Datastore

MAX Cost: 128

ESX Host

MAX Cost: 8

VMworld 2019 Content: Not for publication or distribution

Page 42: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 42

• Follow storage vMotion tuning guidelines

• Tune simultaneous disk copy

• Tune concurrent limits

• Tune network cost

• Check if disk copy is remote

– Src ESX host cannot access destination storage

• Check if disk copy can saturate vMotion network

• If it can limit concurrent vMotion to 1

– By tuning the network cost

Moving storage dominates vMotion time

First tune for Storage then Network

Tuning Concurrent vMotions without Shared StorageConcurrency based storage array performance

90%

10%

vMotion Time

1 2

VMworld 2019 Content: Not for publication or distribution

Page 43: How to Get the Most Out of - GEEKBOY.PRO

43©2019 VMware, Inc.

Troubleshooting vMotionNavigating vSphere logs

VMworld 2019 Content: Not for publication or distribution

Page 44: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 44

vMotion LoggingvSphere logging

vSphere Components 5 Processes 5 Log Files

ESX

vCenter Server

vpxd

vpxa

hostd

VMX

vmkernel

vpxd.log

vpxa.log

Hostd.log

vmware.log

vmkernel.logVMworld 2019 Content: Not for publication or distribution

Page 45: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 45

Processes

• vCenter vpxd

• vCenter agent vpxa

• Host Daemon hostd

Operation ID

• opID=1807e8fa-3b9d-453e-8dd4-79f2d0ac91ca-327

Hostd

• opID to migration ID mapping

Processes

• VMX

• Vmkernel

Migration ID

• 1439244808967130

Management/Control Plane Data Plane

Tracking vMotionManagement and data plane

VMworld 2019 Content: Not for publication or distribution

Page 46: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 46

Associating logs

vMotion Tracing Logs

Log files (vmkernel and VM log files)• grep “VMotion” keyword in vmkernel log files (/var/log/vmkernel*)

# grep VMotion /var/log/vmkernel*

2015-08-10T22:13:40.958Z cpu89:2237036)VMotionUtil: 3995: 1439244808967130 S: Stream connection 1 added.

2015-08-10T22:13:41.164Z cpu89:2237036)VMotion: 6834: 1439244808967130 S: Detected 50Ms round-trip latency to remote host.

2015-08-10T22:13:41.266Z cpu89:2237036)XVMotion: 3284: 1439244808967130 S: Starting XVMotion stream.

2015-08-10T22:16:37.556Z cpu20:2237038)VMotion: 4873: 1439244808967130 S: Estimated network bandwidth 112.070 MB/s during disk copy.

• grep migration id in VM log files (vmware*log files in VM home directory)

# grep 1439244808967130 $(VM_HOME)/vmware*

VPXD Logs (vCenter Server)• Find the “Operation ID” of vMotion

# grep "relocate" /var/log/vmware/vpxd/vpxd-*.log | grep BEGIN

2015-08-10T22:13:28.847Z info vpxd[7FD6AED5A700] [Originator@6876 sub=vpxLro opID=1807e8fa-3b9d-453e-8dd4-79f2d0ac91ca-327-ngc-bf] [VpxLRO] --BEGIN task-55 -- vm-46 -- vim.VirtualMachine.relocate --

Hostd Logs (ESX Server)• Find the Migration ID of vMotion from the Operation ID

# grep 1807e8fa-3b9d-453e-8dd4-79f2d0ac91ca-327 /var/log/hostd.log | grep -i migrate

2015-08-10T22:13:39.568Z info hostd[2FCC2B70] [Originator@6876 sub=Vcsvc.VMotionSrc (1439244808967130) opID=1807e8fa-3b9d-453e-8dd4-79f2d0ac91ca-327-ngc-bf-40-86-7fb8 user=vpxuser:VSPHERE.LOCAL\Administrator] VMotionEntry: migrateType = 1

VMworld 2019 Content: Not for publication or distribution

Page 47: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 47

vMotion Failures

• vMotion network connectivity issues

– ESX hosts cannot ping or timeout of 20 secs

– MTU mismatch with vmknic and network layer i.e switches and routers

• Storage

– Datastore unreachable or APD

– IOs timeout of 20 secs or more

• vMotion successful but guest VM issues

– VM network is not reachable – No L2 level connectivity on Dest ESX host

• Resource overcommit

– Cannot allocate memory for a long time

– Swapping takes a long time leading to vMotion timeout

Common failures and patterns

vMotion Failures

VMworld 2019 Content: Not for publication or distribution

Page 48: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 48

vMotion process overview

Scaling vMotion performance for high speed links

• Multiple vmknics

• Scaling single vmknic

• Auto scaling vMotion to high speed links

Scaling long distance vMotion

• Tuning socket buffers

• Auto scaling vMotion to size socket buffer

Tuning vMotion concurrent limits

• Tuning vMotion network and storage costs

Key takeaways

Summary

VMworld 2019 Content: Not for publication or distribution

Page 49: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 49

[UX70007U] Managing workloads at scale

• Feedback about vMotion UX, bulk migration etc

Design Studio Session

Managing workloads at scale

VMworld 2019 Content: Not for publication or distribution

Page 50: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 50

VMworld 2019 Content: Not for publication or distribution

Page 51: How to Get the Most Out of - GEEKBOY.PRO

©2019 VMware, Inc. 51

ASSESSMENTLOUNGE

Moscone West Level 3 Lobby

Hyper-convergedInfrastructure (HCI)

vSphere OptimizationAssessment (VOA)

Virtual NetworkAssessment (VNA)

HOURS

Sunday, Aug. 25 8:00 AM – 6:00 PM

Monday, Aug. 26 10:30 AM – 6:30 PM *

Tuesday, Aug. 27 10:30 AM – 6:30 PM *

Wednesday, Aug. 29 8:00 AM – 5:00 PM

Thursday, Aug. 30 9:00 AM – 3:00 PM

Closed for keynote 9:00 AM -10:30 AM

vSphere AssessmentTool (vSAT)

VMworld 2019 Content: Not for publication or distribution

Page 52: How to Get the Most Out of - GEEKBOY.PRO

VMworld 2019 Content: Not for publication or distribution

Page 53: How to Get the Most Out of - GEEKBOY.PRO

VMworld 2019 Content: Not for publication or distribution