got loss? get zovn! - sigcomm · got loss? get zovn! daniel crisan, robert birke, gilles cressier,...

24
Research – Zurich Research Laboratory Got Loss? Get zOVN! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ACM SIGCOMM 2013, 12-16 August, Hong Kong, China

Upload: buitruc

Post on 10-May-2019

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Got Loss? Get zOVN! - SIGCOMM · Got Loss? Get zOVN! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ... skb . Research – Zurich Research Laboratory

Research – Zurich Research Laboratory

Got Loss? Get zOVN!

Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat

ACM SIGCOMM 2013, 12-16 August, Hong Kong, China

Page 2: Got Loss? Get zOVN! - SIGCOMM · Got Loss? Get zOVN! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ... skb . Research – Zurich Research Laboratory

Research – Zurich Research Laboratory

Application Performance in Virtualized Datacenter Networks

2

Global Internet long-and-fat links

End-users accessing datacenter services

Physical Datacenter Network short-and-fat links

Router Router

Switch Switch

Switch Switch

Virtual Switch

NIC

VM 1

vNIC

VM K1

vNIC

Virtualized Server 1

Virtual Switch

NIC

VM 1

vNIC

VM KN

vNIC

Virtualized Server N

Virtual Switch

NIC

VM 1

vNIC

VM K2

vNIC

Virtualized Server 2

Virtual Switch

NIC

VM 1

vNIC

VM K3

vNIC

Virtualized Server 3 …

Page 3: Got Loss? Get zOVN! - SIGCOMM · Got Loss? Get zOVN! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ... skb . Research – Zurich Research Laboratory

Research – Zurich Research Laboratory

Physical Network: Lossless Links • IBM builds flow-controlled links since the 80’s

• High Performance Computing community - large

scale lossless distributed systems • Flow control improves performance • HPC and Datacenter communities disconnected

• Why do we disregard the Ethernet flow-control?

• PAUSE widely available, largely ignored

• Converged Enhanced Ethernet – applies HPC and Storage lessons • Priority Flow Control (standardized 2011) • Constantly improved for 1T

3

Page 4: Got Loss? Get zOVN! - SIGCOMM · Got Loss? Get zOVN! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ... skb . Research – Zurich Research Laboratory

Research – Zurich Research Laboratory 4

Physical Networks

Virtual Networks

Packet forwarding

Deterministic bandwidth and delay

Link level flow control

Bandwidth allocation

Latency µs ms

Virtual Networks in embryonic stage

Virtual Networks are Different

Page 5: Got Loss? Get zOVN! - SIGCOMM · Got Loss? Get zOVN! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ... skb . Research – Zurich Research Laboratory

Research – Zurich Research Laboratory

Contributions • Loss identification and characterization in virtual

networks

• Dirty-slate approach for latency sensitive applications • Exploit a L2 technique to the benefit of TCP and application

• Introduce zero-loss Overlay Virtual Network

• Flow-controlled virtual switch

• Evaluation with Partition/Aggregate

• Prototype implementation • Cross-layer simulation

Flow control improves application performance

5

Page 6: Got Loss? Get zOVN! - SIGCOMM · Got Loss? Get zOVN! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ... skb . Research – Zurich Research Laboratory

Research – Zurich Research Laboratory

Outline

• Introduction

• Losses in Virtual Networks

• zOVN Architecture

• Evaluation

• Conclusions

6

Page 7: Got Loss? Get zOVN! - SIGCOMM · Got Loss? Get zOVN! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ... skb . Research – Zurich Research Laboratory

Research – Zurich Research Laboratory

Losses in Virtual Networks

• Packets traverse a series of queues • Producer/Consumer problem on each queue

Not implemented correctly on each queue

7

Physical Machine vSwitch VM 1

Source vNIC Tx

VM 2

Source vNIC Tx

VM 3

Sink vNIC Rx

Port A Tx

Port B Tx

Port C Rx

Page 8: Got Loss? Get zOVN! - SIGCOMM · Got Loss? Get zOVN! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ... skb . Research – Zurich Research Laboratory

Research – Zurich Research Laboratory

Losses in Virtual Networks (2)

• Numbers: measurement points • Inject UDP packets at (1) • Count how many still arrive at (6)

• Loss locations

• vSwitch – between (3) and (4) • Receive stack – between (5) and (6)

8

Physical Machine vSwitch VM 1

Source vNIC Tx

VM 2

Source vNIC Tx

VM 3

Sink vNIC Rx

Port A Tx

Port B Tx

Port C Rx

1

1 2

2

3

3

4 5 6

Page 9: Got Loss? Get zOVN! - SIGCOMM · Got Loss? Get zOVN! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ... skb . Research – Zurich Research Laboratory

Research – Zurich Research Laboratory

Losses in Virtual Networks (3)

9

Configuration Hypervisor vNIC vSwitch C1 Qemu/KVM Virtio Linux Bridge C2 Qemu/KVM Virtio Open vSwitch C3 Qemu/KVM Virtio VALE C4 H2 N2 S4 C5 H2 E1000 S4 C6 Qemu/KVM E1000 Linux Bridge C7 Qemu/KVM E1000 Open vSwitch

0

50

100

150

200

C1 C2 C3 C4 C5 C6 C7

Inje

cte

d t

raff

ic [

MB

ps]

Stack Loss

vSwitch Loss

Received

Page 10: Got Loss? Get zOVN! - SIGCOMM · Got Loss? Get zOVN! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ... skb . Research – Zurich Research Laboratory

Research – Zurich Research Laboratory

Outline

• Introduction

• Losses in Virtual Networks

• zOVN Architecture

• Evaluation

• Conclusions

10

Page 11: Got Loss? Get zOVN! - SIGCOMM · Got Loss? Get zOVN! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ... skb . Research – Zurich Research Laboratory

Research – Zurich Research Laboratory

NIC

Hypervisor

zOVN bridge

VM

TX Path

11

vSwitch

Application

Port B Tx

Port A Rx

Guest kernel

vNIC Tx

socket Tx write

return value Qdisc

NIC Tx

Physical link

send frame

receive PAUSE

overlay encapsulation

wake-up

receive

return value

start/stop queue

start_xmit

enqueue

free skb

Page 12: Got Loss? Get zOVN! - SIGCOMM · Got Loss? Get zOVN! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ... skb . Research – Zurich Research Laboratory

Research – Zurich Research Laboratory

NIC

Hypervisor

zOVN bridge

VM

RX Path: Fix Stack Loss

12

vSwitch

Application Port B Rx

Port A Tx

Guest kernel

vNIC Rx socket Rx read

return value

NIC Rx

Physical link

receive frame

send PAUSE

overlay decapsulation

wake-up

send

return value

pause/resume queue

netif_receive skb

NET RX

Softirq

setsockopt Select lossy or

lossless.

Page 13: Got Loss? Get zOVN! - SIGCOMM · Got Loss? Get zOVN! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ... skb . Research – Zurich Research Laboratory

Research – Zurich Research Laboratory

Lossless Virtual Switch

13

vSwitch

Port 1 Tx

Port 2 Tx

Port N Tx

Port 1 Rx

Port 2 Rx

Port N Rx

Senders: • Produce packets • Start forwarder • Sleep

Receivers: • Consume

packets • Start forwarder • Sleep

Forwarder: • Move packets

from Tx to Rx • Pause Tx ports if

Rx port full • Wake-up Tx ports

when something is consumed

Page 14: Got Loss? Get zOVN! - SIGCOMM · Got Loss? Get zOVN! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ... skb . Research – Zurich Research Laboratory

Research – Zurich Research Laboratory

Fully Lossless Path

• Fixed vSwitch – between (3) and (4) Receive stack – between (5) and (6)

14

Physical Machine vSwitch VM 1

Source vNIC Tx

VM 2

Source vNIC Tx

VM 3

Sink vNIC Rx

Port A Tx

Port B Tx

Port C Rx

1

1 2

2

3

3

4 5 6

Page 15: Got Loss? Get zOVN! - SIGCOMM · Got Loss? Get zOVN! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ... skb . Research – Zurich Research Laboratory

Research – Zurich Research Laboratory

Outline

• Introduction

• Losses in Virtual Networks

• zOVN Architecture

• Evaluation

• Conclusions

15

Page 16: Got Loss? Get zOVN! - SIGCOMM · Got Loss? Get zOVN! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ... skb . Research – Zurich Research Laboratory

Research – Zurich Research Laboratory

Partition/Aggregate Workload

• Problem: TCP incast

• During Aggregate, buffers might overflow. • For short flows: TCP ineffective, ACK clock stalled. • Must rely on timeouts.

• Partition and Aggregate – datacenter internal

• Open to optimizations

16

Master

Worker Worker Worker Worker

1

4

2 2 2 2

3 3 3 3

Page 17: Got Loss? Get zOVN! - SIGCOMM · Got Loss? Get zOVN! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ... skb . Research – Zurich Research Laboratory

Research – Zurich Research Laboratory

Testbed Setup

17

Control network

HP 1810-8G 1G Switch

VM 1 VM 16

IBM x3550 M4 Server

1G

VM 1 VM 16

IBM x3550 M4 Server

1G

VM 1 VM 16

IBM x3550 M4 Server

1G

VM 1 VM 16

IBM x3550 M4 Server

1G

Data network

IBM G8264 10G Switch

vSwitch vSwitch vSwitch vSwitch

10G 10G 10G 10G

• 4x Rack Servers • 16 physical cores + HyperThreading • Intel 10G adapters (ixgbe drivers)

• 16 VMs / server

• 8 VMs for PA traffic* • 8 VMs produce background flow

* as in “DCTCP: Efficient

Packet Transport for the Commoditized Data Center” SIGCOMM 2010

Page 18: Got Loss? Get zOVN! - SIGCOMM · Got Loss? Get zOVN! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ... skb . Research – Zurich Research Laboratory

Research – Zurich Research Laboratory

Testbed Results (CUBIC)

18

1

10

100

1000

1 10 100 1000 10000Me

an c

om

ple

tio

n t

ime

[m

s]

Response size [Packets]

LL

LZ

ZL

ZZ

Virtual Network Flow Control

Physical Network Flow Control

No No

No Yes

Yes No

Yes Yes

• Virtual only better than physical only: vSwitch primary

congestion point. Physical switch congestion negligible

• No improvement for short/long flows: Long transfers can remain on lossy priorities

Page 19: Got Loss? Get zOVN! - SIGCOMM · Got Loss? Get zOVN! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ... skb . Research – Zurich Research Laboratory

Research – Zurich Research Laboratory

Simulation Setup

• Larger topology: 256 servers

• 4 VMs / server • 3 VMs produce PA traffic • 1 VM background flows

• Assumption: infinite CPU

19

Page 20: Got Loss? Get zOVN! - SIGCOMM · Got Loss? Get zOVN! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ... skb . Research – Zurich Research Laboratory

Research – Zurich Research Laboratory

Simulation Results (64 packets)

• Confirm findings from prototype experiments

• (LZ) Physical only flow control: shift the drop point into the virtual network

• (ZZ) Both flow controls required for better performance

20

0

5

10

15

20

25

30

35

40

45

NewReno Vegas Cubic

Me

an c

om

ple

tio

n t

ime

[m

s]

LL

LZ

ZL

ZZ

Virtual Network Flow Control

Physical Network Flow Control

No No

No Yes

Yes No

Yes Yes

Page 21: Got Loss? Get zOVN! - SIGCOMM · Got Loss? Get zOVN! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ... skb . Research – Zurich Research Laboratory

Research – Zurich Research Laboratory

Faster CPUs or faster networks? • Loss ratio influenced by CPU/network speed ratio

21

TX

• Slow CPU coupled with a fast network is desirable

• e.g. Xeon + 1G network drops more than Core2 + 1G network

RX

• Fast CPU coupled with a slow network is desirable

• e.g. Xeon + 10G network drops more than Xeon + 1G network

• Conflicting requirements: cannot solve problem by changing hardware

The only solution: add flow control !

Page 22: Got Loss? Get zOVN! - SIGCOMM · Got Loss? Get zOVN! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ... skb . Research – Zurich Research Laboratory

Research – Zurich Research Laboratory

Conclusions

• Loss identification and characterization in OVN

• First flow-controlled vSwitch for future Overlay Virtual Networks • Dirty-slate approach for latency sensitive applications • Un-tuned TCP • Commodity 1-10G Ethernet fabric • Result replication trivial • Orthogonal to other proposals

• Lossless links: Order of magnitude completion time

reduction in Partition/Aggregate

22

Page 23: Got Loss? Get zOVN! - SIGCOMM · Got Loss? Get zOVN! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ... skb . Research – Zurich Research Laboratory

Research – Zurich Research Laboratory

Backup

23

Page 24: Got Loss? Get zOVN! - SIGCOMM · Got Loss? Get zOVN! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ... skb . Research – Zurich Research Laboratory

Research – Zurich Research Laboratory 24

Encapsulation in Overlay Virtual Networks

Workflow

1. Source VM sends packet to its attached vSwitch. 2. vSwitch queries the Controller to find the address of the

destination. 3. Controller answers. The information is cached by the switch. 4. Packet sent over physical network encapsulated with new headers. 5. Packet decapsulated at destination virtual Switch.

Payload TCP| IP|Eth Encap|UDP|IP|Eth

Physical

Network

VM VM VM

vSwitch

Cache

(1) (3) (2)

VM VM VM

vSwitch

Cache

(5)

(4)

Destination Server Source Server Fabric Controller