vsnoop: improving tcp throughput in virtualized environments via acknowledgement offload

vSnoop: Improving TCP Throughput in Virtualized Environments

via Acknowledgement Offload

Ardalan Kangarlou, Sahan Gamage, Ramana Kompella, Dongyan Xu

Department of Computer SciencePurdue University

Cloud Computing and HPC

Background and Motivation Virtualization: A key enabler of cloud

computing Amazon EC2, Eucalyptus

Increasingly adopted in other real systems: High performance computing

NERSC’s Magellan system Grid/cyberinfrastructure computing

In-VIGO, Nimbus, Virtuoso

Multiple VMs hosted by one physical host Multiple VMs sharing the same core

Flexibility, scalability, and economy

VM Consolidation: A Common Practice

Hardware

Virtualization Layer

VM 1 VM 3 VM 4VM 2Key Observation:VM consolidation negatively

impacts network performance!

Sender

Hardware

Virtualization Layer

Investigating the Problem

Server

VM 1 VM 2 VM 3Client

Number of VMs

US East – West

US East – Europe

US West – Australia

RTT increases in proportion to VM scheduling slice

(30ms)

Q1: How does CPU Sharing affect RTT ?

RTT Increase

Q2: What is the Cause of RTT Increase ?

Sender

Hardware

Driver Domain(dom0)

Device Driver

bufbuf

VM scheduling latency dominates

virtualization overhead!

+ dom0 processing x wait time in buffer

Connection to the VM is much slower than dom0!

Q3: What is the Impact on TCP Throughput ?

+ dom0 x VM

Our Solution: vSnoop Alleviates the negative effect of VM

scheduling on TCP throughput Implemented within the driver domain to

accelerate TCP connections

Does not require any modifications to the VM

Does not violate end-to-end TCP semantics Applicable across a wide range of VMMs

Xen, VMware, KVM, etc.

Sender VM1 BufferDriver Domain

SYN,ACK

VM1 buffer

TCP Connection to a VMScheduled VM

SYN,ACKSYN

VM Scheduling Latency

Sender establishes a TCP connection to

Sender VM Shared BufferDriver Domain

SYN,ACK

VM1 buffer

Key Idea: Acknowledgement OffloadScheduled VM

SYN,ACK

w/ vSnoop

Faster progress during TCP slowstart

vSnoop’s Impact on TCP Flows TCP Slow Start

Early acknowledgements help progress connections faster

Most significant benefit for short transfers that are more prevalent in data centers [Kandula IMC’09], [Benson WREN’09]

TCP congestion avoidance and fast retransmit Large flows in the steady state can also benefit from

vSnoop Benefit not as much as for Slow Start

Challenge 1: Out-of-order/special packets (SYN, FIN packets)

Solution: Let the VM handle these packets

Challenge 2: Packet loss after vSnoop Solution: Let vSnoop acknowledge only if room in

buffer

Challenge 3: ACKs generated by the VM Solution: Suppress/rewrite ACKs already generated by

vSnoop

Challenge 4: Throttle Receive window to keep vSnoop online

Solution: Adjusted according to the buffer size

Challenges

State Machine Maintained Per-Flow

Unexpected Sequence

Active(online)

No buffer(offline)

Out-of-order packet

In-order pkt Buffer space

available

Out-of-order packet

In-order pktNo buffer

In-order pkt Buffer space available

No buffer

Packet recvEarly acknowledgements

for in-order packets

Don’t acknowledge

Pass out-of-order pkts to VM

vSnoop Implementation in Xen

Driver Domain (dom0)

Bridge

Netfront

Netback

vSnoop

Netfront

Netback

VM3Netfront

Netback

buf bufbuf

Tuning Netfront

Evaluation Overheads of vSnoop

TCP throughput speedup

Application speedup Multi-tier web service (RUBiS) MPI benchmarks (Intel, High-Performance

Linpack)

Evaluation – Setup VM hosts

3.06GHz Intel Xeon CPUs, 4GB RAM Only one core/CPU enabled Xen 3.3 with Linux 2.6.18 for the driver domain (dom0)

and the guest VMs Client machine

2.4GHz Intel Core 2 Quad CPU, 2GB RAM Linux 2.6.19

Gigabit Ethernet switch

vSnoop Routines

Single Stream Multiple Streams

Cycles CPU % Cycles CPU %

vSnoop_ingress() 509 3.03 516 3.05vSnoop_lookup_hash(

)74 0.44 91 0.51

vSnoop_build_ack() 52 0.32 52 0.32vSnoop_egress() 104 0.61 104 0.61

Per-packet CPU overhead for vSnoop routines in dom0

vSnoop Overhead Profiling per-packet vSnoop overhead using

Xenoprof [Menon VEE’05]

Minimal aggregateCPU overhead

Median

0.192MB/s

0.778MB/s

6.003MB/s

TCP Throughput Improvement 3 VMs consolidated, 1000 transfers of a

100KB file Vanilla Xen, Xen+tuning,

Xen+tuning+vSnoop30x Improvement

+ Vanilla Xen x Xen+tuning * Xen+tuning+vSnoop

TCP Throughput: 1 VM/Core

50KBNo

Transfer Size

Xen+tuning+vSnoopXen+tuningXen

TCP Throughput: 2 VMs/Core

50KBNo

Transfer Size

vSnoop’s benefit rises with higher VM consolidation

TCP Throughput: Other Setup Parameters CPU load for VMs Number of TCP connections to VM Driver domain on separate core Sender being a VM

vSnoop consistently achieves significant TCP

throughput improvement

vSnoopdom0

dom1 dom2

Server1

vSnoopdom0

dom1 dom2

Server2ClientClient Threads

Application-Level Performance: RUBiS

RUBiS Clients Apache MySQL

RUBiS Operation Countw/o vSnoop

Countw/ vSnoop

Browse 421 505 19.9%BrowseCategories 288 357 23.9%

SearchItemsInCategory 3498 4747 35.7%BrowseRegions 128 141 10.1%

ViewItem 2892 3776 30.5%ViewUserInfo 732 846 15.6%

ViewBidHistory 339 398 17.4%Others 3939 4815 22.2%Total 12237 15585 27.4%

Average Throughput 29 req/s 37 req/s 27.5%

RUBiS Results

Intel MPI Benchmark: Network intensive High-performance Linpack: CPU intensive

vSnoopdom0

dom1 dom2

Server1dom0

dom1 dom2

Server2dom0

dom1 dom2

Server3dom0

Server4

MPI nodes

Application-level Performance – MPI Benchmarks

vSnoop vSnoop vSnoop

Intel MPI Benchmark Results: Broadcast

Message Size

40% Improvement

Intel MPI Benchmark Results: All-to-All

Message Size

HPL Benchmark Results

0.000 0.200 0.400 0.600 0.800 1.000 1.200 1.400 1.600 1.800

Problem Size and Block Size (N,NB)

Xen+tuning+vSnoopXen

Related Work Optimizing virtualized I/O path

Menon et al. [USENIX ATC’06,’08; ASPLOS’09]

Improving intra-host VM communications XenSocket [Middleware’07], XenLoop

[HPDC’08], Fido [USENIX ATC’09], XWAY [VEE’08], IVC [SC’07]

I/O-aware VM scheduling Govindan et al. [VEE’07], DVT [SoCC’10]

Conclusions Problem: VM consolidation degrades TCP

throughput Solution: vSnoop

Leverages acknowledgment offloading Does not violate end-to-end TCP semantics Is transparent to applications and OS in VMs Is generically applicable to many VMMs

Results: 30x improvement in median TCP throughput About 30% improvement in RUBiS benchmark 40-50% reduction in execution time for Intel

MPI benchmark

Thank you.

For more information: http://

friends.cs.purdue.edu/dokuwiki/doku.php?id=vsnoopOr Google “vSnoop Purdue”

TCP Benchmarks cont. Testing different scenarios:

a) 10 concurrent connections b) Sender also subject to VM

scheduling c) Driver domain on a separate core

TCP Benchmarks cont. Varying CPU load for 3 consolidated VMs:

40% CPU load:

80% CPU load:

60% CPU load:

vsnoop: improving tcp throughput in virtualized environments via acknowledgement offload

vm migration

economy vm consolidation

result of vm consolidation

hpc applications

server consolidation

vm scheduling slice

multiples vms

physical hostmultiple

Documents

siemens 33kv offload isolator

wipro - mobile internet offload gateway

data offload approaches for mobile · pdf filedata offload...

autonomous nic offload

wlan traffic offload in lte white paper - rohde &...

beta systems operlog manager user's guide · 2016. 10....

context-specific managed offload for mobile data...

intelligent offload - cisco

vsnoop: improving tcp throughput in virtualized environments...

int 1010 tcp offload

wlan traffic offload in lte

wi fi offload paper

mobile data offload- wi-fi offload - tec data...

understanding wi-fi offload

wifi offload cisco

why wifi offload?

data warehouse offload

business of offload v01

cellular offload[1]

wifi offload-through-eap-authentication