virtual network diagnosis as a servicewisdom.cs.wisc.edu/workshops/spring-14/talks/wu.pdfvirtual...

Post on 10-Jun-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Virtual Network Diagnosis as a

Service

Wenfei Wu (UW-Madison)

Guohui Wang (Facebook)

Aditya Akella (UW-Madison)

Anees Shaikh (Google)

Virtual Networks in the Cloud

Application Traffic

GW GW Subnet 1

VM

VM

VS

MB

R R

Subnet 2

VM

VS

VM

• Data center infrastructure • Virtual networks • Virtual-to-Physical Mapping • Network services • Sharing, Isolation

Tenant 2 VM

VM

VS

VM

Virtual Network Problems

• Multiple layers may have various problems oConnectivity/Performance issues in applications

• Isolation and abstraction prevent tenants from diagnosing their virtual networks

Application Traffic

VM MB VM SW SW

Existing Solutions

• Infrastructure-layer tools may expose the physical network to tenants • sFlow, NetFlow

• OFRewind, NDB, etc.

• Tools in VMs are difficult to deploy in some virtual components (e.g., middleboxes) • Tcpdump, Xtrace, SNAP

VND Proposal

• The cloud provider should offer a virtual network diagnostic service (VND) to the tenants

1. Diagnostic request

2. Trace collection

3. Trace parse

4. Data query

VND Challenges

• Preserve isolation and abstractions

• Low overhead

• Scalability

Contents

• Motivation

• VND Design & Implementation

• Evaluation

• Conclusions

VND Architecture

Cloud Controller

Collection Policy

ts id … ts id …

Raw Data

Trace Collector

Raw Data

Trace Collector

Flow Trace

Collect Config Collect Config

& Topology

Policy Manager

Tenant

Query Execution

Analysis Manager

Table Server Table Server

Control Server

Query Executor

Trace Table

Query Executor

Trace Table

Trace Parser Trace Parser

Parse Config

Parse Config

Data Collection (1)

• The tenant submits a Data Collection Configuration

Load Balancer

Front End

Server 1

Server 2

Server 3

Cap: Input srcIP=10.0.0.6 dstIP=10.0.0.8 tcp dstPort=80 dualDirection

Cap: output …

Virtual Appliance Link : l1 Capture Point node1 Flow Pattern field = value, ... Capture Point node2 ... Virtual Appliance Node: n1 Capture Point input, [output] ...

Data Collection (2)

• Policy Manager generates a Data Collection Policy

srcIP=10.0.0.6, …, etc.

Collector (c1)

tr_id1

Trace ID tr_id1 Physical Cap Point vs1 Collector c1 at h1 Pattern field = value, ... Path vs1, ..., h1, c1

Hypervisor

vSwitch (vs1)

Load Balancer

NIC

Data Parse

• The tenant submits a Data Parse Configuration

Table ID tab_id1 Filter exp Fields field_list

exp = exp and exp | exp or exp | not exp | (exp) | prim prim = field in value_set

field_list = field (as name) (, field (as name))*

Trace ID all Filter: ip.proto = tcp or ip.proto = udp Fields: timestamp as ts, ip.src as src_ip, ip.dst as dst_ip, ip.proto as proto, tcp.src as src_port, tcp.dst as dst_port, udp.src as src_port, udp.dst as dst_port

ts src_IP dst_IP proto src_port dst_port

… … … … … …

Data Analysis

ts id …

Trace_ID Vnet_App Cap_Loc Pattern Trace_ID …

Analysis Manager

Collection View Parse View

ts id … ts id …

Trace Table

Query Executor

Trace Table

Query Executor

Trace Table

Query Executor

SQL Interface

Diagnostic Schema

• Trace Tables form a distributed database

Operations Diagnostic Applications Tenants

Data Analysis Examples

• RTT • Correlate data packet and ack packet

• Compute the average time difference

• Throughput, loss, delay, statistics, etc.

ts src_IP dst_IP seq ack length

t1 IP1 IP2 1000 1 1400

t2 IP2 IP1 0 2400 0

t3 IP1 IP2 2400 1 1400

t4 IP2 IP1 0 3800 0

ts src_IP dst_IP seq ack length

Contents

• Motivation

• VND Design & Implementation

• Evaluation

• Conclusions

Trace Collection (1) • Network Overhead

• Each minute, we capture one additional VM pair’s traffic

• The duplicated traffic increases as we capture more traffic

• The original application traffic is not impacted

Server 1

8 VMs

Server 2

8 VMs

Trace Collection (2)

• Memory Overhead • VMs perform memory copy and data transfer

simultaneously

• We measure memory and network throughput

• Each 1Gbps network traffic duplication causes 59 MB/s memory overhead

Data Query Overhead • We use RTT monitoring as an example

o200 Mbps traffic

oCalculate average RTT periodically

• Network overhead is negligible

• Execution time scales linearly with the data size

• VND can process 2-3 Gbps traffic in real time

RTT

Scalability

• Control Server is a simple web server can be scaled up easily

• Data collection distributed locally to avoid being the bottleneck

• Query execution is also distributed • Real-time analysis

• 2-3Gbps (RTT example, MySQL implementation)

• Offline analysis • Higher volume

Conclusions

• The cloud provider should offer a virtual network diagnostic service to the tenants

• We design VND • Architecture, interfaces and operations

• Address several important challenges

• Our experiments and simulation demonstrate the feasibility of VND

END

Q&A

Please contact with WISDOM

http://wisdom.cs.wisc.edu

Functional Validation

• Middlebox Bottleneck Detection

• VND use throughput and RTT time to find abnormity

RE IDS

Functional Validation(2)

• Middlebox scaling

RE IDS

VND Architecture

Cloud Controller

Trace Parser

Trace Table

Table Server Policy Manager

Analysis Manager

Collect Config

Parse Config Collect Config

& Topology

Collection Policy

Flow Trace

Parse Config

Query Execution

Tenant

ts id …

Raw Data

Query Executor

Trace Collector

Trace Parser

Trace Table

Table Server

ts id …

Raw Data

Query Executor

Trace Collector

Control Server

Optimizations

• Local Table Server placement • Place the collector locally with the capture points

• Avoid the trace collection traffic traversing the network

• Move data only when queries need it

• Avoid interference with existing rules using the multi-table feature on OVS

Scalability (2)

• Data Query simulation • A data center with 10,000 servers, each has a 10Gbps

NIC

• Virtual network size [2, 20]

• Query executors can process 3 Gbps traffic in real time

• Total link utilization [0.1, 0.9]

• Results • 30% of total link capacity can be queried in real time

top related