virtual network diagnosis as a servicewisdom.cs.wisc.edu/workshops/spring-14/talks/wu.pdfvirtual...
TRANSCRIPT
Virtual Network Diagnosis as a
Service
Wenfei Wu (UW-Madison)
Guohui Wang (Facebook)
Aditya Akella (UW-Madison)
Anees Shaikh (Google)
Virtual Networks in the Cloud
Application Traffic
GW GW Subnet 1
VM
VM
VS
MB
R R
Subnet 2
VM
VS
VM
• Data center infrastructure • Virtual networks • Virtual-to-Physical Mapping • Network services • Sharing, Isolation
Tenant 2 VM
VM
VS
VM
Virtual Network Problems
• Multiple layers may have various problems oConnectivity/Performance issues in applications
• Isolation and abstraction prevent tenants from diagnosing their virtual networks
Application Traffic
VM MB VM SW SW
Existing Solutions
• Infrastructure-layer tools may expose the physical network to tenants • sFlow, NetFlow
• OFRewind, NDB, etc.
• Tools in VMs are difficult to deploy in some virtual components (e.g., middleboxes) • Tcpdump, Xtrace, SNAP
VND Proposal
• The cloud provider should offer a virtual network diagnostic service (VND) to the tenants
1. Diagnostic request
2. Trace collection
3. Trace parse
4. Data query
VND Challenges
• Preserve isolation and abstractions
• Low overhead
• Scalability
Contents
• Motivation
• VND Design & Implementation
• Evaluation
• Conclusions
VND Architecture
Cloud Controller
Collection Policy
ts id … ts id …
Raw Data
Trace Collector
Raw Data
Trace Collector
Flow Trace
Collect Config Collect Config
& Topology
Policy Manager
Tenant
Query Execution
Analysis Manager
Table Server Table Server
Control Server
Query Executor
Trace Table
Query Executor
Trace Table
Trace Parser Trace Parser
Parse Config
Parse Config
Data Collection (1)
• The tenant submits a Data Collection Configuration
Load Balancer
Front End
Server 1
Server 2
Server 3
Cap: Input srcIP=10.0.0.6 dstIP=10.0.0.8 tcp dstPort=80 dualDirection
Cap: output …
Virtual Appliance Link : l1 Capture Point node1 Flow Pattern field = value, ... Capture Point node2 ... Virtual Appliance Node: n1 Capture Point input, [output] ...
Data Collection (2)
• Policy Manager generates a Data Collection Policy
srcIP=10.0.0.6, …, etc.
Collector (c1)
tr_id1
Trace ID tr_id1 Physical Cap Point vs1 Collector c1 at h1 Pattern field = value, ... Path vs1, ..., h1, c1
Hypervisor
vSwitch (vs1)
Load Balancer
NIC
Data Parse
• The tenant submits a Data Parse Configuration
Table ID tab_id1 Filter exp Fields field_list
exp = exp and exp | exp or exp | not exp | (exp) | prim prim = field in value_set
field_list = field (as name) (, field (as name))*
Trace ID all Filter: ip.proto = tcp or ip.proto = udp Fields: timestamp as ts, ip.src as src_ip, ip.dst as dst_ip, ip.proto as proto, tcp.src as src_port, tcp.dst as dst_port, udp.src as src_port, udp.dst as dst_port
ts src_IP dst_IP proto src_port dst_port
… … … … … …
Data Analysis
ts id …
Trace_ID Vnet_App Cap_Loc Pattern Trace_ID …
Analysis Manager
Collection View Parse View
ts id … ts id …
Trace Table
Query Executor
Trace Table
Query Executor
Trace Table
Query Executor
SQL Interface
Diagnostic Schema
• Trace Tables form a distributed database
Operations Diagnostic Applications Tenants
Data Analysis Examples
• RTT • Correlate data packet and ack packet
• Compute the average time difference
• Throughput, loss, delay, statistics, etc.
ts src_IP dst_IP seq ack length
t1 IP1 IP2 1000 1 1400
t2 IP2 IP1 0 2400 0
t3 IP1 IP2 2400 1 1400
t4 IP2 IP1 0 3800 0
ts src_IP dst_IP seq ack length
Contents
• Motivation
• VND Design & Implementation
• Evaluation
• Conclusions
Trace Collection (1) • Network Overhead
• Each minute, we capture one additional VM pair’s traffic
• The duplicated traffic increases as we capture more traffic
• The original application traffic is not impacted
…
Server 1
8 VMs
…
Server 2
8 VMs
Trace Collection (2)
• Memory Overhead • VMs perform memory copy and data transfer
simultaneously
• We measure memory and network throughput
• Each 1Gbps network traffic duplication causes 59 MB/s memory overhead
Data Query Overhead • We use RTT monitoring as an example
o200 Mbps traffic
oCalculate average RTT periodically
• Network overhead is negligible
• Execution time scales linearly with the data size
• VND can process 2-3 Gbps traffic in real time
RTT
Scalability
• Control Server is a simple web server can be scaled up easily
• Data collection distributed locally to avoid being the bottleneck
• Query execution is also distributed • Real-time analysis
• 2-3Gbps (RTT example, MySQL implementation)
• Offline analysis • Higher volume
Conclusions
• The cloud provider should offer a virtual network diagnostic service to the tenants
• We design VND • Architecture, interfaces and operations
• Address several important challenges
• Our experiments and simulation demonstrate the feasibility of VND
END
Q&A
Please contact with WISDOM
http://wisdom.cs.wisc.edu
Functional Validation
• Middlebox Bottleneck Detection
• VND use throughput and RTT time to find abnormity
RE IDS
Functional Validation(2)
• Middlebox scaling
RE IDS
VND Architecture
Cloud Controller
Trace Parser
Trace Table
Table Server Policy Manager
Analysis Manager
Collect Config
Parse Config Collect Config
& Topology
Collection Policy
Flow Trace
Parse Config
Query Execution
Tenant
ts id …
Raw Data
Query Executor
Trace Collector
Trace Parser
Trace Table
Table Server
ts id …
Raw Data
Query Executor
Trace Collector
Control Server
Optimizations
• Local Table Server placement • Place the collector locally with the capture points
• Avoid the trace collection traffic traversing the network
• Move data only when queries need it
• Avoid interference with existing rules using the multi-table feature on OVS
Scalability (2)
• Data Query simulation • A data center with 10,000 servers, each has a 10Gbps
NIC
• Virtual network size [2, 20]
• Query executors can process 3 Gbps traffic in real time
• Total link utilization [0.1, 0.9]
• Results • 30% of total link capacity can be queried in real time