omnimon: re-architecting network telemetry with …omnimon: re-architecting network telemetry with...
TRANSCRIPT
![Page 1: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/1.jpg)
OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy
Qun Huang, Haifeng Sun, Patrick P. C. Lee
Wei Bai, Feng Zhu, Yungang Bao
1
![Page 2: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/2.jpg)
Flow-level Network Telemetry
2
Hardware Switches
End-hosts
Controller
Flow 1 Pkt count
Packet: (flowkey, packet values)
………Flow 2 Pkt count ………Flow 3 Pkt count ………
... ... ………
Flow Statistics
![Page 3: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/3.jpg)
Goal
3
Hardware Switches
End-hosts
Controller
Flow 1 Pkt count ………Flow 2 Pkt count ………Flow 3 Pkt count ………
... ... ………
Flow StatisticsFull
Accuracy
Resource Efficiency
![Page 4: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/4.jpg)
Full Accuracy
4
Controller
Flow 1 Pkt count ………Flow 2 Pkt count ………Flow 3 Pkt count ………
... ... ………
Flow Statistics
1. Always-on: all time intervals2. Network-wide: all devices3. Complete: all flows4. Correct: zero per-flow error
Hardware Switches
End-hosts
![Page 5: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/5.jpg)
Resource Efficiency
5
Hardware Switches
Controller
Flow 1 Pkt count ………Flow 2 Pkt count ………Flow 3 Pkt count ………
... ... ………
Flow Statistics
Sufficient memory Programmability Slow CPU Limited visibility
Fast ASIC Limited memory Limited programmability
Sufficient CPU and memory Global visibility Limited bandwidth
End-HostsController
1. Always-on: all time intervals2. Network-wide: all devices3. Complete: all flows4. Correct: zero per-flow error
![Page 6: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/6.jpg)
Existing Approaches: Trade-offs
6Full Accuracy
ResourceEfficiency
ResourceEfficiency
FullAccuracy
![Page 7: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/7.jpg)
Existing Approaches: Trade-offs
7Full Accuracy
ResourceEfficiency
SNMP
Course-grained
![Page 8: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/8.jpg)
Existing Approaches: Trade-offs
8Full Accuracy
ResourceEfficiency
SNMP
Course-grained
High Overheads
Hash tables
![Page 9: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/9.jpg)
Existing Approaches: Trade-offs
9Full Accuracy
ResourceEfficiency
EventMatching
Top-kCountingSampling
SNMP
Course-grained
High Overheads
Hash tables
![Page 10: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/10.jpg)
Existing Approaches: Trade-offs
10Full Accuracy
ResourceEfficiency
EventMatching
Top-kCountingSampling
SNMP
Course-grained
Only Partial Flows
High Overheads
Hash tables
![Page 11: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/11.jpg)
Existing Approaches: Trade-offs
11Full Accuracy
ResourceEfficiency
EventMatching
Top-kCountingSampling
SNMP
Course-grained
Only Partial Flows
SketchApproximate
ResultsHigh Overheads
Hash tables
![Page 12: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/12.jpg)
Existing Approaches: Trade-offs
12Full Accuracy
ResourceEfficiency
Hash tables
EventMatching
Top-kCountingSampling
SNMP
Our Goal
Course-grained
Only Partial Flows
High Overheads
SketchApproximate
Results
![Page 13: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/13.jpg)
Root Cause
13
Controller
Telemetry Operator
Telemetry Operator
Telemetry Operator
Operators are executed individually with limited collaboration
![Page 14: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/14.jpg)
Root Cause
14
Controller
Telemetry Operator
Resource Management
Flowkeys ValuesTelemetry Operator
Resource Management
Flowkeys Values
Telemetry Operator
Resource Management
Flowkeys Values
Operators are executed individually with limited collaboration
Operators have to be heavy and sacrifice accuracy
![Page 15: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/15.jpg)
Root Cause
15
Controller
Telemetry Operator
Resource Management
Flowkeys ValuesTelemetry Operator
Resource Management
Flowkeys Values
Telemetry Operator
Resource Management
Flowkeys Values
Operators are executed individually with limited collaboration
Operators have to be heavy and sacrifice accuracy
![Page 16: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/16.jpg)
OminMon
Question 1
Coordinate different entities for network telemetry?
16
Re-architect network telemetry by distributed design
Question 2
Reliable guarantees for the coordination?
![Page 17: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/17.jpg)
OminMon
Question 1
Coordinate different entities for network telemetry?
17
Re-architect network telemetry by distributed design
Question 2
Reliable guarantees for the coordination?
![Page 18: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/18.jpg)
Split-and-Merge Architecture
18
Network telemetry
Flowkey Tracking Value Updating Resource Management Collective Analysis
Controller
Break heavy operators
Network-wide coordination
![Page 19: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/19.jpg)
Flowkey Tracking
19
Network telemetry
Flowkey Tracking Value Updating Resource Management Collective Analysis
ControllerFlowkeysFlowkeys
![Page 20: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/20.jpg)
Value Update
20
Network telemetry
Flowkey Tracking Value Updating Resource Management Collective Analysis
ControllerFlowkeysFlowkeys
![Page 21: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/21.jpg)
Value Update
21
Network telemetry
ControllerSlots (Ingress) Flowkeys Slots (Egress)Slots (Ingress) Flowkeys Slots (Egress)
Flowkey Tracking Value Updating Resource Management Collective Analysis
Packet
![Page 22: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/22.jpg)
Mapping at End-Host
22
Network telemetry
Flowkey Tracking
ControllerSlots (Ingress) Flowkeys Slots (Egress)Slots (Ingress) Flowkeys Slots (Egress)
Value Updating Resource Management Collective Analysis
Different strategies of slot maping in end-hosts and switches
![Page 23: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/23.jpg)
Mapping at End-Host (Egress)
23
Network telemetry
Flowkey Tracking
ControllerSlots (Ingress) Flowkeys Slots (Egress)Slots (Ingress) Flowkeys Slots (Egress)
Value Updating Resource Management Collective Analysis
![Page 24: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/24.jpg)
Mapping at End-Host (Ingress)
24
Network telemetry
Flowkey Tracking
ControllerSlots (Ingress) Flowkeys Slots (Egress)Slots (Ingress) Flowkeys Slots (Egress)
Value Updating Resource Management Collective Analysis
1. Embed LocationHostIndex 2. Locate Slot
Packet
No FlowkeysNo Flowkeys
![Page 25: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/25.jpg)
Mapping at Switch
25
Network telemetry
Flowkey Tracking
ControllerSlots (Ingress) Flowkeys Slots (Egress)Slots (Ingress) Flowkeys Slots (Egress)
Value Updating Resource Management Collective Analysis
SwitchIndex
SwitchIndex
HostIndex
HostIndexFlow 1 Flow 2
1. Global Coordination
2. Embed Index3. Extract & Update
![Page 26: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/26.jpg)
Collective Analysis
26
Network telemetry
Flowkey Tracking
ControllerSlots (Ingress) Flowkeys Slots (Egress)Slots (Ingress) Flowkeys Slots (Egress)
Value Updating Resource Management Collective Analysis
Collect results from end-host and switches to form final flow statistics
![Page 27: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/27.jpg)
Collective Analysis
27
Network telemetry
Flowkey Tracking
ControllerSlots (Ingress) Flowkeys Slots (Egress)Slots (Ingress) Flowkeys Slots (Egress)
Value Updating Resource Management Collective Analysis
Flow 1 Flow 2 Flow 3 …
Exploit end-host information to decompose switch slots
![Page 28: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/28.jpg)
Collective Analysis (Detail)
28
24Switch
Zero-errorFlowkey & Values
Source End-HostFlowkeys
Flow 1
Slots (Egress)
13
Flowkeys
Flow 2
Slots (Egress)
11
Switch IndexSwitch Index
Flow 1: 13 Flow 2: 11
Source End-Host
Zero-errorFlowkey & Values
![Page 29: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/29.jpg)
Putting It Together
29
Network telemetry
Flowkey Tracking
ControllerSlots (Ingress) Flowkeys Slots (Egress)Slots (Ingress) Flowkeys Slots (Egress)
Value Updating Resource Management Collective Analysis
Switches: Shared Slots Exact Per-flow Tracking Affordable Operations
Low Memory Usage Simple Updates
Switch Index Mapping Collective Analysis
End-Hosts: Hash Table Controller: Global Info.
![Page 30: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/30.jpg)
OminMon
Question 1
Coordinate different entities for network telemetry?
30
Re-architect network telemetry by distributed design
Question 2
Reliable guarantees for the coordination?
![Page 31: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/31.jpg)
Unreliable Events
Lack of global clock• Devices reside in different intervals
Packet loss• Flow values are missing in some devices
31
![Page 32: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/32.jpg)
Impact of Unreliable Events
32
Flow 1 13 Flow 2 11
Long DelayPacket Loss
Recorded by Other Intervals
Expect: 24
Not Recorded
![Page 33: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/33.jpg)
Impact of Unreliable Events
33
Flow 1 13 Flow 2 11
Long Delay
Flow 1: ??? Flow 2: ???
Packet Loss
Recorded by Other Intervals 20 Not Recorded
Expect: 24
![Page 34: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/34.jpg)
Reliability Guarantees
34
Lack of global clock Packet Loss
Hybrid consistency model• Network-wide synchronization
Loss inference• Linear system with DCN-
specific optimizations• Flow mapping algorithm
Guarantees• Each packet is included in the
same intervals by all devices• All end-hosts reside in the same
time intervals in most time
Guarantees• Per-switch, per-flow loss
interference in common cases
More details in the paper
![Page 35: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/35.jpg)
Implementation
Testbed• End-hosts: DPDK• Switch: P4• Controller: C++
Simulator: 8-ray fat-tree
35
![Page 36: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/36.jpg)
Host Overheads
36
<10% overheads when adding telemetry functionalities to PktGen
Hash lookup dominates the overheads
![Page 37: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/37.jpg)
Switch Overheads
Less resources than sketch-based techniques
Only OmniMon achieves zero errors
37
OS: OmniMon that monitors only packet countOF: OmniMon that monitors 9 statistics
FR: FlowRadar (NSDI 16)UM: UmniMon (SIGCOMM 16)ES: Elastic Sketch (SIGCOMM 18)SL: SketchLearn (SIGCOMM 18)Each sketch only monitors packet count
Sketch techniques
OmniMon
![Page 38: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/38.jpg)
More Results
Controller overheads
Synchronization efficiency
Accountability
Scalability
User case: anomaly detection
Use case: network failure diagnosis
Use case: load balance evaluation
38
![Page 39: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei](https://reader035.vdocuments.us/reader035/viewer/2022070923/5fbb02403838415cbb0f1ace/html5/thumbnails/39.jpg)
Conclusion OmniMon architecture: split-and-merge design
• Four partial operations• Network-wide coordination
Consistency guarantee• Network-wide synchronization with hybrid consistency model
Accountability guarantee• Packet loss inference with linear systems• Flow mapping algorithm
Prototype: DPDK + P4
Results: compare with 11 state-of-the-art solutions in various aspects
39Source Code Available: https://github.com/N2-Sys/Omnimon