self=driving&networks& - platform lab ·...
TRANSCRIPT
![Page 1: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&](https://reader035.vdocuments.us/reader035/viewer/2022063014/5fcea5437b21ab4e5576528d/html5/thumbnails/1.jpg)
Balaji Prabhakar and Mendel Rosenblum Departments of EE and CS, Stanford
Self-‐Driving Networks
![Page 2: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&](https://reader035.vdocuments.us/reader035/viewer/2022063014/5fcea5437b21ab4e5576528d/html5/thumbnails/2.jpg)
1960s—2000: Packet-‐switching developed • Protocols for computer-‐to-‐computer communica?on • Algorithms for rou?ng, switching, load balancing, conges?on control, …
2005—now: SoEware-‐defined Networking (SDN) • Programmability and flexibility
Now—: Self-‐Driving Networks
• Autonomy: Network senses and monitors itself; programs and controls itself • Interac8vity: Infra should be transparent and fun to interact with, especially for 3rd
party users
Background: Self-‐Driving Networks
![Page 3: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&](https://reader035.vdocuments.us/reader035/viewer/2022063014/5fcea5437b21ab4e5576528d/html5/thumbnails/3.jpg)
What does “self-‐driving” mean?
DCN Workload
Given a DCN and a workload or jobs that arrive over ?me • Allocate resources (network, CPU, memory, storage), so that • Jobs are processed quickly (small job comple?on ?me), and • Resource u?liza?on is efficient
Key func?ons: Sense, Infer, Learn, Control
![Page 4: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&](https://reader035.vdocuments.us/reader035/viewer/2022063014/5fcea5437b21ab4e5576528d/html5/thumbnails/4.jpg)
Sense DCN from the Edge
TX Timestamp RX Timestamp
![Page 5: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&](https://reader035.vdocuments.us/reader035/viewer/2022063014/5fcea5437b21ab4e5576528d/html5/thumbnails/5.jpg)
NIC-‐based telemetry
• More scalable: edge observa?ons are a sufficient sta?s?c - No per-‐queue counters - No per-‐packet measurement - No extra network traffic due to sensed data
• Doesn’t need forkliE upgrade of network, or same vendor of switches (data formats) • Just need NICs which are capable of ?me-‐stamping probes/packets
- Preey standard for most 10, 40, 100G NICs
Of course, if switches can give extra data, that’ll help
• E.g., path followed by probe/packet NIC-‐based probing also useful for fine-‐grained clock synchroniza?on
Sensing at the Edge
![Page 6: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&](https://reader035.vdocuments.us/reader035/viewer/2022063014/5fcea5437b21ab4e5576528d/html5/thumbnails/6.jpg)
Network boelenecks
Infer
• Detailed buffer depths at switches • Link u?liza?ons • Queue and link composi?ons
- Who’s packets are in the queues/links? - Which applica?ons, tenants’ traffic, etc?
• Link failures, brownouts
Applica?on performance • Timeouts • Predict stragglers • Comparisons/Regressions
- Why did the latest soEware patch slow things down?
Challenges: Noisy data, sparse observa?ons, speed, scalability
![Page 7: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&](https://reader035.vdocuments.us/reader035/viewer/2022063014/5fcea5437b21ab4e5576528d/html5/thumbnails/7.jpg)
Current focus: Sense at the edge, infer/reconstruct fine network details
• Mesh of probes for sensing • Machine learning and neural nets for real-‐?me inference
- Func?on Approxima?on: Implemen?ng network algos with NNs à much faster - Paeern Recogni?on: Learning network load from paeerns (packet traces, CPU/memory
u?liza?on paeerns)
Built system with following modules: • SoEware-‐based clock synchroniza?on system: ~10s of nanoseconds accuracy • Network reconstruc?on: accurately infer queues, link u?liza?on, etc • Query and rendering engines: interac?ve visualiza?on and diagnos?c tool
Future work will focus on learning and control • Learning best responses in real-‐?me using Reinforcement Learning • Integra?on with network controllers for real-‐?me autonomous control
Roadmap and progress
![Page 8: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&](https://reader035.vdocuments.us/reader035/viewer/2022063014/5fcea5437b21ab4e5576528d/html5/thumbnails/8.jpg)
Google testbed
• 40G links, 40 racks, 5-‐stage Clos switching Ø Collabora?on with Ashish Naik and Amin Vahdat at Google
Stanford testbed • 1G links, 128-‐server, 2-‐stage Clos switching
Plaporms and Testbeds
Cisco 2960!
![Page 9: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&](https://reader035.vdocuments.us/reader035/viewer/2022063014/5fcea5437b21ab4e5576528d/html5/thumbnails/9.jpg)
Network Evolu?on Reconstruc?on from Edge-‐based Timestamps
![Page 10: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&](https://reader035.vdocuments.us/reader035/viewer/2022063014/5fcea5437b21ab4e5576528d/html5/thumbnails/10.jpg)
Sensing DCN from the edge
TX Timestamp RX Timestamp
![Page 11: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&](https://reader035.vdocuments.us/reader035/viewer/2022063014/5fcea5437b21ab4e5576528d/html5/thumbnails/11.jpg)
Algorithm • Input: – 5-‐tuple flow IDs, for inferring network paths – Rx, Tx ?mestamps of probes
• Basic equa?ons – For each packet:
– Combine all packets:
– Solve for queue sizes: Use Lasso algorithm
![Page 12: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&](https://reader035.vdocuments.us/reader035/viewer/2022063014/5fcea5437b21ab4e5576528d/html5/thumbnails/12.jpg)
Es?mates well
![Page 13: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&](https://reader035.vdocuments.us/reader035/viewer/2022063014/5fcea5437b21ab4e5576528d/html5/thumbnails/13.jpg)
Clock SynchronizaHon
![Page 14: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&](https://reader035.vdocuments.us/reader035/viewer/2022063014/5fcea5437b21ab4e5576528d/html5/thumbnails/14.jpg)
A classical hard problem: affects performance of distributed systems • Can boost performance of exis?ng solu?ons
- e.g., in databases by maintaining causality and external consistency • Or enable new ones
- e.g., fine-‐grained resource and task scheduling, real-‐?me distributed control, etc • Has become more severe as clock precision and event frequency have gone up:
milliseconds à microseconds à nanoseconds
Current solu?ons • Expensive: PTP and PPS require compa?ble hardware • Uneven performance: Many PTP-‐compa?ble switches perform poorly under load
Background
![Page 15: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&](https://reader035.vdocuments.us/reader035/viewer/2022063014/5fcea5437b21ab4e5576528d/html5/thumbnails/15.jpg)
Clock synchroniza?on
Pairwise clock driEs • Typically 5-‐10 microseconds/sec • Can be as high as 30 microseconds/sec
Clock frequency varies with temperature • Ideal temperature ~ 25-‐28 deg cen?grade • Resonance frequency changes quadra?cally with temperature: 10oC change ~ 3.35 usec/s
![Page 16: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&](https://reader035.vdocuments.us/reader035/viewer/2022063014/5fcea5437b21ab4e5576528d/html5/thumbnails/16.jpg)
A soEware clock synchroniza?on system • Probe-‐based, only needs ?mestamping-‐
capable NICs à Same probe mesh needed for reconstruc?on
Synchroniza?on accuracy of 10s of nanoseconds
• Accuracy verified against NetFPGAs
Our solu?on
0 10 20 30 40 50 60 70 80
Network load (%)
0
5
10
15
20
25
30
35
40
Err
or(n
s)
mean99th percentile
Synchroniza?on error stays under 40 ns at 80% load
![Page 17: Self=Driving&Networks& - Platform Lab · Balaji&Prabhakar&and&Mendel&Rosenblum& DepartmentsofEEandCS,Stanford& Self=Driving&Networks&](https://reader035.vdocuments.us/reader035/viewer/2022063014/5fcea5437b21ab4e5576528d/html5/thumbnails/17.jpg)
Self-‐Driving Networks is a mul?-‐year project • Current system has clock sync and network reconstruc?on • Ready for wider deployment: many use cases beyond telemetry
- Regressions/comparisons, forensics, planning, purchasing, policy setng, … • Ini?al work on learning; in future, we’ll be doing more learning and control
We welcome your feedback and collabora?ons
Summary