2006 © SWITCH
End-to-end Performanceover Research Networks
Simon Leinen, SWITCH
Wizard gap, PERT, performance monitoring, Premium IP
2006 © SWITCH 2
End-to-end Performance Issues
• Performance seen by end users hasn't followed backbone upgrades• “Wizard gap” (ordinary users vs. land speed record heroes)• Issues solving multi-domain performance problems• Issues solving multi-layer performance problems• Lack of performance-oriented network monitoring
-> The “ends” must be included in network performance work!• endpoints, i.e. hosts, operating systems, applications (users even)• campus networks and their administrators
2006 © SWITCH 3
Various efforts to improve e2e performance
Internet2 “e2epi” (end-to-end performance initiative)– Performance workshops– Web100 kernel instrumentation and other TCP enhancements for Linux
enable end-user tools such as NDT (e.g. ndt.switch.ch) auto-tuning for TCP buffers experimental TCP variants (Vegas, Westwood, HS-TCP, BIC, S-TCP, H-TCP...)
GN2– PERT (Performance Enhancement and Response Team)
“like a CERT but for performance” chartered to “own” performance issues (no fingerpointing) collect knowledge, produce documentation (to make itself obsolete)
– Premium IP and other backbone-specific enhancements
2006 © SWITCH 4
Bandwidth is not everything
Most transfers over the Internet (including the GTREN) limited by RTT– TCP window-size limitations for “LFNs” (Long Fat Networks)– short flows– delay-sensitive applications (conversational A/V, RPC, games...)
-> what works well in the LAN won't always do so over the WAN– help users tune TCP (Web100/NDT very useful here)– provide assistance with application design and engineering
alternatives to TCP etc.
RTT harder to improve than bandwidth– speed-of-light issue (btw. router hop-count quickly becoming irrelevant)– some inter-continental connections more useful than others
e.g. TEIN link through Siberia reduces EU-China RTT by half
Other important performance indicators: availability, predictibility...-> using capacity as prime “connectivity” metric no longer justified.
2006 © SWITCH 5
Example from right here (how NOT to do it) My traceroute [v0.71]
agathe (0.0.0.0) Wed May 24 10:24:32 2006
Keys: Help Display mode Restart statistics Order of fields quit
Packets Pings
Host Loss% Snt Last Avg Best Wrst StDev
1. 10.129.21.252 0.0% 377 5.1 8.3 2.5 181.5 15.5
2. 10.64.1.8 1.3% 377 531.6 507.7 125.1 992.6 152.5
3. 172.28.95.109 2.1% 377 544.3 506.3 98.1 1003. 157.6
4. 172.28.74.22 1.6% 377 499.9 509.9 123.5 1204. 162.7
5. 172.28.76.19 1.6% 377 479.8 512.4 117.8 1155. 160.2
6. 172.28.76.33 2.7% 377 475.0 513.0 110.3 1134. 159.7
7. 172.28.75.17 2.7% 377 421.9 515.9 135.5 1102. 158.2
8. 172.28.87.4 2.9% 376 424.8 517.4 119.1 1067. 154.8
9. 172.28.218.241 2.1% 376 583.6 522.1 113.3 1096. 159.4
10. 193.158.5.13 2.9% 376 536.9 513.6 107.3 919.3 156.1
11. zrh-e4.ZRH.CH.net.DTAG.DE 3.7% 376 556.2 526.1 106.6 1027. 154.3
12. swiix1-g2-1.switch.ch 2.9% 376 511.2 534.6 120.0 1087. 158.8
13. 130.59.36.249 2.9% 376 533.0 529.7 139.7 1053. 152.1
14. swiCS3-10GE-1-1.switch.ch 2.7% 376 527.4 525.6 111.8 1052. 148.1
15. swiNM1-G1-0-25.switch.ch 1.6% 376 529.3 528.9 125.7 1090. 150.4
16. swiLM1-V610.switch.ch 2.4% 375 510.2 526.9 136.2 1037. 153.8
17. diotima.switch.ch 1.9% 375 575.2 526.9 149.9 959.0 152.4
2006 © SWITCH 6
2006 © SWITCH 7
GN2 PERT
Part of SA3 (Service Activity – End-to-end Quality of Service)– also called PACE - “Performance and Allocated Capacity for End-Users”
PERT Case Managers mostly from several NRENs– duty CMs, rotating weekly (with videoconference briefings)– dedicated CMs for some cases– reachable through PTS (PERT Ticket System) or [email protected]
Subject Matter Experts (SMEs) participation– issues of “recruiting” and involvement (on demand vs. interest-based)
PERT Knowledge Base (KB)– currently Wiki-based - http://kb.pert.switch.ch/– “Performance Guides” published as deliverables
2006 © SWITCH 8
GN2 PERT Ticket System (PTS)
2006 © SWITCH 9
PERT Knowledge Base (KB)
2006 © SWITCH 10
GN2 PERT Cases (closed)
DEISA TCP Throughput Reduction– solved – due to GEANT packet reordering with heavy cross-traffic
will partly go away with GEANT2 (some of the routers are upgraded)
DEISA-Teragrid Performance (TCP throughput)– closed, but not solved in due time (until demo was over)
DEISA TCP Throughput issues with some sites– found RTT dependency, GEANT->GEANT2 changes explain variations
Loss of large packets on one of the e-VLBI (-> JIVE) paths– resolved by configuration
2006 © SWITCH 11
GN2 PERT cases (ongoing)
ITER VPN– information-gathering phase – VPN makes traditional diagnostics hard
e-VLBI– ongoing investigation – infrequent tests and network changes over time
EU->US routing through Japan– ongoing, but maybe not really a case for PERT?
or, should we have all (GTREN) BGP geeks participate as SMEs?
2006 © SWITCH 12
GN2 PERT Experience
Weaknesses– Few, and often difficult (but interesting!) cases
Mostly large groups: DEISA, e-VLBI (JIVE), DESY/FNAL, ITER... Trying to open up to larger customer base
– It's hard to close cases! lack of clear success indicators
– Friction can be further reduced weekly Case Manager handover, PTS, SME involvements
Strengths– Brings users (researchers) closer to NOCs– Mutual learning experience
Bodes well for PERT Knowledge Base Provides vital input on measurement infrastructure requirements
– Inspires PERT activities in NRENs
2006 © SWITCH 13
SWITCH PERT Example: Opera oberta
Opera oberta– high-quality multicast transmissions of opera from Barcelona and Madrid– mostly Spanish participants, but a few in FR, MX, and now CH– currently 9 Mb/s DVB+D5.1, experimenting with HDTV (~15 Mb/s)
Customer (EPFL) contacted us– early tests were unsatisfactory (due to problems at source, it turns out)– set up NOC support (awareness, test participation, monitoring)– one transmission still failed (due to misconfigured SWITCH router)– fixed problem, improved NOC support (out-of-hours service)– next transmission (last night) a success – it had to be...
-> include aspects of availability and support in “performance” notion
2006 © SWITCH 14
Conclusions
• significant potential for service improvements on current infrastructure– end-host tuning, delay-robust protocols, better NOC cooperation
• PERT concept really helps– improves customers' “reach” into backbones– “user interface” can still be improved
• Leverage new developments in the future– backbone measurement instrumentation, e.g. GN2 JRA1 PerfSONAR– Premium IP and other “on-demand” services
• Long-term benefits– smart users + dumb networks -> unexpected performance and innovation
The end-to-end principles are honoured!