network-aware os doe/mics project final review september 16, 2004 tom dunigan [email protected] matt...

21
Network-aware OS DOE/MICS Project Final Review September 16, 2004 Tom Dunigan [email protected] Matt Mathis [email protected] Brian Tierney [email protected] ORNL: Florence Fowler, Steven Carter, Nagi Rao, Bill Wing PSC: Raghu Reddy, John Heffner, Janet Brown LBNL: Jason Lee, Martin Stouffer

Upload: darlene-casey

Post on 13-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Network-aware OS

DOE/MICS Project Final Review

September 16, 2004

Tom Dunigan [email protected] Mathis [email protected] Tierney [email protected] ORNL: Florence Fowler,

Steven Carter, Nagi Rao, Bill Wing

PSC: Raghu Reddy, John Heffner, Janet Brown

LBNL: Jason Lee, Martin Stouffer

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Roadmap• Motivation & Background• Net100 project components

– Web100– network probes & sensors– protocol analysis and tuning

• Results– TCP tuning daemon– Tuning experiments

• Project contributions

www.net100.org

DOE-funded project (Office of Science) $2.6M, 3 yrs beginning 9/01 LBNL, ORNL, PSC, NCAR

Net100 project objectives: (network-aware operating systems)• measure, understand, and improve end-to-end network/application performance• tune network protocols and applications (grid and bulk transfer)• emphasis: TCP bulk transfer over high delay/bandwidth nets

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Motivation

• Poor network application performance– High bandwidth paths, but app’s slow– Is it application? OS? network? … Yes– Often need a network “wizard”

• Changing: bandwidths– 9.6 Kbs… 1.5 Mbs ..45 …100…1000…? Gbs

• Unchanging: TCP– speed of light (RTT)– packet size (MSS/MTU) still 1500 bytes– TCP congestion control

• TCP is lossy by design !– 2x overshoot at startup, sawtooth– Recovery proportional to MSS/RTT2

– recovery after a loss can be very slow on today’s high delay/bandwidth links -- unacceptable on tomorrow’s links:

• 10 Gbs cross country: recovery time > 1 hr.!!

Linear recovery at 0.5 Mb/s!

Instantaneous bandwidth

Average bandwidth

Early startup losses

ORNL to NERSC ftp

8 Mbs

GigE/OC12 (600 Mbs) 80ms RTT

40 seconds

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

TCP 101

• adaptable and fair• flow-controlled by sender/receiver buffer sizes• self-clocking with positive ACK’s of in-sequence data• sensitive to packet size (MTU) and RTT• slow start -- +1 packet per each packet ACK’d (exponential)• congestion window (cwnd)-- max packets that can be in flight• packet loss: 3 dup ACKs or timeout (AIMD)

– cut cwnd in half (Multiplicative Decrease)– add 1 packet to cwnd per RTT (Additive Increase)

• Workarounds:– parallel streams– non-TCP (UDP) applications– Net100 (no changes to applications)

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Net100 components

• Web100 Linux kernel (NSF)– instrumented TCP stack (IETF MIB draft)

• Path characterization– Network Tuning and Analysis Framework (NTAF)– both active and passive measurement tools– data base of measurements

• TCP protocol analysis and tuning– simulation/emulation

• ns• TCP-over-UDP (atou)• NISTNet

– kernel tuning extensions– tuning daemon (WAD)

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Web100• NSF funded (PSC/NCAR/NCSA) web100.org• Modified Linux kernel

– instrumented kernel to read/set TCP variables for a specific flow– readable: RTT, counts (bytes, pkts, retransmits,dups), state (SACKs, windowscale, cwnd,

ssthresh)– settable: buffer sizes– 100+ TCP variables (IETF MIB) ( /proc/web100/)

• GUI to display/modify a flow’s TCP variables, real-time• API for network-aware applications or tuning daemon• Net100 extensions:

– additional tuning variables and algorithms– event notification for a tuning daemon– Java bandwidth tester http://cruise.ornl.gov:7123

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Network Tool Analysis Framework (NTAF)

• Configure and launch network tools– measure bandwidth/latency (iperf, pchar, pipechar)– augment tools to report Web100 data

• Collect and transform tool results – use Netlogger to transform common format

• Save results for short-term auto-tuning and archive for later analysis– compare predicted to actual performance– measure effectiveness of tools and auto-tuning– provide data that can be used to predict future

performance– invaluable for comparing tools (pathload/pchar/netest)

Net100 hosts at: LBNL,ORNL,PSC,NCAR NERSC, SLAC, UT, CERN, Amsterdam,ANL

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

TCP flow visualization

- Web interface for data archive and visualization

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

TCP tuning

• “enable” high speed– need buffer = bandwidth*RTT - autotune

ORNL/NERSC (80 ms, OC12) need 6 MB

– faster slow-start• avoid losses

– modified slow-start– reduce bursts– anticipate loss (ECN,Vegas?) – reorder threshold

• speed recovery– bigger MTU or “virtual MSS”– modified AIMD (0.5,1) (Floyd, Kelly)– delayed ACKs, initial window, slow-start increment

• avoid congestion collapse, be fair (?) … intranets, QoS

• Net100: ns simulation, NISTNet emulation, “almost TCP over UDP” (atou), WAD/Internet

ns simulation: 500 mbs link, 80 ms RTTPacket loss early in slow start.Standard TCP with del ACK takes 10 minutes to recover!

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

TCP Tuning Daemon

• Work-around Daemon (WAD) – tune unknowing sender/receiver at startup and/or during flow– Web100 kernel extensions

• pre-set windowscale to allow dynamic tuning• uses netlink to alert daemon of socket open/close (or poll)• besides existing Web100 buffer tuning, new tuning parameters

and algorithms• knobs to disable Linux 2.4 caching, burst mgt., and sendstall

– config file with static tuning data• mode specifies dynamic tuning (AIMD options, NTAF buffer size,

concurrent streams)

– daemon periodically polls NTAF for fresh tuning data– can do out-of-kernel tuning (e.g., Floyd)– written in C (also Python version)

WAD config file [bob] src_addr: 0.0.0.0 src_port: 0 dst_addr: 10.5.128.74 dst_port: 0 mode: 1 sndbuf: 2000000 rcvbuf: 100000 wadai: 6 wadmd: 0.3 maxssth: 100 divide: 1 reorder: 9 sendstall: 0 delack: 0 floyd: 1 kellyai: 0

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Experimental results

• Evaluating the tuning daemon in the wild– emphasis: bulk transfers over high delay/bandwidth nets (Internet2, ESnet)– tests over: 10GigE/OC192,OC48, OC12, OC3, ATM/VBR,

GigE+jumboframe,FDDI,100/10T,cable, ISDN,wireless (802.11b),dialup– tests over NISTNet testbed (speed, loss, delay)

• Various TCP tuning options– buffer tuning (static, auto, and dynamic/NTAF)– AIMD mods (including Floyd, Kelly, Vegas, static, virtual MSS)– slow-start mods– parallel streams vs single tuned vs UDP transports

NISTNethost

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Buffer tuning

Classic buffer tuning•network-challenged app. gets 10 Mbs• same app., WAD/NTAF tuned buffer gets 143 Mbs

Autotuning buffers (kernel)• Linux 2.4, Feng’s Dynamic Right Sizing• Net100 autotuning

• receiver estimates RTT• receiver advertises window 2 times data recv’d in RTT• buffer size grows dynamically to 2x bandwidth*RTT• separate application buffers from kernel buffers

ORNL to PSC, OC192, 30 ms RTT

ORNL to PSC, OC12, 80ms RTT

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Speeding recovery

Amsterdam-Chicago GigE via 10GigE, 100 ms RTT

UDP burst

Selectable TCP AIMD algorithms: Floyd HS TCP: as cwnd grows increase AI and decrease MD, do the reverse when cwnd shrinks Kelly scalable TCP: use MD of 1/8 instead of 1/2 and add % of cwnd (e.g. 1%) each RTT

Virtual MSS• tune TCP’s additive increase (WAD_AI)• add k segments per RTT during recovery• k=6 like GigE jumbo frame, but:

•interrupt rate not reduced•doesn’t do k segments for initial window

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

WAD tuning

Modified slow-start and AI• often losses in slow-start• WAD tuned Floyd slow-start and fixed AI (6)

WAD-tuned AIMD and slow-start • parallel streams AIMD (1/(2k),k)

•exploit TCP’s fairness• WAD-tuned single stream (0.125,4)• “ “ + Floyd slow-start

ORNL to NERSC, OC12, 80 ms RTT

ORNL to CERN, OC12, 150ms RTT

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Workaround: parallel streams• Takes advantage of TCP’s fairness• Faster startup, k buffers• faster recovery

– often only 1 stream loses a packet– MD: 1/(2k) rather than 1/2– AI: k times faster linear phase

• BUT– requires rewrite of applications– how many streams? Buffer size?

• GridFTP, bbftp, psocket lib Alice and Bob sharing Clever Alice -- 3 streams

Bad girl ...

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

GridFTP tuning

Can tuned single stream compete with parallel streams?Mostly not with “equivalence” tuning, but sometimes…. Parallel streams have slow-start advantage.

WAD can divide buffer among concurrent flows—fairer/faster? Tests inconclusive Testing on real Internet is problematic.

Is there a “congestion metric”? Per unit of time? Flow Mbs congestion re-xmitsuntuned 28 4 30tuned 74 5 295parallel 52 30 401

untuned 25 7 25tuned 67 2 420parallel 88 17 440

Data/plots from Web100 tracerBuffers: 64K I/O, 4MB TCP

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Recent Net100 research

– more user-friendly WAD, WAD-lite• No NTAF, bandwidth test thread

– invited to submit Web100/Net100 mods to Linux 2.6– port to Cray X1

• Linux network front-end• added Net100 kernel, 4x improvement in wide-area TCP!

– port to SGI Altix – TCP Vegas

• Vegas avoids loss (if RTT increasing, Vegas backs off)• can be configured to compete with standard TCP (Feng)• CalTech’s FAST (adjusts alpha dynamically)

– comparison with other “work arounds”• parallel streams• non-TCP (SABUL, FOBS, TSUNAMI, RBUDP, SCTP)

– additional accelerants• slow-start initial/increment• reorder resiliance• delayed ACKs

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

TCP tuning for other OS’s Reorder threshold

• seeing more out of order packets -future: multipath or bonded NICs

• WAD tune a bigger reorder threshold for path• 40x improvement!

• Linux 2.4 does a good job already• adjusts and caches reorder threshold• “undo” congestion avoidance

•UDP transports don’t handle re-ordering well

Delayed ACKs

• WAD could turn off delayed ACKs 2x improvement in recovery rate and slow-start• Linux 2.4 already turns off delayed ACKs for initial slow-start

ns simulation: 500 mbs link, 80 ms RTTPacket loss early in slow-start.Standard TCP with del ACK takes 10 minutes to recover!NOTE aggressive static AIMD (Floyd pre-tune)

LBL to ORNL (using our TCP-over-UDP) : dup3 case had 289 retransmits, but all were unneeded!

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Interactions

• Scientific applications– SciDAC supernova and global climate– Data grids (CERN, SLAC)

• Middleware – Globus/gridFTP– HSI/HPSS

• Network measurement– Internet2 end-to-end– Pinger (Cottrell)– Claffy/Dovrolis pathload– netest (Guojun)– SCNM

• Protocol research – Dynamic Right-Sizing (Feng)– HS TCP (Floyd)– Scalable TCP (Kelly)– TCP Vegas (Feng, Low)– Tsunami/SABUL/FOBS/RBUDP– parallel streams (Hacker)

• OS vendors– Linux – IBM AIX/Linux – Cray X1– SGI Altix

• Talks/papers/software/ www.net100.org

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Insights

• Parallel streams are quite effective– No kernel mods, but need new app’s– Bypass system buffer limits– Faster slow-start and recovery, and still TCP-like

• Rate-based UDP is effective– No kernel mods, but need new app’s– Sensitive to re-ordering– Many duplicate packets– Does software-based rate control in the application layer scale?

• WAD and WAD-lite: nice for experimenting or QoS, hard for user– Configure auto-tuning and Floyd’s HS TCP

• Vote for bigger MTUs

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Summary• Novel approaches

– non-invasive dynamic/auto tuning of legacy applications– out-of-kernel tuning– using TCP to tune TCP – tuning on a per flow/destination based on recent path metrics or policy (QoS)

• Effective evaluation framework– protocol analysis and tuning – network/application/OS debugging– path characterization tools, archive, and visualization tools

• Performance improvements– WAD tuned:

• buffers 10x• AIMD 2x to 10x• delayed ACK 2x• slowstart 3x• reorder 40x

• Timely -- needed for science on today’s and tomorrow’s networks