congestion avoidance and control - peoplesylvia/cs268/papers/congavoid.pdf · is a filter gain...

21
Congestion Avoidance and Control Van Jacobson Lawrence Berkeley Laboratory Michael J. Karels University of California at Berkeley November, 1988 Introduction Computer networks have experienced an explosive growth over the past few years and with that growth have come severe congestion problems. For example, it is now common to see internet gateways drop 10% of the incoming packets because of local buffer overflows. Our investigation of some of these problems has shown that much of the cause lies in transport protocol implementations (not in the protocols themselves): The ‘obvious’ ways to implement a window-based transport protocol can result in exactly the wrong behavior in response to network congestion. We give examples of ‘wrong’ behav- ior and describe some simple algorithms that can be used to make right things happen. The algorithms are rooted in the idea of achieving network stability by forcing the transport connection to obey a ‘packet conservation’ principle. We show how the algorithms derive from this principle and what effect they have on traffic over congested networks. In October of ’86, the Internet had the first of what became a series of ‘congestion collapses’. During this period, the data throughput from LBL to UC Berkeley (sites separated by 400 yards and two IMP hops) dropped from 32 Kbps to 40 bps. We were fascinated by this sudden factor-of-thousand drop in bandwidth and embarked on an investigation of why things had gotten so bad. In particular, we wondered if the 4.3BSD (Berkeley UNIX) TCP was mis-behaving or if it could be tuned to work better under abysmal network conditions. The answer to both of these questions was “yes”. Note: This is a very slightly revised version of a paper originally pre- sented at SIGCOMM ’88 [11]. If you wish to reference this work, please cite [11]. This work was supported in part by the U.S. Department of Energy under Contract Number DE-AC03-76SF00098. This work was supported by the U.S. Department of Commerce, Na- tional Bureau of Standards, under Grant Number 60NANB8D0830. Since that time, we have put seven new algorithms into the 4BSD TCP: (i) round-trip-time variance estimation (ii) exponential retransmit timer backoff (iii) slow-start (iv) more aggressive receiver ack policy (v) dynamic window sizing on congestion (vi) Karn’s clamped retransmit backoff (vii) fast retransmit Our measurements and the reports of beta testers suggest that the final product is fairly good at dealing with congested con- ditions on the Internet. This paper is a brief description of (i) –(v) and the ra- tionale behind them. (vi) is an algorithm recently developed by Phil Karn of Bell Communications Research, described in [15]. (vii) is described in a soon-to-be-published RFC (ARPANET “Request for Comments”). Algorithms (i) –(v) spring from one observation: The flow on a TCP connection (or ISO TP-4 or Xerox NS SPP con- nection) should obey a ‘conservation of packets’ principle. And, if this principle were obeyed, congestion collapse would become the exception rather than the rule. Thus congestion control involves finding places that violate conservation and fixing them. By ‘conservation of packets’ we mean that for a connec- tion ‘in equilibrium’, i.e., running stably with a full window of data in transit, the packet flow is what a physicist would call ‘conservative’: A new packet isn’t put into the network until an old packet leaves. The physics of flow predicts that systems with this property should be robust in the face of congestion. 1 Observation of the Internet suggests that it was not particularly robust. Why the discrepancy? 1 A conservative flow means that for any given time, the integral of the packet density around the sender–receiver–sender loop is a constant. Since

Upload: others

Post on 31-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • CongestionAvoidanceandControl�

    VanJacobson�

    LawrenceBerkeleyLaboratory

    MichaelJ.Karels�

    Universityof Californiaat Berkeley

    November, 1988

    Introduction

    Computernetworkshaveexperiencedan explosivegrowthoverthepastfewyearsandwith thatgrowthhavecomeseverecongestionproblems.Forexample,it is nowcommonto seeinternetgatewaysdrop10%of theincomingpacketsbecauseof localbufferoverflows.Our investigationof someof theseproblemshasshownthatmuchof thecauselies in transportprotocolimplementations(not in theprotocolsthemselves):The‘obvious’ waysto implementa window-basedtransportprotocolcanresultin exactlythewrongbehaviorin responsetonetworkcongestion.Wegiveexamplesof ‘wrong’ behav-ior anddescribesomesimplealgorithmsthatcanbeusedtomakeright thingshappen.Thealgorithmsarerootedin theideaof achievingnetworkstability by forcing the transportconnectionto obey a ‘packet conservation’principle. Weshowhowthealgorithmsderivefrom thisprincipleandwhateffect theyhaveon traffic overcongestednetworks.

    In Octoberof ’86, theInternethadthefirstof whatbecamea seriesof ‘congestioncollapses’. During this period, thedatathroughputfrom LBL to UC Berkeley(sitesseparatedby400yardsandtwo IMP hops)droppedfrom32Kbpsto40bps. We werefascinatedby this suddenfactor-of-thousanddropin bandwidthandembarkedonaninvestigationof whythingshadgottenso bad. In particular, we wonderedif the4.3BSD (BerkeleyU

    �N�

    IX) TCP wasmis-behavingor if it couldbetunedto work betterunderabysmalnetworkconditions.Theanswerto bothof thesequestionswas“yes”.�

    Note: This is a very slightly revisedversionof a paperoriginally pre-sentedat SIGCOMM ’88 [11]. If you wish to referencethis work, pleasecite[11].�

    This work was supportedin part by the U.S. Departmentof EnergyunderContractNumberDE-AC03-76SF00098.�

    This work wassupportedby the U.S. Departmentof Commerce,Na-tionalBureauof Standards,underGrantNumber60NANB8D0830.

    Sincethat time, we haveput sevennew algorithmsintothe4BSD TCP:

    (i) round-trip-timevarianceestimation

    (ii) exponentialretransmittimer backoff

    (iii) slow-start

    (iv) moreaggressivereceiverackpolicy

    (v) dynamicwindow sizingon congestion

    (vi) Karn’sclampedretransmitbackoff

    (vii) fastretransmit

    Ourmeasurementsandthereportsof betatesterssuggestthatthefinal productis fairly goodatdealingwith congestedcon-ditionson theInternet.

    This paperis a brief descriptionof (i) – (v) andthe ra-tionalebehindthem. (vi) is analgorithmrecentlydevelopedby Phil Karn of Bell CommunicationsResearch,describedin [15]. (vii) is describedin a soon-to-be-publishedRFC(ARPAN� ET “Requestfor Comments”).

    Algorithms (i) – (v) spring from oneobservation:Theflow on a TCP connection(or ISO TP-4 or Xerox N� S SPP con-nection)shouldobeya ‘conservationof packets’principle.And, if thisprinciplewereobeyed,congestioncollapsewouldbecometheexceptionratherthantherule. Thuscongestioncontrol involvesfinding placesthatviolateconservationandfixing them.

    By ‘conservationof packets’wemeanthatfor a connec-tion ‘in equilibrium’, i.e., runningstablywith a full windowof datain transit,the packetflow is what a physicistwouldcall ‘conservative’:A newpacketisn’t put into thenetworkuntil anold packetleaves.Thephysicsof flow predictsthatsystemswith this propertyshouldbe robust in the face ofcongestion.1 Observationof theInternetsuggeststhatit wasnotparticularlyrobust.Why thediscrepancy?

    1 A conservativeflow meansthat for anygiven time, the integralof thepacketdensityaroundthesender–receiver–senderloop is a constant.Since

  • 2 CONSERVATION AT EQUILIBRIUM: ROUND-TRIPTIMING 2

    Thereareonly threewaysfor packetconservationto fail:

    1. Theconnectiondoesn’t getto equilibrium,or

    2. A senderinjectsanewpacketbeforeanold packethasexited,or

    3. Theequilibriumcan’t bereachedbecauseof resourcelimits alongthepath.

    In thefollowing sections,we treateachof thesein turn.

    1 Getting to Equilibrium: Slow-start

    Failure(1) hasto befrom aconnectionthat is eitherstartingor restartingaftera packetloss. Anotherway to look at theconservationpropertyis to saythat the senderusesacksasa ‘clock’ to strobenewpacketsinto the network. Sincethereceivercangenerateacksnofasterthandatapacketscangetthroughthenetwork,theprotocolis ‘self clocking’ (fig. 1).Selfclockingsystemsautomaticallyadjustto bandwidthanddelayvariationsandhavea wide dynamicrange(importantconsideringthatTCPspansarangefrom800MbpsCraychan-nelsto 1200bpspacketradiolinks). But thesamething thatmakesa self-clockedsystemstablewhenit’ s runningmakesit hardto start— to get dataflowing theremustbe ackstoclockout packetsbut to getackstheremustbedataflowing.

    To startthe‘clock’, wedevelopedaslow-startalgorithmto gradually increasethe amountof data in-transit.2 Al-thoughwe flatterourselvesthatthedesignof this algorithmisrathersubtle,theimplementationis trivial — onenewstatevariableandthreelinesof codein thesender:� Addacongestionwindow, cwnd,to theper-connection

    state.� Whenstartingor restartingafteraloss,setcwndtoonepacket.� Oneachackfor newdata,increasecwndbyonepacket.

    packetshaveto ‘dif fuse’aroundthis loop,theintegralis sufficiently contin-uousto beaLyapunovfunctionfor thesystem.A constantfunctiontriviallymeetstheconditionsfor Lyapunovstability sothesystemis stableandanysuperpositionof suchsystemsisstable.(See[2], chap.11–12or [20], chap.9for excellentintroductionsto systemstability theory.)

    2 Slow-startis quitesimilar to theC U TE algorithmdescribedin [13]. Wedidn’t know aboutC U TE at the time we weredevelopingslow-startbut weshouldhave—C U TE precededourwork by severalmonths.

    Whendescribingour algorithmat the Feb.,1987, InternetEngineeringTaskForce(IETF) meeting,we calledit soft-start, a referenceto an elec-tronicsengineer’s techniqueto limit in-rushcurrent. The nameslow-startwascoinedby JohnNaglein a messageto the IETF mailing list in March,’87. This namewasclearlysuperiorto oursandwepromptlyadoptedit.

    � When sending,sendthe minimum of the receiver’sadvertisedwindow andcwnd.

    Actually, theslow-startwindow increaseisn’t thatslow:it takestime � log2 � where � is the round-trip-timeand�

    is the window size in packets(fig. 2). This meansthewindow opensquickly enoughto havea negligible effectonperformance,evenonlinks with a largebandwidth–delayproduct.And thealgorithmguaranteesthataconnectionwillsourcedataat a rate at most twice the maximumpossibleon thepath. Without slow-start,by contrast,when10 MbpsEthernethoststalkoverthe56KbpsArpanetvia IPgateways,thefirst-hopgatewayseesaburstof eightpacketsdeliveredat200timesthepathbandwidth.Thisburstof packetsoftenputsthe connectioninto a persistentfailure modeof continuousretransmissions(figures3 and4).

    2 Conservation at equilibrium:round-trip timing

    Oncedatais flowing reliably, problems(2) and(3) shouldbe addressed.Assumingthat the protocol implementationis correct,(2) mustrepresenta failure of sender’s retransmittimer. A good round trip time estimator, the core of theretransmittimer, is thesinglemostimportantfeatureof anyprotocolimplementationthatexpectsto surviveheavyload.And it is frequentlybotched([26] and[12] describetypicalproblems).

    Onemistakeis not estimatingthe variation, �� , of theroundtrip time, � . Fromqueuingtheoryweknowthat � andthevariationin � increasequickly with load. If the load is� (theratioof averagearrival rateto averagedeparturerate),� and � scalelike � 1 � ����� 1 . Tomakethisconcrete,if thenetwork is runningat 75% of capacity, asthe Arpanetwasin lastApril’ scollapse,oneshouldexpectround-trip-timetovaryby a factorof sixteen( � 2 to � 2 ).

    The TCP protocol specification[23] suggestsestimatingmeanroundtrip time via thelow-passfilter

    ����������� 1 ��� �! where � is theaverageRTT estimate, is a roundtrip timemeasurementfrom themostrecentlyackeddatapacket,and� isafilter gainconstantwith asuggestedvalueof 0.9. Oncethe � estimateis updated,the retransmittimeout interval,"$#&% , for thenextpacketsentis setto '(� .

    Theparameter' accountsfor RTT variation(see[4], sec-tion 5). Thesuggested'*) 2 canadaptto loadsof at most30%. Above this point, a connectionwill respondto loadincreasesby retransmittingpacketsthat haveonly beende-layedin transit. This forcesthenetworkto do uselesswork,

  • 2 CONSERVATION AT EQUILIBRIUM: ROUND-TRIPTIMING 3

    Figure1: Window Flow Control ‘Self-clocking’

    P+

    r

    ArAs

    Pb

    Receiver,Sender-

    Ab

    This is a schematicrepresentationof a senderandreceiveron high bandwidthnetworksconnectedby aslower, long-haulnet. Thesenderis juststartingandhasshippedawindow’sworthof packets,back-to-back.Theackfor thefirst of thosepacketsis aboutto arrivebackat thesender(theverticalline at themouthof thelower left funnel).Theverticaldimensionis bandwidth,thehorizontaldimensionis time. Eachof theshadedboxesisa packet.Bandwidth . Time / Bits sotheareaof eachbox is thepacketsize. Thenumberof bitsdoesn’t changeasapacketgoesthroughthenetworksoapacketsqueezedinto thesmallerlong-haulbandwidthmustspreadout in time. The time 021 representsthe minimum packetspacingon theslowestlink in thepath(thebottleneck). As thepacketsleavethebottleneckfor thedestinationnet,nothingchangesthe inter-packetinterval soon the receiver’s net packetspacing 0435/60 1 . If thereceiverprocessingtime is thesamefor all packets,thespacingbetweenackson thereceiver’s net7 38/�0439/:0 1 . If the time slot 0 1 wasbig enoughfor a packet,it’ s big enoughfor anackso theackspacingis preservedalongthereturnpath.Thustheackspacingon thesender’snet

    71 .So,if packetsafterthefirst burstaresentonly in responseto anack,thesender’spacketspacingwillexactlymatchthepackettime on theslowestlink in thepath.

    wastingbandwidthon duplicatesof packetsthat will even-tually be delivered,at a time whenit’ s known to be havingtroublewith usefulwork. I.e., this is thenetworkequivalentof pouringgasolineon afire.

    We developeda cheapmethodfor estimatingvariation(seeappendixA) 3 andtheresultingretransmittimer essen-tially eliminatesspuriousretransmissions.A pleasantsideeffect of estimating? ratherthanusinga fixed valueis thatlow loadaswell ashigh loadperformanceimproves,partic-ularly overhigh delaypathssuchassatellitelinks (figures5and6).

    Another timer mistakeis in the backoff after a retrans-mit: If apackethasto beretransmittedmorethanonce,howshouldthe retransmitsbe spaced?For a transportendpointembeddedin a networkof unknowntopologyandwith an

    3 Wearefar fromthefirst torecognizethattransportneedstoestimatebothmeanandvariation.See,for example,[5]. But wedo think ourestimatorissimplerthanmost.

    unknown,unknowableandconstantlychangingpopulationof competingconversations,only oneschemehasanyhopeof working—exponentialbackoff—but a proof of this is be-yondthe scopeof this paper.4 To finessea proof, notethata networkis, to a very goodapproximation,a linearsystem.Thatis, it is composedof elementsthatbehavelike linearop-erators— integrators,delays,gainstages,etc. Linearsystemtheorysaysthatif asystemis stable,thestability is exponen-tial. Thissuggeststhatanunstablesystem(anetworksubject

    4 See[7]. Severalauthorshaveshownthat backoffs ‘slower’ thanex-ponentialarestablegiven finite populationsandknowledgeof the globaltraffic. However, [16] showsthat nothingslowerthanexponentialbehav-ior will work in thegeneralcase.To feedyour intuition, considerthatanIPgatewayhasessentiallythesamebehaviorasthe‘ether’ in anALOHA netorEthernet.Justifyingexponentialretransmitbackoff is thesameasshowingthatno collisionbackoff slowerthananexponentialwill guaranteestabilityon an Ethernet. Unfortunately, with an infinite userpopulationevenex-ponentialbackoff won’t guaranteestability (althoughit ‘almost’ does—see[1]). Fortunately,wedon’t (yet)havetodealwith aninfiniteuserpopulation.

  • 3 ADAPTING TO THE PATH: CONGESTIONAVOIDANCE 4

    Figure2: The Chronology of a Slow-start

    1

    23

    1

    One Round Trip Time

    0R

    1R

    2

    45

    3@

    6A7

    2R

    4B

    89

    5

    1011

    6

    1213

    7

    1415

    3R

    One Packet Time

    Thehorizontaldirectionis time. Thecontinuoustimeline hasbeenchoppedintoone-round-trip-timepiecesstackedvertically with increasingtime goingdownthepage.Thegrey, numberedboxesarepackets.Thewhite numberedboxesarethecorrespondingacks. As eachackarrives,two packetsaregenerated:onefor theack(theacksaysa packethasleft thesystemsoa newpacketis addedtotakeits place)andonebecauseanackopensthecongestionwindow by onepacket.It maybeclearfrom thefigurewhy anadd-one-packet-to-windowpolicy opensthewindow exponentiallyin time.If the local net is muchfasterthanthe long haul net, theack’s two packetsarriveat thebottleneckatessentiallythesametime. Thesetwo packetsareshownstackedon topof oneanother(indicatingthatoneof themwouldhaveto occupyspacein thegateway’soutboundqueue).Thustheshort-termqueuedemandon thegatewayis increasingexponentiallyandopeningawindowof size C packetswill require CED 2 packetsof buffer capacityat thebottleneck.

    to randomload shocksandproneto congestivecollapse5)canbestabilizedby addingsomeexponentialdamping(ex-ponentialtimer backoff) to its primary excitation(senders,traffic sources).

    3 Adapting to the path: congestionavoidance

    If the timersare in goodshape,it is possibleto statewithsomeconfidencethatatimeoutindicatesalostpacketandnot

    5 Thephrasecongestioncollapse(describinga positivefeedbackinsta-bility dueto poorretransmittimers)is againthecoinageof JohnNagle,thistimefrom [22].

    a brokentimer. At this point, somethingcanbe doneabout(3). Packetsget lost for two reasons:they aredamagedintransit,or the network is congestedandsomewhereon thepaththerewasinsufficientbuffer capacity. Onmostnetworkpaths,lossduetodamageis rare( F 1%) soit isprobablethatapacketlossis dueto congestionin thenetwork.6

    6 Becausea packetlossemptiesthewindow, thethroughputof anywin-dowflow controlprotocolis quitesensitiveto damageloss.ForanRFC793standardTCPrunningwith window G (where G is atmostthebandwidth-delayproduct),a lossprobability of H degradesthroughputby a factor ofI1 J 2HKG(LNM 1 . E.g., a 1% damagelossrateon anArpanetpath(8 packet

    window)degradesTC P throughputby 14%.Thecongestioncontrolschemewe proposeis insensitiveto damageloss

    until the lossrate is on the orderof the window equilibrationlength (thenumberof packetsit takesthewindowto regainits originalsizeafteraloss).If thepre-losssizeis G , equilibrationtakesroughly G 2 O 3 packetsso,for the

  • 3 ADAPTING TO THE PATH: CONGESTIONAVOIDANCE 5

    A ‘congestionavoidance’strategy, suchasthe onepro-posedin [14], will havetwo components:Thenetworkmustbe ableto signal the transportendpointsthat congestionisoccurring(orabouttooccur).And theendpointsmusthaveapolicy thatdecreasesutilization if this signalis receivedandincreasesutilization if thesignalisn’t received.

    If packetloss is (almost)alwaysdueto congestionandif a timeoutis (almost)alwaysdueto a lost packet,we haveagoodcandidatefor the‘network is congested’signal. Par-ticularly sincethis signal is deliveredautomaticallyby allexistingnetworks,without specialmodification(asopposedto [14] which requiresa newbit in thepacketheadersandamodificationto all existinggatewaysto setthisbit).

    The other part of a congestionavoidancestrategy, theendnodeaction,isalmostidenticalin theDEC/

    PISO schemeand

    our TCP7 andfollows directly from a first-ordertime-seriesmodelof the network:8 Saynetwork load is measuredbyaveragequeuelengthoverfixedintervalsof someappropriatelength(somethingneartheroundtrip time). If Q�R is theloadat interval S , an uncongestednetwork can be modeledbysaying Q�R changesslowly comparedto the samplingtime.I.e., QTRVUXW( W constant). If the network is subjectto congestion,thiszerothordermodelbreaksdown. Theaveragequeuelengthbecomesthe sumof two terms,the W abovethat accountsfor theaveragearrival rateof newtraffic andintrinsicdelay,anda new term that accountsfor the fraction of traffic leftoverfromthelasttimeintervalandtheeffectof thisleft-overtraffic (e.g.,inducedretransmits):

    QYRVUZW\[*]>Q^R`_ 1(Thesearethefirst two termsin a TaylorseriesexpansionofQbadcfe . Thereis reasonto believeonemighteventuallyneedathreeterm,secondordermodel,butnotuntil theInternethasgrownsubstantially.)

    Arpanet,thelosssensitivitythresholdisabout5%. At thishighlossrate,theemptywindow effect describedabovehasalreadydegradedthroughputby44%andtheadditionaldegradationfrom thecongestionavoidancewindowshrinkingis theleastof one’sproblems.

    Weareconcernedthatthecongestioncontrolnoisesensitivityis quadraticin g but it will take at leastanothergenerationof network evolution toreachwindow sizeswherethis will besignificant. If experienceshowsthissensitivityto be a liability, a trivial modificationto thealgorithmmakesitlinearin g . An in-progresspaperexploresthissubjectin detail.

    7 This is not an accident: We copied Jain’s schemeafter hearinghispresentationat [9] andrealizingthattheschemewas,in a sense,universal.

    8 Seeanygoodcontrol theorytext for therelationshipbetweena systemmodelandadmissiblecontrolsfor thatsystem.A niceintroductionappearsin [20], chap.8.

    When the network is congested,] must be large andthe queuelengthswill start increasingexponentially.9 Thesystemwill stabilizeonly if the traffic sourcesthrottle backat leastasquickly asthequeuesaregrowing. Sinceasourcecontrolsload in a window-basedprotocol by adjustingthesizeof thewindow, h , we endup with thesenderpolicy

    On congestion:

    hER^U:i�hERj_ 1 a`ilk 1eI.e.,a multiplicativedecreaseof thewindowsize(whichbe-comesan exponentialdecreaseover time if the congestionpersists).

    If there’s no congestion, ] must be nearzero and theloadapproximatelyconstant.Thenetworkannounces,via adroppedpacket,whendemandis excessivebutsaysnothingif a connectionis usinglessthanits fair share(sincethenet-work is stateless,it cannotknowthis). Thusaconnectionhasto increaseits bandwidthutilization to find out the currentlimit. E.g.,youcouldhavebeensharingthepathwith some-oneelseandconvergedto awindowthatgivesyoueachhalftheavailablebandwidth.If sheshutsdown,50%of theband-width will bewastedunlessyour window sizeis increased.Whatshouldtheincreasepolicy be?

    Thefirst thoughtis to useasymmetric,multiplicativein-crease,possiblywith a longertime constant,h R U\mnh Rj_ 1 ,1 komEp 1qKi . This is a mistake. The resultwill oscillatewildly and,on theaverage,deliverpoorthroughput.Thean-alytic reasonfor this hasto do with that fact that it is easyto drive thenetinto saturationbut hardfor thenetto recover(what [17], chap.2.1, calls the rush-houreffect).10 Thus

    9 I.e., thesystembehaveslike r4sutwvKr>syx 1 , a differenceequationwiththesolution ruz

  • 4 THE GATEWAY SIDE OF CONGESTIONCONTROL 6

    overestimatingtheavailablebandwidthis costly. But anex-ponential,almostregardlessof its timeconstant,increasessoquickly thatoverestimatesareinevitable.

    Withoutjustification,we’ll statethatthebestincreasepol-icy is to makesmall,constantchangesto thewindowsize:11

    Onnocongestion:EVj1 � | w8�

    whereE

    is thepipesize(thedelay-bandwidthproductofthepathminusprotocoloverhead— i.e., thelargestsensiblewindowfor theunloadedpath).This is theadditiveincrease/ multiplicative decreasepolicy suggestedin [14] and thepolicy we’ve implementedin TCP. The only differencebe-tweenthetwo implementationsis thechoiceof constantsfor

    and . We used0.5 and1 for reasonspartially explainedin appendixD. A morecompleteanalysisis in yet anotherin-progresspaper.

    Theprecedinghasprobablymadethecongestioncontrolalgorithmsoundhairy but it’ snot. Like slow-start,it’ s threelinesof code: On any timeout,setcwnd to half the currentwindow

    size(this is themultiplicativedecrease). On eachack for new data,increasecwndby 1/cwnd(this is theadditiveincrease).12 When sending,sendthe minimum of the receiver’sadvertisedwindow andcwnd.

    Note that this algorithm is only congestionavoidance, itdoesn’t include the previouslydescribedslow-start. Sincethepacketlossthatsignalscongestionwill resultin are-start,it will almostcertainlybenecessaryto slow-startin additionto the above. But, becauseboth congestionavoidanceandslow-startare triggeredby a timeout and both manipulatethecongestionwindow, theyarefrequentlyconfused.Theyareactuallyindependentalgorithmswith completelydifferentobjectives.To emphasizethedifference,thetwo algorithms

    11See[3] for acompleteanalysisof theseincreaseanddecreasepolicies.Also see[7] and [8] for a control-theoreticanalysisof a similar classofcontrolpolicies.

    12Thisincrementrulemaybelessthanobvious.Wewantto increasethewindowby at mostonepacketovera time intervalof length (the roundtrip time). To makethe algorithm ‘self-clocked’, it’ s betterto incrementby a small amounton eachackratherthanby a largeamountat theendofthe interval. (Assuming,of course,the senderhaseffective silly windowavoidance(see[4], section3) anddoesn’t attemptto sendpacketfragmentsbecauseof thefractionallysizedwindow.) A window of sizecwndpacketswill generateat mostcwndacksin one . Thusan incrementof 1/cwndper ack will increasethe window by at most one packetin one . InTC

    P, windowsandpacketsizesarein bytesso the incrementtranslatestomaxseg*maxseg/cwndwheremaxsegisthemaximumsegmentsizeandcwndis expressedin bytes,not packets.

    havebeenpresentedseparatelyeventhoughin practisetheyshouldbe implementedtogether. Appendix B describesacombinedslow-start/congestionavoidancealgorithm.13

    Figures7 through12 showthe behaviorof TCP connec-tionswith andwithout congestionavoidance.Althoughthetestconditions(e.g.,16KB windows)weredeliberatelycho-sento stimulatecongestion,the testscenarioisn’t far fromcommonpractice: TheArpanetIMP end-to-endprotocolal-lows at most eight packetsin transit betweenany pair ofgateways.Thedefault4.3BSD window sizeis eightpackets(4 KB). Thussimultaneousconversationsbetween,say, anytwo hostsatBerkeleyandanytwo hostsatMIT wouldexceedthe network capacityof the U CB–MIT IMP path and wouldlead14 to thetypeof behaviorshown.

    4 Future work: the gateway side ofcongestion control

    Whilealgorithmsatthetransportendpointscaninsurethenet-work capacityisn’t exceeded,theycannotinsurefair sharingof that capacity. Only in gateways,at the convergenceofflows,is thereenoughinformationto controlsharingandfairallocation.Thus,weviewthegateway‘congestiondetection’algorithmasthenextbig step.

    Thegoalof thisalgorithmtosendasignalto theendnodesasearlyaspossible,butnotsoearlythatthegatewaybecomes

    13We havealsodevelopeda rate-basedvariantof thecongestionavoid-ancealgorithmtoapplytoconnectionlesstraffic (e.g.,domainserverqueries,RPC requests).Rememberingthatthegoalof theincreaseanddecreasepoli-

    ciesis bandwidthadjustment,andthat ‘time’ (thecontrolledparameterin arate-basedscheme)appearsin thedenominatorof bandwidth,thealgorithmfollows immediately: The multiplicative decreaseremainsa multiplica-tive decrease(e.g.,doublethe interval betweenpackets). But subtractinga constantamountfrom interval doesnot result in an additive increaseinbandwidth. This approachhasbeentried, e.g.,[18] and[24], andappearsto oscillatebadly. To seewhy, notethat for an inter-packetinterval anddecrement , thebandwidthchangeof adecrease-interval-by-constantpol-icy is

    1 1^~anon-linear, anddestablizing,increase.

    An updatepolicy thatdoesresultin a linear increaseof bandwidthovertime is ! ¢¡¤£  y¥ 1£9¦  y¥ 1where   is theintervalbetweensendswhenthe § th packetis sentand £ isthedesiredrateof increasein packetsperpacket/sec.

    Wehavesimulatedtheabovealgorithmandit appearsto performwell. Totestthepredictionsof thatsimulationagainstreality, wehaveacooperativeprojectwith SunMicrosystemstoprototypeRPC dynamiccongestioncontrolalgorithmsusingN¨ FS© asa test-bed(sinceN¨ FS© is known to havecongestionproblemsyet it would bedesirableto haveit work over thesamerangeofnetworksasTC P).

    14did lead.

  • 4 THE GATEWAY SIDE OF CONGESTIONCONTROL 7

    starvedfor traffic. Sincewe plan to continueusingpacketdropsasa congestionsignal,gateway‘self protection’froma mis-behavinghostshouldfall-out for free: Thathostwillsimplyhavemostof its packetsdroppedasthegatewaytrysto tell it that it’ s usingmorethanits fair share. Thus, liketheendnodealgorithm,thegatewayalgorithmshouldreducecongestionevenif no endnodeis modifiedto do congestionavoidance.And nodesthatdo implementcongestionavoid-ancewill get their fair shareof bandwidthanda minimumnumberof packetdrops.

    Sincecongestiongrowsexponentially, detectingit earlyis important. If detectedearly, small adjustmentsto thesenders’windows will cure it. Otherwisemassiveadjust-mentsarenecessaryto give the net enoughsparecapacityto pumpout the backlog. But, given the burstynatureoftraffic, reliabledetectionis a non-trivial problem. Jain[14]proposesaschemebasedonaveragingbetweenqueueregen-erationpoints. This shouldyield goodburstfiltering but wethink it might haveconvergenceproblemsunderhigh loador significantsecond-orderdynamicsin the traffic.15 Weplanto usesomeof our earlierwork on ARMAX modelsforround-trip-time/queuelength predictionas the basisof de-tection.Preliminaryresultssuggestthatthisapproachworkswell at high load, is immuneto second-ordereffectsin thetraffic andiscomputationallycheapenoughtonotslowdownkilopacket-per-secondgateways.

    Acknowledgements

    Wearegratefulto themembersof theInternetActivity Board’sEnd-to-EndandInternet-Engineeringtaskforcesfor thispastyear’sinterest,encouragement,cogentquestionsandnetworkinsights.BobBradenof ISI andCraigPartridgeof BBN wereparticularly helpful in the preparationof this paper: theircarefulreadingof earlydraftsimprovedit immensely.

    The first authoris alsodeeplyin debt to Jeff Mogul ofDECWesternResearchLab. WithoutJeff ’s interestandpa-tientprodding,thispaperwouldneverhaveexisted.

    15Theseproblemsstemfrom the fact that the averagetime betweenre-generationpointsscaleslike ª 1 «¬n®°¯ 1 andthe variancelike ª 1 «¬±®°¯ 3(seeFeller[6], chap.VI.9). Thusthecongestiondetectorbecomessluggishascongestionincreasesandits signal-to-noiseratiodecreasesdramatically.

  • 4 THE GATEWAY SIDE OF CONGESTIONCONTROL 8

    Figure3: Startup behavior of TCP without Slow-start

    •••••••••••••••••••••••••••••••••••••••

    ••••••••••••

    •••••••••••••••••••••••••••••••• •

    •••••••• •

    •••••••••••

    • ••••••••••••••••••••••••••••••••

    ••••••••••••••••••••••••••••••••••••••••••••••••••••

    •••••••••••••••••••••••••••••••

    • ••••••••••••••••••••••••••••••••

    ••••••••••••••••••••••

    •••••••••••••••••••• •••••••••

    Send Time (sec)

    Pac

    ket S

    eque

    nce

    Num

    ber

    (KB

    )

    0 2 4 6 8 10

    010

    2030

    4050

    6070

    Tracedataof thestartof a TC² P conversationbetweentwo Sun3/50srunningSunOS 3.5 (the4.3BSDTC²

    P). The two Sunswereon differentEthernetsconnectedby IP gatewaysdriving a 230.4Kbpspoint-to-pointlink (essentiallythesetupshownin fig. 7). Thewindow sizefor theconnectionwas16KB (32512-bytepackets)andtherewere30packetsof buffer availableat thebottleneckgateway.The actualpath containssix store-and-forwardhopsso the pipe plus gatewayqueuehasenoughcapacityfor a full window but thegatewayqueuealonedoesnot.Eachdot is a 512 data-bytepacket. The x-axis is the time the packetwassent. The y-axis is thesequencenumberin thepacketheader. Thusa vertical arrayof dotsindicateback-to-backpacketsandtwo dotswith thesamey butdifferentx indicatea retransmit.‘Desirable’ behavioron this graphwould be a relatively smoothline of dotsextendingdiagonallyfrom the lower left to theupperright. Theslopeof this line would equalthe availablebandwidth.Nothingin this traceresemblesdesirablebehavior.The dashedline showsthe 20 KBps bandwidthavailablefor this connection. Only 35% of thisbandwidthwasused;therestwaswastedonretransmits.Almost everythingis retransmittedat leastonceanddatafrom 54 to 58 KB is sentfive times.

  • 4 THE GATEWAY SIDE OF CONGESTIONCONTROL 9

    Figure4: Startup behavior of TCP with Slow-start

    • •• ••••••••

    •••••••• ••

    ••••••••• •

    ••••••••••••••••

    ••• ••••••••••••••••••••••••• •••

    •••••••••••••••

    •••••••••••••• •

    ••• ••••••••

    ••••••••••

    ••••••••• •

    •••••••••• •

    ••••••••••••••••••••

    ••••••••••• •••••

    •••••••••••••••

    ••••••••••

    •• •••••••••

    •••••••••••••••

    •••••••• ••

    •••••••••••

    •••••••••••••••••• ••

    • ••••••• ••

    ••••••••••••••••

    ••••

    Send Time (sec)

    Pac

    ket S

    eque

    nce

    Num

    ber

    (KB

    )

    0 2 4 6 8 10

    020

    4060

    8010

    012

    014

    016

    0

    Sameconditionsasthepreviousfigure(sametimeof day,sameSuns,samenetworkpath,samebufferandwindow sizes),exceptthemachineswererunningthe 4 ³ 3́ TC² P with slow-start.No bandwidthis wastedonretransmitsbut two secondsis spenton theslow-startsotheeffectivebandwidthof thispartof the traceis 16 KBps — two timesbetterthanfigure3. (This is slightly misleading:Unlikethepreviousfigure,theslopeof thetraceis 20 KBps andtheeffectof the2 secondoffsetdecreasesasthetracelengthens.E.g.,if this tracehadrun a minute,theeffectivebandwidthwouldhavebeen19KBps. Theeffectivebandwidthwithoutslow-startstaysat7 KBpsnomatterhow long thetrace.)

  • 4 THE GATEWAY SIDE OF CONGESTIONCONTROL 10

    Figure5: Performance of an RFC793 retransmit timer

    • • • •

    ••

    • •

    • •

    • •

    ••• •

    • •

    ••

    • • • • ••

    •••

    • • • • •

    • ••

    ••

    • • • • • •

    ••

    • • •

    • ••

    • ••

    •••

    • ••

    • • •

    •• •

    •••

    • •

    •• •

    • •

    ••

    ••

    • • •

    • • •

    ••

    • •

    Packet

    RT

    T (

    sec.

    )

    0 10 20 30 40 50 60 70 80 90 100 110

    02

    46

    810

    12

    Tracedatashowingper-packetroundtrip time on a well-behavedArpanetconnection.The x-axisis the packetnumber(packetswerenumberedsequentially, startingwith one)andthey-axis is theelapsedtime from thesendof thepacketto thesender’s receiptof its ack. During this portionof thetrace,no packetsweredroppedor retransmitted.The packetsare indicatedby a dot. A dashedline connectsthemto makethe sequenceeasiertofollow. Thesolid line showsthebehaviorof a retransmittimer computedaccordingto therulesofRFC793.

    Figure6: Performance of a Mean+Variance retransmit timer

    • • • •

    ••

    • •

    • •

    • •

    ••• •

    • •

    ••

    • • • • ••

    •••

    • • • • •

    • ••

    ••

    • • • • • •

    ••

    • • •

    • ••

    • ••

    •••

    • ••

    • • •

    •• •

    •••

    • •

    •• •

    • •

    ••

    ••

    • • •

    • • •

    ••

    • •

    Packet

    RT

    T (

    sec.

    )

    0 10 20 30 40 50 60 70 80 90 100 110

    02

    46

    810

    12

    Samedataasabovebut thesolid line showsaretransmittimercomputedaccordingto thealgorithmin appendixA.

  • 4 THE GATEWAY SIDE OF CONGESTIONCONTROL 11

    Figure7: Multiple conversation test setup

    Poloµ(sun 3/50)

    Hotµ(sun 3/50)

    Surf¶(sun 3/50)

    Renoir(vax 750)

    VanGogh·(vax 8600)¸

    Monet(vax 750)

    Okeeffe¹(CCI)ºVs(sun 3/50)

    csam» cartan¼230.4 KbsMicrowave½

    10 Mbs Ethernets

    Testsetuptoexaminetheinteractionof multiple,simultaneousTC² Pconversationssharingabottlenecklink. 1 MByte transfers(2048 512-data-bytepackets)were initiated 3 secondsapart from fourmachinesatLBL to fourmachinesatUCB,oneconversationpermachinepair(thedottedlinesaboveshowthe pairing). All traffic went via a 230.4Kbps link connectingIP routercsam at LBL to IProutercartan at UCB. Themicrowavelink queuecanhold up to 50 packets.Eachconnectionwasgiven a window of 16 KB (32 512-bytepackets). Thus any two connectionscould overflow theavailablebufferingandthefour connectionsexceededthequeuecapacityby 160%.

  • 4 THE GATEWAY SIDE OF CONGESTIONCONTROL 12

    Figure8: Multiple, simultaneous TCPs with no congestion avoidance

    Time (sec)

    Seq

    uenc

    e N

    umbe

    r (K

    B)

    0 50 100 150 200

    020

    040

    060

    080

    010

    0012

    00

    Tracedatafrom four simultaneousTC² P conversationswithout congestionavoidanceover the pathsshownin figure7. 4,000of 11,000packetssentwereretransmissions(i.e.,half thedatapacketswereretransmitted).Sincethelink databandwidthis 25KBps,eachof thefour conversationsshouldhavereceived6 KBps. Instead,oneconversationgot 8 KBps, two got 5 KBps, onegot 0.5 KBps and6KBpshasvanished.

  • 4 THE GATEWAY SIDE OF CONGESTIONCONTROL 13

    Figure9: Multiple, simultaneous TCPs with congestion avoidance

    Time (sec)

    Seq

    uenc

    e N

    umbe

    r (K

    B)

    0 50 100 150 200

    020

    040

    060

    080

    010

    0012

    00

    Tracedatafrom four simultaneousTC² P conversationsusing congestionavoidanceover the pathsshownin figure7. 89 of 8281packetssentwereretransmissions(i.e., 1% of the datapacketshadto be retransmitted).Two of theconversationsgot 8 KBps andtwo got 4.5 KBps (i.e., all the linkbandwidthis accountedfor — seefig. 11). The differencebetweenthe high and low bandwidthsenderswas dueto the receivers. The 4.5 KBps senderswere talking to 4.3BSD receiverswhichwoulddelayanackuntil 35%of thewindowwasfilled or 200mshadpassed(i.e.,anackwasdelayedfor 5–7packetson theaverage).This meantthesenderwould deliverburstsof 5–7packetson eachack.The8 KBps sendersweretalking to 4.3¾ BSD receiverswhich would delayan ack for at mostonepacket(becauseof anack’s ‘clock’ rôle, theauthorsbelievethattheminimumackfrequencyshouldbeeveryotherpacket).I.e., thesenderwould deliverburstsof atmosttwo packets.Theprobabilityof lossincreasesrapidly with burstsizesosenderstalking to old-stylereceiverssawthreetimesthelossrate(1.8%vs.0.5%).Thehigherlossratemeantmoretimespentin retransmitwait and,becauseof thecongestionavoidance,smalleraveragewindow sizes.

  • 4 THE GATEWAY SIDE OF CONGESTIONCONTROL 14

    Figure10: Total bandwidth used by old and new TCPs

    Time (sec)

    Rel

    ativ

    e B

    andw

    idth

    0 20 40 60 80 100 120

    0.8

    1.0

    1.2

    1.4

    1.6

    Thethin lineshowsthetotalbandwidthusedby thefoursenderswithoutcongestionavoidance(fig.8),averagedover5secondintervalsandnormalizedto the25KBpslink bandwidth.Notethatthesenderssend,ontheaverage,25%morethanwill fit in thewire. Thethick line is thesamedatafor thesenderswith congestionavoidance(fig. 9). Thefirst 5 secondintervalis low (becauseof theslow-start),thenthereisabout20secondsof dampedoscillationasthecongestioncontrol‘regulator’ for eachTC² Pfindsthecorrectwindow size. The remainingtime thesendersrun at thewire bandwidth. (The activityaround110secondsisabandwidth‘re-negotiation’duetoconnectiononeshuttingdown. Theactivityaround80secondsisareflectionof the‘flat spot’in fig. 9wheremostof conversationtwo’sbandwidthissuddenlyshiftedto conversationsthreeandfour— competingconversationsfrequentlyexhibitthistypeof ‘punctuatedequilibrium’ behaviorandwehopeto investigateits dynamicsin a futurepaper.)

  • 4 THE GATEWAY SIDE OF CONGESTIONCONTROL 15

    Figure11: Effective bandwidth of old and new TCPs

    Time (sec)

    Rel

    ativ

    e B

    andw

    idth

    0 20 40 60 80 100 120

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    1.1

    1.2

    Figure10showedtheold TC² Pswereusing25%morethanthebottlenecklink bandwidth.Thus,oncethebottleneckqueuefilled, 25% of the the senders’packetswerebeingdiscarded.If the discards,andonly the discards,wereretransmitted,the senderswould havereceivedthe full 25 KBps linkbandwidth(i.e., theirbehaviorwouldhavebeenanti-socialbutnotself-destructive).But fig. 8 notedthataround25%of thelink bandwidthwasunaccountedfor. Hereweaveragethetotalamountof dataackedperfive secondinterval. (Thisgivestheeffectiveor deliveredbandwidthof thelink.) Thethinline is onceagaintheold TC² Ps. Notethatonly 75%of thelink bandwidthis beingusedfor data(theremaindermusthavebeenusedby retransmissionsof packetsthatdidn’t needto beretransmitted).Thethick line showsdeliveredbandwidthfor thenewTC² Ps. Thereis thesameslow-startandturn-ontransientfollowed by a long periodof operationright at thelink bandwidth.

  • 4 THE GATEWAY SIDE OF CONGESTIONCONTROL 16

    Figure12: Window adjustment detail

    ••

    • •• •

    •• •

    • •

    ••

    • ••

    Time (sec)

    Rel

    ativ

    e B

    andw

    idth

    0 20 40 60 800.4

    0.6

    0.8

    1.0

    1.2

    1.4

    Becauseof thefive secondaveragingtime (neededto smoothout thespikesin theold TC² P data),thecongestionavoidancewindow policy is difficult to makeout in figures10 and11. Herewe showeffective throughput(dataacked)for TC² Ps with congestioncontrol, averagedover a threesecondinterval.Whenapacketisdropped,thesendersendsuntil it fills thewindow, thenstopsuntil theretransmissiontimeout. Sincethereceivercannotackdatabeyondthedroppedpacket,on this plot we’d expecttoseea negative-goingspikewhoseamplitudeequalsthesender’swindowsize(minusonepacket).Iftheretransmithappensin thenextinterval(theintervalswerechosentomatchtheretransmittimeout),we’d expectto seeapositive-goingspikeof thesameamplitudewhenreceiverackstheout-of-orderdatait cached.Thustheheightof thesespikesis a directmeasureof thesender’swindowsize.Thedataclearlyshowsthreeof theseevents(at15,33and57seconds)andthewindowsizeappearstobedecreasingexponentially. Thedottedline isaleastsquaresfit to thesixwindowsizemeasurementsobtainedfrom theseevents.The fit time constantwas28 seconds.(The long time constantis dueto lack of a congestionavoidancealgorithmin thegateway. With a ‘drop’ algorithmrunningin thegateway, thetime constantshouldbearound4 seconds)

  • A A FAST ALGORITHM FOR RTT MEAN AND VARIATION 17

    A A fast algorithm for rtt mean andvariation

    A.1 Theory

    The RFC793algorithmfor estimatingthe meanroundtriptime is oneof the simplestexamplesof a classof estima-torscalledrecursivepredictionerror or stochasticgradientalgorithms.In thepast20yearsthesealgorithmshaverevolu-tionizedestimationandcontrol theory[19] andit’ sprobablyworth looking at theRFC793estimatorin somedetail.

    Givenanewmeasurement¿ of theRTT (roundtrip time),TCP updatesanestimateof theaverageRTT À by

    ÀÂÁÄà 1 Å=ÆÇNÀÈ*ÆÉ¿where Æ is a ‘gain’ (0 Ê�ÆËÊ 1) thatshouldberelatedto thesignal-to-noiseratio (or, equivalently, variance)of ¿ . Thismakesamoresense,andcomputesfaster, if werearrangeandcollecttermsmultipliedby Æ to get

    ÀËÁÌÀÍÈ*Æ4ð¿\ÅÎÀ�ÇThink of À asa predictionof thenextmeasurement.¿ÏÅ�Àis theerror in thatpredictionandtheexpressionabovesayswe makea new predictionbasedon the old predictionplussomefractionof thepredictionerror. Thepredictionerror isthesumof two components:(1) error dueto ‘noise’ in themeasurement(random,unpredictableeffectslike fluctuationsin competingtraffic) and(2) errordueto a badchoiceof À .Callingtherandomerror ÐbÑ andtheestimationerror ÐÒ ,

    À|ÁÌÀ5È*ÆÉÐÑÓÈ*Æ¢Ð

  • B SLOW-START + CONGESTIONAVOIDANCE ALGORITHM 18

    õ÷öupdate Average estimator

    ö÷õm ø = (sa ùñù 3);sa += m;õ÷ö

    update Deviation estimatorö÷õ

    if (m ú 0)m = ø m;

    m ø = (sv ù5ù 3);sv += m;

    It’ s not necessaryto usethesamegain for û and ü . Toforce the timer to go up quickly in responseto changesintheRTT, it’ s a goodideato give ü a largergain. In particu-lar, becauseof window–delaymismatchthereareoftenRTTartifactsat integermultiplesof thewindow size.17 To filterthese,onewould like 1

    õKýin the û estimatorto be at least

    as large as the window size (in packets)and 1õKý

    in the üestimatorto belessthanthewindow size.18

    Usinga gain of .25 on the deviationandcomputingtheretransmittimer, þnÿ�� , as û � 4ü , the final timer codelookslike:

    m ø = (sa ùñù 3);sa += m;if (m ú 0)

    m = ø m;m ø = (sv ùñù 2);sv += m;rto = (sa ùñù 3) + sv;

    Ingeneralthiscomputationwill correctlyround þ$ÿ�� : Becauseof the �îû truncationwhencomputing� ø9û , �±û will convergeto the truemeanroundedup to thenexttick. Likewisewith�êü . Thus,on theaverage,thereis half a tick of biasin each.The þ$ÿ�� computationshouldbe roundedby half a tick andonetick needsto beaddedto accountfor sendsbeingphasedrandomlywith respectto the clock. So, the 1.75 tick biascontributionfrom 4ü approximatelyequalsthedesiredhalftick roundingplusonetick phasecorrection.

    17E.g.,seepackets10–50of figure5. Notethatthesewindoweffectsaredueto characteristicsof theArpa/Milnetsubnet.In general,windoweffectson the timer areat mosta second-orderconsiderationanddependa greatdealontheunderlyingnetwork.E.g.,if onewereusingtheWidebandwith a256packetwindow, 1/256wouldnot beagoodgainfor � (1/16might be).

    18Althoughit maynotbeobvious,theabsolutevaluein thecalculationof� introducesanasymmetryin thetimer: Because� hasthesamesignasanincreaseandtheoppositesignof adecrease,moregainin � makesthetimergo up quickly andcomedownslowly, ‘automatically’ giving thebehaviorsuggestedin [21]. E.g.,seetheregionbetweenpackets50 and80 in figure6.

    B The combined slow-start withcongestion avoidance algorithm

    The senderkeepstwo statevariablesfor congestioncon-trol: a slow-start/congestionwindow, cwnd, anda thresholdsize,ssthresh, to switch betweenthe two algorithms. Thesender’s outputroutinealwayssendstheminimumof cwndand the window advertisedby the receiver. On a timeout,half the currentwindow sizeis recordedin ssthresh(this isthemultiplicativedecreasepartof thecongestionavoidancealgorithm),thencwnd is setto 1 packet(this initiatesslow-start).Whennewdatais acked,thesenderdoes

    if (cwnd ú ssthresh)õ÷öif we’re still doing slowø startöopen window exponentially

    öãõcwnd += 1;

    else õ÷öotherwisedo CongestionöAvoidanceincrementø byø 1 öãõ

    cwnd += 1õcwnd;

    Thusslow-startopensthewindow quickly to whatcon-gestionavoidancethinks is a safeoperatingpoint (half thewindow thatgot us into trouble),thencongestionavoidancetakesoverandslowly increasesthewindowsizeto probeformorebandwidthbecomingavailableon thepath.

    Notethattheelse clauseof theabovecodewill malfunc-tion if cwndis anintegerin unscaled,one-packetunits. I.e.,if themaximumwindowfor thepathis � packets,cwndmustcovertherange0 ��� with resolutionof atleast1õ � .19 Sincesendingpacketssmallerthanthemaximumtransmissionunitfor thepathlowersefficiency, theimplementormusttakecarethatthefractionallysizedcwnddoesnot resultin smallpack-etsbeingsent. In reasonableTCP implementations,existingsilly-windowavoidancecodeshouldpreventruntpacketsbutthispoint shouldbecarefullychecked.

    C Window adjustment interactionwith round-trip timing

    SomeTCP connections,particularly thoseover a very lowspeedlink suchasa dial-up SLIP line[25], may experience

    19For TC P this happensautomaticallysince windows are expressedinbytes,notpackets.For protocolssuchasISO TP4,theimplementorshouldscalecwndsothatthecalculationsabovecanbedonewith integerarithmeticandthescalefactorshouldbelargeenoughto avoidthefixedpoint (zero)of�1���������� in thecongestionavoidanceincrement.

  • C WINDOW ADJUSTMENT INTERACTION WITH ROUND-TRIPTIMING 19

    an unfortunateinteractionbetweencongestionwindow ad-justmentand retransmittiming: Network pathstend to di-vide into two classes: delay-dominated, where the store-and-forwardand/or transit delaysdeterminethe RTT, andbandwidth-dominated,where(bottleneck)link bandwidthandaveragepacketsizedeterminethe RTT.20 On a bandwidth-dominatedpathof bandwidth� , acongestion-avoidancewin-dowincrementof ��� will increasetheRTT of post-incrementpacketsby

    ����� ����If thepathRTT variation � is small, ��� mayexceedthe4 �cushionin �! �" , a retransmittimeoutwill occurand,after afewcyclesof this,ssthresh(and,thus,cwnd) endupclampedatsmallvalues.

    The �! �" calculationin appendixA wasdesignedto pre-ventthistypeof spuriousretransmissiontimeoutduringslow-start. In particular, theRTT variation � is multipliedby fourin the �! �" calculationbecauseof the following: A spuriousretransmitoccursif the retransmittimeoutcomputedat theendof slow-startround # , �! �"%$ , is everlessthanor equaltotheactualRTT of thenextround. In theworstcaseof all thedelaybeingduethe window, � doubleseachround(sincethewindow sizedoubles).Thus �&$(' 1 ) 2�&$ (where �*$ isthemeasuredRTT at slow-startround # ). But

    � $ ) � $,+ � $.- 1) � $�/ 2

    and

    �! �"0$ ) �1$32 4 �4$) 3� $5 2� $5 � $(' 1

    sospuriousretransmittimeoutscannotoccur.21

    Spuriousretransmissionduetoawindowincreasecanoc-curduringthecongestionavoidancewindowincrementsincethewindowcanonlybechangedin onepacketincrementsso,for a packetsize 6 , theremaybe asmanyas 6 + 1 packetsbetweenincrements,long enoughfor any � increasedueto thelastwindow incrementto decayawayto nothing. Butthisproblemisunlikelyonabandwidth-dominatedpathsince

    20E.g.,TC P overa 2400baudpacketradio link is bandwidth-dominatedsincethetransmissiontime for a (typical)576byteIP packetis 2.4seconds,longerthananypossibleterrestrialtransitdelay.

    21TheoriginalSIGCOMM’88 versionof thispapersuggestedcalculating7�8:9 as ;*< 2= ratherthan ;*< 4= . Sincethattimewehavehadmuchmoreexperiencewith low speedSLIPlinksandobservedspuriousretransmissionsduringconnectionstartup.An investigationof why theseoccuredled to theanalysisaboveandthechangeto the 7>8:9 calculationin app.A.

    the incrementswould haveto be morethantwelve packetsapart(thedecaytime of the � filter timesits gainin the �? �"calculation)which implies that a ridiculously largewindowis beingusedfor the path.22 Thusoneshouldregardthesetimeoutsasappropriatepunishmentfor grossmis-tuningandtheireffectwill simplybeto reducethewindowtosomethingmoreappropriatefor thepath.

    Although slow-startand congestionavoidanceare de-signedto not triggerthis kind of spuriousretransmission,aninteractionwith higherlevel protocolsfrequentlydoes:Ap-plicationprotocolslike SMTP andN� N� TP havea ‘negotiation’phasewherea few packetsareexchangedstop-and-wait,fol-lowed by datatransferphasewhereall of a mail messageor newsarticle is sent. Unfortunately, the ‘negotiation’ ex-changesopenthecongestionwindow sothestartof thedatatransferphasewill dump severalpacketsinto the networkwith noslow-startand,onabandwidth-dominatedpath,fasterthan �! �" cantracktheRTT increasecausedby thesepackets.The root causeof this problemis the sameone describedin sec.1: dumping too many packetsinto an empty pipe(the pipe is emptysincethe negotiationexchangewascon-ductedstop-and-wait)with noack‘clock’. Thefix proposedin sec.1, slow-start,will alsopreventthisproblemif theTCPimplementationcandetectthephasechange.And detectionis simple: The pipe is emptybecausewe haven’t sentany-thing for at leasta round-trip-time(anotherway to view RTTis asthetime it takesthepipeto emptyafterthemostrecentsend). So, if nothinghasbeensentfor at leastoneRTT, thenextsendshouldsetcwndto onepacketto forceaslow-start.I.e.,if theconnectionstatevariablelastsndholdsthetimethelastpacketwassent,thefollowing codeshouldappearearlyin theTCP outputroutine:

    int idle = (snd max == snd una);if (idle && now + lastsnd 5 rto)

    cwnd = 1;

    Thebooleanidle is trueif thereis nodatain transit(all datasenthasbeenacked)sotheif says“if there’snothingin transitandwe haven’t sentanythingfor ‘a long time’, slow-start.”Our experiencehasbeenthateitherthecurrentRTT estimateor the �! �" estimatecanbeusedfor ‘a long time’ with goodresults23

    22Thethelargestsensiblewindow for apathis thebottleneckbandwidthtimesthe round-tripdelayand,by definition, the delayis negligiblefor abandwidth-dominatedpathsothewindow shouldonly bea few packets.

    23The 7>8:9 estimateis moreconvenientsinceit is kept in units of timewhile RTT isscaled.Also, becauseof send/receivesymmetry, thetimeof thelastreceivecanbeusedratherthanthelastsend— If theprotocolimplements‘keepalives’,this statevariablemayalreadyexist.

  • REFERENCES 20

    D Window Adjustment Policy

    A reasonfor using 12 asa the decreaseterm, asopposedtothe 78 in [14], wasthefollowing handwaving:Whenapacketis dropped,you’re eitherstarting(or restartingaftera drop)or steady-statesending. If you’re starting,you know thathalf thecurrentwindow size‘worked’, i.e., thata window’sworth of packetswereexchangedwith no drops(slow-startguaranteesthis). Thuson congestionyou setthewindow tothelargestsizethatyouknowworksthenslowly increasethesize. If the connectionis steady-staterunninganda packetis dropped,it’ s probablybecausea new connectionstartedup and took someof your bandwidth. We usually run ournetswith @BA 0 C 5 soit’ sprobablethattherearenowexactlytwo conversationssharingthe bandwidth. I.e., you shouldreduceyourwindowby half becausethebandwidthavailableto you hasbeenreducedby half. And, if thereare morethantwo conversationssharingthebandwidth,halvingyourwindow is conservative— andbeingconservativeat hightraffic intensitiesis probablywise.

    Althougha factorof two changein windowsizeseemsalarge performancepenalty, in systemtermsthe costis neg-ligible: Currently, packetsare droppedonly when a largequeuehasformed. Evenwith the ISO IP ‘congestionexpe-rienced’bit [10] to force sendersto reducetheir windows,we’restuckwith thequeuebecausethebottleneckis runningat 100%utilization with no excessbandwidthavailabletodissipatethequeue.If a packetis tossed,somesendershutsup for two RTT, exactlythetime neededto emptythequeue.If thatsenderrestartswith thecorrectwindowsize,thequeuewon’t reform. Thusthedelayhasbeenreducedto minimumwithout thesystemlosinganybottleneckbandwidth.

    The 1-packetincreasehaslessjustificationthanthe 0.5decrease.In fact, it’ s almostcertainly too large. If the al-gorithmconvergesto a window sizeof D , thereare EGF�D 2 Hpacketsbetweendropswith anadditiveincreasepolicy. Wewereshootingfor an averagedrop rateof I 1% andfoundthaton theArpanet(theworstcaseof the four networkswetested),windowsconvergedto 8–12packets.This yields 1packetincrementsfor a 1%averagedroprate.

    But, sincewe’ve donenothingin thegateways,thewin-dowweconvergeto is themaximumthegatewaycanacceptwithout droppingpackets.I.e., in the termsof [14], we arejust to the left of the clif f ratherthanjust to theright of theknee.If thegatewaysarefixedsotheystartdroppingpacketswhenthequeuegetspushedpasttheknee,our incrementwillbemuchtooaggressiveandshouldbedroppedbyaboutafac-tor of four (sinceour measurementsonanunloadedArpanetplaceits ‘pipe size’ at 4–5packets).It appearstrivial to im-plementa secondordercontrolloop to adaptivelydeterminetheappropriateincrementto usefor apath.But secondorder

    problemsareonholduntil we’vespentsometimeonthefirstorderpartof thealgorithmfor thegateways.

    References

    [1] ALDOU S,J D. JK . Ultimate instability of exponentialback-off protocolfor acknowledgmentbasedtransmis-sion control of randomaccesscommunicationchan-nels. IEEE Transactionson InformationTheoryIT-33,2 (Mar. 1987).

    [2] BORRELLI,J R.,J AN� D CL OLEMAN� ,J CL . Differential Equa-tions. Prentice-HallInc.,1987.

    [3] CL

    HIU ,J D.-M.,J AN� D JK AIN� ,J R. Networkswith a connec-

    tionlessnetworklayer;partiii: Analysisof theincreaseand decreasealgorithms. Tech. Rep. DEC-TR-509,Digital EquipmentCorporation,Stanford, CA, Aug.1987.

    [4] CL

    LARK,J D. Window and AcknowlegementStrategyinTCP. ARPANM ET Working Group Requestsfor Com-ment,DDN Network InformationCenter, SRI Interna-tional,MenloPark,CA, July1982.RFC-813.

    [5] EDGE,N SO . WP . An adaptivetimeout algorithm for re-transmissionacrossa packetswitching network. InProceedingsof SIGCOMM’83 (Mar. 1983),ACM.

    [6] FELLER,N WP . Probability Theoryand its Applications,seconded.,vol. II. JohnWiley & Sons,1971.

    [7] HAJQ EK,N B. Stochasticapproximationmethodsfordecentralizedcontrol of multiaccesscommunications.IEEE Transactionson Information Theory IT-31, 2(Mar. 1985).

    [8] HAJQ EK,N B.,N ANM D VR ANM LOONM ,N T. Decentralizeddy-namiccontrolof amultiaccessbroadcastchannel.IEEETransactionson Automatic Control AC-27, 3 (June1982).

    [9] Proceedingsof the Sixth Internet EngineeringTaskForce (Boston,MA, Apr. 1987). Proceedingsavail-ableasNIC documentIETF-87/2PfromDDN NetworkInformationCenter, SRIInternational,MenloPark,CA.

    [10] INM TERNM ATIONM AL OS RGANM IZATIONM FOR SO TANM DARDIZA-TION

    M . ISO International Standard 8473, InformationProcessingSystems—OpenSystemsInterconnection—Connectionless-modeNetworkServiceProtocolSpeci-fication, Mar. 1986.

  • REFERENCES 21

    [11] JTACOBSON

    M ,N VU . Congestionavoidanceandcontrol. InProceedingsof SIGCOMM ’88 (Stanford,CA, Aug.1988),ACM.

    [12] JTAINM ,N R. Divergenceof timeoutalgorithmsfor packet

    retransmissions.In ProceedingsFifth AnnualInterna-tional PhoenixConferenceonComputersandCommu-nications(Scottsdale,AZ, Mar. 1986).

    [13] JTAINM ,N R. A timeout-basedcongestioncontrol scheme

    for window flow-controllednetworks. IEEE Journalon SelectedAreasin CommunicationsSAC-4, 7 (Oct.1986).

    [14] JTAINM ,N R.,N RAMAKRISHNM ANM ,N K.,N ANM D CV HIUW ,N D.-M. Con-

    gestionavoidancein computernetworkswith a con-nectionlessnetwork layer. Tech.Rep.DEC-TR-506,Digital EquipmentCorporation,Aug. 1987.

    [15] KARNM ,N P.,N ANM D PARTRIDGE,N CV . Estimatinground-triptimesin reliabletransportprotocols.In ProceedingsofSIGCOMM’87 (Aug. 1987),ACM.

    [16] KELLY,N F. P. Stochasticmodelsof computercommuni-cationsystems.Journalof theRoyalStatisticalSocietyB 47, 3 (1985),379–395.

    [17] KLEINM ROCK,N L. QueueingSystems, vol. II. JohnWiley& Sons,1976.

    [18] KLINM E,N CV . Supercomputerson the Internet: A casestudy. In Proceedingsof SIGCOMM’87 (Aug. 1987),ACM.

    [19] LJQ UW NM G,N L.,N ANM D SO ODERSTR OM,N T. TheoryandPracticeof RecursiveIdentification. MIT Press,1983.

    [20] LUW ENM BERGER,N D. GX . Introductionto DynamicSystems.JohnWiley & Sons,1979.

    [21] MILLS,N D. Internet Delay Experiments. ARPANM ETWorkingGroupRequestsfor Comment,DDN NetworkInformationCenter, SRIInternational,MenloPark,CA,Dec.1983.RFC-889.

    [22] NY

    AGLE,N JT . CongestionControl in IP/TCP Internet-works. ARPANM ET Working GroupRequestsfor Com-ment,DDN Network InformationCenter, SRI Interna-tional,MenloPark,CA, Jan.1984.RFC-896.

    [23] POSTEL,N JT ., Ed. TransmissionControl ProtocolSpecifi-cation. SRIInternational,MenloPark,CA, Sept.1981.RFC-793.

    [24] PRUW E,N WP .,N ANM D POSTEL,N JT . SomethingA Host CouldDo with Source Quench. ARPANM ET Working GroupRequestsfor Comment, DDN Network InformationCenter, SRI International,Menlo Park,CA, July 1987.RFC-1016.

    [25] ROMKEY,N JT . A Nonstandard for Transmissionof IPDatagramsOverSerialLines: Slip. ARPANM ET WorkingGroupRequestsfor Comment,DDN NetworkInforma-tion Center, SRI International,Menlo Park,CA, June1988.RFC-1055.

    [26] ZHANM G,N L. Why TCPtimersdon’t work well. In Pro-ceedingsof SIGCOMM’86 (Aug. 1986),ACM.