congestion avoidance and control - peoplesylvia/cs268/papers/congavoid.pdf · is a ﬁlter gain...

CongestionAvoidanceandControl�

VanJacobson�

LawrenceBerkeleyLaboratory

MichaelJ.Karels�

Universityof Californiaat Berkeley

November, 1988

Introduction

Computernetworkshaveexperiencedan explosivegrowthoverthepastfewyearsandwith thatgrowthhavecomeseverecongestionproblems.Forexample,it is nowcommonto seeinternetgatewaysdrop10%of theincomingpacketsbecauseof localbufferoverflows.Our investigationof someof theseproblemshasshownthatmuchof thecauselies in transportprotocolimplementations(not in theprotocolsthemselves):The‘obvious’ waysto implementa window-basedtransportprotocolcanresultin exactlythewrongbehaviorin responsetonetworkcongestion.Wegiveexamplesof ‘wrong’ behav-ior anddescribesomesimplealgorithmsthatcanbeusedtomakeright thingshappen.Thealgorithmsarerootedin theideaof achievingnetworkstability by forcing the transportconnectionto obey a ‘packet conservation’principle. Weshowhowthealgorithmsderivefrom thisprincipleandwhateffect theyhaveon traffic overcongestednetworks.

In Octoberof ’86, theInternethadthefirstof whatbecamea seriesof ‘congestioncollapses’. During this period, thedatathroughputfrom LBL to UC Berkeley(sitesseparatedby400yardsandtwo IMP hops)droppedfrom32Kbpsto40bps. We werefascinatedby this suddenfactor-of-thousanddropin bandwidthandembarkedonaninvestigationof whythingshadgottenso bad. In particular, we wonderedif the4.3BSD (BerkeleyU

�N�

IX) TCP wasmis-behavingor if it couldbetunedto work betterunderabysmalnetworkconditions.Theanswerto bothof thesequestionswas“yes”.�

Note: This is a very slightly revisedversionof a paperoriginally pre-sentedat SIGCOMM ’88 [11]. If you wish to referencethis work, pleasecite[11].�

This work was supportedin part by the U.S. Departmentof EnergyunderContractNumberDE-AC03-76SF00098.�

This work wassupportedby the U.S. Departmentof Commerce,Na-tionalBureauof Standards,underGrantNumber60NANB8D0830.

Sincethat time, we haveput sevennew algorithmsintothe4BSD TCP:

(i) round-trip-timevarianceestimation

(ii) exponentialretransmittimer backoff

(iii) slow-start

(iv) moreaggressivereceiverackpolicy

(v) dynamicwindow sizingon congestion

(vi) Karn’sclampedretransmitbackoff

(vii) fastretransmit

Ourmeasurementsandthereportsof betatesterssuggestthatthefinal productis fairly goodatdealingwith congestedcon-ditionson theInternet.

This paperis a brief descriptionof (i) – (v) andthe ra-tionalebehindthem. (vi) is analgorithmrecentlydevelopedby Phil Karn of Bell CommunicationsResearch,describedin [15]. (vii) is describedin a soon-to-be-publishedRFC(ARPAN� ET “Requestfor Comments”).

Algorithms (i) – (v) spring from oneobservation:Theflow on a TCP connection(or ISO TP-4 or Xerox N� S SPP con-nection)shouldobeya ‘conservationof packets’principle.And, if thisprinciplewereobeyed,congestioncollapsewouldbecometheexceptionratherthantherule. Thuscongestioncontrol involvesfinding placesthatviolateconservationandfixing them.

By ‘conservationof packets’wemeanthatfor a connec-tion ‘in equilibrium’, i.e., runningstablywith a full windowof datain transit,the packetflow is what a physicistwouldcall ‘conservative’:A newpacketisn’t put into thenetworkuntil anold packetleaves.Thephysicsof flow predictsthatsystemswith this propertyshouldbe robust in the face ofcongestion.1 Observationof theInternetsuggeststhatit wasnotparticularlyrobust.Why thediscrepancy?

1 A conservativeflow meansthat for anygiven time, the integralof thepacketdensityaroundthesender–receiver–senderloop is a constant.Since

2 CONSERVATION AT EQUILIBRIUM: ROUND-TRIPTIMING 2

Thereareonly threewaysfor packetconservationto fail:

1. Theconnectiondoesn’t getto equilibrium,or

2. A senderinjectsanewpacketbeforeanold packethasexited,or

3. Theequilibriumcan’t bereachedbecauseof resourcelimits alongthepath.

In thefollowing sections,we treateachof thesein turn.

1 Getting to Equilibrium: Slow-start

Failure(1) hasto befrom aconnectionthat is eitherstartingor restartingaftera packetloss. Anotherway to look at theconservationpropertyis to saythat the senderusesacksasa ‘clock’ to strobenewpacketsinto the network. Sincethereceivercangenerateacksnofasterthandatapacketscangetthroughthenetwork,theprotocolis ‘self clocking’ (fig. 1).Selfclockingsystemsautomaticallyadjustto bandwidthanddelayvariationsandhavea wide dynamicrange(importantconsideringthatTCPspansarangefrom800MbpsCraychan-nelsto 1200bpspacketradiolinks). But thesamething thatmakesa self-clockedsystemstablewhenit’ s runningmakesit hardto start— to get dataflowing theremustbe ackstoclockout packetsbut to getackstheremustbedataflowing.

To startthe‘clock’, wedevelopedaslow-startalgorithmto gradually increasethe amountof data in-transit.2 Al-thoughwe flatterourselvesthatthedesignof this algorithmisrathersubtle,theimplementationis trivial — onenewstatevariableandthreelinesof codein thesender:� Addacongestionwindow, cwnd,to theper-connection

state.� Whenstartingor restartingafteraloss,setcwndtoonepacket.� Oneachackfor newdata,increasecwndbyonepacket.

packetshaveto ‘dif fuse’aroundthis loop,theintegralis sufficiently contin-uousto beaLyapunovfunctionfor thesystem.A constantfunctiontriviallymeetstheconditionsfor Lyapunovstability sothesystemis stableandanysuperpositionof suchsystemsisstable.(See[2], chap.11–12or [20], chap.9for excellentintroductionsto systemstability theory.)

2 Slow-startis quitesimilar to theC U TE algorithmdescribedin [13]. Wedidn’t know aboutC U TE at the time we weredevelopingslow-startbut weshouldhave—C U TE precededourwork by severalmonths.

Whendescribingour algorithmat the Feb.,1987, InternetEngineeringTaskForce(IETF) meeting,we calledit soft-start, a referenceto an elec-tronicsengineer’s techniqueto limit in-rushcurrent. The nameslow-startwascoinedby JohnNaglein a messageto the IETF mailing list in March,’87. This namewasclearlysuperiorto oursandwepromptlyadoptedit.

� When sending,sendthe minimum of the receiver’sadvertisedwindow andcwnd.

Actually, theslow-startwindow increaseisn’t thatslow:it takestime � log2 � where � is the round-trip-timeand�

is the window size in packets(fig. 2). This meansthewindow opensquickly enoughto havea negligible effectonperformance,evenonlinks with a largebandwidth–delayproduct.And thealgorithmguaranteesthataconnectionwillsourcedataat a rate at most twice the maximumpossibleon thepath. Without slow-start,by contrast,when10 MbpsEthernethoststalkoverthe56KbpsArpanetvia IPgateways,thefirst-hopgatewayseesaburstof eightpacketsdeliveredat200timesthepathbandwidth.Thisburstof packetsoftenputsthe connectioninto a persistentfailure modeof continuousretransmissions(figures3 and4).

2 Conservation at equilibrium:round-trip timing

Oncedatais flowing reliably, problems(2) and(3) shouldbe addressed.Assumingthat the protocol implementationis correct,(2) mustrepresenta failure of sender’s retransmittimer. A good round trip time estimator, the core of theretransmittimer, is thesinglemostimportantfeatureof anyprotocolimplementationthatexpectsto surviveheavyload.And it is frequentlybotched([26] and[12] describetypicalproblems).

Onemistakeis not estimatingthe variation, �� , of theroundtrip time, � . Fromqueuingtheoryweknowthat � andthevariationin � increasequickly with load. If the load is� (theratioof averagearrival rateto averagedeparturerate),� and � scalelike � 1 � �� 1 . Tomakethisconcrete,if thenetwork is runningat 75% of capacity, asthe Arpanetwasin lastApril’ scollapse,oneshouldexpectround-trip-timetovaryby a factorof sixteen( � 2 to � 2 ).

The TCP protocol specification[23] suggestsestimatingmeanroundtrip time via thelow-passfilter

�� 1 �� ! where � is theaverageRTT estimate, is a roundtrip timemeasurementfrom themostrecentlyackeddatapacket,and� isafilter gainconstantwith asuggestedvalueof 0.9. Oncethe � estimateis updated,the retransmittimeout interval,"$#&% , for thenextpacketsentis setto '(� .

Theparameter' accountsfor RTT variation(see[4], sec-tion 5). Thesuggested'*) 2 canadaptto loadsof at most30%. Above this point, a connectionwill respondto loadincreasesby retransmittingpacketsthat haveonly beende-layedin transit. This forcesthenetworkto do uselesswork,

2 CONSERVATION AT EQUILIBRIUM: ROUND-TRIPTIMING 3

Figure1: Window Flow Control ‘Self-clocking’

P+

r

ArAs

Pb

Receiver,Sender-

Ab

This is a schematicrepresentationof a senderandreceiveron high bandwidthnetworksconnectedby aslower, long-haulnet. Thesenderis juststartingandhasshippedawindow’sworthof packets,back-to-back.Theackfor thefirst of thosepacketsis aboutto arrivebackat thesender(theverticalline at themouthof thelower left funnel).Theverticaldimensionis bandwidth,thehorizontaldimensionis time. Eachof theshadedboxesisa packet.Bandwidth . Time / Bits sotheareaof eachbox is thepacketsize. Thenumberof bitsdoesn’t changeasapacketgoesthroughthenetworksoapacketsqueezedinto thesmallerlong-haulbandwidthmustspreadout in time. The time 021 representsthe minimum packetspacingon theslowestlink in thepath(thebottleneck). As thepacketsleavethebottleneckfor thedestinationnet,nothingchangesthe inter-packetinterval soon the receiver’s net packetspacing 0435/60 1 . If thereceiverprocessingtime is thesamefor all packets,thespacingbetweenackson thereceiver’s net7 38/�0439/:0 1 . If the time slot 0 1 wasbig enoughfor a packet,it’ s big enoughfor anackso theackspacingis preservedalongthereturnpath.Thustheackspacingon thesender’snet

71 .So,if packetsafterthefirst burstaresentonly in responseto anack,thesender’spacketspacingwillexactlymatchthepackettime on theslowestlink in thepath.

wastingbandwidthon duplicatesof packetsthat will even-tually be delivered,at a time whenit’ s known to be havingtroublewith usefulwork. I.e., this is thenetworkequivalentof pouringgasolineon afire.

We developeda cheapmethodfor estimatingvariation(seeappendixA) 3 andtheresultingretransmittimer essen-tially eliminatesspuriousretransmissions.A pleasantsideeffect of estimating? ratherthanusinga fixed valueis thatlow loadaswell ashigh loadperformanceimproves,partic-ularly overhigh delaypathssuchassatellitelinks (figures5and6).

Another timer mistakeis in the backoff after a retrans-mit: If apackethasto beretransmittedmorethanonce,howshouldthe retransmitsbe spaced?For a transportendpointembeddedin a networkof unknowntopologyandwith an

3 Wearefar fromthefirst torecognizethattransportneedstoestimatebothmeanandvariation.See,for example,[5]. But wedo think ourestimatorissimplerthanmost.

unknown,unknowableandconstantlychangingpopulationof competingconversations,only oneschemehasanyhopeof working—exponentialbackoff—but a proof of this is be-yondthe scopeof this paper.4 To finessea proof, notethata networkis, to a very goodapproximation,a linearsystem.Thatis, it is composedof elementsthatbehavelike linearop-erators— integrators,delays,gainstages,etc. Linearsystemtheorysaysthatif asystemis stable,thestability is exponen-tial. Thissuggeststhatanunstablesystem(anetworksubject

4 See[7]. Severalauthorshaveshownthat backoffs ‘slower’ thanex-ponentialarestablegiven finite populationsandknowledgeof the globaltraffic. However, [16] showsthat nothingslowerthanexponentialbehav-ior will work in thegeneralcase.To feedyour intuition, considerthatanIPgatewayhasessentiallythesamebehaviorasthe‘ether’ in anALOHA netorEthernet.Justifyingexponentialretransmitbackoff is thesameasshowingthatno collisionbackoff slowerthananexponentialwill guaranteestabilityon an Ethernet. Unfortunately, with an infinite userpopulationevenex-ponentialbackoff won’t guaranteestability (althoughit ‘almost’ does—see[1]). Fortunately,wedon’t (yet)havetodealwith aninfiniteuserpopulation.

3 ADAPTING TO THE PATH: CONGESTIONAVOIDANCE 4

Figure2: The Chronology of a Slow-start

1

23

1

One Round Trip Time

0R

1R

2

45

3@

6A7

2R

4B

89

5

1011

6

1213

7

1415

3R

One Packet Time

Thehorizontaldirectionis time. Thecontinuoustimeline hasbeenchoppedintoone-round-trip-timepiecesstackedvertically with increasingtime goingdownthepage.Thegrey, numberedboxesarepackets.Thewhite numberedboxesarethecorrespondingacks. As eachackarrives,two packetsaregenerated:onefor theack(theacksaysa packethasleft thesystemsoa newpacketis addedtotakeits place)andonebecauseanackopensthecongestionwindow by onepacket.It maybeclearfrom thefigurewhy anadd-one-packet-to-windowpolicy opensthewindow exponentiallyin time.If the local net is muchfasterthanthe long haul net, theack’s two packetsarriveat thebottleneckatessentiallythesametime. Thesetwo packetsareshownstackedon topof oneanother(indicatingthatoneof themwouldhaveto occupyspacein thegateway’soutboundqueue).Thustheshort-termqueuedemandon thegatewayis increasingexponentiallyandopeningawindowof size C packetswill require CED 2 packetsof buffer capacityat thebottleneck.

to randomload shocksandproneto congestivecollapse5)canbestabilizedby addingsomeexponentialdamping(ex-ponentialtimer backoff) to its primary excitation(senders,traffic sources).

3 Adapting to the path: congestionavoidance

If the timersare in goodshape,it is possibleto statewithsomeconfidencethatatimeoutindicatesalostpacketandnot

5 Thephrasecongestioncollapse(describinga positivefeedbackinsta-bility dueto poorretransmittimers)is againthecoinageof JohnNagle,thistimefrom [22].

a brokentimer. At this point, somethingcanbe doneabout(3). Packetsget lost for two reasons:they aredamagedintransit,or the network is congestedandsomewhereon thepaththerewasinsufficientbuffer capacity. Onmostnetworkpaths,lossduetodamageis rare( F 1%) soit isprobablethatapacketlossis dueto congestionin thenetwork.6

6 Becausea packetlossemptiesthewindow, thethroughputof anywin-dowflow controlprotocolis quitesensitiveto damageloss.ForanRFC793standardTCPrunningwith window G (where G is atmostthebandwidth-delayproduct),a lossprobability of H degradesthroughputby a factor ofI1 J 2HKG(LNM 1 . E.g., a 1% damagelossrateon anArpanetpath(8 packet

window)degradesTC P throughputby 14%.Thecongestioncontrolschemewe proposeis insensitiveto damageloss

until the lossrate is on the orderof the window equilibrationlength (thenumberof packetsit takesthewindowto regainits originalsizeafteraloss).If thepre-losssizeis G , equilibrationtakesroughly G 2 O 3 packetsso,for the

3 ADAPTING TO THE PATH: CONGESTIONAVOIDANCE 5

A ‘congestionavoidance’strategy, suchasthe onepro-posedin [14], will havetwo components:Thenetworkmustbe ableto signal the transportendpointsthat congestionisoccurring(orabouttooccur).And theendpointsmusthaveapolicy thatdecreasesutilization if this signalis receivedandincreasesutilization if thesignalisn’t received.

If packetloss is (almost)alwaysdueto congestionandif a timeoutis (almost)alwaysdueto a lost packet,we haveagoodcandidatefor the‘network is congested’signal. Par-ticularly sincethis signal is deliveredautomaticallyby allexistingnetworks,without specialmodification(asopposedto [14] which requiresa newbit in thepacketheadersandamodificationto all existinggatewaysto setthisbit).

The other part of a congestionavoidancestrategy, theendnodeaction,isalmostidenticalin theDEC/

PISO schemeand

our TCP7 andfollows directly from a first-ordertime-seriesmodelof the network:8 Saynetwork load is measuredbyaveragequeuelengthoverfixedintervalsof someappropriatelength(somethingneartheroundtrip time). If Q�R is theloadat interval S , an uncongestednetwork can be modeledbysaying Q�R changesslowly comparedto the samplingtime.I.e., QTRVUXW( W constant). If the network is subjectto congestion,thiszerothordermodelbreaksdown. Theaveragequeuelengthbecomesthe sumof two terms,the W abovethat accountsfor theaveragearrival rateof newtraffic andintrinsicdelay,anda new term that accountsfor the fraction of traffic leftoverfromthelasttimeintervalandtheeffectof thisleft-overtraffic (e.g.,inducedretransmits):

QYRVUZW\[*]>Q^R`_ 1(Thesearethefirst two termsin a TaylorseriesexpansionofQbadcfe . Thereis reasonto believeonemighteventuallyneedathreeterm,secondordermodel,butnotuntil theInternethasgrownsubstantially.)

Arpanet,thelosssensitivitythresholdisabout5%. At thishighlossrate,theemptywindow effect describedabovehasalreadydegradedthroughputby44%andtheadditionaldegradationfrom thecongestionavoidancewindowshrinkingis theleastof one’sproblems.

Weareconcernedthatthecongestioncontrolnoisesensitivityis quadraticin g but it will take at leastanothergenerationof network evolution toreachwindow sizeswherethis will besignificant. If experienceshowsthissensitivityto be a liability, a trivial modificationto thealgorithmmakesitlinearin g . An in-progresspaperexploresthissubjectin detail.

7 This is not an accident: We copied Jain’s schemeafter hearinghispresentationat [9] andrealizingthattheschemewas,in a sense,universal.

8 Seeanygoodcontrol theorytext for therelationshipbetweena systemmodelandadmissiblecontrolsfor thatsystem.A niceintroductionappearsin [20], chap.8.

When the network is congested,] must be large andthe queuelengthswill start increasingexponentially.9 Thesystemwill stabilizeonly if the traffic sourcesthrottle backat leastasquickly asthequeuesaregrowing. Sinceasourcecontrolsload in a window-basedprotocol by adjustingthesizeof thewindow, h , we endup with thesenderpolicy

On congestion:

hER^U:i�hERj_ 1 a`ilk 1eI.e.,a multiplicativedecreaseof thewindowsize(whichbe-comesan exponentialdecreaseover time if the congestionpersists).

If there’s no congestion, ] must be nearzero and theloadapproximatelyconstant.Thenetworkannounces,via adroppedpacket,whendemandis excessivebutsaysnothingif a connectionis usinglessthanits fair share(sincethenet-work is stateless,it cannotknowthis). Thusaconnectionhasto increaseits bandwidthutilization to find out the currentlimit. E.g.,youcouldhavebeensharingthepathwith some-oneelseandconvergedto awindowthatgivesyoueachhalftheavailablebandwidth.If sheshutsdown,50%of theband-width will bewastedunlessyour window sizeis increased.Whatshouldtheincreasepolicy be?

Thefirst thoughtis to useasymmetric,multiplicativein-crease,possiblywith a longertime constant,h R U\mnh Rj_ 1 ,1 komEp 1qKi . This is a mistake. The resultwill oscillatewildly and,on theaverage,deliverpoorthroughput.Thean-alytic reasonfor this hasto do with that fact that it is easyto drive thenetinto saturationbut hardfor thenetto recover(what [17], chap.2.1, calls the rush-houreffect).10 Thus

9 I.e., thesystembehaveslike r4sutwvKr>syx 1 , a differenceequationwiththesolution ruz

4 THE GATEWAY SIDE OF CONGESTIONCONTROL 6

overestimatingtheavailablebandwidthis costly. But anex-ponential,almostregardlessof its timeconstant,increasessoquickly thatoverestimatesareinevitable.

Withoutjustification,we’ll statethatthebestincreasepol-icy is to makesmall,constantchangesto thewindowsize:11

Onnocongestion:EVj1 � | w8�

whereE

is thepipesize(thedelay-bandwidthproductofthepathminusprotocoloverhead— i.e., thelargestsensiblewindowfor theunloadedpath).This is theadditiveincrease/ multiplicative decreasepolicy suggestedin [14] and thepolicy we’ve implementedin TCP. The only differencebe-tweenthetwo implementationsis thechoiceof constantsfor

and . We used0.5 and1 for reasonspartially explainedin appendixD. A morecompleteanalysisis in yet anotherin-progresspaper.

Theprecedinghasprobablymadethecongestioncontrolalgorithmsoundhairy but it’ snot. Like slow-start,it’ s threelinesof code: On any timeout,setcwnd to half the currentwindow

size(this is themultiplicativedecrease). On eachack for new data,increasecwndby 1/cwnd(this is theadditiveincrease).12 When sending,sendthe minimum of the receiver’sadvertisedwindow andcwnd.

Note that this algorithm is only congestionavoidance, itdoesn’t include the previouslydescribedslow-start. Sincethepacketlossthatsignalscongestionwill resultin are-start,it will almostcertainlybenecessaryto slow-startin additionto the above. But, becauseboth congestionavoidanceandslow-startare triggeredby a timeout and both manipulatethecongestionwindow, theyarefrequentlyconfused.Theyareactuallyindependentalgorithmswith completelydifferentobjectives.To emphasizethedifference,thetwo algorithms

11See[3] for acompleteanalysisof theseincreaseanddecreasepolicies.Also see[7] and [8] for a control-theoreticanalysisof a similar classofcontrolpolicies.

12Thisincrementrulemaybelessthanobvious.Wewantto increasethewindowby at mostonepacketovera time intervalof length (the roundtrip time). To makethe algorithm ‘self-clocked’, it’ s betterto incrementby a small amounton eachackratherthanby a largeamountat theendofthe interval. (Assuming,of course,the senderhaseffective silly windowavoidance(see[4], section3) anddoesn’t attemptto sendpacketfragmentsbecauseof thefractionallysizedwindow.) A window of sizecwndpacketswill generateat mostcwndacksin one . Thusan incrementof 1/cwndper ack will increasethe window by at most one packetin one . InTC

P, windowsandpacketsizesarein bytesso the incrementtranslatestomaxseg*maxseg/cwndwheremaxsegisthemaximumsegmentsizeandcwndis expressedin bytes,not packets.

havebeenpresentedseparatelyeventhoughin practisetheyshouldbe implementedtogether. Appendix B describesacombinedslow-start/congestionavoidancealgorithm.13

Figures7 through12 showthe behaviorof TCP connec-tionswith andwithout congestionavoidance.Althoughthetestconditions(e.g.,16KB windows)weredeliberatelycho-sento stimulatecongestion,the testscenarioisn’t far fromcommonpractice: TheArpanetIMP end-to-endprotocolal-lows at most eight packetsin transit betweenany pair ofgateways.Thedefault4.3BSD window sizeis eightpackets(4 KB). Thussimultaneousconversationsbetween,say, anytwo hostsatBerkeleyandanytwo hostsatMIT wouldexceedthe network capacityof the U CB–MIT IMP path and wouldlead14 to thetypeof behaviorshown.

4 Future work: the gateway side ofcongestion control

Whilealgorithmsatthetransportendpointscaninsurethenet-work capacityisn’t exceeded,theycannotinsurefair sharingof that capacity. Only in gateways,at the convergenceofflows,is thereenoughinformationto controlsharingandfairallocation.Thus,weviewthegateway‘congestiondetection’algorithmasthenextbig step.

Thegoalof thisalgorithmtosendasignalto theendnodesasearlyaspossible,butnotsoearlythatthegatewaybecomes

13We havealsodevelopeda rate-basedvariantof thecongestionavoid-ancealgorithmtoapplytoconnectionlesstraffic (e.g.,domainserverqueries,RPC requests).Rememberingthatthegoalof theincreaseanddecreasepoli-

ciesis bandwidthadjustment,andthat ‘time’ (thecontrolledparameterin arate-basedscheme)appearsin thedenominatorof bandwidth,thealgorithmfollows immediately: The multiplicative decreaseremainsa multiplica-tive decrease(e.g.,doublethe interval betweenpackets). But subtractinga constantamountfrom interval doesnot result in an additive increaseinbandwidth. This approachhasbeentried, e.g.,[18] and[24], andappearsto oscillatebadly. To seewhy, notethat for an inter-packetinterval anddecrement , thebandwidthchangeof adecrease-interval-by-constantpol-icy is

1 1^~anon-linear, anddestablizing,increase.

An updatepolicy thatdoesresultin a linear increaseof bandwidthovertime is ! ¢¡¤£ y¥ 1£9¦ y¥ 1where is theintervalbetweensendswhenthe § th packetis sentand £ isthedesiredrateof increasein packetsperpacket/sec.

Wehavesimulatedtheabovealgorithmandit appearsto performwell. Totestthepredictionsof thatsimulationagainstreality, wehaveacooperativeprojectwith SunMicrosystemstoprototypeRPC dynamiccongestioncontrolalgorithmsusingN¨ FS© asa test-bed(sinceN¨ FS© is known to havecongestionproblemsyet it would bedesirableto haveit work over thesamerangeofnetworksasTC P).

14did lead.


starvedfor traffic. Sincewe plan to continueusingpacketdropsasa congestionsignal,gateway‘self protection’froma mis-behavinghostshouldfall-out for free: Thathostwillsimplyhavemostof its packetsdroppedasthegatewaytrysto tell it that it’ s usingmorethanits fair share. Thus, liketheendnodealgorithm,thegatewayalgorithmshouldreducecongestionevenif no endnodeis modifiedto do congestionavoidance.And nodesthatdo implementcongestionavoid-ancewill get their fair shareof bandwidthanda minimumnumberof packetdrops.

Sincecongestiongrowsexponentially, detectingit earlyis important. If detectedearly, small adjustmentsto thesenders’windows will cure it. Otherwisemassiveadjust-mentsarenecessaryto give the net enoughsparecapacityto pumpout the backlog. But, given the burstynatureoftraffic, reliabledetectionis a non-trivial problem. Jain[14]proposesaschemebasedonaveragingbetweenqueueregen-erationpoints. This shouldyield goodburstfiltering but wethink it might haveconvergenceproblemsunderhigh loador significantsecond-orderdynamicsin the traffic.15 Weplanto usesomeof our earlierwork on ARMAX modelsforround-trip-time/queuelength predictionas the basisof de-tection.Preliminaryresultssuggestthatthisapproachworkswell at high load, is immuneto second-ordereffectsin thetraffic andiscomputationallycheapenoughtonotslowdownkilopacket-per-secondgateways.

Acknowledgements

Wearegratefulto themembersof theInternetActivity Board’sEnd-to-EndandInternet-Engineeringtaskforcesfor thispastyear’sinterest,encouragement,cogentquestionsandnetworkinsights.BobBradenof ISI andCraigPartridgeof BBN wereparticularly helpful in the preparationof this paper: theircarefulreadingof earlydraftsimprovedit immensely.

The first authoris alsodeeplyin debt to Jeff Mogul ofDECWesternResearchLab. WithoutJeff ’s interestandpa-tientprodding,thispaperwouldneverhaveexisted.

15Theseproblemsstemfrom the fact that the averagetime betweenre-generationpointsscaleslike ª 1 «¬n®°¯ 1 andthe variancelike ª 1 «¬±®°¯ 3(seeFeller[6], chap.VI.9). Thusthecongestiondetectorbecomessluggishascongestionincreasesandits signal-to-noiseratiodecreasesdramatically.


Figure3: Startup behavior of TCP without Slow-start

•••••••••••••••••••••••••••••••••••••••

••••••••••••

•

•••••••••••••••••••••••••••••••• •

•••••••• •

•••••••••••

• ••••••••••••••••••••••••••••••••

•

••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••

• ••••••••••••••••••••••••••••••••

•

••••••••••••••••••••••

•••••••••••••••••••• •••••••••

Send Time (sec)

Pac

ket S

eque

nce

Num

ber

(KB

)

0 2 4 6 8 10

010

2030

4050

6070

Tracedataof thestartof a TC² P conversationbetweentwo Sun3/50srunningSunOS 3.5 (the4.3BSDTC²

P). The two Sunswereon differentEthernetsconnectedby IP gatewaysdriving a 230.4Kbpspoint-to-pointlink (essentiallythesetupshownin fig. 7). Thewindow sizefor theconnectionwas16KB (32512-bytepackets)andtherewere30packetsof buffer availableat thebottleneckgateway.The actualpath containssix store-and-forwardhopsso the pipe plus gatewayqueuehasenoughcapacityfor a full window but thegatewayqueuealonedoesnot.Eachdot is a 512 data-bytepacket. The x-axis is the time the packetwassent. The y-axis is thesequencenumberin thepacketheader. Thusa vertical arrayof dotsindicateback-to-backpacketsandtwo dotswith thesamey butdifferentx indicatea retransmit.‘Desirable’ behavioron this graphwould be a relatively smoothline of dotsextendingdiagonallyfrom the lower left to theupperright. Theslopeof this line would equalthe availablebandwidth.Nothingin this traceresemblesdesirablebehavior.The dashedline showsthe 20 KBps bandwidthavailablefor this connection. Only 35% of thisbandwidthwasused;therestwaswastedonretransmits.Almost everythingis retransmittedat leastonceanddatafrom 54 to 58 KB is sentfive times.


Figure4: Startup behavior of TCP with Slow-start

• •• ••••••••

•••••••• ••

••••••••• •

••••••••••••••••

••• ••••••••••••••••••••••••• •••

•••••••••••••••

•••••••••••••• •

••• ••••••••

••••••••••

••••••••• •

•••••••••• •

••••••••••••••••••••

••••••••••• •••••

•••••••••••••••

••••••••••

•• •••••••••

•••••••••••••••

•••••••• ••

•••••••••••

•••••••••••••••••• ••

• ••••••• ••

••••••••••••••••

••••

Send Time (sec)

Pac

ket S

eque

nce

Num

ber

(KB

)

0 2 4 6 8 10

020

4060

8010

012

014

016

0

Sameconditionsasthepreviousfigure(sametimeof day,sameSuns,samenetworkpath,samebufferandwindow sizes),exceptthemachineswererunningthe 4 ³ 3́ TC² P with slow-start.No bandwidthis wastedonretransmitsbut two secondsis spenton theslow-startsotheeffectivebandwidthof thispartof the traceis 16 KBps — two timesbetterthanfigure3. (This is slightly misleading:Unlikethepreviousfigure,theslopeof thetraceis 20 KBps andtheeffectof the2 secondoffsetdecreasesasthetracelengthens.E.g.,if this tracehadrun a minute,theeffectivebandwidthwouldhavebeen19KBps. Theeffectivebandwidthwithoutslow-startstaysat7 KBpsnomatterhow long thetrace.)


Figure5: Performance of an RFC793 retransmit timer

•

• • • •

••

• •

• •

• •

••• •

• •

••

• • • • ••

•••

• • • • •

• ••

••

• • • • • •

••

• • •

• ••

•

• ••

•

•••

•

• ••

•

• • •

•• •

•••

• •

•• •

•

• •

••

••

• • •

• • •

•

•

••

•

• •

Packet

RT

T (

sec.

)

0 10 20 30 40 50 60 70 80 90 100 110

02

46

810

12

Tracedatashowingper-packetroundtrip time on a well-behavedArpanetconnection.The x-axisis the packetnumber(packetswerenumberedsequentially, startingwith one)andthey-axis is theelapsedtime from thesendof thepacketto thesender’s receiptof its ack. During this portionof thetrace,no packetsweredroppedor retransmitted.The packetsare indicatedby a dot. A dashedline connectsthemto makethe sequenceeasiertofollow. Thesolid line showsthebehaviorof a retransmittimer computedaccordingto therulesofRFC793.

Figure6: Performance of a Mean+Variance retransmit timer

•

• • • •

••

• •

• •

• •

••• •

• •

••

• • • • ••

•••

• • • • •

• ••

••

• • • • • •

••

• • •

• ••

•

• ••

•

•••

•

• ••

•

• • •

•• •

•••

• •

•• •

•

• •

••

••

• • •

• • •

•

•

••

•

• •

Packet

RT

T (

sec.

)

0 10 20 30 40 50 60 70 80 90 100 110

02

46

810

12

Samedataasabovebut thesolid line showsaretransmittimercomputedaccordingto thealgorithmin appendixA.


Figure7: Multiple conversation test setup

Poloµ(sun 3/50)

Hotµ(sun 3/50)

Surf¶(sun 3/50)

Renoir(vax 750)

VanGogh·(vax 8600)¸

Monet(vax 750)

Okeeffe¹(CCI)ºVs(sun 3/50)

csam» cartan¼230.4 KbsMicrowave½

10 Mbs Ethernets

Testsetuptoexaminetheinteractionof multiple,simultaneousTC² Pconversationssharingabottlenecklink. 1 MByte transfers(2048 512-data-bytepackets)were initiated 3 secondsapart from fourmachinesatLBL to fourmachinesatUCB,oneconversationpermachinepair(thedottedlinesaboveshowthe pairing). All traffic went via a 230.4Kbps link connectingIP routercsam at LBL to IProutercartan at UCB. Themicrowavelink queuecanhold up to 50 packets.Eachconnectionwasgiven a window of 16 KB (32 512-bytepackets). Thus any two connectionscould overflow theavailablebufferingandthefour connectionsexceededthequeuecapacityby 160%.


Figure8: Multiple, simultaneous TCPs with no congestion avoidance

Time (sec)

Seq

uenc

e N

umbe

r (K

B)

0 50 100 150 200

020

040

060

080

010

0012

00

Tracedatafrom four simultaneousTC² P conversationswithout congestionavoidanceover the pathsshownin figure7. 4,000of 11,000packetssentwereretransmissions(i.e.,half thedatapacketswereretransmitted).Sincethelink databandwidthis 25KBps,eachof thefour conversationsshouldhavereceived6 KBps. Instead,oneconversationgot 8 KBps, two got 5 KBps, onegot 0.5 KBps and6KBpshasvanished.


Figure9: Multiple, simultaneous TCPs with congestion avoidance

Time (sec)

Seq

uenc

e N

umbe

r (K

B)

0 50 100 150 200

020

040

060

080

010

0012

00

Tracedatafrom four simultaneousTC² P conversationsusing congestionavoidanceover the pathsshownin figure7. 89 of 8281packetssentwereretransmissions(i.e., 1% of the datapacketshadto be retransmitted).Two of theconversationsgot 8 KBps andtwo got 4.5 KBps (i.e., all the linkbandwidthis accountedfor — seefig. 11). The differencebetweenthe high and low bandwidthsenderswas dueto the receivers. The 4.5 KBps senderswere talking to 4.3BSD receiverswhichwoulddelayanackuntil 35%of thewindowwasfilled or 200mshadpassed(i.e.,anackwasdelayedfor 5–7packetson theaverage).This meantthesenderwould deliverburstsof 5–7packetson eachack.The8 KBps sendersweretalking to 4.3¾ BSD receiverswhich would delayan ack for at mostonepacket(becauseof anack’s ‘clock’ rôle, theauthorsbelievethattheminimumackfrequencyshouldbeeveryotherpacket).I.e., thesenderwould deliverburstsof atmosttwo packets.Theprobabilityof lossincreasesrapidly with burstsizesosenderstalking to old-stylereceiverssawthreetimesthelossrate(1.8%vs.0.5%).Thehigherlossratemeantmoretimespentin retransmitwait and,becauseof thecongestionavoidance,smalleraveragewindow sizes.


Figure10: Total bandwidth used by old and new TCPs

Time (sec)

Rel

ativ

e B

andw

idth

0 20 40 60 80 100 120

0.8

1.0

1.2

1.4

1.6

Thethin lineshowsthetotalbandwidthusedby thefoursenderswithoutcongestionavoidance(fig.8),averagedover5secondintervalsandnormalizedto the25KBpslink bandwidth.Notethatthesenderssend,ontheaverage,25%morethanwill fit in thewire. Thethick line is thesamedatafor thesenderswith congestionavoidance(fig. 9). Thefirst 5 secondintervalis low (becauseof theslow-start),thenthereisabout20secondsof dampedoscillationasthecongestioncontrol‘regulator’ for eachTC² Pfindsthecorrectwindow size. The remainingtime thesendersrun at thewire bandwidth. (The activityaround110secondsisabandwidth‘re-negotiation’duetoconnectiononeshuttingdown. Theactivityaround80secondsisareflectionof the‘flat spot’in fig. 9wheremostof conversationtwo’sbandwidthissuddenlyshiftedto conversationsthreeandfour— competingconversationsfrequentlyexhibitthistypeof ‘punctuatedequilibrium’ behaviorandwehopeto investigateits dynamicsin a futurepaper.)


Figure11: Effective bandwidth of old and new TCPs

Time (sec)

Rel

ativ

e B

andw

idth

0 20 40 60 80 100 120

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

Figure10showedtheold TC² Pswereusing25%morethanthebottlenecklink bandwidth.Thus,oncethebottleneckqueuefilled, 25% of the the senders’packetswerebeingdiscarded.If the discards,andonly the discards,wereretransmitted,the senderswould havereceivedthe full 25 KBps linkbandwidth(i.e., theirbehaviorwouldhavebeenanti-socialbutnotself-destructive).But fig. 8 notedthataround25%of thelink bandwidthwasunaccountedfor. Hereweaveragethetotalamountof dataackedperfive secondinterval. (Thisgivestheeffectiveor deliveredbandwidthof thelink.) Thethinline is onceagaintheold TC² Ps. Notethatonly 75%of thelink bandwidthis beingusedfor data(theremaindermusthavebeenusedby retransmissionsof packetsthatdidn’t needto beretransmitted).Thethick line showsdeliveredbandwidthfor thenewTC² Ps. Thereis thesameslow-startandturn-ontransientfollowed by a long periodof operationright at thelink bandwidth.


Figure12: Window adjustment detail

•

••

•

•

•

• •• •

•

•

•

•• •

• •

•

•

••

• ••

•

Time (sec)

Rel

ativ

e B

andw

idth

0 20 40 60 800.4

0.6

0.8

1.0

1.2

1.4

Becauseof thefive secondaveragingtime (neededto smoothout thespikesin theold TC² P data),thecongestionavoidancewindow policy is difficult to makeout in figures10 and11. Herewe showeffective throughput(dataacked)for TC² Ps with congestioncontrol, averagedover a threesecondinterval.Whenapacketisdropped,thesendersendsuntil it fills thewindow, thenstopsuntil theretransmissiontimeout. Sincethereceivercannotackdatabeyondthedroppedpacket,on this plot we’d expecttoseea negative-goingspikewhoseamplitudeequalsthesender’swindowsize(minusonepacket).Iftheretransmithappensin thenextinterval(theintervalswerechosentomatchtheretransmittimeout),we’d expectto seeapositive-goingspikeof thesameamplitudewhenreceiverackstheout-of-orderdatait cached.Thustheheightof thesespikesis a directmeasureof thesender’swindowsize.Thedataclearlyshowsthreeof theseevents(at15,33and57seconds)andthewindowsizeappearstobedecreasingexponentially. Thedottedline isaleastsquaresfit to thesixwindowsizemeasurementsobtainedfrom theseevents.The fit time constantwas28 seconds.(The long time constantis dueto lack of a congestionavoidancealgorithmin thegateway. With a ‘drop’ algorithmrunningin thegateway, thetime constantshouldbearound4 seconds)

A A FAST ALGORITHM FOR RTT MEAN AND VARIATION 17

A A fast algorithm for rtt mean andvariation

A.1 Theory

The RFC793algorithmfor estimatingthe meanroundtriptime is oneof the simplestexamplesof a classof estima-torscalledrecursivepredictionerror or stochasticgradientalgorithms.In thepast20yearsthesealgorithmshaverevolu-tionizedestimationandcontrol theory[19] andit’ sprobablyworth looking at theRFC793estimatorin somedetail.

Givenanewmeasurement¿ of theRTT (roundtrip time),TCP updatesanestimateof theaverageRTT À by

ÀÂÁÄÃ 1 Å=ÆÇNÀÈ*ÆÉ¿where Æ is a ‘gain’ (0 Ê�ÆËÊ 1) thatshouldberelatedto thesignal-to-noiseratio (or, equivalently, variance)of ¿ . Thismakesamoresense,andcomputesfaster, if werearrangeandcollecttermsmultipliedby Æ to get

ÀËÁÌÀÍÈ*Æ4Ã°¿\ÅÎÀ�ÇThink of À asa predictionof thenextmeasurement.¿ÏÅ�Àis theerror in thatpredictionandtheexpressionabovesayswe makea new predictionbasedon the old predictionplussomefractionof thepredictionerror. Thepredictionerror isthesumof two components:(1) error dueto ‘noise’ in themeasurement(random,unpredictableeffectslike fluctuationsin competingtraffic) and(2) errordueto a badchoiceof À .Callingtherandomerror ÐbÑ andtheestimationerror ÐÒ ,

À|ÁÌÀ5È*ÆÉÐÑÓÈ*Æ¢Ð

B SLOW-START + CONGESTIONAVOIDANCE ALGORITHM 18

õ÷öupdate Average estimator

ö÷õm ø = (sa ùñù 3);sa += m;õ÷ö

update Deviation estimatorö÷õ

if (m ú 0)m = ø m;

m ø = (sv ù5ù 3);sv += m;

It’ s not necessaryto usethesamegain for û and ü . Toforce the timer to go up quickly in responseto changesintheRTT, it’ s a goodideato give ü a largergain. In particu-lar, becauseof window–delaymismatchthereareoftenRTTartifactsat integermultiplesof thewindow size.17 To filterthese,onewould like 1

õKýin the û estimatorto be at least

as large as the window size (in packets)and 1õKý

in the üestimatorto belessthanthewindow size.18

Usinga gain of .25 on the deviationandcomputingtheretransmittimer, þnÿ�� , as û � 4ü , the final timer codelookslike:

m ø = (sa ùñù 3);sa += m;if (m ú 0)

m = ø m;m ø = (sv ùñù 2);sv += m;rto = (sa ùñù 3) + sv;

Ingeneralthiscomputationwill correctlyround þ$ÿ�� : Becauseof the �îû truncationwhencomputing� ø9û , �±û will convergeto the truemeanroundedup to thenexttick. Likewisewith�êü . Thus,on theaverage,thereis half a tick of biasin each.The þ$ÿ�� computationshouldbe roundedby half a tick andonetick needsto beaddedto accountfor sendsbeingphasedrandomlywith respectto the clock. So, the 1.75 tick biascontributionfrom 4ü approximatelyequalsthedesiredhalftick roundingplusonetick phasecorrection.

17E.g.,seepackets10–50of figure5. Notethatthesewindoweffectsaredueto characteristicsof theArpa/Milnetsubnet.In general,windoweffectson the timer areat mosta second-orderconsiderationanddependa greatdealontheunderlyingnetwork.E.g.,if onewereusingtheWidebandwith a256packetwindow, 1/256wouldnot beagoodgainfor � (1/16might be).

18Althoughit maynotbeobvious,theabsolutevaluein thecalculationof� introducesanasymmetryin thetimer: Because� hasthesamesignasanincreaseandtheoppositesignof adecrease,moregainin � makesthetimergo up quickly andcomedownslowly, ‘automatically’ giving thebehaviorsuggestedin [21]. E.g.,seetheregionbetweenpackets50 and80 in figure6.

B The combined slow-start withcongestion avoidance algorithm

The senderkeepstwo statevariablesfor congestioncon-trol: a slow-start/congestionwindow, cwnd, anda thresholdsize,ssthresh, to switch betweenthe two algorithms. Thesender’s outputroutinealwayssendstheminimumof cwndand the window advertisedby the receiver. On a timeout,half the currentwindow sizeis recordedin ssthresh(this isthemultiplicativedecreasepartof thecongestionavoidancealgorithm),thencwnd is setto 1 packet(this initiatesslow-start).Whennewdatais acked,thesenderdoes

if (cwnd ú ssthresh)õ÷öif we’re still doing slowø startöopen window exponentially

öãõcwnd += 1;

else õ÷öotherwisedo CongestionöAvoidanceincrementø byø 1 öãõ

cwnd += 1õcwnd;

Thusslow-startopensthewindow quickly to whatcon-gestionavoidancethinks is a safeoperatingpoint (half thewindow thatgot us into trouble),thencongestionavoidancetakesoverandslowly increasesthewindowsizeto probeformorebandwidthbecomingavailableon thepath.

Notethattheelse clauseof theabovecodewill malfunc-tion if cwndis anintegerin unscaled,one-packetunits. I.e.,if themaximumwindowfor thepathis � packets,cwndmustcovertherange0 �� with resolutionof atleast1õ � .19 Sincesendingpacketssmallerthanthemaximumtransmissionunitfor thepathlowersefficiency, theimplementormusttakecarethatthefractionallysizedcwnddoesnot resultin smallpack-etsbeingsent. In reasonableTCP implementations,existingsilly-windowavoidancecodeshouldpreventruntpacketsbutthispoint shouldbecarefullychecked.

C Window adjustment interactionwith round-trip timing

SomeTCP connections,particularly thoseover a very lowspeedlink suchasa dial-up SLIP line[25], may experience

19For TC P this happensautomaticallysince windows are expressedinbytes,notpackets.For protocolssuchasISO TP4,theimplementorshouldscalecwndsothatthecalculationsabovecanbedonewith integerarithmeticandthescalefactorshouldbelargeenoughto avoidthefixedpoint (zero)of�1�� in thecongestionavoidanceincrement.

C WINDOW ADJUSTMENT INTERACTION WITH ROUND-TRIPTIMING 19

an unfortunateinteractionbetweencongestionwindow ad-justmentand retransmittiming: Network pathstend to di-vide into two classes: delay-dominated, where the store-and-forwardand/or transit delaysdeterminethe RTT, andbandwidth-dominated,where(bottleneck)link bandwidthandaveragepacketsizedeterminethe RTT.20 On a bandwidth-dominatedpathof bandwidth� , acongestion-avoidancewin-dowincrementof �� will increasetheRTT of post-incrementpacketsby

�� If thepathRTT variation � is small, �� mayexceedthe4 �cushionin �! �" , a retransmittimeoutwill occurand,after afewcyclesof this,ssthresh(and,thus,cwnd) endupclampedatsmallvalues.

The �! �" calculationin appendixA wasdesignedto pre-ventthistypeof spuriousretransmissiontimeoutduringslow-start. In particular, theRTT variation � is multipliedby fourin the �! �" calculationbecauseof the following: A spuriousretransmitoccursif the retransmittimeoutcomputedat theendof slow-startround # , �! �"%$ , is everlessthanor equaltotheactualRTT of thenextround. In theworstcaseof all thedelaybeingduethe window, � doubleseachround(sincethewindow sizedoubles).Thus �&$(' 1 ) 2�&$ (where �*$ isthemeasuredRTT at slow-startround # ). But

� $ ) � $,+ � $.- 1) � $�/ 2

and

�! �"0$ ) �1$32 4 �4$) 3� $5 2� $5 � $(' 1

sospuriousretransmittimeoutscannotoccur.21

Spuriousretransmissionduetoawindowincreasecanoc-curduringthecongestionavoidancewindowincrementsincethewindowcanonlybechangedin onepacketincrementsso,for a packetsize 6 , theremaybe asmanyas 6 + 1 packetsbetweenincrements,long enoughfor any � increasedueto thelastwindow incrementto decayawayto nothing. Butthisproblemisunlikelyonabandwidth-dominatedpathsince

20E.g.,TC P overa 2400baudpacketradio link is bandwidth-dominatedsincethetransmissiontime for a (typical)576byteIP packetis 2.4seconds,longerthananypossibleterrestrialtransitdelay.

21TheoriginalSIGCOMM’88 versionof thispapersuggestedcalculating7�8:9 as ;*< 2= ratherthan ;*< 4= . Sincethattimewehavehadmuchmoreexperiencewith low speedSLIPlinksandobservedspuriousretransmissionsduringconnectionstartup.An investigationof why theseoccuredled to theanalysisaboveandthechangeto the 7>8:9 calculationin app.A.

the incrementswould haveto be morethantwelve packetsapart(thedecaytime of the � filter timesits gainin the �? �"calculation)which implies that a ridiculously largewindowis beingusedfor the path.22 Thusoneshouldregardthesetimeoutsasappropriatepunishmentfor grossmis-tuningandtheireffectwill simplybeto reducethewindowtosomethingmoreappropriatefor thepath.

Although slow-startand congestionavoidanceare de-signedto not triggerthis kind of spuriousretransmission,aninteractionwith higherlevel protocolsfrequentlydoes:Ap-plicationprotocolslike SMTP andN� N� TP havea ‘negotiation’phasewherea few packetsareexchangedstop-and-wait,fol-lowed by datatransferphasewhereall of a mail messageor newsarticle is sent. Unfortunately, the ‘negotiation’ ex-changesopenthecongestionwindow sothestartof thedatatransferphasewill dump severalpacketsinto the networkwith noslow-startand,onabandwidth-dominatedpath,fasterthan �! �" cantracktheRTT increasecausedby thesepackets.The root causeof this problemis the sameone describedin sec.1: dumping too many packetsinto an empty pipe(the pipe is emptysincethe negotiationexchangewascon-ductedstop-and-wait)with noack‘clock’. Thefix proposedin sec.1, slow-start,will alsopreventthisproblemif theTCPimplementationcandetectthephasechange.And detectionis simple: The pipe is emptybecausewe haven’t sentany-thing for at leasta round-trip-time(anotherway to view RTTis asthetime it takesthepipeto emptyafterthemostrecentsend). So, if nothinghasbeensentfor at leastoneRTT, thenextsendshouldsetcwndto onepacketto forceaslow-start.I.e.,if theconnectionstatevariablelastsndholdsthetimethelastpacketwassent,thefollowing codeshouldappearearlyin theTCP outputroutine:

int idle = (snd max == snd una);if (idle && now + lastsnd 5 rto)

cwnd = 1;

Thebooleanidle is trueif thereis nodatain transit(all datasenthasbeenacked)sotheif says“if there’snothingin transitandwe haven’t sentanythingfor ‘a long time’, slow-start.”Our experiencehasbeenthateitherthecurrentRTT estimateor the �! �" estimatecanbeusedfor ‘a long time’ with goodresults23

22Thethelargestsensiblewindow for apathis thebottleneckbandwidthtimesthe round-tripdelayand,by definition, the delayis negligiblefor abandwidth-dominatedpathsothewindow shouldonly bea few packets.

23The 7>8:9 estimateis moreconvenientsinceit is kept in units of timewhile RTT isscaled.Also, becauseof send/receivesymmetry, thetimeof thelastreceivecanbeusedratherthanthelastsend— If theprotocolimplements‘keepalives’,this statevariablemayalreadyexist.

REFERENCES 20

D Window Adjustment Policy

A reasonfor using 12 asa the decreaseterm, asopposedtothe 78 in [14], wasthefollowing handwaving:Whenapacketis dropped,you’re eitherstarting(or restartingaftera drop)or steady-statesending. If you’re starting,you know thathalf thecurrentwindow size‘worked’, i.e., thata window’sworth of packetswereexchangedwith no drops(slow-startguaranteesthis). Thuson congestionyou setthewindow tothelargestsizethatyouknowworksthenslowly increasethesize. If the connectionis steady-staterunninganda packetis dropped,it’ s probablybecausea new connectionstartedup and took someof your bandwidth. We usually run ournetswith @BA 0 C 5 soit’ sprobablethattherearenowexactlytwo conversationssharingthe bandwidth. I.e., you shouldreduceyourwindowby half becausethebandwidthavailableto you hasbeenreducedby half. And, if thereare morethantwo conversationssharingthebandwidth,halvingyourwindow is conservative— andbeingconservativeat hightraffic intensitiesis probablywise.

Althougha factorof two changein windowsizeseemsalarge performancepenalty, in systemtermsthe costis neg-ligible: Currently, packetsare droppedonly when a largequeuehasformed. Evenwith the ISO IP ‘congestionexpe-rienced’bit [10] to force sendersto reducetheir windows,we’restuckwith thequeuebecausethebottleneckis runningat 100%utilization with no excessbandwidthavailabletodissipatethequeue.If a packetis tossed,somesendershutsup for two RTT, exactlythetime neededto emptythequeue.If thatsenderrestartswith thecorrectwindowsize,thequeuewon’t reform. Thusthedelayhasbeenreducedto minimumwithout thesystemlosinganybottleneckbandwidth.

The 1-packetincreasehaslessjustificationthanthe 0.5decrease.In fact, it’ s almostcertainly too large. If the al-gorithmconvergesto a window sizeof D , thereare EGF�D 2 Hpacketsbetweendropswith anadditiveincreasepolicy. Wewereshootingfor an averagedrop rateof I 1% andfoundthaton theArpanet(theworstcaseof the four networkswetested),windowsconvergedto 8–12packets.This yields 1packetincrementsfor a 1%averagedroprate.

But, sincewe’ve donenothingin thegateways,thewin-dowweconvergeto is themaximumthegatewaycanacceptwithout droppingpackets.I.e., in the termsof [14], we arejust to the left of the clif f ratherthanjust to theright of theknee.If thegatewaysarefixedsotheystartdroppingpacketswhenthequeuegetspushedpasttheknee,our incrementwillbemuchtooaggressiveandshouldbedroppedbyaboutafac-tor of four (sinceour measurementsonanunloadedArpanetplaceits ‘pipe size’ at 4–5packets).It appearstrivial to im-plementa secondordercontrolloop to adaptivelydeterminetheappropriateincrementto usefor apath.But secondorder

problemsareonholduntil we’vespentsometimeonthefirstorderpartof thealgorithmfor thegateways.

References

[1] ALDOU S,J D. JK . Ultimate instability of exponentialback-off protocolfor acknowledgmentbasedtransmis-sion control of randomaccesscommunicationchan-nels. IEEE Transactionson InformationTheoryIT-33,2 (Mar. 1987).

[2] BORRELLI,J R.,J AN� D CL OLEMAN� ,J CL . Differential Equa-tions. Prentice-HallInc.,1987.

[3] CL

HIU ,J D.-M.,J AN� D JK AIN� ,J R. Networkswith a connec-

tionlessnetworklayer;partiii: Analysisof theincreaseand decreasealgorithms. Tech. Rep. DEC-TR-509,Digital EquipmentCorporation,Stanford, CA, Aug.1987.

[4] CL

LARK,J D. Window and AcknowlegementStrategyinTCP. ARPANM ET Working Group Requestsfor Com-ment,DDN Network InformationCenter, SRI Interna-tional,MenloPark,CA, July1982.RFC-813.

[5] EDGE,N SO . WP . An adaptivetimeout algorithm for re-transmissionacrossa packetswitching network. InProceedingsof SIGCOMM’83 (Mar. 1983),ACM.

[6] FELLER,N WP . Probability Theoryand its Applications,seconded.,vol. II. JohnWiley & Sons,1971.

[7] HAJQ EK,N B. Stochasticapproximationmethodsfordecentralizedcontrol of multiaccesscommunications.IEEE Transactionson Information Theory IT-31, 2(Mar. 1985).

[8] HAJQ EK,N B.,N ANM D VR ANM LOONM ,N T. Decentralizeddy-namiccontrolof amultiaccessbroadcastchannel.IEEETransactionson Automatic Control AC-27, 3 (June1982).

[9] Proceedingsof the Sixth Internet EngineeringTaskForce (Boston,MA, Apr. 1987). Proceedingsavail-ableasNIC documentIETF-87/2PfromDDN NetworkInformationCenter, SRIInternational,MenloPark,CA.

[10] INM TERNM ATIONM AL OS RGANM IZATIONM FOR SO TANM DARDIZA-TION

M . ISO International Standard 8473, InformationProcessingSystems—OpenSystemsInterconnection—Connectionless-modeNetworkServiceProtocolSpeci-fication, Mar. 1986.

REFERENCES 21

[11] JTACOBSON

M ,N VU . Congestionavoidanceandcontrol. InProceedingsof SIGCOMM ’88 (Stanford,CA, Aug.1988),ACM.

[12] JTAINM ,N R. Divergenceof timeoutalgorithmsfor packet

retransmissions.In ProceedingsFifth AnnualInterna-tional PhoenixConferenceonComputersandCommu-nications(Scottsdale,AZ, Mar. 1986).

[13] JTAINM ,N R. A timeout-basedcongestioncontrol scheme

for window flow-controllednetworks. IEEE Journalon SelectedAreasin CommunicationsSAC-4, 7 (Oct.1986).

[14] JTAINM ,N R.,N RAMAKRISHNM ANM ,N K.,N ANM D CV HIUW ,N D.-M. Con-

gestionavoidancein computernetworkswith a con-nectionlessnetwork layer. Tech.Rep.DEC-TR-506,Digital EquipmentCorporation,Aug. 1987.

[15] KARNM ,N P.,N ANM D PARTRIDGE,N CV . Estimatinground-triptimesin reliabletransportprotocols.In ProceedingsofSIGCOMM’87 (Aug. 1987),ACM.

[16] KELLY,N F. P. Stochasticmodelsof computercommuni-cationsystems.Journalof theRoyalStatisticalSocietyB 47, 3 (1985),379–395.

[17] KLEINM ROCK,N L. QueueingSystems, vol. II. JohnWiley& Sons,1976.

[18] KLINM E,N CV . Supercomputerson the Internet: A casestudy. In Proceedingsof SIGCOMM’87 (Aug. 1987),ACM.

[19] LJQ UW NM G,N L.,N ANM D SO ODERSTR OM,N T. TheoryandPracticeof RecursiveIdentification. MIT Press,1983.

[20] LUW ENM BERGER,N D. GX . Introductionto DynamicSystems.JohnWiley & Sons,1979.

[21] MILLS,N D. Internet Delay Experiments. ARPANM ETWorkingGroupRequestsfor Comment,DDN NetworkInformationCenter, SRIInternational,MenloPark,CA,Dec.1983.RFC-889.

[22] NY

AGLE,N JT . CongestionControl in IP/TCP Internet-works. ARPANM ET Working GroupRequestsfor Com-ment,DDN Network InformationCenter, SRI Interna-tional,MenloPark,CA, Jan.1984.RFC-896.

[23] POSTEL,N JT ., Ed. TransmissionControl ProtocolSpecifi-cation. SRIInternational,MenloPark,CA, Sept.1981.RFC-793.

[24] PRUW E,N WP .,N ANM D POSTEL,N JT . SomethingA Host CouldDo with Source Quench. ARPANM ET Working GroupRequestsfor Comment, DDN Network InformationCenter, SRI International,Menlo Park,CA, July 1987.RFC-1016.

[25] ROMKEY,N JT . A Nonstandard for Transmissionof IPDatagramsOverSerialLines: Slip. ARPANM ET WorkingGroupRequestsfor Comment,DDN NetworkInforma-tion Center, SRI International,Menlo Park,CA, June1988.RFC-1055.

[26] ZHANM G,N L. Why TCPtimersdon’t work well. In Pro-ceedingsof SIGCOMM’86 (Aug. 1986),ACM.

congestion avoidance and control - peoplesylvia/cs268/papers/congavoid.pdf · is a ﬁlter gain...

Documents