2.5 reliable transmission - github pages · the sliding window algorithm the sliding window...

27
2.5 Reliable Transmission As we saw in the previous section, frames are sometimes corrupted while in transit, with an error code like CRC used to detect such errors. While some error codes are strong enough also to correct errors, in practice the overhead is typically too large to handle the range of bit and burst errors that can be introduced on a network link. Even when error-correcting codes are used (e.g., on wireless links) some errors will be too severe to be corrected. As a result, some corrupt frames must be discarded. A link-level protocol that wants to deliver frames reliably must somehow recover from these discarded (lost) frames. It's worth noting that reliability is a function that may be provided at the link level, but many modern link technologies omit this function. Furthermore, reliable delivery is frequently provided at higher levels, including both transport and sometimes, the application layer. Exactly where it should be provided is a matter of some debate and depends on many factors. We describe the basics of reliable delivery here, since the principles are common across layers, but you should be aware that we're not just talking about a link-layer function. Reliable delivery is usually accomplished using a combination of two fundamental mechanisms—acknowledgments and timeouts. An acknowledgment (ACK for short) is a small control frame that a protocol sends back to its peer saying that it has received an earlier frame. By control frame we mean a header without any data, although a protocol can piggyback an ACK on a data frame it just happens to be sending in the opposite direction. The receipt of an acknowledgment indicates to the sender of the original frame that its frame was successfully delivered. If the sender does not receive an acknowledgment after a reasonable amount of time, then it retransmits the original frame. This action of waiting a reasonable amount of time is called a timeout. The general strategy of using acknowledgments and timeouts to implement reliable delivery is sometimes called automatic repeat request (abbreviated ARQ). This section describes three different ARQ algorithms using generic language; that is, we do not give detailed information about a particular protocol's header fields. Stop-and-Wait The simplest ARQ scheme is the stop-and-wait algorithm. The idea of stop-and-wait is straightforward: After transmitting one frame, the sender waits for an acknowledgment before transmitting the next frame. If the acknowledgment does not arrive after a certain period of time, the sender times out and retransmits the original frame. 2.5 Reliable Transmission 62

Upload: others

Post on 16-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

2.5ReliableTransmission

Aswesawintheprevioussection,framesaresometimescorruptedwhileintransit,withanerrorcodelikeCRCusedtodetectsucherrors.Whilesomeerrorcodesarestrongenoughalsotocorrecterrors,inpracticetheoverheadistypicallytoolargetohandletherangeofbitandbursterrorsthatcanbeintroducedonanetworklink.Evenwhenerror-correctingcodesareused(e.g.,onwirelesslinks)someerrorswillbetooseveretobecorrected.Asaresult,somecorruptframesmustbediscarded.Alink-levelprotocolthatwantstodeliverframesreliablymustsomehowrecoverfromthesediscarded(lost)frames.

It'sworthnotingthatreliabilityisafunctionthatmaybeprovidedatthelinklevel,butmanymodernlinktechnologiesomitthisfunction.Furthermore,reliabledeliveryisfrequentlyprovidedathigherlevels,includingbothtransportandsometimes,theapplicationlayer.Exactlywhereitshouldbeprovidedisamatterofsomedebateanddependsonmanyfactors.Wedescribethebasicsofreliabledeliveryhere,sincetheprinciplesarecommonacrosslayers,butyoushouldbeawarethatwe'renotjusttalkingaboutalink-layerfunction.

Reliabledeliveryisusuallyaccomplishedusingacombinationoftwofundamentalmechanisms—acknowledgmentsandtimeouts.Anacknowledgment(ACKforshort)isasmallcontrolframethataprotocolsendsbacktoitspeersayingthatithasreceivedanearlierframe.Bycontrolframewemeanaheaderwithoutanydata,althoughaprotocolcanpiggybackanACKonadataframeitjusthappenstobesendingintheoppositedirection.Thereceiptofanacknowledgmentindicatestothesenderoftheoriginalframethatitsframewassuccessfullydelivered.Ifthesenderdoesnotreceiveanacknowledgmentafterareasonableamountoftime,thenitretransmitstheoriginalframe.Thisactionofwaitingareasonableamountoftimeiscalledatimeout.

Thegeneralstrategyofusingacknowledgmentsandtimeoutstoimplementreliabledeliveryissometimescalledautomaticrepeatrequest(abbreviatedARQ).ThissectiondescribesthreedifferentARQalgorithmsusinggenericlanguage;thatis,wedonotgivedetailedinformationaboutaparticularprotocol'sheaderfields.

Stop-and-Wait

ThesimplestARQschemeisthestop-and-waitalgorithm.Theideaofstop-and-waitisstraightforward:Aftertransmittingoneframe,thesenderwaitsforanacknowledgmentbeforetransmittingthenextframe.Iftheacknowledgmentdoesnotarriveafteracertainperiodoftime,thesendertimesoutandretransmitstheoriginalframe.

2.5ReliableTransmission

62

Page 2: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

Figure1.Timelineshowingfourdifferentscenariosforthestop-and-waitalgorithm.(a)TheACKisreceivedbeforethetimerexpires;(b)theoriginalframeislost;(c)theACKislost;(d)thetimeoutfirestoosoon.

Figure1illustratestimeslinesforfourdifferentscenariosthatresultfromthisbasicalgorithm.Thesendingsideisrepresentedontheleft,thereceivingsideisdepictedontheright,andtimeflowsfromtoptobottom.Figure1(a)showsthesituationinwhichtheACKisreceivedbeforethetimerexpires;(b)and(c)showthesituationinwhichtheoriginalframeandtheACK,respectively,arelost;and(d)showsthesituationinwhichthetimeoutfirestoosoon.Recallthatby"lost"wemeanthattheframewascorruptedwhileintransit,thatthiscorruptionwasdetectedbyanerrorcodeonthereceiver,andthattheframewassubsequentlydiscarded.

Thepackettimelinesshowninthissectionareexamplesofafrequentlyusedtoolinteaching,explaining,anddesigningprotocols.Theyareusefulbecausetheycapturevisuallythebehaviorovertimeofadistributedsystem—somethingthatcanbequitehardtoanalyze.Whendesigningaprotocol,youoftenhavetobepreparedfortheunexpected—asystemcrashes,amessagegetslost,orsomethingthatyouexpectedtohappenquicklyturnsouttotakealongtime.Thesesortsofdiagramscanoftenhelpusunderstandwhatmightgowronginsuchcasesandthushelpaprotocoldesignerbepreparedforeveryeventuality.

Thereisoneimportantsubtletyinthestop-and-waitalgorithm.Supposethesendersendsaframeandthereceiveracknowledgesit,buttheacknowledgmentiseitherlostordelayedinarriving.Thissituationisillustratedintimelines(c)and(d)ofFigure1.Inbothcases,thesendertimesoutandretransmitstheoriginalframe,butthereceiverwillthinkthatitisthenextframe,sinceitcorrectlyreceivedandacknowledgedthefirstframe.Thishasthepotentialtocauseduplicatecopiesofaframetobedelivered.Toaddressthisproblem,theheaderforastop-and-waitprotocolusuallyincludesa1-bitsequencenumber—thatis,thesequencenumbercantakeonthevalues0and1—andthesequencenumbersusedforeachframealternate,asillustratedinFigure2.Thus,whenthesenderretransmitsframe0,thereceivercandeterminethatitisseeingasecondcopyofframe0ratherthanthefirstcopyofframe1andthereforecanignoreit(thereceiverstillacknowledgesit,incasethefirstACKwaslost).

2.5ReliableTransmission

63

Page 3: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

Figure2.Timelineforstop-and-waitwith1-bitsequencenumber.

Themainshortcomingofthestop-and-waitalgorithmisthatitallowsthesendertohaveonlyoneoutstandingframeonthelinkatatime,andthismaybefarbelowthelink'scapacity.Consider,forexample,a1.5-Mbpslinkwitha45-msround-triptime.Thislinkhasadelay×bandwidthproductof67.5Kb,orapproximately8KB.SincethesendercansendonlyoneframeperRTT,andassumingaframesizeof1KB,thisimpliesamaximumsendingrateof

Bits-Per-Frame/Time-Per-Frame=1024x8/0.045=182kbps

oraboutone-eighthofthelink'scapacity.Tousethelinkfully,then,we'dlikethesendertobeabletotransmituptoeightframesbeforehavingtowaitforanacknowledgment.

Thesignificanceofthedelay×bandwidthproductisthatitrepresentstheamountofdatathatcouldbeintransit.Wewouldliketobeabletosendthismuchdatawithoutwaitingforthefirstacknowledgment.Theprincipleatworkhereisoftenreferredtoaskeepingthepipefull.Thealgorithmspresentedinthefollowingtwosubsectionsdoexactlythis.

SlidingWindowConsideragainthescenarioinwhichthelinkhasadelay×bandwidthproductof8KBandframesare1KBinsize.WewouldlikethesendertobereadytotransmittheninthframeatprettymuchthesamemomentthattheACKforthefirstframearrives.Thealgorithmthatallowsustodothisiscalledslidingwindow,andanillustrativetimelineisgiveninFigure3.

Figure3.Timelinefortheslidingwindowalgorithm.

2.5ReliableTransmission

64

Page 4: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

TheSlidingWindowAlgorithm

Theslidingwindowalgorithmworksasfollows.First,thesenderassignsasequencenumber,denotedSeqNum,toeachframe.Fornow,let'signorethefactthatSeqNumisimplementedbyafinite-sizeheaderfieldandinsteadassumethatitcangrowinfinitelylarge.Thesendermaintainsthreevariables:Thesendwindowsize,denotedSWS,givestheupperboundonthenumberofoutstanding(unacknowledged)framesthatthesendercantransmit;LARdenotesthesequencenumberofthelastacknowledgmentreceived;andLFSdenotesthesequencenumberofthelastframesent.Thesenderalsomaintainsthefollowinginvariant:

LFS-LAR<=SWS

ThissituationisillustratedinFigure4.

Figure4.Slidingwindowonsender.

Whenanacknowledgmentarrives,thesendermovesLARtotheright,therebyallowingthesendertotransmitanotherframe.Also,thesenderassociatesatimerwitheachframeittransmits,anditretransmitstheframeshouldthetimerexpirebeforeanACKisreceived.NoticethatthesenderhastobewillingtobufferuptoSWSframessinceitmustbepreparedtoretransmitthemuntiltheyareacknowledged.

Thereceivermaintainsthefollowingthreevariables:Thereceivewindowsize,denotedRWS,givestheupperboundonthenumberofout-of-orderframesthatthereceiveriswillingtoaccept;LAFdenotesthesequencenumberofthelargestacceptableframe;andLFRdenotesthesequencenumberofthelastframereceived.Thereceiveralsomaintainsthefollowinginvariant:

LAF-LFR<=RWS

ThissituationisillustratedinFigure5.

Figure5.Slidingwindowonreceiver.

WhenaframewithsequencenumberSeqNumarrives,thereceivertakesthefollowingaction.IfSeqNum<=LFRorSeqNum>LAF,thentheframeisoutsidethereceiver'swindowanditisdiscarded.IfLFR<SeqNum<=LAF,thentheframeiswithinthereceiver'swindowanditisaccepted.NowthereceiverneedstodecidewhetherornottosendanACK.LetSeqNumToAckdenotethelargestsequencenumbernotyetacknowledged,suchthatallframeswithsequencenumberslessthanorequaltoSeqNumToAckhavebeenreceived.ThereceiveracknowledgesthereceiptofSeqNumToAck,evenifhighernumberedpacketshavebeenreceived.Thisacknowledgmentissaidtobecumulative.ItthensetsLFR=SeqNumToAckandadjustsLAF=LFR+RWS.

Forexample,supposeLFR=5(i.e.,thelastACKthereceiversentwasforsequencenumber5),andRWS=4.ThisimpliesthatLAF=9.Shouldframes7and8arrive,theywillbebufferedbecausetheyarewithinthereceiver'swindow.However,noACKneedstobesentsinceframe6hasyettoarrive.Frames7and8aresaidtohavearrivedoutoforder.(Technically,thereceivercouldresendanACKforframe5whenframes7and8arrive.)Shouldframe6

2.5ReliableTransmission

65

Page 5: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

thenarrive—perhapsitislatebecauseitwaslostthefirsttimeandhadtoberetransmitted,orperhapsitwassimplydelayed—thereceiveracknowledgesframe8,bumpsLFRto8,andsetsLAFto12.Ifframe6wasinfactlost,thenatimeoutwillhaveoccurredatthesender,causingittoretransmitframe6.

It'sunlikelythatapacketcouldbedelayedonapoint-to-pointlink,thissamealgorithmisusedonmulti-hopconnectionswheresuchdelaysarepossible.

Weobservethatwhenatimeoutoccurs,theamountofdataintransitdecreases,sincethesenderisunabletoadvanceitswindowuntilframe6isacknowledged.Thismeansthatwhenpacketlossesoccur,thisschemeisnolongerkeepingthepipefull.Thelongerittakestonoticethatapacketlosshasoccurred,themoreseverethisproblembecomes.

Noticethat,inthisexample,thereceivercouldhavesentanegativeacknowledgment(NAK)forframe6assoonasframe7arrived.However,thisisunnecessarysincethesender'stimeoutmechanismissufficienttocatchthissituation,andsendingNAKsaddsadditionalcomplexitytothereceiver.Also,aswementioned,itwouldhavebeenlegitimatetosendadditionalacknowledgmentsofframe5whenframes7and8arrived;insomecases,asendercanuseduplicateACKsasacluethataframewaslost.Bothapproacheshelptoimproveperformancebyallowingearlydetectionofpacketlosses.

Yetanothervariationonthisschemewouldbetouseselectiveacknowledgments.Thatis,thereceivercouldacknowledgeexactlythoseframesithasreceivedratherthanjustthehighestnumberedframereceivedinorder.So,intheaboveexample,thereceivercouldacknowledgethereceiptofframes7and8.Givingmoreinformationtothesendermakesitpotentiallyeasierforthesendertokeepthepipefullbutaddscomplexitytotheimplementation.

Thesendingwindowsizeisselectedaccordingtohowmanyframeswewanttohaveoutstandingonthelinkatagiventime;SWSiseasytocomputeforagivendelay×bandwidthproduct.Ontheotherhand,thereceivercansetRWStowhateveritwants.TwocommonsettingsareRWS=1,whichimpliesthatthereceiverwillnotbufferanyframesthatarriveoutoforder,andRWS=SWS,whichimpliesthatthereceivercanbufferanyoftheframesthesendertransmits.ItmakesnosensetosetRWS>SWSsinceit'simpossibleformorethanSWSframestoarriveoutoforder.

FiniteSequenceNumbersandSlidingWindow

Wenowreturntotheonesimplificationweintroducedintothealgorithm—ourassumptionthatsequencenumberscangrowinfinitelylarge.Inpractice,ofcourse,aframe'ssequencenumberisspecifiedinaheaderfieldofsomefinitesize.Forexample,a3-bitfieldmeansthatthereareeightpossiblesequencenumbers,0..7.Thismakesitnecessarytoreusesequencenumbersor,statedanotherway,sequencenumberswraparound.Thisintroducestheproblemofbeingabletodistinguishbetweendifferentincarnationsofthesamesequencenumbers,whichimpliesthatthenumberofpossiblesequencenumbersmustbelargerthanthenumberofoutstandingframesallowed.Forexample,stop-and-waitallowedoneoutstandingframeatatimeandhadtwodistinctsequencenumbers.

Supposewehaveonemorenumberinourspaceofsequencenumbersthanwehavepotentiallyoutstandingframes;thatis,SWS<=MaxSeqNum-1,whereMaxSeqNumisthenumberofavailablesequencenumbers.Isthissufficient?TheanswerdependsonRWS.IfRWS=1,thenMaxSeqNum>=SWS+1issufficient.IfRWSisequaltoSWS,thenhavingaMaxSeqNumjustonegreaterthanthesendingwindowsizeisnotgoodenough.Toseethis,considerthesituationinwhichwehavetheeightsequencenumbers0through7,andSWS=RWS=7.Supposethesendertransmitsframes0..6,theyaresuccessfullyreceived,buttheACKsarelost.Thereceiverisnowexpectingframes7,0..5,butthesendertimesoutandsendsframes0..6.Unfortunately,thereceiverisexpectingthesecondincarnationofframes0..5butgetsthefirstincarnationoftheseframes.Thisisexactlythesituationwewantedtoavoid.

ItturnsoutthatthesendingwindowsizecanbenomorethanhalfasbigasthenumberofavailablesequencenumberswhenRWS=SWS,orstatedmoreprecisely,

SWS<(MaxSeqNum+1)/2

2.5ReliableTransmission

66

Page 6: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

Intuitively,whatthisissayingisthattheslidingwindowprotocolalternatesbetweenthetwohalvesofthesequencenumberspace,justasstop-and-waitalternatesbetweensequencenumbers0and1.Theonlydifferenceisthatitcontinuallyslidesbetweenthetwohalvesratherthandiscretelyalternatingbetweenthem.

NotethatthisruleisspecifictothesituationwhereRWS=SWS.WeleaveitasanexercisetodeterminethemoregeneralrulethatworksforarbitraryvaluesofRWSandSWS.Alsonotethattherelationshipbetweenthewindowsizeandthesequencenumberspacedependsonanassumptionthatissoobviousthatitiseasytooverlook,namelythatframesarenotreorderedintransit.Thiscannothappenonadirectpoint-to-pointlinksincethereisnowayforoneframetoovertakeanotherduringtransmission.However,wewillseetheslidingwindowalgorithmusedinadifferentenvironments,andwewillneedtodeviseanotherrule.

ImplementationofSlidingWindow

Thefollowingroutinesillustratehowwemightimplementthesendingandreceivingsidesoftheslidingwindowalgorithm.Theroutinesaretakenfromaworkingprotocolnamed,appropriatelyenough,SlidingWindowProtocol(SWP).Soasnottoconcernourselveswiththeadjacentprotocolsintheprotocolgraph,wedenotetheprotocolsittingaboveSWPasthehigh-levelprotocol(HLP)andtheprotocolsittingbelowSWPasthelink-levelprotocol(LLP).

Westartbydefiningapairofdatastructures.First,theframeheaderisverysimple:Itcontainsasequencenumber(SeqNum)andanacknowledgmentnumber(AckNum).ItalsocontainsaFlagsfieldthatindicateswhethertheframeisanACKorcarriesdata.

typedefu_charSwpSeqno;

typedefstruct{SwpSeqnoSeqNum;/*sequencenumberofthisframe*/SwpSeqnoAckNum;/*ackofreceivedframe*/u_charFlags;/*upto8bitsworthofflags*/}SwpHdr;

Next,thestateoftheslidingwindowalgorithmhasthefollowingstructure.Forthesendingsideoftheprotocol,thisstateincludesvariablesLARandLFS,asdescribedearlierinthissection,aswellasaqueuethatholdsframesthathavebeentransmittedbutnotyetacknowledged(sendQ).ThesendingstatealsoincludesacountingsemaphorecalledsendWindowNotFull.Wewillseehowthisisusedbelow,butgenerallyasemaphoreisasynchronizationprimitivethatsupportssemWaitandsemSignaloperations.EveryinvocationofsemSignalincrementsthesemaphoreby1,andeveryinvocationofsemWaitdecrementssby1,withthecallingprocessblocked(suspended)shoulddecrementingthesemaphorecauseitsvaluetobecomelessthan0.AprocessthatisblockedduringitscalltosemWaitwillbeallowedtoresumeassoonasenoughsemSignaloperationshavebeenperformedtoraisethevalueofthesemaphoreabove0.

Forthereceivingsideoftheprotocol,thestateincludesthevariableNFE.Thisisthenextframeexpected,theframewithasequencenumberonemorethatthelastframereceived(LFR),describedearlierinthissection.Thereisalsoaqueuethatholdsframesthathavebeenreceivedoutoforder(recvQ).Finally,althoughnotshown,thesenderandreceiverslidingwindowsizesaredefinedbyconstantsSWSandRWS,respectively.

typedefstruct{/*sendersidestate:*/SwpSeqnoLAR;/*seqnooflastACKreceived*/SwpSeqnoLFS;/*lastframesent*/SemaphoresendWindowNotFull;SwpHdrhdr;/*pre-initializedheader*/structsendQ_slot{Eventtimeout;/*eventassociatedwithsend-timeout*/Msgmsg;}sendQ[SWS];

/*receiversidestate:*/

2.5ReliableTransmission

67

Page 7: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

SwpSeqnoNFE;/*seqnoofnextframeexpected*/structrecvQ_slot{intreceived;/*ismsgvalid?*/Msgmsg;}recvQ[RWS];}SwpState;

ThesendingsideofSWPisimplementedbyproceduresendSWP.Thisroutineisrathersimple.First,semWaitcausesthisprocesstoblockonasemaphoreuntilitisOKtosendanotherframe.Onceallowedtoproceed,sendSWPsetsthesequencenumberintheframe'sheader,savesacopyoftheframeinthetransmitqueue(sendQ),schedulesatimeouteventtohandlethecaseinwhichtheframeisnotacknowledged,andsendstheframetothenext-lower-levelprotocol,whichwedenoteasLINK.

Onedetailworthnotingisthecalltostore_swp_hdrjustbeforethecalltomsgAddHdr.ThisroutinetranslatestheCstructurethatholdstheSWPheader(state->hdr)intoabytestringthatcanbesafelyattachedtothefrontofthemessage(hbuf).Thisroutine(notshown)musttranslateeachintegerfieldintheheaderintonetworkbyteorderandremoveanypaddingthatthecompilerhasaddedtotheCstructure.Theissueofbyteorderisanon-trivialissue,butfornowitisenoughtoassumethatthisroutineplacesthemostsignificantbitofamultiwordintegerinthebytewiththehighestaddress.

AnotherpieceofcomplexityinthisroutineistheuseofsemWaitandthesendWindowNotFullsemaphore.sendWindowNotFullisinitializedtothesizeofthesender'sslidingwindow,SWS(thisinitializationisnotshown).Eachtimethesendertransmitsaframe,thesemWaitoperationdecrementsthiscountandblocksthesendershouldthecountgoto0.EachtimeanACKisreceived,thesemSignaloperationinvokedindeliverSWP(seebelow)incrementsthiscount,thusunblockinganywaitingsender.

staticintsendSWP(SwpState*state,Msg*frame){structsendQ_slot*slot;hbuf[HLEN];

/*waitforsendwindowtoopen*/semWait(&state->sendWindowNotFull);state->hdr.SeqNum=++state->LFS;slot=&state->sendQ[state->hdr.SeqNum%SWS];store_swp_hdr(state->hdr,hbuf);msgAddHdr(frame,hbuf,HLEN);msgSaveCopy(&slot->msg,frame);slot->timeout=evSchedule(swpTimeout,slot,SWP_SEND_TIMEOUT);returnsend(LINK,frame);}

BeforecontinuingtothereceivesideofSWP,weneedtoreconcileaseeminginconsistency.Ontheonehand,wehavebeensayingthatahigh-levelprotocolinvokestheservicesofalow-levelprotocolbycallingthesendoperation,sowewouldexpectthataprotocolthatwantstosendamessageviaSWPwouldcallsend(SWP,packet).Ontheotherhand,theprocedurethatimplementsSWP'ssendoperationiscalledsendSWP,anditsfirstargumentisastatevariable(SwpState).Whatgives?Theansweristhattheoperatingsystemprovidesgluecodethattranslatesthegenericcalltosendintoaprotocol-specificcalltosendSWP.Thisgluecodemapsthefirstargumenttosend(themagicprotocolvariableSWP)intobothafunctionpointertosendSWPandapointertotheprotocolstatethatSWPneedstodoitsjob.Thereasonwehavethehigh-levelprotocolindirectlyinvoketheprotocol-specificfunctionthroughthegenericfunctioncallisthatwewanttolimithowmuchinformationthehigh-levelprotocolhascodedinitaboutthelow-levelprotocol.Thismakesiteasiertochangetheprotocolgraphconfigurationatsometimeinthefuture.

NowwemoveontoSWP'sprotocol-specificimplementationofthedeliveroperation,whichisgiveninproceduredeliverSWP.Thisroutineactuallyhandlestwodifferentkindsofincomingmessages:ACKsforframessentearlierfromthisnodeanddataframesarrivingatthisnode.Inasense,theACKhalfofthisroutineisthecounterparttothe

2.5ReliableTransmission

68

Page 8: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

sendersideofthealgorithmgiveninsendSWP.AdecisionastowhethertheincomingmessageisanACKoradataframeismadebycheckingtheFlagsfieldintheheader.NotethatthisparticularimplementationdoesnotsupportpiggybackingACKsondataframes.

WhentheincomingframeisanACK,deliverSWPsimplyfindstheslotinthetransmitqueue(sendQ)thatcorrespondstotheACK,cancelsthetimeoutevent,andfreestheframesavedinthatslot.ThisworkisactuallydoneinaloopsincetheACKmaybecumulative.TheonlyotherthingtonoticeaboutthiscaseisthecalltosubroutineswpInWindow.Thissubroutine,whichisgivenbelow,ensuresthatthesequencenumberfortheframebeingacknowledgediswithintherangeofACKsthatthesendercurrentlyexpectstoreceive.

Whentheincomingframecontainsdata,deliverSWPfirstcallsmsgStripHdrandload_swp_hdrtoextracttheheaderfromtheframe.Routineload_swp_hdristhecounterparttostore_swp_hdrdiscussedearlier;ittranslatesabytestringintotheCdatastructurethatholdstheSWPheader.deliverSWPthencallsswpInWindowtomakesurethesequencenumberoftheframeiswithintherangeofsequencenumbersthatitexpects.Ifitis,theroutineloopsoverthesetofconsecutiveframesithasreceivedandpassesthemuptothehigher-levelprotocolbyinvokingthedeliverHLProutine.ItalsosendsacumulativeACKbacktothesender,butdoessobyloopingoverthereceivequeue(itdoesnotusetheSeqNumToAckvariableusedintheprosedescriptiongivenearlierinthissection).

staticintdeliverSWP(SwpStatestate,Msg*frame){SwpHdrhdr;char*hbuf;

hbuf=msgStripHdr(frame,HLEN);load_swp_hdr(&hdr,hbuf)if(hdr->Flags&FLAG_ACK_VALID){/*receivedanacknowledgment—doSENDERside*/if(swpInWindow(hdr.AckNum,state->LAR+1,state->LFS)){do{structsendQ_slot*slot;

slot=&state->sendQ[++state->LAR%SWS];evCancel(slot->timeout);msgDestroy(&slot->msg);semSignal(&state->sendWindowNotFull);}while(state->LAR!=hdr.AckNum);}}

if(hdr.Flags&FLAG_HAS_DATA){structrecvQ_slot*slot;

/*receiveddatapacket—doRECEIVERside*/slot=&state->recvQ[hdr.SeqNum%RWS];if(!swpInWindow(hdr.SeqNum,state->NFE,state->NFE+RWS-1)){/*dropthemessage*/returnSUCCESS;}msgSaveCopy(&slot->msg,frame);slot->received=TRUE;if(hdr.SeqNum==state->NFE){Msgm;

while(slot->received){deliver(HLP,&slot->msg);msgDestroy(&slot->msg);

2.5ReliableTransmission

69

Page 9: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

slot->received=FALSE;slot=&state->recvQ[++state->NFE%RWS];}/*sendACK:*/prepare_ack(&m,state->NFE-1);send(LINK,&m);msgDestroy(&m);}}returnSUCCESS;}

Finally,swpInWindowisasimplesubroutinethatcheckstoseeifagivensequencenumberfallsbetweensomeminimumandmaximumsequencenumber.

staticboolswpInWindow(SwpSeqnoseqno,SwpSeqnomin,SwpSeqnomax){SwpSeqnopos,maxpos;

pos=seqno-min;/*pos*should*beinrange[0..MAX)*/maxpos=max-min+1;/*maxposisinrange[0..MAX]*/returnpos<maxpos;}

FrameOrderandFlowControl

Theslidingwindowprotocolisperhapsthebestknownalgorithmincomputernetworking.Whatiseasilyconfusingaboutthealgorithm,however,isthatitcanbeusedtoservethreedifferentroles.Thefirstroleistheonewehavebeenconcentratingoninthissection—toreliablydeliverframesacrossanunreliablelink.(Ingeneral,thealgorithmcanbeusedtoreliablydelivermessagesacrossanunreliablenetwork.)Thisisthecorefunctionofthealgorithm.

Thesecondrolethattheslidingwindowalgorithmcanserveistopreservetheorderinwhichframesaretransmitted.Thisiseasytodoatthereceiver—sinceeachframehasasequencenumber,thereceiverjustmakessurethatitdoesnotpassaframeuptothenext-higher-levelprotocoluntilithasalreadypassedupallframeswithasmallersequencenumber.Thatis,thereceiverbuffers(i.e.,doesnotpassalong)out-of-orderframes.Theversionoftheslidingwindowalgorithmdescribedinthissectiondoespreserveframeorder,althoughwecouldimagineavariationinwhichthereceiverpassesframestothenextprotocolwithoutwaitingforallearlierframestobedelivered.Aquestionweshouldaskourselvesiswhetherwereallyneedtheslidingwindowprotocoltokeeptheframesinorderatthelinklevel,orwhether,instead,thisfunctionalityshouldbeimplementedbyaprotocolhigherinthestack.

Thethirdrolethattheslidingwindowalgorithmsometimesplaysistosupportflowcontrol—afeedbackmechanismbywhichthereceiverisabletothrottlethesender.Suchamechanismisusedtokeepthesenderfromover-runningthereceiver—thatis,fromtransmittingmoredatathanthereceiverisabletoprocess.Thisisusuallyaccomplishedbyaugmentingtheslidingwindowprotocolsothatthereceivernotonlyacknowledgesframesithasreceivedbutalsoinformsthesenderofhowmanyframesithasroomtoreceive.Thenumberofframesthatthereceiveriscapableofreceivingcorrespondstohowmuchfreebufferspaceithas.Asinthecaseofordereddelivery,weneedtomakesurethatflowcontrolisnecessaryatthelinklevelbeforeincorporatingitintotheslidingwindowprotocol.

Oneimportantconcepttotakeawayfromthisdiscussionisthesystemdesignprinciplewecallseparationofconcerns.Thatis,youmustbecarefultodistinguishbetweendifferentfunctionsthataresometimesrolledtogetherinonemechanism,andyoumustmakesurethateachfunctionisnecessaryandbeingsupportedinthemosteffectiveway.Inthisparticularcase,reliabledelivery,ordereddelivery,andflowcontrolaresometimescombinedinasingleslidingwindowprotocol,andweshouldaskourselvesifthisistherightthingtodoatthelinklevel.

ConcurrentLogicalChannels

2.5ReliableTransmission

70

Page 10: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

ThedatalinkprotocolusedintheoriginalARPANETprovidesaninterestingalternativetotheslidingwindowprotocol,inthatitisabletokeepthepipefullwhilestillusingthesimplestop-and-waitalgorithm.Oneimportantconsequenceofthisapproachisthattheframessentoveragivenlinkarenotkeptinanyparticularorder.Theprotocolalsoimpliesnothingaboutflowcontrol.

TheideaunderlyingtheARPANETprotocol,whichwerefertoasconcurrentlogicalchannels,istomultiplexseverallogicalchannelsontoasinglepoint-to-pointlinkandtorunthestop-and-waitalgorithmoneachoftheselogicalchannels.Thereisnorelationshipmaintainedamongtheframessentonanyofthelogicalchannels,yetbecauseadifferentframecanbeoutstandingoneachoftheseverallogicalchannelsthesendercankeepthelinkfull.

Moreprecisely,thesenderkeeps3bitsofstateforeachchannel:aboolean,sayingwhetherthechanneliscurrentlybusy;the1-bitsequencenumbertousethenexttimeaframeissentonthislogicalchannel;andthenextsequencenumbertoexpectonaframethatarrivesonthischannel.Whenthenodehasaframetosend,itusesthelowestidlechannel,andotherwiseitbehavesjustlikestop-and-wait.

Inpractice,theARPANETsupported8logicalchannelsovereachgroundlinkand16overeachsatellitelink.Intheground-linkcase,theheaderforeachframeincludeda3-bitchannelnumberanda1-bitsequencenumber,foratotalof4bits.Thisisexactlythenumberofbitstheslidingwindowprotocolrequirestosupportupto8outstandingframesonthelinkwhenRWS=SWS.

2.5ReliableTransmission

71

Page 11: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

5.2ReliableByteStream(TCP)

IncontrasttoasimpledemultiplexingprotocollikeUDP,amoresophisticatedtransportprotocolisonethatoffersareliable,connection-oriented,byte-streamservice.Suchaservicehasprovenusefultoawideassortmentofapplicationsbecauseitfreestheapplicationfromhavingtoworryaboutmissingorreordereddata.TheInternet'sTransmissionControlProtocolisprobablythemostwidelyusedprotocolofthistype;itisalsothemostcarefullytuned.ItisforthesetworeasonsthatthissectionstudiesTCPindetail,althoughweidentifyanddiscussalternativedesignchoicesattheendofthesection.

Intermsofthepropertiesoftransportprotocolsgivenintheproblemstatementatthestartofthischapter,TCPguaranteesthereliable,in-orderdeliveryofastreamofbytes.Itisafull-duplexprotocol,meaningthateachTCPconnectionsupportsapairofbytestreams,oneflowingineachdirection.Italsoincludesaflow-controlmechanismforeachofthesebytestreamsthatallowsthereceivertolimithowmuchdatathesendercantransmitatagiventime.Finally,likeUDP,TCPsupportsademultiplexingmechanismthatallowsmultipleapplicationprogramsonanygivenhosttosimultaneouslycarryonaconversationwiththeirpeers.

Inadditiontotheabovefeatures,TCPalsoimplementsahighlytunedcongestion-controlmechanism.TheideaofthismechanismistothrottlehowfastTCPsendsdata,notforthesakeofkeepingthesenderfromover-runningthereceiver,butsoastokeepthesenderfromoverloadingthenetwork.AdescriptionofTCP'scongestion-controlmechanismispostponeduntilthenextchapter,wherewediscussitinthelargercontextofhownetworkresourcesarefairlyallocated.

Sincemanypeopleconfusecongestioncontrolandflowcontrol,werestatethedifference.Flowcontrolinvolvespreventingsendersfromover-runningthecapacityofreceivers.Congestioncontrolinvolvespreventingtoomuchdatafrombeinginjectedintothenetwork,therebycausingswitchesorlinkstobecomeoverloaded.Thus,flowcontrolisanend-to-endissue,whilecongestioncontrolisconcernedwithhowhostsandnetworksinteract.

End-to-EndIssues

AttheheartofTCPistheslidingwindowalgorithm.Eventhoughthisisthesamebasicalgorithmasisoftenusedatthelinklevel,becauseTCPrunsovertheInternetratherthanaphysicalpoint-to-pointlink,therearemanyimportantdifferences.ThissubsectionidentifiesthesedifferencesandexplainshowtheycomplicateTCP.ThefollowingsubsectionsthendescribehowTCPaddressestheseandothercomplications.

First,whereasthelink-levelslidingwindowalgorithmpresentedrunsoverasinglephysicallinkthatalwaysconnectsthesametwocomputers,TCPsupportslogicalconnectionsbetweenprocessesthatarerunningonanytwocomputersintheInternet.ThismeansthatTCPneedsanexplicitconnectionestablishmentphaseduringwhichthetwosidesoftheconnectionagreetoexchangedatawitheachother.Thisdifferenceisanalogoustohavingtodialuptheotherparty,ratherthanhavingadedicatedphoneline.TCPalsohasanexplicitconnectionteardownphase.Oneofthethingsthathappensduringconnectionestablishmentisthatthetwopartiesestablishsomesharedstatetoenabletheslidingwindowalgorithmtobegin.ConnectionteardownisneededsoeachhostknowsitisOKtofreethisstate.

Second,whereasasinglephysicallinkthatalwaysconnectsthesametwocomputershasafixedround-triptime(RTT),TCPconnectionsarelikelytohavewidelydifferentround-triptimes.Forexample,aTCPconnectionbetweenahostinSanFranciscoandahostinBoston,whichareseparatedbyseveralthousandkilometers,mighthaveanRTTof100ms,whileaTCPconnectionbetweentwohostsinthesameroom,onlyafewmetersapart,mighthaveanRTTofonly1ms.ThesameTCPprotocolmustbeabletosupportbothoftheseconnections.Tomakemattersworse,theTCPconnectionbetweenhostsinSanFranciscoandBostonmighthaveanRTTof100msat3a.m.,butanRTTof500msat3p.m.VariationsintheRTTareevenpossibleduringasingleTCPconnectionthatlastsonlyafew

5.2ReliableByteStream(TCP)

210

Page 12: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

minutes.Whatthismeanstotheslidingwindowalgorithmisthatthetimeoutmechanismthattriggersretransmissionsmustbeadaptive.(Certainly,thetimeoutforapoint-to-pointlinkmustbeasettableparameter,butitisnotnecessarytoadaptthistimerforaparticularpairofnodes.)

AthirddifferenceisthatpacketsmaybereorderedastheycrosstheInternet,butthisisnotpossibleonapoint-to-pointlinkwherethefirstpacketputintooneendofthelinkmustbethefirsttoappearattheotherend.Packetsthatareslightlyoutoforderdonotcauseaproblemsincetheslidingwindowalgorithmcanreorderpacketscorrectlyusingthesequencenumber.Therealissueishowfaroutoforderpacketscangetor,saidanotherway,howlateapacketcanarriveatthedestination.Intheworstcase,apacketcanbedelayedintheInternetuntiltheIPtimetolive(TTL)fieldexpires,atwhichtimethepacketisdiscarded(andhencethereisnodangerofitarrivinglate).KnowingthatIPthrowspacketsawayaftertheirTTLexpires,TCPassumesthateachpackethasamaximumlifetime.Theexactlifetime,knownasthemaximumsegmentlifetime(MSL),isanengineeringchoice.Thecurrentrecommendedsettingis120seconds.KeepinmindthatIPdoesnotdirectlyenforcethis120-secondvalue;itissimplyaconservativeestimatethatTCPmakesofhowlongapacketmightliveintheInternet.Theimplicationissignificant—TCPhastobepreparedforveryoldpacketstosuddenlyshowupatthereceiver,potentiallyconfusingtheslidingwindowalgorithm.

Fourth,thecomputersconnectedtoapoint-to-pointlinkaregenerallyengineeredtosupportthelink.Forexample,ifalink'sdelay×bandwidthproductiscomputedtobe8KB—meaningthatawindowsizeisselectedtoallowupto8KBofdatatobeunacknowledgedatagiventime—thenitislikelythatthecomputersateitherendofthelinkhavetheabilitytobufferupto8KBofdata.Designingthesystemotherwisewouldbesilly.Ontheotherhand,almostanykindofcomputercanbeconnectedtotheInternet,makingtheamountofresourcesdedicatedtoanyoneTCPconnectionhighlyvariable,especiallyconsideringthatanyonehostcanpotentiallysupporthundredsofTCPconnectionsatthesametime.ThismeansthatTCPmustincludeamechanismthateachsideusesto"learn"whatresources(e.g.,howmuchbufferspace)theothersideisabletoapplytotheconnection.Thisistheflowcontrolissue.

Fifth,becausethetransmittingsideofadirectlyconnectedlinkcannotsendanyfasterthanthebandwidthofthelinkallows,andonlyonehostispumpingdataintothelink,itisnotpossibletounknowinglycongestthelink.Saidanotherway,theloadonthelinkisvisibleintheformofaqueueofpacketsatthesender.Incontrast,thesendingsideofaTCPconnectionhasnoideawhatlinkswillbetraversedtoreachthedestination.Forexample,thesendingmachinemightbedirectlyconnectedtoarelativelyfastEthernet—andcapableofsendingdataatarateof10Gbps—butsomewhereoutinthemiddleofthenetwork,a1.5-Mbpslinkmustbetraversed.And,tomakemattersworse,databeinggeneratedbymanydifferentsourcesmightbetryingtotraversethissameslowlink.Thisleadstotheproblemofnetworkcongestion.Discussionofthistopicisdelayeduntilthenextchapter.

Weconcludethisdiscussionofend-to-endissuesbycomparingTCP'sapproachtoprovidingareliable/ordereddeliveryservicewiththeapproachusedbyvirtual-circut-basednetworkslikethehistoricallyimportantX.25network.InTCP,theunderlyingIPnetworkisassumedtobeunreliableandtodelivermessagesoutoforder;TCPusestheslidingwindowalgorithmonanend-to-endbasistoprovidereliable/ordereddelivery.Incontrast,X.25networksusetheslidingwindowprotocolwithinthenetwork,onahop-by-hopbasis.Theassumptionbehindthisapproachisthatifmessagesaredeliveredreliablyandinorderbetweeneachpairofnodesalongthepathbetweenthesourcehostandthedestinationhost,thentheend-to-endservicealsoguaranteesreliable/ordereddelivery.

Theproblemwiththislatterapproachisthatasequenceofhop-by-hopguaranteesdoesnotnecessarilyadduptoanend-to-endguarantee.First,ifaheterogeneouslink(say,anEthernet)isaddedtooneendofthepath,thenthereisnoguaranteethatthishopwillpreservethesameserviceastheotherhops.Second,justbecausetheslidingwindowprotocolguaranteesthatmessagesaredeliveredcorrectlyfromnodeAtonodeB,andthenfromnodeBtonodeC,itdoesnotguaranteethatnodeBbehavesperfectly.Forexample,networknodeshavebeenknowntointroduceerrorsintomessageswhiletransferringthemfromaninputbuffertoanoutputbuffer.Theyhavealsobeenknowntoaccidentallyreordermessages.Asaconsequenceofthesesmallwindowsofvulnerability,itisstillnecessarytoprovidetrueend-to-endcheckstoguaranteereliable/orderedservice,eventhoughthelowerlevelsofthesystemalsoimplementthatfunctionality.

5.2ReliableByteStream(TCP)

211

Page 13: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

Thisdiscussionservestoillustrateoneofthemostimportantprinciplesinsystemdesign—theend-to-endargument.Inanutshell,theend-to-endargumentsaysthatafunction(inourexample,providingreliable/ordereddelivery)shouldnotbeprovidedinthelowerlevelsofthesystemunlessitcanbecompletelyandcorrectlyimplementedatthatlevel.Therefore,thisrulearguesinfavoroftheTCP/IPapproach.Thisruleisnotabsolute,however.Itdoesallowforfunctionstobeincompletelyprovidedatalowlevelasaperformanceoptimization.Thisiswhyitisperfectlyconsistentwiththeend-to-endargumenttoperformerrordetection(e.g.,CRC)onahop-by-hopbasis;detectingandretransmittingasinglecorruptpacketacrossonehopispreferabletohavingtoretransmitanentirefileend-to-end.

SegmentFormat

TCPisabyte-orientedprotocol,whichmeansthatthesenderwritesbytesintoaTCPconnectionandthereceiverreadsbytesoutoftheTCPconnection.Although"bytestream"describestheserviceTCPofferstoapplicationprocesses,TCPdoesnot,itself,transmitindividualbytesovertheInternet.Instead,TCPonthesourcehostbuffersenoughbytesfromthesendingprocesstofillareasonablysizedpacketandthensendsthispackettoitspeeronthedestinationhost.TCPonthedestinationhostthenemptiesthecontentsofthepacketintoareceivebuffer,andthereceivingprocessreadsfromthisbufferatitsleisure.ThissituationisillustratedinFigure1,which,forsimplicity,showsdataflowinginonlyonedirection.Rememberthat,ingeneral,asingleTCPconnectionsupportsbytestreamsflowinginbothdirections.

Figure1.HowTCPmanagesabytestream.

ThepacketsexchangedbetweenTCPpeersinFigure1arecalledsegments,sinceeachonecarriesasegmentofthebytestream.EachTCPsegmentcontainstheheaderschematicallydepictedinFigure2.Therelevanceofmostofthesefieldswillbecomeapparentthroughoutthissection.Fornow,wesimplyintroducethem.

5.2ReliableByteStream(TCP)

212

Page 14: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

Figure2.TCPheaderformat.

TheSrcPortandDstPortfieldsidentifythesourceanddestinationports,respectively,justasinUDP.Thesetwofields,plusthesourceanddestinationIPaddresses,combinetouniquelyidentifyeachTCPconnection.Thatis,TCP'sdemuxkeyisgivenbythe4-tuple

(SrcPort,SrcIPAddr,DstPort,DstIPAddr)

NotethatbecauseTCPconnectionscomeandgo,itispossibleforaconnectionbetweenaparticularpairofportstobeestablished,usedtosendandreceivedata,andclosed,andthenatalatertimeforthesamepairofportstobeinvolvedinasecondconnection.Wesometimesrefertothissituationastwodifferentincarnationsofthesameconnection.

TheAcknowledgement,SequenceNum,andAdvertisedWindowfieldsareallinvolvedinTCP'sslidingwindowalgorithm.BecauseTCPisabyte-orientedprotocol,eachbyteofdatahasasequencenumber.TheSequenceNumfieldcontainsthesequencenumberforthefirstbyteofdatacarriedinthatsegment,andtheAcknowledgementandAdvertisedWindowfieldscarryinformationabouttheflowofdatagoingintheotherdirection.Tosimplifyourdiscussion,weignorethefactthatdatacanflowinbothdirections,andweconcentrateondatathathasaparticularSequenceNumflowinginonedirectionandAcknowledgementandAdvertisedWindowvaluesflowingintheoppositedirection,asillustratedinFigure3.Theuseofthesethreefieldsisdescribedmorefullylaterinthischapter.

Figure3.Simplifiedillustration(showingonlyonedirection)oftheTCPprocess,withdataflowinonedirectionandACKsintheother.

The6-bitFlagsfieldisusedtorelaycontrolinformationbetweenTCPpeers.ThepossibleflagsincludeSYN,FIN,RESET,PUSH,URG,andACK.TheSYNandFINflagsareusedwhenestablishingandterminatingaTCPconnection,respectively.Theiruseisdescribedinalatersection.TheACKflagissetanytimetheAcknowledgementfieldisvalid,implyingthatthereceivershouldpayattentiontoit.TheURGflagsignifiesthatthissegmentcontains

5.2ReliableByteStream(TCP)

213

Page 15: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

urgentdata.Whenthisflagisset,theUrgPtrfieldindicateswherethenonurgentdatacontainedinthissegmentbegins.Theurgentdataiscontainedatthefrontofthesegmentbody,uptoandincludingavalueofUrgPtrbytesintothesegment.ThePUSHflagsignifiesthatthesenderinvokedthepushoperation,whichindicatestothereceivingsideofTCPthatitshouldnotifythereceivingprocessofthisfact.Wediscusstheselasttwofeaturesmoreinalatersection.Finally,theRESETflagsignifiesthatthereceiverhasbecomeconfused—forexample,becauseitreceivedasegmentitdidnotexpecttoreceive—andsowantstoaborttheconnection.

Finally,theChecksumfieldisusedinexactlythesamewayasforUDP—itiscomputedovertheTCPheader,theTCPdata,andthepseudoheader,whichismadeupofthesourceaddress,destinationaddress,andlengthfieldsfromtheIPheader.ThechecksumisrequiredforTCPinbothIPv4andIPv6.Also,sincetheTCPheaderisofvariablelength(optionscanbeattachedafterthemandatoryfields),aHdrLenfieldisincludedthatgivesthelengthoftheheaderin32-bitwords.ThisfieldisalsoknownastheOffsetfield,sinceitmeasurestheoffsetfromthestartofthepackettothestartofthedata.

ConnectionEstablishmentandTermination

ATCPconnectionbeginswithaclient(caller)doinganactiveopentoaserver(callee).Assumingthattheserverhadearlierdoneapassiveopen,thetwosidesengageinanexchangeofmessagestoestablishtheconnection.(RecallfromChapter1thatapartywantingtoinitiateaconnectionperformsanactiveopen,whileapartywillingtoacceptaconnectiondoesapassiveopen.)Onlyafterthisconnectionestablishmentphaseisoverdothetwosidesbeginsendingdata.Likewise,assoonasaparticipantisdonesendingdata,itclosesonedirectionoftheconnection,whichcausesTCPtoinitiatearoundofconnectionterminationmessages.Noticethat,whileconnectionsetupisanasymmetricactivity(onesidedoesapassiveopenandtheothersidedoesanactiveopen),connectionteardownissymmetric(eachsidehastoclosetheconnectionindependently).Therefore,itispossibleforonesidetohavedoneaclose,meaningthatitcannolongersenddata,butfortheothersidetokeeptheotherhalfofthebidirectionalconnectionopenandtocontinuesendingdata.

Tobemoreprecise,connectionsetupcanbesymmetric,withbothsidestryingtoopentheconnectionatthesametime,butthecommoncaseisforonesidetodoanactiveopenandtheothersidetodoapassiveopen.

Three-WayHandshake

ThealgorithmusedbyTCPtoestablishandterminateaconnectioniscalledathree-wayhandshake.WefirstdescribethebasicalgorithmandthenshowhowitisusedbyTCP.Thethree-wayhandshakeinvolvestheexchangeofthreemessagesbetweentheclientandtheserver,asillustratedbythetimelinegiveninFigure4.

Figure4.Timelineforthree-wayhandshakealgorithm.

5.2ReliableByteStream(TCP)

214

Page 16: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

Theideaisthattwopartieswanttoagreeonasetofparameters,which,inthecaseofopeningaTCPconnection,arethestartingsequencenumbersthetwosidesplantousefortheirrespectivebytestreams.Ingeneral,theparametersmightbeanyfactsthateachsidewantstheothertoknowabout.First,theclient(theactiveparticipant)sendsasegmenttotheserver(thepassiveparticipant)statingtheinitialsequencenumberitplanstouse(Flags=SYN,SequenceNum=x).Theserverthenrespondswithasinglesegmentthatbothacknowledgestheclient'ssequencenumber(Flags=ACK,Ack=x+1)andstatesitsownbeginningsequencenumber(Flags=SYN,SequenceNum=y).Thatis,boththeSYNandACKbitsaresetintheFlagsfieldofthissecondmessage.Finally,theclientrespondswithathirdsegmentthatacknowledgestheserver'ssequencenumber(Flags=ACK,Ack=y+1).ThereasonwhyeachsideacknowledgesasequencenumberthatisonelargerthantheonesentisthattheAcknowledgementfieldactuallyidentifiesthe"nextsequencenumberexpected,"therebyimplicitlyacknowledgingallearliersequencenumbers.Althoughnotshowninthistimeline,atimerisscheduledforeachofthefirsttwosegments,andiftheexpectedresponseisnotreceivedthesegmentisretransmitted.

Youmaybeaskingyourselfwhytheclientandserverhavetoexchangestartingsequencenumberswitheachotheratconnectionsetuptime.Itwouldbesimplerifeachsidesimplystartedatsome"well-known"sequencenumber,suchas0.Infact,theTCPspecificationrequiresthateachsideofaconnectionselectaninitialstartingsequencenumberatrandom.Thereasonforthisistoprotectagainsttwoincarnationsofthesameconnectionreusingthesamesequencenumberstoosoon—thatis,whilethereisstillachancethatasegmentfromanearlierincarnationofaconnectionmightinterferewithalaterincarnationoftheconnection.

State-TransitionDiagram

TCPiscomplexenoughthatitsspecificationincludesastate-transitiondiagram.AcopyofthisdiagramisgiveninFigure5.Thisdiagramshowsonlythestatesinvolvedinopeningaconnection(everythingaboveESTABLISHED)andinclosingaconnection(everythingbelowESTABLISHED).Everythingthatgoesonwhileaconnectionisopen—thatis,theoperationoftheslidingwindowalgorithm—ishiddenintheESTABLISHEDstate.

5.2ReliableByteStream(TCP)

215

Page 17: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

Figure5.TCPstate-transitiondiagram.

TCP'sstate-transitiondiagramisfairlyeasytounderstand.EachboxdenotesastatethatoneendofaTCPconnectioncanfinditselfin.AllconnectionsstartintheCLOSEDstate.Astheconnectionprogresses,theconnectionmovesfromstatetostateaccordingtothearcs.Eacharcislabeledwithatagoftheformevent/action.Thus,ifaconnectionisintheLISTENstateandaSYNsegmentarrives(i.e.,asegmentwiththeSYNflagset),theconnectionmakesatransitiontotheSYN_RCVDstateandtakestheactionofreplyingwithanACK+SYNsegment.

Noticethattwokindsofeventstriggerastatetransition:(1)asegmentarrivesfromthepeer(e.g.,theeventonthearcfromLISTENtoSYN_RCVD),or(2)thelocalapplicationprocessinvokesanoperationonTCP(e.g.,theactiveopeneventonthearcfromCLOSEDtoSYN_SENT).Inotherwords,TCP'sstate-transitiondiagrameffectivelydefinesthesemanticsofbothitspeer-to-peerinterfaceanditsserviceinterface.Thesyntaxofthesetwointerfacesisgivenbythesegmentformat(asillustratedinFigure2)andbysomeapplicationprogramminginterface,suchasthesocketAPI,respectively.

Nowlet'stracethetypicaltransitionstakenthroughthediagraminFigure5.Keepinmindthatateachendoftheconnection,TCPmakesdifferenttransitionsfromstatetostate.Whenopeningaconnection,theserverfirstinvokesapassiveopenoperationonTCP,whichcausesTCPtomovetotheLISTENstate.Atsomelatertime,theclientdoesanactiveopen,whichcausesitsendoftheconnectiontosendaSYNsegmenttotheserverandtomovetotheSYN_SENTstate.WhentheSYNsegmentarrivesattheserver,itmovestotheSYN_RCVDstateandrespondswith

5.2ReliableByteStream(TCP)

216

Page 18: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

aSYN+ACKsegment.ThearrivalofthissegmentcausestheclienttomovetotheESTABLISHEDstateandtosendanACKbacktotheserver.WhenthisACKarrives,theserverfinallymovestotheESTABLISHEDstate.Inotherwords,wehavejusttracedthethree-wayhandshake.

Therearethreethingstonoticeabouttheconnectionestablishmenthalfofthestate-transitiondiagram.First,iftheclient'sACKtotheserverislost,correspondingtothethirdlegofthethree-wayhandshake,thentheconnectionstillfunctionscorrectly.ThisisbecausetheclientsideisalreadyintheESTABLISHEDstate,sothelocalapplicationprocesscanstartsendingdatatotheotherend.EachofthesedatasegmentswillhavetheACKflagset,andthecorrectvalueintheAcknowledgementfield,sotheserverwillmovetotheESTABLISHEDstatewhenthefirstdatasegmentarrives.ThisisactuallyanimportantpointaboutTCP—everysegmentreportswhatsequencenumberthesenderisexpectingtoseenext,evenifthisrepeatsthesamesequencenumbercontainedinoneormoreprevioussegments.

Thesecondthingtonoticeaboutthestate-transitiondiagramisthatthereisafunnytransitionoutoftheLISTENstatewheneverthelocalprocessinvokesasendoperationonTCP.Thatis,itispossibleforapassiveparticipanttoidentifybothendsoftheconnection(i.e.,itselfandtheremoteparticipantthatitiswillingtohaveconnecttoit),andthenforittochangeitsmindaboutwaitingfortheothersideandinsteadactivelyestablishtheconnection.Tothebestofourknowledge,thisisafeatureofTCPthatnoapplicationprocessactuallytakesadvantageof.

Thefinalthingtonoticeaboutthediagramisthearcsthatarenotshown.Specifically,mostofthestatesthatinvolvesendingasegmenttotheothersidealsoscheduleatimeoutthateventuallycausesthesegmenttobepresentiftheexpectedresponsedoesnothappen.Theseretransmissionsarenotdepictedinthestate-transitiondiagram.Ifafterseveraltriestheexpectedresponsedoesnotarrive,TCPgivesupandreturnstotheCLOSEDstate.

Turningourattentionnowtotheprocessofterminatingaconnection,theimportantthingtokeepinmindisthattheapplicationprocessonbothsidesoftheconnectionmustindependentlycloseitshalfoftheconnection.Ifonlyonesideclosestheconnection,thenthismeansithasnomoredatatosend,butitisstillavailabletoreceivedatafromtheotherside.Thiscomplicatesthestate-transitiondiagrambecauseitmustaccountforthepossibilitythatthetwosidesinvokethecloseoperatoratthesametime,aswellasthepossibilitythatfirstonesideinvokescloseandthen,atsomelatertime,theothersideinvokesclose.Thus,onanyonesidetherearethreecombinationsoftransitionsthatgetaconnectionfromtheESTABLISHEDstatetotheCLOSEDstate:

Thissideclosesfirst:ESTABLISHED→FIN_WAIT_1→FIN_WAIT_2→TIME_WAIT→CLOSED.

Theothersideclosesfirst:ESTABLISHED→CLOSE_WAIT→LAST_ACK→CLOSED.

Bothsidescloseatthesametime:ESTABLISHED→FIN_WAIT_1→CLOSING→TIME_WAIT→CLOSED.

Thereisactuallyafourth,althoughrare,sequenceoftransitionsthatleadstotheCLOSEDstate;itfollowsthearcfromFIN_WAIT_1toTIME_WAIT.Weleaveitasanexerciseforyoutofigureoutwhatcombinationofcircumstancesleadstothisfourthpossibility.

ThemainthingtorecognizeaboutconnectionteardownisthataconnectionintheTIME_WAITstatecannotmovetotheCLOSEDstateuntilithaswaitedfortwotimesthemaximumamountoftimeanIPdatagrammightliveintheInternet(i.e.,120seconds).Thereasonforthisisthat,whilethelocalsideoftheconnectionhassentanACKinresponsetotheotherside'sFINsegment,itdoesnotknowthattheACKwassuccessfullydelivered.Asaconsequence,theothersidemightretransmititsFINsegment,andthissecondFINsegmentmightbedelayedinthenetwork.IftheconnectionwereallowedtomovedirectlytotheCLOSEDstate,thenanotherpairofapplicationprocessesmightcomealongandopenthesameconnection(i.e.,usethesamepairofportnumbers),andthedelayedFINsegmentfromtheearlierincarnationoftheconnectionwouldimmediatelyinitiatetheterminationofthelaterincarnationofthatconnection.

SlidingWindowRevisited

5.2ReliableByteStream(TCP)

217

Page 19: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

WearenowreadytodiscussTCP'svariantoftheslidingwindowalgorithm,whichservesseveralpurposes:(1)itguaranteesthereliabledeliveryofdata,(2)itensuresthatdataisdeliveredinorder,and(3)itenforcesflowcontrolbetweenthesenderandthereceiver.TCP'suseoftheslidingwindowalgorithmisthesameasatthelinklevelinthecaseofthefirsttwoofthesethreefunctions.WhereTCPdiffersfromthelink-levelalgorithmisthatitfoldstheflow-controlfunctioninaswell.Inparticular,ratherthanhavingafixed-sizeslidingwindow,thereceiveradvertisesawindowsizetothesender.ThisisdoneusingtheAdvertisedWindowfieldintheTCPheader.ThesenderisthenlimitedtohavingnomorethanavalueofAdvertisedWindowbytesofunacknowledgeddataatanygiventime.ThereceiverselectsasuitablevalueforAdvertisedWindowbasedontheamountofmemoryallocatedtotheconnectionforthepurposeofbufferingdata.Theideaistokeepthesenderfromover-runningthereceiver'sbuffer.Wediscussthisatgreaterlengthbelow.

ReliableandOrderedDelivery

ToseehowthesendingandreceivingsidesofTCPinteractwitheachothertoimplementreliableandordereddelivery,considerthesituationillustratedinFigure6.TCPonthesendingsidemaintainsasendbuffer.Thisbufferisusedtostoredatathathasbeensentbutnotyetacknowledged,aswellasdatathathasbeenwrittenbythesendingapplicationbutnottransmitted.Onthereceivingside,TCPmaintainsareceivebuffer.Thisbufferholdsdatathatarrivesoutoforder,aswellasdatathatisinthecorrectorder(i.e.,therearenomissingbytesearlierinthestream)butthattheapplicationprocesshasnotyethadthechancetoread.

Figure6.RelationshipbetweenTCPsendbuffer(a)andreceivebuffer(b).

Tomakethefollowingdiscussionsimplertofollow,weinitiallyignorethefactthatboththebuffersandthesequencenumbersareofsomefinitesizeandhencewilleventuallywraparound.Also,wedonotdistinguishbetweenapointerintoabufferwhereaparticularbyteofdataisstoredandthesequencenumberforthatbyte.

Lookingfirstatthesendingside,threepointersaremaintainedintothesendbuffer,eachwithanobviousmeaning:LastByteAcked,LastByteSent,andLastByteWritten.Clearly,

LastByteAcked<=LastByteSent

sincethereceivercannothaveacknowledgedabytethathasnotyetbeensent,and

LastByteSent<=LastByteWritten

sinceTCPcannotsendabytethattheapplicationprocesshasnotyetwritten.AlsonotethatnoneofthebytestotheleftofLastByteAckedneedtobesavedinthebufferbecausetheyhavealreadybeenacknowledged,andnoneofthebytestotherightofLastByteWrittenneedtobebufferedbecausetheyhavenotyetbeengenerated.

5.2ReliableByteStream(TCP)

218

Page 20: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

Asimilarsetofpointers(sequencenumbers)aremaintainedonthereceivingside:LastByteRead,NextByteExpected,andLastByteRcvd.Theinequalitiesarealittlelessintuitive,however,becauseoftheproblemofout-of-orderdelivery.Thefirstrelationship

LastByteRead<NextByteExpected

istruebecauseabytecannotbereadbytheapplicationuntilitisreceivedandallprecedingbyteshavealsobeenreceived.NextByteExpectedpointstothebyteimmediatelyafterthelatestbytetomeetthiscriterion.Second,

NextByteExpected<=LastByteRcvd+1

since,ifdatahasarrivedinorder,NextByteExpectedpointstothebyteafterLastByteRcvd,whereasifdatahasarrivedoutoforder,thenNextByteExpectedpointstothestartofthefirstgapinthedata,asinFigure6.NotethatbytestotheleftofLastByteReadneednotbebufferedbecausetheyhavealreadybeenreadbythelocalapplicationprocess,andbytestotherightofLastByteRcvdneednotbebufferedbecausetheyhavenotyetarrived.

FlowControl

Mostoftheabovediscussionissimilartothatfoundinthestandardslidingwindowalgorithm;theonlyrealdifferenceisthatthistimeweelaboratedonthefactthatthesendingandreceivingapplicationprocessesarefillingandemptyingtheirlocalbuffer,respectively.(Theearlierdiscussionglossedoverthefactthatdataarrivingfromanupstreamnodewasfillingthesendbufferanddatabeingtransmittedtoadownstreamnodewasemptyingthereceivebuffer.)

Youshouldmakesureyouunderstandthismuchbeforeproceedingbecausenowcomesthepointwherethetwoalgorithmsdiffermoresignificantly.Inwhatfollows,wereintroducethefactthatbothbuffersareofsomefinitesize,denotedMaxSendBufferandMaxRcvBuffer,althoughwedon'tworryaboutthedetailsofhowtheyareimplemented.Inotherwords,weareonlyinterestedinthenumberofbytesbeingbuffered,notinwherethosebytesareactuallystored.

Recallthatinaslidingwindowprotocol,thesizeofthewindowsetstheamountofdatathatcanbesentwithoutwaitingforacknowledgmentfromthereceiver.Thus,thereceiverthrottlesthesenderbyadvertisingawindowthatisnolargerthantheamountofdatathatitcanbuffer.ObservethatTCPonthereceivesidemustkeep

LastByteRcvd-LastByteRead<=MaxRcvBuffer

toavoidoverflowingitsbuffer.Itthereforeadvertisesawindowsizeof

AdvertisedWindow=MaxRcvBuffer-((NextByteExpected-1)-LastByteRead)

whichrepresentstheamountoffreespaceremaininginitsbuffer.Asdataarrives,thereceiveracknowledgesitaslongasalltheprecedingbyteshavealsoarrived.Inaddition,LastByteRcvdmovestotheright(isincremented),meaningthattheadvertisedwindowpotentiallyshrinks.Whetherornotitshrinksdependsonhowfastthelocalapplicationprocessisconsumingdata.Ifthelocalprocessisreadingdatajustasfastasitarrives(causingLastByteReadtobeincrementedatthesamerateasLastByteRcvd),thentheadvertisedwindowstaysopen(i.e.,AdvertisedWindow=MaxRcvBuffer).If,however,thereceivingprocessfallsbehind,perhapsbecauseitperformsaveryexpensiveoperationoneachbyteofdatathatitreads,thentheadvertisedwindowgrowssmallerwitheverysegmentthatarrives,untiliteventuallygoesto0.

TCPonthesendsidemustthenadheretotheadvertisedwindowitgetsfromthereceiver.Thismeansthatatanygiventime,itmustensurethat

LastByteSent-LastByteAcked<=AdvertisedWindow

5.2ReliableByteStream(TCP)

219

Page 21: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

Saidanotherway,thesendercomputesaneffectivewindowthatlimitshowmuchdataitcansend:

EffectiveWindow=AdvertisedWindow-(LastByteSent-LastByteAcked)

Clearly,EffectiveWindowmustbegreaterthan0beforethesourcecansendmoredata.Itispossible,therefore,thatasegmentarrivesacknowledgingxbytes,therebyallowingthesendertoincrementLastByteAckedbyx,butbecausethereceivingprocesswasnotreadinganydata,theadvertisedwindowisnowxbytessmallerthanthetimebefore.Insuchasituation,thesenderwouldbeabletofreebufferspace,butnottosendanymoredata.

Allthewhilethisisgoingon,thesendsidemustalsomakesurethatthelocalapplicationprocessdoesnotoverflowthesendbuffer—thatis,

LastByteWritten-LastByteAcked<=MaxSendBuffer

IfthesendingprocesstriestowriteybytestoTCP,but

(LastByteWritten-LastByteAcked)+y>MaxSendBuffer

thenTCPblocksthesendingprocessanddoesnotallowittogeneratemoredata.

Itisnowpossibletounderstandhowaslowreceivingprocessultimatelystopsafastsendingprocess.First,thereceivebufferfillsup,whichmeanstheadvertisedwindowshrinksto0.Anadvertisedwindowof0meansthatthesendingsidecannottransmitanydata,eventhoughdataithaspreviouslysenthasbeensuccessfullyacknowledged.Finally,notbeingabletotransmitanydatameansthatthesendbufferfillsup,whichultimatelycausesTCPtoblockthesendingprocess.Assoonasthereceivingprocessstartstoreaddataagain,thereceive-sideTCPisabletoopenitswindowbackup,whichallowsthesend-sideTCPtotransmitdataoutofitsbuffer.Whenthisdataiseventuallyacknowledged,LastByteAckedisincremented,thebufferspaceholdingthisacknowledgeddatabecomesfree,andthesendingprocessisunblockedandallowedtoproceed.

Thereisonlyoneremainingdetailthatmustberesolved—howdoesthesendingsideknowthattheadvertisedwindowisnolonger0?Asmentionedabove,TCPalwayssendsasegmentinresponsetoareceiveddatasegment,andthisresponsecontainsthelatestvaluesfortheAcknowledgeandAdvertisedWindowfields,evenifthesevalueshavenotchangedsincethelasttimetheyweresent.Theproblemisthis.Oncethereceivesidehasadvertisedawindowsizeof0,thesenderisnotpermittedtosendanymoredata,whichmeansithasnowaytodiscoverthattheadvertisedwindowisnolonger0atsometimeinthefuture.TCPonthereceivesidedoesnotspontaneouslysendnondatasegments;itonlysendstheminresponsetoanarrivingdatasegment.

TCPdealswiththissituationasfollows.Whenevertheothersideadvertisesawindowsizeof0,thesendingsidepersistsinsendingasegmentwith1byteofdataeverysooften.Itknowsthatthisdatawillprobablynotbeaccepted,butittriesanyway,becauseeachofthese1-bytesegmentstriggersaresponsethatcontainsthecurrentadvertisedwindow.Eventually,oneofthese1-byteprobestriggersaresponsethatreportsanonzeroadvertisedwindow.

NotethatthereasonthesendingsideperiodicallysendsthisprobesegmentisthatTCPisdesignedtomakethereceivesideassimpleaspossible—itsimplyrespondstosegmentsfromthesender,anditneverinitiatesanyactivityonitsown.Thisisanexampleofawell-recognized(althoughnotuniversallyapplied)protocoldesignrule,which,forlackofabettername,wecallthesmartsender/dumbreceiverrule.RecallthatwesawanotherexampleofthisrulewhenwediscussedtheuseofNAKsinslidingwindowalgorithm.

ProtectingagainstWraparound

ThissubsectionandthenextconsiderthesizeoftheSequenceNumandAdvertisedWindowfieldsandtheimplicationsoftheirsizesonTCP'scorrectnessandperformance.TCP'sSequenceNumfieldis32bitslong,anditsAdvertisedWindowfieldis16bitslong,meaningthatTCPhaseasilysatisfiedtherequirementoftheslidingwindowalgorithmthatthe

5.2ReliableByteStream(TCP)

220

Page 22: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

sequencenumberspacebetwiceasbigasthewindowsize:2 >>2×2 .However,thisrequirementisnotthe

interestingthingaboutthesetwofields.Considereachfieldinturn.

Therelevanceofthe32-bitsequencenumberspaceisthatthesequencenumberusedonagivenconnectionmightwraparound—abytewithsequencenumberScouldbesentatonetime,andthenatalatertimeasecondbytewiththesamesequencenumberSmightbesent.Onceagain,weassumethatpacketscannotsurviveintheInternetforlongerthantherecommendedMSL.Thus,wecurrentlyneedtomakesurethatthesequencenumberdoesnotwraparoundwithina120-secondperiodoftime.WhetherornotthishappensdependsonhowfastdatacanbetransmittedovertheInternet—thatis,howfastthe32-bitsequencenumberspacecanbeconsumed.(Thisdiscussionassumesthatwearetryingtoconsumethesequencenumberspaceasfastaspossible,butofcoursewewillbeifwearedoingourjobofkeepingthepipefull.)Table1showshowlongittakesforthesequencenumbertowraparoundonnetworkswithvariousbandwidths.

Bandwidth TimeuntilWraparound

T1(1.5Mbps) 6.4hours

Ethernet(10Mbps) 57minutes

T3(45Mbps) 13minutes

FastEthernet(100Mbps) 6minutes

OC-3(155Mbps) 4minutes

OC-48(2.5Gbps) 14seconds

OC-192(10Gbps) 3seconds

10GigE(10Gbps) 3seconds

Table1.TimeUntil32-BitSequenceNumberSpaceWrapsAround.

Asyoucansee,the32-bitsequencenumberspaceisadequateatmodestbandwidths,butgiventhatOC-192linksarenowcommonintheInternetbackbone,andthatmostserversnowcomewith10GigEthernet(or10Gbps)interfaces,we'renowwell-pastthepointwhere32bitsistoosmall.Fortunately,theIETFhasworkedoutanextensiontoTCPthateffectivelyextendsthesequencenumberspacetoprotectagainstthesequencenumberwrappingaround.Thisandrelatedextensionsaredescribedinalatersection.

KeepingthePipeFull

Therelevanceofthe16-bitAdvertisedWindowfieldisthatitmustbebigenoughtoallowthesendertokeepthepipefull.Clearly,thereceiverisfreetonotopenthewindowaslargeastheAdvertisedWindowfieldallows;weareinterestedinthesituationinwhichthereceiverhasenoughbufferspacetohandleasmuchdataasthelargestpossibleAdvertisedWindowallows.

Inthiscase,itisnotjustthenetworkbandwidthbutthedelay×bandwidthproductthatdictateshowbigtheAdvertisedWindowfieldneedstobe—thewindowneedstobeopenedfarenoughtoallowafulldelay×bandwidthproduct'sworthofdatatobetransmitted.AssuminganRTTof100ms(atypicalnumberforacross-countryconnectionintheUnitedStates),Table2givesthedelay×bandwidthproductforseveralnetworktechnologies.

Bandwidth Delay×BandwidthProduct

T1(1.5Mbps) 18KB

Ethernet(10Mbps) 122KB

T3(45Mbps) 549KB

FastEthernet(100Mbps) 1.2MB

32 16

5.2ReliableByteStream(TCP)

221

Page 23: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

OC-3(155Mbps) 1.8MB

OC-48(2.5Gbps) 29.6MB

OC-192(10Gbps) 118.4MB

10GigE(10Gbps) 118.4MB

Table2.RequiredWindowSizefor100-msRTT.

Asyoucansee,TCP'sAdvertisedWindowfieldisinevenworseshapethanitsSequenceNumfield—itisnotbigenoughtohandleevenaT3connectionacrossthecontinentalUnitedStates,sincea16-bitfieldallowsustoadvertiseawindowofonly64KB.TheverysameTCPextensionmentionedaboveprovidesamechanismforeffectivelyincreasingthesizeoftheadvertisedwindow.

TriggeringTransmission

Wenextconsiderasurprisinglysubtleissue:howTCPdecidestotransmitasegment.Asdescribedearlier,TCPsupportsabyte-streamabstraction;thatis,applicationprogramswritebytesintothestream,anditisuptoTCPtodecidethatithasenoughbytestosendasegment.Whatfactorsgovernthisdecision?

Ifweignorethepossibilityofflowcontrol—thatis,weassumethewindowiswideopen,aswouldbethecasewhenaconnectionfirststarts—thenTCPhasthreemechanismstotriggerthetransmissionofasegment.First,TCPmaintainsavariable,typicallycalledthemaximumsegmentsize(MSS),anditsendsasegmentassoonasithascollectedMSSbytesfromthesendingprocess.MSSisusuallysettothesizeofthelargestsegmentTCPcansendwithoutcausingthelocalIPtofragment.Thatis,MSSissettothemaximumtransmissionunit(MTU)ofthedirectlyconnectednetwork,minusthesizeoftheTCPandIPheaders.ThesecondthingthattriggersTCPtotransmitasegmentisthatthesendingprocesshasexplicitlyaskedittodoso.Specifically,TCPsupportsapushoperation,andthesendingprocessinvokesthisoperationtoeffectivelyflushthebufferofunsentbytes.Thefinaltriggerfortransmittingasegmentisthatatimerfires;theresultingsegmentcontainsasmanybytesasarecurrentlybufferedfortransmission.However,aswewillsoonsee,this"timer"isn'texactlywhatyouexpect.

SillyWindowSyndrome

Ofcourse,wecan'tjustignoreflowcontrol,whichplaysanobviousroleinthrottlingthesender.IfthesenderhasMSSbytesofdatatosendandthewindowisopenatleastthatmuch,thenthesendertransmitsafullsegment.Suppose,however,thatthesenderisaccumulatingbytestosend,butthewindowiscurrentlyclosed.NowsupposeanACKarrivesthateffectivelyopensthewindowenoughforthesendertotransmit,say,MSS/2bytes.Shouldthesendertransmitahalf-fullsegmentorwaitforthewindowtoopentoafullMSS?Theoriginalspecificationwassilentonthispoint,andearlyimplementationsofTCPdecidedtogoaheadandtransmitahalf-fullsegment.Afterall,thereisnotellinghowlongitwillbebeforethewindowopensfurther.

Itturnsoutthatthestrategyofaggressivelytakingadvantageofanyavailablewindowleadstoasituationnowknownasthesillywindowsyndrome.Figure7helpsvisualizewhathappens.IfyouthinkofaTCPstreamasaconveyerbeltwith"full"containers(datasegments)goinginonedirectionandemptycontainers(ACKs)goinginthereversedirection,thenMSS-sizedsegmentscorrespondtolargecontainersand1-bytesegmentscorrespondtoverysmallcontainers.AslongasthesenderissendingMSS-sizedsegmentsandthereceiverACKsatleastoneMSSofdataatatime,everythingisgood(Figure7(a)).But,whatifthereceiverhastoreducethewindow,sothatatsometimethesendercan'tsendafullMSSofdata?Ifthesenderaggressivelyfillsasmaller-than-MSSemptycontainerassoonasitarrives,thenthereceiverwillACKthatsmallernumberofbytes,andhencethesmallcontainerintroducedintothesystemremainsinthesystemindefinitely.Thatis,itisimmediatelyfilledandemptiedateachendandisnevercoalescedwithadjacentcontainerstocreatelargercontainers,asinFigure7(b).ThisscenariowasdiscoveredwhenearlyimplementationsofTCPregularlyfoundthemselvesfillingthenetworkwithtinysegments.

5.2ReliableByteStream(TCP)

222

Page 24: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

Figure7.Sillywindowsyndrome.(a)AslongasthesendersendsMSS-sizedsegmentsandthereceiverACKsoneMSSatatime,thesystemworkssmoothly.(b)AssoonasthesendersendslessthanoneMSS,orthereceiverACKs

lessthanoneMSS,asmall"container"entersthesystemandcontinuestocirculate.

Notethatthesillywindowsyndromeisonlyaproblemwheneitherthesendertransmitsasmallsegmentorthereceiveropensthewindowasmallamount.Ifneitherofthesehappens,thenthesmallcontainerisneverintroducedintothestream.It'snotpossibletooutlawsendingsmallsegments;forexample,theapplicationmightdoapushaftersendingasinglebyte.Itispossible,however,tokeepthereceiverfromintroducingasmallcontainer(i.e.,asmallopenwindow).TheruleisthatafteradvertisingazerowindowthereceivermustwaitforspaceequaltoanMSSbeforeitadvertisesanopenwindow.

Sincewecan'teliminatethepossibilityofasmallcontainerbeingintroducedintothestream,wealsoneedmechanismstocoalescethem.ThereceivercandothisbydelayingACKs—sendingonecombinedACKratherthanmultiplesmallerones—butthisisonlyapartialsolutionbecausethereceiverhasnowayofknowinghowlongitissafetodelaywaitingeitherforanothersegmenttoarriveorfortheapplicationtoreadmoredata(thusopeningthewindow).Theultimatesolutionfallstothesender,whichbringsusbacktoouroriginalissue:WhendoestheTCPsenderdecidetotransmitasegment?

Nagle'sAlgorithm

ReturningtotheTCPsender,ifthereisdatatosendbutthewindowisopenlessthanMSS,thenwemaywanttowaitsomeamountoftimebeforesendingtheavailabledata,butthequestionishowlong?Ifwewaittoolong,thenwehurtinteractiveapplicationslikeTelnet.Ifwedon'twaitlongenough,thenwerisksendingabunchoftinypacketsandfallingintothesillywindowsyndrome.Theansweristointroduceatimerandtotransmitwhenthetimerexpires.

Whilewecoulduseaclock-basedtimer—forexample,onethatfiresevery100ms—Nagleintroducedanelegantself-clockingsolution.TheideaisthataslongasTCPhasanydatainflight,thesenderwilleventuallyreceiveanACK.ThisACKcanbetreatedlikeatimerfiring,triggeringthetransmissionofmoredata.Nagle'salgorithmprovidesasimple,unifiedrulefordecidingwhentotransmit:

Whentheapplicationproducesdatatosendifboththeavailabledataandthewindow>=MSSsendafullsegmentelseifthereisunACKeddatainflightbufferthenewdatauntilanACKarriveselsesendallthenewdatanow

5.2ReliableByteStream(TCP)

223

Page 25: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

Inotherwords,it'salwaysOKtosendafullsegmentifthewindowallows.It'salsoallrighttoimmediatelysendasmallamountofdataiftherearecurrentlynosegmentsintransit,butifthereisanythinginflightthesendermustwaitforanACKbeforetransmittingthenextsegment.Thus,aninteractiveapplicationlikeTelnetthatcontinuallywritesonebyteatatimewillsenddataatarateofonesegmentperRTT.Somesegmentswillcontainasinglebyte,whileotherswillcontainasmanybytesastheuserwasabletotypeinoneround-triptime.BecausesomeapplicationscannotaffordsuchadelayforeachwriteitdoestoaTCPconnection,thesocketinterfaceallowstheapplicationtoturnoffNagel'salgorithmbysettingtheTCP_NODELAYoption.Settingthisoptionmeansthatdataistransmittedassoonaspossible.

AdaptiveRetransmissionBecauseTCPguaranteesthereliabledeliveryofdata,itretransmitseachsegmentifanACKisnotreceivedinacertainperiodoftime.TCPsetsthistimeoutasafunctionoftheRTTitexpectsbetweenthetwoendsoftheconnection.Unfortunately,giventherangeofpossibleRTTsbetweenanypairofhostsintheInternet,aswellasthevariationinRTTbetweenthesametwohostsovertime,choosinganappropriatetimeoutvalueisnotthateasy.Toaddressthisproblem,TCPusesanadaptiveretransmissionmechanism.WenowdescribethismechanismandhowithasevolvedovertimeastheInternetcommunityhasgainedmoreexperienceusingTCP.

OriginalAlgorithm

Webeginwithasimplealgorithmforcomputingatimeoutvaluebetweenapairofhosts.ThisisthealgorithmthatwasoriginallydescribedintheTCPspecification—andthefollowingdescriptionpresentsitinthoseterms—butitcouldbeusedbyanyend-to-endprotocol.

TheideaistokeeparunningaverageoftheRTTandthentocomputethetimeoutasafunctionofthisRTT.Specifically,everytimeTCPsendsadatasegment,itrecordsthetime.WhenanACKforthatsegmentarrives,TCPreadsthetimeagain,andthentakesthedifferencebetweenthesetwotimesasaSampleRTT.TCPthencomputesanEstimatedRTTasaweightedaveragebetweenthepreviousestimateandthisnewsample.Thatis,

EstimatedRTT=alphaxEstimatedRTT+(1-alpha)xSampleRTT

TheparameteralphaisselectedtosmooththeEstimatedRTT.AsmallalphatrackschangesintheRTTbutisperhapstooheavilyinfluencedbytemporaryfluctuations.Ontheotherhand,alargealphaismorestablebutperhapsnotquickenoughtoadapttorealchanges.TheoriginalTCPspecificationrecommendedasettingofalphabetween0.8and0.9.TCPthenusesEstimatedRTTtocomputethetimeoutinaratherconservativeway:

TimeOut=2xEstimatedRTT

Karn/PartridgeAlgorithm

AfterseveralyearsofuseontheInternet,aratherobviousflawwasdiscoveredinthissimplealgorithm.TheproblemwasthatanACKdoesnotreallyacknowledgeatransmission;itactuallyacknowledgesthereceiptofdata.Inotherwords,wheneverasegmentisretransmittedandthenanACKarrivesatthesender,itisimpossibletodetermineifthisACKshouldbeassociatedwiththefirstorthesecondtransmissionofthesegmentforthepurposeofmeasuringthesampleRTT.ItisnecessarytoknowwhichtransmissiontoassociateitwithsoastocomputeanaccurateSampleRTT.AsillustratedinFigure8,ifyouassumethattheACKisfortheoriginaltransmissionbutitwasreallyforthesecond,thentheSampleRTTistoolarge(a);ifyouassumethattheACKisforthesecondtransmissionbutitwasactuallyforthefirst,thentheSampleRTTistoosmall(b).

5.2ReliableByteStream(TCP)

224

Page 26: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

Figure8.AssociatingtheACKwith(a)originaltransmissionversus(b)retransmission.

Thesolution,whichwasproposedin1987,issurprisinglysimple.WheneverTCPretransmitsasegment,itstopstakingsamplesoftheRTT;itonlymeasuresSampleRTTforsegmentsthathavebeensentonlyonce.ThissolutionisknownastheKarn/Partridgealgorithm,afteritsinventors.TheirproposedfixalsoincludesasecondsmallchangetoTCP'stimeoutmechanism.EachtimeTCPretransmits,itsetsthenexttimeouttobetwicethelasttimeout,ratherthanbasingitonthelastEstimatedRTT.Thatis,KarnandPartridgeproposedthatTCPuseexponentialbackoff,similartowhattheEthernetdoes.Themotivationforusingexponentialbackoffissimple:Congestionisthemostlikelycauseoflostsegments,meaningthattheTCPsourceshouldnotreacttooaggressivelytoatimeout.Infact,themoretimestheconnectiontimesout,themorecautiousthesourceshouldbecome.Wewillseethisideaagain,embodiedinamuchmoresophisticatedmechanism,inthenextchapter.

Jacobson/KarelsAlgorithm

TheKarn/PartridgealgorithmwasintroducedatatimewhentheInternetwassufferingfromhighlevelsofnetworkcongestion.Theirapproachwasdesignedtofixsomeofthecausesofthatcongestion,but,althoughitwasanimprovement,thecongestionwasnoteliminated.Thefollowingyear(1988),twootherresearchers—JacobsonandKarels—proposedamoredrasticchangetoTCPtobattlecongestion.Thebulkofthatproposedchangeisdescribedinthenextchapter.Here,wefocusontheaspectofthatproposalthatisrelatedtodecidingwhentotimeoutandretransmitasegment.

Asanaside,itshouldbeclearhowthetimeoutmechanismisrelatedtocongestion—ifyoutimeouttoosoon,youmayunnecessarilyretransmitasegment,whichonlyaddstotheloadonthenetwork.Theotherreasonforneedinganaccuratetimeoutvalueisthatatimeoutistakentoimplycongestion,whichtriggersacongestion-controlmechanism.Finally,notethatthereisnothingabouttheJacobson/KarelstimeoutcomputationthatisspecifictoTCP.Itcouldbeusedbyanyend-to-endprotocol.

ThemainproblemwiththeoriginalcomputationisthatitdoesnottakethevarianceofthesampleRTTsintoaccount.Intuitively,ifthevariationamongsamplesissmall,thentheEstimatedRTTcanbebettertrustedandthereisnoreasonformultiplyingthisestimateby2tocomputethetimeout.Ontheotherhand,alargevarianceinthesamplessuggeststhatthetimeoutvalueshouldnotbetootightlycoupledtotheEstimatedRTT.

Inthenewapproach,thesendermeasuresanewSampleRTTasbefore.Itthenfoldsthisnewsampleintothetimeoutcalculationasfollows:

Difference=SampleRTT-EstimatedRTTEstimatedRTT=EstimatedRTT+(deltaxDifference)Deviation=Deviation+delta(|Difference|-Deviation)

wheredeltaisafractionbetween0and1.Thatis,wecalculateboththemeanRTTandthevariationinthatmean.

5.2ReliableByteStream(TCP)

225

Page 27: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

TCPthencomputesthetimeoutvalueasafunctionofbothEstimatedRTTandDeviationasfollows:

TimeOut=muxEstimatedRTT+phixDeviation

wherebasedonexperience,muistypicallysetto1andphiissetto4.Thus,whenthevarianceissmall,TimeOutisclosetoEstimatedRTT;alargevariancecausestheDeviationtermtodominatethecalculation.

Implementation

TherearetwoitemsofnoteregardingtheimplementationoftimeoutsinTCP.ThefirstisthatitispossibletoimplementthecalculationforEstimatedRTTandDeviationwithoutusingfloating-pointarithmetic.Instead,thewhole

calculationisscaledby2 ,withdeltaselectedtobe1/2 .Thisallowsustodointegerarithmetic,implementing

multiplicationanddivisionusingshifts,therebyachievinghigherperformance.Theresultingcalculationisgivenbythefollowingcodefragment,wheren=3(i.e.,delta=1/8).NotethatEstimatedRTTandDeviationarestoredintheirscaled-upforms,whilethevalueofSampleRTTatthestartofthecodeandofTimeOutattheendarereal,unscaledvalues.Ifyoufindthecodehardtofollow,youmightwanttotrypluggingsomerealnumbersintoitandverifyingthatitgivesthesameresultsastheequationsabove.

{SampleRTT-=(EstimatedRTT>>3);EstimatedRTT+=SampleRTT;if(SampleRTT<0)SampleRTT=-SampleRTT;SampleRTT-=(Deviation>>3);Deviation+=SampleRTT;TimeOut=(EstimatedRTT>>3)+(Deviation>>1);}

ThesecondpointofnoteisthattheJacobson/Karelsalgorithmisonlyasgoodastheclockusedtoreadthecurrenttime.OntypicalUniximplementationsatthetime,theclockgranularitywasaslargeas500ms,whichissignificantlylargerthantheaveragecross-countryRTTofsomewherebetween100and200ms.Tomakemattersworse,theUniximplementationofTCPonlycheckedtoseeifatimeoutshouldhappeneverytimethis500-msclocktickedandwouldonlytakeasampleoftheround-triptimeonceperRTT.Thecombinationofthesetwofactorscouldmeanthatatimeoutwouldhappen1secondafterthesegmentwastransmitted.Onceagain,theextensionstoTCPincludeamechanismthatmakesthisRTTcalculationabitmoreprecise.

Alloftheretransmissionalgorithmswehavediscussedarebasedonacknowledgmenttimeouts,whichindicatethatasegmenthasprobablybeenlost.Notethatatimeoutdoesnot,however,tellthesenderwhetheranysegmentsitsentafterthelostsegmentweresuccessfullyreceived.ThisisbecauseTCPacknowledgmentsarecumulative;theyidentifyonlythelastsegmentthatwasreceivedwithoutanyprecedinggaps.Thereceptionofsegmentsthatoccurafteragapgrowsmorefrequentasfasternetworksleadtolargerwindows.IfACKsalsotoldthesenderwhichsubsequentsegments,ifany,hadbeenreceived,thenthesendercouldbemoreintelligentaboutwhichsegmentsitretransmits,drawbetterconclusionsaboutthestateofcongestion,andmakebetterRTTestimates.ATCPextensionsupportingthisisdescribedinalatersection.

RecordBoundariesSinceTCPisabyte-streamprotocol,thenumberofbyteswrittenbythesenderarenotnecessarilythesameasthenumberofbytesreadbythereceiver.Forexample,theapplicationmightwrite8bytes,then2bytes,then20bytestoaTCPconnection,whileonthereceivingsidetheapplicationreads5bytesatatimeinsidealoopthatiterates6

n n

5.2ReliableByteStream(TCP)

226