baccelli/evaluation/augustinchaintreauphd.pdf

269

Upload: hoangque

Post on 22-Apr-2019

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Pro esses of Intera tionin Data NetworksA dissertation submitted for the degree ofDo tor of Philosophyby Augustin Chaintreau

Page 2: baccelli/Evaluation/AugustinChaintreauPhD.pdf
Page 3: baccelli/Evaluation/AugustinChaintreauPhD.pdf

ThèsePro essus d'Intera tiondans les Réseaux de Données

présentée pour obtenir le titre deDo teur en S ien es de l'Université Paris 6spé ialité : Informatiquepar Augustin Chaintreausoutenue le 16 Janvier 2006, devant le jury omposé de:Messieurs Ernst Biersa k RapporteurFrank Kelly RapporteurJean-Yves Le Boude RapporteurPhilippe Chrétienne ExaminateurChristophe Diot ExaminateurFrançois Ba elli Dire teur de Thèse

Page 4: baccelli/Evaluation/AugustinChaintreauPhD.pdf
Page 5: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Remer iements / A knowledgmentsJe tiens à remer ier François Ba elli. La rigueur et la bienveillan e deses onseils ont dépassé tout e que je onnaissais, pendant es années detravail que j'ai ee tuées sous sa dire tion. Je lui en suis redevable tous lesjours.I would like to thank the members of my ommittee, for their trust whena epting to evaluate my work. It is a great honor for me, and it also hasbeen an immense pleasure to know afterwards that I have been writing thisdissertation for their examination. I am also grateful to Frank Kelly for givingme a ess to the Mathemathi al Library while I was in Cambridge, whi hadded a lot to the enjoyment of writing this dissertation. J'aimerais aussiremer ier Christophe Diot. Il a inspiré mon premier travail sur les réseauxde données. J'admire sans esse son intuition et son énergie. La n de ettethèse doit beau oup à sa patien e et à son soutien pendant ette dernièreannée, où j'ai la han e de travailler ave lui sur de nouveaux sujets.Enn, j'aimerais remer ier Mélanie Sag, ar 'est notre ren ontre, heu-reuse, qui m'a donné le désir inassouvi de savoir et d'é rire. Et puis, pour êtrehonnête, on se demanderait en ore dans quelle langue ma thèse est é rite, sielle ne l'avait pas relue.The ontent of this dissertation has benetted from many ontributionsand riti ism. All my thanks go to my o-authors Danny de Vlees hauwer,Zhen Liu, David M Donald, Anton Riabov, and Sambit Sahu, for their kindhelp and their expert advi es. Most of the laims that I am able to make inthis thesis depend on their talents. I wish also to express my gratitude tomany of my olleagues for their interest, for their suggestions, and for theirfriendliness. Bartek Blasz zyszyn, Thomas Bonald, Charles Bordenave, JonCrow roft, Moez Draief, Alain Fris h, Ri hard Gass, Pan Hui, Bruno Kau-mann, Mar Lelarge, Jean Mairesse, Laurent Massoulié, Vivek Mathre, DinaPapagiannaki, Alexandre Proutière, James S ott, Renata Texeira, Patri kThiran. I annot detail all the inuen e ea h had on my work, but I do hopeyou will feel it in the ex itment I had when writing this thesis.Je suis re onnaissant au Conseil Général des Te hnologie de l'Informa-tion, à l'É ole Normale Supérieure et à l'INRIA, ainsi qu'aux laboratoiresde re her he d'IBM Watson, d'Intel Cambridge et de Thomson Paris, pourleur soutien sans faille à plusieurs moments lés de e travail. Parmi lespersonnes qui m'ont dire tement aidé, j'aimerais saluer la gentillesse et lesoutien quotidien de Letizia Ballarini, Floren e Barbara, Angela Barreto,Véronique Beaupel, Marie Claudine Bendayan, Ja ques Beigbeder, Julia Bla- kwell, Joelle Isnard, Lauren e Lenormand et Valérie Mongiat.

Page 6: baccelli/Evaluation/AugustinChaintreauPhD.pdf

About this thesisThis thesis is made of four hapters, with numbers from 0 to 3. Chapter 0 ontains an introdu tion to the TCP ommuni ation proto ol used in datanetworks ; it also justies briey my motivation for looking at the intera tionof data ows. Two types of intera tion are then presented in Chapter 1 and 2.The last results are obtained using a new framework to analyze large dis rete-event systems ; it is des ribed in Chapter 3, whi h may be read separatelyfrom the rest.A short te hni al summary was added, to present the orresponding pu-bli ations, and the ollaboration I had with other resear hers while I om-pleted this work. This thesis ends with a synthesis written in fren h.

Page 7: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Contents0 Introdu tion: De entralized ontrol in data networks 11 The Transport Control Proto ol . . . . . . . . . . . . . . . . . 21.1 Flow Control and Reliability . . . . . . . . . . . . . . 21.2 Congestion Control . . . . . . . . . . . . . . . . . . . . 51.3 The Algorithms of TCP . . . . . . . . . . . . . . . . . 61.4 A Brief Dis ussion About Performan e . . . . . . . . . 91.5 Some Extensions . . . . . . . . . . . . . . . . . . . . . 102 Mi ros opi Models . . . . . . . . . . . . . . . . . . . . . . . . 122.1 Preliminary Studies with Dedi ated Networks . . . . . 122.2 Statisti al Multiplexing . . . . . . . . . . . . . . . . . 132.3 TCP as a (max,+) Linear System . . . . . . . . . . . 163 Bandwidth Sharing with Fluid Flows . . . . . . . . . . . . . . 183.1 Fairness among Persistent Flows . . . . . . . . . . . . 183.2 Capa ity under Dynami Tra Demand . . . . . . . 214 Motivation and Summary of this Dissertation . . . . . . . . . 24Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 Bandwidth Sharing with Intera ting Flows 311 Related Works and Open Problems . . . . . . . . . . . . . . . 322 The Hybrid AIMD Model with Persistent Flows . . . . . . . . 352.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 352.2 Many User Approximation . . . . . . . . . . . . . . . . 382.3 A Closed Form Formula for TCP Throughput TDF . . 403 Examples of Appli ations . . . . . . . . . . . . . . . . . . . . 473.1 Buer and Pa ket Losses . . . . . . . . . . . . . . . . . 473.2 Impa t of Buer Sizes . . . . . . . . . . . . . . . . . . 483.3 Dimensioning with Probabilisti Guarantee . . . . . . 534 Analysis with Non Persistent Flows . . . . . . . . . . . . . . . 564.1 The Free Regime . . . . . . . . . . . . . . . . . . . . . 574.2 Property of Inter Congestion Time . . . . . . . . . . . 604.3 The Congestion Regime . . . . . . . . . . . . . . . . . 664.4 Appli ation: Rate and Stability . . . . . . . . . . . . . 72Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

Page 8: baccelli/Evaluation/AugustinChaintreauPhD.pdf

2 S alable Multi asting on Overlay Network 791 15 Years of Multi asting: Lessons Learned . . . . . . . . . . . 801.1 The Network Can Route Multi ast Tra ... . . . . . 801.2 ... But It Cannot Transport It. . . . . . . . . . . . . . 822 Networking in Overlays . . . . . . . . . . . . . . . . . . . . . 892.1 Building Optimal Overlay . . . . . . . . . . . . . . . . 902.2 Towards Large S ale Overlay . . . . . . . . . . . . . . 932.3 Transporting Data on an Overlay . . . . . . . . . . . . 982.4 Our Contribution . . . . . . . . . . . . . . . . . . . . . 1023 The one-to-many TCP Overlay . . . . . . . . . . . . . . . . . 1043.1 Designing an Adaptive Data Transport on Overlays . . 1043.2 A Distributed Dis rete-Event System . . . . . . . . . . 1083.3 Last-Passage Per olation, and Pa ket losses . . . . . . 1124 S alability Analysis . . . . . . . . . . . . . . . . . . . . . . . . 1174.1 Empiri al Results on Ba k-Pressure . . . . . . . . . . 1184.2 Proving Throughput S alability . . . . . . . . . . . . . 1234.3 Empiri al Results with Innite Buers . . . . . . . . . 1264.4 Proving Laten y S alability . . . . . . . . . . . . . . . 130Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1363 Last-Passage Per olation in Pattern Grids 1491 Pattern Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . 1501.1 Denitions . . . . . . . . . . . . . . . . . . . . . . . . 1501.2 Sharp Ve tor . . . . . . . . . . . . . . . . . . . . . . . 1551.3 Why is a Sharp Ve tor Ne essary ? . . . . . . . . . . . 1571.4 Why is a Sharp Ve tor Useful ? . . . . . . . . . . . . . 1642 Dire tional Last-Passage Per olation . . . . . . . . . . . . . . 1692.1 Comparison with Linear Rate . . . . . . . . . . . . . . 1702.2 Asymptoti Linear Growth . . . . . . . . . . . . . . . 1732.3 Ordering, Solidarity Property . . . . . . . . . . . . . . 1772.4 Hydrodynami S aling 1: Quadrant . . . . . . . . . . . 1802.5 Hydrodynami S aling 2: Extended Quadrant . . . . . 1833 Appli ation: Innite Dis rete-Event Systems . . . . . . . . . . 1903.1 Quadrant . . . . . . . . . . . . . . . . . . . . . . . . . 1903.2 Extended Quadrant . . . . . . . . . . . . . . . . . . . 1924 Extension to Pattern Invariant Graph . . . . . . . . . . . . . 2034.1 Pattern Invariant Graphs . . . . . . . . . . . . . . . . 2034.2 Leveled Graphs . . . . . . . . . . . . . . . . . . . . . . 209Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214A Some Mathemati al Ba kground 217I Measure Theory and Ergodi ity . . . . . . . . . . . . . 218II Super-additivity . . . . . . . . . . . . . . . . . . . . . 223III Sto hasti Order . . . . . . . . . . . . . . . . . . . . . 225

Page 9: baccelli/Evaluation/AugustinChaintreauPhD.pdf

IV Con entration of Measure . . . . . . . . . . . . . . . . 225V Heavy Tailed Distribution . . . . . . . . . . . . . . . . 225List of Te hni al Contributions 229Index 232Synthèse (in fren h) i1 Prin ipes des Réseaux Dé entralisés . . . . . . . . . . . . . . i2 Partage de la Bande Passante . . . . . . . . . . . . . . . . . . viii3 Diusion Multipoint sur les Réseaux Pair-à-Pair . . . . . . . . xiv4 Dernière Per olation dans les Grilles de Motifs . . . . . . . . . xviii

Page 10: baccelli/Evaluation/AugustinChaintreauPhD.pdf
Page 11: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Chapter 0Introdu tory ChapterDe entralized Control in Data NetworksEt ainsi ne pouvant faire que e qui est juste fût fort, on a fait que e quiest fort fût juste.1 Blaise Pas al.The Internet is a global sharing of resour es to transfer omputer les. Itsdevelopment is pushed by the trend to make digital information available forreprodu tion and transport, at marginal ost. Any two of the onne ted ma- hines an ex hange data; the resour es ne essary for them to ommuni ateare, most of the time, open to everyone.What is the best way to share these resour es for any Internet tra ? What are the merits of the Transport Control Proto ol (TCP) whi h isde fa to responsible for the amount of network resour es allo ated today tomost of the appli ations ?We argue in this hapter that these open questions share a ommonground: they involve a large populations of data ows, whi h intera t a - ording to a pro ess of dis rete events.Guidelines: Motivation and algorithms used to design the Transport Control Proto ol(TCP) are briey presented in 1, together with some questions that followed its wideadoption. The properties of this proto ol are then presented using two omplementaryapproa hes. First, mi ros opi models, presented in 2, whi h take into a ount pa ketgranularity and typi ally fo used on a single ow onsidered in isolation. Se ond, uid-owmodels, des ribed in 3, and the new light they shed on the bandwidth sharing produ edby TCP. This allows us to motivate in 4 our methodology and the three ontributionsmade in this do ument.1 Pensées, Raison des eets, fragment 135, Paris, Mer ure de Fran e, 1976, p.78: Asthey ould not fortify justi e they have justied for e. (translated by A.J. Krailsheimer).

Page 12: baccelli/Evaluation/AugustinChaintreauPhD.pdf

2 Chapter 01 The Transport Control Proto olIn this se tion, we motivate briey the prin iples and the algorithms used toregulate data ommuni ation in omputer networks, as a part of the Trans-port Control Proto ol (TCP). Starting with the earliest issues that this pro-to ol has addressed (ow ontrol, reliability), we stress later the most riti alfun tions it performs ( ongestion ontrol, bandwidth sharing).We wish to qui kly bring insights into implementation aspe ts, whi hillustrate and justify our approa h. However, we have to admit that thesehighlights give ne essarily a partial view. An exhaustive des ription of thisproto ol's spe i ation may be found in [31.1.1 Flow Control and ReliabilityThe ow ontrol is among the rst ne essities for the ommuni ation ofdigital information, even for an elementary system like two ommuni atingdevi es onne ted by a dedi ated able. Flow ontrol makes sure that bitsare sent by the sour e at a rate sustainable by both the re eiver and the hannel that is used.Example : Sensing Devi eTo motivate the need for ow ontrol with an example, the author was re ently involvedin an experiment with sensing devi es arried by human people. In the rst version of thesoftware used for these devi es, no ow ontrol was implemented when, after an experiment,ea h of these devi es was onne ted to a omputer to olle t the data. The buer of there eiving omputer interfa e was qui kly lled immediately after data started to be sent; allthe remaining bits were lost.Most of the experimental data olle ted during a week was never re overed as every devi ehad to be programmed again. As a onsequen e the experiment had to be reprodu ed shortlyafter using an improved software.The same problem o urred in the earliest stage of ommuni ation net-work between heterogeneous omputers. It was solved through a feedba kme hanism implemented between the re eiver and the sour e: when the om-muni ation is established at rst, the re eiver allo ates a memory buer,whi h we all the input buer, and denote by BIN. Its size, in bits, is imme-diately advertised to the sour e. In addition, every set of bits sent by thesour e, also alled data pa ket, is given a sequen e number and is a knowl-edged by the re eiver to the sour e, after it has been re eived in the inputbuer.With this simple feedba k me hanism, the sour e an keep tra k ofthe amount of bits that are sent but still not a knowledged (also alled

Page 13: baccelli/Evaluation/AugustinChaintreauPhD.pdf

De entralized Control in Data Networks 3flight_size). The sour e is also aware of the urrent lling of the re- eiver's input buer, that is expli itly advertised in ea h a knowledgmentpa ket. To avoid buer overow, the following ondition must be enfor edat the sour e: a pa ket is sent as soon as su ient memory is known to beavailable for its orre t arrival. For ea h pa ket, this is veried if and onlyif :(1) re eiver_mem_avail ≥ flight_size + pkt_size .In the parti ular ase where the buer is full and all pa kets sent have beenre eived and a knowledged, but ould not leave the input buer, the ommu-ni ation an only resume later, after enough memory has been made avail-able again in this buer. In this ase, a spe i a knowledgment pa ket isgenerated to advertise the new value of the memory available to the sour e.A ording to rule (1), a pa ket departure is most of the time triggeredby the a knowledgment of a data pa ket. Seen over time, it looks like thereis an impli it relative syn hronization between the pro essing of pa kets inthe two devi es. This property is usually referred to as self- lo king. Letus stress another one of its onsequen es: when the delay in urred betweenthe two devi es is non negligible, the number of in-ight bits an be adaptedby dimensioning the input buer. It keeps the ommuni ation running onthe hannel between the two ends even if the delay is long. Note that thisimpli it oordination is more e ient than sending su essive blo ks of sizeBIN bits separately, ea h being a knowledged before the next blo k is sent.Coming ba k to the example of the sensing devi e we des ribed above, asimple me hanism like this one was developed in the nal platform. It solvedthis elementary problem of data olle tion. It is perhaps worth noting thatthis fun tion may not be in luded in all digital ommuni ation systems,espe ially prototypes.The feedba k me hanism we des ribed did not onsider pa ket lossesthat may o ur in the network. Su h losses would indeed reate a drift, asea h pa ket lost would not be a knowledged and never be deleted from theflight_size maintained by the sour e. After a ertain number of losses, ondition (1) ould never be veried in pra ti e, leading the ommuni ationto a permanent halt. Another problem is that lost pa kets are never re eivedby the destination.The previous me hanism an be simply adapted to ope with this prob-lem. We an implement a time-out me hanism at the sour e: every time ana knowledgment is re eived, the time elapsed sin e the pa ket was sent ismeasured. This time is equal to the delay on the round trip path, from thesour e to the destination and ba k to the sour e. This delay is usually de-noted RTT for Round Trip Time. A timer is reset for every a knowledgmentre eived. In ase it times out, pa kets sent that are still not a knowledgedare assumed lost and all retransmitted. The value of this timer is hosen

Page 14: baccelli/Evaluation/AugustinChaintreauPhD.pdf

4 Chapter 0large enough not to interfere with normal delay variation (see below).Implementation : Time-OutIn pra ti e, the estimation of the Round Trip Time has to be done on the y. The followingrule is applied every time an a knowledgment is re eived at the sour e:(2) 8

<

:

rtt := α.rtt + (1 − α).new_rtt usually α = 0.9rto := β.rtt usually β = 2timer := rtoThe value of α, positive, makes the evolution of estimated RTT smoother. The value of β,larger than 1, makes sure that time-out is not triggered by a small normal variation of theRTT.When a time-out o urs, a retransmission is sent, and a new timer is set with a doublevalue of time. This implies an exponential ba k o of pa kets departure in ase of serious ongestion, this multipli ation is stopped after six steps, when the timer has already beenmultiplied sixty four times in total.Another solution is to implement a umulative a knowledgment be-tween the sour e and the destination. A knowledgment pa kets are alsosent when a pa ket is re eived, but they now ontain the maximal sequen enumber m su h that all previous pa kets (m,m − 1,m − 2, . . .) have been orre tly re eived. Losses are inferred by the sour e when it re eives an a -knowledgment with the same sequen e number as its prede essor, also alleddupli ate a knowledgment. To insure error re overy, when the sour e re- eives a dupli ate a knowledgment ontaining number m, it assumes thatall pa kets sent after pa ket m have been lost, an els them from the valueof flight_size and retransmits a opy of all of them a ording to (1). Inthis ase, retransmitted pa kets eventually resume the sequen e of a knowl-edgment and ommuni ation is never stopped even in the ase of pa ketlosses.This solution has the advantage to be ertainly faster to dis over a lossthan relying on time-outs. However, in the parti ular ase where all pa k-ets allowed to be sent are lost in a row, no dupli ate a knowledgment isprodu ed; this an lead the ommuni ation to a deadlo k. For this reason,implementing a time-out me hanism annot be avoided in pra ti e. Notethat this ase is unlikely to happen ex ept for extremely lossy networks, orif the number of pa kets simultaneously sent is small (be ause of a smallre eiver buer, or for short ows).Implementation : Cumulative A knowledgment

Page 15: baccelli/Evaluation/AugustinChaintreauPhD.pdf

De entralized Control in Data Networks 5In pra ti e, pa kets may be reordered as well as being lost, either be ause of a dis iplineimplemented in the routers, or be ause dierent pa kets have used dierent routes. For thisreason, the retransmission is rarely sent after re eiving the rst dupli ate a knowledgment. Inpra ti e, the third dupli ate a knowledgment is used. Note that with this rule and no time-outme hanism, ommuni ation ould have been stopped even for a flight_size allowing threepa kets to be sent.Whi h pa kets are retransmitted ? All pa kets that are not a knowledged are sus eptibleto be lost, be ause of the umulative implementation of the feedba k. As a onsequen e, bydefault a opy is sent for all of them (see extensions dis ussed in 1.5).Other me hanisms for ow ontrol, in luding the use of negative a knowl-edgment signaling pa ket losses, may be found in [19.To summarize what we learned from this elementary ase:• Problems in network ontrol are quite simple to phrase. The solutionsthey require an be deployed end-to-end (as advo ated for examplein [29) and may be des ribed in a ompa t algorithmi form. Theiranalysis an be more omplex as they involve distributed systems.• The solutions implemented tend to mix dierent fun tions together: inthis ase, the ow ontrol and the loss re overy.We prove in the next paragraph, and in the rest of this dissertation,that these two preliminary fa ts apply remarkably well to hara terize more omplex problems: TCP bandwidth sharing for uni ast ommuni ation, anda natural extension to the ontrol of one-to-many multi ast ommuni ation.1.2 Congestion ControlThe rst version of the TCP proto ol used re eiver feedba ks ow ontroland error re overy, presented above. As the network hanged during the endof the eighties, it be ame ne essary to modify the ow ontrol me hanism,as the nature of the resour es sharing on the Internet evolved.The me hanism we have presented allows a nite number of pa ketsto be simultaneously sent; it also sets the pa e of pa kets based on theira knowledgments. As routers do have a limited memory buer, and thusneed to drop pa kets when they re eive too mu h tra , the rea tion of thisme hanism is insu ient in pra ti e when the number of a tive ows on thenetwork be omes too large.The main on ern with this me hanism is that the network an enter a riti al regime where pa kets are sent but dropped extensively, in urringlarge number of retransmissions. Note that retransmissions, in the set-ting that we have presented here, are not performed e iently as a wholeflight_size is reated again and sent. As a onsequen e, the e ien y ofthe network is impa ted by a situation where massive drops are expe ted.

Page 16: baccelli/Evaluation/AugustinChaintreauPhD.pdf

6 Chapter 0In other words, the overall amount of data that an be transported per unitof time de reases in ase of overload.This is lose to the experien e of every day's ar driver in urban highwayin tra jam. In the presen e of a heavy tra , vehi les are for ed to slowdown, and the overall number of vehi les getting through per unit of timeis redu ed. It happens spe i ally at times when a larger number of themneed to use the road.Phenomena of this sort were observed on the Internet for the rst timein the eighties. It was a onsequen e from the in rease of the Internet tra demand, and its on entration on a few riti al links, typi ally used to overlong distan e. Period of ongestion ollapse, where the overall apa ity wasredu ed, was typi ally met at some times of the day when many ows wereprodu ed by users. Note that the ongestion does not hit every ow equally:ows lose to the bottlene k ould be able to pro ess their pa kets fasterin the ase of a large buer overow. Other ows, that ompete to a essthe router's buer from further lo ations, ould be starved. We refer to thisdis rimination as the round-trip time bias.The problem of handling ongestion shares similarities with the ow ontrol problem, des ribed in the previous se tion. Buer overows leadto the loss of data pa kets, that need to be addressed by setting a orre tsending rate. But this problem is dierent be ause of the following fa t:the s ar e resour es, that are now the most pre ious and needto be arefully used, are shared between users. The phenomenon of ongestion ollapse an be interpreted as a move of the bottlene k link in thenetwork from the edge to the enter (from the buer of the re eiving devi es,already used to ontrol the tra as in rule (1), towards the apa ity andbuer present in shared routers).How to avoid state of ongestion was addressed by revisiting the ontrolthat we des ribed earlier. As before, the number of una knowledged bitsallowed to be transmitted simultaneously remains bounded by a onstant,also alled a window. The new feature is that the size of this window needs tobe adapted with the state of the routers used in the path from the sour e toits destination. How to perform this adaptation with the minimal number ofsignaling is a di ult task; the answer, that is still used today, was proposedin 1988 by Van Ja obson in [15 and soon in orporated in the TCP proto ol.1.3 The Algorithms of TCPSlow StartThe orre t window size that needs to be used by the ow at some giventime is hosen as the minimum of two values: the re eiver window rwnd thatis equal to re eiver_mem_avail, and a new value, alled ongestion window wnd, whi h adapts a ording to the resour es available in the network. A

Page 17: baccelli/Evaluation/AugustinChaintreauPhD.pdf

De entralized Control in Data Networks 7pa ket is then allowed to be sent by the sour e if:(3) min(rwnd, wnd) ≥ flight_size + pkt_size .Let us think of the value of this window in pa kets, rather than in bits.Usually denoted W , it is equal to (min(rwnd, wnd))/(pkt_size) assuminga onstant pa ket size.The slow start algorithm an be des ribed as the following: the onges-tion window wnd is starting with a onservative value that is a small multipleof pkt_size; its value is in reased by pkt_size for every a knowledgmentof new data pa kets re eived.In this setting, every a knowledgment pa ket is responsible for the de-parture of two pa kets, one be ause the number of pa kets in luded in theight size is de reased by 1, and one as wnd is in reased by the size of onepa ket. As a onsequen e, one an see that this so alled slow start me h-anism implies in fa t a fast in rease, roughly exponential, of the ongestionwindow with time.Losses are aptured by a time-out based on the round trip time estima-tion, as before. As losses are not usually aused by unreliable links, theyindi ate that the urrent ight_size is above the number of bits that thenetwork an simultaneously handle, be ause of ongestion. To avoid at all ost the event of general ongestion ollapse, the ongestion window is resetto a small multiple of pkt_size after ea h pa ket loss.This me hanism ensures that ea h ow is able to rea h its natural equi-librium, and at the same time rea ts to ongestion. It was shown in [15 thatit redu es already the amount of retransmissions on a typi al link in overloadby several orders of magnitude. However, the time spent in the regime loseto the optimum is quite small, as the amount of pa kets sent doubles in asingle round trip time. Moreover, it seems that its rea tion to ongestionappearing in the network is rather strong.Congestion Avoidan eThis algorithm, introdu ed at the same time as slow start, is proposing asmoother evolution of the window: ea h a knowledgment pa ket is respon-sible for an in rease of the window equal to 1/W , where W is the value of wnd expressed in pa ket number. As a onsequen e, the size of the windowis in reased by one pa ket only after all the pa kets sent in a whole windoware a knowledged.Implementation : Additive In rease

Page 18: baccelli/Evaluation/AugustinChaintreauPhD.pdf

8 Chapter 0In pra ti e, the in rease of W is done via wnd, for ea h a knowledgment pa ket re eived,(4) wnd := wnd + pkt_size pkt_size wnd .This new feature omes together with a ne essary adaptation of the win-dow in the o urren e of ongestion, identied through pa ket losses andtime-outs. The ee t of the window adaptation is ne essarily asymmetri :the de rease of the ongestion window should be stronger to a ount for theprevious tra left over, and its onse utive retransmissions. Consequently,it was proposed to implement a multipli ative de rease of the window: wndshould be multiplied by a onstant d smaller than 1, usually set to 0.5.As a onsequen e, in ase of severe ongestion in urring onse utive losses,the ongestion window is guaranteed to be ome exponentially smaller witho urring losses.These two algorithms (slow start and ongestion avoidan e) solve twodistin t problems: the former is e ient to go to equilibrium, the latter isgood at staying lose to it. Histori ally they were qui kly ombined together,using a threshold value sstresh, that governs the rea tion to the arrival ofan a knowledgment pa ket in the sour e:(5) When an a knowledgment is re eived, do the following: wnd := wnd + pkt_size, if wnd < sstresh wnd := wnd + pkt_sizepkt_size wnd , if wnd ≥ sstresh .These two me hanisms were in luded together in the standard of TCPin two su essive versions: the rst to be proposed, TCP Tahoe, did notimplement multipli ative de rease on wnd but on sstresh:(6) In ase of a time-out, do the following:sstresh := max(flight_size2 , 2) and wnd := 1 .TCP Reno, proposed immediately after, implemented the following me h-anism - Fast Re overy Fast Retransmit - based on umulative a knowledg-ments. It an be onsidered as a more ideal implementation of an AdditiveIn rease Multipli ative De rease (AIMD) window adaptation:(7) If three dupli ate a knowledgments arrive before the time-out:

timer := rtosstresh := max(flight_size

2, 2) wnd := sstresh .

Page 19: baccelli/Evaluation/AugustinChaintreauPhD.pdf

De entralized Control in Data Networks 9Implementation : Window InationIt seems at rst ontradi tory to retransmit and redu e the window at the same time.When a pa ket is lost, the sour e may be aware of it only after some delay. A number of bitsmight have been already sent in the ight, before the third dupli ate a knowledgment, or atime-out, triggers retransmission. In total, it annot be more than flight_size − pkt_sizebits sent and re eived, reating at most W − 4 additional dupli ate a knowledgments.The presen e of these pa kets that have already been sent in the ight-size, when thewindow redu es, introdu es a temporal gap in the departure of pa kets. This gap may not beneeded as the dupli ate a knowledgments re eived prove that other pa kets are being orre tlytransmitted. For this reason, the fast retransmit fast re overy implements the following windowination:(8)8

<

:

when the third dupACK is re eived, wnd := sstresh + 3.pkt_sizefor any new dupACK re eived, wnd := wnd + pkt_sizewhen all data sent before the retransmission are a knowledged, wnd := sstreshNote that even with a window ination, there remains a time interval with no new pa ket'sdeparture: the rst half of the dupli ate a knowledgments re eived does not allow new pa ketsto be sent, as the window was divided by two. However, the pa kets departure resumes withthe se ond half of the dupli ate a knowledgments. Another advantage of this approa h is to ontinue the self lo king departure of pa kets after a loss, avoiding a sudden burst of pa ketshappening at the next a knowledgment.1.4 A Brief Dis ussion About Performan eLet us briey summarize the highlight of the ow, ongestion ontrol andreliability as implemented in the TCP proto ol. In the rest of this do umentwe fo us, unless otherwise mentioned, on the Reno version of TCP.TCP is De entralized and S alableBeing de entralized is ertainly the property that is the best exemplied bythis proto ol. The network is unaware of the number of a tive ows, andof the ontrols they implement. Moreover, indi ations of the global statesof ongestion in the network's links are neither maintained nor signaled,these are impli itly derived by ea h ow from the history of its data pa kets'a knowledgment.

• No expli it feedba k from the network is needed to adapt the windowsize, as advo ated by other ongestion avoidan e me hanisms proposed(see for example [28).• No isolation between ows needs to be done in the routers, as it wasrequired for other types of ongestion avoidan e me hanisms (see forexample fair queuing with pa ket pair in [18).

Page 20: baccelli/Evaluation/AugustinChaintreauPhD.pdf

10 Chapter 0• Moreover, the amount of state maintained by the sour e and the re- eiver to ommuni ate is small. This is an important feature as busyweb servers have to handle a large number of TCP ows at the sametime.But is TCP E ient and Fair ?That TCP is e ient is still the subje t of a debate. Despite a strong in reaseof the apa ity of data networks, the same me hanisms are used withoutbeing modied today, adapting more or less to the in reased speed and sizeof data ommuni ations. However, it should be mentioned that for longdelay or large apa ity, re ent resear h works have been proposing modiedversions of TCP for optimized performan e.Being fair ould be thought as the most ontroversial property of thisproto ol. This proto ol was shown in many situations to share the network'sresour es among ows with a broken fairness. Flows are not treated equally,espe ially as a onsequen e from their round trip times. This an be seen asa kind of redu ed version of the round trip time bias experien ed in ase of ongestion, where ows do not suer from starvation any more.Another quite important issue is that it oers little prote tion againstmisbehaving ows, that may not redu e their own window in ase of loss, oreven do not limit their transmission by a onstant size window. Following thisobservation, a re ommendation that is often quoted today in the literatureis that any appli ation deployed in the network needs to show that it doesnot harm the bandwidth of TCP ows.1.5 Some ExtensionsDelayed A knowledgmentThis feature was enabled to redu e the amount of signaling pa kets (in this ase, a knowledgment pa kets) in the network. In this s heme, an a knowl-edgment is not ne essarily sent for every re eived pa ket. In general, ana knowledgment is sent for every se ond segment re eived or after a er-tain elapsed time (that is usually referred to as a delayed a knowledgmenttime-out). On the other hand, after the event of a segment loss, dupli atea knowledgments produ ed are sent immediately as this situation needs tobe signaled to the sour e qui kly. More details an be found in [31.Sele tive A knowledgment (SACK)In ase of multiple losses, the umulative a knowledgments that we have de-s ribed perform badly. This does not seem to be un ommon on the Internet,where losses are reated in burst by buer overows.

Page 21: baccelli/Evaluation/AugustinChaintreauPhD.pdf

De entralized Control in Data Networks 11For this reason, improved a knowledgments were proposed, that des ribemore a urately the bits that were orre tly re eived. TCP Sele tive AC-Knowledgment (SACK), presented in [22, advertises to the sour e portionof re eived bits up to a ertain number, and enhan es the error re overy.Fra tional a knowledgments are another te hnique proposed in [10.Expli it Congestion Noti ation (ECN)The rst expli it feedba k me hanism for ongestion avoidan e in data net-works was introdu ed in [28. It proposed to add a single ongestion indi a-tion bit in the header of ea h IP pa ket, to report ongestion. This bit wasinitially set to zero but ould be hanged by a router that was experien ingan average queue length above a threshold. The window implemented bythe sour e is supposed to rea t every round trip time if more than half of itspa kets is marked. The window needs then to be de reased a ording to amultipli ative de rease, with value d = 0.875.A variation of this te hnique was presented in [9 for TCP Reno: it pro-poses to mark pa kets using a probabilisti marking s heme in the routers,su h that the marking probability linearly in reases with the average queuelength, following the proposed me hanism of Random Early Dete tion (RED)in [11. The window of TCP is halved at the rst marked pa ket experi-en ed, and onse utive marking has no further ee t in the following roundtrip time.

Page 22: baccelli/Evaluation/AugustinChaintreauPhD.pdf

12 Chapter 02 Mi ros opi ModelsThe rst se tion of this hapter qui kly presented the pra ti e of ontrollingdata networks. In this se tion and the next one, we give a qui k overview ofthe theories used to justify and better understand the apabilities of TCP ontrol. These frameworks, introdu ed over the last ten years, stems fromthe analysis of diverse modeling tools: queuing theory, partial dierentialequations, Markovian pro esses, point pro esses, (max,+) algebra.Let us start with what is alled here the 'Mi ros opi modeling' thattakes expli itly into a ount the behavior of TCP driven by a knowledgmentsand lost pa kets. These are models in whi h small variations of the algo-rithms implemented might have an impa t on the analyti al results found.That type of analysis qui kly be ame an a tive area of resear h in the 90swith the wide adoption of TCP in a xed spe i ation, and many aspe ts ofits performan e in networks left open for resear h.Almost all these models onsider a single TCP ow using a network path.They usually make the following simpli ations:a1 The re eiver window rwnd is not onstraining the ommuni ation andthe sour e always has data ready to be sent. This is equivalent toassuming that the sending rate depends only on the ongestion windowsize. Moreover, we assume that ea h pa ket sent has the same size. We onsider window expressed in pa kets.a2 Every pa ket is a knowledged instantaneously, and perfe tly: the ee tof delayed a knowledgments in the re eiver is negle ted, and the errorre overy using retransmission is not taken into a ount.a3 The ow has rea hed a steady state behavior, that may be thought asthe one of a long TCP ow after some initialization time.2.1 Preliminary Studies with Dedi ated NetworksThe rst mi ros opi studies of the dynami of TCP fo used on a singleow using a set of links and routers. It is assumed that this route admits abottlene k link (i.e. one with minimal apa ity), its apa ity is denoted byC and its buer size by B. The bandwidth delay produ t of this route isdened as the round trip time multiplied by the apa ity of the bottlene klink, expressed in pa kets per se ond. It represents a bound on the windowthat an be used, if the buering ee ts in the networks are negle ted.In [30, a model is proposed for ows using TCP Tahoe. The windowevolution is approximated by a ontinuous time periodi pro ess. Values ofsstresh in steady state are derived based on the parameters of the network.[20 ondu ted a similar analysis for both TCP Tahoe and TCP Reno, inthe ase of a buer supposed to be small when ompared to the bandwidth

Page 23: baccelli/Evaluation/AugustinChaintreauPhD.pdf

De entralized Control in Data Networks 13delay produ t. In this ontext, the round trip delay of a pa ket and itsa knowledgment an be well approximated by the deterministi round triptime, and the slow start phase an be a urately des ribed.It is shown that TCP by itself may operate in pe uliar regime for somevalue of the parameters: TCP Tahoe might alternate between two slow startphases with dierent sstresh for small buer; more generally, the utilizationof the link is poor, be ause of periodi ongestion re overy epo hs, both forTCP Reno and Tahoe.2.2 Statisti al MultiplexingTo the best of our knowledge, there is no losed form result, derived frompa ket level me hanisms, that an be dedu ed for the evolution of severalows sharing a part of networks' links and buers. The di ulties involvedre alls the omplexity inherent in multi lass queuing systems.As a onsequen e, the models proposed for the behavior of a TCP owin a shared network make the following assumption:a4 There exists a state of operation of the network that is not impa tedby the behavior of the referen e ow studied.In other words, the network should be onsidered as a 'bla k box': it suersstatisti ally from ongestion epo hs, seen by the TCP ow through pa ketlosses, as an exogenous statisti al feedba k. As a onsequen e, the perfor-man e of TCP should be expressed as fun tion of aggregate measures of thenetwork state (su h as the properties of the pro esses of pa ket losses).AIMD and the Rare Loss Asymptoti A rst model of this kind was drafted in [24. It assumes that sent pa ketsmay be lost independently with probability p. The behavior of TCP isan ideal ongestion avoidan e phase of TCP Reno, with instantaneous andperfe t feedba k in ase of pa ket losses. The evolution of the window, after ares aling, an be approximated for small p by a pro ess that is des ribed witha drift and a olle tion of negative jumps, distributed in times a ording toa Poisson Pro ess. Results are given on the invariant measures of this limitpro ess.Following this approximation, the sending rate an be related to p via:(9) Thru = Cste pkt_sizeRTT

√pThis heuristi formula is also presented in [23 where it is extended to the ase of periodi drop and delayed ACK, leading to the same formula withdierent values of Cste. A rigorous justi ation of this relation was shownin [7 (see the framed paragraph below).

Page 24: baccelli/Evaluation/AugustinChaintreauPhD.pdf

14 Chapter 0These models explained for the rst time the empiri al relation betweenloss rate and average throughput, initially observed in [8. Unfortunately forits pra ti al purpose, it ould not apture a urately the behavior of TCPfor loss beyond 1%.More Details : Rare Loss Asymptoti The Markovian pro ess of the window evolution was given a deeper analysis in [7. As-suming independent losses between RTT, and an ideal AIMD evolution, the authors ouldgive a pre ise meaning to the asymptoti regime for small loss rates:They introdu e the pro ess V(p)

n of the window seen when a pa ket is lost, and its re-s aledversion √pV

(p)n . The following results hold as p goes to 0:

• The res aled pro ess, as well as its invariant measure, onverges to a limit pro ess andits asso iated invariant measure, hara terized by:(Vn+1)

2 =(Vn)2 + 2En

2where En are i.i.d. with law exp(1).

• Several previous heuristi results were onrmed using this approa h. The onstant forthe oe ient Cste in (9) ould be established to 1.3098, as observed in [8.• This result applies also to the original pro ess of the window, but the limit pro ess isharder to write. It an be proved with a maximal window size as well.

The TCP Reno Throughput FormulaThe window evolution was studied in [25 via a model loser to the pa ketlevel of TCP. The regenerative pro ess introdu ed is based on rounds, that orrespond to a window of pa kets sent. The feedba k for ea h pa ket isdetermined in the following way: given that none of the previous pa ketsof its round are lost, this pa ket may be lost with probability p. Ea hpa ket lost implies that all remaining pa kets that were sent in this roundare lost. A new round starts immediately, with a halved window, losses inthis new round are independent from the previous round. This model doesnot a ount for retransmission but it in ludes time-outs that are produ edby the losses of three dupli ate a knowledgments, as well as maximal windowsize advertised by the re eiver.The main merit of this model is that it oers an a urate approximationof the relation between aggregate loss rate, seen by a ow, and the longterm average throughput that it a hieves, as many empiri al validationshave shown. The formula derived from this approa h is usually quoted asthe referen e throughput that should be a hieved on the long term by a ow,in order to respe t the natural bandwidth sharing reated by TCP.

Page 25: baccelli/Evaluation/AugustinChaintreauPhD.pdf

De entralized Control in Data Networks 15Loss CorrelationCongestion epo hs, whi h imply window redu tions of a given ow throughpa ket losses, may be seen as a point pro ess of events in time. It might bethought of as deterministi , as made in [30 and [20, or as an independentpro ess for either dierent pa kets [23 or dierent RTT [25, [7.We have seen that the performan e of TCP is impa ted by the propertiesof this pro ess; for instan e, the onstant oe ient in (9) is dierent forrandom or periodi losses (see e.g. [23). This relation is studied for a generalsto hasti pro ess of ongestion epo hs in [1. This arti le onsiders a modelof window evolution based on an exogenous ergodi stationary sequen e of ongestion events, that represents any event (one or several pa ket losses)that de reases the window by half. Note that the o urren es of ongestionevents, given as an input for this model, are assumed independent from the urrent window size. Expli it formulas for the throughput are shown usingthe orrelation fun tion, whi h proves in parti ular that for a given loss rate,the performan e of TCP is improved by losses o urring in bursts.More Details : Loss Correlation in the Rare Loss Asymptoti The impa t of loss orrelation was also studied analyti ally for the asymptoti of rare lossin [13 for the following setting: lumps of pa ket losses o ur, where a ertain number ofsubsequent window redu tions happen (with a xed law X). The model is studied when therate of o urren e of lumps p be omes arbitrary small.In a result similar to the one shown in [7, the embedded Markov hain des ribing thewindow at the end of a lump, denoted V(p)

n , an be asymptoti ally hara terized:• The s aled pro ess (

√pV

(p)n )n, , together with its invariant measure, onverges indistribution to a limit pro ess and its asso iated invariant measure, that is hara terizedby:

(V∞)2

2+ E0

dist= I , where E0 is independent with law exp(1),and I =

Z +∞

0

e−ξ(t)dt with ξ(t) = log(4)

N(t)X

k=1

Xk ,

N is a Poisson pro ess with parameter 1, (Xk) are i.i.d. variables with law X.• Note that I an be expressed as an exponential of the Lévy pro ess ξ. This ategory ofrandom variables exhibits a remarkable symmetry property and re eived an importantattention re ently for their appli ations in mathemati al nan e.• Generally, the stationary throughput an be obtained as a fun tion of the variable X.The following result holds:if X

onv≤ Y, then ThruX

p

E[X]≥ ThruY

p

E[Y ].As E[X]

onv≤ X by Jensen's inequality, this proves in parti ular that ThruX ≥ ThruE[X](assuming here that E[X] is an integer), so that the throughput for varying lumps oflosses with number X is higher than for a xed number of losses E[X] in every lump.

Page 26: baccelli/Evaluation/AugustinChaintreauPhD.pdf

16 Chapter 02.3 TCP as a (max, +) Linear SystemAs demonstrated in [3, the dis rete-event nature of TCP ontrol and theoperations that it performs all t well in the (max,+) linear algebra.Window Flow Control is a (max,+) Linear SystemA single server queue with rst ome rst served servi e dis ipline is knownto be a linear system in the (max,+) algebra: let us introdu e am and dmthe time at whi h ustomer m arrives and leaves su h a queue, these twosequen es of dates ould be related via:dm = max(dm−1, am) + σm = (dm−1 ⊕ am) ⊗ σm

= (dm−1 ⊗ σm) ⊕ (am ⊗ σm) .where we have denoted by σm the time required by the queue to serve us-tomer m. Note for the last equality that we have used the lassi al notationof (max,+) algebra to denote the operation maximum and the addition:(a ⊕ b) = max(a, b) and a ⊗ b = a + b.The same result extends to a sequen e of K queues in tandem (wherethe departure time from queue k is the arrival time in queue k + 1, fork = 1, . . . ,K−1). These sequen es of queues may be thought as the sequen eof router buers traversed by ea h pa ket belonging to this ow.A window ow ontrol with a onstant size W is also dening a linearoperation in this algebra: in fa t the departure from the sour e for pa ket mis, in this ase, authorized if the pa ket m−W has arrived in the destinationand is a knowledged. Assuming instantaneous feedba k, and that a pa ketarrived in the destination as soon as it leaves the last queues (asso iatedwith index k = K), we an write the following re urren e equation:(10)

d0,m = dK,m−W ⊕ am for k = 0 (i.e. the sour e)dk,m = (dk,m−1 ⊕ dk−1,m) ⊗ σk,m for k=1,. . . ,K .Again, it denes a linear system in (max,+), as it proves that the ve torsmade with departure times for pa ket m an be omputed as a produ t of

m matri es that only depends on the servi e times, and the time am wherethe pa ket is available to be sent by the sour e.Consider now the ase of a window that does not remain onstant intime. Results already found an be extended using (Wm)m, the pro ess ofwindows experien ed by ea h pa ket sent. This pro ess hara terizes there urren e equation in the following way: pa ket m an be sent immediatelyafter pa ket m − Wm is a knowledged.The Impa t of Cross Tra The previous (max,+) linear system did only a ount for pa kets of thereferen e ow that we study, that we also all referen e pa kets. Other ows

Page 27: baccelli/Evaluation/AugustinChaintreauPhD.pdf

De entralized Control in Data Networks 17using apa ities and buer of ea h router are not in luded in these previousmodels. The presen e of this ross tra is experien ed by referen e pa ketsthrough an additional queuing delay between two of its pa kets, to traverseea h of the link on its route. As shown in [2, the ee t of the ross tra onthe throughput is di ult to predi t in general, even for a onstant windowsize, as it does not only depend on a rst order statisti like the rate of rosstra pa ket arrivals.One way to ir umvent this extreme sensitivity is to assume that theamount of ross tra pa kets, that intervene between two su essive pa ketsof the referen e ow, is xed in distribution, and is independent betweendierent pairs of referen e pa kets and links. For ea h referen e pa ket, this ross tra adds a random time spent in this queue, that omes in additionafter the pre eding referen e pa ket has left. We add this time value tothe referen e pa ket pro essing time, as a random aggregated servi e time(usually denoted AST), asso iated with ea h referen e pa ket and ea h link.In this model, ongestion o urring in a link is represented by a largevalue taken by its asso iated aggregated servi e times, reating for a owa large time gap between two a knowledgments re eived. Pa ket losses andTCP rea tions an be represented a urately as a fun tion of these timesequen es, several examples an be found in [3.Out omesThe main result of this approa h is that the throughput depends only onthe RTT and the bottlene k rate, for deterministi onstant servi e time.Formula are given for TCP Tahoe and Reno under some dierent feedba kassumptions. However, if aggregated servi e times are random, the long termthroughput of a ow is not hara terized by their mean values only, and itdepends on the distribution that is hosen on ea h link. Analyti al resultsare usually hard to obtain for su h sensitive losed loop systems.This modeling paradigm is also a ompa t representation of the pa ketlevel implementation of TCP; it allows fast and a urate simulations of its ontrol behavior in a network. As shown in Chapter 2, this simulation te h-nique s ales for a large number of ows, and applies to TCP as well asextended version of this proto ol.

Page 28: baccelli/Evaluation/AugustinChaintreauPhD.pdf

18 Chapter 03 Bandwidth Sharing with Fluid FlowsWe have already seen in 1 how ow ontrol, ongestion ontrol and errorre overy are jointly performed by TCP. The algorithms implemented in thisproto ol perform in fa t another important task impli itly: they allo atenetwork resour es between ompeting ows.This ombination does not appear ne essarily as a bad feature. First, theessen e of ongestion avoidan e is to prevent the pani state of overloadingin a network, that is best addressed dire tly in the sour es. Hen e it shoulddene how ows should ondu t themselves when they share a network.Se ond, implementing two distin t algorithms, ea h one being responsiblefor one of these tasks, ould prove omplex and ostly in pra ti e. Even ifperforman e benets are to be expe ted, these me hanisms would ne essarily oexist, making their designs and their analysis di ult.This open problem requires in addition to answer two preliminary ques-tions. What is the suitable obje tive of a well designed bandwidth sharings heme ? Can this obje tive be obtained by a de entralized proto ol ?The models we present in this se tion are based on the abstra tion ofuid ows, where the unit of data used is supposed small enough to approx-imate data transport by a ontinuous pro ess. Thanks to this simpli ation,networks where multiple ows intera t oud have been well studied. Wehighlight in this se tion some of their remarkable properties.3.1 Fairness among Persistent FlowsWe start to des ribe bandwidth sharing under the following assumption:there is a xed number of ows, and all of them have always data to send.It orresponds to a network with a xed saturated tra demand. Wesuppose that routes are xed, su h that the network an be modeled as:(11)

a set of links L, with apa ity Cla set of routes R, dened as subsets of linksa number of ows per route (nr)r∈RFrom RTT Bias ...As exhibited for the rst time in [8, and des ribed in more details in [20,the AIMD me hanism of the window based on a knowledgments makes TCPnaturally biased against ows with long RTT. In other words, ows goingthrough more links are allowed by TCP a smaller sending rate.Generally speaking, this bias ould be des ribed by the following rule[20: the throughput given to a ow is inversely proportional to RTTa with1 ≤ a ≤ 2. This oe ient omes from a double ee t: the throughput isexpressed as the inverse of the RTT with the window, and the window of aow is in reasing slower for larger round trip times. This ee t may depend

Page 29: baccelli/Evaluation/AugustinChaintreauPhD.pdf

De entralized Control in Data Networks 19on several fa tors, most notably the urrent queuing delay in the bottlene k,su h that dierent values of a are found between one and two for dierents enarios, as obtained in [20 with a mi ros opi model of a single ow.Let us onsider the example shown in Figure 1. The most egalitarianway of sharing the bandwidth is to give half of the apa ity to ea h route.This allo ation is an example of the so alled max−min fairness, that anbe dened for any network and any onstrained ows as the allo ation whi hmaximizes, for the lexi ographi order, the in reasing sorted ve tor of therates allo ated ([4 p.450). In other words, this allo ation balan es, as mu has possible, any trade-o in order to in rease the smallest rate. In oppositionto this egalitarian obje tive, the RTT bias ee t an be des ribed as anadditional penalty enfor ed by TCP for ows that use more resour es to arry their data than others. It was initially seen as an undesired feature ofthe proto ol, that makes the transparent a ess of the Internet unne essary omplex for networks designers and users.link l1 link l2

route r1route r2

route r3

L = l1, l2 and R = r1 = l1 , r2 = l2 , r3 = l1, l2 Figure 1: Linear network with two links with same apa ity and three routes.... to Distributed FairnessThe work of Kelly, started in 1997 [16 and ontinued in the famous arti le[17 of 1998, showed a surprising result, and led to a hange in the under-standing of the Internet resour es sharing.• First, this work advo ated that there is no ground for enfor ing the

max−min fairness, whi h does neither ne essarily ree t any e onom-i al obje tive, nor so ial e ien y. On the ontrary, the obje tive ofany bandwidth sharing s heme ould be better interpreted as estab-lishing the maximum so ial value of a network. This obje tive oulda ount for a degree of fairness among ows, in parti ular avoiding anystarvation, as well as be a useful tool for network designers.• Se ond, and more surprisingly, what TCP is a hieving in a de entral-ized manner is spe i ally the maximum so ial value of a given utilityfun tion, whi h guarantees the so alled proportional fairness.

Page 30: baccelli/Evaluation/AugustinChaintreauPhD.pdf

20 Chapter 0Proportional fairness balan es any trade-o so that a ow an in reaseits rate by a per entage, only if this does not de rease another ow ina larger per entage of its rate. This denition extends to more thantwo ows by omparing the sum of per entage that in reases with the umulated per entage de rease: it is overall maximizing the sum of thelogarithms of the allo ated rates.For two ows, starting from the max−min fair allo ation, one mightde rease by δ the rate r of a given ow to in rease by ∆ anotherow with a higher rate R. But this is allowed only if the onse utiveimprovement is higher in rate proportion:that is only if ∆

δ>

R

r.This se ond result was shown for a ontinuous uid ow model, whi h as-sumed instantaneous delay. This serves as a strong indi ation that the rstsuitable obje tive dened as so ial optimality for a network an be per-formed in a distributed manner, following a design lose to the one urrentlydeployed by TCP.The works presented in [14 and [32 rened this result, using ordinarydierential equations. A key fa tor that impa ts the fairness a hieved bya distributed AIMD s heme is the rate at whi h negative feedba k, whi hindu es multipli ative de rease, is re eived by the ow. Taking into a ountthe fa t that ows with higher rate are more keen to re eive feedba k athigher rate, the fairness obtained was proved to be an approximation ofproportional fairness. But the qualitative result remains: rates are allo atedto all ows so that they maximize a sum of utility fun tions, denoted FA.That denes the FA fairness.These ma ros opi results are somewhat te hni al to prove; but theyare easy to apply in pra ti e to des ribe the resour e allo ation obje tivetargetted by TCP. They also oer insights in the design of proto ol fornew types of networks, in whi h proto ols may have to be modied. As anexample, the analysis ondu ted in [27 proved that rate max−min fairness, urrently implemented by 802.11, is generally not desirable to share themedium of a wireless network.What remains largely an open question is how the design of a distributedalgorithm and the network may hange the so ial optimality of a bandwidthallo ation. In [21, Massoulié and Roberts prove in a similar model thatapproximation of any fairness dened by a utility fun tion may be omputedin a distributed manner depending on the s heduling performed in the router,and/or the adaptation performed by the sour e. They do not, however,propose a new algorithm that an be ready deployed to repla e existingTCP if another obje tive, dierent from proportional fairness, is hosen.

Page 31: baccelli/Evaluation/AugustinChaintreauPhD.pdf

De entralized Control in Data Networks 213.2 Capa ity under Dynami Tra DemandIn the previous se tion we assumed a xed number of a tive ows that alwaysadapt their rate to the maximum allo ation allowed by TCP. Stability forsu h a tra demand is not an issue, as soon as one assumes that no owsare starved; this last ondition is guaranteed by almost any fair allo ation,max−min, proportional fairness, and the one that an be dened with alarge lass of utility fun tions.In pra ti e, the tra demand in today's data network may not adaptin this way to the network ondition, as it is the onsequen e from usersrequesting the transfers of ele troni do uments. A ow asso iated with ale may adapt its instantaneous rate a ording to TCP, but it remains a tiveuntil the ne essary amount of data are transported on the network. Thattype of tra demand is usually alled elasti tra . This has two major onsequen es. First, the rate obtained by a ow depends on the urrentnumber of a tive ows that may hange with time. Se ond, do uments thatremain to be ompleted may a umulate in the network, up to an innitenumber with time. In other words, this network is essentially an open loopsystem with an input given by the pro ess of do ument requests; it an beunstable.A network with dynami elasti tra demand an be modeled as:(12)

a set of links L, with apa ity Cl.a set of routes R, dened as subsets of links.pro esses of do uments requested on routes, with rates (λr)r∈R,and a distribution of do uments sizes. (σr)r∈RWe have in parti ular the following ne essary ondition for stability:Condition 1 for any link l, loadl =∑

r | l∈r

λrE[σr] ≤ Cl .If this ondition does not hold for one of the links, the number of do umentrequests waiting to be ompleted in rease with time to innity. This isbe ause one link re eives overall more data to send that it an transmit.One an dene generally the apa ity of the network as the set of possibletra demands on routes that keep this system stable.Impa t of Fairness on StabilityWe assume here, whenever a ow joins or leaves, that the bandwidth sharingis qui kly onverging to an equilibrium rate allo ation. In other words, thetime s ale of ows joining and leaving is an order above the time s ale ofpa kets transmission and TCP adaptation.Under this assumption, a result shown in [5 is that Condition 1 is alsosu ient to ensure stability. This was shown for rates that are allo ated a - ording to a general denition of fairness that in ludes proportional fairness,

Page 32: baccelli/Evaluation/AugustinChaintreauPhD.pdf

22 Chapter 0FA-fairness, and max−min fairness. The same result holds, even if fairnessis modied to introdu e per lass weights in the utility fun tions, and alsoin the ase of a limiting rate given to ea h ow. It is shown for Poisson owarrival and σ exponentially distributed, but the proof an be extended toany renewal pro ess and distribution of σ with minimal moment ondition.As a onsequen e, a network that enfor es one of this fair allo ation op-erates at full apa ity w.r.t. dynami elasti demand. Following this result,the RTT bias produ ed by TCP in the network's bandwidth sharing doesnot lead to instability of networks with dynami demand, as it orrespondsto the maximization of a fair utility fun tion. On the other hand, the samenetwork with the same demand ould be unstable with other types of rate al-lo ation, as the ones based on preemptive priorities or per- lass reservation.Several examples are shown in [5, where a demand that satises Condition 1leads to instability.Insensitivity and Network Performan e EvaluationThe performan e of networks under dynami tra demand is studied in[12. Let us onsider a network with one link, where ea h a tive ow re eivesan equal share of the bandwidth (as it would be the ase for all the fairallo ations we onsidered). The tra demand is assumed to be the sum ofa large number of user sessions, in dierent lasses. Ea h session may ontainone or several ows with a general size distribution whi h may be dierentamong ows but only depends on the lass they belong to. Sessions areassumed to arrive a ording to a Poisson Pro ess. Under this quite generalassumption, a remarkable fa t, shown in [12, is that the expe ted numberof a tive ows in the link does only depend on the mean tra intensity. ByLittle's law, this proves that the expe ted ompletion time for a ow of asize s is given by

s

C − load .In other words, the harmoni mean of ows throughputs, given as a fun tionof their size s, is a onstant equal to C − load (the spare apa ity).The property of insensitivity is not only a remarkable theoreti al prop-erty. It is important, yet not ne essary, to prevent the network from beingdisrupted by large variation in le sizes: as shown in [5, using a preemptivepriority an indu e, even for a stable ase, ompletion times of les with aninnite expe tation. This was a onsequen e from the heavy tail property ofthe le size distribution that is observed in pra ti e, and of the preemptivedis ipline of allo ated rate. We should mention that these pathologi al sit-uations have not been exhibited when the ow rates are allo ated a ordingto a fairness riteria that avoids the starvation of any ow.This insensitivity result an be extended with a limit rate asso iated withea h ow. Unfortunately, it does not apply to more general networks, as a

Page 33: baccelli/Evaluation/AugustinChaintreauPhD.pdf

De entralized Control in Data Networks 23bottlene k that may be a essed unequally owing to the round trip timebias, or a network that ontains multiple bottlene ks. A notable ex eptionis that proportional fairness is insensitive for linear and grid networks (see[5). More generally it was proved in [26 that this insensitivity property is hara terizing a restri ted lass of regular networks.Extensions to non egalitarian bandwidth sharing: The ase of asingle bottlene k link sharing bandwidth unequally under tra dynami demand is analyzed in [6. It extends the previous result as the RTT biasmay o ur in this model. The apa ity of the link is supposed to be sharedto maximize at all time the utility fun tions dened as a sum on the a tiveows, reated by a large population of ON-OFF sour es. Under some er-godi onditions, the rates allo ated to all the ON-OFF sour es are shownto maximize another utility fun tion. This utility fun tion is derived from theoriginal one and from the maximal mean rates asso iated with ea h sour e.

Page 34: baccelli/Evaluation/AugustinChaintreauPhD.pdf

24 Chapter 04 Motivation and Summary of this DissertationA short overview of the ontrol of data networks gives a remarkably newvision of the tragedy of the ommons, whi h states that a ommon good leftto the ontrol of individuals be omes qui kly overexploited and unusable.2In fa t, when it omes to transporting digital information, this tragedy anbe avoided with a small level of oordination between network elements usingthe same resour es:• The adaptation me hanisms, responsible for avoiding ongestion, ab-stra t the network as a 'bla k box'. In pra ti e, the network does nei-ther maintain nor advertise its urrent state. Inferen e and de isionsare both done dire tly by end-hosts.• Moreover, the resulting bandwidth sharing exhibits surprising prop-erties of solidarity that guarantee its stability to a dynami elasti demand. To some extent, this bandwidth sharing is insensitive to highvariation found in the tra demand.The striking fa t, whi h inspired our work, is that a de entralizedproto ol, designed to work independently of fun tions performedby the network, uses the network onstantly to intera t impli itlywith other ows. This remark had to be qui kly followed in our mind bytwo methodologi al re ommendations.A New Model to Understand TCP Controlled NetworksThe 'bla k box' abstra tion is an ideal prin iple for proto ol designers, fa il-itating separate progress of dierent parts of the networks fun tions, guar-anteeing ompatibility, (See for example [29 for an argument supporting theend-to-end prin iple in network design). However, this abstra tion should beused with are when applied to network analysis, as it leads to the assump-tion, usually met in the literature, that the network states an be knownseparately and xed in distribution.In ontrast, it seems to us that some open problems on TCP bandwidthsharing an be better addressed by a model that fo uses on the nature of theimpli it ow intera tion (during ongestion epo hs), and does not assume apriori that the ows verify a global optimization riterion.- In Chapter 1, we extend the hybrid AIMD model proposed by Ba - elli and Hong to study bandwidth sharing in a link for xed and dynami tra demands. Noti eable properties of TCP (syn hronization, fra tal den-sity, HTTP turbulen e) and their onsequen es on performan e are justiedanalyti ally.2More details an be found at en.wikipedia.org/wiki/Tragedy_of_the_ ommons .

Page 35: baccelli/Evaluation/AugustinChaintreauPhD.pdf

De entralized Control in Data Networks 25A De entralized Proto ol for Cooperative Communi ationsTCP enables large number of ows to transport data on urrently on sharedlinks; for ea h ow, it nds impli itly the suitable resour es.Could a proto ol follow the same design and allow a large number ofows to transport data ooperatively, keeping this data moving in a largegroup with no more than its appropriate resour es ? Could the self- lo kingproperty be reprodu ed among an arbitrary group of ooperative end-hosts?- We laim in Chapter 2 that a de entralized ontrol proto ol like TCP an be su essfully extended to one-to-many ommuni ation from a sour eto any large group of re eivers, when it is deployed between end-hosts inan overlay. It guarantees a reliable ommuni ation to every re eiver, andadapts lo ally to a xed amount of memory and network apa ity. Our keyresult is that this proto ol onverges to the appropriate sending rate, whi his greater than a positive onstant that does not depend on the size of thegroup.- Motivated by this previous example, we show in Chapter 3 that de en-tralized ommuni ation proto ols are, among other ooperative appli ations,well represented by last-passage dire ted per olation time through a ategoryof invariant graph. This new modeling tool allows to onsider a general lassof distributed dynami al dis rete-events system : we identify for them asimple ondition that hara terizes their behavior on large s ales, and theirpossible stationary regimes, with hydrodynami limits.NB: The framework introdu ed in Chapter 3 is used in the proofs ofChapter 2, but ex ept for these mathemati al arguments, these two haptersmay be read independently.

Page 36: baccelli/Evaluation/AugustinChaintreauPhD.pdf

26 BIBLIOGRAPHYBibliography[1 E. Altman, K. Avra henkov, and C. Barakat. A sto hasti model ofTCP/IP with stationary random losses. In SIGCOMM '00: Pro eed-ings of the onferen e on Appli ations, Te hnologies, Ar hite tures, andProto ols for Computer Communi ation, pages 231242. ACM Press,2000.(This paper studies the performan e of TCP under a general ergodi stationary pro ess of ongestion events in time, this pro ess is xedindependently of the window size).[2 F. Ba elli and T. Bonald. Window ow ontrol in FIFO networks with ross tra . Queuing Syst. Theory Appl., 32(1-3):195231, 1999.(This paper analyses the impa t of the ross tra on links used by aow ontrolled with a window of onstant size. It shows that the meanthroughput depends on higher order statisti s of the ross tra arrivalpro esses and is not monotone in the ross tra arrival rate).[3 F. Ba elli and D. Hong. TCP is max-plus linear and what it tells uson its throughput. In SIGCOMM '00: Pro eedings of the onferen e onAppli ations, Te hnologies, Ar hite tures, and Proto ols for ComputerCommuni ation, pages 219230. ACM Press, 2000.(This paper presents a dynami al model of TCP Tahoe and Reno at thepa ket level; its evolution is interpreted as a (max,+) linear re ursion.It shows how the throughput an be expressed for deterministi servi etimes, but remains di ult to predi t in a random environment).[4 D. Bertsekas and R. Gallager. Data networks (2nd ed.). Prenti e-Hall,In ., Upper Saddle River, NJ, USA, 1992.(A referen e textbook on the ontrol of data networks.).[5 T. Bonald and L. Massoulié. Impa t of fairness on internet performan e.In SIGMETRICS '01: Pro eedings of the 2001 ACM SIGMETRICS in-ternational onferen e on Measurement and modeling of omputer sys-tems, pages 8291. ACM Press, 2001.(This paper proves that fairness does impa t the stability of a networkwith elasti tra . The property of insensitivity is dis ussed and estab-lished for proportional fairness on a networks organized on a latti e).[6 C.-S. Chang and Z. Liu. A bandwidth sharing theory for a large numberof HTTP-like onne tions. IEEE/ACM Trans. Netw., 12(5):952962,2004.(This paper studies the bandwidth sharing of a large number of ON-OFF sour es sharing a single bottlene k: if the instantaneous rates ofa tive ows are allo ated a ording to some utility, the time-average rateobtained by ows are shown to maximize a transform utility fun tion).

Page 37: baccelli/Evaluation/AugustinChaintreauPhD.pdf

BIBLIOGRAPHY 27[7 V. Dumas, F. Guillemin, and P. Robert. A Markovian analysis ofadditive- in rease multipli ative-de rease. Advan es in Applied Prob-ability, 34(1):85111, 2002.(A rigorous analysis of an ideal AIMD window evolution in the rare lossasymptoti ).[8 S. Floyd. Conne tions with multiple ongested gateways in pa ketswit hed networks, part 1: One way tra . ACM Computer Com-muni ations Review, 21(5):3047, 1991.(One of the rst empiri al studies of the bias of TCP against ows withlonger round trip times).[9 S. Floyd. TCP and expli it ongestion noti ation. SIGCOMM Com-put. Commun. Rev., 24(5):823, 1994.(This paper advo ates the use of a ongestion-experien ed bit in IPpa kets, in the same spirit than [28, together with a probabilisti mark-ing s heme).[10 S. Floyd, T. Henderson, and A. Gurtov. RFC 3782 - the NewRenomodi ation to TCP's fast re overy algorithm, 2004.(Updated version of RFC 2582, whi h presents the fast retransmissionfast re overy me hanism deployed in TCP NewReno to handle e ientlywindow redu tion).[11 S. Floyd and V. Ja obson. Random early dete tion gateways for on-gestion avoidan e. IEEE/ACM Trans. Netw., 1(4):397413, 1993.(This paper proposes a probabilisti pa ket dropping in router, as afun tion of the average queue size).[12 S. Ben Fredj, T. Bonald, A. Proutière, G. Régnié, and J. W. Roberts.Statisti al bandwidth sharing: a study of ongestion at ow level. InSIGCOMM '01: Pro eedings of the 2001 onferen e on Appli ations,te hnologies, ar hite tures, and proto ols for omputer ommuni ations,pages 111122. ACM Press, 2001.(This paper presents the dynami bandwidth sharing of dynami ows.It proves that the previously known result of insensitivity an be ex-tended to sessions with arbitrary shape, that arrive a ording to a Pois-son pro ess. It dis usses by simulations the impa t for general network,and overloads link with impatient users).[13 F. Guillemin, P. Robert, and B. Zwart. AIMD algorithms and exponen-tial fun tionals. Annals of Applied Probability, 14(1):90117, 2004.(This paper extends the analysis of [7 to a ase of orrelated losses).[14 P. Hurley, J.Y. Le Boude , and P. Thiran. A note on the fairness ofadditive in rease and multipli ative de rease. In Pro eedings of ITC-16,

Page 38: baccelli/Evaluation/AugustinChaintreauPhD.pdf

28 BIBLIOGRAPHYEdinburgh, June 1999.(As a variation of the analysis provided in [17, this arti le shows thatif a knowledgments are re eived in proportion of the rate a hieved, thefairness riterion a hieved by AIMD is slightly dierent from propor-tional fairness, in ase of homogeneous round trip times).[15 V. Ja obson. Congestion avoidan e and ontrol. In SIGCOMM '88:Symposium pro eedings on Communi ations ar hite tures and proto ols,pages 314329. ACM Press, 1988.(This arti le introdu es slow start and ongestion avoidan e, it was alandmark to design TCP Tahoe and TCP Reno).[16 F. Kelly. Charging and rate ontrol for elasti tra . European Trans-a tions on Tele ommuni ations., 8:3337, 1997.(This arti le shows that a so ial optimum in data networks an bea hieved through a distributed me hanism of harge per unit of time.It proposes the proportional fairness riterion as an alternative to(max,min) fairness).[17 F. Kelly, A. Maulloo, and D. Tan. Rate ontrol in ommuni ationnetworks: shadow pri es, proportional fairness and stability. Journal ofthe Operational Resear h So iety, 49:237252, 1998.(This arti le shows through dierential equations analysis that TCPde entralized ontrol a hieves a form of fairness through pa ket droppingand window adaptation).[18 S. Keshav. A ontrol-theoreti approa h to ow ontrol. In SIGCOMM'91: Pro eedings of the onferen e on Communi ations ar hite ture &proto ols, pages 315, New York, NY, USA, 1991. ACM Press.(A proposition to regulate Internet ows using a fair queuing me hanismin router together with pa ket pair te hnique to sele t the appropriaterate).[19 S. Keshav. An engineering approa h to omputer networking: ATM net-works, the Internet, and the telephone network. Addison-Wesley Long-man Publishing Co., In ., Boston, MA, USA, 1997.(A referen e textbook on te hnology and algorithms found in omputernetworks. It ontains an extensive des ription of routing and ow ontrolissues and me hanisms proposed in the literature).[20 T. V. Lakshman and U. Madhow. The performan e of TCP/IPfor networks with high bandwidth-delay produ ts and random loss.IEEE/ACM Trans. Netw., 5(3):336350, 1997. (First version appearedin IFIP Transa tions C-26, High Performan e Networking in 1994)(One of the rst papers that analyzes the performan e of TCP (Reno)in high speed network).

Page 39: baccelli/Evaluation/AugustinChaintreauPhD.pdf

BIBLIOGRAPHY 29[21 L. Massoulié and J. W. Roberts. Bandwidth sharing: obje tives andalgorithms. IEEE/ACM Trans. Netw., 10(3):320328, 2002.First version appeared at INFOCOM'99, (This arti le presents severalrate allo ation obje tives, introdu ing the potential delay minimization.It shows dierent algorithms, for routers and end-hosts, to a hieve someof these obje tives).[22 M. Mathis, J. Mahdavi, S. Floyd, and A. Romanow. RFC 2018 - TCPsele tive a knowledgment options, 1996.(Introdu es a lightweight me hanism to identify the pa kets that havebeen lost ee tively during a ongestion epo h; only those are thenretransmitted).[23 M. Mathis, J. Semke, J. Mahdavi, and T. Ott. The ma ros opi beha-vior of the TCP ongestion avoidan e algorithm. SIGCOMM Comput.Commun. Rev., 27(3):6782, 1997.(This paper validates the square root formula, introdu ed in [24, in the ontext of periodi loss.).[24 T. Ott, J. Kemperman, and M. Mathis. The stationary behavior ofideal TCP ongestion avoidan e. Unpublished manus ript, available at iteseer.ist.psu.edu/ott96stationary.html.(This paper introdu es the rare loss asymptoti of an ideal AIMD win-dow evolution, it proposes one of the rst square root formulaes des rib-ing the throughput as a fun tion of the loss rate).[25 J. Padhye, V. Firoiu, D. F. Towsley, and J. F. Kurose. ModelingTCP Reno performan e: a simple model and its empiri al validation.IEEE/ACM Trans. Netw., 8(2):133145, 2000.(This famous paper presents the TCP Reno throughput=f(loss rate)formula, whi h a ounts in parti ular for the presen e of time outs).[26 A. Proutière. Insensibilité et bornes sto hastiques dans les réseaux dele d'attente, appli ation à la modélisation des reseaux de télé ommuni- ations au niveau ot. PhD thesis, E ole Polyte hnique, 2003.(Among other ontributions, this work hara terizes the lass of net-works guaranteeing the insensitivity property.).[27 B. Radunovi and J.Y. Le Boude . Rate performan e obje tives ofmulti-hop wireless networks. In Pro eedings of IEEE INFOCOM, HongKong, China, Mar h 2004.(This arti le advo ates proportional fairness for wireless networks, pro-ving that a rate adaptation towards (max,min) fairness is leading togross ine ien y).[28 K. K. Ramakrishnan and R. Jain. A binary feedba k s heme for on-gestion avoidan e in omputer networks with a onne tionless network

Page 40: baccelli/Evaluation/AugustinChaintreauPhD.pdf

30 BIBLIOGRAPHYlayer. In SIGCOMM '88: Symposium pro eedings on Communi ationsar hite tures and proto ols, pages 303313, New York, NY, USA, 1988.ACM Press.(Published exa tly at the same time as the famous Van Ja obson's pa-per, this arti le proposes the DECbit expli it feedba k me hanism).[29 J. H. Saltzer, D. P. Reed, and D. D. Clark. End-to-end arguments insystem design. ACM Transa tions on Computer Systems, 2(4):277288,November 1984.(A fun tion need not to be implemented in a lower layer unless thisprovides a signi ant performan e improvement, is the position thatis illustrated and advo ated in this arti le. Most notably this prin i-ple guarantees that the amount of state maintained in the network isminimum).[30 S. Shenker, L. Zhang, and D. D. Clark. Some observations on thedynami s of a ongestion ontrol algorithm. SIGCOMM Comput. Com-mun. Rev., 20(5):3039, 1990.(One of the rst studies of the behavior of TCP Tahoe. This arti leexhibits the loss syn hronization ee t o urring between ows and thetra lustering in the buer).[31 R. Stevens. TCP/IP Illustrated, Volume 1: The Proto ols. Addison-Wesley, 1994.(A referen e book des ribing the spe i ations of the TCP/IP proto olsuite).[32 M. Vojnovi , J.Y. Le Boude , and C. Boutremans. Global fairness ofadditive-in rease and multipli ative-de rease with heterogeneous round-trip times. In Pro eedings of IEEE INFOCOM, pages 13031312, TelAviv, Israel, Mar h 2000.(This arti le extends the result of [14 to a ase of heterogeneous roundtrip times).

Page 41: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Chapter 1Bandwidth Sharingwith Intera ting FlowsOne of the beauties of Jungle Law is that punishment settles all s ores.1Rudyard KiplingHow do TCP ows ompete and oordinate themselves dynami ally, viainterse ting pro esses of ongestion epo hs ?This exploratory hapter builds on the pioneering work of Ba elli andHong presented in [4. We onstru t a model of the impli it intera tion be-tween ows, for xed and dynami tra demand on a single bottlene klink. The su ess of this approa h lies in two out omes. First, simple sit-uations are su iently well hara terized, usually leading to a losed formexpression. Se ond, intri ate properties quoted in measurements or analysisof TCP ontrolled networks an generally be justied and well representedfor the rst time in a unique framework.Guidelines: Related open problems and previous works are presented in 1. In 2, wepresent Ba elli and Hong's hybrid AIMD model in a nominal ase: a single link sharedby a xed number of persistent ows; we extend their work to des ribe the steady statedistribution of the throughput in a losed form formula. This approa h is applied as a toolfor dimensioning TCP ontrolled networks in 3. We then present in 4 one of the rststudies of the impa t of dynami tra demand on TCP bandwidth sharing; we establishthe presen e of an unknown phenomenon of HTTP turbulen e for onditions that we an hara terize in some ases.1 The Jungle Book, Kaa's Hunting, London, Ma millan, 1919, p.83.

Page 42: baccelli/Evaluation/AugustinChaintreauPhD.pdf

32 Chapter 11 Related Works and Open ProblemsThis hapter addresses some pra ti al and some general open questions onTCP bandwidth sharing properties. Let us take some time to introdu e ea hof them.Syn hronization and DimensioningCon urrent ows on Internet links usually exhibit syn hronized behaviors, asit was rst reported in [17: a large proportion of ows (possibly all of them)loses at least one pa ket during ea h ongestion epo h. This was initiallyshown in the ase of a small number of ows using the Tahoe version of TCPon a bottlene k link, in a path with a small bandwidth delay produ t. Thesame phenomenon was exhibited in [20 where a two way tra was in luded.This was onrmed analyti ally in [15, where several versions of TCP (Tahoeand Reno) were onsidered, fo using on a ase of large bandwidth delayprodu t.All these studies state that link utilization ould be ae ted by syn hro-nization of ows: a long time is ne essary to re over from a ongestion epo hae ting simultaneously many ows, and this leaves part of the apa ity ofa link unused.Example : Loss in BurstsSeveral analyti al works [1, 11 have shown that for a given rate of pa ket losses per unitof time, the average throughput obtained by TCP is improved by losses o urring in bursts, ompared to an un orrelated pro ess.Can we on lude from these two results that TCP ontrolled ows generally benet froma situation of on entrated losses ? It seems di ult to do so. Not only the implementationof TCP (like the fast re overy fast retransmit algorithm) might be disrupted in su h a ase.But it seems as well that having ongestion epo hs separated by short intervals of time anfa ilitate syn hronization between a population of ows.In this ontext, dimensioning TCP ontrolled networks, to target an ex-pe ted demand and the performan e required by an appli ation, is halleng-ing. The simplest solution onsiders a uid ow model: it divides the a-pa ity of the bottlene k link by the expe ted number of ows (or a weightedversion based on RTT bias), but this does not a ount for throughput vari-ations, and might be over optimisti . Most of the urrent models used givethe mean throughput obtained by a TCP ontrolled ow, or its distribution,as a fun tion of its loss rate. But in general, it is a di ult task to predi tthe losses experien ed by a ow on a given network, even if one an boundthe number of a tive ows in the bottlene k link, and assume that they havean homogeneous RTT.

Page 43: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Bandwidth Sharing with Intera ting Flows 33This hapter identies several regimes of syn hronization, based on thedemand and the size of the buer in the link. Closed form expressions anthen be used as a dimensioning tool for TCP network with long lived ows.Sensitivity to Higher Order Statisti sExa t insensitivity as it was dened in [16 annot be expe ted in general bytoday's data network. This means that the average performan e observedon a network ould depend on detailed statisti s of the tra demand. Inthe same time large variations have been shown in the nature of the traf- demand (as ows sizes belonging to dierent appli ations). Besides, agrowing number of omplex statisti al properties of the tra arried by theInternet have been proved (self similarity and long range dependen e). Their onsequen es, if any, remain di ult to establish.Let us point out at this stage that the hybrid AIMD model, that we usein this hapter, produ es a tra with some similar statisti al properties. Itshould not ome as a surprise that the exa t results we derive may ontainsome of these intri ate properties.More Details : Closed Loop and Buer O upan yAs proved in [14, applying open loop analysis an overestimate the sensitivity to higherorder statisti s.In this arti le, Kherani and Kumar onsider the ase of a bottlene k link where les tobe transferred arrive as a Poisson Pro ess, and are served a ording to the pro essor sharingdis ipline. As a onsequen e, under the stability ondition, the mean number of les thatremain to be served is nite and does not depend on the le size distribution. It is assumedthat these TCP ows send data in the bottlene k queue a ording to the ongestion avoidan eme hanism of TCP (by an amount that in reases linearly ex ept when the window is redu ed).Larger les, whi h stay longer, might then have more data sent in the pipe; if les sizes followan heavy tail distribution, the mean queue o upan y might be expe ted to grow to innity.The result shown by Kherani and Kumar gives a strong argument supporting the oppo-site laim. They assume that ows do not suer from pa ket losses, and that le sizes aredistributed a ording to a Pareto distribution with parameter α > 1.5. They prove that theamount of pa kets sent remains with nite mean, but that the pro ess of pa kets' arrivalexhibit a ertain form of long range dependen e.This is in ontrast with the behavior of an open loop model, that takes as an input thesame pro ess of pa kets arrival, and serves them independently as in a single server queue.In this new model, the o upan y of the buer would have an innite mean. This exampleillustrates remarkably well the additional omplexity introdu ed in TCP by the losed loopimplementation, and, to some extent, it indi ates the robustness of this proto ol in the ontextof large variations of the input pro ess.We give in this hapter at least one new example in whi h omplex prop-erties (fra tal invarian e of density fun tion) do not prevent the performan eto be easily predi table. It omes, along with other examples (see framed

Page 44: baccelli/Evaluation/AugustinChaintreauPhD.pdf

34 Chapter 1paragraph above), as an indi ation that losed loop systems might exhibit omplex yet harmless statisti al properties.TCP and Non Persistent FlowsInternet tra demands for le requests present heterogeneous sizes, dis-tributed over several orders of magnitude: many of the ows reated remaina tive for a short time, some leave after a small number of ongestion epo hs.Little is known about the impa t of these joins and leaves on the bandwidthsharing or hestrated by TCP.Approximating TCP bandwidth sharing as a pro essor sharing statesthat ows are not dis riminated in a bottlene k depending on their sizes; itguarantees stability (i.e. that ows being ompleted do not a umulate) onthe largest possible set of random dynami demands. It assumes a ouplingbetween ows that is loose (as they intera t only through their total number),but extremely rea tive (they adapt their rate instantaneously at all time).Can we guarantee stability in pra ti e as ows intera t imperfe tly, only ata given time s ale through periods of ongestion ?More Details : TCP as a Perturbation of Pro essor SharingThe previous question is formulated, and partially answered, in [13. In this arti le, Kheraniand Kumar onsider the pro ess of load reated by a Poisson arrival of size to be transmittedin a bottlene k, whi h follows the AIMD algorithms of TCP. This model assumes that allows are syn hronized (all the ows suer a loss if the buer overows), and that they allre eive an equal share of overall load reated on the bottlene k. It is therefore a time varyingpro essor sharing queue. It is shown by numeri al omputation of the model that the meanperforman e (average ompletion time of a transfer, or time average throughput of an a tivesession) is impa ted by the le size distribution. Another regime is identied for a bandwidthdelay produ t mu h larger than mean le size, where ows do not suer any loss.In the end of this hapter, we prove that TCP bandwidth sharing underdynami tra demand might enter some turbulent region, where at leasttwo steady states might be possible, depending on the initial onditions.

Page 45: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Bandwidth Sharing with Intera ting Flows 352 The Hybrid AIMD Model with Persistent FlowsThe hybrid AIMD model was presented for the rst time in [4. A TCP owis hara terized by an instantaneous sending rate, as for a uid ow model,with sometimes a ontinuous evolution as well as o asionally the presen e ofa dis ontinuity ( orresponding to a ongestion epo h, or an event o urringfor this ow su h as the end of a le transfer). The presen e of a non- ontinuous evolution term motivates to all this model 'hybrid'.We fo us in this se tion on the additive in rease multipli ative de reasetypi al of the ongestion avoidan e phase, whi h is the dominant aspe t ofthe TCP Reno proto ol. The model an be applied to the Tahoe version ofTCP as well: it is done, for example, in 4.2.1 OverviewThe Hybrid AIMD model onsiders for a given TCP ow (denoted by indexn) its instantaneous throughput at time t ∈ R (denoted by X(n)(t)); itin reases linearly with a slope 1

R2 , where R is its asso iated RTT.As a result of the nite bandwidth available on the link (denoted by C),a ongestion epo h o urs when the sum of the instantaneous throughputrea hes the apa ity. At this time, some onne tions may suer from pa ketlosses and halve their throughput. We rst onsider the buer-less ase inthis se tion. We ome ba k to the ase of a bottlene k router with a nonzerobuer in 3.1.We introdu e Ti the time of the i_th ongestion epo h, and X(n)i theinstantaneous throughput a hieved by onne tion n taken immediately afterthe i_th ongestion epo h: X

(n)i = X(n)(Ti+). From the AIMD behaviorof the simplied TCP proto ol, illustrated on Figure 1.1, we dedu e thefollowing re urren e equation:(1.1) X

(n)i = γ

(n)i

(X

(n)i−1 + qi

), where qi =

1

R2(Ti − Ti−1) ,

• where γ(n)i is a random variable equal to 1, ex ept if the onne tion nsuers a loss at time Ti, in whi h ase it is 1

2 .• qi as dened here is the throughput growth of the onne tion sin e thelast ongestion epo h. This is the same for ea h onne tion, as ea h ofthem has the same RTT.By expressing that at the (i + 1)-th ongestion epo h the apa ity C isjust onsumed, it an be seen that the in rease qi does not only depend onthe throughput of the n-th onne tion, but also on all the throughput of allother onne tions; it might be written as:(1.2) N∑

n=1

(X

(n)i−1 + qi

)= C ,

Page 46: baccelli/Evaluation/AugustinChaintreauPhD.pdf

36 Chapter 1X

(n)i−3

γ(n)i−1 = 1

X(n)i−1

X(n)i time t

Ti−3 Ti−2 Ti−1 Ti

X(n)i−2

γ(n)i−3 = 1

2

γ(n)i−2 = 1

2

γ(n)i

= 12

X(n)(t)

throughput

Figure 1.1: Evolution of the instantaneous throughput of a onne tion.(1.3) whi h leads to qi =C − Si−1

Nwith Si−1 =

N∑

n=1

X(n)i−1 .Repla ing the expression for qi in (1.1), we apture the AIMD ompeti-tion between TCP onne tions on the link in a re urren e equation with aprodu t of random matri es, whose elements ontain the random variables

(γ(n)i )n=1...N,i∈Z.Syn hronization RateIn this model, pa ket losses are in luded via the value 1

2 that an be taken byvariables (γ(n)i )n=1...N,i∈Z. We assume that these variables are independentbetween dierent onne tions, and also that they are independent in time.By homogeneity, we assume that they have the same law for all onne tions,and for all ongestion epo hs. In this ase, the distribution of all randomvariables in the system is xed, if we hoose the value of the probability

p = P (γ(n)i = 1

2) (this does not depend on n neither on i) that is referredto as syn hronization rate. It represents both the proportion of onne tionssuering at least one pa ket loss in a ongestion epo h, and the probability forone onne tion to lose at least one pa ket when ongestion appears. Anothermodel has been introdu ed in [12 where, for ea h onne tion epo h Ti, thevariables for the dierent onne tions γ(1)i , . . . , γ

(N)i are independent, butwith laws that depend on the throughput immediately before the ongestionepo h. We onsider the simplest model here, allowing us to derive a losedform formula.

Page 47: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Bandwidth Sharing with Intera ting Flows 37More Details : Previous ResultsThe hybrid AIMD model exhibits several remarkable properties des ribed in earlier work.Let us qui kly highlight them.• An extended square root formula giving the expe ted throughput as a fun tion of theexperien ed pa ket loss rate and the syn hronization rate, was shown in [4:

E[X(∞)] = Cste 1

R.√loss_rate where Cste =

r

2 − p

2∈

"

r

3

2;√

2

#

.

• As shown in [4, statisti al properties observed on short time s ales, that are not wellexplained today, seem to be present in similar shape in the result of this model.• One of the most important out omes of this approa h is a tool for fast simulations. Theme hanism of millions of TCP ows sharing a network of ten thousands links is madepossible with reasonable amount of memory and omputation time. These numbersare some order of magnitude beyond what an be simulated today, using pa ket leveldis rete event simulators like the ns2 simulator.This simulation tool was initially presented in [5, and is now the base of a ommer- ialized software produ t (see www.n2nsoft. om).• In the ase of several bottlene ks, the intera tion of multiple ows mimi ks the dynami of a billiard, as it was presented in [6. This leads to periodi bottlene ks sequen es inthe deterministi ase of full syn hronization. The instantaneous throughput seen bydierent onne tions are distributed on a fra tal set, that an be signi antly omplex.

Expe tation and Stationary DistributionThe expe tation of the throughput seen immediately after ongestion an bededu ed from the independen e assumption and the symmetry found in themodel (see [4):(1.4) E[X(n)i ] =

C

N

(1 − p

2

).Note that this expression an be interpreted intuitively. As the total through-put immediately before the ongestion is the apa ity C, and a proportion pof the onne tions have halved their throughput during the ongestion, thisequation only ree ts that, on average, the total throughput seen immedi-ately after ongestion should be C(1 − p

2 ).By iterating the re urren e relation of (1.1) an innite sum of produ ts isobtained. In [4, it is shown that this innite sum onverges to an expressionof the steady state distribution:(1.5) X(n)∞ =

j≥0

γ(n)i . . . γ

(n)i−jqi−j .

Page 48: baccelli/Evaluation/AugustinChaintreauPhD.pdf

38 Chapter 12.2 Many User ApproximationThe equation (1.1) spe ies the relations between values of the throughputfor dierent onne tions at onse utive ongestion epo hs. In parti ular,one ow intera ts with another one only via the expression of the inter- ongestion additive growth qi, through the sum 1N Si−1. When N is in reasedwhile keeping the same apa ity ratio C ′/N , the variable Si−1 is obtained asan average of more and more random variables with the same expe tation.From the study of their inter- orrelation, we an show that the varian e ofthe average vanishes and that this average an be taken in the model to beequal to its expe ted value (proof an be found in [12):

Si−1

N∼ ρC

N(1−p

2) if N is large enough, hen e qi =

ρC − Si−1

N∼ ρC

N

p

2= q.Under this approximation, the variables qi onverge for all i to theirexpe ted values. The re urren e simplies to a one-dimensional equation:(1.6) X

(n)i = γ

(n)i

(X

(n)i−1 + q

) and we have X(n)∞ = q

j≥0

γ(n)i . . . γ

(n)i−j .Steady State Law as an Innite Geometri WalkIn the innite sum dening the steady state law in (1.6), we an dedu e the

j + 1 term from the term number j by an independent oin tossing; withprobability 1 − p, they are equal, and with probability p, the latter term ishalf the rst one. Consequently, the steady state law is the result of theinnite geometri walk represented in Figure 1.2, where, after ea h step,we de ide independently with probability p to halve the step length or tokeep the same value.(γ

(n)i , γ

(n)i−1, . . .) = (1, 1,

1

2,1

2, 1, 1,

1

2, 1, . . .)

etc.

1 STEPat STEP=q/2 2 STEPS

at STEP=q/8

3 STEPSat STEP=q/4at STEP=q

2 STEPS

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Pro

babi

lity

[ Thr

ough

put >

Thr

esho

ld ]

Threshold (1 is c)

Size of Simulation = 100000

Synchronization rate = 0.1Synchronization rate = 0.3Synchronization rate = 0.5Synchronization rate = 0.7Synchronization rate = 0.8Synchronization rate = 0.9

Synchronization rate = 0.99Synchronization rate = 1

Figure 1.2: Des ription of the Innite geometri walk (left), TDF of theinstantaneous throughput after ongestion for dierent values of p (right).

Page 49: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Bandwidth Sharing with Intera ting Flows 39We an observe that the number of steps, made in this walk with a ertain xed length q2k , is thus geometri ally distributed with a ratio 1 − p.The variable may be zero if we onsider the step length q (a division ano ur immediately before the rst step), but for any smaller length q

2k withk ≥ 1, there is at least one step with this length. Moreover, these variablesare independent. The steady state distribution of the sequen e (X

(n)i )i∈Z isthen the same as the following distribution:(1.7) X∞ = qA0 +

k≥1

q

2k(1 + Ak) = q + q

k≥0

Ak

2k,

• where A0, A1, . . . are independent geometri random variables startingin 0 with ratio (1 − p):Prob[Ak = m] = p(1 − p)m,∀m ≥ 0.We have represented this distribution, for dierent values of p and apa ity su h that c = ρCN = 1, on the right of Figure 1.2. As it isobserved in the gure, the distribution of this variable, for p lose to1, presents some fra tal properties. They an be interpreted as the onsequen es of the dis rete symmetry found in the formulation of thisvariable as a innite geometri walk.Throughput in Continuous TimeUntil now, we have only been onsidering the instantaneous throughputtaken immediately after ongestion epo hs (X(n)(Ti+))i∈Z, that is an em-bedded hain of the ontinuous time pro ess of the instantaneous throughput

(X(n)(t))t∈R. The values of this ontinuous time pro ess an easily be derivedfrom the embedded hain, as the throughput of ea h onne tion in reases lin-early with a onstant slope between ongestion epo hs.In the many user approximation, the value qi of the throughput's in reasebetween Ti−1 and Ti is equal for all i to the same onstant q. Inter- ongestiontime are then onstant. Adding the variable qU where U is independent,uniformly distributed on [0; 1] to the steady state law of the embedded hain leads to the stationary distribution of the ontinuous time pro ess.Chara terization of the stationary distribution: Under the manyuser approximation, the stationary distribution of the throughput obtainedby a onne tion has the same distribution as the variable:(1.8) X(∞) = q + qU + q∑

k≥0

Ak

2k,where - q is a onstant equal to ρC

Np2 ,- U is uniformly distributed in [0; 1], ea h Ak has a geometri distribution starting in 0, with ratio (1 − p),- all the variables here are independent.

Page 50: baccelli/Evaluation/AugustinChaintreauPhD.pdf

40 Chapter 1We an already dedu e the expe tation for this throughput:(1.9) E[X(∞)] = E[X∞] + qE[U ] so that E[X(∞)] =ρC

N(1 − p

4) .Empiri al ValidationWe have ondu ted experiments to justify the use of the many user approx-imation, where we have ompared, for dierent values of N , as well as forthe asymptoti many user approximation, the TDF observed by simulationof the model, while keeping the other parameters (ρ, p) onstant and c = ρC

Nequal to 1.The TDF observed by simulations are shown on Figure 1.3. We arepresenting here the two ases where the value of p is small and large, similarresults have been obtained for any value of p in [0; 1]. These experimentsshow that the many user approximation is a urate with a moderate numberof onne tions. For a number of onne tions greater than 16, whatever bethe value of p, it is hard to distinguish between the TDF and the value givenby its many user approximation.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Pro

babi

lity

[ Thr

ough

put >

Thr

esho

ld ]

Threshold (1 is c)

Synchronization rate = 0.2, Number of simulation steps= 100,000

Asymptotic ValueNumber of flows = 1 Number of flows = 2 Number of flows = 4 Number of flows = 8

Number of flows = 16 Number of flows = 32 Number of flows = 64

Number of flows = 128 Number of flows = 256 Number of flows = 512

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Pro

babi

lity

[ Thr

ough

put >

Thr

esho

ld ]

Threshold (1 is c)

Synchronization rate = 0.8, Number of simulation steps= 100,000

Asymptotic ValueNumber of flows = 1 Number of flows = 2 Number of flows = 4 Number of flows = 8

Number of flows = 16 Number of flows = 32 Number of flows = 64

Number of flows = 128 Number of flows = 256 Number of flows = 512

Figure 1.3: TDF of the throughput in ontinuous time, with a nite num-ber of onne tions, ompared with its many user asymptoti : p=0.2 (left),p=0.8 (right).2.3 A Closed Form Formula for TCP Throughput TDFWe derive from (1.8) the following formula, giving the TDF of the stationarythroughput obtained by one sour e as an innite sum ontaining innite

Page 51: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Bandwidth Sharing with Intera ting Flows 41produ t:(1.10) P (X(∞) ≥ q + qx) =∑

k≥0

αk(x)φ(+∞)(2kx) ,where φ(+∞) is a Fourier series φ(+∞)(x) =∑

l∈Zβ

(+∞)l e2iπlx , and

αk(x) =p(1 − p)2

kx(( 11−p)2

k − 1)

2k

k∏

m=1

am , with am = p1−(1−p)1−2m ,

β(+∞)l =

1

(ln(1 − p) + 2iπl)2

m≥1

bm,l , with bm,l = p

1−e−i 2π

2m l(1−p)

1− 12m

.Numeri al Algorithm: A numeri al algorithm allows one to estimate thisquantity with an arbitrary pre ision. For any ε > 0, the algorithm omputesK,L, and (M (l))|l|≤L to have:P (X(∞) ≥ q+qx) =

K∑

k=0

αk(x)L∑

l=−L

1

(ln(1 − p) + 2iπl)2

M (l)∏

m=1

bm,le2iπl(2kx)±ε.This algorithm, whi h rst estimates a uniform bound on innite produ ts(see (1.15) in the proof below), dedu es from the study of sequen e (am)m≥1and (bm,l)m≥1,l∈Z the appropriate trun ation for innite sums, and on ludesby studying the onvergen e of ea h innite produ ts. Additional details anbe found in [8.Sket h of the Proof: We onsider in this proof the Cumulative Dis-tribution Fun tion (CDF) of a random variable, instead of its TDF, to be onsistent with lassi al mathemati al notation. A general inversion for-mula expresses the CDF of a random variable as a omplex integral of itsLapla e transform over a straight line (s = c+ iR) that lies in the area wherethe Lapla e transform exists. This Lapla e transform is easily omputed as

X(∞) is written as a sum of independent random variables.This integral may be developed with the lassi al method of the residues,as the Lapla e transform an be extended to an analyti fun tion, dened onthe whole omplex plane, outside its singularities. The path of integration an then be in luded in a losed ontour of the omplex plane whi h may ontain some singularities. It is then su ient to show that the integral overother paths in the ontour an be asymptoti ally negle ted. This is what wedo for a trun ated version of the variable X(∞), as a dire t appli ation ofthe method annot be dire tly proved.A limit argument allows to on lude and prove (1.10).PreliminariesThe Stieltjes Integral is the natural tool to be used here as it allows one tohandle within the same framework both a ontinuous law with density (as

Page 52: baccelli/Evaluation/AugustinChaintreauPhD.pdf

42 Chapter 1for variable U), and a dis rete law (as for variables (Ak)k≥0), as found inthe expression of X(∞). An introdu tion to the Stieltjes Integration maybe found in [19 (Chap. I and II).Consider a random variable Y, whose CDF µ is of bounded variation, itsLapla e transform an be dened by the following Stieltjes integral on thefun tion µ:fY (s) =

∫ +∞

0e−sxdµ(x) for s = σ + iτ ∈ C ,

µ being bounded, this integral is well dened for σ > 0. We have, for anyc > 0, the following inversion formula (see [19):lim

T→+∞

1

2iπ

∫ c+iT

c−iT

esx

sfY (s)XS =

µ(x+) + µ(x−)

2(= µ(x) for µ ontinuous) .Lapla e Transform in the model: Sin e the density of the random vari-able Y = U +

∑k≥0

Ak

2k exists, its CDF is ontinuous. The Lapla e transformfY is expressed as an innite produ t and inversion gives for all c > 0 andx ≥ 0, As fU (s) =

1 − e−s

s, and fAk

(s) =p

1 − (1 − p)e−s,(1.11) P (Y ≤ x) = lim

T→+∞

1

2iπ

∫ c+iT

c−iT

esx

s

(1 − e−s)

s

k≥0

p

1 − (1 − p)es

2kds .Singularities: fU has no singularity , as opposed to fAk

whi h possesses aninnite number of them, all lo ated on the axis σ = ln(1 − p) in points s =ln(1− p) + 2ilπ, for l ∈ Z. Consequently, the Lapla e transform for variableY has singularities in points s = 2k ln(1 − p) + 2i(l2k)π, for k ≥ 0, l ∈ Z.Trun ation: Unfortunately the fun tion that is integrated in (1.11) ontainstoo many singularities for the al ulus to be done dire tly; it is better to on-sider temporarily the trun ated variable of order K, YK = U +

∑k=0...K

Ak

2k .Its Lapla e transform is given by the expression dening fY , where the in-nite produ t has been redu ed to its K + 1 rst fa tors. In parti ular fYKpossesses no singularities in s = σ+iτ for σ stri tly smaller than 2K ln(1−p).STEP 1: The Expression of the IntegralJusti ation of the residue al ulus for K_order trun ation: Letus onsider the re tangular ontour represented on the Figure 1.4, denedwith parameters T = π(2m + 1) and σ0 < 2K ln(1 − p). We integrate thefun tion as in (1.11), where the expansion of the produ t is stopped afterK + 1 fa tors (i.e. Inversion formula for the K_order trun ated variable).

Page 53: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Bandwidth Sharing with Intera ting Flows 43

τ = 2π

τ = 0

τ = −2π

σ0 σc 0 c

: Singularitiesσ = σc = ln(1 − p)σ = 2Kσc

τ = −T = −π(2m + 1)

τ = T = π(2m + 1)

Figure 1.4: Integral Cal ulus for order approximation (here K = 2).• Along the right edge of the re tangle, the integral for T large enoughtends to P (YK ≤ x).• Along edge ②, s = σ0 + iτ with −T ≤ τ ≤ T , and the integral is∫

=1

2iπ

∫ σ0−iT

σ0+iT

esx(1 − e−s)

s2

p

1 − (1 − p)e−s. . .

p

1 − (1 − p)e−s/2Kds ;for any k = 0 . . . K, when τ varies, (1−p)e

− s

2k is in the ir le with en-ter zero and radius (1− p)e−

σ02k , whi h is larger than 1. Consequently,its distan e to the point 1 in the omplex plane is bigger than its radiusminus 1, and we have:(1.12)

|∫

| ≤ eσ0x + eσ0(x−1)

(K∏

k=0

p

(1 − p)e−σ0/2k − 1

)∫ T

−T

1

|σ0 + iτ |2 dτ .In this last bound, for x ≥ 0, the oe ient before the integral remainsbounded for σ0 → −∞, as the produ t of K + 1 fa tors in bra kets isequivalent to ( p1−p

)K+1eσ0(1+ 1

2+...+ 1

2K ). The integral on the right anbe bounded for any value of T by the integral of the same fun tion onthe innite domain ] − ∞,+∞[, whi h vanishes when σ0 → −∞ bymonotone onvergen e. To on lude, integral over ② vanishes for largenegative values of σ0, uniformly with respe t to T .

Page 54: baccelli/Evaluation/AugustinChaintreauPhD.pdf

44 Chapter 1• On edge ① (integral over ③ is treated similarly): s = σ + iT with

σ0 ≤ σ ≤ c,∫

= − 1

2iπ

∫ c+iT

σ0+iT

(p

1 − (1 − p)e−s. . .

p

1 − (1 − p)e−s/2K

)esx − es(x−1)

(σ + iT )2ds.The produ t of K + 1 fa tors in bra kets an be bounded, as for any

k = 0 . . . K the argument or z = (1− p)e− s

2k annot be loser than π2Kfrom zero (as T = π modulo 2π). In other terms, the omplex z annotbe in the domain reiθ, r ≥ 0, |θ| ≤ π

2K and this implies in parti ularthat its distan e to 1 is at least √2√

1 − cos π2K = distK > 0,(1.13) and thus |

| ≤ 1

(pdistK

)K+1 ∫ c

σ0

e−σx + e−σ(x−1)

σ2 + T 2dσ .For σ0 xed, this integral vanishes as T = π(2m + 1) be omes largewith m.Asymptoti ally, for any arbitrary pre ision ε > 0, the σ0 an be takennegative and large enough to have integral over ② smaller than ε for anyvalue of T . If T is π(2m + 1) with m large enough, integrals over ① and ③are both smaller than ε. Residue al ulus an then be used as the integralfor the right edge of the ontour extended to innity is equal to the sums ofresidues on the left side of this line.Expression of the sum of the residues: We want to study theresidues of: s 7→ esx(1−e−s)

s2

(p

1−(1−p)e−s . . . p

1−(1−p)e−s/2K

).

• For s → 0: every fa tor in the produ t on the right is onverging toone, the residue is given by a development in Laurent Series of the oe ient on the left, and is equal to 1.• For s → sk,l = 2k ln(1 − p) + 2iπ(2kl), where k = 0 . . . K, l ∈ Z:The left oe ient of the fun tion is equal to:

(1 − p)2kx − (1 − p)2

k(x−1)

(2k)21

(ln(1 − p) + 2iπl)2e2k2iπlx .Fa tors appearing in the right produ t are p

1−(1−p)e−s/2j with j =

0 . . . K: for j = k, the residue is lims→sk,l

(s− sk,l)p

1 − (1 − p)e−s/2k= p(2k) , for j = 0 . . . (k−1), the fa tor simplies to ak−j =p

1 − (1 − p)1−2k−j,

Page 55: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Bandwidth Sharing with Intera ting Flows 45 for j = (k + 1) . . . K, it is bj−k,l =p

1 − ei 2π

2j−k l(1 − p)

1− 1

2j−k

.Then the residue R(K)k,l at this point is given by

p((1 − p)2

k(x−1) − (1 − p)2kx)

2k

k∏

m=1

am

︸ ︷︷ ︸=αK(x) and does not depend on l

1

(ln(1 − p) + 2iπl)2

K−k∏

m=1

bm,l

︸ ︷︷ ︸=β

(K−k)l

e2iπl2kx .We nd a rst version of the Formula (1.10), for K_order trun ated vari-ables:(1.14) P (YK ≤ x) = 1 +∑

k=0...K

l∈Z

R(K)k,l = 1 −

k=0...K

αK(x)φ(K−k)(2kx) ,where φ(K−k) is given by a Fourier series, φ(K−k)(x) =∑

l∈Zβ

(K−k)l e2iπlx.STEP 2: From the Finite Order Case to the Exa t FormulaRandom variables YK are monotoni ally onverging for K → +∞ to Y .Consequently, if CDF µK for trun ated variables point-wise onverge to afun tion µ, this implies that the limit fun tion is the limit variable's CDF.The rest of the se tion establishes that point-wise onvergen e, whi h proves(1.10).

• For k ≥ 0, l ∈ Z, limK→+∞

β(K−k)l = β

(+∞)l is implied by the onvergen eof the innite produ t ∏m≥0 bm,l: As bm,l = 1

1−dm,l, where

dm,l =p

1 − p(e

ln(1−p)+2iπl2m − 1) ∼m→+∞

1

2m.

p

1 − p(ln(1 − p) + 2iπl) ,from ∑

m≥0 |dm| < +∞, we dedu e that the produ t ∏m≥0(1 − dm)is absolutely onvergent, and thus onverges to a nite non null limit(see [18 Chap.I for more details on the matter).• This implies the point-wise onvergen e φ(K−k)(x) →K→+∞ φ(∞)(x):For m ≥ 1 and l ∈ Z, bm,l = p

1−θ(1−p)1− 1

2mwhere θ is a omplexnumber with modulus 1. Hen e, as 0 < 1−p < 1, we have |bm,l| ≤

p

1−(1−p)1− 1

2m= bm,0 .(1.15) Note that bm,0 ≥ 1 , hen e ∣∣∣

∏K−km=1 bm,l

∣∣∣ ≤∏

m≥1 bm,0 = P .

Page 56: baccelli/Evaluation/AugustinChaintreauPhD.pdf

46 Chapter 1As a onsequen e, we have |β(K−k)l | ≤ P 1

|l|2for l ∈ Z and any valueof K−k, we an also dedu e |β(+∞)

l | ≤ P 1|l|2 . All terms dening series

φ(K−k)(x) onverge to the orresponding term dening the limit series,the previous bound on β(K−k)l is then su ient to on lude.

• This implies the point-wise onvergen e µK(x) →K→+∞ µ(x). Again,ea h term in the sums dening µK onverges to the orresponding limitterm. One easy onsequen e from the previous study is that φ(K−k)(x) an be bound for any value of x and K − k. Having ∑k≥0 |αK(x)| <+∞ is then su ient to on lude, and it is a onsequen e from

|αk(x)| ≤ αk =p

2k

(1

1 − p

)2k ∏

m=0...k

|am| , and |am| ∼m→+∞p

1 − p(1−p)2

m.

Page 57: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Bandwidth Sharing with Intera ting Flows 473 Examples of Appli ations3.1 Buer and Pa ket LossesEe t of the Buer SizeA ongestion epo h generally starts as soon as the apa ity is above C andthe buer is full. The expression of the inter- ongestion time (1.2), denedwith the rst instant when the apa ity is rea hed, holds then only whenthe buer size an be negle ted. More generally, the inter- ongestion time an be dened using the same expression, where the value of C has been orre ted to C ′ = ρC, with ρ greater than one, to ree t the lling of thebuer before the rst pa ket losses appear.The buer size B has also an inuen e on the RTT R. Indeed, the buero upan y an in prin iple vary between 0 and B. Some parts of the TCPow hit an empty buer, some hit a buer that is ompletely full. The partsthat hit a full buer need to wait for an additional time of BC , with respe tto the former ones. As we have assumed a onstant RTT R in our analysis,this additional RTT u tuation needs to be mu h smaller than R. Hen e,the approa h des ribed in this se tion is only valid if B is mu h lower thanthe bandwidth delay RC.Estimation of the orre ted value of the apa ity an be made by study-ing the buer evolution between two su essive ongestion epo hs, and dis-tinguishing between the two ases whether or not the buer empties in thatperiod. When ongestion starts at time Ti−1, the buer is full and the totalthroughput has rea hed the orre ted apa ity C ′. During ongestion onaverage a proportion p of onne tions halve their throughput, implying aredu tion of the total throughput to the value C ′(1− p

2 ). The total through-put used in the link is then for some time lower than its apa ity C, and thenumber of pa kets in the buer de reases and may vanish. When the sum ofthroughput in reases to su h a value that the apa ity C is rea hed again,the buer starts to ll; it is full when the total throughput has rea hed thevalue C ′.• If the de rease in buer utilization after a ongestion epo h is su ientto empty the buer before the link apa ity is rea hed again, thenlosses o ur after the buer gets entirely lled. A al ulation thengives

C ′ = C +

√2BN

R.

• If the buer never empties the de rease and in rease of the buerutilization on a period between two ongestion epo hs should ompen-sate, hen e we have:C ′ − C and C − C ′(1 − p

2) should be equal, so that C ′ = C

1

1 − p4

.

Page 58: baccelli/Evaluation/AugustinChaintreauPhD.pdf

48 Chapter 1It an be shown that the maximal load ρ is given in general by the min-imum of these two possible values. The smallest of these values determinesthe fun tioning ase (o asionally empty buer or never empty buer) thatis valid (more details in [8).(1.16) ρ = min

(1 + χ,

1

1 − p4

), where χ =

√2BN

RC.Estimating the Syn hronization RateWe propose here a method to hoose the value of the parameter p, whi h isdes ribed in full detail in [6.When the buer is full, we assume that we enter a period of ongestion,for a time equal to one RTT. The router is then modeled as a M/M/1/Bqueue with arrival rate ρC and servi e rate C, where B is expressed inpa kets, and C in pa kets per se ond. The probability for a onne tion tolose at least one pa ket in this ongestion epo h is then given by(1.17) p =

1 − exp

(−RC

Nρ−1

1−( 1ρ)B+1

)

1 − exp

(−RC ρ−1

1−( 1ρ)B+1

) = f(ρ) .

• In the O asionally empty buer ase, the value of ρ is knownequal to 1 + χ, whi h gives dire tly a value for p.• In the Never empty buer ase, this is slightly more omplex as thevalue of ρ depends on p itself. p is then the solution of a xed pointequation p = g(p), where g(p) = f( 1

1− p4). In [8, an iterative methodis proposed to nd this xed point.Again we have observed that p is more generally given by the minimum ofthese two possible values, the hoi e of the minimal value determines whetherthe o asionally empty buer ase or never empty buer ase applies.The value of parameter p is then entirely xed by hara teristi s of thenetwork C,R,B,N , and, in parti ular, it does not depend on the initial onditions of the onne tions passing through the link. Note that the formulawe propose here to estimate p, an be repla ed in our model by any othermethod, with no hange in the following of our analysis to derive the law ofthe throughput.3.2 Impa t of Buer SizesIn this se tion we onsider a bottlene k router of apa ity C shared by Nusers. Ea h (persistent TCP) user sees the same RTT R of 100 ms. Thenumber of users is always su h that the available apa ity per user is kept

Page 59: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Bandwidth Sharing with Intera ting Flows 49to CN =1 Mb/s. Three ases are onsidered : a bottlene k router with alink apa ity C of 34 Mb/s, 100 Mb/s and 600 Mb/s, whi h is shared by34, 100 and 600 (persistent TCP) users respe tively. We study rst theee t of the buer size B of this router on the syn hronization rate p andits onsequen e on the instantaneous throughput seen by one user. Both theaverage throughput as well as its instantaneous distribution are onsidered.N.B.: In all previous se tions, the link apa ity and the buer size wereexpressed in pa kets per se ond and pa kets respe tively, while in this se tionwe use bits per se ond and bytes as units respe tively. To relate the latterto the former, we need to hoose a typi al pa ket size for the IPpa ketstransporting the TCP data. We assume a pa ket size of 536 bytes.As explained in 2, our analysis is only valid for B ≪ RC. We thentake the largest onsidered value of the buer size to be B = 0.1RC, whi h orresponds to 0.0425 Mbyte, 0.125 Mbyte and 0.75 Mbyte for C = 34 Mb/s,100 Mb/s and 600 Mb/s respe tively. The value of buer size Bthr thatseparates the o asionally empty buer ase from the never empty buer ase is 0.55 Mbyte, 1.62 Mbyte and 9.71 Mbyte for the 34 Mb/s, 100 Mb/sand 600 Mb/s ase respe tively. Consequently, the o asionally empty buer ase always applies to values of B onsidered in this study.We dene the relative average throughput as the ratio of the averagethroughput seen by a user at a random time instant in steady state, as in(1.9, over the user's maximum fair share C

N . It is given by(1.18) E[X(∞)]CN

= ρ(1 − p

4

),and as one an see by examining (1.16), whi h estimates ρ as a minimum, it an never ex eed 1. Moreover, the value 1 an only o ur if, in (1.16), theright term under the minimum operator is the smallest, whi h orrespondsto the ase where the buer never empties between ongestion epo hs. Thisfollows the intuitive argument that any system that distributes resour esamongst users, annot oer ea h user its maximum fair share if it wastestransmission opportunities, i.e. if it allows its buer to empty. Noti e alsothat the relative average throughput never drops below 0.75 as ρ ≥ 1 and

0 ≤ p ≤ 1. As the o asionally empty buer ase always applies here, theload is in reasing with the square-root of the buer size B.Figure 1.5 illustrates how the syn hronization rate p and the relativeaverage throughput seen by one user hange as the buer B in reases. Inparti ular, it shows that the average throughput does not always in reaseas the buer size B gets bigger, i.e. there may be some ( ounterintuitive)situations, where in reasing the buer size B a tually leads to a de rease inaverage throughput. This an be understood in our model by studying thevariation of the syn hronization rate with B.

Page 60: baccelli/Evaluation/AugustinChaintreauPhD.pdf

50 Chapter 1

Figure 1.5: Syn hronization rate p (left) and Relative average throughput(right), seen by one TCP user as a fun tion of the buer size B relative toRC.For a small buer size, the load ρ (at ongestion epo hs) is barely largerthan 1. Under this ondition, the syn hronization rate is given by(1.19) p =

1 − exp(− RC

N(B+1)

)

1 − exp(−RC 1

B+1

) .Hen e, for B small enough to have B+1RC ≪ 1

N , the value of p is lose to 1, andit de reases with B. Equation (1.17) shows that this de rease, for a ertainvalue of B, gets ompensated by an in rease in ρ, whi h also in reases as Bin reases. A minimal value of p is then attained for some buer size, andthe value of p is then in reasing under the inuen e of the se ond ee t. Weobserve this phenomenon in Figure 1.5 (left). Note that the value for BRCwhere the lo al minimum o urs and the minimal value of p both de reaseas the link apa ity C in reases.If the in rease of p, o urring after a ertain buer size, is su ientto ompensate the in rease of ρ in the produ t (1.18) dening the relativeaverage throughput, a lo al maximum of the average throughput is seenas a fun tion of the buer size. Su h a lo al maximum is observed for allparameters settings in Figure 1.5 (right). Note that this lo al maximum issharper for a larger apa ity. For larger buer sizes su h values are notshown in Figure 1.5 as the ne essary ondition B ≪ RC is not valid anymorefor the ases studied here the in rease in load ρ dominates and results inan overall in rease of the relative average throughput. Ultimately, if thebuer size is bigger than Bthr, the never empty buer ase applies, implyinga relative average throughput equal to 1.These results were he ked via a NS simulation (a pa ket-based simulator

Page 61: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Bandwidth Sharing with Intera ting Flows 51of TCP networks, available at http://www.isi.edu/nsnam/ns/). The set-upwas hosen to be as lose as possible to the one above. N users share arouter of apa ity C where CN is kept equal to 1 Mb/s whi h introdu esa propagation delay of 10 ms. The N users ea h a ess this router througha separate link of apa ity 1.5 Mb/s where the propagation delay is 30 ms;after the ommon router, ea h user uses a separate link to its destination ona link of apa ity 1.5 Mb/s, where the propagation delay is 10 ms. Finally,the propagation delay for the a knowledgment pa kets is 50 ms. Noti e thatthe RTT is 100 ms, plus some delay be ause of queuing in the network.As the buer size has been hosen small enough in these simulations, thequeuing delay in the bottlene k an be negle ted, and in parti ular, it annotbe responsible for the lo al de rease of the throughput observed when thebuer is larger.

Figure 1.6: Relative average throughput seen by one user as a fun tion ofthe buer size B relative to RC, obtained by NS simulations.Results from these simulations are shown in Figure 1.6. There is no urve available for N = 600 users, be ause it turned out to require too mu h omputer memory. The overall trend of the average throughput is similarto what our model predi ts ; the average relative throughput is roughly in- reasing with B, but it admits some lo al maximum and its evolution stayspretty at after this lo al maximum, ex ept for some small variations. We an also observe some dieren es with our model. First, a relative averagethroughput smaller than 75% an be observed when the buer is small, as Bplays, in this ase, the role of a Wmax (maximal number of una knowledgedpa kets sent through the network for a onne tion); this pa ket level ee t annot be aptured by our model. Se ond, we observe that for large val-ues of the buer size our model underestimates the average throughput (orequivalently overestimates the syn hronization rate p). This might be a on-sequen e of the variation of the RTT due to the queuing delay, as for these

Page 62: baccelli/Evaluation/AugustinChaintreauPhD.pdf

52 Chapter 1 ases the buer is su iently large to have su h an impa t. It would explainthat we observe less syn hronization between ows than the one predi tedby our al ulation on Markovian queues.Next, we onsider the TDF of the instantaneous throughput. To studythe impa t of ρ and p separately, we dene the relative instantaneous through-put by rewriting (1.8) as(1.20) X(∞)CN

= ρZ with Z =p

2

1 + U +∑

k≥0

Ak

2k

.Noti e that the syn hronization rate p solely determines the TDF of therandom variable Z. For p = 1, Z is uniformly distributed between 0.5 and1. Figure 1.7 shows the TDF of the variable Z for dierent values of p.As p de reases from 1, the TDF be omes less steep and a broader range ofvalues an be taken from p2 to +∞. The 0.95-quantile (i.e. that value z forwhi h P (Z > z) = 0.95) thus de reases as p in reases and similarly, the0.05-quantile in reases with p.

Figure 1.7: TDF of the variable Z with the syn hronization rate p as pa-rameter.Figure 1.8 shows the TDF of the throughput seen by one user. In ea h ase, four buer sizes were hosen su h that BRC= 0.0008, 0.0024, 0.008 and0.08. We observe in two ases, orresponding to the lowest buer size for apa ity 34 Mb/s and 100 Mb/s, a law lose to the uniform distribution onthe interval [0.5Mb/s; 1Mb/s], as the syn hronization rate p is then loseto 1. The average throughput is in these ases only a little larger than0.75 Mb/s (i.e. only 75% of the naively expe ted value of 1 Mb/s). Remarkthat this is not the ase for the smallest buer size with apa ity 600 Mb/s,as the syn hronization rate p has already de reased from 1. In the ase of600 Mb/s, we observe the de rease in average throughput as B in reases for

Page 63: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Bandwidth Sharing with Intera ting Flows 53

Figure 1.8: TDF and average throughput seen by one TCP user for severalbuer sizes B: C = 34 Mb/s (top left), 100 Mb/s (top right) and 600 Mb/s(bottom).the two largest values of the buer size. This was already explained whendis ussing Figure 1.5.3.3 Dimensioning with Probabilisti GuaranteeIn the previous se tion, we al ulated the average throughput seen by a userin a given s enario. We address now a dierent problem: given a xed a-pa ity and buer size, what is the maximal number of onne tions that anshare the link when ea h of them needs to satisfy a performan e require-ment. This requirement is given on the distribution of the instantaneousthroughput.As in 3.2, we express here the apa ity and the buer in bytes, instead ofpa kets. To relate the latter to the former, a onstant pa ket size of 536 bytes

Page 64: baccelli/Evaluation/AugustinChaintreauPhD.pdf

54 Chapter 1is assumed. The link onsidered has a apa ity of 34 Mb/s, 100 Mb/s and600 Mb/s, and the buer is again a parameter that we vary while keepingB

RC ≤ 0.1. The obje t of our study is the maximal value of N to have athroughput always greater than x = 100 kb/s, ex ept for at most 5% of thetime (i.e. P (X(∞) ≥ x) ≥ 95%). Therefore, we need to numeri ally invertthe losed form formula (1.10).As the threshold for the throughput onsidered here (100 kb/s) is oneorder of magnitude smaller than throughput onsidered in the previous se -tion (around 1 Mb/s), we an intuitively expe t that the number of userswe onsider now is higher. This is indeed the ase, and there is, for every apa ity level, between three and ve times more onne tions in this se ondstudy. Consequently, the threshold buer Bthr (whi h de rease when N ishigher) may t in the range of values that we onsider for B (to keep BRCsmaller than 0.1).Figure 1.9 shows the results obtained with the losed form formula. Forea h value of B, we have in reased N , while we evaluate p, and P (X(∞) ≥ x)using (1.10); the pro ess was stopped when this probability falls under 0.95.The maximal value of N is given relative to the value Nmax we ould expe tif the apa ity was equally shared perfe tly at any time. We have Nmax = C

xand for a shared apa ity of 34 Mb/s, 100 Mb/s and 600Mb/s, it is equalrespe tively to 340, 1000 and 6000 ows.

Figure 1.9: The maximal number of users, for whi h a throughput of 100 kb/s an be guaranteed 95 % of the time, seen as a fun tion of the buer size Bin the bottlene k router.If the buer size B is small, the syn hronization rate p is very lose to 1.Hen e, the TDF of the throughput is pra ti ally uniform between 12

CN and

CN . Hen e, the 0.95-quantile hardly ex eeds 1

2CN , and the maximal numberof users is slightly higher than half of Nmax. Remark however that the NS

Page 65: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Bandwidth Sharing with Intera ting Flows 55simulations of the previous se tion indi ated that in this ase the pa ketphenomena are so important that our model partially based on uid ow annot apture this behavior very well.For a slightly larger buer size B, the load ρ in reases with B whilethe syn hronization rate p de reases. This has a double bene ial ee t onthe (relative) average throughput, as was observed in the previous se tion.Strangely enough, these bene ial ee ts do not apply to the 0.95-quantile.This was already observed for the relative instantaneous throughput Z onFigure 1.7. So, in this interval for B where the average throughput in reases,the value of the 0.95-quantile a tually de reases (and as a onsequen e, lessusers an be supported if su h a guarantee of 100 kb/s needs to be given95% of the time).For still larger values of the buer size B, i.e. for B larger than the value,where the minimal syn hronization rate p o urs, we observe a steep in reaseof the number of users that are guaranteed 100 kb/s 95% of the time. Thisis explained by the in rease of the syn hronization rate after its minimum,and at the same time the in rease of the load ρ, as both are bene ial to the0.95-quantile of the throughput distribution.Finally, the buer is so large that the never empty buer ase applies.Note that, as the number of users onsidered in the link is greater now thanit was in the previous se tion, the never empty buer ase o urs when B isstill moderately large (i.e. B smaller than 0.1RC), so that our analysis anbe applied. The average throughput seen by ea h user in this ase is exa tlyCN . The load ρ is given by 4

4−p and the syn hronization rate p remains moreor less onstant. This value of the syn hronization rate determines the 0.95-quantile of the relative instantaneous throughput Z. The number of usersN that an be supported, as an be seen on Figure 1.9, remains more or less onstant for these values of B.

Page 66: baccelli/Evaluation/AugustinChaintreauPhD.pdf

56 Chapter 14 Analysis with Non Persistent FlowsIn this se tion, we apply the same modeling approa h of TCP ows' inter-a tion to study the impa t of the random varying number of ows.Our ModelWe onsider a variation of the previous model: ea h of the N ows thatshare the bottlene k link alternates as an ON-OFF sour e; when it is a tive,ea h ow transfers a le of size exponentially distributed with mean 1/µ;at the end of its transfer, ea h ow remains idle for a silent period thatis exponentially distributed with mean 1/β. We do not x in advan e theinitial onditions of the ows, as several ase are onsidered.During an a tive period of transfer, a ow in reases its rate a ording tothe AIMD rule already explained. We assume buers are small enough for ongestion to o ur instantaneously as soon as the sum of rates for all a tiveows rea h the router's apa ity. When ongestion o urs, ea h ow has axed probability p to half its window. Note that we onsider also the aseof TCP Tahoe, in whi h the window is not divided by 2, but is reset to zeroif this ow experien es ongestion.Before moving on, we would like to dis uss two spe i assumptions:• The distribution of idle time and le size is exponential.Several studies have shown that le size distribution and idle time areboth heavy tailed. Most of the results that we present are only shownhere for exponential law. However, we have been able to demonstratethat our results are not limited to light tailed distributions. File sizesdistributed with an heavy tail should exhibit, even more severely, thephenomenon of turbulen e that is identied below. More details maybe found in [3.• TCP ows are all in ongestion avoidan e phase.The initial slow start phase of ea h ow is ignored here, and ows areassumed to start dire tly in ongestion avoidan e. Another solutionis to assume an instantaneous jump when the onne tion starts, it isdis ussed in more details in [3.We are interested in the evolution of this system for large N , wherea mean eld approximation is valid. We do not prove here the existen eof this mean eld solution, we rather hara terize whi h types of solutions an be found, leaving the justi ation of the limit result for future work.In the same spirit as for the mean eld analysis of RED (made in [7), weneed to assume that the rate of the a tive ows are distributed a ording to adensity fun tion s(z, t), that evolves in time a ording to a partial dierentialequation.

Page 67: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Bandwidth Sharing with Intera ting Flows 57The Two Regimes of Dynami Tra DemandThere are, at least, two steady state regimes that we an identify:• In the free regime, that we analyze rst in 4.1, the ows do not intera tas no ongestion o urs in the network. The tra is only hara ter-ized by the in rease of the ongestion avoidan e phase that we haveassumed.Between two ongestion epo hs, the evolution of this model is alwaysdes ribed as the in the free regime. We study its evolution equation arefully to apply it later.• In the ongestion regime, ongestion epo hs o ur with a xed period τ .Our rst result (shown in 4.2) is to hara terize the possible values ofthis period. It provides us with a ne essary ondition on the existen eof su h a regime.In a se ond step, we apply the evolution of the free regime, obtainedearlier to dedu e the invariant measure of rate distribution immediatelyafter a ongestion epo h.4.1 The Free RegimeRegenerative Rate Pro essIn the ase without ongestion, ea h ow in reases its transmission ratelinearly at rate 1/R2 and an transmit a le of size y pa kets in time t where

y = t2/(2R2); i.e. in time t = R√

2y. The density of the transmission time ofa le is µ tR2 e−

µt2

2R2 (as easily seen by the hange of variable t → v = t2/2R2)and the mean le transmission time is therefore(1.21) TON = R

∫ ∞

0µ exp(−µy)

√2ydy = R

√π

2µ.A tagged ow alternates between periods omposed of a silen e period ofexponential duration with parameter β, and a a tive period of mean duration

TON, distributed a ording to the above density.The rate X(t) of the tagged ow at time t is a regenerative pro ess thatstays equal to 0 during OFF periods and in reases linearly with time duringa tivity periods. This sto hasti pro ess regenerates after the ompletion ofone OFF and one ON period. The point pro ess of regeneration epo hs of atagged ow is denoted by S.During ea h ON period, a ow transmits on average 1/µ pa kets. Con-sequently the average transmission rate per ow is(1.22) ρ = (1/µ)/(1/β + TON)).

Page 68: baccelli/Evaluation/AugustinChaintreauPhD.pdf

58 Chapter 1The proportion ν of ows whi h are idle is (1/β)/(1/β + TON)). The trans-mission rate equals νβ/µ. This is intuitively obvious sin e νβ is the rate atwhi h new ows ome on-line and ea h new ow must transmit on average1/µ pa kets.Hen e, when the regime without ongestion o urs, the average trans-mission rate per ow ρ is less than C; i.e. νβ/µ < C and(1.23) ρ =

νβ

µ=

(1/β + R

√π

))−1

< C.Partial Dierential EquationLet ν(t) be the proportion of idle ows at time t. Let s(z, t) be the density ofthe transmission rates of a tive ows in the mean-eld regime (we onsiderrst the ase with a density for the sake of lear exposition). Consequently,∫ ∞

0s(z, t)dz = 1 − ν(t).(1.24)As shown by Ba elli et al. in [7, the density fun tion s(z, t) veries inthis ontext:

∂s

∂t(z, t) +

1

R2

∂s

∂z(z, t) = −µzs(z, t).(1.25)Multiplied by dz, the se ond term on the left hand side represents the rateof hange of the proportion of transmission rates in [z, z +dz] be ause of thelinear in rease of the ongestion avoidan e phase of TCP. The right handside represents the rate at whi h les omplete transmission, sin e s(z, t)dzis the proportion of ows with transmission rates in the interval [z, z + dz]and that su h ows omplete their transmissions at a rate µz.The rate at whi h ows be ome a tive is βν(t), hen e in time dt the area

βν(t)dt is added under the graph of s(z, t) between 0 and dt/R2, be ausethis area is leared out by the additive in rease in the transmission rates.The area under the graph of s(z, t) between 0 and dt/R2 is s(0, t)dt/R2 torst order. Hen e,s(0, t)/R2 = βν(t).(1.26)It is shown in [3, using Lapla e transform arguments, that s(z, t) satisesthe following Fredholm equation for s(z, t):

s(z, t) = s(z − t

R2, 0) e

−µ“

tz− t2

2R2

+ e−µR2 z2

2 R2β(

1 −∫ ∞

0s(x, t − zR2)dx

)(1.27)

Page 69: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Bandwidth Sharing with Intera ting Flows 59whi h turns out to be quite handy for numeri al exploitation, as we shall seebelow. Identity (1.27) is easy to interpret when onsidering the two ases:for the rate to be z at time t, either the transfer of the le transmittedat time 0 is not yet ompleted at time t, whi h requires that the rate wasz − t/R2 ≥ 0 at time 0, or it is ompleted, whi h requires that the owwas ina tive at time t − zR2 > 0 and there was a transition from ina tiveto a tive at that time. In fa t, it is lear that (1.27) an be generalized todes ribe the evaluation of a measure S(dz, t) representing the distribution oftransmission rates at time t starting from an arbitrary measure S(dz, 0):

S(dz, t) = R2β

(1 −

∫ ∞

x=0S(dx, t − zR2)

)e−µR2 z2

2 dz

+S(dz − t

R2, 0) e−µ(zt− t2

2R2 ).(1.28)Letα(t) =

∫ ∞

0zs(z, t)dz.(1.29)The fun tion α(t) represents the aggregate rate (sum of the transmissionrates at time t where the sum is over all ows).Lemma 1 The Lapla e transform of α, denoted by α(u) =

∫ ∞

0e−utα(t)dt, an be expressed as α(u) =

ν(0) ββ+u I(u) + J(u)

1 − µ ββ+u I(u)

,where

I(u) = R2

∫ ∞

0xe−R2ux−R2µx2/2dx

J(u) = R2

∫ ∞

z=0eR2uz+ R2µz2

2 s(z, 0)

∫ ∞

x=zxe−R2ux−R2µx2

2 dxdz.Lemma 2 The stationary distribution of the rates is:ν(∞) =

1β + R

√π2µ

, s(z,∞) =R2e−R2µz2/2

1β + R

√π2µ

.(1.30)The stationary aggregate rate is:α(∞) =

1

µ

1

1β + R

√π2µ

= ρ.(1.31)The proofs of these lemmas an be found in [3. This arti le ontains also aninterpretation of the fun tions introdu ed in Lemma 1 in terms of renewaltheory, and a losed form expression for the solution of (1.25) as a fun tionof the time domain.

Page 70: baccelli/Evaluation/AugustinChaintreauPhD.pdf

60 Chapter 14.2 Property of Inter Congestion TimeRate Conservation Prin ipleDene X(t) to be the transmission rate of a tagged ow parti ipating in thesteady state. Assume that a stationary regime exists for X(t), namely thatit is a stationary sto hasti pro ess dened on a probability spa e Ω,F , P.The distribution of X(t) is therefore the same for all ows in the steadystate. X(t) in reases linearly at rate 1/R2 when it is a tive; i.e. with meanrate P(X(0) > 0)/R2. This in rease is ountera ted by two ee ts. First,the negative jumps when a le nishes, as the value of X(t) hanges to zero.Se ond, a redu tion by one half of X(t) when a pa ket is lost at a ongestionepo h.Let us introdu e the following point pro esses:• T , the point pro ess of ongestion epo hs, with inter-arrival times τ ,with Palm expe tation E

τ0 ; let τ denote the expe tation of the inter- ongestion times w.r.t. Pτ0;

• D, the point pro ess of le ompletions of the tagged ow, with inten-sity λδ and with Palm expe tation Eδ0.When a le is ompletely downloaded, the throughput is reset to zero.Hen e, with the introdu ed notation, the rate of de rease of X(t), ausedby le ompletion, is λδE

δ0(X(0−)). In addition to that, the mean rate atwhi h the tagged ow suers a pa ket loss is p/τ , and the tagged ow dividesits transmission rate by 2 for ea h loss. This a ounts for another de reaseof X(t) with rate p

τ Eτ0 [X(0−)/2]. This rate is equal to pC/(2τ ), sin e theutilization is exa tly one when the ongestion epo h begins, proving that

Eτ0 [X(0−)] = C.By the rate onservation prin iple (see e.g. [2, p.24) of stationary pro- esses, the mean rate of in rease equals the mean rate of de rease. Hen e,(1.32) P(X(0) > 0)

R2=

pC

2τ+ λδE

δ0[X(0−)].On the left hand side, the unknown quantity is the steady state probabilitythat a ow is a tive. On the right hand side, we have λδ, the rate at whi hle ompletions o ur and E

τ0 [X(0−)], the mean transmission rate observedwhen the le is ompletely downloaded.In the Tahoe ase, the onservation prin iple may be written as(1.33) P(X(0) > 0)

R2=

pC

τ+ λδE

δ0[X(0−)].In what follows, this identity is used as a way to determine the possiblevalues of τ . As we shall see below, the expressions that show up in (1.32) and(1.33), namely P(X(0) > 0) and E

δ0[X(0−)] an be omputed as a fun tionof τ , so that this equation an be seen as a xed point equation for τ .

Page 71: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Bandwidth Sharing with Intera ting Flows 61The Fredholm EquationsIn this se tion, we let the parameter N tend to ∞ and we assume the exis-ten e of a stationary mean-eld limit as N → ∞ in the same spirit as in [7,[9 or [4. In su h a mean-eld regime, the inter- ongestion times be omedeterministi and we have propagation of haos; i.e. ea h ow be omes in-dependent. We on entrate on the ase where the stationary regime of themean-eld limit has periodi ongestion events, separated by the same timevalue τ .As seen below, when τ is known, all quantities in (1.32) an be omputedas the solutions of ertain Fredholm integral equations. Equation (1.32) anthen be used to determine τ as a xed point.In this se tion, we assume τ to be given. We dene a y le to start at a ongestion epo h where the tagged ow is idle. The y le ends at the rst ongestion epo h when the ow is idle again. We use the following notation:• Σ is the point pro ess of ongestion epo hs where the tagged ow isidle, with inter-arrival times σ and with Palm expe tation E

σ0 .The rationale for dening su h y les is that the sequen e of su essive y les asso iated with the tagged ow is i.i.d., or, in other words, that thebeginning of y les are regeneration times for the tagged ow.Expe ted number of les in a y le: Dene f(t) to be the expe tednumber of les that will be transmitted by the end of the urrent y le, giventhat the tagged ow is ina tive at the urrent time t (where 0 ≤ t < τ). Alsodene g(z) to be the expe ted number of les that will be transmitted bythe end of the urrent y le, given that the urrent transmission rate of thetagged ow is z pa kets per se ond and that the urrent time is immediatelyafter a ongestion epo h.Our goal is to evaluate f(0), but we nd f(t) for all t ∈ [0, τ [. Sin ethe silen e period has an exponential distribution, we an ondition on thetime when the ow has a new le to transmit. There are two possibilities.Either the le arrives before the next ongestion epo h at some time r where

t ≤ r ≤ τ or it does not. If it has not arrived, the urrent y le ends andf(t) = 0.If it does, for a time r where t ≤ r ≤ τ , we ondition on the size y of thearriving le. There are again two ases. Either the transmission of this leis ompleted before the next ongestion epo h or there is some remainingdata to be transmitted after the next ongestion epo h. We are in the rst ase if we an transmit y pa kets in τ − r time units, where the ow startswith a null transmission rate. Sin e the transmission rate in reases at rate1/R2 it takes t′ time units to transmit y pa kets if y = (t′/2)(t′/R2), i.e. ift′ = R

√2y. Consequently y pa kets an be transmitted before the next ongestion epo h only if y ≤ (τ − r)2/(2R2). In this ase we add one to the

Page 72: baccelli/Evaluation/AugustinChaintreauPhD.pdf

62 Chapter 1number of les transmitted during the urrent y le, plus a renewal term.We an summarize this rst ase by∫ τ

tβe−β(r−t)

∫ (τ−r)2

2R2

0µe−µydy(1 + f(r + R

√2y)

dr.In the se ond ase, the y pa kets annot be transmitted before the next ongestion epo h. In this ase, whi h o urs with probability exp(−µ(τ −r)2/(2R2)), we do not add one to the number of les transmitted, but onlythe expe ted number of les transmitted after the next ongestion epo h. Itdepends on the throughput seen after ongestion: by the ongestion epo hthe transmission rate of the tagged ow is (τ − r)/R2. There is probabilityp that the tagged ow suers a pa ket loss whi h redu es the transmissionrate to (τ − r)/(2R2).We an summarize the expe ted number of les that will be transmittedby the end of the urrent y le given we are in this se ond ase as

∫ τt βe−β(r−t)e−µ (τ−r)2

2R2(pg( τ−r

2R2 ) + (1 − p)g( τ−rR2 )

)dr.We on lude that f(t) is given by:(1.34)

∫ τ

tβe−β(r−t)

∫ (τ−r)2

2R2

0µe−µy(1 + f(r + R

√2y))dy

+e−µ(τ−r)2

2R2

(pg(

τ − r

2R2) + (1 − p)g(

τ − r

R2)

))dr .By similar arguments (see [3), g(z) an be written as:(1.35) ∫ zτ+ τ2

2R2

0µe−µy(1 + f(R

√R2z2 + 2y − R2z))dy

+e−µ(zτ+ τ2

2R2 )

(pg(

z + τR2

2) + (1 − p)g(z +

τ

R2)

).Equations (1.34) and (1.35) onstitute an integral equation of the Fredholmtype for the pair (f, g).The three unknowns of the onservation prin iple equation: One an get similar Fredholm equation for determining the pairs of fun tions

(h, i), (j, k) and (l,m) where:• h(t) is the expe ted umulative time when the ow is a tive in the re-maining time of the urrent y le, given that the tagged ow is ina tiveat the urrent time t with 0 ≤ t < τ .

Page 73: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Bandwidth Sharing with Intera ting Flows 63• i(z) is the expe ted umulative time when the ow is a tive in theremaining time of the urrent y le, given that the urrent time isimmediately after a ongestion epo h, and that the tagged ow is a tivewith a transmission rate of z.• j(t) is the expe ted residual time before the end of the urrent y le,given that the tagged ow is ina tive at the urrent time t with 0 ≤

t < τ .• k(z) is the expe ted residual time before the end of the urrent y le,given that the urrent time is immediately after a ongestion epo h,and that the tagged ow is a tive with a urrent transmission rate of

z.• l(t) is the expe ted umulative throughput redu tions, aused by le ompletions, from now to the end of the y le, given that the taggedow is ina tive at the urrent time t, with 0 ≤ t < τ .• m(z) is the expe ted umulative throughput redu tions, aused by le ompletions, from now to the end of the y le, given that the urrenttime is immediately after a ongestion epo h, and that the tagged owis a tive with a rate of z.The knowledge of• E

σ0 [KB ] := f(0), the mean number of births during a y le (whi h isalso the mean number of le ompletions during a y le);

• Eσ0 [∫ σ0 1X(t)>0dt] = h(0), the mean umulative ON time over a y le;

• Eσ0 [σ] = j(0), the mean duration of a y le and

• Eσ0 [∫ σ0 X(t−)D(dt)] = l(0), the mean umulative throughput redu -tions, aused by le ompletions, over a y lein turn determines the 3 unknowns of (1.32) sin e:

λδ =Eσ

0 [KB]Eσ

0 [σ] = f(0)j(0)

Eδ0[X(0−)] =

Eσ0 [

R σ0 X(t−)D(dt)]

Eσ0 [KB] = l(0)

f(0)

P(X(0) > 0) =E

σ0 [

R σ0 1X(t)>0dt]

Eσ0 [σ] = h(0)

j(0) .Noti e that the produ t λδEδ0[X(0−)], whi h is used in (1.32), is equal to

l(0)j(0) so that the (f, g) pair is a tually not required for solving this xed pointequation.

Page 74: baccelli/Evaluation/AugustinChaintreauPhD.pdf

64 Chapter 1Numeri al Evaluation of the Fixed PointIn this se tion we present the method that we developed to numeri ally studythe xed point equation satised by τ . The main result is a ommon linearequation des ribing the integral equations for the dierent pairs of fun tions(f, g), (h, i), (j, k), (l,m).(1.36)

A(t) =

∫ τ

tβe−β(r−t)

(U(r) +

∫ τ

rκ(s − r)e−κ (s−r)2

2 A(s)ds

+e−κ(τ−r)2

2

(pB( τ−r

2 ) + (1 − p)B(τ − r)))

dr .

B(r) = V (r) +

∫ τ

0κ(r + s)e−κ s2+2sr

2 A(s)ds

+e−κ τ2+2τr2

(pB( τ+r

2 ) + (1 − p)B(τ + r))

.Ea h of the pairs of fun tions (f, g), (h, i), (j, k), (l,m) satises a Fred-holm equation of the se ond type, where all equations share some ommonterms. It is shown in [3 that the general form of these equations is as fol-lows: we look for a fun tion A, dened on [0; τ ] and a fun tion B dened on[0;+∞[, that verify (1.36). In this equation, κ = µ/(R2) and the fun tionsU and V are given in the following table for all 4 ases:

A(t) B(r) U(r) V (r)f(t) g

(r

R2

)− 1 1 0

h(t) i(

rR2

)aτ (r) bτ (r)

j(t) k(

rR2

)aτ (r)

1β + bτ (r)

l(t) m(

rR2

)− r

R2

aτ (r)+ p2cτ (r)

R2

bτ (r)+ p2dτ (r)

R2with the fun tions aτ , bτ , cτ , dτ dened as:aτ (r) =

∫ τr e−κ (s−r)2

2 ds , bτ (r) =∫ τ0 e−κ s2+2sr

2 ds

cτ (r) = (τ − r)e−κ(τ−r)2

2 , dτ (r) = (r + τ)e−κ τ2+2τr2 .Let (Γ(t), Γ(r)) be the solution (A,B) of (1.36) for (U, V ) = (1, 0), let

(Θ(t), Θ(r)) denote the solution for (U, V ) = (aτ , bτ ), and let (∆(t), ∆(r))be the solution of this equation for (U, V ) = (cτ , dτ ). A ording to the lasttable, we have:Γ(t) = f(t) , Θ(t) = h(t) and, as identity (1.36) is linear in U and V ,

1

βΓ(t) + Θ(t) = j(t) and Θ(t) − p

2∆(t)

R2= l(t).We numeri ally solve (1.36) in the following way. First, we set B(r) = 0for x > Kτ . This is motivated by the fa t that, for physi al reasons, B(r)

Page 75: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Bandwidth Sharing with Intera ting Flows 65has to de rease as r in reases, a fa t that an be proved mathemati ally,but we omit the proof here. Se ond, we approximate the fun tions A(t) andB(r) on a dis rete uniform s ale with a density of M samples per interval oflength τ . So, the fun tion A(t) is approximated by a ve tor of M samplesand B(r) by a ve tor of KM samples. We sta k both ve tors and hen eobtain a ve tor of dimension (K + 1)M . Approximating the integrals in(1.36) by weighted sums of the samples of the fun tions, (1.36) redu es toa matrix identity. Solving this matrix equation involves the inversion ofa (K + 1)M × (K + 1)M matrix. The numeri al error introdu ed in thispro edure an be ontrolled by the hoi e of the parameters K and M (see[3).Finding the Value for τAs shown above, τ satises the following equation:(1.37) pC

2τ+

l(0)

j(0)=

1

R2

h(0)

j(0)i.e. C =

(∆(0)

1β Γ(0) + Θ(0)

R2.This form is valid both for the Reno and the Tahoe ases, for appropriatedenitions of Θ and Γ. In Figure 1.10, we have omputed the right-handside of the rightmost equation in (1.37), whi h does not depend on C, asa fun tion of τ for a xed setting of the parameters 1/β = 2s, 1/µ = 2000Pkts, R = 100ms, p = 0.8 On this plot, we an see that, if the link apa ityis large enough, there is no value of τ making this fun tion vanish (here for

C = 290 Pkts/s.). In this ase, the only possible stable regime is ongestion-less. For smaller values of the apa ity, we observe either two xed points(e.g. for C = 270 Pkts/s.) or one (e.g. for C = 250 Pkts/s.). In the ase withtwo solutions, we have several andidates for a stable regime, with dierentperiods. In the next se tion, we present a method allowing to distinguishbetween solutions that may be the inter- ongestion time of a stable regimeand other solutions. From Figure 1.10, we an on lude more:• for all C-values above 273.4 Pkts/s. (283.3 in the Tahoe ase), thereare no interse tions;• for 263 < C < 273.5 Pkts/s. (263 < C < 283.3 in the Tahoe ase),there are two interse tions and• for C < 263 Pkts/s., there is only one interse tion.The Case with the Tahoe Version of TCPOn e τ is given, the rate of the tagged ow is again a regenerative pro esswith the same y le stru ture as for the Reno version, namely starting witha ongestion epo h when the rate of the tagged ow is 0, and ending when

Page 76: baccelli/Evaluation/AugustinChaintreauPhD.pdf

66 Chapter 1

Figure 1.10: The RHS of (1.37) as a fun tion of τ ; the xed points are theinterse tions of this RHS with the horizontal line C, in the Reno and theTahoe ases.the next ongestion is again 0. Using the same notation as in the Reno ase,we now getf(t) =

∫ τ

tβe−β(r−t)

∫ (τ−r)2

2R2

0µe−µy(1 + f(r + R

√2y)dy

+e−µ (τ−r)2

2R2

(pg(0) + (1 − p)g(

(τ − r)

(R2))

)drand g(z) =

∫ zτ+ τ2

2R2

0µe−µy(1 + f(R

√R2z2 + 2y − R2z))dy

+e−µ(zτ+ τ2

2R2 )(pg(0) + (1 − p)g(z +

τ

R2))

.Other pairs of fun tions (h,i), (j,k) and (l,m) an be analyzed in the sameway (as detailed in [3).4.3 The Congestion RegimeThe Invariant Measure EquationAssume there exists a periodi regime of period τ . Then τ should be asolution of (1.32). Let us onsider (ν0, S0(dz)), that gives the proportion of

Page 77: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Bandwidth Sharing with Intera ting Flows 67OFF sour es and the distribution of rates just after a ongestion epo h. Thejoint distribution of this ouple of variables should be invariant w.r.t. theshift that moves from a ongestion epo h to the next.First τ and (ν0, S0(dz)) should be su h that the aggregate rate fun tionα0, obtained when taking S(dz, 0) = S0(dz), is su h that α0(τ) = C andα0(t) < C for all 0 < t < τ .In addition, given that during a ongestion epo h, a proportion p of thewindows are halved, the (ν0, S0(dz)) should satisfy the integral equation(whi h is referred to as the invariant measure equation)(1.38) S0(dz) = (1 − p)S(dz, τ) + pS(d2z, τ),where S(dz, t) is the solution of (1.28), with the initial ondition S(dz, 0)taken equal to S0(dz).When using the expli it solution of the partial dierential equation (1.25),des ribed in detail in [3, one gets that the last integral equation for thedistribution of S0 an also be seen as a Fredholm type integral equation ofthe se ond kind.In the Tahoe ase, the transmission rates of a tive sour es has a measurewhi h must have a point mass at zero at a ongestion epo h; the invariantmeasure equation then reads

S0(dz) = (1 − p)S(dz, τ) + pδ0(dz)

∫ ∞

0S(dv, τ).(1.39)A few remarks should be made, before addressing numeri al issues:

• The existen e of a ouple (ν0, S0(dz)) solution of (1.38) and su h thatthe α0(τ) = C and α0(t) < C for all t < τ , is ne essary and su ientfor the existen e of a ongestion periodi regime of period τ . Usingthis, it is easy to he k that in the region where (1.37) has two xedpoints, the rightmost xed point is spurious. This immediately followsfrom the fa t that the ondition α0(t) < C for all t < τ is not satisedfor this other xed point.• The more general problem of nding all possible periodi regimes isthe following: nd all pairs, made of a real number 0 < τ < ∞ and a ouple (ν0, S0(dz)), su h that (1.38) holds, and su h that α0(τ) = Cand α0(t) < C for all t < τ . In the Tahoe ase, these variables needsto verify (1.39).• Of ourse, other stationary regimes are possible su h as periodi regimes,where the aggregate rate has a period that onsists of k > 1 ongestionepo hs, or even non periodi regimes (although we did not nd su hregimes by simulation).

Page 78: baccelli/Evaluation/AugustinChaintreauPhD.pdf

68 Chapter 1• Inje ting the ouple (ν0, S0(dz)) as an initial ondition into (1.27) de-termines the proportion of a tive ows and the throughput distributionof a tive ows S(dz, t) for all 0 ≤ t < τ . The mean stationary through-put, obtained from this fun tion as an average over ontinuous time,is given by the following y le mean:(1.40) 1

τ

∫ τ

t=0

∫ ∞

z=0zS(dz, t)dt.Numeri al SolutionWe have hosen a numeri al pro edure to nd an approximation for s(z, t)based on identity (1.27) and (1.38). We approximate the fun tion s(z, t)on a dis rete s ale with L + 1 samples over its time domain (an interval oflength τ) and with a density of L samples per interval of length τ

R2 overits spa e domain (i.e. the z variable). We use L + 1 samples in the timedomain as there is a ru ial dieren e between the time instant just beforea ongestion epo h (the L-th sample) and the time instant just after (the0-th sample). We trun ate the s(z, t) fun tion in the z dire tion by puttings(z, t) = 0 for z > K τ

R2 . This trun ation is motivated by the solution of theintera tion-less system where this fun tion de ays like the tail of a Gaussiandistribution.The dis rete versions of (1.27) and (1.38) dene a matrix equation. Inthis ase, (in ontrast to the ase of solving for A(t) and B(r) in 4.2), thereare L2K unknowns and the matri es involved may be ome large. Therefore,we used (1.27) and (1.38) as a re ursive rule to al ulate an approximationfor s(z, t). The larger L and K are hosen, the better the approximation is.For the examples onsidered in this se tion, K=5 and L=200 turned out tobe adequate values.The Multiple Stationary Regime RegionIn this se tion, we give both numeri al and simulation eviden e that thefollowing ondition on the load fa torρ = (1/µ)/(1/β + TON)) ≤ C ,namely the apa ity per user is more than the mean load per user, is not asu ient ondition for having an intera tion-less mean-eld regime with allinitial onditions. The numeri al part is based on the solution of the set ofFredholm equations already introdu ed. The simulations are based on theN2N ode2, a dis rete-event simulator whi h omputes the AIMD sharingfor a nite number of ON/OFF ows, intera ting through the sum of theirrates, as des ribed in 4.2http://www.n2nsoft. om

Page 79: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Bandwidth Sharing with Intera ting Flows 69We also show that there exist values of the parameters su h that, de-pending on the initial rates of the various ows, one may enter either intoan intera tion-less stationary regime, or into a stationary ongestion regime.In the ase onsidered here 1/µ = 2000 Pkts, 1/β = 2 s., p = 0.8 andR = 0.1 s. The load fa tor ρ is then around 263 Pkts/s. We take C = 270Pkts/s.

• When the initial ondition is hosen a ording to the stationary lawgiven in (1.30), then α(t) = ρ for all t and no ongestion o urs at allsin e ρ < C.• As already shown in 4.2, the rate onservation prin iple admits twosolutions given by the xed point equation (1.32), the smallest of whi his τ ∼ 3.7s. Using the solution of the invariant measure equationof 4.3, we nd that for this value of τ , there exists a probabilitymeasure satisfying the integral equation (1.38) and satisfying the key ondition that the asso iated α fun tion rst rea hes C at time τ .The p.d.f of this distribution as obtained by two dierent methodsis depi ted in Figure 1.11 for Reno. The existen e of su h a regimeis onrmed by the N2N3 simulation of a million HTTP users withthe above hara teristi s and sharing a link of apa ity 270 Pkts/s.Moreover, the steady state distributions found by simulation mat hquite pre isely those obtained numeri ally.In other words, depending on the initial phases of the ows, one either entersinto a ongestion-less regime or into a periodi regime with innitely many ongestion epo hs. The rst ase o urs when the initial onditions are hosen independently for all ows, and ea h ow is in the stationary regimeit would rea h if there were no intera tion at all. The se ond ase o urs ifthe ows are more in phase: here all start ina tive at time 0.Here are a few remarks of interest:• The same period and periodi regime are rea hed when the initial on-dition is that with all ows initially a tive and with null rate;• The largest value of C for whi h we observe these two possible station-ary regimes is approximately 273.5 Pkts/se as shown independentlyby the N2N simulator and the xed point method;• the se ond solution of the xed point equation for τ happens to be spu-rious. There exists a probability solution of (1.38) but the asso iated

α fun tion rosses the C level before this value of τ .• Similar results hold for Tahoe. The asso iated distributions are givenin [3.3 http://www.n2nsoft. om

Page 80: baccelli/Evaluation/AugustinChaintreauPhD.pdf

70 Chapter 1

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0 100 200 300 400 500 600 700 800 900

Figure 1.11: Probability distribution fun tion for the throughput seen insteady state, immediately after a ongestion epo h, immediately before a ongestion epo h, and in ontinuous time: numeri al solution of the invariantmeasure equation (top), N2N simulation of 1 Million HTTP ows (bottom).Dependen e of Bi-Stability Region w.r.t. the ParametersLet CT be the maximum C for whi h there is an intera tion regime, ρ begiven as in (1.22) and dene the over-provisioning ratio (for guaranteeing theabsen e of intera tion) to be ω = CT /ρ. Here are a few data on this ratio inthe exponential ase with p = 0.8 and 1µ = 2000 Pkts.

• 1/β = 2 s., R = 0.1 s.: ω = 1.04;• 1/β = 4 s., R = 0.1 s.: ω = 1.06;• 1/β = 8 s., R = 0.1 s.: ω = 1.09;• 1/β = 2 s., R = 0.05 s.: ω = 1.06;• 1/β = 8 s., R = 0.05 s.: ω = 1.12;• 1/β = 2 s., R = 0.025 s.: ω = 1.09;

Page 81: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Bandwidth Sharing with Intera ting Flows 71• 1/β = 8 s., R = 0.025 s.: ω = 1.15.The region is larger for small RTTs and for short idle times.Proof of the Existen e of a Turbulent RegimeLet us onsider the Tahoe ase with an initial ondition onsisting of allsour es a tive and with 0 rate. The fun tions α(t) (the aggregate rate denedin (1.29)) and γ(t) = 1−ν(t) (the proportion of a tive ows), asso iated withthis initial ondition, play a key role in the onstru tion of this se tion. Theyare depi ted in Figure 1.12 in the ase 1/µ = 2000, 1/β = 2 and R = 0.1.

0

50

100

150

200

250

300

350

0 5 10 15 20 25 30

0.7

0.75

0.8

0.85

0.9

0.95

1

5 10 15 20 25 30

Figure 1.12: The α (left) and the γ (right) fun tions.Let M denote the maximum of α(t) over all t > 0, θ the argument of themaximum of α, m the minimum of α(t) over all t > τ and let γ denote theminimum of γ(t) over all t > 0. In the parti ular ase of Figure 1.12, we haveM = 301.8, θ = 5.5, m = 258.1 and γ = 0.723. Let C = pγM + (1− pγ)m.Lemma 3 For the above initial ondition, if C > ρ, then the Tahoe versionof the model experien es an innite number of ongestion epo hs for all Cin the interval ρ ≤ C ≤ C.The proof an be found in [3.This result shows in our example, when p = 0.8, that Tahoe exhibitsinnitely many ongestion epo hs as soon as C ≤ C = 283.38. Note thatthis is only a su ient ondition for ongestion, namely that CT may be ingeneral smaller than C. Under the same assumption, if the initial onditionis hanged to have all ows initially in the steady state of the intera tion-less regime, then the system remains in this regime forever. Hen e for thisparameters two possible equilibrium regimes may be found.From our numeri al and simulation, it seems that the bi-stability regionfor Tahoe is larger than for Reno (see Figure 1.10). We have no orrespondingLemma 3 in the Reno ase at this stage. The fa t that Reno ould have aturbulent regime, when the load per user is less than the apa ity per user,is hen e urrently justied by simulation and numeri al eviden e only.

Page 82: baccelli/Evaluation/AugustinChaintreauPhD.pdf

72 Chapter 14.4 Appli ation: Rate and StabilityStatisti al Properties of the Stationary RateWe now study more detailed properties of the stationary throughput. Figure1.13 gives the stationary rate densities, obtained by simulation and numeri- ally in the ase C=250 Pkts/se , p=.4, 1/µ=2200 Pkts, 1/β=2 s., R=.1 s.The fra tal and intri ate stru ture of the distribution of the rate at a on-gestion epo h should not ome as a surprise (similar shapes were obtainedfor long lived sessions in the previous se tion and in [8). Compared to the ase of Figure 1.11, the irregularities of the distribution are redu ed by thesmaller value of p. The ontinuous time stationary rate has a more regularrate density.

Figure 1.13: Stationary distribution densities of transmission rate, obtainedby the numeri al method: TCP Reno, C=250 Pkts/s., p=.4, 1/µ=2200 Pkts,1/β=2 s.Comparison to the PS-Engset ModelHow do the AIMD indu ed dynami evolutions presented in this se tion ompare with the pro essor sharing (PS) approximations, proposed in theliterature to model TCP bandwidth sharing ?The losest model of large population with PS dis ipline is the Engsetmodel with N users, where N is large. In this model, the a tive sessions

Page 83: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Bandwidth Sharing with Intera ting Flows 73generate 1/µ pa kets, whi h are queued at a single server pro essor, sharingnode servi ed with rate CN pa kets per se ond. On e served, these sessionsmove to an innite server think time node, where they stay for a durationof 1/β se onds. It is shown in [3 that when N tends to innity, the meanrate obtained by ea h ow is x = β min(

1µ , C

β

). Figure 1.14 below omparesthis to the expressions obtained from our AIMD model, with and withoutslow start. In the ase without slow start, the rate in the in reasing part ofthe urve of the AIMD model (i.e. the part where no ongestion o urs) isobtained from (1.22). As one an he k, the mat h is not so good unless theload is small. Noti e that there is a tually no reason for these models to be lose be ause, in the PS formula, there is no dependen e on the RTT.

Figure 1.14: The average rate as predi ted by the PS and the AIMD models.The qualitative properties found in the present study have no analogy inthese PS models:• There are no multiple stationary regimes asso iated with dierent ini-tial onditions. Above, we looked at the steady state of the Engsetmodel and then let N (population) go to innity. We let rst the timego to innity (to be in steady state) and we then let N go to innity.If we had started the Engest model in some transient state (e.g. all

Page 84: baccelli/Evaluation/AugustinChaintreauPhD.pdf

74 Chapter 1users thinking, rather than all in steady state), then let N go to innity,and nally let time go to innity, we would have obtained the samesteady state as above. This an be easily seen by a dire t analysis ofthe transient mean-eld Engset model.Note that these multiple regimes appear in the vi inity of riti al load,whi h is pre isely a region where PS is not expe ted to provide ana urate model for TCP bandwidth sharing anyway.• The rightmost part of the PS urve assumes that the apa ity is usedentirely, whereas the AIMD dynami evolution does not. The right-most part of the AIMD urve has an horizontal asymptote given by

C(1−p/4) (that is here .8×C) as predi ted by the long lived ow the-ory, presented in the previous se tion. We observe in Figure 1.14 thata sharp de rease of the mean performan e (about 15%) takes pla e at avalue of the mean le size, whi h orresponds to an under load regime.This sharp de rease is aused by the transition from the ongestion-lessregime to the ongestion regime des ribed above. This is another qual-itative feature (that is a onsequen e of the partial syn hronization ofows) whi h annot be aptured in the PS Engset model.Meta-StabilityThe mean-eld limit of the sto hasti evolution has two stationary regimesfor some values of the parameters; this implies two meta-stable regimes fora sto hasti system with a nite population that are observed with the sameparameters s aled down with the number of users. The system may os illatefrom one regime to the other, staying in ea h of them for a ertain amountof time. This phenomenon (see e.g. [10 for another example pertainingto proto ols) is depi ted in Figure 1.15 whi h features the Tahoe ase with1/µ = 2000 Pkts, 1/β = 2 s. and R = 0.1 s.In Figure 1.15, the number of sour es is rather small (1000) and the apa ity is approximately the riti al value above whi h the mean-eld limithas only one ongestion-less regime. The two modes are learly visible inthe traje tories. The u tuations are high enough to make the system movefrequently enough from one mode to the other.

Page 85: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Bandwidth Sharing with Intera ting Flows 75

0

50000

100000

150000

200000

250000

300000

800 850 900 950 1000 1050 1100 1150 1200

Figure 1.15: Bi-stability: Aggregated throughput of 1000 Tahoe ows (inpa kets) seen as a fun tion of time (in se onds) for C = 282Pkts/s, p = .8.

Page 86: baccelli/Evaluation/AugustinChaintreauPhD.pdf

76 BIBLIOGRAPHYBibliography[1 E. Altman, K. Avra henkov, and C. Barakat. 2000.(See referen e in Chap. 0).[2 F. Ba elli and P. Bremaud. 2003.(See referen e in Chap. 3).[3 F. Ba elli, A. Chaintreau, D. De Vlees hauwer, and D. R. M Donald.A mean-eld analysis of short lived intera ting TCP ows. In SIG-METRICS 2004: Pro eedings of the joint international onferen e onMeasurement and modeling of omputer systems, pages 343354. ACMPress, 2004. (Extended version available as INRIA Resear h Report5205 at http://www.inria.fr/rrrt/rr-5205.html)(This paper proves the existen e of HTTP turbulent modes in meaneld model of TCP intera tion).[4 F. Ba elli and D. Hong. AIMD, fairness and fra tal s aling of TCPtra . In Pro eedings of IEEE INFOCOM, 2002.(This paper introdu es the hybrid AIMD intera tion model, togetherwith some of its remarkable properties).[5 F. Ba elli and D. Hong. Flow level simulation of large IP networks. InPro eedings of IEEE INFOCOM, 2003.(This paper presents a simulation framework based on the hybrid AIMDmodel).[6 F. Ba elli and D. Hong. Intera tion of TCP ows as billiards. In Pro- eedings of IEEE INFOCOM, 2003.(This paper presents an analyti al framework for hybrid AIMD inter-a tion on multiple bottlene ks, several remarkable fra tal properties ofthis system are shown).[7 F. Ba elli, D. R. M Donald, and J. Reynier. A mean-eld model formultiple TCP onne tions through a buer implementing RED. Per-form. Eval., 49(1-4):7797, 2002.(This work studies intera ting TCP ows sharing a bottlene k with ana tive queue management s heme. It presents a mean eld limit de-s ribed by a partial dierential equation of rate distribution).[8 A. Chaintreau and D. De Vlees hauwer. A losed form formula forlong-lived TCP onne tions throughput. Perform. Eval., 49(1-4):5776,2002. (Extended version available as INRIA Resear h Report number4443 at http://www.inria.fr/rrrt/rr-4443.html)(This paper establishes the formula for the distribution of instantaneousrate of long lived TCP onne tions).

Page 87: baccelli/Evaluation/AugustinChaintreauPhD.pdf

BIBLIOGRAPHY 77[9 C.-S. Chang and Z. Liu. 2004.(See referen e in Chap. 0).[10 R. Gibbens, P. Hunt, and F. Kelly. Bistability in ommuni ations net-works. In (Eds. G.R. Grimmett and D.J.A. Welsh) Disorder in Physi alSystems, pages 113127, 1990.(For an innite ir uits network re eiving all to be routed, this paperproves that the blo king probability, if a random alternative routing isallowed, may take multiple values orresponding to dierent equilibriumregimes. Metastability is identied for nite networks).[11 F. Guillemin, P. Robert, and B. Zwart. 2004.(See referen e in Chap. 0).[12 D. Hong and D. Lebedev. Many TCP user asymptoti analysis of theAIMD model. Te hni al Report 3971, INRIA Resear h Report number3971, 6 2001. available at http://www.inria.fr/rrrt/rr-3971.html,(This work extends the exposition of the hybrid AIMDmodel, des ribingresult of mean eld in the ase of dependent pa ket losses).[13 A. Kherani and A. Kumar. Sto hasti models for throughput analysis ofrandomly arriving elasti ows in the internet. In Pro eedings of IEEEINFOCOM, 2002.(This paper analyses the impa t of TCP ongestion period on the band-width sharing of a bottlene k with dynami tra , in the simplifyingassumptions that ows re eive instantaneous equal share at all time,and are fully syn hronized (they all suer from a loss in a ongestionperiod)).[14 A. Kherani and A. Kumar. Closed loop analysis of the bottlene k buerunder adaptive window ontrolled transfer of HTTP-like tra . InPro eedings of IEEE INFOCOM, 2003.(This paper analyses a pro essor sharing queue with a Poisson pro essof ows. Flows are supposed to be window ontrolled, and loss-free. Itshows that self similarity appears for heavy tailed distribution of lesizes, but that the buer ontent remains bounded).[15 T. V. Lakshman and U. Madhow. 1997.(See referen e in Chap. 0).[16 A. Proutière. PhD thesis, 2003.(See referen e in Chap. 0).[17 S. Shenker, L. Zhang, and D. D. Clark. 1990.(See referen e in Chap. 0).

Page 88: baccelli/Evaluation/AugustinChaintreauPhD.pdf

78 BIBLIOGRAPHY[18 E.C. Tit hmarsh. The Theory of Fun tions. Oxford Univ. Press, 1932.(A referen e textbook on fun tional analysis and the onvergen e of omplex series).[19 D. Widder. The Lapla e Transform. Prin eton Univ. Press, 1946.(A textbook whi h overs the Lapla e Stieltjes integral theory ofbounded variation fun tions, that extends ontinuous fun tions and al-lows a very general Lapla e Inversion method).[20 L. Zhang, S. Shenker, and D. D. Clark. Observations on the dynam-i s of a ongestion ontrol algorithm: the ee ts of two-way tra . InSIGCOMM '91: Pro eedings of the onferen e on Communi ations ar- hite ture & proto ols, pages 133147, New York, NY, USA, 1991. ACMPress.(This paper extends the empiri al study of [17 to a two-way tra ,exhibiting in parti ular the ee t of a knowledgment ompression).

Page 89: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Chapter 2S alable Multi astingon Overlay NetworksUn albero ha bisogno di due ose: sostanza sotto terra e bellezza fuori. Sono reature on rete ma spinte da una forza di eleganza.1 Erri de Lu aCan we relax the requirement that a pa ket transmitted by an IP router an only rea h a single destination ? Given the redundan y of the ontentrequested on-line today, an a de entralized network like the Internet makee iently data available to a large set of destinations ?Ba k to the early nineties, these two problems seem indistinguishable,and almost resolved; a new generation of IP routers made possible to arrydata over a dynami tree. But the ontrol of data transport over this routingextension ould never be in luded in a well a epted proto ol, leaving therst question dangerously open, and the se ond one unaddressed.In this hapter, we laim that the guiding prin iples behind the su - essful design of TCP extend to ontrol e ient data transport in largegroups. We show that under one ondition: that all ommuni ation ontroltasks are performed by end-hosts rather than by network elements.Guidelines: Carrying data e iently from a sour e to more than one destination isgenerally alled multi asting. Deploying this transport fun tion with minimum networksupport is still un ertain owing to many obsta les, that we present in 1 to motivateour work. Overlay network ar hite tures, des ribed in 2, do not rely at all on networkelements, and by opposition have qui kly be ome a dominant arrier of popular ontent.We follow this dire tion in 3 and introdu e a de entralized ontrol proto ol for a data owin an overlay, to adapt to a nite amount of network resour es and memory available. Ouranalysis, shown in 4, proves that the data re eiving rate, and delays between end-hosts, an be guaranteed independently of the overlay size.1 Tre Cavalli, Milano, Feltrinelli, 1999: A tree demands two things: below ground,matter, and beauty beyond. Con rete reatures, they stem from a for e of elegan e.

Page 90: baccelli/Evaluation/AugustinChaintreauPhD.pdf

80 Chapter 21 15 Years of Multi asting: Lessons LearnedInside a olle tion of IP routers under the same ontrol entity, that is also alled an autonomous system, a best eort delivery servi e to a group ofdestination addresses an be supported over a distribution tree. In thisse tion, we present briey the history of multi ast ommuni ations supportedby network elements, and identify what has pushed and what has limitedtheir deployment.1.1 The Network Can Route Multi ast Tra ...Deploying IP-Multi ast RoutingDating from 1978, Reverse Path Forwarding was proposed to build broad- ast trees [21 based on routing table to the sour e, together with a reverseooding method. This was improved in 1990 by Deering to a ount for treesthat do not rea h every network hosts [25, making possible to route datamore e iently to any group of hosts inside an autonomous system. Notethat in this s heme, the dynami join and leave of a destination in a group isaddressed through an update algorithm implemented in the routers by an ex- hange of ontrol messages. Extensions were soon proposed to optimize thes alability of this approa h to inter-domain routes [8, and/or sparse group[24. One important hallenge addressed by these te hniques is to mitigatethe ostly ood-and-prune ex hange of ontrol messages that is usually foundin su h link state proto ols.None of these ar hite tures were widely deployed and all of them arehardly used today, despite a large interest shown by the resear h ommunity.It is worth noting that this deployment hallenge is twofold:1 This routing algorithm seems to ontradi t the end-to-end prin ipleadvo ated for omputer ommuni ation; states orresponding to mul-ti ast group addresses need to be used and maintained in the routers.As one of their onsequen es, they reate disin entives for routers' op-erators to make this servi e available in their network. This goes alongwith the naturally di ult operation of infrastru ture repla ement onlarge networks, and implementation updates.2 The potential saving of bandwidth a hieved by su h a s heme is large,motivated by the large redundan y of the popular data transmitted onthe Internet. On the other hand, the bandwidth that an potentiallybe wasted is a riti al issue.The rst part ould be addressed, at least up to some extent, by hard-ware improvement and areful optimization. The se ond question seemsmore subtle. One may argue that urrent appli ations using uni ast UDP

Page 91: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 81 onne tions an already waste a signi ant bandwidth as they do not followthe ongestion avoidan e behavior. Following this line, if multi ast is widelyadopted, an important part of the network tra ould be transformed into amore e onomi al bandwidth onsumption. On the other hand, it remains tobe proved that the bandwidth sharing that is working reasonably well todayfor uni ast ommuni ation, is not disrupted by the addition of multi astedtra , related to groups with any arbitrary size, and any arbitrary volumeof data to transfer.In other words, multi ast ommuni ations should negotiate their band-width with regard to network ondition, to guarantee what TCP does foruni ast ommuni ation: avoiding ongestion and gross unfairness to takepla e in extended regions of the network.More Details : A Tale of Two CitiesTo illustrate the di ult deployment of IP-multi ast, let us des ribe the urrent situation:Inside an autonomous system, me hanisms may be deployed to build e ient distributiontrees, as we dis ussed above. Outside, the only available multi ast forwarding me hanismdeployed at the network level is the multi ast ba kbone (Mbone). It was des ribed previouslyin [15, and in [27 with more details. This is a network made of tunnels among gateways indierent Internet domains, set and maintained manually. The routing among these tunnelsis essentially broad ast with a pa ket Time To Live (TTL) ounter de reased by one afterpassing ea h gateway: a pa ket is allowed to use a maximal number of tunnels. Controlis done through sour e based xed rate, essentially open-loop, and man-made higher levelme hanisms in luding reservation of time slots, with no automati resolution of oni t.In spite of this limitation, the Mbone be ame a visible platform for group ommuni ation.On a limited s ale, it proved the users' high expe tation for multi ast appli ations. It alsoproved, as reported in [27, the limit of a deployed solution in the network with this te hnology:in the absen e of any ontrol ex ept sour e xed rate, multi ast sessions are taking bandwidthby for ing all other TCP onne tions to ba k-o. Tra on those links may be severelydisturbed by a multi ast transmission and the onsequen es are spread globally for large groups,or as the ee t of users mistakes when setting the parameters (su h as the TTL), not tomention mali ious atta kers.A Growing Number of Group Appli ationsBefore moving on to des ribe some of the histori al solutions proposed, it isperhaps worth addressing a ommon skepti ism spread in the resear h om-munity about the need for e ient group ommuni ation. Appli ations that an benet from multi asting have been used for a long time by Internetusers and they are growing: examples are large olle tive e-mails, Internetradios, ooperative downloads of software sour e ode, real time dissemina-tion of nan ial data. Until now, none of these servi es, widely adopted,has been deployed using a multi ast bandwidth saving s heme; the number

Page 92: baccelli/Evaluation/AugustinChaintreauPhD.pdf

82 Chapter 2of users they an handle today is typi ally limited by a network resour ebottlene k, or a maintenan e ost that ould and should be improved.This area is now lled with diverse pragmati problems, taking the bestadvantages of possible design hoi es: dedi ated resour es with semi-manualtuning, ontent repli ation in proxies, epidemi asyn hronous disseminationin a peer to peer network. It is rather di ult to des ribe it well today as asingle omputer ommuni ation issue whi h an be addressed in a standardspe i ation. In parti ular the routing, the higher level requirements, andthe optimal ar hite ture may dier from one appli ation to another.We believe that these appli ations share at least one requirement, thatis the obje t of our ontribution: all of them need a simple and e ient ongestion avoidan e me hanism. It should be ompatible with TCPbehavior, and insure network's resour e fair share.1.2 ... But It Cannot Transport It.Indeed a best eort multi ast delivery is not su ient for all appli ations.Reliability and ow ontrol may be needed between the sour e and its des-tinations. Congestion avoidan e, as seen before, is a requirement.Rajagopalan et al. introdu e in [49 a solution to deploy a reliable mul-ti ast delivery at the network level between autonomous systems. In thisar hite ture, ea h gateway is lo ally responsible for retransmission over itsnext hop, to avoid ostly retransmission taking pla e between elements thatare far in the tree. A sliding window me hanism is implemented per-hopbetween gateways, it alternates with a global syn hronization y le, thattakes into a ount re ongurations of the tree. Several other multi ast fun -tions are introdu ed to rea h some destinations only, et . . It is proved tobe robust to failure and to deliver any data to a nite group in nite time.However, there are several reasons that make this ar hite ture not pra ti al:• For all window sizes, the throughput is vanishing as the tree expands.This is aused by the syn hronization step, performed regularly be-fore resuming ommuni ation, that requires the ex hange of ontrolmessages over the entire tree. Note that this ne essary halt seems ontradi tory also with the implementation of a window ow ontrol:interruptions o urring an el the benet stemming from the self lo k-ing property (see 1.1 in Chapter 0).• It exa erbates the need for state and ontrol message ex hanges inthe network elements, making this ar hite ture di ult to deploy inpra ti e, even for a small group.Applying to this ase the same prin iple that was used with su h a su ess inthe design of the uni ast TCP proto ol, a large body of resear h fo used onimplementing the same transport fun tions end-to-end, rather than per-hop.

Page 93: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 83This requires almost always me hanisms to be designed from s rat h, and ompared to their respe tive uni ast TCP implementation. This stems fromthe topologi al hange of the data forwarding path, from a root to all theleaves in a tree.In the rest of this se tion, we dis uss the implementation that has beenproposed to improve multi asting with the three following fun tions: pro-viding error re overy, ontrolling ow rate, and avoiding ongestion.Feedba k and ReliabilityThe primary argument for per-hop re overy is the danger of feedba k stormsin the reverse path of a multi ast tree, as the number of leaves that maygenerate or answer a request may be large. One natural solution, intro-du ed by Gusella et al. in [31 on an Ethernet LAN, is to use randomizedtimers in the re eivers. The optimal value for these timers in the ontextof multi round sour e retransmitting was omputed by Danzig in [22. Thesame te hnique was used by Floyd et al. in [28 for deploying error re overybetween re eivers themselves. They proposed a s heme based on negativea knowledgment, multi asted to every member of the group, with a random-ized timer in ea h re eiver used for emitting a NACK and for answering oneof them. This last feature allows re eivers to ba k o from sending pa ketswhen not ne essary, it also enables lo al re overy as a loser re eiver is likelyto be onta ted sooner and answer a given request. Part of these algorithmswere in luded in the Pragmati General Multi ast Proto ol (PGM) that im-proves IP multi ast delivery by randomized timers in the re eivers, NACKaggregation in routers, and sele tive forwarding of repair pa kets.The same feedba k problem is addressed by Paul et al. in [45, usinga hierar hy of designated re eivers that pro ess lo al feedba k aggregationand a he pa kets re eived for later retransmissions. Note that none of theses hemes requires additional support from the network, but that RMTP re-quires a set of designated re eivers to keep the entire data that was sent. Are ent study by Radoslavov et al. [48 has shown that a retransmission hier-ar hy, implemented between end-hosts, exhibits omparable performan e asone implemented in the routers.More Details : Coding for Error-Re overySour e- oding/re eivers-de oding provides usually additional exibility on the ontrol ofa digital ommuni ation. It may be possible using for instan e Reed Solomon oding togenerate redundant data sent in addition to the original pa kets, so that only a subset ofthem, su iently large, need to be re eived in order to deliver the original data. This blo kerasure property may drasti ally improve the performan e of large multi ast group. Thiste hnique, known generally as Forward Error Corre tion (FEC), was introdu ed by Metzner in[43 for the ontext of broad ast.

Page 94: baccelli/Evaluation/AugustinChaintreauPhD.pdf

84 Chapter 2It was proved by Nonnenma her et .al in [44 that an integrated approa h using both FECand negative a knowledgments implemented at the highest network layer is the most e ientfor reliable data delivery. This improvement may, however, be impa ted by losses happeningin bursts.A key parameter is the sli e size, that denotes the amount of data that an be odedtogether and benets from the blo k erasure property. It is driving the en oding/de odingtime, as well as the e ien y of the proto ol.Flow Control : Multi-Rate ?Flow ontrol is responsible for delivering data at a rate sustainable for allthe re eiving appli ations. How to set this rate is not only a question of onstraints but also of design. It may be already satisfa tory to send dataat the minimal rate that any re eiver an a hieve from the sour e. But itmay be useful, depending on the group appli ation, to send data faster tosome of the re eivers. A good example of the latter ase is a sour e sendingreal time data, with short expiration date. Dealing with heterogeneity ofthe re eiver means that some of them an re eive more data (e.g. improvedquality for the ase of a real time broad ast). This natural adaptation ofthe demand to the network apa ity is usually alled gra eful degradation.Note that a hieving multiple rates may be useful for other types of tra as well (e.g. the download of the same do ument, with dierent speed, in alarge group with heterogeneous re eivers).Single-rate and multi-rate delivery use one (or several) simultaneous mul-ti ast delivery servi e(s) as an underlying fun tion: the single-rate ase, viaa single distribution tree; the multi-rate ase, via dierent multi ast groupswith umulative transmission rate. In the latter ase (see [42 and referen estherein), this deals with heterogeneous re eivers for the same multi ast ses-sion. Real time data an be en oded using several xed rate groups, orre-sponding to in remental quality layers. Similarly, a le may be interleavedin dierent layers to minimize download ompletion time of faster re eivers.The sour e then does need to send only on e the maximum number of layersthat a re eiver an a hieve at any given time.Moreover, this me hanism may be thought to adapt well with TCP band-width sharing. Imagine that ea h re eiver knows in advan e its bandwidthfair share. It may be able to subs ribe to all the orresponding layers, a hiev-ing this rate (or a lose lower bound) on the path leading toward it from thesour e. Rubenstein et al. proved in [57 that allowing a multi ast session touse multi-rate among re eivers in the group leads to a more predi table and onsistent fairness denition for a network, as opposed to single-rate.The proto ol des ribed in [42 is based on this idea: re eivers make at-tempt to in rease their rate (i.e. joining a group), and abort this join if ongestion is experien ed. A me hanism of shared learning is advo ated,

Page 95: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 85where re eivers an infer the performan e on their paths and avoid simulta-neous attempt, by listening to ea h other. Drops are not distributed to highlayer only, to have every re eiver ba k o in ase of severe ongestion, andmitigate the ee t of ill behaved ows. The key advantage of this approa his that it does not require any additional support from network's routers.Benets from this approa h are lear, if re eivers are lassied in a smallnumber of lasses (for instan e, some users a essing from the same LAN asthe sour e and some via a long haul slow line) and if their performan es anbe well predi ted. The s heme previously des ribed would ertainly make agood hoi e of the targeted rate.On the other hand, it is not ne essarily obvious that this is enough toavoid network ongestion. In a general ontext with many possible rates,it seems di ult to rea t to avoid ongestion at the time-s ale of multi astgroup leaves and join. This is what is observed and dis ussed by Vi isano etal. in [62, who propose an improved me hanism whi h goes in the dire tion ofTCP highly rea tive bandwidth sharing. Shared learning is avoided throughsyn hronization pa kets emitted by the sour e, that orrespond to indi ationsthat the rate used ould be in reased. These syn hronization pa kets are sentat the end of short pa ket bursts, so that their possible dropping indi atesthe apa ity of the bottlene k link to every bran h of the group. It is learlyan elegant way to solve two problems at on e, that may be ompared topa ket pair te hniques [35, proposed for ongestion avoidan e.More Details : Coding for Better E ien yRizzo et al. introdu e in [54 a proto ol based on FEC and sli es interleaving to distributereliably a given le to a large set of re eivers. Re eivers send requests for a number of additionalpa kets, in ase that they did not re eive enough to omplete the de oding of the le.This isimplemented with a random timer to avoid a feedba k storm. These expli it requests simplifythe algorithm used by the sour e, that does not need to identify pa kets nor re eivers thatneed to re eive pa kets. In addition, it may be deployed on devi es with limited omputationsand memory apabilities.This approa h was extended with the introdu tion of Tornado odes, dis ussed by Byerset al. in [14. This te hniques allows larger sli e size (about 10, 000 pa kets of size 1kB)to be used for Forward Error Corre tion, while keeping en oding/de oding time in the orderof a se ond. These odes are in orporated with the layered multi ast me hanism developedby Vi isano et al. in [62, to a ount for rate adaptation, and a spe i blo k interleavingpresented by Bhatta haryya et al. in [11, to minimize the dupli ates re eived. Note that thede oding is non deterministi in the ase of Tornado ode, as it depends on the pa kets thatare lost on the way: this requires ea h re eiver to ompute rst the de oding before de idingto leave.This ar hite ture makes the multi ast group ommuni ation appearing loser to an idealdigital fountain. Dierent streams of pa kets are re eived by ea h destination, be ause ofpa ket losses, dierent layers used, dierent times for joining and leaving, and possible servi einterruption, but they are all served e iently in a way that is transparent to the sour e. Forthese reasons, in this ideal ase, one an expe t a great performan e gain, and to relax the

Page 96: baccelli/Evaluation/AugustinChaintreauPhD.pdf

86 Chapter 2ne essary oupling between the sour e and its re eivers.Implementation limits should be he ked for this ar hite ture: there is ine ien y in the oding te hnique, as a fun tion of the limited en oding time. This is responsible for around 5%(and a maximum of 10%) of additional data transmitted, in the ase of a le that ts into asingle sli e. In the ase of larger les (above 10Mb for Tornado ode), it is ne essary to dividethem in blo ks, whi h in reases the ine ien y as some dupli ate pa kets may be re eivedwhile waiting for another blo k to be ompleted. In addition, dupli ates may be re eived asthe ee t of the varying onditions in the network (drop or in rease of layers used), just likean interruption of servi e would damage the interleaving s heme among layers.Congestion Avoidan eWindow ow ontrol, implemented end-to-end with a hierar hi al me ha-nism of feedba k aggregation, allows in theory a multi ast ongestion ontrols heme to mimi the behavior of the TCP proto ol between a sour e and anarbitrary group of destination. It is indeed advo ated for RMTP in [45.The sending rate is adapted by the sour e with a window that is adaptedon a lo k y le, where status messages are reported; advertised retransmis-sion requests are used as negative feedba ks to whi h the window adaptationshould rea t.There are however some reasons to believe that the ongestion avoidan ealgorithm used for uni ast annot be applied in the same way to a multi asttree:• In [12, Bhatta haryya et al. have shown that the pa ket loss probabil-ity is ne essarily overestimated when measured between a sour e anda large number of re eiving end-hosts. This leads AIMD ongestionavoidan e to poor performan e.• Golestani at al. studied in [29 how a naive implementation of TCP onheterogeneous groups of re eiving end-hosts may unne essary redu ethe share of bandwidth that the group is allo ated. They propose are eiver driven implementation of TCP, based on a token pool andfeedba k onsolidation to avoid this ee t. This may prove to solve aswell the loss path multipli ity exhibited by Bhatta haryya et al., butnone of the results are shown outside deterministi models.• As we proved in a previous work [16 for window ow ontrol withany bounded window size, the variation of delays aused by onges-tion redu es the throughput of a multi ast group as the inverse of thelogarithm of the number of re eivers. As a onsequen e, a windowow ontrol implemented on a tree would need to in rease drasti allyits window with the group size, making it di ult to a hieve a fair ompetition with normal uni ast onne tions.

Page 97: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 87In this ontext, it seems hard to adapt the TCP proto ol to avoid ongestionon a IP distribution tree, without asking more support from the network.More Details : Dire tions for IP-Multi ast Congestion Avoidan eAnother generation of ar hite tures have been proposed, and partially deployed, to opewith the ongestion avoidan e deployment problem of IP-multi ast. They are based on someform of estimation performed by re eivers and/or routers, to adjust dynami ally the rate orthe pa kets sent by the sour e.• In [53, Rizzo proposes a ongestion avoidan e algorithm built on top of PGM, thatimplements a window ow ontrol between the sour e and only one re eiver at anytime. This representative re eiver may hange with time with no need for window re-initialization. The key element is then a timely and a urate estimation of the re eiverthat should ontrol the pa e of pa kets emission at the sour e. Several heuristi rulesare used, in luding a variation of the square root formula we presented in Chapter 0.• Similarly, TFMCC [64 ex hanged detailed feedba k between the sour e and the urrentlimiting re eiver, to adapt the sending rate of the sour e exa tly to the referen e squareroot formula orresponding to the slowest re eiver.The dieren e is that PGMCC is only using estimation to ompare re eivers together,when TFMCC uses this estimation to ontrol the rate a ording to the throughputequation. As a onsequen e, the latter may a hieve smoother data delivery, at theexpense of larger sensitivity to the estimation te hnique.An extension of TFMCC to a multi-rate ase was proposed in [36 where re eivers areestimating the throughput that they an obtain and move between groups to adapt totheir fair rate.• In [34 and [23, the authors show that a fairness riteria may be a hieved for multi-ratemulti ast (together with uni ast onne tions), through a pa ket marking strategy inrouters and re eiver driven rate adaptation. This s heme requires a per ow state tobe maintained in the routers for ea h multi ast ow, whi h seems reasonable in the ontext of IP-multi astrouting.

Last ThoughtsSeveral re ent works are trying to make IP-multi ast in rementally de-ployable today, at least for small to moderate group sizes. In TCP-SMO,proposed by Liang et al. in [38, a knowledgments are sent separately by ea hre eiver. It is argued that this an be handled by routers and pro essed bya server for groups ontaining up to 1000 re eivers, as a knowledgments aresmaller in size and less frequent than data pa kets. Re eiver initiated win-dow ow ontrol is implemented, following a variation of [29. Throughputfor moderate group is shown to be better than multiple uni ast onne tions.Going further in this dire tion, TCP-XM, presented by Jea le et al. in [33,proposes a single-rate reliable le transfer to multiple destination througha unied ommand: m p, for multiple opy, reminding the widely used s p.

Page 98: baccelli/Evaluation/AugustinChaintreauPhD.pdf

88 Chapter 2This me hanism is using multi ast network's apability whenever possible,while still guaranteeing delivery through uni ast onne tions that are alsoused for feedba k and ow ontrol. The m p ommand was initially thoughtas a building blo k for large data transfers among s ienti omputing envi-ronments.These pragmati approa hes, if deployed, would reate in entive for ex-tending the deployment of multi ast among urrent networks. In response, itmay enhan e the understanding of the multi ast ontrol designed in the liter-ature, and in rease our onden e in their ability to use networks e iently,that is di ult to establish today.To summarize this short overview of network-supported multi ast, wewould like to highlight the following remarks:• Multi ast fun tions require to have support, either from the networkelements or from a hierar hi al organization among re eivers, with ad-ditional dedi ated resour e.• The exa t behavior of reliability and ow ontrol an and should beenfor ed dierently depending on the appli ation.• Avoiding ongestion aused by a multi ast group ommuni ation, ne -essary for deployment, is hallenging. A solution for handling largegroup is not known today.

Page 99: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 892 Networking in OverlaysAn overlay is dened as a network of omputers built on top of another one.The nodes of this overlay network are simply the end-hosts of the underlyingnetwork (typi ally, they are a set of omputers onne ted to an IP network).An abstra t link, also alled overlay hop, is drawn between two overlay nodesto represent a uni ast path, made with several underlying links, that onne tthese two endpoints in the underlying network. An example of an overlayis shown in Figure 2.1, the path from B to D ontains three links in theunderlying topology, but it is represented by a single hop in the abstra ttopology of the overlay.AG F E

CBD

HOPB→CHOPB→DHOPG→EHOPA→GHOPG→F

HOPA→B AB

EF D CG HOPA→B

HOPG→F HOPB→CHOPG→EHOPA→GHOPB→DFigure 2.1: An overlay: underlying topology (left) and abstra t topology(right).The key property of an overlay is that many end-hosts may be relayingpoints in the abstra t topology, as soon as they are onne ted with at leasttwo others, via overlay hops. This allows to build, with a small eort, a datadissemination topology running on large groups, using only a minimal servi esupported by the network (i.e. uni ast delivery between two end-hosts). Asan example, shown in Figure 2.1, a set of omputers an be organized in anoverlay distribution tree. In ontrast with the di ult deployment of routerbased s heme, ar hite tures where new fun tions are added in end-hosts oerexibility and easy in remental building.Appli ation Layer Multi astingUsing overlays bear similarities with making the abstra tion of a layer, asdone to separate fun tions of omputer networks. Following this line ofthought, a servi e supported by an overlay topology may be des ribed as afun tion implemented at the appli ation layer. The rationale for moving a

Page 100: baccelli/Evaluation/AugustinChaintreauPhD.pdf

90 Chapter 2fun tion upwards loser to the appli ation that uses it, was already given(for example in [58), as a onsequen e of the end-to-end argument. Thisguideline states that a fun tion does not need to be implemented in a lowerlayer unless this provides a signi ant performan e improvement. We have infa t already seen in 1 several propositions to implement multi ast fun tions(error re overy, ow ontrol, ongestion avoidan e) with no support from thenetwork, and little support from the sour e.Appli ation Layer Multi ast follows this dire tion, it was rst proposed inthe late nineties (see [20), as an alternative to IP-multi ast servi e: re eiv-ing end-hosts self-organize themselves in an overlay distribution topology,ea h of them taking are lo ally of the addressing, and the ne essary pa ketdupli ation. In other words, ar hite tures in this lass do not rely a priori onany multi ast servi e provided by the network, even if some of them mightbe able to take advantage of it lo ally. Today, their development is pushedby the multipli ation of end-hosts onne ted to the Internet with high data-transmission rate, whi h already gave birth to new lasses of peer-to-peerappli ations.In the rest of this se tion, we briey present the proto ols that havebeen proposed to build an overlay network on top of the Internet. We fo usrst on measures of their performan e, and then on the issue of s alabilityfor large overlays. We then introdu e problems remaining unaddressed inthe ontrol of multi ast transport on a general overlay, whi h is the obje tof our present ontribution.2.1 Building Optimal OverlayDesigning proto ols to reate a satisfa tory overlay topology re alls many dif- ult aspe ts of Internet routing, with an additional onstraint: the topologyof an overlay needs to be deployed using uni ast paths between end-hosts.The Essential Part of a Hard ProblemLet us onsider a xed set of end-hosts that may ommuni ate through anoverlay topology, su h as an overlay distribution tree. The optimal topologyfor laten y may not oin ide in general with an optimal apa ity. To someextent this problem, is not a new one: similar questions arise for IP-multi ast,and even for uni ast routing, when multiple paths may be used. However,when the network is reated as a hierar hy of domains with a small number ofgateways, laten y and apa ity optimality, for uni ast or IP-multi ast paths,are not ontradi ting ea h other strongly. In parti ular, in a network withbidire tional links that does not ontain any y le, the same topology denesa unique path for ea h pair of nodes, whi h is learly the best in terms oflaten y and apa ity.

Page 101: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 91The topology of an overlay network is, on the other hand, learly the otherextreme: one an possibly use one distin t overlay hop to transport data inthe overlay between any arbitrary pair of nodes. Using this topology wouldresult in a maximal number of overlay hops to maintain in ea h end-hosts,and it would onsume a large amount of bandwidth to make data availablebetween all of them. In order to save bandwidth, an overlay topology thatuses paths made with several overlay hops is generally better. This omeswith a ertain disadvantage as laten y between two end-hosts in the overlay an grow, as well as apa ity may be redu ed (for ases where the oppositeo urs, see the framed paragraph below).More Details : The Capa ity of Paths with Several Overlay HopsAs observed in [19, 37, hoosing a path going through two overlay hops may in rease thebandwidth available between the rst and the last end-hosts.The rst reason that omes to mind is that routing proto ol in IP networks usuallyoptimizes delays. The existen e of a longer but faster route between two end-hosts is then nota ontradi tion. Note also that the inter-domain routing proto ols used may follow dierentpoli ies, whi h do not onsider any of these measures. This an reate, as shown in [68,triangle inequality violations: going through an intermediate end-host might result in an overallde rease of the total laten y, ompared with the dire t IP uni ast path between these twopoints.In the ase of an overlay, another ee t an explain this apa ity in rease. Let us assumehere that overlay hops (i.e. uni ast onne tions between end-hosts) are implementing TCP ontrol. The apa ity of an overlay hop is thus dened as the TCP throughput on this path,that is de reasing with the RTT (see 2 in Chapter 0). Splitting a TCP onne tion withlarge RTT into two smaller ones, going through an intermediate end-host, may in this aseimprove the overall throughput, ompared with a dire t onne tion between the sour e andthe destination. This unexpe ted onsequen e of the RTT bias was observed in [19 and onsidered in [37 to improve overlay performan e.To avoid a prohibitive ost introdu ed by data being forwarded ba k and forth betweentransport and appli ation layer in the intermediate end-host, a fast kernel modi ation maybe useful (as TCP Spli ing by Maltz et al. in [40, initially designed for fast proxies).How to dene and onstru t a good distribution tree, when its bran hingpoints are network end-hosts, has little in ommon with the problem of onstru ting e ient IP-multi ast tree. Under diverse laten y and degree onstraints, nding the optimal distribution tree was shown in [59 to be ahard problem, even in a entralized algorithm assuming omplete knowledgeof network links performan e. Moreover let us stress two fa ts that impa tthe suitability of any pra ti al solutions:• A urate optimality is not required to have a reasonable distributiontree onstru tion. Indeed, an appli ation built on an overlay is bydesign hoi e already experien ing moderate to long delays.

Page 102: baccelli/Evaluation/AugustinChaintreauPhD.pdf

92 Chapter 2• In ontrast, a requirement that is more usually met in overlay networksis the need for a qui k and e ient re onguration. This depends onappli ations needs and support. In general, this is the onsequen e ofthe fa t that an end-host may fail or leave the ommuni ation groupmore frequently than an IP router.A Re ursive Routing SolutionIn theory, it would be possible to apply known distributed routing proto ols(distan e ve tor, or link state) in the omplete graph that links ea h pairof overlay end-hosts. What makes this solution not pra ti al is that theamount of ontrol messages and states to maintain grows qui kly with thegroup's size. This is be ause, as opposed to IP networks, all hosts are possibleneighbors in the overlay network.The solution implemented by the Narada proto ol des ribed in [20 isto drop overlay hops in the omplete graph, while keeping a on-ne ted graph that is referred to as a mesh. This mesh is then used likean underlying topology, where routing is performed a ording to one of thedistributed proto ols already introdu ed on the Internet. More pre isely,ea h host evaluates its andidates neighbors among a nite list with a givenheuristi based on laten y (bandwidth is in luded in a new heuristi in [19),that it shares with its neighbors. A dynami algorithm lets this ar hite ture hange with time to adapt to network onditions, hosts joins and leaves, aswell as network partition. All of them are addressed through a rather largenumber of ontrol pa ket ex hanges. It is shown that Narada an a ommo-date 128 re eivers while keeping the delay in the overlay network less than

4 times the dire t uni ast sour e-destination. This relative metri is notapplying well for short uni ast path, where this ratio may be ome larger,but in this ase the overall delay is shown to be small in absolute value aswell. Bandwidth seems to be a little more sensitive, but it is shown to bewith a penalty smaller than 30% in the overlay network, for groups up to20 re eivers. Gossamer, des ribed by Chawathe in [18, is a variation of thisstru ture that improves the amount of ontrol pa ket ex hanges.Dynami Overlay TreesCentralized and distributed algorithms have been proposed to build dire tlyan overlay distribution tree between end-hosts. One solution is to have up-date messages sent to a enter host that takes are of this re onguration,this is what is advo ated for small groups in the ALMI ar hite ture proposedby Pendarakis et al. in [46. Another solution is to rely on an adaptive dis-tributed algorithm: the Over ast ar hite ture, presented by Janotti et al. in[32, in ludes an algorithm where hosts are pushed deeper in the tree, un-der the onstraint that they do not lose bandwidth; ties between andidate

Page 103: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 93neighbors are broken in favor of the smaller RTT. Zhang et al. proposed in[65 a similar algorithm that prioritizes smaller RTT, while keeping the samerule to push a host deeper in the tree. It is advo ated in [65 that this treere onguration is a good lustering method, as hosts that are lose in theIP network are sus eptible to make the same de isions and stay lose in thetree.The performan e of these two algorithms are omparable with re ursiverouting algorithms. For groups ontaining around 200 re eivers, laten ypenalty stays below 4 or 5 with dierent measures, the bandwidth penaltyis usually smaller than 30%.Estimating the impa t of re onguration for a dynami overlay networkis generally di ult, and has been little studied. A noti eable ex eption isan extended experiment ondu ted for Narada in [19, and some simulationresults presented in [32. These results show lear limitations of these proto- ols for fast turnout distributed appli ations that involves large and dynami groups of hosts.2.2 Towards Large S ale OverlayNone of the s hemes we presented above s ale beyond more than a few hun-dred parti ipating end-hosts. Although it might be su ient for some ap-pli ations, new te hniques have been proposed to develop overlays of largers ale size, primarily for self organized le sharing appli ations. Their goalis in parti ular to address the natural limit of the rst generation of peer-to-peer networks, based either on a entralized dire tory or TTL-limitedooding te hnique.Even orre t addressing is a hallenging task for this ontext. No end-host an handle by itself the omplete set of parti ipating end-hosts. Asa onsequen e, new parti ipants joining a session are handled qui kly by abootstrap end-hosts, whose fun tions are kept to a minimum.Addressing and RoutingSin e the work of Plaxton et al. [47, it is proved that a large number ofobje ts may be stored and a essed e iently in theory through a distributedaddressing te hnique.

Page 104: baccelli/Evaluation/AugustinChaintreauPhD.pdf

94 Chapter 2Des ription: Plaxton-Rajaraman-Ri ha Algorithm . It is based on the oupleof a unique identier, and a label asso iated with ea h host. A unidire tionalneighboring relation is built between hosts based primarily, but not solely, on alabel prexing. Ea h obje t is identied uniquely and asso iated with a dedi atedhost, alled its root that is independent from the sour e that may ontain it.A request forwarding s heme guarantees that an obje t is found by any host, ifone of its opies is urrently available in another host.The e ien y of this te hnique is surprising. The memory size that needs to bemaintained is in reasing as O(M log N) where M is the number of obje ts thatmay be stored in a host, N is the number of hosts. Inserting, deleting an obje tor a host is shown to be in a onstant with N or under the power of a logarithmfor a high probability.Moreover, this addressing te hnique an expli itly a ount for the ost of atransmission between two hosts c(u, v), that is supposed symmetri and veriesthe following property:(2.1) min (N, δ × M(u, r)) ≤ M(u, 2r) ≤ ∆ × M(u, r) .where M(u, r) = # v a host | c(u, v) ≤ rUnder this ondition, it an be shown that the ost to answer the request of anobje t of length L by host u, is ontrolled as O(f(L)c(u, v)), where v is thehost ontaining this obje t with a minimal ost transmission to u, and f is afa tor of ost that depends on the obje t size.More Details : Content Distribution Network• Variations of this algorithm are implemented by Rowstron et al. in [55 and Zhao etal. in [67, between dynami hosts on a wide area networks.• Stoi a et al. [60 proposed a simplied version of Plaxton's algorithm, assuming a onstant ost between all hosts, for the robust lo ation of a host storing a parti ularitem. Con urrent hosts joins and failures are addressed, to bound generally the numberof overlay hops that are needed in a query.• Another approa h was proposed by Ratnasamy et al. in [50, where a host is asso iatedwith a label and a zone in a d dimensional torus. The advantage of this approa h isto have a set of neighbors per host that does not depend on the total number N ofhosts. It is laimed to obtain path lengths from a host to a requested obje t that is in

O(dn1/d).• Another overlay stru ture may be onstru ted by identifying hosts with points in theplane and dening neighbors as one of a Delaunay triangulation. This approa h wasintrodu ed by Liebeherr et al. in [39, stressing that this neighboring relation might be omputed in a distributed manner, while the amount of neighbors per host is remainingsmall in average. Moreover, the ompass routing, where ea h next step is hosen as theneighbor with smallest angle to a given dire tion, is guaranteed to rea h a destinationwith no loop in this ontext.The key performan e riteria of a large overlay network is the topologi al ongruen e, whi h is the relation between the distan e in number of overlay

Page 105: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 95hops, and the underlying distan e between them on the physi al network.Be ause of the hierar hi al and fragmented nature of the IP network, thisproperty is also referred to as a ongruent lustering.A guarantee on ongruen e is indeed formulated by the original work ofPlaxton and his oauthors. Interpreting the ost between two points as alaten y distan e, the distributed addressing te hnique transports an obje twith no more than the dire t distan e between the losest sour e and thedestination, multiplied by a onstant. Proving it requires assumption (2.1)to hold for this distan e. The results extend easily to a looser guaranteein dynami overlay networks [55, [67, where the same bound applies if weassume that the system re ongures qui kly enough. Another approa h toimprove ongruen e is proposed by Ratnasamy et al. in [50. They use abinning s heme: zones are allo ated to host, a ording to a omparison ofthe distan es between the host and a given set of landmarks. This algorithmis evaluated in [51 by simulations. The omparison of the performan e ofthese dierent approa hes remains to be done.Dissemination in Content Distribution NetworkFollowing the progress of Content Distribution Network self-organizing prop-erties, and their relative su ess in providing ongruent topology, severals hemes have been proposed to implement Appli ation Level Multi ast di-re tly on these overlay stru tures. In fa t, multi asting omes as an easyextension of their routing algorithm. Overall it allows to send multi astdata to a large group with a minimum number of states. This improve-ment also provides designers with a faster re onguration of the distributionstru ture, making overall the ar hite ture more robust for host joining andleaving the group.• Zhao et al. proposed in [69 to build an overlay distribution tree onthe overlay maintained by the Tapestry ar hite ture [67. This is donevia an ex hange of JOIN and PRUNE messages towards the root. Itmay remind the Core Based Tree [8 ar hite ture, proposed to buildinter-domain IP-multi ast distribution tree. However, as opposed tothis te hnique, the root host needs to maintain a list of the ompleteset of re eivers, and to ex hange messages for every host joining orleaving.• Similarly, Rowstron et al. proposed in [56 to build an overlay distri-bution tree using the ar hite ture maintained by Pastry. Their imple-mentation is even loser to that of the Core Based Tree proto ol, as aroot is primarily sele ted for the group. It assumes that the root of thegroup re eives all messages via uni ast, and is responsible for sendingthem along the overlay distribution tree.

Page 106: baccelli/Evaluation/AugustinChaintreauPhD.pdf

96 Chapter 2• Ratnasamy et al. [52, proposed to build a dedi ated overlay networkwith the hosts of a multi ast group, where ooding is performed a - ording to virtual oordinates, with a me hanism for dupli ate avoid-an e. It is in parti ular supporting seamlessly one or several multi astsour es. Choosing to build an entire overlay may be too ostly for someappli ations.• Overlays relying on Delaunay Triangulation are another interestingapproa h for Appli ation Layer Multi asting, proposed by Liebeherrin [39. In addition to guarantee with high probability a small numberof neighbors, the ompass routing provides a minimum spanning tree.But the virtual topology is not known to be losely related to any realdistan e in the physi al network.Multi asting data on large overlays is made possible by one of thesete hniques. We would like to add two observations:• None of these multi ast ar hite tures that rely on a ontent distributionnetwork is using the ri h semanti stru ture of this overlay, in termsof obje t storage and a ess through labels. The unique feature thatis needed is a robust neighboring relation, built to make the overlay onne ted and somewhat ompa t.• Congruen e is ertainly a good sign for an overlay network to e ientlymulti ast tra , but it is far from being a guarantee of its performan ein this ontext. Content Distribution Networks, built for distributeddatabases, have been mostly used to retrieve a small amount of infor-mation, su h as the IP address of a nearby server ontaining a le,in a distributed database. The overlay topology onstru ted for theseappli ations is not optimized for transporting data: in most of thear hite tures used, the a tual data is sent dire tly via uni ast.Layered ClusteringBanerjee et al. proposed a simple layered lustering te hnique in [9, that maybe onsidered as a relative prex based routing fo used on lo ality. It is inparti ular agnosti to any virtual oordinate, and implements a hierar hi altopology that only a ounts for relative distan e.

Page 107: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 97Des ription: The NICE Proto ol [9 . Ea h host belongs to a luster in thelower level layer of this hierar hy, where ea h of these lusters ontains betweenk and 3k − 1 hosts (k being a design parameter). The leader of a luster is hosen as a host with minimal maximal distan e to all others. The set madewith leaders of lusters, belonging to a given level, is forming the immediate nextlayer of hosts in the hierar hy, where the same rules apply. Hen e, ea h hostbelongs to a variable number of lusters (at most in O(log N) where N is thenumber of hosts). Data is forwarded a ording to the hierar hy: when a hostre eives a pa ket, he forwards it to all other hosts that share a luster with him,ex ept for the luster shared with the sender of this pa ket. This stru ture anbe maintained through a limited amount of states and messages to refresh andmeasure distan e (at most O(k log N) for the leader of the higher order luster,and O(k) on average per host). This routing may be interpreted as a variationof Plaxton's algorithm where neighbors are onstantly hanging to ree t loserelation between hosts.This approa h was proved to lead to similar performan es to those ofNarada, but using mu h less states maintained by ea h host, ontrol mes-sages, and for a smaller ost of a re onguration in the distribution ar hi-te ture.More Details : Large S ale Tree TopologyWhat would the topology of a large overlay distribution tree look like ? Indeed it dependsobviously on the s heme that was hosen to build it.The quality of the hops hosen in the overlay: They are dire tly linked to the ongruen ebetween the topology of the overlay network and the physi al underlying one. Trees based onprex routing have provable ongruen e for asymptoti large overlay, others (zone neighboringin a d-torus, Delaunay triangulation) are only shown by empiri al fa ts. The ongruen e of atree based on a mesh approa h, or a binary dynami al algorithm may depend on the dierentimplementation (bounded fan-out degree).The shape of the overlay tree: A tree based on prex routing has a maximum fan-outin reasing with the number of neighbor as O(log N), just like his length. This remains true forone onstru ted by a layered lustering te hnique of NICE. In both ases, the average degreewould be nite. Delaunay triangulation maintains a bounded average degree as well, but itsmaximal number of neighbors is unknown (and ould grow to N − 1 in parti ular, and rare,situations). The number of neighboring zone in the splitting algorithm of a d-dimensionaltorus is not known in advan e as well. Again, only the average degree an be shown to be onstant with N .

Epidemi -Style Dissemination on OverlayIn a new ategory of overlay ommuni ation, data is not transmitted betweenpairs of peers hosen in advan e: on the ontrary, the neighbors and thedestination of ea h pa kets may be hosen dierently a ording to a randomrule.

Page 108: baccelli/Evaluation/AugustinChaintreauPhD.pdf

98 Chapter 2The e ien y of randomized algorithms, based on an epidemi prin iplefor distributed omputing, was exhibited in [26. The problem addressedby Demers et al. in this arti le is to guarantee onsisten y of update in alarge distributed database, where opies of obje ts are repli ated in severallo ations. In the ontext of IP-multi ast, a similar approa h was introdu edby Birman et al. to guarantee reliability. In bimodal multi ast proposed in[13, all hosts alternate between an usual multi ast transmission from thesour e, and a period of repair, in whi h random gossiping is performed tore eive pa kets that were missing.Let us des ribe some of the proposed s hemes based on this approa h:• Gupta et al. designed in [30 an epidemi proto ol aimed at a larges ale dissemination. It is based on two ingredients: a hierar hi alorganization to redu e the amount of randomly reated messages thatspan a ross the entire network, and an adaptive dissemination s hemedeployed in this hierar hy.• Zhang et al. proposed in [66 a data driven approa h to appli ationlayer multi ast for real time video appli ations. Their s heme is basedon video data bulks of a given size, and a request s heduler with givendeadlines. It maintains a set of possible provider for ea h bulk that is onta ted simultaneously to retrieve data.2.3 Transporting Data on an OverlayOn e an overlay topology has been onstru ted to ommuni ate e iently,in a set of end-hosts whi h is possibly a very large group, an we ontrolthe data it transports in a de entralized way ? This seems ne essary tomaintain onsistently a reliable delivery, and the adaptation to the networkand end-hosts available memory buer.Generally speaking, all overlay ar hite tures benet from the underlyinguni ast proto ol deployed on the network. As an example, guaranteeing onservative usage of network resour es is made simpler if, for instan e, TCPis deployed lo ally on ea h overlay hop, between ea h pair of onne ted end-hosts. As it is usually the ase in omputer networking, this benet does not ome alone, as other features of these proto ols an be extremely useful.More Details : Is a TCP Overlay TCP Fair ?Consider an overlay implementing one-to-many ommuni ation from a sour e, implement-ing TCP between ea h pair of its end-hosts. The network resour es (routers, and links) thatit involves are not exa tly given by the maximum uni ast equivalent from this sour e to ea hdestination.First, several TCP onne tions used in this overlay might share a link, laiming overallmore bandwidth for this session than a orresponding uni ast ow. This is true in parti ular

Page 109: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 99on a ess links for a relaying end-hosts in the overlay. Se ond, several studies have shownthat using an intermediate end-host, relaying in a two-hop overlay path, ould improve theoverall throughput between a sender and a re eiver end-host, be ause of the RTT bias (seethe framed paragraph in p.91). This is providing the two end-hosts with a larger bandwidththan what they would be allowed in a dire t uni ast onne tion.However, the following points support the deployment of TCP overlay:• Ee ts from this overlay unfairness may be used already by today's network hosts toin rease their bandwidth. In fa t, the multipli ation of TCP onne tions are ommonin peer-to-peer networks, and is usually limited by the fa t that TCP performs badlyon a ongested a ess link. They are in all ases signi antly less dangerous than aIP-multi ast ongestion ollapse.• An overlay may use more bandwidth than a single uni ast ommuni ation, but it maybe regarded as a ompensation for the number of nal destinations that are servedthrough this data ex hange. As a multi ast ow is serving a large group of people,a limited amount of extra bandwidth may be easier to a ept for this ow. This wasalready advo ated by Wang et al in [63, and most of the time it an be a hieved lo allythrough a bounded degree overlay.Assumptions: In this se tion and the rest of this hapter, we fo us onthe single sour e ase, assuming a xed overlay distribution tree, built onthe overlay topology to support a one-to-many ommuni ation to the sour e.The forwarding inside this tree is deterministi (as opposed to epidemi styledissemination proto ols, whi h make onsistent delivery more hallenging).Two onditions (unreliable delivery, or innite memory available in allend-hosts) are already favorable to implement this ontrol. We des ribe thesetwo ases rst for illustration, presenting later the more general problem.Unreliable DeliveryLet us onsider the ase of a multi ast ommuni ation where the data isdistributed in umulative layers, whi h an be dropped sele tively if needed.We already met appli ations with su h gra eful degradation in 1.2. De-ploying su h multi ast appli ations in an overlay makes their ontrol moresimple: as multi ast delivery fun tions (repli ation, addressing) are deployeddire tly in the end-hosts, many other fun tions an be added per-hop to ontrol the ommuni ation. By opposition, in the ontext of IP-multi ast,this would involve modi ations in the network routers.In the ontext of unreliable data delivery, prevention of ongestion innetwork routers and links are a hieved by implementing TCP between ea hneighboring end-hosts in the overlay. Note that for any ase in whi h anotherapproa h would prove to be better, it is possible to use another transportproto ol. The ow ontrol needs then only to be implemented inside ea hend-host, when the data is read and repli ated to feed several dierent TCP onne tions that orrespond to dierent overlay hops. Assuming a reason-able buer size to link upstream and downstream onne tions, and dropping

Page 110: baccelli/Evaluation/AugustinChaintreauPhD.pdf

100 Chapter 2pa ket from higher layer in priority when needed at ea h intermediary step,we an guarantee to deliver to ea h host in the overlay the best quality thatit an a hieve through this distribution tree. This approa h was used by Chuet al. in [19, for a distribution tree built with Narada on a wide area net-work. It is shown in this arti le that this approa h is not as mu h sensitiveto group size as IP-multi ast solutions, when the group remains moderatelysmall.The ase of a dynami join and leave tree is more hallenging as om-muni ation from the sour e may be stopped, in entire subtrees, during alarge period of the tree re onstru tion. A probabilisti s heme has beenproposed by Banerjee et al. [10 to improve the resilien e of a distributionar hite ture. Redundan y is introdu ed randomly in the forwarding strategybetween end-hosts, su h that large dis onne ted subtrees ontinue to re eivethe data with high probability through one of these redundant links. ANACK approa h inside the overlay is making sure that data re eived by anyend-host rea h all end-hosts in its onne ted omponents. This s heme doesnot guarantee reliability, but an heuristi argument shown in [10 indi atesthat a large majority of data ould be delivered to a large majority of theend-hosts remaining a tive.A similar te hnique, whi h resembles an epidemi algorithm, is used byZhang et al. in the Coolstreaming ar hite ture, des ribed in [66, that was in2005 one of the most popular program to share real-time TV ontent. Again,the proposed solution is made possible only by the per-hop implementationof forwarding in an overlay.Reliable Transport with Permanent StorageAs another example of a ommuni ation that is made simple via an overlaynetwork, let us onsider the following ase: a group of hosts is willing todownload a given le that they an keep in memory entirely, at least forall the time ne essary to the ommuni ation. It may be be ause they wantto store them for their own use, or be ause they are dedi ated server par-ti ipating in a ooperative appli ation (like in [32, [17) with any amountof memory needed. We assume in addition that they do not leave or rashbefore the download ompletion of all their remaining hildren.We see easily that end-to-end reliable transfer is easy to deploy and ontrol, using TCP between ea h pair of hosts. Pa kets lost on the networkare automati ally re overed. Any pa ket re eived an be stored until it isallowed to be sent later on the next TCP onne tions. Su h a s heme oersreliability and exibility, at the ost of using any possible amount of memoryavailable everywhere.This approa h was rst advo ated in [32, where it is shown it guaran-tees data ompleteness as well. In ase of failure of an internal host, the tree onstru tion re ongures the rst dis onne ted host, and resumes the om-

Page 111: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 101muni ation from where it was left. The same approa h may be used in the ase of a Content Distribution Network, and is done for a stati overlay asein [39. It is also onsidered in [17 to deliver data to proxies with innitebuer spa e.More Details : A Digital Fountain Approa h to Overlay Multi astingIn [37, Kwon et al. argue that through the introdu tion of Tornado en oding/de odingbetween the sour e and its re eivers, a reliable le transfer may be a hieved on overlay networks,even in the ase of nite buers. They advo ate to use TCP between ea h pair of end-hosts,for ongestion avoidan e, but to drop pa kets in an end-host when one of its buers is full.As pa kets are en oded, it is possible for some hosts to omplete their transmission later on,it keeps also the reliability between the sour e and any re eiver.This is a rather elegant way of extending the digital fountain vision to an overlay network.It adapts to handle heterogeneity of the re eivers, and possibly to their migration in the tree,in ase of joins and leaves.• In pra ti e, as opposed to an ideal digital fountain, the total number of symbols thatthe sour e an produ e without dupli ation is bounded by a nite stret h fa tor, thatis linearly linked to the en oding time. If the download time of a re eiver ex eeds thetime for the sour e to send a omplete set of its symbols, dupli ates may be re eived,possibly many.• Long les would also need to be ut in dierent blo ks, and the ee t of additional du-pli ates holds as in IP-multi ast, as re eivers are waiting for the last blo k to omplete.Previous interleaving te hniques introdu ed to ope with this problem in IP-multi ast(blo k interleaving, random order), have not been evaluated in the ontext of forward-ing on overlay. In this new ase, losses o urring in burst be ause of end-host overow,or migrations in the tree, may be signi antly more dominant.More Details : Hybrid Uni ast-Multi astRe on iling the relative e ien y of IP-multi ast with the pervasiveness of overlay ar hi-te tures was proposed re ently for appli ation layer multi ast. All of them treats IP-multi astgroups as lo al building blo ks, under the ontrol of single entities organized in overlay.Chawathe et al. presented in [17 an ar hite ture to enable this servi e, based on anoverlay of proxy server that serves lo al delivery group. Their proposition is a little moregeneral as they advo ate an appli ation level semanti that may be used by this proxy toadapt their ommuni ation to network onditions. Proxies may use gra eful degradation orkeep in memory all the data re eived to adapt to its re eiver onne tions.Zhang et al. give similar arguments in [65 to implement overlay of hosts serving lo al groupdelivery through IP-multi ast. They emphasize overlay network as a deployment methodologyfor multi ast, in a way that extends the urrent Mbone.Liang et al. fo us in [38 on the best use of the IP-multi ast where it is urrently deployed.They propose to implement TCP ontrol through a multi ast hannel and uni ast feedba kfor small and moderate group. They on lude that the size of the group an in rease, if ahierar hi al overlay ar hite ture is built, where ea h end-hosts is serving a small number ofothers through a TCP-SMO ontrolled small groups.

Page 112: baccelli/Evaluation/AugustinChaintreauPhD.pdf

102 Chapter 2Reliable Store-and-ForwardRelying on an innite memory buer available in all end-hosts to supportthe ommuni ation makes the ontrol of multi ast overlay transport simpleand onvenient. The question that we introdu e in this se tion, and thatis addressed later on in this hapter, is: what happens if we get rid of thisassumption?For nite buer, additional are should be taken inside the appli ationto implement a minimum level of syn hronization between ows. Examplesof su h implementation are des ribed in [49, where a syn hronization step isperformed periodi ally after a xed number of pa kets. A similar approa his advo ated in [46, for a group of ooperating pro ess. An appli ationlevel a knowledgment me hanism is proposed, that makes sure that a pa kethas been re eived before allowing its deletion in all host. The size of thisappli ation window ontrol is given by the smaller buer available in a host.As it deals with all re eivers, this ow ontrol, proposed in this arti le forsmall groups, annot extend to large ones: the same negative result proved in[16 for IP-multi ast holds. The throughput experien ed by users de reasesto zero as the set of parti ipating end-hosts be omes large.This problem is identied as well in the ontext of an overlay of proxiesto implement reliable multi ast in [17. It is stated that proxies need to beable to limit the sending rate from the sour e, to avoid dropping pa ketsthemselves, if they annot serve on time the data they re eived.We propose an algorithm to deploy Store-and-Forward with nite buerin the ontext of an overlay network build on TCP. It was proposed indepen-dently from us by Urvoy-Keller et al. in [61. It spe ies only one additionalrule that needs to be enfor ed in every internal host of the overlay network,see 3: A pa ket does not leave the buer dedi ated to its in oming TCP onne tion, unless there are su ient memory bits available in all buers or-responding to its outgoing TCP onne tions. This ba k-pressure rule,whi h resembles the output queue blo king of a network swit h, is makingsure that the sour e adapts to prevent any buer overow. It is simple toimplement, quite oherent with the rest of the TCP design, and does notneed any additional message ex hange.This ould be thought as a per-hop overlay adaptation, that adapt the ongestion in end-hosts in the same way as TCP adapts to ongestion in thenetwork. Equivalently, it may be des ribed as an end-hosts implementationof the s heme proposed in [49, with the dieren e that syn hronizationbetween ea h ow remains always lo al and de entralized.2.4 Our ContributionIn the following se tions of this hapter, we analyze the performan e of largeoverlay networks implementing TCP with ba k-pressure. They may be seen

Page 113: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 103as another intera ting system of ows. The dieren e with the previousintera ting problem, analyzed in Chapter 1, is that they do not generally ompete for a given bandwidth. On the ontrary, they ooperate in a syn- hronized manner, to ontrol data transport onsistently a ross dierent partof the overlay. One hara teristi feature remains the same: all ows adaptstheir own behavior to the one of other ows in the system, not perfe tly buta ording to a de entralized me hanism.1 As our rst important result, we show that su h a ontrol may bedeployed on a network of any s ale with nite lo al buer memory.In ontrast with some intuition, a positive data throughput an beguaranteed independently of the overlay size. This is at oddswith almost all the onje tures we found about this subje t.We then ome ba k to the ase of unlimited buer available for ea hhost. The laten y of data transit from the sour e to a host, arbitrary faraway, has been rarely studied, su h as the number of pa kets remaining inan intermediary end-host. For a large ommuni ation group, this may bedire tly linked to the speed of data dissemination, and the memory that itneeds to be supported. For a proxy that serves many transient les simulta-neously, this amount of memory an be riti al. It is in general a fun tionof the rate of the sour e, the TCP ontrol evolution, in luding the loss rate,and the distribution of delay observed on a link by a pa ket. We make two ontributions:2 We establish pre isely the ondition for an asymptoti ally large overlayto remain stable, whi h guarantees that the number of data waiting tobe served, for a ow of any size, remains bounded.3 We hara terize the speed of data ow, together with a limit (in theCesaro sense) of the average buer o upan y. They are both des ribedas the transform of a hydrodynami limit.

Page 114: baccelli/Evaluation/AugustinChaintreauPhD.pdf

104 Chapter 23 The one-to-many TCP OverlayWe present in this se tion a new ontrol proto ol for overlay multi ast dataows. It may be seen as an extension, for one-to-many ommuni ation, ofthe me hanisms deployed by TCP on a uni ast path, that were presented indetails in Chapter 0. The implementation of this ontrol is end-to-end andpa ket driven. The data ow rea ts via ea h pa ket to the onditions in thenetwork, and, via a ba k pressure rule, to the amount of memory availablein the overlay.We then prove that features of this ar hite ture need to be aptured ina new lass of distributed dynami al dis rete-event system, dened by aninvariant re urren e relation. It motivates the study of these obje ts madein Chapter 3, that is instrumental for the s alability properties that we provelater.3.1 Designing an Adaptive Data Transport on OverlaysWe onsider a single sour e and a xed overlay distribution tree. Our ar hi-te ture implements TCP on ea h overlay hop, together with a ba k-pressurerule, similarly to the me hanism des ribed in [61, that was developed inde-pendently from us.Denitions, NotationIn the study of overlay network, several levels of organization interfere. It istherefore important to hoose dierent words to refer to a notion that maybe found in dierent levels:• IP network: we refer to an element of the physi al IP network as arouter. These routers are onne ted to their neighbors by links.• Overlay: end-hosts that are parti ipating in the overlay network are onne ted by a uni ast path in the IP network. While this path maytraverse several routers in the physi al network (see Figure 2.1), we all it an overlay hop.A internal host, that is interior in the overlay tree, re eives, stores, andforwards the data. The root of the overlay distribution tree is alledthe sour e, and other non internal hosts, at the extreme of the tree,are alled leaf hosts.• Ea h host re eives pa kets from its an estor in the tree, that we allhis mother host. It dupli ates them and sends a opy to ea h of its hildren, also alled daughter hosts, as it was seen on Figure 2.1.Overlay Topology: We onsider several tree topologies, for whi h we intro-du e the following generi notation. We number hosts by a pair (k, l) de-signing their lo ation in the overlay multi ast tree. The rst index k gives

Page 115: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 105their distan e to the root of the tree (or level). The se ond index l allows tonumber hosts with the same level. For the ase of a omplete binary tree,the hosts with the same distan e k from the root are labeled by numbersl = 0, . . . , 2k − 1. An example of omplete binary tree with height equal to2 is des ribed on Figure 2.1 (right). The mother host of host (k, l) is de-noted (k − 1, m(k, l)). The daughter hosts of host (k, l) are labeled (k + 1, l′)with l′ ∈ d(k, l). For a omplete binary tree, m(k, l) =

⌊l2

⌋ and d(k, l) is2l, 2l + 1. We present on Figure 2.2 this notation applied to the overlayexample, already shown in Figure 2.1.A

BEF D C

G(2, 2) (2, 3)(2, 1)

(1, 1)

(k, l) = (0, 0)

(k, l) = (1, 0)

(k, l) = (2, 0)

HOPA→BHOPG→F HOPB→CHOPG→EHOPA→G

HOPB→DFigure 2.2: Notation of end-hosts inside an overlayOverlay hops: The point-to-point ommuni ation between a mother hostand a daughter host is arried out by TCP. We shall assume that in all TCP onne tions, Fast Retransmit Fast Re overy [1 is implemented. We onsiderboth situations with and without Sele tive A knowledgment (SACK). It is onvenient to present our analysis rst for the ase of pa kets marked withEarly Congestion Noti ation (ECN), (see 3.2); this ase is overed as well.Buering in hosts: On ea h host (ex ept for the root host), there is an inputbuer, orresponding to the re eiver window of the upstream TCP, and,ex ept for the leaf hosts, there are several output buers, also referred to asforwarding buers, one for ea h downstream TCP onne tions. Figure 2.3illustrates these buering me hanisms. Throughout this hapter we shallassume that all these buers have nite sizes BIN, BOUT (for input buer andoutput buer).For host (k, l), we denote by B

(k,l)IN the size of its input buer. It ismeasured in pa kets; we assume that all pa kets have the same size for thesake of simple exposition, although this is not an essential assumption. Wedenote by B(k,l)OUT,(k′,l′) the size of the output buer of host (k, l), in the so ketasso iated with the TCP onne tion to the daughter host (k′, l′).

Page 116: baccelli/Evaluation/AugustinChaintreauPhD.pdf

106 Chapter 2Reliable Transfer and Forwarding via Ba k-PressureThere an be three dierent types of pa ket losses in the overlay multi ast:(1) losses that o ur in the network path in-between two hosts; losses ausedby overow in (2) input buer and (3) output buer. The rst type of lossesis re overed by the TCP a knowledgment and retransmit me hanisms.The se ond type of losses does not o ur thanks to the ow ontrol im-plemented by TCP (already seen in 1.1 p.2 of Chapter 0). Indeed, theavailable spa e in the input buer at the re eiver host is advertised to thesender through the a knowledgments of TCP. In addition, when the avail-able input buer spa e diers from the last advertised size by two MaximalSegment Size (MSS) or more, whi h an o ur when pa kets are opied tothe output buers, the re eiver sends a noti ation to the sour e via a spe ialpa ket.The last type of loss is avoided by ba k-pressure. A pa ket is removedfrom the input buer only when it was opied in all output buers. This opying pro ess is blo ked when an output buer is full and is resumed on ethere is room for at least one pa ket in all output buers. Thus, be ause ofthis blo king ba k-pressure me hanism, there is no overow at the outputbuers. In order to help the reader to identify the me hanisms used in end-hosts, we have depi ted (in Figure 2.3) a zoom of an end-host and of theinformation ex hanged with its mother and daughter hosts.

BACK-UPBUFFEROUTPUTBUFFER 1OUTPUTBUFFER 2

Re eives the signalof empty spa eINPUTBUFFER

Re eiver windowSends an ACKFrees the spa eof BOUT.Re eives the pa ketB

(1,0)INB

(1,0)OUT,(2,1)

B(1,0)OUT,(2,0)

idemidemidemForward the Pa ketidemAdvertises the

0

0

0

(W(2,0)m )m≥1

(W(2,1)m )m≥1

from its daughter.Re eives ACKin BIN of its daughter.to its mother node.to its mother node.in its mother node.in BOUTwhen re eives ACK.Frees the spa e

Figure 2.3: Des ription of host with index (k, l) = (1, 0).In this gure, ea h bar represents a sequen e of events of a ertain type(for instan e the sequen e of pa kets departure times from a buer); thelabels on the edges represent the lag in the pa ket sequen e number betweenthe events onne ted by this edge. Let us interpret the gure: for a bar

Page 117: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 107with several in oming edges, the m-th event takes pla e as soon as, for allupstream bar, the event of order m minus the lag has taken pla e. Forinstan e, pa ket m leaves the input buer at the latest of the following threeevents: pa ket m arrived in the input buer; pa ket m − B(1,0)OUT,(2,0) has beena knowledged; pa ket m − B

(1,0)OUT,(2,1) has been a knowledged.Flow ontrol, ongestion window ontrol, and ba k-pressure guaranteethat there are no losses in the overlay, even if all hosts have nite-size buers.However, these me hanisms also redu e the throughput of the group ommu-ni ation. It is ru ial to understand this s alability issue, namely to he kwhether the throughput would vanish when the group size grows.Retransmissions and Re-Sequen ingAnother fa tor ould signi antly impa t the throughput of the group om-muni ation: the pa ket re-sequen ing delay be ause of losses. In ase of apa ket loss along a path, TCP retransmits the pa ket eventually. However,some pa kets with larger sequen e number arrive before the retransmission(of the lost pa ket) arrives. These pa kets are not opied out to forwardingbuers until the dupli ate arrives. Su h a delay in pa ket pro essing has neg-ligible impa t on the throughput of the TCP onne tion experien ing loss,owing to the window ination implemented by Fast Retransmit Fast Re ov-ery (See the framed paragraph in p.8). However, it has an impa t on thethe downstream TCP onne tions, as this reates an interruption of pa ketarrival. Su h perturbations may ause signi ant performan e degradationin these downstream TCP onne tions and exhibit ripple ee ts in the orre-sponding subtrees. In turn, be ause of the ba k pressure me hanisms, theseperforman e degradations impa t the sour e sending rate, and therefore thegroup ommuni ation throughput.More Details : Guaranteeing Data Completeness with Ba kup BuersWith overlay multi ast s heme, an important issue to address is resilien y, i.e., handlinghost failures and/or departures (possibly without prior noti e). This is not the main fo usof this do ument, but this ar hite ture might be extended to a ount for some lo al partialfailure.Failure of an end-hosts parti ipating in the overlay is dis overed via the missing of aheartbeat message, sent regularly between neighbors via UDP datagrams. On e a failure isdete ted, the tree needs to re ongure in su h a way that the daughter hosts of the failed host,as well as the subtrees they are rooted at, are re-atta hed to the original tree. A new TCP onne tion is established for ea h re-atta hment. There is a variety of ways to re ongure thetree. Some algorithms were presented and evaluated in this ontext in [4. In order to a hieveend-to-end reliability, we need to ensure that the data re eived is omplete after hosts arereatta hed. In other words, we need to make sure that their new mother host have the datathat is old enough for them not to miss parts of the sequen e that were already ex hanged.Avoiding this interruption in pa ket sequen e may be more di ult for a host distant from the

Page 118: baccelli/Evaluation/AugustinChaintreauPhD.pdf

108 Chapter 2root, sin e the pa kets that it has been re eived, at the time of failure, may have been alreadypro essed and dis arded by other group members, ex ept for the failed host.This ould be e iently a hieved with a limited amount of supplementary memory. One an implement a ba kup buer in the end-hosts to reate opies of the data ex hanged, whileit is moved from the input buer to the output buers. Whenever a new TCP onne tion isestablished, the pa kets in the ba kup buer of the sender are sent out rst.In [4 we show that if the size of the ba kup buer is large enough ompared to thoseof input and output buers, then the end-to-end reliability an be guaranteed with dierentre onne tion strategies, as long as the number of simultaneous failures remain under a givenbound.The pro essing of pa kets inside the overlay network a ording to theseme hanisms an be des ribed as a distributed dis rete-event system. Wepresent our model under dierent but related formalisms:• (max,+) re ursions dened by the relation between the pro ess ofevents; this formalism turns out to be the most e ient for simulatingthe dynami s of this lass of networks, as made in the next se tion.• Last-passage per olation in pattern invariant graphs. This seems to bethe right formalism to adequately represent all the required me ha-nisms in a ompa t way. It is in parti ular useful to a ount for theee t of pa ket losses and re-sequen ing. It is instrumental later in themathemati al proof.We rst present the model in the ECN pa ket marking ase. Pa ket losses,retransmission and re-sequen ing an be in luded in these frameworks, butit is more omplex; we present this ase as a se ond step in 3.3. The elab-oration of these representations, within the ontext of this non-a y li andlossre-sequen ing framework, are one of the main te hni al a hievements inthis work.3.2 A Distributed Dis rete-Event SystemThe Model for the ECN Pa ket Marking CaseWe model routers as single server queues with FIFO dis ipline and innitebuer (losses are taken into a ount as an exogenous pro ess, whi h ontrolsthe window evolution). The pa kets of the multi ast ow ompete with thoseof other ows for the resour es of ea h router and link. This ompetition reates additional queuing delays between the servi e of pa kets belonging tothe referen e multi ast ow. In these queues, we only onsider the pa ketsof the multi ast ow, but we expand their pro essing times to random ag-gregated servi e times. The values taken by this variable takes into a ountthe ee t of ross tra (for more details, see 2.3 p.16 in Chapter 0).

Page 119: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 109More Details : The Last-Mile Ee tThe following issue should be taken into a ount in the models: if the out-degree ofsome host of the overlay tree is large, then the a ess link from this host may be ome thea tual bottlene k, be ause of the large number of simultaneous transfers originating from thishost. Hen e the throughput of the overlay hops asso iated with this host may in fa t beae ted. This "last-mile link" ee t an be in orporated in our model. The extra tra anbe represented by an in rease of the aggregated servi e times. In order to keep this under ontrol, the general idea is then to keep a deterministi bound, say D, on the out degree ofany host in the overlay tree. It is easy to show that if the bandwidth sharing is fair on thelast-mile link, then the perturbed system, where all aggregated servi e times on this link aremultiplied by D, is a onservative lower bound system in terms of throughput.Hen e, whenever the out-degree of any host is bounded by a onstant, the proof ofthe s alability of throughput for the ase without this last-mile ee t extends to a proof ofs alability with this ee t taken into a ount.We introdu e the following notation:• The TCP onne tion to host (k, l) is labeled with the index (k, l). Ithas a route that onsists of a sequen e of H(k,l) routers in series.• Routers of TCP onne tion (k, l) are labeled by the index h takingvalues 1, 2, . . . ,H(k,l). Ea h router is modeled as a single server queue ontaining only pa kets from the referen e onne tion. The servi etime for these pa kets in this queue is a random variable des ribingthe impa t of ross tra , also alled Aggregated Servi e Time (AST).For pa ket m being served through router with index h in onne tion

(k, l), it is denoted by AST(k,l,h)m ; a ordingly, the buer of this routeris denoted with the label (k, l, h).

• We also introdu e labels for the other buers used by TCP onne tion(k, l): label (k, l, beg) denotes the output buer of host (k − 1, m(k, l))on the so ket orresponding to TCP onne tion (k, l); label (k, l, end)denotes the input buer of host (k, l), that is the destination of this onne tion.TCP Window ow ontrol: Let (W

(k,l)m )m≥1 denote the window sizesequen e for TCP onne tion (k, l). More pre isely, W

(k,l)m is the window sizeseen by pa ket m. This sequen e takes its values in the set 1, 2, . . . ,Wmax,where Wmax is the maximal window size.We assume the following random evolution for this sequen e, that or-responds to TCP RENO ongestion avoidan e phase: when the window isequal to w, it in reases by a size orresponding to one MSS after w pa k-ets (additive in rease rule), if there is no pa ket marked with ECN; when apa ket is marked by one of the routers, the window is halved (multipli ativede rease rule); a tually, an integer approximation of halving is used so as

Page 120: baccelli/Evaluation/AugustinChaintreauPhD.pdf

110 Chapter 2to keep the window in the set 1, 2, . . . ,Wmax; similarly, if the window isequal to Wmax, it remains equal to this value until the next pa ket marking.If one assumes pa kets to be marked independently with probability p(k,l),then (W(k,l)m )m is an aperiodi and ergodi Markov hain [7.A ouple of remarks should be made:1. In this model, we are not in luding Time-Outs that o ur and reinitial-ize the window after a sour e starvation, or a large variation of delay,and we negle t slow-start phase. Both an be taken into a ount insimulations, but for all the ases that we are presenting in this hapter,we have observed that they do not impa t the long-term throughput.2. We assume also that ea h pa ket is a knowledged. In urrent TCP im-plementations, an a knowledgment is sent for every se ond segment.This ould be taken into a ount by saying that a pa ket transmis-sion in the model represent the transmission of two MSS in the TCP onne tion.One should then onsider an "abstra t pa ket" with size 2 × MSSin the model. The variable Wm, whi h is an integer expressed in ab-stra t pa kets, is equal to the integer part of CWND/(2 × MSS)where CWND is the ongestion window given for the TCP proto ol;it in reases by MSS/(2 × MSS) = 1/2 for ea h window su essfullytransmitted (i.e. the value of Wm is in reased by 1 after the su essfultransmission of 2Wm pa kets).More Details : Des ription with Timed Petri-NetsLet us assume that no pa ket is lost and that the ongestion window remains equal to a onstant deterministi value W . In this ase, the transport of data in the overlay is a timedPetri-net.A representation of this system for a binary overlay tree with height 2, in luding blo kingme hanism asso iated with window ow ontrol and ba k-pressure, is shown in the gurebelow.The pro essing of this model an be thought as a Token Game. Tokens an be storedin pla es (represented by ir les or horizontal boxes) and transported to other pla es throughtransitions (represented by bars). They may represent either pa kets, or a knowledgments,or more generally ontrol events asso iated with the routing, transport, or the ba k-pressureme hanisms. As in a Petri-net, the general rule is that any transition takes pla e as soonas a token is available in ea h of the pla es upstream of this transition. One token is then onsumed from ea h pla e upstream and one token is reated in all pla es downstream of thetransition, after some random pro essing time whi h depends on the transition.The initial ondition of the system is that all pla es with an index h (represented byempty ir les), that represent buers ontaining data pa ket of the ommuni ation, are emptyof tokens. A number of tokens are put initially in the system to represent the empty pla eavailable in the routers (represented by ir les that ontain a single token), or in end-hostsmemory (represented by horizontal boxes where the number is the amount of initial tokens).

Page 121: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 111This orresponds to the ase where initially multi ast ommuni ation has not started and allmemory buers are empty in every host.The feedba k ar s, that go in the dire tion opposite to the data stream, represent thevarious ow ontrol and ba k-pressure me hanisms.Tm

B(0,0)IN

B(1,0)IN

B(1,1)INh = beg h = 1 h = 3

h = 2 h = endh = end

W (1,0)

h = begB

(2,0)INB

(2,1)INW (1,1)

B(0,0)OUT,(1,1)

B(2,2)IN

W (2,2)

W (2,3)

W (2,1)

W (2,0)

B(1,0)OUT,(2,1)

B(1,0)OUT,(2,0)

B(1,1)OUT,(2,2)

B(1,1)OUT,(2,3)

B(0,0)OUT,(1,0)

B(2,3)IN

• Note that in general, owing to the dynami al evolution of the window size, ontrolledexternally by a loss pro ess, this dis rete-event system does not fall in the ategory ofPetri-nets. It shares however a similarity that is used in the next paragraph (It an beseen as a linear system in (max, +) algebra).For more details on Petri-nets see [6.Figure 2.4: A binary overlay tree of height 2 with input and output blo king.(max,+) Linear Evolution EquationsLet Tm, m ≥ 0, denote the time when pa ket m is available at the root host.In this work, we may assume a saturated root, where all pa kets are readyat the root from the beginning of the ommuni ation; namely Tm = 0 forall m. Let x

(k,l,h)m denote the time when m has ompleted its transmissionand leaves the buer with label (k, l, h). In parti ular for h = 1, . . . ,H(k,l),this is the time when router h in onne tion (k, l) has ompleted the servi efor pa ket m. x

(k,l,beg)m is the time when pa ket m departs from the outputbuer of the sour e host of TCP onne tion (k, l), and arrives in the inputbuer of router h = 1. Finally x

(k,l,end)m is the time when pa ket m departsfrom the input buer of the re eiver host of TCP onne tion (k, l), and istransmitted into ea h output buer of host (k, l).The network (ex ept for the root) is assumed to be empty initially ofany pa kets of the ommuni ation. This an be represented by taking

Page 122: baccelli/Evaluation/AugustinChaintreauPhD.pdf

112 Chapter 2x

(k,l,h)m = −∞ for any m < 0. Then, with the above assumptions, thedynami behavior of the model presented in the last subse tion is given bythe following evolution equations at the root host (where ∨ denotes the max-imum):

x(0,0,beg)m = Tm ∨ x

(0,0,end)m−B

(0,0)INx(0,0,end)

m = x(0,0,beg)m ∨

( ∨

l∈d(0,0)

x(1,l,H(1,l′))

m−B(0,0)OUT,(1,l)

)and, for a host (k, l), where k ≥ 1, and l ≥ 0:x

(k,l,beg)m = x(k−1,m(k,l),end)

m ∨ x(k,l,end)m−B

(k,l)IN ∨ x(k,l,H(k,l))

m−W(k,l)m

x(k,l,1)m =

(x(k,l,beg)

m ∨ x(k,l,1)m−1

)+ AST(k,l,1)

m

. . .

x(k,l,H(k,l))m =

(x

(k,l,H(k,l)−1)m ∨ x

(k,l,H(k,l))

m−1

)+AST(k,l,H(k,l))

m

x(k,l,end)m = x

(k,l,H(k,l))m ∨

l′∈d(k,l)

x(k+1,l′,H(k+1,l′))

m−B(k,l)OUT,(k+1,l′)

.One an observe that in all these equations, the indi es orresponding tothe pa ket number is m in the LHS, and always smaller than or equal to m inthe RHS. When the index is m in the RHS, it is a term where either index hor index k is smaller than the one in the LHS. Following this observation one an easily ompute numeri ally the value of these dates in the lexi ographi order orresponding to indi es m,k, l, h.3.3 Last-Passage Per olation, and Pa ket lossesWe introdu e in this se tion a new lass of model, based on a ategory ofinvariant graphs, whi h reprodu es well the evolution equation that we havefound previously. Moreover, the impa t of pa ket losses on re-transmissionsand re-sequen ing an be easily aptured in this formalism. Completiontime of pa ket transmissions are seen as a last-passage per olation in this ategory of graphs. The properties of this lass of models are interesting bythemselves, and we treat them separately in Chapter 3.Last passage per olation in a GraphLet G be the graph where the set of verti es is:V = (0, 0, beg,m), (0, 0, end,m), m ∈ Z

∪(k, l, h,m), k ≥ 1, l ≥ 0, h ∈ beg, 1, . . . ,H(k,l), end, m ∈ Z

Page 123: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 113and the set of edges E is: E1 ∪ E2 ∪ E3 ∪ E4 ∪ E5 with:E1 = (0, 0, end,m) → (0, 0, beg,m) | ∀m ∈ Z

∪(k, l, 1,m) → (k, l, beg,m), (k, l, end,m) → (k, l,H(k,l),m)

| ∀k ≥ 1, l ≥ 0,m ∈ Z∪(k, l, h,m) → (k, l, h − 1,m) | 2 ≤ h ≤ H(k,l), k ≥ 1, l ≥ 0,m ∈ Z∪(k, l, beg,m) → (k − 1, m(k, l), end,m) | ∀k ≥ 1, l ≥ 0,m ∈ Z

E2 = (k, l, h,m) → (k, l, h,m − 1) | ∀h, k ≥ 1, l ≥ 0,m ∈ ZE3 = (k, l, beg,m) → (k, l,H(k,l),m − W

(k,l)m )

| ∀k ≥ 1, l ≥ 0,m ∈ ZE4 = (k, l, beg,m) → (k, l, end,m − B

(k,l)IN )| ∀k ≥ 0, l ≥ 0,m ∈ Z

E5 = (k, l, end,m) → (k + 1, l′,H(k+1,l′),m − B(k,l)OUT,(k+1,l′))

| ∀l′ ∈ d(k, l) and ∀k ≥ 0, l ≥ 0,m ∈ ZWe illustrate the graph G for the ase of two TCP onne tions in series inFigure 2.5, where the E1 edges are the horizontal ones and the E2 edgesare the verti al ones. The ategory for the other edges on the Figure areexpli itly indi ated.We introdu e a weight for ea h vertex (k, l, h,m), that is given by AST(k,l,h)mfor h ∈

1, 2, ...,H(k,l)

and m ∈ Z, and that is equal to zero for h ∈beg, end. The weight Wei(π) of a path π in G is dened as the sumof the weights of the verti es ontained in π.All the buers in the hosts are initially empty. As a onsequen e, wedene the restri tion of this graph denoted by G[0]: the weight of vertex(k, l, h,m) in G[0] is hanged to −∞ for all verti es with m < 0. One anthen prove, by an indu tion argument based on the re urren e equations of3.2, that for all k, l, h,m(2.2) x(k,l,h)

m = maxπ a path in G[0], (k,l,h,m) (0,0,beg,0) Wei(π) .Model for Pa ket LossesOur aim in this se tion is not to build an exa t model for the ase with pa ketlosses, but rather to des ribe a simplied and tra table model obtained viaa set of onservative transformations. To prove the s alability in this ase(namely the positiveness of throughput in the exa t model for an inniteoverlay tree), it is su ient to show that the simplied onservative models ales in the same sense.Imagine pa ket with index m is lost. Before TCP is aware of this loss,some of the pa kets m + 1, . . . ,m + Wm might have left the sour e. Theyhave been in fa t allowed by the ongestion window me hanism, but theymay not have left the sour e be ause of the ba k-pressure me hanism, or

Page 124: baccelli/Evaluation/AugustinChaintreauPhD.pdf

114 Chapter 2

. . .. . .

. . .

. . . . . .. . . . . .

. . .. . . . . . . . .

. . .. . .

. . .

. . . . . .. . . . . .

. . .. . . . . . . . .

h = endh = 1

h = 2h = 3

h = 4h = end

h = begh = 1

h = 2h = 3

h = beg. . .

. . .

. . .

. . .

k = 0

h = end

Pa ket mPa ket m − 1

. . .Pa ket m − B

. . .Pa ket m − Wm

H1 = 4k = 1

H2 = 3k = 2

E3

E5E4

h = begFigure 2.5: Random Graph Representing two TCP Conne tions in tandemwith Ba k-Pressure Constraints.be ause they are not available at this time in the sender host. In all ases,the following simplied window evolution is onservative:- The window is set to max((Wm − 1)/2, 1) for pa kets with indi es

m + 1,m + 2, . . . , ...,m + Wm + max((Wm − 1)/2, 1) − 1.- Starting from pa ket m + Wm + max((Wm − 1)/2, 1), the additivein reasing evolution of the window is resumed .It means here that the exa t system using Fast Retransmit Fast Re overy (see[1) has larger windows at all times. The exa t throughput, as a onsequen e,would be higher than this simplied model (more details may be found in[4).

Page 125: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 115In addition to that, the loss of pa ket with index m has two majorimpa ts. First, retransmissions are sent by the sour es. We an hoose toin lude them at the last possible step of the ommuni ation (between pa ketm+Wm and m+Wm +1). This tends to overload the network at intuitivelythe worst time (after the self lo king me hanism has resumed with a halfwindow), and this is what is done in our simulations. Another hoi e, evenmore onservative, is to in lude retransmissions at every possible step (afterea h pa ket m,m + 1, . . . ,m + Wm) to over ea h possible ase of the exa tmodel. This is what we hoose in our model, as the s alability result an stillbe proved under this assumption. The number of retransmissions at ea hstep is Wm in our model by default, again orresponding to a onservativeassumption, it is only one pa ket if SACK is implemented by TCP.The se ond onsequen e of pa ket m being lost is that some pa ketsm + 1,m + 2, . . . are blo ked in the input buer of the destination host, aspa kets need to be pro essed by the host a ording to the order dened bythe sequen e number. In our model, we have hosen to in lude this blo kingfor all the possible pa kets involved. In other words pa kets m,m+1, . . . ,m+Wm annot be pro essed in the destination host until pa ket m + Wm, andthe retransmissions, arrive).The Graph asso iated with Pa ket LossesIn this se tion, verti es of the graph asso iated with the index m refer eitherto pa ket m itself, or to a retransmitted pa ket, sent after pa ket m andbefore pa ket m + 1. In the random graph asso iated with the loss model,we denote the vertex for host (k, l), pa ket m and index h by v(k, l, h,m).For all k, l,h = 1, . . . ,H(k,l) and m, we add a vertex v′(k, l, h,m) on top ofv(k, l, h,m), whi h represents the potential retransmissions of pa kets justbetween pa kets m and m + 1. The weight asso iated with this vertex isindependent and obeys to the same law as AST(k,l,h)

m , in the ase where TCPis implementing SACK. Otherwise, it should be equal to a sum of 2 × Wmindependent variables with this law, as several pa kets may be retransmitted.We also add the following edges to in lude these verti es in the horizon-tal and verti al stru tures of the graph:• Horizontal edges: v′(k, l, 1,m) → v(k, l, beg,m) and v′(k, l, h,m) →v′(k, l, h − 1,m) for h = 2 . . . H,• Verti al edges: v′(k, l, h,m) → v(k, l, h,m) for h = 1 . . . H.In order to represent the ee t of the loss and the retransmission of pa ket

m on the TCP onne tion (k, l), we add:• Edge E6: v(k, l, end,m) → v′(k, l,Hk,l,m + Wm − 1) in order to repre-sent the re-sequen ing onstraints on pa kets m,m+1, . . . ,m+Wm−1.

Page 126: baccelli/Evaluation/AugustinChaintreauPhD.pdf

116 Chapter 2

. . . . . . . . .. . . . . .. . . . . .

h = beg h = beg

. . .. . .

. . .. . .. . .

. . .

. . .

. . .. . .. . . . . .

. . . . . .. . .

. . .

E6

E5E3E4

h = 2 h = 2h = end

h = 1h = 3

h = 1h = 3

h = 4

. . .. . .. . .. . .

. . .. . .. . .. . .. . .. . .

. . .

k = 1 k = 2H1 = 4 H2 = 3

. . .

. . .

. . .

. . .

. . .

. . .

k = 0

h = endh = end

. . .

. . .

. . .

h = beg

m + Wm + 1W = Wm/2

Pa ket m + 1W = Wm/2Pa ket m

W = Wm

m − BW = Wm−B

Pa ket m − 1W = Wm−1

W = Wm/2m + Wm

W = Wm−Wm

m − WmPa ketPa ket

Pa ketPa ket

Figure 2.6: Random Graph Representing two TCP Conne tions in tandemwith ba k-pressure and re-sequen ing Constraints.• Edges E7: v(k, l, h,m′′ +1) → v′(k, l, h,m′′) for all h = 1, . . . ,Hk,l and

m′′ = m, . . . ,m + Wm to represent the retransmission of pa ket m (asthe extra pa ket in between indi es m + Wm − 1 and m + Wm) whi hdelays the following pa kets.The omplete graph (in luding all types of edges E1, . . . , E7) is presented inFigure 2.6, for the ase of two TCP onne tions in tandem. Edges belongingto E7 are the verti al lo al edges in red. For readability purpose, edgesbelonging to other lasses than E1 and E2 have been represented only whenthey depart from station k and pa ket m; we have also hosen to representhere the ase where BIN = BOUT = B for simpli ity.

Page 127: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 1174 S alability AnalysisThis se tion establishes two properties of the one-to-many TCP Overlay.First that the throughput experien ed by users an be guaranteed for anygroup size, as it is positive in an innite group. Se ond, that even if thebuer be omes large, the number of data pa kets waiting to be served ina host far away from the sour e onverges in a weak sense, leading to abounded laten y for data transit in the group. This se ond result requiresthat a xed rate ontrol is performed at the sour e, to avoid the riti al ase.These two fa ts are rst observed empiri ally in 4.1 and 4.3. They areobtained in a simulator based on the (max,+) evolution equations, whi his parti ularly e ient for the simulation of large overlay trees. We alsoprototyped our reliable multi ast ar hite ture and arried out experimentsin a planetary testbed environment. Both of these observations are thenjustied analyti ally, respe tively in 4.2 and 4.4, based on the last-passageper olation model we introdu e above, and on the results shown in Chapter 3.AssumptionsAll our results require that ertain statisti al assumptions hold on the indi-vidual point to point routes that basi ally guarantee the good behavior ofea h TCP onne tion in a stand-alone situation (e.g. bound on the numberof IP links that are traversed, bound on the pa ket loss probability et .).We all the homogeneous model (resp. the non-homogeneous model) the ase where:• Ea h host in the overlay tree has a fan-out degree xed D (resp.bounded from above by D);• All ba k-pressure buers have the same size for any host in the overlaytree (resp. they are bounded from below by onstants BIN and BOUT);• The routes used by all TCP onne tions are stru turally and statisti- ally equivalent (resp. oer minimal stru tural and statisti al guaran-tee). More pre isely, the number of hops H is the same for all onne -tions (resp. bounded from above by H); the ECN pa ket marking orpa ket losses pro ess is independent and identi ally distributed in all onne tions with probability p (resp. with a probability that is smallerthan p); and nally, the aggregated servi e times are independent inall routers, and identi ally distributed with law σ (resp. they are in-dependent and bounded from above, in the sto hasti order sense (seeIII in Appendix A), by a random variable σ with a nite mean.

Page 128: baccelli/Evaluation/AugustinChaintreauPhD.pdf

118 Chapter 24.1 Empiri al Results on Ba k-PressureThe throughput obtained in the one-to-many TCP overlay is the result of a(possible large) number of onne tions in intera tion, as a onsequen e of theba k-pressure me hanism. The rst on ern that we would like to addressin this se tion is: how this throughput de reases as a fun tion of the overlaysize, and what is the impa t of the buer size on this throughput degradation?(max,+) Simulations with Finite BuersWe analyze the throughput obtained for long le transfers in large groups.We use for this purpose a (max,+) simulator based on a variation of the Evo-lution Equations des ribed in 3.2, where pa ket losses and re-sequen ing aretaken into a ount. The main advantage of this equation-based simulator, ompared to more traditional dis rete-event simulators, is that it allows oneto handle larger overlay: we simulate typi ally an overlay made with morethan 1, 000 end-hosts, using a network made of more than 10, 000 routers intotal.We have hosen MSS=100B, so that a pa ket is 200B (see Remark 2 inp.110). In ea h simulation run, we simulate the transmissions of 10M pa kets(or 2GB of data). We only report results where the overlay distribution treeis binary and omplete. Ea h TCP onne tion involved goes through 10routers in series, and all the pa kets transmitted on this onne tion have anindependent probability p to get a negative feedba k (loss or ECN marking).By default, p = 0.01. We hoose BIN = 50 pa kets (i.e. 10KB), and BOUTvaries as 50,100,1000 and 10 000 pa kets (resp. 10KB, 20KB, 200KB, 2MB).Consequently, for ea h onne tion Wmax = min(BIN, BOUT) = 50 pa kets.As it was des ribed above, the ross tra is hara terized by the Aggre-gated Servi e Times in ea h router. In these simulations we have onsideredboth exponential (the default option) and Pareto random values with meanequal to 10ms in ea h router/link.We have simulated overlay of sizes up to 1023 hosts, with dierent vari-ants for handling losses: TCP RENO type (with Fast Re overy Fast Re-transmit), TCP SACK and TCP over ECN. We also onsidered the impa tof output buer size. Figure 2.7 illustrates the throughput as a fun tionof the group size, in the ase of TCP Reno. It is easy to see that, quiteintuitively, the group throughput is a de reasing fun tion of the group size,and an in reasing fun tion of the output buer size. Note that this Figure isplotted with a log-s ale for the x-axis, representing the number of re eivers,as several order of magnitude of re eivers are presented. The throughputseen in a linear s ale for the number of re eivers would look almost om-pletely at. We observe that when the output buer is large (say more than1000 pa kets), the throughput attens out qui kly with small groups (less

Page 129: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 119

38

40

42

44

46

48

50

1 10 100 1000

Asy

mpt

otic

Thr

ough

put (

Pkt

s/se

c)

Number of Receivers

TCP Reno ; Overlay with 10 Routers ; Packet Loss Rate =0.01

Service Time : Exponentially distribued with mean = 10msBuffer = 10 000 Packets

Buffer = 1 000 PacketsBuffer = 100 Packets

Buffer = 50 Packets

Figure 2.7: Asymptoti group throughput as a fun tion of group size, forbinary overlay tree and exponential ross tra (TCP with Fast RetransmitFast Re overy).than 10 hosts). For smaller output buers, the onvergen e to the asymp-toti throughput an be observed when the group size rea hes 100 hosts.This means that the throughput de reases initially with the size of the over-lay. However, for more than 100 re eivers, it does not de rease signi antlyany more. This throughput an be made lose to the referen e throughputobtained by a single onne tion, if the buers used are moderately high.The two other variants of TCP exhibit similar behavior (as seen in Fig-ure 2.8): with the same onguration, TCP with SACK has a through-put that is about 8% more than that of TCP Reno; TCP over ECN hasroughly the same performan e, slightly better, with 2% improvement overTCP SACK.Comparison between Asymptoti Throughput and Single Con-

Page 130: baccelli/Evaluation/AugustinChaintreauPhD.pdf

120 Chapter 2

38

40

42

44

46

48

50

1 10 100 1000

Asy

mpt

otic

Thr

ough

put (

Pkt

s/se

c)

Number of Receivers

TCP with SACK ; Overlay with 10 Routers ; Packet Loss Rate =0.01

Service Time : Exponentially distribued with mean = 10msBuffer = 10 000 Packets

Buffer = 1 000 PacketsBuffer = 100 Packets

Buffer = 50 Packets

38

40

42

44

46

48

50

1 10 100 1000

Asy

mpt

otic

Thr

ough

put (

Pkt

s/se

c)

Number of Receivers

TCP with ECN ; Overlay with 10 Routers ; Pack Loss Proba =0.01

Service Time : Exponentially distribued with mean = 10msBuffer = 10 000 Packets

Buffer = 1 000 PacketsBuffer = 100 Packets

Buffer = 50 PacketsFigure 2.8: Group throughput as a fun tion of group size, with exponential ross tra , and dierent output buer sizes: TCP using SACK (left), TCPusing ECN (right).ne tion Throughput. In [5, it was shown for the ase without ba k pres-sure that the group throughput is equal to the minimum of those of a singlesaturated onne tion, (su h throughput is referred to as lo al throughput).Thus, for the homogeneous ase, this translates into the fa t that the groupthroughput is identi al to the lo al throughput. In our ase, there is no hopethat su h a relation holds be ause of the ba k pressure me hanisms. It ishowever interesting to know how far the large group asymptoti throughputis from the lo al throughput. In the following table, we report the ratio ofthese two quantities, where the group throughput was simulated for a groupof 1023 re eivers: Buer (Pkts) 10,000 1,000 100 50TCP RENO .99 .98 .90 .83TCP SACK .99 .99 .92 .86TCP ECN .99 .99 .92 .87Table 2.1: Ratio of Group Throughput / Lo al ThroughputIt is worthwhile observing that the group throughput with large outputbuers remains, as expe ted, lose to the lo al throughput. In other words,large output buers alleviate in a signi ant way the ee t of the ba k pres-sure me hanism. Even if the output buer is small, say 50 pa kets (in this ase, identi al to the input buer), the degradation of the group throughput reated by ba k pressure is moderate (less than 18%).Remark: The previous result, whi h shows that ba k-pressure leads to mod-

Page 131: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 121erate throughput degradation, is not restri ted to the homogeneous ase.Under the heterogeneous model assumptions, as the group throughput is adeterministi non-in reasing fun tion of the olle tion of weights, Lemma 11in Appendix A may be used to prove that the throughput in any large groupis larger than a positive onstant, by sto hasti omparison.Impa t of Cross Tra Distribution: In our work, ross tra atthe routers is modeled through the aggregated servi e times. We now usesimulation to see what is the impa t of this distribution, in parti ular whenit is heavy tailed. In Figure 2.9 (left), we show the throughput as a fun tionof the group size for exponential and Pareto distributions, with dierent tailparameters, but the same mean.First, we observe that the throughput is dierent, even for an overlaydistribution tree ontaining only a single overlay hop, from the sour e toa unique re eiver. This is expe ted, as it was shown already in [7, thatthe long-term throughput of a TCP onne tion does not depend on theaverage aggregated servi e time, but is sensitive to the variation of thisdistribution. We see that the heavier the tail of the distribution is, thesmaller the throughput is.We also observe that, even for heavy tail distributions like Pareto, whenthe se ond moment exists (whi h is the ase when the parameter is 2.1), thethroughput urve has a shape similar to that of the exponential distribution.However, when the parameter is 1.9, the se ond moment no longer exists,the throughput urve tends to de ay faster.Impa t of Pa ket Loss Probability: How does the group throughputbehave with regard to the pa ket loss (or negative feedba k) probability ?As we pointed out earlier in this se tion, the asymptoti group throughputremains relatively lose to the throughput of a single onne tion (i.e. lo althroughput) when the output buer is not too small. The square root for-mula is known to des ribe the long term average throughput of a point topoint TCP onne tion. A natural question that arises is whether this for-mula ould be generalized to predi t the long term average group throughputin a ba k-pressured overlay network. We have ondu ted simulations whi hsuggest that, even with the ba k pressure me hanisms, the group throughputhas a similar shape as that of the single- onne tion throughput. Figure 2.9(right) illustrates the group throughput as a fun tion of pa ket loss prob-ability in a parti ular ase (A group of 126 re eivers with B = 100 Pkts).One immediately noti es that the single onne tion throughput (i.e. lo althroughput) is lose to those of the group of size 126, independently of theloss probability.

Page 132: baccelli/Evaluation/AugustinChaintreauPhD.pdf

122 Chapter 2

30

35

40

45

50

0 200 400 600 800 1000

Gro

up T

hrou

ghpu

t (P

kts/

sec)

Group Size

TCP Reno ; Overlay with 10 Routers ; Buffer=100Pkts ; Packet Loss Proba =0.01

Service time with mean = 10msExponential law

Pareto law with parameter 2.1Pareto law with parameter 1.9

10

20

30

40

50

60

70

80

90

100

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

Gro

up T

hrou

ghpu

t (P

kts/

sec)

Packet Loss Proba

TCP Reno ; Overlay with 10 Routers ; Buffer = 40 Pkts

Service Time : exponentially distribued with mean = 10ms

One connectionGroup size=126

Figure 2.9: Group throughput as a fun tion of group size for several laws of ross tra (left), as a fun tion of pa ket loss probability (right).Experiments on Planet-LabIn order to evaluate the pra ti ality of our models, we have implemented aprototype of the one-to-many TCP overlay multi asting system. We usedPlanet-Lab network2, whi h gives a ess to omputers lo ated in universitiesand resear h enters over the world. Our implementation runs a separate pro- ess for ea h output and input buer, whi h are syn hronized via semaphoresand pipes. As soon as data is read from input buer, it is available for out-going transmissions. A separate semaphore is used to ensure that data is notread from input so ket, if it annot be sent to output buers, whi h reatesba k-pressure. A dedi ated entral host was used to monitor and ontrol theprogress of the experiments.To analyze s alability of throughput, we onstru ted a balan ed binaryoverlay distribution tree of 63 hosts onne ted to the Internet. We startedsimultaneously transmissions in balan ed sub-trees of sizes 15, 31 and 63with the same root. Running experiments simultaneously allowed us to avoiddi ulties asso iated with u tuation of networking onditions. In this way,link apa ities are always shared between overlay trees of dierent sizes inroughly equal proportions a ross the overlay trees. We measured throughputin pa kets per se ond, a hieved on ea h link during transmission of 10MBof data. Throughput of a link was measured by re eiving host. We reporton Table 2.2 the group throughput measurements for 3 dierent overlay treesizes and 3 dierent settings for output buer size. Group throughput is omputed as the minimum value of link throughput observed in the overlay2Planet-Lab is an open platform for developing, deploying, and a essing planetary-s ale servi es, www.planet-lab.org.

Page 133: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 123tree. Similarly to our simulations presented above, size of ea h pa ket is 200bytes, size of the input buer is 50 pa kets, and size of the output buer isvariable. One an observe that the group throughput does not hange mu hGroup size: 15 31 63Buer=50 Pkts 95 86 88Buer=100 Pkts 82 88 77Buer=1000 Pkts 87 95 93Table 2.2: Throughput in Pkts/s for dierent sizeswith the group size. This is onsistent with the simulation results reportedabove, although, as is quite expe ted, the absolute numbers are dierent.4.2 Proving Throughput S alabilityIn this se tion, we are interested in the throughput of the group ommu-ni ation when the size of the overlay tree gets large. For this, we dire tly onsider an innite overlay distribution tree. We show that even for this ase and for ba k-pressure me hanism deployed between all end-hosts, thegroup throughput is positive. This is an unexpe ted result in view of thepreliminary simulation results reported in [61, and it also ontrasts withthe non-s alability results reported in the literature on IP-supported reliablemulti ast.Interpretation as a Pattern Invariant GraphA ording to the des ription made in 3.3, the pa kets pro essing time in anend-host parti ipating in the one-to-many TCP overlay follow a pre eden erule. For any end-host, indexed by k, l (both represent its position in theoverlay), any pa ket m is re eived at time x(k,l,end)m , that an be written as apath of maximal weight using (2.2).The key property, whi h is always veried by the graphs we presentedin 3.3, is the following: the edges that may be used do not depend onthe absolute position of an end-host, but are dened using only a nitenumber of its neighbors (mother and daughter hosts in the overlay tree).As a onsequen e, the pre eden e relation between pa ket pro essing time,detailed in 3.3, denes a pattern invariant graph (p.i.g., see Chap.3, 4). Itis therefore only a matter of rewriting to see that:

x(k,l,end)m = Per m×(k,l)×end ,where Per is a last-passage per olation on a graph onstru ted on the prod-u t N×TD. The rst index des ribed the pa ket number m and the se ondindex the position in the overlay tree (that is a semi-innite regular rooted

Page 134: baccelli/Evaluation/AugustinChaintreauPhD.pdf

124 Chapter 2tree TD (see p.204). For every element of this produ t (denoted m × (k, l),that is asso iated with a given end-host and a given pa ket number), weasso iate a nite set (or pattern) whi h des ribes the su essive steps of pro- essing of this pa ket to this end-host H = beg, 1, 2, . . . ,H, end. As itwas already seen above, ea h of these value for index h ∈ H represents abuer either in a host, or in a router, and it may be extended to ontain1′, 2′, . . . ,H ′ representing re-transmission.Let us dene the transformation η(m × (k, l)) = (m + 1) × (k, l) that isa shift in the graph on the rst oordinate. This is obviously an invarianttransformation of the produ t N × TD. There always exists a path η(m ×(k, l))× h m× (k, l)× h; hen e Theorem 11, shown in Chapter 3, tells usthat we have, quite generally by super-additivity, that

limm→∞

x(0,0,end)m

m= lim

m→∞

Per η(m)(0×(0,0))×endm

= ℓ exists a.s. in R ∪ +∞ .This allows to interpret 1ℓ as the intensity of pa ket arrivals to the inputbuer of the root end-host (0, 0). Hen e this is the asymptoti long termthroughput of the overlay, measured at the sour e.

• It is not hard to see, when the buer used in ea h end-host is nite,that this limit is the same when host (0, 0) is repla ed by any end-host(k, l) (Proposition 13).

• Note also that for the non-homogeneous ase, we an show by sto has-ti omparison thatlim infm→∞

m

x(k,l,h)m

≥ 1

ℓ,where ℓ is the group throughput of the orresponding homogeneousmodel dened with the bounds.What remains, and indeed requires the biggest eort, is to prove that ℓis a.s. nite and onstant. We will prove it under dierent assumptions onthe topology and the law used by the aggregated servi e time.S alability under Light Tailed AssumptionIn this se tion, we assume that the random variable σ is light tailed:There exists t > 0 su h that E[etσ ] ≤ A(t) < +∞ .We onsider the normalized level appli ation (see 4 in Chap.3) lev :

N×TD → Z2, dened by lev(m×(k, l)) = (m,k). This appli ation maps thepattern invariant graph already des ribed into a pattern grid with dimension

2. The asso iated dependen e graph is shown in Figure 2.10

Page 135: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 1251

- no label means null label, ex eptalways has label (−1, 0)-beg2

3

(−1, 0)

(−1, 0)

(−1, 0)

Conventions used:end2′

1′

3′

(−BOUT ,+1)

(−BIN, 0)(−1, 0), . . . , (−Wmax, 0)

(0,−1)

(1, 0), . . . , (Wmax, 0)

Figure 2.10: The dependen e graph asso iated with a one-to-many TCPoverlay with ba k pressure and re-transmission re-sequen ing.A ording to Theorem 12, we only need to prove that this dependen egraph is sharp (i.e. it admits a sharp ve tor) to havelim

m→∞

x(k,l,end)m

m= ℓ that is nite and onstant, for a.s. and L

1 onvergen e.One an see that, for any path π : h → h′ in the dependen e graphasso iated with ve tor r, the fun tion f(π) = <r, s> + φ(h′) − φ(h),where s1 = H + 2s2 = (H + 2)Wmax + 1

and

φ(beg) = 0φ(h) = h for h = 1, . . . ,Hφ(h′) = h + 1 for h = 1, . . . ,Hφ(end) = (H + 2)Wmaxveries f(π) < 0. This is obvious as f(π) < 0 an be proved for all pathsmade with a single edge, and f(π π′) = f(π) + f(π′). As a onsequen e,whenever π is a y le in the dependen e graph, we have <r, s> ≤ −1. Thisproves that s is a well sharp ve tor for this dependen e graph.Note that the same argument proves that an overlay using TCP andorganized in an innite diagonal latti e where every host aggregates in-formation from its two upstream neighbors, and transmits them to its twodownstream hildren admits a positive throughput independent of thesize.

Page 136: baccelli/Evaluation/AugustinChaintreauPhD.pdf

126 Chapter 2S alability under Heavy Tailed AssumptionsProving the same s alability result under heavy tailed assumptions is sig-ni antly more di ult, as it requires to go deeper in the analysis of the ombinatorial properties of path in the pattern invariant graph. It is ingeneral di ult to tell if the light tailed assumption of the weight an berelaxed.There is one ex eption, if we assume D = 1 (su h that the semi-inniteregular rooted tree be omes a semi-innite line). In this ase the patterninvariant graph be omes exa tly a pattern grid of dimension 2. It has thesame dependen e graph than before, hen e this dependen e graph veries thesharp riterion. For this ase we know that the limit ℓ is a nite onstant,as long as we have ∫ +∞

0P (σ ≥ u)1/2du < ∞ .This should not ome as a surprise, as Martin already showed in [41 thata innite series of queues with nite buer and blo king admits a positivethroughput under the same moment ondition. In fa t, Lemma 5.4 in thissame arti le ould be used almost dire tly to prove the result by omparison.More generally, as shown in Chap.3 1.4 it is enough to embed a pathof G into a greedy latti e animal with a size that is bounded by a linearfun tion of the size of the path. This is always possible for a pattern gridthat admits a sharp ve tor.Whether this property remains true for other pattern invariant graphs is urrently still an open question that we wish to study in future work.4.3 Empiri al Results with Innite BuersWhen large buers are used in end-systems, the one-to-many TCP overlaytends to behave like an overlay where an innite amount of memory is avail-able to store data ex hanged. In this ase, the analysis of the system isin some way simplied, owing to the following a y li ausality property:If one onsiders a xed destination end-system in the overlay, the pa ketarrival times do not depend on any end-system that is downstream in theoverlay distribution tree. In other words, the time of arrivals of pa ket toan end-system depends only on its pre eding overlay hops, that are the ones ontained in the path from the sour e to this destination.The a y li ausality property, o urring for innite buers, has two onsequen es:1. One an study any bran h of the distribution tree separately, seen asan innite hain of overlay hops. In other words, we an always assumethat the tree has a onstant degree equal to 1, so that the index l isalways equal to 0 and an be dis arded from the notation.

Page 137: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 1272. The throughput s alability an be proved step-by-step, following anargument of rate onservation (see [5).Another onsequen e of the a y li ausality property is that the overlayar hite ture is essentially open-loop, and may be unstable. In parti ular,the number of data pa kets waiting to be sent in a given end-host an growto innity with time. As a onsequen e a large laten y an be experien edbetween the time a pa ket is sent by the sour e and it is re eived by anend-hosts.(max,+) Simulations with Innite BuersAgain, the simulation results we obtained are based on a dire t exploitationof the (max,+) evolution equations that we have introdu ed. A ording tothe remark made above, we need only to simulate the transmission of pa ketsin a hain of end-hosts. We onsider here only the homogeneous ase.Figure 2.11 studies the stationary mean buer o upan y in a host lo- ated at level k of an overlay network omposed of an arbitrary overlay tree.The throttling of the sour e is assumed to be realized via a deterministi s heme: it sends a pa ket every λ−1 se onds where λ is hosen smaller thanthe saturated throughput of a single isolated onne tion.As one an see from this Figure, the mean number of data pa kets waitinggrows with k and seems to stabilize to some asymptoti value b, whi h anbe intuitively thought of as the mean stationary buer of a host being atlevel ∞. We dene the spatial Cesaro average as the average number ofdata pa kets waiting in the k rst end-hosts in the hain, divided by k.This seems to onverge for large k to a limit value. Note that, ombinedwith Little's law, this result extends to hara terize an asymptoti delayat innity, experien ed far away from the sour e by a pa ket, to traverse asingle overlay hop. We denote it by d and it is dened in a rigorous way inthe next se tion.Figure 2.12 studies the sensitivity of the asymptoti average number ofpa kets w.r.t. the distribution of the aggregated servi e time, and the pa ketloss probability. It shows dierent urves that give the spatial Cesaro aver-age, estimated after 100 su essive overlay hops, as a fun tion of λ for alladmissible values of λ (i.e. smaller than the lo al saturated throughput).In Figure 2.12 (left), the only dieren e between these four urves is thedistribution fun tion of the aggregated servi es representing the inuen e of ross tra . The lowest urve is that with exponential aggregated servi etimes. The upper urves feature various Pareto distributions with in reasingvariability.As one an he k, this asymptoti number of pa kets waiting is quitesensitive to the variability of ross tra . The heavier the tail is, the morepa kets remain to wait in the buer.

Page 138: baccelli/Evaluation/AugustinChaintreauPhD.pdf

128 Chapter 2

1.1

1.15

1.2

1.25

1.3

1.35

1.4

1.45

1.5

0 10 20 30 40 50 60 70 80 90 100

Mea

n nu

mbe

r of

pac

kets

wai

ting

to b

e se

rved

Level of end-host

10 links in an overlay edge, Window Max = 40, Loss Proba =0.001

Throttle Rate = 30 pkts/sec

In the end-hostSpatial Cesaro Average

Figure 2.11: Average number of data pa kets waiting in an end-host, as afun tion of the host level in the tree.Figure 2.12 (right) studies the sensitivity of the same fun tionals w.r.t.the pa ket loss probability. The same trend is observed for all this values, butas expe ted the region of stability giving admissible value of λ are dierent.Experiments on Planet-LabSaturated Sour e: In Tables 2.3 we present the results of our measure-ments of throughput and buer utilization. In this table, the leftmost olumn ontains the symboli names assigned to hosts used in the experiments. Theindentation in this olumn des ribes the stru ture of the overlay distributiontree, with the rst indentation level orresponding to the root, the se ondto its hildren et . For ea h non-root node, we list the hara teristi s of the

Page 139: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 129

0

5

10

15

20

25

30

10 20 30 40 50 60 70 80

Mea

n nu

mbe

r of

pac

kets

wai

ting

to b

e se

rved

Throttling rate (pkts/sec)

10 routers in each overlay edge, Window Max = 40, Loss Proba =0.001

Exponential tailParetto Tail with parameter 2.5Paretto Tail with parameter 2.3Paretto Tail with parameter 2.1

0

5

10

15

20

25

30

10 20 30 40 50 60 70 80

Mea

n nu

mbe

r of

pac

kets

wai

ting

to b

e se

rved

Throttling rate (pkts/sec)

10 links in each overlay edge, Window Max = 40

Exponential tail

Loss proba = 0.000001Loss proba = 0.001Loss proba = 0.005

Loss proba = 0.01Loss proba = 0.02

Figure 2.12: Study of the mean buer o upan y seen as a fun tion of throt-tling rate: for dierent laws of servi e time (left), for various pa ket lossprobabilities (right).in oming link to that node (so that ea h line a tually des ribes a link). Werepeated measurements 10 times, and took average, minimum and maximumof measured parameters.We present in the se ond olumn the lo al throughput of the in om-ing overlay hop to this end-host, in kilobytes per se ond; it was measuredshortly after or before the multi ast diusion. This lo al throughput is theone obtained when the previous end-host always has pa kets to transmitdownstream. In addition, transfers on all overlay hops asso iated with thisend-host were started simultaneously, in order to take into a ount the band-width shared on the last links between the dierent the overlay hops.The last two olumns show throughput and buer utilization measure-ments, as observed during the global overlay multi ast. In this experiment,the memory available to store data in an end-host was not restri ted. Wereport the maximum number of entries used in the buer, lo ated on theupstream node of the link. Ea h buer entry orresponds to one 100-byteblo k. We send 20,000 blo ks in total during the experiment. Buer utiliza-tion is measured as a proportion of the maximum number of blo ks used inthe buer to the total number of blo ks sent for an experiment. Noti e thatbuer utilization is high at the rst nodes, sin e data is generated at theroot node qui kly, and almost all blo ks are immediately buered.As expe ted, we observe on this table that the group throughput seen byany end-host is equal to the minimum of the lo al maximum TCP through-put, taken among overlay hops belonging to the path from the sour e to thisend-host. One an also observe that the buer o upan y is quite large in

Page 140: baccelli/Evaluation/AugustinChaintreauPhD.pdf

130 Chapter 2Lo al Group BuerThroughput Throughput UtilizationHost (KB/s) (KB/s) (%)b7asterix-1 201 235 254 147 155 165 98 98 99a e 356 372 403 147 155 165 0 0 1edge 231 235 244 147 155 164 0 0 1asterix-2 186 204 224 146 154 164 3 4 5ananda-1 341 397 507 147 155 165 0 0 0umn-1 864 885 900 147 155 164 0 0 1baobab 103 113 124 113 116 119 31 36 44fermi-1 31 32 34 22 36 58 60 69 74berk-1 121 209 309 22 36 58 1 1 1pisa-1 21 25 28 17 19 21 82 83 83u sb-1 721 769 821 17 19 21 1 1 1 mu-1 667 671 678 17 19 21 1 1 1berk-2 107 387 555 219 367 558 95 96 99u sb-2 65 118 173 135 159 177 27 46 66 mu-4 538 625 673 134 158 176 0 1 1ananda-2 1044 1159 1366 134 158 176 0 0 0dogmatix 219 372 561 134 150 164 0 10 27umn-2 872 877 888 134 150 164 0 0 0b8 91 133 165 128 154 186 49 59 69asterix-3 258 276 308 128 136 146 10 17 27berk-3 94 161 214 116 125 133 3 4 4pisa-2 346 483 560 127 135 146 3 3 3 mu-2 884 905 939 128 154 185 0 1 1fermi-2 660 690 721 128 154 185 0 0 0Table 2.3: Large buers, with saturated sour e.many buers. It may be explained by the heterogeneity in the lo al TCPthroughput observed in dierent lo ation.Rate-Control at the Sour e: Table 2.4 presents the ee t of im-plementing a rate ontrol at the sour e end-host, on the same performan eindi ator (throughput, buer utilization). This experiment is identi al to theprevious one, ex ept that we have introdu ed a 10-millise ond delay betweensending individual 100-byte blo ks at the sour e host. This orresponds toxed transmission rate of approximately 10 kilobytes per se ond. Numbersfor link throughput are repeated from these tables.Indeed, to olle t these two measurements, we ompleted the two ex-periments (with and without rate ontrol) in sequen e, one after another,until 10 measurements in ea h experiment were taken. The average time ofone experiment ranged from 2 to 5 minutes. By performing measurementsimmediately one after another, we tried to minimize the ee ts of networku tuation as mu h as possible.From our experimental measurements, and as it ould be expe ted, therate ontrol me hanism is ee tive: all the end-hosts now experien e thesame throughput. Moreover, the buer o upan y is strikingly low, and thisirrespe tive of the heterogeneity among the lo al TCP throughput seen atdierent lo ation. The results for two other ongurations are des ribed in[5, they onrm the observations that we highlight here.4.4 Proving Laten y S alabilityThis se tion explores the s alability for buer o upan y and laten y foran overlay network ontrolled by TCP already introdu ed in 3 and further

Page 141: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 131Link Fixed BuerThroughput Rate UtilizationHost (KB/s) (KB/s) (%)b7asterix-1 201 235 254 10 10 10 0 1 3a e 356 372 403 10 10 10 0 0 0edge 231 235 244 10 10 10 0 25 74asterix-2 186 204 224 10 10 10 0 0 0ananda-1 341 397 507 10 10 10 0 0 0umn-1 864 885 900 10 10 10 0 0 0baobab 103 113 124 10 10 10 0 1 4fermi-1 31 32 34 10 10 10 0 1 3berk-1 121 209 309 10 10 10 1 1 1pisa-1 21 25 28 10 10 10 1 2 3u sb-1 721 769 821 10 10 10 1 1 1 mu-1 667 671 678 10 10 10 1 1 1berk-2 107 387 555 10 10 10 0 0 0u sb-2 65 118 173 10 10 10 0 0 0 mu-4 538 625 673 10 10 10 0 0 0ananda-2 1044 1159 1366 10 10 10 0 0 0dogmatix 219 372 561 10 10 10 0 0 0umn-2 872 877 888 10 10 10 0 0 0b8 91 133 165 10 10 10 0 2 6asterix-3 258 276 308 10 10 10 0 1 3berk-3 94 161 214 10 10 10 0 0 1pisa-2 346 483 560 10 10 10 0 1 1 mu-2 884 905 939 10 10 10 0 0 1fermi-2 660 690 721 10 10 10 0 0 0Table 2.4: Large buers, with xed rate ontrol.analyzed in 4.2. What is dierent is that we now assume that the buerused by ea h end-host is not limited. In other words this unlimited memoryof ea h host an els the ba k-pressure des ribed in 4.1-4.2.We examine here the onditions under whi h the pa kets do not innitelya umulate in a buer of the overlay. We also hara terize the delay expe-rien ed by a pa ket to rea h an end-host. Two aspe ts make this problemdi ult. First, a large number of pa kets may have been already sent ina bran h of the tree when a pa ket is reated. As a onsequen e, we needto onstru t a steady state regime for the global system. Se ond, end-hostsin the overlay might be very far-away from the sour e, and the delay mightgrow fast with the size of the overlay.Formulation with Pattern GridBy default, we study here the homogeneous ase but most of the results weproved an be extended to a heterogeneous ase.As a rst remark, already made informally in 4.3, using innite buershas the following onsequen e: any end-host re eives pa kets independentlyof any host that is not in its path from the root. In other words, ea h bran hof the tree is isolated, and ea h an be represented as a single innite hain ofend-host k = 0, 1, 2, . . ., whi h follows the lo al behavior of the one-to-manyTCP overlay with lo ally innite buers.Ea h pa ket m is pro essed to ea h host k in several steps representedby elements of H = beg, 1, 2, . . . ,H, end. We generally denote these pro- essing times byT =

T(m,k)×h

∣∣ (m,k) ∈ Z2 , h ∈ H

.

Page 142: baccelli/Evaluation/AugustinChaintreauPhD.pdf

132 Chapter 2It is easy to see from 3 that this time-valued pro ess follows a uniformre urren e system des ribed by a pattern grid on Z2 ×H. As shown in 4.2,this pattern grid is totally ordered and its support (i.e. the graph ontainingall edges that has positive probability) is sharp.We introdu e for any pa ket number m, the time tm at whi h it is reatedby the sour e. The pro ess t = tm | m ∈ Z is supposed to be ompatiblewith a measure preserving ow θt (see p.11 in [3), with intensity λ. Asaturated input (i.e. tm = 0 for any m) is denoted by λ = ∞ (Note that inthis ase, the pro ess is not ergodi ).Immediately after a pa ket is reated at the sour e, it may be transmittedfrom the output buer of the rst onne tion (denoted by k = 0, h = beg).We an then dene a stationary regime for T, as the smallest solution of are urren e system that in ludes boundary onditions dened by t, for verti es

(m, 0) × beg | m ∈ Z , and the re urren e of the pattern grid (see (3.11in 3.2 of Chapter 3).The Ne essity of a Rate Control at the Sour eThe rst result of this se tion proves that no steady state may be found ifthe rate at whi h the sour e sends data is not set to a proper nite value.More pre isely, it should be smaller than the saturated throughput of anisolated TCP onne tion.The velo ity along dimension 1 in h is dened for all h ∈ H by:1Velo1(h)

= limm→∞

Per [−∞×0](m,0)×h

m= lim

m→∞

T(m,0)×h

mwhen λ = ∞ .We have, for all m, that the following path exists

π : (m, 0) × end → (m, 0) × H → . . . (m, 0) × 1 → (m, 0) × beg .We an dedu e, min Velo1(h) | h = beg, 1, . . . ,H, end = Velo1(end),and that this denes the saturated throughput of the rst TCP onne tion(asso iated with index k = 0).It omes from Corollary 4 (ii), shown in Chapter 3, that(2.3) if λ > Velo1(end), then ∀m ∀k ≥ 0, T(m,k)×end = +∞ a.s..As a onsequen e, when pa kets start to be sent from index m = 0,a ording to pro ess t with su h rate, the above result proves that the delaythey experien ed in the rst onne tion grows in distribution to innity. Thisis aused by pa kets a umulating in the intermediate buers. To avoid su hinstability, a proper value of the rate λ has to be hosen.• It is not di ult to prove that, with this notation, λ < 1Velo(end) isa su ient ondition for the existen e of a nite steady state. The

Page 143: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 133proof may be found in [5. The steady state regime is onstru tedindu tively k = 0, 1, 2 . . ., as pro essing of the dierent overlay hops isentirely disjoint. We do not prove this result here as most of it will be ontained in the one shown in the next se tion.• Both this results (ne essary and su ient onditions) an be extendedfor the heterogeneous ases. The stability threshold is then given by thesaturated output rate orresponding to the slowest isolated onne tion.S alability of Laten y with Rate ControlWe assume here the aggregated servi e times distribution veries

∫ +∞

0P (σ ≥ u)1/2du < ∞ .Under this ondition, we know by Theorem 9 shown in Chapter 3 thathydrodynami s aling fun tion γ1,γ2 may be dened, together with Thres =

γ2(0+).A te hni al di ulty is that Thresmay in general be smaller than Velo1(end).We expe t them to be equal in this ase, but ould not nd a omplete prooffor it. As a onsequen e, we assume that the rate λ veries λ < Thres, whi hmay be a priori more restri tive than λ < Velo1(end).Under this ondition, Theorem 10 an be applied, it shows rst thatalmost surely the variables dening T are all nite. It proves the existen eof a global stationary regime for all end-hosts. Note that the distributionof this pro ess is also a bound, for the sto hasti ordering, of the delayexperien ed by a pa ket in a transient system started with pa ket m = −M .Moreover, we have the following a.s. onvergen e applying to long-range umulated laten y:(2.4) limk→∞

T(0,K)×begK

= d = supx∈R

γ1(x) − x

λ < ∞ .This result should be interpreted as follows: when the depth of the overlaydistribution tree grows large, the laten y of a pa ket originating from thesour e to some host grows linearly with the depth of the host. This laten yhas an asymptoti in rement of d per overlay hop, where d is a nite onstant.The value of the onstant d depends on the hydrodynami limit γ1(x),asso iated with this pattern grid. It is in general di ult to hara terize. Tothe best of our knowledge, the expli it form of this fun tion is only known inthe parti ular ase with onstant window W

(k)m ≡ 1, with Hk = 1, and withan exponential weight (see [2).Figure 2.13 (left) gives an estimation of the γ1 fun tion, for dierentdistributions of the aggregated servi e time. Figure 2.13 (right) plots two

Page 144: baccelli/Evaluation/AugustinChaintreauPhD.pdf

134 Chapter 2evaluations of d as a fun tion of the throttling rate λ. The aggregated servi etime was, in this ase, hosen exponential. The rst urve presents d usingthe previous estimation of γ1 and (2.4); the se ond one evaluates d dire tlyby simulation, as the average stationary sojourn time at innity. The mat hseems good, as long as the throttling rate is not taken too lose to the riti alvalue.

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

0 0.5 1 1.5 2 2.5 3

Val

ue o

f Gam

ma(

x)

Abciss x

10 links in each overlay edge, Window Max = 40, Loss Proba =0.02

Light TailedParetto 2.5Paretto 2.3

0

0.2

0.4

0.6

0.8

1

0 10 20 30 40 50 60 70

Mea

n so

jour

n tim

e at

infin

ity (

sec)

Throttling Rate (pkts/sec)

10 links in an overlay edge, Window Max = 40, Loss Proba =0.0010000

given by analysisgiven by simulations

Figure 2.13: An example of hydrodynami fun tion for the saturated system(left). Mean sojourn time at innity by two dierent methods (right).What is interesting for us is that the onstant d upper bounds the end-to-end laten y in the ommuni ation of an overlay, linearly with the depth of thetree, independently of the total group size, and the amount of informationsent.Some Extensions• We an easily show that (2.4) holds when beg is repla ed by any other

h ∈ H, as we have T(0,K)×beg ≤ T(0,K)×h ≤ T(0,K+1)×beg.• Let us dene Dm,k = T(m,k)×end−T(m,k−1)×end, the delay of a pa ket mto ross the overlay hop k, after going through hop (k − 1). To handle

k = 0, we x by onvention T(m,−1)×end = tm.The onvergen e (2.4) an be interpreted as a law of large numbers,when k grows large, for the sequen e of this delay,lim

K→∞

1

K

K∑

k=0

Dm,k = d, true for all m.• Note that for any k, the sequen e Dm,k | m ∈ Z might be seenas the sojourn time of a pa ket in overlay hop k, whereas the pro ess

Page 145: baccelli/Evaluation/AugustinChaintreauPhD.pdf

S alable Multi asting on Overlay Network 135

T(m,k−1)×end | m ∈ Z, with intensity λ, an be interpreted as apa ket arrival pro ess in this hop. Both are stationary, ompatiblewith the ow dened by the original input pro ess t.Let Bm,k denote the buer o upan y in overlay k, seen when pa ket marrives in the buer (k, beg). By denition, it in ludes all the pa ketsbuered in the k_th TCP onne tion (in end-hosts, in routers) fromhost k − 1 to host k. By Little's law (see p.186 in [3) we have,

∀m ∀k ≥ 0, E[Bm,k] = λE[Dm,k] and limK→∞

1

K

K∑

k=0

E[Bm,k] = λd .This shows that if the sour e emits pa ket at rate λ, the mean buero upan y in steady state onverge, in the sense of Cesaro, to a nite onstant. In other words, the mean umulated pa kets present betweenhost k = 0 and k = K grows linearly with K with in rement λd.

Page 146: baccelli/Evaluation/AugustinChaintreauPhD.pdf

136 BIBLIOGRAPHYBibliography[1 M. Allman, V. Paxson, and W. Stevens, 1999.(See referen e in Chap. 0).[2 F. Ba elli, A. Borovkov, and J. Mairesse, 2000.(See referen e in Chap. 3).[3 F. Ba elli and P. Bremaud. 2003.(See referen e in Chap. 3).[4 F. Ba elli, A. Chaintreau, Z. Liu, and A. Riabov. The one-to-manyTCP overlay: A s alable and reliable multi ast ar hite ture. In Pro eed-ings of IEEE INFOCOM, 2005. (Extended version available as INRIAResear h Report 5241 at http://www.inria.fr/rrrt/rr-5241.html)(This arti le studies the ase of an overlay network ontrolled by TCPwith a nite buer in any host and ba k pressure, it proves the s ala-bility of throughput for light tailed aggregated servi e times.).[5 F. Ba elli, A. Chaintreau, Z. Liu, A. Riabov, and S. Sahu. S alabilityof reliable group ommuni ation using overlays. In Pro eedings of IEEEINFOCOM, 2004. (Extended version available as INRIA Resear h Re-port number 4895 at http://www.inria.fr/rrrt/rr-4895.html)(This arti le analyzes an overlay network ontrolled by TCP with aninnite buer in any host, it proves the s alability of laten y and givesthe expe ted asymptoti delay using an hydrodynami limit).[6 F. Ba elli, G. Cohen, G. Olsder, and J.P. Quadrat. Syn hronizationand Linearity. Wiley, 1992. (Out of print),(An ele troni version an be freely downloaded athttp://www-ro q.inria.fr/metalau/ ohen/SED/book-online.html)(A referen e textbook des ribing the max-plus framework to analyzedis rete-event systems).[7 F. Ba elli and D. Hong. 2000.(See referen e in Chap. 0).[8 T. Ballardie, P. Fran is, and J. Crow roft. Core based trees ( bt). InSIGCOMM '93: Conferen e pro eedings on Communi ations ar hite -tures, proto ols and appli ations, pages 8595, New York, NY, USA,1993. ACM Press.(Advo ates the use of a ore router in multi ast tree. It simplies thetree building, avoiding ostly pruning (DVMRP) or ontrol messageex hange (MOSPF). It extends to multi-sour es multi ast delivery).[9 S. Banerjee, B. Bhatta harjee, and C. Kommareddy. S alable appli a-tion layer multi ast. In SIGCOMM '02: Pro eedings of the 2002 on-feren e on Appli ations, te hnologies, ar hite tures, and proto ols for

Page 147: baccelli/Evaluation/AugustinChaintreauPhD.pdf

BIBLIOGRAPHY 137 omputer ommuni ations, pages 205217. ACM Press, 2002.(Presents NICE, a layered lustering te hnique to build and maintainoverlay tree distribution. It is shown to have similar performan e toNarada [20 while requiring the N hosts to keep and refresh on averagea onstant number of states, with a maximum of O(logN) for a few ofthem).[10 S. Banerjee, S. Lee, B. Bhatta harjee, and A. Srinivasan. Resilientmulti ast using overlays. In SIGMETRICS '03: Pro eedings of the2003 ACM SIGMETRICS international onferen e on Measurementand modeling of omputer systems, pages 102113. ACM Press, 2003.(Introdu es Probabilisti Resilient Multi ast (PRM) that randomly re- overs from pa ket losses in an overlay network through random redun-dent forwarding. It is shown to in rease data delivery ratios. An heuristi argument is given to show that re overy an be performed for a largeproportion of nodes in logarithmi time).[11 S. Bhatta haryya, J. F. Kurose, D. F. Towsley, and R. Nagarajan. E- ient rate- ontrolled bulk data transfer using multiple multi ast groups.IEEE/ACM Trans. Netw., 11(6):895907, 2003. (First version appearedat INFOCOM'98)(Extends layered multi ast from real time delivery to le transfer usingan e ient nested s heduling s heme between layers).[12 S. Bhatta haryya, D. F. Towsley, and J. F. Kurose. The loss path multi-pli ity problem in multi ast ongestion ontrol. In Pro eedings of IEEEINFOCOM, pages 856863, 1999.(Shows that a loss probability of pa ket in a IP-multi ast distributiontree is overestimated, ompared to a line with similar performan e. Itshows in parti ular that the TCP window adaptation performan e de- reases as the group expands).[13 K. P. Birman, M. Hayden, O. Ozkasap, Z. Xiao, M. Budiu, and Y. Min-sky. Bimodal multi ast. ACM Trans. Comput. Syst., 17(2):4188, 1999.(Advo ates the use of a two alternating phase for reliable IP multi ast:one lassi al IP multi ast delivery, followed by a repair phase where mes-sage are ex hanged between re eivers a ording to a random epidemi algorithm).[14 J. W. Byers, M. Luby, M. Mitzenma her, and A. Rege. A digital foun-tain approa h to reliable distribution of bulk data. In SIGCOMM '98:Pro eedings of the ACM SIGCOMM '98 onferen e on Appli ations,te hnologies, ar hite tures, and proto ols for omputer ommuni ation,pages 5667. ACM Press, 1998.(This paper advo ates the use of Tornado ode for reliable multi astdelivery, to bypass traditionnal error re overy even for long les. Pa ket

Page 148: baccelli/Evaluation/AugustinChaintreauPhD.pdf

138 BIBLIOGRAPHYs heduler for single and multi-rate are presented to minimize laten yand dupli ate pa kets).[15 S. Casner and S. E. Deering. First IETF internet audio ast. ACMComputer Communi ation Review, 22(3):9297, 1992.(Des ribes the IP-multi ast and manual tunneling and rate ontrol ar- hite ture deployed for the rst IETF audio broad ast in 1992).[16 A. Chaintreau, F. Ba elli, and C. Diot. Impa t of TCP-like onges-tion ontrol on the throughput of multi ast groups. IEEE/ACM Trans.Netw., 10(4):500512, 2002.(Proves a limit of window ow ontrol implemented on a multi ast de-livery tree with N re eivers and aggregated feedba k: the throughputde reases as 1/ log(N) as an ee t from light tailed delay variations).[17 Y. Chawathe, S. M Canne, and E. A. Brewer. RMX : reliable multi- ast for heterogeneous networks. In Pro eedings of IEEE INFOCOM,volume 2, pages 785804, 2000.(Advo ates a two level hierar hy in reliable multi ast. At the higherlevel, the proxies organize themselves in an overlay network using TCPuni ast. At the lower level, ea h proxy transmits data to lo al deliv-ery group through IP-multi ast. Re eivers heterogeneity is addressedby ea h proxy using appli ation level semanti s sent in the data).[18 Y. D. Chawathe. S atter ast: an adaptable broad ast distributionframework. Multimedia Syst., 9(1):104118, 2003. (Previously pub-lished in author's PhD dissertation in 2000)(Des ribes Gossamer, a variation of Narada [20, to build an overlayof proxies that serve lo al delivery groups using IP-multi ast as a laststep).[19 Y. Chu, S. Rao, S. Seshan, and H. Zhang. Enabling onferen ing ap-pli ations on the internet using an overlay multi ast ar hite ture. InSIGCOMM '01: Pro eedings of the 2001 onferen e on Appli ations,te hnologies, ar hite tures, and proto ols for omputer ommuni ations,pages 5567. ACM Press, 2001.(Demonstrates that real time unreliable delivery to small groups isfeasible through Appli ation Layer Multi ast. Based on a variation ofNarada, this paper shows that both delay and bandwidth need to betaken into a ount to build e ient overlay distribution trees).[20 Y. Chu, S. G. Rao, and H. Zhang. A ase for end system multi ast(keynote address). In SIGMETRICS '00: Pro eedings of the 2000 ACMSIGMETRICS international onferen e on Measurement and modelingof omputer systems, pages 112. ACM Press, 2000.(Advo ates the implementation of multi ast delivery fun tions inside

Page 149: baccelli/Evaluation/AugustinChaintreauPhD.pdf

BIBLIOGRAPHY 139end systems, rather than in network's routers, using only uni ast servi e.Presents Narada, that builds an overlay distribution tree using distan eve tor proto ol on top of a partial overlay mesh).[21 Y. K. Dalal and R. M. Met alfe. Reverse path forwarding of broad astpa kets. Commun. ACM, 21(12):10401048, 1978.(Des ribes an e ient method to ompute the shortest path broad asttree on top of uni ast routing. It is not optimal in ase of a networkwith asymmetri delays, nevertheless it oers a very good andidate,mu h easier to implement than previous delivery methods).[22 P. B. Danzig. Flow ontrol for limited buer multi ast. IEEE Trans.Softw. Eng., 20(1):112, 1994.(Des ribes an optimal multi-rounds ba k-o timer setting, implementedby re eivers, to minimize the laten y for broad asting and gatheringanswers from all re eivers, with a nite buer in the sour e, that mayoverow).[23 S. Deb and R. Srikant. Congestion ontrol for fair resour e allo ationin networks with multi ast ows. IEEE/ACM Trans. Netw., 12(2):274285, 2004.(Proposes a marking strategy, similar to [34, with minimal state in therouters to a hieve fair allo ation in a multirate multi ast and uni astshared network).[24 S. Deering, D. Estrin, D. Farina i, V. Ja obson, C.G. Liu, and L. Wei.An ar hite ture for wide-area multi ast routing. In SIGCOMM '94:Pro eedings of the onferen e on Communi ations ar hite tures, proto- ols and appli ations, pages 126135, New York, NY, USA, 1994. ACMPress.(Des ribes Proto ol Independent Multi ast (PIM), that extends CBT [8in two ways: allowing lo al multi ast proto ol to build e ient tree in agiven domain (LAN), using Rendez Vous point as bootstrap ore router,to improve the (WAN) distribution shared tree in a sour e-spe i dis-tribution tree by an expli it signaling toward the sour e).[25 S. E. Deering and D. R. Cheriton. Multi ast routing in datagram inter-networks and extended lans. ACM Trans. Comput. Syst., 8(2):85110,1990. First version appeared in SIGCOMM'88(This paper presents dierent routing proto ols to build multi ast trees.One of them is an extension of RPF [21 to build dynami multi asttrees, using restri ted per sour e broad ast and pruning).[26 A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker,H. Sturgis, D. Swinehart, and D. Terry. Epidemi algorithms for repli- ated database maintenan e. In PODC '87: Pro eedings of the sixth

Page 150: baccelli/Evaluation/AugustinChaintreauPhD.pdf

140 BIBLIOGRAPHYannual ACM Symposium on Prin iples of distributed omputing, pages112. ACM Press, 1987.(This landmark arti le introdu es randomized algorithms to guaranteethat every update is eventually ree ted in a large number of databaserepli ates).[27 H. Eriksson. Mbone: the multi ast ba kbone. Communi ations of theACM, 37(8):5460, 1994.(A report on the design and experien e with the multi ast ba kbone:appli ation, addressing, routing and tunnelling, some bugs in routers'implementation. This paper also dis usses the danger of severe global ongestion reated by multi ast sessions).[28 S. Floyd, V. Ja obson, C.-G. Liu, S. M Canne, and L. Zhang. A reli-able multi ast framework for light-weight sessions and appli ation levelframing. IEEE/ACM Trans. Netw., 5(6):784803, 1997. First versionappeared in SIGCOMM'95.(Presents the S alable Reliable Multi ast (SRM) proto ol, that ad-dresses multi ast error re overy through re eiver initiated retransmis-sion. Feedba k storms are avoided through randomized timer, as in [22,RTT distan e estimates enable a form of lo al re overy).[29 S. J. Golestani and K. K. Sabnani. Fundamental observations on mul-ti ast ongestion ontrol in the internet. In Pro eedings of IEEE INFO-COM, pages 9901000, 1999.(Proposes a re eiver based implementation of window ow ontrol formulti ast delivery, to ope with heterogeneous apa ities and delays be-tween bran hes of the tree. This ar hite ture, that an be implementedthrough feedba k aggregation, is shown to be TCP friendly in a deter-ministi model).[30 I. Gupta, A.M. Kermarre , and A. J. Ganesh. E ient epidemi -style proto ols for reliable and s alable multi ast. In SRDS '02: Pro- eedings of the 21st IEEE Symposium on Reliable Distributed Systems(SRDS'02), page 180. IEEE Computer So iety, 2002.(Proposes a hierar hi al adaptive organization improving onventionalgossiping te hniques for large s ale multi ast appli ation, to redu e thedata sent a ross several domains and adapt to various pa ket loss prob-abilities).[31 R. Gusella and S. Zatti. An ele tion algorithm for a distributed lo ksyn hronization program. In Pro eedings of the IEEE 6th InternationalConferen e on Distributed Computing Systems, pages 364371, 1986.(Proposes the use of randomized timer to sele t a leader and avoid oni ts, this arti le inspired similar te hniques to prevent feedba kstorm in reliable multi ast).

Page 151: baccelli/Evaluation/AugustinChaintreauPhD.pdf

BIBLIOGRAPHY 141[32 J. Jannotti, D. K. Giord, K. L. Johnson, M. F. Kaashoek, andJ. O'Toole Jr. Over ast: Reliable multi asting with an overlay net-work. In In Pro eedings of the 4th Symposium on Operating SystemDesign and Implementation (OSDI), pages 197212, San Diego, CA,2000. USENIX.(Des ribes a self-organizing overlay of server repli ates (nodes with per-manent storage of a given le). The overlay distribution tree is built viaa lo al operation performed by ea h end-system: move as deep as pos-sible in the overlay tree while not de reasing its estimated bandwidth).[33 K. Jea le and J. Crow roft. TCP-XM: Uni ast-enabled re-liable multi ast, 2005. (to be published, more details athttp://www. l. am.a .uk/users/kj234/)(Presents a me hanism, instantly deployed in user-spa e, for reliable ef- ient transfer of les to medium size group of destinations. It makesa transparent use of multi ast network's apability whenever possible,swit hing seamlessly to uni ast if needed).[34 K. Kar, S. Sarkar, and L. Tassiulas. A s alable low overhead rate ontrolalgorithm for multirate multi ast sessions. IEEE Journal of Sele ted ar-eas in Communi ation (Spe ial issue in Network Support for Multi astCommuni ations), 20(8):15411557, 10 2002.(Formalizes the multi ast multirate ontrol problem as a global util-ity maximization. This problem is shown to be more omplex than the orresponding uni ast ase, but it may be generally obtained by a de- entralized algorithm based on pa ket marking in routers).[35 S. Keshav. 1991.(See referen e in Chap. 0).[36 G.I. Kwon and J. W. Byers. Smooth multirate multi ast ongestion ontrol. In INFOCOM, 2003.(Advo ates the use of several single-rate multi ast groups, ontrolled byTFMCC [64 and adaptive membership, to a hieve s alable multi-rateow ontrol in reliable multi ast).[37 G.U. Kwon and J.W. Byers. ROMA: Reliable overlay multi ast withloosely oupled TCP onne tions. In Pro eedings of IEEE INFOCOM,volume 1, pages 385395, 2004.(Advo ates the use of Tornado ode for reliable ontent delivery in anoverlay network with nite intermediary buer, with a per hop dropin ase of intermediate overow. It is said to a hieve the equivalent ofreliable multi-rate delivery ([14) in the ontext of Appli ation LayerMulti ast).

Page 152: baccelli/Evaluation/AugustinChaintreauPhD.pdf

142 BIBLIOGRAPHY[38 S. Liang and D. R. Cheriton. TCP-SMO: Extending TCP to supportmedium-s ale multi ast appli ations. In Pro eedings of IEEE INFO-COM, 2002.(Proposes a modi ation of TCP kernel implementation to ontrol mul-ti ast delivery for a small or moderate group. It is based on subs riptionand uni ast feedba k hannels, with onservative ow ontrol).[39 J. Liebeherr, M. Nahas, and Si Weisheng. Appli ation-layer multi as-ting with Delaunay triangulation overlays. IEEE Journal on Sele tedAreas in Communi ations, 20:14721488, 2002.(Presents a overlay maintenan e te hnique based on Delaunay Triangu-lation (DT). It benets from the natural property of DT: distributedformation and maintenan e, small average number of neighbors, guar-antee of the su ess for ompass routing, and easy building of spanningtree).[40 D. A. Maltz and P. Bhagwat. TCP spli ing for appli ation layer proxyperforman e. Resear h Report RC 21139, IBM, Mar h 1998. (Alsopublished in Journal of High Speed Network vol.8 n.3 pp.235240 in1999)(Proposes a kernel modi ation onne ting several TCP onne tions in as ade e iently, to enhan e the performan e of web servers).[41 J. Martin, 2002.(See referen e in Chap. 3).[42 S. M Canne, V. Ja obson, and M. Vetterli. Re eiver-driven layeredmulti ast. In SIGCOMM '96: Pro eedings of the ACM SIGCOMM '96 onferen e on Appli ations, te hnologies, ar hite tures, and proto ols for omputer ommuni ation, volume 26,4, pages 117130, New York, NY,USA, August 1996. ACM Press.(Advo ates the use of several multi ast groups, with xed rate and lay-ered sour e oding, for real time multi ast delivery. Presents Re eiver-driven Layered Multi ast (RLM): the re eivers ontrol their rate byjoining and leaving the multi ast groups, through distributed de ision,trying to gain from others' attempts).[43 J. J. Metzner. An improved broad ast retransmission proto ol. IEEETransa tions on Communi ations, 32(6):679683, 6 1984.(Introdu es Forward Error Corre ting ode to improve the e ien y ofa one-to-many data transmission with lossy hannels).[44 J. Nonnenma her, E. Biersa k, and D. F. Towsley. Parity-based loss re- overy for reliable multi ast transmission. In SIGCOMM '97: Pro eed-ings of the ACM SIGCOMM '97 onferen e on Appli ations, te hnolo-gies, ar hite tures, and proto ols for omputer ommuni ation, pages

Page 153: baccelli/Evaluation/AugustinChaintreauPhD.pdf

BIBLIOGRAPHY 143289300, New York, NY, USA, 1997. ACM Press.(Des ribes the use of FEC in the ontext of reliable multi ast omputer ommuni ation. It is shown that a layered approa h where FEC is trans-parent to the appli ation, is outperformed by an integrated transportlevel FEC-NACK approa h, espe ially in the ontext of losses o urringin bursts).[45 S. Paul, K. K. Sabnani, J. C.H. Lin, and S. Bhatta haryya. Reliablemulti ast transport proto ol (RMTP). IEEE Journal of Sele ted Areasin Communi ations, 15(3):407421, 1997. First version appeared in IN-FOCOM'96(This proto ol, introdu ed for reliable multi ast delivery, avoids A kimplosion by using sele tive periodi a knowledgments, aggregatedthrough a xed hierar hy of A k Pro essors (AP). APs are requiredto a he all data transmitted. A window ow ontrolled is implementedat the sour e, that adapts to the slowest re eiver).[46 D. Pendarakis, S. Shi, D. Verma, and M. Waldvogel. ALMI: An appli- ation level multi ast infrastru ture. In Pro eedings of the 3rd USENIXSymposium on Internet Te hnologies and Systems (USITS), pages 4960, 2001.(Advo ates a entralized ar hite ture for Appli ation Layer Multi ast,that builds a minimum spanning overlay tree minimizing delays for asmall group of end-systems. Syn hronization a knowledgments are pe-riodi ally transmitted along the whole tree, as in [49, to avoid interme-diate buer overows).[47 C. G. Plaxton, R. Rajaraman, and A. W. Ri ha. A essing nearby opies of repli ated obje ts in a distributed environment. In SPAA '97:Pro eedings of the ninth annual ACM symposium on Parallel algorithmsand ar hite tures, pages 311320, New York, NY, USA, 1997. ACMPress.(This arti le proves that a large number N of hosts an self-organize toe iently distribute les between themselves: the amount of state, themaintenan e ost of nodes/obje ts are all upper bounded by O(log2 N),while the a ess ost follows a onstant stret h from the optimal).[48 P. Radoslavov, C. Papadopoulos, R. Govindan, and D. Estrin. A om-parison of appli ation-level and router-assisted hierar hi al s hemes forreliable multi ast. IEEE/ACM Trans. Netw., 12(3):469482, 2004.(Dis usses the metri of implosion (number of feedba k pa kets re eived)and exposure (retransmitted pa kets that are not ne essary) with dif-ferent types of hierar hy. End-systems hierar hy is shown to performreasonnably well when ompared to hierar hy that needs routers sup-port).

Page 154: baccelli/Evaluation/AugustinChaintreauPhD.pdf

144 BIBLIOGRAPHY[49 B. Rajagopalan. Reliability and s aling issues in multi ast ommuni- ation. In SIGCOMM '92: Conferen e pro eedings on Communi ationsar hite tures & proto ols, pages 188198. ACM Press, 1992.(Proposes a reliable multi ast delivery s heme implemented via per-hopwindow ow ontrol, and periodi re overy, implemented between net-work elements).[50 S. Ratnasamy, P. Fran is, M. Handley, R. Karp, and S. Shenker. A s al-able ontent-addressable network. In SIGCOMM '01: Pro eedings of the2001 onferen e on Appli ations, te hnologies, ar hite tures, and proto- ols for omputer ommuni ations, pages 161172. ACM Press, 2001.(This paper presents CAN, a self-organized overlay embedding a set ofnodes in a virtual oordinate spa e of dimension d, through a numberof zone splits and merges. Content to be stored is hashed and storeda ording to its asso iated zone. Routing is performed via greedy movea ording to virtual oordinate, and is proved to perform on average inO(N1/d) steps).[51 S. Ratnasamy, M. Handley, R. Karp, and S. Shenker. Topologi ally-aware overlay onstru tion and server sele tion. In Pro eedings of IEEEINFOCOM, 6 2002.(Proposes an improvement of CAN to guarantee ongruen e of virtualand physi al topology: zones are allo ated using estimation of networkdistan e to a ommon set of landmarks).[52 S. Ratnasamy, M. Handley, R. M. Karp, and S. Shenker. Appli ation-level multi ast using ontent-addressable networks. In NGC '01: Pro- eedings of the Third International Workshop on Networked Group Com-muni ation, pages 1429. Springer-Verlag, 2001.(This proposes to use a dedi ted ontent distribution ar hite ture asCAN [50 for multi ast dissemination. Data is sent using ooding basedon virtual oordinates, with minimal state to avoid dupli ates).[53 L. Rizzo. PGMCC: a TCP-friendly single-rate multi ast ongestion ontrol s heme. In SIGCOMM '00: Pro eedings of the onferen e onAppli ations, Te hnologies, Ar hite tures, and Proto ols for ComputerCommuni ation, pages 1728. ACM Press, 2000.(Des ribes a window ow ontrol for one-to-many multi ast ommuni- ation, based on ele tion of a group representative among re eiver andaggregated NACK feedba k for reliability. The key me hanism is the hange of the representative re eiver in ase of new onditions observedon its path).[54 L. Rizzo and L. Vi isano. RMDP: an FEC-based reliable multi astproto ol for wireless environments. SIGMOBILE Mob. Comput. Com-mun. Rev., 2(2):2331, 1998.

Page 155: baccelli/Evaluation/AugustinChaintreauPhD.pdf

BIBLIOGRAPHY 145(Presents a reliable multi ast proto ol transmission with a redundan yat the sour e oupled with a pa kets ounts request from re eivers. Itis low demanding on the re eiving host, to adapt well to mobile nodeswith low omputation apabilities).[55 A. I. T. Rowstron and P. Drus hel. Pastry: S alable, de entralized ob-je t lo ation, and routing for large-s ale peer-to-peer systems. In Mid-dleware 2001: Pro eedings of the IFIP/ACM International Conferen eon Distributed Systems Platforms Heidelberg, pages 329350, London,UK, 2001. Springer-Verlag.(This self organized overlay network applies the theoreti al frameworkof Plaxton et al. [47 to onstru t a s alable robust Content DistributionNetwork exhibiting lo ality property).[56 A. I. T. Rowstron, A.M. Kermarre , M. Castro, and P. Drus hel. S ribe:The design of a large-s ale event noti ation infrastru ture. In NGC'01: Pro eedings of the Third International COST264 Workshop onNetworked Group Communi ation, pages 3043, London, UK, 2001.Springer-Verlag.(Proposes a distributed algorithm to build a ore based overlay distri-bution tree on the overlay network maintained by Pastry, in the ontextof publish/susbs ribe group).[57 D. Rubenstein, J. F. Kurose, and D. F. Towsley. The impa t of mul-ti ast layering on network fairness. In SIGCOMM '99: Pro eedings ofthe onferen e on Appli ations, te hnologies, ar hite tures, and proto- ols for omputer ommuni ation, pages 2738. ACM Press, 1999.(Dis usses issues of fairness for allo ating rate on multi ast tree ; itshows that max−min fairness exhibits paradoxi al properties as a on-sequen e from the solidarity of a single rate s heme. By opposition, thebehavior of a multi-rate s heme seems easier to understand and to usein pra ti e).[58 J. H. Saltzer, D. P. Reed, and D. D. Clark, 1984.(See referen e in Chap. 0).[59 S. Shi and J. Turner. Multi ast routing and bandwidth dimensioning inoverlay networks. IEEE Journal on Sele ted Areas in Communi ations,20:14441455, 2002. (First version appeared at INFOCOM'02)(Shows that nding optimal distribution tree built on overlay networkis NP omplete for various obje tives related to delay and degree on-straint. Evaluates approximation obtained using entralized greedy al-gorithms).[60 I. Stoi a, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan.Chord: A s alable peer-to-peer lookup servi e for internet appli ations.

Page 156: baccelli/Evaluation/AugustinChaintreauPhD.pdf

146 BIBLIOGRAPHYIn SIGCOMM '01: Pro eedings of the 2001 onferen e on Appli ations,te hnologies, ar hite tures, and proto ols for omputer ommuni ations,pages 149160, New York, NY, USA, 2001. ACM Press.(Propose a robust simplied implementation of Plaxton's algorithm tohandle node joins and leaves, with an appli ation to sour e dis overy inpeer-to-peer networks).[61 G. Urvoy-Keller and E. W. Biersa k. A ongestion ontrol model formulti ast overlay networks and its performan e. In NGC '02: Pro eed-ings of the Fourth International Workshop on Networked Group Com-muni ation, 10 2002.(Proposes to implement ba k-pressure between hosts in an overlay toinsure a reliable group ommuni ation. The authors established by sim-ulations that the throughput is less sensitive to group's size up to 64re eivers, even with intermediate nite buers).[62 L. Vi isano, L. Rizzo, and J. Crow roft. TCP-like ongestion ontrolfor layered multi ast data transfer. In Pro eedings of IEEE INFOCOM,pages 9961003, 1998.(Improves the adaptation me hanism of RLM [42 by syn hronizing re- eivers' behavior with periodi burst of data ; parameters are hosen asto mimi k AIMD adaptation).[63 H. A. Wang and M. S hwartz. A hieving bounded fairness for multi astand TCP tra in the internet. In SIGCOMM '98: Pro eedings of theACM SIGCOMM '98 onferen e on Appli ations, te hnologies, ar hi-te tures, and proto ols for omputer ommuni ation, pages 8192, NewYork, NY, USA, 1998. ACM Press.(Proposes essential fairness as an obje tive for multi ast ongestion ontrol: keeping a onstant bound on the ratio of rates obtained dividedby the rate of a TCP ow. A random adaptive algorithm is proposed toa hieve this goal).[64 J. Widmer and M. Handley. Extending equation-based ongestion on-trol to multi ast appli ations. SIGCOMM Comput. Commun. Rev.,31(4):275285, 2001.(In this paper, the equation based rate ontrol is adapted to multi astone-to-many ommuni ation, using a areful re eiver fair rate estima-tion and randomized timer feedba k te hnique).[65 B. Zhang, S. Jamin, and L. Zhang. Host multi ast: A framework fordelivering multi ast to end users. In Pro eedings of IEEE INFOCOM,volume 3, pages 13661375, 2002.(Presents a variation of Over ast tree building, HTMP, that aims atminimizing end-to-end round trip delay. This algorithm is in luded in an

Page 157: baccelli/Evaluation/AugustinChaintreauPhD.pdf

BIBLIOGRAPHY 147overlay ar hite ture using IP-multi ast as the latest step when possiblefor lo al delivery group).[66 X. Zhang, J. Liu, B. Li, and T.-S. P. Yum. DONet/CoolStreaming: Adata-driven overlay network for live media streaming. In Pro eedings ofIEEE INFOCOM, 2005.(An overlay ommuni ation ar hite ture for video streaming based onepidemi dissemination: blo ks of video are randomly distributed be-tween peers, following an e ient deadline s heduler implemented lo- ally).[67 B. Y. Zhao, J. D. Kubiatowi z, and A. D. Joseph. Tapestry: A resilientglobal-s ale overlay for servi e deployment. IEEE Journal on Sele tedAreas in Communi ations, 22(1):4153, 1 2004. (Previously publishedas Te h Report UCB/CSD-01-1141 in 2001)(Presents an implementation of Plaxton et al. theoreti al frameworkto implement a ontent distribution network among Internet end-hosts,with a degree of lo ality).[68 H. Zheng, E. Keong Lua, M. Pias, and T.G. Grin. Internet routingpoli ies and round-trip-times. In Pro eedings of the 6th InternationalWorkshop on Passive and A tive Network Measurement (PAM 2005),pages 236250, 2005.(Identies triangular inequality violations of the RTT as a widespread onsequen e of Internet routing poli ies.).[69 S. Q. Zhuang, B. Y. Zhao, A. D. Joseph, R. H. Katz, and J. D. Kubia-towi z. Bayeux: an ar hite ture for s alable and fault-tolerant wide-areadata dissemination. In NOSSDAV '01: Pro eedings of the 11th inter-national workshop on Network and operating systems support for digitalaudio and video, pages 1120, New York, NY, USA, 2001. ACM Press.(Proposes to leverage the routing apability of a Content DistributionNetwork build on overlay for Sour e Spe i Appli ation Layer Multi- ast. The overlay distribution tree is enable/disable on ea h bran h bythe root node via end-to-end ex hange of join and prune messages).

Page 158: baccelli/Evaluation/AugustinChaintreauPhD.pdf

148 BIBLIOGRAPHY

Page 159: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Chapter 3Last-Passage Per olationin Pattern GridsWir sehen in der Natur nie Etwas als Einzelnheit, sondern wir sehenAlles in Verbindung mit etwas Anderem, das vor ihm, neben ihm, hinterihm, unter ihm und über ihm si h bendet.1Johann Wolfgang Von GoetheThis hapter presents an original analysis of large distributed dis rete-event systems; this lass in ludes ollaborative de entralized proto ols su has the one presented in Chapter 2.The general model is the following: an innite olle tion of tasks to om-plete are organized along several dimensions. We assume that tasks intera tonly lo ally through the same invariant ausality pattern, that is based ona pre eden e rule: a given task starts as soon as a set of neighboring othershave been ompleted. This allows to model general distributed systems, andwe argue that their asymptoti behaviors on large s ales an be already har-a terized reasonably well. This hapter supports this laim, and overs moreextensively the ase of task pro esses embedded in a latti e with dimension2.Guidelines: In 1, we des ribe these intera ting task pro esses in a pattern grid; theirasso iation ompletion times are shown to verify a system of uniform re urren e equations.We identify two sub- ategories alled sharp and non-sharp, whi h exhibit large pathswith dierent behaviors. It is shown in 2 that the sharp ondition hara terizes apattern grid that admits onstants of dire tional last-passage per olation. In 2.4 and2.5, we show for dimension 2 that these onstants, seen as a fun tion of the dire tion,share regularity properties with hydrodynami limits. Based on these results, we onstru tand hara terize in 3 the stationary regime of a semi-innite line of intera ting taskspro esses. We prove in 4 that our results are not restri ted to tasks organized over alatti e, but an be applied to general graphs, under proper invariant ondition.1 Gesprä he mit E kermann, 5. Juni 1826, Mün hen, Hanser Verlag, 1986, p.536, Innature we never see anything isolated, but everything in onne tion with something elsewhi h is before it, beside it, under it and over it. (translated by J. Oxenford).

Page 160: baccelli/Evaluation/AugustinChaintreauPhD.pdf

150 Chapter 31 Pattern Grid1.1 DenitionsLet H be a nite set with ardinal denoted by H, a pattern grid with di-mension d and motifs H is dened as a graph Gpatt = (V, E) that follows aninvariant property:• The set of its verti es is V = Z

d ×H.• The set of its edges E is supposed to be invariant by any translation inthe latti e Z

d, su h that for all v in Zd we have:

(a × h) → (a′ × h′) ∈ E i ((a + v) × h) → ((a′ + v) × h′) ∈ E .In this work, we onsider only the ase of lo ally nite graphs: the set ofedges leaving any vertex in this graph is always nite. The invariant propertyimplies then that the degrees of verti es in this graph are uniformly bounded.It is easy to see that another way to hara terize the set E of a patterngrid is via a olle tion of dependen e sets: a nite olle tion (∆h,h′)h,h′∈H ofsubsets of Zd indexed by H2, su h that(3.1) (a × h) → (a′ × h′) is in E if an only if a′ − a is in ∆h,h′ .As this graph is supposed lo ally nite, all these subsets are ne essarily nite.Re urren e EquationsA pattern grid an be interpreted as an innite set of tasks ompletionsintera ting via pre eden e rules. Ea h vertex a × h represents a task thatneeds to be done in ooperation with other pro esses; this task requires anamount of time to omplete, that we all its weight and denote by Wei(a, h).Ea h edge stands for one pre eden e relationship: when leading from a×h to

a′×h′, it states that task a×h annot be started unless a′×h′ is ompleted.In this ase, we all a′ × h′ an immediate prede essor of a × h.Ea h pattern grid denes a system of uniform re urren e equations, that an be thought of as the rst time of ompletion Th(a), for a in Zd, and hin H, whi h satises the pre eden e rules. These equations state that ea htask starts as soon as all its immediate prede essors have been ompleted:(3.2) Th(a) = Wei(a, h) + max Th′(a′)| (a, h) → (a′, h′) ∈ E .Equivalently, using (3.1), we an rewrite this equation as(3.3) Th(a) = Wei(a, h) + max

Th′(a + r)| r ∈ ∆h,h′, h′ ∈ H

.

Page 161: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 151More Details : Set of Uniform Re urren e EquationsFrom the last expression (3.3), a simple hange of notation shows that the set of equationsdened by a pattern grid is a system of Uniform Re urren e Equations (UREs), as dened byKarp et al. in [7. The key property is that the dependen e sets (∆h,h′)h,h′∈H appearing in(3.3) do not depend on the urrent oordinates of a.In this arti le, onditions are given to be able to onstru t in rementally the solutions ofsu h systems, in the right order. The authors assume that there exists a entralized s hedulerto omplete these tasks in parallel while respe ting the pre eden e rules. A typi al problem tostudy is the optimal s heduling strategy, with or without onstraints on the number of tasksbeing ompleted simultaneous, and/or onstraints on the amount of memory size available tokeep results from previous omputation. Gaujal and his oauthors study in [5 the minimalmemory size for systems of UREs in dimension 1.In our work, we do not assume a limitation of the number of simultaneous tasks, as theyare all omputed lo ally on separate systems. We assume that ea h omputation takes roughlya random time that is the same for every of them, and we study the asymptoti speed of tasks ompletion when several dimension indi es be ome large.Example A: Innite Tandem of QueuesConsider a semi innite line of single server queues, indexed by k ∈ N, serving ustomersindexed by m ∈ Z. When a ustomer has ompleted its servi e in server k, he enters immedi-ately the (innite) buer of server k + 1, where he is s heduled a ording to a rst ome rstserved dis ipline.These tasks are numbered a ording to two dimensions indexed by m and k. Task servi eof ustomer m in server k is naturally given the index m, k, and a weight orresponding tothe servi e time of this ustomer in this server.The pre eden e relation between tasks is des ribed by the following set of edges:• (m, k) → (m, k − 1) (i.e. the servi e of a ustomer in k annot start unless its servi ein server k − 1 is ompleted).• (m, k) → (m−1, k) (i.e. the servi e of a ustomer in k annot start unless the previous ustomer's has ompleted its servi e on this server).For this pattern grid, (3.3) an be written in the following form, introdu ed in [6:

T (m,k) = Wei(m, k) + max (T (m,k − 1), T (m − 1, k)) .Equivalently, one ould hara terize this pattern grid as the only one of dimension 2, with amotif made of a single element H = o and the dependen e set ∆o,o = (−1, 0), (0,−1).Example B: Innite Tandem of Queues with Blo kingWe rst over a simple ase. Imagine that in Example A des ribed above, ea h queuehas a nite buer with size B, and that it implements the following blo king before servi e

Page 162: baccelli/Evaluation/AugustinChaintreauPhD.pdf

152 Chapter 3dis ipline to avoid buer overow: servi e of ustomer m is not started in k until a su ientspa e of memory is available in the buer of server k + 1.The time at whi h the buer of k+1 has su ient memory spa e, to re eive ustomer m,is when ustomer m − B has ompleted its servi e in k + 1. This system an be representedby the same pattern grid as in Example A, if we add the following olle tion of edges to theset E :(m, k) → (m − B, k + 1) for all m and k.Extensions to a pair of buer: When it is expe ted that buers be ome frequently full,the blo king before servi e dis ipline may be onsidered as too onservative. In the periodswhen a buer is full, the servi e of the pre edent server is, in this ase, systemati ally stopped.A te hnique involving two buers an be used to avoid this phenomenon.Ea h queue implements two buers of sizes BIN and BOUT with the the following me ha-nism: ustomers enter the queue in the buer BIN before being served. When they are served,they enter a se ond buer, where they are stored until they an be immediately forwarded tothe next server. We assume that ustomers are served in a rst ome rst served dis ipline,and that buer overows are avoided via the two following pre eden e rules: ustomer m annot leave BIN to be served unless ustomer m − BOUT has left BOUT; ustomer m annotleave BOUT to enter server k + 1 until ustomer m − BIN has left the buer BIN in k + 1.This system may be represented as a pattern grid with dimension 2, and the setH = i, omade with two elements. The weight of (m, k)× i is the servi e time of this ustomer in thisserver, the weight of (m, k) × o is null, as we negle t forwarding time between servers.The olle tion of edges in this grid are dened asedge interpretation

(m,k) × i → (m, k − 1) × oServing a ustomer requires that hewas forwarded from the previous server

(m,k) × o → (m, k) × iA ustomer annot enter the se ond buerbefore he has been served by this server

(m, k) × i → (m − 1, k) × iServing a ustomer requires thatthe pre edent ustomer has been served

(m, k) × i → (m − BOUT, k) × o avoid BOUT overow(m, k) × o → (m − BIN, k + 1) × i avoid BIN overow.Equivalently, it may be des ribed as the pattern grid Z

2 × i, o with dependen e sets:∆i,i = (−1, 0) ∆o,o = ∅

∆i,o = (0,−1), (−BOUT, 0) ∆o,i = (0, 0), (−BIN, +1) .This re urren e system was introdu ed in [9, in a more ompa t writing, where departuredates of ustomer are interpreted as paths drawn on a latti e.Example C: Innite Tandem of Queues with Control AgentsIn this ase, the innite line of single server queue is omplemented every L queues by anagent, that keeps tra k lo ally of the pa kets going through and may implement some ow ontrol.Let us onsider, for instan e, that agents implement lo ally a window ow ontrol (with

Page 163: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 153a xed window size W ). Agent k forwards the pa ket m, when the pa ket m − W has beenre eived by the next agent on the line.This distributed system may be thought of as a pattern grid with H = L + 1 elementsdenoted by a, l1, . . . , lL representing an agent and the L next queues. The asso iated edgesare:• (m, k) × l1 → (m, k) × a

• (m, k) × li → (m, k) × li−1 for all i = 2, . . . , L,• (m, k) × a → (m,k − 1) × lL

• (m, k) × li → (m − 1, k) × li for all i = 1, . . . , L,• (m, k) × a → (m − W, k + 1) × a and (m,k) × a → (m − 1, k) × aIt orresponds to the pattern grid Z

2 × a, l1, . . . , lL with dependen e sets:∆a,li =

(0,−1) for i = L ,∅ otherwise ,

∆a,a = (−W,+1), (−1, 0) ,

∆li,lj =

8

<

:

(−1, 0) for j = i ,(0, 0) for j = i − 1 ,∅ otherwise ,

∆li,a =

(0, 0) for i = 1 ,∅ otherwise .

Dependen e GraphA path in a pattern grid is dened following the usual denition for graphs.Its size is given by the number of verti es that it ontains. Multipli ity ofa verti es are in luded in this number, for the ase where the path ontainsa loop; but as seen in the previous se tion, we deal most of the time withself avoiding paths. Its weight is the sum of the weights of all of its verti es(in luding multipli ity in ase there are loops in this path).Following the terminology used in [7, we dene the dependen e graphof a pattern grid as a graph su h that:• The verti es are given by the elements of H.• Ea h edge is given a label in Z

d, the set of labeled edges is:h → h′ with label r

∣∣ r ∈ ∆h,h′, h, h′ ∈ H

.Note that there ould be two edges leading from h to h′ with dierentlabels.Paths drawn in the dependen e graph are alled dependen e paths. A uniquelabel, that is a ve tor in Zd may be asso iated with ea h path π of thedependen e graph, it is given by the sum of the labels orresponding toedges used by π.

Page 164: baccelli/Evaluation/AugustinChaintreauPhD.pdf

154 Chapter 3In other words, edges in the dependen e graph are obtained in the fol-lowing way: For every edge (v × h) → (v′ × h′) ∈ E , we dene an edge inthe dependen e graph h → h′, with a label in Zd that is equal to the ve -tor v′ − v ≡ (v′1 − v1, . . . , v

′d − vd). Two identi al edges in the dependen egraph with the same label are identied as a unique one, to avoid innitemultipli ity of identi al labeled edge. The graph is thus lo ally nite.Similarly, a path in the dependen e graph π may be thought of as theproje tion of a path drawn in the pattern grid. The proje tion of a pathin the pattern grid is unique. For ea h arbitrary v ∈ Z

d, a path π in thedependen e graph, starting from h in H, is the proje tion of a path in thepattern grid starting from v × h, that is unique.Examples A-B-C: Dependen e GraphsExample A: The dependen e graph for a semi-innite line of single server queues ontains asingle vertex, and two edges with labels (−1, 0) and (0,−1):(0,−1)

(−1, 0)Example B: If ea h queue implements blo king with two buers BIN, BOUT, a ording to theme hanism des ribed in Example B, we have the following dependen e graph:i

o

(0, 0)(−1, 0)

(0,−1)

(−BIN , 1)(−BOUT , 0)Example C: The ase of a line of innite tandem queues with ontrol agent implementingwindow ow ontrol, pla ed every L queues, is des ribed by the following dependen e graph(we have represented the ase L = 3):

l1

(−1, 0)

(−1, 0)

(−1, 0)

l3

l2

(0, 0)

(0, 0)

(0, 0)

(0, −1)

(−W, 1)

a

(−1, 0)

Page 165: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 1551.2 Sharp Ve torIn the previous subse tions, we presented pattern grid using several repre-sentations; all of them are equivalent owing to the translation invarian eproperty of the set of edges in the grid. We did not impose other restri tionson this dependen e relation; however, some seem ne essary. As an exam-ple, loops appearing in the dependen e relation (i.e. loops in the patterngrid) would reate a deadlo k in the re urren e: one task would need to be ompleted before it is started.In the two following subse tions, we identify a ondition, veried bypattern grid that are alled sharp, that is su ient to avoid su h loops.We will see that this ondition is indeed stronger: it implies an upper boundon the size of the paths between two xed points. A pattern grid an be hara terized as sharp or non-sharp using only a nite number of integer omparisons.DenitionsLet us onsider the dependen e graph of the pattern grid, dened in 1.1.A path in this graph that starts and ends in the same vertex (i.e. the sameelement of H), and uses at least one edge is alled a y le. It is said to besimple if it does not stri tly ontain another y le.In other words, a y le in the dependen e graph is simple if and onlyif the sequen e of its verti es are all distin t, ex ept for the rst and lastverti es that are ne essarily the same. Note in parti ular that a simple y lehas a size at least 2, and smaller than or equal to H + 1. As this graph ontains a nite number of verti es and edges, there is a nite number ofsimple y les.Remark 1 As seen above, a loop in the dependen e relation, hara terizedby a loop in the pattern grid, or equivalently by a y le with null label inthe dependen e graph, implies a deadlo k. Let us stress that a y le in thedependen e graph, whose asso iated label is not zero, bears no sign of a de-genera y. In fa t, su h y les ne essarily appear in any long path as thedependen e graph ontains a nite number of edges and verti es.A pattern grid is alled sharp if it admits a sharp ve tor, that is denedby the following ondition:Condition 2 There exists a ve tor s in Zd, alled a sharp ve tor, verifying<r, s> < 0 for all labels r asso iated with a simple y le.A label r asso iated with a simple y le is alled an elementary ve tor.The family of elementary ve tors is ne essarily nite, as the number of simple

Page 166: baccelli/Evaluation/AugustinChaintreauPhD.pdf

156 Chapter 3 y les is nite. One an thus he k that any s ∈ Zd is a sharp ve tor, using anite number of linear inequalities (one orresponding to ea h simple y le).Finding all simple y les may be omputationally expensive if the set His large. One ould take advantage of the following proposition:Proposition 1 A y le of size 2 (with a single edge) is alled a trivial y le

(i) All trivial y les are simple.(ii) Edges ontained in non-trivial simple y les and edges ontained intrivial y les are disjoint.Proof: (i) is obvious as all y les ontain at least 2 verti es. To prove

(ii), let us onsider σ a non trivial simple y le. If one of its edges is ontainedin a trivial y le, σ ontains this trivial y le, whi h ontradi ts that it issimple.Non-simple y le: The following proposition shows that these previousdenitions are equivalent if the paths onsidered are extended to ontain all y les, that are not ne essary simples:Proposition 2 Let v be in Zd, then <r, v> < 0 for all labels r of y les, ifand only if <r, v> < 0 for all labels r of simple y les.Proof: This an be shown easily by re urren e on the size of the y le.All y les with size 2 are simple, and if σ is a non simple y le, then it maybe written π1 σ0 π2, where σ0 is a simple y le, and π1 π2, that is welldened, is a y le with stri tly smaller size. The result then holds as theve tor asso iated with σ is rπ1 + rπ2 + rσ0 = rπ1π2 + rσ0 .Geometri interpretation: For any family V of ve tors of R

d, wedene its one as the subset of linear ombinations of ve tors in V, withnon negative oe ients, and we denote it by Cone(V). For the family ofelementary ve tors, we all this one the elementary one. There exists asharp ve tor if and only if there is a ve tor that has a negative s alar produ twith all non null elements of the elementary one. In other words, thereshould be an hyperplane that is only meeting this one in 0, su h that this one is entirely in luded in one of the half spa es dened by this hyperplane.Examples are shown in Figure 3.1 for the ase of dimension 2. Dierentfamilies of elementary ve tors have been represented, ontaining 3 (as in ase (a)) to 6 elements (as in (d)). The one generated by positive linear ombination of this family is shown in gray. We have shown in bla k thedire tions that dene sharp ve tors, for ases (a) and (b), where su h ave tor may be found (note that the subset of sharp ve tors is also a one,ex ept that it does not ontain 0). Case (c) shows an example of family ontaining opposite ve tors, making it impossible to nd a sharp ve tor. In

Page 167: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 157(d)(a) (b) (c)Figure 3.1: Geometri representation: (a) and (b) represent families thatadmit a sharp ve tor, (c) and (d) families that do not admit su h a ve tor.the ase (d), the one reated by positive linear ombination of elementaryve tor is generating the whole spa e R

2, su h that no sharp ve tor may befound. A question that may arise is whether these ases depi t all possiblesituations. This an be proved for dimension 2, and it an be found in 1.3.1.3 Why is a Sharp Ve tor Ne essary ?We laim in this se tion that a pattern grid that does not admit a sharpve tor always exhibits a pathologi al behavior for the study of large paths.It is a rather general statement and it seems di ult to give a pre ise mathe-mati al meaning to it. We thus fo us on the ase of dimension 2, and presentsome results to support our laim.These results illustrate the omplexity lying behind the dependen e rela-tion dened by a pattern grid, it also justies our approa h to assume from1.4 that there always exists a sharp ve tor.Stri tly ObtuseLet us rst dene the following ondition that learly make the existen e ofa sharp ve tor impossible.Condition 3 A family of ve tors of Zd is said stri tly obtuse if

∀v ∈ Z2, v 6= 0 there exists e hosen in the family su h that <v, e> > 0 .Note that in the ase (illustrated for example in Figure 3.1 (d)) whereCone(V) ontains all elements in R

d, it is lear that the Condition 3 holds:Any v ould then be written as v =∑

e∈V ae.e where oe ient ae is nonnegative for all e. Having <v, e> ≤ 0 for all e in this family would then beabsurd, as it would imply<v, v> = <v,∑

e∈V

ae.e> =∑

e∈V

ae <v, e> ≤ 0 .

Page 168: baccelli/Evaluation/AugustinChaintreauPhD.pdf

158 Chapter 3The onverse is true in dimension 2, and surprisingly te hni al to prove. It an be seen in this ase as a onsequen e from the two following lemmas:Lemma 4 Let V be a family of Z2, ontaining a pair of opposite ve tors:there exist e and f in V su h that e = −a.f , where a ∈ R, a > 0.Then Condition 3 implies Cone(V) = R

2.Proof: Let us hoose v in R2 su h that <e, v> = 0. By Condition 3,there exists e su h that <e, v> > 0. We now laim that v is in Cone(V):

• If <e, e> ≤ 0, let us x a = − <e,e><e,e> . We have that e + a.e is equal to<e, v> .v as <e + a.e, e> = <e, e> + a. <e, e> = 0<e + a.e, v> = <e, v> > 0 ,and (e, v) are perpendi ular.

• If <e, e> > 0 then <e, f> < 0, we an hoose a = − <e,f><f,f> , and provethat e + a.f is equal to <e, v> .v.The same method an be used to show that −v is in Cone(V), as it veries<− v, e> = 0. As a onsequen e we have shown that Cone(V) ontains fourve tors dening a ross, (e, f, v,−v), and hen e ontains any ve tor of R2.Lemma 5 Let V be a nite family of Z

2 that does not ontain a pair ofopposite ve tors, and veries Condition 3.Then it ontains a generating triple: there exists (e, f, g) in V su h that <e, f> < 0 <e, g> < 0<e, f> > 0 <e, g> < 0

, where e is a ve tor verifying <e, e> = 0 .Proof: We will prove that assuming that no su h triple (e, f, g) existsis absurd. Indeed in su h situation, one ould onstru t an innite sequen emade of distin t elements of V, in the following way:Step 1: Let e0 be arbitrily hosen in V. We an nd f ∈ V su h that<e0, f> < 0 (i.e. Condition 3 used with v = −e0). Let e0 be orthogonalwith e0. If <e0, f> is zero, then e0 and f are opposite ve tor in V; this isnot possible by assumption. Hen e we an suppose <e0, f> > 0; otherwise,we ould repla e e0 by −e0.This proves that the s alar produ ts between the dierent ve tors in- luded in these two pairs are su h that the following diagram holds:

Page 169: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 1590

++

+

ff

0

e0e0In this Figure, a label in a edge linking two ve tors represent the sign oftheir s alar produ t, and 0 when they are orthogonal. Note that the edgerepresented in the top was drawn in dashed line be ause it follows from theve other edges in the following way: The sign of the s alar produ ts betweene0 and the two other ve tors (f, f) are known (i.e. − with f , + with f). Thisis not true for e0, as only one of this s alar produ t is known (i.e. + with f).But as <e0, e0> is null, we have:

0 = <e0, e0> = <e0, a( <e0, f> .f + <e0, f> .f

) >= a︸︷︷︸

>0

<e0, f>︸ ︷︷ ︸>0

. <e0, f>︸ ︷︷ ︸<0

+ <e0, f>︸ ︷︷ ︸? . <e0, f>︸ ︷︷ ︸>0

.It an then be dedu ed that <e0, f> > 0.Step 2: It follows from Condition 3 that there exists g ∈ V verifying<e0, g> < 0. Note that this implies that g is dierent from e0 and f .We now prove that the previous diagram an be extended to the following:0

++

g

e0

− + −+

+

ff

0

e0

Again we need to dedu e the edge drawn with dashed lines.• First, we have ne essarily <g, e0> ≥ 0. Otherwise, (e0, f, g) wouldbe a generating triple. This implies that the s alar produ t <g, f> isnegative, following the sign of the s alar produ ts of g with e0 and e0.• This implies <g, e0> 6= 0. Otherwise, g = −a.e0 whi h is madeimpossible by <g, f> < 0. Hen e <g, e0> > 0

Page 170: baccelli/Evaluation/AugustinChaintreauPhD.pdf

160 Chapter 3Similarly, <g, f> ≥ 0, as (f, e, g) would else be a generating triple. <g, f> =0 would imply <e0, g> = <e0,−af> ≥ 0; this proves that <g, f> > 0.Step 3: Let us now hoose g a ve tor inR

2 that is orthogonal with gand veries <g, e0> > 0. What we laim is that it implies that the followingdiagram holds:+

++

− + −+

e0e0

0

++

+

ff

0

g

0

g+Let us prove the two positive s alar produ ts remaining, together with theone negative.

• <e0, g> > 0 is easy to dedu e from <g, g> = 0 and the proje tion ofg in the oordinates dened by (e0, e0), as done in Step 1.

• This implies then positivity of <f , g> , as g may be proje ted withpositive oordinates dened by (e0, e0).• <f, g> > 0 an be dedu ed from <g, g> = 0, where g is proje tedalong the oordinates orresponding to (f, f).Step 4: As shown in the previous diagram, we are now ba k to theinitial relations between (e0, e0) and (f, f), for (g, g) and (f, f). This allowsus to dene by indu tion a sequen e (ei, ei), starting with e1 = g and e1 =

g, and applying the same onstru tion (i.e. hoosing ei+1 ∈ V su h that<ei, ei+1> < 0, and ei+1 as orthogonal with ei+1 and satisfying <ei+1, ei> >0). Let us now prove that (ei)i≥1 an never be equal to e0. As the sameargument applies by indu tion it proves that the sequen e is made of distin telements in V.Let us rst note that <ei, e0> > 0, that follows from<ei, e0> = <a( <ei, f> .f + <ei, f> .f

), e0>

= a︸︷︷︸>0

<ei, f>︸ ︷︷ ︸>0

. <f, e0>︸ ︷︷ ︸>0

+ <ei, f>︸ ︷︷ ︸>0

. <f , e0>︸ ︷︷ ︸>0

.

Page 171: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 161It follows from indu tion that <ei, e0> < 0 for any i = 1, 2, . . .. It holdsfor i = 1. For i ≥ 2:<ei, e0> = a︸︷︷︸>0

<ei, ei−1>︸ ︷︷ ︸>0

. <ei−1, e0>︸ ︷︷ ︸<0 par ré urren e + <ei, ei−1>︸ ︷︷ ︸

<0

. <ei, e0>︸ ︷︷ ︸>0

.This proves that ei is not orthogonal w.r.t. e0, and hen e the sequen e isalways dierent from e0.Corollary 1 When d = 2, Condition 3 is equivalent to (Cone(V) = R2).Proof: The ase where the family ontains a pair of opposite ve tors wasalready shown in Lemma 4. Otherwise, we an always assume (see Lemma 5)that there exists a generating triple. This proves that there exist e, f, g in

V, and e verifying <e, e> = 0 su h that:

f = −a.e + b.eg = −c.e − d.e

with a, b, c, d > 0.We have f + bd .g = (a + bc

d ).(−e) and hen e −e is in Cone(V). Similarlyf + a.e = b.e (respe tively g + c.e = d.(−e)) su h that e (respe tively (−e))is in Cone(V). This one ontains the ross (e,−e, e,−e), and hen e anyelement of R

2.Corollary 2 If we onsider a family in Z2 that does not admit a sharpve tor, then one of the following statements is true:

(i) It ontains the ve tor 0.(ii) It ontains a ouple of opposite ve tors.(iii) It ontains a generating triple.Proof: The fa t that no sharp ve tor may be found for V an be written

∀v ∈ Zd, there exists r ∈ V su h that <r, v> ≥ 0 , (i.e. max

r∈V<r, v> ≥ 0) .First, let us suppose that 0 /∈ V and that Condition 3 is not veried:There exists v0 in R

2 and r0 ∈ V su h that: maxr∈V

<r, v0> = <r0, v0> = 0 .We then onsider for any k ≥ 1, the sequen e k.v0 − r0. We know thatthere exists rk ∈ V su h that k <v, rk> − <rk, r0> ≥ 0. This sequen e annever be equal to r0, and as it takes value in a nite set, there exists r1 inV hosen an innite amount of time. As a onsequen e, we have ne essarily

Page 172: baccelli/Evaluation/AugustinChaintreauPhD.pdf

162 Chapter 3that <v, r1> ≥ 0 (hen e <v, r1> = 0) and that <r1, r0> ≤ 0 (hen e<r1, r0> < 0, as v0 and r0 are perpendi ular). This implies that V ontains(r0, r1) as a pair of opposite ve tors.Two ases remain, when 0 is in V su h that (i) holds by denition, andwhen Condition 3 is veried. In this last ase, Lemma 5 proved already thatV should ontain a pair of opposite ve tors or a generating triple.Consequen e on Pattern GridLet us onsider a pattern grid with dimension 2, su h that the family of theve tors asso iated with its simple y le does not admit a sharp ve tor. ByCorollary 2, we know that it ontains one of these pe uliarities: the ve tor0, or a pair of opposite ve tors, or a generating triple.Irredu ibility: A pattern grid is said irredu ible if its dependen e graphis strongly onne ted, su h that there always exists a path leading from h toh′, for any h and h′. Note that su h a path an always be hosen to have asize at most H.The irredu ibility property is required to derive the following result.Theorem 1 We onsider a pattern grid, irredu ible, with dimension d = 2that does not admit a sharp ve tor. We pi k an arbitrary vertex of this graphas an origin. There exists a vertex v × h, su h that we an build a path ofsize arbitrary large from v × h to the origin.Proof: We will prove the following fa t: one an onstru t a path ofarbitrary size whi h begins and ends in a given bounded subset. This allowsus to on lude that: one an then dene an innite number of these paths, hosen with in reasing size; at least two points in this domain may be linkedby a path of any arbitrary size. From the invarian e by translation, we anassume that this olle tion of paths starts in a xed vertex and ends in theorigin.To prove this fa t let us rst dene the onstant M as

M = max ||r||∞ for r asso iated with a path π of size |π| ≤ H + 1 .By irredu ibility, one an draw a path in the dependen e graph from anyvertex h to any other h′, and this path may be hosen to have size at mostH, so that in parti ular its asso iated ve tor r veries ||r||∞ ≤ M . Notethat any simple y le ontains at most H + 1 verti es, hen e its asso iatedve tor r veries ||r||∞ ≤ M as well.Based on Corollary 2, we prove the fa t for the following bounded domain:

B5M = a × h | ||a||∞ ≤ 5M .

Page 173: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 163To prove this fa t, we need only to show that paths of any size may be foundin the dependen e graph, su h that their asso iated ve tors remain upperbounded by 5M , for the uniform norm.• If 0 is a ve tor asso iated with a simple y le. Iteration of this y lemay have arbitrary size, with a null asso iated ve tor. As seen before,dependen e relation of this kind always implies a deadlo k.• If (e, f) is a pair of opposite ve tors, asso iated with two simple y les

σ and ρ of the dependen e graph. We denote by h and h′, the twoextreme verti es asso iated with σ and ρ, and π a path from h to h′in the dependen e graph, with size at most H.We have that − e1f1

= − e2f2

> 0. In the ase where it is a rationalnumber, one an nd p and q two integers arbitrary large verifyingq.e + p.f = 0. Otherwise, one an nd p and q integers arbitrary largesu h that p

q ≤ − e1f1

≤ p+1q , whi h implies ||q.e + p.f ||∞ ≤ ||f ||∞ ≤ M .In both ases, the path (σ)q π (ρ)p may take an arbitrary large sizeas p and q are hosen large, but its asso iated ve tor is q.e + rπ + p.f ,whose uniform norm is upper bounded by ||q.e+p.f ||∞+||rπ||∞ ≤ 2M .

• Lastly, if the family of elementary ve tors ontains a generating triple,then we an nd e, f, g asso iated with simple y les σ, ρ, τ , and everifying <e, e> = 0, <e, e> = 1, su h that:

f = −a.e + b.eg = −c.e − d.e

with a, b, c, d > 0Let q ∈ N be arbitrarily large. We hoose r then p su h thatr

q≤ b

d≤ r + 1

qand −p

q≤ −

(a +

r

qc

)≤ −p + 1

qWe then have: 0 ≤ qb − rd ≤ d and 0 ≤ p − (qa + rc) ≤ 1 .One an rewrite p.e+q.f +r.g as (p − (qa + rc)) .e+(qb − rd) .e , whi hproves that its uniform norm is upper bounded by ||e||∞ +d.||e||∞. Ase has Eu lidean norm equal to 1, we have that d = <g, e> ≤ 2.||g||∞,hen e p.e + q.f + r.g has uniform norm at most 3.M .Let us denote by h, h′, h′′ the extremal vertex for y le σ, ρ, τ , and letπ (respe tively, π′) be a path in the dependen e graph leading from hto h′ (respe tively, from h′ to h′′), whose size is at most H. We anthen onsider the ompound path (σ)p π (ρ)q π′ (τ)r. It may takea size arbitrarily large as q in reases, but its uniform norm remains atmost 5.M .

Page 174: baccelli/Evaluation/AugustinChaintreauPhD.pdf

164 Chapter 31.4 Why is a Sharp Ve tor Useful ?In the previous se tion, we have examined whi h pe uliarities an arise whenthe family of ve tors asso iated with simple y les do not admit a sharp ve -tor. In this se tion, we do exa tly the opposite. Starting from the assumptionthat a sharp ve tor may be found, as it will almost always be the ase inthe appli ations, we prove that paths in the pattern grid between two xedpoints may be bounded in dierent ways.Let us start with the following renement of Condition 2: A sharp ve toris alled well sharp if it has only non negative oordinates (i.e. s ∈ Nd) and(3.4) <r, s> ≤ −1 for all labels r of simple y les.In pra ti e, it will always be possible to assume that a sharp ve tor iswell sharp, for the following two reasons: First, if a sharp ve tor has negative oordinates in one or several dimensions, one an redene the numbering ofverti es in the pattern grid, su h that all indi es orresponding to thesedimensions are repla ed by their opposite. The pattern grid obtained is thesame read from a dierent orientation, and it admits a sharp ve tor with nonnegative oordinates. All results obtained in this hapter hen e hold for theoriginal pattern grid, after the orientation is hanged. Se ond, as the numberof simple y les is nite, the number of labels r appearing in the previousdenition is nite. A sharp ve tor has a negative s alar produ t with all ofthem, su h that when it is multiplied by a su iently large integer, all theses alar produ ts are below −1.Residue: Proposition 2 states that the s alar produ t between the labelof path and a sharp ve tor is negative if this path is a y le. This result isnot true in general for all paths. To treat this ase we introdu e the following onstant, alled the residue of a pattern grid, asso iated with a sharp ve tor

s. Res = max( <r, s> )+| r asso iated with π where |π| ≤ H

.It is a nite maximum by denition, be ause the dependen e graph ontainsa nite number of verti es and edges. Note that for the ase where the motifset H ontains a single element, this residue is null, be ause every path is a y le.The following theorem omes from the remark that large paths in thedependen e graph are essentially made with on atenated y les, ex ept fora marginal part that is well overed by the residue. This allows to boundthe size of this path in the following way:Theorem 2 Let us onsider a pattern grid admitting a sharp ve tor s. Let

π : a × h a′ × h′ be a path, its size is then upper bounded by:|π| ≤ H

(1 + Res + <a − a′, s> ) .

Page 175: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 165Proof: Let us rst remark that this result is equivalent to the followingresult on paths in the dependen e graph: for any path π with label r,we have |π| ≤ H (1 + Res− <r, s> ) .We will prove this fa t by indu tion on the size of π. First, from the denitionof the residue, this result holds trivially for any path π whose size is less thanor equal to H.If π has a size stri tly larger than H, then it ontains a y le, and hen ea simple y le, σ. We an write π = π1 σ π2. The path π1 π2 is welldened, as the vertex ending π1 is also the one starting π2. The path π1 π2has ne essarily a smaller size than π. If we suppose that it veries the fa tof the theorem, we have:|π| ≤ |π1 π2| + |σ| − 1 ≤ |π1 π2| + H + 1 − 1

≤ H(1 + Res− <rπ1 + rπ2, s> ) + H≤ H(1 + Res− <rπ1 + rπ2, s> ) − H <rσ, s>≤ H(1 + Res− <rπ, s> ) .Corollary 3 Consider a pattern grid that admits a well sharp ve tor, andone of its vertex, hosen arbitrarily. There exists Lmax su h all paths π start-ing from this vertex with size bigger than Lmax end in a vertex with, at least,one negative oordinate.Proof: Let π : a × h a′ × h′ be su h a path. By Theorem 2, we have<a′, s> ≤ (1 + Res + <a, s> ) − |π|

H.If π has a size bigger than H(1 + Res + <a, s> ), it implies <a′, s> < 0.All oordinates of s are non negative, a′ hen e ontains at least one negative oordinate.Embedded Latti e AnimalLatti e animals, in dimension d, are onne ted subsets of a latti e Z

d. Theywere introdu ed for the rst time in [3 and [4, where their ombinatorialproperties were established. In this do ument, we follow the rened analysismade by Martin in [10.In a latti e Zd, where weights are asso iated with ea h vertex, a greedylatti e animal (one of maximal weight in a olle tion) has a lot in ommonwith a path with maximal weight appearing in last-passage per olation. Oneaspe t whi h distinguishes them is that the edges allowed to onne t elementsof a latti e animal are purely based on neighboring adja en y, as dened by

Page 176: baccelli/Evaluation/AugustinChaintreauPhD.pdf

166 Chapter 3the latti e, while the edges appearing in a path may involve other relations, hara terized by the subsets (∆h,h′)h,h′∈H.Let us dene the following adja en y relation between the verti es of apattern grid. We start by the hoi e of an arbitrary proje tion of H in Z,denoted p : H → 1, 2, . . . ,H.The vertex a × h in Zd × H is alled an adja ent prede essor of a′ × h′in any of the three following ases:(3.5)

(i) p(h′) > 1 , p(h) = p(h′) − 1 and a = a′ ,

(ii) p(h′) = 1 , p(h) = H and a = (a′1, a′2, . . . , a

′d − 1) ,

(iii) h = h′, and there exists j ∈ 1, 2, . . . ,H − 1 su h that:a =

(a′1, . . . , a

′j−1, a

′j − 1, a′j+1, . . . , a

′d

).Verti es a×h and a′×h′ are alled adja ent if one is an adja ent prede essorof the other. In other words, the adja ent relation is the one imported fromthe latti e Z

d by the following bije tion:(3.6) (Z

d ×H → Zd

a × h 7→ (a1, a2, a3, . . . ,H.ad + p(h) − 1)An animal on a pattern grid is a subset ξ ⊂ Zd×H, that is onne ted bypair of adja ent verti es: for all (a, h) and (a′, h′) in ξ there exists a sequen eof elements of ξ

(a, h) = (a0, h0) , (a1, h1) , . . . , (an, hn) = (a′, h′)su h that for all i ≥ 1 (ai−1, hi−1), (ai, hi) are adja ent.The lass of animals dened on a pattern grid with dimension d inherits ombinatorial properties from the lass of animals dened on a latti e Zd, asa onsequen e from the bije tion dened in (3.6). Moreover, the next resultproves that, for a sharp pattern grid, this lass aptures the ombinatorialexpansion of paths.Let us introdu e the radius of a pattern grid as the nite maximumRad = max

||r||∞ for r ∈ ∆h,h′, h, h′ ∈ H

.It is the maximum dieren e on one oordinate between verti es a × h and

a′ × h′ that are onne ted by an edge in the pattern grid.Theorem 3 In a pattern grid with dimension d, admitting a well sharpve tor s, let π : a× h a′ × h′ be a path. It is in luded in an animal ξ with|ξ| ≤ (H2 + d.H.Rad)(1 + Res + <a − a′, s> ) .

Page 177: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 167Proof: As a rst remark, if two verti es are onne ted by an edge, theset a × h, a′ × h′ is in luded in the following animal:ξ = # verti es

a × h, a × p−1(p(h) ± 1), . . . , a × p−1(p(h′)) = a × h′, |p(h′) − p(h)| + 1(a1 ± 1, a2, . . . , ad) × h′, . . . , (a′1, a2, . . . , ad) × h′ |a′1 − a1|(a′1, a2 ± 1, . . . , ad) × h′, . . . , (a′1, a

′2, . . . , ad) × h′ |a′2 − a2|

. . . . . .(a′1, a

′2, . . . , ad ± 1) × h′, . . . , (a′1, a

′2, . . . , a′d) × h′ |a′d − ad|This animal ontains at most H + d.Rad verti es.The path π ontains less than H(1+Res+ <a−a′, s> ) edges, the subsetsof its verti es an then be extended in an animal whi h ontains less than

H(H + d.Rad)(1 + Res + <a − a′, s> ) verti es in total.Examples and Con lusionAll the Examples A, B and C that we des ribed before fall in the ategoryof sharp pattern grids. Using Proposition 1, we an pro eed in two steps:First, treat all the trivial y les and remove all their edges, and in a se ondstep, work on the remaining edges to nd other simple y les.Example A, B and C: Finding a Sharp Ve torExample A: As the dependen e graph ontains a single vertex, we have immediately thatthe family of elementary ve tors is (−1, 0), (0,−1), admitting s = (1, 1) as a well sharpve tor.Example B: Only one trivial y le, with label (−1, 0) may be found. There re-main four edges, two leading from i to o, and two from o to i. Choosing sepa-rately one or the other for ea h dire tion gives in total four simple y les, with labels(0,−1), (−BOUT, 0), (−BIN, 0), (−BIN − BOUT, 1). All these labels ex ept the last one hasonly non positive oordinates, and none of them is null, su h that any ve tor with positive oordinate will have a negative produ t with all of them. From the shape of the last label andas we have BIN + BOUT ≥ 1, we see that the ve tor (2, 1) is a well sharp ve tor.Example C: There is L+1 trivial y les (one on ea h vertex l1, . . . , lL with labels (−1, 0),and one on a, with label (−W, 1)). When edges orresponding to trivial y les are removed,only one y le remains, that ontains all the remaining edges, and has label (0,−1). As forExample B, we an show easily that (2, 1) is a well sharp ve tor.Let us nish with the following remark: Having a sharp ve tor is su- ient but not ne essary to avoid loops (i.e. preventing deadlo k) in thepattern grid. It is proved by the following example, taken from one in [7:

Page 178: baccelli/Evaluation/AugustinChaintreauPhD.pdf

168 Chapter 3Example D: Non-Sharp Pattern GridLet us onsider the following dependen e graph(0,−1)

(0, 0)

(−1, 1)

(1,−1)

• Clearly, this dependen e graph is not sharp, as two trivial y les are labeled by twoopposite ve tors.• One an also easily he k manually that it does not ontain a y le with null label.Let σ be su h a y le, its null label is in parti ular a linear ombination of the form

a.(1,−1) + b.(−1, 1) + c.(0, 0) + d.(0,−1) = 0. Writing that the rst oordinate of rσis null proves that a = b. As the se ond oordinate of rσ is also supposed to be null, itimplies that d = 0, so that the y le annot ontain the edge with label (0,−1). Whenthis edge is removed from the graph, it is not strongly onne ted. No y le an then bebuilt that passes through two distin t verti es, and y les that ontain only one vertex learly annot have a null label.As a onsequen e from Example D, the existen e of a sharp ve tor doesnot hara terize pattern grid that avoids loops. It is in parti ular dierentfrom the Conditions that were introdu ed in [7.

Page 179: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 1692 Dire tional Last-Passage Per olationThe last-passage per olation time is dened for any a ∈ Zd and h ∈ H asPer a,h = supπ: a×h 0×h

Wei(π) .Remark 2 To handle the ase a = 0 we in lude in this supremum the path (0 × beg) , that does not ontain an edge, with a weight Wei(0 × beg).The previous se tion established some properties of large paths in pat-tern grid, exhibiting the ontrast between the sharp and the non-sharp ases. In this se tion, we apply them to hara terize the growth rate ofthe sequen e (Per m.a,h)m∈N, dened for any a ∈ Z

d and h ∈ H as thedire tional last-passage per olation, with dire tion a.Sto hasti AssumptionsRandom edges: Our results are presented in a slightly more generalframework than in 1, as the set of edges E is now supposed to be a ran-dom variable (i.e. a random subset of (Zd × H)2), whose law is invari-ant by any translation made in the latti e Zd. Su h a graph is alled arandom pattern grid.We introdu e the support of this random pattern grid: it is the determi-nisti pattern grid, dened on Z

d×H, that ontains edge (a×h) → (a′×h′)if and only if this edge is ontained in E with a positive probability.It is falling in the denition of a pattern grid given in 1. We alwaysassume that this support is a lo ally nite graph.Restri tion: For any ve tor b in (Z ∪ −∞)d, the restri ted patterngrid, denoted by G[b]patt, is obtained after repla ing the weight of a vertex a×hby −∞ if we have ai < bi for at least one index i hosen in 1, . . . , d.A vertex a×h in the grid G[b]patt is alled valid if it does not have a weightequal to −∞ (i.e. if we have b ≤ a). A path in the grid G[b]patt is alled validif it only ontains valid verti es.For any a ∈ Zd, and h in H, we denote by Per [b]

a,h the last-passageper olation time in the restri ted grid. In other words, it is the weight of amaximal valid path leading from a × h to 0 × h, su h that in parti ular wehave Per [b]a,h ≤ Per a,h. Note also that Per [b]

a,h = −∞ if a is not a validvertex.We dene the set of valid dire tions:A(b) =

a ∈ Z

d∣∣∣ for all i = 1, . . . , d , bi > −∞ =⇒ ai ≥ 0

.

Page 180: baccelli/Evaluation/AugustinChaintreauPhD.pdf

170 Chapter 3We observe that if a /∈ A(b), then ne essarily the sequen e Per [b]m.a,h is onstant equal to −∞ for m su iently large, as m.a is not a valid vertexwhen m is su iently large. This is why dire tional last-passage per olation,whi h is the subje t of this se tion, is only interesting for valid dire tions.Weights: We suppose that the weights asso iated with dierent verti esare independent random variables. The weight of v×h is supposed to follow alaw that depends only on h (ex ept for restri tion) and is upper bounded, forthe sto hasti ordering (see III in Appendix A) , by a variable s verifying:Condition 4 the law of s veries: ∫ +∞

0P (s ≥ u)1/ddu < ∞ .This ondition implies E[(s)d] < +∞. It is in parti ular implied by E[(s)d+ǫ] <

+∞ for any positive ǫ, but it is more general.2.1 Comparison with Linear RateLet us start with the ase of a random pattern grid with a sharp support.Theorem 4 Let Gpatt be a random pattern grid with sharp support. Anysequen e of dire tional last-passage per olation time grows at most linearly:max

h,h′∈Hlim supm→∞

1

m

(sup

π: m.a×h 0×h′Wei(π)

)< +∞ a.s. and in expe tation.Proof: Theorem 1.1 in [10 tells us that in a latti e Z

d, with weightverifying Condition 4, the weight of a greedy latti e animal ξ grows linearly:1

n

(max

ξ latt. anim., |ξ|=n , 0∈ξWei(ξ))→ N < ∞ a.s. and in L

1, for n → ∞.For any xed h and h′, as a onsequen e of Theorem 3, a pathπ : m.a × h 0 × h′ appearing in the denition of Per m.a,h is ontainedin an animal ξ with size smaller than

|ξ| ≤ (H2 + d.Rad.H)(1 + Res + m. <a, s> ) ,where s is well sharp ve tor for this grid. All these animals ontain inparti ular 0×h′. The weight of a path is then upper bounded by the weightof a greedy animal that ontains a xed point, and we have thatlim supm→∞

1

msup

π: m.a×h→0×h′Wei(π) ≤ (H2 + d.Rad.H). <a, s> .N < ∞ .

Page 181: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 171The ase of a non-sharp pattern grid is quite dierent. Let us rst denethe set of stri tly valid dire tions:A(b) ⊂ A(b) =

a ∈ Z

d∣∣∣ for all i = 1, . . . , d , bi > −∞ =⇒ ai > 0

.We prove in a nominal ase (a deterministi irredu ible pattern grid of di-mension 2), that growth is ne essarily more than linear, for some dire tions.Theorem 5 Let Gpatt be a deterministi irredu ible pattern grid with di-mension 2, not admitting a sharp ve tor. Let b ∈ (Z ∪ −∞)2, h ∈ H.Weights are supposed non negative, and not identi ally null.

Then either supa∈A(b) Per [b]a,h = −∞ ,or ∃a ∈ A(b)

Per [b]m.a,h

m→m→∞ +∞ a.s. and in expe tation.Proof: Assume supa∈A(b) Per [b]

a,h > −∞. First we show that it implies∃a ∈ A(b), su h that

∀L, ∃πL : a × h 0 × h with |πL| = L ≥ L∃π a valid path: a × h 0 × h .As there exists a ∈ A with Per [b]

a,h 6= −∞, there exists a valid path leadingfrom a×h to 0×h, and hen e a valid path from any m.a×h to 0×h, wherem ≥ 1. By Theorem 1, we know there exists a′ in Z

d, su h that for any Lthere exists a path with size larger than L going from a′ × h to 0 × h. Bytranslation, we an always build a path that goes from m.a+a′×h to m.a×hand then to 0 × h, with size at least L. By hoosing m arbitrarily large, we an always assume that m.a + a′ is in A, and that the path orrespondingto L = 1 is valid.Let us now prove that limm→∞E[Per [b]

m.a,h]

m = +∞. To simplify, assumethat all verti es have same weight law with expe tation ¯Wei.We denote by π(m)L (respe tively π(m)) the translated version of path πL(respe tively π) that starts in m.a× h and ends in (m− 1).a× h. Note thatfor m ≥ 1, π(m) is always valid. For m ≥ L .Rad , π

(m)L is valid as well, as wehave for all i = 1, 2 bi ≤ 0 and either bi = −∞ or ai ≥ 1.We an then build the ompound path from m.a × h to 0 × h:

πmL πm−1

L . . . πLRad+1L πLRad . . . π(1) .It is valid and ontains at least (m − L .Rad)L verti es, proving

E[Per m.a,h] ≥ m

2L . ¯Wei for m ≥ 2L .Rad .

Page 182: baccelli/Evaluation/AugustinChaintreauPhD.pdf

172 Chapter 3Applying this to all L ≥ 0 proves that limm→∞E[Per m.a×h]

m = +∞ .For any L, let us denote (SLm)m≥L.Rad the weight of the path from m.a×hto (L.Rad).a × h onstru ted with translation of πL. SL

m is the sum of m −L.Rad independent variables with weight L. ¯Wei. Hen e, when m → ∞,by the law of large numbers, lim

SLm

m= lim

SLm

m − L.Rad = L ¯Wei a.s.Hen e ∃M(L), a.s. nite, su h that for m ≥ max(M(L), 2L.Rad)SL

m

m≥ L

2¯Wei ≥ L

2¯Wei hen e Per [b]

m.a,h

m≥ L

2¯Wei .whi h proves lim inf

Per [b]m.a,h

m= +∞ a.s., as L was hosen arbitrarily.The results holds if verti es have dierent laws. As there is a nite numberof laws involved in the weights, ea h of them an be bounded from below, inthe sto hasti order sense, by a distribution with positive mean. The resultholds by sto hasti omparison, as we onsider only monotone fun tionals ofthe random variables.A New Interpretation of the Sharpness ConditionLet us illustrate the ase of non-sharp pattern grid with an example.Example D: Non-Sharp Pattern GridThis example was des ribed in 1.4. We have represented in the next gure (a) a lo alpart of the pattern grid asso iated with the dependen e graph previously shown.

(a) (b)

Page 183: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 173Let us onsider the restri ted pattern grid orresponding to ve tor b = (0, 0), i.e. thequadrant where we in lude only verti es with non-negative oordinates. As shown in in thegure on the right, one an onstru t a large path from any vertex to the origin by followingrst the top left dire tion (remaining only over bla k verti es), following an edge towards awhite vertex before rossing the y-axis, then following a right bottom line, remaining on whiteverti es.It an be seen that starting from oordinate (m, k), the length of this path for large m andk is of the order of (m + k)2. As a onsequen e, the development of dire tional last-passageper olation grows more than linearly.Theorem 4 proves that no super linear expansion an o ur in a patterngrid with a sharp ve tor. Theorem 5 laims that, in dimension 2, if su h asharp ve tor annot be found, super linear explosions of paths (similar tothat of Example D) ne essarily o ur for some valid dire tions, as long asthe system is not too degenerate (i.e. in parti ular that it admits valid pathfor stri tly valid dire tions).Condition 2 then omes as an additional riteria, dened on Uniform Re- urren e Systems, stronger than the ones shown in [7. It may be interpretedas a hara teristi property of systems that remains linearly dependent ontheir past onditions.Most of the distributed dis rete-event systems that we met fall in the ategory of sharp dependen e. As the next se tion shows, their asymptoti growth an be well des ribed.2.2 Asymptoti Linear GrowthThe previous se tion has shown that dire tional last-passage per olationgrows at most linearly in the ase of a sharp pattern grid. We have infa t a more pre ise result: it grows asymptoti ally at a linear onstant rate.Theorem 6 We onsider a random pattern grid G[b]patt with sharp support,b ∈ (Z ∪ −∞)d, h ∈ H, and a ∈ Z

d a valid dire tion. We suppose:• There is almost surely a valid path: a × h 0 × h.Then there exists a onstant of dire tional last-passage per olationPer [b]

m.a,h

m→ δ[b](a, h) ∈ R a.s. and in L

1 for m → ∞ .NB: If no valid path exists from a×h to 0×h, by onvention δ[b](a, h) = −∞.Remark 3 Assuming the existen e of su h a valid path guarantees that notime-valued variable in the sequen e (Per [b]m.a,h)m≥1 takes the value −∞ witha positive probability. For some a, it may happen that this sequen e takesvalue −∞ for only some rst terms su h that the onvergen e still holds forthem, as seen in 2.5.

Page 184: baccelli/Evaluation/AugustinChaintreauPhD.pdf

174 Chapter 3Proof: Let us introdu e the following variables:X

[b]m,m′ = sup

π path in G[b+m.a]pattπ : m′.a × h → m.a × h

Wei(π) − Wei(m.a × h) .We would like to apply the super-additive ergodi theorem to this sequen eof variables.STEP 1: Let us rst note that we have for all m < m′ < m′′,X

[b]m,m′′ ≥ X

[b]m,m′ + X

[b]m′,m′′ .To prove it, let us note by X

[b]m′,m′′ the variable X

[b]m′,m′′ , where in its denitionthe pattern grid G[b+m′.a]patt was repla ed by G[b+m.a]patt .As no oordinate of a is negative, ex ept one for whi h bi is −∞, having

m < m′ implies b+m.a ≤ b+m′.a. Hen e the variable X[b]m′,m′′ is greater thanor equal to X

[b]m′,m′′ , and we an write X

[b]m,m′ + X

[b]m′,m′′ ≤ X

[b]m,m′ + X

[b]m′,m′′ .This RHS may be rewritten as:

sup

Wei(π) + Wei(π′) − Wei(m′.a × h)︸ ︷︷ ︸Wei(π′π)

−Wei(m.a × h)

,where the supremum is taken over all paths π and π′ in G[b+m.a]patt su h thatπ : m′.a×h → m.a×h and π′ : m′′.a×h → m′.a×h. The path π′ π is welldened in G[b+m.a]patt ; it goes from m′′.a × h to m.a × h. Its weight is hen eupper bounded by X

[b]m,m′′ + Wei(m.a × h) . This proves that the sequen e issuper-additive.STEP 2: This sequen e is stri tly stationary. We introdu e the shift θaby step a, as dened below. The law of Gpatt is by denition invariant by thisshift. This shift hanges X

[b]m,m′ in X

[b]m+1,m′+1. Hen e the law (X

[b]m,m′)m<m′is identi al to the one of (X

[b]m+1,m′+1)m<m′ .To prove it, let us dene expli itly the probability spa e and the appli- ation that we dene on it.

• The state spa e is Ω = (R ∪ −∞)Zd×H.

• Marginal proje tion in this spa e are dening the weights of the ver-ti es: Wei(a × h)(ω) = ωa×h for all a ∈ Zd and h ∈ H .Note that by independen e among them, they dene the law P.

Page 185: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 175• The shift θa : Ω → Ω is dened for all step a ∈ Z

d by(θa(ω))a′×h = ω(a+a′)×h .

P is by denition left invariant by the shift θa. We haveWei(a′ × h) θa = Wei((a + a′) × h), that we an prove for all ω:(Wei(a′ × h) θa

)(ω) = (θa(ω))a′×h = ω(a+a′)×h = Wei((a+a′)×h)(ω) .

• For all b ∈ (Z ∪ −∞)d we dene the annihilator φ[b] : Ω → Ω by:(φ[b](ω)

)

a×h=

−∞ if, for one i ∈ 1, 2, . . . , d, ai < bi,ωa×h otherwise.Clearly the law P is not invariant by this appli ation. For two variables

Y, Y dened in the same way respe tively on Gpatt and G[b]patt, we anwrite Y = Y φ[b]. A priori they have dierent distributions. We haveφ[b] θa = θa φ[b+a] that an be veried again ω by ω

(φ[b](θa(ω))a′×h = (θa(ω))a′×h −∞× I∃1≤i≤d, a′i<bi

= ω(a′+a)×h −∞× I∃1≤i≤d, a′i+ai<bi+ai

=(φ[b+a](ω)

)(a′+a)×h

(θa

(φ[b+a](ω)

))a′×h

.To prove the stationarity of the sequen e (X[b]m,m′)m<m′ , let us rst remarkthat if we introdu e Xm,m′ = X

[−∞×...×−∞]m,m′ , we have

(Xm,m′)m<m′ θa = (Xm+1,m′+1)m<m′ .This is be ause for any path π Wei(π) θa = Wei(π) where π is the image ofπ by the translation of step a, that sends any vertex (a′×h) on (a′ + a×h).We an then dedu e the stri t stationarity of the sequen e with restri tionin the following way.

(X[b]m,m′ θa)m<m′ = (Xm,m′ φ[b+m.a] θa)m<m′

= (Xm,m′ θa φ[b+(m+1).a])m<m′

= (Xm+1,m′+1 φ[b+(m+1).a])m<m′

= (X[b]m+1,m′+1)m<m′ .STEP 3: This sequen e admits a linear bound. As there exists a.s. apath π : a×h 0×h, variables Xm,m+1 take value −∞ with null probability.By Theorem 2 applied to the support of Gpatt, we know that the numberof paths in luded in the denition of X

[b]m,m′ is nite, be ause their lengths

Page 186: baccelli/Evaluation/AugustinChaintreauPhD.pdf

176 Chapter 3are upper bounded and that the support is lo ally nite. As a onsequen eX

[b]m,m′ is a nite maximum, and has a nite expe tation.We know by Theorem 4 that we an hoose A su h that

lim supE[X

[b]0,m]

m≤ E[lim sup

X[b]0,m

m] ≤ A .STEP 4: We will show the ergodi ity of the shift θa, when a 6= 0. Letus assume rst that all edges are deterministi . The olle tion of ylinders

ωa1×h1 ∈ E1, . . . , ωan×hn ∈ En | n ≥ 0, a ∈ Z

d, E1, . . . , En ∈ B(R)is losed under nite interse tion, and it generates all the events of theborelian σ-eld. Let C and D be two ylinders, dened with oordinates

c1, . . . , cn and d1, . . . , dm. Note that by independen e of the weight, two ylinders are independent if they are dened with distin t set of oordinates.In parti ular as a 6= 0, T−k(C) and D are independent for k su ientlylarge. This proves that T is strongly mixing for the olle tion of ylinders.By Proposition 17 (i) and (ii), the shift is ergodi .The same results holds when edges are independently distributed, whenstarting from edge with dierent oordinates.N.B.: The same result holds if the edges are not independent, but followthe history of an irredu ible and aperiodi Markov Chain. It omes from thefa t that Markov shift are strongly mixing in this ase, and that the produ tof strongly mixing shift is strongly mixing. (see [11 p.59, and [2 p.49).The super-additive ergodi theorem (see Corollary 5 in Appendix A), we an prove that 1mX

[b]0,m onverges a.s. and in L

1 to a deterministi onstant,that we denote by δ(a, h). This proves the result as X[b]0,m = Per [b]

m.a,h −Wei(0 × h).Some Properties of the Fun tion δ[b]Proposition 3 For all b and a, a′ in A(b)(i) E[Per [b]a,h] ≤ δ[b](a, h) + E[Wei(0 × h)] .(ii) δ[b](a + a′, h) ≥ δ[b](a, h) + δ[b](a′, h) .Proof: (i) if there does not exist a.s. a path a×h 0×h valid in G[b]patt,the result is obvious as Per [b]

a,h = −∞ with a positive probability.Otherwise, this is an easy onsequen e from the proof of the previoustheorem, as we have by super-additivityE[Per [b]

a,h − Wei(0 × h)] = E[X[b]0,1] ≤ δ[b](a, h) .

Page 187: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 177(ii) is obviously veried if either δ[b](a, h) or δ[b](a′, h) is equal to −∞.Otherwise, we pro eed as in the proof of the super-additivity in the previoustheorem. It is su ient to prove the for all m

E[Per [b]m.(a+a′),h] ≥ E[Per [b]

m.a,h] + E[Per [b]m.a′,h] − E[Wei(m.a × h)]Let us apply the shift θm.a to the variable Per [b]

m.a′,h. What we obtain is avariable with the same expe tation (by stationarity), that may be writtenPer [b]m.a′,h θm.a = sup

π path in G[b+m.a]pattπ : m.a′ + m.a × h m.a × h

Wei(π) ,This variable is less or equal than the one with the same denition whereG[b+m.a]patt is repla ed by G[b]patt, as we have b ≤ b + m.a.Hen e the sum made with this variable and Per [b]

m.a,h−Wei(m.a×h) maybe interpreted as the weight of a ompound path π π in G[b]patt; where π ispath leading from a.m×h to 0×h, and π is path leading from (a+a′).m×a.m.This proves that this sum is upper bounded by Per [b]m.(a+a′),h.2.3 Ordering, Solidarity PropertyLast-passage per olation in general pattern grid is omplex as a path mayeither not exist or not be valid between two given points that we are onsid-ering (see, for example, the ne essary onditions for Theorems 5 and 6). Inthis se tion, we introdu e onditions to avoid these degenerate ases.OrderingPattern grids belonging to this lass are the ones in whi h a valid pathmay always be found, when oordinates are non-in reasing in all dimension.This is more or less assuming that the natural order of indi es a1, . . . , ad orresponds in reality to a pre eden e relation of tasks.Let us denote by ei the ve tor with oordinate i equal to 1, all othersbeing null.Condition 5 A pattern grid admits is totally ordered if

(i) All its weights are non-negative (ex ept for the ones that were repla edby −∞ after restri tion).(ii) For all h ∈ H and i = 1, . . . , d, there exists almost surely a path

π : a × h (a − ei) × h valid in G[a−ei]patt .

Page 188: baccelli/Evaluation/AugustinChaintreauPhD.pdf

178 Chapter 3Examples A-B-C-D: Total OrderingExample A: It denes a pattern grid with total ordering, for obvious reasons.Example B: Stri tly speaking, the dependen e graph shown for this example is not ordered.There exists a path (m, k) × h (m,k − 1) × h for h = i, o, and a path (m,k) × i (m − 1, k) × i, but one annot nd a path (m,k) × o (m − 1, k) × o.Let us add the edges o → o with label (−1, 0). We laim that it does not impa t thesystems of re urren e equations. First as the weight in o is assumed to be null, we have inthis new re urrent system

T(m,k),o = max (Ti(m, k), Ti(m − BIN, k + 1), To(m − 1, k))= max (Ti(m, k), Ti(m − BIN, k + 1),

Ti(m − 1, k), Ti(m − 1 − BIN, k + 1), To(m − 2, k))= max(Ti(m,k), Ti(m − BIN, k + 1), To(m − 2, k))

. . . . . .= max(Ti(m,k), Ti(m − BIN, k + 1), To(m − BOUT, k))= max(Ti(m,k), Ti(m − BIN, k + 1)) .be ause Ti(m, k) ≥ To(m − BOUT, k). The ompleted pattern grid is then totally ordered.Example C: It is again trivially a totally ordered graph.Example D: It does not dene a totally ordered pattern grid. From a white vertex with oordinates (m0, k0), we annot rea h (m0 − 1, k0), staying in the domain where k ≥ k0. Inparti ular we have δ[(0,0)]((1, 0)) = −∞.Total order implies omparison of dire tional last-passage per olation,based on the ordering of their dire tions.Proposition 4 In a pattern grid with sharp support, totally ordered

(i) For all a ≤ a′, Per [b]a,h ≤ Per [b]

a′,h, and hen e δ[b](a, h) ≤ δ[b](a′, h).(ii) A well sharp ve tor s has all its oordinates ≥ 1.(iii) For b ≤ 0 ≤ a, we have δ[b](a, h) > −∞.Proof: (i) The omparison is obvious if we do not have b ≤ a, we anthen assume it. In this ase, we an build a valid path π : a′ × h → a× h in

G[b]patt by de reasing oordinates one by one. We an onsiderPer [b]a,h + Wei(π) − Wei(a × h) ≥ Per [b]

a,hand interpret it as the weight of some paths leading from a′ × h to 0 × h,proving the result.(ii) A path a × h (a − ei) × h, whi h exists by denition, denes a y le in the dependen e graph, whi h proves that <− ei, v> ≤ −1.(iii) By Theorem 6, all we need to verify is that there exists a valid pathfrom a×h 0×h whi h an be onstru ted by de reasing one by one ea hof the non-negative oordinates of a.

Page 189: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 179SolidarityThe solidarity property of a pattern grid, dened over one of its dimensionsi, erties that some paths starting with a null i_th oordinate ends in anyvertex. It is usually assumed together with total order. In pra ti e both onditions are typi ally veried by losed loop systems.Condition 6 A pattern grid G[b]patt with dimension d is alled solidar alongits i_th dimension if there always exists a valid path to any valid vertex thatstarts from a vertex with a null i_th oordinate.N.B.: In order for this to be veried, we assume ne essarily that bi ≤ 0.Verifying this ondition for ea h valid vertex may seem too omplex. Ina general ondition on restri tion, usually met in pra ti e, we an prove itusing a single vertex.Proposition 5 For a totally ordered grid G[b]patt, with ∀j, bj ∈ 0,−∞:It is solidar w.r.t. dimension i ∈ 1, . . . , d if and only if for all h,

(i) there exists a ∈ Zd with ai = 0 and a valid path: a × h ei × h .If (i) is veried for a, for any valid vertex a×h there exists a valid path:

c × h a × h with cj = ai.aj + aj for j 6= i,cj = 0 for j = i.Proof: For any h, ei × h is a valid vertex, hen e solidarity implies (i).Let us assume (i). We note rst that, by total ordering, the path deningsolidarity an always be found towards a valid ve tor whose i_th oordinateis negative or null.Also, by translation, we an onstru t a path: m.a×h m.ei×h for any

m ≥ 1. It is obtained after on atenating the same path that was translatedby a in the grid. The only thing to he k is that ea h path remains validafter translation by ve tor a or ei, and this is always true as bi is never nitenegative (a ould indeed have a negative oordinate in dimension j, but asit denes a valid path this would implies bj = −∞).We now onsider any valid vertex a × h. Whatever be the value of aj,we an onstru t a valid path from aj.a × h to aj.ei × h. Let us translatethis path by the ve tor a′ that is equal to a, ex ept for its oordinate i thatis null. This annot ontradi t the validity of the path for the same reasonas in the previous paragraph.This a tually onstru ts a path from c × h (as dened in (ii)) to a × h;this proves (ii) and the solidarity property.

Page 190: baccelli/Evaluation/AugustinChaintreauPhD.pdf

180 Chapter 3Examples A-B-C: SolidarityExample A is typi ally des ribing an open loop system. In this ase, no oordinates anin rease along any path, su h that it is impossible to have solidarity neither for dimension 1nor 2.Example B has solidarity with regard to dimension 2. In the dependen e graph, we anfollow paths (BIN + BOUT, 0) × o → (BOUT, 1) × i → (0, 1) × o and (BIN + BOUT, 0) × i →(BIN, 0) × o → (0, 1) × i.This orresponds to the pre eden e rule: pa ket m = BIN + BOUT annot leave stationk = 0 before pa ket m = 0 leaves station k = 1.Example C has solidarity property for dimension 2, it may be shown in a similar way.2.4 Hydrodynami S aling 1: QuadrantFor any valid dire tion, a onstant of dire tional last passage per olationexists, if there is a valid path following this ve tor, that ontains a large lassof dire tions for totally ordered grid. Constant of dire tional per olation an be des ribed in more details for the ase of dimension 2, as done inthis se tion. Fun tional properties of the limit, alled hydrodynami bysimilarity with some models found in statisti al physi s, is instrumental tostudy later stationary regime.In this se tion, we treat the ase of a totally ordered pattern grid withdimension 2 restri ted along its both dimensions denoted G[0]patt. We denoteby m and k its two indi es, that play a symmetri al role in this ase. Notethat all dire tions with non negative m and k are valid and that by totalordering δ is always a nite positive onstant on these dire tions.Let h be an element of H, xed arbitrarily until the end of this se tion.We dene the velo ity in dimension i by

1Veloi= δ[0](ei, h)It may be thought as the speed to omplete step in dimension i only, allother indi es remaining equal to 0. For Examples A, B and C where the rstdimension is representing a pa ket sequen e number, this velo ity may bethought as a throughput expressed in pa kets pro essed per unit of time.Denition and Regularity on ]0;+∞[The next theorem is an adaptation of Theorem 6.3 in [6, that rewrites thefun tions δ using two real valued fun tions dened for R

+.Theorem 7 Let G[0]patt denote a random pattern grid with sharp support,totally ordered. There exists fun tions dened for x ≥ 0 as the following

Page 191: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 181deterministi limit (for a.s. and L1 onvergen e):

γ1(x) = limk→∞

Per [0](⌊x.k⌋,k),h

k; γ2(x) = lim

m→∞

Per [0](m,⌊x.m⌋),h

m.We refer to them as hydrodynami limits of grid G[0]patt. We have in parti ular

γ1(0) = 1Velo2, and γ2(0) = 1Velo1

. Moreover:(i) γ1, γ2 are non-de reasing, on ave, ontinuous on ]0,+∞[.(ii) They verify γ1(x) = xγ2(1/x) for all x > 0.Proof: Let us rst prove that γ1 is well dened.• Step 1: When x = p

r ≥ 0 is a rational number, we have by Prop. 4 (i)Per [0](p⌊k/r⌋,r⌊k/r⌋),h ≤ Per [0]

⌊(p/r)k⌋,k).h ≤ Per [0](p⌊(k/r)+1⌋,r⌊(k/r)+1⌋),h ,for all k ≥ 0. Hen e, Per [0]

(⌊(p/r)k⌋,k).h

k → 1r .δ[0]((p, r), h) a.s. and in L

1,δ[0]((p, r), h) is not −∞ thanks to Prop. 4 (iii).The fun tion γ1 is thus well dened for any rational number, and isnon-de reasing. Note that γ1(0) = δ[0]((0, 1), h). Let us note thatγ1(x) =

1

rδ[0]((p, r), h) holds for all integers p, r su h that x = p

r .• Step 2: For x = p

r ,y = qr and t = a

b ≤ 1 all positive rational numbers,we haveγ1(tx + (1 − t)y) = 1

br δ[0]((ap + (b − a)q, br), h)

≥ 1br δ[0]((ap, ar), h) + 1

br δ[0](((b − a)q, (b − a)r), h)≥ tγ1(x) + (1 − t)γ1(y)whi h proves that γ1 is on ave on the set of rational numbers.

γ1 is non-de reasing, hen e it admits in all point a left limit and aright limit (ex ept for the point 0 where it only admits a left limit).These two limits are equal on ]0,∞[; otherwise it would ontradi t its on avity. We dedu e that γ1 is ontinuous on all positive rationalnumbers. By ontinuity, we an extend the limit that denes γ1 toall reals, as this sequen e may be lower bounded and upper boundedby sequen e that orresponds to rational number, and have the samelimit.The same arguments ould be applied to the denition of γ2, we obtainin parti ular:γ2(x) =

1

rδ[0](r, p) for all p, r su h that x =

p

r,

Page 192: baccelli/Evaluation/AugustinChaintreauPhD.pdf

182 Chapter 3hen e, γ1(x) =1

rδ[0](p, r) =

p

r

1

pδ(p, r) = xγ2(

1

x) .By ontinuity, this identity holds for all positive real number x.Continuity of the Hydrodynami Limit Around ZeroThe hydrodynami limit fun tions are dened on R

+ and ontinuous in anypositive real value, but the ontinuity in zero annot be shown with the samemethod. It is in fa t quite a hallenging question.It was shown rst in [6, assuming light tailed servi e time, for the par-ti ular ase of an innite tandem of single server queue (i.e. Example A).This proof relies on an enumeration of paths with Stirling formula, quiteremarkable, that annot be to the best of our knowledge, reprodu ed fora general pattern grid. The light tail assumption is not needed, as it wasshown in [1 that the same results hold for innite tandem of single serverqueue, with general servi e time whose law veries Condition 4. It used ate hnique of on atenation, bounding the dieren e with latti e animals andLee's ontinuity riteria [8. We do not know today how to prove it generally,and there is some eviden e this fa t ould require additional onditions onthe dependen e relation of the pattern grid.There is however one general lass of pattern grids for whi h this di ultquestion an be solved, at least partially: the ones that exhibit solidarityproperty. Let assume this property is veried for dimension i, the followingresult shows that the orresponding hydrodynami limit is then ontinuousat point zero. We wrote the following results, and its proof, a ording to thenotation already used for Theorem 7; they are used later as a single result.Theorem 8 Under the assumptions of Theorem 7,Introdu ing 1Thres = lim0+

γ2 ≥ 1Velo1(iii) γ1(x + y) ≥ γ1(x) + y. 1Thres and lim∞

γ1′ = lim

x→∞

γ1(x)

x=

1Thres .(iv) If the grid has solidarity with respe t to dimension 2, Velo1 = Thres.Proof: We ondu t it as a ontinuation of the previous proof• Step 4: Let us onsider x = p

r and y = qr two rational numbers, wehave:

γ1(x + y) = 1r δ[0]((p + q, r) × h)

≥ 1r δ[0]((p, r − 1) × h) + 1

r δ[0]((q, 1) × h)

≥ r−1r γ1(x) + q

r

1

qδ[0]((q, 1)

︸ ︷︷ ︸=γ2( 1

q)≥γ2(0[+])

Page 193: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 183As this is true for any r, we have the rst part of (iii). It extends toall real number by ontinuity, and proves that γ1′(x) ≥ 1Thres .To on lude on (iii), let us remark that, as γ1 is a on ave fun tion,for x > 0 and y > 0

γ1(x + y) − γ1(x)

y≤ γ1(x)

x= γ2(

1

x) .This proves 1Thres ≤ γ1

′(x) ≤ γ2(1x), whi h allows us to on lude.

• Step 5: We now assume that G[0]patt has solidarity for the se ond di-mension. Proposition 5 (ii) applied to the ve tor (r, p) shows:For any (p ≥ 0, r ≥ 0), there is a path (p.a1 + r, 0) × h (r, p) × h,This implies In parti ular that δ[0]((r, p) × h) ≤ δ[0]((p.a1 + r, 0) × h) .Hen e we have for x = pr :

γ2(x) =1

rδ[0]((r, p) × h) ≤ p.a1 + r

r︸ ︷︷ ︸=1+a1.x

1

p.a1 + rδ[0]((p.a1 + r, 0) × h)

︸ ︷︷ ︸=γ2(0)

.This proves γ2(x) ≤ γ2(0)(1 + Cste.x) and the result.This result ould be thought of as surprising. Previous works establishedthe ontinuity for a parti ular pattern grid, that is exa tly at the opposite ofsolidarity property (the open-loop innite tandem of single server queues).In reality, the same ommon result is obtained in these two ases, but bytwo dierent methods. The general ase, that ould mix global open loopbehavior avoiding solidarity while allowing lo al losed loop systems, is aninteresting subje t for further resear h.To on lude, let us remark that it is far from being only of some te hni alinterest: the limit of the hydrodynami limit in zero is related to the upperbound of the stability region of the re urren e system (see 3.2).2.5 Hydrodynami S aling 2: Extended QuadrantResults from the previous se tion are now extended when the restri tiononly applies to one dimension: we assume that the rst index m an takeany value positive or negative; the pattern grid is hen e denoted by G[−∞,0]patt .In other words, valid verti es orrespond to all positions (m,k) found in theupper half-plane k ≥ 0.Remark 4 Results shown here are based on a variation of the same sets ofarguments and share a lot of aspe ts with the ones presented in the previ-

Page 194: baccelli/Evaluation/AugustinChaintreauPhD.pdf

184 Chapter 3ous se tion. To keep notation to a minimum, we did not indi ate on thesymbol Velo1, Velo2, Thres, γ1, γ2 if they belong to the quadrant or the ex-tended quadrant. In general, they are not identi al. For ea h appli ation, itis important to indi ate whi h restri tion was used to dene the pre eden erelation.Criti al Dire tion, and Super-Criti al PathsIn the previous se tion, we only onsidered dire tions dened with positive oordinates. The assumption of Theorem 6 was in this ase guaranteed byweak ordering and Proposition 4 (iii). This is not the ase anymore and weneed now to study whi h dire tions of the spa e are likely to have a lastpassage per olation rate.Let us dene a set, known as the interval of dire tionIdir = x ∈ R| sup

p,r∈Z×N | pr<x

δ[−∞×0]((p, r) × h) > −∞ .Proposition 6 For G[−∞,0]patt , with sharp support for ve tor s, totally ordered:(i) Idir is an interval that an be written Idir = ]Xdir; +∞ [ ,(ii) Xdir is also alled the riti al dire tion, and −∞ < − s2

s1≤ Xdir ≤ 0 .Proof: Clearly x ∈ Idir and x ≤ y implies y ∈ Idir, Idir is hen e aninterval with innite upper bound. It annot ontain its lower bound as

x ∈ Idir implies that there exists pr < x verifying δ[−∞×0]((p, r)×h) 6= −∞,meaning that ]pr ; +∞[ is ontained in Idir. This proves (i).Let p, r satisfy the denition of Idir; in other words, there exists a pathfrom (p, r) × h to 0 × h. It denes a y le in the dependen e graph, su hthat −ps1 − rs2 ≤ −1, where s is a well sharp ve tor. We dedu e that

pr ≥ −v2

v1> −∞. As the grid is totally ordered, any positive p veries the ondition, hen e Idir ontains all positive real numbers. This proves (ii).The fun tion δ was dened as a fun tional of the sequen e (Per m.a,h)m≥1whose elements are in R∪−∞. By onvention, we xed δ to −∞ wheneverthe rst element of the sequen e is −∞ or, equivalently, whenever this se-quen e does not ontain only nite real values. In the next results, we provethat almost surely this sequen e is identi ally −∞, or that it only ontainsa nite number of non real elements. It an then be possible with another onvention to dene δ, as an almost sure limit, for all dire tions.Proposition 7 For G[−∞×0]patt , with sharp support, totally ordered, x ∈ Idir

(i) δ[−∞×0]((⌊kx⌋ , k) × h) 6= −∞, for k large enough.(ii) If −p

r < x < 0, rx+ p ≥ 1, and k ≥ pr, then there exists a valid path(⌊kx⌋ , k) (−np, nr) where n ≥ (−x)k

p .

Page 195: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 185(iii) If x < −p

r < 0, rx + p ≤ −1 and k ≥ pr, then there exists a validpath (−np, nr) (⌊kx⌋ , k), with kr ≤ n ≤ (−x)k

p .Proof: If x is positive, (i) holds by total ordering. We assume that x < 0is in Idir. We establish rst property (ii):By rx + p ≥ 1 and k ≥ p.r, we have, using a multipli ation by kr that:(3.7) kx ≥ −k.

p

r+

k

r≥ −k.

p

r+ p, whi h implies ⌊kx⌋ ≥ −k.

p

r+ (p− 1) .Let us onsider the following p numbers: ⌊kx⌋ , ⌊kx⌋−1, . . . , ⌊kx⌋−(p−1).One of them is a multiple of −p, i.e. there exists 0 ≤ i ≤ p − 1 su h that

⌊kx⌋ − i = −n.p. This implies n ≥ kp .(−x).Note that i ≥ 0 implies −n.p ≤ ⌊kx⌋, and that i ≤ p − 1 implies by(3.7) that −n.p ≥ ⌊kx⌋ − (p − 1) ≥ −k p

r , hen e n.r ≤ k. In other words,(−np, nr) ≤ (⌊kx⌋ , k).By total ordering, a path exists from (⌊kx⌋ , k)×h (−np, nr)×h. Thispath is valid in G[(−np,nr)]patt , hen e in G[−∞×0]patt as well. This proves (ii).let x be in x ∈ Idir, there exist p, r positive integers su h that −p

r < xand there is a.s. a valid path (−p, r) × h 0 × h. In addition, we ansuppose that rx ≥ −p + 1, otherwise p and r an be multiplied by the sameinteger to satisfy this ondition. By (ii), if k is greater than or equal to prthere exists a valid path (⌊kx⌋ , k) × h (−np, nr) × h, and hen e a validpath (⌊kx⌋ , k) × h (0, 0) × h. This proves (i) for x < 0. For x = 0 andx ∈ Idir, it is easy to prove the same property with similar te hnique.Let us now prove (iii). We suppose rx + p ≤ −1 and k ≥ pr; it implieskx + k p

r ≤ −kr ≤ −p. Among the p numbers ⌊kx⌋ + 1, ⌊kx⌋ , . . . , ⌊kx⌋ + p,there is a multiple of −p. So that, ⌊kx⌋ + i = −n.p for 1 ≤ i ≤ p. We have

−np ≥ ⌊kx⌋ + 1 ≥ kx, hen e n ≤ kp .(−x), and −np ≤ ⌊kx⌋ + p ≤ kx + p ≤

−k pr , hen e n ≥ k

r .It implies (−np, nr) ≥ (⌊kx⌋ , k), whi h proves by total ordering theexpe ted result.Denition and Regularity of Hydrodynami LimitsThe previous te hni al fa ts allows to extend the hydrodynami limit provedin the quadrant to the pattern grid G[−∞×0]patt .Theorem 9 let G[−∞×0]patt have a sharp support, and be totally ordered.The fun tions γ1, γ2, alled hydrodynami limits, that take value in R ∪

Page 196: baccelli/Evaluation/AugustinChaintreauPhD.pdf

186 Chapter 3−∞, an be dened for any x ∈ R as the following almost sure limit:

γ1(x) =

−∞ for x < Xdir,lim

k→∞

1

kPer [−∞×0]

(⌊x.k⌋,k),h ∈ R for x > Xdir, or x ≥ 0

γ2(x) =

−∞ for x < 0,lim

m→∞

1

mPer [−∞×0]

(m,⌊x.m⌋),h . for x ≥ 0.For ea h of them, in the se ond ase, the onvergen e is also true in L1.We have γ1(0) = 1Velo2

, γ2(0) = 1Velo1, and dene γ2(0

+) = 1Thres .(i) γ1 and γ2 are non-de reasing, on ave, ontinuous on Idir and ]0,+∞[.(ii) For x > 0, we have γ1(x) = xγ2(1/x).(iii) For x ∈ R, y ≥ 0, γ1(x+y) ≥ γ1(x)+y. 1Thres and lim∞

γ1′ =

1Thres .(iv) If G[−∞,0]patt shows solidarity w.r.t. dimension 2, 1Thres = 1Velo1.Proof: To simplify notation, let us denote δ[−∞×0]((p, r)× h) by δ(p, r)and Per [−∞×0]

(p,r)×h by Per p,r.Denition, Con avitity, Continuity: We dene γ1 on R asγ1(x) = sup

p,r∈Z×N∗| pr≤x

δ(p, r)

r= lim sup

r→∞

δ(⌊xr⌋ , r)

r= lim sup

r→∞

1

rPer ⌊xr⌋,r .We know by denition that γ1(x) > −∞ if and only if x > Xdir. Theargument shown in Step 1 for the proof of Theorem 7 proves that this fun -tion γ1 is well dened and nite when x ≥ 0. The fun tion γ1 is obviouslynon-de reasing, su h that γ1(x) < +∞ for any x ∈ R.This fun tion is on ave; as for p1

r1≤ x, p2

r2≤ y and 0 ≤ t = a

b ≤ 1,ab

1r1

δ(p1, r1) + b−ab

1r2

δ(p2, r2)

≤ 1br1r2

(a δ(p1r2, r1r2) + (b − a) δ(p2r1, r1r2))

≤ 1br1r2

(δ(ap1r2, ar1r2) + δ((b − a)p2r1, (b − a)r1r2))

≤ 1br1r2

(δ(ap1r2 + (b − a)p2r1, br1r2))

≤ γ1(ab x + b−a

b y) .As this holds for any p1, r1, p2, r2 satisfying the previous onditions,we have tγ1(x) + (1 − t)γ1(y) ≤ γ1(tx + (1 − t)y) .

Page 197: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 187The fun tion γ1 is well dened and nite on ]Xdir; +∞[. It is non-de reasing and on ave, hen e it is ontinous on this interval.Convergen e: For x ≥ 0 the result an be proved by the same argumentthat in Theorem 7 ( omparison of integer part and total order). We nowprove it more generally, when Xdir < x < 0:lim

k→∞

Per ⌊kx⌋,k

k= γ1(x) a.s. and in L

1.Lower bound: Let us x ε > 0. By ontinuity of γ1, we hoose x′ ∈[x − ε;x] su h that γ1(x

′) ≥ γ1(x) − ε.Following the denition of γ1(x′), there exist p, r positive integers su hthat −p

r < x, 1r δ(p, r) ≥ γ1(x

′) − ε. It is even possible to hoose them sothat they satisfy (1 + ε)x < −pr < x and rx + p ≥ 1.We have in this ase, by Proposition 7 (ii), that for all k ≥ p.r there is avalid path from (⌊kx⌋ , k)×h to (−np, nr)×h, with n ≥ −xk

p . This implies:(3.8) 1

kPer ⌊kx⌋,k ≥ 1

kPer −np,nr ≥

−x

p

Per −np,nr

n≥ 1

1 + ε

Per −np,nr

rn.Or by a.s. onvergen e, shown in Theorem 6, we have that:a.s., ∃N (−)(ω) su h that ∀n ≥ N (−)(ω),

Per −np,nr

nr(ω) >

1

rδ(p, r) − ε,a.s., if k ≥ K(−)(ω) = max(N.

p

−x, p.r),

Per ⌊kx⌋,k

k(ω) >

1

1 + ε(γ1(x) − 3ε) .Similarly, by L

1 onvergen e:There exists N (−) su h that n ≥ N (−)E[

Per −np,nr

nr] ≥ 1

rδ(p, r) − εhen e, for k ≥ K(−) = max(N. p

−x , p.r), E[Per ⌊kx⌋,k

k] >

1

1 + ε(γ1(x) − 3ε) .Upper bound: It is a very similar argument, using Proposition 7 (iii).Let us x ε > 0; we hoose x′′ ∈ [x;x(1 − ε)] su h that γ1(x

′′) ≤ γ1(x) + ε.If we hoose −pr in[x, x′′], by denition 1

r δ−p,r ≤ γ1(x′′). Multiplying pand r by the same number, we an suppose that rx+p ≤ −1. In this ase, forall k ≥ pr, we know that there is a valid path (−np, nr)×h (⌊kx⌋ , k)×h,where k

r ≤ n ≤ (−x)kp . It implies:Per ⌊kx⌋,k

k≤ Per −np,nr

k≤ (−x)

r

p

Per −np,nr

nr≤ 1

1 − ε

Per −np,nr

nr≤ 1

1 − εAgain, by a.s. onvergen e, proved in Theorem 6, we havea.s. ∃N (+)(ω) su h that ∀n ≥ N (+) Per −np,nr

nr(ω) <

1

rδ(p, r) + ε

Page 198: baccelli/Evaluation/AugustinChaintreauPhD.pdf

188 Chapter 3a.s., if k ≥ K(+)(ω) = max(N.r, p.r) ,Per ⌊kx⌋,k

k(ω) <

1

1 − ε(γ1(x) + 2ε) .And, by L

1 onvergen e,∃N (+) su h that, for n ≥ N (+)

E[Per −np,nr

nr] <

1

rδ(p, r) + εhen e, for k ≥ K(+) = max(N.r, p.r), E[

Per ⌊kx⌋,k

k] >

1

1 − ε(γ1(x) + 2ε) .We an dedu e from this bound that the onvergen e holds for γ1 whenever

x is non-negative or greater than Xdir.In rease of γ1: The proof is indeed almost the same. For pr ≤ x and

y = qr ≥ 0 two rationnal numbers we have:

γ1(x + y) ≥ 1r δ(p + q, r)

≥ 1r δ(p, r − 1) + 1

r δ(q, 1)

≥ r−1r γ1(x) + q

r

1

qδ(q, 1)

︸ ︷︷ ︸=γ2( 1

q)≥γ2(0[+])We on lude by onsidering the supremum over all possible p, q, r:

γ1(x + y) ≥ γ1(x) + y.1Thres .All other results, that are onsidering positive x, may be shown withexa tly the same argument than for Theorem 7.Property of the Legendre TransformThe Legendre transform of the hydrodynami limit is an important fun -tional that an be related to an asymptoti laten y (see next se tion). Weshow here that it is well dened on ]0; Thres[, and that it is equal to thissupremum taken on a nite interval.Proposition 8 We pla e ourselves under the assumption of Theorem 9. Let

λ > 0 and γ1(λ) = supx∈R(γ1(x) − xλ) be the Legendre transform of γ1,

(i) λ < Thres =⇒ γ1(λ) < +∞ and λ > Thres =⇒ γ1(λ) = +∞(ii) For any µ su h that 1Thres < 1λ − µ < 1

λ , there exists ζ > 0 verifying:γ1(λ) = sup

Xdir<x≤ζ(γ1(x) − x

λ) and for x ≥ ζ, γ1(x) − x

λ≤ −µx .

Page 199: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 189Proof: Let us start by (ii), we hoose rst µ > 0 su h that 1Thres < 1λ −µ.We know that limx→∞

γ1(x)x = 1Thres . Hen e there exists ζ > 0 verifying

γ1(x)x ≤ 1

λ − µ for x ≥ ζ, proving the last part of (ii).As supx∈R

(γ1(x) − x

λ) ≥ γ1(0) > 0 , it implies the rst part of (ii). As γ1is non-de reasing, γ1(Xdir) ≤ lim

X+dir γ1. As we have γ1(x) = −∞ for x < Xdir,we dedu e (i).(iii) follows from Theorem 9 (iii), as we have γ(y) ≥ γ(0) +

yThres .

Page 200: baccelli/Evaluation/AugustinChaintreauPhD.pdf

190 Chapter 33 Appli ation: Innite Dis rete-Event SystemsPattern grids an be used to represent the evolution of some innite dis reteevent systems, as it was already shown by several examples in 1. In thisse tion, we hara terize transient and stable regimes for su h systems, usingthe hydrodynami s aling presented in 2.Generally speaking there are, at least, two kinds of boundary onditionsthat are used over a pattern grid asso iated with a dis rete event system:• First, a vertex (e.g. 0 × beg) of the grid is hosen as the origin of aquadrant ( orresponding to the restri ted grid G[0]patt).• Se ond, the quadrant is extended to all oordinates, positive or neg-ative, along the rst dimension ( orresponding to the restri ted gridG[−∞×0]patt ). A time-valued pro ess (tm)m∈Z reates other onstraints forthe beginning of all tasks in m × 0 × beg | m ∈ Z .The rst ase represents well an innite system, originally empty, that isstudied starting from its rst initial steps, and in saturation. It is des ribedin 3.1. The se ond ase, shown in 3.2, an be used to follow the samesystem, with a limited input, when it is stationary along its rst dimension.For systems in dimension 2, we provide onditions for the existen e of thisstable regime, and hara terize its long-range ompletion time.3.1 QuadrantIn the rst ase, introdu ed above, ompletion time of tasks are given bylast-passage per olation to the origin vertex. It is then easy to apply resultsfrom 2 to hara terize this transient regime.More pre isely, we onsider the pro ess of ompletion times of all tasks,

T =

Ta×h ∈ R ∪ −∞∣∣∣ a ∈ Z

d, h ∈ H

.We suppose that Ta×h = −∞ if a has at least one negative oordinate. Thetask asso iated with the origin vertex 0× beg is supposed to start at time 0.The ompletion times verify the uniform re urren e system, dened on therestri ted grid G[0]patt as(3.9) T0×beg = Wei(0 × beg) , and, for a × h 6= 0 × beg,

Ta×h = Wei(a × h) + max(a×h)→(a′×h′)

Ta′×h′ .Proposition 9 Let G[0]patt have sharp support and be totally ordered.Let T be a solution of (3.9), we have for any a ≥ 0 and h ∈ H

Ta×h = max

Wei(π)

∣∣∣∣∣π a path in G[0]pattπ : a × h 0 × beg

< +∞ .

Page 201: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 191In parti ular, when h = beg, we have Ta×beg = Per [0]a×beg .Note that the same expression holds whenever a ≥ 0 is not veried; inthis ase, both sides are equal to −∞.Proof: When a × h = 0× beg, the path π = (0 × beg), is in luded inthe above denition (see Remark 2 p.169) . That is the only one; any otherpath would dene a y le in the dependen e graph, with a null asso iatedve tor, and it would ontradi t Proposition 2.Hen e, setting Ta×h = sup

Wei(π)

∣∣∣∣∣π a path in G[0]pattπ : a × h 0 × beg

,denes a solution of (3.9). A ording to Corollary 3, shown for all sharppattern grids, a path starting from a×h with size above Lmax ends in a vertexwith, at least, one negative oordinate. This proves that the supremum above an be restri ted to all path with length at most Lmax. That is a nite set,as the support of a pattern grid was supposed lo ally nite, and it denes anite maximum.What remains only to be proved is that this denes the unique solutionof the system of equation (3.9). The proof is a little te hni al.We onsider (a× h) 6= (0× beg). Note that Ta×h an be developed into:sup

Wei(π) − Wei(a′ × h′) + Ta′×h′

∣∣∣∣∣∣∣∣

π a path in G[0]pattπ : a × h a′ × h′

(0 × beg) /∈(π\a′ × h′

)

a′ ∈ Zd, h′ ∈ H

.This denition in ludes in parti ular the ase a′ × h′ = 0 × beg, so that

Ta×h ≥ Ta×h. Proving Ta×h ≤ Ta×h is then su ient to on lude.Let us dene for all path π in luded in the above denition:val : π 7→ Wei(π) − Wei(a′ × h′) + Ta′×h′ .We an show by de reasing indu tion that val(π) ≤ Ta×h. Indeed this isobvious for any π with length above Lmax, as a′ has at least one negative oordinate. Let us suppose that the result holds for all path with lengthL + 1, and let π be with length L.If (a′ × h′) = (0 × beg), val(π) = Wei(π) and the result is obvious as πis in luded in the denition of Ta×h. If (a′ × h′) 6= (0 × beg), Ta′×h′ may bedeveloped a ording to (3.9) and we have:val(π) = max

π′:a′×h′→a′′×h′′

Wei(π) − Wei(a′ × h′) + Wei(a′ × h′)︸ ︷︷ ︸Wei(ππ′)−Wei(a′′×h′′)

+Ta′′×h′′

= maxπ′:a′×h′→a′′×h′′

(val(π π′))≤ Ta×h by indu tion.

Page 202: baccelli/Evaluation/AugustinChaintreauPhD.pdf

192 Chapter 3Note that the expression for T in the above proof always denes a solutionof (3.9), even if the pattern grid is nor sharp neither totally ordered. It denesalso generally the minimal solution of(3.10) T0×beg ≥ Wei(0 × beg) , and, for all a × h,

Ta×h ≥ Wei(a × h) + max(a×h)→(a′×h′)

Ta′×h′ .But in the general ase, the supremum in T may be innite. Moreover,for non-sharp pattern grid, other solutions, in luding innite ones, may befound. The sharpness riterion is su ient to avoid these degenerate ases,but it is not in general ne essary. Su ient and ne essary onditions, basedon more general assumptions, may be found in [7.3.2 Extended QuadrantWe now introdu e another set of boundary onditions, to answer the follow-ing question: an we extend the domain of denition of T to handle doubleinnite values (positive and negative), so that this pro ess follows a law in-variant by translations along this axis ? We prove here that this extension an be made along one axis, when d = 2.Notation and Expression for TWe denote, for m ∈ Z, a ∈ Zd−1, and h ∈ H the vertex (m,a1, . . . , ad−1)× hby (m × a × h). The pro ess of ompletion times is

T =

Tm×a×h ∈ R ∪ −∞∣∣∣ m ∈ Z, a ∈ Z

d−1, h ∈ H

.We suppose that Tm×a×h = −∞ whenever a has a negative oordinate.When d = 2, it orresponds to restri ting the set of valid verti es to a half-plane. When d = 3, the set of valid verti es is the quarter of a spa e, asshown in Figure 3.2.We introdu e a set of onstraints tm | m ∈ Z that hara terize the ompletion times on the axis m × 0 × beg | m ∈ Z . It is generally toorestri tive to x the exa t ompletion time of these tasks, as they also needto satisfy the pre eden e relation of the pattern grid. Instead we interprettm as an authorization time: the task m × 0 × beg may start at anytimeafter tm, but not before.We an then dene the pro ess T as the minimal solution of the system,dened on the restri ted grid G[−∞×0]patt as:(3.11)

Tm×0×beg ≥ Wei(m × 0 × beg) + tm for all m in Z ,Tm×a×h ≥ Wei(m × a × h) + sup

(m×a×h)→(m′×a′×h′)Tm′×a′×h′ .

Page 203: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 193a3

a1

a2

a3

a2

a1

(a) (b)Figure 3.2: Quadrant (a) and Extended Quadrant (b) represented for the ase d = 3.Proposition 10 Let G[−∞×0]patt have sharp support and be totally ordered.Let T be the minimal solution of (3.11). For a ≥ 0, Tm×a×h is equal tosupn∈Z

(max

Wei(π)

∣∣∣∣∣π a path in G[−∞×0]patt

π : m × a × h (m − n) × 0 × beg + tm−n

).For Tm×0×beg, the supremum in n an be restri ted to n ≥ 0.For Tm×a×h, it an be restri ted to n ≥ −Ma, that depends only on a.Proof: Let us denote the above supremum by Tm×a×beg. As made inthe proof of Proposition 9, we an he k that it denes a solution of (3.11).As the pattern grid has a sharp support, for all n ∈ Z, paths π : m×a×h

(m−n)×0×beg have a bounded length, as a onsequen e from Theorem 2,whi h proves that the supremum of this set is a nite maximum.It is easy to see, following a similar development as the one used in theproof of Proposition 9, that any other solution veries Tm×a×beg ≥ Tm×a×beg,proving that the solution found is minimal.Let us onsider m × 0 × beg, and n < 0. A path m × 0 × beg (m−n)× 0× beg denes a y le in the dependen e graph, whose asso iatedve tor r has all oordinates non-negative. It ontradi ts the denition of asharp ve tor <r, s> ≤ −1. Hen e, there does not exist su h a path, and we an always suppose that n ≥ 0 in the supremum dening Tm×0×beg.More generally, for any n < 0 and π : m × a × h (m − n) × 0 × beg,by Theorem 2, |π| ≤ H (1 + Res + n.s1 + <a, (s2, . . . , sd)> )hen e n ≥ − 1

s1(1 + Res + <a, (s2, . . . , sd)> ) = −Ma .where s1 > 0 by total order and Proposition 4 (ii).

Page 204: baccelli/Evaluation/AugustinChaintreauPhD.pdf

194 Chapter 3Note that in the above expression for T, the supremum in n ould beinnite, as an innite number of terms orresponding to all non-negativen are in luded. It generally depends on the pro ess tm | m ∈ Z that hara terizes the boundary of G[−∞×0]patt .Stable Regime of Two_Dimensional Pattern GridsWe onsider a pattern grid in dimension 2, where the position in Z

2 of amotif is denoted with indexes m and k, as made in 2.4. The pro ess of ompletion times isT =

T(m,k)×h | m,k ∈ Z , h ∈ H

.A ording to Theorem 9, there exist two fun tions, γ1 and γ2, dened byhydrodynami s aling, that hara terize asymptoti dire tional last-passageper olation. The positive onstant Thres is dened as 1Thres = γ2(0+).What we show in the next theorem is that these fun tions hara ter-ize whi h pro esses t = tm | m ∈ Z denes a stable pattern grid (i.e.pattern grid where T is a.s. nite and orresponds to a stationary regime).Theorem 10 Let G[−∞×0]patt have sharp support, and be totally ordered.Let T be the minimal solution of (3.11), we suppose d = 2 and

(i) the pro ess t is stationary and ergodi , with a rate λ < Thres.Then, for all m ∈ Z we have: ∀k ≥ 0 ,−∞ < T(m,k)×beg < +∞ a.s.,and limk→∞

T(m,k)×beg − tm

k= d a.s., for d = sup

x∈R

γ1(x) − x

λ < ∞ .Under the assumptions above, t is stationary, and the pattern grid G[−∞×0]pattis un hanged by translation m → m+1. As a onsequen e the law of pro ess

T is invariant by the same translation, and, for a xed k, the sequen e ofvariables (T(m,k)×beg−tm)m∈Z is stationary. It is therefore su ient to provethe result of the theorem for T(0,k)×beg − t0.Interpretation: if the index m represents the number of a ustomerentering a system, the law of T(0,k)×beg − t0 hara terizes, in steady state,the time between the arrival of a ustomer (authorization of his origin task)and the ompletion of his k rst steps.Moreover, let us dene Dm,k = T(m,k)×beg − T(m,k−1)×beg, the time spentby ustomer m to omplete his k_th step, after his (k − 1)_th step is done,where tm = T(m,−1)×beg by onvention. The onvergen e above an be in-terpreted as a Cesaro average a.s. onvergen e, or a strong law of largenumbers, for the sequen e (Dm,k)k≥0.

Page 205: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 195Proof: The proof to establish this fa t is rather long, we distinguishseveral steps to help the reader. They all follow the arguments given in [9to prove Theorem 5.2.STEP 1: Simpli ation and notation Without loss of generality we an suppose that the task orresponding to 0× 0× beg is authorized at timet = 0. We denote for all m ∈ Z by A(m) the time between authorization oftasks (m+1)×0×beg and m×0×beg. With this notation, we an rewrite:(3.12) tm =

(m)+−1∑

i=(m)−

sgn (i)A(i) =

∑m−1i=0 A(i) for m ≥ 0

−∑−1i=m A(i) for m < 0.To simplify the notation, let us introdu e:M(n, k) = sup

Wei(π)

∣∣∣∣∣π ∈ G[−∞×0]patt

π : 0 × k × beg −n × 0 × beg

.Note that we have by invarian e translation the identity in distribution(3.13) ∀(n, k) ∈ Z2 , M(n, k)

dist= Per [−∞×0]

(n,k)×beg ,so that in parti ular they have the same expe tation. We an also rewrite(3.14) T(0,k)×beg = supn∈Z

M(n, k) +

(n)+∑

i=(n)−+1

sgn (−i) A(−i)

.Sin e the grid is totally ordered, there is a path from (0, k)×beg to (−n, 0)×beg, for all n ≥ 0, whi h implies M(n, k) > −∞. We know by Proposition 10that this supremum an be restri ted to value n ≥ −k s2s1, where s is a sharpve tor of the grid. For the rst half of the theorem, it is then su ient toprove that for a value of N0:(3.15) sup

n≥N0

M(n, k) −n∑

i=1

A(−i)

< ∞ a.s. .Proposition 8 (i) tells us that d < ∞. For µ > 0 hosen su h that

1Thres < 1λ − µ, by Proposition 8 (ii), there exists ζ > 0 su h that thesupremum dening d an be restri ted to x verifying Xdir < x ≤ ζ. Whatremains to be proved is that, when k goes to innity, we have the following:(3.16) (

1

ksupn≥ζk

M(n, k) −n∑

i=1

A(−i)

)+

→ 0 a.s. .(3.17) 1

kmax

−ks2s1

≤n≤ζk

M(n, k) +

(n)+∑

i=(n)−+1

sgn (−i)A(−i)

→ d a.s. .

Page 206: baccelli/Evaluation/AugustinChaintreauPhD.pdf

196 Chapter 3- STEP 2: Trun ation, on entration around the meanWe are not aware of a dire t method proving (3.15)(3.17) based ontheir expressions. The key argument to on lude was introdu ed by Ba elli,Borovkov and Mairesse in [1, we follow the extended version used by Martinin [9. Let us introdu e for any m,k and h, the trun ated variable:ˇWei((m,k) × h) = min

(Wei((m,k) × h),max(|m|, |k|) 14

).We dene M(n, k) a ording to the same denition that M(n, k), where theweights used in G[−∞×0]patt were repla ed by the asso iated trun ated variables.A ombinatorial argument (Lemma 5.5 in [9) proves that if the distri-butions of weights satisfy Condition 4, we have:(3.18) 1

Nmax

|ξ|≤N ;0∈ξWei(ξ) − ˇWei(ξ) → 0 a.s. when N → ∞ ,where ξ denotes a latti e animal, |ξ| its size, in a latti e with dimension 2.Lemma 6 (i) ∀x ,

M(⌊xk⌋ , k) − M(⌊xk⌋ , k)

k→ 0 a.s. as k → ∞ .

(ii) max−k

s2s1

≤n≤ζk

M(n, k) − M(n, k)

k→ 0 a.s. as k → ∞ .

(iii) maxk≤n/ζ

M(n, k) − M(n, k)

n→ 0 a.s. as n → ∞ .Proof of the Lemma: As a onsequen e of Theorem 3, all paths involved in the denitionof M(n, k) and M(n, k) an be in luded in a latti e animal ξ with size bounded by: (H2 +

d.H.Rad)(1+Res+n.s1+k.s2) . This latti e animal ontains (0, k)×beg, and we an arti iallyadd (0, k − 1) × beg, . . . , (0, 0) × beg to extend ξ in a latti e animal with the same property,whi h in ludes the origin, and has size less than (H2 + d.H.Rad)(1+ Res+n.s1 + k.(s2 +1)).Result (i) then follows from (3.18) with N = (H2 +d.H.Rad)(1+Res+(x.s1 +s2 +1)k),(ii) with N = (H2 + d.H.Rad)(1 + Res + (s1ζ + s2 + 1)k), and (iii) with N = (H2 +d.H.Rad)(1 + Res + (s1 + (s2 + 1)/ζ)n).Working with bounded variables allows us to obtain the following on- entration of measure result:Lemma 7 For any n and k, we introdu e A = H(1 + Res + s1 + s2), thenP(∣∣M(n, k) − E[M(n, k)]

∣∣ ≥ u)≤ exp

(− u2

16A(c0 + c1|n| + c2|k|)32

+ 64

).where c0 = HRad(1 + Res), c1 = HRad(s1 + 1), c2 = HRad(s2 + 1).Proof of the Lemma: Let us introdu e R = RadH (1+Res+(s1 +1)|n|+(s2 +1)|k|).We onsider a path π : (0, k) × beg → (−n, 0) × beg, involved in the denition ofM(n, k). By Theorem 2, we know that it ontains at most

|π| ≤ H(1 + Res + s1n + s2k) ≤ A.R verti es .

Page 207: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 197A ording to the denition of Rad, we know that for (n′, k′) × h ontained in π,|n′| ≤ |n| + |π|.Rad ≤ R and |k′| ≤ |k| + |π|.Rad ≤ R .So that the trun ated weight of any verti es ontained in π is upper bounded by R1/4.Applying Theorem 15 shown in Appendix A, to the olle tion of paths involved in thedenition of M(n, k), an be done with S = A.R and L = R1/4, proving this result.We have a on entration result of a dierent kind, applying to the inputpro ess, as a dire t onsequen e of Lemma 5.8 in [9.Lemma 8 As A(i), i ∈ Z is stationary ergodi , we have

1

Nmax

−N≤n≤N

∣∣∣∣∣∣

(n)+∑

i=(n)−+1

sgn (−i)A(−i) +n

λ

∣∣∣∣∣∣→ 0 a.s. as N → ∞ .-STEP 3: A simplied version of (3.17)Let us prove the onvergen e of this fun tional when variables are trun- ated and repla ed by their expe tation,(3.19) 1

kmax

−ks2s1

≤n≤ζkE[M(n, k)] − 1

λ.n → d .Note that this onvergen e result now deals with real numbers. We remarkrst that a ording to Proposition 3 (i):

E[M(m,k)] ≤ E[M(m,k)] ≤ E[Per [−∞×0](m,k)×beg]

≤ δ[−∞×0]((m,k), beg) + E[s]≤ kγ1(

mk ) + E[s] , so that ∀k > 0

max−k

s2s1

≤m≤ζk

E[M(m,k)] − 1λ .m

k≤ sup

s2s1

≤x≤ζ

(γ1(x) − x

λ

)+

E[s]

k.This proves that the superior limit of the LHS in (3.19) is smaller than d.For the other omparison, we dedu e from the denition of γ1 as a L1 limit,

∣∣∣∣E[M(⌊xk⌋ , k)]

k− γ1(x)

∣∣∣∣ =

∣∣∣∣∣∣

E[Per [−∞×0](⌊xk⌋,k)×beg]k

− γ1(x)

∣∣∣∣∣∣→k→∞ 0 ,holding for all x > Xdir. By Lemma 6 (i) and dominated onvergen e,we have E[M(⌊xk⌋ , k)]

k→k→∞ γ1(x)Applied to x = argmax

s2s1

≤x≤ζ

(γ1(x) − x

λ

), we have E[M(⌊xk⌋,k)]−(⌊xk⌋/λ)

k → d .

Page 208: baccelli/Evaluation/AugustinChaintreauPhD.pdf

198 Chapter 3The inferior limit of the LHS in (3.19) is then above d, proving the onver-gen e.-STEP 4: Dedu ing (3.17)In all the steps that remain, we prove that the onvergen e towards the onstant d holds for the original sequen e, as all approximations introdu ed an be negle ted. Let us start by establishing (3.17).We hoose ε > 0, by Lemma 7, we haveP (∣∣M(n, k) − E[M(n, k)]

∣∣ ≥ εk) ≤ exp

(− (εk)2

16A(c0 + c1|n| + c2|k|)32

+ 64

).For (−k s2

s1≤ n ≤ kζ) we have c0 + c1|n| + c2|k| ≤ c0 + c′2|k|, with c′2 =

c2 + c1 max(ζ, s2s1

). Hen e, summing on n we obtain(3.20) P

⌊ζk⌋⋃

n=−s2s1

k

|M(n, k) − E[M(n, k)]| > εk

≤ a′k exp(−b′√

k)For k ≥ max

(c0c′2

, 1ζ+

s2s1

), with a′ = e642(ζ + s2s1

) and b′ = ε2

(16A)(2c′2)3/2 .It proves that (3.20) is the term of a onvergen e series in k ≥ 0. By theBorel-Cantelli lemma, applied to all ε, there exists a.s. a nite Kε su h thatfor k ≥ Kε, and (−k s2s1

≤ n ≤ kζ), ∣∣M(n, k) − E[M(n, k)]∣∣ ≤ kε. By (3.19)it implies

1

kmax

−ks2s1

≤n≤ζkM(n, k) − n

λ → d a.s. as k → ∞.Lemma 6 (ii) and Lemma 8 allows us to dedu e dire tly (3.17).-STEP 5: Proving (3.15) and (3.16)We hose µ and ζ a ording to Proposition 8 (ii), su h that γ1(x) ≤

( 1λ − µ)x for all x ≥ ζ. As a onsequen e, for any n ≥ ζk we have:

E[M(n, k)] ≤ kγ1(n

k) + E[s] ≤ (

1

λ− µ)n + E[s] .This would imply (3.15) and (3.16), if all variables where repla ed by theirexpe tations, but to treat the real sequen e, we need the following rewriting:For any ν su h that 1

λ − µ < ν < 1λ , let ρ = 1

2(ν − ( 1λ − µ)), we have

ρ > 0 and ∀n ≥ ζk , E[M(n, k)] − νn ≤ −2ρn + E[s] .As a onsequen e supn≥ζk(M(n, k) − νn) is upper bounded by(3.21) supn≥ζk

(M(n, k)−M(n, k)−ρn)+ supn≥ζk

(M(n, k)−E[M(n, k)]−ρn)+E[s] .

Page 209: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 199Bounding the left term of (3.21): By Lemma 6 (iii) we have1

nmaxk< n

ζ

(M(n, k) − M(n, k))→ 0 a.s. as n → ∞, whi h implies thatthere exists N a.s. nite, s.t. for n ≥ N : max

k< nζ

(M(n, k) − M(n, k))

< ρn .Bounding the middle term of (3.21): From Lemma 7,P (∣∣M(n, k) − E[M(n, k)]

∣∣ ≥ δn) ≤ exp

(− (δn)2

16A(c0 + c1|n| + c2|k|)32

+ 64

).For 0 ≤ k < n/ζ, we have c0 + c1|n| + c2|k| ≤ c0 + c′1n for c′1 = c1 + c2/ζ.Hen e, P

k< nζ

∣∣M(n, k) − E[M(n, k)]∣∣ ≥ δn

≤ a′′n exp(−b′′√

n) ,for n ≥ max(ζ, c′1/c0), with a′′ = e64 2ζ , and b′′ = δ2

(16A)(2c′1)3/2 .This is therefore the term of a onvergen e series in n > 0. By Borel-Cantelli Lemma, there exists N ′ a.s. nite su h that this event does noto ur for n ≥ N ′.Note that as ν < 1λ , by the ergodi theorem, there exists N ′′ a.s. nitesu h that ∑n

i=0 A(−i) − νn is positive for n ≥ N ′′.As a onsequen e, (M(n, k) −∑ni=0 A(−i) ≤ E[s]) for any k ≥ 0 and

n ≥ max(N,N ′, N ′′, ζk). This implies (3.15). The same omparison holdsfor k ≥ max(N,N ′,N ′′)ζ , and any n > ζk, proving (3.16).Stability ondition: What was shown in the previous theorem is that

λ < Thres is a su ient ondition to dene a stable regime, where long-range ompletion time grows linearly in the distan e from the axis. Weprove next that it is, generally speaking, a ne essary ondition as well, forthese two results to hold.The velo ity in a dimension was dened in 2.4 as 1δ[0](ei,beg) , for animpli it arbitrary hoi e of beg in H. In the next result, we write expli itlythe element h hosen in this denition, as Velo1(h). For onsistent notationwe also denote Thres = Thres(beg), as it was already assume before.Corollary 4 We make the same assumption as in Theorem 10,

(i) If λ > Thres(beg), limk→∞

T(m,k)×beg − tm

k= +∞ a.s., and in expe tation.

(ii) If λ > Velo1(h), where h ∈ H, then ∃K ≥ 0 su h thatT(m,k)×h = +∞ a.s., for any m and k ≥ K.

Page 210: baccelli/Evaluation/AugustinChaintreauPhD.pdf

200 Chapter 3Remark 5 For the se ond assertion we impli itly assume that the solutionT(m,k)×h is well dened, at least for some m,k (i.e. there exists almost surelym′, k′ and a path π : (m,k) × h (m′, k′) × beg).Proof: As a onsequen e from the ergodi theorem, if ν > 1

λ , there existsN1(ν), a.s. nite, su h that: n ≥ N1(ν) =⇒ t−n > −n.ν .Proving the se ond assertion: It is su ient to prove for a givenh and k, T(0,k)×h = +∞ a.s., as it implies the same result on T(m,k)×h bystationarity, and therefore on T(m,k′)×beg when k′ ≥ k, by total ordering.It is possible to hoose h, ν and µ verifying: 1

λ < ν < µ < 1Velo1(h) .A ording to Remark 5, we an always assume that there is a path π :(m,k)×h (0, k′)×beg, with k′ ≥ 0, and that the path is valid (otherwise,we an repla e k and k′ by su iently large number). After translation −nalong the x axis, it remains valid, implyingfor all n ≥ 0, T(−(n−m),k)×h ≥ T(−n,k′)×beg ≥ t−n .Developing (3.11), we an lower bound T(0,k)×h, with any n ≥ m, bysup

Wei(π)

∣∣∣∣∣π ∈ G[−∞×0]patt

π : 0 × k × h −(n − m) × k × beg + T(−(n−m),k)×h︸ ︷︷ ︸

≥t−nLet us denote this supremum, minus Wei((0, k) × h), by Y(k)n−m. Let us in-trodu e Y (k), the same pro ess where G[−∞×0]patt is repla ed by G[−∞×k]patt . Notethat we have for all n ∈ Z, the omparison Y

(k)n ≥ Y

(k)n .The pro ess Y

(k)n

∣∣∣ n ∈ Z

is super-additive; and Y(k)n

dist= Per [−∞×0]

(0,n)×h .It implies that these two pro esses have the same limit:lim

n→∞

Y(k)n

n=

1Velo1(h)a.s., and in L

1.Hen e there exists N2, a.s. nite, su h that: n ≥ N2 =⇒ Y (k)n > n.µ .Hen e, for n ≥ max(N1(ν),m + N2) we have Y

(k)n−m − t−n > (n−m).µ−

n.ν). This proves T(0,k)×h = +∞ a.s.Proving the rst assertion: it requires more probabilisti te hnique.Let us note rst:T(0,k)×beg

k≥ 1

ksupx≥0

M(⌊xk⌋ , k) −⌊xk⌋∑

i=1

A(−i)

.The onvergen e in expe tation is quite dire t: Let us hoose x ∈ R arbi-trarily, we have limk→∞E[M(⌊xk⌋,k)]

k = γ1(x), and hen elim infk→∞

E[1

kT(0,k)×beg] ≥ γ1(x) − x

λ.

Page 211: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 201Having λ > Thres proves that supx≥0(γ1(x) − xλ) = +∞, as shown inProposition 8 (i), it on ludes the proof on the expe tations.Let us now introdu e the same trun ated variables M(n, k) as in theprevious proof. By Lemma 6 (i) and dominated onvergen e, we have

E[M(⌊xk⌋ , k)]

k→k→∞ γ1(x)We hoose Thres < 1

ν < λ, any C ≥ 0, and x ≥ 0 su h that, γ1(x)−ν.x ≥ C.(3.22) We have limk→∞

E[M(⌊xk⌋ , k)] − ν. ⌊xk⌋k

≥ CLet us hoose ε > 0. Working with trun ated variables allows to applyLemma 7. For k ≥ c0c2+c1|x|

and b = ε2

(16A)(2.c2+2.c1|x|)3/2 ,we have P (∣∣M(⌊xk⌋ , k) − E[M(⌊xk⌋ , k)]

∣∣ ≥ εk) ≤ exp(−b

√k)This denes a onvergent series. By Borel-Cantelli Lemma, there exists Kε,a.s. nite, su h that k ≥ Kε =⇒ M(⌊xk⌋ , k) ≥ E[M(⌊xk⌋ , k)] − k.ε.Combined with (3.22), it proves, lim

k→∞

M(⌊xk⌋ , k) − ν. ⌊xk⌋k

≥ C − ε .We dedu e lim infk→∞

T (0, k) × begk

≥ C a.s. ∀C. This limit is therefore +∞.Note that it also proves indire tlyThres(beg) ≤ min Velo1(h) | h ∈ H ≤ Velo1(beg)Dis ussion, future works: We have seen in 2.4 that Thres and Velo1in general may not be equal, as the hydrodynami fun tion ould exhibita dis ontinuity at the point zero. What was shown in the previous resultis that these two onstants may be related with two dierent denitions ofstability.• The onstant min Velo1(h) | h ∈ H represents the minimal satu-rated ompletion rate of a task following the axis (m, 0) | m ∈ Z .If T is well dened for all h ∈ H, it provides a ne essary onditionfor the stability dened in the following way: T(m,k)×h < +∞ a.s.for any m,k, h. We do not expe t this ondition to be su ient as the ompletion rate for a task further apart from the axis ould be smaller.Note that ondition of this type rephrase, here for innite system,the Ba elli-Foss saturation rule (see [2 p.165), whi h states that the

Page 212: baccelli/Evaluation/AugustinChaintreauPhD.pdf

202 Chapter 3stability region of a Monotone-Homogeneous-Separable system is de-s ribed by its saturated output rate. But be ause of an innite size, itis not possible to use this result dire tly to prove that it is a su ient ondition.• The onstant Thres hara terizes a su ient and ne essary onditionfor a long-range stability, that is more stringent: a nite steady stateexists for the pro ess T and, in addition, delays in this regime, umu-lated along the k-axis, grows linearly with index k.We have not been able to ome up with an example where the twonotions of stability are distin t, and where a stationary regime mayexists that does not verify the result of Theorem 10. We expe t that ingeneral they are equivalent su h that the law of large numbers shownin Theorem 10 holds in general for any stable regime.Conje ture 1 For Gpatt with sharp support, totally ordered, and d = 2,

λ < Thres ⇐⇒ T(m,k)×h < +∞ a.s. ∀m,k, h .

• We might be able to improve the rst onstant and establish anotherexpression for Thres. To the best of our knowledge, a good andidatefor a su ient ondition is the following:Conje ture 2 For Gpatt with sharp support, totally ordered, and d = 2,Thres = minh∈H

infk≥0

δ[−∞×−k]((1, 0) × h)

∣∣∣ or, equivalently,

infx>0

limm→∞

Per [−∞×0](m,⌊xm⌋)×beg

m

= maxh∈H

supk≥0

limm→∞

Per [−∞×−k](m,0)×h

m

.Note that Conje ture 2 implies Conje ture 1. We have already seen thata tandem of single-server queue (see [1), as well as any pattern grid withsolidarity property (see 2.5) veries Thres = Velo1, and hen e both of these onje tures.Results of that sort are here essentiel as they allow at the same time to hara terize Thres with reasonably simple performan e measure (the rate of ompletion of a given task), and to prove that the law of large numbers onthe delays holds in this ase for any smaller input rate. It ould be possibleto identify a general lass of innite dis rete-event systems whi h follow asimilar result.

Page 213: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 2034 Extension to Pattern Invariant GraphUntil now we onsidered, from 1 to 3, the lass of pattern grids. They aredened as a dire ted graph or, equivalently, a pre eden e relation builtover a referen e latti e. The hara teristi avor of this theory stems fromthe following property: the distribution of the set of edges is invariant byany translation of the referen e latti e.In pra ti e, tasks in distributed systems may be organized not as in alatti e but a ording to indu tive rules, just like the ones found for example inan innite tree. Some transformations may leave this organization invariant,in the same way as translations by a xed ve tor preserve the relation denedon a latti e.In this se tion, we introdu e a general framework to study pre eden erelations that are stable under these transformations. It generalizes the lass of pattern grids to pattern invariant graphs (p.i.g.s). The theory of thisgeneral lass of distributed system is out of the s ope of this hapter, butwe identify a ategory of p.i.g.s in whi h asymptoti last-passage per olation an be well des ribed, and are of some pra ti al interest.4.1 Pattern Invariant GraphsWe start by dening more pre isely what makes a graph invariant with regardto a given transformation.Invarian e in GraphLet us onsider a graph G = (V,E). We all an appli ation η : V → V aninvariant transformation (i.t.) of G if we have for all u, v in V :( η(v) = η(u) ⇐⇒ v = u ) and ( (u, v) ∈ E ⇐⇒ (η(u), η(v)) ∈ E ) .If the mapping dened by invariant transformation η is a bije tion on V , ηis alled an invariant bije tion. As an example, the identity on V denesobviously an invariant bije tion for all graphs.Examples : Graphs and Invariant Transformations• The semi-innite line: dened with V = N, and E = (n → n + 1) | n ∈ N . All invariant transformations are given by the translations to the right, or theadditions by a positive number. This might be seen as we have

(n → n + 1) ∈ E =⇒ (η(n) → η(n + 1)) ∈ E =⇒ η(n + 1) = η(n) + 1 . As a onsequen e, the only invariant bije tion is the identity.

Page 214: baccelli/Evaluation/AugustinChaintreauPhD.pdf

204 Chapter 3• The innite line: Z, with edges (m → m + 1)|m ∈ Z: The invariant transformations are all the translations, or the additions by a xedinteger. The proof is the same. Ea h of them is an innite bije tion.• The bi-dimensional diagonal latti e: is des ribed as a integer quadrant:

V = N2 and E = ((i, j) → (i + 1, j)) , ((i, j) → (i, j + 1)) | i ≥ 0, j ≥ 0 .Be ause the two dimensions play a similar role, this graph is best des ribed with a π/4rotation as:

. . .

. . .

. . .

. . .

. . .

. . .

i

j

. . .

. . . An invariant transformation is hara terized by the image of the vertex (0, 0),that may be any vertex (i, j), and the image of (1, 0) and (0, 1), whi h arealways in (i + 1, j), (i, j + 1), but may be hosen arbitrarily. In parti ular, theorientation of the latti e may be swit hed by an invariant transformation. In this ase there are two invariant bije tions: the identity and the ree tion thatswit h every point around an horizontal line in the gure above ( orrespondingto the line i = j).Other examples of diagonal latti e may be onstru ted with any dimension.• The semi-innite regular rooted tree: Denoted by TD, it is dened for all degree D ≥ 1by

V = 1, 2, . . . , D(N) and E = (a → a i ) | a ∈ V, 1 ≤ i ≤ d ,where for all set X we denote by X(N) the set made with nite sequen e of elements ofX, that in ludes the empty sequen e ∅. We denote, for x in X and y = y1, . . . , yk ∈X(N), by y x the on atenated sequen e y1, . . . , yk, x.For illustration, let us draw a part of the tree T3 that ontains all verti es asso iatedwith a sequen e with length at most 2:

321

. . .. . . . . .. . .. . . . . . . . .. . . . . .

. . .. . . . . .. . .. . . . . .. . .. . . . . . . . .. . . . . .

. . .. . . . . .. . .. . . . . .

3, 33, 23, 12, 32, 22, 11, 31, 21, 1

Page 215: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 205 The invariant transformations are hara terized by a sequen e (that is asso iatedwith the image of the root), and by a set of permutation (σx)x∈V of the set1, . . . , d. They basi ally des ribe all proje tion of the tree in one of its subtrees(that are all the hildren of the image of the root), where the horizontal orderof hildren may be swapped for ea h node. The invariant bije tions are invariant transformations su h that the image of theroot is the root.The on atenation of two invariant transformations is an invariant trans-formation. It is useful to think of the set of invariant transformation as theisometri appli ation on the graphs. They generally hara terize the sym-metry of the set of edges.We say that a graph is semi-transitive if, for all u and v in V , thereexists an invariant transformation su h that either η(u) = v or η(v) = u. Iffor any u and v in V there always exists an invariant bije tion that veriesthe same properties, we all the graph transitive.Note that these two properties are dierent. The semi-innite line N andthe rooted tree Td are semi-transitive but not transitive. By opposition, theinnite line Z veries the two properties.It would be possible to study only transitive graph, as any graph used inthe appli ations ould be ompleted in all dire tions to be ome transitive.But performing this ompletion makes the results look unne essarily om-plex, and the proof of the most important results is not more simple in this ase.Degrees and transitivity: In this work, we onsider only graphs withnite out-degree (i.e. only a nite number of edges depart from any vertex).If a graph G is transitive, assuming nite out-degree implies that there exista uniform bound on the degree of any vertex. Indeed, one an show in this ase that all verti es have the same out-degree.The same result does not hold, generally speaking, for all semi-transitivegraphs (a ounter-example is shown below). We ex lude this ase from ourfurther study; we always assume that there exist a uniform bound on theout-degree of all verti es.More Details : A ounter-example of uniform degree boundLet us onsider the semi-innite line N, with the set of edges

(n → m) | m < n .From any vertex n, there depart exa tly n edges (one towards ea h of the verti es 0, 1, . . . , n.Hen e this graph has nite out-degree everywhere, but no uniform bound on the out-degree. Itsinvariant transformation are the translation to the right by any number, it is semi-transitive.

Page 216: baccelli/Evaluation/AugustinChaintreauPhD.pdf

206 Chapter 3DenitionsLet (Gi = (Vi, Ei))i=1...d be a nite olle tion of graph. Let i be in 1, . . . , d,we onsider, for all invariant transformations η of Gi, the appli ations denedon the produ t by:(3.23) (V1 × ... × Vd → V1 × ... × Vd

(v1, . . . , vd) 7→ (v1, . . . , vi−1, η(vi), vi+1, . . . , vd) .These appli ations, taken for all i, generate a lass, that we all the lassof invarian es, or the lass of olle tion invarian es. A olle tion invarian e onsists in applying, in a produ t graph, an invariant transformation (thatmay be the identity) to ea h oordinate.It is important to distinguish the invarian e of a olle tion of graph, andthe invariant transformation of the produ t graph. The produ t graph of anite olle tion, denoted ∏1≤i≤d Gi, is generally dened by :

1≤i≤d

V,⋃

1≤i≤d

v → (v1, . . . , vi−1, v

′i, vi+1, . . . , vd)

∣∣ (vi → v′i) ∈ Ei

.Every invarian e of a olle tion is by denition an invariant transformationof ∏1≤i≤d Gi. But the onverse, in general, is false.Let us onsider for instan e the graph produ t of two semi-innite lines(dened as in p.203). One an easily see that N×N is in this ase equal to thediagonal latti e (shown above). The mapping N ×N → N ×N, (i, j) 7→ (j, i)is therefore an invariant transformation of the produ t graph. But this isnot an invarian e of the olle tion. In other words, invariant transformationsmap every edge to another, while invarian es map every edge of a given graphGi to another edge of the same graph Gi.Let H be a nite set, and (Gi = (Vi, Ei))i=1...d a nite olle tion ofsemi-transitive graphs. A pattern invariant graph (p.i.g.) is a graph (V, E),where the set of verti es V is given by the produ t V1 × . . . × Vd × H, andthe set of edges E is stable under all olle tion invarian e.In other words, the set of edges E is a subset of V2 that veriesCondition 7 For any 1 ≤ i ≤ d and η an invariant transformation of Gi,

v × h → v′ × h′ is in E if and only if E ontains also(v1, . . . , vi−1, η(vi), vi+1, . . . , vd) × h → (v′1, . . . , v

′i−1, η(v′i), v

′i+1, . . . , v

′d) × h′We would like to stress that the sets E1, . . . , Ed are only present in thisdenition be ause they hara terize the set of invarian e; these sets are notne essarily related to E in any way.Re urren e equations: As for pattern grid, we an interpret the pre e-den e rule dened by a pattern invariant graph inside a system of uniform

Page 217: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 207re urren e equations between time-valued pro esses. The presen e of an edgev×h → v′×h′ denotes that v′×h′ should be ompleted rst. The rst timesof ompletion, Th(v) | v ∈ V, h ∈ H , that satisfy the pre eden e rules arethe solution of:(3.24) Th(v) = Wei(v, h) + max Th′(v′)| (v, h) → (v′, h′) ∈ E .These equations may still be read as ea h task starts as soon as all itsimmediate prede essors have been ompleted.Last-Passage Per olationLet us x in a pattern invariant graph an origin vertex o × h, we dene thelast-passage per olation time to this origin, from any vertex v, as:Per v×h = sup

π: v×h o×hWei(π) .Let us onsider η, an invarian e of Gpatt, we dene by indu tion the sequen eof iterated transformation (η(m))m≥0 by:

η(m)(v) =

v if m = 0

η(m−1)(η(v)) if m ≥ 1 .Note that we have: η(m)(V) ⊆ η(m−1)(V) ⊆ . . . ⊆ η(V) ⊆ V .An invarian e η is said transient if we have, for any v of V, that (η(m)(v))m≥0is a self-avoiding sequen e of V.We onsider a set of independent random variables Av | v ∈ V . Wesuppose that their distribution is not degenerate: P [Av 6= E[Av]] > 0 .For any η, we onstru t the asso iated shift T by: T (ω)v = ωη(v−−−).We have T−1 (Av ∈ A) = Aη(v) ∈ A for all event A, and v ∈ V.Proposition 11 The shift T is ergodi if and only if η is transient.Proof: If η is not transient, there exists v su h that the sequen eη(m)(v) is periodi with period M . In this ase, the event Av ∈ C,Aη(v) ∈C, . . . , Aη(M−1)(v) ∈ C is invariant by T for any C. As the distribution ofvariables A is not degenerated, this shift annot be ergodi .If η is transient, let us onsider the lass of ylinder events. Su h anevent C is hara terized only by the value of Av for v in a nite set, thatis alled the support of C, denoted by supp(C). It is easy to verify thatT−1(C) is a ylinder and that supp(T−1(C)) = η(supp(C)). Two ylindersC and D verify that T−k(C) and D are independent for k su iently large,be ause supp(T−k(C)) and supp(D) are eventually disjoint. Otherwise, itwould imply that an element in supp(D) is equal to several iterations of thesame element of C, that is impossible as η is transient.

Page 218: baccelli/Evaluation/AugustinChaintreauPhD.pdf

208 Chapter 3We now on lude by a lassi al monotone lass argument. What we havejust shown proves that limk→∞

P

[T−k(C) ∩ D

]= P [C] P [D]. The lass of ylinders is losed by nite interse tion. By Proposition 17, shown in I ofAppendix A, the shift is mixing and hen e ergodi .The iterated images of a vertex by a transformation play, in this ontext,the same role as the sequen e of multipli ation of a given ve tor in a patterngrid. Similarly, we prove, in the next result, that the asymptoti speed ofthe sequen e Per η(m)(o)×h an be well hara terized:Theorem 11 Let Gpatt be a pattern invariant graph, we assume

(i) η is a transient invarian e of Gpatt.(ii) There exists a.s. a path from η(o) × h to o × h.In this ase, lim

m→∞

Per η(m)(o)×h

m= ℓ a.s., where ℓ ∈ R ∪ +∞and either ℓ = E[ℓ] < +∞ a.s or E[ℓ] = +∞ .Proof: Let us introdu e for all m ≥ 0, the pattern invariant graph G[m]pattwhere the weight of a vertex v × h is hanged to −∞ if v /∈ η(m)(V ).The proof of this result may be seen as an extension of Theorem 6. Wedene for m,m′ positive integers, with m < m′,

Xm,m′ = sup

π path in G[m]pattπ : η(m′)(o) × h η(m)(o) × h

Wei(π) − Wei(m × v × h) .The olle tion Xm,m′ ,m < m′ denes a super-additive pro ess (see II inAppendix A). It veries (S1) as shown by the same proof as in Theorem 6.It veries (S′2) as we have for any m,m′ in N that Xm+1,m′+1 = Xm,m′ T ,where T is dened as above. This shift is ergodi by (i) and Proposition 11.We have X0,1 = Per 1×v×h ≥ 0 a.s. by assumption (ii).The result is then a dire t onsequen e from Corollary 6 shown in Ap-pendix A.It always makes sense to onsider asymptoti last-passage per olation ofiteration in a pattern invariant graph, as long as a path exists and the trans-formation is transient. But two ases are possible: either the onvergen eholds in L

1, and the a.s. limit is a nite onstant, or the limit has inniteexpe tation. One of them o urs, whi h depends on the ombinatorial prop-erty of the pre eden e relation, as well as on the distribution of the weightsin all verti es.

Page 219: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 209In the next se tion we introdu e a riteria, based on the property of sharpve tors, already shown in 1.4, to prove that the onvergen e holds in L1.As opposed to pattern grid, that were already hara terized, we require amu h more stringent ondition on the distribution of weights.4.2 Leveled GraphsA mapping lev : V → Z, is alled a level if it is not onstant, and the leveldieren e is preserved by any invariant appli ation η of G:for all u, v in V , lev(η(v)) − lev(η(u)) = lev(v) − lev(u) .Example : Leveled graphsLet us des ribe the levels that may be dened in the previous examples we studied in 4.1.Two levels that are dierent only by an additive term and a multipli ative onstant are said tobe equivalent.

• The semi-innite and innite lines: Any level is hara terized by lev(0) and lev(1).As for all i, there exists η su h that η(0) = i, η(1) = i + 1, we have that lev(i) =lev(0) + i.(lev(1) − lev(0)).• The diagonal latti e and the semi-innite regular rooted tree: The level dieren ebetween (0, 0) (respe tively, the root) and ea h of its hildren should be the same.Let us denote it by l. By semi-transitivity, the level dieren e is also l between anynode and one of its hildren, hen e lev(i, j) = lev(0, 0) + l.(i + j) (respe tively,lev(x1, . . . , xd) = lev(∅) + l.d). In both ases, the level is hara terized by thedistan e from the root.In all these ases, all levels belong to the same equivalent lass.It is worth noting that, as a onsequen e, the dieren e lev(η(v))−lev(v)is the same for every vertex v; this dieren e is referred to as the level of η,denoted by lev(η). Note that lev(η) 6= 0 is a su ient ondition to provethat η is transient, as it implies automati ally that η(m)(v) is self-avoidingfor all v.Normalization: We an assume without loss of generality that thereexists u and v in V su h that lev(u) − lev(v) = 1. This omes from thefollowing lemma:Lemma 9 A = lev(v) − lev(u)|u, v ∈ V is a subgroup of Z.Proof: This is lear that if a ∈ A, then −a is in A as well. If a and bare in A, we know that a + b an be written:

a + b = lev(v) − lev(u) + lev(w) − lev(z)

Page 220: baccelli/Evaluation/AugustinChaintreauPhD.pdf

210 Chapter 3By semi-transitivity, there exists η an invariant transformation su h thateither η(u) = w or η(w) = u. Let us suppose the rst ase, then we havea + b = lev(v) − lev(u) + lev(η(u)) − lev(z)and, as lev(u) + lev(η(u)) = lev(v) + lev(η(v)), we dedu e that a + b =lev(η(v)) − lev(z) ∈ A. The same on lusion holds in the se ond ase aslev(w) + lev(η(w)) = lev(z) + lev(η(z)).Hen e, if the set A does not ontain 1, as it is not restri ted to 0, itis equal to k.Z for a ertain k ∈ N, k > 1. Choose o ∈ V arbitrary, themapping lev′ : u 7→ lev(u)−lev(o)

k is a level, and there exists u, v su hthat lev(u) − lev(v) = k. This implies that G admits a level su h thatthere exists u, v verifying lev(u) − lev(v) = 1. By semi-transitivity thisproves that there exists an invariant transformation with level in +1,−1,by hoosing η su h that u = η(v) or v = η(u).Level dependen e graph: Let Gpatt be a p.i.g. over a olle tion ofsemi-transitive graphs, with normalized level lev1, . . . , levd. We onsiderthe proje tion dened by the produ t of level mapping:(3.25) (V1 × ... × Vd ×H → Z

d ×H(v1, . . . , vd) × h 7→ (lev1(v1), . . . , levd(vd)) × h .It an be easily seen that this allows to asso iate Gpatt with a unique patterngrid. Its asso iated dependen e sets are given by:

∆h,h′ = lev(v′) − lev(v)

∣∣ v, v′ ∈ V s.t. (v × h → v′ × h′) ∈ E .If this dependen e graph admits a sharp ve tor, the number of verti es ontained in any path in the pattern grid, as well as in pattern invariantgraph, an be bounded by a s alar produ t:Proposition 12 Let π : v × h v′ × h′ be a path dened in a patterninvariant graph. We assume that the asso iated dependen e graph admits awell sharp ve tor s. We then have:|π| ≤ H

(1 + Res + <lev(v) − lev(v′), s> ) .Proof: This is a onsequen e of Theorem 2, as a path in the patterninvariant graph has the same length as a path in the pattern grid.We are now able to state the major result of this se tion: if weights arelight-tailed, pattern invariant graphs with a sharp level admits a nite limitof dire tional last-passage per olation. The main interest of this theorem isnot in its pre ision a similar result was obtained for pattern grids undermore general ondition on weights' distribution but rather in the widerange of de entralized organizations that t in the assumptions made here.

Page 221: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 211Theorem 12 Let Gpatt be a pattern invariant graph we assume:(i) There is a level appli ation asso iated with a sharp dependen e graph.(ii) The invarian e η is transient (implied in parti ular by lev(η) 6= 0).(iii) The weights of all verti es are light-tailed.

limm→∞

Per η(m)(o)×h

m= ℓ, a.s., and in L

1, ℓ is a onstant.Proof: A ording to the super-additive Theorem (Corollary 6), we onlyneed to prove thatlim supm→∞

Per η(m)(o)×h

m< ∞ .All weights are upper bounded, for the sto hasti order (see III in Ap-pendix A), by a variable of law s. We denote A(t) = E[et.s].A path π leading from η(m)(o) × h to the origin o× h verify by Proposi-tion 12, immediately above,

|π| ≤ H

1 + Res + <lev(η(m)(o)) − lev(o), s>︸ ︷︷ ︸=m. <lev(η),s>

Hen e, when m ≥ 1, |π| ≤ B.m with B = H(1 + Res + | <lev(η), s> |).This implies, by Markov inequality,P [Wei(π) ≥ xm] ≤ e−txmE[etWei(π)]

≤ e−txmA(t)|π|

≤ e−txmA(t)B.m .Let D be the uniform bound of the degree of any vertex for the patterninvariant graph. All the paths ontained in the denition of Per η(m)(o)×hend in the same vertex. They all have a length upper bounded by B.m. We an therefore bound the number of these paths by DB.m.As the probability of the union of some events is upper bounded by thesum of their probabilities, we haveP

[Per η(m)(o)×h ≥ xm]≤ (D)B.me−txmA(t)B.m .When x is hosen large enough (in fa t it should be hosen su h that etx ≥

(A(t)D)B), this probability denes a onvergent series. The result thenfollows from Borel-Cantelli lemmaP

[lim supm→∞

1

mPer η(m)(o)×h ≤ x

]= 1 .

Page 222: baccelli/Evaluation/AugustinChaintreauPhD.pdf

212 Chapter 3It ould be important to study last-passage per olation of a sequen e ofiterated images starting from any arbitrary element. In fa t, this limit willbe exa tly the same when the pattern invariant graph follows an additional ondition, that reminds solidarity exhibited by some pattern grids.A p.i.g., with origin o × h, is alled solidar if we have for all v ∈ V(i) There exist a.s. a path π : v × h o × h .

(ii) ∃m ≥ 0 su h that there is a path π : η(m)(o) × h v × h .Proposition 13 Let Gpatt be solidar, under the onditions of Theorem 12,limm∞

Per η(m)(v)×h

m= ℓ a.s., and in L

1, for all v .Proof: For any v, by solidarity, there exists M ≥ 0, su h that a.s.:0 ≤ Per v×h ≤ Per η(M)(o)×h . These paths are preserved under η, su h thatwe have a.s.: Per η(m)(o)×h ≤ Per η(m)(v)×h ≤ Per η(m+M)(o)×hwhi h proves that limm

1mPer η(m)(v)×h is equal to the same onstant.Why is the light tail assumption ne essary ?For sharp pattern grid with dimension d, we proved in 2 that the lastpassage per olation time grows linearly in any dire tion, for a general mo-ment ondition whi h depends on d. Theorem 12 generalizes this result toany system of tasks organized in a graph, as long as the graphs is semi-transitive and the system an be proje ted on a sharp pattern grid. But thisresult requires that every task's ompletion time is light tailed.In the following proposition, we illustrate why the light tail ondition isneeded. We present a simple example based on an innite binary tree, andprove that the result of Theorem 12 does not hold whenever task ompletiontimes are heavy tailed, although all the other onditions to apply the theoremare satised.Proposition 14 Let a p.i.g. be dened on N × T2 by the following edges:

(i) (m,u) → (m − 1, u).(ii) (m,u) → (m, m(u)), where m(u) denotes the mother node of u in tree2.

(iii) (m,u) → (m − B, v) for v ∈ d(u), where d(u) denotes the set ofdaughter nodes of u in the tree.Note that this example generalizes Example B on an innite tree. Let us onsider the level dened as the distan e from the root node in the tree; itallows to proje t this p.i.g. into a sharp pattern grid.

Page 223: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Last-Passage Per olation in Pattern Grids 213However, if we assume that the weights are heavy tailed, then for any u.lim supm→∞

Per (m,u)

m= ∞, a.s..Proof: It is su ient to prove that lim sup Per (m,o)

m = ∞ when m goesto innity and is a multiple of B. For any m = B.n, and any vertex u atdistan e n from the root, we an onstru t a path: (m, o) (0, u) (0, o)using edges of type (iii), and (ii).There are 2n nodes u, the weight of the onstru ted path is at most thatof vertex (0, u). Hen e, we obtain that:Per (n.B,o)

n.B≥ 1

B

1

nmax

Wei(0, u) | u ∈ d(n)(o)

.This is the maximum of a olle tion of variables independent in the in-di es n and u. u takes an exponential number of values and the variables areheavy tailed. The result is then a onsequen e of Proposition 18 (ii), whi his shown in Appendix A.

Page 224: baccelli/Evaluation/AugustinChaintreauPhD.pdf

214 BIBLIOGRAPHYBibliography[1 F. Ba elli, A. Borovkov, and J. Mairesse. Asymptoti results on in-nite tandem queuing networks. Probability Theory and Related Fields,118:365405, 2000.(A study of general innite tandem of single server queues: For a sat-urated input, an hydrodynami limit is dened for heavy tailed servi etimes, using analogies with the growth of latti e animals. Steady stateis onstru ted when the input is stationary, for the rst time in this ontext, and a law of large numbers on sojourn times is established).[2 F. Ba elli and P. Bremaud. Elements of Queuing Theory. Springer-Verlag, se ond edition, 2003.(This textbook presents the theory of stationary ergodi point pro essand Palm Cal ulus whi h are then used to extend results of queuingtheory to general pro esses of arrivals and servi es).[3 J. T. Cox, A. Gandol, P. S. Grin, and H. Kesten. Greedy latti eanimals I: upper bounds. Annals Appl. Prob., 1(3):11511169, 1993.(This arti le introdu es greedy latti e animals and proves an upperbounds on their growth using a ombinatorial argument).[4 A. Gandol and H. Kesten. Greedy latti e animals II: linear growth.Annals Appl. Prob., 1(4):76107, 1994.(This arti le follows the results of [3, proving that the weight of a greedylatti e animal grows at a onstant linear rate under a simple onditionon the moments of the weight distribution).[5 B. Gaujal, A. Jean-Marie, and J. Mairesse. Computations of uniformre urren e equations using minimal memory size. SIAM J. Comput.,30(5):17011738, 2001.(Several bounds are shown on the size of the minimal memory neededto ompute the solution of a uniform re urren e equation in dimension1, using a model of multiple pro essors with a shared memory).[6 P. Glynn and W. Whitt. Departures from many queues in series. Annalsof Applied Probability, 1(4):546572, 1991.(This arti le studies a large number of single server queues in series witha saturated input, under the assumption that servi e times are indepen-dent and light tailed. An hydrodynami limit is shown to hara terizethe departure of lient m from the ⌊xm⌋ queue, as m grows).[7 R. M. Karp, R. E. Miller, and S. Winograd. The organization of ompu-tations for uniform re urren e equations. J. ACM, 14(3):563590, 1967.(Introdu es the notion of Uniform Re urren e Equations (URE) to har-a terize solutions of xed point problem. This arti le denes the depen-

Page 225: baccelli/Evaluation/AugustinChaintreauPhD.pdf

BIBLIOGRAPHY 215den e graph and a ne essary and su ient ondition under whi h asolution an be built, when the domain of denition is a quadrant).[8 S. Lee. The ontinuity of M and N in greedy latti e animals. Journalof Theoreti al Probability, 10(1):87100, 1997.(Following the results shown in [4, this arti le proves that the oe ientof linear growth for a greedy latti e animal is ontinuous w.r.t. the weak onvergen e of weights distribution).[9 J. Martin. Large tandem queuing networks with blo king. QueuingSystems, Theory and Appli ations, 41:4572, 2002.(This paper generalizes the results of [1 to a series of queues with nitebuer and ba k pressure blo king, and a slightly more general onditionof servi e time law. It introdu es the on ept of dependen e path).[10 J. Martin. Linear growth for greedy latti e animals. Sto hasti Pro essesand their Appli ations, 98(1):4366, 2002.(This arti le proves the linear growth of greedy latti e animals under aslightly more general ondition on the weight distribution than in [4.The proof follows from omparison with Bernouilli law, and the propertyof super-additivity).[11 Karl Petersen. Ergodi Theory. Cambridge Press, 1983.(A referen e textbook on measure preserving appli ation and the ergod-i ity or mixing properties whi h allow to apply the ergodi theorem).[12 M. Talagrand. 1995.(See referen e in Appendix A).

Page 226: baccelli/Evaluation/AugustinChaintreauPhD.pdf

216 BIBLIOGRAPHY

Page 227: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Appendix ASome Mathemati alBa kgroundNotation usedIntegers and Fra tional PartsFor all real x we denote by ⌊x⌋ its integer part ⌊x⌋ = supn ∈ Z|n ≤ x .Note that for any real x, positive or negative, we have:⌊x⌋ ≤ x < ⌊x⌋ + 1 or, equivalently, x − 1 < ⌊x⌋ ≤ x .We denote the positive and negative part, of a real number, respe tivelyby (x)+ and (x)−,

(x)+ = max (x, 0) and (x)− = min (x, 0) .Ve tors, Order, NormFor a, b in Rd, we denote by a ≤ b the partial ordera ≤ b if and only if (ai ≤ bi for all i = 1, . . . , d ) .We denote the s alar produ t of two real ve tors a and b by <a, b> . Theeu lidean norm of a is denoted by ||a||,For all a, b in R

d, <a, b> =

d∑

i=1

ai.bi and ||a|| =√ <a, a> =

√√√√d∑

i=1

(ai)2 .The uniform norm of a is given by: ||a||∞ = maxi=1,...,d

|ai| .

Page 228: baccelli/Evaluation/AugustinChaintreauPhD.pdf

218 Appendix AI Measure Theory and Ergodi ityThe ergodi theory will be an important tool to prove that limits of a pro essof random variables does not only exist, but are also almost surely equal toa onstant.In this se tion, we revisit some of the most widely known results of mea-sure theory and ergodi hara terization, to establish an ergodi theoremfor appli ations that are not ne essarily invertible. It would make easier toformulate the results shown on pattern invariant graph.Algebra, σ-Field, Monotone ClassA olle tion of subsets of a set X is alled an algebra, if it ontains theempty set, and if it is losed by the usual operation on sets. In other words,it veries:(P1) A ∈ A implies that A = x ∈ X | x /∈ A ∈ A.(P2) A ∈ A and B ∈ A implies that A ∩ B is in A.One obtain from this denition and indu tion that the union or theinterse tion of any nite family of subsets of A is also an element of A. The olle tion is alled a σ-eld if the same result holds for any ountable innitefamily of subsets. In other words, it should verify (P1) and the following:(P3) If A0, A1, . . . are hosen in A, then ⋂i≥0 Ai ∈ A.One an he k easily that the interse tion of any family of σ-elds is a

σ-eld. For any olle tion of subsets C, we denote by σ(C) the interse tionof all σ-elds that ontain C. It is usually referred to as the σ-eld generatedby C.Most of the properties established for elements of a σ-eld an be shownrst on a restri ted olle tion of subsets C, and then generalized to all el-ements of σ(C). The most di ult part of this method is usually to provethat the subsets that verify a property are losed under innite union orinterse tion. This motivates to introdu e the next denition.A olle tion M of subsets of X is alled a monotone lass if(M1) M ontains X.(M2) For A′ ⊆ A, both in M, the restri tion A\A′ = x ∈ A,x /∈ A′ isin M.(M3) For any in reasing sequen e An in M, lim

n∞An =

n≥0

An is in M.A monotone lass is always losed by the operation of omplement (P1).The next proposition proves that olle tions that are both a monotone lassand an algebra are exa tly all the σ-elds.Proposition 15 A monotone lass, losed under nite interse tion (or -nite union), is a σ_eld.

Page 229: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Some Mathemati al Ba kground 219Proof: For any A0, A1, . . ., we an onsider A′i = ∪0≤j≤iAj . This formsanother sequen e of subsets, hosen in the monotone lass. This sequen e isin reasing, hen e ⋃i≥0 A′

i, that is equal to ⋃i≥0 Ai, is in the lass, provingthe result.Imagine that a property on events is preserved by restri tion, and extendsto a limit of in reasing subsets. It is hen e veried by subsets of a monotone lass. A ording to the previous result, we an prove it for any event of aσ-eld if we show that this property is preserved by nite interse tion, andthat it is veried by a olle tion generating all events.The next result proves an even stronger result. It is indeed su ient to he k that the property is veried by a olle tion generating all events, andthat this generating olle tion is losed by nite interse tion.Theorem 13 Let C be a olle tion, losed under nite interse tion.If a monotone lass ontains C, it ontains σ(C).Proof: Let M(C) the smallest monotone lass ontaining C, and

M1 = A ∈ M(C) | ∀B ∈ C, A ∩ B ∈ M(C) .

M1 is a monotone lass:• It ontains X.• For A′ ⊆ A hosen in M1, we have that A′∩B and A∩B are in M(C),hen e that (A\A′) ∩ B = (A ∩ B)\(A′ ∩ B) is in M(C).• If (An)n≥0 is an in reasing sequen e of M1, (An ∩ B) are in M(C),whi h proves ⋃(An ∩ B) =

⋃(An) ∩ B ∈ M(C).

M1 is a monotone lass, and as C is losed under nite interse tion, it obvi-ously ontains C, hen e M1 and M(C) are equal. Let us now onsiderM2 = A ∈ M(C) | ∀B ∈ M(C), A ∩ B ∈ M(C) .Again, this denes a monotone lass (same arguments). It ontains all sub-sets of C be ause M1 = M(C), it is therefore equal to M(C). This provesthat M(C) is losed under nite interse tion, it is hen e a σ_eld by Propo-sition 15, that is ne essarily ontaining σ(C), by minimality.Measure Preserving Appli ationLet us onsider a measure spa e (X,B, µ), a measurable appli ation T :

X → X is alled a measure preserving appli ation (m.p.a.) if it veries

Page 230: baccelli/Evaluation/AugustinChaintreauPhD.pdf

220 Appendix Aµ(T−1(E)) = µ(E) for all events E ∈ B.where T−1(E) = x ∈ X | T (x) ∈ E .Proposition 16 Let T : X → X be an appli ation and E ⊆ X,

T (T−1(E)) ⊆ E,T (T−1(E)) = E if T is onto, E ⊆ T−1(T (E)),

E = T−1(T (E)) if T is one-to-one.Proof: We have x ∈ T (T−1(E)) ⇐⇒ ∃y ∈ T−1(E) s.t. x = f(y) .By denition of T−1(E), this implies ne essarily that x is in E. Similarly,if T is onto and x is in E, there exists y su h that f(y) = x and y is ne essarilyin T−1(E), whi h proves the se ond in lusion.x ∈ T−1(T (E)) ⇐⇒ ∃y ∈ E s.t. f(x) = f(y) .This is obviously veried for any x ∈ E. If T is one-to-one, the RHS impliesin parti ular that y = x, and, ne essarily, x ∈ E.As a onsequen e, if T is a m.p.a. and one-to-one, we have µ(T−1(T (E))) =

µ(E) by Proposition 16, and µ(T−1(T (E))) = µ(T (E)) by denition for T .This implies, in this ase, µ(T (E)) = µ(E) for any event E. But this equalitydoes not hold generally for any m.p.a. as shown in the example below.Example : Semi-innite lineLet us onsider a olle tion of independent real random variables on a semi-innite line Xi | i ∈ N . Formally, it is dened on the produ t spa e Ω = R

N, with the σ-eld givenby Borel sets, and the produ t probability.We an dene the shift appli ation that operates a shift of the index to the right:„

T : Ω → Ω(x0, x1, x2, . . .) 7→ (x1, x2, . . .) .

T is a measure preserving appli ation, as a sequen e x0, x1, . . . is in T−1(E) if and onlyif x1, x2, . . . is in E. But we do not have ne essarily µ(T (E)) = µ(E) for all events.Let us onsider the ase P [Xi = 1] = P [Xi = −1] = 12, and the event E = X0 ≥ 0,

• E has probability 12;

• T−1(E) = X1 ≥ 0, and it has probability 12;

• but T (E) = Ω and it has probability 1.Set-equivalen e: We introdu e the following relation between setsA ∼= B ⇐⇒ µ(AB) = 0 ⇐⇒ µ(A ∩ B) = µ(A ∩ B) = 0 .

Page 231: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Some Mathemati al Ba kground 221Let us remark rst that, if A ∼= B, then µ(A) = µ(B) = µ(A ∩ B).This relation denes an equivalen e relation. Reexivity and symmetryare obvious, transitivity may be shown in the following way: We assumeA ∼= B and B ∼= C,

A ∩ C =(A ∩ B ∩ C

)︸ ︷︷ ︸

⊆A∩B

∪(A ∩ B ∩ C

)︸ ︷︷ ︸

⊆B∩C

.This proves that µ(A ∩ C) = 0. µ(A ∩ C) = 0 may be shown in a similarway, hen e A ∼= C.Note that the relation ∼= is ompatible with any m.p.a, as we have forany su h T :A ∼= B ⇐⇒ T−1(A) ∼= T−1(B) .This is be ause x ∈ T−1(A)∩T−1(B) if and only if T (x) ∈ A and T (x) /∈ B.This is veried if and only if x ∈ T−1(A ∩ B). Hen e T−1(A)T−1(B) =

T−1(AB), whi h implies that T−1(A)T−1(B) and AB have the samemeasure.Ergodi ity, Strong mixingFor any m.p.a. T , we all an event E invariant if we have T−1(E) ∼= T(i.e. µ(T−1(E)E) = 0). This denition hara terizes the events that arepreserved by T , on e all subset of null probability have been negle ted.The appli ation T is alled ergodi if any of its invariant events is eithernegligible (i.e. it has probability 0), or o urs almost surely (i.e. it hasprobability 1). This property will insure that variables dened as asymp-toti limit of super-additive pro esses are degenerate and take only a single onstant value.Most of the time, we will dedu e ergodi ity from the following riteria:T is alled strongly mixing if it veries, for any event A and B,(A.1) lim

k→∞µ(T−k(A) ∩ B

)= µ(A)µ(B) .Proposition 17 Let T be a m.p.a. on (X,B, µ).

(i) T is strongly mixing if and only if (A.1) holds for all A and B in C,where C is a olle tion, losed under nite interse tion, whi h generates B.(ii) If T is strongly mixing, it is ergodi .Proof: (ii) may be easily dedu ed from the property of the relation ∼=:

E ∼= T−1(E) =⇒ T−1(E) ∼= T−2(E) =⇒ E ∼= T−2(E) ,by ompatibility with T−1 rst, and then by transitivity. By indu tion,we an dedu e E ∼= T−k(E) for any k ≥ 0 and invariant event E. As a

Page 232: baccelli/Evaluation/AugustinChaintreauPhD.pdf

222 Appendix A onsequen e, µ(T−k(E) ∩ E) = µ(E). This implies, by (A.1) applied to theevent A = B = E, that µ(E) = µ(E)2, hen e µ(E) ∈ 0, 1.(i) Let B be hosen in C, the olle tion

M(B) =

A ∈ B | lim

k→∞µ(T−k(A) ∩ B

)= µ(A)µ(B)

,We laim that it is a monotone lass (see below). As it ontains C, byTheorem 13, this olle tion ontains all events of B. Fix now any event A in

B, the olle tionM′(A) =

B ∈ B | lim

k→∞µ(T−k(A) ∩ B

)= µ(A)µ(B)

, ontains all subsets of C. Again, if this is a monotone lass it hen e ontainsall events of B, and the result is proved.Showing that M(B) and M′(A) is a monotone lass follows quite simplyfrom their denition, we detail the argument only for M(B), as the prooffor the other olle tion is even simpler.Let A′ ⊆ A, all ontained in M(B). We have

µ(A\A′) = µ(A)−µ(A′) and T−k(A\A′)∩B = (T−k(A)∩B)\(T−k(A′)∩B)hen e µ(T−k(A\A′) ∩ B) = µ(T−k(A) ∩ B) − µ(T−k(A′) ∩ B). Letting kgrow large proves thatlim

k→∞µ(T−k(A\A′) ∩ B) = µ(A\A′)µ(B) hen e A\A′ is in M(B).Let An րn∞ A be an in reasing sequen e of subsets in M(B). As allthese subsets are measurable, we have µ(A) = lim µ(An). Let us hoose any

ε > 0 and N su h that n ≥ N =⇒ µ(A\An) ≤ ε/2, we haveµ(T−k(An) ∩ B) ≤ µ(T−k(A) ∩ B) = µ(T−k(An) ∩ B) + µ(T−k(A\An) ∩ B)

≤ µ(T−k(An) ∩ B) + ε/2 .We an now hoose K su iently large to have for any k ≥ K

µ(An)µ(B) − ε/2 ≤ µ(T−k(An) ∩ B) ≤ µ(An)µ(B) + ε/2µ(A)µ(B) − ε ≤ µ(T−k(An) ∩ B) ≤ µ(A)µ(B) + ε/2 .Hen e A is in M(B), whi h ompletes the proof. The ase of M′(A) maybe shown by similar arguments.

Page 233: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Some Mathemati al Ba kground 223II Super-additivityA olle tion of real valued random variables X = Xs,t | s ∈ N, t ∈ N, s < tis alled a super-additive pro ess if it veries:(S1) For any s < t < u, we have Xs,u ≥ Xs,t + Xt,u.(S2) The joint distribution of Xs+1,t+1 | s, t ∈ N, s < t is the same as X.(S3) X0,t admits a nite expe tation for all t ≥ 1.(S4) There exists A > 0 su h that for all t ≥ 1, E[X0,t] ≤ t.A.Theorem 14 Let X be a super-additive pro ess, then

limt→∞

X0,t

t= ℓ ∈ R, a.s., in L

1, and E[ℓ] = supt>0

E[X0,t]

t

.Proof: One an easily he k that X is a super-additive pro ess if andonly if the olle tion of variables −Xs,t | s, t ∈ N, s < t is a sub-additivepro ess, as dened in [1. The result is then a onsequen e of Kingmansub-additive results (Theorem 1 p.885, in the same arti le).The stationary ondition (S2) an be strengthen to prove the degenera yof the limit, using a measure preserving appli ation and ergodi ity:

(S′2) For all s, t ∈ N, Xs+1,t+1 = Xs,t T , where T is an ergodi m.p.a. .Note here that T does not need to be invertible (see I).Corollary 5 If X veries (S1), (S

′2), (S3) and (S4),then, almost surely, ℓ = E[ℓ] = sup

E[X0,t]

t

∣∣∣∣ t ≥ 1

.Proof: For all t ≥ 1, we have by almost sure limit

ℓ = limt→∞

X0,t

t≥ lim

t→∞

X0,1

t+

t − 1

t

X1,t

t − 1≥ lim

t→∞

X0,1

t+ lim

t→∞

X0,t−1 T

t − 1Su h that ℓ ≥ ℓ T almost everywhere. As E[ℓ] = E[ℓ T ] < ∞, it implies,almost surely, ℓ = ℓ T . This proves that ℓ ≥ E[ℓ] is an invariant eventunder T . It annot have a null probability, it is then almost sure. Similarlywe have, almost surely, ℓ ≤ E[ℓ], and hen e ℓ = E[ℓ].

Page 234: baccelli/Evaluation/AugustinChaintreauPhD.pdf

224 Appendix AWhat happens when integrability is not veried ?It is usually the ase that super-additivity and stationary ergodi ity omedire tly from the denition of a pro ess. Integrability, and ondition (S4),are usually the most di ult assumptions to verify. What we show in thenext result is that only two ases are possible: either they are veried, thelimit is a nite onstant and the onvergen e holds. Otherwise, the a.s. limitstill exists but it always has an innite mean.Corollary 6 If X veries (S1), (S′2), and E[(X0,1)

−] > −∞, we havelimt→∞

X0,t

t= ℓ ∈ R ∪ +∞ a.s. and, for ℓ = lim

t→∞

E[X0,t]

t= sup

t>0

E[X0,t]

t

, either (i) ℓ < +∞ and ℓ = ℓ a.s.or (ii) ℓ = +∞ and E[ℓ] = +∞.Note that, in the se ond ase, we do not have ℓ = +∞ a.s., as the limit isnot ne essarily a degenerate variable (i.e. a deterministi onstant). This isbe ause the ergodi result applies only to variable with nite mean.Proof: The argument may be found p.885 in [1. Let us introdu e for all

N ∈ N, the trun ated pro ess X(N) dened by X(N)s,t = min(Xs,t, N(t − s)).This pro ess veries (S2) and (S3), hen e the limit

ℓ(N) = limt→∞

X(N)0,t

tis dened, a.s. .By monotoni ity, ℓ(N) is in reasing with N , so that ℓ = limN→∞ ℓ(N) existsand it is the almost sure limit of 1

t X0,t.We now prove the following lemma:Lemma 10 Exa tly one of the following statements holds:either (i) limt→∞E[X0,t]

t = +∞, or (ii) (S3) and (S4) are veried.Proof of the Lemma: If there exists one t su h that E[X0,t] = ∞, the same result holdsfor any t′ > t. In other words, we are in the rst ase. Otherwise, (S3) holds as the negativepart of X0,t always admits a nite expe tation. By super-additivity of the expe tation, we knowthat limt→∞E[X0,t]

t< +∞ if and only if supE[X0,t]

t|t ≥ 1 = +∞, proving the lemma.In the last ase of the lemma, we are in the ondition of Corollary 5, andresult (i) learly holds. Otherwise, we know that E[ℓ] ≥ 1

t E[X(N)0,t ] for any

N and t. When N goes to innity, we have by monotone onvergen e thatE[ℓ] ≥ 1

tE[X0,t] for any t.whi h implies (ii) when t grows large as limt→∞

E[X0,t]t = +∞.

Page 235: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Some Mathemati al Ba kground 225III Sto hasti OrderThe sto hasti order is a partial order between distribution measures on R.We say that a real variable X is sto hasti ally smaller than Y , that we denoteX ≤st Y , if we havefor all x ∈ R, P [X ≥ x] ≤ P [Y ≥ x] .

X ≤st Y implies in parti ular that there exists two variables X and Y ,with same laws as respe tively X and Y , su h that X ≤ Y holds almostsurely.We refer to [3 for a proof of this last result, and more details on thesto hasti order.Lemma 11 Let (Xi)i∈I be a olle tion of independent real random variablesand φ : RI → R, a non-de reasing deterministi fun tional of this olle tion.If we have for all i, Y ≤st Xi ≤st Z, then we have:

φ(Y ) ≤ φ(X) ≤ φ(Z) .where φ(Y ) is dened as the value of φ for a olle tion of independent randomvariables with law Y .Proof: One an hoose olle tions Y , X, Z of random variables with thesame joint law (xed by the independen e assumptions), su h that Y ≤ Xi ≤Z holds almost surely for all i. We then on lude by monotoni ity, as thevalue of φ does only depend on the law of the olle tion of variables.IV Con entration of MeasureTheorem 15 Let Xi, i = 1, . . . , N be N independent variables on [0;L]Let C be a olle tion of subsets of 1, . . . , N su h that for C ∈ C, |C| ≤ S .Then the variable Z = max

C∈C

i∈C

Xi veriesP [|Z − E[Z]| ≥ u] ≤ exp

(64 − u2

16SL2

)Proof: It is shown as Lemma 5.1 in [2. This is a onsequen e from the on entration of measure result shown by Talagrand (Theorem 8.1.1 in [4).V Heavy Tailed DistributionA real random variable X is said to be light tailed if it veries:(A.2) There exists t > 0 su h that E[et.X ] < ∞ .

Page 236: baccelli/Evaluation/AugustinChaintreauPhD.pdf

226 Appendix ANote that this only depends on its distribution. A variable that does notverify Condition (A.2) is by denition heavy tailed.We present in the rest of this se tion a hara terization of light and heavytailed distribution, as onditions for the linear growth of a maximum takenin large olle tions of variables with the same law.Let f : N → N be a non-de reasing fun tion, we dene:Mn =

1

nmax

X

(n)i

∣∣∣ 1 ≤ i ≤ f(n)where all variables (X

(n)i )i≥1,n≥1 are supposed to follow the law of X.Proposition 18 When a.eb.n ≤ f(n) ≤ a′.eb′.n ,for positive a, a′, b, b′.

(i) If X is light tailed, lim supMn < M a.s. for a onstant M .If we assume in addition that all the variables (X(n)i )i≥1,n≥1 are independent,

(ii) if X is heavy tailed, lim supMn = ∞ a.s. .Proof: (i) follows a lassi al appli ation of Markov inequality:P [Mn ≥ x] ≤ f(n)P [X ≥ n.x] ≤ f(n)P [f(X) ≥ f(n.x)] ≤ f(n)

f(n.x)E[f(X)] .Hen e ∑

n≥1

P [Mn ≥ x] ≤(

E[f(X)]a′

a

).∑

n≥1

(eb−x.b′)nwhen x is hosen large enough this is a onvergent series, proving by Borel-Cantelli that, almost surely, only a nite number of these events o ur.(ii) is an appli ation of the Borel Canteli lemma. Let us denote by F(respe tively F ) the CDF (resp. CCDF) of the variable X.We have for t > 0

∫ A

0etXdF (x) = etAF (A) − F (0) −

∫ A

0t.etXF (x)dx .Note that ∫ A

0 t.etxdx = etA − 1, and hen e the RHS may be rewritten as:etA(F (A) − 1)︸ ︷︷ ︸

≤0

+1 − F (0) + t.

∫ A

0etX (1 − F (x))dxBy heavy tail assumption, this sum grows to innity with A, whi h implies

∫ ∞

0etxF (x)dx = ∞.Using etx ≤ et.(⌊x⌋+1) and F (x) ≤ F (⌊x⌋), we dedu e in parti ularthat ∑n≥1 et.nF (n) = ∞ for any t > 0, or similarly that for any x > 0,∑

n≥1 f(n)F (n.x) = ∞.

Page 237: baccelli/Evaluation/AugustinChaintreauPhD.pdf

BIBLIOGRAPHY 227Let us x a positive x, we onsider the following series:P [Mn ≥ x] = 1 − (1 − P [X ≥ n.x])f(n) = 1 − ef(n) ln(1−F (n.x))Note that the events in the LHS for dierent n are independent. We an thenapply the onverse Borel-Cantelli lemma if the sum of this series diverges.If lim∞ F > 0 then the result is obvious, otherwise this limit is zero andwe have: ln(1 − F (n.x)) ∼ F (n.x) thus P [Mn ≥ x] ∼ 1 − e−f(n)F (n.x).Again, if lim sup f(n)F (n.x) ≥ δ > 0, the result is obvious, as thereexists then an innite number of elements in this series that are greater than

1 − e−δ. Otherwise, f(n)F (n.x) onverges to zero as n grows, and we haveP [Mn ≥ x] ∼ f(n)F (nx). As shown above, the series on the right divergeswhenever X is heavy tailed.Applying the onverse Borel-Cantelli lemma, we obtain that, for any x,almost surely innitely many time Mn goes above x. This proves that, almostsurely, the superior limit of Mn is above x. This superior limit is thereforealmost surely innite.Bibliography[1 J.F.C. Kingman. Subadditive ergodi theory. Annals of Probability,1(6):883909, 1973.(Proof of onvergen e a.s., and in L

1, of the empiri al mean of a sequen eof variables, if this sequen e is sub-additive, an be bounded from below,and is stationary with regard to an ergodi shift).[2 J. Martin, 2002.(See referen e in Chap. 3).[3 A. Muller and D. Stoyan. Comparison Methods for Sto hasti Modelsand Risks. Wiley, 2002.(A referen e textbook on omparison of random variable distributions,and their onsequen es on dierent fun tionals).[4 M. Talagrand. Con entration of measures and isoperimetri inequali-ties in produ t spa es. Inst. Hautes Etudes S i. Publ. Math., 81:73205,1995.(This arti le presents a probabilisti upper bound on the distan e be-tween a variable and its mean; one lass of variables studied is the max-imum of sums of independent variables hosen in a given subset).

Page 238: baccelli/Evaluation/AugustinChaintreauPhD.pdf

228 BIBLIOGRAPHY

Page 239: baccelli/Evaluation/AugustinChaintreauPhD.pdf

List ofTe hni al ContributionsThe work presented in this dissertation was made partially as a ollaborationwith François Ba elli, Danny De Vlees hauwer, Zhen Liu, David M Donald,and Anton Riabov.Chapter 1 extends the Hybrid-AIMD modeling framework, previouslyintrodu ed by François Ba elli and Dohy Hong, to study the bandwidthsharing of TCP ows on a bottlene k link.• Under a xed tra demand, we establish a losed-form formula de-s ribing the distribution of the instantaneous rate of a onne tion, inthe asymptoti mean-eld stationary regime.This work, published in [1, was made in ollaboration with Danny DeVlees hauwer. In parti ular se tions 2 and 3 of this hapter have beenonly slightly modied from this arti le.• Under a dynami tra demand, we exhibit and justify the presen eof a turbulent regime, where dierent equilibrium an be rea hed de-pending on the initial onditions.This result, published in [2,3, has been a ollaboration between Fran oisBa elli, Danny de Vlees hauwer, David M Donald and myself. Se tion4 follows losely the ontent of this paper. The analysis based on par-tial dierential equations was designed and lead by Fran ois Ba elli andDavid M Donald, 4.1 and 4.3 in parti ular. I ontributed to estab-lish the rst result (the rate onservation equation des ribed in 4.2),the original method for numeri al estimation was designed by Danny DeVlees hauwer.[1 A. Chaintreau and D. De Vlees hauwer. A losed form formula for long-lived TCP onne tions throughput. Perform. Eval., 49(1-4):5776, 2002.[2 F. Ba elli, A. Chaintreau, D. De Vlees hauwer, and D. R. M Donald. A mean-eldanalysis of short lived intera ting TCP ows. In Pro eedings of SIGMETRICS 2004,pages 343354. ACM Press, 2004.[3 F. Ba elli, A. Chaintreau, D. De Vlees hauwer, and D. R. M Donald. HTTP tur-bulen e. Networks and Heterogeneous Media, 1(1):140, 2006.

Page 240: baccelli/Evaluation/AugustinChaintreauPhD.pdf

230 List of Te hni al ContributionsChapter 2 proposes a general ar hite ture to deploy TCP de entralized ontrol on an overlay network, with an emphasis on providing end-to-endreliable transport while keeping the system ongestion-adaptive and s alable.• The key ontribution is to represent this ar hite ture in a lass ofinnite dis rete event systems, whi h generalizes an innite tandem ofqueues. Pa kets pro essing times may be des ribed in this setting aslast-passage per olation in a new ategory of random graphs.• We prove that a positive long-term throughput an be guaranteed froma sour e to an innite number of destinations, even if the lo al memoryavailable in every end-host is nite. It requires the overlay to have auniformly bounded degree, and random perturbation reated by rosstra to be light-tailed.• The same result may be obtained for perturbation with a heavy taildistribution, under a general moment ondition: when the out-degreeof all end-hosts is one (i.e. they form an innite hain), or when theavailable memory is innite in all end-hosts.If, in addition, rate ontrol is implemented by the sour e, we provethat the overlay admits a steady state; in this regime, a law of largenumbers is shown for the delays of a pa ket through innite sequen esof overlay hops.These results were produ ed in ollaboration with Fran ois Ba elli and ZhenLiu; Anton Riabov ondu ted a validation of these ndings on the Planet-Labtestbed. They were published su essively in [4 (whi h fo uses on the asewith innite lo al memory) and in [5 (whi h extends the results to nite lo almemory). In parti ular 3, as well as 4.1 and 4.3, follows losely the ontentof these arti les. But the analyti al results, presented in 4.2 and 4.4 havebeen rewritten for this dissertation in the more general analyti al framework thatwas built afterwards.Chapter 3 presents a unifying framework, pattern grids, for innitedis rete-event systems hara terized by a pre eden e relation, built over alatti e, and that is invariant in law under any translation.• We introdu e the sharpness ondition, usually easy to verify, that har-a terizes pattern grids that always follow linear rate of dire tional last-passage per olation. The key element is to identify a relation between[4 F. Ba elli, A. Chaintreau, Z. Liu, A. Riabov, and S. Sahu. S alability of reliablegroup ommuni ation using overlays. In Pro eedings of INFOCOM, 2004.[5 F. Ba elli, A. Chaintreau, Z. Liu, and A. Riabov. The one-to-many TCP over-lay: A s alable and reliable multi ast ar hite ture. In Pro eedings of INFOCOM,2005. (also presented, as invited paper, at the Sixteenth International Symposium onMathemati al Theory of Networks and Systems (MTNS2004)).

Page 241: baccelli/Evaluation/AugustinChaintreauPhD.pdf

List of Te hni al Contributions 231a path size, its s alar produ t with a referen e ve tor, and propertiesof dependen e y les between tasks.• The existen e of a stationary regime, where long-range ompletiontimes satisfy a law of large numbers, is proved in any sharp patterngrid with dimension 2. The stability ondition and the limit growthrate are hara terized by a hydrodynami s aling.• Results on rate of dire tional last-passage per olation are generalizedto some pattern invariant graphs, in whi h the pre eden e relation maybe built not just on a latti e but over any innite graph.The ontent of this hapter is a personal ontribution, with help and advi e fromFran ois Ba elli, it will be submitted shortly for publi ation.

Page 242: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Indexa knowledgment, 4 umulative , 4delayed , 10dupli ate , 4sele tive , 10adja en y, 166aggregated servi e time, 109, 117, 121,127ba k-pressure, 106, 151analysis of, 123bottlene k, 12, 85, 109buerba kup , 108innite, 126input , 2o upan y, 33, 127, 130output , 105 on entration of measure inequality,196, 225 ongestion, 6 ontrol, 7Expli it Congestion Noti ation(ECN), 11multi ast, 86data pa ket, 2dependen e y le, 155graph, 153path, 153set, 150trivial y le, 156dire tion riti al , 184interval of , 184

stri tly valid , 171valid , 169elementary one, 156elementary ve tor, one, 155end-to-end prin iple, 5, 24, 90error re overy, 2forward error orre tion, 83, 85on overlay, 101fairness, 19, 98max-min , 19proportional , 19Fast Re overy Fast Retransmit, 8feedba k storms, 83ow ontrol, 2in overlay, 106, 132multi ast, 84fra tal density, 33, 37generating triple, 158gra eful degradation, 84heavy tailed distribution, 226hydrodynami limits, 181immediate prede essor, 150insensitivity, 22, 33intera tion of owsin parallel, 24, 31in series, 25, 104invariant transformation (i.t.), 203invarian e of a produ t, 206invariant bije tion, 203transient , 207irredu ible pattern grid, 162last-passage per olation, 169, 207

Page 243: baccelli/Evaluation/AugustinChaintreauPhD.pdf

INDEX 233in a dire tion, 169, 208latti e animal, 165level appli ation of a p.i.g., 209light tailed distribution, 212, 225long-range dependen e, 33mean eld regime, 38, 56measure preserving appli ation, 219ergodi , 221strongly mixing , 221modelhybrid AIMD, 35on overlay, 108pa ket-level, 12pro essor sharing, 21, 33, 72with intermittent tra demand,22, 23, 33, 56monotone lass, 218multi ast, 25, 80IP routing, 80appli ation layer , 89ow ontrol, 84transport, 82overlay, 89tree, 89, 92, 104hop, 89 hop, 105large s ale , 93one-to-many TCP, 104transport in , 98pattern grid, 150random , 169restri tion of a , 169support of a , 169totally ordered , 177pattern invariant graph (p.i.g.), 206peer-to-peer, 25quadrant, 180, 190extended , 183, 192queues in tandem, 126radius of a pattern grid, 166

rare loss asymptoti , 14residue of a pattern grid, 164round-trip time bias, 6, 18s alability, 117of delay, 130of throughput, 123self similarity, 33self- lo king, 3, 25sharp, 155well , 164σ-eld, 218solidarityof a p.i.g., 212of a pattern grid, 179stabilitylong-range stability, 194saturation rule, 202stri tly obtuse, a olle tion of ve -tors, 157super-additive pro ess, 223syn hronizationrate, 36impa t of buer sizes, 48Transport Control Proto ol (TCP), 2TCP Reno, 8TCP Tahoe, 8Time To Live (TTL), 81tragedy of the ommons, 24transitive/semi-transitive graph, 205turbulen e, 69, 71uniform re urren e equation, 151valid, 169valid vertex, path, 169velo ity of a pattern grid, 180window, 109ination, 9ow ontrol, 6 ongestion , 7

Page 244: baccelli/Evaluation/AugustinChaintreauPhD.pdf

234 INDEX

Page 245: baccelli/Evaluation/AugustinChaintreauPhD.pdf

SynthèseTout est possible, mais tout n'est pas utile.1 PaulInternet ore le moyen de transporter n'importe quelle information numériquesur des liens a essibles à tous. La reprodu tion de ette information dans leréseau formé par es liens, et son transport de pro he en pro he, tous deuxd'un oût marginal extrêmement faible, ne sont jamais ontrlés dans leurensemble par une autorité.Ce do ument étudie les prin ipes te hnologiques qui rendent possible etqui ara térisent e transport. Nous her hons à en justier le fon tion-nement à grande é helle, à en onnaître les limites théoriques, et quelquesdomaines qui restent inexplorés. Beau oup de es questions abordent dediérentes façons un objet entral: une grande population de ots, qui in-teragissent selon un pro essus d'événements dis rets.1 Prin ipes des Réseaux Dé entralisésContrle de Bout en Bout et Passage à l'É helleDeux appareils sont onne tés sur un réseau de ommuni ation, 'est-à-direqu'une unité d'information, sous la forme d'un segment de taille variable,peut être transmise d'un appareil à un autre. Rien n'est a priori onnu sur esystème d'adressage: un segment d'information peut être perdu ; il peut aussiêtre délivré une ou plusieurs fois, ave des délais relativement importants,et variables ; plusieurs envois su essifs peuvent être intervertis. On ne saitpas non plus estimer pré isément le volume d'information que peut traiter à haque instant l'appareil qui la reçoit, ou les liens qui la transportent.Comment é helonner, dans es onditions, les diérents envois dans letemps ? Comment assurer l'intégrité de l'information transmise ?Par ommodité nous proposons d'abstraire le réseau sous-ja ent, ommeune boite noire dont le omportement n'est onnu que très grossièrement.Ainsi une solution trouvée sera-t-elle toujours appli able.1 Pnta moi êxestin, ll> oÎ pnta sumfèrei. Ép. Cor. I 6-12.

Page 246: baccelli/Evaluation/AugustinChaintreauPhD.pdf

ii SynthèseExemple : Mémoire Tampon et A quittementsPour simplier, nous supposons dans un premier temps qu'il n'y a pas de perte (i.e. unenvoi est toujours reçu), ni de dépassement (i.e. les envois sont reçus dans l'ordre où ils sontémis).Chaque segment d'information à transmettre est muni d'un numéro de séquen e ; ladestination transmet à la sour e, pour haque envoi reçu, un a quittement (en anglais, a -knowledgment pa ket, on trouve souvent la forme ra our ie a k). On peut alors utiliserles règles suivantes:• La destination réserve une mémoire tampon (en anglais, memory buer), de taillexée. Ce tampon se remplit ave les informations dire tement reçues du réseau, et sevide au fur et à mesure que la destination traite les données reçues, à la vitesse qui lui onvient. La taille de ette mémoire est initialement annon ée à la sour e ; de plus,après haque ré eption par la destination, la mémoire en ore disponible est annon éedans l'a quittement orrespondant.• Grâ e au mé anisme des a quittements, la sour e peut estimer la quantité d'o tetsen vol, 'est-à-dire la taille des données envoyées, non en ore a quittées. A haqueinstant, un segment est envoyé par la sour e si et seulement si on a:(A.3) fen_re ≥ taille_vol + taille_segmentCes règles garantissent, à haque instant, à la destination qu'au une unité d'informationne lui est envoyée sans qu'elle soit en mesure de la re evoir.• Reste un seul blo age possible: si toute l'information est a quittée et la mémoiretampon est en ore pleine. En eet, quand une partie de ette mémoire est à nouveaudisponible, il ne saurait y avoir de nouvel a quittement pour l'annon er à la sour e, eta tualiser la valeur qu'elle utilise dans la ondition (A.3).Cette di ulté est assez fa ile à ontourner. La destination transmet dans ette situa-tion un a quittement vide, qui réa tualise la mémoire disponible onnue par la sour e.Ave e dernier ajout, es règle garantissent à la sour e que le pro essus des envois nese bloque jamais, sauf dans le as grave et improbable du blo age omplet du réseau ou de ladestination.Le mé anisme de ontrle qui est dé rit dans l'en adré i-dessus permetdéjà de faire quelques remarques importantes pour la suite:• Le ontrle d'un tra de données (par exemple, éviter les déborde-ments à la destination, et les blo ages à la sour e) peut être ee tuéave un petit nombre de messages supplémentaires et des règles sim-ples, appliquées de bout en bout.• Ces mé anismes ne né essitent pas l'établissement, l'entretien, ou mêmela simple onnaissan e d'un état parti ulier du réseau, ils débutent inmedias res, 'est-à-dire sans prérequis.

Page 247: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Synthèse iiiExemple : Pertes et DépassementsLe sort d'un segment d'information envoyé dépend de l'état a tuel du réseau, il peutparfois être perdu, ou dupliqué. Les mêmes prin ipes qu'avant permettent d'assurer la abilitéde la ommuni ation, de bout en bout.• La pro édure d'a quittement permet de prévoir une retransmission, envoyée par lasour e si elle ne reçoit pas l'a quittement d'un segment après un ertain ompte àrebours. Ce dernier peut être e a ement réglé ave l'historique des pré édentes ob-servations.Ce i assure que la destination reçoit tous les envois, ou une de leurs retransmissions.• Ave une autre méthode, la destination indique dans haque a quittement non pasle segment reçu mais le plus grand numéro de séquen e m tel que tous les segmentspré édents (de numéro n ≤ m) aient été reçus. Sans au une perte, ela revientau même ; dans le as d'une perte, ou d'un dépassement, la sour e reçoit plusieursa quittements su essifs qui portent le même numéro. Pour éviter de réagir à toutesles irrégularités de dépassement, la sour e retransmet tous les envois non a quittés,après avoir reçu 3 dupli ata.Dans e type d'ar hite ture, il est ourant que les solutions hoisies pourdes fon tions a priori distin tes (assurer la retransmission des segmentségarés, régler le volume d'information à émettre) se roisent dans des règles ommunes. Ce i s'ajoute à une ambiguïté latente, omme plusieurs pro es-sus d'événements (envoi/ré eption d'un segment/a quittement) réagissent,souvent par anti ipation, sans être immédiatement onnus les uns des autres.Ainsi, si es règles sont fa iles à mettre en pratique, analyser leur fon -tionnement sous diverses onditions peut se révéler une tâ he di ile. Lespropriétés qu'elles garantissent sont don parti ulièrement pré ieuses.Éviter les Engorgements et les PénuriesMémoire tampon, a quittements umulatifs, retransmissions, es mé an-ismes que nous venons de présenter dé rivent assez bien la régulation dutra sur Internet lors des années quatre vingts. C'est à la n de ette dé- ennie qu'apparut un phénomène d'engorgement du oeur du réseau qui enmodia le ontrle.Le oeur d'un réseau informatique est onstitué de routeurs, onne téspar des liens. Ces routeurs sont apables d'aiguiller un segment vers sapro haine étape, selon son adresse de destination. Ils utilisent typiquementune grande puissan e de al ul et une mémoire tampon importante pour onserver l'information à traiter.Sous le déploiement a éléré d'Internet, et la on entration de la demandede tra sur quelques liens entraux, le réseau vit apparaître à ertainespériodes de la journée des engorgements. Par omparaison ave les périodesd'embouteillages périodiques des autoroutes urbaines, ette a umulation de

Page 248: baccelli/Evaluation/AugustinChaintreauPhD.pdf

iv Synthèsetra , au-delà des vitesses de al ul des routeurs, réduit onsidérablementl'e a ité de son é oulement. Une part importante des segments transmisest rapidement perdue dans des débordements de mémoire, e qui entraîneleur retransmission. Or e i a lieu au moment même où le besoin d'utiliserle réseau est le plus fort.Ce phénomène n'handi ape pas tous les ots de la même manière: unesour e pro he du routeur engorgé passe toujours au travers ; d'autres, au ontraire plus éloignées, ne sont plus en mesure de l'utiliser. Répondre auxengorgements est don un problème d'équité : éviter la pénurie de ertainsots de transports de données, sur des liens sous forte demande.Ce i pose une question très pro he de elles que nous avons déjà ren- ontrées. On peut aussi bien qu'avant envisager d'y répondre par un on-trle ee tué à la sour e. La prin ipale di ulté, ependant, est que 1-la ressour e la plus pré ieuse (l'o upation des mémoires tampons dans lesrouteurs) n'est pas onnue expli itement, et surtout 2- qu'elle est main-tenant partagée entre tous les ots.En d'autres termes, nous venons de révéler un nouveau goulot d'étranglement,devenu prédominant, qui dénit naturellement un autre débit à suivre. Cedernier ne doit plus être xé par les bords du réseau (les mémoires tampondes destinations), mais par son oeur (les mémoires tampon des routeurs).Les règles que nous avons déjà dé rites s'adaptent don de la manière suiv-ante :• Une nouvelle borne sur le nombre maximum de données non a quittéesest introduite. On ne permet pas à la sour e d'envoyer plus d'unefenêtre de ongestion. Ce i s'ajoute à la ontrainte pré édente de lafenêtre de ré eption. Ainsi, la sour e est-elle autorisée à envoyer unsegment seulement si on a(A.4) min(fen_re , fen_ ong) ≥ taille_vol + taille_segment .

• La valeur de ette fenêtre est adaptative. Elle évolue en fon tion designes impli ites de l'état du réseau, qui sont les su ès ou les é he sdes envois su essifs de segments.Comment réussir e a ement e pro essus d'adaptation ave e peu d'informationsé hangées reste un problème di ile. Une réponse, proposée en 1988 par VanJa obson, en ore utilisée aujourd'hui, est de s'en remettre à un algorithmeà pas roissant additif et pas dé roissant multipli atif.Exemple : Fenêtre AdaptativeSlow start: Le nom de et algorithme pourrait être traduit Démarrage Ralenti.

Page 249: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Synthèse v• La fenêtre de ongestion est initialisée ave une petite valeur (de l'ordre de quelquestaille_segment); elle est ensuite augmentée de taille_segment pour haque a -quittement que la sour e reçoit.Dans es onditions, l'arrivée d'un a quittement implique le départ de deux segments.Le premier par e que le nombre de segments en vol dé roît de un, et le se ond par eque fen_ ong augmente de la taille d'un segment. La fenêtre de ongestion augmentede fait de manière assez rapide ave le temps, environ exponentielle.• Les pertes de segments sont indiquées par un ompte à rebours, omme pré édemment.Comme en général elles ne sont pas ausées par l'erreur d'un lien, elles indiquent que levolume de données a tuellement en vol ne peut pas être traité simultanément. C'est unétat imminent de ongestion : pour éviter qu'il ne dégénère, la fenêtre de ongestionest reinitialisée à une petite valeur après ha une de es pertes.Ce i assure que haque ot est apable de transmettre un ertain volume de données, etque ha un réagisse à temps pour éviter les engorgements. Cependant, le temps passé prèsd'un régime d'équilibre est très ourt, omme le nombre de données envoyées double environentre deux vols transmis. De plus, la réa tion de e mé anisme à haque perte de segmentspeut sembler exagérée. Un autre algorithme est don proposé, qui fait évoluer la taille de lafenêtre de manière moins brusque.Congestion Avoidan e: littéralement "évitement de la ongestion".• Chaque a quittement reçu augmente la fenêtre d'une valeur "1/W ", où W indique lataille de ette fenêtre, exprimée en nombre de segments. Ainsi, la fenêtre de ongestionaugmente-t-elle d'un segment uniquement une fois que tous les a quittements d'un volont été reçus. Elle roît don ave le temps de façon linéaire.• Il faut dans e as réagir diéremment à l'apparition de la ongestion, indiquée parune perte de segment ( ompte à rebours épuisé sans a quittement reçu, ou saut dansla séquen e d'a quittements). Il semble important que l'adaptation de la fenêtre soitasymétrique : la dé roissan e de la fenêtre doit être la plus forte, pour ompenserles données pré édentes, les retransmissions éventuelles, et garantir la stabilité. Cetalgorithme hoisit d'utiliser un pas dé roissant multipli atif (diviser la taille de la fenêtrepar une onstante supérieure stri tement à l'unité, par défaut par 2, pour haqueévénement de perte).Le proto ole TCP (pour Transport Control Proto ol) est une ombi-naison de es deux algorithmes (Slow Start et Congestion Avoidan e) ;la omparaison de la taille ourante de la fenêtre ave une valeur (que nousnotons seuil) dé ide de quelle manière la fenêtre augmente quand un a quit-tement est reçu. Ce i permet à la fois d'atteindre rapidement un équilibrepar Slow Start, et de s'y maintenir ave Congestion Avoidan e.A la ré eption d'un a quittement : fen_ ong := fen_ ong + taille_segment, si fen_ ong < seuilfen_ ong := fen_ ong + taille_segment×taille_segmentfen_ ong , si fen_ ong ≥ seuil .Dans une première version, TCP Tahoe, le pas dé roissant multipli atifs'applique pour haque perte de segment (indiquée par ompte à rebours),

Page 250: baccelli/Evaluation/AugustinChaintreauPhD.pdf

vi Synthèsenon pas à la fenêtre, toujours reinitialisée, mais à la valeur du seuil. Dansune version ultérieure, TCP Reno, qui est aujourd'hui la plus ourante, lafenêtre elle-même est divisée par deux, après un petit nombre de dupli ata.Quand une perte est indiquée par un ompte à rebours, la même règle que elle de TCP Tahoe s'applique.TCP Tahoe : A la n d'un ompte à rebours sans a quittement,seuil := max(taille_vol2 , 2) et fen_ ong := 1 .TCP Reno : Au troisième dupli ata reçu su essivement,

réinitialiser le ompte à reboursseuil := max(taille_vol

2, 2)fen_ ong := seuil .Figure A.1: Dé roissan e multipli ative, pour deux versions lassiques deTCP.Le ontrle de ongestion de TCP Reno, qui régule la majorité du tra ,empê he aujourd'hui la plupart des pénuries de ots. Il est aussi exemplaire: 'est la réponse à un problème global (allouer des vitesses d'émission desegments à tous) par un mé anisme assez exible (quelques règles établiesde bout en bout), entièrement distribué. Ce proto ole, établi avant les annéesquatre vingts dix, a été très peu modié alors que le réseau, lui, a hangéplusieurs fois d'é helles en taille et en vitesse.Sujet et MéthodeComme nous venons de le voir sur quelques exemples frappants, le fon tion-nement e a e d'un réseau de ommuni ation dé entralisé peut être obtenuave peu de oordination. Il est ainsi possible de :

• on evoir les diérentes fon tions d'un ot de données séparément ;• xer le omportement de haque ot sans onnaître les autres.C'est uniquement de ette manière, de bout en bout, que diérentesformes de ommuni ation peuvent être adoptées et améliorées progressive-ment, en s'intégrant au tra existant.On ne peut ignorer que les ots d'un réseau ne essent jamais d'interagir.Ils suivent ha un les diérentes règles d'un proto ole, en suivant des pro- essus d'événements dis rets dont des parties sont ommunes. Ce sont lespropriétés de ette forme d'intera tion (Comment onnaître ses performan es? Ses limites ? Comment ara tériser son passage à l'é helle ?) que nousproposons i i d'étudier.

Page 251: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Synthèse viiIl n'est peut-être pas inutile de dé rire notre méthodologie, par sou i de larté, même s'il s'agit d'un adre lassique.La théorie des probabilités permet de onsidérer des événements dontla réalisation n'est pas for ément ertaine, mais dont on peut mesurer lavraisemblan e (soit par symétrie, soit par estimation). Elle permet d'étudiere a ement le omportement de grands systèmes dans de larges onditionsd'utilisation, qui représentent en quelque sorte un aléa type, que l'on peutsouvent justier par l'expérien e. Parmi es valeurs, aléatoires, qui inter-viendront dans notre étude, itons l'o urren e d'une perte d'un segmentlors d'un dépassement de mémoire, les volumes de tra reçus su essive-ment par un routeur, la taille d'un do ument à télé harger.Pour on lure, nous ferons appel à deux résultats, ou, plutt, à deuxgrandes lignes dire tri es, de la théorie des probabilités:1 Des systèmes probabilistes de très grandes tailles, ou bien que l'onregarde sur des grandes périodes de temps, s'appro hent dans un sensrigoureux d'une limite déterministe ( 'est-à-dire, qui est la même danstoutes les réalisations).Ainsi, par la loi des grands nombres, la moyenne sur un grand nombred'essais, sans lien entre eux, se rappro he inniment d'une limite ertaine( 'est-à-dire, d'une valeur xe). Nous utiliserons diérentes variantes de erésultat (le théorème ergodique sous-additif de Kingman, l'existen e d'unelimite de hamp moyen).2 On peut ara tériser ette limite déterministe (montrer que 'est une onstante nie, ou positive, ou établir des bornes), sans onnaître dansle détail la distribution d'un aléa type. Pour le prouver, il est presquetoujours né essaire de borner les variations de et aléa ( 'est-à-dire lavraisemblan e d'événements ex eptionnels, omme les réalisations devaleurs très élevées).Ainsi, dans une le d'attente, un régime stable existe ave harge niepour toute les réalisations, si la demande n'est pas ex essive, et si une ertainesomme sur la probabilité des très grands temps de servi e est nie. On peutplus généralement déduire des bornes valides dans toutes les réalisations,pour des fon tions monotones de l'aléa, par omparaison sto hastique.L'originalité de nos résultats est la suivante: Nous étudions des limites entemps inni de systèmes eux-mêmes innis en espa e, mais dont les élémentsinteragissent selon des lois régulières. Nous dé rirons aussi un système quine vérie pas l'assertion 1 i-dessus, puisque sa limite existe, mais dépenddes onditions initiales. Enn, autant que possible, nous établissons nosrésultats pour des aléas à forte variation.

Page 252: baccelli/Evaluation/AugustinChaintreauPhD.pdf

viii Synthèse2 Partage de la Bande PassanteNous avons vu que le ontrle de ongestion ee tué par TCP est responsablede fa to du partage des ressour es de ommuni ation d'un réseau (la bandepassante des liens, les mémoires tampons des routeurs). Elles sont en eetallouées impli itement par le omportement de haque ot en ompétitionsur un lien. Ce omportement individuel est xé par le standard ; 'est luiqui garantit que le ot ne rée pas d'engorgement massif dans ses propresliens, e qui qui lui serait préjudi iable.C'est e résultat impli ite des proto oles que nous étudions de près dans ette se tion, en nous on entrant sur quelques as simples.Dans e domaine, on peut dire que la pratique a pré édé la théorie, enlui réservant quelques surprises de taille. On onnaît depuis environ unedé ennie des modèles détaillés du proto ole de TCP, ils analysent le plussouvent un ot unique, en isolation dans un environnement donné. Dansd'autres modèles plus omplets, il est supposé que les proto oles atteignentrapidement un équilibre déni omme un optimal global. Cette hypothèse,qui peut paraître surprenante, a reçu plusieurs justi ations théoriques etempiriques ; elle est devenue en pratique d'une grande utilité.Nous étendons dans ette se tion un modèle d'intera tion proposé ini-tialement par Ba elli et Hong. Le su ès de e modèle est d'être à la foissusamment simple pour orir des formules en forme lose, et susamment omplet pour être omparé à plusieurs des approximations pré édentes, età des propriétés dynamiques remarquables repérées par d'autres travaux demesure.Le Modèle "Hybrid AIMD"C'est un modèle uide, où nous supposons que haque ot (noté ave l'indexn) reçoit un débit instantané, fon tion du temps, noté X(n)(t). L'évolutionde e débit suit le omportement de l'algorithme Congestion Avoidan e,présenté à la se tion pré édente. Chaque ot augmente son débit linéaire-ment ; omme la apa ité de mémoire de haque lien est nie, une époque de ongestion a lieu sur e lien peu après que la somme de es débits a dépassésa apa ité. A e moment, un ou plusieurs ots peuvent être vi times de laperte d'un segment, qui entraîne la division de leurs débits par deux.Nous supposons que les ots ont le même délai de bout en bout R, etpartagent tous un même lien de goulot d'étranglement, qui ara térise ledébit qu'ils obtiennent. Pour simplier, nous traitons d'abord le as oùle nombre de ots N est onstant, et où la mémoire tampon du lien estnégligeable.Introduisons Ti l'époque de ongestion i, et X

(n)i = X(n)(Ti+) le débitinstantané du ot n mesuré immédiatement après et événement de onges-tion. Nous avons l'évolution suivante (illustrée par la Figure A.2):

Page 253: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Synthèse ixX

(n)i−3

γ(n)i−1 = 1

X(n)i−1

X(n)i temps t

Ti−3 Ti−2 Ti−1 Ti

X(n)i−2

γ(n)i−3 = 1

2

γ(n)i−2 = 1

2

γ(n)i

= 12

X(n)(t)

débit inst.

Figure A.2: Évolution du débit instantané d'un ot, en phase d'a tivité.(A.5) X(n)i = γ

(n)i

(X

(n)i−1 + qi

), où qi =

1

R2(Ti − Ti−1) ,

• γ(n)i est une variable aléatoire égale à 1, sauf si le ot n est vi timed'une perte à l'instant Ti, dans e as elle vaut 1

2 . Nous supposeronsque e dernier as arrive, indépendamment pour haque ot et pour haque époque, ave une probabilité p.• qi est l'a roissement de débit depuis la dernière époque de ongestion.C'est le même pour haque ot puisque eux- i ont le même délai debout en bout.L'intera tion entre ots se reporte dans les valeurs de l'a roissement qi,qui dépend du débit de tous les ots. Quand es ots sont en plus grandnombre, pour une apa ité normalisée par ot, qi onverge vers une limite onstante, égale à C

Np2 , où C est la apa ité de traitement du lien. Cha undes ots suit alors un régime de hamp moyen, où il interagit ave lesautres par l'intermédiaire d'une moyenne déterministe. Cet équilibre est àla fois plus omplexe ( ar il in lut la ompensation d'un nombre inni deots) mais aussi plus fa ile à ara tériser ( ar l'évolution de haque ot estindépendante, et de même loi).Demande Statique De Tra C'est le as que nous avons déjà dé rit, où le nombre de ots a tifs est onstant dans le lien. Il est possible de ara tériser expli itement la loistationnaire du débit instantané dans le régime de hamp moyen:(A.6) X(∞) = q + qU + q

k≥0

Ak

2k, ave q =

C

N

p

2.

Page 254: baccelli/Evaluation/AugustinChaintreauPhD.pdf

x Synthèseoù

- U a une distribution uniforme sur [0; 1],- Ak a une distribution géométrique depuis 0, de raison (1 − p),- Toutes les variables i-dessus sont indépendantes.Cette loi peut être interprétée omme la réalisation d'une mar he aléa-toire géométrique, où la valeur de l'in rément peut être à haque pas divisépar deux. Comme on peut le voir sur la Figure A.3, son membre de droite (ledébit mesuré après ongestion) possède de remarquables propriétés fra tales.(γ

(n)i , γ

(n)i−1, . . .) = (1, 1,

1

2,1

2, 1, 1,

1

2, 1, . . .)

etc.

1 étapesà pas=q/2 2 étapes

à pas=q/8

3 étapesavec pas=q/4avec pas=q

2 étapes

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Pro

babi

lité

[ déb

it >

seu

il ]

Seuil normalisé (1=c)

Simulation de 100 000 pas

Taux synchronisation = 0.1Taux synchronisation = 0.3Taux synchronisation = 0.5Taux synchronisation = 0.7Taux synchronisation = 0.8Taux synchronisation = 0.9

Taux synchronisation = 0.99Taux synchronisation = 1

Figure A.3: Des ription d'une mar he géométrique innie (gau he), Distri-bution du débit instantané après ongestion, pour diérentes valeurs de p(droite).L'image de la loi (A.6) par la transformation de Lapla e s'exprime àl'aide d'un produit inni. Elle peut être étendue hors de son domaine par unefon tion analytique qui ontient un nombre inni de singularités, répartiesrégulièrement omme sur la Figure A.4.L'intégrale de Stieltjes est un outil naturel à utiliser i i ar il permetdans le même adre d'étudier des lois dis rètes et des lois de densité on-tinue. C'est par un de es al uls que l'on peut déduire la formule suivante,dire tement utilisable pour dé rire la valeur instantanée de débit d'un ot:(A.7) P (X(∞) ≥ q + qx) =∑

k≥0

αk(x)φ(+∞)(2kx) ,où φ(+∞) est une série de Fourier φ(+∞)(x) =∑

l∈Zβ

(+∞)l e2iπlx , and

αk(x) =p(1 − p)2

kx(( 11−p)2

k − 1)

2k

k∏

m=1

am , ave am = p1−(1−p)1−2m ,

β(+∞)l =

1

(ln(1 − p) + 2iπl)2

m≥1

bm,l , ave bm,l = p

1−e−i 2π

2m l(1−p)

1− 12m

.Cette formule est, pour garantir le débit d'un ot TCP, l'équivalent dela formule d'Erlang, utilisée pour dimensionner les réseaux téléphoniques.

Page 255: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Synthèse xi

τ = 2π

τ = 0

τ = −2π

σ0 σc 0 c

: Singularitéσ = σc = ln(1 − p)σ = 2Kσc

τ = −T = −π(2m + 1)

τ = T = π(2m + 1)

Figure A.4: Domaine de dénition de la loi après transformation de Lapla e.Demande Dynamique de Tra Les ots de données transportées sur Internet forment une demande extrême-ment hétérogène et dynamique. Le tra web, ara térisé par des séries deots intermittents, est en parti ulier responsable d'une part importante del'utilisation de ertains liens, qui ne peut être fa ilement omprise à l'aidede modèles de ots xes.Nous étudions un as d'é ole: un nombre onstant de sour es de tra alternent ave des périodes de télé hargement et d'ina tivité. Celles- i sontsupposées suivre des lois exponentielles. C'est lairement une hypothèse quine se vérie pas dans la réalité, mais nous verrons que nos résultat n'ensont pas dépendants. L'originalité pratique de e modèle est qu'il étudie laperturbation de l'intera tion des ots TCP, omme vu pré édemment, dansle as de demande de tra intermittente.A nouveau, nous étudions une limite de hamp moyen quand le nombrede sour es devient grand pour des paramètres normalisés. Au moins deuxrégimes limites possibles existent : un régime pour lequel les époques de ongestion disparaissent entièrement (don sans intera tion), et l'autre oùdes événements de ongestion se renouvellent de façon périodique. Nousqualions e dernier régime de turbulent, par analogie ave la mé anique des

Page 256: baccelli/Evaluation/AugustinChaintreauPhD.pdf

xii Synthèseuides.Ce qui est nouveau, 'est que les régions où es deux régimes peuventexister ne sont pas disjointes : ainsi pour ertains paramètres de harge, etde prol de demande, il semble a priori possible d'établir l'un ou l'autrede es régimes dans un système inni, simplement en fon tion des ondi-tions initiales. Pour le prouver, nous onstruisons des onditions initialesqui onduisent à l'un ou l'autre de es régimes. L'une d'elles, pour le régimeturbulent, s'exprime selon une équation de point xe. La résolution de etteéquation n'est pas dire tement possible, mais plusieurs méthodes numériquespermettent d'estimer qu'une solution existe. Enn, pour la version TCPTahoe, qui est une simpli ation du modèle déjà présenté, le régime turbu-lent s'obtient à partir de onditions initiales beau oup plus simples.A partir de onditions initiales typiques, tant que le régime sans on-gestion est le seul possible, les performan es du lien sont assez pro hes de e qu'obtiendrait une dis ipline Pro essor Sharing. Si on fait varier le prol

Figure A.5: Débit moyen estimé par le modèle Pro essor Sharing, par lemodèle Hybrid AIMD.de la demande, en augmentant par exemple la taille moyenne des hiersdemandés, le régime turbulent est possible à partir d'un ertain seuil, il resteenn plus loin le seul envisageable, après un deuxième seuil. En pratique,

Page 257: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Synthèse xiii ela se traduit, omme il est illustré numériquement sur la Figure A.5, parune transition de phase où le débit total dé roît quand la demande aug-mente. Nous retrouvons pour des liens saturés les résultats asymptotiquesdéjà ren ontrés dans le as de ots persistants.Ces propriétés des régimes limites de hamp moyen se traduisent pourdes populations nies par des phénomènes de bi-stabilité. Plusieurs régimesquasi stationnaires peuvent être entretenus par le système, et on assiste sou-vent à une sorte d'os illation de l'un à l'autre, omme l'illustre la Figure A.6.

0

50000

100000

150000

200000

250000

300000

800 850 900 950 1000 1050 1100 1150 1200

Figure A.6: Bi-stabilité: Évolution temporelle du débit total de 1000 otsTCP Tahoe sur un lien de apa ité C = 282 (segments / s) et p = 0.8.

Page 258: baccelli/Evaluation/AugustinChaintreauPhD.pdf

xiv Synthèse3 Diusion Multipoint sur les Réseaux Pair-à-PairPeut-on diuser e a ement, sur un réseau dé entralisé, le même ontenud'information à un grand nombre de destinataires ?Nous justions dans ette se tion que ette diusion, déployée entre pairs,peut être maintenue de manière dé entralisée et équitable, pour des groupesde n'importe quelle taille. Elle fait appel à une autre forme d'intera tion en-tre ots ontrlés de bout en bout, organisés en série, ou le long des bran hesd'un arbre. Elle passe à l'é helle sous une ondition simple, le ritère depointe, qui est vérié par une extension immédiate de TCP.Un Problème de DéploiementLe routage IP-multi ast, onçu au début des années quatre-vingt dix, per-met à des éléments du oeur d'un réseau de dupliquer un segment reçupour l'a heminer vers plusieurs liens, omme sur les bran hes d'un arbre. Ilrend possible idéalement la diusion de données vers un ensemble arbitrairede destinations. Malgré un déploiement partiel, on ignore en ore ommentl'intégrer au reste du tra , de manière able et équitable. En eet, il on-tredit les onditions d'établissement de toutes les fon tions des proto olessupérieurs (a quittements, abilité, ontrle du taux d'émission, ontrle de ongestion), pour plusieurs raisons:• Ces fon tions de transport doivent désormais s'appliquer sur un arbre,et non plus sur une route.• Comme auparavant, elles doivent être organisées de bout en bout, 'est-à-dire i i entre une sour e unique et un ensemble parfois grand de des-tinations. Le oeur de l'arbre, les routeurs où se roisent les bran hes,ne peuvent pas parti iper dire tement à es tâ hes.Sans être parvenus à la dénition d'un nouveau standard de transportde données, des travaux ont permis d'énon er quelques re ommandationsutiles. Le transport de données multipoints né essite un support, déployéentre parti ipants d'un groupe ou dans le réseau, sinon la sour e est sinonrapidement saturée ; les fon tions qui le ara térisent peuvent varier en fon -tion des besoins de l'appli ation ; enn, assurer le ontrle de ongestion nesemble possible que dans deux as parti uliers : pour des ommuni ationsnon ables, ou si le nombre de destinations est petit.Transport dans les Réseaux Pair-à-PairUn réseau pair-à-pair se dénit omme un ensemble onstruit par dessus unréseau physique ommun. Les noeuds de e réseau pair-à-pair sont simple-ment un sous-ensemble des terminaisons du réseau original ; un lien pair-à-pair, que l'on appellera un bond , est en fait la représentation abstraite d'un

Page 259: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Synthèse xv hemin reliant deux terminaisons dans le réseau original. Comme illustrésur la Figure A.7, le hemin de B à D ontient trois liens dans la topolo-gie physique du réseau, mais il est représenté par un unique bond dans latopologie abstraite.AG F E

CBD

BondB→CBondB→DBondG→EBondA→GBondG→F

BondA→B AB

EF D CG BondA→B

BondG→F BondB→CBondG→EBondA→GBondB→DFigure A.7: Réseau pair-à-pair: topologie physique (gau he) et abstraite(droite).Il est remarquablement fa ile d'organiser un très grand nombre de ter-minaisons d'un réseau dans une stru ture pair-à-pair, omme l'a prouvé lesu ès des premiers réseaux de partage de hiers. Une justi ation théoriquede l'apparition de es grandes stru tures de distribution est onnue depuis1997, dé ouverte initialement par Plaxton pour l'organisation de grandesbases de données distribuées.Cela fait maintenant quelques années que ette te hnique est proposée omme un support alternatif à la diusion multipoint lassique. Deux adresde transport de données sont en général envisagés :

• Le as non able ( omme par exemple le télé hargement en temps réelde vidéo), qui peut typiquement utiliser n'importe quel proto ole detransport.• Le as able, où les données reçues sont onservées par tous les pairs,dans leur intégralité. Il est alors requis d'utiliser TCP sur haque bonddans le réseau pair-à-pair, pour assurer la abilité de la ommuni ationentre deux noeuds su essifs.Nous onsidérons i i un as un peu plus général, où il n'est pas né essairede faire l'hypothèse que haque noeud onserve toutes les données transmises.Au ontraire, nous supposerons que la mémoire disponible par ha un pour ette ommuni ation est nie. Ce transport de données s'adaptera, omme le

Page 260: baccelli/Evaluation/AugustinChaintreauPhD.pdf

xvi Synthèsefait TCP lo alement, pour assurer qu'il n'y a pas de débordement de mémoiredans l'ensemble de l'arbre. Ainsi e proto ole garantira impli itement que ledébit utilisé par la sour e est véritablement elui que l'arbre peut supporter,sans a umulation de données en un point du réseau.Les fon tions utilisées sont, ave très peu de modi ations, elles dustandard de TCP, nous les avons représentées, en plusieurs étapes ommediérentes transitions d'un réseau de Pétri, sur la Figure A.8.

Mémoire deSe oursTamponde sortie 1Tamponde Sortie 2

Signale la mémoiredisponible dans leTampond'entrée

de re eptionEnvoie un a uséDélivre le tamponBSOR de la mère.Re oit le segment

B(1,0)ENT

B(1,0)SOR,(2,1)

B(1,0)SOR,(2,0)

idemidemidemTransmet le segmentidemDé lare la fenêtre

0

0

0

(W(2,0)m )m≥1

(W(2,1)m )m≥1

de la station lle.Re oit l'a usétampon BENT de la lle.à la station mère.à la station mère. BSOR.Délivre le tampon

Figure A.8: Des ription d'une station d'index (k, l) = (1, 0).Passage à l'É helleLe proto ole que nous avons introduit peut être dé rit omme un ouplage ensérie, selon les bran hes d'un arbre, de plusieurs bou les de ontrle TCP. Ilen hérite un grand nombre de propriétés, et parmi elles un ontrle de onges-tion adapté pour son intégration immédiate dans le reste du tra d'Internet.Il reste ependant deux points d'interrogation. Le débit dé roît-il vers zéroquand le groupe devient grand, puisque l'a umulation de bou les de TCP onstruit un système en bou le fermée parmi un nombre roissant d'éléments? Enn, si les mémoires tampons lo ales sont grandes, peut-on dé rire leurremplissage à l'équilibre et la vitesse d'é oulement de l'information dansl'arbre sur de grandes distan es ?Le fon tionnement de e proto ole est fa ile à dé rire par un modèle dedernière per olation dans un graphe aléatoire. Ce graphe est dé rit selondeux dimensions, indiquées par l'index m du numéro de segment, et par unindex dé rivant les noeuds du réseau pair-à-pair. Cha un des sommets de e graphe représente le transport d'un segment vers une mémoire tampon(d'un routeur, d'un noeud pair-à-pair), et son séjour à l'intérieur.

Page 261: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Synthèse xviiNous avons représenté e graphe sur la Figure A.9, pour un réseau pair-à-pair ave trois éléments en ligne (une sour e, un relais, une destination).Une représentation plus ompa te de e système de ré urren e uniforme,selon son graphe de dépendan e, est montré Figure A.10. Sur ette dernièregure, diérentes étapes du transport d'un segment sont représentées, ellesdépendent d'autres étapes (qui peuvent être pour d'autres segments, et/oupour d'autres bonds du réseau). Comme nous le montrons dans la pro hainese tion, il nous sut de vérier que les ve teurs asso iés aux y les du graphede dépendan e vérient le ritère de pointe. Un al ul élémentaire sur lesentiers permet de s'en assurer. Une vitesse de propagation des segments dansle réseau est alors garantie, même dans un groupe sans limite de taille.Ce i né essite en général que le temps d'inter-servi e dans les routeurs, pourles segments on ernés, ( 'est-à-dire le volume de tra transversal) admetteun moment exponentiel. Des onditions moins restri tives peuvent surepour ertaines topologies.Dans le as d'une mémoire lo ale innie, l'évolution de haque bran hede l'arbre peut être étudiée séparément, e qui simplie beau oup de hoses.Deux régimes sont possibles en fon tion du débit de sortie de la sour e ;un régime divergent où l'information s'a umule dans le réseau pair-à-pairformé, ou au ontraire un régime stable où la harge de haque le onverge.On peut montrer qu'on est dans e deuxième as tant que le débit de sortiede la sour e est inférieur à une valeur seuil. Ce seuil n'est malheureusementpas fa ile à exprimer en général, mais il oïn ide pour plusieurs exemplesave le débit lo al minimum d'un bond du réseau pair-à-pair. Nous royonsque 'est en fait toujours le as ; e n'est à l'heure a tuelle qu'une onje ture.Dans e régime stable, le délai dans le réseau pair-à-pair, ainsi que levolume de données en présen e dans les mémoires tampons, vérient tous lesdeux une loi des grands nombres. Leur limite déterministe sur de grandesdistan es s'exprime omme la transformée de Legendre d'une fon tion hy-drodynamique.

Page 262: baccelli/Evaluation/AugustinChaintreauPhD.pdf

xviii Synthèse

. . . . . . . . .. . . . . .. . . . . .

h = deb h = deb

. . .. . .

. . .. . .. . .

. . .

. . .

. . .. . .. . . . . .

. . . . . .. . .

. . .

E6

E5E3E4

h = 2 h = 2h = fin

h = 1h = 3

h = 1h = 3

h = 4

. . .. . .. . .. . .

. . .. . .. . .. . .. . .. . .

. . .

k = 1 k = 2H1 = 4 H2 = 3

. . .

. . .

. . .

. . .

. . .

. . .

k = 0

h = finh = fin

. . .

. . .

. . .

h = deb

m + Wm + 1W = Wm/2

Segment m + 1W = Wm/2Segment m

W = Wm

m − BW = Wm−B

Segment m − 1W = Wm−1

W = Wm/2m + Wm

W = Wm−Wm

m − WmSegmentSegment

SegmentSegment

Figure A.9: Graphe aléatoire orrespondant à deux ots TCP en tandem,ave pression arrière et réordonnan ement.4 Dernière Per olation dans les Grilles de MotifsNous onstruisons i i un modèle de al ul pour étudier le omportement detrès grands systèmes distribués. Il peut être dé rit de la façon suivante : unensemble inni de tâ hes à ompléter est organisé selon plusieurs dimensions, omme pour les éléments d'une grille. Leur évolution suit une relation deprédé esseur, qui est invariante par les translations de la grille. Ainsi unetâ he peut-elle ommen er dès qu'un ensemble de tâ hes voisines ont été omplétées, le temps né essaire pour la réaliser est supposé indépendant de elui des autres.Ce adre permet de onsidérer une large lasse de systèmes d'événementsdis rets, omme les proto oles oopératifs distribués de bout en bout. Il peut

Page 263: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Synthèse xix1

(W, 0)

(0,−1)

(−BSOR , +1)

(−BENT , 0)deb(−W, 0)

2

3

(1, 0)

(−1, 0)

(−1, 0)

ont tous le signe (−1, 0).-- Conventions utilisées:fin les ve teurs nuls ont été omis.2′

1′

3′

Figure A.10: Graphe de dépendan e d'un réseau pair-à-pair one-to-manyTCP.être onsidéré omme un équivalent probabiliste des Équations de Ré urren eUniformes, introduites dans la théorie de la omplexité du al ul parallèle.Le omportement de es systèmes sur de grandes distan es, et leurs pro-priétés de stabilité, sont ara térisés par un ritère de pointe et des limiteshydrodynamiques.Critère de Pointe, et Dernière Per olation Dire tionnelleOn peut représenter la relation de prédé esseur dans une grille de motifs parun graphe de dépendan e, qui ontient un nombre ni de sommets et d'ar s,où haque ar est étiqueté ave un ve teur. La grille sera dite pointue si ellevérie la ondition suivante :Critère 1 Il existe un ve teur s dans Nd (appelé ve teur de pointe) tel que<r, s> ≤ −1 pour tout ve teur r asso ié à un y le élémentaire dans le graphede dépendan e.Notons que ette ondition implique que le graphe de dépendan e n'admetpas de y le de ve teur nul, et don que la relation de prédé esseur est biendénie (au une tâ he n'est son propre prédé esseur). Le ritère de pointen'est pas pour autant une ondition né essaire pour s'assurer de ette bonnedénition, nous le prouvons par l'exemple suivant :

Page 264: baccelli/Evaluation/AugustinChaintreauPhD.pdf

xx SynthèseExemple : Une grille de motifs non-pointueConsidérons le graphe de dépendan e suivant(0,−1)

(0, 0)

(−1, 1)

(1,−1)

• La grille de motifs asso iée n'est pas pointue, ar e graphe ontient deux ve teursopposés.• On peut fa ilement vérier à la main qu'il ne ontient pas de y le de ve teur nul, arune ombinaison linéaire qui vérie

a.(1,−1) + b.(−1, 1) + c.(0, 0) + d.(0,−1) = 0 ,vérie for ément a = b et don d = 0. Quand l'ar d'étiquette (0,−1) est retiré, legraphe n'est plus fortement onne té, au un y le ne peut don ontenir deux sommetsdistin ts. On peut alors on lure ar au un y le déni sur un seul sommet ne peutavoir de ve teur nul.Une illustration géométrique du ritère de pointe est dessinée sur la Fig-ure A.11. On peut même montrer qu'elle dé rit toutes les situations possiblespour le as de dimension 2 ; e i provient du résultat suivant :Lemme 1 Soit une famille de ve teurs dans Z2 qui n'admet pas de ve teurde pointe, alors au moins l'une des propositions suivantes est vériée:

(i) Elle ontient le ve teur nul.(ii) Elle ontient deux ve teurs opposés.(iii) Elle ontient un triplet générateur: (e, f, g) tel que <e, f> < 0 <e, g> < 0<e, f> > 0 <e, g> < 0

, où e ∈ Z2 vérie <e, e> = 0 .On peut don qualier les grilles de motifs pointues de grilles orientées,puisque e sont elles où les ve teurs asso iés aux y les de dépendan es'ins rivent dans un ne stri tement aigu. L'utilité de e ritère est qu'ilpermet, dans une démonstration très élémentaire de prouver que la tailled'un hemin est toujours bornée par une fon tion linéaire de son dépla ementdans la grille. Il y a plus: on peut en adrer les propriétés ombinatoires des hemins dans la grille de motifs par eux d'animaux de grille ( 'est-à-dire,les ensembles onne tés par voisinage de la grille).On en déduit le point suivant: si on xe une dire tion de l'espa e, la sériedes temps de dernière per olation à l'origine selon un point sur et axe a une roissan e au plus linéaire. Ce i suppose que les lois des temps de réalisation

Page 265: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Synthèse xxi(d)(a) (b) (c)Figure A.11: Illustration géométrique du ritère de pointe : (a) et (b) sontdes familles qui admettent un ve teur de pointe, (c) et (d) n'en admettentpas (respe tivement as (ii) et (iii) du lemme).des tâ hes sont dominées par s, qui vérie :

∫ +∞

0P (s ≥ u)1/ddu < ∞ .Ce i sut, par le théorème ergodique sous-additif, pour prouver que le tauxd'a roissement onverge vers une limite onstante et nie.Le ritère de pointe semble aussi être une ondition né essaire: elle l'esten tous as pour une grille irrédu tible (i.e. dont le graphe de dépendan eest fortement onnexe). On peut, en eet, montrer que si une grille de motifsn'est pas pointue, la limite de ette série de dernière per olation n'est paslinéaire pour ertaines dire tions.Exemple : Dernière per olation dans les grilles non pointuesDans la pro haine gure nous avons représenté, sur la gau he, la relation de prédé esseurpour le graphe de dépendan e que nous avons dé rit plus haut.

(a) . (b)

Page 266: baccelli/Evaluation/AugustinChaintreauPhD.pdf

xxii SynthèseComme le montre la gure de droite, on peut suivre un hemin depuis n'importe quelsommet, en suivant d'abord la dire tion supérieure gau he (en restant uniquement sur dessommets noirs). On passe sur un sommet blan juste avant de ren ontrer l'axe verti al. Onredes end ensuite selon la dire tion inférieure droite, qui suit une diagonale juste en-dessous,ave des sommets blan s uniquement, et on re ommen e.On peut montrer qu'en débutant des oordonnées (m, k), la longueur de e hemin estde l'ordre de (m + k)2. Ainsi la série de dernière per olation dans une dire tion ne suit pasde roissan e linéaire. Loin d'être une ex eption, e phénomène est en fait ara téristique detoutes les grilles de motifs qui n'admettent pas de ve teur de pointe (en tous as, si elles sontirrédu tibles).Stabilité des Systèmes à événements dis rets innisNous onsidérons par la suite uniquement des grilles de motifs pointues,de dimension 2, et totalement ordonnées. C'est-à-dire que nous supposonsqu'une tâ he su ède toujours à ses équivalents d'indi e tous inférieurs ouégaux. Cette propriété est généralement vériée par les systèmes d'objets quiutilisent la dis ipline premier arrivé premier servi. Dans es onditions, leslimites de dernière per olation dire tionnelle sont dé rites par deux fon tionsde limites hydrodynamiques, roissantes, et on aves :γ1, γ2 : R → R ∪ −∞ .Ces limites hydrodynamiques permettent d'étudier la stabilité des sys-tèmes innis d'événements dis rets sous ontrainte, 'est-à-dire i i les grillesde motifs dénies ave , pour ondition aux bord, une date d'autorisationpour haque tâ he sur un axe, omme déssiné à droite sur la Figure A.12.Nous notons par λ l'intensité du pro essus d'autorisation des tâ hes de l'axe.

a3

a1

a2

a3

a2

a1

(a) (b)Figure A.12: Quart de plan (a) et quart étendu (b) le long d'un axe, pour le as de dimension d = 3.

Page 267: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Synthèse xxiii• Si λ < Thres = 1

γ2(O+) , on peut onstruire un régime stable pour lesystème entier. Dans e régime, les dates de réalisation des tâ hesvérient une loi forte des grands nombres quand on s'éloigne de l'axe.Remarquons que pour e dernier résultat, que l'on pourrait qualierde stabilité grande é helle, la ondition est aussi né essaire.• Si λ > Velo1 = 1

γ2(0) , il n'existe au un régime stable.Nous ne onnaissons pas à l'heure a tuelle de preuve générale pour mon-trer l'égalité de es deux seuils. Ce i permettrait de ara tériser la stabilitégrande é helle par une omparaison entre λ et γ2(0), 'est-à-dire ave le débitsaturé du système, omme le prévoit la règle de saturation qui s'applique auxsystèmes nis. Cette onje ture est prouvée dans ertains as (dont le tan-dem inni d'une le d'attente), et pour l'ensemble des systèmes en bou lefermée, qui vérient une propriété de solidarité.Graphes de MotifsSoit un graphe G = (V,E), une appli ation η : V → V est appelée unetransformation invariante de G si on a pour tout u, v dans V :( η(v) = η(u) ⇐⇒ v = u ) et ( (u, v) ∈ E ⇐⇒ (η(u), η(v)) ∈ E ) .Exemple : Graphes et transformations invariantes• La grille diagonale est le graphe

V = N2 et E = ((i, j) → (i + 1, j)) , ((i, j) → (i, j + 1)) | i ≥ 0, j ≥ 0 .Comme les deux dimensions sont é hangeables, e graphe est mieux dé rit en ee tuantune rotation de π/4:

. . .

. . .

. . .

. . .

. . .

. . .

i

j

. . .

. . . Les transformations invariantes sont ara térisées par l'image du sommet (0, 0),qui peut être n'importe quel sommet (i, j), et par l'image de (1, 0) et (0, 1), quisont hoisies distin tes dans (i + 1, j), (i, j + 1), mais arbitrairement. On peutdon en parti ulier é hanger les deux dire tions de la grille.

Page 268: baccelli/Evaluation/AugustinChaintreauPhD.pdf

xxiv Synthèse• L'arbre régulier orienté inni est déni pour haque degré D ≥ 1 par

V = 1, 2, . . . , D(N) et E = (a → a i ) | a ∈ V, 1 ≤ i ≤ d ,où on dénote, pour tout ensemble X, par X(N) l'ensemble des suites nies d'éléments deX, en in luant la suite vide ∅. On note aussi, pour x dans X et y = y1, . . . , yk ∈ X(N),par y x la suite omposée y1, . . . , yk, x.Dessinons l'arbre de degré T3 qui ontient tous les éléments asso iés à des suites d'ordreau plus 2:

321

. . .. . . . . .. . .. . . . . . . . .. . . . . .

. . .. . . . . .. . .. . . . . .. . .. . . . . . . . .. . . . . .

. . .. . . . . .. . .. . . . . .

3, 33, 23, 12, 32, 22, 11, 31, 21, 1 Les transformations invariantes sont ara térisées par une suite (qui est asso iéeà l'image de la ra ine), et une suite de permutations (σx)x∈V de l'ensemble1, . . . , d. Ces transformations sont toutes les proje tions de l'arbre dans un deses sous-arbres, où l'ordre "horizontal" peut être hangé à haque étape.Ce i permet de dénir des invarian es sur un produit de graphes, quigénéralise les translations dans une grille. On peut alors étendre la dénitiondes limites dire tionnelles de dernière per olation à des motifs organisés selondes graphes produits plus généraux, par itération d'une invarian e. Deux assont possibles, qui peuvent être tran hés par des arguments ombinatoires :la limite existe, elle est nie et onstante, ou elle existe mais est de moyenneinnie.Le résultat le plus utile est alors le suivant : si on peut plonger ungraphe de motifs, par la dénition d'un niveau, dans une grille de motifs etqu'elle est pointue, on hérite de ses propriétés sur les longueurs de hemins.Si on suppose que les lois des temps de réalisation admettent un momentexponentiel, e i sut pour prouver pour n'importe quel graphe lo alementni, que la limite dire tionnelle de dernière per olation est nie et onstante.

Page 269: baccelli/Evaluation/AugustinChaintreauPhD.pdf

Pro esses of Intera tion in Data NetworksInternet data tra is ontrolled by end-to-end intera tion between ows,whi h follows a dis rete events pro ess spe ied by the TCP proto ol. Westart by establishing limit regimes that hara terize the bandwidth sharingof a link when a large number of ows is responsible for xed or dynami tra demand. We then design a distributed proto ol for ommuni ation onoverlay network, that extends the pro ess ontrolling a ow to an arbitraryset of destinations. It leads us to introdu e a new frame of al ulus, to studyinnite regular dis rete events systems, des ribed by last-passage per olationtime. We establish for the previous proto ol, and more generally for anydistributed system satisfying a "sharpness riterion", that a throughput maybe guaranteed independently from the system's size, and that a stable regimeexists where long-range dissemination time follows a law of large numbers.Pro essus d'Intera tion dans les Réseaux de DonnéesLa stabilité et l'équité du tra Internet sont assurées par l'intera tiondes ots de transport de données, de bout en bout, selon un pro essus àévénements dis rets spé ié par le proto ole TCP. Nous dé rivons d'abordplusieurs régimes qui ara térisent une large population de ots, xes ouintermittents, partageant la bande passante d'un lien. Nous on evons en-suite un proto ole distribué, qui étend le pro essus de ontrle d'un ot,pour la diusion à un groupe arbitraire de destinations sur un réseau pair-à-pair. Ce i nous amène à dénir un nouveau adre de al ul, pour étudier lessystèmes à événements dis rets innis et réguliers, dé rits par un dernier pas-sage de per olation. Pour le proto ole pré édent, ainsi que pour tout systèmevériant un " ritère de pointe", nous montrons que le débit d'é oulement estgaranti quelle que soit la taille du système, et qu'un régime stable existe oùles temps de propagation vérient une loi des grands nombres.