fault-tolerant broadcasting and gossiping in communication networks

Upload: mies0r

Post on 10-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Fault-Tolerant Broadcasting and Gossiping in Communication Networks

    1/14

    Fault-Tolerant Broadcasting and Gossipingin Communication NetworksAndrzej Pelc*Departement dlnformatique, Universite du Quebec a Hull, Hull, Quebec J8X 3x7, Canada

    Broadcasting and gossiping are fundamental tasks in network communication. In broadcasting, or one-to-all communication, information originally held in one node of the network (called the source) must betransmitted to all other nodes. In gossiping, or all-to-all communication, every node holds a messagewhich has to be transmitted to all other nodes. As communication networks grow in size, they becomeincreasingly vulnerable to component failures. Thus, capabilities for fault-tolerant broadcasting and gossipinggain importance. The present paper is a survey of the fast-growing area of research investigating thesecapabilities.We focus on two most important efficiency measures of broadcasting and gossiping algorithms:running time and number of elementary ransmissions required by the communication process.We emphasizethe unlfying thread in most results from the research in fault-tolerant communication: the trade-offs betweenefficiency of communication schemes and their fault-tolerance. 0 7996 John Mey & Sons, Inc.

    1. INTRODUCTIONBroadcasting and gossiping are fundamen tal tasks in net-work communication. They both aim at disseminatinginformation among nodes. In broadcasting, also calledone-to-all communication, information originally held inone node of the network (called the source) has to betransmitted to all other nodes. In gossiping, or all-to-all communication, every node holds a message (value)which must be transmitted to all other nodes. These typesof network communication often occur in distributedcomputing, e.g., in global processor synchronization andupdating distributed databases. Moreover, suc h comm uni-cation tasks are implicit in many parallel computationproblems, where data and results are distributed amongprocessors. This happens, e.g., in matrix multiplication,parallel solving of linear systems, parallel computing ofthe discrete Fourier transform, o r parallel sorting, cf. [8,31, 531.T w o most important measures of performance ofbroadcasting and gossiping algorithms are the num ber ofelementary transmissions ( c a l f s ) and the number of

    * E-mail: [email protected]

    rounds ( r ime) required. Another concern in the design ofcommunication schemes is the demand that they imposeon the underlying network. S ince dense networks are dif-ficult and costly to implement, it is important to considerefficient broadcasting and go ssiping algorithm s that workfor networks as sparse as possible. Excellent accou nts ofthe literature on broadcasting and gossiping focusing onthe above-mentioned problems can be found in surveys[33, 50, 511.As commu nication networks grow in size, they becomeincreasingly vulnerable to com ponent failures. Som e linksand/or nodes of the network may fail. It becomes im-portant to design communication algorithms in such away that the desired communication task be accomplishedefficiently in spite of these faults, usually without know -ing their location ahead of time. As such, much attentionhas been devoted recently to fault-tolerant broadcastingand gossiping. The present paper, which is an extendedversion of [62], surveys this rapidly growing area ofresearch. It has been necessary to make choices in thelarge body of literature concerning fault-tolerant commu-nication, leaving out vast and important subdomains re-lated to the main focus of this paper. We do not coverthe issue of network reliability, as it is not immediately

    NETWORKS, Vol. 28 (1996) 143-1560 1996 John Wiley & Sons. Inc. CCC 0028-3045/96/030143-14

    143

  • 8/8/2019 Fault-Tolerant Broadcasting and Gossiping in Communication Networks

    2/14

    I# PELC

    concerned with the efficiency of fault-tolerant commu ni-cation but rather with necessary conditions for its feasibil-ity. Another issue closely related to fault-tolerant broad-casting and gossiping is the Byzantine Agreement. Thismaterial has been covered in the survey [ 5 ] which alsodiscussed the im portant problem of multiprocessor systemdiagnosis. We d o not consider problems of the ByzantineAgreement in this paper, although we discuss some as-pects of broadcasting and gossiping in the presence ofByzantine faults, pointing out differences of our approach.Finally, we focus attention only on the two above-men-tioned communication tasks, leaving out, e.g., the im-portant and largely studied issue of fault-tolerant, point-to-point routing where problems and techniques aredifferent from those encountered for broadcasting andgossiping.We focus on two most important and widely studiedefficiency m easures of broadcasting and gossiping algo-rithms: running time and number of elementary transmis-sions used in the communication process. We also discussth e issue of sparsity of networks supporting efficient fault-tolerant communication schemes; schemes that work forsparse networks are more w idely applicable than are thoserequiring, e g , a completely connected network. In thissurvey, we emphasize the unifying thread found in mostresults from research in fault-tolerant com munication: th etrade-offs between efficiency of com munication schem esand their fault-tolerance.The paper is organized as follows: In Section 2, wediscuss a variety of comm unication and fault models dis-cussed in the literature. One dividing line is betweenbounded and probabilistic fault mo dels, as both the goalsand algorithm design techniques differ in each case. Therest of the paper is organized as follows: In Section 3,we survey results in the bounded fault model, while inSection4, andom fault distribution is considered. Section5 is devoted to the discussion of possible directions offuture research.

    2. MODELS AND TERMINOLOGYThe d omain of fault-tolerant broadcasting and gossipingis a rather broad area of research. Different authors haveadopted varying approaches, both to network comm unica-tion and to fault modeling. Thus, many combinations ofassumptions concerning the communication process andthe possible faults can be fo und, yielding a large num berof communication models and fault-tolerance solutions.The design of algorithms and results concerning theirefficiency and robustness heavily depend on the underly-ing model; hence, it is very important to specify the as-sumptions in a detailed and rigorous way.The only assumption common to all papers reviewedin the present survey is the consideration of point-to-pointcommunication networks modeled as undirected graphs

    whose vertices are sites of the network (e.g., processors,computers) and whose edges are communication linksused to transmit messages from site to site. A number ofother features must be specified in order to describe themodel completely. Below , we point out the choices to bemade and fix the appropriate terminology.2.1. CommunicationModeThe communication primitive is a call taking place be-tween two adjacent sites (or nodes) of the network andusually lasting a unit of time. The communication modespecifies which calls can be exe cuted sim ultaneously dur-ing one unit of time and what messages can be transmittedin one call. In the shouting mode, also called n-porr orfink-bound, node can com mu nicate with all its neighborsduring a single unit of time, while in the whispering mode,also called I-porr or processor-bound, a node can com-municate with at most one neighbor. Further, the comm u-nication mode can be full-duplex, when during one callmessages between communicating nodes can travel inboth directions through the (bidirectional) link joiningthem, or halfduplex, when every node can only send oronly receive information in a given call. The shoutingmode models, e g , adio communication, in which a sig-nal can be simultaneously transmitted to all receiverswithin the range of the broadcasting station. The w hisper-ing mode is suitable to model wire-based comm unication,such as occurs in traditional telephone networks. Like-wise, the full-duplex mode is appropriate for telephoneconversations, while the half-duplex mode is used insending telegrams or letters. Several combinations andvariations of these modes have been considered, e.g., si-multaneous sending to al l neighbors but sequential receiv-ing or the concurrent sending to one n eighbor and receiv-ing from o ne (possibly different) neighbor.2.2. Size of PacketsAnother important characteristic of the communicationprocess, significantly influencing the performance of algo-rithms, is the size of message packets. A packet is theamount of information that can be sent by a node to itsneighbor in one call. This parameter may dramaticallychange the process of gossiping, as large packets allowsingle transmission of already accumulated information.In case of broadcasting, packet size often does not changethe algorithm design, as only one message is to be dissem-inated throughout the network. However, even in thiscase, i t may play a significant role, especially if largecontrol messages concerning already detected faults needto be circulated during the algorithm execution. The sizeof packets can vary from unit, when each packet cancontain the value of only one node, to unbounded, whenpackets are large enough to contain all values of nodesin the network. Large bandwidth availability, such as in

  • 8/8/2019 Fault-Tolerant Broadcasting and Gossiping in Communication Networks

    3/14

    FAULT-TOLERANT BROADCASTING AND GOSSIPING 1 6optical communication networks, makes the assumptionof large, essentially unbounded packets increasingly plau-sible.2.3. Fault ClassificationSeveral assumptions are made concerning aspects offaults that can occur in the communication process. Thefirst concern is to specify which components are fault-prone: only links, only nodes, or both links and nodes.Furthermore, the nature of faults must be described. Thetwo most commonly studied types are crash and Byzan-tine faults. If the fault is a crash, the faulty node does notsend or receive messages or the faulty l ink does not trans-mit messages. Faulty components do not alter transmittedmessages; they fail to transmit messages at all. Such faultsare relatively benign. Although some information may belost, at least the information that is received canbe trusted.Byzantine faults, on the other hand, are a worst-case faultscenario: Faulty components can behave arbitrarily (evenmaliciously) as transmitters, by stopping, rerouting, oraltering transmitted messages in a way most detrimentalto the communication process. Byzantine failures that ex-hibit all these kinds of damaging behavior rarely occurin practice. They may be caused by a hostile agent whoseaim is to destroy the communication process. On the otherhand, the concept of Byzantine faults plays an importantrole in our study, representing a worst-case assumption.Communication algorithms that work correctly in thepresence of Byzantine faults can be used safely under anyfau t scenario.Another important characteristic of faults, which mustbe specified, is their duration. Faults can be eitherp e m -nent (i.e., the status faulty/fault-free of a component doesnot change during the algorithm execution)or transient(i.e. , the status of a component may change in each unit oftime). Permanent faults usually correspond to hardwareComponent failures, while transient l ink faults correspondto transmission failures due, e.g., to temporary magneticinterference. Knowledge as to which types of faults arelikely to occur is important in communication algorithmdesign. Repeating attempts to transmit the same messagealong the same link is useless in case of permanent faultsbut may be essential if faults are of a transient nature.2.4. Fault DistributionOne of the crucial assumptions made about faults con-cerns their distribution. Clearly, some limitations on thenumber of possibly faulty components mustbe imposed;otherwise, no communication is possible. Two commonlyused fault models are the bounded model and theprobabi-listic model. In the bounded model, an upper boundk isimposed on the numberof faulty components; their worst-case location is assumed. In the probabilistic model, faultsare assumed to occur randomly and independently of each

    other, with specified probability. The choice betweenthese two assumptions regarding fault occurrence influ-ences the definition of the goal of broadcasting and gos-siping in the presence of faults.In the bounded fault model, we usually seek what istermed k-tolerant broadcasting and gossiping. In case ofbroadcasting, the source is assumed fault-free (otherwise,no broadcasting is possible); its message must reach allfault-free nodes provided that no more thank components(links or nodes or both, depending on the particular sce-nario) are faulty. In the case of gossiping, al l fault-freenodes must receive messages from all fault-free nodes,provided that no more than k components are faulty. Itshould be stressed that no agreement concerning messagesfrom faulty nodes is required among the fault-free nodes;this is where fault-tolerant gossiping differs from Byzan-tine Agreement in the case of Byzantine node failures.

    In the probabilistic fault model, the communicationgoal cannot be achieved with certainty, since, with somesmall probability, all components may fail and precludeany message transmission. As a result, almost safe broad-casting and gossiping is sought. In broadcasting,all fault-free nodes must receive the source message with probabil-it y at least 1 - ( I / n ) (the source being fault free); ingossiping, all fault-free nodes must receive messagesfrom all fault-free nodes with a specified probability.De-signing efficient communication algorithms whose relia-bility increases for networks of larger size is difficult forthe following reason: In small networks, it is relativelyeasy to achieve fault-tolerant communication using mas-sive redundancy; due to the small scale of the network,resources used by such brute-force communication proce-dures (either time or number of messages) will not beexcessive. In larger networks, however, the trade-off be-tween reliability and efficiency becomes an important is-sue; for such networks, highly reliable and efficient algo-rithms are sought.2.5. Flexibility of AlgorithmsIn a fault-free environment, broadcasting and gossipingalgorithms have a simple form. All calls to be carried outin each time unit can be specified in advance, before thealgorithm execution. In the presence of faults, a distinc-tion must be made regarding this point, which will sig-nificantly affect the efficiency and robustness of fault-tolerant communication. Broadcasting and gossiping al-gorithms can be either nonadaptive, also called oblivious,where all calls must be scheduled in advance,or adaptive,where every node can schedule its next call based oninformation currently available to it. In adaptive algo-rithms, a node becomes aware of whether a call it at-tempted was successful (i.e., if the called node and theconnecting link are fault-free);hence, different calls canbe executed depending on the successor failure of previ-ous ones. Even in this second case, however, nodes can

  • 8/8/2019 Fault-Tolerant Broadcasting and Gossiping in Communication Networks

    4/14

    146 PELConly take advantage of locally available information; wedo not assume the existence of a central monitor supervis-ing the overall communication process. Adaptive algo-rithms require more local control and memory at eachnode but are usually more efficient for the same fault-tolerance level.2.6. Models in the LiteratureAll the above characteristics of the communication andfault models must be specified if results concerning fault-tolerant broadcasting and go ssiping are to be meaningful.Many authors make som e of the assumptions tacitly, andthe com plete description of the model can only be inferredfrom the algorithm description or from arguments con-cerning efficiency and robustness of the communicationschemes. The interest of the research comm unity in differ-ent m odels varies; the popularity of som e of the proposedmodels have changed over time. An exam ple is the choiceto be made between the bounded and probabilistic mod-els; interest in nonadaptive vs. adaptive communicationalgorithms has also varied with time. The first papersin the area of fault-tolerant broadcasting and gossipingfavored the bounded fault model and nonadaptive algo-rithms. This is probab ly due to the fact that, at first, broad-casting and gossiping w ere viewed m ostly as combinato-rial, graph theoretic problems; communication schemeshave a simple combinatorial formulation under the aboveassumptions. As the dom ain of research has evolved, sev-eral researchers pointed o ut that the random fault assump-tion corresponds m uch better to failure Occurrence in realnetworks; as a result, the probabilistic fault model gainedprominence. In addition, as local memory and computa-tion capabilities of processors have increased, the im ple-mentation of adaptive algorithms has become more realis-tic. At present, there is a good balance of research con-ducted in each of the above models; we can seetechnological advances as well as intrinsic interest incombinatorics influencing the choice of particular com-munication scenarios.2.7. NotationWe use the following notation through the rest of thepaper: We let n denote the number of nodes in the net-work. Unless otherwise specified, the network is a com-plete graph. An m-hypercube is the graph on 2"' nodeslabeled by binary sequences of length rn in which adjacentnodes have labels differing in exactly one place. We de-note by log the logarithm w ith base 2 and by In the naturallogarithm.3. THE BOUNDED FAULT MODELIn this section, we assume that there are at m ost k faultycomponents in the network and their distribution is w orst-

    case. We denote by T the minimum time of k-tolerantbroadcasting or gossiping and by C the minimum numberof calls used in such a communication process.3.1. Permanent Link FaultsWe first assume that nodes are fault free and there are atmost k permanently faulty links in the network. As such,faults are of crash type. We assume the communicationmode is whispering.The first results on the number of calls in this modelwere proved in [7 1 , where both the full-duplex and half-duplex variations were considered. In case of broadcast-ing, the exact value of C was obtained in both cases. Forthe full-duplex mode,

    C = r s ( n - l ) 1 , i f k s n - 2 , and

    c = r? n 1 . otherwise.For the half-duplex mode, C = ( k + 1)(n - 1 ).In case of gossiping, the exact value of C was obtainedfor the half-duplex mode: C = ( k + 2 ) n - 2. However,for the full-duplex mode, only bounds on C have beenestablished:

    5 CI ( k + i ) ( n - I)], if k 5 n - 2, an d

    s L ( k + i)n - )] , i f k 2 n - 2.

    It was conjectured that the exact value of C is closeto the upper bound; more precisely, that C = [ k + 3) /2 ] n - consr. This conjecture w as later disproved in [481under the transient fault assumption (cf. Su bsection 3.4).The following results concern the minimum time Tof nonadaptive, k-tolerant broadcasting in the full-duplexmode:In [ 5 6 ] , he lower bound T 2 lo g nl + k , for n - 1> k 2 0, was proved. Moreover, the exact value of Twas established for k = 1, 2:

  • 8/8/2019 Fault-Tolerant Broadcasting and Gossiping in Communication Networks

    5/14

    FAULT-TOLERANT BROADCASTING AND GOSSIPING 147

    T = r l o g n l + I , i f k = 1, n > 2 ,T = r l o g n l + 2 , i f k = 2 , n > 4 , n z 2 ' - 1,T = [ lo g nl + 3, n = 2 ' - 1.f k = 2, n > 6,In [561, the k-tolerant broadcast function BA n wasdefined and studied for the first time. This is the minimumnumber of links in a network supporting k-tolerant broad-casting in minimum time T . In [561, the following boundson B ,(n)were established:

    Similar upper bounds on B 2 ( n )were obtained for somevalues of n. The above upper bounds on B ,(n) nd B 2 ( n )were later improved in [151. Sparser networks supporting1-tolerant and 2-tolerant broadcasting in minimum timewere constructed, giving the following estimates:

    n n2 2,(n) [log nl - - + 6, for even n, and

    for n = 0 or 1 ( m o d 4 ) , and

    for n = 2 o r 3 (mod 4 ) .In [54], it was shown that

    The above results concerning the minimum values forthe time T of k-tolerant broadcasting and for the k-tolerantbroadcast function Bk(n) ere extended in [ 4 2 ]. For allk 5 Llog PI], he m inimum time T of k-tolerant broadcast-+ 1, depending on k . As for the function &(n), it wasproved in [42] to be O( n ( k + log n)), whenever ks Llog nJ. Moreover, under the additional assumptionthat n = 2'". the exact value Bk(n) ( m k ) n / 2 wasestablished. Further bounds on the function B,(n) nd itsvalues for specific k and n can be found in [11. In p articu-

    ing was proved to be [log n l + k or r l o g ( n - I )1 + k

    lar, the authors construct networks supporting k-tolerantbroadcasting in minimum time, with a number of l inksdiffering from BA(n) by at most a constant factor, for k< Llog nl if n is even and for k 5 L2r10g''1- n + I J ifn is odd. It was also proved that

    (m - 1)(2"' - 6 )2, (2"' - 6) = , when m is even.For general values of k , two upper bounds on time Thad been obtained earlier in [64]. First, it was provedthat T s 2k + h o g nl + 4. This upper bound was alsoobtained independently in [5 7] . For large k , e.g., linearin n , t can be advantageous to decrease the coefficientat k at the expense of increasing the coefficient at [log

    nl . In these situations, the second upper bound from 1641is useful:T s 1 + $ ) k + 2 d l o g n l + d ,

    for any constant c I , k s ( n - 1 ) / ( 8 c + l ) ] - c.and a constant d depending on c but not on n or k.The exa ct values of T for arbitrarily large k were firstobtained in [3 7 ], for n being a power of 2. If n = 2"'and k 5 n - 2, k-tolerant broadcasting schemes wereconstructed that perform in the optimal possible time:T = l o g n + k , i f k s n - l o g n - 1, andT = l o g n + k + 1, i f n - l o g n 5 k s - 2.

    Moreover, the networks supporting these schem es havethe minimum possible number of links ( m + k)n/2,thereby proving that B A ( n )= (m + k ) n / 2 , f or n = 2"and k s - 2.In the same paper, tighter upper bounds on T for othervalues of n an d k were a lso established. Let m = Llog nJand n = 2" ' + , where 0 < 5 2"'-2. It was proved thatm - h o g l - 1

    if k 5 2"' - m - 1. an dLs m + k + l +k - l J,Lm - r i o g j l - 1T = s m + k + 2 +

    if 2"' - m - 1 < k 5 2"' - 2.The tightness of these bounds depends on k an d j . Fo r

    small values of j , when n is close to Llog nl, nd smallvalues of k, close to log n, he upper bound does notexceed log n + k very much and thus it is fairly tight.

  • 8/8/2019 Fault-Tolerant Broadcasting and Gossiping in Communication Networks

    6/14

    148 PELCOn the other hand, for large j , close to 2-, the boundbecomes close to log n + 2k, which is then similar tothe bound T 5 2k + [log nl + 4 from [64] nd [57]and is far from the lower bound when k is large. Theexact values of broadcasting time for arbitrary n and kremain unknown.In [591, k-tolerant broadcasting and gossiping wasconsidered in the shouting mode for product graphs G= G I X Gz.The authors derived bounds on the m inimumtime and the minimum number of calls for k-tolerantbroadcasting and gossiping in G depending on these val-ues for G , and G z.

    A different approach to broadcasting in the presenceof faults was adopted in [121. In the usual scenario of k-fault-tolerance, even an adaptive algorithm does not as-sume any a pr ior i knowledge of fault location; this infor-mation must be acquired by nodes during the comm unica-tion process, which is usually costly in terms of timeand calls. In [ 121, fault-tolerant broadcasting in the m-hypercube was considered, assuming that the location offaults is a priori known to the source. Packets were as-sumed of size O ( m ) and a variation of the whisperingmode was considered, in which each node could sendinformation to two neighbors in a unit of time. Finally,it was assumed that the number of faulty links does notexceed rm/21 - I . Under these assumptions, it wasshown that a nonadaptive broadcasting algorithm for them-hypercube existed w orking in optimal time m and usingth e optimal number of 2- 1 calls.

    In [40], the lineur communication model was used.In this model, the time to transmit a message of lengthL is p + LT,where p is the startup time and LT isthe propagation time. Four communication modes wereconsidered: whispering and shouting in both the half-duplex and full-duplex variations. Also, both transient andpermanent link failures were investigated. Under theseassumptions, the authors used Rabins Information Dis-persal Algorithm (ID A ) to construct fast k-tolerant broad-casting algorithms for the hypercube. Given a file F, IDAencodes it into n smaller files F,, . . ,F, in such a waythat F can be recovered from any subset of n - k filesF, . Broadcasting proposed in [401 was performed so thatone failure could only affect one file F, and, consequently,the algorithm was k-tolerant. For each of the consideredcommunication modes, the execution time of this algo-rithm was show n to be sm aller than the time of broadcast-ing using the straightforward ( k + I)-replication ap-proach.In [301, randomized broadcasting in the whisperingmode was considered. Informed nodes decide randomlyto which neighbor they transmit the message at each timeunit. (This should not be confused with deterministicbroadcasting in the presence of randomly distributedfaults, to be considered in Section 4.) If k < cn, or aconstant c < 4, random broadcasting in the presence of

    at most k faulty links was proved to be completed in timeO( log n), with probability converging to 1.

    3.2. Permanent Node FaultsWe now assume that links are fault-free and there are atmost k permanently faulty nodes. Faults are of the crashtype; the comm unication mo de is full-duplex whispering.W e first consider the minimum time T of k-tolerant adap-tive broadcasting. The first results in this model wereobtained in [29]. It was proved that if there are at mostk nonadjacent faulty nodes in some chordal ri n s thenadaptive broadcasting can be performed in time f og nl+ k. (Chordal rings are rings [ x , ) , . . . , x, , -~] i th addi-tional links [ x i , x ~ ~ + , , ~ , , , , ~ ] ,here t E S , or a fixed SC ( 1, . . . , - 1 ) .) Adaptive broadcasting in completenetworks has been recently investigated in [ 4 5 ] . T wovariations of the above model were considered: In thefirst, called the wake-up model, only nodes that have al-ready obtained the source information (i.e., are awake)can attempt placing calls. In this model, the exact valueof minimum broadcasting time T = k + r l o g ( n - k ) lwas obtained. The second variation does not impose anyrestriction on nodes placing calls. Uninformed no des canalso place calls; this enables the algorithm to performpreprocessing and avoid time-consuming calls from thesource to faulty nodes. In this general model, the authorsconstructed a k-tolerant broadcasting algorithm workingin time O(log2n), whenever k I an, or a constant a< 1. It remains open if adaptive broadcasting can be donein time O(1og n).In [431, nonadaptive broadcasting in the half-duplexwhispering mode was considered, under the same as-sumption as above: k s an, or a constant a < 1. T heauthors established upper and lower bounds on T an dconstructed a k-tolerant broadcasting algorithm requiringtime at most 1.73 times greater than optimal.In [521. a comm unication mod e in between whisperingand shouting was considered: Every node can transmit toat most k other nodes in a unit of time. The authorsproposed nonadaptive k-tolerant broadcasting schemesworking for the complete n-node network in timer l o g , + , n l + 3. Fo r k = 1 and k = 2, these schemes workin t ime r log,+ ,nl + 2, and if no faults are present, theywork in optimal time hog,, ,nlIn [21, the goal considered was that of minimizing thenumber of calls in fault-tolerant gossiping. It was assumedthat nodes fail in a Byzantine manner, but they havediagnostic capabilities: A fault-free node can correctlydiagnose tested nod es, while faulty testers are unpredict-able. The half-duplex whispering mode was adopted, andthe size of packets was assumed unbounded. The authorspresented an adaptive gossiping algorithm using 3n logn + O ( n ) alls and wo rking correctly whenever the fault-free part of the network is connected.

  • 8/8/2019 Fault-Tolerant Broadcasting and Gossiping in Communication Networks

    7/14

    FAULT-TOLERANT BROADCASTING AND GOSSIPING 1493.3. Permanent Link and Node FaultsWe now assume that both links and nodes can fail andthat the total number of such permanent crash faults isat most k. In [ 6 5 ] , nonadaptive broadcasting in the m-hypercube was considered under the assumption thatnodes can simultaneously send and receive m essages andthat the number k of faults is less than m . A broadcastingalgorithm w as proposed, requiring time at most m + 1 ifa node can simultaneously transmit to all neighbors andtime at most 2m if transmitting a message is possible onlyto one neighbor at a time.In [131, a comm unication mode in between whisperingand shouting was considered. The authors assumed thateach node can communicate with at most t neighbors ina unit of time. Assuming that k < m, they proposed ascheme for k-tolerant broadcasting in the m-hypercuberequiring a time of m + r ( k + 1 ) / t 1 . Some bounds onthe broadcasting time in the presence of a larger numberof faults that do not disconnect the hypercube were alsoshown.In [491, k-tolerant broadcasting in jumping nenvorkswas considered. These are n-node chordal rings in whichnodes u and u are adjacent iff u - u = 2'(mod n), fo rsome integer r. In the full-duplex whispering mode, theupper bound Ts lo g nl + k + 2 on k-tolerant broadcast-ing time was established under the assumption k s h o gn l - 2.Nonadaptive k-tolerant broadcasting in product net-works G = GI - - - x G, was considered in [3 ] .The authors proposed broadcasting algorithms for suchnetworks using the construction of n independent span-ning trees rooted at the source s = (sI . . , s,$) (the pathsfrom the sou rce to any node in distinct trees aremutuallyinternally node-disjoint ) . These algorithms can tolerateup to L (n - 1 ) / 2 1 Byzantine faults or up to n - 1 crashfaults. The execution time t of these algorithms was de-rived in terms of broadcasting time in the fac tor networksG ,, oth in the whispering and in the shouting mode. Forthe whispering mode, t = 2 ZyLl' p, + on+ I , and forthe shouting mode, t = 1 + C:=, a,, here 0, and a,denote op timal fault-free broadcasting time from s, in G, ,in whispering and shouting m ode, respectively.In [321, the previously described linear communica-tion model was used and k-tolerant broadcasting and gos-siping algorithms were proposed for the hypercube inboth shouting and whispering modes. It was proved thattheir running time is asymptotically optimal, i.e., it con-verges to the lower bound as he length L of the messagesincreases.A different approach to broadcasting in the presenceof faults, similar to that in [121, was adopted in [631 and[41]. In both papers, the location of faults was assumedto be known to all nodes. The adopted m ode of commun i-cation was shouting. Under these assumptions, broadcast-ing in the m-hypercube was considered in [631. An algo-

    rithm was described that performs broadcasting in thepresence of up to m - 2 arbitrary node or link failures,using the optimal time m and the optimal number of calls2"' - . This should be compared to the previously men-tioned result of [121. obtained under different assump-tions. In [121, the numb er of faults was sm aller and onlylinks were fault-prone, but the location of faults wasknown only to the source and the communication modewas more restrictive, i.e., a node could com mun icate withonly two neighbors in a unit of time.A similar result for the star graph was obtained in[41] . The n-star graph S, is a graph whose nodes arelabeled by all permutations of integers 1, . . . , n and thenode u = ( u l , . . . , u,,) is adjacent to all nodes u[i]. fori = 2, . . . , , where u[ ] results from u by a transpositionof uI and u, . S, has n! nodes and the diameter D ( S , )= L3(n - 1)/2J. The authors determined the maximumnumber r ( n ) of link or node faults such that the fault-free part of S, still has diameter D(S ,). Assuming at mostr(n ) ink or node faults known to all nodes, they proposeda broadcasting algorithm for S, working in optimal timeD ( S,) and using the optimal number of calls n ! - 1. Inthe traditional fault-tolerant setting, i.e., in the presence ofk unknown faults, they proposed k-tolerant broadcastingusing the optimal number of ( k + 1) (n - 1 calls.A variation of the broadcasting problem, called linearbroadcasting. was first considered in [10, 221. See alsoProblem 13 by Greenberg (the Report Dispersal Prob-lem) in [341. The source has an unrestricted number ofidentical tokens. In any unit of time, each node that holdstokens can send at most one token to at most one othernode. The goal is for all nodes to be visited by at leastone token. Th e difference from broadcasting is that tokensmay not be "multiplied" for free at any node but haveto travel to each node from the source. In [271, a fault-tolerant version of linear broadcasting was considered.Each token has a predetermined route, indicating in whichorder nodes should be visited. A faulty node or link de-stroys all tokens passing through it. A linear broadcastingscheme consisting of token routes is k-tolerant if everyfault-free nod e is visited by at least one token, wheneverthe total number of faults does not exceed k . The perfor-mance measure of a k-tolerant scheme, adopted in [ 2 7 ],was the number of tokens used by the scheme. For posi-tive integers k and n , et PL n) enote the minimum num-ber of tokens for which there exists a k-tolerant linearbroadcasting scheme in a (n + 1)-node complete net-work. The authors established lower bounds on Pk(n)and showed k-tolerant schemes using few tokens. It wasproved, in particular, that the minimum number of tokenssufficient to perform 2-tolerant linear broadcasting is@(log log n ) and that it is 2 for 1-tolerant linear broad-casting.3.4. Transient FaultsWe now turn attention to transient faults. Let us firstnote that if a broadcasting or gossiping algorithm works

  • 8/8/2019 Fault-Tolerant Broadcasting and Gossiping in Communication Networks

    8/14

    correctly assuming at most k permanent link and/o r nodefaults then it also works correctly assuming at most ktransient link and /o r node faults during the entire commu -nication process. Hence, all upper bounds on time andthe number of calls reported above for permanent faultsstill hold in the present scenario. In this subsection, wediscuss results that use the assumption that faults are tran-

    time in O ( og n + k ) .This orde r of magnitude is clearlyoptimal. This result was subsequently strengthened in[371 where the following upp er bounds on T were proved:

    T I rn + k , if n = 2"'. which is tight, andT s r n + 3 k + 1 , i f n = 2 " - ' + j < 2 " .

    sient, i.e., the same fault cannot prevent transmission intwo distinct time units, unlike in the permanent fault case.All results concern only link faults. We assume thatnodes are fault free and links are subject to transientfailures, i.e., there are at most k faulty calls during thecommu nication process. Faults are of crash type, i.e., nomessage passes during a faulty call. Packets are un-bounded, the communication mode is whispering, andalgorithms are nonadaptive.The first results concerning the minimum number Cof calls were proved in [ 4 8 ] . As mentioned in S ubsection3.1, it was conjectured in [ 7 ] that C = [ (k + 3 ) / 2 ] n- const under the assumption of at most k permanentlink faults. This conjecture was disproved in [ 4 8 ] underthe stronger assumption of at m ost k transient link faults.The following upper bound was shown:

    Moreover, the gossiping algorithm constructed in[37], requiring a minimum time T = rn + k , when n= 2", has an additional feature: It uses the minimumnumber ( r n + k ) n / 2 of calls. T he scheme also improveson the number of calls, when compared to [48 ] for manyvalues of k an d j . The exact value of T for this model,for arbitrary n and k , remains unknown.For broadcasting in the half-duplex variation of thismodel, the following lower bounds w ere proved in [381.F o r k = 1 and arbitrary n , T = h o g n l + 2 and C 2 2n- 2. On the other hand, a l-tolerant broadcasting schem ewas shown for which both the time and the number ofcalls achieve the above optimal values. For arbitrary n= 2"' > 2k, it was proved that the minimum time of k-tolerant broadcasting is T = rn + 2k. Moreover, it wasshown that k-tolerant broadcasting in time rn + 2k, for k> 2"-' , can be achieved in the symmetric directed rn-hypercube. This network has the minimum number oflinks among all networks sup porting k-tolerant broadcast-ing in the optimum time rn + 2k.We now return to the full-duplex m ode. The followingvariation on the assumption of at most k faulty calls wasrecently considered in [MI. Since the number of callfailures is likely to increase with execution time, insteadof imposing a fixed upper bound on this number, it seems

    C 5-k + O ( k 6 + n log n ) .2In [481, a class of improved upper bounds for almostall k was also obtained, e.g.:

    C f ( k + 8 ) ( n - 1 ) + 2k + 16.reasonable to assume-that the possible numb er of failuresis proportional to the time e lapsed sin ce the beginning ofthe algorithm execution. To make broadcasting possiblein the whispering mode, the proportion coefficient mustbe smaller than 1; therwise, no message leav es the sourcein the worst case. Fix a C 1. Assume that, for any t> 0, at most at calls may fail in the first t time units ofthe algorithm execution. The following results regardingbroadcasting time in this linearly bound ed transient faultmodel were obtained in [44 .

    The exact value of the minimum num ber of calls in k-tolerant gossiping in this model is still unknown.A more general measure of communication cost hasbeen considered in [391. Every link ( i , j ) of the completenetwork on n nodes is assigned a positive cost c ( i , ) .The total cost of a broadcasting or gossiping algorithmis obtained by adding c ( i ,j ) whenever a message travelsthrough the link ( i , j ) . For example, the total cost isequal to the number of calls in the half-duplex mode,when all costs c(i, ) are equal to I . ] It was proved in[39] that, for given costs of links, the problem of de - If the network is an n-node chain, thentermining the minimum cost of performing k-tolerant gos-siping among n nodes is NP-hard. The authors proposeda k-tolerant gossiping algorithm with cost at most twicethe optimal. Moreover, their gossiping scheme requirestime at most ( k + 2 )n - 1 larger than optimal. In thecase of broadcasting, a k-tolerant algorithm was con-structed with cost k + 1 times larger than the cost of aminimum spanning tree. It was proved that this cost isoptimal among all k-tolerant broadcasting algorithms.

    The first result on the time of gossiping in the full-duplex mode with transient link faults was proved in

    T E O ( n ) , if a < $ andT E a((A)")if a 2 - .

    If the network is an rn-hypercube then, for any a< 1,T = L L ( r n - l ) ] + r n .[48]. A k-tolerant algorithm was proposed with running 1 - a

  • 8/8/2019 Fault-Tolerant Broadcasting and Gossiping in Communication Networks

    9/14

    FAULT-TOLERANT BROADCASTING AND GOSSIPING I51If the network is the com plete n-node graph, then, fora n y a < 1,

    log n + O(l0g log n).1T E -1 - aA similar assumption that the number of faulty trans-missions may increase in time was addressed in [ 351. Theauthors considered the shouting communication mode;therefore, it was possible to allow a larger number offailures per time unit. They assumed that at most k callsmay fail in each time unit for a constant k smaller thanthe edge-connectivity of the network. (For larger valuesof k , the network could be disconnected.) In [35] , theissue of broadcasting time in the m -hypercube was studied

    for this model, where it was proved that T E m + o (m ) .This model was further investigated in [21] for generalnetworks. Let d be the diameter of the network and let k(smaller than the edge-connectivity ) be the maximumnumber of faulty calls per time unit. The following resultswere proved in [2 1 1 :For fixed d , T E O ( k d ' * - ' ) , or arbitrary networks,and there exist networks for which T E @ ( k ( d ' 2 - ' ) .For fixed k , T E O (d ' + ) , for arbitrary networks, andthere exist networks for which T E 0( '* I ) .For multidimensional tori, TE O (d ) .As an open prob-lem, the authors asked whether T is linear in the diameter,for all vertex transitive graphs.A different way of defining fault-tolerant broadcastingwas proposed in [ 5 8 ] . nstead of demanding that a broad-casting scheme tolerate at most k link faults in the net-work, for a given parameter k , the requirement of fault-tolerance was allowed to vary from node to node. Fix anetwork N and a broadcasting source s . Given a node u

    f s, et k, be the minimum number of links whose dele-tion disconnects u from s. A broadcasting algorithm isreliable for the node u, if u gets the source message when-ever less than k,, links are subject to transient crash faults.A broadcasting algorithm is faithful if it is reliable forevery node. In [581, the author investigated the trade-offbetween the minimum number of calls used by a faithfulbroadcasting algorithm and the m aximum am ount of localmemory needed in a node. The latter is called the spacecomplexity of the algorithm. It was proved that everyfaithful broadcasting algorithm for an arbitrary networkuses at least Z,,, k,, calls. Moreover, for any network,there exists a faithful broadcasting algorithm of linearspace complexity, using the above optimal number ofcalls. On the other hand, arbitrarily large networks wereconstructed for which all faithful broadcasting algorithmsusing no local memory for computations at nodes use anumber of calls exponential in the size of the network.Finally, the author characterized networks that supportfaithful broadcasting using the optimal number of callsand no local memory for computations.

    4. THE PROBABILISTIC FAULT MODELIn this section, we assu me that links fail with probability0 s p < 1, nodes fail with probability 0 s q < 1, an dall faults are independent. The values p = 0 ( q = 0)correspond to the assumption that links (nodes) are fault-free. Unless explicitly stated, algorithms are deterministicand probabilistic considerations relate only to their cor-rectness and/or efficiency.In most of the papers using the probabilistic faultmodel, results concerning the execution time and thenumber of calls used by broadcasting and gossiping algo-rithms are of asym ptotic nature, i.e., only their orders ofmagnitude (up to a multiplicative constant) are mini-mized. Using this approach, results concerning broadcast-ing and gossiping are usually equivalent when packetsare assumed of unbounded size. indeed, a reverse of abroadcasting algorithm can be used to gather all valuesin one node; then, the total information can be broadcastfrom this node to all other nodes. In this way, both thetime and number of calls are at most doubled.In the probabilistic m odel, there are two natural varia-tions for defining the performance of ad aptive algorithmsin terms of time and number of calls. Both these valuesare random variables, as they depend on the location offaults which are random. Thus, we may ask about theworst-case or the expected time and num ber of calls usedby the algorithm. It will be seen below that this distinctionis sometimes significant. It does not occur in the caseof nonadaptive algorithms, as all transmissions in suchalgorithms are scheduled in advance and do not dependon the random occurrence of faults.4.1. Permanent Link and Node FaultsWe first assume that all failures are permanent and ofcrash type and that the communication mode is full-du-plex whispering. First, consider the fault-free nodes sce-nario (q = 0). n the following papers, p < 1 was as-sumed to be constant. In [91, a nonadaptive, almost safebroadcasting algorithm was proposed. It used noncon-structive expanders and worked in time O(log n). In[251, broadcasting and gossiping were studied under theassumption of unbounded packets. A simple treelike con-struction was applied to guarantee almost safe adaptivebroadcasting and gossiping using an expe cted time O ( ogn) nd an expected number of calls O ( n ) .Nonadaptivebroadcasting and gossiping algorithms were proposed,working in time O (lo g2 n) and using O (n log n) calls.The order of the number of calls was proved to be optimalfor nonadaptive algorithms (cf. also [91). Later, the con-struction from [25] was extended in [17] to decreasenonadaptive broadcasting time to O( og n) aswell, undermore general assumptions. Also, in [171, the result from[25 1 concerning adaptive broadcasting was strengthened.An adaptive, almost safe broadcasting algorithm was

  • 8/8/2019 Fault-Tolerant Broadcasting and Gossiping in Communication Networks

    10/14

    152 PELCgiven whose worst-case time was O(1og n ) and worst-case number of calls was O (n), oth orders of m agnitudebeing optim al. In [161,almost safe broadcasting was con-sidered for the m-hypercube. The authors constructed anonadaptive broadcasting algorithm requiring timeO ( m ) ,whenever p 5 0.09.In [661, almost safe, adaptive broadcasting was con-sidered, assuming that link failure probability p is largerthan constant. Let p* = 1 - p denote the probability thata link is fault free. Fo r p * = [ w( n ) lo g n l l n , let w ( n ) beany function divcrging to infinity. The authors proposedan almost safe broadcasting algorithm working in timelog n + o( log n) , with probability converging to 1. Underth e slightly stronger assumption p* = ( K n2n ) l n , whereK > 12/( ln 2), they showed an almost safe broadcastingalgorithm working in optimal time h o g nl .These results were subsequently extended in [461. Theauthor showed almost safe broadcasting in time log n+ d log log n, or d < 1, whenever p* = ( c In n ) l n ,where c > 18.4. Moreover, almost safe broadcasting int ime ho g nl was shown for p * = ( c In n-log log n ) l n ,where c > 16. In [361, the latter result was strengthened:Almost sa fe broadcasting in t ime ho g nl was shown forp * = (c In n ) l n , for some positive constant c.In [471, the results from [461 were extended in anotherway. On the one hand, almost safe broadcasting in timeh o g nl + 1 was shown for p * = (C In nslog log log n ) ln , where c > 16.On the other hand, the author showedalmost safe broadcasting in time log n + d log log logn, for d < I , whenever p* = ( c In n ) l n , where c> 18.4, thus improving required time for this probabilityvalue.Random ized broadcasting in the above fault model wasconsidered in [ 3 0 ] . F o r p * = [log n + w ( n ) ] / n , herew (n ) is any function diverging to infinity, randomizedbroadcasting was proved to be almost safe and work int ime O(log2 n), while for p* 2 [ ( 1 + )logn l l n , itsrunning time is O ( og n ) .In the following papers, both links and nodes wereassumed fault-prone. In [171, nonadaptive broadcastingwas considered for arbitrary constant fault probability val-ues p ,q < 1. The authors proposed an almost safe broad-casting algorithm w orking in time O ( og n), sing a tree-like construction. This eliminated the need of noncon-structive expanders used in [91, simultaneouslyweakening the assumption regarding node faults (in [91,nodes were assumed fault-free). In [17], an adaptive,almost safe broadcasting algorithm w as also given havingworst-case running time O (lo g n ) nd expected numberof calls O ( n ) . t remains open whether both the runningtime O(log n ) and the number of calls O(n ) can beguaranteed in the worst case. As previously mentioned,the authors gave a positive answer to this question assum-ing that nodes are fault-free.

    A variation of broadcasting, called waking up, wasconsidered in [191. In the beginning, only the source is

    awake and has to wake up all fault-free nodes by sendingthe wake-up message. The difference from classicalbroadcasting is that only nodes that are awake (i.e., al-ready have received the message) can place calls; a dor-mant (uninformed) node cannot call to seek the sourcemessage, as in broadcasting. This restriction has a sig-nificant impact on the minimum number of calls of analmost safe, waking-up algorithm, in the case when linksand nodes are fault prone. It was proved in [191 that everysuch algorithm (even adaptive) must use an expectednumber of 52( n log n ) alls. (T his can be contrasted withthe above-mentioned result from [171, where classicalbroadcasting was performed with an expected linear num-ber of calls.) M oreover, the authors constructed an alm ostsafe wake up algorithm working in expected time O( ogn), and using an expected number of O ( n log n ) alls,both orders of magnitude being optimal. An additionalfeature of the above schem e was that it worked in anony-mous (com ple te) networks in w hich nodes do not knowtheir labels and execu te identical algorithms.In [201, nonadaptive broadcasting in the hypercubewas considered. It is well known that, in this case, almostsafe broadcasting is im possible for large fault probabilityvalues, as the hypercube can be then disconnected withconstant positive probability. For small probabilities offaults, satisfying the condition ( 1 - p ) ( 1 - q ) 2 0.99,an almost safe broadcasting algorithm workmg in timeO(m ) for the m-hypercube was constructed in [201.As observed previously, for packets of unbounded size,the above results can be immediately extended to gossip-ing. The situation changes significantly if we considersmaller packets. For unit-size packets, e.g., gossiping re-quires linear time even without faults. In [28]. the rela-tions between the size of packets and the time of fault-tolerant broadcasting were investigated, for arbitrary faultprobability values p , 9 < 1. For packets of size b ( n ) , heauthors constructed a nonadaptive almost safe gossipingalgorithm working in time O ( [ n l b ( n ) ] log n), asilyseen to be optimal. For the unbounded packet size, thisyields gossiping time O( og n),which also follows from[171, while fo r unit-size packets, this yields linear gossip-ing time. The algorithm in [28] used explicitly con-structible expanders.It should be mentioned that in the above gossipingalgorithm, nodes do not know, a priori, whose value iscurrently transmitted; thus, it is necessary to attach nodelabels to values during transmissions. Since labels mustuse at least log n bits, a packet of size b ( n ) , which, bydefinition, must contain values of b ( n ) nodes, must, infact, contain b(n)log n bits, even for one-bit values. Forexample, unit-size packets must contain O( og n) bits.This should be compared to the situation in [181 and [ 1 I ] ,where labels did not have to be attached and packetscould have only a constant number of bits.In [141. a nonadaptive broadcasting algorithm tolerat-ing at most m - faults in the m-hypercube w as proposed.

  • 8/8/2019 Fault-Tolerant Broadcasting and Gossiping in Communication Networks

    11/14

    FAULT-TOLERANT BROADCASTING AND QOSSlPlNG 1s

    Its running time is (2" - 1)m nd the number of callsis m2" - 2" + 1. In addition, the author performedsimulations showing that for fault probability p s fall nodes become informed after an average time of lessthan 5m.Byzantine link failures under the assumption that nodesare fault free were considered in [61. Link failure proba-bility was a con stan tp < 4;otherwise, no reliable com mu-nication can be achieved in a Byzantine environment. Thecommunication mode was full-duplex whispering. Usingnonconstructive expanders, the authors proposed a non-adaptive almost safe broadcasting algorithm working intime O(1og n).In [41, nonadaptive gossiping with unbounded p acketswas considered for the m-hypercube. The authors as-sumed that either links or nodes fail in a Byzantine way.For fault-free links (respectively, nodes) and nodes (re-

    spectively, links) failing w ith probability p s / m , witha small constant c , an almost safe gossiping algorithmrequiring time 2m was presented. For fault-free nodes andlinks failing with probability p s , with a sm all constantc. almost safe gossiping in time 2m 2 was shown. It re-mains open whether almost safe gossiping with un-bounded packets, or even almost safe broadcasting, canbe performed in the m-hypercube in time O(m) henlinks fail in a Byzantine manner with a small constantprobability.Gossiping with unit-size packets in the presence ofByzantine faults was studied in [111. The authors consid-ered strongly nonadaptive algorithms in which not onlytransmission scheduling is done in advance, but it is alsopredetermined which node's value is to be sent in a giventransmission. This en ables the algorithm to sk ip the labelsof no des during transmissions, similar to the approach in[181 and unlike that in [281 . Thus, if node values havea constant number of bits, packets contain a constantnumber of bits, as well. Two commu nication modes wereconsidered. In mo de 1, sending was performed by shout-ing, i.e., a node could send a packet simultaneously to allneighbors, but receiving was sequential, i.e., a node couldreceive only one packet at a time. Mode 2 was classicalwhispering. These two communication modes were com-bined with two assumptions regarding faults: (N ) fault-free links and nodes failing with constant probability 0< q < and (L) fault-free nodes and links failing withconstant probability 0 < p < i. All four models yieldedby combinations of these assumptions were considered,labeled as LI, L2, N1, and N2. For each of these models,almost safe gossiping algorithms were constructed whoserunning time T and the number of calls C were shown tobe of the smallest possible order of magnitude. In caseof models L1 and L2, T E @( n log n),C E O(n210g n )and the algorithm achieving this performance worked forthe sparsest possible networks. In case of m odels N1 andN2, T E 6 ( n ) nd C E @ ( n 2 ) , f the underlying network

    is complete. For these models, gossiping algorithmsworking in sparser networks were also constructed andtheir performance w as proved to be of the smallest possi-ble order of magnitude among gossiping schemes workingin these networks. For example, in case of the modelN l , an almost safe gossiping algorithm was constructed,working in time O ( n w ( n ) ) nd using O ( n 2 w ( n ) ) alls,for any function w(n) -, 03. This algorithm used onlyO ( n w ( n ) ) inks for communication. It was proved, onthe other hand, that almost safe gossiping in time 0 nor using O(n2 ) alls is impossible in any network witho ( n 2 ) inks.The results from [1 ] should be compared to thosefrom [281. Consider the model L2 from [111:whisperingwith fault-free nodes an d faulty links. Suppo se that nodevalues are only 1 or 0. The lower bound T E O(n og n)on gossiping time, proved in [111, remains true for crashfaults as well. This lower bound is valid for stronglynonadaptive algorithms described above and for one-bitpackets. On the other hand, the algorithm from [28]works for link and node crash faults. It is not stronglynonadaptive in the above sense and works in time O(n )for unit-size packets. However, as previously remarked,in [281, labels have to be attached to nodes; thus, unit-size packets must, in fact, contain a logarithmic numberof bits. The following question remains open: Supposethat links and nodes of the complete network are subjectto permanent crash faults with probabilities p < 1 and q< 1, respectively, and that all values of nodes have aconstant number of bits. Does there exist an almost safegossiping algorithm using packets containing a constantnumber of bits and working in linear time? A positiveanswer to this question for transient link faults and fault-free nodes was given in [181.In [231, the linear broadcasting problem, discussed inSection 3, was considered under the name of token dis-persal. The authors assumed that links and/or nodes ofthe (co m ple te) network fail independently with constantprobabilities and that an attempt to send a token to afaulty node or via a faulty link does not succeed (i.e., thetoken remains at the sending no de ). Every fault-free nodeha s to be visited. The performance measure adopted in[23] was the running time of the scheme. If only nodesor only links can fail ( p = 0 or q = 0). hen an almostsafe, token dispersal algorithm was shown with runningtime O(6).f both links and nodes can fail (p, q > 01,then almost safe token dispersal was proposed with run-ning time O ( G ) . n both cases, algorithms werenonadaptive and the order of magnitude was proved tobe optimal. Other fault-tolerant aspects of the token dis-persal problem were considered in [24].4.2. Transient FaultsWe now assume that individual calls fail with constantprobability 0 zs p < 1 and all failures, including those

  • 8/8/2019 Fault-Tolerant Broadcasting and Gossiping in Communication Networks

    12/14

    of calls placed along the same link, are independent. Nodefailures occur with probability 0 5 9 < I ; they are perma-nent and independen t. Faults are of crash type and pack etsare of unbounded size. The comm unication mode is full-duplex whispering.In [601. adaptive broadcasting for the complete net-work was considered in this model, assuming arbitrary

    p , q < 1. An alm ost safe broadcasting algorithm requiringexpected time O(1og n ) nd worst-case time O ( n og n )was constructed. An improvement of this result followsfrom [171, where almost safe, nonadaptive broadcastingworking in t ime O(log n ) was given. The result from[ 171 holds for transient faults, as well. and thus yieldsworst-case time O( og n ) .In the following papers, nodes were assumed fault-free, i.e., q = 0. Under this assumption, communicationin n-node bounded degree networks of diameter D ( n )was investigated in [261. It was shown that almost safe,nonadaptive broadcasting and gossiping can be performedin such networks in time O ( D ( n ) ) sing O ( n log n )calls. Adaptive broadcasting and gossiping algorithmsalso were designed, working in worst-case time O (D ( n))and using an expected linear number of calls. All theseorders of magnitude are optimal.A variation of this model for unit-size packets wasconsidered in [ 181. Under this scenario, gossiping musttake t ime at least linear in n , as every node must readthe value of every other node. In [ 1 8 ] , an almost safenonadaptive gossiping algorithm working in time O (n )was constructed for the large class of networks havingspanning trees of bounded maximum degree (including,e.g., all Hamiltonian graphs). It is worth noting that thegossiping scheme in [181was constructed in such a waythat each node knows in advance the order in which it isgoing to get values of other nodes. Thus, it was not neces-sary to append node labels to their values during transmis-sions and, consequently, unit-size packets really meantpackets containing a constant num ber of bits, in the casewhen values were of constant size. This should be com-pared to the scenario in [ 2 8 ] , described in Subsec-tion 4.1.The relationships between almost safe broadcastingtime and th e number of links in the network were studiedin [61] for the shouting communication mode and crashtransmission faults. It was show n that the minimum tim eof almost sa fe broadcasting in networks with e ( n ) linksis T E @(n log n l e ( n ) ) .Byzantine transmission faults were considered in [551.In this case, the assum ption p < is necessary; otherwise,no reliable communication can be guaranteed. The prob-lem considered in [ 5 5 ] is that of the minimum time re-quired for alm ost safe nonadaptive broadcasting in the n -node chain. It follows from [26] that this time is O ( n ) ,if faults are of crash type. On the other hand, for Byzan-tine faults, a simple algorithm working in time O ( n log

    n ) an be constructed. Diks and Pelc (see Problem 29 in[ 3 4 ] ) asked if there exists an almost safe broadcastingalgorithm w orking in time O ( n ) or Byzantine faults. Th emain result of [55] is the construction and analysis ofsuch an algorithm.

    5. FUTURE RESEARCHThis survey demonstrates that although an important bodyof research exists concerning fault-tolerant broadcastingand gossiping, our understanding of the relations betw eenefficiency and fault tolerance of communication algo-rithms for these tasks is still fragmentary and incomplete.At least three groups of open problems can be specifiedon the basis of already obtained research results. Researchdirected toward their solution is likely to deepen the un-derstanding of the domain and increase the potential ap-plicability of theoretical results in practice.The first group of problems concerns tightening th egaps between upper and lower bounds on the minimumtime and /or the m inimum num ber of calls in fault-tolerantbroadcasting and gossiping. In som e cases, these gaps arefairly small, and the remaining open problems a re mostlyof combinatorial interest. In other situations, however,even the exact order of magnitude of minimum time orminimum number of calls in fault-tolerant communicationhas not yet been established. Future developments mayhave a significant impact on the efficiency of actuallyimplemented communication schemes.The second direction for future research concerns in-vestigating new communication and fault models arisingas a consequenc e of emerg ing technologies. Although, aswe have seen, a large number of hypotheses have alreadybeen considered, their possible combinations yielding aplethora of potential m odels, important challenges are thestudy of models that faithfully describe existing patternsof communication and the exploration of features likelyto characterize networks built in the future.Finally, it can be seen that most of the research doneto date in the surveyed area concerns complete networksand hypercubes. Good topological properties of these net-works, such as symmetry and high connectivity, enablethem to support efficient and robust com munication algo-rithms. However, especially in the case of complete net-works, they are increasingly difficult and costly to buildas the number of nodes grows. Hence, there is a need toexplore fault-tolerant capabilities of other networks, inparticular, sparser ones. H ere, again, the types of networksactually used in practice provide im portant and challeng-ing research targets.

    This research was supported in part by NSERC Grant OGP0008136.

  • 8/8/2019 Fault-Tolerant Broadcasting and Gossiping in Communication Networks

    13/14

    FAULT-TOLERANT BROADCASTING AND GOSSIPING 1%REFERENCES

    R. Ahlswede, L. Gargano, H. S . Haroutunian, and L. H .Khachatrian, Fault-tolerant minimum broadcast net-works. Networks 27 ( 1996) 293-307.A. Bagchi and S. L. H akimi, Information disseminationin distributed systems with faulty units. IEEE Trans.Comp. 43 (1994) 98-710.F. Bao and Y. garashi, Reliable broadca sting in productnetworks with Byzantine faults. Proceedings of the 26thAnnual International Symposium on Fault-TolerantComputing ( 1996) 262-27 .F. Bao, Y. garashi, and K. Katano, Broadcasting inhypercubes with randomly distributed Byzantine faults.Proceedings WDA G95, LNCS 972, 21 5-229.M. Barborak, M. Malek, and A. Dahbura, The consen susproblem in fault-tolerant com puting.ACM Comput. Su n.

    P. Berman, K. Diks, and A. Pelc, Reliable broadcastingin logarithmic time w ith Byzan tine link failures. J. Alg.,to appear.K. A. Berman and M. Hawrylycz. Telephone problemswith failures. SIAMJ . Alg. Disc. Methods 7 ( 1986) 13-17.D. P. Bertsek as and J. N. Tsitsiklis, Parallel and Distrib-uted Computation: Numerical Methods. Prentice-Hall,Englewood Cliffs, NJ (1989).D. Bienstock, Broadcasting with random faults. Disc.Appl. Math. 20 (1988) -7.S. Bitan and S. Zaks, Optimal linear broadcast. J. A f g .D. Blough and A. Pelc, Optimal communication in net-works with randomly distributed Byzantine faults. Net-works 23 ( 1993) 691 -701.J. Bruck, On optimal broadcasting in faulty hypercubes.Disc. Appl. Math. 53 ( 1994)3-13.S.Carlsson, Y. garashi, K. Kanai, A. Lingas, K. Miura,and 0. Pe ter so n, Information disseminating schemes forfault-tolerance in hypercubes. IEICE Trans. Fund. E75S. C. Chau. Fast and robust broadca sting in faulty hyper-cubes. Manuscript.S. C. Chau and A. L. Liestman, Constructing fault-toler-ant m inimal broadcast networks, J. Comb. In5 Sys. Sci.B. S. Chlebus, K.Diks, and A. Pelc, Optimal broadcast-ing in faulty hypercubes. Digest of Papers, FTCS21(1991) 66-273.B. S. Chlebus, K. Diks, and A. Pelc, Sparse networkssupporting efficient reliable broadcasting. Proc.B. S. Chlebus. K. Diks, and A. Pelc, Fast gossiping withshort unreliable messages. Disc. Appl. M ath. 53 (1994)15-24.B. S. Chlebus. K. Diks, and A. Pelc, Waking up ananonymous faulty network from a single source. Pro-ceedings of the 27th Annual Hawaii International Con-ference on System Sciences 2 Jan. 1994) 187-193.

    25 (1993) 171-220.

    14(1993) 88-315.

    (1992) 55-260.

    10 (1986) 1-18.

    ICALP93, LNCS 700, 388-397.

    301

    L 391

    B. S. Chlebus, K. Diks, and A. Pelc, Reliable broadcast-ing in hypercubes with random link and node failures.Combin. fr ob . Comput. , to appear.B. S. Chlebus, K.Diks, and A. Pelc, Broadcasting insynchronous networks with dynam ic faults. Networks 27C. T. Chou and I. S.Gopal, Linear broadcast routing.K. Diks, A. Malinowski, and A. Pelc, Reliable tokendispersal with random faults. Par. Proc. Lett. 4 (1994)K. Diks, A. Malinowski, and A. Pelc, Token transfer ina faulty network. Theor. Znf: Appl. 29 (1995) 83-400.K. Diks and A. P elc, Reliable gossip schemes with ran-dom link failures. Pro ceedi ngs of the 28th Annual Aller -ton Conference on Communication, Control and Com-puting. (Oct. 1990) 978-987.K. Diks and A. Pelc, Almost safe gossiping in boundeddegree networks. SIAM J. Disc. Math. 5 (1992) 338-344.K.Diks and A. Pelc, Fault-tolerant linear broadcasting.Proceedings of the First C ana da- France C onference onParallel and Distributed Computing, Theory and Prac-tice, Montreal. Canada, LNCS 805 (May 1994) 207-217.K. Diks and A. Pelc, Efficient gossiping by packets innetworks with random faults. SIAM J. Disc. Math. 9A. F arley, Reliable minimum-time broadcast networks.Proceedings of the 18th SE Conference on Combinato-rics, G raph Theory and Computing. Congress. Numer.U. Feige, D. Peleg, P. Raghavan , and E. Upfal, Random-ized broadcast in networks. Random Struct. Alg. 1(1990) 47-460.G. ox , M. ohnsson, G. Lyzenga, S. Otto, J. Salmon,and D. Walker, Solving Problems on Concurrent P roces-sors. Prentice-Hall, Englewood C liffs, NJ, (1988).P. Fraigniaud. A symptotically optimal broadcasting andgossiping in faulty hypercube multicomputers. IEEETrans. Comp . 41 (1992) 1410-1419.P. Fraigniaud and E. Lazard, Methods and problems ofcomm unication in usual networks. Disc. Appl. Math. 53P. Fraigniaud, A. L. Liestman, and D. Sotteau, Eds.,Open problems. Par. P roc. Lett. 3 (1993) 07-524.P. Fraigniaud and C. Peyrat, Broadcasting in a hyper-cube when some calls fail. Inf. Proc. Lett. 39 (1991)A. Frieze and M. Molloy, Broadcasting in randomgraphs. Disc. Appl. Math. 54 (1994) 7-80.L. G argano, Tighter bounds on fault-tolerant broadcast-ing and gossiping. Networks 22 (1992) 69-486.L. Gargano, A. L. Liestman, J. Peters, and D. Richards,Reliable broadcasting. Discr. Appl. Math. 53 (1994)L. Gargano and A. A. Rescigno, Comm unication com-

    (1996) 09-318.J . Alg. 10(1989) 90-517.

    417-427.

    (1996) -18.

    59 (1987) 7-48.

    (1994) 9-133.

    115-1 19.

    135- 148.

  • 8/8/2019 Fault-Tolerant Broadcasting and Gossiping in Communication Networks

    14/14

    156 PELCplexity of fault-tolerant information diffusion. Proceed-ings of the Fifrh IEEE Symposium on Parallel and Dis-tributed Computing ( 1993) .L. Gargano, A. A. Rescigno, and U. Vaccaro, Fault-tolerant hypercube broadcasting via information dis-persal. Networks 23 ( 1993) 271 -282.

    [411 L. Gargano. A. A. Rescigno, and U. Vaccaro, Minimumtime broadcast in faulty star networks. Manuscript.[ 421 L. Gargano and U. Vaccaro, Minimum time networkstolerating a logarithmic number of faults. SIAM J . Disc.Math. 5 (1992) 178-198.

    L. Gpieniec and A. Pelc, Broadcasting with a boundedfraction of faulty nodes. Technica l Report RR 95/01-1,Universitt du Qu6bec i Hull ( 1995) .L. Gpieniec and A. Pelc, Broadcasting with linearlybounded transmission faults. Technical Report RR 95104-7, Universitt du Quebec i Hull ( 1995).L. Gpieniec and A. Pelc, Adaptive broadcasting withfaulty nodes. Par. Comput.. to appear.A. V. Gerbessiotis, Broadcasting in random graphs.Discr. Appl. Math. 53 (1994) 149-170.A. V. Gerbe ssiotis , Close-to-optimal and near-optimalbroadcasting in random graphs. Disc. Appl. Math. 63R. W. Haddad, S.Roy, and A. A. Schaffer. O n gossipingwith faulty telephone lines. SIAM J. Alg . Disc. Methods

    [ 491 Y. an , Y. garashi, K. Kanai, and K. Miura, Broadcast-ing in faulty binary jumping ne tworks. J.Par. Dist. Com-S.M. Hedetniemi, S.T. Hedetniemi. and A. L. Liestman,A survey of gossiping and broadcasting in communica-tion networks. Networks 18 (1988) 319-349.

    [ 511 J . HromkoviE, R. Klasing, B. Monien, and R. Peine,Dissemination of information in interconnection net-works (broadcasting and gossiping). CombinarorialNer-work Theory. (F. Hsu and D.-Z. Du, Eds.). SciencePress & AMS , to appear.Y. Igarashi, K. Kanai, K. Miura, and S.Osawa, Optimalschemes for disseminating information and their fault-tolerance. IEICE Trans. Inf: Syst. E75 ( 1992) 22-29.S. L. Johnsson and C. T. Ho, Matrix multiplication onBoolean cubes using ge neric communication primitives.

    [40]

    [ 431

    [ 441

    [45][46][ 471

    (1995) 129-150.[481

    8 (1987) 439-445.

    put. 23 (1994) 462-467.[ 501

    [ 521

    [ 531

    Parallel Processing and Medium-Scale MultiprocessorsL. H. Khachatrian and H. S. Harutounian, On optimalbroadcast graphs. Proceedings of the Fourth Interna-tional C olloquium on Coding Th eory, Armenia ( 1990)L. KuEera, Broadcasting through a noisy one-dimen-sional network. Technical Report MPI-1-93- 106, Max-Planck-Institut fur Infonnatik ( 1993).A. L. Liestman, Fault-tolerant broadcast graphs. Net-works 15 (1985) 159-171.G. M addaluno , Algorithm s for the construction of fault-tolerant networks (in Italian). Thesis, Universita di Sa-lerno, 1987.S. Moran, M essage complexity versus spac e complexityin fault tolerant broadcast protocols. Networks 19( 1989)S. Ohring and D. H.Hohndel, Optimal fault tolerantcommunication algorithms on product networks usingspanning trees. Proceedings of the 6th IEEE Symposiumon Parallel and Distributed Processing ( 1994) 188-195.A. Pelc. Broadcasting in complete netw orks with faultynodes using unreliable calls. I n f . Proc. Lett. 40 ( 1991)169-174.A. Pelc, Broadcasting time in sparse networks with fau ltytransmissions. Par. Proc. Lett. 2 1992) 355-361.A. Pelc, Fast fault-tolerant broadcasting and gossiping.Proceed ings of the 2nd C olloquium on Structural Infor-mation and Communication Complexity, SIROCC0'95,Greece (June 1995) 159-172.D. Peleg. A note on optimal time broadcast in faultyhypercubes. J. Par. Dist. Comp. 26 (1995) 132-135.D. Peleg and A. A. Schaffer, Time bounds on fault-tolerant broadcasting. Neiworks 19 ( 1989) 803-822.P. Ramanathan and K. G. Shin, Reliable broadcast inhypercube multicomputers. IEEE Trans. Comp. 37E. R. Scheinerman and J . C. Wierman, Optimal andnear-optimal broadcast in random graphs. Disc. Appl.Math. 25 (1989) 289-297.

    (A. W ouk, Ed.). SIAM (1989) 108-156.

    69-77.

    505 -5 19.

    (1988) 1654-1657.

    Received November 22, 1995Accepted March 27, 1996