dimitri p. bertsekas:j:§ and john n. tsitsiklis:jweb.mit.edu/dimitrib/www/some_aspects_db.pdf ·...

Au/OmGnCG. Vol 27. No I. pp 3--21.1991PrInted In Great Bnta,n

'XKJS-IW8/91 SJOO + 0 (XI

Per.amon Press pic

..'0 1990 In,ema"onal FederatIon at A';toma"c Conlr"l

Survey Paper

DIMITRI P. BERTSEKAS:j:§ and JOHN N. TSITSIKLIS:j:

Iterative methods suitable for use in parallel and distributed computingsystems are surveyed. Both synchronous and asynchronous implementa-tions are discussed. A number of theoretical issues regarding the validity ofasynchronous algorithms are addressed.

Key Words-Computational methods; distributed data processing; iterative methods; parallelprocessing; asynchronous algorithms; parallel algorithms; distributed algorithms.

Abstrad- We consider iterative algorithms of the formx:= f(x), executed by a parallel or distributed computingsvstem. We first consider synchronous executions of suchiierat,ons and study their communication requirements. aswell as issues related to processor synchronization. We alsodiscuss the parallelization of iterations of the Gauss-Seideltype. We then consider asynchronous implementationswhereby each processor iterates on a different component ofx. at its own pace. using the most recently received (butpossibly outdated) information on the remaining componentsof x. While certain algorithms may fail to converge whenImplemented asynchronously. a large number of positiveconvergence results is available. We classify asynchronousalgorithms into three main categories. depending on theamount of asynchronism they can tolerate. and survey thecorresponding convergence results. We also discuss issuesrelated to their termination.

1. INTRODUCTIONPARALLEL AND DISTRIBUTED computing systems havereceived broad attention motivated by severaldifferent Iypes of applications. Roughly speaking.parallel computing systems consist of several lightlycoupled processors that are located within a smalldistance of each other. Their main purpose is toexecute jointly a computational task and they havebeen designed with such a purpose in mind:communication between processors is fast andreliable. Distributed computing systems are somewhat

different in a number of respects. Processors areloosely coupled with little. if any. central coordinationand control. and interprocessor communication ismore problematic. Communication delays can beunpredictable. and the communication links them-selves can be unreliable. Finally, while the architec-ture of a parallel system is usually chosen with aparticular set of computational tasks in mind, thestructure of distributed systems is often dictated byexogenous considerations. Nevertheless, there areseveral algorithmic issues that arise in both paralleland distributed systems and that can be addressedjointly. To avoid repetition. we will mostly employ inthe sequel the term "distributed", but it should bekept in mind that most of the discussion applies toparallel systems was well.

There are. at least two contexts where distributedcomputation has played a signficant role. The first isthe context of information acquisition, informationextraction, and control, within spatially distributedsystems. An example is a sensor network in which aset of geographically distributed sensors obtaininformation on the state of the environment andprocess it cooperatively. Another example is providedby data communication networks in which certainfunctions of the network (such as correct and timelyrouting of messages) have to be controlled in adistributed manner, through the cooperation of thecomputers residing at the nodes of the network. Otherapplications are possible in the quasistatic decentral-ized control of large scale systems whereby certain

parameters (e.g. operating points for each subsystem)are to be optimized locally, while taking into accountinteractions with neighboring subsystems. The secondimportant context for parallel or distributed computa-tion is the solution of very large computationalproblems in which no single processor has sufficientcomputational power to tackle the problem on itsown.

The ideas of this paper are relevant to bothcontexts, but our presentation will emphasize largescale numerical computation issues and iterativemethods in particular. Accordingly. we shall consider

.Received 10 November 1988: revised 12 February 1990:received in final form 11 March 1990. The original versionof this paper was presented at the IFAC/IMACS Symposiumon Distributed Intelligence Systems which was held in Varna.Bulgaria during June 1988. The published Proceedings of thisIFAC Meeting may be ordered from: Pergamon Press pic.Headington Hill Hall. Oxford OX3 OBW. U.K. This paperwas recommended for publication in revised form by EditorK. J. Astrom.

t Research supponed by the NSF under Grants ECS-8519058 and ECS-8552419. with matching funds fromBellcore. Du Pont and IBM. and by the ARO under GrantDAAL03-86-K-0l71.

~ Laboratory for Information and Decision Systems.Massachusetts Institute of Technology. Cambridge. MA02139. U.S.A.

§ Author to whom all correspondence should beaddressed.

D. P. BERTSEKAS and J. N. TSITSIKLlS

pand let X c~" be the Cartesian product X = n X,.

.-1

Accordingly. any x E i?Jl" is decomposed in the formx =(xl ,xp), with each x, belonging to ~"'. Fori = 1, ..., p. let j; : X Xi be a given function and letf:

X X be the function defined by f(x) =

(f.(x). ...,fp(x) for every x EX. We want to solvethe fixed point problem x = f(x). To this end we will

consider the iteration

x:=f(x).

We will also consider the more general iteration

{ j;(X) ifieJXi:= Xi otherwise.

(2.1

)

algorithms of the form .r: = f(x) where x =

(x, ,x.) is a vector in [1l' andf:~'o-+~' is aniteration mapping defining the algorithm. In manyinteresting applications. it is natural to considerdistributed executions of this iteration whereby the jthprocessor updates x, according to the formula

x,:=f(x"".,x,), (1.1)

while receiving information from other processors onthe current values of the remaining components.

Our discussion of distributed implementations ofiteration (1.1) focuses on mechanisms for tnterproces-sor communication and synchronization. We alsoconsider asynchronous implementations and present asurvey of the convergence issues that arise in the faceof asynchronism. These issues are discussed in moredetail in Bertsekas and Tsitsiklis (1989b) where proofsof most of the results quoted here can be found.

Iteration (1.1) can be executed synchronouslywhereby processors perform an iteration. communi-cate their results to the other processors. and thenproceed to the next iteration. In Section 2. weIntroduce two alternative synchronous IteratIons.namely Jacobi type and Gauss-Seidel type iterations.and discuss briefly their parallelizatlon. In Section 3.we indicate that synchronous parallel execution isfeasible even if the underlying computing system isinherently asynchronous (i.e. no processor has accessto a global clock) provided that certain synchroniza-tion mechanisms are in place. We review and comparethree representative synchronization methods. Wealso discuss some basic communication problems thatarise naturally in parallel iterations. assuming thatprocessors communicate using a point-to-point com-munication network. Then. in Section 4. we provide amore detailed analysis of the required time perparallel iteration. In Section 5. we indicate that thesynchronous execution of iteration (1.1) can havecertain drawbacks. thus motivating asynchronousimplementations whereby each processor computes atits own pace while receiving (possibly outdated)information on the values of the components updatedby the other processors. An asynchronous implemen-tation of iteration (1.1) is not mathematicallyequivalent to its synchronous counterpart and anotherwise convergent algorithm may become diver-gent. It will be seen that asynchronous iterativealgorithms can display several and different con-vergence behaviors. ranging from divergence toguaranteed convergence in the face of the worstpossible amount of asynchronism and communicationdelays. We classify the possible behaviors in threebroad classes; the corresponding convergence resultsare surveyed in Sections 6. 7 and 8. respectively. InSection 9. we address some difficulties that arise ifwe wish to terminate an asynchronous distributedalgorithm in finite time. Finally. Section 10 containsour conclusions and a brief discussion of futureresearch directions.

where I is a subset of the component index set(1,... ,p}, which may change from one iteration tothe next.

We are interested in the distributed implementationof such iterations. While some of the discussionapplies to shared memory systems. we will focus inthis and the next two sections on a message-passIngsystem with p processors. each having its own localmemory and communicating with the other processorsover a communication network. We assume that theith processor has the responsibility of updating the ithcomponent x, according to the rule x,:= j;(x). It isimplicitly assumed here that the ith processor knowsthe form of the function j;. In the special case wheref(x) = Ax + b, where A is an n x n matrix and b E ~",

this amounts to assuming that the ith processor knowsthe rows of the matrix A corresponding to thecomponents assigned to it. Other implementations ofthe linear iteration x : = Ax + b are also possible. For

example. each processor could be given certaincolumns of A. We do not pursue this issue further andrefer the reader to McBryan and Van der Velde(1987) and Fox et al. (1988) for discussions ofalternative matrix storage schemes.

For implementation of the iteration. it is seen that ifthe function 1; depends on x, (with i * j), thenprocessor j must be informed by processor i on thecurrent value of x,. To capture such data depend-encies, we form a directed graph G = (N, A), called

the dependency graph of the algorithm. with nodesN = {I, ..., p} and with arcs A = {(i, j) I i * j and 1;

depends on x,}. We assume that for every arc (i, j) inthe dependency graph there is a communicationcapability by means of which processor i can relayinformation to processor j. We also assume thatmessages are received correctly within a finite butotherwise arbitrary amount of time. Such communica-tion may be possible through a direct communicationlink joining processors i and j or it could consist of amulti-hop path in a communication network. Thediscussion that follows applies to both cases.

An iteration in which all of the components of x aresimultaneously updated [I = {I, ...,p} in (2.1)], is

sometimes called a Jacobi type iteration. In analternative form. the components of x are updatedone at a time. and the most recently computed valuesof the other components are used. The resultingiteration is often called an iteration of the

2. JACOBI AND GAUSS-SEIDEL ITERATIONSLet X \. X". be subsets of the Euclidean spaces

(1l"' (1l"P, respectively. Let n=n,+...+np,

Parallel and distributed iterative algorithms- A survey 5

Gauss-Seidel type and is described mathematically by

xi(t+I)=J;(x,(t+I) t,_,(t+1).

x,(t) x,,(t».

i=I p. (2.2)

In a serial computing environment. Gauss-Seideliterations are often preferable. As an example.consider the linear case where f(x) = Ax + b. and A

has non-negative elements and spectral radius lessthan one. Then. the classical Stein-Rosenbergtheorem [see e.g. Bertsekas and Tsitsiklis (1989b. p.152) I states that both the Gauss-Seidel and the Jacobiiterations converge at a geometric rate to the uniquefixed point of f; however. in a serial setting where oneJacobi iteration takes as much as one Gauss-Seideliteration. the rate of convergence of the Gauss-Seideliteration is always faster. Surprisingly. in a parallelsetting this conclusion is reversed. as we now describein a somewhat more general context.

Consider the sequence {xl (t)} generated by theJacobi iteration

xJ(t+l)=f(xJ(t». t=O.I (2.3)

and the sequence {.tli(t)} generated by the Gauss-Seidel iteration (2.2). started from the same initialcondition x(O) = xJ (0) = .f(i (0). The following result is

proved in Tsitsiklis (1989) generalizing an earlierresult of Smart and White (1988):

c'

~r-:"/~

I.VFIG. 1. A dependency graph.

corresponding Jacobi iteration: this can happen if thenumber of available processors is less than p and thedependency graph is sufficiently sparse. as we nowillustrate.

Consider the dependency graph of Fig. 1. Acorresponding Gauss-Seidel iteration is described by

x.(t + 1) = f,(x.(t). XJ(t»)

X2(t + 1) = f2(X.(t + i), X2(t»)

Xj(t + 1) = !J(X2(t + i), X.,(t). X.(t)

X.(t+ 1)=f,(x2(t+ l).x.(t))

and its structure is shown in Fig. 2. We notice herethat x)(t + 1) and X.(t + 1) can be computed inparallel. In particular. a sweep. that is. an update ofall four components. can be performed in only threestages. On the other hand. a different ordering of thecomponents leads to an iteration of the form

XI(t + I) = fl(x,(t). X:l(t))

X:I(t + I) = f,(X2(t). X:I(t). X.(t»

X.(t+ I) =f,,(X2(t). x.(t)

X2(t+ 1)=f2(X,(t+ 1).X2(t))

which is illustrated in Fig. 3. We notice here thatXI(t + I), x,(t + I), and x.(t + 1) can be computed inparallel. and a sweep requires only two stages.

The above example motivates the problem ofchoosing an ordering of the components for which asweep requires the least number of stages. Thesolution of this problem. given in Bertsekas and

Tsitsiklis (1989b, p. 23) is as follows:

Proposition 2. The following are equivalent:(i) There exists an ordering of the variables such

that a sweep of the corresponding Gauss-Seidelalgorithm can be performed in K parallel steps.

(ii) We can assign colors to the nodes of thedependency graph so that at most K different colors

Proposition 1. Suppose that f: .'?I?;" -+.'?I?;" has a unique

fixed point x'. and is monotone, that is. it satisfies

f(X)$f(y) ifx$y. Then. iff(x(O))$x(O), we have

X'$XJ(pt)$x"(t), t=O,I,...

and if .t(O) $f(x(O)), we have

X1i(t)$xJ(pt)$X', 1=0.1 Proposition 1 establishes the faster parallel conver-

gence of the Jacobi iteration. for certain initial

conditions. assuming that a Gauss-Seidel iteration

takcs as much parallel time as p Jacobi iterations. It

has also been shown in Smart and White (1988) that if

in addition to the assumptions of Proposition 2.1, f is

linear (and thus satisfies the assumptions of the

Stein-Rosenberg theorem), the rate of convergenceof the Jacobi iteration is faster than the rate of

convergence of the Gauss-Seidel iteration. An

extension of this result that applies to asynchronous

Jacobi and Gauss-Seidel iterations is also given in

Bertsekas and Tsitsiklis (1989a).

The preceding comparison of Jacobi and Gauss-Seidel iterations assumes that a Jacobi iteration is

executed in one time step. and that the Gauss-Seidel

iteration cannot be parallelized (so that a full update

of all the components x".,. ,x" requires p time

steps). This is the case when the number of available

processors i~ p and the dependency graph describing

the structure of the iteration is complete (every

component depends on every other component), sothat no two components can be updated in parallel. A

Gauss-Seidel iteration can still converge faster,

however, if it can be parallelized to the point where it

requires the same number of time steps as the FIG. 2. The data dependencies in a Gauss-Seidel iteration

D. P. BERTSEKAS and J. N. TSITSIKLlS6

updated in parallel: the corresponding rate of

convergence must also be taken into account.

3. SYNCHRONIZAll0N AND COMMUNICATIONISSUES

We say that an execution of iteration (2.1) issynchronous if it can be described mathematically by

the formula

FIG. 3. The data dependencies in a Gauss-Seide! iterationfor a different updating order.

( 1 { };(XI(t),...,Xp(t)) ifteT'X; t + ) = (3.1)

x;(t) otherwIse.

are used and so that each subgraph obtained byrestricting to the set of nodes with the same color has

no directed cycles.

A well known special case of the above propositionarises when the dependency graph G is symmetric:that is, the presence of an arc (i. j) E A also impliesthe presence of the arc (j, i). In this case there is noneed to distinguish between directed and undirectedcycles. and the coloring problem of Proposition 2reduces to coloring the nodes of the dependencygraph so that no two neighboring nodes have the same

color.Unfortunately, the coloring problem of Proposition

2 is intractable (NP-hard). On the other hand. inseveral practical situations the dependency graph Ghas a very simple structure and the coloring problemcan be solved by inspection. Furthennore, it can beshown that if the dependency graph is a tree or atwo-dimensional grid, only two colors suffice. so aGauss-Seidel sweep can be done in two steps. withroughly half the components of x being updated inparallel at each step. In this case. while with nprocessors the Jacobi method is as fast or faster thanGauss-Seidel. the reverse is true when using n /2processors (or more generally. any number ofprocessors with which a Gauss-Seidel step can becompleted in the same time as the Jacobi iteration).

Even with unstructured dependency graphs. reason-ably good colorings can be found using simpleheuristics; see Zenios and Lasken (1988) and Zeniosand Mulvey (1988), for examples. Let us also pointout that the parallelization of Gauss-Seidel methodsby means of coloring is very common in the context ofthe numerical solution of partial differential equa-tions; see, for example, Ortega and Voigt (1985) andthe references therein.

A related approach for parallelizing Gauss-Seide!iterations, which is fairly easy to implement. isdiscussed in Barbosa (1986) and Barbosa and Gafni(1987). In this approach, a new sweep is allowed tostart before the previous one has been completed andfor this reason. one obtains. in general, somewhatgreater parallelism than that obtained by the coloring

approach.We finally note that the order in which the variables

are updated in a Gauss-Seidel sweep may have asignificant effect on the convergence rate of theiteration. Thus. completing a Gauss-Seidel sweep in aminimum number of steps is not the only considera-tion in selecting the grouping of variables to be~

3.1. Synchronization methodsSynchronous execution is certainly possible if the

processors have access to a global clock. and ifmessages can be reliably transmitted from oneprocessor to another between two consecutive "ticks"of the clock. Barring the existence of a global clock.synchronous execution can be still accomplished byusing synchronization protocols called synchronizers.We refer the reader to Awerbuch (1985) for acomparative complexity analysis of a class ofsynchronizers and we continue with a brief discussionof three representative synchronization methods.These methods will be described for the case of Jacobitype iterations. but they can be easily adapted for thecase of Gauss-Seidel iterations as well.

(a) Global synchronization. Here the processorsproceed to the (t + 1 )st iteration. also referred to asphase, only after every processor i has completed thetth iteration and has received the value of x,(t) fromevery j such that (j. i) EA. Global synchronization canbe implemented by a variety of techniques. a simpleone being the following: the processors are arrangedas a spanning tree. with a particular processor chosento be the root of the tree. Once processor i hascomputed x,(t). has received the value of x,(t) forevery j such that (j, i) EA. and has received a phasetermination message from all its "children" in thetree. it sends a phase termination message to its"father" in the tree. Phase termination messages thuspropagate towards the root. Once the root hasreceived a phase termination message from all of itschildren, it knows that the current phase has beencompleted and sends a phase initiation message to itschildren, which is propagated along the spanning tree.Once a processor receives such a message it canproceed to the next phase. (See Fig. 4 for an

illustration.)(b) Local synchronization. Global synchronization

can be seen to be rather wasteful in terms of the time

Here, t is an integer-valued variable used to indexdifferent iterations, not necessarily representing realtime, and T' is an infinite subset of the index set{O, 1.. ..}. Thus, T' is the set of time indices at whichXi is updated. With different choices of T' one obtainsdifferent algorithms, including Jacobi and Gauss-Seidel type of methods. We will later contrastsynchronous iterations with asynchronous iterations.where Instead of the current component values x,(t),earlier values x,(t -d) are used in (3.1), with d beinga possibly positive and unpredictable "communicatlondelay" that depends on i. i and t.

Parallel and distributed iterative algorithms-A survey 7

Root

Ph- ,",m,no',onm- .'00..-'.t,om tho to '0Ih. '001

pronounced if d is much smaller than p. as is the casein most practical applications. Some related analysisand experiments can be found in Dubois and Briggs

(1982).(c) Synchronization via rollback. This method.

introduced by Jefferson (1985), has been primarilyapplied to the simulation of discrete-event systems. Itcan also be viewed as a general purpose synchroniza-tion method but it is likely to be inferior to thepreceding two methods in applications involvingsolution of systems of equations. Consider a situationwhere the message x}(t) transmitted from someprocessor j to some other processor i is most likely totake a fixed default value known to i. In such a case,processor i may go ahead with the computation ofx,(t + 1) without waiting for the value of x,(t). bymaking the assumption that x}(t) will take the defaultvalue. In case that a message comes later whichfalsifies the assumption that x}(t) has the default value.then a rollback occurs; that is. the computation ofXi(t + 1) is invalidated and is performed once more.taking into account the correct value of x,(t).Furthermore, if a processor has sent messages basedon computations which are later invalidated. it sendsantimessages which cancel the earlier messages. Areception of such an anti message by some otherprocessor k could invalidate some of k's computationsand could trigger the transmission of furtherantimessages by k. This process has the potential ofexplosive generation of antimessages that could drainthe available communication resources. On the otherhand, it is hoped that the number of messages andantimessages would remain small in problems ofpractical interest, although insufficient analyticalevidence is available at present. Some probabilisticanalyses of the performance of this method can befound in Lavenberg et al. (1983) and Mitra andMitrani (1984).

co

Roo.

required per iteration. An alternative is to allow theith processor to proceed with the (I + 1 )st iteration as

soon as it has received all the messages X,(l) it needs.Thus. processor i moves ahead on the basis of localinfonnation alone. obviating the need for propagatingmessages along a spanning tree.

It is easily seen that the iterative computation canonly proceed faster when local synchronization isemployed. Furthennore. this conclusion can also bereached even if a more efficient global synchronizationmethod were possible whereby all processors start the(I + l)st iteration immediately after all messages

generated by the lth iteration have been delivered.(We refer to this hypothetical and practicallyunachievable situation as the ideal global synchroniza-tion.) Let us assume that the time required for onecomputation and the communication delays arebounded above by a finite constant and are boundedbelow by a positive constant. Then it is easily shownthat the time spent for a number K of iterations underideal global synchronization is at most a constantmultiple of the corresponding time when localsynchronization is employed.

The advantage of local synchronization is betterseen if communication delays do not obey any a prioribound. For example. let us assume that thecommunication delay of every message is anindependent exponentially distributed random vari-able with mean one. Furthennore. suppose forsimplicity. that each processor sends messages toexactly d other processors. where d is some constant(i.e. the outdegree of each node of the dependencygraph is equal to d). With global synchronization. thereal time spent for one iteration is roughly equal to

the maximum of dp independent exponential randomvariables and its expectation is. therefore. of the orderof log (dp). Thus. the expected time needed for Kiterations is of the order of K log (pd). On the otherhand. with local synchronization. it turns out that theexpected time for K iterations is of the order oflogp + K log d [joint work with C. H. Papadimitriou;

see Bertsekas and Tsitsiklis (1989b. p. 104)]. If K islarge. then local synchronization is faster by a factorroughly equal to log (pd)/log d. Its advantage is more

3.2. Single and muilinode broadcastingRegardless of whether the implementation is

synchronous or not. it is necessary to exchange someinformation between the processors after eachiteration. The interprocessor communication time canbe substantial when compared to the time devoted tocomputations. and it is important to carry out themessage exchanges as efficiently as possible. There area number of generic communication problems thatarise frequently in iterative and other algorithms. Wedescribe a few such tasks related to message

broadcasting.In the first communication task. we want to send

the same message from a given processor to everyother processor (we call this a single node broadcast).In a generalized version of this problem. we want todo a single node broadcast simultaneously from allnodes (we call this a multinode broadcast). A typicalexample where a multinode broadcast is needed arisesin the iteration x: = f(x). If we assume that there

is a separate processor assigned to component x,.i = 1. p. and that the function f depends on all

components x,, j= 1 .p. then. at the end of aniteration. there is a need for every processor to send

FIG. 4. Illustration of the global synchronization method.

8 D. P. BERTSEKAS and J. N. TSn"SIKLIS

SINGLE NODE ACCUMULATIONSINGLE NODE BROADCAST

(I) (b)

FIG. 5. (a) A single node broadcast uses a tree thOlt is rooted at a given node (which is node 1 in the figure).The time next to each link is the time that transmission of the packet on the link begins. (b) A single nodeaccumulation problem Involving summation o( n scalars ai' ...,an (one per processor) at the given node(which is node 1 in the figure). The time next to each link is the time at which transmission o( the"combined" packet on the link begins. assuming that the time for scalar addition IS negligible relative to the

time required for packet transmiSSion.

node. as in an inner product calculation [see Fig.5(b)]; we can view addition of scalars at a nodeas "combining" the corresponding messages into asingle message. The second problem. which is dualto a multinode broadcast. is called multinodeaccumulation, and involves a separate single nodeaccumulation at each node. It can be shown that asingle node (or multi node) accumulation problem canbe solved in the same time as a single node(respectively multinode) broadcast problem. byrealizing that an accumulation algorithm can beviewed as a broadcast algorithm running in reversetime. as illustrated in Fig. 5. As shown in Fig. 4.global synchronization can be accomplished by asingle node broadcast followed by a single nodeaccumulation.

Algorithms for solving the broadcast problems justdescribed. together with other related communicationproblems. have been developed for several populararchitectures (Nassimi and Sahni. 1980; Saad andShultz. 1987; McBryan and Van der Velde. 1987;Ozveren. 1987; Bertsekas et al.. 1989; Bertsekas andTsitsiklis. 1989b; Johnsson and Ho. 1989). Table 1gives the order of magnitude of the time needed tosolve each of these problems using an optimalalgorithm. The underlying assumption for the resultsof this table is that each message requires unit time fortransmission on any link of the interconnectionnetwork. and that each processor can transmit andreceive a message simultaneously on all of its incidentlinks. Specific algorithms that attain these times aregiven in Bertsekas et al. (1989) and Bertsekas andTsitsiklis (1989b. Section 1.3.4). In most cases thesealgorithms are optimal in that they solve the problemin the minimum possible number of time steps. Figure6 illustrates a multinode broadcast algorithm for a ring

the value of its component to every other processor,which is a multinode broadcast.

Clearly, to solve the single node broadcast problem,it is sufficient to transmit the given node's messagealong a spanning tree rooted at the given node, thatis, a spanning tree of the network together with adirection on each link of the tree such that there is aunique path from the given node (called the root) toevery other node. With an optimal choice of such aspanning tree, a single node broadcast takese(r)-time. t where r is the diameter of the network. asshown in Fig. 5(a). To solve the multinode broadcastproblem. we need to specify one spanning tree perroot node. The difficulty here is that some links maybelong to several spanning trees: this complicates thetiming analysis, because several messages can arrivesimultaneously at a node, and require transmission onthe same link with a queueing delay resulting.

There are two important communication problemsthat are dual to the single and multinode broadcasts.in the sense that the spanning tree(s) used to solveone problem can also be used to solve the dual in thesame amount of communication time. In the firstproblem, called single node accumulation, we want tosend to a given node a message from every othernode: we assume, however, that messages can be"combined" for transmission on any communicationlink, with a "combined" transmission time equal tothe transmission time of a single message. Thisproblem arises, for example, when we want to form ata given node a sum consisting of one term for each

tThe notation h(y}=6(g(y}}. where y is a positiveinteger. means that for some c, > O. c~ > O. and Yn > O. wehave cllg(y)1 s h( Y) s c~lg(Y}1 for all Y ~ y"o


,.i,.

J

~C{fq..e

~ ,/"'-<

r--

(I)

~...0~c;<} l2)

3t[

16

J

-.:.z .~"'0~

St.~ 3

(b)

FIC;, tI (a) A ring of p nodes having as links the pairs Ii. i + 1) for i = 1.~ .P -1. and (p. I), (b) Amultlnode broadcast on a ring wIth p nodes can be performed in rIp -1)/21 stages as follows: at stage I.each node sends II~ own packet to its clockwise and counterclockwise neighbors. At stage:. .., .rIp -I )/~1. each node sends to Its clockwIse neighbor the packet received from its counterclockwiseneighbor at the previous stage: also. at stages~..,. .r(p -2)/21. each node sends to its counterclockwiseneighbor the packet received from Its clockwIse neighbor at the previous stage. The figure illustrates this

process for p = 6.

assumptions:

(a) All components of x are updated at eachiteration. (This corresponds to a Jacobi iteration.If a Gauss-Seidel iteration is used instead. thetime per iteration cannot increase. since byupdating only a subset of the components. thecomputation per iteration will be reduced and thecommunication problem will be simplified. Basedon this. it can be seen that the order of requiredtime will be unaffected if in place of a Jacobiiteration. we perform a Gauss-Seidel sweep with

with p processors, which attains the minimum numberof steps.

Using the results of Table I, it is also shown inBertsekas and T~itsiklis (19H\)b) that if a hypercube isused, then most of the basic operations of numericallinear algebra, I.t:. inner product, matrix-vectormultiplication, matrix-matrix multiplication, power ofa matrix, etc., can be executed in parallel in the sameorder of time as when communication is instan-taneous. In some cases this is also possible when theprocessors are connected with a less powerfulinterconnection network such as a square mesh. Thus,communication affects only the "multiplying constant"as opposed to the order of time needed to carry outthese operations. Nonetheless. with a large number ofprocessors, the effect of communication delays onlinear algebra operations can be very substantial.

TABLE 1. SOLUTION TIMES OF omMAL ALGORlniMS FOR niE

BROADCAST AND ACCUMULATION PROBLEMS USING A RING. A

BINARY BALANCED TREE. A d-DIMENSIONAL MESH (Wlni niE

SAME NUMBER OF PROCESSORS ALONG EACH DIMENSION). AND A

HYPERCUBE Wlni P PROCESSORS THE TIMES GIVEN FOR niE

RING ALSO HOLD FOR A LINEAR ARRAY

Problem Ring Tree Mesh Hypercube4. ITERATION COMPLEXITY

We now try to assess the potential benefit fromparallelization of the iteration x : = f (x). In particular.

we will estimate the order of growth of the requiredtime per iteration. as the dimension n increases. Ouranalysis is geared towards large problems and theissue of speedup of iterative methods using a largenumber of processors. We will make the following

Single node broadcast(or single nodeaccumulation)

Multinode broadcast(or multinodeaccumulation)

8(p) 8(logp) 8(p"d) 8(logpj

8(p) 8(p) 8(p) 8(p/logp)

D. P. BERTSEKAS and J. N. TSITSIKLIS10

a number of steps which is fixed and independentof the dimension n.)

(b) There are n processors. each updating a singlescalar component of x at each iteration. (One maywish to use fewer than n processors. say p. eachupdating an nip-dimensional component of x. inorder to economize on communication. We arguelater. however. that under our assumptions.choosing p < n cannot improve the order of timerequired per iteration. although it may reduce thistime by a constant factor. In practice.'of course.the number of available processors is often muchless than n. and it is interesting to consideroptimal utilization of a limited number ofprocessors in the context of iterative methods. Inthis paper. however. we will not address thisissue. prefering to concentrate on the potentialand limitations of iterative computation usingmassively parallel machines with an abundantnumber of processors.)

(c) Following the execution of their assigned portionof the iteration. the processors exchange theupdated values of their components by means of acommunication algorithm such as a multinodebroadcast. The subsequent synchronization takesnegligible time. (This can be justified by notingthat local synchronization can be accomplished aspart of the communication algorithm and thusrequires no additional time. Furthermore. globalsynchronization can be done by means of a singlenode broadcast followed by a single nodeaccumulation. Thus the time required for globalsynchronization grows with n no faster than amultinode broadcast time. Therefore. if thecommunication portion of the iteration is done bya multinode broadcast. the global synchronizationtime can be ignored when estimating the order ofrequired time per iteration.)

Another example is when the system solved is linearand dense. but each processor has vector processingcapability allowing it to compute inner products in

8(1) time.Medium TCOMP: (=8(logn)). An example for this

case is when the system solved is linear and dense.and each processor can compute an inner product in8(log n) time. It can be shown that this is possible ifeach processor is itself a message-passing parallelprocessor with log n diameter.

Large TCOMP: (= 8(n)). An example for this case iswhen the system solved is linear and dense. and eachprocessor computes inner products serially in 8( n)time.

Also the following are considered for thecommunication time T MNB:

Small T MNB: (= 8(1»). An example for this case iswhen special very fast communication hardware isused. making the time for the multinode broadcastnegligible relative to T CaMP or relative to thecommunication software overhead at the messagesources. Another example is when the processors areconnected by a network that matches the form of thedependency graph. so that all necessary communica-tion involves directly connected nodes. For examplewhen solving partial differential equations. thedependency graph is often a grid resulting fromdiscretization of physical space. Then. with processorsarranged in an appropriate grid. communication canbe done very fast.

Medium TMNB: (= 8(n/logn». An example for thiscase is when the multinode broadcast is performedusing a hypercube network (cf. Table 1).

Large TMNB: (=8(n». An example for this case iswhen the multinode broadcast is performed using aring network or a linear array (cf. Table 1).

Table 2 gives the time per iteration T CaMP + T MNBfor the different combinations of cases. In the worstcase. the time per iteration is 8(n). and this time isfaster by a factor n than the time needed to executeserially the linear iteration x: = Ax + b when the

matrix A is fully dense. In this case. the speedup isproportional to the number of processors n and thebenefit from parallelization is very substantial. Thisthought. however. must be tempered by therealization that the parallel solution time still increasesat least linearly with n, unless the number of iterationsneeded to solve the problem within practical accuracydecreases with n-an unlikely possibility.

We estimate the time per iteration as

T COMP + T MNB'

TABLE 2. TIME PER ITERATION X:= f(x) UNDER A VARIETY OF

ASSUMmONS FOR THE COMPIJTATION TIME PER ITERATION

TCOMP AND THE COMMUNICATION TIME PER ITERATION TMNB'

IN THE CELLS ABOVE THE DIAGONAL. THE COMPIJTATION TIME IS

THE BOn1.ENECK. AND IN THE CELLS BELOW THE DIAGONAL.

THE COMMUNICATION TIME IS THE BOn1.ENECK

where T CaMP is the time to compute the updatedcomponents j;(x). and T MNB is the time to exchangethe updated component values between the processorsas necessary. If there is overlap of the computationand communication phases due to some form ofpipe lining. the time per iteration will be smaller thanT CaMP + T MNB but its order of growth with n will notchange. We consider several hypotheses for T CaMPand TMNB' corresponding to different types ofcomputation and communication hardware. andstructures of the functions j;. In particular, weconsider the following cases. motivated primarily bythe case where the system of equations x = f(x) islinear:

Small TCOMP: (= 6(1»). One example for this case iswhen the iteration functions j; are linear andcorrespond to a very sparse system (the maximumnode degree of the dependency graph is 8( 1 ».

T"'NB: 8(1) 8(1)T"'NB; 8(n/logb) 8(n/logn)

T"'NB: 8(n) 8(n)

Parallel and distributed iterative algorithms- A survey

In the best case of Table 2. the tIme per iteration isbounded irrespectively of the dimension n. offeringhope that with special computing and communicationhardware. some extremely large practical problemscan be solved in reasonable time.

Another interesting case. which is not covered byTable 2. arises in connection with the linear iteration.f : = Ax + b. where A is an n x n fully dense matrix. It

can be shown that this iteration can be executed in ahypercube network of n~ processors in 8(log n) time[see. for example. Bertsekas and Tsitsiklis (1989b)].While it is hard to imagine at present hypercubes of n~processors solving large n x n systems. this caseprovides a theoretical limit for the time per iterationfor unstructured linear systems in message-passingmachines.

Consider now the possibility of using p < nprocessors. each updating an nip-dimensional com-ponent of x. The computation time per iteration willthen increase by a factor nip. so the question ariseswhether it is possible to improve the order of growthof the communication time in the cases where T MNB isthe iteration time bottleneck. The cases of mediumand large T MNB are of principal interest here. In thesecases the corresponding times measured in messagetransmission time units are 8( p Ilog p) and 8(p),

respectively. Because. however. each message in-volves nip values. its transmission time grows linearlywith nip. so the corresponding time TMNB becomes8(n/logp) and 8(n). respectively. Thus the order oftime per iteration is not improved by choosing p < n.at least under the hypotheses of this section.

processor can proceed with the next IteratIon withoutwaiting for all other processors to complete thecurrent iteration. and without waiting for a synchroni-zation algorithm to execute. Furthermore. in certaincases. there are even advantages over the localsynchronization method as we now discuss. Supposethat an algorithm happens to be such that eachiteration leaves the value of Xi unchanged. With localsynchronization. processor i must still send messagesto every processor J with (i, j) E A because processor jwill not otherwise proceed to the next iteration.Consider now a somewhat more realistic case wherethe algorithm is such that a typical iteration is verylikely to leave X, unchanged. Then each processor jwith (i, j) E A will be often found in a situation whereit waits for rather uninformative messages stating thatthe value of Xi has not changed. In an asynchronousexecution. processor j does not wait for messages fromprocessor i and the progress of the algorithm is likelyto be faster. A similar argument can be made for thecase where Xi changes only slightly between iterations.Notice that the situation is similar to the case ofsynchronization via rollback. except that in an

asynchronous algorithm processors do not roll backeven if they iterate on the basis of outdated and laterinvalidated information.

(b) Ease of restarting. Suppose that the processorsare engaged in the solution of an optimizationproblem and that suddenly one of the parameters ofthe problem changes. (Such a situation is common andnatural in the context of data networks or in thequasistatic control of large scale systems.) In asynchronous execution. all processors should beinformed. abort the computation. and then reinitiate(in a synchronized manner) the algorithm. In anasynchronous implementation no such reinitializationis required. Rather. each processor incorporates thenew parameter value in its iterations as soon as itlearns the new value. without waiting for allprocessors to become aware of the parameter change.When all processors learn the new parameter value.the algorithm becomes the correct (asynchronous)iteration.

(c) Reduction of the effect of bottlenecks. Supposethat the computational power of processor i suddenlydeteriorates drastically. In a synchronous executionthe entire algorithm would be slowed down. In anasynchronous execution. however. only the progressof Xi and of the components strongly influenced by X,would be affe~ted: the remaining components wouldstill retain the capacity of making unhamperedprogress. Thus the effects of temporary malfunctionstend to be localized. The same argument applies tothe case where a particular communication channel issuddenly slowed down.

(d) Convergence acceleration due to a Gauss-Seideleffect. With a Gauss-Seidel execution. convergenceoften takes place with fewer updates of eachcomponent. the reason being that new information isincorporated faster in the update formulas. On theother hand Gauss-Seidel iterations are generally less

parallelizable. Asynchronous algorithms have the

5. ASYNCHRONOUS ITERATIONSAsynchronous iterations have been introduced by

Chazan and Miranker (1969) (under the name chaoticrelaxation) for the solution of linear equations. In anasynchronous implementation of the iteration ..t : =

f(x). processors are not required to wait to receive allmessages generated during the previous iteration.Rather. each processor is allowed to keep iterating onits own component at its own pace. If the currentvalue of the component updated by some otherprocessor is not available. then some outdated valuereceived at some time in the past is used instead.Furthermore. processors are not required to com-municate their results after each iteration but onlyonce in a while. We allow some processors to computefaster and execute more iterations than others. weallow some processors to communicate more fre-quently than others. and we allow the communicationdelays to be substantial and unpredictable. We alsoallow the communication channels to deliver messagesout of order. i.e. in a different order than the one theywere transmitted.

There are several potential advantages that may begained from asynchronous execution [see Kung (1976)for a related discussion].

(a) Reduction of the s.vnchronization penalty. Thereis no overhcad such as the one associated with theglobal synchronization method. In particular. a


The above mathematical description can be used asa model of asynchronous iterations executed by eithera message-passing distributed system or a shared-memory parallel computer. For an illustration of the

latter case. see Fig. 7.The difference t -rj(t) is equal to zero for a

synchronous execution. The larger this difference is.the larger is the amount of asynchronism in thealgorithm. Of course. for the algorithm to make anyprogress at all we should not allow r;(t) to remainforever small. Furthermore. no processor should beallowed to drop out of the computation and stopiterating. For this reason. certain assumptions need tobe imposed. There are two different types ofassumptions which we state below.

Assumption 1. (Total asynchronism). The sets T' areinfinite and if {tk} is a sequence of elements of T'

which tends to infinity. then lim T;(lk) = :x: for every j.k-

Assumplion 2. (Partial asynchronism). There exists a

positive constant B such that:(a) For every 12:0 and every i. at least one of the

elements of the set {I. 1 + 1. 1 + B-1} belongs

to T'.(b) There holds

(5.3)1- B < Tj(t):S t. Vi. i, VI E T',

(c) There holds ';(t) = I. for all i and lET'

The constant B of Assumption 2. to be called theasynchronism measure, bounds the amount by whichthe information available to a processor can be

outdated. Notice that a Jacobi-type synchronousiteration is the special case of partial asynchronism inwhich B = 1. Notice also that Assumption (c) statesthat the information available to processor i regardingits own component is never outdated. Such anassumption is natural in most contexts. but could be

violated in certain types of shared memory parallelcomputing systems if we allow more than oneprocessor to update the same component of x. It turns

potential of displaying a Gauss-Seidel effect becausenewest information is incorporated into the computa-tions as soon as it becomes available. while retainingmaximal parallelism as in Jacobi-type algorithms.

A major potential drawback of asynchronousalgorithms is that they cannot be described mathe-matically by the iteration x(t + 1) = [(xCI)). Thus.even if this iteration is convergent. the correspondingasynchronous iteration could be divergent. and indeedthis is sometimes the case. Even if the convergence ofthe asynchronous iteration can be established. thecorresponding analysis is often difficult. Nevertheless.there is a large number of results stating that certainclasses of important algorithms retain their desirableconvergence properties in the face of asynchronism:they will be surveyed in Sections 6-8. Anotherdifficulty relates to the fact that an asynchronousalgorithm may have converged (within a desiredaccuracy) but the algorithm does not terminatebecause no processor is aware of this fact. We addressthis issue in Section 9.

We now present our model of asynchronouscomputation. Let the set X and the function [ be asdescribed in Section 2. Let t be an integer variableused to index the events of interest in the computingsystem. Although t will be referred to as a timevariable. it may have little relation with "real time".Let xj(t) be the value of Xi residing in the memory ofthe ith processor at time t. We assume that there is aset of times T' at which X, is updated. To account forthe possibility that the ith processor may not haveaccess to the most recent values of the components of

x, we assume that

x,(t + 1) =j;(x.(t',(t»,... ,x"(t~(t»), ~t E T', (5.1)

where t;(t) are times satisfying

OSt;(t)St, ~t~O.

At all times t f T', x;(t) is left unchanged and

x;(t + 1) =x;(t), ~t f T'. (5.2)

We assume that the algorithm is initialized with some

x(O) E X.

Time

FIG. 7. Illustration of a component u~ate in a shared memory multiprocessor. Here x~ is viewed as beingupdated at time 1 = 9 (9 E T2), with rj(9) = 1, r~(9) = 2. and r;(9) = 4. The updated value of x ~ is entered atthe corresponding register at 1 = 10. Several components can be simultaneously in the process of being

updated. and the values of r;(I) can be unpredictable.


The induction hypothesis is true for k = 0, since theinitial estimate is assumed to be in X(Q). Assuming itis true for a given k, we will show that there exists atime It+1 with the required properties. For eachi = 1, ..., n, let t' be the first element of T' such that

I' 2: It. Then by condition (b) in the statement of the

proposition. we have f(x'(I'» E X(k + 1) and

xi(t' + 1) = J:(x'(I'» E X;(k + 1).

Similarly, for every lET', t 2: 1', we have X,(I + 1) EXi(k + 1). Between elements of T', x,(t) does not

change. Thus.

x;(t) E X.(k + 1), Vt 2: t' + 1.

out that if we relax Assumption 2(c). the convergenceof certain asynchronous algorithms is destroyed(Lubachevsky and Mitra. 1986; Bertsekas andTsitsiklis. 1989b. p. 506 and p. 517). Parts (a) and (b)of Assumption 2 are typically satisfied in practice.

Asynchronous algorithms can exhibit three differ-ent types of behavior (other than guaranteed

divergence):

(a) Convergence under total asynchronism.(b) Convergence under partial asynchronism. forevery value of B, but possible divergence under totally

asynchronous execution.(c) Convergence under partial asynchronism if B issmall enough. and possible divergence if B is large

enough.

The mechanisms by which convergence is estab-lished in each one of the above three cases are

fundamentally different and we address them in thesubsequent three sections. respectively.

ti. TOT ALL Y ASYNCHRONOUS ALGORITHMS

Totally asynchronous convergence results have beenobtainedt by Chazan and Miranker (1969) for lineariterations. Miellou (1975a). Baudet (1978). EI Tarazi(1982). Miellou and Spiteri (1985) for contractingiterations. Miellou (1975b) and Bensekas (1982) formonotone iterations. and Bensekas (1983) for generaliterations. Related results can be also found in Uresinand Dubois (1986. 1988. 1990). The following generalresult is from Bertsekas (1983).

Let t~ = max {t'} + 1. Then. using the CartesianI

product structure of X (k) we have

x(t) e X(k + 1), Vt 2: t~

Finally. since by Assumption 1. we have t';(t)-~ ast-ao. lET', we can choose a time t,,+.2:t~ that issufficiently large so that t';(t) 2: t~ for all i. j and t e T'with t2:t"+I' We then have. x,(t';(t))eX,(k+ 1). forall tET' with t2:t"+1 and all j=I, n. whichimplies that

x'(t) = (x.(t".(t)). x~(t'~(t)). x.(t".(t))) e X(k + 1).

p "Proposition 3. Let X = n X, C n ~n,. Suppose that

,-I ,-Ifor each i e {I. p }. there exists a sequence{Xj(k)} of subsets of X, such that:

(a) X,(k + I) c X,(k). for all k 2: o."

(b) The sets X(k) = n X,(k) have the property'51

!(x)eX(k+I).forallxeX.(c) Every limit point of a sequence {x(k)} with theproperty x(k) e X(k) for all k. is a fixed point off.

Then. under Assumption I (total asynchronism).and if x(O) e X(O). every limit point of the sequence

{x(r)} generated by the asynchronous iteration(5.1)-(5.2) is a fixed point of f.

The induction is complete. Q.E.D.The key idea behind Proposition 3 is that eventually

x(t) enters and stays in the set X(k); furthermore. dueto condition (b) in Proposition 3. it eventually movesinto the next set X(k + 1). The most restrictiveassumption in the proposition is the requirement thateach X(k) is the Cartesian product of sets X;(k).Successful application of Proposition 3 depends on theability to properly define the sets X,(k) with the

required properties. This is possible for two generalclasses of iterations which will be discussed shortly.

Notice that Proposition 3 makes no assumptions onthe nature of the sets X;(k). For this reason. it can beapplied to problems involving continuous variables. aswell as discrete iterations involving finite-valuedvariables. Furthermore. the result extends in theobvious way to the case where each X;(k) is a subsetof an infinite-dimensional space (instead of being asubset of .'¥l;"i) or to the case where f has multiple fixed

points.Interestingly enough, the sufficient conditions for

asynchronous convergence provided by Proposition 3,are also known to be necessary for two special cases:(i) if ni = 1 for each i and the mapping f is linear

(Chazan and Miranker. 1969), and (ii) if the set X isfinite (Uresin and Dubois. 1990).

Several authors have also studied asynchronousiterations with zero delays. that is, under theassumption t'j(t) = t for every t E T': see for example

Robert et ai. (1975); Robert (1976.1987.1988). Notethat this is a special case of our asynchronous model,but is more general than the synchronous Jacobi andGauss-Seidel iterations of Section 2. because the setsT' are allowed to be arbitrary. General necessary andsufficient convergence conditions for the zero-delav

Proof. We show by induction that for each k ~ 0,there is a time t. such that:

(a) x(t) e X(k) for all t ~ t..(b) For all i and teT' with t~tk' we have

x'(t) e X(k), where

x'(t) = (xl(r'a(t)), x~(.~(t). ...,x"( r~(t»). Vte T'.

[In words: after some time. all solution estimateswill be in X(k) and all estimates used in iteration(5.1) will come from X(k).]

tActually. some of these papers only consider partiallyasvnchronous Iterations. but their convergence results readilyextend to cover the case of total asynchronism.

D. P. BERTSEKAS and J. N. TSITSIKLIS

~

14

where A is the positive definite matrix given bycase can be found in Tsitsiklis (1987) where it is shownthat asynchronous convergence is guaranteed if andonly if there exists a Lyapunov-type function which

testifies to this.

1

+E+E

A=

1.

Maximum norm contractionsConsider a norm on ~" defined by

.Ilx;ll;

6.

(6.1)Ilxll=max~Wi

-1 1 1+~-

and y. ~ are positive constants. This iteration can beviewed as the gradient iteration x: = x -y V F,:.t) forminimizing the quadratic function F(x) = ~x' Ax and is

known to converge synchronously if the stepslze y issufficiently small. If ~ > 1. then the diagonaldominance condition of (6.3) holds and totallyasynchronous convergence follows. when the stepsizey is sufficiently small. On the other hand. when0 < ~ < 1. the condition of (6.3) fails to hold for ally > O. In fact. in that case. it is easily shown thatp(ll- yAI) > 1 for every y > O. and totally asynchro-nous convergence fails to hold. according to thenecessary conditions quoted earlier. An illustrativesequence of events under which the algorithmdiverges is the following. Suppose that the processorsstart with a common vector x(O) = (c. c. c) and that

each processor executes a very large number to ofupdates of its own component without informing theothers. Then. in effect. processor 1 solves theequation 0 = (aF/ax,)(x,. c. c) = (1 + ~)XI + C + c. toobtain x 1(/0) = -2c / (1 + ~). and the same conclusionis obtained for the other processors as well. Assumenow that the processors exchange their results at time10 and repeat the above described scenario. We will

then obtain X,(2/n)=-2x;(/n)/(1+~)=(-2)2C/(1+~ )2. Such a sequence of events can be repeated adinfinitum. and it is clear that the vector X(/) will

diverge if ~ < 1.(c) The projection algorithm (as well as several

other algorithms) for variational inequalities. Here.

X = fi X, c .'1/;" is a closed convex set. f: X '1/;" is aI-I

given function. and we are looking for a vector x' E X

such that

'VX EX.(x-x*)'f(x*)2:0.

The projection algorithm is given by x : = Ix -yf (x)]. .where I,] + denotes orthogonal projection on the set

X. Totally asynchronous convergence to x. isobtained under the assumption that the mappingx .-+ x -yf (x) is a maximum norm contractionmapping. and this is always the case if the Jacobian of

f satisfies a diagonal dominance condition (Bertsekasand Tsitsiklis. 1989b). Special cases of variationalinequalities include constrained convex optimization.solution of systems of nonlinear equations. trafficequilibrium problems under a user-optimizationprinciple. and Nash games. Let us point out here thatan asynchronous algorithm for solving a trafficequilibrium problem can be viewed as a model of atraffic network in operation whereby individual usersoptimize their individual routes given the currentcondition of the network. It is natural to assume thatsuch user-optimization takes place asynchronously.Similarly. in a game theoretic context. we can think ofa set of players who asynchronously adapt theirstrategies so as to improve their individual payoffs.

.where x, E l.Jr'i is the ith component of x. II .Iii is anorm on ~"i. and Wi is a positive scalar. for each i.

Suppose that f has the following contraction property:there exists some a E [0. 1) such that

Ilf(x)-x.ll:sallx-x.ll. 'VXEX. (6.2)

where x' is a fixed point of f. Given a vector x(O) E Xwith which the algorithm is initialized. let

Xi(k) = (x, E g1;"' Illx, -xi.lli:S ak Ilx(O) -x.II}.

It is easily verified that these sets satisfy the conditionsof Proposition 3 and convergence to x' follows.

Iteration mappings f with the contraction property(6.2) are very common. We list a few examples:

(a) Linear iterations of the form f(x) = Ax + b.

where A is an n x n matrix such that p(IAI) < I. Here.IA I is the matrix whose entries are the absolute valuesof the corresponding entries of A. and p(IA!). thespectral radius of IAI. is the largest of the magnitudesof the eigenvalues of IAI (Chazan and Miranker.1969). This result follows from a corollary of thePerron-Frobenius theorem that states that p(IAI) < 1if and only if A is a contraction mapping with respectto a weighted maximum norm of the form (6.1). for asuitable choice of the weights. As a special case. weobtain totally asynchronous convergence of theiteration .'T: =;rP for computing a row vector ;rconsisting of the invariant probabilities of anirreducible. discrete-time. finite-state. Markov chain.Here. P is the transition probability matrix of thechain and one of the components of it is held fixedthroughout the algorithm (Bertsekas and Tsitsiklis.

1989b). Another special case. the case of periodicasynchronous iterations. is considered in Donnelly(1971). Let us mention here that the conditionp( IA I) < 1 is not only sufficient but also necessary fortotally asynchronous convergence (Chazan and Mir-

anker. 1969).(b) Gradient iterations of the form f(x) =x-

yVF(x), where y is a small positive stepsizeparameter. F: ~" 0-+ ~ is a twice continuouslydifferentiable cost function whose Hessian matrix isbounded and satisfies the diagonal dominance

condition

2.IV7;F(x)1 :S V~F(x) -.8. 'Vi. 'VX EX. (6.3)I-'

Here. .8 is a positive constant and V7;F stands for(~F)/(ax; ax,) (Bertsekas. 1983; Bertsekas and

Tsitsiklis. 1989b).

Example 1. Consider the iteration x:= x -yAx.


6.2. Monotone mappingsConsider a function f: [It" ~" which is con-

tinuous. monotone [that is. ifxsy then f(x) sf(y)].and has a unique fixed point x'. Furthermore. assumethat there exist vectors u, v. such that u sf(u) sf(v) s v. If we let fk be the composition of k copies off and X(k) = {x I fk(U) S X ~ fk(V)}, then Proposition

3 applies and establishes totally asynchronousconvergence. The above stated conditions on faresatisfied by the iteration mapping corresponding to thesuccessive approximation (value iteration) algorithmfor discounted and certain undiscounted infinitehorizon dynamic programming problems (Bertsekas.

1982).An important special case is the asynchronous

Bellman-Ford algorithm for the shortest pathproblem. Here we are given a directed graphG=(N,A). with N={l, n} and for each arc(i, j) EA. a weight ai; representing its length. Theproblem is to compute the shortest distance x, fromevery node i to node 1. We assume that every cyclenot containing node 1 has positive length and thatthere exists at least one path from every node to node1. Then. the shortest distances correspond to theunique fixed point of the monotone mappingf:~"""~" defined by f,(x) = 0 and

and an asynchronous iteration can be used as a modelof such a situation.

(d) Waveform relaxation methods for solving asystem or ordinary differential equations under a weakcoupling assumption (Mitra, 1987), as well as fortwo-point boundary value problems (Lang et ai.. 1986;Spiteri, 1984; Bertsekas and Tsitsiklis, 1989b).

Other studies have dealt with an asynchronousNewton algorithm (Bojanczyk, 1984), an agreementproblem (Li and Basar, 1987), diagonally dominantlinear programming problems (Tseng, 1990), and avariety of infinite-dimensional problems such aspartial differential equations, and variational in-equalities (Spiteri, 1984, 1986: Miellou and Spiteri,1985: Anwar and EI Tarazi, 1985).

In the case of maximum norm contraction mappings,there are some convergence rate estimates availablewhich indicate that the asynchronous iterationconverges faster than its synchronous counterpart,especially if the coupling between the differentcomponents of x is relatively weak. Let us supposethat an update by a processor takes one time unit andthat the communication delays are always equal to Dlime units, where D is a positive integer. With asynchronous algorithm, there is one Iteration everyD + I time units and the "error" Ilx(t) -x'il can bebounded by Ca'/(D.II, where C is some constant[depending on x(O)) and a is the contraction factor of(6.2). We now consider an asynchronous executionwhereby, at each time step, an iteration is performedby each processor i and the result is immediatelytransmitted to the other processors. Thus the values ofXI (j * i) which are used by processor i are alwaysoutdated by D time units. Concerning the function f.we assume that there exists some scalar {3 such that0 < {3 < a and

j;(x) = min (ail +X,I (i.i).AI

i*

11j;(x)-x,.II,s

max J a Ilx, -x,oll,. .B max Ilx, -x/oil,l I-'

'Vi. (6.4)

It is seen that a small value of fJ corresponds to asituation where the coupling between differentcomponents of x is weak. Under condition (6.4), theconvergence rate estimate for the synchronousiteration cannot be improved, but the errorIlx(t) -x'il for the asynchronous iteration can beshown (Bertsekas and Tsitsiklis, 1989b) to be boundedabove by Cp', where C is some constant and p is thepositive solution of the equation p = max {a, fJp-V}.It is not hard to see that p < a"/(D+ II and the

asynchronous algorithm converges faster. The ad-vantage of the asynchronous algorithm is morepronounced when fJ is very small (very weak coupling)in which case p approaches a. The latter is theconvergence rate that would have been obtained ifthere were no communication delays at all. Weconclude that, for weakly coupled problems, asyn-chronous iterations are slowed down very little bycommunication delays. in sharp contrast with their

synchronous counterparts.

The Bellman-Ford algorithm consists of the iterationx : = f (x) and can be shown to converge asynchro-

nously (Tajibnapis. 1977; Bertsekas. 1982). We nowcompare the synchronous and the asynchronousversions. We assume that both versions are initializedwith Xi = ~ for every i * I, which is the most common

choice. The synchronous iteration is known toconverge after at most II iterations. However.assuming that the communication delays fromprocessor i to j are fixed to some constant D.,. and thatthe computation time is negligible. it is easily shownthat the asynchronous iteration is guaranteed toterminate earlier than the synchronous one.

Notice that the number of messages exchanged inthe synchronous Bellman-Ford algorithm is at mostnJ. This is because there are at most n stages and atmost n messages are transmitted by each processor ateach stage. Interestingly enough. with an asynchro-nous execution. and if the communication delays areallowed to be arbitrary. some simple examples (due toE. M. Gafni and R. G. Gallager; see Bertsekas andTsitsiklis. 1989b) show that the number of messagesexchanged until termination could be exponential inn, even if we restrict processor i to transmit a messageonly when the value of Xi changes. This could be aserious drawback but experience with the algorithmindicates that this worst case behavior rarely occursand that the average number of messages exchanged ispolynomial in n. It also turns out that the expectednumber of messages is polynomial in n under somereasonable probabilistic assumptions on the executionof the algorithm (Tsitsiklis and Stamoulis. 1990).

A number of asynchronous convergence results

16 D. P. BERTSEKAS and J. N. TSITSIKLlS

Clearly. if the algorithm converges then. in the limit.the values possessed by different processors are equal.We will thus refer to the asynchronous iterationx:= Ax as an agreement algorithm. [t can be shownthat. under the assumption of partial asynchronism.the function d defined by

d(z(t»=max max x,(T)-min min X,(T) (7.1), 1-8<'~I , 1-8<T51

has the properties assumed in Proposition 4, providedthat at least one of the diagonal entries of A ispositive. In particular, if the processors initiallydisagree, the .'maximum disagreement" [cf. (7.1») isreduced by a positive amount after at most 2nB timeunits (Tsitsiklis, 1984). Proposition 4 applies andestablishes geometric convergence to agreement.Furthermore, such partially asynchronous conver-gence is obtained no matter how big the value of theasynchronism measure B is, as long as B is finite.

The following example (Bertsekas and Tsitsiklis.1989b) shows that the agreement algorithm need not

converge totally asynchronously.

making essential use of monotonicity conditions arealso available for relaxation and primal-dual algo-rithms for linear and nonlinear network flow

problems (Bertsekas. 1986: Bertsekas and Eckstein.1987, 1988: Bertsekas and EI Baz, 1987: Bertsekasand Castanon, 1989. 1990). Experiments showingfaster convergence for asynchronous over synchronousrelaxation methods for assignment problems using ashared memory machine are given in Bertsekas andCastonon (1989).

We finally note that, under the monotonicityassumptions of this subsection. the convergence rateof an asynchronous iteration is guaranteed to be atleast as good as the convergence rate of a

corresponding synchronous iteration, under a faircomparison (Bertsekas and Tsitsiklis. 1989a).

7. PARTIALLY ASYNCHRONOUS ALGORITHMS-I

We now consider iterations satisfying the partialasynchronism Assumption 2. Since old information is"purged" from the algorithm after at most B units, itis natural to describe the "state" of the algo(ithm attime 1 by the vector z (I) e X H defined by

Z(I) = (X(I), X(I -1), X(I -B + 1 )).

We then notice that X(I + 1) can be determined [cf.(5.1)-(5.3)1 in terms of Z(I); in particular. knowledgeof x( r), for r ~ 1 -B is not needed. We assume thatthe iteration mapping f is continuous and has anonempty set X* c X of fixed points, Let Z* be theset of all vectors z*e X" of the form z* =

(x*, x*, ...,x*), where x* belongs to X*. Wepresent a sometimes useful convergence result, whichemploys a Lyapunov-type function d defined on theset X".

Here. the synchronous iteration XCI + I) = Ax(l)converges in a single step to the vector x = (y, y),where Y = (XI + x2)/2, Consider the following totally

asynchronous scenario. Each processor updates itsvalue at each time step. At certain times 11,12"" ,each processor transmits its value which is receivedwith zero delay and is immediately incorporated intothe computations of the other processor. We thenhave

XI(I) X.(lk)X,(I + I) = .-

Proposition 4. (Bertsekas and Tsitsiklis. 1989b)Suppose that there exist a positive integer " and acontinuous function d; XR.-+ [0. ~) with the followingproperties: For every initialization z(O) f Z* of theiteration and any subsequent sequence of events(conforming to Assumption 2) we have d(z(t')) <d(z(O)) and d(z(l)) s d(z(O)). Then every limit pointof a sequence {z(t)} generated by the partially

asynchronous iteration (5.1 )-(5.2) belongs to Z'.Furthermore. if X = ~". if the function d is of theform d(z) = inf II:. -zollo where 11.11 is some vector

,"oz.norm. and if the function f is of the formf(x) = Ax + b. where A is a n x n matrix and b is a

vector in ~". then d(z(t)) converges to zero at therate of a geometric progression.

~.,.~I.

st < t.2 2

XI(/t) X~(/)-+-~ 2x~('+l)=c I, S I < I,

For an interesting application of the above

proposition. consider a mapping f : ~" qj;" of theform f(x) = Ax where A is an irreducible stochasticmatrix. and let ni = 1 for each i. In the corresponding

iterative algorithm. each processor maintains andcommunicates a value of a scalar variable x, and oncein a while forms a convex combination of its ownvariable with the variables received from other

processors according to the rule"

Xi : = L a'ix"--I

(See Fig. 8 for an illustration.) Thus.

X.(tlt+l) = (1/2)".'-"x.(tk) + (1- (1/2)".'-")x~(tk)'

X~(tk+l) = (1/2)".'-"x~(tlt) + (1- (1/2)".'-")X,(lk)'

Subtracting these two equations we obtain

IX2(tlt+l) -x,(tk+I)1 = (1- 2(1/2)".'-") IX2(11t) -x.(tlt)1

= (1- EIt) IX2(lk) -x,(t.)I,

where Ek = 2(1/2)'k+'-", In particular. the disagree-

ment Ix2(11t) -x ,(IIt)1 keeps decreasing. On the otherhand. convergence to agreement is not guaranteed

s

unless n (I -EIt) = 0 which is not necessarily theIt-,

case. For example. if we choose the differencesIk+ I -lit to be large enough so that Ek < k -2, then we

can use the fact n (I -k -2) > 0 to see thatIt-I

convergence to agreement does not take place,

Example 2 shows that failure to converge is possibleif part (b) of the partial asynchronism Assumption 2


FIG. 8. Illustration of divergence in Example 2.

(c) An asynchronous algorithm for load balancingin a computer network whereby highly loadedprocessors transfer fractions of their load to theirlightly loaded neighbors. until the load of allprocessors becomes the same (Bertsekas and Tsits-iklis. 1989b).

In all of the above cases. partially asynchronousconvergence has been proved for all values of B. andexamples are available which demonstrate that totallyasynchronous convergence fails.

We close by mentioning a particular context inwhich the agreement algorithm could be of use.Consider a set of processors who obtain a sequence ofnoisy observations and try to estimate certainparameters by means of some iterative method. Thiscould be a stochastic gradient algorithm (such as theones arising in recursive system identification) or somekind of a Monte Carlo estimation algorithm. Allprocessors are employed for the estimation of thesame parameters but their individual estimates aregenerally different because the noises corrupting theirobservations can be different. We let the processorscommunicate and combine their individual estimatesin order to average their individual noises. therebyreducing the error variance. We thus let theprocessors execute the agreement algorithm. trying toagree on a common estimate. while simultaneouslyobtaining new observations which they incorporateinto their estimates. There are two opposing effectshere: the agreement algorithm tends to bring theirestimates closer together. while new observationshave the potential of increasing the difference of theirestimates. Under the partial asynchronism assump-tion. the agreement algorithm tends to convergegeometrically. On the other hand. in several stochasticalgorithms (such as the stochastic approximationiteration

fails to hold. There also exist examples demonstratingthat parts la) and (c) of Assumption 2 are alsonecessary for convergence.

Example 2 illustrates best the convergence mecha-nism In algorithms which converge partially asynchro-nously for every B. but not totally asynchronously.The key idea is that the distance from the set of fixedpoints is guaranteed to "contract" once In a while.However. the contraction factor depends on Bandapproaches I as B gets larger. (In the context of

Example 2. the contraction factor is I-f. whichapproaches 1 as I. -, -I. is increased to infinity.) Astime goes to infinity. the distance from the set of fixedpoints is contracted an infinite number of times butthis guarantees convergence only if the contractionfactor is bounded away from I. which thennecessitates a flmte but otherwise arbltrarv bound onB.

Partially asynchronous convergence for every valucof B has been established' for ~everal variations andgeneralizations of the agreement algorithm (Tsitsiklis.19H4; Bertsekas and Tsitsiklis. 1989b). as well as for avariety of other problems:

(a) The iteration lC:= lCP for the computation of arow vector lC of invariant probabilities. associated withan irreducible stochastic matrix P with a nonzerodiagonal entry (Lubachevsky and Mitra. 1986). Thisresult can be also obtained by letting x, = lC,/ lC,".where lC' is a positive vector satisfying lC" = lC p, and

by verifying that the variables x, obey the equations ofthe agreement algorithm (Bertsekas and Tsitsiklis.

1989b).(b) Relaxation algorithms involving nonexpans-

ive mappings with respect to the maximum norm(Tseng et al.. 1 \)9(); Bertsekas and Tsitsiklis. 1989b).Special cases include dual relaxation algorithms forstrIctly convex network flow problems and linearIterations for the solution of linear equations of theform Ax = b. where A is an irreducible matrix

satisfying the weak diagonal dominance condition

Ela.,l::;u",foralli.",

x:=x-~(VF(x)+w),I

where w represents ob~ervation noise) the stepsize 1/1


decreases to zero as time goes to infinity. We thenhave. asymptotically. a separation of time scales: thestochastic algorithm operates on a slower time scaleand therefore the agreement algorithm can beapproximated by an algorithm in which agreement isinstantly established. It follows that the asynchronousnature of the agreement algorithm cannot have anyadverse effect on the convergence of the stochasticalgorithm. Rigorous results of this type can be foundin Tsitsiklis (1984); Tsitsiklis et al. (1986); Kushnerand Yin (1987a. b); Bertsekas and Tsitsiklis (1989b).

Tsitsiklis. 1989b).An important application of asynchronous gradient-

like optimization algorithms arises in the context ofoptimal quasistatic routing in data networks. In acommon formulation of the routing problem oneis faced with a convex nonlinear multicommoditynetwork flow problem (Bertsekas and Gallager. 1987)that can be solved using gradient projection methods.It has been shown that these methods also convergepartially asynchronously. provided that a smallenough stepsize is used (Tsitsiklis and Bertsekas.1986). Furthermore. such methods can be naturallyimplemented on-line by having the processors in thenetwork asynchronously exchange information on thecurrent traffic conditions in the system and performupdates trying to reduce the measure of congestionbeing optimized. An important property of such anasynchronous algorithm is that it adapts to changes inthe problem being solved (such as changes on theamount of traffic to be routed through the network)without a need for aborting and restarting thealgorithm. Some further analysis of the asynchronousrouting algorithm can be found in Tsai (1986. 1989)and Tsai et al. (1986).

9. TERMINATION OF ASYNCHRONOUSITEM TIONS

In practice. iterative algorithms are executed onlyfor a finite number of iterations. until sometermination condition is satisfied. In the case ofasynchronous iterations. the problem of determiningwhether termination conditions are satisfied is ratherdifficult because each processor possesses only partialinformation on the progress of the algorithm.

We now introduce one possible approach forhandling the termination problem for asynchronousiterations. In this approach. the problem is decom-posed into two parts:

(a) An asynchronous iterative algorithm is modifiedso that it terminates in finite time.

(b) A special procedure is used to detect terminationin finite time after it has occured.

In order to handle the termination problem. wehave to be a little more specific about the model ofinterprocessor communication. While the generalmodel of asynchronous iterations introduced inSection 5 can be used for both shared memory andmessage-passing parallel architectures. we adopt herea more explicit message-passing model. In particular.we assume that each processor j sends messages withthe value of x, to every other processor i. Processor ikeeps a buffer with the most recently received value ofXi' We denote the value in this buffer at time 1 byX;(I). This value was transmitted by processor j atsome earlier time ';(1) and therefore X;(I) = x,( r;(I».

We also assume the following:

R. PARTIALLY ASYNCHRONOUS ALGORITHMS-II

We now turn to the study of partially asynchronousiterations that converge only when the stepsize issmall. We illustrate the behavior of such algorithms interms of a prototypical example.

Let A be an n x n positive definite symmetricmatrix and let b be a vector in .c;o?". We consider theasynchronous iteration x : =.f -y(Ax -h). where y isa small positive stepsize. We define a cost functionF:.'l/:" [fl by F(x) = ;x'Ax -x'b. and our iterationis equivalent to the gradient algorithm x: = x -

y V F(x) for minimizing F. This algorithm is known toconverge synchronously provided that y is chosensmall enough. On the other hand. it was shown in

Example 1 that the gradient algorithm does notconverge totally asynchronously. Furthermore. acareful examination of the argument in that examplereveals that for every value ofy there exists a 8 largeenough such that the partially asynchronous gradientalgorithm does not converge. Nevertheless. if y isfixed to a small value. and if 8 is not excessively large(we roughly need 8 s C/y. where C is some constantdetermined by the structure of the matrix A). then thepartially asynchronous iteration turns out to beconvergent. An equivalent statement is that for everyvalue of 8 there exists some Yo> 0 such that if0 < y < Yo then the partially asynchronous algorithmconverges (Tsitsiklis er al.. 1986; Bertsekas andTsitsiklis. 1989b). The rationale behind such a result isthe following. If the information available toprocessor i on the value of x, is outdated by at most Btime units. then the difference between the valuexi( .;(t)) possessed by processor i and the true valuex,(t) is of the order of yR. because each step taken byprocessor j is of the order of y. It follows that for yvery small the errors caused by asynchronism becomenegligible and cannot destroy the convergence of the

algorithm.The above mentioned convergence result can be

extended to more general gradient-like algorithms fornonquadratic cost functions F. One only needs toassume that the iteration is of the form x := x -ys(x).where s(x) is an update direction with the propertySi(X) V;F(x) 2 K IV;F(x)I~. where K is a positiveconstant. together with a Lipschitz continuitycondition on V F. and a boundedness assumption ofthe form Ils(x)11 S L IIVF(x)11 (Tsitsiklis et al.. 1986;Bertsekas and Tsitsiklis. 1989b). Similar conclusionsare obtained for gradient projection iterations forconstrained convex optimization (Bertsekas and

Assumption 3. (a) If t E r' and x/(t + 1) :;:x/(t). thenprocessor i will eventually send a message to everyother processor.

(b) If a processor i has sent a message with the


defined by .>

h(X)={ -X" ~fx2-E/2.

0, Ifx2<E/2.

h(X)=X2/2.It is clear that the asynchronous iteration x: = f(x)converges to x' = (0,0): in particular. x2 is updated

according to X2:=x2/2 and tends to zero: thus, iteventually becomes smaller than E/2. Eventuallyprocessor 1 receives a value of x ~ smaller than E /2 anda subsequent update by the same processor sets XI tozero.

Let us now consider the iteration x : = g(x). If thealgorithm is initialized with X2 between E/2 and E,then the value ofx2 will never change. and processor 1will keep executing the nonconvergent iteration

xl:= -XI' Thus. the asynchronous iteration x :=g(x)is not guaranteed to terminate.

value of Xi(t) to some other processor J, thenprocessor i will send a new message to processor J onlyafter the value of Xi changes (due to an update by

processor i).(c) Messages are received in the order that they are

transmitted.(d) Each processor sends at least one message to

every other processor.

The remainder of this section is devoted to thederivation of conditions under which the iterationx: = g(x) is guaranteed to terminate. We introduce

some notation. Let 1 be a subset of the set {I. ., , .p}of all processors. For each i E I. let there be givensome value ei E X,. We consider the asynchronousiteration x:= f,.8(X). which is the same as theiteration x := f(x) except that any component x" withie/. is set to the value e,. Formally, the mapping f,,8is defined by letting f:.8(x) = [;(x), if if/. andf:,8(X) = ei. if ie/.

Proposition 5. (Bertsekas and Tsitsiklis. 1989a) LetAssumption 3 hold. Suppose that for any / c{I .p} and for any choice of £JieXi. ie/. theasynchronous iteration x: = f"//(X) is guaranteed to

converge. Then. the asynchronous iteration x : = g(x)terminates in finite time.

Assumption 3(d) is only needed to get the algorithmstarted. Assumption 3(b) is crucial and has the

following consequences. If the value of XCI) settles tosome final value, then there will be some time t* afterwhich no messages will be sent. Furthermore, all

messages transmitted before t* will eventually reachtheir destinations and the algorithm will eventuallyreach a quiescent state where none of the variables Xichanges and no message is in transit. We can then saythat the algorithm has terminated.

More formally. we view termination as equivalentto the following two properties:

(i) No message is in transit.(ii) An update by some processor i causes no change

in the value of X,.

Property (ii) is a collection of local terminationconditions. There are several algorithms for termina-tion detection when a termination condition can be

decomposed as above (Dijkstra and Scholten, 1980;Bertsekas and Tsitsiklis, 1989b). Thus terminationdetection causes no essential difficulties, under theassumption that the asynchronous algorithm termin-ates in finite time.

We now turn to the more difficult problem ofconverting a convergent asynchronous iterativealgorithm into a finitely terminating one. If we weredealing with the synchronous iteration x(t + 1) =f(X(/)), it would be natural to terminate the algorithmwhen the condition Ilx(t + I) -x(t)11 :S E is satisfied,where E is a small positive constant reflecting thedesired accuracy of solution, and where II' II is asuitable norm. This suggests the following approachfor the context of asynchronous iterations. Given theiteration mapping f and the accuracy parameter E, wedefine a new iteration mapping g : X ~ X by letting

Proof. Consider the asynchronous iteration x : = g(x).Let J be the set of all indices i for which the variable

xi(t) changes only a finite number of times. For eachi E J. let fJi be the limiting value of xi(t). Since f mapsX into itself, so does g. It follows that fJi E Xi for eachi. For each i E J. processor i sends a positive but finitenumber of messages [Assumptions 3(d) and (b)]. By

Assumption 3(a), the last message sent by processor icarries the value fJi and by Assumption 3(c) this is alsothe last message received by any other processor.Thus, for all t large enough, and for all i. we will have

x;(t)=xi(r;(t))=fJi. Thus, the iteration x:=g(x)eventually bee:omes identical with the iterationx := r.6(x) and therefore converges. This implies thatthe difference xi(t + 1) -xj(t) converges to zero forany i f J. On the other hand. because of the definitionof the mapping g. the difference Xi(t + 1) -xi(t) iseither zero. or its magnitude is bounded below byE>O. It follows that xi(t + 1) -xi(t) eventually settlesto zero, for every i f J. This shows that i E J for everyif J; we thus obtain a contradiction unless J =

{1 p}.whichprovesthedesiredresult. Q.E.D.

( )- { j;(X) ifllj;(x)-Xill~f:,g X -I Xi, otherwise.

We will henceforth assume that the processors areexecuting the asynchronous iteration x:= g(x). Thekey question is whether this new iteration isguaranteed to terminate in finite time. One couldargue as follows. Assuming that the original iterationx: = f(x) is guaranteed to converge, the changes in the

vector x will eventually become arbitrarily small, inwhich case we will have g (x) = x and the iterationx: = g(x) will terminate. Unfortunately. this argument

is fallacious. as demonstrated by the following

example.We now identify certain cases in which the main

assumption in Proposition 5 is guaranteed to hold. Weconsider first the case of monotone iterations and weExample 30 Consider the function of:= ~;:~~;:

20 D. P. BERTSEKAS and J. N TSITSIKLIS

the conclusions of the analysis but more testing with abroader variety of computer architectures is needed toprovide a comprehensive picture of the practicalbehavior of asynchronous iterations. Furthermore. theproper implementation of asynchronous algorithms inreal parallel machines can be quite challenging andmore experience is needed in this area. Finally. muchremains to be done to enlarge the already substantialclass of problems for which asynchronous algorithmscan be correctly applied.

assume that the iteration mapping I has the propertiesintroduced in Section 6.2. For any 1 and {ei lie I},the mapping 110" inherits all of the properties of I.except that f." is not guaranteed to have a uniquefixed point. If this latter property can be independ-ently verified. then the asynchronous iterationx:= fo"(x) is guaranteed to converge. and Proposition3 applies. Let us simply say here that this property canbe indeed verified for several interesting problems.

Let us now consider the case where I satisfies thecontraction condition 111(x) -x "II Sa Ilx -x "II of(6.2). Unfortunately. it is not necessarily true that themappings fo" also satisfy the same contractioncondition. In fact. the mappings fo" are not evenguaranteed to have a fixed point. Let us strengthenthe contraction condition of (6.2) and assume that

111(x)-I(Y)llsallx-YII. 'v'x.ye~", (9.1)

where 11'11 is the weighted maximum norm of (6.1)and aeIO.l). We have fH(x)-llo"(y)=ei-ei=Ofor all i e I. Thus.

Ilr.S(x)

-r."(1Y)II = max -11[.(x) -['(Y)II.

..1 w,

s max -=-11f,(x) -f,(y)lI., Wi

= IIf(x) -f(Y)1I S Q' IIx -YII.

Hence. the mappings f'" inherit the contractionproperty (9.1). As discussed in Section 6. thisproperty guarantees asynchronous convergence andtherefore Proposition 5 applies again.

We conclude that the modification x : = g(x) of theasynchronous iteration x: = f(x) is often. but not

always. guaranteed to terminate in finite time. It is aninteresting research question to devise economicaltermination procedures for the iteration x : = f(x) that

are always guaranteed to work. The snapshotalgorithm of Chandy and Lamport (1985) [see(Bertsekas and Tsitsiklis. 1989. Section 8.2») seems tobe one option.

10. CONCLUSIONSIterative algorithms are easy to parallelize and can

be executed synchronously even in inherentlyasynchronous computing systems. Furthermore, forthe regular communication networks associated withseveral common parallel architectures, the com-munication requirements of iterative algorithms arenot severe enough to preclude the possibility ofmassive parallelization and speedup of the computa-tion. Iterative algorithms can also be executedasynchronously. often without losing the desirableconvergence properties of their synchronous counter-parts. although the mechanisms that affect conver-gence can be quite different for different types ofalgorithms. Such asynchronous execution may offersubstantial advantages in a variety of contexts.

At present. there is very strong evidence suggestingthat asynchronous iterations converge faster than theirsynchronous counterparts. However, this evidence isprincipally based on analysis and simulations. There isonly a small number of related experimental worksusing shared memory machines. These works support

REFERENCESAnwar. M. N. and N. EI Tarazi (1985). Asynchronous

algorithms for Poisson.s equation with nonlinear boundaryconditions. Computing, 34. 155-168.

Awerbuch, B. (1985). Complexity of network synchroniza-tion. J. ACM, 32.1104-1123.

Barbosa, V. C. (1986). Concurrency In systems wIthneighborhood constraints. Doctoral Dissertation. Compu-ter Science Dept., U.C.L.A.. Los Angeles. CA. U.S.A.

Barbosa. V. C. and E. M. Gafni (1987). Concurrency inheavily loaded neighborhood-constrained systems. Proc.7th Int. Cont on Distributed ComputIng Sy.ftems.

Baudet. G. M. (1978). Asynchronous iterative methods formultiprocessors. J. ACM. 2.226-244.

Bertsekas. D. P. (1982). Distributed dynamic programming.1£££ Trans. Aut. Control. AC.27. 610-616.

Bertsekas, D. P. (1983). Distnbuted asynchronous computa-tion of fixed pOints. Math. Programm. 27. 107-120.

Bertsekas. D. P. (1986). Distributed asynchronous relaxationmethods for linear network flow problems. TechnicalReport LIDS-P-I606. Laboratory for Information andDecision Systems. M.I.T.. Cambridge. MA.

Bertsekas. D P. and D. A. Castanon (1989). Parallelsynchronous and asynchronous implementauons of theauction algorithm. Technical Report TP-308. AlphatechInc. Burlington. MA. Also Parallel ComputIng (toappear).

Bertsekas. D. P. and D. A. Castanon (1990). Parallelasynchronous primal-dual methods for the minimum cost"ow problem. Math. Programming (submitted).

Bertsekas, D. P. and J. Eckstein (19118). Dual coordinatestep methods for linear network flow problem!;. Math.Programmmg. 42. 203-243.

Bertsekas. D. P and J. Eckstein (1987). DistnbutelJasynchronous relaxation method~ for linear network "owproblems. Proc IFAC '87. Munich. F.R.G.

Bertsekas. D. P. and D. EI Baz (1987). Distributedasynchronous relaxation methods for convex network flowproblems. SIAM J. Control Optimiz.. 25. 74-X4.

Bertsekas, D. P. and R. G. Gallager (1987). Data Network.f.Prentice Hall. Englewood Cliffs. NJ.

Bertsekas. D. P.. C. Ozveren. G. Stamoulis. P. Tseng and JN. Tsitsiklis (1989). Optimal communication algorithmsfor hypercubes. Technical Report LIDS-P-11l47. Labora-tory for Information and Decision Systems. M.I.T..Cambridge. MA. Also J. Parallel Distrib. Compul. (toappear).

Bertsekas. D. P. and J. N. Tsitsiklis (1989a). ConvergenceRate and Termination of Asynchronous Iterative Algo-rithms. Proc. 1989 Int. Con!. on Supercomputing.Irakleion. Greece. 19119. pp. 461-470.

Bertsekas. D. P. and J. N. Tsitsiklis (1989b). Parallt'l andDistributed Computation: Numerical Methods. Prentice-Hall. Englewood Cliffs. NJ.

Bojanczyk, A. (1984). Optimal asynchronous Newtonmethod for the solution of nonlinear equations. J. ACM.32. 792-803.

Chandy. K. M. and L. Lamport (1985). Distributedsnapshots: determining global states of distributed systems.ACM Trans. Comput. Syst., 3.63-75.

Chazan. D. and W Miranker (1969). Chaotic relaxation.Linear Algtbra and its Applications. 2. 199-222.

Donnelly. J. D. P. (1971). Periodic chaotic relaxation.Linea~ Algebra and its Applications. 4, 117-1211.

Parallel and distributed iterative algorithms-A survey

cal Report 671M. I.M.A.G.. University of Grenoble.France.

Robert. F.. M. Charnay and F. Musy (1975). Iterationschaotiques serie-parallele pour des equations non-linealresde point fixe. Aplikace Matematicky. 20. 1-38.

Saad. Y. and M, H. Schultz (1987). Data Communication inHypercubes. Research Report Y ALEU/DCS/RR-428.Yale University. New Haven. CN.

Smart. D. and J. White (1988). Reducing the parallelsolution time of sparse circuit matrices using reorderedGaussIan elimination and relaxation. Proc. 1988 ISCAS.Espoo. Finland.

Spiteri. P. (1984). Contribution a I'etude de grands systemesnon lineaires. Doctoral Dissertation. L'Universlte deFranche-Comte. Besan~on. France.

Spiteri. P. (1986). Paralle! asynchronous algorithms forsolving boundary value problems. In M. Cosnard et al.(Eds.), Parallel Algorithms and Architectures, NorthHolland. Amsterdam, pp. 73-84.

Tajibnapis, W. D. (1977). A correctness proof of a topologyinformation maintenance protocol for a distributedcomputer network. Commun. ACM. 20.477-485.

Tsai, W. K. (19R6). Optimal quasi-static routing for virtualcircuit networks subjected to stochastic inputs. DoctoralDissertation. Dept of Electrical Engineering and Compu-ter Science. M.I.T.. Cambridge, MA.

Tsai. W. K. (1989). Convergence of gradient projectionrouting methods in an asynchronous stochastic quasI-staticvirtual circuit network. IEEE Tran.r. Aut. Control. 34.20-33.

Tsai, W. K.. J. N. Tsitsiklis and D. P. Bertsekas (1986).Some issues in distributed asynchronous routing in virtualcircuit data networks. Proc. 25th IEEE Con/. on Decisionand Control, Athens, Greece. 1335-1337

Tseng. P. (1990). Distributed computation for linearprogramming problems satisfying a certain diagonaldominance condition. Math. Operat. Res, 15.33-48.

Tseng, P., D. P. Bertsekas and J. N. Tsitsiklis (1990). SIAMJ. Control Optimiz. 28. 678-710

Tsitsiklis, J. N. (1984). Problems in decentralized decisionmaking and computation. Ph.D. thesis, Dep of ElectricalEngineering and Computer Science. M.l.T., Cambridge,MA.

Tsitsiklis, J. N. (1987). On the stability of asynchronousiterative processes. Math. Syst.. 20. 137-153.

Tsitsiklis, J. N. (1989). A comparison of Jacobi andGauss-Seidel parallel iterations. Applied Math. Lett.. 2.167-170.

Tsitsiklis, J. N. and D. P. Bertsekas (19R6). Distributedasynchronous optimal routing in data networks. IEEETrans. Aut. Control. AC-31. 325-332.

Tsitsiklis, J. N., D. P. Bertsekas and M. Athans (19R6).Distributed asynchronous deterministic and stochasticgradient optimization algorithms. IEEE Trans. Aut.Control. AC-31. RO3-812.

Tsitsiklis, J. N. and G. D. Stamoulis (1990). On the averagecommunication complexity of asynchronous distributedalgorithms, Technical Report LIDS-P-1986. Laboratoryfor Information and Decision Svstems. M.I.T.. Cam-bridge. MA. .

Uresin. A. and M. Dubois (1986). Generalized asynchronousiterations. In Lecture Notes in Computer Science. 237.Springer, Berlin. pp. 272-278.

Uresin. A. and M. Dubois (1988a). Sufficient conditions forthe convergence of asynchronous iterations. TechnicalReport. Computer Research Institute. University ofSouthern California. Los Angeles. California. U.S.A.

Uresin. A. and M. Dubois (1990). Parallel asvnchronousalgorithms for discrete data. J. ACM. 37. 58R-i,()(>.

Zenios. S. A. and R. A. Lasken (198R). Nonlinear networkoptimization on a massively parallel Connection Machine.Annals of Operallons Research. 14. 147-165.

Zenios. S. A. and J. M. Mulvey (19RR). Distributedalgorithms for strictly convex network flow problems.Parallel Computing. 6.45-56.

Dijkstra. E. W. and C. S. Scholten (1980). Terminationdetection for diffusing computations. Inform. Process.Lell.. 11. 1-4.

Dubois. M. and F. A. Briggs (1982). Performance ofsynchronIzed iterative processes in multi-processor sys-tems. IEEE Trans. Software Engng, 8.419-431.

EI Tarazi. M. N. (1982). Some convergence results forasynchronous algorithms. Numerische Mathematik. 39.325-340.

Fox, G.. M. Johnson. G. Lyzenga. S. Otto. J. Salmon andD. Walker (19&8). Solving Problems on ConcurrentProcessors. Vol. I. Prentice-Hall. Englewood Cliffs. NJ.

Jefferson. D. R. (1985). Virtual time. ACM Trans.Programm. Languages Syst., 7. 404-425.

Johnsson, S. L. and C. T. Ho (1989). Optimum broadcastingand personalized communication in hypercubes. IEEETrans. Comput.. 38. 1249-1268.

Kung. H. T. (1976). Synchronized and asynchronous parallelalgorithms for multiprocessors. In Algorithms andComplexity. Academic Press. New York. pp. 153-200.

Kushner. H. J. and G. Yin (1987a). Stochastic approxima-tion algorithms for parallel and distributed processing.Stochastics. 22.219-250.

Kushner. H. J. and G. Yin (1987b). Asymptotic proper-ties of distributed and communicating stochastic approxim-ation algorithms. SIAM J. Control Optimiz.. 25. 1266-1290.

Lang. B.. J. C. Miellou and P. Spiteri (1986). Asynchronousrelaxation algorithms for optimal control problems. Math.("omput. Simult.. 28. 227-242.

Lavenberg. S.. R. Muntz and B. Samadi (1983).Performance analysis of a rollback method for distributedsimulation. In A. K. Agrawala and S. K. Tripathi (Eds.).Pl'rformance 83. North Holland. Amsterdam, pp. 117-132.

Li. S. and T Basar (1987). Asymptotic agreemern andconvergence of asynchronous stochastic algorithms. IEEETrans. Aut. Control. 32. 612-618.

Lubachevsky. B. and D. Mitra (1986). A chaoticasynchronous algorithm for computing the fixed point of anonnegative matrix of unit spectral radius. J. ACM, 33.130-150.

McBryan, O. A. and E. F. Van der Velde (1987).Hypercube algorithms and implementations, SIAM J.Scientific Statist. Comput., 8. s227-s287.

Miellou. J. C. (1975a). Algorithmes de relaxation chaotiquea retards. R.A.I.R.O., 9. R-I, 55-82.

Miellou. J. C. (1975b). Iterations chaotiques a retards.ctudcs dc la convergence dans Ie cas d'espacespartlellement ordonnes. ("{}mptes Rendus, Academie deSciences de Paris, 280. Serie A, 233-236.

Miellou. J. C. and P. Spiteri (1985). Un critere deconvergence pour des methodes generales de point fixe.Math. Modelling Numer. Anal., 19. 645-669.

Mitra. D (1987). Asynchronous relaxations for thenumerical solution of differential equations by parallelprocessors, SIAM J. Sci. Statist. Comput., 8. s43-s58.

Mitra. D. and I. Mitrani (1984). Analysis and optimumperformance of two message-passing parallel processorssynchronized by rollback. In E. Gelenbe (Ed.).Performance '84. North Holland. Amsterdam. 35-50.

Nassimi. D. and S. Sahni (1980). An optimal routingalgorithm for mesh-connected parallel computers. J.ACM, 27. 6-29.

Ortega. J. M. and R. G. Voigt (19R5). Solution of partialdifferernial equations on vector and parallel computers,SIAM Review, 27. 149-240.

Ozveren, C. fI987). Communication aspects of parallelprocessing, Technical Report LIDS-P-1721, Laboratory forInformation and Decision Systems, MIT, Cambridge. MA.

Robert. F. (1976). Cornraction en norme vectorielle:convergence d'iterations chaotiques pour des equationsnon lineaires de point fixe a plusieurs variables. LinearAlgl'bra and iLf Applications, 13. 19-35.

Robert. F. (1987). Iterations discretes asynchrones, Techni-

dimitri p. bertsekas:j:§ and john n. tsitsiklis:jweb.mit.edu/dimitrib/www/some_aspects_db.pdf ·...

Documents