working with mike on distributed computing theory, 1978-1992

1

Working with Mike on Distributed Computing

Theory, 1978-1992

Nancy LynchTheory of Distributed Systems GroupMIT-CSAILab

Fischer FestJuly, 2003

2

My joint papers with Mike• Abstract complexity theory [Lynch, Meyer, Fischer 72]

[Lynch, Meyer, Fischer 72, 76]• Lower bounds for mutual exclusion

[Burns, Fischer, Jackson, Lynch, Peterson 78, 82]

• K-consensus [Fischer, Lynch, Burns, Borodin 79]• Describing behavior and implementation of distributed

systems [Lynch, Fischer 79, 80, 81]• Decomposing shared-variable algorithms

[Lynch, Fischer 80, 83]• Optimal placement of resources in networks

[Fischer, Griffeth, Guibas, Lynch 80, 81, 92]• Time-space tradeoff for sorting [Borodin, Fischer,

Kirkpatrick, Lynch, Tompa 81]

3

More joint papers• Global snapshots of distributed computations

[Fischer, Griffeth, Lynch 81, 82]• Lower bound on rounds for Byzantine agreement

[Fischer, Lynch 81, 82]• Synchronous vs. asynchronous distributed systems

[Arjomandi, Fischer, Lynch 81, 83]• Efficient Byzantine agreement algorithms

[Dolev, Fischer, Fowler, Lynch, Strong 82]• Impossibility of consensus

[Fischer, Lynch, Paterson 82, 83, 85]• Colored ticket algorithm

[Fischer, Lynch, Burns, Borodin 83]• Easy impossibility proofs for distributed consensus

problems [Fischer, Lynch, Merritt 85, 86]

4

And still more joint papers!

• FIFO resource allocation in small shared space [Fischer, Lynch, Burns, Borodin 85, 89]

• Probabilistic analysis of network resource allocation algorithm [Lynch, Griffeth, Fischer, Guibas 85, 86]

• Reliable communication over unreliable channels [Afek, Attiya, Fekete, Fischer, Lynch, Mansour, Wang, Zuck 92, 94]

• 16 major projects…14 in the area of distributed computing theory…

• Some well known, some not…

5

In this talk:

• I’ll describe what these papers are about, and what I think they contributed to the field of distributed computing theory.

• Put in context of earlier/later research.• Give you a little background about why/how

we wrote them.

6

By topic:

1. Complexity theory2. Mutual exclusion and related problems3. Semantics of distributed systems4. Sorting5. Resource placement in networks6. Consistent global snapshots7. Synchronous vs. asynchronous distributed

systems8. Distributed consensus9. Reliable communication from unreliable

channels

7

1. Prologue (Before distributed computing)

• MIT, 1970-72• Mike Fischer + Albert Meyer’s research group in

algorithms and complexity theory. – Amitava Bagchi, Donna Brown, Jeanne Ferrante, David

Johnson, Dennis Kfoury, me, Robbie Moll, Charlie Rackoff, Larry Stockmeyer, Bostjan Vilfan, Mitch Wand, Frances Yao,…

• Lively…energetic…ideas…parties…fun• My papers with Mike (based on my thesis):

– Priority arguments in complexity theory [Lynch, Meyer, Fischer 72]

– Relativization of the theory of computational complexity [Lynch, Meyer, Fischer 72, 76]

8

Abstract Complexity Theory• Priority arguments in complexity theory

[Lynch, Meyer, Fischer 72]• Relativization of the theory of computational

complexity [Lynch, Meyer, Fischer 72, Trans AMS 76]

• What are these papers about?– They show the existence of pairs, A and B, of recursive

problems that are provably hard to solve, even given the other as an oracle.

– In fact, given any hard A, there’s a hard B that doesn’t help.

• What does this have to do with Distributed Computing Theory?– Nothing at all.– Well, foreshadows our later focus on lower

bounds/impossibility results. – Suggests study of relative complexity (and computability) of

problems, which has become an important topic in DCT.

9

2. Many Years Later…Mutual Exclusion and Related

Problems• Background:

– We met again at a 1977 Theory conference, decided the world needed a theory for distributed computing.

– Started working on one…with students: Jim Burns, Paul Jackson (Georgia Tech), Gary Peterson (U. Wash.)

– Lots of visits, Georgia Tech and U. Wash. – Mike’s sabbatical at Georgia Tech, 1980– Read lots of papers:

• Dijkstra---Mutual exclusion,…• Lamport---Time clocks,…• Johnson and Thomas---Concurrency control• Cremers and Hibbard---Lower bound on size of shared memory

to solve 2-process mutual exclusion– Finally, we wrote one:

• Data Requirements for Implementation of N-Process Mutual Exclusion Using a Single Shared Variable [Burns, Fischer, Jackson, Lynch, Peterson 78, 82]

10

Mutual Exclusion• Data Requirements for Implementation of N-

Process Mutual Exclusion Using a Single Shared Variable [Burns, F, Jackson, L, Peterson ICPP 78, JACM 82]

• What is this paper about? – N processes accessing read-modify-write shared memory,

solving N-process lockout-free mutual exclusion.

– Lower bound of (N+1)/2 memory states.• Constructs bad executions that “look like” other

executions, as far as processes can tell.– For bounded waiting, lower bound of N+1 states.– Nearly-matching algorithms

• Based on distributed simulation of a centralized scheduler process.

11

Mutual Exclusion

• Theorem: Lower bound of N states, bounded waiting:– Let processes 1,…N enter the trying region, one by one. – If the memory has < N states, two processes, say i and j, must

leave the memory in the same state. – Then processes i+1,…j are hidden and can be bypassed

arbitrarily many times.

• Theorem: Lower bound of N+1 states, bounded waiting:– Uses a slightly more complicated strategy of piecing together

execution fragments.

• Theorem: Lower bound of N/2 states, for lockout-freedom.– Still more complicated strategy.

• Two algorithms…– Based on simulating a centralized scheduler– Bounded waiting algorithm with only ~N states.– Surprising lockout-free algorithm with only ~N/2 states!

12

Mutual Exclusion

• Why is this interesting?– Cute algorithms and lower bounds.– Some of the first lower bound results in DCT.– “Looks like” argument for lower bounds, typical of many

later impossibility arguments.– Virtual scheduler process provides some conceptual

modularity, for the algorithms.– We worked out a lot of formal definitions:

• State-machine model for asynchronous shared memory systems.

• With liveness assumptions, input/output distinctions.• Exclusion problems, safety and liveness requirements.

13

Decomposing Algorithms Using a Virtual Scheduler

• Background:– Continuing on the same theme…– Mutual exclusion algorithms (ours and others’) were

complicated, hard to understand and prove correct.– We realized we needed ways of decomposing them.– Our algorithms used a virtual scheduler, informally.– Could we make this rigorous?– We did:

•A Technique for Decomposing Algorithms that Use a Single Shared Variable [Lynch, Fischer 80, JCSS 83]

14

Decomposing Distributed Algorithms

• A Technique for Decomposing Algorithms that Use a Single Shared Variable [Lynch, Fischer 80, JCSS 83]

• What is this paper about?– Defines a “supervisor system”, consisting of N contending

processes + one distinguished, permanent supervisor process.

– Shows that, under certain conditions, a system of N contenders with no supervisor can simulate a supervisor system.

– Generalizes, makes explicit, the technique used in [BFJLP 78, 82].– Applies this method to two mutual exclusion algorithms.

Sup

Sup

15

Decomposing Distributed Algorithms

• Why is this interesting?– Explicit decomposition of a complex distributed

algorithm.– Prior to this point, distributed algorithms were

generally presented as monolithic entities.– Foreshadows later extensive uses of

decomposition, transformations.– Resulting algorithm easier to understand.– Not much increase in costs.

16

Generalization to K-Exclusion

• Background:– Still continuing on the same theme…– Started inviting other visitors to join us (Borodin, Lamport,

Arjomandi,…)– Our mutual exclusion algorithms didn’t tolerate stopping

failures, either in critical region or in trying/exit regions. – So we added in some fault-tolerance:

• K-exclusion, to model access to K resources (so progress can continue if < K processes fail in the critical region).

• f-robustness, to model progress in the presence of at most f failures overall.

– Still focused on bounds on the size of shared memory.– Wrote:

• Resource allocation with immunity to limited process failure [Fischer, Lynch, Burns, Borodin FOCS 79]

17

K-Exclusion• Resource allocation with immunity to limited

process failure [Fischer, Lynch, Burns, Borodin FOCS 79]

• What is this paper about?– N processes accessing RMW shared memory,

solving f-robust lockout-free K-exclusion.– Three robust algorithms:

• Unlimited robustness, FIFO, O(N2) states– Simulates queue of waiting processes in shared memory.– Distributed implementation of queue.

• f-robust, bounded waiting, O(N) states– Simulates centralized scheduler.– Fault-tolerant, distributed implementation of scheduler,

using 2f+1 “helper” processes.• f-robust, FIFO, O(N (log N)c) states

– Corresponding lower bounds: (N2) states for unlimited robustness, FIFO (N) states for f-robustness, bounded waiting

18

K-Exclusion• Why is this interesting?

– New definitions:• K-exclusion problem, studied later by [Dolev, Shavit], others.• f-robustness (progress in the presence of at most f failures),

for exclusion problems– Lots of cute algorithms and lower bounds.– Among the earliest lower bound results in DCT; more “looks like”

arguments.– Algorithmic ideas:

• Virtual scheduler process, fault-tolerant implementation.• Distributed implementation of shared queue.• Passive communication with possibly-failed processes.

– Proof ideas:• Refinement mapping used to show that the real algorithm

implements the algorithm with a virtual scheduler.• First example I know of a refinement mapping used to verify a

distributed algorithm.

19

K-Exclusion

• Background:– Lots of results in the FOCS 79 paper. – Only one seems to have made it to journal

publication: the Colored Ticket Algorithm.

20

K-Exclusion• The Colored Ticket Algorithm [F, L, Burns, Borodin

83]• Distributed FIFO resource allocation using small

shared space [F, L, Burns, Borodin 85, TOPLAS 89]• What are these papers about?

– FIFO K-exclusion algorithm, unlimited robustness, O(N2) states– Simulates queue of waiting processes in shared memory; first K

processes may enter critical region.– First distributed implementation: Use tickets with unbounded

numbers. Keep track of last ticket issued and last ticket validated for entry to critical region.

– Second version: Use size N batches of K + 1 different colors. Reuse a color if no ticket of the same color is currently issued or validated.

– Corresponding (N2) lower bound.

21

K-Exclusion

• Why is this interesting?– Algorithm modularity: Bounded-memory

algorithm simulating unbounded-memory version.

– Similar strategy to other bounded-memory algorithms, like [Dolev, Shavit] bounded concurrent timestamp algorithm

22

3. Semantics of Distributed Systems

• Background:– When we begin working in DCT, there were no usable

mathematical models for:• Describing distributed algorithms.• Proving correctness, performance properties.• Stating and proving lower bounds.

– So we had to invent them.– At first, ad hoc, for each paper.– But soon, we were led to define something more general:

• On describing the behavior and implementation of distributed systems [Lynch, Fischer 79, 80, 81]

– We got to present this in Evian-les-Bains, France

23

Semantics of Distributed Systems

• On describing the behavior and implementation of distributed systems [Lynch, Fischer 79, 80, TCS 81]

• What is this paper about?– It defined a mathematical modeling framework

for asynchronous interacting processes.– Based on infinite-state state machines,

communicating using shared variables.– Input/output distinction, fairness.– Input-enabled, with respect to

environment’s changes to shared variables.– External behavior notion:

Set of “traces” of accesses to shared variables that arise during fair executions.

– Implementation notion: Subset for sets of traces.– Composition definition, compositionality results.– Time measure. Time bound proofs using recurrences.

24

Semantics of Distributed Systems

• Why is this interesting?– Very early modeling framework for asynchronous distributed

algorithms. – Includes features that have become standard in later

models, esp., implementation = subset for trace sets– Predecessor of I/O automata (but uses shared variables

rather than shared actions).– Had notions of composition, hiding.– Had a formal notion of implementation, though no notion of

simulation relations.– Differs from prior models [Hoare] [Milne, Milner]:

• Based on math concepts (sets, sequences, etc.) instead of notations (process expressions) and proof rules.

• Simpler notions of external behavior and implementation.

25

4. Sorting

• A time-space tradeoff for sorting on non-oblivious machines [Borodin, F, Kirkpatrick, L, Tompa JCSS 81]

• What is this paper about?– Defined DAG model for sorting programs.– Defined measures:

• Time T = length of longest path• Space S = log of number of vertices

– Proved lower bound on product S T = (N2), for sorting N elements.

– Based on counting arguments for permutations.

• Why is this interesting?– I don’t know.– A neat complexity theory result. – Tight bound [Munro, Paterson]– It has nothing to do with distributed computing theory.

26

5. Resource Allocation in Networks

• Background:– K-exclusion = allocation of K resources.– Instead of shared memory, now consider allocating K

resources in a network.– Questions:

• Where to place the resources?• How to locate them efficiently?

– Experiments (simulations), analysis.– Led to papers:

• Optimal placement of resources in networks [Fischer, Griffeth, Guibas, Lynch 80, ICDCS 81, Inf

& Comp 92]• Probabilistic analysis of network resource

allocation algorithm [Lynch, Griffeth, Fischer, Guibas 85, 86]

27

Resource Allocation in Networks

• Optimal placement of resources in networks [Fischer, Griffeth, Guibas, Lynch 80, ICDCS 81, Inf & Comp 92]

• What is this paper about?– How to place K resources in a tree to minimize the total expected cost

(path length) of servicing (matching) exactly K requests arriving randomly at nodes.

– One-shot resource allocation problem.– Characterizations, efficient divide-and-conquer algorithms for

determining optimal placements.– Theorem: Fair placements (where

each subtree has approx. expected number of needed resources) are approximately optimal.

– Cost O(K)

28


• Why is this interesting?– Optimal placements not completely obvious, e.g.,

can’t minimize flow on all edges simultaneously.– Analysis uses interesting properties of convex

functions.– Results suggested by Nancy Griffeth’s experiments.– Uses interesting math observation: mean vs.

median of binomial distributions are always within 1, that is, median(n,p) {np,np}.• Unfortunately, already discovered (but not that

long ago!) [Jogdeo, Samuels 68], [Uhlmann 63, 66]

29


• Probabilistic analysis of network resource allocation algorithm [Lynch, Griffeth, Fischer, Guibas 85, 86]

• What is this paper about?– An algorithm for matching resources to requests in trees.

• The search for a resource to satisfy each request proceeds sequentially, in larger and larger subtrees.

• Search in a subtree reverses direction when it discovers that the subtree is empty.

• Cost is independent of size of network.– Analysis:

• For simplified case:– Requests and responses all at same depth in the tree– Choose subtree to search probabilistically– Constant message delay

• Proved that worst case = noninterfering requests.• Expected time for noninterfering requests is constant,

independent of size of network or number of requests.

30


• Why is this interesting?– Cute algorithm.– Performance independent of size of network---an

interesting “scalability” criterion.– Analysis decomposed in an interesting way:

• Analyze sequential executions (using traditional methods).

• Bound performance of concurrent executions in terms of performance of sequential executions.

31

6. Consistent Global Snapshots

• Background:– We studied database concurrency control

algorithms [Bernstein, Goodman]– Led us to consider implementing transaction

semantics in a distributed setting.– Canonical examples:

• Bank transfer and audit transactions.• Consistent global snapshot (distributed

checkpoint) transactions.– Led to:

•Global states of a distributed system [Fischer, Griffeth, Lynch 81, TOSE 82]

32

Consistent Global Snapshots• Global states of a distributed system

[Fischer, Griffeth, Lynch 81, TOSE 82]• What is this paper about?

– Models distributed computations using multi-site transactions.– Defines correctness conditions for checkpoint (consistent

global snapshot): • Returns transaction-consistent state that includes all

transactions completed before the checkpoint starts, and possibly some of those that overlap.

– Provides a general, nonintrusive algorithm for computing a checkpoint:

• Mark transactions as “pre-checkpoint” or “post-checkpoint”.• Run pre-checkpoint transactions to completion, in a parallel

execution of the system.– Applications to bank audit, detecting inconsistent data, system

recovery.

33

Global snapshots

• Why is this interesting?– The earliest distributed snapshot algorithm I know

of.– Some similar ideas to [Chandy, Lamport 85]

“marker” algorithm, but:• Presented in terms of transactions.• Used formal automaton model [Lynch, Fischer

79], which makes it a bit obscure.

34

7. Synchronous vs. Asynchronous Distributed

Systems• Synchronous vs. asynchronous distributed

systems [Arjomandi, Fischer, Lynch 81, JACM 83]• Background:

– Eshrat Arjomandi visited us at Georgia Tech, 1980.– Working on PRAM algorithms for matrix computations– We considered extensions to asynchronous, distributed setting.

• What is this paper about?– Defines a synchronization problem, the s-session problem, in

which the processes perform at least s “sessions”, then halt. – In a session, each process performs at least one output.– Motivated by “sufficient interleaving” of matrix operations.– Shows that this problem can be solved in time O(s) in a

synchronous system (obvious).– But it takes time (s diam) in an asynchronous system (not

obvious).

35

Synchronous vs. Asynchronous Distributed

Systems• Why is this interesting?

– Cute impossibility proof.– First proof I know that synchronous systems are

inherently faster than asynchronous systems, even in a non-fault-prone setting.

– Interesting contrast with synchronizer results of [Awerbuch], which seem to say that asynchronous systems are just as fast as synchronous systems.

– The difference is that the session problem requires preserving global order of events at different nodes.

36

8. Distributed Consensus

• Background:– Leslie Lamport visited us at Georgia Tech, 1980.– We discussed his new Albanian agreement

problem and algorithms.– The algorithms took a lot of time (f+1 rounds) and

a lot of processes (3f + 1).– We wondered why.– Led us to write:

•A lower bound for the time to assure interactive consistency [Fischer, Lynch 81, IPL 82]

37

Lower Bound on Rounds, for Byzantine Agreement

• A lower bound for the time to assure interactive consistency [Fischer, Lynch 81, IPL 82]

• What is this paper about?– You probably already know.– Shows that f+1 rounds are needed to solve consensus,

with up to f Byzantine failures.• Uses a chain argument, constructing a chain spanning

from all-0, failure-free execution to all-1, failure-free execution.

0000

1111

38

Lower Bound on Rounds, for Byzantine Agreement

• Why is this interesting?– A fundamental fact about fault-tolerant distributed

computing.– First chain argument I know of.– Proved for Byzantine failures, but similar ideas

used later to prove the same bound for less drastic failure models: • Byzantine failures with authentication

[Dolev, Strong], [De Millo, Lynch, Merritt]• Stopping failures [Merritt].

– Leaves open the questions of minimizing communication and storage

39

BA with Small Storage/Communication

• Background:– 1981, by now we’ve moved to MIT, Yale– Next we considered the communication/storage

complexity.– Most previous algorithms used exponential

communication.– [Dolev 82] used 4f + 4 rounds, O(n4 log n) bits of

communication.– We wrote:

• A Simple and Efficient Byzantine Generals Algorithm [Lynch, Fischer, Fowler 82]

• Efficient Byzantine agreement algorithms [Dolev, Fischer, Fowler, Lynch, Strong IC 82]

40


• A Simple and Efficient Byzantine Generals Algorithm [Lynch, Fischer, Fowler 82]

• Efficient Byzantine agreement algorithms [Dolev, Fischer, Fowler, Lynch, Strong IC 82]

• What is in this paper?– A new algorithm, using 2f+3 rounds and O(nf + f3 log f) bits.– Asymmetric: Processes try actively to decide on 1.– Processes “initiate” a proposal to decide 1, relay each

other’s proposals.– Initiate at later stages based on number of known proposals.– Threshold for initiation increases steadily at later rounds.– Decide after 2f + 3 rounds based on sufficient known

proposals.

41


• Why is this interesting?– Efficient algorithm, at the time.– Presentation was unstructured; later work of

[Srikanth, Toueg 87] added more structure to such algorithms, in the form of a “consistent broadcast” primitive.

– Later algorithms [Berman, Garay 93], [Garay, Moses 93], [Moses, Waarts 94], used poly communication and only f+1 rounds (the minimum).

42

Impossibility of Consensus

• Background:– For other consensus problems in fault-prone settings, like

approximate agreement [Dolev, Lynch, Pinter, Stark, Weihl], algorithms for synchronous models always seemed to extend to asynchronous models.

– So we tried to design a fault-tolerant asynchronous consensus algorithm.

– We failed. – Tried to prove an impossibility result, failed.– Tried to find an algorithm,…– Finally found the impossibility result:

• Impossibility of distributed consensus with one faulty process [F, L, Paterson 82, PODS 83, JACM 85]

43


• What is in this paper?– You know. A proof that you can’t solve consensus

in asynchronous systems, with even one stopping failure.

– Uses a “bivalence” argument, which focuses on the form that a decision point would have to take:

– And does a case analysis to prove that this configuration can’t occur.

0-valent

1-valent

44


• Why is this interesting?– A fundamental fact about fault-tolerant distributed

computing.– Addresses issues of interest to system builders,

not just theoreticians.– Makes it clear it’s necessary to strengthen the

model or weaken the problem.– Result is not obvious.– “Proof just hard enough” (Mike Fischer).– Proof methods: Bivalence, chain arguments

• Leslie may say more about this…

45

The Beginnings of PODC

• Around this time (1981), we noticed that:– There was a lot of distributed computing theory,– But no conferences devoted to it.

• E.g., FLP appeared in a database conference (PODS)

• 1982: Mike and I (and Robert Probert) started PODC

46

Lower Bounds on Number of Processes

• Background: – [Pease, Shostak, Lamport 80] showed 3f+1 lower bound on

number of processes for Byzantine agreement.– [Dolev 82] showed 2f+1 connectivity bound for BA.– [Lamport 83] showed 3f+1 lower bound on number of

processes for weak BA.– [Coan, Dolev, Dwork, Stockmeyer 85] showed 3f+1 lower

bound for Byzantine firing squad problem.– [Dolev, Lynch, Pinter, Stark, Weihl 83] claimed 3f+1 bound

for approximate BA.– [Dolev, Halpern, Strong 84] showed 3f+1 lower bound for

Byzantine clock synchronization.– Surely there is some common reason for all these results.– We unified them, in:

• Easy impossibility proofs for distributed consensus problems [Fischer, Lynch, Merritt PODC 85, DC 86]

47


• What is this paper about?– Gives a uniform set of proofs for Byzantine

agreement, weak BA, approximate BA, Byzantine firing squad, and Byzantine clock synchronization.

– Proves 3f+1 lower bounds on number of processes, and 2f+1 lower bounds on connectivity.

– Uses “hexagon vs. triangle” arguments:

A A

AB

B B

C

C C

48


• Why is this interesting?– Basic results about fault-tolerant computation.– Unifies a lot of results/proofs.– Cute proofs.

49

9. Epilogue: Reliable Communication

• Reliable communication over unreliable channels [Afek, Attiya, Fekete, Fischer, Lynch, Mansour, Wang, Zuck 92, JACM 94]

• Background:– Much later, we again became interested in a common problem.– Two reliable processes, communicating over two unreliable

channels: reorder, lose, duplicate. – Bounded size messages

(no sequence numbers).– Can we implement a reliable FIFO channel?– [Wang, Zuck 89] showed impossible for reo + dup.– Any one kind of fault is easy to tolerate.– Alternating Bit Protocol tolerates loss + dup– Remaining question: reordering + loss, bounded size

messages

50

Reliable Communication

• What is this paper about?– Resolves the question of implementability of reliable FIFO

channels over unreliable channels exhibiting reordering + loss, using bounded size messages.

– An algorithm, presented in two layers: • Layer 1: Uses the given channels to implement channels

that do not reorder (but may lose and duplicate messages).– Receiver-driven (probes to request messages).– Uses more and more messages as time goes on, to “swamp

out” copies of old messages.– Lots of messages.– Really lots of messages.

• Layer 2: Uses ABP over layer 1 to get reliable FIFO channel.– A lower bound, saying that any such algorithm must use lots of

messages.

51

Reliable Communication

• Why is this interesting?– Fundamental facts about communication in fault-

prone settings.– Related impossibility results proved by

[Mansour, Schieber 92], [Wang, Zuck 89], [Tempero, Ladner 90, 95].

52

Conclusions

• A lot of projects, a lot of papers.• Especially during 1979-82. • Especially during Mike’s sabbatical, 1980.• They seem pretty interesting, even now!• Some are well known, some not…

• Anyway, we had a lot of fun working on them.

• Thanks to Mike for a great collaboration!

working with mike on distributed computing theory, 1978-1992

Documents

networks fischer

byzantine agreement

variable algorithms

small shared space fischer

sorting borodin

hard b

mutual exclusion burns

fifo resource allocation