domesticating parallelism lindaand friendsjajones/inf102-s18/readings/29... · 2018-03-29 ·...

9
Domesticating Parallelism Linda and Friends >cholasSudhir Ahuja, AT&T Bell Laboratories icholas Carriero and David Gelernter, Yale University Linda consists of a few simple primitives that support an "uncoupled" style of parallel programming. Implementations exist on a broad spectrum of parallel machines. 26 Linda consists of a few simple opera- tors designed to support and simplify the construction of explicitly-parallel programs. Linda has been implemented on AT&T Bell Labs' S/Net multicom- puter and, in a preliminary way, on an Ethernet-based MicroVAX network and an Intel iPSC hypercube. Although the implementations are new and need refine- ment, our early experiences with them have been revealing, and we take them as supporting our claim that Linda is a pro- mising tool. Parallel progranmming is often described as being fundamentally harder than con- ventional, sequential programming, but in our experience (imited so far, but grow- ing) it isn't. Parallel programming in Lin- da is conceptually the same order of task as conventional programniing in a sequen- tial language. Parallelism does, though, encompass a potentially difficult problem. A conventional program consists of one executing process, of a single point in, computational time-space, but a parallel program consists of many, and to the ex- tent that we have to worry about the rela- tionship among these points in time and space, the mood turns nasty. Linda's mis- sion, however, is to make it largely unnecessary to think about the coupling between parallel processes. Linda's un- coupled processes, in fact, never deal with each other directly. A parallel program in Linda is a spatially and temporally un- ordered bag of processes, not a process graph. To the extent that process uncou- pling succeeds, the difficulty of designing, debugging, and understanding a parallel program grows additively and not multi- plicatively with the variety of processes it encompasses. When the simple operators Linda pro- vides are injected into a host language h, they turn h into a parallel programming language. A Linda system consists of the runtime kernel, which implements inter- process communication and process man- agement, and a preprocessor or compiler. A Linda-based parallel language is in fact a new language, not an old one with added system calls, to the extent that the prepro- cessor or compiler recognizes the Linda operations, checks and rewrites them on the basis of symbol table information, and can optimize the pattern of kernel calls that result based on its knowledge of con- stants and loops, among other things. Most of our programming experiments so far have been conducted in C-Linda (and we use C-Linda for the examples below), but we have implemented a Fortran-Linda preprocessor as well. The kernel is language-independent. It will support N-Linda for any language N. Associated with the Linda operators is a particular programming methodology, based on distributed data structures. (The language doesn't restrict programmers to this methodology. It merely allows the methodology, which most other languages don't.) The distributed-data-structure 0018&9162/86/0800-0026$01.00 0 1986 IEEE COMPUTER

Upload: others

Post on 27-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Domesticating Parallelism Lindaand Friendsjajones/INF102-S18/readings/29... · 2018-03-29 · Domesticating Parallelism LindaandFriends >cholasSudhirAhuja,AT&TBell Laboratories

Domesticating Parallelism

Linda and Friends

>cholasSudhir Ahuja, AT&T Bell Laboratoriesicholas Carriero and David Gelernter, Yale University

Linda consists of afew simple primitivesthat support an"uncoupled" style ofparallel programming.Implementations existon a broad spectrumof parallel machines.

26

Linda consists of a few simple opera-tors designed to support and simplifythe construction of explicitly-parallel

programs. Linda has been implementedon AT&T Bell Labs' S/Net multicom-puter and, in a preliminary way, on anEthernet-based MicroVAX network andan Intel iPSC hypercube. Although theimplementations are new and need refine-ment, our early experiences with themhave been revealing, and we take them assupporting our claim that Linda is a pro-mising tool.

Parallel progranmming is often describedas being fundamentally harder than con-ventional, sequential programming, but inour experience (imited so far, but grow-ing) it isn't. Parallel programming in Lin-da is conceptually the same order of taskas conventional programniing in a sequen-tial language. Parallelism does, though,encompass a potentially difficult problem.A conventional program consists of oneexecuting process, of a single point in,computational time-space, but a parallelprogram consists of many, and to the ex-tent that we have to worry about the rela-tionship among these points in time andspace, the mood turns nasty. Linda's mis-sion, however, is to make it largelyunnecessary to think about the couplingbetween parallel processes. Linda's un-coupled processes, in fact, never deal witheach other directly. A parallel program inLinda is a spatially and temporally un-ordered bag of processes, not a process

graph. To the extent that process uncou-pling succeeds, the difficulty of designing,debugging, and understanding a parallelprogram grows additively and not multi-plicatively with the variety of processes itencompasses.When the simple operators Linda pro-

vides are injected into a host language h,they turn h into a parallel programminglanguage. A Linda system consists of theruntime kernel, which implements inter-process communication and process man-agement, and a preprocessor or compiler.A Linda-based parallel language is in facta new language, not an old one with addedsystem calls, to the extent that the prepro-cessor or compiler recognizes the Lindaoperations, checks and rewrites them onthe basis ofsymbol table information, andcan optimize the pattern of kernel callsthat result based on its knowledge of con-stants and loops, among other things.Most of our programming experimentsso far have been conducted in C-Linda(and we use C-Linda for the examplesbelow), but we have implemented aFortran-Linda preprocessor as well. Thekernel is language-independent. It willsupport N-Linda for any language N.

Associated with the Linda operators is aparticular programming methodology,based on distributed data structures. (Thelanguage doesn't restrict programmers tothis methodology. It merely allows themethodology, which most other languagesdon't.) The distributed-data-structure

0018&9162/86/0800-0026$01.00 0 1986 IEEE COMPUTER

Page 2: Domesticating Parallelism Lindaand Friendsjajones/INF102-S18/readings/29... · 2018-03-29 · Domesticating Parallelism LindaandFriends >cholasSudhirAhuja,AT&TBell Laboratories

methodology in turn suggests a particularstrategy for dealing with parallelism. Mostmodels of parallelism assume that a pro-gram will be parallelized by partitioning itinto a large number of simultaneous activ-ities. This partitioning appears, however,to be relatively difficult to do, especiallywhen we consider large multicomputersthat support thousand-fold parallelismand beyond. In the Linda framework, wecan get parallelism by replicating as well asby partitioning. We anticipate that it willfrequently be simpler to stamp out manyidentical copies of one process than tocreate the same number of distinct pro-cesses. So the final ingredient in the Lindaframework is a strategy for coping withparallelism by replication rather than bypartitioning.

In the following we discuss first what itseems to us that parallel programmersneed. We then describe Linda, some pro-gramming experiments using Linda, thecurrent implementation, and the projectnow underway to go beyond the currentsoftware implementation and build ahardware Linda machine. We go on to dis-cuss some related higher-level parallel lan-guages that can be implemented on top ofthe Linda kernel, particularly the sym-metric languages. We close in a blaze ofspeculation.

What parallelprogrammers needMany parallel algorithms are known

and more are in development; many paral-lel machines are available and many morewill be soon. But the fate of the whole ef-fort will ultimately be decided by the ex-tent to which working programmers canput the algorithms and the machines to-gether. The needs of parallel programmershave not been accommodated very well todate.

A machine-independent and (poten-tially) portable progrmming vehicle. De-signers ofparallel languages generally holdthat programming tools should accommo-date a high-level programming model, nota particular architecture. But as parallelmachines emerge commercially, there hasbeen little effort spent on making high-level, machine-independent tools avail-able on them. Young debutante machinesare sometimes gotten-up in their own full-

blown parallel languages; more often theycome dressed in only a handful of idiosyn-cratic system calls that support the localvariant of message-passing or memory-sharing. In either case, so long as each newmachine is provided with its own parallelprogramming system, programs for multi-computer x will not only have to be re-coded, they may need to be conceptuallyreformulated to run on multicomputer y.(This is particularly true if x is a shared-memory machine like a BBN Butterfly I oran IBM RP3 2 and y is a network, like anIntel iPSC.) But users need to be able torun parallel programs on a range of archi-tectures, particularly now, when interest-ing designs of unknown merit proliferate.They need to be able to communicate par-allel algorithms. Methodological knowl-edge can't grow when sources are clutteredwith local dialect. Finally, they need pro-gramming tools suited to their needs, notto the machine's.

A progrmming tool that absolves themas fully as possible from dealng withspatial and temporal relatonsips amongparnllel processes. We referred to thegeneral problem of uncoupling above.Uncoupling has both a spatial and a tem-poral aspect. Spatially, each process in aparallel program will usually develop a re-sult or a series of results that will be ac-cepted as input by certain other processes.Uncoupling suggests that process q shouldnot be required to know or care that pro-cessj accepts q's data as input. Instead ofrequiring q to execute an explicit "sendthis data toj" statement, we would ratherthat q be permitted simply to tag its newdata with a logical label (for example,"new data from q") and then forget aboutit, under the assumption that any processthat wants it will come and get it. At somelater point in program development, a dif-ferent process may decide to deal with q'sdata. Under the spatially-uncoupledscheme, this won't matter to q.

Temporal uncoupling involves similarthough perhaps slightly more subtle issues.If q is forced to send toj explicitly, the sys-tem is constrained to have both processesloaded simultaneously (or at least to havebuffer space allocated forj when q runs).Further, most parallel languages attachsome form of synchronization constraintto send. A synchronized send operationlike Ada's entry call or CSP-Occam's out-

August 1986

put statement forces the system not merelyto load but to run the receiving process be-fore the sender can continue. We wouldrather that our parallel programs be large-ly free of scheduling implications likethese. Not only do they constrain the sys-tem in ways that may be undesirable, butthey force programmers to think in simul-taneities. As far as possible, we would likeprogrammers to be able to develop q'scode without having to envision other si-multaneous execution loci. To achievethis, we would like q to be allowed to takeeach new datum it develops and heave itoverboard without a backwards glance.(We make this a bit more concrete below.)

A programming tool that allows tasksto be dynamically distributed at runtime.Generally there is more logical parallelismin a parallel algorithm than physical paral-lelism in a host multicomputer, whichmeans that at runtime there are moreready tasks than idle processors. Goodspeedup obviously requires that tasks beevenly distributed among available pro-cessors. Many systems require that thisdistribution be performed statically atload time. Sometimes, finding agood stat-ic distribution is easy, notably when theprogram's logical structure matches themachine's physical structure. As the pro-gram's logical structure grows moreirregular, the task gets harder, and whenthe program's computational focus devel-ops dynamically at runtime, finding agood static mapping may be impossible.Many important applications and pro-gram structures fall into the first, easily-handled category, but many more do not.For those that don't, dynamic distributionof tasks is essential.

A programming tool that can be imple-mented efficiently on existing hardware.Obviously. Parallel language research hasproduced far more designs than implemen-tations. Elegant language ideas will alwaysbe interesting regardless of the existence ofgood implementations, but parallel pro-grammers, as opposed to language re-searchers, require implementable elegance.

LindaLinda centers on an idiosyncratic mem-

ory model. Where a conventional mem-ory's storage unit is the physical byte (or

27

Page 3: Domesticating Parallelism Lindaand Friendsjajones/INF102-S18/readings/29... · 2018-03-29 · Domesticating Parallelism LindaandFriends >cholasSudhirAhuja,AT&TBell Laboratories

Figure 1. out statement: drop it in.

something comparable), Linda memory'sstorage unit is the logical tuple, or orderedset of values. Where the elements ofa con-ventional memory are accessed by ad-dress, elements in Linda memory have noaddresses. They are accessed by logicalname, where a tuple's name is any selec-tion of its values. Where a conventionalmemory is accessed via two operations,read and write, a Linda memory is ac-cessed via three: read, add, and remove.

It is a consequence of the last character-istic that tuples in a Linda memory can'tbe altered in situ. To be changed, theymust be physically removed, updated, andthen reinserted. This makes it possible formany processes to share access to a Lindamemory simultaneously; using Linda wecan build distributed data structures that,unlike conventional ones, may be manipu-lated by many processes in parallel.Furthermore, as a consequence of the firstcharacteristic (a Linda memory storestuples, not bytes), Linda's shared memoryis coarse-grained enough to be supportedefficiently without shared-memory hard-ware. Shared memory has long beenregarded as the most flexible and powerfulway of sharing data among parallel pro-cesses, but a naive shared memory requireshardware support that is complicated, ex-pensive to build, and suitable only formulticomputers, not for local area net-works. Linda's variant of shared memory,on the other hand, runs both on the S/Netand on a MicroVAX network, neither ofwhich provides any physically sharedmemory. (Of course, Linda may be im-

Figure 2. In statement: haul it out.

plemented on shared-memory multicom-puters as well, as we discuss below.)

Linda's shared memory is referred to astuple space, or TS. Messages in Linda arenever exchanged between two processesdirectly. Instead, a process with data tocommunicate adds it to tuple space and aprocess that needs to receive data seeks it,likewise, in tuple space. There are four op-erations defined over TS: out(, inl,read(, and eval(. out(t) causes tuple t tobe added to TS; the executing process con-tinues immediately. in(s) causes sometuple t that matches template s to bewithdrawn from TS; the values of the ac-tuals in t are assigned to the formals in sand the executing process continues. If nomatching t is available when in(s) executes,the executing process suspends until oneis, then proceeds as before. If manymatching t's are available, one is chosenarbitrarily. read(s) is the same as in(s), withactuals assigned to formals as before, ex-cept that the matched tuple remains in TS.

For example, executingout("P", 5, false)

causes the tuple ("P", 5, false) to be addedto TS. The first component of a tupleserves as a logical name, here "P"; the re-maining components are data values. Sub-sequent execution of

in("P", int i, bool b)

might cause tuple ("P", 5, false) to bewithdrawn from TS. 5 would be assignedto i and false to b. Alternatively, it mightcause any other matching tuple (any other,that is, whose first component is "P" and

Figure 3. read statement: read it andleave it.

whose second and third components arean integer and a Boolean, respectively) tobe withdrawn and assigned. Executing

read("P", int i, bool b)when ("P", 5, false) is available in TS maycause 5 to be assigned to iand false to b, orequivalently may cause the assignment ofvalues from some other type consonant tu-ple, with the matched tuple itself remain-ing in TS in either case. eval(t) is the sameas out(t), except that eval adds an unevalu-ated tuple to TS. (eval is not primitive inLinda; it will be implemented on top ofout. We haven't done this yet in S/Net-Linda, so we omit further mention ofeval.) See Figures j, 2, and 3.The parameters to an inl or read(

statement needn't all be formals. Any orall may be actuals as well. All actuals mustbe matched by corresponding actuals in atuple for tuple-matching to occur. Thusthe statement

in("P", int i, 15)may withdraw tuple ("P", 6, 15) but nottuple ("P", 6, 12). When a variable ap-pears in a tuple without a type declarator,its value is used as an actual. The annota-tion formal may precede an already-declared variable to indicate that the pro-grammer intends a formal parameter.Thus, if i and j have already been declaredas integer variables, the following twostatements are equivalent to the precedingone:

i = 15; in("P", formal i, j)This extended naming convention (it re-

sembles the select operation in relational

COMPUTER28

Page 4: Domesticating Parallelism Lindaand Friendsjajones/INF102-S18/readings/29... · 2018-03-29 · Domesticating Parallelism LindaandFriends >cholasSudhirAhuja,AT&TBell Laboratories

Figure 4. Structured naming: legal matches.

databases) is referred to as structured1naming. Structured naming makes TS 11content-addressable, in the sense that pro-cesses may select among a collection of Ctuples that share the same first componentton the basis of the values of any other com- iponent fields. Any paramneter to out( oreval()except the first may likewise be a for-mal; a formal parameter in a tuple matchesany type-consonant actua in an in or readstatement's template. See Figure 4.

Programming in LindaFigure 5. The manager process services requests one at a time.

Linda accommodates the needs for un-coupling and dynamic scheduling we listedabove by relying on distributed data struc-tures. As noted, a distributed data struc-ture is one that may be manipulated by 0many parallel processes simultaneously.Distributed data structures are the natural Icomplement to parallel program struc-tures, but despite this natural relationship, | \distributed data structures are impossiblein most parallel programming languages.Most parallel languages are based insteadon what we call the manager process |model of parallelism, which requires thatshared data objects be encapsulated withinmanager processes. Operations on shareddata are carried out, on request, by themanager process on the user's behalf. SeeFigures 5 and 6.The manager-process model has impor-

tant advantages, and manager-processprograms are easy to write in Linda. What Figure 6. Data are directly accessible to all parallel processes.

A.*.,. I.-,O 29AugusT IYOO

Page 5: Domesticating Parallelism Lindaand Friendsjajones/INF102-S18/readings/29... · 2018-03-29 · Domesticating Parallelism LindaandFriends >cholasSudhirAhuja,AT&TBell Laboratories

The processes in apartitioned-network

program are coupled,while those in a

replicated-workerprogram areuncoupled.

is significant, though, is the number ofcases in which distributed data structureprograms come closer to achieving thequalities we want. They do so particularlyin the context of parallel programs struc-tured not as logical networks but as col-lections of identical workers. In logical-network-style parallelism (the morecommon variety), aprogram is partitionedinto n pieces, where n is determined by thelogic of the algorithm. Each of the nlogical pieces is implemented by a process,and each process keeps its attentiondemurely fixed on its own conventional,local data structures. In the replicatedworker model, we don't partition our pro-gram at all; we replicate it r times, where ris determined by the number of processorswe have available. All rprocesses scramblesimultaneously over a distributed datastructure, seeking work where they can getit. There is a strong underlying sense inwhich the processes in a partitioned-network program are coupled, while thosein a replicated-worker program are un-coupled. In the partitioned network pro-gram, each process must, in general, dealwith its neighbor processes; in thereplicated-worker program, workers ig-nore each other completely. The replicatedworker model is interesting for a numberof more specific reasons as well.

1. It scales transparently. Once we havedeveloped and debugged a program with asingle worker process, our program willrun in the same way, only faster, with tenparallel workers or a hundred. We need beonly minimally aware of parallelism in de-veloping the program, and we can adjustthe degree of parallelism in any given runto the available resources.

2. It eliminates logically-pointless con-text switching. Each processor runs asingle process. We add processes onlywhen we add processors. The process-managment burden per node is exactly thesame when the program runs on one nodeas when it runs on a thousand. (This is nottrue, of course, in the network model. Anetwork program always creates the samenumber of processes. If many processorsare available, they spread out; if there areonly a few, they pile up.)

3. It balances load dynamically, bydefault. Each worker process repeatedlysearches for a task to execute, executes it,and loops. Tasks are therefore divided atruntime among the available workers.

It's important to note that most of theprograms we've experimented with are notpure replicated-worker examples; they in-volve some partitioning as well as somereplication of duties. It's also true thatpurely network-style programs may bewritten in Linda and may rely on distrib-uted data structures. Linda programs thattend towards the replicated style seem tobe the most idiomatic and interesting,though.We illustrate with a simple example that

doesn't exercise the mechanism fully, butmakes some basic points. We've testedseveral matrix multiplication programsusing S/Net-Linda. One version consistsof a setup-cleanup process and at leastone, but ordinarily many, worker pro-cesses. Each worker is repeatedly assignedsome element of the product matrix tocompute; it computes this assigned ele-ment and is assigned another, until allelements of the product matrix have beenfilled in. IfA and B are the matrices to bemultiplied, then specifically

1. The initialization process uses a suc-cession of out statements to dump A'srows and B's columns into TS. When thesestatements have completed, TS holds

("A", 1, A's-first-row)("A", 2, A's-second-row)

("B", 1, B's-first-column)("B", 2, B's-second-column)

Indices are included as the second elementof each tuple so that worker processes,using structured naming, can select the ith

row orjth column for reading. The initial-izer then adds the tuple

("Next", 1)to TS and terminates. 1 indicates the nextelement to be computed.

2. Each worker process repeatedlydecides on an element to compute, thencomputes it. To select a next element, theworker removes the "Next" tuple fromTS, determines from its second field theindices of the product element to be com-puted next, and reinserts "Next" with anincremented second field:

in("Next", formal NextElem);if(NextElem < dim * dim)

out("Next", NextElem + 1);i = (NextElem - 1)/dim + 1;j = (NextElem - 1)%dim + 1;

The worker now proceeds to compute theproduct element whose index is (ij).Note that if (ij) is the last element of theproduct matrix, the "Next" tuple is notreinserted. When the other workers at-tempt to remove it, they will block. A Lin-da program terminates when all processeshave terminated or have blocked at in orread statements.To compute element (iJ) of the prod-

uct, the worker executesread("A", i, formal row);read("B", j, formal col);out("result", i, j, DotProduct(row, col));

Thus each element of the product is packedin a separate tuple and dumped into TS.(Note that the first read statement picksout a tuple whose first element is "A" andsecond is the value of i; this tuple's thirdelement is assigned to the formal row.)

3. The cleanup process reels in the prod-uct-element tuples, installs them in theresult matrix prod, and prints prod:

for (row = 1; row < = NumRows; row++)for (col = 1; col<= NumCols; col ++)

in ("result", row, col, formal prod[rowJ[coll);

print prod;This simple program depends entirely

on distributed data structures. The inputmatrices are distributed data structures; allworker processes may read them simulta-neously. In the manager-process model,processes would send read-requests to theappropriate manager and await its reply.The "Next" tuple is a distributed datastructure; all worker processes share directaccess to it. In the manager process model,again, worker processes would read andupdate the "Next" counter indirectly via a

30 COMPUTER

Page 6: Domesticating Parallelism Lindaand Friendsjajones/INF102-S18/readings/29... · 2018-03-29 · Domesticating Parallelism LindaandFriends >cholasSudhirAhuja,AT&TBell Laboratories

manager. The product matrix is a distrib-uted data structure, which all workers par-ticipate in building simultaneously.We discuss the performance of this pro-

gram, and of another version that assignscoarser-grained tasks that compute an en-tire row rather than a single inner product,elsewhere.3 Both versions show goodspeedup as we add processors up to thelimited number available to us on ourS/Net (currently eight). The versiondiscussed above requires only two parallelworkers and one control process to beat aconventional uniprocessor C program on32 x 32 matrices, and continues to showlinear speedup as we add workers. Thecoarser-grained version, with its lowercommuriication overhead, shows speedupclose to ideal linear speedup of theuniprocessor C version: our figures showclose to a progressive doubling, tripling,and so on of the C program's speed as weadd Linda workers.The matrix program displays the un-

coupling and dynamic-scheduling proper-ties that we claimed above were important.Uncoupling: no worker deals directly withany other. Dynamic task scheduling: thematrix program assigns tasks to workersdynamically. But of course, in a problemas simple and regular as matrix multiplica-tion, dynamic scheduling isn't important.We could just as well have assigned each ofn workers l/n of the product matrix tocompute. (It's interesting to note, how-ever, that even with a problem as orderlyas matrix multiplication, dynamic sched-uling might be the technique of choice ifwe were running on a nonhomogeneousnetwork, on which processors vary inspeed and in runtime loading. We've beenstudying just such anetwork-a collectionof VAXes ranging from MicroVAX I's to8600's.)Dynamic scheduling becomes impor-

tant when tasks require varying amountsof time to complete. In the general case,moreover, new tasks may be developeddynamically as old ones are processed.Linda techniques to deal with this generalproblem are based on a distributed datastructure called a task bag. Workersrepeatedly draw their next assignmentfrom the task bag, carry out the specifiedassignment, and drop any new tasksgenerated in the process back into the taskbag. The program completes when the bagis empty. The scheme is easily imple-

Figure 7. An S/Net kernel. (a) shows out: broadcast, while (b) shows in: check locally.(The inverse of this scheme is also possible.)

mented in Linda. Elements of the bag willbe tuples of the form

("Task", task descriptor)Each worker executes the following loop:

loop [

/I withdraw a task from the bag: */in("Task", formal NextTask);process "NextTask";for (each NewTask generated in the

process)/I drop the new task into the bag: */out("Task", NewTask);

We've experimented with programs ofthis sort to perform LU decompositionwith pivoting and to find paths through agraph, among others. Note that, if it werenecessary to process tasks in a particularorder rather than in arbitrary order, wewould build a task queue instead of a taskbag. The technique would involve num-bered tuples and structured naming.

The S/Nets LindakernelLinda has often been regarded as posing

a particularly difficult implementationproblem. The difficulty lies in the factthat, as noted above, Linda supplies aform of logically-shared memory withoutassuming any physically-shared memoryin the underlying hardware. The following

paragraphs summarize the way in whichwe implemented Linda on the S/Net (theS/Net implementation is discussed indetail elsewhere3); there are many otherpossible implementations as well.Our implementation buys speed at the

expense of communication bandwidthand local memory. The reasonableness ofthis trade-off was our starting point.(Possible variants are more conservativewith local memory.)

Executing out(t) causes tuple t to bebroadcast to every node in the network;every node stores a complete copy of TS.Executing in(s) triggers a local search for amatching t. If one is found, the localkernel attempts to delete t network-wideusing a procedure we discuss below. If theattempt suceeds, t is returned to the pro-cess that executed inO. (he attempt failsonly if a process on some other node hassimultaneously attempted to delete t, andhas succeeded.) If the local search trig-gered by in(s) turns up no matching tuple,all newly-arriving tuples are checked untila match occurs, at which point the matchedtuple is deleted and returned as before.readO works in the same way as inO, ex-cept that no tuple-deletion need be at-tempted. As soon as a matching tuple isfound, it is immediately returned to thereading process. See Figure 7.The delete protocol must satisfy two re-

quirements. First, all nodes must receive

August 1986 31

Page 7: Domesticating Parallelism Lindaand Friendsjajones/INF102-S18/readings/29... · 2018-03-29 · Domesticating Parallelism LindaandFriends >cholasSudhirAhuja,AT&TBell Laboratories

Figure 9. Execute in parallel.

Figure 8. Execute sequentially.

the "delete" message, and second, ifmany processes attempt to delete simulta-neously, only one must succeed. The man-ner in which these requirements are metwill depend, of course, on the availablehardware.When some node fails to receive and

buffer a broadcast message, a negative-acknowledgement signal is available onthe S/Net bus. One possible delete pro-tocol has two parts: The sending kernelrebroadcasts repeatedly until the negative-acknowlegement signal is not present. Itthen awaits an "ok to delete t" messagefrom the node on which t originated. Inthis protocol the kernel on the tuple'sorigin node is responsible for allowing oneprocess, and only one, to delete it. (Wehave implemented other protocols as well.Processes may use the bus as a semaphoreto mediate multiple simultaneous deletes,for example, and avoid the use of a special"ok to delete" message.)

Evidence suggests that a minimal out4ntransaction, from kernel entry on the outside to kernel exit on the in side, takesabout 1.4 ms.We are working on other implementa-

tions as well. The VAX-network Lindakernel (which was designed and is being

implemented by Jerry Leichter) uses atechnique that is in a sense the inverse ofthe existing S/Net scheme. We're in theprocess of trying this new technique on theS/Net also. In the new protocol, out re-quires only a local install; in(s) causestemplate s to be broadcast to all nodes inthe network. Whenever a node receives atemplate s, it checks s against all of itslocally-stored tuples. If there is a match, itsends the matched tuple off to the tem-plate's node. If not, it stores the templatefor x ticks (checking all tuples newly-generated within this period against it),then throws it out. If the template's originnode hasn't received a matching tupleafter x ticks, it rebroadcasts the template.More than one node may respond with amatching tuple to a template broadcast;when a template broadcaster receivesmore than one tuple, it simply installs theextras alongside its locally-generatedtuples and sends them onward whenthey're needed. (In a more elaborate ver-sion, we can forestall the arrival of un-needed tuples by having potential sendersmonitor the bus, or by broadcasting an"I've got one, enough already" messageat the appropriate point.) This schemedoesn't require hardware support forreliable broadcast and it doesn't requiretuples to be replicated on each node, soper-node storage requirements are muchlower.The Linda kernel for the Intel iPSC

hypercube, designed and implemented byRob Bjornson, relies on point-to-pointrather than broadcast communication.His scheme implements tuple space as ahash table distributed throughout the net-work. Each tuple is hashed on out to aunique network node and is sent there forstorage. Templates are hashed and storedin the same way.

Finally, several of us (Bjornson, Carri-ero, Leichter, and Gelernter) have begun,in conjunction with Scientific ComputingAssociates, to design and implement aLinda kernel for the Encore Multimax.Nodes on the Multimax have direct accessto physically shared memory. The Multi-max Linda kernel should therefore befaster and simpler than the kernelsdescribed above, and in fact it is. The rela-tionship between Linda and shared-memory multiprocessors like the Multi-max is roughly similar to what holds be-tween block-structured languages andstack architectures. The architecturestrongly supports the language; the lan-guage refines the power of the architectureand makes it accessible to programmers.Of course, for all its promise, the Encoredoesn't end our interest in networks likethe S/Net. Shared memory seems ideal forsmall or medium-sized collections of pro-cessors. S/Net-like architectures, par-ticularly the Linda machine we describebelow, may well scale upwards to enor-mous sizes.We have referred to Linda as a pro-

gramming language, but it really isn't. It isa new machine model, in the same sense inwhich dataflow or graph-reduction maybe regarded as machine models as much asprogramming methodologies. The kernelsdescribed above are software realizationsof a Linda machine, but Ahuja andVenkatesh Krishnaswamy of Yale aredesigning a hardware Linda machine aswell, based on the S/Net. The heart of theLinda machine is a box to be interposedbetween each processor and the S/Netbus. The box implements the Linda com-munication kernel in hardware, turning anordinary bus into a tuple space. The cur-rent box is designed for the S/Net ex-clusively, but we are interested in general

COMPUTER32

Page 8: Domesticating Parallelism Lindaand Friendsjajones/INF102-S18/readings/29... · 2018-03-29 · Domesticating Parallelism LindaandFriends >cholasSudhirAhuja,AT&TBell Laboratories

versions that will connect arbitrary nodesand communication media as well. In-stallation of either the software Lindakernel or the hardware Linda boxes hasthe effect of uniting many physically-disjoint nodes into one logically-sharedspace.

FriendsLinda may be regarded as machine lan-

guage for the Linda machine. We can infact compile higher-level parallel lan-guages into Linda. Higher-level languagesmay, for example, support shared vari-ables that are directly accessible to parallelprocesses. If v is a shared variable, thecompiler might translate

v: = expr

toin("v", formal v_value);out("v", expr)

and

toread ("v", formal v_value);f(...v_value,...)

We can support data objects like streamson top of Linda in the same general way.The higher-level parallel languages that

interest us particularly are the so-calledsymmetric languages. Symmetric lan-guages are based on the proposition that,just as we can give names to arbitrarystatement sequences and nest their execu-tion in arbitrary ways, we should be able toname arbitrary horizontal combinations,or environments, and nest them in ar-bitrary ways.

Consider an arbitrary "execute sequen-tially" statement:

sl ; s2; ...; sn

We can represent the execution of this se-quence at runtime as as in Figure 8. Eachbox represents the execution of one state-ment in the sequence. The boxes arestacked on top of each other; the evalua-tions of successive elements occupy dis-joint intervals of time, but they may suc-cessively occupy the same space. (Thus ifeach si is a block that creates localvariables, we can always reuse the previousblock's storage space for the next block'svariables, once evaluation of the previousblock is complete.) Now suppose we trans-pose this structure around the time-space

axis, as in Figure 9. Again, each box repre-sents the execution of a separate state-ment. In the resulting structure, which werefer to as an alpha, the evaluations of suc-cessive elements occupy disjoint regions ofspace and share one time; that is, they exe-cute concurrently. Ifwe added alphas to aprogramming language and wrote them

sl&s2& ...&sn,the resulting statement calls for the execu-tion of all si in parallel.

Suppose we add one more element tothe alpha's definition. In most program-ming languages, a local variable's scope isspecified explicitly and its lifetime is in-ferred from its scope by the following sim-ple rule: A variable must live for at least aslong as the statements that refer to it, sothat theymaybe assured of finding it whenthey look. In symmetric languages we re-verse this rule and infer scope from life-time: If a variable is guaranteed to live forat least as long as a group of statements,then those statements may refer to itbecause they are assured of finding it whenthey look. Now consider the alpha: Execu-tion of an alpha as a whole can't be com-plete until each of its components has exe-cuted to completion. (The same rule holdsfor the standard "execute sequentially"form.) Because alpha execution isn't com-plete until every component has been fullyevaluated, no box in the alpha representa-tion above will disappear until they all do.It follows that, if we store a namedvariable instead of an executable state-ment inside some box, then that variableshould be accessible to statements in adja-cent boxes, because the variable and thestatement live for the same interval oftime. The statement is therefore assured offinding the variable when it looks for it.Hence, symmetric languages will usealphas to create blocks as well as to createparallel-execution streams. For example,the Pascal block

var i: real; j: integer; begin ... end

becomesi: real& j: integer& begin ... end

in Symmetric Pascal.The alpha can in fact be used as a flexi-

ble computational cupboard. We canstore any assortment of named values andactive processes in its slots. Symmetriclanguages use alphas to serve the purposeof a Pascal record, of a Simula class orScheme closure, ofa package or a module,

We'd like to be ableto encompass wholenetworks, evenphysically-dispersedones, within Lindasystems.

and in fact of an entire program or en-vironment. All symmetric languages natu-rally encompass interpreted as well ascompiled execution. A symmetric-languageinterpreter simply builds an alpha in-crementally, repeatedly tacking on newelements at the end. This incrementally-growing alpha is the interpreter's environ-ment. Because the elements of an alphamay be evaluated in parallel, the sym-metric interpreter is a parallel interpreter:Each new expression the user enters isevaluated in a separate process, concur-rently with all previous expressions. Thevalues returned by all these concurrentevaluations coalesce into a single sharednaming environment.

This is a mere sketch. Symmetric lan-guages are discussed in detail elsewhere. 4We are particularly interested in Sym-metric C and Symmetric Lisp; either maybe implemented on top of the runtime en-vironment provided by the Linda kernel.

The futureWe have many future plans.The semantics of a tuple space allow it,

like a file, to exist independently of anyparticular process or program. A tuplespace might in the abstract outlive manyinvocations of the same program. Whatwe'd like, then, is for tuple spaces to beregarded as a special sort of file (orequivalently, for files to be special tuplespaces). We'd like to be able to keep tuplespaces along with files in hierarchicaldirectories. With many tuple spaces tochoose from, Linda processes must begiven a way to indicate which one is thecurrent one. Once some such mechanism

August 1986 33

Page 9: Domesticating Parallelism Lindaand Friendsjajones/INF102-S18/readings/29... · 2018-03-29 · Domesticating Parallelism LindaandFriends >cholasSudhirAhuja,AT&TBell Laboratories

has been provided, the availability ofmultiple tuple spaces greatly expands thesystem's capabilities. We can associate dif-ferent protection attributes with differenttuple spaces, just as we do with differentfiles. We can use tuple spaces to supportcommunication between user and systemprocesses by making tuple spaces availablefor out only, read only, and so on. We canalso allow in operations that remove wholetuple spaces at one blow and outs that addwhole tuple spaces. The design and im-plementation ofsuch an extension is a goalfor the immediate future.

It's clear that Linda can be an inter-preted command language as well as acompiled one. It would be useful to allowusers to add, read, and remove tuples fromactive tuple spaces interactively. We'vetaken some preliminary steps towards im-plementing such a system. We'd like, too,to be able to encompass whole networks,even physically-dispersed ones, withinLinda systems. We can then use Linda towrite distributed network utilities likemailers and file systems. Our work on theVAX-network implementation is leadingtoward experiments of this sort.

F inally, we imagine, as an object ofprime interest for the future, anenormous Linda machine highly op-

timized to support Linda primitives. Wedon't yet know how to build such a ma-chine, but it's hard not to notice that verylarge networks with small diameters mightbe constructed out of multidimensionalgrids of S/Net-like buses. Having builtsuch a machine, we imagine tuple spaceitself as the machine's main memory.(Outside of tuple space, only registers andlocal caches exist.) As the Linda primitivesbecome faster and more efficient, such anarchitecture looks more and more like asort of dataflow machine, but with acrucial difference. As in a dataflowmachine, we can create task templates(stored in tuples), update them with newvalues as they become available, and markthem "enabled" when all values are filledin. General-purpose evaluator processes,much like the replicated workers discussedabove, use in to pick out task tuplesmarked "enabled." But unlike the tokenspace of a dataflow machine, a Lindamachine's tuple space can store data struc-tures as well as task descriptors. Processesare free to build whatever (distributed)

data structures they want, and manipulateand side-effect them as they choose. Wemight even use such a Linda machine tostore large databases operated upon inparallel.Such work is for the future. We still lack

a polished Linda implementation on anymachine. We hope to have one soon. Andclearly we can learn a great deal by contin-uing to refine and to experiment with Lin-da kernels for present-generation architec-ture. This is what we plan to do.OI

AcknowledgmentsRob Bjornson, Venkatesh Krishna-

swamy, and Jerry Leichter are our collab-orators in the Yale Linda group. Thanksalso to Erik DeBenedictis, Robert Gaglia-nello, Howard Katseff, and Thomas Lon-don of AT&T Bell Labs.

References1. J. T. Deutsch and A. R. Newton,

"MSplice: A Multiprocessor-basedCircuit Simulator," Proc. 1984 Int'lConf. Parallel Processing, Aug. 1984, pp.207-214.

2. G. F. Pfister et al., "The IBM ResearchParallel Processor (RP3): Introductionand Architecture," Proc. 1985Intl7Conf.Parallel Processing, Aug. 1985.

Sudhir Ahuja obtained his MS and PhD in elec-trical engineering from Rice University in 1974and 1977, respectively. He has been withAT&TBell Laboratories, Holmdel, NJ since 1977. Heis currently the head of the System Architec-tures Research Department. His earlier workinvolved associative memories, pipelining, andparallel processing. He has been involved in thedesign and implementation of an associativeprocessor, high-speed buses, and multiproces-sor systems. His current interests are in the fieldof multiprocessor architectures, concurrentprogramming, local networking, and the use ofVLSI to implement specialized processor archi-tectures.

Readers may write to Gelernter at the Dept.of Computer Science, PO Box 2158, Yale Sta-tion, New Haven, CT 06520-2158.

3. N. Carriero and D. Gelernter, "TheS/Net's Linda Kernel," Proc. Symp.Operating System Principles, Dec. 1985,and ACM TOCS, May 1986.

4. D. Gelermter, "Symmetric ProgrammingLanguages," Yale Univ. Dept. Comp.Sci. tech. report yaleu/dcs/ rr#253, Dec.1984.

Nicholas Carriero is a graduate student in theYale University Department of ComputerScience. He received a BS from Brown in 1980and an MS in computer science from SUNY atStony Brook in 1983. Distributed programminglanguages and operating systems are his re-search interests.

David Gelermter's biography and photo appearfollowing the Guest Editor's Introduction, onpage 16.

34 COMPUTER