software combining algorithms for distributing hot-spot addressing

10
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 10, 130- 139 ( 1990) Software Combining Algorithms for Distributing Hot-Spot Addressing * PEIYI TANG Department of Computer Science, The Australian National University, Canberra, ACT 2601, Australia AND PEN-CHUNG YEW Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801 In a large shared-memory multiprocessor system, a large number of simultaneous accesses to a single shared variable (called a hot spot in [lo]) can degrade the performance of its shared memory system. Software combining [ 141 is an inexpen- sive alternative to the hardware combining networks [ 3, 91 for tackling this problem. This paper gives software combining al- gorithms for three different types of hot-spot accesses: ( 1) barrier synchronizations in parallel loops, (2) fetch-and-add type of op- erations, and (3) P and V operations on semaphores. They in- clude most of the general hot-spot access patterns. By using software combining trees to distribute hot-spot accessings, the number of processors that can access the same location is greatly reduced. In these algorithms, the completion time of a hot-spot accessis in the order O(Iog2N) in a multiprocessor system with N processors, assuming that the delay of a switch element in an interconnection network is a constant time, 0( 1). o 1990Academic Press, Inc. 1. INTRODUCTION Large tightly coupled multiprocessor systems with shared memory such as Denelcor HEP [ 1 l,] , BBN’s Butterfly, Cedar [ 5 1, Ultracomputer [ 3 1, and RP3 [ 91 are becoming more common due to the progress of VLSI and software technol- ogies. As processor speed and system size increase, shared memory becomes a bottleneck in such systems. Techniques such as using caches in processors and prefetching data before they are needed can be used to tackle this problem. However, shared variables such as barriers for parallel loops [ 12 1, loop indices in processor self-scheduling [ 12 1, semaphores for managing critical sections [I], and pointers for parallel queues [ 41 and parallel linked lists [ 131 can still create very serious problems when they are accessed by a large number of processors. These shared variables are referred to as “hot * This work was supported in part by the National Science Foundation under Grants US NSF DCR84-06916 and US NSF DCR84-10110, the U.S. Department of Energy under Grant NS DOE DE FG02-85ER25001, and by the donations of IBM and Digital Equipment Corporations. spots” [lo]. It was shown that even a very small percentage (e.g., 1%) of hot-spot accesses, if not handled properly, can congest the interconnection network, severely degrading memory bandwidth and seriously affecting regular accesses. Combining networks have been suggested to solve this prob- lem in NYU’s Ultracomputer [3] and IBM’s RP3 multi- processor project [9]. However, combining hardware can increase the complexity and the cost of a network. It is es- timated that the cost of a combining network for fetch-and- add operations is about 6 to 32 times as much as that of a regular network [ 10 1. The combining hardware will also slow down the traffic of regular shared-memory accesses unless a separate network is used as proposed in RP3. Yew et al. [ 141 proposed an inexpensive alternative to this problem. Instead of using expensive hardware to combine hot-spot accesses in an interconnection network, software combining trees are used. For each hot spot, a software com- bining tree is created in which each node is shared by only a few processors and nodes are dispersed across different memory modules. However, only simple hot-spot accesses are discussed in [ 141 and no algorithms are proposed for traversing these software trees. In this paper, we explore software combining algorithms for more complicated hot-spot accessing like P/V operations and fetch-and-add types of operations. We assume that some kind of indivisible test-and-modify synchronization instruc- tions like those in the Cedar system [ 151 are available to reduce synchronization overhead. We use a Pascal-like lan- guage to describe our algorithms. Uppercase names are glob- ally shared variables and lowercase names are local variables or read-only shared variables in these algorithms. 2. SYNCHRONIZATION INSTRUCTIONS To simplify our presentation and reduce possible syn- chronization overhead, we use Cedar synchronization in- structions [ 151 in our algorithms. A synchronization variable x has two fields: KEY and DATA. The KEY field is for storing synchronization information (which is an integer) 0743-73 15190 $3.00 Copyright 0 1990 by Academic Press, Inc. All rights of reproduction in any form reserved. 130

Upload: peiyi-tang

Post on 21-Jun-2016

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Software combining algorithms for distributing hot-spot addressing

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 10, 130- 139 ( 1990)

Software Combining Algorithms for Distributing Hot-Spot Addressing *

PEIYI TANG

Department of Computer Science, The Australian National University, Canberra, ACT 2601, Australia

AND

PEN-CHUNG YEW

Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801

In a large shared-memory multiprocessor system, a large number of simultaneous accesses to a single shared variable (called a hot spot in [lo]) can degrade the performance of its shared memory system. Software combining [ 141 is an inexpen- sive alternative to the hardware combining networks [ 3, 91 for tackling this problem. This paper gives software combining al- gorithms for three different types of hot-spot accesses: ( 1) barrier synchronizations in parallel loops, (2) fetch-and-add type of op- erations, and (3) P and V operations on semaphores. They in- clude most of the general hot-spot access patterns. By using software combining trees to distribute hot-spot accessings, the number of processors that can access the same location is greatly reduced. In these algorithms, the completion time of a hot-spot access is in the order O(Iog2N) in a multiprocessor system with N processors, assuming that the delay of a switch element in an interconnection network is a constant time, 0( 1). o 1990Academic

Press, Inc.

1. INTRODUCTION

Large tightly coupled multiprocessor systems with shared memory such as Denelcor HEP [ 1 l,] , BBN’s Butterfly, Cedar [ 5 1, Ultracomputer [ 3 1, and RP3 [ 91 are becoming more common due to the progress of VLSI and software technol- ogies. As processor speed and system size increase, shared memory becomes a bottleneck in such systems. Techniques such as using caches in processors and prefetching data before they are needed can be used to tackle this problem. However, shared variables such as barriers for parallel loops [ 12 1, loop indices in processor self-scheduling [ 12 1, semaphores for managing critical sections [I], and pointers for parallel queues [ 41 and parallel linked lists [ 131 can still create very serious problems when they are accessed by a large number of processors. These shared variables are referred to as “hot

* This work was supported in part by the National Science Foundation under Grants US NSF DCR84-06916 and US NSF DCR84-10110, the U.S. Department of Energy under Grant NS DOE DE FG02-85ER25001, and by the donations of IBM and Digital Equipment Corporations.

spots” [lo]. It was shown that even a very small percentage (e.g., 1%) of hot-spot accesses, if not handled properly, can congest the interconnection network, severely degrading memory bandwidth and seriously affecting regular accesses. Combining networks have been suggested to solve this prob- lem in NYU’s Ultracomputer [3] and IBM’s RP3 multi- processor project [9]. However, combining hardware can increase the complexity and the cost of a network. It is es- timated that the cost of a combining network for fetch-and- add operations is about 6 to 32 times as much as that of a regular network [ 10 1. The combining hardware will also slow down the traffic of regular shared-memory accesses unless a separate network is used as proposed in RP3.

Yew et al. [ 141 proposed an inexpensive alternative to this problem. Instead of using expensive hardware to combine hot-spot accesses in an interconnection network, software combining trees are used. For each hot spot, a software com- bining tree is created in which each node is shared by only a few processors and nodes are dispersed across different memory modules. However, only simple hot-spot accesses are discussed in [ 141 and no algorithms are proposed for traversing these software trees.

In this paper, we explore software combining algorithms for more complicated hot-spot accessing like P/V operations and fetch-and-add types of operations. We assume that some kind of indivisible test-and-modify synchronization instruc- tions like those in the Cedar system [ 151 are available to reduce synchronization overhead. We use a Pascal-like lan- guage to describe our algorithms. Uppercase names are glob- ally shared variables and lowercase names are local variables or read-only shared variables in these algorithms.

2. SYNCHRONIZATION INSTRUCTIONS

To simplify our presentation and reduce possible syn- chronization overhead, we use Cedar synchronization in- structions [ 151 in our algorithms. A synchronization variable x has two fields: KEY and DATA. The KEY field is for storing synchronization information (which is an integer)

0743-73 15190 $3.00 Copyright 0 1990 by Academic Press, Inc. All rights of reproduction in any form reserved.

130

Page 2: Software combining algorithms for distributing hot-spot addressing

SOFTWARE COMBINING ALGORITHMS 131

and the DATA field is for storing the value of the variable (which can be a floating-point number). The format is as follows:

{x; Test on KEY; Operation on KEY; Operation on DATA } .

Here x is the name (or the address) of the synchronization variable. The “test on KEY” specifies the condition to be tested between the KEY field of x (denoted by x.KEY) and an integer provided by the instruction. The test includes >, 2, <, <, =, #, and NULL. The NULL test means that no test is needed and therefore the result of the test is always true. The “operation on KEY” can be Increment, Decre- ment, Add, Fetch, Fetch&Add, Store, Fetch&Increment, Fetch&Decrement, and No Action. The “operation on DATA” can be Fetch, Store, and No Action. The entire execution of a synchronization instruction, i.e., test on KEY, operations on KEY, and DATA, is done in each shared memory module and is indivisible. The operation on KEY and the operation on DATA are executed only when the result of the test is true. The memory module will inform the processor of a “success” if the condition is satisfied and the operations are completed. For some applications, a pro- cessor needs to busy-wait (spin-lock) at a synchronization instruction until it is successfully executed. There are two ways to implement such busy-waiting in a multiprocessor system: ( 1) The pending synchronization can be queued in the memory module and tested later without resubmission from the processor. HEP multiprocessor [ 1 l] uses this scheme. (2) A “failure” signal is sent back to the processor and the processor resubmits the same instruction until it is executed successfully by the memory module. We use the latter scheme in this paper. In our algorithms, we use a star at the test condition in a synchronization instruction if busy- waiting is required. In other words,

{x; (test on KEY)*; operation on KEY; operation on DATA }

is equivalent to

1: { x; test on KEY; operation on KEY; operation on DATA } if (failure) then goto 1

It can be seen that a synchronization instruction with busy- waiting can be resubmitted by the processor many times. To simplify the algorithm complexity analysis, we assume that a synchronization instruction with busy-waiting always suc- ceeds on the first try.

In this paper, the value of the KEY field is referred to as the value of the synchronization variable when the DATA field of a synchronization variable is not used.

In a large multiprocessor system, shared-memory access takes far more time than local-memory access and any logical and arithmetic instruction. Therefore, we only consider shared-memory accesses in analyzing the algorithms in this paper. We assume that the multiprocessor system uses a reg- ular multistage interconnection network such as an Omega network [6] to connect processors and shared-memory modules. We also assume that each switch operation in the network takes a constant time, 0( 1). In a multiprocessor system with N processors, a shared-memory access has to traverse log N stages of switch elements to reach shared memory [ 61. Therefore, the cost of a shared-memory access is in the order of 0( log N).

3. BARRIER SYNCHRONIZATIONS

Given P processors to execute a parallel program, a barrier at a given point of the program requires that no processor pass the point until all P processors have reached it [ 8 1. A barrier can be implemented as follows:

so

sl s2 s3 s4

{ B; (KEY =O)*; No Action; No Action } ; { A; Null; Fetch(x) &Decrement; No Action } ; if (x= 1) then begin

A.KEY:=P; B.KEY:=P - 1;

end else S5 { B; (KEY>O)*; Decrement; No Action >

where the initial value of the counter A is the number of processors, P, and the initial value of B is 0. The variable A counts the number of processors that have reached the barrier and the variable B is a lock to block processors until all of the P processors have arrived [ 12 1. When all processors have arrived, the lock will open and the processors can pass the barrier. Statement SO is necessary to prevent the race con- dition possible when the barrier is enclosed in an outer serial loop. The execution rate of parallel processors in a multi- processor system can be arbitrarily different. It is possible that, after lock B is open, fast processors come back to the barrier before slow processors pass statement ~5. Without statement SO, these fast processors could pass statement s5 for the second time without synchronizing with the slow processors. Statements SO and s5 represent the two necessary synchronization points of a barrier. However, if the outer serial loop encloses at least two barriers, each barrier needs only one synchronization point. In this case, the synchro- nization point by statement SO is redundant and statement SO can be removed.

The shared variables used for barrier synchronizations are potential hot spots in multiprocessor systems. We could use a single variable to implement a barrier with two synchro- nization points [ 21. However, our two-variable barrier is no worse than a single-variable barrier, because the traffic to a

Page 3: Software combining algorithms for distributing hot-spot addressing

132 TANG AND YEW

single shared variable can be reduced by half. Of course, if there are a large number of processors, variables A and B can still become hot spots. Next, we discuss the software combining algorithm which distributes hot-spot addressing to these barrier synchronization variables.

To simplify our presentation, let us assume P = k’, where k is the fan-out at each node of a software combining tree. The tree has I levels and each node contains two shared vari- ables: a counter A and a lock B. Each processor is assigned to a leaf node of the combining tree. The data structure of each node is as follows:

type node=record A: syncvar /* a counter, initialized to be k */ B: syncvar /* a lock, initialized to be 0 */ parent: tnode /* a pointer to the parent node */ root: boolean /* to indicate if the node is the root */

end

where “syncvar” means a synchronization variable. The combining algorithm for each node of the tree is shown in Algorithm 3.1. Note that after all of the k’ processors have left the barrier, counter A and lock B of each node are rein- itialized.

In Algorithm 3.1, shared variables in each node are ac- cessed by k processors instead of by P processors. As long as k is kept small, hot-spot contentions can be greatly reduced as shown by the simulations in [ 141. Since the processors have to traverse the software combining tree from the leaves to the root, the software combining barrier cost for each processor is about 3 ZogkP accesses to shared memory (each Barrier procedure call to a tree node has three shared-memory accesses, namely, SO, s 1, and ~6, for most processors). The cost of a shared-memory access is 0( log N). If all the pro- cessors of the system are allocated to execute a single pro- gram, i.e., N = P, the cost of a software combining barrier for each processor is in the order of 0( log2N).

4. SOFTWARE COMBINING MECHANISM FOR GENERAL ACCESS PATTERNS

The access pattern of the shared variables described in the previous section has two characteristics: ( 1) the total number

Procedure Barrier (nodeptr: Tnode) begin

with nodeptrf do begin SO (B; (KEY=O)*; No Action; No Action J; Sl (A; Null; Fetch(x)&Decrement; No Action]; s2 if (x=1) then begin s3 if (root=false) then call Banier(parent); s4 A:=k; s5 B:=k-1;

end else s6 (B; (KEY>O)*; Decrement; No Action);

end end

ALGORITHM 3.1

of processors accessing a shared variable is fixed, and (2 ) processors will not proceed to the next step until they have all finished the current step. These two characteristics make the combining algorithm very simple, as shown in Algo- rithm 3.1.

For shared variables which do not have such access pat- terns, more sophisticated combining algorithms are needed. Those shared variables can be pointers to the head or to the tail of a ring-buffer parallel queue [ 41, locks and linking pointers of parallel linked lists [ 131, semaphores for P/V operations [ 11, loop index variables in self-scheduling loops [ 121, etc. Combining trees for these shared variables have several unique characteristics:

( 1) In the combining algorithm described in Section 3, a representative cannot be sent to its parent node before k processors have all accessed the node. For general access patterns like P/V operations, it may take a very long time before all k processors access the shared node. Some pro- cessors may access the shared node more frequently than others, and some processors may never access the shared node at all. We cannot wait until all k processors have arrived before sending a representative to the parent node as in Al- gorithm 3.1.

(2 ) In a hardware combining network [ 10 1, requests may arrive at a switching element simultaneously and be com- bined in the switching element. We assume that there is no combining network in our system; each memory module has to service these requests sequentially. Hence, there is no clear notion of “simultaneity” in our system. We have to define the notion of simultaneity in a different way.

( 3 ) Depending on how requests are combined in a node, the number of requests combined may be larger or smaller than k. In a combining tree in Section 3, there are always exactly k requests combined in each node.

(4) Due to the hardware complexity in a hardware com- bining network, pairwise combining is usually used. There is no such constraint in a software combining tree. We can combine as many requests as we want.

First let us define the notion of simultaneity. One way to define it is to create a “window” of time. Requests which arrive during that period of time will be combined. Requests which miss the current window have to wait for the next one.

There are many ways to define a window, and its length will determine the effectiveness of this scheme. If a window is too long, even though many requests can be combined, the earlier requests will suffer long waiting time. If a window is too short, not many requests can be combined. This pa- rameter depends very heavily on other system parameters like memory speed, network delay, and processor speed. Since we only concentrate on algorithms, we do not elaborate on how this parameter should be chosen.

In this paper, we assume a large multiprocessor system; hence, network delay is relatively long compared to memory

Page 4: Software combining algorithms for distributing hot-spot addressing

SOFTWARE COMBINING ALGORITHMS 133

and processor speed. We define a window as follows: When a processor accesses a node, it checks to see if it is the first processor accessing the node since the previous window was closed. If it is the first processor, it marks the time of the access as the beginning of a new window and will access the node again to “close” the window. The time between these two visits is defined as the length of a window. Any request arriving within the window will be combined.

There are two advantages to this scheme:

( 1) It allows a window to be defined only by active pro- cessors. Even with only one active processor (i.e., there is no combining), the processor can still traverse the tree by itself. We do not need a different algorithm for this special case.

(2) If the conflict resolution schemes in the interconnec- tion network and in the shared memory are fair, the length of a window can be adjusted dynamically by the two visits of the first processor. For example, if there are a lot of accesses between those two visits, the second visit of the processor will be pushed back and the length of the window will become longer to combine all those accesses. On the other hand, if there are very few accesses between those two visits, the win- dow will be shortened accordingly because the first processor can get to the node more quickly in its second visit.

In the following sections, we describe three different access patterns to shared variables. In Section 5, access patterns like the computing of the sum, the maximum, or the minimum of a large array are discussed. In this access pattern, requests only traverse the combining tree from leaf nodes to the root node, and no information is sent back from the root to leaf nodes. In Section 6, information is flowing from leaf nodes to the root node, and the result will then propagate back to leaf nodes. Fetch-and-add operations [4] have this kind of access pattern. In Section 7, information is flowing from leaf nodes to the root node, and some kind of test is performed on the way to the root node. The result will propagate back to leaf nodes. This type of access pattern includes P and V operations on semaphores.

5. GENERAL ACCESS PATTERN 1

The tree traversing algorithm for access patterns like the computing of the sum, the maximum, or the minimum is described in Algorithm 5.1. The data structure for a tree node for software combining is as follows:

type node=record WINDOW: syncvar /* initialized to be 0 */ COUNTER: syncvar /* initialized to be 0 */ parent: tnode /* a pointer to the parent node */ root: boolean /* to indicate if the node is the root */

Procedure COMBINE (nodeptrfnode, argument: real) local var temp: real; tO,w: integer; begin

with nodeptr? do if root=rme then perform required operation on X with argument

else begin sl (WINDOW; (KEY<m)*; Fetch(tO)&Increment; No Action); cl if (tO=O) then begin s2 (WINDOW; Null; Fetch(w)&Add(m); No Action J;

if (w=l) then begin temp:=argument;

s3 WINDOW.KEY:=O; end

else begin s4 perform required operation on X with argument; S5 (COUNTER; (KEY=w-l)*; No action; No Action); S6 temp := X; S-l COUNTERKEY :=O;

end; S8 call COMBINE(parent, temp);

end else begin

s9 perform required operation on X with argument; S10 I COUNTER: Null: Increment: No Action I :

sll s12 s13

.~~ if (tO=l) then begin

(COUNTER; (KEY=O)*; No action; No Action); reinitialize X; WINDOW.KEY :=o; end

end

end end

ALGORITHM 5.1

X: syncvar / * used for combining operations */ end

Each window is controlled by a synchronization variable WINDOW, whose initial value is 0. As indicated before, the length of the window is defined by the two visits of the first processor. In its first visit, the first processor simply fetch- and-increments WINDOW. In its second visit, it adds m to WINDOW (see s2 of Algorithm 5.1) to close the current window of the node. All processors accessing the node first check to see if WINDOW is less than m . If so, the processor fetch-and-increments WINDOW to enter the combining window; otherwise, it has to wait for the next combining window (see s 1 of Algorithm 5.1) . Therefore, m is the largest number of requests that can be combined in a window and it should be set as large as possible. If the synchronization instruction set includes Swap (i.e., Fetch&Store), we can use Fetch( w)&Store( m) in s2. In this case, m can be the maximum integer MAXINT. With a very large m, we have an equivalence of “unbounded” combining [ 71. The hard- ware for unbounded combining in a switch element is pro- hibitively expensive. One advantage of software combining is that there is no extra cost for unbounded combining.

WINDOW is also used as a counter to count the number of processors in the current combining window. The value of WINDOW fetched by a processor to its local variable t0 is actually the number of processors that have entered the combining window before that processor. (The successful

Page 5: Software combining algorithms for distributing hot-spot addressing

134 TANG AND YEW

completion of sl defines the time when a processor enters the combining window.) Therefore, local variable t0 of the first processor must be 0, t0 of the second processor must be 1, and so on. When the first processor accesses WINDOW for the second time in ~2, the value fetched to its local variable w must be the total number of processors in the current combining window (including the first processor itself). If the first processor finds its w = 1, which means that there is only one processor (the first processor itself) in the combining window and there is no need for combining, it proceeds to traverse to the parent node after reinitializing WINDOW to 0. If w is larger than 1, the first processor is chosen as the representative to carry the combined request to the parent node for that particular combining window. This scheme is also used in Algorithms 6.1 and 7.1.

Algorithm 5.1 shows only the control mechanism of com- bining at a node. It does not specify what kind of operations are applied to the hot spot in statements s4 or ~9. The syn- chronization variable COUNTER with an initial value of 0 is used to synchronize the first processor with the remaining processors in the combining window. All processors except the first one in the combining window increment the COUNTER in s 10 after they perform the required operations on X in ~9. On the other hand, the first processor will fetch X in s6 only after it performs its operation on X in s4 and checks to see in s5 that the COUNTER becomes w - 1, the total number of the remaining processors of the combining window. Thus, it is guaranteed that the first processor will carry the correctly combined value of X to the parent node only after all of the processors in the combining window have performed the required operations on it.

The reinitializations of X and WINDOW are done by the second processor, i.e., the processor with its t0 = 1, in ~12 and s 13. This will not happen until COUNTER is reinitial- ized by the first processor in s7 after it finishes fetching X, due to the wait statement ~11. (Notice that the value of COUNTER is at least 1 after the second processor passes ~10 and before the first processor resets it to 0 in ~7.) The order of s 12 and s 13 is important and should not be reversed, because it guarantees that X (and COUNTER) is properly reinitialized before WINDOW is reset to 0 and the entire tree node is open again for the next combining window. The next combining window can be started as soon as WINDOW is reset to 0 allowing software combining at the nodes along the paths from the leaves to the root to be pipelined.

Let PE, be the first processor and PEZ, . . . , PE,. the re- maining processors in the combining window. We have shown the following theorem indicating the correctness of Algorithm 5.1.

THEOREM 5.1. Algorithm 5.1 accomplishes the following objectives and, hence, controls a combining tree correctly:

(1) 1 s w < m, i.e., there will be at least 1, and no more than m, processors in each window.

(2) PE, executes the “true” branch (~2, ~3, ~4, ~5, ~6, ~7, ~8) and PE2, . , PE, execute the ‘false” branch (~9, ~10, ~11, ~12, ~13) ofthestatement cl.

(3) PEI will bring the value of shared variable X to the parent node only after all processors in the window haveper- formed the required operations on it.

(4) COUNTER, WIND0 W, and X in the node are rein- itializedproperly before the node is usedfor the next window.

Let us estimate the execution time for Algorithm 5.1. If there is only one processor in the combining window, (i.e., there is no combining in that window), the processor needs three shared-memory accesses (sl, ~2, ~3) before it calls the parent node. If there is combining in the window, the rep- resentative needs six shared-memory accesses (~1, ~2, ~4, ~5, ~6, ~7) before it calls the parent node. The rest of the pro- cessors need three shared-memory accesses (s 1, ~9, s IO ) , ex- cept that one of them needs another three shared-memory accesses (~11, ~12, sl3), to reinitialize X and WINDOW. Note that not all of the processors need to traverse the entire tree. The worst case happens when a processor is chosen as the representative all the way from a leaf node up to the root node. In this case, the total completion time for the procedure call at a leaf is 6 ( logkP - 1) + 1 shared-memory accesses. Again, if N = P, the cost of the procedure call at a leaf is in the order of 0( log2N).

6. GENERAL ACCESS PATTERN 2

In this section, we describe a combining algorithm for those access patterns which require information to be prop- agated from leaf nodes to the root node, and some results are then propagated back from the root node to leaf nodes. Fetch-and-add operations [ 41 have such access patterns. Even though our algorithm is independent of the type of operations on those shared variables, to simplify our presentation, we use fetch-and-add as an example.

We use a combining window as defined in Algorithm 5.1. Each processor in a combining window fetch-and-adds the shared variable X in the node. The initial value of X is 0. The fetched value is kept in the processor to be used later as an offset. At the end of combining, the representative processor (i.e., the first processor) will take the final value of X to call the parent node. The rest of the processors will be busy-waiting at a shared variable called “mailbox” for a return value brought back from the parent node. When the representative processor brings back a return value from the parent node, it will put the return value in the mailbox. Those waiting processors can then use that value as a base and add their own fetched values as an offset to get the final value of the combined fetch-and-add operation. The same procedure is used for all intermediate nodes except the root node, where no combining is needed. Using this procedure, a processor that is fetch-and-adding a shared variable at a

Page 6: Software combining algorithms for distributing hot-spot addressing

SOFTWARE COMBINING ALGORITHMS 135

node in the lowest level of a combining tree will have the same effect as if it were fetch-and-adding directly to the root node.

In order to have as much parallelism as possible for a combining tree, we do not want to lock up a node when a representative is sent to the parent node for further combin- ing. The next combining window should be able to be created as soon as the previous window is closed. While there can be only one window created at a time at a node, it is possible that several groups of processors participating in different combining windows are waiting for returning values at the same node. To prevent confusion among the processors waiting for returning values in different combining windows, the processors participating in the same combining window must have an agreement as to where the mailbox is, and each window should have its own mailbox.

The mailbox of a window is provided by the processor chosen as the representative of the window and the location of the mailbox has to be passed to all other processors in that window. When the representative processor visits the parent node, it may be chosen again as a representative at the parent node. In this case, it needs to provide another mailbox address for the parent node. If the processor is cho- sen as a representative at all intermediate nodes on its way to the root, it has to provide I - 1 mailbox addresses for an I-level combining tree. In our algorithm, each processor is allocated an array of pointers for mailboxes:

mbxptr: array [I . l-l] of tsyncvar.

Each mailbox is a synchronization variable. Its KEY field is used to synchronize communication among processors, and its DATA field is used to store the returning value. When a processor is chosen as a representative for a window at a node in level i (leaf nodes are in level 1, and the root node is in level I) its mbxptr( i) is passed on to other participating processors. Algorithm 6.1 is a recursive procedure for com- bining fetch-and-add requests. The shared variable for fetch- and-add operation is X. Algorithm 6.1 is almost the same as Algorithm 5.1 except statements sl’, s2’, s3’, and ~4’ are added to perform handshaking among the processors, and ~5’ is used to compute the return value of the fetch-and-add operation. We prove the correctness of the algorithm as fol- lows:

LEMMA 6.1. LetPEI,PE2,...,PE,(w> 1)bethe processors participating in a combining window at a node and PE, the representative ofthe window. The value returned from the parent node by PE, (i.e., the value of “base” in the recursive call in s8 ) will be passed on to the processors PE2, . ) PE,

Proof PE, will access WINDOW at s2 and the rest of the processors will access WINDOW at ~3’. The condition

F’roccdure FetchandAdd(nodepw7node; addendzinteger; 1evel:integer; var return:real) I** nodeptr: a pointer that points to the current node;

addend: the value in the fetch-and-add operation; level: the wee level of the current node; return: return value of the procedure **I

local var sum, base, offset: integer; tO,w: integer; pointer: Tsyncvar; begin

sl

with nodepaf do if root=true then (X; Null; Fetch(retum)&Add(addend); No Action ) ;

else begin (WINDOW; (KEY<m)*; Fetch(tO)&Increment; No Action); if (tO=O) then bcein

sl’ s2

s3

imbxptrfievel)?; (KEY=O)*; No Action; No Action]; (WINDOW, Null; Fetch&)&Add(m); Store(mbxptr(leve1)) 1; if (w=l) then begin

sum:=addend; WINDOW.KEY:=o; end

s4 s5 s6 sl

s8 s2’

s9 s3’ SlO

sll s12 s13

s4’

S5’

end

else begin (X; Null; Fetch(offset)&Add(addend); No Action); (COUNTER; (=w-l)*; No Action; No Action]; sum:=X.KEY, COUNTER.KEY:=O; end;

call FetchandAdd(parent.sum,level+l,base); if (w>l) then (mbxptr(level)?; Null; Store(w-1); Store(base)]; end

else begin IX: Null; Fetch(offset)&Add(addend): No Actionl: (WINDOW; (K&Y&)*; Noaction; ketch(pointe;)); (COUNTER; Null; Increment; No Action]; if (tO=l) then begin

(COUN’kR; (=O)*; No action); X.KEY:=o; wINDow.KEY:=o;

end; (pointer?; (KEY>O)*; Decrement; Fetch(base)]; end;

retum:=base + offset; end

ALGORITHM 6.1

that allows PE2, . . . , PE, to fetch the DATA field of WIN- DOW is that the KEY value of WINDOW is greater than m. This happens only after PE, has executed ~2, in which PE, stores the mailbox pointer mbxptr( level) (“level” is the level of the node in the tree) into the DATA field of WIN- DOW. On the other hand, we also have to guarantee that eachofPE2,..., PE, will get that mailbox pointer. Note that ~3’ is executed before ~10, in which COUNTER is in- cremented. By the time COUNTER becomes w - 1, each ofPE2,. . . , PE, must have fetched the DATA field of WIN- DOW. Note that WINDOW will not be opened for a new window ( see s 13 ) until COUNTER is reset to 0 at ~7, which is executed only after COUNTER becomes w - 1 (see s5 ) . PEl,..., PE, must have all received the mailbox pointer from the DATA field of WINDOW by the time WINDOW is reopened for the new combining window.

The initial value of the KEY field of the mailbox (pointed by “pointer”) is 0. PE2, . . . , PEW will be busy-waiting at ~4’ until PE, stores the return value of the recursive call into the DATA field of the mailbox and sets the KEY field to w - 1 at ~2’. PE2, . . . , PE, can then start fetching the return value at ~4’. We must also guarantee that each of PE2, . . . , PE, will receive that return value. Note that the KEY field

Page 7: Software combining algorithms for distributing hot-spot addressing

136 TANG AND YEW

of the mailbox will not turn to 0 until all of PE2, . . . , PE, have successfully accessed the mailbox in ~4’. The mailbox will not be reused for another combining unless its KEY field is 0 (see s 1’) which is before ~2). Hence, what each of PE,,..., PE, receives from the mailbox in ~4’ must be the valid return value from the parent node. This proves the lemma. Q.E.D.

When a number of processors access the same variable in the shared memory in a multiprocessor system, the order of accesses is indeterministic due to possible conflicts in the network. There is no guarantee that the requests issued first will arrive at a memory module first. We must bear this point in mind in the following discussion.

THEOREM 6.1. If a processor issues a jitch-and-add re- quest to its assigned leaf node, using Algorithm 6.1, the return value of the request and the eflect of the request on the root node are the same as ifthe fetch-and-add request were issued directly to the root node.

Proof: Using Lemma 6.1, it is easy to show that the theorem is true for a tree of 2 levels. Suppose that the theorem is true for a tree of k levels. Let PE, , PE2, . . . , PE, be the processors participating in a combining at a leaf node of a tree of k + 1 levels and PE, the representative. Since we are using the same control structure as in Algorithm 5.1, from item (3) in Theorem 5.1, PEI will use the sum ofthe addends from all the participating processors to issue a new fetch- and-add request to its parent node. From this point on, it is a k-level tree. According to the assumption about k-level trees, the effect of this new request is the same as if it were issued to the root node directly, and the return value is also the same as if it were brought back directly from the root. From Lemma 6.1, this return value will be passed on to all processors. Thus, the theorem is true for combining trees of k + 1 levels. Q.E.D.

As in Algorithm 5.1, we consider only the number of shared-memory accesses in estimating the time needed for a procedure call in Algorithm 6.1. If there is no combining in the window, a processor needs only four shared-memory accesses (sl, sl ‘, ~2, ~3) before it visits the parent node in ~8. If there is combining in the window, the representative needs seven shared-memory accesses (s 1, s 1’) ~2, ~4, ~5, ~6, ~7) before it calls the parent node and one shared-memory access (~2’) after that. Because the rest of the processors have to wait for the return value of that recursive call and fetch the return value in another shared-memory access (s4’), the time for a single procedure call excluding the time for a re- cursive call to the parent node is nine shared-memory ac- cesses. Hence, the total completion time of each procedure call at a leaf node is 9 (1ogkP - 1) + 1 shared-memory ac- cesses. The cost of a procedure call at the leaf node is still in the order of 0( /og’N), assuming N = P.

7. GENERAL ACCESS PATTERN 3

In this section, we present a combining algorithm for op- erations like P’s and I’/‘s on semaphores. This type of ac- cessing pattern is very similar to the access pattern 2 in the last section, except that some kind of test is needed at each node.

It is shown by Gottlieb [4] that P’s and V’s can be im- plemented through fetch-and-add operations. We can use Algorithm 6.1 for those fetch-and-add operations and elim- inate hot-spot addressings for semaphore P/V operations. However, this will be extremely inefficient.

Using our synchronization instructions, P operation on a semaphore S, P(S), can be expressed as

{ S; (KEY>O)*; Decrement; No Action }

and I/ operation, V(S), is just

{ S; Null; Increment; No Action 1.

To distribute hot-spot accessings to the shared variable S, we replace it by a k-way combining tree and use a recursive procedure called PVto traverse the tree (see Algorithm 7.1).

Procedure PI/ in Algorithm 7.1 has four arguments: ( 1) “nodeprt” is a pointer pointing to the current node. (2) “Level” is the tree level of the current node. (3) “Request” is the value that should be added to the semaphore. (It can be a positive number or a negative number: a positive num- ber is equivalent to a V operation; a negative number rep- resents the value to be taken from the semaphore. It cannot be 0 because that means all of the positive and the negative requests in a combining window are satisfied in a node and no further traversal is needed.) (4) “Received” is a nonneg- ative value received from the tree root. For a PV call with a negative request, received is no larger than the absolute value of request. If request is positive (which means a Voperation), there will be no received value.

Thus, using the combining tree and Algorithm 7.1, P(S) operation on the semaphore S can be implemented as

L: call PV (leaf, 1, - 1, received) if (received=O) then goto L

and a I/(S) operation is

call PV ( leaf, 1, + 1, received),

where “leaf” is the pointer to the leaf node to which the processor is assigned.

Procedure RootPI’ in Algorithm 7.2 is used when pro- cessors reach the root node.

Algorithm 7.1 uses Algorithm 5.1 to control the combining in a node. All of the statements in Algorithm 5.1 are included

Page 8: Software combining algorithms for distributing hot-spot addressing

SOFTWARE COMBINING ALGORITHMS 137

s5 $6 sl

s8 $3’

Procedure PV (ncdepu:Tnode; 1evel:integer; request:integer;var received: integer) var local cO,w,x,y,balance,response: integer; pointerfsyncvu begin

Sl Cl s2

s3

with nodepuT do if (root=true) then call RootPV(ncdepu, request, received)

else begin received:=O; [WINDOW; (KEY<m)*: Fetch(tO)&Increment; No Action): if (dkll) then begin

{WINDOW; Null; Fetch(w)&Add(m); Store(mbxpu(level))); if (w=l) then begin

WINDOW.KEY:=o; call PV(parent, level+l, request, received); Et”Ill; end;

Sl’ s4&s9 c2 c3 s2’

pointer:= mbxptr(level); end

else (WINWW; (KEY>m)*; No Action; Fetch(pointer)); (X; Null; Fetch(x)&Add(request); No Action); if (request x x-33) then

if (request>O) then (pointer?; Null; Add(minax 1, request)); No Action]

else begin received:=received + min(x. Irequest I); request:=request + min(x, 1 request I); end;

if (tO=O) then begin (COUNTER; (KEYawl)*; No Action; No Action); balance:=X.KEY; COUNTER.KEY:=O; responsz4; if (balance&l) then call PV(parent, level+l, balance, response); if (balance<O) then (pointer?; Null; Add(response): Store(l))

else response:=0; end

SIO

sll sl2 s13

else begin (COUNTER; Null; Increment); if (tO=l) then begin

(COUNTER; (KEY=O)*; No Action; No Action); X.WY:=oz (WINDO+ Null; Store(O); Store(O)); end;

end;

c6 ~6’

cl c8 s7’

~8’ s9’

end

if (request<O) then begin while (request -3 and pointerT.DATA=O) do begin

(pointer7; KEY>-request: Add(request): No Action]; if (success) then begin

received:= received + Irequest I; request:=o; end;

end; if (requestc0) then begin

(pointerf; Null; Fetch(y)&Add(request); No Action); if (yz0) then received:= received + min(y, Irequest I); end;

end; if (tO=O) then

if (balanceZOO) then els~pointer~: (KEY=O)*; No Action; No Action)

(pointerf; (KEY=balance+Ksponse)*; Store(O); Store(O));

end;

ALGORITHM 7.1

in Algorithm 7.1 with the same labels and in the same order. Note that s4 and s9 in Algorithm 5.1 are replaced by a state- ment labeled s4&s9 and its following “if” statement which involves only local variables.

In the following discussions, a PI/procedure call with a positive request and a negative request are referred to as a V request and a P request, respectively. The basic idea is to use V requests to satisfy P requests in the same combining window as early as possible at each node. Variable X is used to store the balance of P requests and V requests. Its initial value is 0.

Each processor in the same combining window fetches the original value of X and adds its request to X. If the final balance in X is 0, all P requests can eventually be satisfied by the V requests locally. If the final balance in X is negative, some P requests cannot be completely satisfied. After a pro- cessor with a P request accesses X, it will be busy-waiting at the mailbox, if the P request is not yet completely satisfied. When a V request arrives and finds X has a negative value, it will try to satisfy as many P requests busy-waiting at mail- box as possible by adding its request value or the absolute value of X, whichever is smaller, to mailbox. After all pro- cessors in the combining window have accessed X, if its final balance is not zero, a representative (i.e., the first processor) is sent to the parent node with the balance as the actual parameter for request in the recursive call.

Those unsatisfied P requests busy-waiting at mailbox can be satisfied by two sources: ( 1) later processors with V re- quests (see ~2’): and (2) the representative which brings back a positive return value (see ~3’). The DATA field of the mailbox is used as a flag to indicate whether the representative has returned from the parent node. Before the representative returns from the parent node (i.e., when mailbox.DATA = 0 ), some processors with unsatisfied P requests busy-wait- ing at the mailbox may be satisfied by the contributions of V requests to the mailbox (see ~4’). Note that the mailbox never becomes negative during this period of time. After the representative returns from the parent node and puts a pos- itive value to the mailbox (i.e., when mailbox.DATA = 1 ), all of the unsatisfied P requests will compete for the positive value left in the mailbox (see ~6’).

The representative processor reinitializes the mailbox be- fore it completes the PVprocedure call. Recall that the mail- box used in the combining window is provided by the rep- resentative processor. As is shown below, the representative processor will not reinitialize the mailbox until all of the relevant processors have finished accessing it. Therefore, there is no race for the mailbox between the processors in the different combining windows. (Note that the represen- tative processor could issue another PV call after it finishes the current PVcall. If it is chosen again as the representative

Procedure RootPV(nodeptr:~ncde; request:integer; var receivrd:integer) local var x,y : integer begin

with nodeput do begin (X; (KEY=l)*; Decrement; Fetch(x)); if request>0 then begin

received:=0, y:=x + request;

end else begin

received:= min(x, Irequest I); y := max(0, x+request);

end; {X; Null; Increment; Store(y));

end end

ALGORITHM 7.2

Page 9: Software combining algorithms for distributing hot-spot addressing

138 TANG AND YEW

on the way from the leaf node to the same node of level i, its mailbox, mbxatr( i)t , could be used again for commu- nication in another combining window at the same node.) In other words, each mailbox of each processor is guaranteed to be properly reinitialized before its next use.

There are two cases in mailbox reinitialization: ( 1) “bal- ance” > 0 and (2) balance < 0. In case 1, when the repre- sentative processor reaches statement c7, the KEY field of the mailbox is 0 if and only if all the relevant processors have accessed it. This is because the final value of the KEY of the mailbox is 0, and by the time the representative pro- cessor reaches statement c7, all processors with V requests contributing to the mailbox must have executed statement ~2’ due to the synchronization by COUNTER. Similarly, in case 2, when the representative processor reaches statement c7, the KEY field of the mailbox is “balance+response” if and only if all the relevant processors have accessed it.

Although Algorithm 7.1 appears to be more complicated than Algorithm 6.1, both use the same control structure for combining as Algorithm 5.1 and have similar delay for a single procedure call. If there is no combining in a window, a processor needs 3 shared-memory accesses (s 1, ~2, s3 ) be- fore it visits the parent node. If there is combining in a win- dow, the representative needs 6 or 7 (sl, ~2, s4&s9, s2’, ~5, ~6, ~7) shared-memory accesses, depending on whether it executes ~2’ or not, before it visits the parent node. After it returns from the parent node, it may spend another shared- memory access in ~3’. If it has an unsatisfied P request, it will spend another 2 shared-memory accesses in ~4’ and ~6’. Including the shared-memory access in ~7’ or sS’, the longest time for a single procedure call by a representative is 11 shared-memory accesses. But, unlike those in Algorithm 6.1, not all processors have to wait for the return value from the parent node. Some of the P requests can be satisfied im- mediately by I’ requests after as few as 4 shared-memory accesses (s 1, s 1’) s4&s9, s 10). Hence, the total completion time for a PI/ procedure call to a leaf node varies from case to case. The worst case occurs when a P request to a leaf node is always combined with P requests in all nodes along the path to the root node. In this case, the P request can still be completed in 0( .!ug’N) time.

8. CONCLUDING REMARKS

In shared-memory multiprocessor systems, software com- bining of concurrent accesses to shared variables is an in- expensive alternative to hardware combining [ 141. Opera- tions to hot-spot variables are often complicated read-and- modify memory operations, such as fetch-and-add or P/V operations. In this paper, we presented software combining algorithms for these hot-spot accessings.

The number of processors that can access the same lo- cation in the shared memory is greatly reduced by using a

software combining tree to replace the original hot-spot vari- able. Because accesses to memory locations are sequential, we introduced the concept of the combining window and the notion of simultaneity for a combining tree. An impor- tant advantage of software combining is that it can adjust the number of combined accesses at a node dynamically and, hence, realize the unbounded combining [7] without extra cost.

Serial bottlenecks in shared memories and tree saturation in interconnection networks are the two problems caused by hot-spot accessing (or memory contention). Tree saturation makes the serial bottleneck worse. That is, if P processors issue their requests to a hot spot simultaneously, the com- pletion time of all these requests is much larger than P when the network is congested. Tree saturation also degrades the network bandwidth for regular accesses. Software combining solves both problems.

If a shared-memory access in the system is assumed to take a constant time 0( 1 ), all combining algorithms in this paper cost 0( log P), where P is the number of processors accessing the hot spot. In reality, the shared-memory access time is 0( log N), where N is the total number of processors in the system (P < N). Therefore, the cost of all combining algorithms in this paper is O( log2ZV) if the switch clock in the interconnection network is assumed to be the constant time 0( 1). In any case, the ratio of the time cost between software combining and hardware combining is 0( log P).

ACKNOWLEDGMENTS

The authors thank Chuan-Qi Zhu for his valuable comments during the preparation of this paper. The authors are also indebted to the reviewers of the paper for their important suggestions.

I.

2.

3.

4.

5.

6.

7.

REFERENCES

Dijkstra, E. W. Hierarchical ordering of sequential processes. In Hansen, P. B. (Ed.). Operating System Principles. Prentice-Hall, Englewood Cliffs, NJ, 1972, pp. 72-93. Dimitrovsky, I. A short note on barrier synchronization. Ultracomputer System Software Note 59, Courant Institute, New York University, New York, Jan. 1985. Gottlieb, A., et al. The NYU ultracomputer-Designing a MIMD shared memory parallel computer. IEEE Trans. Comput. C-32,2 (Feb. 1983). 175-189.

Gottlieb, A., et al. Basic techniques for the efficient coordination of very large numbers of cooperating sequential processors. ACM Trans. Pro- gramming Languages Systems 5,2 (Apr. 1983), 164-189. Kuck, D. J. et al. Parallel supercomputing today and cedar approach. Science (Feb. 1986), 967-974. Lawrie, D. H. Access and alignment of data in an array processor. IEEE Trans. Comput. C-24, 12 (Dec. 1975), 1145-l 155. Lee, G., Kruskal, C., and Kuck, D. J. The effectiveness of combining in shared memory parallel computers in the presence of “hot spots.” Proc. 1986 International Conference on ParaNel Processing. IEEE Com- puter. Society, Silver Spring, MD, 1986, pp. 35-41.

Page 10: Software combining algorithms for distributing hot-spot addressing

SOFTWARE COMBINING ALGORITHMS 139

8. Lusk, E. L., and Overbeek, R. A. Implementation of monitors with macros: A programming aid for the HEP and other parallel processors. In Kowalik, J. S. (Ed.). Parallel MIMD Computation: HEP Supercom- puter and Its Applications. MIT Press, Cambridge, MA, 1985.

9. Plister, G. F., et al. The IBM research parallel processor prototype (RP3): Introduction and architecture. Proc. 1985 International Cor@rence on Parallel Processing. IEEE Computer Society, Silver Spring, MD, 1985, pp. 764-17 1.

10. Pfister, G. F., and Norton, V. A. “Hot-Spot” contention and combining in multistage interconnection networks. IEEE Trans. Comput. C-34, 10 (Oct. 1985), 943-948.

1 I. Smith, B. The architecture of HEP. In Kowalik, J. S. (Ed.). Parallel MIMD Computation: HEP Supercomputer and Its Applications. MIT Press, Cambridge, MA, 1985, pp. 58-62.

12. Tang, P., and Yew, P. C. Processor self-scheduling for multiple-nested parallel loops. Proc. 1986 International Conference on Parallel Pro- cessing. IEEE Computer Society, Silver Spring, MD, 1986, pp. 528- 535.

13. Tang, P., Yew, P. C., and Zhu, C. Q. A parallel linked list for shared- memory multiprocessors. Proc. 1989 International Computer Software and Applications Conference. IEEE Computer Society, Silver Spring, MD, 1989, pp. 130-135.

14. Yew, P. C. Tzeng, N. F., and Lawrie, D. H. Distributing hot-spot ad- dressing in large-scale multiprocessors. IEEE Trans. Comput. C-36, 4 (Apr. 1987), 388-395.

15. Zhu, C. Q., and Yew, P. C. A Scheme to enforce data dependences on large multiprocessor systems. IEEE Trans. S&ware Engrg. SE-13, 6 (June 1987), 726-739.

Received October 7, 1987; accepted March 29. 1989