performance of parallel processors

20
Parallel Computing 12 (1989) 1-20 1 North-Holland Performance of parallel processors * Horace P. FLATT IBM Palo Alto Scientific Center, Palo Alto, CA 94304, U.S.A. Ken KENNEDY Department of Computer Science, Rice University, P.O. Box 1892, Houston, TX 77251, U.S.A. Received December 1986 Revised February 1987, November 1988 AbstraeL The impact of synchronization and communication overhead on the performance of parallel processors is investigated with the aim of establishing upper bounds on the performance of parallel processors under ideal conditions. Bounds are derived under fairly general conditions on the synchronization cost function. These bounds have implications for a variety of parallel architecture and can be used to derive several popular 'laws' about processor performance and efficiency. Keywords. Parallel processing, synchronization and communication overhead, performance model for parallel processors, parallel architecture efficiency. 1. Introduction Because of the need for more computation cycles than current technology can deliver in a single processor, there has been substantial interest in systems that employ multiple processors. If we can develop architectures for which doubling the number of processors effectively halves the running time, our computation power will be limited only by the number of processors that we can build into a manageable system. The literature abounds with proposals for architectures incorporating hundreds or even thousands of processors [15,19,36,37]. However, in computations on both simulated and real parallel systems of differing architec- tures, users have encountered limitations on the potential reduction of running time for a fixed application. Amdahl's famous critique of parallel processing [2] points out that there are parts of every program that must be done sequentially; hence, parallel processing can achieve running times no less than the time to execute the sequential code on a single processor. This observation, which has come to be known as "Amdahl's Law", establishes an upper bound on speedup for parallel processing that is independent of the number of processors employed. In practice, even more severe limitations can obtain--in fact, beyond a certain number, adding additional processors results in increased running times. This phenomenon has been observed in computations arising from such diverse fields as artificial intelligence [4,13] and semiconduc- tor simulation [21]. But even the most enthusiastic supporters of parallel processing agree that, in order to bring more than one processor to bear on a single problem, synchronization and communication are required. The time devoted to these activities offsets the performance gained through paralle- lism. * Support for this research was provided by IBM Corporation. 0167-8191/89/$3.50 © 1989, Elsevier Science Publishers B.V. (North-Holland)

Upload: horace-p-flatt

Post on 21-Jun-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Parallel Computing 12 (1989) 1-20 1 North-Holland

Performance of parallel processors *

Horace P. F L A T T

IBM Palo Alto Scientific Center, Palo Alto, CA 94304, U.S.A.

Ken K E N N E D Y

Department of Computer Science, Rice University, P.O. Box 1892, Houston, TX 77251, U.S.A.

Received December 1986 Revised February 1987, November 1988

AbstraeL The impact of synchronization and communication overhead on the performance of parallel processors is investigated with the aim of establishing upper bounds on the performance of parallel processors under ideal conditions. Bounds are derived under fairly general conditions on the synchronization cost function. These bounds have implications for a variety of parallel architecture and can be used to derive several popular 'laws' about processor performance and efficiency.

Keywords. Parallel processing, synchronization and communication overhead, performance model for parallel processors, parallel architecture efficiency.

1. Introduction

Because of the need for more computat ion cycles than current technology can deliver in a single processor, there has been substantial interest in systems that employ multiple processors. If we can develop architectures for which doubling the number of processors effectively halves the running time, our computation power will be limited only by the number of processors that we can build into a manageable system. The literature abounds with proposals for architectures incorporating hundreds or even thousands of processors [15,19,36,37].

However, in computations on both simulated and real parallel systems of differing architec- tures, users have encountered limitations on the potential reduction of running time for a fixed application. Amdahl 's famous critique of parallel processing [2] points out that there are parts of every program that must be done sequentially; hence, parallel processing can achieve running times no less than the time to execute the sequential code on a single processor. This observation, which has come to be known as "Amdahl ' s Law", establishes an upper bound on speedup for parallel processing that is independent of the number of processors employed. In practice, even more severe limitations can o b t a i n - - i n fact, beyond a certain number, adding additional processors results in increased running times. This phenomenon has been observed in computations arising from such diverse fields as artificial intelligence [4,13] and semiconduc- tor simulation [21].

But even the most enthusiastic supporters of parallel processing agree that, in order to bring more than one processor to bear on a single problem, synchronization and communicat ion are required. The time devoted to these activities offsets the performance gained through paralle- lism.

* Support for this research was provided by IBM Corporation.

0167-8191/89/$3.50 © 1989, Elsevier Science Publishers B.V. (North-Holland)

2 H.P. Flatt, K. Kennedy / Performance ofparallelprocessors

In this paper we investigate the impact of synchronization and communication overhead on the performance of parallel processors. Our aim is to establish upper bounds on the power of parallel processing under ideal conditions. We shall do this by assuming that, for a given program, there is a region amenable to parallel execution and this region can be partitioned into as many tasks of equal duration as we like. It is clear that this assumption leads to an under-estimation of the running time and, hence, an over-estimation of the power of parallel processing. Under this assumption, we will investigate various reasonable models for the cost of synchronization. This investigation will provide some insight into the fundamental limitations of various architectures proposed in the literature. We also hope that it will provide guidance on how to craft better compilers to support parallel architectures.

We begin with a discussion of the simple performance model for parallel processors. A component of this model is the function defining the cost of synchronization for a given number of processors. Properties of reasonable choices for this function will be investigated and implications will be drawn for several architectural schemes. Lower bounds for the performance of parallel systems will be discussed and used as a basis for comparison. In the course of this discussion, we derive some popular 'laws' of parallel performance, including Amdahl's first and second laws. We close with a discussion of the tradeoff between parallelism and granularity.

2. Timing model for parallel processing

In this section we introduce a simple model for the performance of parallel processors, originated by Flatt [12]. This model provides a lower bound for the running time of a given application on a parallel system and, hence, yields an upper bound on the speedup achievable for this application.

Consider a system consisting of n identical processors interconnected in some way for the purpose of passing data and control information between the processors. This connection can be through a shared memory as in the NYU Ultracomputer [17] or direct interconnections as in the Cosmic Cube [37]. Each processor is assumed to have its own local memory.

Consider a program for which the execution time on a single CPU is normalized to unity. In converting this program to run on a multiprocessor, we subdivide it into two components: a component with running time T s that must be run sequentially and a component with sequential running time Tp that can be subdivided into parallel components running on different processors. Note that

T~ + Tp = 1 (1)

On a parallel system with n processors, this program will have a running time T comprising three components:

(1) serial execution time T~, (2) parallel execution time, equal to Tp/n if the parallelizable part of the program can be

partitioned into n parallel components of equal running time, (3) synchronization and communication overhead To(n ). Assume To(n ) >1 0 for all n >/1.

The parallel computation time T( n ) is then

T(n) = T s + Tp + To(n). (2) n

H.P. Flatt, K. Kennedy / Performance of parallel processors 3

3. Synchronization overhead

Let us now turn to analyzing the behavior of reasonable choices for the overhead function To(n ) . This function represents the contribution to running time due to the need for synchroni- zation and communication between processors. Its value can be affected by the structure of the application, which influences the necessity for communication, the task dispatching algorithm used to control the assignment of processes to processors, and the hardware and software mechanisms for communication and synchronization. Nevertheless, we present a set of reasona- ble assumptions about the behavior of T o. These assumptions assume that T O is a real-valued function of real parameter n.

We shall see that the assumptions made concerning the overhead are satisfied by both linear and logarithmic functions, among others. From (2) above, we see that so long as To(n ) is less than the true overhead for each n, we obtain a lower bound for the running time. Nonetheless, as always, the conclusions drawn are valid only in those cases where the assumptions are satisfied.

Fundamental assumptions. The function T O has the following properties: (F1) To(n ) is continuous and twice differentiable with respect to n, (F2) To(1 ) = 0, (F3) To(n ) > 0 for all n >~ 1, hence To(n ) is nonnegative, (F4) nTo"(n) + 2To'(n) > 0 for all n >1 1, (F5) There exists n 1 >/1 such that To(n1) = 1.

Although the rationale for conditions (F1)-(F3) are fairly obvious, the other conditions need some motivation. Assumption (F4) guarantees that any zero of T ' ( n ) is unique and (F5) insures that eventually T ' ( n ) > 0. These are established by the following lemmas.

Lemma 3.1. Under conditions (F1)-(F3), (F4) implies that, if T ' ( n ) has a zero, it is unique.

Proof. Differentiating (2) yields

T ' ( n ) = ~ ( n 2 T o ' ( n ) - T p ) , (3)

Condition (F4) implies that nETo(n) is monotone increasing, since

d ( n 2 T , ( n ) ) = n ( n T " ( n ) + 2To(n ) ) > 0 for all n >~ 1 (4) dn

Therefore, T ' ( n ) can be zero only at the point where nETo'(n)= Tp, but since nETo(n) is monotone increasing, it must be unique. []

Lemma 3.2. Under conditions (F1)-(F4), /f there exists a value of n, say nx, such that T ' (nx) = O, then T " (nx) > 0 and T(n) is monotone increasing for n > n x.

Proof. Differentiating (3) yields

2% T " ( n ) = - ~ + T " ( n ) .

Rearranging (3) gives

rp = l(To(n ) - T'(n))

n 3 n

(5)

4 H.P. Flatt, K. Kennedy / Performance ofparallelprocessors

and substituting this in (5) yields

T"(n ) = l (2To , (n ) + nT, ff (n)) - 2--n T ' (n ) . (6)

This equation, in conjunction with condition (F4) establishes the first result. Furthermore, from (3) we see that

n2T'(n) = n2To'(n) - Tp.

Since n2To'(n) is monotone in n, so is nET'(n). Consequently, if T'(nx) = 0, then n2xT'(nx) = 0 and n2T'(n) > 0 for all n > n x. Since n x >i 1, T'(n) > 0 for n > nx, establishing the second result. []

L e m m a 3.3. Under conditions (F1)-(F4), (F5) is sufficient to insure that, for any choice of Tp, 0 <~ Tp <<, 1, T'(n) > 0 for some n >1 1.

Proof. If T0'O) > Tp, then T ' (1) > 0 vacuously. So assume that To'O) ~< Tp. By (3), T ' (1) ~< 0. We claim that T'(n) > 0 for n sufficiently large. Suppose that this is not so. Then by (3),

rp To'(n)<~- ~ for all n >_- 1.

Therefore,

TO(n)<~TO(1)+Tp ~-~ d x = T p 1 - <Tp~<l

This contradicts assumption (F5) that To(n) is somewhere equal to 1. []

TO'(.)

for all n >t 1 and

It is easy to exhibit a function T o that satisfies (F1)-(F4), fails to satisfy (F5) and for which T'(n) is never greater than 0. Assume Tp > 0 and let

Tp Tp 2Tp = + - - (7 ) T°(n) 3n 2 n 3

Clearly, (F1) and (F2) are satisfied and, since

2rp - - + >0 3n 3 n-~

we have

2Tp 2Tp T " ( n ) = n4 n3 ,

nTo" + 2 T ' ( n ) = 2Tp 2Tp 2Tp 4Tp 2Tp n 3 n--T + n--- T 3n a = 3n 3

Hence (F3) and (F4) are satisfied. However, To(n ) < 1 for n >~ 1. Now

2Tp r ( n ) = + Tp + _ _

n 3n 2 n 3

and

T ' ( n ) = - ~ < 0 for all n >~ 1. 3n 3

We now present the main result of this section.

- - > 0 for a l l n > / 1 .

H.P. Flatt, K. Kennedy / Performance of parallel processors 5

Theorem 3.4. Under conditions (F1)-(F5), there exists a unique value of n, say n 0, at which T( n ) assumes a minimum for n >~ 1, that is

T ( n o ) < T ( n ) fora l ln>~l .

I f To(1 ) < Tp, then n o is the solution to

n2oTo(no) = Tp. (8)

Otherwise, n o = 1.

Proof. If To'O)> Tp, then T ' ( 1 ) > 0. Using the argument from Lemmas 3.1 and 3.2, since n2T' (n) is monotone increasing, T ' (n ) cannot equal 0 for n >/1. Hence, T ' ( n ) > 0 and T(n) is monotone increasing for n >/1. In this case, T(n) has a unique minimum at n = 1.

So assume that To(1 ) ~< Tp. By (3), T'(1) ~< 0. By Lemma 3.3, T ' ( n ) > 0 for n sufficiently large so by continuity of T ' , there must exist an n o such that T ' ( n o ) = 0. This is unique by Lemma 3.1, and represents a minimum by Lemma 3.2. Formula (8) follows immediately from equation (3). []

4. Performance measures

4.1. Parallel cost penalty

The function

Q(n) = nT(n) = nT~ + Tp + nTo(n ) (9)

may be viewed as the cost penalty of a parallel computation under the following rationale. The cost of a computation is dearly proportional to the number of processors times the length of the computation, since you can view the computation as renting the processors for that length of time. Since the cost of the serial version of the computation is unity (one processor for unit time), nT(n) is the ratio of the cost of the parallel computation to the cost of the serial computation. Clearly, the closer Q(n) is to 1, the more cost effective the computation is.

Parallel processing usually involves wasting some resources in order to decrease the time to completion of a computation. Therefore the following theorem should come as no surprise.

Theorem 4.1. Under assumptions (F1)-(F5) (1) O(1) = 1, (2) Q(n) is monotone increasing, and (3) Q'( n) is monotone increasing.

Proof. Property (1) follows trivially from equation (9). Consider the equation for Q'(n):

Q ' ( n ) = T~ + To(n ) + nTo'(n). (10)

This is positive for all n >/1 by formulas (F2) and (F3). Hence, property (2) is established. The formula for Q" is

Q " ( n ) = n T " ( n ) + 2ro'(n ). (11)

This is positive by (F4), establishing property (3). []

Corollary 4.2. Q( n ) > 1 for all n > 1.

The tangent line to Q(n) has some interesting and useful properties that we shall exploit in future sections. Here we simply introduce the notation and prove two simple results.

6 H.P. Flatt, K. Kennedy / Performance ofparallelprocessors

Theorem 4.3. Let l(n, x ) denote the line tangent to Q at n:

l (n , x ) = O ' ( n ) ( x - n) + O(n) .

Then

l (n , O) = T p - nET' (n) .

(12)

(13)

Proof. By (12),

l (n , O) = Q(n) - nQ ' (n ) .

Substituting from (9) to (10) we get

l (n , O) = nT s + rp + nTo(n ) - n ( T s + To(n ) + nTo'(n)) = rp - n 2 r / ( n ) .

Corollary 4.4. I f T'(1) < Tp, the line tangent to Q at n o has the property

l(no, O) = O.

[]

(14)

Proof. Immediate from equations (3) and (13).

4.2. Efficiency

The efficiency of a parallel computation is defined as the inverse of the cost penalty.:

1 1 E ( n ) = a ( n ) = n T ( n ) " (15)

Substituting (9) into (15) and observing that T s = 1 - Tp, we get

1 E ( n ) = l + ( n _ l ) T ~ + n T o ( n ) < 1 f o r a l l n > l . (16)

From Theorem 4.1 we have

Corollary 4.5. Under assumptions (F1)-(F5) (1) E(1) = 1, (2) E(n) is monotone decreasing, and (3) E'(n ) is monotone increasing.

It is obviously advantageous for E to be close to unity for as large a value of n as possible. However, we see from (2) and (16) that

nE(n ) - 1 n Tp (n - 1 ) g ( n ) + n----~ T°(n)" (17)

It may be readily deduced from the work of Seitz [37] that for some problems run on the Cosmic Cube, with 64 processors, E(64) = 0.83333. Thus,

Tp = 0.9968 + 1.0159To(64).

In other words, the portion of the computation that can be performed in parallel must be over 99 percent of the total computation and the overhead function for 64 processors must be smaller than three tenths of a percent of the total computation.

We see from this the high degree of parallelism and low overhead required to achieve truly high efficiency. In fact, we will show that the minimal execution time will be achieved at a significantly lower level of efficiency. To state this another way, if the parallel system is

H.P. Flatt, K. Kennedy / Performance of parallel processors 7

operating at high efficiency on a given computation, the running time may be further reduced by adding processors, because the minimal execution time is always achieved at fairly low efficiency levels.

Theorem 4.6. Under assumptions (F1)-(F5), if To'(1) < Tp, then the following hold: (1) I f To(n ) = kn - k, then

1 E(no)

2 + (n o - 2 )T s - Tp/n o '

(2) I f (d/dn)(To(n)/n) <~ 0 for n = n o, then

1 E(no)<...

2 + (n o - 2)T~"

Proof. The value of E(no) for To(no)= k n - k follows directly from the definition of E(no) and n o, establishing (1). To establish (2), consider

d [ To(n) ) To(n) To(n) -ag~ ~ - - 7 - _ = n n 2 n 3

For n = n o we have by assumption

n2ro'(no) <~ noTo(no).

Consequently, f rom (16)

1

E ( n ) = l + (no_ l)T~ + noro(no ) <~

Since nZTo(no) = Tp = 1 - T~ by Theorem 3.4,

1 1

E ( n ° ) = l + ( n o - 1 ) T ~ + T p =

the desired result. []

n2To(n) - nTo(n )

2 ! ° 1 + (n o - 1)T s + noT o (no)

2 + (n o - 2)T~'

The implication of this theorem is that for any overhead function with a growth rate very close to linear,

1 E(no) <<.

2 - Tp/n o "

This means that the efficiency at the optimal point n o cannot be much greater than ½. In fact, if the growth rate is less than linear, the efficiency at the optimal point n o is less than ½, provided n o > 2. This is not particularly encouraging and leads us to look for a good operating point: some number of processors < n o such that the efficiency is reasonably high. This will be the subject of Section 4.4.

4.3. Speedup

The speedup S(n) of a parallel program running on n processors over the same program running on a single processor is

1 n S ( n ) = T(n) =nE(n) Q(n)" (18)

8 H.P. Flatt, K. Kennedy / Performance ofparallelprocessors

Corollary 4.7. Under the assumptions o f Theorem 4.6, (1) i f To(n ) = kn - k, then

1 S ( n o ) =

Ts+ ( 2 - 1 ) Tp no no

(2) /f ( d / d n ) ( T o ( n ) / n ) ~ 0 for n = n o, then

1 S ( n o ) <~

<+iT. '

n o P

(3) / fn 2 is such that n2To(n2) = Tp, then n 2 <~ no.

Proof. Parts (1) and (2) follow directly from Theorem 4.6 and equation (18). Part (3) derives from the observation that both n2T~(n ) and nTo(n ) are monotone increasing. Since Tp =

2 i noT o (no) <~ noTo(no), we must have n 2 ~< n 0. []

The main property of speedup is given by the following.

Theorem 4.8. Under assumptions (F1)-(F5) the speedup S( n ) has a unique m a x i m u m at n o >~ 1.

I f To'(1 ) ~< Tp,

1 S ( n o ) = Q , ( n o ) . (19)

Proof. The existence of the maximum is a direct corollary of Theorem 3.4. To establish (19), consider

Q'(n) = ~ = S-(n) S2(n)

Since S'(no) = O, the result follows immediately. [2]

We note that this is a stronger result than Amdahl's Law [2,5], which predicts a maximum speedup of 1/T~ that cannot be achieved with a finite number of processors. The principal difference here is the effect of the overhead function To(n ) .

4.4. An operating point

From the previous discussion, it is clear that neither the efficiency nor the speedup alone gives a good indicator of where the best operating point should be. The best efficiency is achieved with one processor, while efficiency will be less that ½ for a machine with n o processors. We should be able to pick some number of processors < no for which the efficiency is much greater but the speedup is not significantly smaller. We need to somehow take both efficiency and speedup into consideration when picking such a point.

Consider the function

F ( n ) = E ( n ) S ( n ) = n E 2 ( n ) = 1 . r 2 ( n ) . (20)

This function may be characterized as the square of the geometric mean of E ( n ) and S(n ) .

H.P. Flatt, K. Kennedy / Performance of parallel processors 9

The value of this function is that it combines the behavior of efficiency and speedup. As early as 1976, Kuck suggested optimizing this function in algorithm design, because it is a measure of speedup over cost [25], observing that cost is proportional to the number of processors times the running time. Since speedup over cost is given by

S ( n ) = S ( n ) 1 E ( n ) S ( n ) = F ( n ) ,

nT(n ) n T (n )

the optimal point for F(n) is the same as the optimum sought by Kuck. If there exists an optimal point for F, it should be somewhere between 1 (the optimal point

for E) and n 0. The following theorem establishes these results in a slightly more general form. Suppose we consider the weighted geometric mean of E and S:

Fx(n ) = E X ( n ) S 2 - X ( n ) ,

where 0 < x < 2. The following theorem shows that F x always has an optimal point.

Theorem 4.9. Let x be such that 0 < x < 2. Define F x as follows:

Fx(n) = EX(n)SZ-X(n) = n-XS2(n).

Under assumptions (F1)-(F5), /f

T ' (1) < Tp - ½x, (21)

then F x has a unique maximum at n = n F( X ) > 1, where n F( X ) is the solution of the equation

Hx(n ) = x(nT~ + nTo(n)) + 2nZT ' (n ) = (2 - x)Tp. (22)

Proof. Consider the function

1 =nXT=(n)" (23) G ( n ) = F~(n)

F x assumes a maximum whenever G~ assumes a minimum. Let us examine the properties of G~. Differentiating (23) we get

G;(n) = xnX- 'T2(n ) + 2 n X T ( n ) r ' ( n ) , (24)

which, when expanded, yields

G ; ( n ) = T (n ) xn x - ' T~ + Tp + +2n~ T ' ( n ) - -~ .

Rearranging and expanding T(n) leads to the following expression:

G' (n ) = T(n) [xnT~ + xnTo(n ) + xTp + 2n2To'(n) - 2Tp] n2-X

T(n) - n2_ x [ H x ( n ) - ( 2 - x ) T p ] . (25)

From (22) and the fundamental assumptions, we see that Hx is monotone increasing in n for fixed x. Consider

Hx(1) = xT~ + 2To'(1),

hence, by assumption (21) we see

Hx(1 ) - (2 - x)Tp = x - 2Tp + 2T' (1) ~< 0.

Since H x is monotone increasing, Hx(n ) - ( 2 - x)Tp can have at most one zero. Furthermore,

10 H.P. Flatt, K. Kennedy / Performance of parallel processors

f rom (F5) we see that H x ( n l ) >/2 >t (2 - x)Tp for all x >/0. Thus H x ( n ) - (2 - X)Tp has at least one zero. That zero must occur at the point where equat ion (22) holds. []

Corollary 4.10. Under assumptions (F1)-(F5) , there exists a unique value o f n, say nr , at which F ( n ) assumes a max imum for n >_- 1. Furthermore, 1 <,% n r <~ no, with n r = n o only i f n o = 1.

Proof. There are three cases to consider. First suppose that

1 To'0) -< rp 2.

Then, Theorem 3.4 tells us that T ' ( n o ) = 0. Furthermore, the assumptions of Theorem 4.9 are also satisfied for x = 1, so we have f rom (22) and (24), that

G ~ ( n o ) = T Z ( n o ) > O .

It is easily seen f rom (25) that G~ is mono tone increasing, so if it takes on a zero, it must take one on before n 0. But Theorem 4.9 guarantees us that it takes on a zero at n F = hE(1 ).

N o w suppose that

Tp - ½ < To(1 ) ~< Tp.

In this case, we cannot use Theorem 4.9 directly, but we can observe that, since G~(n) > 0 for all n >/1, the min imum value of G1, and hence the unique max imum value of F1 occurs at n r = 1. Since Theorem 3.4 assures us that n o >1 1, we have the desired inequality.

Finally, if To(l) > Tp, then both n r and n o are equal to one. []

Corollary 4.1. Under the assumptions o f Theorem 4.9,

2 n F T ' ( n F ) = -- T (HF) . (26)

Proof. Equat ion (26) follows directly f rom equat ion (24) and G ~ ( n r ) = 0, f rom which we conclude that

T 2 ( n F ) + 2 n F T ( n F ) T ' ( n F ) = O.

The result follows f rom the observat ion that T ( n ) > O. []

Theorem 4.12. Under the same assumptions as Theorem 4.9,

2 t 2 P 2riFT o ( n r ) < noT o (no ) (27)

and

n o + n F S(nF~) > - - (28) S ( n o ) 2n0

Proof. In this case we have n r > 1 and rearranging (22) yields

2 t nrT~ + n F T o ( n r ) + 2nvTo ( n v ) = Tp.

2 t Since, by Theorem 3.4, Tp - noT o (no), we conclude that

2 t 2 P 2nFTo ( n F ) < noTo ( n o ) ,

the desired equation (27). To establish (28), consider

l ( nF , X) = a ' ( n F ) ( X - - nF) + O ( n F )

H.P. Flatt, K. Kennedy / Performance of parallel processors 11

(see (12)). Since Q' is monotone increasing by Theorem 4.1 and since nF< n o by Corollary 4.10, we have

l(nF, no) = noa ' (nF) + (Q(nF) -- nFO'(nF) ) < Q(no) .

Differentiating Q(n) = nT(n) and substituting from (26) yields

O'(nF) = T(nF) + n T ' ( n F ) = ½T(nF).

Substituting this into the previous equation yields

½(n o + n r ) T ( n F ) <~ noT(no).

Equation (28) follows immediately. []

We may arrive at an interpretation for the meaning of this optimal point for F in the following manner (see [24]). Since F = S / Q , we have

F ' ( n ) = V ( n ) S(n ) Q(n) "

Since n v is the unique maximal point for F, F ' ( n ) > 0 for n < n F and F ' ( n ) < 0 for n > ne. Hence for n < n F,

S ' ( n ) Q ' ( n )

S ( n ) Q(n) '

so a relative change in the speedup is greater than the corresponding relative change in the cost penalty and it is therefore useful to increase n. For n > n F, the relative change in the speedup is smaller than the relative change in the Cost penalty and it is therefore useful to decrease n. In this sense, n F is the optimal point for operation.

To look at it another way, by the monotonicity of the positive square root, the optimal point for F is also the optimal point for

( E . S ) 1/2,

the geometric mean of E and S. By (18), this means that n F is the optimal point for

S(n) /~ f~ .

Differentiating with respect to n, we get

( d S ( n ) 1 S ' ( n ) - - - d x 2n "

Thus,

S(nF) 1 S ' ( n F ) 2n F <-~"

In other words, the marginal value of each additional processor after processor n F is less than half a processor. Even if it is possible to double S(nF), it will require at least 4n F processors.

5. Applications

Let us now turn to an examination of the implication of these results, by considering some realistic choices for T o . We will study three general system organizations: the shared memory multiprocessor using critical regions for synchronization and scheduling, an array of processors

12 H.P. Flatt, K. Kennedy / Performance of parallel processors

arranged as a k-cube for fixed k, and an array of processors interconnected by a logarithmic network, either directly as in the Cosmic Cube or through a shared memory as in the Ultracomputer.

5.1. Critical region scheduling

Let us consider a system that uses critical regions to schedule its tasks. Assume that the processor executing in the serial part of the program enters a parallel region. It then spends some amount of time posting the parallel tasks and then each processor, in turn, enters the critical region of the scheduling code and selects a task for itself. If traversing the critical region takes time tc, then the last processor finishes the critical region nt c time units after the parallel code is entered. Hence the overhead of scheduling is

To(n ) = m ( n t c - tc) (29)

where m is the number of entries into a region of parallel execution. The subtraction of tc insures that ToO ) = 0.

Note that there will also be some time expended waiting for the last processor to finish its parallel task (if barrier synchronization is assumed). However, this cost is subsumed in (29) by the following rationale. At the end of its task, each processor enters a critical region to update a counter that indicates the number of tasks that are complete. The first task to complete becomes the next serial processor and waits until the last task updates the counter to n. It then begins execution of the serial portion. If the waiting processor behaves in a way that insures that at least one waiting task can make it into the exit critical region before it tries to enter again, the total cost of exit synchronization can be no more than linear in the number of processors.

Hence, for any system that uses critical regions for scheduling the parallel execution time will be

T ( n ) = T~ + Tp + m ( n t c _ tc). n

This formula should hold for several commercial multiprocessors including the CRAY X-MP and the HEP. That overhead for these processors does indeed follow this formula is borne out by published studies [20,31].

Applying equation (3) from Theorem 3.4, we see that n 0, the optimal number of processors, satisfies

nEmt~ = Tp.

This means that

n o = V ~ p / m t ~ . (30)

Clearly this is an idealized lower bound for the running time of a parallel program in parallel. It is particularly unlikely, for example, that the program will be perfectly partitionable. However, it will give us a convenient model to study the limits of performance improvement for parallel programs.

The next question is: what are realistic values for n 0, t~ and Tp? We shall rely on two studies to provide values for these.

The first study, by Larson [31] ran three programs on one- and two-processor configurations of the CRAY X-MP. Although the number of parallel region entry points is not specified, Larson indicates that Tp = 0.96, 0.98, and 0.98 on the three codes and his figures imply that mt~ ~ 0.009. If this is taken as the value with Tp -- 0.98, the maximum speedup is 5.32 with 10

H.P. Flatt, K. Kennedy / Performance of parallelprocessors 13

processors. This means that no matter how many processors we use we cannot get a speedup better than 5.32 on this program. This is fairly discouraging.

In a study of the Denelcor HEP [20], Hockney observed that synchronization overhead behaves as our model would predict. His numbers would support a value of 34.8 ms for to, which is consistent with intuition.

5.2. Cube connections

An array of n processors connected as a k-cube, for fixed k, is arranged into a cube of k dimensions, with each processor connected to its 2k neighbors. If each dimension has size q, then there are n = q~ processors.

In a k-cube, the time for synchronization depends on the time for a message to traverse the distance between the two processors most distant from one another (at opposite comers). Since such a message must travel q - 1 links in each of the dimensions, the time to traverse from one comer to another is proportional to kq. From this we see that, for the k-cube,

To(n ) = m ( tmkn 1/k - trek ) (31)

w h e r e t m is the basic time to send a message between neighboring cube elements. From equation (3) in Theorem 3.4 we see that

2 - 1/k-1 nomlmno = Tp

which yields

( Tp ) k/(k+l) n o = -~m " (32)

5.3. Logarithmic connection

The NYU Ultracomputer [17,18,36] provides a substantially different approach to synchro- nization. The Ultracomputer is a shared memory machine in which the memory interconnec- tion network is a variation on an Omega Network [32]. In such a network, the delays to memory are proportional to the logarithm (base 2) of the number of processors and memories.

The Ultracomputer also provides a synchronization instruction, called fetch-and-add that is implemented by the interconnection network and takes time proportional to log2n. Thus in log2n time, the Ultracomputer can assign each processor a unique task or determine if all processors are finished with a given parallel task. It is possible to implement a similar mechanism in a message-passing machine based on the hypercube interconnection, such as the Cosmic Cube [37].

In a machine like this, the overhead is given by

To(n ) = mt c log2n. (33)

By Theorem 3.4,

1 n2mtc log 2 e - - = Tp.

no

Hence

rp (34)

n° = mt c log 2e" i

This is roughly the square of the optimal number of processors for critical region scheduling.

14 H.P. Flatt, K. Kennedy / Performance of parallel processors

6. Lower bounds for synchronization time

It seems clear that a logarithmic form of T o is the best that can be achieved for realistic architectures with limitations on interconnection fan-in. This has been observed by a number of researchers, notably Johnson [22]. A straightforward argument establishes that detecting when all processes have completed their assignments takes at least logarithmic time if no processor can read more than k inputs at once. The processors are conceptually arranged into a tree. On the first cycle, each processor writes into a memory location assigned to hold its 'completion variable'. On the next cycle, the parent processors each read the completion variables for k children, writing their own completion variables only when all their children processors have finished. The root processor can write its own completion variable only after logk(n ) cycles.

Can the basic limitation of logarithmic overhead be overcome through the use of hierarchy? Consider the following argument. If a problem can be decomposed into parallel subproblems, each of which has a smaller communication burden than the whole, we may be able to save on the complexity of communication. To put it another way, suppose we take a network of computers operating at the optimal point. Can we not gain additional speedup, beyond what the theory introduced above suggests, by replacing each individual computer with a parallel network of equally powerful machines?

Unfortunately, things are not so simple. Let us analyze this situation by asking what the best possible speedup is for a network of n subnets, each with k parallel processors. In other words, how well can we do with a hierarchy of nk processors.

In the ideal situation, the best we can hope for is that all nk processors are brought to bear on an approximately equal part of the program. Hence the parallel execution time for the hierarchical processor system is

Tp T ( n k ) = T~ + - ~ + T h ( n k ) , (35)

where T h is the overhead function for the hierarchy. Let us examine the composition of To h in the context of the problem of detecting the

condition that all processors have finished executing a parallel region (barrier synchronization). The time to detect this condition consists of an amount of time to detect it in each of the subnets plus an amount of time to detect that each subnet is done. Since detection of completion in the subnets can be done in parallel, we have the following equation for T):

T h ( n k ) = To(k ) + To(n ). (36)

Here T O is the overhead function for the interconnection scheme used in the main network and the subnets.

The interconnection cost represented by this function is no better than logarithmic. To see this, assume that

To(x ) = c log2x.

Then,

Toh(nk) = c log2k + c log2n = c log2nk = To(nk ).

Thus, hierarchy provides no improvement over a shared memory machine with a logarithmic synchronization primitive like the Ultracomputer. However, it can be used to improve the performance of systems that employ a linear or cube interconnection scheme. For example, linear critical region scheduling can be improved to logarithmic by employing a tree of semaphores.

H.P. Flatt, K Kennedy / Performance of parallel processors 15

7. Increasing problem size

Many proponents point out, quite correctly, that parallel processing makes it feasible to attack much larger problems. In particular, Fox observes that a principal goal of parallel processing is to solve a problem k times bigger using kn processors than one can solve using n processors. Recently, Benner, Gustafson and Montry from Sandia Laboratories, who achieved real speedups of over 400 on 1024-processor NCUBE hypercube, proposed scaled speedup as a measure of how well an algorithm scales when the number of processors is multiplied by k [3,11,16]. In their informal definition, a scaled speedup of close to k indicates good scahng.

Our goal is to present a formal definition of scaled speedup and investigate its properties. To accomplish this, we will define the running time T to be a function of both the number of processors n and a measure z of the number of elementary computations in a given algorithm, assuming that each elementary computation takes the same amount of time. This is important because many algorithms are nonlinear in the size of the input data set, hence, we must use a measure of the absolute number of computations. Let us define the running time T as follows:

Tp(z) T(n, z) = T~(z) + + To(n, z) (37)

n

where T s, Tp and T O have become functions of z as well. If we assume that each computation takes one time unit, we have

z = rs(z ) + Tp(z). (38)

With this definition in hand, we are now prepared to define scaled speedup Ss(k, n, z), where k >I 1 is the multiplication factor, n >I 1 is the original number of processors and z > 0 is the original number of computations:

T(n, z) (39) Ss(k, n, z ) = k T(kn , kz)"

The rationale for this definition is that scaled speedup is intended to be a measure of how close the running time of a size kz problem on kn processors is to the time for a problem of size z on n processors. Hence, we divide the running time of the smaller problem by the (presumably longer) running time of the larger problem and multiply by k, the factor by which the number of processors has increased. Thus, a scaled speedup close to k means that the problem is scaling well as we increase the number of processors. We feel that this definition properly captures the intent of the Sandia group's informal definition [3]. Note that scaled speedup, as we define it, is also a function of the initial problem size and the initial number of processors. This will be significant in the analysis that follows.

We will now introduce the assumptions on scalability that are proposed by Fox and the Sandia researchers. Intuitively, they propose that, for certain important classes of computa- tions, when a computation increases in size by a factor of k, all of the increase will occur in the parallel region. We will use the following formal assumptions to approximate this.

Sealability assumptions. (S1) To(n, z) = To(n ). That is, T O is a function only of the number of parallel processors

and not the size of the computation. ($2) Ts(kz ) -- Ts(z ). That is, there is no increase in the running time of the serial region as

the problem size increases.

We are now ready to establish theoretical bounds on scaled speedup.

16 H.P. Flatt, K. Kennedy / Performance of parallel processors

Theorem 7.1. Under the scalability assumptions, if we assume that z and n are fixed,

0 k

Proof. By expanding the denominator of (39) and applying (38) and assumption ($2), we get

T(n, z) = k T(n, z) Tp(kz) k z - T (z)

k---'-n- + r°(kn) T~(z) + kn + T°(kn)

Ss(k, n, z ) = k

T~(kz) +

Rearranging terms, this yields

Ss(k, n, z ) = k

Since

we have

r(n, z) (1) T~(z) 1 - -~n + z + To(kn )

(40)

T~(z)(1- 1 ) ~< T~(z)(1 - 1 ) ,

T(n, z) S (k, n, k

T ~ ( z ) ( i - 1 ) + z + To(kn)

We now observe that To(kn ) is the only term in the denominator that depends on k, and since the other terms are nonnegative we conclude

T(n, z) (41) Ss(k, n, z)<~ k To(kn) .

The result follows immediately. []

Since logarithmic overhead is the best possible, we have the following corollary.

Corollary 7.2. Under the assumptions of Theorem 7.1 and assuming logarithmic ooerhead,

S s ( k ' n ' z ) = k ( 1 ) ~ " T~(z) 1---~n + z + c logzn+c log2k

(42)

Proof. The result follows immediately from Theorem 7.1 and the substituting of To(n) = c log2n in equation (40) from the proof of Theorem 7.1. []

Corollary 7.2 tells us that, in the limit, if we multiply the size of the problem and the number of processors by k the scaled speedup increases only by a factor proportional to k/ log k. This relationship has been known as "Kuck's Conjecture" [30] and "Amdahl's second law", although it is not due to Amdahl and there is no evidence that he even believed it [8]. A form of this relationship, under the assumption of random numbers of processors available, was proved by Lee [33].

Corollary 7.2 also gives us a way of estimating the point at which the asymptotic bound begins to dominate. For small values of c, scaled speedup will be close to k until k becomes

H.P. Flatt, K. Kennedy / Performance of parallel processors 17

large enough to make c log2k comparable to T~(z) + z + c log2n. This explains the good scaled speedups for the Sandia experiments, but indicates that similar scaled speedups cannot be achieved indefinitely by further increases of k.

For other overhead functions the scaled speedups will fall off more rapidly, and for linear overhead the situation is especially bad.

Corollary 7.3. Under the assumptions of Theorem 7.1 if To(n ) = O( n ), then Ss( k, n, z) = O(1).

In other words, for a given algorithm there exists a constant upper bound on scaled speedup. We observe that an interesting result holds for the speedup as originally defined if we

increase the problem size keeping the number of processors fixed. The speedup S(k, n, z) by using n processors on a problem of size kz can be defined as

T(kz) S(k , n, z ) = T(n, kz)

where, as before,

z = T(z) = Ts(z ) + Tp(z).

Suppose we replace scalability assumption ($2) with the assumption

($2') lim T~(z) = 0. Z-'-* ~ Z

Theorem 7.4. Under scalability assumptions (S1) and ($2'), if we assume that z and N are fixed as k increases,

lim S(k , n, z ) = n. k---~ oo

Proof. We have

zk 1 S ( k , n , =

Tp(zk) T~(zk) 1 Tp(zk) To(n ) - - + - - + - -

n zk n zk zk

By the assumptions the first and last terms in the denominator tend to zero as k increases, while the second term increases to 1/n. []

This result shows that, under the assumptions, you can get arbitrarily close to a speedup of n by increasing the problem size. Note that the running time for a problem of size kz on n processors is likely to be longer than the running time of a problem of size z on one processor.

Next, consider what happens to efficiency as problem size and number of processors increase together. Following our earlier development,

E(n, z ) = r(1, z) nT(n, z)"

Hence, the ratio of efficiencies is given by

E(kn, kz) nT(n, z) ~ ( 1 ) (43) E(n, z) - knT(kn, kz) = Ss(k, n, z ) = 0 kTo(kn )

under the assumptions of Corollary 7.2. In other words, efficiency goes down as the problem size and the number of processors increases proportionally. Fox observes that the efficiency can

18 H.P. Flatt, K. Kennedy / Performance of parallel processors

be increased if the number of computations per processor increases as the number of processors increases [14]. It appears that this increase must be at a rate greater than that of k times the synchronization function.

One way to view increasing the size of problem per processor is as a type of granularity tradeoff. What is the effect of increasing the problem size per processor while decreasing the number of processors? This is the subject of the next section.

8. Tradeoffs

8.1. Processors versus region size

The only reason for wishing to use fewer processors is because the overhead function has become so large, that it pays to reduce it. This happens when the number of processors exceeds the optimal number for the problem as defined by equation (8). Note however, that any decrease in number of processors will increase efficiency.

8.2. Multiple region entries

Typically, a program tailored for parallelism has a number of regions of parallelism with sequential regions interspersed. Suppose we use m as the number of such regions. Then the overhead function T O becomes a function of m as well as n and z. Usually, m will be a multiplicative factor in To, that is

To(n, z, m) = mTo(n, z, 1)

where To(n, z, 1) is the overhead function for a single region, and To(n, z, m) is equivalent to the overhead function we have been dealing with in earlier sections. If we normalize z to 1, To(n, 1, 1) can be assumed to satisfy assumptions (F1)-(FS). The formula for parallel execution time (2) becomes

T(n, 1, m) = T~ + Tp + mTo(n, 1). (44) n

Formula (43) can be used to determine the minimum program granule worth converting to parallelism. Suppose we determine that we can convert a region of sequential running time x to parallel execution on n processors. How large must x be to make it pay off?. In the new program, the sequential region is T s - x and the parallel region is Tp + x. The value of m is increased by one in the new program. If the running time of the new version is to be smaller, then the following inequality, derived by subtracting the two versions of formula (43), must hold:

x ( 1 - 1 ) - T°(n' l ) >

In other words, n

x > --~-To(n, 1). n

Intuitively, this means that the sequential running time of the target region must be slightly larger than the time required to dispatch and synchronize the parallel execution, an unsurpris- ing result. However, it is also worth noting that since To(n, 1) increases monotonically with n, the sequential running time of the target region must also increase as the number of processors grows larger. The implications of this property for fine granularity and massively parallel processors need to be considered further.

H.P. Flatt, K. Kennedy / Performance of parallel processors 19

9. Other work

A number of investigators have discovered that there exists an optimum point for various synchronization models. Hockney notes the existence of such a point for the HEP [21], which uses linear critical region scheduling, and Cytron establishes the existence of optimal points for both the linear and logarithmic cases [10]. Cytron also derives the approximate square relation between the optimal points for linear and logarithmic overhead. In fact, it seems clear that these results have somewhat the quality of folk theorems. For example, when one of us was discussing these results with John Cocke of IBM Research, he asserted that he had discovered the formulas for the linear and logarithmic optimal points while listening to a speaker on multiprocessor design [7].

In this paper we have attempted to unify the treatment of a wide range of reasonable strategies for synchronization overhead and to suggest an ideal point at which to operate a parallel processor. In turn this will lead to goals for the size of parallel regions. If we can keep the ratio of the time spent in a parallel region to the time spent in synchronization high, we can achieve excellent performance from parallel processors. Thus, the compilers and programming systems can better determine when parallel computation is profitable.

References

[1] J.R. Allen and K. Kennedy, PFC: a program to convert Fortran to parallel form, Report MASC TR 82-6, Department of Mathematical Sciences, Rice University, Houston, TX, 1982.

[2] G.M. Amdahl, Validity of the single processor approach to achieving large scale computer capabilities, in: Proc. AFIPS Spring Joint Comp. Conf. 30, Atlantic City, NJ (1967) 483-485.

[3] R.E. Benner, J.L. Gustafson and G.R. Montry, Analysis of scientific application programs on a 1024-processor hypercube, Technical Report, Sandia National Laboratories, Albuquerque, NM, 1987.

[4] H.D. Brown, E. Schoen and B.A. Delagi, An experiment in knowledge-based signal understanding using parallel architectures, Knowledge Systems Laboratory Report KSL 86-69, 1986.

[5] B. Buzbee, Parallel processing makes tough demands, Comput. Design 23 (1984) 137-140. [6] B. Buzbee, Private communication, 1985. [7] J. Cocke, Private communication, 1985. [8] L.A. Cohn and H. Sullivan, Uniform bounds on efficient parallel execution time and speedup, Preliminary

working version, Sullivan Computer Corporation, 1984. [9] Documentation from the Cosmic Cube Project, CalTech, Pasadena, CA.

[10] R. Cytron, Useful parallelism in a multiprocessing environment, in: Proc. 1985 Parallel Processing Conf. [11] J. Dongarra, A. Karp and K. Kennedy, First Gordon Bell Awards winners achive speedup of 400, IEEE Software

(May 1988) 108-112. [12] H.P. Flatt, A simple model for parallel processing, Computer 17 (11) (1984) 95. [13] C.L. Forgy, A. Gupta, A. NeweU and R. Wedig, Initial assessment of architectures for production systems, in:

Proc. National Conf. on Artificial Intelligence (1984) 116-120. [14] G. Fox, Keynote address, Hypercube Conference, Knoxville, TN, 1985. [15] D. Gajski, D. Kuck, D. Lawrie and A. Sameh, A large scale multiprocessor, Laboratory for Advanced

Supercomputers Cedar Project, Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, 1983.

[16] J.L. Gustafson, Reevaluating Amdahl's law, Comm. ACM 31 (2) (1988) 532-533. [17] A. Gottlieb, R. Grishman, C.P. Kruskal, K.P. McAuliffe, L. Rudolph and M. Snir, The NYU Ultracomputer--de-

signing an MIMD shared memory parallel computer, IEEE Trans. Comput 32 (1983) 175-189. [18] A. Gottlieb and J.T. Schwartz, Networks and algorithms for very-large-scale parallel computation, Computer 15

(1) January (1982) 27-36. [19] W.D. Hillis, The Connection Machine (MIT Press, Cambridge, M.A, 1985). [20] R.W. Hockney, Performance characterization of the HEP, in: J.S. Kowalik, ed., Parallel MIMD Computation:

HEP Supercomputer and Its Applications (MIT Press, Cambridge, MA, 1985). [21] R.W. Hockney, Characterizing overheads on VM-EPEX and multiple FPS-164 processors, Preprint, 1986. [22] D.S. Johnson, The NP-completeness column: an ongoing guide, seventh edition, J. Algorithms 4 189-203.

20 H.P. Flatt, K. Kennedy / Performance of parallel processors

[23] K. Kennedy, Automatic translation of Fortran programs to vector form, Rice Technical Report 476-029-4, Rice University, 1980.

[24] L. Kleinrock, On flow control in computer networks, in: Proc. ICC 78 (1978). [25] D.J. Kuck, On the speedup and cost of parallel computation, in: R.S. Anderssen and R.P. Brent, ed., The

Complexity of Computational Problem Solving (University of Queensland Press, St. Lucia, Queensland, Australia, 1976).

[26] D.J. Kuck, A survey of parallel machine organization and programming, Comput. Surveys 9 (1) (1977) 25-59. [27] D.J. Kuck, The Structure of Computers and Computations Vol. 1 (Wiley, New York, 1978). [28] D.J. Kuck, R.H. Kuhn, B. Leasure, D.A. Padua and M. Wolfe, Compiler transformation of dependence graphs, in:

Conf. Record 8th A CM Symposium on Principles of Programming Languages, Williamsburg, VA (1981). [29] D.J. Kuck, R.H. Kuhn, B. Leasure and M. Wolfe, The structure of an advanced vectorizer for pipelined

processors, in: Proc. IEEE Computer Society 4th International Computer Software and Applications Conf. (1980). [30] D.J. Kuck, D.H. Lawrie and A.H. Sameh, Measurements of parallelism in ordinary Fortran programs, 1973

Sagamore Conference on Parallel Processing. [31] J.L. Larson, Multitasking on the Cray X-MP-2 multiprocessor, Computer 17 (7) (1984) 62-69. [32] D. Lawrie, Access and alignment of data in an array processor, IEEE Trans. Comput. 24 (1975) 1145-1155. [33] R.B.-L. Lee, Performance bounds in parallel processor organizations, in: D.J. Kuck, D. Lawrie, and A. Sameh,

eds., High Speed Computer and Algorithm Organization (Academic Press, New York, 1977). [34] N.R. Lincoln, It's really not as much fun building a supercomputer as it is simply inventing one, in: D.J. Kuck, D.

Lawrie and A. Sameh, eds., High Speed Computer and Algorithm Organization (Academic Press, New York, 1977). [35] R.M. Russell, The CRAY-1 computer system, Comm. ACM 21 (1) (1978) 63-72. [36] J.T. Schwartz, Ultracomputers, ACM Trans. Program. Languages Systems 2 (1980) 484-521. [37] C.L. Seitz, The Cosmic Cube, Comm. ACM 23 (1) (1985) 22-33. [38] R.L. Sites, An analysis of the Cray-1 computer, in: Proc. 5th Annual Symposium on Computer Architecture (1978)

101-106. [39] R.E. WeUck, A simple model for parallel processing on the Cray X-MP, Computer 18 (3) (1985) 113.