communication-optimal parallel parenthesis matching

10
Communication-optimal parallel parenthesis matching q Chun-Hsi Huang a, * , Xin He b , Min Qian a a Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, USA b Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, NY 14260, USA Received 19 June 2003; received in revised form 10 December 2004; accepted 5 July 2005 Available online 13 September 2005 Abstract We provide the first non-trivial lower bound, p3 p n p , where p is the number of the processors and n is the data size, on the average-case communication volume, r, required to solve the parenthesis matching problem, assuming problem instances are uniformly distributed, and present a parallel algorithm that takes linear (optimal) computation time and optimal expected message volume, r + p. The kernel of the algorithm is to solve the all nearest smaller values problem. Provided n p ¼ XðpÞ, we present an algorithm that achieves optimal sequential computation time and uses only a constant number of communication phases, with the message volume in each phase bounded above by ð n p þ pÞ in the worst case and p in the average case. Experiments have been performed on two clusters: an SGI Intel Linux Cluster and a Sun cluster of work- stations, both showing low communication overhead and good speedups. Ó 2005 Elsevier B.V. All rights reserved. Keywords: Parallel algorithms; Communication complexity; Parenthesis matching 1. Introduction A sequence of parentheses is balanced if every left (right, respectively) parenthesis has a matching right (left, respectively) parenthesis. Let a balanced sequence of parentheses be represented by (a 1 , a 2 , ... , a n ), where a k represents the kth parenthesis. The parenthesis matching problem asks to determine the matching parenthesis of each parenthesis. This is a fundamental problem in combinatorics with wide applications in computational geometry and graph theory. This paper discusses the expected communication complexity of the parenthesis matching problem. We then provide a parallel algorithm, analyze the communication cost of this algorithm, and conclude its commu- nication optimality by showing the communication cost asymptotically matches the lower bound of the 0167-8191/$ - see front matter Ó 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.parco.2005.07.001 q Computational resources and technical support provided by the Center for Computational Research (CCR) at the State University of New York at Buffalo. Preliminary version without experimental results appeared in the 13th International Symposium on Algorithms on Computation (ISAAC), 2002. * Corresponding author. E-mail addresses: [email protected] (C.-H. Huang), [email protected]ffalo.edu (X. He), [email protected] (M. Qian). Parallel Computing 32 (2006) 14–23 www.elsevier.com/locate/parco

Upload: chun-hsi-huang

Post on 26-Jun-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Parallel Computing 32 (2006) 14–23

www.elsevier.com/locate/parco

Communication-optimal parallel parenthesis matchingq

Chun-Hsi Huang a,*, Xin He b, Min Qian a

a Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, USAb Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, NY 14260, USA

Received 19 June 2003; received in revised form 10 December 2004; accepted 5 July 2005Available online 13 September 2005

Abstract

We provide the first non-trivial lower bound, p�3p � np, where p is the number of the processors and n is the data size, on the

average-case communication volume, r, required to solve the parenthesis matching problem, assuming problem instancesare uniformly distributed, and present a parallel algorithm that takes linear (optimal) computation time and optimalexpected message volume, r + p. The kernel of the algorithm is to solve the all nearest smaller values problem. Providednp ¼ XðpÞ, we present an algorithm that achieves optimal sequential computation time and uses only a constant numberof communication phases, with the message volume in each phase bounded above by ðnp þ pÞ in the worst case and p in

the average case. Experiments have been performed on two clusters: an SGI Intel Linux Cluster and a Sun cluster of work-stations, both showing low communication overhead and good speedups.� 2005 Elsevier B.V. All rights reserved.

Keywords: Parallel algorithms; Communication complexity; Parenthesis matching

1. Introduction

A sequence of parentheses is balanced if every left (right, respectively) parenthesis has a matching right (left,respectively) parenthesis. Let a balanced sequence of parentheses be represented by (a1,a2, . . . ,an), where akrepresents the kth parenthesis. The parenthesis matching problem asks to determine the matching parenthesisof each parenthesis. This is a fundamental problem in combinatorics with wide applications in computationalgeometry and graph theory.

This paper discusses the expected communication complexity of the parenthesis matching problem. Wethen provide a parallel algorithm, analyze the communication cost of this algorithm, and conclude its commu-nication optimality by showing the communication cost asymptotically matches the lower bound of the

0167-8191/$ - see front matter � 2005 Elsevier B.V. All rights reserved.

doi:10.1016/j.parco.2005.07.001

q Computational resources and technical support provided by the Center for Computational Research (CCR) at the State University ofNew York at Buffalo. Preliminary version without experimental results appeared in the 13th International Symposium on Algorithms onComputation (ISAAC), 2002.* Corresponding author.E-mail addresses: [email protected] (C.-H. Huang), [email protected] (X. He), [email protected] (M. Qian).

C.-H. Huang et al. / Parallel Computing 32 (2006) 14–23 15

communication complexity of the problem itself. The motivation of this work is to extend the use of thegeneral-purpose parallel programming models to address the communication complexity of basic problemsand design communication-efficient/optimal parallel algorithms for such problems.

The paper organization is as follows. Section 2 proves a non-trivial lower bound, HðnpÞ (orp�3p � np, more spe-

cifically), where p stands for the number of processors and n the problem size, of the expected-case commu-nication volume, r, required to solve the parenthesis matching problem, assuming the problem instances areuniformly distributed. Note that this bound is used to verify the communication optimality of the algorithmpresented thereafter in Section 3. Section 4 analyzes the communication volume of the parenthesis matchingalgorithm, proves that this volume is in fact r + H(p) and, therefore, concludes the communication optimalitysince p ¼ oðnpÞ for coarse-grained parallel machines. Experimental results on two different clusters of the paren-thesis matching algorithm are presented in Section 5. Section 6 conclude this paper with some related prob-lems and then acknowledgments.

2. Lower bound on communication cost

This section analyzes the expected communication cost of the parenthesis matching problem. The basic ideafollows. Initially we partition the input sequence of parentheses evenly and distribute them among the proces-sors. Then each processor sequentially pairs up its local parentheses. The unmatched ones will need to seektheir matches from other processors, which requires communication. Note that this communication is regard-less of algorithms used and therefore serves as a theoretical lower bound on the communication cost for anyalgorithm for parallel parenthesis matching. Below we show, in the expected case, the maximum number ofunmatched parentheses and hence provides a non-trivial lower bound, assuming the input instances areuniformly distributed.

Let Sn be the set of all possible legal parenthesis sequences of length n, where n = 2m. We haveSnj j ¼ Cm ¼ 1

mþ12mm

� �, the mth Catalan number [4]. In addition, given a sequence (a1,a2, . . . ,an) 2 Sn, we define

(1) rm(j) (lm(j), respectively) to be the index of the right (left, respectively) matching parenthesis if aj is a left(right, respectively) parenthesis;

(2) Ms(i) and Mr(i) to be the total number of right and left parentheses in processor Pi that are not locallymatched, while (a1,a2, . . . ,an) is evenly partitioned and distributed among the p processors. Note thatMs(i) (Mr(i), respectively) is the size of the message sent (received, respectively) by processor Pi in step3 of the Algorithm Parenthesis Matching.

Moreover, for a randomly selected sequence from Sn, we denote the event that the jth element is a right(left, respectively) parenthesis as aj = r (aj = l, respectively). The following lemma can be derived.

Lemma 2.1. Pr½aj ¼ r� ¼Pbj2c�1

k¼0 CkCm�1�k

� �.Cm.

Proof. We count the number of legal sequences where aj is a right parenthesis. The subsequence between ajand its left match must be a legal sequence and hence consisting of an even number of parentheses. Thus,the left match of aj can only occur at locations j � 1, j � 3, . . ., and j� 2 j

2

� �� 1

� �. In addition, the total num-

ber of sequences with aj = r and the left match of aj being at location aj�(2k�1) is Ck�1Cm�k. (This is because thesequence aj�2k+2, . . .,aj�1 must be a legal sequence consisting of (k � 1) pairs of parentheses and the sequencea1, . . . ,aj�2kaj+1, . . . ,an must be a legal sequence consisting of (m � k) pairs of parentheses.)

Therefore, the total number of sequences with aj = r isPbj2c�1

k¼0 CkCm�1�k, which completes the proof. h

We also observe that:

(1) Pr[aj = l] = 1 � Pr[aj = r],(2) Pr[aj = r] = Pr[an�j + 1 = l ], and(3) E(Ms(i)) = E(Mr(p � i + 1)).

16 C.-H. Huang et al. / Parallel Computing 32 (2006) 14–23

Observations (2) and (3) can be made by reversing the sequence and switching the roles of right and leftparentheses.

Lemma 2.2. Let ak be a parenthesis in processor Pi. The probability that ak is a right parenthesis and is not

locally matched in Pi is

P r½ak ¼ r� �Xk�npði�1Þ2

� ��1

t¼0

CtCm�1�t=Cm.

Proof. The probability that ak is a right parenthesis is Pr[ak = r]. ak can only be locally matched by a left

parenthesis ak�2t�1 for t ¼ 0; 1; . . . ;k�n

pði�1Þ2

j k� 1. By the argument in the proof of Lemma 2.1, for each fixed

t, the probability that ak�2t�1 is the matching left parenthesis of ak is Ct Cm�1�t/Cm. This proves thelemma. h

Lemma 2.3. The expected maximum size of messages (sent and received) by any processor is E(Ms(p)).

Proof. By Lemma 2.2, the size of the message sent by processor Pi (1 6 i 6 p) is

EðMsðiÞÞ ¼Xnpi

k¼npði�1Þþ1

T kðiÞ;

where

T kðiÞ ¼ P r½ak ¼ r� �Xk�npði�1Þ2

� ��1

t¼0

CtCm�1�t

,Cm

is the probability that ak is an unmatched right parenthesis within Pi.For any fixed i1 < i2 and k, the second term

Xk�npði�1Þ2

� ��1

t¼0

CtCm�1�t

0B@

,Cm

1CA

in Tk(i1) and Tk(i2) are identical. The first term (Pr[ak = r]) in Tk(i1) is less than the first term (Pr[ak = r]) inTk(i2). Thus E(Ms(p)) = max16i6p{E(Ms(i))}. Thus, the maximum size of the messages sent by any processoris E(Ms(p)).

Similarly, the maximum size of the messages received by any processor is E(Mr(1)) which, according toObservation (3), equals to E(Ms(p)). h

Thus, to lower bound the expected maximum message size, we need to lower bound E(Ms(p)). In the fol-lowing lemmas, we show this bound for a random legal sequence.

Lemma 2.4

CxCm�x�1

CmP

m32

8ffiffiffip

pðm� x� 1Þ

32x

32

.

Proof. Per the Stirling�s approximation

ffiffiffiffiffiffiffiffi2pn

p ne

� �n6 n! 6

ffiffiffiffiffiffiffiffi2pn

p ne

� �nþ 112n

� �;

C.-H. Huang et al. / Parallel Computing 32 (2006) 14–23 17

we have

Cx ¼1

xþ 1

ð2xÞ!x!x!

P1

xþ 1

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffið2pÞð2xÞ

p2xe

� �2x½ffiffiffiffiffiffiffiffi2px

pxe

� �xþ 112x�½

ffiffiffiffiffiffiffiffi2px

pxe

� �xþ 112x�

P22xe

16x

ðxþ 1Þffiffiffiffiffipx

pðx 1

6xÞ.

Analogously,

Cm�x�1 P22ðm�x�1Þe

16ðm�x�1Þ

ðm� xÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipðm� x� 1Þ

pðm� x� 1Þ

16ðm�x�1Þ

and

Cm 61

mþ 1

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffið2pÞð2mÞ

p2me

� �2mþ 124mffiffiffiffiffiffiffiffiffi

2pmp

me

� �m ffiffiffiffiffiffiffiffiffi2pm

pme

� �m 61

mþ 1

1ffiffiffiffiffiffiffipm

p 22mþ1

24mm1

24m

e1

24m

.

Therefore,

CxCm�x�1

CmP

22xe16x

ðxþ1Þffiffiffiffipx

px16x

� � 22ðm�x�1Þe1

6ðm�x�1Þ

ðm�xÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipðm�x�1Þ

pðm�x�1Þ

16ðm�x�1Þ

1mþ1

1ffiffiffiffipm

p 22mþ 1

24mm1

24m

e1

24m

Pm

32

4ffiffiffip

px32ðm� x� 1Þ

32

1

2

� �. �

Lemma 2.5. p�3p � np 6 EðMsðpÞÞ < n

p.

Proof. By Lemma 2.1, we have

EðMsðpÞÞ ¼Xn

k¼npðp�1Þþ1

P r½ak ¼ r� �Xk�npðp�1Þ2

� ��1

t¼0

CtCm�1�t

,Cm

0B@

1CA

¼ C0Cm�1 þ C1Cm�2 þ � � � þ C npðp�1Þþ1

2

� ��1C

m�npðp�1Þþ1

2

� �� �Cm

þ C1Cm�2 þ C2Cm�3 þ � � � þ C npðp�1Þþ2

2

� ��1C

m�npðp�1Þþ2

2

� �� �Cm þ � � �

þ C npb c�1Cm� n

pb c þ CnpCm�n

p�1 þ � � � þ Cm�1C0

� �Cm

Pnp

C npb c�1Cm� n

pb c þ � � � þ C npðp�1Þþ1

2

� ��1C

m�npðp�1Þþ1

2

� �� �� Cm

�.

(The last inequality is true because there are n/p lines on the right hand side and each line contains thecommon expression in the bracket.)

Based on Lemma 2.4 and the fact that Cm ¼Pm�1

k¼0 CkCm�1�k, we conclude the following:

EðMsðpÞÞ Pnp

C npb c�1Cm� n

pb c þ � � � þ C npðp�1Þþ1

2

� ��1C

m�npðp�1Þþ1

2

� �� �� Cm

Pnp

1

m

np ðp � 1Þ þ 1

2

$ %� n

p

�þ 1

! !P

np

p � 3

p

� �.

(The second inequality results from the fact that the terms in the right hand side of the first inequality arethe larger

npðp�1Þþ1

2

j k� n

p

j kþ 1

� �terms of

Pm�1k¼0

CkCm�1�kCm

¼ 1.)

18 C.-H. Huang et al. / Parallel Computing 32 (2006) 14–23

In addition, we can easily infer EðMsðpÞÞ < np, since

EðMsðpÞÞ ¼Xnpi

k¼npði�1Þþ1

P r½ak ¼ r� �Xk�npði�1Þ2

� ��1

t¼0

CtCm�1�t

0B@

,Cm

1CA<

np.

This completes the proof. h

3. Parallel parenthesis matching algorithm

The cost model used to describe and analyze the algorithm is a general-purpose coarse-grained parallel pro-gramming model. Parameters similar to those used in specific parallel models, such as the BSP and PRAM(n)[9], are used to capture the performance characteristics, including: (1) p: the number of processors, (2) g: theratio of communication throughput to processor throughput, and (3) L: the time required to barrier synchro-nize all or part of the processors. Besides, the model consists of three components: (1) a set of processors, eachwith a local memory, (2) a communication network, and (3) a mechanism for globally synchronizing the pro-cessors. The algorithm proceeds as a series of supersteps, in which a processor may operate only on valuesstored in local memory. Values sent through the communication network are not guaranteed to arrive untilthe end of the current superstep. We use the term h-relation to denote a routing task where each processorhas at most h words of data to send to other processors and each processor is also due to receive at most hwords of data from other processors. In each superstep, if at most w arithmetic operations are performedby each processor and the data communicated forms an h-relation, the cost of this superstep is w + h * g + L.

We first reduce the parenthesis matching problem to the all nearest smaller values problem [2], defined asfollows: Let A = (a1,a2, . . . ,an) be an array of elements from a totally ordered domain. For each aj, 1 6 j 6 n,find the nearest element to the left of aj and the nearest element to the right of aj that are smaller than aj. Atypical application of the ANSVP is the merging of two sorted lists [3,7]. Let A = (a1,a2, . . .,an) andB = (b1,b2, . . . ,bn) be two increasing arrays to be merged. The reduction to the ANSVP is done by construct-ing an array C = (a1,a2, . . . ,an,bn,bn�1, . . . ,b1) and then solving the ANSVP with respect to C. If by is the rightmatch of ax, the location of ax in the merged list is x + y. The locations of bx�s can be found similarly.

The parallel parenthesis matching algorithm will employ a fundamental operation, prefix sum, which is de-fined in terms of a binary, associative operator -. The computation takes as input a sequence hb1,b2, . . . ,bniand produces as output a sequence hc1,c2, . . . ,cni such that c1 = b1 and ck = ck�1-bk for k = 2,3, . . .,n. Notethat the prefix sum is a basic operation in parallel graph algorithms, parallel computational geometry, and awide range of numerical calculations [6].

We will describe the parenthesis matching algorithm first. The details regarding how the prefix sum oper-ation is implemented are described next.

Algorithm Parenthesis Matching:Input: A balanced parenthesis list, partitioned into p contiguous subsets. Each processor stores one subset.Output: The mate location of each parenthesis is stored in the processor containing that parenthesis.

1. Each processor assigns ‘‘1’’ for each left parenthesis and ‘‘�1’’ for each right parenthesis.2. Perform a prefix sum operation on this array and derive the resulting array A.3. Perform an ANSVP algorithm on A.

Note that the prefix sum operation used in step 2 can be implemented in 2 supersteps, provided thatnp ¼ XðpÞ, as below.

Assume that the initial input is b1,b2, . . . ,bn and the processors are labeled P1,P2, . . . ,Pp. Processor Pi ini-tially contains bn

pði�1Þþ1; bnpði�1Þþ2; . . . ; bn

pi.

2.1. Each processor computes all prefix sums of bnpði�1Þþ1; bn

pði�1Þþ2; . . . ; bnpiand stores the results in local mem-

ory locations snpði�1Þþ1; snpði�1Þþ2; . . . ; snpi.

C.-H. Huang et al. / Parallel Computing 32 (2006) 14–23 19

2.2. Each Pi sends snpi to processor Pp.2.3. Processor Pp computes prefix sums of the p values received in step 2.2 and store these values at

t1, t2, . . . , tp.2.4. Processor Pp then sends ti to Pi + 1, 1 6 i 6 p � 1.2.5. Each processor adds the received value to sn

pði�1Þþ1; snpði�1Þþ2; . . . ; snpi.

Steps 2.2 and 2.4 are involved with communication and we can easily arrange the prefix sum algorithm in 2supersteps, each taking a p-relation in the communication phase.

Since step 2 of Algorithm Parenthesis Matching takes only a p-relation, step 3 therefore dominates in thecommunication cost. Namely, computing the communication cost of the parenthesis matching problem nowreduces to computing the communication cost of the ANSVP algorithm [5].

3.1. The ANSVP algorithm

Given a sequence A = (a1,a2,. . .,an) and a p-processor coarse-grained parallel computer with anycommunication media (shared memory or any interconnection network), each Pi (1 6 i 6 p) storesðan

pði�1Þþ1; anpði�1Þþ2; . . . ; an

piÞ. For simplicity, we assume that the elements in the sequence are distinct. We define

the nearest smaller value to the right of an element to be its right match. The ANSVP can be solved sequen-tially with linear time using a straightforward stack approach. To find the right matches of the elements, wescan the input, keep the elements for which no right match has been found on a stack, and report the currentelement as a right match for those elements on the top of the stack that are larger than the current element.The left matches can be found similarly. For brevity and without loss of generality, we will focus on finding theright matches.

Some definitions are given below. (Here we use i, 1 6 i 6 p, for processor related indexing and j, 1 6 j 6 n,for array element related indexing.)

• For any j:– rm(j)(lm(j), respectively) ¼def the index of the right (left, respectively) match of aj.– rmp(j)(lmp(j), respectively) ¼def the index of the processor containing arm(j)(alm(j), respectively).

• For any i:– minðiÞ ¼def the index (in A) of the smallest element in Pi.– rm minðiÞðlm minðiÞ, respectively) ¼def the index of the right (left, respectively) match of amin(i) with

respect to the array Amin = (amin(1), . . . ,amin(p)).

– }i ¼def fPxj rm minðxÞ ¼ ig; ui ¼

def fPxj lm minðxÞ ¼ ig.

For example, given a 15-element (aj, 1 6 j 6 15) sequence (3,2,4,9,8,15,14,13,7,6,12,1,10,11,5) and threeprocessors P1, P2, and P3. After partitioning, subsequences (3,2,4,9,8), (15,14,13,7,6), and (12,1,10,11,5)are assigned to P1, P2 and P3, respectively. According to the above definitions, we have the following:rmð2Þ ¼ 12; rmpð2Þ ¼ 3;minð1Þ ¼ 2;minð2Þ ¼ 6;minð3Þ ¼ 1; rm minð1Þ ¼ 3 and }3 ¼ fP 1; P 2g.

Also based on the above definitions, we observe that Prmp(min(i)) = Prm_min(i) (Plmp(min(i)) = Plm_min(i), respec-tively). Next we prove a lemma used in the algorithm.

Lemma 3.1. On a p-processor coarse-grained computer, for any i, if arm(min(i)) exists and rmp(min(i)) 5 i + 1,

then there exists a unique processor Pk(i), i < k(i) < rmp(min(i)), such that lmp(min(k(i))) = i andrmp(min(k(i))) = rmp(min(i)). (Symmetrically, for any i, if alm(min(i)) exists and lmp(min(i)) 5 i�1, then there

exists a unique processor Pk0ðiÞ, lmp(min(i)) < k 0(i) < i, such that lmp(min(k 0(i))) = lmp(min(i)) and

rmp(min(k 0(i))) = i).

Proof. We show that Ps, where amin(s) = min{amin(i+1),amin(i+2), . . . ,amin(rmp(min(i))�1)}, is the unique processordescribed in the lemma. For any s 0 with i + 1 6 s 0 < s, rmp(min(s 0)) must be 6s. Similarly, for any s 0 withs + 1 6 s 0 < rmp(min(i)), lmp(min(s 0)) must be Ps. This renders Ps the only candidate processor. We can

20 C.-H. Huang et al. / Parallel Computing 32 (2006) 14–23

easily infer that amin(s) > amin(i) > amin(rmp(min(i))). Since amin(s) is the smallest element among those inPi + 1, . . . ,Prmp(min(i))�1, we conclude that Ps is the unique processor Pk(i) specified in Lemma 3.1. (The sym-metric part can be proved similarly.) h

We next outline our algorithm. To begin with, all processors sequentially find the right matches for theirlocal elements, using the stack approach. Those matched elements require no interprocessor communication.We therefore focus on those elements which are not yet matched. The general idea is to find the right matchesfor those not-yet-matched elements by reducing the original ANSVP to 2p smaller ‘‘special’’ ANSVPs, andsolve them in parallel.

Next we compute the right and left matches for all amin(i)�s. To do this, we first solve the ANSVP with re-spect to the array Amin = (amin(1), . . . ,amin(p)). Then, for each processor Pi, we define four sequences, Seq1i,Seq2i, Seq3i and Seq4i as follows:

• If arm(min(i)) does not exist, then Seq1i and Seq2i are undefined.• If arm(min(i)) exists and rmp(min(i)) = i + 1, then: Seq1i ¼ ðaminðiÞ; . . . ; an

piÞ; Seq2i ¼ ðan

piþ1; . . . ; armðminðiÞÞÞ.• If arm(min(i)) exists and rmp(min(i)) > i + 1, let Pk(i) be the unique processor specified in Lemma 3.1. Then:Seq1i = (amin(i), . . . ,alm(min(k(i)))), Seq2i = (arm(min(k(i))), . . . ,arm(min(i))).

• If alm(min(i)) does not exist, then Seq3i and Seq4i are undefined.• If alm(min(i)) exists and lmp(min(i)) = i�1, then: Seq3i ¼ ðan

pði�1Þþ1; . . . ; aminðiÞÞ; Seq4i ¼ ðalmðminðiÞÞ; . . . ; anpði�1ÞÞ.

• If alm(min(i)) exists and lmp(min(i)) < i � 1, let Pk0ðiÞ be the unique processor specified in Lemma 3.1. Then:Seq3i ¼ ðarmðminðk0ðiÞÞÞ; . . . ; aminðiÞÞ; Seq4i ¼ ðalmðminðiÞÞ; . . . ; almðminðk0ðiÞÞÞÞ.

Using the previous 15-element sequence, since k(1) = 2, we have, for example, Seq11 = (2,4) andSeq21 = (1).

Note that Seq1i and Seq3i, if they exist, always reside on Pi, Seq2i, if it exists, always resides onPrmp(min(i)), and Seq4i, if it exists, always resides on Plmp(min(i)). The following observation [2] specifieshow to find the right matches for all unmatched elements: The right matches of all not-yet-matchedelements in Seq1i lie in Seq2i. The right matches of all not-yet-matched elements in Seq4i, except its firstelement, lie in Seq3i.

Each processor Pi therefore is responsible for identifying right matches for not-yet-matched elements inSeq1i and Seq4i. Again, we apply the sequential algorithm at each processor Pi with respect to the twoconcatenated sequences, Seq1ikSeq2i and Seq4ikSeq3i. All elements will be right-matched after the above-mentioned 2p special ANSVPs are solved in parallel [2].

The algorithm below finds the right matches and is therefore denoted Algorithm ANSVPr.

Algorithm ANSVPr:Input: A partitioned into p subsets of continuous elements. Each processor stores one subset.Output: The right match of each aj is computed and stored in the processor containing aj.

1. Each Pi sequentially solves the ANSVr problem with respect to its local subset.2. (a) Each Pi computes its local minimum amin(i).

(b) All amin(i)�s are globally communicated. (Hence each Pi has the array Amin.)

3. Each Pi solves the ANSVPr and the ANSVPl problems with respect to Amin; and identify the sets }i

and ui.4. Each Pi computes arm(min(x)) for every Px 2 }i and alm(min(y)) for every Py 2 ui.5. Each Pi determines Seq1i, Seq3i and receives Seq2i, Seq4i as follows:

(a) Each Pi computes the unique k(i) and k 0(i) (as in Lemma 3.1), if they exist, and determines Seq1i andSeq3i.

(b) Each Pi determines Seq2x for every Px 2 }i, and Seq4y for every Py 2 ui (as in Lemma 4.1).(c) Each Pi sends Seq2x, for every Px 2 }i, to Px and Seq4y, for every Py 2 ui, to Py.

6. (a) Each Pi finds the right matches for the unmatched elements in Seq1i and Seq4i.

(b) Each Pi collects the matched Seq4y�s from all Py�s 2 ui.

C.-H. Huang et al. / Parallel Computing 32 (2006) 14–23 21

4. Cost analysis

The following technical lemma, Lemma 4.1, provides the foundation for lower bound analysis of the com-munication cost.

Lemma 4.1

1. Suppose that }i ¼ fPx1 ; Px2 ; . . . ; Pxtg where x1 < x2 < � � � < xt. Then:

TableCost b

Step

1

2(a)2(b)3

4

5(a)

5(b)

5(c)6(a)6(b)

Seq2x1 ¼ ðarmðminðx2ÞÞ; . . . ; armðminðx1ÞÞÞ;Seq2x2 ¼ ðarmðminðx3ÞÞ; . . . ; armðminðx2ÞÞÞ; . . . ;Seq2xt ¼ ðan

pði�1Þþ1; . . . ; armðminðxtÞÞÞ.

2. Suppose that ui ¼ fPy1 ; Py2 ; . . . ; Pysg where y1 < y2 < � � � < ys. Then:

Seq4y1 ¼ ðalmðminðy1ÞÞ; . . . ; almðminðy2ÞÞÞ;Seq4y2 ¼ ðalmðminðy2ÞÞ; . . . ; almðminðy3ÞÞÞ; . . . ;Seq4ys ¼ ðalmðminðysÞÞ; . . . ; an

piÞ.

Proof. We only prove Statement 1. The proof of Statement 2 is similar. First observe that, for any Px, Py 2 }i,x < y implies amin(x) < amin(y) and k(x) 6 y. Based on these observations, we have k(xl) = xl + 1 for 1 6 l < t

and xt = i � 1. The lemma follows from the definition of Seq2. h

Here we assume the sequential computation time for the ANSVPr problem of input size n is Ts(n), and thesequential time for finding the minimum of n elements is Th(n). Then the cost breakdown of AlgorithmANSVPr can be derived as in Table 1.

Since np ¼ �XðpÞ and Ts(n) = Th(n) = O(n), the computation time in each step is obviously linear in the local

data size, namely OðnpÞ. Steps 2(b), 5(c) and 6(b) involve communication. Thus the algorithm takes three super-steps. Based on Lemma 4.1 and the fact that juij + j}ij 6 p, the communication steps 5(c) and 6(b) can each beimplemented by an ðnp þ pÞ-relation. Therefore we conclude that the ANSVP with input size n can be solved ona p-processor coarse-grained machine in three supersteps using linear local computation time and at most anðnp þ pÞ-relation in each communication phase, provided p 6 n/p.

We show in Section 2 that EðMsðpÞÞ P p�3p � np. Since the algorithm for parenthesis matching employs the

prefix sum operation, which introduces an additional p-relation, and the ANSVP algorithm, which introducesat most a p-relation (refer to Lemma 4.1), in addition to the messages required to be sent/received, we con-clude that the average-case message volume of our parenthesis matching algorithm is bounded above by(E(Ms(p)) + H(p)), which is asymptotically communication-optimal, provided that n

p ¼ XðpÞ.

1reakdown of Algorithm ANSVPr

Cost Synchronizing

Computation Communication

T snp

� �T h

np

� �pg L

2Ts(p)

maxi T hnp þ juij� �

þ T hnp þ j}ij� �n o

maxi T hðrm minðiÞ � lm minðiÞÞf gO n

p

� �maxi

PPx2}i

jSeq2xj þP

Py2uijSeq4y j

� �g

n oL

maxifT sðjSeq1ij þ jSeq2ijÞ þ T sðjSeq4ij þ jSeq3ijÞgmaxi

PPx2ui

jSeq4xjgn o

L

Fig. 1. SGI Intel Linux Cluster running times (x-axis, in seconds).

22 C.-H. Huang et al. / Parallel Computing 32 (2006) 14–23

5. Experimental results

The experiments have been carried out on two cluster systems provided by the Center for ComputationalResearch at the State University of New York at Buffalo, including (1) an SGI Intel Linux Cluster with 75dual processor nodes, with each node equipped with 2 Pentium III processors running at 1 GHz, 1 GB ofRAM per node, 30GB disk space, Myrinet 2000—2 Gb switch and the RedHat 6.2 (Kernel 2.4) as the oper-ating system, and (2) a Sun Cluster of 16750-MHz Sun Blade 1000 workstations with 1 GB of RAM per nodeand 48 360-MHz Sun Ultra 5 workstations with 2 GB of RAM per node configured as a supercomputer, withMyrinet 1000—1.28 Gb switch and Solaris 8 as the operating system. Both systems use the Portable BatchSystem (PBS) for batch job submissions.

Our program is written in C, using the MPI (message passing interface) library for interprocessor commu-nication. A portable implementation of the MPI, MPICH, is installed on both the Sun Cluster and the SGIIntel Cluster. We use MPI_Bcast for broadcasting. Some collective communication routines, such as MPI_All-reduce, MPI_Allgather, are also used for global communication. MPI_Scatter and MPI_Gather are used todistribute input data and collect results. MPI_Send and MPI_Recv are used for two-sided message passing.

The experiment was done on relatively small input sizes (less than 0.5 millions) to demonstrate the commu-nication efficiency. Each data point presented was obtained as the average of 5 test runs, each on a differentrandomly generated balanced sequence of parenthesis, as shown in Figs. 1 and 2. The speedups are almostlinear on both clusters when up to 16 processors are used on sequences of size about 0.5 M (0.5 million).

The experimental results pretty well explain the predicted expected communication cost and the impact onthe overall performance. Note that the ratio of (computation time)/(communication time) is rather low, sincethe computation time is HðnpÞ with a very small hidden constant (estimated to be less than 7), and graduallydecreases as the number of processors increases. Note that the expected message size is p�3

p � np and therefore

the communication cost increases when the number of processors increases, while the actual computation timedecreases when the number of processor increases.

Barrier synchronization also contributes to the communication overhead.1 This is partly because of the highcost of barrier synchronization itself, and partly because of the extremely small hidden constant of the linear-time sequential stack algorithm.

This demonstrates that the nature of the parenthesis matching problem is communication-heavy. However,by carefully designing a parallel algorithm whose communication cost is optimal or as close to the lower

1 An independent experiment on an SGI 3200 shows that, as the number of processors increases to 32 and the input size is 0.1 million,almost 40% of parallel running times are contributed by barrier synchronization. When 16 processors are used on 0.1 million data items,barrier synchronization uses about 25% of the total execution time. However, when only 4 processors are used, this ratio significantlyreduces to about 4%.

Fig. 2. Sun Cluster running times (x-axis, in seconds).

C.-H. Huang et al. / Parallel Computing 32 (2006) 14–23 23

bound on the communication requirement of the problem itself, we will still enjoy reasonable speedups. Weshowed the experimental results on relatively small input sizes. When the input sizes are increased, it isexpected that the speedups will significantly be improved.

6. Conclusions and future work

We provide the first non-trivial lower bound, p�3p � np, where p is the number of the processors and n is the

data size, on the average-case communication volume, r, required to solve the parenthesis matching problemand present a parallel algorithm that takes linear (optimal) computation time and optimal expected messagevolume, r + p. The kernel of the algorithm is to solve the all nearest smaller values problem. Provided n

p ¼ XðpÞ,we present an algorithm that achieves optimal sequential computation time and uses only a constant numberof communication phases, with the message volume in each phase bounded above by (np þ p) in the worst caseand p in the average case.

The parenthesis matching problem is one of several applications that reduce to the ANSVP. Typical exam-ples are the binary tree reconstruction problem [1] and the monotone polygon triangulation problem [6,8], whichhappen to have the same number of problem instances, the nth Catalan number, as of the parenthesis match-ing problem. The question regarding whether the ANSVP algorithm also yields an expected-case communica-tion-optimal algorithm for them is to be investigated.

Acknowledgements

The authors appreciate the technical support provided by the Center for Computational Research at theState University of New York at Buffalo. We would also like to thank the anonymous referees for many help-ful comments that greatly improve the readability of this paper.

References

[1] S.G. Akl, Parallel Computation Models and Methods, Prentice Hall, 1997.[2] O. Berkman, B. Schieber, U. Vishkin, Optimal doubly logarithmic parallel algorithms based on finding all nearest smaller values,

J. Algorithms 14 (1993) 344–370.[3] A. Borodin, J. Hopcroft, Routing, merging and sorting on parallel models of computation, J. Comput. Syst. Sci. 30 (1985) 130–145.[4] T.H. Cormen, C.E. Leiserson, R.L. Rivest, Introduction to Algorithms, McGraw-Hill, 2000.[5] X. He, C.-H. Huang, Communication efficient BSP algorithm for all nearest smaller values problem, J. Parall. Distrib. Comput. 61

(2001) 1425–1438.[6] J. JaJa, An Introduction to Parallel Algorithms, Addison-Wesley, 1992.[7] C. Kruskal, Searching, merging and sorting in parallel computation, IEEE Trans. Comput. C-32 (1983) 942–946.[8] J. Reif, Synthesis of Parallel Algorithms, Morgan Kaufmann, 1993.[9] L.G. Valiant, A Bridging Model for Parallel Computation, Commun. ACM 33 (8) (1990) 103–111.