improved universal k-selection in hypercubes

Parallel Computing 18 (1992) 177-184 North-Holland

177

Improved universal k-selection in hypercubes

H o n g Shen

Department of Computer Science, Abo Akademi University, Lemmink~isenkatu 14, SF-20520 Turku, Finland

Received 15 August 1991

Abstract

Shen, H., Improved universal k-selection in hypercubes, Parallel Computing 18 (1992) 177-184.

This paper presents an improved algorithm for universal k-selection in hypercubes. The algorithm has a worst-case time complexity of O(n /p log p log(kp)/n) for selecting k smallest numbers from n given numbers in a hypercube of p processors (p ~< n). This result shows a maximum speedup of O(Iog k) over the known result for the same problem in the case kp ffi O(n).

Keywords. Parallel algorithm; universal k-selection; hypercube; processors; time complexity.

1. Introduction

The problem of finding k smallest (or largest) numbers from n given numbers (k < n) is called k-selection problem which arises often in statistics, image processing and distributed computing.

Parallel k-selection in hypercubes has been extensively studied and a variety of algorithms for it has been proposed in the literature [2,3,6,7]. For selecting k smallest numbers from n given numbers in a hypercube of p processors, we say an algorithm is universal if it is valid for both p---n and /7 < n. The known result for universal k-selection in hypercubes needs time O ( n / p log k log p)[6].

In this paper, we will show an improved algorithm for universal k-selection in hypercubes. Our algorithm is a variant of the algorithm in [6] and has a worst-case time complexity of O ( n / p log p log (kp) /n) .

2. Preliminaries

A hypercube consists of 2 ' (r >I 0) processors addressed from 0 to 2 ' - 1 by r-bit binary codes (d ,_ td ,_2 "'" do) and connected by a set of interconnection functions, cube 0, cuber, . .- ,cube,_ t, where cubei _connects processor at address d ffi d,_ td,_2 "'" do to processor at address d ffi d,_td ,_2 "'" di . . . do via a bidirectional link for all 0 < d < 2 ' - 1 and 0 < i ~< r - 1 to form 2r- t pairs of processors.

Transferring a fixed-sized data (packet) between two neighbouring processors in a hypercube is provided to complete within a constant (communication) time. To solve a problem of size n (input data) in a hypercube of p processors, each processor in the hypercube should

0167-8191/92/$05.00 © 1992 - Elsevier Science Publishers B.V. All rights reserved

178 H. Shen

have a local memory of size at least O(n/p) . All data are uniformly distributed among the processors in the hypercube.

A sequence consisting of two monotonic (sorted) sequences, one non-decreasing and the other non-increasing, is called bitonic sequence. For convenience, we write 'm-sequence' for a sequence containing m numbers. We also define a sorted sequence to be a sequence sorted in non-decreasing order.

The following theorem was shown by Batcher [1]:

Bateher's Theorem. I f a o, am,'", am-m is a bitonic sequence, sequence MIN and MAX are both bitonic and no number in MIN is greater than any number in MAX, where

MIN: min(a o, am/2), min(a i, am/2+l), '"" ,min(am/2_l, am_l) ;

~ : max(a o, am/2), max(am, am/2+l),"',max(am/2_ 1, am-l) .

We call the above operation to form sequences MIN and ~ Batcher's Compare-Ex- change operation denoted by BCE operation. Parallel merging, sorting and selection can be simply realized by carrying out a series of BCE operations [1,4,8]. They are called bitonic merging, bitonic sorting and bitonic selection respectively.

Like merging and sorting [8], selection can be realized by carrying out a sequence of parallel comparisons, each of which carries out data comparisons over all pairs of processors whose addresses differ only at one bit of the same index called pivot [4]. Thus the sequence of parallel comparisons can be represented by a sequence of pivots, namely pivot sequence. The following theorem shows the pivots sequence for universal k-selection:

Theorem 1 [6]. For the problem of selecting k smallest numbers from n numbers in p processors, the pivot sequence for parallel comparisons is 1. B o, Bl , . . . ,Bt_~t_,)_t , t - ( I - r ) , Bt_(t_,)_l, t - ( l - r ) + 1, B t _ u _ r ~ _ l , ' " , r -

2, Bt_{t_r~_ l, r - 1, i f I - r < t; 2. O, 1,. . . , r - 1, if l - r >~ t, where k ffi 2 t, n - 2 i , p ffi 2", t < l, r <~ l, Bi ffi i, i - 1 , . . . , 0 is the

pivot sequence for merging of a bitonic sequence of length 2 i+ i (0 < i ~< t - (l - r) - 1).

Data comparison over a pair of processors is carried out in the following way: the processor with larger address (P~) sends its data to the other processor (,~) that will carry out the comparison and send the larger number back to P~ if the direction is ascending or the smaller one to Pj otherwise. The direction setting for a comparison depends on the address of the processor that performs the comparison and the corresponding pivot's position in the pivot sequence. For the pivot sequence in Theorem 1, the direction setting of data comparisons is determined as follows: 1. For comparisons at pivots in B i (0<~i < t - ( l - r ) - 1 ) , the direction is ascending if

dr-i O d r _ e ~ ' " d i + t - 0 and descending otherwise, where ~ is the 'exclusive-OR' operator.

2. For comparisons at pivots t - ( l - r ) , t - ( l - r ) + 1 , . . . , r - 1 , the direction is always ascending. Matching pivot i in the pivot sequence for k-selection with interconnection function cube~

in a hypercube of p processors (0 < i < r, p = 2"), Po, P~ , ' " , Pp-~, we can directly obtain an algorithm for universal k-selection in the hypercube, where the parallel comparison at pivot i is carried out in all pairs of processors connected via cube~. The known algorithm works in the way that each processor takes a sorted data block of size n i p as a unit to carry out the parallel comparisons according to the pivot sequence in Theorem 1 and a data comparison

Universal k-selection in hypercubes 179

over a pair of processors is a procedure of merging their two data blocks. The sketch of the algorithm is below [6]: 1. All processors simultaneously do sequential selection and sorting, each selecting the k

smallest numbers from its own data block of size n/p, if n/p > k, and sorting them. 2. All processors repeatedly pair-wise merge successive pairs of data blocks to form a series

of sorted k-sequences, where all successive pairs pair-wise form a series of bitonic 2k-sequences.

3. All processors repeatedly perform BCE operations and merge the resulting bitonic k-sequences M ~ s ) into sorted k-sequences until there is only one MIN sequence conteining the k smallest numbers left. The above algorithm has a worst-case time complexity of O(n/p log k log p) [6].

3. An improved algorithm

Observing that in the above algorithm each processor keeping its data block always being sorted is not necessary for the selection problem itself, we find that the time complexity of the algorithm in [6] can be further reduced. This results in an improved algorithm for universal k-selection in hypercubes to be introduced below.

The main idea of our new algorithm is that instead of a sorted data block, each processor takes an unsorted data block to carry out the parallel comparisons according to the pivot sequence in Theorem 1, and instead of a procedure of merging two data blocks, a data comparison over a pair of processors is a procedure of splitting of two data blocks. Here the splitting procedure realizes to split the two data blocks into two sets of equal size, L ~ and HIGH, where no number in L ~ is greater than any number in "HIGH. It works in the way of first selecting the median of the data over the two data blocks and then comparing all data with the median to form LOW and HIGH. To describe the algorithm, we first need to introduce the following definition and theorem:

Definition. Let BKo, BKt , . . . , B K m_ I be data blocks of equal size over a set of numbers. A sequence consisting of BK o, BKI, . . . , BK m_ t is said to be block-sorted, if no number in block BKi is greater than any number in block BK~ ÷ t, where 0 ¢ i < m - 1. A block-bitonic sequence consists of two block-monotonic sequences, one ascending and the other descending.

As an extension of the Batcher's Theorem, we have the following theorem:

Theorem 2. Let BK o, BKI, . . . , BKm_ l be a block-bitonic sequence and

~ / N : min(BKo, BKm /2 ), min(BKi, B K m /2 + l ) , " " , m i n ( B K m / 2 - 1 , BKm - I ),

~ ] ~ : max(BKo, BKm/2), max(BK i, BKm/2. l ) , " " ,max(BKm/2_l, BK~_I).

Sequence ~ and ~ are both block-bitonic and no number in ~T~ is greater than any number in MAX, where min(BK~, BKj) and max(BK~, BKj) are the smaller half and larger half of the numbers in BK~ and BKj, 0 ~< i, j ~< m - 1, respectively.

Proof. By the meaning of min(BK i, BKj) and max(BKi, BK~), 0 ~< i, j ~< m - 1, matching BKi in the theorem with a~ in Batcher's Theorem directly shows the correctness of the theorem. []

180 H. Shen

Block-sorted Block-BCE Block-BCE / ! HIGil . . ,~ HIGH sequence

.g3.1 bX s.ao.2o.ab =[ u ' ~ ps.so.2o.,41 A

j"> . , o . i BKI.A. 125,30,20,81"I ] .| 4,2,1,5 I ~ "~'1 s .s . , . s '1

Bgol I .,4.7,s I -1 S,4,7,6 ] , t U 'l 4'2'1'4 I ' Pivot: 1 0

Fig. 1. Block-bitonic merging.

We call the above operation to form MIN and ~ block-BCE operation. Same as bitonic merging, carrying out a series of block-BCE operations to a block-bitonic sequence will result in a block-sorted sequence. We call this procedure block-bitonic merging. Figure 1 shows an example of block-bitonic merging for a block-bitonic sequence of four blocks, each containing four numbers . . . . .

The sketch of our algorithm is described as follows: 1. All processors simultaneously do sequential selection, each selecting k smallest numbers

from its own data block of size n/p, if n/p> k. 2. All processors repeatedly pair-wise do block-merging on successive pairs of data blocks to

form a series of block-sorted k-sequences, where all successive pairs pair-wise form a series of block-bitonic 2k-sequences.

3. All processors repeatedly perform block-BCE operations and make the resulting block-bitonic k-sequences (:~/-~s) into block-sorted k-sequences until there is only one MIN sequence containing the k smallest numbers left. For the problem of selecting k smallest numbers from n numbers in a hypercube of p

processors, w.l.o.g., we assume that k = 2', n = 2/, p = 2", where t < 1, r .<< !. For the case 2 ' - n < k < 2', the problem can be solved by first selecting 2' smallest numbers, then sorting the selected 2' numbers, and finally cutting the first k smallest numbers. For the case 2 t- n < n < 2 t, simply adding 2 / - n '®' (dump numbers) into the data set so as to get exact 2 t numbers will meet the assumption. Letting d,_ nd,_ 2 . . . do be the binary code of d, 0 < d ~< p - 1 , and processor Pd use binary array d i r d [ 0 . . t - ( 1 - r ) - 1 ] to keep the information of direction setting, where dird[i] represents the direction for comparisons at the pivots in B i during block-bitonic merging (0 < i < t - (1 - r) - 1), our algorithm for universal k-selection in hypercubes is described as follows:

Algorithm UKSELECT (k, n, p) {*Select k smallest numbers from n given numbers in a hypercube of p processors,

k = 2', n = 2 t, p = 2", t < l, r .g</*}

1. i f I - r < t then for i : = 0 to t - ( l - r ) - 1 do

for d = 0 to p - 1 do in parallel dird[i]'fd,_n Od,._20 . . . ~di+l;

{*Pd computes its directions for all data comparisons needed in the k-selection, 0 ~< d .g< p - 1.*}

Universal k-selection in hypercubes

2. i f l - r > t t h e n for d ffi 0 to p - 1 do in parallel

Pd deletes the n / p - k largest numbers from its data block Dd; {*Pd individually selects the k smallest numbers from its data block, 0 < d < p - 1.*}

181

3. i f 1 - r < t t h e n for i "- 0 to t - (1 - r ) - 1 d o

for j : = 0 to i d o for d - 0 to p - 1 do in parallel

3.1. if di_ j = 1 then I'd sends D d to the neighbour along link cubei_j; 3.2. if d j _ / - 0 then Pd selects the median of its data, splits its data into two sets LOW

and HIGH according to the median, and sends set HIGH if dird[i] = 0 or L ~ otherwise to the neighbour along link cubei_j; {*Perform block-bitonic merging to form block-sorted k-sequences.*}

4. for i ' - m a x { t - ( l - r ) , O } t o r - 2 d o for d - 0 to p - 1 do in parallel 4.1. if d i = 1 then I'd sends D d to the neighbour along link cubei; 4.2. if d i = 0 then I'd selects the median of its data, spilts its data into LOW and HIGH

according to the median, and sends HIGH and ' ~ to the neighbour along link cube~; {* Perform the block-BCE operation to form MINs and ~ s . A processor will become inactive when receives ' ~ . * }

4.3. if 1 - r < t t h e n

for j:=Oto t - ( l - r ) - 1 do

4.3.1. if dt_( l_r)_ j_ I -- 1 then I'd sends D a to the neighbour along link c u b e t - ( t - r ) - j - I;

4.3.2. if dt_~t_o_ J_ l - 0 then I'd selects the median of its data, splits its data into L'-OW and HIGH according to the median, and sends HIGH if dird[ t - ( l - r) - 1] = 0 or ~ otherwise to the neighbour along link cubet_¢t_,~_j_ m; {*Perform block-bitonic merging to merge the ~-/~s into block-sorted k-sequences.*}

5. for d = 0 to p - 1 d o in parallel 5.1. if d,_ m - 1 then Pd sends Dd to the neighbour along link cube,_m; 5.2. if d,_ m - 0 then I'd selects the median of its data, splits its data into L ~ and

HIGH according to the median, and sends HIGH and ' ~ to the neighbour along link cube,_ m. {* Perform the last blcck-BCE operation to obtain the final ~ sequence includ- ing the k smallest numbers.*}

Based on Theorem 1 and 2, the correctness of UKSELECT is trivial from the comments interspersed with the algorithm.

Figure 2 shows an example of implementation of UKSELECT when k = 8, n = 32 and p = 8 .

182 H. Shen

4. T ime complexity o f the algorithm

The time complexity o f U K S E L E C T can be obtained by the fol lowing analysis: 1. If n/p >t k, i.e. ! - r >t t, step 1, 3 and 4.3 in the algorithm do not act. So the algorithm has

a simple time complexity:

0 + ( l o g p - 1 ) O ( k ) + O ( k ) < O - - l o g p , P

where the first item is for step 2 by employing a best sequential selection algorithm [5], second one for step 4 and last one for step 5 respectively.

2. If n/p< k, i.e. l - r < t, the time complexity can be analysed as follows: Step 1 takes time at most O(log p). Note that d i r d [ t - ( l - r ) - 1]-dr_~ e~dr_2 • " '" dt_(t_~) and di rd[ i ] -

P7

P6

P5

]'4

P3

P2

Pl

PO

Initial After Slepl

~ dirdlOl=

0

0

0

Step3 Step4. I Step4.3 Step5 Output &4.2

i

B MAX

/ 46

MAX

D kj MAX

Fig. 2. An example of UKSELECT for k -- 8, n -- 32 and p -- 8.

Universal k-selection in hypercubes 183

dird[i + 1] • di+ l for 0 ~< i < t - (! - r) - 1, where operation' ~ ' can be realized by conven- tional integer division, multiplication and addition. Step 2 does not act. Step 3 takes time

t-(l-r)-l,~o ~ . o ( n ) = O ( n l o g 2 - ~ - ) .

• = j=o P P "

Step 4 takes time

n ) n n r ~ . ~ 2 0 ( ~ ) - k t - ( / ~ ) - l O ( p ) ~ - O ( p l O g ~ l o g - ~ ) .

i f t - ( I - r ) j=O

Step 5 takes time O(n/p). So the total time of the algorithm in the worst case is

O(log p) + O P log ~-" ] + O P log ~ log + O P

= O ( p l o g p I o g ~ ) .

Thus we have therefore the following theorem:

Theorem 3. Selecting k smallest numbers from n given numbers in a hypercube of p processors can be completed in time O(n/p log p log (kp)/n), where k < n and p <~ n.

Theorem 3 shows that Algorithm UKSELECT has a maximum speedup of O(log k) in time complexity over the known result of (n /p log k log p) [6] for universal k-selection in hypercubes in the case kp = O(n).

5. Concluding remarks

Selecting k smallest numbers from n given numbers is an interesting problem arising in many applications. The known universal algorithm for solving this problem in a hypercube of p processors, p < n, has a worst-case time complexity of O(n/p log k log p) [6]. In this paper, as showing an improved result, we have presented a new universal algorithm that reduces the above time complexity to O(n/p log p log (kp)/n) for the same problem.

The proposed algorithm eliminates the need of keeping the data block loaded in each processor being sorted in the previous algorithm [6] and therefore achieves a better time complexity. The algorithm shows a maximum speedup of O(log k) in time complexity over the known result for universal k-selection in hypercubers in the case k p - O(n).

References

[1] K.E. Batcher, Sorting networks and their applications, Proc. AFIPS 1968 Spring Joint Computer Conf. (1968) 307-314.

[2] S. Chandran and A. Rosenfeld, Order statistics on a hypercube, Information Processing Letters 27(3) (1988) 129-132.

[3] K.L. Chen and H. Shen, Bitonic selection algorithm on SIMD machines, Proc. 2nd lnternat. Conf. on Computers and Applications (1987 ) 176-182.

[4] K.L. Chen and H. Shen, A bitonic selection algorithm on multiprocessor system, J. Comput. Sci. Tech. 4(4) (1989) 315-322.

184 H. Shen

[5] D.E. Knuth, The Art of Computer Programming F'ol. 3: Sorting and Searching, (Addison-Wesley, Reading, MA, 1973).

[6] H. Shen, A universal algorithm for parallel/c-selection in hypercubes, Rep. Comput. Sci. & Math., ~bo Akademi Univ., A(126) (1991)., to appear in Proc. Parallel Computing 91 (Elsevier Science).

[7] J.P. Sheu and J.S. Tang, Efficient parallel k selection algorithm, Information Processing Letters 35(6) (1990) 313-316.

[8] H.S. Stone, Parallel processing with the perfect shuffle, IEEE Trans. Comput. C-20(2) (1971) 153-161.

improved universal k-selection in hypercubes

Documents