bounds on algorithms for string generation

Acta Informatica 1, 3t t--319 (1972) �9 by Springer-Verlag t972

Bounds on Algorithms for String Generation

A. C. McKellar and C. K. Wong

Received February 18, 1972

Summary. The well-known lower bound of log2n! on the number of comparisons required to sort n items is extended to cover algorithms, such as replacement selection, which produce a sorted string whose length is a random variable. The case of algorithms which produce several strings is also discussed and these results are then applied to obtain an upper bound on the length of strings produced by a class of string generation algorithms.

Introduction

There is a well-known lower bound, viz. logan !, on the expected number of comparisons required to achieve a complete sort of n items. On the other hand, various algorithms such as replacement selection [3] produce sorted strings whose length is a random variable. In this paper, we obtain an extension of this bound to cover algorithms which produce several sorted strings both for finite input sequences and infinite input sequences. These results are then applied to obtain an upper bound on the expected string length for natural selection [2].

The notation and terminology for trees will be consistent with Knuth E4] except tha t our definition of level differs by t . E ( ) will denote the expected value of the argument.

Bounds on Arbitrary String Generating Algorithms

The following well-known result [5] is s tated since the result is vei'y important to the subsequent development.

Lemma 1. Let T be a binary tree such that each internal node has exact ly two successors. Let Pl, P2 . . . . . Pn be non-negative weights associated with the

terminal nodes of T such that ~ Pi ---- I. Then the total path length of the terminal nodes satisfies the inequality ,=1

i ~ 1 t=l

where l, is the length of the path from the root to the ith terminal node and Pi l~ P, = 0 when p, ~ 0 .

I t is convenient to talk of l, as the level of the ita terminal node because of the obvious pictorial interpretation. Thus, the level of the root is t~. A string is a set of items in sorted order.

3t2 A.C. McKellar and C. K. Wong:

Lemma 2. Consider any sorting algorithm which employs only compares to produce exactly one string of length x (x being a random variable) from some permutat ion on {t, 2 . . . . . M}. Let c be the number of compares used by the algorithm. If each permutat ion is equally likely, then

E (c) > E (log 2 (x !)).

Proo]. Any such algorithm can be represented as a tree in the following way. The pa th from the root to any node will represent a sequence of comparisons together with the outcomes of those comparisons. Each node will be labeled with exactly those permutat ions which would have yielded the specified sequence of outcomes on the sequence of comparisons determined .by the path from the root to tha t node. After a particular comparison has been made, the algorithm may designate some sequence of items as a sorted string and then stop, in which case the corresponding node, v, of the tree will be a leaf. If, on the other hand, the algorithm makes an additional comparison, it will have one of two possible outcomes, less than or greater than. The root of the left subtree of v will be labeled with exactly those permutat ions in v for which the outcome of the comparison would be less than and the root of the right subtree of v will be labeled with exactly those permutat ions in v which would yield an outcome of greater than.

Thus, for example, the root is labeled with all M! permutat ions and if the first comparison is between the i th and ith items, then the root of the left subtree will be labeled with the M!/2 permutat ions for which i tem i is less than i tem/ ' . Clearly, each node has either 0 or 2 immediate successors (some nodes may be empty) and if v has two immediate successors, then they constitute a parti t ion on v. Further, the set of leaves of the tree are a parti t ion on the root.

Suppose tha t for some leaf, v, the algorithm produces a string of length x. Then for every permutat ion in v, those x items must have the same ranks relative to one another. Therefore, there are at most M I/x! permutations in such a leaf.

Let / be the running index over the set of leaves which produce a string of a given length and let n (x, ?') be the total number of permutations on the/-tb leaf which has string length x. Then

~, Zn(x, i) =M! x 7

and

M~ n(x, i) _<_ ~i

n (x, i) is the prob- for all j. Since each of the M ! permutat ions is equally likely, ~ !

n(x, ]) is abilit{, of arriving at the ?-th terminal node with string length x and ~, M ! the probabili ty of getting a string of length x. J

Let l(x, i) be the level of the ]t~ leaf with string length x. Then

Bounds on Algorithms for String Generation 3 t 3

Applying Lemma I yields

M~ y , log, . ( - , i l " x i

Using the facts noted above and elementary algebra, we obtain

1 E(c) >log , M ! - - M! )-~,~ n(x'i)log, n(x,l') x ?

t -->-- log, M l - - M! ~ ~. n(x, i)log, M!x,

/

1 - - M ! Z Z n ( x ' i ) l o g ' x !

x I

= F, log, �9 ! Y "Ix, i/ �9 M !

x 1

and the lemma is proved.

Implicit in the above proof was the restriction that the algorithm was deterministic, i.e., that the comparison made at any point was uniquely determined by the sequence of outcomes of the preceeding comparisons. This restriction is easily removed since a non-deterministic algorithm can be represented by a family of trees, one for each possible set of choices made by the algorithm. Since the above bound applies to each tree, it applies to an average taken over the family of trees.

One might hope to strengthen Lemma 2 by showing that the expected number of compares given that the string had a particular length, x, was greater than or equal to log~ (x !). The following example shows this to be impossible.

Let M = 3 and consider the algorithm which compares the first item to the second and then compares the second item to the third. If the outcome of both compares is less than 9r greater than, the algorithm produces a string of length 3. Otherwise, it produces a string of length 2. When it produces a string of length 3, it takes exactly 2 comparisons but log~3 ! ~ 2.59.

The above technique can now be extended to algorithms which produce several strings.

T h e o r e m 1. Consider any sorting algorithm which employs only compares to produce exactly k strings of length x 1, x~ . . . . . x k respectively (xx, x~, . . . , x k being random variables) from some permutation on {t, 2 . . . . . M}. Let c be the number of compares used by the algorithm. If each permutation is equally likely, then

k

E(c) >= ~,E (log,, (x,!)). {=I

Proo]. Any such algorithm can be represented by a binary tree as in the proof of Lemma 2. If the algorithm produces strings of length xl, x2,. . . , xk, then the

M! corresponding leaf of the tree has at most x~! x 2! ... x k! permutations. Grouping

all leaves according to (xl, x 2 . . . . . xk) with ]" as the running index and

314 A.C. McKellar and C. K. Vr

n(xx, x, . . . . . . xk;/') as the to ta l numbe r of pe rmuta t ions in the ]th node, we have

~, ~ . n ( x I . . . . . xk; f) : M ! (x~,..., xk) I

and M~

n (X 1 . . . . . Xk ; ]) x z ! . . . x k I �9

Also n !xl,_.._,_x_k: _j)_ is the probabi l i ty of arr iving a t the/ ' th te rminal node with M~

lengths xa . . . . . x~ and ~, n (x~ . . . . . Xk; i) is the probabi l i ty of get t ing k strings of �9 M ! I

lengths x~ . . . . . xh, respectively. By the same reasoning as in the proof of L e m m a 2, one has

E (c) > ~ , ~ J~ (X1 . . . . . xk ; f) log$ J~I! = ~ : t ] n ( X l , Xk; j)

(x ...... xk) i . . . . ~ l o g " ( X l ! ' " X k ! ) Z n(xl"MiXk;i)

(x~, . . . , xk) !

= e k

= P E( log , (x i ! ) ) . *G=d

C o r o l l a r y 1. k

=> Z

P r o @ Since log, (x !) is a convex function, Jensen ' s inequal i ty 'yields

E (log, (x !)) => log, (E (x) !).

Note t ha t the factorial symbol should be in te rpre ted as the g a m m a funct ion when non-integer a rguments are involved.

An extension of Theorem t to infinite input sequences is now given. Let B be the set of all infinite sequences of numbers (b 1, b 2 . . . . ) such t ha t a t an)" given finite set of indices ix, i S . . . . . ira, all sets of relat ive ranks of bi,, . . . . b~, are equal ly likely.

Theorem 2. Consider any sort ing algori thm, A, which employs only compares to produce strings of length xl, x 2 . . . . (xl, x, . . . . are r andom variables with finite expec ta t ion x~, x , . . . . ) f rom some e lement of B. Let c(n) be the n u m b e r of compar isons used b y the a lgor i thm to produce the first n strings. I f the element of B is chosen a t random, then

n E (c(n)) >= ~.log, (.~,!).

P r o @ Let F, be the distr ibution functiou for x,. Since each x i has finite mean by hypothesis , it follows tha t given e > 0, there exists 8i :> 0 such t ha t for all M, sat isfying

P r { x , > M , } < 6 , (1)

Bounds on Algorithms for String Generation ~t 5

it is true that c o

f x, aF,< (2) M~

Algorithm A*, which is a slightly modified version of the given algorithm A, is defined as follows. If, for a particular element b of B, x, ~ Mi, 1--< i _< n, then algorithms A and A* are identical. If, on the other hand, x i > Mi for some i ~ n, then A* converts the first n components of b into strings of length I and then initiates A on the remainder of b. Let x*, x2* . . . . be the lengths of strings produced by A*. Then, clearly

x,=~x, if x~ <=M1, x, <=M, ..... and x~<~3I~, ,.<-i<_n, otherwise.

Therefore

f (x, - l) aF, =< (3) D / ' = 1 311

where D is the set of points at which x i and x, ~ differ and the final inequality follows from (2).

Let c* (n) be the number of comparisons required by A* to generate the first n strings. Since A* never uses more comparisons than A in generating the first n strings and in fact uses fewer whenever A produces an exceptionally long string, we have the inequality

E(c(n)) >=E(c*(n)). (4)

We can now apply Theorem 1 to A* with M ---- ~" M, to obtain i = 1

( x , ' ! . (5)

Applying Jensen's inequality and (3) yields

E log 2 (x, ~ l) > log., (g* !). (6) Z ~=1

n

> X log~((~,--e)!) . (7)

Combining inequalities (4, (5), (6) and (7) and observing that since they are true for all positive e, they are also true for s -----0, completes the proof.

Corollary 2. If

then E(logo(x,l))<oo, l<_i<~n

Proo/. If E (log2 (x, !)) < oo, then one can bound E (log_, (x, l)) -- E (log~ (x 7 !)) in the same manner that ~ , - - s was bounded by (t) and (2). Otherwise the proof of Theorem 2 goes through unchanged.

316 A.C. McKellar and C. K. Wong:

Note that any algorithm for which xl , x 2 . . . . each have finite variance satisfies the hypothesis of Corollary 2.

Bound on String Length for Natural Selection

We now apply these results to obtain an upper bound on the string length for an algorithm known as natural selection [2].

Natural selection is characterized by two parameters, G and R , where G is the Iiumber of records which can be accommodated in main store at any point in time and R G is the size (in records) of the reject store. Natural selection begins by filling main store with the first G records from the input file. The smallest of these G is selected to become the first record of the first sorted string and is replaced in main store by the next record from the input file. Provided that the replacement record is larger than the record it replaced (i.e., the record which has just joined the output string), it will eventually become part of the current output string and one continues by again selecting the smallest item in main store to become the next record of the sorted string, replacing it as before with the next record from the input file. Eventually, a replacement record will be encountered which is smaller than the last i tem of the current output string in which case it is clear that this replacement record can never become part of the current string. In "this case, the replacement record is written into the reject store and the record is said to have been rejected. When a record is rejected, another replacement record is read from the input file and compared with the last record on the current output string to determine whether it should in turn be rejected or whether one should again select the smallest record in store and append it to the output string. Eventually, R G records will have been rejected. At that point, the G -- t records which remain in main store are appended in order of increasing size to the current output string thus completing that string. To generate the next sorted string, one proceeds exactly as before except that instead of reading records from the input file, one reads the records which were rejected in generating the previous string in the order in which they were rejected. If R < t , then all of the rejected records will fit in main store and will all become part of the next string. On the other hand, if R > t, there is a possibility that some of the rejected records wiU be rejected for a second time. In general, if for some integer i, R ~ i, then a record will be rejected at most i times. When the reject list is exhausted, one returns to reading from the input file. The entire process terminates in the obvious way when the input file is exhausted.

Natural selection differs from replacement selection [3] only in the treatment of rejected records--replacement selection leaves them resident in main store whereas natural selection writes them onto a secondary store. However, this apparently modest change complicates the analysis considerably. Thus it is of interest to produce bounds on the performance of natural selection.

Let xkG be the length of the k th string produced by natural selection (we may assume that the input file is infinite so there will almost surely be enough records to generate k strings). Writing the string length in this form is a convenient normalization. No assumption of linearit3/is involved since x k can depend on G. We count the number of comparisons required to produce this string as follows.

Bounds on Algorithms for String Generation 317

In producing the k th sorted string, xkG + RG records pass through main store. The first G records are sorted completely requiring a number of comparisons which we denote by S (G) (for many sorting algorithms, S (G) is a random variable). RG records are rejected each of which undergoes one comparison. The remaining xkG--G records each undergoes one comparison to determine that it is not rejected and some number, which we denote by T (G), of comparisons to determine its rank relative to G -- t sorted records resident in main store. T (G) will, in general, be a random variable.

Lemma 3. Given G - 1 items in sorted order, let E(In(G)) be the expected nmnber of comparisons required to insert a new item using binary insertion to create a sorted set of size G. If the probabili ty that the new item assumes a given rank relative to the G - I sorted items is a monotone increasing (or decreasing) function of rank, then

log~G __< E ( I n (G)) < 0.086t + log 2G. (8)

If G is a power of 2, equality can be made to hold in (8) regardless of the probabili ty distribution of the rank of the new item.

Proo]. An insertion algorithm can be represented as a tree as follows. The root of the tree is labeledwith the item against which the new item is compared first. The roots of the left and right subtrees are labeled with the items against which the new i tem is compared next depending upon the outcome of the first comparison. This tree has G -- t internal nodes and G leaves and the leaves are in t - - t correspondence with the possible ranks of the new item. When G is a power of 2, there is an algorithm which has a balanced tree and the last assertion of the lemma follows trivially.

When G is not a power of 2, there are a var iety of trees which are almost balanced in the sense tha t they have as many leaves as possible at Ievel [log2G l and the remainder at level [log~G]. Such trees are called complete [4]. If th,e new i tem is equally likely to assume each of the G possible ranks, then regardless of the complete tree chosen, the lemma follows immediately on substituting n = G and l = G - - t in Lemma 2 of El]. I f the probabili ty distribution is a monotone increasing function of rank, then one uses the unique complete tree which places as many of the largest ranks as possible at level [log2G j and puts the remaining ranks at level [log2G]. Clearly, the expected number of comparisons required is bounded above by the expected number required for the equally likely case and so the lemma follows. When the probabili ty distribution is monotone decreasing, one uses the complete tree which is skewed in the opposite direction.

I t is perhaps surprising that there is no insertion scheme which will make (8) valid for arbi trary distributions. To see this, consider the case G --= 5 and suppose the possible ranks for the item to be inserted have probalitities e[3, (1 - - e)/2; s/3, (t - - e)/2, and e/3 in that order. Then the expected number of comparisons required for the insertion is easily shown to be 2.5 - -e]6 whereas 0.086t + l o g s 5 -- 2.41.

As a corollary of Lemma 3, one obtains the fact that sorting by insertion requires, on the average, fewer than 0.086t n +log~(n.t) comparisons to sort n items provided the hypothesis of Lemma 3 can be satisfied at each stage of the

3t8 A.C. McKellar and C. K. \Vong:

process. In particular, for natural selection, items which are rejected in forming the k th string are distributed over a range which is a monotone increasing function of time. While some of the earliest items rejected may also have been rejected in the past, this only increases the tendency for the first items in the reject store to be smaller than the later ones. Thus the items resident in main store during forma- tion of the k + / s t string will tend to be smaller than one would otherwise expect. Therefore, assuming tha.t a replacement item is not rejected, its probability distribution will be a monotone increasing function of rank and the hypotheses of Lemma 3 are satisfied.

For arbitrary distributions, binary insertion guarantees that

e (I , , (C)) < ~ +log~G.

Hence, sorting n items by insertion can always be accomplished in at most n + logan ! compares.

Theorem 3. Let x~G, x2G . . . . be the lengths of the strings produced by natural selection. If input sequences are randomly chosen elements of B, each x~ has finite expectation xi and ~,-+~ as i - + ~ , then for any fixed R and G sufficiently large,

i) ~ is bounded

ii) R -- log S 2 e > ~log S (~/2 e) -- 0.0861 ~.

Proo]. For fixed R and G, the expected number of compares to produce the first n strings satisfies the inequality

m

E (c) ~ ~ [log S (G !) + 0.0861G + (~, G -- G + RG) + ( ~ G -- G) (log, G + 0.0861)].

On the other hand, by Theorem 2

ECc) = ~. log, ((~,al !).

Combining these inequalities, dividing both sides by n, noting that

--,) xiO

since log2(x! ) is a convex function and that ~..~,/n--+~ as n-->~, we have a s 1/, - - ~ ~ i ~ l

logzG t + ~ G ---: G + R G + �9 G log2G --G log2G + 0.0861 9 G => log 2 ((~ G) I).

Applying Stirling's formula, neglecting terms which are o (G), dividing by G and rearranging terms yields

( 3 ) t loga(~G) R - - l o g 2 2 e > ~ l o g 2 ~ --0.086t ~-~ 2 G

Bounds on Algorithms for String Generation 3t9

IO

9 -

8-- C2

7 -

6

4

3

2

I

I I I I I t I I I I I C I 2 3 4 5 6 7 8 9 10 II 12

R

Fig. 1. The bound on expected relative st t ing length ~ as a function of reject store size. C 1 = Actual value of �9 (obtained by simulation). C2 = The bound

Clear ly �9 is b o u n d e d as G - + c o . Hence the las t t e rm t ends to 0. Thus, for G suff ic ient ly large, the theorem follows.

Fig. I is a p lot of th is bound. Also shown in Fig. I is the ac tua l re la t ive s t r ing length as a funct ion of R ob t a ined b y s imula t ion .

The authors would like to thank Professor R. Floyd for a s t imulat ing discussion a t an early stage in this work.

References

I. Frazer, W . D . , McKellar, A .C . : Samplesort: A sampling approach to minimal storage tree sorting. J. ACM 17, 496-507 (1970).

2. Frazer, W. D., Wong. C. K. :" Sorting by natural selection, to appear C, ACM. 3. Friend, E. H. : Sorting on electronic computer systems. J. ACM 8, 134-168 (1956). 4. Knnth, D. E. : The ar t of computer programming, vol. I. Massachusetts: Addison-

Wesley, Reading 1969. 5. Shannon, C. E., Weaver, W. : The math,~matical theory of communication. Urbana,

Illinois: The Universi ty of Illinois Press 1964.

A. C. McKellar C. K. Wong IBM Thomas J. Watson Research Center P.O. Box 2t 8 Yorktown Heights, N.Y. 10598, USA

22 Acta Informatica, VoL i

bounds on algorithms for string generation

Documents