greedy algorithms for on-line data compression

16
Ž . JOURNAL OF ALGORITHMS 25, 274]289 1997 ARTICLE NO. AL970885 Greedy Algorithms for On-Line Data Compression* Jozsef Bekesi and Gabor Galambos ´ ´´ ´ Department of Computer Science, JGYTF, P.O. Box 396, H-6720 Szeged, Hungary Ulrich Pferschy Department of Statistics and Operations Research, Uni ¤ ersity Graz, Uni ¤ ersitatsstrasse 15, A-8010 Graz, Austria ¨ and Gerhard J. Woeginger Institute of Mathematics B, Uni ¤ ersity of Technology Graz, Steyrergasse 30, A-8010 Graz, Austria Received July 3, 1997 We consider on-line text-compression problems where compression is done by Ž . substituting substrings according to some fixed static dictionary code book . Due to the long running time of optimal algorithms, several heuristics have been introduced in the literature. In this paper, we continue the investigations of wx Katajainen and Raita 3 . We complete the worst-case analysis of the longest matching algorithm and of the differential greedy algorithm for several types of special dictionaries and we derive matching lower and upper bounds for all variants of this problem. Q 1997 Academic Press 1. INTRODUCTION Recent advances in computer technology }both in the field of hardware and in the field of software }strongly require large amounts of data to be moved between various components or to be stored in bounded capacity *This research was partially supported by the Spezialforschungsbereich F003 ‘‘Optimierung und Kontrolle,’’ Projektbereich Diskrete Optimierung, and by a grant from the Hungarian Ž . Academy of Sciences OTKA, No. T016349 . 274 0196-6774r97 $25.00 Copyright Q 1997 by Academic Press All rights of reproduction in any form reserved.

Upload: jozsef-bekesi

Post on 15-Jun-2016

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Greedy Algorithms for On-Line Data Compression

Ž .JOURNAL OF ALGORITHMS 25, 274]289 1997ARTICLE NO. AL970885

Greedy Algorithms for On-Line Data Compression*

Jozsef Bekesi and Gabor Galambos´ ´ ´ ´

Department of Computer Science, JGYTF, P.O. Box 396, H-6720 Szeged, Hungary

Ulrich Pferschy

Department of Statistics and Operations Research, Uni ersity Graz,Uni ersitatsstrasse 15, A-8010 Graz, Austria¨

and

Gerhard J. Woeginger

Institute of Mathematics B, Uni ersity of Technology Graz, Steyrergasse 30, A-8010Graz, Austria

Received July 3, 1997

We consider on-line text-compression problems where compression is done byŽ .substituting substrings according to some fixed static dictionary code book . Due

to the long running time of optimal algorithms, several heuristics have beenintroduced in the literature. In this paper, we continue the investigations of

w xKatajainen and Raita 3 . We complete the worst-case analysis of the longestmatching algorithm and of the differential greedy algorithm for several types ofspecial dictionaries and we derive matching lower and upper bounds for all variantsof this problem. Q 1997 Academic Press

1. INTRODUCTION

Recent advances in computer technology}both in the field of hardwareand in the field of software}strongly require large amounts of data to bemoved between various components or to be stored in bounded capacity

*This research was partially supported by the Spezialforschungsbereich F003 ‘‘Optimierungund Kontrolle,’’ Projektbereich Diskrete Optimierung, and by a grant from the Hungarian

Ž .Academy of Sciences OTKA, No. T016349 .

274

0196-6774r97 $25.00Copyright Q 1997 by Academic PressAll rights of reproduction in any form reserved.

Page 2: Greedy Algorithms for On-Line Data Compression

GREEDY ALGORITHMS FOR ON-LINE DATA COMPRESSION 275

devices. All these operations need data transfers, either between twocomputers or between two parts of the same computer. Essentially, thereare just two possibilities to increase the performance of the transfer: either

Ž .to use better and more expensive! hardware or to compress the databefore the transfer.

One possibility for compressing a given source string is to substitutepieces of the string with the help of a dictionary. A dictionary consists of

Ž .pairs of strings over a finite alphabet source word, code word , which areused to replace a substring in the source string. We will consider onlymethods which use a static dictionary, that is, a fixed dictionary that cannotbe changed or extended during the encoding]decoding procedure. Our

Ž .aim is to translate encode the source string with the help of thedictionary strings into a code text with minimal length; in other words, wewant to find a space-optimal encoding procedure.

The preceding setup is equivalent to the problem of finding a shortestŽpath in a related directed, edge-weighted graph cf. Schuegraf and Heaps

w x. Ž .4 : for a source string S s s s . . . s , we define a graph N s V, A on1 2 n� 4 Ž .the vertex set V s ¨ , ¨ , . . . , ¨ . There is an edge ¨ , ¨ g A iff there0 1 n i iqd

Ž .exists a pair source word, code word in the dictionary such that thesource word consists of d characters that exactly match the original sourcestring in the positions i q 1, . . . , i q d. The weight of this edge is thenumber of bits in the corresponding code word. It is easily seen that ashortest path from ¨ to ¨ in the graph N corresponds to an optimal0 ncompression of the source string S.

ŽIn case that the graph has many cut ¨ertices i.e., vertices which divide.the original problem into independent subproblems and in case that all

these subproblems are reasonably small, we can indeed solve the problemefficiently and compute the optimal encoding. However, in practice, thisoptimal algorithm turned out to be impractical and inefficient for very long

Ž w x.strings cf. 3 . Therefore, heuristics have been developed to derive nearoptimal solutions.

ŽThe earlier developed heuristics e.g., the longest fragment first heuristic;w x.cf. Schuegraf and Heaps 4 have not been deeply analyzed and only

experimental results on their performance have been reported. The firstworst-case analysis for data compression heuristics was performed by

w xKatajainen and Raita 3 who examined so-called on-line heuristics: Anon-line data compression algorithm starts at the source vertex ¨ , exam-0ines all outgoing edges, and chooses one of them according to some givenrule. Then the algorithm continues this procedure from the vertex reachedvia the chosen edge. There is no possibility to undo a decision made at anearlier time, and no backtracking is allowed.

The worst-case beha¨iour of a heuristic is measured by its asymptotic�Ž . 4worst-case ratio which is defined as follows. Let D s w , c : i s 1, . . . , ki i

Page 3: Greedy Algorithms for On-Line Data Compression

BEKESI ET AL.276

be a static dictionary and consider an arbitrary data compression algorithmŽ . Ž .A. Let A D, S , resp. OPT D, S , denote the compressed string produced

by algorithm A, resp. the optimal encoding for a given source string S.5 Ž .5The length of these codings will be denoted by A D, S , resp.

5 Ž .5OPT D, S . Then the asymptotic worst-case ratio of algorithm A isdefined as

5 5A D , SŽ .R D s lim sup : S g S n ,Ž . Ž .A ½ 55 5OPT D , Snª` Ž .

Ž .where S n is the set of all text strings with exactly n characters.w xKatajainen and Raita 3 analyzed two simple on-line heuristics, the

Žlongest matching and the differential greedy algorithms exact definitions of.these algorithms will be given later . They used four parameters to investi-

gate and state the asymptotic worst-case ratios of these algorithms:

Bt S s length of each symbol of the source string S in bits,Ž .< <lmax D s max w , i s 1, . . . , k ,� 4Ž . i

5 5cmin D s min c , i s 1, . . . , k ,� 4Ž . i

5 5cmax D s max c , i s 1, . . . , k ,� 4Ž . i

< < 5 5where w denotes the length of a string w in characters and c thei i ilength of a code word c in bits. If the meaning is clear from the context,iwe will simply denote the bit length of each input character by Bt.Throughout the paper we will assume that lmax G 3 to avoid the separatetreatment of trivial cases.

Not surprisingly, the worst-case behaviour of a heuristic strongly de-pends on the features of the available dictionary. Following the framework

w xin 3 , we now define those types of dictionaries which will be examined inthe sequel:

A dictionary is called general if it contains all symbols of the inputŽalphabet as source words this just ensures that every heuristic will in any

case reach the sink of the underlying graph and thus will terminate the.encoding with a feasible solution . In this paper, we will only deal with

general dictionaries. A general dictionary is called a

Ž .1 code-uniform dictionary, if all code words are of equal lengthŽ 5 5 5 5 .i.e., c s c , 1 F i, j F k ;i j

Ž .2 nonlengthening dictionary, if the length of any code word neverŽ 5 5 < <exceeds the length of the corresponding source word i.e., c F w Bt,i i

.1 F i F k ;

Page 4: Greedy Algorithms for On-Line Data Compression

GREEDY ALGORITHMS FOR ON-LINE DATA COMPRESSION 277

Ž .3 suffix dictionary, if with every source word w also all of its properŽsuffixes are source words i.e., if w s v v ??? v is a source word1 2 q

.« v v ??? v is a source word for all 2 F h F q ;h hq1 q

Ž .4 prefix dictionary, if with every source word w also all of itsŽproper prefixes are source words i.e., if w s v v ??? v is a source word1 2 q

.« v v ??? v is a source word for all 1 F h F q y 1 .1 2 h

The nonlengthening property only makes sense if some inefficiency inthe form of unused codes is present in the source representation.

In this paper, we will continue the investigations of Katajainen andw xRaita 3 . We answer several questions that remained open and provide

matching upper and lower bounds for some dictionary types, in particular,for the differential greedy algorithm. Furthermore, we perform the worst-case analysis for prefix dictionaries.

The paper is organized as follows. Section 2 deals with the longestmatching algorithm and prefix dictionaries. In Section 3 we investigate thedifferential greedy algorithm. Beside results for the prefix property, we

w xshow some improvements on the results in 3 for those cases where thegiven worst-case bounds were not tight. Some concluding remarks arepresented in Section 4.

2. PREFIX DICTIONARIES AND THE LONGESTMATCHING HEURISTIC

Ž .The longest matching algorithm LM processes the text string from leftto right and chooses at each position the longest dictionary source wordthat matches the original text. This substring is then replaced by thecorresponding code word.

Ž w x.LM is one of the oldest heuristics used for data compression see 4 .w xKatajainen and Raita 3 analyzed the worst-case behaviour of LM for

dictionaries that are code uniformrnonlengtheningrsuffix and derivedtight bounds for all eight combinations of these properties. They conjec-tured that ‘‘the coding result for prefix dictionaries can be weaker than thebounds derived for suffix dictionaries.’’

In this section, we will derive tight worst-case bounds for all fourcombinations of the property prefix with the properties code uniformrnon-lengthening. We will show that the longest matching algorithm for prefixdictionaries can behave as badly as possible: all bounds for prefix dictionar-ies with additional properties PP are the same as the corresponding boundsfor general dictionaries with properties PP; in other words, adding theprefix property does not impro¨e any worst-case bound for LM.

Page 5: Greedy Algorithms for On-Line Data Compression

BEKESI ET AL.278

We will first show a general upper bound which is valid for anycompression algorithm. An example where this general bound is attained

w xby the longest matching algorithm is given by Katajainen and Raita in 3 ,where an identical upper bound was shown in a significantly longer proof.

THEOREM 2.1. Let D be a general dictionary. Then, for any encodingalgorithm A,

lmax y 1 cmaxŽ .R D F .Ž .A cmin

Proof. We consider source strings where the A-path and an OPT-pathare vertex disjoint with common end vertices ¨ and ¨ . Obviously, any givenstring can be partitioned into a number of substrings with this propertyand the worst-case ratio for every substring carries over to their combina-tion. The number of characters between these end vertices is denoted by jand the number of vertices on the OPT-path between them by r. Becausethe two paths are vertex disjoint and each arc ‘‘consumes’’ at least onecharacter, the number of arcs of the A-path between ¨ and ¨ is at mostj y r. However, the arcs on the optimal path have to cover the distancebetween ¨ and ¨ . Hence, we have

jr G .

lmax

This yields

j y r cmaxŽ .R D FŽ .A r q 1 cminŽ .

j cmaxF y 1ž /r cmin

lmax y 1 cmaxŽ .F .

cmin

Setting cmax s cmin and repeating the previous arguments, we get thefollowing:

COROLLARY 2.2. Let D be a code-uniform dictionary. Then, for anyencoding algorithm A,

R D F lmax y 1.Ž .A

By constructing an example of a prefix dictionary, where the generalupper bound from Theorem 2.1 is attained, our claim can be proven.

Page 6: Greedy Algorithms for On-Line Data Compression

GREEDY ALGORITHMS FOR ON-LINE DATA COMPRESSION 279

THEOREM 2.3. Let D be a prefix dictionary. Then

lmax y 1 cmaxŽ .R D FŽ .L M cmin

and this bound can be attained.

Proof. The upper bound follows immediately from Theorem 2.1. Thematching lower bound is derived by choosing the three-symbol alphabet� 4u, ¨ , w and the following prefix dictionary:

j lm a xy2Source word u ¨ w u¨ ¨w ¨w uj s 1, . . . , lmax y 2

Code word a b c d e fj

Weight cmax cmax cmax cmax cmax cmin

Ž lm a xy2 . iFor i ) 0, we consider the strings S s u ¨w u with length n siŽi lmax q 1. Visualizing the corresponding network see Fig. 1 for an

. Ž . i Ž .illustration , it can be checked that OPT D, S s af and LM D, S si iŽ lm a xy2 . idc a. Hence,

5 5LM D , SŽ .iR D G limŽ .L M 5 5OPT D , Snª` Ž .i

i lmax y 1 cmax q cmaxŽ .s lim

i cmin q cmaxiª`

lmax y 1 cmaxŽ .s .

cmin

COROLLARY 2.4. Let D be a prefix and code-uniform dictionary. Then

R D F lmax y 1Ž .L M

and this bound can be attained.

FIG. 1. Illustration for the dictionary and string S defined in the proof of Theorem 2.3iŽ .with lmax s 4 . The optimal code path runs beyond the horizontal line, the LM heuristicpath above.

Page 7: Greedy Algorithms for On-Line Data Compression

BEKESI ET AL.280

Proof. The upper bound follows immediately from Corollary 2.2. Thelower bound is easily deduced by setting cmin s cmax in the proof ofTheorem 2.3. Details are left to the reader.

THEOREM 2.5. Let D be a prefix and nonlengthening dictionary. Then

lmax BtR D FŽ .L M cmin

and this bound can be attained.

Proof. To derive the upper bound, we follow the proof of Theorem 2.1Žand modify the weights of the edges in an appropriate way we take into

account that a code word corresponding to a unit-length source word.cannot be longer than Bt .

To get a matching lower-bound example, we use the dictionary given inthe proof of Theorem 2.3 and modify the weights as follows:

j lm a xy2Source word u ¨ w u¨ ¨w ¨w uj s 1, . . . , lmax y 2

Code word a b c d e fj

Ž .Weight cmin Bt Bt 2 Bt j q 1 Bt cmin

Thereby we get

5 5LM D , SŽ .iR D G limŽ .L M 5 5OPT D , Snª` Ž .i

i lmax Bt q cmins lim

i q 1 cminiª` Ž .lmax Bt

s .cmin

By applying arguments analogous to those in the preceding proofs, it iseasy to show the following:

COROLLARY 2.6. Let D be a prefix, nonlengthening, and code-uniformdictionary. Then

R D F lmax y 1Ž .L M

and this bound can be attained.

Page 8: Greedy Algorithms for On-Line Data Compression

GREEDY ALGORITHMS FOR ON-LINE DATA COMPRESSION 281

3. IMPROVED AND NEW BOUNDS FOR THEDIFFERENTIAL GREEDY HEURISTIC

The greedy algorithm, which we will call the differential greedy algorithmŽ .DG , always chooses the ‘‘best possible local compression’’ at the current

< < 5 5position. This is done by calculating the differences w Bt y c for alli imatching dictionary source words w and taking the source word whichimaximizes this difference. Ties are broken arbitrarily.

The differential greedy algorithm was introduced by Gonzalez-Smithw xand Storer 2 , and it was partially analyzed from a worst-case point of view

w xin 3 .Recently, a new algorithm, the fractional greedy heuristic, where for each

< < 5 5local compression the ratio w Btr c is maximized, was introduced andi iw xanalyzed by the authors in 1 .

Since the worst-case behaviour of the DG algorithm is in many casessimilar to the behaviour of the LM algorithm, we first examine the casesof prefix dictionaries and show that the corresponding bounds fromSection 2 carry over to the DG case. We will leave most of the proofs tothe reader. Note that the upper bound from Theorem 2.1 is also valid forthe DG case.

THEOREM 3.1. Let D be a prefix dictionary. Furthermore, let D be a1 2prefix and code-uniform dictionary, D be a prefix and nonlengthening dictio-3nary, and let D be a prefix, nonlengthening, and code-uniform dictionary.4Then

lmax y 1 cmaxŽ .R D F ,Ž .DG 1 cmin

R D F lmax y 1,Ž .DG 2

lmax BtR D F ,Ž .DG 3 cmin

R D F lmax y 1Ž .DG 4

and all four bounds can be attained.

Proof. To show that the bound from Theorem 2.1 can also be attainedfor a prefix dictionary D , the same dictionary introduced in the proof of1

Ž . i Ž .Theorem 2.3 can be used. Again, we have OPT S s af and DG S si iŽ lm a xy2 . idc a, which yields the desired bound.

Along the same line one can easily prove the tightness of the bounds forD , D , and D .2 3 4

Page 9: Greedy Algorithms for On-Line Data Compression

BEKESI ET AL.282

Obviously, for code-uniform dictionaries, DG and LM yield identicalcoding results. Hence, this property will not be considered in the following.

For nonlengthening, suffix dictionaries, Katajainen and Raita proved theresult stated in the following proposition. Among other open problemsthey asked for the exact value of this worst-case ratio.

Ž w x.PROPOSITION 3.2 Katajainen and Raita 3 . Let D be a nonlengthening,suffix dictionary. Then

� 4min lmax Bt , 2cmax y BtR D F .Ž .DG cmin

In the next theorem we prove that this upper bound is best possible.

THEOREM 3.3. For infinitely many quadruples of positi e integers Bt,lmax, cmin, and cmax with cmin F Bt, cmin F cmax, and cmax F lmax Bt,there exists a nonlengthening, suffix dictionary D such that

� 4min lmax Bt , 2cmax y BtR D G .Ž .DG cmin

Proof.

Case I. If lmax Bt F 2cmax y Bt, then we consider the following dic-tionary:

j j lm a xy1Source word u w uw w u w uj s 1, . . . , lmax y 1 j s 1, . . . , lmax y 2

Code word a b c d ej j

Ž .Weight Bt j Bt 2 Bt j q 1 Bt cmin

Ž lm a xy1 . iFor i G 1, we define strings S s u w u of length n s i lmax q 1.iŽ . Ž . i Ž . iWith this we get DG D, S s cb a and OPT D, S s ae . Calcu-i lm a xy2 i

lating the worst-case ratio, we have

5 5DG D , SŽ .iR D G limŽ .DG 5 5OPT D , Snª` Ž .i

i lmax Bt q Bts lim

Bt q i cminiª`

lmax Bts .

cmin

Page 10: Greedy Algorithms for On-Line Data Compression

GREEDY ALGORITHMS FOR ON-LINE DATA COMPRESSION 283

Case II. If lmax Bt ) 2cmax y Bt, let us suppose that cmax s aBt forsome a G 2. This implies 2a y 1 - lmax. We consider the followingdictionary:

j a j jSource word u u uw wj s 1, . . . , a y 1 j s 1, . . . , a y 1 j s 1, . . . , a y 2

Code word a b c dj j j

Ž .Weight Bt 2 Bt j q 1 Bt jBt

ay1 2Žay1. j ay1Source word w w u w u w uj s 1, . . . , 2a y 3

j / a y 1

Code word e f g hj

Weight cmax y Bt cmin cmin cmax

a Ž 2Žay1. . i Ž . i Ž .For S s u w u , we get OPT D, S s bf and DG D, S si i iŽ . ia c e a .ay1 ay1 1

There are some points where the choice of DG is not unique. As DGchooses uay1, there also is another possible candidate ua. The two

Ž .differences computed by DG are a y 1 Bt y Bt, resp. aBt y 2 Bt. With-out loss of generality, we suppose that DG chooses uay1. In a similar way,the algorithm chooses uw ay1 instead of u and w ay1 instead of w ay1u.With this the worst-case ratio is given by

5 5DG D , SŽ .iR D G limŽ .DG 5 5OPT D , Snª` Ž .i

i 2cmax y Bt q 2 BtŽ .s lim

i cmin q 2 Btiª`

2cmax y Bts .

cmin

Concerning the analysis of the DG heuristic for suffix dictionaries, thew xauthors in 3 mention that ‘‘the behaviour of the heuristic is inherently

more difficult to analyze.’’ The only upper bounds that were known for thistype of dictionary are the same as for general dictionaries; they are summa-rized in the following proposition. Deriving tighter bounds was posed asanother open problem.

Ž w x.PROPOSITION 3.4 Katajainen and Raita 3 . Let D be a suffix dictionary.

Page 11: Greedy Algorithms for On-Line Data Compression

BEKESI ET AL.284

Then

R DŽ .DG

¡cmin q lmax y 1 cmaxŽ . 2 2if lmax y 1 cmax Bt - cmin andŽ .cmin q lmax y 1 BtŽ .~F cmax y cmin rBt G lmax y 1,Ž .lmax y 1 cmaxŽ .

otherwise.¢ cmin

In the next theorem we give tight bounds for the asymptotic worst-casebehaviour of the DG heuristic for suffix dictionaries.

THEOREM 3.5. Let D be a suffix dictionary. Then

2cmax y Bt¡if cmax F 3r2 Bt ,

cmin22cmax q BtŽ .

if 3r2 Bt - cmax F lmax y 3r2 Bt ,Ž .~R D FŽ . 8 Bt cminDG

lmax y 1 2cmax y lmax y 2 BtŽ . Ž .Ž .2cmin¢ if lmax y 3r2 Bt - cmax.Ž .

All bounds can be attained.

Ž .Proof. Let N s V, A be the network for a string S and the dictionaryD as defined in Section 1. Let ¨ and ¨ be two consecutive cut ¨ertices ofi jN. This implies that both of them lie on the optimal path and on theDG-path.

Ž .First, we will prove an upper bound on R D . Without loss ofDGgenerality, we may assume that the optimal path has only a single edge

Ž . Ž .from ¨ to ¨ with maximum length lmax and minimum weight cmin.i jWe introduce the following notation. We assume that the DG-path con-

Ž .sists of the sequence ¨ , ¨ , ¨ , . . . , ¨ , ¨ . The length, resp. weight, ofi i i i j1 2 kq1

Ž .an edge ¨ , ¨ , 1 F p F k, is denoted by t , resp. c .i i p pp pq1

Since the given dictionary has the suffix property, at vertex ¨ thei pŽ . Ž .algorithm DG has to choose between the two edges ¨ , ¨ and ¨ , ¨ .i i i jp pq1 p

The latter has weight at most cmax. Since the DG heuristic uses the path

Page 12: Greedy Algorithms for On-Line Data Compression

GREEDY ALGORITHMS FOR ON-LINE DATA COMPRESSION 285

through vertex ¨ ,i pq 1

k

t Bt y c G t q 1 Bt y cmaxÝp p lž /lsp

must hold. Summing up over all p, we get

k k k k

c F k cmax q Bt t y Bt t y k Bt . 1Ž .Ý Ý Ý Ýp p lps1 ps1 ps1 lsp

First, we will estimate the third term on the right-hand side

k k k

t s pt . 2Ž .Ý Ý Ýl pps1 lsp ps1

We denotek

T s t q 1 F lmax y 1.Ý lls1

Ž .It is easy to verify that the minimum of 2 is attained iff t s T y k and1t s 1, p s 2, . . . , k. Hence, we getp

k k k k q 1 k y 2Ž . Ž .min t s T y k q i s T q .Ž .Ý Ý Ýl 2tl ps1 lsp is2

Ž .Substituting this result into 1 yields

k k q 1 k y 2Ž . Ž .c F max k cmax y Bt k q 1 y BtŽ .Ý p ½ 521FkFlmaxy2ps1

12s max 2k cmax y k q k Bt . 3� 4Ž . Ž .

2 1FkFlmaxy2

This expression is a concave function in k and becomes maximum forŽ .k s 2cmax y Bt r2 Bt. Depending on the feasible range for k, we distin-

guish three cases:3Ž .1 If cmax F Bt, the maximum is taken at k s 1.2

3 3Ž . Ž .2 If Bt - cmax F lmax y Bt, the maximum is taken at k s2 2Ž .2cmax y Bt r2 Bt.

3Ž . Ž .3 If lmax y Bt - cmax, the maximum is taken at k s lmax y 2.2

Ž .Substituting the corresponding values of k into the right-hand side of 3and assigning the weight cmax to the final edge from ¨ to ¨ , we arrivei jkq 1

at the desired results for the upper bounds.

Page 13: Greedy Algorithms for On-Line Data Compression

BEKESI ET AL.286

In the second part of the proof, we show that all the preceding upperbounds indeed are tight.

3Case I. For cmax F Bt, we consider the following dictionary:2

j lm a xy1 jSource word u w uw w u w u wj s 1, . . . , lmax y 2 j s 1, . . . , lmax y 1

Code word a b c d d ej lm a xy1 j

Weight cmax y Bt cmax cmax cmax cmin cmax y Bt

Ž lm a xy1 . iWe compress the string S s u w u of length n s i lmax q 1. SinceiŽ . i Ž . Ž . iOPT D, S s ad and DG D, S s ce a,i lm a xy1 i lm a xy2

5 5DG D , SŽ .iR D G limŽ .DG 5 5OPT D , Snª` Ž .i

i 2cmax y Bt q cmax y Bt 2 cmax y BtŽ .s lim s

cmax y Bt q i cmin cminiª` Ž .

3 3Ž .Case II. If Bt - cmax F lmax y Bt, we suppose that Bt is even,2 2Ž .cmin F Btr2, and cmax s 2a q 1 Btr2 for some integer a , 1 F a F

lmax y 4. We construct a dictionary on lmax letters. The letters of thealphabet are denoted by u, ¨ , w , . . . , w :1 lm a xy2

j lm a xy1ykSource word u u¨ ¨ ¨j s 1, . . . , lmax y 2 y k

Code word a b c d0

1Weight cmax cmax cmax Bt2

j lm a xykSource word w ¨ w ??? w u ¨ w w ??? w uj 1 ky1 1 j ky1j s 1, . . . , k y 1 j s 1, . . . , lmax y 1 y k ??? w u j s 1, . . . , lmax y 1ky1

Code word d e e fj j lm a xyk j

Ž .Weight 2 j q 1 Btr2 cmax cmin cmax

Ž . Ž lm a xykLet k s 2cmax y Bt r2 Bt and the source string S s u ¨ w ???i 1. i Ž . iw u . It is easy to check that OPT D, S s ae . Following theky1 i lm a xyk

DG-path, one can see that at the beginning the DG algorithm chooses theedge u¨ . At the end vertex of this edge there are several edges. One skipsthe string ¨ and the others skip the strings ¨ j, j s 2, . . . , lmax y 1 y k. So

lm a xy1yk ŽDG chooses the edge ¨ . On the remaining part of the string until.it meets again a letter u , the DG algorithm has to decide between the

Page 14: Greedy Algorithms for On-Line Data Compression

GREEDY ALGORITHMS FOR ON-LINE DATA COMPRESSION 287

edges w ??? w u, j s 1, . . . , k y 1, and the single character edge w . Inj ky1 jeach vertex it resolves the arising tie by choosing w . In this way we getj

Ž . Ž . iDG D, S s bd d ??? d a which yieldsi 0 1 ky1

5 5DG D , SŽ .iR D G limŽ .DG 5 5OPT D , Snª` Ž .i

ky1Bti cmax q 2 j q 1 q cmaxŽ .Ýž /2 js0s lim

cmax q i cminiª`

2Bt 2cmax y BtŽ .i cmax q q cmax2ž /2 4Bt

s limcmax q i cminiª`

22cmax q BtŽ .s .

8 Bt cmin

3Ž .Case III. Finally, for lmax y Bt - cmax, we again consider a suffix2

dictionary on lmax letters called u, w , . . . , w :1 lm a xy1

Source word u w uw w ??? w u w ??? w uj lm a xy1 j 1 lm a xy1 1j s 1, . . . , lmax y 1 j s lmax y 2, . . . , 1

Code word a b c d ej j

Weight cmax cmax y j Bt cmax cmax cmin

Ž . iFor the source string S s u w ??? w u of length n s i lmax q 1,i lm a xy1 1Ž . i Ž . Ž . iwe get OPT D, S s ae and DG D, S s cb b ??? b a.i i lm a xy2 lm a xy3 1

Hence,

5 5DG D , SŽ .iR D G limŽ .DG 5 5OPT D , Snª` Ž .i

i cmax q Ýlm a xy2 cmax y j Bt q cmaxŽ .Ž .js1s limcmax q i cminiª`

i lmax y 1 2cmax y lmax y 2 Bt q cmaxŽ . Ž .Ž .s lim

2cmax q 2 i cminiª`

lmax y 1 2cmax y lmax y 2 BtŽ . Ž .Ž .s

2cmin

which completes the proof.

Page 15: Greedy Algorithms for On-Line Data Compression

BEKESI ET AL.288

4. CONCLUSIONS

We investigated four properties of dictionaries: code-uniform, non-lengthening, suffix, and prefix. The case of a dictionary that is prefix andsuffix at the same time does neither seem practical nor interesting to us,and thus there remain twelve ‘‘reasonable’’ combinations of these proper-ties.

The analysis of the worst-case behaviour of the longest matching algo-rithm on eight of these dictionary types was performed by Katajainen and

w x Ž .Raita 3 the eight types that are not prefix . The remaining four typeswere examined in this paper. Now for all twelve types tight bounds on theworst-case ratio of LM are known.

The differential greedy algorithm behaves identically to LM in the caseof a code-uniform dictionary. Thus, there remain six dictionary types forwhich the behaviour of DG has to be analyzed. The two types that areprefix were analyzed in the current paper. The two types that are neither

w xprefix nor suffix were analyzed in 3 . These four types turned out to bew xeasy to analyze. For the case of a suffix, nonlenthening dictionary, 3 gave

an upper bound and we provided the matching lower bound. For the caseŽ .of a suffix dictionary, rather involved but matching lower and upper

bounds were derived in Section 3 of the current paper.Essentially, we see three lines for future research. We should

Ž .1 identify other reasonable properties for dictionaries and investi-gate the consequences of these properties on the worst-case behaviour ofLM and DG;

Ž .2 construct better approximation algorithms that exploit the spe-Žcial properties of some dictionary types e.g., it would be interesting to

.have a heuristic able to deal more efficiently with suffix dictionaries ;Ž . Ž . Ž3 extend the analysis to dynamic adaptive dictionaries i.e., find

w x.tight bounds for the Ziv]Lempel algorithm 5 .

ACKNOWLEDGMENT

Gabor Galambos gratefully acknowledges the hospitality of the Technical University Graz´during his visiting position at the Institute of Mathematics.

REFERENCES

1. J. Bekesi, G. Galambos, U. Pferschy, and G. J. Woeginger, The fractional greedy algorithm´ ´Ž .for data compression, Computing 56 1996 , 29]46.

Page 16: Greedy Algorithms for On-Line Data Compression

GREEDY ALGORITHMS FOR ON-LINE DATA COMPRESSION 289

2. M. E. Gonzalez-Smith and J. A. Storer, Parallel algorithms for data compression, J. Assoc.Ž .Comput. Mach. 32 1985 , 344]373.

3. J. Katajainen and T. Raita, An analysis of the longest matching and the greedy heuristic inŽ .text encoding, J. Assoc. Comput. Mach. 39 1992 , 281]294.

4. E. J. Schuegraf and H. S. Heaps, A comparison of algorithms for data base compression byŽ .use of fragments as language elements, Inform. Stor. Ret. 10 1974 , 309]319.

5. J. Ziv and A. Lempel, A universal algorithm for sequential data compression, IEEE Trans.Ž .Inform. Theory 23 1977 , 337]343.