optimization of lz77 data compression algorithm
Post on 03-Jul-2015
525 Views
Preview:
DESCRIPTION
TRANSCRIPT
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
42
OPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHM
Hemraj Kumawat CSE,IIT Jodhpur, Jodhpur, India
Jitendra Chaudhary
CSE,IIT Jodhpur, Jodhpur, India
ABSTRACT
Data compression refers to reducing the amount of space needed to store data or reducing the
amount of time needed to transmit data. Many data compression techniques allow encoding the
compressed form of data with different compression ratio. In particular, in the case of LZ77
technique, it reduces the data concurrency of an input file. In the output of this technique it conveys
more information that is actually not needed in practical. Removing the extra information from the
encoded file that makes this algorithm more optimal. Our task is to identify how much extra
information it conveys and how can we minimize it so that there is no trouble at the time of
decoding. Basically the encoded output of LZ77 is the sequence of triplets (a structure of encoded
output) that is in binary and having fix size. For making the triplets of fix size, sometimes we are
creating unnecessary information. We present the method of variable triplet size as a way to improve
LZ77 compression and demonstrate it through many experiments. In our optimization algorithm we
are getting more compression ratio compare to the conventional LZ77 data compression algorithm.
Keywords: Look-ahead Buffer: The look-ahead buffer contains characters yet to be encoded. This buffer starts
where the Search buffer ends and during the algorithm the Search buffer extends into the look-ahead
buffer.
Match Length: The Match Length is the length of largest matching block in the look-ahead buffer.
These pairs are called triplets, consisting of offset, matching length and code word of character. If
the character is matching then next character code word is used, otherwise same character code word
is used.
Offset: The actual distance between the current position of the pointer and the look-ahead buffer is
known as offset.
Search Buffer: The Search Buffer represents the most recently encoded characters.
Sliding Window: The Structure for Data manipulation, in which the Data is held The Sliding
Window, is divided into two parts as Search buffer and look-ahead buffer.
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING &
TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 4, Issue 5, September – October (2013), pp. 42-48 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com
IJCET
© I A E M E
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
43
1. INTRODUCTION
LZ77 algorithm achieves compression by replacing repeated occurrences of data with
references to a single copy of that data existing earlier in the input (uncompressed) data stream. A
match is encoded by a pair of numbers called a length-distance pair. Some common convention and
definition of the words that we are using in this paper.
2. CONVENTIONAL LZ77 ALGORITHM
LZ77 compression algorithm exploits the fact that words and phrases within a text file are
likely to be repeated. When there is repetition, they can be encoded as a pointer to an earlier
occurrence, with the pointer accompanied by the number of characters to be matched. It is a very
simple adaptive scheme that requires no prior knowledge of the source and seems to require no
assumptions about the characteristics of the source.
In the LZ77 approach the dictionary is simply a portion of the previously encoded sequence.
The encoder examines the input sequence through a sliding window which consists of two parts: a
search buffer that contains a portion of the recently encoded sequence and a look ahead buffer that
contains the next portion of the sequence to be encoded. The algorithm searches the sliding window
for the longest match with the beginning of the look-ahead buffer and outputs a reference (a pointer)
to that match. It is possible that there is no match at all, so the output cannot contain just pointers. In
LZ77 the reference is always represented as a triplet<o,l,c>, where ‘o’ is an offset to the match, ‘l’ is
length of the match and ‘c’ is the next symbol after the match. If there is no match, the algorithm
outputs a null-pointer (both the offset and the match length equal to 0) and the first symbol in the
look-ahead buffer. The values of an offset to a match and length must be limited to some maximum
constant. For this algorithm we have to define the length of the look-ahead buffer, search buffer. The
symbol is usually encoded in 8 bit. More over the compression performance of LZ77 mainly depends
on these values. Generally the search buffer length is more than the look-ahead-buffer size. So the
total triplet size:
While ( look-ahead Buffer not empty) {
get a reference (position, length) to longest match;
if (length > 0)
{
output (position, length, next symbol);
shift the window length+1 positions along;
}
else {
output (0, 0, first symbol in the look-ahead buffer);
shift the window 1 character along;
}
}
ST= [⌈log2(search buffer length)⌉] +[⌈log2(look-ahead buffer length)⌉]+8
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
44
We can have better understanding with an example- “aacaacabcabaaac” For this example the size of
look-ahead buffer is 6 and search buffer is 4.
Triple Binary -- <0, 0, a> 00000011000001
-- <1, 1, c> 00100111000011
-- < 3, 4, b> 01110011000010
--< 3, 3, a> 01101111000001
-- <1, 2, c> 00101011000011
Sliding window( Size: 6 )
Longest match
Next Character
The triplet length for this example is 14(3+3+8). So here the encoded binary string of this example.
0000001100000100100111000011011100110000100110111100000100101011000011
triplet triplet triplet triplet triplet
-------------------|-------------------|-----------------------|----------------------|--------------------|
The decoding is much faster than the encoding in this process because we have to move our
pointer with fixed length (triple length-14 for this example) and it is one of the important features of
this process. From this way we get the triplets .Now we can easily decode to original data by
reversing the encoding process.
3. OPTIMIZATION OF LZ77
As we have described earlier the triplets of LZ77 algorithm have fix size. In the case, when
offset is equal to the matching length we can modify the structure of triplets and represent it with the
new structure that have only <l,c> where l is the matching length and ‘c’ is the next symbol after the
match .We called this new structure as “doublet” in this paper .All the things are same as the
conventional LZ77 algorithm except replacing the triplet with doublet in the case of matching length
equal to offset length.
��= log2(look-ahead buffer length)⌉] + 8
By replacing the triplet with doublet we are saving [⌈log2(search buffer length)⌉] number of
bits per matching case. The exact algorithms is described in the block
a a c a a c a b c a b a a a c
a a c a a c a b c a b a a a c
a a c a a c a b c a b a a a c
a a c a a c a b c a b a a a c
a a c a a c a b c a b a a a c
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
45
While ( look-ahead Buffer not empty) {
get a reference (position, length) to longest match;
if (length > 0)
{
If (length== position){
output (length, next symbol);
shift the window length+1 positions along;
}
else{
output (position, length, next symbol);
shift the window length+1 positions along;
}}
else {
output (0, first symbol in the look-ahead buffer);
shift the window 1 character along;
}
}
We will have better understanding of this optimized algorithm with the previous section
example- “aacaacabcabaaac”
Triplet/doublet Binary
--<0, a> 00011000001
--<1, c> 00111000011
--< 3, 4, b> 01110011000010
--< 3, a> 01111000001
--<1, 2, c> 00101011000011
The length of triplet for this example is 14(3+3+8) and the doublet length is 11(3+8). So here
the encoded binary string of this example
0001100000100111000011011100110000100111100000100101011000011
doublet doublet triplet doublet triplet
-----------------|------------------|---------------|-----------------|-----------------------|
The decoding process is slower than the conventional LZ77 Decoding. In this algorithm we
have to move our pointer with the variable size (doublet and triplet length) .But the problem in
decoding is how we will identify which one is doublet and which one is triplet. So the decoding
process is described in the next section. Once we decode encoded file to triplets and doublet then we
can easily get to original data by reversing the encoding (data to triplet/doublet) process.
a a c a A c A b c a b a A a c
a a c a A c A b c a b a A a c
a a c a A c a b c a b a A a c
a a c a A c A b c a b a a a c
a a c a A c A b c a b a A a c
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
46
4. DECODING PROCESS
For decoding the binary string that contains doublets and triplets, first we have to identify
which one is doublet’s binary string and which one is triplet binary string .To identify this we are
using the concept of delimiter that will depend on the size of the sliding window length. We will put
the delimiter before the doublet binary string.
The effective doublet length is: doublet size + delimiter size.
Now each binary substring is starting from the delimiter .Now for decoding we move the pointer
from starting of the binary string and check whether the first consecutive k bits are equal to the
delimiter or not where k is the length of delimiter .If it is equal then move the current pointer with
the effective doublet length ahead and make this substring to the doublet substring otherwise move
the pointer with the triplet length and make this substring to the triplet substring.
So the decoding algorithm is given below
Effective doublet length = doublet size + delimiter length
//delimiter is a binary string that depends on the offset length
Starting the moving pointer from 0
While(the moving pointer m<total binary length){
if((m to m+ delimiter’s length substring)==delimiter )
{
Move the pointer m with effective doublet length ahead.
Get the doublet binary substring form (m + delimiter’s length ) to (m + effective
length)
}
else{
Move the pointer m with triplet length ahead.
Get the triplet binary substring form m to (m + triplet length)}
So here the above example’s binary string after using the delimiter before the doublets.(here we are
using the delimiter=”1”)
1000110000011001110000110111001100001010111100000100101011000011
doublet double triplet doublet triplet
------------------|----------------|----------------|----------------------|----------------------|
So the substring of doublet length just after the blue one is the doublet binary substring and the rest
substrings of triple size are the triplet binary string .We can easily identify the delimiter (blue one for
this example) by moving the pointer with appropriate length according the doublet and triplets.
But what will happen when the offset’s first k bits are equal to the delimiter then this algorithm is not
valid, where k is the length of delimiter .We cannot decompress the binary string from this delimiter.
Let’s see this from an example:
For maximum search buffer length=31 and look-ahead buffer=7 and delimiter =”111”
Binary string
1110001100000111100011000010111010101100110111100111000111
|___|---doublet----||___|--doublet----||___|----triplet----------||___|--doublet---
From the starting we will see that first two strings are doublet string. The third substring is
actually triplet substring but from our algorithm it reads out as the doublet substring because the first
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
47
3 bits are equal to the delimiter. This will affect the rest decoding process .Eventually we will get the
wrong decoding output.
So we are bounding our delimiter size by:
Delimiter size = no. of first consecutive 1’s in binary representation of sliding window length +1
All the bits in the delimiter are 1’s. From this Delimiter string we are getting the correct decoding
output.
5. ALGORITHMS ANALYSIS
C: Total no of Matching (offset=Matching Length)
CR: Compression ratio
Op LZ77: Optimized LZ77 Algorithm
LZ77: Conventional LZ77 Algorithm
D: % Difference between conventional LZ77 and Optimized LZ77= (Op LZ77 – LZ77) / LZ77
Max. Offset Value=maximum offset value= search buffer length
6. CONCLUSION
As we can see from the above analysis that for maximum offset value (111) the improved
compression ratio is 1.896 and for maximum offset value (223) the improved compression ratio is
1.154. From this analysis result we can see that maximum offset value is increasing, the improved
compression ratio is decreasing. When we increase the search buffer length then the total number of
matching (offset=matching length) will be decreased. The improved compression is directly
proportional to the matching. The improved compression will be more when the matching is more.
So generally this optimal algorithm is more effective, where either conventional LZ77 algorithm is
not so optimal or the input has less repetition.
Serial
Number File size
(in bytes)
Comparison between conventional LZ77 and Optimized LZ77
Max. Offset Value=111 Max. Offset Value=223 Max. Offset Value=447
C CR
C CR
C CR
LZ77 Op
LZ77 D LZ77
Op
LZ77 D LZ77
Op
LZ77 D
1 61,305 2208 1.180 1.200 2.280 1099 1.293 1.313 1.547 621 1.400 1.417 1.130
2 4,21,144 11417 1.210 1.230 1.770 5514 1.318 1.333 1.153 3104 1.420 1.433 0.842
3 7,79,959 17044 1.197 1.213 1.370 6412 1.310 1.319 0.700 3710 1.386 1.393 0.521
4 73,246 2820 1.189 1.169 1.690 1398 1.265 1.275 0.800 742 1.334 1.338 0.341
5 1,09,684 3755 1.200 1.226 2.180 1857 1.317 1.336 1.469 1063 1.423 1.440 1.080
6 6,04,919 19281 1.264 1.291 2.160 8246 1.390 1.407 1.258 4053 1.507 1.519 0.800
7 1,51,610 3489 1.236 1.255 1.497 1430 1.349 1.360 0.830 779 1.453 1.462 0.591
8 1,30,725 3309 1.222 1.242 1.635 1494 1.347 1.360 1.020 750 1.465 1.475 0.660
9 4,27,180 14369 1.254 1.282 2.253 6673 1.389 1.409 1.437 3388 1.517 1.531 0.950
10 6,78,036 23537 1.157 1.182 2.130 10763 1.263 1.279 1.320 5683 1.362 1.374 0.903
Average 10123 1.211 1.229 1.896 4488 1.324 1.339 1.154 2389 1.426 1.438 0.783
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
48
REFERENCES
[1] IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. IT-23, NO. 3, MAY 1977
337A Universal Algorithm for Sequential Data Compression
[2] International Symposium on Information Theory and its Applications, ISITA2006 Seoul,
Korea, October 29–November 1, 2006 Improving LZ77 Data Compression using Bit Recycling
[3] International Journal of Wisdom Based Computing, Vol. 1 (3), December 2011 68 A
Comparative Study Of Text Compression Algorithms
[4] Northwestern University Department of Electrical and Computer Engineering ECE 428:
Information Theory Spring 2004
[5] http://www.stringology.org/DataCompression/lz77/index_en.html
[6] http://www.zlib.net/feldspar.html
• Links of the text files used in Analysis are given below (sorted by the index of the table) 1. http://www.gutenberg.org/cache/epub/28466/pg28466.txt
2. http://www.gutenberg.org/cache/epub/16728/pg16728.txt
3. http://www.gutenberg.org/cache/epub/9173/pg9173.txt
4. http://www.gutenberg.org/cache/epub/32482/pg32482.txt
5. http://www.gutenberg.org/files/25731/25731-0.txt
6. http://www.gutenberg.org/cache/epub/26598/pg26598.txt
7. http://www.gutenberg.org/cache/epub/28569/pg28569.txt
8. http://www.gutenberg.org/cache/epub/32962/pg32962.txt
9. http://www.gutenberg.org/cache/epub/23319/pg23319.txt
10. http://www.gutenberg.org/cache/epub/101/pg101.txt
top related