efficient decodable and searchable natural language adaptive compression salvador de bahía. aug...
Post on 19-Dec-2015
217 Views
Preview:
TRANSCRIPT
Efficient Decodable and Searchable Natural Language Adaptive
Compression
Salvador de Bahía. Aug 2005
28th Annual International ACM SIGIR
Gonzalo Navarro (DCC – U. Chile)
Nieves R. Brisaboa (UDC – Spain)
José R. Paramá (UDC – Spain)
Antonio Fariña (UDC – Spain)
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 2
— Semistatic CompressorsWord-based Huffman: Plain & Tagged Huffman
End-Tagged Dense Code (ETDC)
— Dynamic CompressorsDynamic End-Tagged Dense Code (DETDC)
OutlineOutlineOutlineOutline
Conclusions
Word-based compression
Introduction
Dynamic Lightweight ETDC (DLETDC)— Basic Ideas
— Searching DLETDC— Empirical Results
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 3
IntroductionIntroduction
Why compression? My huge disk is cheap. Compression reduces not only space !
— Disk access time (your huge disk is slow)
— Transmission time (networks are even slower)
— Search time (less data to process)
Compression can be integrated into Text Retrieval Systems, improving their performance in all aspects
Word-based semistatic models proved successfulWord-based semistatic models proved successful MG system (Witten, Moffat, Bell) Byte-oriented Huffman (Moura, N., Ziviani, Baeza-Yates) End-Tagged Dense Codes (Brisaboa, Fariña, N.)
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 4
IntroductionIntroduction
But what about sending results to a remote receiver?— Uncompressed form (wastes network bandwidth)
— Keep in compressed form (plus the large model of the corpus)
— Semistatic recompression (uneffective for small files)
— Dynamic recompressionDynamic recompression (effective, permits a conversation)Gzip (40%), DETDC (32%), arithmetic encoding (25%), …
Semistatic compression is preferred in Text Databases
� Reduces their size: 25%-30% compression ratio
� Permits direct access and local decompression
� Permits direct search on the compressed text� Searches are up to 8 times faster.
� Decompression is only needed for presenting results
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 5
IntroductionIntroduction
Dynamic compression: Receiver
Decompression (to present results)
Keyword search (to classify documents, alert users, …)
Dictionary-based (Ziv-Lempel):
Fast decompression
Decent direct searching
Not so good compression ratios
Bad for low bandwidth networks
Statistical (arithmetic, PPM):
Good compression ratios
Costly at sender and receiver side
Direct searching impossible
Bad for weak receivers
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 6
IntroductionIntroduction Our contribution: DLETC
—New dynamic compressor for text databasesSimple to understand and implement
—StatisticalGood compression ratio
Good for low-bandwidth networks
—Easier to handle by the receiverBreaks the usual sender/receiver symmetry
Very fast decompression
Good for weak receivers
—Searchable without decompressingFirst statistical method permitting direct search
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 7
— Semistatic CompressorsWord-based Huffman: Plain & Tagged Huffman
End-Tagged Dense Code (ETDC)
— Dynamic CompressorsDynamic End-Tagged Dense Code (DETDC)
OutlineOutlineOutlineOutline
Conclusions
Word-based compression
Introduction
Dynamic Lightweight ETDC (DLETDC)— Basic Ideas
— Searching DLETDC— Empirical Results
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 8
Semistatic compressionSemistatic compression
Statistical semistatic compression
— 2 passes
Gather frequencies of words and encoding
Compression (substitution word codeword)
— Association between source symbol codeword does not change across the text
— Direct search is possible— Most representative method: Huffman.
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 9
Word-based HuffmanWord-based Huffman
Moffat proposed the use of words instead of characters as the coding alphabet
Distribution of words more biased than that of characters
0
2
4
6
8
10
12
14
16
18
E A O L S N D R U I T C P M Y Q B H G F V J Ñ Z X K W
n
Word freq
distribution
Character freq distribution (Spanish)
Compression ratio about 25% (English texts) This idea joins the requirements of compression
algorithms and of Information Retrieval systems
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 10
Plain Huffman & Tagged HuffmanPlain Huffman & Tagged Huffman
Moura, Navarro, Ziviani and Baeza: — 2 new techniques: Plain Huffman and Tagged Huffman
Common features— Huffman-based
— Word-based
— Byte- rather than bit-oriented (compression ratio ±30%)
Plain Huffman = Huffman over bytes (256-ary tree) Tagged Huffman flags the beginning of each codeword
First bit is:•“1” for 1st bit of 1st byte•“0” for 1st bit remaining bytes
1xxxxxxx 0xxxxxxx 0xxxxxxx
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 11
Plain Huffman & Tagged HuffmanPlain Huffman & Tagged Huffman
Differences:
— Plain Huffman. (tree of arity 2b=256)
— Tagged Huffman. (tree of arity 2b-1 =128)
Loss of compression ratio (3.5 perc. points)
Tag marks beginning of codewords
Direct search (improved searches)Boyer-Moore
Random access (random decompression)
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 12
— Semi-static CompressorsWord-based Huffman: Plain & Tagged Huffman
End-Tagged Dense Code (ETDC)
— Dynamic CompressorsDynamic End-Tagged Dense Code (DETDC)
OutlineOutlineOutlineOutline
Conclusions
Word-based compression
Introduction
Dynamic Lightweight ETDC (DLETDC)— Basic Ideas
— Searching DLETDC— Empirical Results
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 13
End-Tagged Dense CodeEnd-Tagged Dense Code
Small change: Flag signal the end of a codeword
Prefix code independently of the remaining 7 bits of each byte
Huffman tree is not needed: Dense coding.• Improving Tagged Huffman compression ratio by 2.5 perc. points
Flag bit Same Tagged Huffman searching capabilities
First bit is: “1” --> for 1st bit of last byte “0” --> for 1st bit remaining bytes
1xxxxxxx
0xxxxxxx
Two-byte codeword 1xxxxxxx0xxxxxxx
Three-byte codeword 1xxxxxxx0xxxxxxx0xxxxxxx
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 14
End-Tagged Dense CodeEnd-Tagged Dense Code Encoding scheme
.....
Words from 128+ 1282+1 to
128 +1282 +1283 use three bytes
(1283 = 221 codewords)
00000000:00000000:10000000
……
01111111:01111111:11111111
Words from 128+1 to 128+1282 are encoded using two bytes
(1282 = 214 codewords)
00000000:10000000
…..
01111111:11111111
First 128 words are encoded using one byte (27 codewords)
1000000010000001…..11111111
Codewords depend on the rank, not on the frequency
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 15
— Semi-static CompressorsWord-based Huffman: Plain & Tagged Huffman
End-Tagged Dense Code (ETDC)
— Dynamic CompressorsDynamic End-Tagged Dense Code (DETDC)
OutlineOutlineOutlineOutline
Conclusions
Word-based compression
Introduction
Dynamic Lightweight ETDC (DLETDC)— Basic Ideas
— Searching DLETDC— Empirical Results
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 16
Dynamic codesDynamic codes
Statistical dynamic compression
— Dynamic modelingFrequencies are computed dynamically as text arrives.
— Sender and receiver…
Start with an empty model
Adapt their model during the process.
Both processes are symmetric
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 17
Dynamic End-Tagged Dense CodeDynamic End-Tagged Dense Code
Symmetric sender and receiver. It uses the on-the-fly encoding and decoding algorithms
— Requirement: maintaining the vocabulary sorted by freq.
speed
Sender’s vocabulary
1 the 8
2 is 4
3 code 2
4 tag 2
5 Huffman 2
6
7 ratio 1
8 1
ETDC
speed C8 = encode (8)
Exampleword freq
speed
ETDC
21
send codeword increase frequencyexchange with top
12 The codeword given to a word si may
change each time si is processed.
DETDC …
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 18
rose
Dynamic End-Tagged Dense Code: transmissionDynamic End-Tagged Dense Code: transmission
senderreceiver
Example: … a rose roserose is a very nice one
rose
a126
127
128
129
1
C127
rose
1 a
rose
126
127
128
129
1
1
vocabulary
word freq
vocabulary
word freq
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 19
is
is
is 1
126
127
128
129
vocabulary
word freq
Dynamic End-Tagged Dense Code: transmissionDynamic End-Tagged Dense Code: transmission
1
rose
senderreceiver
is
a
2
1
rose
a
2
is
is 1
Example: … a rose rose isis a very nice one
126
127
128
129
vocabulary
word freq
Cnew
0|128
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 20
— Semi-static CompressorsWord-based Huffman: Plain & Tagged Huffman
End-Tagged Dense Code (ETDC)
— Dynamic CompressorsDynamic End-Tagged Dense Code (DETDC)
OutlineOutlineOutlineOutline
Conclusions
Word-based compression
Introduction
Dynamic Lightweight ETDC (DLETDC)— Basic Ideas— Searching DLETDC— Empirical Results
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 21
DLETDC – Basic IdeasDLETDC – Basic Ideas
The idea:— Break the correspondence between position and codeword— Codewords are mantained explicitly by the sender— SENDER: After processing a symbol si …
Exchange si top(fi) to keep the vocabulary sorted,
Keep original codewords unless codeword length should vary
Otherwise swap, and notify the receiver.
— 2 fixed slots are reserved: new-symbol, ascii word
Swap, Ci, CodeInj.
New-symbolswap
0
1
128129
25500
127128129
128129
1270
1651116512
2550 128
016513 0 129
1272113662 127 2542113663 127 127 255
02113664 0 0 128
1272113661 127 253
Rank
1-byte
2-byte
3-byte
— RECEIVER: does not maintain frequencies
Decodes codewords
Adds new symbols (Cnew-symbol)
Only swaps when it decodes a Cswap
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 22
C127
very 1
126
127
128
129
vocabulary
word
DLETDC: transmissionDLETDC: transmission
senderreceiver
a
1
1 is
a
a
very
Example: … a rose rose is aa very nice one
126
127
128
129
vocabulary
word freq
125 rose125 rose 2 C125
is C126
a C127
C128
1-by
te2-
byte
No codeword swap needed since: “a” C127 and “is” C126
code
2
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 23
128
1127
2
CswapC128C126
1
126
127
128
129
vocabulary
word
DLETDC: transmissionDLETDC: transmission
senderreceiver
very
is
a
very
very
Example: … a rose rose is a veryvery nice one
126
129
vocabulary
word freq
125 rose125 rose 2 C125
a C127
is C126
very C128
1-by
te2-
byte
A codeword swap is needed
code
2
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 24
128
130
Cnew
128
2127
2
1
126
127
129
vocabulary
word
DLETDC: transmissionDLETDC: transmission
senderreceiver
nice
very
a
nice
is
Example: … a rose rose is a very nicenice one
126
129
vocabulary
word freq
125 rose125 rose 2 C125
a C127
very C126
is C128
1-by
te2-
byte
A new word case
code
nice
130
nice 1 C129 nice
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 25
DLETDC – Evolution of swaps.DLETDC – Evolution of swaps.
Swaps worsen compression ratio and processing time How many swaps occur in practice?
ZIFF corpus4,6x107 words
237,622 diff words
Between lengts 1 and 2Between lengts 2 and 3
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 26
— Semi-static CompressorsWord-based Huffman: Plain & Tagged Huffman
End-Tagged Dense Code (ETDC)
— Dynamic CompressorsDynamic End-Tagged Dense Code (DETDC)
OutlineOutlineOutlineOutline
Conclusions
Word-based compression
Introduction
Dynamic Lightweight ETDC (DLETDC)— The code— Searching DLETDC— Empirical Results
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 27
DLETDC – Searches – Keyword filteringDLETDC – Searches – Keyword filtering Multipattern Horspool
— A trie is traversed to recognize text backwards
Search process:
— Initial phaseSearch for the ASCII pattern: C_new | (pattern)
— When foundReplace in the trie by its codeword: C_pat
Need also to monitor C_new to know C_pat
— Watch the swaps
Track C_swap C_pat * and C_swap * C_pat It suffices to search for:
• C_new• C_pat
and make some neighborhood checks
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 28
— Semi-static CompressorsWord-based Huffman: Plain & Tagged Huffman
End-Tagged Dense Code (ETDC)
— Dynamic CompressorsDynamic End-Tagged Dense Code (DETDC)
OutlineOutlineOutlineOutline
Conclusions
Word-based compression
Introduction
Dynamic Lightweight ETDC (DLETDC)— The code— Searching DLETDC— Empirical Results
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 29
Empirical ResultsEmpirical Results
We used some text collections from TREC-2 & TREC- 4, to perform the experiments
Results in:— Compression ratio
— Compression time
— Decompression time
— Search speed
CORPUS size (bytes) # words Diff. words
CALGARY 2,131,045 528,611 30,995
FT91 14,749,355 3,135,383 75,681
CR 51,085,545 10,230,907 117,713
FT92 175,449,235 36,803,204 284,892
ZIFF 185,220,215 40,866,492 237,622
FT93 197,586,294 42,063,804 291,427
FT94 203,783,923 43,335,126 295,018
AP 250,714,271 53,349,620 269,141
ALL FT 591,568,807 124,971,944 577,352
ALL 1,080,719,883 229,596,845 886,190— Dual Intel Pentium-III 800 Mhz with 768Mb RAM.
Debian GNU/Linux (kernel 2.2.19)
gcc 3.3.3 20040429 and –O9 optimizations
Time represents CPU user-time
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 30
Empirical Results. Compression RatioEmpirical Results. Compression Ratio
gzip DLETDC DETDC arith bzip2
26
27
28
29
30
31
32
33
34
35co
mp
ress
ion
ra
tio (
%)
technique
35.09 33.69 33.66 27.98 25.98
Comparison with other techniques.
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 31
Empirical Results. Compression timeEmpirical Results. Compression time
gzip DLETDC DETDC arith bzip2
100
200
300
400
500
600
700C
om
pre
ssio
n ti
me
(se
c)
technique
431.5 154.6 150.1 510.0 1342.0
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 32
Empirical Results. Decompression timeEmpirical Results. Decompression time
gzip DLETDC DETDC arith bzip2
50
100
150
200
250
300
350
400
De
om
pre
ssio
n ti
me
(se
c)
technique
59.8 49.5 75.8 394.1 432.4
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 33
Search speedSearch speed
5 10 15 20 25
2
4
6
8
10
12
14
16
18
20
Avg
se
arc
h ti
me
(se
c)
number of Patterns
Multi-pattern searches
Plain Text
DLETDC
3.5
2
3.4
6
4.3
8
4.4
9
5.0
1
10
.14
13
.80
15
.84
17
.61
18
.70
Avg time over 10,000 random searches for words of length >= 6
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 34
— Semi-static CompressorsWord-based Huffman: Plain & Tagged Huffman
End-Tagged Dense Code (ETDC)
— Dynamic CompressorsDynamic End-Tagged Dense Code (DETDC)
OutlineOutlineOutlineOutline
Conclusions
Word-based compression
Introduction
Dynamic Lightweight ETDC (DLETDC)— Basic Ideas
— Searching DLETDC— Empirical Results
August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 35
ConclusionsConclusions
We have presented DLETDC— A new dynamic “dense” compressor.
— Designed for low bandwidth and weak receivers.
Competitive compression ratio (around 32-34%, < gzip) Fast at compression (much faster than other adaptive) Very fast at decompression (even faster than gunzip!) Very fast at searching (>3 times faster than plain searching)
DLETDC breaks the common symmetry between sender and receiver in adaptive techniques.— Lightweight and fast decompressor/receiver.
— A dynamic searchable technique.
top related