![Page 1: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/1.jpg)
The PAQ4 Data Compressor
Matt Mahoney
Florida Tech.
![Page 2: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/2.jpg)
Outline
• Data compression background
• The PAQ4 compressor
• Modeling NASA valve data
• History of PAQ4 development
![Page 3: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/3.jpg)
Data Compression Background
• Lossy vs. lossless
• Theoretical limits on lossless compression
• Difficulty of modeling data
• Current compression algorithms
![Page 4: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/4.jpg)
Lossy vs. Lossless
• Lossy compression discards unimportant information– NTSC (color TV), JPEG, MPEG discard
imperceptible image details– MP3 discards inaudible details
• Losslessly compressed data can be restored exactly
![Page 5: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/5.jpg)
Theoretical Limits on Lossless Compression
• Cannot compress random data
• Cannot compress recursively
• Cannot compress every possible message– Every compression algorithm must expand
some messages by at least 1 bit
• Cannot compress x better than log2 1/P(x) bits on average (Shannon, 1949)
![Page 6: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/6.jpg)
Difficulty of Modeling
• In general, the probability distribution P of a source is unknown
• Estimating P is called modeling
• Modeling is hard– Text: as hard as AI– Encrypted data: as hard as cryptanalysis
![Page 7: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/7.jpg)
Text compression is as hard as passing the Turing test for AI
• P(x) = probability of a human dialogue x (known implicitly by humans)
• A machine knowing P(A|Q) = P(QA)/P(Q) would be indistinguishable from human
• Entropy of English ≈ 1 bit per character (Shannon, 1950)– Best compression: 1.2 to 2 bpc (depending on input size)
Computer
Q: Are you human?
A: Yes
![Page 8: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/8.jpg)
Compressing encrypted data is equivalent to breaking the
encryption
• Example: x = 1,000,000 0 bytes encrypted with AES in CBC mode and key “foobar”
• The encrypted data passes all tests for statistical randomness (not compressible)
• C(x) = 65 bytes using English
• Finding C(x) requires guessing the key
![Page 9: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/9.jpg)
Nevertheless, some common data is compressible
![Page 10: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/10.jpg)
Redundancy in English text
• Letter frequency: P(e) > P(q)– so “e” is assigned a shorter code
• Word frequency: P(the) > P(eth)
• Semantic constraints: P(drink tea) > P(drink air)
• Syntactic constraints: P(of the) > P(the of)
![Page 11: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/11.jpg)
Redundancy in images (pic from Calgary corpus)
Adjacent pixels are often the same color, P(000111) > P(011010)
![Page 12: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/12.jpg)
Redundancy in the Calgary corpusDistance back to last match of length 1, 2, 4, or 8
![Page 13: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/13.jpg)
Redundancy in DNA
tcgggtcaataaaattattaaagccgcgttttaacaccaccgggcgtttctgccagtgacgttcaagaaaatcgggccattaagagtgagttggtattccatgttaagcatccacaggctggtatctgcaaccgattataacggatgcttaacgtaatcgtgaagtatgggcatatttattcatctttcggcgcagaatgctggcgaccaaaaatcacctccatccgcgcaccgcccgcatgctctctccggcgacgattttaccctcatattgctcggtgatttcgcgggctacc
P(a)=P(t)=P(c)=P(g)=1/4 (2 bpc) e.coli (1.92 bpc?)
![Page 14: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/14.jpg)
Some data compression methods
• LZ77 (gzip) – Repeated strings replaced with pointers back to previous occurrence
• LZW (compress, gif) – Repeated strings replaced with index into dictionary– LZ decompression is very fast
• PPM (prediction by partial match) – characters are arithmetic encoded based on statistics of longest matching context– Slower, but better compression
![Page 15: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/15.jpg)
LZ77 Example
the cat in the hat
...a...a...a...or?
Sub-optimal compression due to redundancy in LZ77 coding
![Page 16: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/16.jpg)
LZW Example
the cat in the hat
atthein
Sub-optimal compression due to parsing ambiguity
...ab...bc...abc...
aabbbcc ab+c or a+bc?
![Page 17: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/17.jpg)
Predictive Arithmetic Compression (optimal)
Predict next symbol
ArithmeticCoder
input p
ArithmeticDecoder
Predict next symbol
p output
compresseddata
Compressor
Decompressor
![Page 18: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/18.jpg)
Arithmetic Coding
• Maps string x into C(x) [0,1) represented as a high precision binary fraction
• P(y < x) < C(x) < P(y ≤ x)– < is a lexicographical ordering
• There exists a C(x) with at most a log2 1/P(x) + 1 bit representation– Optimal within 1 bit of Shannon limit
• Can be computed incrementally– As characters of x are read, the bounds tighten– As the bounds tighten, the high order bits of C(x) can
be output
![Page 19: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/19.jpg)
Arithmetic coding example• P(a) = 2/3, P(b) = 1/3
– We can output “1” after the first “b”
a
b
aa
ab
ba
bb
aaa = “”
aba = 1
baa = 11
bbb = 11111
0.1
0.01
0.11
0aaa
aab
aba
abb
baababbbabbb
![Page 20: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/20.jpg)
Prediction by Partial Match (PPM) Guess next letter by matching longest context
the cat in the ha?
Longest context match is “a”Next letter in context “a” is “t”
the cat in th?
Longest context match is “th”Next letter in context “th” is “e”
![Page 21: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/21.jpg)
How do you mix old and new evidence?
..abx...abx...abx...aby...ab?
P(x) = ?P(y) = ?
![Page 22: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/22.jpg)
How do you mix evidence from contexts of different lengths?
..abcx...bcy...cy...abc?
P(x) = ?P(y) = ?P(z) = ? (unseen but not impossible)
![Page 23: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/23.jpg)
PAQ4 Overview
• Predictive arithmetic coder
• Predicts 1 bit at a time
• 19 models make independent predictions– Most models favor newer data
• Weighted average of model predictions– Weights adapted by gradient descent
• SSE adjusts final probability (Osnach)
• Mixer and SSE are context sensitive
![Page 24: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/24.jpg)
PAQ4
Model
Model
Model
Model
Mixer SSEArithmetic
Coder
p
p p
Input Data
Compressed Data
context
![Page 25: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/25.jpg)
19 Models
• Fixed (P(1) = ½)• n-gram, n = 1 to 8 bytes• Match model for n > 8• 1-word context (white space boundary)• Sparse 2-byte contexts (skips a byte) (Osnach)• Table models (2 above, or 1 above and left)
• 8 predictions per byte– Context normally begins on a byte boundary
![Page 26: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/26.jpg)
n-gram and sparse contexts
.......x? .....x.x?
......xx? ....x..x?
.....xxx? ....x.x.?
....xxxx? x...x...?
...xxxxx? .....xx.?
..xxxxxx? ....xx..?
.xxxxxxx? ... word? (begins after space)
xxxxxxxx? xxxxxxxxxx? (variable length > 8)
![Page 27: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/27.jpg)
Record (or Table) Model
• Find a byte repeated 4 times with same interval, e.g. ..x..x..x..x
• If interval is at least 3, assume a table
• 2 models:– first and second bytes above
– bytes above and left
...x...
...x...
...?
.......
...x...
..x?
![Page 28: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/28.jpg)
Nonstationary counter model
• Count 0 and 1 bits observed in each context
• Discard from the opposite count:– If more than 2 then discard ½ of the excess
• Favors newer data and highly predictive contexts
![Page 29: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/29.jpg)
Nonstationary counter exampleInput (in some context) n0 n1 p(1)----------------------- -- -- ----0000000000 10 0 0/1000000000001 6 1 1/7000000000011 4 2 2/60000000000111 3 3 3/600000000001111 2 4 4/6000000000011111 2 5 5/70000000000111111 2 6 6/8
![Page 30: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/30.jpg)
Mixer
• p(1) = i win1i / i ni
– wi = weight of i’th model– n0i, n1i = 0 and 1 counts for i’th model– ni = n0i + n1i
• Cost to code a 0 bit = -log p(1)• Weight gradient to reduce cost = ∂cost/∂wi =
n1i/jwjnj – ni/jwjn1j
• Adjust wi by small amount (0.1-0.5%) in direction of negative gradient after coding each bit (to reduce the cost of coding that bit)
![Page 31: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/31.jpg)
Secondary Symbol Estimation (SSE)
• Maps P(x) to P(x)
• Refines final probability by adapting to observed bits
• Piecewise linear approximation
• 32 segments (shorter near 0 or 1)
• Counts n0, n1 at segment intersections (stationary, no discounting opposite count)
• 8-bit counts are halved if over 255
![Page 32: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/32.jpg)
SSE example
0 Input p 1
Output p
0
1 Initial function
Trained function
![Page 33: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/33.jpg)
Mixer and SSE are context sensitive
• 8 mixers selected by 3 high order bits of last whole byte
• 1024 SSE functions selected by current partial byte and 2 high order bits of last whole byte
![Page 34: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/34.jpg)
Experimental Results on Popular Compressors, Calgary Corpus
Compressor Size (bytes) Compression Time, 750 MHz
Original data 3141622
compress 1272772 1.5 sec.
pkzip 2.04e 1032290 1.5
gzip -9 1017624 2
winrar 3.20 754270 7
paq4 672134 166
![Page 35: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/35.jpg)
Results on Top Compressors
Compressor Size Time
ppmn 716297 23 sec.
rk 1.02 707160 44
ppmonstr I 696647 35
paq4 672134 166
epm r9 668115 54
rkc 661602 91
slim 18 659358 153
compressia 1.0b 650398 66
durilca 0.3a 647028 35
![Page 36: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/36.jpg)
Compression for Anomaly Detection
• Anomaly detection: finding unlikely events
• Depends on ability to estimate probability
• So does compression
![Page 37: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/37.jpg)
Prior work
• Compression detects anomalies in NASA TEK valve data– C(normal) = C(abnormal)– C(normal + normal) < C(normal + abnormal)– Verified with gzip, rk, and paq4
![Page 38: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/38.jpg)
NASA Valve Solenoid Traces
• Data set 3 solenoid current (Hall effect sensor)
• 218 normal traces
• 20,000 samples per trace
• Measurements quantized to 208 values
• Data converted to a 4,360,000 byte file with 1 sample per byte
![Page 39: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/39.jpg)
Graph of 218 overlapped traces data (green)
![Page 40: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/40.jpg)
Compression ResultsCompressor Size
Original 4360000
gzip -9 1836587
slim 18 1298189
epm r9 1290581
durilca 0.3a 1287610
rkc 1277363
rk4 –mx 1275324
ppmonstr Ipre 1272559
paq4 1263021
![Page 41: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/41.jpg)
PAQ4 Analysis
• Removing SSE had little effect
• Removing all models except n=1 to 5 had little effect
• Delta coding made compression worse for all compressors
• Model is still too large to code in SCL, but uncompressed data is probably noise which can be modeled statistically
![Page 42: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/42.jpg)
Future Work
• Compress with noise filtered out
• Verify anomaly detection by temperature, voltage, and plunger impediment (voltage test 1)
• Investigate analog and other models
• Convert models to rules
![Page 43: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/43.jpg)
History of PAQ4Date Compressor Calgary Size
Nov. 1999 P12 (Neural net, FLAIRS paper in 5/2000)
831341
Jan. 2002 PAQ1 (Nonstationary counters)
716704
May 2003 PAQ2 (Serge Osnach adds SSE)
702382
Sept. 2003 PAQ3 (Improved SSE) 696616
Oct. 2003 PAQ3N (Osnach adds sparse models)
684580
Nov. 2003 PAQ4 (Adaptive mixing) 672135
![Page 44: The PAQ4 Data Compressor Matt Mahoney Florida Tech](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc60e2/html5/thumbnails/44.jpg)
Acknowledgments
• Serge Osnach (author of EPM) for adding SSE and sparse models to PAQ2, PAQ3N
• Yoockin Vadim (YBS), Werner Bergmans, Berto Destasio for benchmarking PAQ4
• Jason Schmidt, Eugene Shelwien (ASH, PPMY) for compiling faster/smaller executables
• Eugene Shelwien, Dmitry Shkarin (DURILCA, PPMONSTR, BMF) for improvements to SSE contexts
• Alexander Ratushnyak (ERI) for finding a bug in an earlier version of PAQ4