![Page 1: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/1.jpg)
Choosing Word Occurrencesfor the Smallest Grammar Problem
Rafael Carrascosa1, Matthias Galle2,Francois Coste2, Gabriel Infante-Lopez1
1NLP Group 2Symbiose ProjectU. N. de Cordoba IRISA / INRIA
Argentina France
LATAMay, 25th 2010
1
![Page 2: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/2.jpg)
Smallest Grammar Problem
Problem DefinitionGiven a sequence s, find a context-free grammar G (s) of minimalsize that generates exactly this and only this sequence.
Example
s =“how much wood would a woodchuck chuck if awoodchuck could chuck wood?”, a possible G (s) (notnecessarily minimal) is
S → how much N2 wN3 N4 N1 if N4 cN3 N1 N2 ?N1 → chuckN2 → woodN3 → ouldN4 → a N2N1
2
![Page 3: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/3.jpg)
Smallest Grammar Problem
Problem DefinitionGiven a sequence s, find a context-free grammar G (s) of minimalsize that generates exactly this and only this sequence.
Example
s =“how much wood would a woodchuck chuck if awoodchuck could chuck wood?”, a possible G (s) (notnecessarily minimal) is
S → how much N2 wN3 N4 N1 if N4 cN3 N1 N2 ?N1 → chuckN2 → woodN3 → ouldN4 → a N2N1
2
![Page 4: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/4.jpg)
Smallest Grammar Problem
Problem DefinitionGiven a sequence s, find a context-free grammar G (s) of minimalsize that generates exactly this and only this sequence.
Applications
• Data Compression
• Sequence Complexity
• Structure Discovery
2
![Page 5: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/5.jpg)
Smallest Grammar Problem
Problem DefinitionGiven a sequence s, find a context-free grammar G (s) of minimalsize that generates exactly this and only this sequence.
Applications
• Data Compression
• Sequence Complexity
• Structure Discovery
1.2 THESIS STATEMENT 7
a
!"#"$"%"&"'"$"("'")"*"#"#"*"#")"$"+","-"$"."/"'"0"%"'"-"$"%"&"'"$"&"'"0"1"'"#"$"0"#"-"$"%"&"'"$"'"0"/"%"&
b
$"2"3"$".","4"4"'"#"."'"4"'"#"%"5"$"6"*"'"3"$"."/"7"0"$"8"'"9"$"."*"'"3":"$"'"%"$"8"0"$"%"'"/"/"'
c
$"!"4"$"2"#";"0"#")"$"9"."&"3";"$"+","%"%"$"-"*"'"$"<"*"4"4"'"8"$"3"#"-"$"-"*"'"$"="/"-"'"
Figure 1.1 Hierarchies for Genesis 1:1in (a) English, (b) French, and (c) German
particular class of grammars, and attempt to provide guarantees about identifyingthe source grammar from its output.
The hierarchy of phrases provides a concise representation of the sequence, andconciseness can be an end in itself. When the hierarchy is appropriately encoded,the technique provides compression. Data compression is concerned with makingefficient use of limited bandwidth and storage by removing redundancy. Mostcompression schemes work by taking advantage of the repetitive nature ofsequences, either by creating structures or by accumulating statistics. Building ahierarchy, however, allows not only the sequence, but also the repeated phrases, tobe encoded efficiently. This success underscores the close relationship betweenlearning and data compression.
1.3 Some examplesThis section previews several results from the thesis. SEQUITUR produces ahierarchy of repetitions from a sequence. For example, Figure 1.1 shows parts ofthree hierarchies inferred from the text of the Bible in English, French, andGerman. The hierarchies are formed without any knowledge of the preferredstructure of words and phrases, but nevertheless capture many meaningfulregularities. In Figure 1.1a, the word beginning is split into begin and ning—a rootword and a suffix. Many words and word groups appear as distinct parts in thehierarchy (spaces have been made explicit by replacing them with bullets). Thesame algorithm produces the French version in Figure 1.1b, where commencement is
c©Nevill-Manning 1997
2
![Page 6: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/6.jpg)
Smallest Grammar Problem
Problem DefinitionGiven a sequence s, find a context-free grammar G (s) of minimalsize that generates exactly this and only this sequence.
RemarkNot only S , but any non-terminal of the grammar generates onlyone sequence of terminal symbols: cons :: N → Σ∗
S → how much N2 wN3 N4 N1 if N4 cN3 N1 N2 ?N1 → chuckN2 → woodN3 → ouldN4 → a N2N1
⇒cons(S) = scons(N1) = chuckcons(N2) = woodcons(N3) = ouldcons(N4) = a woodchuck
2
![Page 7: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/7.jpg)
Smallest Grammar Problem
Problem DefinitionGiven a sequence s, find a context-free grammar G (s) of minimalsize that generates exactly this and only this sequence.
RemarkNot only S , but any non-terminal of the grammar generates onlyone sequence of terminal symbols: cons :: N → Σ∗
S → how much N2 wN3 N4 N1 if N4 cN3 N1 N2 ?N1 → chuckN2 → woodN3 → ouldN4 → a N2N1
⇒cons(S) = scons(N1) = chuckcons(N2) = woodcons(N3) = ouldcons(N4) = a woodchuck
2
![Page 8: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/8.jpg)
Smallest Grammar Problem
Problem DefinitionGiven a sequence s, find a context-free grammar G (s) of minimalsize that generates exactly this and only this sequence.
Size of a Grammar
|G | =∑
N→ω∈P(|ω|+ 1)
2
![Page 9: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/9.jpg)
Smallest Grammar Problem
Problem DefinitionGiven a sequence s, find a context-free grammar G (s) of minimalsize that generates exactly this and only this sequence.
Size of a Grammar
|G | =∑
N→ω∈P(|ω|+ 1)
S → how much N2 wN3 N4 N1 if N4 cN3 N1 N2 ?N1 → chuckN2 → woodN3 → ouldN4 → a N2N1
⇓how much N2 wN3 N4 N1 if N4 cN3 N1 N2 | chuck | wood | ould | a N2 N1 |
2
![Page 10: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/10.jpg)
Previous Approaches
1. Practical algorithms: Sequitur (and offline friends). 1996“Compression and Explanation Using Hierarchical Grammars”. Nevill-Manning & Witten. The Computer
Journal. 1997
2. Compression theoretical framework: Grammar Based Code.2000“Grammar-based codes: a new class of universal lossless source codes”. Kieffer & Yang. IEEE T on
Information Theory. 2000
3. Approximation ratio to the size of a Smallest Grammar in theworst case. 2002“The Smallest Grammar Problem”, Charikar et.al. IEEE T on Information Theory. 2005
3
![Page 11: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/11.jpg)
Previous Approaches
1. Practical algorithms: Sequitur (and offline friends). 1996“Compression and Explanation Using Hierarchical Grammars”. Nevill-Manning & Witten. The Computer
Journal. 1997
2. Compression theoretical framework: Grammar Based Code.2000“Grammar-based codes: a new class of universal lossless source codes”. Kieffer & Yang. IEEE T on
Information Theory. 2000
3. Approximation ratio to the size of a Smallest Grammar in theworst case. 2002“The Smallest Grammar Problem”, Charikar et.al. IEEE T on Information Theory. 2005
3
![Page 12: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/12.jpg)
Offline algorithms
• Maximal Length (ML): take longest repeat, replace alloccurrences with new symbol, iterate.
4
![Page 13: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/13.jpg)
Offline algorithms
• Maximal Length (ML): take longest repeat, replace alloccurrences with new symbol, iterate.
S → how much wood would a woodchuck chuckif a woodchuck could chuck wood?
⇓S → how much wood wouldN1huck ifN1ould chuck wood?N1 → a woodchuck c
⇓S → how much wood wouldN1huck if N1ould N2wood?N1 → a woodN2cN2 → chuck
4
![Page 14: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/14.jpg)
Offline algorithms
• Maximal Length (ML): take longest repeat, replace alloccurrences with new symbol, iterate.
S → how much wood would a woodchuck chuckif a woodchuck could chuck wood?
⇓S → how much wood wouldN1huck ifN1ould chuck wood?N1 → a woodchuck c
⇓S → how much wood wouldN1huck if N1ould N2wood?N1 → a woodN2cN2 → chuck
4
![Page 15: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/15.jpg)
Offline algorithms
• Maximal Length (ML): take longest repeat, replace alloccurrences with new symbol, iterate.
S → how much wood would a woodchuck chuckif a woodchuck could chuck wood?
⇓S → how much wood wouldN1huck ifN1ould chuck wood?N1 → a woodchuck c
⇓S → how much wood wouldN1huck if N1ould N2wood?N1 → a woodN2cN2 → chuck
4
![Page 16: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/16.jpg)
Offline algorithms
• Maximal Length (ML): take longest repeat, replace alloccurrences with new symbol, iterate.
S → how much wood would a woodchuck chuckif a woodchuck could chuck wood?
⇓S → how much wood wouldN1huck ifN1ould chuck wood?N1 → a woodchuck c
⇓S → how much wood wouldN1huck if N1ould N2wood?N1 → a woodN2cN2 → chuck
4
![Page 17: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/17.jpg)
Offline algorithms
• Maximal Length (ML): take longest repeat, replace alloccurrences with new symbol, iterate.
S → how much wood would a woodchuck chuckif a woodchuck could chuck wood?
⇓S → how much wood wouldN1huck ifN1ould chuck wood?N1 → a woodchuck c
⇓S → how much wood wouldN1huck if N1ould N2wood?N1 → a woodN2cN2 → chuck
4
![Page 18: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/18.jpg)
Offline algorithms
• Maximal Length (ML): take longest repeat, replace alloccurrences with new symbol, iterate.Bentley & McIlroy “Data compression using long common strings”. DCC. 1999.
Nakamura, et.al. “Linear-Time Text Compression by Longest-First Substitution”. MDPI Algorithms. 1999
• Most Frequent (MF): take most frequent repeat, replace alloccurrences with new symbol, iterateLarsson & Moffat. “Offline Dictionary-Based Compression”. DCC. 1999
• Most Compressive (MC): take repeat that compress the best,replace with new symbol, iterateApostolico & Lonardi. “Off-line compression by greedy textual substitution” Proceedings of IEEE. 2000
4
![Page 19: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/19.jpg)
A General Framework: IRR
IRR (Iterative Repeat Replacement) frameworkInput: a sequence s, a score function f
1. Initialize Grammar by S → s
2. take repeat ω that maximizes f over G
3. if replacing ω would yield a bigger grammar than Gthen3.1 return G
else3.1 replace all (non-overlapping) occurrences of ω in G by new
symbol N3.2 add rule N → ω to G3.3 goto 2
Complexity: O(n3)
5
![Page 20: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/20.jpg)
Results on Canterbury Corpus
sequence Sequitur IRR-ML IRR-MF IRR-MC
alice29.txt 19.9% 37.1% 8.9% 41000asyoulik.txt 17.7% 37.8% 8.0% 37474cp.html 22.2% 21.6% 10.4% 8048fields.c 20.3% 18.6% 16.1% 3416grammar.lsp 20.2% 20.7% 15.1% 1473kennedy.xls 4.6% 7.7% 0.3% 166924lcet10.txt 24.5% 45.0% 8.0% 90099plrabn12.txt 14.9% 45.2% 5.8% 124198ptt5 23.4% 26.1% 6.4% 45135sum 25.6% 15.6% 11.9% 12207xargs.1 16.1% 16.2% 11.8% 2006
average 19.0% 26.5% 9.3%Extends and confirms results of Nevill-Manning & Witten “On-Line and Off-Line Heuristics
for Inferring Hierarchies of Repetitions in Sequences”. Proc. of the IEEE. vol 80 no 11. November 2000
6
![Page 21: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/21.jpg)
A General Framework: IRR
IRR (Iterative Repeat Replacement) frameworkInput: a sequence s, a score function f
1. Initialize Grammar by S → s
2. take repeat ω that maximizes f over G
3. if replacing ω would yield a bigger grammar than Gthen3.1 return G
else3.1 replace all (non-overlapping) occurrences of ω in G by new
symbol N3.2 add rule N → ω to G3.3 goto 2
Complexity: O(n3)
7
![Page 22: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/22.jpg)
Split the Problem
Choice of Constituents
Smallest Grammar Problem
Choice of Occurrences
8
![Page 23: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/23.jpg)
A General Framework: IRR
IRR (Iterative Repeat Replacement) frameworkInput: a sequence s, a score function f
1. Initialize Grammar by S → s
2. take repeat ω that maximizes f over G
3. if replacing ω would yield a bigger grammar than Gthen3.1 return G
else3.1 replace all (non-overlapping) occurrences of ω in G by new
symbol N3.2 add rule N → ω to G3.3 goto 2
Complexity: O(n3)
9
![Page 24: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/24.jpg)
A General Framework: IRRCOO
IRRCOO (Iterative Repeat Replacement with Choice of OccurrenceOptimization) frameworkInput: a sequence s, a score function f
1. Initialize Grammar by S → s
2. take repeat ω that maximizes f over G
3. if replacing ω would yield a bigger grammar than Gthen3.1 return G
else3.1 G ← mgp(cons(G ) ∪ cons(ω))3.2 goto 2
9
![Page 25: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/25.jpg)
Choice of Occurrences
Minimal Grammar Parsing (MGP) Problem
Given sequences Ω = s = w0,w1, . . . ,wm, find a context-freegrammar of minimal size that has non-terminalsS = N0,N1, . . .Nm such that cons(Ni ) = wi .
10
![Page 26: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/26.jpg)
Choice of Occurrences: an Example
Given sequences Ω = ababbababbabaabbabaa, abbaba, bab
N0
N1
N2
11
![Page 27: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/27.jpg)
Choice of Occurrences: an Example
Given sequences Ω = ababbababbabaabbabaa, abbaba, babN0
N1
N2
11
![Page 28: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/28.jpg)
Choice of Occurrences: an Example
Given sequences Ω = ababbababbabaabbabaa, abbaba, bab
N0
N1
N2
11
![Page 29: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/29.jpg)
Choice of Occurrences: an Example
Given sequences Ω = ababbababbabaabbabaa, abbaba, bab
N0
N1
N2
A minimal grammar for Ω isN0 → aN2N2N1N1aN1 → abN2aN2 → bab
11
![Page 30: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/30.jpg)
Choice of Occurrences: an Example
Given sequences Ω = ababbababbabaabbabaa, abbaba, bab
N0
N1
N2
mgp can be computed in O(n3)
11
![Page 31: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/31.jpg)
Split the Problem
Choice of Constituents
Smallest Grammar Problem
Choice of Occurrences
12
![Page 32: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/32.jpg)
A Search Space for the SGP
Given s, take the lattice 〈R(s),⊆〉 and associate a score to eachnode η: the size of the grammar mgp(η ∪ s). A smallestgrammar will have associated a node with minimal score.
13
![Page 33: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/33.jpg)
A Search Space for the SGP
Lattice is a good search space
For every sequence s, there is a node η in 〈R(s),⊆〉 such thatmgp(η ∪ s) is a smallest grammar.
Not the case for IRR search space
But, there exists a sequence s such that for any score function f ,IRR(s, f ) does not return a smallest grammar Proof
14
![Page 34: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/34.jpg)
The ZZ Algorithm
bottom-up phase: given node η, compute scores of nodes η ∪ wiand take node with smallest score.
ZZ: succession of both phases. Is in O(n7)
15
![Page 35: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/35.jpg)
The ZZ Algorithm
bottom-up phase: given node η, compute scores of nodes η ∪ wiand take node with smallest score.
ZZ: succession of both phases. Is in O(n7)
15
![Page 36: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/36.jpg)
The ZZ Algorithm
bottom-up phase: given node η, compute scores of nodes η ∪ wiand take node with smallest score.
ZZ: succession of both phases. Is in O(n7)
15
![Page 37: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/37.jpg)
The ZZ Algorithm
bottom-up phase: given node η, compute scores of nodes η ∪ wiand take node with smallest score.
ZZ: succession of both phases. Is in O(n7)
15
![Page 38: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/38.jpg)
The ZZ Algorithm
bottom-up phase: given node η, compute scores of nodes η ∪ wiand take node with smallest score.
ZZ: succession of both phases. Is in O(n7)
15
![Page 39: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/39.jpg)
The ZZ Algorithm
bottom-up phase: given node η, compute scores of nodes η ∪ wiand take node with smallest score.
ZZ: succession of both phases. Is in O(n7)
15
![Page 40: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/40.jpg)
The ZZ Algorithm
bottom-up phase: given node η, compute scores of nodes η ∪ wiand take node with smallest score.
ZZ: succession of both phases. Is in O(n7)
15
![Page 41: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/41.jpg)
The ZZ Algorithm
bottom-up phase: given node η, compute scores of nodes η ∪ wiand take node with smallest score.
top-down phase: given node η, compute scores of nodes η \ wiand take node with smallest score.
ZZ: succession of both phases. Is in O(n7)
15
![Page 42: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/42.jpg)
The ZZ Algorithm
bottom-up phase: given node η, compute scores of nodes η ∪ wiand take node with smallest score.
top-down phase: given node η, compute scores of nodes η \ wiand take node with smallest score.
ZZ: succession of both phases. Is in O(n7)
15
![Page 43: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/43.jpg)
Results on Canterbury Corpus
sequence IRRCOO-MC ZZ IRR-MC
alice29.txt -4.3% -8.0% 41000asyoulik.txt -2.9% -6.6% 37474cp.html -1.3% -3.5% 8048fields.c -1.3% -3.1% 3416grammar.lsp -0.1% -0.5% 1473kennedy.xls -0.1% -0.1% 166924lcet10.txt -1.7% – 90099plrabn12.txt -5.5% – 124198ptt5 -2.6% – 45135sum -0.8% -1.5% 12207xargs.1 -0.8% -1.7% 2006
average -2.0% -3.1%
16
![Page 44: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/44.jpg)
New Results
Classi- sequencelength IRRMGP*
size im-fication name provement
Virus P. lambda 48 Knt 13061 -4.25%Bacterium E. coli 4.6 Mnt 741435 -8.82%Protist T. pseudonana chrI 3 Mnt 509203 -8.15%Fungus S. cerevisiae 12.1 Mnt 1742489 -9.68%Alga O. tauri 12.5 Mnt 1801936 -8.78%
17
![Page 45: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/45.jpg)
Back to Structure
How similar are the structures returned by the different algorithms?
Standard measure to compare parse trees:
• Unlabeled Precision and Recall (F-measure)
• Unlabeled Non Crossing Precision and Recall (F-measure)
Dan Klein. “The Unsupervised Learning of Natural Language Structure”. Phd Thesis. U Stanford. 2005
18
![Page 46: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/46.jpg)
Back to Structure
How similar are the structures returned by the different algorithms?Standard measure to compare parse trees:
• Unlabeled Precision and Recall (F-measure)
• Unlabeled Non Crossing Precision and Recall (F-measure)
Dan Klein. “The Unsupervised Learning of Natural Language Structure”. Phd Thesis. U Stanford. 2005
18
![Page 47: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/47.jpg)
Similarity of Structure
sequence algorithm vs IRR-MC size gain UF UNCF
fields.cZZ 3.1 % 77.8 85.3IRRCOO-MC 1.3 % 84.1 88.7
cp.htmlZZ 3.5 % 66.3 75.0IRRCOO-MC 1.3 % 81.4 84.8
alice.txtZZ 8.0 % 36.6 38.6IRRCOO-MC 4.3 % 63.9 66.0
asyoulike.txtZZ 6.6 % 34.6 35.8IRRCOO-MC 2.9 % 55.1 56.9
19
![Page 48: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/48.jpg)
Conclusions and Perspectices
? Split SGP into two complementary problems: choice ofconstituents and choice of occurrences
? Definition of a search space that contains a solution....
? ... and to define algorithms which find smaller grammars thanstate-of-the-art.
• Promising results on DNA sequences (whole genomes)
• Focus on the structure. Meaning of (dis)similarity.
20
![Page 49: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/49.jpg)
Conclusions and Perspectices
? Split SGP into two complementary problems: choice ofconstituents and choice of occurrences
? Definition of a search space that contains a solution....
? ... and to define algorithms which find smaller grammars thanstate-of-the-art.
• Promising results on DNA sequences (whole genomes)
• Focus on the structure. Meaning of (dis)similarity.
20
![Page 50: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/50.jpg)
Conclusions and Perspectices
? Split SGP into two complementary problems: choice ofconstituents and choice of occurrences
? Definition of a search space that contains a solution....
? ... and to define algorithms which find smaller grammars thanstate-of-the-art.
• Promising results on DNA sequences (whole genomes)
• Focus on the structure. Meaning of (dis)similarity.
20
![Page 51: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/51.jpg)
The End
S → thDkAforBr attenC. DoAhave Dy quesCs?A → BB → youC → tionD → an
21
![Page 52: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/52.jpg)
Parse Tree Similarity Measures
UNCP(P1,P2) = |b∈brackets(P1):b does not cross brackets(P2)||brackets(P1)|
UNCR(P1,P2) = |b∈brackets(P2):b does not cross brackets(P1)||brackets(P2)|
UNCF (P1,P2) = 2UNCP(P1,P2)−1+UNCR(P1,P2)−1
22
![Page 53: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/53.jpg)
Parse Tree Similarity Measures
UP(P1,P2) = |brackets(P1)∩brackets(P2)||brackets(P1)|
UR(P1,P2) = |brackets(P1)∩brackets(P2)||brackets(P2)|
UF (P1,P2) = 2UP(P1,P2)−1+UR(P1,P2)−1
22
![Page 54: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/54.jpg)
Problems of IRR-like algorithms
Example
xaxbxcx |1xbxcxax |2xcxaxbx |3xaxcxbx |4xbxaxcx |5xcxbxax |6xax |7xbx |8xcx
A smallest grammar is:S → AbC |1BcA|2CaB|3AcB|4BaC |5CbA|6A|7B|8CA → xaxB → xbxC → xcx
23
![Page 55: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/55.jpg)
Problems of IRR-like algorithms
Example
xaxbxcx |1xbxcxax |2xcxaxbx |3xaxcxbx |4xbxaxcx |5xcxbxax |6xax |7xbx |8xcxA smallest grammar is:S → AbC |1BcA|2CaB|3AcB|4BaC |5CbA|6A|7B|8CA → xaxB → xbxC → xcx
23
![Page 56: Choosing Word Occurrences for the Smallest Grammar Problem · compression schemes work by taking advantage of the repetitive nature of sequences, either by creating structures or](https://reader034.vdocuments.us/reader034/viewer/2022042103/5e80b837e34e0467bb1aa6f2/html5/thumbnails/56.jpg)
Problems of IRR-like algorithms
Example
xaxbxcx |1xbxcxax |2xcxaxbx |3xaxcxbx |4xbxaxcx |5xcxbxax |6xax |7xbx |8xcxBut what IRR can do is like:S → Abxcx |1xbxcA|2xcAbx |3Acxbx |4xbAcx |5xcxbA|6A|7xbx |8xcxA → xax⇓
S → Abxcx |1BcA|2xcAbx |3AcB|4xbAcx |5xcxbA|6A|7B|8xcxA → xaxB → xbx⇓
S → AbC |1BcA|2xcAbx |3AcB|4xbAcx |5CbA|6A|7B|8CA → xaxB → xbxC → xcx
24