smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · grammar...

67
Smallest grammar by recompression Artur Je˙ z Max Planck Institute for Informatics 17.06.2013

Upload: others

Post on 07-Sep-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Smallest grammar by recompression

Artur Jez

Max Planck Institute for Informatics

17.06.2013

Page 2: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Grammar based-compression

Represent w as a CFG generating it.

Advantagesit is usually small (at most quadratic vs. LZ)compression is fastit is exponential on good dataextracts hierarchical structureit is easy to work onrelated to LZW and LZ

17.06.2013 2/17

Page 3: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Grammar based-compression

Represent w as a CFG generating it.

Advantagesit is usually small (at most quadratic vs. LZ)compression is fastit is exponential on good data

extracts hierarchical structureit is easy to work onrelated to LZW and LZ

17.06.2013 2/17

Page 4: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Grammar based-compression

Represent w as a CFG generating it.

Advantagesit is usually small (at most quadratic vs. LZ)compression is fastit is exponential on good dataextracts hierarchical structureit is easy to work on

related to LZW and LZ

17.06.2013 2/17

Page 5: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Grammar based-compression

Represent w as a CFG generating it.

Advantagesit is usually small (at most quadratic vs. LZ)compression is fastit is exponential on good dataextracts hierarchical structureit is easy to work onrelated to LZW and LZ

17.06.2013 2/17

Page 6: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Smallest grammar

ProblemGiven w return smallest CFG Gw such that L(Gw ) = w .

With O(1) increase in size, this is an SLP.

Definition (SLP: Straight Line Programme)CFG with

ordered nonterminals X1,X2, . . .

Chomsky normal formfor Xi → XjXk we have j , k < i

17.06.2013 3/17

Page 7: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Smallest grammar

ProblemGiven w return smallest CFG Gw such that L(Gw ) = w .

With O(1) increase in size, this is an SLP.

Definition (SLP: Straight Line Programme)CFG with

ordered nonterminals X1,X2, . . .

Chomsky normal formfor Xi → XjXk we have j , k < i

17.06.2013 3/17

Page 8: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

What is knownBest approximation ratioO(log(n/g)), where g is the size of the optimal grammar.

Rytter– represent w as LZ, size ` ≤ g– translation of LZ into SLP, size O(` log(n/`)) ≤ O(g log(n/g))– the intermediate grammar is balanced (AVL-type condition)

Charikar et al.:– similar as Rytter– different balance criterion (length of word)

Sakamoto– local replacement rules (plus a global partition): pairs and blocks– analysis vs LZ

Linear time.

17.06.2013 4/17

Page 9: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

What is knownBest approximation ratioO(log(n/g)), where g is the size of the optimal grammar.

Rytter– represent w as LZ, size ` ≤ g– translation of LZ into SLP, size O(` log(n/`)) ≤ O(g log(n/g))– the intermediate grammar is balanced (AVL-type condition)

Charikar et al.:– similar as Rytter– different balance criterion (length of word)

Sakamoto– local replacement rules (plus a global partition): pairs and blocks– analysis vs LZ

Linear time.

17.06.2013 4/17

Page 10: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

What is knownBest approximation ratioO(log(n/g)), where g is the size of the optimal grammar.

Rytter– represent w as LZ, size ` ≤ g– translation of LZ into SLP, size O(` log(n/`)) ≤ O(g log(n/g))– the intermediate grammar is balanced (AVL-type condition)

Charikar et al.:– similar as Rytter– different balance criterion (length of word)

Sakamoto– local replacement rules (plus a global partition): pairs and blocks– analysis vs LZ

Linear time.

17.06.2013 4/17

Page 11: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

What is knownBest approximation ratioO(log(n/g)), where g is the size of the optimal grammar.

Rytter– represent w as LZ, size ` ≤ g– translation of LZ into SLP, size O(` log(n/`)) ≤ O(g log(n/g))– the intermediate grammar is balanced (AVL-type condition)

Charikar et al.:– similar as Rytter– different balance criterion (length of word)

Sakamoto– local replacement rules (plus a global partition): pairs and blocks– analysis vs LZ

Linear time.

17.06.2013 4/17

Page 12: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

What is knownBest approximation ratioO(log(n/g)), where g is the size of the optimal grammar.

Rytter– represent w as LZ, size ` ≤ g– translation of LZ into SLP, size O(` log(n/`)) ≤ O(g log(n/g))– the intermediate grammar is balanced (AVL-type condition)

Charikar et al.:– similar as Rytter– different balance criterion (length of word)

Sakamoto– local replacement rules (plus a global partition): pairs and blocks– analysis vs LZ

Linear time.

17.06.2013 4/17

Page 13: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

This talk

Very simple linear-time algorithm, O(log(n/g)) approximation.

analysis in the recompression framework, vs. SLP– very robust– good: easier to show better approximation?– bad: might be in fact larger

not balanced– good: easier to show approximation?– bad: worse for further processing

height O(log n), when a` has height 1

Algorithm similar to Sakamoto, different analysis.

17.06.2013 5/17

Page 14: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

This talk

Very simple linear-time algorithm, O(log(n/g)) approximation.analysis in the recompression framework, vs. SLP

– very robust– good: easier to show better approximation?– bad: might be in fact larger

not balanced– good: easier to show approximation?– bad: worse for further processing

height O(log n), when a` has height 1

Algorithm similar to Sakamoto, different analysis.

17.06.2013 5/17

Page 15: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

This talk

Very simple linear-time algorithm, O(log(n/g)) approximation.analysis in the recompression framework, vs. SLP

– very robust– good: easier to show better approximation?– bad: might be in fact larger

not balanced– good: easier to show approximation?– bad: worse for further processing

height O(log n), when a` has height 1

Algorithm similar to Sakamoto, different analysis.

17.06.2013 5/17

Page 16: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

This talk

Very simple linear-time algorithm, O(log(n/g)) approximation.analysis in the recompression framework, vs. SLP

– very robust– good: easier to show better approximation?– bad: might be in fact larger

not balanced– good: easier to show approximation?– bad: worse for further processing

height O(log n), when a` has height 1

Algorithm similar to Sakamoto, different analysis.

17.06.2013 5/17

Page 17: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

This talk

Very simple linear-time algorithm, O(log(n/g)) approximation.analysis in the recompression framework, vs. SLP

– very robust– good: easier to show better approximation?– bad: might be in fact larger

not balanced– good: easier to show approximation?– bad: worse for further processing

height O(log n), when a` has height 1

Algorithm similar to Sakamoto, different analysis.

17.06.2013 5/17

Page 18: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Example

a aa a bb a bc a bb a b c ab

IntuitionPhases: compress only pairs and block from the beginning of aphase.Treat nonterminals as letters.To speed up, we make some pair compression simultaneously(partition Σ to Σ`,Σr , pairs from Σ`Σr )

17.06.2013 6/17

Page 19: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Example

a aa a bb a bc a bb a b c ab

IntuitionPhases: compress only pairs and block from the beginning of aphase.Treat nonterminals as letters.To speed up, we make some pair compression simultaneously(partition Σ to Σ`,Σr , pairs from Σ`Σr )

17.06.2013 6/17

Page 20: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Example

a3 a bb a bc a bb a b c aba3 → a3

IntuitionPhases: compress only pairs and block from the beginning of aphase.Treat nonterminals as letters.To speed up, we make some pair compression simultaneously(partition Σ to Σ`,Σr , pairs from Σ`Σr )

17.06.2013 6/17

Page 21: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Example

a3 a bb a bc a b2 a b c aba3 → a3, b2 → b2

IntuitionPhases: compress only pairs and block from the beginning of aphase.Treat nonterminals as letters.To speed up, we make some pair compression simultaneously(partition Σ to Σ`,Σr , pairs from Σ`Σr )

17.06.2013 6/17

Page 22: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Example

a3 b c a b2 c abdd da3 → a3, b2 → b2, d→ ab

IntuitionPhases: compress only pairs and block from the beginning of aphase.Treat nonterminals as letters.To speed up, we make some pair compression simultaneously(partition Σ to Σ`,Σr , pairs from Σ`Σr )

17.06.2013 6/17

Page 23: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Example

a3 b c a b2 c edd da3 → a3, b2 → b2, d→ ab, e→ ba

IntuitionPhases: compress only pairs and block from the beginning of aphase.Treat nonterminals as letters.To speed up, we make some pair compression simultaneously(partition Σ to Σ`,Σr , pairs from Σ`Σr )

17.06.2013 6/17

Page 24: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Example

a3 b c a b2 c edd da3 → a3, b2 → b2, d→ ab, e→ ba

IntuitionPhases: compress only pairs and block from the beginning of aphase.Treat nonterminals as letters.To speed up, we make some pair compression simultaneously(partition Σ to Σ`,Σr , pairs from Σ`Σr )

17.06.2013 6/17

Page 25: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Example

a3 b c a b2 c edd da3 → a3, b2 → b2, d→ ab, e→ ba

IntuitionPhases: compress only pairs and block from the beginning of aphase.Treat nonterminals as letters.To speed up, we make some pair compression simultaneously(partition Σ to Σ`,Σr , pairs from Σ`Σr )

17.06.2013 6/17

Page 26: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Algorithm

1: while |T | > 1 do

2: L← list of letters in T3: for each a ∈ L do . Blocks compression4: compress maximal blocks of a . O(|T |)5: P ← list of pairs6: find partition of Σ into Σ` and Σr7: . Try to maximize the occurrences from Σ`Σr in T .8: for ab ∈ P ∩ Σ`Σr do . These pairs do not overlap9: compress pair ab . Pair compression

10: return the constructed grammar

17.06.2013 7/17

Page 27: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Algorithm

1: while |T | > 1 do2: L← list of letters in T3: for each a ∈ L do . Blocks compression4: compress maximal blocks of a . O(|T |)

5: P ← list of pairs6: find partition of Σ into Σ` and Σr7: . Try to maximize the occurrences from Σ`Σr in T .8: for ab ∈ P ∩ Σ`Σr do . These pairs do not overlap9: compress pair ab . Pair compression

10: return the constructed grammar

17.06.2013 7/17

Page 28: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Algorithm

1: while |T | > 1 do2: L← list of letters in T3: for each a ∈ L do . Blocks compression4: compress maximal blocks of a . O(|T |)5: P ← list of pairs6: find partition of Σ into Σ` and Σr7: . Try to maximize the occurrences from Σ`Σr in T .

8: for ab ∈ P ∩ Σ`Σr do . These pairs do not overlap9: compress pair ab . Pair compression

10: return the constructed grammar

17.06.2013 7/17

Page 29: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Algorithm

1: while |T | > 1 do2: L← list of letters in T3: for each a ∈ L do . Blocks compression4: compress maximal blocks of a . O(|T |)5: P ← list of pairs6: find partition of Σ into Σ` and Σr7: . Try to maximize the occurrences from Σ`Σr in T .8: for ab ∈ P ∩ Σ`Σr do . These pairs do not overlap9: compress pair ab . Pair compression

10: return the constructed grammar

17.06.2013 7/17

Page 30: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Algorithm

1: while |T | > 1 do2: L← list of letters in T3: for each a ∈ L do . Blocks compression4: compress maximal blocks of a . O(|T |)5: P ← list of pairs6: find partition of Σ into Σ` and Σr7: . Try to maximize the occurrences from Σ`Σr in T .8: for ab ∈ P ∩ Σ`Σr do . These pairs do not overlap9: compress pair ab . Pair compression

10: return the constructed grammar

17.06.2013 7/17

Page 31: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Partition

1/4 appearances coveredA partition Σ`Σr such that 1/4 of pairs is covered.

After block compression aa does not appear.Random partition: 1/4 pairs can be covered.derandomise (expected value)we need number of appearances of ab: RadixSortO(|T |).

17.06.2013 8/17

Page 32: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Partition

1/4 appearances coveredA partition Σ`Σr such that 1/4 of pairs is covered.

After block compression aa does not appear.Random partition: 1/4 pairs can be covered.derandomise (expected value)we need number of appearances of ab: RadixSortO(|T |).

17.06.2013 8/17

Page 33: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Size reduction

Size dropConsider set of two consecutive letters ab in T .For 1/4 of them one letter is compressed in a phase.

– if a = b: it is compressed– if a 6= b: 1/4 of those pairs is in Σ`Σr

When we consider ab we replace it, unless one letter wasalready replaced.

Length drops by a constant factor.

Towards running timeIt is enough to show that one round runs in O(|T |).

17.06.2013 9/17

Page 34: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Size reduction

Size dropConsider set of two consecutive letters ab in T .For 1/4 of them one letter is compressed in a phase.

– if a = b: it is compressed

– if a 6= b: 1/4 of those pairs is in Σ`ΣrWhen we consider ab we replace it, unless one letter wasalready replaced.

Length drops by a constant factor.

Towards running timeIt is enough to show that one round runs in O(|T |).

17.06.2013 9/17

Page 35: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Size reduction

Size dropConsider set of two consecutive letters ab in T .For 1/4 of them one letter is compressed in a phase.

– if a = b: it is compressed– if a 6= b: 1/4 of those pairs is in Σ`Σr

When we consider ab we replace it, unless one letter wasalready replaced.

Length drops by a constant factor.

Towards running timeIt is enough to show that one round runs in O(|T |).

17.06.2013 9/17

Page 36: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Size reduction

Size dropConsider set of two consecutive letters ab in T .For 1/4 of them one letter is compressed in a phase.

– if a = b: it is compressed– if a 6= b: 1/4 of those pairs is in Σ`Σr

When we consider ab we replace it, unless one letter wasalready replaced.

Length drops by a constant factor.

Towards running timeIt is enough to show that one round runs in O(|T |).

17.06.2013 9/17

Page 37: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Running time

PartitionO(|T |) time.

Block compressionBy RadixSort, O(|T |) time.

Pair compressionBy RadixSort, O(|T |) time.

17.06.2013 10/17

Page 38: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Number of nonterminals

Representation cost

when c replaces ab we add rule c → ab, representation cost 1when a`1 , a`2 , . . . , a`k are replaced with a`1 , a`2 , . . . , a`k(`1 < `2 . . . < `k ):

– first represent a`2−`1 , a`3−`2 , . . . , a`k−`k−1 as a`2−`1 , a`3−`2 , . . . ,a`k−`k−1

– do this by binary expansion(make new rules a2 → aa, a4 → a2a2, a8 → a4a4, . . . )

– a`i+1 → a`i+1−`i a`i

– representation cost

O( k−1∑

i=1

log(`i+1 − `i ))

17.06.2013 11/17

Page 39: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Number of nonterminals

Representation costwhen c replaces ab we add rule c → ab, representation cost 1

when a`1 , a`2 , . . . , a`k are replaced with a`1 , a`2 , . . . , a`k(`1 < `2 . . . < `k ):

– first represent a`2−`1 , a`3−`2 , . . . , a`k−`k−1 as a`2−`1 , a`3−`2 , . . . ,a`k−`k−1

– do this by binary expansion(make new rules a2 → aa, a4 → a2a2, a8 → a4a4, . . . )

– a`i+1 → a`i+1−`i a`i

– representation cost

O( k−1∑

i=1

log(`i+1 − `i ))

17.06.2013 11/17

Page 40: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Number of nonterminals

Representation costwhen c replaces ab we add rule c → ab, representation cost 1when a`1 , a`2 , . . . , a`k are replaced with a`1 , a`2 , . . . , a`k(`1 < `2 . . . < `k ):

– first represent a`2−`1 , a`3−`2 , . . . , a`k−`k−1 as a`2−`1 , a`3−`2 , . . . ,a`k−`k−1

– do this by binary expansion(make new rules a2 → aa, a4 → a2a2, a8 → a4a4, . . . )

– a`i+1 → a`i+1−`i a`i

– representation cost

O( k−1∑

i=1

log(`i+1 − `i ))

17.06.2013 11/17

Page 41: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Number of nonterminals

Representation costwhen c replaces ab we add rule c → ab, representation cost 1when a`1 , a`2 , . . . , a`k are replaced with a`1 , a`2 , . . . , a`k(`1 < `2 . . . < `k ):

– first represent a`2−`1 , a`3−`2 , . . . , a`k−`k−1 as a`2−`1 , a`3−`2 , . . . ,a`k−`k−1

– do this by binary expansion(make new rules a2 → aa, a4 → a2a2, a8 → a4a4, . . . )

– a`i+1 → a`i+1−`i a`i

– representation cost

O( k−1∑

i=1

log(`i+1 − `i ))

17.06.2013 11/17

Page 42: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Number of nonterminals

Representation costwhen c replaces ab we add rule c → ab, representation cost 1when a`1 , a`2 , . . . , a`k are replaced with a`1 , a`2 , . . . , a`k(`1 < `2 . . . < `k ):

– first represent a`2−`1 , a`3−`2 , . . . , a`k−`k−1 as a`2−`1 , a`3−`2 , . . . ,a`k−`k−1

– do this by binary expansion(make new rules a2 → aa, a4 → a2a2, a8 → a4a4, . . . )

– a`i+1 → a`i+1−`i a`i

– representation cost

O( k−1∑

i=1

log(`i+1 − `i ))

17.06.2013 11/17

Page 43: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Number of nonterminals

Representation costwhen c replaces ab we add rule c → ab, representation cost 1when a`1 , a`2 , . . . , a`k are replaced with a`1 , a`2 , . . . , a`k(`1 < `2 . . . < `k ):

– first represent a`2−`1 , a`3−`2 , . . . , a`k−`k−1 as a`2−`1 , a`3−`2 , . . . ,a`k−`k−1

– do this by binary expansion(make new rules a2 → aa, a4 → a2a2, a8 → a4a4, . . . )

– a`i+1 → a`i+1−`i a`i

– representation cost

O( k−1∑

i=1

log(`i+1 − `i ))

17.06.2013 11/17

Page 44: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Analysis outline

We begin with a G generating T (mental experiment)in each moment we keep G generating the current T

– we apply the compression to G– it is changed so that this can be done

representation cost is calculated using G

G is of more general form: Xi → uXjvXkwexplicit letters have creditrepresentation cost is paid by released credit:

– ab is replaced by c– we need 1 representation cost– each ab in G is replaced with c, 1 credit is released– (bit more tricky for blocks)

we only need to count the number of created credit

17.06.2013 12/17

Page 45: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Analysis outline

We begin with a G generating T (mental experiment)in each moment we keep G generating the current T

– we apply the compression to G– it is changed so that this can be done

representation cost is calculated using G

G is of more general form: Xi → uXjvXkwexplicit letters have creditrepresentation cost is paid by released credit:

– ab is replaced by c– we need 1 representation cost– each ab in G is replaced with c, 1 credit is released– (bit more tricky for blocks)

we only need to count the number of created credit

17.06.2013 12/17

Page 46: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Analysis outline

We begin with a G generating T (mental experiment)in each moment we keep G generating the current T

– we apply the compression to G– it is changed so that this can be done

representation cost is calculated using G

G is of more general form: Xi → uXjvXkwexplicit letters have creditrepresentation cost is paid by released credit:

– ab is replaced by c– we need 1 representation cost– each ab in G is replaced with c, 1 credit is released– (bit more tricky for blocks)

we only need to count the number of created credit

17.06.2013 12/17

Page 47: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Analysis outline

We begin with a G generating T (mental experiment)in each moment we keep G generating the current T

– we apply the compression to G– it is changed so that this can be done

representation cost is calculated using G

G is of more general form: Xi → uXjvXkwexplicit letters have creditrepresentation cost is paid by released credit:

– ab is replaced by c– we need 1 representation cost– each ab in G is replaced with c, 1 credit is released– (bit more tricky for blocks)

we only need to count the number of created credit

17.06.2013 12/17

Page 48: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Analysis outline

We begin with a G generating T (mental experiment)in each moment we keep G generating the current T

– we apply the compression to G– it is changed so that this can be done

representation cost is calculated using G

G is of more general form: Xi → uXjvXkwexplicit letters have creditrepresentation cost is paid by released credit:

– ab is replaced by c– we need 1 representation cost– each ab in G is replaced with c, 1 credit is released

– (bit more tricky for blocks)

we only need to count the number of created credit

17.06.2013 12/17

Page 49: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Analysis outline

We begin with a G generating T (mental experiment)in each moment we keep G generating the current T

– we apply the compression to G– it is changed so that this can be done

representation cost is calculated using G

G is of more general form: Xi → uXjvXkwexplicit letters have creditrepresentation cost is paid by released credit:

– ab is replaced by c– we need 1 representation cost– each ab in G is replaced with c, 1 credit is released– (bit more tricky for blocks)

we only need to count the number of created credit

17.06.2013 12/17

Page 50: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Analysis outline

We begin with a G generating T (mental experiment)in each moment we keep G generating the current T

– we apply the compression to G– it is changed so that this can be done

representation cost is calculated using G

G is of more general form: Xi → uXjvXkwexplicit letters have creditrepresentation cost is paid by released credit:

– ab is replaced by c– we need 1 representation cost– each ab in G is replaced with c, 1 credit is released– (bit more tricky for blocks)

we only need to count the number of created credit

17.06.2013 12/17

Page 51: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Pair compression

X1 → ababcab, X2 → abcbX1abX1a

compression of ab: easycompression of ba: problem

Definition (Non-crossing pairs)ab is non-crossing pair iff none of the below happens

aX appears in a rule, X begins with bXb appears in a rule, X ends with a

When each pair from Σ`Σr is non-crossing,replace all those pairs in G (no new credit).

17.06.2013 13/17

Page 52: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Pair compression

X1 → ababcab, X2 → abcbX1abX1acompression of ab: easy

compression of ba: problem

Definition (Non-crossing pairs)ab is non-crossing pair iff none of the below happens

aX appears in a rule, X begins with bXb appears in a rule, X ends with a

When each pair from Σ`Σr is non-crossing,replace all those pairs in G (no new credit).

17.06.2013 13/17

Page 53: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Pair compression

X1 → ababcab, X2 → abcbX1abX1acompression of ab: easycompression of ba: problem

Definition (Non-crossing pairs)ab is non-crossing pair iff none of the below happens

aX appears in a rule, X begins with bXb appears in a rule, X ends with a

When each pair from Σ`Σr is non-crossing,replace all those pairs in G (no new credit).

17.06.2013 13/17

Page 54: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Pair compression

X1 → ababcab, X2 → abcbX1abX1acompression of ab: easycompression of ba: problem

Definition (Non-crossing pairs)ab is non-crossing pair iff none of the below happens

aX appears in a rule, X begins with bXb appears in a rule, X ends with a

When each pair from Σ`Σr is non-crossing,replace all those pairs in G (no new credit).

17.06.2013 13/17

Page 55: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Pair compression

X1 → ababcab, X2 → abcbX1abX1acompression of ab: easycompression of ba: problem

Definition (Non-crossing pairs)ab is non-crossing pair iff none of the below happens

aX appears in a rule, X begins with bXb appears in a rule, X ends with a

When each pair from Σ`Σr is non-crossing,replace all those pairs in G (no new credit).

17.06.2013 13/17

Page 56: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Making pairs non-crossing

When ab has a crossing appearance: aXi or XibXi defines bw : change it to w , replace Xi by bXi

symmetrically for ending a

LeftPop(b)1: for i ← 1 . .g − 1 do2: if the first symbol in Xi → α is b then3: remove this b4: replace Xi in productions by bXi

LemmaAfter LeftPop(b) and RightPop(a) the ab is non-crossing.

Can be done in parallel for all ab ∈ Σ`Σr .Credit increases by O(g)

17.06.2013 14/17

Page 57: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Making pairs non-crossing

When ab has a crossing appearance: aXi or XibXi defines bw : change it to w , replace Xi by bXi

symmetrically for ending a

LeftPop(b)1: for i ← 1 . .g − 1 do2: if the first symbol in Xi → α is b then3: remove this b4: replace Xi in productions by bXi

LemmaAfter LeftPop(b) and RightPop(a) the ab is non-crossing.

Can be done in parallel for all ab ∈ Σ`Σr .Credit increases by O(g)

17.06.2013 14/17

Page 58: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Making pairs non-crossing

When ab has a crossing appearance: aXi or XibXi defines bw : change it to w , replace Xi by bXi

symmetrically for ending a

LeftPop(b)1: for i ← 1 . .g − 1 do2: if the first symbol in Xi → α is b then3: remove this b4: replace Xi in productions by bXi

LemmaAfter LeftPop(b) and RightPop(a) the ab is non-crossing.

Can be done in parallel for all ab ∈ Σ`Σr .

Credit increases by O(g)

17.06.2013 14/17

Page 59: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Making pairs non-crossing

When ab ∈ Σ`Σr has a crossing appearance: aXi or XibXi defines bw : change it to w , replace Xi by aXi

symmetrically for ending a

LeftPop1: for i ← 1 . .g − 1 do2: if the first symbol in Xi → α is b ∈ Σr then3: remove this b4: replace Xi in productions by bXi

LemmaAfter LeftPop and RightPop the pairs Σ`Σr are non-crossing.

Can be done in parallel for all ab ∈ Σ`Σr .

Credit increases by O(g)

17.06.2013 14/17

Page 60: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Making pairs non-crossing

When ab ∈ Σ`Σr has a crossing appearance: aXi or XibXi defines bw : change it to w , replace Xi by aXi

symmetrically for ending a

LeftPop1: for i ← 1 . .g − 1 do2: if the first symbol in Xi → α is b ∈ Σr then3: remove this b4: replace Xi in productions by bXi

LemmaAfter LeftPop and RightPop the pairs Σ`Σr are non-crossing.

Can be done in parallel for all ab ∈ Σ`Σr .Credit increases by O(g)

17.06.2013 14/17

Page 61: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Blocks & Wrap up

IdeaSimilarly as pairs

Xi defines a`i wbri : change it to wreplace Xi in rules by a`i Xibri

analysis: more tricky but worksO(g)

In totalO(g) per phaseO(log n) phasesO(g log n) credit in total (= size of created grammar)can be improved to O(g log(n/g))

17.06.2013 15/17

Page 62: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Blocks & Wrap up

IdeaSimilarly as pairs

Xi defines a`i wbri : change it to wreplace Xi in rules by a`i Xibri

analysis: more tricky but worksO(g)

In totalO(g) per phaseO(log n) phasesO(g log n) credit in total (= size of created grammar)can be improved to O(g log(n/g))

17.06.2013 15/17

Page 63: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Blocks & Wrap up

IdeaSimilarly as pairs

Xi defines a`i wbri : change it to wreplace Xi in rules by a`i Xibri

analysis: more tricky but worksO(g)

In totalO(g) per phaseO(log n) phasesO(g log n) credit in total (= size of created grammar)can be improved to O(g log(n/g))

17.06.2013 15/17

Page 64: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Acknowledgments

M. LohreySuggesting the analysis.

P. Gawrychowskiintroducing to the topicliterature

– K. Mehlhorn, R. Sundar and Ch. Uhrig, Maintaining DynamicSequences under Equality Tests in Polylogarithmic Time, ‘97

– H. Sakamoto, A fully linear-time approximation algorithm forgrammar-based compression, ’05

– M. Lohrey and Ch. Mathissen, Compressed Membership inAutomata with Compressed Labels, ’11

17.06.2013 16/17

Page 65: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Acknowledgments

M. LohreySuggesting the analysis.

P. Gawrychowskiintroducing to the topicliterature

– K. Mehlhorn, R. Sundar and Ch. Uhrig, Maintaining DynamicSequences under Equality Tests in Polylogarithmic Time, ‘97

– H. Sakamoto, A fully linear-time approximation algorithm forgrammar-based compression, ’05

– M. Lohrey and Ch. Mathissen, Compressed Membership inAutomata with Compressed Labels, ’11

17.06.2013 16/17

Page 66: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Open problems, related research

Open problemsbetter approximationsimpler computational model (no RadixSort)addition chains (O( log n

log log n ) approximation known)

Other applications: recompressioncompressed membershipfully compressed pattern matchingword equations

17.06.2013 17/17

Page 67: Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most

Open problems, related research

Open problemsbetter approximationsimpler computational model (no RadixSort)addition chains (O( log n

log log n ) approximation known)

Other applications: recompressioncompressed membershipfully compressed pattern matchingword equations

17.06.2013 17/17