data compression with finite windows fiala and greene speaker: giora alexandron

Data Compression with finitewindows

Fiala and Greene

Speaker: Giora Alexandron

Overview:-----------------------

Our main purpose:

See how Suffix Tree supports a compression algorithm.

Overview:-----------------------

Our main purpose:

See how Suffix Tree supports a compression algorithm.

What we would see:

A data compression method, which works by substituting text. It uses a modification of the basic suffix tree, to support cyclic maintenance of the most

recent strings seen in file .

Outlines------------------------

1 .Compression: - In General

- Our Algorithm

2 .Data Structure: - Modification of the suffix tree.

3 .Theoretical Considerations: - Prooves.

4 .Improvments.

Compression-------------------------------

What is Compression:

Compression is the coding of data to minimize its representation. We would focus on

lossless, adaptive, one-pass methods .

Compression-------------------------------

What is Compression: Compression is the coding of data to minimize its

representation. We would focus on lossless, adaptive, one-pass methods .

Main approaches- Statistical approach- try to predict the next symbol .

Substitutional approach- replace blocks of texts with references to earlier occurrences of identical text.

**We would focus on a Substitutional method**

Compression-cont.------------------------------

What characterize a good compressor:

- Good compressing ratio.

- Run fast in Compression.

- Use minimum of space.

-Run fast in Expansion.

Compression-cont.------------------------------

What characterize a good compressor: - Good compressing ratio. - Run fast in Compression.

- Use minimum of space. -Run fast in Expansion.

There are trade-offs between all of those.Naturally, we want to achieve them all=

A good Algorithm + a matching Data Structure

Substitutional Compressing---------------------------------------

Consider the following basic scheme:

The compressed files would contain two types of codewords:

literal x pass the next x characters directly to the output.

copy x, y go back y characters and copy the next x

characters start at that position.

Example------------------------------------------------

..it was the best of times, it was the worst of times..

Would compress to-

Example------------------------------------------------

Would compress to-

(literal 26 )it was the best of times,

Example------------------------------------------------

Would compress to-

(copy 11-26)

-26 +11

Example------------------------------------------------

Would compress to-

(copy 11-26) wor )copy 11-27(

-26 +11 -27 +11

Example-cont.------------------------------------------------

And we get a very simple lossless method:

The compression achieved depends on the size of the copy and literal codewords.

..it was the best of times ,

it was the worst of times.

Compression

Expansion

..it was the best of times ,

it was the worst of times.

(copy 11-26) wor )copy 11-27(.

A1------------------------------------------------------

The encoding of A1:

-8 bits for a literal codeword

-16 bit for a copy codeword

(can you figure what’s the logic behind)?

literal length[1..16]

length[2..16]

displacement[1..4096]

0000xxxx

xxxxyy..yy

A1------------------------------------------------------

The encoding of A1: -8 bits for a literal codeword

-16 bit for a copy codeword

And we get )a compression of 51 to 36(: (literal 16 )it was the best )literal 10(of times,

(copy 11-26) wor )copy 11-27(

literal length[1..16]

length[2..16]

displacement[1..4096]

0000xxxx

xxxxyy..yy

A1’s policy----------------------------

If the compressor is idle )just finish a word(:

look for a copy >= 2

otherwise, start a literal.

If the compressor is in the middle of a literal:

extend it until a copy >= 3 is found.

- Our Algorithm

( here )

Where do we stand?

The Data Structure-----------------------------------------

What do we need?

Find the current longest match )for copy(.

The Data Structure-----------------------------------------

What do we need?

Find the current longest match )for copy(.

-What could we use ?

Naive solution-

Suffix tree with all strings of length <= 16 in the previous 4096-bytes window.

Naive solution---------------------------------

Suffix tree with all strings of length <= 16 in the previous 4096-bytes window:

current4096

The cost --------------------------------------------

If we descended d levels to insert string starts at position j ,

we will descend at least d-1 levels to insert string starts at j+1.

The cost-cont.------------------------------------------

If we descended d levels to insert string starts at position j ,

we would descend at least d-1 levels to insert string starts at j+1.

So the cost is O)nd( for insertion.

But we want to eliminate d.

Modifications------------------------------------

a.Suffix links:

Each node represents the string aX

has a pointer to the node represents

the string X.

Immediate advantage:

We don’t need to return to the root after each insertion.

Suffix Links------------------------------------

How we use and create suffix links:

..aXYb..

Suffix Links------------------------------------

..aXYb..

Suffix Links-cont.------------------------------------

..aXYb..

1 .Create a new node , and insert b.

How we use and create suffix

links:

..aXYb..

2 .a. Use suffix link to insert XYb:

a.1 we go up to and cross to using the suffix link.

..aXYb..

1 .Create a new node , and insert b.2 .a. Use suffix link to insert XYb: a.1 we go up to and cross to

using the suffix link. a.2 rescan to )not necessarily

exist(

rescan

If doesn’t exist, create it!

Rescan means wedon’t need to check string again, but go stright to

..aXYb..

1 .Create a new node , and insert b.2 .a. Use suffix link to insert XYb: a.1 we go up to and cross to

using the suffix link. a.2 rescan to

a.3 scan from to insert XYb.

rescan

..aXYb..

2 .Use suffix link to insert XYb.

3 .Add ’s suffix link (And we finish with the insertion!

rescan

Invariant kept: every internal node has a suffix link )except one just created(.

Demends from DS:

……………………gffghk……

We explained insertion.

What about deletion?

deleteinsert

Modifications- cont.------------------------------------

Deletion:

b. Leaves in a circular buffer-

identify oldest and delete it.

c.’Son count-’

when it falls to one, delete node

and combine arcs.

1 4096

Son count=3

Circular buffer

Is it enough?------------------------------------

We still have a problem.

Higher pointers can become out-of-date.

But, climb up and update those pointers would take out the advantegaes of using the suffix links!

..fkjg…

Modifications- Last ------------------------------------

d. Percolating updates:

Each internal node has an update bit.aX X

True/false bit

Percolating updates ------------------------------------

d. Percolating updates -

When updating a node:

bit = true

1 .set bit to false.

2 .propagate update to parent.

bit = false

1 .set bit to true.

2 .stop update.

True/false bit

Percolating updates-cont.-------------------------------------------

Effect:

Keep all internal pointers on position

within the 4096-window in file.

Percolating updates-cont.-------------------------------------------

Effect:

Keep all internal pointers on position

within the 4096-window in file.

worst case -

update propagates till root .

amortized-

summing over all new leaves, we get constant cost.

Summary of the inner loop---------------------------------------------------------

The operations: 1 .Insert:

a. insert the previous string. b. use suffix link to insert next string.

2 .Percolate update from leaf: if bit is true

set position field of the node to current position. set bit to false and propagate to parent.

if bit is false set it true, and stop.

Summary- cont---------------------------------------------------------

3 .Circular buffer:

a. replace oldest leaf with the new one.

b. if its parent has only one remaining son-

1 .delete parent, and attach remaining son

to grandparent.

2 .percolate the deleted node’s position-

( *special case- comparative percolation)

- Our Algorithm

Done 1

( here )

Where do we stand?

Done 2

Theoretical Considerations----------------------------------------------------

Correctness and linearity of suffix tree construction-

we already saw that.

We need to be convinced about destruction:

Theorm 1:

Deleting leaves in FIFO order and deleting internal nodes

with single sons will never leave dangling suffix pointers..

Proof:

Assume the contrary:

points to that was deleted.

The existence of means: two strings agree for l differ at l+1

……df..gb…df..gz..

Proof-cont:

Assume the contrary:

points to that was deleted.

The existence of means: two strings agree for l differ at l+1

……df..gb…df..gz.. two strings agree for l-1 differ at l

This contradicts that has one son, and therefore deleted.

Theoretical Considerations-----------------------------------------------------

Theorm 2:

Each percolated update has constant amortized cost.

Proof:

Assume a ‘credit’ on each internal node

with ‘update’ flag true.

A new node is added with two ‘credits-’

One is spent to update parent.

Second - give to parent and terminate )parent is false(.

0 1 true

A new node is added with two ‘credits-’

One is spent to update parent.

Second - give to parent and terminate )parent is false(.

or - obtain two on parent and continue )true(.

Result-

invariant is kept, and we get amortized cost of two

updates per new leaf .

0 1 true true1

Apply recursively on parent

Theoretical Considerations-----------------------------------------------------

Theorm 3 )effectiveness(:

Using the percolating update, every internal node will

be updated at least once in a period (4096).

Proof:

We would prove that every internal node will be

updated at least twice in a period, thus propagate

at least one update up.

(in contradiction )Find - the farthest node from the root that

doesn’t propagate an update to its parent.

3 cases:

a. has two )or more( remained* children:

both are farther from root. Thus- updated it.

Child that has remained for the entire period.

(in contradiction )Find - the farthest node from the root that

doesn’t propagate an update to its parent.

3 cases:

a. has two )or more( remained* children:

both are farther from root. Thus- updated it.

b. has only one remaining child:

one update from it. Second from new child when created.

( new arc causes son to update parent)

(in contradiction )Find - the farthest node from the root that doesn’t propagate an update to its parent.

3 cases: a. has two )or more( remained* children: both are farther from root. Thus- updated it.

b. has only one remaining child: one update from it. Second from new child when created.

( new arc causes son to update parent) c.has two new children- similar.

In all cases, will receive two updates during a period, and thus- propagate an update. Contradiction .

Other Theoretical Considerations)bounds on the compression(

-----------------------------------------------------------

We have focused on the Data Structure.

There are other questions, about the compression.

אבל על כך,

בפעם אחרת!)ובקורס אחר(

ורק נציין אותם בקצרה:

Other Theoretical Considerations)bounds on the compression(

-----------------------------------------------------------

Consider the following:

1 3 16 15 14 13

A1 )literal 1(x)copy 3 y()copy 14 y( 6 bytesOptimal )literal 2(xx)copy 16 y( 5 bytes

How bad can it get?

Position j j+1 j+2 j+3 j+5 j+6

Copy length available

Encoder is here

Optimal

Heuristic vs. Optimal-------------------------------

Foresight algorithms:

Must have more than one-pass: we pay big time.

And the Gain?

(Optimal vs. A1-)

On average- about 1% better.

On Worst case- 20%.

Back to our business

A1’s virtues-------------------------

-Simple one-pass adaptive lossless method.

-Natural approach to 8-bit per character.

Performances:

-Compression ratio - up to 1/8.

-Expander- fast, simple, small storage requirements.

-Compressor- much slower and larger.

(all in comparison to other copy/literal methods )

Improvements--------------------------------

-Enlarge the window- gain compression ratio.

pay space and speed.

-Enlarge copy length- same.

-Change encoding- gain performance, pay simplicity.

-Change update policy-gain compression speed,

pay in space and expansion speed.

SummaryWe introduce the compression problem, and propose a simple substitutional compressing algorithm, based on the copy/literal codewords.

Our main interest was the Data structure. We saw how a

modification of the basic Suffix tree answers the

algorithm demands, on what cost.

Don’t push

data compression with finite windows fiala and greene speaker: giora alexandron

best of times

worst of times

compression algorithm

data compression method

copy x

literal x

good algorithm

coding of data

Documents

giora eiland - begin-sadat center for strategic studies ·...

182.034 2/6/07 6:13 pm page 34 greene & greene...

nmi15 jakub fiala – quantified self for dummies

fiala 2010

giora.99.priority of salient meaning

american values and philosophy - andrew fiala,...

shlomo giora shoham · shlomo giora shoham : œuvres (16...

josef fiala -oboe concerto in b major-oboe solo

computer science jessica rogers, sarah bell, david fiala

giora kornblau - entrepreneurship: creating new reality

hörerlebnis - [audio physic] - no loss of fine detail ·...

giora pinkas i stand corrected: interview with giora...

giora whenis r

pacifism as normative theory - andrew fiala

czechoslovak communist system doc. phdr. vlastimil fiala,...

jakub fiala: quantified self

greene & greene furniture

ing. radek fiala

introduction to module development john fiala and ezra...

european film awards the 28th european film awards ›...