data compression with finite windows fiala and greene speaker: giora alexandron

59
Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Upload: ada-wilcox

Post on 21-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Data Compression with finitewindows

Fiala and Greene

Speaker: Giora Alexandron

Page 2: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Overview:-----------------------

Our main purpose:

See how Suffix Tree supports a compression algorithm.

Page 3: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Overview:-----------------------

Our main purpose:

See how Suffix Tree supports a compression algorithm.

What we would see:

A data compression method, which works by substituting text. It uses a modification of the basic suffix tree, to support cyclic maintenance of the most

recent strings seen in file .

Page 4: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Outlines------------------------

1 .Compression: - In General

- Our Algorithm

2 .Data Structure: - Modification of the suffix tree.

3 .Theoretical Considerations: - Prooves.

4 .Improvments.

Page 5: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Compression-------------------------------

What is Compression:

Compression is the coding of data to minimize its representation. We would focus on

lossless, adaptive, one-pass methods .

Page 6: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Compression-------------------------------

What is Compression: Compression is the coding of data to minimize its

representation. We would focus on lossless, adaptive, one-pass methods .

Main approaches- Statistical approach- try to predict the next symbol .

Substitutional approach- replace blocks of texts with references to earlier occurrences of identical text.

**We would focus on a Substitutional method**

Page 7: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Compression-cont.------------------------------

What characterize a good compressor:

- Good compressing ratio.

- Run fast in Compression.

- Use minimum of space.

-Run fast in Expansion.

Page 8: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Compression-cont.------------------------------

What characterize a good compressor: - Good compressing ratio. - Run fast in Compression.

- Use minimum of space. -Run fast in Expansion.

There are trade-offs between all of those.Naturally, we want to achieve them all=

A good Algorithm + a matching Data Structure

Page 9: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Substitutional Compressing---------------------------------------

Consider the following basic scheme:

The compressed files would contain two types of codewords:

literal x pass the next x characters directly to the output.

copy x, y go back y characters and copy the next x

characters start at that position.

Page 10: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Example------------------------------------------------

..it was the best of times, it was the worst of times..

Would compress to-

Page 11: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Example------------------------------------------------

..it was the best of times, it was the worst of times..

Would compress to-

(literal 26 )it was the best of times,

+26

Page 12: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Example------------------------------------------------

..it was the best of times, it was the worst of times..

Would compress to-

(literal 26 )it was the best of times,

(copy 11-26)

-26 +11

+26

Page 13: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Example------------------------------------------------

..it was the best of times, it was the worst of times..

Would compress to-

(literal 26 )it was the best of times,

(copy 11-26) wor )copy 11-27(

-26 +11 -27 +11

+26

Page 14: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Example-cont.------------------------------------------------

And we get a very simple lossless method:

The compression achieved depends on the size of the copy and literal codewords.

..it was the best of times ,

it was the worst of times.

Compression

Expansion

..it was the best of times ,

it was the worst of times.

(literal 26 )it was the best of times,

(copy 11-26) wor )copy 11-27(.

Page 15: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

A1------------------------------------------------------

The encoding of A1:

-8 bits for a literal codeword

-16 bit for a copy codeword

(can you figure what’s the logic behind)?

literal length[1..16]

length[2..16]

displacement[1..4096]

0 15

0 7

0000xxxx

xxxxyy..yy

Page 16: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

A1------------------------------------------------------

The encoding of A1: -8 bits for a literal codeword

-16 bit for a copy codeword

And we get )a compression of 51 to 36(: (literal 16 )it was the best )literal 10(of times,

(copy 11-26) wor )copy 11-27(

literal length[1..16]

length[2..16]

displacement[1..4096]

0 15

0 7

0000xxxx

xxxxyy..yy

Page 17: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

A1’s policy----------------------------

If the compressor is idle )just finish a word(:

look for a copy >= 2

otherwise, start a literal.

If the compressor is in the middle of a literal:

extend it until a copy >= 3 is found.

Page 18: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

1 .Compression: - In General

- Our Algorithm

2 .Data Structure: - Modification of the suffix tree.

3 .Theoretical Considerations: - Prooves.

Done

( here )

Where do we stand?

Page 19: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

The Data Structure-----------------------------------------

What do we need?

Find the current longest match )for copy(.

Page 20: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

The Data Structure-----------------------------------------

What do we need?

Find the current longest match )for copy(.

-What could we use ?

Naive solution-

Suffix tree with all strings of length <= 16 in the previous 4096-bytes window.

Page 21: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Naive solution---------------------------------

Suffix tree with all strings of length <= 16 in the previous 4096-bytes window:

current4096

1616

16

Page 22: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

The cost --------------------------------------------

If we descended d levels to insert string starts at position j ,

we will descend at least d-1 levels to insert string starts at j+1.

Page 23: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

The cost-cont.------------------------------------------

If we descended d levels to insert string starts at position j ,

we would descend at least d-1 levels to insert string starts at j+1.

So the cost is O)nd( for insertion.

But we want to eliminate d.

j4096

dd

dd-1

j+1

Page 24: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Modifications------------------------------------

a.Suffix links:

Each node represents the string aX

has a pointer to the node represents

the string X.

Immediate advantage:

We don’t need to return to the root after each insertion.

aX X

Y Y

k

Page 25: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Suffix Links------------------------------------

How we use and create suffix links:

..aXYb..

aX X

Y Y

k

Page 26: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Suffix Links------------------------------------

How we use and create suffix links:

..aXYb..

aX X

Y Y

k

x

Page 27: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Suffix Links-cont.------------------------------------

How we use and create suffix links:

..aXYb..

1 .Create a new node , and insert b.

aX X

Y Y

bk

x

Page 28: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Suffix Links-cont.------------------------------------

How we use and create suffix

links:

..aXYb..

1 .Create a new node , and insert b.

2 .a. Use suffix link to insert XYb:

a.1 we go up to and cross to using the suffix link.

aX X

Y Y

bk

x

Page 29: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Suffix Links-cont.------------------------------------

How we use and create suffix links:

..aXYb..

1 .Create a new node , and insert b.2 .a. Use suffix link to insert XYb: a.1 we go up to and cross to

using the suffix link. a.2 rescan to )not necessarily

exist(

aX X

Y Y

bk

rescan

x

If doesn’t exist, create it!

Rescan means wedon’t need to check string again, but go stright to

Page 30: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Suffix Links-cont.------------------------------------

How we use and create suffix links:

..aXYb..

1 .Create a new node , and insert b.2 .a. Use suffix link to insert XYb: a.1 we go up to and cross to

using the suffix link. a.2 rescan to

a.3 scan from to insert XYb.

aX X

Y Y

bk

rescan

scan

x

Page 31: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Suffix Links-cont.------------------------------------

How we use and create suffix links:

..aXYb..

1 .Create a new node , and insert b.

2 .Use suffix link to insert XYb.

3 .Add ’s suffix link (And we finish with the insertion!

aX X

Y Y

bk

rescan

scan

x

Invariant kept: every internal node has a suffix link )except one just created(.

Page 32: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Demends from DS:

……………………gffghk……

We explained insertion.

What about deletion?

4096

match

deleteinsert

Page 33: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Modifications- cont.------------------------------------

Deletion:

b. Leaves in a circular buffer-

identify oldest and delete it.

c.’Son count-’

when it falls to one, delete node

and combine arcs.

aX X

Y Y

bk

1 4096

Son count=3

Circular buffer

Page 34: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Is it enough?------------------------------------

NO.

We still have a problem.

Higher pointers can become out-of-date.

But, climb up and update those pointers would take out the advantegaes of using the suffix links!

aX X

Y Y

bk

..fkjg…

Page 35: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Modifications- Last ------------------------------------

d. Percolating updates:

Each internal node has an update bit.aX X

Y Y

k

True/false bit

Page 36: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Percolating updates ------------------------------------

d. Percolating updates -

When updating a node:

bit = true

1 .set bit to false.

2 .propagate update to parent.

bit = false

1 .set bit to true.

2 .stop update.

aX X

Y Y

k

True/false bit

Page 37: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Percolating updates-cont.-------------------------------------------

Effect:

Keep all internal pointers on position

within the 4096-window in file.

Page 38: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Percolating updates-cont.-------------------------------------------

Effect:

Keep all internal pointers on position

within the 4096-window in file.

Cost:

worst case -

update propagates till root .

amortized-

summing over all new leaves, we get constant cost.

Page 39: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Summary of the inner loop---------------------------------------------------------

The operations: 1 .Insert:

a. insert the previous string. b. use suffix link to insert next string.

2 .Percolate update from leaf: if bit is true

set position field of the node to current position. set bit to false and propagate to parent.

if bit is false set it true, and stop.

Page 40: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Summary- cont---------------------------------------------------------

3 .Circular buffer:

a. replace oldest leaf with the new one.

b. if its parent has only one remaining son-

1 .delete parent, and attach remaining son

to grandparent.

2 .percolate the deleted node’s position-

( *special case- comparative percolation)

Page 41: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

1 .Compression: - In General

- Our Algorithm

2 .Data Structure: - Modification of the suffix tree.

3 .Theoretical Considerations: - Prooves.

Done 1

( here )

Where do we stand?

Done 2

Page 42: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Theoretical Considerations----------------------------------------------------

Correctness and linearity of suffix tree construction-

we already saw that.

We need to be convinced about destruction:

Theorm 1:

Deleting leaves in FIFO order and deleting internal nodes

with single sons will never leave dangling suffix pointers..

Page 43: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Proof:

Assume the contrary:

points to that was deleted.

The existence of means: two strings agree for l differ at l+1

……df..gb…df..gz..

l

b z

Page 44: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Proof-cont:

Assume the contrary:

points to that was deleted.

The existence of means: two strings agree for l differ at l+1

……df..gb…df..gz.. two strings agree for l-1 differ at l

This contradicts that has one son, and therefore deleted.

l

b z

l-1

Page 45: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Theoretical Considerations-----------------------------------------------------

Theorm 2:

Each percolated update has constant amortized cost.

Proof:

Assume a ‘credit’ on each internal node

with ‘update’ flag true.

Page 46: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

A new node is added with two ‘credits-’

One is spent to update parent.

Second - give to parent and terminate )parent is false(.

2

false

1

0 1 true

Page 47: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

A new node is added with two ‘credits-’

One is spent to update parent.

Second - give to parent and terminate )parent is false(.

or - obtain two on parent and continue )true(.

Result-

invariant is kept, and we get amortized cost of two

updates per new leaf .

2 2

false

1

0 1 true true1

1

2

Apply recursively on parent

Page 48: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Theoretical Considerations-----------------------------------------------------

Theorm 3 )effectiveness(:

Using the percolating update, every internal node will

be updated at least once in a period (4096).

Proof:

We would prove that every internal node will be

updated at least twice in a period, thus propagate

at least one update up.

Page 49: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

(in contradiction )Find - the farthest node from the root that

doesn’t propagate an update to its parent.

3 cases:

a. has two )or more( remained* children:

both are farther from root. Thus- updated it.

Child that has remained for the entire period.

Page 50: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

(in contradiction )Find - the farthest node from the root that

doesn’t propagate an update to its parent.

3 cases:

a. has two )or more( remained* children:

both are farther from root. Thus- updated it.

b. has only one remaining child:

one update from it. Second from new child when created.

( new arc causes son to update parent)

Child that has remained for the entire period.

Page 51: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

(in contradiction )Find - the farthest node from the root that doesn’t propagate an update to its parent.

3 cases: a. has two )or more( remained* children: both are farther from root. Thus- updated it.

b. has only one remaining child: one update from it. Second from new child when created.

( new arc causes son to update parent) c.has two new children- similar.

In all cases, will receive two updates during a period, and thus- propagate an update. Contradiction .

Child that has remained for the entire period.

Page 52: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Other Theoretical Considerations)bounds on the compression(

-----------------------------------------------------------

We have focused on the Data Structure.

There are other questions, about the compression.

אבל על כך,

בפעם אחרת!)ובקורס אחר(

ורק נציין אותם בקצרה:

Page 53: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Other Theoretical Considerations)bounds on the compression(

-----------------------------------------------------------

Consider the following:

1 3 16 15 14 13

A1 )literal 1(x)copy 3 y()copy 14 y( 6 bytesOptimal )literal 2(xx)copy 16 y( 5 bytes

How bad can it get?

Position j j+1 j+2 j+3 j+5 j+6

Copy length available

Encoder is here

A1

Optimal

Page 54: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Heuristic vs. Optimal-------------------------------

Foresight algorithms:

Must have more than one-pass: we pay big time.

And the Gain?

(Optimal vs. A1-)

On average- about 1% better.

On Worst case- 20%.

Page 55: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Back to our business

Page 56: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

A1’s virtues-------------------------

-Simple one-pass adaptive lossless method.

-Natural approach to 8-bit per character.

Performances:

-Compression ratio - up to 1/8.

-Expander- fast, simple, small storage requirements.

-Compressor- much slower and larger.

(all in comparison to other copy/literal methods )

Page 57: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

Improvements--------------------------------

-Enlarge the window- gain compression ratio.

pay space and speed.

-Enlarge copy length- same.

-Change encoding- gain performance, pay simplicity.

-Change update policy-gain compression speed,

pay in space and expansion speed.

Page 58: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

SummaryWe introduce the compression problem, and propose a simple substitutional compressing algorithm, based on the copy/literal codewords.

Our main interest was the Data structure. We saw how a

modification of the basic Suffix tree answers the

algorithm demands, on what cost.

Page 59: Data Compression with finite windows Fiala and Greene Speaker: Giora Alexandron

EXIT

Don’t push