some aspects of information theory for a computer scientist

65
Some aspects of information theory for a computer scientist Eric Fabre http://people.rennes.inria.fr/Eric.Fabre http://www.irisa.fr/sumo 11 Sep. 2014

Upload: deidra

Post on 11-Jan-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Some aspects of information theory for a computer scientist. Eric Fabre http:// people.rennes.inria.fr / Eric.Fabre http:// www.irisa.fr / sumo. 11 Sep. 2014. Outline. 1. Information: measure and compression 2. Reliable transmission of information 3. Distributed compression - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Some  aspects of information  theory  for a computer  scientist

Some aspects of information theory for a computer scientist

Eric Fabrehttp://people.rennes.inria.fr/Eric.Fabrehttp://www.irisa.fr/sumo

11 Sep. 2014

Page 2: Some  aspects of information  theory  for a computer  scientist

Outline

1. Information: measure and compression

2. Reliable transmission of information

3. Distributed compression

4. Fountain codes

5. Distributed peer-to-peer storage

11/09/14

Page 3: Some  aspects of information  theory  for a computer  scientist

Information: measure and compression

11/09/14

1

Page 4: Some  aspects of information  theory  for a computer  scientist

Let’s play…

11/09/14

One card is drawn at random in the following set.Guess the color of the card, with a minimum of yes/no questions

One strategy• is it hearts ? • if not, is it clubs ?• if not, is it diamonds ?

Wins in• 1 guess, with probability ½ • 2 guesses, with prob. ¼• 3 guesses, with prob. ¼ 1.75 questions on average

Is there a better strategy ?

Page 5: Some  aspects of information  theory  for a computer  scientist

11/09/14

Observation

Lessons- more likely means easier to guess (carries less information)- amount of information depends only on the log likelihood of an event- guessing with yes/no questions = encoding with bits = compressing

1

01001000

Page 6: Some  aspects of information  theory  for a computer  scientist

11/09/14

Important remark:

• codes like the one below are not permitted

• they cannot be uniquely decoded if one transmits sequences of encoded values of Xe.g. sequence 11 can encode “Diamonds” or “Hearts,Hearts”

• one would need one extra symbol to separate “words”

1

01100

Page 7: Some  aspects of information  theory  for a computer  scientist

Entropy

11/09/14

Source of information = random variablenotation: variables X, Y, … taking values x, y, …

information carried by event “X=x”

average information carried by X

H(X) measures the average difficulty to encode/describe/guess random outcomes of X

Page 8: Some  aspects of information  theory  for a computer  scientist

Properties

11/09/14

with equality iff X and Y independent(i.e. )

with equality iff X not random

with equality iff is uniform

Bernouilli distribution

Page 9: Some  aspects of information  theory  for a computer  scientist

Conditional entropy

11/09/14

uncertainty left on Y when X is known

Property

with equality iff Y and X independent

Page 10: Some  aspects of information  theory  for a computer  scientist

11/09/14

Example : X = color, Y = value

average

recall

so one checks

Exercise : check that

Page 11: Some  aspects of information  theory  for a computer  scientist

11/09/14

A visual representation

Page 12: Some  aspects of information  theory  for a computer  scientist

11/09/14

Data compression

CoDec for source X, with R bits/sample on average

rate R is achievable iff there exists CoDec pairs (fn,gn) of rate R

with vanishing error probability :

Usage: there was no better strategy for our card game !

Theorem (Shannon, ‘48) : - a lossless compression scheme for source X must have

a rate R ≥ H(X) bits/sample on average- the rate H(X) is (asymptotically) achievable

Page 13: Some  aspects of information  theory  for a computer  scientist

11/09/14

Proof

Solution 1• use a known optimal lossless coding scheme for X : the Huffman code• then prove H(X) ≤ L < H(X) + 1

• over n independent symbols X1,…,Xn, one has

Necessity : if R achievable, then R ≥ H(X), quite easy to proveSufficiency : for R > H(X), it requires to build a losslesscoding scheme of using R bits/sample on average

Solution 2 : encoding only “typical sequences”

Page 14: Some  aspects of information  theory  for a computer  scientist

11/09/14

Typical sequences

Let X1,…,Xn be independent, same law

By the law of large numbers, one has the a.s. convergence

Sequence is typical iff

or equivalently

Set of typical sequences :

Page 15: Some  aspects of information  theory  for a computer  scientist

11/09/14

AEP : asymptotic equipartition property

• one has

• and

So non typical sequences count for 0, and there are approximately

typical sequences, each of probability

2nH(X) typical sequences

Kn=2n log2 K sequences, where

Optimal lossless compression• encode a typical sequence with nH(X) bits

• encode a non-typical sequence with n log2 K bits

• add 0 / 1 as prefix to mean typ. / non-typ.

Page 16: Some  aspects of information  theory  for a computer  scientist

11/09/14

Practical coding schemes

Encoding by typicality is unpractical !

Practical codes : • Huffman code• arithmetic coding (adapted to data flows)• etc.All require to know the distribution of the source to be efficient.

Universal code:• does not need to know the source distribution

• for long sequences X1…Xn, converge to the optimal rate H(X) bits/symbol

• example: Lempel-Ziv algorithm (used in ZIP, Compress, etc.)

Page 17: Some  aspects of information  theory  for a computer  scientist

11/09/14

Reliable transmission of information

2

Page 18: Some  aspects of information  theory  for a computer  scientist

Mutual information

11/09/14

Properties

with equality iff X and Y are independent

measures how many bits X and Y have in common (on average)

Page 19: Some  aspects of information  theory  for a computer  scientist

Noisy channel

11/09/14

Channel = input alphabet, output alphabet, transition probability

A B A AB B

observe that is left free

Capacity

maximizes the coupling between input and output lettersfavors letters that are the less altered by noise

bits / use of channel

Page 20: Some  aspects of information  theory  for a computer  scientist

Example

11/09/14

The erasure channel : a proportion of p bits are erased

A B

Define the erasure variable E = f(B) with E=1 when an erasure occurred, and E=0 otherwise

E0

1

andSo

Page 21: Some  aspects of information  theory  for a computer  scientist

Protection against errors

11/09/14

Idea: add extra bits to the message, to augment its inner redundancy (this is exactly the converse of data compression)

Coding scheme

• X takes values in { 1, 2, … , M=2nR }

• rate of the codec R = log2(M) / n transmitted bits / channel use

• R is achievable iff there exists a series of (fn,gn) CoDecs of rate R

such that

fn gn

noisy channel

where

Page 22: Some  aspects of information  theory  for a computer  scientist

Error correction (for a binary channel)

11/09/14

Repetition

• useful bit U sent 3 times : A1=A2=A3=U

• decoding by majority• detects and corrects one error… but R’=R/3

Parity checks

• X = k useful bits U1…Uk, expanded into n bits A1…An

• rate R = k/n

• for example: add extra redundant bits Vk+1…Vn that are

linear combinations of the U1…Uk

• examples: • ASCII code k=7, n=8• ISBN• social security number• credit card numberQuestions: how ??? and how many extra bits ???

Page 23: Some  aspects of information  theory  for a computer  scientist

How ?

11/09/14

Almost all channel codes are linear : Reed-Solomon, Reed-Muller, Golay, BCH, cyclic codes, convolutional codes…Use finite field theory, and algebraic decoding techniques.

The Hamming code

• 4 useful bits U1…U4

• 3 redundant bits V1…V3

• rate R = 4/7• detects and corrects 1 error (exercise…)• trick : 2 codewords differ by at least 3 bits

U1

U2

U3

U4

V1

V2

V3

1 0 0 0 0 1 10 1 0 0 1 0 10 0 1 0 1 1 00 0 0 1 1 1 1

[ U1 … U4 ] = [ U1 … U4 V1 … V3 ]

Generating matrix (of a linear code)

Page 24: Some  aspects of information  theory  for a computer  scientist

what Shannon proved in ’48

How much ?

11/09/14

what people believed before ‘48

Usage: measures the efficiency of an error correcting code for some channel

Theorem (Shannon, ‘48) : - any achievable transmission rate R must satisfy

R ≤ C transmitted bits / channel use - any transmission rate R < C is achievable

Page 25: Some  aspects of information  theory  for a computer  scientist

Proof

11/09/14

Necessity: if a coding is (asympt.) error free, then its rate satisfies R≤ C, rather easy to proveSufficiency: any rate R<C is achievable, demands to build a coding scheme !

Idea = random coding !

• best distribution on the input alphabet of the channel

• build a random codeword w = a1…an drawing letters according to

(w is a typical sequence)

• sending w over the channel yields output w’ = b1…bn

which is a typical sequence for • and the pair (w,w’) is jointly typical for

Page 26: Some  aspects of information  theory  for a computer  scientist

11/09/14

w1 w’1

M typical

sequencesas codewords

typical sequences

A1…An B1…Bn

jointly typical with w1

possible typical sequences

at output

w2 w’2

wM w’M.

. .

• if M is small enough, the output cones do not overlap (with high probability)• maximal number of input codewords :

which proves that any R <C is achievable !

Page 27: Some  aspects of information  theory  for a computer  scientist

n transmitted bits

Perfect coding

11/09/14

Perfect code = error-free and achieves capacity. What does it look like ?

• by the data processing inequality

nR = H(X) = I(X;X) ≤ I(A1…An;B1…Bn) ≤ nC

• if R = C, then I(A1…An;B1…Bn) = nC

• possible iff letters of the codeword Ai are independent,

and each I(Ai;Bi)=C, i.e. each Ai carries R=C bits

fn gn

noisy channel

For a binary channel: R = k / n a perfect code spreads information uniformly over a larger number of bits

k useful bits

channel

Page 28: Some  aspects of information  theory  for a computer  scientist

In practice

11/09/14

• Random coding unpractical: relies on a (huge) codebook for cod./dec.• Algebraic (linear) codes were preferred for long :

more structure, cod./dec. with algorithms• But in practice, they remained much below optimal rates !

• Things changed in 1993 when Berrou & Glavieux invented the turbo-codes• followed by the rediscovery of the low-density parity check codes (LDPC)

invented by Gallager in his PhD… in 1963 !• both code families behave like random codes… but come with

low-complexity cod./dec. algorithms

Page 29: Some  aspects of information  theory  for a computer  scientist

Can feedback improve capacity ?

11/09/14

Principle• the outputs of the channel are revealed to the sender• the sender can use this information to adapt its next symbol

channel

But is can greatly simplify coding, decoding, and transmission protocols.

Theorem: Feedback does not improve channel capacity.

Page 30: Some  aspects of information  theory  for a computer  scientist

2nd PART

11/09/14

Information theory was designed for point-to-point communications.Which was soon considered as a limitation…

broadcast channel: each user has a different channelmultiple access channel: interferences

Spread information: which structure for this object ?how to regenerate / transmit it ?

sd

Page 31: Some  aspects of information  theory  for a computer  scientist

2nd PART

11/09/14

What is the capacity of a network ?

Are network links just pipes, with capacity, in which information flows like a fluid ?

A

B

C How many transmissions to broadcast from A to C,D and from B to C,D ?

D

E F

a a

a a

a

a

ab

bb b

b b

ba a

a

a

b

bb b

a +b

a

a

+b

b

a

+b By network coding, one transmissionover link E—F can be saved.

Medard & Koetter2003

Page 32: Some  aspects of information  theory  for a computer  scientist

Outline

1. Information: measure and compression

2. Reliable transmission of information

3. Distributed compression

4. Fountain codes

5. Distributed peer-to-peer storage

11/09/14

Page 33: Some  aspects of information  theory  for a computer  scientist

11/09/14

Distributed source coding

3

Page 34: Some  aspects of information  theory  for a computer  scientist

Collecting spread information

11/09/14

• X, Y are two distant but correlated sources• transmit their value to a unique receiver (perfect channels)• no communication between the encoders

X

Y

dist

ance

encoder 1

encoder 2

joint decoder X,Yno

com

mun

icat

ion

I(X;Y)

H(Y|X)

H(X|Y)

• Naive solution = ignore correlation, compress and send

each source separately : rates R1=H(X), R2=H(Y)

• Can one do better, and take advantage of the correlation of X and Y ?

rate R1

rate R2K

Page 35: Some  aspects of information  theory  for a computer  scientist

Example

11/09/14

• X = weather in Brest, Y = weather in Quimper• probability that weathers are identical is 0.89• one wishes to send the observed weather of 100 days in both cities

• One has H(X) = 1 = H(Y), so naïve encoding requires 200 bits• I(X;Y) = 0.5, so not sending the “common information” saves 50 bits

0.445 0.055

0.055 0.445

sun rainY

sun

rainX

Page 36: Some  aspects of information  theory  for a computer  scientist

Necessary conditions

11/09/14

Question: what are the best possible achievable transmission rates ?

X

Y

dist

ance

encoder 1

encoder 2

joint decoder X,Yno

com

mun

icat

ion

I(X;Y)

H(Y|X)

H(X|Y) rate R1

rate R2

• Jointly, both coders must transmit the full pair (X,Y), so

R1+R2 ≥ H(X,Y)

• Each coder alone must transmit the private information that is not accessible through the other variable, so

R1 ≥ H(X|Y) and R2 ≥ H(Y|X)

A pair (R1,R2) is achievable is there exist separate encoders fnX and fn

Y

of sequences X1…Xn and Y1…Yn resp., and a joint decoder gn, that are

asymptotically error-free.

Page 37: Some  aspects of information  theory  for a computer  scientist

Result

11/09/14

Theorem (Slepian & Wolf, ‘75) : The achievable region is defined by

• R1 ≥ H(X|Y)

• R2 ≥ H(Y|X)

• R1+R2 ≥ H(X,Y)

R1

R2

H(Y|X)

H(Y)

H(X|Y) H(X)

achievable region

The achievable region is easily shown to be convex, upper-right closed.

Page 38: Some  aspects of information  theory  for a computer  scientist

Compression by random binning

11/09/14

• encode only typical sequences w = x1…xn =

• throw then at random into 2nR bins, with R>H(X)

1 2 3 2nR codeword, on R bits/symbol

Encoding of w = the number b of the bin where w lies

Decoding : if w = unique typical sequence in bin number b, output w otherwise, output “error”

Error probability

Page 39: Some  aspects of information  theory  for a computer  scientist

Proof of Slepian-Wolf

11/09/14

• fX and fY are two independent random binnings of rates R1 and R2

for x = x1…xn and y = y1…yn resp.

• to decode the pair of bin numbers (bX,bY) = (fX(x),fY(y)), g outputs the unique pair (x,y) of jointly typical sequences in box (bX,bY) or “error” if there are more than one such pair.

• R2>H(Y|X) : given x, there are 2nH(Y|X) sequences y that are jointly typical with x

• R1+R2 > H(X,Y) : the number of boxes 2n(R1+R2) must be greater than 2nH(X,Y)

1 2 3 2nR1

1

2

3

2nR2

…x

y jointly typical pairs (x,y)

Page 40: Some  aspects of information  theory  for a computer  scientist

Example

11/09/14

X= color

Y=value

0.5 1.251.25X Y

Questions:

1. Is there an instantaneous* transmission protocol for rates RX=1.25=H(X|Y), RY=1.75=H(Y) ?

• send Y (always) : 1.75 bits• what about X ?

(caution: the code for X should be uniquely decodable)

Y

X

?

?

?

?

0 10 110 111

(*) i.e. for sequences of length n=1

2. What about RX=RY=1.5 ?

K

Page 41: Some  aspects of information  theory  for a computer  scientist

In practice

11/09/14

The Slepian-Wolf theorem extends to N sources.It long remained an academic result, since no practical coders existed.

Beginning of the 2000s, practical coders and applications appeared• compression of correlated images (e.g. same scene, 2 angles)• sensor networks (e.g. measure of a temperature field)• case of a channel with side information• acquisition of structured information, without communication

Page 42: Some  aspects of information  theory  for a computer  scientist

11/09/14

Fountain codes

4

Page 43: Some  aspects of information  theory  for a computer  scientist

Network protocols

11/09/14

TCP/IP (transmission control protocol)

network(erasure channel)

1234567 1 234

ack 2 ack 2

• slow for huge files over long-range connexions (e.g. cloud backups…)• feedback channel… but feedback does not improve capacity !• repetition code… the worst rate among error correcting codes !• designed by engineers who ignored information theory ? :o)

Drawbacks

• the erasure rate of the channel (thus capacity) is unknown / changing• feedback make protocols simpler• there exist faster protocols (UDP) for streaming feeds

However

Page 44: Some  aspects of information  theory  for a computer  scientist

A fountain of information bits…

11/09/14

How to quickly and reliably transmit K packets of b bits?

Fountain code: • from k packets, generate and send a continuous

flow of packets• some get lost, some go through ; no feedback• as soon as a proportion K(1+ε) of them are received,

any of them, decoding becomes possible

Fountain codes are example of rateless codes (no predefined rate),or universal codes : they adapt to the channel capacity.

Page 45: Some  aspects of information  theory  for a computer  scientist

Random coding…

11/09/14

Packet tn sent at time n is a random linear combinations of the

K packets s1…sK to transmit.

where the Gn,k are random IID binary variables.

…s1 sKb

bits

K packets

…t1 tK’

…s1 sK …t1 tK’=K

1001011 … 11010001 ... 00110100 … 1 … 1011010 … 0

G

K’

*

Page 46: Some  aspects of information  theory  for a computer  scientist

Decoding

11/09/14

…s1 sK …t1 tK’=K

1001011 … 11010001 ... 00110100 … 1 … 1011010 … 0

G

K’

*

11 … 1 12 … 000 … 1 … 11 … 0

G’

N

K =r1 rN…*…s1 sK

Some packets are lost, and N out of K’ are received.This is equivalent to another random code with generating matrix G’.

How big should N be to enable decoding ?

Page 47: Some  aspects of information  theory  for a computer  scientist

11/09/14

Decoding

• For N=K, what is the probability that G’ is invertible ?

One has where G’ is a random K*N binary matrix.If G’ is invertible, one can decode by

Answer: converges quickly to 0.289 (as soon as K>10).

• What about N=K+E ? What is the probability P that at least one K*K sub-matrix of G’ is invertible ?Answer: P =1-δ(E) where δ(E) ≤ 2-E ( δ(E)<10-6 for E=20)

exponential convergence to 1 with E, regardless of K.

Complexity

• K/2 operations per gerenated packet, so O(K2) for encoding• decoding: K3 for matrix inversion• one would like better complexities… linear ?

Page 48: Some  aspects of information  theory  for a computer  scientist

LT codes

11/09/14

Invented by Michael Luby (2003), and inspired from LDPC codes (Gallager, 1963).

Idea : linear combinations of packets should be “sparse”

Encoding

• for each packet tn, randomly select a “degree” dn

according to some distribution ρ(d) on degrees

• choose at random dn packets among s1…sK

and take as tn the sum of these dn packets

• some nodes have low degree, others have high degree:makes the graph a small world

…s1 sK

t1 tN

Page 49: Some  aspects of information  theory  for a computer  scientist

Decoding LT codes

11/09/14

Idea = a simplified version of turbo-decoding (Berrou) that resembles cross-words solving

Example

1 0 1 1

Page 50: Some  aspects of information  theory  for a computer  scientist

Decoding LT codes

11/09/14

Idea = a simplified version of turbo-decoding (Berrou) that resembles cross-words solving

Example

1 0 1 1

1

Page 51: Some  aspects of information  theory  for a computer  scientist

Decoding LT codes

11/09/14

Idea = a simplified version of turbo-decoding (Berrou) that resembles cross-words solving

Example

1 0 1 1

1

Page 52: Some  aspects of information  theory  for a computer  scientist

Decoding LT codes

11/09/14

Idea = a simplified version of turbo-decoding (Berrou) that resembles cross-words solving

Example

1 0 1 1

1 0

Page 53: Some  aspects of information  theory  for a computer  scientist

Decoding LT codes

11/09/14

Idea = a simplified version of turbo-decoding (Berrou) that resembles cross-words solving

Example

1 0 1 1

1 0

Page 54: Some  aspects of information  theory  for a computer  scientist

Decoding LT codes

11/09/14

Idea = a simplified version of turbo-decoding (Berrou) that resembles cross-words solving

Example

1 0 1 1

1 0 1

Page 55: Some  aspects of information  theory  for a computer  scientist

Decoding LT codes

11/09/14

Idea = a simplified version of turbo-decoding (Berrou) that resembles cross-words solving

Example

1 0 1 1

1 0 1

How to choose degrees ?• each iteration should yield to a single new node of degree 1

achieved by distribution ρ(1)=1/K and ρ (d)=1/d(d-1) for d=2…K• average degree is logeK, so decoding complexity is K logeK• in reality

• one needs a few nodes of high degree to ensure that everypacket is connected to at least one check-node

• one needs a little more small degree nodes to ensure that decoding starts

Page 56: Some  aspects of information  theory  for a computer  scientist

In practice…

11/09/14

Performance• both encoding and decoding are in K log K (instead of K2 and K3)• for large K>104, the observed overhead E represents from 5% to 10%• Raptor codes (Shokrollahy, 2003) do better : linear time complexity

Applications• broadcast to many users:

a fountain code adapts to the channel of each user no need to rebroadcast packets missed by some user

• storage on many unreliable devices e.g. RAID (redundant array of inexpensive disks) data centers peer-to-peer distributed storage

Page 57: Some  aspects of information  theory  for a computer  scientist

11/09/14

Distributed P2P storage

5

Page 58: Some  aspects of information  theory  for a computer  scientist

Principle

11/09/14

…s1sK

t1 tN…raw data

v

redundant data

t2

distinct storages (disks, peers,…)

Problems• disks can crash, peers can leave: eventual data loss• original data can be recovered if enough packets remain…

… but missing packets need to be restored

Idea = Raw data split into packets, expanded with some ECC. Each new created packet is stored independently. Original data erased.

Restoration• perfect : the packet that is lost is exactly replaced• functional : new packets are built, to preserve data recoverability• intermediate : maintain the systematic part of the data

t’2 t’N…

new peers

Page 59: Some  aspects of information  theory  for a computer  scientist

Which codes ?

11/09/14

Fountain/random codes :

• random linear combinations of remaining blocks among t1…tn• will not preserve the appropriate degree distribution

Target : one should rebuild missing blocks… … without first rebuilding the original data ! (would require too much bandwith)

MDS codes: maximum distance separable codes

• can rebuild s1…sk from any subset of exactly k blocks in t1…tn• example : Reed-Solomon codes

Page 60: Some  aspects of information  theory  for a computer  scientist

Example

11/09/14

k sets of α blocs

n sets of α blocs

a b c d

a+c b+d b+c a+b+da b ca b d

reconstruction a b

d b+d a+b+d

β blocs requested

Page 61: Some  aspects of information  theory  for a computer  scientist

Example

11/09/14

k sets of α blocs

n sets of α blocs

a b c d

a+c b+d b+c a+b+da b c d b+c a+b+d

reconstruction

c+d b+d

b+c a+b+d

a

β blocs requested

Result (Dimakis et al., 2010): For functional repair,given k, n and d ≥ k (number of nodes to contact for repair)network coding techniques allows to optimally balanceα (number of blocs) and β (bandwidth necessary to reconstruction).

Page 62: Some  aspects of information  theory  for a computer  scientist

11/09/14

Conclusion

6

Page 63: Some  aspects of information  theory  for a computer  scientist

A few lessons

11/09/14

Ralf Koetter* : “Communications aren’t anymore about transmitting a bit, but about transmitting evidence about a bit.”

(*) one of the inventors of Network Coding

Random structures spread information uniformly.

Information theory gives bounds on how much one can learnabout some hidden information…

One does not have to build the actual protocols/codes that will reveal this information.

Page 64: Some  aspects of information  theory  for a computer  scientist

Management of distributed information……in other fields

11/09/14

- A, B: random variables, possibly correlated- one wishes to compute in B the value f(A,B)- how many bits should be exchanged?- how many communication rounds?

Compressed sensing (signal processing)

- signal can be described by sparse coefficients- random (sub-Nyquist) sampling

Communications complexity (computer science)- A, B: variables, taking values in a huge space- how many bits should A send to B in order to check A=B ?- solution by random coding

A Bn bits

A=B ?

Digital communications (network information theory)

Page 65: Some  aspects of information  theory  for a computer  scientist

thank you !