some aspects of information theory for a computer scientist

Some aspects of information theory for a computer scientist

Eric Fabrehttp://people.rennes.inria.fr/Eric.Fabrehttp://www.irisa.fr/sumo

11 Sep. 2014

Outline

1. Information: measure and compression

2. Reliable transmission of information

3. Distributed compression

4. Fountain codes

5. Distributed peer-to-peer storage

11/09/14

Information: measure and compression

11/09/14

1

Let’s play…

11/09/14

One card is drawn at random in the following set.Guess the color of the card, with a minimum of yes/no questions

One strategy• is it hearts ? • if not, is it clubs ?• if not, is it diamonds ?

Wins in• 1 guess, with probability ½ • 2 guesses, with prob. ¼• 3 guesses, with prob. ¼ 1.75 questions on average

Is there a better strategy ?

11/09/14

Observation

Lessons- more likely means easier to guess (carries less information)- amount of information depends only on the log likelihood of an event- guessing with yes/no questions = encoding with bits = compressing

1

01001000

11/09/14

Important remark:

• codes like the one below are not permitted

• they cannot be uniquely decoded if one transmits sequences of encoded values of Xe.g. sequence 11 can encode “Diamonds” or “Hearts,Hearts”

• one would need one extra symbol to separate “words”

1

01100

Entropy

11/09/14

Source of information = random variablenotation: variables X, Y, … taking values x, y, …

information carried by event “X=x”

average information carried by X

H(X) measures the average difficulty to encode/describe/guess random outcomes of X

Properties

11/09/14

with equality iff X and Y independent(i.e. )

with equality iff X not random

with equality iff is uniform

Bernouilli distribution

Conditional entropy

11/09/14

uncertainty left on Y when X is known

Property

with equality iff Y and X independent

11/09/14

Example : X = color, Y = value

average

recall

so one checks

Exercise : check that

11/09/14

A visual representation

11/09/14

Data compression

CoDec for source X, with R bits/sample on average

rate R is achievable iff there exists CoDec pairs (fn,gn) of rate R

with vanishing error probability :

Usage: there was no better strategy for our card game !

Theorem (Shannon, ‘48) : - a lossless compression scheme for source X must have

a rate R ≥ H(X) bits/sample on average- the rate H(X) is (asymptotically) achievable

11/09/14

Proof

Solution 1• use a known optimal lossless coding scheme for X : the Huffman code• then prove H(X) ≤ L < H(X) + 1

• over n independent symbols X1,…,Xn, one has

Necessity : if R achievable, then R ≥ H(X), quite easy to proveSufficiency : for R > H(X), it requires to build a losslesscoding scheme of using R bits/sample on average

Solution 2 : encoding only “typical sequences”

11/09/14

Typical sequences

Let X1,…,Xn be independent, same law

By the law of large numbers, one has the a.s. convergence

Sequence is typical iff

or equivalently

Set of typical sequences :

11/09/14

AEP : asymptotic equipartition property

• one has

• and

So non typical sequences count for 0, and there are approximately

typical sequences, each of probability

2nH(X) typical sequences

Kn=2n log2 K sequences, where

Optimal lossless compression• encode a typical sequence with nH(X) bits

• encode a non-typical sequence with n log2 K bits

• add 0 / 1 as prefix to mean typ. / non-typ.

11/09/14

Practical coding schemes

Encoding by typicality is unpractical !

Practical codes : • Huffman code• arithmetic coding (adapted to data flows)• etc.All require to know the distribution of the source to be efficient.

Universal code:• does not need to know the source distribution

• for long sequences X1…Xn, converge to the optimal rate H(X) bits/symbol

• example: Lempel-Ziv algorithm (used in ZIP, Compress, etc.)

11/09/14

Reliable transmission of information

2

Mutual information

11/09/14

Properties

with equality iff X and Y are independent

measures how many bits X and Y have in common (on average)

Noisy channel

11/09/14

Channel = input alphabet, output alphabet, transition probability

A B A AB B

observe that is left free

Capacity

maximizes the coupling between input and output lettersfavors letters that are the less altered by noise

bits / use of channel

Example

11/09/14

The erasure channel : a proportion of p bits are erased

A B

Define the erasure variable E = f(B) with E=1 when an erasure occurred, and E=0 otherwise

E0

1

andSo

Protection against errors

11/09/14

Idea: add extra bits to the message, to augment its inner redundancy (this is exactly the converse of data compression)

Coding scheme

• X takes values in { 1, 2, … , M=2nR }

• rate of the codec R = log2(M) / n transmitted bits / channel use

• R is achievable iff there exists a series of (fn,gn) CoDecs of rate R

such that

fn gn

noisy channel

where

Error correction (for a binary channel)

11/09/14

Repetition

• useful bit U sent 3 times : A1=A2=A3=U

• decoding by majority• detects and corrects one error… but R’=R/3

Parity checks

• X = k useful bits U1…Uk, expanded into n bits A1…An

• rate R = k/n

• for example: add extra redundant bits Vk+1…Vn that are

linear combinations of the U1…Uk

• examples: • ASCII code k=7, n=8• ISBN• social security number• credit card numberQuestions: how ??? and how many extra bits ???

How ?

11/09/14

Almost all channel codes are linear : Reed-Solomon, Reed-Muller, Golay, BCH, cyclic codes, convolutional codes…Use finite field theory, and algebraic decoding techniques.

The Hamming code

• 4 useful bits U1…U4

• 3 redundant bits V1…V3

• rate R = 4/7• detects and corrects 1 error (exercise…)• trick : 2 codewords differ by at least 3 bits

U1

U2

U3

U4

V1

V2

V3

1 0 0 0 0 1 10 1 0 0 1 0 10 0 1 0 1 1 00 0 0 1 1 1 1

[ U1 … U4 ] = [ U1 … U4 V1 … V3 ]

Generating matrix (of a linear code)

what Shannon proved in ’48

How much ?

11/09/14

what people believed before ‘48

Usage: measures the efficiency of an error correcting code for some channel

Theorem (Shannon, ‘48) : - any achievable transmission rate R must satisfy

R ≤ C transmitted bits / channel use - any transmission rate R < C is achievable

Proof

11/09/14

Necessity: if a coding is (asympt.) error free, then its rate satisfies R≤ C, rather easy to proveSufficiency: any rate R<C is achievable, demands to build a coding scheme !

Idea = random coding !

• best distribution on the input alphabet of the channel

• build a random codeword w = a1…an drawing letters according to

(w is a typical sequence)

• sending w over the channel yields output w’ = b1…bn

which is a typical sequence for • and the pair (w,w’) is jointly typical for

11/09/14

w1 w’1

M typical

sequencesas codewords

typical sequences

A1…An B1…Bn

jointly typical with w1

possible typical sequences

at output

w2 w’2

wM w’M.

. .

• if M is small enough, the output cones do not overlap (with high probability)• maximal number of input codewords :

which proves that any R <C is achievable !

n transmitted bits

Perfect coding

11/09/14

Perfect code = error-free and achieves capacity. What does it look like ?

• by the data processing inequality

nR = H(X) = I(X;X) ≤ I(A1…An;B1…Bn) ≤ nC

• if R = C, then I(A1…An;B1…Bn) = nC

• possible iff letters of the codeword Ai are independent,

and each I(Ai;Bi)=C, i.e. each Ai carries R=C bits

fn gn

noisy channel

For a binary channel: R = k / n a perfect code spreads information uniformly over a larger number of bits

k useful bits

channel

In practice

11/09/14

• Random coding unpractical: relies on a (huge) codebook for cod./dec.• Algebraic (linear) codes were preferred for long :

more structure, cod./dec. with algorithms• But in practice, they remained much below optimal rates !

• Things changed in 1993 when Berrou & Glavieux invented the turbo-codes• followed by the rediscovery of the low-density parity check codes (LDPC)

invented by Gallager in his PhD… in 1963 !• both code families behave like random codes… but come with

low-complexity cod./dec. algorithms

Can feedback improve capacity ?

11/09/14

Principle• the outputs of the channel are revealed to the sender• the sender can use this information to adapt its next symbol

channel

But is can greatly simplify coding, decoding, and transmission protocols.

Theorem: Feedback does not improve channel capacity.

2nd PART

11/09/14

Information theory was designed for point-to-point communications.Which was soon considered as a limitation…

broadcast channel: each user has a different channelmultiple access channel: interferences

Spread information: which structure for this object ?how to regenerate / transmit it ?

sd

2nd PART

11/09/14

What is the capacity of a network ?

Are network links just pipes, with capacity, in which information flows like a fluid ?

A

B

C How many transmissions to broadcast from A to C,D and from B to C,D ?

D

E F

a a

a a

a

a

ab

bb b

b b

ba a

a

a

b

bb b

a +b

a

a

+b

b

a

+b By network coding, one transmissionover link E—F can be saved.

Medard & Koetter2003

Outline

1. Information: measure and compression

2. Reliable transmission of information

3. Distributed compression

4. Fountain codes

5. Distributed peer-to-peer storage

11/09/14

11/09/14

Distributed source coding

3

Collecting spread information

11/09/14

• X, Y are two distant but correlated sources• transmit their value to a unique receiver (perfect channels)• no communication between the encoders

X

Y

dist

ance

encoder 1

encoder 2

joint decoder X,Yno

com

mun

icat

ion

I(X;Y)

H(Y|X)

H(X|Y)

• Naive solution = ignore correlation, compress and send

each source separately : rates R1=H(X), R2=H(Y)

• Can one do better, and take advantage of the correlation of X and Y ?

rate R1

rate R2K

Example

11/09/14

• X = weather in Brest, Y = weather in Quimper• probability that weathers are identical is 0.89• one wishes to send the observed weather of 100 days in both cities

• One has H(X) = 1 = H(Y), so naïve encoding requires 200 bits• I(X;Y) = 0.5, so not sending the “common information” saves 50 bits

0.445 0.055

0.055 0.445

sun rainY

sun

rainX

Necessary conditions

11/09/14

Question: what are the best possible achievable transmission rates ?

X

Y

dist

ance

encoder 1

encoder 2

joint decoder X,Yno

com

mun

icat

ion

I(X;Y)

H(Y|X)

H(X|Y) rate R1

rate R2

• Jointly, both coders must transmit the full pair (X,Y), so

R1+R2 ≥ H(X,Y)

• Each coder alone must transmit the private information that is not accessible through the other variable, so

R1 ≥ H(X|Y) and R2 ≥ H(Y|X)

A pair (R1,R2) is achievable is there exist separate encoders fnX and fn

Y

of sequences X1…Xn and Y1…Yn resp., and a joint decoder gn, that are

asymptotically error-free.

Result

11/09/14

Theorem (Slepian & Wolf, ‘75) : The achievable region is defined by

• R1 ≥ H(X|Y)

• R2 ≥ H(Y|X)

• R1+R2 ≥ H(X,Y)

R1

R2

H(Y|X)

H(Y)

H(X|Y) H(X)

achievable region

The achievable region is easily shown to be convex, upper-right closed.

Compression by random binning

11/09/14

• encode only typical sequences w = x1…xn =

• throw then at random into 2nR bins, with R>H(X)

1 2 3 2nR codeword, on R bits/symbol

…

Encoding of w = the number b of the bin where w lies

Decoding : if w = unique typical sequence in bin number b, output w otherwise, output “error”

Error probability

Proof of Slepian-Wolf

11/09/14

• fX and fY are two independent random binnings of rates R1 and R2

for x = x1…xn and y = y1…yn resp.

• to decode the pair of bin numbers (bX,bY) = (fX(x),fY(y)), g outputs the unique pair (x,y) of jointly typical sequences in box (bX,bY) or “error” if there are more than one such pair.

• R2>H(Y|X) : given x, there are 2nH(Y|X) sequences y that are jointly typical with x

• R1+R2 > H(X,Y) : the number of boxes 2n(R1+R2) must be greater than 2nH(X,Y)

1 2 3 2nR1

1

2

3

2nR2

…

…x

y jointly typical pairs (x,y)

Example

11/09/14

X= color

Y=value

0.5 1.251.25X Y

Questions:

1. Is there an instantaneous* transmission protocol for rates RX=1.25=H(X|Y), RY=1.75=H(Y) ?

• send Y (always) : 1.75 bits• what about X ?

(caution: the code for X should be uniquely decodable)

Y

X

?

?

?

?

0 10 110 111

(*) i.e. for sequences of length n=1

2. What about RX=RY=1.5 ?

K

In practice

11/09/14

The Slepian-Wolf theorem extends to N sources.It long remained an academic result, since no practical coders existed.

Beginning of the 2000s, practical coders and applications appeared• compression of correlated images (e.g. same scene, 2 angles)• sensor networks (e.g. measure of a temperature field)• case of a channel with side information• acquisition of structured information, without communication

11/09/14

Fountain codes

4

Network protocols

11/09/14

TCP/IP (transmission control protocol)

network(erasure channel)

1234567 1 234

ack 2 ack 2

• slow for huge files over long-range connexions (e.g. cloud backups…)• feedback channel… but feedback does not improve capacity !• repetition code… the worst rate among error correcting codes !• designed by engineers who ignored information theory ? :o)

Drawbacks

• the erasure rate of the channel (thus capacity) is unknown / changing• feedback make protocols simpler• there exist faster protocols (UDP) for streaming feeds

However

A fountain of information bits…

11/09/14

How to quickly and reliably transmit K packets of b bits?

Fountain code: • from k packets, generate and send a continuous

flow of packets• some get lost, some go through ; no feedback• as soon as a proportion K(1+ε) of them are received,

any of them, decoding becomes possible

Fountain codes are example of rateless codes (no predefined rate),or universal codes : they adapt to the channel capacity.

Random coding…

11/09/14

Packet tn sent at time n is a random linear combinations of the

K packets s1…sK to transmit.

where the Gn,k are random IID binary variables.

…s1 sKb

bits

K packets

…t1 tK’

…s1 sK …t1 tK’=K

1001011 … 11010001 ... 00110100 … 1 … 1011010 … 0

G

K’

*

Decoding

11/09/14

…s1 sK …t1 tK’=K

1001011 … 11010001 ... 00110100 … 1 … 1011010 … 0

G

K’

*

11 … 1 12 … 000 … 1 … 11 … 0

G’

N

K =r1 rN…*…s1 sK

Some packets are lost, and N out of K’ are received.This is equivalent to another random code with generating matrix G’.

How big should N be to enable decoding ?

11/09/14

Decoding

• For N=K, what is the probability that G’ is invertible ?

One has where G’ is a random K*N binary matrix.If G’ is invertible, one can decode by

Answer: converges quickly to 0.289 (as soon as K>10).

• What about N=K+E ? What is the probability P that at least one K*K sub-matrix of G’ is invertible ?Answer: P =1-δ(E) where δ(E) ≤ 2-E ( δ(E)<10-6 for E=20)

exponential convergence to 1 with E, regardless of K.

Complexity

• K/2 operations per gerenated packet, so O(K2) for encoding• decoding: K3 for matrix inversion• one would like better complexities… linear ?

LT codes

11/09/14

Invented by Michael Luby (2003), and inspired from LDPC codes (Gallager, 1963).

Idea : linear combinations of packets should be “sparse”

Encoding

• for each packet tn, randomly select a “degree” dn

according to some distribution ρ(d) on degrees

• choose at random dn packets among s1…sK

and take as tn the sum of these dn packets

• some nodes have low degree, others have high degree:makes the graph a small world

…s1 sK

t1 tN

…

Decoding LT codes

11/09/14

Idea = a simplified version of turbo-decoding (Berrou) that resembles cross-words solving

Example

1 0 1 1

Decoding LT codes

11/09/14


Example

1 0 1 1

1

Decoding LT codes

11/09/14


Example

1 0 1 1

1 0

Decoding LT codes

11/09/14


Example

1 0 1 1

1 0 1

Decoding LT codes

11/09/14


Example

1 0 1 1

1 0 1

How to choose degrees ?• each iteration should yield to a single new node of degree 1

achieved by distribution ρ(1)=1/K and ρ (d)=1/d(d-1) for d=2…K• average degree is logeK, so decoding complexity is K logeK• in reality

• one needs a few nodes of high degree to ensure that everypacket is connected to at least one check-node

• one needs a little more small degree nodes to ensure that decoding starts

In practice…

11/09/14

Performance• both encoding and decoding are in K log K (instead of K2 and K3)• for large K>104, the observed overhead E represents from 5% to 10%• Raptor codes (Shokrollahy, 2003) do better : linear time complexity

Applications• broadcast to many users:

a fountain code adapts to the channel of each user no need to rebroadcast packets missed by some user

• storage on many unreliable devices e.g. RAID (redundant array of inexpensive disks) data centers peer-to-peer distributed storage

11/09/14

Distributed P2P storage

5

Principle

11/09/14

…s1sK

t1 tN…raw data

v

redundant data

t2

distinct storages (disks, peers,…)

Problems• disks can crash, peers can leave: eventual data loss• original data can be recovered if enough packets remain…

… but missing packets need to be restored

Idea = Raw data split into packets, expanded with some ECC. Each new created packet is stored independently. Original data erased.

Restoration• perfect : the packet that is lost is exactly replaced• functional : new packets are built, to preserve data recoverability• intermediate : maintain the systematic part of the data

t’2 t’N…

new peers

Which codes ?

11/09/14

Fountain/random codes :

• random linear combinations of remaining blocks among t1…tn• will not preserve the appropriate degree distribution

Target : one should rebuild missing blocks… … without first rebuilding the original data ! (would require too much bandwith)

MDS codes: maximum distance separable codes

• can rebuild s1…sk from any subset of exactly k blocks in t1…tn• example : Reed-Solomon codes

Example

11/09/14

k sets of α blocs

n sets of α blocs

a b c d

a+c b+d b+c a+b+da b ca b d

reconstruction a b

d b+d a+b+d

β blocs requested

Example

11/09/14

k sets of α blocs

n sets of α blocs

a b c d

a+c b+d b+c a+b+da b c d b+c a+b+d

reconstruction

c+d b+d

b+c a+b+d

a

β blocs requested

Result (Dimakis et al., 2010): For functional repair,given k, n and d ≥ k (number of nodes to contact for repair)network coding techniques allows to optimally balanceα (number of blocs) and β (bandwidth necessary to reconstruction).

11/09/14

Conclusion

6

A few lessons

11/09/14

Ralf Koetter* : “Communications aren’t anymore about transmitting a bit, but about transmitting evidence about a bit.”

(*) one of the inventors of Network Coding

Random structures spread information uniformly.

Information theory gives bounds on how much one can learnabout some hidden information…

One does not have to build the actual protocols/codes that will reveal this information.

Management of distributed information……in other fields

11/09/14

- A, B: random variables, possibly correlated- one wishes to compute in B the value f(A,B)- how many bits should be exchanged?- how many communication rounds?

Compressed sensing (signal processing)

- signal can be described by sparse coefficients- random (sub-Nyquist) sampling

Communications complexity (computer science)- A, B: variables, taking values in a huge space- how many bits should A send to B in order to check A=B ?- solution by random coding

A Bn bits

A=B ?

Digital communications (network information theory)

thank you !

some aspects of information theory for a computer scientist

Documents

x average information

equality iff x

values x

source x

x independent110914example

variables x

event x

propertywith equality