1 a(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

1

A(n) (extremely) brief/crude introduction to minimum description length princ

iplejdu

2006-04

2

Outline

• Conceptual/non-technical introduction

• Probabilities and Codelengths• Crude MDL• Refined MDL• Other topics

3

Outline



4

Introduction

• Example: data compression– Description methods

Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.

5

Introduction

• Example: regression– Model selection and overfitting– Complexity of the model vs. Goodness of fit


6

Introduction

• Models vs. Hypotheses


7

Introduction

• Crude 2-part version of MDL


8

Outline



9

Probabilities and Codelengths• Let X be a finite or countable set

– A code C(x) for X• 1-to-1 mapping from X to Un>0{0,1}n

• LC(x): number of bits needed to encode x using C

– P: probability distribution defined on X• P(x): the probability of x• A sequence of (usually iid) observations x1, x2,

…, xn: xn

10

Probabilities and Codelengths• Prefix codes: as examples of uniquely

decodable codes– no code word is a prefix of any other

a 0

b 111

c 1011

d 1010

r 110

! 100

Source: http://www.cs.princeton.edu/courses/archive/spring04/cos126/

11

Probabilities and Codelengths• Expected codelength of a code C

– Lower bound:

• Optimal code– if it has minimum expected codelength over all un

iquely decodable codes– How to design one given P?

• Huffman coding

Xx

CCP xLxPxLE )()())((

Xx

xPxPxH )(log)()( 2

12

Probabilities and Codelengths• Huffman coding

Source: http://star.itc.it/caprile/teaching/algebra-superiore-2001/

13

Probabilities and Codelengths• How to design code for {1, 2, …, M}?

– Assuming a uniform distribution: 1/M for each number

– ~logM bits

14

Probabilities and Codelengths• How to design code for all the

positive integers?– For each k

• Describe it with 0s • Followed by a 1• Then encode k using the uniform code for• In total, ~ 2logk + 1 bits

– Can be refined…

15

Probabilities and Codelengths• Let P be a probability distribution over X,

then there exists a code C for X such that:

• Let C be a uniquely decodable code over X, then there exists a probability distribution P such that:

)(log)( xPxLC

)(log)( xPxLC

)(log)( nnC xPxL

16

Probabilities and Codelengths• Codelength revisited


17

Outline



18

Crude MDL

• Preliminary: k-th order Markov chain on X={0,1}– A sequence: X1, X2, …, XN

– Special case: 0-th order: Bernoulli model (biased coin)

• Maximum Likelihood estimator

19

Crude MDL

• Preliminary: k-th order Markov chain on X={0,1}– Special case: first order Markov chain B(1)

• MLE

20

Crude MDL

• Preliminary: k-th order Markov chain on X={0,1}– 2k parameters

• theta[1|000…000] = n[1|000…000]/n[000…000]• theta[1|000…001]• …• theta[1|111…110]• theta[1|111…111]

– Log likelihood function: …– MLE: …

21

Crude MDL

• Question: Given data D=xn, find the Markov chain that best explains D.– We do not want to restrict ourselves to cha

ins of fixed order• How to avoid overfitting?• Obviously, an (n-1)-th order Markov model wo

uld always fit the data the best

22

Crude MDL

• two-part MDL revisited


23

Crude MDL

• Description length of data given hypothesis

24

Crude MDL

• Description length of hypothesis– The code should not change with the

sample size n.– Different codes will lead to preferences

of different hypotheses– How to design a code that

• Leads to good inferences with small, practically relevant sample sizes?

25

Crude MDL

• An ``intuitive” and ``reasonable” code for k-th order Markov chain– First describe k using 2logk+1 bits– Then describe the d=2k parameters

• Assume n is given in advance– For each theta in the MLE {theta[1|000…000], …, theta[1|111

…111]}, the best precision we can achieve by counting is 1/(n+1)

– Describe each theta with log(n+1) bits– L(H)=2logk+1+dlog(n+1)– L(H)+L(D|H) = 2logk+1+dlog(n+1) – logP(D|k, theta)– For a given k, only the MLE theta need to be consi

dered

26

Crude MDL

• Good news– We have found a principled manner to

encode data D using H

• Bad news– We have not found clear guidelines to

design codes for H

27

Outline


• Probabilities and Codelengths• Crude MDL• Refined MDL• Other issues

28

Refined MDL

• Universal codes and universal distributions– maximum likelihood code depends on the

data• How to describe the data in an unambiguous

manner?– Design a code such that for every possible

observation, its codelength corresponds to its ML? - impossible

29

Refined MDL

• Worst-case regret

• Optimal universal model

30

Refined MDL

• Normalized maximum likelihood (NML)

• Minimizing -logNML

31

Refined MDL

• Complexity of a model

– The more sequences that can be fit well by an element of M, the larger M’s complexity

– Would it lead to a ``right” balance between complexity and fit?• Hopefully…

32

Refined MDL

• General refined MDL


33

Outline



34

Other topics

• Mixture code• Resolvability• …

35

References

• Barron, A.; Rissanen, J. & Yu, B. (1998), 'The minimum description length principle in coding and modeling', Information Theory, IEEE Transactions on 44(6), 2743--2760.

• Grnwald, P.D.; Myung, I.J. & Pitt, M.A. (2005), Advances in Minimum Description Length: Theory and Applications (Neural Information Processing), The MIT Press.

• Hall, P. & Hannan, E.J. (1988), 'On stochastic complexity and nonparametric density estimation', Biometrika 75(4), 705-714.

1 a(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

Documents

th order markov chain

codelengthslet x

th order markov model

code clower

probability distribution

minimum expected codelength

decodable codesno code

uniform code forin total