1 a(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

35
1 A(n) (extremely) brief/crude intr oduction to mini mum description length principle jdu 2006-04

Upload: constance-blake

Post on 28-Dec-2015

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

1

A(n) (extremely) brief/crude introduction to minimum description length princ

iplejdu

2006-04

Page 2: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

2

Outline

• Conceptual/non-technical introduction

• Probabilities and Codelengths• Crude MDL• Refined MDL• Other topics

Page 3: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

3

Outline

• Conceptual/non-technical introduction

• Probabilities and Codelengths• Crude MDL• Refined MDL• Other topics

Page 4: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

4

Introduction

• Example: data compression– Description methods

Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.

Page 5: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

5

Introduction

• Example: regression– Model selection and overfitting– Complexity of the model vs. Goodness of fit

Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.

Page 6: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

6

Introduction

• Models vs. Hypotheses

Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.

Page 7: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

7

Introduction

• Crude 2-part version of MDL

Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.

Page 8: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

8

Outline

• Conceptual/non-technical introduction

• Probabilities and Codelengths• Crude MDL• Refined MDL• Other topics

Page 9: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

9

Probabilities and Codelengths• Let X be a finite or countable set

– A code C(x) for X• 1-to-1 mapping from X to Un>0{0,1}n

• LC(x): number of bits needed to encode x using C

– P: probability distribution defined on X• P(x): the probability of x• A sequence of (usually iid) observations x1, x2,

…, xn: xn

Page 10: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

10

Probabilities and Codelengths• Prefix codes: as examples of uniquely

decodable codes– no code word is a prefix of any other

a 0

b 111

c 1011

d 1010

r 110

! 100

Source: http://www.cs.princeton.edu/courses/archive/spring04/cos126/

Page 11: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

11

Probabilities and Codelengths• Expected codelength of a code C

– Lower bound:

• Optimal code– if it has minimum expected codelength over all un

iquely decodable codes– How to design one given P?

• Huffman coding

Xx

CCP xLxPxLE )()())((

Xx

xPxPxH )(log)()( 2

Page 12: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

12

Probabilities and Codelengths• Huffman coding

Source: http://star.itc.it/caprile/teaching/algebra-superiore-2001/

Page 13: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

13

Probabilities and Codelengths• How to design code for {1, 2, …, M}?

– Assuming a uniform distribution: 1/M for each number

– ~logM bits

Page 14: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

14

Probabilities and Codelengths• How to design code for all the

positive integers?– For each k

• Describe it with 0s • Followed by a 1• Then encode k using the uniform code for• In total, ~ 2logk + 1 bits

– Can be refined…

Page 15: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

15

Probabilities and Codelengths• Let P be a probability distribution over X,

then there exists a code C for X such that:

• Let C be a uniquely decodable code over X, then there exists a probability distribution P such that:

)(log)( xPxLC

)(log)( xPxLC

)(log)( nnC xPxL

Page 16: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

16

Probabilities and Codelengths• Codelength revisited

Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.

Page 17: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

17

Outline

• Conceptual/non-technical introduction

• Probabilities and Codelengths• Crude MDL• Refined MDL• Other topics

Page 18: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

18

Crude MDL

• Preliminary: k-th order Markov chain on X={0,1}– A sequence: X1, X2, …, XN

– Special case: 0-th order: Bernoulli model (biased coin)

• Maximum Likelihood estimator

Page 19: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

19

Crude MDL

• Preliminary: k-th order Markov chain on X={0,1}– Special case: first order Markov chain B(1)

• MLE

Page 20: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

20

Crude MDL

• Preliminary: k-th order Markov chain on X={0,1}– 2k parameters

• theta[1|000…000] = n[1|000…000]/n[000…000]• theta[1|000…001]• …• theta[1|111…110]• theta[1|111…111]

– Log likelihood function: …– MLE: …

Page 21: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

21

Crude MDL

• Question: Given data D=xn, find the Markov chain that best explains D.– We do not want to restrict ourselves to cha

ins of fixed order• How to avoid overfitting?• Obviously, an (n-1)-th order Markov model wo

uld always fit the data the best

Page 22: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

22

Crude MDL

• two-part MDL revisited

Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.

Page 23: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

23

Crude MDL

• Description length of data given hypothesis

Page 24: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

24

Crude MDL

• Description length of hypothesis– The code should not change with the

sample size n.– Different codes will lead to preferences

of different hypotheses– How to design a code that

• Leads to good inferences with small, practically relevant sample sizes?

Page 25: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

25

Crude MDL

• An ``intuitive” and ``reasonable” code for k-th order Markov chain– First describe k using 2logk+1 bits– Then describe the d=2k parameters

• Assume n is given in advance– For each theta in the MLE {theta[1|000…000], …, theta[1|111

…111]}, the best precision we can achieve by counting is 1/(n+1)

– Describe each theta with log(n+1) bits– L(H)=2logk+1+dlog(n+1)– L(H)+L(D|H) = 2logk+1+dlog(n+1) – logP(D|k, theta)– For a given k, only the MLE theta need to be consi

dered

Page 26: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

26

Crude MDL

• Good news– We have found a principled manner to

encode data D using H

• Bad news– We have not found clear guidelines to

design codes for H

Page 27: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

27

Outline

• Conceptual/non-technical introduction

• Probabilities and Codelengths• Crude MDL• Refined MDL• Other issues

Page 28: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

28

Refined MDL

• Universal codes and universal distributions– maximum likelihood code depends on the

data• How to describe the data in an unambiguous

manner?– Design a code such that for every possible

observation, its codelength corresponds to its ML? - impossible

Page 29: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

29

Refined MDL

• Worst-case regret

• Optimal universal model

Page 30: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

30

Refined MDL

• Normalized maximum likelihood (NML)

• Minimizing -logNML

Page 31: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

31

Refined MDL

• Complexity of a model

– The more sequences that can be fit well by an element of M, the larger M’s complexity

– Would it lead to a ``right” balance between complexity and fit?• Hopefully…

Page 32: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

32

Refined MDL

• General refined MDL

Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.

Page 33: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

33

Outline

• Conceptual/non-technical introduction

• Probabilities and Codelengths• Crude MDL• Refined MDL• Other topics

Page 34: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

34

Other topics

• Mixture code• Resolvability• …

Page 35: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04

35

References

• Barron, A.; Rissanen, J. & Yu, B. (1998), 'The minimum description length principle in coding and modeling', Information Theory, IEEE Transactions on 44(6), 2743--2760.

• Grnwald, P.D.; Myung, I.J. & Pitt, M.A. (2005), Advances in Minimum Description Length: Theory and Applications (Neural Information Processing), The MIT Press.

• Hall, P. & Hannan, E.J. (1988), 'On stochastic complexity and nonparametric density estimation', Biometrika 75(4), 705-714.