if we measured a distribution p, what is the tree- dependent distribution p t that best approximates...

21

Upload: leonard-york

Post on 24-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all
Page 2: If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all

• If we measured a distribution P, what is the tree-dependent distribution Pt that best approximates P?

• Search Space: All possible trees• Goal: From all possible trees find the one closest

to P• Distance Measurement:

• Kullback–Leibler cross –entropy measure• Operators/Procedure

Page 3: If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all

Problem definition

• X1…Xn are random variables• P is unknown • Given independent samples x1,…,xs drawn from distribution P• Estimate P

Solution 1 - independence

• Assume X1…Xn are independent • P(x) = Π P(xi)

Solution 2 - trees

• P(x) = Π P(xi|xj)• xj- The parent of xi in some

Page 4: If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all

Kullback–Leibler cross–enthropy measure

• For probability distributions P and Q of a discrete random variable the K–L divergence of Q from P is defined to be

• It can be seen from the definition of the Kullback-Leibler divergence that

• where H(P,Q) is called the cross entropy of P and Q, and H(P) is the entropy of P.

• Non negative measure (by Gibb’s inequality)

Page 5: If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all

5

Entropy is a measure for Uncertainty

• Fair coin:– H(½, ½) = – ½ log2(½) – ½ log2(½) = 1 bit– (ie, need 1 bit to convey the outcome of coin flip)

• Biased coin:H( 1/100, 99/100) =

– 1/100 log2(1/100) – 99/100 log2(99/100) = 0.08 bit

• As P( heads ) 1, info of actual outcome 0H(0, 1) = H(1, 0) = 0 bitsie, no uncertainty left in source

(0 log2(0) = 0)

Page 6: If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all

Optimization Task

• Init: Fix the structure of some tree t• Assign Probabilities: What conditional probabilities

Pt(x|y) would yield the best approximation of P?

• Procedure: vary the structure of t over all possible spanning trees

• Goal: among all trees with probabilities- which is the closest to P?

Page 7: If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all

What Probabilities to assign?

Theorem 1 :If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P

Page 8: If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all

How to vary over all trees? How to move in the search space?

Theorem 2 :The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurement

Page 9: If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all

Mutual information

• measures how much knowing one of these variables reduces our uncertainty about the other

• the mutual information is the same as the uncertainty contained in Y (or X) alone, namely the entropy of Y or X

• Mutual information is a measure of dependence• Mutual information is nonnegative (i.e. I(X;Y) ≥ 0) and

symmetric (i.e. I(X;Y) = I(Y;X)).

Page 10: If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all

The algorithm

• Find Maximum spanning tree with weights given by :

• Compute Pt

– Select an arbitrary root node and compute

Page 11: If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all

11

0.31260.02290.01720.02300.01830.2603

ABACADBCBDCD

(0.56, 0.11, 0.02, 0.31)(0.51, 0.17, 0.17, 0.15)(0.53, 0.15, 0.19, 0.13)(0.44, 0.14, 0.23, 0.19)(0.46, 0.12, 0.26, 0.16)(0.64, 0.04, 0.08, 0.24)

A

C

B

D

A

C

B

D

ABACADBCBDCD

0.31260.02290.01720.02300.01830.2603

(0.56, 0.11, 0.02, 0.31)(0.51, 0.17, 0.17, 0.15)(0.53, 0.15, 0.19, 0.13)(0.44, 0.14, 0.23, 0.19)(0.46, 0.12, 0.26, 0.16)(0.64, 0.04, 0.08, 0.24)

Illustration of CL-Tree LearningA

C

B

D

Page 12: If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all

Theorem 1 :If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P

Page 13: If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all

Theorem 1 :If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P

By Gibb’s inequality the Expression

is maximized when P’(x)=P(x) => all the expression is maximal when P’(xi|xj)=P(xi|xj)

Q.E.D.

Page 14: If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all

Theorem 1 :If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P

Gibbs' inequality

Page 15: If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all

Theorem 1 :If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P

Gibbs' inequality

Page 16: If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all

Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurement:

From theorem 1, we get:

After assignment, and Bayes rule:

maximizes DKL

Page 17: If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all

Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurement:

From theorem 1, we get:

After assignment, and Bayes rule:

maximizes DKL

Page 18: If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all

Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurement:

• The second and third term are independent of t

• D(P,Pt) is nonnegative (Gibb’s inequality)

Thus, minimizing the distance D(P,Pt) is equivalent to maximizing the

sum of branch weights

Q.E.D.

Page 19: If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all

19

Chow-Liu (CL) Results

• If distribution P is tree-structured,CL finds CORRECT one

• If distribution P is NOT tree-structured,CL finds tree structured Q that has min’l KL-divergence – argminQ KL(P; Q)

• Even though 2(n log n) trees,CL finds BEST one in poly time O(n2 [m + log n])

Page 20: If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all

Chow-Liu Trees -Summary• Approximation of a joint distribution with a

tree-structured distribution [Chow and Liu 68]

• Learning the structure and the probabilities– Compute individual and pairwise marginal

distributions for all pairs of variables – Compute mutual information (MI) for each pair of

variables

– Build maximum spanning tree with for a complete graph with variables as nodes and MIs as weights

• Properties– Efficient:

• O(#samples×(#variables)2×(#values per variable)2)

– Optimal

YX YPXP

YXPYXPYX

, )()(

),(log),(),MI(

Page 21: If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all

• S. Kullback (1959) Information theory and statistics (John Wiley and Sons, NY).

• Chow, C. K. and Liu, C. N. (1968). Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, IT-14(3):462{467.