learning a small mixture of trees m. pawan kumar daphne koller pawan koller aim: to efficiently...

1
Learning a Small Mixture of Trees M. Pawan Kumar Daphne Koller http://ai.stanford.edu/~pawan http://ai.stanford.edu/~koller Aim: To efficiently learn a small mixture of trees that approximates an observed distribution Results ixture of Trees inimizing -Divergence Problem Formulation Modifying Fractional Covering inimizing the KL Divergence Meila and Jordan, 2000 (MJ00) Plotkin et al., 1995 ables V = {v 1 , v 2 , … , v n } Label x a X a for variable v a Labeling x v 1 v 2 v 3 v 1 v 2 v 3 v 1 v 2 v 3 z Hidden variable t 1 t 2 t 3 Pr(x | m ) = t T Pr(x | t ) Pr(x | t ) = (a,b) t ab (x a ,x b ) a t a (x a ) d a -1 t ab (x a ,x b ): Pairwise potentials t a (x a ): Unary potentials d a : Degree of v a Renyi, 1961 ( 1 || 2 ) = x Pr(x | 1 ) log Pr(x | 1 ) Pr(x | 2 ) 1 : observed distribution 2 : simpler distribution orithm (Relies heavily on initialization) Estimate Pr(x | t ) for each x and t Obtain structure and potentials (Chow-Liu) Focuses on dominant mode Rosset and Segal, 2002 (RS02) arg min m p(x i ) Pr(x i | m ) max i log = arg max m p(x i ) Pr(x i | m ) min i m* = MJ00 uses twice as many trees Fractional Covering Standard UCI datasets MJ00 RS02 Our Agaricu s 99.98 (.04) 100 (0) 100 (0) Nursery 99.2 (.02) 98.35 (0.3) 99.28 (.13) Splice 95.5 (0.3) 95.6 (.42) 96.1 (.15) Learning Pictorial Structures 11 characters in an episode of “Buffy” 24,244 faces (first 80% train, last 20% tes 13 facial features (variables) + positions ( Unary: logistic regression, Pairwise: Bag of visual words : 65.68% RS02 Our 66.05 66.05 66.01 66.65 66.01 66.86 66.08 67.25 66.08 67.48 66.16 67.50 66.20 67.68 Pr(x | 2 ) 1- -1 D 1 ( 1 || 2 ) = KL( 1 || 2 ) D ( 1 || 2 ) = Pr(x | 1 ) 1 log x Generalization of KL Divergence Fitting q to p Larger is inclusive Minka, 2005 Use = = 1 = 0.5 = Choose from all possible trees T = { t j } defined over n random variables Matrix A where A(i,j) = Pr(x i | t j ) Vector b where b(i) = p(x i ) Vector 0 such that j = 1 P max s.t. a i ≥ b i P Constraints defined on infinite variables min i exp(-a i /b i ) s.t. P Parameter log(m) Width w= max max i a i /b i Initial solution 0 Define 0 = min i a i 0 /b i Define = /4w Finding -optimal solution? While < 2 0 , iterate Define y i = exp(-a i /b i )/b i Find ’ = argmax y T A Update = (1-) + Minimize first-order approximation (1) Slow convergence (2) Singleton trees (Probability = 0 for unseen test examples) Drawbacks Overview ve objective function for learning a mixture of trees ate the problem using fractional covering fy the drawbacks of fractional covering uitable modifications to the algorithm (1) Start with = 1/w. Increase by a factor of 2 if necessary. Large step-size Large y i for numerical stability (2) Minimize using convex relaxation. -Pr(x i | t ) t T p(x i ) min i exp s.t. Pr(x i | t ) ≥ 0, i Pr(x i | t )≤ 1 Dropped Initialize tolerance , parameter , factor f Solve for distribution Pr(. | t ) min f - i log(Pr(x i | t )) -i log(1-Pr(x i | t )) Update f = f until m/f ≤ Log-barrier approach. Use Newton’s method. To minimize g(z), update z = z - ( 2 g(z)) -1 g(z) Hessian Gradient Hessian with uniform off-diagonal elements Matrix inversion in linear time Project to tree distribution using Chow-Liu May result in increase in Discard best explained sample and recompute t Enforce Pr(x i’ | t ) = 0 i’ = argmax i Pr(x i | t )/p(x i ) ibution p(.) find a mixture of trees by minimizing -divergence Computationally expensive operation? Use previous solution Only one log-barrier optimization required Convergence Properties Maximum number of increases for = O(log(log(m))) Maximum discarded samples = m-1 Polynomial time per iteration. Polynomial time convergence of overall algorithm. Mixtures in log-probability spa Connections to Discrete AdaBoo Future Work Pr(x i | t ) = Pr(x i | t ) + s i Pr(x i’ | t ) s i = p(x i | t )/k p(x k | t ) S S T T A A N N F F O O R R D D

Upload: hannah-stewart

Post on 02-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning a Small Mixture of Trees M. Pawan Kumar Daphne Koller pawan koller Aim: To efficiently learn a

Learning a Small Mixture of Trees M. Pawan Kumar Daphne Koller

http://ai.stanford.edu/~pawan http://ai.stanford.edu/~kollerAim: To efficiently learn a small mixture of trees

that approximates an observed distribution Results

Mixture of Trees

Minimizing -Divergence

Problem Formulation

Modifying Fractional Covering

Minimizing the KL Divergence

Meila and Jordan, 2000 (MJ00)

Plotkin et al., 1995

Variables V = {v1, v2, … , vn} Label xa Xa for variable va Labeling x

v1

v2 v3

v1

v2 v3

v1

v2 v3

z Hidden variable

t1 t2 t3

Pr(x | m) = ∑t T Pr(x | t)

Pr(x | t) =∏(a,b) t

ab(xa,xb)

∏a ta(xa)da-1

tab(xa,xb): Pairwise potentials

ta(xa): Unary potentials

da: Degree of va

Renyi, 1961

KL(1||2) = ∑x Pr(x | 1) log Pr(x | 1) Pr(x | 2)

1 : observed distribution

2 : simpler distribution

EM Algorithm (Relies heavily on initialization)

E-step: Estimate Pr(x | t) for each x and t

M-step: Obtain structure and potentials (Chow-Liu) Focuses on dominant mode

Rosset and Segal, 2002 (RS02)

arg minm p(xi)

Pr(xi | m)maxi log = arg maxm

p(xi) Pr(xi | m)mini m* =

MJ00 uses twice as many treesFractional Covering

Standard UCI datasets

MJ00 RS02 OurAgaricus 99.98 (.04) 100 (0) 100 (0)Nursery 99.2 (.02) 98.35 (0.3) 99.28 (.13)Splice 95.5 (0.3) 95.6 (.42) 96.1 (.15)

Learning Pictorial Structures11 characters in an episode of “Buffy”24,244 faces (first 80% train, last 20% test)13 facial features (variables) + positions (labels)Unary: logistic regression, Pairwise: m

Bag of visual words : 65.68% RS02 Our

66.05 66.05

66.01 66.65

66.01 66.86

66.08 67.25

66.08 67.48

66.16 67.50

66.20 67.68

Pr(x | 2)1- -1 D1(1 || 2) = KL(1 || 2)D(1||2) = Pr(x | 1) 1 log ∑x

Generalization of KL Divergence

Fitting q to pLarger is inclusive

Minka, 2005

Use =

= 1 = 0.5 =

Choose from all possible trees T = {tj} defined over n random variables

Matrix A where A(i,j) = Pr(xi|tj)

Vector b where b(i) = p(xi)

Vector ≥ 0 such that ∑ j = 1 P

max

s.t. ai ≥ bi

P

Constraints defined on infinite variables

min ∑iexp(-ai/bi)

s.t. P

Parameter log(m)

Width w= maxmaxi ai/bi

Initial solution 0

Define 0 = mini ai0/bi

Define = /4w

Finding -optimalsolution?

While < 20, iterate

Define yi = exp(-ai/bi)/bi

Find ’ = argmax yTAUpdate = (1-) + ’

Minimize first-orderapproximation

(1) Slow convergence(2) Singleton trees(Probability = 0 forunseen test examples)

Drawbacks

OverviewAn intuitive objective function for learning a mixture of treesFormulate the problem using fractional coveringIdentify the drawbacks of fractional coveringMake suitable modifications to the algorithm

(1) Start with = 1/w. Increase by a factor of 2 if necessary.

Large step-size Large yi for numericalstability

(2) Minimize using convex relaxation.

-Pr(xi|t)

t T

p(xi)min ∑i exp

s.t. Pr(xi|t) ≥ 0, ∑i Pr(xi|t)≤ 1

Dropped

Initialize tolerance , parameter , factor f

Solve for distribution Pr(. | t)

min f - ∑i log(Pr(xi|t)) -∑i log(1-Pr(xi|t))

Update f = f until m/f ≤

Log-barrier approach. Use Newton’s method.

To minimize g(z), update z = z - (2g(z))-1 g(z)

Hessian Gradient

Hessian with uniformoff-diagonal elements

Matrix inversion in linear time

Project to tree distribution using Chow-Liu May result in increase in

Discard best explainedsample and recompute t

Enforce Pr(xi’|t) = 0 i’ = argmaxi Pr(xi|t)/p(xi)

Given distribution p(.) find a mixture of trees by minimizing -divergence

Computationallyexpensive operation?

Use previous solution Only one log-barrieroptimization required

Convergence PropertiesMaximum number of increases for = O(log(log(m)))Maximum discarded samples = m-1Polynomial time per iteration. Polynomial time convergence of overall algorithm.

Mixtures in log-probability space?

Connections to Discrete AdaBoost?FutureWork

Pr(xi|t) = Pr(xi|t) + si Pr(xi’|t)si = p(xi |t)/∑k p(xk |t)

SSTTAANNFFOORRDD