learning a small mixture of trees m. pawan kumar daphne koller pawan koller aim: to efficiently...

Learning a Small Mixture of Trees M. Pawan Kumar Daphne Koller

http://ai.stanford.edu/~pawan http://ai.stanford.edu/~kollerAim: To efficiently learn a small mixture of trees

that approximates an observed distribution Results

Mixture of Trees

Minimizing -Divergence

Problem Formulation

Modifying Fractional Covering

Minimizing the KL Divergence

Meila and Jordan, 2000 (MJ00)

Plotkin et al., 1995

Variables V = {v1, v2, … , vn} Label xa Xa for variable va Labeling x

v1

v2 v3

v1

v2 v3

v1

v2 v3

z Hidden variable

t1 t2 t3

Pr(x | m) = ∑t T Pr(x | t)

Pr(x | t) =∏(a,b) t

ab(xa,xb)

∏a ta(xa)da-1

tab(xa,xb): Pairwise potentials

ta(xa): Unary potentials

da: Degree of va

Renyi, 1961

KL(1||2) = ∑x Pr(x | 1) log Pr(x | 1) Pr(x | 2)

1 : observed distribution

2 : simpler distribution

EM Algorithm (Relies heavily on initialization)

E-step: Estimate Pr(x | t) for each x and t

M-step: Obtain structure and potentials (Chow-Liu) Focuses on dominant mode

Rosset and Segal, 2002 (RS02)

arg minm p(xi)

Pr(xi | m)maxi log = arg maxm

p(xi) Pr(xi | m)mini m* =

MJ00 uses twice as many treesFractional Covering

Standard UCI datasets

MJ00 RS02 OurAgaricus 99.98 (.04) 100 (0) 100 (0)Nursery 99.2 (.02) 98.35 (0.3) 99.28 (.13)Splice 95.5 (0.3) 95.6 (.42) 96.1 (.15)

Learning Pictorial Structures11 characters in an episode of “Buffy”24,244 faces (first 80% train, last 20% test)13 facial features (variables) + positions (labels)Unary: logistic regression, Pairwise: m

Bag of visual words : 65.68% RS02 Our

66.05 66.05

66.01 66.65

66.01 66.86

66.08 67.25

66.08 67.48

66.16 67.50

66.20 67.68

Pr(x | 2)1- -1 D1(1 || 2) = KL(1 || 2)D(1||2) = Pr(x | 1) 1 log ∑x

Generalization of KL Divergence

Fitting q to pLarger is inclusive

Minka, 2005

Use =

= 1 = 0.5 =

Choose from all possible trees T = {tj} defined over n random variables

Matrix A where A(i,j) = Pr(xi|tj)

Vector b where b(i) = p(xi)

Vector ≥ 0 such that ∑ j = 1 P

max

s.t. ai ≥ bi

P

Constraints defined on infinite variables

min ∑iexp(-ai/bi)

s.t. P

Parameter log(m)

Width w= maxmaxi ai/bi

Initial solution 0

Define 0 = mini ai0/bi

Define = /4w

Finding -optimalsolution?

While < 20, iterate

Define yi = exp(-ai/bi)/bi

Find ’ = argmax yTAUpdate = (1-) + ’

Minimize first-orderapproximation

(1) Slow convergence(2) Singleton trees(Probability = 0 forunseen test examples)

Drawbacks

OverviewAn intuitive objective function for learning a mixture of treesFormulate the problem using fractional coveringIdentify the drawbacks of fractional coveringMake suitable modifications to the algorithm

(1) Start with = 1/w. Increase by a factor of 2 if necessary.

Large step-size Large yi for numericalstability

(2) Minimize using convex relaxation.

-Pr(xi|t)

t T

p(xi)min ∑i exp

s.t. Pr(xi|t) ≥ 0, ∑i Pr(xi|t)≤ 1

Dropped

Initialize tolerance , parameter , factor f

Solve for distribution Pr(. | t)

min f - ∑i log(Pr(xi|t)) -∑i log(1-Pr(xi|t))

Update f = f until m/f ≤

Log-barrier approach. Use Newton’s method.

To minimize g(z), update z = z - (2g(z))-1 g(z)

Hessian Gradient

Hessian with uniformoff-diagonal elements

Matrix inversion in linear time

Project to tree distribution using Chow-Liu May result in increase in

Discard best explainedsample and recompute t

Enforce Pr(xi’|t) = 0 i’ = argmaxi Pr(xi|t)/p(xi)

Given distribution p(.) find a mixture of trees by minimizing -divergence

Computationallyexpensive operation?

Use previous solution Only one log-barrieroptimization required

Convergence PropertiesMaximum number of increases for = O(log(log(m)))Maximum discarded samples = m-1Polynomial time per iteration. Polynomial time convergence of overall algorithm.

Mixtures in log-probability space?

Connections to Discrete AdaBoost?FutureWork

Pr(xi|t) = Pr(xi|t) + si Pr(xi’|t)si = p(xi |t)/∑k p(xk |t)

SSTTAANNFFOORRDD

learning a small mixture of trees m. pawan kumar daphne koller pawan koller aim: to efficiently...

Documents