1 efficient bounds for the softmax function applications to inference in hybrid models guillaume...

17
1 Efficient Bounds for the Softmax Function Applications to Inference in Hybrid Models Guillaume Bouchard Xerox Research Centre Europe

Upload: juliet-stevenson

Post on 18-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

1

Efficient Bounds for the Softmax Function Applications to Inference in Hybrid Models

Guillaume Bouchard

Xerox Research Centre Europe

December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 2

Deterministic Inference in Hybrid Graphical Models

Discrete variables with continuous* parents No sufficient statistic No conjugate distribution

Intractable inference Approximate deterministic inference

Local sampling Deterministic approximations

Gaussian quadrature delta method Laplace approximation Maximize a lower bound to the

variational free energy

X1X1 X2

X2 X3X3 X4

X4

Y1Y1

X5X5 X5

X5 Y2Y2

Y3Y3

X0X0

Discrete variable

Continuous variableObserved variable

Hidden variable *or a large number of discrete parents

December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 3

Variational inference

Focus on Bayesian multinomial logistic regression

Mean field approximation

Q belongs to an approximation family

Discrete variable

Continuous variable

Observed variable

Hidden variable

X1iX1i X2i

X2i β 1β 1 β2

β2

YiYi

Data i

upper bound?

upper bound?

max

December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 4

Bounding the log-partition function (1)

Binary case dimension: classical bound [Jordan and Jaakkola]

We propose its multiclass extension

December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 5

Bounding the log-partition function (2)

K=2

K=10

-5 -4 -3 -2 -1 0 1 2 3 4 50

0.5

1

-5 -4 -3 -2 -1 0 1 2 3 4 50

5

10

15

20

25

30

35

40

45

50

x

log k ex

k

worse curvatureoptimal tightoptimal average (=2)

optimal average (=1)

optimal average (=0.1)

December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 6

Other upper bounds

Concavity of the log [e.g. Blei et al.]

Worst curvature [Bohning]

Bound using hyperbolic cosines [Jebara]

Local approximation [Gibbs]not proved to be an upper bound

December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 7

ProofIdea: Expand the product of inverted sigmoids

Upper-bounded by K quadratic upper bounds

Lower bounded by a linear function (log-convexity of f)

Proof: apply Jensen inequality to

December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 8

Bounds on the Expectation

Exponential bound

Quadratic bound

simulations

December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 9

Bayesian multinomial logistic regression

Exponential bound

Cannot be maximized in closed form gradient-based optimization Fixed point equation (unstable !)

Quadratic bound

Analytic update:

December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 10

Numerical experiments

Iris dataset 4 dimensions 3 classes Prior: unit variance

Experiment Learning: Batch updates Compared to MCMC

estimation based on 100K samples

Error = Euclidian distance between the mean and variance parameters

ResultsThe “worse curvature” bound is more faster and better…

0 10 20 30 40 50 60 70 80 90 1001.5

2

2.5

3

3.5

4

4.5

5

5.5

6

number of iterations

err

or

worse curvaturesigmoid product bound

0 20 40 60 80 100 1200

2

4

6

8

10

12

14

16x 10

5

number of iterations

Var

iatio

na

l Fre

e E

nerg

y

worse curvaturesigmoid product bound

December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 11

Conclusion

Multinomial links in graphical models are feasible Existing bound work well We can expect further improvements Remark

better bounds are only needed for the Bayesian setting

For MAP estimation, even a loose bound converge Future work

Application to discriminative learning Mixture-based mean-field approximation

December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 12

December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 13

Backup slides

December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 14

Numerical experiments

Iris dataset 4 dimensions 3 classes Prior: unit variance

Experiment Learning: Batch updates Compared to MCMC

estimation based on 100K samples

Error = Euclidian distance between the mean and variance parameters

ResultsThe “worse curvature” bound is more faster and better…

0 10 20 30 40 50 60 70 80 90 1001.5

2

2.5

3

3.5

4

4.5

5

5.5

6

number of iterations

err

or

worse curvaturesigmoid product bound

0 20 40 60 80 100 1200

2

4

6

8

10

12

14

16x 10

5

number of iterations

Var

iatio

na

l Fre

e E

nerg

y

worse curvaturesigmoid product bound

December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 15

Numerical experiments

Iris dataset 4 dimensions 3 classes Prior: unit variance

Experiment Learning: Batch updates Compared to MCMC

estimation based on 100K samples

Error = Euclidian distance between the mean and variance parameters

ResultsThe “worse curvature” bound is more faster and better…

0 20 40 60 80 1000.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

0.02

0.022

number of iterations

err

or

worse curvaturesigmoid product bound

December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 16

-5 -4 -3 -2 -1 0 1 2 3 4 50

0.5

1

-5 -4 -3 -2 -1 0 1 2 3 4 50

10

20

30

40

50

60

70

x

K=3

log k ex

k

worse curvatureoptimal tightoptimal average (=2)

optimal average (=1)

optimal average (=0.1)

-5 -4 -3 -2 -1 0 1 2 3 4 50

0.5

1

-5 -4 -3 -2 -1 0 1 2 3 4 50

10

20

30

40

50

60

70

x

K=3

log k ex

k

worse curvatureoptimal tightoptimal average (=2)

optimal average (=1)

optimal average (=0.1)

-5 -4 -3 -2 -1 0 1 2 3 4 50

0.5

1

-5 -4 -3 -2 -1 0 1 2 3 4 50

5

10

15

20

25

30

35

40

45

50

x

log k ex

k

worse curvatureoptimal tightoptimal average (=2)

optimal average (=1)

optimal average (=0.1)

-5 -4 -3 -2 -1 0 1 2 3 4 50

0.5

1

-5 -4 -3 -2 -1 0 1 2 3 4 50

20

40

60

80

100

120

140

x

K=100

log k ex

k

worse curvatureoptimal tightoptimal average (=2)

optimal average (=1)

optimal average (=0.1)

December 7, 2007 Guillaume Bouchard, Xerox Research Center Europe 17

Jebara’s bound

One dimension: Hyperbolic cosine bound

Multi-dimensional case