unbiased implicit variational inferencefranrruiz.github.io/contents/group_talks/uc3m-dec2018.pdf ·...

32
Unbiased Implicit Variational Inference Michalis K. Titsias & Francisco J. R. Ruiz December 17th, 2018

Upload: others

Post on 07-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

Unbiased Implicit Variational Inference

Michalis K. Titsias & Francisco J. R. Ruiz

December 17th, 2018

Page 2: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

Probabilistic Modeling Pipeline

ModelDesign

InferenceAlgorithm

Predict&Explore

ModelTesting

Dataset

I Posit generative process with hidden and observed variables

I Given the data, reverse the process to infer hidden variables

I Use hidden structure to make predictions, explore the dataset, etc.

2

Page 3: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

Probabilistic Modeling Pipeline

ModelDesign

InferenceAlgorithm

Predict&Explore

ModelTesting

Dataset

I Incorporate domain knowledge with interpretable components

I Separate assumptions from computation

I Facilitate collaboration with domain experts

2

Page 4: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

Applications: Gene Signature Discovery

Can we identify de novo gene expression patterns in scRNA-seq?

.CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under aThe copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/367003doi: bioRxiv preprint first posted online Jul. 11, 2018;

3

Page 5: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

Applications: Consumer Preferences

Can we use mobile location data to findthe most promising location for a new restaurant?

Restaurants in the Bay Area4

Page 6: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

Applications: Shopping Behavior

Can we use past shopping transactions to learn customer preferencesand predict demand as a function of price?

5

Page 7: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

Background: Probabilistic Modeling

I Latent variables z

I Observations x

I Probabilistic model p(x , z)

I Posterior p(z | x) = p(x,z)∫p(x,z)dz

6

Page 8: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

Background: Probabilistic Modeling

p(z | x) =p(x , z)∫p(x , z)dz

I The posterior allows us to explore the data and make predictions

I Approximating the posterior is the central challenge of Bayesianinference

7

Page 9: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

Inference

ModelDesign

InferenceAlgorithm

Predict&Explore

ModelTesting

Dataset

8

Page 10: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

Background: Variational Inference

I Variational inference approximates the posterior

I Find simpler distribution qθ(z) ≈ p(z | x)

I Use KL divergence to measure similarity between qθ(z) and p(z | x)

I Minimize KL divergence w.r.t. variational parameters θ

9

Page 11: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

Background: Mean-Field Variational Inference

I Classical VI: Mean-field variational distribution:

qθ(z) =∏

n

qθn(zn)

I Simple, but might not be accurate

10

Page 12: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

Our Goal: More Expressive Variational Distributions

banana

−3 −2 −1 0 1 2 3−8

−7

−6

−5

−4

−3

−2

−1

0

1x−shaped

−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6multimodal

−6 −4 −2 0 2 4 6−4

−3

−2

−1

0

1

2

3

4

Blue dots: samples from qθ(z)

11

Page 13: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

Variational Inference with Implicit Distributions

I Easy to draw samples from qθ(z):

sample ε ∼ q(ε); set z = fθ(ε)

I Cannot evaluate the density qθ(z)

I Flexible distribution due to the non-linear transformation

12

Page 14: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

VI with Implicit Distributions is Hard

I The VI objective is the ELBO (equivalent to minimizing KL),

L(θ) = Eqθ(z)

[log p(x , z)︸ ︷︷ ︸

model

− log qθ(z)︸ ︷︷ ︸entropy

]

I Gradient of the objective ∇θL(θ) (reparameterization)

∇θL(θ) = Eq(ε)

[∇z (log p(x , z)− log qθ(z))

∣∣z=fθ(ε)

×∇θfθ(ε)]

I Monte Carlo estimates require ∇z log qθ(z) (not available)

13

Page 15: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

Unbiased Implicit Variational Inference

I We describe how to obtain an unbiased Monte Carlo estimator

I We avoid density ratio estimation

I Key ideas:

1. Semi-implicit construction of qθ(z)

2. Gradient of the entropy component as an expectation,

∇z log qθ(z) = Edistrib(·) [function(z , ·)]

14

Page 16: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

UIVI Step 1: Semi-Implicit Distribution

I Implicit distribution:

I (Semi-)implicit distribution:

15

Page 17: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

UIVI Step 1: Semi-Implicit Distribution

I (Semi-)implicit distribution

I Example: The conditional qθ(z | ε) is a Gaussian,

qθ(z | ε) = N (z |µθ(ε),Σθ(ε))

16

Page 18: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

UIVI Step 1: Semi-Implicit Distribution

I (Semi-)implicit distribution

I The distribution qθ(z) is still implicit,I Easy to sample,

sample ε ∼ q(ε),

obtain µθ(ε) and Σθ(ε)

sample z ∼ N (z |µθ(ε),Σθ(ε))

I The variational distribution qθ(z) is not tractable,

qθ(z) =

∫q(ε)qθ(z | ε)dε

17

Page 19: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

UIVI Step 1: Semi-Implicit Distribution

I (Semi-)implicit distribution

I Assumptions on the conditional qθ(z | ε):I ReparameterizableI Tractable gradient ∇z log qθ(z | ε)

Note: this is different from ∇z log qθ(z) (still intractable)

18

Page 20: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

UIVI Step 1: Semi-Implicit Distribution

I (Semi-)implicit distribution

I The Gaussian meets both assumptions:I Reparameterizable,

u ∼ N (u | 0, I ), z = hθ(u ; ε) = µθ(ε) + Σθ(ε)1/2u

I Tractable gradient,

∇z log qθ(z | ε) = −Σθ(ε)−1(z − µθ(ε))

19

Page 21: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

UIVI Step 2: Gradient as Expectation

I Goal: Estimate the gradient of the entropy component, ∇z log qθ(z)

I Rewrite as an expectation,

∇z log qθ(z) = Eqθ(ε′ | z) [∇z log qθ(z | ε′)]

I Form Monte Carlo estimate,

∇z log qθ(z) ≈ ∇z log qθ(z | ε′), ε′ ∼ qθ(ε′ | z)

20

Page 22: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

UIVI: Full Algorithm

I The gradient of the ELBO is

∇θL(θ) = Eq(ε)q(u)

[∇z (log p(x , z)− log qθ(z))

∣∣z=hθ(u ; ε)

×∇θhθ(u ; ε)]

I Estimate the gradient based on samples:

1. Sample ε ∼ q(ε), u ∼ q(u) (standard Gaussians)2. Set z = hθ(ε ; u) = µθ(ε) + Σθ(ε)1/2u3. Evaluate ∇z log p(x , z) and ∇θhθ(u ; ε)4. Sample ε′ ∼ qθ(ε′ | z)5. Approximate ∇z log qθ(z) ≈ ∇z log qθ(z | ε′)

21

Page 23: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

UIVI: The Reverse Conditional

I The distribution qθ(ε′ | z) is the reverse conditionalThe conditional is qθ(z | ε)

I Sample from qθ(ε′ | z) using HMC, targeting

q(ε′ | z) ∝ q(ε′)qθ(z | ε′)

I Problem: HMC is slow. . . How to accelerate this?

22

Page 24: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

UIVI: The Reverse Conditional

I Recall the UIVI algorithm,

UIVI: Full Algorithm

I The gradient of the ELBO is

r✓L(✓) = Eq(")q(u)

hrz (log p(x , z) � log q✓(z))

��z=h✓(u ; ")

⇥r✓h✓(u ; ")i

I Estimate the gradient based on samples:

1. Sample " ⇠ q("), u ⇠ q(u) (standard Gaussians)2. Set z = h✓(" ; u) = µ✓(") + ⌃✓(")

1/2u3. Evaluate rz log p(x , z) and r✓h✓(u ; ")4. Sample "0 ⇠ q✓("

0 | z)5. Approximate rz log q✓(z) ⇡ rz log q✓(z | "0)

21

I We have that (ε, z) ∼ qθ(ε, z) = q(ε)qθ(z | ε) = qθ(z)qθ(ε | z)

I Thus, ε is a sample from qθ(ε | z)

I To accelerate sampling ε′ ∼ q(ε′ | z), initialize HMC at ε

23

Page 25: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

UIVI: The Reverse Conditional

I Sample from qθ(ε′ | z) using HMC targeting

q(ε′ | z) ∝ q(ε′)qθ(z | ε′)

I Initialize HMC at stationarity (using ε)

I A few HMC iterations to reduce correlation between ε and ε′

24

Page 26: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

Toy Experiments

banana

−3 −2 −1 0 1 2 3−8

−7

−6

−5

−4

−3

−2

−1

0

1x−shaped

−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6multimodal

−6 −4 −2 0 2 4 6−4

−3

−2

−1

0

1

2

3

4

Blue dots: samples from qθ(z)

25

Page 27: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

Experiments: Multinomial Logistic Regression

p(x , z) = p(z)N∏

n=1

exp{x>n zyn + z0yn}∑k exp{x>n zk + z0k}

0 20 40 60 80 100 120 140 160

−4

−3.5

−3

−2.5

x 104

Time (minutes)

ELB

O

MNIST

0 20 40 60 80 100 120 140−12000

−11000

−10000

−9000

−8000

−7000

Time (minutes)

ELB

O

HAPT

SIVIUIVI [this paper]

0 20 40 60 80 100 120 140 160−0.4

−0.35

−0.3

Time (minutes)

test

log−

lik

MNIST

0 20 40 60 80 100 120 140−0.5

−0.45

−0.4

−0.35

−0.3

Time (minutes)

test

log−

lik

HAPT

SIVIUIVI [this paper]

UIVI provides better ELBO and predictive performance than SIVI

26

Page 28: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

Experiments: Multinomial Logistic Regression

p(x , z) = p(z)N∏

n=1

exp{x>n zyn + z0yn}∑k exp{x>n zk + z0k}

0 200 400 600 800 1000

−4

−3.5

−3

−2.5

x 104

Iterations (x100)

ELB

O

MNIST

iters=1iters=5iters=20iters=50

Number of HMC iterations does not significantly impact results

27

Page 29: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

Experiments: VAE

I Model is pφ(x , z) =∏

n p(zn)pφ(xn | zn)

I Amortized variational distrib. qθ(zn | xn) =∫q(εn)qθ(zn | εn, xn)dεn

I Goal: Find model parameters φ and variational parameters θ

Manuscript under review by AISTATS 2019

0 20 40 60 80 100 120 140 160

−4

−3.5

−3

−2.5

x 104

Time (minutes)

ELBO

MNIST

0 20 40 60 80 100 120 140−12000

−11000

−10000

−9000

−8000

−7000

Time (minutes)

ELBO

HAPT

SIVIUIVI [this paper]

0 20 40 60 80 100 120 140 160−0.4

−0.35

−0.3

Time (minutes)

test

log−

lik

MNIST

0 20 40 60 80 100 120 140−0.5

−0.45

−0.4

−0.35

−0.3

Time (minutes)

test

log−

lik

HAPT

SIVIUIVI [this paper]

Figure 3. Estimates of the ���� and the test log-likelihood as a function of wall-clock time for the Bayesian multinomiallogistic regression model (Section 4.2). Compared to ���� (red), ���� (blue) achieves a better bound on the marginallikelihood and has better predictive performance.

ages of clothing items. We binarize the Fashion-�����images with a threshold at 0.5. Images in both datasets areof size 28⇥ 28 pixels.

Variational family. We use the variational family describedin Section 4.1 with Gaussian prior and Gaussian condi-tional. Since the variational distribution is amortized, we letthe conditional q✓(zn | "n, xn) depend on the observationxn, such that the variational distribution is q✓(zn | xn) =R

q("n)q✓(zn | "n, xn)d"n. We obtain the mean of the Gaus-sian conditional as the output of a neural network having asinputs both xn and "n. We set the dimensionality of "n to10 and the width of each the two hidden layers of the neuralnetwork to 200.

For comparisons, we also fit a standard ��� [Kingma andWelling, 2014]. The standard ��� uses an explicit Gaussiandistribution whose mean and covariance are functions ofthe input, i.e., q✓(zn | xn) = N (zn | µ✓(xn),⌃✓(xn)). Themean and covariance are parameterized using two separateneural networks with the same structure described above,and the covariance is set to be diagonal. The neural networkfor the covariance has softplus activations in the output layer,i.e., softplus(x) = log(1 + ex).

Experimental setup. For the generative model p�(xn | zn)we use a factorized Bernoulli distribution. We use a two-hidden-layer neural network with 200 hidden units on eachhidden layer, whose sigmoidal outputs define the meansof the Bernoulli distribution. We set the prior p(zn) =N (zn | 0, I) and the dimensionality of zn to 10. We run400,000 iterations of each method (explicit variational distri-bution, ����, and ����), using the same initialization and aminibatch of size 100. We set the ���� parameter L = 100 sothat both ���� and ���� have similar complexity (see below).

average test log-likelihoodmethod ����� Fashion-�����

Explicit (standard ���) �98.29 �126.73���� �97.77 �121.53

���� [this paper] �94.09 �110.72

Table 2. Estimates of the marginal log-likelihood on the testset for the ��� (Section 4.3). ���� gives better predictiveperformance than ����.

We set the learning rate ⌘ = 10�3 for the network parame-ters of the variational Gaussian conditional, ⌘ = 2 · 10�4 forits covariance (we also set ⌘ = 2 · 10�4 for the network thatparameterizes the covariance of the explicit distribution),and ⌘ = 10�3 for the network parameters of the generativemodel. We reduce the learning rate by a factor of 0.9 every15,000 iterations.

Results. We estimate the marginal likelihood on the test setusing importance sampling,

log p(xn) ⇡ log1

S

SX

s=1

p�(xn | z(s)n )p(z

(s)n )

1M

PMm=1 q✓(z

(s)n | "(m)

n , xn),

z(s)n ⇠ q✓(zn | xn), "(m)

n ⇠ q("),

where we set S = 1,000 and M = 10,000 samples.

Table 2 shows the estimated values of the test marginal like-lihood for all methods and datasets. ���� provides better pre-dictive performance than ����, which in turn gives better pre-dictions than the explicit Gaussian approximation.

In terms of computational complexity, the average time periteration is similar for ���� and ����. On �����, it is 0.14seconds for ���� and 0.16 seconds for ����; on Fashion-

UIVI provides better ELBO and predictive performance

28

Page 30: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

Conclusion

I UIVI approximates the posterior with an expressive variationaldistribution

I The variational distribution is implicit

I UIVI directly optimizes the ELBO

I Good results on Bayesian multinomial logistic regression and VAEs

29

Page 31: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

Proof of the Key Equation

I Goal: Prove that

∇z log qθ(z) = Eqθ(ε | z) [∇z log qθ(z | ε)]

I Start with log-derivative identity,

∇z log qθ(z) =1

qθ(z)∇zqθ(z)

I Apply the definition of qθ(z) through a mixture,

∇z log qθ(z) =1

qθ(z)

∫∇zqθ(z | ε)q(ε)dε

I Apply the log-derivative identity on qθ(z | ε),

∇z log qθ(z) =1

qθ(z)

∫qθ(z | ε)q(ε)∇z log qθ(z | ε)dε.

I Apply Bayes’ theorem

1

Page 32: Unbiased Implicit Variational Inferencefranrruiz.github.io/contents/group_talks/UC3M-Dec2018.pdf · Unbiased Implicit Variational Inference I We describe how to obtain an unbiased

SIVI

I SIVI optimizes a lower bound of the ELBO,

L(L)SIVI(θ)=Eε∼q(ε)

[Ez∼qθ(z | ε)

[Eε(1),...,ε(L)∼q(ε)

[log p(x , z)

− log

(1

L + 1

(qθ(z | ε) +

L∑

`=1

qθ(z | ε(`))))]]]

2