fast inference in deep generative...

Fast Inference in DeepGenerative Models

Isabel Valera, Jonas Umlauft

Reading and Communication Club (RCC)

Computational and Biological Learning (CBL)University of Cambridge

http://www.itr.ei.tum.de

Outline

Introduction

Learning Stochastic Inverses

Lower Bound over the Marginal Likelihood

Predictive Sparse Decomposition

Summary

Introduction Learning Stochastic Inverses Lower Bound over the Marginal Likelihood Predictive Sparse Decomposition Summary2

Introduction

Goal:Do fast inference and learning in deep directed models

Problem:Exponential number of possible explanations for data

rain

wet grass

sprinkler sprinkler 1 sprinkler 2 rain

wet grass

Novel paradigm for fast inference and learning in directed models


Introduction

Different names for same thing:� Learning to do Inference� Amortized Inference� Learning to Learn� . . .

Focus of this talk:Algorithms using generative and recognition model.

yx

p✓(x, y)

q�(x, y)


Outline

Introduction




Summary


Learning Stochastic Inverses - Model

Generative model G:

p(x, y) =∏

n

p(yn|paG(yn))∏

m

p(xm|paG(xm))

Recognition model R:

p(x, y) =∏

n

p(yn)∏

i

p(xm|paR(xm))

xi for m = 1 . . .M Latent variablesyj for n = 1 . . . N Observed variablepaG(xm) Set of parents of node xm in generative modelpaR(xm) Set of parents of node xm in recognition model

rain

wet grass

sprinkler rain

wet grass

sprinkler


Learning Stochastic Inverses - Idea

Main idea:

1. Split inference in different tasks by splitting observations in Tsubsets.

2. For each task t = 1 . . . T :� Infer p(x|y(t)) by sampling.� Use sample from task t− 1 to approximateθm ≈ p(xm|paR(xm))

3. There is a trade-off between the number of tasks T and thenumber of datapoints per task.


Learning Stochastic Inverses - Algorithm

Proposed Algorithm (for each task):

1. Construct M recognition models Rm such that xm is a leafand yn for n = 1 . . . N are roots

rain

wet grass

sprinkler rain

wet grass

sprinkler

2. Approximate inverse factors θi ≈ p(xi|paRm(xi)) using

samples from the generative model

3. Sample from posterior p(x|y) using Metropolis Hastings steps� Uniformly sample Rm� Sample proposal size l ∼ Uniform(0, . . . , lmax)� For i = m− l . . .m sample x∗i ∼ θi� Propose moves (xm−l, . . . , xm)→ (x∗m−l, . . . , x

∗m)

Step 1 & 2 can be performed offline, Step 3 online.


Learning Stochastic Inverses - ExperimentalResults

Figure 2: Schema of the Bayesnet structure used in experi-ment 1. Thick arrows indi-cate almost-deterministic de-pendencies, shaded nodes areobserved. The actual networkhas 15 layers with a total of120 nodes.

0 10 20 30 40 50 60

0.00

0.10

0.20

0.30

Time (seconds)Er

ror i

n m

argi

nals

GibbsInverses (10x10)Inverses (10x100)Inverses (10x1000)

Figure 3: The effect of train-ing on approximate posteriorsamples for 10 inference tasks.As the number of training sam-ples per task increases, In-verse MCMC with proposalsof size 20 performs new infer-ence tasks more quickly.

1e+01 1e+02 1e+03 1e+04 1e+05

0.00

0.04

0.08

Number of training samples

Erro

r in

mar

gina

ls

Inverses (kNN)

Figure 4: Learning an inversedistribution for the brightnessconstancy model (Figure 1)from prior samples using theKNN density predictor. Moretraining samples result in bet-ter estimates after the samenumber of MCMC steps.

Figures 3 and 5 show the effect of training the frequency estimator on 10 inference tasks and testingon a different task (averaged over 20 runs). Inverse proposals of (up to) size k=20 do worse thanpure Gibbs sampling with little training (due to higher rejection rate), but they speed convergence asthe number of training samples increases. More generally, large proposals are likely to be rejectedwithout training, but improve convergence after training.

Figure 6 illustrates how the number of inference tasks influences error and MH acceptance ratio ina setting where the total number of training samples is kept constant. Surprisingly, increasing thenumber of training tasks from 5 to 15 has little effect on error and acceptance ratio for this network.That is, it seems relatively unimportant which posterior the training samples are drawn from; wemay expect different results when posteriors are more sparse.

Figure 7 shows how different sources of training data affect the quality of the trained sampler (av-eraged over 20 runs). As the strength of near-deterministic dependencies increases, direct trainingon Gibbs samples becomes infeasible. In this regime, we can still train on prior samples and onGibbs samples for networks with relaxed dependencies. Alternatively, we can employ the anneal-ing scheme outlined in the previous section. In this example, we take the temperature ladder to be[.2, .1, .05, .02, .01, 0]—that is, we start by learning inverses for the relaxed network where all CPTprobabilities are constrained to lie within [.2, .8]; we then use these inverses as proposers for MCMCinference on a network constrained to CPT probabilities in [.1, .9], learn the corresponding inverses,and continue, until we reach the network of interest (at temperature 0).

While the empirical frequency estimator used in the above experiments provides an attractive asymp-totic convergence guarantee (Theorem 3), it is likely to generalize slowly from small amounts oftraining data. For practical purposes, we may be more interested in getting useful generalizationsquickly than converging to a perfect proposal distribution. Fortunately, the Inverse MCMC algo-rithm can be used with any estimator for local conditionals, consistent or not. We evaluate this ideaon a 12-node subset of the network used in the previous experiments. We learn complete inverses,resampling up to 12 nodes at once. We compare inference using a logistic regression estimator withL2 regularization (with and without interaction terms) to inference using the empirical frequencyestimator. Figure 9 shows the error (integrated over time to better reflect convergence speed) againstthe number of training examples, averaged over 300 runs. The regression estimator with interactionterms results in significantly better results when training on few posterior samples, but is ultimatelyovertaken by the consistent empirical estimator.

Next, we use the KNN density predictor to learn inverse distributions for the continuous Bayesiannetwork shown in Figure 1. To evaluate the quality of the learned distributions, we take 1000

6

Schema of the Bayes net structure and error marginals over computationtime


Learning Stochastic Inverses - ExperimentalResults

Error in marginals

Log10(training samples per task)

Max

imum

pro

posa

l size

5

10

15

20

25

30

1 2 3 4

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

Acceptance ratio

Log10(training samples per task)

Max

imum

pro

posa

l size

5

10

15

20

25

30

1 2 3 4

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Figure 5: Without training, big inverse proposals result in high error, as they are unlikely to beaccepted. As we increase the number of approximate posterior samples used to train the MCMCsampler, the acceptance probability for big proposals goes up, which decreases overall error.

Acceptance ratio

Number of tasks

Max

imum

pro

posa

l size

5

10

15

20

25

30

5 10 15

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

Figure 6: For the network under consideration,increasing the number of tasks (i.e., samplesfor other observations) we train on has little ef-fect on acceptance ratio (and error) if we keepthe total number of training samples constant.

PriorGibbsRelaxed GibbsAnnealing

PriorGibbsRelaxed GibbsAnnealing

●

●

●

●

●

●

●

●

Determinism 0.95

Determinism 0.9999

0.05 0.10 0.15 0.20 0.25

Error by training source

Test error (after 10s)

Figure 7: For networks without hard determin-ism, we can train on Gibbs samples. For oth-ers, we can use prior samples, Gibbs samplesfor relaxed networks, and samples from a se-quence of annealed Inverse samplers.

samples using Inverse MCMC and compare marginals to a solution computed by JAGS (Plummeret al., 2003). As we refine the inverses using forward samples, the error in the estimated marginalsdecreases towards 0, providing evidence for convergence towards a posterior sampler (Figure 4).

To evaluate Inverse MCMC in more breadth, we run the algorithm on all binary Bayes nets withup to 500 nodes that have been submitted to the UAI 08 inference competition (216 networks).Since many of these networks exhibit strong determinism, we train on prior samples and applythe annealing scheme outlined above to generate approximate posterior samples. For training andtesting, we use the evidence provided with each network. We compute the error in marginals asdescribed above for both Gibbs (proposal size 1) and Inverse MCMC (maximum proposal size 20).To summarize convergence over the 1200s of test time, we compute the area under the error curves(Figure 8). Each point represents a single run on a single model. We label different classes ofnetworks. For the grid networks, grid-k denotes a network with k% deterministic dependencies.While performance varies across network classes—with extremely deterministic networks makingthe acquisition of training data challenging—the comparison with Gibbs suggests that learned blockproposals frequently help.

Overall, these results indicate that Inverse MCMC is of practical benefit for learning block proposalsin reasonably large Bayes nets and using a realistic amount of training data (an amount that mightresult from amortizing over five or ten inferences).

7

Analysis of effect of number of training samples per task


Outline

Introduction




Summary


Introduction

Main Idea:� Introduce recognition model to obtain an efficient approxima-

tion for lower bound of the marginal log likelihood� Joint optimization of parameters in the recognition and the

generative model by maximizing the lower bound

yx

p✓(x, y)

q�(x, y)


Lower bound on marginal log-likelihood

Derivation of a lower bound of the log probability 1

log p(y|θ) = log

∫Pθ(x, y)dx = log

∫Qφ(x)

Pθ(x, y)

Qφ(x)dx

=

∫Qφ(x) log

Pθ(x, y)

Qφ(x)dx+KL(Qφ(x)||Pθ(x|y)) =

≥∫Qφ(x) logPθ(x, y)dx−

∫Qφ(x) logQφ(x)dx =: −F (y|θ,Qφ(x))

� observations y, latent variables x� generative model parameters θ� recognition model parameters φ

GoalOptimize the lower bound −F (y|θ,Qφ(x)) over θ and Qφ(x)

1KL(Q||P ) is the Kullback–Leibler divergence from Q to P .


Optimization of the lower bound

Using e.g. a factorial recognition distribution parameterized by φrestricts the surface over which is optimized.


Approach 1:Helmholtz Machine

IdeaConnectionist multi-layer system with binary stochastic units con-nected hierarchically by two sets of weights θ, φ

� Top-down connections θimplement generative model

� Bottom-up connections φimplement recognition model


Helmholtz Machine - Assumptions

Assumption on recognition distribution:Activity of each unit in layer l is independent of all other units in l,given activities in layer l − 1

⇒ Qφ(x) is factorial (separable) in each layer⇒ Recognition model only needs specify h probablities, not 2h − 1⇒ Computational tractable but log-probability is underestimated

The generative model is taken to be factorial in thesame way.Recognition: Generation:

qlj(φ, sl−1) = σ

(∑

i

sl−1i φl−1,l

i,j

), plj(θ, s

l+1) = σ

(∑

i

sl+1k θl+1,l

k,j

)

Stochastic gradient ascent across all data can be performed using−F(θ, φ) =∑y −F (y; θ,Qφ(x))



Assumption on recognition distribution:Activity of each unit in layer l is independent of all other units in l,given activities in layer l − 1⇒ Qφ(x) is factorial (separable) in each layer

⇒ Recognition model only needs specify h probablities, not 2h − 1⇒ Computational tractable but log-probability is underestimated



(∑

i

sl−1i φl−1,l

i,j

), plj(θ, s

l+1) = σ

(∑

i

sl+1k θl+1,l

k,j

)




Assumption on recognition distribution:Activity of each unit in layer l is independent of all other units in l,given activities in layer l − 1⇒ Qφ(x) is factorial (separable) in each layer⇒ Recognition model only needs specify h probablities, not 2h − 1

⇒ Computational tractable but log-probability is underestimated



(∑

i

sl−1i φl−1,l

i,j

), plj(θ, s

l+1) = σ

(∑

i

sl+1k θl+1,l

k,j

)




Assumption on recognition distribution:Activity of each unit in layer l is independent of all other units in l,given activities in layer l − 1⇒ Qφ(x) is factorial (separable) in each layer⇒ Recognition model only needs specify h probablities, not 2h − 1⇒ Computational tractable but log-probability is underestimated



(∑

i

sl−1i φl−1,l

i,j

), plj(θ, s

l+1) = σ

(∑

i

sl+1k θl+1,l

k,j

)





The generative model is taken to be factorial in thesame way.

Recognition: Generation:


(∑

i

sl−1i φl−1,l

i,j

), plj(θ, s

l+1) = σ

(∑

i

sl+1k θl+1,l

k,j

)







(∑

i

sl−1i φl−1,l

i,j

), plj(θ, s

l+1) = σ

(∑

i

sl+1k θl+1,l

k,j

)



Helmholtz Machine - Wake-Sleep Algorithm

Gradient computation is complicated→ Use wake-sleep algorithm as simple learning scheme for layernetworks

Wake-Phase:

1. Bottom-up: Select sl basedon qlj(φ, s

l−1)

2. Top-down: Set θ tominimize KL(Q,P )

Sleep-Phase:

1. Top-down: Select sl basedon plj(θ, s

l+1)

2. Bottom-up: Set φ tominimize KL(Q,P )


Approach 2:Stochastic Gradient Variational Bayes (SGVB)

Idea:

� Deep generative models with continuous latent variables.� Stochastic variational inference to optimize the lower

bound −F(y; θ, φ) over θ and φ.� Online algorithm:

Assuming the factorization for N datapointslog p(y(1), . . . , y(N)|θ) =∑N

i=1 log p(y(i)|θ).

We can lower bound the marginal likelihood foreach datapoint:

−F(y(i); θ, φ))

=

∫Qφ(x|y(i))

[− logQφ(x|y(i)) + logPθ(y

(i), x)]dx


SGVB - Reparametrization Trick

� Reparametrization: x ∼ Qφ(x|y)→ x = gφ(y, ε) where� Transformation gφ(y, ε) is differentiable� Auxiliary noise variable ε ∼ p(ε).

� Samples: x(i,l) = gφ(y(i), ε(l)) and ε(l) ∼ p(ε)

� MC estimations of the integrals.

− F (y(i); θ,Qφ(x|y) ≈1

L

L∑

l=1

[− logQφ(x

(i,l)|y(i)) + logPθ(y(i), x(l))

]


SGVB - Minibatch Estimation of the LowerBound

Total number of data points N , minibatches of size M :

F (y; θ,Qφ(x|y)) ≈ F (y; θ, φ) =N

M

M∑

m=1

F (y(m); θ, φ)

Algorithm:Initialize θ, φRepeat until convergence:

1. Select random minibatch of M datapoints y(M)

2. Generate samples ε(l) ∼ p(ε)3. Approximate gradient by ∇φ,θF (y(m); θ, φ)4. Update θ, φ by stochastic optimization (SGD)


SGVB - Experiments

Figure 2: Comparison of our SGVB method to the wake-sleep algorithm, in terms of optimizing thelower bound, for different dimensionality of latent space (Nz). Our method converged considerablyfaster and reached a better solution in all experiments. Interestingly enough, more latent variablesdoes not result in more overfitting, which is explained by the regularizing effect of the lower bound.Vertical axis: the estimated average variational lower bound per datapoint. The estimator variancewas small (< 1) and omitted. Horizontal axis: amount of training points evaluated. Computa-tion took around 20-40 minutes per million training samples with a Intel Xeon CPU running at aneffective 40 GFLOPS.

5 Experiments

We trained generative models of images from the MNIST and Frey Face datasets3 and comparedlearning algorithms in terms of the variational lower bound, and the estimated marginal likelihood.

The generative model (encoder) and variational approximation (decoder) from section 3 were used,where the described encoder and decoder have an equal number of hidden units. Since the FreyFace data are continuous, we used a decoder with Gaussian outputs, identical to the encoder, exceptthat the means were constrained to the interval (0, 1) using a sigmoidal activation function at thedecoder output. Note that with hidden units we refer to the hidden layer of the neural networks ofthe encoder and decoder.

Parameters are updated using stochastic gradient ascent where gradients are computed by differenti-ating the lower bound estimatorr✓,�L(✓,�;X) (see algorithm 1), plus a small weight decay termcorresponding to a prior p(✓) = N (0, I). Optimization of this objective is equivalent to approxi-mate MAP estimation, where the likelihood gradient is approximated by the gradient of the lowerbound.

We compared performance of SGVB to the wake-sleep algorithm [HDFN95]. We employed thesame encoder (also called recognition model) for the wake-sleep algorithm and the variational auto-encoder. All parameters, both variational and generative, were initialized by random sampling fromN (0, 0.01), and were jointly stochastically optimized using the MAP criterion. Stepsizes wereadapted with Adagrad [DHS10]; the Adagrad global stepsize parameters were chosen from {0.01,0.02, 0.1} based on performance on the training set in the first few iterations. Minibatches of sizeM = 100 were used, with L = 1 samples per datapoint.

Likelihood lower bound We trained generative models (decoders) and corresponding encoders(a.k.a. recognition models) having 500 hidden units in case of MNIST, and 200 hidden units in caseof the Frey Face dataset (to prevent overfitting, since it is a considerably smaller dataset). Figure 2shows the results when comparing the lower bounds. Interestingly, superfluous latent variables didnot result in overfitting, which is explained by the regularizing nature of the variational bound.

Marginal likelihood For very low-dimensional latent space it is possible to estimate the marginallikelihood of the learned generative models using an MCMC estimator. More information about the

3Available at http://www.cs.nyu.edu/˜roweis/data.html

7

Comparison of SGVB and the wake-sleep algorithm, in terms of the lowerbound, for a different dimensions of the latent space (Ns).


SGVB - Experiments

(a) Learned Frey Face manifold (b) Learned MNIST manifold

Figure 4: Visualisations of learned data manifold for generative models with two-dimensional latentspace, learned with SGVB. Since the prior of the latent space is Gaussian, linearly spaced coor-dinates on the unit square were transformed through the inverse CDF of the Gaussian to producevalues of the latent variables z. For each of these values z, we plotted the corresponding generativep✓(x|z) with the learned parameters ✓.

(a) 2-D latent space (b) 5-D latent space (c) 10-D latent space (d) 20-D latent space

Figure 5: Random samples from learned generative models of MNIST for different dimensionalitiesof latent space.

denote the variational mean and s.d. evaluated at datapoint i, and let µj and �j simply denote thej-th element of these vectors. Then:

Zq✓(z) log p(z) dz =

ZN (z; µ,�2) log N (z;0, I) dz

= �J

2log(2⇡)� 1

2

JX

j=1

(µ2j + �2

j )

And:Z

q✓(z) log q✓(z) dz =

ZN (z; µ,�2) log N (z; µ,�2) dz

= �J

2log(2⇡)� 1

2

JX

j=1

(1 + log �2j )

10

Visualisations of learned data manifold for generative model withtwo-dimensional latent space p(y|x), learned with SGVB.


Outline

Introduction




Summary


Predictive Sparse Decomposition (PSD) -Motivation

Learning sparse representation is useful� features are more likely to be linearly separable in a

high-dimensional space� features are more robust to noise

The optimal sparse coding problem 2

min ‖X‖0 s.t. Y = BX

Find representation X ∈ Rm for signal Y ∈ Rn by linearcombination described by B ∈ Rn×m, where m > n.

Problem: Combinatorial search is intractable in high dimensionalspaces.

2‖ · ‖0 returns number of non-zeros elements of a vector.


PSD - Augmentation of objective function

The equivalent unconstrained optimization problem

L(Y,X;B) =1

2‖Y −BX‖22 + λ‖X‖1

For efficient inference the following recognition model isintroduced

F (Y ;G,W,D) = G tanh(WY +D),

which maps from observation Y to latent variable X.

Force representation X to be close to the predictor:

L(Y,X;B,Pf ) = ‖Y −BX‖22︸︷︷︸Reconstruction Error

+λ ‖X‖1︸︷︷︸Enforce Sparsity

+ ‖X − F (Y ;Pf )‖22︸︷︷︸Prediction Error

Filter matrix W ∈ Rm×n,Vector of biases D ∈ Rm,

Diagonal gain matrix G ∈ Rm×m, Pf = {G,W,D}


PSD - Learning and Inference

L(Y,X;B,Pf ) = ‖Y −BX‖22 + λ‖X‖1 + ‖X − F (Y ;Pf )‖22Learning:Find optimal value of basis function and parameters U = {B,Pf}

1. Keep U constant, minimize L(Y,X;U) with respect to X2. Update U by one step of stochastic gradient descent

U ← U − η ∂L∂U

Inference:� Approximate inference: Only use forward prediction

X = F (Y ;Pf )

� Optimal inference: Run iterative gradient descent

X∗ = argminXL


PSD - Experimental Results

Comparison of PSD and the exact algorithm feature sign (FS) onthe Caltech 101 dataset:

Using PSD is more than 100 timesfaster than FS. Speed advantageincreases with lower sparsity penalty λ.

Recognition accuracy versus measuredsparsity (average l1-norm) shows nostatistically significant difference.


Outline

Introduction




Summary


SummaryReference Generative Model Recognition

ModelType of inference

Stochasticinverses[Stuhlmuller’12]

Bayes Net. Bayes Net. MCMC

Helmholtzmachine[Zemel’95]

Sigmoidal NN Sigmoidal NN Wake-Sleep algo-rithm

SGVB[Kingma’14]

Bayes Net. (con-tinous latentt vari-ables)

Bayes Net.(continous la-tentt variables)

Variational inference

PSD[Kavukcuoglu’10]

Sparse latent linearmodel

Nonlinear Map-ping

Convex optimization

[Jimenez-Rezende’14]

Gaussian Net. Gaussian Net. Variational Inference+ Stochastic Back-propagation

[Salakhutdinov’10] Boltzmann Machines - Adaptive MCMC[Salakhutdinov’12] Boltzmann Machines - Variational inference[Rasmussen’03] (any complex) gener-

ative modelGaussian Pro-cess

Hybrid Monte Carlo

Is this a novel paradigm in machine learning for fast inference?

[Kavukcuoglu, Ranzato, and LeCun 2010] [Dayan et al. 1995] [Kingma and

Welling 2013] [Rezende, Mohamed, and Wierstra 2014] [Stuhlmuller, Taylor,

and Goodman 2013]


References

Peter Dayan, Geoffrey E Hinton, Radford M Neal, and Richard S Zemel. The helmholtz machine.In: Neural computation 7.5 (1995), pp. 889–904.

Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun.Fast inference in sparse coding algorithms with applications to object recognition.In: arXiv preprint arXiv:1010.3467 (2010).

D. P Kingma and M. Welling. Stochastic Gradient VB and the Variational Auto-Encoder.In: ArXiv e-prints (Dec. 2013). arXiv:1312.6114 [stat.ML].

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.Stochastic Back-propagation and Variational Inference in Deep Latent Gaussian Models.In: arXiv preprint arXiv:1401.4082 (2014).

Andreas Stuhlmuller, Jacob Taylor, and Noah Goodman. Learning Stochastic Inverses.In: Advances in Neural Information Processing Systems 26.Ed. by C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger. 2013, pp. 3048–3056.


http://arxiv.org/abs/1312.6114

http://papers.nips.cc/paper/4966-learning-stochastic-inverses.pdf

fast inference in deep generative...

Documents