empowering probabilistic inference with stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious...

55
EMPOWERING PROBABILISTIC INFERENCE WITH STOCHASTIC DEEP NEURAL NETWORKS guoqing zheng THESIS COMMITTEE: Yiming Yang, Co-Chair (Carnegie Mellon University) Jaime Carbonell, Co-Chair (Carnegie Mellon University) Pradeep Ravikumar (Carnegie Mellon University) John Paisley (Columbia University) Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Language Technologies Institute School of Computer Science Carnegie Mellon University November 2017 DRAFT – November 26, 2017

Upload: others

Post on 18-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

E M P O W E R I N G P R O B A B I L I S T I C I N F E R E N C E W I T HS T O C H A S T I C D E E P N E U R A L N E T W O R K S

guoqing zheng

T H E S I S C O M M I T T E E :

Yiming Yang, Co-Chair (Carnegie Mellon University)

Jaime Carbonell, Co-Chair (Carnegie Mellon University)

Pradeep Ravikumar (Carnegie Mellon University)

John Paisley (Columbia University)

Submitted in partial fulfillment of the requirementsfor the degree of Doctor of Philosophy.

Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University

November 2017

DRAFT – November 26, 2017

Page 2: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

Copyright © 2017 by Guoqing Zheng, Pittsburgh PA, USA

DRAFT – November 26, 2017

Page 3: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

A B S T R A C T

Probabilistic models are powerful tools in understanding real world data from var-ious domains, such as natural languages, images, temporal time series. Often com-plex and flexible probabilistic models are preffered for accurate modeling, how-ever inference difficulty often arise due to high computation or infeasible designcost for the inference algorithm. Meanwhile, recent advances with deep neuralnetworks in both supervised and unsupervised learning have shown prominentadvantages in learning complex deterministic functions from its input to its out-put. Integrating deep neural networks into probabilistic modeling thus becomesan important research direction. Though existing research has opened the doorof using deep neural networks to model stochasticity for probabilistic modeling,it still suffers from limitations, such as a) the family of distributions that can becaptured for inference is limited, b) probabilistic statements about the data can-not be made for some models, even though they take uncertainty into account,and c) applications to discrete and dynamic temporal data have not yet been fullyexplored.

In this thesis, we aim to adress the above limitations of incorporating stochasticdeep neural networks for probabilistic inference. Specifically, we propose: a) toenrich the family of variational distributions for inference, b) to equip probabilis-tic statements for the models that have been shown to capture real data well andc) to explore applications of stochastic neural inference to domains where the datais discrete and dynamic, such as natural languages and temporal time series. Pre-liminary experimental results have demonstrate the effectiveness of the proposedapproaches.

iii

DRAFT – November 26, 2017

Page 4: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

DRAFT – November 26, 2017

Page 5: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

C O N T E N T S

1 introduction 1

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Thesis Statements and Contributions . . . . . . . . . . . . . . . . . . . 2

2 assymetric variational autoencoders 5

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Variational Autoencoder (VAE) . . . . . . . . . . . . . . . . . . 7

2.2.2 Importance Weighted Autoencoder (IWAE) . . . . . . . . . . . 7

2.2.3 Normalizing Flows . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 The Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.1 Flexible Lower Bounds with Auxiliary Variables . . . . . . . . 9

2.3.2 Learning with Monte Carlo Estimates . . . . . . . . . . . . . . 11

2.4 Connection to Related Methods . . . . . . . . . . . . . . . . . . . . . . 12

2.4.1 Normalizing flow as a special case . . . . . . . . . . . . . . . . 13

2.4.2 Other methods with auxiliary variables . . . . . . . . . . . . . 13

2.4.3 Adversarial learning based inference models . . . . . . . . . . 14

2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5.1 Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5.2 Generative Density Estimation . . . . . . . . . . . . . . . . . . 16

2.5.3 Generated Samples . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 convolutional normalizing flows 21

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.1 Transformation of random variables . . . . . . . . . . . . . . . 23

3.2.2 Normalizing flows . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3 A new transformation unit . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.1 Normalizing flow with d hidden units . . . . . . . . . . . . . 24

3.3.2 Convolutional Flow . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4.2 Handwritten digits and characters . . . . . . . . . . . . . . . . 29

3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 generative adversarial networks with likelihood 35

4.1 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . 35

v

DRAFT – November 26, 2017

Page 6: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

vi contents

4.2 Equipping GAN with likelihood evaluation . . . . . . . . . . . . . . . 36

5 neural variational topic modeling 37

5.1 Neural Variational Text Modeling . . . . . . . . . . . . . . . . . . . . . 37

5.2 Neural Variational Topic Modeling . . . . . . . . . . . . . . . . . . . . 39

5.2.1 Preliminary Experimental Results . . . . . . . . . . . . . . . . 40

6 stochastic neural inference for temporal data 41

6.1 Temporal VAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

7 timeline 43

bibliography 45

DRAFT – November 26, 2017

Page 7: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

L I S T O F F I G U R E S

Figure 1 Inference models for VAE, AVAE and AVAE with k aux-iliary random variables (The generative model is fixed asshown in Figure 1a(left)). Note that multiple arrows point-ing to a node indicate one stochastic layer, with the sourcenodes concatenated as input to the stochastic layer and thetarget node as stochastic output. One stochastic layer couldconsist of multiple deterministic layers. (For detailed archi-tecture used in experiments, refer to Experiments) . . . . . . 11

Figure 2 Training data and random samples . . . . . . . . . . . . . . . 19

Figure 3 (a) Illustration of 1-D convolution, where the dimensionsof the input/output variable are both 8 (the input vector ispadded with 0), the width of the convolution filter is 3 anddilation is 1; (b) A block of ConvFlow layers stacked withdifferent dilations. . . . . . . . . . . . . . . . . . . . . . . . . . 25

Figure 4 Approximation performance with different number of Con-vBlocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Figure 5 Training data and generated samples . . . . . . . . . . . . . . 33

Figure 6 GAN architecture (Figure reproduced from [4]) . . . . . . . 35

Figure 7 NVDM architecture (Figure reproduced from [19]) . . . . . . 38

L I S T O F TA B L E S

Table 1 MNIST test set NLL with generative models G1 and G2(lower is better) . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Table 2 OMNIGLOT test set NLL with generative models G1 andG2 (lower is better) . . . . . . . . . . . . . . . . . . . . . . . . 18

Table 3 MNIST test set NLL with generative models G1 and G2(lower is better K is number of ConvBlocks) . . . . . . . . . . 31

Table 4 OMNIGLOT test set NLL with generative models G1 andG2 (lower is better, K is number of ConvBlocks) . . . . . . . 32

vii

DRAFT – November 26, 2017

Page 8: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

Table 5 Timetable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

L I S T I N G S

A C R O N Y M S

viii

DRAFT – November 26, 2017

Page 9: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

1I N T R O D U C T I O N

In this chapter we introduce and explain background material of probabilisticmodeling with deep neural networks, and raise a set of research questions to

explore and answer in this thesis.

1.1 overview

Probabilistic modeling are powerful tools in modeling real world data from vari-ous domains, such as natural languages, images, time series, which embraces richflexibility, accurate prediction and meaningful interpretations. Two critical com-ponents need to be addressed in a successful probabilistic modeling framework,model specification and model inference. Often complex and flexible probabilisticmodel is preffered for accurate modeling, particularly for data domains with richstructures, such as images, natural lagnuages, time series, etc., however inferencedifficulty often arise due to high computation or design cost from its inferencecounterpart.

Meanwhile, recent advances with deep neural networks in both supervised andunsupervised have shown prominent advantages in representing and learningcomplex functions and also shed light on improving probabilistic moddeling. In-tegrating deep neural networks into probabilistic modeling thus becomes an im-portant research direction. Deep neural networks can be used to enhance bothcomponents of a probabilistic modeling framework, i.e., the generative model andthe inference model. The first in this direction is the Variational Autoencoder(VAE), which uses deep neural network to represent both the generative modeland the inference model. It has opened the door of using deep neural networksfor probabilistic modeling, though it suffers from limitations, which this thesisaims to address.

The main obstacles for fully empowering probabilistic inference with deep neu-ral networks results from two fundamental characteristics of deep neural net-works:

1

DRAFT – November 26, 2017

Page 10: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

2 introduction

a) Deep neural networks are best known for its ability to learn and approximatearbitrary and complex deterministic functions mapping from their input totheir output, however there are only a very limited number of ways to injectrandomness into the network to model uncertainty about the quantity ofinterest;

b) Even for the few ways that uncertainty can be modeled, the family of proba-bilistic distributions that allows efficient learning is still limited, which ham-pers its use for general purpose stochastic inference.

Particularly, Variational Autoencoders (VAE) [11] and Generative AdversarialNetworks (GAN) [9] are two representative work that tries to incorporate andmodel uncertainly in data, which have shown success in probabilistic modelingdata from various domains. However, there are also limitations in their abilityto represent arbitrarily complex probabilistic models. On one hand, VAE injectsrandomness in its network architecture, and in order to use stochastic graidentdescent for network training, often over-simplified distributions about the ran-domness is assumed, such as Gaussians. On the other hand, GAN doesn’t injectrandoness into the network itself, but rather a random noise source is assumedas part of input. Since the network itself is still deterministic and it only takessamples from the density of interest, techniques and tricks in training determinis-tic funtions can be applied here. However, one key drawback of GAN is that, it’sdifficult to deliver probabilistic statements about its behavior, such as how likelythe network is going to generate a certain sample, which also limits the extent towhich it can play for probabilistic modeling.

1.2 thesis statements and contributions

In this thesis, we aim to explore and address the above mentioned problems. Par-ticulary, we claim that currently we are far from fully harnessing the power ofDNNs to manipulate stochasticity for probabilistic inference, and we propose toachieve better probabilistic inference with DNNs, by asking and answering thefollowing research questions:

Research question 1: The original Variational Autoencoders relies on the repa-rameterization trick to construct inference model for variational inferece. Can wemake the inference model of VAE more flexible, to accomodate variatinal familiesthat might not admit reparemterization tricks?

Proposed work: We propose to cover a much richer variational family q forVAE, via two methods. One is to incorporate auxiliary variables into the

DRAFT – November 26, 2017

Page 11: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

1.2 thesis statements and contributions 3

VAE framework, which we term the resulting model as Assymetric Varia-tional Autoencoders (See Chapter 2); the other is to propose a new familyof neural network layer for density transformation to capture complex pos-terior families (See Chapter 3).

Research question 2: Generative Adversarial Networks are known to generatesharp and accurate data samples, however probabilistic statements about the gen-erated data points are still lacking. Is it possible to equip GAN with likelihoodinterpretations?

Proposed work: We propose to address this problem by constructing thegenerator of GAN with bi-jective transformations to enable density evalua-tion of generated samples, particularly the new density transformation layer.(See Chapter 4)

Research question 3: Variational Autoencoders are efficient for inference, and areadopted mainly to model continuous data, such as images. Not much effort has beendevoted to the discrete data case. On the other hand, topic modeling is a powerfultechnique in understanding text. Is it possible that we can also integerate variationalautoencoders with classical probabilistic graphical models for text data, such as topicmodels?

Proposed work: We propose to combine stochastic neural inference with tra-ditional probabilistic graphical models. Particularly, we explore and addresshow VAE can be combined to infer topic models such as LDA. (See Chapter5)

Research question 4: Dynamics in the data is another important aspect for betterunderstanding; in addition to modeling data from a static point of view, is it possibleto also address dynamics with the flexible neural probabilistic models, e.g., for timeseries data?

Proposed work: To encounter complex uncerntainly in temporal data analy-sis, we propose to apply the novel density transformation layers to the VAEframework to model complex temporal and multivarate stochastic depen-dencies in temporal data. (See Chapter 6)

DRAFT – November 26, 2017

Page 12: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

DRAFT – November 26, 2017

Page 13: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

2A S S Y M E T R I C VA R I AT I O N A L A U T O E N C O D E R S

Variational inference for latent variable models is prevalent in various ma-chine learning problems, typically solved by maximizing the Evidence Lower

Bound (ELBO) of the true data likelihood with respect to a variational distribu-tion. However, freely enriching the family of variational distribution is challengingsince the ELBO requires variational likelihood evaluations of the latent variables.In this chapter, we propose a novel framework to enrich the variational familybased on an alternative lower bound, by introducing auxiliary random variablesto the variational distribution only. While offering a much richer family of complexvariational distributions, the resulting inference network is likelihood almost free inthe sense that only the latent variables require evaluations from simple likelihoodsand samples from all the auxiliary variables are sufficient for maximum likelihoodinference. We show that the proposed approach is essentially optimizing a prob-abilistic mixture of ELBOs, thus enriching modeling capacity and enhancing ro-bustness. It outperforms state-of-the-art methods in our experiments on severaldensity estimation tasks.

2.1 introduction

Estimating posterior distributions is the primary focus of Bayesian inference, wherewe are interested in how our belief over the variables in our model would changeafter observing a set of data. Predictions can also be benefited from Bayesianinference as every prediction will be equipped with a confidence interval repre-senting how sure the prediction is. Compared to the maximum a posteriori (MAP)estimator of the model parameters, which is a point estimator, the posterior dis-tribution provides richer information about model parameters and hence morejustified prediction.

Among various inference algorithms for posterior estimation, variational infer-ence (VI) and Markov Chain Monte Carlo (MCMC) are the most wisely used ones.It is well known that MCMC suffers from slow mixing time though asymptoticallythe chained samples will approach the true posterior. Furthermore, for latent vari-

5

DRAFT – November 26, 2017

Page 14: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

6 assymetric variational autoencoders

able models (LVMs) where each sampled data point is associated with a latentvariable, the number of simulated Markov Chains increases with the number ofdata points, making the computation too costly. VI, on the other hand, facilitatesfaster inference because it optimizes an explicit objective function and its conver-gence can be measured and controlled. Hence, VI has been widely used in manyBayesian models, such as the mean-field approach for the Latent Dirichlet Al-location [1], etc. To enrich the family of distributions over the latent variables,neural network based variational inference methods have also been proposed,such as Variational Autoencoder (VAE) [11], Importance Weighted Autoencoder(IWAE) [3] and others [12, 20, 24]. These methods outperform the traditionalmean-field based inference algorithms due to their flexible distribution familiesand easy-to-scale algorithms, therefore becoming the state of the art for variationalinference.

The aforementioned VI methods are essentially maximizing the evidence lowerbound (ELBO), i.e., the lower bound of the true marginal data likelihood, definedas

logpθ(x) = log Ez∼qφ(z|x)p(z)p(x|z)

q(z|x)

> Ez∼qφ(z|x) logp(z)p(x|z)

q(z|x)(1)

Notice that the equality holds if and only if qφ(z|x) = pθ(z|x) and otherwise a gapalways exists. The more flexible the variational family q(z|x) is, the more likely itcan match the true posterior p(z|x). However, arbitrarily enriching the variationalmodel family q is non-trivial, since optimizing Eq. 1 always requires evaluationsof q(z|x), no matter what architecture is used to model q.

In this paper we propose to optimize an alternative lower bound of the true datalikelihood, by introducing auxiliary variables to the variational model only. Mostimportantly, likelihood evaluations are not required for the auxiliary variables and easysamplings are admitted. This essentially results in a likelihood almost free inferencenetwork, in the sense that only variational likelihoods for the actual latent vari-ables are needed while samples for all auxiliary variables are sufficient maximumlikelihood inference. We argue that the resulting inference network is essentiallylearning a mix of different variational posteriors, and thus enables modeling amuch richer and flexible family of complex posterior distribution. It can be shownthat the new framework subsumes several recently developed neural variationalmethods as special cases. We conduct empirical evaluations on several densityestimation tasks, which validate the effectiveness of the proposed method.

The rest of the paper is organized as follows: We briefly review two existingapproaches for inference network modeling, and present our proposed likelihood

DRAFT – November 26, 2017

Page 15: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

2.2 preliminaries 7

almost free inference network in the following section. We then point out the con-nections of the proposed framework to other related ones. Empirical evaluationsand analysis are carried out in Section Experiments, and we conclude this paperin the last section.

2.2 preliminaries

In this section, we briefly review several existing methods that aim to enrich infer-ence networks flexible neural network architectures.

2.2.1 Variational Autoencoder (VAE)

Given a generative model pθ(x, z) = pθ(z)pθ(x|z) defined over data x and latentvariable z, indexed by parameter θ, variational inference aims to approximate theintractable posterior p(z|x) with qφ(z|x), indexed by parameter φ, such that theELBO is maximized

LVAE(x) ≡ Eq logp(x, z) − Eq logq(z|x) 6 logp(x) (2)

Parameters of both generative distribution p and variational distribution q arelearned by maximizing the ELBO with stochastic gradient methods.1 Specifically,VAE [11] assumes both the conditional distribution of data given the latent codesof the generative model and the variational posterior distribution are Gaussians,whose means and diagonal covariances are parameterized by two neural networks,termed as generative network and inference network, respectively. Model learningis possible due to the re-parameterization trick [11] which makes back propagationthrough the stochastic variables possible.

2.2.2 Importance Weighted Autoencoder (IWAE)

The above ELBO is a lower bound of the true data log-likelihood logp(x), hence [3]proposed IWAE to directly estimate the true data log-likelihood with the presenceof the variational model2, namely

logp(x) = log Eqp(x, z)q(z|x)

> log1

m

m∑i=1

p(x, zi)q(zi|x)

≡ LIWAE(x) (3)

where m is the number of importance weighted samples. The above bound istighter than the ELBO used in VAE. When trained on the same network structure

1 We drop the dependencies of p and q on parameters θ and φ to prevent clutter.2 The variational model is also referred to as the inference model, hence we use them interchangeably.

DRAFT – November 26, 2017

Page 16: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

8 assymetric variational autoencoders

as VAE, with the above estimate as training objective, IWAE achieves considerableimprovements over VAE on various density estimation tasks [3] and similar ideais also considered in [21].

2.2.3 Normalizing Flows

Another idea to enrich the model capacity to cover complex distribution is nor-malizing flows [24], in which lies the central idea of transforming a simple vari-able with known densities to construct complex densities. Given random variablez ∈ Rd with density p(z), consider a smooth and invertible function f : Rd → Rd

operated on z. Let z ′ = f(z) be the resulting random variable, hence the densityof z ′ is

p(z ′) = p(z)

∣∣∣∣det∂f−1

∂z ′

∣∣∣∣ = p(z) ∣∣∣∣det∂f

∂z

∣∣∣∣−1 (4)

Normalizing flows thus considers successively transforming z0 with a series oftransformations {f1, f2, ..., fK} to construct arbitrarily complex densities for zK =

fK ◦ fK−1 ◦ ... ◦ f1(z0) as

logp(zK) = logp(z0) −K∑k=1

log∣∣∣∣det

∂fk∂zk−1

∣∣∣∣ (5)

Hence the complexity lies in computing the determinant of the Jacobian ma-trix. Without further assumption about f, the general complexity for that is O(d3)

where d is the dimension of z. In order to accelerate this, [24] proposed the fol-lowing family of transformations that they termed as planar flow:

f(z) = z+uh(w>z+ b) (6)

where w ∈ Rd,u ∈ Rd,b ∈ R are parameters and h(·) is a univariate non-linearfunction with derivative h ′(·). For this family of transformations, the determinantof the Jacobian matrix can be efficiently computed to facilitate model training.

det∂f

∂z= det(I+uψ(z)>) = 1+u>ψ(z) (7)

where ψ(z) = h ′(w>z+ b)w. The computation cost of the determinant is hencereduced from O(d3) to O(d).

Applying f to z can be viewed as feeding the input variable to a neural networkwith only one single hidden unit followed by a linear output layer which has thesame dimension with the input layer. Obviously, because of the bottleneck causedby the single hidden unit, the capacity of the family of transformed density ishence limited.

DRAFT – November 26, 2017

Page 17: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

2.3 the proposed method 9

2.3 the proposed method

In this section we propose to maximize an alternative lower bound of the datalikelihood, which essentially results in an inference network which is almost like-lihood free, in the sense that only the latent variables require evaluations fromsimple likelihoods and samples from all the auxiliary variables are sufficient formaximum likelihood inference.

2.3.1 Flexible Lower Bounds with Auxiliary Variables

The true data log-likelihood can be rewritten as

logp(x) = log Eq(z|τ,x)p(x, z)q(z|τ, x)

≡ Lτ(x) (8)

for any valid (conditional) variational density q(z|τ, x) and τ is an auxiliary vari-able (parameter) introduced into the variational model q. Notice that τ appearsin the subscript of Lτ(x) to explicitly indicate that any practical estimate of Lτ(x)will depend on the choice of τ.

With the above formulation, one is tempted to maximize the benefits of auxiliaryvariable τ by finding the best suitable τ with maximizing, e.g.,

Lmax(x) ≡ maxτ

log Eq(z|τ,x)p(x, z)q(z|τ, x)

(9)

In the context of formulating the variational distribution by neural networks, thisis equivalent to adding more deterministic layers to the inference network, to ac-commodate the mapping from x to τ and then the mapping from τ to the stochas-tic layer on z. This will definitely increase the model capacity for the variationaldistribution due to more parameters, however a key shortcoming is that even withthe learned fixed τ∗, eventually it is still one single density q(z|τ∗, x) which canhardly always capture the stochastic behaviors of the latent variable z given a datapoint x.

For instance, when modeling binary data with classic VAE and IWAE, whichtypically assumes the a data point is generated from a multi-variate Bernoulli,conditioned on its latent variable whose prior is a Gaussian, a single Gaussianvariational distribution will not be able to accurately describing the obviouslynon-Gaussian posterior. To this end, we propose to assume τ as a random variablewith proper support and to maximize the log-likelihood with the expectation overτ, which are instantiated for both VAE and IWAE as

LASY-VAE(x) ≡ Eq(τ|x)Eq(z|τ,x)[logp(x, z) − logq(z|τ, x)] (10)

DRAFT – November 26, 2017

Page 18: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

10 assymetric variational autoencoders

and

LASY-IWAE(x) ≡ Eq(τ|x) log Eq(z|τ,x)p(x, z)q(z|τ, x)

(11)

respectively.No likelihood evaluations required for τ. One key advantage brought by the

auxiliary variable τ is that both terms inside the inner expectations do not involveq(τ|x), hence no likelihood evaluations are required when Monte Carlo basedmethods are used to optimize the above bounds. So instead of directly modelingq(τ|x) which is not easy to manipulate, we choose to model the sample generationprocess from q(τ|x). To fully enrich model flexibility, we use neural network f toconstruct τ given x and a random noise vector ε as

τ = f(x, ε) (12)

Due to the presence of auxiliary variables τ, the inference model now implicitly de-fines a probability measure over (τ, z), resulting in an asymmetric structure fromstandard variational autoencoders and their variants, where the stochasticity ofinference model q and the generative model p are both defined over z, thereforewe term the the model with the above bounds as Asymmetric Variational Autoen-coder (AVAE), which includes ASY-VAE and ASY-IWAE as two instantiations forVAE and IWAE, respectively.

Intuitively, AVAE can be thought of as optimizing a mix of various bounds Lτ,thus enhances robustness for model learning which is of particular importanceto Monte Carlo methods for estimating the bound and its gradients w.r.t. itsparameters. Moreover, the resulting inference model enjoys higher flexibility, withthe potential to capture complex structure of the posterior distribution, such asmulti-modality.

For completeness, we briefly include that

Proposition 1 Both LASY-VAE(x) and LASY-IWAE(x) are lower bounds of the true datalog-likelihood, satisfying logp(x) = LASY-IWAE(x) > LASY-VAE(x).

Proof is trivial from Jensen’s inequality, hence it’s omitted. Figure 1 shows acomparison of the inference models between classic VAE and the proposed AVAE.

Remark 1 Though the first equality holds for any choice of distribution q(τ|x)(whether τ depends on x or not), for practical estimation with Monte Carlo meth-ods, it becomes an inequality (logp(x) > L̂ASY-IWAE(x)) and the bound tightens asthe number of importance samples is increased [3]. The second inequality alwaysholds when estimated with Monte Carlo samples.

Remark 2 The above bounds are only concerned with one auxiliary variable τ,however τ can also be a set of auxiliary variables. Moreover, with the same motiva-tion, we can make the variational family of AVAE even more flexible by defining a

DRAFT – November 26, 2017

Page 19: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

2.3 the proposed method 11

z

x

x

z

(a) Generative model (left)and inference model(right) for VAE

x τ

z

(b) Inference model forAVAE (Generative modelis fixed, same as VAE)

x τ1 τ2 · · · τk

z

(c) Inference model forAVAE with k auxiliaryvariables

Figure 1: Inference models for VAE, AVAE and AVAE with k auxiliary random variables(The generative model is fixed as shown in Figure 1a(left)). Note that multiplearrows pointing to a node indicate one stochastic layer, with the source nodesconcatenated as input to the stochastic layer and the target node as stochasticoutput. One stochastic layer could consist of multiple deterministic layers. (Fordetailed architecture used in experiments, refer to Experiments)

series of k auxiliary variables, such that q(z, τ1, ..., τk|x) = q(τ1|x)q(τ2|τ1)...q(τk|τk−1)q(z|τ1, ..., τk, x)with sample generation process as

τ1 = f1(x, ε1)

τi = fi(τi−1, εk) for i = 2, 3, ...,k (13)

and we have

Proposition 2 The AVAE with k auxiliary random variables {τ1, τ2, ..., τk} is also alower bound to the true log-likelihood, satisfying logp(x) = LASY-IWAE-k > LASY-VAE-k,where

LASY-VAE-k(x) ≡Eq(τ1|x)Eq(τ2|τ1) · · ·Eq(τk|τk−1)Eq(z|x,τ1,...,τk)[logp(x, z) − logq(z|x, τ1, ..., τk)

](14)

and

LASY-IWAE-k(x) ≡Eq(τ1|x)Eq(τ2|τ1) · · ·Eq(τk|τk−1)

log Eq(z|x,τ1,...,τk)p(x, z)

q(z|x, τ1, ..., τk)(15)

Figure 1c illustrates the inference model of an AVAE with k auxiliary variables.

2.3.2 Learning with Monte Carlo Estimates

With no additional assumptions are made for the generative model p and vari-ational model q other than that they are parameterized by neural network with

DRAFT – November 26, 2017

Page 20: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

12 assymetric variational autoencoders

stochastic layers, for both ASY-VAE and ASY-IWAE, we can estimate the corre-sponding bounds and its gradients of LASY-VAE and LASY-IWAE with ancestral sam-pling from the model.

For example, for ASY-VAE with one auxiliary variable τ, we estimate

L̂ASY-VAE(x) =1

nm

n∑i=1

m∑j=1

(logp(x, zij) − logq(zij|τi, x)

)(16)

and

L̂ASY-IWAE(x) =1

n

n∑i=1

log1

m

m∑j=1

p(x, zij)q(zij|τi, x)

(17)

where n is the number of τs sampled from the current q(τ|x) and m is the numberof zs sampled from the conditional q(z|τi), x for every i = 1, ..,n. The parametersof both the inference model and generative model are jointly learned by maximiz-ing the above bounds. Besides back propagation through the stochastic variablez (typically assumed to be a Gaussian for continuous latent variables) is possi-ble through the re-parameterization trick, and it is naturally also true for all theauxiliary variables τ since they are defined in a generative manner.

One might suspect that with the above estimator for the loss and its gradients,the introduced auxiliary variables will make the Monte Carlo estimates of highvariance, due to the fact that more sources of randomness are introduced into inthe variational distribution and thus a exponential number of samples of auxiliaryvariables are needed to achieve an accurate estimate of the above bound, sincelogq(z|x) =

∫τ q(τ|x)q(z|x, τ)dτ. However, we emphasize that since the bound

only involves the conditional density of the latent variables given all other vari-ables, namely q(z|x, τ), and further the fact that the bound is still valid for anyconfiguration of samplings of the auxiliary variables τ, the proposed method canstill explore a richer distribution family than without introducing the auxiliaryvariable. In fact, in all our empirical evaluations for AVAE with different layers ofauxiliary variables, we set n = 1 and experimental results still favors the proposedmethod than vanilla VAE and IWAE.

2.4 connection to related methods

Before we proceed to the experimental evaluations of the proposed methods, wehighlight the relations of AVAE to other similar methods.

DRAFT – November 26, 2017

Page 21: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

2.4 connection to related methods 13

2.4.1 Normalizing flow as a special case

We now show that the proposed framework also covers another recently proposedidea as special case. When z = g(τ), that is, z is modeled as a deterministic func-tion of τ, the above framework essentially turns to be Normalizing Flow (NF) [24].When z = f(x, τ), the above bound turns to

logpθ(x) >Eτ log Ez∼qφ(z|τ,x)pθ(z)pθ(x|z)

qφ(z|x, τ)(18)

=Eτ logpθ(z = f(x, τ))pθ(x|z = f(x, τ))

qφ(z = f(x, τ)|x, τ)(19)

=Eτ logpθ(z = f(x, τ))

+ Eτ logpθ(x|z = f(x, τ))

− Eτ logqφ(z = f(x, τ)|x, τ) (20)

=Eτ logpθ(z = f(x, τ))

+ Eτ logpθ(x|z = f(x, τ))

(Eτ logqφ(τ|x) − Eτ log

∣∣∣∣det∂f

∂τ

∣∣∣∣) (21)

where we assume that f is invertible w.r.t τ which is essentially the bound opti-mized by Normalizing Flows. When NF assumes multiple layers of transforma-tions, it can also be thought of as defining a series of random variables based onwarping variables on previous layers, such that zi = f(zi−1). However, it requiresf to be bijective to enable likelihood evaluation of the intermediate variables, whileAVAE doesn’t place such restriction on f.

2.4.2 Other methods with auxiliary variables

Relation to Hierarchical Variational Models (HVM) [23] and Auxiliary Deep Gen-erative Models (ADGM) [16] are two closely related variational methods withauxiliary variables. HVM also considers enriching the variational model familyby placing a prior over the latent variable for the variational distribution q(z|x).While ADGM takes another way to this goal, by placing a prior over the auxiliaryvariable on the generative model, which in some cases will keep the marginalgenerative distribution of the data invariant. It has been shown that HVM andADGM are mathematically equivalent by [2].

However, our proposed method doesn’t add any prior on the generative modeland thus doesn’t change the structure of the generative model. We argue thatwith no priors over the introduced variables are defined, and due to the modelstructure that most of the auxiliary variables are not directly linked to explain the

DRAFT – November 26, 2017

Page 22: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

14 assymetric variational autoencoders

data (except for τ1 when we assume multiple auxiliary variables for the variationaldistribution), our proposed method places the least restrictive constraints on thevariational distribution which enriches model capacity for more accurate posteriorapproximation.

2.4.3 Adversarial learning based inference models

Adversarial learning based inference models, such as Adversarial Autoencoders [17],Adversarial Variational Bayes [18], and Adversarially Learned Inference [7], aimto maximize the ELBO without any variational likelihood evaluations at all. It canbe shown that for the above adversarial learning based models, when the discrim-inator is trained to its optimum, the model is equivalent to optimizing the ELBO.However, due to the minimax game involved in the adversarial setting, practicallyat any moment it is not guaranteed that they are optimizing a lower bound ofthe true data likelihood, thus no maximum likelihood learning interpretation canbe provided. However, in our proposed framework, we don’t require variationallikelihood evaluations for almost all the variables in the variational model, exceptfor the actual latent variables z, while still maintaining the ML interpretation.

2.5 experiments

2.5.1 Setups

To test our proposed AVAE for variational inference we use standard benchmarkdatasets MNIST3 and Omniglot4 [14]. Our method is general and can be appliedto any formulation of the generative model pθ(x, z). For simplicity and fair com-parison, in this paper, we focus on densities defined by stochastic neural networks,i.e., a broad family of flexible probabilistic generative models with its parametersdefined by neural networks. Specifically, we consider the following two familiesof generative models

G1 : pθ(x, z) = pθ(z)pθ(x|z) (22)

G2 : pθ(x, z1, z2) = pθ(z1)pθ(z2|z1)pθ(x|z2) (23)

where p(z) and p(z1) are the priors defined over z and z1 for G1 and G2, respec-tively. All other conditional densities are specified with their parameters θ definedby neural networks, therefore ending up with two stochastic neural networks. This

3 Data downloaded from http://www.cs.toronto.edu/~larocheh/public/datasets/binarized_

mnist/

4 Data downloaded from https://github.com/yburda/iwae/raw/master/datasets/OMNIGLOT/

chardata.mat

DRAFT – November 26, 2017

Page 23: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

2.5 experiments 15

network could have any number of layers, however in this paper, we focus on theones which only have one and two stochastic layers, i.e., G1 and G2, to conduct afair comparison with previous methods on similar network architectures, such asVAE, IWAE and Normalizing Flows.

We use the same network architectures for both G1 and G2 as in [3], specificallyshown as follows

G1 : A single Gaussian stochastic layer z with 50 units. In between the latentvariable z and observation x there are two deterministic layers, each with200 units;

G2 : Two Gaussian stochastic layers z1 and z2 with 50 and 100 units, respectively.Two deterministic layers with 200 units connect the observation x and latentvariable z2, and two deterministic layers with 100 units are in between z2and z1.

where a Gaussian stochastic layer consists of two fully connected linear layers,with one outputting the mean and the other outputting the logarithm of diagonalcovariance. All other deterministic layers are fully connected with tanh nonlinear-ity.

For G1, inference network with the following architecture is used

τ1 = f1(x‖ε1) where ε1 ∼ N(0, I) (24)

τi = fi(τi−1‖εi) where εi ∼ N(0, I)

for i = 2, ...,k (25)

q(z|x, τ1, ..., τk) = N(µ(x‖τ1‖...‖τk), diag

(σ(x‖τ1‖...‖τk)

)(26)

where ‖ denotes the concatenation operator. All noise vectors εs are set to beof 50 dimensions, and all other variables have the corresponding dimensions inthe generative model. Inference network used for G2 is the same, except for theGaussian stochastic layer is defined for z2. An additional Gaussian stochastic layerwith z2 as input is defined for z1 with the dimensions of variables aligned to thosein the generative model G2. Further, Bernoulli observation models are assumedfor both MNIST and Omniglot. For MNIST, we employ the static binarizationstrategy as in [15] while dynamic binarization is employed for Omniglot.

Our baseline models include VAE, IWAE and NF. Since our proposed methodinvolves adding more layers to the inference network, we also include anotherenhanced version of VAE with more deterministic layers added to its inferencenetwork, which we term as VAE+.5 All models are implemented in PyTorch6.

5 VAE+ is a restricted version of AVAE with all the noise vectors εs set to be constantly 0.6 http://pytorch.org/

DRAFT – November 26, 2017

Page 24: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

16 assymetric variational autoencoders

Parameters of both the variational distribution and the generative distribution ofall models are optimized with Adam [13] for 2000 epochs, with a fixed learningrate of 0.0005, exponential decay rates for the 1st and 2nd moments at 0.9 and0.999, respectively. Batch normalization [10] is also used, as it has been shown toimprove learning for neural stochastic models [25].

2.5.2 Generative Density Estimation

For MNIST, models are trained and tuned on the 60,000 training and validation im-ages, and estimated log-likelihood on the test set with 5000 importance weightedsamples are reported. Table 3 presents the performance of all models, when thegenerative model is assumed to be from both G1 and G2.

Firstly, VAE+ achieves higher log-likelihood estimates than vanilla VAE dueto the added more layers in the inference network, implying that a better poste-rior approximation is learned. Second, we observe that ASY-VAE achieves betterdensity estimates than VAE+, which confirms our expectation that adding moreauxiliary variables to the inference network leads to a richer family of variationaldistributions. This suggests that incorporating more sources of stochasticity inthe inference network is another key factor to provide a richer variational family,in addition to enlarging the variational model space by adding more determin-istic layers. Similar trends can be observed on the importance weighted versions(ASY-IWAE versus IWAE). Overall, our proposed method ASY-IWAE outperformsIWAE by more than 1 nat on G1 and 0.75 nat on G2.

Results on OMNIGLOT are presented in Table 4 where similar trends can beobserved as on MNIST. One observation different from MNIST is that, the gainsfrom ASY-VAE and ASY-IWAE over VAE and IWAE respectively are not as largeas they are on MNIST. It could be explained by the fact that Omniglot is a smallerset, roughly with a size of 40% of MNIST.

2.5.3 Generated Samples

After the models are trained, generative samples can be obtained by feedingz ∼ N(0, I) to the learned generative model G1 (or z2 ∼ N(0, I) to G2). Sincehigher log-likelihood estimates are obtained on G2, Figure 5 shows the randomgenerative samples from our proposed method trained with G2 on both MNISTand Ominiglot, compared to real samples from the training sets. We observe thegenerated samples are visually consistent with the training data.

DRAFT – November 26, 2017

Page 25: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

2.6 conclusions 17

Table 1: MNIST test set NLL with generative models G1 and G2 (lower is better)

MNIST (static binarization) − logp(x) on G1 − logp(x) on G2

VAE [3] 87.88 85.65

IWAE (IW = 50) [3] 86.10 84.04

VAE+NF [24] - 6 85.10

VAE+ (k = 1) 87.56 85.53

VAE+ (k = 4) 87.40 85.23

VAE+ (k = 8) 87.28 85.07

ASY-VAE (k = 1) 87.31 85.23

ASY-VAE (k = 4) 87.16 85.08

ASY-VAE (k = 8) 87.01 84.97

ASY-IWAE (IW = 50,k = 1) 85.76 83.77

ASY-IWAE (IW = 50,k = 4) 85.31 83.52

ASY-IWAE (IW = 50,k = 8) 85.03 83.29

2.6 conclusions

This chapter presents a new framework to enrich variational family for variationalinference, by introducing auxiliary random variables to the variational inferencenetworks, based on an alternative lower bound of the data likelihood. We empha-size that the no variational likelihood evaluations are required for the auxiliary variables,hence allowing constructing complex distributions via warping a simple randomnoise vector as inputs to neural networks. This leads to a likelihood almost free in-ference network, in the sense that only variational likelihoods for the actual latentvariables are needed while samples for all auxiliary variables are sufficient. Itcan be shown that the proposed inference network is essentially learning a richerprobabilistic mixture of variational posteriors, thus achieving a much richer andflexible family of variational distributions. Empirical evaluations of the instanti-ated Asymmetric Variational Autoencoders (AVAE) demonstrate the effectivenessof incorporating auxiliary variables in variational inference.

It remains an interesting question of how many auxiliary variables are neededto best exploit the variational family for a specific problem. Also, since this chap-ter only focuses on enriching the variational distribution, other techniques suchas Normalizing Flows, Inverse Autoregressive Flows can be combined with theproposed framework. Hence, how to effectively aggregate the proposed method

DRAFT – November 26, 2017

Page 26: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

18 assymetric variational autoencoders

Table 2: OMNIGLOT test set NLL with generative models G1 and G2 (lower is better)

Omniglot − logp(x) on G1 − logp(x) on G2

VAE [3] 108.86 107.93

IWAE (IW = 50) [3] 104.87 103.93

VAE+ (k = 1) 108.80 107.89

VAE+ (k = 4) 108.64 107.80

VAE+ (k = 8) 108.53 107.67

ASY-VAE (k = 1) 108.74 107.82

ASY-VAE (k = 4) 108.60 107.65

ASY-VAE (k = 8) 108.41 107.43

ASY-IWAE (IW = 50,k = 1) 104.83 103.57

ASY-IWAE (IW = 50,k = 4) 104.80 103.44

ASY-IWAE (IW = 50,k = 8) 104.63 103.40

of enriching the variational family with other techniques, including adversariallearning, to achieve the best possible generative modeling is another promisingdirection to explore. Lastly, training deep neural network with arbitrary numberof stochastic layers remains a challenging problem, a principled framework can bepursued.

DRAFT – November 26, 2017

Page 27: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

2.6 conclusions 19

(a) MNIST Trainingdata

(b) Random samplesfrom ASY-IWAEwith k = 8

(c) OMNIGLOT Train-ing data

(d) Random samplesfrom ASY-IWAEwith k = 8

Figure 2: Training data and random samples

DRAFT – November 26, 2017

Page 28: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

DRAFT – November 26, 2017

Page 29: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

3C O N V O L U T I O N A L N O R M A L I Z I N G F L O W S

Bayesian posterior inference is prevalent in various machine learning problems.Variational inference provides one way to approximate the posterior distribu-

tion, however its expressive power is limited and so is the accuracy of resultingapproximation. Recently, there has a trend of using neural networks to approxi-mate the variational posterior distribution due to the flexibility of neural networkarchitecture. One way to construct flexible variational distribution is to warp asimple density into a complex by normalizing flows, where the resulting densitycan be analytically evaluated. However, there is a trade-off between the flexi-bility of normalizing flow and computation cost for efficient transformation. Inthis chapter, we propose a simple yet effective architecture of normalizing flows,ConvFlow, based on convolution over the dimensions of random input vector. Ex-periments on synthetic and real world posterior inference problems demonstratethe effectiveness and efficiency of the proposed method.

3.1 introduction

Posterior inference is the key to Bayesian modeling, where we are interested to seehow our belief over the variables of interest change after observing a set of datapoints. Predictions can also benefit from Bayesian modeling as every predictionwill be equipped with confidence intervals representing how sure the predictionis. Compared to the maximum a posterior estimator of the model parameters,which is a point estimator, the posterior distribution provide richer informationabout the model parameter hence enabling more justified prediction.

Among the various inference algorithms for posterior estimation, variational in-ference (VI) and Monte Carlo markov chain (MCMC) are the most two wisely usedones. It is well known that MCMC suffers from slow mixing time though asymp-totically the samples from the chain will be distributed from the true posterior.VI, on the other hand, facilitates faster inference, since it is optimizing an explicitobjective function and convergence can be measured and controlled, and it’s beenwidely used in many Bayesian models, such as Latent Dirichlet Allocation [1], etc.

21

DRAFT – November 26, 2017

Page 30: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

22 convolutional normalizing flows

However, one drawback of VI is that it makes strong assumption about the shapeof the posterior such as the posterior can be decomposed into multiple indepen-dent factors. Though faster convergence can be achieved by parameter learning,the approximating accuracy is largely limited.

The above drawbacks stimulates the interest for richer function families to ap-proximate posteriors while maintaining acceptable learning speed. Specifically,neural network is one among such models which has large modeling capacityand endows efficient learning. [24] proposed normalization flow, where the neu-ral network is set up to learn an invertible transformation from one known dis-tribution, which is easy to sample from, to the true posterior. Model learning isachieved by minimizing the KL divergence between the empirical distribution ofthe generated samples and the true posterior. After properly trained, the modelwill generate samples which are close to the true posterior, so that Bayesian pre-dictions are made possible. Other methods based on modeling random variabletransformation, but based on different formulations are also explored, includingNICE [5], the Inverse Autoregressive Flow [12], and Real NVP [6].

One key component for normalizing flow to work is to compute the determi-nant of the Jacobian of the transformation, and in order to maintain fast Jacobiancomputation, either very simple function is used as the transformation, such as theplanar flow in [24], or complex tweaking of the transformation layer is required.Alternatively, in this chapter we propose a simple and yet effective architecture ofnormalizing flows, based on convolution on the random input vector. Due to thenature of convolution, bi-jective mapping between the input and output vectorscan be easily established; meanwhile, efficient computation of the determinant ofthe convolution Jacobian is achieved linearly. We further propose to incorporatedilated convolution [22, 28] to model long range interactions among the inputdimensions. The resulting convolutional normalizing flow, which we term as Con-volutional Flow (ConvFlow), is simple and yet effective in warping simple densitiesto match complex ones.

The remainder of this chapter is organized as follows: We briefly review theprinciples for normalizing flows in Section 3.2, and then present our proposednormalizing flow architecture based on convolution in Section 3.3. Empirical eval-uations and analysis on both synthetic and real world data sets are carried out inSection 3.4, and we conclude this chapter in Section 3.5.

DRAFT – November 26, 2017

Page 31: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

3.2 preliminaries 23

3.2 preliminaries

3.2.1 Transformation of random variables

Given a random variable z ∈ Rd with density p(z), consider a smooth and invert-ible function f : Rd → Rd operated on z. Let z ′ = f(z) be the resulting randomvariable, the density of z ′ can be evaluated as

p(z ′) = p(z)

∣∣∣∣det∂f−1

∂z ′

∣∣∣∣ = p(z) ∣∣∣∣det∂f

∂z

∣∣∣∣−1 (27)

thus

logp(z ′) = logp(z) − log∣∣∣∣det

∂f

∂z

∣∣∣∣ (28)

3.2.2 Normalizing flows

Normalizing flows considers successively transforming z0 with a series of trans-formations {f1, f2, ..., fK} to construct arbitrarily complex densities for zK = fK ◦fK−1 ◦ ... ◦ f1(z0) as

logp(zK) = logp(z0) −K∑k=1

log∣∣∣∣det

∂fk∂zk−1

∣∣∣∣ (29)

Hence the complexity lies in computing the determinant of the Jacobian ma-trix. Without further assumption about f, the general complexity for that is O(d3)

where d is the dimension of z. In order to accelerate this, [24] proposed the fol-lowing family of transformations that they termed as planar flow:

f(z) = z+uh(w>z+ b) (30)

where w ∈ Rd,u ∈ Rd,b ∈ R are parameters and h(·) is a univariate non-linearfunction with derivative h ′(·). For this family of transformations, the determinantof the Jacobian matrix can be computed as

det∂f

∂z= det(I+uψ(z)>) = 1+u>ψ(z) (31)

where ψ(z) = h ′(w>z+ b)w. The computation cost of the determinant is hencereduced from O(d3) to O(d).

Applying f to z can be viewed as feeding the input variable z to a neural net-work with only one single hidden unit followed by a linear output layer whichhas the same dimension with the input layer. Obviously, because of the bottleneckcaused by the single hidden unit, the capacity of the family of transformed densityis hence limited.

DRAFT – November 26, 2017

Page 32: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

24 convolutional normalizing flows

3.3 a new transformation unit

In this section, we first propose a general extension to the above mentioned planarnormalizing flow, and then propose a restricted version of that, which actuallyturns out to be convolution over the dimensions of the input random vector.

3.3.1 Normalizing flow with d hidden units

Instead of having a single hidden unit as suggested in planar flow, consider dhidden units in the process. We denote the weights associated with the edgesfrom the input layer to the output layer as W ∈ Rd×d and the vector to adjustthe magnitude of each dimension of the hidden layer activation as u, and thetransformation is defined as

f(z) = u� h(Wz+b) (32)

where � denotes the point-wise multiplication. The Jacobian matrix of this trans-formation is

∂f

∂z= diag(u� h ′(Wz+ b))W (33)

det∂f

∂z= det[diag(u� h ′(Wz+ b))]det(W) (34)

As det(diag(u� h ′(Wz+b))) is linear, the complexity of computing the abovetransformation lies in computing det(W). Essentially the planar flow is restrict-ing W to be a vector of length d instead of matrices, however we can relax thatassumption while still maintaining linear complexity of the determinant compu-tation based on a very simple fact that the determinant of a triangle matrix is alsojust the product of the elements on the diagonal.

3.3.2 Convolutional Flow

Since normalizing flow with a fully connected layer may not be bijective and gen-erally requires O(d3) computations for the determinant of the Jacobian even it is,we propose to use 1-d convolution to transform random vectors.

Figure 3(a) illustrates how 1-d convolution is performed over an input vectorand outputs another vector. We propose to perform a 1-d convolution on an inputrandom vector z, followed by a non-linearity and necessary post operation afteractivation to generate an output vector. Specifically,

f(z) = z+u� h(conv(z,w)) (35)

DRAFT – November 26, 2017

Page 33: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

3.3 a new transformation unit 25

1 32 4 5 6 7 8

1 2 3 4 5 6 7 8 0 0

w1 w2 w3

w1 w2 w3

w1 w2 w3

w1 w2 w3

w1 w2 w3

w1 w2 w3

w1 w2 w3

w1 w2 w3

(a)

1 32 4 5 6 7 8

1 2 3 4 5 6 7 8 0

0

1 32 4 5 6 7 8 0

0

1 32 4 5 6 7 8

00 0

dilation=4

dilation=2

dilation=1

(b)

Figure 3: (a) Illustration of 1-D convolution, where the dimensions of the input/outputvariable are both 8 (the input vector is padded with 0), the width of the convo-lution filter is 3 and dilation is 1; (b) A block of ConvFlow layers stacked withdifferent dilations.

where w ∈ Rk is the parameter of the 1-d convolution filter (k is the convolu-tion kernel width), conv(z,w) is the 1d convolution operation as shown in Figure3(a), h(·) is a bi-jective non-linear activation function1, � denotes point-wise mul-tiplication, and u ∈ Rd is a vector adjusting the magnitude of each dimensionof the activation from h(·). We term this normalizing flow as Convolutional Flow(ConvFlow).

ConvFlow enjoys the following properties

• Bi-jectivity can be easily achieved if proper padding and a invertible activa-tion function are adopted;

• Due to local connectivity, the Jacobian determinant of ConvFlow only takesO(d) computation independent from convolution kernel width k since

∂f

∂z= I+ diag(w1u� h ′(conv(z,w))) (36)

1 Examples of valid h(x) include sigmoid, tanh, softplus, leaky rectifier (Leaky ReLU), exponentiallinear unit (ELU), etc. Note that vanilla rectifier (ReLU) is invalid here since it’s not bijective whenx < 0.

DRAFT – November 26, 2017

Page 34: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

26 convolutional normalizing flows

where w1 denotes the first element of w.For example the illustration in Figure 3(a), the Jacobian matrix of the 1dconvolution conv(z,w) is

∂ conv(z,w)

∂z=

w1 w2 w3

w1 w2 w3

w1 w2 w3

w1 w2 w3

w1 w2 w3

w1 w2 w3

w1 w2

w1

(37)

which is a triangular matrix whose determinant can be easily computed;

• ConvFlow is much simpler than previously proposed variants of normaliz-ing flows. The total number of parameters of one ConvFlow layer is onlyd+ k where generally k < d, particularly for high dimensional cases. Noticethat the number of parameters in the planar flow in [24] is 2d and more pa-rameters are needed in Inverse Autoregressive Flow [12], and Real NVP [6];

A series of K ConvFlows can be stacked to generate complex output densities.Further, since convolutions are only visible to inputs from neighboring dimen-sions, we propose to incorporate dilated convolution to the flow to accommodateinteractions among dimensions with long distance apart. Figure 3(b) presents ablock of 3 ConvFlows stacked, with different dilations for each layer. Larger re-ceptive field is achieved without increasing the number of parameters. We termthis as a ConvBlock.

From the block of ConvFlow layers presented in Figure 3(b), it is easy to verifythat dimension i (1 6 i 6 d) of the output vector only depends on succeeding di-mensions, but not preceding ones. In other words, dimensions with larger indicestend to end up getting little warping compared to the ones with smaller indices.Fortunately, this can be easily resolved by a Revert Layer, which simply outputs areversed version of its input vector. Specifically, a Revert Layer g operates as

g(z) := g([z1, z2, ..., zd]>) = [zd, zd−1, ..., z1]> (38)

It’s easy to verify a Revert Layer is bijective and that the Jacobian of g is a d×dma-trix with 1s on its anti-diagonal and 0 otherwise, thus log

∣∣∣det ∂g∂z∣∣∣ is 0. Therefore,

we can append a Revert Layer after each ConvBlock to accommodate warping

DRAFT – November 26, 2017

Page 35: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

3.4 experiments 27

for dimensions with larger indices without additional computation cost for theJacobian as follows

z→ ConvBlock→ Revert→ ConvBlock→ Revert→ ...→︸ ︷︷ ︸Repetions of ConvBlock+Revert for K times

f(z) (39)

3.4 experiments

We test performance the proposed ConvFlow on two settings, one on syntheticdata to infer unnormalized target density and the other on density estimation forhand written digits and characters.

3.4.1 Synthetic data

We conduct experiments on using the proposed ConvFlow to approximate an un-normalized target density of z with dimension 2 such that p(z) ∝ exp(−U(z))

where U(z) = 12

[z2−w1(z)

0.4

]2and w1(z) = sin

(πz12

). The target density of z are

plotted as the left most column in Figure 4, and we test to see if the proposed Con-vFlow can transform a two dimensional standard Gaussian to the target densityby minimizing the KL divergence

minKL(qK(zk)||p(z)) = Ezk logqK(zk)) − Ezk logp(zk) (40)

= Ez0 logq0(z0)) − Ez0 log∣∣∣∣det

∂f

∂z0

∣∣∣∣+ Ez0U(f(z0)) + const

(41)

where all expectations are evaluated with samples taken from q0(z0). We use a 2-dstandard Gaussian as q0(z0) and we test different number of ConvBlocks stackedtogether in this task. Each ConvBlock in this case consists a ConvFlow layer withkernel size 2, dilation 1 and followed by another ConvFlow layer with kernel size2, dilation 2. Revert Layer is appended after each ConvBlock, and leaky ReLUwith a negative slope of 0.01 is adopted in ConvFlow.

Experimental results are shown in Figure 4 for different layers of ConvBlock tobe stacked to compose f. It can be seen that even with 4 layers of ConvBlocks, it’salready approximating the target density despite the underestimate about the den-sity around the boundaries. With 8 layers of ConvFlow, the transformation froma standard Gaussian noise vector to the desired target unnormalized density canbe accurately learned. Notice that with 8 layers, we are only using 40 parameters((4+ 1) ∗ 8 with bias terms of convolution counted).

DRAFT – November 26, 2017

Page 36: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

28 convolutional normalizing flows

(a) K = 2

(b) K = 4

(c) K = 8

Figure 4: Approximation performance with different number of ConvBlocks

DRAFT – November 26, 2017

Page 37: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

3.4 experiments 29

3.4.2 Handwritten digits and characters

3.4.2.1 Setups

To test the proposed ConvFlow for variational inference we use standard bench-mark datasets MNIST2 and OMNIGLOT3 [14]. Our method is general and canbe applied to any formulation of the generative model pθ(x, z); For simplicity andfair comparison, in this chapter, we focus on densities defined by stochastic neuralnetworks, i.e., a broad family of flexible probabilistic generative models with itsparameters defined by neural networks. Specifically, we consider the followingtwo family of generative models

G1 : pθ(x, z) = pθ(z)pθ(x|z) (42)

G2 : pθ(x, z1, z2) = pθ(z1)pθ(z2|z1)pθ(x|z2) (43)

where p(z) and p(z1) are the priors defined over z and z1 for G1 and G2, respec-tively. All other conditional densities are specified with their parameters θ definedby neural networks, therefore ending up with two stochastic neural networks. Thisnetwork could have any number of layers, however in this chapter, we focus on theones which only have one and two stochastic layers, i.e., G1 and G2, to conduct afair comparison with previous methods on similar network architectures, such asVAE, IWAE and Normalizing Flows.

We use the same network architectures for both G1 and G2 as in [3], specificallyshown as follows

G1 : A single Gaussian stochastic layer z with 50 units. In between the latentvariable z and observation x there are two deterministic layers, each with200 units;

G2 : Two Gaussian stochastic layers z1 and z2 with 50 and 100 units, respectively.Two deterministic layers with 200 units connect the observation x and latentvariable z2, and two deterministic layers with 100 units are in between z2and z1.

where a Gaussian stochastic layer consists of two fully connected linear layers,with one outputting the mean and the other outputting the logarithm of diagonalcovariance. All other deterministic layers are fully connected with tanh nonlinear-ity. Bernoulli observation models are assumed for both MNIST and OMNIGLOT.

2 Data downloaded from http://www.cs.toronto.edu/~larocheh/public/datasets/binarized_

mnist/

3 Data downloaded from https://github.com/yburda/iwae/raw/master/datasets/OMNIGLOT/

chardata.mat

DRAFT – November 26, 2017

Page 38: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

30 convolutional normalizing flows

For MNIST, we employ the static binarization strategy as in [15] while dynamicbinarization is employed for OMNIGLOT.

The inference networks q(z|x) for G1 and G2 have similar architectures to thegenerative models, with details in [3]. ConvFlow is hence used to warp the out-put of the inference network q(z|x), assumed be to Gaussian conditioned on theinput x, to match complex true posteriors. Our baseline models include VAE [11],IWAE [3] and Normalizing Flows [24]. Since our propose method involves addingmore layers to the inference network, we also include another enhanced versionof VAE with more deterministic layers added to its inference network, which weterm as VAE+.4 All models are implemented in PyTorch. Parameters of both thevariational distribution and the generative distribution of all models are optimizedwith Adam [13] for 2000 epochs, with a fixed learning rate of 0.0005, exponentialdecay rates for the 1st and 2nd moments at 0.9 and 0.999, respectively. Batch nor-malization [10] is also used, as it has been shown to improve learning for neuralstochastic models [25].

For inference models with latent variable z of 50 dimensions, a ConvBlock con-sists of following ConvFlow layers

[ConvFlow(kernel size = 5, dilation = 1), ConvFlow(kernel size = 5, dilation = 2),

ConvFlow(kernel size = 5, dilation = 4), ConvFlow(kernel size = 5, dilation = 8),

ConvFlow(kernel size = 5, dilation = 16), ConvFlow(kernel size = 5, dilation = 32)]

(44)

and for inference models with latent variable z of 100 dimensions, a ConvBlockconsists of following ConvFlow layers

[ConvFlow(kernel size = 5, dilation = 1), ConvFlow(kernel size = 5, dilation = 2),

ConvFlow(kernel size = 5, dilation = 4), ConvFlow(kernel size = 5, dilation = 8),

ConvFlow(kernel size = 5, dilation = 16), ConvFlow(kernel size = 5, dilation = 32),

ConvFlow(kernel size = 5, dilation = 64)] (45)

A Revert layer is appended after each ConvBlock and leaky ReLU with a negativeslope of 0.01 is used as the activation function in ConvFlow.

3.4.2.2 Generative Density Estimation

For MNIST, models are trained and tuned on the 60,000 training and validation im-ages, and estimated log-likelihood on the test set with 5000 importance weighted

4 VAE+ adds more layers before the stochastic layer of the inference network while the proposedmethod is add convolutional flow layers after the stochastic layer

DRAFT – November 26, 2017

Page 39: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

3.4 experiments 31

Table 3: MNIST test set NLL with generative models G1 and G2 (lower is better K isnumber of ConvBlocks)

MNIST (static binarization) − logp(x) on G1 − logp(x) on G2

VAE [3] 87.88 85.65

IWAE (IW = 50) [3] 86.10 84.04

VAE+NF [24] - 6 85.10

VAE+ (K = 1) 87.56 85.53

VAE+ (K = 4) 87.40 85.23

VAE+ (K = 8) 87.28 85.07

VAE+ConvFlow (K = 1) 86.92 85.03

VAE+ConvFlow (K = 2) 86.10 84.47

VAE+ConvFlow (K = 4) 84.91 83.98

VAE+ConvFlow (K = 8) 84.53 83.22

IWAE+ConvFlow (K = 8, IW = 50) 84.13 82.96

samples are reported. Table 3 presents the performance of all models, when thegenerative model is assumed to be from both G1 and G2.

Firstly, VAE+ achieves higher log-likelihood estimates than vanilla VAE due tothe added more layers in the inference network, implying that a better posteriorapproximation is learned (which is still assumed to be a Gaussian). Second, weobserve that VAE with ConvFlow achieves much better density estimates thanVAE+, which confirms our expectation that warping the variational distributionwith convolutional flows enforces the resulting variational posterior to match thetrue complex posterior. Also, adding more blocks of convolutional flows to thenetwork makes the variational posterior further close to the true posterior. Lastly,combining convolutional normalizing flows with multiple importance weightedsamples, as shown in last row of Table 3, further improvement on the test setlog-likelihood is achieved. Overall, the method combining ConvFlow and impor-tance weighted samples achieves best NLL on both settings, outperforming IWAEsignificantly by about 2 nats on G1 and more than 1 nat on G2. Also notice that,ConvFlow combined with IWAE achieves an NLL that is 2 nats better with thenormalizing flow used in [24] with fewer parameters in the normalizing flows,suggesting ConvFlow is more efficient and effective in warping simple densitiesto complex ones.

DRAFT – November 26, 2017

Page 40: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

32 convolutional normalizing flows

Table 4: OMNIGLOT test set NLL with generative models G1 and G2 (lower is better, K isnumber of ConvBlocks)

OMNIGLOT − logp(x) on G1 − logp(x) on G2

VAE [3] 108.86 107.93

IWAE (IW = 50) [3] 104.87 103.93

VAE+ (K = 1) 108.80 107.89

VAE+ (K = 4) 108.64 107.80

VAE+ (K = 8) 108.53 107.67

VAE+ConvFlow (K = 1) 107.41 106.32

VAE+ConvFlow (K = 2) 107.05 105.80

VAE+ConvFlow (K = 4) 106.24 104.35

VAE+ConvFlow (K = 8) 105.87 103.58

IWAE+ConvFlow (K = 8, IW = 50) 104.21 103.02

Results on OMNIGLOT are presented in Table 4 where similar trends can beobserved as on MNIST. One observation different from MNIST is that, the gainfrom IWAE+ConvFlow over IWAE is not as large as it is on MNIST, which couldbe explained by the fact that OMNIGLOT is a smaller set, roughly with a size of40% of MNIST.

3.4.2.3 Generated Samples

After the models are trained, generative samples can be obtained by feedingz ∼ N(0, I) to the learned generative model G1 (or z2 ∼ N(0, I) to G2). Sincehigher log-likelihood estimates are obtained on G2, Figure 5 shows the randomgenerative samples from our proposed method trained with G2 on both MNISTand Ominiglot, compared to real samples from the training sets. We observe thegenerated samples are visually consistent with the training data.

3.5 conclusions

This chapter presents a simple and yet effective architecture to compose normaliz-ing flows based on convolution on the input vectors. ConvFlow takes advantageof the effective computation of convolution, as well as maintaining as few pa-rameters as possible. To further accommodate long range interactions among the

DRAFT – November 26, 2017

Page 41: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

3.5 conclusions 33

(a) MNIST Trainingdata

(b) Random samples1 from IWAE-ConvFlow (K = 8)

(c) Random samples2 from IWAE-ConvFlow (K = 8)

(d) Random samples3 from IWAE-ConvFlow (K = 8)

(e) OMNIGLOTTraining data

(f) Random samplesfrom IWAE-ConvFlow (K = 8)

(g) Random samplesfrom IWAE-ConvFlow (K = 8)

(h) Random samplesfrom IWAE-ConvFlow (K = 8)

Figure 5: Training data and generated samples

dimensions, dilated convolution is incorporated to the framework without increas-ing model parameters. A Revert Layer is used to maximize the opportunity thatall dimensions get as much warping as possible. Experimental results on infer-ring target complex density and density estimation on generative modeling onreal world handwritten digits data demonstrates the effectiveness and efficiencyof ConvFlow. Particularly, density estimates on MNIST show significant improve-ments over state-of-the-art methods, validating the power of ConvFlow in warping

DRAFT – November 26, 2017

Page 42: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

34 convolutional normalizing flows

multivariate densities. It remains an interesting question as to how many layersof ConvFlows are best to exploit its full performance. We hope to address thetheoretical properties of ConvFlow in future work.

DRAFT – November 26, 2017

Page 43: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

4G E N E R AT I V E A D V E R S A R I A L N E T W O R K S W I T HL I K E L I H O O D

Generative Adversarial Networks (GAN) [9] are powerful architectures forgenerative modeling based on learning a discriminator which tries to distin-

guish real data points from generated ones. It has been shown to generate highquality data samples, such as sharp, visually consistent images []. Lying at itscore is a generator which deterministically maps a randomly sampled noise vec-tor, typically assumed to be from a isotropic Gaussian, to a point in the data space.One drawback of GAN is that no probabilistic statements can be provided forthe generated sample, such as how probable the generator is to generate a givenimage? This chapter aims to address this aspect, based on the developments ofnormalizing flows from previous chapters.

4.1 generative adversarial networks

We first briefly introduce GAN, whose architecture can be shown in

Figure 6: GAN architecture (Figure reproduced from [4])

35

DRAFT – November 26, 2017

Page 44: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

36 generative adversarial networks with likelihood

A GAN consits of two components, a discriminator D and a generator G. Thegenerator G takes in a randomly sampled noise vector z and outputs generatedsample x ′ = G(z), while the discriminator D tries to tell apart a real data point xdrawn from the generated data point x ′. Formally, training of GANs aims to findthe parameters of both D and G by solving the following minimax game

minG

maxDV(D,G) = Ex∼pdata(x)[logD(x)] + Ez∼pz(z)[log(1−D(G(z)))] (46)

It can be shown that, with G fixed, the optimal discrinator D is

D∗G(x) =pdata(x)

pdata(x) + pg(x)(47)

and under the assumption that D and D have enough capacity, the optimum ofthe above minimax is reached when pg(x) = pdata(x).

The above result garanttees that at optimum the density of the generator con-verges to the true data density, however, it is still unclear what exactly the truedata density is. This chapter aims to explore this problem in a first step with thehelp of aforementioned normalizing flows.

4.2 equipping gan with likelihood evaluation

One idea to address the density evaluation of the generator is to construct thegenerator from bijective transformations, i.e., assume

x ′ = G(z) := f(z) (48)

where f is bijective (which also implies that z has the same dimension as a datapoint), hence the density of x ′ can be analytically written as

pg(x′) = pz(z)

∣∣∣∣det∂

∂zf

∣∣∣∣−1 (49)

f can constructed by a composition of a series of bijective transformations,f1, f2, ..., fK, such as the convolutional flows proposed in Chapter 3.

Conceptually, the above construction enables density evaluation of generatedsamples from the density of the randomly sampled noise z. When the D andG is jointly trained, this should also recover the underlying data density, exceptthat the model capacity might be limited due to the bijective constraints of all thetransformation, however, we can always stack as many bi-jective transformationsas needed to compose a complex function. The trade-off between model capacityand possibility to evaluate generator density will be explored in this chapter.

DRAFT – November 26, 2017

Page 45: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

5N E U R A L VA R I AT I O N A L T O P I C M O D E L I N G

Classical probabilistic graphical models (PGM) [27] are another set of gener-ative models to describe, capture and infer complex statistical dependencies

among variables. Inference referes to the process of estimating the parametersassociated with the model, and in the traditional PGM literature, Monte CarloMarkov Chain (MCMC) and mean-field variational inference (MFVI) are two mostwidely used methods. MCMC, on one hand, requires evaluating conditional den-sities of the variables of interest conditioned on others, thus a Markov chain canbe constructed and sampled, which can yield asymptotic consistent parameter es-timates yet at the cost long inference time; mean field variational inference, on theother hand, relies on factorized variational densities to approximate the true un-kown posterior by maximizing an evidence lower bound (ELBO) of the true datalikelihoo, whose asymptotic behavior is not garantteed due to the over-simplifiedassumption about the variational posteriors. Given a limited time budget, MFVIis often preferred over MCMC, and satisfactory results have been obtained in var-ious scenarios; however, a key drawback for MFVI is that inference for new datasample absent from the training set often involves folding in the test example tothe training set and running the MFVI updating procedure for a few iterations toinfer its latent variables, which could hamper its performance in settings wherefast inferece for new data points is desired. In this chapter, we aim to study thepossibility of combining stochastic neural inference for those classical PGMs. Asa starting point, we investigate how topic models, such as Latent Dirichlet Alloca-tion (LDA) [1] can be learned by stochastic neural inference.

5.1 neural variational text modeling

Along the direction to adopt neural variationl inference for text modeling, therehave two closely related work, namely Neural Variational Text Modeling (NVDM) [19]and Autoencoding Variational Inference for Topic Models (AVITM) [26]. NVDMand AVITM are built with the similar motivations as VAE, which tries to buildan inference network q(z|x) to approximate the true posterior p(z|x) by optimzing

37

DRAFT – November 26, 2017

Page 46: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

38 neural variational topic modeling

the ELBO. The parameters of the inference can be learned such that the inferencenetwork best approximate the true posterior. Efficient inference for a new datapoint x can be easily obtained by feeding x to the inference network and its out-put will represent its corresponding latent variables z. However, neither NVDMnor AVITM employs the original generative process in LDA, hence it’s unclearwhether they are learning the parameters for LDA.

Figure 7: NVDM architecture (Figure reproduced from [19])

Figure 7 shows the model architecture of NVDM, which aims to learn both theinference network (encoding the conditional q(h|x)) and the generative network(encoding the conditional p(x|h)). Specifically, the learning objective of NVDM is

max logpθ(x) = log Eh∼qφ(h|x)pθ(x|h)p(h)

qφ(h|x)(50)

> Eh∼qφ(h|x)pθ(x|h) + Eh∼qφ(h|x)p(h) − Eh∼qφ(h|x)qφ(h|x)

(51)

where h is the latent encoding for x. NVDM assumes both p(h) and p(x|h) to beGaussians, hence it’s clear to see that NVDM is optimizing a different perspectivethan that of LDA. It’s also worth noting that the latent code h in NVDM doesn’tendow the meaning of vector of topic proportions in LDA.

On the other hand, AVITM tries to alleviate this problem, as it is believed thatthe topic vector which lies on a probablisic simplex is the key to obtaining a setof meaningful topics [26], by assuming q(h|x) to be a logistic Normal distribution,i.e.,

q(h|x) := LogisticNormal(µ(x),Σ(x)) (52)

where µ(x) and Σ(x) are neural networks with parameters to be learned. However,AVITM doesn’t adopt the original generative model for LDA either, as it approxi-mates the Dirichlet prior p(z) with another logistic Normal distribution. Though

DRAFT – November 26, 2017

Page 47: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

5.2 neural variational topic modeling 39

AVITM achieves better topics, its performance is still worse than LDA. Hence weaim to address this problem in this chapter.

5.2 neural variational topic modeling

The generative process for Latent Dirichlet Allocation can be summarized as

• For each topic k, sample ψk ∼ Dir(β)

• For each document i,

1. Sample θi ∼ Dir(α)

2. For each word j in document i

a) Sample a topic zi,j ∼Multinomial(θi)

b) Sample a word wi,j ∼Multinomial(ψzi,j)

The marginal likelihood of a document w is

p(w;α,β) =∫θ

N∏n=1

k∑zn=1

p(wn|zn,β)p(zn|θ)

p(θ;α)dθ (53)

and it turns out that the discreet topic variables z can be summed out, whichyields

p(w;α,β) =∫θ

(N∏n=1

p(wn|θ,β)

)p(θ|α)dθ (54)

where p(wn|θ,β) is Multinomial(1,βθ). Further, we have

p(w, θ;α,β) =

(N∏n=1

p(wn|θ,β)

)p(θ|α) (55)

In this part, we propose to construct an inference network q(θ|x) to approximatethe true posterior by maximizing the following ELBO involoving the original gen-erative process of LDA

logp(w;α,β) = log Eθ∼q(θ|x)p(w, θ;α,β)q(θ|x)

(56)

>Eθ∼q(θ|x) logp(w, θ;α,β) − Eθ∼q(θ|x) logq(θ|x) (57)

=Eθ∼q(θ|x)

N∑n=1

logp(wn|θ,β) + Eθ∼q(θ|x)p(θ;α)

− Eθ∼q(θ|x) logq(θ|x) (58)

DRAFT – November 26, 2017

Page 48: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

40 neural variational topic modeling

where the inference network q(θ|x) is assumed to be a logistic Normal distributionparameterized by nerual network. Samples from the inference network θ ∼ q(θ|x)

can be obtained by using the reparameterization trick for the inference network as

ε ∼ Normal(0, I) (59)

θ = Softmax(µ(x) + Σ(x)1/2ε) (60)

where µ(x) and Σ(x) are represented as neural networks taking x as input.

5.2.1 Preliminary Experimental Results

We conduct prepliminary experiments on the 20 Newsgroups data set (11,000

training samples with a 2000-word vocabulary), and we measure the perplexityon the test set in Table

# of topics LDA with MFVI AVITM NVTM

50 1207 1180 1154

It can be seen that using neural network as inference model for LDA achievesbetter approximation to the posterior p(θ|x), and using the original LDA genera-tive model further increases the perplexity.

DRAFT – November 26, 2017

Page 49: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

6S T O C H A S T I C N E U R A L I N F E R E N C E F O R T E M P O R A L D ATA

Inference for patterns, or latent structures in temporal data is another importantresearch direction for temporal data analysis. For example, extracting local

basis vectors as described in [29], is one way to learn salient patterns in temporaldata; however, in order to infer the patterns for a new tempora time series, SIDLneeds to solve a sparse coding problem on the fly, which is inefficient particularlyin scenarios where fast inference is perferred. In the area of multivariate temporaldata analysis, infering the correlations, or latent structures among the differentsignal dimensions is another critical problem to better understand the interactionsamong the variables, with the potential to leading to deeper and more accuratedata understanding and prediction. Traditional methods in such direction oftenrelies on parametric assumptions about the joint distribution of the multivariatetemporal signals, a widely used one of which is Gaussian [8]. The parametricassumption often limits the application scope of such methods. In this chapter,we propse to model multivariate temporal data with a novel framework whichsupports not only fast inference for unseen data series, but also complex or evennon-parametric distributions among the different signal dimensions.

6.1 temporal vae

Suppose we are given multivariate temporal data, x := {x1, x2, ..., xN} is a series ofdata from time stamp t = 1 to t = N. Each xi is p-dimensional vector encoding thep variables at each time point. To model temporal dependencies, often historicaldata with a window size d is considered, here to make the discussion clear, weassume that we only look back the previous time stamp, i.e., d = 1, potentiallyassuming that the data stream is a Markov chain specified by

p(x) = p(x1)

N∏i=2

p(xi|xi−1) (61)

A couple of existing methods can be categorized to this setting; for example, thevector autoregressive model (VAR) essentially assumes a Gaussian conditional for

41

DRAFT – November 26, 2017

Page 50: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

42 stochastic neural inference for temporal data

p(xi|xi−1) when the model is learned by minizing the mean squared error (MSE)between the predicted value and true one.

Often we believe that is some underlying latent state associated with each timestamp which determines the temporal data observable to us. Hence if we take intoaccount the temporal changes of the underlying latent variables, the above modelfamily turns to

p(x, z) = p(z0)N∏i=1

p(zi|zi−1)p(xi|zi) (62)

A special case of the above model is Gaussian-Markov Model if all conditionals areGaussians. However, the Gaussian assumption is often too restrictive and thus themodel lacks flexibility to adapt to data where more complex correlations amongvariables are present. In this chapter, we propose a Variational Autoencoder (VAE)based framework to model complex distributions among variables beyond theGaussian assumption, and further provides fast inference and prediction for newinstance of temporal data stream. We term the framework as Temporal VariationalAuencoder (Temporal VAE).

Temporal VAE assumes the following probabilistic model over the data obser-vation x and the latent variables z

p(x, z) = p(z0)N∏i=1

p(zi|zi−1)p(xi|zi) (63)

zi := f(τi), p(τi|zi−1) is a simple distribution (64)

xi := g(γi), p(γi|zi) is a simple distribution (65)

where f and g are bijective functions which are parameterized by neural networks,such as Normalizing Flows and Convolutional Flows as described in Chapter 3.The family of joint distributions over the multivariates are thus enriched by thelearnable functions f and g.

Furhter, to facilitate efficient model inference, an inference model parameterizedfrom neural network is proposed

q(z|x) = q(z0)

N∏i=1

q(zi|xi, zi−1) (66)

where similar constructions for q are used as for p. Both the generative model pand inference model q is joint learned by maximizing the ELBO

logp(x) = log Eq(z|x)p(x, z)q(z|x)

> Eq(z|x) logp(x, z) − Eq(z|x) logq(z|x) (67)

Modeling multivariate temporal data with the proposed framework will be thor-oughly explored and examined in this Chapter.

DRAFT – November 26, 2017

Page 51: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

7T I M E L I N E

The anticipated timeline to complete this thesis is as follows:

Table 5: Timetable

2017 2018

Task 11 12 1 2 3 4 5 6 7 8

AVAE (Ch. 2, arXiv:1711.08352) X X

ConvFlow (Ch. 3, arXiv:1711.02255) X X

GANs with ConvFlow (Ch. 4) X X

App. Topic Modeling (Ch. 5) X X X X

App. Time Series Modeling (Ch. 6) X X X X

Thesis Writing X X X X

Thesis Defense X

43

DRAFT – November 26, 2017

Page 52: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

DRAFT – November 26, 2017

Page 53: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

B I B L I O G R A P H Y

[1] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. “Latent Dirichlet Allo-cation.” In: Journal of Machine Learning Research 3 (2003), pp. 993–1022.

[2] Niko Brümmer. “Note on the equivalence of hierarchical variational modelsand auxiliary deep generative models.” In: arXiv preprint arXiv:1603.02443(2016).

[3] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. “Importance WeightedAutoencoders.” In: arXiv preprint arXiv:1509.00519 (2015).

[4] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A.A Bharath. “Generative Adversarial Networks: An Overview.” In: ArXiv e-prints (Oct. 2017). arXiv: 1710.07035 [cs.CV].

[5] Laurent Dinh, David Krueger, and Yoshua Bengio. “NICE: Non-linear inde-pendent components estimation.” In: arXiv preprint arXiv:1410.8516 (2014).

[6] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. “Density estimationusing Real NVP.” In: arXiv preprint arXiv:1605.08803 (2016).

[7] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Ar-jovsky, Olivier Mastropietro, and Aaron Courville. “Adversarially learnedinference.” In: arXiv preprint arXiv:1606.00704 (2016).

[8] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. “A note on thegroup lasso and a sparse group lasso.” In: arXiv preprint arXiv:1001.0736(2010).

[9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. “Generative ad-versarial nets.” In: Advances in neural information processing systems. 2014,pp. 2672–2680.

[10] Sergey Ioffe and Christian Szegedy. “Batch Normalization: Accelerating DeepNetwork Training by Reducing Internal Covariate Shift.” In: Proceedings ofthe 32nd International Conference on Machine Learning, ICML 2015, Lille, France,6-11 July 2015. 2015, pp. 448–456.

[11] Diederik P Kingma and Max Welling. “Auto-Encoding Variational Bayes.”In: arXiv preprint arXiv:1312.6114 (2013).

45

DRAFT – November 26, 2017

Page 54: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

46 Bibliography

[12] Diederik P. Kingma, Tim Salimans, Rafal Józefowicz, Xi Chen, Ilya Sutskever,and Max Welling. “Improving Variational Autoencoders with Inverse Au-toregressive Flow.” In: Advances in Neural Information Processing Systems 29:Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain. 2016, pp. 4736–4744.

[13] Diederik Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimiza-tion.” In: arXiv preprint arXiv:1412.6980 (2014).

[14] Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. “One-shot learning by inverting a compositional causal process.” In: Advances inNeural Information Processing Systems 26: 27th Annual Conference on NeuralInformation Processing Systems 2013. Proceedings of a meeting held December 5-8,2013, Lake Tahoe, Nevada, United States. 2013, pp. 2526–2534.

[15] Hugo Larochelle and Iain Murray. “The Neural Autoregressive DistributionEstimator.” In: Proceedings of the Fourteenth International Conference on Artifi-cial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13,2011. 2011, pp. 29–37.

[16] Lars Maaløe, Casper Kaae Sønderby, Søren Kaae Sønderby, and Ole Winther.“Auxiliary Deep Generative Models.” In: Proceedings of the 33nd InternationalConference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016. 2016, pp. 1445–1453. url: http://jmlr.org/proceedings/papers/v48/maaloe16.html.

[17] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, andBrendan Frey. “Adversarial autoencoders.” In: arXiv preprint arXiv:1511.05644(2015).

[18] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. “Adversarial Vari-ational Bayes: Unifying Variational Autoencoders and Generative Adversar-ial Networks.” In: arXiv preprint arXiv:1701.04722 (2017).

[19] Yishu Miao, Lei Yu, and Phil Blunsom. “Neural variational inference for textprocessing.” In: International Conference on Machine Learning. 2016, pp. 1727–1736.

[20] Andriy Mnih and Karol Gregor. “Neural Variational Inference and Learn-ing in Belief Networks.” In: Proceedings of the 31th International Conference onMachine Learning, ICML 2014, Beijing, China, 21-26 June 2014. 2014, pp. 1791–1799.

[21] Andriy Mnih and Danilo Jimenez Rezende. “Variational Inference for MonteCarlo Objectives.” In: Proceedings of the 33nd International Conference on Ma-chine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016. 2016,pp. 2188–2196.

DRAFT – November 26, 2017

Page 55: Empowering Probabilistic Inference with Stochastic …gzheng/papers/gzheng-thesis-proposal.pdfious domains, such as natural languages, images, temporal time series. Often com-plex

Bibliography 47

[22] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, OriolVinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu.“Wavenet: A generative model for raw audio.” In: arXiv preprint arXiv:1609.03499(2016).

[23] Rajesh Ranganath, Dustin Tran, and David M. Blei. “Hierarchical VariationalModels.” In: Proceedings of the 33nd International Conference on Machine Learn-ing, ICML 2016, New York City, NY, USA, June 19-24, 2016. 2016, pp. 324–333.

[24] Danilo Jimenez Rezende and Shakir Mohamed. “Variational Inference withNormalizing Flows.” In: Proceedings of the 32nd International Conference onMachine Learning, ICML 2015, Lille, France, 6-11 July 2015. 2015, pp. 1530–1538.

[25] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby,and Ole Winther. “Ladder Variational Autoencoders.” In: Annual Conferenceon Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona,Spain. 2016, pp. 3738–3746.

[26] Akash Srivastava and Charles Sutton. “Autoencoding Variational InferenceFor Topic Models.” In: arXiv preprint arXiv:1703.01488 (2017).

[27] Martin J Wainwright, Michael I Jordan, et al. “Graphical models, exponentialfamilies, and variational inference.” In: Foundations and Trends® in MachineLearning 1.1–2 (2008), pp. 1–305.

[28] Fisher Yu and Vladlen Koltun. “Multi-scale context aggregation by dilatedconvolutions.” In: arXiv preprint arXiv:1511.07122 (2015).

[29] Guoqing Zheng, Yiming Yang, and Jaime G. Carbonell. “Efficient Shift-InvariantDictionary Learning.” In: Proceedings of the 22nd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, San Francisco, CA, USA,August 13-17, 2016. 2016, pp. 2095–2104. doi: 10.1145/2939672.2939824.url: http://doi.acm.org/10.1145/2939672.2939824.

DRAFT – November 26, 2017