reweighted wake-sleep - arxiv · this paper is to shed a different light on the wake-sleep...

12
Published as a conference paper at ICLR 2015 R EWEIGHTED W AKE -S LEEP org Bornschein and Yoshua Bengio * Department of Computer Science and Operations Research University of Montreal Montreal, Quebec, Canada ABSTRACT Training deep directed graphical models with many hidden variables and perform- ing inference remains a major challenge. Helmholtz machines and deep belief networks are such models, and the wake-sleep algorithm has been proposed to train them. The wake-sleep algorithm relies on training not just the directed gen- erative model but also a conditional generative model (the inference network) that runs backward from visible to latent, estimating the posterior distribution of la- tent given visible. We propose a novel interpretation of the wake-sleep algorithm which suggests that better estimators of the gradient can be obtained by sampling latent variables multiple times from the inference network. This view is based on importance sampling as an estimator of the likelihood, with the approximate inference network as a proposal distribution. This interpretation is confirmed ex- perimentally, showing that better likelihood can be achieved with this reweighted wake-sleep procedure. Based on this interpretation, we propose that a sigmoidal belief network is not sufficiently powerful for the layers of the inference network in order to recover a good estimator of the posterior distribution of latent variables. Our experiments show that using a more powerful layer model, such as NADE, yields substantially better generative models. 1 I NTRODUCTION Training directed graphical models – especially models with multiple layers of hidden variables – remains a major challenge. This is unfortunate because, as has been argued previously (Hin- ton et al., 2006; Bengio, 2009), a deeper generative model has the potential to capture high-level abstractions and thus generalize better. The exact log-likelihood gradient is intractable, be it for Helmholtz machines (Hinton et al., 1995; Dayan et al., 1995), sigmoidal belief networks (SBNs), or deep belief networks (DBNs) (Hinton et al., 2006), which are directed models with a restricted Boltzmann machine (RBM) as top-layer. Even obtaining an unbiased estimator of the gradient of the DBN or Helmholtz machine log-likelihood is not something that has been achieved in the past. Here we show that it is possible to get an unbiased estimator of the likelihood (which unfortunately makes it a slightly biased estimator of the log-likelihood), using an importance sampling approach. Past proposals to train Helmholtz machines and DBNs rely on maximizing a variational bound as proxy for the log-likelihood (Hinton et al., 1995; Kingma and Welling, 2014; Rezende et al., 2014). The first of these is the wake-sleep algorithm (Hinton et al., 1995), which relies on combining a “recognition” network (which we call an approximate inference network, here, or simply inference network) with a generative network. In the wake-sleep algorithm, they basically provide targets for each other. We review these previous approaches and introduce a novel approach that generalizes the wake-sleep algorithm. Whereas the original justification of the wake-sleep algorithm has been questioned (because we are optimizing a KL-divergence in the wrong direction), a contribution of this paper is to shed a different light on the wake-sleep algorithm, viewing it as a special case of the proposed reweighted wake-sleep (RWS) algorithm, i.e., as reweighted wake-sleep with a single sample. This shows that wake-sleep corresponds to optimizing a somewhat biased estimator of the likelihood gradient, while using more samples makes the estimator less biased (and asymptotically unbiased as more samples are considered). We empirically show that effect, with clearly better re- sults obtained with K =5 samples than with K =1 (wake-sleep), and 5 or 10 being sufficient to * org Bornschein is a CIFAR Global Scholar; Yoshua Bengio is a CIFAR Senior Fellow 1 arXiv:1406.2751v4 [cs.LG] 16 Apr 2015

Upload: others

Post on 10-May-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: REWEIGHTED WAKE-SLEEP - arXiv · this paper is to shed a different light on the wake-sleep algorithm, viewing it as a special case of the proposed reweighted wake-sleep (RWS) algorithm,

Published as a conference paper at ICLR 2015

REWEIGHTED WAKE-SLEEP

Jorg Bornschein and Yoshua Bengio lowastDepartment of Computer Science and Operations ResearchUniversity of MontrealMontreal Quebec Canada

ABSTRACT

Training deep directed graphical models with many hidden variables and perform-ing inference remains a major challenge Helmholtz machines and deep beliefnetworks are such models and the wake-sleep algorithm has been proposed totrain them The wake-sleep algorithm relies on training not just the directed gen-erative model but also a conditional generative model (the inference network) thatruns backward from visible to latent estimating the posterior distribution of la-tent given visible We propose a novel interpretation of the wake-sleep algorithmwhich suggests that better estimators of the gradient can be obtained by samplinglatent variables multiple times from the inference network This view is basedon importance sampling as an estimator of the likelihood with the approximateinference network as a proposal distribution This interpretation is confirmed ex-perimentally showing that better likelihood can be achieved with this reweightedwake-sleep procedure Based on this interpretation we propose that a sigmoidalbelief network is not sufficiently powerful for the layers of the inference networkin order to recover a good estimator of the posterior distribution of latent variablesOur experiments show that using a more powerful layer model such as NADEyields substantially better generative models

1 INTRODUCTION

Training directed graphical models ndash especially models with multiple layers of hidden variablesndash remains a major challenge This is unfortunate because as has been argued previously (Hin-ton et al 2006 Bengio 2009) a deeper generative model has the potential to capture high-levelabstractions and thus generalize better The exact log-likelihood gradient is intractable be it forHelmholtz machines (Hinton et al 1995 Dayan et al 1995) sigmoidal belief networks (SBNs)or deep belief networks (DBNs) (Hinton et al 2006) which are directed models with a restrictedBoltzmann machine (RBM) as top-layer Even obtaining an unbiased estimator of the gradient ofthe DBN or Helmholtz machine log-likelihood is not something that has been achieved in the pastHere we show that it is possible to get an unbiased estimator of the likelihood (which unfortunatelymakes it a slightly biased estimator of the log-likelihood) using an importance sampling approachPast proposals to train Helmholtz machines and DBNs rely on maximizing a variational bound asproxy for the log-likelihood (Hinton et al 1995 Kingma and Welling 2014 Rezende et al 2014)The first of these is the wake-sleep algorithm (Hinton et al 1995) which relies on combining aldquorecognitionrdquo network (which we call an approximate inference network here or simply inferencenetwork) with a generative network In the wake-sleep algorithm they basically provide targets foreach other We review these previous approaches and introduce a novel approach that generalizesthe wake-sleep algorithm Whereas the original justification of the wake-sleep algorithm has beenquestioned (because we are optimizing a KL-divergence in the wrong direction) a contribution ofthis paper is to shed a different light on the wake-sleep algorithm viewing it as a special case ofthe proposed reweighted wake-sleep (RWS) algorithm ie as reweighted wake-sleep with a singlesample This shows that wake-sleep corresponds to optimizing a somewhat biased estimator of thelikelihood gradient while using more samples makes the estimator less biased (and asymptoticallyunbiased as more samples are considered) We empirically show that effect with clearly better re-sults obtained with K = 5 samples than with K = 1 (wake-sleep) and 5 or 10 being sufficient to

lowastJorg Bornschein is a CIFAR Global Scholar Yoshua Bengio is a CIFAR Senior Fellow

1

arX

iv1

406

2751

v4 [

csL

G]

16

Apr

201

5

Published as a conference paper at ICLR 2015

achieve good results Unlike in the case of DBMs which rely on a Markov chain to get samples andestimate the gradient by a mean over those samples here the samples are iid avoiding the veryserious problem of mixing between modes that can plague MCMC methods (Bengio et al 2013)when training undirected graphical models

Another contribution of this paper regards the architecture of the deep approximate inference net-work We view the inference network as estimating the posterior distribution of latent variablesgiven the observed input With this view it is plausible that the classical architecture of the inferencenetwork (a SBN details below) is inappropriate and we test this hypothesis empirically We find thatmore powerful parametrizations that can represent non-factorial posterior distributions yield betterresults

2 REWEIGHTED WAKE-SLEEP

21 THE WAKE-SLEEP ALGORITHM

The wake-sleep algorithm was proposed as a way to train Helmholtz machines which are deepdirected graphical models p(xh) over visible variables x and latent variables h where the latentvariables are organized in layers hk In the Helmholtz machine (Hinton et al 1995 Dayan et al1995) the top layer hL has a factorized unconditional distribution so that ancestral sampling canproceed from hL down to h1 and then the generated sample x is generated by the bottom layergiven h1 In the deep belief network (DBN) (Hinton et al 2006) the top layer is instead generatedby a RBM ie by a Markov chain while simple ancestral sampling is used for the others Eachintermediate layer is specified by a conditional distribution parametrized as a stochastic sigmoidallayer (see section 3 for details)

The wake-sleep algorithm is a training procedure for such generative models which involves train-ing an auxiliary network called the inference network that takes a visible vector x as input andstochastically outputs samples hk for all layers k = 1 to L The inference network outputs sam-ples from a distribution that should estimate the conditional probability of the latent variables of thegenerative model (at all layers) given the input Note that in these kinds of directed models exactinference ie sampling from p(h|x) is intractable

The wake-sleep algorithm proceeds in two phases In the wake phase an observation x is sampledfrom the training distribution D and propagated stochastically up the inference network (one layerat a time) thus sampling latent values h from q(h|x) Together with x the sampled h forms a targetfor training p ie one performs a step of gradient ascent update with respect to maximum likelihoodover the generative model p(xh) with the data x and the inferred h This is useful because whereascomputing the gradient of the marginal likelihood p(x) =

sumh p(xh) is intractable computing the

gradient of the complete log-likelihood log p(xh) is easy In addition these updates decouple allthe layers (because both the input and the target of each layer are considered observed) In thesleep-phase a ldquodreamrdquo sample is obtained from the generative network by ancestral sampling fromp(xh) and is used as a target for the maximum likelihood training of the inference network ie qis trained to estimate p(h|x)The justification for the wake-sleep algorithm that was originally proposed is based on the followingvariational bound

log p(x) gesumh

q(h|x) log p(xh)q(h|x)

that is true for any inference network q but the bound becomes tight as q(h|x) approaches p(h|x)Maximizing this bound with respect to p corresponds to the wake phase update The update withrespect to q should minimize KL(q(h|x)||p(h|x)) (with q as the reference) but instead the sleepphase update minimizes the reversed KL divergence KL(p(h|x)||q(h|x)) (with p as the reference)

22 AN IMPORTANCE SAMPLING VIEW YIELDS REWEIGHTED WAKE-SLEEP

If we think of q(h|x) as estimating p(h|x) and train it accordingly (which is basically what thesleep phase of wake-sleep does) then we can reformulate the likelihood as an importance-weighted

2

Published as a conference paper at ICLR 2015

average

p(x) =sumh

q (h |x) p(xh)q (h |x)

= Ehsimq(h |x)

[p(xh)

q (h |x)

] 1

K

Ksumk=1

h(k)simq(h |x)

p(xh(k))

q(h(k) |x

) (1)

Eqn (1) is a consistent and unbiased estimator for the marginal likelihood p(x) The optimal q thatresults in a minimum variance estimator is qlowast(h |x) = p (h |x) In fact we can show that this isa zero-variance estimator ie the best possible one that will result in a perfect p(x) estimate evenwith a single arbitrary sample h sim p (h |x)

Ehsimp(h |x)

[p (h |x) p(x)p(h |x)

]= p(x) E

hsimp(h |x)[1] = p(x) (2)

Any mismatch between q and p (h |x) will increase the variance of this estimator but it will notintroduce any bias In practice however we are typically interested in an estimator for the log-likelihood Taking the logarithm of (1) and averaging over multiple datapoints will result in a con-servative biased estimate and will on average underestimate the true log-likelihood due to theconcavity of the logarithm Increasing the number of samples will decrease both the bias and thevariance Variants of this estimator have been used in eg (Rezende et al 2014 Gregor et al 2014)to evaluate trained models

23 TRAINING BY REWEIGHTED WAKE-SLEEP

We now consider the models p and q parameterized with parameters θ and φ respectively

Updating pθ for given qφ We propose to use an importance sampling estimator based on eq (1) tocompute the gradient of the marginal log-likelihood Lp(θx) = log pθ(x)

part

partθLp(θx sim D) =

1

p(x)E

hsimq(h |x)

[p(xh)

q (h |x)part

partθlog p(xh)

]

Ksumk=1

ωkpart

partθlog p(xh(k)) with h(k) sim q (h |x) (3)

and the importance weights ωk =ωksumK

kprime=1 ωkprime ωk =

p(xh(k))

q(h(k) |x

)

See the supplement for a detailed derivation Note that this is a biased estimator because it implicitlycontains a division by the estimated p(x) Furthermore there is no guarantee that q = p (h |x)results in a minimum variance estimate of this gradient But both the bias and the variance decreaseas the number of samples is increased Also note that the wake-sleep algorithm uses a gradient thatis equivalent to using only K = 1 sample Another noteworthy detail about eq (3) is that theimportance weights ω are automatically normalized such that they sum up to one

Updating qφ for given pθ In order to minimize the variance of the estimator (1) we would likeq (h |x) to track p (h |x) We propose to train q using maximum likelihood learning with the lossLq(φxh) = log qφ(x|h) There are at least two reasonable options how to obtain training datafor Lq 1) maximize Lq under the empirical training distribution x sim D h sim p (h |x) or 2)maximize Lq under the generative model (xh) sim pθ(xh) We will refer to the former as wakephase q-update and to the latter as sleep phase q-update In the case of a DBN (where the top layeris generated by an RBM) there is an intermediate solution called contrastive-wake-sleep which hasbeen proposed in (Hinton et al 2006) In contrastive wake-sleep we sample x from the trainingdistribution propagate it stochastically into the top layer and use that h as starting point for a shortMarkov chain in the RBM then sample the other layers in the generative network p to generate therest of (xh) The objective is to put the inference networkrsquos capacity where it matters most ienear the input configurations that are seen in the training set

Analogous to eqn (1) and (3) we use importance sampling to derive gradients for the wake phaseq-update

part

partφLq(φx sim D)

Ksumk=1

ωkpart

partφlog qφ(h

(k)|x) (4)

3

Published as a conference paper at ICLR 2015

with the same importance weights ωk as in (3) (the details of the derivation can again be found inthe supplement) Note that this is equivalent to optimizing q so as to minimize KL(p(middot|x) q(middot|x))For the sleep phase q-update we consider the model distribution p(xh) a fully observed system andcan thus derive gradients without further sampling

part

partφLq(φ (xh)) =

part

partφlog qφ(h|x) with xh sim p(xh) (5)

This update is equivalent to the sleep phase update in the classical wake-sleep algorithm

Algorithm 1 Reweighted Wake-Sleep training procedure and likelihood estimator K is the numberof approximate inference samples and controls the trade-off between computation and accuracyof the estimators (both for the gradient and for the likelihood) We typically use a large value(K = 100 000) for test set likelihood estimator but a small value (K = 5) for estimating gradientsBoth the wake phase and sleep phase update rules for q are optionally included (either one or bothcan be used and best results were obtained using both) The original wake-sleep algorithm hasK=1 and only uses the sleep phase update of q To estimate the log-likelihood at test time only thecomputations up to L are required

for number of training iterations dobull Sample example(s) x from the training distributionfor k = 1 to K dobull Layerwise sample latent variables h(k) from q(h|x)bull Compute q(h(k)|x) and p(xh(k))

end forbull Compute unnormalized weights ωk = p(xh(k))

q(h(k) |x)bull Normalize the weights ωk = ωksum

kprime ωkprime

bull Compute unbiased likelihood estimator p(x) = averagek ωkbull Compute log-likelihood estimator L(x) = log averagek ωk

bullWake-phase update of p Use gradient estimatorsumk ωk

part log p(xh(k))partθ

bull Optionally wake phase update of q Use gradient estimatorsumk ωk

part log q(h(k)|x)partφ

bull Optionally sleep phase update of q Sample (xprimehprime) from p and use gradient part log q(hprime|xprime)partφ

end for

24 RELATION TO WAKE-SLEEP AND VARIATIONAL BAYES

Recently there has been a resurgence of interest in algorithms related to the Helmholtz machine andto the wake-sleep algorithm for directed graphical models containing either continuous or discretelatent variables

In Neural Variational Inference and Learning (NVIL Mnih and Gregor 2014) the authors proposeto maximize the variational lower bound on the log-likelihood to get a joint objective for both pand q It was known that this approach results in a gradient estimate of very high variance forthe recognition network q (Dayan and Hinton 1996) In the NVIL paper the authors thereforepropose variance reduction techniques such as baselines to obtain a practical algorithm that enhancessignificantly over the original wake-sleep algorithm In respect to the computational complexity wenote that while we draw K samples from the inference network for RWS NVIL on the other handdraws only a single sample from q but maintains queries and trains an additional auxiliary baselineestimating network With RWS and a typical value of K = 5 we thus require at least twice asmany arithmetic operations but we do not have to store the baseline network and do not have to findsuitable hyperparameters for it

Recent examples for continuous latent variables include the auto-encoding variationalBayes (Kingma and Welling 2014) and stochastic backpropagation papers (Rezende et al 2014)In both cases one maximizes a variational lower bound on the log-likelihood that is rewritten astwo terms one that is log-likelihood reconstruction error through a stochastic encoder (approximateinference) - decoder (generative model) pair and one that regularizes the output of the approximateinference stochastic encoder so that its marginal distribution matches the generative prior on the

4

Published as a conference paper at ICLR 2015

latent variables (and the latter is also trained to match the marginal of the encoder output) Besidesthe fact that these variational auto-encoders are only for continuous latent variables another differ-ence with the reweighted wake-sleep algorithm proposed here is that in the former a single samplefrom the approximate inference distribution is sufficient to get an unbiased estimator of the gradientof a proxy (the variational bound) Instead with the reweighted wake-sleep a single sample wouldcorrespond to regular wake-sleep which gives a biased estimator of the likelihood gradient Onthe other hand as the number of samples increases reweighted wake-sleep provides a less biased(asymptotically unbiased) estimator of the log-likelihood and of its gradient Similar in spirit butaimed at a structured output prediction task is the method proposed by Tang and Salakhutdinov(2013) The authors optimize the variational bound of the log-likelihood instead of the direct ISestimate but they also derive update equations for the proposal distribution that resembles many ofthe properties also found in reweighted wake-sleep

3 COMPONENT LAYERS

Although the framework can be readily applied to continuous variables we here restrict our-selves to distributions over binary visible and binary latent variables We build our models bycombining probabilistic components each one associated with one of the layers of the gener-ative network or of the inference network The generative model can therefore be written aspθ(xh) = p0(x|h1) p1(h1|h2) middot middot middot pL(hL) while the inference network has the form qφ(h |x) =q1(h1 |x) middot middot middot qL(hL |hLminus1) For a distributionP to be a suitable component we must have a methodto efficiently compute P (x(k)|y(k)) given (x(k) y(k)) and we must have a method to efficientlydraw iid samples x(k) sim P (x |y) for a given y In the following we will describe experimentscontaining three kinds of layers

Sigmoidal Belief Network (SBN) layer A SBN layer (Saul et al 1996) is a directed graphicalmodel with independent variables xi given the parents y

P SBN(xi = 1 |y) = σ(W i y + bi) (6)Although a SBN is a very simple generative model given y performing inference for y given x isin general intractable

Autoregressive SBN layer (AR-SBN DARN) If we consider xi an ordered set of observed vari-ables and introduce directed autoregressive links between all previous xlti and a given xi we obtaina fully-visible sigmoid belief network (FVSBN Frey 1998 Bengio and Bengio 2000) When weadditionally condition a FVSBN on the parent layerrsquos y we obtain a layer model that was first usedin Deep AutoRegressive Networks (DARN Gregor et al 2014)

PAR-SBN(xi = 1 |xltiy) = σ(W i y + Siltixlti + bi) (7)We use xlti = (x1 x2 middot middot middot ximinus1) to refer to the vector containing the first i-1 observed variablesThe matrix S is a lower triangular matrix that contains the autoregressive weights between the vari-ables xi and with Siltj we refer to the first j-1 elements of the i-th row of this matrix In contrastto a regular SBN layer the units xi are thus not independent of each other but can be predicted likein a logistic regression in terms of its predecessors xlti and of the input of the layer y

Conditional NADE layer The Neural Autoregressive Distribution Estimator (NADE Larochelleand Murray 2011) is a model that uses an internal accumulating hidden layer to predict variables xigiven the vector containing all previously variables xlti Instead of logistic regression in a FVSBNor an AR-SBN the dependency between the variables xi is here mediated by an MLP (Bengio andBengio 2000)

P (xi = 1 |xlti) = σ(V iσ(W lti xlti + a) + bi)) (8)With W and V denoting the encoding and decoding matrices for the NADE hidden layer For ourpurposes we condition this model on the random variables y

PNADE(xi = 1 |xltiy) = σ(V iσ(W lti xlti + Ua y + a) + U ib y + bi)) (9)Such a conditional NADE has been used previously for modeling musical sequences (Boulanger-Lewandowski et al 2012)

For each layer distribution we can construct an unconditioned distribution by removing the condi-tioning variable y We use such unconditioned distributions as top layer p(h) for the generativenetwork p

5

Published as a conference paper at ICLR 2015

100 101 102

training samples

minus130

minus120

minus110

minus100

minus90

minus80

Fin

alL

Les

tim

ate

(tes

tset

)

NA DE 200SBN 10-200-200SBN 200

A B

bias (epoch 50)

bias (last epoch)std dev (epoch50)

100 101102

training samples

03

04

05

06

07

08

09

bia

s

06

08

10

12

14

16

18

std

-dev

std dev (last epoch)

Figure 1 A Final log-likelihood estimate wrt number of samples used during training B L2-normof the bias and standard deviation of the low-sample estimated pθ gradient relative to a high-sample(K=5000) based estimate

NVIL wake-sleep RWS RWSP-model size Q-model SBN Q-model NADESBN 200 (1131) 1163 (1207) 1031 950SBN 200-200 (998) 1069 (1094) 934 911SBN 200-200-200 (967) 1013 (1044) 901 889AR-SBN 200 892AR-SBN 200-200 928NADE 200 868NADE 200-200 876

Table 1 MNIST results for various architectures and training methods In the 3rd column we citethe numbers reported by Mnih and Gregor (2014) Values in brackets are variational NLL boundsvalues without brackets report NLL estimates (see section 22)

4 EXPERIMENTS

Here we present a series of experiments on the MNIST and the CalTech-Silhouettes datasetsThe supplement describes additional experiments on smaller datasets from the UCI repositoryWith these experiments we 1) quantitatively analyze the influence of the number of samples K2) demonstrate that using a more powerful layer-model for the inference network q can signif-icantly enhance the results even when the generative model is a factorial SBN and 3) showthat we approach state-of-the-art performance when using either relatively deep models or whenusing powerful layer models such as a conditional NADE Our implementation is available athttpsgithubcomjbornscheinreweighted-ws

41 MNIST

We use the MNIST dataset that was binarized according to Murray and Salakhutdinov (2009) anddownloaded in binarized form from (Larochelle 2011) For training we use stochastic gradientdecent with momentum (β=095) and set mini-batch size to 25 The experiments in this paragraphwere run with learning rates of 00003 0001 and 0003 From these three we always report theexperiment with the highest validation log-likelihood In the majority of our experiments a learningrate of 0001 gave the best results even across different layer models (SBN AR-SBN and NADE) Ifnot noted otherwise we use K = 5 samples during training and K = 100 000 samples to estimatethe final log-likelihood on the test set1 To disentangle the influence of the different q updatingmethods we setup p and q networks consisting of three hidden SBN layers with 10 200 and 200units (SBNSBN 10-200-200) After convergence the model trained updating q during the sleepphase only reached a final estimated log-likelihood of minus934 the model trained with a q-updateduring the wake phase reachedminus928 and the model trained with both wake and sleep phase updatereached minus919 As a control we trained a model that does not update q at all This model reached

1We refer to the lower bound estimates which can be arbitrarily tightened by increasing the number of testsamples as LL estimates to distiguish them from the variational LL lower bounds (see section 22)

6

Published as a conference paper at ICLR 2015

100 101 102

samples

minus130

minus120

minus110

minus100

minus90

minus80

est

LL

NA DE-NA DE 200SBN-SBN 200-200-10SBN-SBN 200

A B CSBNSBN 10-100-200-300-400 NADENADE 250

Figure 2 A Final log-likelihood estimate wrt number of test samples used B Samples from theSBNSBN 10-200-200 generative model C Samples from the NADENADE 250 generative model(We show the probabilities from which each pixel is sampled)

Results on binarized MNISTNLL NLL

Method bound estRWS (SBNSBN 10-100-200-300-400) 8548RWS (NADENADE 250) 8523RWS (AR-SBNSBN 500)dagger 8418NADE (500 units [1]) 8835EoNADE (2hl 128 orderings [2]) 8510DARN (500 units [3]) 8413RBM (500 units CD3 [4]) 1055RBM (500 units CD25 [4]) 8634DBN (500-2000 [5]) 8622 8455

Results on CalTech 101 SilhouettesNLL

Method estRWS (SBNSBN 10-50-100-300) 1133RWS (NADENADE 150) 1043

NADE (500 hidden units) 1106RBM (4000 hidden units [6]) 1078

Table 2 Various RWS trained models in relation to previously published methods [1] Larochelleand Murray (2011) [2] Murray and Larochelle (2014) [3] Gregor et al (2014) [4] Salakhutdinovand Murray (2008) [5] Murray and Salakhutdinov (2009) [6] Cho et al (2013) dagger Same model asthe best performing in [3] a AR-SBN with deterministic hidden variables between the observed andlatent All RWS NLL estimates on MNIST have confidence intervals of asymp plusmn040

minus1714 We confirmed that combining wake and sleep phase q-updates generally gives the bestresults by repeating this experiment with various other architectures For the remainder of this paperwe therefore train all models with combined wake and sleep phase q-updates

Next we investigate the influence of the number of samples used during training The results arevisualized in Fig 1 A Although the results depend on the layer-distributions and on the depthand width of the architectures we generally observe that the final estimated log-likelihood does notimprove significantly when using more than 5 samples during training for NADE models and usingmore than 25 samples for models with SBN layers We can quantify the bias and the variance ofthe gradient estimator (3) using bootstrapping While training a SBNSBN 10-200-200 model withK = 100 training samples we useK = 5 000 samples to get a high quality estimate of the gradientfor a small but fixed set of 25 datapoints (the size of one mini-batch) By repeatedly resamplingsmaller sets of 1 2 5 middot middot middot 5000 samples with replacement and by computing the gradient basedon these we get a measure for the bias and the variance of the small sample estimates relative thehigh quality estimate These results are visualized in Fig 1 B In Fig 2 A we finally investigate thequality of the log-likelihood estimator (eqn 1) when applied to the MNIST test set

Table 1 summarizes how different architectures compare to each other and how RWS comparesto related methods for training directed models We essentially observe that RWS trained modelsconsistently improve over classical wake-sleep especially for deep architectures We furthermoreobserve that using autoregressive layers (AR-SBN or NADE) for the inference network improvesthe results even when the generative model is composed of factorial SBN layers Finally we see thatthe best performing models with autoregressive layers in p are always shallow with only a single

7

Published as a conference paper at ICLR 2015

Figure 3 CalTech 101 Silhouettes A Random selection of training data points B Random samplesfrom the SBNSBN 10-50-100-300 generative network C Random Samples from the NADE-150generative network (We show the probabilities from which each pixel is sampled)

hidden layer In Table 2 (left) we compare some of our best models to the state-of-the-art resultspublished on MNIST The deep SBNSBN 10-100-200-300-400 model was trained for 1000 epochswith K = 5 training samples and a learning rate of 0001 For fine-tuning we run additional 500epochs with a learning rate decay of 1005 and 100 training samples For comparison we also trainthe best performing model from the DARN paper (Gregor et al 2014) with RWS ie a singlelayer AR-SBN with 500 latent variables and a deterministic layer of hidden variables between theobserved and the latents We essentially obtain the same final testset log-likelihood For this shallownetwork we thus do not observe any improvement from using RWS

42 CALTECH 101 SILHOUETTES

We applied reweighted wake-sleep to the 28 times 28 pixel CalTech 101 Silhouettes dataset Thisdataset consists of 4100 examples in the training set 2264 examples in the validation set and2307 examples in the test set We trained various architectures on this dataset using the samehyperparameter as for the MNIST experiments Table 2 (right) summarizes our results Note thatour best SBNSBN model is a relatively deep network with 4 hidden layers (300-100-50-10) andreaches a estimated LL of -1169 on the test set Our best network a shallow NADENADE-150network reaches -1043 and improves over the previous state of the art (minus1078 a RBM with 4000hidden units by Cho et al (2013))

5 CONCLUSIONS

We introduced a novel training procedure for deep generative models consisting of multiple layers ofbinary latent variables It generalizes and improves over the wake-sleep algorithm providing a lowerbias and lower variance estimator of the log-likelihood gradient at the price of more samples from theinference network During training the weighted samples from the inference network decouple thelayers such that the learning gradients only propagate within the individual layers Our experimentsdemonstrate that a small number ofasymp 5 samples is typically sufficient to jointly train relatively deeparchitectures of at least 5 hidden layers without layerwise pretraining and without carefully tuninglearning rates The resulting models produce reasonable samples (by visual inspection) and theyapproach state-of-the-art performance in terms of log-likelihood on several discrete datasets

We found that even in the cases when the generative networks contain SBN layers only better resultscan be obtained with inference networks composed of more powerful autoregressive layers Thishowever comes at the price of reduced computational efficiency on eg GPUs as the individualvariables hi sim q(h|x) have to be sampled in sequence (even though the theoretical complexity isnot significantly worse compared to SBN layers)

We furthermore found that models with autoregressive layers in the generative network p typicallyproduce very good results But the best ones were always shallow with only a single hidden layerAt this point it is unclear if this is due to optimization problems

Acknowledgments We would like to thank Laurent Dinh Vincent Dumoulin and Li Yao forhelpful discussions and the developers of Theano (Bergstra et al 2010 Bastien et al 2012) fortheir powerful software We furthermore acknowledge CIFAR and Canada Research Chairs forfunding and Compute Canada and Calcul Quebec for providing computational resources

8

Published as a conference paper at ICLR 2015

REFERENCES

Bastien F Lamblin P Pascanu R Bergstra J Goodfellow I J Bergeron A Bouchard Nand Bengio Y (2012) Theano new features and speed improvements Deep Learning andUnsupervised Feature Learning NIPS 2012 Workshop

Bengio Y (2009) Learning deep architectures for AI Now Publishers

Bengio Y and Bengio S (2000) Modeling high-dimensional discrete data with multi-layer neuralnetworks In NIPSrsquo99 pages 400ndash406 MIT Press

Bengio Y Mesnil G Dauphin Y and Rifai S (2013) Better mixing via deep representationsIn Proceedings of the 30th International Conference on Machine Learning (ICMLrsquo13) ACM

Bergstra J Breuleux O Bastien F Lamblin P Pascanu R Desjardins G Turian J Warde-Farley D and Bengio Y (2010) Theano a CPU and GPU math expression compiler InProceedings of the Python for Scientific Computing Conference (SciPy) Oral Presentation

Boulanger-Lewandowski N Bengio Y and Vincent P (2012) Modeling temporal dependenciesin high-dimensional sequences Application to polyphonic music generation and transcription InICMLrsquo2012

Cho K Raiko T and Ilin A (2013) Enhanced gradient for training restricted boltzmann ma-chines Neural computation 25(3) 805ndash831

Dayan P and Hinton G E (1996) Varieties of helmholtz machine Neural Networks 9(8) 1385ndash1403

Dayan P Hinton G E Neal R M and Zemel R S (1995) The Helmholtz machine Neuralcomputation 7(5) 889ndash904

Frey B J (1998) Graphical models for machine learning and digital communication MIT Press

Gregor K Danihelka I Mnih A Blundell C and Wierstra D (2014) Deep autoregressivenetworks In Proceedings of the 31st International Conference on Machine Learning

Hinton G E Dayan P Frey B J and Neal R M (1995) The wake-sleep algorithm for unsu-pervised neural networks Science 268 1558ndash1161

Hinton G E Osindero S and Teh Y (2006) A fast learning algorithm for deep belief netsNeural Computation 18 1527ndash1554

Kingma D P and Welling M (2014) Auto-encoding variational bayes In Proceedings of theInternational Conference on Learning Representations (ICLR)

Larochelle H (2011) Binarized mnist dataset httpwwwcstorontoedu˜larochehpublicdatasetsbinarized_mnistbinarized_mnist_[train|valid|test]amat

Larochelle H and Murray I (2011) The Neural Autoregressive Distribution Estimator In Proceedings of theFourteenth International Conference on Artificial Intelligence and Statistics (AISTATSrsquo2011) volume 15 ofJMLR WampCP

Mnih A and Gregor K (2014) Neural variational inference and learning in belief networks In Proceedingsof the 31st International Conference on Machine Learning (ICML 2014) to appear

Murray B U I and Larochelle H (2014) A deep and tractable density estimator In ICMLrsquo2014

Murray I and Salakhutdinov R (2009) Evaluating probabilities under high-dimensional latent variable mod-els In NIPSrsquo08 volume 21 pages 1137ndash1144

Rezende D J Mohamed S and Wierstra D (2014) Stochastic backpropagation and approximate inferencein deep generative models In ICMLrsquo2014

Salakhutdinov R and Murray I (2008) On the quantitative analysis of deep belief networks In Proceedingsof the International Conference on Machine Learning volume 25

Saul L K Jaakkola T and Jordan M I (1996) Mean field theory for sigmoid belief networks Journal ofArtificial Intelligence Research 4 61ndash76

Tang Y and Salakhutdinov R (2013) Learning stochastic feedforward neural networks In NIPSrsquo2013

9

Published as a conference paper at ICLR 2015

6 SUPPLEMENT

61 GRADIENTS FOR p(xh)

part

partθLp(θx sim D) =

part

partθlog pθ(x) =

1

p(x)

part

partθ

sumh

p(xh)

=1

p(x)

sumh

p(xh)part

partθlog p(xh)

=1

p(x)

sumh

q (h |x) p(xh)q (h |x)

part

partθlog p(xh)

=1

p(x)E

hsimq(h |x)

[p(xh)

q (h |x)part

partθlog p(xh)

] 1sum

k ωk

Ksumk=1

ωkpart

partθlog p(xh(k)) (10)

with ωk =p(xh(k))

q(h(k) |x

) and h(k) sim q (h |x)

62 GRADIENTS FOR THE WAKE PHASE Q UPDATE

part

partφLq(φx sim D) =

part

partφ

sumh

p(xh)log qφ(h|x)

=1

p(x)E

hsimq(h |x)

[p(xh)

q (h |x)part

partφlog qφ(h|x)

] 1sum

k ωk

Ksumk=1

ωkpart

partφlog qφ(h

(k)|x) (11)

Note that we arrive at the same gradients when we set out to minimize the KL(p(middot|x) q(middot|x) for agiven datapoint x

part

partφKL(pθ(h|x) qφ(h|x)) =

part

partφ

sumh

pθ(h|x) logpθ(h|x)qφ(h|x)

= minussumh

pθ(h|x)part

partφlog qφ(h|x)

minus 1sumk ωk

Ksumk=1

ωkpart

partφlog qφ(h

(k)|x) (12)

10

Published as a conference paper at ICLR 2015

63 ADDITIONAL EXPERIMENTAL RESULTS

631 LEARNING CURVES FOR MNIST EXPERIMENTS

Figure 4 Learning curves for various MNIST experiments

632 BOOTSTRAPPING BASED log(p(x)) BIASVARIANCE ANALYSIS

Here we show the biasvariance analysis from Fig 1 B (main paper) applied to the estimatedlog(p(x)) wrt the number of test samples

100 101102 103

samples

20

15

10

5

0

Bia

s

bias (epoch 50)bias (last epoch)

00

02

04

06

08

10

12

14

16

std

-dev

std dev (epoch 50)std dev (last epoch)

Figure 5 Bias and standard deviation of the low-sample estimated log(p(x)) (bootstrapping withK=5000 primary samples from a SBNSBN 10-200-200 network trained on MNIST)

11

Published as a conference paper at ICLR 2015

633 UCI BINARY DATASETS

We performed a series of experiments on 8 different binary datasets from the UCI database

For each dataset we screened a limited hyperparameter space The learning rate was set to a valuein 0001 0003 001 For SBNs we use K=10 training samples and we tried the following archi-tectures Two hidden layers with 10-50 10-75 10-100 10-150 or 10-200 hidden units and threehidden layers with 5-20-100 10-50-100 10-50-150 10-50-200 or 10-100-300 hidden units Wetrained NADENADE models with K=5 training samples and one hidden layer with 30 50 75 100or 200 units in it

Model ADULT CONNECT4 DNA MUSHROOMS NIPS-0-12 OCR-LETTERS RCV1 WEB

FVSBN 1317 1239 8364 1027 27688 3930 4984 2935NADElowast 1319 1199 8481 981 27308 2722 4666 2839EoNADE+ 1319 1258 8231 968 27238 2731 4612 2787DARN3 1319 1191 8104 955 27468 2817 4610 2883RWS - SBN 1365 1268 9063 990 27254 2999 4616 2818hidden units 5-20-100 10-50-150 10-150 10-50-150 10-50-150 10-100-300 10-50-200 10-50-300RWS - NADE 1316 1168 8426 971 27111 2643 4609 2792hidden units 30 50 100 50 75 100

Table 3 Results on various binary datasets from the UCI repository The top two rows quote thebaseline results from Larochelle amp Murray (2011) the third row shows the baseline results takenfrom Uria Murray Larochelle (2014) (NADElowast 500 hidden units EoNADE+ 1hl 16 ord)

12

  • 1 Introduction
  • 2 Reweighted Wake-Sleep
    • 21 The Wake-Sleep Algorithm
    • 22 An Importance Sampling View yields Reweighted Wake-Sleep
    • 23 Training by Reweighted Wake-Sleep
    • 24 Relation to Wake-Sleep and Variational Bayes
      • 3 Component Layers
      • 4 Experiments
        • 41 MNIST
        • 42 CalTech 101 Silhouettes
          • 5 Conclusions
          • 6 Supplement
            • 61 Gradients for p(x h)
            • 62 Gradients for the wake phase q update
            • 63 Additional experimental results
              • 631 Learning curves for MNIST experiments
              • 632 Bootstrapping based log(p(x)) biasvariance analysis
              • 633 UCI binary datasets
Page 2: REWEIGHTED WAKE-SLEEP - arXiv · this paper is to shed a different light on the wake-sleep algorithm, viewing it as a special case of the proposed reweighted wake-sleep (RWS) algorithm,

Published as a conference paper at ICLR 2015

achieve good results Unlike in the case of DBMs which rely on a Markov chain to get samples andestimate the gradient by a mean over those samples here the samples are iid avoiding the veryserious problem of mixing between modes that can plague MCMC methods (Bengio et al 2013)when training undirected graphical models

Another contribution of this paper regards the architecture of the deep approximate inference net-work We view the inference network as estimating the posterior distribution of latent variablesgiven the observed input With this view it is plausible that the classical architecture of the inferencenetwork (a SBN details below) is inappropriate and we test this hypothesis empirically We find thatmore powerful parametrizations that can represent non-factorial posterior distributions yield betterresults

2 REWEIGHTED WAKE-SLEEP

21 THE WAKE-SLEEP ALGORITHM

The wake-sleep algorithm was proposed as a way to train Helmholtz machines which are deepdirected graphical models p(xh) over visible variables x and latent variables h where the latentvariables are organized in layers hk In the Helmholtz machine (Hinton et al 1995 Dayan et al1995) the top layer hL has a factorized unconditional distribution so that ancestral sampling canproceed from hL down to h1 and then the generated sample x is generated by the bottom layergiven h1 In the deep belief network (DBN) (Hinton et al 2006) the top layer is instead generatedby a RBM ie by a Markov chain while simple ancestral sampling is used for the others Eachintermediate layer is specified by a conditional distribution parametrized as a stochastic sigmoidallayer (see section 3 for details)

The wake-sleep algorithm is a training procedure for such generative models which involves train-ing an auxiliary network called the inference network that takes a visible vector x as input andstochastically outputs samples hk for all layers k = 1 to L The inference network outputs sam-ples from a distribution that should estimate the conditional probability of the latent variables of thegenerative model (at all layers) given the input Note that in these kinds of directed models exactinference ie sampling from p(h|x) is intractable

The wake-sleep algorithm proceeds in two phases In the wake phase an observation x is sampledfrom the training distribution D and propagated stochastically up the inference network (one layerat a time) thus sampling latent values h from q(h|x) Together with x the sampled h forms a targetfor training p ie one performs a step of gradient ascent update with respect to maximum likelihoodover the generative model p(xh) with the data x and the inferred h This is useful because whereascomputing the gradient of the marginal likelihood p(x) =

sumh p(xh) is intractable computing the

gradient of the complete log-likelihood log p(xh) is easy In addition these updates decouple allthe layers (because both the input and the target of each layer are considered observed) In thesleep-phase a ldquodreamrdquo sample is obtained from the generative network by ancestral sampling fromp(xh) and is used as a target for the maximum likelihood training of the inference network ie qis trained to estimate p(h|x)The justification for the wake-sleep algorithm that was originally proposed is based on the followingvariational bound

log p(x) gesumh

q(h|x) log p(xh)q(h|x)

that is true for any inference network q but the bound becomes tight as q(h|x) approaches p(h|x)Maximizing this bound with respect to p corresponds to the wake phase update The update withrespect to q should minimize KL(q(h|x)||p(h|x)) (with q as the reference) but instead the sleepphase update minimizes the reversed KL divergence KL(p(h|x)||q(h|x)) (with p as the reference)

22 AN IMPORTANCE SAMPLING VIEW YIELDS REWEIGHTED WAKE-SLEEP

If we think of q(h|x) as estimating p(h|x) and train it accordingly (which is basically what thesleep phase of wake-sleep does) then we can reformulate the likelihood as an importance-weighted

2

Published as a conference paper at ICLR 2015

average

p(x) =sumh

q (h |x) p(xh)q (h |x)

= Ehsimq(h |x)

[p(xh)

q (h |x)

] 1

K

Ksumk=1

h(k)simq(h |x)

p(xh(k))

q(h(k) |x

) (1)

Eqn (1) is a consistent and unbiased estimator for the marginal likelihood p(x) The optimal q thatresults in a minimum variance estimator is qlowast(h |x) = p (h |x) In fact we can show that this isa zero-variance estimator ie the best possible one that will result in a perfect p(x) estimate evenwith a single arbitrary sample h sim p (h |x)

Ehsimp(h |x)

[p (h |x) p(x)p(h |x)

]= p(x) E

hsimp(h |x)[1] = p(x) (2)

Any mismatch between q and p (h |x) will increase the variance of this estimator but it will notintroduce any bias In practice however we are typically interested in an estimator for the log-likelihood Taking the logarithm of (1) and averaging over multiple datapoints will result in a con-servative biased estimate and will on average underestimate the true log-likelihood due to theconcavity of the logarithm Increasing the number of samples will decrease both the bias and thevariance Variants of this estimator have been used in eg (Rezende et al 2014 Gregor et al 2014)to evaluate trained models

23 TRAINING BY REWEIGHTED WAKE-SLEEP

We now consider the models p and q parameterized with parameters θ and φ respectively

Updating pθ for given qφ We propose to use an importance sampling estimator based on eq (1) tocompute the gradient of the marginal log-likelihood Lp(θx) = log pθ(x)

part

partθLp(θx sim D) =

1

p(x)E

hsimq(h |x)

[p(xh)

q (h |x)part

partθlog p(xh)

]

Ksumk=1

ωkpart

partθlog p(xh(k)) with h(k) sim q (h |x) (3)

and the importance weights ωk =ωksumK

kprime=1 ωkprime ωk =

p(xh(k))

q(h(k) |x

)

See the supplement for a detailed derivation Note that this is a biased estimator because it implicitlycontains a division by the estimated p(x) Furthermore there is no guarantee that q = p (h |x)results in a minimum variance estimate of this gradient But both the bias and the variance decreaseas the number of samples is increased Also note that the wake-sleep algorithm uses a gradient thatis equivalent to using only K = 1 sample Another noteworthy detail about eq (3) is that theimportance weights ω are automatically normalized such that they sum up to one

Updating qφ for given pθ In order to minimize the variance of the estimator (1) we would likeq (h |x) to track p (h |x) We propose to train q using maximum likelihood learning with the lossLq(φxh) = log qφ(x|h) There are at least two reasonable options how to obtain training datafor Lq 1) maximize Lq under the empirical training distribution x sim D h sim p (h |x) or 2)maximize Lq under the generative model (xh) sim pθ(xh) We will refer to the former as wakephase q-update and to the latter as sleep phase q-update In the case of a DBN (where the top layeris generated by an RBM) there is an intermediate solution called contrastive-wake-sleep which hasbeen proposed in (Hinton et al 2006) In contrastive wake-sleep we sample x from the trainingdistribution propagate it stochastically into the top layer and use that h as starting point for a shortMarkov chain in the RBM then sample the other layers in the generative network p to generate therest of (xh) The objective is to put the inference networkrsquos capacity where it matters most ienear the input configurations that are seen in the training set

Analogous to eqn (1) and (3) we use importance sampling to derive gradients for the wake phaseq-update

part

partφLq(φx sim D)

Ksumk=1

ωkpart

partφlog qφ(h

(k)|x) (4)

3

Published as a conference paper at ICLR 2015

with the same importance weights ωk as in (3) (the details of the derivation can again be found inthe supplement) Note that this is equivalent to optimizing q so as to minimize KL(p(middot|x) q(middot|x))For the sleep phase q-update we consider the model distribution p(xh) a fully observed system andcan thus derive gradients without further sampling

part

partφLq(φ (xh)) =

part

partφlog qφ(h|x) with xh sim p(xh) (5)

This update is equivalent to the sleep phase update in the classical wake-sleep algorithm

Algorithm 1 Reweighted Wake-Sleep training procedure and likelihood estimator K is the numberof approximate inference samples and controls the trade-off between computation and accuracyof the estimators (both for the gradient and for the likelihood) We typically use a large value(K = 100 000) for test set likelihood estimator but a small value (K = 5) for estimating gradientsBoth the wake phase and sleep phase update rules for q are optionally included (either one or bothcan be used and best results were obtained using both) The original wake-sleep algorithm hasK=1 and only uses the sleep phase update of q To estimate the log-likelihood at test time only thecomputations up to L are required

for number of training iterations dobull Sample example(s) x from the training distributionfor k = 1 to K dobull Layerwise sample latent variables h(k) from q(h|x)bull Compute q(h(k)|x) and p(xh(k))

end forbull Compute unnormalized weights ωk = p(xh(k))

q(h(k) |x)bull Normalize the weights ωk = ωksum

kprime ωkprime

bull Compute unbiased likelihood estimator p(x) = averagek ωkbull Compute log-likelihood estimator L(x) = log averagek ωk

bullWake-phase update of p Use gradient estimatorsumk ωk

part log p(xh(k))partθ

bull Optionally wake phase update of q Use gradient estimatorsumk ωk

part log q(h(k)|x)partφ

bull Optionally sleep phase update of q Sample (xprimehprime) from p and use gradient part log q(hprime|xprime)partφ

end for

24 RELATION TO WAKE-SLEEP AND VARIATIONAL BAYES

Recently there has been a resurgence of interest in algorithms related to the Helmholtz machine andto the wake-sleep algorithm for directed graphical models containing either continuous or discretelatent variables

In Neural Variational Inference and Learning (NVIL Mnih and Gregor 2014) the authors proposeto maximize the variational lower bound on the log-likelihood to get a joint objective for both pand q It was known that this approach results in a gradient estimate of very high variance forthe recognition network q (Dayan and Hinton 1996) In the NVIL paper the authors thereforepropose variance reduction techniques such as baselines to obtain a practical algorithm that enhancessignificantly over the original wake-sleep algorithm In respect to the computational complexity wenote that while we draw K samples from the inference network for RWS NVIL on the other handdraws only a single sample from q but maintains queries and trains an additional auxiliary baselineestimating network With RWS and a typical value of K = 5 we thus require at least twice asmany arithmetic operations but we do not have to store the baseline network and do not have to findsuitable hyperparameters for it

Recent examples for continuous latent variables include the auto-encoding variationalBayes (Kingma and Welling 2014) and stochastic backpropagation papers (Rezende et al 2014)In both cases one maximizes a variational lower bound on the log-likelihood that is rewritten astwo terms one that is log-likelihood reconstruction error through a stochastic encoder (approximateinference) - decoder (generative model) pair and one that regularizes the output of the approximateinference stochastic encoder so that its marginal distribution matches the generative prior on the

4

Published as a conference paper at ICLR 2015

latent variables (and the latter is also trained to match the marginal of the encoder output) Besidesthe fact that these variational auto-encoders are only for continuous latent variables another differ-ence with the reweighted wake-sleep algorithm proposed here is that in the former a single samplefrom the approximate inference distribution is sufficient to get an unbiased estimator of the gradientof a proxy (the variational bound) Instead with the reweighted wake-sleep a single sample wouldcorrespond to regular wake-sleep which gives a biased estimator of the likelihood gradient Onthe other hand as the number of samples increases reweighted wake-sleep provides a less biased(asymptotically unbiased) estimator of the log-likelihood and of its gradient Similar in spirit butaimed at a structured output prediction task is the method proposed by Tang and Salakhutdinov(2013) The authors optimize the variational bound of the log-likelihood instead of the direct ISestimate but they also derive update equations for the proposal distribution that resembles many ofthe properties also found in reweighted wake-sleep

3 COMPONENT LAYERS

Although the framework can be readily applied to continuous variables we here restrict our-selves to distributions over binary visible and binary latent variables We build our models bycombining probabilistic components each one associated with one of the layers of the gener-ative network or of the inference network The generative model can therefore be written aspθ(xh) = p0(x|h1) p1(h1|h2) middot middot middot pL(hL) while the inference network has the form qφ(h |x) =q1(h1 |x) middot middot middot qL(hL |hLminus1) For a distributionP to be a suitable component we must have a methodto efficiently compute P (x(k)|y(k)) given (x(k) y(k)) and we must have a method to efficientlydraw iid samples x(k) sim P (x |y) for a given y In the following we will describe experimentscontaining three kinds of layers

Sigmoidal Belief Network (SBN) layer A SBN layer (Saul et al 1996) is a directed graphicalmodel with independent variables xi given the parents y

P SBN(xi = 1 |y) = σ(W i y + bi) (6)Although a SBN is a very simple generative model given y performing inference for y given x isin general intractable

Autoregressive SBN layer (AR-SBN DARN) If we consider xi an ordered set of observed vari-ables and introduce directed autoregressive links between all previous xlti and a given xi we obtaina fully-visible sigmoid belief network (FVSBN Frey 1998 Bengio and Bengio 2000) When weadditionally condition a FVSBN on the parent layerrsquos y we obtain a layer model that was first usedin Deep AutoRegressive Networks (DARN Gregor et al 2014)

PAR-SBN(xi = 1 |xltiy) = σ(W i y + Siltixlti + bi) (7)We use xlti = (x1 x2 middot middot middot ximinus1) to refer to the vector containing the first i-1 observed variablesThe matrix S is a lower triangular matrix that contains the autoregressive weights between the vari-ables xi and with Siltj we refer to the first j-1 elements of the i-th row of this matrix In contrastto a regular SBN layer the units xi are thus not independent of each other but can be predicted likein a logistic regression in terms of its predecessors xlti and of the input of the layer y

Conditional NADE layer The Neural Autoregressive Distribution Estimator (NADE Larochelleand Murray 2011) is a model that uses an internal accumulating hidden layer to predict variables xigiven the vector containing all previously variables xlti Instead of logistic regression in a FVSBNor an AR-SBN the dependency between the variables xi is here mediated by an MLP (Bengio andBengio 2000)

P (xi = 1 |xlti) = σ(V iσ(W lti xlti + a) + bi)) (8)With W and V denoting the encoding and decoding matrices for the NADE hidden layer For ourpurposes we condition this model on the random variables y

PNADE(xi = 1 |xltiy) = σ(V iσ(W lti xlti + Ua y + a) + U ib y + bi)) (9)Such a conditional NADE has been used previously for modeling musical sequences (Boulanger-Lewandowski et al 2012)

For each layer distribution we can construct an unconditioned distribution by removing the condi-tioning variable y We use such unconditioned distributions as top layer p(h) for the generativenetwork p

5

Published as a conference paper at ICLR 2015

100 101 102

training samples

minus130

minus120

minus110

minus100

minus90

minus80

Fin

alL

Les

tim

ate

(tes

tset

)

NA DE 200SBN 10-200-200SBN 200

A B

bias (epoch 50)

bias (last epoch)std dev (epoch50)

100 101102

training samples

03

04

05

06

07

08

09

bia

s

06

08

10

12

14

16

18

std

-dev

std dev (last epoch)

Figure 1 A Final log-likelihood estimate wrt number of samples used during training B L2-normof the bias and standard deviation of the low-sample estimated pθ gradient relative to a high-sample(K=5000) based estimate

NVIL wake-sleep RWS RWSP-model size Q-model SBN Q-model NADESBN 200 (1131) 1163 (1207) 1031 950SBN 200-200 (998) 1069 (1094) 934 911SBN 200-200-200 (967) 1013 (1044) 901 889AR-SBN 200 892AR-SBN 200-200 928NADE 200 868NADE 200-200 876

Table 1 MNIST results for various architectures and training methods In the 3rd column we citethe numbers reported by Mnih and Gregor (2014) Values in brackets are variational NLL boundsvalues without brackets report NLL estimates (see section 22)

4 EXPERIMENTS

Here we present a series of experiments on the MNIST and the CalTech-Silhouettes datasetsThe supplement describes additional experiments on smaller datasets from the UCI repositoryWith these experiments we 1) quantitatively analyze the influence of the number of samples K2) demonstrate that using a more powerful layer-model for the inference network q can signif-icantly enhance the results even when the generative model is a factorial SBN and 3) showthat we approach state-of-the-art performance when using either relatively deep models or whenusing powerful layer models such as a conditional NADE Our implementation is available athttpsgithubcomjbornscheinreweighted-ws

41 MNIST

We use the MNIST dataset that was binarized according to Murray and Salakhutdinov (2009) anddownloaded in binarized form from (Larochelle 2011) For training we use stochastic gradientdecent with momentum (β=095) and set mini-batch size to 25 The experiments in this paragraphwere run with learning rates of 00003 0001 and 0003 From these three we always report theexperiment with the highest validation log-likelihood In the majority of our experiments a learningrate of 0001 gave the best results even across different layer models (SBN AR-SBN and NADE) Ifnot noted otherwise we use K = 5 samples during training and K = 100 000 samples to estimatethe final log-likelihood on the test set1 To disentangle the influence of the different q updatingmethods we setup p and q networks consisting of three hidden SBN layers with 10 200 and 200units (SBNSBN 10-200-200) After convergence the model trained updating q during the sleepphase only reached a final estimated log-likelihood of minus934 the model trained with a q-updateduring the wake phase reachedminus928 and the model trained with both wake and sleep phase updatereached minus919 As a control we trained a model that does not update q at all This model reached

1We refer to the lower bound estimates which can be arbitrarily tightened by increasing the number of testsamples as LL estimates to distiguish them from the variational LL lower bounds (see section 22)

6

Published as a conference paper at ICLR 2015

100 101 102

samples

minus130

minus120

minus110

minus100

minus90

minus80

est

LL

NA DE-NA DE 200SBN-SBN 200-200-10SBN-SBN 200

A B CSBNSBN 10-100-200-300-400 NADENADE 250

Figure 2 A Final log-likelihood estimate wrt number of test samples used B Samples from theSBNSBN 10-200-200 generative model C Samples from the NADENADE 250 generative model(We show the probabilities from which each pixel is sampled)

Results on binarized MNISTNLL NLL

Method bound estRWS (SBNSBN 10-100-200-300-400) 8548RWS (NADENADE 250) 8523RWS (AR-SBNSBN 500)dagger 8418NADE (500 units [1]) 8835EoNADE (2hl 128 orderings [2]) 8510DARN (500 units [3]) 8413RBM (500 units CD3 [4]) 1055RBM (500 units CD25 [4]) 8634DBN (500-2000 [5]) 8622 8455

Results on CalTech 101 SilhouettesNLL

Method estRWS (SBNSBN 10-50-100-300) 1133RWS (NADENADE 150) 1043

NADE (500 hidden units) 1106RBM (4000 hidden units [6]) 1078

Table 2 Various RWS trained models in relation to previously published methods [1] Larochelleand Murray (2011) [2] Murray and Larochelle (2014) [3] Gregor et al (2014) [4] Salakhutdinovand Murray (2008) [5] Murray and Salakhutdinov (2009) [6] Cho et al (2013) dagger Same model asthe best performing in [3] a AR-SBN with deterministic hidden variables between the observed andlatent All RWS NLL estimates on MNIST have confidence intervals of asymp plusmn040

minus1714 We confirmed that combining wake and sleep phase q-updates generally gives the bestresults by repeating this experiment with various other architectures For the remainder of this paperwe therefore train all models with combined wake and sleep phase q-updates

Next we investigate the influence of the number of samples used during training The results arevisualized in Fig 1 A Although the results depend on the layer-distributions and on the depthand width of the architectures we generally observe that the final estimated log-likelihood does notimprove significantly when using more than 5 samples during training for NADE models and usingmore than 25 samples for models with SBN layers We can quantify the bias and the variance ofthe gradient estimator (3) using bootstrapping While training a SBNSBN 10-200-200 model withK = 100 training samples we useK = 5 000 samples to get a high quality estimate of the gradientfor a small but fixed set of 25 datapoints (the size of one mini-batch) By repeatedly resamplingsmaller sets of 1 2 5 middot middot middot 5000 samples with replacement and by computing the gradient basedon these we get a measure for the bias and the variance of the small sample estimates relative thehigh quality estimate These results are visualized in Fig 1 B In Fig 2 A we finally investigate thequality of the log-likelihood estimator (eqn 1) when applied to the MNIST test set

Table 1 summarizes how different architectures compare to each other and how RWS comparesto related methods for training directed models We essentially observe that RWS trained modelsconsistently improve over classical wake-sleep especially for deep architectures We furthermoreobserve that using autoregressive layers (AR-SBN or NADE) for the inference network improvesthe results even when the generative model is composed of factorial SBN layers Finally we see thatthe best performing models with autoregressive layers in p are always shallow with only a single

7

Published as a conference paper at ICLR 2015

Figure 3 CalTech 101 Silhouettes A Random selection of training data points B Random samplesfrom the SBNSBN 10-50-100-300 generative network C Random Samples from the NADE-150generative network (We show the probabilities from which each pixel is sampled)

hidden layer In Table 2 (left) we compare some of our best models to the state-of-the-art resultspublished on MNIST The deep SBNSBN 10-100-200-300-400 model was trained for 1000 epochswith K = 5 training samples and a learning rate of 0001 For fine-tuning we run additional 500epochs with a learning rate decay of 1005 and 100 training samples For comparison we also trainthe best performing model from the DARN paper (Gregor et al 2014) with RWS ie a singlelayer AR-SBN with 500 latent variables and a deterministic layer of hidden variables between theobserved and the latents We essentially obtain the same final testset log-likelihood For this shallownetwork we thus do not observe any improvement from using RWS

42 CALTECH 101 SILHOUETTES

We applied reweighted wake-sleep to the 28 times 28 pixel CalTech 101 Silhouettes dataset Thisdataset consists of 4100 examples in the training set 2264 examples in the validation set and2307 examples in the test set We trained various architectures on this dataset using the samehyperparameter as for the MNIST experiments Table 2 (right) summarizes our results Note thatour best SBNSBN model is a relatively deep network with 4 hidden layers (300-100-50-10) andreaches a estimated LL of -1169 on the test set Our best network a shallow NADENADE-150network reaches -1043 and improves over the previous state of the art (minus1078 a RBM with 4000hidden units by Cho et al (2013))

5 CONCLUSIONS

We introduced a novel training procedure for deep generative models consisting of multiple layers ofbinary latent variables It generalizes and improves over the wake-sleep algorithm providing a lowerbias and lower variance estimator of the log-likelihood gradient at the price of more samples from theinference network During training the weighted samples from the inference network decouple thelayers such that the learning gradients only propagate within the individual layers Our experimentsdemonstrate that a small number ofasymp 5 samples is typically sufficient to jointly train relatively deeparchitectures of at least 5 hidden layers without layerwise pretraining and without carefully tuninglearning rates The resulting models produce reasonable samples (by visual inspection) and theyapproach state-of-the-art performance in terms of log-likelihood on several discrete datasets

We found that even in the cases when the generative networks contain SBN layers only better resultscan be obtained with inference networks composed of more powerful autoregressive layers Thishowever comes at the price of reduced computational efficiency on eg GPUs as the individualvariables hi sim q(h|x) have to be sampled in sequence (even though the theoretical complexity isnot significantly worse compared to SBN layers)

We furthermore found that models with autoregressive layers in the generative network p typicallyproduce very good results But the best ones were always shallow with only a single hidden layerAt this point it is unclear if this is due to optimization problems

Acknowledgments We would like to thank Laurent Dinh Vincent Dumoulin and Li Yao forhelpful discussions and the developers of Theano (Bergstra et al 2010 Bastien et al 2012) fortheir powerful software We furthermore acknowledge CIFAR and Canada Research Chairs forfunding and Compute Canada and Calcul Quebec for providing computational resources

8

Published as a conference paper at ICLR 2015

REFERENCES

Bastien F Lamblin P Pascanu R Bergstra J Goodfellow I J Bergeron A Bouchard Nand Bengio Y (2012) Theano new features and speed improvements Deep Learning andUnsupervised Feature Learning NIPS 2012 Workshop

Bengio Y (2009) Learning deep architectures for AI Now Publishers

Bengio Y and Bengio S (2000) Modeling high-dimensional discrete data with multi-layer neuralnetworks In NIPSrsquo99 pages 400ndash406 MIT Press

Bengio Y Mesnil G Dauphin Y and Rifai S (2013) Better mixing via deep representationsIn Proceedings of the 30th International Conference on Machine Learning (ICMLrsquo13) ACM

Bergstra J Breuleux O Bastien F Lamblin P Pascanu R Desjardins G Turian J Warde-Farley D and Bengio Y (2010) Theano a CPU and GPU math expression compiler InProceedings of the Python for Scientific Computing Conference (SciPy) Oral Presentation

Boulanger-Lewandowski N Bengio Y and Vincent P (2012) Modeling temporal dependenciesin high-dimensional sequences Application to polyphonic music generation and transcription InICMLrsquo2012

Cho K Raiko T and Ilin A (2013) Enhanced gradient for training restricted boltzmann ma-chines Neural computation 25(3) 805ndash831

Dayan P and Hinton G E (1996) Varieties of helmholtz machine Neural Networks 9(8) 1385ndash1403

Dayan P Hinton G E Neal R M and Zemel R S (1995) The Helmholtz machine Neuralcomputation 7(5) 889ndash904

Frey B J (1998) Graphical models for machine learning and digital communication MIT Press

Gregor K Danihelka I Mnih A Blundell C and Wierstra D (2014) Deep autoregressivenetworks In Proceedings of the 31st International Conference on Machine Learning

Hinton G E Dayan P Frey B J and Neal R M (1995) The wake-sleep algorithm for unsu-pervised neural networks Science 268 1558ndash1161

Hinton G E Osindero S and Teh Y (2006) A fast learning algorithm for deep belief netsNeural Computation 18 1527ndash1554

Kingma D P and Welling M (2014) Auto-encoding variational bayes In Proceedings of theInternational Conference on Learning Representations (ICLR)

Larochelle H (2011) Binarized mnist dataset httpwwwcstorontoedu˜larochehpublicdatasetsbinarized_mnistbinarized_mnist_[train|valid|test]amat

Larochelle H and Murray I (2011) The Neural Autoregressive Distribution Estimator In Proceedings of theFourteenth International Conference on Artificial Intelligence and Statistics (AISTATSrsquo2011) volume 15 ofJMLR WampCP

Mnih A and Gregor K (2014) Neural variational inference and learning in belief networks In Proceedingsof the 31st International Conference on Machine Learning (ICML 2014) to appear

Murray B U I and Larochelle H (2014) A deep and tractable density estimator In ICMLrsquo2014

Murray I and Salakhutdinov R (2009) Evaluating probabilities under high-dimensional latent variable mod-els In NIPSrsquo08 volume 21 pages 1137ndash1144

Rezende D J Mohamed S and Wierstra D (2014) Stochastic backpropagation and approximate inferencein deep generative models In ICMLrsquo2014

Salakhutdinov R and Murray I (2008) On the quantitative analysis of deep belief networks In Proceedingsof the International Conference on Machine Learning volume 25

Saul L K Jaakkola T and Jordan M I (1996) Mean field theory for sigmoid belief networks Journal ofArtificial Intelligence Research 4 61ndash76

Tang Y and Salakhutdinov R (2013) Learning stochastic feedforward neural networks In NIPSrsquo2013

9

Published as a conference paper at ICLR 2015

6 SUPPLEMENT

61 GRADIENTS FOR p(xh)

part

partθLp(θx sim D) =

part

partθlog pθ(x) =

1

p(x)

part

partθ

sumh

p(xh)

=1

p(x)

sumh

p(xh)part

partθlog p(xh)

=1

p(x)

sumh

q (h |x) p(xh)q (h |x)

part

partθlog p(xh)

=1

p(x)E

hsimq(h |x)

[p(xh)

q (h |x)part

partθlog p(xh)

] 1sum

k ωk

Ksumk=1

ωkpart

partθlog p(xh(k)) (10)

with ωk =p(xh(k))

q(h(k) |x

) and h(k) sim q (h |x)

62 GRADIENTS FOR THE WAKE PHASE Q UPDATE

part

partφLq(φx sim D) =

part

partφ

sumh

p(xh)log qφ(h|x)

=1

p(x)E

hsimq(h |x)

[p(xh)

q (h |x)part

partφlog qφ(h|x)

] 1sum

k ωk

Ksumk=1

ωkpart

partφlog qφ(h

(k)|x) (11)

Note that we arrive at the same gradients when we set out to minimize the KL(p(middot|x) q(middot|x) for agiven datapoint x

part

partφKL(pθ(h|x) qφ(h|x)) =

part

partφ

sumh

pθ(h|x) logpθ(h|x)qφ(h|x)

= minussumh

pθ(h|x)part

partφlog qφ(h|x)

minus 1sumk ωk

Ksumk=1

ωkpart

partφlog qφ(h

(k)|x) (12)

10

Published as a conference paper at ICLR 2015

63 ADDITIONAL EXPERIMENTAL RESULTS

631 LEARNING CURVES FOR MNIST EXPERIMENTS

Figure 4 Learning curves for various MNIST experiments

632 BOOTSTRAPPING BASED log(p(x)) BIASVARIANCE ANALYSIS

Here we show the biasvariance analysis from Fig 1 B (main paper) applied to the estimatedlog(p(x)) wrt the number of test samples

100 101102 103

samples

20

15

10

5

0

Bia

s

bias (epoch 50)bias (last epoch)

00

02

04

06

08

10

12

14

16

std

-dev

std dev (epoch 50)std dev (last epoch)

Figure 5 Bias and standard deviation of the low-sample estimated log(p(x)) (bootstrapping withK=5000 primary samples from a SBNSBN 10-200-200 network trained on MNIST)

11

Published as a conference paper at ICLR 2015

633 UCI BINARY DATASETS

We performed a series of experiments on 8 different binary datasets from the UCI database

For each dataset we screened a limited hyperparameter space The learning rate was set to a valuein 0001 0003 001 For SBNs we use K=10 training samples and we tried the following archi-tectures Two hidden layers with 10-50 10-75 10-100 10-150 or 10-200 hidden units and threehidden layers with 5-20-100 10-50-100 10-50-150 10-50-200 or 10-100-300 hidden units Wetrained NADENADE models with K=5 training samples and one hidden layer with 30 50 75 100or 200 units in it

Model ADULT CONNECT4 DNA MUSHROOMS NIPS-0-12 OCR-LETTERS RCV1 WEB

FVSBN 1317 1239 8364 1027 27688 3930 4984 2935NADElowast 1319 1199 8481 981 27308 2722 4666 2839EoNADE+ 1319 1258 8231 968 27238 2731 4612 2787DARN3 1319 1191 8104 955 27468 2817 4610 2883RWS - SBN 1365 1268 9063 990 27254 2999 4616 2818hidden units 5-20-100 10-50-150 10-150 10-50-150 10-50-150 10-100-300 10-50-200 10-50-300RWS - NADE 1316 1168 8426 971 27111 2643 4609 2792hidden units 30 50 100 50 75 100

Table 3 Results on various binary datasets from the UCI repository The top two rows quote thebaseline results from Larochelle amp Murray (2011) the third row shows the baseline results takenfrom Uria Murray Larochelle (2014) (NADElowast 500 hidden units EoNADE+ 1hl 16 ord)

12

  • 1 Introduction
  • 2 Reweighted Wake-Sleep
    • 21 The Wake-Sleep Algorithm
    • 22 An Importance Sampling View yields Reweighted Wake-Sleep
    • 23 Training by Reweighted Wake-Sleep
    • 24 Relation to Wake-Sleep and Variational Bayes
      • 3 Component Layers
      • 4 Experiments
        • 41 MNIST
        • 42 CalTech 101 Silhouettes
          • 5 Conclusions
          • 6 Supplement
            • 61 Gradients for p(x h)
            • 62 Gradients for the wake phase q update
            • 63 Additional experimental results
              • 631 Learning curves for MNIST experiments
              • 632 Bootstrapping based log(p(x)) biasvariance analysis
              • 633 UCI binary datasets
Page 3: REWEIGHTED WAKE-SLEEP - arXiv · this paper is to shed a different light on the wake-sleep algorithm, viewing it as a special case of the proposed reweighted wake-sleep (RWS) algorithm,

Published as a conference paper at ICLR 2015

average

p(x) =sumh

q (h |x) p(xh)q (h |x)

= Ehsimq(h |x)

[p(xh)

q (h |x)

] 1

K

Ksumk=1

h(k)simq(h |x)

p(xh(k))

q(h(k) |x

) (1)

Eqn (1) is a consistent and unbiased estimator for the marginal likelihood p(x) The optimal q thatresults in a minimum variance estimator is qlowast(h |x) = p (h |x) In fact we can show that this isa zero-variance estimator ie the best possible one that will result in a perfect p(x) estimate evenwith a single arbitrary sample h sim p (h |x)

Ehsimp(h |x)

[p (h |x) p(x)p(h |x)

]= p(x) E

hsimp(h |x)[1] = p(x) (2)

Any mismatch between q and p (h |x) will increase the variance of this estimator but it will notintroduce any bias In practice however we are typically interested in an estimator for the log-likelihood Taking the logarithm of (1) and averaging over multiple datapoints will result in a con-servative biased estimate and will on average underestimate the true log-likelihood due to theconcavity of the logarithm Increasing the number of samples will decrease both the bias and thevariance Variants of this estimator have been used in eg (Rezende et al 2014 Gregor et al 2014)to evaluate trained models

23 TRAINING BY REWEIGHTED WAKE-SLEEP

We now consider the models p and q parameterized with parameters θ and φ respectively

Updating pθ for given qφ We propose to use an importance sampling estimator based on eq (1) tocompute the gradient of the marginal log-likelihood Lp(θx) = log pθ(x)

part

partθLp(θx sim D) =

1

p(x)E

hsimq(h |x)

[p(xh)

q (h |x)part

partθlog p(xh)

]

Ksumk=1

ωkpart

partθlog p(xh(k)) with h(k) sim q (h |x) (3)

and the importance weights ωk =ωksumK

kprime=1 ωkprime ωk =

p(xh(k))

q(h(k) |x

)

See the supplement for a detailed derivation Note that this is a biased estimator because it implicitlycontains a division by the estimated p(x) Furthermore there is no guarantee that q = p (h |x)results in a minimum variance estimate of this gradient But both the bias and the variance decreaseas the number of samples is increased Also note that the wake-sleep algorithm uses a gradient thatis equivalent to using only K = 1 sample Another noteworthy detail about eq (3) is that theimportance weights ω are automatically normalized such that they sum up to one

Updating qφ for given pθ In order to minimize the variance of the estimator (1) we would likeq (h |x) to track p (h |x) We propose to train q using maximum likelihood learning with the lossLq(φxh) = log qφ(x|h) There are at least two reasonable options how to obtain training datafor Lq 1) maximize Lq under the empirical training distribution x sim D h sim p (h |x) or 2)maximize Lq under the generative model (xh) sim pθ(xh) We will refer to the former as wakephase q-update and to the latter as sleep phase q-update In the case of a DBN (where the top layeris generated by an RBM) there is an intermediate solution called contrastive-wake-sleep which hasbeen proposed in (Hinton et al 2006) In contrastive wake-sleep we sample x from the trainingdistribution propagate it stochastically into the top layer and use that h as starting point for a shortMarkov chain in the RBM then sample the other layers in the generative network p to generate therest of (xh) The objective is to put the inference networkrsquos capacity where it matters most ienear the input configurations that are seen in the training set

Analogous to eqn (1) and (3) we use importance sampling to derive gradients for the wake phaseq-update

part

partφLq(φx sim D)

Ksumk=1

ωkpart

partφlog qφ(h

(k)|x) (4)

3

Published as a conference paper at ICLR 2015

with the same importance weights ωk as in (3) (the details of the derivation can again be found inthe supplement) Note that this is equivalent to optimizing q so as to minimize KL(p(middot|x) q(middot|x))For the sleep phase q-update we consider the model distribution p(xh) a fully observed system andcan thus derive gradients without further sampling

part

partφLq(φ (xh)) =

part

partφlog qφ(h|x) with xh sim p(xh) (5)

This update is equivalent to the sleep phase update in the classical wake-sleep algorithm

Algorithm 1 Reweighted Wake-Sleep training procedure and likelihood estimator K is the numberof approximate inference samples and controls the trade-off between computation and accuracyof the estimators (both for the gradient and for the likelihood) We typically use a large value(K = 100 000) for test set likelihood estimator but a small value (K = 5) for estimating gradientsBoth the wake phase and sleep phase update rules for q are optionally included (either one or bothcan be used and best results were obtained using both) The original wake-sleep algorithm hasK=1 and only uses the sleep phase update of q To estimate the log-likelihood at test time only thecomputations up to L are required

for number of training iterations dobull Sample example(s) x from the training distributionfor k = 1 to K dobull Layerwise sample latent variables h(k) from q(h|x)bull Compute q(h(k)|x) and p(xh(k))

end forbull Compute unnormalized weights ωk = p(xh(k))

q(h(k) |x)bull Normalize the weights ωk = ωksum

kprime ωkprime

bull Compute unbiased likelihood estimator p(x) = averagek ωkbull Compute log-likelihood estimator L(x) = log averagek ωk

bullWake-phase update of p Use gradient estimatorsumk ωk

part log p(xh(k))partθ

bull Optionally wake phase update of q Use gradient estimatorsumk ωk

part log q(h(k)|x)partφ

bull Optionally sleep phase update of q Sample (xprimehprime) from p and use gradient part log q(hprime|xprime)partφ

end for

24 RELATION TO WAKE-SLEEP AND VARIATIONAL BAYES

Recently there has been a resurgence of interest in algorithms related to the Helmholtz machine andto the wake-sleep algorithm for directed graphical models containing either continuous or discretelatent variables

In Neural Variational Inference and Learning (NVIL Mnih and Gregor 2014) the authors proposeto maximize the variational lower bound on the log-likelihood to get a joint objective for both pand q It was known that this approach results in a gradient estimate of very high variance forthe recognition network q (Dayan and Hinton 1996) In the NVIL paper the authors thereforepropose variance reduction techniques such as baselines to obtain a practical algorithm that enhancessignificantly over the original wake-sleep algorithm In respect to the computational complexity wenote that while we draw K samples from the inference network for RWS NVIL on the other handdraws only a single sample from q but maintains queries and trains an additional auxiliary baselineestimating network With RWS and a typical value of K = 5 we thus require at least twice asmany arithmetic operations but we do not have to store the baseline network and do not have to findsuitable hyperparameters for it

Recent examples for continuous latent variables include the auto-encoding variationalBayes (Kingma and Welling 2014) and stochastic backpropagation papers (Rezende et al 2014)In both cases one maximizes a variational lower bound on the log-likelihood that is rewritten astwo terms one that is log-likelihood reconstruction error through a stochastic encoder (approximateinference) - decoder (generative model) pair and one that regularizes the output of the approximateinference stochastic encoder so that its marginal distribution matches the generative prior on the

4

Published as a conference paper at ICLR 2015

latent variables (and the latter is also trained to match the marginal of the encoder output) Besidesthe fact that these variational auto-encoders are only for continuous latent variables another differ-ence with the reweighted wake-sleep algorithm proposed here is that in the former a single samplefrom the approximate inference distribution is sufficient to get an unbiased estimator of the gradientof a proxy (the variational bound) Instead with the reweighted wake-sleep a single sample wouldcorrespond to regular wake-sleep which gives a biased estimator of the likelihood gradient Onthe other hand as the number of samples increases reweighted wake-sleep provides a less biased(asymptotically unbiased) estimator of the log-likelihood and of its gradient Similar in spirit butaimed at a structured output prediction task is the method proposed by Tang and Salakhutdinov(2013) The authors optimize the variational bound of the log-likelihood instead of the direct ISestimate but they also derive update equations for the proposal distribution that resembles many ofthe properties also found in reweighted wake-sleep

3 COMPONENT LAYERS

Although the framework can be readily applied to continuous variables we here restrict our-selves to distributions over binary visible and binary latent variables We build our models bycombining probabilistic components each one associated with one of the layers of the gener-ative network or of the inference network The generative model can therefore be written aspθ(xh) = p0(x|h1) p1(h1|h2) middot middot middot pL(hL) while the inference network has the form qφ(h |x) =q1(h1 |x) middot middot middot qL(hL |hLminus1) For a distributionP to be a suitable component we must have a methodto efficiently compute P (x(k)|y(k)) given (x(k) y(k)) and we must have a method to efficientlydraw iid samples x(k) sim P (x |y) for a given y In the following we will describe experimentscontaining three kinds of layers

Sigmoidal Belief Network (SBN) layer A SBN layer (Saul et al 1996) is a directed graphicalmodel with independent variables xi given the parents y

P SBN(xi = 1 |y) = σ(W i y + bi) (6)Although a SBN is a very simple generative model given y performing inference for y given x isin general intractable

Autoregressive SBN layer (AR-SBN DARN) If we consider xi an ordered set of observed vari-ables and introduce directed autoregressive links between all previous xlti and a given xi we obtaina fully-visible sigmoid belief network (FVSBN Frey 1998 Bengio and Bengio 2000) When weadditionally condition a FVSBN on the parent layerrsquos y we obtain a layer model that was first usedin Deep AutoRegressive Networks (DARN Gregor et al 2014)

PAR-SBN(xi = 1 |xltiy) = σ(W i y + Siltixlti + bi) (7)We use xlti = (x1 x2 middot middot middot ximinus1) to refer to the vector containing the first i-1 observed variablesThe matrix S is a lower triangular matrix that contains the autoregressive weights between the vari-ables xi and with Siltj we refer to the first j-1 elements of the i-th row of this matrix In contrastto a regular SBN layer the units xi are thus not independent of each other but can be predicted likein a logistic regression in terms of its predecessors xlti and of the input of the layer y

Conditional NADE layer The Neural Autoregressive Distribution Estimator (NADE Larochelleand Murray 2011) is a model that uses an internal accumulating hidden layer to predict variables xigiven the vector containing all previously variables xlti Instead of logistic regression in a FVSBNor an AR-SBN the dependency between the variables xi is here mediated by an MLP (Bengio andBengio 2000)

P (xi = 1 |xlti) = σ(V iσ(W lti xlti + a) + bi)) (8)With W and V denoting the encoding and decoding matrices for the NADE hidden layer For ourpurposes we condition this model on the random variables y

PNADE(xi = 1 |xltiy) = σ(V iσ(W lti xlti + Ua y + a) + U ib y + bi)) (9)Such a conditional NADE has been used previously for modeling musical sequences (Boulanger-Lewandowski et al 2012)

For each layer distribution we can construct an unconditioned distribution by removing the condi-tioning variable y We use such unconditioned distributions as top layer p(h) for the generativenetwork p

5

Published as a conference paper at ICLR 2015

100 101 102

training samples

minus130

minus120

minus110

minus100

minus90

minus80

Fin

alL

Les

tim

ate

(tes

tset

)

NA DE 200SBN 10-200-200SBN 200

A B

bias (epoch 50)

bias (last epoch)std dev (epoch50)

100 101102

training samples

03

04

05

06

07

08

09

bia

s

06

08

10

12

14

16

18

std

-dev

std dev (last epoch)

Figure 1 A Final log-likelihood estimate wrt number of samples used during training B L2-normof the bias and standard deviation of the low-sample estimated pθ gradient relative to a high-sample(K=5000) based estimate

NVIL wake-sleep RWS RWSP-model size Q-model SBN Q-model NADESBN 200 (1131) 1163 (1207) 1031 950SBN 200-200 (998) 1069 (1094) 934 911SBN 200-200-200 (967) 1013 (1044) 901 889AR-SBN 200 892AR-SBN 200-200 928NADE 200 868NADE 200-200 876

Table 1 MNIST results for various architectures and training methods In the 3rd column we citethe numbers reported by Mnih and Gregor (2014) Values in brackets are variational NLL boundsvalues without brackets report NLL estimates (see section 22)

4 EXPERIMENTS

Here we present a series of experiments on the MNIST and the CalTech-Silhouettes datasetsThe supplement describes additional experiments on smaller datasets from the UCI repositoryWith these experiments we 1) quantitatively analyze the influence of the number of samples K2) demonstrate that using a more powerful layer-model for the inference network q can signif-icantly enhance the results even when the generative model is a factorial SBN and 3) showthat we approach state-of-the-art performance when using either relatively deep models or whenusing powerful layer models such as a conditional NADE Our implementation is available athttpsgithubcomjbornscheinreweighted-ws

41 MNIST

We use the MNIST dataset that was binarized according to Murray and Salakhutdinov (2009) anddownloaded in binarized form from (Larochelle 2011) For training we use stochastic gradientdecent with momentum (β=095) and set mini-batch size to 25 The experiments in this paragraphwere run with learning rates of 00003 0001 and 0003 From these three we always report theexperiment with the highest validation log-likelihood In the majority of our experiments a learningrate of 0001 gave the best results even across different layer models (SBN AR-SBN and NADE) Ifnot noted otherwise we use K = 5 samples during training and K = 100 000 samples to estimatethe final log-likelihood on the test set1 To disentangle the influence of the different q updatingmethods we setup p and q networks consisting of three hidden SBN layers with 10 200 and 200units (SBNSBN 10-200-200) After convergence the model trained updating q during the sleepphase only reached a final estimated log-likelihood of minus934 the model trained with a q-updateduring the wake phase reachedminus928 and the model trained with both wake and sleep phase updatereached minus919 As a control we trained a model that does not update q at all This model reached

1We refer to the lower bound estimates which can be arbitrarily tightened by increasing the number of testsamples as LL estimates to distiguish them from the variational LL lower bounds (see section 22)

6

Published as a conference paper at ICLR 2015

100 101 102

samples

minus130

minus120

minus110

minus100

minus90

minus80

est

LL

NA DE-NA DE 200SBN-SBN 200-200-10SBN-SBN 200

A B CSBNSBN 10-100-200-300-400 NADENADE 250

Figure 2 A Final log-likelihood estimate wrt number of test samples used B Samples from theSBNSBN 10-200-200 generative model C Samples from the NADENADE 250 generative model(We show the probabilities from which each pixel is sampled)

Results on binarized MNISTNLL NLL

Method bound estRWS (SBNSBN 10-100-200-300-400) 8548RWS (NADENADE 250) 8523RWS (AR-SBNSBN 500)dagger 8418NADE (500 units [1]) 8835EoNADE (2hl 128 orderings [2]) 8510DARN (500 units [3]) 8413RBM (500 units CD3 [4]) 1055RBM (500 units CD25 [4]) 8634DBN (500-2000 [5]) 8622 8455

Results on CalTech 101 SilhouettesNLL

Method estRWS (SBNSBN 10-50-100-300) 1133RWS (NADENADE 150) 1043

NADE (500 hidden units) 1106RBM (4000 hidden units [6]) 1078

Table 2 Various RWS trained models in relation to previously published methods [1] Larochelleand Murray (2011) [2] Murray and Larochelle (2014) [3] Gregor et al (2014) [4] Salakhutdinovand Murray (2008) [5] Murray and Salakhutdinov (2009) [6] Cho et al (2013) dagger Same model asthe best performing in [3] a AR-SBN with deterministic hidden variables between the observed andlatent All RWS NLL estimates on MNIST have confidence intervals of asymp plusmn040

minus1714 We confirmed that combining wake and sleep phase q-updates generally gives the bestresults by repeating this experiment with various other architectures For the remainder of this paperwe therefore train all models with combined wake and sleep phase q-updates

Next we investigate the influence of the number of samples used during training The results arevisualized in Fig 1 A Although the results depend on the layer-distributions and on the depthand width of the architectures we generally observe that the final estimated log-likelihood does notimprove significantly when using more than 5 samples during training for NADE models and usingmore than 25 samples for models with SBN layers We can quantify the bias and the variance ofthe gradient estimator (3) using bootstrapping While training a SBNSBN 10-200-200 model withK = 100 training samples we useK = 5 000 samples to get a high quality estimate of the gradientfor a small but fixed set of 25 datapoints (the size of one mini-batch) By repeatedly resamplingsmaller sets of 1 2 5 middot middot middot 5000 samples with replacement and by computing the gradient basedon these we get a measure for the bias and the variance of the small sample estimates relative thehigh quality estimate These results are visualized in Fig 1 B In Fig 2 A we finally investigate thequality of the log-likelihood estimator (eqn 1) when applied to the MNIST test set

Table 1 summarizes how different architectures compare to each other and how RWS comparesto related methods for training directed models We essentially observe that RWS trained modelsconsistently improve over classical wake-sleep especially for deep architectures We furthermoreobserve that using autoregressive layers (AR-SBN or NADE) for the inference network improvesthe results even when the generative model is composed of factorial SBN layers Finally we see thatthe best performing models with autoregressive layers in p are always shallow with only a single

7

Published as a conference paper at ICLR 2015

Figure 3 CalTech 101 Silhouettes A Random selection of training data points B Random samplesfrom the SBNSBN 10-50-100-300 generative network C Random Samples from the NADE-150generative network (We show the probabilities from which each pixel is sampled)

hidden layer In Table 2 (left) we compare some of our best models to the state-of-the-art resultspublished on MNIST The deep SBNSBN 10-100-200-300-400 model was trained for 1000 epochswith K = 5 training samples and a learning rate of 0001 For fine-tuning we run additional 500epochs with a learning rate decay of 1005 and 100 training samples For comparison we also trainthe best performing model from the DARN paper (Gregor et al 2014) with RWS ie a singlelayer AR-SBN with 500 latent variables and a deterministic layer of hidden variables between theobserved and the latents We essentially obtain the same final testset log-likelihood For this shallownetwork we thus do not observe any improvement from using RWS

42 CALTECH 101 SILHOUETTES

We applied reweighted wake-sleep to the 28 times 28 pixel CalTech 101 Silhouettes dataset Thisdataset consists of 4100 examples in the training set 2264 examples in the validation set and2307 examples in the test set We trained various architectures on this dataset using the samehyperparameter as for the MNIST experiments Table 2 (right) summarizes our results Note thatour best SBNSBN model is a relatively deep network with 4 hidden layers (300-100-50-10) andreaches a estimated LL of -1169 on the test set Our best network a shallow NADENADE-150network reaches -1043 and improves over the previous state of the art (minus1078 a RBM with 4000hidden units by Cho et al (2013))

5 CONCLUSIONS

We introduced a novel training procedure for deep generative models consisting of multiple layers ofbinary latent variables It generalizes and improves over the wake-sleep algorithm providing a lowerbias and lower variance estimator of the log-likelihood gradient at the price of more samples from theinference network During training the weighted samples from the inference network decouple thelayers such that the learning gradients only propagate within the individual layers Our experimentsdemonstrate that a small number ofasymp 5 samples is typically sufficient to jointly train relatively deeparchitectures of at least 5 hidden layers without layerwise pretraining and without carefully tuninglearning rates The resulting models produce reasonable samples (by visual inspection) and theyapproach state-of-the-art performance in terms of log-likelihood on several discrete datasets

We found that even in the cases when the generative networks contain SBN layers only better resultscan be obtained with inference networks composed of more powerful autoregressive layers Thishowever comes at the price of reduced computational efficiency on eg GPUs as the individualvariables hi sim q(h|x) have to be sampled in sequence (even though the theoretical complexity isnot significantly worse compared to SBN layers)

We furthermore found that models with autoregressive layers in the generative network p typicallyproduce very good results But the best ones were always shallow with only a single hidden layerAt this point it is unclear if this is due to optimization problems

Acknowledgments We would like to thank Laurent Dinh Vincent Dumoulin and Li Yao forhelpful discussions and the developers of Theano (Bergstra et al 2010 Bastien et al 2012) fortheir powerful software We furthermore acknowledge CIFAR and Canada Research Chairs forfunding and Compute Canada and Calcul Quebec for providing computational resources

8

Published as a conference paper at ICLR 2015

REFERENCES

Bastien F Lamblin P Pascanu R Bergstra J Goodfellow I J Bergeron A Bouchard Nand Bengio Y (2012) Theano new features and speed improvements Deep Learning andUnsupervised Feature Learning NIPS 2012 Workshop

Bengio Y (2009) Learning deep architectures for AI Now Publishers

Bengio Y and Bengio S (2000) Modeling high-dimensional discrete data with multi-layer neuralnetworks In NIPSrsquo99 pages 400ndash406 MIT Press

Bengio Y Mesnil G Dauphin Y and Rifai S (2013) Better mixing via deep representationsIn Proceedings of the 30th International Conference on Machine Learning (ICMLrsquo13) ACM

Bergstra J Breuleux O Bastien F Lamblin P Pascanu R Desjardins G Turian J Warde-Farley D and Bengio Y (2010) Theano a CPU and GPU math expression compiler InProceedings of the Python for Scientific Computing Conference (SciPy) Oral Presentation

Boulanger-Lewandowski N Bengio Y and Vincent P (2012) Modeling temporal dependenciesin high-dimensional sequences Application to polyphonic music generation and transcription InICMLrsquo2012

Cho K Raiko T and Ilin A (2013) Enhanced gradient for training restricted boltzmann ma-chines Neural computation 25(3) 805ndash831

Dayan P and Hinton G E (1996) Varieties of helmholtz machine Neural Networks 9(8) 1385ndash1403

Dayan P Hinton G E Neal R M and Zemel R S (1995) The Helmholtz machine Neuralcomputation 7(5) 889ndash904

Frey B J (1998) Graphical models for machine learning and digital communication MIT Press

Gregor K Danihelka I Mnih A Blundell C and Wierstra D (2014) Deep autoregressivenetworks In Proceedings of the 31st International Conference on Machine Learning

Hinton G E Dayan P Frey B J and Neal R M (1995) The wake-sleep algorithm for unsu-pervised neural networks Science 268 1558ndash1161

Hinton G E Osindero S and Teh Y (2006) A fast learning algorithm for deep belief netsNeural Computation 18 1527ndash1554

Kingma D P and Welling M (2014) Auto-encoding variational bayes In Proceedings of theInternational Conference on Learning Representations (ICLR)

Larochelle H (2011) Binarized mnist dataset httpwwwcstorontoedu˜larochehpublicdatasetsbinarized_mnistbinarized_mnist_[train|valid|test]amat

Larochelle H and Murray I (2011) The Neural Autoregressive Distribution Estimator In Proceedings of theFourteenth International Conference on Artificial Intelligence and Statistics (AISTATSrsquo2011) volume 15 ofJMLR WampCP

Mnih A and Gregor K (2014) Neural variational inference and learning in belief networks In Proceedingsof the 31st International Conference on Machine Learning (ICML 2014) to appear

Murray B U I and Larochelle H (2014) A deep and tractable density estimator In ICMLrsquo2014

Murray I and Salakhutdinov R (2009) Evaluating probabilities under high-dimensional latent variable mod-els In NIPSrsquo08 volume 21 pages 1137ndash1144

Rezende D J Mohamed S and Wierstra D (2014) Stochastic backpropagation and approximate inferencein deep generative models In ICMLrsquo2014

Salakhutdinov R and Murray I (2008) On the quantitative analysis of deep belief networks In Proceedingsof the International Conference on Machine Learning volume 25

Saul L K Jaakkola T and Jordan M I (1996) Mean field theory for sigmoid belief networks Journal ofArtificial Intelligence Research 4 61ndash76

Tang Y and Salakhutdinov R (2013) Learning stochastic feedforward neural networks In NIPSrsquo2013

9

Published as a conference paper at ICLR 2015

6 SUPPLEMENT

61 GRADIENTS FOR p(xh)

part

partθLp(θx sim D) =

part

partθlog pθ(x) =

1

p(x)

part

partθ

sumh

p(xh)

=1

p(x)

sumh

p(xh)part

partθlog p(xh)

=1

p(x)

sumh

q (h |x) p(xh)q (h |x)

part

partθlog p(xh)

=1

p(x)E

hsimq(h |x)

[p(xh)

q (h |x)part

partθlog p(xh)

] 1sum

k ωk

Ksumk=1

ωkpart

partθlog p(xh(k)) (10)

with ωk =p(xh(k))

q(h(k) |x

) and h(k) sim q (h |x)

62 GRADIENTS FOR THE WAKE PHASE Q UPDATE

part

partφLq(φx sim D) =

part

partφ

sumh

p(xh)log qφ(h|x)

=1

p(x)E

hsimq(h |x)

[p(xh)

q (h |x)part

partφlog qφ(h|x)

] 1sum

k ωk

Ksumk=1

ωkpart

partφlog qφ(h

(k)|x) (11)

Note that we arrive at the same gradients when we set out to minimize the KL(p(middot|x) q(middot|x) for agiven datapoint x

part

partφKL(pθ(h|x) qφ(h|x)) =

part

partφ

sumh

pθ(h|x) logpθ(h|x)qφ(h|x)

= minussumh

pθ(h|x)part

partφlog qφ(h|x)

minus 1sumk ωk

Ksumk=1

ωkpart

partφlog qφ(h

(k)|x) (12)

10

Published as a conference paper at ICLR 2015

63 ADDITIONAL EXPERIMENTAL RESULTS

631 LEARNING CURVES FOR MNIST EXPERIMENTS

Figure 4 Learning curves for various MNIST experiments

632 BOOTSTRAPPING BASED log(p(x)) BIASVARIANCE ANALYSIS

Here we show the biasvariance analysis from Fig 1 B (main paper) applied to the estimatedlog(p(x)) wrt the number of test samples

100 101102 103

samples

20

15

10

5

0

Bia

s

bias (epoch 50)bias (last epoch)

00

02

04

06

08

10

12

14

16

std

-dev

std dev (epoch 50)std dev (last epoch)

Figure 5 Bias and standard deviation of the low-sample estimated log(p(x)) (bootstrapping withK=5000 primary samples from a SBNSBN 10-200-200 network trained on MNIST)

11

Published as a conference paper at ICLR 2015

633 UCI BINARY DATASETS

We performed a series of experiments on 8 different binary datasets from the UCI database

For each dataset we screened a limited hyperparameter space The learning rate was set to a valuein 0001 0003 001 For SBNs we use K=10 training samples and we tried the following archi-tectures Two hidden layers with 10-50 10-75 10-100 10-150 or 10-200 hidden units and threehidden layers with 5-20-100 10-50-100 10-50-150 10-50-200 or 10-100-300 hidden units Wetrained NADENADE models with K=5 training samples and one hidden layer with 30 50 75 100or 200 units in it

Model ADULT CONNECT4 DNA MUSHROOMS NIPS-0-12 OCR-LETTERS RCV1 WEB

FVSBN 1317 1239 8364 1027 27688 3930 4984 2935NADElowast 1319 1199 8481 981 27308 2722 4666 2839EoNADE+ 1319 1258 8231 968 27238 2731 4612 2787DARN3 1319 1191 8104 955 27468 2817 4610 2883RWS - SBN 1365 1268 9063 990 27254 2999 4616 2818hidden units 5-20-100 10-50-150 10-150 10-50-150 10-50-150 10-100-300 10-50-200 10-50-300RWS - NADE 1316 1168 8426 971 27111 2643 4609 2792hidden units 30 50 100 50 75 100

Table 3 Results on various binary datasets from the UCI repository The top two rows quote thebaseline results from Larochelle amp Murray (2011) the third row shows the baseline results takenfrom Uria Murray Larochelle (2014) (NADElowast 500 hidden units EoNADE+ 1hl 16 ord)

12

  • 1 Introduction
  • 2 Reweighted Wake-Sleep
    • 21 The Wake-Sleep Algorithm
    • 22 An Importance Sampling View yields Reweighted Wake-Sleep
    • 23 Training by Reweighted Wake-Sleep
    • 24 Relation to Wake-Sleep and Variational Bayes
      • 3 Component Layers
      • 4 Experiments
        • 41 MNIST
        • 42 CalTech 101 Silhouettes
          • 5 Conclusions
          • 6 Supplement
            • 61 Gradients for p(x h)
            • 62 Gradients for the wake phase q update
            • 63 Additional experimental results
              • 631 Learning curves for MNIST experiments
              • 632 Bootstrapping based log(p(x)) biasvariance analysis
              • 633 UCI binary datasets
Page 4: REWEIGHTED WAKE-SLEEP - arXiv · this paper is to shed a different light on the wake-sleep algorithm, viewing it as a special case of the proposed reweighted wake-sleep (RWS) algorithm,

Published as a conference paper at ICLR 2015

with the same importance weights ωk as in (3) (the details of the derivation can again be found inthe supplement) Note that this is equivalent to optimizing q so as to minimize KL(p(middot|x) q(middot|x))For the sleep phase q-update we consider the model distribution p(xh) a fully observed system andcan thus derive gradients without further sampling

part

partφLq(φ (xh)) =

part

partφlog qφ(h|x) with xh sim p(xh) (5)

This update is equivalent to the sleep phase update in the classical wake-sleep algorithm

Algorithm 1 Reweighted Wake-Sleep training procedure and likelihood estimator K is the numberof approximate inference samples and controls the trade-off between computation and accuracyof the estimators (both for the gradient and for the likelihood) We typically use a large value(K = 100 000) for test set likelihood estimator but a small value (K = 5) for estimating gradientsBoth the wake phase and sleep phase update rules for q are optionally included (either one or bothcan be used and best results were obtained using both) The original wake-sleep algorithm hasK=1 and only uses the sleep phase update of q To estimate the log-likelihood at test time only thecomputations up to L are required

for number of training iterations dobull Sample example(s) x from the training distributionfor k = 1 to K dobull Layerwise sample latent variables h(k) from q(h|x)bull Compute q(h(k)|x) and p(xh(k))

end forbull Compute unnormalized weights ωk = p(xh(k))

q(h(k) |x)bull Normalize the weights ωk = ωksum

kprime ωkprime

bull Compute unbiased likelihood estimator p(x) = averagek ωkbull Compute log-likelihood estimator L(x) = log averagek ωk

bullWake-phase update of p Use gradient estimatorsumk ωk

part log p(xh(k))partθ

bull Optionally wake phase update of q Use gradient estimatorsumk ωk

part log q(h(k)|x)partφ

bull Optionally sleep phase update of q Sample (xprimehprime) from p and use gradient part log q(hprime|xprime)partφ

end for

24 RELATION TO WAKE-SLEEP AND VARIATIONAL BAYES

Recently there has been a resurgence of interest in algorithms related to the Helmholtz machine andto the wake-sleep algorithm for directed graphical models containing either continuous or discretelatent variables

In Neural Variational Inference and Learning (NVIL Mnih and Gregor 2014) the authors proposeto maximize the variational lower bound on the log-likelihood to get a joint objective for both pand q It was known that this approach results in a gradient estimate of very high variance forthe recognition network q (Dayan and Hinton 1996) In the NVIL paper the authors thereforepropose variance reduction techniques such as baselines to obtain a practical algorithm that enhancessignificantly over the original wake-sleep algorithm In respect to the computational complexity wenote that while we draw K samples from the inference network for RWS NVIL on the other handdraws only a single sample from q but maintains queries and trains an additional auxiliary baselineestimating network With RWS and a typical value of K = 5 we thus require at least twice asmany arithmetic operations but we do not have to store the baseline network and do not have to findsuitable hyperparameters for it

Recent examples for continuous latent variables include the auto-encoding variationalBayes (Kingma and Welling 2014) and stochastic backpropagation papers (Rezende et al 2014)In both cases one maximizes a variational lower bound on the log-likelihood that is rewritten astwo terms one that is log-likelihood reconstruction error through a stochastic encoder (approximateinference) - decoder (generative model) pair and one that regularizes the output of the approximateinference stochastic encoder so that its marginal distribution matches the generative prior on the

4

Published as a conference paper at ICLR 2015

latent variables (and the latter is also trained to match the marginal of the encoder output) Besidesthe fact that these variational auto-encoders are only for continuous latent variables another differ-ence with the reweighted wake-sleep algorithm proposed here is that in the former a single samplefrom the approximate inference distribution is sufficient to get an unbiased estimator of the gradientof a proxy (the variational bound) Instead with the reweighted wake-sleep a single sample wouldcorrespond to regular wake-sleep which gives a biased estimator of the likelihood gradient Onthe other hand as the number of samples increases reweighted wake-sleep provides a less biased(asymptotically unbiased) estimator of the log-likelihood and of its gradient Similar in spirit butaimed at a structured output prediction task is the method proposed by Tang and Salakhutdinov(2013) The authors optimize the variational bound of the log-likelihood instead of the direct ISestimate but they also derive update equations for the proposal distribution that resembles many ofthe properties also found in reweighted wake-sleep

3 COMPONENT LAYERS

Although the framework can be readily applied to continuous variables we here restrict our-selves to distributions over binary visible and binary latent variables We build our models bycombining probabilistic components each one associated with one of the layers of the gener-ative network or of the inference network The generative model can therefore be written aspθ(xh) = p0(x|h1) p1(h1|h2) middot middot middot pL(hL) while the inference network has the form qφ(h |x) =q1(h1 |x) middot middot middot qL(hL |hLminus1) For a distributionP to be a suitable component we must have a methodto efficiently compute P (x(k)|y(k)) given (x(k) y(k)) and we must have a method to efficientlydraw iid samples x(k) sim P (x |y) for a given y In the following we will describe experimentscontaining three kinds of layers

Sigmoidal Belief Network (SBN) layer A SBN layer (Saul et al 1996) is a directed graphicalmodel with independent variables xi given the parents y

P SBN(xi = 1 |y) = σ(W i y + bi) (6)Although a SBN is a very simple generative model given y performing inference for y given x isin general intractable

Autoregressive SBN layer (AR-SBN DARN) If we consider xi an ordered set of observed vari-ables and introduce directed autoregressive links between all previous xlti and a given xi we obtaina fully-visible sigmoid belief network (FVSBN Frey 1998 Bengio and Bengio 2000) When weadditionally condition a FVSBN on the parent layerrsquos y we obtain a layer model that was first usedin Deep AutoRegressive Networks (DARN Gregor et al 2014)

PAR-SBN(xi = 1 |xltiy) = σ(W i y + Siltixlti + bi) (7)We use xlti = (x1 x2 middot middot middot ximinus1) to refer to the vector containing the first i-1 observed variablesThe matrix S is a lower triangular matrix that contains the autoregressive weights between the vari-ables xi and with Siltj we refer to the first j-1 elements of the i-th row of this matrix In contrastto a regular SBN layer the units xi are thus not independent of each other but can be predicted likein a logistic regression in terms of its predecessors xlti and of the input of the layer y

Conditional NADE layer The Neural Autoregressive Distribution Estimator (NADE Larochelleand Murray 2011) is a model that uses an internal accumulating hidden layer to predict variables xigiven the vector containing all previously variables xlti Instead of logistic regression in a FVSBNor an AR-SBN the dependency between the variables xi is here mediated by an MLP (Bengio andBengio 2000)

P (xi = 1 |xlti) = σ(V iσ(W lti xlti + a) + bi)) (8)With W and V denoting the encoding and decoding matrices for the NADE hidden layer For ourpurposes we condition this model on the random variables y

PNADE(xi = 1 |xltiy) = σ(V iσ(W lti xlti + Ua y + a) + U ib y + bi)) (9)Such a conditional NADE has been used previously for modeling musical sequences (Boulanger-Lewandowski et al 2012)

For each layer distribution we can construct an unconditioned distribution by removing the condi-tioning variable y We use such unconditioned distributions as top layer p(h) for the generativenetwork p

5

Published as a conference paper at ICLR 2015

100 101 102

training samples

minus130

minus120

minus110

minus100

minus90

minus80

Fin

alL

Les

tim

ate

(tes

tset

)

NA DE 200SBN 10-200-200SBN 200

A B

bias (epoch 50)

bias (last epoch)std dev (epoch50)

100 101102

training samples

03

04

05

06

07

08

09

bia

s

06

08

10

12

14

16

18

std

-dev

std dev (last epoch)

Figure 1 A Final log-likelihood estimate wrt number of samples used during training B L2-normof the bias and standard deviation of the low-sample estimated pθ gradient relative to a high-sample(K=5000) based estimate

NVIL wake-sleep RWS RWSP-model size Q-model SBN Q-model NADESBN 200 (1131) 1163 (1207) 1031 950SBN 200-200 (998) 1069 (1094) 934 911SBN 200-200-200 (967) 1013 (1044) 901 889AR-SBN 200 892AR-SBN 200-200 928NADE 200 868NADE 200-200 876

Table 1 MNIST results for various architectures and training methods In the 3rd column we citethe numbers reported by Mnih and Gregor (2014) Values in brackets are variational NLL boundsvalues without brackets report NLL estimates (see section 22)

4 EXPERIMENTS

Here we present a series of experiments on the MNIST and the CalTech-Silhouettes datasetsThe supplement describes additional experiments on smaller datasets from the UCI repositoryWith these experiments we 1) quantitatively analyze the influence of the number of samples K2) demonstrate that using a more powerful layer-model for the inference network q can signif-icantly enhance the results even when the generative model is a factorial SBN and 3) showthat we approach state-of-the-art performance when using either relatively deep models or whenusing powerful layer models such as a conditional NADE Our implementation is available athttpsgithubcomjbornscheinreweighted-ws

41 MNIST

We use the MNIST dataset that was binarized according to Murray and Salakhutdinov (2009) anddownloaded in binarized form from (Larochelle 2011) For training we use stochastic gradientdecent with momentum (β=095) and set mini-batch size to 25 The experiments in this paragraphwere run with learning rates of 00003 0001 and 0003 From these three we always report theexperiment with the highest validation log-likelihood In the majority of our experiments a learningrate of 0001 gave the best results even across different layer models (SBN AR-SBN and NADE) Ifnot noted otherwise we use K = 5 samples during training and K = 100 000 samples to estimatethe final log-likelihood on the test set1 To disentangle the influence of the different q updatingmethods we setup p and q networks consisting of three hidden SBN layers with 10 200 and 200units (SBNSBN 10-200-200) After convergence the model trained updating q during the sleepphase only reached a final estimated log-likelihood of minus934 the model trained with a q-updateduring the wake phase reachedminus928 and the model trained with both wake and sleep phase updatereached minus919 As a control we trained a model that does not update q at all This model reached

1We refer to the lower bound estimates which can be arbitrarily tightened by increasing the number of testsamples as LL estimates to distiguish them from the variational LL lower bounds (see section 22)

6

Published as a conference paper at ICLR 2015

100 101 102

samples

minus130

minus120

minus110

minus100

minus90

minus80

est

LL

NA DE-NA DE 200SBN-SBN 200-200-10SBN-SBN 200

A B CSBNSBN 10-100-200-300-400 NADENADE 250

Figure 2 A Final log-likelihood estimate wrt number of test samples used B Samples from theSBNSBN 10-200-200 generative model C Samples from the NADENADE 250 generative model(We show the probabilities from which each pixel is sampled)

Results on binarized MNISTNLL NLL

Method bound estRWS (SBNSBN 10-100-200-300-400) 8548RWS (NADENADE 250) 8523RWS (AR-SBNSBN 500)dagger 8418NADE (500 units [1]) 8835EoNADE (2hl 128 orderings [2]) 8510DARN (500 units [3]) 8413RBM (500 units CD3 [4]) 1055RBM (500 units CD25 [4]) 8634DBN (500-2000 [5]) 8622 8455

Results on CalTech 101 SilhouettesNLL

Method estRWS (SBNSBN 10-50-100-300) 1133RWS (NADENADE 150) 1043

NADE (500 hidden units) 1106RBM (4000 hidden units [6]) 1078

Table 2 Various RWS trained models in relation to previously published methods [1] Larochelleand Murray (2011) [2] Murray and Larochelle (2014) [3] Gregor et al (2014) [4] Salakhutdinovand Murray (2008) [5] Murray and Salakhutdinov (2009) [6] Cho et al (2013) dagger Same model asthe best performing in [3] a AR-SBN with deterministic hidden variables between the observed andlatent All RWS NLL estimates on MNIST have confidence intervals of asymp plusmn040

minus1714 We confirmed that combining wake and sleep phase q-updates generally gives the bestresults by repeating this experiment with various other architectures For the remainder of this paperwe therefore train all models with combined wake and sleep phase q-updates

Next we investigate the influence of the number of samples used during training The results arevisualized in Fig 1 A Although the results depend on the layer-distributions and on the depthand width of the architectures we generally observe that the final estimated log-likelihood does notimprove significantly when using more than 5 samples during training for NADE models and usingmore than 25 samples for models with SBN layers We can quantify the bias and the variance ofthe gradient estimator (3) using bootstrapping While training a SBNSBN 10-200-200 model withK = 100 training samples we useK = 5 000 samples to get a high quality estimate of the gradientfor a small but fixed set of 25 datapoints (the size of one mini-batch) By repeatedly resamplingsmaller sets of 1 2 5 middot middot middot 5000 samples with replacement and by computing the gradient basedon these we get a measure for the bias and the variance of the small sample estimates relative thehigh quality estimate These results are visualized in Fig 1 B In Fig 2 A we finally investigate thequality of the log-likelihood estimator (eqn 1) when applied to the MNIST test set

Table 1 summarizes how different architectures compare to each other and how RWS comparesto related methods for training directed models We essentially observe that RWS trained modelsconsistently improve over classical wake-sleep especially for deep architectures We furthermoreobserve that using autoregressive layers (AR-SBN or NADE) for the inference network improvesthe results even when the generative model is composed of factorial SBN layers Finally we see thatthe best performing models with autoregressive layers in p are always shallow with only a single

7

Published as a conference paper at ICLR 2015

Figure 3 CalTech 101 Silhouettes A Random selection of training data points B Random samplesfrom the SBNSBN 10-50-100-300 generative network C Random Samples from the NADE-150generative network (We show the probabilities from which each pixel is sampled)

hidden layer In Table 2 (left) we compare some of our best models to the state-of-the-art resultspublished on MNIST The deep SBNSBN 10-100-200-300-400 model was trained for 1000 epochswith K = 5 training samples and a learning rate of 0001 For fine-tuning we run additional 500epochs with a learning rate decay of 1005 and 100 training samples For comparison we also trainthe best performing model from the DARN paper (Gregor et al 2014) with RWS ie a singlelayer AR-SBN with 500 latent variables and a deterministic layer of hidden variables between theobserved and the latents We essentially obtain the same final testset log-likelihood For this shallownetwork we thus do not observe any improvement from using RWS

42 CALTECH 101 SILHOUETTES

We applied reweighted wake-sleep to the 28 times 28 pixel CalTech 101 Silhouettes dataset Thisdataset consists of 4100 examples in the training set 2264 examples in the validation set and2307 examples in the test set We trained various architectures on this dataset using the samehyperparameter as for the MNIST experiments Table 2 (right) summarizes our results Note thatour best SBNSBN model is a relatively deep network with 4 hidden layers (300-100-50-10) andreaches a estimated LL of -1169 on the test set Our best network a shallow NADENADE-150network reaches -1043 and improves over the previous state of the art (minus1078 a RBM with 4000hidden units by Cho et al (2013))

5 CONCLUSIONS

We introduced a novel training procedure for deep generative models consisting of multiple layers ofbinary latent variables It generalizes and improves over the wake-sleep algorithm providing a lowerbias and lower variance estimator of the log-likelihood gradient at the price of more samples from theinference network During training the weighted samples from the inference network decouple thelayers such that the learning gradients only propagate within the individual layers Our experimentsdemonstrate that a small number ofasymp 5 samples is typically sufficient to jointly train relatively deeparchitectures of at least 5 hidden layers without layerwise pretraining and without carefully tuninglearning rates The resulting models produce reasonable samples (by visual inspection) and theyapproach state-of-the-art performance in terms of log-likelihood on several discrete datasets

We found that even in the cases when the generative networks contain SBN layers only better resultscan be obtained with inference networks composed of more powerful autoregressive layers Thishowever comes at the price of reduced computational efficiency on eg GPUs as the individualvariables hi sim q(h|x) have to be sampled in sequence (even though the theoretical complexity isnot significantly worse compared to SBN layers)

We furthermore found that models with autoregressive layers in the generative network p typicallyproduce very good results But the best ones were always shallow with only a single hidden layerAt this point it is unclear if this is due to optimization problems

Acknowledgments We would like to thank Laurent Dinh Vincent Dumoulin and Li Yao forhelpful discussions and the developers of Theano (Bergstra et al 2010 Bastien et al 2012) fortheir powerful software We furthermore acknowledge CIFAR and Canada Research Chairs forfunding and Compute Canada and Calcul Quebec for providing computational resources

8

Published as a conference paper at ICLR 2015

REFERENCES

Bastien F Lamblin P Pascanu R Bergstra J Goodfellow I J Bergeron A Bouchard Nand Bengio Y (2012) Theano new features and speed improvements Deep Learning andUnsupervised Feature Learning NIPS 2012 Workshop

Bengio Y (2009) Learning deep architectures for AI Now Publishers

Bengio Y and Bengio S (2000) Modeling high-dimensional discrete data with multi-layer neuralnetworks In NIPSrsquo99 pages 400ndash406 MIT Press

Bengio Y Mesnil G Dauphin Y and Rifai S (2013) Better mixing via deep representationsIn Proceedings of the 30th International Conference on Machine Learning (ICMLrsquo13) ACM

Bergstra J Breuleux O Bastien F Lamblin P Pascanu R Desjardins G Turian J Warde-Farley D and Bengio Y (2010) Theano a CPU and GPU math expression compiler InProceedings of the Python for Scientific Computing Conference (SciPy) Oral Presentation

Boulanger-Lewandowski N Bengio Y and Vincent P (2012) Modeling temporal dependenciesin high-dimensional sequences Application to polyphonic music generation and transcription InICMLrsquo2012

Cho K Raiko T and Ilin A (2013) Enhanced gradient for training restricted boltzmann ma-chines Neural computation 25(3) 805ndash831

Dayan P and Hinton G E (1996) Varieties of helmholtz machine Neural Networks 9(8) 1385ndash1403

Dayan P Hinton G E Neal R M and Zemel R S (1995) The Helmholtz machine Neuralcomputation 7(5) 889ndash904

Frey B J (1998) Graphical models for machine learning and digital communication MIT Press

Gregor K Danihelka I Mnih A Blundell C and Wierstra D (2014) Deep autoregressivenetworks In Proceedings of the 31st International Conference on Machine Learning

Hinton G E Dayan P Frey B J and Neal R M (1995) The wake-sleep algorithm for unsu-pervised neural networks Science 268 1558ndash1161

Hinton G E Osindero S and Teh Y (2006) A fast learning algorithm for deep belief netsNeural Computation 18 1527ndash1554

Kingma D P and Welling M (2014) Auto-encoding variational bayes In Proceedings of theInternational Conference on Learning Representations (ICLR)

Larochelle H (2011) Binarized mnist dataset httpwwwcstorontoedu˜larochehpublicdatasetsbinarized_mnistbinarized_mnist_[train|valid|test]amat

Larochelle H and Murray I (2011) The Neural Autoregressive Distribution Estimator In Proceedings of theFourteenth International Conference on Artificial Intelligence and Statistics (AISTATSrsquo2011) volume 15 ofJMLR WampCP

Mnih A and Gregor K (2014) Neural variational inference and learning in belief networks In Proceedingsof the 31st International Conference on Machine Learning (ICML 2014) to appear

Murray B U I and Larochelle H (2014) A deep and tractable density estimator In ICMLrsquo2014

Murray I and Salakhutdinov R (2009) Evaluating probabilities under high-dimensional latent variable mod-els In NIPSrsquo08 volume 21 pages 1137ndash1144

Rezende D J Mohamed S and Wierstra D (2014) Stochastic backpropagation and approximate inferencein deep generative models In ICMLrsquo2014

Salakhutdinov R and Murray I (2008) On the quantitative analysis of deep belief networks In Proceedingsof the International Conference on Machine Learning volume 25

Saul L K Jaakkola T and Jordan M I (1996) Mean field theory for sigmoid belief networks Journal ofArtificial Intelligence Research 4 61ndash76

Tang Y and Salakhutdinov R (2013) Learning stochastic feedforward neural networks In NIPSrsquo2013

9

Published as a conference paper at ICLR 2015

6 SUPPLEMENT

61 GRADIENTS FOR p(xh)

part

partθLp(θx sim D) =

part

partθlog pθ(x) =

1

p(x)

part

partθ

sumh

p(xh)

=1

p(x)

sumh

p(xh)part

partθlog p(xh)

=1

p(x)

sumh

q (h |x) p(xh)q (h |x)

part

partθlog p(xh)

=1

p(x)E

hsimq(h |x)

[p(xh)

q (h |x)part

partθlog p(xh)

] 1sum

k ωk

Ksumk=1

ωkpart

partθlog p(xh(k)) (10)

with ωk =p(xh(k))

q(h(k) |x

) and h(k) sim q (h |x)

62 GRADIENTS FOR THE WAKE PHASE Q UPDATE

part

partφLq(φx sim D) =

part

partφ

sumh

p(xh)log qφ(h|x)

=1

p(x)E

hsimq(h |x)

[p(xh)

q (h |x)part

partφlog qφ(h|x)

] 1sum

k ωk

Ksumk=1

ωkpart

partφlog qφ(h

(k)|x) (11)

Note that we arrive at the same gradients when we set out to minimize the KL(p(middot|x) q(middot|x) for agiven datapoint x

part

partφKL(pθ(h|x) qφ(h|x)) =

part

partφ

sumh

pθ(h|x) logpθ(h|x)qφ(h|x)

= minussumh

pθ(h|x)part

partφlog qφ(h|x)

minus 1sumk ωk

Ksumk=1

ωkpart

partφlog qφ(h

(k)|x) (12)

10

Published as a conference paper at ICLR 2015

63 ADDITIONAL EXPERIMENTAL RESULTS

631 LEARNING CURVES FOR MNIST EXPERIMENTS

Figure 4 Learning curves for various MNIST experiments

632 BOOTSTRAPPING BASED log(p(x)) BIASVARIANCE ANALYSIS

Here we show the biasvariance analysis from Fig 1 B (main paper) applied to the estimatedlog(p(x)) wrt the number of test samples

100 101102 103

samples

20

15

10

5

0

Bia

s

bias (epoch 50)bias (last epoch)

00

02

04

06

08

10

12

14

16

std

-dev

std dev (epoch 50)std dev (last epoch)

Figure 5 Bias and standard deviation of the low-sample estimated log(p(x)) (bootstrapping withK=5000 primary samples from a SBNSBN 10-200-200 network trained on MNIST)

11

Published as a conference paper at ICLR 2015

633 UCI BINARY DATASETS

We performed a series of experiments on 8 different binary datasets from the UCI database

For each dataset we screened a limited hyperparameter space The learning rate was set to a valuein 0001 0003 001 For SBNs we use K=10 training samples and we tried the following archi-tectures Two hidden layers with 10-50 10-75 10-100 10-150 or 10-200 hidden units and threehidden layers with 5-20-100 10-50-100 10-50-150 10-50-200 or 10-100-300 hidden units Wetrained NADENADE models with K=5 training samples and one hidden layer with 30 50 75 100or 200 units in it

Model ADULT CONNECT4 DNA MUSHROOMS NIPS-0-12 OCR-LETTERS RCV1 WEB

FVSBN 1317 1239 8364 1027 27688 3930 4984 2935NADElowast 1319 1199 8481 981 27308 2722 4666 2839EoNADE+ 1319 1258 8231 968 27238 2731 4612 2787DARN3 1319 1191 8104 955 27468 2817 4610 2883RWS - SBN 1365 1268 9063 990 27254 2999 4616 2818hidden units 5-20-100 10-50-150 10-150 10-50-150 10-50-150 10-100-300 10-50-200 10-50-300RWS - NADE 1316 1168 8426 971 27111 2643 4609 2792hidden units 30 50 100 50 75 100

Table 3 Results on various binary datasets from the UCI repository The top two rows quote thebaseline results from Larochelle amp Murray (2011) the third row shows the baseline results takenfrom Uria Murray Larochelle (2014) (NADElowast 500 hidden units EoNADE+ 1hl 16 ord)

12

  • 1 Introduction
  • 2 Reweighted Wake-Sleep
    • 21 The Wake-Sleep Algorithm
    • 22 An Importance Sampling View yields Reweighted Wake-Sleep
    • 23 Training by Reweighted Wake-Sleep
    • 24 Relation to Wake-Sleep and Variational Bayes
      • 3 Component Layers
      • 4 Experiments
        • 41 MNIST
        • 42 CalTech 101 Silhouettes
          • 5 Conclusions
          • 6 Supplement
            • 61 Gradients for p(x h)
            • 62 Gradients for the wake phase q update
            • 63 Additional experimental results
              • 631 Learning curves for MNIST experiments
              • 632 Bootstrapping based log(p(x)) biasvariance analysis
              • 633 UCI binary datasets
Page 5: REWEIGHTED WAKE-SLEEP - arXiv · this paper is to shed a different light on the wake-sleep algorithm, viewing it as a special case of the proposed reweighted wake-sleep (RWS) algorithm,

Published as a conference paper at ICLR 2015

latent variables (and the latter is also trained to match the marginal of the encoder output) Besidesthe fact that these variational auto-encoders are only for continuous latent variables another differ-ence with the reweighted wake-sleep algorithm proposed here is that in the former a single samplefrom the approximate inference distribution is sufficient to get an unbiased estimator of the gradientof a proxy (the variational bound) Instead with the reweighted wake-sleep a single sample wouldcorrespond to regular wake-sleep which gives a biased estimator of the likelihood gradient Onthe other hand as the number of samples increases reweighted wake-sleep provides a less biased(asymptotically unbiased) estimator of the log-likelihood and of its gradient Similar in spirit butaimed at a structured output prediction task is the method proposed by Tang and Salakhutdinov(2013) The authors optimize the variational bound of the log-likelihood instead of the direct ISestimate but they also derive update equations for the proposal distribution that resembles many ofthe properties also found in reweighted wake-sleep

3 COMPONENT LAYERS

Although the framework can be readily applied to continuous variables we here restrict our-selves to distributions over binary visible and binary latent variables We build our models bycombining probabilistic components each one associated with one of the layers of the gener-ative network or of the inference network The generative model can therefore be written aspθ(xh) = p0(x|h1) p1(h1|h2) middot middot middot pL(hL) while the inference network has the form qφ(h |x) =q1(h1 |x) middot middot middot qL(hL |hLminus1) For a distributionP to be a suitable component we must have a methodto efficiently compute P (x(k)|y(k)) given (x(k) y(k)) and we must have a method to efficientlydraw iid samples x(k) sim P (x |y) for a given y In the following we will describe experimentscontaining three kinds of layers

Sigmoidal Belief Network (SBN) layer A SBN layer (Saul et al 1996) is a directed graphicalmodel with independent variables xi given the parents y

P SBN(xi = 1 |y) = σ(W i y + bi) (6)Although a SBN is a very simple generative model given y performing inference for y given x isin general intractable

Autoregressive SBN layer (AR-SBN DARN) If we consider xi an ordered set of observed vari-ables and introduce directed autoregressive links between all previous xlti and a given xi we obtaina fully-visible sigmoid belief network (FVSBN Frey 1998 Bengio and Bengio 2000) When weadditionally condition a FVSBN on the parent layerrsquos y we obtain a layer model that was first usedin Deep AutoRegressive Networks (DARN Gregor et al 2014)

PAR-SBN(xi = 1 |xltiy) = σ(W i y + Siltixlti + bi) (7)We use xlti = (x1 x2 middot middot middot ximinus1) to refer to the vector containing the first i-1 observed variablesThe matrix S is a lower triangular matrix that contains the autoregressive weights between the vari-ables xi and with Siltj we refer to the first j-1 elements of the i-th row of this matrix In contrastto a regular SBN layer the units xi are thus not independent of each other but can be predicted likein a logistic regression in terms of its predecessors xlti and of the input of the layer y

Conditional NADE layer The Neural Autoregressive Distribution Estimator (NADE Larochelleand Murray 2011) is a model that uses an internal accumulating hidden layer to predict variables xigiven the vector containing all previously variables xlti Instead of logistic regression in a FVSBNor an AR-SBN the dependency between the variables xi is here mediated by an MLP (Bengio andBengio 2000)

P (xi = 1 |xlti) = σ(V iσ(W lti xlti + a) + bi)) (8)With W and V denoting the encoding and decoding matrices for the NADE hidden layer For ourpurposes we condition this model on the random variables y

PNADE(xi = 1 |xltiy) = σ(V iσ(W lti xlti + Ua y + a) + U ib y + bi)) (9)Such a conditional NADE has been used previously for modeling musical sequences (Boulanger-Lewandowski et al 2012)

For each layer distribution we can construct an unconditioned distribution by removing the condi-tioning variable y We use such unconditioned distributions as top layer p(h) for the generativenetwork p

5

Published as a conference paper at ICLR 2015

100 101 102

training samples

minus130

minus120

minus110

minus100

minus90

minus80

Fin

alL

Les

tim

ate

(tes

tset

)

NA DE 200SBN 10-200-200SBN 200

A B

bias (epoch 50)

bias (last epoch)std dev (epoch50)

100 101102

training samples

03

04

05

06

07

08

09

bia

s

06

08

10

12

14

16

18

std

-dev

std dev (last epoch)

Figure 1 A Final log-likelihood estimate wrt number of samples used during training B L2-normof the bias and standard deviation of the low-sample estimated pθ gradient relative to a high-sample(K=5000) based estimate

NVIL wake-sleep RWS RWSP-model size Q-model SBN Q-model NADESBN 200 (1131) 1163 (1207) 1031 950SBN 200-200 (998) 1069 (1094) 934 911SBN 200-200-200 (967) 1013 (1044) 901 889AR-SBN 200 892AR-SBN 200-200 928NADE 200 868NADE 200-200 876

Table 1 MNIST results for various architectures and training methods In the 3rd column we citethe numbers reported by Mnih and Gregor (2014) Values in brackets are variational NLL boundsvalues without brackets report NLL estimates (see section 22)

4 EXPERIMENTS

Here we present a series of experiments on the MNIST and the CalTech-Silhouettes datasetsThe supplement describes additional experiments on smaller datasets from the UCI repositoryWith these experiments we 1) quantitatively analyze the influence of the number of samples K2) demonstrate that using a more powerful layer-model for the inference network q can signif-icantly enhance the results even when the generative model is a factorial SBN and 3) showthat we approach state-of-the-art performance when using either relatively deep models or whenusing powerful layer models such as a conditional NADE Our implementation is available athttpsgithubcomjbornscheinreweighted-ws

41 MNIST

We use the MNIST dataset that was binarized according to Murray and Salakhutdinov (2009) anddownloaded in binarized form from (Larochelle 2011) For training we use stochastic gradientdecent with momentum (β=095) and set mini-batch size to 25 The experiments in this paragraphwere run with learning rates of 00003 0001 and 0003 From these three we always report theexperiment with the highest validation log-likelihood In the majority of our experiments a learningrate of 0001 gave the best results even across different layer models (SBN AR-SBN and NADE) Ifnot noted otherwise we use K = 5 samples during training and K = 100 000 samples to estimatethe final log-likelihood on the test set1 To disentangle the influence of the different q updatingmethods we setup p and q networks consisting of three hidden SBN layers with 10 200 and 200units (SBNSBN 10-200-200) After convergence the model trained updating q during the sleepphase only reached a final estimated log-likelihood of minus934 the model trained with a q-updateduring the wake phase reachedminus928 and the model trained with both wake and sleep phase updatereached minus919 As a control we trained a model that does not update q at all This model reached

1We refer to the lower bound estimates which can be arbitrarily tightened by increasing the number of testsamples as LL estimates to distiguish them from the variational LL lower bounds (see section 22)

6

Published as a conference paper at ICLR 2015

100 101 102

samples

minus130

minus120

minus110

minus100

minus90

minus80

est

LL

NA DE-NA DE 200SBN-SBN 200-200-10SBN-SBN 200

A B CSBNSBN 10-100-200-300-400 NADENADE 250

Figure 2 A Final log-likelihood estimate wrt number of test samples used B Samples from theSBNSBN 10-200-200 generative model C Samples from the NADENADE 250 generative model(We show the probabilities from which each pixel is sampled)

Results on binarized MNISTNLL NLL

Method bound estRWS (SBNSBN 10-100-200-300-400) 8548RWS (NADENADE 250) 8523RWS (AR-SBNSBN 500)dagger 8418NADE (500 units [1]) 8835EoNADE (2hl 128 orderings [2]) 8510DARN (500 units [3]) 8413RBM (500 units CD3 [4]) 1055RBM (500 units CD25 [4]) 8634DBN (500-2000 [5]) 8622 8455

Results on CalTech 101 SilhouettesNLL

Method estRWS (SBNSBN 10-50-100-300) 1133RWS (NADENADE 150) 1043

NADE (500 hidden units) 1106RBM (4000 hidden units [6]) 1078

Table 2 Various RWS trained models in relation to previously published methods [1] Larochelleand Murray (2011) [2] Murray and Larochelle (2014) [3] Gregor et al (2014) [4] Salakhutdinovand Murray (2008) [5] Murray and Salakhutdinov (2009) [6] Cho et al (2013) dagger Same model asthe best performing in [3] a AR-SBN with deterministic hidden variables between the observed andlatent All RWS NLL estimates on MNIST have confidence intervals of asymp plusmn040

minus1714 We confirmed that combining wake and sleep phase q-updates generally gives the bestresults by repeating this experiment with various other architectures For the remainder of this paperwe therefore train all models with combined wake and sleep phase q-updates

Next we investigate the influence of the number of samples used during training The results arevisualized in Fig 1 A Although the results depend on the layer-distributions and on the depthand width of the architectures we generally observe that the final estimated log-likelihood does notimprove significantly when using more than 5 samples during training for NADE models and usingmore than 25 samples for models with SBN layers We can quantify the bias and the variance ofthe gradient estimator (3) using bootstrapping While training a SBNSBN 10-200-200 model withK = 100 training samples we useK = 5 000 samples to get a high quality estimate of the gradientfor a small but fixed set of 25 datapoints (the size of one mini-batch) By repeatedly resamplingsmaller sets of 1 2 5 middot middot middot 5000 samples with replacement and by computing the gradient basedon these we get a measure for the bias and the variance of the small sample estimates relative thehigh quality estimate These results are visualized in Fig 1 B In Fig 2 A we finally investigate thequality of the log-likelihood estimator (eqn 1) when applied to the MNIST test set

Table 1 summarizes how different architectures compare to each other and how RWS comparesto related methods for training directed models We essentially observe that RWS trained modelsconsistently improve over classical wake-sleep especially for deep architectures We furthermoreobserve that using autoregressive layers (AR-SBN or NADE) for the inference network improvesthe results even when the generative model is composed of factorial SBN layers Finally we see thatthe best performing models with autoregressive layers in p are always shallow with only a single

7

Published as a conference paper at ICLR 2015

Figure 3 CalTech 101 Silhouettes A Random selection of training data points B Random samplesfrom the SBNSBN 10-50-100-300 generative network C Random Samples from the NADE-150generative network (We show the probabilities from which each pixel is sampled)

hidden layer In Table 2 (left) we compare some of our best models to the state-of-the-art resultspublished on MNIST The deep SBNSBN 10-100-200-300-400 model was trained for 1000 epochswith K = 5 training samples and a learning rate of 0001 For fine-tuning we run additional 500epochs with a learning rate decay of 1005 and 100 training samples For comparison we also trainthe best performing model from the DARN paper (Gregor et al 2014) with RWS ie a singlelayer AR-SBN with 500 latent variables and a deterministic layer of hidden variables between theobserved and the latents We essentially obtain the same final testset log-likelihood For this shallownetwork we thus do not observe any improvement from using RWS

42 CALTECH 101 SILHOUETTES

We applied reweighted wake-sleep to the 28 times 28 pixel CalTech 101 Silhouettes dataset Thisdataset consists of 4100 examples in the training set 2264 examples in the validation set and2307 examples in the test set We trained various architectures on this dataset using the samehyperparameter as for the MNIST experiments Table 2 (right) summarizes our results Note thatour best SBNSBN model is a relatively deep network with 4 hidden layers (300-100-50-10) andreaches a estimated LL of -1169 on the test set Our best network a shallow NADENADE-150network reaches -1043 and improves over the previous state of the art (minus1078 a RBM with 4000hidden units by Cho et al (2013))

5 CONCLUSIONS

We introduced a novel training procedure for deep generative models consisting of multiple layers ofbinary latent variables It generalizes and improves over the wake-sleep algorithm providing a lowerbias and lower variance estimator of the log-likelihood gradient at the price of more samples from theinference network During training the weighted samples from the inference network decouple thelayers such that the learning gradients only propagate within the individual layers Our experimentsdemonstrate that a small number ofasymp 5 samples is typically sufficient to jointly train relatively deeparchitectures of at least 5 hidden layers without layerwise pretraining and without carefully tuninglearning rates The resulting models produce reasonable samples (by visual inspection) and theyapproach state-of-the-art performance in terms of log-likelihood on several discrete datasets

We found that even in the cases when the generative networks contain SBN layers only better resultscan be obtained with inference networks composed of more powerful autoregressive layers Thishowever comes at the price of reduced computational efficiency on eg GPUs as the individualvariables hi sim q(h|x) have to be sampled in sequence (even though the theoretical complexity isnot significantly worse compared to SBN layers)

We furthermore found that models with autoregressive layers in the generative network p typicallyproduce very good results But the best ones were always shallow with only a single hidden layerAt this point it is unclear if this is due to optimization problems

Acknowledgments We would like to thank Laurent Dinh Vincent Dumoulin and Li Yao forhelpful discussions and the developers of Theano (Bergstra et al 2010 Bastien et al 2012) fortheir powerful software We furthermore acknowledge CIFAR and Canada Research Chairs forfunding and Compute Canada and Calcul Quebec for providing computational resources

8

Published as a conference paper at ICLR 2015

REFERENCES

Bastien F Lamblin P Pascanu R Bergstra J Goodfellow I J Bergeron A Bouchard Nand Bengio Y (2012) Theano new features and speed improvements Deep Learning andUnsupervised Feature Learning NIPS 2012 Workshop

Bengio Y (2009) Learning deep architectures for AI Now Publishers

Bengio Y and Bengio S (2000) Modeling high-dimensional discrete data with multi-layer neuralnetworks In NIPSrsquo99 pages 400ndash406 MIT Press

Bengio Y Mesnil G Dauphin Y and Rifai S (2013) Better mixing via deep representationsIn Proceedings of the 30th International Conference on Machine Learning (ICMLrsquo13) ACM

Bergstra J Breuleux O Bastien F Lamblin P Pascanu R Desjardins G Turian J Warde-Farley D and Bengio Y (2010) Theano a CPU and GPU math expression compiler InProceedings of the Python for Scientific Computing Conference (SciPy) Oral Presentation

Boulanger-Lewandowski N Bengio Y and Vincent P (2012) Modeling temporal dependenciesin high-dimensional sequences Application to polyphonic music generation and transcription InICMLrsquo2012

Cho K Raiko T and Ilin A (2013) Enhanced gradient for training restricted boltzmann ma-chines Neural computation 25(3) 805ndash831

Dayan P and Hinton G E (1996) Varieties of helmholtz machine Neural Networks 9(8) 1385ndash1403

Dayan P Hinton G E Neal R M and Zemel R S (1995) The Helmholtz machine Neuralcomputation 7(5) 889ndash904

Frey B J (1998) Graphical models for machine learning and digital communication MIT Press

Gregor K Danihelka I Mnih A Blundell C and Wierstra D (2014) Deep autoregressivenetworks In Proceedings of the 31st International Conference on Machine Learning

Hinton G E Dayan P Frey B J and Neal R M (1995) The wake-sleep algorithm for unsu-pervised neural networks Science 268 1558ndash1161

Hinton G E Osindero S and Teh Y (2006) A fast learning algorithm for deep belief netsNeural Computation 18 1527ndash1554

Kingma D P and Welling M (2014) Auto-encoding variational bayes In Proceedings of theInternational Conference on Learning Representations (ICLR)

Larochelle H (2011) Binarized mnist dataset httpwwwcstorontoedu˜larochehpublicdatasetsbinarized_mnistbinarized_mnist_[train|valid|test]amat

Larochelle H and Murray I (2011) The Neural Autoregressive Distribution Estimator In Proceedings of theFourteenth International Conference on Artificial Intelligence and Statistics (AISTATSrsquo2011) volume 15 ofJMLR WampCP

Mnih A and Gregor K (2014) Neural variational inference and learning in belief networks In Proceedingsof the 31st International Conference on Machine Learning (ICML 2014) to appear

Murray B U I and Larochelle H (2014) A deep and tractable density estimator In ICMLrsquo2014

Murray I and Salakhutdinov R (2009) Evaluating probabilities under high-dimensional latent variable mod-els In NIPSrsquo08 volume 21 pages 1137ndash1144

Rezende D J Mohamed S and Wierstra D (2014) Stochastic backpropagation and approximate inferencein deep generative models In ICMLrsquo2014

Salakhutdinov R and Murray I (2008) On the quantitative analysis of deep belief networks In Proceedingsof the International Conference on Machine Learning volume 25

Saul L K Jaakkola T and Jordan M I (1996) Mean field theory for sigmoid belief networks Journal ofArtificial Intelligence Research 4 61ndash76

Tang Y and Salakhutdinov R (2013) Learning stochastic feedforward neural networks In NIPSrsquo2013

9

Published as a conference paper at ICLR 2015

6 SUPPLEMENT

61 GRADIENTS FOR p(xh)

part

partθLp(θx sim D) =

part

partθlog pθ(x) =

1

p(x)

part

partθ

sumh

p(xh)

=1

p(x)

sumh

p(xh)part

partθlog p(xh)

=1

p(x)

sumh

q (h |x) p(xh)q (h |x)

part

partθlog p(xh)

=1

p(x)E

hsimq(h |x)

[p(xh)

q (h |x)part

partθlog p(xh)

] 1sum

k ωk

Ksumk=1

ωkpart

partθlog p(xh(k)) (10)

with ωk =p(xh(k))

q(h(k) |x

) and h(k) sim q (h |x)

62 GRADIENTS FOR THE WAKE PHASE Q UPDATE

part

partφLq(φx sim D) =

part

partφ

sumh

p(xh)log qφ(h|x)

=1

p(x)E

hsimq(h |x)

[p(xh)

q (h |x)part

partφlog qφ(h|x)

] 1sum

k ωk

Ksumk=1

ωkpart

partφlog qφ(h

(k)|x) (11)

Note that we arrive at the same gradients when we set out to minimize the KL(p(middot|x) q(middot|x) for agiven datapoint x

part

partφKL(pθ(h|x) qφ(h|x)) =

part

partφ

sumh

pθ(h|x) logpθ(h|x)qφ(h|x)

= minussumh

pθ(h|x)part

partφlog qφ(h|x)

minus 1sumk ωk

Ksumk=1

ωkpart

partφlog qφ(h

(k)|x) (12)

10

Published as a conference paper at ICLR 2015

63 ADDITIONAL EXPERIMENTAL RESULTS

631 LEARNING CURVES FOR MNIST EXPERIMENTS

Figure 4 Learning curves for various MNIST experiments

632 BOOTSTRAPPING BASED log(p(x)) BIASVARIANCE ANALYSIS

Here we show the biasvariance analysis from Fig 1 B (main paper) applied to the estimatedlog(p(x)) wrt the number of test samples

100 101102 103

samples

20

15

10

5

0

Bia

s

bias (epoch 50)bias (last epoch)

00

02

04

06

08

10

12

14

16

std

-dev

std dev (epoch 50)std dev (last epoch)

Figure 5 Bias and standard deviation of the low-sample estimated log(p(x)) (bootstrapping withK=5000 primary samples from a SBNSBN 10-200-200 network trained on MNIST)

11

Published as a conference paper at ICLR 2015

633 UCI BINARY DATASETS

We performed a series of experiments on 8 different binary datasets from the UCI database

For each dataset we screened a limited hyperparameter space The learning rate was set to a valuein 0001 0003 001 For SBNs we use K=10 training samples and we tried the following archi-tectures Two hidden layers with 10-50 10-75 10-100 10-150 or 10-200 hidden units and threehidden layers with 5-20-100 10-50-100 10-50-150 10-50-200 or 10-100-300 hidden units Wetrained NADENADE models with K=5 training samples and one hidden layer with 30 50 75 100or 200 units in it

Model ADULT CONNECT4 DNA MUSHROOMS NIPS-0-12 OCR-LETTERS RCV1 WEB

FVSBN 1317 1239 8364 1027 27688 3930 4984 2935NADElowast 1319 1199 8481 981 27308 2722 4666 2839EoNADE+ 1319 1258 8231 968 27238 2731 4612 2787DARN3 1319 1191 8104 955 27468 2817 4610 2883RWS - SBN 1365 1268 9063 990 27254 2999 4616 2818hidden units 5-20-100 10-50-150 10-150 10-50-150 10-50-150 10-100-300 10-50-200 10-50-300RWS - NADE 1316 1168 8426 971 27111 2643 4609 2792hidden units 30 50 100 50 75 100

Table 3 Results on various binary datasets from the UCI repository The top two rows quote thebaseline results from Larochelle amp Murray (2011) the third row shows the baseline results takenfrom Uria Murray Larochelle (2014) (NADElowast 500 hidden units EoNADE+ 1hl 16 ord)

12

  • 1 Introduction
  • 2 Reweighted Wake-Sleep
    • 21 The Wake-Sleep Algorithm
    • 22 An Importance Sampling View yields Reweighted Wake-Sleep
    • 23 Training by Reweighted Wake-Sleep
    • 24 Relation to Wake-Sleep and Variational Bayes
      • 3 Component Layers
      • 4 Experiments
        • 41 MNIST
        • 42 CalTech 101 Silhouettes
          • 5 Conclusions
          • 6 Supplement
            • 61 Gradients for p(x h)
            • 62 Gradients for the wake phase q update
            • 63 Additional experimental results
              • 631 Learning curves for MNIST experiments
              • 632 Bootstrapping based log(p(x)) biasvariance analysis
              • 633 UCI binary datasets
Page 6: REWEIGHTED WAKE-SLEEP - arXiv · this paper is to shed a different light on the wake-sleep algorithm, viewing it as a special case of the proposed reweighted wake-sleep (RWS) algorithm,

Published as a conference paper at ICLR 2015

100 101 102

training samples

minus130

minus120

minus110

minus100

minus90

minus80

Fin

alL

Les

tim

ate

(tes

tset

)

NA DE 200SBN 10-200-200SBN 200

A B

bias (epoch 50)

bias (last epoch)std dev (epoch50)

100 101102

training samples

03

04

05

06

07

08

09

bia

s

06

08

10

12

14

16

18

std

-dev

std dev (last epoch)

Figure 1 A Final log-likelihood estimate wrt number of samples used during training B L2-normof the bias and standard deviation of the low-sample estimated pθ gradient relative to a high-sample(K=5000) based estimate

NVIL wake-sleep RWS RWSP-model size Q-model SBN Q-model NADESBN 200 (1131) 1163 (1207) 1031 950SBN 200-200 (998) 1069 (1094) 934 911SBN 200-200-200 (967) 1013 (1044) 901 889AR-SBN 200 892AR-SBN 200-200 928NADE 200 868NADE 200-200 876

Table 1 MNIST results for various architectures and training methods In the 3rd column we citethe numbers reported by Mnih and Gregor (2014) Values in brackets are variational NLL boundsvalues without brackets report NLL estimates (see section 22)

4 EXPERIMENTS

Here we present a series of experiments on the MNIST and the CalTech-Silhouettes datasetsThe supplement describes additional experiments on smaller datasets from the UCI repositoryWith these experiments we 1) quantitatively analyze the influence of the number of samples K2) demonstrate that using a more powerful layer-model for the inference network q can signif-icantly enhance the results even when the generative model is a factorial SBN and 3) showthat we approach state-of-the-art performance when using either relatively deep models or whenusing powerful layer models such as a conditional NADE Our implementation is available athttpsgithubcomjbornscheinreweighted-ws

41 MNIST

We use the MNIST dataset that was binarized according to Murray and Salakhutdinov (2009) anddownloaded in binarized form from (Larochelle 2011) For training we use stochastic gradientdecent with momentum (β=095) and set mini-batch size to 25 The experiments in this paragraphwere run with learning rates of 00003 0001 and 0003 From these three we always report theexperiment with the highest validation log-likelihood In the majority of our experiments a learningrate of 0001 gave the best results even across different layer models (SBN AR-SBN and NADE) Ifnot noted otherwise we use K = 5 samples during training and K = 100 000 samples to estimatethe final log-likelihood on the test set1 To disentangle the influence of the different q updatingmethods we setup p and q networks consisting of three hidden SBN layers with 10 200 and 200units (SBNSBN 10-200-200) After convergence the model trained updating q during the sleepphase only reached a final estimated log-likelihood of minus934 the model trained with a q-updateduring the wake phase reachedminus928 and the model trained with both wake and sleep phase updatereached minus919 As a control we trained a model that does not update q at all This model reached

1We refer to the lower bound estimates which can be arbitrarily tightened by increasing the number of testsamples as LL estimates to distiguish them from the variational LL lower bounds (see section 22)

6

Published as a conference paper at ICLR 2015

100 101 102

samples

minus130

minus120

minus110

minus100

minus90

minus80

est

LL

NA DE-NA DE 200SBN-SBN 200-200-10SBN-SBN 200

A B CSBNSBN 10-100-200-300-400 NADENADE 250

Figure 2 A Final log-likelihood estimate wrt number of test samples used B Samples from theSBNSBN 10-200-200 generative model C Samples from the NADENADE 250 generative model(We show the probabilities from which each pixel is sampled)

Results on binarized MNISTNLL NLL

Method bound estRWS (SBNSBN 10-100-200-300-400) 8548RWS (NADENADE 250) 8523RWS (AR-SBNSBN 500)dagger 8418NADE (500 units [1]) 8835EoNADE (2hl 128 orderings [2]) 8510DARN (500 units [3]) 8413RBM (500 units CD3 [4]) 1055RBM (500 units CD25 [4]) 8634DBN (500-2000 [5]) 8622 8455

Results on CalTech 101 SilhouettesNLL

Method estRWS (SBNSBN 10-50-100-300) 1133RWS (NADENADE 150) 1043

NADE (500 hidden units) 1106RBM (4000 hidden units [6]) 1078

Table 2 Various RWS trained models in relation to previously published methods [1] Larochelleand Murray (2011) [2] Murray and Larochelle (2014) [3] Gregor et al (2014) [4] Salakhutdinovand Murray (2008) [5] Murray and Salakhutdinov (2009) [6] Cho et al (2013) dagger Same model asthe best performing in [3] a AR-SBN with deterministic hidden variables between the observed andlatent All RWS NLL estimates on MNIST have confidence intervals of asymp plusmn040

minus1714 We confirmed that combining wake and sleep phase q-updates generally gives the bestresults by repeating this experiment with various other architectures For the remainder of this paperwe therefore train all models with combined wake and sleep phase q-updates

Next we investigate the influence of the number of samples used during training The results arevisualized in Fig 1 A Although the results depend on the layer-distributions and on the depthand width of the architectures we generally observe that the final estimated log-likelihood does notimprove significantly when using more than 5 samples during training for NADE models and usingmore than 25 samples for models with SBN layers We can quantify the bias and the variance ofthe gradient estimator (3) using bootstrapping While training a SBNSBN 10-200-200 model withK = 100 training samples we useK = 5 000 samples to get a high quality estimate of the gradientfor a small but fixed set of 25 datapoints (the size of one mini-batch) By repeatedly resamplingsmaller sets of 1 2 5 middot middot middot 5000 samples with replacement and by computing the gradient basedon these we get a measure for the bias and the variance of the small sample estimates relative thehigh quality estimate These results are visualized in Fig 1 B In Fig 2 A we finally investigate thequality of the log-likelihood estimator (eqn 1) when applied to the MNIST test set

Table 1 summarizes how different architectures compare to each other and how RWS comparesto related methods for training directed models We essentially observe that RWS trained modelsconsistently improve over classical wake-sleep especially for deep architectures We furthermoreobserve that using autoregressive layers (AR-SBN or NADE) for the inference network improvesthe results even when the generative model is composed of factorial SBN layers Finally we see thatthe best performing models with autoregressive layers in p are always shallow with only a single

7

Published as a conference paper at ICLR 2015

Figure 3 CalTech 101 Silhouettes A Random selection of training data points B Random samplesfrom the SBNSBN 10-50-100-300 generative network C Random Samples from the NADE-150generative network (We show the probabilities from which each pixel is sampled)

hidden layer In Table 2 (left) we compare some of our best models to the state-of-the-art resultspublished on MNIST The deep SBNSBN 10-100-200-300-400 model was trained for 1000 epochswith K = 5 training samples and a learning rate of 0001 For fine-tuning we run additional 500epochs with a learning rate decay of 1005 and 100 training samples For comparison we also trainthe best performing model from the DARN paper (Gregor et al 2014) with RWS ie a singlelayer AR-SBN with 500 latent variables and a deterministic layer of hidden variables between theobserved and the latents We essentially obtain the same final testset log-likelihood For this shallownetwork we thus do not observe any improvement from using RWS

42 CALTECH 101 SILHOUETTES

We applied reweighted wake-sleep to the 28 times 28 pixel CalTech 101 Silhouettes dataset Thisdataset consists of 4100 examples in the training set 2264 examples in the validation set and2307 examples in the test set We trained various architectures on this dataset using the samehyperparameter as for the MNIST experiments Table 2 (right) summarizes our results Note thatour best SBNSBN model is a relatively deep network with 4 hidden layers (300-100-50-10) andreaches a estimated LL of -1169 on the test set Our best network a shallow NADENADE-150network reaches -1043 and improves over the previous state of the art (minus1078 a RBM with 4000hidden units by Cho et al (2013))

5 CONCLUSIONS

We introduced a novel training procedure for deep generative models consisting of multiple layers ofbinary latent variables It generalizes and improves over the wake-sleep algorithm providing a lowerbias and lower variance estimator of the log-likelihood gradient at the price of more samples from theinference network During training the weighted samples from the inference network decouple thelayers such that the learning gradients only propagate within the individual layers Our experimentsdemonstrate that a small number ofasymp 5 samples is typically sufficient to jointly train relatively deeparchitectures of at least 5 hidden layers without layerwise pretraining and without carefully tuninglearning rates The resulting models produce reasonable samples (by visual inspection) and theyapproach state-of-the-art performance in terms of log-likelihood on several discrete datasets

We found that even in the cases when the generative networks contain SBN layers only better resultscan be obtained with inference networks composed of more powerful autoregressive layers Thishowever comes at the price of reduced computational efficiency on eg GPUs as the individualvariables hi sim q(h|x) have to be sampled in sequence (even though the theoretical complexity isnot significantly worse compared to SBN layers)

We furthermore found that models with autoregressive layers in the generative network p typicallyproduce very good results But the best ones were always shallow with only a single hidden layerAt this point it is unclear if this is due to optimization problems

Acknowledgments We would like to thank Laurent Dinh Vincent Dumoulin and Li Yao forhelpful discussions and the developers of Theano (Bergstra et al 2010 Bastien et al 2012) fortheir powerful software We furthermore acknowledge CIFAR and Canada Research Chairs forfunding and Compute Canada and Calcul Quebec for providing computational resources

8

Published as a conference paper at ICLR 2015

REFERENCES

Bastien F Lamblin P Pascanu R Bergstra J Goodfellow I J Bergeron A Bouchard Nand Bengio Y (2012) Theano new features and speed improvements Deep Learning andUnsupervised Feature Learning NIPS 2012 Workshop

Bengio Y (2009) Learning deep architectures for AI Now Publishers

Bengio Y and Bengio S (2000) Modeling high-dimensional discrete data with multi-layer neuralnetworks In NIPSrsquo99 pages 400ndash406 MIT Press

Bengio Y Mesnil G Dauphin Y and Rifai S (2013) Better mixing via deep representationsIn Proceedings of the 30th International Conference on Machine Learning (ICMLrsquo13) ACM

Bergstra J Breuleux O Bastien F Lamblin P Pascanu R Desjardins G Turian J Warde-Farley D and Bengio Y (2010) Theano a CPU and GPU math expression compiler InProceedings of the Python for Scientific Computing Conference (SciPy) Oral Presentation

Boulanger-Lewandowski N Bengio Y and Vincent P (2012) Modeling temporal dependenciesin high-dimensional sequences Application to polyphonic music generation and transcription InICMLrsquo2012

Cho K Raiko T and Ilin A (2013) Enhanced gradient for training restricted boltzmann ma-chines Neural computation 25(3) 805ndash831

Dayan P and Hinton G E (1996) Varieties of helmholtz machine Neural Networks 9(8) 1385ndash1403

Dayan P Hinton G E Neal R M and Zemel R S (1995) The Helmholtz machine Neuralcomputation 7(5) 889ndash904

Frey B J (1998) Graphical models for machine learning and digital communication MIT Press

Gregor K Danihelka I Mnih A Blundell C and Wierstra D (2014) Deep autoregressivenetworks In Proceedings of the 31st International Conference on Machine Learning

Hinton G E Dayan P Frey B J and Neal R M (1995) The wake-sleep algorithm for unsu-pervised neural networks Science 268 1558ndash1161

Hinton G E Osindero S and Teh Y (2006) A fast learning algorithm for deep belief netsNeural Computation 18 1527ndash1554

Kingma D P and Welling M (2014) Auto-encoding variational bayes In Proceedings of theInternational Conference on Learning Representations (ICLR)

Larochelle H (2011) Binarized mnist dataset httpwwwcstorontoedu˜larochehpublicdatasetsbinarized_mnistbinarized_mnist_[train|valid|test]amat

Larochelle H and Murray I (2011) The Neural Autoregressive Distribution Estimator In Proceedings of theFourteenth International Conference on Artificial Intelligence and Statistics (AISTATSrsquo2011) volume 15 ofJMLR WampCP

Mnih A and Gregor K (2014) Neural variational inference and learning in belief networks In Proceedingsof the 31st International Conference on Machine Learning (ICML 2014) to appear

Murray B U I and Larochelle H (2014) A deep and tractable density estimator In ICMLrsquo2014

Murray I and Salakhutdinov R (2009) Evaluating probabilities under high-dimensional latent variable mod-els In NIPSrsquo08 volume 21 pages 1137ndash1144

Rezende D J Mohamed S and Wierstra D (2014) Stochastic backpropagation and approximate inferencein deep generative models In ICMLrsquo2014

Salakhutdinov R and Murray I (2008) On the quantitative analysis of deep belief networks In Proceedingsof the International Conference on Machine Learning volume 25

Saul L K Jaakkola T and Jordan M I (1996) Mean field theory for sigmoid belief networks Journal ofArtificial Intelligence Research 4 61ndash76

Tang Y and Salakhutdinov R (2013) Learning stochastic feedforward neural networks In NIPSrsquo2013

9

Published as a conference paper at ICLR 2015

6 SUPPLEMENT

61 GRADIENTS FOR p(xh)

part

partθLp(θx sim D) =

part

partθlog pθ(x) =

1

p(x)

part

partθ

sumh

p(xh)

=1

p(x)

sumh

p(xh)part

partθlog p(xh)

=1

p(x)

sumh

q (h |x) p(xh)q (h |x)

part

partθlog p(xh)

=1

p(x)E

hsimq(h |x)

[p(xh)

q (h |x)part

partθlog p(xh)

] 1sum

k ωk

Ksumk=1

ωkpart

partθlog p(xh(k)) (10)

with ωk =p(xh(k))

q(h(k) |x

) and h(k) sim q (h |x)

62 GRADIENTS FOR THE WAKE PHASE Q UPDATE

part

partφLq(φx sim D) =

part

partφ

sumh

p(xh)log qφ(h|x)

=1

p(x)E

hsimq(h |x)

[p(xh)

q (h |x)part

partφlog qφ(h|x)

] 1sum

k ωk

Ksumk=1

ωkpart

partφlog qφ(h

(k)|x) (11)

Note that we arrive at the same gradients when we set out to minimize the KL(p(middot|x) q(middot|x) for agiven datapoint x

part

partφKL(pθ(h|x) qφ(h|x)) =

part

partφ

sumh

pθ(h|x) logpθ(h|x)qφ(h|x)

= minussumh

pθ(h|x)part

partφlog qφ(h|x)

minus 1sumk ωk

Ksumk=1

ωkpart

partφlog qφ(h

(k)|x) (12)

10

Published as a conference paper at ICLR 2015

63 ADDITIONAL EXPERIMENTAL RESULTS

631 LEARNING CURVES FOR MNIST EXPERIMENTS

Figure 4 Learning curves for various MNIST experiments

632 BOOTSTRAPPING BASED log(p(x)) BIASVARIANCE ANALYSIS

Here we show the biasvariance analysis from Fig 1 B (main paper) applied to the estimatedlog(p(x)) wrt the number of test samples

100 101102 103

samples

20

15

10

5

0

Bia

s

bias (epoch 50)bias (last epoch)

00

02

04

06

08

10

12

14

16

std

-dev

std dev (epoch 50)std dev (last epoch)

Figure 5 Bias and standard deviation of the low-sample estimated log(p(x)) (bootstrapping withK=5000 primary samples from a SBNSBN 10-200-200 network trained on MNIST)

11

Published as a conference paper at ICLR 2015

633 UCI BINARY DATASETS

We performed a series of experiments on 8 different binary datasets from the UCI database

For each dataset we screened a limited hyperparameter space The learning rate was set to a valuein 0001 0003 001 For SBNs we use K=10 training samples and we tried the following archi-tectures Two hidden layers with 10-50 10-75 10-100 10-150 or 10-200 hidden units and threehidden layers with 5-20-100 10-50-100 10-50-150 10-50-200 or 10-100-300 hidden units Wetrained NADENADE models with K=5 training samples and one hidden layer with 30 50 75 100or 200 units in it

Model ADULT CONNECT4 DNA MUSHROOMS NIPS-0-12 OCR-LETTERS RCV1 WEB

FVSBN 1317 1239 8364 1027 27688 3930 4984 2935NADElowast 1319 1199 8481 981 27308 2722 4666 2839EoNADE+ 1319 1258 8231 968 27238 2731 4612 2787DARN3 1319 1191 8104 955 27468 2817 4610 2883RWS - SBN 1365 1268 9063 990 27254 2999 4616 2818hidden units 5-20-100 10-50-150 10-150 10-50-150 10-50-150 10-100-300 10-50-200 10-50-300RWS - NADE 1316 1168 8426 971 27111 2643 4609 2792hidden units 30 50 100 50 75 100

Table 3 Results on various binary datasets from the UCI repository The top two rows quote thebaseline results from Larochelle amp Murray (2011) the third row shows the baseline results takenfrom Uria Murray Larochelle (2014) (NADElowast 500 hidden units EoNADE+ 1hl 16 ord)

12

  • 1 Introduction
  • 2 Reweighted Wake-Sleep
    • 21 The Wake-Sleep Algorithm
    • 22 An Importance Sampling View yields Reweighted Wake-Sleep
    • 23 Training by Reweighted Wake-Sleep
    • 24 Relation to Wake-Sleep and Variational Bayes
      • 3 Component Layers
      • 4 Experiments
        • 41 MNIST
        • 42 CalTech 101 Silhouettes
          • 5 Conclusions
          • 6 Supplement
            • 61 Gradients for p(x h)
            • 62 Gradients for the wake phase q update
            • 63 Additional experimental results
              • 631 Learning curves for MNIST experiments
              • 632 Bootstrapping based log(p(x)) biasvariance analysis
              • 633 UCI binary datasets
Page 7: REWEIGHTED WAKE-SLEEP - arXiv · this paper is to shed a different light on the wake-sleep algorithm, viewing it as a special case of the proposed reweighted wake-sleep (RWS) algorithm,

Published as a conference paper at ICLR 2015

100 101 102

samples

minus130

minus120

minus110

minus100

minus90

minus80

est

LL

NA DE-NA DE 200SBN-SBN 200-200-10SBN-SBN 200

A B CSBNSBN 10-100-200-300-400 NADENADE 250

Figure 2 A Final log-likelihood estimate wrt number of test samples used B Samples from theSBNSBN 10-200-200 generative model C Samples from the NADENADE 250 generative model(We show the probabilities from which each pixel is sampled)

Results on binarized MNISTNLL NLL

Method bound estRWS (SBNSBN 10-100-200-300-400) 8548RWS (NADENADE 250) 8523RWS (AR-SBNSBN 500)dagger 8418NADE (500 units [1]) 8835EoNADE (2hl 128 orderings [2]) 8510DARN (500 units [3]) 8413RBM (500 units CD3 [4]) 1055RBM (500 units CD25 [4]) 8634DBN (500-2000 [5]) 8622 8455

Results on CalTech 101 SilhouettesNLL

Method estRWS (SBNSBN 10-50-100-300) 1133RWS (NADENADE 150) 1043

NADE (500 hidden units) 1106RBM (4000 hidden units [6]) 1078

Table 2 Various RWS trained models in relation to previously published methods [1] Larochelleand Murray (2011) [2] Murray and Larochelle (2014) [3] Gregor et al (2014) [4] Salakhutdinovand Murray (2008) [5] Murray and Salakhutdinov (2009) [6] Cho et al (2013) dagger Same model asthe best performing in [3] a AR-SBN with deterministic hidden variables between the observed andlatent All RWS NLL estimates on MNIST have confidence intervals of asymp plusmn040

minus1714 We confirmed that combining wake and sleep phase q-updates generally gives the bestresults by repeating this experiment with various other architectures For the remainder of this paperwe therefore train all models with combined wake and sleep phase q-updates

Next we investigate the influence of the number of samples used during training The results arevisualized in Fig 1 A Although the results depend on the layer-distributions and on the depthand width of the architectures we generally observe that the final estimated log-likelihood does notimprove significantly when using more than 5 samples during training for NADE models and usingmore than 25 samples for models with SBN layers We can quantify the bias and the variance ofthe gradient estimator (3) using bootstrapping While training a SBNSBN 10-200-200 model withK = 100 training samples we useK = 5 000 samples to get a high quality estimate of the gradientfor a small but fixed set of 25 datapoints (the size of one mini-batch) By repeatedly resamplingsmaller sets of 1 2 5 middot middot middot 5000 samples with replacement and by computing the gradient basedon these we get a measure for the bias and the variance of the small sample estimates relative thehigh quality estimate These results are visualized in Fig 1 B In Fig 2 A we finally investigate thequality of the log-likelihood estimator (eqn 1) when applied to the MNIST test set

Table 1 summarizes how different architectures compare to each other and how RWS comparesto related methods for training directed models We essentially observe that RWS trained modelsconsistently improve over classical wake-sleep especially for deep architectures We furthermoreobserve that using autoregressive layers (AR-SBN or NADE) for the inference network improvesthe results even when the generative model is composed of factorial SBN layers Finally we see thatthe best performing models with autoregressive layers in p are always shallow with only a single

7

Published as a conference paper at ICLR 2015

Figure 3 CalTech 101 Silhouettes A Random selection of training data points B Random samplesfrom the SBNSBN 10-50-100-300 generative network C Random Samples from the NADE-150generative network (We show the probabilities from which each pixel is sampled)

hidden layer In Table 2 (left) we compare some of our best models to the state-of-the-art resultspublished on MNIST The deep SBNSBN 10-100-200-300-400 model was trained for 1000 epochswith K = 5 training samples and a learning rate of 0001 For fine-tuning we run additional 500epochs with a learning rate decay of 1005 and 100 training samples For comparison we also trainthe best performing model from the DARN paper (Gregor et al 2014) with RWS ie a singlelayer AR-SBN with 500 latent variables and a deterministic layer of hidden variables between theobserved and the latents We essentially obtain the same final testset log-likelihood For this shallownetwork we thus do not observe any improvement from using RWS

42 CALTECH 101 SILHOUETTES

We applied reweighted wake-sleep to the 28 times 28 pixel CalTech 101 Silhouettes dataset Thisdataset consists of 4100 examples in the training set 2264 examples in the validation set and2307 examples in the test set We trained various architectures on this dataset using the samehyperparameter as for the MNIST experiments Table 2 (right) summarizes our results Note thatour best SBNSBN model is a relatively deep network with 4 hidden layers (300-100-50-10) andreaches a estimated LL of -1169 on the test set Our best network a shallow NADENADE-150network reaches -1043 and improves over the previous state of the art (minus1078 a RBM with 4000hidden units by Cho et al (2013))

5 CONCLUSIONS

We introduced a novel training procedure for deep generative models consisting of multiple layers ofbinary latent variables It generalizes and improves over the wake-sleep algorithm providing a lowerbias and lower variance estimator of the log-likelihood gradient at the price of more samples from theinference network During training the weighted samples from the inference network decouple thelayers such that the learning gradients only propagate within the individual layers Our experimentsdemonstrate that a small number ofasymp 5 samples is typically sufficient to jointly train relatively deeparchitectures of at least 5 hidden layers without layerwise pretraining and without carefully tuninglearning rates The resulting models produce reasonable samples (by visual inspection) and theyapproach state-of-the-art performance in terms of log-likelihood on several discrete datasets

We found that even in the cases when the generative networks contain SBN layers only better resultscan be obtained with inference networks composed of more powerful autoregressive layers Thishowever comes at the price of reduced computational efficiency on eg GPUs as the individualvariables hi sim q(h|x) have to be sampled in sequence (even though the theoretical complexity isnot significantly worse compared to SBN layers)

We furthermore found that models with autoregressive layers in the generative network p typicallyproduce very good results But the best ones were always shallow with only a single hidden layerAt this point it is unclear if this is due to optimization problems

Acknowledgments We would like to thank Laurent Dinh Vincent Dumoulin and Li Yao forhelpful discussions and the developers of Theano (Bergstra et al 2010 Bastien et al 2012) fortheir powerful software We furthermore acknowledge CIFAR and Canada Research Chairs forfunding and Compute Canada and Calcul Quebec for providing computational resources

8

Published as a conference paper at ICLR 2015

REFERENCES

Bastien F Lamblin P Pascanu R Bergstra J Goodfellow I J Bergeron A Bouchard Nand Bengio Y (2012) Theano new features and speed improvements Deep Learning andUnsupervised Feature Learning NIPS 2012 Workshop

Bengio Y (2009) Learning deep architectures for AI Now Publishers

Bengio Y and Bengio S (2000) Modeling high-dimensional discrete data with multi-layer neuralnetworks In NIPSrsquo99 pages 400ndash406 MIT Press

Bengio Y Mesnil G Dauphin Y and Rifai S (2013) Better mixing via deep representationsIn Proceedings of the 30th International Conference on Machine Learning (ICMLrsquo13) ACM

Bergstra J Breuleux O Bastien F Lamblin P Pascanu R Desjardins G Turian J Warde-Farley D and Bengio Y (2010) Theano a CPU and GPU math expression compiler InProceedings of the Python for Scientific Computing Conference (SciPy) Oral Presentation

Boulanger-Lewandowski N Bengio Y and Vincent P (2012) Modeling temporal dependenciesin high-dimensional sequences Application to polyphonic music generation and transcription InICMLrsquo2012

Cho K Raiko T and Ilin A (2013) Enhanced gradient for training restricted boltzmann ma-chines Neural computation 25(3) 805ndash831

Dayan P and Hinton G E (1996) Varieties of helmholtz machine Neural Networks 9(8) 1385ndash1403

Dayan P Hinton G E Neal R M and Zemel R S (1995) The Helmholtz machine Neuralcomputation 7(5) 889ndash904

Frey B J (1998) Graphical models for machine learning and digital communication MIT Press

Gregor K Danihelka I Mnih A Blundell C and Wierstra D (2014) Deep autoregressivenetworks In Proceedings of the 31st International Conference on Machine Learning

Hinton G E Dayan P Frey B J and Neal R M (1995) The wake-sleep algorithm for unsu-pervised neural networks Science 268 1558ndash1161

Hinton G E Osindero S and Teh Y (2006) A fast learning algorithm for deep belief netsNeural Computation 18 1527ndash1554

Kingma D P and Welling M (2014) Auto-encoding variational bayes In Proceedings of theInternational Conference on Learning Representations (ICLR)

Larochelle H (2011) Binarized mnist dataset httpwwwcstorontoedu˜larochehpublicdatasetsbinarized_mnistbinarized_mnist_[train|valid|test]amat

Larochelle H and Murray I (2011) The Neural Autoregressive Distribution Estimator In Proceedings of theFourteenth International Conference on Artificial Intelligence and Statistics (AISTATSrsquo2011) volume 15 ofJMLR WampCP

Mnih A and Gregor K (2014) Neural variational inference and learning in belief networks In Proceedingsof the 31st International Conference on Machine Learning (ICML 2014) to appear

Murray B U I and Larochelle H (2014) A deep and tractable density estimator In ICMLrsquo2014

Murray I and Salakhutdinov R (2009) Evaluating probabilities under high-dimensional latent variable mod-els In NIPSrsquo08 volume 21 pages 1137ndash1144

Rezende D J Mohamed S and Wierstra D (2014) Stochastic backpropagation and approximate inferencein deep generative models In ICMLrsquo2014

Salakhutdinov R and Murray I (2008) On the quantitative analysis of deep belief networks In Proceedingsof the International Conference on Machine Learning volume 25

Saul L K Jaakkola T and Jordan M I (1996) Mean field theory for sigmoid belief networks Journal ofArtificial Intelligence Research 4 61ndash76

Tang Y and Salakhutdinov R (2013) Learning stochastic feedforward neural networks In NIPSrsquo2013

9

Published as a conference paper at ICLR 2015

6 SUPPLEMENT

61 GRADIENTS FOR p(xh)

part

partθLp(θx sim D) =

part

partθlog pθ(x) =

1

p(x)

part

partθ

sumh

p(xh)

=1

p(x)

sumh

p(xh)part

partθlog p(xh)

=1

p(x)

sumh

q (h |x) p(xh)q (h |x)

part

partθlog p(xh)

=1

p(x)E

hsimq(h |x)

[p(xh)

q (h |x)part

partθlog p(xh)

] 1sum

k ωk

Ksumk=1

ωkpart

partθlog p(xh(k)) (10)

with ωk =p(xh(k))

q(h(k) |x

) and h(k) sim q (h |x)

62 GRADIENTS FOR THE WAKE PHASE Q UPDATE

part

partφLq(φx sim D) =

part

partφ

sumh

p(xh)log qφ(h|x)

=1

p(x)E

hsimq(h |x)

[p(xh)

q (h |x)part

partφlog qφ(h|x)

] 1sum

k ωk

Ksumk=1

ωkpart

partφlog qφ(h

(k)|x) (11)

Note that we arrive at the same gradients when we set out to minimize the KL(p(middot|x) q(middot|x) for agiven datapoint x

part

partφKL(pθ(h|x) qφ(h|x)) =

part

partφ

sumh

pθ(h|x) logpθ(h|x)qφ(h|x)

= minussumh

pθ(h|x)part

partφlog qφ(h|x)

minus 1sumk ωk

Ksumk=1

ωkpart

partφlog qφ(h

(k)|x) (12)

10

Published as a conference paper at ICLR 2015

63 ADDITIONAL EXPERIMENTAL RESULTS

631 LEARNING CURVES FOR MNIST EXPERIMENTS

Figure 4 Learning curves for various MNIST experiments

632 BOOTSTRAPPING BASED log(p(x)) BIASVARIANCE ANALYSIS

Here we show the biasvariance analysis from Fig 1 B (main paper) applied to the estimatedlog(p(x)) wrt the number of test samples

100 101102 103

samples

20

15

10

5

0

Bia

s

bias (epoch 50)bias (last epoch)

00

02

04

06

08

10

12

14

16

std

-dev

std dev (epoch 50)std dev (last epoch)

Figure 5 Bias and standard deviation of the low-sample estimated log(p(x)) (bootstrapping withK=5000 primary samples from a SBNSBN 10-200-200 network trained on MNIST)

11

Published as a conference paper at ICLR 2015

633 UCI BINARY DATASETS

We performed a series of experiments on 8 different binary datasets from the UCI database

For each dataset we screened a limited hyperparameter space The learning rate was set to a valuein 0001 0003 001 For SBNs we use K=10 training samples and we tried the following archi-tectures Two hidden layers with 10-50 10-75 10-100 10-150 or 10-200 hidden units and threehidden layers with 5-20-100 10-50-100 10-50-150 10-50-200 or 10-100-300 hidden units Wetrained NADENADE models with K=5 training samples and one hidden layer with 30 50 75 100or 200 units in it

Model ADULT CONNECT4 DNA MUSHROOMS NIPS-0-12 OCR-LETTERS RCV1 WEB

FVSBN 1317 1239 8364 1027 27688 3930 4984 2935NADElowast 1319 1199 8481 981 27308 2722 4666 2839EoNADE+ 1319 1258 8231 968 27238 2731 4612 2787DARN3 1319 1191 8104 955 27468 2817 4610 2883RWS - SBN 1365 1268 9063 990 27254 2999 4616 2818hidden units 5-20-100 10-50-150 10-150 10-50-150 10-50-150 10-100-300 10-50-200 10-50-300RWS - NADE 1316 1168 8426 971 27111 2643 4609 2792hidden units 30 50 100 50 75 100

Table 3 Results on various binary datasets from the UCI repository The top two rows quote thebaseline results from Larochelle amp Murray (2011) the third row shows the baseline results takenfrom Uria Murray Larochelle (2014) (NADElowast 500 hidden units EoNADE+ 1hl 16 ord)

12

  • 1 Introduction
  • 2 Reweighted Wake-Sleep
    • 21 The Wake-Sleep Algorithm
    • 22 An Importance Sampling View yields Reweighted Wake-Sleep
    • 23 Training by Reweighted Wake-Sleep
    • 24 Relation to Wake-Sleep and Variational Bayes
      • 3 Component Layers
      • 4 Experiments
        • 41 MNIST
        • 42 CalTech 101 Silhouettes
          • 5 Conclusions
          • 6 Supplement
            • 61 Gradients for p(x h)
            • 62 Gradients for the wake phase q update
            • 63 Additional experimental results
              • 631 Learning curves for MNIST experiments
              • 632 Bootstrapping based log(p(x)) biasvariance analysis
              • 633 UCI binary datasets
Page 8: REWEIGHTED WAKE-SLEEP - arXiv · this paper is to shed a different light on the wake-sleep algorithm, viewing it as a special case of the proposed reweighted wake-sleep (RWS) algorithm,

Published as a conference paper at ICLR 2015

Figure 3 CalTech 101 Silhouettes A Random selection of training data points B Random samplesfrom the SBNSBN 10-50-100-300 generative network C Random Samples from the NADE-150generative network (We show the probabilities from which each pixel is sampled)

hidden layer In Table 2 (left) we compare some of our best models to the state-of-the-art resultspublished on MNIST The deep SBNSBN 10-100-200-300-400 model was trained for 1000 epochswith K = 5 training samples and a learning rate of 0001 For fine-tuning we run additional 500epochs with a learning rate decay of 1005 and 100 training samples For comparison we also trainthe best performing model from the DARN paper (Gregor et al 2014) with RWS ie a singlelayer AR-SBN with 500 latent variables and a deterministic layer of hidden variables between theobserved and the latents We essentially obtain the same final testset log-likelihood For this shallownetwork we thus do not observe any improvement from using RWS

42 CALTECH 101 SILHOUETTES

We applied reweighted wake-sleep to the 28 times 28 pixel CalTech 101 Silhouettes dataset Thisdataset consists of 4100 examples in the training set 2264 examples in the validation set and2307 examples in the test set We trained various architectures on this dataset using the samehyperparameter as for the MNIST experiments Table 2 (right) summarizes our results Note thatour best SBNSBN model is a relatively deep network with 4 hidden layers (300-100-50-10) andreaches a estimated LL of -1169 on the test set Our best network a shallow NADENADE-150network reaches -1043 and improves over the previous state of the art (minus1078 a RBM with 4000hidden units by Cho et al (2013))

5 CONCLUSIONS

We introduced a novel training procedure for deep generative models consisting of multiple layers ofbinary latent variables It generalizes and improves over the wake-sleep algorithm providing a lowerbias and lower variance estimator of the log-likelihood gradient at the price of more samples from theinference network During training the weighted samples from the inference network decouple thelayers such that the learning gradients only propagate within the individual layers Our experimentsdemonstrate that a small number ofasymp 5 samples is typically sufficient to jointly train relatively deeparchitectures of at least 5 hidden layers without layerwise pretraining and without carefully tuninglearning rates The resulting models produce reasonable samples (by visual inspection) and theyapproach state-of-the-art performance in terms of log-likelihood on several discrete datasets

We found that even in the cases when the generative networks contain SBN layers only better resultscan be obtained with inference networks composed of more powerful autoregressive layers Thishowever comes at the price of reduced computational efficiency on eg GPUs as the individualvariables hi sim q(h|x) have to be sampled in sequence (even though the theoretical complexity isnot significantly worse compared to SBN layers)

We furthermore found that models with autoregressive layers in the generative network p typicallyproduce very good results But the best ones were always shallow with only a single hidden layerAt this point it is unclear if this is due to optimization problems

Acknowledgments We would like to thank Laurent Dinh Vincent Dumoulin and Li Yao forhelpful discussions and the developers of Theano (Bergstra et al 2010 Bastien et al 2012) fortheir powerful software We furthermore acknowledge CIFAR and Canada Research Chairs forfunding and Compute Canada and Calcul Quebec for providing computational resources

8

Published as a conference paper at ICLR 2015

REFERENCES

Bastien F Lamblin P Pascanu R Bergstra J Goodfellow I J Bergeron A Bouchard Nand Bengio Y (2012) Theano new features and speed improvements Deep Learning andUnsupervised Feature Learning NIPS 2012 Workshop

Bengio Y (2009) Learning deep architectures for AI Now Publishers

Bengio Y and Bengio S (2000) Modeling high-dimensional discrete data with multi-layer neuralnetworks In NIPSrsquo99 pages 400ndash406 MIT Press

Bengio Y Mesnil G Dauphin Y and Rifai S (2013) Better mixing via deep representationsIn Proceedings of the 30th International Conference on Machine Learning (ICMLrsquo13) ACM

Bergstra J Breuleux O Bastien F Lamblin P Pascanu R Desjardins G Turian J Warde-Farley D and Bengio Y (2010) Theano a CPU and GPU math expression compiler InProceedings of the Python for Scientific Computing Conference (SciPy) Oral Presentation

Boulanger-Lewandowski N Bengio Y and Vincent P (2012) Modeling temporal dependenciesin high-dimensional sequences Application to polyphonic music generation and transcription InICMLrsquo2012

Cho K Raiko T and Ilin A (2013) Enhanced gradient for training restricted boltzmann ma-chines Neural computation 25(3) 805ndash831

Dayan P and Hinton G E (1996) Varieties of helmholtz machine Neural Networks 9(8) 1385ndash1403

Dayan P Hinton G E Neal R M and Zemel R S (1995) The Helmholtz machine Neuralcomputation 7(5) 889ndash904

Frey B J (1998) Graphical models for machine learning and digital communication MIT Press

Gregor K Danihelka I Mnih A Blundell C and Wierstra D (2014) Deep autoregressivenetworks In Proceedings of the 31st International Conference on Machine Learning

Hinton G E Dayan P Frey B J and Neal R M (1995) The wake-sleep algorithm for unsu-pervised neural networks Science 268 1558ndash1161

Hinton G E Osindero S and Teh Y (2006) A fast learning algorithm for deep belief netsNeural Computation 18 1527ndash1554

Kingma D P and Welling M (2014) Auto-encoding variational bayes In Proceedings of theInternational Conference on Learning Representations (ICLR)

Larochelle H (2011) Binarized mnist dataset httpwwwcstorontoedu˜larochehpublicdatasetsbinarized_mnistbinarized_mnist_[train|valid|test]amat

Larochelle H and Murray I (2011) The Neural Autoregressive Distribution Estimator In Proceedings of theFourteenth International Conference on Artificial Intelligence and Statistics (AISTATSrsquo2011) volume 15 ofJMLR WampCP

Mnih A and Gregor K (2014) Neural variational inference and learning in belief networks In Proceedingsof the 31st International Conference on Machine Learning (ICML 2014) to appear

Murray B U I and Larochelle H (2014) A deep and tractable density estimator In ICMLrsquo2014

Murray I and Salakhutdinov R (2009) Evaluating probabilities under high-dimensional latent variable mod-els In NIPSrsquo08 volume 21 pages 1137ndash1144

Rezende D J Mohamed S and Wierstra D (2014) Stochastic backpropagation and approximate inferencein deep generative models In ICMLrsquo2014

Salakhutdinov R and Murray I (2008) On the quantitative analysis of deep belief networks In Proceedingsof the International Conference on Machine Learning volume 25

Saul L K Jaakkola T and Jordan M I (1996) Mean field theory for sigmoid belief networks Journal ofArtificial Intelligence Research 4 61ndash76

Tang Y and Salakhutdinov R (2013) Learning stochastic feedforward neural networks In NIPSrsquo2013

9

Published as a conference paper at ICLR 2015

6 SUPPLEMENT

61 GRADIENTS FOR p(xh)

part

partθLp(θx sim D) =

part

partθlog pθ(x) =

1

p(x)

part

partθ

sumh

p(xh)

=1

p(x)

sumh

p(xh)part

partθlog p(xh)

=1

p(x)

sumh

q (h |x) p(xh)q (h |x)

part

partθlog p(xh)

=1

p(x)E

hsimq(h |x)

[p(xh)

q (h |x)part

partθlog p(xh)

] 1sum

k ωk

Ksumk=1

ωkpart

partθlog p(xh(k)) (10)

with ωk =p(xh(k))

q(h(k) |x

) and h(k) sim q (h |x)

62 GRADIENTS FOR THE WAKE PHASE Q UPDATE

part

partφLq(φx sim D) =

part

partφ

sumh

p(xh)log qφ(h|x)

=1

p(x)E

hsimq(h |x)

[p(xh)

q (h |x)part

partφlog qφ(h|x)

] 1sum

k ωk

Ksumk=1

ωkpart

partφlog qφ(h

(k)|x) (11)

Note that we arrive at the same gradients when we set out to minimize the KL(p(middot|x) q(middot|x) for agiven datapoint x

part

partφKL(pθ(h|x) qφ(h|x)) =

part

partφ

sumh

pθ(h|x) logpθ(h|x)qφ(h|x)

= minussumh

pθ(h|x)part

partφlog qφ(h|x)

minus 1sumk ωk

Ksumk=1

ωkpart

partφlog qφ(h

(k)|x) (12)

10

Published as a conference paper at ICLR 2015

63 ADDITIONAL EXPERIMENTAL RESULTS

631 LEARNING CURVES FOR MNIST EXPERIMENTS

Figure 4 Learning curves for various MNIST experiments

632 BOOTSTRAPPING BASED log(p(x)) BIASVARIANCE ANALYSIS

Here we show the biasvariance analysis from Fig 1 B (main paper) applied to the estimatedlog(p(x)) wrt the number of test samples

100 101102 103

samples

20

15

10

5

0

Bia

s

bias (epoch 50)bias (last epoch)

00

02

04

06

08

10

12

14

16

std

-dev

std dev (epoch 50)std dev (last epoch)

Figure 5 Bias and standard deviation of the low-sample estimated log(p(x)) (bootstrapping withK=5000 primary samples from a SBNSBN 10-200-200 network trained on MNIST)

11

Published as a conference paper at ICLR 2015

633 UCI BINARY DATASETS

We performed a series of experiments on 8 different binary datasets from the UCI database

For each dataset we screened a limited hyperparameter space The learning rate was set to a valuein 0001 0003 001 For SBNs we use K=10 training samples and we tried the following archi-tectures Two hidden layers with 10-50 10-75 10-100 10-150 or 10-200 hidden units and threehidden layers with 5-20-100 10-50-100 10-50-150 10-50-200 or 10-100-300 hidden units Wetrained NADENADE models with K=5 training samples and one hidden layer with 30 50 75 100or 200 units in it

Model ADULT CONNECT4 DNA MUSHROOMS NIPS-0-12 OCR-LETTERS RCV1 WEB

FVSBN 1317 1239 8364 1027 27688 3930 4984 2935NADElowast 1319 1199 8481 981 27308 2722 4666 2839EoNADE+ 1319 1258 8231 968 27238 2731 4612 2787DARN3 1319 1191 8104 955 27468 2817 4610 2883RWS - SBN 1365 1268 9063 990 27254 2999 4616 2818hidden units 5-20-100 10-50-150 10-150 10-50-150 10-50-150 10-100-300 10-50-200 10-50-300RWS - NADE 1316 1168 8426 971 27111 2643 4609 2792hidden units 30 50 100 50 75 100

Table 3 Results on various binary datasets from the UCI repository The top two rows quote thebaseline results from Larochelle amp Murray (2011) the third row shows the baseline results takenfrom Uria Murray Larochelle (2014) (NADElowast 500 hidden units EoNADE+ 1hl 16 ord)

12

  • 1 Introduction
  • 2 Reweighted Wake-Sleep
    • 21 The Wake-Sleep Algorithm
    • 22 An Importance Sampling View yields Reweighted Wake-Sleep
    • 23 Training by Reweighted Wake-Sleep
    • 24 Relation to Wake-Sleep and Variational Bayes
      • 3 Component Layers
      • 4 Experiments
        • 41 MNIST
        • 42 CalTech 101 Silhouettes
          • 5 Conclusions
          • 6 Supplement
            • 61 Gradients for p(x h)
            • 62 Gradients for the wake phase q update
            • 63 Additional experimental results
              • 631 Learning curves for MNIST experiments
              • 632 Bootstrapping based log(p(x)) biasvariance analysis
              • 633 UCI binary datasets
Page 9: REWEIGHTED WAKE-SLEEP - arXiv · this paper is to shed a different light on the wake-sleep algorithm, viewing it as a special case of the proposed reweighted wake-sleep (RWS) algorithm,

Published as a conference paper at ICLR 2015

REFERENCES

Bastien F Lamblin P Pascanu R Bergstra J Goodfellow I J Bergeron A Bouchard Nand Bengio Y (2012) Theano new features and speed improvements Deep Learning andUnsupervised Feature Learning NIPS 2012 Workshop

Bengio Y (2009) Learning deep architectures for AI Now Publishers

Bengio Y and Bengio S (2000) Modeling high-dimensional discrete data with multi-layer neuralnetworks In NIPSrsquo99 pages 400ndash406 MIT Press

Bengio Y Mesnil G Dauphin Y and Rifai S (2013) Better mixing via deep representationsIn Proceedings of the 30th International Conference on Machine Learning (ICMLrsquo13) ACM

Bergstra J Breuleux O Bastien F Lamblin P Pascanu R Desjardins G Turian J Warde-Farley D and Bengio Y (2010) Theano a CPU and GPU math expression compiler InProceedings of the Python for Scientific Computing Conference (SciPy) Oral Presentation

Boulanger-Lewandowski N Bengio Y and Vincent P (2012) Modeling temporal dependenciesin high-dimensional sequences Application to polyphonic music generation and transcription InICMLrsquo2012

Cho K Raiko T and Ilin A (2013) Enhanced gradient for training restricted boltzmann ma-chines Neural computation 25(3) 805ndash831

Dayan P and Hinton G E (1996) Varieties of helmholtz machine Neural Networks 9(8) 1385ndash1403

Dayan P Hinton G E Neal R M and Zemel R S (1995) The Helmholtz machine Neuralcomputation 7(5) 889ndash904

Frey B J (1998) Graphical models for machine learning and digital communication MIT Press

Gregor K Danihelka I Mnih A Blundell C and Wierstra D (2014) Deep autoregressivenetworks In Proceedings of the 31st International Conference on Machine Learning

Hinton G E Dayan P Frey B J and Neal R M (1995) The wake-sleep algorithm for unsu-pervised neural networks Science 268 1558ndash1161

Hinton G E Osindero S and Teh Y (2006) A fast learning algorithm for deep belief netsNeural Computation 18 1527ndash1554

Kingma D P and Welling M (2014) Auto-encoding variational bayes In Proceedings of theInternational Conference on Learning Representations (ICLR)

Larochelle H (2011) Binarized mnist dataset httpwwwcstorontoedu˜larochehpublicdatasetsbinarized_mnistbinarized_mnist_[train|valid|test]amat

Larochelle H and Murray I (2011) The Neural Autoregressive Distribution Estimator In Proceedings of theFourteenth International Conference on Artificial Intelligence and Statistics (AISTATSrsquo2011) volume 15 ofJMLR WampCP

Mnih A and Gregor K (2014) Neural variational inference and learning in belief networks In Proceedingsof the 31st International Conference on Machine Learning (ICML 2014) to appear

Murray B U I and Larochelle H (2014) A deep and tractable density estimator In ICMLrsquo2014

Murray I and Salakhutdinov R (2009) Evaluating probabilities under high-dimensional latent variable mod-els In NIPSrsquo08 volume 21 pages 1137ndash1144

Rezende D J Mohamed S and Wierstra D (2014) Stochastic backpropagation and approximate inferencein deep generative models In ICMLrsquo2014

Salakhutdinov R and Murray I (2008) On the quantitative analysis of deep belief networks In Proceedingsof the International Conference on Machine Learning volume 25

Saul L K Jaakkola T and Jordan M I (1996) Mean field theory for sigmoid belief networks Journal ofArtificial Intelligence Research 4 61ndash76

Tang Y and Salakhutdinov R (2013) Learning stochastic feedforward neural networks In NIPSrsquo2013

9

Published as a conference paper at ICLR 2015

6 SUPPLEMENT

61 GRADIENTS FOR p(xh)

part

partθLp(θx sim D) =

part

partθlog pθ(x) =

1

p(x)

part

partθ

sumh

p(xh)

=1

p(x)

sumh

p(xh)part

partθlog p(xh)

=1

p(x)

sumh

q (h |x) p(xh)q (h |x)

part

partθlog p(xh)

=1

p(x)E

hsimq(h |x)

[p(xh)

q (h |x)part

partθlog p(xh)

] 1sum

k ωk

Ksumk=1

ωkpart

partθlog p(xh(k)) (10)

with ωk =p(xh(k))

q(h(k) |x

) and h(k) sim q (h |x)

62 GRADIENTS FOR THE WAKE PHASE Q UPDATE

part

partφLq(φx sim D) =

part

partφ

sumh

p(xh)log qφ(h|x)

=1

p(x)E

hsimq(h |x)

[p(xh)

q (h |x)part

partφlog qφ(h|x)

] 1sum

k ωk

Ksumk=1

ωkpart

partφlog qφ(h

(k)|x) (11)

Note that we arrive at the same gradients when we set out to minimize the KL(p(middot|x) q(middot|x) for agiven datapoint x

part

partφKL(pθ(h|x) qφ(h|x)) =

part

partφ

sumh

pθ(h|x) logpθ(h|x)qφ(h|x)

= minussumh

pθ(h|x)part

partφlog qφ(h|x)

minus 1sumk ωk

Ksumk=1

ωkpart

partφlog qφ(h

(k)|x) (12)

10

Published as a conference paper at ICLR 2015

63 ADDITIONAL EXPERIMENTAL RESULTS

631 LEARNING CURVES FOR MNIST EXPERIMENTS

Figure 4 Learning curves for various MNIST experiments

632 BOOTSTRAPPING BASED log(p(x)) BIASVARIANCE ANALYSIS

Here we show the biasvariance analysis from Fig 1 B (main paper) applied to the estimatedlog(p(x)) wrt the number of test samples

100 101102 103

samples

20

15

10

5

0

Bia

s

bias (epoch 50)bias (last epoch)

00

02

04

06

08

10

12

14

16

std

-dev

std dev (epoch 50)std dev (last epoch)

Figure 5 Bias and standard deviation of the low-sample estimated log(p(x)) (bootstrapping withK=5000 primary samples from a SBNSBN 10-200-200 network trained on MNIST)

11

Published as a conference paper at ICLR 2015

633 UCI BINARY DATASETS

We performed a series of experiments on 8 different binary datasets from the UCI database

For each dataset we screened a limited hyperparameter space The learning rate was set to a valuein 0001 0003 001 For SBNs we use K=10 training samples and we tried the following archi-tectures Two hidden layers with 10-50 10-75 10-100 10-150 or 10-200 hidden units and threehidden layers with 5-20-100 10-50-100 10-50-150 10-50-200 or 10-100-300 hidden units Wetrained NADENADE models with K=5 training samples and one hidden layer with 30 50 75 100or 200 units in it

Model ADULT CONNECT4 DNA MUSHROOMS NIPS-0-12 OCR-LETTERS RCV1 WEB

FVSBN 1317 1239 8364 1027 27688 3930 4984 2935NADElowast 1319 1199 8481 981 27308 2722 4666 2839EoNADE+ 1319 1258 8231 968 27238 2731 4612 2787DARN3 1319 1191 8104 955 27468 2817 4610 2883RWS - SBN 1365 1268 9063 990 27254 2999 4616 2818hidden units 5-20-100 10-50-150 10-150 10-50-150 10-50-150 10-100-300 10-50-200 10-50-300RWS - NADE 1316 1168 8426 971 27111 2643 4609 2792hidden units 30 50 100 50 75 100

Table 3 Results on various binary datasets from the UCI repository The top two rows quote thebaseline results from Larochelle amp Murray (2011) the third row shows the baseline results takenfrom Uria Murray Larochelle (2014) (NADElowast 500 hidden units EoNADE+ 1hl 16 ord)

12

  • 1 Introduction
  • 2 Reweighted Wake-Sleep
    • 21 The Wake-Sleep Algorithm
    • 22 An Importance Sampling View yields Reweighted Wake-Sleep
    • 23 Training by Reweighted Wake-Sleep
    • 24 Relation to Wake-Sleep and Variational Bayes
      • 3 Component Layers
      • 4 Experiments
        • 41 MNIST
        • 42 CalTech 101 Silhouettes
          • 5 Conclusions
          • 6 Supplement
            • 61 Gradients for p(x h)
            • 62 Gradients for the wake phase q update
            • 63 Additional experimental results
              • 631 Learning curves for MNIST experiments
              • 632 Bootstrapping based log(p(x)) biasvariance analysis
              • 633 UCI binary datasets
Page 10: REWEIGHTED WAKE-SLEEP - arXiv · this paper is to shed a different light on the wake-sleep algorithm, viewing it as a special case of the proposed reweighted wake-sleep (RWS) algorithm,

Published as a conference paper at ICLR 2015

6 SUPPLEMENT

61 GRADIENTS FOR p(xh)

part

partθLp(θx sim D) =

part

partθlog pθ(x) =

1

p(x)

part

partθ

sumh

p(xh)

=1

p(x)

sumh

p(xh)part

partθlog p(xh)

=1

p(x)

sumh

q (h |x) p(xh)q (h |x)

part

partθlog p(xh)

=1

p(x)E

hsimq(h |x)

[p(xh)

q (h |x)part

partθlog p(xh)

] 1sum

k ωk

Ksumk=1

ωkpart

partθlog p(xh(k)) (10)

with ωk =p(xh(k))

q(h(k) |x

) and h(k) sim q (h |x)

62 GRADIENTS FOR THE WAKE PHASE Q UPDATE

part

partφLq(φx sim D) =

part

partφ

sumh

p(xh)log qφ(h|x)

=1

p(x)E

hsimq(h |x)

[p(xh)

q (h |x)part

partφlog qφ(h|x)

] 1sum

k ωk

Ksumk=1

ωkpart

partφlog qφ(h

(k)|x) (11)

Note that we arrive at the same gradients when we set out to minimize the KL(p(middot|x) q(middot|x) for agiven datapoint x

part

partφKL(pθ(h|x) qφ(h|x)) =

part

partφ

sumh

pθ(h|x) logpθ(h|x)qφ(h|x)

= minussumh

pθ(h|x)part

partφlog qφ(h|x)

minus 1sumk ωk

Ksumk=1

ωkpart

partφlog qφ(h

(k)|x) (12)

10

Published as a conference paper at ICLR 2015

63 ADDITIONAL EXPERIMENTAL RESULTS

631 LEARNING CURVES FOR MNIST EXPERIMENTS

Figure 4 Learning curves for various MNIST experiments

632 BOOTSTRAPPING BASED log(p(x)) BIASVARIANCE ANALYSIS

Here we show the biasvariance analysis from Fig 1 B (main paper) applied to the estimatedlog(p(x)) wrt the number of test samples

100 101102 103

samples

20

15

10

5

0

Bia

s

bias (epoch 50)bias (last epoch)

00

02

04

06

08

10

12

14

16

std

-dev

std dev (epoch 50)std dev (last epoch)

Figure 5 Bias and standard deviation of the low-sample estimated log(p(x)) (bootstrapping withK=5000 primary samples from a SBNSBN 10-200-200 network trained on MNIST)

11

Published as a conference paper at ICLR 2015

633 UCI BINARY DATASETS

We performed a series of experiments on 8 different binary datasets from the UCI database

For each dataset we screened a limited hyperparameter space The learning rate was set to a valuein 0001 0003 001 For SBNs we use K=10 training samples and we tried the following archi-tectures Two hidden layers with 10-50 10-75 10-100 10-150 or 10-200 hidden units and threehidden layers with 5-20-100 10-50-100 10-50-150 10-50-200 or 10-100-300 hidden units Wetrained NADENADE models with K=5 training samples and one hidden layer with 30 50 75 100or 200 units in it

Model ADULT CONNECT4 DNA MUSHROOMS NIPS-0-12 OCR-LETTERS RCV1 WEB

FVSBN 1317 1239 8364 1027 27688 3930 4984 2935NADElowast 1319 1199 8481 981 27308 2722 4666 2839EoNADE+ 1319 1258 8231 968 27238 2731 4612 2787DARN3 1319 1191 8104 955 27468 2817 4610 2883RWS - SBN 1365 1268 9063 990 27254 2999 4616 2818hidden units 5-20-100 10-50-150 10-150 10-50-150 10-50-150 10-100-300 10-50-200 10-50-300RWS - NADE 1316 1168 8426 971 27111 2643 4609 2792hidden units 30 50 100 50 75 100

Table 3 Results on various binary datasets from the UCI repository The top two rows quote thebaseline results from Larochelle amp Murray (2011) the third row shows the baseline results takenfrom Uria Murray Larochelle (2014) (NADElowast 500 hidden units EoNADE+ 1hl 16 ord)

12

  • 1 Introduction
  • 2 Reweighted Wake-Sleep
    • 21 The Wake-Sleep Algorithm
    • 22 An Importance Sampling View yields Reweighted Wake-Sleep
    • 23 Training by Reweighted Wake-Sleep
    • 24 Relation to Wake-Sleep and Variational Bayes
      • 3 Component Layers
      • 4 Experiments
        • 41 MNIST
        • 42 CalTech 101 Silhouettes
          • 5 Conclusions
          • 6 Supplement
            • 61 Gradients for p(x h)
            • 62 Gradients for the wake phase q update
            • 63 Additional experimental results
              • 631 Learning curves for MNIST experiments
              • 632 Bootstrapping based log(p(x)) biasvariance analysis
              • 633 UCI binary datasets
Page 11: REWEIGHTED WAKE-SLEEP - arXiv · this paper is to shed a different light on the wake-sleep algorithm, viewing it as a special case of the proposed reweighted wake-sleep (RWS) algorithm,

Published as a conference paper at ICLR 2015

63 ADDITIONAL EXPERIMENTAL RESULTS

631 LEARNING CURVES FOR MNIST EXPERIMENTS

Figure 4 Learning curves for various MNIST experiments

632 BOOTSTRAPPING BASED log(p(x)) BIASVARIANCE ANALYSIS

Here we show the biasvariance analysis from Fig 1 B (main paper) applied to the estimatedlog(p(x)) wrt the number of test samples

100 101102 103

samples

20

15

10

5

0

Bia

s

bias (epoch 50)bias (last epoch)

00

02

04

06

08

10

12

14

16

std

-dev

std dev (epoch 50)std dev (last epoch)

Figure 5 Bias and standard deviation of the low-sample estimated log(p(x)) (bootstrapping withK=5000 primary samples from a SBNSBN 10-200-200 network trained on MNIST)

11

Published as a conference paper at ICLR 2015

633 UCI BINARY DATASETS

We performed a series of experiments on 8 different binary datasets from the UCI database

For each dataset we screened a limited hyperparameter space The learning rate was set to a valuein 0001 0003 001 For SBNs we use K=10 training samples and we tried the following archi-tectures Two hidden layers with 10-50 10-75 10-100 10-150 or 10-200 hidden units and threehidden layers with 5-20-100 10-50-100 10-50-150 10-50-200 or 10-100-300 hidden units Wetrained NADENADE models with K=5 training samples and one hidden layer with 30 50 75 100or 200 units in it

Model ADULT CONNECT4 DNA MUSHROOMS NIPS-0-12 OCR-LETTERS RCV1 WEB

FVSBN 1317 1239 8364 1027 27688 3930 4984 2935NADElowast 1319 1199 8481 981 27308 2722 4666 2839EoNADE+ 1319 1258 8231 968 27238 2731 4612 2787DARN3 1319 1191 8104 955 27468 2817 4610 2883RWS - SBN 1365 1268 9063 990 27254 2999 4616 2818hidden units 5-20-100 10-50-150 10-150 10-50-150 10-50-150 10-100-300 10-50-200 10-50-300RWS - NADE 1316 1168 8426 971 27111 2643 4609 2792hidden units 30 50 100 50 75 100

Table 3 Results on various binary datasets from the UCI repository The top two rows quote thebaseline results from Larochelle amp Murray (2011) the third row shows the baseline results takenfrom Uria Murray Larochelle (2014) (NADElowast 500 hidden units EoNADE+ 1hl 16 ord)

12

  • 1 Introduction
  • 2 Reweighted Wake-Sleep
    • 21 The Wake-Sleep Algorithm
    • 22 An Importance Sampling View yields Reweighted Wake-Sleep
    • 23 Training by Reweighted Wake-Sleep
    • 24 Relation to Wake-Sleep and Variational Bayes
      • 3 Component Layers
      • 4 Experiments
        • 41 MNIST
        • 42 CalTech 101 Silhouettes
          • 5 Conclusions
          • 6 Supplement
            • 61 Gradients for p(x h)
            • 62 Gradients for the wake phase q update
            • 63 Additional experimental results
              • 631 Learning curves for MNIST experiments
              • 632 Bootstrapping based log(p(x)) biasvariance analysis
              • 633 UCI binary datasets
Page 12: REWEIGHTED WAKE-SLEEP - arXiv · this paper is to shed a different light on the wake-sleep algorithm, viewing it as a special case of the proposed reweighted wake-sleep (RWS) algorithm,

Published as a conference paper at ICLR 2015

633 UCI BINARY DATASETS

We performed a series of experiments on 8 different binary datasets from the UCI database

For each dataset we screened a limited hyperparameter space The learning rate was set to a valuein 0001 0003 001 For SBNs we use K=10 training samples and we tried the following archi-tectures Two hidden layers with 10-50 10-75 10-100 10-150 or 10-200 hidden units and threehidden layers with 5-20-100 10-50-100 10-50-150 10-50-200 or 10-100-300 hidden units Wetrained NADENADE models with K=5 training samples and one hidden layer with 30 50 75 100or 200 units in it

Model ADULT CONNECT4 DNA MUSHROOMS NIPS-0-12 OCR-LETTERS RCV1 WEB

FVSBN 1317 1239 8364 1027 27688 3930 4984 2935NADElowast 1319 1199 8481 981 27308 2722 4666 2839EoNADE+ 1319 1258 8231 968 27238 2731 4612 2787DARN3 1319 1191 8104 955 27468 2817 4610 2883RWS - SBN 1365 1268 9063 990 27254 2999 4616 2818hidden units 5-20-100 10-50-150 10-150 10-50-150 10-50-150 10-100-300 10-50-200 10-50-300RWS - NADE 1316 1168 8426 971 27111 2643 4609 2792hidden units 30 50 100 50 75 100

Table 3 Results on various binary datasets from the UCI repository The top two rows quote thebaseline results from Larochelle amp Murray (2011) the third row shows the baseline results takenfrom Uria Murray Larochelle (2014) (NADElowast 500 hidden units EoNADE+ 1hl 16 ord)

12

  • 1 Introduction
  • 2 Reweighted Wake-Sleep
    • 21 The Wake-Sleep Algorithm
    • 22 An Importance Sampling View yields Reweighted Wake-Sleep
    • 23 Training by Reweighted Wake-Sleep
    • 24 Relation to Wake-Sleep and Variational Bayes
      • 3 Component Layers
      • 4 Experiments
        • 41 MNIST
        • 42 CalTech 101 Silhouettes
          • 5 Conclusions
          • 6 Supplement
            • 61 Gradients for p(x h)
            • 62 Gradients for the wake phase q update
            • 63 Additional experimental results
              • 631 Learning curves for MNIST experiments
              • 632 Bootstrapping based log(p(x)) biasvariance analysis
              • 633 UCI binary datasets