arxiv:1810.00597v1 [stat.ml] 1 oct 2018taming vaes danilo j. rezende fabio viola {danilor,...

21
Taming VAEs Danilo J. Rezende * Fabio Viola * {danilor, fviola}@google.com DeepMind, London, UK Abstract In spite of remarkable progress in deep latent variable generative modeling, training still remains a challenge due to a combination of optimization and generalization issues. In practice, a combination of heuristic algorithms (such as hand-crafted annealing of KL-terms) is often used in order to achieve the desired results, but such solutions are not robust to changes in model architecture or dataset. The best settings can often vary dramatically from one problem to another, which requires doing expensive parameter sweeps for each new case. Here we develop on the idea of training VAEs with additional constraints as a way to control their behaviour. We first present a detailed theoretical analysis of constrained VAEs, expanding our understanding of how these models work. We then introduce and analyze a practical algorithm termed Generalized ELBO with Constrained Optimization, GECO. The main advantage of GECO for the machine learning practitioner is a more intuitive, yet principled, process of tuning the loss. This involves defining of a set of constraints, which typically have an explicit relation to the desired model performance, in contrast to tweaking abstract hyper-parameters which implicitly affect the model behavior. Encouraging experimental results in several standard datasets indicate that GECO is a very robust and effective tool to balance reconstruction and compression constraints. 1 Introduction Deep variational auto-encoders (VAEs) have made substantial progress in the last few years on modelling complex data such as images [1, 2, 3, 4, 5, 6, 7], speech synthesis [8, 9], molecular discovery [10, 11, 12], robotics [13, 14, 15] and 3d-scene understanding [16, 17, 18]. VAEs are latent-variable generative models that define a joint density p(x, z) between some observed data x R dx and unobserved or latent variables z R dz . The most popular method for training these models is through stochastic amortized variational approximations [2, 1], which use a variational posterior (also referred to as encoder), q(z|x), to construct the evidence lower-bound (ELBO) objective function. One prominent limitation of vanilla VAEs is the use of simple diagonal Gaussian posterior approxi- mations. It has been observed empirically that VAEs with simple posterior models have a tendency to ignore some of the latent-variables (latent-collapse) [19, 20] and produce blurred reconstructions [20, 4]. As a result, several mechanisms have been proposed to increase the expressiveness of the variational posterior density [21, 22, 23, 24, 25, 3, 26, 5] but it still remains a challenge to train complex encoders due to a combination of optimization and generalization issues [27, 28]. Some of these issues have been partially addressed in the literature through heuristics, such as hand-crafted annealing of the KL-term [16, 4, 5], injection of uniform noise to the pixels [29] and reduction of the bit-depth of the data. A case of interest are VAEs with information bottleneck constraints such as β- VAEs [30]. While a body of work on information bottleneck has primarily focused on tools to analyze * Both authors contributed equally to this work. Preprint. Work in progress. arXiv:1810.00597v1 [stat.ML] 1 Oct 2018

Upload: others

Post on 16-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1810.00597v1 [stat.ML] 1 Oct 2018Taming VAEs Danilo J. Rezende Fabio Viola {danilor, fviola}@google.com DeepMind, London, UK Abstract In spite of remarkable progress in deep

Taming VAEs

Danilo J. Rezende ∗ Fabio Viola ∗

danilor, [email protected], London, UK

Abstract

In spite of remarkable progress in deep latent variable generative modeling, trainingstill remains a challenge due to a combination of optimization and generalizationissues. In practice, a combination of heuristic algorithms (such as hand-craftedannealing of KL-terms) is often used in order to achieve the desired results, butsuch solutions are not robust to changes in model architecture or dataset. Thebest settings can often vary dramatically from one problem to another, whichrequires doing expensive parameter sweeps for each new case. Here we developon the idea of training VAEs with additional constraints as a way to control theirbehaviour. We first present a detailed theoretical analysis of constrained VAEs,expanding our understanding of how these models work. We then introduceand analyze a practical algorithm termed Generalized ELBO with ConstrainedOptimization, GECO. The main advantage of GECO for the machine learningpractitioner is a more intuitive, yet principled, process of tuning the loss. Thisinvolves defining of a set of constraints, which typically have an explicit relation tothe desired model performance, in contrast to tweaking abstract hyper-parameterswhich implicitly affect the model behavior. Encouraging experimental results inseveral standard datasets indicate that GECO is a very robust and effective tool tobalance reconstruction and compression constraints.

1 Introduction

Deep variational auto-encoders (VAEs) have made substantial progress in the last few years onmodelling complex data such as images [1, 2, 3, 4, 5, 6, 7], speech synthesis [8, 9], moleculardiscovery [10, 11, 12], robotics [13, 14, 15] and 3d-scene understanding [16, 17, 18].

VAEs are latent-variable generative models that define a joint density p(x, z) between some observeddata x ∈ Rdx and unobserved or latent variables z ∈ Rdz . The most popular method for training thesemodels is through stochastic amortized variational approximations [2, 1], which use a variationalposterior (also referred to as encoder), q(z|x), to construct the evidence lower-bound (ELBO)objective function.

One prominent limitation of vanilla VAEs is the use of simple diagonal Gaussian posterior approxi-mations. It has been observed empirically that VAEs with simple posterior models have a tendencyto ignore some of the latent-variables (latent-collapse) [19, 20] and produce blurred reconstructions[20, 4]. As a result, several mechanisms have been proposed to increase the expressiveness of thevariational posterior density [21, 22, 23, 24, 25, 3, 26, 5] but it still remains a challenge to traincomplex encoders due to a combination of optimization and generalization issues [27, 28]. Some ofthese issues have been partially addressed in the literature through heuristics, such as hand-craftedannealing of the KL-term [16, 4, 5], injection of uniform noise to the pixels [29] and reduction of thebit-depth of the data. A case of interest are VAEs with information bottleneck constraints such as β-VAEs [30]. While a body of work on information bottleneck has primarily focused on tools to analyze

∗Both authors contributed equally to this work.

Preprint. Work in progress.

arX

iv:1

810.

0059

7v1

[st

at.M

L]

1 O

ct 2

018

Page 2: arXiv:1810.00597v1 [stat.ML] 1 Oct 2018Taming VAEs Danilo J. Rezende Fabio Viola {danilor, fviola}@google.com DeepMind, London, UK Abstract In spite of remarkable progress in deep

models [31, 32, 33], it has also been shown that VAEs with various information bottleneck constraintscan trade off reconstruction accuracy for better-factorized latent representations [30, 34, 35], a highlydesired property in many real-world applications as well as model analysis. Other types of constraintshave also been used to improve sample quality and reduce latent-collapse [36], but while usingsupplementary constraints in VAEs has produced encouraging empirical results, we find that therehave been fewer efforts towards general tools and theoretical analysis for constrained VAEs at scale.

Here, we introduce a practical mechanism for controlling the balance between compression (KLminimization) and other constraints we wish to enforce in our model (not limited to, but includingreconstruction error) termed Generalized ELBO with Constrained Optimization, GECO. GECOenables an intuitive, yet principled, work-flow for tuning loss functions. This involves the definitionof a set of constraints, which typically have an explicit relation to the desired model performance,in constrast to tweaking abstract information-theoretic hyper-parameters which implicitly affectthe model behavior. In spite of its simplicity, our experiments support the view that GECO is anempowering tool and we argue that it has enabled us to have an unprecedented level of control overthe properties and robustness of complex models such as ConvDraw [5, 16] and VAEs with NVPposteriors [28].

The key contributions of this paper are: (a) With a focus on ELBOs with supplementary constraints,we present a detailed theoretical analysis of the behaviour of high-capacity VAEs with and withoutKL and Lipschitz constraints, advancing our understanding of VAEs on multiple fronts: (i) Wedemonstrate that the posterior density of unconstrained VAEs will converge to an equiprobablepartition of the latent-space; (ii) We provide a connection between β-VAEs and spectral clustering;(iii) Drawing from statistical mechanics we study phase-transitions in the reconstruction fixed-pointsof β-VAEs; (b) Equipped with a better understanding of the impact of constraints in VAEs, we designthe GECO algorithm.

The remainder of the paper is structured as follows: in Section 3.1 we study the behavior ofunconstrained VAEs at convergence; in Section 3.2 we draw links between β-VAEs and spectralclustering, and we study phase-transitions in β-VAEs; in Section 3.4 we present the GECO algorithm;finally, in Section 4 we illustrate our theoretical analysis on mixture data and we assess the behaviorof GECO on several larger datasets and for different types of constraints.

2 Related work

,Similarly to [36] we study VAEs with supplementary constraints in addition to the ELBO objectivefunction and we study its behaviour in the information plane as in [31]. Our theoretical analysisextends the analysis performed in [36, 31] with an emphasis on the properties of the learned posteriordensities in high-capacity VAEs. Our analysis also extends the idea of information bottleneck andgeometrical clustering from [37] to the case where latent variables are continuous instead of discrete.A further extension of the analysis including the incorporation of Lipschitz constraints is also availablein Appendix B. Complementary to the analysis from [38] with (semi-)affine assumptions on theVAE’s decoder, we focus on non-linear aspects of high-capacity constrained VAEs.

GECO is a simple mechanism for approximately optimizing VAEs under different types of constraint.It is inspired by the empirical observation that some high-capacity VAEs such as ConvDRAW[5, 16]may reach much lower reconstruction errors compared to a level that is perceptually distinguishable,at the expense of a weak compression rate (large KL). GECO is designed to be an easy-to-implementand tune approximation of more complex stochastic constrained optimization techniques [39, 40] andto take advantage of the reparametrization trick in VAEs [1, 2]. At a high-level, GECO is similar toinformation constraints studied in [31, 41, 42, 43]. However, we argue that in many practical cases itis much easier to decide on useful constraints in the data-domain, such as a desired reconstructionaccuracy, rather than information constraints. Additionally, the types of information constraints wecan impose on VAEs are restricted to a few combinations of KL-divergences (e.g. mutual informationbetween latents and data), whereas there are many easily available ways of meaningfully constrainingreconstructions (e.g. bounding reconstruction errors, bounding the ability of a classifier to correctlyclassify reconstructions or bounding reconstruction errors in some feature space).

Some widespread practices for modelling images such as injecting uniform noise to the pixelsand reducing the bit-depth of the color channels (e.g. [29, 44, 5, 28]) can also be mathematicallyinterpreted as constraints which bound the values of the likelihood from above. For instance, traininga model with density p(x) by injecting uniform noise to the samples x→ x+bε, ε ∼ unif(−1/2, 1/2)

2

Page 3: arXiv:1810.00597v1 [stat.ML] 1 Oct 2018Taming VAEs Danilo J. Rezende Fabio Viola {danilor, fviola}@google.com DeepMind, London, UK Abstract In spite of remarkable progress in deep

is a way of maximizing the likelihood under the constraint p(x) ≤ 1b . With GECO, there is no need

to resort to these heuristics.

3 Methods

VAEs are smooth parametric latent variable models of the form p(x, z) = p(x|z)π(z) and are trainedtypically by maximizing the ELBO variational objective, F , using a parametric variational posteriorq(z|x).

F = Eρ(x)

[Eq(z|x) [ln p(x|z)]− KL [q(z|x);π(z)]

], (1)

where the prior is taken to be a unit Gaussian π(z) = N (0, I) and the empirical data density isrepresented as ρ(x) = 1

n

∑ni δ(x− xi), where xi are the training data-points.

We focus on VAEs with Gaussian decoder density of the form p(x|z) = N (x|g(z), σ2I), where σ isa global parameter and g(z) is referred to as a decoder or generator. Restricting to decoder densitieswhere the components xi of x are conditionally independent given a latent vector z eliminates afamily of solutions in the infinite-capacity limit where the decoder density p(x|z) ignores the latentvariables, i.e. p(x|z) = p(x) ≈ ρ(x), as observed in [31]. This restriction is important because weare interested in the behaviour of the latent variables in this paper.

In contrast to ELBO maximization, we consider a constrained optimization problem for variationalauto-encoders where we seek to minimize the KL-divergence, KL [q(z|x);π(z)], under a set ofexpectation inequality constraints of the form Eρ(x)q(z|x) [C(x, g(z))] ≤ 0 where C(x, g(z)) ∈RL. We refer to this type of constraint as reconstruction constraint since they are based on somecomparison between a data-point x and its reconstruction. We can solve this problem by using astandard method of Lagrange multipliers, where we introduce the Lagrange multipliers λ ∈ RL andoptimize the Lagrangian Lλ via a min-max optimization scheme [45],

Lλ = Eρ(x) [KL [q(z|x);π(z)]] + λTEρ(x)q(z|x) [C(x, g(z))] . (2)

A general algorithm for finding the extreme points of the ELBO F and of the Lagrangian Lλ isprovided in Proposition 1, where we extend the well-known derivations from rate-distortion theory[46, 47, 48, 32] to the case with reconstructions constraints.Proposition 1. (Fixed-point equations) The extrema of the ELBO F with respect to the decoder g(z)and encoder q(z|x) are solutions of the fixed-point Equations (3) and (4) and for the Lagrangian Lλ,Equations (5) and (6) respectively

qtF (z|x) ∝ π(z)e−‖x−gt−1

F (z)‖2

2σ2 (3)

gtF (z) =∑

i

wti(z)xi (4)

qtLλ(z|x) ∝ π(z)e−H(x,z) (5)

0 =∑

i

qtLλ(z|xi)λTG(xi, g

tLλ

(z)), (6)

where H(x, z) = λTC(x, gt−1Lλ

(z)), wti(z) =qtF (z|xi)∑j qtF (z|xj) and G(x, g(z)) = ∂C(x,g(z))

∂g(z) . See proofin Appendix C.1.

In the following Sections 3.1 and 3.2 we analyze properties of Equations (3) to (6) constraints to gaina better understanding of the behaviour of VAEs. In Appendix B we also provide additional analysisof the effect of local and global Lipschitz constraints for the interested reader.

3.1 Unconstrained VAEs

In the unconstrained case, where we optimize the ELBO, we can derive a few interesting conclusionsfrom Equations (3) and (4): (i) The global optimal decoder g(z) is a convex linear combination of thetraining data of the form g(z) =

∑i wi(z)xi; (ii) If we optimize the standard-deviation σ jointly with

the decoder and encoder, it will converge to zero; (iii) The posterior density q(z|x) will converge to adistribution with support corresponding to one element of a partition of the latent space. Moreover,

3

Page 4: arXiv:1810.00597v1 [stat.ML] 1 Oct 2018Taming VAEs Danilo J. Rezende Fabio Viola {danilor, fviola}@google.com DeepMind, London, UK Abstract In spite of remarkable progress in deep

the set of supports of the posterior density formed by each data-point constitutes a partition of thelatent space that is equiprobable under the prior. These results are formalized in Proposition 2, wherewe demonstrate that a solution satisfying all these properties is a fixed point of Equations (3) and (4).

Proposition 2. (High-capacity VAEs learn an equiprobable partition of the latent space): Letq(z|xi) = π(z)Iz∈Ωi/πi, where πi = Eπ [Iz∈Ωi ] is a normalization constant, be the variationalposterior density, evaluated at a training point xi. This density is equal to the restriction of the priorπ(z) to a limited support Ωi ⊂ Rdz in the latent space. q(z|xi) is a fixed-point of equations (5) forany set of volumes Ωi that form a partition of the latent space. Furthermore, the highest ELBO isachieved when the partition is equiprobable under the prior. That is, Eπ [Iz∈Ωi ] = Eπ

[Iz∈Ωj

]= 1/n,

Rdz = ∪iΩi and Ωi ∩ Ωj = ∅ if i 6= j. See proof in Appendix C.2.

The fact that the standard deviation will converge to 0 results in a numerically ill-posed problem asobserved in [49]. Nevertheless, the fixed-point equations still admits a stationary solution where theVAE becomes a mixture of Dirac-delta densities centered at the training data-points.

It is known empirically that low-capacity VAEs tend to produce blurred reconstructions and samples[36, 30]. Contrary to a popularly held belief (e.g. [50]), this phenomenon is not caused by using aGaussian likelihood alone: as observed in [36], this is primarily caused by a sub-optimal variationalposterior. The fixed-point equation (4) provides a mathematical explanation for this phenomenon,generalizing the result from [36]: the optimal decoder g(z) for a given encoder q(z|x) is a convexlinear combination of the training data. If the VAE’s encoder cannot accurately distinguish betweenmultiple training data-points, the resulting weights wi(z) in the VAE’s decoder will be spread acrossthe same data-points, resulting in a blurred reconstruction. This is formalized in Proposition 3.

Proposition 3. (Blurred reconstructions) If the supports Ωi from Proposition 2 are overlapping (i.e.Ωi ∩ Ωj 6= ∅ for i 6= j), then the optimal reconstruction at a latent point z for a fixed encoderq(z|xi) = π(z)Iz∈Ωi/πi will be the average of all data-points mapping to any of the overlappingbasis weighted by the inverse prior probability of the respective basis. See proof in Appendix C.3.

Another striking conclusion we can derive from Propositions 1 and 3 is that the support of the optimaldecoder as a function of the latent vector will be concentrated in the support of the marginal posterior.In fact, if we revisit the proof of Proposition 1 in Appendix C.1 when there are regions in the latentspace where q(z|x) ≈ 0 we notice that the decoder is completely unconstrained by the ELBO in theseregions. We refer to this as the "holes problem" in VAEs. This problem is commonly encounteredwhen using simple Gaussian posteriors (e.g. [3]).

3.2 High-capacity β-VAEs and spectral methods

The β-coefficient in a β-VAE [30] can be interpreted as the Lagrange multiplier of an inequality con-straint imposing either a restriction on the value of the KL-term or a constraint on the reconstructionerror [34, 31]. When using the reconstruction error constraint C(x, g(z)) = ‖x − g(z)‖2 − κ2 inEquation (2), the Lagrange multiplier λ is related to the β from [30] by λ = 1

β .

While VAEs with simple linear decoders can be related to a form of robust PCA, [38], we demonstratea relation between β-VAEs with high-capacity decoders and kernel methods such as spectral clustering.More precisely, we show in Proposition 4 that the fixed-point equations of a high-capacity decoderexpressed in a particular orthogonal basis are analogous to the reconstruction fixed point equationsused in spectral clustering.

In the literature of spectral clustering and kernel PCA it is known that reconstructions based on aGaussian Gram matrix may suffer phase-transitions (sudden change in eigen-values or reconstructionfixed-points) as a function of the scale parameter [51, 52, 53, 54, 55]. Making a bridge between VAEsand these methods allows us to investigate phase-transitions in high-capacity β-VAEs, where theexpected reconstruction error is treated as an order parameter u(β) = E

[‖x− g(z)‖2

]. In this case,

phase transitions will occur at critical temperature points βc, which can be detected by analyzingregions of high-curvature (high absolute second-order derivative |∂

2u(β)∂2β |). As in spectral clustering,

β-VAEs phase transitions correspond to the merging of neighboring data-clusters at different spatialscales. This is illustrated in the experiment shown in Figure 2, where we look at the merging ofreconstruction fixed-points as we increase β. Interestingly, the phase-transitions that we observeare similar to what is known as first-order phase transitions in statistical mechanics [56], but furtheranalysis is necessary to clarify this connection.

4

Page 5: arXiv:1810.00597v1 [stat.ML] 1 Oct 2018Taming VAEs Danilo J. Rezende Fabio Viola {danilor, fviola}@google.com DeepMind, London, UK Abstract In spite of remarkable progress in deep

Figure 1: Illustration of the "blurred reconstructions" and the "holes" problems. Left: Latentspace with a posterior with support in a tiling Ωi, where each tile Ωi represents the support ofthe posterior for the data-point xi.; Right: Data space. In the region of the latent space where theposteriors of the data-points x1 and x2 overlap, Ω1 ∩ Ω2, the optimal reconstruction x is a weightedaverage of the corresponding data-points, resulting in a blurred sample. In a region of low densityunder the marginal posterior, a "hole" (represented by the black area in the figure), the optimalreconstructions from these regions x are unconstrained by the ELBO objective function.

Proposition 4. (High-capacity β-VAE and spectral methods) Let φa : Rdz → 0, 1 be an orthog-onal basis in the latent space. If we express the posterior density and generator using this basisrespectively as q(z|xi) = π(z)

∑amiaφa(z) and g(z) = ψTφ(z), where mia is a matrix with

positive entries satisfying the constraint∑amiaπa = 1 where πa = Eπ(z)[φa(z)]. The fixed-point

equations that maximize the ELBO with respect to ψ are convergent under the appropriate initialconditions and are equivalent to computing the pre-images (reconstructions) of a Kernel-PCA modelwith a normalized Gaussian Kernel with scale parameter

√β. See proof in Appendix C.4.

3.3 Equipartition of energy in β-VAEs

In statistical mechanics there is an important result, known as equipartition of energy theorem. It statesthat, for a system in thermodynamic equilibrium, energy is shared equally amongst all accessibledegrees of freedom at a fixed energy level, [57]. We demonstrate that a similar theorem holds forVAEs in Proposition 5. In Proposition 2 we have seen that the optimal posterior for each data point ina high-capacity VAE has its support in the elements of a partition of the latent space and that thispartition is equiprobable under the prior. We can generalize this result to β-VAEs by noticing that, asa result of the existence of reconstruction fixed points from Proposition 4, there will be regions inthe latent space where the Hamiltonian H(x, z) is approximately constant for a given x. At theseregions, the posterior will be proportional to the prior and they will work as a discrete partition of thelatent space, as in Proposition 2. The concept that VAE encoders learn a tiling of the latent space,each tile corresponding to a different level of the function H(x, z), can be a guiding principle toevaluate generative models as well as to construct more meaningful constraints.

Proposition 5. (Equipartition of Energy for high-capacity VAEs). Let H(x, z) be the "Hamiltonian"function from Proposition 1 for a VAE trained in a dataset with n data-points. For a given data-pointx ∈ Rdx , latent point z ∈ Rdz and precision ε > 0, let Ω(x, z0) = z′||H(x, z′) − H(x, z0)| 6ε ⊆ Rdz . That is, Ω(x, z0) is the set of latent points where the Hamiltonian is approximatelyconstant. As we vary x and z0, each set Ω(x, z0) will be one element of a discrete set of disjoint sets,which we enumerate as Ωa. The encoder density q(z|xi) will converge to a mixture of the restrictionsof the prior to the basis elements Ωa. Moreover, the probability γa of a sample from the prior falling

5

Page 6: arXiv:1810.00597v1 [stat.ML] 1 Oct 2018Taming VAEs Danilo J. Rezende Fabio Viola {danilor, fviola}@google.com DeepMind, London, UK Abstract In spite of remarkable progress in deep

u(

)=

E[re

c]<latexit sha1_base64="JEPDhj8siYeolu9936kPUdX1sak=">AAACA3icbVDLSgMxFM3UV62vUXe6CRahbsqMCLoRiiK4rGAfMDOUJM20oZnMkGSEMhTc+CtuXCji1p9w59+YaWehrQcCh3PuJfccnHCmtON8W6Wl5ZXVtfJ6ZWNza3vH3t1rqziVhLZIzGPZxUhRzgRtaaY57SaSoghz2sGj69zvPFCpWCzu9TihQYQGgoWMIG2knn2Q1nxMNTqBl9CPkB5inN1MPElJ0LOrTt2ZAi4StyBVUKDZs7/8fkzSiApNOFLKc51EBxmSmhFOJxU/VTRBZIQG1DNUoIiqIJtmmMBjo/RhGEvzhIZT9fdGhiKlxhE2k/mZat7Lxf88L9XhRZAxkaSaCjL7KEw51DHMC4F9ZsJqPjYEEcnMrZAMkUREm9oqpgR3PvIiaZ/WXafu3p1VG1dFHWVwCI5ADbjgHDTALWiCFiDgETyDV/BmPVkv1rv1MRstWcXOPvgD6/MHpv6W4w==</latexit><latexit sha1_base64="JEPDhj8siYeolu9936kPUdX1sak=">AAACA3icbVDLSgMxFM3UV62vUXe6CRahbsqMCLoRiiK4rGAfMDOUJM20oZnMkGSEMhTc+CtuXCji1p9w59+YaWehrQcCh3PuJfccnHCmtON8W6Wl5ZXVtfJ6ZWNza3vH3t1rqziVhLZIzGPZxUhRzgRtaaY57SaSoghz2sGj69zvPFCpWCzu9TihQYQGgoWMIG2knn2Q1nxMNTqBl9CPkB5inN1MPElJ0LOrTt2ZAi4StyBVUKDZs7/8fkzSiApNOFLKc51EBxmSmhFOJxU/VTRBZIQG1DNUoIiqIJtmmMBjo/RhGEvzhIZT9fdGhiKlxhE2k/mZat7Lxf88L9XhRZAxkaSaCjL7KEw51DHMC4F9ZsJqPjYEEcnMrZAMkUREm9oqpgR3PvIiaZ/WXafu3p1VG1dFHWVwCI5ADbjgHDTALWiCFiDgETyDV/BmPVkv1rv1MRstWcXOPvgD6/MHpv6W4w==</latexit><latexit sha1_base64="JEPDhj8siYeolu9936kPUdX1sak=">AAACA3icbVDLSgMxFM3UV62vUXe6CRahbsqMCLoRiiK4rGAfMDOUJM20oZnMkGSEMhTc+CtuXCji1p9w59+YaWehrQcCh3PuJfccnHCmtON8W6Wl5ZXVtfJ6ZWNza3vH3t1rqziVhLZIzGPZxUhRzgRtaaY57SaSoghz2sGj69zvPFCpWCzu9TihQYQGgoWMIG2knn2Q1nxMNTqBl9CPkB5inN1MPElJ0LOrTt2ZAi4StyBVUKDZs7/8fkzSiApNOFLKc51EBxmSmhFOJxU/VTRBZIQG1DNUoIiqIJtmmMBjo/RhGEvzhIZT9fdGhiKlxhE2k/mZat7Lxf88L9XhRZAxkaSaCjL7KEw51DHMC4F9ZsJqPjYEEcnMrZAMkUREm9oqpgR3PvIiaZ/WXafu3p1VG1dFHWVwCI5ADbjgHDTALWiCFiDgETyDV/BmPVkv1rv1MRstWcXOPvgD6/MHpv6W4w==</latexit><latexit sha1_base64="JEPDhj8siYeolu9936kPUdX1sak=">AAACA3icbVDLSgMxFM3UV62vUXe6CRahbsqMCLoRiiK4rGAfMDOUJM20oZnMkGSEMhTc+CtuXCji1p9w59+YaWehrQcCh3PuJfccnHCmtON8W6Wl5ZXVtfJ6ZWNza3vH3t1rqziVhLZIzGPZxUhRzgRtaaY57SaSoghz2sGj69zvPFCpWCzu9TihQYQGgoWMIG2knn2Q1nxMNTqBl9CPkB5inN1MPElJ0LOrTt2ZAi4StyBVUKDZs7/8fkzSiApNOFLKc51EBxmSmhFOJxU/VTRBZIQG1DNUoIiqIJtmmMBjo/RhGEvzhIZT9fdGhiKlxhE2k/mZat7Lxf88L9XhRZAxkaSaCjL7KEw51DHMC4F9ZsJqPjYEEcnMrZAMkUREm9oqpgR3PvIiaZ/WXafu3p1VG1dFHWVwCI5ADbjgHDTALWiCFiDgETyDV/BmPVkv1rv1MRstWcXOPvgD6/MHpv6W4w==</latexit>

@2u(

)

@2

2

<latexit sha1_base64="OyHVkaAzaKtzb/ptr0ev+DN1/FM=">AAACJ3icbZBNS8NAEIY3ftb6VfXoZbEIeilJEfQkRS8eK9gqJLFstpN2cfPB7kQoof/Gi3/Fi6AievSfuKk5VOsLCy/PzDA7b5BKodG2P625+YXFpeXKSnV1bX1js7a13dVJpjh0eCITdRMwDVLE0EGBEm5SBSwKJFwHd+dF/foelBZJfIWjFPyIDWIRCs7QoF7t1JMQInWpFyrGcy9lCgWTt02aHXgBIDscT8MJGlNPicEQ/dtmr1a3G/ZEdNY4pamTUu1e7cXrJzyLIEYumdauY6fo58UCLmFc9TINKeN3bACusTGLQPv55M4x3TekT8NEmRcjndDpiZxFWo+iwHRGDIf6b62A/9XcDMMTPxdxmiHE/GdRmEmKCS1Co32hgKMcGcO4EuavlA+ZCQxNtFUTgvP35FnTbTYcu+FcHtVbZ2UcFbJL9sgBccgxaZEL0iYdwskDeSKv5M16tJ6td+vjp3XOKmd2yC9ZX9+upqXQ</latexit><latexit sha1_base64="OyHVkaAzaKtzb/ptr0ev+DN1/FM=">AAACJ3icbZBNS8NAEIY3ftb6VfXoZbEIeilJEfQkRS8eK9gqJLFstpN2cfPB7kQoof/Gi3/Fi6AievSfuKk5VOsLCy/PzDA7b5BKodG2P625+YXFpeXKSnV1bX1js7a13dVJpjh0eCITdRMwDVLE0EGBEm5SBSwKJFwHd+dF/foelBZJfIWjFPyIDWIRCs7QoF7t1JMQInWpFyrGcy9lCgWTt02aHXgBIDscT8MJGlNPicEQ/dtmr1a3G/ZEdNY4pamTUu1e7cXrJzyLIEYumdauY6fo58UCLmFc9TINKeN3bACusTGLQPv55M4x3TekT8NEmRcjndDpiZxFWo+iwHRGDIf6b62A/9XcDMMTPxdxmiHE/GdRmEmKCS1Co32hgKMcGcO4EuavlA+ZCQxNtFUTgvP35FnTbTYcu+FcHtVbZ2UcFbJL9sgBccgxaZEL0iYdwskDeSKv5M16tJ6td+vjp3XOKmd2yC9ZX9+upqXQ</latexit><latexit sha1_base64="OyHVkaAzaKtzb/ptr0ev+DN1/FM=">AAACJ3icbZBNS8NAEIY3ftb6VfXoZbEIeilJEfQkRS8eK9gqJLFstpN2cfPB7kQoof/Gi3/Fi6AievSfuKk5VOsLCy/PzDA7b5BKodG2P625+YXFpeXKSnV1bX1js7a13dVJpjh0eCITdRMwDVLE0EGBEm5SBSwKJFwHd+dF/foelBZJfIWjFPyIDWIRCs7QoF7t1JMQInWpFyrGcy9lCgWTt02aHXgBIDscT8MJGlNPicEQ/dtmr1a3G/ZEdNY4pamTUu1e7cXrJzyLIEYumdauY6fo58UCLmFc9TINKeN3bACusTGLQPv55M4x3TekT8NEmRcjndDpiZxFWo+iwHRGDIf6b62A/9XcDMMTPxdxmiHE/GdRmEmKCS1Co32hgKMcGcO4EuavlA+ZCQxNtFUTgvP35FnTbTYcu+FcHtVbZ2UcFbJL9sgBccgxaZEL0iYdwskDeSKv5M16tJ6td+vjp3XOKmd2yC9ZX9+upqXQ</latexit><latexit sha1_base64="OyHVkaAzaKtzb/ptr0ev+DN1/FM=">AAACJ3icbZBNS8NAEIY3ftb6VfXoZbEIeilJEfQkRS8eK9gqJLFstpN2cfPB7kQoof/Gi3/Fi6AievSfuKk5VOsLCy/PzDA7b5BKodG2P625+YXFpeXKSnV1bX1js7a13dVJpjh0eCITdRMwDVLE0EGBEm5SBSwKJFwHd+dF/foelBZJfIWjFPyIDWIRCs7QoF7t1JMQInWpFyrGcy9lCgWTt02aHXgBIDscT8MJGlNPicEQ/dtmr1a3G/ZEdNY4pamTUu1e7cXrJzyLIEYumdauY6fo58UCLmFc9TINKeN3bACusTGLQPv55M4x3TekT8NEmRcjndDpiZxFWo+iwHRGDIf6b62A/9XcDMMTPxdxmiHE/GdRmEmKCS1Co32hgKMcGcO4EuavlA+ZCQxNtFUTgvP35FnTbTYcu+FcHtVbZ2UcFbJL9sgBccgxaZEL0iYdwskDeSKv5M16tJ6td+vjp3XOKmd2yC9ZX9+upqXQ</latexit>

<latexit sha1_base64="8MzdY9xWePsViGumrVRX3IPm9UE=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi6LHoxWMFWwttKJvtpF262YTdiVBCf4MXD4p49Qd589+4bXPQ1gcDj/dmmJkXplIY8rxvp7S2vrG5Vd6u7Ozu7R9UD4/aJsk0xxZPZKI7ITMohcIWCZLYSTWyOJT4GI5vZ/7jE2ojEvVAkxSDmA2ViARnZKVWL0Ri/WrNq3tzuKvEL0gNCjT71a/eIOFZjIq4ZMZ0fS+lIGeaBJc4rfQygynjYzbErqWKxWiCfH7s1D2zysCNEm1LkTtXf0/kLDZmEoe2M2Y0MsveTPzP62YUXQe5UGlGqPhiUZRJlxJ39rk7EBo5yYkljGthb3X5iGnGyeZTsSH4yy+vkvZF3ffq/v1lrXFTxFGGEziFc/DhChpwB01oAQcBz/AKb45yXpx352PRWnKKmWP4A+fzB8PWjqQ=</latexit><latexit sha1_base64="8MzdY9xWePsViGumrVRX3IPm9UE=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi6LHoxWMFWwttKJvtpF262YTdiVBCf4MXD4p49Qd589+4bXPQ1gcDj/dmmJkXplIY8rxvp7S2vrG5Vd6u7Ozu7R9UD4/aJsk0xxZPZKI7ITMohcIWCZLYSTWyOJT4GI5vZ/7jE2ojEvVAkxSDmA2ViARnZKVWL0Ri/WrNq3tzuKvEL0gNCjT71a/eIOFZjIq4ZMZ0fS+lIGeaBJc4rfQygynjYzbErqWKxWiCfH7s1D2zysCNEm1LkTtXf0/kLDZmEoe2M2Y0MsveTPzP62YUXQe5UGlGqPhiUZRJlxJ39rk7EBo5yYkljGthb3X5iGnGyeZTsSH4yy+vkvZF3ffq/v1lrXFTxFGGEziFc/DhChpwB01oAQcBz/AKb45yXpx352PRWnKKmWP4A+fzB8PWjqQ=</latexit><latexit sha1_base64="8MzdY9xWePsViGumrVRX3IPm9UE=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi6LHoxWMFWwttKJvtpF262YTdiVBCf4MXD4p49Qd589+4bXPQ1gcDj/dmmJkXplIY8rxvp7S2vrG5Vd6u7Ozu7R9UD4/aJsk0xxZPZKI7ITMohcIWCZLYSTWyOJT4GI5vZ/7jE2ojEvVAkxSDmA2ViARnZKVWL0Ri/WrNq3tzuKvEL0gNCjT71a/eIOFZjIq4ZMZ0fS+lIGeaBJc4rfQygynjYzbErqWKxWiCfH7s1D2zysCNEm1LkTtXf0/kLDZmEoe2M2Y0MsveTPzP62YUXQe5UGlGqPhiUZRJlxJ39rk7EBo5yYkljGthb3X5iGnGyeZTsSH4yy+vkvZF3ffq/v1lrXFTxFGGEziFc/DhChpwB01oAQcBz/AKb45yXpx352PRWnKKmWP4A+fzB8PWjqQ=</latexit><latexit sha1_base64="8MzdY9xWePsViGumrVRX3IPm9UE=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi6LHoxWMFWwttKJvtpF262YTdiVBCf4MXD4p49Qd589+4bXPQ1gcDj/dmmJkXplIY8rxvp7S2vrG5Vd6u7Ozu7R9UD4/aJsk0xxZPZKI7ITMohcIWCZLYSTWyOJT4GI5vZ/7jE2ojEvVAkxSDmA2ViARnZKVWL0Ri/WrNq3tzuKvEL0gNCjT71a/eIOFZjIq4ZMZ0fS+lIGeaBJc4rfQygynjYzbErqWKxWiCfH7s1D2zysCNEm1LkTtXf0/kLDZmEoe2M2Y0MsveTPzP62YUXQe5UGlGqPhiUZRJlxJ39rk7EBo5yYkljGthb3X5iGnGyeZTsSH4yy+vkvZF3ffq/v1lrXFTxFGGEziFc/DhChpwB01oAQcBz/AKb45yXpx352PRWnKKmWP4A+fzB8PWjqQ=</latexit>

<latexit sha1_base64="8MzdY9xWePsViGumrVRX3IPm9UE=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi6LHoxWMFWwttKJvtpF262YTdiVBCf4MXD4p49Qd589+4bXPQ1gcDj/dmmJkXplIY8rxvp7S2vrG5Vd6u7Ozu7R9UD4/aJsk0xxZPZKI7ITMohcIWCZLYSTWyOJT4GI5vZ/7jE2ojEvVAkxSDmA2ViARnZKVWL0Ri/WrNq3tzuKvEL0gNCjT71a/eIOFZjIq4ZMZ0fS+lIGeaBJc4rfQygynjYzbErqWKxWiCfH7s1D2zysCNEm1LkTtXf0/kLDZmEoe2M2Y0MsveTPzP62YUXQe5UGlGqPhiUZRJlxJ39rk7EBo5yYkljGthb3X5iGnGyeZTsSH4yy+vkvZF3ffq/v1lrXFTxFGGEziFc/DhChpwB01oAQcBz/AKb45yXpx352PRWnKKmWP4A+fzB8PWjqQ=</latexit><latexit sha1_base64="8MzdY9xWePsViGumrVRX3IPm9UE=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi6LHoxWMFWwttKJvtpF262YTdiVBCf4MXD4p49Qd589+4bXPQ1gcDj/dmmJkXplIY8rxvp7S2vrG5Vd6u7Ozu7R9UD4/aJsk0xxZPZKI7ITMohcIWCZLYSTWyOJT4GI5vZ/7jE2ojEvVAkxSDmA2ViARnZKVWL0Ri/WrNq3tzuKvEL0gNCjT71a/eIOFZjIq4ZMZ0fS+lIGeaBJc4rfQygynjYzbErqWKxWiCfH7s1D2zysCNEm1LkTtXf0/kLDZmEoe2M2Y0MsveTPzP62YUXQe5UGlGqPhiUZRJlxJ39rk7EBo5yYkljGthb3X5iGnGyeZTsSH4yy+vkvZF3ffq/v1lrXFTxFGGEziFc/DhChpwB01oAQcBz/AKb45yXpx352PRWnKKmWP4A+fzB8PWjqQ=</latexit><latexit sha1_base64="8MzdY9xWePsViGumrVRX3IPm9UE=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi6LHoxWMFWwttKJvtpF262YTdiVBCf4MXD4p49Qd589+4bXPQ1gcDj/dmmJkXplIY8rxvp7S2vrG5Vd6u7Ozu7R9UD4/aJsk0xxZPZKI7ITMohcIWCZLYSTWyOJT4GI5vZ/7jE2ojEvVAkxSDmA2ViARnZKVWL0Ri/WrNq3tzuKvEL0gNCjT71a/eIOFZjIq4ZMZ0fS+lIGeaBJc4rfQygynjYzbErqWKxWiCfH7s1D2zysCNEm1LkTtXf0/kLDZmEoe2M2Y0MsveTPzP62YUXQe5UGlGqPhiUZRJlxJ39rk7EBo5yYkljGthb3X5iGnGyeZTsSH4yy+vkvZF3ffq/v1lrXFTxFGGEziFc/DhChpwB01oAQcBz/AKb45yXpx352PRWnKKmWP4A+fzB8PWjqQ=</latexit><latexit sha1_base64="8MzdY9xWePsViGumrVRX3IPm9UE=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi6LHoxWMFWwttKJvtpF262YTdiVBCf4MXD4p49Qd589+4bXPQ1gcDj/dmmJkXplIY8rxvp7S2vrG5Vd6u7Ozu7R9UD4/aJsk0xxZPZKI7ITMohcIWCZLYSTWyOJT4GI5vZ/7jE2ojEvVAkxSDmA2ViARnZKVWL0Ri/WrNq3tzuKvEL0gNCjT71a/eIOFZjIq4ZMZ0fS+lIGeaBJc4rfQygynjYzbErqWKxWiCfH7s1D2zysCNEm1LkTtXf0/kLDZmEoe2M2Y0MsveTPzP62YUXQe5UGlGqPhiUZRJlxJ39rk7EBo5yYkljGthb3X5iGnGyeZTsSH4yy+vkvZF3ffq/v1lrXFTxFGGEziFc/DhChpwB01oAQcBz/AKb45yXpx352PRWnKKmWP4A+fzB8PWjqQ=</latexit>

Figure 2: Effect of β on the reconstruction fixed-points and phase-transitions. Top images Greycurves indicate the trajectories of the vectors ψ. Red points are the fixed-points. Blue points arethe data points; Bottom left Expected reconstruction error as a function of β. Vertical grey linesindicate the detected critical-temperatures βkc ; Bottom right Expected second-order derivative of thereconstruction error as a function of β. At critical temperatures reconstruction fixed-points will mergewith each other, resulting in sudden changes in the slope of the reconstruction error with respect tothe temperature β, these points correspond to spikes in the second-order derivatives. For analysis,we sorted the local maxima according to their height and restricted the analysis to the top-3 points,βk=1,2,3c . Details of this simulation are explained in Appendix A.

in the partition element Ωa is a solution of∑i

e−Hia

β∑b e−Hib

β γb

= n where Hia is the value of H(xi, z)

for z ∈ Ωa. See proof in Appendix C.6.

3.4 The GECO algorithm for VAEs

To derive GECO we start from the augmented Lagrangian defined in Equation (2) for a VAE withdecoder parametrized by a vector θ and an encoder density parametrized by a vector η. Optimizationof the loss involves joint minimization w.r.t. θ and η, and maximization w.r.t. to the Lagrangemultipliers λ. The parameters θ and η are optimized by directly following the negative gradients ofEquation (2). The Lagrange multipliers λ are optimized following a moving average of the constraintvector C(x, g(z)). In order to avoid backpropagation through the moving averages, we only apply thegradients to the last step of the moving average. This procedure is detailed in Algorithm 1.

Figure 3 captures a representative example of the typical behavior of GECO: early on in the optimiza-tion the solver quickly moves the model parameters into a regime of valid solutions, i.e. parameterconfigurations satisfying the constraints, and then minimizes ELBO while preserving the validity ofthe current solution. We refer the reader to Figure 3’s caption for a more detailed description of theoptimization phases.

The main advantage of GECO for the machine learning practitioner is that the process of tuningthe loss involves the definition of a set of constraints, which typically have a direct relation tothe desired model performance, and can be set in the model output space. This is clearly a verydifferent work-flow compared to tweaking abstract hyper-parameters which implicitly affect themodel performance. For example, if we were to work in the β-VAE setting, we would observe thistransition: NLL +βKL =⇒ KL +βREκ, where REκ is the reconstruction error constraint as definedin Table 1, and κ is a tolerance hyper-parameter. On the lhs β is an hyper parameter tuning therelative weight of the negative log-likelihood (NLL) and KL terms, affecting model reconstructionsin a non-smooth way as shown in Figure 2 and, as discussed in Section 3.2, implicitly defining aconstraint on the VAE reconstruction error. On the rhs β is a Lagrange multiplier, whose final valueis automatically tuned during optimization as a function of the κ tolerance hyper-parameter, which

6

Page 7: arXiv:1810.00597v1 [stat.ML] 1 Oct 2018Taming VAEs Danilo J. Rezende Fabio Viola {danilor, fviola}@google.com DeepMind, London, UK Abstract In spite of remarkable progress in deep

100 101

KL

2.0

2.5

3.0

3.5

4.0

4.5

NLL

102 103 104 105 106 107

iteration

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

β

102 103 104 105 106 107

iteration

1.6

1.8

2.0

2.2

2.4

2.6

reco

nst

ruct

ion e

rror

1e 1

reconstruction error

Figure 3: Trajectory in the information plane induced by GECO during training. This plotshows a typical trajectory in the NLL/KL plane for a model trained using GECO with a RE constraint,alongside the corresponding values of the equivalent β and pixel reconstruction errors; note thatiteration information is consistently encoded using color in the three plots. At the beginning oftraining, it < 104, the reconstruction constraint dominates optimization, with β < 1 implicitlyamplifying the NLL term in ELBO. When the inequality constraint is met, i.e. the reconstructionerror curve crosses the κ threshold horizontal line, β slowly starts changing, modulated by the movingaverage, until at it = 104, the β curve flexes and β starts growing. This specific example is for aconditional ConvDraw model trained on MNIST-rotate.

the user can define in pixel space explicitly specifying the required reconstruction performance of themodel.

Result: Learned parameters θ, η and Lagrange multipliers λInitialize t = 0;Initialize λ = 1;while is training do

Read current data batch x;Sample from variational posterior z ∼ q(z|x);Compute the batch average of the constraint Ct ← C(xt, g(zt));if t == 0 then

Initialize the constraint moving average C0ma ← C0;

elseCtma ← αCt−1

ma + (1− α)Ct;endCt ← Ct + StopGradient(Ctma − Ct);Compute gradients Gθ ← ∂Lλ

∂θ and Gη ← ∂Lλ

∂η ;Update parameters as ∆θ,η ∝ −Gθ,η and Lagrange multiplier(s) ∆log(λ) ∝ Ct;t← t+ 1;

endAlgorithm 1: GECO. Pseudo-code for joint optimization of VAE parameters and Lagrange multipli-ers. The update of the Lagrange multipliers is of the form λt ← λt−1 exp(∝ Ct); this to enforcepositivity of λ, a necessary condition [45] for tackling the inequality constraints. The parameter αcontrols the slowness of the moving average, which provides an approximation to the expectation ofthe constraint.

4 Experiments

4.1 Constrained optimization for large scale models and larger datasets

We demonstrate empirically that GECO provides an easy and robust way of balancing compressionversus different types of reconstruction. We conduct experiments using standard implementations ofConvDraw [5] (both in the conditional and unconditional case) and a VAE+NVP model that usesa convolutional decoder similar to [28] and a fully connected conditional NVP [58] model as theencoder density so that we can approximate high-capacity encoders.

7

Page 8: arXiv:1810.00597v1 [stat.ML] 1 Oct 2018Taming VAEs Danilo J. Rezende Fabio Viola {danilor, fviola}@google.com DeepMind, London, UK Abstract In spite of remarkable progress in deep

100 101 102 103

KL

100

101

NLL

Conditional ConvDraw

100 101 102 103

KL

100

101

NLL

ConvDraw

100 101 102 103

KL

100

101

NLL

VAE+NVP

GECO ELBO ELBO + annealed β

CelebA CIFAR10 Color MNIST MNIST MNIST-triplets MNIST-sum MNIST-sum-hard MNIST-rotate

Figure 4: Information plane analysis of Conditional ConvDraw, ConvDraw and VAE+NVPwith and without RE constraints. Each plot shows the final reconstruction / compression trade-off achieved during training for the same ConvDraw and VAE+NVP models using ELBO, GECOand ELBO with a hand annealed β, respectively. For GECO we report results for the followingreconstruction thresholds κ ∈ 0.06, 0.08, 0.1, 0.125, 0.175, and visually tie them together byconnecting them via a line colour-coded by the dataset instance they refer to. For the hand annealedβ we use the same annealing scheme reported in [16]. Results are shown for a variety of conditionaland unconditional datasets, providing evidence of the consistency of the behavior of GECO acrossdifferent domains.

In Table 1 we show a few examples of reconstruction constraints that we have considered in thisstudy. To inspect the performance of GECO we look specifically at the behavior of trained modelsin the information plane (negative reconstruction likelihood vs KL) on various datasets, with andwithout the RE constraint. All models were trained using Adam [59], with learning rate of 1e-5 forConvDraw and 1e-6 for the VAE+NVP, and a constraint moving average parameter α = 0.99.

Name C(x, g(z))Reconstruction Error (RE) ‖x− g(x)‖2 − κ2

Feature Reconstruction Error (FRE) ‖f(x)− f(g(x))‖2 − κ2

Classification accuracy (CLA) l(x)T c(g(z))− κPatch Normalized Cross-correlation (pNCC)

∑[κ− ψ(x, i)Tψ(g(x), i)

]

Table 1: Constraints studied in this paper. For the FRE constraint, the features f(x) can be extracteda classification network (in our experiments we use a pretraiend Cifar10 resnet), or compute localimage statistics, such as mean and standard deviation. For the CLA constraint, c(x) is a simpleconvolutional MNIST classifier that outputs class probabilities and l(x) is the one-hot true labelvector of image x. For the pNCC constraint we define the operator ψ(x, i), which returns a whitenedfixed size patch from input image x at location i, and constraint the dot products of correspondingpatches from targets and reconstructions.

Here we look at the behavior of VAE+NVP and ConvDraw (the latter both in the conditional andunconditional case) in information plane (negative reconstruction likelihood vs KL) on variousdatasets, with and without a RE constraint.

The datasets we use for the unconditional case are CelebA[60], Cifar10[61], MNIST[62], Color-MNIST[63] and a variant of MNIST we will refer to as MNIST-triplets. MNIST-triplets is comprisedof triplets of MNIST digits (Ii, li)i=0,1,2 such that l2 = (l0 + l1) mod 10; the model is trained tocapture the joint distribution of the image vectors (Ii,0, Ii,1, Ii,2)i.In the conditional case we use are variants of MNIST we will refer to as MNIST-sum, MNIST-sum-hard and MNIST-rotate. All variants of the datasets are comprised of contexts and targets derived fromtriplets of MNIST digits (Ii, li)i=0,1,2, with constraints as follows. For MNIST-sum contexts are(Ii,0, Ii,1)i and targets are Ii,2i, such that l2 = (l0 + l1) mod 10; for MNIST-sum-hard contextsare Ii,2i and targets are (Ii,0, Ii,1)i, such that l2 = (l0 + l1) mod 10; finally, for MNIST-rotatecontexts are (Ii,0, Ii,1)i and targets are Ii,2i, such that l2 = l0 and I2 is I2 rotated about itscentre by l1 · 30, note that whilst I0 and I2 have the same label, they are not the same digit instance.

8

Page 9: arXiv:1810.00597v1 [stat.ML] 1 Oct 2018Taming VAEs Danilo J. Rezende Fabio Viola {danilor, fviola}@google.com DeepMind, London, UK Abstract In spite of remarkable progress in deep

In figure Figure 4 we show in all cases that the negative reconstruction log-likelihoods (NLL, seeAppendix D for details) reached by VAE+NVP, ConvDraw and conditional ConvDraw trained onlywith the ELBO objective are lower compared to the values obtained with ELBO + GECO, at theexpense of KL-divergences that are some times many orders of magnitude higher. This result comesfrom the observation that the numerical values of reconstruction errors necessary to achieve goodreconstructions can be much larger, allowing the model to achieve lower compression rates. Toprovide a notion of the quality of the reconstructions when using GECO, we show in Figure 6 afew model samples and reconstructions for different reconstruction thresholds and constraints. InAppendix F we show reconstructions and samples for all levels of reconstruction targets. As wecan see from Figure 6, the use of different constraints has a dramatic impact on both the quality ofreconstructions and samples.

4.1.1 Average and Marginal KL analysis

At a fixed reconstruction error, a computationally cheap indicator of the quality of the learnedencoder is the average KL between prior and posterior, 1

n

∑i KL [q(z|xi);π(z)], which we analyze

in Figure 5. Our analysis shows that an expressive model can achieve lower average KL at a givenreconstruction error when trained with GECO compared to the same model trained with ELBO.

0 1 2 3 4 5 6 7

iteration 1e5

0.4

0.6

0.8

1.0

1.2

KL

1e3

GECO

ELBO

Figure 5: GECO results in lower average KL at fixed reconstruction error compared to ELBO.We first trained an expressive ConvDRAW model on CIFAR10 using the standard ELBO objectiveuntil convergence and recorded its reconstruction error (MSE=0.00029). At this reconstruction errorvalues, the reconstructions are visually perfect. We then trained the same model architecture usingGECO with a RE constraint setup to achieve the same reconstruction error. The curves for themodel trained with ELBO (green) and with GECO (blue) demonstrate that we can achieve the samereconstruction error but with a lower average KL between prior and posterior.

From Propositions 2 and 5, the optimal solutions for VAE’s encoders are inference models that coverthe latent space in such a way that their marginal is equal to the prior. That is, q(z) = 1

n

∑i q(z|xi) =

π(z). We refer to the KL between q(z) and π(z) as the "marginal KL".

If the learned encoder or inference network fails to cover the latent space, it may result in the "holes"problem which, in turn, is associated with bad sample quality.

In contrast to the average KL, the marginal KL is also sensitive to the "holes problem" discussed inSection 1. In Table 2 we evaluate the effect of GECO on the marginal KL of the VAE+NVP modelstrained in Section 4.1 (in a limited number of combinations due to the significant computational costs)and observe that models trained with GECO also have much lower marginal KL while maintainingan acceptable reconstruction accuracy.

5 Discussion

We have provided a detailed theoretical analysis of the behavior of high-capacity VAEs and variantsof β-VAEs. We have made connections between VAEs, spectral clustering methods and statisticalmechanics (phase transitions). Our analysis provides novel insights to the two most common problems

9

Page 10: arXiv:1810.00597v1 [stat.ML] 1 Oct 2018Taming VAEs Danilo J. Rezende Fabio Viola {danilor, fviola}@google.com DeepMind, London, UK Abstract In spite of remarkable progress in deep

Dataset Marginal KL for ELBO Marginal KL for GECOCifar10 725.2 45.3

Color-MNIST 182.5 10.3

Table 2: Marginal KL comparison for a VAE+NVP model on Cifar10 and Color-MNIST.

(i)ELBO<latexit sha1_base64="0ZngIcjPIs8z/hVrMRdIW4M70+w=">AAACBXicbVDLSsNAFJ34rPUVdamLwSLUTUlE0GWpCC4EK9gHNKFMppN26OTBzI1YQjZu/BU3LhRx6z+482+ctllo64ELh3Pu5d57vFhwBZb1bSwsLi2vrBbWiusbm1vb5s5uU0WJpKxBIxHJtkcUEzxkDeAgWDuWjASeYC1veDH2W/dMKh6FdzCKmRuQfsh9TgloqWseOMAeIE0dz8dlfpxleCpcXtdusq5ZsirWBHie2DkpoRz1rvnl9CKaBCwEKohSHduKwU2JBE4Fy4pOolhM6JD0WUfTkARMuenkiwwfaaWH/UjqCgFP1N8TKQmUGgWe7gwIDNSsNxb/8zoJ+OduysM4ARbS6SI/ERgiPI4E97hkFMRIE0Il17diOiCSUNDBFXUI9uzL86R5UrGtin17WqrW8jgKaB8dojKy0RmqoitURw1E0SN6Rq/ozXgyXox342PaumDkM3voD4zPH+ovmC0=</latexit><latexit sha1_base64="0ZngIcjPIs8z/hVrMRdIW4M70+w=">AAACBXicbVDLSsNAFJ34rPUVdamLwSLUTUlE0GWpCC4EK9gHNKFMppN26OTBzI1YQjZu/BU3LhRx6z+482+ctllo64ELh3Pu5d57vFhwBZb1bSwsLi2vrBbWiusbm1vb5s5uU0WJpKxBIxHJtkcUEzxkDeAgWDuWjASeYC1veDH2W/dMKh6FdzCKmRuQfsh9TgloqWseOMAeIE0dz8dlfpxleCpcXtdusq5ZsirWBHie2DkpoRz1rvnl9CKaBCwEKohSHduKwU2JBE4Fy4pOolhM6JD0WUfTkARMuenkiwwfaaWH/UjqCgFP1N8TKQmUGgWe7gwIDNSsNxb/8zoJ+OduysM4ARbS6SI/ERgiPI4E97hkFMRIE0Il17diOiCSUNDBFXUI9uzL86R5UrGtin17WqrW8jgKaB8dojKy0RmqoitURw1E0SN6Rq/ozXgyXox342PaumDkM3voD4zPH+ovmC0=</latexit><latexit sha1_base64="0ZngIcjPIs8z/hVrMRdIW4M70+w=">AAACBXicbVDLSsNAFJ34rPUVdamLwSLUTUlE0GWpCC4EK9gHNKFMppN26OTBzI1YQjZu/BU3LhRx6z+482+ctllo64ELh3Pu5d57vFhwBZb1bSwsLi2vrBbWiusbm1vb5s5uU0WJpKxBIxHJtkcUEzxkDeAgWDuWjASeYC1veDH2W/dMKh6FdzCKmRuQfsh9TgloqWseOMAeIE0dz8dlfpxleCpcXtdusq5ZsirWBHie2DkpoRz1rvnl9CKaBCwEKohSHduKwU2JBE4Fy4pOolhM6JD0WUfTkARMuenkiwwfaaWH/UjqCgFP1N8TKQmUGgWe7gwIDNSsNxb/8zoJ+OduysM4ARbS6SI/ERgiPI4E97hkFMRIE0Il17diOiCSUNDBFXUI9uzL86R5UrGtin17WqrW8jgKaB8dojKy0RmqoitURw1E0SN6Rq/ozXgyXox342PaumDkM3voD4zPH+ovmC0=</latexit><latexit sha1_base64="0ZngIcjPIs8z/hVrMRdIW4M70+w=">AAACBXicbVDLSsNAFJ34rPUVdamLwSLUTUlE0GWpCC4EK9gHNKFMppN26OTBzI1YQjZu/BU3LhRx6z+482+ctllo64ELh3Pu5d57vFhwBZb1bSwsLi2vrBbWiusbm1vb5s5uU0WJpKxBIxHJtkcUEzxkDeAgWDuWjASeYC1veDH2W/dMKh6FdzCKmRuQfsh9TgloqWseOMAeIE0dz8dlfpxleCpcXtdusq5ZsirWBHie2DkpoRz1rvnl9CKaBCwEKohSHduKwU2JBE4Fy4pOolhM6JD0WUfTkARMuenkiwwfaaWH/UjqCgFP1N8TKQmUGgWe7gwIDNSsNxb/8zoJ+OduysM4ARbS6SI/ERgiPI4E97hkFMRIE0Il17diOiCSUNDBFXUI9uzL86R5UrGtin17WqrW8jgKaB8dojKy0RmqoitURw1E0SN6Rq/ozXgyXox342PaumDkM3voD4zPH+ovmC0=</latexit>

(iii)GECO + RE cstr.<latexit sha1_base64="rWaQC9oGjGjHUjOMQnVGaEN4S9E=">AAACEnicbVDJSgNBEO2JW4xb1KOXxiAkCGFGBD0GQ9CbUcwCmSH0dHqSJj0L3TViGOYbvPgrXjwo4tWTN//GznLQxAcFj/eqqKrnRoIrMM1vI7O0vLK6ll3PbWxube/kd/eaKowlZQ0ailC2XaKY4AFrAAfB2pFkxHcFa7nD6thv3TOpeBjcwShijk/6Afc4JaClbr5kA3uAJLFdDxc556U0xVPpsla9xsf4toapAllOu/mCWTYnwIvEmpECmqHezX/ZvZDGPguACqJUxzIjcBIigVPB0pwdKxYROiR91tE0ID5TTjJ5KcVHWulhL5S6AsAT9fdEQnylRr6rO30CAzXvjcX/vE4M3rmT8CCKgQV0usiLBYYQj/PBPS4ZBTHShFDJ9a2YDogkFHSKOR2CNf/yImmelC2zbN2cFioXsziy6AAdoiKy0BmqoCtURw1E0SN6Rq/ozXgyXox342PamjFmM/voD4zPH0G2nIk=</latexit><latexit sha1_base64="rWaQC9oGjGjHUjOMQnVGaEN4S9E=">AAACEnicbVDJSgNBEO2JW4xb1KOXxiAkCGFGBD0GQ9CbUcwCmSH0dHqSJj0L3TViGOYbvPgrXjwo4tWTN//GznLQxAcFj/eqqKrnRoIrMM1vI7O0vLK6ll3PbWxube/kd/eaKowlZQ0ailC2XaKY4AFrAAfB2pFkxHcFa7nD6thv3TOpeBjcwShijk/6Afc4JaClbr5kA3uAJLFdDxc556U0xVPpsla9xsf4toapAllOu/mCWTYnwIvEmpECmqHezX/ZvZDGPguACqJUxzIjcBIigVPB0pwdKxYROiR91tE0ID5TTjJ5KcVHWulhL5S6AsAT9fdEQnylRr6rO30CAzXvjcX/vE4M3rmT8CCKgQV0usiLBYYQj/PBPS4ZBTHShFDJ9a2YDogkFHSKOR2CNf/yImmelC2zbN2cFioXsziy6AAdoiKy0BmqoCtURw1E0SN6Rq/ozXgyXox342PamjFmM/voD4zPH0G2nIk=</latexit><latexit sha1_base64="rWaQC9oGjGjHUjOMQnVGaEN4S9E=">AAACEnicbVDJSgNBEO2JW4xb1KOXxiAkCGFGBD0GQ9CbUcwCmSH0dHqSJj0L3TViGOYbvPgrXjwo4tWTN//GznLQxAcFj/eqqKrnRoIrMM1vI7O0vLK6ll3PbWxube/kd/eaKowlZQ0ailC2XaKY4AFrAAfB2pFkxHcFa7nD6thv3TOpeBjcwShijk/6Afc4JaClbr5kA3uAJLFdDxc556U0xVPpsla9xsf4toapAllOu/mCWTYnwIvEmpECmqHezX/ZvZDGPguACqJUxzIjcBIigVPB0pwdKxYROiR91tE0ID5TTjJ5KcVHWulhL5S6AsAT9fdEQnylRr6rO30CAzXvjcX/vE4M3rmT8CCKgQV0usiLBYYQj/PBPS4ZBTHShFDJ9a2YDogkFHSKOR2CNf/yImmelC2zbN2cFioXsziy6AAdoiKy0BmqoCtURw1E0SN6Rq/ozXgyXox342PamjFmM/voD4zPH0G2nIk=</latexit><latexit sha1_base64="rWaQC9oGjGjHUjOMQnVGaEN4S9E=">AAACEnicbVDJSgNBEO2JW4xb1KOXxiAkCGFGBD0GQ9CbUcwCmSH0dHqSJj0L3TViGOYbvPgrXjwo4tWTN//GznLQxAcFj/eqqKrnRoIrMM1vI7O0vLK6ll3PbWxube/kd/eaKowlZQ0ailC2XaKY4AFrAAfB2pFkxHcFa7nD6thv3TOpeBjcwShijk/6Afc4JaClbr5kA3uAJLFdDxc556U0xVPpsla9xsf4toapAllOu/mCWTYnwIvEmpECmqHezX/ZvZDGPguACqJUxzIjcBIigVPB0pwdKxYROiR91tE0ID5TTjJ5KcVHWulhL5S6AsAT9fdEQnylRr6rO30CAzXvjcX/vE4M3rmT8CCKgQV0usiLBYYQj/PBPS4ZBTHShFDJ9a2YDogkFHSKOR2CNf/yImmelC2zbN2cFioXsziy6AAdoiKy0BmqoCtURw1E0SN6Rq/ozXgyXox342PamjFmM/voD4zPH0G2nIk=</latexit>

(iii)GECO + FRE cstr.<latexit sha1_base64="1uoDDV6WswL5Bwo5iIEFy2Kmb3k=">AAACE3icbVDLSgNBEJyNrxhfqx69DAYhKoRdEfQYDFFvRjEPSEKYncwmQ2YfzPSKYdl/8OKvePGgiFcv3vwbJ8keNFrQUFR1093lhIIrsKwvIzM3v7C4lF3OrayurW+Ym1t1FUSSshoNRCCbDlFMcJ/VgINgzVAy4jmCNZxheew37phUPPBvYRSyjkf6Pnc5JaClrnnQBnYPcdx2XFzgnO8nCZ5KF5XyFT7E5zcVTBXIYtI181bRmgD/JXZK8ihFtWt+tnsBjTzmAxVEqZZthdCJiQROBUty7UixkNAh6bOWpj7xmOrEk58SvKeVHnYDqcsHPFF/TsTEU2rkObrTIzBQs95Y/M9rReCedmLuhxEwn04XuZHAEOBxQLjHJaMgRpoQKrm+FdMBkYSCjjGnQ7BnX/5L6kdF2yra18f50lkaRxbtoF1UQDY6QSV0iaqohih6QE/oBb0aj8az8Wa8T1szRjqzjX7B+PgG3r2c2Q==</latexit><latexit sha1_base64="1uoDDV6WswL5Bwo5iIEFy2Kmb3k=">AAACE3icbVDLSgNBEJyNrxhfqx69DAYhKoRdEfQYDFFvRjEPSEKYncwmQ2YfzPSKYdl/8OKvePGgiFcv3vwbJ8keNFrQUFR1093lhIIrsKwvIzM3v7C4lF3OrayurW+Ym1t1FUSSshoNRCCbDlFMcJ/VgINgzVAy4jmCNZxheew37phUPPBvYRSyjkf6Pnc5JaClrnnQBnYPcdx2XFzgnO8nCZ5KF5XyFT7E5zcVTBXIYtI181bRmgD/JXZK8ihFtWt+tnsBjTzmAxVEqZZthdCJiQROBUty7UixkNAh6bOWpj7xmOrEk58SvKeVHnYDqcsHPFF/TsTEU2rkObrTIzBQs95Y/M9rReCedmLuhxEwn04XuZHAEOBxQLjHJaMgRpoQKrm+FdMBkYSCjjGnQ7BnX/5L6kdF2yra18f50lkaRxbtoF1UQDY6QSV0iaqohih6QE/oBb0aj8az8Wa8T1szRjqzjX7B+PgG3r2c2Q==</latexit><latexit sha1_base64="1uoDDV6WswL5Bwo5iIEFy2Kmb3k=">AAACE3icbVDLSgNBEJyNrxhfqx69DAYhKoRdEfQYDFFvRjEPSEKYncwmQ2YfzPSKYdl/8OKvePGgiFcv3vwbJ8keNFrQUFR1093lhIIrsKwvIzM3v7C4lF3OrayurW+Ym1t1FUSSshoNRCCbDlFMcJ/VgINgzVAy4jmCNZxheew37phUPPBvYRSyjkf6Pnc5JaClrnnQBnYPcdx2XFzgnO8nCZ5KF5XyFT7E5zcVTBXIYtI181bRmgD/JXZK8ihFtWt+tnsBjTzmAxVEqZZthdCJiQROBUty7UixkNAh6bOWpj7xmOrEk58SvKeVHnYDqcsHPFF/TsTEU2rkObrTIzBQs95Y/M9rReCedmLuhxEwn04XuZHAEOBxQLjHJaMgRpoQKrm+FdMBkYSCjjGnQ7BnX/5L6kdF2yra18f50lkaRxbtoF1UQDY6QSV0iaqohih6QE/oBb0aj8az8Wa8T1szRjqzjX7B+PgG3r2c2Q==</latexit><latexit sha1_base64="1uoDDV6WswL5Bwo5iIEFy2Kmb3k=">AAACE3icbVDLSgNBEJyNrxhfqx69DAYhKoRdEfQYDFFvRjEPSEKYncwmQ2YfzPSKYdl/8OKvePGgiFcv3vwbJ8keNFrQUFR1093lhIIrsKwvIzM3v7C4lF3OrayurW+Ym1t1FUSSshoNRCCbDlFMcJ/VgINgzVAy4jmCNZxheew37phUPPBvYRSyjkf6Pnc5JaClrnnQBnYPcdx2XFzgnO8nCZ5KF5XyFT7E5zcVTBXIYtI181bRmgD/JXZK8ihFtWt+tnsBjTzmAxVEqZZthdCJiQROBUty7UixkNAh6bOWpj7xmOrEk58SvKeVHnYDqcsHPFF/TsTEU2rkObrTIzBQs95Y/M9rReCedmLuhxEwn04XuZHAEOBxQLjHJaMgRpoQKrm+FdMBkYSCjjGnQ7BnX/5L6kdF2yra18f50lkaRxbtoF1UQDY6QSV0iaqohih6QE/oBb0aj8az8Wa8T1szRjqzjX7B+PgG3r2c2Q==</latexit>

(iv)ELBO<latexit sha1_base64="y9pxi6X2hbANE3V1jGRcYn5W8AA=">AAACBnicbVDLSgNBEJyNrxhfqx5FGAxCvIRdEfQYIoIHwQjmAdkQZiezyZDZBzO9wbDsyYu/4sWDIl79Bm/+jZNkD5pY0FBUddPd5UaCK7CsbyO3tLyyupZfL2xsbm3vmLt7DRXGkrI6DUUoWy5RTPCA1YGDYK1IMuK7gjXd4eXEb46YVDwM7mEcsY5P+gH3OCWgpa556AB7gCRxXA+X+OgkTfFMubqp3qZds2iVrSnwIrEzUkQZal3zy+mFNPZZAFQQpdq2FUEnIRI4FSwtOLFiEaFD0mdtTQPiM9VJpm+k+FgrPeyFUlcAeKr+nkiIr9TYd3WnT2Cg5r2J+J/XjsG76CQ8iGJgAZ0t8mKBIcSTTHCPS0ZBjDUhVHJ9K6YDIgkFnVxBh2DPv7xIGqdl2yrbd2fFSjWLI48O0BEqIRudowq6RjVURxQ9omf0it6MJ+PFeDc+Zq05I5vZR39gfP4AziWYrQ==</latexit><latexit sha1_base64="y9pxi6X2hbANE3V1jGRcYn5W8AA=">AAACBnicbVDLSgNBEJyNrxhfqx5FGAxCvIRdEfQYIoIHwQjmAdkQZiezyZDZBzO9wbDsyYu/4sWDIl79Bm/+jZNkD5pY0FBUddPd5UaCK7CsbyO3tLyyupZfL2xsbm3vmLt7DRXGkrI6DUUoWy5RTPCA1YGDYK1IMuK7gjXd4eXEb46YVDwM7mEcsY5P+gH3OCWgpa556AB7gCRxXA+X+OgkTfFMubqp3qZds2iVrSnwIrEzUkQZal3zy+mFNPZZAFQQpdq2FUEnIRI4FSwtOLFiEaFD0mdtTQPiM9VJpm+k+FgrPeyFUlcAeKr+nkiIr9TYd3WnT2Cg5r2J+J/XjsG76CQ8iGJgAZ0t8mKBIcSTTHCPS0ZBjDUhVHJ9K6YDIgkFnVxBh2DPv7xIGqdl2yrbd2fFSjWLI48O0BEqIRudowq6RjVURxQ9omf0it6MJ+PFeDc+Zq05I5vZR39gfP4AziWYrQ==</latexit><latexit sha1_base64="y9pxi6X2hbANE3V1jGRcYn5W8AA=">AAACBnicbVDLSgNBEJyNrxhfqx5FGAxCvIRdEfQYIoIHwQjmAdkQZiezyZDZBzO9wbDsyYu/4sWDIl79Bm/+jZNkD5pY0FBUddPd5UaCK7CsbyO3tLyyupZfL2xsbm3vmLt7DRXGkrI6DUUoWy5RTPCA1YGDYK1IMuK7gjXd4eXEb46YVDwM7mEcsY5P+gH3OCWgpa556AB7gCRxXA+X+OgkTfFMubqp3qZds2iVrSnwIrEzUkQZal3zy+mFNPZZAFQQpdq2FUEnIRI4FSwtOLFiEaFD0mdtTQPiM9VJpm+k+FgrPeyFUlcAeKr+nkiIr9TYd3WnT2Cg5r2J+J/XjsG76CQ8iGJgAZ0t8mKBIcSTTHCPS0ZBjDUhVHJ9K6YDIgkFnVxBh2DPv7xIGqdl2yrbd2fFSjWLI48O0BEqIRudowq6RjVURxQ9omf0it6MJ+PFeDc+Zq05I5vZR39gfP4AziWYrQ==</latexit><latexit sha1_base64="y9pxi6X2hbANE3V1jGRcYn5W8AA=">AAACBnicbVDLSgNBEJyNrxhfqx5FGAxCvIRdEfQYIoIHwQjmAdkQZiezyZDZBzO9wbDsyYu/4sWDIl79Bm/+jZNkD5pY0FBUddPd5UaCK7CsbyO3tLyyupZfL2xsbm3vmLt7DRXGkrI6DUUoWy5RTPCA1YGDYK1IMuK7gjXd4eXEb46YVDwM7mEcsY5P+gH3OCWgpa556AB7gCRxXA+X+OgkTfFMubqp3qZds2iVrSnwIrEzUkQZal3zy+mFNPZZAFQQpdq2FUEnIRI4FSwtOLFiEaFD0mdtTQPiM9VJpm+k+FgrPeyFUlcAeKr+nkiIr9TYd3WnT2Cg5r2J+J/XjsG76CQ8iGJgAZ0t8mKBIcSTTHCPS0ZBjDUhVHJ9K6YDIgkFnVxBh2DPv7xIGqdl2yrbd2fFSjWLI48O0BEqIRudowq6RjVURxQ9omf0it6MJ+PFeDc+Zq05I5vZR39gfP4AziWYrQ==</latexit>

(v)GECO + RE cstr.<latexit sha1_base64="Q+pCn1NLM5uKVuWmcXd81nADJUw=">AAACEHicbVDLSgNBEJyNrxhfqx69DAYxIoRdEfQYDEFvRjEPSEKYncwmQ2YfzPQGw7Kf4MVf8eJBEa8evfk3TpI9aLSgoajqprvLCQVXYFlfRmZhcWl5JbuaW1vf2Nwyt3fqKogkZTUaiEA2HaKY4D6rAQfBmqFkxHMEazjD8sRvjJhUPPDvYByyjkf6Pnc5JaClrnnYBnYPcdx2XFwYHSUJngmXlfI1Psa3FUwVyGLSNfNW0ZoC/yV2SvIoRbVrfrZ7AY085gMVRKmWbYXQiYkETgVLcu1IsZDQIemzlqY+8ZjqxNOHEnyglR52A6nLBzxVf07ExFNq7Dm60yMwUPPeRPzPa0Xgnndi7ocRMJ/OFrmRwBDgSTq4xyWjIMaaECq5vhXTAZGEgs4wp0Ow51/+S+onRdsq2jen+dJFGkcW7aF9VEA2OkMldIWqqIYoekBP6AW9Go/Gs/FmvM9aM0Y6s4t+wfj4Bq4hm7A=</latexit><latexit sha1_base64="Q+pCn1NLM5uKVuWmcXd81nADJUw=">AAACEHicbVDLSgNBEJyNrxhfqx69DAYxIoRdEfQYDEFvRjEPSEKYncwmQ2YfzPQGw7Kf4MVf8eJBEa8evfk3TpI9aLSgoajqprvLCQVXYFlfRmZhcWl5JbuaW1vf2Nwyt3fqKogkZTUaiEA2HaKY4D6rAQfBmqFkxHMEazjD8sRvjJhUPPDvYByyjkf6Pnc5JaClrnnYBnYPcdx2XFwYHSUJngmXlfI1Psa3FUwVyGLSNfNW0ZoC/yV2SvIoRbVrfrZ7AY085gMVRKmWbYXQiYkETgVLcu1IsZDQIemzlqY+8ZjqxNOHEnyglR52A6nLBzxVf07ExFNq7Dm60yMwUPPeRPzPa0Xgnndi7ocRMJ/OFrmRwBDgSTq4xyWjIMaaECq5vhXTAZGEgs4wp0Ow51/+S+onRdsq2jen+dJFGkcW7aF9VEA2OkMldIWqqIYoekBP6AW9Go/Gs/FmvM9aM0Y6s4t+wfj4Bq4hm7A=</latexit><latexit sha1_base64="Q+pCn1NLM5uKVuWmcXd81nADJUw=">AAACEHicbVDLSgNBEJyNrxhfqx69DAYxIoRdEfQYDEFvRjEPSEKYncwmQ2YfzPQGw7Kf4MVf8eJBEa8evfk3TpI9aLSgoajqprvLCQVXYFlfRmZhcWl5JbuaW1vf2Nwyt3fqKogkZTUaiEA2HaKY4D6rAQfBmqFkxHMEazjD8sRvjJhUPPDvYByyjkf6Pnc5JaClrnnYBnYPcdx2XFwYHSUJngmXlfI1Psa3FUwVyGLSNfNW0ZoC/yV2SvIoRbVrfrZ7AY085gMVRKmWbYXQiYkETgVLcu1IsZDQIemzlqY+8ZjqxNOHEnyglR52A6nLBzxVf07ExFNq7Dm60yMwUPPeRPzPa0Xgnndi7ocRMJ/OFrmRwBDgSTq4xyWjIMaaECq5vhXTAZGEgs4wp0Ow51/+S+onRdsq2jen+dJFGkcW7aF9VEA2OkMldIWqqIYoekBP6AW9Go/Gs/FmvM9aM0Y6s4t+wfj4Bq4hm7A=</latexit><latexit sha1_base64="Q+pCn1NLM5uKVuWmcXd81nADJUw=">AAACEHicbVDLSgNBEJyNrxhfqx69DAYxIoRdEfQYDEFvRjEPSEKYncwmQ2YfzPQGw7Kf4MVf8eJBEa8evfk3TpI9aLSgoajqprvLCQVXYFlfRmZhcWl5JbuaW1vf2Nwyt3fqKogkZTUaiEA2HaKY4D6rAQfBmqFkxHMEazjD8sRvjJhUPPDvYByyjkf6Pnc5JaClrnnYBnYPcdx2XFwYHSUJngmXlfI1Psa3FUwVyGLSNfNW0ZoC/yV2SvIoRbVrfrZ7AY085gMVRKmWbYXQiYkETgVLcu1IsZDQIemzlqY+8ZjqxNOHEnyglR52A6nLBzxVf07ExFNq7Dm60yMwUPPeRPzPa0Xgnndi7ocRMJ/OFrmRwBDgSTq4xyWjIMaaECq5vhXTAZGEgs4wp0Ow51/+S+onRdsq2jen+dJFGkcW7aF9VEA2OkMldIWqqIYoekBP6AW9Go/Gs/FmvM9aM0Y6s4t+wfj4Bq4hm7A=</latexit>

Figure 6: Examples of samples and reconstructions from ConvDraw trained on CelebA andColor-MNIST. In each block of samples, rows correspond to samples from the data, model recon-structions and model samples respectively. From left to right we have models trained with: (i) ELBOonly. (ii) ELBO + GECO+RE constraint with κ = 0.1. (iii) ELBO + GECO+FRE constraint withκ = 1.0. (iv) ELBO only. (v) ELBO + GECO+RE constraint with κ = 0.06. More samples availablein Appendix F.

with VAEs: blurred reconstructions/samples, and the "holes problem". Finally, we have introducedGECO, a simple-to-use algorithm for constrained optimization of VAEs. Our experiments indicatethat it is a highly effective tool to achieve a good balance between reconstructions and compressionin practice (without recurring to large parameter sweeps) in a broad variety of tasks.

Acknowledgements

We would like to thank Mihaela Rosca for the many useful discussions and the help with the MarginalKL evaluation experiments.

References[1] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. CoRR, abs/1312.6114,

2013.

[2] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagationand approximate inference in deep generative models. In ICML, 2014.

[3] Diederik P. Kingma, Tim Salimans, and Max Welling. Improved variational inference withinverse autoregressive flow. 2016.

[4] Casper Kaae Sonderby, Tapani Raiko, Lars Maaloe, Soren Kaae Sonderby, and Ole Winther.Ladder variational autoencoders. In NIPS, 2016.

[5] Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan Wierstra.Towards conceptual compression. In NIPS, 2016.

[6] Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, DavidVázquez, and Aaron C. Courville. Pixelvae: A latent variable model for natural images. CoRR,abs/1611.05013, 2016.

[7] Philip Bachman. An architecture for deep, hierarchical generative models. In NIPS, 2016.

[8] Kei Akuzawa, Yusuke Iwasawa, and Yutaka Matsuo. Expressive speech synthesis via modelingexpressions with variational autoencoder. arXiv preprint arXiv:1804.02135, 2018.

[9] Wei-Ning Hsu, Yu Zhang, and James Glass. Learning latent representations for speech genera-tion and transformation. arXiv preprint arXiv:1704.04222, 2017.

10

Page 11: arXiv:1810.00597v1 [stat.ML] 1 Oct 2018Taming VAEs Danilo J. Rezende Fabio Viola {danilor, fviola}@google.com DeepMind, London, UK Abstract In spite of remarkable progress in deep

[10] Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel Hernández-Lobato,Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel,Ryan P Adams, and Alán Aspuru-Guzik. Automatic chemical design using a data-drivencontinuous representation of molecules. ACS Central Science, 2016.

[11] Mohammad M Sultan, Hannah K Wayment-Steele, and Vijay S Pande. Transferable neuralnetworks for enhanced sampling of protein dynamics. arXiv preprint arXiv:1801.00636, 2018.

[12] Carlos X Hernández, Hannah K Wayment-Steele, Mohammad M Sultan, Brooke E Husic, andVijay S Pande. Variational encoding of complex dynamics. arXiv preprint arXiv:1711.08576,2017.

[13] Tadanobu Inoue, Subhajit Chaudhury, Giovanni De Magistris, and Sakyasingha Dasgupta.Transfer learning from synthetic to real images using variational autoencoders for roboticapplications. arXiv preprint arXiv:1709.06762, 2017.

[14] Daehyung Park, Yuuna Hoshi, and Charles C Kemp. A multimodal anomaly detector for robot-assisted feeding using an lstm-based variational autoencoder. IEEE Robotics and AutomationLetters, 3(3):1544–1551, 2018.

[15] Herke van Hoof, Nutan Chen, Maximilian Karl, Patrick van der Smagt, and Jan Peters. Stablereinforcement learning with autoencoders for tactile and visual data. In Intelligent Robots andSystems (IROS), 2016 IEEE/RSJ International Conference on, pages 3928–3934. IEEE, 2016.

[16] SM Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S Morcos, MartaGarnelo, Avraham Ruderman, Andrei A Rusu, Ivo Danihelka, Karol Gregor, et al. Neural scenerepresentation and rendering. Science, 360(6394):1204–1210, 2018.

[17] Danilo Jimenez Rezende, SM Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg,and Nicolas Heess. Unsupervised learning of 3d structure from images. In Advances In NeuralInformation Processing Systems, pages 4996–5004, 2016.

[18] SM Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, David Szepesvari, Geoffrey EHinton, et al. Attend, infer, repeat: Fast scene understanding with generative models. InAdvances in Neural Information Processing Systems, pages 3225–3233, 2016.

[19] Jakub M. Tomczak and Max Welling. Vae with a vampprior. In AISTATS, 2018.

[20] Yuri Burda, Roger B. Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders.CoRR, abs/1509.00519, 2015.

[21] Eric T. Nalisnick, Lars Hertel, and Padhraic Smyth. Approximate inference for deep latentgaussian mixtures. 2016.

[22] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows.In ICML, 2015.

[23] Tim Salimans, Diederik P. Kingma, and Max Welling. Markov chain monte carlo and variationalinference: Bridging the gap. In ICML, 2015.

[24] Jakub M. Tomczak and Max Welling. Improving variational auto-encoders using householderflow. CoRR, abs/1611.09630, 2016.

[25] Dustin Tran, Rajesh Ranganath, and David M. Blei. The variational gaussian process. 2015.

[26] Rianne van den Berg, Leonard Hasenclever, Jakub M. Tomczak, and Max Welling. Sylvesternormalizing flows for variational inference. CoRR, abs/1803.05649, 2018.

[27] Chris Cremer, Xuechen Li, and David K. Duvenaud. Inference suboptimality in variationalautoencoders. CoRR, abs/1801.03558, 2018.

[28] Mihaela Rosca, Balaji Lakshminarayanan, and Shakir Mohamed. Distribution matching invariational inference. CoRR, abs/1802.06847, 2018.

[29] Lucas Theis, Aäron van den Oord, and Matthias Bethge. A note on the evaluation of generativemodels. arXiv preprint arXiv:1511.01844, 2015.

11

Page 12: arXiv:1810.00597v1 [stat.ML] 1 Oct 2018Taming VAEs Danilo J. Rezende Fabio Viola {danilor, fviola}@google.com DeepMind, London, UK Abstract In spite of remarkable progress in deep

[30] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew MBotvinick, Shakir Mohamed, and Alexander Lerchner. b-vae: Learning basic visual conceptswith a constrained variational framework. 2016.

[31] Alexander A. Alemi, Ben Poole, Ian Fischer, Joshua V. Dillon, Rif A. Saurous, and KevinMurphy. Fixing a broken elbo. 2017.

[32] Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method.arXiv preprint physics/0004057, 2000.

[33] Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. InInformation Theory Workshop (ITW), 2015 IEEE, pages 1–5. IEEE, 2015.

[34] Christopher Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Des-jardins, and Alexander Lerchner. Understanding disentangling in beta-vae. 2018.

[35] Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin P Murphy. Deep variationalinformation bottleneck. CoRR, abs/1612.00410, 2016.

[36] Shengjia Zhao, Jiaming Song, and Stefano Ermon. Towards deeper understanding of variationalautoencoding models. arXiv preprint arXiv:1702.08658, 2017.

[37] DJ Strouse and David J Schwab. The information bottleneck and geometric clustering. arXivpreprint arXiv:1712.09657, 2017.

[38] Bin Dai, Yu Wang, John Aston, Gang Hua, and David O Wipf. Hidden talents of the variationalautoencoder. 2017.

[39] I-J Wang and James C Spall. Stochastic optimization with inequality constraints using simulta-neous perturbations and penalty functions. In Decision and Control, 2003. Proceedings. 42ndIEEE Conference on, volume 4, pages 3808–3813. IEEE, 2003.

[40] Ana Maria AC Rocha and Edite MGP Fernandes. A stochastic augmented lagrangian equalityconstrained-based algorithm for global optimization. In AIP Conference Proceedings, volume1281, pages 967–970. AIP, 2010.

[41] Mary Phuong, Max Welling, Nate Kushman, Ryota Tomioka, and Sebastian Nowozin. Themutual autoencoder: Controlling information in latent code representations, 2018.

[42] Yan Zhang, Mete Ozay, Zhun Sun, and Takayuki Okatani. Information potential auto-encoders.CoRR, abs/1706.04635, 2017.

[43] Artemy Kolchinsky, Brendan D. Tracey, and David H. Wolpert. Nonlinear information bottle-neck. CoRR, abs/1705.02436, 2017.

[44] Diederik P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolu-tions. arXiv preprint arXiv:1807.03039, 2018.

[45] Dimitri P. Bertsekas. Constrained Optimization and Lagrange Multiplier Methods (Optimizationand Neural Computation Series). 1996.

[46] Richard Blahut. Computation of channel capacity and rate-distortion functions. IEEE transac-tions on Information Theory, 18(4):460–473, 1972.

[47] Imre Csiszár. Information geometry and alternating minimization procedures. Statistics anddecisions, 1984.

[48] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons,2012.

[49] Pierre-Alexandre Mattei and Jes Frellsen. Leveraging the exact likelihood of deep latent variablemodels. CoRR, abs/1802.04826, 2018.

[50] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural sam-plers using variational divergence minimization. In Advances in Neural Information ProcessingSystems, pages 271–279, 2016.

12

Page 13: arXiv:1810.00597v1 [stat.ML] 1 Oct 2018Taming VAEs Danilo J. Rezende Fabio Viola {danilor, fviola}@google.com DeepMind, London, UK Abstract In spite of remarkable progress in deep

[51] David Hoyle and Magnus Rattray. Limiting form of the sample covariance eigenspectrum inpca and kernel pca. In Advances in Neural Information Processing Systems, pages 1181–1188,2004.

[52] Boaz Nadler, Stephane Lafon, Ronald Coifman, and Ioannis G Kevrekidis. Diffusion maps-aprobabilistic interpretation for spectral embedding and clustering algorithms. In Principalmanifolds for data visualization and dimension reduction, pages 238–260. Springer, 2008.

[53] Quan Wang. Kernel principal component analysis and its applications in face recognition andactive shape models. arXiv preprint arXiv:1207.3538, 2012.

[54] Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. Kernel principal componentanalysis. In International Conference on Artificial Neural Networks, pages 583–588. Springer,1997.

[55] Sebastian Mika, Bernhard Schölkopf, Alex J Smola, Klaus-Robert Müller, Matthias Scholz, andGunnar Rätsch. Kernel pca and de-noising in feature spaces. In Advances in neural informationprocessing systems, pages 536–542, 1999.

[56] Stephen J Blundell and Katherine M Blundell. Concepts in thermal physics. OUP Oxford,2009.

[57] Richard Chace Tolman. The principles of statistical mechanics. Courier Corporation, 1938.

[58] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp.arXiv preprint arXiv:1605.08803, 2016.

[59] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980, 2014.

[60] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in thewild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.

[61] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.2009.

[62] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. AT&T Labs[Online]. Available: http://yann. lecun. com/exdb/mnist, 2, 2010.

[63] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarialnetworks. In ICLR, 2017.

[64] Mícheál O’Searcoid. Metric spaces. Springer Science & Business Media, 2006.

[65] Maxwell Rosenlicht. Introduction to analysis. Courier Corporation, 1968.

13

Page 14: arXiv:1810.00597v1 [stat.ML] 1 Oct 2018Taming VAEs Danilo J. Rezende Fabio Viola {danilor, fviola}@google.com DeepMind, London, UK Abstract In spite of remarkable progress in deep

Appendix A Reconstruction fixed-points experiment details

To produce the results in Figure 2, we iterated the fixed-point equations for the matrix ψt usingexponential smoothing with α = 0.9. For each experiment, we iterated the smoothed fixed-pointequations until either the number of iterations exceeded 400 steps or the Euclidean distance betweentwo successive steps became smaller than 1e-3. The basis functions φi(z) were arranged as a 32x32grid in a compact latent space z ∈ [− 1

2 ,12 ]2, the prior was chosen to be uniform, π(z) = 1. The

matrix ψ was initialized at the center-points of the grid tiles with small uniform noise in [−0.1, 0.1].

Appendix B High-capacity β-VAEs and Lipschitz constraints

The analysis of high-capacity VAEs in Sections 3.1 and 3.2 reveal interesting aspects of VAEs nearconvergence, but the type of solutions implied by Propositions 2 and 4 may seem unrealistic for VAEswith smooth decoders parametrized by deep neural networks. For instance, the solutions of the fixedpoint-equations from Propositions 2 and 4 have no notion that the outputs of the decoder g(z) andg(z′) for two similar latent vectors z and z′ should also be similar. That is, these solutions are notsensitive to the metric and topological properties of the latent space. This implies that, if we want towork in a more realistic high-capacity regime, we must further constrain the solutions so that theyhave at least a continuous limit as we grow the number of basis functions to infinity.

A sufficient condition for a function g(z) to be continuous, is that it is a locally-Lipschitz function ofz [64]. Thus, to bring our analysis closer to more realistic VAEs, we consider high-capacity β-VAEswith an extra L-Lipschitz inequality constraint C in the decoder function g(z). The new term that weadd to the augmented Lagrangian, with a functional Lagrange multiplier Ω(z, z′) ≥ 0 is given by

C[g] =1

2

∫dzdz′π(z)π(z′)Ω(z, z′)

[‖g(z)− g(z′)‖2 − L2‖z− z′‖2

]. (7)

By expressing g(z) and Ω(z, z′) in the functional basis φ(z) of Proposition 4, g(z) =∑a ψaφa(z)

and Ω(z, z′) =∑a,b Ωabφa(z)φb(z), we can rewrite the constraint term as a quadratic inequality

constraint in the matrix ψ,

C[g] =1

2

a,b

Ωab[Cab‖ψa − ψb‖2 − 1

], (8)

where Ωab = L2KabΩab are new Lagrange multipliers, Cab = πaπb/(L2Kab) and Kab =∫

dzdz′φa(z)φb(z′)‖z−z′‖2. The matricesCab andKab embed the metric and topological properties

of the latent space but are otherwise independent of the rest of the model.

The constraint (7) can be used to enforce both global and local Lipschitz constraints by controlling thesize of the support of the Lagrange multiplier function Ω(z, z′). If Ω(z, z′) = 0 for ‖z− z′‖ >= r,we will be constraining the decoder function to be locally Lipschitz within a radius r in the latentspace.

Note that the Lipschitz constraint is not a reconstruction constraint as it only constrains the VAE’sdecoder at arbitrary points in the latent space. For this reason, it can be implemented as a projectionstep just after the iteration from Equation (20). This is formalized in Proposition 6. We illustrate thecombined effect of β, local and global Lipschitz constraints on VAEs in Figure 7.

Proposition 6. The Lipschitz constraint from Equation (8) can be incorporated to the fixed-pointEquation (20) as a projection of the form ψt+1 = F (ψt)P t, where F is the transition operatorwithout Lipschitz constraints. See proof in Appendix C.5.

Appendix C Proofs

C.1 Derivation of Proposition 1

Proof. We can obtain these equations by taking the functional derivatives δFδg(z) , δF

δq(z|x) , δLλ

δg(z) andδLλ

δq(z|x) of F and Lλ with respect to g(z) and q(z|x) respectively and re-arranging the terms of the

14

Page 15: arXiv:1810.00597v1 [stat.ML] 1 Oct 2018Taming VAEs Danilo J. Rezende Fabio Viola {danilor, fviola}@google.com DeepMind, London, UK Abstract In spite of remarkable progress in deep

= 0.5<latexit sha1_base64="pgsuUJtPOas2vqrx6M0Ht9mqE+8=">AAAB8HicbVBNS8NAEN3Ur1q/qh69LBbBU0hE0YtQ9OKxgv2QNpTNdtou3WzC7kQoob/CiwdFvPpzvPlv3LY5aOuDgcd7M8zMCxMpDHret1NYWV1b3yhulra2d3b3yvsHDROnmkOdxzLWrZAZkEJBHQVKaCUaWBRKaIaj26nffAJtRKwecJxAELGBEn3BGVrpsRMCsmvPveiWK57rzUCXiZ+TCslR65a/Or2YpxEo5JIZ0/a9BIOMaRRcwqTUSQ0kjI/YANqWKhaBCbLZwRN6YpUe7cfalkI6U39PZCwyZhyFtjNiODSL3lT8z2un2L8KMqGSFEHx+aJ+KinGdPo97QkNHOXYEsa1sLdSPmSacbQZlWwI/uLLy6Rx5vqe69+fV6o3eRxFckSOySnxySWpkjtSI3XCSUSeySt5c7Tz4rw7H/PWgpPPHJI/cD5/AJwVj5w=</latexit><latexit sha1_base64="pgsuUJtPOas2vqrx6M0Ht9mqE+8=">AAAB8HicbVBNS8NAEN3Ur1q/qh69LBbBU0hE0YtQ9OKxgv2QNpTNdtou3WzC7kQoob/CiwdFvPpzvPlv3LY5aOuDgcd7M8zMCxMpDHret1NYWV1b3yhulra2d3b3yvsHDROnmkOdxzLWrZAZkEJBHQVKaCUaWBRKaIaj26nffAJtRKwecJxAELGBEn3BGVrpsRMCsmvPveiWK57rzUCXiZ+TCslR65a/Or2YpxEo5JIZ0/a9BIOMaRRcwqTUSQ0kjI/YANqWKhaBCbLZwRN6YpUe7cfalkI6U39PZCwyZhyFtjNiODSL3lT8z2un2L8KMqGSFEHx+aJ+KinGdPo97QkNHOXYEsa1sLdSPmSacbQZlWwI/uLLy6Rx5vqe69+fV6o3eRxFckSOySnxySWpkjtSI3XCSUSeySt5c7Tz4rw7H/PWgpPPHJI/cD5/AJwVj5w=</latexit><latexit sha1_base64="pgsuUJtPOas2vqrx6M0Ht9mqE+8=">AAAB8HicbVBNS8NAEN3Ur1q/qh69LBbBU0hE0YtQ9OKxgv2QNpTNdtou3WzC7kQoob/CiwdFvPpzvPlv3LY5aOuDgcd7M8zMCxMpDHret1NYWV1b3yhulra2d3b3yvsHDROnmkOdxzLWrZAZkEJBHQVKaCUaWBRKaIaj26nffAJtRKwecJxAELGBEn3BGVrpsRMCsmvPveiWK57rzUCXiZ+TCslR65a/Or2YpxEo5JIZ0/a9BIOMaRRcwqTUSQ0kjI/YANqWKhaBCbLZwRN6YpUe7cfalkI6U39PZCwyZhyFtjNiODSL3lT8z2un2L8KMqGSFEHx+aJ+KinGdPo97QkNHOXYEsa1sLdSPmSacbQZlWwI/uLLy6Rx5vqe69+fV6o3eRxFckSOySnxySWpkjtSI3XCSUSeySt5c7Tz4rw7H/PWgpPPHJI/cD5/AJwVj5w=</latexit><latexit sha1_base64="pgsuUJtPOas2vqrx6M0Ht9mqE+8=">AAAB8HicbVBNS8NAEN3Ur1q/qh69LBbBU0hE0YtQ9OKxgv2QNpTNdtou3WzC7kQoob/CiwdFvPpzvPlv3LY5aOuDgcd7M8zMCxMpDHret1NYWV1b3yhulra2d3b3yvsHDROnmkOdxzLWrZAZkEJBHQVKaCUaWBRKaIaj26nffAJtRKwecJxAELGBEn3BGVrpsRMCsmvPveiWK57rzUCXiZ+TCslR65a/Or2YpxEo5JIZ0/a9BIOMaRRcwqTUSQ0kjI/YANqWKhaBCbLZwRN6YpUe7cfalkI6U39PZCwyZhyFtjNiODSL3lT8z2un2L8KMqGSFEHx+aJ+KinGdPo97QkNHOXYEsa1sLdSPmSacbQZlWwI/uLLy6Rx5vqe69+fV6o3eRxFckSOySnxySWpkjtSI3XCSUSeySt5c7Tz4rw7H/PWgpPPHJI/cD5/AJwVj5w=</latexit>

= 1.0<latexit sha1_base64="Np41G1EotYqen3rsLE9W3j/t0W8=">AAAB8HicbVBNS8NAEN3Ur1q/qh69BIvgKSQi6EUoevFYwdZKG8pmO2mX7m7C7kQoob/CiwdFvPpzvPlv3LY5aOuDgcd7M8zMi1LBDfr+t1NaWV1b3yhvVra2d3b3qvsHLZNkmkGTJSLR7YgaEFxBEzkKaKcaqIwEPESjm6n/8ATa8ETd4ziFUNKB4jFnFK302I0A6VXg+b1qzff8GdxlEhSkRgo0etWvbj9hmQSFTFBjOoGfYphTjZwJmFS6mYGUshEdQMdSRSWYMJ8dPHFPrNJ340TbUujO1N8TOZXGjGVkOyXFoVn0puJ/XifD+DLMuUozBMXmi+JMuJi40+/dPtfAUIwtoUxze6vLhlRThjajig0hWHx5mbTOvMD3grvzWv26iKNMjsgxOSUBuSB1cksapEkYkeSZvJI3RzsvzrvzMW8tOcXMIfkD5/MHlgePmA==</latexit><latexit sha1_base64="Np41G1EotYqen3rsLE9W3j/t0W8=">AAAB8HicbVBNS8NAEN3Ur1q/qh69BIvgKSQi6EUoevFYwdZKG8pmO2mX7m7C7kQoob/CiwdFvPpzvPlv3LY5aOuDgcd7M8zMi1LBDfr+t1NaWV1b3yhvVra2d3b3qvsHLZNkmkGTJSLR7YgaEFxBEzkKaKcaqIwEPESjm6n/8ATa8ETd4ziFUNKB4jFnFK302I0A6VXg+b1qzff8GdxlEhSkRgo0etWvbj9hmQSFTFBjOoGfYphTjZwJmFS6mYGUshEdQMdSRSWYMJ8dPHFPrNJ340TbUujO1N8TOZXGjGVkOyXFoVn0puJ/XifD+DLMuUozBMXmi+JMuJi40+/dPtfAUIwtoUxze6vLhlRThjajig0hWHx5mbTOvMD3grvzWv26iKNMjsgxOSUBuSB1cksapEkYkeSZvJI3RzsvzrvzMW8tOcXMIfkD5/MHlgePmA==</latexit><latexit sha1_base64="Np41G1EotYqen3rsLE9W3j/t0W8=">AAAB8HicbVBNS8NAEN3Ur1q/qh69BIvgKSQi6EUoevFYwdZKG8pmO2mX7m7C7kQoob/CiwdFvPpzvPlv3LY5aOuDgcd7M8zMi1LBDfr+t1NaWV1b3yhvVra2d3b3qvsHLZNkmkGTJSLR7YgaEFxBEzkKaKcaqIwEPESjm6n/8ATa8ETd4ziFUNKB4jFnFK302I0A6VXg+b1qzff8GdxlEhSkRgo0etWvbj9hmQSFTFBjOoGfYphTjZwJmFS6mYGUshEdQMdSRSWYMJ8dPHFPrNJ340TbUujO1N8TOZXGjGVkOyXFoVn0puJ/XifD+DLMuUozBMXmi+JMuJi40+/dPtfAUIwtoUxze6vLhlRThjajig0hWHx5mbTOvMD3grvzWv26iKNMjsgxOSUBuSB1cksapEkYkeSZvJI3RzsvzrvzMW8tOcXMIfkD5/MHlgePmA==</latexit><latexit sha1_base64="Np41G1EotYqen3rsLE9W3j/t0W8=">AAAB8HicbVBNS8NAEN3Ur1q/qh69BIvgKSQi6EUoevFYwdZKG8pmO2mX7m7C7kQoob/CiwdFvPpzvPlv3LY5aOuDgcd7M8zMi1LBDfr+t1NaWV1b3yhvVra2d3b3qvsHLZNkmkGTJSLR7YgaEFxBEzkKaKcaqIwEPESjm6n/8ATa8ETd4ziFUNKB4jFnFK302I0A6VXg+b1qzff8GdxlEhSkRgo0etWvbj9hmQSFTFBjOoGfYphTjZwJmFS6mYGUshEdQMdSRSWYMJ8dPHFPrNJ340TbUujO1N8TOZXGjGVkOyXFoVn0puJ/XifD+DLMuUozBMXmi+JMuJi40+/dPtfAUIwtoUxze6vLhlRThjajig0hWHx5mbTOvMD3grvzWv26iKNMjsgxOSUBuSB1cksapEkYkeSZvJI3RzsvzrvzMW8tOcXMIfkD5/MHlgePmA==</latexit>

= 5.0<latexit sha1_base64="x4o8VKpIgiunHxPE0Z9r+aKcC+c=">AAAB8HicbVBNS8NAEN3Ur1q/qh69LBbBU0hE0YtQ9OKxgv2QNpTNdtou3WzC7kQoob/CiwdFvPpzvPlv3LY5aOuDgcd7M8zMCxMpDHret1NYWV1b3yhulra2d3b3yvsHDROnmkOdxzLWrZAZkEJBHQVKaCUaWBRKaIaj26nffAJtRKwecJxAELGBEn3BGVrpsRMCsusL1+uWK57rzUCXiZ+TCslR65a/Or2YpxEo5JIZ0/a9BIOMaRRcwqTUSQ0kjI/YANqWKhaBCbLZwRN6YpUe7cfalkI6U39PZCwyZhyFtjNiODSL3lT8z2un2L8KMqGSFEHx+aJ+KinGdPo97QkNHOXYEsa1sLdSPmSacbQZlWwI/uLLy6Rx5vqe69+fV6o3eRxFckSOySnxySWpkjtSI3XCSUSeySt5c7Tz4rw7H/PWgpPPHJI/cD5/AJwfj5w=</latexit><latexit sha1_base64="x4o8VKpIgiunHxPE0Z9r+aKcC+c=">AAAB8HicbVBNS8NAEN3Ur1q/qh69LBbBU0hE0YtQ9OKxgv2QNpTNdtou3WzC7kQoob/CiwdFvPpzvPlv3LY5aOuDgcd7M8zMCxMpDHret1NYWV1b3yhulra2d3b3yvsHDROnmkOdxzLWrZAZkEJBHQVKaCUaWBRKaIaj26nffAJtRKwecJxAELGBEn3BGVrpsRMCsusL1+uWK57rzUCXiZ+TCslR65a/Or2YpxEo5JIZ0/a9BIOMaRRcwqTUSQ0kjI/YANqWKhaBCbLZwRN6YpUe7cfalkI6U39PZCwyZhyFtjNiODSL3lT8z2un2L8KMqGSFEHx+aJ+KinGdPo97QkNHOXYEsa1sLdSPmSacbQZlWwI/uLLy6Rx5vqe69+fV6o3eRxFckSOySnxySWpkjtSI3XCSUSeySt5c7Tz4rw7H/PWgpPPHJI/cD5/AJwfj5w=</latexit><latexit sha1_base64="x4o8VKpIgiunHxPE0Z9r+aKcC+c=">AAAB8HicbVBNS8NAEN3Ur1q/qh69LBbBU0hE0YtQ9OKxgv2QNpTNdtou3WzC7kQoob/CiwdFvPpzvPlv3LY5aOuDgcd7M8zMCxMpDHret1NYWV1b3yhulra2d3b3yvsHDROnmkOdxzLWrZAZkEJBHQVKaCUaWBRKaIaj26nffAJtRKwecJxAELGBEn3BGVrpsRMCsusL1+uWK57rzUCXiZ+TCslR65a/Or2YpxEo5JIZ0/a9BIOMaRRcwqTUSQ0kjI/YANqWKhaBCbLZwRN6YpUe7cfalkI6U39PZCwyZhyFtjNiODSL3lT8z2un2L8KMqGSFEHx+aJ+KinGdPo97QkNHOXYEsa1sLdSPmSacbQZlWwI/uLLy6Rx5vqe69+fV6o3eRxFckSOySnxySWpkjtSI3XCSUSeySt5c7Tz4rw7H/PWgpPPHJI/cD5/AJwfj5w=</latexit><latexit sha1_base64="x4o8VKpIgiunHxPE0Z9r+aKcC+c=">AAAB8HicbVBNS8NAEN3Ur1q/qh69LBbBU0hE0YtQ9OKxgv2QNpTNdtou3WzC7kQoob/CiwdFvPpzvPlv3LY5aOuDgcd7M8zMCxMpDHret1NYWV1b3yhulra2d3b3yvsHDROnmkOdxzLWrZAZkEJBHQVKaCUaWBRKaIaj26nffAJtRKwecJxAELGBEn3BGVrpsRMCsusL1+uWK57rzUCXiZ+TCslR65a/Or2YpxEo5JIZ0/a9BIOMaRRcwqTUSQ0kjI/YANqWKhaBCbLZwRN6YpUe7cfalkI6U39PZCwyZhyFtjNiODSL3lT8z2un2L8KMqGSFEHx+aJ+KinGdPo97QkNHOXYEsa1sLdSPmSacbQZlWwI/uLLy6Rx5vqe69+fV6o3eRxFckSOySnxySWpkjtSI3XCSUSeySt5c7Tz4rw7H/PWgpPPHJI/cD5/AJwfj5w=</latexit>

= 20.0<latexit sha1_base64="kyYxDsXEl77w7egnkOCYt8otbu4=">AAAB8XicbVBNSwMxEJ2tX7V+VT16CRbBU9ktgl6EohePFewHtkvJptk2NJtdklmhLP0XXjwo4tV/481/Y9ruQVsfJDzem2FmXpBIYdB1v53C2vrG5lZxu7Szu7d/UD48apk41Yw3WSxj3Qmo4VIo3kSBkncSzWkUSN4Oxrczv/3EtRGxesBJwv2IDpUIBaNopcdewJFe19yq2y9X7D8HWSVeTiqQo9Evf/UGMUsjrpBJakzXcxP0M6pRMMmnpV5qeELZmA5511JFI278bL7xlJxZZUDCWNunkMzV3x0ZjYyZRIGtjCiOzLI3E//zuimGV34mVJIiV2wxKEwlwZjMzicDoTlDObGEMi3sroSNqKYMbUglG4K3fPIqadWqnlv17i8q9Zs8jiKcwCmcgweXUIc7aEATGCh4hld4c4zz4rw7H4vSgpP3HMMfOJ8/CBeP0w==</latexit><latexit sha1_base64="kyYxDsXEl77w7egnkOCYt8otbu4=">AAAB8XicbVBNSwMxEJ2tX7V+VT16CRbBU9ktgl6EohePFewHtkvJptk2NJtdklmhLP0XXjwo4tV/481/Y9ruQVsfJDzem2FmXpBIYdB1v53C2vrG5lZxu7Szu7d/UD48apk41Yw3WSxj3Qmo4VIo3kSBkncSzWkUSN4Oxrczv/3EtRGxesBJwv2IDpUIBaNopcdewJFe19yq2y9X7D8HWSVeTiqQo9Evf/UGMUsjrpBJakzXcxP0M6pRMMmnpV5qeELZmA5511JFI278bL7xlJxZZUDCWNunkMzV3x0ZjYyZRIGtjCiOzLI3E//zuimGV34mVJIiV2wxKEwlwZjMzicDoTlDObGEMi3sroSNqKYMbUglG4K3fPIqadWqnlv17i8q9Zs8jiKcwCmcgweXUIc7aEATGCh4hld4c4zz4rw7H4vSgpP3HMMfOJ8/CBeP0w==</latexit><latexit sha1_base64="kyYxDsXEl77w7egnkOCYt8otbu4=">AAAB8XicbVBNSwMxEJ2tX7V+VT16CRbBU9ktgl6EohePFewHtkvJptk2NJtdklmhLP0XXjwo4tV/481/Y9ruQVsfJDzem2FmXpBIYdB1v53C2vrG5lZxu7Szu7d/UD48apk41Yw3WSxj3Qmo4VIo3kSBkncSzWkUSN4Oxrczv/3EtRGxesBJwv2IDpUIBaNopcdewJFe19yq2y9X7D8HWSVeTiqQo9Evf/UGMUsjrpBJakzXcxP0M6pRMMmnpV5qeELZmA5511JFI278bL7xlJxZZUDCWNunkMzV3x0ZjYyZRIGtjCiOzLI3E//zuimGV34mVJIiV2wxKEwlwZjMzicDoTlDObGEMi3sroSNqKYMbUglG4K3fPIqadWqnlv17i8q9Zs8jiKcwCmcgweXUIc7aEATGCh4hld4c4zz4rw7H4vSgpP3HMMfOJ8/CBeP0w==</latexit><latexit sha1_base64="kyYxDsXEl77w7egnkOCYt8otbu4=">AAAB8XicbVBNSwMxEJ2tX7V+VT16CRbBU9ktgl6EohePFewHtkvJptk2NJtdklmhLP0XXjwo4tV/481/Y9ruQVsfJDzem2FmXpBIYdB1v53C2vrG5lZxu7Szu7d/UD48apk41Yw3WSxj3Qmo4VIo3kSBkncSzWkUSN4Oxrczv/3EtRGxesBJwv2IDpUIBaNopcdewJFe19yq2y9X7D8HWSVeTiqQo9Evf/UGMUsjrpBJakzXcxP0M6pRMMmnpV5qeELZmA5511JFI278bL7xlJxZZUDCWNunkMzV3x0ZjYyZRIGtjCiOzLI3E//zuimGV34mVJIiV2wxKEwlwZjMzicDoTlDObGEMi3sroSNqKYMbUglG4K3fPIqadWqnlv17i8q9Zs8jiKcwCmcgweXUIc7aEATGCh4hld4c4zz4rw7H4vSgpP3HMMfOJ8/CBeP0w==</latexit>

Loca

l Lip

schi

tzG

loba

l Lip

schi

tz

= 50.0<latexit sha1_base64="JfhUggKBOnXjUvlJ1LSLne7h9jA=">AAAB8XicbVBNSwMxEJ2tX7V+VT16CRbBU9kVRS9C0YvHCvYD26Vk09k2NJtdkqxQlv4LLx4U8eq/8ea/MW33oK0PEh7vzTAzL0gE18Z1v53Cyura+kZxs7S1vbO7V94/aOo4VQwbLBaxagdUo+ASG4Ybge1EIY0Cga1gdDv1W0+oNI/lgxkn6Ed0IHnIGTVWeuwGaOj1hVt1e+WK/Wcgy8TLSQVy1Hvlr24/ZmmE0jBBte54bmL8jCrDmcBJqZtqTCgb0QF2LJU0Qu1ns40n5MQqfRLGyj5pyEz93ZHRSOtxFNjKiJqhXvSm4n9eJzXhlZ9xmaQGJZsPClNBTEym55M+V8iMGFtCmeJ2V8KGVFFmbEglG4K3ePIyaZ5VPbfq3Z9Xajd5HEU4gmM4BQ8uoQZ3UIcGMJDwDK/w5mjnxXl3PualBSfvOYQ/cD5/AAysj9Y=</latexit><latexit sha1_base64="JfhUggKBOnXjUvlJ1LSLne7h9jA=">AAAB8XicbVBNSwMxEJ2tX7V+VT16CRbBU9kVRS9C0YvHCvYD26Vk09k2NJtdkqxQlv4LLx4U8eq/8ea/MW33oK0PEh7vzTAzL0gE18Z1v53Cyura+kZxs7S1vbO7V94/aOo4VQwbLBaxagdUo+ASG4Ybge1EIY0Cga1gdDv1W0+oNI/lgxkn6Ed0IHnIGTVWeuwGaOj1hVt1e+WK/Wcgy8TLSQVy1Hvlr24/ZmmE0jBBte54bmL8jCrDmcBJqZtqTCgb0QF2LJU0Qu1ns40n5MQqfRLGyj5pyEz93ZHRSOtxFNjKiJqhXvSm4n9eJzXhlZ9xmaQGJZsPClNBTEym55M+V8iMGFtCmeJ2V8KGVFFmbEglG4K3ePIyaZ5VPbfq3Z9Xajd5HEU4gmM4BQ8uoQZ3UIcGMJDwDK/w5mjnxXl3PualBSfvOYQ/cD5/AAysj9Y=</latexit><latexit sha1_base64="JfhUggKBOnXjUvlJ1LSLne7h9jA=">AAAB8XicbVBNSwMxEJ2tX7V+VT16CRbBU9kVRS9C0YvHCvYD26Vk09k2NJtdkqxQlv4LLx4U8eq/8ea/MW33oK0PEh7vzTAzL0gE18Z1v53Cyura+kZxs7S1vbO7V94/aOo4VQwbLBaxagdUo+ASG4Ybge1EIY0Cga1gdDv1W0+oNI/lgxkn6Ed0IHnIGTVWeuwGaOj1hVt1e+WK/Wcgy8TLSQVy1Hvlr24/ZmmE0jBBte54bmL8jCrDmcBJqZtqTCgb0QF2LJU0Qu1ns40n5MQqfRLGyj5pyEz93ZHRSOtxFNjKiJqhXvSm4n9eJzXhlZ9xmaQGJZsPClNBTEym55M+V8iMGFtCmeJ2V8KGVFFmbEglG4K3ePIyaZ5VPbfq3Z9Xajd5HEU4gmM4BQ8uoQZ3UIcGMJDwDK/w5mjnxXl3PualBSfvOYQ/cD5/AAysj9Y=</latexit><latexit sha1_base64="JfhUggKBOnXjUvlJ1LSLne7h9jA=">AAAB8XicbVBNSwMxEJ2tX7V+VT16CRbBU9kVRS9C0YvHCvYD26Vk09k2NJtdkqxQlv4LLx4U8eq/8ea/MW33oK0PEh7vzTAzL0gE18Z1v53Cyura+kZxs7S1vbO7V94/aOo4VQwbLBaxagdUo+ASG4Ybge1EIY0Cga1gdDv1W0+oNI/lgxkn6Ed0IHnIGTVWeuwGaOj1hVt1e+WK/Wcgy8TLSQVy1Hvlr24/ZmmE0jBBte54bmL8jCrDmcBJqZtqTCgb0QF2LJU0Qu1ns40n5MQqfRLGyj5pyEz93ZHRSOtxFNjKiJqhXvSm4n9eJzXhlZ9xmaQGJZsPClNBTEym55M+V8iMGFtCmeJ2V8KGVFFmbEglG4K3ePIyaZ5VPbfq3Z9Xajd5HEU4gmM4BQ8uoQZ3UIcGMJDwDK/w5mjnxXl3PualBSfvOYQ/cD5/AAysj9Y=</latexit>

No

Lips

chitz

Figure 7: Combined effect of β-VAEs and Lipschitz constraints. Blue points are data-points for amixture of lines (left) and a mixture of circles (right). Grey curves indicate the trajectories of thereconstruction vectors ψ from initial conditions on an uniform 2D grid. Red points are the found fixed-points. Top row Reconstruction fixed-points without Lipschitz constraints. An increase in β factorcauses the reconstruction fixed-points to collapse, effectively clustering the data at different spatialresolutions; Middle row Reconstruction fixed-points with local Lipschitz constraints (r = 0.2). Aswe increase the strength of local Lipschitz constraints, the fixed-points tend organize in thin manifoldsconnecting regions of high density. Bottom row Reconstruction fixed-points with global Lipschitzconstraints. As we increase the strength of global Lipschitz constraints (r = 1.0), the fixed-pointstend organize in manifolds covering regions of high density.

equations δFδq(z|x) = 0, δF

δg(z) = 0, δLλ

δq(z|x) = 0 and δLλ

δg(z) = 0. For the density q(z|x) we must alsotake the normalization constraint

∫dxdzλ(x)(q(z|x)− 1) into account:

δFδg(z)

=1

σ2

(Eρ(x) [q(z|x)x]− Eρ(x) [q(z|x)] g(z)

)= 0, (9)

δFδq(z|x)

= ρ(x)

[‖x− g(z)‖2

2σ2+ ln

q(z|x)

π(z)+ 1 + λ(x)

]= 0, (10)

δLλ

δg(z)= −λTEρ(x)

[q(z|x)

∂C(x, g(z))

∂g(z)

]= 0, (11)

δLλ

δq(z|x)= ρ(x)

[lnq(z|x)

π(z)+ 1 + λTC(x, g(z)) + λ(x)

]= 0, (12)

where F was computed using a Gaussian likelihood with global variance σ2. A straightforwardalgebraic simplification of these equations results in the fixed-point equations of the Proposition 1.For a general constraint C(x, g(z)) we cannot simply solve δLλ

δg(z) = 0 for g(z) due to the non-lineardependency of C(x, g(z)) on g(z). In this case, we can often employ a standard technique forconstructing fixed-point equations which consists in converting an equation of the form x− f(x) = 0to a recurrence relation of the form xn = f(xn−1).

C.2 Derivation of Proposition 2

Proof. First, we note that for any given q(z|x), the ELBO is a convex quadratic functional of thedecoder g(z) and, for a fixed g(z), it is a convex functional of the encoder q(z|x). Second, for afixed partition Ωi, replacing the solution qt(z|xi) = π(z)Iz∈Ωi/πi, where πi = Eπ [Iz∈Ωi ], in the

15

Page 16: arXiv:1810.00597v1 [stat.ML] 1 Oct 2018Taming VAEs Danilo J. Rezende Fabio Viola {danilor, fviola}@google.com DeepMind, London, UK Abstract In spite of remarkable progress in deep

fixed-point equations, results in itself. That is,

wti(z) =qt(z|xi)∑j q

t(z|xj)= Iz∈Ωi (13)

gt(z) =∑

i

wti(z)xi =∑

i

xiIz∈Ωi (14)

σt =√

Eρ(x)qt(z|x)[‖x− gt(z)‖2] = 0 (15)

qt+1(z|xi) = limσ→0

π(z)e−‖xi−g

t(z)‖2

2σ2

c(xi)= limσ→0

π(z)∑j e−‖xi−xj‖

2

2σ2 Iz∈Ωj

∫dz′π(z′)

∑k e− ‖xi−xk‖

2

2σ2 Iz∈Ωk

=1

πiπ(z)Iz∈Ωi .

(16)

Therefore, qt(z|xi) = π(z)Iz∈Ωi/πi is a fixed-point in the family of densities constrained bythe partition Ωi. We observe that the negative ELBO reduces to the expected KL term only,

Ep[KL(q;π)] = 1n

∑j

∫dz

π(z)Iz∈Ωj

πj(− lnπj) = − 1

n

∑j lnπj . We can now optimize the partition

to further maximize the ELBO. This results in πj = 1/n. That is, the tiles Ωi must be equiprobable.At this point, we have Ep[KL(q;π)] = lnn.

C.3 Derivation of Proposition 3

Proof. From (6) we have that g(z) =∑i w

ti(z)xi =

∑i xi

q(z|xi)∑j q(z|xj)

. Substituting q(z|xi) =

π(z)Iz∈Ωi/πi we have,

g(z) =∑

i

xiq(z|xi)∑j q(z|xj)

=∑

i

xiIz∈Ωi/πi∑j Iz∈Ωj/πj

=∑

i|z∈Ωi

xi1/πi∑

j|z∈Ωj1/πj

, (17)

where πi = Eπ [Iz∈Ωi ].

C.4 Derivation of Proposition 4

Proof. The expression for the generator can be obtained by substitution and algebraic simplificationusing the fact that φa form an orthogonal basis and that φa(z) ∈ 0, 1:

g(z) =∑

ib

qt(z|xi)∑j q

t(z|xj)xi =

ia

miaφa(z)∑jbmjbφb(z)

xi =∑

ib

mib∑jmjb

φb(z)xi = ψTφ(z), (18)

with ψb =∑i

mib∑j mjb

xi. Similarly, we can compute the fixed point equations for ψa by substitutionon equations (5) and (6),

qt+1i (z) = π(z)

b

e−‖xi−ψ

tb‖

2

ctiφb(z) = π(z)

a

mt+1ia φa(z) (19)

ψt+1b =

i

ximt+1ib∑

jmt+1jb

=

∑i xi

e−‖xi−ψ

tb‖

2

cti

∑ie−‖xi−ψtb‖

2

cti

, (20)

where ci =∑b e− ‖xi−ψ

tb‖

2

2β πb and mt+1ib = e

−‖xi−ψ

tb‖

2

cti. If the initial reconstruction vectors ψta are in

the convex-hull of the training data-points, then equations (20) will map them to another set of pointsψt+1a in the convex-hull of the training data. Since, these equations are also smooth with respect to

ψ, they are guaranteed to converge as a consequence of the fixed-point theorem [65]. Importantly,equation (20) corresponds to the fixed point iterations for computing the pre-image (reconstructions)of Kernel-PCA but using a normalized Gaussian kernel.

16

Page 17: arXiv:1810.00597v1 [stat.ML] 1 Oct 2018Taming VAEs Danilo J. Rezende Fabio Viola {danilor, fviola}@google.com DeepMind, London, UK Abstract In spite of remarkable progress in deep

C.5 Derivation of Proposition 6

Proof. We first write the relevant terms of the augmented Lagrangian Lλ,Ω in the basis φ(z) fromProposition 4,

Lλ,Ω = Eρ(x)

[Eq(z|x)

[‖x− ψTφ(z)‖2

2σ2

]]+

1

2n

a,b

Ωab[Cab‖ψa − ψb‖2 − 1

]+ cst (21)

= Eρ(x)

[Eq(z|x) [φa(z)]

‖x− ψa‖2

2σ2

]+

1

2n

a,b

Ωab[Cab‖ψa − ψb‖2 − 1

]+ cst (22)

=1

n

ia

miaπa‖xi − ψa‖2

2σ2+

1

2n

a,b

ΩabCab‖ψa − ψb‖2 + cst, (23)

where cst are terms that do not depend on ψ for a fixed q. Solving ∂Lλ,Ω

∂ψa= 0 with respect to ψ

results in

∂Lλ,Ω

∂ψa=πa∑imia

σ2

[∑

i

ximia∑imia

− ψa +σ2

πa∑imia

b

ΩabCab(ψa − ψb)

]= 0 (24)

ψb =∑

i,a

ximia∑imia

Pab, (25)

where P = [I− diag( nσ2

1Tm⊙π

)(diag(1T (Ω⊙C))− Ω

⊙C)]−1.

C.6 Derivation of Proposition 5

Proof. From Proposition 4, we have seen that the reconstruction vectors ψa of a high-capacity β-VAEwill converge to a set of m fixed-points. This means that ψ will map the set basis-functions to asmaller subset of points. As a consequence, all the latent-vectors falling in the support of these basiselements will also map to the same reconstruction and for a fixed x, the function H(x, z) will onlyhavem possible distinct values, which we can enumerate asHia. If we enumerate all distinct supports

as Ωa, then from Equation (5) we have that qi(z) = π(z)∑a

e−Hia

β∑b e−Hib

β γb

Iz∈Ωa . Replacing qi(z) in

Equation (2) and maximizing with respect to γa results in the condition∑i

e−Hia

β∑b e−Hib

β γb

= n.

Appendix D Computing β-independent NLLs

When we train β-VAEs with different constraints using GECO, it is not obvious how to comparethem in the information plane due to the arbitrary scaling learned by the optimizer. In order to do amore meaningful comparison after the models have been trained, we recompute an optimal globalstandard-deviation σopt for all models on the training data (keeping all other parameters fixed):

σopt ≈√

1

M

i∈training set

‖xi − g(zi)‖2 (26)

where and zi ∼ q(z|xi) and M is the batch size used for this computation. All the reported negativereconstruction likelihoods were computed using σopt. We report all likelihoods per-pixel using the"quantized normal" distribution [28] to make it easier to compare with other models.

17

Page 18: arXiv:1810.00597v1 [stat.ML] 1 Oct 2018Taming VAEs Danilo J. Rezende Fabio Viola {danilor, fviola}@google.com DeepMind, London, UK Abstract In spite of remarkable progress in deep

Appendix E Extra Experiments

E.1 Micro-MNIST

As a way to illustrate the impact of different constraints on the learned encoder densities, we createdthe "Micro-MNIST" dataset, comprised of only 7 random samples from MNIST. In these experiments,shown in Figure 8, we observed that as we make the constraints weaker (larger κ for the RE constraintor smaller κ for the CLA constraint), we encourage the posterior density to collapse the modes of thedata, focusing on the properties of the data relevant to the constraints.

Figure 8: Learned posterior densities by a β-VAE+NVP via ELBO and with RE and CLAconstraints on Micro-MNIST. Each plot shows 100 samples from the learned encoder in a 2-dimensional latent space for each data-point from Micro-MNIST. The posterior samples are coloredaccording to the data-point they belong to. From left to right: (i) ELBO only β = 1; (ii) β optimizedby Algorithm 1 using a GECO+RE constraint with κ = 0.05; (iii) β optimized by Algorithm 1 using aGECO+RE constraint with κ = 0.5; (iv) β optimized by Algorithm 1 using a GECO+CLA constraintwith κ = 0.9. This experiment illustrates that VAEs with highly expressive posterior densities trainedvia ELBO have problems forming a tight tiling of the latent-space as predict by theory. By trainingthe VAE+NVP with GECO we can substantially tighten the posterior tiles.

18

Page 19: arXiv:1810.00597v1 [stat.ML] 1 Oct 2018Taming VAEs Danilo J. Rezende Fabio Viola {danilor, fviola}@google.com DeepMind, London, UK Abstract In spite of remarkable progress in deep

(a) (b) (c) (d) (e) (f) (g)

Figure 9: Samples from ConvDraw trained on CelebA. In each block of samples, rows correspondto samples from the data, model reconstructions and model samples respectively. From left to rightwe have models trained with: (a) Data; (b) ELBO only; (c) ELBO + Hand crafted β ∈ [2.0, 0.7]annealing; (d) GECO+RE constraint with κ = 0.08; (e) GECO+RE constraint with κ = 0.06; (f)GECO+RE constraint with κ = 0.1; (g) GECO+FRE constraint with κ = 0.0625.

Appendix F Model and Data Samples

19

Page 20: arXiv:1810.00597v1 [stat.ML] 1 Oct 2018Taming VAEs Danilo J. Rezende Fabio Viola {danilor, fviola}@google.com DeepMind, London, UK Abstract In spite of remarkable progress in deep

(a) (b) (c) (d) (e) (f)

Figure 10: Samples from ConvDraw trained on Color-MNIST. In each block of samples, rowscorrespond to samples from the data, model reconstructions and model samples respectively. Fromleft to right we have models trained with: (a) Data; (b) ELBO only; (c) ELBO + Hand craftedβ ∈ [2.0, 0.7] annealing; (d) GECO+RE constraint with κ = 0.08; (e) GECO+RE constraint withκ = 0.06; (f) GECO+RE constraint with κ = 0.1.

20

Page 21: arXiv:1810.00597v1 [stat.ML] 1 Oct 2018Taming VAEs Danilo J. Rezende Fabio Viola {danilor, fviola}@google.com DeepMind, London, UK Abstract In spite of remarkable progress in deep

(a) (b) (c) (d) (e) (f) (g)

Figure 11: Samples from ConvDraw trained on CIFAR10. In each block of samples, rowscorrespond to samples from the data, model reconstructions and model samples respectively. Fromleft to right we have models trained with: (a) Data; (b) ELBO only; (c) ELBO + Hand craftedβ ∈ [2.0, 0.7] annealing; (d) GECO+RE constraint with κ = 0.06; (e) GECO+RE constraint withκ = 0.0028; (f) GECO+FRE constraint with κ = 0.0625; (g) GECO+pNCC constraint.

21