variational sparse coding - uaiauai.org/uai2019/proceedings/papers/239.pdf · alent of map...

11
Variational Sparse Coding Francesco Tonolini School of Computing Science University of Glasgow Glasgow, UK Bjørn Sand Jensen School of Computing Science University of Glasgow Glasgow, UK Roderick Murray-Smith School of Computing Science University of Glasgow Glasgow, UK Abstract Unsupervised discovery of interpretable fea- tures and controllable generation with high- dimensional data are currently major chal- lenges in machine learning, with applications in data visualisation, clustering and artificial data synthesis. We propose a model based on variational auto-encoders (VAEs) in which interpretation is induced through latent space sparsity with a mixture of Spike and Slab dis- tributions as prior. We derive an evidence lower bound for this model and propose a spe- cific training method for recovering disentan- gled features as sparse elements in latent vec- tors. In our experiments, we demonstrate supe- rior disentanglement performance to standard VAE approaches when an estimate of the num- ber of true sources of variation is not available and objects display different combinations of attributes. Furthermore, the new model pro- vides unique capabilities, such as recovering feature exploitation, synthesising samples that share attributes with a given input object and controlling both discrete and continuous fea- tures upon generation. 1 INTRODUCTION Variational auto-encoders (VAEs) offer an efficient way of performing approximate posterior inference with oth- erwise intractable generative models and yield proba- bilistic encoding functions that can map complicated high-dimensional data to lower dimensional representa- tions (Kingma & Welling, 2013; Rezende et al., 2014; Sønderby et al., 2016). Making such representations meaningful, however, is a particularly difficult task and currently a major challenge in representation learning (Burgess et al., 2018; Tomczak & Welling, 2018; Kim & Mnih, 2018). Large latent spaces often give rise to many latent dimensions that do not carry any information, and obtaining codes that properly capture the complexity of the observed data is generally problematic (Tomczak & Welling, 2018; Higgins et al., 2017b; Burgess et al., 2018). In the case of linear mappings, sparse coding offers an el- egant solution to the aforementioned problem; the repre- sentation space is induced to be sparse. In such a way, the encoding function can exploit a high-dimensional space to model a large number of possible features, while being encouraged to use a small subset of non-zero elements to describe each individual observation (Olshausen & Field, 1996a;b). Due to their efficiency of representa- tion, sparse codes have been used in many learning and recognition systems, as they provide easier interpretation when processing natural data (Lee et al., 2007; Bengio et al., 2013). Biological system have also notably been proven to exploit signal sparsity for visual perception (Olshausen & Field, 1996a;b). In this work, we aim to extend the aforementioned ca- pability of linear sparse coding to non-linear probabilis- tic generative models thus allowing efficient, informative and interpretable representations in the general case. To this end we formulate a new variation of VAEs in which we employ a sparsity inducing prior and a discrete mix- ture recognition model based on the Spike and Slab dis- tribution. We construct a flexible sparse prior as a com- bination of Spike and Slab recognition models through auxiliary pseudo-inputs. We derive a non-trivial analyti- cal evidence lower bound (ELBO) for the model and de- sign a pre-training procedure that avoids mode collapse. 1 In our experiments, we study feature disentanglement in the general situation where no estimate of the number of 1 An implementation of the VSC model is avail- able from https://github.com/ftonolini45/ Variational_Sparse_Coding.

Upload: others

Post on 31-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Variational Sparse Coding - UAIauai.org/uai2019/proceedings/papers/239.pdf · alent of MAP inference and are limited to simple mani-folds due to the need to compute the manifold’s

Variational Sparse Coding

Francesco TonoliniSchool of Computing Science

University of GlasgowGlasgow, UK

Bjørn Sand JensenSchool of Computing Science

University of GlasgowGlasgow, UK

Roderick Murray-SmithSchool of Computing Science

University of GlasgowGlasgow, UK

Abstract

Unsupervised discovery of interpretable fea-tures and controllable generation with high-dimensional data are currently major chal-lenges in machine learning, with applicationsin data visualisation, clustering and artificialdata synthesis. We propose a model basedon variational auto-encoders (VAEs) in whichinterpretation is induced through latent spacesparsity with a mixture of Spike and Slab dis-tributions as prior. We derive an evidencelower bound for this model and propose a spe-cific training method for recovering disentan-gled features as sparse elements in latent vec-tors. In our experiments, we demonstrate supe-rior disentanglement performance to standardVAE approaches when an estimate of the num-ber of true sources of variation is not availableand objects display different combinations ofattributes. Furthermore, the new model pro-vides unique capabilities, such as recoveringfeature exploitation, synthesising samples thatshare attributes with a given input object andcontrolling both discrete and continuous fea-tures upon generation.

1 INTRODUCTION

Variational auto-encoders (VAEs) offer an efficient wayof performing approximate posterior inference with oth-erwise intractable generative models and yield proba-bilistic encoding functions that can map complicatedhigh-dimensional data to lower dimensional representa-tions (Kingma & Welling, 2013; Rezende et al., 2014;Sønderby et al., 2016). Making such representationsmeaningful, however, is a particularly difficult task andcurrently a major challenge in representation learning

(Burgess et al., 2018; Tomczak & Welling, 2018; Kim &Mnih, 2018). Large latent spaces often give rise to manylatent dimensions that do not carry any information, andobtaining codes that properly capture the complexity ofthe observed data is generally problematic (Tomczak &Welling, 2018; Higgins et al., 2017b; Burgess et al.,2018).

In the case of linear mappings, sparse coding offers an el-egant solution to the aforementioned problem; the repre-sentation space is induced to be sparse. In such a way, theencoding function can exploit a high-dimensional spaceto model a large number of possible features, while beingencouraged to use a small subset of non-zero elementsto describe each individual observation (Olshausen &Field, 1996a;b). Due to their efficiency of representa-tion, sparse codes have been used in many learning andrecognition systems, as they provide easier interpretationwhen processing natural data (Lee et al., 2007; Bengioet al., 2013). Biological system have also notably beenproven to exploit signal sparsity for visual perception(Olshausen & Field, 1996a;b).

In this work, we aim to extend the aforementioned ca-pability of linear sparse coding to non-linear probabilis-tic generative models thus allowing efficient, informativeand interpretable representations in the general case. Tothis end we formulate a new variation of VAEs in whichwe employ a sparsity inducing prior and a discrete mix-ture recognition model based on the Spike and Slab dis-tribution. We construct a flexible sparse prior as a com-bination of Spike and Slab recognition models throughauxiliary pseudo-inputs. We derive a non-trivial analyti-cal evidence lower bound (ELBO) for the model and de-sign a pre-training procedure that avoids mode collapse.1

In our experiments, we study feature disentanglement inthe general situation where no estimate of the number of

1An implementation of the VSC model is avail-able from https://github.com/ftonolini45/Variational_Sparse_Coding.

Page 2: Variational Sparse Coding - UAIauai.org/uai2019/proceedings/papers/239.pdf · alent of MAP inference and are limited to simple mani-folds due to the need to compute the manifold’s

ground truth features is available and different featurescan be present or absent in each observed data exam-ple. We show how our model considerably outperformstraditional disentanglement approaches (Higgins et al.,2017a; Gao et al., 2019) in such conditions. We thenconsider two benchmark data sets, Fashion-MNIST andUCI HAR, and demonstrate how the variational sparsecoding (VSC) model recovers sparse representations thatproperly capture the mixed discrete and continuous na-ture of their variability. We further show how such repre-sentations can be used to investigate feature relationshipsamong different objects and classes and perform condi-tional generation completely unsupervisedly.

2 BACKGROUND

2.1 SPARSE CODING

Sparse coding aims to approximately represent observedsignals with a weighted linear combination of few un-known basis vectors (Lee et al., 2007; Bengio et al.,2013). The task of determining the optimal basis is gen-erally formulated as an optimisation problem, where areconstruction cost and a sparsity cost of the embeddedrepresentations are jointly minimised with respect to thecoefficients of a linear transformation. Sparse codingmakes the important realisation that, though large en-sembles of natural signals need many variables to be de-scribed, individual samples can be well represented by asmall subset of such variables. This realisation is sup-ported by substantial empirical evidence and pioneeringwork by Olshausen & Field (1996a) also showed thatthe visual cortex in mammals processes information assparse signals, demonstrating that biological learning ex-ploits sparsity in natural images similarly to sparse cod-ing based models. Olshausen & Field (2004) provide acomprehensive review of linear sparse coding.

Sparse coding can be probabilistically interpreted as agenerative model, where the observed signals are gen-erated from unobserved latent variables through a linearprocess (Lee et al., 2007; Bengio et al., 2013). The modelcan then be described with a latent prior distribution,which assigns high density to sparse latent variables, anda likelihood distribution, which quantifies the reconstruc-tion density given a latent embedding. In fact, perform-ing maximum a posteriori (MAP) estimation with suchmodels recovers the common formulation of sparse cod-ing described above. Previous work has also demon-strated variational inference with sparse coding proba-bilistic models, exploiting algorithms based on EM infer-ence (Titsias & Lazaro-Gredilla, 2011; Goodfellow et al.,2012). However, EM inference becomes intractable formore complicated non-linear posteriors and a large num-

ber of input vectors (Kingma & Welling, 2013), makingsuch an approach unsuitable to scale to our target mod-els.

Conversely, some work has been done in generalisingsparse coding to non-linear transformations, by defin-ing sparsity on Riemannian manifolds (Ho et al., 2013;Cherian & Sra, 2017). These generalisations, however,are not probabilistic, as they define a non-linear equiv-alent of MAP inference and are limited to simple mani-folds due to the need to compute the manifold’s logarith-mic map.

2.2 VARIATIONAL AUTO-ENCODERS

Variational auto-encoders (VAEs) are models for un-supervised efficient coding that aim to maximise themarginal likelihood p(x) =

∏p(xi) with respect to

some decoding parameters θ of the likelihood func-tion pθ(x|z) and encoding parameters φ of a recogni-tion model qφ(z|x) (Kingma & Welling (2013); Rezendeet al. (2014); Pu et al. (2016)).

The VAE model is defined as follows; an observed vec-tor xi ∈ RM×1 is assumed to be drawn from a likeli-hood function pθ(x|z). Common choices are a Gaussianor a Bernoulli distribution. The parameters of pθ(x|z)are the output of a neural network having as input a la-tent variable zi ∈ RJ×1. The latent variable is assumedto be drawn from a prior p(z) which can take differ-ent parametric forms. In the most common VAE im-plementations, the prior takes the form of a multivariateGaussian with identity covariance N (z; 0, I) (Kingma& Welling, 2013; Rezende et al., 2014; Higgins et al.,2017b; Burgess et al., 2018; Yeung et al., 2017). Theaim is then to maximise a joint posterior distribution ofthe form p(x) =

∏i

∫pθ(xi|z)p(z)dz, which for an

arbitrarily complicated conditional p(x|z) is intractable.To address this intractability, VAEs introduce a recogni-tion model qφ(z|x) and define an evidence lower bound(ELBO) to be estimated in place of the true posterior,which can be formulated as

log pθ(xi) = log

∫pθ(xi|z)p(z)

qφ(z|xi)qφ(z|xi)

dz ≥

−DKL(qφ(z|xi)||p(z)) + Eqφ(z|xi) [log pθ(xi|z)] .(1)

The ELBO is composed of two terms; a prior term, whichencourages minimisation of the KL divergence betweenthe encoding distributions and the prior, and a recon-struction term, which maximises reconstruction likeli-hood. The ELBO is then maximised with respect to themodel’s parameters θ and φ.

Page 3: Variational Sparse Coding - UAIauai.org/uai2019/proceedings/papers/239.pdf · alent of MAP inference and are limited to simple mani-folds due to the need to compute the manifold’s

2.2.1 Interpretation in VAEs

Obtaining informative and interpretable representationsof unlabelled data is currently a major objective of un-supervised learning. VAEs have been recently consid-ered as ideal models to obtain such representations, asthey rely on a low dimensional latent space that nec-essarily embeds information about the observation theymodel. Inducing interpretation in a VAE latent space isgenerally formulated as a feature disentanglement prob-lem; assuming observed data is generated from hiddeninterpretable factors of variation, the task is to obtain alatent space in which such factors are aligned with theaxis. The commonly adopted method to enhance the dis-entanglement of features in a VAE is to assign a largerweight to the KL divergence component in equation 1.Models following this approach are known as β-VAEs,and they allow controllable improvement of latent disen-tanglement, at the expense of reconstruction likelihood(Higgins et al., 2017a; Burgess et al., 2018). Several al-ternative methods to improve disentanglement have beenproposed. Recent works extended the β-VAE to explic-itly control total correlation between latent dimensions,expressing it as a penalty term that quantifies disentan-glement (Chen et al., 2018; Gao et al., 2019). Kim &Mnih (2018) proposed an algorithm to directly inducefactorisation, demonstrating a more favourable trade-offbetween disentanglement and reconstruction accuracy.These approaches have proven promising, however theyrely on the assumption that target features are alwayspresent in every observation with continuously varyingvalues. Differently from these, the model presented hererelies on the sparse coding realisation of natural signals,assuming that individual observations are described byonly a small subset of a large ensemble of possible fea-tures. Furthermore, we assume no knowledge of thenumber of source factors.

2.2.2 Discrete Latent Variables and Sparsity inVAEs

Discrete latent distributions are a closely related theme tosparsity, as exactly sparse PDFs involve sampling fromsome discrete variables. Nalisnick & Smyth (2017) andSingh et al. (2017) model VAEs with a Stick-BreakingProcess and an Indian Buffet Process priors respectivelyin order to allow for stochastic dimensionality in the la-tent space. In such a way, the prior can set unused di-mensions to zero. However, the resulting representationsare not truly sparse; the same elements are set to zero forevery encoded observation. The scope of these works isdimensionality selection rather than sparsification.

Other models which present discrete variables in their la-tent space have been proposed in order to capture discrete

features in natural observations. Rolfe (2017) models adiscrete latent space composed of continuous variablesconditioned on discrete ones in order to capture both dis-crete and continuous sources of variation in observations.Similarly motivated, van den Oord et al. (2017) per-form variational inference with a learned discrete priorand recognition model. The resulting latent spaces canpresent sparsity, depending on the choice of prior. How-ever, they do not induce directly sparse statistics in thelatent space.

Other works model sparsity more directly. Yeung et al.(2017) propose to learn a deterministic selection vari-able that dictates which latent dimensions the recogni-tion model should exploit in the latent space. In sucha way, different embeddings can exploit different com-binations of variables, which achieves the goal of coun-teracting over-pruning. This approach does result intosparse latent variables. However, only the continuouscomponents are treated variationally, while the activa-tion of elements is deterministic. More recently, Mathieuet al. (2019) modelled sparsity in the latent space with amixture of Gaussians models using a narrow Gaussiancomponent to encourage elements to be close to zero. Intheir work, a continuous relaxation of sparsity is mod-elled in the latent space, as elements are not encouragedto be zero exactly, but only close to zero by the narrowGaussian component of the prior.

Differently from these prior works, we directly model themixed continuous-discrete nature of sparsity in the latentspace through an exactly sparse prior and find a suitableevidence lower bound and training procedure to performapproximate variational inference.

3 VARIATIONAL SPARSE CODING

We propose to use the framework of VAEs to performapproximate variational inference with neural networksparse coding architectures. With this approach, we aimto discover and discern the non-linear features that con-stitute variability in data and represent them as few non-zero elements in sparse vectors.

3.1 RECOGNITION MODEL

In order to encode observations as sparse vectors in thelatent space, the recognition model is chosen to be aSpike and Slab distribution. The Spike and Slab dis-tribution is defined over two variables; a binary spikevariable sj and a continuous slab variable zj (Mitchell& Beauchamp, 1988). The spike variable is either oneor zero with defined probabilities γ and (1 − γ) respec-tively and the slab variable has a distribution which iseither a Gaussian or a Delta function centered at zero,

Page 4: Variational Sparse Coding - UAIauai.org/uai2019/proceedings/papers/239.pdf · alent of MAP inference and are limited to simple mani-folds due to the need to compute the manifold’s

conditioned on whether the spike variable is one or zerorespectively. The resulting recognition model is

qφ(z|xi) =

J∏j=1

[γi,jN (zi,j ;µz,i,j , σ2z,i,j)

+(1− γi,j)δ(zi,j)],

(2)

where the distribution parameters µz,i,j , σ2z,i,j and γi,j

are the outputs of a neural network having parameters φand input xi, J is the number of latent dimensions andδ(·) indicates the Dirac delta function centered at zero.A description of the recognition model neural networkcan be found in supplementary A.2.

3.2 PRIOR DISTRIBUTION

In order to induce sparsity in the latent space while al-lowing the model to flexibly adjust to represent differ-ent combinations of features, we build upon two re-cent advances in VAEs. Following the prior structurepresented in Tomczak & Welling (2018), we build theprior with recognition models qφ(z|xu) from pseudo-inputs xu, which are trained along with the networks’weights. However, differently from this previous work,which builds the prior with the sum of all pseudo inputs’encodings p(z) = 1

U

∑u qφ(z|xu), we implement a clas-

sifier u∗ = Cω(xi) with parameters ω to select a specificpseudo-input xu∗ and consequentially a single compo-nent qφ(z|xu∗) for an observation xi, hence assumingthat each observation xi is generated from a single com-ponent in the latent space. This feature is similar to thelatent variable selection presented in Yeung et al. (2017),with the difference that the selector in the VSC model as-signs a different prior for each observations, rather thendifferent latent dimensions. The sparse prior is then

ps(z) = qφ(z|xu∗),

u∗ = Cω(xi),(3)

where Cω(x) is a neural network classifier. The use ofthe classifier Cω(x) rather than taking the whole ensem-ble as prior allows us to compute the KL divergence an-alytically, hence rendering optimisation efficient whilemaintaining the flexibility of a prior composed of mul-tiple PDFs. A detailed description of the selection func-tion Cω(x) can be found in supplementary A.3.

3.3 VSC OBJECTIVE FUNCTION

As in the standard VAE setting, we aim to perform ap-proximate variational inference by maximising an ELBOof the form detailed in equation 1, with the Spike andSlab probability density function of equation 3 as priorand the recognition model of equation 2. Additionally,

we need to infer a defined prior sparsity α in the latentspace. This is done by inducing the average Spike prob-ability of each pseudo-input’s recognition model γu tomatch the prior one α. a sparsity KL divergence penaltyterm DKL(γu||α) is then minimised jointly to the max-imisation of the ELBO

arg maxθ,φ,ω,xu

∑i

−DKL(qφ(z|xi)||qφ(z|xu∗))

+ Eqφ(z|xi) [log pθ(xi|z)]− J ·DKL(γu∗ ||α).

(4)

The KL divergence between the pseudo-input’s prior andtarget sparsity DKL(γu∗ ||α) has a simple form and canreadily be differentiated with respect to the weights andpseudo-inputs. This term induces the encodings to havethe prior sparsity on average, acting similarly to the ag-gregate posterior regularisation term described in Math-ieu et al. (2019). In the following subsections, we elabo-rate on the remaining two terms: the reconstruction andKL divergence terms of the ELBO.

3.3.1 Reconstruction Term

The reconstruction component of the ELBO is estimatedstochastically as follows

Eqφ(z|xi) [log pθ(xi|z)] '1

L

L∑l=1

log pθ(xi|zi,l), (5)

where the samples zi,l are drawn from the recognitionmodel qφ(z|xi) and L is the number of such draws. Asin the standard VAE, to make the reconstruction term dif-ferentiable with respect to the encoding parameters φ, weemploy a reparameterization trick to draw from qφ(z|xi).To parametrise samples from the discrete binary compo-nent of qφ(z|xi) we use a continuous relaxation of bi-nary variables analogous to that presented in Maddisonet al. (2017) and Rolfe (2017). We make use of two aux-iliary noise variables ε and η, normally and uniformlydistributed respectively. ε is used to draw from theSlab distributions, resulting in a reparametrisation analo-gous to the standard VAE (Kingma & Welling, 2013).η is used to parametrise draws of the Spike variablesthrough a non-linear binary selection function T (yi,l).The two variables are then multiplied together to obtainthe parametrised draw from qφ(z|xi). A more detaileddescription of the reparametrisation of sparse samples isreported in supplementary B.

3.3.2 Spike and Slab KL Divergence

KL divergences with discrete mixture PDFs have beenused in various discrete latent variables models (Rolfe,2017; Nalisnick & Smyth, 2017). However, in theseworks, they are estimated and optimised stochastically.

Page 5: Variational Sparse Coding - UAIauai.org/uai2019/proceedings/papers/239.pdf · alent of MAP inference and are limited to simple mani-folds due to the need to compute the manifold’s

Gal (2016) derives an analytic form for a particular casein which the recognition model contains the prior. Dif-ferently form these, we derive a closed-form expressionfor the KL divergence between two arbitrary Spike andSlab distributions, hence rendering the optimisation ofthe ELBO for our model of comparable complexity tothe standard VAE case.

By solving the KL divergence between the prior of equa-tion 3 and the recognition model of equation 2 we derivewith a novel approach the closed-form expression

DKL(qφ(z|xi)||qφ(z|xu∗)) =

=

J∑j

[γi,j

(log

σz,u∗,j

σz,i,j+

σz,i,j + (µz,i,j − µz,u∗,j)2

2σz,u∗,j− 1

2

)+

(1− γi,j) log

(1− γi,j

1− γu∗,j

)+ γi,j log

(γi,jγu∗,j

)].

(6)

A detailed derivation of this expression is reported insupplementary C. This expression naturally presents twocomponents. The first term in the sum is the negativeKL divergence between the distributions of the Slab vari-ables, multiplied by the probability of zi,j being non-zeroγi,j . This component gives a similar regularisation tothat of the standard VAE and encourages the Gaussiancomponents of the recognition model to match those ofthe prior, proportionally to the Spike probabilities γi,j .The second term is the negative KL divergence betweenthe distributions of the Spike variables, which encour-ages the probabilities of the latent variables being non-zero γi,j to match those of the pseudo-input prior γu∗,j .

3.4 TRAINING

The VSC model is trained by maximizing the objectivefunction in equation 4 using gradient descent, with theKL divergence of equation 6 and the empirical recon-struction term of equation 5. As other VAE models, VSCsuffers from the inherent problem of posterior collapse;during training, certain latent variables tend to store allinformation about the encoded observations while the re-maining ones perfectly satisfy the KL divergence con-straint. In a VSC model with a very low prior Spike prob-ability α this results in a dimensionality collapse; somelatent variables are set to zero for almost all encodings,while others store most of the information necessary torepresent data. This may be desirable in some settings,such as dimensionality selection, but hinders the abilityto adequately describe observations with different com-binations of recovered features.

To counteract the aforementioned posterior collapse

Algorithm 1 Training the VSC Model

Inputs: observations x; initial model parameters,{θ(0), φ(0), ω(0),x

(0)u }; user-defined latent dimensional-

ity, J ; user-defined prior sparsity level, α; user-definednumber of warm-up steps, Nwarmup; user-defined linearincrement in warm-up coefficient λ, ∆λ; user-definednumber of iterations,Niter; user-defined number of sam-ples, L, used in estimating the reconstruction term.

1: λ(k=0) ← 02: for the k’th iteration in [0 : Niter − 1]3: for the i’th observation4: zi,l ∼ qφ(k),λ(k)(z|xi) ∀ l ∈ [1 : L] (Eq. 5)

5: E(k)i ←

∑l log pθ(k)(xi|zi,l) (Eq. 5)

6: u∗ ← Cω(k)(xi) (Eq. 3)

7: K(k)i ← DKL(qφ(k)(z|xi)||qφ(k)(z|xu∗))

8: D(k)i ← J ·DKL(γu∗ ||α) (Eq. 6)

9: end

10: F(k) =∑i

(E(k)i −K(k)

i − D(k)i

)(Eq. 4)

11: θ(k+1), φ(k+1), ω(k+1),x(k+1)u ← arg max(F(k))

(Eq. 4)

12: if λ(k) < 1 and k > Nwarmup (Eq. 7)

13: λ(k+1) ← λ(k) + ∆λ14: else15: λ(k+1) ← λ(k)

16: end17: end

effect, we employ a straightforward Spike variablewarm-up strategy. During a pre-training phase, therecognition model qφ(z|xi) is modified as follows:

qφ,λ(z|xi) =

J∏j=1

[γi,jN (zi,j ;λµz,i,j , λσ2z,i,j

+ (1− λ)) + (1− γi,j)δ(zi,j)],

(7)

where λ is a constant which is initially set to zero, thenlinearly increased from zero to one and lastly set equalto one throughout training. When λ is equal to zero, theSlab components of the latent variables is fixed to a zeromean and unit covariance Gaussian. This implies that therecognition model can initially store information only inthe Spike variables patterns, similarly to a binary encod-ing, hence forcing latent vectors from different observa-tions to activate different elements. As λ is increased toone, the model can store more information in the contin-uous Slab variables, however maintaining different com-binations of active latent variables for distinct observa-tions.

Page 6: Variational Sparse Coding - UAIauai.org/uai2019/proceedings/papers/239.pdf · alent of MAP inference and are limited to simple mani-folds due to the need to compute the manifold’s

In summary, we contribute two main elements of nov-elty; the derivation of a closed form expression for theSpike and Slab KL divergence and the Spike warm-upstrategy to avoid posterior collapse. Both are crucial inhandling the mixed continuous-discrete nature of sparserepresentations and efficiently train the proposed model,hence obtaining sparse probabilistic representation of ob-servations. For completeness we outline the training pro-cedure in Algorithm 1. The values for each user definedparameters we employ in our experiments are reported insupplementary D. The maximisation at each iteration iscarried out as an ADAM optimisation step.

4 EXPERIMENTS

4.1 ELBO EVALUATION

We evaluate and compare the ELBO values for VSCand β-VAE models. β-VAEs are VAEs in which theKL divergence term of equation 1 is assigned a control-lable weight β. Typically, they are used with a coef-ficient β greater than one to induce interpretation Hig-gins et al. (2017a). We make use of the Fashion-MNISTdataset, composed of 28 × 28 grey-scale images of dif-ferent pieces of clothing (Xiao et al., 2017), and the UCIHAR dataset, which consists of filtered accelerometersignals from mobile phones worn by different peopleduring common activities (Anguita et al., 2012). In allcases, the latent space is chosen to have 60 latent dimen-sions and the neural networks of the two models are ofequal capacity. For the β-VAE we vary the β coefficient,while for the VSC model we vary the prior spike proba-bility α. Further details can be found in supplementaryD.1. Results are shown in figure 1.

Overall, the ELBO values for the VSC model are com-parable to those recovered with β-VAEs and a standardVAE (β = 1). The VSC results in lower ELBO valuesat decreasing prior spike probability α, similarly to thetrend experienced by the β-VAE at increasing KL coef-ficient β. In fact, the two models present a similar be-haviour; as more structure is imposed in the latent spaceby enforcing a ‘stronger’ prior, the ELBO is necessarilyreduced. The VSC model, however, imposes structure inthe latent space through sparsity, rather than an increasedweight on the KL divergence term of the ELBO. In thenext subsection we show how this subtle difference re-sults into important representation advantages.

4.2 FEATURE DISENTANGLEMENT

We investigate the VSC model’s ability to disentanglegenerating features and align them with the latent spaceaxis. To do so, we make use of an artificial dataset, where

the different examples are synthesised from a set of pa-rameters and therefore the generative source features areknown and can be used as ground truth. Previous worksimilarly evaluated disentanglement with artificial sam-ples (Higgins et al., 2017a; Kim & Mnih, 2018). Datasets used in these investigations, however, contain signalsgenerated by altering each feature continuously, leadingto all examples expressing variability in all the generat-ing factors. Following a sparse realisation of natural sig-nals, we are instead interested in situations where groupsof generating features are present or absent in differ-ent combination. To this end, we contribute the Smileysparse data set, in which four different attributes (mouth,eyes, hat and bow tie) can be present or absent, eachwith 0.5 probability. Each attribute constitutes a featuresgroup and, if present in an example, is controlled by adifferent number of continuous source features between3 and 6, for a total of 18 source variables. Examples fromthe Smiley sparse data set are shown in figure 2, while adetailed description is provided in supplementary E.

In our investigation, we consider both the situation inwhich the total number of source variables is known andthat in which it is unknown. We obtain representationsusing 60, 000 examples from the Smiley sparse data setwith both a β-VAE and a VSC and latent spaces of 18

β-VAE VSC

β-VAE VSC

ELB

O

ELB

O

ELB

O

ELB

O

KL Coefficient β Prior Spike Probability α

KL Coefficient β Prior Spike Probability α

Fashion-MNIST

UCI HAR

Figure 1: ELBO evaluation for β-VAEs and VSCs (in-cluding standard deviations over several repetitions withdifferent random seeds). In both models, the ELBO is re-duced by imposing an increasingly more stringent struc-ture in the latent space through the prior, in the β-VAEcase through the KL divergence coefficient and in theVSC through the prior sparsity.

Page 7: Variational Sparse Coding - UAIauai.org/uai2019/proceedings/papers/239.pdf · alent of MAP inference and are limited to simple mani-folds due to the need to compute the manifold’s

Figure 2: Examples from the Smiley sparse data set.Each is composed from a sparse superposition of 4 dif-ferent attributes, including mouth, eyes, hat and bow tie.

dimension, for experiments in which the source dimen-sionality is assumed to be known, and 60 dimensions,to test instead the situation in which knowledge of thesource features’ number is not available. The VSC modelis trained in both instances with a purposely low priorsparsity coefficient α = 0.01. In such a way, the priorregularisation encourages almost all variables to be inac-tive and the model activates only those that are needed todescribe each observation.

Both models are given neural networks of equal capac-ity. We then test features disentanglement with a test setof 20, 000 new samples by computing the absolute valueof the correlation between each source feature and eachrecovered latent variable. Additional details about theseexperiments can be found in supplementary D.2. Abso-lute value of correlation matrices for the two models areshown in figure 3. If the number of latent dimensionsis chosen to match that of the generating features, theβ-VAE displays appreciable correlation between the trueattributes and the recovered latent variables. The VSCmodel displays higher correlation contrast and hence bet-ter disentanglement. This is due to the fact that VSC isable to model with its prior both the presence or absenceof different features in different observations and theircontinuous variability, while a β-VAE attempts to modelthe data only with continuous variables.

As shown in figure 3(c), in the situation where the num-ber of generating dimensions was assumed to be un-known, the β-VAE did not recover latent variables thatwell correlate with each source feature. β-VAEs encour-age encoded data to be distributed as a univariate Gaus-sian distribution which is factorisable along all its latentspace axis. This is effective in situations where the num-ber of latent variables is chosen to match fairly well thenumber of source features, as shown in figure 3(a), andeven more so if such features are all present and contin-uously distributed in each example, as demonstrated byHiggins et al. (2017a). However, in a more general situa-tion, where data is described by different superpositions

(d) VSC, dz = 60

Latent Variables

Sou

rce

Feat

ure

sB

ow

tie

Hat

Eyes

Mo

uth

Sou

rce

Feat

ure

sB

ow

tie

Hat

Eyes

Mo

uth

(c) β-VAE, dz = 60

Latent Variables

Sou

rce

Feat

ure

sB

ow

tie

Hat

Eyes

Mo

uth

Latent Variables Latent Variables

(a) β-VAE, dz = 18 (b) VSC, dz = 18

0

0.2

0.4

0.6

0

0.2

0.4

0.6

0

0.2

0.4

0.6

0

0.2

0.4

0.6

Figure 3: Absolute value of correlation between sourcefeatures and latent variables for the Smiley data set. Thex-axis indicates the index of the latent variable. Variableshave been permuted to group dimensions that display thehighest correlation with the same feature. Feature disen-tanglement is achieved if each latent variable correlatesstrongly with one source attribute but weakly with the re-maining ones thus leading to a block diagonal structure.

of features and their number is not known a priori, en-forcing a latent distribution that factorises along all axisdoes not result into good source features recovery; themodel forces data to stretch along many more dimen-sions than it is necessary to describe it, dispersing thecorrelation with the true sources of variation.

Conversely, as shown in figure 3(d), the VSC model isable to disentangle well the different features groups.Because it is designed to model different combinationsof features, it can activate or deactivate latent variablesfor each encoding according to the features recognised ineach observation and it successfully disentangles each at-tribute in distinct sub-spaces with little interdependence,regardless of the choice of latent space dimensionality.The VSC model adjusted to a suitable number of latentvariables needed to represent the data. In the matrix offigure 3(d), the zero columns correspond to collapsed di-

Page 8: Variational Sparse Coding - UAIauai.org/uai2019/proceedings/papers/239.pdf · alent of MAP inference and are limited to simple mani-folds due to the need to compute the manifold’s

mensions where the latent variables were consistently in-active. Given 60 latent dimensions prior to training andno knowledge of the source dimensionality or sparsity,the model collapsed to 20 exploited dimensions to de-scribe data generated independently from 18 source vari-ables. The number of latent variables assigned to eachfeature group is also recovered with fairly good accu-racy, as the VSC correlation matrix in figure 3 presents anear-block diagonal structure. We also compare featuredisentanglement with the β-TCVAE (Gao et al., 2019)and its behaviour at increasing latent space dimensional-ity was found to be analogous to that of the β−VAE. Thecorrelation matrices from these experiments are shownin full in supplementary F.

It is not possible to quantify feature disentanglement innatural data, as the source features are not known. How-ever, similarly to Kim & Mnih (2018), we can qualita-tively examine the effect of changing single latent vari-ables on generated samples. To this end, we train aVSC model with 100, 000 examples from the CelebAdata set, encode examples from a test set, alter individu-ally exploited dimensions in the latent space and finallygenerate samples from these altered latent vectors. Wefind that several of the dimensions exploited by the VSCmodel control interpretable aspects in the generated data,as shown in the examples of figure 4.

4.3 FEATURE ACTIVATION RECOVERY

The Spike probabilities retrieved when encoding an ob-servation can be interpreted as the probabilities of cer-tain recognised features being present or absent in agiven sample. These activation patterns can be used to

Lighting position

Smile

Fringe

Figure 4: Examples are generated by altering individ-ual latent variables in a VSC model trained on thecelebA data set. Individual latent variables control in-terpretable aspects of the generated images, i.e., inter-pretable sources of variation are aligned with the repre-sentation space axis.

investigate what features different objects are expectedto have in common. As an example, we train two VSCmodels having 60 latent dimensions with the training setsfrom the Fashion-MNIST and UCI HAR data sets re-spectively, encode the entire test sets and examine the

T-shirt

Trousers

Jumpers

Dresses

Shirts

Sandals

Tops

Shoes

Bags

Boots

Latent Variables

Cla

sse

s

Up

per

Bo

dy

Foo

twea

r

Fashion-MNIST

WalkingWalking Upstairs

Laying

SittingStanding

Walking Downstairs

UCI HAR

Act

ivit

yIn

acti

vity

Cla

sse

s

Latent Variables

Cla

sse

s

10 50403020 60

10 50403020 60

1

0.20.40.60.8

0

0.2

0.4

0.6

0.8

0

1

Figure 5: Average Spike probability per class in (a) Fashion-MNIST and (b) UCI HAR. Black corresponds to 0, oralways inactive, and white to 1, or always active. The Spike probabilities show the recovered features classes have incommon. Objects that are similar, such as T-shirts and shirts in the Fashion-MNIST example, activate mostly the samelatent dimensions, i.e., they can largely be described with the same features. More distinct objects, such as T-shirtsand trousers, activate different latent dimensions, i.e., they exploit different features to express their variability.

Page 9: Variational Sparse Coding - UAIauai.org/uai2019/proceedings/papers/239.pdf · alent of MAP inference and are limited to simple mani-folds due to the need to compute the manifold’s

Figure 6: Generation conditioned on single observationswith the VSC model. An object is encoded in the latentspace and synthetic examples are then generated by sam-pling only along the activated latent dimensions. Thismakes it possible to generate a variety of objects thatshare the same features with the input object without anysupervision.

average Spike probabilities per class recovered by therecognition model. Results are shown in figure 5. Inthe VSC latent space, similar classes present a high over-lap of active latent variables, while classes that are verydifferent exploit different sets of latent dimensions to de-scribe their samples. These Spike variable activation pat-terns readily provide a visualisation of the similarity ofdifferent objects in terms of the factors of variation theyare described by.

The capability of VSC to recover feature exploitation fora given observation can also be used to control gener-ation. For instance, it is possible to generate samplesconditioned on a single object by exploring the subspacedefined by the activated latent variables of such object.Figure 6 shows examples of images generated by encod-ing a single observation from Fashion-MNIST and sub-sequently sample the subspace defined by the resultingactive dimensions. As shown in figure 6, the VSC modelnaturally provides a way of performing conditional gen-eration without the need for any supervision.

The VSC model can also be used to study the nature ofthe difference between objects through controlled gener-ation. Two objects may have some active latent variablesin common, describing characteristics that they both re-tain, but that might have different values, and some thatare different, describing features that they do not havein common. For instance, in figure 7, we consider thefeatures of a T-shirt and a shirt taken from the Fashion-MNIST test set. The VSC latent variables for the twoobservations share some dimensions, which we individ-ually alter and generate from. Examples are shown onthe left of figure 7. As shown, these dimensions controlfeatures the two objects have in common. Conversely,there are some latent dimensions that are active for one

Interpolation Along Common Latent Dimensions Activation/Deactivation of Different Latent Variables

Bu

tto

ns

and

Co

llar

Long Sleevesoff on

off

on

Arm

s W

idth

Sho

uld

ers

He

igh

t

Figure 7: Investigating the difference between two ob-jects with VSC. Single examples of a T-shirt and a shirtare encoded in the latent space. On the left, generationsobtained by individually altering two latent dimensionswhich are active for both objects. On the right, gener-ations obtained by activating and deactivating latent di-mensions exploited by one object, but not the other.

observation, but not the other and vice versa. These di-mensions correspond to the recovered features the twoobjects do not share. The effects on generation of activat-ing and deactivating two of these for the T-shirt exampleare shown on the right of figure 7.

All of the controlled generation examples describedabove are carried out without any supervision, but sim-ply by examining and individually controlling the latentdimensions activated by single observation examples.

5 CONCLUSION

We present a new model to retrieve non-linear sparse rep-resentations of data through a variational auto-encodingapproach. The proposed VSC model is capable of re-trieving and disentangling sources of variation from di-verse data, where attributes can be present and absent indifferent combinations and the total number of factorsof variation and their occurrence is unknown a priori.The sparse representations also offer novel visualisationand generation capability, thanks to the ability to exam-ine and exploit latent variable activation. In defining theVSC model, we also contribute general components andmethods that can be used to perform sparse probabilis-tic inference in different settings, such as an analyticalexpression for the Spike and Slab KL divergence and aSpike pre-training strategy.

Acknowledgements:F.T. acknowledges support from Amazon Researchand EPSRC grant EP/M01326X/1, B.S.J. from EP-SRC grant EP/R018634/1, and R.M-S. EPSRC grantsEP/M01326X/1 and EP/R018634/1.

Page 10: Variational Sparse Coding - UAIauai.org/uai2019/proceedings/papers/239.pdf · alent of MAP inference and are limited to simple mani-folds due to the need to compute the manifold’s

ReferencesD. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L Reyes-

Ortiz. Human activity recognition on smartphones us-ing a multiclass hardware-friendly support vector ma-chine. In Proceedings of the International Workshopon Ambient Assisted Living, pp. 216–223. Springer,2012.

Y. Bengio, A. Courville, and P. Vincent. Representationlearning: A review and new perspectives. IEEE Trans-actions on Pattern Analysis and Machine Intelligence,pp. 1798–1828, 2013.

C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Wat-ters, G. Desjardins, and A. Lerchner. Understandingdisentangling in β-VAE. arXiv:1804.03599, 2018.

T. Q. Chen, X. Li, R.B. Grosse, and D.K. Duvenaud.Isolating sources of disentanglement in variational au-toencoders. In Proceedings of the Advances in Neu-ral Information Processing Systems, pp. 2610–2620,2018.

A. Cherian and S. Sra. Riemannian dictionary learningand sparse coding for positive definite matrices. IEEETransactions on Neural Networks and Learning Sys-tems, 28(12):2859–2871, 2017.

Y. Gal. Uncertainty in deep learning. PhD thesis, Uni-versity of Cambridge, 2016.

S. Gao, R. Brekelmans, G.V. Steeg, and A. Galstyan.Auto-encoding total correlation explanation. In Pro-ceedings of the International Conference on ArtificialIntelligence and Statistics, 2019.

I. Goodfellow, A. Courville, and Y. Bengio. Large-scalefeature learning with Spike-and-Slab sparse coding. InProceedings of the International Conference on Ma-chine Learning, 2012.

I. Higgins, L Matthey, A Pal, C Burgess, X Glorot,M. Botvinick, S Mohamed, and A Lerchner. β-VAE:Learning basic visual concepts with a constrained vari-ational framework. In Proceedings of the InternationalConference on Learning Representations, 2017a.

I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot,M. Botvinick, S. Mohamed, and A. Lerchner. β-VAE:Learning basic visual concepts with a constrained vari-ational framework. In Proceedings of the InternationalConference on Learning Representations, 2017b.

J. Ho, Y. Xie, and B. Vemuri. On a nonlinear generaliza-tion of sparse coding and dictionary learning. In Pro-ceedings of the International Conference on MachineLearning, pp. 1480–1488, 2013.

H. Kim and A. Mnih. Disentangling by factorising. InProceedings of the International Conference on Ma-chine Learning, pp. 2649–2658, 2018.

D. P. Kingma and M. Welling. Auto-encoding variationalbayes. In Proceedings of the International Conferenceon Learning Representations, 2013.

H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficientsparse coding algorithms. Neural Computation, pp.801–808, 2007.

C. J. Maddison, A. Mnih, and Y. W. Teh. The Concretedistribution: A continuous relaxation of discrete ran-dom variables. In Proceedings of the InternationalConference on Learning Representations, 2017.

E. Mathieu, T. Rainforth, N. Siddharth, and Y. W. Teh.Disentangling disentanglement in variational autoen-coders. In Proceedings of the International Confer-ence on Machine Learning, pp. 4402–4412, 2019.

T. J. Mitchell and J. J. Beauchamp. Bayesian variableselection in linear regression. Journal of the AmericanStatistical Association, 83(404):1023–1032, 1988.

E. Nalisnick and P. Smyth. Stick-breaking variational au-toencoders. In Proceedings of the International Con-ference on Learning Representations, 2017.

B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by learning a sparse codefor natural images. Nature, 381(6583):607, 1996a.

B. A. Olshausen and D. J. Field. Natural image statisticsand efficient coding. Network: Computation in NeuralSystems, 7(2):333–339, 1996b.

B. A. Olshausen and D. J. Field. Sparse coding of sen-sory inputs. Current opinion in neurobiology, 14(4):481–487, 2004.

Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, andL. Carin. Variational autoencoder for deep learningof images, labels and captions. In Proceedings of theAdvances in Neural Information Processing Systems,2016.

D. J. Rezende, S. Mohamed, and D. Wierstra. Stochasticbackpropagation and approximate inference in deepgenerative models. In Proceedings of the InternationalConference on Machine Learning, 2014.

J. T. Rolfe. Discrete variational autoencoders. In Pro-ceedings of the International Conference on LearningRepresentations, 2017.

Page 11: Variational Sparse Coding - UAIauai.org/uai2019/proceedings/papers/239.pdf · alent of MAP inference and are limited to simple mani-folds due to the need to compute the manifold’s

R. Singh, J. Ling, and F. Doshi-Velez. Structured Vari-ational Autoencoders for the Beta-Bernoulli Process.In Proceedings of the Advances in Neural InformationProcessing Systems, 2017.

C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby,and O. Winther. How to train deep variational au-toencoders and probabilistic ladder networks. In Pro-ceedings of the International Conference on MachineLearning, 2016.

M. K. Titsias and M. Lazaro-Gredilla. Spike and Slabvariational inference for multi-task and multiple ker-nel learning. In Proceedings of the Advances in Neu-ral Information Processing Systems, pp. 2339–2347,2011.

J. M. Tomczak and M. Welling. VAE with a Vamp-Prior. In Proceedings of the International Confer-ence on Artificial Intelligence and Statistics, pp. 1214–1223, 2018.

A. van den Oord, O. Vinyals, and K. Koray. Neural dis-crete representation learning. In Proceedings of Ad-vances in Neural Information Processing Systems, pp.6306–6315, 2017.

H. Xiao, K. Rasul, and R. Vollgraf. Fashion-MNIST: anovel image dataset for benchmarking machine learn-ing algorithms. arXiv:1708.07747, 2017.

S. Yeung, A. Kannan, Y. Dauphin, and L. Fei-Fei. Tack-ling over-pruning in variational autoencoders. Inter-national Conference on Machine Learning: Workshopon Principled Approaches to Deep Learning, 2017.