deep generative models with stick-breaking priorsenalisni/nalisnick_uciaiml_talk.pdfnot necessary...
TRANSCRIPT
Deep Generative Models with Stick-Breaking Priors
Eric NalisnickUniversity of California, Irvine
Overview of Deep Generative Models.1
Outline
DefinitionApplicationsTraining
Overview of Deep Generative Models.1
Outline
The Dirichlet Process.2
Applications
Stochastic Gradient Variational Bayes for the DP
Definition
Definition
Training
Overview of Deep Generative Models.
Deep Generative Models with Stick-Breaking Priors.
1
3
Outline
Stick-Breaking Variational Autoencoders
The Dirichlet Process.2
Applications
Dirichlet Process Variational AutoencodersStick-Breaking for Adaptive Computation
Stochastic Gradient Variational Bayes for the DP
Definition
Definition
Training
Overview of Deep Generative Models
“Generative modeling…denotes modeling tasks in which a density over all the observable quantities is constructed.”
David MacKayInformation Theory, Inference, and Learning Algorithms
p( y | x ) p( x, y )p( x )
Deep Generative Models
Density Network (MacKay & Gibbs, 1997)
XZ
One or More Neural Network Layers
Observed Data
Stochastic Latent Variable
Deep Generative Models
Density Network (MacKay & Gibbs, 1997)
XZ
One or More Neural Network Layers
Observed Data
Stochastic Latent Variable
Z d
X N
θ λ
Sampling from a Deep Generative Model
XZ^~
Sampling from a Deep Generative Model
XZ^~
ImageNet
Sampling from a Deep Generative Model
XZ^~
ImageNet Samples from a DGM
XZ^~
Sampling from a Deep Generative Model
Samples from a DGM
(Bowman et al., 2016)
Overview of Deep Generative Models
1.1 Why Generative Modeling?
Why Generative Modeling?
Cognitive Motivations.1
Why Generative Modeling?
2 Semi-Supervised Learning: Regularizing discriminative models with unlabeled data.
} p( y | x )
Why Generative Modeling?
2
} p( x, y )
Semi-Supervised Learning: Regularizing discriminative models with unlabeled data.
Why Generative Modeling?
3 Simulating Environments for Reinforcement Learning.
Overview of Deep Generative Models
1.2 How to Train a Deep Generative Model
Learning a Deep Generative Model
1 Importance Sampling
2 Variational Inference
3 Adversarial Training
Soph
istic
atio
n
Learning a Deep Generative Model
1 Importance Sampling(MacKay & Gibbs, 1997)
Learning a Deep Generative Model
1 Importance Sampling(MacKay & Gibbs, 1997)
Learning a Deep Generative Model
1 Importance Sampling
Works only in low dimensions
(MacKay & Gibbs, 1997)
Learning a Deep Generative Model
2 Variational Inference
Learning a Deep Generative Model
2 Variational InferencePosterior Approximation
Learning a Deep Generative Model
2 Variational InferencePosterior Approximation
Learning a Deep Generative Model
Data Reconstruction Regularization
2 Variational InferencePosterior Approximation
Learning a Deep Generative Model
Problem: For neural network likelihoods, these integrals are analytically intractable
Data Reconstruction Regularization
2 Variational InferencePosterior Approximation
Learning a Deep Generative Model
Step #1: Use Monte Carlo Approximations
2 Variational Inference
Learning a Deep Generative Model
Step #1: Use Monte Carlo Approximations
Problem: Still can’t compute gradient w.r.t. variational parameters
2 Variational Inference
Learning a Deep Generative Model
Step #1: Use Monte Carlo Approximations
Problem: Still can’t compute gradient w.r.t. variational parameters
Step #2: Sample via Differentiable Non-Centered Parametrizations
2 Variational Inference
(Kingma & Welling, 2014), (Rezende et al., 2014)
The Variational Autoencoder (VAE)
}
Learning a Deep Generative Model
Classifier predicting if x is generated or observed data
3 Adversarial Training(Goodfellow et al., 2014)
Learning a Deep Generative Model
Classifier predicting if x is generated or observed data
1 Minimize with respect to
Two step training:
2 Maximize with respect to
3 Adversarial Training(Goodfellow et al., 2014)
Generative Adversarial Networks (GANs)(Goodfellow et al., 2014)
Variational Bayes vs Adversarial Training
VAE GAN
Deep Generative Models Summary
XZ
(Original) Density Network
Variational Autoencoder
Generative Adversarial Network
Training Method
Importance Sampled Estimate of Marginal
Likelihood
Stochastic Gradient Variational Bayes
Adversarial Training
No auxiliary model
Backend auxiliary model
Frontend auxiliary model
Deep Generative Models Summary
XZ
(Original) Density Network
Variational Autoencoder
Generative Adversarial Network
Training Method
Importance Sampled Estimate of Marginal
Likelihood
Stochastic Gradient Variational Bayes
Adversarial Training
No auxiliary model
Backend auxiliary model
Frontend auxiliary model
The Dirichlet Process
The Dirichlet Process
The Dirichlet Process
1
The Dirichlet Process
1 2
The Dirichlet Process
1 2
Draw from GEM distribution
The Dirichlet Process
1 2
Draw from GEM distribution
The Dirichlet Process
1 2
Draw from GEM distribution
The Dirichlet Process
1 2
Draw from GEM distribution
The Dirichlet Process
1 2
Draw from GEM distribution
SGVB for The Dirichlet Process
Two Obstacles:1. Gradients through 2. Gradients thought samples from the mixture
Obstacle: The Beta distribution does not have a non-centered parametrization (except in special cases)
Differentiating Through πk
Obstacle: The Beta distribution does not have a non-centered parametrization (except in special cases)
Kumaraswamy Distribution: A Beta-like distribution with a closed-form inverse CDF. Use as variational posterior.
Poondi Kumaraswamy (1930-1988)
Differentiating Through πk
Obstacle: Not obvious how to use the reparametrization trick for samples from a mixture distribution.
Stochastic Backpropagation through Mixtures
Obstacle: Not obvious how to use the reparametrization trick for samples from a mixture distribution.
Stochastic Backpropagation through Mixtures
Brute Force Solution: Summing over all components results in a tractable ELBO but requires O(T) decoder propagations.
Deep Generative Models with Stick-Breaking Priors
Why Combine the DP with DGMs?
XZ
Latent variable drawn from
stochastic process
Why Combine the DP with DGMs?
1 Dynamic Latent Representations (Width)
XZ
Latent variable drawn from
stochastic process
Why Combine the DP with DGMs?
1 Dynamic Latent Representations (Width)
2 Neural Networks with Adaptive Computation (Depth)
XZ
Latent variable drawn from
stochastic process
Why Combine the DP with DGMs?
1 Dynamic Latent Representations (Width)
2
Automatic and Principled Model Selection (Based on Marginal Likelihood)
3
Neural Networks with Adaptive Computation (Depth)
XZ
Latent variable drawn from
stochastic process
3.1 The Stick-Breaking Variational Autoencoder
In collaboration with
Padhraic Smyth
Deep Generative Models with Stick-Breaking Priors
MODEL #1: Stick-Breaking Variational Autoencoder
Kumaraswamy Samples
Truncated posterior; not necessary but learns faster
(Nalisnick & Smyth, 2017)
Stick-Breaking VAE
Truncation level of 50 dimensions, Beta(1,5) Prior
Samples from Generative Model (Nalisnick & Smyth, 2017)
Stick-Breaking VAE Gaussian VAE
50 dimensions, N(0,1) PriorTruncation level of 50 dimensions, Beta(1,5) Prior
Samples from Generative Model (Nalisnick & Smyth, 2017)
Uns
uper
vise
dSe
mi-
Supe
rvis
ed
MNIST: Dirichlet Latent Space (t-SNE)
MNIST: Gaussian Latent Space (t-SNE)
MNIST: kNN Classifier on Latent Space
Quantitative Results for SB-VAE (Nalisnick & Smyth, 2017)
Uns
uper
vise
dSe
mi-
Supe
rvis
ed
MNIST: Dirichlet Latent Space (t-SNE)
MNIST: Gaussian Latent Space (t-SNE)
MNIST: kNN Classifier on Latent Space
Quantitative Results for SB-VAE (Nalisnick & Smyth, 2017)
(Estimated) Marginal Likelihoods
Uns
uper
vise
dSe
mi-
Supe
rvis
ed
MNIST: Dirichlet Latent Space (t-SNE)
MNIST: Gaussian Latent Space (t-SNE)
MNIST: kNN Classifier on Latent Space
Nonparametric version of (Kingma et al., NIPS 2014)’s M2 model
Quantitative Results for SB-VAE (Nalisnick & Smyth, 2017)
(Estimated) Marginal Likelihoods
3.2 The Dirichlet Process Variational Autoencoder
Lars Hertel
In collaboration with
Padhraic Smyth
Deep Generative Models with Stick-Breaking Priors
MODEL #2: Dirichlet Process Variational Autoencoder (Nalisnick et al., 2016)
MNIST Samples from Component #1
MNIST Samples from Component #5
(Nalisnick et al., 2016)
Samples from Generative Model
MNIST: Dirichlet Process Latent Space (t-SNE)
MNIST: Gaussian Latent Space (t-SNE)
MNIST: kNN Classifier on Latent Space
Quantitative Results for DP-VAE (Nalisnick et al., 2016)
MNIST: Dirichlet Process Latent Space (t-SNE)
MNIST: Gaussian Latent Space (t-SNE)
MNIST: kNN Classifier on Latent Space
Quantitative Results for DP-VAE (Nalisnick et al., 2016)
(Estimated) Marginal Likelihoods
3.3 Adaptive Computation Time via Stick-Breaking
Marc-Alexandre Cote
In collaboration with
Hugo Larochelle
Deep Generative Models with Stick-Breaking Priors
MODEL #3: Stick-Breaking Adaptive Computation
…
yky2y1
X
π1 π2 πk
Remaining stick is negligibly small
MODEL #3: Stick-Breaking Adaptive Computation
…
yky2y1
X
π1 π2 πk
Remaining stick is negligibly small
Quantitative Results for SB-RNN: Image Modeling
Quantitative Results for SB-RNN: Text Modeling
Quantitative Results for SB-RNN: Text Modeling
Superior Latent Spaces with NP priors: Seem to preserve class structure well, resulting in better discriminative properties. Their dynamic capacity naturally encodes factors of variation.
1
Conclusions
Adaptive Computation: Fusing NNs with nonparametric priors endows NNs with natural model selection properties.
2
Thank you. Questions?
Appendix