probabilistic modelling for nlp powered by deep...

128
Deep generative models Variational inference Variational auto-encoder References Probabilistic modelling for NLP powered by deep learning Wilker Aziz University of Amsterdam April 3, 2018 Wilker Aziz DGMs in NLP 1

Upload: others

Post on 01-May-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Probabilistic modelling for NLPpowered by deep learning

Wilker AzizUniversity of Amsterdam

April 3, 2018

Wilker Aziz DGMs in NLP 1

Page 2: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Outline

1 Deep generative models

2 Variational inference

3 Variational auto-encoder

Wilker Aziz DGMs in NLP 1

Page 3: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Problems

Supervised problems

“learn a distribution over observed data”

sentences in natural language

images, . . .

Unsupervised problems

“learn a distribution over observed and unobserved data”

sentences in natural language + parse trees

images + bounding boxes, . . .

Wilker Aziz DGMs in NLP 2

Page 4: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Problems

Supervised problems

“learn a distribution over observed data”

sentences in natural language

images, . . .

Unsupervised problems

“learn a distribution over observed and unobserved data”

sentences in natural language + parse trees

images + bounding boxes, . . .

Wilker Aziz DGMs in NLP 2

Page 5: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Supervised problems

We have data x (1), . . . , x (N) e.g.

sentences, images, ...

generated by some unknown procedure

which we assume can be captured by a probabilistic model

with known probability (mass/density) function e.g.

X ∼ Cat(π1, . . . , πK )︸ ︷︷ ︸e.g. nationality

or X ∼ N (µ, σ2)︸ ︷︷ ︸e.g. height

estimate parameters that assign maximum likelihood to observations

Wilker Aziz DGMs in NLP 3

Page 6: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Supervised problems

We have data x (1), . . . , x (N) e.g.

sentences, images, ...

generated by some unknown procedurewhich we assume can be captured by a probabilistic model

with known probability (mass/density) function e.g.

X ∼ Cat(π1, . . . , πK )︸ ︷︷ ︸e.g. nationality

or X ∼ N (µ, σ2)︸ ︷︷ ︸e.g. height

estimate parameters that assign maximum likelihood to observations

Wilker Aziz DGMs in NLP 3

Page 7: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Supervised problems

We have data x (1), . . . , x (N) e.g.

sentences, images, ...

generated by some unknown procedurewhich we assume can be captured by a probabilistic model

with known probability (mass/density) function e.g.

X ∼ Cat(π1, . . . , πK )︸ ︷︷ ︸e.g. nationality

or X ∼ N (µ, σ2)︸ ︷︷ ︸e.g. height

estimate parameters that assign maximum likelihood to observations

Wilker Aziz DGMs in NLP 3

Page 8: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Supervised problems

We have data x (1), . . . , x (N) e.g.

sentences, images, ...

generated by some unknown procedurewhich we assume can be captured by a probabilistic model

with known probability (mass/density) function e.g.

X ∼ Cat(π1, . . . , πK )︸ ︷︷ ︸e.g. nationality

or X ∼ N (µ, σ2)︸ ︷︷ ︸e.g. height

estimate parameters that assign maximum likelihood to observations

Wilker Aziz DGMs in NLP 3

Page 9: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Multiple problems, same language

N

(Conditional) Density estimation

Side information (φ) Observation (x)Parsing a sentence parse tree

Translation a sentence in French translation in English

Captioning an image caption in English

Entailment a text and hypothesis entailment relation

Wilker Aziz DGMs in NLP 4

Page 10: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Where does deep learning kick in?

Let φ be all side information availablee.g. deterministic inputs/features

Have neural networks predict parameters of our probabilistic model

X |φ ∼ Cat(πθ(φ)) or X |φ ∼ N (µθ(φ), σθ(φ)2)

and proceed to estimate parameters θ of the NNs

Wilker Aziz DGMs in NLP 5

Page 11: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Where does deep learning kick in?

Let φ be all side information availablee.g. deterministic inputs/features

Have neural networks predict parameters of our probabilistic model

X |φ ∼ Cat(πθ(φ)) or X |φ ∼ N (µθ(φ), σθ(φ)2)

and proceed to estimate parameters θ of the NNs

Wilker Aziz DGMs in NLP 5

Page 12: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Where does deep learning kick in?

Let φ be all side information availablee.g. deterministic inputs/features

Have neural networks predict parameters of our probabilistic model

X |φ ∼ Cat(πθ(φ)) or X |φ ∼ N (µθ(φ), σθ(φ)2)

and proceed to estimate parameters θ of the NNs

Wilker Aziz DGMs in NLP 5

Page 13: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

NN as efficient parametrisation

From the statistical point of view NNs do not generate data

they parametrise distributions thatby assumption govern data

compact and efficient way to map from complex sideinformation to parameter space

Prediction is done by a decision rule outside the statistical model

e.g. beam search

Wilker Aziz DGMs in NLP 6

Page 14: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

NN as efficient parametrisation

From the statistical point of view NNs do not generate data

they parametrise distributions thatby assumption govern data

compact and efficient way to map from complex sideinformation to parameter space

Prediction is done by a decision rule outside the statistical model

e.g. beam search

Wilker Aziz DGMs in NLP 6

Page 15: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Maximum likelihood estimation

Let p(x |θ) be the probability of an observation xand θ refer to all of its parameterse.g. parameters of NNs involved

Given a dataset x (1), . . . , x (N) of i.i.d. observations, the likelihoodfunction

L(θ|x (1:N)) = logN∏

s=1

p(x (s)|θ)

=N∑

s=1

log p(x (s)|θ)

quantifies the fitness of our model to data

Wilker Aziz DGMs in NLP 7

Page 16: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Maximum likelihood estimation

Let p(x |θ) be the probability of an observation xand θ refer to all of its parameterse.g. parameters of NNs involved

Given a dataset x (1), . . . , x (N) of i.i.d. observations, the likelihoodfunction

L(θ|x (1:N)) = logN∏

s=1

p(x (s)|θ)

=N∑

s=1

log p(x (s)|θ)

quantifies the fitness of our model to data

Wilker Aziz DGMs in NLP 7

Page 17: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Maximum likelihood estimation

Let p(x |θ) be the probability of an observation xand θ refer to all of its parameterse.g. parameters of NNs involved

Given a dataset x (1), . . . , x (N) of i.i.d. observations, the likelihoodfunction

L(θ|x (1:N)) = logN∏

s=1

p(x (s)|θ)

=N∑

s=1

log p(x (s)|θ)

quantifies the fitness of our model to data

Wilker Aziz DGMs in NLP 7

Page 18: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Maximum likelihood estimation

Let p(x |θ) be the probability of an observation xand θ refer to all of its parameterse.g. parameters of NNs involved

Given a dataset x (1), . . . , x (N) of i.i.d. observations, the likelihoodfunction

L(θ|x (1:N)) = logN∏

s=1

p(x (s)|θ)

=N∑

s=1

log p(x (s)|θ)

quantifies the fitness of our model to data

Wilker Aziz DGMs in NLP 7

Page 19: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

MLE via gradient-based optimisation

If assessing the log-likelihood is differentiable and assessing it istractable, then backpropagation can give us the gradient

∇θL(θ|x (1:N)) = ∇θ

N∑s=1

log p(x (s)|θ)

=N∑

s=1

∇θ log p(x (s)|θ)

and we can update θ in the direction

γ∇θL(θ|x (1:N))

to attain a local optimum of the likelihood function

Wilker Aziz DGMs in NLP 8

Page 20: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

MLE via gradient-based optimisation

If assessing the log-likelihood is differentiable and assessing it istractable, then backpropagation can give us the gradient

∇θL(θ|x (1:N)) = ∇θ

N∑s=1

log p(x (s)|θ)

=N∑

s=1

∇θ log p(x (s)|θ)

and we can update θ in the direction

γ∇θL(θ|x (1:N))

to attain a local optimum of the likelihood function

Wilker Aziz DGMs in NLP 8

Page 21: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

MLE via gradient-based optimisation

If assessing the log-likelihood is differentiable and assessing it istractable, then backpropagation can give us the gradient

∇θL(θ|x (1:N)) = ∇θ

N∑s=1

log p(x (s)|θ)

=N∑

s=1

∇θ log p(x (s)|θ)

and we can update θ in the direction

γ∇θL(θ|x (1:N))

to attain a local optimum of the likelihood function

Wilker Aziz DGMs in NLP 8

Page 22: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Stochastic optimisation

We can also use a gradient estimate

∇θL(θ|x (1:N)) = ∇θ ES∼U(1..N)

[N log p(x (S)|θ)

]︸ ︷︷ ︸

L(θ|x(1:N))

= ES∼U(1..N)

[N∇θ log p(x (S)|θ)

]︸ ︷︷ ︸

expected gradient :)

MC≈ 1

M

M∑m=1

N∇θ log p(x (si )|θ)

Si ∼ U(1..N)

and take steps in the direction

γN

M∇θL(θ|x (s1:sM))

where x (s1), . . . , x (sM) is a random mini-batch of size M

Page 23: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Stochastic optimisation

We can also use a gradient estimate

∇θL(θ|x (1:N)) = ∇θ ES∼U(1..N)

[N log p(x (S)|θ)

]︸ ︷︷ ︸

L(θ|x(1:N))

= ES∼U(1..N)

[N∇θ log p(x (S)|θ)

]︸ ︷︷ ︸

expected gradient :)

MC≈ 1

M

M∑m=1

N∇θ log p(x (si )|θ)

Si ∼ U(1..N)

and take steps in the direction

γN

M∇θL(θ|x (s1:sM))

where x (s1), . . . , x (sM) is a random mini-batch of size M

Page 24: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Stochastic optimisation

We can also use a gradient estimate

∇θL(θ|x (1:N)) = ∇θ ES∼U(1..N)

[N log p(x (S)|θ)

]︸ ︷︷ ︸

L(θ|x(1:N))

= ES∼U(1..N)

[N∇θ log p(x (S)|θ)

]︸ ︷︷ ︸

expected gradient :)

MC≈ 1

M

M∑m=1

N∇θ log p(x (si )|θ)

Si ∼ U(1..N)

and take steps in the direction

γN

M∇θL(θ|x (s1:sM))

where x (s1), . . . , x (sM) is a random mini-batch of size M

Page 25: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Stochastic optimisation

We can also use a gradient estimate

∇θL(θ|x (1:N)) = ∇θ ES∼U(1..N)

[N log p(x (S)|θ)

]︸ ︷︷ ︸

L(θ|x(1:N))

= ES∼U(1..N)

[N∇θ log p(x (S)|θ)

]︸ ︷︷ ︸

expected gradient :)

MC≈ 1

M

M∑m=1

N∇θ log p(x (si )|θ)

Si ∼ U(1..N)

and take steps in the direction

γN

M∇θL(θ|x (s1:sM))

where x (s1), . . . , x (sM) is a random mini-batch of size M

Page 26: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

DL in NLP recipe

Maximum likelihood estimation

tells you which loss to optimise(i.e. negative log-likelihood)

Automatic differentiation (backprop)

chain rule of derivatives: “give me a tractable forward passand I will give you gradients”

Stochastic optimisation powered by backprop

general purpose gradient-based optimisers

Wilker Aziz DGMs in NLP 10

Page 27: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

DL in NLP recipe

Maximum likelihood estimation

tells you which loss to optimise(i.e. negative log-likelihood)

Automatic differentiation (backprop)

chain rule of derivatives: “give me a tractable forward passand I will give you gradients”

Stochastic optimisation powered by backprop

general purpose gradient-based optimisers

Wilker Aziz DGMs in NLP 10

Page 28: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

DL in NLP recipe

Maximum likelihood estimation

tells you which loss to optimise(i.e. negative log-likelihood)

Automatic differentiation (backprop)

chain rule of derivatives: “give me a tractable forward passand I will give you gradients”

Stochastic optimisation powered by backprop

general purpose gradient-based optimisers

Wilker Aziz DGMs in NLP 10

Page 29: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Tractability is central

Likelihood gives us a differentiable objective to optimise for

but we need to stick with tractable likelihood functions

Wilker Aziz DGMs in NLP 11

Page 30: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

When do we have intractable likelihood?

Unsupervised problems contain unobserved random variables

pθ(x , z) =

latent variable model︷︸︸︷p(z) pθ(x |z)︸ ︷︷ ︸

observation model

thus assessing the marginal likelihood requires marginalisation oflatent variables

pθ(x) =

∫pθ(x , z) dz =

∫p(z)pθ(x |z) dz

Wilker Aziz DGMs in NLP 12

Page 31: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

When do we have intractable likelihood?

Unsupervised problems contain unobserved random variables

pθ(x , z) =

latent variable model︷︸︸︷p(z) pθ(x |z)︸ ︷︷ ︸

observation model

thus assessing the marginal likelihood requires marginalisation oflatent variables

pθ(x) =

∫pθ(x , z) dz =

∫p(z)pθ(x |z) dz

Wilker Aziz DGMs in NLP 12

Page 32: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Examples of latent variable models

Discrete latent variable, continuous observation

too many forward passes

pθ(x) =K∑

c=1

Cat(c |π1, . . . , πK )N (x |µθ(c), σθ(c)2)︸ ︷︷ ︸forward pass

Continuous latent variable, discrete observation

infinitely many forward passes

pθ(x) =

∫N (z |0, I ) Cat(x |πθ(z))︸ ︷︷ ︸

forward pass

dz

Wilker Aziz DGMs in NLP 13

Page 33: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Examples of latent variable models

Discrete latent variable, continuous observation

too many forward passes

pθ(x) =K∑

c=1

Cat(c |π1, . . . , πK )N (x |µθ(c), σθ(c)2)︸ ︷︷ ︸forward pass

Continuous latent variable, discrete observation

infinitely many forward passes

pθ(x) =

∫N (z |0, I ) Cat(x |πθ(z))︸ ︷︷ ︸

forward pass

dz

Wilker Aziz DGMs in NLP 13

Page 34: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Deep generative models

Joint distribution with deep observation model

pθ(x , z) = p(z)︸︷︷︸prior

pθ(x |z)︸ ︷︷ ︸likelihood

mapping from latent variable z to p(x |z) is a NN with parameters θ

Marginal likelihood (or evidence)

pθ(x) =

∫pθ(x , z) dz =

∫p(z)pθ(x |z) dz

intractable in general

Wilker Aziz DGMs in NLP 14

Page 35: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Deep generative models

Joint distribution with deep observation model

pθ(x , z) = p(z)︸︷︷︸prior

pθ(x |z)︸ ︷︷ ︸likelihood

mapping from latent variable z to p(x |z) is a NN with parameters θ

Marginal likelihood (or evidence)

pθ(x) =

∫pθ(x , z) dz =

∫p(z)pθ(x |z) dz

intractable in general

Wilker Aziz DGMs in NLP 14

Page 36: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

GradientExact gradient is intractable

∇θ log pθ(x)

= ∇θ log

∫pθ(x , z)dz︸ ︷︷ ︸marginal

=1∫

pθ(x , z)dz

∫∇θpθ(x , z)dz︸ ︷︷ ︸

chain rule

=1

pθ(x)

∫pθ(x , z)∇θ log pθ(x , z)︸ ︷︷ ︸

log-identity for derivatives

dz

=

∫pθ(x , z)

pθ(x)︸ ︷︷ ︸posterior

∇θ log pθ(x , z)dz

=

∫pθ(z |x)∇θ log pθ(x , z)dz

= Epθ(z|x) [∇θ log pθ(x ,Z )]︸ ︷︷ ︸expected gradient :)

Page 37: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

GradientExact gradient is intractable

∇θ log pθ(x) = ∇θ log

∫pθ(x , z)dz︸ ︷︷ ︸marginal

=1∫

pθ(x , z)dz

∫∇θpθ(x , z)dz︸ ︷︷ ︸

chain rule

=1

pθ(x)

∫pθ(x , z)∇θ log pθ(x , z)︸ ︷︷ ︸

log-identity for derivatives

dz

=

∫pθ(x , z)

pθ(x)︸ ︷︷ ︸posterior

∇θ log pθ(x , z)dz

=

∫pθ(z |x)∇θ log pθ(x , z)dz

= Epθ(z|x) [∇θ log pθ(x ,Z )]︸ ︷︷ ︸expected gradient :)

Page 38: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

GradientExact gradient is intractable

∇θ log pθ(x) = ∇θ log

∫pθ(x , z)dz︸ ︷︷ ︸marginal

=1∫

pθ(x , z)dz

∫∇θpθ(x , z) dz︸ ︷︷ ︸

chain rule

=1

pθ(x)

∫pθ(x , z)∇θ log pθ(x , z)︸ ︷︷ ︸

log-identity for derivatives

dz

=

∫pθ(x , z)

pθ(x)︸ ︷︷ ︸posterior

∇θ log pθ(x , z)dz

=

∫pθ(z |x)∇θ log pθ(x , z)dz

= Epθ(z|x) [∇θ log pθ(x ,Z )]︸ ︷︷ ︸expected gradient :)

Page 39: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

GradientExact gradient is intractable

∇θ log pθ(x) = ∇θ log

∫pθ(x , z)dz︸ ︷︷ ︸marginal

=1∫

pθ(x , z)dz

∫∇θpθ(x , z) dz︸ ︷︷ ︸

chain rule

=1

pθ(x)

∫pθ(x , z)∇θ log pθ(x , z)︸ ︷︷ ︸

log-identity for derivatives

dz

=

∫pθ(x , z)

pθ(x)︸ ︷︷ ︸posterior

∇θ log pθ(x , z)dz

=

∫pθ(z |x)∇θ log pθ(x , z)dz

= Epθ(z|x) [∇θ log pθ(x ,Z )]︸ ︷︷ ︸expected gradient :)

Page 40: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

GradientExact gradient is intractable

∇θ log pθ(x) = ∇θ log

∫pθ(x , z)dz︸ ︷︷ ︸marginal

=1∫

pθ(x , z)dz

∫∇θpθ(x , z) dz︸ ︷︷ ︸

chain rule

=1

pθ(x)

∫pθ(x , z)∇θ log pθ(x , z)︸ ︷︷ ︸

log-identity for derivatives

dz

=

∫pθ(x , z)

pθ(x)︸ ︷︷ ︸posterior

∇θ log pθ(x , z)dz

=

∫pθ(z |x)∇θ log pθ(x , z)dz

= Epθ(z|x) [∇θ log pθ(x ,Z )]︸ ︷︷ ︸expected gradient :)

Page 41: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

GradientExact gradient is intractable

∇θ log pθ(x) = ∇θ log

∫pθ(x , z)dz︸ ︷︷ ︸marginal

=1∫

pθ(x , z)dz

∫∇θpθ(x , z) dz︸ ︷︷ ︸

chain rule

=1

pθ(x)

∫pθ(x , z)∇θ log pθ(x , z)︸ ︷︷ ︸

log-identity for derivatives

dz

=

∫pθ(x , z)

pθ(x)︸ ︷︷ ︸posterior

∇θ log pθ(x , z)dz

=

∫pθ(z |x)∇θ log pθ(x , z)dz

= Epθ(z|x) [∇θ log pθ(x ,Z )]︸ ︷︷ ︸expected gradient :)

Page 42: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

GradientExact gradient is intractable

∇θ log pθ(x) = ∇θ log

∫pθ(x , z)dz︸ ︷︷ ︸marginal

=1∫

pθ(x , z)dz

∫∇θpθ(x , z) dz︸ ︷︷ ︸

chain rule

=1

pθ(x)

∫pθ(x , z)∇θ log pθ(x , z)︸ ︷︷ ︸

log-identity for derivatives

dz

=

∫pθ(x , z)

pθ(x)︸ ︷︷ ︸posterior

∇θ log pθ(x , z)dz

=

∫pθ(z |x)∇θ log pθ(x , z)dz

= Epθ(z|x) [∇θ log pθ(x ,Z )]︸ ︷︷ ︸expected gradient :)

Page 43: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Can we get an estimate?

∇θ log pθ(x) = Epθ(z|x) [∇θ log pθ(x ,Z )]

MC≈ 1

K

K∑k=1

∇θ log pθ(x , z(k))

z(k) ∼ pθ(Z |x)

MC estimate of gradient requires sampling from posterior

pθ(z |x) =p(z)pθ(x |z)

pθ(x)

unavailable due to the intractability of the marginal

Wilker Aziz DGMs in NLP 16

Page 44: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Can we get an estimate?

∇θ log pθ(x) = Epθ(z|x) [∇θ log pθ(x ,Z )]

MC≈ 1

K

K∑k=1

∇θ log pθ(x , z(k))

z(k) ∼ pθ(Z |x)

MC estimate of gradient requires sampling from posterior

pθ(z |x) =p(z)pθ(x |z)

pθ(x)

unavailable due to the intractability of the marginal

Wilker Aziz DGMs in NLP 16

Page 45: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Can we get an estimate?

∇θ log pθ(x) = Epθ(z|x) [∇θ log pθ(x ,Z )]

MC≈ 1

K

K∑k=1

∇θ log pθ(x , z(k))

z(k) ∼ pθ(Z |x)

MC estimate of gradient requires sampling from posterior

pθ(z |x) =p(z)pθ(x |z)

pθ(x)

unavailable due to the intractability of the marginal

Wilker Aziz DGMs in NLP 16

Page 46: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Summary

We like probabilistic models because can make explicitmodelling assumptions

We want complex observation models parameterised by NNs

But we cannot use backprop for parameter estimation

We need approximate inference techniques!

Wilker Aziz DGMs in NLP 17

Page 47: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Summary

We like probabilistic models because can make explicitmodelling assumptions

We want complex observation models parameterised by NNs

But we cannot use backprop for parameter estimation

We need approximate inference techniques!

Wilker Aziz DGMs in NLP 17

Page 48: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Summary

We like probabilistic models because can make explicitmodelling assumptions

We want complex observation models parameterised by NNs

But we cannot use backprop for parameter estimation

We need approximate inference techniques!

Wilker Aziz DGMs in NLP 17

Page 49: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Summary

We like probabilistic models because can make explicitmodelling assumptions

We want complex observation models parameterised by NNs

But we cannot use backprop for parameter estimation

We need approximate inference techniques!

Wilker Aziz DGMs in NLP 17

Page 50: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Outline

1 Deep generative models

2 Variational inference

3 Variational auto-encoder

Wilker Aziz DGMs in NLP 18

Page 51: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

The Basic Problem

The marginal likelihood

p(x) =

∫p(x , z)dz

is generally intractable, which prevents us from computingquantities that depend on the posterior p(z |x)

e.g. gradients in MLE

e.g. predictive distribution in Bayesian modelling

Wilker Aziz DGMs in NLP 19

Page 52: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Strategy

Accept that p(z |x) is not computable.

approximate it by an auxiliary distribution q(z |x) that iscomputable

choose q(z |x) as close as possible to p(z |x) to obtain afaithful approximation

Wilker Aziz DGMs in NLP 20

Page 53: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Strategy

Accept that p(z |x) is not computable.

approximate it by an auxiliary distribution q(z |x) that iscomputable

choose q(z |x) as close as possible to p(z |x) to obtain afaithful approximation

Wilker Aziz DGMs in NLP 20

Page 54: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Evidence lowerbound

log p(x) = log

∫p(x , z)dz

= log

∫q(z |x)

p(x , z)

q(z |x)dz

= log

(Eq(z|x)

[p(x ,Z )

q(Z |x)

])≥ Eq(z|x)

[log

p(x ,Z )

q(Z |x)

]︸ ︷︷ ︸

ELBO

= Eq(z|x) [log p(x ,Z )]− Eq(z|x) [log q(Z )]

= Eq(z|x) [log p(x ,Z )] + H (q(z |x))

Wilker Aziz DGMs in NLP 21

Page 55: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Evidence lowerbound

log p(x) = log

∫p(x , z)dz

= log

∫q(z |x)

p(x , z)

q(z |x)dz

= log

(Eq(z|x)

[p(x ,Z )

q(Z |x)

])≥ Eq(z|x)

[log

p(x ,Z )

q(Z |x)

]︸ ︷︷ ︸

ELBO

= Eq(z|x) [log p(x ,Z )]− Eq(z|x) [log q(Z )]

= Eq(z|x) [log p(x ,Z )] + H (q(z |x))

Wilker Aziz DGMs in NLP 21

Page 56: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Evidence lowerbound

log p(x) = log

∫p(x , z)dz

= log

∫q(z |x)

p(x , z)

q(z |x)dz

= log

(Eq(z|x)

[p(x ,Z )

q(Z |x)

])

≥ Eq(z|x)

[log

p(x ,Z )

q(Z |x)

]︸ ︷︷ ︸

ELBO

= Eq(z|x) [log p(x ,Z )]− Eq(z|x) [log q(Z )]

= Eq(z|x) [log p(x ,Z )] + H (q(z |x))

Wilker Aziz DGMs in NLP 21

Page 57: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Evidence lowerbound

log p(x) = log

∫p(x , z)dz

= log

∫q(z |x)

p(x , z)

q(z |x)dz

= log

(Eq(z|x)

[p(x ,Z )

q(Z |x)

])≥ Eq(z|x)

[log

p(x ,Z )

q(Z |x)

]︸ ︷︷ ︸

ELBO

= Eq(z|x) [log p(x ,Z )]− Eq(z|x) [log q(Z )]

= Eq(z|x) [log p(x ,Z )] + H (q(z |x))

Wilker Aziz DGMs in NLP 21

Page 58: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Evidence lowerbound

log p(x) = log

∫p(x , z)dz

= log

∫q(z |x)

p(x , z)

q(z |x)dz

= log

(Eq(z|x)

[p(x ,Z )

q(Z |x)

])≥ Eq(z|x)

[log

p(x ,Z )

q(Z |x)

]︸ ︷︷ ︸

ELBO

= Eq(z|x) [log p(x ,Z )]− Eq(z|x) [log q(Z )]

= Eq(z|x) [log p(x ,Z )] + H (q(z |x))

Wilker Aziz DGMs in NLP 21

Page 59: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Evidence lowerbound

log p(x) = log

∫p(x , z)dz

= log

∫q(z |x)

p(x , z)

q(z |x)dz

= log

(Eq(z|x)

[p(x ,Z )

q(Z |x)

])≥ Eq(z|x)

[log

p(x ,Z )

q(Z |x)

]︸ ︷︷ ︸

ELBO

= Eq(z|x) [log p(x ,Z )]− Eq(z|x) [log q(Z )]

= Eq(z|x) [log p(x ,Z )] + H (q(z |x))

Wilker Aziz DGMs in NLP 21

Page 60: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

An approximate posterior

log p(x) ≥ Eq(z|x)

[log

p(x ,Z )

q(Z |x)

]︸ ︷︷ ︸

ELBO

= Eq(z|x)

[log

p(Z |x)p(x)

q(Z |x)

]= Eq(z|x)

[log

p(Z |x)

q(Z |x)

]+ log p(x)︸ ︷︷ ︸

constant

= −Eq(z|x)

[log

q(Z |x)

p(Z |x)

]︸ ︷︷ ︸

KL(q(z|x)||p(z|x))

+ log p(x)

We have derived a lower bound on the log-evidence whose gap isexactly KL (q(z |x) || p(z |x)).

Wilker Aziz DGMs in NLP 22

Page 61: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

An approximate posterior

log p(x) ≥ Eq(z|x)

[log

p(x ,Z )

q(Z |x)

]︸ ︷︷ ︸

ELBO

= Eq(z|x)

[log

p(Z |x)p(x)

q(Z |x)

]

= Eq(z|x)

[log

p(Z |x)

q(Z |x)

]+ log p(x)︸ ︷︷ ︸

constant

= −Eq(z|x)

[log

q(Z |x)

p(Z |x)

]︸ ︷︷ ︸

KL(q(z|x)||p(z|x))

+ log p(x)

We have derived a lower bound on the log-evidence whose gap isexactly KL (q(z |x) || p(z |x)).

Wilker Aziz DGMs in NLP 22

Page 62: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

An approximate posterior

log p(x) ≥ Eq(z|x)

[log

p(x ,Z )

q(Z |x)

]︸ ︷︷ ︸

ELBO

= Eq(z|x)

[log

p(Z |x)p(x)

q(Z |x)

]= Eq(z|x)

[log

p(Z |x)

q(Z |x)

]+ log p(x)︸ ︷︷ ︸

constant

= −Eq(z|x)

[log

q(Z |x)

p(Z |x)

]︸ ︷︷ ︸

KL(q(z|x)||p(z|x))

+ log p(x)

We have derived a lower bound on the log-evidence whose gap isexactly KL (q(z |x) || p(z |x)).

Wilker Aziz DGMs in NLP 22

Page 63: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

An approximate posterior

log p(x) ≥ Eq(z|x)

[log

p(x ,Z )

q(Z |x)

]︸ ︷︷ ︸

ELBO

= Eq(z|x)

[log

p(Z |x)p(x)

q(Z |x)

]= Eq(z|x)

[log

p(Z |x)

q(Z |x)

]+ log p(x)︸ ︷︷ ︸

constant

= −Eq(z|x)

[log

q(Z |x)

p(Z |x)

]︸ ︷︷ ︸

KL(q(z|x)||p(z|x))

+ log p(x)

We have derived a lower bound on the log-evidence whose gap isexactly KL (q(z |x) || p(z |x)).

Wilker Aziz DGMs in NLP 22

Page 64: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

An approximate posterior

log p(x) ≥ Eq(z|x)

[log

p(x ,Z )

q(Z |x)

]︸ ︷︷ ︸

ELBO

= Eq(z|x)

[log

p(Z |x)p(x)

q(Z |x)

]= Eq(z|x)

[log

p(Z |x)

q(Z |x)

]+ log p(x)︸ ︷︷ ︸

constant

= −Eq(z|x)

[log

q(Z |x)

p(Z |x)

]︸ ︷︷ ︸

KL(q(z|x)||p(z|x))

+ log p(x)

We have derived a lower bound on the log-evidence whose gap isexactly KL (q(z |x) || p(z |x)).

Wilker Aziz DGMs in NLP 22

Page 65: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Variational Inference

Objectivemaxq(z|x)

E [log p(x ,Z )] + H (q(z |x))

The ELBO is a lower bound on log p(x)

Blei et al. (2016)Wilker Aziz DGMs in NLP 23

Page 66: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Mean field assumption

Suppose we have N latent variables

assume the posterior factorises as N independent terms

each with an independent set of parameters

q(z1, . . . , zN) =N∏i=1

qλi (zi )︸ ︷︷ ︸mean field

Wilker Aziz DGMs in NLP 24

Page 67: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Amortised variational inference

Amortise the cost of inference using NNs

q(z1, . . . , zN |x1, . . . , xN) =N∏i=1

qλ(zi |xi )

with a shared set of parameters

e.g. Z |x ∼ N (µλ(x), σλ(x)2︸ ︷︷ ︸inference network

)

Wilker Aziz DGMs in NLP 25

Page 68: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Outline

1 Deep generative models

2 Variational inference

3 Variational auto-encoder

Wilker Aziz DGMs in NLP 26

Page 69: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Variational auto-encoder

Generative model with NN likelihood

z

x θ

N

λcomplex (non-linear) observation modelpθ(x |z)

complex (non-linear) mapping from datato latent variables qλ(z |x)

Jointly optimise generative model pθ(x |z) and inference modelqλ(z |x) under the same objective (ELBO)

Kingma and Welling (2013)Wilker Aziz DGMs in NLP 27

Page 70: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Objective

log pθ(x) ≥ELBO︷ ︸︸ ︷

Eqλ(z|x) [log pθ(x ,Z )] + H (qλ(z |x))

= Eqλ(z|x) [log pθ(x |Z ) + log p(Z )] + H (qλ(z |x))

= Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))

Parameter estimation

arg maxθ,λ

Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))

assume KL (qλ(z |x) || p(z)) analyticaltrue for exponential families

approximate Eqλ(z|x) [log pθ(x |z)] by samplingtrue because we design qλ(z |x) to be simple

Page 71: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Objective

log pθ(x) ≥ELBO︷ ︸︸ ︷

Eqλ(z|x) [log pθ(x ,Z )] + H (qλ(z |x))

= Eqλ(z|x) [log pθ(x |Z ) + log p(Z )] + H (qλ(z |x))

= Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))

Parameter estimation

arg maxθ,λ

Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))

assume KL (qλ(z |x) || p(z)) analyticaltrue for exponential families

approximate Eqλ(z|x) [log pθ(x |z)] by samplingtrue because we design qλ(z |x) to be simple

Page 72: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Objective

log pθ(x) ≥ELBO︷ ︸︸ ︷

Eqλ(z|x) [log pθ(x ,Z )] + H (qλ(z |x))

= Eqλ(z|x) [log pθ(x |Z ) + log p(Z )] + H (qλ(z |x))

= Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))

Parameter estimation

arg maxθ,λ

Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))

assume KL (qλ(z |x) || p(z)) analyticaltrue for exponential families

approximate Eqλ(z|x) [log pθ(x |z)] by samplingtrue because we design qλ(z |x) to be simple

Page 73: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Objective

log pθ(x) ≥ELBO︷ ︸︸ ︷

Eqλ(z|x) [log pθ(x ,Z )] + H (qλ(z |x))

= Eqλ(z|x) [log pθ(x |Z ) + log p(Z )] + H (qλ(z |x))

= Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))

Parameter estimation

arg maxθ,λ

Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))

assume KL (qλ(z |x) || p(z)) analyticaltrue for exponential families

approximate Eqλ(z|x) [log pθ(x |z)] by samplingtrue because we design qλ(z |x) to be simple

Page 74: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Objective

log pθ(x) ≥ELBO︷ ︸︸ ︷

Eqλ(z|x) [log pθ(x ,Z )] + H (qλ(z |x))

= Eqλ(z|x) [log pθ(x |Z ) + log p(Z )] + H (qλ(z |x))

= Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))

Parameter estimation

arg maxθ,λ

Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))

assume KL (qλ(z |x) || p(z)) analyticaltrue for exponential families

approximate Eqλ(z|x) [log pθ(x |z)] by samplingtrue because we design qλ(z |x) to be simple

Page 75: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Objective

log pθ(x) ≥ELBO︷ ︸︸ ︷

Eqλ(z|x) [log pθ(x ,Z )] + H (qλ(z |x))

= Eqλ(z|x) [log pθ(x |Z ) + log p(Z )] + H (qλ(z |x))

= Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))

Parameter estimation

arg maxθ,λ

Eqλ(z|x) [log pθ(x |Z )]− KL (qλ(z |x) || p(z))

assume KL (qλ(z |x) || p(z)) analyticaltrue for exponential families

approximate Eqλ(z|x) [log pθ(x |z)] by samplingtrue because we design qλ(z |x) to be simple

Page 76: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Generative Network Gradient

∂θ

Eqλ(z|x) [log pθ(x |z)]−constant wrt θ︷ ︸︸ ︷

KL (qλ(z |x) || p(z))

= Eqλ(z|x)

[∂

∂θlog pθ(x |z)

]︸ ︷︷ ︸

expected gradient :)

MC≈ 1

K

K∑k=1

∂θlog pθ(x |z(k))

z(k) ∼ qλ(Z |x)

Note: qλ(z |x) does not depend on θ.

Page 77: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Generative Network Gradient

∂θ

Eqλ(z|x) [log pθ(x |z)]−constant wrt θ︷ ︸︸ ︷

KL (qλ(z |x) || p(z))

= Eqλ(z|x)

[∂

∂θlog pθ(x |z)

]︸ ︷︷ ︸

expected gradient :)

MC≈ 1

K

K∑k=1

∂θlog pθ(x |z(k))

z(k) ∼ qλ(Z |x)

Note: qλ(z |x) does not depend on θ.

Page 78: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Generative Network Gradient

∂θ

Eqλ(z|x) [log pθ(x |z)]−constant wrt θ︷ ︸︸ ︷

KL (qλ(z |x) || p(z))

= Eqλ(z|x)

[∂

∂θlog pθ(x |z)

]︸ ︷︷ ︸

expected gradient :)

MC≈ 1

K

K∑k=1

∂θlog pθ(x |z(k))

z(k) ∼ qλ(Z |x)

Note: qλ(z |x) does not depend on θ.

Page 79: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Generative Network Gradient

∂θ

Eqλ(z|x) [log pθ(x |z)]−constant wrt θ︷ ︸︸ ︷

KL (qλ(z |x) || p(z))

= Eqλ(z|x)

[∂

∂θlog pθ(x |z)

]︸ ︷︷ ︸

expected gradient :)

MC≈ 1

K

K∑k=1

∂θlog pθ(x |z(k))

z(k) ∼ qλ(Z |x)

Note: qλ(z |x) does not depend on θ.

Page 80: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Inference Network Gradient

∂λ

Eqλ(z|x) [log pθ(x |z)]−analytical︷ ︸︸ ︷

KL (qλ(z |x) || p(z))

=∂

∂λEqλ(z|x) [log pθ(x |z)]− ∂

∂λKL (qλ(z |x) || p(z))︸ ︷︷ ︸

analytical computation

The first term again requires approximation by sampling,but there is a problem

Wilker Aziz DGMs in NLP 30

Page 81: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Inference Network Gradient

∂λ

Eqλ(z|x) [log pθ(x |z)]−analytical︷ ︸︸ ︷

KL (qλ(z |x) || p(z))

=∂

∂λEqλ(z|x) [log pθ(x |z)]− ∂

∂λKL (qλ(z |x) || p(z))︸ ︷︷ ︸

analytical computation

The first term again requires approximation by sampling,but there is a problem

Wilker Aziz DGMs in NLP 30

Page 82: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Inference Network Gradient

∂λ

Eqλ(z|x) [log pθ(x |z)]−analytical︷ ︸︸ ︷

KL (qλ(z |x) || p(z))

=∂

∂λEqλ(z|x) [log pθ(x |z)]− ∂

∂λKL (qλ(z |x) || p(z))︸ ︷︷ ︸

analytical computation

The first term again requires approximation by sampling,but there is a problem

Wilker Aziz DGMs in NLP 30

Page 83: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Inference Network Gradient

∂λEqλ(z|x) [log pθ(x |z)]

=∂

∂λ

∫qλ(z |x) log pθ(x |z)dz

=

∫∂

∂λ(qλ(z |x)) log pθ(x |z)dz︸ ︷︷ ︸

not an expectation

MC estimator is non-differentiable: cannot sample first

Differentiating the expression does not yield an expectation:cannot approximate via MC

Wilker Aziz DGMs in NLP 31

Page 84: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Inference Network Gradient

∂λEqλ(z|x) [log pθ(x |z)]

=∂

∂λ

∫qλ(z |x) log pθ(x |z)dz

=

∫∂

∂λ(qλ(z |x)) log pθ(x |z)dz︸ ︷︷ ︸

not an expectation

MC estimator is non-differentiable: cannot sample first

Differentiating the expression does not yield an expectation:cannot approximate via MC

Wilker Aziz DGMs in NLP 31

Page 85: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Inference Network Gradient

∂λEqλ(z|x) [log pθ(x |z)]

=∂

∂λ

∫qλ(z |x) log pθ(x |z)dz

=

∫∂

∂λ(qλ(z |x)) log pθ(x |z)dz︸ ︷︷ ︸

not an expectation

MC estimator is non-differentiable: cannot sample first

Differentiating the expression does not yield an expectation:cannot approximate via MC

Wilker Aziz DGMs in NLP 31

Page 86: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Inference Network Gradient

∂λEqλ(z|x) [log pθ(x |z)]

=∂

∂λ

∫qλ(z |x) log pθ(x |z)dz

=

∫∂

∂λ(qλ(z |x)) log pθ(x |z)dz︸ ︷︷ ︸

not an expectation

MC estimator is non-differentiable: cannot sample first

Differentiating the expression does not yield an expectation:cannot approximate via MC

Wilker Aziz DGMs in NLP 31

Page 87: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Inference Network Gradient

∂λEqλ(z|x) [log pθ(x |z)]

=∂

∂λ

∫qλ(z |x) log pθ(x |z)dz

=

∫∂

∂λ(qλ(z |x)) log pθ(x |z)dz︸ ︷︷ ︸

not an expectation

MC estimator is non-differentiable: cannot sample first

Differentiating the expression does not yield an expectation:cannot approximate via MC

Wilker Aziz DGMs in NLP 31

Page 88: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Score function estimator

We can again use the log identity for derivatives

∂λEqλ(z|x) [log pθ(x |z)]

=∂

∂λ

∫qλ(z |x) log pθ(x |z)dz

=

∫∂

∂λ(qλ(z |x)) log pθ(x |z)dz︸ ︷︷ ︸

not an expectation

=

∫qλ(z |x)

∂λ(log qλ(z |x)) log pθ(x |z)dz

= Eqλ(z|x)

[log pθ(x |z)

∂λlog qλ(z |x)

]︸ ︷︷ ︸

expected gradient :)

Wilker Aziz DGMs in NLP 32

Page 89: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Score function estimator

We can again use the log identity for derivatives

∂λEqλ(z|x) [log pθ(x |z)]

=∂

∂λ

∫qλ(z |x) log pθ(x |z)dz

=

∫∂

∂λ(qλ(z |x)) log pθ(x |z)dz︸ ︷︷ ︸

not an expectation

=

∫qλ(z |x)

∂λ(log qλ(z |x)) log pθ(x |z)dz

= Eqλ(z|x)

[log pθ(x |z)

∂λlog qλ(z |x)

]︸ ︷︷ ︸

expected gradient :)

Wilker Aziz DGMs in NLP 32

Page 90: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Score function estimator

We can again use the log identity for derivatives

∂λEqλ(z|x) [log pθ(x |z)]

=∂

∂λ

∫qλ(z |x) log pθ(x |z)dz

=

∫∂

∂λ(qλ(z |x)) log pθ(x |z)dz︸ ︷︷ ︸

not an expectation

=

∫qλ(z |x)

∂λ(log qλ(z |x)) log pθ(x |z)dz

= Eqλ(z|x)

[log pθ(x |z)

∂λlog qλ(z |x)

]︸ ︷︷ ︸

expected gradient :)

Wilker Aziz DGMs in NLP 32

Page 91: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Score function estimator: high variance

We can now build an MC estimator

∂λEqλ(z|x) [log pθ(x |z)]

= Eqλ(z|x)

[log pθ(x |z)

∂λlog qλ(z |x)

]

MC≈ 1

K

K∑k=1

log pθ(x |z(k))∂

∂λlog qλ(z(k)|x)

z(k) ∼ qλ(Z |x)

but

magnitude of log pθ(x |z) varies widely

model likelihood does not contribute to direction of gradient

too much variance to be useful

Page 92: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Score function estimator: high variance

We can now build an MC estimator

∂λEqλ(z|x) [log pθ(x |z)]

= Eqλ(z|x)

[log pθ(x |z)

∂λlog qλ(z |x)

]MC≈ 1

K

K∑k=1

log pθ(x |z(k))∂

∂λlog qλ(z(k)|x)

z(k) ∼ qλ(Z |x)

but

magnitude of log pθ(x |z) varies widely

model likelihood does not contribute to direction of gradient

too much variance to be useful

Page 93: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Score function estimator: high variance

We can now build an MC estimator

∂λEqλ(z|x) [log pθ(x |z)]

= Eqλ(z|x)

[log pθ(x |z)

∂λlog qλ(z |x)

]MC≈ 1

K

K∑k=1

log pθ(x |z(k))∂

∂λlog qλ(z(k)|x)

z(k) ∼ qλ(Z |x)

but

magnitude of log pθ(x |z) varies widely

model likelihood does not contribute to direction of gradient

too much variance to be useful

Page 94: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Score function estimator: high variance

We can now build an MC estimator

∂λEqλ(z|x) [log pθ(x |z)]

= Eqλ(z|x)

[log pθ(x |z)

∂λlog qλ(z |x)

]MC≈ 1

K

K∑k=1

log pθ(x |z(k))∂

∂λlog qλ(z(k)|x)

z(k) ∼ qλ(Z |x)

but

magnitude of log pθ(x |z) varies widely

model likelihood does not contribute to direction of gradient

too much variance to be useful

Page 95: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Score function estimator: high variance

We can now build an MC estimator

∂λEqλ(z|x) [log pθ(x |z)]

= Eqλ(z|x)

[log pθ(x |z)

∂λlog qλ(z |x)

]MC≈ 1

K

K∑k=1

log pθ(x |z(k))∂

∂λlog qλ(z(k)|x)

z(k) ∼ qλ(Z |x)

but

magnitude of log pθ(x |z) varies widely

model likelihood does not contribute to direction of gradient

too much variance to be useful

Page 96: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

When variance is high we can

sample more

won’t scale

use variance reduction techniques (e.g. baselines and controlvariates)excellent idea, but not just yet

stare at this ∂∂λEqλ(z|x) [log pθ(x |z)]

until we find a way to rewrite the expectation in terms of adensity that does not depend on λ

Wilker Aziz DGMs in NLP 34

Page 97: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

When variance is high we can

sample morewon’t scale

use variance reduction techniques (e.g. baselines and controlvariates)excellent idea, but not just yet

stare at this ∂∂λEqλ(z|x) [log pθ(x |z)]

until we find a way to rewrite the expectation in terms of adensity that does not depend on λ

Wilker Aziz DGMs in NLP 34

Page 98: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

When variance is high we can

sample morewon’t scale

use variance reduction techniques (e.g. baselines and controlvariates)

excellent idea, but not just yet

stare at this ∂∂λEqλ(z|x) [log pθ(x |z)]

until we find a way to rewrite the expectation in terms of adensity that does not depend on λ

Wilker Aziz DGMs in NLP 34

Page 99: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

When variance is high we can

sample morewon’t scale

use variance reduction techniques (e.g. baselines and controlvariates)excellent idea, but not just yet

stare at this ∂∂λEqλ(z|x) [log pθ(x |z)]

until we find a way to rewrite the expectation in terms of adensity that does not depend on λ

Wilker Aziz DGMs in NLP 34

Page 100: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

When variance is high we can

sample morewon’t scale

use variance reduction techniques (e.g. baselines and controlvariates)excellent idea, but not just yet

stare at this ∂∂λEqλ(z|x) [log pθ(x |z)]

until we find a way to rewrite the expectation in terms of adensity that does not depend on λ

Wilker Aziz DGMs in NLP 34

Page 101: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

When variance is high we can

sample morewon’t scale

use variance reduction techniques (e.g. baselines and controlvariates)excellent idea, but not just yet

stare at this ∂∂λEqλ(z|x) [log pθ(x |z)]

until we find a way to rewrite the expectation in terms of adensity that does not depend on λ

Wilker Aziz DGMs in NLP 34

Page 102: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Reparametrisation

Find a transformation h : z 7→ ε that expresses z through a randomvariable ε such that q(ε) does not depend on λ

h(z , λ) needs to be invertible

h(z , λ) needs to be differentiable

Invertibility implies

h(z , λ) = ε

h−1(ε, λ) = z

(Kingma and Welling, 2013; Rezende et al., 2014; Titsias andLazaro-Gredilla, 2014)

Wilker Aziz DGMs in NLP 35

Page 103: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Reparametrisation

Find a transformation h : z 7→ ε that expresses z through a randomvariable ε such that q(ε) does not depend on λ

h(z , λ) needs to be invertible

h(z , λ) needs to be differentiable

Invertibility implies

h(z , λ) = ε

h−1(ε, λ) = z

(Kingma and Welling, 2013; Rezende et al., 2014; Titsias andLazaro-Gredilla, 2014)

Wilker Aziz DGMs in NLP 35

Page 104: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Reparametrisation

Find a transformation h : z 7→ ε that expresses z through a randomvariable ε such that q(ε) does not depend on λ

h(z , λ) needs to be invertible

h(z , λ) needs to be differentiable

Invertibility implies

h(z , λ) = ε

h−1(ε, λ) = z

(Kingma and Welling, 2013; Rezende et al., 2014; Titsias andLazaro-Gredilla, 2014)

Wilker Aziz DGMs in NLP 35

Page 105: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Reparametrisation

Find a transformation h : z 7→ ε that expresses z through a randomvariable ε such that q(ε) does not depend on λ

h(z , λ) needs to be invertible

h(z , λ) needs to be differentiable

Invertibility implies

h(z , λ) = ε

h−1(ε, λ) = z

(Kingma and Welling, 2013; Rezende et al., 2014; Titsias andLazaro-Gredilla, 2014)

Wilker Aziz DGMs in NLP 35

Page 106: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Gaussian Transformation

If Z ∼ N (µλ(x), σλ(x)2) then

h(z , λ) =z − µλ(x)

σλ(x)= ε ∼ N (0, I)

h−1(ε, λ) = µλ(x) + σλ(x)� ε ε ∼ N (0, I)

Wilker Aziz DGMs in NLP 36

Page 107: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Inference Network – Reparametrised Gradient

=∂

∂λ

∫qλ(z |x) log pθ(x |z) dz

=∂

∂λ

∫q(ε) log pθ(x |

=z︷ ︸︸ ︷h−1(ε, λ))dε

=

∫q(ε)

∂λ

log pθ(x |

=z︷ ︸︸ ︷h−1(ε, λ))

= Eq(ε)

[∂

∂λlog pθ(x |h−1(ε, λ))

]dε︸ ︷︷ ︸

expected gradient :D

Page 108: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Inference Network – Reparametrised Gradient

=∂

∂λ

∫qλ(z |x) log pθ(x |z) dz

=∂

∂λ

∫q(ε) log pθ(x |

=z︷ ︸︸ ︷h−1(ε, λ))dε

=

∫q(ε)

∂λ

log pθ(x |

=z︷ ︸︸ ︷h−1(ε, λ))

= Eq(ε)

[∂

∂λlog pθ(x |h−1(ε, λ))

]dε︸ ︷︷ ︸

expected gradient :D

Page 109: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Inference Network – Reparametrised Gradient

=∂

∂λ

∫qλ(z |x) log pθ(x |z) dz

=∂

∂λ

∫q(ε) log pθ(x |

=z︷ ︸︸ ︷h−1(ε, λ))dε

=

∫q(ε)

∂λ

log pθ(x |

=z︷ ︸︸ ︷h−1(ε, λ))

= Eq(ε)

[∂

∂λlog pθ(x |h−1(ε, λ))

]dε︸ ︷︷ ︸

expected gradient :D

Page 110: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Inference Network – Reparametrised Gradient

=∂

∂λ

∫qλ(z |x) log pθ(x |z) dz

=∂

∂λ

∫q(ε) log pθ(x |

=z︷ ︸︸ ︷h−1(ε, λ))dε

=

∫q(ε)

∂λ

log pθ(x |

=z︷ ︸︸ ︷h−1(ε, λ))

= Eq(ε)

[∂

∂λlog pθ(x |h−1(ε, λ))

]dε︸ ︷︷ ︸

expected gradient :D

Page 111: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Reparametrised gradient estimate

= Eq(ε)

[∂

∂λlog pθ(x |h−1(ε, λ))

]dε︸ ︷︷ ︸

expected gradient :D

= Eq(ε)

∂zlog pθ(x |

=z︷ ︸︸ ︷h−1(ε, λ))× ∂

∂λh−1(ε, λ)︸ ︷︷ ︸

chain rule

MC≈ 1

K

K∑k=1

∂zlog pθ(x |

=z︷ ︸︸ ︷h−1(ε(k), λ))× ∂

∂λh−1(ε(k), λ)︸ ︷︷ ︸

backprop’s job

ε(k) ∼ q(ε)

Note that both models contribute with gradients

Page 112: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Reparametrised gradient estimate

= Eq(ε)

[∂

∂λlog pθ(x |h−1(ε, λ))

]dε︸ ︷︷ ︸

expected gradient :D

= Eq(ε)

∂zlog pθ(x |

=z︷ ︸︸ ︷h−1(ε, λ))× ∂

∂λh−1(ε, λ)︸ ︷︷ ︸

chain rule

MC≈ 1

K

K∑k=1

∂zlog pθ(x |

=z︷ ︸︸ ︷h−1(ε(k), λ))× ∂

∂λh−1(ε(k), λ)︸ ︷︷ ︸

backprop’s job

ε(k) ∼ q(ε)

Note that both models contribute with gradients

Page 113: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Reparametrised gradient estimate

= Eq(ε)

[∂

∂λlog pθ(x |h−1(ε, λ))

]dε︸ ︷︷ ︸

expected gradient :D

= Eq(ε)

∂zlog pθ(x |

=z︷ ︸︸ ︷h−1(ε, λ))× ∂

∂λh−1(ε, λ)︸ ︷︷ ︸

chain rule

MC≈ 1

K

K∑k=1

∂zlog pθ(x |

=z︷ ︸︸ ︷h−1(ε(k), λ))× ∂

∂λh−1(ε(k), λ)︸ ︷︷ ︸

backprop’s job

ε(k) ∼ q(ε)

Note that both models contribute with gradients

Page 114: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Gaussian KL

ELBO

Eqλ(z|x) [log pθ(x |z)]− KL (qλ(z |x) || p(z))

Analytical computation of −KL (qλ(z |x) || p(z)):

1

2

d∑i=1

(1 + log

(σ2i

)− µ2

i − σ2i

)Thus backprop will compute − ∂

∂λ KL (qλ(z |x) || p(z)) for us

Wilker Aziz DGMs in NLP 39

Page 115: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Gaussian KL

ELBO

Eqλ(z|x) [log pθ(x |z)]− KL (qλ(z |x) || p(z))

Analytical computation of −KL (qλ(z |x) || p(z)):

1

2

d∑i=1

(1 + log

(σ2i

)− µ2

i − σ2i

)

Thus backprop will compute − ∂∂λ KL (qλ(z |x) || p(z)) for us

Wilker Aziz DGMs in NLP 39

Page 116: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Gaussian KL

ELBO

Eqλ(z|x) [log pθ(x |z)]− KL (qλ(z |x) || p(z))

Analytical computation of −KL (qλ(z |x) || p(z)):

1

2

d∑i=1

(1 + log

(σ2i

)− µ2

i − σ2i

)Thus backprop will compute − ∂

∂λ KL (qλ(z |x) || p(z)) for us

Wilker Aziz DGMs in NLP 39

Page 117: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Computation Graph

x

µ σ

z

x

λ λ

θ

inference model

generative model

ε ∼ N (0, 1)

KL KL

Wilker Aziz DGMs in NLP 40

Page 118: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Computation Graph

x

µ σ

z

x

λ λ

θ

inference model

generative model

ε ∼ N (0, 1)

KL KL

Wilker Aziz DGMs in NLP 40

Page 119: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Computation Graph

x

µ σ

z

x

λ λ

θ

inference model

generative model

ε ∼ N (0, 1)

KL KL

Wilker Aziz DGMs in NLP 40

Page 120: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Computation Graph

x

µ σ

z

x

λ λ

θ

inference model

generative model

ε ∼ N (0, 1)

KL KL

Wilker Aziz DGMs in NLP 40

Page 121: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Computation Graph

x

µ σ

z

x

λ λ

θ

inference model

generative model

ε ∼ N (0, 1)

KL KL

Wilker Aziz DGMs in NLP 40

Page 122: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Example

z xm1 θ

N

λ

Generative model

Z ∼ N (0, I )

Xi |z , x<i ∼ Cat(fθ(z , x<i ))

Inference model

Z ∼ N (µλ(xm1 ), σλ(xm1 )2)

Bowman et al. (2016)Wilker Aziz DGMs in NLP 41

Page 123: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

VAEs – Summary

Advantages

Backprop training

Easy to implement

Posterior inference possible

One objective for both NNs

Drawbacks

Discrete latent variables are difficult

Optimisation may be difficult with several latent variables

Location-scale families onlybut see Ruiz et al. (2016) and Kucukelbir et al. (2017)

Wilker Aziz DGMs in NLP 42

Page 124: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

VAEs – Summary

Advantages

Backprop training

Easy to implement

Posterior inference possible

One objective for both NNs

Drawbacks

Discrete latent variables are difficult

Optimisation may be difficult with several latent variables

Location-scale families onlybut see Ruiz et al. (2016) and Kucukelbir et al. (2017)

Wilker Aziz DGMs in NLP 42

Page 125: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Summary

Deep learning in NLP

task-driven feature extraction

models with more realistic assumptions

Probabilistic modelling

better (or at least more explicit) statistical assumptions

compact models

semi-supervised learning

Wilker Aziz DGMs in NLP 43

Page 126: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Literature I

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neuralmachine translation by jointly learning to align and translate. 2014.URL http://arxiv.org/abs/1409.0473.

David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variationalinference: A review for statisticians. 01 2016. URLhttps://arxiv.org/abs/1601.00670.

Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, RafalJozefowicz, and Samy Bengio. Generating sentences from acontinuous space. In Proceedings of The 20th SIGNLL Conference onComputational Natural Language Learning, pages 10–21, Berlin,Germany, August 2016. Association for Computational Linguistics.URL http://www.aclweb.org/anthology/K16-1002.

Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes.2013. URL http://arxiv.org/abs/1312.6114.

Wilker Aziz DGMs in NLP 44

Page 127: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Literature II

Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, andDavid M. Blei. Automatic differentiation variational inference. Journalof Machine Learning Research, 18(14):1–45, 2017. URLhttp://jmlr.org/papers/v18/16-107.html.

Danilo J. Rezende, Shakir Mohamed, and Daan Wierstra. Stochasticbackpropagation and approximate inference in deep generative models.In ICML, pages 1278–1286, 2014. URLhttp://jmlr.org/proceedings/papers/v32/rezende14.pdf.

Francisco R Ruiz, Michalis Titsias RC AUEB, and David Blei. Thegeneralized reparameterization gradient. In D. D. Lee, M. Sugiyama,U. V. Luxburg, I. Guyon, and R. Garnett, editors, NIPS, pages460–468. 2016. URL http://papers.nips.cc/paper/

6328-the-generalized-reparameterization-gradient.pdf.

Wilker Aziz DGMs in NLP 45

Page 128: Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep generative modelsVariational inference

Variational auto-encoderReferences

Literature III

Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neuralmachine translation models with monolingual data. In Proceedings ofthe 54th Annual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 86–96, Berlin, Germany,August 2016. Association for Computational Linguistics. URLhttp://www.aclweb.org/anthology/P16-1009.

Michalis Titsias and Miguel Lazaro-Gredilla. Doubly stochasticvariational bayes for non-conjugate inference. In Tony Jebara andEric P. Xing, editors, ICML, pages 1971–1979, 2014. URLhttp://jmlr.org/proceedings/papers/v32/titsias14.pdf.

Wilker Aziz DGMs in NLP 46