learning fast-mixing models for structured predictionjsteinhardt/publications/fast-mcmc/slides.pdfq...

49
Learning Fast-Mixing Models for Structured Prediction Jacob Steinhardt Percy Liang Stanford University {jsteinhardt,pliang}@cs.stanford.edu July 8, 2015 J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 1 / 11

Upload: others

Post on 27-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Learning Fast-Mixing Models for Structured Prediction

Jacob Steinhardt Percy Liang

Stanford University

{jsteinhardt,pliang}@cs.stanford.edu

July 8, 2015

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 1 / 11

Page 2: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Structured Prediction Task

a

c b

e

d gf

i

h kj

m

l

o

n

q p

s

r utw

v

y

xz

x: b d s a d b n n n f a a s s j j j

z: b # # a # # n-n-n # a-a # # n-n a

y: b a n a n a

Goal: fit maximum likelihood model pθ (z | x). Two routes:Use simple model u, exact inference

Use expressive model, Gibbs sampling (transition kernel A)Can we get the best of both worlds?

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 2 / 11

Page 3: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Structured Prediction Task

a

c b

e

d gf

i

h kj

m

l

o

n

q p

s

r utw

v

y

xz

x: b d s a d b n n n f a a s s j j j

z: b # # a # # n-n-n # a-a # # n-n a

y: b a n a n aGoal: fit maximum likelihood model pθ (z | x). Two routes:

Use simple model u, exact inference

Use expressive model, Gibbs sampling (transition kernel A)Can we get the best of both worlds?

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 2 / 11

Page 4: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Structured Prediction Task

a

c b

e

d gf

i

h kj

m

l

o

n

q p

s

r utw

v

y

xz

x: b d s a d b n n n f a a s s j j j

z: b # # a # # n-n-n # a-a # # n-n a

y: b a n a n aGoal: fit maximum likelihood model pθ (z | x). Two routes:Use simple model u, exact inference

Use expressive model, Gibbs sampling (transition kernel A)Can we get the best of both worlds?

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 2 / 11

Page 5: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Structured Prediction Task

a

c b

e

d gf

i

h kj

m

l

o

n

q p

s

r utw

v

y

xz

x: b d s a d b n n n f a a s s j j j

z: b # # a # # n-n-n # a-a # # n-n a

y: b a n a n aGoal: fit maximum likelihood model pθ (z | x). Two routes:Use simple model u, exact inference

Use expressive model, Gibbs sampling (transition kernel A)

Can we get the best of both worlds?

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 2 / 11

Page 6: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Structured Prediction Task

a

c b

e

d gf

i

h kj

m

l

o

n

q p

s

r utw

v

y

xz

x: b d s a d b n n n f a a s s j j j

z: b # # a # # n-n-n # a-a # # n-n a

y: b a n a n aGoal: fit maximum likelihood model pθ (z | x). Two routes:Use simple model u, exact inference

Use expressive model, Gibbs sampling (transition kernel A)Can we get the best of both worlds?

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 2 / 11

Page 7: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain A is strong Doeblin with parameter ε if

A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)

for some u, A.

u A A A u A u A A · · ·

All Doeblin chains mix quickly:

Proposition

If A is ε strong Doeblin, then its mixing time is at most 1ε.

Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11

Page 8: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain A is strong Doeblin with parameter ε if

A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)

for some u, A.

u

A A A u A u A A · · ·

All Doeblin chains mix quickly:

Proposition

If A is ε strong Doeblin, then its mixing time is at most 1ε.

Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11

Page 9: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain A is strong Doeblin with parameter ε if

A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)

for some u, A.

u A

A A u A u A A · · ·

All Doeblin chains mix quickly:

Proposition

If A is ε strong Doeblin, then its mixing time is at most 1ε.

Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11

Page 10: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain A is strong Doeblin with parameter ε if

A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)

for some u, A.

u A A

A u A u A A · · ·

All Doeblin chains mix quickly:

Proposition

If A is ε strong Doeblin, then its mixing time is at most 1ε.

Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11

Page 11: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain A is strong Doeblin with parameter ε if

A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)

for some u, A.

u A A A

u A u A A · · ·

All Doeblin chains mix quickly:

Proposition

If A is ε strong Doeblin, then its mixing time is at most 1ε.

Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11

Page 12: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain A is strong Doeblin with parameter ε if

A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)

for some u, A.

u A A A u

A u A A · · ·

All Doeblin chains mix quickly:

Proposition

If A is ε strong Doeblin, then its mixing time is at most 1ε.

Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11

Page 13: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain A is strong Doeblin with parameter ε if

A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)

for some u, A.

u A A A u A

u A A · · ·

All Doeblin chains mix quickly:

Proposition

If A is ε strong Doeblin, then its mixing time is at most 1ε.

Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11

Page 14: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain A is strong Doeblin with parameter ε if

A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)

for some u, A.

u A A A u A u

A A · · ·

All Doeblin chains mix quickly:

Proposition

If A is ε strong Doeblin, then its mixing time is at most 1ε.

Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11

Page 15: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain A is strong Doeblin with parameter ε if

A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)

for some u, A.

u A A A u A u A

A · · ·

All Doeblin chains mix quickly:

Proposition

If A is ε strong Doeblin, then its mixing time is at most 1ε.

Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11

Page 16: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain A is strong Doeblin with parameter ε if

A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)

for some u, A.

u A A A u A u A A

· · ·

All Doeblin chains mix quickly:

Proposition

If A is ε strong Doeblin, then its mixing time is at most 1ε.

Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11

Page 17: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain A is strong Doeblin with parameter ε if

A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)

for some u, A.

u A A A u A u A A · · ·

All Doeblin chains mix quickly:

Proposition

If A is ε strong Doeblin, then its mixing time is at most 1ε.

Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11

Page 18: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain A is strong Doeblin with parameter ε if

A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)

for some u, A.

u A A A u A u A A · · ·

All Doeblin chains mix quickly:

Proposition

If A is ε strong Doeblin, then its mixing time is at most 1ε.

Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11

Page 19: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain A is strong Doeblin with parameter ε if

A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)

for some u, A.

u A A A u A u A A · · ·

All Doeblin chains mix quickly:

Proposition

If A is ε strong Doeblin, then its mixing time is at most 1ε.

Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11

Page 20: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

A Strong Doeblin Family

Let θ parameterize a distribution uθ and transition matrix Aθ .

Aθ = εuθ +(1− ε)Aθ

Three model families:

F0

{uθ}θ∈Θ

F{πθ}θ∈Θ F {πθ}θ∈Θ

F parameterizes computationally tractable distributions!

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11

Page 21: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

A Strong Doeblin Family

Let θ parameterize a distribution uθ and transition matrix Aθ .

Aθ = εuθ +(1− ε)Aθ

Three model families:

F0

{uθ}θ∈Θ

F{πθ}θ∈Θ F {πθ}θ∈Θ

F parameterizes computationally tractable distributions!

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11

Page 22: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

A Strong Doeblin Family

Let θ parameterize a distribution uθ and transition matrix Aθ .

Aθ = εuθ +(1− ε)Aθ

πθ

Three model families:

F0

{uθ}θ∈Θ

F{πθ}θ∈Θ F {πθ}θ∈Θ

F parameterizes computationally tractable distributions!

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11

Page 23: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

A Strong Doeblin Family

Let θ parameterize a distribution uθ and transition matrix Aθ .

πθ

= εuθ +(1− ε)Aθ

πθ

Three model families:

F0

{uθ}θ∈Θ

F{πθ}θ∈Θ F {πθ}θ∈Θ

F parameterizes computationally tractable distributions!

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11

Page 24: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

A Strong Doeblin Family

Let θ parameterize a distribution uθ and transition matrix Aθ .

πθ

= εuθ +(1− ε)Aθ

πθ

Three model families:

F0

{uθ}θ∈Θ

F{πθ}θ∈Θ F {πθ}θ∈Θ

F parameterizes computationally tractable distributions!

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11

Page 25: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

A Strong Doeblin Family

Let θ parameterize a distribution uθ and transition matrix Aθ .

πθ

= εuθ +(1− ε)Aθ

πθ

Three model families:

F0

{uθ}θ∈Θ

F{πθ}θ∈Θ F {πθ}θ∈Θ

F parameterizes computationally tractable distributions!

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11

Page 26: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

A Strong Doeblin Family

Let θ parameterize a distribution uθ and transition matrix Aθ .

πθ

= εuθ +(1− ε)Aθ

πθ

Three model families:

F0

{uθ}θ∈Θ

F{πθ}θ∈Θ

F {πθ}θ∈Θ

F parameterizes computationally tractable distributions!

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11

Page 27: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

A Strong Doeblin Family

Let θ parameterize a distribution uθ and transition matrix Aθ .

πθ

= εuθ +(1− ε)Aθ

πθ

Three model families:

F0

{uθ}θ∈Θ

F{πθ}θ∈Θ F {πθ}θ∈Θ

F parameterizes computationally tractable distributions!

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11

Page 28: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

A Strong Doeblin Family

Let θ parameterize a distribution uθ and transition matrix Aθ .

πθ

= εuθ +(1− ε)Aθ

πθ

Three model families:

F0

{uθ}θ∈Θ

F{πθ}θ∈Θ F {πθ}θ∈Θ

F parameterizes computationally tractable distributions!

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11

Page 29: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Strategy

Parameterize strong Doeblin distributions πθ

Maximize log-likelihood: L(θ) = 1n ∑

ni=1 log πθ (z(i))

Issue: hard to compute ∇L(θ)

Insight: interpret Markov chain as latent variable model:

pθ : z1

z2 · · · zTAθ Aθ Aθ

Observe: πθ (z) = pθ (zT = z), T ∼ Geometric(ε)

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 5 / 11

Page 30: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Strategy

Parameterize strong Doeblin distributions πθ

Maximize log-likelihood: L(θ) = 1n ∑

ni=1 log πθ (z(i))

Issue: hard to compute ∇L(θ)

Insight: interpret Markov chain as latent variable model:

pθ : z1

z2 · · · zTAθ Aθ Aθ

Observe: πθ (z) = pθ (zT = z), T ∼ Geometric(ε)

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 5 / 11

Page 31: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Strategy

Parameterize strong Doeblin distributions πθ

Maximize log-likelihood: L(θ) = 1n ∑

ni=1 log πθ (z(i))

Issue: hard to compute ∇L(θ)

Insight: interpret Markov chain as latent variable model:

pθ : z1

z2 · · · zTAθ Aθ Aθ

Observe: πθ (z) = pθ (zT = z), T ∼ Geometric(ε)

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 5 / 11

Page 32: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Learning Updates

Recall latent variable model:

pθ : z1

z2 · · · zTAθ Aθ Aθ

Lemma

For any fixed z,

∂ logpθ (zT = z)∂θ

= Ez1:T−1∼pθ (·|zT =z)

[∂ logpθ (z1:T )

∂θ

].

Upshot: just need to sample trajectories that end at z.=⇒ importance sampling

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 6 / 11

Page 33: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Learning Updates

Recall latent variable model:

pθ : z1

z2 · · · zTAθ Aθ Aθ

Lemma

For any fixed z,

∂ logpθ (zT = z)∂θ

= Ez1:T−1∼pθ (·|zT =z)

[∂ logpθ (z1:T )

∂θ

].

Upshot: just need to sample trajectories that end at z.=⇒ importance sampling

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 6 / 11

Page 34: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Learning Updates

Recall latent variable model:

pθ : z1

z2 · · · zTAθ Aθ Aθ

Lemma

For any fixed z,

∂ logpθ (zT = z)∂θ

= Ez1:T−1∼pθ (·|zT =z)

[∂ logpθ (z1:T )

∂θ

].

Upshot: just need to sample trajectories that end at z.=⇒ importance sampling

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 6 / 11

Page 35: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Experiments: Task

Task from before:

a

c b

e

d gf

i

h kj

m

l

o

n

q p

s

r utw

v

y

xz

x: b d s a d b n n n f a a s s j j j

z: b # # a # # n-n-n # a-a # # n-n a

y: b a n a n a

Note y is a deterministic function y = f (z).

Goal: learn model p(z | x) that maximizes

p(y | x) = ∑z∈f−1(y)

p(z | x)

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 7 / 11

Page 36: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Experiments: Task

Task from before:

a

c b

e

d gf

i

h kj

m

l

o

n

q p

s

r utw

v

y

xz

x: b d s a d b n n n f a a s s j j j

z: b # # a # # n-n-n # a-a # # n-n a

y: b a n a n a

Note y is a deterministic function y = f (z).

Goal: learn model p(z | x) that maximizes

p(y | x) = ∑z∈f−1(y)

p(z | x)

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 7 / 11

Page 37: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Experiments: Setup

Models:

u(z | x) (bigram, DP): z1 z2 z3 z4

A(zt | zt−1,x) (dictionary, Gibbs): z1 z2 z3 z4

Comparisons:basic: Gibbs sampling (A)

(compute gradients assuming exact inference)

uθ -Gibbs: Gibbs with random restarts from u

Doeblin: our method

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 8 / 11

Page 38: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Experiments: Setup

Models:

u(z | x) (bigram, DP): z1 z2 z3 z4

A(zt | zt−1,x) (dictionary, Gibbs): z1 z2 z3 z4

Comparisons:basic: Gibbs sampling (A)

(compute gradients assuming exact inference)

uθ -Gibbs: Gibbs with random restarts from u

Doeblin: our method

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 8 / 11

Page 39: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Experiments: Setup

Models:

u(z | x) (bigram, DP): z1 z2 z3 z4

A(zt | zt−1,x) (dictionary, Gibbs): z1 z2 z3 z4

Comparisons:

basic: Gibbs sampling (A)(compute gradients assuming exact inference)

uθ -Gibbs: Gibbs with random restarts from u

Doeblin: our method

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 8 / 11

Page 40: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Experiments: Setup

Models:

u(z | x) (bigram, DP): z1 z2 z3 z4

A(zt | zt−1,x) (dictionary, Gibbs): z1 z2 z3 z4

Comparisons:basic: Gibbs sampling (A)

(compute gradients assuming exact inference)

uθ -Gibbs: Gibbs with random restarts from u

Doeblin: our method

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 8 / 11

Page 41: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Experiments: Setup

Models:

u(z | x) (bigram, DP): z1 z2 z3 z4

A(zt | zt−1,x) (dictionary, Gibbs): z1 z2 z3 z4

Comparisons:basic: Gibbs sampling (A)

(compute gradients assuming exact inference)

uθ -Gibbs: Gibbs with random restarts from u

Doeblin: our method

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 8 / 11

Page 42: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Experiments: Setup

Models:

u(z | x) (bigram, DP): z1 z2 z3 z4

A(zt | zt−1,x) (dictionary, Gibbs): z1 z2 z3 z4

Comparisons:basic: Gibbs sampling (A)

(compute gradients assuming exact inference)

uθ -Gibbs: Gibbs with random restarts from u

Doeblin: our method

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 8 / 11

Page 43: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Experiments: Results

5 10 15training passes

0.00

0.05

0.10

0.15

0.20

0.25ac

cura

cySwipe Typing (Test Accuracy)

Doeblinbasic

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 9 / 11

Page 44: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Experiments: Results

5 10 15training passes

0.00

0.05

0.10

0.15

0.20

0.25ac

cura

cySwipe Typing (Test Accuracy)

Doeblinuµ-Gibbsbasic

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 9 / 11

Page 45: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Discussion

Summary:

Strong Doeblin property enables fast mixing

Interpolates between tractability and expressivity

Provides better learning updates at training time

Also in paper:

Theoretical analysis of strong Doeblin family

Multi-stage Doeblin chains

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 10 / 11

Page 46: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Discussion

Summary:

Strong Doeblin property enables fast mixing

Interpolates between tractability and expressivity

Provides better learning updates at training time

Also in paper:

Theoretical analysis of strong Doeblin family

Multi-stage Doeblin chains

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 10 / 11

Page 47: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Discussion

Related work:

Policy gradient (Sutton et al., 1999)

Inference-aware learning (Barbu, 2009; Domke, 2011; Stoyanov et al.,2011; Huang et al., 2012)

Strong Doeblin analysis (Doeblin, 1940; Propp & Wilson, 1996; Corcoran& Tweedie, 1998)

Future work:

Explore other tractable families

Learn multi-stage chains

Reproducible experiments on CodaLab: codalab.org/worksheets

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 11 / 11

Page 48: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Discussion

Related work:

Policy gradient (Sutton et al., 1999)

Inference-aware learning (Barbu, 2009; Domke, 2011; Stoyanov et al.,2011; Huang et al., 2012)

Strong Doeblin analysis (Doeblin, 1940; Propp & Wilson, 1996; Corcoran& Tweedie, 1998)

Future work:

Explore other tractable families

Learn multi-stage chains

Reproducible experiments on CodaLab: codalab.org/worksheets

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 11 / 11

Page 49: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #

Discussion

Related work:

Policy gradient (Sutton et al., 1999)

Inference-aware learning (Barbu, 2009; Domke, 2011; Stoyanov et al.,2011; Huang et al., 2012)

Strong Doeblin analysis (Doeblin, 1940; Propp & Wilson, 1996; Corcoran& Tweedie, 1998)

Future work:

Explore other tractable families

Learn multi-stage chains

Reproducible experiments on CodaLab: codalab.org/worksheets

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 11 / 11