learning fast-mixing models for structured predictionjsteinhardt/publications/fast-mcmc/slides.pdfq...

Learning Fast-Mixing Models for Structured Prediction

Jacob Steinhardt Percy Liang

Stanford University

{jsteinhardt,pliang}@cs.stanford.edu

July 8, 2015

J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 1 / 11

Structured Prediction Task

a

c b

e

d gf

i

h kj

m

l

o

n

q p

s

r utw

v

y

xz

x: b d s a d b n n n f a a s s j j j

z: b # # a # # n-n-n # a-a # # n-n a

y: b a n a n a

Goal: fit maximum likelihood model pθ (z | x). Two routes:Use simple model u, exact inference

Use expressive model, Gibbs sampling (transition kernel A)Can we get the best of both worlds?



a

c b

e

d gf

i

h kj

m

l

o

n

q p

s

r utw

v

y

xz


z: b # # a # # n-n-n # a-a # # n-n a

y: b a n a n aGoal: fit maximum likelihood model pθ (z | x). Two routes:

Use simple model u, exact inference




a

c b

e

d gf

i

h kj

m

l

o

n

q p

s

r utw

v

y

xz


z: b # # a # # n-n-n # a-a # # n-n a

y: b a n a n aGoal: fit maximum likelihood model pθ (z | x). Two routes:Use simple model u, exact inference




a

c b

e

d gf

i

h kj

m

l

o

n

q p

s

r utw

v

y

xz


z: b # # a # # n-n-n # a-a # # n-n a


Use expressive model, Gibbs sampling (transition kernel A)

Can we get the best of both worlds?



a

c b

e

d gf

i

h kj

m

l

o

n

q p

s

r utw

v

y

xz


z: b # # a # # n-n-n # a-a # # n-n a




Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain A is strong Doeblin with parameter ε if

A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)

for some u, A.

u A A A u A u A A · · ·

All Doeblin chains mix quickly:

Proposition

If A is ε strong Doeblin, then its mixing time is at most 1ε.

Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).






for some u, A.

u

A A A u A u A A · · ·


Proposition








for some u, A.

u A

A A u A u A A · · ·


Proposition








for some u, A.

u A A

A u A u A A · · ·


Proposition








for some u, A.

u A A A

u A u A A · · ·


Proposition








for some u, A.

u A A A u

A u A A · · ·


Proposition








for some u, A.

u A A A u A

u A A · · ·


Proposition








for some u, A.

u A A A u A u

A A · · ·


Proposition








for some u, A.

u A A A u A u A

A · · ·


Proposition








for some u, A.

u A A A u A u A A

· · ·


Proposition








for some u, A.

u A A A u A u A A · · ·


Proposition




A Strong Doeblin Family

Let θ parameterize a distribution uθ and transition matrix Aθ .

Aθ = εuθ +(1− ε)Aθ

Three model families:

F0

{uθ}θ∈Θ

F{πθ}θ∈Θ F {πθ}θ∈Θ

F parameterizes computationally tractable distributions!




Aθ = εuθ +(1− ε)Aθ

πθ


F0

{uθ}θ∈Θ






Aθ

πθ

= εuθ +(1− ε)Aθ

πθ


F0

{uθ}θ∈Θ






Aθ

πθ


πθ


F0

{uθ}θ∈Θ

F{πθ}θ∈Θ

F {πθ}θ∈Θ





Aθ

πθ


πθ


F0

{uθ}θ∈Θ




Strategy

Parameterize strong Doeblin distributions πθ

Maximize log-likelihood: L(θ) = 1n ∑

ni=1 log πθ (z(i))

Issue: hard to compute ∇L(θ)

Insight: interpret Markov chain as latent variable model:

pθ : z1

uθ

z2 · · · zTAθ Aθ Aθ

Observe: πθ (z) = pθ (zT = z), T ∼ Geometric(ε)


Learning Updates

Recall latent variable model:

pθ : z1

uθ

z2 · · · zTAθ Aθ Aθ

Lemma

For any fixed z,

∂ logpθ (zT = z)∂θ

= Ez1:T−1∼pθ (·|zT =z)

[∂ logpθ (z1:T )

∂θ

].

Upshot: just need to sample trajectories that end at z.=⇒ importance sampling


Experiments: Task

Task from before:

a

c b

e

d gf

i

h kj

m

l

o

n

q p

s

r utw

v

y

xz


z: b # # a # # n-n-n # a-a # # n-n a

y: b a n a n a

Note y is a deterministic function y = f (z).

Goal: learn model p(z | x) that maximizes

p(y | x) = ∑z∈f−1(y)

p(z | x)


Experiments: Setup

Models:

u(z | x) (bigram, DP): z1 z2 z3 z4

A(zt | zt−1,x) (dictionary, Gibbs): z1 z2 z3 z4

Comparisons:basic: Gibbs sampling (A)

(compute gradients assuming exact inference)

uθ -Gibbs: Gibbs with random restarts from u

Doeblin: our method


Experiments: Setup

Models:



Comparisons:

basic: Gibbs sampling (A)(compute gradients assuming exact inference)


Doeblin: our method


Experiments: Setup

Models:



Comparisons:basic: Gibbs sampling (A)

(compute gradients assuming exact inference)


Doeblin: our method


Experiments: Results

5 10 15training passes

0.00

0.05

0.10

0.15

0.20

0.25ac

cura

cySwipe Typing (Test Accuracy)

Doeblinbasic


Experiments: Results

5 10 15training passes

0.00

0.05

0.10

0.15

0.20

0.25ac

cura

cySwipe Typing (Test Accuracy)

Doeblinuµ-Gibbsbasic


Discussion

Summary:

Strong Doeblin property enables fast mixing

Interpolates between tractability and expressivity

Provides better learning updates at training time

Also in paper:

Theoretical analysis of strong Doeblin family

Multi-stage Doeblin chains


Discussion

Related work:

Policy gradient (Sutton et al., 1999)

Inference-aware learning (Barbu, 2009; Domke, 2011; Stoyanov et al.,2011; Huang et al., 2012)

Strong Doeblin analysis (Doeblin, 1940; Propp & Wilson, 1996; Corcoran& Tweedie, 1998)

Future work:

Explore other tractable families

Learn multi-stage chains

Reproducible experiments on CodaLab: codalab.org/worksheets


https://codalab.org/worksheets

learning fast-mixing models for structured predictionjsteinhardt/publications/fast-mcmc/slides.pdfq...

Documents