training products of experts by minimizing contrastive...

19
Frank Wood - [email protected] Training Products of Experts by Minimizing Contrastive Divergence Geoffrey E. Hinton presented by Frank Wood

Upload: vuxuyen

Post on 21-Aug-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Training Products of Experts by Minimizing Contrastive ...fwood/Tutorials/contrastive_divergence.pdf · Frank Wood -fwood@cs.brown.edu Training Products of Experts by Minimizing Contrastive

Frank Wood - [email protected]

Training Products of Experts by Minimizing Contrastive Divergence

Geoffrey E. Hinton

presented by Frank Wood

Page 2: Training Products of Experts by Minimizing Contrastive ...fwood/Tutorials/contrastive_divergence.pdf · Frank Wood -fwood@cs.brown.edu Training Products of Experts by Minimizing Contrastive

Frank Wood - [email protected]

Goal

• Learn parameters for probability distribution models of high dimensional data– (Images, Population Firing Rates, Securities Data, NLP data, etc)

Product of Experts

Use Contrastive Divergence to learn parameters.

Mixture Model

Use EM to learn parameters

( )( )

( )∑∏

∏=

c m

mm

m

mm

ncf

df

dp

r

r

r

Kr

θ

θ

θθ|

|

,,| 1( ) ( )∑=

m

mmmn dfdp θαθθ |,,| 1

rK

r

Page 3: Training Products of Experts by Minimizing Contrastive ...fwood/Tutorials/contrastive_divergence.pdf · Frank Wood -fwood@cs.brown.edu Training Products of Experts by Minimizing Contrastive

Frank Wood - [email protected]

Take Home

• Contrastive divergence is a general MCMC gradient ascent learning algorithm particularly well suited to learning Product of Experts (PoE) and energy- based (Gibbs distributions, etc.) model parameters.

• The general algorithm:– Repeat Until “Convergence”

• Draw samples from the current model starting from the training data.

• Compute the expected gradient of the log probability w.r.t. all model parameters over both samples and the training data.

• Update the model parameters according to the gradient.

Page 4: Training Products of Experts by Minimizing Contrastive ...fwood/Tutorials/contrastive_divergence.pdf · Frank Wood -fwood@cs.brown.edu Training Products of Experts by Minimizing Contrastive

Frank Wood - [email protected]

Sampling – Critical to Understanding

• Uniform– rand() Linear Congruential Generator

• x(n) = a * x(n-1) + b mod M 0.2311 0.6068 0.4860 0.8913 0.7621 0.4565 0.0185

• Normal– randn() Box-Mueller

• x1,x2 ~ U(0,1) -> y1,y2 ~N(0,1)– y1 = sqrt( - 2 ln(x1) ) cos( 2 pi x2 ) – y2 = sqrt( - 2 ln(x1) ) sin( 2 pi x2 )

• Binomial(p)– if(rand()<p)

• More Complicated Distributions– Mixture Model

• Sample from a Gaussian• Sample from a multinomial (CDF + uniform)

– Product of Experts• Metropolis and/or Gibbs

Page 5: Training Products of Experts by Minimizing Contrastive ...fwood/Tutorials/contrastive_divergence.pdf · Frank Wood -fwood@cs.brown.edu Training Products of Experts by Minimizing Contrastive

Frank Wood - [email protected]

The Flavor of Metropolis Sampling

• Given some distribution , a random starting point , and a symmetric proposal distribution .

• Calculate the ratio of densities

where is sampled from the proposal distribution.

• With probability accept .

• Given sufficiently many iterations

( )θ|dpr

1−tdr

( )1| −tt ddJrr

( )( )θ

θ

|

|

1−

=t

t

dp

dpr r

r

tdr

)1,min(rtdr

{ } ( )θ|~,,, 21 dpddd nnn

rK

rrr

++

Only need to

know the

distribution up

to a

proportionality!

Only need to

know the

distribution up

to a

proportionality!

Page 6: Training Products of Experts by Minimizing Contrastive ...fwood/Tutorials/contrastive_divergence.pdf · Frank Wood -fwood@cs.brown.edu Training Products of Experts by Minimizing Contrastive

Frank Wood - [email protected]

Contrastive Divergence (Final Result!)

10

loglog

θ

θθθ

θθ

PmPm

mmm

ff

∂−

∂∝∆

∑∑∂

∂−

∂∝∆

∈ 1~D

)(log1)(log1

θθθ

θθθ

Pc md m

m

cf

N

df

N

mm

Law of Large Numbers, compute

expectations using samples.

Law of Large Numbers, compute

expectations using samples.

Now you know how to do it, let’s see why this works!

Model

parameters.

Model

parameters.

Training data

(empirical distribution).

Training data

(empirical distribution).

Samples from

model.

Samples from

model.

Page 7: Training Products of Experts by Minimizing Contrastive ...fwood/Tutorials/contrastive_divergence.pdf · Frank Wood -fwood@cs.brown.edu Training Products of Experts by Minimizing Contrastive

Frank Wood - [email protected]

But First: The last vestige of concreteness.

• Looking towards the future:– Take f to be a Student-t.

– Then (for instance)

( ) ( )( )

mmmm

dj

dfdf

m

j ααθ

+

==rr

rrr

T

;

2

11

1

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

α = .6, j = 5

1-D Student-t Expert

1/(

1+

.5*(

jt x))

α

( ) ( )( )

+−=

+∂−

=∂

∂dj

djdf

m

m

mm

m

jmm

rr

rrr

r

T

T

;

2

11log

2

11log

log

α

α

α

α

Dot product �Projection �1-D MarginalDot product �Projection �1-D Marginal

Page 8: Training Products of Experts by Minimizing Contrastive ...fwood/Tutorials/contrastive_divergence.pdf · Frank Wood -fwood@cs.brown.edu Training Products of Experts by Minimizing Contrastive

Frank Wood - [email protected]

Maximizing the training data log likelihood

• We want maximizing parameters

• Differentiate w.r.t. to all parameters and perform gradient ascent to find optimal parameters.

• The derivation is somewhat nasty.

( )

( )∏∑∏

=D,,

1,, |

|

logmaxarg),,|(Dlogmaxarg11 d

c m

mm

m

mm

ncf

df

pnn

r

r

KK

r

r

θ

θθθθθθ

Assuming d’s drawn

independently from p()

Assuming d’s drawn

independently from p()

Standard PoE formStandard PoE form

Over all training data.Over all training data.

Page 9: Training Products of Experts by Minimizing Contrastive ...fwood/Tutorials/contrastive_divergence.pdf · Frank Wood -fwood@cs.brown.edu Training Products of Experts by Minimizing Contrastive

Frank Wood - [email protected]

Maximizing the training data log likelihood

m

d

n

m

n

dpp

θ

θθ

θ

θθ

=∂

∂∏

∈D

1

1

),,|(log),,|(Dlog r

Kr

K

∑∈ ∂

∂=

D

1 ),,|(logd

n

m

dpr

Kr

θθθ

∞∂

∂=

θ

θ

θθ

Pm

ndpN

),,|(log 1 Kr

Remember this

equivalence!

Remember this

equivalence!

Page 10: Training Products of Experts by Minimizing Contrastive ...fwood/Tutorials/contrastive_divergence.pdf · Frank Wood -fwood@cs.brown.edu Training Products of Experts by Minimizing Contrastive

Frank Wood - [email protected]

Maximizing the training data log likelihood

=∂

m

np

N θ

θθ ),,|(Dlog1 1 K

( ) ( )

m

c m

mm

d m

mm

cfdf

N θ

θ

θ

θ

−∂

∂=

∑ ∏∑∈

r

r

rr |log

|log1

D

∏∑∏

∈∂

D )|(

)|(

log1

d

c m

mm

m

mm

m cf

df

N r

r

r

r

θ

θ

θ

( ) ( )

∑∑ ∏

∑∈∈ ∂

−∂

∂=

DD

|log1|log1

d m

c m

mm

d m

mm

cf

N

df

N r

r

r

rr

θ

θ

θ

θ

Page 11: Training Products of Experts by Minimizing Contrastive ...fwood/Tutorials/contrastive_divergence.pdf · Frank Wood -fwood@cs.brown.edu Training Products of Experts by Minimizing Contrastive

Frank Wood - [email protected]

Maximizing the training data log likelihood

( ) ( )

m

c m

mm

Pm

mm

cfdf

θ

θ

θ

θ

−∂

∂=

∑ ∏r

rr |log

|log

0

( ) ( )

m

c m

mm

d m

mm

cfdf

N θ

θ

θ

θ

−∂

∂=

∑ ∏∑∈

r

rr |log

|log1

D

( )( )

( )

m

c m

mm

c m

mmPm

mm

cf

cf

df

θ

θ

θθ

θ

−∂

∂=

∑ ∏

∑ ∏

r

r

r

r

r |

|

1|log

0

log(x)’ = x’/xlog(x)’ = x’/x

Page 12: Training Products of Experts by Minimizing Contrastive ...fwood/Tutorials/contrastive_divergence.pdf · Frank Wood -fwood@cs.brown.edu Training Products of Experts by Minimizing Contrastive

Frank Wood - [email protected]

Maximizing the training data log likelihood

( )( )

( )

m

c m

mm

c m

mmPm

mm

cf

cf

df

θ

θ

θθ

θ

−∂

∂=

∑ ∏

∑ ∏

r

r

r

r

r |

|

1|log

0

( )( )

( ) ( )

m

mm

mj

jj

c

c m

mmPm

mm

cfcf

cf

df

θ

θθ

θθ

θ

−∂

∂=

∏∑

∑ ∏≠

||

|

1|log

0

rr

r

rr

r

( )( )

( ) ( )

m

mm

m

mm

c

c m

mmPm

mm

cfcf

cf

df

θ

θθ

θθ

θ

−∂

∂=

∏∑

∑ ∏

|log|

|

1|log

0

rr

r

rr

r

log(x)’ = x’/xlog(x)’ = x’/x

Page 13: Training Products of Experts by Minimizing Contrastive ...fwood/Tutorials/contrastive_divergence.pdf · Frank Wood -fwood@cs.brown.edu Training Products of Experts by Minimizing Contrastive

Frank Wood - [email protected]

Maximizing the training data log likelihood

( )( )

( ) ( )

m

mm

m

mm

c

c m

mmPm

mm

cfcf

cf

df

θ

θθ

θθ

θ

−∂

∂=

∏∑

∑ ∏

|log|

|

1|log

0

rr

r

rr

r

( ) ( )

( )( )

∑∑ ∏

∂−

∂=

c m

mm

c m

mm

m

mm

Pm

mm cf

cf

cfdf

r

r

r

r

rr

θ

θ

θ

θ

θ

θ |log

|

||log

0

( ) ( )∑

∂−

∂=

c m

mmn

Pm

mm cfcp

df

r

r

Kr

r

θ

θθθ

θ

θ |log),,|(

|log1

0

Page 14: Training Products of Experts by Minimizing Contrastive ...fwood/Tutorials/contrastive_divergence.pdf · Frank Wood -fwood@cs.brown.edu Training Products of Experts by Minimizing Contrastive

Frank Wood - [email protected]

Maximizing the training data log likelihood

( ) ( )

∞∂

∂−

∂=

θ

θ

θ

θ

θ

Pm

mm

Pm

mm cfdf |log|log

0

rr

( ) ( )∑

∂−

∂=

c m

mmn

Pm

mm cfcp

df

θ

θθθ

θ

θ |log),,|(

|log1

0

r

K

r

Phew! We’re done! So:

( ) ( )

∞∂

∂−

∂∝

θ

θ

θ

θ

θ

Pm

mm

Pm

mm cfdf |log|log

0

rr

( )0

|log

Pm

mdpN

θ

θθ

∂⇔

∞r

m

np

θ

θθ

∂ ),,|(Dlog 1 K

Page 15: Training Products of Experts by Minimizing Contrastive ...fwood/Tutorials/contrastive_divergence.pdf · Frank Wood -fwood@cs.brown.edu Training Products of Experts by Minimizing Contrastive

Frank Wood - [email protected]

Equilibrium Is Hard to Achieve

• With:

we can now train our PoE model.

• But… there’s a problem:– is computationally infeasible to obtain (esp. in an inner gradient ascent loop).

– Sampling Markov Chain must converge to target distribution. Often this takes a very long time!

( ) ( )

∞∂

∂−

∂∝

θ

θ

θ

θ

θ

Pm

mm

Pm

mm cfdf |log|log

0

rr

m

np

θ

θθ

∂ ),,|(Dlog 1 K

θP

Page 16: Training Products of Experts by Minimizing Contrastive ...fwood/Tutorials/contrastive_divergence.pdf · Frank Wood -fwood@cs.brown.edu Training Products of Experts by Minimizing Contrastive

Frank Wood - [email protected]

Solution: Contrastive Divergence!

• Now we don’t have to run the sampling Markov Chain to convergence, instead we can stop after 1 iteration (or perhaps a few iterations more typically)

• Why does this work?– Attempts to minimize the ways that the model distorts the data.

( ) ( )

10

|log|log

θ

θ

θ

θ

θ

Pm

mm

Pm

mm cfdf

∂−

∂∝

rr

m

np

θ

θθ

∂ ),,|(Dlog 1 K

Page 17: Training Products of Experts by Minimizing Contrastive ...fwood/Tutorials/contrastive_divergence.pdf · Frank Wood -fwood@cs.brown.edu Training Products of Experts by Minimizing Contrastive

Frank Wood - [email protected]

Equivalence of argmax log P() and argmax KL()

( ) ( )( )dP

dPdPPP

d

r

rr

r ∞

∞ ∑=θ

θ

000 log

( ) ( ) ( ) ( )dPdPdPdPdd

rrrr

rr

∞∑∑ −= θloglog 000

( ) ( )0

log0

PdPPHr

∞−= θ

( )0

log0

Pmm

dPPP

θθθθ

∂−=

∂ ∞∞ rThis is what

we got out of

the nasty

derivation!

This is what

we got out of

the nasty

derivation!

Page 18: Training Products of Experts by Minimizing Contrastive ...fwood/Tutorials/contrastive_divergence.pdf · Frank Wood -fwood@cs.brown.edu Training Products of Experts by Minimizing Contrastive

Frank Wood - [email protected]

Contrastive Divergence

( ) ( ) ( )10

loglog10

θ

θθθθθ

θθθ

PmPmm

dPdPPPPP

∂−

∂−=−

∂ ∞∞∞∞

rr

( ) ( ) ( ) ( )

∞∞∂

∂+

∂−

∂−

∂∝

θθθ

θ

θ

θ

θ

θ

θ

θ

θ

Pm

mm

Pm

mm

Pm

mm

Pm

mm cfdfcfdf |log|log|log|log

10

rrrr

( ) ( )10

|log|log

θ

θ

θ

θ

θ

Pm

mm

Pm

mm dfdf

∂−

∂∝

rr

• We want to “update the parameters to reduce the tendency of the chain to wander away from the initial distribution on the first step”.

Page 19: Training Products of Experts by Minimizing Contrastive ...fwood/Tutorials/contrastive_divergence.pdf · Frank Wood -fwood@cs.brown.edu Training Products of Experts by Minimizing Contrastive

Frank Wood - [email protected]

Contrastive Divergence (Final Result!)

10

loglog

θ

θθθ

θθ

PmPm

mmm

ff

∂−

∂∝∆

∑∑∂

∂−

∂∝∆

∈ 1~D

)(log1)(log1

θθθ

θθθ

Pc md m

m

cf

N

df

N

mm

Law of Large Numbers, compute

expectations using samples.

Law of Large Numbers, compute

expectations using samples.

Now you know how to do it and why it works!

Model

parameters.

Model

parameters.

Training data

(empirical distribution).

Training data

(empirical distribution).

Samples from

model.

Samples from

model.