ruslan salakhutdinov and geoffrey hintonrsalakhu/talks/snowbird.pdf · deep belief networks ruslan...

DEEP BELIEF NETWORKS

Ruslan Salakhutdinov and Geoffrey HintonUniversity of Toronto, Machine Learning Group

The Snowbird Workshop

March 22, 2007

1

Talk outline

• Deep Belief Nets as stacks of Restricted Boltzmann Machines.

– Nonlinear Dimensionality Reduction.– Discriminative Fine-tuning for Regression and Classification.

• Deep Belief Nets as Generative Models.

– A Generative Model of Simple Shapes.

• Another Application of Deep Belief Nets (if time permits).

– Semantic Hashing for Ultra Fast Document Retrieval.

2

Restricted Boltzmann Machines

i

j

W

v

h

bias

• We can model an ensemble of binary imagesusing Restricted Boltzmann Machines (RBM).

• RBM is a two-layer network in whichvisible, binary stochastic pixels v are connectedto hidden binary stochastic feature detectors h.

• A joint configuration (v,h) has an energy:

E(v,h) = −∑

i∈pixels

bivi −∑

j∈features

bjhj −∑

i,j

vihjWij

• The probability that the model assigns to v is

p(v) =∑

h∈H

p(v,h) =∑

h∈H

exp(−E(v,h))∑u,g exp(−E(u,g))

3

Inference and Learning

i i

j

i

j

data1

<v h >i

j

j <v h >i j <v h >i j inf

data reconstruction fantasy

• Conditional distributions over hidden and visible units are given bylogistic function:

p(hj = 1|v) =1

1 + exp(−bj −∑

i viWij)

p(vi = 1|h) =1

1 + exp(−bi −∑

j hjWji)

• Maximum Likelihood learning:

∆Wij = ǫ(< vihj >data − < vihj >∞)

• Contrastive Divergence (1-step) learning:

∆Wij = ǫ(< vihj >data − < vihj >1)

4

What a single RBM learns

• Random sample of the RBM’s receptive fields (W ) for MNIST (left)and Olivetti (right).

• Input data

• Learned W

5

Learning Stacks of RBM’s

W

W

W

W

1

2000

RBM

2

2000

1000

500

RBM500

RBM

1000 RBM

3

4

30

• A single layer of binary features generallycannot perfectly model the structure in the data.

• Perform greedy, layer-by-layer learning:

– Learn and Freeze W1.– Treat the existing feature detectors, driven

by training data, σ(W T1 V ) as if they were data.

– Learn and Freeze W2.– Greedily learn as many layers of features

as desired. .

• Under certain conditions adding an extra layeralways improves a lower bound on the logprobability of data (explained later).

• Each layer of features captures strong high-ordercorrelations between the activities of units in thelayer below. .

6

Nonlinear Dimensionality Reduction

W

W

W

W

W

W

W

W

500

1000

2000

500

2000

Unrolling

Encoder

1

2

3

30

4

3

2

1

Code layer

Decoder

4

1000

T

T

T

T

• Perform greedy, layer-by-layer pretraining.

• After pretraining multiple layers, the model isunrolled to create a deep autoencoder.

• Initially encoder and decoder networks use thesame weights.

• The global fine-tuning uses backpropagationthrough the whole autoencoder to fine-tune theweights for optimal reconstruction. .

• Backpropagation only has to do local search.

• We used a 625-2000-1000-500-30 autoencoder toextract 30-D real-valued codes for Olivetti facepatches (7 hidden layers is usually hard to train).

• We used a 784-1000-500-250-30 autoencoder toextract 30-D real-valued codes for MNIST images.

7

The Big Picture

W

W

W

W

W +ε

W +ε

W +ε

W

W +ε

W +ε

W +ε

+ε

W

W

W

W

W

W

W

W

W +ε1

2000

RBM

2

2000

1000

500

RBM500

1000

1000

500

2000

2000

500500

1000

2000

500

2000

RBM

Pretraining Unrolling

1000 RBM

3

4

30

30

Fine−tuning

4 4

2 2

3 3

4 5

3 6

2 7

1 8

Encoder

1

2

3

30

4

3

2

1

Code layer

Decoder

4

1000

1 1

T

T

T

T T

T

T

T

Show Demo.

8

Reuters Corpus: Learning 2-D code space

Autoencoder 2−D Topic Space

Legal/JudicialLeading Ecnomic Indicators

European CommunityMonetary/Economic

Accounts/Earnings

Interbank Markets

Government Borrowings

Disasters andAccidents

Energy Markets

LSA 2−D Topic Space

• We use a 2000-500-250-125-2 autoencoder to convert test documentsinto a two-dimensional code.

• The Reuters Corpus Volume II contains 804,414 newswire stories(randomly split into 402,207 training and 402,207 test).

• We used a simple “bag-of-words” representation. Each article isrepresented as a vector containing the counts of the most frequent2000 words in the training dataset.

9

Results for 10-D codes

• We use the cosine of the angle between two codes as a measure ofsimilarity.

• Precision-recall curves when a 10-D query document from the testset is used to retrieve other test set documents, averaged over402,207 possible queries.

0.1 0.2 0.4 0.8 1.6 3.2 6.4 12.8 25.6 51.2 100

10

20

30

40

50

Recall (%)

Pre

cisi

on (

%)

Autoencoder 10DLSA 10DLSA 50DAutoencoder 10Dprior to fine−tuning

10

Deep Belief Nets for Classification

W +εW

W

W

W +ε

W +ε

W +ε

W

W

W

W

1 11

500 500

500

2000

500

500

2000

5002

500

RBM

500

2000

3

Pretraining Unrolling Fine−tuning

4 4

2 2

3 3

1

2

3

4

RBM

10

Softmax Output

10RBM

T

T

T

T

T

T

T

T

• After layer-by-layer pretraining of a 784-500-500-2000-10 network,discriminative fine-tuning achieves an error rate of 1.2% on MNIST.SVM’s get 1.4% and randomly initialized backprop gets 1.6%.

• Clearly pretraining helps generalization. It ensures that most of theinformation in the weights comes from modeling the input data.

• The very limited information in the labels is used only to slightlyadjust the final weights.

11

A Regression Task

• Predicting the orientation of a face patch.

-66.84 43.48 14.22 30.01−57.14 −35.75

• Labeled Training Data:Input: 1000 labeled training patches Output: orientation

from Olivetti faces of 30training people.

• Labeled Test Data:Input: 1000 labeled test patches . Predict: orientation

from Olivetti faces of 10new people.

• Gaussian Processes with Gaussian kernel (using Radford Neal’ssoftware) achieves a RMSE of 16.35◦ (±0.45◦).

12

Deep Belief Nets for Regression

-66.84 43.48 14.22 30.01−57.14 −35.75 Unlabeled

• Additional Unlabeled Training Data: 12000 face patches from 30training people.

• Pretrain a stack of RBM’s: 784-1000-500.

• Features were extracted with no idea of the final task.

Train a dumb linear regression model RMSE 13.73◦.on the top-level features usingthe labeled 1000 training cases:

The same GP on the top-level features: RMSE 10.06◦ (±0.36◦).

13

The Generative View of Stacks of RBM’s

WW

W

W

W

W

W

etc.

FrozenT

Frozen

T

T

v0

h h0

v1

h1

v

• When Wfrozen = W, the two models are the same.

• The weights Wfrozen define p(v0|h0,Wfrozen) but also indirectlydefine p(h0).

• Idea: Freeze bottom layer of weights at Wfrozen and change higherlayers to build a better model for p(h0), that is closer to the posteriorhidden features produced by Wfrozen applied to the datap(h0|v0,W

Tfrozen).

• As we learn a new layer, the inference becomes incorrect, but thebound on the log probability of the data increases (see Hinton et.al.).

14

The Generative View of Stacks of RBM’s

W

W

W

W

Likelihood x (W v0)σPrior =

etc.

W +20 +20

+20 +20

W

ComplementaryPrior

W

T

Frozen

T

h

v

FrozenT

��

��

��

��

h0

h1

v0

v1

T

Frozen

• What about explaining away?

• A complementary prior exactly cancels out correlations created byexplaining away! So the posterior factors.

15

Two Alternatives to Our Method

Prior W

W

W1

2

3

LikelihoodW1

Factorial Prior

h1

h2

v0

h0

v0

h0

• Alternative 1:

– Without complementary prior, learning one layer at a time is hardbecause of explaining away.

• Alternative 2:

– If we start with different weights in each layer and try to learnthem all at once, we have major problems.

– Just to calculate the prior for h0 requires integration over allhigher-level hidden configurations! Good luck with that.

16

Semantic Hashing

W +ε

W +ε

W +ε

W

W

W

W

W

W

W

W

W +ε

W +ε

W +ε

W

2000

500

500

2000

500

20001 1

2 2

500

500

GaussianNoise

500

3 3

2000

500

20001

2

5002

1

3

Code Layer

2

500

1

3

20

Fine−tuning

2

1

3 4

5

6

Code Layer20

UnrollingRecursive Pretraining

500RBM

500

500RBM

3

RBM20

Bag of Words

T

T

T T

T

T

• Learn to map documents into semantic 20-D binary code and usethese codes as memory addresses.

• We have the ultimate retrieval tool: Given a query document,compute its 20-bit address and retrieve all of the documents stored atthe similar addresses with no search at all.

17

Semantic HashingReuters 2−D Embedding of 20−bit codes

Accounts/Earnings

Government Borrowing

European Community Monetary/Economic

Disasters and Accidents

Energy Markets

SemanticallySimilarDocuments

Memory

Document

f

0.1 0.2 0.4 0.8 1.6 3.2 6.4 12.8 25.6 51.2 100 0

10

20

30

40

50

Recall (%)

Pre

cisi

on (

%)

TF−IDFTF−IDF using 20 bitsLocality Sensitive Hashing

• We used a simple C implementation on Reuters dataset (402,212training and 402,212 test documents).

• For a given query, it takes about 0.5 milliseconds to create a short-listof about 3,000 semantically similar documents.

• It then takes 10 milliseconds to retrieve the top few matches fromthat short-list using TF-IDF, and it is more accurate than full TF-IDF.

• Locality-Sensitive Hashing takes about 500 milliseconds, and is lessaccurate.

• Our method is 50 times faster than the fastest existing method and ismore accurate.

18

THE END

19