DEEP BELIEF NETWORKS
Ruslan Salakhutdinov and Geoffrey HintonUniversity of Toronto, Machine Learning Group
The Snowbird Workshop
March 22, 2007
1
Talk outline
• Deep Belief Nets as stacks of Restricted Boltzmann Machines.
– Nonlinear Dimensionality Reduction.– Discriminative Fine-tuning for Regression and Classification.
• Deep Belief Nets as Generative Models.
– A Generative Model of Simple Shapes.
• Another Application of Deep Belief Nets (if time permits).
– Semantic Hashing for Ultra Fast Document Retrieval.
2
Restricted Boltzmann Machines
i
j
W
v
h
bias
• We can model an ensemble of binary imagesusing Restricted Boltzmann Machines (RBM).
• RBM is a two-layer network in whichvisible, binary stochastic pixels v are connectedto hidden binary stochastic feature detectors h.
• A joint configuration (v,h) has an energy:
E(v,h) = −∑
i∈pixels
bivi −∑
j∈features
bjhj −∑
i,j
vihjWij
• The probability that the model assigns to v is
p(v) =∑
h∈H
p(v,h) =∑
h∈H
exp(−E(v,h))∑u,g exp(−E(u,g))
3
Inference and Learning
i i
j
i
j
data1
<v h >i
j
j <v h >i j <v h >i j inf
data reconstruction fantasy
• Conditional distributions over hidden and visible units are given bylogistic function:
p(hj = 1|v) =1
1 + exp(−bj −∑
i viWij)
p(vi = 1|h) =1
1 + exp(−bi −∑
j hjWji)
• Maximum Likelihood learning:
∆Wij = ǫ(< vihj >data − < vihj >∞)
• Contrastive Divergence (1-step) learning:
∆Wij = ǫ(< vihj >data − < vihj >1)
4
What a single RBM learns
• Random sample of the RBM’s receptive fields (W ) for MNIST (left)and Olivetti (right).
• Input data
• Learned W
5
Learning Stacks of RBM’s
W
W
W
W
1
2000
RBM
2
2000
1000
500
RBM500
RBM
1000 RBM
3
4
30
• A single layer of binary features generallycannot perfectly model the structure in the data.
• Perform greedy, layer-by-layer learning:
– Learn and Freeze W1.– Treat the existing feature detectors, driven
by training data, σ(W T1 V ) as if they were data.
– Learn and Freeze W2.– Greedily learn as many layers of features
as desired. .
• Under certain conditions adding an extra layeralways improves a lower bound on the logprobability of data (explained later).
• Each layer of features captures strong high-ordercorrelations between the activities of units in thelayer below. .
6
Nonlinear Dimensionality Reduction
W
W
W
W
W
W
W
W
500
1000
2000
500
2000
Unrolling
Encoder
1
2
3
30
4
3
2
1
Code layer
Decoder
4
1000
T
T
T
T
• Perform greedy, layer-by-layer pretraining.
• After pretraining multiple layers, the model isunrolled to create a deep autoencoder.
• Initially encoder and decoder networks use thesame weights.
• The global fine-tuning uses backpropagationthrough the whole autoencoder to fine-tune theweights for optimal reconstruction. .
• Backpropagation only has to do local search.
• We used a 625-2000-1000-500-30 autoencoder toextract 30-D real-valued codes for Olivetti facepatches (7 hidden layers is usually hard to train).
• We used a 784-1000-500-250-30 autoencoder toextract 30-D real-valued codes for MNIST images.
7
The Big Picture
W
W
W
W
W +ε
W +ε
W +ε
W
W +ε
W +ε
W +ε
+ε
W
W
W
W
W
W
W
W
W +ε1
2000
RBM
2
2000
1000
500
RBM500
1000
1000
500
2000
2000
500500
1000
2000
500
2000
RBM
Pretraining Unrolling
1000 RBM
3
4
30
30
Fine−tuning
4 4
2 2
3 3
4 5
3 6
2 7
1 8
Encoder
1
2
3
30
4
3
2
1
Code layer
Decoder
4
1000
1 1
T
T
T
T T
T
T
T
Show Demo.
8
Reuters Corpus: Learning 2-D code space
Autoencoder 2−D Topic Space
Legal/JudicialLeading Ecnomic Indicators
European CommunityMonetary/Economic
Accounts/Earnings
Interbank Markets
Government Borrowings
Disasters andAccidents
Energy Markets
LSA 2−D Topic Space
• We use a 2000-500-250-125-2 autoencoder to convert test documentsinto a two-dimensional code.
• The Reuters Corpus Volume II contains 804,414 newswire stories(randomly split into 402,207 training and 402,207 test).
• We used a simple “bag-of-words” representation. Each article isrepresented as a vector containing the counts of the most frequent2000 words in the training dataset.
9
Results for 10-D codes
• We use the cosine of the angle between two codes as a measure ofsimilarity.
• Precision-recall curves when a 10-D query document from the testset is used to retrieve other test set documents, averaged over402,207 possible queries.
0.1 0.2 0.4 0.8 1.6 3.2 6.4 12.8 25.6 51.2 100
10
20
30
40
50
Recall (%)
Pre
cisi
on (
%)
Autoencoder 10DLSA 10DLSA 50DAutoencoder 10Dprior to fine−tuning
10
Deep Belief Nets for Classification
W +εW
W
W
W +ε
W +ε
W +ε
W
W
W
W
1 11
500 500
500
2000
500
500
2000
5002
500
RBM
500
2000
3
Pretraining Unrolling Fine−tuning
4 4
2 2
3 3
1
2
3
4
RBM
10
Softmax Output
10RBM
T
T
T
T
T
T
T
T
• After layer-by-layer pretraining of a 784-500-500-2000-10 network,discriminative fine-tuning achieves an error rate of 1.2% on MNIST.SVM’s get 1.4% and randomly initialized backprop gets 1.6%.
• Clearly pretraining helps generalization. It ensures that most of theinformation in the weights comes from modeling the input data.
• The very limited information in the labels is used only to slightlyadjust the final weights.
11
A Regression Task
• Predicting the orientation of a face patch.
-66.84 43.48 14.22 30.01−57.14 −35.75
• Labeled Training Data:Input: 1000 labeled training patches Output: orientation
from Olivetti faces of 30training people.
• Labeled Test Data:Input: 1000 labeled test patches . Predict: orientation
from Olivetti faces of 10new people.
• Gaussian Processes with Gaussian kernel (using Radford Neal’ssoftware) achieves a RMSE of 16.35◦ (±0.45◦).
12
Deep Belief Nets for Regression
-66.84 43.48 14.22 30.01−57.14 −35.75 Unlabeled
• Additional Unlabeled Training Data: 12000 face patches from 30training people.
• Pretrain a stack of RBM’s: 784-1000-500.
• Features were extracted with no idea of the final task.
Train a dumb linear regression model RMSE 13.73◦.on the top-level features usingthe labeled 1000 training cases:
The same GP on the top-level features: RMSE 10.06◦ (±0.36◦).
13
The Generative View of Stacks of RBM’s
WW
W
W
W
W
W
etc.
FrozenT
Frozen
T
T
v0
h h0
v1
h1
v
• When Wfrozen = W, the two models are the same.
• The weights Wfrozen define p(v0|h0,Wfrozen) but also indirectlydefine p(h0).
• Idea: Freeze bottom layer of weights at Wfrozen and change higherlayers to build a better model for p(h0), that is closer to the posteriorhidden features produced by Wfrozen applied to the datap(h0|v0,W
Tfrozen).
• As we learn a new layer, the inference becomes incorrect, but thebound on the log probability of the data increases (see Hinton et.al.).
14
The Generative View of Stacks of RBM’s
W
W
W
W
Likelihood x (W v0)σPrior =
etc.
W +20 +20
+20 +20
W
ComplementaryPrior
W
T
Frozen
T
h
v
FrozenT
���������������
���������������
���������������
���������������
h0
h1
v0
v1
T
Frozen
• What about explaining away?
• A complementary prior exactly cancels out correlations created byexplaining away! So the posterior factors.
15
Two Alternatives to Our Method
Prior W
W
W1
2
3
LikelihoodW1
Factorial Prior
h1
h2
v0
h0
v0
h0
• Alternative 1:
– Without complementary prior, learning one layer at a time is hardbecause of explaining away.
• Alternative 2:
– If we start with different weights in each layer and try to learnthem all at once, we have major problems.
– Just to calculate the prior for h0 requires integration over allhigher-level hidden configurations! Good luck with that.
16
Semantic Hashing
W +ε
W +ε
W +ε
W
W
W
W
W
W
W
W
W +ε
W +ε
W +ε
W
2000
500
500
2000
500
20001 1
2 2
500
500
GaussianNoise
500
3 3
2000
500
20001
2
5002
1
3
Code Layer
2
500
1
3
20
Fine−tuning
2
1
3 4
5
6
Code Layer20
UnrollingRecursive Pretraining
500RBM
500
500RBM
3
RBM20
Bag of Words
T
T
T T
T
T
• Learn to map documents into semantic 20-D binary code and usethese codes as memory addresses.
• We have the ultimate retrieval tool: Given a query document,compute its 20-bit address and retrieve all of the documents stored atthe similar addresses with no search at all.
17
Semantic HashingReuters 2−D Embedding of 20−bit codes
Accounts/Earnings
Government Borrowing
European Community Monetary/Economic
Disasters and Accidents
Energy Markets
SemanticallySimilarDocuments
Memory
Document
f
0.1 0.2 0.4 0.8 1.6 3.2 6.4 12.8 25.6 51.2 100 0
10
20
30
40
50
Recall (%)
Pre
cisi
on (
%)
TF−IDFTF−IDF using 20 bitsLocality Sensitive Hashing
• We used a simple C implementation on Reuters dataset (402,212training and 402,212 test documents).
• For a given query, it takes about 0.5 milliseconds to create a short-listof about 3,000 semantically similar documents.
• It then takes 10 milliseconds to retrieve the top few matches fromthat short-list using TF-IDF, and it is more accurate than full TF-IDF.
• Locality-Sensitive Hashing takes about 500 milliseconds, and is lessaccurate.
• Our method is 50 times faster than the fastest existing method and ismore accurate.
18
THE END
19