document classification using deep belief nets lawrence mcafee 6/9/08 cs224n, sprint ‘08

9
Document Classification using Deep Belief Nets Lawrence McAfee 6/9/08 CS224n, Sprint ‘08

Post on 22-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Document Classification using Deep Belief Nets

Lawrence McAfee6/9/08

CS224n, Sprint ‘08

Overview

• Corpus: Wikipedia XML Corpus• Single-labeled data – each document falls under single

category

• Binary Feature Vectors• Bag-of-words• ‘1’ indicates word occurred one or more times in document

Doc#1

Doc#1Doc#1

Doc#3Doc#2 Classifier

Doc#1

Food

Doc#2

Brazil

Doc#3President

s

Background on Deep Belief Nets

Training Data

RBM 1

RBM 2

RBM 3Higher level

features

Features/basis vectors for

training data

Very abstract features

RBM

• Unsupervised, clustering training algorithm

Inside an RBMhidden

i

j

visible Configuration (v,h)

GolfCycling

Energy

Input/Training data

• Goal in training RBM is to minimize energy of configurations corresponding to input data

• Train RBM by repeatedly sampling hidden and visible units for a given data input

Depth

• Binary representation does not capture word frequency information

• Inaccurate features learned at each level of DBN

0

20

40

60

80

100

120

0 2 4 6 8

Number of layers

Acc

ura

cy (

%)

straight

linear

Training Iterations

• Accuracy increases with more training iterations

• Increasing iterations may (partially) make up for learning poor features

0

10

20

30

40

50

60

70

80

90

100

0 2000 4000 6000 8000 10000 12000

Training iterations per layer

Acc

ura

cy (

%)

Configuration (v,h)

Lions Tigers

Configuration (v,h)

LionsTigers

Energy Energy

Comparison to SVM, NB

• Binary features do not provide good starting point for learning higher level features

• Binary still useful, as 22% is better than random• Time: DBN-2h,13m; SVM-4sec; NB-3sec

05

1015

2025

3035

4045

50

DBN (100K iters, 30categ)

SVM NB

Classifier

Acc

ura

cy (

%)

30 categories

Lowercasing

• Supposedly richer vocabulary when lowercasing• Overfitting: we don’t need these extra words

• Other experiments show only top 500 words relevant

0

10

20

30

40

50

60

70

80

90

0 500 1000 1500 2000 2500

Number of hidden neurons in top layer

Acc

ura

cy (

%)

low ercase

non-low ercase

Suggestions for Improvement

• Use appropriate continuous-valued neurons• Linear or Gaussian neurons• Slower to train• Not much documentation on using continuous-valued

neurons with RBMs

• Implement backpropagation to fine-tune weights and biases• Propagate error derivatives from top level RBM back

to inputs• Unsupervised training gives good initial weights,

while backpropagation slightly modifies weights/biases

• Backpropagation cannot be used alone, as it tends to get stuck in local optima