Random Forests and Ferns
David Capel
The Multi-class Classification Problem
276 Fergus, Zisserman and Perona
Figure 1. Some sample images from the datasets. Note the large variation in scale in, for example, the cars (rear) database. These datasets are
from both http://www.vision. caltech.edu/html-files/archive.html and http://www.robots. ox.ac.uk vgg/data/, except for the Cars (Side) from(http://l2r.cs.uiuc.edu/ cogcomp/ index research.html) and Spotted Cats from the Corel Image library. A Powerpoint presentation of the figuresin this paper can be found at http://www.robots.ox.ac.uk/ vgg/presentations.html.
Training: Labelled exemplars representing multiple classes
Classifying: to which class does this new example belong?
? ?
276 Fergus, Zisserman and Perona
Figure 1. Some sample images from the datasets. Note the large variation in scale in, for example, the cars (rear) database. These datasets are
from both http://www.vision. caltech.edu/html-files/archive.html and http://www.robots. ox.ac.uk vgg/data/, except for the Cars (Side) from(http://l2r.cs.uiuc.edu/ cogcomp/ index research.html) and Spotted Cats from the Corel Image library. A Powerpoint presentation of the figuresin this paper can be found at http://www.robots.ox.ac.uk/ vgg/presentations.html.
276 Fergus, Zisserman and Perona
Figure 1. Some sample images from the datasets. Note the large variation in scale in, for example, the cars (rear) database. These datasets are
from both http://www.vision. caltech.edu/html-files/archive.html and http://www.robots. ox.ac.uk vgg/data/, except for the Cars (Side) from(http://l2r.cs.uiuc.edu/ cogcomp/ index research.html) and Spotted Cats from the Corel Image library. A Powerpoint presentation of the figuresin this paper can be found at http://www.robots.ox.ac.uk/ vgg/presentations.html.
276 Fergus, Zisserman and Perona
Figure 1. Some sample images from the datasets. Note the large variation in scale in, for example, the cars (rear) database. These datasets are
from both http://www.vision. caltech.edu/html-files/archive.html and http://www.robots. ox.ac.uk vgg/data/, except for the Cars (Side) from(http://l2r.cs.uiuc.edu/ cogcomp/ index research.html) and Spotted Cats from the Corel Image library. A Powerpoint presentation of the figuresin this paper can be found at http://www.robots.ox.ac.uk/ vgg/presentations.html.
276 Fergus, Zisserman and Perona
Figure 1. Some sample images from the datasets. Note the large variation in scale in, for example, the cars (rear) database. These datasets are
from both http://www.vision. caltech.edu/html-files/archive.html and http://www.robots. ox.ac.uk vgg/data/, except for the Cars (Side) from(http://l2r.cs.uiuc.edu/ cogcomp/ index research.html) and Spotted Cats from the Corel Image library. A Powerpoint presentation of the figuresin this paper can be found at http://www.robots.ox.ac.uk/ vgg/presentations.html.
Planes Faces Cars CatsHandwritten digits
• A classifier is a mapping H from feature vectors f to discrete class labels C
• Numerous choices of feature space F are possible, e.g.
The Multi-class Classification Problem
Raw pixel values Color histograms Oriented filter banksTexton histograms
f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}
H : f " C
P (C = ci|f1, f2, ..., fN )H(f) = arg max
iP (C = ci|f1, f2, ..., fN )
Dm = (fm, Cm) for m = 1...M
Gij =12
!1n
"
k
(d2ik + d2
kj)# d2ij #
1n2
"
kl
d2kl
#
err(y) ="
ij
$Gij # y!i yj
%2
y!i =&
!!v!i for " =1,2,...,d
Cij =1n
"
k
xikxjk
Gij = xi $ xj
arg minW
!(W ) ="
i
'''xi #"
j"Ni
Wijxj
'''2
"
j
Wij = 1
1
The Multi-class Classification Problem
• We have a (large) database of labelled exemplars
• Problem: Given such training data, learn the mapping H
• Even better: learn the posterior distribution over class label conditioned on the features:
..and obtain classifier H as the mode of the posterior:
f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}
H : f " C
P (C = ci|f1, f2, ..., fN )H(f) = arg max
iP (C = ci|f1, f2, ..., fN )
Dm = (fm, Cm) for m = 1...M
Gij =12
!1n
"
k
(d2ik + d2
kj)# d2ij #
1n2
"
kl
d2kl
#
err(y) ="
ij
$Gij # y!i yj
%2
y!i =&
!!v!i for " =1,2,...,d
Cij =1n
"
k
xikxjk
Gij = xi $ xj
arg minW
!(W ) ="
i
'''xi #"
j"Ni
Wijxj
'''2
"
j
Wij = 1
1
f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}
H : f " C
P (C = ci|f1, f2, ..., fN )H(f) = arg max
iP (C = ci|f1, f2, ..., fN )
Dm = (fm, Cm) for m = 1...M
Gij =12
!1n
"
k
(d2ik + d2
kj)# d2ij #
1n2
"
kl
d2kl
#
err(y) ="
ij
$Gij # y!i yj
%2
y!i =&
!!v!i for " =1,2,...,d
Cij =1n
"
k
xikxjk
Gij = xi $ xj
arg minW
!(W ) ="
i
'''xi #"
j"Ni
Wijxj
'''2
"
j
Wij = 1
1
f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}
H : f " C
P (C = ck|f1, f2, ..., fN )H(f) = argmax
kP (C = ck|f1, f2, ..., fN )
Dm = (fm, Cm) for m = 1...M
fn > T
w!f > T
IG =!
k
pleftk (1# pleft
k ) +!
k
prightk (1# pright
k )
IE = #!
k
#pleftk log pleft
k #!
k
#prightk log pright
k
argmaxk
P (Ck|f1, f2, ..., fN )
argmaxk
P (f1, f2, ..., fN |Ck)P (Ck)
P (f1, f2, ..., fN |Ck) =N"
n=1
P (fn|Ck)
Class(f) $ argmaxk
P (Ck)N"
n=1
P (fn|Ck)
Fl = {f l,1, f l,2, ..., f l,S}
P (f1, f2, ..., fN |Ck) =L"
l=1
P (Fl|Ck)
1
f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}
H : f " C
P (C = ck|f1, f2, ..., fN )H(f) = argmax
kP (C = ck|f1, f2, ..., fN )
Dm = (fm, Cm) for m = 1...M
fn > T
w!f > T
IG =!
k
pleftk (1# pleft
k ) +!
k
prightk (1# pright
k )
IE = #!
k
#pleftk log pleft
k #!
k
#prightk log pright
k
argmaxk
P (Ck|f1, f2, ..., fN )
argmaxk
P (f1, f2, ..., fN |Ck)P (Ck)
P (f1, f2, ..., fN |Ck) =N"
n=1
P (fn|Ck)
Class(f) $ argmaxk
P (Ck)N"
n=1
P (fn|Ck)
Fl = {f l,1, f l,2, ..., f l,S}
P (f1, f2, ..., fN |Ck) =L"
l=1
P (Fl|Ck)
1
• Direct, non-parametric learning of class posterior (histograms)• K-Nearest Neighbours• Fisher Linear Discriminants• Relevance Vector Machines• Multi-class SVMs
How do we represent and learn the mapping?
In Bishop’s book, we saw numerous ways to build multi-class classifiers, e.g.
Center pixel intensity
Ed
ge
pix
el in
ten
sity
Center pixel intensity
Ed
ge
pix
el in
ten
sity
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
1 ! Specificity
Sensitiv
ity
Center pixel intensity
Edge p
ixel in
tensity
(a) (b) (c) (d)
Fig. 4. Distribution of (a) negative pairs gn and (b) positive pairs gp used to constructthe fast classifier. (c) ROC curve used to determine the value of ! (indicated by thedashed line) which will produce the optimal decision surface given the costs of positiveand negative errors. (d) The selected Bayes decision surface.
The two distributions can be combined to yield a Bayes decision surface. If gp
and gn represent the positive and negative distributions then the classificationof a given intensity pair x is:
classification(x) =!
+1 if ! · gp(x) > gn(x)!1 otherwise (1)
where ! is the relative cost of a false negative over a false positive. The parameter! was varied to produce the ROC curve shown in Figure 4c. A weighting of! = e12 produces the decision boundary shown in Figure 4d, and correspondsto a sensitivity of 0.9965 and a specificity of 0.75.
A subwindow is marked as a possible fiducial if a series of intensity pairs alllie within the positive decision region. Each pair contains the central point andone of seven outer pixels. The outer edge pixels were selected to minimize thenumber of false positives based on the above emiprical distributions.
The first stage of the cascade seeks dark points surrounded by lighter back-grounds, and thus functions is like a well-trained edge detector. Note howeverthat the decision criteria is not simply (edge ! center) > threshold as would bethe case if the center was merely required to be darker than the outer edge. In-stead, the decision surface in Figure 4d encodes the fact that {dark center,darkedge} are more likely to be background, and {light center, light edge} are rare inthe positive examples. Even at this early edge detection stage there are benefitsfrom including learning in the algorithm.
3.4 Cascade Stage Two: Nearest Neighbour
Among the various methods of supervised statistical pattern recognition, thenearest neighbour rule [18] achieves consistently high performance [19]. Thestrategy is very simple: given a training set of examples from each class, a newsample is assigned the class of the nearest training example. In contrast withmany other classifiers, this makes no a priori assumptions about the distribu-tions from which the training examples are drawn, other than the notion thatnearby points will tend to be of the same class.
Class A
Class B
f1
f2Direct, non-parametric
k-Nearest Neighbours
• Decision trees classify features by a series of Yes/No questions
• At each node, the feature space is split according to the outcome of some binary decision criterion
• The leaves are labelled with the class C corresponding the feature reached via that path through the tree
Binary Decision Trees
B1(f)=True?
B2(f)=True? B3(f)=True?
c1 c1 c1c2
Yes
YesYes
No
No No
• Decision trees classify features by a series of Yes/No questions
• At each node, the feature space is split according to the outcome of some binary decision criterion
• The leaves are labelled with the class C corresponding the feature reached via that path through the tree
Binary Decision Trees
B1(f)=True?
B2(f)=True? B3(f)=True?
c1 c1 c1c2
Yes
YesYes
No
No No
B1(f)=True?
f
• Decision trees classify features by a series of Yes/No questions
• At each node, the feature space is split according to the outcome of some binary decision criterion
• The leaves are labelled with the class C corresponding the feature reached via that path through the tree
Binary Decision Trees
B1(f)=True?
B2(f)=True? B3(f)=True?
c1 c1 c1c2
Yes
YesYes
No
No No
Yes
B3(f)=True?
B1(f)=True?
f
• Decision trees classify features by a series of Yes/No questions
• At each node, the feature space is split according to the outcome of some binary decision criterion
• The leaves are labelled with the class C corresponding the feature reached via that path through the tree
Binary Decision Trees
B1(f)=True?
B2(f)=True? B3(f)=True?
c1 c1 c1c2
Yes
YesYes
No
No NoNo
Yes
B3(f)=True?
B1(f)=True?
f
Binary Decision Trees: Training
• To train, recursively partition the training data into subsets according to some Yes/No tests on the feature vectors
• Partitioning continues until each subset contains features of a single class
Binary Decision Trees: Training
• To train, recursively partition the training data into subsets according to some Yes/No tests on the feature vectors
• Partitioning continues until each subset contains features of a single class
Binary Decision Trees: Training
• To train, recursively partition the training data into subsets according to some Yes/No tests on the feature vectors
• Partitioning continues until each subset contains features of a single class
Binary Decision Trees: Training
• To train, recursively partition the training data into subsets according to some Yes/No tests on the feature vectors
• Partitioning continues until each subset contains features of a single class
Binary Decision Trees: Training
• To train, recursively partition the training data into subsets according to some Yes/No tests on the feature vectors
• Partitioning continues until each subset contains features of a single class
A B
C C D
Common decision rules
• Axis-aligned splitting
Threshold on a single feature at each node
Very fast to evaluate
• General plane splitting
Threshold on a linear combination of features
More expensive but produces smaller trees
f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}
H : f " C
P (C = ci|f1, f2, ..., fN )H(f) = arg max
iP (C = ci|f1, f2, ..., fN )
Dm = (fm, Cm) for m = 1...M
fi > T
w!f > T
Gij =12
!1n
"
k
(d2ik + d2
kj)# d2ij #
1n2
"
kl
d2kl
#
err(y) ="
ij
$Gij # y!i yj
%2
y!i =&
!!v!i for " =1,2,...,d
Cij =1n
"
k
xikxjk
Gij = xi $ xj
arg minW
!(W ) ="
i
'''xi #"
j"Ni
Wijxj
'''2
1
f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}
H : f " C
P (C = ck|f1, f2, ..., fN )H(f) = argmax
kP (C = ck|f1, f2, ..., fN )
Dm = (fm, Cm) for m = 1...M
fn > T
w!f > T
IG =!
k
pleftk (1# pleft
k ) +!
k
prightk (1# pright
k )
IE = #!
k
#pleftk log pleft
k #!
k
#prightk log pright
k
argmaxk
P (Ck|f1, f2, ..., fN )
argmaxk
P (f1, f2, ..., fN |Ck)P (Ck)
P (f1, f2, ..., fN |Ck) =N"
n=1
P (fn|Ck)
Class(f) $ argmaxk
P (Ck)N"
n=1
P (fn|Ck)
Fl = {f l,1, f l,2, ..., f l,S}
P (f1, f2, ..., fN |Ck) =L"
l=1
P (Fl|Ck)
1
Choosing the right split
• The split location may be chosen to optimize some measure of classification performance of the child subsets
• Encourage child subsets with lower entropy (confusion)
• Gini impurity
• Information gain (entropy reduction)
f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}
H : f " C
P (C = ck|f1, f2, ..., fN )H(f) = argmax
kP (C = ck|f1, f2, ..., fN )
Dm = (fm, Cm) for m = 1...M
fn > T
w!f > T
IG =!
k
pleftk (1# pleft
k ) +!
k
prightk (1# pright
k )
IE = #!
k
#pleftk log pleft
k #!
k
#prightk log pright
k
argmaxk
P (Ck|f1, f2, ..., fN )
argmaxk
P (f1, f2, ..., fN |Ck)P (Ck)
P (f1, f2, ..., fN |Ck) =N"
n=1
P (fn|Ck)
Class(f) $ argmaxk
P (Ck)N"
n=1
P (fn|Ck)
Fl = {f l,1, f l,2, ..., f l,S}
P (f1, f2, ..., fN |Ck) =L"
l=1
P (Fl|Ck)
1
f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}
H : f " C
P (C = ck|f1, f2, ..., fN )H(f) = argmax
kP (C = ck|f1, f2, ..., fN )
Dm = (fm, Cm) for m = 1...M
fn > T
w!f > T
IG =!
k
pleftk (1# pleft
k ) +!
k
prightk (1# pright
k )
IE = #!
k
pleftk log pleft
k #!
k
prightk log pright
k
argmaxk
P (Ck|f1, f2, ..., fN )
argmaxk
P (f1, f2, ..., fN |Ck)P (Ck)
P (f1, f2, ..., fN |Ck) =N"
n=1
P (fn|Ck)
Class(f) $ argmaxk
P (Ck)N"
n=1
P (fn|Ck)
Fl = {f l,1, f l,2, ..., f l,S}
P (f1, f2, ..., fN |Ck) =L"
l=1
P (Fl|Ck)
1
fn > T ?fraction of each class pk
Pros and Cons of Decision Trees
• Training can be fast and easy to implement
• Easily handles a large number of input variables
• Clearly, it is always possible to construct a decision tree that scores 100% classification accuracy on the training set
• But they tend to over-fit and do not generalize very well
• Hence, performance on the testing set will be far less impressive
So, what can be done to overcome these problems?
Advantages
Drawbacks
Random Forests
• Somehow introduce randomness into the tree-learning process
• Build multiple, independent trees based on the training set
• When classifying an input, each tree votes for a class label
• The forest output is the consensus of all the tree votes
• If the trees really are independent, the performance should improve with more trees
Ho, Tin (1995) "Random Decision Forests" Int'l Conf. Document Analysis and Recognition
Breiman, Leo (2001) "Random Forests" Machine Learning
Basic idea
How to introduce randomness?
• Bagging
Generate randomized training sets by sampling with replacement from the full training set (bootstrap sampling)
• Feature subset selection
Choose different random subsets of the full feature vector to generate each tree
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12
D4 D9 D3 D4 D12 D10 D10 D7 D3 D1 D6 D1
f1 f2 f3 f4 f5 f6 f7 f8 f9 f10
f4 f6 f7 f9 f10
Full training set
Random “bag”
Full feature vector
Feature subset
Random Forests
c1 c1 c1c2
D31
D21
D11
c1 c1
D32
D22
D12
c1 c1c2
D3L
D2L
D1L
Class 1 Class 2
Vot
es
Tree 1 Tree 2 Tree L
c2 c2 c2
Input feature vector f
Performance comparison
• Results from Ho 1995
• Handwritten digits (10 classes)
• 20x20 pixel binary images
• 60000 training, 10000 testing samples
• Feature set f1 : raw pixel values (400 features)
• Feature set f2 : f1 plus gradients (852 features in total)
• Uses full training set for every tree (no bagging)
• Compares different features subset sizes (100 and 200 resp)
Performance comparison
%Correct
Number of Trees in Forests1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
60
70
80
90
100
cap (f1), 100d
cap (f1), 200d
cap (f2), 100d
cap (f2), 200d..................
............
cap (f1), 100d
cap (f1), 200d
cap (f2), 100d
cap (f2), 200d
.....................
...................
cap (f1), 100d
cap (f1), 200d
cap (f2), 100d
cap (f2), 200d
. . .. . .. . . . .
cap (f1), 100d
cap (f1), 200d
cap (f2), 100d
cap (f2), 200d
. . . . .. . . . . .
cap (f1), 100d
cap (f1), 200d
cap (f2), 100d
cap (f2), 200d
. . . . .
. . . . .
cap (f1), 100d
cap (f1), 200d
cap (f2), 100d
cap (f2), 200d
. . . . .. . . . .
cap (f1), 100d
cap (f1), 200d
cap (f2), 100d
cap (f2), 200d
. . . . .
. . . . .
cap (f1), 100d
cap (f1), 200d
cap (f2), 100d
cap (f2), 200d
. . . . .. . . . .
cap (f1), 100d
cap (f1), 200d
cap (f2), 100d
cap (f2), 200d
. . . . .
. . . . .
cap (f1), 100d
cap (f1), 200d
cap (f2), 100d
cap (f2), 200d
. . . . .
. . . . .
cap (f1), 100d
cap (f1), 200d
cap (f2), 100d
cap (f2), 200d
. . . . .
. . . . .
cap (f1), 100d
cap (f1), 200d
cap (f2), 100d
cap (f2), 200d
. . . . .
. . . . .
cap (f1), 100d
cap (f1), 200d
cap (f2), 100d
cap (f2), 200d
. . . . .
. . . . .
cap (f1), 100d
cap (f1), 200d
cap (f2), 100d
cap (f2), 200d
. . . . .
. . . . .
cap (f1), 100d
cap (f1), 200d
cap (f2), 100d
cap (f2), 200d
. . . . .
. . . . .
cap (f1), 100d
cap (f1), 200d
cap (f2), 100d
cap (f2), 200d
. . . . .
. . . . .
cap (f1), 100d
cap (f1), 200d
cap (f2), 100d
cap (f2), 200d
. . . . .
. . . . .
cap (f1), 100d
cap (f1), 200d
cap (f2), 100d
cap (f2), 200d
. . . . .
. . . . .
cap (f1), 100d
cap (f1), 200d
cap (f2), 100d
cap (f2), 200d
. . . . .
. . . . .
cap (f1), 100d
cap (f1), 200d
cap (f2), 100d
cap (f2), 200d
%Correct
Number of Trees in Forests1 2 3 4 5 6 7 8 9 10
60
70
80
90
100
per (f1), 100d
per (f1), 200d
per (f2), 100d
per (f2), 200d...................
...............
per (f1), 100d
per (f1), 200d
per (f2), 100d
per (f2), 200d
.........................
......................
per (f1), 100d
per (f1), 200d
per (f2), 100d
per (f2), 200d
. . . . . . . . .
. . . . . . . . .
per (f1), 100d
per (f1), 200d
per (f2), 100d
per (f2), 200d
. . . . . . . . .. . . . . . . . . .
per (f1), 100d
per (f1), 200d
per (f2), 100d
per (f2), 200d
. . . . . . . . .
. . . . . . . . .
per (f1), 100d
per (f1), 200d
per (f2), 100d
per (f2), 200d
. . . . . . . . .
. . . . . . . . .
per (f1), 100d
per (f1), 200d
per (f2), 100d
per (f2), 200d
. . . . . . . . .
. . . . . . . . .
per (f1), 100d
per (f1), 200d
per (f2), 100d
per (f2), 200d
. . . . . . . . .
. . . . . . . . .
per (f1), 100d
per (f1), 200d
per (f2), 100d
per (f2), 200d
. . . . . . . . .
. . . . . . . . .
per (f1), 100d
per (f1), 200d
per (f2), 100d
per (f2), 200d
Revision: Naive Bayes Classifiers
• We would like to evaluate the posterior probability over class label
• Bayes’ rules tells us that this is equivalent to
• But learning the joint likelihood distributions over all features is most
likely intractable!
• Naive Bayes makes the simplifying assumption that features are
conditionally independent given the class label
f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}
H : f " C
P (C = ci|f1, f2, ..., fN )H(f) = argmax
iP (C = ci|f1, f2, ..., fN )
Dm = (fm, Cm) for m = 1...M
fi > T
w!f > T
IG =!
i
plefti (1# pleft
i ) +!
i
prighti (1# pright
i )
IE = #!
i
#plefti log pleft
i #!
i
#prighti log pright
i
argmaxk
P (Ck|f1, f2, ..., fN )
argmaxk
P (f1, f2, ..., fN |Ck)P (Ck)
P (f1, f2, ..., fN |Ck) =N"
i=1
P (fi|Ck)
Gij =12
#1n
!
k
(d2ik + d2
kj)# d2ij #
1n2
!
kl
d2kl
$
err(y) =!
ij
%Gij # y!i yj
&2
y!i ='
!!v!i for " =1,2,...,d
1
f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}
H : f " C
P (C = ci|f1, f2, ..., fN )H(f) = argmax
iP (C = ci|f1, f2, ..., fN )
Dm = (fm, Cm) for m = 1...M
fi > T
w!f > T
IG =!
i
plefti (1# pleft
i ) +!
i
prighti (1# pright
i )
IE = #!
i
#plefti log pleft
i #!
i
#prighti log pright
i
argmaxk
P (Ck|f1, f2, ..., fN )
argmaxk
P (f1, f2, ..., fN |Ck)P (Ck)
P (f1, f2, ..., fN |Ck) =N"
i=1
P (fi|Ck)
Gij =12
#1n
!
k
(d2ik + d2
kj)# d2ij #
1n2
!
kl
d2kl
$
err(y) =!
ij
%Gij # y!i yj
&2
y!i ='
!!v!i for " =1,2,...,d
1
f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}
H : f " C
P (C = ci|f1, f2, ..., fN )H(f) = argmax
iP (C = ci|f1, f2, ..., fN )
Dm = (fm, Cm) for m = 1...M
fi > T
w!f > T
IG =!
i
plefti (1# pleft
i ) +!
i
prighti (1# pright
i )
IE = #!
i
#plefti log pleft
i #!
i
#prighti log pright
i
argmaxk
P (Ck|f1, f2, ..., fN )
argmaxk
P (f1, f2, ..., fN |Ck)P (Ck)
P (f1, f2, ..., fN |Ck) =N"
i=1
P (fi|Ck)
Gij =12
#1n
!
k
(d2ik + d2
kj)# d2ij #
1n2
!
kl
d2kl
$
err(y) =!
ij
%Gij # y!i yj
&2
y!i ='
!!v!i for " =1,2,...,d
1
(likelihood x prior)
Revision: Naive Bayes Classifiers
• This independence assumption is usually false!
• The resulting approximation tends to grossly underestimate the true posterior probabilities
However ...• It is usually easy to learn the 1-d conditional densities P(fi|Ck)
• It often works rather well in practice!
• Can we do better without sacrificing simplicity?
f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}
H : f " C
P (C = ck|f1, f2, ..., fN )H(f) = argmax
kP (C = ck|f1, f2, ..., fN )
Dm = (fm, Cm) for m = 1...M
fn > T
w!f > T
IG =!
k
pleftk (1# pleft
k ) +!
k
prightk (1# pright
k )
IE = #!
k
pleftk log pleft
k #!
k
prightk log pright
k
argmaxk
P (Ck|f1, f2, ..., fN )
argmaxk
P (f1, f2, ..., fN |Ck)P (Ck)
P (f1, f2, ..., fN |Ck) =N"
n=1
P (fn|Ck)
Class(f) $ argmaxk
P (Ck)N"
n=1
P (fn|Ck)
Fl = {f l,1, f l,2, ..., f l,S}
P (f1, f2, ..., fN |Ck) =L"
l=1
P (Fl|Ck)
1
Ferns : “Semi-Naive” Bayes
• Group of features into L small sets of size S (called Ferns)
where features fn are the outcome of a binary test on the input vector, such that fn : {0,1}
• Assume groups are conditionally independent, hence
• Learn the class-conditional distributions for each group and apply Bayes rule to obtain posterior,
M. Özuysal et al. “Fast Keypoint Recognition in 10 Lines of Code” CVPR 2007
f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}
H : f " C
P (C = ck|f1, f2, ..., fN )H(f) = argmax
kP (C = ck|f1, f2, ..., fN )
Dm = (fm, Cm) for m = 1...M
fn > T
w!f > T
IG =!
k
pleftk (1# pleft
k ) +!
k
prightk (1# pright
k )
IE = #!
k
pleftk log pleft
k #!
k
prightk log pright
k
argmaxk
P (Ck|f1, f2, ..., fN )
argmaxk
P (f1, f2, ..., fN |Ck)P (Ck)
P (f1, f2, ..., fN |Ck) =N"
n=1
P (fn|Ck)
Class(f) $ argmaxk
P (Ck)N"
n=1
P (fn|Ck)
Fl = {f l,1, f l,2, ..., f l,S}
P (f1, f2, ..., fN |Ck) =L"
l=1
P (Fl|Ck)
1
f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}
H : f " C
P (C = ck|f1, f2, ..., fN )H(f) = argmax
kP (C = ck|f1, f2, ..., fN )
Dm = (fm, Cm) for m = 1...M
fn > T
w!f > T
IG =!
k
pleftk (1# pleft
k ) +!
k
prightk (1# pright
k )
IE = #!
k
pleftk log pleft
k #!
k
prightk log pright
k
argmaxk
P (Ck|f1, f2, ..., fN )
argmaxk
P (f1, f2, ..., fN |Ck)P (Ck)
P (f1, f2, ..., fN |Ck) =N"
n=1
P (fn|Ck)
Class(f) $ argmaxk
P (Ck)N"
n=1
P (fn|Ck)
Fl = {f l,1, f l,2, ..., f l,S}
P (f1, f2, ..., fN |Ck) =L"
l=1
P (Fl|Ck)
Class(f) $ argmaxk
P (Ck)L"
l=1
P (Fl|Ck)
1
f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}
H : f " C
P (C = ck|f1, f2, ..., fN )H(f) = argmax
kP (C = ck|f1, f2, ..., fN )
Dm = (fm, Cm) for m = 1...M
fn > T
w!f > T
IG =!
k
pleftk (1# pleft
k ) +!
k
prightk (1# pright
k )
IE = #!
k
pleftk log pleft
k #!
k
prightk log pright
k
argmaxk
P (Ck|f1, f2, ..., fN )
argmaxk
P (f1, f2, ..., fN |Ck)P (Ck)
P (f1, f2, ..., fN |Ck) =N"
n=1
P (fn|Ck)
Class(f) $ argmaxk
P (Ck)N"
n=1
P (fn|Ck)
Fl = {fl,1, fl,2, ..., fl,S}
P (f1, f2, ..., fN |Ck) =L"
l=1
P (Fl|Ck)
Class(f) $ argmaxk
P (Ck)L"
l=1
P (Fl|Ck)
1
Naive vs Semi-Naive Bayes
• Full joint class-conditional distribution:
Intractable to estimate
• Naive approximation:
Too simplistic
Poor approximation to true posterior
• Semi-Naive:
Balance of complexity and tractability
Trade complexity/performance by choice of Fern size (S), NumFerns (L)
f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}
H : f " C
P (C = ck|f1, f2, ..., fN )H(f) = argmax
kP (C = ck|f1, f2, ..., fN )
Dm = (fm, Cm) for m = 1...M
fn > T
w!f > T
IG =!
k
pleftk (1# pleft
k ) +!
k
prightk (1# pright
k )
IE = #!
k
pleftk log pleft
k #!
k
prightk log pright
k
argmaxk
P (Ck|f1, f2, ..., fN )
argmaxk
P (f1, f2, ..., fN |Ck)P (Ck)
P (f1, f2, ..., fN |Ck) =N"
n=1
P (fn|Ck)
Class(f) $ argmaxk
P (Ck)N"
n=1
P (fn|Ck)
Fl = {f l,1, f l,2, ..., f l,S}
P (f1, f2, ..., fN |Ck) =L"
l=1
P (Fl|Ck)
1
f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}
H : f " C
P (C = ci|f1, f2, ..., fN )H(f) = argmax
iP (C = ci|f1, f2, ..., fN )
Dm = (fm, Cm) for m = 1...M
fi > T
w!f > T
IG =!
i
plefti (1# pleft
i ) +!
i
prighti (1# pright
i )
IE = #!
i
#plefti log pleft
i #!
i
#prighti log pright
i
argmaxk
P (Ck|f1, f2, ..., fN )
argmaxk
P (f1, f2, ..., fN |Ck)P (Ck)
P (f1, f2, ..., fN |Ck) =N"
i=1
P (fi|Ck)
Gij =12
#1n
!
k
(d2ik + d2
kj)# d2ij #
1n2
!
kl
d2kl
$
err(y) =!
ij
%Gij # y!i yj
&2
y!i ='
!!v!i for " =1,2,...,d
1
f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}
H : f " C
P (C = ck|f1, f2, ..., fN )H(f) = argmax
kP (C = ck|f1, f2, ..., fN )
Dm = (fm, Cm) for m = 1...M
fn > T
w!f > T
IG =!
k
pleftk (1# pleft
k ) +!
k
prightk (1# pright
k )
IE = #!
k
pleftk log pleft
k #!
k
prightk log pright
k
argmaxk
P (Ck|f1, f2, ..., fN )
argmaxk
P (f1, f2, ..., fN |Ck)P (Ck)
P (f1, f2, ..., fN |Ck)
P (f1, f2, ..., fN |Ck) =N"
n=1
P (fn|Ck)
Class(f) $ argmaxk
P (Ck)N"
n=1
P (fn|Ck)
1
So how does a single Fern work?
• A Fern applies a series of S binary tests to the input vector I
• This gives an S-digit binary code for the feature which may be interpreted as an integer in the range [0 ... 2S-1]
• It’s really just a simple hashing scheme that drops input vectors into one of 2S “buckets”
e.g relative intensities of a pair of pixels:
f1(I) = I(xa,ya) > I(xb,yb) -> true
f2(I) = I(xc,yc) > I(xd,yd) -> false
0 1 0 1 1 11binary code decimal
So how does a single Fern work?
• The output of a fern when applied to a large number of input vectors of the same class is a multinomial distribution
p(F|C)
0 1 2 3 .. .. .. 2S
F
Training a Fern
• Apply the fern to each labelled training example Dm=(Im,cm) and compute its output F(Dm)
• Learn multinomial densities p(F|Ck) as histograms of fern output for each class
p(F|C0)
0 2S
p(F|C1)
0 2S
p(F|C2)
0 2S
p(F|CK)
0 2S
• Given a test input, simply apply the fern and “look-up” the posterior distribution over class label
Classifying using a single Fern
p(F|C0)
0 2S
p(F|C1)0 2S
p(F|C2)0 2S
p(F|CK)
0 2S
C2C1 C3 CK
Test input I
Apply fernF(I)=[00011]
= 3
F(I)=3
Normalizedistribution
Class posterior
f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}
H : f " C
P (C = ck|f1, f2, ..., fN )H(f) = argmax
kP (C = ck|f1, f2, ..., fN )
Dm = (fm, Cm) for m = 1...M
fn > T
w!f > T
IG =!
k
pleftk (1# pleft
k ) +!
k
prightk (1# pright
k )
IE = #!
k
pleftk log pleft
k #!
k
prightk log pright
k
p(F = z|Ck) =N(F = z|Ck) + 1
"2S"1z=0 (N(F = z|Ck) + 1)
p(Ck|F ) =p(F |Ck)"k p(F |Ck)
argmaxk
P (Ck|f1, f2, ..., fN )
argmaxk
P (f1, f2, ..., fN |Ck)P (Ck)
P (f1, f2, ..., fN |Ck)
P (f1, f2, ..., fN |Ck) =N#
n=1
P (fn|Ck)
Class(f) $ argmaxk
P (Ck)N#
n=1
P (fn|Ck)
1
Adding randomness: an ensemble of ferns
• A single fern does not give great classification performance
• But .. we can build an ensemble of “independent” ferns by randomly choosing different subsets of features
e.g. F1 = {f2, f7, f22, f5, f9}
F2 = {f4, f1, f11, f8, f3}
F3 = {f6, f31, f28, f11, f2}
• Finally, combine their outputs using Semi-Naive Bayes:
F1
F2
F3
f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}
H : f " C
P (C = ck|f1, f2, ..., fN )H(f) = argmax
kP (C = ck|f1, f2, ..., fN )
Dm = (fm, Cm) for m = 1...M
fn > T
w!f > T
IG =!
k
pleftk (1# pleft
k ) +!
k
prightk (1# pright
k )
IE = #!
k
pleftk log pleft
k #!
k
prightk log pright
k
argmaxk
P (Ck|f1, f2, ..., fN )
argmaxk
P (f1, f2, ..., fN |Ck)P (Ck)
P (f1, f2, ..., fN |Ck) =N"
n=1
P (fn|Ck)
Class(f) $ argmaxk
P (Ck)N"
n=1
P (fn|Ck)
Fl = {f l,1, f l,2, ..., f l,S}
P (f1, f2, ..., fN |Ck) =L"
l=1
P (Fl|Ck)
Class(f) $ argmaxk
P (Ck)L"
l=1
P (Fl|Ck)
1
Classifying using Random Ferns
• Having randomly selected and trained a collection of ferns, classifying new inputs involves only simple look-up operations
Fern 1 Fern 2 Fern L
Posterior over class label
One small subtlety...
• Even for moderate size ferns, the output range can be large
e.g. fern size S=10, output range = [0 ... 210] = [0 ... 1024]
• Even with large training set, many “buckets” may see no samples
• So .. assume a Dirichlet prior on the distributions P(F|Ck)
• Assigns a baseline low probability to all possible fern outputs:
where N(F=z|Ck) is the number of times we observe fern output equal to z in the training set for class Ck
p(F|Ck)
0 1 2 3 .. .. .. 2S
These zero-probabilities will be problematic for our Semi-Naive Bayes!
f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}
H : f " C
P (C = ck|f1, f2, ..., fN )H(f) = argmax
kP (C = ck|f1, f2, ..., fN )
Dm = (fm, Cm) for m = 1...M
fn > T
w!f > T
IG =!
k
pleftk (1# pleft
k ) +!
k
prightk (1# pright
k )
IE = #!
k
pleftk log pleft
k #!
k
prightk log pright
k
p(F = z|Ck) =N(F = z|Ck) + 1
"2S"1z=0 (N(F = z|Ck) + 1)
argmaxk
P (Ck|f1, f2, ..., fN )
argmaxk
P (f1, f2, ..., fN |Ck)P (Ck)
P (f1, f2, ..., fN |Ck)
P (f1, f2, ..., fN |Ck) =N#
n=1
P (fn|Ck)
Class(f) $ argmaxk
P (Ck)N#
n=1
P (fn|Ck)
1
Comparing Forests and Ferns
• Decision trees directly learn the posterior P(Ck|F)• Applies different sequence of tests in each child node• Training time grows exponentially with tree depth• Combine tree hypotheses by averaging
• Learn class-conditional distributions P(F|Ck)• Applies the same sequence of tests to every input vector• Training time grows linearly with fern size S• Combine hypothesis using Bayes rule (multiplication)
Tree classifier Fern classifier
Forests
Ferns
Application: Fast Keypoint Matching
• Ozuysal et al. use Random Ferns for keypoint recognition
• Similar to typical application of SIFT descriptors
• Very robust to affine viewing deformations
• Very fast to evaluate (13.5 microsec per keypoint)
Training
• Each keypoint to be recognized is a separate class
• Training sets are generated by synthesizing random affine deformations of the image patch (10000 samples)
• Features are pairwise intensity comparisons: fn(I) = I(xa,ya) > I(xb,yb)
• Fern size S=10 (randomly selected pixel pairs)
• Ensembles of 5 to 50 ferns are tested
11
(a)
(b)
(c)
Fig. 6. Warped patches from the images of Figure 5 show the range of affine deformations we considered. In each line, the left
most patch is the original one and the others are deformed versions of it. (a) Sample patches from the City image. (b) Sample
patches from the Flowers image. (c) Sample patches from the Museum image.
for the estimated distributions. Also the number of features evaluated per patch is equal in all
cases.
The training set is obtained by randomly deforming images of Figure 5. To perform these
experiments, we represent affine image deformations as 2 ! 2 matrices of the form
R!R!"diag(!1, !2)R" ,
where diag(!1, !2) is a diagonal 2 ! 2 matrix and R# represents a rotation of angle ". Both to
train and to test our ferns, we warped the original images using such deformations computed by
randomly choosing # and $ in the [0 : 2%] range and !1 and !2 in the [0.6 : 1.5] range. Fig. 6
depicts patches surrounding individual interest points first in the original images and then in the
warped ones. We used 30 random affine deformations per degree of rotation to produce 10800
November 7, 2008 DRAFT
Synthesized viewpoint variation
Classification Rate vs Number of Ferns
0
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25 30 35 40 45 50
Ave
rage C
lass
ifica
tion R
ate
# of Structures (Depth/Size 10)
FernsRandom Trees
0
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25 30 35 40 45 50
Ave
rage C
lass
ifica
tion R
ate
# of Structures (Depth/Size 10)
FernsRandom Trees
Figure 4. The percentage of correctly classified image patches is shown against different classifier parameters for the fern and tree basedmethods. The independence assumption between ferns allows the joint utilization of the features resulting in a higher classification rate.The error bars show the 95% confidence interval assuming a Gaussian distribution.
0
10
20
30
40
50
60
70
80
90
100
0 250 500 750 1000 1250 1500 1750 2000
Ave
rage C
lass
ifica
tion R
ate
# of Classes
Ferns (20 ferns with size 10)Random Trees (20 trees with depth 10)
0
10
20
30
40
50
60
70
80
90
100
0 250 500 750 1000 1250 1500 1750 2000
Ave
rage C
lass
ifica
tion R
ate
# of Classes
Ferns (30 ferns with size 10)Random Trees (30 trees with depth 10)
Figure 5. Classification using ferns can handle many more classes than Random Tree based methods. For both figures ferns with size 10and trees with depth 10 are used.
4.1. Ferns Outperform Trees
We evaluate the performance of the proposed fern basedapproach by comparing to the results of a Random Treebased implementation. The number of tests in the ferns andthe depth of the trees are taken to be equal, and we comparethe classification rate when using the same number of struc-tures, that is of ferns or Random Trees. In particular, thesame number of tests is performed on each keypoint, andthe same number of joint probabilities has to be stored.We do our tests on the images presented 3 with 500
classes and calculate the average classification rate onrandomly generated test images while eliminating falsematches using object geometry. Since the feature selec-tion is random, we repeat the test 10 times and calculate themean and variance of the classification rate and we performthe test on the two images.As depicted by Figure 4, despite the inaccuracy of the
independence assumptions the fern based classifier outper-forms the combination of trees. Furthermore as the number
of ferns is increased the random selection method does notcause large variations on the classifier performance.We also investigate the behavior of the classification rate
as the number of classes increases. Figure 5 shows that alarger number of classes does not affect the performanceof ferns much, while tree based methods can not cope withmany classes. In both experiments we have trained the clas-sifiers using classes from three different images up to 700classes for each image.
4.2. Linking the Two Approaches
Here we show that the two approaches are equivalentwhen the training set is small and give some insights intowhy the ferns perform better when it is large.Recall that we evaluate P(Fk |C = ci) of Eq. (3)
P(Fk |C = ci) = PeP(!(Fk))+ µ(1!P(!(Fk))) ,
where Pe is the empirical probability and µ = 1H . Since
computing the product of such terms is the same as sum-
•300 classes were used•Also compares to Random Trees of equivalent size
Classification Rate vs Number of Classes (Keypoints)
0
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25 30 35 40 45 50
Ave
rag
e C
lass
ifica
tion
Ra
te
# of Structures (Depth/Size 10)
FernsRandom Trees
0
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25 30 35 40 45 50
Ave
rag
e C
lass
ifica
tion
Ra
te
# of Structures (Depth/Size 10)
FernsRandom Trees
Figure 4. The percentage of correctly classified image patches is shown against different classifier parameters for the fern and tree basedmethods. The independence assumption between ferns allows the joint utilization of the features resulting in a higher classification rate.The error bars show the 95% confidence interval assuming a Gaussian distribution.
0
10
20
30
40
50
60
70
80
90
100
0 250 500 750 1000 1250 1500 1750 2000
Ave
rag
e C
lass
ifica
tion
Ra
te
# of Classes
Ferns (20 ferns with size 10)Random Trees (20 trees with depth 10)
0
10
20
30
40
50
60
70
80
90
100
0 250 500 750 1000 1250 1500 1750 2000
Ave
rag
e C
lass
ifica
tion
Ra
te
# of Classes
Ferns (30 ferns with size 10)Random Trees (30 trees with depth 10)
Figure 5. Classification using ferns can handle many more classes than Random Tree based methods. For both figures ferns with size 10and trees with depth 10 are used.
4.1. Ferns Outperform Trees
We evaluate the performance of the proposed fern basedapproach by comparing to the results of a Random Treebased implementation. The number of tests in the ferns andthe depth of the trees are taken to be equal, and we comparethe classification rate when using the same number of struc-tures, that is of ferns or Random Trees. In particular, thesame number of tests is performed on each keypoint, andthe same number of joint probabilities has to be stored.We do our tests on the images presented 3 with 500
classes and calculate the average classification rate onrandomly generated test images while eliminating falsematches using object geometry. Since the feature selec-tion is random, we repeat the test 10 times and calculate themean and variance of the classification rate and we performthe test on the two images.As depicted by Figure 4, despite the inaccuracy of the
independence assumptions the fern based classifier outper-forms the combination of trees. Furthermore as the number
of ferns is increased the random selection method does notcause large variations on the classifier performance.We also investigate the behavior of the classification rate
as the number of classes increases. Figure 5 shows that alarger number of classes does not affect the performanceof ferns much, while tree based methods can not cope withmany classes. In both experiments we have trained the clas-sifiers using classes from three different images up to 700classes for each image.
4.2. Linking the Two Approaches
Here we show that the two approaches are equivalentwhen the training set is small and give some insights intowhy the ferns perform better when it is large.Recall that we evaluate P(Fk |C = ci) of Eq. (3)
P(Fk |C = ci) = PeP(!(Fk))+ µ(1!P(!(Fk))) ,
where Pe is the empirical probability and µ = 1H . Since
computing the product of such terms is the same as sum-
•30 random ferns of size 10 were used
Drawback: Memory requirements
• Fern classifiers can be very memory hungry, e.g
Fern size = 11
Number of ferns = 50
Number of classes = 1000
RAM = 2fern size * sizeof(float) * NumFerns * NumClasses
= 2048 * 4 * 50 * 1000
= 400 MBytes!
Conclusions?
• Random Ferns are easy to understand and easy to train
• Very fast to perform classification once trained
• Provide a probabilistic output (how accurate though?)
• Appear to outperform Random Forests
• Can be very memory hungry!