Sharing features for multi-class object detection
Antonio Torralba, Kevin Murphy and Bill Freeman
MITComputer Science
and Artificial Intelligence Laboratory (CSAIL)Sept. 1, 2004
Goal
• We want a machine to be able to identify thousands of different objects as it looks around the world.
Multi-class object detection:
Localfeatures no car
ClassifierP( car | vp )
Vp
no cowClassifier
P( cow | vp )
no personClassifier
P(person | vp )
…
Bookshelf
Desk
Screen
Desired detector outputs:
One patch
Need to detect as well as to recognize
Detection: localize in the image screens, keyboards and bottles
In detection, the big problem is to differentiate the sparse objects against the background
Recognition: name each object
In recognition, the problem is to discriminate between objects.
There are efficient solutions for
detecting a single object category and view:
Viola & Jones 2001; Papageorgiou & Poggio 2000, …
Lowe, 1999
detecting particular objects:.
But the problem of multi-class and multi-view object detection is still largely unsolved.
Leibe & Schiele, 2003;
detecting objects in isolation
Why multi-object detection is a hard problem
viewpoints
Need to detect Nclasses * Nviews * Nstyles.Lots of variability within classes, and across viewpoints.
Object classes
Styles, lighting conditions, etc, etc, etc…
Existing approaches for multiclass object detection (vision community)
Using a set of independent binary classifiers is the dominant strategy:
• Viola-Jones extension for dealing with rotations
- two cascades for each view
• Schneiderman-Kanade multiclass object detection
a) One detector foreach class
Promising approaches…• Fei-Fei, Fergus, & Perona, 2003
Look for a vocabularyof edges that reducesthe number of features
• Krempp, Geman, & Amit, 2002
Characteristics of one-vs-all multiclass approaches: cost
Computational cost grows linearly with Nclasses * Nviews * Nstyles …Surely, this will not scale well to 30,000 object classes.
Characteristics of one-vs-all multiclass approaches: representation
Some of these parts can only be used for this object.
What is the best representation to detect a traffic sign?
Very regular object: template matching will do the job
Parts derived fromtraining a binary classifier.
~100% detection ratewith 0 false alarms
Meaningful parts, in one-vs-all approaches
Part-based object representation (looking for meaningful parts):
• A. Agarwal and D. Roth
Ullman, Vidal-Naquet, and Sali, 2004: features of intermediate complexity are most informative for (single-object) classification.
• M. Weber, M. Welling and P. Perona
…
Multi-class classifiers (machine learning community)
•Error correcting output codes (Dietterich & Bakiri, 1995; …)
But only use classification decisions (1/-1), not real values.
•Reducing multi-class to binary (Allwein et al, 2000)Showed that the best code matrix is problem-dependent; don’t address how to design code matrix.
• Bunching algorithm (Dekel and Singer, 2002)Also learns code matrix and classifies, but more complicated than our algorithm and not applied to object detection.
• Multitask learning (Caruana, 1997; …)
Train tasks in parallel to improve generalization, share features. But not applied to object detection, nor in a boosting framework.
Our approach
• Share features across objects, automatically selecting the best sharing pattern.
• Benefits of shared features: – Efficiency– Accuracy– Generalization ability
Algorithm goals, for object recognition.
We want to find the vocabulary of parts that can be shared
We want share across different objects generic knowledge about detecting objects (eg, from the background).
We want to share computations across classes so that computational cost < O(Number of classes)
Object class 1
Object class 2
Object class 3
Object class 4
Total number of hyperplanes (features): 4 x 6 = 24. Scales linearly with number of classes
Independentfeatures
Total number of shared hyperplanes (features): 8
May scale sub-linearly with number of classes, and may generalize better.
Sharedfeatures
Note: sharing is a graph, not a tree
bR
3b
3
3
Objects: {R, b, 3}
This defines a vocabulary of parts shared across objects
At the algorithmic level
• Our approach is a variation on boosting that allows for sharing features in a natural way.
• So let’s review boosting (ada-boost demo)
Joint boosting, outside of the context of images.
Additive models for classification
+1/-1 classificationclasses
feature responses
Feature sharing in additive models
1) Simple to have sharing between additive models
2) Each term hm can be mapped to a single feature
H1 = G1,2 + G1
H2 = G1,2 + G2
Flavors of boosting
• Different boosting algorithms use different lossfunctions or minimization procedures (Freund & Shapire, 1995; Friedman, Hastie, Tibshhirani, 1998).
• We base our approach on Gentle boosting: learns faster than others(Friedman, Hastie, Tibshhirani, 1998; Lienahart, Kuranov, & Pisarevsky, 2003).
Joint Boosting
We use the exponential multi-class cost function
At each boosting round, we add a function:
classes
classifier output for class c
membership in class c,+1/-1
Newton’s method
Treat hm as a perturbation, and expand loss J to second order in hm
classifier with perturbation
squared error
reweighting
a+b
b vf
Given a sharing pattern, the decision stump parameters
are obtained analytically
Feature output, v
For a trial sharing pattern, set weak learner parameters to optimize overall classification
hm (v,c)
Response histograms for background (blue) and class members (red)
hm(v,c)
kc=1
kc=2
kc=5
The constants k prevent sharing
features just due to an asymmetry between
positive and negative examples for each
class. They only appear during
training.
Joint Boosting: select sharing pattern and weak learner to minimize cost.
Algorithm details in CVPR 2004, Torralba, Murphy & Freeman
Approximate best sharingBut this requires exploring 2C –1 possible sharing patterns
Instead we use a first-best search:
• S = []
• 1) We fit stumps for each class independently
• 2) take best class - ci
• S = [S ci]
• fit stumps for [S ci] with ci not in S
• go to 2, until length(S) = Nclasses
• 3) select the sharing with smallest WLS error
Comparison of the classifiersShared features: note better isolation of individual classes.
Non-shared features.
Now, apply this to images.
Image features (weak learners)
32x32 training image of an object
12x12 patch
Location of that patch within the 32x32 object
gf(x) Mean = 0 Energy = 1wf(x) Binary mask
Feature output
The candidate features
Dictionary of 2000 candidate patches and position masks, randomly sampled from the training images
template position
Multiclass object detection
21 objects
We use [20, 50] training samples per object, and about 20 times as many background examples as object examples.
Example shared feature (weak classifier)
At each round of running joint boosting on training set we geta feature and a sharing pattern.
Response histograms for
background (blue) and class members (red)
How the features were shared across objects (features sorted left-to-right from generic to specific)
ROC curves for our 21 object database
• How will this work under training-starved or feature-starved conditions?
• Presumably, in the real world, we will always be starved for training data and for features.
15 features, 20 training examples (mid)70 features, 20 training examples (left)
Shared features Non-shared features
15 features, 2 training examples (right)15 features, 20 training examples (middle)
70 features, 20 training examples (left)
Shared features Non-shared features
Scaling
Joint Boosting shows sub-linear scaling of features with objects (for area under ROC = 0.9).
Results averaged over 8 training sets,
and different combinations of
objects. Error bars show variability.
Generic vs. specific features
Parts derived fromtraining a binary classifier.
In both cases ~100% detection rate with 0 false alarms
Parts derived from training a joint classifier with 20 more objects.
Multi-view object detectiontrain for object and orientation
Viewinvariantfeatures
Viewspecificfeatures
Sharing features is a natural approach to view-invariant object detection.
Summary
• Argued that feature sharing will be an essential part of scaling up object detection to hundreds or thousands of objects (and viewpoints).
• We introduced joint boosting, a generalization to boosting that incorporates feature sharing in a natural way.
• Initial results (up to 30 objects) show the desired scaling behavior for # features vs # objects.
• The shared features are observed to generalize better, allowing learning from fewer examples, using fewer features.