andrew zisserman talk - part 2
TRANSCRIPT
![Page 1: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/1.jpg)
Andrew ZissermanVisual Geometry Group
University of Oxfordhttp://www.robots.ox.ac.uk/~vgg
Includes slides from: Mark Everingham, Pedro Felzenszwalb, Rob Fergus, Kristen Grauman, Bastian Leibe, Fei-Fei Li, Marcin Marszalek, Pietro Perona, Deva Ramanan, Josef Sivic and Andrea Vedaldi
Visual search and recognitionPart II – category recognition
![Page 2: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/2.jpg)
What we would like to be able to do …
• Visual recognition and scene understanding• What is in the image and where
• scene type: outdoor, city …• object classes• material properties• actions
![Page 3: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/3.jpg)
Recognition Tasks• Image Classification
– Does the image contain an aeroplane?
• Object Class Detection/Localization– Where are the aeroplanes (if any)?
• Object Class Segmentation– Which pixels are part of an aeroplane
(if any)?
![Page 4: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/4.jpg)
Things vs. StuffStuff (n): Material defined by a homogeneous or repetitive pattern of fine-scale properties, but has no specific or distinctive spatial extent or shape.
Thing (n): An object with a specific size and shape.
Ted Adelson, Forsyth et al. 1996.
Slide: Geremy Heitz
![Page 5: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/5.jpg)
Challenges: Clutter
![Page 6: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/6.jpg)
Challenges: Occlusion and truncation
![Page 7: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/7.jpg)
Challenges: Intra-class variation
![Page 8: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/8.jpg)
Object Category Recognition by Learning• Difficult to define model of a category. Instead, learn from
example images
![Page 9: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/9.jpg)
Level of Supervision for LearningImage-level label
Pixel-level segmentation
Bounding box
“Parts”
![Page 10: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/10.jpg)
Outline1. Image Classification
• Bag of visual words method
• Features and adding spatial information
• Encoding
• PASCAL VOC and other datasets
2. Object Category Detection
3. The future and challenges
![Page 11: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/11.jpg)
Recognition Task• Image Classification
– Does the image contain an aeroplane?
• Challenges– Imaging factors e.g. lighting, pose,
occlusion, clutter– Intra-class variation
– Position can vary within image– Training data may not specify
![Page 12: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/12.jpg)
Image classification
• Supervised approach – Training data with labels indicating presence/absence of the class
Positive training images containing an object class, here motorbike
Negative training images that don’t
– Learn classifier
![Page 13: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/13.jpg)
Image classification
• Test image– Image without label– Determine whether test image contains the object class or not
• Classify– Run classifier on the test image
?
![Page 14: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/14.jpg)
Bag of visual words• Images yield varying number of local
features• Features are high-dimensional
e.g. 128-D for SIFT
• How to summarize image content in a fixed-length vector for classification?
1. Map descriptors onto a common vocabulary of visual words
2. Represent image as a histogram over visual words – a bag of words
![Page 15: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/15.jpg)
Examples for visual words
Airplanes
Motorbikes
Faces
Wild Cats
Leaves
People
Bikes
![Page 16: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/16.jpg)
Intuition
Visual Vocabulary
• Visual words represent “iconic” image fragments• Discarding spatial information gives lots of invariance
![Page 17: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/17.jpg)
positive negative
Train classifier,e.g. SVM
Training data: vectors are histograms, one from each training image
![Page 18: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/18.jpg)
Faces 435
Motorbikes 800
Airplanes 800
Cars (rear) 1155
Background 900
Total: 4090
Example Image collection: four object classes + background
The “Caltech 5”
![Page 19: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/19.jpg)
Example: weak supervision
Training• 50% images• No identifcation of
object within image
Testing• 50% images• Simple object
present/absent test
Motorbikes Airplanes Frontal Faces
Cars (Rear) Background
Learning• SVM classifier• Gaussian kernel using as similarity between histograms
Result• Between 98.3 – 100% correct, depending on class
Zhang et al 2005Csurka et al 2004
K(x,y) = e−γχ2(x,y)
![Page 20: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/20.jpg)
Localization according to visual word probability
50 100 150 200
20
40
60
80
100
120
50 100 150 200
20
40
60
80
100
120
foreground word more probable
background word more probable
sparse segmentation
![Page 21: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/21.jpg)
Why does SVM learning work?
• Learns foreground and background visual words
foreground words – positive weight
background words – negative weight
w
Linear SVM, f(x) = w>x+ b
![Page 22: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/22.jpg)
Bag of visual words summary
• Advantages:– largely unaffected by position and orientation of object in image– fixed length vector irrespective of number of detections– Very successful in classifying images according to the objects they
contain
• Disadvantages:– No explicit use of configuration of visual word positions– Poor at localizing objects within an image
![Page 23: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/23.jpg)
Adding Spatial Information
![Page 24: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/24.jpg)
Beyond BOW II: Grids and spatial pyramidsStart from BoW for image
• no spatial information recorded
Bag of Words
Feature Vector
![Page 25: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/25.jpg)
Adding Spatial Information to Bag of Words
Bag of Words
Concatenate
Feature Vector[Fergus et al, 2005]Keeps fixed length feature vector for a window
![Page 26: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/26.jpg)
Tiling defines (records) the spatial correspondence of the words
If codebook has V visual words, then representation has dimension 4V
Fergus et al ICCV 05
• parameter: number of tiles
![Page 27: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/27.jpg)
Spatial Pyramid – represent correspondence
•••••••••••••••••••••
1 BoW
4 BoW
16 BoW
[Lazebnik et al, 2006][Grauman & Darrell, 2005]
![Page 28: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/28.jpg)
Dense Visual Words• Why extract only sparse visual words?• Good where lots of invariance is needed (e.g. to rotation or
scale), but not relevant if it isn’t• Also, interest points do not necessarily capture “all” features
• Instead, extract dense visual words of fixed scales on an overlapping grid
• More “detail” at the expense of invariance• Improves performance for most categories• Pyramid histogram of visual words (PHOW)
[Luong & Malik, 1999][Varma & Zisserman, 2003]
[Vogel & Schiele, 2004][Jurie & Triggs, 2005]
[Fei-Fei & Perona, 2005][Bosch et al, 2006]Patch / SIFT
QuantizeWord
![Page 29: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/29.jpg)
Image categorization: summary
... ... ... ...
VQ
Linear SVM
dogs
![Page 30: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/30.jpg)
Example – retrieving videos of scene categories
• TrecVid 2010 test data
– 8383 videos
– 144,988 shots
– 232,276 key-frames
– 200G mpeg2
![Page 31: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/31.jpg)
Feature ExtractionDense Sift + VQ+ Spatial Tiling
Positive (Airplane)
Negative (Background)
Training Images Features
LearningSupport VectorMachine (SVM)
Test Images
Feature ExtractionDense Sift + VQ + Spatial Tiling
Features
Scoring
Scores
0.91
0.12
0.65
0.89
Ranked List
0.91
0.89
0.65
0.12
Ranking
Classifier Model
Test
ing
Trai
ning
![Page 32: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/32.jpg)
Example – retrieving videos of scene categories
aircraft
![Page 33: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/33.jpg)
Example – retrieving videos of scene categories
cityscape
![Page 34: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/34.jpg)
Example – retrieving videos of scene categories
demonstration
![Page 35: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/35.jpg)
Example – retrieving videos of scene categories
singing
![Page 36: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/36.jpg)
Encodings
![Page 37: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/37.jpg)
Beyond hard assignments …1. Distance based methods
– Represent descriptor by function of distance to set of nearest neighbour cluster centres (soft assignment)
A: 0.1B: 0.5C: 0.4
B: 1.0 Hard Assignment
Soft Assignment
[Philbin et al. CVPR 2008, Van Gemert et al. ECCV 2008]
![Page 38: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/38.jpg)
Beyond hard assignments …2. Reconstruction based methods
– Sparse-coding: approximate descriptor using nearest neighbourcentres as a basis
– Locality-constrained Linear Coding (LLC)
[Wang et al. CVPR 2010]
minα||x− Bα||2 + λ||α||2
where B is a matrix formed from the nearest centres as column vectors
![Page 39: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/39.jpg)
Beyond hard assignments …3. Represent the residual (and more)
– Measure (and quantize) the difference vector from the cluster centre
[Jegou et al. CVPR 2010, Perronin et al., Fisher kernels ECCV 2010 ]
ci
x
![Page 40: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/40.jpg)
Spatial tiling & binning
1. Aggregate by sum-pooling
2. Aggregate by max-pooling
[Wang et al. CVPR 2010]
sum
max
![Page 41: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/41.jpg)
Datasets
![Page 42: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/42.jpg)
The PASCAL Visual Object Classes (VOC) Dataset and Challenge
Mark EveringhamLuc Van GoolChris Williams
John WinnAndrew Zisserman
![Page 43: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/43.jpg)
• Challenge in visual objectrecognition funded byPASCAL network of excellence
• Publicly available dataset ofannotated images
• Main competitions in classification (is there an X in this image), detection (where are the X’s), and segmentation (which pixels belong to X)
• “Taster competitions” in 2-D human “pose estimation” (2007-present) and static action classes (2010-present)
• Standard evaluation protocol (software supplied)
The PASCAL VOC Challenge
![Page 44: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/44.jpg)
Dataset Content
• 20 classes: aeroplane, bicycle, boat, bottle, bus, car, cat, chair, cow, dining table, dog, horse, motorbike, person, potted plant, sheep, train, TV
• Real images downloaded from flickr, not filtered for “quality”
• Complex scenes, scale, pose, lighting, occlusion, ...
![Page 45: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/45.jpg)
Annotation• Complete annotation of all objects
• Annotated in one session with written guidelines
TruncatedObject extends beyond BB
OccludedObject is significantly occluded within BB
PoseFacing left
DifficultNot scored in evaluation
![Page 46: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/46.jpg)
Examples
Aeroplane
Bus
Bicycle Bird Boat Bottle
Car Cat Chair Cow
![Page 47: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/47.jpg)
Examples
Dining Table
Potted Plant
Dog Horse Motorbike Person
Sheep Sofa Train TV/Monitor
![Page 48: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/48.jpg)
Dataset Collection• Images downloaded from flickr
– 500,000 images downloaded and random subset selected for annotation
– Queries• Keyword e.g. “car”, “vehicle”, “street”, “downtown”• Date of capture e.g. “taken 21-July”
– Removes “recency” bias in flickr results• Images selected from random page of results
– Reduces bias toward particular flickr users
• 2008/9 datasets retained as subset of 2010– Assignments to training/test sets maintained
![Page 49: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/49.jpg)
Dataset Statistics• Around 40% increase in size over VOC2009
22,992
9,637
23,374
10,103
Training Testing
Images (7,054) (6,650)
Objects (17,218) (16,829)
VOC2009 counts shown in brackets
• Minimum ~500 training objects per category– ~1700 cars, 1500 dogs, 7000 people
• Approximately equal distribution across training and test sets
![Page 50: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/50.jpg)
Classification Challenge• Predict whether at least one object of a given class is present in an image
• Evaluation: average precision per class
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
recall
prec
isio
n
AP
• Average Precision (AP) measures area under precision/recall curve
• Application independent
• A good score requires both high recall and high precision
![Page 51: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/51.jpg)
Precision/Recall: Aeroplane (All)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
recall
prec
isio
nAll results
NEC_V1_HOGLBP_NONLIN_SVM (93.3)NEC_V1_HOGLBP_NONLIN_SVMDET (93.3)NUSPSL_KERNELREGFUSING (93.0)NUSPSL_MFDETSVM (91.9)CVC_PLUSDET (91.7)UVA_BW_NEWCOLOURSIFT (91.5)NUSPSL_EXCLASSIFIER (91.3)CVC_PLUS (91.0)SURREY_MK_KDA (90.6)UVA_BW_NEWCOLOURSIFT_SRKDA (90.6)NLPR_VSTAR_CLS_DICTLEARN (90.3)BUT_FU_SVM_SIFT (89.7)CVC_FLAT (89.4)BONN_FGT_SEGM (88.0)LIRIS_MKL_TRAINVAL (87.5)TIT_SIFT_GMM_MKL (87.2)XRCE_IFV (87.1)NUDT_SVM_LDP_SIFT_PMK_SPMK (86.1)RITSU_CBVR_WKF (85.6)UC3M_GENDISC (85.5)NUDT_SVM_WHGO_SIFT_CENTRIST_LLM (83.5)BUPT_LPBETA_MULTFEAT (82.1)BUPT_SVM_MULTFEAT (81.1)BUPT_SPM_SC_HOG (79.6)LIP6UPMC_RANKING (78.8)LIP6UPMC_MKL_L1 (78.5)LIP6UPMC_KSVM_BASELINE (78.4)NTHU_LINSPARSE_2 (77.9)WLU_SPM_EMDIST (75.8)LIG_MSVM_FUSE_CONCEPT (74.4)NII_SVMSIFT (69.3)HIT_PROTOLEARN_2 (60.7)
![Page 52: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/52.jpg)
Precision/Recall: Aeroplane (Top 10 by AP)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
recall
prec
isio
nTop 10 results by AP
NEC_V1_HOGLBP_NONLIN_SVM (93.3)NEC_V1_HOGLBP_NONLIN_SVMDET (93.3)NUSPSL_KERNELREGFUSING (93.0)NUSPSL_MFDETSVM (91.9)CVC_PLUSDET (91.7)UVA_BW_NEWCOLOURSIFT (91.5)NUSPSL_EXCLASSIFIER (91.3)CVC_PLUS (91.0)SURREY_MK_KDA (90.6)UVA_BW_NEWCOLOURSIFT_SRKDA (90.6)
![Page 53: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/53.jpg)
AP by Class
0102030405060708090
100ae
ropl
ane
pers
ontra
inbu
sm
otor
bike
hors
eca
rca
tbi
cycl
ebo
attv
/mon
itor
bird
dog
shee
pco
wch
air
dini
ngta
ble
sofa
bottl
epo
ttedp
lant
AP
(%) max
medianchance
• Max AP: 93.3% (aeroplane) ... 53.3% (potted plant)
![Page 54: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/54.jpg)
Progress 2008-2010
• Results on 2008 data improve for best 2009 and 2010 methods for all categories, by over 100% for some categories
– Caveat: Better methods or more training data?
0102030405060708090
100ae
ropl
ane
bicy
cle
bird
boat
bottl
ebu
sca
r
cat
chai
rco
wdi
ning
tabl
edo
g
hors
em
otor
bike
pers
onpo
ttedp
lant
shee
p
sofa
train
tvm
onito
r
Max
AP
(%)
200820092010
![Page 55: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/55.jpg)
The Indoor Scene Dataset
• 67 indoor categories
• 15620 images
• At least 100 images per category
• Training 67 x 80 images
• Testing 67 x 20 images
• A. Quattoni, and A.Torralba. Recognizing Indoor Scenes. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
![Page 56: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/56.jpg)
The Oxford Flowers Dataset• Explore fine grained visual categorization
• 102 different species
![Page 57: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/57.jpg)
![Page 58: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/58.jpg)
Dataset statistics• 102 categories
• Training set – 10 images per category
• Validation set– 10 images per category
• Test set– >20 images per category. – Total 6129 images.
![Page 59: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/59.jpg)
Fine grained visual classification – flowers
Y. Chai, M.E. Nilsback, V. Lempitsky, A. Zisserman, ICVGIP’08, ICCV’ 11
![Page 60: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/60.jpg)
![Page 61: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/61.jpg)
Outline1. Image Classification
2. Object Category Detection
• Sliding window methods
• Histogram of Oriented Gradients (HOG)
• Learning an object detector
• PASCAL VOC (again) and two state of the art algorithms
3. The future and challenges
![Page 62: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/62.jpg)
• Object Class Detection/Localization– Where are the aeroplanes (if any)?
Recognition Task
• Challenges– Imaging factors e.g. lighting, pose,
occlusion, clutter– Intra-class variation
• Compared to Classification– Detailed prediction e.g. bounding box– Location usually provided for training
![Page 63: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/63.jpg)
aeroplane bicycle
car cow
motorbikehorse
Preview of typical results
![Page 64: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/64.jpg)
Problem of background clutter• Use a sub-window
– At correct position, no clutter is present– Slide window to detect object– Change size of window to search over scale
![Page 65: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/65.jpg)
Yes,a carNo,
not a car
Detection by Classification• Basic component of sliding window classifier: binary classifier
Car/non-carClassifier
![Page 66: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/66.jpg)
Detection by Classification• Detect objects in clutter by search
Car/non-carClassifier
• Sliding window: exhaustive search over position and scale
![Page 67: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/67.jpg)
Detection by Classification• Detect objects in clutter by search
Car/non-carClassifier
• Sliding window: exhaustive search over position and scale
![Page 68: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/68.jpg)
Detection by Classification• Detect objects in clutter by search
Car/non-carClassifier
• Sliding window: exhaustive search over position and scale(can use same size window over a spatial pyramid of images)
![Page 69: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/69.jpg)
Window (Image) Classification
• Features usually engineered• Classifier learnt from data
FeatureExtraction
Classifier
Training Data
Car/Non-car
![Page 70: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/70.jpg)
Problems with sliding windows …
• aspect ratio
• granuality (finite grid)
• partial occlusion
• multiple responses
See work by
• Christoph Lampert et al CVPR 08, ECCV 08
![Page 71: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/71.jpg)
Bag of (visual) Words representation
• Detect affine invariant local features (e.g. affine-Harris)
• Represent by high-dimensionaldescriptors, e.g. 128-D for SIFT
• Summarizes sliding window content in a fixed-length vector suitable for classification
1. Map descriptors onto a common vocabulary of visual words
2. Represent sliding window as a histogram over visual words – a bag of words
![Page 72: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/72.jpg)
Sliding window detector• Classifier: SVM with linear kernel
• BOW representation for ROI
Example detections for dog
Lampert et al CVPR 08: Efficient branch and bound search over all windows
![Page 73: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/73.jpg)
Discussion: ROI as a Bag of Visual Words
• Advantages– No explicit modelling of spatial information ⇒
high level of invariance to position and orientation in image
– Fixed length vector ⇒ standard machine learning methods applicable
• Disadvantages– No explicit modelling of spatial information ⇒
less discriminative power– Inferior to state of the art performance– Add dense features
![Page 74: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/74.jpg)
Dalal & Triggs CVPR 2005Pedestrian detection
• Objective: detect (localize) standing humans in an image
• Sliding window classifier
• Train a binary classifier on whether a window contains a standing person or not
• Histogram of Oriented Gradients (HOG) feature
• Although HOG + SVM originally introduced for pedestrians has been used very successfully for many object categories
![Page 75: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/75.jpg)
Feature: Histogram of Oriented Gradients (HOG)
imagedominant direction HOG
frequ
ency
orientation
• tile 64 x 128 pixel window into 8 x 8 pixel cells
• each cell represented by histogram over 8 orientation bins (i.e. angles in range 0-180 degrees)
![Page 76: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/76.jpg)
Histogram of Oriented Gradients (HOG) continued
• Adds a second level of overlapping spatial bins re-normalizing orientation histograms over a larger spatial area
• Feature vector dimension (approx) = 16 x 8 (for tiling) x 8 (orientations) x 4 (for blocks) = 4096
![Page 77: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/77.jpg)
Window (Image) Classification
• HOG Features• Linear SVM classifier
FeatureExtraction
Classifier
Training Data
pedestrian/Non-pedestrian
![Page 78: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/78.jpg)
![Page 79: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/79.jpg)
Averaged examples
![Page 80: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/80.jpg)
Advantages of linear SVM:
• Training (Learning)• Very efficient packages for the linear case, e.g. LIBLINEAR for batch training and Pegasos for on-line training.
• Complexity O(N) for N training points (cf O(N^3) for general SVM)
• Testing (Detection)
Classifier: linear SVMf(x) = w>x+ b
f(x) =SXi
αik(xi,x) + b
f(x) =SXi
αixi>x+ b
= w>x+ b
S = # of support vectors
= (worst case ) N
size of training data
Non-linear
Linear
Independent of size of training data
![Page 81: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/81.jpg)
Dalal and Triggs, CVPR 2005
![Page 82: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/82.jpg)
Learned model
f(x) = w>x+ b
average over positive training data
![Page 83: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/83.jpg)
Slide from Deva Ramanan
![Page 84: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/84.jpg)
Why does HOG + SVM work so well?• Similar to SIFT, records spatial arrangement of histogram orientations• Compare to learning only edges:
– Complex junctions can be represented– Avoids problem of early thresholding– Represents also soft internal gradients
• Older methods based on edges have become largely obsolete
• HOG gives fixed length vector for window, suitable for feature vector for SVM
![Page 85: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/85.jpg)
Training a sliding window detector• Object detection is inherently asymmetric: much more
“non-object” than “object” data
• Classifier needs to have very low false positive rate• Non-object category is very complex – need lots of data
![Page 86: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/86.jpg)
Bootstrapping
1. Pick negative training set at random
2. Train classifier3. Run on training data4. Add false positives to
training set5. Repeat from 2
• Collect a finite but diverse set of non-object windows• Force classifier to concentrate on hard negative examples
• For some classifiers can ensure equivalence to training on entire data set
![Page 87: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/87.jpg)
Example: train an upper body detector– Training data – used for training and validation sets
• 33 Hollywood2 training movies• 1122 frames with upper bodies marked
– First stage training (bootstrapping)• 1607 upper body annotations jittered to 32k positive samples• 55k negatives sampled from the same set of frames
– Second stage training (retraining)• 150k hard negatives found in the training data
![Page 88: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/88.jpg)
Training data – positive annotations
![Page 89: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/89.jpg)
Positive windows
Note: common size and alignment
![Page 90: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/90.jpg)
Jittered positives
![Page 91: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/91.jpg)
Jittered positives
![Page 92: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/92.jpg)
Random negatives
![Page 93: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/93.jpg)
Random negatives
![Page 94: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/94.jpg)
Window (Image) first stage classification
HOG FeatureExtraction
Linear SVMClassifier
Jittered positives
random negatives f(x) = w>x+ b
• find high scoring false positives detections
• these are the hard negatives for the next round of training
• cost = # training images x inference on each image
![Page 95: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/95.jpg)
Hard negatives
![Page 96: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/96.jpg)
Hard negatives
![Page 97: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/97.jpg)
First stage performance on validation set
![Page 98: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/98.jpg)
Performance after retraining
![Page 99: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/99.jpg)
Effects of retraining
![Page 100: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/100.jpg)
Side by side
before retraining after retraining
![Page 101: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/101.jpg)
Side by side
before retraining after retraining
![Page 102: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/102.jpg)
Side by sidebefore retraining after retraining
![Page 103: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/103.jpg)
Tracked upper body detections
![Page 104: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/104.jpg)
Accelerating Sliding Window Search• Sliding window search is slow because so many windows are
needed e.g. x × y × scale ≈ 100,000 for a 320×240 image
• Most windows are clearly not the object class of interest
• Can we speed up the search?
![Page 105: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/105.jpg)
Cascaded Classification• Build a sequence of classifiers with increasing complexity
ClassifierN
Face
Non-face
Classifier2
Non-face
Classifier1
Non-face
Window
More complex, slower, lower false positive rate
• Reject easy non-objects using simpler and faster classifiers
Possibly a face
Possibly a face
![Page 106: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/106.jpg)
Cascaded Classification
• Slow expensive classifiers only applied to a few windows ⇒significant speed-up
• Controlling classifier complexity/speed:– Number of support vectors [Romdhani et al, 2001]– Number of features [Viola & Jones, 2001]– Type of SVM kernel [Vedaldi et al, 2009]
![Page 107: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/107.jpg)
Summary: Sliding Window Detection• Can convert any image classifier into an
object detector by sliding window. Efficient search methods available.
• Requirements for invariance are reduced by searching over e.g. translation and scale
• Spatial correspondence can be “engineered in” by spatial tiling
![Page 108: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/108.jpg)
Outline1. Image Classification
2. Object Category Detection
• Sliding window methods
• Histogram of Oriented Gradients (HOG)
• Learning an object detector
• PASCAL VOC (again) and two state of the art algorithms
3. The future and challenges
![Page 109: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/109.jpg)
The PASCAL Visual Object Classes (VOC) Dataset and Challenge
Mark EveringhamLuc Van GoolChris Williams
John WinnAndrew Zisserman
![Page 110: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/110.jpg)
Detection: Evaluation of Bounding Boxes• Area of Overlap (AO) Measure
Ground truth Bgt
Predicted Bp
Bgt � Bp
> ThresholdDetection if50%
• Evaluation: Average precision per class on predictions
![Page 111: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/111.jpg)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
recall
prec
isio
n
NLPR_HOGLBP_MC_LCEGCHLC (55.3)UOCTTI_LSVM_MDPM (54.3)UCI_DPM_SP (52.6)NUS_HOGLBP_CTX_CLS_RESCORE_V2 (52.4)MITUCLA_HIERARCHY (48.5)UVA_DETMONKEY (39.8)UVA_GROUPLOC (39.6)UMNECUIUC_HOGLBP_DHOGBOW_SVM (34.7)BONN_FGT_SEGM (33.7)UMNECUIUC_HOGLBP_LINSVM (33.7)CMU_RANDPARTS (31.7)LJKINPG_HOG_LBP_LTP_PLS2ROOTS (29.7)CMIC_SYNTHTRAIN (28.9)CMIC_VARPARTS (28.2)BONN_SVR_SEGM (24.4)TIT_SIFT_GMM_MKL2 (14.5)UC3M_GENDISC (5.5)TIT_SIFT_GMM_MKL (1.6)
Precision/Recall - Bicycle
![Page 112: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/112.jpg)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
recall
prec
isio
n
UOCTTI_LSVM_MDPM (49.1)NLPR_HOGLBP_MC_LCEGCHLC (46.7)UCI_DPM_SP (44.5)MITUCLA_HIERARCHY (43.5)UVA_GROUPLOC (37.8)UVA_DETMONKEY (36.9)UMNECUIUC_HOGLBP_DHOGBOW_SVM (33.8)UMNECUIUC_HOGLBP_LINSVM (33.1)BONN_SVR_SEGM (32.9)NUS_HOGLBP_CTX_CLS_RESCORE_V2 (32.8)BONN_FGT_SEGM (31.9)LJKINPG_HOG_LBP_LTP_PLS2ROOTS (27.5)CMU_RANDPARTS (19.5)CMIC_VARPARTS (13.7)CMIC_SYNTHTRAIN (13.3)TIT_SIFT_GMM_MKL2 (8.1)UC3M_GENDISC (5.4)TIT_SIFT_GMM_MKL (1.6)
Precision/Recall - Car
![Page 113: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/113.jpg)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
recall
prec
isio
n
UVA_GROUPLOC (13.0)UVA_DETMONKEY (12.1)UCI_DPM_SP (11.6)NLPR_HOGLBP_MC_LCEGCHLC (10.2)MITUCLA_HIERARCHY (9.7)NUS_HOGLBP_CTX_CLS_RESCORE_V2 (9.6)UOCTTI_LSVM_MDPM (9.1)UMNECUIUC_HOGLBP_LINSVM (7.2)BONN_SVR_SEGM (6.7)UMNECUIUC_HOGLBP_DHOGBOW_SVM (6.4)BONN_FGT_SEGM (5.8)LJKINPG_HOG_LBP_LTP_PLS2ROOTS (3.1)UC3M_GENDISC (1.5)CMU_RANDPARTS (1.1)TIT_SIFT_GMM_MKL2 (0.8)TIT_SIFT_GMM_MKL (0.3)
Precision/Recall – Potted plant
![Page 114: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/114.jpg)
True Positives - MotorbikeMITUCLA_HIERARCHY
NLPR_HOGLBP_MC_LCEGCHLC
NUS_HOGLBP_CTX_CLS_RESCORE_V2
![Page 115: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/115.jpg)
False Positives - MotorbikeMITUCLA_HIERARCHY
NLPR_HOGLBP_MC_LCEGCHLC
NUS_HOGLBP_CTX_CLS_RESCORE_V2
![Page 116: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/116.jpg)
True Positives - CatUVA_DETMONKEY
UVA_GROUPLOC
MITUCLA_HIERARCHY
![Page 117: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/117.jpg)
False Positives - CatUVA_DETMONKEY
UVA_GROUPLOC
MITUCLA_HIERARCHY
![Page 118: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/118.jpg)
Progress 2008-2010
Results on 2008 data improve for best 2009 and 2010 methods for all categories, by over 100% for some categories
0
10
20
30
40
50
60ae
ropl
ane
bicy
cle bird
boat
bottl
e
bus
car
cat
chai
r
cow
dinin
gtab
le
dog
hors
e
moto
rbike
pers
on
potte
dpla
ntsh
eep
sofa
train
tvmo
nitor
Max
AP
(%)
200820092010
![Page 119: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/119.jpg)
Multiple Kernels for Object Detection
Andrea Vedaldi, Varun Gulshan,Manik Varma, Andrew Zisserman
ICCV 2009
![Page 120: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/120.jpg)
Approach
• Three stage cascade– Each stage uses a more powerful and more expensive classifier
• Multiple kernel learning for the classifiers over multiple features• Jumping window first stage
Feature vector
Fast Linear SVM
Quasi-linear SVM
Jumping Window
Non-linear SVM
![Page 121: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/121.jpg)
Multiple Kernel Classification
PHOW Gray
Visual Words
PHOG
SSIM
PHOW Color
PHOG Sym
MK SVM
combine one kernel per histogram
[Varma & Rai, 2007][Gehler & Nowozin, 2009]
Feature vector
![Page 122: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/122.jpg)
Jumping window
Hypothesis
Position of visual word with respect to the object
learn the position/scale/aspect ratio of the ROI with respect to the visual word
Trai
ning
Det
ectio
n
Handles change of aspect ratio
![Page 123: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/123.jpg)
140
SVMs overview• First stage
– linear SVM– (or jumping window)– time: #windows
• Second stage– quasi-linear SVM– χ2 kernel– time: #windows × #dimensions
• Third stage– non-linear SVM– χ2-RBF kernel– time:
#windows × #dimensions × #SVs
140
Feature vector
Fast Linear SVM
Quasi-linear SVM
Jumping Window
Non-linear SVM
![Page 124: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/124.jpg)
Results
![Page 125: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/125.jpg)
Results
![Page 126: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/126.jpg)
Results
![Page 127: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/127.jpg)
Single Kernel vs. Multiple Kernels• Multiple Kernels gives substantial boost• Multiple Kernel Learning:
– small improvement over averaging– sparse feature selection
![Page 128: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/128.jpg)
Object Detection with Discriminatively Trained Part Based Models
Pedro F. Felzenszwalb, David Mcallester, Deva Ramanan, Ross Girshick
PAMI 2010
![Page 129: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/129.jpg)
Approach
• Mixture of deformable part-based models– One component per “aspect” e.g. front/side view
• Each component has global template + deformable parts• Discriminative training from bounding boxes alone
![Page 130: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/130.jpg)
Example Model• One component of person model
root filterscoarse resolution
part filtersfiner resolution
deformationmodels
x1
x3
x4
x6
x5
x2
![Page 131: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/131.jpg)
Object Hypothesis• Position of root + each part• Each part: HOG filter (at higher resolution)
Score is sum of filter scores minus
deformation costs
p0 : location of rootp1,..., pn : location of parts
z = (p0,..., pn)
![Page 132: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/132.jpg)
Score of a Hypothesis
• Linear classifier applied to feature subset defined by hypothesis
filters deformation parameters
displacements
Appearance term Spatial prior
concatenation of HOG features and part displacement
features
concatenation of filters and deformation
parameters
![Page 133: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/133.jpg)
Training• Training data = images + bounding boxes• Need to learn: model structure, filters, deformation costs
Training
![Page 134: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/134.jpg)
Latent SVM (MI-SVM)
Minimize
Training data
We would like to find β such that:
Classifiers that score an example x using
β are model parametersz are latent values
• Which component?• Where are the parts?
SVM objective
![Page 135: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/135.jpg)
Latent SVM Training
• Convex if we fix z for positive examples
• Optimization:– Initialize β and iterate:
• Pick best z for each positive example• Optimize β with z fixed
• Local minimum: needs good initialization– Parts initialized heuristically from root
Alternation strategy
![Page 136: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/136.jpg)
Person Model
root filterscoarse resolution
part filtersfiner resolution
deformationmodels
Handles partial occlusion/truncation
![Page 137: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/137.jpg)
Car Model
root filterscoarse resolution
part filtersfiner resolution
deformationmodels
![Page 138: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/138.jpg)
Car Detections
high scoring false positiveshigh scoring true positives
![Page 139: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/139.jpg)
Person Detections
high scoring true positiveshigh scoring false positives
(not enough overlap)
![Page 140: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/140.jpg)
Comparison of Models
![Page 141: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/141.jpg)
Summary• Multiple features and multiple kernels boost
performance• Discriminative learning of model with latent
variables for single feature (HOG):– Latent variables can learn best alignment in the
ROI training annotation– Parts can be thought of as local SIFT vectors– Some similarities to Implicit Shape
Model/Constellation models but with discriminative/careful training throughout
NB: Code available for latent model !
![Page 142: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/142.jpg)
Outline
1. Image Classification
2. Object Category Detection
3. The future and challenges
![Page 143: Andrew Zisserman Talk - Part 2](https://reader034.vdocuments.us/reader034/viewer/2022042607/554e7309b4c9054a698b4be9/html5/thumbnails/143.jpg)
Current Research Challenges• Context
– from scene properties: GIST, BoW, stuff – from other objects, e.g. Felzenszwalb et al, PAMI 10– from geometry of scene, e.g. Hoiem et al CVPR 06
• Occlusion/truncation– Winn & Shotton, Layout Consistent Random Field, CVPR 06– Vedaldi & Zisserman, NIPS 09– Yang et al, Layered Object Detection, CVPR 10
• 3D
• Scaling up – thousands of classes– Torralba et al, Feature sharing– ImageNet
• Weak and noisy supervision