lecture 02 internet video search
TRANSCRIPT
Before the break
Color, texture, time, spatial structure, Gauss does it all. …. but not invariance, which is badly needed.
Before the break
Most basic systems: System 1 Swain Ballard match colors System 2 Blob world match texture blobs We need more invariance
4. Descriptors
Patch descriptors
For 4x4 patches, find local gradient directions over t.
Count the directions per patch, 128D SIFT histogram.
Lowe IJCV 2004
Affine patch descriptor Compute the prominent direction.
Start with central Gaussian distributed weights in W.
Compute 2nd order moments matrix Mk over all directions.
Adapt weights to elliptic shape.
Iterate until there is no longer change.
kkk WMW =+1
=
∑∑∑∑
yykyxk
xykxxkk
ffyxwffyxwffyxwffyxw
M),(),(),(),(
Color Patch Descriptors
Light intensity change
Light intensity
shift
Light intensity change and
shift Light color
change
Light color change and
shift
SIFT + + + - - OpponentSIFT + + + - - C-SIFT + - - - - RGB-SIFT + + + + +
van de Sande PAMI 2010
Invariance properties per descriptor
Results on PASCAL VOC 2007
Results per object category
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
personaeroplane
horsecar
trainboatbus
motorbikebicycle
chaircat
tvmonitorsofabird
sheepdiningtable
dogcow
pottedplantbottleMAP
Average Precision
OpponentSIFT (L2 norm)
Two channel I+C (L2 norm)
Corner selector
The change energy at x over a small vector u:
=
≈
yyyx
xyxxxy ffff
ffffM
vu
MvuvuE ,][),(
Since M is symmetric, we have
For a corner both should be large.
RRM
= −
2
11
00λ
λdirection of the fastest change
(λmax)-1/2
(λmin)-1/2 ( )2det traceR M k M= −222
21 )(det yxyx IIIIM −== λλ22
21 yx IItraceM +=+= λλ
Directionality of gradients
Harris’ stability
Blob detector
2D Laplacian:
DoG:
The Laplacian has a single max at the size of the blob,
Multiply
by σ2
( )2 ( , , ) ( , , )xx yyL G x y G x yσ σ σ= +
( , , ) ( , , )DoG G x y k G x yσ σ= −
Laplace blob detector
Laplace blob detector
Laplace blob detector
DoG detection + SIFT description
Jepson 2005
System 3: patch detection
http://www.cloudburstresearch.com/
System 3 is an app: Stitching
4. Conclusion
Patch descriptors bring local orderless information. Best combined with color invariance for illumination. Scene-pose-illumination invariance brings meaning.
Lee Comm. ACM 2011
5 Words & Similarity
Before words
1000 patches, 128 features 1,000,000 images ~ 11,5 days / 100 Gbyte
Capture the pattern in patch
Measure the pattern in a patch with abundant features.
More is better. Different is better. Normalized is better.
Sample many patches
Sample the patches in the image.
Dense 256 K words, salient 1 K words. Salience is good. Dense is better. Combined even better. Salient is memory efficient. Dense is compute efficient.
Sample many images
Sample the images in the world: the learning set.
Learn all relevant distinctions. Learn all irrelevant variations not covered in the invariance of features.
Form a dictionary of words
Form regions in feature space.
Size 4,000 (general) to 400,000 (buildings). Random forest is good and fast, 4 runs 10 deep is OK.
Count words per image
Retain the word boundaries.
Fill the histogram of words per training image.
Map histogram in similarity space
In 4096 D word count space, 1 point is 1 image.
Hard assignment: one patch one word.
Learn histogram similarity
Learn the histogram distinction between the image histograms sorted per class of images in the learning set.
The histogram is 𝑉𝑑 = 𝑡1, 𝑡2, … , 𝑡𝑖 , … , 𝑡𝑛 𝑇, where 𝑡𝑖 is the total of occurrences of the visual word i.
The number of words in common is the intersection between query and image: 𝑆𝑞 = 𝑉𝑞 ∩ 𝑉𝑗
Classify unknown image
Retain the word count discrimination + support vectors
Go from patch to patch > words > counts > discriminate
System 4: Oxford building search
http://www.robots.ox.ac.uk/~vgg/research/oxbuildings/index.html
Note 1: Soft assignment is better
Soft assignment: assign to multiple clusters, weighted by distance to center. Pooled single sigma for all codebook elements.
van Gemert, PAMI 2010
Notes 2: SVM similarity is better
SVM can reconstruct a complex geometry at the boundary including disjoint subspaces. The distance metric in the kernel is important.
Note 2: nonlinear SVMs
Vapnik, 1995
How to transform the data such that the samples from the two classes are separable by a linear function (preferably with margin). Or, equivalently, define a kernel that does this for you straight away.
Note 2: χ² - kernels
Zhang, IJCV ‘07
Because χ² is meant to discriminate histograms!
Note 2: … or multiple kernels
Let multiple kernel learning determine the weight of all features
Descriptors Norm = L2 # Norm ∈ L #
SIFT 0.4902 1 0.5169 4
OpponentSIFT (baseline) 0.4975 1 0.5203 4
SIFT and OpponentSIFT 0.5187 2 0.5357 8
One channel from C 0.5351 49 0.5405 196
Two channel: I and one from C 0.5463 49 0.5507 196
Note 3: Speed
For the Intersection Kernel hi is piecewise linear, and quite smooth, blue plot. We can approximate with fewer uniformly spaced segments, red plot. Saves factor 75 time! Maji CVPR 2008
Note 4: What is in a word?
Gavves 2011 Chum ICCV 2007 Turcot ICCV 2009
This is how a word looks like
Note 4: Where are the synonyms?
Gavves 2011 But not all views of the same detail are close!
Note 4: Forming selective dictionary
Build vocabulary by selecting the minimal set by maximizing the cross entropy:
99% vocabulary reduction
6% improved recognition
Needs 100 words per concept.
Gavves 2011 CVPR
Note 4
Selective dictionary by cross entropy.
Examples.
Note 5: Deconstruct words
Fisher vectors capture the internal structure of words.
Train a Gaussian Mixture Model, where each codebook element has its own sigma – one per dimension. Store differences in all descriptor dimensions. The feature vector is #codewords x #descriptors.
Perronnin ECCV 2010
System 5: MediaMill search engine
http://www.mediamill.nl
5. Conclusion
Words are the essential step forward.
More is better. Better but costly.
Smooth assignment works better than hard.
At the cost of less orthogonal methods.
Approximate algorithms are sufficient, mostly.