lecture 02 internet video search

42
Before the break Color, texture, time, spatial structure, Gauss does it all. …. but not invariance, which is badly needed.

Upload: zukun

Post on 19-May-2015

398 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Lecture 02 internet video search

Before the break

Color, texture, time, spatial structure, Gauss does it all. …. but not invariance, which is badly needed.

Page 2: Lecture 02 internet video search

Before the break

Most basic systems: System 1 Swain Ballard match colors System 2 Blob world match texture blobs We need more invariance

Page 3: Lecture 02 internet video search

4. Descriptors

Page 4: Lecture 02 internet video search

Patch descriptors

For 4x4 patches, find local gradient directions over t.

Count the directions per patch, 128D SIFT histogram.

Lowe IJCV 2004

Page 5: Lecture 02 internet video search

Affine patch descriptor Compute the prominent direction.

Start with central Gaussian distributed weights in W.

Compute 2nd order moments matrix Mk over all directions.

Adapt weights to elliptic shape.

Iterate until there is no longer change.

kkk WMW =+1

=

∑∑∑∑

yykyxk

xykxxkk

ffyxwffyxwffyxwffyxw

M),(),(),(),(

Page 6: Lecture 02 internet video search

Color Patch Descriptors

Light intensity change

Light intensity

shift

Light intensity change and

shift Light color

change

Light color change and

shift

SIFT + + + - - OpponentSIFT + + + - - C-SIFT + - - - - RGB-SIFT + + + + +

van de Sande PAMI 2010

Invariance properties per descriptor

Page 7: Lecture 02 internet video search

Results on PASCAL VOC 2007

Page 8: Lecture 02 internet video search

Results per object category

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

personaeroplane

horsecar

trainboatbus

motorbikebicycle

chaircat

tvmonitorsofabird

sheepdiningtable

dogcow

pottedplantbottleMAP

Average Precision

OpponentSIFT (L2 norm)

Two channel I+C (L2 norm)

Page 9: Lecture 02 internet video search

Corner selector

The change energy at x over a small vector u:

=

yyyx

xyxxxy ffff

ffffM

vu

MvuvuE ,][),(

Since M is symmetric, we have

For a corner both should be large.

RRM

= −

2

11

00λ

λdirection of the fastest change

(λmax)-1/2

(λmin)-1/2 ( )2det traceR M k M= −222

21 )(det yxyx IIIIM −== λλ22

21 yx IItraceM +=+= λλ

Page 10: Lecture 02 internet video search

Directionality of gradients

Page 11: Lecture 02 internet video search

Harris’ stability

Page 12: Lecture 02 internet video search

Blob detector

2D Laplacian:

DoG:

The Laplacian has a single max at the size of the blob,

Multiply

by σ2

( )2 ( , , ) ( , , )xx yyL G x y G x yσ σ σ= +

( , , ) ( , , )DoG G x y k G x yσ σ= −

Page 13: Lecture 02 internet video search

Laplace blob detector

Page 14: Lecture 02 internet video search

Laplace blob detector

Page 15: Lecture 02 internet video search

Laplace blob detector

Page 16: Lecture 02 internet video search

DoG detection + SIFT description

Jepson 2005

Page 17: Lecture 02 internet video search

System 3: patch detection

http://www.cloudburstresearch.com/

System 3 is an app: Stitching

Page 18: Lecture 02 internet video search

4. Conclusion

Patch descriptors bring local orderless information. Best combined with color invariance for illumination. Scene-pose-illumination invariance brings meaning.

Lee Comm. ACM 2011

Page 19: Lecture 02 internet video search

5 Words & Similarity

Page 20: Lecture 02 internet video search

Before words

1000 patches, 128 features 1,000,000 images ~ 11,5 days / 100 Gbyte

Page 21: Lecture 02 internet video search

Capture the pattern in patch

Measure the pattern in a patch with abundant features.

More is better. Different is better. Normalized is better.

Page 22: Lecture 02 internet video search

Sample many patches

Sample the patches in the image.

Dense 256 K words, salient 1 K words. Salience is good. Dense is better. Combined even better. Salient is memory efficient. Dense is compute efficient.

Page 23: Lecture 02 internet video search

Sample many images

Sample the images in the world: the learning set.

Learn all relevant distinctions. Learn all irrelevant variations not covered in the invariance of features.

Page 24: Lecture 02 internet video search

Form a dictionary of words

Form regions in feature space.

Size 4,000 (general) to 400,000 (buildings). Random forest is good and fast, 4 runs 10 deep is OK.

Page 25: Lecture 02 internet video search

Count words per image

Retain the word boundaries.

Fill the histogram of words per training image.

Page 26: Lecture 02 internet video search

Map histogram in similarity space

In 4096 D word count space, 1 point is 1 image.

Hard assignment: one patch one word.

Page 27: Lecture 02 internet video search

Learn histogram similarity

Learn the histogram distinction between the image histograms sorted per class of images in the learning set.

The histogram is 𝑉𝑑 = 𝑡1, 𝑡2, … , 𝑡𝑖 , … , 𝑡𝑛 𝑇, where 𝑡𝑖 is the total of occurrences of the visual word i.

The number of words in common is the intersection between query and image: 𝑆𝑞 = 𝑉𝑞 ∩ 𝑉𝑗

Page 28: Lecture 02 internet video search

Classify unknown image

Retain the word count discrimination + support vectors

Go from patch to patch > words > counts > discriminate

Page 29: Lecture 02 internet video search

System 4: Oxford building search

http://www.robots.ox.ac.uk/~vgg/research/oxbuildings/index.html

Page 30: Lecture 02 internet video search

Note 1: Soft assignment is better

Soft assignment: assign to multiple clusters, weighted by distance to center. Pooled single sigma for all codebook elements.

van Gemert, PAMI 2010

Page 31: Lecture 02 internet video search

Notes 2: SVM similarity is better

SVM can reconstruct a complex geometry at the boundary including disjoint subspaces. The distance metric in the kernel is important.

Page 32: Lecture 02 internet video search

Note 2: nonlinear SVMs

Vapnik, 1995

How to transform the data such that the samples from the two classes are separable by a linear function (preferably with margin). Or, equivalently, define a kernel that does this for you straight away.

Page 33: Lecture 02 internet video search

Note 2: χ² - kernels

Zhang, IJCV ‘07

Because χ² is meant to discriminate histograms!

Page 34: Lecture 02 internet video search

Note 2: … or multiple kernels

Let multiple kernel learning determine the weight of all features

Descriptors Norm = L2 # Norm ∈ L #

SIFT 0.4902 1 0.5169 4

OpponentSIFT (baseline) 0.4975 1 0.5203 4

SIFT and OpponentSIFT 0.5187 2 0.5357 8

One channel from C 0.5351 49 0.5405 196

Two channel: I and one from C 0.5463 49 0.5507 196

Page 35: Lecture 02 internet video search

Note 3: Speed

For the Intersection Kernel hi is piecewise linear, and quite smooth, blue plot. We can approximate with fewer uniformly spaced segments, red plot. Saves factor 75 time! Maji CVPR 2008

Page 36: Lecture 02 internet video search

Note 4: What is in a word?

Gavves 2011 Chum ICCV 2007 Turcot ICCV 2009

This is how a word looks like

Page 37: Lecture 02 internet video search

Note 4: Where are the synonyms?

Gavves 2011 But not all views of the same detail are close!

Page 38: Lecture 02 internet video search

Note 4: Forming selective dictionary

Build vocabulary by selecting the minimal set by maximizing the cross entropy:

99% vocabulary reduction

6% improved recognition

Needs 100 words per concept.

Gavves 2011 CVPR

Page 39: Lecture 02 internet video search

Note 4

Selective dictionary by cross entropy.

Examples.

Page 40: Lecture 02 internet video search

Note 5: Deconstruct words

Fisher vectors capture the internal structure of words.

Train a Gaussian Mixture Model, where each codebook element has its own sigma – one per dimension. Store differences in all descriptor dimensions. The feature vector is #codewords x #descriptors.

Perronnin ECCV 2010

Page 41: Lecture 02 internet video search

System 5: MediaMill search engine

http://www.mediamill.nl

Page 42: Lecture 02 internet video search

5. Conclusion

Words are the essential step forward.

More is better. Better but costly.

Smooth assignment works better than hard.

At the cost of less orthogonal methods.

Approximate algorithms are sufficient, mostly.