recognition - cs.utoronto.ca

60
Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bag-of-words Spatial pyramids Neural Networks Object class detection Hough-voting techniques Support Vector Machines (SVM) detector on HOG features Deformable part-based model (DPM) R-CNN (detector with Neural Networks) Object (class) segmentation Unsupervised segmentation (“bottom-up” techniques) Supervised segmentation (“top-down” techniques) Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Upload: others

Post on 12-Jun-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recognition - cs.utoronto.ca

Recognition

Topics that we will try to cover:

Indexing for fast retrieval (we still owe this one)

History of recognition techniques

Object classification

Bag-of-wordsSpatial pyramidsNeural Networks

Object class detection

Hough-voting techniquesSupport Vector Machines (SVM) detector on HOG featuresDeformable part-based model (DPM)R-CNN (detector with Neural Networks)

Object (class) segmentation

Unsupervised segmentation (“bottom-up” techniques)Supervised segmentation (“top-down” techniques)

Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Page 2: Recognition - cs.utoronto.ca

Recognition:

Indexing for Fast Retrieval

Sanja Fidler CSC420: Intro to Image Understanding 2 / 28

Page 3: Recognition - cs.utoronto.ca

Recognizing or Retrieving Specific Objects

Example: Visual search in feature films

[Source: J. Sivic, slide credit: R. Urtasun]

Sanja Fidler CSC420: Intro to Image Understanding 3 / 28

Page 4: Recognition - cs.utoronto.ca

Recognizing or Retrieving Specific Objects

Example: Search photos on the web for particular places

[Source: J. Sivic, slide credit: R. Urtasun]

Sanja Fidler CSC420: Intro to Image Understanding 4 / 28

Page 5: Recognition - cs.utoronto.ca

Sanja Fidler CSC420: Intro to Image Understanding 5 / 28

Page 6: Recognition - cs.utoronto.ca

Why is it Difficult?

Objects can have possibly large changes in scale, viewpoint, lighting andpartial occlusion.

[Source: J. Sivic, slide credit: R. Urtasun]

Sanja Fidler CSC420: Intro to Image Understanding 6 / 28

Page 7: Recognition - cs.utoronto.ca

Why is it Difficult?

There is tones of data.

Sanja Fidler CSC420: Intro to Image Understanding 7 / 28

Page 8: Recognition - cs.utoronto.ca

Our Case: Matching with Local Features

For each image in our database we extracted local descriptors (e.g., SIFT)

Sanja Fidler CSC420: Intro to Image Understanding 8 / 28

Page 9: Recognition - cs.utoronto.ca

Our Case: Matching with Local Features

For each image in our database we extracted local descriptors (e.g., SIFT)

Sanja Fidler CSC420: Intro to Image Understanding 8 / 28

Page 10: Recognition - cs.utoronto.ca

Our Case: Matching with Local Features

Let’s focus on descriptors only (vectors of e.g. 128 dim for SIFT)

Sanja Fidler CSC420: Intro to Image Understanding 8 / 28

Page 11: Recognition - cs.utoronto.ca

Our Case: Matching with Local Features

Sanja Fidler CSC420: Intro to Image Understanding 8 / 28

Page 12: Recognition - cs.utoronto.ca

Our Case: Matching with Local Features

Sanja Fidler CSC420: Intro to Image Understanding 8 / 28

Page 13: Recognition - cs.utoronto.ca

Our Case: Matching with Local Features

Sanja Fidler CSC420: Intro to Image Understanding 8 / 28

Page 14: Recognition - cs.utoronto.ca

Indexing!

Sanja Fidler CSC420: Intro to Image Understanding 9 / 28

Page 15: Recognition - cs.utoronto.ca

Indexing Local Features: Inverted File Index

For text documents, an efficient way to find all pages on which a wordoccurs is to use an index.

We want to find all images in which a feature occurs.

To use this idea, well need to map our features to “visual words”.

Why?

[Source: K. Grauman, slide credit: R. Urtasun]Sanja Fidler CSC420: Intro to Image Understanding 10 / 28

Page 16: Recognition - cs.utoronto.ca

What Are Our Visual “Words”?

Sanja Fidler CSC420: Intro to Image Understanding 11 / 28

Page 17: Recognition - cs.utoronto.ca

What Are Our Visual “Words”?

Sanja Fidler CSC420: Intro to Image Understanding 11 / 28

Page 18: Recognition - cs.utoronto.ca

What Are Our Visual “Words”?

Sanja Fidler CSC420: Intro to Image Understanding 11 / 28

Page 19: Recognition - cs.utoronto.ca

What Are Our Visual “Words”?

Sanja Fidler CSC420: Intro to Image Understanding 11 / 28

Page 20: Recognition - cs.utoronto.ca

What Are Our Visual “Words”?

Sanja Fidler CSC420: Intro to Image Understanding 11 / 28

Page 21: Recognition - cs.utoronto.ca

What Are Our Visual “Words”?

Sanja Fidler CSC420: Intro to Image Understanding 11 / 28

Page 22: Recognition - cs.utoronto.ca

What Are Our Visual “Words”?

Sanja Fidler CSC420: Intro to Image Understanding 11 / 28

Page 23: Recognition - cs.utoronto.ca

Visual Words

All example patches on the right belong to the same visual word.

[Source: R. Urtasun]

Sanja Fidler CSC420: Intro to Image Understanding 12 / 28

Page 24: Recognition - cs.utoronto.ca

What Are Our Visual “Words”?

Sanja Fidler CSC420: Intro to Image Understanding 13 / 28

Page 25: Recognition - cs.utoronto.ca

What Are Our Visual “Words”?

Sanja Fidler CSC420: Intro to Image Understanding 13 / 28

Page 26: Recognition - cs.utoronto.ca

What Are Our Visual “Words”?

Sanja Fidler CSC420: Intro to Image Understanding 13 / 28

Page 27: Recognition - cs.utoronto.ca

What Are Our Visual “Words”?

Sanja Fidler CSC420: Intro to Image Understanding 13 / 28

Page 28: Recognition - cs.utoronto.ca

Inverted File Index

Now we found all images in the database that have at least one visual wordin common with the query image

But this can still give us lots of images... What can we do?

Sanja Fidler CSC420: Intro to Image Understanding 14 / 28

Page 29: Recognition - cs.utoronto.ca

Inverted File Index

Now we found all images in the database that have at least one visual wordin common with the query image

But this can still give us lots of images... What can we do?

Idea: Compute similarity between query image and retrieved images. Howcan we do this fast?

Sanja Fidler CSC420: Intro to Image Understanding 14 / 28

Page 30: Recognition - cs.utoronto.ca

Relation to Documents

[Slide credit: R. Urtasun]

Sanja Fidler CSC420: Intro to Image Understanding 15 / 28

Page 31: Recognition - cs.utoronto.ca

Bags of Visual Words

[Slide credit: R. Urtasun]

Summarize entire image based on its distribution (histogram) of wordoccurrences.

Analogous to bag of words representation commonly used for documents.

Sanja Fidler CSC420: Intro to Image Understanding 16 / 28

Page 32: Recognition - cs.utoronto.ca

Compute a Bag-of-Words Description

Sanja Fidler CSC420: Intro to Image Understanding 17 / 28

Page 33: Recognition - cs.utoronto.ca

Compute a Bag-of-Words Description

Sanja Fidler CSC420: Intro to Image Understanding 17 / 28

Page 34: Recognition - cs.utoronto.ca

Compute a Bag-of-Words Description

Instead of a histogram, for retrieval it’s better to re-weight the image

description vector T = [t1, t2, . . . , ti , . . . ] with term frequency-inverse

document frequency (tf-idf), a standard trick in document retrieval:

ti =nidnd

logN

niwhere:

nid . . . is the number of occurrences of word i in image d

nd . . . is the total number of words in image d

ni . . . is the number of occurrences of word i in the whole database

N . . . is the number of documents in the whole database

Sanja Fidler CSC420: Intro to Image Understanding 17 / 28

Page 35: Recognition - cs.utoronto.ca

Compute a Bag-of-Words Description

Instead of a histogram, for retrieval it’s better to re-weight the image

description vector T = [t1, t2, . . . , ti , . . . ] with term frequency-inverse

document frequency (tf-idf), a standard trick in document retrieval:

ti =nidnd

logN

niwhere:

nid . . . is the number of occurrences of word i in image d

nd . . . is the total number of words in image d

ni . . . is the number of occurrences of word i in the whole database

N . . . is the number of documents in the whole database

The weighting is a product of two terms: the word frequency nidnd

, and the

inverse document frequency log Nni

Sanja Fidler CSC420: Intro to Image Understanding 17 / 28

Page 36: Recognition - cs.utoronto.ca

Compute a Bag-of-Words Description

Instead of a histogram, for retrieval it’s better to re-weight the image

description vector t = [t1, t2, . . . , ti , . . . ] with term frequency-inverse

document frequency (tf-idf), a standard trick in document retrieval:

ti =nidnd

logN

niwhere:

nid . . . is the number of occurrences of word i in image d

nd . . . is the total number of words in image d

ni . . . is the number of occurrences of word i in the whole database

N . . . is the number of documents in the whole database

The weighting is a product of two terms: the word frequency nidnd

, and the

inverse document frequency log Nni

Intuition behind this: word frequency weights words occurring often in a

particular document, and thus describe it well, while the inverse document

frequency downweights the words that occur often in the full dataset

Sanja Fidler CSC420: Intro to Image Understanding 17 / 28

Page 37: Recognition - cs.utoronto.ca

Compute a Bag-of-Words Description

Sanja Fidler CSC420: Intro to Image Understanding 17 / 28

Page 38: Recognition - cs.utoronto.ca

Compute a Bag-of-Words Description

Sanja Fidler CSC420: Intro to Image Understanding 17 / 28

Page 39: Recognition - cs.utoronto.ca

Comparing Images

Compute the similarity by normalized dot product between their (tf-idfweighted) occurrence counts

sim(tj,q) =< tj,q >

||tj|| · ||q||

Rank images in database based on the similarity score (the higher the better)

Are we done?

No. Our similarity doesn’t take into account geometric relations betweenfeatures.

Take top K best ranked images and do spatial verification (computetransformation and count inliers)

Sanja Fidler CSC420: Intro to Image Understanding 18 / 28

Page 40: Recognition - cs.utoronto.ca

Comparing Images

Compute the similarity by normalized dot product between their (tf-idfweighted) occurrence counts

sim(tj,q) =< tj,q >

||tj|| · ||q||

Rank images in database based on the similarity score (the higher the better)

Are we done?

No. Our similarity doesn’t take into account geometric relations betweenfeatures.

Take top K best ranked images and do spatial verification (computetransformation and count inliers)

Sanja Fidler CSC420: Intro to Image Understanding 18 / 28

Page 41: Recognition - cs.utoronto.ca

Comparing Images

Compute the similarity by normalized dot product between their (tf-idfweighted) occurrence counts

sim(tj,q) =< tj,q >

||tj|| · ||q||

Rank images in database based on the similarity score (the higher the better)

Are we done?

No. Our similarity doesn’t take into account geometric relations betweenfeatures.

Take top K best ranked images and do spatial verification (computetransformation and count inliers)

Sanja Fidler CSC420: Intro to Image Understanding 18 / 28

Page 42: Recognition - cs.utoronto.ca

Comparing Images

Compute the similarity by normalized dot product between their (tf-idfweighted) occurrence counts

sim(tj,q) =< tj,q >

||tj|| · ||q||

Rank images in database based on the similarity score (the higher the better)

Are we done?

No. Our similarity doesn’t take into account geometric relations betweenfeatures.

Take top K best ranked images and do spatial verification (computetransformation and count inliers)

Sanja Fidler CSC420: Intro to Image Understanding 18 / 28

Page 43: Recognition - cs.utoronto.ca

Spatial Verification

Both image pairs have many visual words in common

Only some of the matches are mutually consistent

[Source: O. Chum]

Sanja Fidler CSC420: Intro to Image Understanding 19 / 28

Page 44: Recognition - cs.utoronto.ca

Summary – Stuff You Need To Know

Fast image retrieval:

Compute features in all images from database, and query image.

Cluster the descriptors from the images in the database (e.g., k-means) toget k clusters. These clusters are vectors that live in the same dimensionalspace as the descriptors. We call them visual words.

Assign each descriptor in database and query image to the closest cluster.

Build an inverted file index

For a query image, lookup all the visual words in the inverted file index toget a list of images that share at least one visual word with the query

Compute a bag-of-words (BoW) vector for each retrieved image and query.This vector just counts the number of occurrences of each word. It has asmany dimensions as there are visual words. Weight the vector with tf-idf.

Compute similarity between query BoW vector and all retrieved image BoWvectors. Sort (highest to lowest). Take top K most similar images (e.g, 100)

Do spatial verification on all top K retrieved images (RANSAC + affine orhomography + remove images with too few inliers)

Sanja Fidler CSC420: Intro to Image Understanding 20 / 28

Page 45: Recognition - cs.utoronto.ca

Summary – Stuff You Need To Know

Matlab function:

[IDX, W] = kmeans(X, k); where rows of X are descriptors,rows of W are visual words vectors, and IDX are assignments ofrows of X to visual words

Once you have W , you can quickly compute IDX via the dist2function (Assignment 2):

D = dist2(X’, W’); [∼,IDX] = min(D, [], 2);

A much faster way of computing the closest cluster (IDX) is viathe FLANN library: http://www.cs.ubc.ca/research/flann/

Since X is typically super large, kmeans will run for days... A solution is torandomly sample a few descriptors from X and cluster those. Another greatpossibility is to use this:http://www.robots.ox.ac.uk/~vgg/software/fastanncluster/

Sanja Fidler CSC420: Intro to Image Understanding 21 / 28

Page 46: Recognition - cs.utoronto.ca

Even Faster?

Can we make the retrieval process even more efficient?

Sanja Fidler CSC420: Intro to Image Understanding 22 / 28

Page 47: Recognition - cs.utoronto.ca

Vocabulary Trees

Hierarchical clustering for large vocabularies, [Nister et al., 06].

k defines the branch factor (number of children of each node) of the tree.

First, an initial k- means process is run on the training data, defining kcluster centers.

The same process is then recursively applied to each group.

The tree is determined level by level, up to some maximum number of levelsL.

Each division into k parts is only defined by the distribution of the descriptorvectors that belong to the parent quantization cell.

Sanja Fidler CSC420: Intro to Image Understanding 23 / 28

Page 48: Recognition - cs.utoronto.ca

Vocabulary Trees

Hierarchical clustering for large vocabularies, [Nister et al., 06].

k defines the branch factor (number of children of each node) of the tree.

First, an initial k- means process is run on the training data, defining kcluster centers.

The same process is then recursively applied to each group.

The tree is determined level by level, up to some maximum number of levelsL.

Each division into k parts is only defined by the distribution of the descriptorvectors that belong to the parent quantization cell.

Sanja Fidler CSC420: Intro to Image Understanding 23 / 28

Page 49: Recognition - cs.utoronto.ca

Vocabulary Trees

Hierarchical clustering for large vocabularies, [Nister et al., 06].

k defines the branch factor (number of children of each node) of the tree.

First, an initial k- means process is run on the training data, defining kcluster centers.

The same process is then recursively applied to each group.

The tree is determined level by level, up to some maximum number of levelsL.

Each division into k parts is only defined by the distribution of the descriptorvectors that belong to the parent quantization cell.

Sanja Fidler CSC420: Intro to Image Understanding 23 / 28

Page 50: Recognition - cs.utoronto.ca

Vocabulary Trees

Hierarchical clustering for large vocabularies, [Nister et al., 06].

k defines the branch factor (number of children of each node) of the tree.

First, an initial k- means process is run on the training data, defining kcluster centers.

The same process is then recursively applied to each group.

The tree is determined level by level, up to some maximum number of levelsL.

Each division into k parts is only defined by the distribution of the descriptorvectors that belong to the parent quantization cell.

Sanja Fidler CSC420: Intro to Image Understanding 23 / 28

Page 51: Recognition - cs.utoronto.ca

Vocabulary Trees

Hierarchical clustering for large vocabularies, [Nister et al., 06].

k defines the branch factor (number of children of each node) of the tree.

First, an initial k- means process is run on the training data, defining kcluster centers.

The same process is then recursively applied to each group.

The tree is determined level by level, up to some maximum number of levelsL.

Each division into k parts is only defined by the distribution of the descriptorvectors that belong to the parent quantization cell.

Sanja Fidler CSC420: Intro to Image Understanding 23 / 28

Page 52: Recognition - cs.utoronto.ca

Vocabulary Trees

Hierarchical clustering for large vocabularies, [Nister et al., 06].

k defines the branch factor (number of children of each node) of the tree.

First, an initial k- means process is run on the training data, defining kcluster centers.

The same process is then recursively applied to each group.

The tree is determined level by level, up to some maximum number of levelsL.

Each division into k parts is only defined by the distribution of the descriptorvectors that belong to the parent quantization cell.

Sanja Fidler CSC420: Intro to Image Understanding 23 / 28

Page 53: Recognition - cs.utoronto.ca

Constructing the tree

Offline phase: hierarchical clustering.

Vocabulary Tree

Sanja Fidler CSC420: Intro to Image Understanding 24 / 28

Page 54: Recognition - cs.utoronto.ca

Constructing the tree

Offline phase: hierarchical clustering.

Vocabulary Tree

Sanja Fidler CSC420: Intro to Image Understanding 24 / 28

Page 55: Recognition - cs.utoronto.ca

Constructing the tree

Offline phase: hierarchical clustering.

Vocabulary Tree

Sanja Fidler CSC420: Intro to Image Understanding 24 / 28

Page 56: Recognition - cs.utoronto.ca

Constructing the tree

Offline phase: hierarchical clustering.

Vocabulary Tree

Sanja Fidler CSC420: Intro to Image Understanding 24 / 28

Page 57: Recognition - cs.utoronto.ca

Parsing the tree

Online phase: each descriptor vector is propagated down the tree by at eachlevel comparing the descriptor vector to the k candidate cluster centers(represented by k children in the tree) and choosing the closest one.

The tree directly defines the visual vocabulary and an efficient searchprocedure in an integrated manner.

Every node in the vocabulary tree is associated with an inverted file.

The inverted files of inner nodes are the concatenation of the inverted filesof the leaf nodes (virtual).

Sanja Fidler CSC420: Intro to Image Understanding 25 / 28

Page 58: Recognition - cs.utoronto.ca

Vocabulary size

Complexity: branching factor and number of levels

Most important for the retrieval quality is to have a large vocabulary

Sanja Fidler CSC420: Intro to Image Understanding 26 / 28

Page 59: Recognition - cs.utoronto.ca

Visual words/bags of words

Good

flexible to geometry / deformations / viewpoint

compact summary of image content

provides vector representation for sets

very good results in practice

Bad

background and foreground mixed when bag covers whole image

optimal vocabulary formation remains unclear

basic model ignores geometry must verify afterwards, or encode via features

Sanja Fidler CSC420: Intro to Image Understanding 27 / 28

Page 60: Recognition - cs.utoronto.ca

Next Time

Recognition Techniques in 20 min

Sanja Fidler CSC420: Intro to Image Understanding 28 / 28