global and high-level image descriptions

Global and high-level image descriptions

Present by Yao Pan

Low-level descriptor

• Pixel• Patch• Mathematical transformation

(Laplace,Gauss) of pixel and patch(SIFT,GIST…)

Object Recognition Scene classification

Pixel intensity, gradient

Semantic Gap

High Level Task:

Low Level image feature

Analogy to text analysis

We want to classify which author write this article?Or what type of content(science, politics, entertainment)?

Letter frequency

Meaning groupSentencephrasewordLetter frequency

• Modeling the shape of the scene: a Holistic Representation of the Spatial EnvelopeA. Oliva and A. Torralba. IJCV 2001

Efficient Object Category Recognition Using ClassemesLorenzo Torresani, Martin Szummer, Andrew Fitzgibbon. ECCV 2010

Objects as Attributes for Scene ClassificationLi-Jia Li*, Hao Su*, Yongwhan Lim, Li Fei-Fei. ECCV 2010

Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature SparsificationL-J. Li, H. Su, E. Xing, L. Fei-Fei. NIPS 2010

Overview

Modeling the shape of the scene: a Holistic Representation of the Spatial Envelope

A. Oliva and A. Torralba. IJCV 2001


Motivation

Scene CategorizationOne way of doing this is: segment and detect the objects in the picture. Classify the scene according to which objects the picutre contains.

But, segmentation and object detection are hard problems.

Picture from: J. Yao, S. Fidler and R. Urtasun


MotivationExperiment in Cognitive PsychologyMary C.Potter, 1975, science• Subjects were presented a target scene picture or a

scene name beforehand• Then they were presented a sequence of pictures at

rates up to 8 per second. They were asked to press the button when they saw the target.

• Detection rate are surprising high (more than 90%).


Motivation• Subsequent experiment implies that object information

might be ignored during rapid categorization of scene.

• Human are using some holistic visual features(spatial layout, spatial structure, shape of scene...).

• In this paper, the author terms them as Spatial Envelope.• Scenes belonging to same category share similar spatial

structure that can be extracted without segmentating the image.


ApproachWhat is scene?• Traditionally: unconstrained configuration of

objects.• In this paper: treat it as an individual object.


What exactly is spatial envelope?

Spatial Envelope Properties• Naturalness

• Straight horizontal and vertical line in man-made scene vs. textured zone of natural landscape.

• Openness• Roughness• Expansion• RuggednessFinding a low-dimensional scene space that scenes of same category are projected together.


NaturalnessNatural vs. man-made

Slides credit: scene understanding seminar


Openness• Decrease as number of boundary increases



Roughness• Size of elements at each spatial scale



Expansion(mainly for man-made scene)A flat view of a building would have a low degree of Expansion. A street with long vanishing lines would have a high degree of Expansion.



ExpansionFollow up: Depth estimation from image structureA. Torralba, A. Oliva, 2003


Ruggedness (mainly for natural scene)

• Deviation of ground relative to horizon



ApproachHow to translate these abstract concept to computable mathematical values?

Discrete Fourier transform(DFT)Windowed DFTPCA


ApproachDiscrete Fourier transform(DFT)

DFT of an image:

Where i(x,y) is the intensity distribution


What is the fourier transform of an image?

Fourier transform of signal Fourier transform of an imageOriginal image

Fourier transform(amplitude spectrum)


Fourier transform of an image

Polar form:Original points represent DC(zeros frequency)


Fourier transform of an image

Keep low frequency onlyLost image detail

Keep high frequency onlyLost gradient


ApproachDifferent scene categories have different spectral signatures• Amplitude captures roughness• Orientation captures dominant edges


• 8 categories. • Natural: Coast, Country, Forest, Mountain• Man-made: Highway, Street, Close-up, Tall building

• Choose 400 target images from database with first 7 neighbors for each.

• Neighbors are define as Euclidean distance between attributes

Experiment


Experiment• Scenes were considered correctly recognized when

at least 4 neighbors having same category membership.


Experiment


ExperimentConfusion matrix for natural scene

Confusion matrix for man-made scene

Average Accuracy:WDST: 92% DST: 86%


Limitation• Primarily for man-made vs. natural differences.• Coarse-grained classification

Efficient Object Category Recognition Using Classemes

Lorenzo Torresani, Martin Szummer, Andrew Fitzgibbon


Motivation

Large-scale object category recognitionRequirement:• Novel category

• Zero-shot learning: Possible category does not appear in training example. Countless object category, it is impossible to cover all in training dataset.

• Compact descriptor• Disk vs. memory

• Simple classifier


Motivation

Existing system:• Attribute approach

• Categories are described by a set of boolean attributes

Has beak Has tail Near water

duck √ × √

• Drawback： Need human to label the training data. Some categories are hard to extract attribute.


Approach

• Classeme• Represent object as a combination of other object

classes (classeme) to which they are related.• These classeme are extracted automatically and do

not necessarily contain semantic meaning.

c


Approach

• Classeme Learning• Choose a set of category label from Large Scale Concept

Ontology for Multimedia. C=2659 categories in total• Learn a One-versus-all classifier (by Multiple Kernel

learning)• For image x, it is represented as a vector • To achieve compactness, vector is not stored in double

precision, but quantized to Q levels (1bit to 4 bit). • After getting the representation, apply classification method

such as SVM…

c

1( ) [ ( ),..., ( )]cf x x x

c


Approach


Approachsky

c

crying

Get 150 training images for each classeme from bing.com image search engine

Experiment 1: Multiclass classification

Dataset: Caltech256• 256 categories, 30608 images.Competitor: • multiclass SVM• Neural network • Decision forests• Nearest neighour• LP-β

• Combine multiple complementary features(color based, shape based, texture based) and learn the weights for different features.


Accuracy comparison • Accuracy：• 36% versus 42%

• But much faster speed


Accuracy comparison


Speed comparison

Over two orders of magnitude faster!


Over two orders of magnitude faster!

Compactness comparison

Objects as Attributes for Scene Classification

Li-Jia Li*, Hao Su*, Yongwhan Lim, Li Fei-Fei

Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification

L-J. Li, H. Su, E. Xing, L. Fei-Fei.

Motivation

More of a action recognition instead of scene?

Object Recognition Scene classification

Pixel intensity, gradient

Semantic Gap

High Level Task:

Low Level image feature

Approach

After we get OB representation for each image, we can use any machine learning method for the classification.(In this paper, SVM and Logistic regression are chosen).

ApproachSpatial pyramid representation

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. S. Lazebnik, C. Schmid and J. Ponce

Implementation Detail

How to choose the object bank? From where? How many?

Choose the most frequent object from popular dataset (LabelMe, ESP, ImageNet, Flickr) and find their intersection.Finally result in 200 objects.

Implementation Detail

The large object number brings the problem of dimension curve.

The second paper mainly deals with the computation problem.

In this paper, N=200 object detector, computer response on S=12 scales, L=3 spatial pyramid level. So the total response for each image is:200*12*(1+4+16)≈50000 dimensions

Experiment

Datasets:• 15-Scene: 15 natural scene classese• LabelMe: 9 classes• MIT Indoor: 67 indoor scenes• UIUC sports: 8 complex event classes

Experiment

15-Scene

Experiment

LabelMe: 9 categories

beach, mountain, bathroom, church, garage, office, sail, street, forest

Experiment

MIT indoor(67 categories):

ExperimentUIUC sports(8 categories):

Experiment

Experiment

More performance gain on MIT indoor and UIUC sports datasets because these two are more complex.

• Similar texture(low-level) but different objects(semantic information)

• Confirm the effectiveness of object-bank in high-level task.

Experiment

Accuracy with growing object bank

Classification performance continuously increases when more objects are incorporated in the OB representation.

ExperimentComparison with classeme in object recognition

Not fair because classeme is proposed for speed

OB Classeme

Caltech256 39% 36%

Efficiency comparison of Classeme and Object bank

Classeme Object bank Spatial envelope

Feature extraction per image

0.4s 7.2s 0.1s

Feature descriptor size

12KB for continuous626byte for binary

236KB 2KB

Dataset: 50 images randomly selected from Caltech256.

What if we binary the continuous value of object bank?

Spatial Envelope on MIT indoor

Dataset: A subset of MIT indoor which contains 8 categories

Average classification accuracy

By chance

Spatial envelope

17.5% 12.5%

Retrospect

Spatial envelope(2001): Ignore object information

Classemes(2010), Object bank(2010): Utilize object information.

• For coarse-grained scene. Fast scene recognition.

• For fine-grained scene or object categories. Increase in accuracy.

Acknowledge

• Dr. Devi Parikh

• Questions?

global and high-level image descriptions

Documents

shape of scene

holistic representation

target scene picture

spatial envelopewhat

spatial envelopeapproachwhat

rapid categorization

spatial envelope prop

scene classificationlijia