large-scale image parsing

LARGE-SCALE IMAGE PARSING

Joseph Tighe and Svetlana LazebnikUniversity of North Carolina at Chapel Hill

building

Small-scale image parsingTens of classes, hundreds of images

He et al. (2004), Hoiem et al. (2005), Shotton et al. (2006, 2008, 2009), Verbeek and Triggs (2007), Rabinovich et al. (2007), Galleguillos et al. (2008), Gould et al. (2009), etc.

Figure from Shotton et al. (2009)

Large-scale image parsingHundreds of classes, tens of thousands of images

ing floor seawater

er sign

mirror

pillow

founta

wersho

nter to

furnit

urecrane pot

windshi

eld brick

drawer fan

closet

ttleou

tail li

lights

02000000400000060000008000000

1000000012000000

Non-uniform class frequencies

Large-scale image parsingHundreds of classes, tens of thousands of images

Evolving training set

http://labelme.csail.mit.edu/

Non-uniform class frequencies

Challenges What’s considered important for small-

scale image parsing? Combination of local cues Multiple segmentations, multiple scales Context

How much of this is feasible for large-scale, dynamic datasets?

Our first attempt: A nonparametric approach

Lazy learning: do (almost) nothing up front

To parse (label) an image we will: Find a set of similar images Transfer labels from the similar images by

matching pieces of the image (superpixels)

Finding Similar Images

Open Field

Highway

Street

Forest

Mountain

Inner City

Tall BuildingWhat is depicted in this image?

Which image is most similar?

Then assign the label from the most similar image

Pixels are a bad measure of similarity

Most similar according to pixel distance Most similar according to “Bag of Words”

Origin of the Bag of Words model

Orderless document representation: frequencies of words from a dictionary Salton & McGill (1983)

US Presidential Speeches Tag Cloudhttp://chir.ag/phernalia/preztags/

What are words for an image?

Wing Tail

WheelBuildingPropeller

Wing TailWheelBuildingPropeller Jet Engine

But where do the words come from?

Then where does the dictionary come from?

Example Dictionary

Source: B. Leibe

Another dictionary

………

Source: B. Leibe

Fei-Fei et al. 2005

Outline of the Bag of Words method

Divide the image into patches Assign a “word” for each patch Count the number of occurrences of

each “word” in the image

Does this work for our problem?

65,536 Pixels 256 Dimensions

Which look the most similar?

building

roadcar

building

roadcar

building

roadcar

building

roadcar

building

roadcar

sky sky

buildingsand

mountain

carroad

Step 1: Scene-level matchingGist

(Oliva & Torralba, 2001)

Spatial Pyramid(Lazebnik et al.,

2006)Color

Histogram

Retrieval set: Source of possible labelsSource of region-level

matches

Step 2: Region-level matching

Superpixels(Felzenszwalb & Huttenlocher, 2004)

BuildingSky

Pixel Area (size)

Sidewalk

Absolute mask(location)

SkySnowSidewalk

Texture

Building

Sidewalk

Color histogram

Superpixels(Felzenszwalb & Huttenlocher, 2004)

Superpixel features

Region-level likelihoods Nonparametric estimate of class-conditional

densities for each class c and feature type k:

Per-feature likelihoods combined via Naïve Bayes:

),(#))),(((#)|)((ˆ

cDcrfNcrfP ik

ik kth feature

type of ith region

Features of class c within some radius

of riTotal features of

class c in the dataset

iki crfPcrPfeatures

)|)((ˆ)|(ˆ

Region-level likelihoods

Building Car Crosswalk

SkyWindowRoad

Step 3: Global image labeling

How do we resolve issues like this?

searoad

Original imageMaximum likelihood labeling sky

Compute a global image labeling by optimizing a Markov random field (MRF) energy function:

jijiii cccccrLE,

),(][),(log)( cLikelihood score for region ri and label ci

Co-occurrence penalty

Vector of

region labels

Regions Neighboring regions

Smoothing penalty

Maximum likelihood labeling

Edge penalties Final labeling Final edge penalties

building

carwindow

building

jijiii cccccrLE,

Vector of

region labels

Smoothing penalty

searoad

Original imageMaximum likelihood labeling

Edge penalties MRF labeling

jijiii cccccrLE,

Vector of

region labels

Smoothing penalty

Joint geometric/semantic labeling

Semantic labels: road, grass, building, car, etc. Geometric labels: sky, vertical, horizontal

Gould et al. (ICCV 2009)

tree car

horizontal

vertical

Original image Semantic labeling Geometric labeling

Joint geometric/semantic labeling

Objective function for joint labeling:

ii gcEEF regions

),()()(),( gcgc

Geometric/semantic consistency penalty

Semantic labels

Geometric labels

Cost of semantic labeling

Cost of geometric labeling

tree car

horizontal

vertical

Original image Semantic labeling Geometric labeling

Example of joint labeling

Understanding scenes on many levels

To appear at ICCV 2011

DatasetsTraining images Test images Labels

SIFT Flow (Liu et al., 2009) 2,488 200 33

Barcelona (Russell et al., 2007) 14,871 279 170

LabelMe+SUN 50,424 300 232

Datasets

s... plate

utensi

placem

at cuppic

counte

r... lamp

toilet

1001000

10000100000

1000000

buildi

ng tree

road car

window riv

er rock

sandde

crossw

alk boat po

le cow moon

1001000

10000100000

1000000

Training images Test images Labels

SIFT Flow (Liu et al., 2009) 2,488 200 33

Barcelona (Russell et al., 2007) 14,871 279 170

LabelMe+SUN 50,424 300 232

Overall performance

SIFT Flow Barcelona LabelMe + SUNSemantic Geom

.Semantic Geom. Semantic Geom.

Base 73.2 (29.1)

89.8 62.5 (8.0) 89.9 46.8 (10.7)

MRF 76.3 (28.8)

89.9 66.6 (7.6) 90.2 50.0 (9.1) 81.0

MRF + Joint 76.9 (29.4)

90.8 66.9 (7.6) 90.7 50.2 (10.5)

82.2LabelMe + SUN Indoor LabelMe + SUN OutdoorSemantic Geom. Semantic Geom.

Base 22.4 (9.5) 76.1 53.8 (11.0) 83.1MRF 27.5 (6.5) 76.4 56.4 (8.6) 82.3MRF + Joint 27.8 (9.0) 78.2 56.6 (10.8) 84.1

*SIFT Flow: 74.75

Per-class classification rates

0%25%50%75%

100%SiftFlow Barcelona LM + Sun

0%25%50%75%

Results on SIFT Flow dataset

55.3 92.2 93.6

Results on LM+SUN datasetImage Ground truth

Initial semantic Final semantic Final geometric

58.9 93.057.3

Results on LM+SUN datasetImage Ground truth

60.3 93.0

Image Ground truth

Results on LM+SUN dataset

65.6 75.8 87.7

Image Ground truth

Results on LM+SUN dataset

Running times

SIFT Flow

Barcelona

dataset

Conclusions Lessons learned

Can go pretty far with very little learning Good local features, and global (scene)

context is more important than neighborhood context

What’s missing A rich representation for

scene understanding The long tail Scalable, dynamic

learningroad

building

large-scale image parsing

image superpixels

image thumbnail

frequencies of words

largescale learning

new classes

possible classes

bag of words methoddivide

existing classes

Documents

semantic image segmentation via deep parsing network, ·...

image parsing with stochastic scene grammar

challenges to image parsing researchers

product image scale drawing - microsoft

cloud-scale image compression through content...

hierarchy parsing for image captioning€¦ · hierarchy...

large-scale image processing research cloud · large-scale...

parsing image facades with reinforcement...

automatic non-parametric image parsing via...

large-scale nonparametric image parsing joseph tighe and...

scale image analysis paper

semantic image segmentation via deep parsing...

animated image cloth segmentation -...

1 image parsing: unifying segmentation, detection, and...

large scale image processing

image parsing - department of computer science › ~ayuille...

color image multi-scale guided depth image super

large-scale image classification using high performance...

image parsing: unifying segmentation and detection

understanding image structure via hierarchical shape...