large-scale image parsing

LARGE-SCALE IMAGE PARSING

Joseph Tighe and Svetlana LazebnikUniversity of North Carolina at Chapel Hill

road

building

car

sky

Small-scale image parsingTens of classes, hundreds of images

He et al. (2004), Hoiem et al. (2005), Shotton et al. (2006, 2008, 2009), Verbeek and Triggs (2007), Rabinovich et al. (2007), Galleguillos et al. (2008), Gould et al. (2009), etc.

Figure from Shotton et al. (2009)

Large-scale image parsingHundreds of classes, tens of thousands of images

build

ing floor seawater

sand

perso

nsky

scrap

er sign

mirror

pillow

founta

inflo

wersho

pcou

nter to

ppa

per

furnit

urecrane pot

arcad

ebri

dge

windshi

eld brick

clock

drawer fan

dishw

asher

vase

closet

hand

lebo

ttleou

tlet

bag

tail li

ght

lights

witch

02000000400000060000008000000

1000000012000000

Non-uniform class frequencies

Large-scale image parsingHundreds of classes, tens of thousands of images

Evolving training set

http://labelme.csail.mit.edu/

Non-uniform class frequencies

http://labelme.csail.mit.edu/tool.html

http://labelme.csail.mit.edu/tool.html

Challenges What’s considered important for small-

scale image parsing? Combination of local cues Multiple segmentations, multiple scales Context

How much of this is feasible for large-scale, dynamic datasets?

Our first attempt: A nonparametric approach

Lazy learning: do (almost) nothing up front

To parse (label) an image we will: Find a set of similar images Transfer labels from the similar images by

matching pieces of the image (superpixels)

Finding Similar Images

Ocean

Open Field

Highway

Street

Forest

Mountain

Inner City

Tall BuildingWhat is depicted in this image?

Which image is most similar?

Then assign the label from the most similar image

Pixels are a bad measure of similarity

Most similar according to pixel distance Most similar according to “Bag of Words”

Origin of the Bag of Words model

Orderless document representation: frequencies of words from a dictionary Salton & McGill (1983)

US Presidential Speeches Tag Cloudhttp://chir.ag/phernalia/preztags/

What are words for an image?

Wing Tail

WheelBuildingPropeller

Wing TailWheelBuildingPropeller Jet Engine

But where do the words come from?

Then where does the dictionary come from?

Example Dictionary

Source: B. Leibe

Another dictionary

………

…

Source: B. Leibe

Fei-Fei et al. 2005

Outline of the Bag of Words method

Divide the image into patches Assign a “word” for each patch Count the number of occurrences of

each “word” in the image

Does this work for our problem?

65,536 Pixels 256 Dimensions

Which look the most similar?

Which look the most similar?

building

roadcar

sky

building

roadcar

sky

building

roadcar

sky

building

roadcar

sky

building

building

roadcar

sky

tree

sky

sky sky

sky

tree

tree

buildingsand

mountain

carroad

Step 1: Scene-level matchingGist

(Oliva & Torralba, 2001)

Spatial Pyramid(Lazebnik et al.,

2006)Color

Histogram

Retrieval set: Source of possible labelsSource of region-level

matches

Step 2: Region-level matching


Superpixels(Felzenszwalb & Huttenlocher, 2004)


Snow

Road

Tree

BuildingSky

Pixel Area (size)

Road

Sidewalk


Absolute mask(location)


Road

SkySnowSidewalk

Texture


Building

Sidewalk

Road

Color histogram


Superpixels(Felzenszwalb & Huttenlocher, 2004)

Superpixel features

Region-level likelihoods Nonparametric estimate of class-conditional

densities for each class c and feature type k:

Per-feature likelihoods combined via Naïve Bayes:

),(#))),(((#)|)((ˆ

cDcrfNcrfP ik

ik kth feature

type of ith region

Features of class c within some radius

of riTotal features of

class c in the dataset

k

iki crfPcrPfeatures

)|)((ˆ)|(ˆ

Region-level likelihoods

Building Car Crosswalk

SkyWindowRoad

Step 3: Global image labeling

How do we resolve issues like this?

sky

tree

sand

road

searoad

Original imageMaximum likelihood labeling sky

sand

sea


Compute a global image labeling by optimizing a Markov random field (MRF) energy function:

i ji

jijiii cccccrLE,

),(][),(log)( cLikelihood score for region ri and label ci

Co-occurrence penalty

Vector of

region labels

Regions Neighboring regions

Smoothing penalty



Maximum likelihood labeling

Edge penalties Final labeling Final edge penalties

road

building

carwindow

sky

road

building

car

sky

i ji

jijiii cccccrLE,



Vector of

region labels


Smoothing penalty



sky

tree

sand

road

searoad

sky

sand

sea

Original imageMaximum likelihood labeling

Edge penalties MRF labeling

i ji

jijiii cccccrLE,



Vector of

region labels


Smoothing penalty

Joint geometric/semantic labeling

Semantic labels: road, grass, building, car, etc. Geometric labels: sky, vertical, horizontal

Gould et al. (ICCV 2009)

sky

tree car

road

sky

horizontal

vertical

Original image Semantic labeling Geometric labeling

Joint geometric/semantic labeling

Objective function for joint labeling:

ir

ii gcEEF regions

),()()(),( gcgc

Geometric/semantic consistency penalty

Semantic labels

Geometric labels

Cost of semantic labeling

Cost of geometric labeling

sky

tree car

road

sky

horizontal

vertical

Original image Semantic labeling Geometric labeling

Example of joint labeling

Understanding scenes on many levels

To appear at ICCV 2011

DatasetsTraining images Test images Labels

SIFT Flow (Liu et al., 2009) 2,488 200 33

Barcelona (Russell et al., 2007) 14,871 279 170

LabelMe+SUN 50,424 300 232

Datasets

wall

book

s... plate

chair

bed

keyb

oard

utensi

l

scree

nde

sk

cupbo

ardna

pkin

placem

at cuppic

ture

counte

r... lamp

toilet

1001000

10000100000

1000000

# o

f Su-

perp

ixel

s

buildi

ng tree

road car

window riv

er rock

sandde

sert

perso

nfen

ce

awnin

g

crossw

alk boat po

le cow moon

1001000

10000100000

1000000

Log

Scal

e (x

1000

)

Training images Test images Labels

SIFT Flow (Liu et al., 2009) 2,488 200 33

Barcelona (Russell et al., 2007) 14,871 279 170

LabelMe+SUN 50,424 300 232

Overall performance

SIFT Flow Barcelona LabelMe + SUNSemantic Geom

.Semantic Geom. Semantic Geom.

Base 73.2 (29.1)

89.8 62.5 (8.0) 89.9 46.8 (10.7)

81.5

MRF 76.3 (28.8)

89.9 66.6 (7.6) 90.2 50.0 (9.1) 81.0

MRF + Joint 76.9 (29.4)

90.8 66.9 (7.6) 90.7 50.2 (10.5)

82.2LabelMe + SUN Indoor LabelMe + SUN OutdoorSemantic Geom. Semantic Geom.

Base 22.4 (9.5) 76.1 53.8 (11.0) 83.1MRF 27.5 (6.5) 76.4 56.4 (8.6) 82.3MRF + Joint 27.8 (9.0) 78.2 56.6 (10.8) 84.1

*SIFT Flow: 74.75

Per-class classification rates

0%25%50%75%

100%SiftFlow Barcelona LM + Sun

0%25%50%75%

100%

Results on SIFT Flow dataset

55.3 92.2 93.6

Results on LM+SUN datasetImage Ground truth

Initial semantic Final semantic Final geometric

58.9 93.057.3

Results on LM+SUN datasetImage Ground truth


11.6

0.0

60.3 93.0

Image Ground truth


Results on LM+SUN dataset

65.6 75.8 87.7

Image Ground truth


Results on LM+SUN dataset

Running times

SIFT Flow

Barcelona

dataset

Conclusions Lessons learned

Can go pretty far with very little learning Good local features, and global (scene)

context is more important than neighborhood context

What’s missing A rich representation for

scene understanding The long tail Scalable, dynamic

learningroad

building

car

sky

large-scale image parsing

Documents

image superpixels

image thumbnail

frequencies of words

largescale learning

new classes

possible classes

bag of words methoddivide

existing classes