large-scale image parsing
DESCRIPTION
sky. building. car. Large-Scale Image Parsing. road. Joseph Tighe and Svetlana Lazebnik University of North Carolina at Chapel Hill. Small-scale image parsing Tens of classes, hundreds of images. Figure from Shotton et al. (2009). - PowerPoint PPT PresentationTRANSCRIPT
LARGE-SCALE IMAGE PARSING
Joseph Tighe and Svetlana LazebnikUniversity of North Carolina at Chapel Hill
road
building
car
sky
Small-scale image parsingTens of classes, hundreds of images
He et al. (2004), Hoiem et al. (2005), Shotton et al. (2006, 2008, 2009), Verbeek and Triggs (2007), Rabinovich et al. (2007), Galleguillos et al. (2008), Gould et al. (2009), etc.
Figure from Shotton et al. (2009)
Large-scale image parsingHundreds of classes, tens of thousands of images
build
ing floor seawater
sand
perso
nsky
scrap
er sign
mirror
pillow
founta
inflo
wersho
pcou
nter to
ppa
per
furnit
urecrane pot
arcad
ebri
dge
windshi
eld brick
clock
drawer fan
dishw
asher
vase
closet
hand
lebo
ttleou
tlet
bag
tail li
ght
lights
witch
02000000400000060000008000000
1000000012000000
Non-uniform class frequencies
Large-scale image parsingHundreds of classes, tens of thousands of images
Evolving training set
http://labelme.csail.mit.edu/
Non-uniform class frequencies
Challenges What’s considered important for small-
scale image parsing? Combination of local cues Multiple segmentations, multiple scales Context
How much of this is feasible for large-scale, dynamic datasets?
Our first attempt: A nonparametric approach
Lazy learning: do (almost) nothing up front
To parse (label) an image we will: Find a set of similar images Transfer labels from the similar images by
matching pieces of the image (superpixels)
Finding Similar Images
Ocean
Open Field
Highway
Street
Forest
Mountain
Inner City
Tall BuildingWhat is depicted in this image?
Which image is most similar?
Then assign the label from the most similar image
Pixels are a bad measure of similarity
Most similar according to pixel distance Most similar according to “Bag of Words”
Origin of the Bag of Words model
Orderless document representation: frequencies of words from a dictionary Salton & McGill (1983)
US Presidential Speeches Tag Cloudhttp://chir.ag/phernalia/preztags/
What are words for an image?
Wing Tail
WheelBuildingPropeller
Wing TailWheelBuildingPropeller Jet Engine
Wing TailWheelBuildingPropeller Jet Engine
Wing TailWheelBuildingPropeller Jet Engine
But where do the words come from?
Then where does the dictionary come from?
Example Dictionary
Source: B. Leibe
Another dictionary
………
…
Source: B. Leibe
Fei-Fei et al. 2005
Outline of the Bag of Words method
Divide the image into patches Assign a “word” for each patch Count the number of occurrences of
each “word” in the image
Does this work for our problem?
65,536 Pixels 256 Dimensions
Which look the most similar?
Which look the most similar?
building
roadcar
sky
building
roadcar
sky
building
roadcar
sky
building
roadcar
sky
building
building
roadcar
sky
tree
sky
sky sky
sky
tree
tree
buildingsand
mountain
carroad
Step 1: Scene-level matchingGist
(Oliva & Torralba, 2001)
Spatial Pyramid(Lazebnik et al.,
2006)Color
Histogram
Retrieval set: Source of possible labelsSource of region-level
matches
Step 2: Region-level matching
Step 2: Region-level matching
Superpixels(Felzenszwalb & Huttenlocher, 2004)
Step 2: Region-level matching
Snow
Road
Tree
BuildingSky
Pixel Area (size)
Road
Sidewalk
Step 2: Region-level matching
Absolute mask(location)
Step 2: Region-level matching
Road
SkySnowSidewalk
Texture
Step 2: Region-level matching
Building
Sidewalk
Road
Color histogram
Step 2: Region-level matching
Superpixels(Felzenszwalb & Huttenlocher, 2004)
Superpixel features
Region-level likelihoods Nonparametric estimate of class-conditional
densities for each class c and feature type k:
Per-feature likelihoods combined via Naïve Bayes:
),(#))),(((#)|)((ˆ
cDcrfNcrfP ik
ik kth feature
type of ith region
Features of class c within some radius
of riTotal features of
class c in the dataset
k
iki crfPcrPfeatures
)|)((ˆ)|(ˆ
Region-level likelihoods
Building Car Crosswalk
SkyWindowRoad
Step 3: Global image labeling
How do we resolve issues like this?
sky
tree
sand
road
searoad
Original imageMaximum likelihood labeling sky
sand
sea
Step 3: Global image labeling
Compute a global image labeling by optimizing a Markov random field (MRF) energy function:
i ji
jijiii cccccrLE,
),(][),(log)( cLikelihood score for region ri and label ci
Co-occurrence penalty
Vector of
region labels
Regions Neighboring regions
Smoothing penalty
Step 3: Global image labeling
Compute a global image labeling by optimizing a Markov random field (MRF) energy function:
Maximum likelihood labeling
Edge penalties Final labeling Final edge penalties
road
building
carwindow
sky
road
building
car
sky
i ji
jijiii cccccrLE,
),(][),(log)( cLikelihood score for region ri and label ci
Co-occurrence penalty
Vector of
region labels
Regions Neighboring regions
Smoothing penalty
Step 3: Global image labeling
Compute a global image labeling by optimizing a Markov random field (MRF) energy function:
sky
tree
sand
road
searoad
sky
sand
sea
Original imageMaximum likelihood labeling
Edge penalties MRF labeling
i ji
jijiii cccccrLE,
),(][),(log)( cLikelihood score for region ri and label ci
Co-occurrence penalty
Vector of
region labels
Regions Neighboring regions
Smoothing penalty
Joint geometric/semantic labeling
Semantic labels: road, grass, building, car, etc. Geometric labels: sky, vertical, horizontal
Gould et al. (ICCV 2009)
sky
tree car
road
sky
horizontal
vertical
Original image Semantic labeling Geometric labeling
Joint geometric/semantic labeling
Objective function for joint labeling:
ir
ii gcEEF regions
),()()(),( gcgc
Geometric/semantic consistency penalty
Semantic labels
Geometric labels
Cost of semantic labeling
Cost of geometric labeling
sky
tree car
road
sky
horizontal
vertical
Original image Semantic labeling Geometric labeling
Example of joint labeling
Understanding scenes on many levels
To appear at ICCV 2011
DatasetsTraining images Test images Labels
SIFT Flow (Liu et al., 2009) 2,488 200 33
Barcelona (Russell et al., 2007) 14,871 279 170
LabelMe+SUN 50,424 300 232
Datasets
wall
book
s... plate
chair
bed
keyb
oard
utensi
l
scree
nde
sk
cupbo
ardna
pkin
placem
at cuppic
ture
counte
r... lamp
toilet
1001000
10000100000
1000000
# o
f Su-
perp
ixel
s
buildi
ng tree
road car
window riv
er rock
sandde
sert
perso
nfen
ce
awnin
g
crossw
alk boat po
le cow moon
1001000
10000100000
1000000
Log
Scal
e (x
1000
)
Training images Test images Labels
SIFT Flow (Liu et al., 2009) 2,488 200 33
Barcelona (Russell et al., 2007) 14,871 279 170
LabelMe+SUN 50,424 300 232
Overall performance
SIFT Flow Barcelona LabelMe + SUNSemantic Geom
.Semantic Geom. Semantic Geom.
Base 73.2 (29.1)
89.8 62.5 (8.0) 89.9 46.8 (10.7)
81.5
MRF 76.3 (28.8)
89.9 66.6 (7.6) 90.2 50.0 (9.1) 81.0
MRF + Joint 76.9 (29.4)
90.8 66.9 (7.6) 90.7 50.2 (10.5)
82.2LabelMe + SUN Indoor LabelMe + SUN OutdoorSemantic Geom. Semantic Geom.
Base 22.4 (9.5) 76.1 53.8 (11.0) 83.1MRF 27.5 (6.5) 76.4 56.4 (8.6) 82.3MRF + Joint 27.8 (9.0) 78.2 56.6 (10.8) 84.1
*SIFT Flow: 74.75
Per-class classification rates
0%25%50%75%
100%SiftFlow Barcelona LM + Sun
0%25%50%75%
100%
Results on SIFT Flow dataset
55.3 92.2 93.6
Results on LM+SUN datasetImage Ground truth
Initial semantic Final semantic Final geometric
58.9 93.057.3
Results on LM+SUN datasetImage Ground truth
Initial semantic Final semantic Final geometric
11.6
0.0
60.3 93.0
Image Ground truth
Initial semantic Final semantic Final geometric
Results on LM+SUN dataset
65.6 75.8 87.7
Image Ground truth
Initial semantic Final semantic Final geometric
Results on LM+SUN dataset
Running times
SIFT Flow
Barcelona
dataset
Conclusions Lessons learned
Can go pretty far with very little learning Good local features, and global (scene)
context is more important than neighborhood context
What’s missing A rich representation for
scene understanding The long tail Scalable, dynamic
learningroad
building
car
sky