scene collaging: analysis and synthesis of natural images ... · z∗ = argmax z k min(a(˜z )...

7
Scene Collaging: Analysis and Synthesis of Natural Images with Semantic Layers – Supplementary Materials Phillip Isola MIT [email protected] Ce Liu Microsoft Research [email protected] 1. Maximum entropy prior Our scenegraph grammar (Section 2.3, main text), serves as a binary prior: scenes are either valid or not. In order to compare against a method than can capture more subtle gradations in the prior probability, we also implemented a maximum entropy (maxent) prior over object co-occurrence statistics [3]. With this approach, we express the prior prob- ability of a scene X as: P (X)= 1 Z (Λ) exp ( i λ i f i (X)), (1) where Z normalizes and the distribution is parameterized by parameters λ i Λ. Many methods exist for learning Λ from training data. We employ gradient descent [2]. The functions f i measure object class count statistics and object class co-count statistics. We approximate the distribution of object class counts with a histogram with bins R. For each object class i, f (count) i,r (X)= (n i (X) r)), (2) where n i (X) is the number of objects of class i in the scene X, that is: n i (X)= L c = i). (3) Recall from Section 2 of the main text that L is the set of dictionary indices of the objects used in a scene X and ˜ c is the class of an object . r R is a range of integer values. In our current implementation, R = {[0], [1], [2, 3], [4, 7], [8, )}. We approximate the distribution of object class co- counts with a 2D histogram R × R. For each pair of ob- ject classes (i, j ), and histogram bin (r 1 ,r 2 ),r 1 ,r 2 R, we have the feature: f (cocount) i,j,r1,r2 (X)= (n i (X) r 1 ,n j (X) r 2 ). (4) In Section 5.2.1 of the main text, we compare our sys- tem with just the scenegraph grammar as a prior to our sys- tem with both the scenegraph grammar prior and the maxent prior described above. As shown in Table 4, the addition of the maxent prior decreases performance compared to our scenegraph grammar alone. 2. Segment transformation – details Translation and scaling: As summarized in Section 4.2 of the main text, each ob- ject from our dictionary is translated and scaled indepen- dently. We seek to place the object at a location and scale where it well explains the appearance of the query image, I . Using θ t to represent the center position of the object’s mask and θ s to represent how much we scale the object’s mask, we start with the following objective function: E 1 (θ t s )= qQ ˜ g (f (I ) q )+ λ t q˜ Q p t c ,q)+ λ s p s (θ s ), (5) Q = T ts ( ˜ Q , {θ t s }), (6) where ˜ g is the segment’s appearance model, p t c ,q) is a prior over mask position for objects of class ˜ c , and p s (θ s ) is a prior over scale factor. T ts applies a similarity transforma- tion to the dictionary object mask ˜ Q . In order to encourage objects to stay largely within the image frame, we assign a default penalty to all pixels in the transformed mask that fall outside the image frame. We calculate p t (c, q) by averaging masks of class c across our dictionary: p t (c, q) ∈L s.t. ˜ c =c (q ˜ Q ). (7) We manually set p s (θ s )= {0.75, 0.875, 1, 0.875, 0.75} for the discrete set of scales θ s ∈{0.5, 0.75, 1, 1.5, 2}. That is, objects slightly prefer to remain the scale at which they were found in our dictionary. 1

Upload: others

Post on 08-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Scene Collaging: Analysis and Synthesis of Natural Images with SemanticLayers – Supplementary Materials

Phillip IsolaMIT

[email protected]

Ce LiuMicrosoft [email protected]

1. Maximum entropy priorOur scenegraph grammar (Section 2.3, main text), serves

as a binary prior: scenes are either valid or not. In orderto compare against a method than can capture more subtlegradations in the prior probability, we also implemented amaximum entropy (maxent) prior over object co-occurrencestatistics [3]. With this approach, we express the prior prob-ability of a scene X as:

P (X) =1

Z(Λ)exp (−

i

λifi(X)), (1)

where Z normalizes and the distribution is parameterizedby parameters λi ∈ Λ. Many methods exist for learning Λfrom training data. We employ gradient descent [2].

The functions fi measure object class count statisticsand object class co-count statistics.

We approximate the distribution of object class countswith a histogram with bins R. For each object class i,

f (count)i,r (X) = (ni(X) ∈ r)), (2)

where ni(X) is the number of objects of class i in the sceneX, that is:

ni(X) =�

�∈L

(c̃� = i). (3)

Recall from Section 2 of the main text that L is the setof dictionary indices of the objects used in a scene Xand c̃� is the class of an object �. r ∈ R is a rangeof integer values. In our current implementation, R ={[0], [1], [2, 3], [4, 7], [8,∞)}.

We approximate the distribution of object class co-counts with a 2D histogram R × R. For each pair of ob-ject classes (i, j), and histogram bin (r1, r2), r1, r2 ∈ R,we have the feature:

f (co−count)i,j,r1,r2

(X) = (ni(X) ∈ r1, nj(X) ∈ r2). (4)

In Section 5.2.1 of the main text, we compare our sys-tem with just the scenegraph grammar as a prior to our sys-tem with both the scenegraph grammar prior and the maxentprior described above. As shown in Table 4, the addition ofthe maxent prior decreases performance compared to ourscenegraph grammar alone.

2. Segment transformation – detailsTranslation and scaling:As summarized in Section 4.2 of the main text, each ob-

ject from our dictionary is translated and scaled indepen-dently. We seek to place the object at a location and scalewhere it well explains the appearance of the query image,I . Using θt to represent the center position of the object’smask and θs to represent how much we scale the object’smask, we start with the following objective function:

E1(θt, θs) =�

q∈Q�

g̃�(f(I)q ) + λt

q∈Q̃�

pt(c̃�, q) + λsps(θs),

(5)

Q� = Tts(Q̃�, {θt, θs}), (6)

where g̃� is the segment’s appearance model, pt(c̃�, q) is aprior over mask position for objects of class c̃�, and ps(θs) isa prior over scale factor. Tts applies a similarity transforma-tion to the dictionary object mask Q̃�. In order to encourageobjects to stay largely within the image frame, we assign adefault penalty to all pixels in the transformed mask that falloutside the image frame.

We calculate pt(c, q) by averaging masks of class cacross our dictionary:

pt(c, q) ∝�

�∈L s.t. c̃�=c

(q ∈ Q̃�). (7)

We manually set ps(θs) = {0.75, 0.875, 1, 0.875, 0.75}for the discrete set of scales θs ∈ {0.5, 0.75, 1, 1.5, 2}. Thatis, objects slightly prefer to remain the scale at which theywere found in our dictionary.

1

The objective E1 prefers that objects scale down as muchas possible until they achieve a precise fit onto the query im-age. In order to prevent all objects from choosing the small-est scale, we add a term that trades off between precisionand coverage of the detections:

E2(θt, θs) =E1(θt, θs)

|Q�|+ η(Q̃�, z̃�). (8)

Here we have scaled E1 by a softened version of the areathe transformed object mask covers. The softening factor,η(Q̃�, z̃�), controls preference for precision versus recall inthe object detections. We set this term so as to prefer recallfor large, background objects whereas to prefer precisionfor small, foreground objects. The intuition for this choiceis that objects in the background are likely to be largely cov-ered in the final scene explanation, and consequently, it isokay if only parts of them match the image well. This termis a function of the object’s untransformed mask Q̃� and thelayer of the object in its dictionary scene, z̃�:

η(Q̃�, z̃�) ∝1

|Q̃�|z̃�, (9)

We choose transformation parameters that maximize ourobjective:

{θ∗t , θ∗s} = argmaxθt,θs

E2(θt, θs). (10)

To choose θ∗s we try each of our discrete set of scales. Tochoose θ∗t , we adopt a sliding window approach, searchingover all possible placements of the object mask such that itscenter is within the image frame (however, for dictionaryobjects that touch an image border, we only consider place-ments such that the object still touches the image border).We use convolution between the object mask and image fea-tures to perform this search efficiently.

Trimming and growing:We edit object mask silhouettes after each iteration of

greedy optimization (Algorithm 1 in the main text; labeledas TRIMANDGROW). We formulate this part of the problemas 2D MRF-based segmentation, in which each object inthe scene becomes a segment class label. Here, we denoteobject segment labels as a 2D array �(q), and minimize thefollowing energy function using BP-S [1]:

− log p(� | I,X) =�

q

g̃�(f(I)q ) + λ1

q

ψ�(q)+ (11)

λ2

{p,q}∈�

φ(�(p), �(q); I) + logZ, (12)

where Z normalizes. The data term g̃�(·) is our appearancemodel from Section 2.2 of the main text. Spatial priors ψ�(·)

are provided by our object visibility masks (Equation 4 inthe main text):

ψ�(q) = (G ∗ V�)(q), (13)

where G is a Gaussian kernel with scale proportional to theobject’s mask area, |Q�|.

For the spatial smoothness term φ(·), we borrow theimage-sensitive smoothness potential from [1]:

φ(�(p), �(q); I) = (�(p) �= �(q))

�ξ + e−γ�I(p)−I(q)�2

ξ + 1

�,

(14)

with γ = (2 < �I(p)− I(q)�2 >)−1.This silhouette editing process leaves us with a 2D map

of object labels �(q). We use this map to update our objectvisibility masks: V� = ∪q (� = �(q)). These visibilitymasks affect the synthesized scene’s likelihood (Equation 6in the main text), and are used to calculate pixel-wise andclass-wise label accuracy (Section 5 of the main text). Uponeach invocation of TRIMANDGROW, we discard the old sil-houette edits and start afresh from the visibility masks givenby Equation 4 in the main text.

3. Random scene synthesis – detailsDuring random scene synthesis, we do not have a tar-

get image to match, so several changes need to be made toour collaging algorithm. First, the likelihood term is set tobe proportional to the number of pixels the collage covers:logP (I|X) ∝

��∈L |V�|. This biases inference toward

collages that fill the entire image frame. Second, segmenttransformation and trimming and growing are skipped – in-stead segments are placed exactly where they were foundin the dictionary scene they came from. Third, we choosethe layer on which to place each object segment � by find-ing the layer in the collage that best matches the object’slayer in the dictionary scene from which it was sampled.Let A(z̃�) be the histogram of object class counts above z̃�in the dictionary scene and B be the histogram below. Fur-ther, let A�(z) and B�(z) represent the histogram of objectclass counts above and below z in the scene collage beingsynthesized. We place the object on the layer z∗� that mini-mizes the following sum of histogram intersections:

z∗� = argmaxz

k

min(A(z̃�)k, A�(z)k)+

min(B(z̃�)k, B�(z)k). (15)

4. Additional resultsIn Figures 1, 2, and 3 we display additional example

parses on each dataset. Figures 4, 5, and 6 show the averageper-class accuracy of our algorithm on the top 30 most fre-quent classes in each dataset. Figure 7 shows several char-acteristic failure cases for our algorithm.

Query imageHuman

annotationGround truth scenegraph

Inferred appearance

Inferred annotations

Inferred scenegraph

Inferred layer-world

sky

sunmountain

mountainsea

skysea

seasun

boat

sky

treetree

treetree

treetree

field

skytree

treetree

treetree

planttree

treetree

treeplant

sky

road

mountaintree

building

fence

bridgebuilding

carcar

cartree

roadsky

treetree

fencefence

carfence

fencecar

car

carsign

sky

road

buildingbuilding

building

building

carcar

carcar

carcar

car

car

cartree

carwindow

windowwindow

windowawning

building

road

building

building

car

tree

carcar

cartree

car

building

carcar

awningwindow

carcar

carcar

tree

road

road

sidewalk

buildingbuilding

grassgrass

bussign

personbus

treebus

buscar

carcar

cartree

road

road

sky

road

treetree

car

buildingbuilding

bridge

treecar

carbuilding

carcar

carcar

signfield

carsign

carcar

signcar

carcar

skymountain

mountain

skymountain

mountain

mountainmountain

mountainmountain

mountainmountain

sky

roadroad

sidewalk

building

building

buildingbuilding

carcar

carcar

carcar

cartree

treetree

sky

road

building

building

buildingbuilding

buildingtree

car

carcar

building

carcar

treetree

carcar

carcar

carbuilding

buildingbuilding

skybuilding

buildingbuilding

buildingbuilding

buildingbuilding

buildingbuilding

building

building

skybuilding

building

building

buildingbuilding

building

building

building

buildingbuilding

buildingbuilding

building

buildingbuilding

buildingbuilding

buildingbuilding

building

skyroad

road

sidewalk

sidewalkbuilding

fence

carbuilding

carbuilding

persontree

car

carcar

person

road

building

skybuilding

buildingbuilding

busbuilding

staircaseperson

personsidewalk

windowawning

buildingbuilding

building

Figure 1: Additional example results parsing LMO images. Zoom in to see details, such as the object labels on the scene-graphs. The last row shows an example in which the dictionary contained the same building as in the query image (althougha different photo of that building). This occurs fairly frequency in LMO and SUN since these datasets contain many cases ofmultiple photos of more or less the same place.

Human annotation

Ground truth scenegraph

Inferred appearance

Inferred annotations

Inferred scenegraph

Inferred layer-worldQuery image

wallwall

wallwall

wall

wallwall

wallfloor

ceilingbox

bleachersperson

steps

wall

floor

personperson

windowdoor

ceilingperson

personperson

persondoor

person

personnewspaper

personperson

tree

tree

buildinggrass

pathplant

windowwindow

doorwindow

windowwindow

chimneypole

tree

grass

building

building

treebuilding

treebuilding

treewindow

windowtree

treewindow

plantwindow

skybuilding

tree

fencescoreboard

text

standfence

pitch

buildingbuilding

monitorfence

pitchtree

skybuilding

treebuilding

treebuilding

treegrass

grassperson

personbucket

grass

tree

groundground

persontable

treeplant

personperson

treeperson

ground

personperson

groundbuilding

personperson

personperson

treewindow

personwindow

personperson

windowwindow

personwindow

windowwindow

windowperson

doorperson

window

sky

road

sidewalk

wallmountain

building

building

carcar

cartree

treewindow

windowwindow

building

building

sidewalksky

road

building

buildingtree

buildingbuilding

treecar

treetree

doorwindow

windowplant

carbuilding

building

wallwall

floorceiling

windowair conditioning

desk lampcurtain

curtain

wallfloor

ceiling

wall

ceilingbed

wall

columnwindow

ceiling lamptest tube

wall

wallwall

wall

wallfloor

ceilingcolumn

doordoor

windowbench

wall

floorceiling

ceilingwall

columndoor

ceiling lampdoor

windowceiling lamp

ceiling lampwindow

wallwall

wallwall

floor

ceilingceiling

escalator

trolleyperson

person

mountain

mountain

mountainperson

mountainfog bank

person

Figure 2: Additional example results parsing SUN images. Zoom in to see details, such as the object labels on the scene-graphs.

Human annotation

Ground truth scenegraph

Inferred appearance

Inferred annotations

Inferred scenegraph

Inferred layer-worldQuery image

wall

floorwall

bed

pillowwindow

curtainpicture

pillowcurtain

curtainbed

clothing drying rackwindow

lamptowel

lampboxrolled carpet

curtain

bedwall

bedblinds

floor

blindsblinds

wall

bookshelfwall

bed

cabinet

wall

wall

wallblinds

chairwall

picturewall

chairpillow

desk

wallwall

ceiling

wall

floor

wallbookshelf

picturewindow

bottlewindowunknown

desk

chairchair

deskclothesplastic tray

paperchair

poster casecabinet

chairgarbage bin

picturechair

air ventair vent

lightlight

backpackpaper

clothesbackpack

stereochair

desk

paperflower

table

chairpaper

backpackfolder

paperbox

flower potdeskbasket

boxbooks

chair

desk

wall

wall

wall

walldoor

windowwindow

chair

chair

desk

desk

chairchair

chair

chair

cabinetchandelier

chairpicture

printerpaper holder

deskwindow

wallwall

floor

headboard

bedblinds

boxpillow

window

bookshelf

booksbooks

booksbooks

booksbooks

booksbooks

booksbox

decorative platepillow

basketpaper

door

picturepicture

floorwall

wallfloor

bed

wall

bed

wallbed

wall

bedwall

picturebed

wallblinds

picture

doorpicture

picture

blanketwall

backpackpicture

boxbackpack

pictureclothes

pillowpillow

clothes

wall

floor

cabinetcork board

deskchair

keyboard

computer

monitorbox

paperfile

paper

pen

wall

floorwall

wall

wallwall

wall

floor

bookshelf

floor

chairwall decoration

chairprinter

chairbackpack

chairchair

picturepicture

picturecomputer

chairchair

deskbox

boxchair

chair

wallfloor

wall

wallwall

door

lightpicture

desk

picturepicture

chaircomputer

blindswooden plank

paperjug

chairdog bowl

table

dog bowlchair

chairnapkin

cup

cabinet

bookshelf

floor

doordoor

cabinet

wall

floor

wallcabinet

doorchair

curtainpicture

picturebox

boxdresser

chairclothes

wall

floorlamp

headboard

bookspillow

picturepillow

bed

picturebook

sculpture

night stand

lampmattress

clothes

wall

bedfloor

wall

wall

wall

wallwall

blindswall

bed

bedpicture

picture

pictureblinds

headboardpillow

pillowgarbage bin

pillowpillow

headboard

wall

floor

walltable

chair

blinds

table

chairchair

blindschair

blindschair

bottlepicture

chairbottle

picture

floor mat

picturecabinet

chairbottle

picturechair

lightchair

floor

floor

cabinetcabinet

bookshelf

wallwall

cabinet

wall

chair

bookshelf

doorwall

cabinetbox

chair

chairwall

chairchair

picturepicture

chairgarbage bin

wallbooks

picture

wallwall

floorpicture

chair

table

curtainwindow

chaircurtain

chairpicture

chairchair

chairchair

chair

wall

floor

wallwall

wall

windowwindow

blanketbed

curtainchair

cribchair

bedpicture

window

floor

wall

ceiling

ceilingfloor

columnbookshelf

bookshelfshelves

shelvespipe

books

pipe

bookshelf

bookshelf

bench

columnbook

book

crate

yoga matyoga mat

bookshelfbooks

booksbooks

booksbooks

floor

bookshelfbookshelf

bookshelf

wallwall

floor

bookshelf

deskwall

floorbooks

boxbooks

booksbooks

picturewall

booksbooks

booksbooks

Figure 3: Additional example results parsing NYU RGBD images. Zoom in to see details, such as the object labels on thescenegraphs.

References[1] C. Liu, J. Yuen, and A. Torralba. Nonparametric Scene Parsing via

Label Transfer. PAMI, 2011. 2[2] C. Liu, S. Zhu, and H. Shum. Learning inhomogeneous Gibbs model

of faces by minimax entropy. Computer Vision, 2001. ICCV 2001.Proceedings. Eighth IEEE International Conference on, 1:281–287vol. 1, 2001. 1

[3] J. Yuen, C. Zitnick, C. Liu, and A. Torralba. A Framework for Encod-ing Object-level Image Priors. Microsoft Research Technical Report,2011. 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1sk

ybu

ildingtre

em

ount

ainroadsea

field car

sand

river

plant

gras

swi

ndow

sidew

alkrock

bridg

edo

orfe

nce

pers

onsta

ircas

eaw

ning

sign

boat

cros

swalkpole

bus

balco

nystr

eetlig

ht sun

bird

mea

n pe

rcl

ass

pixe

l lab

elin

g ac

cura

cy

Figure 4: Mean per-class pixel labeling accuracy on theLMO dataset.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

wall

sky

build

ing floor

tree

mou

ntain

ceilin

gro

adgr

ass

field

wind

owpla

nt sea

grou

ndwa

ter

door

table

pers

onsh

elves

show

case bed

stand

curta

inro

cksid

ewalk

cupb

oard

sand

chair

cabin

etriv

er

mea

n pe

rcl

ass

pixe

l lab

elin

g ac

cura

cy

Figure 5: Mean per-class pixel labeling accuracy on the top30 most common object classes in the SUN test set.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

wall

floor

cabin

etbe

dch

air sofa

door

table

pictu

rewi

ndow

book

shelf

blind

sco

unte

rce

iling

desk

curta

indr

esse

rm

irror

pillow

floor

mat

shelv

esbo

oks

refri

dger

ator

cloth

este

levisi

onpa

per

towe

lwh

itebo

ard

show

er cu

rtain

night

stan

d

mea

n pe

rcl

ass

pixe

l lab

elin

g ac

cura

cyFigure 6: Mean per-class pixel labeling accuracy on the top30 most common object classes in the NYU RGBD test set.

Query imageHuman

annotationInferred

appearanceInferred

annotations

Wrong category

Failure type

Underexplaining

Bad alignment

Overexplaining

Figure 7: Characteristic failure modes of our algorithm. Toprow: Sometimes the algorithm gets the entire scene cate-gory wrong. Second row: Our algorithm often runs into dif-ficulty with cluttered scenes full of small objects. Groupsof small objects are sometimes explained by one big ob-ject that resembles the ensemble appearance of the smallobjects. One reason for this is that often the human an-notations do not mark all small objects in a scene. In thisexample, the human annotations did not mark the people ineither the query image or in the scene whose segments wereused in the scene collage. Third row: Conversely, some-times the algorithm hallucinates small objects where noneare necessary, as in the addition of the foreground plants inthis simple image of a field. Together, the second and thirdrows demonstrate that the algorithm has difficulty choosingbetween explaining a region with several small objects ver-sus one big object. Bottom row: Our current segment trans-formation algorithm produces fairly coarse alignments, andoften makes false matches. The system often finds reason-able small objects to place into a scene collage (such as thepeople in this collage), but gets their position wrong, lead-ing to low per-class accuracy; better small object detectionand alignment is an important direction for future work.