3d scene retrieval from text with semantic parsingcs229.stanford.edu/proj2014/will monroe, 3d...

3D Scene Retrieval from Text with Semantic Parsing

Will Monroe∗Stanford University

Computer Science DepartmentStanford, CA 94305, USA

[email protected]

Abstract

We examine the problem of providing anatural-language interface for retrieving 3Dscenes from descriptions. We frame the natu-ral language understanding component of thistask as a semantic parsing problem, whereinwe first generate a structured meaning rep-resentation of a description and then use themeaning representation to specify the spaceof acceptable scenes. Our model outperformsa one-vs.-all logistic regression classifier onsynthetic data, and remains competitive on alarge, real-world dataset. Furthermore, ourmodel learns to favor meaning representationsthat take into account more of the natural lan-guage meaning and structure over those thatignore words or relations between them.

1 Introduction

We look at the task of 3D scene retrieval: given anatural-language description and a set of 3D scenes,identify a scene matching the description. Geo-metric specifications of 3D scenes are part of thecraft of many graphical computing applications, in-cluding computer animation, games, and simulators.Large databases of such scenes have become avail-able in recent years as a result of improvements in

∗This project is part of a joint research project with AngelChang (Computer Science) and Christopher Potts (Linguistics).I have omitted their names from the author section to avoid con-fusion, as neither is a CS 229 student; however, their contri-butions are substantial. Angel implemented the infrastructurefor working with 3D scenes and collecting data via MechanicalTurk. Prof. Potts has been our primary faculty advisor for thisresearch project.

“brown room with a refrigerator in the back corner”

A B

C D

E

Figure 1: One instance of the scene retrieval task.

the ease of use of tools for 3D scene design. Asystem that can identify a 3D scene from a natu-ral language description is useful for making suchdatabases of scenes readily accessible. Natural lan-guage has evolved to be well-suited to describingour (three-dimensional) world, and it provides a con-venient way of specifying the space of acceptablescenes: a description of a physical environment en-codes logical propositions about the space of envi-ronments a speaker is describing.

This logical structure of scene descriptions sug-gests that semantic parsing is an appropriate frame-work in which to understand the problem of retriev-ing scenes based on their natural-language descrip-tions. Semantic parsing (Zelle and Mooney, 1996;Tang and Mooney, 2001; Zettlemoyer and Collins,2005) is the problem of extracting logical formsfrom natural language.

S → P F PF → O (P O)∗

P → Σ∗

O → room|green|weird| . . .

Figure 2: Context-free grammar for parsing scene de-scriptions.

A large and growing body of literature from thecomputer vision and information retrieval commu-nities exists surrounding the task of scene retrievalover images and video given descriptions (Sebe etal., 2003; Datta et al., 2008; Snoek and Worring,2008). However, relatively little work has lookedat identifying 3D scenes from natural language de-scriptions. 3D scene representations offer advan-tages in the form of a highly structured, segmentedinput that is semantically annotated.

Other work has looked at generation of 3D scenesfrom text. WordsEye (Coyne and Sproat, 2001)pioneered the approach of visualizing natural lan-guage by constructing a 3D scene that representsits meaning. More recently, Chang et al. (2014)built a system for text-to-scene generation which canlearn scene synthesis from a database of human-builtscenes (Fisher et al., 2012) and be trained interac-tively, updating the weights for object support hier-archy priors in response to user corrections. Thishelps ensure that generated scenes are realistic, bysampling from an estimate of the density of human-generated scenes that is trained on both positive andnegative examples.

Both of these systems notably lack sophisticatedapproaches for extracting meaning from natural lan-guage descriptions. Our work relates to this goal inthat we define a space of meaning representationsthat can be used for both 3D scene retrieval and 3Dscene generation. We aims to fill the natural lan-guage understanding gap in this work by applying astate-of-the-art semantic parser to map textual inputto these meaning representations.

2 Models

We define the scene retrieval problem as follows:given x(i), a natural-language description of a 3D

scene, identify the scene y(i) that inspired the de-scription from among a set of scenes Y (i) consistingof y(i) and one or more distractor scenes.

2.1 Logistic regressionWe train and evaluate two models for solving thisproblem. The first is a one-vs.-all logistic regression(LR) classifier, which learns to estimate the proba-bility of a scene being correct:

hθ(y|x) =1

1 + exp[−φ(x, y)T θ]

In training the LR model, we maximize the L2-regularized log likelihood of the dataset, countingeach true scene and each distractor scene as one ex-ample:

O(x,y) = λ2||θ||22 +

m∑i=1

∑y∈Y (i)

(1{y = y(i)

}log hθ(y|x(i)) +

1{y 6= y(i)

}log[1− hθ(y|x(i))]

)At test time, the model predicts the scene with the

highest estimated probability:

ŷ(i) = arg maxy∈Y (i)

hθ(y|x(i))

2.2 Semantic parsingThe second model we consider is a structured pre-diction model that uses the SEMPRE semantic pars-ing framework (Berant et al., 2013) to derive a struc-tured logical representation (z, a latent variable) ofthe meaning of a scene description. SEMPRE usesa structured softmax regression model with featuresdefined over utterance–logical form pairs:

hθ(z|x) =exp[φ(x, z)T θ]∑

z′∈D(x;θ) exp[φ(x, z′)T θ]

The space of logical forms compatible with an inpututterance D(x) is in general exponentially large, soSEMPRE approximates this sum using a beam searchalgorithm (guided by the learned parameters θ) tosearch a subset D̃(x; θ) of possible outputs.

For each logical form z, our model uses deter-ministic scoring rules to select the scene that best

matches the constraints expressed in the logicalform:

JzK = arg maxy∈Y (i)

Score(y, z)

The training objective is to maximize the L1-regularized log likelihood of the correct scenes,marginalized over z:

O(x,y) = λ||θ||1 +m∑i=1

log∑

z∈D̃(x(i);θ)

1{

JzK = y(i)}hθ(z|x(i))

The final prediction of the model is the scene se-lected by the logical form on the beam that is themost likely according to the model:

ŷ(i) =

targ maxz∈D(x(i);θ)

hθ(z|x(i))

|

D(x), the true set of logical forms compatiblewith a scene description x, is defined by a context-free grammar over the input augmented with seman-tic interpretations of each grammar rule. Figure 2shows the simple grammar we use. Any terminaland nonterminal symbol in this grammar can be sur-rounded by zero or more tokens that are skipped (notused in the output logical form). The symbol O rep-resents the lexicon for this grammar, which consistsof phrases mapped to subsets of a database of 3Dmodels, either by category (e.g. “chair” is mapped tothe set of objects categorized by humans as “chairs”in the database) or by model ID (e.g. “crazy poster”is mapped to the specific model poster463). Ourlogical forms then consist of a conjunction of pred-icates denoting the existence of at least one objectfrom each subset. For example, one possible logicalform might look like:

(and category:roommodel:refrigerator192)

2.2.1 FeaturesBoth models use binary-valued features indicat-

ing the co-occurrence of a unigram or bigram and anobject category or model ID. The logistic regressionmodel uses features defined over every unigram andbigram in the input, a total of 15,000 features on thedevelopment dataset and 236,000 features on the full

dataset. SEMPRE uses unigram and bigram featuresonly for entries in the lexicon O (approximately 900for the development dataset and 2,000 for the fulldataset). These features are very sparse; for a giveninput x containing n words, at most 2n− 1 of thesefeatures are non-zero in the LR model, and in theSEMPRE model, at most (2n − 1)|z| are non-zero,where |z| is the number of object descriptions in agiven logical form.

For the SEMPRE model, we also have featuresover the logical forms |z|. These consist of counts ofwords skipped by part of speech and counts of gram-mar rules applied in constructing the logical form.These features allow the algorithm to adjust the frac-tion of the natural language input that is used or ig-nored in building the output as well as the complex-ity of the output logical forms. On the developmentset, the model stored weights for 19 rule applicationfeatures and 20 POS-skipping features; for the fulldataset, the numbers were 26 and 34, respectively.

3 Experiments

3.1 Datasets

We collected a dataset of 860 3D scenes and 1,735descriptions of these scenes from workers on Me-chanical Turk. One set of Turkers are shown a sen-tence (one of 60 seed sentences we wrote describingsimple configurations of interior scenes) and askedto build a scene using the WebSceneStudio interface(the program used by Fisher et al. (2012) to buildtheir scene synthesis database). Other Turkers arethen asked to describe each of these scenes in one totwo sentences.

Our development was done on a smaller datasetof 302 machine-generated scenes, using an existingscene generation system (Chang et al., 2014).These are created from a set of 36 logical forms,some of which were augmented with spatial con-straints, such as (and category:candlecategory:table) and (on_top_ofcategory:candle category:table).For each of these, we generated a number of scenesthat satisfy the requirements expressed by the logi-cal form (e.g. that have both a candle and a table),manually filtering scenes to eliminate low-qualitymodels and unphysical object placement. We thenwrote one description for each of these 302 scenes.

Model Train TestRandom 20% 20%LR-raw-uni 100% 55%LR-raw 100% 56%LR-lem 100% 57%SEMPRE-raw 40.6% 39%SEMPRE-lem 46.0% 47%SEMPRE-lex-raw 81.2% 57%SEMPRE-lex-lem 86.6% 66%Size 202 100

Table 1: Development dataset results. raw and lem indi-cate using unprocessed and lemmatized text for unigramand bigram features; uni indicates unigram features only;lex indicates use of a lexicon assembled from the trainingset instead of written by hand.

Model Train TestRandom 20% 20%LR 99.9% 52.3%SEMPRE 70.7% 41.3%Size 1,437 298

Table 2: Results on the Mechanical Turk dataset.

3.2 Results

Table 1 shows the performance of our model onthe development dataset. Our best semantic pars-ing model achieves 66% accuracy at distinguishinga scene that matches a description from four distrac-tors (chance is 20%) on a held-out set of 100 scenesfrom this development dataset. The one-vs.-all clas-sifier achieves 57% accuracy on the held-out dev set.

The fact that the LR model is able to perfectlyclassify the development training set indicated thatit was overfitting. To reduce variance, we looked atthe effects of regularization and training set size onmodel performance.

Figure 3 shows held-out dev set accuracy as afunction of the regularization constant λ. We foundthat including regularization only barely improvedthe LR model, and hurt performance of the SEMPREmodel across the board. (Note that LR uses L2 reg-ularization and SEMPRE uses L1 regularization, sothe constants are not directly comparable, but in bothcases larger numbers represent a stronger prior.)

Figure 4 shows the learning curves of the twomodels on the development set. The slopes of the

10−6

10−4

10−2

100

102

104

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Regularization constant λ

Acc

urac

y

LR−devLR−trainSEMPRE−devSEMPRE−train

Figure 3: Effects of regularization on development setaccuracy. We found that regularization did not improveperformance on either model.

20 40 60 80 100 120 140 160 180 200 2200.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Training set size (examples)

Acc

urac

y

LR−devLR−trainSEMPRE−devSEMPRE−train

Figure 4: Learning curve on the development dataset.The upward slope indicated that additional data waslikely to improve performance; the steeper line for theLR model suggested (correctly) that LR would outper-form SEMPRE on the larger dataset.

learning curves suggest that the SEMPRE model doeswell in the low-data setting, but the LR model ben-efits from the larger dataset. This can be explainedby the fact that SEMPRE has a fixed lexicon that en-codes some amount of prior knowledge about wordmeanings; as the training data grows, this lexiconbecomes less capable of accounting for the varietyof language found in the data, whereas the LR modelcontinues to add features and learn from the addi-tional data.

The results on our dataset collected from Mechan-ical Turk (Figure 2) support this conclusion: the LRmodel performed better than the semantic parsingmodel on this larger dataset, achieving 52.3% accu-racy on the test set compared to 41.3% for the se-mantic parsing model (chance is again 20%).

4 Discussion

On the smaller dev set, the semantic parsing modelhad higher performance than the simple classifier.However, with a larger dataset, logistic regressionoutperforms the structured model. This discrepancyis best explained by a combination of two factors:firstly, additional data provides a greater benefit tothe LR model than to the semantic parsing model,as mentioned in Section 3.2. Secondly, the largerdataset contains descriptions collected from manyworkers, whereas the descriptions in the smallerdataset were written only by two people. This nat-urally leads to greater linguistic diversity in the fulldataset, making the problem harder for both models(as we can see from the lower test set scores in Fig-ure 2) but disproportionately affecting the semanticparsing model because of its limited capacity for en-coding linguistic variation in the logical forms.

One promising outcome we observed is that thesemantic parsing model gives negative weights tothe word skipping features (with the exception ofverbs, wh-pronouns, and punctuation) and the gram-mar rule application features with counts of zero.This indicates that paying attention to all of the lan-guage in a description and producing outputs withcomplex structure are empirically beneficial.

5 Future Work

For future work, we plan to examine ways to speedup training times, so we can use a larger lexi-

con. We also plan to add new features and newkinds of meaning representations, incorporating spa-tial relationships between objects and other infor-mation about 3D models that is available in ourmodel database such as human-written descriptionsof models. Other attributes that can be extractedfrom the models that could be useful in features in-clude color, size, and proportions.

The meaning representations produced by theSEMPRE model can also be used for scene synthe-sis, a line of work we plan to pursue over the comingmonths.

ReferencesJonathan Berant, Andrew Chou, Roy Frostig, and Percy

Liang. 2013. Semantic parsing on Freebase fromquestion-answer pairs. In EMNLP.

Angel X. Chang, Manolis Savva, and Christopher D.Manning. 2014. Interactive learning of spatial knowl-edge for text to 3D scene generation. In ACL-ILLVI.

Bob Coyne and Richard Sproat. 2001. WordsEye: anautomatic text-to-scene conversion system. In SIG-GRAPH.

Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z. Wang.2008. Image retrieval: Ideas, influences, and trendsof the new age. ACM Computing Surveys (CSUR),40(2):5.

Matthew Fisher, Daniel Ritchie, Manolis Savva, ThomasFunkhouser, and Pat Hanrahan. 2012. Example-basedsynthesis of 3D object arrangements. ACM Transac-tions on Graphics (TOG), 31(6):135.

Nicu Sebe, Michael S. Lew, Xiang Zhou, Thomas S.Huang, and Erwin M. Bakker. 2003. The state of theart in image and video retrieval. In Image and VideoRetrieval, pages 1–8. Springer.

Cees GM Snoek and Marcel Worring. 2008. Concept-based video retrieval. Foundations and Trends in In-formation Retrieval, 2(4):215–322.

Lappoon R. Tang and Raymond J. Mooney. 2001. Us-ing multiple clause constructors in inductive logic pro-gramming for semantic parsing. In Machine Learning:ECML 2001. Springer.

John M. Zelle and Raymond J. Mooney. 1996. Learn-ing to parse database queries using inductive logic pro-gramming. In AAAI, pages 1050–1055.

Luke S. Zettlemoyer and Michael Collins. 2005. Learn-ing to map sentences to logical form: Structured clas-sification with probabilistic categorial grammars. InUAI.

3d scene retrieval from text with semantic parsingcs229.stanford.edu/proj2014/will monroe, 3d...

Documents