semantic interpretation

7/29/2019 Semantic Interpretation

1/20

s In recent years, there has been a flurry of research

into empirical, corpus-based learning approachesto natural language processing (NLP). Most empir-ical NLP work to date has focused on relativelylow-level language processing such as part-of-speech tagging, text segmentation, and syntacticparsing. The success of these approaches has stim-ulated research in using empirical learning tech-niques in other facets of NLP, including semanticanalysisuncovering the meaning of an utter-ance. This article is an introduction to some of theemerging research in the application of corpus-based learning techniques to problems in semanticinterpretation. In particular, we focus on two im-portant problems in semantic interpretation,namely, word-sense disambiguation and semantic

parsing.

Getting computer systems to understandnatural language input is a tremen-dously difficult problem and remains a

largely unsolved goal of AI. In recent years,there has been a flurry of research into empiri-cal, corpus-based learning approaches to natur-al language processing (NLP). Whereas tradi-tional NLP has focused on developinghand-coded rules and algorithms to processnatural language input, corpus-based ap-proaches use automated learning techniquesover corpora of natural language examples inan attempt to automatically induce suitablelanguage-processing models. Traditional workin natural language systems breaks the processof understanding into broad areas of syntacticprocessing, semantic interpretation, and dis-course pragmatics. Most empirical NLP work todate has focused on using statistical or otherlearning techniques to automate relativelylow-level language processing such as part-of-

speech tagging, segmenting text, and syntacticparsing. The success of these approaches, fol-lowing on the heels of the success of similartechniques in speech-recognition research, hasstimulated research in using empirical learningtechniques in other facets of NLP, includingse-mantic analysisuncovering the meaning of anutterance.

In the area of semantic interpretation, therehave been a number of interesting uses of cor-pus-based techniques. Some researchers haveused empirical techniques to address a difficultsubtask of semantic interpretation, that of de-veloping accurate rules to select the proper

meaning, or sense, of a semantically ambiguousword. These rules can then be incorporated aspart of a larger system performing semanticanalysis. Other research has considered whe-ther, at least for limited domains, virtually theentire process of semantic interpretation mightyield to an empirical approach, producing asort of semantic parser that generates appropri-ate machine-oriented meaning representationsfrom natural language input. This article is anintroduction to some of the emerging researchin the application of corpus-based, learningtechniques to problems in semantic interpreta-tion.

Word-Sense Disambiguation

The task of word-sense disambiguation (WSD)is to identify the correct meaning, or sense, ofa word in context. The input to a WSD pro-gram consists of real-world natural languagesentences. Typically, a separate phase prior toWSD to identify the correct part of speech of

Articles

WINTER 1997 45

Corpus-Based Approachesto Semantic Interpretation

in Natural LanguageProcessing

Hwee Tou Ng and John Zelle

Copyright 1997, American Association for Artificial Intelligence. All rights reserved. 0738-4602-1997 / $2.00

AI Magazine Voume 18 Num er 4 1997 AAAI


2/20

WSD is also an essential part of many NLPapplications. In information retrieval, WSDhas brought about improvement in retrieval

accuracy. When tested on part of theTREC cor-pus, a standard information-retrieval test col-lection, WSD improves precision by about 4.3percent (from 29.9 percent to 34.2 percent)(Schtze and Pedersen 1995). Similarly, in ma-chine translation, WSD has been used to selectthe appropriate words to translate into a targetlanguage. Specifically, Dagan and Itai (1994)reported successful use of WSD to improve theaccuracy of machine translation. The work ofSchtze and Pedersen (1995) and Dagan andItai (1994) clearly demonstrates the utility ofWSD in practical NLP applications.

Early work on WSD, such as Kelly and Stone(1975) and Hirst (1987), used hand coding ofknowledge to disambiguate word sense. Theknowledge-acquisition process can be labori-ous and time consuming. For each word to bedisambiguated, the appropriate inferenceknowledge must be handcrafted. It is difficultto come up with a comprehensive set of thenecessary disambiguation knowledge. Also, asthe amount of disambiguation knowledgegrows, manual maintenance and further ex-pansion become increasingly complex. Thus,it is difficult to scale up manual knowledge ac-quisition to achieve wide coverage for real-

world sentences.The recent surge in corpus-based NLP re-search has resulted in a large body of work onWSD of unconstrained real-world sentences.In contrast to manually hand coding disam-biguation knowledge into a system, the cor-pus-based approach uses machine-learningtechniques to automatically acquire such dis-ambiguation knowledge, using sense-taggedcorpora and large-scale linguistic resources

the words in the sentence is assumed (that is,whether a word is a noun, verb, and so on). Inthe output, each word occurrence w is tagged

with its correct sense, in the form of a sensenumber i, where i corresponds to the i-th sensedefinition ofw in its assigned part of speech.The sense definitions are those specified insome dictionary. For example, consider the fol-lowing sentence: In the interest of stimulatingthe economy, the government lowered the in-terest rate.

Suppose a separate part-of-speech tagger hasdetermined that the two occurrences ofinterestin the sentence are nouns. The various sensedefinitions of the noun interest, as given in theLongman Dictionary of Contemporary English(LDOCE) (Bruce and Wiebe 1994; Procter

1978), are listed in table 1. In this sentence, thefirst occurrence of the noun interest is in sense4, but the second occurrence is in sense 6. An-other wide-coverage dictionary commonlyused in WSD research is WORDNET (Miller 1990),which is a public-domain dictionary con-taining about 95,000 English word forms, witha rather refined sense distinction for words.

WSD is a long-standing problem in NLP. Toachieve any semblance of understanding natur-al language, it is crucial to figure out what eachindividual word in a sentence means. Words innatural language are known to be highly am-biguous, which is especially true for the fre-quently occurring words of a language. For ex-ample, in the WORDNET dictionary, the averagenumber of senses for each noun for the mostfrequent 121 nouns in English is 7.8, but thatfor the most frequent 70 verbs is 12.0 (Ng andLee 1996). This set of 191 words is estimated toaccount for about 20 percent of all word occur-rences in any English free text. As such, WSD isa difficult and prevalent problem in NLP.

Sense Number Sense Definition1 Readiness to give attention2 Quality of causing attention to be given3 Activity, subject, and so on, that one gives time and attention to

4 Advantage, advancement, or favor5 A share (in a company, business, and so on)6 Money paid for the use of money

Articles

46 AI MAGAZINE

Table 1. Sense Definitions of the Noun Interest.


3/20

such as online dictionaries. As in other corpus-based learning approaches to NLP, more em-phasis is now placed on the empirical evalua-

tion of WSD algorithms on large quantities ofreal-world sentences.Corpus-based WSD research can broadly be

classified into the supervised approach and theunsupervised approach.

Supervised Word-SenseDisambiguation

In the supervised approach, a WSD programlearns the necessary disambiguation knowl-edge from a large sense-tagged corpus, inwhich word occurrences have been taggedmanually with senses from some wide-cover-age dictionary, such as the LDOCE or WORDNET.

After training on a sense-tagged corpus inwhich occurrences of word w have beentagged, a WSD program is then able to assignan appropriate sense to w appearing in a newsentence, based on the knowledge acquiredduring the learning phase.

The heart of supervised WSD is the use of asupervised learning algorithm. Typically, sucha learning algorithm requires its training ex-amples, as well as test examples, to be encodedin the form of feature vectors. Hence, there aretwo main parts to supervised WSD: (1) trans-forming the context of a word w to be disam-biguated into a feature vector and (2) applying

a supervised learning algorithm. Figure 1 illus-trates the typical processing phases of super-vised WSD.

Feature Selection Let w be the word tobe disambiguated. The words surrounding wwill form the context ofw.This context is thenmapped into a feature vector ((f1v1)(fnvn)),which is a list of featuresfi and their associatedvalues vi. An important issue in supervised

WSD is the choice of appropriate features. In-tuitively, a good feature should capture an im-portant source of knowledge critical in deter-

mining the sense ofw.Various kinds of feature representing differ-ent knowledge sources have been used in su-pervised WSD research. They include the fol-lowing:

Surrounding words: Surrounding words arethe unordered set of words surrounding w.These surrounding words typically come froma fixed-size window centered at w or the sen-tence containing w. Unordered surroundingwords, especially in a large context window,tend to capture the broad topic of a text, whichis useful for WSD. For example, surroundingwords such as bank, loan, and payment tend to

indicate the money paid for the use of mon-ey sense ofinterest.

Local collocations: A local collocation refersto a short sequence of words nearw, taking theword order into account. Such a sequence ofwords need not be an idiom to qualify as a lo-cal collocation. Collocations differ from sur-rounding words in that word order is taken in-to consideration. For example, although thewords in, the, and ofby themselves are not in-dicative of any particular sense of interest,when these words occur in the sequence inthe interest of, it always implies the advan-tage, advancement, or favor sense of interest.The work of Kelly and Stone (1975), Yarowsky(1993), Yarowsky (1994), and Ng and Lee(1996) made use of local collocations to disam-biguate word sense.

Syntactic relations:Traditionally, selection-al restrictions as indicated by syntactic rela-tions such as subject-verb, verb-object, and ad-jective-noun are considered an importantsource of WSD knowledge. For example, in the

Articles

WINTER 1997 47

Sense-tagged

Corpus

Feature

Selection

Feature

Vectors

Learning

Algorithm

WSD

ClassifierKnowledge

Disambiguation

New Sentence

Sense-tagged Sentence

Figure 1. Supervised Word-Sense Disambiguation.


4/20

1996; Yarowsky 1994), and exemplar-based al-gorithms (Ng 1997a; Mooney 1996; Ng andLee 1996; Cardie 1993).

In particular, Mooney (1996) evaluated sev-en widely used machine-learning algorithmson a common data set for disambiguating sixsenses of the noun line (Leacock, Towell, andVoorhees 1993). The seven algorithms that heevaluated are (1) a Naive-Bayes classifier (Dudaand Hart 1973), (2) a perceptron (Rosenblatt1958), (3) a decision tree learner (Quinlan1993), (4) aknearest-neighbor classifier (exem-plar-based learner) (Cover and Hart 1967), (5)logic-based DNF (disjunctive normal form)learners (Mooney 1995), (6) logic-based CNF(conjunctive normal form) learners (Mooney1995), and (7) a decision-list learner (Rivest1987). We briefly describe two of the learningalgorithms that have been used for WSD: anaive-Bayes algorithm and an exemplar-basedalgorithm, PEBLS.

Naive-Bayes: The naive-Bayes algorithm(Duda and Hart 1973) is based on Bayess the-orem:

where P(Ci | vj) is the probability that a testexample is of class C i given feature valuesvj. (vj denotes the conjunction of all feature valuesin the test example.) The goal of a naive-Bayesclassifier is to determine the class Ci with thehighest conditional probabili ty P(Ci | vj). Be-cause the denominator P( vj) of the previousexpression is constant for all classes Ci, theproblem reduces to finding the class Ci withthe maximum value for the numerator.

The naive-Bayes classifier assumes indepen-dence of example features, so that

During training, naive-Bayes constructs thematrix P(vj | Ci), and P(Ci) is estimated from thedistribution of training examples among theclasses.

PEBLS: PEBLS is an exemplar-based (or nearest-neighbor) algorithm developed by Cost andSalzberg (1993). It has been used successfullyfor WSD in Ng (1997a) and Ng and Lee (1996).The heart of exemplar-based learning is a mea-

sure of the similarity, or distance, between twoexamples. If the distance between two exam-ples is small, then the two examples are simi-lar. In PEBLS, the distance between two symbol-ic values v1 and v2 of a feature fis defined as

wheren is the total number of classes.P(Ci | v1)is estimated by N1,i/N1, where N1,i is the num-

d v v P C v P C vii

n

i( , ) | ( | ) ( | ) |,1 21

1 2==

P v C P v Cj i j ij

( | ) ( | ). =

P C vP v C P C

P vi ni j

j i i

j

( | )( | ) ( )

( ), =

= 1K

sentence he sold his interest in the joint ven-ture, the verb-object syntactic relation be-tween the verb soldand the head of the objectnoun phrase interest is indicative of the sharein a company sense of interest. In disam-biguating the noun interest, the possible valuesof the verb-object feature are the verbs (such assold) that stand in a verb-object syntactic rela-tion with interest.

Parts of speech and morphological forms:The part of speech of the neighboring words ofw and the morphological form ofw also pro-vide useful knowledge to disambiguate w (Ngand Lee 1996; Bruce and Wiebe 1994). For ex-ample, in Ng and Lee (1996), there is a part-of-speech feature for each of the 3 words cen-tered around w. There is also one feature forthe morphological form ofw. For a noun, thevalue of this feature is either singular or plural;for a verb, the value is one of infinitive (as inthe uninflected form of a verb such asfall), pre-sent-third-person-singular (as in falls), past (as

in fell), present-participle (as in falling), or past-participle (as in fallen).Most previous research efforts on corpus-

based WSD use one or more of the previouslygiven knowledge sources. In particular, thework of Ng and Lee (1996) attempts to inte-grate this diverse set of knowledge sources forcorpus-based WSD. Some preliminary findingssuggest that local collocation provides themost important source of disambiguationknowledge, although the accuracy achieved bythe combined knowledge sources exceeds thatobtained by using any one of the knowledgesources alone (Ng and Lee 1996). That local

collocation is the most predictive seems toagree with past observation that humans needa narrow window of only a few words to per-form WSD (Choueka and Lusignan 1985).

One approach taken to select and decide onthe explicit interaction among the possiblefeatures is from Bruce and Wiebe (1994) andPedersen, Bruce, and Wiebe (1997), who ex-plore a search space for model selection in thecontext of building a probabilistic classifier.

Learning Algorithms Having decided ona set of features to encode the context andform the training examples, the next step is touse some supervised learning algorithm tolearn from the training examples. A large num-ber of learning algorithms have been used inprevious WSD research, including Bayesianprobabilistic algorithms (Pedersen and Bruce1997a; Mooney 1996; Bruce and Wiebe 1994;Leacock, Towell, and Voorhees 1993; Gale,Church, and Yarowsky 1992a; Yarowsky 1992),neural networks (Mooney 1996; Leacock, Tow-ell, and Voorhees 1993), decision lists (Mooney

Articles

48 AI MAGAZINE


5/20

ber of training examples with value v1 for fea-ture fthat is classified as class i in the trainingcorpus, and N1 is the number of training exam-ples with valuev1 for featurefin any class.P(Ci| v2) is estimated similarly. This distance metricofPEBLS is adapted from the value-differencemetric of the earlier work of Stanfil l and Waltz(1986). The distance between two examples isthe sum of the distances between the values ofall the features of the two examples.

Let k be the number of nearest neighbors touse for determining the class of a test example,k 1. During testing, a test example is com-pared against all the training examples. PEBLSthen determines the k training examples withthe shortest distance to the test example.Among these k closest-matching training ex-amples, the class that the majority of these kexamples belong to will be assigned as the classof the test example, with tie among multiplemajority classes broken randomly.

Mooney (1996) reported that the simple

naive-Bayes algorithm gives the highest accu-racy on the line corpus tested. The set of fea-tures used in his study only consists of the un-ordered set of surrounding words. Past researchin machine learning has also reported that thenaive-Bayes algorithm achieved good perfor-mance on other machine-learning tasks, inspite of the conditional independence assump-tion made by the naive-Bayes algorithm,which might be unjustified in some of the do-mains tested.

Recently, Ng (1997a) compared the naive-Bayes algorithm with the exemplar-based algo-rithm PEBLS on the DSO National Laboratories

corpus (Ng and Lee 1996), using a larger valueofk (k =20) and 10-fold cross validation to au-tomatically determine the bestk. His results in-dicate that with this improvement, the exem-plar-based algorithm achieves accuracycomparable to the naive-Bayes algorithm on atest set from the DSO corpus. The set of fea-tures used in this study only consists of localcollocations.

Hence, both the naive-Bayes algorithm andthe exemplar-based algorithm give good per-formance for WSD. The potential interactionbetween the choice of features and the learn-ing algorithms appears to be an interestingtopic worthy of further investigation.

One drawback of the supervised learning ap-proach to WSD is the need for manual sensetagging to prepare a sense-tagged corpus. Thework of Brown et al. (1991) and Gale, Church,and Yarowsky (1992a) tried to make use ofaligned parallel corpora to get around thisproblem. In general, a word in a source lan-guage can be translated into several different

target language words, depending on its in-tended sense in context. For example, the taxsense ofduty is translated as droit in French,whereas the obligation sense is translated asdevoir (Gale, Church, and Yarowsky 1992a). Analigned parallel corpus can thus be used as anatural source ofsense tags, occurrences of theword duty that are translated as droit have thetax sense ofduty, and so on. The shortcomingof this approach is that sense disambiguationis now task specific and language specific. It istied to the machine-translation task betweenthe two languages concerned.

Another trick that has been used to generatesense-tagged data is to conflate two unrelatedEnglish words, such as authorand baby into anartificial compound word author-baby(Schtze1992). All occurrences of the wordsauthor andbaby in the texts are replaced with the com-pound word author-baby.The goal is then todisambiguate occurrences ofauthor-babyasau-thoror baby, with the correct answers being the

original word forms in the texts. Although thisapproach has the advantage of an unlimitedsupply of tagged data without manual annota-tion, the disambiguation problem is highly ar-tificial, and it is unclear how the disambigua-tion accuracy obtained should be interpreted.

UnsupervisedWord-Sense Disambiguation

In the WSD research literature, unsupervisedWSD typically refers to disambiguating wordsense without the use of a sense-tagged corpus.It does not necessarily refer to clustering of un-labeled training examples, which is what unsu-

pervised learning traditionally means in ma-chine learning.

Most research efforts in unsupervised WSDrely on the use of knowledge contained in amachine-readable dictionary (Lin 1997; Resnik1997; Agirre and Rigau 1996; Luk 1995; Wilkset al. 1990). A widely used resource is WORDNET.Besides being an online, publicly available dic-tionary, WORDNET is also a large-scale taxonom-ic class hierarchy, where each English nounsense corresponds to a taxonomic class in thehierarchy. The is-a relationship in WORDNETstaxonomic class hierarchy is an importantsource of knowledge exploited in unsupervisedWSD algorithms.

We illustrate unsupervised WSD with an al-gorithm of Resnik (1997) that disambiguatesnoun senses. His algorithm makes use of theWORDNET class hierarchy and requires that sen-tences in a training corpus be parsed so thatsyntactic relations such as subject-verb, verb-object, adjective-noun, head-modifier, andmodifier-head can be extracted from a sen-

Articles

WINTER 1997 49


6/20

only pointed to by one individual noun.A frequency-counting scheme will be able to

identify this target beverage sense. Specifically,let n be a noun with senses s1, , sk. Supposethe syntactic relation R holds for n and theverb p. For i from 1 to k, Resniks (1997)method computes

Ci ={c | c is an ancestor ofsi}

whereNR(p,w) is the number of times the syn-tactic relation R holds for word w and p, andNt(w) is the number of taxonomic classes towhich wbelongs. The goal is to select the sensesi with the maximum value ofai.

In the previous example, n is the nouncoffee, with s1 =beverage, s2 =tree, s3 =seed,and s4 =color. Suppose that each of the syntac-tic relations drink coffee, drink tea, drink milk,and drink wineoccurs exactly once in our rawcorpus. For simplicity, let us consider only theconcepts listed in figure 2, ignoring other an-cestor concepts not shown in the figure. Thena1 =1/4 +1/4 +1/4 +1/2, a2 =a3 =1/4, and a4

=1/4 +1/2. Hence, the beverage sense s1 withthe highest value a1 is chosen. See Resnik(1997) for the details.

The work of Agirre and Rigau (1996) also us-es the taxonomic class hierarchy ofWORDNET toachieve WSD. However, instead of using syn-tactic relations, their algorithm uses the sur-rounding nouns in a window centered atn, thenoun to be disambiguated. The idea is to locatea class that contains, proportionately speaking,

aN p w

N wi

c C

R

tw ci

=

max(( , )

( )),

tence. Each syntactic relation involves twowords: (1) the noun n to be disambiguated and(2) a verb (in subject-verb, verb-object rela-tions), an adjective (in adjective-noun rela-tion), or another noun (in head-modifier,modifier-head relations).

Consider the following example to explainResniks algorithm: Suppose we want to disam-biguate the noun coffee in a test sentence con-taining the verb-object syntactic relation drinkcoffee. The noun coffeehas four senses in WORD-NET: coffee as a kind of (1) beverage, (2) tree, (3)seed, or (4) color. The algorithm examines theparsed training corpus, looking for all occur-rences ofdrink x in a verb-object relation. Ex-amples of possible past occurrences are drinktea, drink milk, and drink wine. The goal is to beable to determine that the nouns tea, milk,wine, and so on, are most similar to the bever-age sense ofcoffee without requiring that tea,milk, wine, and so on, be manually tagged withthe correct sense in the training corpus.

It turns out that this goal can readily beachieved. The key observation is that althougheach of the four nouns coffee, tea, milk, and

winehas multiple senses, the sense that is mostcommonly shared by these four nouns is thebeverage sense. Figure 2 lists all the senses ofthese four nouns in WORDNET. For example, thenoun tea has four upward-pointing arrows toits four senses, namely, as a kind of (1) bever-age, (2) meal, (3) bush, or (4) herb. The bever-age sense is shared by all four nouns, whereasthe color sense is shared by only two nouns(coffee and wine), with the remaining senses

Articles

50 AI MAGAZINE

beverage seedtree color beverage meal bush herb

teacoffee

beverage body_fluid river nourishment

milk

beverage color

wine

Figure 2. Unsupervised Word-Sense Disambiguation: Resniks Method.


7/20

the highest number of senses of the nouns inthe context window. The sense ofn under thisclass is the chosen sense. The precise metric isformulated in the form ofconceptual density.

LDOCE is another large-scale linguistic re-source exploited in past work on unsupervisedWSD, such as the work of Wilks et al. (1990)and Luk (1995), who use the dictionary defin-itions of LDOCE and the cooccurrence statis-tics collected to achieve WSD.

The unsupervised WSD method of Yarowsky(1995) iteratively uses the decision list super-vised learning algorithm. It requires only asmall number of sense-tagged occurrences of a

word w to start with. These initial sense-taggedoccurrences form the seeds, as illustrated in fig-ure 3a. In this figure, an example is either la-beled with a 1 or a 2, denoting whether it isclassified as sense 1 or sense 2 by the algo-rithm.

At each iteration of the algorithm, new un-tagged occurrences ofw that the algorithm canconfidently assign senses to are tagged withsenses. These are untagged examples where theprobability of classification, as assigned by thedecision list algorithm, is above a certainthreshold. This process is illustrated in figure3b, where more examples are labeled with

Articles

WINTER 1997 51

1 1

2 2

1 1 1

1 1

1

1

1

1

2

22

22

2

2

22

2 2

1

1

1 1 1

1 1 1

1

2

2 2

2 2

2

2

1

1

2

11

1 1

1

1

11

1

11

1 1

11

1

2

2 2

2

2

2

2

2

2

2

2

2

2

1

1

1

1

1

1

1

2

2

2

2

2 2

2

2

2

2

2

2

(a) Iteration 0 (initial stage) (b) Iteration k (intermediate stage)

(c) Last iteration (final stage)

Figure 3. Unsupervised Word-Sense Disambiguation: Yarowskys Method.


8/20

Yarowsky 1992b). This heuristic simply choos-es the most frequent sense of a word w and as-signs it as the sense ofw in test sentences with-out considering any effect of context. AnyWSD algorithm must perform better than themost frequent heuristic to be of any significantvalue.

For supervised WSD algorithms, an accuracyof about 73 percent to 76 percent is achievedon the line corpus (Mooney 1996; Leacock,Towell, and Voorhees 1993); 87.4 percent onthe interest corpus (Ng and Lee 1996); 69.0percent on the SEMCOR corpus (Mil ler et al.1994b); and 58.7 percent and 75.2 percent ontwo test sets from the DSO corpus (Ng 1997a).In general, the disambiguation accuracy de-pends on the particular words tested, the num-ber of senses to a word, and the test corpusfrom which the words originate. Verbs are typ-ically harder to disambiguate than nouns, anddisambiguation accuracy for words chosenfrom a test corpus composed of a wide variety

of genres and domains, such as the Brown cor-pus, is lower than the accuracy of words cho-sen from, say, business articles from the WallStreet Journal (Ng and Lee 1996).

Given an adequate number of sense-taggedtraining examples (typically of a few hundredexamples), state-of-the-art WSD programs out-perform the most frequent heuristic baseline.For example, figure 4 shows the learning curveof WSD accuracy achieved by LEXAS, an exem-plar-based, supervised WSD program (Ng1997b). The accuracy shown is averaged over43 words of the DSO corpus, and each of thesewords has at least 1300 training examples in

the corpus. The figure indicates that the accu-racy of supervised WSD is significantly higherthan the most frequent heuristic baseline.

The accuracy for unsupervised WSD algo-rithms is harder to assess because most are test-ed on different test sets with varying difficulty.Although supervised WSD algorithms have thedrawback of requiring a sense-tagged corpus,they tend to give a higher accuracy comparedwith unsupervised WSD algorithms. For exam-ple, when tested on SEMCOR, Resniks (1997) un-supervised algorithm achieved accuracies inthe range of 35.3 percent to 44.3 percent forambiguous nouns. BecauseWORDNET orders thesense definitions of each word from the mostfrequent to the least frequent, the baselinemethod of most frequent heuristic can be real-ized by always picking sense one of WORDNET.As reported in Mil ler et al. (1994b), such amost frequent heuristic baseline achieved ac-curacy of 58.2 percent for all ambiguouswords.

Similarly, the conceptual density method of

senses at an intermediate stage. A new classifieris then built from the new, larger set of taggedoccurrences, using the decision-l istlearningalgorithm. More untagged occurrences arethen assigned senses in the next iteration, andthe algorithm terminates when all untaggedoccurrences ofw in a test corpus have been as-signed word sense, as shown in figure 3c.Yarowsky reported that his method did as wellas supervised learning, although he only testedhis method on disambiguating binary, coarse-grained senses.

The work of Schtze (1992) also adopted anunsupervised approach to WSD, relying onlyon a large untagged raw corpus. For each wordw in the corpus, a vector of the cooccurrencecounts of the surrounding words centered at awindow size of 1000 characters around w iscreated. These vectors are clustered, and the re-sulting clusters represent the various senses ofthe word. To make the computation tractable,he used singular-value decomposition to re-

duce the dimensions of the vectors to around100. A new occurrence of the word w is thenassigned the sense of the nearest cluster. Theaccuracy on disambiguating 10 words into bi-nary distinct senses is in excess of 90 percent.

Performance

As pointed out in Resnik and Yarowsky (1997),the evaluation of empirical, corpus-based WSDhas not been as rigorously pursued as other ar-eas of corpus-based NLP, such as part-of-speechtagging and syntactic parsing. The lack of stan-dard, large, and widely available test corporadiscourages the empirical comparison of vari-

ous WSD approaches.Currently, several sense-tagged corpora are

available. These include a corpus of 2,094 ex-amples about 6 senses of the noun line (Lea-cock, Towell, and Voorhees 1993); a corpus of2,369 sentences about 6 senses of the noun in-terest (Bruce and Wiebe 1994); SEMCOR (Mi ller etal. 1994b), a subset of the BROWN corpus withabout 200,000 words in which all contentwords (nouns, verbs, adjectives, and adverbs)in a running text have manually been taggedwith senses from WORDNET; and the DSO corpus(Ng and Lee 1996), consisting of approximate-ly 192,800 word occurrences of the most fre-quently occurring (and, hence, most ambigu-ous) 121 nouns and 70 verbs in English. Thelast three corpora are publicly available fromNew Mexico State University, Princeton Uni-versity, and Linguistic Data Consortium(LDC), respectively.

A baseline called the most frequent heuristichas been proposed as a performance measurefor WSD algorithms (Gale, Church, and

Articles

52 AI MAGAZINE


9/20

Agirre and Rigau (1996) was also implementedand tested on a subset of the MUC-4 terroristcorpus in Peh and Ng (1997). The results indi-cate that on this test corpus, the accuracyachieved by the conceptual density method isstill below the most frequent heuristic base-line. Recently, Pedersen and Bruce (1997a,1997b) also used both supervised and unsuper-vised learning algorithms to disambiguate acommon set of 12 words. Their results indicatethat although supervised algorithms give aver-age accuracies in excess of 80 percent, the ac-curacies of unsupervised algorithms are about66 percent, and the accuracy of the most fre-

quent heuristic is about 73 percent. However,the accuracies of unsupervised algorithms ondisambiguating a subset of five nouns are bet-ter than the most frequent baseline (62 percentversus 57 percent). These five nouns have twoto three senses to a noun. In general, althoughWSD for coarse-grained, binary word-sense dis-tinction can yield accuracy in excess of 90 per-cent, as reported in Yarowsky (1995), disam-

biguating word senses to the refined sense dis-tinction of, say, the WORDNET dictionary is stilla challenging task, and much research is stillnecessary to achieve broad-coverage, high-ac-curacy WSD (Ng 1997b).

Semantic Parsing

The research discussed to this point involvesusing empirical techniques to disambiguateword sense, a difficult subpart of the semantic-interpretation process. This section considersresearch that goes a step farther by using em-pirical approaches to automate the entireprocess of semantic interpretation.

Abstractly viewed, semantic interpretationcan be considered as a problem of mappingsome representation of a natural language in-put into a structured representation of mean-ing in a form suitable for computer manipula-tion. We call this the semantic parsing problem.The semantic parsing model of language ac-quisition has a long history in AI research

Articles

WINTER 1997 53

Figure 4. Effect of Number of Training Examples on LEXAS Word-SenseDisambiguation Accuracy Averaged over 43 Words with at Least 1300 Training Examples.

55.0

56.0

57.0

58.0

59.0

60.0

61.0

62.0

63.0

64.0

65.0

0 200 400 600 800 1000 1200 1400

WSDAccuracy(%)

Number of training examples

Effect of number of training examples (avg over 43 words)

LexasMost freq.


10/20

allowing inexperienced users to retrieve usefulinformation from computer archives was rec-ognized early on as an important applicationof computer understanding. The traditionalapproach to constructing such systems hasbeen to use the current linguistic representa-tion to build a set of rules that transforms in-put sentences into appropriate queries. Thus,query systems have been constructed usingaugmented transition networks(Woods 1970), anoperationalization of context-free grammarswith associated actions to produce semanticrepresentations; semantic grammars (Hendrix,Sagalowicz, and Slocum 1978; Brown and Bur-ton 1975), context-free grammars having non-terminal symbols that represent the expressionof domain-specific concepts; and logic gram-mars (Abramson and Dahl 1989; Warren andPereira 1982), general phrase-structure gram-mars that encode linguistic dependencies andstructure-building operations using logicalunification.

Traditional approaches have suffered a num-ber of shortcomings related to the complexityof the knowledge that must be expressed. Ittakes considerable linguistic expertise to con-struct an appropriate grammar. Furthermore,the use of domain-specific knowledge is re-quired to correctly interpret natural languageinput. To handle this knowledge efficiently, itis often intertwined with the rules for repre-senting knowledge about language itself. Asthe sophistication of the input and the data-base increase, it becomes exceedingly difficultto craft an accurate and relatively complete in-terface. The bottom line is that although it is

possible to construct natural language inter-faces by hand, the process is time and expertiseintensive, and the resulting interfaces are oftenincomplete, inefficient, and brittle. Databaseapplications are an attractive proving groundfor empirical NLP. Although challenging, theunderstanding problem has built-in con-straints that would seem to make it a feasibletask for automatic acquisition. The problem iscircumscribed both by the limited domain ofdiscourse found in a typical database and thesimple communication goal, namely, queryprocessing. It is hoped that techniques of em-pirical NLP might enable the rapid develop-ment of robust domain-specific database inter-faces with less overall time and effort.

An empirical approach to constructing nat-ural language interfaces starts with a trainingcorpus comprising sentences paired with ap-propriate translations into formal queries.Learning algorithms are utilized to analyze thetraining data and produce a semantic parserthat can map subsequent input sentences into

(Langley 1982; Selfridge 1981; Sembugamoor-thy 1981; Anderson 1977; Reeker 1976; Sik-lossy 1972, Klein and Kuppin 1970). This earlywork focused on discovering learning mecha-nisms specific to language acquisition and themodeling of cognitive aspects of human lan-

guage learning. The more recent work dis-cussed here is based on the practical require-ments of robust NLP for realistic NLP systems.These approaches use general-purpose ma-chine-learning methods and apply the result-ing systems on much more elaborate NLP do-mains.

Modern approaches to semantic parsing canvary significantly in both the form of inputand the form of output. For example, the inputmight beraw data, strings of phonemes spokento a speech-recognition system or a sentencetyped by a user, or it might be significantly pre-processed to turn the linear sequence of

sounds or words into a more structured inter-mediate representation such as a parse treethat captures the syntactic structure of a sen-tence. Similarly, the type of output that mightbe useful is strongly influenced by the task athand because as yet, there is little agreementon what a general meaning-representation lan-guage should look like.

Research on empirical methods in semanticparsing has focused primarily on the creationof natural language front ends for databasequerying. In this application, a sentence sub-mitted to the semantic parser is some represen-tation of a users question, and the resultingmeaning is a formal query in a database querylanguage that is then submitted to a database-management system to retrieve the (hopefully)appropriate answers. Figure 5 illustrates thetype of mapping that might be required, trans-forming a users request into an appropriateSQL command.

The database query task has long been atouchstone in NLP research. The potential of

Sentence: Show me the morning flights from Boston toDenver.

Meaning: SELECT flight_idFROM flightsWHERE from_city =Boston

AND to_city =DenverAND departure_time


11/20

appropriate queries. The learning problem isdepicted in figure 6. Approaches to this prob-lem can be differentiated according to thelearning techniques used. As in other areas ofempirical NLP, some researchers have attempt-ed to extend statistical approaches, whichhave been successful in domains such asspeech recognition, to the semantic parsingproblem (Mil ler et al. 1996; Mil ler et al. 1994a;Pieraccini, Levin, and Lee 1991). Kuhn and DeMori (1995) describe a method based on se-mantic classification trees, a variant of tradi-tional decision treeinduction approaches fa-miliar in machine learning. The CHILL system(Zelle and Mooney 1996; Zelle 1995) uses tech-niques from inductive logic programming, a sub-field of machine learning, that investigates thelearning of relational concept definitions.

A Statistical Approach

Research at BBN Systems and Technologies aspart of the Advanced Research Projects

Agencysponsored Air Travel Information Ser-vice (ATIS) Project (Mil ler et al. 1994a) repre-sents a paradigm case of the application of sta-tistical modeling techniques to the semanticparsing problem. The target task in the ATIScompetition is the development of a spokennatural language interface for air travel infor-mation. In the BBN approach, the heart of thesemantic parser is a statistical model that rep-resents the associations between strings ofwords and meaning structures. Parsing of in-put is accomplished by searching the statisticalmodel in an attempt to find the most likelymeaning structure given the input sentence.

The learning problem is to set the parametersof the statistical model to accurately reflect therelative probabilities of various meanings, giv-en a particular input.

Representations Meanings are representedas a tree structure, with nodes in the tree rep-resenting concepts. The children of a node rep-resent its component concepts. For example,the concept of a flight might include compo-nent concepts such as airline, flight number,origin, and destination. Figure 7 shows a treethat might represent the sentence from ourprevious example. The nodes shown in circlesare the nonterminal nodes of the tree and rep-resent concepts, and the rounded rectanglesare terminal nodes and represent somethingakin to lexical categories. This tree essentiallyrepresents the type of analysis that one mightobtain from a semantic grammar, a popular ap-proach to the construction of database inter-faces.

The basic requirements of the representa-tion are that the semantic concepts are hierar-

chically nested, and the order of nodes in thetree matches the ordering of words in the sen-tence (that is, the tree can be placed in corre-spondence with the sentence without anycrossing edges). This meaning representationis not a query language as such but partially

specified trees form a basic framelike annota-tion scheme that is automatically translatedinto a database query. The frame representa-tion of our example is shown in figure 8.

Learning a Statistical Model In recentincarnations (Miller et al. 1996), the construc-tion of the frame representation is a two-stepprocess: First, a statistical parsing process isused to produce a combined syntactic-seman-tic parse tree. This tree is based on a simplesyntactic parsing theory, with nodes labeledboth for syntactic role and semantic category.For example, the flight node of figure 7 wouldbe labeled flight/np to show it is a node repre-

senting the concept of a flight and plays therole of a noun phrase in the syntax of the sen-tence. This tree structure is then augmentedwith frame-building operations. The top nodein the tree is associated with the decision ofwhat the overall frame should be, and internalnodes of the tree are associated with slot-fillingoperations for the frame. Each node getstagged as filling some particular slot in theframe (for example, origin) or the special nulltag to indicate it does not directly fill a slot.The input to the learning phase is a set of sen-tences paired with syntactic-semantic parsetrees that are annotated with frame-buildingactions.

The initial parsing process is similar to otherapproaches proposed for statistically basedsyntactic parsing. Because our main focus hereis on semantic mapping, we do not go into thedetails; however, a general overview is in order.The statistical model treats the derivation ofthe parse as the traversal of a path through aprobabilistic recursive transition network.

Articles

WINTER 1997 55

Figure 6. Learning Semantic Mappings.

TrainingExamples

LearningSystem Parser

Sentence

Meaning


12/20

the empirical probabilities of the transitions inthe training data.

In the database interface task, the trainedparsing model is searched to produce the nmost probable parse trees corresponding to thesequence of words in an input sentence. Thesen best parses are then submitted to the seman-tic interpretation model for annotation withframe-building operations. As mentioned, themeanings Ms are decomposed into a decisionabout the frame type FT and the slot fillers S.The probability of a particular meaningMs giv-en a tree is calculated as

P(Ms |T) =P(FT,S |T)=P(FT)P(T | FT)P(S | FT,T) .

That is, the probability of a particular frameand set of slot fillers is determined by the a pri-ori likelihood of the frame type, the probabili-ty of the parse tree given the frame type, andthe probabili ty that the slots have the assignedvalues given the frame type and the parse tree.P(FT) is estimated directly from the trainingdata. P(T | FT) is obtained by rescoring thetransition probabilities in the parse tree, as de-scribed earlier, but with the extra frame-typeinformation. Finally, the probabilistic modelfor slot fill ing assumes that fillers can accurate-ly be predicted by considering only the frametype, the slot operations already performed,

Probabilities on transitions have the formP(staten | staten1, stateup). That is, the probabil-ity of a transition to stat en depends on the pre-vious state and the label of the network we arecurrently traversing. For example, P(location/pp| arrival/vp_head, arrival/vp) represents theprobability that within a verb phrase describ-ing an arrival, a location/pp node (a preposi-tional phrase describing a location) occurs im-mediately following an arrival/vp_head node(the main verb in a phrase associated with anarrival). The probability metric assigned to aparse is then just the product of each transi-tion probability along the path correspondingto the parse tree. The transition probabilitiesfor the parsing model are estimated by noting

FRAME: Air-TransportationSHOW: (flight-information)ORIGIN: (City Boston)DEST: (City Denver)TIME: (part-of-day MORNING)

Articles

56 AI MAGAZINE

Figure 8. Frame Representation of Flight-Information Query.

Figure 7. A Tree-Structured Meaning Representation.

city

name

dest

indicator

city

name

origin

indicator

flight

indicatorpart-of

day

show

indicator

time

show

flight

origindest

Show me the morning flights from Boston to Denver


13/20

and the local parse tree context. This local con-text includes the node itself, two left siblings,two right siblings, and four immediate ances-tors. The final probability of a set of slot fillingsis taken as the product of the likelihood ofeach individual slot-filling choice. The trainingexamples are insufficient for the direct estima-tion of all the parameters in the slot-fillingmodel; so, the technique relies on heuristicmethods for growing statistical decision trees(Magerman 1994).

Using the learned semantic model is a mul-tiphase process. The n best syntactic-semanticparse trees from the parsing model are firstrescored to determine P(T |TF) for each possi-ble frame type. Each of those models is com-bined with the corresponding prior probabilityof the frame type to yield P(FT)P(T | FT). Thenbest of these theories then serve as candidatesfor consideration with possible slot fillers. Abeam search is utilized to consider possiblecombinations of fil l operations. The final result

is an approximation of the n most likely se-mantic interpretations.The parsing and semantic analysis compo-

nents described here have been joined with astatistically based discourse module to form acomplete, trainable natural language interface.The system has been evaluated on some sam-ples from the ATIS corpus. After training on4000 annotated examples, the system had a re-ported error rate of 21.6 percent on a disjoint setof testing examples. Although this performancewould not put the system at the top of con-tenders in the ATIS competition, it representsan impressive initial demonstration of a fully

trainable statistical approach to the ATIStask.

Semantic Classification Trees

The CHANEL system (Kuhn and De Mori 1995)takes an approach to semantic parsing basedon a variation of traditional decision tree in-duction called semantic classification trees(SCTs). This technique has also been applied tothe ATIS database querying task.

Representations The final representationsfrom theCHANEL system are database queries inSQL. Like the statistical approach discussed pre-viously, however, the output of semantic inter-pretation is not a directly executable query butan intermediate form that is then turned intoan SQL query. The meaning representation con-sists of a set of features to be displayed and a setof constraints on these features. The represen-tation of our sentence is depicted in figure 9.Basically, this is a listing of the subset of all pos-sible attributes that might be displayed and asubset of all possible constraints on attributes.Using this representation reduces the problem

of semantic mapping to basic classification. Foreach possible attribute, the system must deter-mine whether it is in the final query.

In the CHANEL system, the input to the se-mantic interpretation process is a partiallyparsed version of the original sentence. Thisinitial processing is performed by a handcraft-

ed local chart parser that scans through the in-put sentence for words or phrases that carryconstraints relevant to the domain of the data-base. The target phrases are replaced by vari-able symbols, and the original content of eachsymbol is saved for later insertion into themeaning representation. In our example sen-tence, the chart parser might produce an out-put such as show me the TIM flights from CITto CIT. Here, TIM is the symbol for a word orphrase indicating a time constraint, and CIT isthe symbol for a reference to a city. This sen-tence would then be sent to an SCT-based ro-bust matcher to generate the displayed at-

tribute list and assign roles to constraints.Learning Semantic Classification TreesAn SCT is a variant of traditional decision treesthat examines a string of symbols and classifiesit into a set of discrete categories. In a typicalclassification tree, each internal node repre-sents a question about the value of some at-tribute in an example. An SCT is a binary deci-sion tree where a node represents a pattern tomatch against the sentence. If the patternmatches, we take the left branch; if it doesnt,then the right branch is taken. When a leafnode is reached, the example is classified withthe category assigned to the leaf.

The patterns stored in the nodes of an SCTare a restricted form of regular expressionconsisting of sequences of words and arbitrar-ily large gaps. For example, the pattern, would match any sentencecontaining the words flight and from, provid-ed they appeared in this order with at leastone word between them. The special symbolsindicate the start and end of the sen-

DISPLAYED ATTRIBUTES ={flight.flight_info}

CONSTRAINTS = {flight.from_city =Boston,flight.to_city =Denver,flight.dep_time


14/20

It is difficult to assess the effectiveness of thesemantic mapping component alone because itis embedded in a larger system, including manyhand-coded parts. However, some preliminaryexperiments with the semantic mapping por-tion ofCHANEL indicated that it was successful indetermining the correct set of displayed attrib-utes in 91 percent of sentences held back fortesting. Overall performance on the ATIS mea-sures placed the system near the middle of the10 systems evaluated in December 1993.

Inductive Logic ProgrammingBoth the statistical and SCT approaches arebased on representations that are basicallypropositional. The CHILL system (Zelle andMooney 1996; Zelle 1995), which has alsobeen applied to the database query problem, isbased on techniques from inductive logic pro-gramming for learning relational concepts.The representations used by CHILL differ sub-stantially from those used by CHANEL. CHILL hasbeen demonstrated on a database task for U.S.geography, where the database is encoded as aset of facts in Prolog, and the query languageis a logical form that is directly executed by aquery interpreter to retrieve appropriate an-swers from the database.

Representations The input to CHILL is a setof training instances consisting of sentencespaired with the desired queries. The output isa shift-reduce parser in Prolog that maps sen-tences into queries. CHILL treats parser induc-tion as a problem of learning rules to controlthe actions of a shift-reduce parser expressed asa Prolog program. Control rules are expressed

as definite-clause (Prolog) concept definitions.These rules are induced using a general con-cept learning system employing techniquesfrom inductive logic programming, a subfield ofmachine learning that addresses the problemof learning definite-clause logic descriptions(Prolog programs) (Lavrac and Dzeroski 1994;Muggleton 1992) from examples.

As mentioned previously, CHILL producesparsers that turn sentences directly into a log-ical form suitable for direct interpretation. Themeaning of our example sentence might berepresented as

answer(F, time(F,T), flight(F), morning(T),

origin(F, O), equal(O, city(boston)),destination(F, D), equal(D, city(denver))) .

The central insight in CHILL is that the gen-eral operators required for a shift-reduce parserto produce a given set of sentence analyses aredirectly inferable from the representationsthemselves. For example, building the previ-ous representation requires operators to intro-duce logical terms such as time(_,_), operators

tence, and + represents a gap of at least oneword. A word in a pattern can also be replacedwith a set of words indicating alternation, achoice of words that could be inserted at thatpoint in the pattern.

During training, an SCT is grown using atypical greedy classification treegrowing strat-egy. The tree starts as a single node containingthe most general pattern . Patterns arespecialized by replacing a gap with a more spe-cific pattern. Ifw is a word covered by the gapin some example, then the gap can be replacedwith w, w+, +w, or +w+. The tree-growing algo-rithm follows the standard practice of testingall possible single specializations to see whichproduces the best split of examples accordingto some metric of the purity of the leaves. Thisspecialization is then implemented, and thegrowing algorithm is recursively applied toeach of the leaves until all leaves contain setsof examples that cannot be split any further.This occurs when all the examples in the leaf

belong to the same category, or the pattern atthe leaf contains no gaps.

Applying Semantic Classif ication Treesto Semantic Interpretation Semantic in-terpretation is done by a robust matcher con-sisting of forests of SCTs. Determining the listof displayed attributes is straightforward;there is a forest of SCTs, one for each possibleattribute. Each SCT classifies a sentence as ei-ther yes or no. A yes classification indicatesthat the attribute associated with this treeshould be displayed for the query and, hence,included in the listing of displayed attributes.A no classification results in the attribute be-

ing left out of the list.Creating the list of constraints is a bit more

involved. In this case, SCTs are used to classifysubstrings within the sentence. There is a treefor each symbol (for example, CIT, TIM) thatcan be inserted by the local parser. This tree isresponsible for classifying the role that thissymbol fulfills, wherever it occurs in the sen-tence. To identify the role of each symbol in thesentence, the symbol being classified is speciallymarked. To classify the first CIT symbol in ourexample, the sentence show me the TIMflights from *CIT to CIT would be submitted tothe role-classification tree for CIT. The * marksthe instance of CIT that is being classified. Thistree would then classify the entire sentence pat-tern according to the role of *CIT. Leaves of thetree would have labels such asorigin, destination,or scrap, the last indicating that it should notproduce any constraint. The constraint is thenbuilt by binding the identified role to the valuefor the symbol that was originally extracted bythe local parser in the syntactic phase.

The input toCHILL is a

set of

traininginstancesconsisting of

sentencespaired withthe desired

queries. Theoutput is a

shift-reduceparser in

Prolog thatmaps

sentencesinto queries.

Articles

58 AI MAGAZINE


15/20

to coreference logical variables (for example,

an operator that binds first argument of timewith the argument offlight), and operators toembed terms inside of other terms (for exam-ple, to placetime, flight, and so on, into answer.However, just inferring an appropriate set ofoperators does not produce a correct parser be-cause more knowledge is required to apply op-erators accurately during the course of parsingan example.

The current context of a parse is containedin the contents of the stack and the remaininginput buffer. CHILL uses parses of the trainingexamples to figure out the contexts in whicheach of the inferred operators is and is not ap-plicable. These contexts are then given to ageneral induction algorithm that learns rulesto classify the contexts in which each operatorshould be used. Because the contexts are arbi-trarily complex parser states involving nested(partial) constituents, CHILL uses a learning al-gorithm that can deal with structured inputand produce relational concept descriptions,which is exactly the problem addressed by in-

ductive logic programming research.

Learning a Shift-Reduce Parser Figure10 shows the basic components ofCHILL. Dur-

ing parser operator generation, the training ex-amples are analyzed to formulate an overly

general shift-reduce parser that is capable ofproducing parses from sentences. The initial

parser simply consists of all the parsing opera-tors that can be inferred from the meaning rep-resentations in the training examples. In an

initial parser, an operator can be applied at anypoint in a parse, which makes it wildly overly

general in that it can produce a great manyspurious analyses for any given input sen-

tence. In example analysis, the training exam-ples are parsed using the overly general parser

to extract contexts in which the various pars-ing operators should and should not be em-

ployed. The parsing of the training examples isguided by the form of the output (the databasequery) to determine the correct sequence of

operators yielding the desired output. Control-rule induction then uses a general inductive

Articles

WINTER 1997 59

Final

Example

Analysis

ControlRule

Induction

ProgramSpecialization

TrainingExamples

Control Examples

Control

Rules

Overly GeneralParser

ParsingOperatorGenerator

Parser

Prolog

Prolog

Figure 10. TheCHILL Architecture.


16/20

could answer. The result was a corpus of 250sentences that were then annotated with the

appropriate database queries. This set was thensplit into a training set of 225 examples, with25 held out for testing.

Tests were run to determine whether the fi-nal application produced the correct answer topreviously unseen questions. Each test sen-tence was parsed to produce a query. Thisquery was then executed to extract an answerfrom the database. The extracted answer wasthen compared to the answer produced by thecorrect query associated with the test sentence.Identical answers were scored as a correct pars-ing; any discrepancy was scored as a failure.Figure 11 shows the accuracy ofCHILLs parsersover a 10-trial average. The line labeledGeobase shows the average accuracy of thepreexisting hand-coded interface on these 10testing sets of 25 sentences. The curves showthat CHILL outperforms the hand-coded systemwhen trained on 175 or more examples. In thebest trial, CHILLs induced parser, comprising1100 lines of Prolog code, achieved 84-percentaccuracy in answering novel queries.

logic programming algorithm to learn rulesthat characterize these contexts. The inductive

logic programming system analyzes a set ofpositive and negative examples of a conceptand formulates a set of Prolog rules that suc-ceed for the positive examples and fail for thenegative examples. For each parsing operator,the induction algorithm is called to learn a de-finition of parsing contexts in which this oper-ator should be used. Finally, program special-ization folds the learned control rules backinto the overly general parser to produce the fi-nal parser, a shift-reduce parser that only ap-plies operators to parsing states for which theyhave been deemed appropriate.

As mentioned previously, the CHILL systemwas tested on a database task for U.S. geogra-phy. Sample questions in English were ob-tained by distributing a questionnaire to 50uninformed subjects. The questionnaire pro-vided the basic information from the onlinetutorial supplied for an existing natural lan-guage database application, including a verba-tim list of the type of information in the data-base and sample questions that the system

Figure 11. Geoquery: Accuracy.

Articles

60 AI MAGAZINE

0

10

20

30

40

50

60

70

0 50 100 150 200 250

Accuracy

Training Examples

ChillGeobase


17/20

Discussion and Future Directions

Unfortunately, it is difficult to directly com-pare the effectiveness of these three approach-es to the semantic parsing problem. They havenot been tested head to head on identicaltasks, and given the various differences, in-cluding how they are embedded in larger sys-tems and the differing form of input and out-

put, any such comparison would be difficult.What can be said is that each has been success-ful in the sense that it has been demonstratedas a useful component of an NLP system. Inthe case of systems that compete in ATIS eval-uations, we can assume that designers of sys-tems with empirically based semantic compo-nents have chosen to use these approachesbecause they offered leverage over hand-codedalternatives. There is no requirement thatthese systems be corpus-based learning ap-proaches. The previous CHILL example showsthat an empirical approach outperformed apreexisting hand-coded system at the level of a

complete natural language application. Ofcourse, none of these systems is a complete an-swer, and there is much room for improve-ment in accuracy. However, the overall perfor-mance of these three approaches suggests thatcorpus-based techniques for semantic parsingdo hold promise.

Another dimension for consideration is theflexibility offered by these approaches. Partic-ularly interesting is the question of how easilythey might be applied to new domains. Allthree approaches entail a combination ofhand-coded and automatically acquired com-ponents.

In the case of a purely statistical approach,the effort of crafting a statistical model shouldcarry over to new domains, provided the mod-el really is sufficient to capture the decisionsthat are required. Creating statistical modelsinvolves the identification of specific contextsover which decisions are made (for example,deciding which parts of a parse tree context arerelevant for a decision about slot fill ing). If ap-propriate, tractable contexts are found for aparticular combination of input language andoutput representation, most of the effort inporting to a new database would then fall onthe annotation process. Providing a large cor-pus of examples with detailed syntactic-se-mantic parse trees and frame-building opera-tions would stil l seem to require a fair amountof effort and linguistic expertise.

In the case of SCTs, the annotation taskseems straightforward; in fact, the necessaryinformation can probably be extracted directlyfrom SQL queries. In this case, it seems thatmost of the effort would be in designing the lo-

cal (syntactic) parser to find, tag, and correctlytranslate the form of the constraints. Perhapsfuture research would allow the automation ofthis component so that the system could learndirectly from sentences paired with SQLqueries.

The inductive logic programming approachis perhaps the most flexible in that it learns di-rectly from sentences paired with executablequeries. The effort in porting to a new domainwould be mainly in constructing a domain-spe-cific meaning-representation language to ex-press the types of query that the system shouldhandle. Future research into constructing SQL-type queries within a shift-reduce parsingframework might alleviate some of this burden.

TheCHILL system also provides more flexibil-ity in the type of representation produced.CHILL builds structures with arbitrary embed-dings, allowing queries with recursive nest-ings. For example, it can handle sentencessuch as, What state borders the most states? It

is difficult to see how systems relying on a flatlisting of possible attributes or slot fillers couldbe modified to handle more sophisticatedqueries, where arbitrary queries might be em-bedded inside the top-level query.

Another consideration is the efficiency ofthese learning approaches. The simpler learn-ing mechanisms such as statistical and deci-sion tree methods can be implemented effi-ciently and, hence, can be used with largecorpora of training examples. Although train-ing times are not reported for these approach-es, they are probably somewhat faster than forinductive logic programming learning algo-

rithms. Inductive logic programming algo-rithms are probably still too slow to be usedpractically on corpora of more than a fewthousand examples. The flip side of efficiencyis how many training examples are required toactually produce an effective interface. The re-ported experiments with CHILL suggest that (atleast for some domains) a huge number of ex-amples might not be required. The trainingtime for the geography database experimentswas modest (on the order of 15 minutes of cen-tral processing unit time on a midlevel work-station). In general, much more research needsto be done investigating the learning efficiencyof these various approaches.

Finally, it is worth speculating on the overallutility of learning approaches to constructingnatural language database interfaces. Somemight argue that empirical approaches simplyreplace hand-built rule sets with hand-builtcorpora annotations. Is the construction ofsuitable corpora even feasible? We believe thatcorpus-based approaches will continue to

Articles

WINTER 1997 61


18/20

NLP researchers are now turning to empiricaltechniques in their attempts to solve the long-standing problem of semantic interpretationof natural languages. The initial results of cor-pus-based WSD look promising, and it has thepotential to significantly improve the accuracyof information-retrieval and machine-transla-tion applications. Similarly, the constructionof a complete natural language query systemto databases using corpus-based learning tech-niques is exciting. In summary, given the cur-rent level of enthusiasm and interest, it seemscertain that empirical approaches to semanticinterpretation will remain an active and fruit-ful research topic in the coming years.

References

Abramson, H., and Dahl, V. 1989. Logic Grammars.New York: Springer-Verlag.

Agirre, E., and Rigau, G. 1996. Word-Sense Disam-biguation Using Conceptual Density. In Proceedingsof the Sixteenth International Conference on Com-putational Linguistics. Somerset, N.J.: Association

for Computational Linguistics.Anderson, J . R. 1977. Induction of Augmented Tran-sition Networks. Cognitive Science1:125157.

Brown, J. S., and Burton, R. R. 1975. Multiple Repre-sentations of Knowledge for Tutorial Reasoning. InRepresentation and Understanding, eds. D. G. Bobrowand A. Collins. San Diego, Calif.: Academic.

Brown, P. F.; Della Pietra, S. A.; Della Pietra, V. J.; andMercer, R. L. 1991. Word-Sense Disambiguation Us-ing Statistical Methods. In Proceedings of the Twen-ty-Ninth Annual Meeting of the Association forComputational Linguistics, 264270. Somerset, N.J .:Association for Computational Linguistics.

Bruce, R., and Wiebe, J. 1994. Word-Sense Disam-

biguation Using Decomposable Models. In Proceed-ings of the Thirty-Second Annual Meeting of the As-sociation for Computational Linguistics, 139146.Somerset, N.J.: Association for Computational Lin-guistics.

Cardie, C. 1993. A Case-Based Approach to Knowl-edge Acquisition for Domain-Specific SentenceAnalysis. In Proceedings of the Eleventh NationalConference on Artificial Intelligence, 798803. Men-lo Park, Calif.: American Association for Artificial In-telligence.

Choueka, Y., and Lusignan, S. 1985. Disambiguationby Short Contexts. Computers and the Humanities19:147157.

Cost, S., and Salzberg, S. 1993. A Weighted Nearest-

Neighbor Algorithm for Learning with Symbolic Fea-tures. Machine Learning10(1): 5778.

Cover, T. M., and Hart, P. 1967. Nearest-NeighborPattern Classification. IEEE Transactions on Informa-tion Theory 13(1): 2127.

Dagan, I., and Itai, A. 1994. Word-Sense Disambigua-tion Using a Second-Language Monolingual Corpus.Computational Linguistics20(4): 563596.

Duda, R., and Hart, P. 1973.Pattern Classification andScene Analysis. New York: Wiley.

make significant gains. Even in the traditionalapproach, the construction of a suitable corpusis an important component in interface de-sign. From this corpus, a set of rules is hypoth-esized and tested. The systems are improved byexamining performance on novel examplesand trying to fix the rules so that they can han-dle previously incorrect examples. Of course,changes that fix some input might have dele-terious effects on sentences that were previous-ly handled correctly. The result is a sort ofgrammar-tweakingregression-testing cycle toensure that overall progress is being made. Em-pirical approaches follow a similar modelwhere the burden of creating a set of rules thatis consistent with the examples is placedsquarely on an automated learning compo-nent rather than on human analysts. The useof an automated learning component frees thesystem designer to spend more time on collect-ing and analyzing a larger body of examples,thus improving the overall coverage and qual-

ity of the interface. Another possibility is thattechniques of active learning might be used toautomatically identify or construct the exam-ples that the learning system considers mostinformative. Active learning might significant-ly reduce the size of the corpus required toachieve good interfaces. Indeed, extendingcorpus-based approaches to the entire seman-tic mapping task, for example, where trainingsamples might consist of English sentencespaired with SQL queries, offers intriguing possi-bilities for the rapid development of databaseinterfaces. Formal database queries could becollected during the normal day-to-day opera-

tions of the database. After some period of use,the collected queries could be glossed with nat-ural language questions and the paired exam-ples fed to a learning system that produces anatural language interface. With a natural lan-guage interface in place, examples could con-tinue to be collected where input that are notcorrectly translated would be glossed with thecorrect SQL query and fed back into the learn-ing component. In this fashion, the construc-tion and refinement of the interface might beaccomplished as a side-effect of normal usewithout the investment of significant time orlinguistic expertise.

Conclusion

The field of NLP has witnessed an unprece-dented surge of interest in empirical, corpus-based learning approaches. Inspired by theirsuccesses in speech-recognition research, andtheir subsequent successful application to part-of-speech tagging and syntactic parsing, many

Articles

62 AI MAGAZINE


19/20

Gale, W.; Church, K. W.; and Yarowsky, D. 1992a. AMethod for Disambiguating Word Senses in a LargeCorpus. Computers and the Humanities 26:415439.

Gale, W.; Church, K. W.; and Yarowsky, D. 1992b. Es-timating Upper and Lower Bounds on the Perfor-mance of Word-Sense Disambiguation Programs. InProceedings of the Thirtieth Annual Meeting of theAssociation for Computational Linguistics, 249256.Somerset, N.J.: Association for Computational Lin-

guistics.Hendrix, G. G.; Sacerdoti, E.; Sagalowicz, D.; andSlocum, J. 1978. Developing a Natural Language In-terface to Complex Data. ACM Transactions on Data-base Systems3(2): 105147.

Hirst, G. 1987. Semantic Interpretation and the Resolu-tion of Ambiguity. Cambridge, U.K.: Cambridge Uni-versity Press.

Kelly, E., and Stone, P. 1975. Computer Recognition ofEnglish Word Senses. Amsterdam: North-Holland.

Klein, S., and Kuppin, M. A. 1970. An Interactive,Heuristic Program for Learning TransformationalGrammars, Technical Report, TR-97, Computer Sci-ences Department, University of Wisconsin.

Kuhn, R., and De Mori, R. 1995. The Application ofSemantic Classification Trees to Natural LanguageUnderstanding. IEEE Transactions on Pattern Analysisand Machine Intelligence17(5): 449460.

Langley, P. 1982. Language Acquisition through Er-ror Recovery. Cognition and Brain Theory 5.

Lavrac, N., and Dzeroski, S. 1994. Inductive Logic Pro-gramming: Techniques and Applications. London: EllisHorwood.

Leacock, C.; Towell, G.; and Voorhees, E. 1993. Cor-pus-Based Statistical Sense Resolution. In Proceed-ings of the ARPA Human Language TechnologyWorkshop, 260265. Washington, D.C.: AdvancedResearch Projects Agency.

Lin, D. 1997. Using Syntactic Dependency as LocalContext to Resolve Word-Sense Ambiguity. In Pro-ceedings of the Thirty-Fifth Annual Meeting of theAssociation for Computational Linguistics. Somer-set, N.J.: Association for Computational Linguistics.

Luk, A. K. 1995. Statistical Sense Disambiguationwith Relatively Small Corpora Using Dictionary De-finitions. In Proceedings of the Thirty-Third AnnualMeeting of the Association for Computational Lin-guistics, 181188. Somerset, N.J.: Association forComputational Linguistics.

Magerman, D. M . 1994. Natural Language Parsing asStatistical Pattern Recognition. Ph.D. thesis, Stan-ford University.

Miller, G. A., eds. 1990. WORDNET: An Online Lexical

Database. International Journal of Lexicography 3(4):235312.

Mil ler, S.; Bobrow, R.; Ingria, R.; and Schwartz, R.1994a. Hidden Understanding Models of NaturalLanguage. In Proceedings of the Thirty-Second An-nual Meeting of the Association for ComputationalLinguistics, 2532. Somerset, N.J.: Association forComputational Linguistics.

Miller, G. A.; Chodorow, M.; Landes, S.; Leacock, C.;and Thomas, R. G. 1994b. Using a Semantic Concor-

dance for Sense Identification. In Proceedings of theARPA Human Language Technology Workshop,240243. Washington, D.C.: Advanced Research Pro-

jects Agency.

Mil ler, S.; Stallard, D.; Bobrow, R.; and Schwartz, R.1996. A Fully Statistical Approach to Natural Lan-guage Interfaces. In Proceedings of the Thirty-FourthAnnual Meeting of the Association for Computa-tional Linguistics, 5561. Somerset, N.J.: Association

for Computational Linguistics.Mooney, R. J. 1996. Comparative Experiments onDisambiguating Word Senses: An Illustration of theRole of Bias in Machine Learning. In Proceedings ofthe Conference on Empirical Methods in NaturalLanguage Processing, 8291. Somerset, N.J.: Associa-tion for Computational Linguistics.

Mooney, R. J. 1995. Encouraging Experimental Re-sults on Learning CNF. Machine Learning 19(1):7992.

Muggleton, S. H., ed. 1992. Inductive Logic Program-ming. San Diego, Calif.: Academic.

Ng, H. T. 1997a. Exemplar-Based Word-Sense Disam-biguation: Some Recent Improvements. In Proceed-ings of the Second Conference on Empirical Meth-ods in Natural Language Processing, 208213.Somerset, N.J.: Association for Computational Lin-guistics.

Ng, H. T. 1997b. Getting Serious about Word-SenseDisambiguation. In Proceedings of the ACL SIGLEXWorkshop on Tagging Text with Lexical Semantics:Why, What, and How? 17. Somerset, N.J.: Associa-tion for Computational Linguistics.

Ng, H. T., and Lee, H. B. 1996. Integrating MultipleKnowledge Sources to Disambiguate Word Sense: AnExemplar-Based Approach. In Proceedings of the

Thirty-Fourth Annual Meeting of the Association forComputational Linguistics, 4047. Somerset, N.J.:Association for Computational Linguistics.

Pedersen, T., and Bruce, R. 1997a. A New SupervisedLearning Algorithm for Word-Sense Disambiguation.In Proceedings of the Fourteenth National Confer-ence on Artificial Intelligence. Menlo Park, Calif.:American Association for Artificial Intelligence.

Pedersen, T., and Bruce, R. 1997b. DistinguishingWord Senses in Untagged Text. In Proceedings of theSecond Conference on Empirical Methods in NaturalLanguage Processing, 197207. Somerset, N.J.: Asso-ciation for Computational Linguistics.

Pedersen, T.; Bruce, R.; and Wiebe, J. 1997.Sequential Model Selection for Word-Sense Disam-biguation. In Proceedings of the Fifth Conference onApplied Natural Language Processing, 388395.Somerset, N.J.: Association for Computational Lin-

guistics.Peh, L. S., and Ng, H. T. 1997. Domain-Specific Se-mantic-Class Disambiguation Using WORDNET. InProceedings of the Fifth Workshop on Very LargeCorpora, 5664. Somerset, N.J.: Association for Com-putational Linguistics.

Pieraccini, R.; Levin, E.; and Lee, C. H. 1991. Stochas-tic Representation of Conceptual Structure in theATIS

Task. In Proceedings of the 1991 DARPA Speech andNatural Language Workshop, 121124. San Francis-

Articles

WINTER 1997 63


20/20

Yarowsky, D. 1995. Unsupervised Word-Sense Dis-ambiguation Rivaling Supervised Methods. In Pro-ceedings of the Thirty-Third Annual Meeting of theAssociation for Computational Linguistics, 189196.Somerset, N.J.: Association for Computational Lin-guistics.

Yarowsky, D. 1994. Decision Lists for Lexical Ambi-guity Resolution: Application to Accent Restorationin Spanish and French. In Proceedings of the Thirty-

Second Annual Meeting of the Association for Com-putational Linguistics, 8895. Somerset, N.J.: Associ-ation for Computational Linguistics.

Yarowsky, D. 1993. One Sense per Collocation. InProceedings of the ARPA Human-Language Technol-ogy Workshop, 266271. Washington, D.C.: Ad-vanced Research Projects Agency.

Yarowsky, D. 1992. Word-Sense Disambiguation Us-ing Statistical Models of Rogets Categories Trainedon Large Corpora. In Proceedings of the FourteenthInternational Conference on Computational L in-guistics, 454460. Somerset, N.J.: Association forComputational Linguistics.

Zelle, J. M . 1995. Using Inductive Logic Program-ming to Automate the Construction of Natural Lan-

guage Parsers. Ph.D. thesis, Department of ComputerSciences, University of Texas at Austin.

Zelle, J. M., and Mooney, R. J. 1996. Learning toParse Database Queries Using Inductive Logic Pro-gramming. In Proceedings of the Thirteenth Nation-al Conference on Artificial Intelligence, 10501055.Menlo Park, Calif.: American Association for Artifi-cial Intelligence.

Hwee Tou Ng is a group head atDSO National Laboratories, Singa-pore. He received a B.Sc. in com-puter science for data manage-ment (with high distinction) fromthe University of Toronto, an M.S.

in computer science from StanfordUniversity, and a Ph.D. in comput-er science from the University of

Texas at Austin. His current research interests in-clude natural language processing, information re-trieval, and machine learning. He received the BestNonstudent Paper Award at the 1997 ACM SIGIRconference. His e-mail address is [email protected].

John Zelle is an assistant professorof computer science at Drake Uni-versity where he has been since1995. He holds a B.S. and an M .S.in computer science from IowaState University and a Ph.D. incomputer sciences from the Uni-versity of Texas at Austin. He de-veloped theCHILL system as part of

his dissertation work and continues to pursue an in-terest in natural language applications of machinelearning. His e-mail address is [email protected].

co, Calif.: Morgan Kaufmann.

Procter, P. 1978. Longman Dictionary of ContemporaryEnglish. London: Longman.

Quinlan, J. R. 1993. C4.5: Programs for Machine Learn-ing. San Francisco, Calif.: Morgan Kaufmann.

Reeker, L. H. 1976. The Computational Study of Lan-guage Acquisition. In Advances in Computers, Volume15, eds. M. Yovits and M. Rubinoff. San Diego, Calif.:Academic.

Resnik, P. 1997. Selectional Preference and Sense Dis-ambiguation. In Proceedings of the ACL SIGLEXWorkshop on Tagging Text with Lexical Semantics:Why, What, and How? 5257. Somerset, N.J.: Asso-ciation for Computational Linguistics.

Resnik, P., and Yarowsky, D. 1997. A Perspective onWord-Sense Disambiguation Methods and TheirEvaluation. In Proceedings of the ACL SIGLEX Work-shop on Tagging Text with Lexical Semantics: Why,What, and How? 7986. Somerset, N.J.: Associationfor Computational Linguistics.

Rivest, R. L. 1987. Learning Decision Lists. MachineLearning2(3): 229246.

Rosenblatt, F. 1958. The Perceptron: A Probabilistic

Model for Information Storage and Organization inthe Brain. Psychological Review 65:386408.

Schtze, H. 1992. Dimensions of Meaning. In Pro-ceedings of Supercomputing, 787796. Washington,D.C.: IEEE Computer Society Press.

Schtze, H., and Pedersen, J. 1995. Information Re-trieval Based on Word Senses. In Proceedings of theFourth Annual Symposium on Document Analysisand Information Retrieval, 161175. Las Vegas, Nev.:University of Nevada at Las Vegas.

Selfridge, M. 1981. A Computer Model of Child Lan-guage Acquisition. In Proceedings of the Seventh In-ternational Joint Conference on Artificial Intelli-gence, 106108. Menlo Park, Calif.: International

Joint Conferences on Artificial Intelligence.

Sembugamoorthy, V. 1981. A Paradigmatic Lan-guage-Acquisition System. In Proceedings of the Sev-enth International Joint Conference on Artificial In-telligence, 106108. Menlo Park, Calif.:International Joint Conferences on Artificial Intelli-gence.

Siklossy, L. 1972. Natural Language Learning byComputer. In Representation and Meaning: Experi-ments with Information Processing Systems, eds. H. A.Simon and L. Siklossy. Englewood Cliffs, N.J.: Pren-tice Hall.

Stanfill, C., and Waltz, D. 1986. Toward Memory-Based Reasoning. Communications of the Associationfor Computing Machinery 29(12): 12131228.

Warren, D. H. D., and Pereira, F. C. N. 1982. An Effi-cient Easily Adaptable System for Interpreting Natur-al Language Queries. American Journal of Computa-tional Linguistics 8(34): 110122.

Wilks, Y.; Fass, D.; Guo, C.; McDonald, J. E.; Plate, T.;and Slator, B. M. 1990. Providing Machine-TractableDictionary Tools. Machine Translation 5(2): 99154.

Woods, W. A. 1970. Transition Network Grammarsfor Natural Language Analysis. Communications ofthe Association for Computing Machinery 13:591606.

Articles

64 AI MAGAZINE

semantic interpretation

Documents