lang process

7/31/2019 Lang Process

1/41

A Memory-Based Model of Syntactic

Analysis: Data-Oriented ParsingRemko Scha, Rens Bodand Khalil Sima'an

Institute for Logic, Language and ComputationUniversity of AmsterdamSpuistraat 1341012 VB Amsterdam, The Netherlands

Abstract

This paper presents a memory-based model of human syntactic processing: Data-Oriented Parsing. After a brief introduction (section 1), it argues that any accountof disambiguation and many other performance phenomena inevitably has animportant memory-based component (section 2). It discusses the limitations ofprobabilistically enhanced competence-grammars, and argues for a moreprincipled memory-based approach (section 3). In sections 4 and 5, oneparticular memory-based model is described in some detail: a simpleinstantiation of the "Data-Oriented Parsing" approach ("DOP1"). Section 6reports on experimentally established properties of this model, and section 7compares it with other memory-based techniques. Section 8 concludes andpoints to future work.

1. Introduction

Could it be the case that allof human language cognition takes place by means ofsimilarity- and analogy-based processes which operate on a store of concrete pastexperiences? For those of us who are tempted to give a positive answer to thisquestion, one of the most important challenges consists in describing theprocesses that deal with syntactic structure.

A person who knows a language can understand and produce a virtually endlessvariety of new and unforeseen utterances. To describe precisely how people

actually do this, is clearly beyond the scope of linguistic theory; some kind ofabstraction is necessary. Modern linguistics has therefore focussed its attentionon the infinite repertoire ofpossible sentences (and their structures andinterpretations) that a person's conception of a language in principle allows: theperson's linguistic "competence".

In its effort to understand the nature of this "knowledge of language", linguistictheory uses the artificial languages of logic and mathematics as its paradigm


2/41

sources of inspiration. Linguistic research proceeds on the assumption that alanguage is a well-defined formal code -- that to know a language is to know anon-redundant, consistent set of rules (a "competence grammar"), whichestablishes unequivocally which word sequences belong to the language, andwhat their pronunciations, syntactic analyses and semantic interpretations are.

Language-processing algorithms which are built for practical applications, orwhich are intended as cognitive models, must address some of the problems thatlinguistic competence grammars abstract away from. They cannot just producethe set of all possible analyses of an input utterance: in the case of ambiguity,they should pick the most plausible analysis; if the input is uncertain (as in thecase of speech recognition) they should pick the most plausible candidate; if theinput is corrupted (by typos or spelling-mistakes) they should make the mostplausible correction.

A competence-grammar which gives a plausible characterization of the set ofpossible sentences of a language does no more (and no less) than provide a ratherbroad framework within which many different models of an individual's languageprocessing capabilities ("performance models") may be specified.To investigatewhat a performance model of human language processing should look like, we donot have to start from scratch. We may, for instance, look at the ideas of previousgenerations of linguists and psychologists, even though these ideas did not yet getarticulated as mathematical theories or computational models. If we do that, wefind one very common idea: that language users produce and understand newutterances by constructing analogies with previously experienced ones. NoamChomsky has noted, for instance, that this view was held by Bloomfield, Hockett,Jespersen, Paul, Saussure, and "many others" (Chomsky 1966, p. 12).

This intuitively appealing idea may be summarized as memory-basedlanguage-processing (if we want to emphasize that it involves accessing representations ofconcrete past language experiences), or as analogy-basedlanguage-processing (ifwe want to draw attention to the nature of the process that is applied to theserepresentations). The project of embodying it in a formally articulated modelseems a worthwhile challenge. In the next few sections of this paper we willdiscuss empirical reasons for actually undertaking such a project, and we willreport on our first steps in that direction.

The next section therefore discusses in some detail one particular reason to beinterested in memory-based models: the problem of ambiguity resolution.

Section 3 will then start to address the technical challenge of designing amathematical and computational system which complies with our intuitionsabout the memory-based nature of language-processing, while at the same timedoing justice to some insights about syntactic structure which have emerged fromthe Chomskyan tradition.

2. Disambiguation and statistics


3/41


4/41

to be correct than others -- and past occurrence frequencies may be the mostreliable indicator for these likelihoods.

3. From probabilistic competence-grammars to data-oriented parsing

In the previous section we saw that the human language processing system seemsto estimate the most probable analysis of a new input sentence, on the basis ofsuccessful analyses of previously encountered ones. But how is this done? Whatprobabilistic information does the system derive from its past languageexperiences? The set of sentences that a language allows may best be viewed asinfinitely large, and probabilistic information is used to compare alternativeanalyses of sentences never encountered before. A finite set of probabilities ofunits and combination operations must therefore be used to characterize aninfinite set of probabilities of sentence-analyses.

This problem can only be solved if a more basic, non-probabilistic one is solvedfirst: we need a characterization of the complete set of possible sentence-analysesof the language. As we saw before, that is exactly what the competence-grammarsof theoretical syntax try to provide. Most probabilistic disambiguation modelstherefore build directly on that work: they characterize the probabilities ofsentence-analyses by means of a "stochastic grammar", constructed out of acompetence grammar by augmenting the rules with application probabilitiesderived from a corpus. Different syntactic frameworks have been extended in thisway. Examples are Stochastic Context-Free Grammar (Suppes, 1970; Sampson,1986; Black et al., 1992), Stochastic Lexicalized Tree-Adjoining Grammar(Resnik, 1992; Schabes, 1992), Stochastic Unification-Based Grammar (Briscoe &Carroll, 1993) and Stochastic Head-Driven Phrase Structure Grammar (Brew,1995).

A statistically enhanced competence grammar of this sort defines all sentences ofa language and all analyses of these sentences. It also assigns probabilities to eachof these sentences and each of these analyses. It therefore makes definitepredictions about an important class of performance phenomena: the preferencesthat people display when they must choose between different sentences (inlanguage production and speech recognition), or between alternative analyses ofsentences (in disambiguation).

The accuracy of these predictions, however, is necessarily limited. Stochasticgrammars assume that the statistically significant language units coincide exactly

with the lexical items and syntactic rules employed by the competence grammar.The most obvious case of frequency-based bias in human disambiguationbehavior therefore falls outside their scope: the tendency to assign previouslyseen interpretations rather than innovative ones to platitudes and conventionalphrases. Platitudes and conventional phrases demonstrate that syntacticconstructions of arbitrary size and complexity may be statistically important, alsoif they are completely redundant from the point of view of a competencegrammar.


5/41

Stochastic grammars which define their probabilities on minimal syntactic unitsthus have intrinsic limitations as to the kind of statistical distributions they candescribe. In particular, they cannot account for the statistical biases which arecreated by frequently occurring complex structures. (For a more detaileddiscussion regarding some specific formalisms, see Bod (1995, Ch. 3).) The

obvious way to remedy this, is to allow redundancy: to specify statisticallysignificant complex structures as part of a "phrasal lexicon", even though thegrammar could already generate these structures in a compositional way. To beable to do that, we need a grammar formalism which builds up a sentencestructure out of explicitly specified component structures: a "Tree Grammar" (cf.Fu 1982). The simplest kind of Tree Grammar that might fit our needs is theformalism known as Tree Substitution Grammar (TSG).

A Tree Substitution Grammar describes a language by specifying a set ofarbitrarily complex "elementary trees". The internal nodes of these trees arelabelled by non-terminal symbols, the leaf nodes by terminals or non-terminals.Sentences are generated by a "tree rewrite process": if a tree has a leaf node witha non-terminal label, substitute on that node an elementary tree with that rootlabel; repeat until all leaf nodes are terminals.

Tree Substitution Grammars can be arbitrarily redundant: there is no formalreason to disallow elementary trees which can also be generated by combiningother elementary trees. Because of this property, a probabilistically enhancedTSG could model in a rather direct way how frequently occurring phrases andstructures influence a language user's preferences and expectations: we coulddesign a very redundant TSG, containing elementary trees for all statisticallyrelevant phrases, and then assign the proper probabilities to all these elementarytrees.

If we want to explore the possibilities of such Stochastic Tree SubstitutionGrammars (STSG's), the obvious next question is: what are the statisticallyrelevant phrases? Suppose we have a corpus of utterances sampled from thepopulation of expected inputs, and annotated with labelled constituent treesrepresenting the contextually correct analyses of the utterances. Which subtreesshould we now extract from this corpus to serve as elementary trees in our STSG?

There may very well be constraints on the form of cognitively relevant subtrees,but currently we do not know what they are. Note that if we only use subtrees ofdepth 1, the TSG is non-redundant: it would be equivalent to a CFG. If we

introduce redundancy by adding larger subtrees, we can bias the analysis ofpreviously experienced phrases and patterns in the direction of their previouslyexperienced structures. We certainly want to include the structures of completeconstituents and sentences for this purpose, but we may also want to includemany partially lexicalized syntactic patterns.

Are there statistical constraints on the elementary trees that we want to consider?Should we only employ the most frequently occurring ones? That is not clear


6/41

either. Psychological experiments have confirmed that the interpretation ofambiguous input is influenced by the frequency of occurrence of variousinterpretations in one's past experience . Apparently, the individual occurrencesof these interpretations had a cumulative effect on the cognitive system. Thisimplies that, at the time of a new occurrence, there is a memory of the previous

occurrences. And in particular, that at the time of the second occurrence, there isa memory of the first. Frequency effects can only build up over time on the basisof memories of unique occurrences. The simplest way to allow this to happen is tostore everything.

We thus arrive at a memory-based language processing model, which employs acorpus of annotated utterances as a representation of a person's past languageexperience, and analyses new input by means of an STSG which uses as itselementary trees all subtrees that can be extracted from the corpus, or a largesubset of them. This approach has been calledData-Oriented Parsing(DOP). Aswe just summarized it, this model is crude and underspecified, of course. To buildworking sytems based on this idea, we must be more specific about subtreeselection, probability calculations, parsing algorithms, and disambiguationcriteria. These issues will be considered in the next few sections of this paper.

But before we do that, we should zoom out a little bit and emphasize that wedo notexpect that a simple STSG model as just sketched will be able to accountfor all linguistic and psycholinguistic phenomena that we may be interested in.We employ Stochastic Tree Substitution Grammar because it is a very simplekind of probabilistic grammar which allows us nevertheless to take into accountthe probabilities of arbitrarily complex subtrees. We do not believe that a corpusof contextless utterances with labelled phrase structure trees is an adequatemodel of someone's language experience, nor that syntactic processing is

necessarily limited to subtree-composition. To build more adequate models, thecorpus annotations will have to be enriched considerably, and more complexprocesses will have to be allowed in extracting data from the corpus as well as inanalysing the input.

The general approach proposed here should thus be distinguished from thespecific instantiations discussed in this paper. We can in fact articulate theoverall idea fairly explicitly by indicating what is involved in specifying aparticular technical instantiation (cf. Bod, 1995). To describe a specific "data-oriented processing" model, four components must be defined:

1. aformalism for representating utterance-analyses,2. an extraction functionwhich specifies which fragments or abstractions of

the utterance- analyses may be used as units in constructing an analysis ofa new utterance,

3. the combination operations that may be used in putting together newutterances out of fragments or abstractions,


7/41

4. aprobability modelwhich specifies how the probability of an analysis of anew utterance is computed on the basis of the occurrence-frequencies ofthe fragments or abstractions in the corpus.

Construed in this way, the data-oriented processing framework allows for a wide

range of different instantiations. It then boils down to the hypothesis that humanlanguage processing is a probabilistic process that operates on a corpus ofrepresentations of past language experiences -- leaving open how the utterance-analyses in the corpus are represented, what sub-structures or other abstractionsof these utterance-analyses play a role in processing new input, and what thedetails of the probabilistic calculations are.

Current DOP models are typically concerned with syntactic disambiguation, andemploy readily available corpora which consist of contextless sentences withsyntactic annotations. In such corpora, sentences are annotated with theirsurface phrase structures as perceived by a human annotator. Constituents arelabeled with syntactic category symbols: a human annotator has designated eachconstituent as belonging to one of a finite number of mutually exclusive classeswhich are considered as potentially inter-substitutable.

Corpus-annotation necessarily occurs against the background of an annotationconvention of some sort. Formally, this annotation convention constitutes agrammar, and in fact, it may be considered as a competence grammar in theChomskyan sense: it defines the set of syntactic structures that is possible. We donot presuppose, however, that the set of possible sentences, as defined by therepresentational formalism employed, coincides with the set of sentences that aperson will judge to be grammatical. The competence grammar as we construe it,must be allowed to overgenerate: as long as it generates a superset of thegrammatical sentences and their structures, a properly designed probabilisticdisambiguation mechanism may be able to distinguish grammatical sentencesand grammatical structures from their ungrammatical or less grammaticalalternatives. An annotated corpus can thus be viewed as a stochastic grammarwhich defines a subset of the sentences and structures allowed by the annotationscheme, and which assigns empirically motivated probabilities to each of thesesentences and structures.

The current paper thus explores the properties of some varieties of a language-processing model which embodies this approach in a stark and simple way. Themodel demonstrates that memory-based language-processing is possible in

principle. For certain applications it performs better already than someprobabilistically enhanced competence-grammars, but its main goal is to serve asa starting point for the development of further refinements, modifications andgeneralizations.

4. A Simple Data-Oriented Parsing Model: DOP1


8/41

We will now look in some detail at one simple DOP model, which is known asDOP1 (Bod 1992, 1993a, 1995). Consider a corpus consisting of only two trees,labeled with conventional syntactic categories:

Figure 1. Imaginary corpus of two trees.

Various subtrees can be extracted from the trees in such a corpus. The subtreeswe consider are: (1) the trees of complete constituents (including the corpus treesthemselves, but excluding individual terminal nodes); and (2) all trees that can beconstructed out of these constituent trees by deletiing proper subconstituenttrees and replacing them by their root nodes.

The subtree-set extracted from a corpus defines a Stochastic Tree SubstitutionGrammar. The stochastic sentence generation process of DOP1 employs only oneoperation for combining subtrees, called "composition", indicated as o. The

composition-operation identifies the leftmost nonterminal leaf node of one treewith the root node of a second tree, i.e., the second tree is substitutedon theleftmost nonterminal leaf node of the first tree. Starting out with the "corpus" ofFigure 1 above, for instance, the sentence "She saw the dress with the telescope",may be generated by repeated application of the composition operator to corpussubtrees in the following way:

Figure 2. Derivation and parse for"She saw the dress with thetelescope"


9/41

Several other derivations, involving different subtrees, may of course yield thesame parse tree; for instance:

or

Figures 3/4. Two other derivations of the same parse for"She saw the dress with the telescope".

Note also that, given this example corpus, the sentence we considered isambiguous; by combining other subtrees, a different parse may be derived, whichis analogous to the first rather than the second corpus sentence.

DOP1 computes the probability of substituting a subtree ton a specific node asthe probability of selecting tamong all corpus-subtrees that could be substitutedon that node. This probability is equal to the number of occurrences oft, |t|, divided by the total number of occurrences of subtrees t'with the same rootlabel as t. Let r (t) return the root label oft. Then we may write:

Since each node substitution is independent of previous substitutions, theprobability of a derivationD = t1 o . . . o tnis computed by the product of theprobabilities of the subtreestiinvolved in it:


10/41

The probability of a parse tree is the probability that it is generated by any of itsderivations. The probability of a parse tree Tis thus computed as the sum of theprobabilities of its distinct derivationsD:

This probability may be viewed as a measure for the average similarity between asentence analysis and the analyses of the corpus utterances: it correlates withthenumber of corpus trees that share subtrees with the sentence analysis, andalso with the size of these shared fragments. Whether this measure constitutes anoptimal way of weighing frequency and size against each other, is a matter ofempirical investigation.

5. Computational Aspects of DOP1

We now consider the problems of parsing and disambiguation with DOP1. Thealgorithms we discuss do not exploit the particular properties of Data-OrientedParsing; they work with any Stochastic Tree-Substitution Grammar.

5.1 Parsing

The algorithm that creates a parse forest for an input sentence is derived fromalgorithms that exist for Context-Free Grammars, which parse an input sentenceofn words in polynomial (usually cubic) time. These parsers use a chart or well-formed substring table. They take as input a set of context-free rewrite rules anda sentence and produce as output a chart of labeled phrases. A labeled phrase is asequence of words labeled with a category symbol which denotes the syntacticcategory of that phrase. A chart-like parse forest can be obtained by includingpointers from a category to the other categories which caused it to be placed inthe chart. Algorithms that accomplish this can be found in e.g. Kay (1980),Winograd (1983), Jelinek et al. (1990).

The chart parsing approach can be applied to parsing with Stochastic Tree-Substitution Grammars if we note that every elementary tree tof the STSG can be

viewed as a context-free rewrite rule: root(t)>yield(t) (cf. Bod 1992). In orderto obtain a chart-like forest for a sentence parsed with an STSG, we label thephrases not only with their syntactic categories but with their full elementary

trees. Note that in a chart-like forest generated by an STSG, different derivationsthat generate identical trees do not collapse. We will therefore talk aboutaderivation forestgenerated by an STSG (cf. Sima'an et al. 1994).

We now show what such a derivation forest may look like. Assume an exampleSTSG which has the trees in Figure 5 as its elementary trees. A chart parseranalysing the input string abcd on the basis of this STSG, will then create thederivation forest illustrated in Figure 6. The visual representation is based on Kay


11/41

(1980): every entry (i,j)in the chart is indicated by an edge and spans the wordsbetween the i-th and thej-th position of a sentence. Every edge is labeled withlinked elementary trees that constitute subderivations of the underlyingsubsentence. (The probabilities of the elementary trees, needed in thedisambiguation phase, have been left out.)

Figure 5. Elementary trees of an example STSG.

Figure 6. Derivation forest for the stringabcd.


12/41


13/41

labelsHis represented byA>tH. Let the triple denote the fact that non-terminalA is in chart entry (i,j) after parsing the input string W1,...,Wn; thisimplies that the STSG can derive the substring Wi+1,...,Wj, starting with anelementary tree that has root labelA. The probability of the MPD ofstring W1,...,Wn, represented asPPMPD, is computed recursively as follows:

where

Obviously, if we drop the CNF assumption, we may apply exactly the samestrategy. And by introducing some bookkeeping to keep track of thesubderivations which yield the highest probabilities at each step, we get analgorithm which actually computes the most probable derivation. (For somemore detail, see Sima'an et al. 1994.)

5.2.2 The most probable parse

The most probable parse tree of a sentence cannot be computed in deterministicpolynomial time: Sima'an (1996b) proved that for STSG's the problem ofcomputing the most probable parse is NP-hard. This does not mean, however,that every disambiguation algorithm based on this notion is necessarilyintractable. We will now investigate to what extent tractability may be achieved ifwe forsake analytical probability calculations, and are satisfied with estimationsinstead.

Because the derivation forest specifies a statistical ensemble of derivations, wemay employ theMonte Carlo method(Hammersley & Handscomb 1964) for thispurpose: we can estimate parse tree probabilities by sampling a suitable numberof derivations from this ensemble, and observing which parse tree results mostfrequently from these derivations.

We have seen that a best-first search, as accomplished by Viterbi, can be used forcomputing the most probable derivation from the derivation forest. In ananalogous way, we may conduct a random-firstsearch, which selects a randomderivation from the derivation forest by making, for each node at each chart-entry, a random choice between the different alternative subtrees on the basis oftheir respective substitution probabilities. By iteratively generating severalrandom derivations we can estimate the most probable parse as the parse whichresults most often from these derivations. (The probability of a parse is the


14/41

probability that any of its derivations occurs.) According to the Law of LargeNumbers, the most frequently generated parse converges to the most probableparse as we increase the number of derivations that we sample.

This strategy is exemplified by the following algorithm (Bod 1993b, 95):

Algorithm 2: Sampling a random derivation

Given a derivation forest of a sentence ofn words, consisting of labeled entries(i,j)that span the words between the ith and thejth position of the sentence. Everyentry is labeled with elementary trees, each with its probability and, for everynon-terminal leaf node, a pointer to the relevant sub-entry. (Cf. Figure 6 inSection 5.1 above.) Sampling a derivation from such a chart consists of choosingat random one of the elementary trees for every root-node at every labeled entry(e.g. bottom-up, breadth-first):

forlength := 1 tondo forstart:= 0 ton - lengthdo foreach root nodeXin chart-entry (start, start+ length) do

select at random a tree from the distribution of elementary trees withroot nodeX;

eliminate the other elementary trees with root nodeXfrom this chart-entry

The resulting randomly pruned derivation forest trivially defines one "randomderivation" for the whole sentence: take the elementary tree of chart-entry (0, n)and recursively substitute the elementary subtrees of the relevant sub-entries onnon-terminal leaf nodes.

The parse tree that results from this derivation constitutes a first guess for themost probable parse. A more reliable guess can be computed by sampling a largernumber of random derivations, and selecting the parse which results most oftenfrom these derivations. How large a sample set should be chosen?

Let us first consider the probability of error: the probability that the parse that ismost frequently generated by the sampled derivations, is in fact not equal to themost probable parse. An upper bound for this probability is given by

where the different values ofiare indices correspondingto the different parses, 0 is the index of the most probable parse,piis theprobability of parse i; andNis the number of derivations that was sampled (cf.Hammersley & Handscomb 1964).

This upper bound on the probability of error becomes small if we increaseN, butif there is an iwithpiclose top0(i.e., if there are different parses in the top of thesampling distribution that are almost equally likely), we must makeNvery large


15/41


16/41

Note that this algorithm differs essentially from the disambiguation algorithmgiven in Bod (1995), which increases the sample size until the probability of errorof the MPP estimation has become sufficiently small. That algorithm takesexponential time in the worst case, though this was overlooked in the complexitydiscussion in Bod (1995). (This was brought to our attention in personal

conversation by Johnson (1995) and in writing by Goodman (1996, 1998).)The present algorithm (from Bod & Scha 1996, 1997) therefore focuses onestimating the distribution of the parse probabilities; it assumes a value for themaximally allowed standard error (e.g. 0.05), and samples a number ofderivations which is guaranteed to achieve this; this number is quadratic in thechosen standard error. Only in the case of a forced choice experiment, the mostfrequently occurring parse is selected from the sample distribution.

5.2.3 Optimizations

In the past few years, several optimizations have been proposed fordisambiguating with STSG. Sima'an (1995, 1996a) gives an optimization forcomputing the most probable derivation which starts out by using only the CFG-backbone of an STSG; subsequently, the constraints imposed by the STSG areemployed to further restrict the parse space and to select the most probablederivation. This optimization achieves linear time complexity in STSG sizewithout risking an impractical increase of memory-use. Bod (1993b, 1995)proposes to use only a small random subset of the corpus subtrees (5%) so as toreduce the search space for computing the most probable parse. Sekine andGrishman (1995) use only subtrees rooted with S or NP categories, but theirmethod suffers considerably from undergeneration. Goodman (1996) proposes adifferent polynomial time disambiguation strategy which computes the so-called"maximum constituents parse" of a sentence (i.e. the parse which maximizes theexpected number of correct constituents) rather than the most probable parse ormost probable derivation. However, Goodman also shows that the "maximumconstituents parse" may return parse trees that cannot be produced by thesubtrees of DOP1 (Goodman 1996: 147). Chappelier & Rajman (1998) andGoodman (1998) give some optimizations for selecting a random derivation froma derivation forest. For a more extensive discussion of these and some otheroptimization techniques see Bod (1998a) and Sima'an (1999).

6. Experimental Properties of DOP1

In this section, we establish some experimental properties of DOP1. We will do soby studying the impact of various fragment restrictions.

6.1 Experiments on the ATIS corpus

We first summarize a series of pilot experiments carried out by Bod (1998a) on aset of 750 sentence analyses from the Air Travel Information System (ATIS)corpus (Hemphill et al. 1990) that were annotated in the Penn Treebank (Marcus


17/41

et al. 1993).[1] These experiments focussed on tests about the Most ProbableParse as defined by the original DOP1 probability model. [2] Their goal was notprimarily to establish how well DOP1 would perform on this corpus, but to findout how the accuracies obtained by "undiluted" DOP1 compare with the resultsobtained by more restricted STSG-models which do not employ the complete set

of corpus subtrees as their elementary trees.We use the blind testing method, dividing the 750 ATIS trees into a 90% trainingset of 675 trees and a 10% test set of 75 trees. The division was random except forone constraint: that all words in the test set actually occurred in the training set.[3] The 675 training set trees were converted into fragments (i.e. subtrees) andwere enriched with their corpus probabilities. The 75 sentences from the test setserved as input sentences that were parsed with the subtrees from the training setand disambiguated by means of the algorithms described in the previous section.The most probable parses were estimated from probability distributions of 100sampled derivations. We use the notion ofparse accuracy as our accuracy metric,defined as the percentage of the most probable parses that are identical to thecorresponding test set parses.

Because the MPP estimation is a fairly costly algorithm, we have not yet been ableto repeat all our experiments for different training-set/test-set splits, to obtainaverage results with standard deviations. We made one exception, however. Wewill very often be comparing the results of an experiment with the resultsobtained when employing allcorpus-subtrees as elementary trees; therefore, itwas important to establish at least that the parse accuracy obtained in thisfashion (which was 85%) was notdue to some unlikely random split.

On 10 random training/test set splits of the ATIS corpus we achieved an averageparse accuracy of 84.2% with a standard deviation of 2.9%. Our 85% base lineaccuracy lies thus solidly within the range of values predicted by the moreextentensive experiment.

The impact of overlapping fragments: MPP vs. MPD

The stochastic model of DOP1 can generate the same parse in many differentways; the probability of a parse must therefore be computed as the sum of theprobabilities of all its derivations. We have seen, however, that the computationof the Most Probable Parse according to this model has an unattractivecomplexity, whereas the Most Probable Derivation is much easier to compute. We

may therefore wonder how often the parse generated by the Most ProbableDerivation is in fact the correct one: perhaps this method constitutes a goodapproximation of the Most Probable Parse, and can achieve very similar parseaccuracies. And we cannot exclude that it might yield even better accuracies, if itsomehow compensates for bad properties of the stochastic model of DOP1. Forinstance, by summing up over probabilities of several derivations, the MostProbable Parse takes into account overlapping fragments, while the Most
http://staff.science.uva.nl/~simaan/D-Papers/jetai.html#note1http://staff.science.uva.nl/~simaan/D-Papers/jetai.html#note1http://staff.science.uva.nl/~simaan/D-Papers/jetai.html#note2http://staff.science.uva.nl/~simaan/D-Papers/jetai.html#note3http://staff.science.uva.nl/~simaan/D-Papers/jetai.html#note1http://staff.science.uva.nl/~simaan/D-Papers/jetai.html#note2http://staff.science.uva.nl/~simaan/D-Papers/jetai.html#note3


18/41

Probable Derivation does not. It is not a priori obvious whether we do or do notwant this property.

We thus calculated the accuracies based on the analyses generated by the MostProbable Derivations of the test sentences. The parse accuracy obtained by the

trees generated by the Most Probable Derivations was 69%, which is lower thanthe 85% base line parse accuracy obtained by the Most Probable Parse. Weconclude that overlapping fragments play an important role in predicting theappropriate analysis of a sentence.

The impact of fragment size

Next, we tested the impact of the size of the fragments on the parse accuracy.Large fragments capture more lexical/syntactic dependencies than small ones.We investigated to what extent this actually leads to better predictions of theappropriate parse. We therefore performed experiments with versions of DOP1where the fragment collection is restricted to subtrees with a certain maximumdepth (where the depth of a tree is defined as the length of the longest path fromthe root to a leaf). For instance, restricting the maximum depth of the subtrees to1 gives us fragments that cover exactly one level of constituent structure, whichmakes DOP1 equivalent to a stochastic context-free grammar (SCFG). For amaximal subtree depth of 2, we obtain fragments that also cover two levels ofconstituent structure, which capture some more lexical/syntactic dependencies,etc. The following table shows the results of these experiments, where the parseaccuracy for each maximal depth is given for both the most probable parse andfor the parse generated by the most probable derivation (the accuracies arerounded off to the nearest integer).

Table 1. Accuracy increases if larger corpus fragments are used

The table shows an increase in parse accuracy, for both the most probable parseand the most probable derivation, when enlarging the maximum depth of thesubtrees. The table confirms that the most probable parse yields better accuracythan the most probable derivation, except for depth 1 where DOP1 is equivalentto an SCFG (and where every parse is generated by exactly one derivation). Thehighest parse accuracy reaches 85%.


19/41

The impact of fragment lexicalization

We now consider the impact of lexicalized fragments on the parse accuracy. By alexicalized fragment we mean a fragment whose frontier contains one or morewords. The more words a fragment contains, the more lexical (collocational)

dependencies are taken into account. To test the impact of lexicalizationon theparse accuracy, we performed experiments with different versions of DOP1 wherethe fragment collection is restricted to subtrees whose frontiers contain a certainmaximum number of words; the maximal subtree depth was kept constant at 6.

These experiments are particularly interesting since we can simulate lexicalizedgrammars in this way. Lexicalized grammars have become increasingly popularin computational linguistics (e.g. Schabes 1992; Srinivas & Joshi 1995; Collins1996, 1997; Charniak 1997; Carroll & Weir 1997). However, all lexicalizedgrammars that we know of restrict the number of lexical items contained in a ruleor elementary tree. It is a significant feature of the DOP approach that we canstraightforwardly test the impact of the number of lexical items allowed.

The following table shows the results of our experiments, where the parseaccuracy is given for both the most probable parse and the most probablederivation.

Table 2. Accuracy as a function of the maximum number of words in fragmentfrontiers.

The table shows an initial increase in parse accuracy, for both the most probableparse and the most probable derivation, when enlarging the amount oflexicalization that is allowed. For the most probable parse, the accuracy isstationary when the maximum is enlarged from 4 to 6 words, but it increasesagain if the maximum is enlarged to 8 words. For the most probable derivation,the parse accuracy reaches its maximum already at a lexicalization bound of 4words. Note that the parse accuracy deteriorates if the lexicalization boundexceeds 8 words. Thus, there seems to be an optimal lexical maximum for theATIS corpus. The table confirms that the most probable parse yields betteraccuracy than the most probable derivation, also for different lexicalization sizes.

The impact of fragment frequency


20/41

We may expect that highly frequent fragments contribute to a larger extent to theprediction of the appropriate parse than very infrequent fragments. But whilesmall fragments can occur very often, most larger fragments typically occur once.Nevertheless, large fragments contain much lexical/structural context, and canparse a large piece of an input sentence at once. Thus, it is interesting to see what

happens if we systematically remove low-frequency fragments. We performed anadditional set of experiments by restricting the fragment collection to subtreeswith a certain minimum number of occurrences, but without applying any otherrestrictions.

Table 3. Accuracy decreases if lower bound on fragment frequency increases(for the most probable parse).

The results, presented in Table 3, indicate that low frequency fragmentscontribute significantly to the prediction of the appropriate analysis: the parseaccuracy seriously deteriorates if low frequency fragments are discarded. Thisseems to contradict common wisdom that probabilities based on sparse data arenot reliable. Since especially large fragments are once-occurring events, thereseems to be a preference in DOP1 for an occurrence-based approach if enough

context is provided: large fragments, even if they occur once, tend to contributeto the prediction of the appropriate parse, since they provide much contextualinformation. Although these fragments have very low probabilities, they tend toinduce the most probable parse because fewer fragments are needed to constructa parse.

In Bod (1998a), the impact of some other fragment restrictions is studied. Amongother things, it is shown there that the parse accuracy decreases if subtrees withonly non-head words are eliminated.

6.2 Experiments on larger corpora: the SRI-ATIS corpus and the OVIS

corpus

In the following experiments (summarized from Sima'an 1999)[4]we onlyemploy the most probable derivation rather than the most probable parse. Sincethe most probable derivation can be computed more efficiently than the mostprobable parse (see section 5), it can be tested more extensively on largercorpora. The experiments were conducted on two domains: the Amsterdam OVIStree-bank (Bonnema et al. 1997) and the SRI-ATIS tree-bank (Carter 1997). [5] It
http://staff.science.uva.nl/~simaan/D-Papers/jetai.html#note4http://staff.science.uva.nl/~simaan/D-Papers/jetai.html#note4http://staff.science.uva.nl/~simaan/D-Papers/jetai.html#note4http://staff.science.uva.nl/~simaan/D-Papers/jetai.html#note5http://staff.science.uva.nl/~simaan/D-Papers/jetai.html#note4http://staff.science.uva.nl/~simaan/D-Papers/jetai.html#note5


21/41

is worth noting that the SRI-ATIS tree-bank differs considerably from the PennTreebank ATIS-corpus that was employed in the experiments reported in thepreceding subsection.

In order to acquire workable and accurate DOP1 models from larger tree-banks, a

set of heuristic criteria is used for selecting the fragments. Without these criteria,the number of subtrees would not be manageable. For example, in OVIS there aremore than hunderd and fifty million subtrees. Even when the subtree depth islimited to e.g. depth five, the number of subtrees in the OVIS tree-bank remainsmore than five million. The subtree selection criteria are expressed as constraintson the form of the subtrees that are projected from a tree-bank into a DOP1model. The constraints are expressed as upper-bounds on: the depth (d), thenumber of substitution-sites (n), the number of terminals (l) and the number ofconsecutive terminals (L) of the subtree. These constraints apply to all subtreesbut the subtrees of depth 1, i.e. subtrees of depth 1 are not subject to theseselection criteria. In the sequel we represent the four upper-bounds by the shortnotation ddnnllLL. For example, d4n2l7L3 denotes a DOP STSG obtained from atree-bank such that every elementary tree has at most depth 4, and a frontiercontaining at most 2 substitution sites and 7 terminals; moreover, the length ofany consecutive sequence of terminals on the frontier of that elementary tree islimited to 3 terminals. Since all projection parameters except for the upper-bound on the depth are usually a priori fixed, the DOP1 STSG obtained under adepth upper-bound that is equal to an integer iwill be represented by the shortnotation DOP(i).

We used the following evaluation metrics: Recognized (percentage of recognizedsentences), TLC (Tree-Language Coverage: the percentage of test set parses thatis in the tree language of the DOP1 STSG), exact match (either syntactic/semantic

or only syntactic), labeled bracketing recall and precision (as defined in Black etal. 1991, concerning either syntactic plus semantic or only syntactic annotation).Below we summarize the experimental results pertaining to some of the issuesthat are addressed in Sima'an (1999). Some of these issues are similar to thoseaddressed by the experiments with the most probable parse on the small ATIStree-bank in subsection 6.1, e.g. the impact of fragment size. Other issues areorthogonal and supplement the issues addressed in the experiments concerningthe most probable parse.

Experiments on the SRI-ATIS corpus

In this section we report experiments on syntactically annotated utterances fromthe SRI International ATIS tree-bank. The utterances of the tree-bank originatefrom the ATIS domain (Hemphill et al. 1990). For the present experiments, wehave access to 13335 utterances that are annotated syntactically (we refer to thistree-bank here as the SRI-ATIS corpus/tree-bank). The annotation schemeoriginates from the linguistic grammar that underlies the Core Language Engine(CLE) system in Alshawi (1992). The annotation process is described in Carter(1997). For the experiments summarized below, some of the training parameters


22/41

were fixed: the DOP models were projected under the parameters n2l4L3, whilethe subtree depth bound was allowed to vary.

A training-set of 12335 trees and a test-set of 1000 trees were obtained bypartitioning the SRI-ATIS tree-bank randomly. DOP1 models with various depth

upper-bound values were trained on the training-set and tested on the test-set. Itis noteworthy that the present experiments are extremely time-consuming: forupper-bound values larger than three, the models become huge and very slow,e.g. it takes more than 10 days for DOP(4) to parse and disambiguate the test-set(1000 sentences). This is despite of the subtree upper bounds n2l4L3, which limitthe total size of the subtree-set to less than three hunderd thousand subtrees.

Table 4. The impact of subtree depth (SRI-ATIS)

Table 4 (left-hand side) shows the results for depth upper-bound values up tofour. An interesting and suprising result is that the exact-match of DOP(1) on thislarger and different ATIS tree-bank (46%) is very close to the result reported inthe preceding subsection. This also holds for the DOP(4) model (here 82.70%exact-match vs. 84% on the Penn Treebank ATIS corpus). More striking is thatthe present experiments concern the most probable derivation while theexperiments of the preceding section concern the most probable parse. In thepreceding subsection, the exact-match of the most probable derivation did notexceed 69%, while in this case it is 82.70%. This might be explained by the factthat the availability of more data is more crucial for the accuracy of the mostprobable derivation than the most probable parse. This is certainlynotdue to asimpler tree-bank or domain since the annotations here are as deep as those inthe Penn Treebank. In any case, it would be interesting to consider the exactmatch that the most probable pare achieves on this tree-bank. This, however, willremain an issue for future research because computing the most probable parse

is still infeasible on such large tree-banks.

The issue of course here is still the impact of employing deeper subtrees. Clearly,as the results show, the difference between the DOP(1) (the SCFG) and anydeeper DOP1 model is at least 23% (DOP(2)). This difference increases to 36.70%at DOP(4). To validate this difference, we ran experiments with a four-fold cross-validation experiment that confirms the magnitude of this difference. In theright-hand side of table 4 means and standard-deviations for two DOP1 models


23/41

are reported. Four independent partitions into test (1000 trees each) and trainingsets (12335 trees each) were used here for training and testing these DOP1models. These results show a mean difference of 24% exact-match betweenDOP(2) and DOP(1) (SCFG): a substantial accuracy improvement achieved bymemory-based parsing using DOP1, above simply using the SCFG underlying the

tree-bank (as for instance in Charniak 1996).Experiments on the OVIS corpus

The Amsterdam OVIS ("Openbaar Vervoer Informatie Systeem") corpus contains10000 syntactically and semantically annotated trees. For detailed informationconcerning the syntactic and semantic annotation scheme of the OVIS tree-bankwe refer the reader to Bonnema et al. (1997). In acquiring DOP1 models thesemantic and syntactic annotations are treated as one annotation in which thelabels of the nodes in the trees are a juxtaposition of the syntactic and semanticlabels. Although this results in many more non-terminal symbols (and thus alsoDOP model parameters), Bonnema (1996) shows that the resultingsyntactic+semantic DOP models are better than the mere syntactic DOP1 models.Since the utterances in the OVIS tree-bank are answers to questions asked by adialogue system, these utterances tend to be short. The average sentence lengthin OVIS is 3.43 words. However, the results reported in Sima'an (1999) concernonly sentences that contain at least two words; the number of those sentences is6797 and their average length is 4.57 words. All DOP1 models are projected underthe subtree selection criterion n2l7L3, while the subtree depth upper bound wasallowed to vary.

It is interesting here to observe the effect of varying subtree depth on theperformance of the DOP1 models on a tree-bank from a different domain. To thisend, in a set of experiments one random partition of the OVIS tree-bank into atest-set of 1000 trees and a training set of 9000 was used to test the effect ofallowing the projection of deeper elementary trees in DOP STSGs. DOP STSGswere projected from the training set with upper-bounds on subtree depth equal to1, 3, 4, and 5. Each of the four DOP models was run on the sentences of the test-set (1000 sentences). The resulting parse trees were then compared to the correcttest set trees.


24/41

Table 5. The impact of subtree depth (OVIS)

The lefthand side of table 5 above shows the results of these DOP1 models. Notethat the recognition power (Recognized) is not affected by the depth upper-boundin any of the DOP1 models. This is because all models allowed all subtrees ofdepth 1 to be elementary trees. As the results show, a slight accuracy degradationoccurs when the subtree depth upper bound is increased from four to five. Thishas been confirmed separately by earlier experiments conducted on similarmaterial (Bonnema et al. 1997). An explanation for this degradation might be thatincluding larger subtrees implies many more subtrees and sparse-data effects. Itis not clear, therefore, whether this finding contradicts the Memory-BasedLearning doctrine that maintaining all cases in the case-base is more profitable. Itmight equally well be the case that this problem is specific for the DOP1 modeland the way it assigns probabilities to subtrees.

Table 5 also shows the means and standard deviations (stds) of two DOP1 modelson five independent partitions of the OVIS tree-bank into training set (9000trees) and test set (1000 trees). For every partition, the DOP1 model was trainedonly on the training set and then tested on the test set. We observe here themeans and standard deviation of the models DOP(1) (the SCFG underlying thetree-bank) and DOP(4). Clearly, the difference between DOP(4) and DOP(1)observed in the preceding set of experiments is supported here. However, theimprovement of DOP(4) on the SCFG in exact match and the other accuracymeasures, although significant, is disappointing: it is about 2.85% exact matchand 3.35% syntactic exact match. The improvement itself is indeed in line with

the observation that DOP1 improves on the SCFG underlying the tree-bank. Thiscan be seen as an argument for MBL syntactic analysis as opposed to thetraditional enrichment of "linguistic" grammars with probabilities.

We have thus seen that there is quite some evidence for our hypothesis that thestructural units of language processing cannot limited to a minimal set of rules,but should be defined in terms of a large set of previously seen structures. It isinteresting to note that similar results are obtained by other instantiations of


25/41

memory-based language processing. For example, van den Bosch & Daelemans(1998) report that almost every criterion for removing instances from memoryyields worse accuracy than keeping full memory of learning material (for the taskof predicting English word pronunciation). Despite this interesting convergenceof results, there is a significant difference between DOP and other memory-based

approaches. We will go into this topic in the following section.7. DOP: probabilistic recursive MBL

In this section we make explicit the relationship between the present DataOriented Processing (DOP) framework and the Memory-Based Learningframework. We show how the DOP framework extends the general MBLframework with probabilistic reasoning in order to deal with complexperformance phenomena such as syntactic disambiguation. In order to keep thisdiscussion concrete we also analyze the model DOP1, the first instantiation of theDOP framework. Subsequently, we contrast the DOP model to other existing

MBL approaches that employ so called "flat" or "intermediate" descriptions asopposed to the hierarchical descriptions used by the DOP model.

7.1 Case-Based Reasoning

In the Machine Learning (ML) literature, e.g. Aamodt & Plaza (1994), Mitchell(1997), various names, e.g. Instance-Based, Case-Based, Memory-Based or Lazy,are used for a paradigm of learning that can be characterized by two properties:

(1) it involves a lazy learning algorithm that does not generalize over the trainingexamples but stores them, and

(2) it involves lazy generalization during the application phase: each newinstance is classified (on its own) on basis of its relationship to the stored trainingexamples; the relationship between two instances is examined by means of socalled similarity functions.

We will refer to this general paradigm by the term MBL (although the term LazyLearning, as Aha (1998) suggests, might be more suitable). There are variousforms of MBL that differ in several respects. In this study we are specificallyinterested in the Case-Based Reasoning (CBR) variant of MBL (Kolodner 1993,Aamodt & Plaza 1994).

Case-Based Reasoning differs from other MBL approaches, e.g. k-nearestneighbor methods, in that it does notrepresent the instances of the taskconcept [6] as real-valued points in an n-dimensional Euclidean space; instead,CBR methods represent the instances by means of complex symbolicdescriptions, e.g. graphs (Aamodt & Plaza 1994, Mitchell 1997). This implies thatCBR methods require more complex similarity functions. It also implies that CBRmethods view their learning task in a different way than other MBL methods:while the latter methods view their learning task as a classification problem, CBR
http://staff.science.uva.nl/~simaan/D-Papers/jetai.html#note6http://staff.science.uva.nl/~simaan/D-Papers/jetai.html#note6


26/41

methods view their learning task as the construction of classes for input instancesby reusing parts of the stored classified training-instances.

According to overviews of the CBR literature, e.g. Mitchell (1997), Aha &Wettschereck (1997), there exist various CBR methods that address a wide variety

of tasks, e.g. conceptual designs of mechanical devices (Sycara et al. 1992),reasoning about legal cases (Ashley 1990), scheduling and planning problems(Veloso & Carbonell 1993) and mathematical integration problems (Elliot & Scott1991). Rather than pursuing the infeasible task of contrasting DOP to each ofthese methods, we will firstly highlight the specific aspects of DOP as anextension to CBR. Subsequently we compare DOP to recent approaches thatextend CBR with probabilistic reasoning.

7.2 The DOP framework and CBR

We will show that the first three components of a DOP model as described in theDOP framework of section 3 define a CBR method, and that the fourthcomponent extends CBR with probabilistic reasoning. To this end, we willexpress each component in CBR terminology and then show how this specifies aCBR system. The first component of DOP, i.e. a formal representation ofutterance-analyses, specifies the representation language of the instances andclasses in the parsing task, i.e. the so called case description language. Thesecond component, i.e. fragments definition, defines the units thatare retrievedfrom memory when a class (tree) is being constructed for a newinstance; the retrieved units are exactly the sub-instances and sub-classes thatcan be combined into instances and classes. The third component, i.e. definitionof combination operations, concerns the definition of the constraints oncombining the retrieved units into trees when parsing a new utterance. Together,the latter two components define exactly theretrieval, reuse and revision aspectsof the CBR problem solving cycle (Aamodt & Plaza 1994). The similarity measure,however, is not explicitly specified in DOP but is implicit in a retrieval strategythat relies on simple string equivalence. Thus, the first three components of aDOP model specify exactly a simple CBR system for natural language analysis,i.e., a natural language parser. This system is lazy: it does not generalize over thetree-bank until it starts parsing a new sentence, and it defines a space of analysesfor a new input sentence simply by matching and combining fragments from thecase-base (i.e. tree-bank).

The fourth component of the DOP framework, however, extends the CBR

approach for dealing with ambiguities in the definition of the case descriptionlanguage. It specifies how the frequencies of the units retrieved from the case-base define a probability value for the utterance that is being parsed. We mayconclude that the four components of the DOP framework define a CBR methodthat uses possiblyrecursive case descriptions, string-matching for retrieval ofcases, and a probabilistic model for resolving ambiguity. The latter property ofDOP is crucial for the task that the DOP framework addresses: disambiguation.Disambiguation differs from the task that mainstream CBR approaches address,


27/41

i.e. constructing a class for the input instance. Linguistic disambiguation involvesclassification under an ambiguous definition of the "case description language",i.e., the formal representation of the utterance analyses, which is usually agrammar. Since the fragments (second component of a DOP model) are definedunder this "grammar", combining them by means of the combination operations

(third component) usually defines an ambiguous CBR system: for classifying (i.e.parsing) a new instance (i.e. sentence), this CBR system may construct more thanone analysis (i.e., class). The ambiguity of this CBR system is inherent to the factthat it is often infeasible to construct an unambiguous natural language formalrepresentation of analyses. A DOP model solves this ambiguity when classifyingan instance by assigning a probability value to every constructed class in order toselect the most probable one. Next we will show how these observations about theDOP framework apply to the DOP1 model for syntactic disambiguation.

7.3 DOP1 and CBR methods

In order to explore the differences between the DOP1 model and CBR methods,we will express DOP1 as an extension to a CBR parsing system. To this end, it isnecessary to identify the case-base, the instance-space, the class-space, and the"similarity function" that DOP assumes. In DOP, the training tree-bank contains pairs that represent the classified instances, where string is aninstance and tree is a class.

A DOP model memorizes in its case-base exactly the finite set of tree-bank trees.When parsing a new input sentence, the DOP model retrieves from the case-baseall subtrees of the trees in its case-base and tries to use them for constructingtrees for that sentence. Let us refer to the ordered sequence of symbols thatdecorate the leaf nodes of a subtree as thefrontier of that subtree. Moreover, letus call the frontier of a subtree from the case-base a subsentential-form from thetree-bank. During the retrieval phase, a DOP1 model retrieves from its case-baseall pairs, where stis a subtree of the tree-bank and the string str is itsfrontier SSF. These subtrees are used for constructing classes, i.e. trees, for theinput sentence using the substitution-operation. This operation enablesconstrained assemblage of sentences (instances) and trees (classes). The set ofsentences and the set of trees that can be assembled from subtrees by means ofsubstitution constitute respectively the instance-space and the class-space of aDOP1 model.

Thus, the instance-space and the class-space of a DOP1 model are defined by the

Tree-Substitution Grammar (TSG) that employs the subtrees of the tree-banktrees as its elementary trees; this TSG is a recursive device thatdefines infiniteinstance- and class-spaces. However, this TSG, which representsthe "runtime" expansion of DOP's case-base, does not generalize over the CFGthat underlies the tree-bank (the case description language), since the twogrammars have equal string-languages (instance-spaces) and equal tree-languages (class-spaces).


28/41

The probabilities that DOP1 attaches to the subtrees in the TSG are induced fromthe frequencies in the tree-bank and can be seen as subtree weights. Thus, theSTSG that a DOP model projects from a tree-bank can be viewed as an infiniteruntime case-base containing instance-class-weight triples that have the form.

Task and similarity function

The task implemented by a DOP1 model is disambiguation: the identificationofthe most plausible tree that the TSG assigns to the input sentence. Syntacticdisambiguation is indeed a classification task in the presence of an infinite class-space. For an input sentence, this class-space is firstly limited to a finite set by the(usually) ambiguous TSG: only trees that the TSG constructs for that sentence bycombining subtrees from the tree-bank are in the specific class-space of thatsentence. Subsequently, the tree with the highest probability (according to theSTSG) in this limited space is selected as the most plausible one for the inputsentence. To understand the matching and retrieval processes of a DOP1 model,let us consider both steps of disambiguation separately.

In the first step, i.e. parsing, the similarity function that DOP1 employs is asimplerecursive string-matching procedure. First, all substrings of the inputsentence are matched against SSFs in the case-base (i.e. the TSG) and thesubtrees corresponding to the matched SSFs are retrieved; every substring thatmatches an SSF is replaced by the label of the root node of the retrieved subtree(note that there can be many subtrees retrieved for the same SSF, their rootsreplace the substring as alternatives). This SSF-matching and subtree-retrievalprocedure is recursively applied to the resulting set of strings until the last set ofstrings does not change.

Technically speaking, this "recursive matching process" is usually implementedas a parsing algorithm that constructs an efficient representation of all trees thatthe TSG assigns to the input sentence, called the parse-forest that the TSGconstructs for that sentence (see section 5).

What is the role of probabilities in the similarity function?

Thus, rather than employing a Euclidean or any other metric to measure thedistance between the input sentence and the sentences in the case-base, DOP1resorts to a recursive matching process where similarity between SSFs is

implemented as simple string-equivalence. Beyond parsing, which can be seen asthe first step of disambiguation, DOP1 faces the ambiguity of the TSG, whichfollows from the ambiguity of the CFG underlying the tree-bank (i.e. the casedescription language definition). This is one important point where DOP1deviates from mainstream CBR methods that usually employ unambiguousdefinitions of the case description language, or resort to (often ad hoc) heuristicsthat give a marginal role to disambiguation.


29/41

For natural language processing it is usually not feasible to constructunambiguous grammars. Therefore, the parse-forest that the parsing processconstructs for an input sentence usually contains more than one tree. The task ofselecting the most plausible tree from this parse-forest, i.e. syntacticdisambiguation, constitutes the main task addressed by performance models of

syntactic analysis such as DOP1. For disambiguation, DOP1 ranks the alternativetrees in the parse-forest of the input sentence by computing a probability forevery tree. Subsequently, it selects the tree with the highest probability from thisspace. It is interesting to consider how the probability of a tree in DOP1 relates tothe matching process that takes place during parsing.

Given a parse-tree, we may view each derivation that generates it as a"successful" combination of subtrees from the case-base; to every suchcombination we assign a "matching-score" of 1. All sentence-derivations thatgenerate a different parse-tree (including the ones that generate a differentsentence) receive matching-score 0. The probability of a parse-tree as computedby DOP1 is in fact the expectation value (or mean) of the scores (with respect tothis parse-tree) of all derivations allowed by the TSG; this expectation valueweighs the score of every derivation by the probability of that derivation. Theprobability of a derivation is computed as the product of the probabilities of thesubtrees that participate in it. Subtree probabilities, in their turn, are based onthe frequency counts of the subtrees and are conditioned on the constraintembodied by the tree-substitution combination operation.

This brief description of DOP1 in CBR terminology shows that the followingaspects of DOP1 are not present in other mainstream CBR methods: (i) both theinstance- and class-spaces are infinite and are defined in terms ofa recursive matching process embodied by a TSG-based parser that matches

strings by equality, and then retrieves and combines the subtrees associated withthe matching strings using the substitution-operation, (ii) the CBR task ofconstructing a tree (i.e. class) for a input sentence is further complicated in DOP1by the ambiguity of this TSG, (iii) the "structural similarity function" that mostother CBR methods employ is, therefore, implemented in DOP as a recursiveprocess that is complemented by spanning aprobability distribution over theinstance- and class-spaces in order to facilitate disambiguation, and (iv) theprobabilistic disambiguation process in DOP1 is conducted globally over thewhole sentence rather than locally on parts of the sentence. Hence we maycharacterize DOP1 as a lazy probabilistic recursive CBR classifier that addressesthe problem ofglobal sentential-levelsyntactic disambiguation.

7.4 DOP and recent probabilistic extensions to CBR

Recent literature within the CBR approach advocates extending CBR withprobabilistic reasoning. Waltz and Kasif (1996) and Kasif et al. (forthcoming)refer to the framework that combines CBR with probabilistic reasoning with thename Memory-Based Reasoning (MBR). Their arguments for this framework arebased on the need for systems to be able to adapt to rapidly changing


30/41

environments "where abstract truths are at best temporary or contingent". Thiswork differs from DOP in at least two important ways: (i) it assumes a non-recursive finite class-space, and (ii) it employs probabilistic reasoning forinducing so called "adaptive distance metrics" (these are distance metrics thatautomatically change as new training material enters the system) rather than for

disambiguation.These differences imply that this approach does not take note of the specificaspects of the disambiguation task as found in natural language parsing, e.g., theneed for recursive symbolic descriptions and that disambiguation lies at the hartof any performance task. The syntactic ambiguity problem, thus, has anadditional dimension of complexity next to the dimensions that are addressed bythe mainstream ML literature.

7.5 DOP vs. other MBL approaches in NLP

In this section we will concentrate on contrasting DOP to some MBL methodsthat are used for implementing natural language processing (NLP) tasks. Firstlywe briefly address the relatively clear differences between methods based onvariations of the k-Nearest Neighbor (k-NN) approach and DOP. Subsequentlywe discuss more thoroughly the recently introduced Memory-Based SequenceLearning (MBSL) method (Argamon et al. 1998) and how it relates to the DOPmodel.

7.5.1 k-NN vs. DOP

From the description of CBR methods earlier in this section, it became clear thatthe main difference between k-NN methods and CBR methods is that the latteremploy complex data structures rather than feature vectors for representingcases. As mentioned before, DOP's case description language is further enhancedby recursion and complicated by ambiguity. Moreover, while k-NN methodsclassify their input in a partitioning of a real-valued multi-dimensional Euclideanspace, the CBR methods (including DOP) must constructa class for an inputinstance. The similarity measures in k-NN methods are based on measuring thedistance between the input instance and each of the instances in memory. InDOP, this measure is simplified during parsing to string-equivalence andcomplicated during disambiguation by a probabilistic ranking of the alternativetrees of the input sentence. Of course, it is possible to imagine a DOP model thatemploys k-NN methods during the parsing phase, so that the string matching

process becomes more complex than simple string-equivalence. In fact, a simpleversion of such an extension of DOP has been studied in Zavrel (1996) -- withinconclusive empirical results due to the lack of suitable training material.

Recently, extensions and enhancements of the basic k-NN methods (Daelemanset al. 1999a) have been applied to limited forms of syntactic analysis (Daelemanset al. 1999b). This work employs k-NN methods for very specific syntacticclassification tasks (for instance for recognizing NP's or VP's, and for deciding on


31/41

PP-attachment or subject/object relations), and then combines these classifiersinto shallow parsers. The classifiers carry out their task on the basis of localinformation; some of them (e.g. subject/object) rely on preprocessing of the inputstring by other classifiers (e.g. NP- and VP-recognition). This approach has beentested successfully on the individual classification tasks and on shallow parsing,

yielding state-of-the-art accuracy results.This work differs from the DOP approach in important respects. First of all, itdoes not address the full parsing problem; it is not intended to deal with witharbitrary tree structures derived from an arbitrary corpus, but hardcodes alimited number of very specific syntactic notions. Secondly, the classificationdecisions (or disambiguation decisions) are based on a limited context which isfixed in advance. Thirdly, the approach employs similarity metrics rather thanstochastic modeling techniques. Some of these features make the approach moreefficient than DOP, and therefore easier to apply to large tree-banks. But at thesame time this method shows clear limitations if we look at its applicability togeneral parsing tasks, or if we consider the disambiguation accuracy that can beachieved if only local information is taken into account .

7.5.2 Memory-Based Sequence Learning vs. DOP

Memory-Based Sequence Learning (MBSL), described in Aragamon et al.(1998) can be seen as analogous to a DOP model at the level of flat non-recursivelinguistic descriptions. It is interesting to pursue this analogy by analysing MBSLin terms of the four components that constitute a DOP model (cf. the end ofsection 3 above). Since MBSL is a new method that seems closer to DOP1 than allother MBL methods discussed in this volume, we will first summarize how itworks before we compare it with DOP1.

Like DOP1, an MBSL system works on the basis of a corpus of utterancesannotated with labelled constituent structures. It assumes a differentrepresentation of these structures, however: an MBSL corpus consists ofbracketed strings. Each pair of brackets delimits the borders of a substring ofsome syntactic category, e.g., Noun Phrase (NP) or Verb Phrase (VP). For everysyntactic category, a separate MBSL system is constructed. Given a corpuscontaining bracketed strings of part-of-speech tags (pos-tags), an MBSL systemstores all [7]the substrings that contain at least one bracket together with theiroccurrence counts; such substrings are called tiles. Moreover, the MBSL systemstores all substrings stripped from the brackets together with their occurrence

count. The positive evidence in the corpus for a given tile is computed as the ratiobetween the occurrence count of the tile to the total occurrence count of thesubstring obtained by stripping the tile from the brackets.

When an MBSL system is prompted to assign brackets to a new input sentence, itassigns all possible brackets to the sentence and then it computes the positiveevidence for every pair of brackets on basis of its stored corpus. To this end, everysubsequence of the input sentence surrounded by a (matching) pair of brackets is
http://staff.science.uva.nl/~simaan/D-Papers/jetai.html#note7http://staff.science.uva.nl/~simaan/D-Papers/jetai.html#note7


32/41

considered a candidate. A candidate together with the rest of the sentence isconsidered a situated candidate. To evaluate the positive evidence for a situatedcandidate, a tiling of that situated candidate is attempted on basis of the tiles thatare stored. When tiling a situated candidate, tiles are retrieved from storage andplaced such that theycover it entirely. Only tiles of sufficient positive evidence

are retrieved for this purpose (a threshold on sufficient evidence is used). Tospecify how a cover is exactly obtained, it is necessary to define the notionofconnecting tiles (with respect to a situated candidate). We say that atilepartially covers a situated candidate if the tile is equivalent to some substringof that situated candidate; the substring is then called a matching substring of thetile. Given a situated candidate and a tile T1 that partially covers it, tile T2 iscalled connecting to tile T1 iffT2 also partially-covers the situated candidate andthe matching substrings ofT1 and T2 overlap without being included in eachother, or are adjacent to each other in the situated candidate. The shortestsubstring of a situated candidate that is partially covered by two connecting tilesis said to be covered by the two tiles; we also say that the two tiles constitute acover of that substring. The notion of a cover is then generalized to a sequence ofconnecting tiles: a sequence of connecting tiles is an ordered sequence of tilessuch that every tile is connecting to the tile that precedes it in that sequence.Hence, a cover of the situated candidate is a sequence of connecting tiles thatcovers it entirely. Crucially, there can be many different covers of a situatedcandidate.

The evidence score for a situated candidate is a function of the evidence for thevarious covers that can be found for it in the corpus. MBSL estimates this scoreby a heuristic function that combines various statistical measures concerningproperties of the covers e.g., number of covers, number of tiles in a cover, size ofoverlap between tiles, etc. In order to select a bracketing for the input sentence,

all possible situated candidates are ranked according to their scores. Starting withthe situated candidate with the highest score C, the bracketing algorithm removesall other situated candidates that contain a pair of brackets that overlaps with thebrackets in C. This procedure is repeated iteratively until it stabilizes. Theremaining set of situated candidates defines a bracketing of the input sentence.

It is interesting to look at MBSL from the perspective of the DOP framework. Letus first consider its representation formalism, its fragment extraction functionand its combination operation. The formalism that MBSL employs to representutterance analyses is a part-of-speech--tagging of the sentences together with abracketing that defines the target syntactic category. The fragments are the tiles

that can be extracted from the training corpus. The combination operation is anoperation that combines two tiles such that they connect. We cannot extend thisanalogy to include the covers, however, because MBSL considers the coverslocally with respect to every situated candidate. If MBSL would have employedcovers for a whole consistent bracketing of the input sentence, MBSL coverswould have corresponded to DOP derivations, and consistent bracketings toparses. Another important difference between MBSL and DOP shows up if welook at the way in which "disambiguation" or pruning of the bracketing takes


33/41

place in MBSL. Rather than employing a probabilistic model, MBSL resorts tolocal heuristic functions with various parameters that must be tuned in order toachieve optimal results.

In summary, MBSL and DOP are analogous but nevertheless rather different

MBL methods for dealing with syntactic ambiguity. The differences can besummarized by (i) the locally-based (MBSL) vs. globally-based (DOP) rankingstrategy of alternative analyses, and (ii) the ad hoc heuristics for computingscores (MBSL) vs. the stochastic model (DOP) for computing probabilities.

Note also that the combination operation employed by the MBSL system allows akind of generalization over the tree-bank that is not possible in DOP1. BecauseMBSL allows tiling situated candidates using tiles that contain one bracket(either a left or a right bracket) the matching pairs of brackets that result from aseries of connecting tiles may delimit a string of part-of-speech-tags that cannotbe constructed by nested pairs of matching brackets from the tree-bank (which isa kind of substitution-operation of bracketed strings).

In contrast to MBSL, DOP1's tree substitution-operation generates only SSFs thatcan be obtained by nesting SSFs from the tree-bank under the restrictiveconstraints of the substitution-operation. This implies that each pair of matchingbrackets that corresponds to an SSF is a matching pair of brackets that can befound in the tree-bank. DOP1 generates exactly the same set of trees as the CFGunderlying the tree-bank, while MBSL generalizes beyond the string-languageand the tree-language of this CFG. It is not obvious, however, that MBSL operatesin terms of the most felicitous representation of hierarchic surface structure, andone may wonder to what extent the generalizations produced in this way areactually useful.

Before we close off this section, we should emphasize that DOP-models do notnecessarily lack all generalization power. Some extensions of DOP1 have beendeveloped that learn new subtrees (and SSFs) by allowing mismatches betweencategories (in particular between SSFs and part-of-speech-tags). For instance,Bod (1995) discusses a model which assigns non-zero probabilities to unobservedlexical items on the basis of Good-Turing estimation. Another DOP-model (LFG-DOP) employs a corpus annotated with LFG-structures, and allowsgeneralization over feature values (Bod & Kaplan 1998). We are currentlyinvestigating the design of DOP-models with more powerful generalizationcapabilities.

8. Conclusion

In this paper we focused on the Memory-Based aspects of the Data OrientedParsing (DOP) model. We argued that disambiguation is a central element oflinguistic processing that can be most suitably modelled by a memory-basedlearning model. Furthermore, we argued that this model must employ linguisticgrammatical descriptions and that it should employ probabilities in order to


34/41

account for the psycholinguistic observations concerning the central role thatfrequencies play in linguistic processing. Based on these motivations wepresented the DOP model as aprobabilistic recursive Memory-Based model forlinguistic processing. We discussed some computational aspects of a simpleinstantiation of it, called DOP1, that aims specifically at syntactic disambiguation.

We also summarized some of our empirical and experimental observationsconcerning DOP1 that are specifically related to its Memory-Based nature. Weconclude that these observations provide convincing evidence for the hypothesisthat the structural units for language processing should not be limited to aminimal set of grammatical rules but that they must be defined in terms of a(redundant) space of linguistic relations that are encountered in an explicit"memory" or "case-base" that represents linguistic experience. Moreover,although many of our empirical observations support the argument for a purelymemory-based approach, we encountered other empirical observations thatexhibit the strong tension that exists between this argument and various factorsthat are encountered in practice e.g. data-sparseness. We may conclude, however,that our empirical observations do support the idea that "forgetting exceptions isharmful" (Van den Bosch & Daelemans 1998).

Furthermore, we analyzed the DOP model within the MBL paradigm andcontrasted it to other MBL approaches within NLP and outside it. We concludedthat despite the many similarities between DOP and other MBL models, there aremajor differences. For example, DOP1 distinguishes itself from other MBLmodels in that (i) it deals with a classification task involving infinite instancespaces and class spaces described by an ambiguous recursive grammar, and (ii) itemploys a disambiguation facility based on a stochastic extension of the syntacticcase-base.

Endnotes

1. We employed a cleaned-up version of this corpus in which mistaggedwords had been corrected by hand, and analyses with so-called pseudo-attachments had been removed.

2. Bod (1998a) also presents experiments with extensions of DOP1 thatgeneralize over the corpus fragments. DOP1 cannot cope with unknownwords.

3. Bod (1995, 1998a) shows how to extend the model to overcome thislimitation.

4. The experiments in Sima'an (1999) concern a comparison between DOP1and a new model, called Specialized DOP (SDOP). Since this comparison isnot the main issue here, we will summarize the results concerning theDOP1 model only. However, it might be of interest here to mention the


35/41

conclusions of the comparison. In short, the SDOP model extends theDOP1 model with automatically inferred subtree-selection criteria. Thesecriteria are determined by a new learning algorithm that specializes theannotation of the training tree-bank to the domain of language userepresented by that tree-bank. The SDOP models acquired from a tree-

bank are substantially smaller than the DOP1 models. Nevertheless,Sima'an (1999) shows that the SDOP models are at least as accurate andhave the same coverage as DOP1 models.

5. The SRI-ATIS tree-bank was generously made available for theseexperiments by SRI International, Cambridge (UK).

6. A concept is a function; members of its domain are called instances andmembers of its range are called classes. The task or target concept is thefunction that the learning process tries to estimate.

7. In fact only limited context is allowed around a bracket, which means that

not all of the substrings in the corpus are stored.

References

A. Aamodt and E. Plaza, 1994. "Case-Based Reasoning: Foundational issues,methodological variations and system approaches",AI Communications, 7. 39-59.

D. Aha, 1998. "The Omnipresence of Case-Based Reasoning in Science andApplication",Knowledge-Based Systems 11(5-6), 261-273.

D. Aha and D. Wettschereck, 1997. "Case-Based Learning: Beyond Classificationof Feature Vectors",ECML-97 invited paper. Also in MLnet ECML'97workshop.MLnet News, 5:1, 8-11.

S. Abney, 1996. "Statistical Methods and Linguistics." In: Judith L. Klavans andPhilip Resnik (eds.): The Balancing Act. Combining Symbolic and StatisticalApproaches to Language. Cambridge (Mass.): MIT Press, pp. 1-26.

H. Alshawi, editor, 1992. The Core Language Engine, Boston: MIT Press.

H. Alshawi and D. Carter, 1994. "Training and Scaling Preference Functions for

Disambiguation", Computational Linguistics.

K. Ashley, 1990.Modeling legal argument: Reasoning with cases andhypotheticals, Cambridge, MA: MIT Press.

M. van den Berg, R. Bod and R. Scha, 1994. "A Corpus-Based Approach toSemantic Interpretation",Proceedings Ninth Amsterdam Colloquium,Amsterdam, The Netherlands.


36/41

E. Black, S. Abney, D. Flickenger, C. Gnadiec, R. Grishman, P. Harrison, D.Hindle, R. Ingria, F. Jelinek, J. Klavans, M. Liberman, M. Marcus, S. Roukos, B.Santorini and T. Strzalkowski, 1991. "A Procedure for Quantitatively Comparingthe Syntactic Coverage of English",Proceedings DARPA Speech and NaturalLanguage Workshop, Pacific Grove, Morgan Kaufmann.

A. van den Bosch and W. Daelemans, 1998. "Do not forget: Full memory inmemory-based learning of word pronunciation",Proceedings ofNeMLaP3/CoNLL98.

E. Black, J. Lafferty and S. Roukos, 1992. "Development and Evaluation of aBroad-Coverage Probabilistic Grammar of English-Language ComputerManuals",Proceedings ACL'92, Newark, Delaware.

R. Bod, 1992. "A Computational Model of Language Performance: Data OrientedParsing",Proceedings COLING'92, Nantes.

R. Bod, 1993a. "Using an Annotated Language Corpus as a Virtual StochasticGrammar",Proceedings AAAI'93, Morgan Kauffman, Menlo Park, Ca.

R. Bod, 1993b. "Monte Carlo Parsing". Proceedings of the Third InternationalWorkshop on Parsing Technologies. Tilburg/Durbuy.

R. Bod, 1995.Enriching Linguistics with Statistics: Performance Models ofNatural Language, Ph.D. thesis, ILLC Dissertation Series 1995-14, University ofAmsterdam.

R. Bod, 1998a.Beyond Grammar, CSLI Publications / Cambridge UniversityPress, Cambridge.

R. Bod, 1998b. "Spoken Dialogue Interpretation with the DOPModel",Proceedings COLING-ACL'98, Montreal, Canada.

R. Bod and R. Kaplan, 1998. "A Probabilistic Corpus-Driven Model for Lexical-Functional Analysis",Proceedings COLING-ACL'98, Montreal, Canada.

R. Bod and R. Scha, 1996.Data-Oriented Language Processing. AnOverview.Technical Report LP-96-13. Institute for Logic, Language andComputation, University of Amsterdam.

R. Bod and R. Scha, 1997."Data-Oriented Language Processing." In S. Young andG. Bloothooft (eds.) Corpus-Based Methods in Language and Speech Processing,Kluwer Academic Publishers, Boston. 137-173.

R. Bonnema, 1996Data-Oriented Semantics. Master's Thesis, Department ofComputational Linguistics (Institute for Logic, Language and Computation).
http://turing.wins.uva.nl/~rens/thesis.pshttp://turing.wins.uva.nl/~rens/thesis.pshttp://turing.wins.uva.nl/~rens/thesis.pshttp://turing.wins.uva.nl/~rens/overview.pshttp://turing.wins.uva.nl/~rens/overview.pshttp://turing.wins.uva.nl/~rens/overview.pshttp://turing.wins.uva.nl/~rens/overview.pshttp://turing.wins.uva.nl/~rens/overview.pshttp://turing.wins.uva.nl/~rens/overview.pshttp://www.hum.uva.nl/computerlinguistiek/bonnema/dop-sem/scriptie.htmlhttp://turing.wins.uva.nl/~rens/thesis.pshttp://turing.wins.uva.nl/~rens/thesis.pshttp://turing.wins.uva.nl/~rens/thesis.pshttp://turing.wins.uva.nl/~rens/overview.pshttp://turing.wins.uva.nl/~rens/overview.pshttp://turing.wins.uva.nl/~rens/overview.pshttp://www.hum.uva.nl/computerlinguistiek/bonnema/dop-sem/scriptie.htmlhttp://www.hum.uva.nl/computerlinguistiek/bonnema/dop-sem/scriptie.html


37/41

R. Bonnema, R. Bod and R. Scha, 1997. "A DOP Model for SemanticInterpretation",Proceedings 35th Annual Meeting of the ACL / 8th Conference ofthe EACL, Madrid, Spain.

C. Brew, 1995. "Stochastic HPSG",Proceedings European chapter of the ACL'95,

Dublin, Ireland.T. Briscoe and J. Carroll, 1993. "Generalized Probabilistic LR Parsing of NaturalLanguage (Corpora) with Unification-Based Grammars", ComputationalLinguistics 19(1), 25-59.

J. Carroll and D. Weir, 1997. "Encoding Frequency Information in LexicalizedGrammars",Proceedings 5th International Workshop on Parsing Technologies,MIT, Cambridge (Mass.).

D. Carter, 1997. "The TreeBanker: a Tool for Supervised Training of ParsedCorpora",Proceedings of the workshop on Computational Environments forGrammar Development and Linguistic Engineering, ACL/EACL'97, Madrid,Spain.

T. Cartwright and M. Brent, 1997. "Syntactic categorization in early languageacquisition: formalizing the role of distributional analysis." Cognition 63, pp. 121-170.

J. Chappelier and M. Rajman, 1998. "Extraction stocha

lang process

Documents