improving subcategorization acquisition using word sense disambiguation
DESCRIPTION
Improving Subcategorization Acquisition using Word Sense Disambiguation. Anna Korhonen and Judith Preiss University of Cambridge, Computer Laboratory 15 JJ Thomas Avenue, Cambridge CB3 0FD, UK [email protected] , [email protected]. Outline. - PowerPoint PPT PresentationTRANSCRIPT
Improving Subcategorization Acquisition using Word Sense Disambiguation
Anna Korhonen and Judith Preiss University of Cambridge, Computer Laboratory 15 JJ Thomas Avenue, Cambridge CB3 0FD, UK [email protected], [email protected]
Outline
Subcategorization Acquisition
Baseline System
Baseline System combined with WSD Probabilistic WSD Experiment
Evaluation
Methods
Introduction Subcategorization
The dependents of a verb are classified in: arguments -subject, object, direct object
- subject - non subject arguments (complements)
e.g. Mary knows that she is wining.adjuncts e.g. She read the book with great interest.
The type of complements that a verb permits gives the
verb classification The verb classification is called subcategorization SCFs –subcategorization frames for a given
predicate; essential for parsing
Introduction
SCFs- a particular set of arguments that a verb can appear with
Intransitive verb. NP[subject]. They danced.
Transitive verb. NP[subject], NP[object]. Mary appreciates her Professor.
Intransitive with PP. NP[subject],PP. He leave in Paris Transitive with PP. NP[subject], NP[object], PP. She put the
book on the table.
IntroductionManual subcategorization versus automatically one
Manual - does not provide the relative frequency of SCFs
- predicates change behavior
Automatically - no lexical/semantic information is exploited;
- reveals only syntactic aspects;
- no distinction between predicate senses
Korhonen(2002) model : back-off estimates which used the predominant sense of a verb (WordNet)
Acquisition Goal – domain specific lexicon (written vs. spoken; genre based on different senses)
Subcategorization Acquisition
Baseline System– system with the knowledge of verb semantics Levin(93) - verb senses divides them in classes distinctive for subcategorizationKorhonen(2002) - verb forms are able to divide them into semantic
classes based on the predominant sense (fly - move) - determine the sense and the semantic class (Levin Classes
“Motion verbs”) Briscoe Carroll(97) – SCF distribution are acquired from corpus
data
Subcategorization Acquisition Baseline System – description
The linear interpolation smoothing back-off estimates is used for the SCF distribution
The method of obtaining back-off estimates a) 4-5 representative verbs are chosen from a verb class
b) for theses verbs the SCF distribution is built using manually analysis of 300 occurrences of each verb (BNC)
c) the resulted SCF distributions are merged giving equal weight to each distribution E.g. fly - move, slide, arrive, travel, sail
An empirical threshold is used to filter out noisy SCFs
Subcategorization Acquisition Combining with WSD
Preiss & Korhonen(02)
- created different corpus datasets for the senses (first/and or second) being disambiguated and other datasets for the
remaining senses
- SCFs were acquired from both types of datasets
- back-off estimates used for the SCFs acquired from the initial dataset, the estimates were used for
smoothing according to the relevant sense
- the SCF lexicons acquired were merged in the end SCF distribution was rather specific to a verb than a
sense
- problems with subcategorization acquisition: datasets too small, separation of the data was unnecessary
Subcategorization Acquisition New method – does not involve separating data and it uses back-off estimates
for the sense distribution given by the WSD system not only for the predominant sense
pj(scfi), j=1..nb0 (nb0=the number of back-off estimates) - the probabilities of SCFs in different back-off distribution
P(scfi)= ∑λj*pj(scfi);
λj - weights for the different distributions that sum up to 1, are obtained from the probabilistic WSD system
Probabilistic WSD - able to determine the probability distribution for each noun, verb, adjective and adverb - able to determine a probability distribution on the senses for each verb and compute the average of it
J=1
nb0
Subcategorization Acquisition System Description
- it is based on Stevenson and Wilks(2001) system which combines knowledge sources to produce a WSD Tool
- it combines the probability distribution on senses determined by each module used; (modules
described in Yarowsky(2000); Mihalcea(2002); Pederson(2002)) for the WSD probabilistic system
- a process of smoothing is used for each module according to each confidence value; a low module confidence is smoothed extensively for uniform distribution
- the optimal combination of modules is based on the accuracy (F-measure) for the English all-words
task
Subcategorization Acquisition
ExperimentTest Data
- polysemous verbs with the predominant sense not very frequent – 29 verbs chosen randomly
- the Levin-style senses are used to map the WordNet senses of the chosen verbs
- he maximum number of Levin senses considered was 4 and some of the given senses were left out
Subcategorization Acquisition
Subcategorization Acquisition Evaluation Method - 20 mil words of the BNC corpus and extracted all
senses for the test verbs - 1000 sentences for each verb disambiguated with the
probabilistic WSD - applied the modified subcategorization system - for each verb an individual set of back-off estimates
was built based on the different frequency senses from the corpus data
- results were evaluated against a manual analysis of the corpus data
- for an average of 300 occurrences for each verb in the BNC test data 5-21 gold standard SCFs were
found (16 SCFs per verb)
Subcategorization Acquisition Evaluation
Method F-measure = 2∙P∙R ∕ P+R;
P-precisionR-recall
RC – Sperman rank correctionKL – Kullback-Leibler distance CE – cross entropy
- record the total number of SCFs missing in the distribution for determining the accuracy of the
back-off estimates - comparison with other systems: the base-line and other
which assumed no sense at all
Subcategorization Acquisition Results
- using the unsmoothed lexicon from a total of 175 unseen standard SCFs a number of 107 remain unseen after using the predominant sense method
- using the WSD method only 22 remain unseen- the performance improves with the numbers of senses - IS measure reveals that between the acquired and the
gold standard SCFs exists an intersection when WSD is used
Subcategorization Acquisition
Subcategorization Acquisition
Results
- improvement for the highly polysemous verbs (bear, count, roar e.t.c)
- verbs who differ substantially in terms of subcategorization (conceive, continue, grasp e.t.c)
- verbs whose sense involves mainly NP/PP
- SCFs seems to appear in data as “families” for a sense of a verb
- worse performance for seek using WSD even though is highly polysemous and differs in terms of
subcategorization
-no clear improvement : choose, compose, induce, watch
Subcategorization Acquisition
Conclusions
- using the WSD an improvement can be shown for SCFs acquisition of difficult verbs because the senses differ
also in terms of subcategorization not only in the degree of polysemy
Future work- a better way of integrating the frequency of acquired
senses into the SCFs and a refinancefor the subcategorization method