empirical investigations of anaphora and salience

76
EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE Massimo Poesio Università di Trento and University of Essex Vilem Mathesius Lectures Praha, 2007

Upload: tamera

Post on 06-Jan-2016

39 views

Category:

Documents


2 download

DESCRIPTION

EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE. Massimo Poesio Universit à di Trento and University of Essex. Vilem Mathesius Lectures Praha, 2007. Plan of the series. Wednesday: Annotating context dependence, and particularly anaphora - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

Massimo PoesioUniversità di Trento and University of Essex

Vilem Mathesius Lectures Praha, 2007

Page 2: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

Plan of the series

Wednesday: Annotating context dependence, and particularly anaphora

Yesterday: Using anaphorically annotated corpora to investigate local & global salience

Today: Using anaphorically annotated corpora to investigate anaphora resolution

Page 3: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

Today’s lecture

The Vieira / Poesio work on robust definite description resolution

Bridging references Discourse-new (If time allows):Task-oriented evaluation

Page 4: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

Preliminary corpus study (Poesio and Vieira, 1998)

Annotators asked to classify about 1,000 definite descriptions from the ACL/DCI corpus (Wall Street Journal texts) into three classes:

DIRECT ANAPHORA: a house … the house

DISCOURSE-NEW: the belief that ginseng tastes like spinach is more widespread than one would expect

BRIDGING DESCRIPTIONS:the flat … the living room; the car … the vehicle

Massimo Poesio:

Add better examples (e.g., from The book of evidence)

Massimo Poesio:

Add better examples (e.g., from The book of evidence)

Page 5: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

Poesio and Vieira, 1998

Results:

More than half of the def descriptions are first-mention

Subjects didn’t always agree on the classification of an antecedent (bridging descriptions: ~8%)

Page 6: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

The Vieira / Poesio system for robust definite description resolution

Follows a SHALLOW PROCESSING approach (Carter, 1987; Mitkov, 1998): it only uses

Structural information (extracted from Penn Treebank)

Existing lexical sources (WordNet)

(Very little) hand-coded information

(Vieira & Poesio, 1996 / Vieira, 1998 / Vieira & Poesio, 2001)

Page 7: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

Methods for resolving direct anaphors

DIRECT ANAPHORA:

the red car, the car, the blue car: premodification heuristics

segmentation: approximated with ‘loose’ windows

Page 8: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

Methods for resolving discourse-new definite descriptions

DISCOURSE-NEW DEFINITES

the first man on the Moon, the fact that Ginseng tastes of spinach: a list of the most common functional predicates (fact, result, belief) and modifiers (first, last, only… )

heuristics based on structural information (e.g., establishing relative clauses)

Page 9: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

A `knowledge-based’ classification of bridging descriptions (Vieira, 1998)

Based on LEXICAL RELATIONS such as synonymy, hyponymy, and meronimy, available from a lexical resource such as WordNetthe flat … the living room

The antecedent is introduced by a PROPER NAMEBach … the composer

The anchor is a NOMINAL MODIFIER introduced as part of the description of a discourse entity:selling discount packages … the discounts

Page 10: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

… continued (cases NOT attempted by our system)

The anchor is not explicitly mentioned in the text, but is a `discourse topic’the industry (in a text about oil companies)

The resolution depends on more general commonsense knowledgelast week’s earthquake … the suffering people

The anchor is introduced by a VP:Kadane oil is currently drilling two oil wells. The activity…

Page 11: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

Distribution of bridging descriptions

Class Total Percentage

Syn/Hyp/Mer 12/14/12 19%

Names 49 24%

Compound Nouns

25 12%

Events 40 20%

Discourse Topic 15 7%

Inference 37 18%

Total 204 100%

Page 12: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

The (hand-coded) decision tree

1. Apply ‘safe’ discourse-new recognition heuristics

2. Attempt to resolve as same-head anaphora

3. Attempt to classify as discourse new

4. Attempt to resolve as bridging description. Search backward 1 sentence at a time and apply heuristics in the following order:1. Named entity recognition heuristics – R=.66, P=.95

2. Heuristics for identifying compound nouns acting as anchors – R=.36

3. Access WordNet – R, P about .28

Page 13: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

Overall Results

Evaluated on a ‘test corpus’ of 464 definite descriptions

Overall results:

R P F

Version 1 53% 76% 62%

Version 2 57% 70% 62%

D-N def 77% 77% 77%

Page 14: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

Overall Results

Results for each type of definite description:

R P F

Direct anaphora

62% 83% 71%

Disc new 69% 72% 70%

Bridging 29% 38% 32.9%

Page 15: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

Questions raised by the Vieira / Poesio work

Do these results hold for larger datasets? Do discourse-new detectors help? Bridging:

– How to define the phenomenon?– Where to get the information?– How to combine salience with lexical & commonsense

knowledge?

Can such a system be helpful for applications?

Page 16: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

Mereological bridging references

Cartonnier (Filing Cabinet) with Clock

This piece of mid-eighteenth-century furniture was meant to be used like a modern filing cabinet; papers were placed in leather-fronted cardboard boxes (now missing) that were fitted into the openshelves. A large table decorated in the same manner would have been placed in front for working with those papers. Access to the cartonnier's lower half can only be gained by the doors at the sides, because the table would have blocked the front.

Page 17: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

PREVIOUS RESULTS

A series of experiments using the Poesio / Vieira dataset, containing 204 bridging references, including 39 `WordNet’ bridges

(Vieira and Poesio, 2000, but also Carter 1985, Hobbs - a number of papers-, etc) need lexical knowledge

But: even large lexical resources such as WordNet not enough, particularly for mereological references (Poesio et al, 1997; Vieira and Poesio, 2000; Poesio, 2003; Garcia-Almanza, 2003)

Partial solution: use lexical acquisition (HAL, Hearst-style construction method). Best results (for mereology): construction-style

Page 18: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

FINDING MERONYMICAL RELATIONS USING SYNTACTIC INFORMATION

Some syntactic constructions suggest semantic relations

– (Cfr. Hearst 1992, 1998 for hyponyms) Ishikawa 1998, Poesio et al 2002: use syntactic

constructions to extract mereological information from corpora

– The WINDOW of the CAR– The CAR’s WINDOW– The CAR WINDOW

See also Berland & Charniak 1999, Girju et al 2002

Page 19: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

LEXICAL RESOURCES FOR BRIDGING: A SUMMARY

Class Syn Hyp Mer Total WN

Total 12 14 12 38

WordNet 4 (33.3%) 8 (57.1%) 3 (33.3%) 15 (39%)

HAL 4 (33.3%) 2 (14.3%) 2 (16.7%) 8 (22.2%)

Constructions 1 (8.3%) 0 8 (66.7%) 9 (23.7%)

(All using the Vieira / Poesio dataset.)

Page 20: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

FOCUSING AND MEREOLOGICAL BRIDGES

Cartonnier (Filing Cabinet) with Clock

This piece of mid-eighteenth-century furniture was meant to be used like a modern filing cabinet; papers were placed in leather-fronted cardboard boxes (now missing) that were fitted into the openshelves. A large table decorated in the same manner would have been placed in front for working with those papers. Access to the cartonnier's lower half can only be gained by the doors at the sides, because the table would have blocked the front.

(See Sidner, 1979; Markert et al, 1995.)

Page 21: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

FOCUS (CB) TRACKING + GOOGLE SEARCH (POESIO, 2003)

Analyzed 169 associative BDs in GNOME corpus (58 mereology)

Correlation between distance and focusing (Poesio et al, 2004) and choice of anchor

– 77.5% anchor same or previous sentence; 95.8% in last five sentences

– CB(U-1) anchor for only 33.6% of BDs,– but 89% of anchors had been CB or CP

Using `Google distance’ to choose among salient anchor candidates

Page 22: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

FINDING MEREOLOGICAL RELATIONS USING GOOGLE

Lexical vicinity measure (for MERONYMS) between NBD and NPA

– Search in Google for “the NBD of the NPA” (cfr. Ishikawa, 1998; Poesio et al, 2002)

E.g., “the drawer of the cabinet”– Choose as anchor the PA whose NPA results in the greater

number of hits Preliminary results for associative BDs: around 70%

P/R (by hand) See also: Markert et al, 2003, 2005; Modjeska et al,

2003

Page 23: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

NEW EXPERIMENTS (Poesio et al, 2004)

Using the GNOME corpus – 58 mereological bridging refs realized by the-nps– 153 mereological bridging references in total– Reliably annotated

Completely automatic feature extraction– Google & WordNet for lexical distance– Using (an approximation of) salience

Using machine learning to combine the features

Page 24: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

More (and reliably annotated) data: the GNOME corpus

Texts from 3 genres (museum descriptions, pharmaceutical leaflets, tutorial dialogues)

Reliably annotated syntactic, semantic and discourse information

– grammatical function, agreement features– anaphoric relations– uniqueness, ontological information, animacy, genericity, …

Reliable annotation of bridging references http://cswww.essex.ac.uk/Research/NLE/corpora/GNOME

Page 25: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

METHODS

Salience features:– Utterance distance– First mention– ‘Global first mention’ (approximate CB)

Lexical distance:– WordNet (using a pure hypernym-based search strategy)– Google– Tried both separately and together

Statistical classifiers: MLP, Naïve Bayes – (MatLab / Weka ML Library)

Page 26: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

Computing WordNet Distance: Get the head noun of the anaphor and find all the (noun)

senses for the head noun. Get all the noun senses for the head noun of the potential

antecedent under consideration. Retrieve the hypernym trees from WordNet for each sense of

anaphor and the antecedent. Traverse each unique path in these trees and find a common

parent for the anaphor and the antecedent; count the no. of nodes they are apart.

Select the least distance path across all combinations. If no common parent is found, assign an hypothetical

distance (30).

Lexical Distance 1 (WordNet)

Page 27: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

Lexical Distance, 1: WordNet

Page 28: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

Lexical Distance 2 (Google)

As in (Poesio, 2003) But use Google API to access the Google search engine Computing Google hits:

– Get the head noun for BR and potential candidate.– Check whether the potential candidate is a mass or count noun.– If count, build the query as “the body of the person” and search

for the pattern.– Retrieve the no. of Google hits

Page 29: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

WN vs GOOGLE

Description Results

No path in WordNet 503/1720

No path in WordNet between BD and correct anchor 10/58

Anchor with Min WN Distance correct 8/58

Zero Google Hits 1089/1720

Zero Google Hits for correct anchor 24/58

Max Google Hits identify correct candidate 8/58

Page 30: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

BASELINES

BASELINE ACCURACY

Random choice (previous 5) 4%

Random choice (previous) 19%

Random choice among FM 21.3%

Min Google Distance 13.8%

Min WN Distance 13.8%

FM entity in previous sentence 31%

Min Google in previous sentence 17.2%

Min WN in previous sentence 25.9%

Min Google among FM 12%

Min WN among FM 24.1%

Page 31: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

RESULTS (58 THE-NPs, 50:50)

WN DISTANCE GOOGLE DISTANCE

MatLab NN, self-tuned 92 (79.3%) 89 (76.7%)

Weka NN Algorithm 91 (78.4%) 86 (74.1%)

Weka Naïve Bayes 88 (75.9%) 85 (73.3%)

Prec Recall F

WN distance 75.4% 84.5% 79.6%

Google distance 70.6% 86.2% 77.6%

Page 32: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

MORE RESULTS

Accuracy F

WN distance 224 (74.2%) 76.3%

Google distance 230 (75.2%) 75.8%

1:3 dataset:

all 153 mereological BRs:

Accuracy F

WN distance 80.6% 55.7%

Google distance 82% 56.7%

Page 33: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

MEREOLOGICAL BDS REALIZED WITH BARE-NPS

The combination of rare and expensive materials used on this cabinet indicates that it was a particularly expensive commission. The four Japanese lacquer panels date from the mid- to late 1600s and were created with a technique known as kijimaki-e. For this type of lacquer, artisans sanded plain wood to heighten its strong grain and used it as the background of each panel. They then added the scenic elements of landscape, plants, and animals in raised lacquer. Although this technique was common in Japan, such large panels were rarely incorporated into French eighteenth-century furniture.

Heavy Ionic pilasters, whose copper-filled flutes give an added rich color and contrast to the gilt-bronze mounts, flank the panels. Yellow jasper, a semiprecious stone, rather than the usual marble, forms the top.

Page 34: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

HARDER TEST

Distance Balance Accuracy on balanced

F on bal

Accuracy on unbal

F on unbal

WN 1:11:3

70.2%75.9%

.7

.480.2%91.7%

.20

Google 1:11:3

64.4%79.8%

.7

.563.6%88.4%

.1

.3

WN + Google

1:11:3

66.3%77.9%

.6

.465.3%92.5%

.2

.5

Using classifiers trained on balanced /slightly unbalanced data (the-nps) on unbalanced ones (10-fold cross validation)

Page 35: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

DISCUSSION

Previous results:– Construction-based techniques provide adequate lexical

resources, particularly when using Web as corpus– But need to combine lexical knowledge and salience

modeling This work:

– Combining (simple) salience with lexical resources results in significant improvements

Future work:– Larger dataset – Better approximation of focusing

Page 36: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

Back to discourse-new detection

The GUITAR system Recent results

Page 37: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

GUITAR (Kabadjov, to appear)

A robust, usable anaphora resolution system designed to work as part of an XML pipeline

Incorporates:– Pronouns: the Mitkov algorithm – Definite descriptions: the Vieira / Poesio algorithm– Proper nouns: the Bontcheva alg.

Several versions– Version 1: (Poesio & Kabadjov, 2004): direct anaphora– Version 2: DN detection– Version 3: proper name resolution

Freely available from http://privatewww.essex.ac.uk/~malexa/GuiTAR/

Page 38: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

DISCOURSE-NEW DEFINITE DESCRIPTIONS

Poesio and Vieira (1998): about 66% of definite descriptions in their texts (WSJ) are discourse-new

(1) Toni Johnson pulls a tape measure across the front of what was once a stately Victorian home.

(2) The Federal Communications Commission allowed American Telephone & Telegraph Co. to continue offering discount phone services for large-business customers and said it would soon re-examine its regulation of the long-distance market.

Page 39: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

WOULD DNEW RECOGNITION HELP?

First version of GUITAR without DN detection on subset of DDs in GNOME corpus - 574 DDs, of which - 184 anaphoric (32%)- 390 discourse-new (67.9%)

Total Sys Ana

Corr NM WM SM R P F

574(184)

198 457(119)

38 27 5226.3%

79.6(60.1)

79.6(64.7)

79.6(62.3)

Page 40: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

SPURIOUS MATCHES

If your doctor has told you in detail HOW MUCH to use and HOW OFTEN then keep to this advice.

….. If you are not sure then follow the advice on

the back of this leaflet.

Page 41: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

GOALS OF THE WORK

Vieira and Poesio’s (2000) system incorporated DISCOURSE-NEW DD DETECTORS (P=69, R=72, F=70.5)

Two subsequent strands of work:– Bean and Riloff (1999), Uryupina (2003) developed

improved detectors (e.g., Uryupina: F=86.9)– Ng and Cardie (2002) questioned whether such detectors

improve results Our project: systematic investigation of whether DN

detectors actually help– ACL 04 ref res: features, preliminary results– THIS WORK: results of further experiments

Page 42: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

DN CLASSIFIER:THE UPPER BOUND

Current number of SMs: 52/198 (26.3%) If SM = 0,

P=R=F overall = 509/574 = 88.7– (P=R=F on anaphora only: 119/146= 81.5)

Page 43: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

VIEIRA AND POESIO’S DN DETECTORS

Recognize SEMANTICALLY FUNCTIONAL descriptions: SPECIAL PREDICATES / PREDICATE MODIFIERS (HAND-CODED) the front of what was once a stately Victorian home

the best chance of saving the youngest children PROPER NAMES. the Federal Communications Commission …

LARGER SITUATION descriptions (HAND-CODED): the City, the sun, ….

Page 44: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

VIEIRA AND POESIO’S DN DETECTORS, II

Descriptions ESTABLISHED by modification: The warlords and private militias who were once regarded as the West’s staunchest allies are now a greater threat to the country’s security than the Taliban …. (Guardian, July 13th 2004, p.10)

PREDICATIVE descriptions: COPULAR CLAUSES: he is the hardworking son of a Church of Scotland minister …. APPOSITIONS. Peter Kenyon, the Chelsea chief executive …

Page 45: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

VIEIRA AND POESIO’S DECISION TREES

Tried both hand-coded and ML

1. Try the DN detectors with highest accuracy (attempt to classify as functional using special predicates, and as predicative by looking for apposition)

Hand-coded decision tree:

2. Attempt to resolve the DD as direct anaphora

3. Try other DN detectors in order: proper name, establishing clauses, proper name modification ….

ML DT: swap 1. and 2.

Page 46: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

VIEIRA AND POESIO’S RESULTS

P R F

Baseline 50.8 100 67.4

DN detection 69 72 70

Hand-coded DT(partial)

62 85 71.7

Hand-coded DT(total)

77 77 77

ID3 75 75 75

Page 47: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

BEAN AND RILOFF (1999)

Developed a system for identifying DN definites

SENTENCE-ONE (S1) EXTRACTION identify as discourse-new every description found in first sentence of a text.

Adopted syntactic heuristics from Vieira and Poesio, and developed several new techniques:

DEFINITE PROBABILITY create a list of nominal groups encountered at least 5 times with definite article, but never with indefinite

VACCINES: block heuristics when prob. too low.

Page 48: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

BEAN AND RILOFF’S ALGORITHM

1. If the head noun appeared earlier, classify as anaphoric

2. If DD occurs in S1 list, classify as DN unless vaccine

3. Classify DD as DN if one of the following applies: (a) high definite probability; (b) matches a EHP pattern; (c) matches one of the syntactic heuristics

4. Classify as anaphoric

Page 49: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

BEAN AND RILOFF’S RESULTSP R

Baseline 100 72.2

Syn heuristics 43 93.1

Syn Heuristics +S1EHPDO

66.360.769.2

84.387.383.9

Syn Heuristics + S1 + EHP + DO

81.7 82.2

Syn Heuristics + S1+ EHP + DO + V

79.1 84.5

Page 50: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

NG AND CARDIE (2002)

Directly investigate question of whether discourse-new detectors improves performance of anaphora resolution system

Dealing with ALL types of anaphoric expressions

Page 51: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

NG AND CARDIE’S METHODS DN detectors:

– statistical classifiers trained using C4.5 and RIPPER– Features: predicate & superlative detection / head match /

position in text of NP – Tested over MUC-6 (F=86) and MUC-7 (F=84)

2 architectures for integration of detectors and AR:1. Run DN detector first,

apply AR on NPs classified as anaphoric2. Run AR if str_match or alias=Y;

otherwise, as in 1.

Page 52: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

NG AND CARDIE’S RESULTS

MUC-6 MUC-7

P R F P R F

Baseline (no DN detection)

70.3 58.3 63.8 65.5 58.2 61.6

DN detection runs first

57.4 71.6 63.7 47.0 77.1 58.4

Same head runs first

63.4 68.3 65.8 59.7 69.3 64.2

Page 53: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

NG AND CARDIE’S RESULTS

MUC-6 MUC-7

P R F P R F

Baseline (no DN detection)

70.3 58.3 63.8 65.5 58.2 61.6

DN detection runs first

57.4 71.6 63.7 47.0 77.1 58.4

Same head runs first

63.4 68.3 65.8 59.7 69.3 64.2

Page 54: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

URYUPINA’S METHODS

A DN statistical classifier trained using RIPPER

Trained / tested over Ng and Cardie’s MUC-7 data

Page 55: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

URYUPINA’S FEATURES:WEB-BASED DEFINITE PROBABILITY

Y" a"

Y" the"

Y""

Y" the"

H" a"

H" the"

H""

H" the"

Page 56: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

URYUPINA’S RESULTS(DNEW CLASSIFIER)

P R F

All NPs No Def Prob 87.9 86.0 86.9

Def Prob 88.5 84.3 86.3

Def NPs No Def Prob 82.5 79.3 80.8

Def Prob 84.8 82.3 83.5

(On MUC-7)

Page 57: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

URYUPINA’S RESULTS(DNEW CLASSIFIER)

P R F

All NPs No Def Prob 87.9 86.0 86.9

Def Prob 88.5 84.3 86.3

Def NPs No Def Prob 82.5 79.3 80.8

Def Prob 84.8 82.3 83.5

(On MUC-7)

Page 58: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

PRELIMINARY CONCLUSIONS

Quite a lot of agreement on features for DN recognition:

– Recognizing predicative NPs– Recognizing establishing relatives– Recognizing DNEW proper names– Identifying functional DDs

Automatic detection of these better Using the Web best

All these systems integrate DN detection with some form of AR resolution

– See Ng’s results concerning how `globally optimized’ classifiers are better than `locally optimized’ ones (ACL 2004)

Page 59: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

PRELIMINARY CONCLUSIONS, II

Ng and Cardie’s results not the last word:– Performance of their DN detector not as high as

Uryupina’s (F=84 vs. F=87 on same dataset, MUC-7)– Overall performance of their resolution system not that

high best performance: F=65.8 on ALL NPs But on full NPs (i.e., excluding PNs and pronouns): F=31.7

(GUITAR on DDs, unparsed text: F=56.4)

Room for improvement

Page 60: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

A NEW SET OF EXPERIMENTS

Incorporate the improvements in DN detection technology to

– the Vieira / Poesio algorithm, as reimplemented in a state-of-the-art `specialized’ AR system, GUITAR

– a statistical `general purpose’ AR resolver (Uryupina, in progress)

Test over a large variety of data– New: GNOME corpus (623 DDs)– Original Vieira and Poesio dataset (1400 DDs)– MUC-7 (for comparison with Ng and Cardie, Uryupina)

(3000 DDs)

Page 61: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

ARCHITECTURE

A two-level system:– Run GUITAR’s direct anaphora resolution– Results used as one of the features of a statistical

discourse-new classifier – A `globally optimized’ system (Ng, ACL 2004)

Trained / tested over – GNOME corpus– Vieira / Poesio dataset, converted to MMAX, converted to

MAS-XML (still correcting the annotation)

Page 62: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

A NEW SET OF FEATURESDIRECT ANAPHORA Run the Vieira / Poesio algorithm; -1 if no result else distance

PREDICATIVE NP DETECTOR DD occurs in apposition DD occurs in copular construction

PROPER NAMES c-head c-premod Bean and Riloff’s S1

Page 63: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

A REVISED SET OF FEATURES (II)

FUNCTIONALITY Uryupina’s four definite probabilities (computed off the Web) superlative

ESTABLISHING RELATIVE (a single feature)

POSITION IN TEXT OF NP (Ng and Cardie) header / first sentence / first para

Page 64: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

LEARNING A DN CLASSIFIER

Use of the data:– 8% for parameter tuning– 10-fold cross-validation over the rest

Classifiers: from the Weka package– Decision Tree (C4.5), NN (MLP), SVM

3 evaluations (overall, DN, DA) Performance comparison: t-test (cfr. Dietterich

98

Page 65: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

3 EVALUATIONS

OVERALL DADN

DADNR

DADN

DADNP syssys

syssys

corrcorr

,

DA DA

DAR

DA

DAP sys

sys

corr ,

DN DN

DNR

DN

DNP sys

sys

corr ,

Page 66: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

RESULTS:OVERALL

T Res C P=R=F

GuiTAR 574 574 457 79.6

GuiTAR +MLP 574 574 473 82.4

GuiTAR +C4.5 574 574 466 81.18

p .1

notsig

Page 67: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

RESULTS:DNEW CLASSIFICATION

P R F A

DNC4.5 86.9 92.3 89.3 85.04

DNMLP 86.4 94.6 90.2 85.89

DNSVM 90.0 86.4 88.1 84.15

BASELINE(all DDs are DN)

67.5 100 80.6 67.5

Page 68: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

RESULTS:DIRECT ANAPHORA RESOLUTION

T Res C NM WM SM P R F

GuiTAR 184 198 119 38 27 52 60.1 64.7 62.3

GuiTAR +MLP 184 142 104 60 20 18 74.1 56.5 63.4

GuiTAR +C4.5 184 158 106 56 22 30 68.9 57.7 62.1

GuiTAR +SVM 184 198 119 38 27 52 60.1 64.7 62.3

Page 69: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

ERROR ANALYSIS

A 65% reduction in spurious matches:– “the answer to any of these questions“– “the title of cabinet maker and sculptor to Louis

XIV, King of France”– “the other half of the plastic“

But: a 58% increase in no matches– “the palm of the hand”

Page 70: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

THE DECISION TREE

DirectAna <= -1?

DNEW (339/36) DirectAna <= 20?

Y N

DNEW (11/1)

N

TheY/A Y <= 201.2?

Y

NY

1stPar = 0?Relative = 0?

NY

DNEW (12/1)ANAPH

NY

DNEWDirectAna <= 12?

Page 71: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

RESULTS:THE VIEIRA/POESIO CORPUS

Tested on 400 DDs (the ‘test’ corpus) Initial results at DN detection very poor Problem: the two conversions resulted in the

loss of much information about modification, particularly relatives

Currently correcting the annotation by hand

Page 72: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

RESULTS:AUTOMATIC PARSING

GUITAR without DN detection over the same texts, but using a chunker: 10% less accuracy

Main problem: many DDs not detected (particularly possessives)

Currently experimenting with full parsers (tried several, settled on Charniak’s)

Page 73: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

CONCLUSIONS AND DISCUSSION

All results so far support the idea that DN detectors improve the performance of AR with DD (if perhaps by only a few percent)

Some agreement on what features are useful– One clear lesson: interleave AR and DN detection!

But: will need to test on larger corpora (also to improve performance of classifier)

Current work:– Test on unparsed text– Test on MUC-7 data

Page 74: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

Task-based evaluation

RANLP / EMNLP slides

Page 75: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

Conclusions

Page 76: EMPIRICAL INVESTIGATIONS OF ANAPHORA AND SALIENCE

URLs

Massimo Poesio: http://cswww.essex.ac.uk/staff/poesio

GUITAR: http://privatewww.essex.ac.uk/~malexa/GuiTAR/

WEKA:http://www.cs.waikato.ac.nz/~ml