number or nuance: factors affecting reliable word sense annotation susan windisch brown, travis...

NUMBER OR NUANCE: Factors Affecting Reliable Word Sense Annotation

Susan Windisch Brown, Travis Rood, and Martha PalmerUniversity of Colorado at Boulder

Annotators in their little nests agree;And ‘tis a shameful sight,When taggers on one projectFall out, and chide, and fight. —[adapted from] Isaac Watts

3

Automatic word sense disambiguation

Lexical ambiguity is a significant problem in natural language processing (NLP) applications (Agirre & Edmonds, 2006) Text summarization Question answering

WSD systems might help Several studies show benefits for NLP

tasks (Sanderson, 2000; Stokoe, 2003; Carpuat and Wu, 2007; Chan, Ng and Chiang, 2007)

But only with higher system accuracy (90%+)

4

Annotation reliability affects system accuracy

WSD system

System Performance

Inter-annotator agreement

Sense Inventory

SensEval2 62.5% 70% WordNet

Chen et al. (2007)

82% 89% OntoNotes

Palmer (2008)

90% 94% PropBank

5

Senses for the verb control

WordNet OntoNotes

1. exercise authoritative control or power over

1. exercise power or influence over; hold within limits

2. control (others or oneself) or influence skillfully

3. handle and cause to function

4. lessen the intensity of; temper

5. check or regulate (a scientific experiment) by conducting a parallel experiment

2. verify something by comparing to a standard

6. verify by using a duplicate register for comparison

7. be careful or certain to do something

8. have a firm understanding of

6

Possible factors affecting the reliability of word sense annotation

Fine-grained senses result in many senses per word, creating a heavy cognitive load on annotators, making accurate and consistent tagging difficult

Fine-grained senses are not distinct enough to reliably discriminate between

7

Requirements to compare fine-grained and coarse-grained annotation

Annotation of the same words on the same corpus instances

Sense inventories differing only in sense granularity

Previous work (Ng et al., 1999; Edmonds & Cotton, 2001; Navigli et al. 2007)

8

3 experiments

40 verbs Number of senses : 2-26 Sense granularity: WordNet vs. OntoNotes Exp. 1: confirm difference in reliability

between fine- and coarse-grained annotation; vary granularity and number of senses

Exp. 2: hold granularity constant; vary number of senses

Exp. 3: hold number constant; vary granularity

9

Experiment 1

Compare fine-grained sense inventory to coarse

70 instances for each verb from the ON corpus

Annotated with WN senses by multiple pairs of annotators

Annotated with ON senses by multiple pairs of annotators

Compare the ON ITAs to the WN ITAsAve. number of senses

Granularity

OntoNotes 6.2 Coarse

WN 14.6 Fine

10

Results

Wor

dNet

(fine

-gra

ined

)

OntoN

otes

(coa

rse-

grai

ned)

0%

20%

40%

60%

80%

100%

57%

91%

ITA

11

Results

Coarse-grained ON annotations had higher ITAs than fine-grained WN annotations

Number of senses No significant effect (t(79) = -1.28, p = .206).

Sense nuance Yes, a significant effect (t(79) = 10.39, p < .0001).

With number of senses held constant, coarse-grained annotation is 16.2 percentage points higher than fine-grained.

12

Experiment 2: Number of senses Hold sense granularity constant; vary # of senses 2 pairs of annotators, using fine-grained WN senses First pair uses full set of WN senses for a word Second pair uses a restricted set on instances that

we know should fit one of those senses

Ave. number of senses

Granularity

WN Full set 14.6 Fine

WN Restricted set

5.6 Fine

13

OntoNotes grouped sense B

OntoNotes grouped sense C

OntoNotes grouped sense A

WN 3 7 8

13 14

WN 9 10

WN 1 2 4 5

6 11 12

14

"Then I just bought plywood, drew the pieces on it and cut them out."

1. ---------------- 2. ---------------- 3. ---------------- 4. ---------------- 5. ---------------- 6. ---------------- 7. ---------------- 8. ---------------- 9. ---------------- 10. ---------------- 11. ---------------- 12. ---------------- 13. ---------------- 14. ----------------

3. ---------------- 7. ---------------- 8. ---------------- 13. ---------------- 14. ----------------

Full set of WN sensesRestricted set of WN senses

15

Results

WN full set WN restricted set0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

59%53%

ITA

16

Experiment 3

Number of senses controlled; vary sense granularity

Compare the ITAs for the ON tagging with the restricted-set WN tagging

Ave. number of senses

Granularity

OntoNotes 6.2 Coarse

WN Restricted set

5.6 Fine

17

Results

WN re

stric

ted

set (

fine-

grai

ned)

OntoN

otes

(coa

rse-

grai

ned)

0%20%40%60%80%

100%

53%

91%

ITA

18

Conclusion

Number of senses annotators must choose between: never a significant factor

Granularity of the senses: a significant factor, with fine-grained senses leading to lower ITAs

Poor reliability of fine-grained word sense annotation cannot be improved by reducing the cognitive load on annotators.

Annotators cannot reliably discriminate between nuanced sense distinctions.

19

Acknowledgements

We gratefully acknowledge the efforts of all of the annotators and the support of the National Science Foundation Grants NSF-0415923, Word Sense Disambiguation and CISE-CRI-0551615, Towards a Comprehensive Linguistic Annotation and CISE-CRI 0709167, as well as a grant from the Defense Advanced Research Projects Agency (DARPA/IPTO) under the GALE program, DARPA/CMO Contract No. HR0011-06-C-0022, a subcontract from BBN, Inc.

20

Restricted set annotation

Use the adjudicated ON data to determine the ON sense for each instance.

Use instances from experiment1 that were labeled with one selected ON sense (35 instances).

Each restricted-set annotator saw only the WN senses that were clustered to form the appropriate ON sense.

Compare to the full set annotation for those instances.

number or nuance: factors affecting reliable word sense annotation susan windisch brown, travis...

Documents