jigsaw: an algorithm for word sense disambiguation

25
EVALITA 2007 Evaluation of NLP Tools for Italian EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy JIGSAW: an Algorithm for Word Sense Disambiguation Dipartimento di Informatica University of Bari Pierpaolo Basile ([email protected]) Giovanni Semeraro ([email protected])

Upload: cree

Post on 10-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

JIGSAW: an Algorithm for Word Sense Disambiguation. Dipartimento di Informatica University of Bari Pierpaolo Basile ([email protected]) Giovanni Semeraro ([email protected]). Word Sense Disambiguation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: JIGSAW: an Algorithm for Word Sense Disambiguation

EVALITA 2007Evaluation of NLP Tools for Italian

EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy

JIGSAW: an Algorithm for Word Sense Disambiguation

Dipartimento di InformaticaUniversity of Bari

Pierpaolo Basile ([email protected])Giovanni Semeraro ([email protected])

Page 2: JIGSAW: an Algorithm for Word Sense Disambiguation

EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy

Word Sense Disambiguation

Word Sense Disambiguation (WSD) is the problem of selecting a sense for a word from a set of predefined possibilitiesSense Inventory usually comes from a

dictionary or thesaurusKnowledge intensive methods, supervised

learning, and (sometimes) bootstrapping approaches

Page 3: JIGSAW: an Algorithm for Word Sense Disambiguation

EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy

All Words WSD

Attempt to disambiguate all open-class words in a text: “He put his suit over the back of the chair”

How?Knowledge-based approachesUse information from dictionariesPosition in a semantic networkUse discourse propertiesMinimally supervised approachesMost frequent sense

Page 4: JIGSAW: an Algorithm for Word Sense Disambiguation

EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy

JIGSAW

Knowledge-based WSD algorithmDisambiguation of words in a text by

exploiting WordNet sensesCombination of three different strategies

to disambiguate nouns, verbs, adjectives and adverbs

Main motivation: the effectiveness of a WSD algorithm is strongly influenced by the POS tag of the target word

Page 5: JIGSAW: an Algorithm for Word Sense Disambiguation

EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy

JIGSAW algorithm

Input: document d = {w1, w2, ... , wh} Output: list of WordNet synsets X = {s1, s2, ... ,

sk}each element si is obtained by disambiguating the

target word wi

based on the information obtained from WordNet about words in the context context C of the target word: a window of n words to the left

and another n words to the right, for a total of 2n surrounding words

For each word JIGSAW adopts a different strategy based on POS tag

Page 6: JIGSAW: an Algorithm for Word Sense Disambiguation

EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy

JIGSAW_nouns: the idea

Based on Resnik [Resnik95] algorithm for disambiguating noun groups

Given a set of nouns W={w1,w2, ... ,wn} from document d:each wi has an associated sense inventory

Si={si1, si2, ... , sik} of possible senses

Goal: assigning each wi with the most appropriate sense sihSi, according to the similarity of wi with the other nouns in W

Page 7: JIGSAW: an Algorithm for Word Sense Disambiguation

EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy

JIGSAW_nouns: semantic similarity

w = cat C = {mouse}

cat mousemouse

Mouse#1: any of numerous small

rodents…

Mouse#2: a hand-operated electronic

device …

Cat#1: feline mammal…

Cat#2: computerized axial

tomography…

T={02244530,03651364}

Wcat={02037721,00847815}

T={mouse#1,mouse#2}

Wcat={cat#1,cat#2}

“The white cat is hunting the mouse”

0.726

0.0

0.107

0.0

Page 8: JIGSAW: an Algorithm for Word Sense Disambiguation

EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy

JIGSAW_nouns: MSS support

MostSpecificSubsumer between wordsGive more importance to senses that are

hyponym of MSS Combine MSS support with semantic

similarity

Wcat={cat#1,cat#2} T={mouse#1,mouse#2}

MostSpecificSubsumer{placental_mammal}

Page 9: JIGSAW: an Algorithm for Word Sense Disambiguation

EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy

Difference between JIGSAW_nouns and Resnik

Leacock-Chodorow measure to calculate similarity (instead Information Content)

a Gaussian factor G, which takes into account the distance between words in the text

a factor R, which takes into account the synset frequency score in WordNet

a parameterized search for the MSS (Most Specific Subsumer) between two concepts

Page 10: JIGSAW: an Algorithm for Word Sense Disambiguation

EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy

JIGSAW_verbs: the idea

Try to establish a relation between verbs and nouns (distinct IS-A hierarchies in WordNet)

Verb wi disambiguated using:nouns in the context C of wi

nouns into the description (gloss + WordNet usage examples) of each candidate synset for wi

Page 11: JIGSAW: an Algorithm for Word Sense Disambiguation

EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy

JIGSAW_verbs: algorithm [1/4]

For each candidate synset sik of wi

computes nouns(i, k): the set of nouns in the description for sik

for each wj in C and each synset sik computes the highest similarity maxjk

maxjk is the highest similarity value for wj wrt the nouns related to the k-th sense for wi (using Leacock-Chodorow measure)

Page 12: JIGSAW: an Algorithm for Word Sense Disambiguation

EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy

1. (70) play -- (participate in games or sport; "We played hockey all afternoon"; "play cards"; "Pele played for the Brazilian teams in many important matches")

2. (29) play -- (play on an instrument; "The band played all night long")

3. …

nouns(play,1): game, sport, hockey, afternoon, card, team, match

wi=playC={basketball, soccer}

nouns(play,2): instrument, band, night

nouns(play,35): …

I play basketball and soccer

JIGSAW_verbs: algorithm [2/4]

Page 13: JIGSAW: an Algorithm for Word Sense Disambiguation

EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy

nouns(play,1): game, sport, hockey, afternoon, card, team, match

basketball

basketball1

basketballh

MAXbasketball = MAXi Sim(wi,basketball) winouns(play,1)

game

game1

game2

gamek

sport

sport1

sport2

sportm

wi=playC={basketball, soccer}

JIGSAW_verbs: algorithm [3/4]

similarity

Page 14: JIGSAW: an Algorithm for Word Sense Disambiguation

EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy

JIGSAW_verbs: algorithm [4/4]

finally, an overall similarity score, (i, k), among sik and the whole context C is computed:

hhi

Cwjkji

wposwposG

wposwposG

kRkij

))(),((

max))(),((

)(),(

the synset assigned to wi is the one with the highest value

Page 15: JIGSAW: an Algorithm for Word Sense Disambiguation

EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy

JIGSAW_others

Based on the WSD algorithm proposed by Banerjee and Pedersen [Banerjee07] (inspired to Lesk)

Idea: computes the overlap between the glosses of each candidate sense for the target word to the glosses of all words in its context

assigns the synset with the highest overlap score if ties occur, the most common synset in WordNet is

chosen

Page 16: JIGSAW: an Algorithm for Word Sense Disambiguation

EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy

Evaluation

EVALITA All-Words-Taskdisambiguate all words in a text

Dataset16 texts in Italian languageabout 5000 words (tagged by ItalWordNet)

ProcessingWSD needs others NLP steps:

Text normalization and Tokenization Part-Of-Speech Tagging (based on ACOPOST) Lemmatization (based on Morph-it! Resource)

META

Page 17: JIGSAW: an Algorithm for Word Sense Disambiguation

EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy

Evaluation (Results)

system precision recall attempted

JIGSAW 0,560 0,414 73,95%

Baseline (1° sense)

0,669 0,669 100%

Comments: results are encouraging considering that our system exploits only ItalWordNetpre-processing phases, lemmatization and POS-tagging, introduce errors:

77,66% lemmatization precision 76,23% POS-tagging precision

Page 18: JIGSAW: an Algorithm for Word Sense Disambiguation

EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy

Conclusions

Conclusions:Knowledge-based WSD algorithmDifferent strategy for each POS-taggingUse only WordNet (ItalWordNet) and some

heuristics Advantage: use the same strategy for Italian and

English [Basile07] Drawback: low precision (now)

Page 19: JIGSAW: an Algorithm for Word Sense Disambiguation

EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy

Future Work

Including new knowledge sources:Web

(e.g. Topic Signature)

Wikipedia (e.g. Wikipedia similarity)

Do not include resources that are available for only few languages

Use others heuristics: Statistical distribution of senses instead WordNet frequency WordNet domains

Page 20: JIGSAW: an Algorithm for Word Sense Disambiguation

EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy

References S. Banerjee and T. Pedersen. An adapted lesk algorithm for word

sense disambiguation using wordnet. In CICLing’02: Proc. 3rd Int’l Conf. on Computational Linguistics and Intelligent Text Processing,pages 136–145, London, UK, 2002. Springer-Verlag.

P. Basile, M. de Gemmis, A.L. Gentile, P. Lops, and G. Semeraro. JIGSAW algorithm for word sense disambiguation. In SemEval-2007: 4th Int. Workshop on Semantic Evaluations, pages 398–401. ACL press, 2007.

C. Leacock and M. Chodorow. Combining local context and wordnet similarity for word sense identification. In C. Fellbaum (Ed.), WordNet: An Electronic Lexical Database, pages 305–332. MIT Press, 1998.

P. Resnik. Disambiguating noun groupings with respect to WordNet senses. In Proceedings of the Third Workshop on Very Large Corpora, pages 54–68. Association for Computational Linguistics, 1995.

Page 21: JIGSAW: an Algorithm for Word Sense Disambiguation

EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy

Backup slides

Page 22: JIGSAW: an Algorithm for Word Sense Disambiguation

EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy

Results (detail)

total valid precision

nouns 2405 1923 0,556

verbs 1479 1118 0,375

others 694 330 0,676

proper nouns 159 126 0,913

Comments:polysemy of verb is highgenerally proper nouns have only one sense lemmatizer and pos-tagger work worst for adjectives

Page 23: JIGSAW: an Algorithm for Word Sense Disambiguation

EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy

JIGSAWnouns: The idea Inspired by the idea proposed by Resnik

[Resnik95]

[Resnik95] P. Resnik. Disambiguating noun groupings with respect to WordNet senses. In Proceedings of the Third Workshop on Very Large Corpora, pages 54–68. Association for Computational Linguistics, 1995.

Most plausible assignment of senses to a set of co-occurring nouns is the one that maximizes the relatedness of meanings among the senses chosen

[ w1, w2, … wn ]

[s11 s12 … s1k] [s21 s22 … s1h] [sn1 sn2 … snm]

[ w1 w2 … wn ]

Relatedness is measured by computing a score for each sij

confidence with which sij is the most appropriate synset for wi

The score is a way to assign credit to word senses based on similarity with co-occurring words (support)

[0.4 0.3 0.5] [0.2 0.3 0.4] [0.6 0.1 0.2]

Page 24: JIGSAW: an Algorithm for Word Sense Disambiguation

EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy

JIGSAWnouns: The support

[s11 s12 … s1k] [s21 s22 … s1h] [sn1 sn2 … snm]

[ w1 w2 … wn ]

the more similar two words are, the more informative will be the most specific concept that subsumes both of them

0.56

0.15semantic similarity score

most specific subsumer MSS

Page 25: JIGSAW: an Algorithm for Word Sense Disambiguation

EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy

JIGSAWnouns: The idea

[s11 s12 … s1k] [s21 s22 … s1h] [sn1 sn2 … snm]

[ w1 w2 … wn ]

MSS = placental mammal

most specific subsumer MSS

cat mouse

feline mamma

l

rodent[0.56 0.0 0.0] [0.56 0.0 0.56] [0.0 0.0 0.0]

Placental mammal

Carnivore Rodent

Feline, felid

Cat(feline mammal)

Mouse(rodent)

MSS