jigsaw: an algorithm for word sense disambiguation

EVALITA 2007Evaluation of NLP Tools for Italian

EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy

JIGSAW: an Algorithm for Word Sense Disambiguation

Dipartimento di InformaticaUniversity of Bari

Pierpaolo Basile ([email protected])Giovanni Semeraro ([email protected])


Word Sense Disambiguation

Word Sense Disambiguation (WSD) is the problem of selecting a sense for a word from a set of predefined possibilitiesSense Inventory usually comes from a

dictionary or thesaurusKnowledge intensive methods, supervised

learning, and (sometimes) bootstrapping approaches


All Words WSD

Attempt to disambiguate all open-class words in a text: “He put his suit over the back of the chair”

How?Knowledge-based approachesUse information from dictionariesPosition in a semantic networkUse discourse propertiesMinimally supervised approachesMost frequent sense


JIGSAW

Knowledge-based WSD algorithmDisambiguation of words in a text by

exploiting WordNet sensesCombination of three different strategies

to disambiguate nouns, verbs, adjectives and adverbs

Main motivation: the effectiveness of a WSD algorithm is strongly influenced by the POS tag of the target word


JIGSAW algorithm

Input: document d = {w1, w2, ... , wh} Output: list of WordNet synsets X = {s1, s2, ... ,

sk}each element si is obtained by disambiguating the

target word wi

based on the information obtained from WordNet about words in the context context C of the target word: a window of n words to the left

and another n words to the right, for a total of 2n surrounding words

For each word JIGSAW adopts a different strategy based on POS tag


JIGSAW_nouns: the idea

Based on Resnik [Resnik95] algorithm for disambiguating noun groups

Given a set of nouns W={w1,w2, ... ,wn} from document d:each wi has an associated sense inventory

Si={si1, si2, ... , sik} of possible senses

Goal: assigning each wi with the most appropriate sense sihSi, according to the similarity of wi with the other nouns in W


JIGSAW_nouns: semantic similarity

w = cat C = {mouse}

cat mousemouse

Mouse#1: any of numerous small

rodents…

Mouse#2: a hand-operated electronic

device …

Cat#1: feline mammal…

Cat#2: computerized axial

tomography…

T={02244530,03651364}

Wcat={02037721,00847815}

T={mouse#1,mouse#2}

Wcat={cat#1,cat#2}

“The white cat is hunting the mouse”

0.726

0.0

0.107

0.0


JIGSAW_nouns: MSS support

MostSpecificSubsumer between wordsGive more importance to senses that are

hyponym of MSS Combine MSS support with semantic

similarity

Wcat={cat#1,cat#2} T={mouse#1,mouse#2}

MostSpecificSubsumer{placental_mammal}


Difference between JIGSAW_nouns and Resnik

Leacock-Chodorow measure to calculate similarity (instead Information Content)

a Gaussian factor G, which takes into account the distance between words in the text

a factor R, which takes into account the synset frequency score in WordNet

a parameterized search for the MSS (Most Specific Subsumer) between two concepts


JIGSAW_verbs: the idea

Try to establish a relation between verbs and nouns (distinct IS-A hierarchies in WordNet)

Verb wi disambiguated using:nouns in the context C of wi

nouns into the description (gloss + WordNet usage examples) of each candidate synset for wi


JIGSAW_verbs: algorithm [1/4]

For each candidate synset sik of wi

computes nouns(i, k): the set of nouns in the description for sik

for each wj in C and each synset sik computes the highest similarity maxjk

maxjk is the highest similarity value for wj wrt the nouns related to the k-th sense for wi (using Leacock-Chodorow measure)


1. (70) play -- (participate in games or sport; "We played hockey all afternoon"; "play cards"; "Pele played for the Brazilian teams in many important matches")

2. (29) play -- (play on an instrument; "The band played all night long")

3. …

nouns(play,1): game, sport, hockey, afternoon, card, team, match

wi=playC={basketball, soccer}

nouns(play,2): instrument, band, night

nouns(play,35): …

…

I play basketball and soccer



nouns(play,1): game, sport, hockey, afternoon, card, team, match

basketball

basketball1

basketballh

MAXbasketball = MAXi Sim(wi,basketball) winouns(play,1)

game

game1

game2

gamek

…

sport

sport1

sport2

sportm

…

…

wi=playC={basketball, soccer}


similarity



finally, an overall similarity score, (i, k), among sik and the whole context C is computed:

hhi

Cwjkji

wposwposG

wposwposG

kRkij

))(),((

max))(),((

)(),(

the synset assigned to wi is the one with the highest value


JIGSAW_others

Based on the WSD algorithm proposed by Banerjee and Pedersen [Banerjee07] (inspired to Lesk)

Idea: computes the overlap between the glosses of each candidate sense for the target word to the glosses of all words in its context

assigns the synset with the highest overlap score if ties occur, the most common synset in WordNet is

chosen


Evaluation

EVALITA All-Words-Taskdisambiguate all words in a text

Dataset16 texts in Italian languageabout 5000 words (tagged by ItalWordNet)

ProcessingWSD needs others NLP steps:

Text normalization and Tokenization Part-Of-Speech Tagging (based on ACOPOST) Lemmatization (based on Morph-it! Resource)

META


Evaluation (Results)

system precision recall attempted

JIGSAW 0,560 0,414 73,95%

Baseline (1° sense)

0,669 0,669 100%

Comments: results are encouraging considering that our system exploits only ItalWordNetpre-processing phases, lemmatization and POS-tagging, introduce errors:

77,66% lemmatization precision 76,23% POS-tagging precision


Conclusions

Conclusions:Knowledge-based WSD algorithmDifferent strategy for each POS-taggingUse only WordNet (ItalWordNet) and some

heuristics Advantage: use the same strategy for Italian and

English [Basile07] Drawback: low precision (now)


Future Work

Including new knowledge sources:Web

(e.g. Topic Signature)

Wikipedia (e.g. Wikipedia similarity)

Do not include resources that are available for only few languages

Use others heuristics: Statistical distribution of senses instead WordNet frequency WordNet domains


References S. Banerjee and T. Pedersen. An adapted lesk algorithm for word

sense disambiguation using wordnet. In CICLing’02: Proc. 3rd Int’l Conf. on Computational Linguistics and Intelligent Text Processing,pages 136–145, London, UK, 2002. Springer-Verlag.

P. Basile, M. de Gemmis, A.L. Gentile, P. Lops, and G. Semeraro. JIGSAW algorithm for word sense disambiguation. In SemEval-2007: 4th Int. Workshop on Semantic Evaluations, pages 398–401. ACL press, 2007.

C. Leacock and M. Chodorow. Combining local context and wordnet similarity for word sense identification. In C. Fellbaum (Ed.), WordNet: An Electronic Lexical Database, pages 305–332. MIT Press, 1998.

P. Resnik. Disambiguating noun groupings with respect to WordNet senses. In Proceedings of the Third Workshop on Very Large Corpora, pages 54–68. Association for Computational Linguistics, 1995.


Backup slides


Results (detail)

total valid precision

nouns 2405 1923 0,556

verbs 1479 1118 0,375

others 694 330 0,676

proper nouns 159 126 0,913

Comments:polysemy of verb is highgenerally proper nouns have only one sense lemmatizer and pos-tagger work worst for adjectives


JIGSAWnouns: The idea Inspired by the idea proposed by Resnik

[Resnik95]

[Resnik95] P. Resnik. Disambiguating noun groupings with respect to WordNet senses. In Proceedings of the Third Workshop on Very Large Corpora, pages 54–68. Association for Computational Linguistics, 1995.

Most plausible assignment of senses to a set of co-occurring nouns is the one that maximizes the relatedness of meanings among the senses chosen

[ w1, w2, … wn ]

[s11 s12 … s1k] [s21 s22 … s1h] [sn1 sn2 … snm]

[ w1 w2 … wn ]

Relatedness is measured by computing a score for each sij

confidence with which sij is the most appropriate synset for wi

The score is a way to assign credit to word senses based on similarity with co-occurring words (support)

[0.4 0.3 0.5] [0.2 0.3 0.4] [0.6 0.1 0.2]


JIGSAWnouns: The support


[ w1 w2 … wn ]

the more similar two words are, the more informative will be the most specific concept that subsumes both of them

0.56

0.15semantic similarity score

most specific subsumer MSS


JIGSAWnouns: The idea


[ w1 w2 … wn ]

MSS = placental mammal

most specific subsumer MSS

cat mouse

feline mamma

l

rodent[0.56 0.0 0.0] [0.56 0.0 0.56] [0.0 0.0 0.0]

Placental mammal

Carnivore Rodent

Feline, felid

Cat(feline mammal)

Mouse(rodent)

MSS

jigsaw: an algorithm for word sense disambiguation

Documents