advances in wsd

7/29/2019 Advances in WSD

1/208

Advances inWord Sense Disambiguation

Tutorial at ACL 2005

June 25, 2005

Ted Pedersen

University of Minnesota, Duluth

http://www.d.umn.edu/~tpederse

Rada Mihalcea

University of North Texas

http://www.cs.unt.edu/~rada


2/208

2

Goal of the Tutorial

Introduce the problem of word sense disambiguation

(WSD), focusing on the range of formulations and

approaches currently practiced.

Accessible to anyone with an interest in NLP.

Persuade you to work on word sense disambiguation

Its an interesting problem

Lots of good work already done, still more to do

There is infrastructure to help you get started

Persuade you to use word sense disambiguation in your

text applications.


3/208

3

Outline of Tutorial

Introduction (Ted)

Methodolodgy (Rada)

Knowledge Intensive Methods (Rada)

Supervised Approaches (Ted) Minimally Supervised Approaches (Rada) / BREAK

Unsupervised Learning (Ted)

How to Get Started (Rada)

Conclusion (Ted)


4/208

Part 1:

Introduction


5/208

5

Outline

Definitions

Ambiguity for Humans and Computers

Very Brief Historical Overview

Theoretical Connections Practical Applications


6/208

6

Definitions

Word sense disambiguation is the problem of selecting a

sense for a word from a set of predefined possibilities.

Sense Inventory usually comes from a dictionary or thesaurus.

Knowledge intensive methods, supervised learning, and

(sometimes) bootstrapping approaches

Word sense discrimination is the problem of dividing the

usages of a word into different meanings, without regard to

any particular existing sense inventory. Unsupervised techniques


7/208

7

Outline

Definitions





8/208

8

Computers versus Humans

Polysemymost words have many possible meanings.

A computer program has no basis for knowing which one

is appropriate, even if it is obvious to a human

Ambiguity is rarely a problem for humans in their day today communication, except in extreme cases


9/208

9

Ambiguity for Humans - Newspaper Headlines!

DRUNK GETS NINE YEARS IN VIOLIN CASE

FARMER BILL DIES IN HOUSE

PROSTITUTES APPEAL TO POPE

STOLEN PAINTING FOUND BY TREE RED TAPE HOLDS UP NEW BRIDGE

DEER KILL 300,000

RESIDENTS CAN DROP OFF TREES

INCLUDE CHILDREN WHEN BAKING COOKIES

MINERS REFUSE TO WORK AFTER DEATH


10/208

10

Ambiguity for a Computer

The fisherman jumped off the bank and into the water.

The bank down the street was robbed!

Back in the day, we had an entire bank of computers

devoted to this problem. The bank in that road is entirely too steep and is really

dangerous.

The plane took a bank to the left, and then headed off

towards the mountains.


11/208

11

Outline

Definitions





12/208

12

Early Days of WSD

Noted as problem for Machine Translation (Weaver, 1949)

A word can often only be translated if you know the specific sense

intended (A bill in English could be a pico or a cuenta in Spanish)

Bar-Hillel (1960) posed the following:

Little John was looking for his toy box. Finally, he found it. The

box was in the pen. John was very happy.

Is pen a writing instrument or an enclosure where children play?

declared it unsolvable, left the field of MT!


13/208

13

Since then

1970s - 1980s

Rule based systems

Rely on hand crafted knowledge sources

1990s

Corpus based approaches

Dependence on sense tagged text

(Ide and Veronis, 1998) overview history from early days to 1998.

2000s

Hybrid Systems

Minimizing or eliminating use of sense tagged text

Taking advantage of the Web


14/208

14

Outline

Definitions



Interdisciplinary Connections Practical Applications


15/208

15

Interdisciplinary Connections

Cognitive Science & Psychology

Quillian (1968), Collins and Loftus (1975) : spreading activation

Hirst (1987) developed marker passing model

Linguistics Fodor & Katz (1963) : selectional preferences

Resnik (1993) pursued statistically

Philosophy of Language

Wittgenstein (1958): meaning as use

For a large class of cases-though not for all-in which we employ

the word "meaning" it can be defined thus: the meaning of a word

is its use in the language.


16/208

16

Outline

Definitions





17/208

17

Practical Applications

Machine Translation

Translate bill from English to Spanish

Is it a pico or a cuenta?

Is it a bird jaw or an invoice?

Information Retrieval Find all Web Pages about cricket

The sport or the insect?

Question Answering

What is George Millers position on gun control?

The psychologist or US congressman?

Knowledge Acquisition Add to KB: Herb Bergson is the mayor of Duluth.

Minnesota or Georgia?


18/208

18

References

(Bar-Hillel, 1960) The Present Status of Automatic Translations of Languages. InAdvances in Computers. Volume 1. Alt, F. (editor). Academic Press, New York, NY. pp91-163.

(Collins and Loftus, 1975) A Spreading Activation Theory of Semantic Memory.Psychological Review, (82) pp. 407-428.

(Fodor and Katz, 1963) The structure of semantic theory. Language (39). pp 170-210.

(Hirst, 1987) Semantic Interpretation and the Resolution of Ambiguity. Cambridge

University Press. (Ide and Vronis, 1998)Word Sense Disambiguation: The State of the Art..

Computational Linguistics (24) pp 1-40.

(Quillian, 1968) Semantic Memory. In Semantic Information Processing. Minsky, M.(editor). The MIT Press, Cambridge, MA. pp. 227-270.

(Resnik, 1993) Selection and Information: A Class-Based Approach to LexicalRelationships. Ph.D. Dissertation. University of Pennsylvania.

(Weaver, 1949): Translation. In Machine Translation of Languages: fourteen essays.Locke, W.N. and Booth, A.D. (editors) The MIT Press, Cambridge, Mass. pp. 15-23.

(Wittgenstein, 1958) Philosophical Investigations, 3rd edition. Translated by G.E.M.Anscombe. Macmillan Publishing Co., New York.
http://www.cs.vassar.edu/~ide/papers/idevero.pshttp://www.cs.vassar.edu/~ide/papers/idevero.ps


19/208

Part 2:

Methodology


20/208

20

Outline

General considerations

All-words disambiguation

Targeted-words disambiguation

Word sense discrimination, sense discovery Evaluation (granularity, scoring)


21/208

21

Ex: chair furniture or person

Ex: child young person or human offspring

Overview of the Problem

Many words have several meanings (homonymy / polysemy)

Determine which sense of a word is used in a specific sentence

Note:

often, the different senses of a word are closely related Ex: title - right of legal ownership

- document that is evidence of the legal ownership,

sometimes, several senses can be activated in a single context(co-activation)

Ex: This could bring competition to the trade

competition: - the act of competing

- the people who are competing


22/208

22

Word Senses

The meaning of a word in a given context

Word sense representations

With respect to a dictionarychair= a seat for one person, with a support for the back; "he put his coatover the back of the chair and sat down"

chair= the position of professor; "he was awarded an endowed chair in

economics"

With respect to the translation in a second language

chair = chaise

chair = directeur

With respect to the context where it occurs (discrimination)

Sit on a chair Take a seat on this chair

The chairof the Math Department The chairof the meeting


23/208

23

Approaches to Word Sense Disambiguation

Knowledge-Based Disambiguation use of external lexical resources such as dictionaries and thesauri

discourse properties

Supervised Disambiguation based on a labeled training set

the learning system has: a training set of feature-encoded inputs AND

their appropriate sense label (category)

Unsupervised Disambiguation based on unlabeled corpora

The learning system has: a training set of feature-encoded inputs BUT

NOT their appropriate sense label (category)


24/208

24

All Words Word Sense Disambiguation

Attempt to disambiguate all open-class words in a text

He put his suit over the backof the chair

Knowledge-based approaches

Use information from dictionaries

Definitions / Examples for each meaning

Find similarity between definitions and current context

Position in a semantic network

Find that table is closer to chair/furniture than to chair/person

Use discourse properties

A word exhibits the same sense in a discourse / in a collocation


25/208

25

All Words Word Sense Disambiguation

Minimally supervised approaches

Learn to disambiguate words using small annotated corpora

E.g. SemCorcorpus where all open class words are

disambiguated

200,000 running words

Most frequent sense


26/208

26

Targeted Word Sense Disambiguation

Disambiguate one target word

Take a seat on this chair

The chairof the Math Department

WSD is viewed as a typical classification problem

use machine learning techniques to train a system

Training:

Corpus of occurrences of the target word, each occurrence

annotated with appropriate sense

Build feature vectors: a vector of relevant linguistic features that represents the context (ex:

a window of words around the target word)

Disambiguation:

Disambiguate the target word in new unseen text


27/208

27

Targeted Word Sense Disambiguation

Take a window ofn word around the target word

Encode information about the words around the target word

typical features include: words, root forms, POS tags, frequency,

An electric guitar and bass player stand off to one side, not really part of

the scene, just as a sort of nod to gringo expectations perhaps.

Surrounding context (local features)

[ (guitar, NN1), (and, CJC), (player, NN1), (stand, VVB) ]

Frequent co-occurring words (topical features)

[fishing, big, sound, player, fly, rod, pound, double, runs, playing, guitar, band]

[0,0,0,1,0,0,0,0,0,0,1,0]

Other features:

[followed by "player", contains "show" in the sentence,]

[yes, no, ]


28/208

28

Unsupervised Disambiguation

Disambiguate word senses:

without supporting tools such as dictionaries and thesauri

without a labeled training text

Without such resources, word senses are not labeled

We cannot say chair/furnitureor chair/person

We can:

Cluster/group the contexts of an ambiguous word into a number

of groups

Discriminate between these groups without actually labeling

them


29/208

29

Unsupervised Disambiguation

Hypothesis: same senses of words will have similar neighboring

words

Disambiguation algorithm

Identify context vectors corresponding to all occurrences of a particular

word Partition them into regions of high density

Assign a sense to each such region

Sit on a chair

Take a seat on this chairThe chairof the Math Department

The chairof the meeting


30/208

30

Evaluating Word Sense Disambiguation

Metrics: Precision = percentage of words that are tagged correctly, out of the

words addressed by the system

Recall = percentage of words that are tagged correctly, out of all wordsin the test set

Example

Test set of 100 words Precision = 50 / 75 = 0.66

System attempts 75 words Recall = 50 / 100 = 0.50

Words correctly disambiguated 50

Special tags are possible:

Unknown

Proper noun

Multiple senses

Compare to a gold standard

SEMCOR corpus, SENSEVAL corpus,


31/208

31

Evaluating Word Sense Disambiguation

Difficulty in evaluation:

Nature of the senses to distinguish has a huge impact on results

Coarse versus fine-grained sense distinctionchair= a seat for one person, with a support for the back; "he put his coat

over the back of the chair and sat down

chair= the position ofprofessor; "he was awarded an endowed chair ineconomics

bank = a financial institution that accepts deposits and channels the moneyinto lending activities; "he cashed a check at the bank"; "that bank holdsthe mortgage on my home"

bank= a building in which commercial banking is transacted; "the bank ison the corner of Nassau and Witherspoon

Sense maps

Cluster similar senses

Allow for both fine-grained and coarse-grained evaluation


32/208

32

Bounds on Performance

Upper and Lower Bounds on Performance:

Measure of how well an algorithm performs relative to the difficulty ofthe task.

Upper Bound:

Human performance

Around 97%-99% with few and clearly distinct senses

Inter-judge agreement:

With words with clear & distinct senses95% and up

With polysemous words with related senses65%70%

Lower Bound (or baseline):

The assignment of a random sense / the most frequent sense

90% is excellent for a word with 2 equiprobable senses

90% is trivial for a word with 2 senses with probability ratios of 9 to 1


33/208

33

References

(Gale, Church and Yarowsky 1992) Gale, W., Church, K., and Yarowsky, D. Estimating

upper and lower bounds on the performance of word-sense disambiguation programs

ACL 1992.

(Miller et. al., 1994) Miller, G., Chodorow, M., Landes, S., Leacock, C., and Thomas, R.

Using a semantic concordance for sense identification. ARPA Workshop 1994.

(Miller, 1995) Miller, G. Wordnet: A lexical database. ACM, 38(11) 1995.

(Senseval) Senseval evaluation exercises http://www.senseval.org


34/208

Part 3:

Knowledge-based Methods for

Word Sense Disambiguation


35/208

35

Outline

Task definition

Machine Readable Dictionaries

Algorithms based on Machine Readable Dictionaries

Selectional Restrictions

Measures of Semantic Similarity

Heuristic-based Methods


36/208

36

Task Definition

Knowledge-based WSD = class of WSD methods relying

(mainly) on knowledge drawn from dictionaries and/or raw

text

Resources

Yes


Raw corpora

No

Manually annotated corpora Scope

All open-class words


37/208

37


In recent years, most dictionaries made available in

Machine Readable format (MRD)

Oxford English Dictionary

Collins

Longman Dictionary of Ordinary Contemporary English (LDOCE)

Thesaurusesadd synonymy information

Roget Thesaurus

Semantic networksadd more semantic relations

WordNet

EuroWordNet


38/208

38

MRDA Resource for Knowledge-based WSD

For each word in the language vocabulary, an MRDprovides:

A list of meanings

Definitions (for all word meanings)

Typical usage examples (for most word meanings)

WordNet definitions/examples for the nounplant

1. buildings for carrying on industrial labor; "they built a large plant to

manufacture automobiles

2. a living organism lacking the power of locomotion3. something planted secretly for discovery by another; "the police used a plant to

trick the thieves"; "he claimed that the evidence against him was a plant"

4. an actor situated in the audience whose acting is rehearsed but seems

spontaneous to the audience


39/208

39

MRDA Resource for Knowledge-based WSD

A thesaurus adds:

An explicit synonymy relation between word meanings

A semantic network adds:

Hypernymy/hyponymy (IS-A), meronymy/holonymy (PART-OF),

antonymy, entailnment, etc.

WordNet synsets for the noun plant

1. plant, works, industrial plant

2. plant, flora, plant life

WordNet related concepts for the meaning plant life

{plant, flora, plant life}

hypernym: {organism, being}

hypomym: {house plant}, {fungus},

meronym: {plant tissue}, {plant part}

holonym: {Plantae, kingdom Plantae, plant kingdom}


40/208

40

Outline

Task definition







41/208

41

Lesk Algorithm

(Michael Lesk 1986): Identify senses of words in contextusing definition overlap

Algorithm:

1. Retrieve from MRD all sense definitions of the words to be

disambiguated

2. Determine the definition overlap for all possible sense combinations

3. Choose senses that lead to highest overlap

Example: disambiguate PINE CONE

PINE1. kinds of evergreen tree with needle-shaped leaves2. waste away through sorrow or illness

CONE

1. solid body which narrows to a point2. something of this shape whether solid or hollow

3. fruit of certain evergreen trees

Pine#1 Cone#1 = 0

Pine#2 Cone#1 = 0Pine#1 Cone#2 = 1

Pine#2 Cone#2 = 0

Pine#1 Cone#3 = 2

Pine#2 Cone#3 = 0


42/208

42

Lesk Algorithm for More than Two Words?

I saw a man who is 98 years old and can still walk and tell jokes

nine open class words: see(26), man(11),year(4), old(8), can(5), still(4),

walk(10), tell(8),joke(3)

43,929,600 sense combinations! How to find the optimal sensecombination?

Simulated annealing (Cowie, Guthrie, Guthrie 1992)

Define a function E = combination of word senses in a given text.

Find the combination of senses that leads to highest definition overlap

(redundancy)1. Start with E = the most frequent sense for each word

2. At each iteration, replace the sense of a random word in the set with a

different sense, and measure E

3. Stop iterating when there is no change in the configuration of senses


43/208

43

Lesk Algorithm: A Simplified Version

Original Lesk definition: measure overlap between sense definitions forall words in context

Identify simultaneously the correct senses for all words in context

Simplified Lesk (Kilgarriff & Rosensweig 2000): measure overlap

between sense definitions of a word and current context

Identify the correct sense for one word at a time

Search space significantly reduced


44/208

44

Lesk Algorithm: A Simplified Version

Example: disambiguate PINE in

Pine cones hanging in a tree

PINE

1. kinds of evergreen tree with needle-shaped leaves

2. waste away through sorrow or illness

Pine#1 Sentence = 1

Pine#2 Sentence = 0

Algorithm for simplified Lesk:

1.Retrieve from MRD all sense definitions of the word to be

disambiguated

2.Determine the overlap between each sense definition and the

current context

3.Choose the sense that leads to highest overlap


45/208

45

Evaluations of Lesk Algorithm

Initial evaluation by M. Lesk 50-70% on short samples of text manually annotated set, with respect

to Oxford Advanced Learners Dictionary

Simulated annealing

47% on 50 manually annotated sentences

Evaluation on Senseval-2 all-words data, with back-off to

random sense (Mihalcea & Tarau 2004)

Original Lesk: 35%

Simplified Lesk: 47%

Evaluation on Senseval-2 all-words data, with back-off to

most frequent sense (Vasilescu, Langlais, Lapalme 2004)

Original Lesk: 42%

Simplified Lesk: 58%


46/208

46

Outline

Task definition



Selectional Preferences




47/208

47


A way to constrain the possible meanings of words in agiven context

E.g. Wash a dish vs. Cook a dish

WASH-OBJECT vs. COOK-FOOD

Capture information about possible relations betweensemantic classes Common sense knowledge

Alternative terminology Selectional Restrictions


Selectional Constraints


48/208

48

Acquiring Selectional Preferences

From annotated corpora

Circular relationship with the WSD problem

Need WSD to build the annotated corpus

Need selectional preferences to derive WSD

From raw corpora

Frequency counts

Information theory measures

Class-to-class relations


49/208

49

Preliminaries: Learning Word-to-Word Relations

An indication of the semantic fit between two words

1. Frequency counts

Pairs of words connected by a syntactic relations

2. Conditional probabilities

Condition on one of the words

),,( 21RWWCount

),(

),,(),|(

2

2121

RWCount

RWWCountRWWP


50/208

50

Learning Selectional Preferences (1)

Word-to-class relations (Resnik 1993)

Quantify the contribution of a semantic class using all the concepts

subsumed by that class

where

)(

),|(log),|(

)(

),|(log),|(

),,(

2

1212

2

1212

21

2CP

RWCPRWCP

CP

RWCPRWCP

RCWA

C

22)(

),,(),,(

),(

),,(

),|(

2

2121

1

21

12

CW WCount

RWWCountRCWCount

RWCount

RCWCount

RWCP


51/208

51

Learning Selectional Preferences (2) Determine the contribution of a word sense based on the assumption of

equal sense distributions: e.g. plant has two senses 50% occurences are sense 1, 50% are sense 2

Example: learning restrictions for the verb to drink

Find high-scoring verb-object pairs

Find prototypical object classes (high association score)

Co-occ score Verb Object

11.75 drink tea

11.75 drink Pepsi

11.75 drink champagne

10.53 drink liquid

10.2 drink beer9.34 drink wine

A(v,c) Object class

3.58 (beverage, [drink, ])

2.05 (alcoholic_beverage, [intoxicant, ])


52/208

52

Learning Selectional Preferences (3)

Other algorithms

Learn class-to-class relations (Agirre and Martinez, 2002)

E.g.: ingest food is a class-to-class relation foreat chicken

Bayesian networks (Ciaramita and Johnson, 2000) Tree cut model (Li and Abe, 1998)


53/208

53

Using Selectional Preferences for WSD

Algorithm:1. Learn a large set of selectional preferences for a given syntactic

relation R

2. Given a pair of words W1W

2connected by a relation R

3. Find all selectional preferences W1C (word-to-class) or C

1C

2

(class-to-class) that apply

4. Select the meanings of W1

and W2

based on the selected semantic

class

Example: disambiguatecoffeeindrinkcoffee

1. (beverage) a beverage consisting of an infusion of ground coffee beans

2. (tree) any of several small trees native to the tropical Old World

3. (color) a medium to dark brown color

Given the selectional preferenceDRINK BEVERAGE : coffee#1


54/208

54

Evaluation of Selectional Preferences for WSD

Data set

mainly on verb-object, subject-verb relations extracted from

SemCor

Compare against random baseline

Results (Agirre and Martinez, 2000) Average results on 8 nouns

Similar figures reported in (Resnik 1997)

Object Subject

Precision Recall Precision RecallRandom 19.2 19.2 19.2 19.2

Word-to-word 95.9 24.9 74.2 18.0

Word-to-class 66.9 58.0 56.2 46.8

Class-to-class 66.6 64.8 54.0 53.7


55/208

55

Outline

Task definition







56/208

56

Semantic Similarity

Words in a discourse must be related in meaning, for thediscourse to be coherent (Haliday and Hassan, 1976)

Use this property for WSDIdentify related meanings for

words that share a common context

Context span:

1. Local context: semantic similarity between pairs of words

2. Global context: lexical chains


57/208

57

Semantic Similarity in a Local Context

Similarity determined between pairs of concepts, or

between a word and its surrounding context

Relies on similarity metrics on semantic networks

(Rada et al. 1989)

carnivore

wild dogwolf

bearfeline, felidcanine, canidfissiped mamal, fissiped

dachshund

hunting doghyena dogdingo

hyenadog

terrier


58/208

58

Semantic Similarity Metrics (1) Input: two concepts (same part of speech)

Output: similarity measure (Leacock and Chodorow 1998)

E.g. Similarity(wolf,dog) = 0.60 Similarity(wolf,bear) = 0.42

(Resnik 1995)

Define information content, P(C) = probability of seeing a concept of type

C in a large corpus

Probability of seeing a concept = probability of seeing instances of thatconcept

Determine the contribution of a word sense based on the assumption of

equal sense distributions:

e.g. plant has two senses 50% occurrences are sense 1, 50% are sense 2

D

CCPathCCSimilarity

2

),(log),( 2121 , D is the taxonomy depth

))(log()( CPCIC


59/208

59

Semantic Similarity Metrics (2)

Similarity using information content (Resnik 1995) Define similarity between two concepts (LCS = Least

Common Subsumer)

Alternatives (Jiang and Conrath 1997)

Other metrics:

Similarity using information content (Lin 1998)

Similarity using gloss-based paths across different hierarchies (Mihalceaand Moldovan 1999)

Conceptual density measure between noun semantic hierarchies andcurrent context (Agirre and Rigau 1995)

Adapted Lesk algorithm (Banerjee and Pedersen 2002)

)),((),( 2121 CCLCSICCCSimilarity

))()((

)),((2),(

21

2121

CICCIC

CCLCSICCCSimilarity


60/208

60

Semantic Similarity Metrics for WSD

Disambiguate target words based on similarity with one

word to the left and one word to the right (Patwardhan, Banerjee, Pedersen 2002)

Evaluation:

1,723 ambiguous nouns from Senseval-2

Among 5 similarity metrics, (Jiang and Conrath 1997) provide the

best precision (39%)

Example: disambiguate PLANT in plant with flowers

PLANT1. plant, works, industrial plant

2. plant, flora, plant life

Similarity (plant#1, flower) = 0.2

Similarity (plant#2, flower) = 1.5 : plant#2


61/208

61

Semantic Similarity in a Global Context

Lexical chains (Hirst and St-Onge 1988), (Haliday and Hassan 1976)

A lexical chain is a sequence of semantically related words, whichcreates a context and contributes to the continuity of meaning and thecoherence of a discourse

Algorithmfor finding lexical chains:1. Select the candidate words from the text. These are words for which we cancompute similarity measures, and therefore most of the time they have thesame part of speech.

2. For each such candidate word, and for each meaning for this word, find achain to receive the candidate word sense, based on a semantic relatedness

measure between the concepts that are already in the chain, and the candidateword meaning.

3. If such a chain is found, insert the word in this chain; otherwise, create a newchain.


62/208

62

Semantic Similarity of a Global Context

A very long traintraveling along the railswith a constant velocityv in a

certain direction

train #1: public transport

#2: order set of things

#3: piece of cloth

travel

#1 change location

#2: undergo transportation

rail #1: a barrier

# 2: a bar of steel for trains

#3: a small bird


63/208

63

Lexical Chains for WSD

Identify lexical chains in a text

Usually target one part of speech at a time

Identify the meaning of words based on their membership

to a lexical chain

Evaluation:

(Galley and McKeown 2003) lexical chains on 74 SemCor texts

give 62.09%

(Mihalcea and Moldovan 2000) on five SemCor texts give 90%with 60% recall

lexical chains anchored on monosemous words

(Okumura and Honda 1994) lexical chains on five Japanese texts

give 63.4%


64/208

64

Outline

Task definition







65/208

65

Example: plant/flora is used more often thanplant/factory

- annotate any instance ofPLANT asplant/flora

Most Frequent Sense (1)

Identify the most often used meaning and use this meaning

by default

Word meanings exhibit a Zipfian distribution

E.g. distribution of word senses in SemCor

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 2 3 4 5 6 7 8 9 10

Sense number

Frequency Noun

Verb

Adj

Adv


66/208

66

Most Frequent Sense (2) Method 1: Find the most frequent sense in an annotated

corpus Method 2: Find the most frequent sense using a method

based on distributional similarity (McCarthy et al. 2004)

1. Given a word w, find the top k distributionally similar words

Nw = {n1, n2, , nk}, with associated similarity scores{dss(w,n1), dss(w,n2), dss(w,nk)}

2. For each sense wsi of w, identify the similarity with the words nj,

using the sense of nj that maximizes this score

3. Rank senses wsi of w based on the total similarity score

)),((max),(

where

,),'(

),(),()(

)(

)('

xinsensesns

ji

Nn

wsensesws

ji

ji

ji

nswswnssnwswnss

nwswnss

nwswnssnwdsswsScore

jx

wj

i


67/208

67

Most Frequent Sense(3)

Word senses pipe #1 = tobacco pipe

pipe #2 = tube of metal or plastic

Distributional similar words

N = {tube, cable, wire, tank, hole, cylinder, fitting, tap, }

For each word in N, find similarity with pipe#i (using the sense thatmaximizes the similarity)

pipe#1tube (#3) = 0.3

pipe#2tube (#1) = 0.6

Compute score for each sense pipe#i

score (pipe#1) = 0.25 score (pipe#2) = 0.73

Note: results depend on the corpus used to find distributionallysimilar words => can find domain specific predominant senses


68/208

68

E.g. The ambiguous word PLANT occurs 10 times in a discourse

all instances ofplant carry the same meaning

One Sense Per Discourse

A word tends to preserve its meaning across all its occurrences in agiven discourse (Gale, Church, Yarowksy 1992)

What does this mean?

Evaluation:

8 words with two-way ambiguity, e.g. plant, crane, etc.

98% of the two-word occurrences in the same discourse carry the same

meaning

The grain of salt: Performance depends on granularity (Krovetz 1998) experiments with words with more than two senses

Performance of one sense per discourse measured on SemCor is approx.

70%


69/208

69

The ambiguous word PLANT preserves its meaning in all its

occurrences within the collocationindustrial plant, regardless

of the context where this collocation occurs

One Sense per Collocation

A word tends to preserver its meaning when used in the samecollocation (Yarowsky 1993)

Strong for adjacent collocations

Weaker as the distance between words increases

An example

Evaluation:

97% precision on words with two-way ambiguity

Finer granularity:

(Martinez and Agirre 2000) tested the one sense per collocation

hypothesis on text annotated with WordNet senses

70% precision on SemCor words

References


70/208

70

References (Agirre and Rigau, 1995) Agirre, E. and Rigau, G.A proposal for word sense disambiguation

using conceptual distance. RANLP 1995.

(Agirre and Martinez 2001) Agirre, E. and Martinez, D. Learning class-to-class selectional

preferences. CONLL 2001.

(Banerjee and Pedersen 2002) Banerjee, S. and Pedersen, T. An adapted Lesk algorithm forword sense disambiguation using WordNet. CICLING 2002.

(Cowie, Guthrie and Guthrie 1992), Cowie, L. and Guthrie, J. A. and Guthrie, L.: Lexicaldisambiguation using simulated annealing. COLING 2002.

(Gale, Church and Yarowsky 1992) Gale, W., Church, K., and Yarowsky, D. One sense per

discourse. DARPA workshop 1992. (Halliday and Hasan 1976) Halliday, M. and Hasan, R., (1976). Cohesion in English.

Longman.

(Galley and McKeown 2003) Galley, M. and McKeown, K. (2003) Improving word sensedisambiguation in lexical chaining. IJCAI 2003

(Hirst and St-Onge 1998) Hirst, G. and St-Onge, D. Lexical chains as representations ofcontext in the detection and correction of malaproprisms. WordNet: An electronic lexicaldatabase, MIT Press.

(Jiang and Conrath 1997) Jiang, J. and Conrath, D. Semantic similarity based on corpusstatistics and lexical taxonomy. COLING 1997.

(Krovetz, 1998) Krovetz, R.More than one sense per discourse. ACL-SIGLEX 1998.

(Lesk, 1986) Lesk, M. Automatic sense disambiguation using machine readable dictionaries:How to tell a pine cone from an ice cream cone. SIGDOC 1986.

(Lin 1998) Lin, DAn information theoretic definition of similarity. ICML 1998.

References


71/208

71

(Martinez and Agirre 2000) Martinez, D. and Agirre, E. One sense per collocation andgenre/topic variations. EMNLP 2000.

(Miller et. al., 1994) Miller, G., Chodorow, M., Landes, S., Leacock, C., and Thomas, R.Using a semantic concordance for sense identification. ARPA Workshop 1994.

(Miller, 1995) Miller, G. Wordnet: A lexical database. ACM, 38(11) 1995.

(Mihalcea and Moldovan, 1999) Mihalcea, R. and Moldovan, D. A method for wordsense disambiguation of unrestricted text. ACL 1999.

(Mihalcea and Moldovan 2000) Mihalcea, R. and Moldovan, D. An iterative approachto word sense disambiguation. FLAIRS 2000.

(Mihalcea, Tarau, Figa 2004) R. Mihalcea, P. Tarau, E. FigaPageRank on Semantic

Networks with Application to Word Sense Disambiguation, COLING 2004. (Patwardhan, Banerjee, and Pedersen 2003) Patwardhan, S. and Banerjee, S. and

Pedersen, T. Using Measures of Semantic Relatedeness for Word Sense Disambiguation.CICLING 2003.

(Rada et al 1989) Rada, R. and Mili, H. and Bicknell, E. and Blettner, M. Developmentand application of a metric on semantic nets. IEEE Transactions on Systems, Man, andCybernetics, 19(1) 1989.

(Resnik 1993) Resnik, P. Selection and Information: A Class-Based Approach to LexicalRelationships. University of Pennsylvania 1993.

(Resnik 1995) Resnik, P. Using information content to evaluate semantic similarity.IJCAI 1995.

(Vasilescu, Langlais, Lapalme 2004) F. Vasilescu, P. Langlais, G. Lapalme "Evaluatingvariants of the Lesk approach for disambiguating words, LREC 2004.

(Yarowsky, 1993) Yarowsky, D. One sense per collocation. ARPA Workshop 1993.


72/208

Part 4:

Supervised Methods of Word

Sense Disambiguation


73/208

73

Outline

What is Supervised Learning?

Task Definition

Single Classifiers

Nave Bayesian Classifiers

Decision Lists and Trees

Ensembles of Classifiers


74/208

74


Collect a set of examples that illustrate the various possibleclassifications or outcomes of an event.

Identify patterns in the examples associated with each

particular class of the event.

Generalize those patterns into rules.

Apply the rules to classify a new event.

f


75/208

75

Learn from these examples :

when do I go to the store?

Day

CLASS

Go to Store?

F1

Hot Outside?

F2

Slept Well?

F3

Ate Well?

1 YES YES NO NO

2 NO YES NO YES

3 YES NO NO NO

4 NO NO NO YES

L f h l


76/208

76

Learn from these examples :

when do I go to the store?

Day

CLASS

Go to Store?

F1

Hot Outside?

F2

Slept Well?

F3

Ate Well?

1 YES YES NO NO

2 NO YES NO YES

3 YES NO NO NO

4 NO NO NO YES


77/208

77

Outline


Task Definition

Single Classifiers





78/208

78

Task Definition

Supervised WSD: Class of methods that induces a classifier frommanually sense-tagged text using machine learning techniques.

Resources

Sense Tagged Text

Dictionary (implicit source of sense inventory)

Syntactic Analysis (POS tagger, Chunker, Parser, )

Scope

Typically one target word per context

Part of speech of target word resolved

Lends itself to targeted word formulation

Reduces WSD to a classification problem where a target word is

assigned the most appropriate sense from a given set of possibilities

based on the context in which it occurs


79/208

79

Sense Tagged Text

Bonnie and Clyde are two really famous criminals, I thinkthey were bank/1 robbers

My bank/1 charges too much for an overdraft.

I went to the bank/1 to deposit my check and get a new ATM

card.The University of Minnesota has an East and a West Bank/2

campus right on the Mississippi River.

My grandfather planted his pole in the bank/2 and got a great

big catfish!The bank/2is pretty muddy, I cant walk there.

T B f W d


80/208

80

Two Bags of Words

(Co-occurrences in the window of context)

FINANCIAL_BANK_BAG:

a an and are ATM Bonnie card charges check Clyde

criminals deposit famous for get I much My new overdraft

really robbers the they think to too two went were

RIVER_BANK_BAG:

a an and big campus cant catfish East got grandfather great

has his I in is Minnesota Mississippi muddy My of on planted

pole pretty right River The the there University walk West


81/208

81

Simple Supervised Approach

Given a sentence S containing bank:

For each word Wi in S

If Wi is in FINANCIAL_BANK_BAG then

Sense_1 = Sense_1 + 1;If Wi is in RIVER_BANK_BAG then

Sense_2 = Sense_2 + 1;

If Sense_1 > Sense_2 then print Financialelse if Sense_2 > Sense_1 then print River

else print Cant Decide;


82/208

82

Supervised Methodology

Create a sample oftraining data where a given target wordis manually annotatedwith a sense from apredeterminedset of possibilities.

One tagged word per instance/lexical sample disambiguation

Select a set of features with which to represent context. co-occurrences, collocations, POS tags, verb-obj relations, etc...

Convert sense-taggedtraining instances to feature vectors.

Apply a machine learning algorithm to induce a classifier.

Formstructure or relation among features

Parametersstrength of feature interactions

Convert a held outsample of test data into feature vectors.

correct sense tags are known but not used

Apply classifier to test instances to assign a sense tag.


83/208

83

From Text to Feature Vectors

My/pronoun grandfather/noun used/verb to/prep fish/verbalong/adv the/det banks/SHORE of/prep the/det

Mississippi/noun River/noun. (S1)

The/det bank/FINANCE issued/verb a/det check/noun

for/prep the/det amount/noun of/prep interest/noun. (S2)

P-2 P-1 P+1 P+2 fish check river interest SENSE TAG

S1 adv det prep det Y N Y N SHORE

S2 det verb det N Y N Y FINANCE


84/208

84

Supervised Learning Algorithms

Once data is converted to feature vector form, anysupervised learning algorithm can be used. Many have

been applied to WSD with good results:

Support Vector Machines

Nearest Neighbor Classifiers Decision Trees

Decision Lists


Perceptrons

Neural Networks

Graphical Models

Log Linear Models


85/208

85

Outline


Task Definition

Nave Bayesian Classifier




86/208

86

Nave Bayesian Classifier

Nave Bayesian Classifier well known in MachineLearning community for good performance across a range

of tasks (e.g., Domingos and Pazzani, 1997)

Word Sense Disambiguation is no exception

Assumes conditional independence among features, giventhe sense of a word.

The form of the model is assumed, but parameters are estimated

from training instances

When applied to WSD, features are often a bag of wordsthat come from the training data

Usually thousands of binary features that indicate if a word is

present in the context of the target word (or not)


87/208

87

Bayesian Inference

Given observed features, what is most likely sense?

Estimate probability of observed features given sense

Estimate unconditional probability of sense

Unconditional probability of features is a normalizingterm, doesnt affect sense classification

),...,3,2,1(

)()*|,...,3,2,1(),...,3,2,1|(

FnFFFp

SpSFnFFFpFnFFFSp


88/208

88

Nave Bayesian Model

S

F2 F3 F4F1 Fn

)|(*...*)|2(*)|1()|,...,2,1( SFnpSFpSFpSFnFFP


89/208

89

The Nave Bayesian Classifier

Given 2,000 instances of bank, 1,500 for bank/1 (financial sense)and 500 for bank/2 (river sense)

P(S=1) = 1,500/2000 = .75

P(S=2) = 500/2,000 = .25 Given credit occurs 200 times with bank/1 and 4 times with bank/2.

P(F1=credit) = 204/2000 = .102

P(F1=credit|S=1) = 200/1,500 = .133

P(F1=credit|S=2) = 4/500 = .008

Given a test instance that has one feature credit

P(S=1|F1=credit) = .133*.75/.102 = .978 P(S=2|F1=credit) = .008*.25/.102 = .020

)(*)|(*...*)|1(argmax SpSFnpSFpsense Ssense


90/208

90

Comparative Results

(Leacock, et. al. 1993) compared Nave Bayes with aNeural Network and a Context Vector approach when

disambiguating six senses ofline

(Mooney, 1996) compared Nave Bayes with a Neural

Network, Decision Tree/List Learners, Disjunctive andConjunctive Normal Form learners, and a perceptron when

disambiguating six senses ofline

(Pedersen, 1998) compared Nave Bayes with Decision

Tree, Rule Based Learner, Probabilistic Model, etc. whendisambiguating line and 12 other words

All found that Nave Bayesian Classifier performed as

well as any of the other methods!


91/208

91

Outline


Task Definition





92/208

92


Very widely used in Machine Learning.

Decision trees used very early for WSD research (e.g.,

Kelly and Stone, 1975; Black, 1988).

Represent disambiguation problem as a series of questions

(presence of feature) that reveal the sense of a word. List decides between two senses after one positive answer

Tree allows for decision among multiple senses after a series of

answers

Uses a smaller, more refined set of features than bag ofwords and Nave Bayes.

More descriptive and easier to interpret.


93/208

93

Decision List for WSD (Yarowsky, 1994)

Identify collocational features from sense tagged data.

Word immediately to the left or right of target :

I have my bank/1 statement.

The riverbank/2 is muddy.

Pair of words to immediate left or right of target : The worlds richestbank/1 is here in New York.

The river bank/2 is muddy.

Words found within k positions to left or right of target,

where k is often 10-50 : My creditis just horrible because my bank/1 has made several

mistakes with my accountand the balance is very low.


94/208

94

Building the Decision List

Sort order of collocation tests using log of conditional

probabilities.

Words most indicative of one sense (and not the other)

will be ranked highly.

))|2(

)|1((log

inCollocatioiFSp

inCollocatio

iFSp

Abs


95/208

95

Computing DL score

Given 2,000 instances of bank, 1,500 for bank/1(financial sense) and 500 for bank/2 (river sense)

P(S=1) = 1,500/2,000 = .75

P(S=2) = 500/2,000 = .25

Given credit occurs 200 times with bank/1 and 4

times with bank/2. P(F1=credit) = 204/2,000 = .102

P(F1=credit|S=1) = 200/1,500 = .133

P(F1=credit|S=2) = 4/500 = .008

From Bayes Rule

P(S=1|F1=credit) = .133*.75/.102 = .978 P(S=2|F1=credit) = .008*.25/.102 = .020

DL Score = abs (log (.978/.020)) = 3.89


96/208

96

Using the Decision List

Sort DL-score, go through test instance looking formatching feature. First match reveals sense

DL-score Feature Sense

3.89 creditwithin bank Bank/1 financial

2.20 bankis muddy Bank/2 river

1.09 pole within bank Bank/2 river

0.00 of the bank N/A


97/208

97

Using the Decision List

CREDIT?

BANK/1 FINANCIAL IS MUDDY?

POLE?BANK/2 RIVER

BANK/2 RIVER


98/208

98

Learning a Decision Tree

Identify the feature that most cleanly divides the trainingdata into the known senses.

Cleanly measured by information gain or gain ratio.

Create subsets of training data according to feature values.

Find another feature that most cleanly divides a subset ofthe training data.

Continue until each subset of training data is pure or asclean as possible.

Well known decision tree learning algorithms include ID3

and C4.5 (Quillian, 1986, 1993) In Senseval-1, a modified decision list (which supported

some conditional branching) was most accurate for EnglishLexical Sample task (Yarowsky, 2000)


99/208

99

Supervised WSD with Individual Classifiers

Many supervised Machine Learning algorithms have beenapplied to Word Sense Disambiguation, most workreasonably well.

(Witten and Frank, 2000) is a great intro. to supervised learning.

Features tend to differentiate among methods more thanthe learning algorithms.

Good sets of features tend to include: Co-occurrences or keywords (global)

Collocations (local)

Bigrams (local and global) Part of speech (local)

Predicate-argument relations

Verb-object, subject-verb,

Heads of Noun and Verb Phrases

C f R l


100/208

100

Convergence of Results

Accuracy of different systems applied to the same datatends to converge on a particular value, no one system

shockingly better than another.

Senseval-1, a number of systems in range of 74-78% accuracy for

English Lexical Sample task. Senseval-2, a number of systems in range of 61-64% accuracy for

English Lexical Sample task.

Senseval-3, a number of systems in range of 70-73% accuracy for

English Lexical Sample task

What to do next?

O li


101/208

101

Outline


Task Definition




E bl f Cl ifi


102/208

102


Classifier error has two components (Bias and Variance) Some algorithms (e.g., decision trees) try and build a

representation of the training dataLow Bias/High Variance

Others (e.g., Nave Bayes) assume a parametric form and dont

represent the training dataHigh Bias/Low Variance

Combining classifiers with different bias variance

characteristics can lead to improved overall accuracy

Bagging a decision tree can smooth out the effect of

small variations in the training data (Breiman, 1996)

Sample with replacement from the training data to learn multiple

decision trees.

Outliers in training data will tend to be obscured/eliminated.

E bl C id ti


103/208

103

Ensemble Considerations

Must choose different learning algorithms withsignificantly different bias/variance characteristics.

Nave Bayesian Classifier versus Decision Tree

Must choose feature representations that yield significantly

different (independent?) views of the training data. Lexical versus syntactic features

Must choose how to combine classifiers.

Simple Majority Voting

Averaging of probabilities across multiple classifier output Maximum Entropy combination (e.g., Klein, et. al., 2002)

E bl R lt


104/208

104

Ensemble Results

(Pedersen, 2000) achieved state of art for interestand linedata using ensemble of Nave Bayesian Classifiers.

Many Nave Bayesian Classifiers trained on varying sized

windows of context / bags of words.

Classifiers combined by a weighted vote

(Florian and Yarowsky, 2002) achieved state of the art for

Senseval-1 and Senseval-2 data using combination of six

classifiers.

Rich set of collocational and syntactic features.

Combined via linear combination of top three classifiers.

Many Senseval-2 and Senseval-3 systems employed

ensemble methods.

R f


105/208

105

References

(Black, 1988) An experiment in computational discrimination of English word senses.

IBM Journal of Research and Development (32) pg. 185-194. (Breiman, 1996) The heuristics of instability in model selection. Annals of Statistics

(24) pg. 2350-2383.

(Domingos and Pazzani, 1997) On the Optimality of the Simple Bayesian Classifier

under Zero-One Loss, Machine Learning (29) pg. 103-130.

(Domingos, 2000) A Unified Bias Variance Decomposition for Zero-One and Squared

Loss. In Proceedings of AAAI. Pg. 564-569. (Florian an dYarowsky, 2002) Modeling Consensus: Classifier Combination for Word

Sense Disambiguation. In Proceedings of EMNLP, pp 25-32.

(Kelly and Stone, 1975). Computer Recognition of English Word Senses, North Holland

Publishing Co., Amsterdam.

(Klein, et. al., 2002) Combining Heterogeneous Classifiers for Word-Sense

Disambiguation, Proceedings of Senseval-2. pg. 87-89. (Leacock, et. al. 1993) Corpus based statistical sense resolution. In Proceedings of the

ARPA Workshop on Human Language Technology. pg. 260-265.

(Mooney, 1996) Comparative experiments on disambiguating word senses: An

illustration of the role of bias in machine learning. Proceedings of EMNLP. pg. 82-91.

R f


106/208

106

References

(Pedersen, 1998) Learning Probabilistic Models of Word Sense Disambiguation. Ph.D.Dissertation. Southern Methodist University.

(Pedersen, 2000) A simple approach to building ensembles of Naive Bayesian classifiers

for word sense disambiguation. In Proceedings of NAACL.

(Quillian, 1986). Induction of Decision Trees. Machine Learning (1). pg. 81-106.

(Quillian, 1993). C4.5 Programs for Machine Learning. San Francisco, Morgan

Kaufmann. (Witten and Frank, 2000). Data MiningPractical Machine Learning Tools and

Techniques with Java Implementations. Morgan-Kaufmann. San Francisco.

(Yarowsky, 1994) Decision lists for lexical ambiguity resolution: Application to accent

restoration in Spanish and French. In Proceedings of ACL. pp. 88-95.

(Yarowsky, 2000) Hierarchical decision lists for word sense disambiguation. Computers

and the Humanities, 34.


107/208

Part 5:

Minimally Supervised Methods

for Word Sense Disambiguation

O tli


108/208

108

Outline

Task definition What does minimally supervised mean?

Bootstrapping algorithms

Co-training

Self-training Yarowsky algorithm

Using the Web for Word Sense Disambiguation

Web as a corpus

Web as collective mind

Task Definition


109/208

109

Task Definition

SupervisedWSD = learning sense classifiers starting withannotated data

Minimally supervised WSD = learning sense classifiers

from annotated data, with minimal human supervision

Examples Automatically bootstrap a corpus starting with a few human

annotated examples

Use monosemous relatives / dictionary definitions to automatically

construct sense tagged data

Rely on Web-users + active learning for corpus annotation

Outline


110/208

110

Outline



Co-training



Web as a corpus


Bootstrapping WSD Classifiers


111/208

111

Bootstrapping WSD Classifiers

Build sense classifiers with little training data Expand applicability of supervised WSD

Bootstrapping approaches

Co-training Self-training

Yarowsky algorithm

Bootstrapping Recipe


112/208

112

Bootstrapping Recipe

Ingredients (Some) labeled data

(Large amounts of) unlabeled data

(One or more) basic classifiers

Output Classifier that improves over the basic classifiers


113/208

113

plants#1and animals

industry plant#2

building the only atomic plant

plantgrowth is retarded

a herb or flowering plant

a nuclear powerplant

building a new vehicle plant the animal and plantlife

the passion-fruit plant

Classifier 1

Classifier 2

plant#1growth is retarded

a nuclear powerplant#2

Co-training / Self-training


114/208

114

Co training / Self training

1. Create a pool of examples U'

choose P random examples from U

2. Loop for I iterations

Train Cion L and label U'

Select G most confident examples and add to L

maintain distribution in L Refill U' with examples from U

keep U' at constant size P

A set L of labeled training examples

A set U of unlabeled examples

Classifiers Ci

Co training


115/208

115

(Blum and Mitchell 1998) Two classifiers

independent views

[independence condition can be relaxed]

Co-training in Natural Language Learning Statistical parsing (Sarkar 2001)

Co-reference resolution (Ng and Cardie 2003)

Part of speech tagging (Clark, Curran and Osborne 2003)

...

Co-training

Self-training


116/208

116

Self-training

(Nigam and Ghani 2000) One single classifier

Retrain on its own output

Self-training for Natural Language Learning

Part of speech tagging (Clark, Curran and Osborne 2003) Co-reference resolution (Ng and Cardie 2003)

several classifiers through bagging

Parameter Setting for Co-training/Self-training


117/208

117

Parameter Setting for Co-training/Self-training

1. Create a pool of examples U' choose P random examples from U

2. Loop for I iterations

Train Cion L and label U'

Select G most confident examples and add to L maintain distribution in L

Refill U' with examples from U

keep U' at constant size P

Pool size

Iterations

Growth size

A major drawback of bootstrapping

No principled method for selecting optimal values for these

parameters (Ng and Cardie 2003)

Experiments with Co-training / Self-training


118/208

118

for WSD

Training / Test data Senseval-2 nouns (29 ambiguous nouns)

Average corpus size: 95 training examples, 48 test examples

Raw data British National Corpus

Average corpus size: 7,085 examples

Co-training

Two classifiers: local and topical classifiers

Self-training One classifier: global classifier

(Mihalcea 2004)

Parameter Settings


119/208

119

Parameter Settings

Parameter ranges

P = {1, 100, 500, 1000, 1500, 2000, 5000} G = {1, 10, 20, 30, 40, 50, 100, 150, 200}

I = {1, ..., 40}

29 nouns 120,000 runs

Upper bound in co-training/self-training performance

Optimised on test set

Basic classifier: 53.84%

Optimal self-training: 65.61%

Optimal co-training: 65.75%

~25% error reduction

Per-word parameter setting:

Co-training = 51.73%

Self-training = 52.88%

Global parameter setting

Co-training = 55.67%

Self-training = 54.16%

Example: lady

basic = 61.53%

self-training = 84.61% [20/100/39]

co-training = 82.05% [1/1000/3]

Yarowsky Algorithm


120/208

120

Yarowsky Algorithm

(Yarowsky 1995) Similar to co-training

Differs in the basic assumption (Abney 2002)

view independence (co-training) vs. precision independence(Yarowsky algorithm)

Relies on two heuristics and a decision list

One sense per collocation :

Nearby words provide strong and consistent clues as to the sense of a

target word

One sense per discourse :

The sense of a target word is highly consistent within a singledocument

Learning Algorithm


121/208

121

Learning Algorithm

A decision list is used to classify instances of target word :

the loss of animal and plant species through extinction

Classification is based on the highest ranking rule thatmatches the target context

LogL Collocation Sense

9.31 flower (within +/- k words) A (living)

9.24 job (within +/- k words) B (factory)

9.03 fruit (within +/- k words) A (living)

9.02 plant species A (living)

... ...

Bootstrapping Algorithm


122/208

122


All occurrences of the target word are identified

A small training set of seed data is tagged with word sense

Sense-B: factory

Sense-A: life



123/208

123


Seed set grows and residual set shrinks .



124/208

124


Convergence: Stop when residual set stabilizes



125/208

125


Iterative procedure: Train decision list algorithm on seed set

Classify residual data with decision list

Create new seed set by identifying samples that are tagged with a

probability above a certain threshold

Retrain classifier on new seed set

Selecting training seeds

Initial training set should accurately distinguish among possible

senses

Strategies:

Select a single, defining seed collocation for each possible sense.

Ex: life and manufacturing for targetplant

Use words from dictionary definitions

Hand-label most frequent collocates

Evaluation


126/208

126

Evaluation

Test corpus: extracted from 460 million word corpus of multiplesources (news articles, transcripts, novels, etc.)

Performance of multiple models compared with:

supervised decision lists

unsupervised learning algorithm of Schtze (1992), based on

alignment of clusters with word senses

Word Senses Supervised Unsupervised

Schtze

Unsupervised

Bootstrapping

plant living/factory 97.7 92 98.6

space volume/outer 93.9 90 93.6

tank vehicle/container 97.1 95 96.5

motion legal/physical 98.0 92 97.9

-

Avg. - 96.1 92.2 96.5

Outline


127/208

127

Outline



Co-training



Web as a corpus


The Web as a Corpus


128/208

128

e Web as a Co pus

Use the Web as a large textual corpus Build annotated corpora using monosemous relatives

Bootstrap annotated corpora starting with few seeds

Similar to (Yarowsky 1995)

Use the (semi)automatically tagged data to train WSDclassifiers

Monosemous Relatives


129/208

129

Monosemous Relatives

Idea: determine a phrase (SP) which uniquely identifies thesense of a word (W#i)

1. Determine one or more Search Phrases from a machine readable

dictionary using several heuristics

2. Search the Web using the Search Phrases from step 1.

3. Replace the Search Phrases in the examples gathered at 2 with W#i.

Output: sense annotated corpus for the word sense W#i

As a pastime, she enjoyed reading.

Evaluate the interestingness of the website.

As an interest, she enjoyed reading.

Evaluate the interest of the website.

Heuristics to Identify Monosemous Relatives


130/208

130


Synonyms Determine a monosemous synonym

remember#1 has recollect as monosemous synonymSP=recollect

Dictionary definitions (1)

Parse the gloss and determine the set of single phrase definitions

produce#5 has the definition bring onto the market or release 2definitions: bring onto the market and release

eliminate release as being ambiguousSP=bring onto the market

Dictionary defintions (2)

Parse the gloss and determine the set of single phrase definitions

Replace the stop words with the NEAR operator

Strengthen the query: concatenate the words from the current synset usingthe AND operator

produce#6 has the synset {grow, raise, farm, produce}and the definitioncultivate by growing

SP=cultivate NEAR growing AND (grow OR raise OR farm OR produce)



131/208

131

Dictionary definitions (3) Parse the gloss and determine the set of single phrase definitions

Keep only the head phrase

Strengthen the query: concatenate the words from the current synset using

the AND operator

company#5 has the synset {party,company}and the definition band ofpeople associated in some activity

SP=band of people AND (company OR party)

Example


132/208

132

Building annotated corpora for the noun interest.# Synset Definition

1 {interest#1, involvement} sense of concern with and curiosity aboutsomeone or something

2 {interest#2,interestingness} the power of attracting or holding ones interest3 {sake, interest#3} reason for wanting something done4 {interest#4} fixed charge for borrowing money; usually a

percentage of the amount borrowed5 {pastime,interest#5} a subject or pursuit that occupies ones time and

thoughts

6 {interest#6, stake} a right or legal share of something; financialinvolvement with something

7 {interest#7, interest group} a social group whose members control some fieldof activity and who have common aims

Sense # Search phrase

1 sense of concern AND (interest OR involvement)

2 interestigness

3 reason for wanting AND (interest OR sake)4 fixed charge AND interest

percentage of amount AND interest

5 pastime

6 right share AND (interest OR stake)

legal share AND (interest OR stake)

financial involvement AND (interest OR stake)

7 interest group

Example


133/208

133

p

Gather 5,404 examples Check the first 70 examples 67 correct; 95.7% accuracy.

1. I appreciate the genuine interest#1 which motivated you to write your

message.2. The webmaster of this site warrants neither accuracy, nor interest#2.

3. He forgives us not only for our interest#3, but for his own.

4.Interest#4 coverage, including rents, was 3.6x

5. As an interest#5, she enjoyed gardening and taking part into church

activities.

6. Voted on issues, they should have abstained because of direct and

indirect personal interests#6in the matters of hand.

7. The Adam Smith Society is a new interest#7organized within the APA.

Experimental Evaluation


134/208

134

Tests on 20 words 7 nouns, 7 verbs, 3 adjectives, 3 adverbs (120 word meanings)

manually check the first 10 examples of each sense of a word

=> 91% accuracy (Mihalcea 1999)

Word Polysemy

count

Examples

in SemCor

Total exam-

ples acquired

Examples ma-

nually checked

Correct

examples

interest 7 139 5404 70 67

report 7 71 4196 70 63

company 9 90 6292 80 77

school 7 146 2490 59 54

produce 7 148 4982 67 60

remember 8 166 3573 67 57

write 8 285 2914 69 67speak 4 147 4279 40 39

small 14 192 10954 107 92

clearly 4 48 4031 29 28

TOTAL

(20 words)

120 2582 80741 1080 978

Web-based Bootstrapping


135/208

135

pp g

Similar to Yarowsky algorithm Relies on data gathered from the Web

1. Create a set of seeds (phrases) consisting of:

Sense tagged examples in SemCor

Sense tagged examples from WordNet

Additional sense tagged examples, if available

Phrase?

At least two open class words;

Words involved in a semantic relation (e.g. noun phrase, verb-object,

verb-subject, etc.)

2. Search the Web using queries formed with the seed expressions found

at Step 1

Add to the generated corpus of maximum of N text passages

Results competitive with manually tagged corpora (Mihalcea 2002)

The Web as Collective Mind


136/208

136

Two different views of the Web:

collection of Web pages

very large group of Web users

Millions of Web users can contribute their knowledge to a

data repository

Open Mind Word Expert (Chklovski and Mihalcea, 2002)

Fast growing rate:

Started in April 2002

Currently more than 100,000 examples of noun senses in several

languages

OMWE

online


137/208

137 http://teach-computers.org

Open Mind Word Expert: Quantity and Quality


138/208

138

Data

A mix of different corpora: Treebank, Open Mind Common Sense, Los

Angeles Times, British National Corpus Word senses

Based on WordNet definitions

Active learning to select the most informative examples for learning

Use two classifiers trained on existing annotated data

Select items where the two classifiers disagree for human annotation Quality:

Two tags per item

One tag per item per contributor

Evaluations:

Agreement rates of about 65% - comparable to the agreementsrates obtained when collecting data for Senseval-2 with trainedlexicographers

Replicability: tests on 1,600 examples of interest led to 90%+replicability

References

(Ab 2002) Ab S B i P di f ACL 2002


139/208

139

(Abney 2002) Abney, S.Bootstrapping. Proceedings of ACL 2002.

(Blum and Mitchell 1998) Blum, A. and Mitchell, T. Combining labeled and unlabeled

data with co-training. Proceedings of COLT 1998.

(Chklovski and Mihalcea 2002) Chklovski, T. and Mihalcea, R.Building a sense taggedcorpus with Open Mind Word Expert. Proceedings of ACL 2002 workshop on WSD.

(Clark, Curran and Osborne 2003) Clark, S. and Curran, J.R. and Osborne, M.Bootstrapping POS taggers using unlabelled data. Proceedings of CoNLL 2003.

(Mihalcea 1999) Mihalcea, R. An automatic method for generating sense taggedcorpora. Proceedings of AAAI 1999.

(Mihalcea 2002) Mihalcea, R. Bootstrapping large sense tagged corpora. Proceedingsof LREC 2002.

(Mihalcea 2004) Mihalcea, R. Co-training and Self-training for Word SenseDisambiguation. Proceedings of CoNLL 2004.

(Ng and Cardie 2003) Ng, V. and Cardie, C. Weakly supervised natural languagelearning without redundant views. Proceedings of HLT-NAACL 2003.

(Nigam and Ghani 2000) Nigam, K. and Ghani, R. Analyzing the effectiveness andapplicability of co-training. Proceedings of CIKM 2000.

(Sarkar 2001) Sarkar, A.Applying cotraining methods to statistical parsing. Proceedingsof NAACL 2001.

(Yarowsky 1995) Yarowsky, D. Unsupervised word sense disambiguation rivalingsupervised methods. Proceedings of ACL 1995.


140/208

Part 6:

Unsupervised Methods of Word

Sense Disambiguation

Outline


141/208

141

What is Unsupervised Learning? Task Definition

Agglomerative Clustering

LSI/LSA

Sense Discrimination Using Parallel Texts

What is Unsupervised Learning?


142/208

142

Unsupervised learning identifies patterns in a large sampleof data, without the benefit of any manually labeled

examples or external knowledge sources

These patterns are used to divide the data into clusters,

where each member of a cluster has more in common withthe other members of its own cluster than any other

Note! If you remove manual labels from supervised data

and cluster, you may not discover the same classes as in

supervised learning

Supervised Classification identifies features that trigger a sense tag

Unsupervised Clustering finds similarity between contexts

Cluster this Data!

Facts about my day


143/208

143

Facts about my day

Day

F1

Hot Outside?

F2

Slept Well?

F3

Ate Well?

1 YES NO NO

2 YES NO YES

3 NO NO NO

4 NO NO YES

Cluster this Data!

Facts about my day


144/208

144

Facts about my day

Day

F1

Hot Outside?

F2

Slept Well?

F3

Ate Well?

1 YES NO NO

2 YES NO YES

3 NO NO NO

4 NO NO YES

Cluster this Data!


145/208

145

Day

F1

Hot Outside?

F2

Slept Well?

F3

Ate Well?

1 YES NO NO

2 YES NO YES

3 NO NO NO

4 NO NO YES

Outline


146/208

146



LSI/LSA


Task Definition


147/208

147

Unsupervised Word Sense Discrimination: A class ofmethods that cluster words based on similarity of context

Strong Contextual Hypothesis

(Miller and Charles, 1991): Words with similar meanings tend to

occur in similar contexts

(Firth, 1957): You shall know a word by the company it keeps.

words that keep the same company tend to have similar meanings

Only use the information available in raw text, do not use

outside knowledge sources or manual annotations

No knowledge of existing sense inventories, so clusters are

not labeled with senses

Task Definition


148/208

148

Resources:

Large Corpora

Scope:

Typically one targeted word per context to be discriminated

Equivalently, measure similarity among contexts

Features may be identified in separate training data, or in the

data to be clustered

Does not assign senses or labels to clusters

Word Sense Discrimination reduces to the problem of

finding the targeted words that occur in the most similarcontexts and placing them in a cluster

Outline


149/208

149



LSI/LSA




150/208

150

Create a similarity matrix of instances to be discriminated Results in a symmetric instance by instance matrix, where each

cell contains the similarity score between a pair of instances

Typically a first order representation, where similarity is based on

the features observed in the pair of instances

Apply Agglomerative Clustering algorithm to matrix

To start, each instance is its own cluster

Form a cluster from the most similar pair of instances

Repeat until the desired number of clusters is obtained

Advantages : high quality clustering

Disadvantagescomputationally expensive, must carry

out exhaustive pair wise comparisons)( )( YX YX

Measuring Similarity


151/208

151

Integer Values

Matching Coefficient

Jaccard Coefficient

Dice Coefficient

Real Values

Cosine

YX

YX

YX

YX

YX

2

YX

YX

Instances to be Clustered


152/208

152

P-2 P-1 P+1 P+2 fish check river interest

S1 adv det prep det Y N Y N

S2 det prep det N Y N Y

S3 det adj verb det Y N N N

S4 det noun noun noun N N N N

S1 S2 S3 S4

S1 3 4 2

S2 3 2 0

S3 4 2 1

S4 2 0 1

Average Link Clustering

aka McQuittys Similarity Analysis


153/208

153

Q y y y

S1 S2 S3 S4

S1 3 4 2

S2 3 2 0

S3 4 2 1

S4 2 0 1

S1S3 S2 S4

S1S3

S2 0

S4 0

5.22

23

5.22

23

5.12

12

5.12

12

S1S3S2 S4

S1S3S2

S4

5.12

5.15.1

5.12

5.15.1

Evaluation of Unsupervised Methods


154/208

154

If Sense tagged text is available, can be used for evaluation But dont use sense tags for clustering or feature selection!

Assume that sense tags represent true clusters, and

compare these to discovered clusters

Find mapping of clusters to senses that attains maximum accuracy Pseudo words are especially useful, since it is hard to find

data that is discriminated

Pick two words or names from a corpus, and conflate them into

one name. Then see how well you can discriminate.

http://www.d.umn.edu/~kulka020/kanaghaName.html

Baseline Algorithmgroup all instances into one cluster,

this will reach accuracy equal to majority classifier

Baseline Performance


155/208

155

S1 S2 S3 Totals

C1 0 0 0 0

C2 0 0 0 0

C3 80 35 55 170

Totals 80 35 55 170

S3 S2 S1 Totals

C1 0 0 0 0

C2 0 0 0 0

C3 55 35 80 170

Totals 55 35 80 170

(0+0+55)/170 = .32 (0+0+80)/170 = .47

if C3 is S3 if C3 is S1

Evaluation


156/208

156

Suppose that C1 is labeled S1, C2 as S2, and C3 as S3

Accuracy = (10 + 0 + 10) / 170 = 12%

Diagonal shows how many members of the cluster actually belong to

the sense given on the column

Can the columns be rearranged to improve the overall accuracy?

Optimally assign clusters to senses

S1 S2 S3 Totals

C1 10 30 5 45

C2 20 0 40 60

C3 50 5 10 65

Totals 80 35 55 170

Evaluation


157/208

157

The assignment of C1 to S2,C2 to S3, and C3 to S1

results in 120/170 = 71%

Find the ordering of the

columns in the matrix that

maximizes the sum of thediagonal.

This is an instance of the

Assignment Problem from

Operations Research, or

finding the Maximal

Matching of a Bipartite

Graph from Graph Theory.

S2 S3 S1 Totals

C1 30 5 10 45

C2 0 40 20 60C3 5 10 50 65

Totals 35 55 80 170

Agglomerative Approach


158/208

158

(Pedersen and Bruce, 1997) explore discrimination with asmall number (approx 30) of features near target word.

Morphological form of target word (1)

Part of Speech two words to left and right of target word (4)

Co-occurrences (3) most frequent content words in context

Unrestricted collocations (19) most frequent words located oneposition to left or right of target, OR

Content collocations (19) most frequent content words located oneposition to left or right of target

Features identified in the instances be clustered

Similarity measured by matching coefficient Clustered with McQuittys Similarity Analysis, Wards

Method, and the EM Algorithm

Found that McQuittys method was the most accurate

Experimental Evaluation


159/208

159

Adjectives

Chief, 86% majority (1048)

Common, 84% majority (1060)

Last, 94% majority (3004)

Public, 68% majority (715)

Nouns

Bill, 68% majority (1341)

Concern, 64% majority (1235)

Drug, 57% majority (1127)

Interest, 59% majority (2113)

Line, 37% majority (1149)

Verbs Agree, 74% majority (1109)

Close, 77% majority (1354)

Help, 78% majority (1267)

Include, 91% majority (1526)

Adjectives

Chief, 86%

Common, 80%

Last, 79%

Public, 63%

Nouns

Bill, 75%

Concern, 68%

Drug, 65%

Interest, 65%

Line, 42%

Verbs Agree, 69%

Close, 72%

Help, 70%

Include, 77%

Analysis


160/208

160

Unsupervised methods may not discover clustersequivalent to the classes learned in supervised learning

Evaluation based on assuming that sense tags represent the

true cluster are likely a bit harsh. Alternatives?

Humans could look at the members of each cluster and determinethe nature of the relationship or meaning that they all share

Use the contents of the cluster to generate a descriptive label that

could be inspected by a human

First order feature sets may be problematic with smaller

amounts of data since these features must occur exactly inthe test instances in order to be matched

Outline


161/208

161



LSI/LSA


Latent Semantic Indexing/Analysis


162/208

162

Adapted by (Schtze, 1998) to word sense discrimination

Represent training data as word co-occurrence matrix

Reduce the dimensionality of the co-occurrence matrix via

Singular Value Decomposition (SVD)

Significant dimensions are associated with concepts Represent the instances of a target word to be clustered by

taking the average of all the vectors associated with all the

words in that context

Context represented by an averaged vector

Measure the similarity amongst instances via cosine and

record in similarity matrix, or cluster the vectors directly

Co-occurrence matrix


163/208

163

apple blood cells ibm data box tissue graphics memory organ plasma

pc 2 0 0 1 3 1 0 0 0 0 0

body 0 3 0 0 0 0 2 0 0 2 1

disk 1 0 0 2 0 3 0 1 2 0 0

petri 0 2 1 0 0 0 2 0 1 0 1

lab 0 0 3 0 2 0 2 0 2 1 3

sales 0 0 0 2 3 0 0 1 2 0 0

linux 2 0 0 1 3 2 0 1 1 0 0

debt 0 0 0 2 3 4 0 2 0 0 0

Singular Value Decomposition

A=UDV


164/208

164

U


165/208

165

.35 .09 -.2 .52 -.09 .40 .02 .63 .20 -.00 -.02

.05 -.49 .59 .44 .08 -.09 -.44 -.04 -.6 -.02 -.01

.35 .13 .39 -.60 .31 .41 -.22 .20 -.39 .00 .03

.08 -.45 .25 -.02 .17

advances in wsd

Documents