advances in wsd

Upload: music2850

Post on 04-Apr-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Advances in WSD

    1/208

    Advances inWord Sense Disambiguation

    Tutorial at ACL 2005

    June 25, 2005

    Ted Pedersen

    University of Minnesota, Duluth

    http://www.d.umn.edu/~tpederse

    Rada Mihalcea

    University of North Texas

    http://www.cs.unt.edu/~rada

  • 7/29/2019 Advances in WSD

    2/208

    2

    Goal of the Tutorial

    Introduce the problem of word sense disambiguation

    (WSD), focusing on the range of formulations and

    approaches currently practiced.

    Accessible to anyone with an interest in NLP.

    Persuade you to work on word sense disambiguation

    Its an interesting problem

    Lots of good work already done, still more to do

    There is infrastructure to help you get started

    Persuade you to use word sense disambiguation in your

    text applications.

  • 7/29/2019 Advances in WSD

    3/208

    3

    Outline of Tutorial

    Introduction (Ted)

    Methodolodgy (Rada)

    Knowledge Intensive Methods (Rada)

    Supervised Approaches (Ted) Minimally Supervised Approaches (Rada) / BREAK

    Unsupervised Learning (Ted)

    How to Get Started (Rada)

    Conclusion (Ted)

  • 7/29/2019 Advances in WSD

    4/208

    Part 1:

    Introduction

  • 7/29/2019 Advances in WSD

    5/208

    5

    Outline

    Definitions

    Ambiguity for Humans and Computers

    Very Brief Historical Overview

    Theoretical Connections Practical Applications

  • 7/29/2019 Advances in WSD

    6/208

    6

    Definitions

    Word sense disambiguation is the problem of selecting a

    sense for a word from a set of predefined possibilities.

    Sense Inventory usually comes from a dictionary or thesaurus.

    Knowledge intensive methods, supervised learning, and

    (sometimes) bootstrapping approaches

    Word sense discrimination is the problem of dividing the

    usages of a word into different meanings, without regard to

    any particular existing sense inventory. Unsupervised techniques

  • 7/29/2019 Advances in WSD

    7/208

    7

    Outline

    Definitions

    Ambiguity for Humans and Computers

    Very Brief Historical Overview

    Theoretical Connections Practical Applications

  • 7/29/2019 Advances in WSD

    8/208

    8

    Computers versus Humans

    Polysemymost words have many possible meanings.

    A computer program has no basis for knowing which one

    is appropriate, even if it is obvious to a human

    Ambiguity is rarely a problem for humans in their day today communication, except in extreme cases

  • 7/29/2019 Advances in WSD

    9/208

    9

    Ambiguity for Humans - Newspaper Headlines!

    DRUNK GETS NINE YEARS IN VIOLIN CASE

    FARMER BILL DIES IN HOUSE

    PROSTITUTES APPEAL TO POPE

    STOLEN PAINTING FOUND BY TREE RED TAPE HOLDS UP NEW BRIDGE

    DEER KILL 300,000

    RESIDENTS CAN DROP OFF TREES

    INCLUDE CHILDREN WHEN BAKING COOKIES

    MINERS REFUSE TO WORK AFTER DEATH

  • 7/29/2019 Advances in WSD

    10/208

    10

    Ambiguity for a Computer

    The fisherman jumped off the bank and into the water.

    The bank down the street was robbed!

    Back in the day, we had an entire bank of computers

    devoted to this problem. The bank in that road is entirely too steep and is really

    dangerous.

    The plane took a bank to the left, and then headed off

    towards the mountains.

  • 7/29/2019 Advances in WSD

    11/208

    11

    Outline

    Definitions

    Ambiguity for Humans and Computers

    Very Brief Historical Overview

    Theoretical Connections Practical Applications

  • 7/29/2019 Advances in WSD

    12/208

    12

    Early Days of WSD

    Noted as problem for Machine Translation (Weaver, 1949)

    A word can often only be translated if you know the specific sense

    intended (A bill in English could be a pico or a cuenta in Spanish)

    Bar-Hillel (1960) posed the following:

    Little John was looking for his toy box. Finally, he found it. The

    box was in the pen. John was very happy.

    Is pen a writing instrument or an enclosure where children play?

    declared it unsolvable, left the field of MT!

  • 7/29/2019 Advances in WSD

    13/208

    13

    Since then

    1970s - 1980s

    Rule based systems

    Rely on hand crafted knowledge sources

    1990s

    Corpus based approaches

    Dependence on sense tagged text

    (Ide and Veronis, 1998) overview history from early days to 1998.

    2000s

    Hybrid Systems

    Minimizing or eliminating use of sense tagged text

    Taking advantage of the Web

  • 7/29/2019 Advances in WSD

    14/208

    14

    Outline

    Definitions

    Ambiguity for Humans and Computers

    Very Brief Historical Overview

    Interdisciplinary Connections Practical Applications

  • 7/29/2019 Advances in WSD

    15/208

    15

    Interdisciplinary Connections

    Cognitive Science & Psychology

    Quillian (1968), Collins and Loftus (1975) : spreading activation

    Hirst (1987) developed marker passing model

    Linguistics Fodor & Katz (1963) : selectional preferences

    Resnik (1993) pursued statistically

    Philosophy of Language

    Wittgenstein (1958): meaning as use

    For a large class of cases-though not for all-in which we employ

    the word "meaning" it can be defined thus: the meaning of a word

    is its use in the language.

  • 7/29/2019 Advances in WSD

    16/208

    16

    Outline

    Definitions

    Ambiguity for Humans and Computers

    Very Brief Historical Overview

    Theoretical Connections Practical Applications

  • 7/29/2019 Advances in WSD

    17/208

    17

    Practical Applications

    Machine Translation

    Translate bill from English to Spanish

    Is it a pico or a cuenta?

    Is it a bird jaw or an invoice?

    Information Retrieval Find all Web Pages about cricket

    The sport or the insect?

    Question Answering

    What is George Millers position on gun control?

    The psychologist or US congressman?

    Knowledge Acquisition Add to KB: Herb Bergson is the mayor of Duluth.

    Minnesota or Georgia?

  • 7/29/2019 Advances in WSD

    18/208

    18

    References

    (Bar-Hillel, 1960) The Present Status of Automatic Translations of Languages. InAdvances in Computers. Volume 1. Alt, F. (editor). Academic Press, New York, NY. pp91-163.

    (Collins and Loftus, 1975) A Spreading Activation Theory of Semantic Memory.Psychological Review, (82) pp. 407-428.

    (Fodor and Katz, 1963) The structure of semantic theory. Language (39). pp 170-210.

    (Hirst, 1987) Semantic Interpretation and the Resolution of Ambiguity. Cambridge

    University Press. (Ide and Vronis, 1998)Word Sense Disambiguation: The State of the Art..

    Computational Linguistics (24) pp 1-40.

    (Quillian, 1968) Semantic Memory. In Semantic Information Processing. Minsky, M.(editor). The MIT Press, Cambridge, MA. pp. 227-270.

    (Resnik, 1993) Selection and Information: A Class-Based Approach to LexicalRelationships. Ph.D. Dissertation. University of Pennsylvania.

    (Weaver, 1949): Translation. In Machine Translation of Languages: fourteen essays.Locke, W.N. and Booth, A.D. (editors) The MIT Press, Cambridge, Mass. pp. 15-23.

    (Wittgenstein, 1958) Philosophical Investigations, 3rd edition. Translated by G.E.M.Anscombe. Macmillan Publishing Co., New York.

    http://www.cs.vassar.edu/~ide/papers/idevero.pshttp://www.cs.vassar.edu/~ide/papers/idevero.ps
  • 7/29/2019 Advances in WSD

    19/208

    Part 2:

    Methodology

  • 7/29/2019 Advances in WSD

    20/208

    20

    Outline

    General considerations

    All-words disambiguation

    Targeted-words disambiguation

    Word sense discrimination, sense discovery Evaluation (granularity, scoring)

  • 7/29/2019 Advances in WSD

    21/208

    21

    Ex: chair furniture or person

    Ex: child young person or human offspring

    Overview of the Problem

    Many words have several meanings (homonymy / polysemy)

    Determine which sense of a word is used in a specific sentence

    Note:

    often, the different senses of a word are closely related Ex: title - right of legal ownership

    - document that is evidence of the legal ownership,

    sometimes, several senses can be activated in a single context(co-activation)

    Ex: This could bring competition to the trade

    competition: - the act of competing

    - the people who are competing

  • 7/29/2019 Advances in WSD

    22/208

    22

    Word Senses

    The meaning of a word in a given context

    Word sense representations

    With respect to a dictionarychair= a seat for one person, with a support for the back; "he put his coatover the back of the chair and sat down"

    chair= the position of professor; "he was awarded an endowed chair in

    economics"

    With respect to the translation in a second language

    chair = chaise

    chair = directeur

    With respect to the context where it occurs (discrimination)

    Sit on a chair Take a seat on this chair

    The chairof the Math Department The chairof the meeting

  • 7/29/2019 Advances in WSD

    23/208

    23

    Approaches to Word Sense Disambiguation

    Knowledge-Based Disambiguation use of external lexical resources such as dictionaries and thesauri

    discourse properties

    Supervised Disambiguation based on a labeled training set

    the learning system has: a training set of feature-encoded inputs AND

    their appropriate sense label (category)

    Unsupervised Disambiguation based on unlabeled corpora

    The learning system has: a training set of feature-encoded inputs BUT

    NOT their appropriate sense label (category)

  • 7/29/2019 Advances in WSD

    24/208

    24

    All Words Word Sense Disambiguation

    Attempt to disambiguate all open-class words in a text

    He put his suit over the backof the chair

    Knowledge-based approaches

    Use information from dictionaries

    Definitions / Examples for each meaning

    Find similarity between definitions and current context

    Position in a semantic network

    Find that table is closer to chair/furniture than to chair/person

    Use discourse properties

    A word exhibits the same sense in a discourse / in a collocation

  • 7/29/2019 Advances in WSD

    25/208

    25

    All Words Word Sense Disambiguation

    Minimally supervised approaches

    Learn to disambiguate words using small annotated corpora

    E.g. SemCorcorpus where all open class words are

    disambiguated

    200,000 running words

    Most frequent sense

  • 7/29/2019 Advances in WSD

    26/208

    26

    Targeted Word Sense Disambiguation

    Disambiguate one target word

    Take a seat on this chair

    The chairof the Math Department

    WSD is viewed as a typical classification problem

    use machine learning techniques to train a system

    Training:

    Corpus of occurrences of the target word, each occurrence

    annotated with appropriate sense

    Build feature vectors: a vector of relevant linguistic features that represents the context (ex:

    a window of words around the target word)

    Disambiguation:

    Disambiguate the target word in new unseen text

  • 7/29/2019 Advances in WSD

    27/208

    27

    Targeted Word Sense Disambiguation

    Take a window ofn word around the target word

    Encode information about the words around the target word

    typical features include: words, root forms, POS tags, frequency,

    An electric guitar and bass player stand off to one side, not really part of

    the scene, just as a sort of nod to gringo expectations perhaps.

    Surrounding context (local features)

    [ (guitar, NN1), (and, CJC), (player, NN1), (stand, VVB) ]

    Frequent co-occurring words (topical features)

    [fishing, big, sound, player, fly, rod, pound, double, runs, playing, guitar, band]

    [0,0,0,1,0,0,0,0,0,0,1,0]

    Other features:

    [followed by "player", contains "show" in the sentence,]

    [yes, no, ]

  • 7/29/2019 Advances in WSD

    28/208

    28

    Unsupervised Disambiguation

    Disambiguate word senses:

    without supporting tools such as dictionaries and thesauri

    without a labeled training text

    Without such resources, word senses are not labeled

    We cannot say chair/furnitureor chair/person

    We can:

    Cluster/group the contexts of an ambiguous word into a number

    of groups

    Discriminate between these groups without actually labeling

    them

  • 7/29/2019 Advances in WSD

    29/208

    29

    Unsupervised Disambiguation

    Hypothesis: same senses of words will have similar neighboring

    words

    Disambiguation algorithm

    Identify context vectors corresponding to all occurrences of a particular

    word Partition them into regions of high density

    Assign a sense to each such region

    Sit on a chair

    Take a seat on this chairThe chairof the Math Department

    The chairof the meeting

  • 7/29/2019 Advances in WSD

    30/208

    30

    Evaluating Word Sense Disambiguation

    Metrics: Precision = percentage of words that are tagged correctly, out of the

    words addressed by the system

    Recall = percentage of words that are tagged correctly, out of all wordsin the test set

    Example

    Test set of 100 words Precision = 50 / 75 = 0.66

    System attempts 75 words Recall = 50 / 100 = 0.50

    Words correctly disambiguated 50

    Special tags are possible:

    Unknown

    Proper noun

    Multiple senses

    Compare to a gold standard

    SEMCOR corpus, SENSEVAL corpus,

  • 7/29/2019 Advances in WSD

    31/208

    31

    Evaluating Word Sense Disambiguation

    Difficulty in evaluation:

    Nature of the senses to distinguish has a huge impact on results

    Coarse versus fine-grained sense distinctionchair= a seat for one person, with a support for the back; "he put his coat

    over the back of the chair and sat down

    chair= the position ofprofessor; "he was awarded an endowed chair ineconomics

    bank = a financial institution that accepts deposits and channels the moneyinto lending activities; "he cashed a check at the bank"; "that bank holdsthe mortgage on my home"

    bank= a building in which commercial banking is transacted; "the bank ison the corner of Nassau and Witherspoon

    Sense maps

    Cluster similar senses

    Allow for both fine-grained and coarse-grained evaluation

  • 7/29/2019 Advances in WSD

    32/208

    32

    Bounds on Performance

    Upper and Lower Bounds on Performance:

    Measure of how well an algorithm performs relative to the difficulty ofthe task.

    Upper Bound:

    Human performance

    Around 97%-99% with few and clearly distinct senses

    Inter-judge agreement:

    With words with clear & distinct senses95% and up

    With polysemous words with related senses65%70%

    Lower Bound (or baseline):

    The assignment of a random sense / the most frequent sense

    90% is excellent for a word with 2 equiprobable senses

    90% is trivial for a word with 2 senses with probability ratios of 9 to 1

  • 7/29/2019 Advances in WSD

    33/208

    33

    References

    (Gale, Church and Yarowsky 1992) Gale, W., Church, K., and Yarowsky, D. Estimating

    upper and lower bounds on the performance of word-sense disambiguation programs

    ACL 1992.

    (Miller et. al., 1994) Miller, G., Chodorow, M., Landes, S., Leacock, C., and Thomas, R.

    Using a semantic concordance for sense identification. ARPA Workshop 1994.

    (Miller, 1995) Miller, G. Wordnet: A lexical database. ACM, 38(11) 1995.

    (Senseval) Senseval evaluation exercises http://www.senseval.org

  • 7/29/2019 Advances in WSD

    34/208

    Part 3:

    Knowledge-based Methods for

    Word Sense Disambiguation

  • 7/29/2019 Advances in WSD

    35/208

    35

    Outline

    Task definition

    Machine Readable Dictionaries

    Algorithms based on Machine Readable Dictionaries

    Selectional Restrictions

    Measures of Semantic Similarity

    Heuristic-based Methods

  • 7/29/2019 Advances in WSD

    36/208

    36

    Task Definition

    Knowledge-based WSD = class of WSD methods relying

    (mainly) on knowledge drawn from dictionaries and/or raw

    text

    Resources

    Yes

    Machine Readable Dictionaries

    Raw corpora

    No

    Manually annotated corpora Scope

    All open-class words

  • 7/29/2019 Advances in WSD

    37/208

    37

    Machine Readable Dictionaries

    In recent years, most dictionaries made available in

    Machine Readable format (MRD)

    Oxford English Dictionary

    Collins

    Longman Dictionary of Ordinary Contemporary English (LDOCE)

    Thesaurusesadd synonymy information

    Roget Thesaurus

    Semantic networksadd more semantic relations

    WordNet

    EuroWordNet

  • 7/29/2019 Advances in WSD

    38/208

    38

    MRDA Resource for Knowledge-based WSD

    For each word in the language vocabulary, an MRDprovides:

    A list of meanings

    Definitions (for all word meanings)

    Typical usage examples (for most word meanings)

    WordNet definitions/examples for the nounplant

    1. buildings for carrying on industrial labor; "they built a large plant to

    manufacture automobiles

    2. a living organism lacking the power of locomotion3. something planted secretly for discovery by another; "the police used a plant to

    trick the thieves"; "he claimed that the evidence against him was a plant"

    4. an actor situated in the audience whose acting is rehearsed but seems

    spontaneous to the audience

  • 7/29/2019 Advances in WSD

    39/208

    39

    MRDA Resource for Knowledge-based WSD

    A thesaurus adds:

    An explicit synonymy relation between word meanings

    A semantic network adds:

    Hypernymy/hyponymy (IS-A), meronymy/holonymy (PART-OF),

    antonymy, entailnment, etc.

    WordNet synsets for the noun plant

    1. plant, works, industrial plant

    2. plant, flora, plant life

    WordNet related concepts for the meaning plant life

    {plant, flora, plant life}

    hypernym: {organism, being}

    hypomym: {house plant}, {fungus},

    meronym: {plant tissue}, {plant part}

    holonym: {Plantae, kingdom Plantae, plant kingdom}

  • 7/29/2019 Advances in WSD

    40/208

    40

    Outline

    Task definition

    Machine Readable Dictionaries

    Algorithms based on Machine Readable Dictionaries

    Selectional Restrictions

    Measures of Semantic Similarity

    Heuristic-based Methods

  • 7/29/2019 Advances in WSD

    41/208

    41

    Lesk Algorithm

    (Michael Lesk 1986): Identify senses of words in contextusing definition overlap

    Algorithm:

    1. Retrieve from MRD all sense definitions of the words to be

    disambiguated

    2. Determine the definition overlap for all possible sense combinations

    3. Choose senses that lead to highest overlap

    Example: disambiguate PINE CONE

    PINE1. kinds of evergreen tree with needle-shaped leaves2. waste away through sorrow or illness

    CONE

    1. solid body which narrows to a point2. something of this shape whether solid or hollow

    3. fruit of certain evergreen trees

    Pine#1 Cone#1 = 0

    Pine#2 Cone#1 = 0Pine#1 Cone#2 = 1

    Pine#2 Cone#2 = 0

    Pine#1 Cone#3 = 2

    Pine#2 Cone#3 = 0

  • 7/29/2019 Advances in WSD

    42/208

    42

    Lesk Algorithm for More than Two Words?

    I saw a man who is 98 years old and can still walk and tell jokes

    nine open class words: see(26), man(11),year(4), old(8), can(5), still(4),

    walk(10), tell(8),joke(3)

    43,929,600 sense combinations! How to find the optimal sensecombination?

    Simulated annealing (Cowie, Guthrie, Guthrie 1992)

    Define a function E = combination of word senses in a given text.

    Find the combination of senses that leads to highest definition overlap

    (redundancy)1. Start with E = the most frequent sense for each word

    2. At each iteration, replace the sense of a random word in the set with a

    different sense, and measure E

    3. Stop iterating when there is no change in the configuration of senses

  • 7/29/2019 Advances in WSD

    43/208

    43

    Lesk Algorithm: A Simplified Version

    Original Lesk definition: measure overlap between sense definitions forall words in context

    Identify simultaneously the correct senses for all words in context

    Simplified Lesk (Kilgarriff & Rosensweig 2000): measure overlap

    between sense definitions of a word and current context

    Identify the correct sense for one word at a time

    Search space significantly reduced

  • 7/29/2019 Advances in WSD

    44/208

    44

    Lesk Algorithm: A Simplified Version

    Example: disambiguate PINE in

    Pine cones hanging in a tree

    PINE

    1. kinds of evergreen tree with needle-shaped leaves

    2. waste away through sorrow or illness

    Pine#1 Sentence = 1

    Pine#2 Sentence = 0

    Algorithm for simplified Lesk:

    1.Retrieve from MRD all sense definitions of the word to be

    disambiguated

    2.Determine the overlap between each sense definition and the

    current context

    3.Choose the sense that leads to highest overlap

  • 7/29/2019 Advances in WSD

    45/208

    45

    Evaluations of Lesk Algorithm

    Initial evaluation by M. Lesk 50-70% on short samples of text manually annotated set, with respect

    to Oxford Advanced Learners Dictionary

    Simulated annealing

    47% on 50 manually annotated sentences

    Evaluation on Senseval-2 all-words data, with back-off to

    random sense (Mihalcea & Tarau 2004)

    Original Lesk: 35%

    Simplified Lesk: 47%

    Evaluation on Senseval-2 all-words data, with back-off to

    most frequent sense (Vasilescu, Langlais, Lapalme 2004)

    Original Lesk: 42%

    Simplified Lesk: 58%

  • 7/29/2019 Advances in WSD

    46/208

    46

    Outline

    Task definition

    Machine Readable Dictionaries

    Algorithms based on Machine Readable Dictionaries

    Selectional Preferences

    Measures of Semantic Similarity

    Heuristic-based Methods

  • 7/29/2019 Advances in WSD

    47/208

    47

    Selectional Preferences

    A way to constrain the possible meanings of words in agiven context

    E.g. Wash a dish vs. Cook a dish

    WASH-OBJECT vs. COOK-FOOD

    Capture information about possible relations betweensemantic classes Common sense knowledge

    Alternative terminology Selectional Restrictions

    Selectional Preferences

    Selectional Constraints

  • 7/29/2019 Advances in WSD

    48/208

    48

    Acquiring Selectional Preferences

    From annotated corpora

    Circular relationship with the WSD problem

    Need WSD to build the annotated corpus

    Need selectional preferences to derive WSD

    From raw corpora

    Frequency counts

    Information theory measures

    Class-to-class relations

  • 7/29/2019 Advances in WSD

    49/208

    49

    Preliminaries: Learning Word-to-Word Relations

    An indication of the semantic fit between two words

    1. Frequency counts

    Pairs of words connected by a syntactic relations

    2. Conditional probabilities

    Condition on one of the words

    ),,( 21RWWCount

    ),(

    ),,(),|(

    2

    2121

    RWCount

    RWWCountRWWP

  • 7/29/2019 Advances in WSD

    50/208

    50

    Learning Selectional Preferences (1)

    Word-to-class relations (Resnik 1993)

    Quantify the contribution of a semantic class using all the concepts

    subsumed by that class

    where

    )(

    ),|(log),|(

    )(

    ),|(log),|(

    ),,(

    2

    1212

    2

    1212

    21

    2CP

    RWCPRWCP

    CP

    RWCPRWCP

    RCWA

    C

    22)(

    ),,(),,(

    ),(

    ),,(

    ),|(

    2

    2121

    1

    21

    12

    CW WCount

    RWWCountRCWCount

    RWCount

    RCWCount

    RWCP

  • 7/29/2019 Advances in WSD

    51/208

    51

    Learning Selectional Preferences (2) Determine the contribution of a word sense based on the assumption of

    equal sense distributions: e.g. plant has two senses 50% occurences are sense 1, 50% are sense 2

    Example: learning restrictions for the verb to drink

    Find high-scoring verb-object pairs

    Find prototypical object classes (high association score)

    Co-occ score Verb Object

    11.75 drink tea

    11.75 drink Pepsi

    11.75 drink champagne

    10.53 drink liquid

    10.2 drink beer9.34 drink wine

    A(v,c) Object class

    3.58 (beverage, [drink, ])

    2.05 (alcoholic_beverage, [intoxicant, ])

  • 7/29/2019 Advances in WSD

    52/208

    52

    Learning Selectional Preferences (3)

    Other algorithms

    Learn class-to-class relations (Agirre and Martinez, 2002)

    E.g.: ingest food is a class-to-class relation foreat chicken

    Bayesian networks (Ciaramita and Johnson, 2000) Tree cut model (Li and Abe, 1998)

  • 7/29/2019 Advances in WSD

    53/208

    53

    Using Selectional Preferences for WSD

    Algorithm:1. Learn a large set of selectional preferences for a given syntactic

    relation R

    2. Given a pair of words W1W

    2connected by a relation R

    3. Find all selectional preferences W1C (word-to-class) or C

    1C

    2

    (class-to-class) that apply

    4. Select the meanings of W1

    and W2

    based on the selected semantic

    class

    Example: disambiguatecoffeeindrinkcoffee

    1. (beverage) a beverage consisting of an infusion of ground coffee beans

    2. (tree) any of several small trees native to the tropical Old World

    3. (color) a medium to dark brown color

    Given the selectional preferenceDRINK BEVERAGE : coffee#1

  • 7/29/2019 Advances in WSD

    54/208

    54

    Evaluation of Selectional Preferences for WSD

    Data set

    mainly on verb-object, subject-verb relations extracted from

    SemCor

    Compare against random baseline

    Results (Agirre and Martinez, 2000) Average results on 8 nouns

    Similar figures reported in (Resnik 1997)

    Object Subject

    Precision Recall Precision RecallRandom 19.2 19.2 19.2 19.2

    Word-to-word 95.9 24.9 74.2 18.0

    Word-to-class 66.9 58.0 56.2 46.8

    Class-to-class 66.6 64.8 54.0 53.7

  • 7/29/2019 Advances in WSD

    55/208

    55

    Outline

    Task definition

    Machine Readable Dictionaries

    Algorithms based on Machine Readable Dictionaries

    Selectional Restrictions

    Measures of Semantic Similarity

    Heuristic-based Methods

  • 7/29/2019 Advances in WSD

    56/208

    56

    Semantic Similarity

    Words in a discourse must be related in meaning, for thediscourse to be coherent (Haliday and Hassan, 1976)

    Use this property for WSDIdentify related meanings for

    words that share a common context

    Context span:

    1. Local context: semantic similarity between pairs of words

    2. Global context: lexical chains

  • 7/29/2019 Advances in WSD

    57/208

    57

    Semantic Similarity in a Local Context

    Similarity determined between pairs of concepts, or

    between a word and its surrounding context

    Relies on similarity metrics on semantic networks

    (Rada et al. 1989)

    carnivore

    wild dogwolf

    bearfeline, felidcanine, canidfissiped mamal, fissiped

    dachshund

    hunting doghyena dogdingo

    hyenadog

    terrier

  • 7/29/2019 Advances in WSD

    58/208

    58

    Semantic Similarity Metrics (1) Input: two concepts (same part of speech)

    Output: similarity measure (Leacock and Chodorow 1998)

    E.g. Similarity(wolf,dog) = 0.60 Similarity(wolf,bear) = 0.42

    (Resnik 1995)

    Define information content, P(C) = probability of seeing a concept of type

    C in a large corpus

    Probability of seeing a concept = probability of seeing instances of thatconcept

    Determine the contribution of a word sense based on the assumption of

    equal sense distributions:

    e.g. plant has two senses 50% occurrences are sense 1, 50% are sense 2

    D

    CCPathCCSimilarity

    2

    ),(log),( 2121 , D is the taxonomy depth

    ))(log()( CPCIC

  • 7/29/2019 Advances in WSD

    59/208

    59

    Semantic Similarity Metrics (2)

    Similarity using information content (Resnik 1995) Define similarity between two concepts (LCS = Least

    Common Subsumer)

    Alternatives (Jiang and Conrath 1997)

    Other metrics:

    Similarity using information content (Lin 1998)

    Similarity using gloss-based paths across different hierarchies (Mihalceaand Moldovan 1999)

    Conceptual density measure between noun semantic hierarchies andcurrent context (Agirre and Rigau 1995)

    Adapted Lesk algorithm (Banerjee and Pedersen 2002)

    )),((),( 2121 CCLCSICCCSimilarity

    ))()((

    )),((2),(

    21

    2121

    CICCIC

    CCLCSICCCSimilarity

  • 7/29/2019 Advances in WSD

    60/208

    60

    Semantic Similarity Metrics for WSD

    Disambiguate target words based on similarity with one

    word to the left and one word to the right (Patwardhan, Banerjee, Pedersen 2002)

    Evaluation:

    1,723 ambiguous nouns from Senseval-2

    Among 5 similarity metrics, (Jiang and Conrath 1997) provide the

    best precision (39%)

    Example: disambiguate PLANT in plant with flowers

    PLANT1. plant, works, industrial plant

    2. plant, flora, plant life

    Similarity (plant#1, flower) = 0.2

    Similarity (plant#2, flower) = 1.5 : plant#2

  • 7/29/2019 Advances in WSD

    61/208

    61

    Semantic Similarity in a Global Context

    Lexical chains (Hirst and St-Onge 1988), (Haliday and Hassan 1976)

    A lexical chain is a sequence of semantically related words, whichcreates a context and contributes to the continuity of meaning and thecoherence of a discourse

    Algorithmfor finding lexical chains:1. Select the candidate words from the text. These are words for which we cancompute similarity measures, and therefore most of the time they have thesame part of speech.

    2. For each such candidate word, and for each meaning for this word, find achain to receive the candidate word sense, based on a semantic relatedness

    measure between the concepts that are already in the chain, and the candidateword meaning.

    3. If such a chain is found, insert the word in this chain; otherwise, create a newchain.

  • 7/29/2019 Advances in WSD

    62/208

    62

    Semantic Similarity of a Global Context

    A very long traintraveling along the railswith a constant velocityv in a

    certain direction

    train #1: public transport

    #2: order set of things

    #3: piece of cloth

    travel

    #1 change location

    #2: undergo transportation

    rail #1: a barrier

    # 2: a bar of steel for trains

    #3: a small bird

  • 7/29/2019 Advances in WSD

    63/208

    63

    Lexical Chains for WSD

    Identify lexical chains in a text

    Usually target one part of speech at a time

    Identify the meaning of words based on their membership

    to a lexical chain

    Evaluation:

    (Galley and McKeown 2003) lexical chains on 74 SemCor texts

    give 62.09%

    (Mihalcea and Moldovan 2000) on five SemCor texts give 90%with 60% recall

    lexical chains anchored on monosemous words

    (Okumura and Honda 1994) lexical chains on five Japanese texts

    give 63.4%

  • 7/29/2019 Advances in WSD

    64/208

    64

    Outline

    Task definition

    Machine Readable Dictionaries

    Algorithms based on Machine Readable Dictionaries

    Selectional Restrictions

    Measures of Semantic Similarity

    Heuristic-based Methods

  • 7/29/2019 Advances in WSD

    65/208

    65

    Example: plant/flora is used more often thanplant/factory

    - annotate any instance ofPLANT asplant/flora

    Most Frequent Sense (1)

    Identify the most often used meaning and use this meaning

    by default

    Word meanings exhibit a Zipfian distribution

    E.g. distribution of word senses in SemCor

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1 2 3 4 5 6 7 8 9 10

    Sense number

    Frequency Noun

    Verb

    Adj

    Adv

  • 7/29/2019 Advances in WSD

    66/208

    66

    Most Frequent Sense (2) Method 1: Find the most frequent sense in an annotated

    corpus Method 2: Find the most frequent sense using a method

    based on distributional similarity (McCarthy et al. 2004)

    1. Given a word w, find the top k distributionally similar words

    Nw = {n1, n2, , nk}, with associated similarity scores{dss(w,n1), dss(w,n2), dss(w,nk)}

    2. For each sense wsi of w, identify the similarity with the words nj,

    using the sense of nj that maximizes this score

    3. Rank senses wsi of w based on the total similarity score

    )),((max),(

    where

    ,),'(

    ),(),()(

    )(

    )('

    xinsensesns

    ji

    Nn

    wsensesws

    ji

    ji

    ji

    nswswnssnwswnss

    nwswnss

    nwswnssnwdsswsScore

    jx

    wj

    i

  • 7/29/2019 Advances in WSD

    67/208

    67

    Most Frequent Sense(3)

    Word senses pipe #1 = tobacco pipe

    pipe #2 = tube of metal or plastic

    Distributional similar words

    N = {tube, cable, wire, tank, hole, cylinder, fitting, tap, }

    For each word in N, find similarity with pipe#i (using the sense thatmaximizes the similarity)

    pipe#1tube (#3) = 0.3

    pipe#2tube (#1) = 0.6

    Compute score for each sense pipe#i

    score (pipe#1) = 0.25 score (pipe#2) = 0.73

    Note: results depend on the corpus used to find distributionallysimilar words => can find domain specific predominant senses

  • 7/29/2019 Advances in WSD

    68/208

    68

    E.g. The ambiguous word PLANT occurs 10 times in a discourse

    all instances ofplant carry the same meaning

    One Sense Per Discourse

    A word tends to preserve its meaning across all its occurrences in agiven discourse (Gale, Church, Yarowksy 1992)

    What does this mean?

    Evaluation:

    8 words with two-way ambiguity, e.g. plant, crane, etc.

    98% of the two-word occurrences in the same discourse carry the same

    meaning

    The grain of salt: Performance depends on granularity (Krovetz 1998) experiments with words with more than two senses

    Performance of one sense per discourse measured on SemCor is approx.

    70%

  • 7/29/2019 Advances in WSD

    69/208

    69

    The ambiguous word PLANT preserves its meaning in all its

    occurrences within the collocationindustrial plant, regardless

    of the context where this collocation occurs

    One Sense per Collocation

    A word tends to preserver its meaning when used in the samecollocation (Yarowsky 1993)

    Strong for adjacent collocations

    Weaker as the distance between words increases

    An example

    Evaluation:

    97% precision on words with two-way ambiguity

    Finer granularity:

    (Martinez and Agirre 2000) tested the one sense per collocation

    hypothesis on text annotated with WordNet senses

    70% precision on SemCor words

    References

  • 7/29/2019 Advances in WSD

    70/208

    70

    References (Agirre and Rigau, 1995) Agirre, E. and Rigau, G.A proposal for word sense disambiguation

    using conceptual distance. RANLP 1995.

    (Agirre and Martinez 2001) Agirre, E. and Martinez, D. Learning class-to-class selectional

    preferences. CONLL 2001.

    (Banerjee and Pedersen 2002) Banerjee, S. and Pedersen, T. An adapted Lesk algorithm forword sense disambiguation using WordNet. CICLING 2002.

    (Cowie, Guthrie and Guthrie 1992), Cowie, L. and Guthrie, J. A. and Guthrie, L.: Lexicaldisambiguation using simulated annealing. COLING 2002.

    (Gale, Church and Yarowsky 1992) Gale, W., Church, K., and Yarowsky, D. One sense per

    discourse. DARPA workshop 1992. (Halliday and Hasan 1976) Halliday, M. and Hasan, R., (1976). Cohesion in English.

    Longman.

    (Galley and McKeown 2003) Galley, M. and McKeown, K. (2003) Improving word sensedisambiguation in lexical chaining. IJCAI 2003

    (Hirst and St-Onge 1998) Hirst, G. and St-Onge, D. Lexical chains as representations ofcontext in the detection and correction of malaproprisms. WordNet: An electronic lexicaldatabase, MIT Press.

    (Jiang and Conrath 1997) Jiang, J. and Conrath, D. Semantic similarity based on corpusstatistics and lexical taxonomy. COLING 1997.

    (Krovetz, 1998) Krovetz, R.More than one sense per discourse. ACL-SIGLEX 1998.

    (Lesk, 1986) Lesk, M. Automatic sense disambiguation using machine readable dictionaries:How to tell a pine cone from an ice cream cone. SIGDOC 1986.

    (Lin 1998) Lin, DAn information theoretic definition of similarity. ICML 1998.

    References

  • 7/29/2019 Advances in WSD

    71/208

    71

    (Martinez and Agirre 2000) Martinez, D. and Agirre, E. One sense per collocation andgenre/topic variations. EMNLP 2000.

    (Miller et. al., 1994) Miller, G., Chodorow, M., Landes, S., Leacock, C., and Thomas, R.Using a semantic concordance for sense identification. ARPA Workshop 1994.

    (Miller, 1995) Miller, G. Wordnet: A lexical database. ACM, 38(11) 1995.

    (Mihalcea and Moldovan, 1999) Mihalcea, R. and Moldovan, D. A method for wordsense disambiguation of unrestricted text. ACL 1999.

    (Mihalcea and Moldovan 2000) Mihalcea, R. and Moldovan, D. An iterative approachto word sense disambiguation. FLAIRS 2000.

    (Mihalcea, Tarau, Figa 2004) R. Mihalcea, P. Tarau, E. FigaPageRank on Semantic

    Networks with Application to Word Sense Disambiguation, COLING 2004. (Patwardhan, Banerjee, and Pedersen 2003) Patwardhan, S. and Banerjee, S. and

    Pedersen, T. Using Measures of Semantic Relatedeness for Word Sense Disambiguation.CICLING 2003.

    (Rada et al 1989) Rada, R. and Mili, H. and Bicknell, E. and Blettner, M. Developmentand application of a metric on semantic nets. IEEE Transactions on Systems, Man, andCybernetics, 19(1) 1989.

    (Resnik 1993) Resnik, P. Selection and Information: A Class-Based Approach to LexicalRelationships. University of Pennsylvania 1993.

    (Resnik 1995) Resnik, P. Using information content to evaluate semantic similarity.IJCAI 1995.

    (Vasilescu, Langlais, Lapalme 2004) F. Vasilescu, P. Langlais, G. Lapalme "Evaluatingvariants of the Lesk approach for disambiguating words, LREC 2004.

    (Yarowsky, 1993) Yarowsky, D. One sense per collocation. ARPA Workshop 1993.

  • 7/29/2019 Advances in WSD

    72/208

    Part 4:

    Supervised Methods of Word

    Sense Disambiguation

  • 7/29/2019 Advances in WSD

    73/208

    73

    Outline

    What is Supervised Learning?

    Task Definition

    Single Classifiers

    Nave Bayesian Classifiers

    Decision Lists and Trees

    Ensembles of Classifiers

  • 7/29/2019 Advances in WSD

    74/208

    74

    What is Supervised Learning?

    Collect a set of examples that illustrate the various possibleclassifications or outcomes of an event.

    Identify patterns in the examples associated with each

    particular class of the event.

    Generalize those patterns into rules.

    Apply the rules to classify a new event.

    f

  • 7/29/2019 Advances in WSD

    75/208

    75

    Learn from these examples :

    when do I go to the store?

    Day

    CLASS

    Go to Store?

    F1

    Hot Outside?

    F2

    Slept Well?

    F3

    Ate Well?

    1 YES YES NO NO

    2 NO YES NO YES

    3 YES NO NO NO

    4 NO NO NO YES

    L f h l

  • 7/29/2019 Advances in WSD

    76/208

    76

    Learn from these examples :

    when do I go to the store?

    Day

    CLASS

    Go to Store?

    F1

    Hot Outside?

    F2

    Slept Well?

    F3

    Ate Well?

    1 YES YES NO NO

    2 NO YES NO YES

    3 YES NO NO NO

    4 NO NO NO YES

  • 7/29/2019 Advances in WSD

    77/208

    77

    Outline

    What is Supervised Learning?

    Task Definition

    Single Classifiers

    Nave Bayesian Classifiers

    Decision Lists and Trees

    Ensembles of Classifiers

  • 7/29/2019 Advances in WSD

    78/208

    78

    Task Definition

    Supervised WSD: Class of methods that induces a classifier frommanually sense-tagged text using machine learning techniques.

    Resources

    Sense Tagged Text

    Dictionary (implicit source of sense inventory)

    Syntactic Analysis (POS tagger, Chunker, Parser, )

    Scope

    Typically one target word per context

    Part of speech of target word resolved

    Lends itself to targeted word formulation

    Reduces WSD to a classification problem where a target word is

    assigned the most appropriate sense from a given set of possibilities

    based on the context in which it occurs

  • 7/29/2019 Advances in WSD

    79/208

    79

    Sense Tagged Text

    Bonnie and Clyde are two really famous criminals, I thinkthey were bank/1 robbers

    My bank/1 charges too much for an overdraft.

    I went to the bank/1 to deposit my check and get a new ATM

    card.The University of Minnesota has an East and a West Bank/2

    campus right on the Mississippi River.

    My grandfather planted his pole in the bank/2 and got a great

    big catfish!The bank/2is pretty muddy, I cant walk there.

    T B f W d

  • 7/29/2019 Advances in WSD

    80/208

    80

    Two Bags of Words

    (Co-occurrences in the window of context)

    FINANCIAL_BANK_BAG:

    a an and are ATM Bonnie card charges check Clyde

    criminals deposit famous for get I much My new overdraft

    really robbers the they think to too two went were

    RIVER_BANK_BAG:

    a an and big campus cant catfish East got grandfather great

    has his I in is Minnesota Mississippi muddy My of on planted

    pole pretty right River The the there University walk West

  • 7/29/2019 Advances in WSD

    81/208

    81

    Simple Supervised Approach

    Given a sentence S containing bank:

    For each word Wi in S

    If Wi is in FINANCIAL_BANK_BAG then

    Sense_1 = Sense_1 + 1;If Wi is in RIVER_BANK_BAG then

    Sense_2 = Sense_2 + 1;

    If Sense_1 > Sense_2 then print Financialelse if Sense_2 > Sense_1 then print River

    else print Cant Decide;

  • 7/29/2019 Advances in WSD

    82/208

    82

    Supervised Methodology

    Create a sample oftraining data where a given target wordis manually annotatedwith a sense from apredeterminedset of possibilities.

    One tagged word per instance/lexical sample disambiguation

    Select a set of features with which to represent context. co-occurrences, collocations, POS tags, verb-obj relations, etc...

    Convert sense-taggedtraining instances to feature vectors.

    Apply a machine learning algorithm to induce a classifier.

    Formstructure or relation among features

    Parametersstrength of feature interactions

    Convert a held outsample of test data into feature vectors.

    correct sense tags are known but not used

    Apply classifier to test instances to assign a sense tag.

  • 7/29/2019 Advances in WSD

    83/208

    83

    From Text to Feature Vectors

    My/pronoun grandfather/noun used/verb to/prep fish/verbalong/adv the/det banks/SHORE of/prep the/det

    Mississippi/noun River/noun. (S1)

    The/det bank/FINANCE issued/verb a/det check/noun

    for/prep the/det amount/noun of/prep interest/noun. (S2)

    P-2 P-1 P+1 P+2 fish check river interest SENSE TAG

    S1 adv det prep det Y N Y N SHORE

    S2 det verb det N Y N Y FINANCE

  • 7/29/2019 Advances in WSD

    84/208

    84

    Supervised Learning Algorithms

    Once data is converted to feature vector form, anysupervised learning algorithm can be used. Many have

    been applied to WSD with good results:

    Support Vector Machines

    Nearest Neighbor Classifiers Decision Trees

    Decision Lists

    Nave Bayesian Classifiers

    Perceptrons

    Neural Networks

    Graphical Models

    Log Linear Models

  • 7/29/2019 Advances in WSD

    85/208

    85

    Outline

    What is Supervised Learning?

    Task Definition

    Nave Bayesian Classifier

    Decision Lists and Trees

    Ensembles of Classifiers

  • 7/29/2019 Advances in WSD

    86/208

    86

    Nave Bayesian Classifier

    Nave Bayesian Classifier well known in MachineLearning community for good performance across a range

    of tasks (e.g., Domingos and Pazzani, 1997)

    Word Sense Disambiguation is no exception

    Assumes conditional independence among features, giventhe sense of a word.

    The form of the model is assumed, but parameters are estimated

    from training instances

    When applied to WSD, features are often a bag of wordsthat come from the training data

    Usually thousands of binary features that indicate if a word is

    present in the context of the target word (or not)

  • 7/29/2019 Advances in WSD

    87/208

    87

    Bayesian Inference

    Given observed features, what is most likely sense?

    Estimate probability of observed features given sense

    Estimate unconditional probability of sense

    Unconditional probability of features is a normalizingterm, doesnt affect sense classification

    ),...,3,2,1(

    )()*|,...,3,2,1(),...,3,2,1|(

    FnFFFp

    SpSFnFFFpFnFFFSp

  • 7/29/2019 Advances in WSD

    88/208

    88

    Nave Bayesian Model

    S

    F2 F3 F4F1 Fn

    )|(*...*)|2(*)|1()|,...,2,1( SFnpSFpSFpSFnFFP

  • 7/29/2019 Advances in WSD

    89/208

    89

    The Nave Bayesian Classifier

    Given 2,000 instances of bank, 1,500 for bank/1 (financial sense)and 500 for bank/2 (river sense)

    P(S=1) = 1,500/2000 = .75

    P(S=2) = 500/2,000 = .25 Given credit occurs 200 times with bank/1 and 4 times with bank/2.

    P(F1=credit) = 204/2000 = .102

    P(F1=credit|S=1) = 200/1,500 = .133

    P(F1=credit|S=2) = 4/500 = .008

    Given a test instance that has one feature credit

    P(S=1|F1=credit) = .133*.75/.102 = .978 P(S=2|F1=credit) = .008*.25/.102 = .020

    )(*)|(*...*)|1(argmax SpSFnpSFpsense Ssense

  • 7/29/2019 Advances in WSD

    90/208

    90

    Comparative Results

    (Leacock, et. al. 1993) compared Nave Bayes with aNeural Network and a Context Vector approach when

    disambiguating six senses ofline

    (Mooney, 1996) compared Nave Bayes with a Neural

    Network, Decision Tree/List Learners, Disjunctive andConjunctive Normal Form learners, and a perceptron when

    disambiguating six senses ofline

    (Pedersen, 1998) compared Nave Bayes with Decision

    Tree, Rule Based Learner, Probabilistic Model, etc. whendisambiguating line and 12 other words

    All found that Nave Bayesian Classifier performed as

    well as any of the other methods!

  • 7/29/2019 Advances in WSD

    91/208

    91

    Outline

    What is Supervised Learning?

    Task Definition

    Nave Bayesian Classifiers

    Decision Lists and Trees

    Ensembles of Classifiers

  • 7/29/2019 Advances in WSD

    92/208

    92

    Decision Lists and Trees

    Very widely used in Machine Learning.

    Decision trees used very early for WSD research (e.g.,

    Kelly and Stone, 1975; Black, 1988).

    Represent disambiguation problem as a series of questions

    (presence of feature) that reveal the sense of a word. List decides between two senses after one positive answer

    Tree allows for decision among multiple senses after a series of

    answers

    Uses a smaller, more refined set of features than bag ofwords and Nave Bayes.

    More descriptive and easier to interpret.

  • 7/29/2019 Advances in WSD

    93/208

    93

    Decision List for WSD (Yarowsky, 1994)

    Identify collocational features from sense tagged data.

    Word immediately to the left or right of target :

    I have my bank/1 statement.

    The riverbank/2 is muddy.

    Pair of words to immediate left or right of target : The worlds richestbank/1 is here in New York.

    The river bank/2 is muddy.

    Words found within k positions to left or right of target,

    where k is often 10-50 : My creditis just horrible because my bank/1 has made several

    mistakes with my accountand the balance is very low.

  • 7/29/2019 Advances in WSD

    94/208

    94

    Building the Decision List

    Sort order of collocation tests using log of conditional

    probabilities.

    Words most indicative of one sense (and not the other)

    will be ranked highly.

    ))|2(

    )|1((log

    inCollocatioiFSp

    inCollocatio

    iFSp

    Abs

  • 7/29/2019 Advances in WSD

    95/208

    95

    Computing DL score

    Given 2,000 instances of bank, 1,500 for bank/1(financial sense) and 500 for bank/2 (river sense)

    P(S=1) = 1,500/2,000 = .75

    P(S=2) = 500/2,000 = .25

    Given credit occurs 200 times with bank/1 and 4

    times with bank/2. P(F1=credit) = 204/2,000 = .102

    P(F1=credit|S=1) = 200/1,500 = .133

    P(F1=credit|S=2) = 4/500 = .008

    From Bayes Rule

    P(S=1|F1=credit) = .133*.75/.102 = .978 P(S=2|F1=credit) = .008*.25/.102 = .020

    DL Score = abs (log (.978/.020)) = 3.89

  • 7/29/2019 Advances in WSD

    96/208

    96

    Using the Decision List

    Sort DL-score, go through test instance looking formatching feature. First match reveals sense

    DL-score Feature Sense

    3.89 creditwithin bank Bank/1 financial

    2.20 bankis muddy Bank/2 river

    1.09 pole within bank Bank/2 river

    0.00 of the bank N/A

  • 7/29/2019 Advances in WSD

    97/208

    97

    Using the Decision List

    CREDIT?

    BANK/1 FINANCIAL IS MUDDY?

    POLE?BANK/2 RIVER

    BANK/2 RIVER

  • 7/29/2019 Advances in WSD

    98/208

    98

    Learning a Decision Tree

    Identify the feature that most cleanly divides the trainingdata into the known senses.

    Cleanly measured by information gain or gain ratio.

    Create subsets of training data according to feature values.

    Find another feature that most cleanly divides a subset ofthe training data.

    Continue until each subset of training data is pure or asclean as possible.

    Well known decision tree learning algorithms include ID3

    and C4.5 (Quillian, 1986, 1993) In Senseval-1, a modified decision list (which supported

    some conditional branching) was most accurate for EnglishLexical Sample task (Yarowsky, 2000)

  • 7/29/2019 Advances in WSD

    99/208

    99

    Supervised WSD with Individual Classifiers

    Many supervised Machine Learning algorithms have beenapplied to Word Sense Disambiguation, most workreasonably well.

    (Witten and Frank, 2000) is a great intro. to supervised learning.

    Features tend to differentiate among methods more thanthe learning algorithms.

    Good sets of features tend to include: Co-occurrences or keywords (global)

    Collocations (local)

    Bigrams (local and global) Part of speech (local)

    Predicate-argument relations

    Verb-object, subject-verb,

    Heads of Noun and Verb Phrases

    C f R l

  • 7/29/2019 Advances in WSD

    100/208

    100

    Convergence of Results

    Accuracy of different systems applied to the same datatends to converge on a particular value, no one system

    shockingly better than another.

    Senseval-1, a number of systems in range of 74-78% accuracy for

    English Lexical Sample task. Senseval-2, a number of systems in range of 61-64% accuracy for

    English Lexical Sample task.

    Senseval-3, a number of systems in range of 70-73% accuracy for

    English Lexical Sample task

    What to do next?

    O li

  • 7/29/2019 Advances in WSD

    101/208

    101

    Outline

    What is Supervised Learning?

    Task Definition

    Nave Bayesian Classifiers

    Decision Lists and Trees

    Ensembles of Classifiers

    E bl f Cl ifi

  • 7/29/2019 Advances in WSD

    102/208

    102

    Ensembles of Classifiers

    Classifier error has two components (Bias and Variance) Some algorithms (e.g., decision trees) try and build a

    representation of the training dataLow Bias/High Variance

    Others (e.g., Nave Bayes) assume a parametric form and dont

    represent the training dataHigh Bias/Low Variance

    Combining classifiers with different bias variance

    characteristics can lead to improved overall accuracy

    Bagging a decision tree can smooth out the effect of

    small variations in the training data (Breiman, 1996)

    Sample with replacement from the training data to learn multiple

    decision trees.

    Outliers in training data will tend to be obscured/eliminated.

    E bl C id ti

  • 7/29/2019 Advances in WSD

    103/208

    103

    Ensemble Considerations

    Must choose different learning algorithms withsignificantly different bias/variance characteristics.

    Nave Bayesian Classifier versus Decision Tree

    Must choose feature representations that yield significantly

    different (independent?) views of the training data. Lexical versus syntactic features

    Must choose how to combine classifiers.

    Simple Majority Voting

    Averaging of probabilities across multiple classifier output Maximum Entropy combination (e.g., Klein, et. al., 2002)

    E bl R lt

  • 7/29/2019 Advances in WSD

    104/208

    104

    Ensemble Results

    (Pedersen, 2000) achieved state of art for interestand linedata using ensemble of Nave Bayesian Classifiers.

    Many Nave Bayesian Classifiers trained on varying sized

    windows of context / bags of words.

    Classifiers combined by a weighted vote

    (Florian and Yarowsky, 2002) achieved state of the art for

    Senseval-1 and Senseval-2 data using combination of six

    classifiers.

    Rich set of collocational and syntactic features.

    Combined via linear combination of top three classifiers.

    Many Senseval-2 and Senseval-3 systems employed

    ensemble methods.

    R f

  • 7/29/2019 Advances in WSD

    105/208

    105

    References

    (Black, 1988) An experiment in computational discrimination of English word senses.

    IBM Journal of Research and Development (32) pg. 185-194. (Breiman, 1996) The heuristics of instability in model selection. Annals of Statistics

    (24) pg. 2350-2383.

    (Domingos and Pazzani, 1997) On the Optimality of the Simple Bayesian Classifier

    under Zero-One Loss, Machine Learning (29) pg. 103-130.

    (Domingos, 2000) A Unified Bias Variance Decomposition for Zero-One and Squared

    Loss. In Proceedings of AAAI. Pg. 564-569. (Florian an dYarowsky, 2002) Modeling Consensus: Classifier Combination for Word

    Sense Disambiguation. In Proceedings of EMNLP, pp 25-32.

    (Kelly and Stone, 1975). Computer Recognition of English Word Senses, North Holland

    Publishing Co., Amsterdam.

    (Klein, et. al., 2002) Combining Heterogeneous Classifiers for Word-Sense

    Disambiguation, Proceedings of Senseval-2. pg. 87-89. (Leacock, et. al. 1993) Corpus based statistical sense resolution. In Proceedings of the

    ARPA Workshop on Human Language Technology. pg. 260-265.

    (Mooney, 1996) Comparative experiments on disambiguating word senses: An

    illustration of the role of bias in machine learning. Proceedings of EMNLP. pg. 82-91.

    R f

  • 7/29/2019 Advances in WSD

    106/208

    106

    References

    (Pedersen, 1998) Learning Probabilistic Models of Word Sense Disambiguation. Ph.D.Dissertation. Southern Methodist University.

    (Pedersen, 2000) A simple approach to building ensembles of Naive Bayesian classifiers

    for word sense disambiguation. In Proceedings of NAACL.

    (Quillian, 1986). Induction of Decision Trees. Machine Learning (1). pg. 81-106.

    (Quillian, 1993). C4.5 Programs for Machine Learning. San Francisco, Morgan

    Kaufmann. (Witten and Frank, 2000). Data MiningPractical Machine Learning Tools and

    Techniques with Java Implementations. Morgan-Kaufmann. San Francisco.

    (Yarowsky, 1994) Decision lists for lexical ambiguity resolution: Application to accent

    restoration in Spanish and French. In Proceedings of ACL. pp. 88-95.

    (Yarowsky, 2000) Hierarchical decision lists for word sense disambiguation. Computers

    and the Humanities, 34.

  • 7/29/2019 Advances in WSD

    107/208

    Part 5:

    Minimally Supervised Methods

    for Word Sense Disambiguation

    O tli

  • 7/29/2019 Advances in WSD

    108/208

    108

    Outline

    Task definition What does minimally supervised mean?

    Bootstrapping algorithms

    Co-training

    Self-training Yarowsky algorithm

    Using the Web for Word Sense Disambiguation

    Web as a corpus

    Web as collective mind

    Task Definition

  • 7/29/2019 Advances in WSD

    109/208

    109

    Task Definition

    SupervisedWSD = learning sense classifiers starting withannotated data

    Minimally supervised WSD = learning sense classifiers

    from annotated data, with minimal human supervision

    Examples Automatically bootstrap a corpus starting with a few human

    annotated examples

    Use monosemous relatives / dictionary definitions to automatically

    construct sense tagged data

    Rely on Web-users + active learning for corpus annotation

    Outline

  • 7/29/2019 Advances in WSD

    110/208

    110

    Outline

    Task definition What does minimally supervised mean?

    Bootstrapping algorithms

    Co-training

    Self-training Yarowsky algorithm

    Using the Web for Word Sense Disambiguation

    Web as a corpus

    Web as collective mind

    Bootstrapping WSD Classifiers

  • 7/29/2019 Advances in WSD

    111/208

    111

    Bootstrapping WSD Classifiers

    Build sense classifiers with little training data Expand applicability of supervised WSD

    Bootstrapping approaches

    Co-training Self-training

    Yarowsky algorithm

    Bootstrapping Recipe

  • 7/29/2019 Advances in WSD

    112/208

    112

    Bootstrapping Recipe

    Ingredients (Some) labeled data

    (Large amounts of) unlabeled data

    (One or more) basic classifiers

    Output Classifier that improves over the basic classifiers

  • 7/29/2019 Advances in WSD

    113/208

    113

    plants#1and animals

    industry plant#2

    building the only atomic plant

    plantgrowth is retarded

    a herb or flowering plant

    a nuclear powerplant

    building a new vehicle plant the animal and plantlife

    the passion-fruit plant

    Classifier 1

    Classifier 2

    plant#1growth is retarded

    a nuclear powerplant#2

    Co-training / Self-training

  • 7/29/2019 Advances in WSD

    114/208

    114

    Co training / Self training

    1. Create a pool of examples U'

    choose P random examples from U

    2. Loop for I iterations

    Train Cion L and label U'

    Select G most confident examples and add to L

    maintain distribution in L Refill U' with examples from U

    keep U' at constant size P

    A set L of labeled training examples

    A set U of unlabeled examples

    Classifiers Ci

    Co training

  • 7/29/2019 Advances in WSD

    115/208

    115

    (Blum and Mitchell 1998) Two classifiers

    independent views

    [independence condition can be relaxed]

    Co-training in Natural Language Learning Statistical parsing (Sarkar 2001)

    Co-reference resolution (Ng and Cardie 2003)

    Part of speech tagging (Clark, Curran and Osborne 2003)

    ...

    Co-training

    Self-training

  • 7/29/2019 Advances in WSD

    116/208

    116

    Self-training

    (Nigam and Ghani 2000) One single classifier

    Retrain on its own output

    Self-training for Natural Language Learning

    Part of speech tagging (Clark, Curran and Osborne 2003) Co-reference resolution (Ng and Cardie 2003)

    several classifiers through bagging

    Parameter Setting for Co-training/Self-training

  • 7/29/2019 Advances in WSD

    117/208

    117

    Parameter Setting for Co-training/Self-training

    1. Create a pool of examples U' choose P random examples from U

    2. Loop for I iterations

    Train Cion L and label U'

    Select G most confident examples and add to L maintain distribution in L

    Refill U' with examples from U

    keep U' at constant size P

    Pool size

    Iterations

    Growth size

    A major drawback of bootstrapping

    No principled method for selecting optimal values for these

    parameters (Ng and Cardie 2003)

    Experiments with Co-training / Self-training

  • 7/29/2019 Advances in WSD

    118/208

    118

    for WSD

    Training / Test data Senseval-2 nouns (29 ambiguous nouns)

    Average corpus size: 95 training examples, 48 test examples

    Raw data British National Corpus

    Average corpus size: 7,085 examples

    Co-training

    Two classifiers: local and topical classifiers

    Self-training One classifier: global classifier

    (Mihalcea 2004)

    Parameter Settings

  • 7/29/2019 Advances in WSD

    119/208

    119

    Parameter Settings

    Parameter ranges

    P = {1, 100, 500, 1000, 1500, 2000, 5000} G = {1, 10, 20, 30, 40, 50, 100, 150, 200}

    I = {1, ..., 40}

    29 nouns 120,000 runs

    Upper bound in co-training/self-training performance

    Optimised on test set

    Basic classifier: 53.84%

    Optimal self-training: 65.61%

    Optimal co-training: 65.75%

    ~25% error reduction

    Per-word parameter setting:

    Co-training = 51.73%

    Self-training = 52.88%

    Global parameter setting

    Co-training = 55.67%

    Self-training = 54.16%

    Example: lady

    basic = 61.53%

    self-training = 84.61% [20/100/39]

    co-training = 82.05% [1/1000/3]

    Yarowsky Algorithm

  • 7/29/2019 Advances in WSD

    120/208

    120

    Yarowsky Algorithm

    (Yarowsky 1995) Similar to co-training

    Differs in the basic assumption (Abney 2002)

    view independence (co-training) vs. precision independence(Yarowsky algorithm)

    Relies on two heuristics and a decision list

    One sense per collocation :

    Nearby words provide strong and consistent clues as to the sense of a

    target word

    One sense per discourse :

    The sense of a target word is highly consistent within a singledocument

    Learning Algorithm

  • 7/29/2019 Advances in WSD

    121/208

    121

    Learning Algorithm

    A decision list is used to classify instances of target word :

    the loss of animal and plant species through extinction

    Classification is based on the highest ranking rule thatmatches the target context

    LogL Collocation Sense

    9.31 flower (within +/- k words) A (living)

    9.24 job (within +/- k words) B (factory)

    9.03 fruit (within +/- k words) A (living)

    9.02 plant species A (living)

    ... ...

    Bootstrapping Algorithm

  • 7/29/2019 Advances in WSD

    122/208

    122

    Bootstrapping Algorithm

    All occurrences of the target word are identified

    A small training set of seed data is tagged with word sense

    Sense-B: factory

    Sense-A: life

    Bootstrapping Algorithm

  • 7/29/2019 Advances in WSD

    123/208

    123

    Bootstrapping Algorithm

    Seed set grows and residual set shrinks .

    Bootstrapping Algorithm

  • 7/29/2019 Advances in WSD

    124/208

    124

    Bootstrapping Algorithm

    Convergence: Stop when residual set stabilizes

    Bootstrapping Algorithm

  • 7/29/2019 Advances in WSD

    125/208

    125

    Bootstrapping Algorithm

    Iterative procedure: Train decision list algorithm on seed set

    Classify residual data with decision list

    Create new seed set by identifying samples that are tagged with a

    probability above a certain threshold

    Retrain classifier on new seed set

    Selecting training seeds

    Initial training set should accurately distinguish among possible

    senses

    Strategies:

    Select a single, defining seed collocation for each possible sense.

    Ex: life and manufacturing for targetplant

    Use words from dictionary definitions

    Hand-label most frequent collocates

    Evaluation

  • 7/29/2019 Advances in WSD

    126/208

    126

    Evaluation

    Test corpus: extracted from 460 million word corpus of multiplesources (news articles, transcripts, novels, etc.)

    Performance of multiple models compared with:

    supervised decision lists

    unsupervised learning algorithm of Schtze (1992), based on

    alignment of clusters with word senses

    Word Senses Supervised Unsupervised

    Schtze

    Unsupervised

    Bootstrapping

    plant living/factory 97.7 92 98.6

    space volume/outer 93.9 90 93.6

    tank vehicle/container 97.1 95 96.5

    motion legal/physical 98.0 92 97.9

    -

    Avg. - 96.1 92.2 96.5

    Outline

  • 7/29/2019 Advances in WSD

    127/208

    127

    Outline

    Task definition What does minimally supervised mean?

    Bootstrapping algorithms

    Co-training

    Self-training Yarowsky algorithm

    Using the Web for Word Sense Disambiguation

    Web as a corpus

    Web as collective mind

    The Web as a Corpus

  • 7/29/2019 Advances in WSD

    128/208

    128

    e Web as a Co pus

    Use the Web as a large textual corpus Build annotated corpora using monosemous relatives

    Bootstrap annotated corpora starting with few seeds

    Similar to (Yarowsky 1995)

    Use the (semi)automatically tagged data to train WSDclassifiers

    Monosemous Relatives

  • 7/29/2019 Advances in WSD

    129/208

    129

    Monosemous Relatives

    Idea: determine a phrase (SP) which uniquely identifies thesense of a word (W#i)

    1. Determine one or more Search Phrases from a machine readable

    dictionary using several heuristics

    2. Search the Web using the Search Phrases from step 1.

    3. Replace the Search Phrases in the examples gathered at 2 with W#i.

    Output: sense annotated corpus for the word sense W#i

    As a pastime, she enjoyed reading.

    Evaluate the interestingness of the website.

    As an interest, she enjoyed reading.

    Evaluate the interest of the website.

    Heuristics to Identify Monosemous Relatives

  • 7/29/2019 Advances in WSD

    130/208

    130

    Heuristics to Identify Monosemous Relatives

    Synonyms Determine a monosemous synonym

    remember#1 has recollect as monosemous synonymSP=recollect

    Dictionary definitions (1)

    Parse the gloss and determine the set of single phrase definitions

    produce#5 has the definition bring onto the market or release 2definitions: bring onto the market and release

    eliminate release as being ambiguousSP=bring onto the market

    Dictionary defintions (2)

    Parse the gloss and determine the set of single phrase definitions

    Replace the stop words with the NEAR operator

    Strengthen the query: concatenate the words from the current synset usingthe AND operator

    produce#6 has the synset {grow, raise, farm, produce}and the definitioncultivate by growing

    SP=cultivate NEAR growing AND (grow OR raise OR farm OR produce)

    Heuristics to Identify Monosemous Relatives

  • 7/29/2019 Advances in WSD

    131/208

    131

    Dictionary definitions (3) Parse the gloss and determine the set of single phrase definitions

    Keep only the head phrase

    Strengthen the query: concatenate the words from the current synset using

    the AND operator

    company#5 has the synset {party,company}and the definition band ofpeople associated in some activity

    SP=band of people AND (company OR party)

    Example

  • 7/29/2019 Advances in WSD

    132/208

    132

    Building annotated corpora for the noun interest.# Synset Definition

    1 {interest#1, involvement} sense of concern with and curiosity aboutsomeone or something

    2 {interest#2,interestingness} the power of attracting or holding ones interest3 {sake, interest#3} reason for wanting something done4 {interest#4} fixed charge for borrowing money; usually a

    percentage of the amount borrowed5 {pastime,interest#5} a subject or pursuit that occupies ones time and

    thoughts

    6 {interest#6, stake} a right or legal share of something; financialinvolvement with something

    7 {interest#7, interest group} a social group whose members control some fieldof activity and who have common aims

    Sense # Search phrase

    1 sense of concern AND (interest OR involvement)

    2 interestigness

    3 reason for wanting AND (interest OR sake)4 fixed charge AND interest

    percentage of amount AND interest

    5 pastime

    6 right share AND (interest OR stake)

    legal share AND (interest OR stake)

    financial involvement AND (interest OR stake)

    7 interest group

    Example

  • 7/29/2019 Advances in WSD

    133/208

    133

    p

    Gather 5,404 examples Check the first 70 examples 67 correct; 95.7% accuracy.

    1. I appreciate the genuine interest#1 which motivated you to write your

    message.2. The webmaster of this site warrants neither accuracy, nor interest#2.

    3. He forgives us not only for our interest#3, but for his own.

    4.Interest#4 coverage, including rents, was 3.6x

    5. As an interest#5, she enjoyed gardening and taking part into church

    activities.

    6. Voted on issues, they should have abstained because of direct and

    indirect personal interests#6in the matters of hand.

    7. The Adam Smith Society is a new interest#7organized within the APA.

    Experimental Evaluation

  • 7/29/2019 Advances in WSD

    134/208

    134

    Tests on 20 words 7 nouns, 7 verbs, 3 adjectives, 3 adverbs (120 word meanings)

    manually check the first 10 examples of each sense of a word

    => 91% accuracy (Mihalcea 1999)

    Word Polysemy

    count

    Examples

    in SemCor

    Total exam-

    ples acquired

    Examples ma-

    nually checked

    Correct

    examples

    interest 7 139 5404 70 67

    report 7 71 4196 70 63

    company 9 90 6292 80 77

    school 7 146 2490 59 54

    produce 7 148 4982 67 60

    remember 8 166 3573 67 57

    write 8 285 2914 69 67speak 4 147 4279 40 39

    small 14 192 10954 107 92

    clearly 4 48 4031 29 28

    TOTAL

    (20 words)

    120 2582 80741 1080 978

    Web-based Bootstrapping

  • 7/29/2019 Advances in WSD

    135/208

    135

    pp g

    Similar to Yarowsky algorithm Relies on data gathered from the Web

    1. Create a set of seeds (phrases) consisting of:

    Sense tagged examples in SemCor

    Sense tagged examples from WordNet

    Additional sense tagged examples, if available

    Phrase?

    At least two open class words;

    Words involved in a semantic relation (e.g. noun phrase, verb-object,

    verb-subject, etc.)

    2. Search the Web using queries formed with the seed expressions found

    at Step 1

    Add to the generated corpus of maximum of N text passages

    Results competitive with manually tagged corpora (Mihalcea 2002)

    The Web as Collective Mind

  • 7/29/2019 Advances in WSD

    136/208

    136

    Two different views of the Web:

    collection of Web pages

    very large group of Web users

    Millions of Web users can contribute their knowledge to a

    data repository

    Open Mind Word Expert (Chklovski and Mihalcea, 2002)

    Fast growing rate:

    Started in April 2002

    Currently more than 100,000 examples of noun senses in several

    languages

    OMWE

    online

  • 7/29/2019 Advances in WSD

    137/208

    137 http://teach-computers.org

    Open Mind Word Expert: Quantity and Quality

  • 7/29/2019 Advances in WSD

    138/208

    138

    Data

    A mix of different corpora: Treebank, Open Mind Common Sense, Los

    Angeles Times, British National Corpus Word senses

    Based on WordNet definitions

    Active learning to select the most informative examples for learning

    Use two classifiers trained on existing annotated data

    Select items where the two classifiers disagree for human annotation Quality:

    Two tags per item

    One tag per item per contributor

    Evaluations:

    Agreement rates of about 65% - comparable to the agreementsrates obtained when collecting data for Senseval-2 with trainedlexicographers

    Replicability: tests on 1,600 examples of interest led to 90%+replicability

    References

    (Ab 2002) Ab S B i P di f ACL 2002

  • 7/29/2019 Advances in WSD

    139/208

    139

    (Abney 2002) Abney, S.Bootstrapping. Proceedings of ACL 2002.

    (Blum and Mitchell 1998) Blum, A. and Mitchell, T. Combining labeled and unlabeled

    data with co-training. Proceedings of COLT 1998.

    (Chklovski and Mihalcea 2002) Chklovski, T. and Mihalcea, R.Building a sense taggedcorpus with Open Mind Word Expert. Proceedings of ACL 2002 workshop on WSD.

    (Clark, Curran and Osborne 2003) Clark, S. and Curran, J.R. and Osborne, M.Bootstrapping POS taggers using unlabelled data. Proceedings of CoNLL 2003.

    (Mihalcea 1999) Mihalcea, R. An automatic method for generating sense taggedcorpora. Proceedings of AAAI 1999.

    (Mihalcea 2002) Mihalcea, R. Bootstrapping large sense tagged corpora. Proceedingsof LREC 2002.

    (Mihalcea 2004) Mihalcea, R. Co-training and Self-training for Word SenseDisambiguation. Proceedings of CoNLL 2004.

    (Ng and Cardie 2003) Ng, V. and Cardie, C. Weakly supervised natural languagelearning without redundant views. Proceedings of HLT-NAACL 2003.

    (Nigam and Ghani 2000) Nigam, K. and Ghani, R. Analyzing the effectiveness andapplicability of co-training. Proceedings of CIKM 2000.

    (Sarkar 2001) Sarkar, A.Applying cotraining methods to statistical parsing. Proceedingsof NAACL 2001.

    (Yarowsky 1995) Yarowsky, D. Unsupervised word sense disambiguation rivalingsupervised methods. Proceedings of ACL 1995.

  • 7/29/2019 Advances in WSD

    140/208

    Part 6:

    Unsupervised Methods of Word

    Sense Disambiguation

    Outline

  • 7/29/2019 Advances in WSD

    141/208

    141

    What is Unsupervised Learning? Task Definition

    Agglomerative Clustering

    LSI/LSA

    Sense Discrimination Using Parallel Texts

    What is Unsupervised Learning?

  • 7/29/2019 Advances in WSD

    142/208

    142

    Unsupervised learning identifies patterns in a large sampleof data, without the benefit of any manually labeled

    examples or external knowledge sources

    These patterns are used to divide the data into clusters,

    where each member of a cluster has more in common withthe other members of its own cluster than any other

    Note! If you remove manual labels from supervised data

    and cluster, you may not discover the same classes as in

    supervised learning

    Supervised Classification identifies features that trigger a sense tag

    Unsupervised Clustering finds similarity between contexts

    Cluster this Data!

    Facts about my day

  • 7/29/2019 Advances in WSD

    143/208

    143

    Facts about my day

    Day

    F1

    Hot Outside?

    F2

    Slept Well?

    F3

    Ate Well?

    1 YES NO NO

    2 YES NO YES

    3 NO NO NO

    4 NO NO YES

    Cluster this Data!

    Facts about my day

  • 7/29/2019 Advances in WSD

    144/208

    144

    Facts about my day

    Day

    F1

    Hot Outside?

    F2

    Slept Well?

    F3

    Ate Well?

    1 YES NO NO

    2 YES NO YES

    3 NO NO NO

    4 NO NO YES

    Cluster this Data!

  • 7/29/2019 Advances in WSD

    145/208

    145

    Day

    F1

    Hot Outside?

    F2

    Slept Well?

    F3

    Ate Well?

    1 YES NO NO

    2 YES NO YES

    3 NO NO NO

    4 NO NO YES

    Outline

  • 7/29/2019 Advances in WSD

    146/208

    146

    What is Unsupervised Learning? Task Definition

    Agglomerative Clustering

    LSI/LSA

    Sense Discrimination Using Parallel Texts

    Task Definition

  • 7/29/2019 Advances in WSD

    147/208

    147

    Unsupervised Word Sense Discrimination: A class ofmethods that cluster words based on similarity of context

    Strong Contextual Hypothesis

    (Miller and Charles, 1991): Words with similar meanings tend to

    occur in similar contexts

    (Firth, 1957): You shall know a word by the company it keeps.

    words that keep the same company tend to have similar meanings

    Only use the information available in raw text, do not use

    outside knowledge sources or manual annotations

    No knowledge of existing sense inventories, so clusters are

    not labeled with senses

    Task Definition

  • 7/29/2019 Advances in WSD

    148/208

    148

    Resources:

    Large Corpora

    Scope:

    Typically one targeted word per context to be discriminated

    Equivalently, measure similarity among contexts

    Features may be identified in separate training data, or in the

    data to be clustered

    Does not assign senses or labels to clusters

    Word Sense Discrimination reduces to the problem of

    finding the targeted words that occur in the most similarcontexts and placing them in a cluster

    Outline

  • 7/29/2019 Advances in WSD

    149/208

    149

    What is Unsupervised Learning? Task Definition

    Agglomerative Clustering

    LSI/LSA

    Sense Discrimination Using Parallel Texts

    Agglomerative Clustering

  • 7/29/2019 Advances in WSD

    150/208

    150

    Create a similarity matrix of instances to be discriminated Results in a symmetric instance by instance matrix, where each

    cell contains the similarity score between a pair of instances

    Typically a first order representation, where similarity is based on

    the features observed in the pair of instances

    Apply Agglomerative Clustering algorithm to matrix

    To start, each instance is its own cluster

    Form a cluster from the most similar pair of instances

    Repeat until the desired number of clusters is obtained

    Advantages : high quality clustering

    Disadvantagescomputationally expensive, must carry

    out exhaustive pair wise comparisons)( )( YX YX

    Measuring Similarity

  • 7/29/2019 Advances in WSD

    151/208

    151

    Integer Values

    Matching Coefficient

    Jaccard Coefficient

    Dice Coefficient

    Real Values

    Cosine

    YX

    YX

    YX

    YX

    YX

    2

    YX

    YX

    Instances to be Clustered

  • 7/29/2019 Advances in WSD

    152/208

    152

    P-2 P-1 P+1 P+2 fish check river interest

    S1 adv det prep det Y N Y N

    S2 det prep det N Y N Y

    S3 det adj verb det Y N N N

    S4 det noun noun noun N N N N

    S1 S2 S3 S4

    S1 3 4 2

    S2 3 2 0

    S3 4 2 1

    S4 2 0 1

    Average Link Clustering

    aka McQuittys Similarity Analysis

  • 7/29/2019 Advances in WSD

    153/208

    153

    Q y y y

    S1 S2 S3 S4

    S1 3 4 2

    S2 3 2 0

    S3 4 2 1

    S4 2 0 1

    S1S3 S2 S4

    S1S3

    S2 0

    S4 0

    5.22

    23

    5.22

    23

    5.12

    12

    5.12

    12

    S1S3S2 S4

    S1S3S2

    S4

    5.12

    5.15.1

    5.12

    5.15.1

    Evaluation of Unsupervised Methods

  • 7/29/2019 Advances in WSD

    154/208

    154

    If Sense tagged text is available, can be used for evaluation But dont use sense tags for clustering or feature selection!

    Assume that sense tags represent true clusters, and

    compare these to discovered clusters

    Find mapping of clusters to senses that attains maximum accuracy Pseudo words are especially useful, since it is hard to find

    data that is discriminated

    Pick two words or names from a corpus, and conflate them into

    one name. Then see how well you can discriminate.

    http://www.d.umn.edu/~kulka020/kanaghaName.html

    Baseline Algorithmgroup all instances into one cluster,

    this will reach accuracy equal to majority classifier

    Baseline Performance

  • 7/29/2019 Advances in WSD

    155/208

    155

    S1 S2 S3 Totals

    C1 0 0 0 0

    C2 0 0 0 0

    C3 80 35 55 170

    Totals 80 35 55 170

    S3 S2 S1 Totals

    C1 0 0 0 0

    C2 0 0 0 0

    C3 55 35 80 170

    Totals 55 35 80 170

    (0+0+55)/170 = .32 (0+0+80)/170 = .47

    if C3 is S3 if C3 is S1

    Evaluation

  • 7/29/2019 Advances in WSD

    156/208

    156

    Suppose that C1 is labeled S1, C2 as S2, and C3 as S3

    Accuracy = (10 + 0 + 10) / 170 = 12%

    Diagonal shows how many members of the cluster actually belong to

    the sense given on the column

    Can the columns be rearranged to improve the overall accuracy?

    Optimally assign clusters to senses

    S1 S2 S3 Totals

    C1 10 30 5 45

    C2 20 0 40 60

    C3 50 5 10 65

    Totals 80 35 55 170

    Evaluation

  • 7/29/2019 Advances in WSD

    157/208

    157

    The assignment of C1 to S2,C2 to S3, and C3 to S1

    results in 120/170 = 71%

    Find the ordering of the

    columns in the matrix that

    maximizes the sum of thediagonal.

    This is an instance of the

    Assignment Problem from

    Operations Research, or

    finding the Maximal

    Matching of a Bipartite

    Graph from Graph Theory.

    S2 S3 S1 Totals

    C1 30 5 10 45

    C2 0 40 20 60C3 5 10 50 65

    Totals 35 55 80 170

    Agglomerative Approach

  • 7/29/2019 Advances in WSD

    158/208

    158

    (Pedersen and Bruce, 1997) explore discrimination with asmall number (approx 30) of features near target word.

    Morphological form of target word (1)

    Part of Speech two words to left and right of target word (4)

    Co-occurrences (3) most frequent content words in context

    Unrestricted collocations (19) most frequent words located oneposition to left or right of target, OR

    Content collocations (19) most frequent content words located oneposition to left or right of target

    Features identified in the instances be clustered

    Similarity measured by matching coefficient Clustered with McQuittys Similarity Analysis, Wards

    Method, and the EM Algorithm

    Found that McQuittys method was the most accurate

    Experimental Evaluation

  • 7/29/2019 Advances in WSD

    159/208

    159

    Adjectives

    Chief, 86% majority (1048)

    Common, 84% majority (1060)

    Last, 94% majority (3004)

    Public, 68% majority (715)

    Nouns

    Bill, 68% majority (1341)

    Concern, 64% majority (1235)

    Drug, 57% majority (1127)

    Interest, 59% majority (2113)

    Line, 37% majority (1149)

    Verbs Agree, 74% majority (1109)

    Close, 77% majority (1354)

    Help, 78% majority (1267)

    Include, 91% majority (1526)

    Adjectives

    Chief, 86%

    Common, 80%

    Last, 79%

    Public, 63%

    Nouns

    Bill, 75%

    Concern, 68%

    Drug, 65%

    Interest, 65%

    Line, 42%

    Verbs Agree, 69%

    Close, 72%

    Help, 70%

    Include, 77%

    Analysis

  • 7/29/2019 Advances in WSD

    160/208

    160

    Unsupervised methods may not discover clustersequivalent to the classes learned in supervised learning

    Evaluation based on assuming that sense tags represent the

    true cluster are likely a bit harsh. Alternatives?

    Humans could look at the members of each cluster and determinethe nature of the relationship or meaning that they all share

    Use the contents of the cluster to generate a descriptive label that

    could be inspected by a human

    First order feature sets may be problematic with smaller

    amounts of data since these features must occur exactly inthe test instances in order to be matched

    Outline

  • 7/29/2019 Advances in WSD

    161/208

    161

    What is Unsupervised Learning? Task Definition

    Agglomerative Clustering

    LSI/LSA

    Sense Discrimination Using Parallel Texts

    Latent Semantic Indexing/Analysis

  • 7/29/2019 Advances in WSD

    162/208

    162

    Adapted by (Schtze, 1998) to word sense discrimination

    Represent training data as word co-occurrence matrix

    Reduce the dimensionality of the co-occurrence matrix via

    Singular Value Decomposition (SVD)

    Significant dimensions are associated with concepts Represent the instances of a target word to be clustered by

    taking the average of all the vectors associated with all the

    words in that context

    Context represented by an averaged vector

    Measure the similarity amongst instances via cosine and

    record in similarity matrix, or cluster the vectors directly

    Co-occurrence matrix

  • 7/29/2019 Advances in WSD

    163/208

    163

    apple blood cells ibm data box tissue graphics memory organ plasma

    pc 2 0 0 1 3 1 0 0 0 0 0

    body 0 3 0 0 0 0 2 0 0 2 1

    disk 1 0 0 2 0 3 0 1 2 0 0

    petri 0 2 1 0 0 0 2 0 1 0 1

    lab 0 0 3 0 2 0 2 0 2 1 3

    sales 0 0 0 2 3 0 0 1 2 0 0

    linux 2 0 0 1 3 2 0 1 1 0 0

    debt 0 0 0 2 3 4 0 2 0 0 0

    Singular Value Decomposition

    A=UDV

  • 7/29/2019 Advances in WSD

    164/208

    164

    U

  • 7/29/2019 Advances in WSD

    165/208

    165

    .35 .09 -.2 .52 -.09 .40 .02 .63 .20 -.00 -.02

    .05 -.49 .59 .44 .08 -.09 -.44 -.04 -.6 -.02 -.01

    .35 .13 .39 -.60 .31 .41 -.22 .20 -.39 .00 .03

    .08 -.45 .25 -.02 .17