2002.11.07 - slide 1is 202 – fall 2002 lecture 20: lexical relations & wordnet prof. ray...

73
2002.11.07 - SLIDE 1 IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002 http://www.sims.berkeley.edu/academics/courses/ is202/f02/ SIMS 202: Information Organization and Retrieval

Post on 21-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 1IS 202 – FALL 2002

Lecture 20: Lexical Relations & WordNet

Prof. Ray Larson & Prof. Marc Davis

UC Berkeley SIMS

Tuesday and Thursday 10:30 am - 12:00 pm

Fall 2002http://www.sims.berkeley.edu/academics/courses/is202/f02/

SIMS 202:

Information Organization

and Retrieval

Page 2: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 2IS 202 – FALL 2002

Lecture Overview

• Review– Probabilistic Models of IR– Relevance Feedback

• Lexical Relations

• WordNet

• Can Lexical and Semantic Relations be Exploited to Improve IR?

Credit for some of the slides in this lecture goes to Marti Hearst and Warren Sack

Page 3: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 3IS 202 – FALL 2002

Lecture Overview

• Review– Probabilistic Models of IR– Relevance Feedback

• Lexical Relations

• WordNet

• Can Lexical and Semantic Relations be Exploited to Improve IR?

Credit for some of the slides in this lecture goes to Marti Hearst and Warren Sack

Page 4: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 4IS 202 – FALL 2002

Probability Ranking Principle

• If a reference retrieval system’s response to each request is a ranking of the documents in the collections in the order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.

Stephen E. Robertson, J. Documentation 1977

Page 5: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 5IS 202 – FALL 2002

Probabilistic Models: Some Unifying Notation

• D = All present and future documents

• Q = All present and future queries

• (Di,Qj) = A document query pair

• x = class of similar documents,

• y = class of similar queries,

• Relevance (R) is a relation:

}Q submittinguser by therelevant judged

isDdocument ,Q ,D | )Q,{(D R

j

ijiji QD

Dx Qy

Page 6: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 6IS 202 – FALL 2002

Probabilistic Models

• Model 1 -- Probabilistic Indexing, P(R|y,Di)

• Model 2 -- Probabilistic Querying, P(R|Qj,x)

• Model 3 -- Merged Model, P(R| Qj, Di)

• Model 0 -- P(R|y,x)

• Probabilities are estimated based on prior usage or relevance estimation

Page 7: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 7IS 202 – FALL 2002

Probabilistic Models

QD

x

y

Di

Qj

Page 8: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 8IS 202 – FALL 2002

Logistic Regression

• Another approach to estimating probability of relevance

• Based on work by William Cooper, Fred Gey and Daniel Dabney

• Builds a regression model for relevance prediction based on a set of training data

• Uses less restrictive independence assumptions than Model 2– Linked Dependence

Page 9: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 9IS 202 – FALL 2002

Logistic Regression

100 -

90 -

80 -

70 -

60 -

50 -

40 -

30 -

20 -

10 -

0 - 0 10 20 30 40 50 60Term Frequency in Document

Rel

evan

ce

Page 10: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 10IS 202 – FALL 2002

Logistic Regression

• Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients

• At retrieval the probability estimate is obtained by:

• For the 6 X attribute measures shown previously

6

10),|(

iii XccDQRP

Page 11: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 11IS 202 – FALL 2002

Relevance Feedback in an IR System

Interest profiles& Queries

Documents & data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Selected relevant docs

Page 12: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 12IS 202 – FALL 2002

Relevance Feedback

• Main Idea:– Modify existing query based on relevance

judgements• Extract terms from relevant documents and add

them to the query• And/or re-weight the terms already in the query

– Two main approaches:• Automatic (pseudo-relevance feedback)• Users select relevant documents

– Users/system select terms from an automatically-generated list

Page 13: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 13IS 202 – FALL 2002

Rocchio/Vector Illustration

Retrieval

Information

0.5

1.0

0 0.5 1.0

D1

D2

Q0

Q’

Q”

Q0 = retrieval of information = (0.7,0.3)D1 = information science = (0.2,0.8)D2 = retrieval systems = (0.9,0.1)

Q’ = ½*Q0+ ½ * D1 = (0.45,0.55)Q” = ½*Q0+ ½ * D2 = (0.80,0.20)

Page 14: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 14IS 202 – FALL 2002

Alternative Notions of Relevance Feedback

• Find people whose taste is “similar” to yours– Will you like what they like?

• Follow a users’ actions in the background– Can this be used to predict what the user will

want to see next?

• Track what lots of people are doing– Does this implicitly indicate what they think is

good and not good?

Page 15: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 15IS 202 – FALL 2002

Alternative Notions of Relevance Feedback

• Several different criteria to consider:– Implicit vs. Explicit judgements – Individual vs. Group judgements– Standing vs. Dynamic topics– Similarity of the items being judged vs.

similarity of the judges themselves

Page 16: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 16IS 202 – FALL 2002

Lecture Overview

• Review– Probabilistic Models of IR– Relevance Feedback

• Lexical Relations

• WordNet

• Can Lexical and Semantic Relations be Exploited to Improve IR?

Credit for some of the slides in this lecture goes to Marti Hearst and Warren Sack

Page 17: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 17IS 202 – FALL 2002

Syntax

• The syntax of a language is to be understood as a set of rules which accounts for the distribution of word forms throughout the sentences of a language

• These rules codify permissible combinations of classes of word forms

Page 18: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 18IS 202 – FALL 2002

Semantics

• Semantics is the study of linguistic meaning

• Two standard approaches to lexical semantics (cf., sentential semantics; and, logical semantics):– (1) compositional– (2) relational

Page 19: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 19IS 202 – FALL 2002

Lexical Semantics: Compositional Approach

• Compositional lexical semantics, introduced by Katz & Fodor (1963), analyzes the meaning of a word in much the same way a sentence is analyzed into semantic components. The semantic components of a word are not themselves considered to be words, but are abstract elements (semantic atoms) postulated in order to describe word meanings (semantic molecules) and to explain the semantic relations between words. For example, the representation of bachelor might be ANIMATE and HUMAN and MALE and ADULT and NEVER MARRIED. The representation of man might be ANIMATE and HUMAN and MALE and ADULT; because all the semantic components of man are included in the semantic components of bachelor, it can be inferred that bachelor man. In addition, there are implicational rules between semantic components, e.g. HUMAN ANIMATE, which also look very much like meaning postulates.– George Miller, “On Knowing a Word,” 1999

Page 20: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 20IS 202 – FALL 2002

Lexical Semantics: Relational Approach

• Relational lexical semantics was first introduced by Carnap (1956) in the form of meaning postulates, where each postulate stated a semantic relation between words. A meaning postulate might look something like dog animal (if x is a dog then x is an animal) or, adding logical constants, bachelor man and never married [if x is a bachelor then x is a man and not(x has married)] or tall not short [if x is tall then not(x is short)]. The meaning of a word was given, roughly, by the set of all meaning postulates in which it occurs.– George Miller, “On Knowing a Word,” 1999

Page 21: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 21IS 202 – FALL 2002

Pragmatics

• Deals with the relation between signs or linguistic expressions and their users

• Deixis (literally “pointing out”)– E.g., “I’ll be back in an hour” depends upon the time of the

utterance• Conversational implicature

– A: “Can you tell me the time?”– B: “Well, the milkman has come.” [I don’t know exactly, but

perhaps you can deduce it from some extra information I give you.]

• Presupposition– “Are you still such a bad driver?”

• Speech acts– Constatives vs. performatives– E.g., “I second the motion.”

• Conversational structure– E.g., turn-taking rules

Page 22: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 22IS 202 – FALL 2002

Language

• Language only hints at meaning

• Most meaning of text lies within our minds and common understanding– “How much is that doggy in the window?”

• How much: social system of barter and trade (not the size of the dog)

• “doggy” implies childlike, plaintive, probably cannot do the purchasing on their own

• “in the window” implies behind a store window, not really inside a window, requires notion of window shopping

Page 23: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 23IS 202 – FALL 2002

Semantics: The Meaning of Symbols

• Semantics versus Syntax– add(3,4)– 3 + 4– (different syntax, same meaning)

• Meaning versus Representation– What a person’s name is versus who they are

• A rose by any other name...

– What the computer program “looks like” versus what it actually does

Page 24: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 24IS 202 – FALL 2002

Semantics

• Semantics: Assigning meanings to symbols and expressions– Usually involves defining:

• Objects• Properties of objects• Relations between objects

– More detailed versions include • Events• Time• Places• Measurements (quantities)

Page 25: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 25IS 202 – FALL 2002

The Role of Context

• The concept associated with the symbol “21” means different things in different contexts– Examples?

• The question “Is there any salt?”– Asked of a waiter at a restaurant– Asked of an environmental scientist at work

Page 26: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 26IS 202 – FALL 2002

What’s In a Sentence?

“A sentence is not a verbal snapshot or movie of an event. In framing an utterance, you have to abstract away from everything you know, or can picture, about a situation, and present a schematic version which conveys the essentials. In terms of grammatical marking, there is not enough time in the speech situation for any language to allow for the marking of everything which could possibly be significant to the message.”

Dan Slobin, in Language Acquisition: The state of the art, 1982

Page 27: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 27IS 202 – FALL 2002

Lexical Relations

• Conceptual relations link concepts– Goal of Artificial Intelligence

• Lexical relations link words– Goal of Linguistics

Page 28: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 28IS 202 – FALL 2002

Major Lexical Relations

• Synonymy

• Polysemy

• Metonymy

• Hyponymy/Hyperonymy

• Meronymy

• Antonymy

Page 29: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 29IS 202 – FALL 2002

Synonymy

• Different ways of expressing related concepts• Examples

– cat, feline, Siamese cat

• Overlaps with basic and subordinate levels• Synonyms are almost never truly substitutable:

– Used in different contexts– Have different implications

• This is a point of contention

Page 30: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 30IS 202 – FALL 2002

Polysemy

• Most words have more than one sense– Homonym: same word, different meaning

• bank (river)• bank (financial)

– Polysemy: different senses of same word• That dog has floppy ears.• She has a good ear for jazz.• bank (financial) has several related senses

– the building, the institution, the notion of where money is stored

Page 31: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 31IS 202 – FALL 2002

Metonymy

• Use one aspect of something to stand for the whole– The building stands for the institution of the

bank.– Newscast: “The White House released new

figures today.”– Waitperson: “The ham sandwich spilled his

drink.”

Page 32: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 32IS 202 – FALL 2002

Hyponymy/Hyperonymy

• ISA relation

• Related to Superordinate and Subordinate level categories– hyponym(robin,bird)– hyponym(bird,animal)– hyponym(emu,bird)

• A is a hypernym of B if B is a type of A

• A is a hyponym of B if A is a type of B

Page 33: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 33IS 202 – FALL 2002

Basic-Level Categories (review)

• Brown 1958, 1965, Berlin et al., 1972, 1973• Folk biology:

– Unique beginner: plant, animal– Life form: tree, bush, flower– Generic name: pine, oak, maple, elm– Specific name: Ponderosa pine, white pine– Varietal name: Western Ponderosa pine

• No overlap between levels• Level 3 is basic

– Corresponds to genus– Folk biological categories correspond accurately to

scientific biological categories only at the basic level

Page 34: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 34IS 202 – FALL 2002

Psychologically Primary Levels

SUPERORDINATE animal furniture

BASIC LEVEL dog chair

SUBORDINATE terrier rocker

• Children take longer to learn superordinate

• Superordinate not associated with mental images or motor actions

Page 35: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 35IS 202 – FALL 2002

Meronymy

• Parts-of relation– part of(beak, bird)– part of(bark, tree)

• Transitive conceptually but not lexically:– The knob is a part of the door.– The door is a part of the house.– ? The knob is a part of the house ?

Page 36: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 36IS 202 – FALL 2002

Antonymy

• Lexical opposites– antonym(large, small)– antonym(big, small)– antonym(big, little)– but not large, little

• Many antonymous relations can be reliably detected by looking for statistical correlations in large text collections. (Justeson &Katz 91)

Page 37: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 37IS 202 – FALL 2002

Thesauri and Lexical Relations

• Polysemy: Same word, different senses of meaning– Slightly different concepts expressed similarly

• Synonyms: Different words, related senses of meanings– Different ways to express similar concepts

• Thesauri help draw all these together• Thesauri also commonly define a set of relations

between terms that is similar to lexical relations– BT, NT, RT

Page 38: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 38IS 202 – FALL 2002

What is an Ontology?

• From Merriam-Webster’s Collegiate:– A branch of metaphysics concerned with the nature

and relations of being– A particular theory about the nature of being or the

kinds of existence• More prosaically:

– A carving up of the world’s meanings– Determine what things exist, but not how they inter-

relate• Related terms:

– Taxonomy, dictionary, category structure• Commonly used now in CS literature to describe

structures that function as Thesauri

Page 39: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 39IS 202 – FALL 2002

Lecture Overview

• Review– Probabilistic Models of IR– Relevance Feedback

• Lexical Relations

• WordNet

• Can Lexical and Semantic Relations be Exploited to Improve IR?

Credit for some of the slides in this lecture goes to Marti Hearst and Warren Sack

Page 40: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 40IS 202 – FALL 2002

WordNet

• Started in 1985 by George Miller, students, and colleagues at the Cognitive Science Laboratory, Princeton University

• Can be downloaded for free:– www.cogsci.princeton.edu/~wn/

• “In terms of coverage, WordNet’s goals differ little from those of a good standard college-level dictionary, and the semantics of WordNet is based on the notion of word sense that lexicographers have traditionally used in writing dictionaries. It is in the organization of that information that WordNet aspires to innovation.”– (Miller, 1998, Chapter 1)

Page 41: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 41IS 202 – FALL 2002

Presuppositions of WordNet project

• Separability hypothesis: T– The lexical component of language can be

separated and studied in its own right

• Patterning hypothesis: – People have knowledge of the systematic

patterns and relations between word meanings

• Comprehensiveness hypothesis: – Computational linguistics programs need a

store of lexical knowledge that is as extensive as that which people have

Page 42: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 42IS 202 – FALL 2002

WordNet: Size

POS Unique Synsets Strings

Noun 107930 74488 Verb 10806 12754 Adjective 21365 18523 Adverb 4583 3612 Totals 144684 109377

WordNet Uses “Synsets” – sets of synonymous terms

Page 43: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 43IS 202 – FALL 2002

Structure of WordNet

Page 44: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 44IS 202 – FALL 2002

Structure of WordNet

Page 45: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 45IS 202 – FALL 2002

Structure of WordNet

Page 46: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 46IS 202 – FALL 2002

Unique Beginners

• Entity, something– (anything having existence (living or nonliving))

• Psychological_feature– (a feature of the mental life of a living organism)

• Abstraction– (a general concept formed by extracting common

features from specific examples) • State

– (the way something is with respect to its main attributes; "the current state of knowledge"; "his state of health"; "in a weak financial state")

• Event– (something that happens at a given place and time)

Page 47: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 47IS 202 – FALL 2002

Unique Beginners

• Act, human_action, human_activity– (something that people do or cause to happen)

• Group, grouping– (any number of entities (members) considered as a

unit)

• Possession– (anything owned or possessed)

• Phenomenon– (any state or process known through the senses

rather than by intuition or reasoning)

Page 48: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 48IS 202 – FALL 2002

WordNet Usage

• Available online (from Unix) if you wish to try it…– Login to irony and type “wn word” for any

word you are interested in– Demo…

Page 49: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 49IS 202 – FALL 2002

Lecture Overview

• Review– Probabilistic Models of IR– Relevance Feedback

• Lexical Relations

• WordNet

• Can Lexical and Semantic Relations be Exploited to Improve IR?

Credit for some of the slides in this lecture goes to Marti Hearst and Warren Sack

Page 50: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 50IS 202 – FALL 2002

Lexical Relations and IR

• Recall that most IR research has primarily looked at statistical approaches to inferring the topicality or meaning of documents

• I.e., Statistics imply Semantics– Is this really true or correct?

• How has (or might) WordNet be used to provide more functionality in searching?

• What about other thesauri, classification schemes and ontologies?

Page 51: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 51IS 202 – FALL 2002

Natural Language Processing and IR

• The main approach in applying NLP to IR has been to attempt to address– Phrase usage vs individual terms– Search expansion using related

terms/concepts– Attempts to automatically exploit or assign

controlled vocabularies

Page 52: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 52IS 202 – FALL 2002

NLP and IR

• Early research showed that (at least in the restricted test databases tested)– Indexing documents by individual terms

corresponding to words and word stems produces retrieval results at least as good as when indexes use controlled vocabularies (whether applied manually or automatically)

– Constructing phrases or “pre-coordinated” terms provides only marginal and inconsistent improvements

Page 53: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 53IS 202 – FALL 2002

NLP and IR

• Not clear why intuitively plausible improvements to document representation have had little effect on retrieval results when compared to statistical methods– E.g. Use of syntactic role relations between

terms has shown no improvement in performance over “bag of words” approaches

– Semantics is even harder to accomplish• WordNet alone can’t disambiguate word senses in

texts

Page 54: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 54IS 202 – FALL 2002

Using NLP

• Strzalkowski

Text NLP represDbasesearch

TAGGERNLP: PARSER TERMS

Page 55: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 55IS 202 – FALL 2002

Using NLP

INPUT SENTENCEThe former Soviet President has been a local hero ever since a Russian tank invaded Wisconsin.

TAGGED SENTENCEThe/dt former/jj Soviet/jj President/nn has/vbz been/vbn a/dt local/jj hero/nn ever/rb since/in a/dt Russian/jj tank/nn invaded/vbd Wisconsin/np ./per

Page 56: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 56IS 202 – FALL 2002

Using NLP

TAGGED & STEMMED SENTENCEthe/dt former/jj soviet/jj president/nn have/vbz be/vbn a/dt local/jj hero/nn ever/rb since/in a/dt russian/jj tank/nn invade/vbd wisconsin/np ./per

Page 57: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 57IS 202 – FALL 2002

Using NLP

PARSED SENTENCE

[assert

[[perf [have]][[verb[BE]]

[subject [np[n PRESIDENT][t_pos THE]

[adj[FORMER]][adj[SOVIET]]]]

[adv EVER]

[sub_ord[SINCE [[verb[INVADE]]

[subject [np [n TANK][t_pos A]

[adj [RUSSIAN]]]]

[object [np [name [WISCONSIN]]]]]]]]]

Page 58: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 58IS 202 – FALL 2002

Using NLP

EXTRACTED TERMS & WEIGHTS

President 2.623519 soviet 5.416102

President+soviet 11.556747 president+former 14.594883

Hero 7.896426 hero+local 14.314775

Invade 8.435012 tank 6.848128

Tank+invade 17.402237 tank+russian 16.030809

Russian 7.383342 wisconsin 7.785689

Page 59: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 59IS 202 – FALL 2002

NLP & IR Research Issues

• Is natural language indexing using more NLP knowledge needed?

• Or, should controlled vocabularies be used

• Can NLP in its current state provide the improvements needed

• How to test

Page 60: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 60IS 202 – FALL 2002

NLP & IR Research Areas

• Lewis and Sparck Jones (CACM 1996) suggest research in three areas– Examination of the words, phrases and sentences

that make up a document description and express the combinatory, syntagmatic relations between single terms

– The classificatory structure over document collection as a whole, indicating the paradigmatic relations between terms and permitting controlled vocabulary indexing and searching

– Using NLP-based methods for searching and matching

Page 61: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 61IS 202 – FALL 2002

NLP & IR: Possible Approaches

• Indexing– Use of NLP methods to identify phrases

• Test weighting schemes for phrases

– Use of more sophisticated morphological analysis

• Searching– Use of two-stage retrieval

• Statistical retrieval• Followed by more sophisticated NLP filtering

Page 62: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 62IS 202 – FALL 2002

Can Statistics approach Semantics?

• One approach is the Entry Vocabulary Index (EVI) work being done here…

• (The following slides are from my presentation at JCDL 2002)

Page 63: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 63IS 202 – FALL 2002

What is an Entry Vocabulary Index?

• EVIs are a means of mapping from user’s vocabulary to the controlled vocabulary of a collection of documents…

Page 64: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 64IS 202 – FALL 2002

Start with a collection of documents.

Page 65: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 65IS 202 – FALL 2002

Classify and index with controlled

vocabulary.

Index

Ideally, use a database

already indexed

Page 66: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 66IS 202 – FALL 2002

Problem:Controlled

Vocabularies can be difficult

for people to use.

“pass mtr veh spark ign eng”

Index

Page 67: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 67IS 202 – FALL 2002

Solution:Entry Level Vocabulary

Indexes.Index

EVIpass mtr veh

spark ign eng”

= “Automobile”

Page 68: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 68IS 202 – FALL 2002

EVI example

EVI 1

Index term:“pass mtr veh spark ign eng”User

Query “Automobile

” EVI 2Index term:“automobiles”OR

“internal combustible engines”

Page 69: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 69IS 202 – FALL 2002

But why stop there?

Index

EVI

Page 70: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 70IS 202 – FALL 2002

“Which EVI do I use?”

Index

EVI

Index

Index EVI

IndexEVI

Page 71: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 71IS 202 – FALL 2002

EVI to EVIs

Index

EVI

Index

Index EVI

IndexEVI

EVI2

Page 72: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 72IS 202 – FALL 2002

FindPlutonium

In Arabic Chinese Greek Japanese Korean Russian Tamil

Why not treat language the same way?

Page 73: 2002.11.07 - SLIDE 1IS 202 – FALL 2002 Lecture 20: Lexical Relations & WordNet Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday

2002.11.07 - SLIDE 73IS 202 – FALL 2002

FindPlutonium

In Arabic Chinese Greek Japanese Korean Russian Tamil

...),,2[logL(p t)W(c, 1 baaStatistical association

Digital library resources