qualitative differences between human behvaioral data and co-occurrence models of semantic...

24
Qualitative Differences Between Human Data and Co-occurrence Models of Semantic Similarity Gabriel Recchia

Upload: gabriel-recchia

Post on 13-Jun-2015

187 views

Category:

Education


1 download

TRANSCRIPT

Page 1: Qualitative differences between  human behvaioral data and co-occurrence models of semantic similarity

Qualitative Differences Between Human Data and Co-occurrenceModels of Semantic Similarity

Gabriel Recchia

Page 2: Qualitative differences between  human behvaioral data and co-occurrence models of semantic similarity

• Corpus-based methods

• Pointwise Mutual Information (Church & Hanks, 1989)

• Spreading Activation (Anderson & Pirolli, 1984; Farahat, Pirolli, & Markova, 2004)

• Latent Semantic Analysis (Landauer & Dumais, 1997)

• Generalized Latent Semantic Analysis (Matveeva et al., 2005)

• WordNet-based methods

• Resnik (1995), Jiang & Conrath (1997), Hirst & St-Onge (1998), Leacock & Chodorow (1998), Lin (1998), Pedersen et al. (2004)

Background

Page 3: Qualitative differences between  human behvaioral data and co-occurrence models of semantic similarity

Compares the probability of observing word x and word y together (the joint probability) with the probabilities of observing x and y independently (chance)

1. Pointwise Mutual Information (PMI)

I(x,y) = log2P(x,y)

P(x)P(y)

Page 4: Qualitative differences between  human behvaioral data and co-occurrence models of semantic similarity

1. Pointwise Mutual Information (PMI)

I(x,y) (# of docs containing {x and y})

(# of docs containing x) (# of docs containing y)

Compares the probability of observing word x and word y together (the joint probability) with the probabilities of observing x and y independently (chance)

Page 5: Qualitative differences between  human behvaioral data and co-occurrence models of semantic similarity

• PMI a metric, not a process model

• Like PMI, simple vector space models– reward co-occurrences– penalize highly frequent words

• Unlike PMI, they also store latent/indirect similarity information

Page 6: Qualitative differences between  human behvaioral data and co-occurrence models of semantic similarity

2. Vector Addition Model

Page 7: Qualitative differences between  human behvaioral data and co-occurrence models of semantic similarity

• Build a term-document matrix where element (i,j) describes the frequency of term i in document j

• Apply log-entropy weighing scheme to decrease the weight of high-frequency words

• Use singular value decomposition to find an approximation to the term-document matrix with lower rank k

• Optimize k for the task at hand

3. Latent Semantic Analysis (LSA)

(Landauer & Dumais, 1997)

Page 8: Qualitative differences between  human behvaioral data and co-occurrence models of semantic similarity

• Forced-choice synonymy tests

• TOEFL synonymy test (Landauer & Dumais, 1997) (TOEFL)

• ESL synonymy test (Turney, 2001) (ESL)

• Semantic similarity judgments

• Rubenstein & Goodenough, 1965 (RG)

• Miller & Charles, 1991 (MC)

• Resnik, 1995 (R)

• Finkelstein et al.’s “WordSimilarity 353,” 2002 (WS353)

Page 9: Qualitative differences between  human behvaioral data and co-occurrence models of semantic similarity

PMI VectorsESL 0.35 0.28

TOEFL 0.41 0.33RG 0.47 0.14R 0.46 0.17

MC 0.46 0.12WS353 0.54 0.57

Trained on small Wikipedia subset:

Page 10: Qualitative differences between  human behvaioral data and co-occurrence models of semantic similarity

PMI vs. LSA

Task PMI (Wiki subset) LSA (Wiki subset)

ESL .35 .36

TOEFL .41 .44

RG .47 .62

R .46 .60

MC .46 .46

WS353 .54 .57

Page 11: Qualitative differences between  human behvaioral data and co-occurrence models of semantic similarity

PMI usingfull Wikipedia

WN NSS.F NSS.T SA.N SA.W LSA.T

ESL .62 .70 .44 .56 .39 .51 .44TOEFL .64 .87 .59 .50 .61 .59 .55

RG .78 .88 .62 .53 .49 .39 .69R .86 .90 .56 .54 .49 .52 .74

MC .76 .77 .61 .56 .45 .45 .61WS353 .73 .46 .60 .59 .40 .38 .60

• WN: Wordnet::Similarity vector measure

• NSS.F: Normalized Search Similarity, using Factiva business news corpus

• NSS.T: Normalized Search Similarity, using TASA corpus

• SA.N: Spreading Activation, using Google counts restricted to nytimes.com

• SA.W: Spreading Activation, using Google counts restricted to wikipedia.org

• LSA.T: LSA, using TASA corpus

Page 12: Qualitative differences between  human behvaioral data and co-occurrence models of semantic similarity

PMI vs. Vector Addition vs. LSA

TaskPMI

(Wikipedia)Vectors

(Wikipedia)LSA

(Wiki subset)

ESL .62 0.44 .36

TOEFL .64 0.54 .44

RG .78 0.62 .62

R .86 0.73 .60

MC .76 0.68 .46

WS353 .73 0.45 .57

Page 13: Qualitative differences between  human behvaioral data and co-occurrence models of semantic similarity

cor(pmi, hum) cor(vec, hum) cor(pmi, vec)ESL -- -- --

TOEFL -- -- --RG 0.78 0.62 0.52R 0.86 0.73 0.64

MC 0.76 0.68 0.57WS353 0.73 0.45 0.54

Page 14: Qualitative differences between  human behvaioral data and co-occurrence models of semantic similarity

Rank (Humans) Rank (Humans)

most similar least similar most similar least similar

Ran

k (P

MI)

Ran

k (V

ecto

rs)

Rubenstein & Goodenough Word Pairs

Page 15: Qualitative differences between  human behvaioral data and co-occurrence models of semantic similarity

Rank (Humans) Rank (Humans)

most similar least similar most similar least similar

Ran

k (P

MI)

Ran

k (V

ecto

rs)

Resnik Word Pairs

Page 16: Qualitative differences between  human behvaioral data and co-occurrence models of semantic similarity

Rank (Humans) Rank (Humans)

most similar least similar most similar least similar

Ran

k (P

MI)

Ran

k (V

ecto

rs)

Miller & Charles Word Pairs

Page 17: Qualitative differences between  human behvaioral data and co-occurrence models of semantic similarity

Rank (Humans) Rank (Humans)

most similar least similar most similar least similar

Ran

k (P

MI)

Ran

k (V

ecto

rs)

WordSim353 Word Pairs

Page 18: Qualitative differences between  human behvaioral data and co-occurrence models of semantic similarity

Proportion of Word Pairs that PMI Assigns a Similarity Value of 0 to, By Task

ESL TOEFL RG R MC WS353

.26 .34 .25 .21 .20 .04

Page 19: Qualitative differences between  human behvaioral data and co-occurrence models of semantic similarity

Rubenstein & Goodenough Word Pairs

Vectors PMI

HumansLSA

Page 20: Qualitative differences between  human behvaioral data and co-occurrence models of semantic similarity

Resnik Word Pairs

Vectors PMI

HumansLSA

Page 21: Qualitative differences between  human behvaioral data and co-occurrence models of semantic similarity

Miller & Charles Word Pairs

Vectors PMI

HumansLSA

Page 22: Qualitative differences between  human behvaioral data and co-occurrence models of semantic similarity

WordSim353 Word Pairs

Vectors PMI

HumansLSA

Page 23: Qualitative differences between  human behvaioral data and co-occurrence models of semantic similarity

Vectors PMI

Humans (Forward Strengths)LSA

USF Free Association Norms

Page 24: Qualitative differences between  human behvaioral data and co-occurrence models of semantic similarity

• Looking at rank correlations alone obscures important distributional properties that should not be ignored if the goal is to emulate human semantic representations

• Closer attention to qualitative trends should guide model design