more data trumps smarter algorithms: training computational models of semantics on large corpora

18

Click here to load reader

Upload: gabriel-recchia

Post on 13-Jun-2015

998 views

Category:

Education


2 download

DESCRIPTION

Gabriel Recchia, Cognitive Science Michael N. Jones, PsychologyIndiana University, BloomingtonPresented at

TRANSCRIPT

Page 1: More Data Trumps Smarter Algorithms:  Training Computational Models of Semantics on Large Corpora

More Data Trumps Smarter Algorithms: Training Computational Models of

Semantics on Large Corpora

Gabriel Recchia, Cognitive ScienceMichael N. Jones, Psychology

Indiana University, Bloomington

Page 2: More Data Trumps Smarter Algorithms:  Training Computational Models of Semantics on Large Corpora

car / automobilelad / wizardgem / jewelglass / magicianjourney / voyageasylum / madhouse

0.9750.1750.8750.0250.8750.9

Page 3: More Data Trumps Smarter Algorithms:  Training Computational Models of Semantics on Large Corpora

Compares the probability of observing word x and word y together (the joint probability) with the probabilities of observing x and y independently (chance)

Pointwise Mutual Information (PMI)

I(x,y) = log2P(x,y)

P(x)P(y)

(Church & Hanks, 1990)

Page 4: More Data Trumps Smarter Algorithms:  Training Computational Models of Semantics on Large Corpora

• Build a term-document matrix where element (i,j) describes the frequency of term i in document j

• Apply log-entropy weighing scheme to decrease the weight of high-frequency words

• Use singular value decomposition to find an approximation to the term-document matrix with lower rank k

• Optimize k for the task at hand

Latent Semantic Analysis (LSA)

(Landauer & Dumais, 1997)

Page 5: More Data Trumps Smarter Algorithms:  Training Computational Models of Semantics on Large Corpora

• Criticisms

• Scalability

• Incrementality

• Lessons from computational linguistics:simple models that can be trained on more data often outperform complex models that are restricted to less

Latent Semantic Analysis (LSA)

Page 6: More Data Trumps Smarter Algorithms:  Training Computational Models of Semantics on Large Corpora

• Forced-choice tests

• TOEFL synonymy test (Landauer & Dumais, 1997) (TOEFL)

• ESL synonymy test (Turney, 2001) (ESL)

• Semantic similarity judgments

• Rubenstein & Goodenough, 1965 (RG)

• Miller & Charles, 1991 (MC)

• Resnik, 1995 (R)

• Finkelstein et al.’s “WordSimilarity 353,” 2002 (WS353)

Page 7: More Data Trumps Smarter Algorithms:  Training Computational Models of Semantics on Large Corpora

PMI vs. LSA (Budiu et al., 2007)

Task PMI (TASA) LSA (TASA)

ESL .22 .44

TOEFL .22 .60

RG .61 .64

R .61 .71

MC .65 .75

WS353 .58 .60

Page 8: More Data Trumps Smarter Algorithms:  Training Computational Models of Semantics on Large Corpora

PMI vs. LSA (Budiu et al., 2007)

Task PMI (Stanford) LSA (TASA)

ESL .52 .44

TOEFL .51 .60

RG .75 .64

R .83 .71

MC .79 .75

WS353 .71 .60

Page 9: More Data Trumps Smarter Algorithms:  Training Computational Models of Semantics on Large Corpora

• Budiu et al. (2007) concluded that PMI performs better when given more data

• But, they had a confound: Corpus size was confounded with document size and type of text (web documents vs. carefully constructed sentences in textbooks)

Page 10: More Data Trumps Smarter Algorithms:  Training Computational Models of Semantics on Large Corpora

PMI vs. LSA

Experiment 1

Task PMI (Wiki subset) LSA (Wiki subset)

ESL .35 .36

TOEFL .41 .44

RG .47 .62

R .46 .60

MC .46 .46

WS353 .54 .57

Page 11: More Data Trumps Smarter Algorithms:  Training Computational Models of Semantics on Large Corpora

PMI vs. LSA

Experiment 1

Task PMI (full Wikipedia) LSA (Wiki subset)

ESL .62 .36

TOEFL .64 .44

RG .78 .62

R .86 .60

MC .76 .46

WS353 .73 .57

Page 12: More Data Trumps Smarter Algorithms:  Training Computational Models of Semantics on Large Corpora

Experiment 2

PMI trained on lots of data outperforms LSA trained on less. How does it compare with other measures of semantic relatedness?

To find out: compare with other publicly available measures at the Rensselaer Measures of Semantic Relatedness website, cwl-projects.cogsci.rpi.edu/msr(Veksler et al., 2008)

Page 13: More Data Trumps Smarter Algorithms:  Training Computational Models of Semantics on Large Corpora

• Latent Semantic Analysis (Landauer & Dumais, 1997)

• Spreading Activation (Anderson & Pirolli, 1984; Farahat, Pirolli, & Markova, 2004)

• Normalized Search Similarity (Cilibrasi and Vitányi, 2007; Veksler et al., 2008)

• WordNet::Similarity vector measure (Pedersen et al., 2004)

Experiment 2

Page 14: More Data Trumps Smarter Algorithms:  Training Computational Models of Semantics on Large Corpora

Experiment 2

PMI usingfull Wikipedia

WN NSS.F NSS.T SA.N SA.W LSA.T

ESL .62 .70 .44 .56 .39 .51 .44TOEFL .64 .87 .59 .50 .61 .59 .55

RG .78 .88 .62 .53 .49 .39 .69R .86 .90 .56 .54 .49 .52 .74

MC .76 .77 .61 .56 .45 .45 .61WS353 .73 .46 .60 .59 .40 .38 .60

• WN: Wordnet::Similarity vector measure

• NSS.F: Normalized Search Similarity, using Factiva business news corpus

• NSS.T: Normalized Search Similarity, using TASA corpus

• SA.N: Spreading Activation, using Google counts restricted to nytimes.com

• SA.W: Spreading Activation, using Google counts restricted to wikipedia.org

• LSA.T: LSA, using TASA corpus

Page 15: More Data Trumps Smarter Algorithms:  Training Computational Models of Semantics on Large Corpora

Discussion

Released a tool for calculating PMI scores: http://www.indiana.edu/~clcl/lmoss/

Page 16: More Data Trumps Smarter Algorithms:  Training Computational Models of Semantics on Large Corpora

Discussion

Page 17: More Data Trumps Smarter Algorithms:  Training Computational Models of Semantics on Large Corpora

• PMI does not take latent information into account; unmodified, is not plausible

• However, its success when scaled to data on the order of human experience favors models based on simple co-occurrences (e.g. models based on vector addition)

Discussion

Page 18: More Data Trumps Smarter Algorithms:  Training Computational Models of Semantics on Large Corpora

• Simple, scalable, incremental models of semantic similarity show promise

• Suggests complexity should be left in the data rather than added to the model

• Publicly available tool allows non-programmers to retrieve corpus-specific semantic similarity scores

Conclusions