take and took, gaggle and goose, book and read: evaluating the utility of vector differences for...

23
Evaluating the Utility of Vector Differences for Lexical Relation Learning Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin August, 09 2016 Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 1 / 23

Upload: katerina-vylomova

Post on 19-Feb-2017

70 views

Category:

Science


0 download

TRANSCRIPT

Evaluating the Utility of Vector Differences for Lexical RelationLearning

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin

August, 09 2016

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 1 / 23

The utility of difference vectors

DIFFVEC = word2 −word1

Vector Difference, or OffsetMikolov et al, 2013: king−man + woman ≈ queenCAPITAL-CITY: Paris− France + Poland ≈Warsawor PLURALISATION: cars− car + apple ≈ apples

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 2 / 23

The utility of difference vectors

DIFFVEC = word2 −word1

Vector Difference, or OffsetMikolov et al, 2013: king−man + woman ≈ queenCAPITAL-CITY: Paris− France + Poland ≈Warsawor PLURALISATION: cars− car + apple ≈ apples

Can Diffw1,w2be clustered or classified into a broad coverage set of lexical relations?

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 3 / 23

Types of relations

Lexical semantic relationsLEXSEMHyper: Hypernymy (animal,dog)LEXSEMMero: Meronymy (bird,wing)LEXSEMEvent: Object’s Action (zip, coat)

Morphosyntactic relationsVERBPast: Present, 1st→ Past (know, knew)

VERB3: Present, 1st→ Present, 3rd (know, knows)VERB3Past: Present, 3rd→ Past (knows, knew)

NOUNSP: Singular→ Plural (year, years)

Morphosemantic relationsVERBNOUN: Nominalisation of a verb (drive,drift)PREFIX: Prefixing with re morpheme (vote, revote)

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 4 / 23

Word Embeddings

Name Dimensions Training dataw2v 300 100× 109

GloVe 200 6× 109

SENNA 100 37× 106

HLBL 200 37× 106

w2vwiki 300 50× 106

GloVewiki 300 50× 106

SVDwiki 300 50× 106

The Models Usedw2v (Mikolov et al.,2013),

GloVe (Pennington et al.,2014),

SENNA (Collobert et al., 2011),

HLBL (Mnih and Hinton,2009)

PPMI+SVD (Levy and Goldberg, 2015)

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 5 / 23

Closed-World Experiments

Closed-World setting: Multi-class classifierLet {(wi ,wj)} be a set of word pairsR = {rk} be a set of binary lexical relations(wi ,wj) 7→ rk ∈ R,i.e. all word pairs can be uniquely classified according to a relation in R

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 6 / 23

Spectral clustering: t-SNE projection for 10 samples per class

LEXSEMAttr

LEXSEMCause

NOUNColl

LEXSEMEvent

LEXSEMHyper

LVCLEXSEMMero

NOUNSP

PREFIX

LEXSEMRef

LEXSEMSpace

VERB3

VERB3Past

VERBPast

VERBNOUN

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 7 / 23

Methods

Clustering algorithmSpectral clustering (von Luxburg, 2007).Two hyperparameters: (1) the number of clusters; and (2) the pairwise similarity measure forcomparing DIFFVECs.

Clustering evaluationV-Measure (Rosenberg, Hirschberg, 2007)

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 8 / 23

Clustering results

10 20 30 40 50 60 70 80

0.15

0.20

0.25

0.30

0.35

0.40

Number of clusters

V-M

easu

re

w2vw2v wikiGloVe

GloVe wikiSVD wiki

HLBL

SENNA

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 9 / 23

Clustering results

Incorrectly classified due to ambiguity/one word overwhelming anotherstudies− study⇒ VERBNOUN

saw− utensil⇒ VERBPast

tigers− ambush⇒ NOUNColl

Single ”hypernym”-specific clustersnecklace− unit, wristband− unit, hairpin− unit

Semantic sub-clustersmovement verb− animal nounfood verb− food nounaction verb− profession noun

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 10 / 23

From Clustering to Classification

From Clustering to ClassificationEncouraged by the results of the clustering experiment,we next move to classification experiments.We train multi-class linear classifier to differentiate between the relations types.

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 11 / 23

Classification: Multi-class linear SVM, F-scores

0.0

0.2

0.4

0.6

0.8

1.0

Hyper

Event

Mer

o

Noun SP

Verb 3

Verb Pas

t

Verb 3P

ast

Prefix Re

Noun Coll

Micr

oAvg

Baseline W2V W2V:Wiki SVD:Wiki

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 12 / 23

Open-World Experiments

Open-World setting: Binary classifierLet {(wi ,wj)} be a set of word pairsR = {rk} be a set of binary lexical relations(wi ,wj) 7→ rk ∈ R ∪ {φ}where φ signifies the fact that none of the relations in R apply to the word pair

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 13 / 23

Binary classification

Adding some random pairs, i.e. randomly linked word pairs

Generating random pairs(1) Seed word proportional to their frequency in Wikipedia⇒(2) take the Cartesian product over pairs of words from the seed lexicon⇒(3) sample word pairs uniformly from this set

Training the classifiersTrain 9 binary SVM classifiers with RBF kernel and evaluate on test set augmented with randomsamples

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 14 / 23

Open-World Results

Open−World

0.0

0.2

0.4

0.6

0.8

1.0

Hyper

Event

Mer

o

Noun SP

Verb 3

Verb Pas

t

Verb 3P

ast

Prefix Re

Noun Coll

Pr: No NS Re: No NS Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 15 / 23

Binary classification

Resultscorrectly captured many of the true classes of the relations (high recall), but also many of therandom samples as being related (low precision):

(have,works), (turn, took), (works, started)⇒ VERB3, VERBPast and VERB3Past

NOUNColl: everything related to animalsLEXSEMMero: mainly relations consisting of nouns

Relational similarity ≈ a combination of attributional similarities.Some of them (e.g., syntactic) the classifier captures, and some(e.g., semantics) might be missing.

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 16 / 23

Binary classification

The classifier over-generalizes→ Add extra negative samples to the training data.

Negative Samplesopposite pairs: switching the order of word pairs, Opposw1,w2 = word1 −word2

shuffled pairs: replacing w2 with a random word from the same relation,Shuffw1,w2 = word′

2 −word1.

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 17 / 23

Binary classification with negative samples

Usage of Negative Samples

0.0

0.2

0.4

0.6

0.8

1.0

Hyper

Event

Mer

o

Noun SP

Verb 3

Verb Pas

t

Verb 3P

ast

Prefix Re

Noun Coll

Pr: No NS Pr: With NS Re: No NS Re: With NS

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 18 / 23

Lexical memorization

A Short ExampleTrain:

LEXSEMHyper: Hypernymy (animal,dog)LEXSEMHyper: Hypernymy (animal, cat)LEXSEMHyper: Hypernymy (animal,monkey)

Then in Test:LEXSEMHyper: Hypernymy : (animal,banana)

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 19 / 23

Lexical memorization: No lexical overlap between test and train

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

Volume of random word pairs

P/R

/F

PP+neg

RR+neg

FF+neg

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 20 / 23

Conclusion

many types of morphosyntactic differences are captured by DIFFVECs, morphosemanticrelations are a bit harder and lexical semantic relations are captured less wellclassification over the DIFFVECs works extremely well in a closed-world setup, but less well overopen dataWith the introduction of automatically-generated negative samples, however, the resultsimproved substantially

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 21 / 23

Open Questions

Could some examples be more representative of the relation type?How much data do we need for the best generalization?Some morphosemantic (derivations:prefixing) and lexical relations (meronyms) are still hard tocapture.

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 22 / 23

Thanks

Thank you for your time and attention! Questions?See more details here: http://arxiv.org/abs/1509.01692

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 23 / 23