take and took, gaggle and goose, book and read: evaluating the utility of vector differences for...

Evaluating the Utility of Vector Differences for Lexical RelationLearning

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin

August, 09 2016

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 1 / 23

The utility of difference vectors

DIFFVEC = word2 −word1

Vector Difference, or OffsetMikolov et al, 2013: king−man + woman ≈ queenCAPITAL-CITY: Paris− France + Poland ≈Warsawor PLURALISATION: cars− car + apple ≈ apples


The utility of difference vectors

DIFFVEC = word2 −word1

Vector Difference, or OffsetMikolov et al, 2013: king−man + woman ≈ queenCAPITAL-CITY: Paris− France + Poland ≈Warsawor PLURALISATION: cars− car + apple ≈ apples

Can Diffw1,w2be clustered or classified into a broad coverage set of lexical relations?


Types of relations

Lexical semantic relationsLEXSEMHyper: Hypernymy (animal,dog)LEXSEMMero: Meronymy (bird,wing)LEXSEMEvent: Object’s Action (zip, coat)

Morphosyntactic relationsVERBPast: Present, 1st→ Past (know, knew)

VERB3: Present, 1st→ Present, 3rd (know, knows)VERB3Past: Present, 3rd→ Past (knows, knew)

NOUNSP: Singular→ Plural (year, years)

Morphosemantic relationsVERBNOUN: Nominalisation of a verb (drive,drift)PREFIX: Prefixing with re morpheme (vote, revote)


Word Embeddings

Name Dimensions Training dataw2v 300 100× 109

GloVe 200 6× 109

SENNA 100 37× 106

HLBL 200 37× 106

w2vwiki 300 50× 106

GloVewiki 300 50× 106

SVDwiki 300 50× 106

The Models Usedw2v (Mikolov et al.,2013),

GloVe (Pennington et al.,2014),

SENNA (Collobert et al., 2011),

HLBL (Mnih and Hinton,2009)

PPMI+SVD (Levy and Goldberg, 2015)


Closed-World Experiments

Closed-World setting: Multi-class classifierLet {(wi ,wj)} be a set of word pairsR = {rk} be a set of binary lexical relations(wi ,wj) 7→ rk ∈ R,i.e. all word pairs can be uniquely classified according to a relation in R


Spectral clustering: t-SNE projection for 10 samples per class

LEXSEMAttr

LEXSEMCause

NOUNColl

LEXSEMEvent

LEXSEMHyper

LVCLEXSEMMero

NOUNSP

PREFIX

LEXSEMRef

LEXSEMSpace

VERB3

VERB3Past

VERBPast

VERBNOUN


Methods

Clustering algorithmSpectral clustering (von Luxburg, 2007).Two hyperparameters: (1) the number of clusters; and (2) the pairwise similarity measure forcomparing DIFFVECs.

Clustering evaluationV-Measure (Rosenberg, Hirschberg, 2007)


Clustering results

10 20 30 40 50 60 70 80

0.15

0.20

0.25

0.30

0.35

0.40

Number of clusters

V-M

easu

re

w2vw2v wikiGloVe

GloVe wikiSVD wiki

HLBL

SENNA


Clustering results

Incorrectly classified due to ambiguity/one word overwhelming anotherstudies− study⇒ VERBNOUN

saw− utensil⇒ VERBPast

tigers− ambush⇒ NOUNColl

Single ”hypernym”-specific clustersnecklace− unit, wristband− unit, hairpin− unit

Semantic sub-clustersmovement verb− animal nounfood verb− food nounaction verb− profession noun


From Clustering to Classification

From Clustering to ClassificationEncouraged by the results of the clustering experiment,we next move to classification experiments.We train multi-class linear classifier to differentiate between the relations types.


Classification: Multi-class linear SVM, F-scores

0.0

0.2

0.4

0.6

0.8

1.0

Hyper

Event

Mer

o

Noun SP

Verb 3

Verb Pas

t

Verb 3P

ast

Prefix Re

Noun Coll

Micr

oAvg

Baseline W2V W2V:Wiki SVD:Wiki


Open-World Experiments

Open-World setting: Binary classifierLet {(wi ,wj)} be a set of word pairsR = {rk} be a set of binary lexical relations(wi ,wj) 7→ rk ∈ R ∪ {φ}where φ signifies the fact that none of the relations in R apply to the word pair


Binary classification

Adding some random pairs, i.e. randomly linked word pairs

Generating random pairs(1) Seed word proportional to their frequency in Wikipedia⇒(2) take the Cartesian product over pairs of words from the seed lexicon⇒(3) sample word pairs uniformly from this set

Training the classifiersTrain 9 binary SVM classifiers with RBF kernel and evaluate on test set augmented with randomsamples


Open-World Results

Open−World

0.0

0.2

0.4

0.6

0.8

1.0

Hyper

Event

Mer

o

Noun SP

Verb 3

Verb Pas

t

Verb 3P

ast

Prefix Re

Noun Coll

Pr: No NS Re: No NS Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 15 / 23


Resultscorrectly captured many of the true classes of the relations (high recall), but also many of therandom samples as being related (low precision):

(have,works), (turn, took), (works, started)⇒ VERB3, VERBPast and VERB3Past

NOUNColl: everything related to animalsLEXSEMMero: mainly relations consisting of nouns

Relational similarity ≈ a combination of attributional similarities.Some of them (e.g., syntactic) the classifier captures, and some(e.g., semantics) might be missing.



The classifier over-generalizes→ Add extra negative samples to the training data.

Negative Samplesopposite pairs: switching the order of word pairs, Opposw1,w2 = word1 −word2

shuffled pairs: replacing w2 with a random word from the same relation,Shuffw1,w2 = word′

2 −word1.


Binary classification with negative samples

Usage of Negative Samples

0.0

0.2

0.4

0.6

0.8

1.0

Hyper

Event

Mer

o

Noun SP

Verb 3

Verb Pas

t

Verb 3P

ast

Prefix Re

Noun Coll

Pr: No NS Pr: With NS Re: No NS Re: With NS


Lexical memorization

A Short ExampleTrain:

LEXSEMHyper: Hypernymy (animal,dog)LEXSEMHyper: Hypernymy (animal, cat)LEXSEMHyper: Hypernymy (animal,monkey)

Then in Test:LEXSEMHyper: Hypernymy : (animal,banana)


Lexical memorization: No lexical overlap between test and train

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

Volume of random word pairs

P/R

/F

PP+neg

RR+neg

FF+neg


Conclusion

many types of morphosyntactic differences are captured by DIFFVECs, morphosemanticrelations are a bit harder and lexical semantic relations are captured less wellclassification over the DIFFVECs works extremely well in a closed-world setup, but less well overopen dataWith the introduction of automatically-generated negative samples, however, the resultsimproved substantially


Open Questions

Could some examples be more representative of the relation type?How much data do we need for the best generalization?Some morphosemantic (derivations:prefixing) and lexical relations (meronyms) are still hard tocapture.


Thanks

Thank you for your time and attention! Questions?See more details here: http://arxiv.org/abs/1509.01692


http://arxiv.org/abs/1509.01692

take and took, gaggle and goose, book and read: evaluating the utility of vector differences for...

Science