lecture 6: vector space model - computer sciencekc2wc/teaching/nlp16/slides/06-vm.pdf · latent...

Post on 20-Jun-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Lecture 6: Vector Space Model

Kai-Wei ChangCS @ University of Virginia

kw@kwchang.net

Couse webpage: http://kwchang.net/teaching/NLP16

16501 Natural Language Processing

This lecture

v How to represent a word, a sentence, or a document?

v How to infer the relationship among words?

v We focus on “semantics”: distributional semantics

v What is the meaning of “life”?

26501 Natural Language Processing

6501 Natural Language Processing 3

How to represent a word

vNaïve way: represent words as atomic symbols: student, talk, universityvN-germ language model, logical analysis

vRepresent word as a “one-hot” vector[ 0 0 0 1 0 … 0 ]

vHow large is this vector?vPTB data: ~50k, Google 1T data: 13M

v 𝑣 ⋅ 𝑢 =?

6501 Natural Language Processing 4

eggstudenttalkuniversity happybuy

Issues?

vDimensionality is large; vector is sparsevNo similarity

vCannot represent new wordsvAny idea?

6501 Natural Language Processing 5

𝑣'())* =[00010…0 ]𝑣+(,=[00100…0]𝑣-./0 = [10000…0]

𝑣'())* ⋅ 𝑣+(, = 𝑣'())* ⋅ 𝑣-./0 =0

Idea 1: Taxonomy (Word category)

6501 Natural Language Processing 6

What is “car”?

>>>fromnltk.corpusimportwordnet aswn>>>wn.synsets('motorcar')[Synset('car.n.01')]

6501 Natural Language Processing 7

>>>motorcar.hypernyms()[Synset('motor_vehicle.n.01')]>>>paths=motorcar.hypernym_paths()>>>[synset.name()for synsetin paths[0]]

['entity.n.01','physical_entity.n.01','object.n.01','whole.n.02','artifact.n.01', 'instrumentality.n.03','container.n.01','wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01','motor_vehicle.n.01','car.n.01']

>>>[synset.name()for synsetin paths[1]]['entity.n.01','physical_entity.n.01','object.n.01','whole.n.02','artifact.n.01', 'instrumentality.n.03','conveyance.n.03','vehicle.n.01','wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01','motor_vehicle.n.01','car.n.01']

Word similarity?

6501 Natural Language Processing 8

>>>right=wn.synset('right_whale.n.01')>>>minke =wn.synset('minke_whale.n.01')>>>orca=wn.synset('orca.n.01')>>>tortoise=wn.synset('tortoise.n.01')>>>novel=wn.synset('novel.n.01')

>>>right.lowest_common_hypernyms(minke)[Synset('baleen_whale.n.01')]>>>right.lowest_common_hypernyms(orca)[Synset('whale.n.02')]>>>right.lowest_common_hypernyms(tortoise)[Synset('vertebrate.n.01')]>>>right.lowest_common_hypernyms(novel)[Synset('entity.n.01')]

Requirehumanlabor

Taxonomy (Word category)

vSynonym, hypernym (Is-A), hyponym

6501 Natural Language Processing 9

Idea 2: Similarity = Clustering

6501 Natural Language Processing 10

Cluster n-gram model

vCan be generated from unlabeled corporavBased on statistics, e.g., mutual information

6501 Natural Language Processing 11

Implementation oftheBrown hierarchical wordclustering algorithm.PercyLiang

Idea 3: Distributional representation

v Linguistic items with similar distributions have similar meaningsv i.e., words occur in the same contexts

⇒ similar meaning

6501 Natural Language Processing 12

"a word is characterized by the company it keeps” --Firth, John, 1957

Vector representation (word embeddings)v Discrete ⇒ distributed representations v Word meanings are vector of “basic concept”

vWhat are “basic concept”?vHow to assign weights?vHow to define the similarity/distance?

6501 Natural Language Processing 13

𝑣0.23 = [0.8 0.9 0.1 0 …]𝑣45662 = [0.8 0.1 0.8 0 …]𝑣())/* = [0.1 0.2 0.1 0.8 …]

royaltymasculinity femininity eatable

An illustration of vector space model

6501 Natural Language Processing 14

Masculine

Eatable

Royalty

w4

w2

W1W5

w3

|D2-D4|

Semantic similarity in 2D

vHome depot product

6501 Natural Language Processing 15

Capture the structure of words

vExample from GloVe

6501 Natural Language Processing 16

How to use word vectors?

6501 Natural Language Processing 17

Pre-trained word vectors

vGoogle Bookhttps://code.google.com/archive/p/word2vecv100 billion tokens, 300 dimension, 3M words

vGlove projecthttp://nlp.stanford.edu/projects/glove/vPre-trained word vectors of Wiki (6B), web

crawl data (840B), twitter (27B)

6501 Natural Language Processing 18

Distance/similarity

vVector similarity measure⇒ similarity in meaning

vCosine similarity v cos 𝑢, 𝑣 = 5⋅;

||5||⋅||;||

vWord vector are normalized by length

vEuclidean distance ||𝑢 − 𝑣||>

v Inner product 𝑢 ⋅ 𝑣vSame as cosine similarity if vectors are

normalized

6501 Natural Language Processing 19

5||5||

isaunitvector

Distance/similarity

vVector similarity measure⇒ similarity in meaning

vCosine similarity v cos 𝑢, 𝑣 = 5⋅;

||5||⋅||;||

vWord vector are normalized by length

vEuclidean distance ||𝑢 − 𝑣||>

v Inner product 𝑢 ⋅ 𝑣vSame as cosine similarity if vectors are

normalized

6501 Natural Language Processing 20

LinguisticRegularities inSparseandExplicitWordRepresentations, Levy Goldberg, CoNLL 14

Choosing the right similarity metric is important

Word similarity DEMO

v http://msrcstr.cloudapp.net/

6501 Natural Language Processing 21

Word analogy

v 𝑣-(2 − 𝑣?@-(2 + 𝑣52B/6 ∼ 𝑣(52D

6501 Natural Language Processing 22

From words to phrases

6501 Natural Language Processing 23

Neural Language Models

6501 Natural Language Processing 24

How to “learn” word vectors?

6501 Natural Language Processing 25

Whatare“basicconcept”?Howtoassignweights?Howtodefinethesimilarity/distance?Cosinesimilarity

Back to distributional representation

vEncode relational data in a matrixv Co-occurrence (e.g., from a general corpus)

v Bag-of-word model: documents (clusters) as the basis for vector space

6501 Natural Language Processing 26

Back to distributional representation

vEncode relational data in a matrixv Co-occurrence (e.g., from a general corpus)v Skip-grams

6501 Natural Language Processing 27

Back to distributional representation

vEncode relational data in a matrixv Co-occurrence (e.g., from a general corpus)v Skip-gramsv From taxonomy (e.g., WordNet, thesaurus)

6501 Natural Language Processing 28

joy gladden sorrow sadden goodwill

Group 1:“joyfulness” 1 1 0 0 0

Group2:“sad” 0 0 1 1 0

Group3:“affection” 0 0 0 0 1

Input: Synonyms from a thesaurusJoyfulness: joy, gladdenSad: sorrow, sadden

Back to distributional representation

vEncode relational data in a matrixv Co-occurrence (e.g., from a general corpus)v Skip-gramsv From taxonomy (e.g., WordNet, thesaurus)

6501 Natural Language Processing 29

joy gladden sorrow sadden goodwill

Group 1:“joyfulness” 1 1 0 0 0

Group2:“sad” 0 0 1 1 0

Group3:“affection” 0 0 0 0 1

Input: Synonyms from a thesaurusJoyfulness: joy, gladdenSad: sorrow, sadden Cosinesimilarity?

Prosandcons?

Problems?

v Number of basic concepts is largev Basis is not orthogonal

(i.e., not linearly independent)v Some function words are too frequent (e.g., the)

vSyntax has too much impactvE.g, TF-IDF can be appliedvE.g, skip-gram: scaling by distance to target

6501 Natural Language Processing 30

Latent Semantic Analysis (LSA)

vData representationvEncode single-relational data in a matrix

v Co-occurrence (e.g., document-term matrix, skip-gram)

v Synonyms (e.g., from a thesaurus)

vFactorizationvApply SVD to the matrix to find latent

components

6501 Natural Language Processing 31

Principle Component Analysis (PCA)

vDecompose the similarity space into a set of orthonormal basis vectors

6501 Natural Language Processing 32

Principle Component Analysis (PCA)

vDecompose the similarity space into a set of orthonormal basis vectors

vFor an 𝑚×𝑛 matrix 𝐴 , there exists a factorization such that

𝐴 = 𝑈Σ𝑉L

v𝑈, 𝑉 are orthogonal matrices

6501 Natural Language Processing 33

Low-rank Approximation

v Idea: store the most important information in a small number of dimensions (e.g,. 100-1000)

v SVD can be used to compute optimal low-rank approximation

v Set smallest n-r singular value to zero

v Similar words map to similar location in low dimensional space

6501 Natural Language Processing 34

Latent Semantic Analysis (LSA)

vFactorizationvApply SVD to the matrix to find latent

components

6501 Natural Language Processing 35

LSA example

vOriginal matrix C

6501 Natural Language Processing 36

ExamplefromChristopherManningandPandu Nayak,introduction toIR

LSA example

vSVD: 𝐶 = 𝑈Σ𝑉L

6501 Natural Language Processing 37

LSA example

vOriginal matrix CvDimension reduction 𝐶 ∼ 𝑈Σ𝑉L

6501 Natural Language Processing 38

LSA example

v Original matrix 𝐶 v.s. reconstructed matrix𝐶>

v What is the similarity between ship and boat?

6501 Natural Language Processing 39

Word vectors

𝐶 ∼ 𝑈Σ𝑉L

𝐶𝐶L ∼ 𝑈Σ𝑉L × 𝑈Σ𝑉L L= 𝑈Σ𝑉L×𝑉ΣL𝑈L= 𝑈ΣΣL𝑈L (why?)= 𝑈Σ UΣ L

v 𝐶:,+'.) ⋅ 𝐶:,P@(D∼ 𝑈Σ :,+'.) ⋅ 𝑈Σ :,P@(D

6501 Natural Language Processing 40

Why we need low rank approximation?

vKnowledge base (e.g., thesaurus) is never complete

vNoise reduction by dimension reduction v Intuitively, LSA brings together “related”

axes (concepts) in the vector spacevA compact model

6501 Natural Language Processing 41

All problem solved?

6501 Natural Language Processing 42

An analogy game

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings, NIPS 16

6501 Natural Language Processing 43

Continuous Semantic Representations

sunnyrainy

windycloudy

car

wheel

cab sad

joy

emotion

feeling

6501 Natural Language Processing 44

Semantics Needs More Than Similarity

Tomorrow will be rainy.

Tomorrow will be sunny.

𝑠𝑖𝑚𝑖𝑙𝑎𝑟(rainy, sunny)?

𝑎𝑛𝑡𝑜𝑛𝑦𝑚(rainy, sunny)?

6501 Natural Language Processing 45

Continuous representations for entities

6501 Natural Language Processing 46

?

MichelleObama

DemocraticParty

GeorgeWBush

LauraBush

RepublicParty

Continuous representations for entities

6501 Natural Language Processing 47

• Useful resources for NLP applications• Semantic Parsing & Question Answering • Information Extraction

Next lecture: a more flexible framework

vDirectly learn word vectors using NN modelvMore flexible vEasier to learn new wordsv Incorporate other informationvOptimize specific task loss.

vReview calculus!

6501 Natural Language Processing 48

top related