lecture 6: vector space model - computer sciencekc2wc/teaching/nlp16/slides/06-vm.pdf · latent...

Lecture 6: Vector Space Model

Kai-Wei ChangCS @ University of Virginia

kw@kwchang.net

Couse webpage: http://kwchang.net/teaching/NLP16

16501 Natural Language Processing

This lecture

v How to represent a word, a sentence, or a document?

v How to infer the relationship among words?

v We focus on “semantics”: distributional semantics

v What is the meaning of “life”?

26501 Natural Language Processing

6501 Natural Language Processing 3

How to represent a word

vNaïve way: represent words as atomic symbols: student, talk, universityvN-germ language model, logical analysis

vRepresent word as a “one-hot” vector[ 0 0 0 1 0 … 0 ]

vHow large is this vector?vPTB data: ~50k, Google 1T data: 13M

v 𝑣 ⋅ 𝑢 =?

eggstudenttalkuniversity happybuy

Issues?

vDimensionality is large; vector is sparsevNo similarity

vCannot represent new wordsvAny idea?

𝑣'())* =[00010…0 ]𝑣+(,=[00100…0]𝑣-./0 = [10000…0]

𝑣'())* ⋅ 𝑣+(, = 𝑣'())* ⋅ 𝑣-./0 =0

Idea 1: Taxonomy (Word category)

What is “car”?

>>>fromnltk.corpusimportwordnet aswn>>>wn.synsets('motorcar')[Synset('car.n.01')]

>>>motorcar.hypernyms()[Synset('motor_vehicle.n.01')]>>>paths=motorcar.hypernym_paths()>>>[synset.name()for synsetin paths[0]]

['entity.n.01','physical_entity.n.01','object.n.01','whole.n.02','artifact.n.01', 'instrumentality.n.03','container.n.01','wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01','motor_vehicle.n.01','car.n.01']

>>>[synset.name()for synsetin paths[1]]['entity.n.01','physical_entity.n.01','object.n.01','whole.n.02','artifact.n.01', 'instrumentality.n.03','conveyance.n.03','vehicle.n.01','wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01','motor_vehicle.n.01','car.n.01']

Word similarity?

>>>right=wn.synset('right_whale.n.01')>>>minke =wn.synset('minke_whale.n.01')>>>orca=wn.synset('orca.n.01')>>>tortoise=wn.synset('tortoise.n.01')>>>novel=wn.synset('novel.n.01')

>>>right.lowest_common_hypernyms(minke)[Synset('baleen_whale.n.01')]>>>right.lowest_common_hypernyms(orca)[Synset('whale.n.02')]>>>right.lowest_common_hypernyms(tortoise)[Synset('vertebrate.n.01')]>>>right.lowest_common_hypernyms(novel)[Synset('entity.n.01')]

Requirehumanlabor

Taxonomy (Word category)

vSynonym, hypernym (Is-A), hyponym

Idea 2: Similarity = Clustering

Cluster n-gram model

vCan be generated from unlabeled corporavBased on statistics, e.g., mutual information

Implementation oftheBrown hierarchical wordclustering algorithm.PercyLiang

Idea 3: Distributional representation

v Linguistic items with similar distributions have similar meaningsv i.e., words occur in the same contexts

⇒ similar meaning

"a word is characterized by the company it keeps” --Firth, John, 1957

Vector representation (word embeddings)v Discrete ⇒ distributed representations v Word meanings are vector of “basic concept”

vWhat are “basic concept”?vHow to assign weights?vHow to define the similarity/distance?

𝑣0.23 = [0.8 0.9 0.1 0 …]𝑣45662 = [0.8 0.1 0.8 0 …]𝑣())/* = [0.1 0.2 0.1 0.8 …]

royaltymasculinity femininity eatable

An illustration of vector space model

Masculine

Eatable

Royalty

|D2-D4|

Semantic similarity in 2D

vHome depot product

Capture the structure of words

vExample from GloVe

How to use word vectors?

Pre-trained word vectors

vGoogle Bookhttps://code.google.com/archive/p/word2vecv100 billion tokens, 300 dimension, 3M words

vGlove projecthttp://nlp.stanford.edu/projects/glove/vPre-trained word vectors of Wiki (6B), web

crawl data (840B), twitter (27B)

Distance/similarity

vVector similarity measure⇒ similarity in meaning

vCosine similarity v cos 𝑢, 𝑣 = 5⋅;

||5||⋅||;||

vWord vector are normalized by length

vEuclidean distance ||𝑢 − 𝑣||>

v Inner product 𝑢 ⋅ 𝑣vSame as cosine similarity if vectors are

normalized

5||5||

isaunitvector

Distance/similarity

vVector similarity measure⇒ similarity in meaning

vCosine similarity v cos 𝑢, 𝑣 = 5⋅;

||5||⋅||;||

vWord vector are normalized by length

vEuclidean distance ||𝑢 − 𝑣||>

v Inner product 𝑢 ⋅ 𝑣vSame as cosine similarity if vectors are

normalized

LinguisticRegularities inSparseandExplicitWordRepresentations, Levy Goldberg, CoNLL 14

Choosing the right similarity metric is important

Word similarity DEMO

v http://msrcstr.cloudapp.net/

Word analogy

v 𝑣-(2 − 𝑣?@-(2 + 𝑣52B/6 ∼ 𝑣(52D

From words to phrases

Neural Language Models

How to “learn” word vectors?

Whatare“basicconcept”?Howtoassignweights?Howtodefinethesimilarity/distance?Cosinesimilarity

Back to distributional representation

vEncode relational data in a matrixv Co-occurrence (e.g., from a general corpus)

v Bag-of-word model: documents (clusters) as the basis for vector space

vEncode relational data in a matrixv Co-occurrence (e.g., from a general corpus)v Skip-grams

vEncode relational data in a matrixv Co-occurrence (e.g., from a general corpus)v Skip-gramsv From taxonomy (e.g., WordNet, thesaurus)

joy gladden sorrow sadden goodwill

Group 1:“joyfulness” 1 1 0 0 0

Group2:“sad” 0 0 1 1 0

Group3:“affection” 0 0 0 0 1

Input: Synonyms from a thesaurusJoyfulness: joy, gladdenSad: sorrow, sadden

vEncode relational data in a matrixv Co-occurrence (e.g., from a general corpus)v Skip-gramsv From taxonomy (e.g., WordNet, thesaurus)

joy gladden sorrow sadden goodwill

Group 1:“joyfulness” 1 1 0 0 0

Group2:“sad” 0 0 1 1 0

Group3:“affection” 0 0 0 0 1

Input: Synonyms from a thesaurusJoyfulness: joy, gladdenSad: sorrow, sadden Cosinesimilarity?

Prosandcons?

Problems?

v Number of basic concepts is largev Basis is not orthogonal

(i.e., not linearly independent)v Some function words are too frequent (e.g., the)

vSyntax has too much impactvE.g, TF-IDF can be appliedvE.g, skip-gram: scaling by distance to target

Latent Semantic Analysis (LSA)

vData representationvEncode single-relational data in a matrix

v Co-occurrence (e.g., document-term matrix, skip-gram)

v Synonyms (e.g., from a thesaurus)

vFactorizationvApply SVD to the matrix to find latent

components

Principle Component Analysis (PCA)

vDecompose the similarity space into a set of orthonormal basis vectors

Principle Component Analysis (PCA)

vDecompose the similarity space into a set of orthonormal basis vectors

vFor an 𝑚×𝑛 matrix 𝐴 , there exists a factorization such that

𝐴 = 𝑈Σ𝑉L

v𝑈, 𝑉 are orthogonal matrices

Low-rank Approximation

v Idea: store the most important information in a small number of dimensions (e.g,. 100-1000)

v SVD can be used to compute optimal low-rank approximation

v Set smallest n-r singular value to zero

v Similar words map to similar location in low dimensional space

Latent Semantic Analysis (LSA)

vFactorizationvApply SVD to the matrix to find latent

components

LSA example

vOriginal matrix C

ExamplefromChristopherManningandPandu Nayak,introduction toIR

LSA example

vSVD: 𝐶 = 𝑈Σ𝑉L

LSA example

vOriginal matrix CvDimension reduction 𝐶 ∼ 𝑈Σ𝑉L

LSA example

v Original matrix 𝐶 v.s. reconstructed matrix𝐶>

v What is the similarity between ship and boat?

Word vectors

𝐶 ∼ 𝑈Σ𝑉L

𝐶𝐶L ∼ 𝑈Σ𝑉L × 𝑈Σ𝑉L L= 𝑈Σ𝑉L×𝑉ΣL𝑈L= 𝑈ΣΣL𝑈L (why?)= 𝑈Σ UΣ L

v 𝐶:,+'.) ⋅ 𝐶:,P@(D∼ 𝑈Σ :,+'.) ⋅ 𝑈Σ :,P@(D

Why we need low rank approximation?

vKnowledge base (e.g., thesaurus) is never complete

vNoise reduction by dimension reduction v Intuitively, LSA brings together “related”

axes (concepts) in the vector spacevA compact model

All problem solved?

An analogy game

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings, NIPS 16

Continuous Semantic Representations

sunnyrainy

windycloudy

cab sad

emotion

feeling

Semantics Needs More Than Similarity

Tomorrow will be rainy.

Tomorrow will be sunny.

𝑠𝑖𝑚𝑖𝑙𝑎𝑟(rainy, sunny)?

𝑎𝑛𝑡𝑜𝑛𝑦𝑚(rainy, sunny)?

Continuous representations for entities

MichelleObama

DemocraticParty

GeorgeWBush

LauraBush

RepublicParty

Continuous representations for entities

• Useful resources for NLP applications• Semantic Parsing & Question Answering • Information Extraction

Next lecture: a more flexible framework

vDirectly learn word vectors using NN modelvMore flexible vEasier to learn new wordsv Incorporate other informationvOptimize specific task loss.

vReview calculus!

lecture 6: vector space model - computer sciencekc2wc/teaching/nlp16/slides/06-vm.pdf · latent...

Documents

lecture 9: hidden markov model - computer...

lecture 9: part of speech - university of virginia school...

lecture 4: language model evaluation and advanced...

structured query language -...

lecture 19: question answering - computer...

lecture 5: morphology - computer...

the relationship between the holy spirit and human...

a dual coordinate descent method for ... - computer...

lecture 2: n-gram - university of virginia school of...

appendix a undertaking by the eap and .(< cvs · 2 key...

lecture 24: relation extraction - computer...

lecture 1: introduction - computer...

lecture 15: dependency...

lecture 18: semantic role labeling & semantic...

lecture 7: word embeddings - computer...

huawei(ec-325) vdata modem - tataindicom.com ... tata...

lecture 3: language model...

chemical modelling programs for predicting … · chemical...

huawei(ec -325) vdata modem - · pdf filehuawei(ec -325)...

lecture 24: ner & entity linking - university of virginia...