lecture 7: word embeddings - computer sciencekc2wc/teaching/nlp16/slides/07-wordembedding.pdflecture...
Post on 26-May-2018
217 Views
Preview:
TRANSCRIPT
Lecture 7: Word Embeddings
Kai-Wei ChangCS @ University of Virginia
kw@kwchang.net
Couse webpage: http://kwchang.net/teaching/NLP16
16501 Natural Language Processing
This lecture
v Learning word vectors (Cont.)
v Representation learning in NLP
26501 Natural Language Processing
Recap: Latent Semantic Analysis
vData representationvEncode single-relational data in a matrix
v Co-occurrence (e.g., from a general corpus)v Synonyms (e.g., from a thesaurus)
vFactorizationvApply SVD to the matrix to find latent
components
vMeasuring degree of relationvCosine of latent vectors
Recap: Mapping to Latent Space via SVD
v SVD generalizes the original datav Uncovers relationships not explicit in the thesaurusv Term vectors projected to 𝑘-dim latent space
v Word similarity: cosine of two column vectors in 𝚺𝐕$
𝑪 𝐔𝐕'≈
𝑑×𝑛 𝑑×𝑘
𝑘×𝑘 𝑘×𝑛
𝚺
Low rank approximation
vFrobenius norm. C is a 𝑚×𝑛 matrix
||𝐶||/ = 11|𝑐34|56
478
9
378
vRank of a matrix.vHow many vectors in the matrix are
independent to each other
6501 Natural Language Processing 5
Low rank approximation
v Low rank approximation problem:min=||𝐶 − 𝑋||/𝑠. 𝑡. 𝑟𝑎𝑛𝑘 𝑋 = 𝑘
v If I can only use k independent vectors to describe the points in the space, what are the best choices?
6501 Natural Language Processing 6
Essentially,weminimizethe“reconstructionloss”underalowrankconstraint
Low rank approximation
v Low rank approximation problem:min=||𝐶 − 𝑋||/𝑠. 𝑡. 𝑟𝑎𝑛𝑘 𝑋 = 𝑘
v If I can only use k independent vectors to describe the points in the space, what are the best choices?
6501 Natural Language Processing 7
Essentially,weminimizethe“reconstructionloss”underalowrankconstraint
Low rank approximation
v Assume rank of 𝐶 is rv SVD: 𝐶 = 𝑈Σ𝑉', Σ = diag(𝜎8, 𝜎5 …𝜎P, 0,0,0, …0)
v Zero-out the r − 𝑘 trailing valuesΣ′ = diag(𝜎8, 𝜎5 …𝜎U, 0,0,0,… 0)
v 𝐶V = UΣV𝑉' is the best k-rank approximation: CV = 𝑎𝑟𝑔min
=||𝐶 − 𝑋||/𝑠. 𝑡. 𝑟𝑎𝑛𝑘 𝑋 = 𝑘
6501 Natural Language Processing 8
Σ =𝜎8 0 00 ⋱ 00 0 0
𝑟 non-zeros
Word2Vec
v LSA: a compact representation of co-occurrence matrix
v Word2Vec:Predict surrounding words (skip-gram)vSimilar to using co-occurrence counts Levy&Goldberg
(2014), Pennington et al. (2014)
v Easy to incorporate new wordsor sentences
6501 Natural Language Processing 9
Word2Vec
v Similar to language model, but predicting next word is not the goal.
v Idea: words that are semantically similar often occur near each other in textv Embeddings that are good at predicting neighboring
words are also good at representing similarity
6501 Natural Language Processing 10
Skip-gram v.s Continuous bag-of-words
vWhat differences?
6501 Natural Language Processing 11
Skip-gram v.s Continuous bag-of-words
6501 Natural Language Processing 12
Objective of Word2Vec (Skip-gram)
vMaximize the log likelihood of context word 𝑤\]9,𝑤\]9^8, … ,𝑤\]8 , 𝑤\^8, 𝑤\^5,… ,𝑤\^9given word 𝑤\
vm is usually 5~10
6501 Natural Language Processing 13
Objective of Word2Vec (Skip-gram)
vHow to model log 𝑃(𝑤\^4|𝑤\)?
𝑝 𝑤\^4 𝑤\ =cde(fghij⋅lgh)
∑ cde(fgn ⋅lgh)gn
vsoftmax function Again!
vEvery word has 2 vectorsv𝑣p : when 𝑤 is the center wordv𝑢p: when 𝑤 is the outside word (context word)
6501 Natural Language Processing 14
How to update?
𝑝 𝑤\^4 𝑤\ =cde(fghij⋅lgh)
∑ cde(fgn ⋅lgh)gn
vHow to minimize 𝐽(𝜃)vGradient descent!vHow to compute the gradient?
6501 Natural Language Processing 15
Recap: Calculus
6501 Natural Language Processing 16
vGradient:𝒙' = 𝑥8 𝑥5 𝑥z ,
∇𝜙 𝒙 =
𝜕𝜙(𝒙)𝜕𝑥8𝜕𝜙(𝒙)𝜕𝑥5𝜕𝜙(𝒙)𝜕𝑥z
v 𝜙 𝒙 = 𝒂 ⋅ 𝒙 (or represented as 𝒂'𝒙)∇𝜙 𝒙 = 𝒂
Recap: Calculus
6501 Natural Language Processing 17
v If𝑦 = 𝑓 𝑢 and𝑢 = 𝑔 𝑥 (i.e,.𝑦 = 𝑓(𝑔 𝑥 )����= ��(f)
�f��(�)��
( ���f
�f��
)
1. 𝑦 = 𝑥� + 6 z 2. y = ln(𝑥5 + 5)3. y = exp(x� + 3𝑥 + 2)
Other useful formulation
v 𝑦 = exp 𝑥dydx = exp x
v y = log xdydx =
1x
6501 Natural Language Processing 18
WhenIsaylog(inthiscourse), usuallyImeanln
6501 Natural Language Processing 19
Example
vAssume vocabulary set is 𝑊. We have one center word 𝑐, and one context word 𝑜.
vWhat is the conditional probability 𝑝 𝑜 𝑐
𝑝 𝑜 𝑐 =exp(𝑢� ⋅ 𝑣�)
∑ exp(𝑢pn ⋅ 𝑣�)pV vWhat is the gradient of the log likelihood
w.r.t 𝑣�?𝜕 log 𝑝 𝑜 𝑐
𝜕𝑣�= 𝑢� − 𝐸p∼� 𝑤 𝑐 [𝑢p]
6501 Natural Language Processing 20
Gradient Descent
minp𝐽(𝑤)
Update w: 𝑤 ← 𝑤 − 𝜂∇𝐽(𝑤)
6501 Natural Language Processing 21
Local minimum v.s. global minimum
6501 Natural Language Processing 22
Stochastic gradient descent
v Let 𝐽 𝑤 = 86∑ 𝐽4(𝑤)6478
v Gradient descent update rule:
𝑤 ← 𝑤 − �6∑ 𝛻𝐽4 𝑤6478
v Stochastic gradient descent:
v Approximate 86∑ 𝛻𝐽4 𝑤6478 by the gradient at a
single example 𝛻𝐽3 𝑤 (why?)v At each step:
6501 Natural Language Processing 23
Randomlypickanexample𝑖𝑤 ← 𝑤 − 𝜂𝛻𝐽3 𝑤
Negative sampling
vWith a large vocabulary set, stochastic gradient descent is still not enough (why?)
𝜕 log𝑝 𝑜 𝑐𝜕𝑣�
= 𝑢� − 𝐸p∼� 𝑤 𝑐 [𝑢p]
vLet’s approximate it again!vOnly sample a few words that do not appear
in the contextvEssentially, put more weights on positive
samples
6501 Natural Language Processing 24
More about Word2Vec – relation to LSA
v LSA factorizes a matrix of co-occurrence counts
v (Levy and Goldberg 2014) proves that skip-gram model implicitly factorizes a (shifted) PMI matrix!
vPMI(w,c) =log ¡(�|�)¡(�)
= log ¡(�,�)�(�)¡(�)
= log# 𝑤, 𝑐 ⋅ |𝐷|#(𝑤)#(𝑐)
6501 Natural Language Processing 25
All problem solved?
6501 Natural Language Processing 26
Continuous Semantic Representations
sunnyrainy
windycloudy
car
wheel
cab sad
joy
emotion
feeling
6501 Natural Language Processing 27
Semantics Needs More Than Similarity
Tomorrow will be rainy.
Tomorrow will be sunny.
𝑠𝑖𝑚𝑖𝑙𝑎𝑟(rainy, sunny)?
𝑎𝑛𝑡𝑜𝑛𝑦𝑚(rainy, sunny)?
6501 Natural Language Processing 28
Polarity Inducing LSA [Yih, Zweig, Platt 2012]
vData representationvEncode two opposite relations in a matrix using
“polarity”v Synonyms & antonyms (e.g., from a thesaurus)
vFactorizationvApply SVD to the matrix to find latent
components
vMeasuring degree of relationvCosine of latent vectors
joy gladden sorrow sadden goodwill
Group 1:“joyfulness” 1 1 -1 -1 0
Group2:“sad” -1 -1 1 1 0
Group3:“affection” 0 0 0 0 1
Encode Synonyms & Antonyms in Matrix
v Joyfulness: joy, gladden; sorrow, saddenv Sad: sorrow, sadden; joy, gladden
Inducing polarity
Cosine Score: +𝑆𝑦𝑛𝑜𝑛𝑦𝑚𝑠
Target word: row-vector
joy gladden sorrow sadden goodwill
Group 1:“joyfulness” 1 1 -1 -1 0
Group2:“sad” -1 -1 1 1 0
Group3:“affection” 0 0 0 0 1
Encode Synonyms & Antonyms in Matrix
v Joyfulness: joy, gladden; sorrow, saddenv Sad: sorrow, sadden; joy, gladden
Inducing polarity
Cosine Score: −𝐴𝑛𝑡𝑜𝑛𝑦𝑚𝑠
Target word: row-vector
Continuous representations for entities
6501 Natural Language Processing 32
?
MichelleObama
DemocraticParty
GeorgeWBush
LauraBush
RepublicParty
Continuous representations for entities
6501 Natural Language Processing 33
• Useful resources for NLP applications• Semantic Parsing & Question Answering • Information Extraction
top related