cse 5243 intro. to data mining - ysu1989.github.io · cse 5243 intro. to data mining word embedding...
TRANSCRIPT
![Page 1: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/1.jpg)
CSE 5243 INTRO. TO DATA MINING
Word Embedding
Yu Su, CSE@The Ohio State University
![Page 2: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/2.jpg)
22
How to let a computer understand meaning?2
A cat sits on a mat. #_$@^_&*^&_()_@_+@^=
![Page 3: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/3.jpg)
33
Distributional semantics3
¨ You can get a lot of value by representing a word by means of its neighbors (context)
¨ One of the most successful ideas of modern statistical NLP
“You shall know a word by the company it keeps” (J. R. Firth 1957: 11)
![Page 4: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/4.jpg)
44
History of word embedding4
Last lecture
![Page 5: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/5.jpg)
55
History of word embedding5
This lecture
![Page 6: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/6.jpg)
66
Different embeddings are based on different priors
6
Latent semantic analysis
“Words occur in same documents should be similar”
Word2vec
“Words occur in similar contexts should be similar”
Neural Network Language Modeling
“Word vectors should give plausible sentences high probability”
Collabert et al., 2008 & 2011
“Word vectors should facilitate downstream classification tasks”
Faruqui et al., 2015
“Words should follow linguistic constraints from semantic lexicons”
![Page 7: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/7.jpg)
77
Latent semantic analysis: word-doc occurrence matrix
7
¨ Word-doc occurrence matrix will give general topics, e.g., all sports words will have similar entries
¨ Apply SVD for dimensionality reduction
![Page 8: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/8.jpg)
88
Different embeddings are based on different priors
8
Latent semantic analysis
“Words occur in same documents should be similar”
Word2vec
“Words occur in similar contexts should be similar”
Neural Network Language Modeling
“Word vectors should give plausible sentences high probability”
Collabert et al., 2008 & 2011
“Word vectors should facilitate downstream classification tasks”
Faruqui et al., 2015
“Words should follow linguistic constraints from semantic lexicons”
![Page 9: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/9.jpg)
99
Word2vec: “Words occur in similar contexts should be similar”
9
¨ Word2vec will adjust the vector of a word to be similar to the vectors of its context words
¨ Words with similar contexts thus end up with similar vectors
I just played with my dog.I just played with my cat.My dog likes to sleep on my bed.My cat likes to sleep on my bed.
![Page 10: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/10.jpg)
1010
Different embeddings are based on different priors
10
Latent semantic analysis
“Words occur in same documents should be similar”
Word2vec
“Words occur in similar contexts should be similar”
Neural Network Language Modeling
“Word vectors should assign high probability to plausible sentences”
Collabert et al., 2008 & 2011
“Word vectors should facilitate downstream classification tasks”
Faruqui et al., 2015
“Words should follow linguistic constraints from semantic lexicons”
![Page 11: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/11.jpg)
1111
Probabilistic Language Modeling11
¨ Goal: assign a probability to a sentence¤ Machine Translation:
n Source sentence: 今晚大风n P(large winds tonight) P(strong winds tonight)
¤ Spell Correctionn The office is about fifteen minuets from my house
n P(about fifteen minutes from) P(about fifteen minuets from)
¤ Speech Recognitionn P(I saw a van) P(eyes awe of an)
¤ +Summarization, question answering, etc.
<
>
>>
![Page 12: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/12.jpg)
1212
Probabilistic Language Modeling12
¨ Goal: compute the probability of a sentence or a sequence of words:
¨ How to compute the joint probability?
¨ Chain rule:
P(w1m ) = P(w1,w2,...,wm )
P(a,dog,is,running,in,a,room)
P(a,dog,is,running) =P(a)P(dog | a)P(is | a,dog)P(running | a,dog,is)
P(w1,w2,...,wm ) = P(w1)P(w2 |w1)P(w3 |w1,w2 )...P(wm |w1,...wm−1)
![Page 13: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/13.jpg)
1313
Probabilistic Language Modeling13
¨ Key:
¨ Just count? Exponential number of entries and sparsity.
¨ Markov assumption:
P(w1,w2,...,wm ) = P(wt |w1,...wt−1)t
m−1
∏P(wt |w1,...wt−1)
P(wt |w1,...wt−1) ≈ P(wt |wt−n+1,...wt−1)
![Page 14: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/14.jpg)
1414
Probabilistic Language Modeling14
¨ N-gram (bigram)
¨ What’s the problem?¤ Small context window (typically bigram or trigram)¤ Not utilizing word similarity
n Seeing “A dog is running in a room” should increase probability ofn “The dog is walking in a room” andn “A cat is running in the room” andn “Some cats are running in the room”
¨ Solution: Neural Network Language Modeling!
P(running | a,dog,is) ≈ P(running | is) = count(is,running)count(is)
![Page 15: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/15.jpg)
1515
Neural Network Language Model15
A Neural Probabilistic Language Model. Bengio et al. JMLR 2003.
Projection
Fully connected non-linear layer
Softmax Learn P(wt |wt−n+1,...wt−1)
![Page 16: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/16.jpg)
1616
The Lookup Table16
¨ Each word in vocabulary maps to a vector in ¨ LookupTable: input of the ith word is
!d
To get the embedding vector for the word we multiply Cxwhere C is a d x D matrix with D words in the vocabulary
In the original space words are orthogonal.cat = (0,0,0,0,0,0,0,0,0,1,0,0,0,0, …)dog = (0,0,1,0,0,0,0,0,0,0,0,0,0,0, …)
!d
C contains the word vectors!
![Page 17: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/17.jpg)
17
Neural Network Language Model
wt−n+1,wt−n+2,...,wt−1
x = (Cwt−n+1,Cwt−n+2,...,Cwt−1)T
z = tanh(Hx + b1)
y =Uz + b2
projection
non-linearity
output
P(wt = i) =exp(yi )exp(yj )j=1
D∑softmax
![Page 18: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/18.jpg)
18
wt−n+1,wt−n+2,...,wt−1
x = (Cwt−n+1,Cwt−n+2,...,Cwt−1)T
z = tanh(Hx + b1)
y =Uz + b2
projection
non-linearity
output
P(wt = i) =exp(yi )exp(yj )j=1
D∑softmax
d :word vector dimensionalityn: window sizeD: vocabulary sizeh: # of hidden units
h
D
Dimensionality of each layer?
n*d
![Page 19: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/19.jpg)
19
wt−n+1,wt−n+2,...,wt−1
x = (Cwt−n+1,Cwt−n+2,...,Cwt−1)T
z = tanh(Hx + b1)
y =Uz + b2
projection
non-linearity
output
P(wt = i) =exp(yi )exp(yj )j=1
D∑softmax
n*d
n*d *h + h
h*D + D
# of parameters in each layer?
d :word vector dimensionalityn: window sizeD: vocabulary sizeh: # of hidden units
![Page 20: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/20.jpg)
2020
Training20
¨ All free parameters
¨ Backpropagation + Stochastic Gradient Ascent:
θ = (C,H ,U,b1,b2 )
θ ←θ + ε ∂logP(wt |wt−n+1,...,wt−1)∂θ
Costly!
![Page 21: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/21.jpg)
2121
Speed up training21
¨ Most computations are at the output layer¤ In order to compute the normalization term of softmax, we have to
compute the yi for every word! ¤ Cost (almost) linear to vocabulary size.¤ Same problem in Skip-gram
¨ Solutions: Approximate the normalized probability¤ Negative sampling¤ Noise contrastive estimation¤ Hierarchical softmax¤ …
![Page 22: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/22.jpg)
2222
Speed up training22
¨ Most computations are at the output layer¤ In order to compute the normalization term of softmax, we have to
compute the yi for every word! ¤ Cost (almost) linear to vocabulary size.¤ Same problem in Skip-gram
¨ Solutions: Approximate the normalized probability¤ Negative sampling¤ Noise contrastive estimation¤ Hierarchical softmax¤ …
![Page 23: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/23.jpg)
2323
Refresher: Skip-gram23
¨ Given the central word, predict surrounding words in a window of length c
¨ Objective function:
¨ Softmax:
p(O | I ) = exp(vO' T vI )
exp(vw' T vI )
w∈V∑
∂log p(O | I )∂vI
= v 'O− p(w | I )v 'ww∑
![Page 24: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/24.jpg)
2424
Negative sampling24
¨ I: central word. O: a context word ¨ Original: Maximize¨ We will derive an alternative which is less costly to compute ¨ Does pair (I,O) really come from the training data?
¨ Trivial solution: same (long enough) vector for all words¨ Contrast with negative words!
p(O | I ,θ )
θ = argmaxθ p(D = 1| I ,O,θ )
where p(D = 1| I ,O,θ ) =σ (vI
T vO' ) = 1
1+ e−vIT vO'
![Page 25: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/25.jpg)
2525
Negative sampling25
¨ Solution: randomly sample k negative words wi from a noise distribution, assume (I,wi) are incorrect pairs
¨ I = “is”, O = “running”, w1 = “walk”, w2 = “do”, etc.
maximize p(D = 1| I ,O,θ )• p(D = 0 | I ,wi ,θ )
i=1
k
∏
or logσ (vI
T vO' )+ Ewi∼Pn (w)
[logσ (−vIT vwi
' )]i=1
k
∑
where Pn (w) =
U(w)3/4
Z,U(w) the unigram distribution
![Page 26: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/26.jpg)
2626
Different embeddings are based on different priors
26
Latent semantic analysis
“Words occur in same documents should be similar”
Word2vec
“Words occur in similar contexts should be similar”
Neural Network Language Modeling
“Word vectors should give plausible sentences high probability”
Collobert et al., 2008 & 2011
“Word vectors should facilitate downstream classification tasks”
Faruqui et al., 2015
“Words should follow linguistic constraints from semantic lexicons”
![Page 27: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/27.jpg)
2727
What to get from this work27
¨ How to supervise the learning of word embedding using external classification tasks
¨ How to do semi-supervised learning of word embedding
¨ How to apply word vectors and neural networks in other traditional NLP tasks
![Page 28: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/28.jpg)
2828
Embedding for other NLP tasks (Collobert et al., 2008&11)
28
![Page 29: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/29.jpg)
2929
The Large-scale Feature Engineering Way29
![Page 30: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/30.jpg)
3030
The sub-optimal cascade30
![Page 31: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/31.jpg)
3131
NLP: Large scale machine learning31
![Page 32: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/32.jpg)
3232
The big picture32
![Page 33: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/33.jpg)
4343
Different embeddings are based on different priors
43
Latent semantic analysis
“Words occur in same documents should be similar”
Word2vec
“Words occur in similar contexts should be similar”
Neural Network Language Modeling
“Word vectors should give plausible sentences high probability”
Collabert et al., 2008 & 2011
“Word vectors should facilitate downstream classification tasks”
Faruqui et al., 2015
“Words should follow linguistic constraints from semantic lexicons”
![Page 34: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/34.jpg)
4444
Semantic lexicon: WordNet44
wrong
untruefalse
flawed incorrect
![Page 35: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/35.jpg)
4545
Retrofitting word vectors to semantic lexicons (NAACL’15)45
¨ Incorporates information from lexicons in word vectors
¨ Post-processing approach
¨ Applicable to any word embedding method
¨ Applicable to any lexicon
![Page 36: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/36.jpg)
4646
Retrofitting46
Original Vectors
Retrofitted Vectors
![Page 37: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/37.jpg)
4747
Semantic lexicons used in this work47
¨ PPDB: Lexical paraphrases obtained from parallel texts¨ WordNet: synonyms, hypernyms and hyponyms¨ FrameNet: Cause_change_of_position -> push=raise=growth
Table 1. Approximate size of the graphs obtained from different lexicons
![Page 38: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/38.jpg)
4848
Experiment results48
Original Embedding
Semantic Lexicons
Word SimilaritySynonymSelection
SyntacticAnalysis
SentimentAnalysis
![Page 39: CSE 5243 INTRO. TO DATA MINING - ysu1989.github.io · CSE 5243 INTRO. TO DATA MINING Word Embedding Yu Su, CSE@TheOhio State University . 2 How to let a computer understand meaning?](https://reader031.vdocuments.us/reader031/viewer/2022040205/5ed73039c30795314c175cad/html5/thumbnails/39.jpg)
4949
In this lecture…49
¨ More types of supervision used in training word embedding¤ Language modeling¤ NLP labeling tasks¤ Semantic lexicons
¨ Ways to speed up¤ E.g., negative sampling¤ Necessary for training on huge text corpora¤ Scale up from hundreds of millions to hundreds of billions
¨ How word embeddings help other NLP tasks