part-of-speech tagging + neural networks 3: word embeddings · neural networks 3: word embeddings...
Post on 27-Mar-2020
5 Views
Preview:
TRANSCRIPT
Part-of-Speech Tagging
+
Neural Networks 3: Word Embeddings
CS 287
Review: Neural Networks
One-layer multi-layer perceptron architecture,
NNMLP1(x) = g(xW1 + b1)W2 + b2
I xW+ b; perceptron
I x is the dense representation in R1×din
I W1 ∈ Rdin×dhid ,b1 ∈ R1×dhid ; first affine transformation
I W2 ∈ Rdhid×dout ,b2 ∈ R1×dout ; second affine transformation
I g : Rdhid×dhid is an activation non-linearity (often pointwise)
I g(xW1 + b1) is the hidden layer
Review: Non-Linearities Tanh
Hyperbolic Tangeant:
tanh(t) =exp(t)− exp(−t)exp(t) + exp(−t)
I Intuition: Similar to sigmoid, but range between 0 and -1.
Review: Backpropagation
fi (. . . f1(x0))
fi+1(∗; θi+1)
fi+1(fi (. . . f1(x0)))
∂L
∂fi (. . . f1(x0))∂L
∂fi+1(. . . f1(x0))
∂L
∂θi+1
Quiz
One common class of operations in neural network models is known as
pooling. Informally a pooling layer consists of aggregation unit, typically
unparameterized, that reduces the input to a smaller size.
Consider three pooling functions of the form f : Rn 7→ R,
1. f (x) = maxi xi
2. f (x) = mini xi
3. f (x) = ∑i xi/n
What action do each of these functions have? What are their
gradients? How would you implement backpropagation for these units?
Quiz
I Max pooling: f (x) = maxi xi
I Keeps only the most activated input
I Fprop is simple; however must store arg max (“switch”)
I Bprop gradient is zero except for switch, which gets gradoutput
I Min pooling: f (x) = mini xiI Keeps only the least activated input
I Fprop is simple; however must store arg min (“switch”)
I Bprop gradient is zero except for switch, which gets gradoutput
I Avg pooling: f (x) = ∑i xi/nI Keeps the average activation input
I Fprop is simply mean.
I Gradoutput is averaged and passed to all inputs.
Quiz
I Max pooling: f (x) = maxi xi
I Keeps only the most activated input
I Fprop is simple; however must store arg max (“switch”)
I Bprop gradient is zero except for switch, which gets gradoutput
I Min pooling: f (x) = mini xiI Keeps only the least activated input
I Fprop is simple; however must store arg min (“switch”)
I Bprop gradient is zero except for switch, which gets gradoutput
I Avg pooling: f (x) = ∑i xi/nI Keeps the average activation input
I Fprop is simply mean.
I Gradoutput is averaged and passed to all inputs.
Quiz
I Max pooling: f (x) = maxi xi
I Keeps only the most activated input
I Fprop is simple; however must store arg max (“switch”)
I Bprop gradient is zero except for switch, which gets gradoutput
I Min pooling: f (x) = mini xiI Keeps only the least activated input
I Fprop is simple; however must store arg min (“switch”)
I Bprop gradient is zero except for switch, which gets gradoutput
I Avg pooling: f (x) = ∑i xi/nI Keeps the average activation input
I Fprop is simply mean.
I Gradoutput is averaged and passed to all inputs.
Contents
Embedding Motivation
C&W Embeddings
word2vec
Evaluating Embeddings
1. Use dense representations instead of sparse
2. Use windowed area instead of sequence models
3. Use neural networks to model windowed interactions
What about rare words?
Word Embeddings
Embedding layer,
x0W0
I x0 ∈ R1×d0 one-hot word.
I W0 ∈ Rd0×din , d0 = |V|
Notes:
I d0 >> din, e.g. d0 = 10000, din = 50
Pretraining Representations
I We would strong shared representations of words
I However, PTB only 1M labeled words, relatively small
I Collobert et al. (2008, 2011) use semi-supervised method.
I (Close connection to Bengio et al (2003), next topic)
Semi-Supervised Training
Idea: Train representations separately on more data
1. Pretrain word embeddings W0 first.
2. Substitute them in as first NN layer
3. Fine-tune embeddings for final task
I Modify the first layer based on supervised gradients
I Optional, some work skips this step
Large Corpora
To learn rare word embeddings, need many more tokens,
I C&W
I English Wikipedia (631 million words tokens)
I Reuters Corpus (221 million word tokens)
I Total vocabulary size: 130,000 word types
I word2vec
I Google News (6 billion word tokens)
I Total vocabulary size: ≈ 1M word types
But this data has no labels...
Contents
Embedding Motivation
C&W Embeddings
word2vec
Evaluating Embeddings
C&W Embeddings
I Assumption: Text in Wikipedia is coherent (in some sense).
I Most randomly corrupted text is incoherent.
I Embeddings should distinguish coherence.
I Common idea in unsupervised learning (distributional hypothesis).
C&W Setup
Let V be the vocabulary of English and let s score any window of size
dwin = 5, if we see the phrase
[ the dog walks to the ]
It should score higher by s than
[ the dog house to the ]
[ the dog cats to the ]
[ the dog skips to the ]
...
C&W Setup
Can estimate score s as a windowed neural network.
s(w1, . . . ,wdwin) = hardtanh(xW1 + b1)W2 + b
with
x = [v(w1) v(w2) . . . v(wdwin)]
I din = dwin × 50, dhid = 100, dwin = 11, dout = 1!
Example: Function s
x = [v(w3) v(w4) v(w5) v(w6) v(w7)]
din/dwin din/dwin
xdin/dwin din/dwin din/dwin
Training?
I Different setup than previous experiments.
I No direct supervision y
I Train to rank good examples better.
Ranking Loss
Given only example {x1, . . . , xn} and for each example have set D(x) of
alternatives.
L(θ) = ∑i
∑x′∈D(x)
Lranking (s(xi ; θ), s(x′; θ))
Lranking (y , y) = max{0, 1− (y − y)}
Example: C&W ranking
x = [the dog walks to the]
D(x) = { [the dog skips to the], [the dog in to the], . . . }
I (Torch nn.RankingCriterion)
I Note: slightly different setup.
C&W Embeddings in Practice
I Vocabulary size |D(x)| > 100, 000
I Training time for 4 weeks
I (Collobert is main an author of Torch)
Sampling (Sketch of Wsabie (Weston, 2011))
Observation: in many contexts
Lranking (y , y) = max{0, 1− (y − y)} = 0
Particularly true later in training.
For difficult contexts, may be easy to find
Lranking (y , y) = max{0, 1− (y − y)} 6= 0
We can therefore sample from D(x) to find an update.
C&W Results
1. Use dense representations instead of sparse
2. Use windowed area instead of sequence models
3. Use neural networks to model windowed interactions
4. Use semi-supervised learning to pretrain representations.
Contents
Embedding Motivation
C&W Embeddings
word2vec
Evaluating Embeddings
word2vec
I Contributions:
I Scale embedding process to massive sizes
I Experiments with several architectures
I Empirical evaluations of embeddings
I Influential release of software/data.
I Differences with C&W
I Instead of MLP uses (bi)linear model (linear in paper)
I Instead of ranking model, directly predict word (cross-entropy)
I Various other extensions.
I Two different models
1. Continuous Bag-of-Words (CBOW)
2. Continuous Skip-gram
word2vec
I Contributions:
I Scale embedding process to massive sizes
I Experiments with several architectures
I Empirical evaluations of embeddings
I Influential release of software/data.
I Differences with C&W
I Instead of MLP uses (bi)linear model (linear in paper)
I Instead of ranking model, directly predict word (cross-entropy)
I Various other extensions.
I Two different models
1. Continuous Bag-of-Words (CBOW)
2. Continuous Skip-gram
word2vec (Bilinear Model)
Back to pure bilinear model, but with much bigger output space
y = softmax((∑i x
0i W
0
dwin − 1)W1)
I x0i ∈ R1×d0 input words one-hot vectors .
I W0 ∈ Rd0×din ; d0 = |V|, word embeddings
I W1 ∈ Rdin×dout ; dout = |V| output embeddings
Notes:
I Bilinear parameter interaction.
I d0 >> din, e.g. 50 ≤ din ≤ 1000, 10000 ≤ |V| ≤ 1M or more
word2vec (Mikolov, 2013)
Continuous Bag-of-Words (CBOW)
y = softmax((∑i x
0i W
0
dwin − 1)W1)
I Attempt to predict the middle word
[ the dog walks to the ]
Example: CBOW
x =v(w3) + v(w4) + v(w6) + v(w7)
dwin − 1
y = δ(w5)
W1 is no longer partitioned by row (order is lost)
Continous Skip-gram
y = softmax(x0W0)W1)
I Also a bilinear model
I Attempt to predict each context-word from middle
[ the dog ]
Example: Skip-gram
x = v(w5)
y = δ(w3)
Done for each word in window.
Additional aspects
I The window dwin is sampled for each SGD step
I SGD is done less for frequent words.
I We have slightly simplified the training objective.
Softmax Issues
Use a softmax to force a distribution,
softmax(z) =exp(z)
∑c∈C
exp(zc)
log softmax(z) = z− log ∑c∈C
exp(zc)
I Issue: class C is huge.
I For C&W, 100,000, for word2vec 1,000,000 types
I Note largest dataset is 6 billion words
Two-Layer Softmax
First, clustering words into hard classes (for instance Brown clusters)
Groups words into classes based on word-context.
5
. . .motorcycletruckcar
. . .3
. . .horsecatdog
. . .
Two-Layer Softmax
Assume that we first generate a class C and then a word,
p(Y |X ) ≈ P(Y |C ,X ; θ)P(C |X ; θ)
Estimate distributions with a shared embedding layer,
P(C |X ; θ)
y1 = softmax((x0W0)W1 + b)
P(Y |C = class,X ; θ)
y2 = softmax((x0W0)Wclass + b))
Softmax as Tree
5
. . .motorcycletruckcar
. . .3
. . .horsecatdog
. . .
y(1) = softmax((x0W0)W1 + b)
y(2) = softmax((x0W0)Wclass + b))
L2SM(y(1), y(2), y(1), y(2)) = − log p(y|x, class(y))− log p(class(y)|x)
= − log y(1)c1− log y
(2)c2
Speed
5
. . .motorcycletruckcar
. . .3
dog cat horse . . .
. . .
I Computing loss only requires walking path.
I Two-layer a balanced tree.
I Computing loss requires O(√|V|)
I (Note: computing full distribution requires O(|V|))
Hierarchical Softmax(HSM)
I Build multiple layer tree
LHSM(y(1), . . . , y(C ), y(1), . . . , y(C )) = −∑i
log y(i)c i
I Balanced tree only requires O(log2 |V|)
I Experiments on website (Mnih and Hinton, 2008)
HSM with Huffman Encoding
I Requires O(log2 perp(unigram))
I Reduces time to only 1 day for 1.6 million tokens
Contents
Embedding Motivation
C&W Embeddings
word2vec
Evaluating Embeddings
How good are embeddings?
I Qualitative Analysis/Visualization
I Analogy task
I
I Extrinsic Metrics
Metrics
Dot-product
xcatx>dog
Cosine Similarity
xcatx>dog||xcat || ||xdog ||
k-nearest neighbors (cosine sim)
dog
cat 0.921800527377
dogs 0.851315870426
horse 0.790758298322
puppy 0.775492121034
pet 0.772470734611
rabbit 0.772081457265
pig 0.749006160038
snake 0.73991884888
I Intuition: trained to match words that act the same.
Empirical Measures: Analogy task
Analogy questions:
A:B::C:
I 5 types of semantic questions, 9 types of syntactic
Embedding Tasks
Analogy Prediction
A:B::C:
x′ = xB − xA + xC
Project to the closest word,
arg maxD∈V
xDx′>
||xD ||||x′||
I Code example
Extrinsic Tasks
I Text classification
I Part-of-speech tagging
I Many, many others over last couple years
Conclusion
I Word Embeddings
I Scaling issues and tricks
I Next Class: Language Modeling
top related