distributed representation of sentences and documents
TRANSCRIPT
Distributed Representations of Words and Phrases and their Compositionality
Abdullah Khan Zehady
Neural Word Embedding● Continuous vector space representation
o Words represented as dense real-valued vectors in Rd
● Distributed word representation ↔ Word Embeddingo Embed an entire vocabulary into a relatively low-dimensional linear
space where dimensions are latent continuous features.
● Classical n-gram model works in terms of discrete units o No inherent relationship in n-gram.
● In contrast, word embeddings capture regularities and relationships between words.
Syntactic & Semantic Relationship
Regularities are observed as the constant offset vector between pair of words sharing some relationship.
Gender RelationKING-QUEEN ~ MAN - WOMAN
Singular/Plural Relation
KING-KINGS ~ QUEEN - QUEENS
Other Relations: Language France - French ~ Spain - Spanish
Past Tense Go – Went ~ Capture - Captured
Neural Net
Language Model(LM) Different models for estimating continuous representations of words.
Latent Semantic Analysis (LSA) Latent Dirichlet Allocation (LDA)
Neural network Language Model(NNLM)
Feed Forward NNLM Consists of input, projection, hidden and output layers.
N previous words are encoded using 1-of-V coding, where V is size of the vocabulary. Ex: A = (1,0,...,0), B = (0,1,...,0), … , Z = (0,0,...,1) in R26
NNLM becomes computationally complex between projection(P) and hidden(H) layer
For N=10, size of P = 500-2000, size of H = 500-1000 Hidden layer is used to compute prob. dist. over all the words in
vocabulary V Hierarchical softmax as the rescue.
Recurrent NNLM No projection Layer, consists of input, hidden and output layers only.
No need to specify the context length like feed forward NNLM
What is special in RNN model?
Recurrent matrix that connects
layer to itself
Recurrent NNLMw(t): Input word at time ty(t): Output layer produces a prob. Dist. over words.s(t): Hidden layerU: Each column represents a word
RNN is trained with backpropagationto maximize the log likelihood.
Continuous Bag of Word Model
Hierarchical Softmax
Negative Sampling
Negative Sampling
Subsampling of Frequent words
Skip gram model
Empirical Result
Skip gram model
Learning Phrases
Phrase skip gram results
Additive compositionality
Compare with published word representations
Skip gram model
Skip gram model