16 midterm review - wei xu · midterm review instructor: wei xu. naïve bayes, logistic regression...
Post on 05-Aug-2020
3 Views
Preview:
TRANSCRIPT
Midterm Review Instructor: Wei Xu
Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent
Classification
Planning GUIGarbageCollection
Machine Learning NLP
parsertagtrainingtranslationlanguage...
learningtrainingalgorithmshrinkagenetwork...
garbagecollectionmemoryoptimizationregion...
Test document
parserlanguagelabeltranslation…
Text Classification
...planningtemporalreasoningplanlanguage...
?
Classification Methods: Supervised Machine Learning
• Input: – a document d – a fixed set of classes C = {c1, c2,…, cJ}
– A training set of m hand-labeled documents (d1,c1),....,(dm,cm)
• Output: – a learned classifier γ:d ! c
Naïve Bayes Classifier
cMAP = argmaxc∈C
P(c | d)
= argmaxc∈C
P(d | c)P(c)P(d)
= argmaxc∈C
P(d | c)P(c)
MAP is “maximum a posteriori” = most likely class
Bayes Rule
Dropping the denominator
= argmaxc∈C
P(x1, x2,…, xn | c)P(c)
cNB = argmaxc∈C
P(cj ) P(x | c)x∈X∏
bag of word
conditional independence assumption
Multinomial Naïve Bayes: Learning
• maximum likelihood estimates – simply use the frequencies in the data
P̂(wi | cj ) =count(wi ,cj )count(w,cj )
w∈V∑
P̂(cj ) =doccount(C = cj )
Ndoc
=count(wi ,c)+1
count(w,cw∈V∑ )
⎛
⎝⎜⎜
⎞
⎠⎟⎟ + V
• For unknown words (which completely doesn’t occur in training set), we can ignore them.
laplace (add-1) smoothing to avoid zero probabilities
Weaknesses of Naïve Bayes
• Assuming conditional independence
• Correlated features -> double counting evidence – Parameters are estimated independently
• This can hurt classifier accuracy and calibration
Logistic Regression
• Doesn’t assume conditional independence of features – Better calibrated probabilities – Can handle highly correlated overlapping
features
Logistic vs. Softmax Regression
one estimated probability for
each class
one estimated probability good for binary classification
the class with highest
probability
Generalization to Multi-class Problems
Maximum Entropy Models (MaxEnt)• a.k.a logistic regression or multinominal logistic regression or
multiclass logistic regression or softmax regression
• belongs to the family of classifiers know as log-linear classifiers
• Math proof of “LR=MaxEnt”: - [Klein and Manning 2003] - [Mount 2011]
http://www.win-vector.com/dfiles/LogisticRegressionMaxEnt.pdf
convert into probabilities between [0, 1]
softmax function
concave function!
Gradient Ascent
Gradient Ascent
Loop While not converged: For all features j, compute and add derivatives
: Training set log-likelihood (Objective function)
: Gradient vector
LR Gradient
logistic function
Perceptron vs. Logistic Regression
weights (parameters)
inputs (features)
choices of activation function
Initialize weight vector w = 0 Loop for K iterations Loop For all training examples x_i if sign(w * x_i) != y_i w += y_i * x_i
Error-driven!
Perceptron Algorithm• Very similar to logistic regression • Not exactly computing gradient
MultiClass Perceptron Algorithm
Initialize weight vector w = 0 Loop for K iterations Loop For all training examples x_i y_pred = argmax_y w_y * x_i if y_pred != y_i w_y_gold += x_i w_y_pred -= x_i
increase score for right answer
decrease score for wrong answer
Perceptron vs. Logistic Regression
• Only hyperparameter of perceptron is maximum number of iterations (LR also needs learning rate)
• Perceptron is guaranteed to converge if the data is linearly separable (LR always converge)
Perceptron vs. Logistic Regression• The Perceptron is an online learning algorithm. • Logistic Regression is not:
this update is effectively the same as “w += y_i * x_i”
Multinominal LR Gradient
empirical feature count expected feature count
MAP-based learning (Perceptron)
.≈
Maximum A Posteriori
approximate using maximization
Markov Assumption, Perplexity, Interpolation, Backoff, and
Kneser-Ney Smoothing
Language Modeling
Probabilistic Language Modeling•Goal: compute the probability of a sentence or sequence of words: P(W) = P(w1,w2,w3,w4,w5…wn)
•Related task: probability of an upcoming word: P(w5|w1,w2,w3,w4)
•A model that computes either of these: P(W) or P(wn|w1,w2…wn-1) is called a language model or LM
The Chain Rule applied to compute joint probability of words in sentence
Markov Assumption
Bigram model• Condition on the previous word:
Add-one estimation
•Also called Laplace smoothing
• Pretend we saw each word one more time than we did • Just add one to all the counts!
•MLE estimate:
•Add-1 estimate:
PMLE(wi |wi−1) =c(wi−1,wi )c(wi−1)
PAdd−1(wi |wi−1) =c(wi−1,wi )+1c(wi−1)+V
Add-1 estimation is a blunt instrument
• So add-1 isn’t used for N-grams: •We’ll see better methods
• But add-1 is used to smooth other NLP models • For text classification • In domains where the number of zeros isn’t so huge.
•Linear interpolation
•Backoff •Absolute Discounting •Kneser-Ney Smoothing
Better Language Models
Perplexity
Perplexity is the inverse probability of the test set, “normalized” by the number of words:
Minimizing perplexity is the same as maximizing probability
The best language model is one that best predicts an unseen test set • Gives the highest P(sentence)
Chain Rule
for bigram
Hidden Markov Models, Maximum Entropy Markov
Models (Log-linear Models for Tagging), and Viterbi Algorithm
Tagging
Tagging (Sequence Labeling)• Given a sequence (in NLP, words), assign appropriate labels to
each word. • Many NLP problems can be viewed as sequence labeling: - POS Tagging - Chunking - Named Entity Tagging
• Labels of tokens are dependent on the labels of other tokens in the sequence, particularly their neighbors
Plays well with others. VBZ RB IN NNS
Emission parameters
Trigram parameters
Andrew Viterbi, 1967
running time for trigram HMM
running time for bigram HMM is O(n|S|2)
Decoding§ Linear Perceptron
§ Features must be local, for x=x1…xm, and s=s1…sm
§ Define π(i,si) to be the max score of a sequence of length iending in tag si
§ Viterbi algorithm (HMMs):
§ Viterbi algorithm (Maxent):�(i, si) = max
si�1
p(si|si�1, x1 . . . xm)�(i� 1, si�1)
⇡(i, si) = max
si�1
e(xi|si)q(si|si�1)⇡(i� 1, si�1)
s⇤ = argmaxs
w · �(x, s) · �
�(i, si) = max
si�1
w · ⇥(x, i, si�i, si) + �(i� 1, si�1)
�(x, s) =mX
j=1
�(x, j, sj�1, sj)
top related