grzegorz chrupa la and nicolas stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf ·...

64
Linear models for classification Grzegorz Chrupala and Nicolas Stroppa Saarland University Google META Workshop Chrupala and Stroppa (UdS) Linear models 2010 1 / 62

Upload: tranmien

Post on 28-Feb-2019

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Linear models for classification

Grzegorz Chrupa la and Nicolas Stroppa

Saarland UniversityGoogle

META Workshop

Chrupala and Stroppa (UdS) Linear models 2010 1 / 62

Page 2: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Outline

1 Linear models

2 Perceptron

3 Naıve Bayes

4 Logistic regression

Chrupala and Stroppa (UdS) Linear models 2010 2 / 62

Page 3: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Outline

1 Linear models

2 Perceptron

3 Naıve Bayes

4 Logistic regression

Chrupala and Stroppa (UdS) Linear models 2010 3 / 62

Page 4: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

An examplePositive examples are blank, negative are filled

−4 −2 0 2 4

−4

−2

02

4

x

yy=−1x−0.5y=−3x+1y=69x+1

●●

Chrupala and Stroppa (UdS) Linear models 2010 4 / 62

Page 5: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Linear models

Think of training examples as points in d-dimensionalspace. Each dimension corresponds to one feature.

A linear binary classifier defines a plane in the spacewhich separates positive from negative examples.

Chrupala and Stroppa (UdS) Linear models 2010 5 / 62

Page 6: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Linear decision boundary

A hyperplane is a generalization of a straight line to> 2 dimensions

A hyperplane contains all the points in a ddimensional space satisfying the following equation:

w1x1 + w2x2, . . . ,+wdxd + w0 = 0

Each coefficient wi can be thought of as a weighton the corresponding feature

The vector containing all the weightsw = (w0, . . . , wd) is the parameter vector or weigthvector

Chrupala and Stroppa (UdS) Linear models 2010 6 / 62

Page 7: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Normal vectorGeometrically, the weight vector w is a normalvector of the separating hyperplaneA normal vector of a surface is any vector which isperpendicular to it

Chrupala and Stroppa (UdS) Linear models 2010 7 / 62

Page 8: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Hyperplane as a classifier

Let

g(x) = w1x1 + w2x2, . . . ,+wdxd + w0

Then

y = sign(g(x)) =

{+1 if g(x) ≥ 0

−1 otherwise

Chrupala and Stroppa (UdS) Linear models 2010 8 / 62

Page 9: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Bias

The slope of the hyperplane is determined byw1...wd. The location (intercept) is determined bybias w0

Include bias in the weight vector and add a dummycomponent to the feature vector

Set this component to x0 = 1

Then

g(x) =d∑i=0

wixi

g(x) = w · x

Chrupala and Stroppa (UdS) Linear models 2010 9 / 62

Page 10: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Separating hyperplanes in 2 dimensions

−4 −2 0 2 4

−4

−2

02

4

x

yy=−1x−0.5y=−3x+1y=69x+1

●●

Chrupala and Stroppa (UdS) Linear models 2010 10 / 62

Page 11: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Learning

The goal of the learning process is to come up witha “good” weight vector w

The learning process will use examples to guide thesearch of a “good” w

Different notions of “goodness” exist, which yielddifferent learning algorithms

We will describe some of these algorithms in thefollowing

Chrupala and Stroppa (UdS) Linear models 2010 11 / 62

Page 12: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Outline

1 Linear models

2 Perceptron

3 Naıve Bayes

4 Logistic regression

Chrupala and Stroppa (UdS) Linear models 2010 12 / 62

Page 13: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Perceptron trainingHow do we find a set of weights that separate ourclasses?Perceptron: A simple mistake-driven onlinealgorithm

I Start with a zero weight vector and process each trainingexample in turn.

I If the current weight vector classifies the current exampleincorrectly, move the weight vector in the right direction.

I If weights stop changing, stop

If examples are linearly separable, then thisalgorithm is guaranteed to converge to the solutionvector

Chrupala and Stroppa (UdS) Linear models 2010 13 / 62

Page 14: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Fixed increment online perceptronalgorithm

Binary classification, with classes +1 and −1

Decision function y′ = sign(w · x)

Perceptron(x1:N , y1:N , I):1: w← 02: for i = 1...I do3: for n = 1...N do4: if y(n)(w · x(n)) ≤ 0 then5: w← w + y(n)x(n)

6: return w

Chrupala and Stroppa (UdS) Linear models 2010 14 / 62

Page 15: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Or more explicitly

1: w← 02: for i = 1...I do3: for n = 1...N do4: if y(n) = sign(w · x(n)) then5: pass6: else if y(n) = +1 ∧ sign(w · x(n)) = −1 then7: w← w + x(n)

8: else if y(n) = −1 ∧ sign(w · x(n)) = +1 then9: w← w − x(n)

10: return w

Chrupala and Stroppa (UdS) Linear models 2010 15 / 62

Page 16: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Weight averaging

Although the algorithm is guaranteed to converge,the solution is not unique!

Sensitive to the order in which examples areprocessed

Separating the training sample does not equal goodaccuracy on unseen dataEmpirically, better generalization performance withweight averaging

I A method of avoiding overfittingI As final weight vector, use the mean of all the weight

vector values for each step of the algorithmI (cf. regularization in a following session)

Chrupala and Stroppa (UdS) Linear models 2010 16 / 62

Page 17: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Outline

1 Linear models

2 Perceptron

3 Naıve Bayes

4 Logistic regression

Chrupala and Stroppa (UdS) Linear models 2010 17 / 62

Page 18: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Probabilistic model

Instead of thinking in terms of multidimensionalspace...

Classification can be approached as a probabilityestimation problemWe will try to find a probability distribution which

I Describes well our training dataI Allows us to make accurate predictions

We’ll look at Naive Bayes as a simplest example ofa probabilistic classifier

Chrupala and Stroppa (UdS) Linear models 2010 18 / 62

Page 19: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Representation of examples

We are trying to classify documents. Let’s representa document as a sequence of terms (words) itcontains t = (t1...tn)

For (binary) classification we want to find the mostprobable class:

y = argmaxy∈{−1,+1}

P (Y = y|t)

But documents are close to unique: we cannotreliably condition Y |tBayes’ rule to the rescue

Chrupala and Stroppa (UdS) Linear models 2010 19 / 62

Page 20: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Bayes rule

Bayes rule determines how joint and conditionalprobabilities are related.

P (Y = yi|X = x) =P (X = x|Y )P (Y = yi)∑

i P (X = x|Y = yi)P (Y = yi)

That is:

posterior =prior× likelihood

evidence

Chrupala and Stroppa (UdS) Linear models 2010 20 / 62

Page 21: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Prior and likelihood

With Bayes’ rule we can invert the direction ofconditioning

y = argmaxy

P (Y = y)P (t|Y = y)∑y P (Y = y)P (t|Y = y)

= argmaxy

P (Y = y)P (t|Y = y)

Decomposed the task into estimating the priorP (Y ) (easy) and the likelihood P (t|Y = y)

Chrupala and Stroppa (UdS) Linear models 2010 21 / 62

Page 22: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Conditional independence

How to estimate P (t|Y = y)?

Naively assume the occurrence of each word in thedocument is independent of the others, whenconditioned on the class

P (t|Y = y) =

|t|∏i=1

P (ti|Y = y)

Chrupala and Stroppa (UdS) Linear models 2010 22 / 62

Page 23: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Naive Bayes

Putting it all together

y = argmaxy

P (Y = y)

|t|∏i=1

P (ti|Y = y)

Chrupala and Stroppa (UdS) Linear models 2010 23 / 62

Page 24: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Decision function

For binary classification:

g(t) =P (Y = +1)

∏|t|i=1 P (ti|Y = +1)

P (Y = −1)∏|t|

i=1 P (ti|Y = −1)

=P (Y = +1)

P (Y = −1)

|t|∏i=1

P (ti|Y = +1)

P (ti|Y = −1)

y =

{+1 if g(t) ≥ 1

−1 otherwise

Chrupala and Stroppa (UdS) Linear models 2010 24 / 62

Page 25: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Documents in vector notation

Let’s represent documents asvocabulary-size-dimensional binary vectors

V1 V2 V3 V4

Obama Ferrari voters moviesx = ( 1 0 2 0 )

Dimension i indicates how many times the ith

vocabulary item appears in document x

Chrupala and Stroppa (UdS) Linear models 2010 25 / 62

Page 26: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Naive Bayes in vector notation

Counts appear as exponents:

g(x) =P (+1)

P (−1)

|V |∏i=1

(P (Vi|+ 1)

P (Vi| − 1)

)xi

If we take the logarithm of the threshold (ln 1 = 0)and g, we’ll get the same decision function

h(x) = ln

(P (+1)

P (−1)

)+

|V |∑i=1

ln

(P (Vi|+ 1)

P (Vi| − 1)

)xi

Chrupala and Stroppa (UdS) Linear models 2010 26 / 62

Page 27: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Linear classifier

Remember the linear classifier?

g(x) =w0 +d∑i=1

wi xi

h(x) = ln

(P (+1)

P (−1)

)+

|V |∑i=1

ln

(P (Vi|+ 1)

P (Vi| − 1)

)xi

Log prior ratio corresponds to the bias term

Log likelihood ratios correspond to feature weights

Chrupala and Stroppa (UdS) Linear models 2010 27 / 62

Page 28: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

What is the difference

Training criterion and procedure

PerceptronZero-one loss function

error(w, D) =∑

(x,y)∈D

{0 if sign(w · x) = y

1 otherwise

Error-driven algorithm

Chrupala and Stroppa (UdS) Linear models 2010 28 / 62

Page 29: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Naive BayesMaximum Likelihood criterion

P (D|θ) =∏

(x,y)∈D

P (Y = y|θ)P (x|Y = y, θ)

Find parameters which maximize the log likelihood

θ = argmaxθ

log(P (D|θ))

Parameters reduce to relative counts

+ Ad-hoc smoothing

Alternatives (e.g. maximum a posteriori)

Chrupala and Stroppa (UdS) Linear models 2010 29 / 62

Page 30: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Comparison

Model Naive Bayes PerceptronModel power Linear LinearType Generative DiscriminativeDistribution modeled P (x, y) N/ASmoothing Crucial OptionalIndependence assumptions Strong None

Chrupala and Stroppa (UdS) Linear models 2010 30 / 62

Page 31: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Outline

1 Linear models

2 Perceptron

3 Naıve Bayes

4 Logistic regression

Chrupala and Stroppa (UdS) Linear models 2010 31 / 62

Page 32: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Probabilistic conditional model

Let’s try to come up with a probabilistic modelwhich has some of the advantages of perceptron

Model P (y|x) directly, and not via P (x, y) andBayes rule as in Naive Bayes

Avoid issue of dependencies between features of xWe’ll take linear regression as a starting point

I The goal is to adapt regression to modelclass-conditional probability

Chrupala and Stroppa (UdS) Linear models 2010 32 / 62

Page 33: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Linear Regression

Training data: observations paired with outcomes(n ∈ R)

Observations have features (predictors, typically alsoreal numbers)The model is a regression line y = ax+ b whichbest fits the observations

I a is the slopeI b is the interceptI This model has two parameters (or weigths)I One feature = xI Example:

F x = number of vague adjectives in property descriptionsF y = amount house sold over asking price

Chrupala and Stroppa (UdS) Linear models 2010 33 / 62

Page 34: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Chrupala and Stroppa (UdS) Linear models 2010 34 / 62

Page 35: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Multiple linear regression

More generally y = w0 +∑d

i=1wixi, whereI y = outcomeI w0 = interceptI x1..xd = features vector and w1..wd weight vectorI Get rid of bias:

g(x) =d∑

i=0

wixi = w · x

Linear regression: uses g(x) directly

Linear classifier: uses sign(g(x))

Chrupala and Stroppa (UdS) Linear models 2010 35 / 62

Page 36: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Learning linear regression

Minimize sum squared error over N trainingexamples

Err(w) =N∑n=1

(g(x(n))− y(n))2

Closed-form formula for choosing the best weightsw:

w = (XTX)−1XTy

where the matrix X contains training examplefeatures, and y is the vector of outcomes.

Chrupala and Stroppa (UdS) Linear models 2010 36 / 62

Page 37: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Logistic regression

In logistic regression we use the linear model toassign probabilities to class labels

For binary classification, predict P (Y = 1|x). Butpredictions of linear regression model are ∈ R,whereas P (Y = 1|x) ∈ [0, 1]

Instead predict logit function of the probability:

ln

(P (Y = 1|x)

1− P (Y = 1|x)

)= w · x

P (Y = 1|x)

1− P (Y = 1|x)= ew·x

Chrupala and Stroppa (UdS) Linear models 2010 37 / 62

Page 38: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Solving for P (Y = 1|x) we obtain:

P (Y = 1|x) = ew·x(1− P (Y = 1|x))

P (Y = 1|x) = ew·x − ew·xP (Y = 1|x)

P (Y = 1|x) + ew·xP (Y = 1|x) = ew·x

P (Y = 1|x)(1 + ew·x) = ew·x

P (Y = 1|x) =ew·x

1 + ew·x

=exp

(∑di=0wixi

)1 + exp

(∑di=0wixi

)

Chrupala and Stroppa (UdS) Linear models 2010 38 / 62

Page 39: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Logistic regression - classification

Example x belongs to class 1 if:

P (Y = 1|x)

1− P (Y = 1|x)> 1

ew·x > 1

w · x > 0d∑i=0

wixi > 0

Equation w · x = 0 defines a hyperplane with pointsabove belonging to class 1

Chrupala and Stroppa (UdS) Linear models 2010 39 / 62

Page 40: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Multinomial logistic regression

Logistic regression generalized to more than two classes

P (Y = y|x,W) =exp(Wy• · x)∑y′ exp(Wy′• · x)

Chrupala and Stroppa (UdS) Linear models 2010 40 / 62

Page 41: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Learning parameters

Conditional likelihood estimation: choose theweights which make the probability of the observedvalues y be the highest, given the observations xiFor the training set with N examples:

W = argmaxW

N∏i=1

P (Y = y(n)|x(n),W)

= argmaxW

N∑i=1

logP (Y = y(n)|x(n),W)

Chrupala and Stroppa (UdS) Linear models 2010 41 / 62

Page 42: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Error function

Equivalently, we seek the value of the parameters whichminimize the error function:

Err(W, D) = −N∑n=1

logP (Y = y(n)|x(n),W)

where D = {(x(n), y(n))}Nn=1

Chrupala and Stroppa (UdS) Linear models 2010 42 / 62

Page 43: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

A problem in convex optimization

L-BFGS (Limited-memoryBroyden-Fletcher-Goldfarb-Shanno method)

gradient descent

conjugate gradient

iterative scaling algorithms

Chrupala and Stroppa (UdS) Linear models 2010 43 / 62

Page 44: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Stochastic gradient descent

Gradient descentA gradient is a slope of a function

That is, a set of partial derivatives, one for eachdimension (parameter)

By following the gradient of a convex function wecan descend to the bottom (minimum)

Chrupala and Stroppa (UdS) Linear models 2010 44 / 62

Page 45: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Chrupala and Stroppa (UdS) Linear models 2010 45 / 62

Page 46: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Gradient descent example

Find argminθ f(θ) where f(θ) = θ2

Initial value of θ1 = −1

Gradient function: ∇f(θ) = 2θ

Update: θ(n+1) = θ(n) − η∇f(θ(n))

The learning rate η (= 0.2) controls the speed ofthe descent

After first iteration: θ(2) = −1− 0.2(−2) = −0.6

Chrupala and Stroppa (UdS) Linear models 2010 46 / 62

Page 47: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Gradient descent example

Find argminθ f(θ) where f(θ) = θ2

Initial value of θ1 = −1

Gradient function: ∇f(θ) = 2θ

Update: θ(n+1) = θ(n) − η∇f(θ(n))

The learning rate η (= 0.2) controls the speed ofthe descent

After first iteration: θ(2) = −1− 0.2(−2) = −0.6

Chrupala and Stroppa (UdS) Linear models 2010 46 / 62

Page 48: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Five iterations of gradient descent

−1.0 −0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

θθ

θθ2

● ● ● ● ● ●

Chrupala and Stroppa (UdS) Linear models 2010 47 / 62

Page 49: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Stochastic

We could compute the gradient of error for the fulldataset before each updateInstead

I Compute the gradient of the error for a single exampleI update the weightI Move on to the next example

On average, we’ll move in the right direction

Efficient, online algorithm

Chrupala and Stroppa (UdS) Linear models 2010 48 / 62

Page 50: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Error gradient

The gradient of the error function is the set ofpartial derivatives of the error function with respectto the parameters Wyi

∇y,iErr(D,W) =∂

∂Wyi

(−

N∑n=1

logP (Y = y|x(n),W)

)

= −N∑

n=1

∂Wyi

logP (Y = y|x(n),W)

Chrupala and Stroppa (UdS) Linear models 2010 49 / 62

Page 51: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Single training example

∂Wyi

logP (Y = y|x(n),W) =∂

∂Wyi

logP (Y = y|x(n),W)

=∂

∂Wyi

logexp(Wy• · x(n))∑y′ exp(Wy′• · x(n))

· · ·· · ·= x

(n)i (1(y = y(n))− P (Y = y|x(n),W))

Chrupala and Stroppa (UdS) Linear models 2010 50 / 62

Page 52: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Update

Stochastic gradient update step

W(n)yi = W

(n−1)yi + ηx

(n)i (1(y = y(n))− P (Y = y|x(n),W))

Chrupala and Stroppa (UdS) Linear models 2010 51 / 62

Page 53: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Update: Explicit

For the correct class (y = y(n))

W(n)yi = W

(n−1)yi + ηx

(n)i (1− P (Y = y|x(n),W))

where 1− P (Y = y|x(n),W) is the residual

For all other classes (y 6= y(n))

W(n)yi = W

(n−1)yi − ηx(n)

i P (Y = y|x(n),W)

Chrupala and Stroppa (UdS) Linear models 2010 52 / 62

Page 54: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Logistics Regression SGD vs Perceptron

w(n) = w(n−1)+ 1x(n)

W(n)yi = W

(n−1)yi + ηx

(n)i (1− P (Y = y|x(n),W))

Very similar update!

Perceptron is simply an instantiation of SGD for aparticular error function

The perceptron criterion: for a correctly classifiedexample zero error; for a misclassified example−y(n)w · x(n)

Chrupala and Stroppa (UdS) Linear models 2010 53 / 62

Page 55: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Maximum entropy

Multinomial logistic regression model is also known asthe Maximum entropy model. What is the connection?

Chrupala and Stroppa (UdS) Linear models 2010 54 / 62

Page 56: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Entia non sunt multiplicanda praeternecessitatem

Chrupala and Stroppa (UdS) Linear models 2010 55 / 62

Page 57: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Maximum Entropy principle

Jaynes, 1957... in making inferences on the basis of partial information we must

use that probability distribution which has maximum entropy subject

to whatever is known. This is the only unbiased assignment we can

make; to use any other would amount to arbitrary assumption of

information which we by hypothesis do not have.

Chrupala and Stroppa (UdS) Linear models 2010 56 / 62

Page 58: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Entropy

Out of all possible models, choose the simplest oneconsistent with the data

Entropy of the distribution of X:

H(X) = −∑x

P (X = x) log2 P (X = x)

The uniform distribution has the highest entropy

Chrupala and Stroppa (UdS) Linear models 2010 57 / 62

Page 59: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Finding the maximum entropy distribution:

p∗ = argmaxp∈C

H(p)

Berger et al. (1996) showed that solving thisproblem is equivalent to finding the multinomiallogistic regression model whose weights maximizethe likelihood of the training data.

Chrupala and Stroppa (UdS) Linear models 2010 58 / 62

Page 60: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Maxent principle – simple exampleFind a Maximum Entropy probability distributionp(a, b) where a ∈ {x, y} and b ∈ {0, 1}The only thing we know are is the followingconstraint: p(x, 0) + p(y, 0) = 0.6

p(a, b) 0 1x ? ?y ? ?total 0.6 1

p(a, b) 0 1x 0.3 0.2y 0.3 0.2total 0.6 1

Chrupala and Stroppa (UdS) Linear models 2010 59 / 62

Page 61: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Maxent principle – simple exampleFind a Maximum Entropy probability distributionp(a, b) where a ∈ {x, y} and b ∈ {0, 1}The only thing we know are is the followingconstraint: p(x, 0) + p(y, 0) = 0.6

p(a, b) 0 1x ? ?y ? ?total 0.6 1

p(a, b) 0 1x 0.3 0.2y 0.3 0.2total 0.6 1

Chrupala and Stroppa (UdS) Linear models 2010 59 / 62

Page 62: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Comparison

Model Naive Bayes Perceptron Log. regressionModel power Linear Linear LinearType Generative Discriminative DiscriminativeDistribution P (x, y) N/A P (y|x)Smoothing Crucial Optional1 Optional1

Independence Strong None None

1Aka regularizationChrupala and Stroppa (UdS) Linear models 2010 60 / 62

Page 63: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

The end

Chrupala and Stroppa (UdS) Linear models 2010 61 / 62

Page 64: Grzegorz Chrupa la and Nicolas Stroppagrzegorz.chrupala.me/papers/ml4nlp/linear-classifiers.pdf · Outline 1 Linear models 2 Perceptron 3 Na ve Bayes 4 Logistic regression Chrupala

Efficient averaged perceptron algorithm

Perceptron(x1:N , y1:N , I):1: w← 0 ; wa ← 02: b← 0 ; ba ← 03: c← 14: for i = 1...I do5: for n = 1...N do6: if y(n)(w · x(n) + b) ≤ 0 then7: w← w + y(n)x(n) ; b← b+ y(n)

8: wa ← wa + cy(n)x(n) ; ba ← ba + cy(n)

9: c← c+ 110: return (w −wa/c, b− ba/c)

Chrupala and Stroppa (UdS) Linear models 2010 62 / 62