cs260 machine learning algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… ·...

97
Logistic Regression Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 1 / 48

Upload: others

Post on 04-Oct-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Logistic Regression

Professor Ameet Talwalkar

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 1 / 48

Page 2: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Outline

1 Administration

2 Review of last lecture

3 Logistic regression

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 2 / 48

Page 3: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

HW1

Will be returned today in class during our break

Can also pick up from Brooke during her office hours

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 3 / 48

Page 4: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Outline

1 Administration

2 Review of last lectureNaive Bayes

3 Logistic regression

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 4 / 48

Page 5: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

How to tell spam from ham?

FROM THE DESK OF MR. AMINU SALEHDIRECTOR, FOREIGN OPERATIONS DEPARTMENTAFRI BANK PLCAfribank Plaza,14th Floormoney344.jpg51/55 Broad Street,P.M.B 12021 Lagos-Nigeria

Attention: Honorable Beneficiary,

IMMEDIATE PAYMENT NOTIFICATION VALUED AT US$10 MILLION

Dear Ameet,

Do you have 10 minutes to get on a videocall before 2pm?

Thanks,

Stefano

Page 6: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Simple strategy: count the words

�⇧⇧⇧⇧⇧⇧⇤

free 100money 2

......

account 2...

...

⇥⌃⌃⌃⌃⌃⌃⌅

�⇧⇧⇧⇧⇧⇧⇤

free 1money 1

......

account 2...

...

⇥⌃⌃⌃⌃⌃⌃⌅

Bag-of-word representationof documents (and textual data)

Page 7: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Naive Bayes (in our Spam Email Setting)

Assume X ∈ RD, all Xd ∈ [K], and zk is the number of times k in X

P (X = x, Y = c) = P (Y = c)P (X = x|Y = c)

= P (Y = c)∏

k

P (k|Y = c)zk = πc∏

k

θzkck

Key assumptions made?

Conditional independence:P (Xi, Xj |Y = c) = P (Xi|Y = c)P (Xj |Y = c).

P (Xi|Y = c) depends only the value of Xi, not i itself (order ofwords does not matter in “bag-of-word” representation of documents)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 5 / 48

Page 8: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Naive Bayes (in our Spam Email Setting)

Assume X ∈ RD, all Xd ∈ [K], and zk is the number of times k in X

P (X = x, Y = c) = P (Y = c)P (X = x|Y = c)

= P (Y = c)∏

k

P (k|Y = c)zk = πc∏

k

θzkck

Key assumptions made?

Conditional independence:P (Xi, Xj |Y = c) = P (Xi|Y = c)P (Xj |Y = c).

P (Xi|Y = c) depends only the value of Xi, not i itself (order ofwords does not matter in “bag-of-word” representation of documents)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 5 / 48

Page 9: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Naive Bayes (in our Spam Email Setting)

Assume X ∈ RD, all Xd ∈ [K], and zk is the number of times k in X

P (X = x, Y = c) = P (Y = c)P (X = x|Y = c)

= P (Y = c)∏

k

P (k|Y = c)zk = πc∏

k

θzkck

Key assumptions made?

Conditional independence:P (Xi, Xj |Y = c) = P (Xi|Y = c)P (Xj |Y = c).

P (Xi|Y = c) depends only the value of Xi, not i itself (order ofwords does not matter in “bag-of-word” representation of documents)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 5 / 48

Page 10: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Learning problem

Training dataD = {({znk}Kk=1, yn)}Nn=1

Goal

Learn πc, c = 1, 2, · · · ,C, and θck,∀c ∈ [C], k ∈ [K] under the constraints:

c

πc = 1 ,

k

θck =∑

k

P (k|Y = c) = 1 ,

and all quantities should be nonnegative.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 6 / 48

Page 11: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Learning problem

Training dataD = {({znk}Kk=1, yn)}Nn=1

GoalLearn πc, c = 1, 2, · · · ,C, and θck,∀c ∈ [C], k ∈ [K] under the constraints:

c

πc = 1 ,

k

θck =∑

k

P (k|Y = c) = 1 ,

and all quantities should be nonnegative.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 6 / 48

Page 12: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Learning problem

Training dataD = {({znk}Kk=1, yn)}Nn=1

GoalLearn πc, c = 1, 2, · · · ,C, and θck,∀c ∈ [C], k ∈ [K] under the constraints:

c

πc = 1 ,

k

θck =∑

k

P (k|Y = c) = 1 ,

and all quantities should be nonnegative.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 6 / 48

Page 13: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Likelihood Function

Let X1, . . . , XN be IID with PDF f(x|θ) (also written as f(x; θ))

Likelihood function is defined by L(θ|x) (also written as L(θ;x)):

L(θ|x) =

N∏

i=1

f(Xi; θ).

Notes The likelihood function is just the joint density of the data, exceptthat we treat it as a function of the parameter θ, L : Θ→ [0,∞).

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 7 / 48

Page 14: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Maximum Likelihood Estimator

Definition: The maximum likelihood estimator (MLE) θ̂, is the value of θthat maximizes L(θ).

The log-likelihood function is defined by l(θ) = logL(θ)

Maximum occurs at same place as that of the likelihood function

Using logs simplifies mathemetical expressions (converts exponents toproducts and products to sums)

Using logs helps with numerical stabilitity

The same is true of the likelihood function times any constant. Thus weshall often drop constants in the likelihood function.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48

Page 15: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Maximum Likelihood Estimator

Definition: The maximum likelihood estimator (MLE) θ̂, is the value of θthat maximizes L(θ).

The log-likelihood function is defined by l(θ) = logL(θ)

Maximum occurs at same place as that of the likelihood function

Using logs simplifies mathemetical expressions (converts exponents toproducts and products to sums)

Using logs helps with numerical stabilitity

The same is true of the likelihood function times any constant. Thus weshall often drop constants in the likelihood function.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48

Page 16: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Bayes Rule

For any document x, we want to compare p(spam|x) and p(ham|x)

Axiom of Probability: p(spam, x) = p(spam|x)p(x) = p(x|spam)p(spam)

This gives us (via bayes rule):

p(spam|x) =p(x|spam)p(spam)

p(x)

p(ham|x) =p(x|ham)p(ham)

p(x)

Denominators are same, and easier to compute logarithms, so we compare:

log[p(x|spam)p(spam)] versus log[p(x|ham)p(ham)]

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 9 / 48

Page 17: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Bayes Rule

For any document x, we want to compare p(spam|x) and p(ham|x)

Axiom of Probability: p(spam, x) = p(spam|x)p(x) = p(x|spam)p(spam)

This gives us (via bayes rule):

p(spam|x) =p(x|spam)p(spam)

p(x)

p(ham|x) =p(x|ham)p(ham)

p(x)

Denominators are same, and easier to compute logarithms, so we compare:

log[p(x|spam)p(spam)] versus log[p(x|ham)p(ham)]

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 9 / 48

Page 18: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Our hammer: maximum likelihood estimationLog-Likelihood of the training data

L = logP (D) = log

N∏

n=1

πynP (xn|yn)

= log

N∏

n=1

(πyn

k

θznkynk

)

=∑

n

(log πyn +

k

znk log θynk

)

=∑

n

log πyn +∑

n,k

znk log θynk

Optimize it!

(π∗c , θ∗ck) = arg max

n

log πyn +∑

n,k

znk log θynk

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 10 / 48

Page 19: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Our hammer: maximum likelihood estimationLog-Likelihood of the training data

L = logP (D) = log

N∏

n=1

πynP (xn|yn)

= log

N∏

n=1

(πyn

k

θznkynk

)

=∑

n

(log πyn +

k

znk log θynk

)

=∑

n

log πyn +∑

n,k

znk log θynk

Optimize it!

(π∗c , θ∗ck) = arg max

n

log πyn +∑

n,k

znk log θynk

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 10 / 48

Page 20: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Our hammer: maximum likelihood estimationLog-Likelihood of the training data

L = logP (D) = log

N∏

n=1

πynP (xn|yn)

= log

N∏

n=1

(πyn

k

θznkynk

)

=∑

n

(log πyn +

k

znk log θynk

)

=∑

n

log πyn +∑

n,k

znk log θynk

Optimize it!

(π∗c , θ∗ck) = arg max

n

log πyn +∑

n,k

znk log θynk

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 10 / 48

Page 21: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Our hammer: maximum likelihood estimationLog-Likelihood of the training data

L = logP (D) = log

N∏

n=1

πynP (xn|yn)

= log

N∏

n=1

(πyn

k

θznkynk

)

=∑

n

(log πyn +

k

znk log θynk

)

=∑

n

log πyn +∑

n,k

znk log θynk

Optimize it!

(π∗c , θ∗ck) = arg max

n

log πyn +∑

n,k

znk log θynk

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 10 / 48

Page 22: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Our hammer: maximum likelihood estimationLog-Likelihood of the training data

L = logP (D) = log

N∏

n=1

πynP (xn|yn)

= log

N∏

n=1

(πyn

k

θznkynk

)

=∑

n

(log πyn +

k

znk log θynk

)

=∑

n

log πyn +∑

n,k

znk log θynk

Optimize it!

(π∗c , θ∗ck) = arg max

n

log πyn +∑

n,k

znk log θynk

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 10 / 48

Page 23: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Details

Note the separation of parameters in the likelihood

n

log πyn +∑

n,k

znk log θynk

which implies that {πc} and {θck} can be estimated separately.Reorganize terms

n

log πyn =∑

c

log πc × (#of data points labeled as c)

and

n,k

znk log θynk =∑

c

n:yn=c

k

znk log θck =∑

c

n:yn=c,k

znk log θck

The later implies {θck, k = 1, 2, · · · ,K} and {θc′k, k = 1, 2, · · · ,K} can beestimated independently.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 11 / 48

Page 24: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Details

Note the separation of parameters in the likelihood

n

log πyn +∑

n,k

znk log θynk

which implies that {πc} and {θck} can be estimated separately.Reorganize terms

n

log πyn =∑

c

log πc × (#of data points labeled as c)

and

n,k

znk log θynk =∑

c

n:yn=c

k

znk log θck =∑

c

n:yn=c,k

znk log θck

The later implies {θck, k = 1, 2, · · · ,K} and {θc′k, k = 1, 2, · · · ,K} can beestimated independently.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 11 / 48

Page 25: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Estimating {πc}

We want to maximize

c

log πc × (#of data points labeled as c)

Intuition

Similar to roll a dice (or flip a coin): each side of the dice shows upwith a probability of πc (total C sides)

And we have total N trials of rolling this dice

Solution

π∗c =#of data points labeled as c

N

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 12 / 48

Page 26: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Estimating {πc}

We want to maximize

c

log πc × (#of data points labeled as c)

Intuition

Similar to roll a dice (or flip a coin): each side of the dice shows upwith a probability of πc (total C sides)

And we have total N trials of rolling this dice

Solution

π∗c =#of data points labeled as c

N

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 12 / 48

Page 27: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Estimating {θck, k = 1, 2, · · · ,K}

We want to maximize

n:yn=c,k

znk log θck

Intuition

Again similar to roll a dice: each side of the dice shows up with aprobability of θck (total K sides)

And we have total∑

n:yn=c,k znk trials.

Solution

θ∗ck =#of times side k shows up in data points labeled as c

#total trials for data points labeled as c

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 13 / 48

Page 28: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Estimating {θck, k = 1, 2, · · · ,K}

We want to maximize

n:yn=c,k

znk log θck

Intuition

Again similar to roll a dice: each side of the dice shows up with aprobability of θck (total K sides)

And we have total∑

n:yn=c,k znk trials.

Solution

θ∗ck =#of times side k shows up in data points labeled as c

#total trials for data points labeled as c

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 13 / 48

Page 29: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Translating back to our problem of detecting spam emails

Collect a lot of ham and spam emails as training examples

Estimate the “bias”

p(ham) =#of ham emails

#of emails, p(spam) =

#of spam emails

#of emails

Estimate the weights (i.e., p(funny word|ham) etc)

p(funny word|ham) =#of funny word in ham emails

#of words in ham emails(1)

p(funny word|spam) =#of funny word in spam emails

#of words in spam emails(2)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 14 / 48

Page 30: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Classification rule

Given an unlabeled data point x = {zk, k = 1, 2, · · · ,K}, label it with

y∗ = arg maxc∈[C] P (y = c|x)

= arg maxc∈[C] P (y = c)P (x|y = c)

= arg maxc[log πc +∑

k

zk log θck]

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 15 / 48

Page 31: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Constrained optimization

Equality Constraints

Method of Lagrange multipliers

Construct the following function (Lagrangian)

min f(x)

s.t. g(x) = 0

L(x,�) = f(x) + �g(x)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 16 / 48

Page 32: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

A short derivation of the maximum likelihood estimationTo maximize ∑

n:yn=c,k

znk log θck

We can use Lagrange multipliers

f(θ) = −∑

n:yn=c,k

znk log θck

g(θ) = 1−∑

k

θck = 0

Lagrangian

L(θ, λ) = f(θ) + λg(θ)

= −∑

n:yn=c,k

znk log θck + λ

(1−

k

θck

)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 17 / 48

Page 33: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

A short derivation of the maximum likelihood estimationTo maximize ∑

n:yn=c,k

znk log θck

We can use Lagrange multipliers

f(θ) = −∑

n:yn=c,k

znk log θck

g(θ) = 1−∑

k

θck = 0

Lagrangian

L(θ, λ) = f(θ) + λg(θ)

= −∑

n:yn=c,k

znk log θck + λ

(1−

k

θck

)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 17 / 48

Page 34: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

L(θ, λ) = −∑

n:yn=c,k

znk log θck + λ

(1−

k

θck

)

First take derivatives with respect to θck and then find the stationary point

−( ∑

n:yn=c

znkθck

)− λ = 0→ θck = − 1

λ

n:yn=c

znk

Plug in expression above for θck into constraint∑

k θck = 1

Solve for λ

Plug this expression for λ back into expression for θck to get:

θck =

∑n:yn=c znk∑

k

∑n:yn=c znk

Chris will review in section on Friday

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 18 / 48

Page 35: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

L(θ, λ) = −∑

n:yn=c,k

znk log θck + λ

(1−

k

θck

)

First take derivatives with respect to θck and then find the stationary point

−( ∑

n:yn=c

znkθck

)− λ = 0→ θck = − 1

λ

n:yn=c

znk

Plug in expression above for θck into constraint∑

k θck = 1

Solve for λ

Plug this expression for λ back into expression for θck to get:

θck =

∑n:yn=c znk∑

k

∑n:yn=c znk

Chris will review in section on Friday

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 18 / 48

Page 36: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Summary

Things you should know

The form of the naive Bayes modelI write down the joint distributionI explain the conditional independence assumption implied by the modelI explain how this model can be used to classify spam vs ham emailsI explain how it could be used for categorical variables

Be able to go through the short derivation for parameter estimationI The model illustrated here is called discrete Naive BayesI HW2 asks you to apply the same principle to other variants of naive

BayesI The derivations are very similar – except there you need to estimate

different model parameters

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 19 / 48

Page 37: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Moving forward

Examine the classification rule for naive Bayes

y∗ = arg maxc log πc +∑

k

zk log θck

For binary classification, we thus determine the label based on the sign of

log π1 +∑

k

zk log θ1k −(

log π2 +∑

k

zk log θ2k

)

This is just a linear function of the features {zk}

w0 +∑

k

zkwk

where we “absorb” w0 = log π1 − log π2 and wk = log θ1k − log θ2k.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 20 / 48

Page 38: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Moving forward

Examine the classification rule for naive Bayes

y∗ = arg maxc log πc +∑

k

zk log θck

For binary classification, we thus determine the label based on the sign of

log π1 +∑

k

zk log θ1k −(

log π2 +∑

k

zk log θ2k

)

This is just a linear function of the features {zk}

w0 +∑

k

zkwk

where we “absorb” w0 = log π1 − log π2 and wk = log θ1k − log θ2k.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 20 / 48

Page 39: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Naive Bayes is a linear classifier

Fundamentally, what really matters in deciding decision boundary is

w0 +∑

k

zkwk

This motivates many new methods, including logistic regression, to bediscussed next

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 21 / 48

Page 40: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Outline

1 Administration

2 Review of last lecture

3 Logistic regressionGeneral setupMaximum likelihood estimationGradient descentNewton’s method

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 22 / 48

Page 41: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Logistic classification

Setup for two classes

Input: x ∈ RD

Output: y ∈ {0, 1}Training data: D = {(xn, yn), n = 1, 2, . . . , N}

Model:p(y = 1|x; b,w) = σ[g(x)]

whereg(x) = b+

d

wdxd = b+wTx

and σ[·] stands for the sigmoid function

σ(a) =1

1 + e−a

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 23 / 48

Page 42: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Logistic classification

Setup for two classes

Input: x ∈ RD

Output: y ∈ {0, 1}Training data: D = {(xn, yn), n = 1, 2, . . . , N}Model:

p(y = 1|x; b,w) = σ[g(x)]

whereg(x) = b+

d

wdxd = b+wTx

and σ[·] stands for the sigmoid function

σ(a) =1

1 + e−a

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 23 / 48

Page 43: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Why the sigmoid function?

What does it look like?

σ(a) =1

1 + e−a

where

a = b+wTx−6 −4 −2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Properties

Bounded between 0 and 1 ← thus, interpretable as probability

Monotonically increasing thus, usable to derive classification rulesI σ(a) > 0.5, positive (classify as ’1’)I σ(a) < 0.5, negative (classify as ’0’)I σ(a) = 0.5, undecidable

Nice computational properties As we will see soon

Linear or nonlinear classifier?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 24 / 48

Page 44: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Why the sigmoid function?

What does it look like?

σ(a) =1

1 + e−a

where

a = b+wTx−6 −4 −2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Properties

Bounded between 0 and 1 ← thus, interpretable as probability

Monotonically increasing thus, usable to derive classification rulesI σ(a) > 0.5, positive (classify as ’1’)I σ(a) < 0.5, negative (classify as ’0’)I σ(a) = 0.5, undecidable

Nice computational properties As we will see soon

Linear or nonlinear classifier?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 24 / 48

Page 45: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Why the sigmoid function?

What does it look like?

σ(a) =1

1 + e−a

where

a = b+wTx−6 −4 −2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Properties

Bounded between 0 and 1 ← thus, interpretable as probability

Monotonically increasing thus, usable to derive classification rulesI σ(a) > 0.5, positive (classify as ’1’)I σ(a) < 0.5, negative (classify as ’0’)I σ(a) = 0.5, undecidable

Nice computational properties As we will see soon

Linear or nonlinear classifier?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 24 / 48

Page 46: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Why the sigmoid function?

What does it look like?

σ(a) =1

1 + e−a

where

a = b+wTx−6 −4 −2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Properties

Bounded between 0 and 1 ← thus, interpretable as probability

Monotonically increasing thus, usable to derive classification rulesI σ(a) > 0.5, positive (classify as ’1’)I σ(a) < 0.5, negative (classify as ’0’)I σ(a) = 0.5, undecidable

Nice computational properties As we will see soon

Linear or nonlinear classifier?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 24 / 48

Page 47: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Linear or nonlinear?

σ(a) is nonlinear, however, the decision boundary is determined by

σ(a) = 0.5⇒

a = 0⇒ g(x) = b+wTx = 0

which is a linear function in x

We often call b the offset term.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 25 / 48

Page 48: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Linear or nonlinear?

σ(a) is nonlinear, however, the decision boundary is determined by

σ(a) = 0.5⇒ a = 0⇒ g(x) = b+wTx = 0

which is a linear function in x

We often call b the offset term.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 25 / 48

Page 49: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Contrast Naive Bayes and our new model

Similar

Both classification models are linear functions of features

Different

Naive Bayes models the joint distribution: P (X,Y ) = P (Y )P (X|Y )Logistic regression models the conditional distribution: P (Y |X)

Generative vs. Discriminative

NB is a generative model, LR is a discriminative model

We will talk more about the differences later

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 26 / 48

Page 50: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Contrast Naive Bayes and our new model

Similar

Both classification models are linear functions of features

Different

Naive Bayes models the joint distribution: P (X,Y ) = P (Y )P (X|Y )Logistic regression models the conditional distribution: P (Y |X)

Generative vs. Discriminative

NB is a generative model, LR is a discriminative model

We will talk more about the differences later

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 26 / 48

Page 51: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Contrast Naive Bayes and our new model

Similar

Both classification models are linear functions of features

Different

Naive Bayes models the joint distribution: P (X,Y ) = P (Y )P (X|Y )Logistic regression models the conditional distribution: P (Y |X)

Generative vs. Discriminative

NB is a generative model, LR is a discriminative model

We will talk more about the differences later

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 26 / 48

Page 52: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Likelihood function

Probability of a single training sample (xn, yn)

p(yn|xn; b;w) =

{σ(b+wTxn) if yn = 11− σ(b+wTxn) otherwise

Compact expression, exploring that yn is either 1 or 0

p(yn|xn; b;w) = σ(b+wTxn)yn [1− σ(b+wTxn)]1−yn

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 27 / 48

Page 53: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Likelihood function

Probability of a single training sample (xn, yn)

p(yn|xn; b;w) =

{σ(b+wTxn) if yn = 11− σ(b+wTxn) otherwise

Compact expression, exploring that yn is either 1 or 0

p(yn|xn; b;w) = σ(b+wTxn)yn [1− σ(b+wTxn)]1−yn

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 27 / 48

Page 54: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Log Likelihood or Cross Entropy Error

Log-likelihood of the whole training data D

logP (D) =∑

n

{yn log σ(b+wTxn) + (1− yn) log[1− σ(b+wTxn)]}

It is convenient to work with its negation, which is calledcross-entropy error function

E(b,w) = −∑

n

{yn log σ(b+wTxn) + (1− yn) log[1− σ(b+wTxn)]}

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 28 / 48

Page 55: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Log Likelihood or Cross Entropy Error

Log-likelihood of the whole training data D

logP (D) =∑

n

{yn log σ(b+wTxn) + (1− yn) log[1− σ(b+wTxn)]}

It is convenient to work with its negation, which is calledcross-entropy error function

E(b,w) = −∑

n

{yn log σ(b+wTxn) + (1− yn) log[1− σ(b+wTxn)]}

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 28 / 48

Page 56: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Shorthand notation

This is for convenience

Append 1 to xx← [1 x1 x2 · · · xD]

Append b to w

w ← [b w1 w2 · · · wD]

Cross-entropy is then

E(w) = −∑

n

{yn log σ(wTxn) + (1− yn) log[1− σ(wTxn)]}

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 29 / 48

Page 57: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

How to find the optimal parameters for logistic regression?

We will minimize the error function

E(w) = −∑

n

{yn log σ(wTxn) + (1− yn) log[1− σ(wTxn)]}

However, this function is complex and we cannot find the simple solutionas we did in Naive Bayes. So we need to use numerical methods.

Numerical methods are messier, in contrast to cleaner analyticsolutions.

In practice, we often have to tune a few optimization parameters —patience is necessary.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 30 / 48

Page 58: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

An overview of numerical methods

We describe two

Gradient descent (our focus in lecture): simple, especially effective forlarge-scale problems

Newton’s method: classical and powerful method

Gradient descent is often referred to as a first-order method

Requires computation of gradients (i.e., the first-order derivative)

Newton’s method is often referred as to an second-order method

Requires computation of second derivatives

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 31 / 48

Page 59: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Gradient Descent

Start at a random point

w

f(w)

w0w*

Page 60: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Gradient Descent

Start at a random point

Determine a descent direction

w

f(w)

w0w*

Page 61: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Gradient Descent

Start at a random point

Determine a descent directionChoose a step size

w

f(w)

w0w*

Page 62: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Gradient Descent

Start at a random point

Determine a descent directionChoose a step sizeUpdate

w

f(w)

w1 w0w*

Page 63: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Start at a random point Repeat

Determine a descent direction Choose a step size Update

Until stopping criterion is satisfied

Gradient Descent

w

f(w)

w1 w0w*

Page 64: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Start at a random point Repeat

Determine a descent direction Choose a step size Update

Until stopping criterion is satisfied

Gradient Descent

w

f(w)

w1 w0w*

Page 65: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Start at a random point Repeat

Determine a descent direction Choose a step size Update

Until stopping criterion is satisfied

Gradient Descent

w

f(w)

w1 w0w*

Page 66: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Start at a random point Repeat

Determine a descent direction Choose a step size Update

Until stopping criterion is satisfied

Gradient Descent

w

f(w)

w1 w0w*

Page 67: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Start at a random point Repeat

Determine a descent direction Choose a step size Update

Until stopping criterion is satisfied

Gradient Descent

w

f(w)

w2 w1 w0w*

Page 68: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Start at a random point Repeat

Determine a descent direction Choose a step size Update

Until stopping criterion is satisfied

Gradient Descent

w

f(w)

w2 w1 w0w* …

Page 69: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

w

g(w) Non-convex

Any local minimum is a global minimum

Where Will We Converge?

Least Squares, Ridge Regression and Logistic Regression are all convex!

w

f(w) Convex

w*

Multiple local minima may existw*w!

Page 70: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Why do we move in the direction opposite the gradient?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 32 / 48

Page 71: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Update Rule: wi+1 = wi � αidfdw

(wi)

Choosing Descent Direction (1D)

We can only move in two directionsNegative slope is direction of descent!

w0 w

f(w)

w* w

f(w)

w*

positive ⇒ go left!

w0

negative ⇒ go right!

zero ⇒ done!

Step Size

Negative Slope

Page 72: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

We can move anywhere in Negative gradient is direction of steepest descent!

Rd

2D Example: • Function values are in black/white

and black represents higher values• Arrows are gradients

"Gradient2" by Sarang. Licensed under CC BY-SA 2.5 via Wikimedia Commons http://commons.wikimedia.org/wiki/File:Gradient2.svg#/media/File:Gradient2.svg

Choosing Descent Direction

Update Rule:

Step Size

Negative Slope

wi+1 = wi � αi�f(wi)

Page 73: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Example: min f(θ) = 0.5(θ21 − θ2)

2 + 0.5(θ1 − 1)2

We compute the gradients

∂f

∂θ1= 2(θ21 − θ2)θ1 + θ1 − 1

∂f

∂θ2= −(θ21 − θ2)

Use the following iterative procedure for gradient descent

1 Initialize θ(0)1 and θ

(0)2 , and t = 0

2 do

θ(t+1)1 ← θ

(t)1 − η

[2(θ

(t)1

2− θ(t)2 )θ

(t)1 + θ

(t)1 − 1

]

θ(t+1)2 ← θ

(t)2 − η

[−(θ

(t)1

2− θ(t)2 )

]

t← t+ 1

3 until f(θ(t)) does not change much

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 33 / 48

Page 74: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Example: min f(θ) = 0.5(θ21 − θ2)

2 + 0.5(θ1 − 1)2

We compute the gradients

∂f

∂θ1= 2(θ21 − θ2)θ1 + θ1 − 1

∂f

∂θ2= −(θ21 − θ2)

Use the following iterative procedure for gradient descent

1 Initialize θ(0)1 and θ

(0)2 , and t = 0

2 do

θ(t+1)1 ← θ

(t)1 − η

[2(θ

(t)1

2− θ(t)2 )θ

(t)1 + θ

(t)1 − 1

]

θ(t+1)2 ← θ

(t)2 − η

[−(θ

(t)1

2− θ(t)2 )

]

t← t+ 1

3 until f(θ(t)) does not change much

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 33 / 48

Page 75: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Example: min f(θ) = 0.5(θ21 − θ2)

2 + 0.5(θ1 − 1)2

We compute the gradients

∂f

∂θ1= 2(θ21 − θ2)θ1 + θ1 − 1

∂f

∂θ2= −(θ21 − θ2)

Use the following iterative procedure for gradient descent

1 Initialize θ(0)1 and θ

(0)2 , and t = 0

2 do

θ(t+1)1 ← θ

(t)1 − η

[2(θ

(t)1

2− θ(t)2 )θ

(t)1 + θ

(t)1 − 1

]

θ(t+1)2 ← θ

(t)2 − η

[−(θ

(t)1

2− θ(t)2 )

]

t← t+ 1

3 until f(θ(t)) does not change much

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 33 / 48

Page 76: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Impact of step size

Choosing the right η is important

small η is too slow?

0 0.5 1 1.5 2−0.5

0

0.5

1

1.5

2

2.5

3

large η is too unstable?

0 0.5 1 1.5 2−0.5

0

0.5

1

1.5

2

2.5

3

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 34 / 48

Page 77: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Impact of step size

Choosing the right η is important

small η is too slow?

0 0.5 1 1.5 2−0.5

0

0.5

1

1.5

2

2.5

3

large η is too unstable?

0 0.5 1 1.5 2−0.5

0

0.5

1

1.5

2

2.5

3

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 34 / 48

Page 78: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Impact of step size

Choosing the right η is important

small η is too slow?

0 0.5 1 1.5 2−0.5

0

0.5

1

1.5

2

2.5

3

large η is too unstable?

0 0.5 1 1.5 2−0.5

0

0.5

1

1.5

2

2.5

3

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 34 / 48

Page 79: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Gradient descent

General form for minimizing f(θ)

θt+1 ← θ − η∂f∂θ

Remarks

η is step size – how far we go in the direction of the negative gradientI Step size needs to be chosen carefully to ensure convergence.I Step size can be adaptive, e.g., we can use line search

We are minimizing a function, hence the subtraction (−η)

With a suitable choice of η, we converge to a stationary point

∂f

∂θ= 0

Stationary point not always global minimum (but happy when convex)

Popular variant called stochastic gradient descent

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 35 / 48

Page 80: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Gradient Descent Update for Logistic Regression

Simple fact: derivatives of σ(a)

d σ(a)

d a=

d

d a

(1 + e−a

)−1=−(1 + e−a)′

(1 + e−a)2

=e−a

(1 + e−a)2=

1

1 + e−ae−a

1 + e−a

= σ(a)[1− σ(a)]

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 36 / 48

Page 81: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Gradients of the cross-entropy error function

Cross-entropy Error Function

E(w) = −∑

n

{yn log σ(wTxn) + (1− yn) log[1− σ(wTxn)]}

Gradients

∂E(w)

∂w= −

n

{yn[1− σ(wTxn)]xn − (1− yn)σ(wTxn)]xn

}

=∑

n

{σ(wTxn)− yn

}xn

Remark

en ={σ(wTxn)− yn

}is called error for the nth training sample.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 37 / 48

Page 82: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Gradients of the cross-entropy error function

Cross-entropy Error Function

E(w) = −∑

n

{yn log σ(wTxn) + (1− yn) log[1− σ(wTxn)]}

Gradients

∂E(w)

∂w= −

n

{yn[1− σ(wTxn)]xn − (1− yn)σ(wTxn)]xn

}

=∑

n

{σ(wTxn)− yn

}xn

Remark

en ={σ(wTxn)− yn

}is called error for the nth training sample.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 37 / 48

Page 83: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Numerical optimization

Gradient descent for logistic regression

Choose a proper step size η > 0

Iteratively update the parameters following the negative gradient tominimize the error function

w(t+1) ← w(t) − η∑

n

{σ(wTxn)− yn

}xn

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 38 / 48

Page 84: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Intuition for Newton’s method

Approximate the true function with an easy-to-solve optimizationproblem

f(x)fquad

(x)

xk

xk+d

k

f(x)fquad

(x)

xk

xk+d

k

In particular, we can approximate the cross-entropy error function aroundw(t) by a quadratic function, and then minimize this quadratic function

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 39 / 48

Page 85: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Intuition for Newton’s method

Approximate the true function with an easy-to-solve optimizationproblem

f(x)fquad

(x)

xk

xk+d

k

f(x)fquad

(x)

xk

xk+d

k

In particular, we can approximate the cross-entropy error function aroundw(t) by a quadratic function, and then minimize this quadratic function

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 39 / 48

Page 86: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Approximation

Second Order Taylor expansion around xt

f(x) ≈ f(xt) + f ′(xt)(x− xt) +1

2f ′′(xt)(x− xt)2

Taylor expansion of cross-entropy error function around w(t)

E(w) ≈ E(w(t)) + (w −w(t))T∇E(w(t)) +1

2(w −w(t))TH(t)(w −w(t))

where

∇E(w(t)) is the gradient

H(t) is the Hessian matrix evaluated at w(t)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 40 / 48

Page 87: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Approximation

Second Order Taylor expansion around xt

f(x) ≈ f(xt) + f ′(xt)(x− xt) +1

2f ′′(xt)(x− xt)2

Taylor expansion of cross-entropy error function around w(t)

E(w) ≈ E(w(t)) + (w −w(t))T∇E(w(t)) +1

2(w −w(t))TH(t)(w −w(t))

where

∇E(w(t)) is the gradient

H(t) is the Hessian matrix evaluated at w(t)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 40 / 48

Page 88: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

So what is the Hessian matrix?

The matrix of second-order derivatives

H =∂2E(w)

∂wwT

In other words,

Hij =∂

∂wj

(∂E(w)

∂wi

)

So the Hessian matrix is RD×D, where w ∈ RD.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 41 / 48

Page 89: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Optimizing the approximation

Minimize the approximation

E(w) ≈ E(w(t)) + (w −w(t))T∇E(w(t)) +1

2(w −w(t))TH(t)(w −w(t))

and use the solution as the new estimate of the parameters

w(t+1) ← minw

(w −w(t))T∇E(w(t)) +1

2(w −w(t))TH(t)(w −w(t))

The quadratic function minimization has a closed form, thus, we have

w(t+1) ← w(t) −(H(t)

)−1∇E(w(t))

i.e., the Newton’s method.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 42 / 48

Page 90: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Optimizing the approximation

Minimize the approximation

E(w) ≈ E(w(t)) + (w −w(t))T∇E(w(t)) +1

2(w −w(t))TH(t)(w −w(t))

and use the solution as the new estimate of the parameters

w(t+1) ← minw

(w −w(t))T∇E(w(t)) +1

2(w −w(t))TH(t)(w −w(t))

The quadratic function minimization has a closed form, thus, we have

w(t+1) ← w(t) −(H(t)

)−1∇E(w(t))

i.e., the Newton’s method.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 42 / 48

Page 91: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Contrast gradient descent and Newton’s method

Similar

Both are iterative procedures.

Different

Newton’s method requires second-order derivatives.

Newton’s method does not have the magic η to be set.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 43 / 48

Page 92: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Other important things about Hessian

Our cross-entropy error function is convex

∂E(w)

∂w=∑

n

{σ(wTxn)− yn}xn (3)

⇒H =∂2E(w)

∂wwT= homework (4)

For any vector v,vTHv = homework ≥ 0

Thus, positive semi-definite. Thus, the cross-entropy error function isconvex, with only one global optimum.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 44 / 48

Page 93: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Other important things about Hessian

Our cross-entropy error function is convex

∂E(w)

∂w=∑

n

{σ(wTxn)− yn}xn (3)

⇒H =∂2E(w)

∂wwT= homework (4)

For any vector v,vTHv = homework ≥ 0

Thus, positive semi-definite. Thus, the cross-entropy error function isconvex, with only one global optimum.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 44 / 48

Page 94: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Good about Newton’s method

Fast (in terms of convergence)!

Newton’s method finds the optimal point in a single iteration when thefunction we’re optimizing is quadratic

In general, the better our Taylor approximation, the more quickly Newton’smethod will converge

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 45 / 48

Page 95: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Bad about Newton’s method

Not scalable!

Computing and inverting Hessian matrix can be very expensive forlarge-scale problems where the dimensionally D is very large. There arefixes and alternatives, such as Quasi-Newton/Quasi-second order method.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 46 / 48

Page 96: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Summary

Setup for 2 classes

Logistic Regression models conditional distribution as:p(y = 1|x;w) = σ[g(x)] where g(x) = wTx

Linear decision boundary: g(x) = wTx = 0

Minimizing cross-entropy error (negative log-likelihood)

E(b,w) = −∑n{yn log σ(b+wTxn)+(1−yn) log[1−σ(b+wTxn)]}No closed form solution; must rely on iterative solvers

Numerical optimization

Gradient descent: simple, scalable to large-scale problemsI move in direction opposite of gradient!I gradient of logistic function takes nice form

Newton method: fast to converge but not scalableI At each iteration, find optimal point in 2nd-order Taylor expansionI Closed form solution exists for each iteration

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 47 / 48

Page 97: CS260 Machine Learning Algorithms - cs.cmu.eduatalwalk/teaching/winter17/cs260/lectures… · Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 8 / 48 Maximum

Naive Bayes and logistic regression: two different modelingparadigms

Maximize joint likelihood∑

n log p(xn, yn) (Generative, NB)

Maximize conditional likelihood∑

n log p(yn|xn) (Discriminative, LR)

More on this next class

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 48 / 48