classification: logistic regression from data
TRANSCRIPT
Classification: Logistic Regressionfrom Data
Machine Learning: Jordan Boyd-GraberUniversity of Colorado BoulderLECTURE 3
Slides adapted from Emily Fox
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 1 of 15
Reminder: Logistic Regression
P(Y = 0|X) =1
1 + exp�
β0 +∑
i βiXi
� (1)
P(Y = 1|X) =exp
�
β0 +∑
i βiXi
�
1 + exp�
β0 +∑
i βiXi
� (2)
• Discriminative prediction: p(y |x)
• Classification uses: ad placement, spam detection
• What we didn’t talk about is how to learn β from data
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 2 of 15
Logistic Regression: Objective Function
L ≡ lnp(Y |X ,β) =∑
j
lnp(y(j) |x(j),β) (3)
=∑
j
y(j)
β0 +∑
i
βix(j)i
!
− ln
1 + exp
β0 +∑
i
βix(j)i
!
(4)
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 3 of 15
Convexity
• Convex function
• Doesn’t matter where you start, ifyou “walk up” objective
• Gradient!
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 4 of 15
Convexity
• Convex function
• Doesn’t matter where you start, ifyou “walk up” objective
• Gradient!
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 4 of 15
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables W and b
0
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables W and b
0
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables W and b
0
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables W and b
10
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables W and b
10
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables W and b
10
2
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables W and b
10
2
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables W and b
10
23
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables W and b
10
23
UndiscoveredCountry
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables W and b
0
Wij(l) Wij
(l)
Jα∂
∂W
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15
Gradient for Logistic Regression
Gradient
∇βL (~β) =
�
∂L (~β)
∂ β0, . . . ,
∂L (~β)
∂ βn
�
(5)
Update
∆β ≡η∇βL (~β) (6)
βi ←β ′i +η∂L (~β)
∂ βi(7)
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 6 of 15
Gradient for Logistic Regression
Gradient
∇βL (~β) =
�
∂L (~β)
∂ β0, . . . ,
∂L (~β)
∂ βn
�
(5)
Update
∆β ≡η∇βL (~β) (6)
βi ←β ′i +η∂L (~β)
∂ βi(7)
Why are we adding? What would well do if we wanted to do descent?
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 6 of 15
Gradient for Logistic Regression
Gradient
∇βL (~β) =
�
∂L (~β)
∂ β0, . . . ,
∂L (~β)
∂ βn
�
(5)
Update
∆β ≡η∇βL (~β) (6)
βi ←β ′i +η∂L (~β)
∂ βi(7)
η: step size, must be greater than zero
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 6 of 15
Gradient for Logistic Regression
Gradient
∇βL (~β) =
�
∂L (~β)
∂ β0, . . . ,
∂L (~β)
∂ βn
�
(5)
Update
∆β ≡η∇βL (~β) (6)
βi ←β ′i +η∂L (~β)
∂ βi(7)
NB: Conjugate gradient is usually better, but harder to implement
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 6 of 15
Choosing Step Size
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 15
Choosing Step Size
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 15
Choosing Step Size
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 15
Choosing Step Size
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 15
Choosing Step Size
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 15
Remaining issues
• When to stop?
• What if β keeps getting bigger?
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 8 of 15
Regularized Conditional Log Likelihood
Unregularized
β ∗= argmaxβ
ln�
p(y(j) |x(j),β)�
(8)
Regularized
β ∗= argmaxβ
ln�
p(y(j) |x(j),β)�
−µ∑
i
β 2i (9)
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 15
Regularized Conditional Log Likelihood
Unregularized
β ∗= argmaxβ
ln�
p(y(j) |x(j),β)�
(8)
Regularized
β ∗= argmaxβ
ln�
p(y(j) |x(j),β)�
−µ∑
i
β 2i (9)
µ is “regularization” parameter that trades off between likelihood and havingsmall parameters
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 15
Approximating the Gradient
• Our datasets are big (to fit into memory)
• . . . or data are changing / streaming
• Hard to compute true gradient
L (β)≡Ex [∇L (β ,x)] (10)
• Average over all observations
• What if we compute an update just from one observation?
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 10 of 15
Approximating the Gradient
• Our datasets are big (to fit into memory)
• . . . or data are changing / streaming
• Hard to compute true gradient
L (β)≡Ex [∇L (β ,x)] (10)
• Average over all observations
• What if we compute an update just from one observation?
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 10 of 15
Approximating the Gradient
• Our datasets are big (to fit into memory)
• . . . or data are changing / streaming
• Hard to compute true gradient
L (β)≡Ex [∇L (β ,x)] (10)
• Average over all observations
• What if we compute an update just from one observation?
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 10 of 15
Getting to Union Station
Pretend it’s a pre-smartphone world and you want to get to Union Station
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 11 of 15
Stochastic Gradient for Logistic Regression
Given a single observation x chosen at random from the dataset,
βi ←β ′i +η�
−µβ ′i + xi
�
y −p(y = 1 |~x , ~β ′)��
(11)
Examples in class.
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 12 of 15
Stochastic Gradient for Logistic Regression
Given a single observation x chosen at random from the dataset,
βi ←β ′i +η�
−µβ ′i + xi
�
y −p(y = 1 |~x , ~β ′)��
(11)
Examples in class.
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 12 of 15
Stochastic Gradient for Regularized Regression
L = logp(y |x;β)−µ∑
j
β 2j (12)
Taking the derivative (with respect to example xi )
∂L∂ βj
=(yi −p(yi = 1 |~xi ;β))xj −2µβj (13)
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 13 of 15
Stochastic Gradient for Regularized Regression
L = logp(y |x;β)−µ∑
j
β 2j (12)
Taking the derivative (with respect to example xi )
∂L∂ βj
=(yi −p(yi = 1 |~xi ;β))xj −2µβj (13)
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 13 of 15
Proofs about Stochastic Gradient
• Depends on convexity of objective and how close ε you want to get toactual answer
• Best bounds depend on changing η over time and per dimension (not allfeatures created equal)
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 14 of 15
In class
• Your questions!
• Working through simple example
• Prepared for logistic regression homework
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 15 of 15