classification logistic regression

40
©Kevin Jamieson 2018 1 Classification Logistic Regression Machine Learning – CSE546 Kevin Jamieson University of Washington October 16, 2016 O μtWl due Thursday

Upload: others

Post on 29-Mar-2022

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Classification Logistic Regression

©Kevin Jamieson 2018 1

ClassificationLogistic Regression

Machine Learning – CSE546 Kevin Jamieson University of Washington

October 16, 2016

O

µtWl

due Thursday

Page 2: Classification Logistic Regression

©Kevin Jamieson 2018

THUS FAR, REGRESSION: PREDICT A CONTINUOUS VALUE GIVEN SOME INPUTS

2

Page 3: Classification Logistic Regression

©Kevin Jamieson 2018

Weather prediction revisted

3

Temperature

0

Page 4: Classification Logistic Regression

©Kevin Jamieson 2018

Reading Your Brain, Simple Example

AnimalPerson

Pairwise classification accuracy: 85%[Mitchell et al.]

4

Page 5: Classification Logistic Regression

©Kevin Jamieson 2018

Binary Classification

■ Learn: f:X —>Y X – features Y – target classes

■ Loss function:

■ Expected loss of f:

■ Suppose you know P(Y|X) exactly, how should you classify? Bayes optimal classifier:

5

Y 2 {0, 1}

O

floaty

Ex LI S HH t't 3 Ex E LHE fait YS IX xD

IT Hx ti IPA i IX x FEEMY it x c I P Y fact I x

Page 6: Classification Logistic Regression

©Kevin Jamieson 2018

Binary Classification

■ Learn: f:X —>Y X – features Y – target classes

■ Loss function:

■ Expected loss of f:

■ Suppose you know P(Y|X) exactly, how should you classify? Bayes optimal classifier:

6

Y 2 {0, 1}`(f(x), y) = 1{f(x) 6= y}

EXY [1{f(X) 6= Y }] = EX [EY |X [1{f(x) 6= Y }|X = x]]

f(x) = argmaxy

P(Y = y|X = x)

EY |X [1{f(x) 6= Y }|X = x] =X

i

P (Y = i|X = x)1{f(x) 6= i} =X

i 6=f(x)

P (Y = i|X = x)

= 1� P (Y = f(x)|X = x)

PITT Xix

Page 7: Classification Logistic Regression

©Kevin Jamieson 2018

Link Functions

■ Estimating P(Y|X): Why not use standard linear regression?

■ Combining regression and probability? Need a mapping from real values to [0,1] A link function!

7

O

PI w

We need a function that maps XERD or

O

Page 8: Classification Logistic Regression

©Kevin Jamieson 2018

Logistic RegressionLogistic function (or Sigmoid):

■ Learn P(Y|X) directly Assume a particular functional form for link function Sigmoid applied to a linear function of the input features:

Z

Features can be discrete or continuous! 8

0

Page 9: Classification Logistic Regression

©Kevin Jamieson 2018

Understanding the sigmoid

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

w0=0, w1=-1

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

w0=-2, w1=-1

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

w0=0, w1=-0.5

9

Page 10: Classification Logistic Regression

©Kevin Jamieson 2018

Sigmoid for binary classes

10

P(Y = 0|w,X) =1

1 + exp(w0 +P

k wkXk)

P(Y = 1|w,X) = 1� P(Y = 0|w,X) =exp(w0 +

Pk wkXk)

1 + exp(w0 +P

k wkXk)

P(Y = 1|w,X)

P(Y = 0|w,X)= exp(w0 +

X

k

wkXk)

Ex

exp wot WTX I

logl t w wix I

Page 11: Classification Logistic Regression

©Kevin Jamieson 2018

Sigmoid for binary classes

11

P(Y = 0|w,X) =1

1 + exp(w0 +P

k wkXk)

P(Y = 1|w,X) = 1� P(Y = 0|w,X) =exp(w0 +

Pk wkXk)

1 + exp(w0 +P

k wkXk)

P(Y = 1|w,X)

P(Y = 0|w,X)= exp(w0 +

X

k

wkXk)

logP(Y = 1|w,X)

P(Y = 0|w,X)= w0 +

X

k

wkXk

Linear Decision Rule!

Page 12: Classification Logistic Regression

©Kevin Jamieson 2018

Logistic Regression – a Linear classifier

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

12

Wo W3cO

i

i

Page 13: Classification Logistic Regression

©Kevin Jamieson 2018

Loss function: Conditional Likelihood

■ Have a bunch of iid data of the form:

13

P (Y = 1|x,w) = exp(wTx)

1 + exp(wTx)

P (Y = �1|x,w) = 1

1 + exp(wTx)

{(xi, yi)}ni=1 xi 2 Rd, yi 2 {�1, 1}

P (Y = y|x,w) = 1

1 + exp(�y wTx)

■ This is equivalent to:

■ So we can compute the maximum likelihood estimator:

bwMLE = argmaxw

nY

i=1

P (yi|xi, w)

I H u

Page 14: Classification Logistic Regression

©Kevin Jamieson 2018

Loss function: Conditional Likelihood

■ Have a bunch of iid data of the form:

14

{(xi, yi)}ni=1 xi 2 Rd, yi 2 {�1, 1}

P (Y = y|x,w) = 1

1 + exp(�y wTx)bwMLE = argmaxw

nY

i=1

P (yi|xi, w)

= argminw

nX

i=1

log(1 + exp(�yi xTi w))

Page 15: Classification Logistic Regression

©Kevin Jamieson 2018

Loss function: Conditional Likelihood

■ Have a bunch of iid data of the form:

15

{(xi, yi)}ni=1 xi 2 Rd, yi 2 {�1, 1}

P (Y = y|x,w) = 1

1 + exp(�y wTx)bwMLE = argmaxw

nY

i=1

P (yi|xi, w)

= argminw

nX

i=1

log(1 + exp(�yi xTi w))

Logistic Loss: `i(w) = log(1 + exp(�yi xTi w))

Squared error Loss: `i(w) = (yi � xTi w)

2 (MLE for Gaussian noise)

e

Page 16: Classification Logistic Regression

©Kevin Jamieson 2018

Loss function: Conditional Likelihood

■ Have a bunch of iid data of the form:

16

{(xi, yi)}ni=1 xi 2 Rd, yi 2 {�1, 1}

P (Y = y|x,w) = 1

1 + exp(�y wTx)bwMLE = argmaxw

nY

i=1

P (yi|xi, w)

= argminw

nX

i=1

log(1 + exp(�yi xTi w))= J(w)

What does J(w) look like? Is it convex?

I8 yixiw

Gzd oy Itexpc z

Ii S

f is convex if f beta Dg E fix cCi d fly

Page 17: Classification Logistic Regression

©Kevin Jamieson 2018

Loss function: Conditional Likelihood

■ Have a bunch of iid data of the form:

17

{(xi, yi)}ni=1 xi 2 Rd, yi 2 {�1, 1}

P (Y = y|x,w) = 1

1 + exp(�y wTx)bwMLE = argmaxw

nY

i=1

P (yi|xi, w)

= argminw

nX

i=1

log(1 + exp(�yi xTi w))= J(w)

Good news: J(w) is convex function of w, no local optima problems

Bad news: no closed-form solution to maximize J(w)

Good news: convex functions easy to optimize

Page 18: Classification Logistic Regression

©Kevin Jamieson 2018 18

Linear Separability

= argminw

nX

i=1

log(1 + exp(�yi xTi w)) When is this loss small?

Page 19: Classification Logistic Regression

©Kevin Jamieson 2018 19

Large parameters → Overfitting

■ If data is linearly separable, weights go to infinity

In general, leads to overfitting: ■ Penalizing high weights can prevent overfitting…

o O O

Page 20: Classification Logistic Regression

©Kevin Jamieson 2018

Regularized Conditional Log Likelihood

■ Add regularization penalty, e.g., L2:

20

argminw,b

nX

i=1

log�1 + exp(�yi (x

Ti w + b))

�+ �||w||22

Be sure to not regularize the o↵set b!

Page 21: Classification Logistic Regression

©Kevin Jamieson 2018 21

Gradient Descent

Machine Learning – CSE546 Kevin Jamieson University of Washington

October 16, 2016

Page 22: Classification Logistic Regression

©Kevin Jamieson 2018

Machine Learning Problems

22

■ Have a bunch of iid data of the form:{(xi, yi)}ni=1 xi 2 Rd yi 2 R

nX

i=1

`i(w)■ Learning a model’s parameters:

Each `i(w) is convex.

Page 23: Classification Logistic Regression

©Kevin Jamieson 2018

Machine Learning Problems

23

■ Have a bunch of iid data of the form:{(xi, yi)}ni=1 xi 2 Rd yi 2 R

nX

i=1

`i(w)■ Learning a model’s parameters:

Each `i(w) is convex.

x

y

f convex:

f(y) � f(x) +rf(x)T (y � x) 8x, y

x

f(y) � f(x) +rf(x)T (y � x) + `2 ||y � x||22 8x, y

r2f(x) � `I 8x

f `-strongly convex:

f (�x+ (1� �)y) �f(x) + (1� �)f(y) 8x, y,� 2 [0, 1]

g is a subgradient at x if f(y) � f(x) + gT (y � x)

g is a subgradient at x if f(y) � f(x) + gT (y � x)or D

Page 24: Classification Logistic Regression

©Kevin Jamieson 2018

Machine Learning Problems

24

■ Have a bunch of iid data of the form:{(xi, yi)}ni=1

Logistic Loss: `i(w) = log(1 + exp(�yi xTi w))

Squared error Loss: `i(w) = (yi � xTi w)

2

xi 2 Rd yi 2 R

■ Learning a model’s parameters: nX

i=1

`i(w)Each `i(w) is convex. 0

Page 25: Classification Logistic Regression

©Kevin Jamieson 2018

Least squares

25

■ Have a bunch of iid data of the form:{(xi, yi)}ni=1

Squared error Loss: `i(w) = (yi � xTi w)

2

xi 2 Rd yi 2 R

■ Learning a model’s parameters: nX

i=1

`i(w)Each `i(w) is convex.

12 ||Xw � y||22How does software solve:

I Cxtx w XTy

Find Ax b

Page 26: Classification Logistic Regression

©Kevin Jamieson 2018

Least squares

26

■ Have a bunch of iid data of the form:{(xi, yi)}ni=1

Squared error Loss: `i(w) = (yi � xTi w)

2

xi 2 Rd yi 2 R

■ Learning a model’s parameters: nX

i=1

`i(w)Each `i(w) is convex.

How does software solve:

…its complicated: Do you need high precision?Is X column/row sparse?Is bwLS sparse?Is XTX “well-conditioned”?Can XTX fit in cache/memory?

(LAPACK, BLAS, MKL…)

12 ||Xw � y||22

0

Page 27: Classification Logistic Regression

©Kevin Jamieson 2018

Taylor Series Approximation

27

■ Taylor series in one dimension:

f(x+ �) = f(x) + f 0(x)� + 12f

00(x)�2 + . . .

■ Gradient descent:Initialize 36 0 randomlyfly comet

y Ie VtY

G in 9fees t

y

Page 28: Classification Logistic Regression

©Kevin Jamieson 2018

Taylor Series Approximation

28

■ Taylor series in d dimensions:f(x+ v) = f(x) +rf(x)T v + 1

2vTr2f(x)v + . . .

■ Gradient descent:

Key Xe Joffre

Page 29: Classification Logistic Regression

©Kevin Jamieson 2018

Gradient Descent

29

wt+1 = wt � ⌘rf(wt)

rf(w) =

f(w) = 12 ||Xw � y||22

XT Xu y xtxw XT

Wen we 2 XT Xue yEn

We Z XTX we 2 XTyI Z Xix wt t ZXTy

Wat Wg I Z Xix wt w 2XtXw Xy

Page 30: Classification Logistic Regression

Z XTxw 2 xTy ZXT yw tyZ P f f w

O

Page 31: Classification Logistic Regression

©Kevin Jamieson 2018

Gradient Descent

30

wt+1 = wt � ⌘rf(wt)

Example: X =

10�3 00 1

�y =

10�3

1

�w⇤ =

(wt+1 � w⇤) = (I � ⌘XTX)(wt � w⇤)

= (I � ⌘XTX)t+1(w0 � w⇤)

w0 =

00

f(w) = 12 ||Xw � y||22

Ol 3

xtx of 9 D diagonal Dk hthpog.gg

wee w two w

abs value L

wet z w l c Z Wo z Wee z

Page 32: Classification Logistic Regression

lzc

a

Page 33: Classification Logistic Regression

©Kevin Jamieson 2018

Taylor Series Approximation

31

■ Taylor series in one dimension:

f(x+ �) = f(x) + f 0(x)� + 12f

00(x)�2 + . . .

■ Newton’s method:

µ

if f'Cx y 2 f Cdc O

i

y x

Page 34: Classification Logistic Regression

©Kevin Jamieson 2018

Taylor Series Approximation

32

■ Taylor series in d dimensions:f(x+ v) = f(x) +rf(x)T v + 1

2vTr2f(x)v + . . .

■ Newton’s method:

Xf Xt t ZVe

Ve I Hajj Offx

Page 35: Classification Logistic Regression

©Kevin Jamieson 2018

Newton’s Method

33

rf(w) =

r2f(w) =

wt+1 = wt + ⌘vt

vt is solution to : r2f(wt)vt = �rf(wt)

f(w) = 12 ||Xw � y||22

Page 36: Classification Logistic Regression

©Kevin Jamieson 2018

Newton’s Method

34

rf(w) =

r2f(w) =

wt+1 = wt + ⌘vt

vt is solution to : r2f(wt)vt = �rf(wt)

f(w) = 12 ||Xw � y||22

w1 = w0 � ⌘(XTX)�1XT (Xw0 � y) = w⇤

For quadratics, Newton’s method converges in one step! (Not a surprise, why?)

XT (Xw � y)

XTX

Page 37: Classification Logistic Regression

©Kevin Jamieson 2018

General case

35

In general for Newton’s method to achieve f(wt)� f(w⇤) ✏:

So why are ML problems overwhelmingly solved by gradient methods?

vt is solution to : r2f(wt)vt = �rf(wt)Hint:

Page 38: Classification Logistic Regression

©Kevin Jamieson 2018

General Convex case

36

f(wt)� f(w⇤) ✏

Newton’s method:

t ⇡ log(log(1/✏))

Gradient descent: • f is smooth and strongly convex: :

• f is smooth:

• f is potentially non-differentiable:

r2f(w) � bI

aI � r2f(w) � bI

||rf(w)||2 c

Other: BFGS, Heavy-ball, BCD, SVRG, ADAM, Adagrad,…Nocedal +Wright, Bubeck

Clean convergenice proofs: Bubeck

Page 39: Classification Logistic Regression

©Kevin Jamieson 2018 37

Revisiting…Logistic Regression

Machine Learning – CSE546 Kevin Jamieson University of Washington

October 16, 2016

Page 40: Classification Logistic Regression

©Kevin Jamieson 2018

Loss function: Conditional Likelihood

■ Have a bunch of iid data of the form:

38

{(xi, yi)}ni=1 xi 2 Rd, yi 2 {�1, 1}

P (Y = y|x,w) = 1

1 + exp(�y wTx)bwMLE = argmaxw

nY

i=1

P (yi|xi, w)

= argminw

nX

i=1

log(1 + exp(�yi xTi w))f(w)

rf(w) =