logistic regression - mit opencourseware · 2020-01-04 · harvard-mit division of health sciences...

34
Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor Lucila Ohno-Machado and Professor Staal Vinterbo 6.873/HST.951 Medical Decision Support Fall 2005 Logistic Regression Maximum Likelihood Estimation Lucila Ohno-Machado

Upload: others

Post on 22-Feb-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005Instructors: Professor Lucila Ohno-Machado and Professor Staal Vinterbo

6.873/HST.951 Medical Decision SupportFall 2005

Logistic Regression Maximum Likelihood Estimation

Lucila Ohno-Machado

Page 2: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

Risk Score of Death from AngioplastyUnadjusted Overall Mortality Rate = 2.1%

0

500

1000

1500

2000

2500

3000

0%

10%

20%

30%

40%

50%

60%

Risk

Number of Cases

0.4% 1.4%

53.6%

12.4%

21.5%

2.2%

0 to 2 3 to 4 5 to 6 7 to 8 9 to 10 >10

Risk Score Category

Num

ber o

f Cas

es

Mortality

62%

26%

7.6% 2.9% 1.6% 1.3%

Page 3: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

Linear RegressionOrdinary Least Squares (OLS)

y x3Minimize Sum of Squared Errors (SSE)

n data points x4 i is the subscript for each point x2 x1

yi = β0 + β xi1 x n n

2 2SSE = ∑ (y − yi ) = ∑[y − (β + β xi )]i i 0 1 i=1 i=1

Page 4: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

Logit

1 =pi 1+ e−(β +β xi )0 1

β0 +β1xie p = i eβ0 +β1xi +1

y

x

logit⎤

log⎣

1−

pi

pi ⎦⎥ = β01+ β1xi⎢

x

Page 5: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

Increasing β

0

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 30

Series1Series1Series1Series1Series1

0

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 30

Series1Series1Series1Series1Series1Series1

0

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 300

0.2

0.4

0.6

0.8

1

1.2

0 10 20 30

Series1Series1Series1Series1Series1

Page 6: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

Finding β0

• Baseline case 1pi =

1+ e− (β )0

Blue(1) Green(0) 1Death 28 22 50 297.0 =

1+ e− (β )Life 45 52 97

Total 73 74 147

β 0 − = 8616.0

0

Page 7: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

Odds ratio

• Odds: p/(1-p) |blue death • Odds-ratio

p 1− p |=OR blue death

p |green death Blue Green

|Death 28 22 50 1− p green death

Life 45 52 97 28 / 45Total 73 74 147 OR = = 47.1

22 / 52

Page 8: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

What do coefficients mean?

eβ color = ORcolor

OR = 28 / 45

= 47 .122 / 52

eβ color = 47 .1Blue Green β color = 385.0

Death 28 22 50 1 Life 45 52 97 pblue =

1+ e − − 8616.0 + 385.0 ) = 383.0 (

Total 73 74 147 1 pgreen = 1+ e 8616.0 = 297.0

Page 9: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

What do coefficients mean?

eβage = ORage

p age death =50|

Age49 Age50 1− p age death =50OR = |

Death 28 22 50 p age death =49| Life 45 52 97

|1− p age death =49Total 73 74 147

Page 10: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

Why not search using OLS?y x3

yi = β0 + β xi1

n x4 x22SSE = ∑ (y − yi ) x1ii=1

x

logit

⎤log

1−

pi

pi ⎦⎥ = β01+ β1xi⎢

x

Page 11: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

P(model | data) ?

If only intercept is allowed, which1 value would it have?pi = 1+ e −( β β 1xi )0 +

yy

x x

x

Page 12: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

P (data | model) ?P(data|model) = [P(model | data) P(data)] / P(model)

When comparing models:

P(model): assume all the same (ie, chances ofbeing a model with high coefficients the same aslow, etc)

P(data): assume it is the same Then,

P(data | model) α P(model | data)

Page 13: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

Maximum Likelihood Estimation

• Maximize P(data | model)• Maximize the probability that we would

observe what we observed (given assumption of a particular model)

• Choose the best parameters from the particular model

logit

x

Page 14: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

Maximum Likelihood Estimation

• Steps:– Define expression for the probability of data

as a function of the parameters – Find the values of the parameters that

maximize this expression

Page 15: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

Likelihood Function

L = Pr(Y )L = Pr( y y 2 ,..., y )1, n

L = Pr( y ) Pr(y Pr( )... y )1 2 n ∏ =

= n

i i y

1

)Pr(

Page 16: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

Likelihood FunctionBinomial

L = Pr(Y )L = Pr( y y 2 ,..., y )1, n

n

L = Pr( y ) Pr( y Pr( )... y ) = ∏ Pr( y )1 2 n i i =1

Pr( y = 1) = pi i

Pr( y = 0) = (1− p )i i

ii y i

y ii ppy −−= 1)1()Pr(

Page 17: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

)1(log)(log

)1(log)1(

loglog

ix

ii

ii

iii

i

ii

exyL

pp

pyL

ββ +−=

−+⎟⎟⎠

⎞⎜⎜⎝

⎛−

=

∑∑

∑∑Since model is the logit

Likelihood Function

n n yi 1− yiL = ∏ Pr( y ) = ∏ p (1− p )i i i

i= 1 i= 1

n ⎛ i

L = ∏⎜⎜ pi ⎞

⎟⎟ y

(1− p )i= 1 ⎝ (1− p ) ⎠

ii

Page 18: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

Log Likelihood Function

n n yi 1− yiL = ∏ Pr( y ) = ∏ p (1− p )i i i

i= 1 i= 1

in ⎛ p ⎞ y

i

i= 1 ⎝ (1− pi ) ⎟⎟⎠

(1− pi )L = ∏⎜⎜

p ⎞ilog L = ∑ y log⎜⎜⎛

⎟⎟ + ∑ log(1− p )i i ⎝ (1− p ) ⎠ i

ii

Page 19: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

Log Likelihood Function

p ⎞ilog L = ∑ y log⎜⎜⎛

⎟⎟ + ∑ log(1− p )i i ⎝ (1− p ) ⎠ i

ii

βxi )log L = ∑ y (βxi ) − ∑ log(1+ e i i

i

Since model is the logit

Page 20: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

ixi

iiii

ii

ey

xyxyL

β

β

−+=

=−=∂

∂ ∑∑

11ˆ

0ˆlog

Not easy to solve because y-hat is non-linear, need to use iterative methods: most popular is Newton-Raphson

Maximize

βxi )log L = ∑ y (βxi ) − ∑ log(1+ e i i

i

Page 21: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

Maximize

βxi )log L = ∑ y (βxi ) − ∑ log(1+ e i i

i

∂ log L = ∑ y xi − ∑ y∂β i

i i

ˆ x = 0i i

1 Not easy to solve becauseyi = y-hat is non-linear, need to1+ e−βxi use iterative methods: most popular is Newton-Raphson

Page 22: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

Newton-Raphson

• Start with random or zero βs

• “walk” in the “direction” that maximizes MLE – how big a step (Gradient or Score)– direction

Page 23: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

Maximizing the LogLikelihood

Log Likelihood

βi+1 βi First iteration LL Initial LL

Page 24: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

Maximizing the LogLikelihood

Log Likelihood

βi+1 βi Second iteration LL New Initial LL

Page 25: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

Similar iterative method to Minimizing the Error in Gradient Descent (neural nets)

winitial wtrained

initial error

final error

negative

local minimum

Error surface

derivative

positive change

Page 26: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

Newton-Raphson Algorithmβ xi )log L = ∑ y (β xi ) − ∑ log(1+ e

i ii

U (β ) =∂ log L

= ∑ x y − ∑ i i∂β i i i x y Gradient

i

' I (β ) =∂ 2 log L

= −∑ y x x i i ˆ (1− y ˆ ) Hessiani iβ β ' i∂ ∂

− 1β j + 1 = β j − I (β j )U (β j ) a step

Page 27: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

Convergence

Criterion

β j+1 − β j < 0001.β j

Convergence problems: complete and quasi-complete separation

Page 28: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

Complete separation

MLE does not exist (ie, it is infinite)

βi βi+1

yy

x x

x

Page 29: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

Quasi-complete separation

Same values for predictors, different outcomes

y

x

Page 30: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

No (quasi)complete separationis fine to find MLE

y

x

Page 31: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

How good is the model?

• Is it better than predicting the same prior probability for everyone? (ie, model with just β0)

• How well do the training data fit?

• How well does is generalize?

Page 32: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

Generalized likelihood-ratio test

• Are β 1, β 2, …, β different from 0?nn n

yi 1− yiL = ∏ Pr( y ) = ∏ p (1− p )i i ii= 1 i= 1

log L = ∑ [y log p + (1− y ) 1log( − p )]i i i ii

G − = log 2 L + log 2 Lo 1

G has χ 2 distribution

cross − entropy _ error − = ∑ [y log p + (1− y ) 1log( − p )]i i i ii

Page 33: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

AIC, SC, BIC

• To compare models

• Akaike’s Information Criterion, k parameters

AIC=− log 2 k L+2

• Schwartz Criterion, Bayesian Information Criterion, n cases

BIC=− log 2 k L logn+

Page 34: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor

Summary

• Maximum Likelihood Estimation is used in finding parameters for models

• MLE maximizes the probability that the data obtained would have been generatedby the model

• Coming up: goodness-of-fit (how good are the predictions?) – How well do the training data fit?– How well does is generalize?