uva cs 4501: machine learning lecture 13: logistic …lecture 13: logistic regression dr. yanjun qi...
Post on 13-Aug-2021
3 Views
Preview:
TRANSCRIPT
UVACS4501:MachineLearning
Lecture13:LogisticRegression
Dr.Yanjun Qi
UniversityofVirginiaDepartmentofComputerScience
3/23/18
Dr.YanjunQi/UVACS/s18
1
Wherearewe?èFivemajorsectionsofthiscourse
q Regression(supervised)q Classification(supervised)q Unsupervisedmodelsq Learningtheoryq Graphicalmodels
3/23/18 2
Dr.YanjunQi/UVACS/s18
Wherearewe?èThreemajorsectionsforclassification
• We can divide the large variety of classification approaches into roughly three major types
1. Discriminative- directly estimate a decision rule/boundary- e.g., logistic regression, support vector machine, decisionTree
2. Generative:- build a generative statistical model- e.g., naïve bayes classifier, Bayesian networks
3. Instance based classifiers- Use observation directly (no models)- e.g. K nearest neighbors
3/23/18 3
Dr.YanjunQi/UVACS/s18
Today
q BayesClassifierq LogisticRegressionq Binarytomulti-classq TrainingLGbyMLE
Dr.YanjunQi/UVACS/s18
3/23/18 4
• Treat each feature attribute and the class label as random variables.
• Given a sample x with attributes ( x1, x2, … , xp ):– Goal is to predict its class C.– Specifically, we want to find the value of Ci that maximizes
p( Ci | x1, x2, … , xp ).
• Can we estimate p(Ci |x) = p( Ci | x1, x2, … , xp ) directly from data?
Bayes classifiers
3/23/18
Dr.YanjunQi/UVACS/s18
5
• Treat each feature attribute and the class label as random variables.
• Given a sample x with attributes ( x1, x2, … , xp ):– Goal is to predict its class c.– Specifically, we want to find the class that maximizes
p( c | x1, x2, … , xp ).
• Can we estimate p(Ci |x) = p( Ci | x1, x2, … , xp ) directly from data?
Bayes classifiers
3/23/18
Dr.YanjunQi/UVACS/s18
6
MAPRule
Bayes Classifiers – MAP Rule
Task: Classify a new instance X based on a tuple of attribute values into one of the classesX = X1,X2,…,Xp
cMAP = argmaxcj∈C
P(cj | x1, x2,…, xp )
7
MAP=MaximumAposteriori Probability
MAPRule
3/23/18
Dr.YanjunQi/UVACS/s18
AdaptFromCarols’prob tutorial
PleasereadtheL13ExtraslidesforWHY
ADatasetforclassification
• Data/points/instances/examples/samples/records: [rows]• Features/attributes/dimensions/independent variables/covariates/predictors/regressors: [columns,exceptthelast]• Target/outcome/response/label/dependentvariable:specialcolumntobepredicted [lastcolumn ]
3/23/18 8
Output as Discrete Class Label
C1, C2, …, CL
C
CDr.YanjunQi/UVACS/s18
Discriminativeargmax
c∈CP(c |X)C = {c1 ,⋅⋅⋅,cL}
Establishing a probabilistic model for classification
– Discriminative model
x = (x1 ,x2 ,⋅⋅⋅,xp)
Discriminative Probabilistic Classifier
1x 2x xp
)|( 1 xcP )|( 2 xcP )|( xLcP
•••
•••
3/23/18AdaptfromProf.KeChenNBslides
9
Dr.YanjunQi/UVACS/s18
argmax
c∈CP(c |X),C = {c1 ,⋅⋅⋅,cL}
ADatasetforclassification
• Data/points/instances/examples/samples/records: [rows]• Features/attributes/dimensions/independent variables/covariates/predictors/regressors: [columns,exceptthelast]• Target/outcome/response/label/dependentvariable:specialcolumntobepredicted [lastcolumn ]
3/23/18 10
Output as Discrete Class Label
C1, C2, …, CL
C
C
Generative
Dr.YanjunQi/UVACS/s18
argmax
c∈CP(c |X )= argmax
c∈CP(X ,c)= argmax
c∈CP(X |c)P(c)
argmax
c∈CP(c |X)C = {c1 ,⋅⋅⋅,cL}Discriminative
Later!
Today
q BayesClassifierq LogisticRegressionq Binarytomulti-classq TrainingLGbyMLE
Dr.YanjunQi/UVACS/s18
3/23/18 11
MultivariatelinearregressiontoLogisticRegression
y = β0 +β1x1 +β2x2 + ...+βpxp
Logisticregression forbinaryclassification
ln P( y |x)
1−P( y x)⎡
⎣⎢⎢
⎤
⎦⎥⎥= β0 +β1x1 +β2x2 + ...+βpxp
3/23/18 12
Dr.YanjunQi/UVACS/s18
Dr.YanjunQi/UVACS/s18
Review: ProbabilisticInterpretationofLinearRegression
• Letusassumethatthetargetvariableandtheinputsarerelatedbytheequation:
whereε isanerrortermofunmodeled effectsorrandomnoise
• Nowassumethatε followsaGaussianN(0,σ),thenwehave:
• ByIIDassumption:
iiT
iy εθ += x
⎟⎟⎠
⎞⎜⎜⎝
⎛ −−= 2
2
221
σθ
σπθ )(exp);|( i
Ti
iiyxyp x
⎟⎟
⎠
⎞⎜⎜
⎝
⎛ −−⎟
⎠⎞⎜
⎝⎛== ∑∏ =
=2
12
1 221
σθ
σπθθ
n
i iT
inn
iii
yxypL
)(exp);|()(
x
3/23/18 13
LogisticRegressionp(y|x)
3/23/18
Dr.YanjunQi/UVACS/s18
14
P(y x)= eβ0+β1x1+β2x2+...+βpxp
1+eβ0+β1x1+β2x2+...+βpxp= 11+e−βT X
ln P( y |x)
1−P( y x)⎡
⎣⎢⎢
⎤
⎦⎥⎥= β0 +β1x1 +β2x2 + ...+βpxp
LogisticRegressionmodelsalinearclassificationboundary!
3/23/18
Dr.YanjunQi/UVACS/s18
15
!!y∈{0,1}
ln P( y |x)
1−P( y x)⎡
⎣⎢⎢
⎤
⎦⎥⎥= β0 +β1x1 +β2x2 + ...+βpxp
3/23/18
Dr.YanjunQi/UVACS/s18
16
!!y∈{0,1}
(2)p(y|x)
!!ln P( y |x)
1−P( y x)⎡
⎣⎢⎢
⎤
⎦⎥⎥=α +β1x1 +β2x2 + ...+βpxp
0.5
0.5
LogisticRegressionmodelsalinearclassificationboundary!
Thelogisticfunction(1)-- isacommon"S"shapefunction
0.0
0.2
0.4
0.6
0.8
1.0e.g.Probability ofdisease
x
βxα
βxα
e1e)xP(y +
+
+=
P (Y=1|X)
3/23/18 17
Dr.YanjunQi/UVACS/s18
Thelogisticfunction(1)-- isacommon"S"shapefunction
0.0
0.2
0.4
0.6
0.8
1.0e.g.Probability ofdisease
x
βxα
βxα
e1e)xP(y +
+
+=
P (Y=1|X)
3/23/18 18
Dr.YanjunQi/UVACS/s18
LogisticRegression—when?
Logisticregressionmodelsareappropriatefortargetvariablecodedas0/1.
Weonlyobserve“0” and“1” forthetargetvariable—butwethinkofthetargetvariableconceptuallyasaprobabilitythat“1”willoccur.
3/23/18 19
Dr.YanjunQi/UVACS/s18
19P(x) 1-p(x)
LogisticRegression—when?
Logisticregressionmodelsareappropriatefortargetvariablecodedas0/1.
Weonlyobserve“0” and“1” forthetargetvariable—butwethinkofthetargetvariableconceptuallyasaprobabilitythat“1”willoccur.
This means we use Bernoulli distribution to model the target variable with its Bernoulli parameter p=p(y=1|x) predefined.
The main interest è predicting the probability that an event occurs (i.e., the probability that p(y=1|x) ).
3/23/18 20
Dr.YanjunQi/UVACS/s18
ln( )( )
P y xP y x
x1−⎡
⎣⎢
⎤
⎦⎥ = +α β
Thelogit functionView
Logit ofP(y|x)
{P y x e
e
x
x( ) =+
+
+
α β
α β1
3/23/18 21
Dr.YanjunQi/UVACS/s18
ln P( y =1|x)
P( y =0|x)⎡
⎣⎢
⎤
⎦⎥ = ln
P( y =1|x)1−P( y =1|x)⎡
⎣⎢
⎤
⎦⎥ =α +β1x1 +β2x2 + ...+βpxp
Logit function
DecisionBoundaryè equalstozero
LogisticRegressionAssumptions
• Linearityinthelogit – theregressionequationshouldhavealinearrelationshipwiththelogit formofthetargetvariable
• Thereisnoassumptionaboutthefeaturevariables/targetpredictorsbeinglinearlyrelatedtoeachother.
3/23/18 22
Dr.YanjunQi/UVACS/s18
Today
q BayesClassifierq LogisticRegressionq Binarytomulti-classq TrainingLGbyMLE
Dr.YanjunQi/UVACS/s18
3/23/18 23
Binary Logistic RegressionIn summary that the logistic regression tells us two things at once.• Transformed, the “log odds” (logit) are linear.
• Logistic Distribution
ln[p/(1-p)]
P (Y=1|x)
x
x3/23/18 24
Dr.YanjunQi/UVACS/s18
This means we use Bernoulli distribution to model the target variable with its Bernoulli parameter p=p(y=1|x)predefined.
P(y=1|x) 1-p(x)
Odds=p/(1-p)
25
BinaryèMultinoulliLogisticRegressionModel
Pr(G = k |X = x)= exp(βk0 +βkT x)
1+ exp(βl0 +βlT x)l=1
K−1
∑, k =1,…,K −1
Pr(G = K |X = x)= 11+ exp(βl0 +βlT x)
l=1
K−1
∑
Directly modelstheposteriorprobabilitiesastheoutputofregression
Notethattheclassboundariesarelinear
x isp-dimensional inputvector
isap-dimensionalvectorforeachclassk
Totalnumber ofparametersis(K-1)(p+1)
3/23/18
Dr.YanjunQi/UVACS/s18
βkT
26
BinaryèMultinoulliLogisticRegressionModel
Notethattheclassboundariesarelinear3/23/18
Dr.YanjunQi/UVACS/s18
Today
q BayesClassifierq LogisticRegressionq Binarytomulti-classq TrainingLGbyMLE
Dr.YanjunQi/UVACS/s18
3/23/18 27
Parameter EstimationforLGèMLEfrom thedata
• Review:– Linear regression trainingè Leastsquares– Gaussian explaination ofLinear Regression trainingèMLE
• Logistic regression Training:– MaximumLikelihood Estimationbased– Classic algorithms using iterative Newton
3/23/18 28
Dr.YanjunQi/UVACS/s18
PleasereadtheL13ExtraslidesforHOW
Review:DefiningLikelihoodforbasicBernoulli
• Likelihood=p(data|parameter)
!!f (xi |p)= pxi (1− p)1−xi
function of x_i
PMF:
!!L(p)=∏i=1n pxi (1− p)1−xi = px(1− p)n−x
function of p
LIKELIHOOD:
è e.g.,fornindependenttossesofcoins,withunknownp
Dr.YanjunQi/UVACS
Observeddataèxheads-upfromntrials
x = xi
i=1
n
∑
30
MLEforLogisticRegressionTraining(extra)
Let’sfitthelogisticregressionmodelforK=2, i.e.,number ofclassesis2
!!
l(β)= {logPr(Y = yi |X = xi )}i=1
N
∑
= yi log(Pr(Y =1|X = xi ))+(1− yi )log(Pr(Y =0|X = xi ))i=1
N
∑
= ( yi logexp(βT xi )
1+exp(βT xi ))+(1− yi )log
11+exp(βT xi )
)i=1
N
∑
= ( yiβT xi − log(1+exp(βT xi )))i=1
N
∑
Trainingset:(xi,yi),i=1,…,N
Wewanttomaximize thelog-likelihood inordertoestimate\beta
xi are(p+1)-dimensional inputvectorwithleadingentry1\betaisa(p+1)-dimensional vector
pβ( y |x)y(1− pβ( y |x))1− y
ForBernoullidistribution
3/23/18
Dr.YanjunQi/UVACS/s18
How?
3/23/18
Dr.YanjunQi/UVACS/s18
31
!!l(β)= {logPr(Y = yi |X = xi )}
i=1
N
∑
Logistic Regression
classification
Log-odds = linear function of X’s
EPE, MLE
Iterative (Newton) method
Logistic weights
Task
Representation
Score Function
Search/Optimization
Models, Parameters
P(c =1 x) = eα+βx
1+ eα+βx3/23/18 32
Dr.YanjunQi/UVACS/s18
References
q Prof.Tan,Steinbach,Kumar’s“IntroductiontoDataMining”slide
q Prof.AndrewMoore’sslidesq Prof.EricXing’sslidesq Prof.Ke ChenNBslidesq Hastie,Trevor,etal.Theelementsofstatisticallearning.Vol.2.No.1.NewYork:Springer,2009.
3/23/18 33
Dr.YanjunQi/UVACS/s18
top related