uva cs 6316/4501 – fall 2016 machine learning lecture 14: … › yanjun › teach › 2016f ›...
TRANSCRIPT
UVACS6316/4501–Fall2016
MachineLearning
Lecture14:LogisAcRegression/GeneraAvevs.DiscriminaAve
Dr.YanjunQi
UniversityofVirginia
DepartmentofComputerScience
11/2/16
Dr.YanjunQi/UVACS6316-450104/f16
1
Wherearewe?èFivemajorsecJonsofthiscourse
q Regression(supervised)q ClassificaJon(supervised)q Unsupervisedmodelsq Learningtheoryq Graphicalmodels
11/2/16 2
Dr.YanjunQi/UVACS6316-450104/f16
Wherearewe?èThreemajorsecJonsforclassificaJon
• We can divide the large variety of classification approaches into roughly three major types
1. Discriminative - directly estimate a decision rule/boundary - e.g., logistic regression, support vector machine, decisionTree 2. Generative: - build a generative statistical model - e.g., naïve bayes classifier, Bayesian networks 3. Instance based classifiers - Use observation directly (no models) - e.g. K nearest neighbors
11/2/16 3
Dr.YanjunQi/UVACS6316-450104/f16
ADatasetforclassificaJon
• Data/points/instances/examples/samples/records:[rows]• Features/a0ributes/dimensions/independentvariables/covariates/predictors/regressors:[columns,exceptthelast]• Target/outcome/response/label/dependentvariable:specialcolumntobepredicted[lastcolumn]
11/2/16 4
Output as Discrete Class Label
C1, C2, …, CL
C
C
GeneraJve
Dr.YanjunQi/UVACS6316-450104/f16
argmaxC
P(C | X) = argmaxC
P(X,C) = argmaxC
P(X |C)P(C)
argmax
CP(C |X)C = c1 ,⋅⋅⋅,cLDiscriminaJve
Establishing a probabilistic model for classification (cont.)
– (1) Generative model
Generative Probabilistic Model
for Class 1
)|( 1cP x
1x 2x xp•••
Generative Probabilistic Model
for Class 2
)|( 2cP x
1x 2x xp•••
Generative Probabilistic Model
for Class L
)|( LcP x
1x 2x xp•••
•••
x = (x1, x2, ⋅ ⋅ ⋅, xp )11/2/16 AdaptfromProf.KeChenNBslides
argmaxC
P(C | X) = argmaxC
P(X,C)
= argmaxC
P(X |C)P(C)
5
Dr.YanjunQi/UVACS6316-450104/f16
Establishing a probabilistic model for classification
– (2) Discriminative model
P(C |X) C = c1,⋅ ⋅ ⋅,cL, X = (X1, ⋅ ⋅ ⋅,Xn )
x = (x1, x2, ⋅ ⋅ ⋅, xn )
Discriminative
Probabilistic Classifier
1x 2x nx
)|( 1 xcP )|( 2 xcP )|( xLcP
•••
•••
11/2/16 AdaptfromProf.KeChenNBslides6
Dr.YanjunQi/UVACS6316-450104/f16
Today:GeneraAvevs.DiscriminaAve
ü WhyBayesClassificaJon–MAPRule?
§ EmpiricalPredicJonError§ 0-1LossfuncJonforBayesClassifier
ü LogisJcregressionü GeneraJvevs.DiscriminaJve
11/2/16 7
Dr.YanjunQi/UVACS6316-450104/f16
Bayes Classifiers – MAP Rule
Task: Classify a new instance X based on a tuple of attribute values into one of the classes X = X1,X2,…,Xp
cMAP = argmaxcj∈C
P(cj | x1, x2,…, xp )
8
MAP=MaximumAposterioriProbability
WHY?
11/2/16
Dr.YanjunQi/UVACS6316-450104/f16
AdaptFromCarols’probtutorial
9
0-1 LOSS for Classification
• Procedure for categorical output variable C • Frequently, 0-1 loss function used: L(k, ℓ)
• L(k, ℓ) is the price paid for misclassifying an element from class Ck as belonging to class Cℓ è L*L matrix
11/2/16
Dr.YanjunQi/UVACS6316-450104/f16
10
Expected prediction error (EPE)
• Expected prediction error (EPE), with expectation taken w.r.t. the joint distribution Pr(C,X) – Pr(C,X )=Pr(C |X )Pr(X )
EPE( f )= EX ,C(L(C , f (X ))
=EX L[C kk=1
L
∑ , f (X )]Pr(C k |X ) Consider sample
population distribution
11/2/16
Dr.YanjunQi/UVACS6316-450104/f16
11/2/16
Dr.YanjunQi/UVACS6316-450104/f16
11
12
Expected prediction error (EPE)
• Pointwise minimization suffices
• è simply
!! EPE( f )= EX ,C(L(C , f (X ))=EX L[C k
k=1
K
∑ , f (X )]Pr(C k |X )
!! f̂ (X )= argmin g∈C
L(C kk=1
K
∑ ,g)Pr(C k |X = x)
!!
f̂ (X )=C k !if!Pr(C k |X = x)=max
g∈CPr(g|X = x)
Bayes Classifier
Consider sample
population distribution
11/2/16
Dr.YanjunQi/UVACS6316-450104/f16
SUMMARY: WHEN EPE USES DIFFERENT LOSS
Loss Function Estimator
L2
L1
0-1
(Bayes classifier / MAP)
13 11/2/16
Dr.YanjunQi/UVACS6316-450104/f16
Today:GeneraAvevs.DiscriminaAve
ü WhyBayesClassificaJon–MAPRule?
§ EmpiricalPredicJonError§ 0-1LossfuncJonforBayesClassifier
ü LogisJcregressionü GeneraJvevs.DiscriminaJve
11/2/16 14
Dr.YanjunQi/UVACS6316-450104/f16
MulJvariatelinearregressiontoLogisJcRegression
Dependent IndependentvariablesPredicted Predictorvariables Responsevariable ExplanatoryvariablesOutcomevariable Covariables
xβ ... xβ xβα y ii2211 ++++=
LogisJcregressionforbinaryclassificaJon
!!ln P( y |x)
1−P( y x)⎡
⎣⎢⎢
⎤
⎦⎥⎥=α +β1x1 +β2x2 + ...+βpxp
11/2/16 15
Dr.YanjunQi/UVACS6316-450104/f16
11/2/16
Dr.YanjunQi/UVACS6316-450104/f16
16
!!y∈{0,1}
(1)Lineardecisionboundary
(2)p(y|x)
!!ln P( y |x)
1−P( y x)⎡
⎣⎢⎢
⎤
⎦⎥⎥=α +β1x1 +β2x2 + ...+βpxp
11/2/16
Dr.YanjunQi/UVACS6316-450104/f16
17
!!y∈{0,1}
(1)Lineardecisionboundary
(2)p(y|x)
!!ln P( y |x)
1−P( y x)⎡
⎣⎢⎢
⎤
⎦⎥⎥=α +β1x1 +β2x2 + ...+βpxp
ThelogisJcfuncJon(1)--isacommon"S"shapefunc
0.0
0.2
0.4
0.6
0.8
1.0e.g.Probabilityofdisease
x
βxα
βxα
e1e)xP(y +
+
+=
P (Y=1|X)
11/2/16 18
Dr.YanjunQi/UVACS6316-450104/f16
LogisJcRegression—when?
LogisJcregressionmodelsareappropriatefortargetvariablecodedas0/1.
Weonlyobserve“0”and“1”forthetargetvariable—butwethinkofthetargetvariableconceptuallyasaprobabilitythat“1”willoccur.
11/2/16 19
Dr.YanjunQi/UVACS6316-450104/f16
LogisJcRegression—when?
LogisJcregressionmodelsareappropriatefortargetvariablecodedas0/1.
Weonlyobserve“0”and“1”forthetargetvariable—butwethinkofthetargetvariableconceptuallyasaprobabilitythat“1”willoccur.
This means we use Bernoulli distribution to model the target variable with its Bernoulli parameter p=p(y=1|x) predefined. The main interest è predicting the probability that an event occurs (i.e., the probability that p(y=1|x) ).
11/2/16 20
Dr.YanjunQi/UVACS6316-450104/f16
0.0
0.2
0.4
0.6
0.8
1.0e.g.Probabilityofdisease
x
P (y=1|X)
11/2/16 21
DiscriminaJve LogisJcregressionmodelsforbinarytargetvariablecoded0/1.
P( y = 1 x) = eα+βx
1+ eα+βx
logisJcfuncJon
LogitfuncJon
Dr.YanjunQi/UVACS6316-450104/f16
0.0
0.2
0.4
0.6
0.8
1.0e.g.Probabilityofdisease
x
P (y=1|X)
11/2/16 22
DiscriminaJve LogisJcregressionmodelsforbinarytargetvariablecoded0/1.
P( y = 1 x) = eα+βx
1+ eα+βx
ln P( y =1|x)
P( y =0|x)⎡
⎣⎢
⎤
⎦⎥ = ln
P( y =1|x)1−P( y =1|x)⎡
⎣⎢
⎤
⎦⎥ =α +β1x1 +β2x2 + ...+βpxp
logisJcfuncJon
LogitfuncJon
DecisionBoundaryèequalstozero
Dr.YanjunQi/UVACS6316-450104/f16
ln( )( )
P y xP y x
x1−⎡
⎣⎢
⎤
⎦⎥ = +α β
ThelogisJcfuncJon(2)
LogitofP(y|x)
{P y x e
e
x
x( ) =+
+
+
α β
α β1
11/2/16 23
Dr.YanjunQi/UVACS6316-450104/f16
ThelogisJcfuncJon(3)
• Advantagesofthelogit– SimpletransformaJonofP(y|x)– LinearrelaJonshipwithx– CanbeconJnuous(Logitbetween-infto+infinity)– DirectlyrelatedtothenoJonoflogoddsoftargetevent
βxαP-1
P ln +=⎟⎠⎞⎜
⎝⎛ e
P-1P βxα+=
11/2/16 24
Dr.YanjunQi/UVACS6316-450104/f16
z = log p1− p
⎛⎝⎜
⎞⎠⎟
LogisJcRegressionAssumpJons
• Linearityinthelogit–theregressionequaJonshouldhavealinearrelaJonshipwiththelogitformofthetargetvariable
• ThereisnoassumpJonaboutthefeaturevariables/targetpredictorsbeinglinearlyrelatedtoeachother.
11/2/16 25
Dr.YanjunQi/UVACS6316-450104/f16
Binary Logistic Regression In summary that the logistic regression tells us two things at once. • Transformed, the “log odds” (logit) are linear.
• Logistic Distribution
ln[p/(1-p)]
P (Y=1|x)
x
x 11/2/16 26
Dr.YanjunQi/UVACS6316-450104/f16
This means we use Bernoulli distribution to model the target variable with its Bernoulli parameter p=p(y=1|x) predefined.
p 1-p
Odds=p/(1-p)
27
BinaryèMulJnomialLogisJcRegressionModel
∑
∑
−
=
−
=
++===
−=++
+===
1
10
1
10
0
)exp(1
1)|Pr(
1,,1,)exp(1
)exp()|Pr(
K
l
Tll
K
l
Tll
Tkk
xxXKG
Kkx
xxXkG
ββ
ββ
ββ…
DirectlymodelstheposteriorprobabiliJesastheoutputofregression
Notethattheclassboundariesarelinear
xisp-dimensionalinputvector\betakisap-dimensionalvectorforeachkTotalnumberofparametersis(K-1)(p+1)
11/2/16
Dr.YanjunQi/UVACS6316-450104/f16
28
BinaryèMulJnomialLogisJcRegressionModel
∑
∑
−
=
−
=
++===
−=++
+===
1
10
1
10
0
)exp(1
1)|Pr(
1,,1,)exp(1
)exp()|Pr(
K
l
Tll
K
l
Tll
Tkk
xxXKG
Kkx
xxXkG
ββ
ββ
ββ…
DirectlymodelstheposteriorprobabiliJesastheoutputofregression
Notethattheclassboundariesarelinear
xisp-dimensionalinputvector\betakisap-dimensionalvectorforeachkTotalnumberofparametersis(K-1)(p+1)
11/2/16
Dr.YanjunQi/UVACS6316-450104/f16
Today:GeneraAvevs.DiscriminaAve
ü WhyBayesClassificaJon–MAPRule?
§ EmpiricalPredicJonError§ 0-1LossfuncJonforBayesClassifier
ü LogisJcregression§ ParameterEsJmaJonforLR
ü GeneraJvevs.DiscriminaJve
11/2/16 29
Dr.YanjunQi/UVACS6316-450104/f16
ParameterEsJmaJonforLRèMLEfromthedata
• RECAP:LinearregressionèLeastsquares
• LogisJcregression:èMaximumlikelihoodesJmaJon
11/2/16 30
Dr.YanjunQi/UVACS6316-450104/f16
31
MLEforLogisJcRegressionTraining
Let’sfitthelogisJcregressionmodelforK=2,i.e.,numberofclassesis2
!!
l(β)= {logPr(Y = yi |X = xi )}i=1
N
∑
= yi log(Pr(Y =1|X = xi ))+(1− yi )log(Pr(Y =0|X = xi ))i=1
N
∑
= ( yi logexp(βT xi )
1+exp(βT xi ))+(1− yi )log
11+exp(βT xi )
)i=1
N
∑
= ( yiβT xi − log(1+exp(βT xi )))i=1
N
∑
Trainingset:(xi,yi),i=1,…,N
(condiJonal)Log-likelihood:
Wewanttomaximizethelog-likelihoodinordertoesJmate\beta
xiare(p+1)-dimensionalinputvectorwithleadingentry1\betaisa(p+1)-dimensionalvector
p(y | x)y (1− p)1−yForBernoullidistribuJon
11/2/16
Dr.YanjunQi/UVACS6316-450104/f16
How?
11/2/16
Dr.YanjunQi/UVACS6316-450104/f16
32
!!l(β)= {logPr(Y = yi |X = xi )}
i=1
N
∑
Newton-RaphsonforLR(opJonal)
0))exp(1)exp(()(
1=
+−=
∂∂ ∑
=
N
iiT
T
i xxxyl
ββ
ββ
(p+1)Non-linearequaJonstosolvefor(p+1)unknowns
SolvebyNewton-Raphsonmethod:
where, ( ∂2l(β)
∂β∂βT ) = - xixiT ( exp(β
T xi )1+ exp(βT xi )
)( 11+ exp(βT xi )
)i=1
N
∑
βnew ← β old −[( ∂2l(β)
∂β∂βT )]-1 ∂l(β)∂β
,
p(xi;β) 1-p(xi;β)11/2/16
Dr.YanjunQi/UVACS6316-450104/f16
33
minimizesaquadraJcapproximaJontothefuncJonwearereallyinterestedin.
Newton-RaphsonforLR…
),()( 1 pyXWXX TToldnew −+← −ββ
∂l(β)∂β
= (yi −exp(βT x)1+ exp(βT x)
)xii=1
N
∑ = XT (y− p)
( ∂2l(β)
∂β∂βT ) = −XTWX
So,NRrulebecomes:
,,
1
2
1
)1(
2
1
−−+−−
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
=
byNNpbyNTN
T
T
y
yy
y
x
xx
X!!
,
))exp(1/()exp(
))exp(1/()exp())exp(1/()exp(
1
22
11
−−⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
+
++
=
byNNT
NT
TT
TT
xx
xxxx
p
ββ
ββββ
!
)))exp((1
11)())exp((1
)exp((i
Ti
Ti
T
xxx
βββ
+−
+
));(1();( ofmatrix diagonal :
);( ofmatrix 1 :
ofmatrix 1 : ofmatrix 1 :
oldi
oldi
oldi
i
i
xpxpN NWxp Np
y Nyx) (pNX
βββ
−×
×
×+×
11/2/16
Dr.YanjunQi/UVACS6316-450104/f16
34
Newton-RaphsonforLR…
35
• Newton-Raphson–
– Adjustedresponse
– IteraJvelyreweightedleastsquares(IRLS)
WzXWXXpyWXWXWXX
pyXWXX
TT
oldTT
TToldnew
1
11
1
)())(()(
)()(
−
−−
−
=−+=
−+=β
ββ
)(1 pyWXz old −+= −β
)()(minarg
)()(minarg
1 pyWpy
XzWXzT
TTTnew
−−←
−−←
−
β
ββββ
ReexpressingNewtonstepasweightedleastsquarestep
11/2/16
Dr.YanjunQi/UVACS6316-450104/f16
Logistic Regression
classification
Log-odds = linear function of X’s
EPE, with conditional Log-likelihood
Iterative (Newton) method
Logistic weights
Task
Representation
Score Function
Search/Optimization
Models, Parameters
P(c =1 x) = eα+βx
1+ eα+βx11/2/16 36
Dr.YanjunQi/UVACS6316-450104/f16
Today:GeneraAvevs.DiscriminaAve
ü WhyBayesClassificaJon–MAPRule?
§ EmpiricalPredicJonError§ 0-1LossfuncJonforBayesClassifier
ü LogisJcregressionü GeneraJvevs.DiscriminaJve
11/2/16 37
Dr.YanjunQi/UVACS6316-450104/f16
Discriminative vs. Generative GeneraJveapproach-ModelthejointdistribuJonp(X,C)using p(X|C=ck)andp(C=ck)DiscriminaJveapproach-ModelthecondiJonaldistribuJonp(c|X)directly
Classprior
e.g.,
Discriminative vs. Generative
Height
Gaussian
Pr
LogisJcRegression
LDAvs.LogisJcRegression
11/2/16
Dr.YanjunQi/UVACS6316-450104/f16
40
Kp+
Discriminative vs. Generative
● DefiniJons○ hgenandhdis:generaJveanddiscriminaJve
classifiers
○ hgen,infandhdis,inf:sameclassifiersbuttrainedontheenJrepopulaJon(asymptoJcclassifiers)
○ n→infinity,hgen→hgen,infandhdis→hdis,inf
Ng,Jordan,."OndiscriminaJvevs.generaJveclassifiers:AcomparisonoflogisJcregressionandnaivebayes."AdvancesinneuralinformaEonprocessingsystems14(2002):841.
Discriminative vs. Generative
ProposiJon1:
ProposiJon2:-p:numberofdimensions-n:numberofobservaJons-ϵ:generalizaJonerror
Logistic Regression vs. NBC
DiscriminaJveclassifier(LogisJcRegression)-SmallerasymptoJcerror- Slowconvergence~O(p)
GeneraJveclassifier(NaiveBayes)-LargerasymptoJcerror-Canhandlemissingdata(EM)-Fastconvergence~O(lg(p))
Innumericalanalysis,thespeedatwhichaconvergentsequenceapproachesitslimitiscalledtherateofconvergence.
generalizaJonerror
LogisJcRegression
NaiveBayes
Sizeoftrainingset
Ng,Jordan,."OndiscriminaJvevs.generaJveclassifiers:AcomparisonoflogisJcregressionandnaivebayes."AdvancesinneuralinformaEonprocessingsystems14(2002):841.
11/2/16
Dr.YanjunQi/UVACS6316-450104/f16
44
Xue,Jing-Hao,andD.MichaelTi{erington."Commenton“OndiscriminaJvevs.generaJveclassifiers:AcomparisonoflogisJcregressionandnaiveBayes”."Neuralprocessingle0ers28.3(2008):169-187.
generalizaJonerror
Sizeoftrainingset
Discriminative vs. Generative
● Empirically,generaJveclassifiersapproachtheirasymptoJcerrorfasterthandiscriminaJveones○ Goodforsmalltrainingset○ Handlemissingdatawell(EM)
● Empirically,discriminaJveclassifiershavelowerasymptoJcerrorthangeneraJveones○ Goodforlargertrainingset
References
q Prof.Tan,Steinbach,Kumar’s“IntroducJontoDataMining”slide
q Prof.AndrewMoore’sslidesq Prof.EricXing’sslidesq Prof.KeChenNBslidesq HasJe,Trevor,etal.TheelementsofstaEsEcallearning.Vol.2.No.1.NewYork:Springer,2009.
11/2/16 47
Dr.YanjunQi/UVACS6316-450104/f16