uva cs 6316/4501 – fall 2016 machine learning lecture 14: … › yanjun › teach › 2016f ›...

UVACS6316/4501–Fall2016

MachineLearning

Lecture14:LogisAcRegression/GeneraAvevs.DiscriminaAve

Dr.YanjunQi

UniversityofVirginia

DepartmentofComputerScience

11/2/16

Dr.YanjunQi/UVACS6316-450104/f16

1

Wherearewe?èFivemajorsecJonsofthiscourse

q Regression(supervised)q ClassificaJon(supervised)q Unsupervisedmodelsq Learningtheoryq Graphicalmodels

11/2/16 2


Wherearewe?èThreemajorsecJonsforclassificaJon

•  We can divide the large variety of classification approaches into roughly three major types

1. Discriminative - directly estimate a decision rule/boundary - e.g., logistic regression, support vector machine, decisionTree 2. Generative: - build a generative statistical model - e.g., naïve bayes classifier, Bayesian networks 3. Instance based classifiers - Use observation directly (no models) - e.g. K nearest neighbors

11/2/16 3


ADatasetforclassificaJon

•  Data/points/instances/examples/samples/records:[rows]•  Features/a0ributes/dimensions/independentvariables/covariates/predictors/regressors:[columns,exceptthelast]•  Target/outcome/response/label/dependentvariable:specialcolumntobepredicted[lastcolumn]

11/2/16 4

Output as Discrete Class Label

C1, C2, …, CL

C

C

GeneraJve


argmaxC

P(C | X) = argmaxC

P(X,C) = argmaxC

P(X |C)P(C)

argmax

CP(C |X)C = c1 ,⋅⋅⋅,cLDiscriminaJve

Establishing a probabilistic model for classification (cont.)

–  (1) Generative model

Generative Probabilistic Model

for Class 1

)|( 1cP x

1x 2x xp•••


for Class 2

)|( 2cP x

1x 2x xp•••


for Class L

)|( LcP x

1x 2x xp•••

•••

x = (x1, x2, ⋅ ⋅ ⋅, xp )11/2/16 AdaptfromProf.KeChenNBslides

argmaxC

P(C | X) = argmaxC

P(X,C)

= argmaxC

P(X |C)P(C)

5


Establishing a probabilistic model for classification

–  (2) Discriminative model

P(C |X) C = c1,⋅ ⋅ ⋅,cL, X = (X1, ⋅ ⋅ ⋅,Xn )

x = (x1, x2, ⋅ ⋅ ⋅, xn )

Discriminative

Probabilistic Classifier

1x 2x nx

)|( 1 xcP )|( 2 xcP )|( xLcP

•••

•••

11/2/16 AdaptfromProf.KeChenNBslides6


Today:GeneraAvevs.DiscriminaAve

ü WhyBayesClassificaJon–MAPRule?

§  EmpiricalPredicJonError§  0-1LossfuncJonforBayesClassifier

ü  LogisJcregressionü  GeneraJvevs.DiscriminaJve

11/2/16 7


Bayes Classifiers – MAP Rule

Task: Classify a new instance X based on a tuple of attribute values into one of the classes X = X1,X2,…,Xp

cMAP = argmaxcj∈C

P(cj | x1, x2,…, xp )

8

MAP=MaximumAposterioriProbability

WHY?

11/2/16


AdaptFromCarols’probtutorial

9

0-1 LOSS for Classification

•  Procedure for categorical output variable C •  Frequently, 0-1 loss function used: L(k, ℓ)

•  L(k, ℓ) is the price paid for misclassifying an element from class Ck as belonging to class Cℓ è L*L matrix

11/2/16


10

Expected prediction error (EPE)

•  Expected prediction error (EPE), with expectation taken w.r.t. the joint distribution Pr(C,X) – Pr(C,X )=Pr(C |X )Pr(X )

EPE( f )= EX ,C(L(C , f (X ))

=EX L[C kk=1

L

∑ , f (X )]Pr(C k |X ) Consider sample

population distribution

11/2/16


11/2/16


11

12

Expected prediction error (EPE)

•  Pointwise minimization suffices

•  è simply

!! EPE( f )= EX ,C(L(C , f (X ))=EX L[C k

k=1

K

∑ , f (X )]Pr(C k |X )

!! f̂ (X )= argmin g∈C

L(C kk=1

K

∑ ,g)Pr(C k |X = x)

!!

f̂ (X )=C k !if!Pr(C k |X = x)=max

g∈CPr(g|X = x)

Bayes Classifier

Consider sample

population distribution

11/2/16


SUMMARY: WHEN EPE USES DIFFERENT LOSS

Loss Function Estimator

L2

L1

0-1

(Bayes classifier / MAP)

13 11/2/16






11/2/16 14


MulJvariatelinearregressiontoLogisJcRegression

Dependent IndependentvariablesPredicted Predictorvariables Responsevariable ExplanatoryvariablesOutcomevariable Covariables

xβ ... xβ xβα y ii2211 ++++=

LogisJcregressionforbinaryclassificaJon

!!ln P( y |x)

1−P( y x)⎡

⎣⎢⎢

⎤

⎦⎥⎥=α +β1x1 +β2x2 + ...+βpxp

11/2/16 15


11/2/16


16

!!y∈{0,1}

(1)Lineardecisionboundary

(2)p(y|x)

!!ln P( y |x)

1−P( y x)⎡

⎣⎢⎢

⎤

⎦⎥⎥=α +β1x1 +β2x2 + ...+βpxp

11/2/16


17

!!y∈{0,1}

(1)Lineardecisionboundary

(2)p(y|x)

!!ln P( y |x)

1−P( y x)⎡

⎣⎢⎢

⎤

⎦⎥⎥=α +β1x1 +β2x2 + ...+βpxp

ThelogisJcfuncJon(1)--isacommon"S"shapefunc

0.0

0.2

0.4

0.6

0.8

1.0e.g.Probabilityofdisease

x

βxα

βxα

e1e)xP(y +

+

+=

P (Y=1|X)

11/2/16 18


LogisJcRegression—when?

LogisJcregressionmodelsareappropriatefortargetvariablecodedas0/1.

Weonlyobserve“0”and“1”forthetargetvariable—butwethinkofthetargetvariableconceptuallyasaprobabilitythat“1”willoccur.

11/2/16 19


LogisJcRegression—when?

LogisJcregressionmodelsareappropriatefortargetvariablecodedas0/1.

Weonlyobserve“0”and“1”forthetargetvariable—butwethinkofthetargetvariableconceptuallyasaprobabilitythat“1”willoccur.

This means we use Bernoulli distribution to model the target variable with its Bernoulli parameter p=p(y=1|x) predefined. The main interest è predicting the probability that an event occurs (i.e., the probability that p(y=1|x) ).

11/2/16 20


0.0

0.2

0.4

0.6

0.8


x

P (y=1|X)

11/2/16 21

DiscriminaJve LogisJcregressionmodelsforbinarytargetvariablecoded0/1.

P( y = 1 x) = eα+βx

1+ eα+βx

logisJcfuncJon

LogitfuncJon


0.0

0.2

0.4

0.6

0.8


x

P (y=1|X)

11/2/16 22

DiscriminaJve LogisJcregressionmodelsforbinarytargetvariablecoded0/1.

P( y = 1 x) = eα+βx

1+ eα+βx

ln P( y =1|x)

P( y =0|x)⎡

⎣⎢

⎤

⎦⎥ = ln

P( y =1|x)1−P( y =1|x)⎡

⎣⎢

⎤

⎦⎥ =α +β1x1 +β2x2 + ...+βpxp

logisJcfuncJon

LogitfuncJon

DecisionBoundaryèequalstozero


ln( )( )

P y xP y x

x1−⎡

⎣⎢

⎤

⎦⎥ = +α β

ThelogisJcfuncJon(2)

LogitofP(y|x)

{P y x e

e

x

x( ) =+

+

+

α β

α β1

11/2/16 23


ThelogisJcfuncJon(3)

•  Advantagesofthelogit– SimpletransformaJonofP(y|x)– LinearrelaJonshipwithx– CanbeconJnuous(Logitbetween-infto+infinity)– DirectlyrelatedtothenoJonoflogoddsoftargetevent

βxαP-1

P ln +=⎟⎠⎞⎜

⎝⎛ e

P-1P βxα+=

11/2/16 24


z = log p1− p

⎛⎝⎜

⎞⎠⎟

LogisJcRegressionAssumpJons

•  Linearityinthelogit–theregressionequaJonshouldhavealinearrelaJonshipwiththelogitformofthetargetvariable

•  ThereisnoassumpJonaboutthefeaturevariables/targetpredictorsbeinglinearlyrelatedtoeachother.

11/2/16 25


Binary Logistic Regression In summary that the logistic regression tells us two things at once. •  Transformed, the “log odds” (logit) are linear.

•  Logistic Distribution

ln[p/(1-p)]

P (Y=1|x)

x

x 11/2/16 26


This means we use Bernoulli distribution to model the target variable with its Bernoulli parameter p=p(y=1|x) predefined.

p 1-p

Odds=p/(1-p)

27

BinaryèMulJnomialLogisJcRegressionModel

∑

∑

−

=

−

=

++===

−=++

+===

1

10

1

10

0

)exp(1

1)|Pr(

1,,1,)exp(1

)exp()|Pr(

K

l

Tll

K

l

Tll

Tkk

xxXKG

Kkx

xxXkG

ββ

ββ

ββ…

DirectlymodelstheposteriorprobabiliJesastheoutputofregression

Notethattheclassboundariesarelinear

xisp-dimensionalinputvector\betakisap-dimensionalvectorforeachkTotalnumberofparametersis(K-1)(p+1)

11/2/16


28

BinaryèMulJnomialLogisJcRegressionModel

∑

∑

−

=

−

=

++===

−=++

+===

1

10

1

10

0

)exp(1

1)|Pr(

1,,1,)exp(1

)exp()|Pr(

K

l

Tll

K

l

Tll

Tkk

xxXKG

Kkx

xxXkG

ββ

ββ

ββ…

DirectlymodelstheposteriorprobabiliJesastheoutputofregression

Notethattheclassboundariesarelinear

xisp-dimensionalinputvector\betakisap-dimensionalvectorforeachkTotalnumberofparametersis(K-1)(p+1)

11/2/16





ü  LogisJcregression§  ParameterEsJmaJonforLR

ü  GeneraJvevs.DiscriminaJve

11/2/16 29


ParameterEsJmaJonforLRèMLEfromthedata

•  RECAP:LinearregressionèLeastsquares

•  LogisJcregression:èMaximumlikelihoodesJmaJon

11/2/16 30


31

MLEforLogisJcRegressionTraining

Let’sfitthelogisJcregressionmodelforK=2,i.e.,numberofclassesis2

!!

l(β)= {logPr(Y = yi |X = xi )}i=1

N

∑

= yi log(Pr(Y =1|X = xi ))+(1− yi )log(Pr(Y =0|X = xi ))i=1

N

∑

= ( yi logexp(βT xi )

1+exp(βT xi ))+(1− yi )log

11+exp(βT xi )

)i=1

N

∑

= ( yiβT xi − log(1+exp(βT xi )))i=1

N

∑

Trainingset:(xi,yi),i=1,…,N

(condiJonal)Log-likelihood:

Wewanttomaximizethelog-likelihoodinordertoesJmate\beta

xiare(p+1)-dimensionalinputvectorwithleadingentry1\betaisa(p+1)-dimensionalvector

p(y | x)y (1− p)1−yForBernoullidistribuJon

11/2/16


How?

11/2/16


32

!!l(β)= {logPr(Y = yi |X = xi )}

i=1

N

∑

Newton-RaphsonforLR(opJonal)

0))exp(1)exp(()(

1=

+−=

∂∂ ∑

=

N

iiT

T

i xxxyl

ββ

ββ

(p+1)Non-linearequaJonstosolvefor(p+1)unknowns

SolvebyNewton-Raphsonmethod:

where, ( ∂2l(β)

∂β∂βT ) = - xixiT ( exp(β

T xi )1+ exp(βT xi )

)( 11+ exp(βT xi )

)i=1

N

∑

βnew ← β old −[( ∂2l(β)

∂β∂βT )]-1 ∂l(β)∂β

,

p(xi;β) 1-p(xi;β)11/2/16


33

minimizesaquadraJcapproximaJontothefuncJonwearereallyinterestedin.

Newton-RaphsonforLR…

),()( 1 pyXWXX TToldnew −+← −ββ

∂l(β)∂β

= (yi −exp(βT x)1+ exp(βT x)

)xii=1

N

∑ = XT (y− p)

( ∂2l(β)

∂β∂βT ) = −XTWX

So,NRrulebecomes:

,,

1

2

1

)1(

2

1

−−+−−

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

=

⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢

⎣

⎡

=

byNNpbyNTN

T

T

y

yy

y

x

xx

X!!

,

))exp(1/()exp(

))exp(1/()exp())exp(1/()exp(

1

22

11

−−⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢

⎣

⎡

+

++

=

byNNT

NT

TT

TT

xx

xxxx

p

ββ

ββββ

!

)))exp((1

11)())exp((1

)exp((i

Ti

Ti

T

xxx

βββ

+−

+

));(1();( ofmatrix diagonal :

);( ofmatrix 1 :

ofmatrix 1 : ofmatrix 1 :

oldi

oldi

oldi

i

i

xpxpN NWxp Np

y Nyx) (pNX

βββ

−×

×

×+×

11/2/16


34

Newton-RaphsonforLR…

35

•  Newton-Raphson– 

–  Adjustedresponse

–  IteraJvelyreweightedleastsquares(IRLS)

WzXWXXpyWXWXWXX

pyXWXX

TT

oldTT

TToldnew

1

11

1

)())(()(

)()(

−

−−

−

=−+=

−+=β

ββ

)(1 pyWXz old −+= −β

)()(minarg

)()(minarg

1 pyWpy

XzWXzT

TTTnew

−−←

−−←

−

β

ββββ

ReexpressingNewtonstepasweightedleastsquarestep

11/2/16


Logistic Regression

classification

Log-odds = linear function of X’s

EPE, with conditional Log-likelihood

Iterative (Newton) method

Logistic weights

Task

Representation

Score Function

Search/Optimization

Models, Parameters

P(c =1 x) = eα+βx

1+ eα+βx11/2/16 36






11/2/16 37


Discriminative vs. Generative GeneraJveapproach-ModelthejointdistribuJonp(X,C)using p(X|C=ck)andp(C=ck)DiscriminaJveapproach-ModelthecondiJonaldistribuJonp(c|X)directly

Classprior

e.g.,

Discriminative vs. Generative

Height

Gaussian

Pr

LogisJcRegression

LDAvs.LogisJcRegression

11/2/16


40

Kp+


●  DefiniJons○  hgenandhdis:generaJveanddiscriminaJve

classifiers

○  hgen,infandhdis,inf:sameclassifiersbuttrainedontheenJrepopulaJon(asymptoJcclassifiers)

○  n→infinity,hgen→hgen,infandhdis→hdis,inf

Ng,Jordan,."OndiscriminaJvevs.generaJveclassifiers:AcomparisonoflogisJcregressionandnaivebayes."AdvancesinneuralinformaEonprocessingsystems14(2002):841.


ProposiJon1:

ProposiJon2:-p:numberofdimensions-n:numberofobservaJons-ϵ:generalizaJonerror

Logistic Regression vs. NBC

DiscriminaJveclassifier(LogisJcRegression)-SmallerasymptoJcerror-  Slowconvergence~O(p)

GeneraJveclassifier(NaiveBayes)-LargerasymptoJcerror-Canhandlemissingdata(EM)-Fastconvergence~O(lg(p))

Innumericalanalysis,thespeedatwhichaconvergentsequenceapproachesitslimitiscalledtherateofconvergence.

generalizaJonerror

LogisJcRegression

NaiveBayes

Sizeoftrainingset

Ng,Jordan,."OndiscriminaJvevs.generaJveclassifiers:AcomparisonoflogisJcregressionandnaivebayes."AdvancesinneuralinformaEonprocessingsystems14(2002):841.

11/2/16


44

Xue,Jing-Hao,andD.MichaelTi{erington."Commenton“OndiscriminaJvevs.generaJveclassifiers:AcomparisonoflogisJcregressionandnaiveBayes”."Neuralprocessingle0ers28.3(2008):169-187.

generalizaJonerror

Sizeoftrainingset


●  Empirically,generaJveclassifiersapproachtheirasymptoJcerrorfasterthandiscriminaJveones○  Goodforsmalltrainingset○  Handlemissingdatawell(EM)

●  Empirically,discriminaJveclassifiershavelowerasymptoJcerrorthangeneraJveones○  Goodforlargertrainingset

References

q Prof.Tan,Steinbach,Kumar’s“IntroducJontoDataMining”slide

q Prof.AndrewMoore’sslidesq Prof.EricXing’sslidesq Prof.KeChenNBslidesq HasJe,Trevor,etal.TheelementsofstaEsEcallearning.Vol.2.No.1.NewYork:Springer,2009.

11/2/16 47


uva cs 6316/4501 – fall 2016 machine learning lecture 14: … › yanjun › teach › 2016f ›...

Documents