data mining and statistical learning, lecture 5 outline summary of regressions on correlated inputs...

Data mining and statistical learning, lecture 5

Outline

Summary of regressions on correlated inputs Ridge regression PCR (principal components regression) PLS (partial least squares regression) Model selection using cross-validation

Linear classification models Logistic regression Regression on indicator functions Linear discriminant analysis (LDA)


Ridge regression

The ridge regression coefficients minimize a penalized residual sum of squares:

or

Normally, inputs are centred prior to the estimation of regression coefficients

N

i

p

jjpjpji

ridge xxy1 1

22110 )...(argminˆ

p

jj

N

ipjpji

ridge

s

xxy

1

2

1

2110

)...(argminˆ

tosubject


Regression methods using derived input directions

- Partial Least Squares Regression

Extract linear combinations of the inputs as derived features, and then model the target (response) as a linear function of these features

x1 x1 xp

z1 z2 zM…

…

y

Select the intermediates so that the covariance with the response variable is maximized

Normally, the inputs are standardized to mean zero and variance one prior to the PLS analysis


PLS vs PCR

- absorbance records for chopped meat

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10

Number of factors

Per

cen

t va

riat

ion

acc

ou

nte

d f

or

PLS

PCRIn general, PLS models has fewer factors than PCR models


Common characteristics of ridge regression, PCR, and PLS

Ridge regression, PCR, and PLS can all handle high-dimensional inputs

In contrast to ordinary least squares regression, the cited methods can be used for prediction even if the number of inputs (x-variables) exceeds the number of cases

For minimizing prediction error, ridge regression, PCR, and PLS are generally preferable to variable subset selection in ordinary least squares regression


Behaviour of ridge regression, PCR, and PLS

Ridge regression, PCR, and PLS tend to behave similarly

Ridge regression shrinks all directions, but shrinks low-variance directions more

Principal components regression leaves M high-variance directions alone, and discards the rest

Partial least squares regression tends to shrink the low-variance directions, but may inflate some of the higher variance directions


Model selection: ordinary cross-validation

For each model, do the following:

(i) Fit the model to the training set of inputs and responses

(ii) Use the fitted model to predict the response value in the test set and compute the prediction error

Select the model that produces the smallest PRESS-value

npnnn

jjpjj

jpjjj

p

p

yxxx

yxxx

yxxx

yxxx

yxxx

21

11,1,21,1

21

222212

112111

PRESS = Prediction Error Sum of Squares

Test set

Training set


Model selection: leave-one-out cross-validation

For each model, do the following:

(i) Leave out one case and fit the model to the remaining data

(ii) Use the fitted model to predict the response value in the case that was left out and compute the prediction error

(iii) Repeat steps (i) and (ii) for all cases and compute the PRESS-value (prediction error sum of squares)


npnnn

jpjjj

p

p

yxxx

yxxx

yxxx

yxxx

21

21

222212

112111

PRESS = Prediction Error Sum of Squares


Model selection: K-fold (block) cross-validation

Divide the data set into m blocks of size K and do the following for each model:

(i) Leave out one block of cases and fit the model to the remaining data

(ii) Use the fitted model to predict the response values in the block that was left out and compute the sum of squared prediction errors

(iii) Repeat steps (i) and (ii) for all blocks and compute the PRESS-value (prediction error sum of squares)


mKpmKpmKmK

KmKmpKmKm

KjKjpKjKj

jKjKpjKjK

KpKKK

p

yxxx

yxxx

yxxx

yxxx

yxxx

yxxx

.,,2,1

1)1(1)1(,1)1(,21)1(,1

)1()1(,)1(,2)1(,1

11,1,21,1

21

112111

....

....

....

....

.... Block 1

Block j

Block m


Classification

The task of assigning objects to one of several predefined categories

Detecting spams among e-mails

Credit scoring

Classifying tumours as malignant or benign


Customer relations management

- an example

Consider a database in which 2470 customers have been registered

For each customer the enterprise has recorded a binary response variable Y (Y = 1: multiple purchases, Y = 0: single purchase) and several predictors

We shall model the probability that Y = 1.

Y Installment First_amount_spentNo._productsAge51_89 Age36_50 Age15_35 Sex North Central South_and_islands0 0 520000 0 0 0 1 0 0 0 10 1 1484000 2 0 1 0 1 0 0 10 0 2459000 1 1 0 0 1 0 0 10 0 3389000 0 0 1 0 1 0 1 0


Logistic regression for a binary response variable Y

- single input

xxXYP

xXYP

p

p

x

xxXYPp

10

10

10

)|0(

)|1(log

1log

)exp(1

)exp()|1(

0

0.2

0.4

0.6

0.8

1

0 1 2 3

x

P(Y

=1)

The log of the odds ratio is linear in x


Logistic regression of multiple purchases

vs first amount spent

0

0.2

0.4

0.6

0.8

1

0 1000 2000 3000 4000 5000 6000 7000

First amount spent

Observed binary response Estimated event probability


Logistic regression of multiple purchases vs first amount spent

- inference from a model comprising a single input

Response Information

Variable Value Count

Multiple_purchases 1 34 (Event)

0 66

Total 100

Logistic Regression Table

Odds 95% CI

Predictor Coef SE Coef Z P Ratio Lower Upper

Constant -2.50310 0.450895 -5.55 0.000

First_amount_spent 0.0014381 0.0003063 4.69 0.000 1.00 1.00 1.00

Log-Likelihood = -43.215

Test that all slopes are zero: G = 41.776, DF = 1, P-Value = 0.000


Logistic regression for a binary response variable

- multiple inputs

Consider a binary response variable Y

Set p = P(Y = 1)

Assume that the log odds ratio

is a linear function of m predictors x1, …, xm

)...exp(1

)...exp(

110

110

mm

mm

xx

xxp

mm xx

p

p

...1

log 110


Logistic regression

- inference from a model comprising two inputs

Binary Logistic Regression: RestingPulse versus Smokes, Weight

Variable Value Count

RestingPulse Low 70 (Event)

High 22

Total 92

Factor Information

Factor Levels Values

Smokes 2 No, Yes

Logistic Regression Table

Odds 95% CI

Predictor Coef SE Coef Z P Ratio Lower Upper

Constant -1.98717 1.67930 -1.18 0.237

SmokesYes -1.19297 0.552980 -2.16 0.031 0.30 0.10 0.90

Weight 0.0250226 0.0122551 2.04 0.041 1.03 1.00 1.05

The estimated coefficient -1.19297 represents the change in the log of P(low pulse)/P(high pulse) when the subject smokes compared to when he/she does not smoke, with the covariate Weight held constant

The odds of smokers in the sample having a low pulse being 30% of the odds of non-smokers having a low pulse.


Logistic regression for an ordinal response variable Y

)exp()exp(1

1)|3(

)exp()exp(1

)exp()|2(

)exp()exp(1

)exp()|1(

21201110

21201110

2120

21201110

1110

xxxXYP

xx

xxXYP

xx

xxXYP

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5x

P(Y=1) P(Y=2) P(Y=3)


Logistic regression for an ordinal response variable Y

xxXYP

xXYP

xxXYP

xXYP

2120

1110

)|3(

)|2(log

)|3(

)|1(log

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5x

P(Y=1) P(Y=2) P(Y=3)


Classification using logistic regression

Assign the object to the class k that maximizes

)|( xXkYP

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5x

P(Y=1) P(Y=2) P(Y=3)


Regression of an indicator matrix

0

2

4

6

8

10

12

14

16

2 4 6 8 10

x1

x2 Class 1

Class 2

Find a linear function

which is (on average) one for objects in class 1 and otherwise (on average) zero

Find a linear function

which is (on average) one for objects in class 1 and otherwise (on average) zero

Assign a new object to class 1 if

21211110211ˆˆˆ),(ˆ xxxxf

21211110211ˆˆˆ),(ˆ xxxxf

),(ˆ),(ˆ212211 xxfxxf


3D-plot of an indicator matrix for class 1

15

0.0 10

0.5

1.0

4 6 58 10

Class_1

x2

x1

3D Scatterplot of Class_1 vs x2 vs x1


3D-plot of an indicator matrix for class 2

15

0.0 10

0.5

1.0

4 6 58 10

Class_2

x2

x1

3D Scatterplot of Class_2 vs x2 vs x1



- discriminating function

0

2

4

6

8

10

12

14

16

2 4 6 8 10

x1

x2

Class 1

Class 2

Discr.



- discriminating function

0

5

10

15

20

25

2 4 6 8 10

x1

x2

Class 1

Class 2

Class 3

Estimate discriminant functions

for each class, and then classify a new object to the class with the largest value for its discriminant function

)(xk


Linear discriminant analysis (LDA)

LDA is an optimal classification method when the data arise from Gaussian distributions with different means and a common covariance matrix

4

6

8

10

12

14

16

18

2 4 6 8 10 12

Class1

Class 2

Class3


Software recommendation

SAS Proc DISCRIM

Proc discrim data=mining.lda;CLASS class;VAR x1 x2;Run;

data mining and statistical learning, lecture 5 outline summary of regressions on correlated inputs...

Documents

pls ridge regression

squares regression slide

regression methods

ridge regression coefficients

behaviour of ridge regression

data mining

statistical learning

fitted model