data mining and statistical learning, lecture 5 outline summary of regressions on correlated inputs...
Post on 20-Dec-2015
215 views
TRANSCRIPT
Data mining and statistical learning, lecture 5
Outline
Summary of regressions on correlated inputs Ridge regression PCR (principal components regression) PLS (partial least squares regression) Model selection using cross-validation
Linear classification models Logistic regression Regression on indicator functions Linear discriminant analysis (LDA)
Data mining and statistical learning, lecture 5
Ridge regression
The ridge regression coefficients minimize a penalized residual sum of squares:
or
Normally, inputs are centred prior to the estimation of regression coefficients
N
i
p
jjpjpji
ridge xxy1 1
22110 )...(argminˆ
p
jj
N
ipjpji
ridge
s
xxy
1
2
1
2110
)...(argminˆ
tosubject
Data mining and statistical learning, lecture 5
Regression methods using derived input directions
- Partial Least Squares Regression
Extract linear combinations of the inputs as derived features, and then model the target (response) as a linear function of these features
x1 x1 xp
z1 z2 zM…
…
y
Select the intermediates so that the covariance with the response variable is maximized
Normally, the inputs are standardized to mean zero and variance one prior to the PLS analysis
Data mining and statistical learning, lecture 5
PLS vs PCR
- absorbance records for chopped meat
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10
Number of factors
Per
cen
t va
riat
ion
acc
ou
nte
d f
or
PLS
PCRIn general, PLS models has fewer factors than PCR models
Data mining and statistical learning, lecture 5
Common characteristics of ridge regression, PCR, and PLS
Ridge regression, PCR, and PLS can all handle high-dimensional inputs
In contrast to ordinary least squares regression, the cited methods can be used for prediction even if the number of inputs (x-variables) exceeds the number of cases
For minimizing prediction error, ridge regression, PCR, and PLS are generally preferable to variable subset selection in ordinary least squares regression
Data mining and statistical learning, lecture 5
Behaviour of ridge regression, PCR, and PLS
Ridge regression, PCR, and PLS tend to behave similarly
Ridge regression shrinks all directions, but shrinks low-variance directions more
Principal components regression leaves M high-variance directions alone, and discards the rest
Partial least squares regression tends to shrink the low-variance directions, but may inflate some of the higher variance directions
Data mining and statistical learning, lecture 5
Model selection: ordinary cross-validation
For each model, do the following:
(i) Fit the model to the training set of inputs and responses
(ii) Use the fitted model to predict the response value in the test set and compute the prediction error
Select the model that produces the smallest PRESS-value
npnnn
jjpjj
jpjjj
p
p
yxxx
yxxx
yxxx
yxxx
yxxx
21
11,1,21,1
21
222212
112111
PRESS = Prediction Error Sum of Squares
Test set
Training set
Data mining and statistical learning, lecture 5
Model selection: leave-one-out cross-validation
For each model, do the following:
(i) Leave out one case and fit the model to the remaining data
(ii) Use the fitted model to predict the response value in the case that was left out and compute the prediction error
(iii) Repeat steps (i) and (ii) for all cases and compute the PRESS-value (prediction error sum of squares)
Select the model that produces the smallest PRESS-value
npnnn
jpjjj
p
p
yxxx
yxxx
yxxx
yxxx
21
21
222212
112111
PRESS = Prediction Error Sum of Squares
Data mining and statistical learning, lecture 5
Model selection: K-fold (block) cross-validation
Divide the data set into m blocks of size K and do the following for each model:
(i) Leave out one block of cases and fit the model to the remaining data
(ii) Use the fitted model to predict the response values in the block that was left out and compute the sum of squared prediction errors
(iii) Repeat steps (i) and (ii) for all blocks and compute the PRESS-value (prediction error sum of squares)
Select the model that produces the smallest PRESS-value
mKpmKpmKmK
KmKmpKmKm
KjKjpKjKj
jKjKpjKjK
KpKKK
p
yxxx
yxxx
yxxx
yxxx
yxxx
yxxx
.,,2,1
1)1(1)1(,1)1(,21)1(,1
)1()1(,)1(,2)1(,1
11,1,21,1
21
112111
....
....
....
....
.... Block 1
Block j
Block m
Data mining and statistical learning, lecture 5
Classification
The task of assigning objects to one of several predefined categories
Detecting spams among e-mails
Credit scoring
Classifying tumours as malignant or benign
Data mining and statistical learning, lecture 5
Customer relations management
- an example
Consider a database in which 2470 customers have been registered
For each customer the enterprise has recorded a binary response variable Y (Y = 1: multiple purchases, Y = 0: single purchase) and several predictors
We shall model the probability that Y = 1.
Y Installment First_amount_spentNo._productsAge51_89 Age36_50 Age15_35 Sex North Central South_and_islands0 0 520000 0 0 0 1 0 0 0 10 1 1484000 2 0 1 0 1 0 0 10 0 2459000 1 1 0 0 1 0 0 10 0 3389000 0 0 1 0 1 0 1 0
Data mining and statistical learning, lecture 5
Logistic regression for a binary response variable Y
- single input
xxXYP
xXYP
p
p
x
xxXYPp
10
10
10
)|0(
)|1(log
1log
)exp(1
)exp()|1(
0
0.2
0.4
0.6
0.8
1
0 1 2 3
x
P(Y
=1)
The log of the odds ratio is linear in x
Data mining and statistical learning, lecture 5
Logistic regression of multiple purchases
vs first amount spent
0
0.2
0.4
0.6
0.8
1
0 1000 2000 3000 4000 5000 6000 7000
First amount spent
Observed binary response Estimated event probability
Data mining and statistical learning, lecture 5
Logistic regression of multiple purchases vs first amount spent
- inference from a model comprising a single input
Response Information
Variable Value Count
Multiple_purchases 1 34 (Event)
0 66
Total 100
Logistic Regression Table
Odds 95% CI
Predictor Coef SE Coef Z P Ratio Lower Upper
Constant -2.50310 0.450895 -5.55 0.000
First_amount_spent 0.0014381 0.0003063 4.69 0.000 1.00 1.00 1.00
Log-Likelihood = -43.215
Test that all slopes are zero: G = 41.776, DF = 1, P-Value = 0.000
Data mining and statistical learning, lecture 5
Logistic regression for a binary response variable
- multiple inputs
Consider a binary response variable Y
Set p = P(Y = 1)
Assume that the log odds ratio
is a linear function of m predictors x1, …, xm
)...exp(1
)...exp(
110
110
mm
mm
xx
xxp
mm xx
p
p
...1
log 110
Data mining and statistical learning, lecture 5
Logistic regression
- inference from a model comprising two inputs
Binary Logistic Regression: RestingPulse versus Smokes, Weight
Variable Value Count
RestingPulse Low 70 (Event)
High 22
Total 92
Factor Information
Factor Levels Values
Smokes 2 No, Yes
Logistic Regression Table
Odds 95% CI
Predictor Coef SE Coef Z P Ratio Lower Upper
Constant -1.98717 1.67930 -1.18 0.237
SmokesYes -1.19297 0.552980 -2.16 0.031 0.30 0.10 0.90
Weight 0.0250226 0.0122551 2.04 0.041 1.03 1.00 1.05
The estimated coefficient -1.19297 represents the change in the log of P(low pulse)/P(high pulse) when the subject smokes compared to when he/she does not smoke, with the covariate Weight held constant
The odds of smokers in the sample having a low pulse being 30% of the odds of non-smokers having a low pulse.
Data mining and statistical learning, lecture 5
Logistic regression for an ordinal response variable Y
)exp()exp(1
1)|3(
)exp()exp(1
)exp()|2(
)exp()exp(1
)exp()|1(
21201110
21201110
2120
21201110
1110
xxxXYP
xx
xxXYP
xx
xxXYP
0
0.2
0.4
0.6
0.8
1
0 1 2 3 4 5x
P(Y=1) P(Y=2) P(Y=3)
Data mining and statistical learning, lecture 5
Logistic regression for an ordinal response variable Y
xxXYP
xXYP
xxXYP
xXYP
2120
1110
)|3(
)|2(log
)|3(
)|1(log
0
0.2
0.4
0.6
0.8
1
0 1 2 3 4 5x
P(Y=1) P(Y=2) P(Y=3)
Data mining and statistical learning, lecture 5
Classification using logistic regression
Assign the object to the class k that maximizes
)|( xXkYP
0
0.2
0.4
0.6
0.8
1
0 1 2 3 4 5x
P(Y=1) P(Y=2) P(Y=3)
Data mining and statistical learning, lecture 5
Regression of an indicator matrix
0
2
4
6
8
10
12
14
16
2 4 6 8 10
x1
x2 Class 1
Class 2
Find a linear function
which is (on average) one for objects in class 1 and otherwise (on average) zero
Find a linear function
which is (on average) one for objects in class 1 and otherwise (on average) zero
Assign a new object to class 1 if
21211110211ˆˆˆ),(ˆ xxxxf
21211110211ˆˆˆ),(ˆ xxxxf
),(ˆ),(ˆ212211 xxfxxf
Data mining and statistical learning, lecture 5
3D-plot of an indicator matrix for class 1
15
0.0 10
0.5
1.0
4 6 58 10
Class_1
x2
x1
3D Scatterplot of Class_1 vs x2 vs x1
Data mining and statistical learning, lecture 5
3D-plot of an indicator matrix for class 2
15
0.0 10
0.5
1.0
4 6 58 10
Class_2
x2
x1
3D Scatterplot of Class_2 vs x2 vs x1
Data mining and statistical learning, lecture 5
Regression of an indicator matrix
- discriminating function
0
2
4
6
8
10
12
14
16
2 4 6 8 10
x1
x2
Class 1
Class 2
Discr.
Data mining and statistical learning, lecture 5
Regression of an indicator matrix
- discriminating function
0
5
10
15
20
25
2 4 6 8 10
x1
x2
Class 1
Class 2
Class 3
Estimate discriminant functions
for each class, and then classify a new object to the class with the largest value for its discriminant function
)(xk
Data mining and statistical learning, lecture 5
Linear discriminant analysis (LDA)
LDA is an optimal classification method when the data arise from Gaussian distributions with different means and a common covariance matrix
4
6
8
10
12
14
16
18
2 4 6 8 10 12
Class1
Class 2
Class3
Data mining and statistical learning, lecture 5
Software recommendation
SAS Proc DISCRIM
Proc discrim data=mining.lda;CLASS class;VAR x1 x2;Run;