linear methods for classification based on chapter 4 of hastie, tibshirani, and friedman david...
TRANSCRIPT
![Page 1: Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan](https://reader035.vdocuments.us/reader035/viewer/2022070413/5697bff11a28abf838cbb864/html5/thumbnails/1.jpg)
Linear Methods for Classification
Based on Chapter 4 of Hastie, Tibshirani, and Friedman
David Madigan
![Page 2: Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan](https://reader035.vdocuments.us/reader035/viewer/2022070413/5697bff11a28abf838cbb864/html5/thumbnails/2.jpg)
Predictive ModelingGoal: learn a mapping: y = f(x;)
Need: 1. A model structure
2. A score function
3. An optimization strategy
Categorical y {c1,…,cm}: classification
Real-valued y: regression
Note: usually assume {c1,…,cm} are mutually exclusive and exhaustive
![Page 3: Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan](https://reader035.vdocuments.us/reader035/viewer/2022070413/5697bff11a28abf838cbb864/html5/thumbnails/3.jpg)
Probabilistic Classification
Let p(ck) = prob. that a randomly chosen object comes from ck
Objects from ck have: p(x |ck , k) (e.g., MVN)
Then: p(ck | x ) p(x |ck , k) p(ck)
Bayes Error Rate: dxxpxcpp kk
B )())|(max1(*
•Lower bound on the best possible error rate
![Page 4: Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan](https://reader035.vdocuments.us/reader035/viewer/2022070413/5697bff11a28abf838cbb864/html5/thumbnails/4.jpg)
Bayes error rate about 6%
![Page 5: Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan](https://reader035.vdocuments.us/reader035/viewer/2022070413/5697bff11a28abf838cbb864/html5/thumbnails/5.jpg)
Classifier Types
Discriminative: model p(ck | x )
- e.g. logistic regression, CART
Generative: model p(x |ck , k)
- e.g. “Bayesian classifiers”, LDA
![Page 6: Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan](https://reader035.vdocuments.us/reader035/viewer/2022070413/5697bff11a28abf838cbb864/html5/thumbnails/6.jpg)
Regression for Binary Classification
•Can fit a linear regression model to a 0/1 response
•Predicted values are not necessarily between zero and one
-3 -2 -1 0 1 2 3
0.0
0.5
1.0
x
y
zeroOneR.txt
•With p>1, the decision boundary is lineare.g. 0.5 = b0 + b1 x1 + b2 x2
![Page 7: Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan](https://reader035.vdocuments.us/reader035/viewer/2022070413/5697bff11a28abf838cbb864/html5/thumbnails/7.jpg)
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
x1
x2
![Page 8: Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan](https://reader035.vdocuments.us/reader035/viewer/2022070413/5697bff11a28abf838cbb864/html5/thumbnails/8.jpg)
Linear Discriminant AnalysisK classes, X n × p data matrix.
p(ck | x ) p(x |ck , k) p(ck)
Could model each class density as multivariate normal:
)()(2
1
212
1
||)2(
1)|(
kkT
k xx
kpk excp
LDA assumes for all k. Then:k
)()()(2
1
)(
)(log
)|(
)|(log 11
lkT
lkT
lkl
k
l
k xcp
cp
xcp
xcp
This is linear in x.
![Page 9: Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan](https://reader035.vdocuments.us/reader035/viewer/2022070413/5697bff11a28abf838cbb864/html5/thumbnails/9.jpg)
Linear Discriminant Analysis (cont.)
It follows that the classifier should predict: )(maxarg xkk
)(log2
1)( 11
kkTkk
Tk cpxx
“linear discriminant function”
If we don’t assume the k’s are identicial, get Quadratic DA:
)(log)()(2
1||log
2
1)( 1
kkkT
kkk cpxxx
![Page 10: Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan](https://reader035.vdocuments.us/reader035/viewer/2022070413/5697bff11a28abf838cbb864/html5/thumbnails/10.jpg)
Linear Discriminant Analysis (cont.)
Can estimate the LDA parameters via maximum likelihood:
kki
ik Nx /ˆ
NNcp kk /)(ˆ
)/()')((ˆ1
KNxxK
k kikiki
![Page 11: Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan](https://reader035.vdocuments.us/reader035/viewer/2022070413/5697bff11a28abf838cbb864/html5/thumbnails/11.jpg)
![Page 12: Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan](https://reader035.vdocuments.us/reader035/viewer/2022070413/5697bff11a28abf838cbb864/html5/thumbnails/12.jpg)
LDA QDA
![Page 13: Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan](https://reader035.vdocuments.us/reader035/viewer/2022070413/5697bff11a28abf838cbb864/html5/thumbnails/13.jpg)
![Page 14: Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan](https://reader035.vdocuments.us/reader035/viewer/2022070413/5697bff11a28abf838cbb864/html5/thumbnails/14.jpg)
LDA (cont.)
•Fisher is optimal if the class are MVN with a common covariance matrix
•Computational complexity O(mp2n)
![Page 15: Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan](https://reader035.vdocuments.us/reader035/viewer/2022070413/5697bff11a28abf838cbb864/html5/thumbnails/15.jpg)
Logistic Regression
Note that LDA is linear in x:
)()()(2
1
)(
)(log
)|(
)|(log 0
10
10
00
k
Tk
Tk
kk xcp
cp
xcp
xcp
xTkk 0
Linear logistic regression looks the same:
xxcp
xcp Tkk
k 00 )|(
)|(log
But the estimation procedure for the co-efficients is different.LDA maximizes joint likelihood [y,X]; logistic regression maximizes conditional likelihood [y|X]. Usually similar predictions.
![Page 16: Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan](https://reader035.vdocuments.us/reader035/viewer/2022070413/5697bff11a28abf838cbb864/html5/thumbnails/16.jpg)
Logistic Regression MLE
For the two-class case, the likelihood is:
n
iiiii xpyxpyl
1
));(1log()1();(log)(
xxp
xp T
);(1
);(log ))exp(1log();(log xxxp TT
n
i
TTi xxyl
1
))exp(1log()(
The maximize need to solve (non-linear) score equations:
n
iiii xpyx
d
dl
1
0));(()(
![Page 17: Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan](https://reader035.vdocuments.us/reader035/viewer/2022070413/5697bff11a28abf838cbb864/html5/thumbnails/17.jpg)
Logistic Regression ModelingSouth African Heart Disease Example (y=MI)
Coef. S.E. Z score
Intercept
-4.130 0.964 -4.285
sbp 0.006 0.006 1.023
Tobacco
0.080 0.026 3.034
ldl 0.185 0.057 3.219
Famhist
0.939 0.225 4.178
Obesity
-0.035 0.029 -1.187
Alcohol
0.001 0.004 0.136
Age 0.043 0.010 4.184
Wald
![Page 18: Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan](https://reader035.vdocuments.us/reader035/viewer/2022070413/5697bff11a28abf838cbb864/html5/thumbnails/18.jpg)
Regularized Logistic Regression
•Ridge/LASSO logistic regression
•Successful implementation with over 100,000 predictor variables
•Can also regularize discriminant analysis
j
j
N
iii
T wyxwn
w 1
))exp(1log(1
infargˆ
![Page 19: Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan](https://reader035.vdocuments.us/reader035/viewer/2022070413/5697bff11a28abf838cbb864/html5/thumbnails/19.jpg)
Simple Two-Class Perceptron
Define:
Classify as class 1 if h(x)>0, class 2 otherwise
Score function: # misclassification errors on training data
For training, replace class 2 xj’s by -xj; now
need h(x)>0
pjxwxh jj 1,)(
Initialize weight vector
Repeat one or more times:
For each training data point xi
If point correctly classified, do nothing
Else ixww
Guaranteed to converge to a separating hyperplane (if exists)
![Page 20: Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan](https://reader035.vdocuments.us/reader035/viewer/2022070413/5697bff11a28abf838cbb864/html5/thumbnails/20.jpg)
![Page 21: Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan](https://reader035.vdocuments.us/reader035/viewer/2022070413/5697bff11a28abf838cbb864/html5/thumbnails/21.jpg)
“Optimal” Hyperplane
The “optimal” hyperplane separates the two classes and maximizes the distance to the closest point from either class.
Finding this hyperplane is a convex optimization problem.
This notion plays an important role in support vector machines
![Page 22: Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan](https://reader035.vdocuments.us/reader035/viewer/2022070413/5697bff11a28abf838cbb864/html5/thumbnails/22.jpg)
![Page 23: Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan](https://reader035.vdocuments.us/reader035/viewer/2022070413/5697bff11a28abf838cbb864/html5/thumbnails/23.jpg)
wx+b=0(0,0) from
|1|
w
b
(0,0) from |1|
w
b
w
2