linear models for classification
DESCRIPTION
Linear Models for Classification. Berkay Topçu. Linear Models for Classification. Goal: Take an input vector and assign it to one of K classes (C k where k =1,...,K) Linear separation of classes. Generalized Linear Models. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Linear Models for Classification](https://reader036.vdocuments.us/reader036/viewer/2022062315/568151fd550346895dc03664/html5/thumbnails/1.jpg)
Linear Models for Classification
Berkay Topçu
![Page 2: Linear Models for Classification](https://reader036.vdocuments.us/reader036/viewer/2022062315/568151fd550346895dc03664/html5/thumbnails/2.jpg)
Linear Models for Classification Goal: Take an input vector and assign it to one
of K classes (Ck where k=1,...,K) Linear separation of classes
![Page 3: Linear Models for Classification](https://reader036.vdocuments.us/reader036/viewer/2022062315/568151fd550346895dc03664/html5/thumbnails/3.jpg)
Generalized Linear Models We wish to predict discrete class labels, or
more generally class posterior probabilities that lies in range (0,1).
Classification model as a linear function of the parameters ,
Classification directly in the original input space , or a fixed nonlinear transformation of the input variables using a vector of basis functions
)()( 0Txwx wfy w
)|( xkCp
x
)(x
![Page 4: Linear Models for Classification](https://reader036.vdocuments.us/reader036/viewer/2022062315/568151fd550346895dc03664/html5/thumbnails/4.jpg)
Discriminant Functions Linear discriminants
If , assign to class C1 and to class C2 otherwise
Decision boundary is given by determines the orientation of the decision
surface and determines the location Compact notation:
0Txwx wy )(
0)( xy0)( xy
w0w
xwx T~~)( y
),(~0 ww w
),1(~0 xx x
![Page 5: Linear Models for Classification](https://reader036.vdocuments.us/reader036/viewer/2022062315/568151fd550346895dc03664/html5/thumbnails/5.jpg)
Multiple Classes K-class discriminant by combining number of
two-class discriminant functions (K>2) One-versus-the-rest: seperating points in one
particular class Ck from points not in that class One-versus-one: K(K-1)/2 binary discriminant
functions
![Page 6: Linear Models for Classification](https://reader036.vdocuments.us/reader036/viewer/2022062315/568151fd550346895dc03664/html5/thumbnails/6.jpg)
Multiple Classes A single K-class discriminant comprising K
linear functions
Assign to class Ck if for all
How to learn the parameters of linear discriminant functions?
0xwx kTkk wy )(
)()( xx jk yy kj
![Page 7: Linear Models for Classification](https://reader036.vdocuments.us/reader036/viewer/2022062315/568151fd550346895dc03664/html5/thumbnails/7.jpg)
Least Squares for Classification Each class Ck is described by its own linear
model
Training data set for n =1,...,N where
Matrix whose nth row is the vector and whose nth row is
xWxxwx 0~~
)(y )( Tk
Tkk wy
n, txn
)0,...,1,...,0,0(nt
T Tnt X
Tnx
![Page 8: Linear Models for Classification](https://reader036.vdocuments.us/reader036/viewer/2022062315/568151fd550346895dc03664/html5/thumbnails/8.jpg)
Least Squares for Classification Minimizing the sum-of-squares error function
Solution :
Discriminant function :
)~~()
~~( Tr
2
1)( T TWXTWXW DE
TXTXXXW ~~)
~~(
~ 1 TT
xXTxWxy ~)~
(~~)( TTT
![Page 9: Linear Models for Classification](https://reader036.vdocuments.us/reader036/viewer/2022062315/568151fd550346895dc03664/html5/thumbnails/9.jpg)
Fisher’s Linear Discriminant Dimensionality reduction: take the D-dimensional
input vector and project to one dimension using
Projection that maximizes class seperation Two-class problem: N1 points of C1 and N2 points of C2
Fisher’s idea: large separation between the projected class means small variance within each class, minimizing class
overlap
xwTy
ii
T
Cnn
Cnn
wmm
NN
1 ),(
1 ,
1
2112
22
11
21
mmw
xmxm
![Page 10: Linear Models for Classification](https://reader036.vdocuments.us/reader036/viewer/2022062315/568151fd550346895dc03664/html5/thumbnails/10.jpg)
Fisher’s Linear Discriminant
The Fisher criterion:
21
))(())((
))(( ,)(
2211
1212
Cn
Tnn
Cn
TnnW
TB
WT
BT
J
mxmxmxmxS
mmmmSwSw
wSww
)( 121 mmSw w
![Page 11: Linear Models for Classification](https://reader036.vdocuments.us/reader036/viewer/2022062315/568151fd550346895dc03664/html5/thumbnails/11.jpg)
Fisher’s Linear Discriminant For the two-class problem, Fisher criterion is a
special case of least squares (reference : Penalized Discriminant Analysis – Hastie, Buja and Tibshirani)
For multiple classes:
The weights values are determined by the eigenvectors that corresponds to K highest eigenvalues of
xWTy
)(
)(Tr)(
))(( ,1
B1
TW
TB
K
k
Tkkk
K
kkW
J
N
WWS
WWSW
mmmmSSS
BWSS 1
![Page 12: Linear Models for Classification](https://reader036.vdocuments.us/reader036/viewer/2022062315/568151fd550346895dc03664/html5/thumbnails/12.jpg)
The Perceptron Algorithm Input vector is transformed using a
nonlinear transformation
Perceptron criterion: For all training samples
We need to minimize
)(xx
))(()( xwx Tfy
0 ,1
0 ,1)(
a
aaf
)1,1( t
0)( nnT txw
Mn
nnT tE ww)(P
![Page 13: Linear Models for Classification](https://reader036.vdocuments.us/reader036/viewer/2022062315/568151fd550346895dc03664/html5/thumbnails/13.jpg)
The Perceptron Algorithm – Stocastic Gradient Descent Cycle through the training patterns in turn
If the pattern is correctly classified weight vectors remains unchanged, else:
nntE )(P
)()1( )( wwww
![Page 14: Linear Models for Classification](https://reader036.vdocuments.us/reader036/viewer/2022062315/568151fd550346895dc03664/html5/thumbnails/14.jpg)
Probabilistic Generative Models Depend on simple assumptions about the
distribution of the data
Logistic sigmoid function Maps the whole real axis to a
finite interval
)()exp(1
1
)()|()()|(
)()|()|(
2211
111
aa
CpCpCpCp
CpCpCp
xx
xx
)()|(
)()|(ln
22
11
CpCp
CpCpa
x
x
![Page 15: Linear Models for Classification](https://reader036.vdocuments.us/reader036/viewer/2022062315/568151fd550346895dc03664/html5/thumbnails/15.jpg)
Continuous Inputs - Gaussian Assuming the class-conditional densities are
Gaussian
Case of two classes
)()(
2
1exp
1
)2(
1)|( 1
212 kT
kDkCp μxΣμxΣ
x
)(
)(ln
2
1
2
1
)(
)()|(
2
12
121
110
211
01
Cp
Cpw
wCp
TT
T
μΣμμΣμ
μμΣw
xwx
![Page 16: Linear Models for Classification](https://reader036.vdocuments.us/reader036/viewer/2022062315/568151fd550346895dc03664/html5/thumbnails/16.jpg)
Maximum Likelihood Solution Likelihood function:
Maximizing log-likelihood
N
nnn
nn NNp1
12121 )]|()1[()]|([)|( tt ,μx,μx,μ,μπ,t
2
1
))((1
))((1
)1(1
1
222
2
111
1
22
11
122
111
Cn
Tnn
Cn
Tnn
N
nnn
N
nnn
N
N
N
N
N
N
tN
tN
μxμxS
μxμxS
SS
xμxμ
![Page 17: Linear Models for Classification](https://reader036.vdocuments.us/reader036/viewer/2022062315/568151fd550346895dc03664/html5/thumbnails/17.jpg)
Probabilistic Discriminative Models Probabilistic generative model
Number of parameters grows quadratically with M (# dim.)
However has M adjustable parameters Maximum likelihood solution for Logistic
Regression
Energy function: negative log likelihood
)()()|( 1 φwφφ TyCp
φwT
N
n
tn
tn
nn yyp1
1)1()|( wt
N
nnnnn ytytpE
1
)}1ln()1(ln{)|(ln)( wtw
![Page 18: Linear Models for Classification](https://reader036.vdocuments.us/reader036/viewer/2022062315/568151fd550346895dc03664/html5/thumbnails/18.jpg)
Iterative Reweighted Least Squares Newton-Raphson iterative optimization on
linear regression
Same as the standard least-squares solution
)(1)()( wHww Eoldnew
tΦΦwΦφφww TTn
N
nnn
T tE 1
)()(
ΦΦφφwH TTn
N
nnE
1
)(
tΦΦΦtΦΦwΦΦΦww TTToldTToldnew 1)(1)()( )(}{)(
![Page 19: Linear Models for Classification](https://reader036.vdocuments.us/reader036/viewer/2022062315/568151fd550346895dc03664/html5/thumbnails/19.jpg)
Iterative Reweighted Least Squares Newton-Raphson update for negative log
likelihood
Weighted least-squares problem
)()()(1
tyΦφw
Tn
N
nnn tyE
)1( ,)1()(1
nnnnTT
nn
N
nnn yyRyyE
RΦΦφφwH
)( )(
)( )(
)()(
1)(1
)(1
1)()(
tyRwΦRΦRΦΦ
tyΦwRΦΦRΦΦ
tyΦRΦΦww
oldTT
ToldTT
TToldnew
![Page 20: Linear Models for Classification](https://reader036.vdocuments.us/reader036/viewer/2022062315/568151fd550346895dc03664/html5/thumbnails/20.jpg)
Maximum Margin Classifiers Support Vector Machines for two-class
problem
Assuming linearly seperable data set There exists at least one set of variables satisfies That give the smallest generalization error Margin: the smallest distance between decision
boundary and any of the samples
by T )()( xwx }1,1{ ,...1 nN txx0)( nn yt x
![Page 21: Linear Models for Classification](https://reader036.vdocuments.us/reader036/viewer/2022062315/568151fd550346895dc03664/html5/thumbnails/21.jpg)
Support Vector Machines Optimization of parameters, maximizing the
margin
Maximizing the margin minimizing : subject to the constraint:
Introduction of Lagrange multipliers
w
xw
w
x ))(()( btyt nT
nnn
2
, 2
1minarg ww b
2w
1))(( bt nT
n xw
0na
N
nn
Tnn btabL
1
2}1))(({
2
1),,( xwwaw
![Page 22: Linear Models for Classification](https://reader036.vdocuments.us/reader036/viewer/2022062315/568151fd550346895dc03664/html5/thumbnails/22.jpg)
Support Vector Machines - Lagrange Multipliers Minimizing with respect to w and b and
maximizing with respect to a.
The dual form:
Quadratic programming problem:
N
nn
N
nnnn tata
1n1
0 ,)(xw
,),(2
1)(
~
1 11
N
n
N
mmnmnmn
N
nn kttaaaL xxa
N
nnnn taa
1
0 ,0
CQaa
a
T
2
1 maxarg
~
bktayN
nnnn
1
),()( xxx
![Page 23: Linear Models for Classification](https://reader036.vdocuments.us/reader036/viewer/2022062315/568151fd550346895dc03664/html5/thumbnails/23.jpg)
Support Vector Machines Overlapping class distributions (linearly
unseparable data) Slack variable: distance from the boundary
To maximize the margin whilepenalizing points that lie on the wrong side of the margin boundary
)(or 0 nnnn yt x
nnn yt 1)(x
2
1 2
1 min w
N
nnC
![Page 24: Linear Models for Classification](https://reader036.vdocuments.us/reader036/viewer/2022062315/568151fd550346895dc03664/html5/thumbnails/24.jpg)
SVM-Overlapping Class Distributions Identical to separable case
Again represents a quadratic programming problem
N
nnn
N
nnnnn
N
nn ytaCbL
111
2}1)({
2
1),,( xwaw
,),(2
1)(
~
1 11
N
n
N
mmnmnmn
N
nn kttaaaL xxa
N
nnnn taCa
1
0 ,0
![Page 25: Linear Models for Classification](https://reader036.vdocuments.us/reader036/viewer/2022062315/568151fd550346895dc03664/html5/thumbnails/25.jpg)
Support Vector Machines Relation to logistic regression
Hinge loss used in SVM and the error function of logistic regression approximate the ideal misclassification error(MCE)
Black : MCE, Blue: Hinge Loss, Red: Logistic Regression, Green: Squared Error
nntyz