cs 782 – machine learning lecture 4 linear models for classification probabilistic generative...
TRANSCRIPT
CS 782 – Machine LearningLecture 4
Linear Models for Classification
Probabilistic generative models
Probabilistic discriminative models
Probabilistic Generative Models
We have shown:
Decision boundary:
2
122110
211
01
ln2
1
2
1
|
Cp
Cpw
w
wCp
TT
T
μμμμ
μμ
xwx
5.0|| 21 xx CpCp
05.01
10
0
wx w wxw
TT
e
Probabilistic Generative Models
K >2 classes:
We can show the following result:
kkk
j
a
a
jjj
kkk
CpCpa
e
e
CpCp
CpCpCp
j
k
|ln
|
||
x
x
xx
kkTkkkk
j
wx
w
k
Cpw
e
exCp
jTj
kTk
ln2
1
|
10
1
0
0
μμ μ w
w
xw
Maximum likelihood solution
kT
k
eCp Dk
μxμx
x1
2
1
2/12/
1
2
1|
We have a parametric functional form for the class-conditional
densities:
We can estimate the parameters and the prior class
probabilities using maximum likelihood.
Two class case with shared covariance matrix.
Training data:
1,
01
,,1,
21
21
CpCp
CclassdenotestCclassdenotest
Nnt
nn
nn
:Priors
x
Maximum likelihood solution
1,
01
,,1,
21
21
CpCp
CclassdenotestCclassdenotest
Nnt
nn
nn
:Priors
x
For a data point from class we have and
therefore
For a data point from class we have and
therefore
,||, 1111 μxxx nnn NCpCpCp
,|1|, 2222 μxxx nnn NCpCpCp
nx 1C 1nt
nx 2C 0nt
Maximum likelihood solution
1,
01
,,1,
21
21
CpCp
CclassdenotestCclassdenotest
Nnt
nn
nn
:Priors
x
,||, 1111 μxxx nnn NCpCpCp
,|1|, 2222 μxxx nnn NCpCpCp
TNt
n
N
n
tn
tn
N
n
tn
N
nn
ttNN
CpCp
tpp
nn
nn
,,,|1,|
,,
,,,|,,,|
11
21
1
12
11
12121
t μxμx
xx
μμμμt
Assuming observations are drawn independently, we can
write the likelihood function as follows:
Maximum likelihood solution
μxμx
μμt
nn tn
N
n
tn NN
p
12
11
21
,|1,|
,,,|
We want to find the values of the parameters that maximize
the likelihood function, i.e., fit a model that best describes
the observed data.
As usual, we consider the log of the likelihood:
μxμx
μμt
N
nnnnnnn NttNtt
p
121
21
,|ln11ln1,|lnln
,,,|ln
Maximum likelihood solution
We first maximize the log likelihood with respect to . The
terms that depend on are
N
nnn tt
1
11ln
21
11
1
111
111
1
01
1
1
1
1
1
1
11
11
111ln1ln
NN
N
N
Nt
N
NttNt
tttt
N
nn
N
nn
N
nn
N
nn
N
nn
N
nn
N
nnn
μxμx
μμt
N
nnnnnnn NttNtt
p
121
21
,|ln11ln1,|lnln
,,,|ln
Maximum likelihood solution
Thus, the maximum likelihood estimate of is the fraction
of points in class
The result can be generalized to the multiclass case: the
maximum likelihood estimate of is given by the
fraction of points in the training set that belong to
1C
21
1
NN
NML
kCp
kC
Maximum likelihood solution
We now maximize the log likelihood with respect to . The
terms that depend on are1μ
μxμx
μμt
N
nnnnnnn NttNtt
p
121
21
,|ln11ln1,|lnln
,,,|ln
1μ μx
N
nnn Nt
11,|ln
]
1
2
1,|[
11
12
1
2/12/1
μxμx
μxn
Tn
eN Dn
N
nnn
N
nnn
N
nnn
N
n
Tnn
N
nn
Tnn
N
nnn
tN
Nttt
consttNt
111
111
11
1
11
11
11
111
1
1
00
2
1,|ln
xμ
μxμx μx
μxμxμ
μxμ
Maximum likelihood solution
Thus, the maximum likelihood estimate of is the sample
mean of all the input vectors assigned to class
By maximizing the log likelihood with respect to we
obtain a similar result for
1μ
nx 1C
2μ
N
nnntN 11
1
1xμ
N
nnntN 12
2 11
xμ
2μ
Maximum likelihood solution
Maximizing the log likelihood with respect to we obtain
the maximum likelihood estimate
Thus: the maximum likelihood estimate of the covariance
is given by the weighted average of the sample covariance
matrices associated with each of the classes.
This results extend to K classes.
ML
TnCn
nT
nCn
n
ML
NS
NS
SN
NS
N
N
222
2111
1
22
11
21
11μxμx μxμx
Probabilistic Discriminative Models
Two-class case:
Multiclass case:
Discriminative approach: use the functional form of the
generalized linear model for the posterior probabilities and
determine its parameters directly using maximum
likelihood.
01 | wCp T xwx
j
wx
w
kj
Tj
kTk
e
eCp
0
0
|w
xw
x
Probabilistic Discriminative Models
Advantages:
Fewer parameters to be determined
Improved predictive performance, especially when the
class-conditional density assumptions give a poor
approximation of the true distributions.
Probabilistic Discriminative Models
Two-class case:
In the terminology of statistics, this model is known as
logistic regression.
Assuming how many parameters do we need to
estimate?
xxwx ywCp T 01 |
xx |1| 12 CpCp
Mx
1M
Probabilistic Discriminative Models
How many parameters did we estimate to fit Gaussian
class-conditional densities (generative approach)?
22
22
1
221
22
22
1
MOMM
Mtotal
MMMMM
Mvectorsmean
Cp
Logistic Regression
We use maximum likelihood to determine the parameters
of the logistic regression model.
xxwx ywCp T 01 |
nnnn tn
N
n
tn
tn
N
n
tn
nn
nn
yyCPCPL
functionLikelihood
dataobservedthetoassociatediesprobabilit
posteriortheimizethatofvaluesthefindtowantWe
CclassdenotestCclassdenotest
Nnt
1
1
11
11
21
1|1|
:
max
01
,,1,
xxxxw
w
x
Logistic Regression
We consider the negative logarithm of the likelihood:
xxwx ywCp T 01 |
nnnn tn
N
n
tn
tn
N
n
tn yyCPCPL
1
1
11
11 1|1| xxxxw
w
xx
xxww
wE
ytyt
yyLE
N
nnnnn
tn
N
n
tn
nn
minarg
1ln1ln
1lnln
1
1
1
Logistic Regression
We compute the derivative of the error function with
respect to w (gradient):
We need to compute the derivative of the logistic sigmoid
function:
xxwx ywCp T 01 |
N
nnnnn ytytE
1
1ln1ln xxw
ww
aa
ee
e
e
ee
e
eaa
a
aa
a
a
aa
a
a
11
11
1
1
11
1
11
12
Logistic Regression
xxwx ywCp T 01 |
n
N
nnn
n
N
nnnn
N
nnn
n
N
nnnnnnn
N
nnnnnnn
N
nnnn
n
nnnn
n
n
N
nnnnn
tyE
tyyt
ytyyttytyt
yyy
tyy
y
t
ytytE
xw
xx
xxx
xx
xxw
ww
1
11
11
1
1
11
11
11
1ln1ln
Logistic Regression
n
N
nnn tyE xw
1
The gradient of E at w gives the direction of the
steepest increase of E at w. We need to minimize E. Thus
we need to update w so that we move along the opposite
direction of the gradient:
This technique is called gradient descent
It can be shown that E is a concave function of w. Thus,
it has a unique minimum.
An efficient iterative technique exists to find the optimal
w parameters (Newton-Raphson optimization).
www Ett 1
Batch vs. on-line learning
n
N
nnn tyE xw
1
The computation of the above gradient requires the
processing of the entire training set (batch technique)
If the data set is large, the above technique can be
costly;
For real time applications in which data become
available as continuous streams, we may want to update
the parameters as data points are presented to us (on-line
technique).
www Ett 1
On-line learning
After the presentation of each data point n, we compute
the contribution of that data point to the gradient
(stochastic gradient):
The on-line updating rule for the parameters becomes:
nnnt
ntt tyE xwwww 1
nnnn tyE xw
econvergencensuretocarefullychosenbetoneedsvaluesIt
ratelearningcalledis
'
.0
www ntt E1
Multiclass Logistic Regression
Multiclass case: xxw
xw
k
j
wx
w
k ye
eCp
jTj
kTk
0
0
|
We use maximum likelihood to determine the parameters
of the logistic regression model.
N
n
K
k
tnk
N
n
K
k
tnkK
K
kn
nn
nknk yCPL
functionLikelihood
dataobservedthetoassociatediesprobabilitposteriorthe
imizethatofvaluesthefindtowantWe
Cclassdenotest
Nn
1 11 11
1
|,,
:
max,,
0,,1,,0
,,1,
xxww
w w
tx
Multiclass Logistic Regression
xxw
xw
k
j
wx
w
k ye
eCp
jTj
kTk
0
0
|
N
n
K
k
tnkK
nkyL1 1
1 ,, xww
We consider the negative logarithm of the likelihood:
j
N
n
K
knknkKK
E
ytLE
j
w
xwwww
wminarg
ln,,ln,,1 1
11
Multiclass Logistic Regression
xxw
xw
k
j
wx
w
k ye
eCp
jTj
kTk
0
0
|
We compute the gradient of the error function with respect
to one of the parameter vectors:
N
n
K
knknkK ytE
1 11 ln,, xww
N
n
K
knknk
jK
j
ytE1 1
1 ln,, xw
www
Multiclass Logistic Regression
xxw
xw
k
j
wx
w
k ye
eCp
jTj
kTk
0
0
|
Thus, we need to compute the derivatives of the softmax
function:
N
n
K
knknk
jK
j
ytE1 1
1 ,, xw
www
kkkk
j
a
aa
j
aa
j
a
a
kk
k
yyyy
e
eeee
e
e
ay
aj
kkjk
j
k
12
2
Multiclass Logistic Regression
xxw
xw
k
j
wx
w
k ye
eCp
jTj
kTk
0
0
|
Thus, we need to compute the derivatives of the softmax
function:
N
n
K
knknk
jK
j
ytE1 1
1 ,, xw
www
jk
j
a
aa
j
a
a
jk
j
yy
e
ee
e
e
ay
akjfor
j
jk
j
k
2,
Multiclass Logistic Regression
Compact expression:
jk
j
a
aa
j
a
a
jk
j
yy
e
ee
e
e
ay
akjfor
j
jk
j
k
2,
kkkk
j
a
aa
j
aa
j
a
a
kk
k
yyyy
e
eeee
e
e
ay
aj
kkjk
j
k
12
2
matrixidentitytheofelementstheareIwhere
yIyya
kj
jkjkkj
Multiclass Logistic Regression
xxw
xw
k
j
wx
w
k ye
eCp
jTj
kTk
0
0
|
n
N
nnjnj
N
n
K
knkjnknjnk
N
n
K
knnjkjnk
nknk
N
n
K
knknk
jK
ty
ItytyIyy
t
ytEj
x
xx
xw
www
1
1 11 1
1 11
1
ln,,
jkjkkj
yIyya
Multiclass Logistic Regression
n
N
nnjnjK tyE
jxwww
1
1 ,,
It can be shown that E is a concave function of w. Thus,
it has a unique minimum.
For a batch solution, we can use the Newton-Raphson
optimization technique.
On-line solution (stochastic gradient descent):
nnjnjtjn
tj
tj tyE xwwww 1