part 3: estimation of parameters
DESCRIPTION
Part 3: Estimation of Parameters. Estimation of Parameters. Most of the time, we have random samples but not the densities given. If the parametric form of the densities are given or assumed, then, using the labeled samples, the parameters can be estimated. (supervised learning) - PowerPoint PPT PresentationTRANSCRIPT
Part 3: Estimation of Parameters
METU Informatics Institute
Min720
Pattern Classification with Bio-Medical Applications
Lecture Notes
by
Neşe Yalabık
Spring 2011
Estimation of Parameters
• Most of the time, we have random samples but not the densities given.
• If the parametric form of the densities are given or assumed, then, using the labeled samples, the parameters can be estimated. (supervised learning)
Maximum Likelihood Estimation of Parameters
• Assume we have a sample set:
• as belonging to a given class. Drawn from
(independently drawn from identically distributed r.v.)
samples
nXXXD ,...,, 21
)( jXP iid
(unknown parameter vector)
for gaussian
The density function - assumed to be of known form
So our problem: estimate using sample set:
Now drop and assume a single density function.
: estimate of
Anything can be an estimate. What is a good estimate?
• Should converge to actual values
• Unbiased etc
Consider the mixture density
(due to statistical independence)
is called “likelihood function”
that maximizes
(Best agrees with drawn samples.)
,...],...,,[),(
],...,,[
1121
21
jjjT
jjj
Tpj ttt
j
)()()(1
i
n
i
XPDPL
ˆ)(L
iidXXXD jnjjj },...,,{ 21j
)( jXP
)(L
if is a singular,
Then find such that and for solving for .
When is a vector, then
: gradient of L wrt
Where:
Therefore
or (log-likelihood)
(Be careful not to find the minimum with derivatives)
0..2
1
ptL
tL
tL
)(maxarg)(lnmaxargˆ
)(maxargˆ
lL
L
0
),...,,( 21
L
tttLL p
0ddL
Example 1:
Consider an exponential distribution
otherwise
(single feature, single parameter)
With a random sample
: valid for
(inverse of average)
)( Xf
n
iin
n
ii
n
ii
ndLd
ddl
n
ii
n
ii
n
i
xn
in
xx
n
x
xnxLl
eXXXfL i
1
11
1
)(ln
111
.
121
1ˆˆ
0
lnln)(ln)(
.),...,,()(
},...,,{ 21 nXXX
0x
)(L
0
0);(
xeXf
x
X
Example 2:
• Multivariate Gaussian with unknown mean vector M. Assume is known.
• k samples from the same distribution:
(iid)
(linear algebra)
kXXX .,,........., 21
)()(2
1
12/12/
1
)2(
1)|(
MXMXk
in
iT
i
eMXL
)()(2
1
12/12/
1
)2(
1loglog
MXMXk
inMM
iT
i
eLl
))()(2
1||log
2
1)2log(
2( 1
1
MXMXn
iT
i
k
iM
))((1
1
k
ii MX
• (sample average or sample mean)
• Estimation of when it is unknown.
• (Do it yourself: not so simple)
•
• :sample covariance
• where is the same as above.
• Biased estimate :
• use for an unbiased estimate.
•
k
ii MkX
1
1 )(0
k
iiXk
M1
1
n
k
Tkk MXMX
n 1
))((1
M
22 )(
E
21n
n
.......
1
1
n
Example 3:
Binary variables with unknown parameters
(n parameters)
So,
k samples
here is the element of sample .
nipi 1,
n
i
n
iiiii pxpxXP
1 1
)1log()1(log)(log
k
jjXPLl
1
)(loglog
k
j
n
i
n
iiijiij pxpx
1 1 1
)1log()1(log(
ijxthi thj jX
• So,
is the sample average of the feature.
Lp
Lp
Lp
L
n
pi
log
:
log
log
log2
1
k
jiij
i
ij
i
pxp
xL
p 1
))1)(1(((log
k
j
k
jij
iij
i
xp
xp 1 1
)1(ˆ1
1ˆ1
0
k
jiji x
kp
1
1ˆ
ip
• Since is binary, will be the same as counting the occurances of ‘1’.
• Consider character recognition problem with binary matrices.
• For each pixel, count the number of 1’s and this is the estimate of .
•
iX
k
jijx
1
A Bip
Part 4: Features and Feature Extraction
METU Informatics Institute
Min720
Pattern Classification with Bio-Medical Applications
Lecture Notes
by
Neşe Yalabık
Spring 2011
Problems of Dimensionality and Feature Selection
• Are all features independent? Especially in binary features, we might have >100.
• The classification accuracy vs. size of feature set.
• Consider the Gaussian case with same for both categories.
(assuming a priori probabilities are the same) (e:error)
• where is the square of mahalonobis distance between class means.
Mahalonobis distance
between and (the means)
2/
2/2
2
1)(
r
u dueeP
2r
)()( 211
212 Tr
1 2
• P(e) decreases as r increases (the distance between the means).
• If (all features statistically
• independent.)
• then
• We conclude from here that
• 1-Most useful features are the ones with large distance and small variance.
• 2-Each feature contributes to reduce the probability of error.
2
22
21
0
.
0
d
d
i
d
i i
ii
i
ii mmmmr
1 12
2212212 )(
)(
• When r increases, probability of error decreases.
• Best features are the ones with distant means and small variances.
So add new features if the ones we already have are not adequate (more features, decreasing prob. of error.)
But it was shown that adding new features after some point leads to worse performance.
Find statistically independent features Find discriminating features Computationally feasible features
• Principal Component Analysis (PCA) (Karhunen-Loeve Transform)
• Finds (reduces the set) to statistically independent features.
• vectors
• Find a representative
• Squared error criterion
nXXX ,......,, 21
0X
Eliminating Redundant Features
is to be found using a larger set
So we either Throw one away Generate a new feature using and (ex:projections of the
points to a line) Form a linear combination of features.
TdxxX ,,.........1
TmyyY ,,.........1
new sample in 1-d space
21, yy
1y
2yFeatures that are linearly dependent
1y 2y
• A linear transformation
• W? Can be found by: K-L expansion, Principal Component Analysis
• W :above are satisfied (class discrimination and independent x1, x2,..).
• represented with a single vector .
• -Find a vector so that sum of the squared distances to is minimum(Zero degree representation).
),.......,(
),.......,(
),.......,(
1
122
111
mdd
m
m
yyfx
yyfx
yyfx
Linear functions
WYX
nXX ..,,.........1 0X
0X0X
• squared-error criterion
• Find that maximizes .
• Solution is given by the sample mean.
•
2
1000 )(
n
kkXXXJ
0X 0J
kXnM
1
2
000 )()()( MXMXXJ k
2
0
2
0 )()(2 MXMXMXMX kKT
2
0
2
0 )()(2 MXMXMXMX kKT
0
• Where ,• this expression is minimized.
• Consider now 1-d representation from 2-d.
• -The line should pass through the sample mean.
• • unit vector in the direction of line
22
0 MXMX k
Independent of 0X
MX 0
aeMX
e
XkX
MXka
2)( kk XeaM
Distance of Xk from M
• Now how to find best e that minimizes
• It turns out that given the scatter matrix
• e must be the eigenvector of the scatter matrix with the largest eigenvalue lambda .
• That is, we project the data onto a line through the sample mean in the direction of the eigenvector of the scatter matrix with largest eigenvalue.
• Now consider d dimensional projection
• Here are d eigenvectors of the scatter matrix having largest eigenvalues.
•
1J
n
k
Tkk MXMXS
1
))((
eSe
d
iiieaMX
1
dee ,.....,1
2)( kk XeaM
• Coefficients are called principal components.
• So each m dimensional feature vector is transferred to d dimensional space since the components are given as
• Now represent our new feature vector’s elements
• So
•
ia
ia
)( MXea kTiki
)(
:
)(
)(
'
22
11
MXea
MXea
MXea
kTdid
kT
i
kT
i
• FISHER’S LINEAR DISCRIMINANT
• Curse of dimensionality. More features, more samples needed.
• We would like to choose features with more discriminating ability.
• Reduces the dimension of the problem to one in simplest form.
• Seperates samples from different categories.
• Consider samples from 2 different categories now.
• -Find a line so that the projection separates the samples best.
• Same as: • Apply a transformation to samples X to result with a scalar
• such that
• Fisher’s criterion function
• is maximized, where
Line 1(bad solution)
Line 2 (good solution)
Blue (original data)Red(projection onto Line 1)Yellow(projection onto Line 2)
XWy T
22
21
221 )(
)(
WJ
• This reduces the problem to 1d, by keeping the classes most distant from each other.
• But if we write and in terms of and
•
iCyi
i yn
1
22 )(1
iCyi
i
i
yn
i2i iM i
iCXi
i Xn
M1
MWXWn
TT
ii 1
ii CX
iT
iCXi
TT
ii MXW
nMWXW
n222 ))((
1)(
1
))))(((1())((
1WMXMX
nWWMXMXW
nT
iii
TTii
T
i
y
• Then,
• - within class scatter matrix
• - between class scatter matrix
• Then , maximize
WSW iT
2212
212
21 )()()( MMWMWMW TTT
WSWWMMMMW BTTT ))(( 2121
BS
WSWWSSW WTT )( 21
22
21
WS
BS
WS
WSW
WSWWJ
WT
BT
)(Rayleigh quotient
• It can be shown that W that maximizes J can be found by solving the eigenvalue problem again:
• and the solution is given by
• Optimal if the densities are gaussians with equal covariance matrices. That means reducing the dimension does not cause any loss.
• Multiple Discriminant Analysis: c category problem.
• A generalization of 2-category problem.
•
WWSS BW 1
)( 211 MMSW W
• Non-Parametric Techniques• Density Estimation
• Use samples directly for classification
– - Nearest Neighbor Rule
– - 1-NN
– - k-NN
• Linear Discriminant Functions: gi(X) is linear.