introduction to machine learningcs.kangwon.ac.kr/~leeck/ai2/05.pdf · 2018-09-04 · estimation of...
TRANSCRIPT
Multivariate Data
Nd
NN
d
d
XXX
XXX
XXX
21
22
2
2
1
11
2
1
1
X
2
Multiple measurements (sensors)
d inputs/features/attributes: d-variate
N instances/observations/examples
Multivariate Parameters
2
21
2
2
221
112
2
1
μμ
ddd
d
d
TE
XXXCov
3
Parameter Estimation
4
Estimation of Missing Values What to do if certain instances have missing attributes?
Ignore those instances: not a good idea if the sample is small
Use ‘missing’ as an attribute: may give information
Imputation: Fill in the missing value
Mean imputation: Use the most likely value (e.g., mean)
Imputation by regression: Predict based on other attributes
5
Multivariate Normal Distribution
μxμxx
μx
1
212Σ
2
1
Σ2
1
Σ
T
d
d
p exp
~
//
,N
6
Multivariate Normal Distribution
7
Mahalanobis distance: (x – μ)T ∑–1 (x – μ)
measures the distance from x to μ in terms of ∑ (normalizes for difference in variances and correlations)
Bivariate: d = 2
2
221
21
2
1
iiii xz
zzzzxxp
/
exp,
2
221
2
122
21
21 212
1
12
1
rij =s ij
s is j
Bivariate Normal
8
9
Mahalanobis distance A correlation-based distance measure.
10
If xi are independent, offdiagonals of ∑ are 0, Mahalanobis distance reduces to weighted (by 1/σi) Euclidean distance:
If variances are also equal, reduces to Euclidean distance
Independent Inputs: Naive Bayes
p x( ) = pi xi( )i=1
d
Õ =1
2p( )d /2
s ii=1
d
Õexp -
1
2
xi -mis i
æ
èç
ö
ø÷
2
i=1
d
åé
ë
êê
ù
û
úú
11
Parametric Classification If p (x | Ci ) ~ N ( μi , ∑i )
Discriminant functions
ii
T
i
i
diCp μxμxx1
212Σ
2
1
Σ2
1exp|
//
iii
T
ii
iii
CPd
CPCpg
log loglog
log| log
μΣμ
2
1Σ
2
12
2
1xx
xx
12
Estimation of Parameters
t
ti
T
it
t itt
i
i
t
ti
t
tti
i
t
ti
i
r
r
r
r
N
rCP
mxmx
xm
S
ˆ
iii
T
iii CPg ˆ log log
mxmxx1
2
1
2
1SS
13
Different Si Quadratic discriminant
iiii
T
ii
iii
ii
i
T
iiT
iii
T
iiiT
iT
ii
CPw
w
CPg
ˆ log log
where
ˆ log log
SS
S
SW
W
SSSS
2
1
2
1
2
1
22
1
2
1
1
0
1
1
0
111
mm
mw
xwxx
mmmxxxx
14
15
likelihoods
posterior for C1
discriminant: P (C1|x ) = 0.5
Shared common sample covariance S
Discriminant reduces to
which is a linear discriminant
Common Covariance Matrix S
ii
iCP̂ SS
16
ii
T
ii CP̂g log2
1 1 mxmxx S
ii
T
iiii
i
T
ii
CPw
wg
ˆ log
where
mmmw
xwx
1
0
1
0
2
1SS
Common Covariance Matrix S
17
When xj j = 1,..d, are independent, ∑ is diagonal
p (x|Ci) = ∏j p (xj |Ci) (Naive Bayes’ assumption)
Classify based on weighted Euclidean distance (in sj units) to the nearest mean
Diagonal S
i
d
j j
ijtj
i CPs
mxg ˆ log
1
2
2
1x
18
Diagonal S
19
variances may be different
Nearest mean classifier: Classify based on Euclidean distance to the nearest mean
Each mean can be considered a prototype or template and this is template matching
Diagonal S, equal variances
i
d
jij
tj
ii
i
CPmxs
CPs
g
ˆ log
ˆ log
2
12
2
2
2
1
2
mxx
20
Diagonal S, equal variances
21
* ?
Model Selection Assumption Covariance matrix No of parameters
Shared, Hyperspheric Si=S=s2I 1
Shared, Axis-aligned Si=S, with sij=0 d
Shared, Hyperellipsoidal Si=S d(d+1)/2
Different, Hyperellipsoidal Si K d(d+1)/2
22
As we increase complexity (less restricted S), bias decreases and variance increases
Assume simple models (allow some bias) to control variance (regularization)
23
Discrete Features
ij
ijjijj
iii
CPpxpx
CPCpg
log log log
log| log
11
xx
24
Binary features:
if xj are independent (Naive Bayes’)
the discriminant is linear
ijij Cxpp |1
d
j
x
ij
x
ijijj ppCxp
1
11|
t
ti
t
ti
tj
ijr
rxp̂Estimated parameters
Multinomial (1-of-nj) features: xj Î {v1, v2,..., vnj}
if xj are independent
Discrete Features
ikjijkijk CvxpCzpp || 1
25
t
ti
t
ti
tjk
ijk
iijkj k jki
d
j
n
k
z
ijki
r
rzp
CPpzg
pCpj
jk
ˆ
log log
|
x
x1 1
Multivariate Regression d
tt wwwxgr ,...,,| 10
26
Multivariate linear model
Multivariate polynomial model: Define new higher-order variables z1=x1, z2=x2, z3=x1
2, z4=x22, z5=x1x2
and use the linear model in this new z space (basis functions, kernel trick: Chapter 13)
211010
22110
2
1
t
tdd
ttd
tdd
tt
xwxwwrwwwE
xwxwxww
X|,...,,