introduction to machine learningcs.kangwon.ac.kr/~leeck/ai2/05.pdf · 2018-09-04 · estimation of...

Multivariate Data

Nd

NN

d

d

XXX

XXX

XXX

21

22

2

2

1

11

2

1

1

X

2

Multiple measurements (sensors)

d inputs/features/attributes: d-variate

N instances/observations/examples

Multivariate Parameters

2

21

2

2

221

112

2

1

μμ

ddd

d

d

TE

XXXCov

3

Parameter Estimation

4

Estimation of Missing Values What to do if certain instances have missing attributes?

Ignore those instances: not a good idea if the sample is small

Use ‘missing’ as an attribute: may give information

Imputation: Fill in the missing value

Mean imputation: Use the most likely value (e.g., mean)

Imputation by regression: Predict based on other attributes

5

Multivariate Normal Distribution

μxμxx

μx

1

212Σ

2

1

Σ2

1

Σ

T

d

d

p exp

~

//

,N

6

Multivariate Normal Distribution

7

Mahalanobis distance: (x – μ)T ∑–1 (x – μ)

measures the distance from x to μ in terms of ∑ (normalizes for difference in variances and correlations)

Bivariate: d = 2

2

221

21

2

1

iiii xz

zzzzxxp

/

exp,

2

221

2

122

21

21 212

1

12

1

rij =s ij

s is j

Bivariate Normal

8

Mahalanobis distance A correlation-based distance measure.

10

If xi are independent, offdiagonals of ∑ are 0, Mahalanobis distance reduces to weighted (by 1/σi) Euclidean distance:

If variances are also equal, reduces to Euclidean distance

Independent Inputs: Naive Bayes

p x( ) = pi xi( )i=1

d

Õ =1

2p( )d /2

s ii=1

d

Õexp -

1

2

xi -mis i

æ

èç

ö

ø÷

2

i=1

d

åé

ë

êê

ù

û

úú

11

Parametric Classification If p (x | Ci ) ~ N ( μi , ∑i )

Discriminant functions

ii

T

i

i

diCp μxμxx1

212Σ

2

1

Σ2

1exp|

//

iii

T

ii

iii

CPd

CPCpg

log loglog

log| log

μΣμ

2

1Σ

2

12

2

1xx

xx

12

Estimation of Parameters

t

ti

T

it

t itt

i

i

t

ti

t

tti

i

t

ti

i

r

r

r

r

N

rCP

mxmx

xm

S

ˆ

iii

T

iii CPg ˆ log log

mxmxx1

2

1

2

1SS

13

Different Si Quadratic discriminant

iiii

T

ii

iii

ii

i

T

iiT

iii

T

iiiT

iT

ii

CPw

w

CPg

ˆ log log

where

ˆ log log

SS

S

SW

W

SSSS

2

1

2

1

2

1

22

1

2

1

1

0

1

1

0

111

mm

mw

xwxx

mmmxxxx

14

15

likelihoods

posterior for C1

discriminant: P (C1|x ) = 0.5

Shared common sample covariance S

Discriminant reduces to

which is a linear discriminant

Common Covariance Matrix S

ii

iCP̂ SS

16

ii

T

ii CP̂g log2

1 1 mxmxx S

ii

T

iiii

i

T

ii

CPw

wg

ˆ log

where

mmmw

xwx

1

0

1

0

2

1SS

Common Covariance Matrix S

17

When xj j = 1,..d, are independent, ∑ is diagonal

p (x|Ci) = ∏j p (xj |Ci) (Naive Bayes’ assumption)

Classify based on weighted Euclidean distance (in sj units) to the nearest mean

Diagonal S

i

d

j j

ijtj

i CPs

mxg ˆ log

1

2

2

1x

18

Diagonal S

19

variances may be different

Nearest mean classifier: Classify based on Euclidean distance to the nearest mean

Each mean can be considered a prototype or template and this is template matching

Diagonal S, equal variances

i

d

jij

tj

ii

i

CPmxs

CPs

g

ˆ log

ˆ log

2

12

2

2

2

1

2

mxx

20

Diagonal S, equal variances

21

* ?

Model Selection Assumption Covariance matrix No of parameters

Shared, Hyperspheric Si=S=s2I 1

Shared, Axis-aligned Si=S, with sij=0 d

Shared, Hyperellipsoidal Si=S d(d+1)/2

Different, Hyperellipsoidal Si K d(d+1)/2

22

As we increase complexity (less restricted S), bias decreases and variance increases

Assume simple models (allow some bias) to control variance (regularization)

Discrete Features

ij

ijjijj

iii

CPpxpx

CPCpg

log log log

log| log

11

xx

24

Binary features:

if xj are independent (Naive Bayes’)

the discriminant is linear

ijij Cxpp |1

d

j

x

ij

x

ijijj ppCxp

1

11|

t

ti

t

ti

tj

ijr

rxp̂Estimated parameters

Multinomial (1-of-nj) features: xj Î {v1, v2,..., vnj}

if xj are independent

Discrete Features

ikjijkijk CvxpCzpp || 1

25

t

ti

t

ti

tjk

ijk

iijkj k jki

d

j

n

k

z

ijki

r

rzp

CPpzg

pCpj

jk

ˆ

log log

|

x

x1 1

Multivariate Regression d

tt wwwxgr ,...,,| 10

26

Multivariate linear model

Multivariate polynomial model: Define new higher-order variables z1=x1, z2=x2, z3=x1

2, z4=x22, z5=x1x2

and use the linear model in this new z space (basis functions, kernel trick: Chapter 13)

211010

22110

2

1

t

tdd

ttd

tdd

tt

xwxwwrwwwE

xwxwxww

X|,...,,

introduction to machine learningcs.kangwon.ac.kr/~leeck/ai2/05.pdf · 2018-09-04 · estimation of...

Documents