introduction to machine learningusers.cis.fiu.edu/~jabobadi/cap5610/slides8.pdfintroduction to...

INTRODUCTION TO Machine Learning

2nd Edition

ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some parts fromhttp://www.cs.tau.ac.il/~apartzin/MachineLearning/© The MIT Press, 2010

[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml2e

Lecture Slides for

Outline

This class: Ch 5: Multivariate Methods● Multivariate Data● Parameter Estimation● Estimation of Missing Values● Multivariate Classification

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

CHAPTER 5:

Multivariate Methods

Multivariate Distribution

● Assume all members of class came from join distribution

● Can learn distributions from data P(x|C) ● Assign new instance for most probable class

P(C|x) using Bayes rule ● An instance described by a vector of

correlated parameters ● Realm of multivariate distributions● Multivariate normal

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

4

Multivariate Data


5

=

Nd

NN

d

d

XXX

XXX

XXX

21

222

21

112

11

X

● Multiple measurements (sensors)● d inputs/features/attributes: d-variate ● N instances/observations/examples

Multivariate Parameters

( ) ( )( )[ ]

=−−=≡Σ

221

22221

11221

Cov

ddd

d

d

TE

σσσ

σσσσσσ

μμ XXX

[ ] [ ]( )

( )ji

ijijji

jiij

Td

X,X

X,X

,...,E

σσσ

=ρ≡

≡σµµ==

Corr :nCorrelatio

Cov:Covariance

:Mean 1μx

6


Parameter Estimation

( )( )

ji

ijij

jtj

N

t iti

ij

N

t

ti

i

ss

sr:

N

mxmxs:

d,...,i,N

xm:

=

−−=

==

∑

∑

=

=

R

S

m

matrix nCorrelatio

matrix Covariance

1 mean Sample

1

1

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)7

Estimation of Missing Values

Based on for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

8

● What to do if certain instances have missing attributes?

● Ignore those instances: not a good idea if the sample is small

● Use ‘missing’ as an attribute: may give information

● Imputation: Fill in the missing value– Mean imputation: Use the most likely value (e.g.,

mean)– Imputation by regression: Predict based on other

attributes

Multivariate Normal

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

9

● Have d-attributes● Often can assume each one distributed

normally● Attributes might be dependant/correlated● Joint distribution of correlated several

variables – P(X1=x1, X2=x2, … Xd=xd)=?– X1 is normally distributed with mean µi and

variance �i

Multivariate Normal


10

● Mahalanobis distance: (x – μ)T ∑–1 (x – μ) ● 2 variables are correlated● Divided by inverse of covariance (large)● Contribute less to Mahalanobis distance● Contribute more to the probability

( )

( )( )

( ) ( )

−−−

π= − μxμxx

μx

1212 2

1exp

2

1Σ

Σ

Σ

T

//d

d

p

~ ,N

Bivariate Normal


11

Multivariate Normal Distribution


12

● Mahalanobis distance: (x – μ)T ∑–1 (x – μ)

measures the distance from x to μ in terms of ∑ (normalizes for difference in variances and correlations)

● Bivariate: d = 2

σσρσ

σρσσ=

2221

2121Σ

( ) ( ) ( )( ) iiii /xz

zzzzx,xp

σµ−=

+ρ−

ρ−−

ρ−σπσ= 2

2212122

21

21 212

1exp

12

1


13

Bivariate Normal

Bivariate Normal

14Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Independent Inputs: Naive Bayes

Based on Introduction to Machine Learning © The MIT Press (V1.1)

15

p ( x )=∏i=1

d

p i (x i )=1

(2π )d /2∏i=1

d

σ i

exp [−12∑i=1

d

(x i−μi

σ i)2

]

● If xi are independent, offdiagonals of ∑ are 0, Mahalanobis distance reduces to weighted (by 1/σi) Euclidean distance:

● If variances are also equal, reduces to Euclidean distance

Projection Distribution

● Example: vector of 3 features● Multivariate normal distribution● Projection to 2 dimensional space (e.g. XY

plane) Vectors of 2 features● Projection are also multivariate normal

distribution● Projection of d-dimensional normal to

k-dimensional space is k-dimensional normal


16

1D projection


17

Multivariate Classification

● Assume members of class from a single multivariate distribution

● Multivariate normal is a good choice– Easy to analyze– Model many natural phenomena– Model a class as having single prototype source

(mean) slightly randomly changed


18

Example

● Matching cars to customers● Each cat defines a class of matching

customers● Customers described by (age, income)● There is a correlation between age and

income ● Assume each class is multivariate normal● Need to learn P(x|C) from data● Use Bayes to compute P(C|x)


19

Parametric Classification

● If p (x | Ci ) ~ N ( μi , ∑i )

● Discriminant functions are

● Need to know Covariance Matrix and mean to compute discriminant functions.

● Can ignore P(x) as the same for all classes

( )( )

( ) ( )

−−−

π= −

iiT

i/

i/diCp μxμxx 1

212 21

exp2

1| Σ

Σ

( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( )1

|log | =log =log | log log ( )

( )

1 1log2 log Σ μ Σ μ log ( )

2 2 2

i ii i i i

T

i i i i i

P C P Cg P C x p C P C P x

P x

dP C LogP xπ −

= + −

= − − − − − + −

xx x

x x


Estimation of Parameters

( )

( )( )∑

∑∑

∑

∑

−−=

=

=

t

ti

T

it

t itt

ii

t

ti

t

tti

i

t

ti

i

r

r

r

rN

rCP̂

mxmx

xm

S

( ) ( ) ( ) ( )iiiT

iii CP̂g log21

log21 1 +−−−−= − mxmxx SS

21


Covariance Matrix per Class

● Quadratic discriminant

● Requires estimation of K*d*(d+1)/2 parameters for covariance matrix

( ) ( ) ( )

( )iiiiT

ii

iii

ii

iT

iiT

iiiT

iiiT

iT

ii

CP̂w

w

CP̂g

log log21

21

21

where

log221

log21

10

1

1

0

111

+−−=

=

−=

++=

++−−−=

−

−

−

−−−

SS

S

SW

W

SSSS

mm

mw

xwxx

mmmxxxx



likelihoods

posterior for C1

discriminant: P (C1|x ) = 0.5

Common Covariance Matrix S


24

( ) ii

iCP̂ SS ∑=

● If not enough data can assume all classes have same common sample covariance matrix S

Discriminant reduces to a linear discriminant (xTS-1x is common to all discriminant and can be removed)

( ) ( ) ( ) ( )iiT

ii CP̂g log21 1 +−−−= − mxmxx S

( )

( )0

1 10

1 ˆwhere log 2

Ti i i

Ti i i i i i

g w

w P C− −

= +

= = − +

x w x

w m m mS S

Common Covariance Matrix S


25

Diagonal S


26

( ) ( )i

d

j j

ijtj

i CP̂s

mxg log

21

1

2

+

−−= ∑

=

x

● When xj j = 1,..d, are independent, ∑ is diagonal

p (x|Ci) = ∏j p (xj |Ci) (Naive Bayes’ assumption)

Classify based on weighted Euclidean distance (in sj units) to the nearest mean

Diagonal S


27

variances may bedifferent

Diagonal S, equal variances


28

g i ( x )=−∥x−mi∥

2

2s2+ log P̂ (C i )

¿=−12s2 ∑

j=1

d

(x jt −mij)

2+ log P̂ (C i )

● Nearest mean classifier: Classify based on Euclidean distance to the nearest mean

● Each mean can be considered a prototype or template and this is template matching

Diagonal S, equal variances


29

*?

Model SelectionAssumption Covariance matrix No of parameters

Shared, Hyperspheric Si=S=s^2I 1

Shared, Axis-aligned Si=S, with sij=0 d

Shared, Hyperellipsoidal Si=S d(d+1)/2

Different, Hyperellipsoidal

Si K d(d+1)/2

30

As we increase complexity (less restricted S), bias decreases and variance increases

Assume simple models (allow some bias) to control variance (regularization)


Model Selection

● Different covariance matrix for each class● Have to estimate many parameters● Small bias , large variance● Common covariance matrices, diagonal

covariance etc. reduce number of parameters● Increase bias but control variance● In-between states?


31

Regularized Discriminant Analysis(RDA)

● a=b=0: Quadratic classifier● a=0, b=1:Shared Covariance, linear classifier● a=1,b=0: Diagonal Covariance● Choose best a,b by cross validation


32

Model Selection: Example


33

Model Selection


34

Discrete Features

( ) ( ) ( )( ) ( )[ ] ( )i

jijjijj

iii

CPpxpx

CPCpg

log log log

log| log

+−−+=

+=

∑ 11

xx

35

Binary features:if xj are independent (Naive Bayes’)

the discriminant is linear

( )ijij Cxpp |1=≡

( ) ( ) ( )∏=

−−=d

j

xij

xiji

jj ppCxp1

11|

∑∑=

t

ti

t

ti

tj

ij r

rxp̂Estimated parameters


Multivariate Regression( ) ε+= d

tt wwwxgr ,...,,| 10

36

Multivariate linear model

( ) [ ]2

11010

22110

2

1 ∑ −−−−=

++++

t

tdd

ttd

tdd

tt

xwxwwrwwwE

xwxwxww

X|,...,,


Multivariate Regression

37

l


( ) [ ]2

11010

22110

2

1 ∑ −−−−=

++++

t

tdd

ttd

tdd

tt

xwxwwrwwwE

xwxwxww

X|,...,,

CHAPTER 6:

Dimensionality Reduction

Dimensionality of input


39

● Number of Observables (e.g. age and income) ● If number of observables is increased

– More time to compute – More memory to store inputs and intermediate

results– More complicated explanations (knowledge from

learning) ● Regression from 100 vs. 2 parameters

– No simple visualization● 2D vs. 10D graph

– Need much more data (curse of dimensionality)● 1M of 1-d inputs is not equal to 1 input of dimension 1M

Dimensionality reduction


40

● Some features (dimensions) bear little or nor useful information (e.g. color of hair for a car selection)– Can drop some features– Have to estimate which features can be dropped

from data

● Several features can be combined together without loss or even with gain of information (e.g. income of all family members for loan application)– Some features can be combined together– Have to estimate which features to combine from

data

Feature Selection vs Extraction


41

● Feature selection: Choosing k<d important features, ignoring the remaining d – k– Subset selection algorithms

● Feature extraction: Project the original xi , i =1,...,d dimensions to new k<d dimensions, zj , j =1,...,k– Principal Components Analysis (PCA)– Linear Discriminant Analysis (LDA)– Factor Analysis (FA)

Usage


42

● Have data of dimension d● Reduce dimensionality to k<d

– Discard unimportant features– Combine several features in one

● Use resulting k-dimensional data set for– Learning for classification problem (e.g.

parameters of probabilities P(x|C)– Learning for regression problem (e.g.

parameters for model y=g(x|Thetha)

Subset selection


43

● Have initial set of features of size d● There are 2^d possible subsets● Need a criteria to decide which subset is the

best● A way to search over the possible subsets● Can’t go over all 2^d possibilities ● Need some heuristics

“Goodness” of feature set


44

● Supervised– Train using selected subset– Estimate error on validation data set

● Unsupervised– Look at input only(e.g. age, income and

savings) – Select subset of 2 that bear most of the

information about the person

Mutual Information


45

● Have a 3 random variables(features) X,Y,Z and have to select 2 which gives most information

● If X and Y are “correlated” then much of the information about of Y is already in X

● Make sense to select features which are “uncorrelated”

● Mutual Information (Kullback–Leibler Divergence ) is more general measure of “mutual information”

● Can be extended to n variables (information variables x1,.. xn have about variable xn+1)

Subset-selection


46

● Forward search– Start from empty set of features– Try each of remaining features– Estimate classification/regression error for adding

specific feature– Select feature that gives maximum improvement in

validation error– Stop when no significant improvement

● Backward search– Start with original set of size d– Drop features with smallest impact on error

Subset SelectionThere are 2^d subsets of d featuresForward search: Add the best feature at each step

– Set of features F initially Ø.– At each iteration, find the best new feature

j = argmini E ( F ∪ xi ) – Add xj to F if E ( F ∪ xj ) < E ( F )

Hill-climbing O(d^2) algorithmBackward search: Start with all features and

remove one at a time, if possible.Floating search (Add k, remove l)

47Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Floating Search


48

● Forward and backward search are “greedy” algorithms– Select best options at single step– Do not always achieve optimum value

● Floating search– Two types of steps: Add k, remove l– More computations

Feature Extraction


49

● Face recognition problem– Training data input: pairs of Image +

Label(name) – Classifier input: Image– Classifier output: Label(Name)

● Image: Matrix of 256X256=65536 values in range 0..256

● Each pixels bear little information so can’t select 100 best ones

● Average of pixels around specific positions may give an indication about an eye color.

Projection


50

● Find a projection matrix w from d-dimensional to k-dimensional vectors that keeps error low

PCA: Motivation


51

● Assume that d observables are linear combination of k<d vectors

● zi=wi1xi1+…+wikxid

● We would like to work with basis as it has lesser dimension and have all(almost) required information

● What we expect from such basis– Uncorrelated or otherwise can be reduced

further– Have large variance (e.g. wi1 have large

variation) or otherwise bear no information

PCA: Motivation


52

PCA: Motivation


53

● Choose directions such that a total variance of data will be maximum– Maximize Total Variance

● Choose directions that are orthogonal – Minimize correlation

● Choose k<d orthogonal directions which maximize total variance

PCA


54

● Choosing only directions:● ● Maximize variance subject to a constrain using

Lagrange Multipliers

● Taking Derivatives

● Eigenvector. Since want to maximize

we should choose an eigenvector with largest eigenvalue

PCA


55

● d-dimensional feature space● d by d symmetric covariance matrix estimated

from samples ● Select k largest eigenvalue of the covariance

matrix and associated k eigenvectors● The first eigenvector will be a direction with

largest variance

What PCA does


56

z = WT(x – m)

where the columns of W are the eigenvectors of ∑, and m is sample mean

Centers the data at the origin and rotates the axes

How to choose k ?


57

dk

k

λ++λ++λ+λλ++λ+λ

21

21

● Proportion of Variance (PoV) explained

when λi are sorted in descending order ● Typically, stop at PoV>0.9● Scree graph plots of PoV vs k, stop at

“elbow”


58

PCA


59

● PCA is unsupervised (does not take into account class information)

● Can take into account classes : Karhuned-Loeve Expansion– Estimate Covariance Per Class– Take average weighted by prior

● Common Principle Components– Assume all classes have same eigenvectors

(directions) but different variances

PCA


60

● Does not try to explain noise– Large noise can become new dimension/largest

PC

● Interested in resulting uncorrelated variables which explain large portion of total sample variance

● Sometimes interested in explained shared variance (common factors) that affect data

introduction to machine learningusers.cis.fiu.edu/~jabobadi/cap5610/slides8.pdfintroduction to...

Documents