introduction to machine learningusers.cis.fiu.edu/~jabobadi/cap5610/slides8.pdfintroduction to...

60
INTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some parts from http://www.cs.tau.ac.il/~apartzin/MachineLearning/ © The MIT Press, 2010 [email protected] http://www.cmpe.boun.edu.tr/~ethem/i2ml2e Lecture Slides for

Upload: others

Post on 27-Jun-2020

20 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

INTRODUCTION TO Machine Learning

2nd Edition

ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some parts fromhttp://www.cs.tau.ac.il/~apartzin/MachineLearning/© The MIT Press, 2010

[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml2e

Lecture Slides for

Page 2: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Outline

This class: Ch 5: Multivariate Methods● Multivariate Data● Parameter Estimation● Estimation of Missing Values● Multivariate Classification

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 3: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

CHAPTER 5:

Multivariate Methods

Page 4: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Multivariate Distribution

● Assume all members of class came from join distribution

● Can learn distributions from data P(x|C) ● Assign new instance for most probable class

P(C|x) using Bayes rule ● An instance described by a vector of

correlated parameters ● Realm of multivariate distributions● Multivariate normal

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

4

Page 5: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Multivariate Data

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

5

=

Nd

NN

d

d

XXX

XXX

XXX

21

222

21

112

11

X

● Multiple measurements (sensors)● d inputs/features/attributes: d-variate ● N instances/observations/examples

Page 6: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Multivariate Parameters

( ) ( )( )[ ]

=−−=≡Σ

221

22221

11221

Cov

ddd

d

d

TE

σσσ

σσσσσσ

μμ XXX

[ ] [ ]( )

( )ji

ijijji

jiij

Td

X,X

X,X

,...,E

σσσ

=ρ≡

≡σµµ==

Corr :nCorrelatio

Cov:Covariance

:Mean 1μx

6

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Page 7: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Parameter Estimation

( )( )

ji

ijij

jtj

N

t iti

ij

N

t

ti

i

ss

sr:

N

mxmxs:

d,...,i,N

xm:

=

−−=

==

=

=

R

S

m

matrix nCorrelatio

matrix Covariance

1 mean Sample

1

1

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)7

Page 8: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Estimation of Missing Values

Based on for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

8

● What to do if certain instances have missing attributes?

● Ignore those instances: not a good idea if the sample is small

● Use ‘missing’ as an attribute: may give information

● Imputation: Fill in the missing value– Mean imputation: Use the most likely value (e.g.,

mean)– Imputation by regression: Predict based on other

attributes

Page 9: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Multivariate Normal

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

9

● Have d-attributes● Often can assume each one distributed

normally● Attributes might be dependant/correlated● Joint distribution of correlated several

variables – P(X1=x1, X2=x2, … Xd=xd)=?– X1 is normally distributed with mean µi and

variance �i

Page 10: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Multivariate Normal

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

10

● Mahalanobis distance: (x – μ)T ∑–1 (x – μ) ● 2 variables are correlated● Divided by inverse of covariance (large)● Contribute less to Mahalanobis distance● Contribute more to the probability

( )

( )( )

( ) ( )

−−−

π= − μxμxx

μx

1212 2

1exp

2

Σ

Σ

T

//d

d

p

~ ,N

Page 11: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Bivariate Normal

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

11

Page 12: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Multivariate Normal Distribution

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

12

● Mahalanobis distance: (x – μ)T ∑–1 (x – μ)

measures the distance from x to μ in terms of ∑ (normalizes for difference in variances and correlations)

● Bivariate: d = 2

σσρσ

σρσσ=

2221

2121Σ

( ) ( ) ( )( ) iiii /xz

zzzzx,xp

σµ−=

+ρ−

ρ−−

ρ−σπσ= 2

2212122

21

21 212

1exp

12

1

Page 13: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

13

Bivariate Normal

Page 14: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Bivariate Normal

14Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 15: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Independent Inputs: Naive Bayes

Based on Introduction to Machine Learning © The MIT Press (V1.1)

15

p ( x )=∏i=1

d

p i (x i )=1

(2π )d /2∏i=1

d

σ i

exp [−12∑i=1

d

(x i−μi

σ i)2

]

● If xi are independent, offdiagonals of ∑ are 0, Mahalanobis distance reduces to weighted (by 1/σi) Euclidean distance:

● If variances are also equal, reduces to Euclidean distance

Page 16: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Projection Distribution

● Example: vector of 3 features● Multivariate normal distribution● Projection to 2 dimensional space (e.g. XY

plane) Vectors of 2 features● Projection are also multivariate normal

distribution● Projection of d-dimensional normal to

k-dimensional space is k-dimensional normal

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

16

Page 17: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

1D projection

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

17

Page 18: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Multivariate Classification

● Assume members of class from a single multivariate distribution

● Multivariate normal is a good choice– Easy to analyze– Model many natural phenomena– Model a class as having single prototype source

(mean) slightly randomly changed

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

18

Page 19: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Example

● Matching cars to customers● Each cat defines a class of matching

customers● Customers described by (age, income)● There is a correlation between age and

income ● Assume each class is multivariate normal● Need to learn P(x|C) from data● Use Bayes to compute P(C|x)

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

19

Page 20: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Parametric Classification

● If p (x | Ci ) ~ N ( μi , ∑i )

● Discriminant functions are

● Need to know Covariance Matrix and mean to compute discriminant functions.

● Can ignore P(x) as the same for all classes

( )( )

( ) ( )

−−−

π= −

iiT

i/

i/diCp μxμxx 1

212 21

exp2

1| Σ

Σ

( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( )1

|log | =log =log | log log ( )

( )

1 1log2 log Σ μ Σ μ log ( )

2 2 2

i ii i i i

T

i i i i i

P C P Cg P C x p C P C P x

P x

dP C LogP xπ −

= + −

= − − − − − + −

xx x

x x

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)20

Page 21: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Estimation of Parameters

( )

( )( )∑

∑∑

−−=

=

=

t

ti

T

it

t itt

ii

t

ti

t

tti

i

t

ti

i

r

r

r

rN

rCP̂

mxmx

xm

S

( ) ( ) ( ) ( )iiiT

iii CP̂g log21

log21 1 +−−−−= − mxmxx SS

21

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Page 22: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Covariance Matrix per Class

● Quadratic discriminant

● Requires estimation of K*d*(d+1)/2 parameters for covariance matrix

( ) ( ) ( )

( )iiiiT

ii

iii

ii

iT

iiT

iiiT

iiiT

iT

ii

CP̂w

w

CP̂g

log log21

21

21

where

log221

log21

10

1

1

0

111

+−−=

=

−=

++=

++−−−=

−−−

SS

S

SW

W

SSSS

mm

mw

xwxx

mmmxxxx

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)22

Page 23: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)23

likelihoods

posterior for C1

discriminant: P (C1|x ) = 0.5

Page 24: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Common Covariance Matrix S

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

24

( ) ii

iCP̂ SS ∑=

● If not enough data can assume all classes have same common sample covariance matrix S

Discriminant reduces to a linear discriminant (xTS-1x is common to all discriminant and can be removed)

( ) ( ) ( ) ( )iiT

ii CP̂g log21 1 +−−−= − mxmxx S

( )

( )0

1 10

1 ˆwhere log 2

Ti i i

Ti i i i i i

g w

w P C− −

= +

= = − +

x w x

w m m mS S

Page 25: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Common Covariance Matrix S

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

25

Page 26: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Diagonal S

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

26

( ) ( )i

d

j j

ijtj

i CP̂s

mxg log

21

1

2

+

−−= ∑

=

x

● When xj j = 1,..d, are independent, ∑ is diagonal

p (x|Ci) = ∏j p (xj |Ci) (Naive Bayes’ assumption)

Classify based on weighted Euclidean distance (in sj units) to the nearest mean

Page 27: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Diagonal S

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

27

variances may bedifferent

Page 28: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Diagonal S, equal variances

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

28

g i ( x )=−∥x−mi∥

2

2s2+ log P̂ (C i )

¿=−12s2 ∑

j=1

d

(x jt −mij)

2+ log P̂ (C i )

● Nearest mean classifier: Classify based on Euclidean distance to the nearest mean

● Each mean can be considered a prototype or template and this is template matching

Page 29: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Diagonal S, equal variances

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

29

*?

Page 30: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Model SelectionAssumption Covariance matrix No of parameters

Shared, Hyperspheric Si=S=s^2I 1

Shared, Axis-aligned Si=S, with sij=0 d

Shared, Hyperellipsoidal Si=S d(d+1)/2

Different, Hyperellipsoidal

Si K d(d+1)/2

30

As we increase complexity (less restricted S), bias decreases and variance increases

Assume simple models (allow some bias) to control variance (regularization)

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 31: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Model Selection

● Different covariance matrix for each class● Have to estimate many parameters● Small bias , large variance● Common covariance matrices, diagonal

covariance etc. reduce number of parameters● Increase bias but control variance● In-between states?

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

31

Page 32: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Regularized Discriminant Analysis(RDA)

● a=b=0: Quadratic classifier● a=0, b=1:Shared Covariance, linear classifier● a=1,b=0: Diagonal Covariance● Choose best a,b by cross validation

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

32

Page 33: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Model Selection: Example

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

33

Page 34: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Model Selection

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

34

Page 35: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Discrete Features

( ) ( ) ( )( ) ( )[ ] ( )i

jijjijj

iii

CPpxpx

CPCpg

log log log

log| log

+−−+=

+=

∑ 11

xx

35

Binary features:if xj are independent (Naive Bayes’)

the discriminant is linear

( )ijij Cxpp |1=≡

( ) ( ) ( )∏=

−−=d

j

xij

xiji

jj ppCxp1

11|

∑∑=

t

ti

t

ti

tj

ij r

rxp̂Estimated parameters

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 36: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Multivariate Regression( ) ε+= d

tt wwwxgr ,...,,| 10

36

Multivariate linear model

( ) [ ]2

11010

22110

2

1 ∑ −−−−=

++++

t

tdd

ttd

tdd

tt

xwxwwrwwwE

xwxwxww

X|,...,,

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 37: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Multivariate Regression

37

l

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

( ) [ ]2

11010

22110

2

1 ∑ −−−−=

++++

t

tdd

ttd

tdd

tt

xwxwwrwwwE

xwxwxww

X|,...,,

Page 38: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

CHAPTER 6:

Dimensionality Reduction

Page 39: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Dimensionality of input

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

39

● Number of Observables (e.g. age and income) ● If number of observables is increased

– More time to compute – More memory to store inputs and intermediate

results– More complicated explanations (knowledge from

learning) ● Regression from 100 vs. 2 parameters

– No simple visualization● 2D vs. 10D graph

– Need much more data (curse of dimensionality)● 1M of 1-d inputs is not equal to 1 input of dimension 1M

Page 40: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Dimensionality reduction

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

40

● Some features (dimensions) bear little or nor useful information (e.g. color of hair for a car selection)– Can drop some features– Have to estimate which features can be dropped

from data

● Several features can be combined together without loss or even with gain of information (e.g. income of all family members for loan application)– Some features can be combined together– Have to estimate which features to combine from

data

Page 41: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Feature Selection vs Extraction

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

41

● Feature selection: Choosing k<d important features, ignoring the remaining d – k– Subset selection algorithms

● Feature extraction: Project the original xi , i =1,...,d dimensions to new k<d dimensions, zj , j =1,...,k– Principal Components Analysis (PCA)– Linear Discriminant Analysis (LDA)– Factor Analysis (FA)

Page 42: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Usage

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

42

● Have data of dimension d● Reduce dimensionality to k<d

– Discard unimportant features– Combine several features in one

● Use resulting k-dimensional data set for– Learning for classification problem (e.g.

parameters of probabilities P(x|C)– Learning for regression problem (e.g.

parameters for model y=g(x|Thetha)

Page 43: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Subset selection

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

43

● Have initial set of features of size d● There are 2^d possible subsets● Need a criteria to decide which subset is the

best● A way to search over the possible subsets● Can’t go over all 2^d possibilities ● Need some heuristics

Page 44: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

“Goodness” of feature set

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

44

● Supervised– Train using selected subset– Estimate error on validation data set

● Unsupervised– Look at input only(e.g. age, income and

savings) – Select subset of 2 that bear most of the

information about the person

Page 45: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Mutual Information

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

45

● Have a 3 random variables(features) X,Y,Z and have to select 2 which gives most information

● If X and Y are “correlated” then much of the information about of Y is already in X

● Make sense to select features which are “uncorrelated”

● Mutual Information (Kullback–Leibler Divergence ) is more general measure of “mutual information”

● Can be extended to n variables (information variables x1,.. xn have about variable xn+1)

Page 46: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Subset-selection

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

46

● Forward search– Start from empty set of features– Try each of remaining features– Estimate classification/regression error for adding

specific feature– Select feature that gives maximum improvement in

validation error– Stop when no significant improvement

● Backward search– Start with original set of size d– Drop features with smallest impact on error

Page 47: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Subset SelectionThere are 2^d subsets of d featuresForward search: Add the best feature at each step

– Set of features F initially Ø.– At each iteration, find the best new feature

j = argmini E ( F ∪ xi ) – Add xj to F if E ( F ∪ xj ) < E ( F )

Hill-climbing O(d^2) algorithmBackward search: Start with all features and

remove one at a time, if possible.Floating search (Add k, remove l)

47Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 48: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Floating Search

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

48

● Forward and backward search are “greedy” algorithms– Select best options at single step– Do not always achieve optimum value

● Floating search– Two types of steps: Add k, remove l– More computations

Page 49: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Feature Extraction

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

49

● Face recognition problem– Training data input: pairs of Image +

Label(name) – Classifier input: Image– Classifier output: Label(Name)

● Image: Matrix of 256X256=65536 values in range 0..256

● Each pixels bear little information so can’t select 100 best ones

● Average of pixels around specific positions may give an indication about an eye color.

Page 50: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Projection

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

50

● Find a projection matrix w from d-dimensional to k-dimensional vectors that keeps error low

Page 51: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

PCA: Motivation

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

51

● Assume that d observables are linear combination of k<d vectors

● zi=wi1xi1+…+wikxid

● We would like to work with basis as it has lesser dimension and have all(almost) required information

● What we expect from such basis– Uncorrelated or otherwise can be reduced

further– Have large variance (e.g. wi1 have large

variation) or otherwise bear no information

Page 52: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

PCA: Motivation

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

52

Page 53: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

PCA: Motivation

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

53

● Choose directions such that a total variance of data will be maximum– Maximize Total Variance

● Choose directions that are orthogonal – Minimize correlation

● Choose k<d orthogonal directions which maximize total variance

Page 54: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

PCA

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

54

● Choosing only directions:● ● Maximize variance subject to a constrain using

Lagrange Multipliers

● Taking Derivatives

● Eigenvector. Since want to maximize

we should choose an eigenvector with largest eigenvalue

Page 55: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

PCA

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

55

● d-dimensional feature space● d by d symmetric covariance matrix estimated

from samples ● Select k largest eigenvalue of the covariance

matrix and associated k eigenvectors● The first eigenvector will be a direction with

largest variance

Page 56: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

What PCA does

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

56

z = WT(x – m)

where the columns of W are the eigenvectors of ∑, and m is sample mean

Centers the data at the origin and rotates the axes

Page 57: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

How to choose k ?

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

57

dk

k

λ++λ++λ+λλ++λ+λ

21

21

● Proportion of Variance (PoV) explained

when λi are sorted in descending order ● Typically, stop at PoV>0.9● Scree graph plots of PoV vs k, stop at

“elbow”

Page 58: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

58

Page 59: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

PCA

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

59

● PCA is unsupervised (does not take into account class information)

● Can take into account classes : Karhuned-Loeve Expansion– Estimate Covariance Per Class– Take average weighted by prior

● Common Principle Components– Assume all classes have same eigenvectors

(directions) but different variances

Page 60: INTRODUCTION TO Machine Learningusers.cis.fiu.edu/~jabobadi/CAP5610/slides8.pdfINTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some

PCA

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

60

● Does not try to explain noise– Large noise can become new dimension/largest

PC

● Interested in resulting uncorrelated variables which explain large portion of total sample variance

● Sometimes interested in explained shared variance (common factors) that affect data