introduction to machine learningusers.cis.fiu.edu/~jabobadi/cap5610/slides8.pdfintroduction to...
TRANSCRIPT
INTRODUCTION TO Machine Learning
2nd Edition
ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some parts fromhttp://www.cs.tau.ac.il/~apartzin/MachineLearning/© The MIT Press, 2010
[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml2e
Lecture Slides for
Outline
This class: Ch 5: Multivariate Methods● Multivariate Data● Parameter Estimation● Estimation of Missing Values● Multivariate Classification
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
CHAPTER 5:
Multivariate Methods
Multivariate Distribution
● Assume all members of class came from join distribution
● Can learn distributions from data P(x|C) ● Assign new instance for most probable class
P(C|x) using Bayes rule ● An instance described by a vector of
correlated parameters ● Realm of multivariate distributions● Multivariate normal
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
4
Multivariate Data
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
5
=
Nd
NN
d
d
XXX
XXX
XXX
21
222
21
112
11
X
● Multiple measurements (sensors)● d inputs/features/attributes: d-variate ● N instances/observations/examples
Multivariate Parameters
( ) ( )( )[ ]
=−−=≡Σ
221
22221
11221
Cov
ddd
d
d
TE
σσσ
σσσσσσ
μμ XXX
[ ] [ ]( )
( )ji
ijijji
jiij
Td
X,X
X,X
,...,E
σσσ
=ρ≡
≡σµµ==
Corr :nCorrelatio
Cov:Covariance
:Mean 1μx
6
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Parameter Estimation
( )( )
ji
ijij
jtj
N
t iti
ij
N
t
ti
i
ss
sr:
N
mxmxs:
d,...,i,N
xm:
=
−−=
==
∑
∑
=
=
R
S
m
matrix nCorrelatio
matrix Covariance
1 mean Sample
1
1
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)7
Estimation of Missing Values
Based on for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
8
● What to do if certain instances have missing attributes?
● Ignore those instances: not a good idea if the sample is small
● Use ‘missing’ as an attribute: may give information
● Imputation: Fill in the missing value– Mean imputation: Use the most likely value (e.g.,
mean)– Imputation by regression: Predict based on other
attributes
Multivariate Normal
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
9
● Have d-attributes● Often can assume each one distributed
normally● Attributes might be dependant/correlated● Joint distribution of correlated several
variables – P(X1=x1, X2=x2, … Xd=xd)=?– X1 is normally distributed with mean µi and
variance �i
Multivariate Normal
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
10
● Mahalanobis distance: (x – μ)T ∑–1 (x – μ) ● 2 variables are correlated● Divided by inverse of covariance (large)● Contribute less to Mahalanobis distance● Contribute more to the probability
( )
( )( )
( ) ( )
−−−
π= − μxμxx
μx
1212 2
1exp
2
1Σ
Σ
Σ
T
//d
d
p
~ ,N
Bivariate Normal
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
11
Multivariate Normal Distribution
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
12
● Mahalanobis distance: (x – μ)T ∑–1 (x – μ)
measures the distance from x to μ in terms of ∑ (normalizes for difference in variances and correlations)
● Bivariate: d = 2
σσρσ
σρσσ=
2221
2121Σ
( ) ( ) ( )( ) iiii /xz
zzzzx,xp
σµ−=
+ρ−
ρ−−
ρ−σπσ= 2
2212122
21
21 212
1exp
12
1
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
13
Bivariate Normal
Bivariate Normal
14Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Independent Inputs: Naive Bayes
Based on Introduction to Machine Learning © The MIT Press (V1.1)
15
p ( x )=∏i=1
d
p i (x i )=1
(2π )d /2∏i=1
d
σ i
exp [−12∑i=1
d
(x i−μi
σ i)2
]
● If xi are independent, offdiagonals of ∑ are 0, Mahalanobis distance reduces to weighted (by 1/σi) Euclidean distance:
● If variances are also equal, reduces to Euclidean distance
Projection Distribution
● Example: vector of 3 features● Multivariate normal distribution● Projection to 2 dimensional space (e.g. XY
plane) Vectors of 2 features● Projection are also multivariate normal
distribution● Projection of d-dimensional normal to
k-dimensional space is k-dimensional normal
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
16
1D projection
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
17
Multivariate Classification
● Assume members of class from a single multivariate distribution
● Multivariate normal is a good choice– Easy to analyze– Model many natural phenomena– Model a class as having single prototype source
(mean) slightly randomly changed
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
18
Example
● Matching cars to customers● Each cat defines a class of matching
customers● Customers described by (age, income)● There is a correlation between age and
income ● Assume each class is multivariate normal● Need to learn P(x|C) from data● Use Bayes to compute P(C|x)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
19
Parametric Classification
● If p (x | Ci ) ~ N ( μi , ∑i )
● Discriminant functions are
● Need to know Covariance Matrix and mean to compute discriminant functions.
● Can ignore P(x) as the same for all classes
( )( )
( ) ( )
−−−
π= −
iiT
i/
i/diCp μxμxx 1
212 21
exp2
1| Σ
Σ
( ) ( ) ( ) ( ) ( ) ( )
( ) ( ) ( )1
|log | =log =log | log log ( )
( )
1 1log2 log Σ μ Σ μ log ( )
2 2 2
i ii i i i
T
i i i i i
P C P Cg P C x p C P C P x
P x
dP C LogP xπ −
= + −
= − − − − − + −
xx x
x x
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)20
Estimation of Parameters
( )
( )( )∑
∑∑
∑
∑
−−=
=
=
t
ti
T
it
t itt
ii
t
ti
t
tti
i
t
ti
i
r
r
r
rN
rCP̂
mxmx
xm
S
( ) ( ) ( ) ( )iiiT
iii CP̂g log21
log21 1 +−−−−= − mxmxx SS
21
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Covariance Matrix per Class
● Quadratic discriminant
● Requires estimation of K*d*(d+1)/2 parameters for covariance matrix
( ) ( ) ( )
( )iiiiT
ii
iii
ii
iT
iiT
iiiT
iiiT
iT
ii
CP̂w
w
CP̂g
log log21
21
21
where
log221
log21
10
1
1
0
111
+−−=
=
−=
++=
++−−−=
−
−
−
−−−
SS
S
SW
W
SSSS
mm
mw
xwxx
mmmxxxx
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)22
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)23
likelihoods
posterior for C1
discriminant: P (C1|x ) = 0.5
Common Covariance Matrix S
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
24
( ) ii
iCP̂ SS ∑=
● If not enough data can assume all classes have same common sample covariance matrix S
Discriminant reduces to a linear discriminant (xTS-1x is common to all discriminant and can be removed)
( ) ( ) ( ) ( )iiT
ii CP̂g log21 1 +−−−= − mxmxx S
( )
( )0
1 10
1 ˆwhere log 2
Ti i i
Ti i i i i i
g w
w P C− −
= +
= = − +
x w x
w m m mS S
Common Covariance Matrix S
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
25
Diagonal S
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
26
( ) ( )i
d
j j
ijtj
i CP̂s
mxg log
21
1
2
+
−−= ∑
=
x
● When xj j = 1,..d, are independent, ∑ is diagonal
p (x|Ci) = ∏j p (xj |Ci) (Naive Bayes’ assumption)
Classify based on weighted Euclidean distance (in sj units) to the nearest mean
Diagonal S
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
27
variances may bedifferent
Diagonal S, equal variances
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
28
g i ( x )=−∥x−mi∥
2
2s2+ log P̂ (C i )
¿=−12s2 ∑
j=1
d
(x jt −mij)
2+ log P̂ (C i )
● Nearest mean classifier: Classify based on Euclidean distance to the nearest mean
● Each mean can be considered a prototype or template and this is template matching
Diagonal S, equal variances
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
29
*?
Model SelectionAssumption Covariance matrix No of parameters
Shared, Hyperspheric Si=S=s^2I 1
Shared, Axis-aligned Si=S, with sij=0 d
Shared, Hyperellipsoidal Si=S d(d+1)/2
Different, Hyperellipsoidal
Si K d(d+1)/2
30
As we increase complexity (less restricted S), bias decreases and variance increases
Assume simple models (allow some bias) to control variance (regularization)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Model Selection
● Different covariance matrix for each class● Have to estimate many parameters● Small bias , large variance● Common covariance matrices, diagonal
covariance etc. reduce number of parameters● Increase bias but control variance● In-between states?
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
31
Regularized Discriminant Analysis(RDA)
● a=b=0: Quadratic classifier● a=0, b=1:Shared Covariance, linear classifier● a=1,b=0: Diagonal Covariance● Choose best a,b by cross validation
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
32
Model Selection: Example
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
33
Model Selection
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
34
Discrete Features
( ) ( ) ( )( ) ( )[ ] ( )i
jijjijj
iii
CPpxpx
CPCpg
log log log
log| log
+−−+=
+=
∑ 11
xx
35
Binary features:if xj are independent (Naive Bayes’)
the discriminant is linear
( )ijij Cxpp |1=≡
( ) ( ) ( )∏=
−−=d
j
xij
xiji
jj ppCxp1
11|
∑∑=
t
ti
t
ti
tj
ij r
rxp̂Estimated parameters
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Multivariate Regression( ) ε+= d
tt wwwxgr ,...,,| 10
36
Multivariate linear model
( ) [ ]2
11010
22110
2
1 ∑ −−−−=
++++
t
tdd
ttd
tdd
tt
xwxwwrwwwE
xwxwxww
X|,...,,
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Multivariate Regression
37
l
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
( ) [ ]2
11010
22110
2
1 ∑ −−−−=
++++
t
tdd
ttd
tdd
tt
xwxwwrwwwE
xwxwxww
X|,...,,
CHAPTER 6:
Dimensionality Reduction
Dimensionality of input
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
39
● Number of Observables (e.g. age and income) ● If number of observables is increased
– More time to compute – More memory to store inputs and intermediate
results– More complicated explanations (knowledge from
learning) ● Regression from 100 vs. 2 parameters
– No simple visualization● 2D vs. 10D graph
– Need much more data (curse of dimensionality)● 1M of 1-d inputs is not equal to 1 input of dimension 1M
Dimensionality reduction
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
40
● Some features (dimensions) bear little or nor useful information (e.g. color of hair for a car selection)– Can drop some features– Have to estimate which features can be dropped
from data
● Several features can be combined together without loss or even with gain of information (e.g. income of all family members for loan application)– Some features can be combined together– Have to estimate which features to combine from
data
Feature Selection vs Extraction
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
41
● Feature selection: Choosing k<d important features, ignoring the remaining d – k– Subset selection algorithms
● Feature extraction: Project the original xi , i =1,...,d dimensions to new k<d dimensions, zj , j =1,...,k– Principal Components Analysis (PCA)– Linear Discriminant Analysis (LDA)– Factor Analysis (FA)
Usage
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
42
● Have data of dimension d● Reduce dimensionality to k<d
– Discard unimportant features– Combine several features in one
● Use resulting k-dimensional data set for– Learning for classification problem (e.g.
parameters of probabilities P(x|C)– Learning for regression problem (e.g.
parameters for model y=g(x|Thetha)
Subset selection
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
43
● Have initial set of features of size d● There are 2^d possible subsets● Need a criteria to decide which subset is the
best● A way to search over the possible subsets● Can’t go over all 2^d possibilities ● Need some heuristics
“Goodness” of feature set
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
44
● Supervised– Train using selected subset– Estimate error on validation data set
● Unsupervised– Look at input only(e.g. age, income and
savings) – Select subset of 2 that bear most of the
information about the person
Mutual Information
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
45
● Have a 3 random variables(features) X,Y,Z and have to select 2 which gives most information
● If X and Y are “correlated” then much of the information about of Y is already in X
● Make sense to select features which are “uncorrelated”
● Mutual Information (Kullback–Leibler Divergence ) is more general measure of “mutual information”
● Can be extended to n variables (information variables x1,.. xn have about variable xn+1)
Subset-selection
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
46
● Forward search– Start from empty set of features– Try each of remaining features– Estimate classification/regression error for adding
specific feature– Select feature that gives maximum improvement in
validation error– Stop when no significant improvement
● Backward search– Start with original set of size d– Drop features with smallest impact on error
Subset SelectionThere are 2^d subsets of d featuresForward search: Add the best feature at each step
– Set of features F initially Ø.– At each iteration, find the best new feature
j = argmini E ( F ∪ xi ) – Add xj to F if E ( F ∪ xj ) < E ( F )
Hill-climbing O(d^2) algorithmBackward search: Start with all features and
remove one at a time, if possible.Floating search (Add k, remove l)
47Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Floating Search
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
48
● Forward and backward search are “greedy” algorithms– Select best options at single step– Do not always achieve optimum value
● Floating search– Two types of steps: Add k, remove l– More computations
Feature Extraction
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
49
● Face recognition problem– Training data input: pairs of Image +
Label(name) – Classifier input: Image– Classifier output: Label(Name)
● Image: Matrix of 256X256=65536 values in range 0..256
● Each pixels bear little information so can’t select 100 best ones
● Average of pixels around specific positions may give an indication about an eye color.
Projection
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
50
● Find a projection matrix w from d-dimensional to k-dimensional vectors that keeps error low
PCA: Motivation
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
51
● Assume that d observables are linear combination of k<d vectors
● zi=wi1xi1+…+wikxid
● We would like to work with basis as it has lesser dimension and have all(almost) required information
● What we expect from such basis– Uncorrelated or otherwise can be reduced
further– Have large variance (e.g. wi1 have large
variation) or otherwise bear no information
PCA: Motivation
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
52
PCA: Motivation
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
53
● Choose directions such that a total variance of data will be maximum– Maximize Total Variance
● Choose directions that are orthogonal – Minimize correlation
● Choose k<d orthogonal directions which maximize total variance
PCA
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
54
● Choosing only directions:● ● Maximize variance subject to a constrain using
Lagrange Multipliers
● Taking Derivatives
● Eigenvector. Since want to maximize
we should choose an eigenvector with largest eigenvalue
PCA
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
55
● d-dimensional feature space● d by d symmetric covariance matrix estimated
from samples ● Select k largest eigenvalue of the covariance
matrix and associated k eigenvectors● The first eigenvector will be a direction with
largest variance
What PCA does
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
56
z = WT(x – m)
where the columns of W are the eigenvectors of ∑, and m is sample mean
Centers the data at the origin and rotates the axes
How to choose k ?
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
57
dk
k
λ++λ++λ+λλ++λ+λ
21
21
● Proportion of Variance (PoV) explained
when λi are sorted in descending order ● Typically, stop at PoV>0.9● Scree graph plots of PoV vs k, stop at
“elbow”
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
58
PCA
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
59
● PCA is unsupervised (does not take into account class information)
● Can take into account classes : Karhuned-Loeve Expansion– Estimate Covariance Per Class– Take average weighted by prior
● Common Principle Components– Assume all classes have same eigenvectors
(directions) but different variances
PCA
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
60
● Does not try to explain noise– Large noise can become new dimension/largest
PC
● Interested in resulting uncorrelated variables which explain large portion of total sample variance
● Sometimes interested in explained shared variance (common factors) that affect data