kernel density estimation theory and application in discriminant analysis thomas ledl universität...
Post on 22-Dec-2015
224 Views
Preview:
TRANSCRIPT
Kernel Density EstimationTheory and Application in Discriminant Analysis
Thomas Ledl
Universität Wien
Introduction Theory Aspects of Application Simulation Study Summary
Contents:
Introduction
Introduction
Theory
Application Aspects
Simulation Study
Summary
Introduction
0 1 2 3 4
25 observations:
Which distribution?
0 1 2 3 4
0 1 2 3 4
0.0
0.1
0.2
0.3
0 1 2 3 4
0.0
0.2
0.4
0 1 2 3 4
0.0
0.2
0.4
0.6
0 1 2 3 4
0.0
0.1
0.2
0.3
0.4
0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
? ?
?
? ?
0 1 2 3 4
Kernel density estimator model:
K(.) and h to choose
Theory
Application Aspects
Simulation Study
Summary
Introduction
0 1 2 3 4
0 1 2 3 4
0.0
0.2
0.4
0.6
triangular
gaussian
„small“ h „large“ h
0 1 2 3 4
0.0
0.2
0.4
0.6
0 1 2 3 4
0.0
0.1
0.2
0.3
0.4
0 1 2 3 4
0.0
0.1
0.2
0.3
0.4
kernel/ bandwidth:
Theory
Application Aspects
Simulation Study
Summary
Introduction
Question 1:
Which choice of K(.) and h is the best for a descriptive purpose?
Introduction
Theory
Application Aspects
Simulation Study
Summary
Introduction
-3 -2,2 -1,4 -0,6 0,
2 1
1,8 2,6
-3
-1,3
0,4
2,1
0,000,010,020,03
0,040,05
0,06
0,07
0,08
0,09
-3
-1,5
0
1,5
3
-3 -2,3
-1,6
-0,9
-0,2 0,5 1,2 1,9 2,6
0
0,02
0,04
0,06
0,08
0,1
0,12
0,14Classification:
Introduction
Theory
Application Aspects
Simulation Study
Summary
Introduction
Levelplot – LDA (based on assumption of a multivariate normal distribution):
-5
-1
3
7
0.34
0.48
0.63
0.78
0.93
-1 1 3 5 7V1
V2
1
11
1
1
111 1
1
1
1
1
1
1
11
1
1
1
2
2
22
2
22
2 2
2
22
2
2
2
22
22
2
33
333
33
333
33
3
3
3333
33
4
4
4
4
4
444
44
44
44
4
444 4
4
5
55
5
5 5
5
5
5
5
5555
55
5
5
55
Classification:
Introduction
Theory
Application Aspects
Simulation Study
Summary
Introduction
-5
-1
3
7
0.34
0.48
0.63
0.78
0.93
-1 1 3 5 7V1
V2
1
11
1
1
111 1
1
1
1
1
1
1
11
1
1
1
2
2
22
2
22
2 2
2
22
2
2
2
22
22
2
33
333
33
333
33
3
3
3333
33
4
4
4
4
4
444
44
44
44
4
444 4
4
5
55
5
5 5
5
5
5
5
5555
55
5
5
55
Classification:
Introduction
Theory
Application Aspects
Simulation Study
Summary
Introduction
Levelplot – KDE classificator:
-5
-1
3
7
0.34
0.48
0.63
0.78
0.93
-1 1 3 5 7V1
V2
1
11
1
1
111 1
1
1
1
1
1
1
11
1
1
1
2
2
22
2
22
2 2
2
22
2
2
2
22
22
2
33
333
33
333
33
3
3
3333
33
4
4
4
4
4
444
44
44
44
4
444 4
4
5
55
5
5 5
5
5
5
5
5555
55
5
5
55
-1 1 3 5 7
-5
-1
3
7
0.34
0.48
0.63
0.78
0.93
V1
V2
1
11
1
1
111 1
1
1
1
1
1
1
11
1
1
1
2
2
22
2
22
2 2
2
22
2
2
2
22
22
2
33
333
33
333
33
3
3
3333
33
4
4
4
4
4
444
44
44
44
4
444 4
4
5
55
5
5 5
5
5
5
5
5555
55
5
5
55
Classification:
Introduction
Theory
Application Aspects
Simulation Study
Summary
Introduction
Question 2:
Performance of classification based on KDE in more than 2 dimensions?
Theory
Essential issues
Optimization criteria Improvements of the standard
model Resulting optimal choices of the
model parameters K(.) and h
Theory
Application Aspects
Simulation Study
Summary
Introduction
Essential issues
Optimization criteria Improvements of the standard
model Resulting optimal choices of the
model parameters K(.) and h
Theory
Application Aspects
Simulation Study
Summary
Introduction
Theory
Application Aspects
Simulation Study
Summary
Introduction
Lp-distances:
Optimization criteria
Theory
Application Aspects
Simulation Study
Summary
Introduction
f(.)
g(.)
-2 -1 0 1 2 3 4
0.0
0.2
0.4
Theory
Application Aspects
Simulation Study
Summary
Introduction
-2 -1 0 1 2 3 4
0.0
0.0
50
.10
0.1
5
-2 -1 0 1 2 3 4
0.0
0.2
0.4
0.0
0.2
0.4
Theory
Application Aspects
Simulation Study
Summary
Introduction
-2 -1 0 1 2 3 4
0.0
0.0
50
.10
0.1
5
„Integrated absolute error“
=IAE
=ISE
„Integrated squared error“
Theory
Application Aspects
Simulation Study
Summary
Introduction
-2 -1 0 1 2 3 4
0.0
0.0
50
.10
0.1
5
=IAE
„Integrated absolute error“
=ISE
„Integrated squared error“
Theory
Application Aspects
Simulation Study
Summary
Introduction Consideration of horizontal distances for
a more intuitive fit (Marron and Tsybakov, 1995)
Compare the number and position of modes
Minimization of the maximum vertical distance
Other ideas:
Overview about some minimization criteria
L1-distance=IAE
L-distance=Maximum difference
„Modern“ criteria, which include a kind of measure of the horizontal distances
L2-distance=ISE, MISE,AMISE,...
Difficult mathematical tractability
Does not consider overall fit
Difficult mathematical tractability
Theory
Application Aspects
Simulation Study
Summary
Introduction
Most commonly used
ISE, MISE, AMISE,...
Theory
Application Aspects
Simulation Study
Summary
Introduction
log10(h)
-1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2
0.0
0.0
10
.02
0.0
30
.04
0.0
5
MISE,IV,ISBAMISE,AIV,AISB
x
De
nsi
ty-3 -1 1 2 3
0.0
0.2
0.4
MISE=E(ISE), the expectation of ISEAMISE=Taylor approximation of MISE, easier to calculate
ISE is a random variable
Essential issues
Optimization criteria Improvements of the standard
model Resulting optimal choices of the
model parameters K(.) and h
Theory
Application Aspects
Simulation Study
Summary
Introduction
The AMISE-optimal bandwidth
Theory
Application Aspects
Simulation Study
Summary
Introduction
The AMISE-optimal bandwidth
Theory
Application Aspects
Simulation Study
Summary
Introductionminimized by
-1.0 -0.5 0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.0
0.2
0.4
0.6
„Epanechnikov kernel“
dependent on the kernel function K(.)
The AMISE-optimal bandwidth
Theory
Application Aspects
Simulation Study
Summary
Introduction
dependent on the unknown density f(.)
How to proceed?
Data-driven bandwidth selection methods
Theory
Application Aspects
Simulation Study
Summary
Introduction
Maximum Likelihood Cross-Validation
Least-squares cross-validation (Bowman, 1984)
Leave-one-out selectors
Criteria based on substituting R(f“) in the AMISE-formula
„Normal rule“ („Rule of thumb“; Silverman, 1986)
Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990)
Smoothed bootstrap
Data-driven bandwidth selection methods
Theory
Application Aspects
Simulation Study
Summary
Introduction
Leave-one-out selectors
Criteria based on substituting R(f“) in the AMISE-formula
„Normal rule“ („Rule of thumb“; Silverman, 1986)
Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990)
Smoothed bootstrap
Maximum Likelihood Cross-Validation
Least-squares cross-validation (Bowman, 1984)
Least squares cross-validation (LSCV)
Undisputed selector in the 1980s Gives an unbiased estimator for the ISE Suffers from more than one local
minimizer – no agreement about which one to use
Bad convergence rate for the resulting bandwidth hopt
Theory
Application Aspects
Simulation Study
Summary
Introduction
Data-driven bandwidth selection methods
Theory
Application Aspects
Simulation Study
Summary
Introduction
Maximum Likelihood Cross-Validation
Least-squares cross-validation (Bowman, 1984)
Leave-one-out selectors
Criteria based on substituting R(f“) in the AMISE-formula
„Normal rule“ („Rule of thumb“; Silverman, 1986)
Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990)
Smoothed bootstrap
Normal rule („Rule of thumb“)
Assumes f(x) to be N(,2) Easiest selector Often oversmooths the function
Theory
Application Aspects
Simulation Study
Summary
Introduction The resulting bandwidth is given by:
Data-driven bandwidth selection methods
Theory
Application Aspects
Simulation Study
Summary
Introduction
Maximum Likelihood Cross-Validation
Least-squares cross-validation (Bowman, 1984)
Leave-one-out selectors
Criteria based on substituting R(f“) in the AMISE-formula
„Normal rule“ („Rule of thumb“; Silverman, 1986)
Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990)
Smoothed bootstrap
Plug in-methods (Sheather and Jones, 1991; Park and Marron,1990)
Does not substitute R(f“) in the AMISE-formula, but estimates it via R(f(IV)) and R(f(IV)) via R(f(VI)),etc.
Another parameter i to chose (the number of stages to go back) – one stage is mostly sufficient
Better rates of convergence Does not finally circumvent the problem
of the unknown density, eitherTheory
Application Aspects
Simulation Study
Summary
Introduction
The multivariate case
Theory
Application Aspects
Simulation Study
Summary
Introduction
h H...the bandwidth matrix
Issues of generalization in d dimensions
Theory
Application Aspects
Simulation Study
Summary
Introduction
d2 instead of one bandwidth parameter Unstable estimates Bandwidth selectors are essentially
straightforward to generalize For Plug-in methods it is „too difficult“ to
give succint expressions for d>2 dimensions
Aspects of Application
Application Aspects
Theory
Simulation Study
Summary
Introduction
Essential issues
Curse of dimensionality Connection between goodness-of-fit and
optimal classification Two methods for discrimatory purposes
Application Aspects
Theory
Simulation Study
Summary
Introduction
Essential issues
Curse of dimensionality Connection between goodness-of-fit and
optimal classification Two methods for discrimatory purposes
Application Aspects
Theory
Simulation Study
Summary
Introduction
The „curse of dimensionality“
The data „disappears“ into the distribution tails in high dimensions
: a good fit in the tails is desired!d
Probability mass NOT in the "Tail" of a Multivariate Normal Density
0%
20%
40%
60%
80%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# of dimensions
Application Aspects
Theory
Simulation Study
Summary
Introduction
The „curse of dimensionality“
Much data is necessary to obey a constant estimation error in high dimensions
Dimensionality Required sample size1 42 193 674 2235 7686 27907 107008 437009 18700010 842000
Application Aspects
Theory
Simulation Study
Summary
Introduction
Essential issues
Curse of dimensionality Connection between goodness-of-fit and
optimal classification Two methods for discrimatory purposes
Essential issues
Estimation of tails important
Worse fit in the tails
Calculation intensive for large n
Many observations required for a reasonable fit
L2-optimal L1-optimal (Misclassification rate)
AMISE-optimal parameter choice
Optimal classification (in high dimensions)
Application Aspects
Theory
Simulation Study
Summary
Introduction
Essential issues
Curse of dimensionality Connection between goodness-of-fit and
optimal classification Two methods for discrimatory purposes
Application Aspects
Theory
Simulation Study
Summary
Introduction
Method 1:
Reduction of the data onto a subspace which allows a somewhat accurate estimation, however does not destoy too much information „trade-off“
Use the multivariate kernel density concept to estimate the class densities
Application Aspects
Theory
Simulation Study
Summary
Introduction
Method 2:
Use the univariate concept to „normalize“ the data nonparametrically
Use the classical methods like LDA and QDA for classification
Drawback: calculation intensive
Application Aspects
Theory
Simulation Study
Summary
Introduction
Method 2:
x
f(x)
0 2 4 6 80.
050.
100.
150.
20
x
0 2 4 6 8
0.0
0.2
0.4
0.6
0.8
1.0
F(x)G(x)
t(x) t(x+) x x+
a)
b)
Simulation Study
Simulation Study
Theory
Application Aspects
Summary
Introduction
Criticism on former simulation studies
Carried out 20-30 years ago Out-dated parameter selectors Restriction to uncorrelated normals Fruitless estimation because of
high dimensions No dimension reduction
Simulation Study
Theory
Application Aspects
Summary
Introduction
21 datasets 14 estimators 2 error criteria 21x14x2=588 classification
scores Many results
The present simulation study
Simulation Study
Theory
Application Aspects
Summary
Introduction
The present simulation study
21 datasets 14 estimators 2 error criteria 21x14x2=588 classification
scores Many results
Simulation Study
Theory
Application Aspects
Summary
Introduction
Each dataset has...
...2 classes for distinction ...600 observations/class ...200 test observations, 100
produced by each class ... therfore dimension 1400x10
Normal
-4 -2 0 2 4
0.0
0.2
0.4
Normal-noise small
-4 -2 0 2 4
0.0
0.2
0.4
Normal-noise medium
-4 -2 0 2 4
0.0
0.2
0.4
Normal-noise large
-4 -2 0 2 4
0.0
0.15
0.30
Exponential (1)
0 1 2 3 4 5 6
0.0
0.4
0.8
Bimodal - close
-2 0 2 4 6 8
0.0
0.10
0.20
Bimodal - far
-2 0 2 4 6 8
0.0
0.10
0.20
Univariate prototype distributions:
+10 datasets having unequal covariance matrices
21 datasets total
+ 1 insurance dataset
10 datasets having equal covariance matrices
Dataset Nr. Abbrev. contains
1 NN1 10 normal distributions with "small noise"2 NN2 10 normal distributions with "medium noise"3 NN3 10 normal distributions with "small noise"4 SkN1 2 skewed (exp-)distributions and 7 normals5 SkN2 5 skewed (exp-)distributions and 5 normals6 SkN3 7 skewed (exp-)distributions and 3 normals7 Bi1 4 normals, 4 skewed and 2 bimodal (close)-dist.8 Bi2 4 normals, 4 skewed and 2 bimodal (close)-dist.9 Bi3 8 skewed and 2 bimodal (far)-dist.10 Bi4 8 skewed and 2 bimodal (far)-dist.
Simulation Study
Theory
Application Aspects
Summary
Introduction
Simulation Study
21 datasets 14 estimators 2 error criteria 21x14x2=588 classification
scores Many results
Principal component reduction onto 2,3,4 and 5 dimensions (4) x multivariate „normal rule“ and multivariate LSCV-criterion ,resp. (2) 8 estimators
Method 2(„marginal normalizations“):
Method 1(multivariate density estimator):
Classical methods:
14 estimators
2 estimatorsLDA and QDA (2)
Univariate normal rule and Sheather-Jones plug-in (2) x subsequent LDA and QDA (2) 4 estimators
Simulation Study
Theory
Application Aspects
Summary
Introduction
Simulation Study
21 datasets 14 estimators 2 misclassification criteria 21x14x2=588 classification
scores Many results
Simulation Study
Theory
Application Aspects
Summary
Introduction
Misclassification Criteria
The classical Misclassification rate („Error rate“)
The Brier score
Simulation Study
Theory
Application Aspects
Summary
Introduction
Simulation Study
21 datasets 14 estimators 2 error criteria 21x14x2= 588 classification
scores Many results
Simulation Study
Theory
Application Aspects
Summary
Introduction
Results
Error rate vs. Brier score
0,00
0,10
0,20
0,30
0,40
0,50
0,60
0,70
0,80
0,0 0,1 0,2 0,3 0,4 0,5 0,6
Error rate
Bri
er s
core
The choice of the misclassification criterion is not essential
Simulation Study
Theory
Application Aspects
Summary
Introduction
Results The choice of the multivariate bandwidth parameter (method 1) is not essential in most cases
Error rates for method 1
0,000
0,050
0,100
0,150
0,200
0,250
0,300
0,350
0,400
0,450
0,500
0,000 0,100 0,200 0,300 0,400 0,500 0,600
"Normal rule"
LS
CV
Superiority of LSCV in case of bimodals having unequal covariance matrices
Simulation Study
Theory
Application Aspects
Summary
Introduction
Results The choice of the univariate bandwidth parameter (method 2) is not essential
Error rates for method 2
0,000
0,050
0,100
0,150
0,200
0,250
0,300
0,000 0,050 0,100 0,150 0,200 0,250 0,300
"Normal rule"
She
athe
r-Jo
nes
sele
ctor
Simulation Study
Theory
Application Aspects
Summary
Introduction
Results The best trade-off is a projection onto 2-3 dimensions
Error rate regarding different subspaces
0,000
0,050
0,100
0,150
0,200
0,250
0,300
0,350
2 3 4 5
# dimensions
NN-distributions
SkN-distributions
Bi-distributions
Results
Equal covariance matrices: Method 1 performs inferior against LDA
0,0000,0500,1000,1500,2000,2500,3000,3500,400
Err
or ra
te
LDA (classical)
LSCV(3) - method1
Equal covariance matrices: Method 2 sometimes slightly improves
0,000
0,050
0,100
0,150
0,200
0,250
0,300N
N11
NN
21
NN
31
SkN
11
SkN
21
SkN
31
Bi1
1
Bi2
1
Bi3
1
Bi4
1
Err
or ra
te
LDA (classical)
Normal rule (in method 2)
Results
Unqual covariance matrices: Method 2 performs quite poor, but
not for skewed distributions
0,000
0,050
0,100
0,150
0,200
0,250
Err
or ra
te QDA (classical)
LSCV(3) - method1
Unequal covariance matrices: Method 2 often improves essentially
0,000
0,050
0,100
0,150
0,200
0,250
Err
or ra
te QDA (classical)
Normal rule (inmethod 2)
Is the additional calculation time justified?
Results
Simulation Study
Theory
Application Aspects
Summary
Introduction
Required calculation time
LDA,QDA multivariate "normal rule" Preliminary univariatenormalizations,LSCV,
Sheather-Jones plug-in
Summary
Summary (1/3) – Classification Performance
Restriction to only a few dimensions Improvements with respect to the classical discrimination
methods by marginal normalizations (especially for unequal covariance matrices)
Poor performance of the multivariate kernel density classificator
LDA is undisputed in the case of equal covariance matrices and equal prior probabilities
Additional computation time seems not to be justified
Summary (1/3) – Classification Performance
Restriction to only a few dimensions Improvements with respect to the classical discrimination
methods by marginal normalizations (especially for unequal covariance matrices)
Poor performance of the multivariate kernel density classificator
LDA is undisputed in the case of equal covariance matrices and equal prior probabilities
Additional computation time seems not to be justified
Summary (1/3) – Classification Performance
Restriction to only a few dimensions Improvements with respect to the classical discrimination
methods by marginal normalizations (especially for unequal covariance matrices)
Poor performance of the multivariate kernel density classificator
LDA is undisputed in the case of equal covariance matrices and equal prior probabilities
Additional computation time seems not to be justified
Summary (1/3) – Classification Performance
Restriction to only a few dimensions Improvements with respect to the classical discrimination
methods by marginal normalizations (especially for unequal covariance matrices)
Poor performance of the multivariate kernel density classificator
LDA is undisputed in the case of equal covariance matrices and equal prior probabilities
Additional computation time seems not to be justified
Summary (1/3) – Classification Performance
Restriction to only a few dimensions Improvements with respect to the classical discrimination
methods by marginal normalizations (especially for unequal covariance matrices)
Poor performance of the multivariate kernel density classificator
LDA is undisputed in the case of equal covariance matrices and equal prior probabilities
Additional computation time seems not to be justified
Summary (2/3) – KDE for Data Description
Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions)
No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“
in L1-theory, L2-theory: ISEMISEAMISE,several minima in LSCV,...)
Different parameter selectors are of varying quality with respect to different underlying densities
Summary (2/3) – KDE for Data Description
Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions)
No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“
in L1-theory, L2-theory: ISEMISEAMISE,several minima in LSCV,...)
Different parameter selectors are of varying quality with respect to different underlying densities
Summary (2/3) – KDE for Data Description
Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions)
No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“
in L1-theory, L2-theory: ISEMISEAMISE,several minima in LSCV,...)
Different parameter selectors are of varying quality with respect to different underlying densities
Summary (2/3) – KDE for Data Description
Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions)
No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“
in L1-theory, L2-theory: ISEMISEAMISE,several minima in LSCV,...)
Different parameter selectors are of varying quality with respect to different underlying densities
Summary (3/3) – Theory vs. Application
Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification
For discrimatory purposes the issue of estimating log-densities is much more important
Some univariate model improvements are not generalizable
The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss
Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time
Summary (3/3) – Theory vs. Application
Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification
For discrimatory purposes the issue of estimating log-densities is much more important
Some univariate model improvements are not generalizable
The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss
Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time
Summary (3/3) – Theory vs. Application
Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification
For discrimatory purposes the issue of estimating log-densities is much more important
Some univariate model improvements are not generalizable
The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss
Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time
Summary (3/3) – Theory vs. Application
Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification
For discrimatory purposes the issue of estimating log-densities is much more important
Some univariate model improvements are not generalizable
The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss
Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time
Summary (3/3) – Theory vs. Application
Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification
For discrimatory purposes the issue of estimating log-densities is much more important
Some univariate model improvements are not generalizable
The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss
Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time
The End
top related