data mining and statistical learning - 2008
TRANSCRIPT
Data Mining and Statistical Learning - 2008
1
Kernel methods- overview
Kernel smoothers
Local regression
Kernel density estimation
Radial basis functions
Data Mining and Statistical Learning - 2008
2
Introduction
Kernel methods are regression techniques used to estimate a response function
from noisy data
Properties:
• Different models are fitted at each query point, and only those observations close to that point are used to fit the model
• The resulting function is smooth
• The models require only a minimum of training
dRXXfy ),(
Data Mining and Statistical Learning - 2008
3
A simple one-dimensional kernel smoother
where
N
ii
N
iii
xxK
yxxKxf
10
10
0
,
,ˆ
otherwise,0
|| if ,1 00
xx
xxK
4.9
5
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
6
0 5 10 15 20 25
Observed Fitted
Data Mining and Statistical Learning - 2008
4
Kernel methods, splines and ordinary least squares regression (OLS)
• OLS: A single model is fitted to all data
• Splines: Different models are fitted to different subintervals (cuboids) of the input domain
• Kernel methods: Different models are fitted at each query point
Data Mining and Statistical Learning - 2008
5
Kernel-weighted averages and moving averages
The Nadaraya-Watson kernel-weighted average
where indicates the window size and the function D shows how the weights change with distance within this window
The estimated function is smooth!
K-nearest neighbours
The estimated function is piecewise constant!
N
ii
N
iii
xxK
yxxKxf
10
10
0
,
,ˆ
))(|()(ˆ xNxyAvexf kii
0xx
DK
Data Mining and Statistical Learning - 2008
6
Examples of one-dimesional kernel smoothers
• Epanechnikov kernel
• Tri-cube kernel
otherwise
tifttD
0
1143
)(2
otherwise
tifttD0
11)(33
Data Mining and Statistical Learning - 2008
7
Issues in kernel smoothing
• The smoothing parameter λ has to be defined
• When there are ties at xi : Compute an average y value and introduce weights representing the number of points
• Boundary issues
• Varying density of observations:
– bias is constant– the variance is inversely proportional to the density
Data Mining and Statistical Learning - 2008
8
Boundary effects of one-dimensionalkernel smoothers
Locally-weighted averages can be badly biased on the boundaries if the response function has a significant slope apply local linear regression
Data Mining and Statistical Learning - 2008
9
Local linear regression
Find the intercept and slope parameters solving
The solution is a linear combination of yi:
Data Mining and Statistical Learning - 2008
10
Kernel smoothing vs local linear regression
Kernel smoothing
Solve the minimization problem
Local linear regression
Solve the minimization problem
N
iiixa xyxxK
1
200)( )]()[,(min
0
N
iiiixxa xxxyxxK
1
2000)(),( ])()()[,(min
00
Data Mining and Statistical Learning - 2008
11
Properties of local linear regression
• Automatically modifies the kernel weights to correct for bias
• Bias depends only on the terms of order higher than one in the expansion of f.
Data Mining and Statistical Learning - 2008
12
Local polynomial regression
• Fitting polynomials instead of straight lines
Behavior of estimated response function:
Data Mining and Statistical Learning - 2008
13
Polynomial vs local linear regression
Advantages:
• Reduces the ”Trimming of hills and filling of valleys”
Disadvantages:
• Higher variance (tails are more wiggly)
Data Mining and Statistical Learning - 2008
14
Selecting the width of the kernel
Bias-Variance tradeoff:
Selecting narrow window leads to high variance and low bias whilst selecting wide window leads to high bias and low variance.
Data Mining and Statistical Learning - 2008
15
Selecting the width of the kernel
1. Automatic selection ( cross-validation)
2. Fixing the degrees of freedom
ijij xlSS ,ˆ yf
Stracedf
Data Mining and Statistical Learning - 2008
16
Local regression in RP
The one-dimensional approach is easily extended to p dimensions by
• Using the Euclidian norm as a measure of distance in the kernel.
• Modifying the polynomial
,,,,,,1 2221
2121 XXXXXXXb
Data Mining and Statistical Learning - 2008
17
Local regression in RP
”The curse of dimensionality”
• The fraction of points close to the boundary of the input domain increases with its dimension
• Observed data do not cover the whole input domain
Data Mining and Statistical Learning - 2008
18
Structured local regression models
Structured kernels (standardize each variable)
Note: A is positive semidefinite
Data Mining and Statistical Learning - 2008
19
Structured local regression models
Structured regression functions
• ANOVA decompositions (e.g., additive models)
Backfitting algorithms can be used
• Varying coefficient models (partition X)
• INSERT FORMULA 6.17
Data Mining and Statistical Learning - 2008
20
Structured local regression models
Varying coefficient
models (example)
Data Mining and Statistical Learning - 2008
21
Local methods
• Assumption: model is locally linear ->maximize the log-likelihood locally at x0:
• Autoregressive time series. yt=β0+β1yt-1+…+ βkyt-k+et ->
yt=ztT β+et. Fit by local least-squares with kernel K(z0,zt)
Data Mining and Statistical Learning - 2008
22
Kernel density estimation
• Straightforward estimates of the density are bumpy
• Instead, Parzen’s smooth estimate is preferred:
Normally, Gaussian kernels are used
Data Mining and Statistical Learning - 2008
23
Radial basis functions and kernels
Using the idea of basis expansion, we treat kernel functions as basis functions:
where ξj –prototype parameter, λj-scale parameter
Data Mining and Statistical Learning - 2008
24
Radial basis functions and kernels
Choosing the parameters:
•
• Estimate {λj, ξj } separately from βj (often by using the distribution of X alone) and solve least-squares.