Kernel Methods
Arie Nakhmani
Outline Kernel Smoothers Kernel Density Estimators Kernel Density Classifiers
Kernel Smoothers – The Goal Estimating a function by
using noisy observations, when the parametric model for this function is unknown
The resulting function should be smooth
The level of “smoothness” should be set by a single parameter
( ) : pf X
Example
0 1 2 3 4 5 6-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
X
Y
N=100 sample points
What is it: “smooth enough” ?
0 1 2 3 4 5 6-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
X
Y
Example
Y in X , X ~ U 0,2 , ~ (0,1/ 4)s N
N=100 sample points
Exponential Smoother
0 1 2 3 4 5 6-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
X
Y
ˆ ˆ( ) (1 ) ( 1) ( )ˆ(1) (1) 0 1
sorted
sorted
Y i Y i Y i
Y Y
0.25
Smaller smoother line, but more delayed
Exponential Smoother
Simple Sequential Single parameter Single value memory Too rough Delayed
ˆ ˆ( ) (1 ) ( 1) ( )sortedY i Y i Y i
Moving Average Smoother
5 :
Y(1) = Y (1)
Y(2) = (Y (1) + Y (2) + Y (3))/3
Y(3) = (Y (1) + Y (2) + Y (3) + Y (4) + Y (5))/5
Y(4) = (Y (2) + Y (3) + Y (4) + Y
sorted
sorted sorted sorted
sorted sorted sorted sorted sorted
sorted sorted sorted sort
For m
(5) + Y (6))/5...
ed sorted
Moving Average Smoother
0 1 2 3 4 5 6-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
X
Y
m=11
Larger m smoother, but straightened line
Moving Average Smoother Sequential Single parameter: the window size
m Memory for m values Irregularly smooth What if we have p-dimensional
problem with p>1 ???
0 1 2 3 4 5 6-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
X
YNearest Neighbors Smoother
0 0ˆ( ) | ( )i i mY x Average y x Neighborhood x
x0
m=160
ˆ( )Y x
Larger m smoother, but biased line
Nearest Neighbors Smoother Not sequential Single parameter: the number of
neighbors m Trivially extended to any number of
dimensions Memory for m values Depends on metrics definition Not smooth enough Biased end-points
Low Pass Filter
0 1 2 3 4 5 6-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
X
Y
2nd order Butterworth: 2
22 1( ) 0.0078
3 0.77z zH z
z z
Why do we need kernel smoothers ???
Low Pass Filter
0 1 2 3 4 5 6-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
X
Y
The same filter…for log function
Low Pass Filter Smooth Simply extended to any number of
dimensions Effectively, 3 parameters: type,
order, and bandwidth Biased end-points Inappropriate for some functions
(depends on bandwidth)
Kernel Average Smoother
0 1 2 3 4 5 6-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
X
Y
0ˆ( )Y x
0 0ˆ( ) |i i iY x Average w y x x
x0
Kernel Average Smoother Nadaraya-Watson kernel-weighted average:
with the kernel:
for Nearest Neighbor Smoother for Locally Weighted Average
01
0
01
( , )ˆ( )
( , )
N
i iiN
ii
K x x yY x
K x x
0
00
( , )( )
x xK x x D
h x
0( )h x 0 0 [ ]( )m mh x x x
t
Popular Kernels Epanechnikov kernel:
Tri-cube kernel:
Gaussian Kernel:
23(1 ) / 4 1( )0,
t if tD totherwise
3 3(1 ) 1( )0,
t if tD totherwise
21( ) ( ) exp22tD t t
-3 -2 -1 0 1 2 3
0
0.2
0.4
0.6
0.8
1
EpanechnikovTri-cubeGaussian
Non-Symmetric Kernel Kernel example:
Which kernel is that ???
(1 ) , 0, 0 1( )0,
t tD totherwise
1
2 11 2 1
ˆ(1 )
ˆ (1 ) (1 ) ... (1 )
i
ii i i i
Y
Y Y Y Y Y
0(1 ) 1i
i
-3 -2 -1 0 1 2 30
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Kernel Average Smoother Single parameter: window width Smooth Trivially extended to any number
of dimensions Memory-based method – little or
no training is required Depends on metrics definition Biased end-points
Local Linear Regression Kernel-weighted average
minimizes:
Local linear regression minimizes:
0
20 0 0 0
( ) 1
ˆmin ( , ) ( ) ( ) ( )N
i ix i
K x x y x Y x x
0 0
20 0 0
( ), ( ) 1
0 0 0 0
min ( , ) ( ) ( )
ˆ( ) ( ) ( )
N
i i ix x i
K x x y x x x
Y x x x x
Local Linear Regression Solution:
where:
Other representation:
10 0 0 0
ˆ( ) 1, ( ) ( )T TY x x x x
B W B B W y
1 2
1 1 ... 1...
T
Nx x x
B
0 0( ) ( , )i N Nx diag K x x W
1
N
y
y
y
0 01
ˆ( ) ( )N
i ii
Y x l x y
equivalent kernel01
( ) 1N
iil x
Local Linear Regression
0 1 2 3 4 5 6-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
X
Y
0ˆ( )Y x
x0
Equivalent Kernels
Local Polynomial Regression Why stop at local linear fits? Let’s minimize:
0 0
2
0 0 0( ), ( ), 1,..., 1 1
0 0 0 01
min ( , ) ( ) ( )
ˆ( ) ( ) ( )
j
N dj
i i j ix x j d i j
dj
jj
K x x y x x x
Y x x x x
Local Polynomial Regression
Variance Compromise 2 2 2
0 0ˆ( ) ( ) ( ) ; , 0i iVar Y x l x for y f x Var E
0.2 tri-cube kernel
Conclusions Local linear fits can help bias dramatically at
the boundaries at a modest cost in variance. Local linear fits more reliable for extrapolation.
Local quadratic fits do little at the boundaries for bias, but increase the variance a lot.
Local quadratic fits tend to be most helpful in reducing bias due to curvature in the interior of the domain.
λ controls the tradeoff between bias and variance. Larger λ makes lower variance but higher bias
Local Regression in Radial kernel:
p
00
0( , )
( )x x
K x x Dh x
0
2
0 0 0( ) 1
2 21 2 1 2 1 2
0 0 0
ˆ( ) arg min ( , ) ( ) ( )
( ) 1, , , , , ...
ˆˆ( ) ( ) ( )
NT
i i ix i
T
x K x x y b x x
b X X X X X X X
Y x b x x
Popular Kernels Epanechnikov kernel
Tri-cube kernel
Gaussian kernel
23(1 ) / 4, 1( )0,
t if tD t
otherwise
331 , 1( )0,
t if tD totherwise
21( ) ( ) exp22tD t t
Example
0 1 2 3 4 5 6 70
5
10-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
XY
sin( )~ [0,2 ]
~ (0,0.2)
Z XX U
N
Higher Dimensions The boundary estimation is problematic
Many sample points are needed to reduce the bias
Local regression is less useful for p>3 It’s impossible to maintain localness
(low bias) and sizeable samples (low variance) at the same time
Structured Kernels Non-radial kernel:
Coordinates or directions can be downgraded or omitted by imposing restrictions on A.
Covariance can be used to adapt a metric A. (related to Mahalanobis distance)
Projection-pursuit model
0 0, 0
( ) ( )( , ) ; 0
Tx x x xK x x D
A
A A
Structured Regression Divide into a set (X1,X2,…,Xq) with
q<p and the remainder of the variables collect in vector Z.
Conditionally linear model:
For given Z fit a model by locally weighted least squares:
1 1( ) ( ) ( ) ... ( )q qf X Z Z X Z X
0 0
20 0 1 1 0 0
( ), ( ) 1min ( , ) ( ) ( ) ... ( )
N
i i i qi qz z i
K z z y z x z x z
pX
Density Estimation
-10 -5 0 5 10 15 200
0.05
0.1
0.15
0.2
0.25
X
Den
sity
original distribution
constant window estimation
sample set0
0# Neighbor(x )ˆ ( ) i
Xx
f xN
Mixture of two normal distributions
6000.3
N
Kernel Density EstimationSmooth Parzen estimate: 0 0
1
1ˆ ( ) ( , )N
Xi
f x K x xN
Comparison
-10 -5 0 5 10 15 200
0.05
0.1
0.15
0.2
0.25
X
Den
sity
Data SamplesNearest NeighborsEpanechnikovGaussian
Mixture of two normal distributions
Usually Bandwidth selection is more important than kernel function selection
Kernel Density Estimation Gaussian kernel density estimation:
where denote the Gaussian density with mean zero and standard deviation .
Generalization to :
1
1ˆ ˆ( ) ( ) * ( )N
X ii
f x x x F xN
LPF
p
20 02 /2
1
1ˆ ( ) exp 0.5 /(2 )
N
X ipi
f x x xN
Kernel Density Classification
00
01
ˆˆ ( )Pr | ; 1,...,
ˆˆ ( )
j jJ
k kk
f xG j X x j J
f x
For a J class problem:
( )
Pr( | )jf x
X x G j
Radial Basis Functions Function f(x) is represented as expansion in
basis functions:
Radial basis functions expansion (RBF):
where the sum-of-squares is minimized with respect to all the parameters (for Gaussian kernel):
1( ) ( )Mj jjf x h x
1 1
( ) ,j
M M jj j
jj j
xf x K x D
1
2
0 1 2{ , , } 1
( ) ( )min exp
Mj j j
TN M i j i ji jj
i j
x xy
Radial Basis Functions When assuming constant j= : the problem of
“holes”
The solution - Renormalized RBF: 1
/( )
/
jj M
kk
D xh x
D x
Additional Applications Local likelihood Mixture models for density
estimation and classification Mean-shift
Conclusions Memory-based methods: the model is the
entire training data set Infeasible for many real-time applications Provides good smoothing result for
arbitrary sampled function Appropriate for interpolation and
extrapolation When the model is known, better use
another fitting methods