lecture 2 introduction to kernel methods - uni-tuebingen.de · lecture 2 introduction to kernel...
Post on 04-Jun-2020
10 Views
Preview:
TRANSCRIPT
Lecture 2Introduction to Kernel Methods
Pavel Laskov1 Blaine Nelson1
1Cognitive Systems Group
Wilhelm Schickard Institute for Computer Science
Universitat Tubingen, Germany
Advanced Topics in Machine Learning, 2012
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 1 / 28
An Apartment for Rent.
Living area: 51m2
Monthly rent: 550e
Is this price fair?
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 2 / 28
Linear Regression
Objective: find relationship between (correlated)input variables x & output variables y
f (x) = w⊤x+ b
Applications of regression:
Interpolation: determining output values for other points within the rangeof available dataExtrapolation: determining output values for other points outside of therange of available dataAnalysis: understanding of specific parameters, e.g., slope w or bias b
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 3 / 28
Linear RegressionProblem Setup
Given is the data D = {(xi , yi )}Ni=1 where each data point is a pair ofinput variables xi ∈ ℜD & the corresponding output yi
Objective: find the weight vector w and the bias b which minimize thediscrepancy between predictions w⊤xi + b and the true output yi foreach point i
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 4 / 28
Linear RegressionMatrix Representation
Arrange all data points in rows of the data matrix X
X =
x⊤1...x⊤N
Then the prediction error can be expressed as
ξ = y − Xw − 1b
The objective is now to minimize the squared prediction error
min L = ξ⊤ξ = (y − Xw − 1b)⊤(y − Xw − 1b)
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 5 / 28
Linear RegressionDerivation of Optimal Solution
Compute partial derivatives of L with respect to w and b:
∂L∂w
= −2X⊤(y − Xw − 1b) = −2X⊤y − 2X⊤Xw + 2X⊤1b
∂L∂b
= −21⊤(y − Xw − 1b) = −21⊤y − 21⊤Xw + 21⊤1b
The optimal solution is obtained by equating partial derivatives to zero:
[
X⊤X X⊤11⊤X 1⊤1
] [
w
b
]
=
[
X⊤
1⊤
]
y
By setting X = [X , 1], w = [w ; b] one obtains the classical system ofequations for linear regression passing through the origin:
X⊤X w = X⊤y
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 6 / 28
Linear RegressionSolution of the apartment problem
0 20 40 60 80 1000
200
400
600
800
1000
Living area, m2
Ren
t, eu
ro
Linear regression, lambda = 0.00, MSE = 20.721631
Regression slope is 9.47e/m2, bias is 9.42e
The (fair) rent should have been 9.47 · 51 = 483e
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 7 / 28
Nonlinear Regression
Model the dependency with a linear combination of non-linear basisfunctions:
f (x) = c1f1(x) + c2f2(x) + . . . + cM fM(x)
Construct the data matrix X from non-linear contributions of each datapoint:
X =
f1(x1) f2(x1) . . . fM(x1)...
f1(xN) f2(xN) . . . fM(xN)
Re-use the standard solution
The solution vector w contains coefficients c1, . . . , cM .
In general, we will denote the space containing nonlineartransformations of input data as feature space:
Φ(x) = [fl(x), f2(x), . . . , fM(x)]
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 8 / 28
Nonlinear RegressionExample: Polynomial of Degree 2
0 20 40 60 80 100
200
400
600
800
1000
Living area, m2
Ren
t, eu
ro
Degree 2 polynomial regression, MSE = 20.661319
f (x) = b + c1x + c2x2
Slightly better accuracy of fit: MSE of 20.66 instead of 20.72
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 9 / 28
Nonlinear RegressionExample: Polynomial of Degree 3
0 20 40 60 80 100
0
200
400
600
800
1000
Living area, m2
Ren
t, eu
ro
Degree 3 polynomial regression, MSE = 20.65
f (x) = b + c1x + c2x2 + c3x
3
Tiny improvement of accuracy: MSE of 20.65 instead of 20.66
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 10 / 28
Nonlinear RegressionExample: Polynomial of Degree 4
0 20 40 60 80 100
02004006008001000
Living area, m2
Ren
t, eu
ro
Degree 4 polynomial regression, MSE = 20.29
f (x) = b + c1x + c2x2 + c3x
3 + c4x4
A significant jump in accuracy: MSE of 20.29 instead of 20.65
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 11 / 28
Overfitting
Polynomials of higher degree can easily fit any constellations of trainingdata. Some other complex functions fit better, but...
/ They poorly approximate the true dependency both between and outsideof the range of training data points.
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 12 / 28
Overfitting
Polynomials of higher degree can easily fit any constellations of trainingdata. Some other complex functions fit better, but...
/ They poorly approximate the true dependency both between and outsideof the range of training data points.
Regularization: the trade-off between the accuracy of fit to thetraining data and the accuracy of approximation of the true dependencycan be controlled by restricting the coefficients c1, c2, . . . , cM .
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 12 / 28
Overfitting
Polynomials of higher degree can easily fit any constellations of trainingdata. Some other complex functions fit better, but...
/ They poorly approximate the true dependency both between and outsideof the range of training data points.
Regularization: the trade-off between the accuracy of fit to thetraining data and the accuracy of approximation of the true dependencycan be controlled by restricting the coefficients c1, c2, . . . , cM .
Modify the objective function as follows:
min L = ξ⊤ξ + λw⊤w
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 12 / 28
Overfitting
Polynomials of higher degree can easily fit any constellations of trainingdata. Some other complex functions fit better, but...
/ They poorly approximate the true dependency both between and outsideof the range of training data points.
Regularization: the trade-off between the accuracy of fit to thetraining data and the accuracy of approximation of the true dependencycan be controlled by restricting the coefficients c1, c2, . . . , cM .
Modify the objective function as follows:
min L = ξ⊤ξ + λw⊤w
Regularized solution:[
X⊤X + λI X⊤11⊤X 1⊤1
] [
w
b
]
=
[
X⊤
1⊤
]
y
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 12 / 28
Nonlinear RegressionExample: Regularized Polynomial of Degree 4
0 20 40 60 80 100
0
200
400
600
800
1000
Living area, m2
Ren
t, eu
ro
Degree 4 polynomial regression, lambda = 0.50, MSE = 20.62
More improvement in accuracy of fit: MSE of 20.62 instead of 20.65
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 13 / 28
The Curse of Dimensionality
How many monomials has the polynomial of degree k in d dimensions?
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 14 / 28
The Curse of Dimensionality
How many monomials has the polynomial of degree k in d dimensions?
k = 2, d = 2:x21 + x1x2 + x22 + x1 + x2 + b ⇒ 6
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 14 / 28
The Curse of Dimensionality
How many monomials has the polynomial of degree k in d dimensions?
k = 2, d = 2:x21 + x1x2 + x22 + x1 + x2 + b ⇒ 6
k = 2, d = 3:
x21 + x22 + x23 + x1x2 + x1x3 + x2x3 + x1 + x2 + x3 + b ⇒ 10
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 14 / 28
The Curse of Dimensionality
How many monomials has the polynomial of degree k in d dimensions?
k = 2, d = 2:x21 + x1x2 + x22 + x1 + x2 + b ⇒ 6
k = 2, d = 3:
x21 + x22 + x23 + x1x2 + x1x3 + x2x3 + x1 + x2 + x3 + b ⇒ 10
k = 2, d = d :
x21 + . . .+ x2dd
+ x1x2 + . . .+ xd−1xdd(d−1)
2
+ x1 + . . .+ x3d
+b ⇒ O(d2)
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 14 / 28
The Curse of Dimensionality
How many monomials has the polynomial of degree k in d dimensions?
k = 2, d = 2:x21 + x1x2 + x22 + x1 + x2 + b ⇒ 6
k = 2, d = 3:
x21 + x22 + x23 + x1x2 + x1x3 + x2x3 + x1 + x2 + x3 + b ⇒ 10
k = 2, d = d :
x21 + . . .+ x2dd
+ x1x2 + . . .+ xd−1xdd(d−1)
2
+ x1 + . . .+ x3d
+b ⇒ O(d2)
k = k , d = d :
mon. deg. k + mon. deg. k − 1 + . . .+ b ⇒k
∑
i=1
(i + d − 1)!
i !(d − 1)!= O(dk)
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 14 / 28
A New Look at the Problem
Consider the following optimization problem
minξ,w
1
2ξ⊤ξ +
λ
2w⊤w
subject to: ξ = y − Xw − 1b
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 15 / 28
A New Look at the Problem
Consider the following optimization problem
minξ,w
1
2ξ⊤ξ +
λ
2w⊤w
subject to: ξ = y − Xw − 1b
How do we minimize a function subject to constraints?
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 15 / 28
Some Facts from Optimization Theory
1 To optimize a function f (x) subject to constraints g(x) = 0,
Add g(x) to the objective function weighted by the dual variables α
L = f (x) + α⊤g(x)
The objective function extended with weighted constraints is called theLagrangian.Optimize the Lagrangian L with respect to both x and α.
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 16 / 28
Some Facts from Optimization Theory
1 To optimize a function f (x) subject to constraints g(x) = 0,
Add g(x) to the objective function weighted by the dual variables α
L = f (x) + α⊤g(x)
The objective function extended with weighted constraints is called theLagrangian.Optimize the Lagrangian L with respect to both x and α.
2 If joint optimization is difficult, simplify the problem by the followingtrick:
Compute partial derivatives of L with respect to x and equate them to zero.Use these constraints to eliminate x from L.The resulting optimization problem containing only α is known as the dualoptimization problem.
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 16 / 28
Nonlinear Regression: A Dual ViewDifferential of the Lagrangian
The Lagrangian of the regression problem is:
1
2ξ⊤ξ +
λ
2w⊤w − α⊤(y − Xw − 1b)
Its partial derivatives with respect to primal variables are:
∂L∂ξ
= ξ − α = 0 ⇒ ξ = α
∂L∂w
= λw − X⊤α = 0 ⇒ w =1
λX⊤α
∂L∂b
= 1⊤α = 0
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 17 / 28
Nonlinear Regression: A Dual ViewDerivation of the Dual Problem
Substituting the expresions for optimal ξ and w back into theLagrangian we obtain:
L =1
2α⊤α+
λ
2
1
λ2α⊤XX⊤α− α⊤α
+ α⊤y − 1
λα⊤XX⊤α− α⊤1
=0b
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 18 / 28
Nonlinear Regression: A Dual ViewDerivation of the Dual Problem
Substituting the expresions for optimal ξ and w back into theLagrangian we obtain:
L =1
2α⊤α+
λ
2
1
λ2α⊤XX⊤α− α⊤α
+ α⊤y − 1
λα⊤XX⊤α− α⊤1
=0b
=1
2
1
λα⊤XX⊤α− 1
2α⊤α+ α⊤y
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 18 / 28
Nonlinear Regression: A Dual ViewDerivation of the Dual Problem
Substituting the expresions for optimal ξ and w back into theLagrangian we obtain:
L =1
2α⊤α+
λ
2
1
λ2α⊤XX⊤α− α⊤α
+ α⊤y − 1
λα⊤XX⊤α− α⊤1
=0b
=1
2
1
λα⊤XX⊤α− 1
2α⊤α+ α⊤y
Differentiating with respect to α we obtain the following optimalitycondition:
(XX⊤ + λI )α = λy
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 18 / 28
Primal vs. Dual
Final solutions of both problems are similar:
primal: (X⊤X + λI )w = X⊤y
dual: (XX⊤ + λI )α = λy
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 19 / 28
Primal vs. Dual
Final solutions of both problems are similar:
primal: (X⊤X + λI )w = X⊤y
dual: (XX⊤ + λI )α = λy
Key difference: dimensionality of the linear system
primal: X⊤X : (M + 1)× (M + 1)
dual: XX⊤ : N × N
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 19 / 28
Primal vs. Dual
Final solutions of both problems are similar:
primal: (X⊤X + λI )w = X⊤y
dual: (XX⊤ + λI )α = λy
Key difference: dimensionality of the linear system
primal: X⊤X : (M + 1)× (M + 1)
dual: XX⊤ : N × N
For nonlinear regression,
N ≪ M =
k∑
i=1
(i + d − 1)!
i !(d − 1)!!!!
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 19 / 28
A Closer Look at the Dual Problem
How long does it take to compute XX⊤?
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 20 / 28
A Closer Look at the Dual Problem
How long does it take to compute XX⊤?
[N ×M] [M × N] = O(M3) = O(
∑ki=1
(i+d−1)!i !(d−1)!
)3//
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 20 / 28
A Closer Look at the Dual Problem
How long does it take to compute XX⊤?
[N ×M] [M × N] = O(M3) = O(
∑ki=1
(i+d−1)!i !(d−1)!
)3//
Let’s look at the structure of XX⊤:
x⊤1x⊤2...x⊤N
x1 x2 . . . xN
=
x⊤1 x1 x⊤1 x2 · · · x⊤1 xNx⊤2 x1 x⊤2 x2 · · · x⊤2 xN...
.... . .
...x⊤Nx1 x⊤Nx2 · · · x⊤NxN
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 20 / 28
A Closer Look at the Dual Problem
How long does it take to compute XX⊤?
[N ×M] [M × N] = O(M3) = O(
∑ki=1
(i+d−1)!i !(d−1)!
)3//
Let’s look at the structure of XX⊤:
x⊤1x⊤2...x⊤N
x1 x2 . . . xN
=
x⊤1 x1 x⊤1 x2 · · · x⊤1 xNx⊤2 x1 x⊤2 x2 · · · x⊤2 xN...
.... . .
...x⊤Nx1 x⊤Nx2 · · · x⊤NxN
The main challenge is to efficiently compute inner products x⊤i xj
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 20 / 28
A Closer Look at the Dual Problem
How long does it take to compute XX⊤?
[N ×M] [M × N] = O(M3) = O(
∑ki=1
(i+d−1)!i !(d−1)!
)3//
Let’s look at the structure of XX⊤:
x⊤1x⊤2...x⊤N
x1 x2 . . . xN
=
x⊤1 x1 x⊤1 x2 · · · x⊤1 xNx⊤2 x1 x⊤2 x2 · · · x⊤2 xN...
.... . .
...x⊤Nx1 x⊤Nx2 · · · x⊤NxN
The main challenge is to efficiently compute inner products x⊤i xj
Can we do this faster than in O(M)?
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 20 / 28
Kernel MagicExample 1: 2-dimensional Polynomials of Degree 2
Consider a (slightly modified) feature space for 2-dimensionalpolynomials of degree 2:
Φ(x) = [x21 , x22 ,√2x1x2,
√2x1,
√2x2, 1]
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 21 / 28
Kernel MagicExample 1: 2-dimensional Polynomials of Degree 2
Consider a (slightly modified) feature space for 2-dimensionalpolynomials of degree 2:
Φ(x) = [x21 , x22 ,√2x1x2,
√2x1,
√2x2, 1]
Let us compute the inner product between two points in the featurespace:
Φ(x)⊤Φ(y) = x21 y21 + x22 y
22 + 2x1x2y1y2 + 2x1y1 + 2x2y2 + 1
= (x1y1 + x2y2 + 1)2
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 21 / 28
Kernel MagicExample 1: 2-dimensional Polynomials of Degree 2
Consider a (slightly modified) feature space for 2-dimensionalpolynomials of degree 2:
Φ(x) = [x21 , x22 ,√2x1x2,
√2x1,
√2x2, 1]
Let us compute the inner product between two points in the featurespace:
Φ(x)⊤Φ(y) = x21 y21 + x22 y
22 + 2x1x2y1y2 + 2x1y1 + 2x2y2 + 1
= (x1y1 + x2y2 + 1)2
Complexity: 3 multiplications instead of 6.
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 21 / 28
Kernel MagicExample 2: 2-dimensional Polynomials of Degree 3
Consider a (slightly modified) feature space for 2-dimensionalpolynomials of degree 3:
Φ(x) = [x31 , x32 ,√3x21 x2,
√3x1x
22 ,√3x21 ,
√3x22 ,
√6x1x2,
√3x1,
√3x2, 1]
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 22 / 28
Kernel MagicExample 2: 2-dimensional Polynomials of Degree 3
Consider a (slightly modified) feature space for 2-dimensionalpolynomials of degree 3:
Φ(x) = [x31 , x32 ,√3x21 x2,
√3x1x
22 ,√3x21 ,
√3x22 ,
√6x1x2,
√3x1,
√3x2, 1]
Let us compute the inner product between two points in the featurespace:
Φ(x)⊤Φ(y) = x31 y31 + x32 y
32 + 3x21 x2y
21 y2 + 3x1x
22 y1y
22 + 3x21 y
21 + 3x22 y
22
+ 6x1x2y1y2 + 3x1y1 + 3x2y2 + 1
= (x1y1 + x2y2 + 1)3
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 22 / 28
Kernel MagicExample 2: 2-dimensional Polynomials of Degree 3
Consider a (slightly modified) feature space for 2-dimensionalpolynomials of degree 3:
Φ(x) = [x31 , x32 ,√3x21 x2,
√3x1x
22 ,√3x21 ,
√3x22 ,
√6x1x2,
√3x1,
√3x2, 1]
Let us compute the inner product between two points in the featurespace:
Φ(x)⊤Φ(y) = x31 y31 + x32 y
32 + 3x21 x2y
21 y2 + 3x1x
22 y1y
22 + 3x21 y
21 + 3x22 y
22
+ 6x1x2y1y2 + 3x1y1 + 3x2y2 + 1
= (x1y1 + x2y2 + 1)3
Complexity: 3 multiplications instead of 10.
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 22 / 28
Polynomial Kernel
For a polynomial of degree k in d dimensions the inner product betweentwo feature vectors can be computed as:
Φ(x)⊤Φ(y) = (x⊤y + 1)k
Complexity: O(d)
The function k(x, y) = (x⊤y + 1)k is called a reproducing kernel forthe feature space of polynomials (the qualification “reproducing” will bediscussed in the next lecture)
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 23 / 28
Classical Kernel Functions
Radial Basis Function (RBF) is defined solely via its kernel:
k(x, y) = e−||x−y||2
2σ
Hyperbolic Tangent Function (used for neural networks):
k(x, y) = tanh(αx⊤y + c)
Both of these functions enable O(d) computation of inner products ininfinite-dimensional features spaces.
More than 30 different kernel functions are widely used in machinelearning algorithms: splines, Chi-square, histogram, wavelets, etc.
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 24 / 28
Algorithmic Properties of Kernels
, Kernel functions provide an efficient way for computation of similaritybetween data in complex feature spaces
Complexity varies from linear to low-degree polynomial compared tohigh-degree polynomial, exponential or even infinite.
, Kernel functions provide a powerful abstraction for algorithmic design
For algorithms formulated in terms of kernels, it suffices to change a kernelfunction to implement a new feature space; no change to the algorithm isnecessary.
/ Numeric properties of kernel-based algorithms are worse
Special care must be taken to avoid numeric instability.
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 25 / 28
Kernel Nonlinear RegressionExample: Polynomial of Degree 4
0 20 40 60 80 1000
200
400
600
800
1000
Living area, m2
Ren
t, eu
ro
Ridge regression, polynomial deg. 4, lambda = 5.00, MSE = 20.131777
Excellent accuracy: MSE of 20.13 (20.26 for explicit space)
Numeric instability: strong regularization required (λ = 5)
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 26 / 28
Kernel Nonlinear RegressionExample: Radial Basis Function, σ = 30
0 20 40 60 80 100
0
200
400
600
800
Living area, m2
Ren
t, eu
ro
Ridge regression, RBF sig. 30.000000, lambda = 0.01, MSE = 20.294771
Good accuracy: MSE of 20.29
Smooth kernel parameter required: σ = 30
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 27 / 28
Summary
1 Development of learning algorithms involves complex algorithms foroptimization of cost functions
2 A wide-spread method for development of learning algorithms ismapping data objects into a nonlinear feature space
3 Prohibitively high dimensionality of feature spaces can, in many cases,be overcome by transforming the problem into the dual form in whichthe data occurs in the form of inner products
4 Inner products in high-dimensional spaces can be efficiently computedwith nonlinear kernel functions
5 In general, kernel functions offer a powerful abstraction for developmentof learning algorithms
6 Next Lecture: We will discuss futher mathematical properties of kernelfunctions which affect the development of learning algorithms
P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 28 / 28
top related