kernels usman roshan cs 675 machine learning. feature space representation consider two classes...
TRANSCRIPT
![Page 1: Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane](https://reader033.vdocuments.us/reader033/viewer/2022051114/56649f295503460f94c42140/html5/thumbnails/1.jpg)
Kernels
Usman RoshanCS 675 Machine Learning
![Page 2: Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane](https://reader033.vdocuments.us/reader033/viewer/2022051114/56649f295503460f94c42140/html5/thumbnails/2.jpg)
Feature space representation
• Consider two classes shown below• Data cannot be separated by a hyperplane
![Page 3: Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane](https://reader033.vdocuments.us/reader033/viewer/2022051114/56649f295503460f94c42140/html5/thumbnails/3.jpg)
Feature space representation
• Suppose we square each coordinate• In other words (x1 , x2 ) => (x1
2 , x22 )
• Now the data are well separated
![Page 4: Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane](https://reader033.vdocuments.us/reader033/viewer/2022051114/56649f295503460f94c42140/html5/thumbnails/4.jpg)
Feature spaces/Kernel trick
• Using a linear classifier (nearest means or SVM) we solve a non-linear problem simply by working in a different feature space.
• With kernels – we don’t have to make the new feature space
explicit.– we can implicitly work in a different space and
efficiently compute dot products there.
![Page 5: Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane](https://reader033.vdocuments.us/reader033/viewer/2022051114/56649f295503460f94c42140/html5/thumbnails/5.jpg)
Support vector machine
• Consider the hard margin SVM optimization
• Solve by applying KKT. Think of KKT as a tool for constrained convex optimization.
• Form Lagrangian2
0
1- ( ( ) 1)
2
where are Lagrange multipliers
Tp i i i
i
i
L w y w x w
![Page 6: Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane](https://reader033.vdocuments.us/reader033/viewer/2022051114/56649f295503460f94c42140/html5/thumbnails/6.jpg)
Support vector machine
• KKT says the optimal w and w0 are given by the saddle point solution
• And KKT conditions imply that and
0
2
, 0
1min max - ( ( ) 1)
2
subject to 0
i
Tw w p i i i
i
i
L w y w x w
![Page 7: Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane](https://reader033.vdocuments.us/reader033/viewer/2022051114/56649f295503460f94c42140/html5/thumbnails/7.jpg)
Support vector machine
• After applying the Lagrange multipliers we obtain the dual by substituting w into the primal (dual is maximized)
![Page 8: Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane](https://reader033.vdocuments.us/reader033/viewer/2022051114/56649f295503460f94c42140/html5/thumbnails/8.jpg)
SVM and kernels
• We can rewrite the dual in a compact form:
![Page 9: Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane](https://reader033.vdocuments.us/reader033/viewer/2022051114/56649f295503460f94c42140/html5/thumbnails/9.jpg)
Optimization
• The SVM is thus a quadratic program that can be solved by any quadratic program solver.
• Platt’s Sequential Minimization Optimization (SMO) algorithm offers a simple specific solution to the SVM dual
• Idea is to perform coordinate ascent by selecting two variables at a time to optimize
• Let’s look at some kernels.
![Page 10: Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane](https://reader033.vdocuments.us/reader033/viewer/2022051114/56649f295503460f94c42140/html5/thumbnails/10.jpg)
Example kernels
• Polynomial kernels of degree d give a feature space with higher order non-linear terms
• Radial basis kernel gives infinite dimensional space (Taylor series)
( , ) ( 1)T di j i jK x x x x
2
2( , )i jx x
si jK x x e
![Page 11: Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane](https://reader033.vdocuments.us/reader033/viewer/2022051114/56649f295503460f94c42140/html5/thumbnails/11.jpg)
Example kernels
• Empirical kernel map– Define a set of reference vectors for– Define a score between xi and mj
– Then– And
![Page 12: Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane](https://reader033.vdocuments.us/reader033/viewer/2022051114/56649f295503460f94c42140/html5/thumbnails/12.jpg)
Example kernels
• Bag of words– Given two documents D1 and D2 the we define the
kernel K(D1,D2) as the number of words in common
– To prove this is a kernel first create a large set of words Wi. Define the mapping Φ(D1) as a high dimensional vector where Φ(D1)[i] is 1 if the word Wi is present in the document.
![Page 13: Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane](https://reader033.vdocuments.us/reader033/viewer/2022051114/56649f295503460f94c42140/html5/thumbnails/13.jpg)
SVM and kernels
• What if we make the kernel matrix K a variable and optimize the dual
• But now there is no way to tie the kernel matrix to the training data points.
![Page 14: Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane](https://reader033.vdocuments.us/reader033/viewer/2022051114/56649f295503460f94c42140/html5/thumbnails/14.jpg)
SVM and kernels
• To tie the kernel matrix to training data we assume that the kernel to be determined is a linear combination of some existing base kernels.
• Now we have a problem that is not a quadratic program anymore.
• Instead we have a semi-definite program (Lanckriet et. al. 2002)
![Page 15: Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane](https://reader033.vdocuments.us/reader033/viewer/2022051114/56649f295503460f94c42140/html5/thumbnails/15.jpg)
Theoretical foundation
• Recall the margin error theorem (7.3 from Learning with kernels)
![Page 16: Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane](https://reader033.vdocuments.us/reader033/viewer/2022051114/56649f295503460f94c42140/html5/thumbnails/16.jpg)
Theoretical foundation
• The kernel analogue of Theorem 7.3 from Lackriet et. al. 2002:
![Page 17: Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane](https://reader033.vdocuments.us/reader033/viewer/2022051114/56649f295503460f94c42140/html5/thumbnails/17.jpg)
How does MKL work in practice?
• Gonnen and Alpaydin, JMLR, 2011• Datasets:– Digit recognition, – Internet advertisements– Protein folding
• Form kernels with different sets of features• Apply SVM with various kernel learning
algorithms.
![Page 18: Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane](https://reader033.vdocuments.us/reader033/viewer/2022051114/56649f295503460f94c42140/html5/thumbnails/18.jpg)
How does MKL work in practice?
From Gonnen and Alpaydin, JMLR, 2011
![Page 19: Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane](https://reader033.vdocuments.us/reader033/viewer/2022051114/56649f295503460f94c42140/html5/thumbnails/19.jpg)
How does MKL work in practice?
From Gonnen and Alpaydin, JMLR, 2011
![Page 20: Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane](https://reader033.vdocuments.us/reader033/viewer/2022051114/56649f295503460f94c42140/html5/thumbnails/20.jpg)
How does MKL work in practice?
From Gonnen and Alpaydin, JMLR, 2011
![Page 21: Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane](https://reader033.vdocuments.us/reader033/viewer/2022051114/56649f295503460f94c42140/html5/thumbnails/21.jpg)
How does MKL work in practice?
• MKL better than single kernel• Mean kernel hard to beat• Non-linear MKL looks promising