introduction to machine...
TRANSCRIPT
![Page 1: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/1.jpg)
Introduction to Machine LearningFelix Brockherde12 Kristof Schutt1
1Technische Universitat Berlin2Max Planck Institute of Microstructure Physics
IPAM Tutorial 2013
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 1 / 35
![Page 2: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/2.jpg)
What is Machine Learning?
InferedStructure
Data with Pattern Algorithm ML Model
ML is about learning structure from data
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 2 / 35
![Page 3: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/3.jpg)
Examples
Drug discovery
Search engines
BCI
DNA splice site detection
Face recognition
Recommender systems
Speech recognition
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 3 / 35
![Page 4: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/4.jpg)
This Talk
Part 1: Learning Theory andSupervised ML
I Basic Ideas of Learning TheoryI Support Vector MachinesI KernelsI Kernel Ridge Regression
Part 2: Unsupervised ML andApplication
I PCAI Model SelectionI Feature Representation
Not covered
I Probabilistic Models
I Neural Networks
I Online Learning
I Reinforcement Learning
I Semi-supervised Learning
I etc.
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 4 / 35
![Page 5: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/5.jpg)
Supervised Learning
Classification
yi ∈ {−1,+1}
Regression
yi ∈ RI Given: Points X = (x1, . . . , xN) with xi ∈ Rd and Labels
Y = (y1, . . . , yn) generated by some joint probability distribution.
I Learn underlying unknown mapping f (x) = y
I Important: Performance on unseen data
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 5 / 35
![Page 6: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/6.jpg)
Basic Ideas in Learning Theory
Risk minimization (RM)Learn a model function f from examples
(x1, y1), . . . , (xN , yN) ∈ Rd × R or {+1,−1}, generated from P(x, y)
such that the expected number of errors on test data (drawn fom P(x, y)),
R[f ] =
∫1
2|f (x)− y |2dP(x, y),
is minimal.
Problem: Distribution P(x, y) is unknown
Empirical Risk Minimization (ERM)Replace the average over P(x, y) by average of training samples (i.e.minimize the training error):
Remp[f ] =1
N
N∑i=1
1
2|f (xi )− yi |2
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 6 / 35
![Page 7: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/7.jpg)
Law of large numbers: Remp[f ]→ R[f ] asN →∞.
Question: Does minf Remp[f ] give usminf R[f ] for sufficiently large N?
No: uniform convergence needed
Error bound for classification
With probablity of at least 1− η:
R[f ] ≤ Remp[f ] +
√D(log 2N
D + 1)− log(η4 )
N
where D is the VC dimension (Vapnik and Chervonenkis (1971)).
Introduce structure on set of possible functions and use Structural RiskMinimization (SRM).
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 7 / 35
![Page 8: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/8.jpg)
Law of large numbers: Remp[f ]→ R[f ] asN →∞.
Question: Does minf Remp[f ] give usminf R[f ] for sufficiently large N?
No: uniform convergence needed
Error bound for classification
With probablity of at least 1− η:
R[f ] ≤ Remp[f ] +
√D(log 2N
D + 1)− log(η4 )
N
where D is the VC dimension (Vapnik and Chervonenkis (1971)).
Introduce structure on set of possible functions and use Structural RiskMinimization (SRM).
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 7 / 35
![Page 9: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/9.jpg)
The linear function class has VC-dimension D = 3
minf
Remp[f ] + Complexity[f ]
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 8 / 35
![Page 10: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/10.jpg)
Support Vector Machines (SVM)
{x|w · x + b = 0}
{x|w · x + b = +1}
{x|w · x + b = −1}
2||w||
w
b
Normalize w so thatminxi w · xi + b = 1.
w · x1 + b = +1
w · x2 + b = −1
⇐⇒ w · (x1 − x2) = 2
⇐⇒ w
||w||· (x1 − x2) =
2
||w||
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 9 / 35
![Page 11: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/11.jpg)
Support Vector Machines (SVM)
{x|w · x + b = 0}
{x|w · x + b = +1}
{x|w · x + b = −1}
2||w||
w
b
Normalize w so thatminxi w · xi + b = 1.
w · x1 + b = +1
w · x2 + b = −1
⇐⇒ w · (x1 − x2) = 2
⇐⇒ w
||w||· (x1 − x2) =
2
||w||
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 9 / 35
![Page 12: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/12.jpg)
Support Vector Machines (SVM)
{x|w · x + b = 0}
{x|w · x + b = +1}
{x|w · x + b = −1}
2||w||
w
b
Normalize w so thatminxi w · xi + b = 1.
w · x1 + b = +1
w · x2 + b = −1
⇐⇒ w · (x1 − x2) = 2
⇐⇒ w
||w||· (x1 − x2) =
2
||w||
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 9 / 35
![Page 13: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/13.jpg)
Support Vector Machines (SVM)
{x|w · x + b = 0}
{x|w · x + b = +1}
{x|w · x + b = −1}
2||w||
w
b
Normalize w so thatminxi w · xi + b = 1.
w · x1 + b = +1
w · x2 + b = −1
⇐⇒ w · (x1 − x2) = 2
⇐⇒ w
||w||· (x1 − x2) =
2
||w||
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 9 / 35
![Page 14: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/14.jpg)
Support Vector Machines (SVM)
{x|w · x + b = 0}
{x|w · x + b = +1}
{x|w · x + b = −1}
2||w||
w
b
Normalize w so thatminxi w · xi + b = 1.
w · x1 + b = +1
w · x2 + b = −1
⇐⇒ w · (x1 − x2) = 2
⇐⇒ w
||w||· (x1 − x2) =
2
||w||
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 9 / 35
![Page 15: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/15.jpg)
VC Dimension of Hyperplane Classifiers
Theorem (Cortes and Vapnik (1995))
Hyperplanes in canonical form have VC Dimension
D ≤ min{R2||w||2 + 1, N + 1}
where R the radius of the smallest sphere containing the data.
SRM Bound:
R[f ] ≤ Remp[f ] +
√D(log 2N
D + 1)− log(η4 )
N
maximal margin = minimum ||w||2 → good generalization, i.e. low risk:
minw,b
||w||2
subject to yi (w · xi + b) ≥ 1 for i = 1 . . .N
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 10 / 35
![Page 16: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/16.jpg)
Slack variables
ξi
Introduce slack variables ξi :
minw,b,ξi
||w||2 + CN∑i=1
ξi
subject to yi (w · xi + b) ≥ 1− ξiξi ≥ 0
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 11 / 35
![Page 17: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/17.jpg)
Slack variables
ξi
Introduce slack variables ξi :
minw,b,ξi
||w||2 + CN∑i=1
ξi
subject to yi (w · xi + b) ≥ 1− ξiξi ≥ 0
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 11 / 35
![Page 18: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/18.jpg)
Non-linear hyperplanes
Map into a higher dimensional feature space:
Φ : R2 → R3
(x1, x2) 7→ (x21 ,√
2x1x2, x22 )
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 12 / 35
![Page 19: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/19.jpg)
Non-linear hyperplanes
Map into a higher dimensional feature space:
Φ : R2 → R3
(x1, x2) 7→ (x21 ,√
2x1x2, x22 )
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 12 / 35
![Page 20: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/20.jpg)
Dual SVM
Primalmin
w,b,ξi||w||2 + C
N∑i=1
ξi
subject to yi (w · Φ(xi) + b) ≥ 1− ξi and ξi ≥ 0 for i = 1 . . .N
Dual
maxα
N∑i=1
αi −1
2
N∑i ,j=1
αiαjyiyj (Φ(xi) · Φ(xj))
subject toN∑i=1
αiyi = 0 and C ≥ αi ≥ 0 for i = 1 . . .N
Data points xi only appear in scalar products (Φ(xi) · Φ(xj)).
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 13 / 35
![Page 21: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/21.jpg)
The Kernel Trick
Replace scalar products with kernel function (Muller et al. (2001)):
k(x , y) = Φ(x) · Φ(y)
I Compute kernel matrix Kij = k(xi , xj), i.e. never use Φ directlyI Underlying mapping Φ can be unknownI Kernels can be adopted to specific task, e.g. using prior knowledge
(kernels for graphs, trees, strings, . . . )
Common kernels
Gaussian Kernel: k(x , y) = exp(− ||x−y ||
2
2σ2
)Linear Kernel: k(x , y) = x · y
Polynomial Kernel: k(x , y) = (x · y + c)d
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 14 / 35
![Page 22: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/22.jpg)
The Support Vectors in SVM
maxα
N∑i=1
αi −1
2
N∑i ,j=1
αiαjyiyj (Φ(xi) · Φ(xj))
subject toN∑i=1
αiyi = 0 and C ≥ αi ≥ 0 for i = 1 . . .N
KKT conditions
yi [wΦ(xi)) + b] > 1 =⇒ ai = 0 −→ xi irrelevant
yi [wΦ(xi)) + b] = 1 =⇒ on/in margin −→ xi Support Vector
Old model f (x) = w · Φ(xi ) + b becomes via w =∑N
i=1 αiyiΦ(xi ):
f (x) =N∑i=1
αiyik(xi , x) + b −→ f (x) =∑
xi∈SVαiyik(xi , x) + b
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 15 / 35
![Page 23: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/23.jpg)
Kernel Ridge Regression (KRR)
Ridge Regression
minw
N∑i=1
||yi − w · xi||2 + λ||w ||2
Setting derivative to zero gives
w =
(λI +
N∑i=1
xixᵀi
)−1 N∑i=1
yixi
Linear Model: f (x) = w · x
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 16 / 35
![Page 24: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/24.jpg)
Kernelizing Ridge Regression
Setting X = (x1, . . . , xN) ∈ Rd×N and Y = (y1, . . . , yn)ᵀ ∈ RN :
w = (λI + XX ᵀ)−1 XY
Apply Woodbury Matrix identity:
w = X (X ᵀX + λI )−1 Y
Introduce α:
α = (K + λI )−1Y and w =N∑i=1
Φ(xi)αi
Kernel Model: f (x) = w · Φ(x) =∑N
i=1 αik(xi , x)
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 17 / 35
![Page 25: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/25.jpg)
Unsupervised Learning
I Learn structure from unlabeled data
I Fit an assumed model / distribution tothe data
I ExamplesI clusteringI blind source separationI outlier detectionI dimensionality reduction
426 9. MIXTURE MODELS AND EM
(a)
!2 0 2
!2
0
2 (b)
!2 0 2
!2
0
2 (c)
!2 0 2
!2
0
2
(d)
!2 0 2
!2
0
2 (e)
!2 0 2
!2
0
2 (f)
!2 0 2
!2
0
2
(g)
!2 0 2
!2
0
2 (h)
!2 0 2
!2
0
2 (i)
!2 0 2
!2
0
2
Figure 9.1 Illustration of the K-means algorithm using the re-scaled Old Faithful data set. (a) Green pointsdenote the data set in a two-dimensional Euclidean space. The initial choices for centres µ1 and µ2 are shownby the red and blue crosses, respectively. (b) In the initial E step, each data point is assigned either to the redcluster or to the blue cluster, according to which cluster centre is nearer. This is equivalent to classifying thepoints according to which side of the perpendicular bisector of the two cluster centres, shown by the magentaline, they lie on. (c) In the subsequent M step, each cluster centre is re-computed to be the mean of the pointsassigned to the corresponding cluster. (d)–(i) show successive E and M steps through to final convergence ofthe algorithm.
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 18 / 35
![Page 26: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/26.jpg)
Principal Component Analysis (PCA)
Given centered data matrix X = (x1, . . . , xN)ᵀ ∈ RNxD
I best linear approximation
w1 = arg min‖w‖=1
{‖X − Xwwᵀ‖2
}I direction of largest variance
w1 = arg max‖w‖=1
{‖Xw‖2
}I matrix reduction for further
components
Xk+1 = Xk − XkwwᵀPearson, K. 1901. On lines and planes of closest fit to systems of points in space. Philosophical Magazine 2:559-572.
http://pbil.univ-lyon1.fr/R/pearson1901.pdf
Pearson (1901)
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 19 / 35
![Page 27: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/27.jpg)
Principal Component Analysis (PCA)
Given centered data matrix X ∈ RNxD , decompose correlated data matrixinto uncorrelated, orthogonal PCs
I diagonalize covariance matrix Σ = 1NX
ᵀX
Σwk = σ2kwk
I order principal components wk by variance σ2k
I project data to first n principal components
What about nonlinear correlations?
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 20 / 35
![Page 28: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/28.jpg)
Principal Component Analysis (PCA)
Given centered data matrix X ∈ RNxD , decompose correlated data matrixinto uncorrelated, orthogonal PCs
I diagonalize covariance matrix Σ = 1NX
ᵀX
Σwk = σ2kwk
I order principal components wk by variance σ2k
I project data to first n principal components
What about nonlinear correlations?Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 20 / 35
![Page 29: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/29.jpg)
Kernel Principal Component Analysis (kPCA)
Transformation to feature space X 7→ Xf :
Σf =1
NX ᵀf Xf , K = Xf X
ᵀf ,Kij = k(xi , xj)
Σf wk = σ2kwk
⇓ wk = X ᵀf αk
X ᵀf Xf X
ᵀf αk = Nσ2
kXᵀf αk
⇓ Xf ·K 2αk = Nσ2
kKαk
⇓ K−1·Kαk = Nσ2
kαk
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 21 / 35
![Page 30: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/30.jpg)
Kernel Principal Component Analysis (kPCA)
Transformation to feature space X 7→ Xf :
Σf =1
NX ᵀf Xf , K = Xf X
ᵀf ,Kij = k(xi , xj)
X ᵀf Xf wk = Nσ2
kwk
⇓ wk = X ᵀf αk
X ᵀf Xf X
ᵀf αk = Nσ2
kXᵀf αk
⇓ Xf ·K 2αk = Nσ2
kKαk
⇓ K−1·Kαk = Nσ2
kαk
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 21 / 35
![Page 31: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/31.jpg)
Kernel Principal Component Analysis (kPCA)
Transformation to feature space X 7→ Xf :
Σf =1
NX ᵀf Xf , K = Xf X
ᵀf ,Kij = k(xi , xj)
X ᵀf Xf wk = Nσ2
kwk
⇓ wk = X ᵀf αk
X ᵀf Xf X
ᵀf αk = Nσ2
kXᵀf αk
⇓ Xf ·K 2αk = Nσ2
kKαk
⇓ K−1·Kαk = Nσ2
kαk
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 21 / 35
![Page 32: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/32.jpg)
Kernel Principal Component Analysis (kPCA)
Transformation to feature space X 7→ Xf :
Σf =1
NX ᵀf Xf , K = Xf X
ᵀf ,Kij = k(xi , xj)
X ᵀf Xf wk = Nσ2
kwk
⇓ wk = X ᵀf αk
X ᵀf Xf X
ᵀf αk = Nσ2
kXᵀf αk
⇓ Xf ·Xf X
ᵀf Xf X
ᵀf αk = Nσ2
kXf Xᵀf αk
⇓ K−1·Kαk = Nσ2
kαk
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 21 / 35
![Page 33: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/33.jpg)
Kernel Principal Component Analysis (kPCA)
Transformation to feature space X 7→ Xf :
Σf =1
NX ᵀf Xf , K = Xf X
ᵀf ,Kij = k(xi , xj)
X ᵀf Xf wk = Nσ2
kwk
⇓ wk = X ᵀf αk
X ᵀf Xf X
ᵀf αk = Nσ2
kXᵀf αk
⇓ Xf ·K 2αk = Nσ2
kKαk
⇓ K−1·Kαk = Nσ2
kαk
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 21 / 35
![Page 34: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/34.jpg)
Kernel Principal Component Analysis (kPCA)
Transformation to feature space X 7→ Xf :
Σf =1
NX ᵀf Xf , K = Xf X
ᵀf ,Kij = k(xi , xj)
X ᵀf Xf wk = Nσ2
kwk
⇓ wk = X ᵀf αk
X ᵀf Xf X
ᵀf αk = Nσ2
kXᵀf αk
⇓ Xf ·K 2αk = Nσ2
kKαk
⇓ K−1·Kαk = Nσ2
kαk
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 21 / 35
![Page 35: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/35.jpg)
Kernel Principal Component Analysis (kPCA)
Projection:
xᵀf wk = xᵀf Xᵀf αk
=N∑i=1
αk,ik(x , x i )Scholkopf et al. (1997)
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 22 / 35
![Page 36: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/36.jpg)
Model Selection
I Find the model that best fits thedata distribution
I We can only estimate thisdistribution
I ConsiderI noise ratio / distributionI data correlation
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 23 / 35
![Page 37: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/37.jpg)
Hyperparameters
I adjust model complexityI regularization, kernel
parameters, etc.
I have to be tuned using examplesnot used for training
I standard solution:exhaustive search overparameter grid
0 1 2 3 4 5 6
x
1.5
1.0
0.5
0.0
0.5
1.0
1.5
f(x)
traintest
f (x) = sin(x)
f (x) =∑i
αi exp
(‖x − xi‖2
σ2
)α = (K + τ I )−1y
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 24 / 35
![Page 38: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/38.jpg)
Grid Search
0 1 2 3 4 5 6
x
1.5
1.0
0.5
0.0
0.5
1.0
1.5
f(x)
0 1 2 3 4 5 6
x
1.5
1.0
0.5
0.0
0.5
1.0
1.5
f(x)
10−2 10−1 100 101 102
τ
10−2
10−1
100
101
102
σ2
0.08
0.16
0.24
0.32
0.40
0.48
0.56
0.64
RM
SE
0 1 2 3 4 5 6
x
1.5
1.0
0.5
0.0
0.5
1.0
1.5
f(x)
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 25 / 35
![Page 39: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/39.jpg)
k-fold cross-validation
testtraining
training test
split data
4x inner loopmodel selection
evaluation 5x outer loop
Don’t even think aboutlooking at the test set!
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 26 / 35
![Page 40: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/40.jpg)
k-fold cross-validation
testtraining
training test
split data
4x inner loopmodel selection
evaluation 5x outer loop
Don’t even think aboutlooking at the test set!
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 26 / 35
![Page 41: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/41.jpg)
From objects to vectors
How to represent complex objects for kernel methods?I explicit map to vector space: φ : M → Rn
I use standard kernel (e.g., linear, polynomial, gaussian)k : Rn × Rn → R on mapped features
I direct use of kernel function: k : M ×M → R
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 27 / 35
![Page 42: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/42.jpg)
Feature Representation
Given a physical object (molecule, crystal, etc.) and a property of interest,what is a good ML representation?
I no loss of valuable informationI support generalization
I remove invariancesI decompose problem
I incorporation of domain knowledge
I depends on data set, target function and learning method
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 28 / 35
![Page 43: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/43.jpg)
Feature Representation - Molecules
Coulomb matrix:
Cij =
0.5Z 2.4
i if i = j
ZiZj
‖r i − r j‖if i 6= j
(a) (b) (c) (d) (e)
(Rupp et al., 2012; Montavon et al., 2012)
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 29 / 35
![Page 44: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/44.jpg)
Feature Representation - Molecules
PCA of Coulomb matrices with atom permutations Montavon et al. (2013)
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 30 / 35
![Page 45: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/45.jpg)
Results - Molecules
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 31 / 35
![Page 46: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/46.jpg)
Feature Representation - Crystals
element pair r1 · · · rn
α α gαα(r1) · · · gαα(rn)
α β gαβ(r1) · · · gαβ(rn)
β α gβα(r1) · · · gβα(rn)
β β gββ(r1) · · · gββ(rn)
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 32 / 35
![Page 47: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/47.jpg)
Results - Crystals
Learning curve of DOSfermi predictions
K.T. Schutt, H. Glawe, F. Brockherde, A. Sanna, K.-R. Muller, E.K.U. Gross, How to represent crystal structures for machine
learning: towards fast prediction of electronic properties, arXiv, 2013
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 33 / 35
![Page 48: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/48.jpg)
Machine Learning ...
... has been successfully applied to various research fields.
... is based on statistical learning theory.
... provides fast and accurate predictions on previously unseen data.
... is able to model non-linear relationships of high-dimensional data.
Feature representation is key!
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 34 / 35
![Page 49: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/49.jpg)
Machine Learning ...
... has been successfully applied to various research fields.
... is based on statistical learning theory.
... provides fast and accurate predictions on previously unseen data.
... is able to model non-linear relationships of high-dimensional data.
Feature representation is key!
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 34 / 35
![Page 50: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/50.jpg)
Machine Learning ...
... has been successfully applied to various research fields.
... is based on statistical learning theory.
... provides fast and accurate predictions on previously unseen data.
... is able to model non-linear relationships of high-dimensional data.
Feature representation is key!
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 34 / 35
![Page 51: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/51.jpg)
Machine Learning ...
... has been successfully applied to various research fields.
... is based on statistical learning theory.
... provides fast and accurate predictions on previously unseen data.
... is able to model non-linear relationships of high-dimensional data.
Feature representation is key!
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 34 / 35
![Page 52: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/52.jpg)
Machine Learning ...
... has been successfully applied to various research fields.
... is based on statistical learning theory.
... provides fast and accurate predictions on previously unseen data.
... is able to model non-linear relationships of high-dimensional data.
Feature representation is key!
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 34 / 35
![Page 53: Introduction to Machine Learninghelper.ipam.ucla.edu/publications/msetut/msetut_11483.pdfIntroduction to Machine Learning Felix Brockherde12 Kristof Schutt 1 1Technische Universit](https://reader036.vdocuments.us/reader036/viewer/2022081400/5f0ec9367e708231d440edf4/html5/thumbnails/53.jpg)
Literature I
Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3):273–297.
Montavon, G., Hansen, K., Fazli, S., Rupp, M., Biegler, F., Ziehe, A., Tkatchenko, A., Lilienfeld, A. V., and Muller, K.-R.(2012). Learning invariant representations of molecules for atomization energy prediction. In Advances in Neural InformationProcessing Systems, pages 449–457.
Montavon, G., Rupp, M., Gobre, V., Vazquez-Mayagoitia, A., Hansen, K., Tkatchenko, A., Muller, K.-R., and von Lilienfeld,O. A. (2013). Machine learning of molecular electronic properties in chemical compound space. arXiv preprintarXiv:1305.7074.
Muller, K.-R., Mika, S., Ratsch, G., Tsuda, K., and Scholkopf, B. (2001). An introduction to kernel-based learning algorithms.Neural Networks, IEEE Transactions on, 12(2):181–201.
Pearson, K. (1901). Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and DublinPhilosophical Magazine and Journal of Science, 2(11):559–572.
Rupp, M., Tkatchenko, A., Muller, K.-R., and von Lilienfeld, O. A. (2012). Fast and accurate modeling of molecularatomization energies with machine learning. Physical Review Letters, 108(5):058301.
Scholkopf, B., Smola, A., and Muller, K.-R. (1997). Kernel principal component analysis. In Artificial NeuralNetworks—ICANN’97, pages 583–588. Springer.
Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to theirprobabilities. Theory of Probability & Its Applications, 16(2):264–280.
Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 35 / 35