![Page 1: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/1.jpg)
Landmark-Based Speech Recognition:
Spectrogram Reading,Support Vector Machines,
Dynamic Bayesian Networks,and Phonology
Mark [email protected]
University of Illinois at Urbana-Champaign, USA
![Page 2: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/2.jpg)
Lecture 4: Hyperplanes, Perceptrons, and Kernel-Based Classifiers
• Definition: Hyperplane Classifier• Minimum Classification Error Training Methods
– Empirical risk– Differentiable estimates of the 0-1 loss function– Error backpropagation
• Kernel Methods– Nonparametric expression of a hyperplane– Mathematical properties of a dot product– Kernel-based classifier– The implied high-dimensional space– Error backpropagation for a kernel-based classifier
• Useful kernels– Polynomial kernel– RBF kernel
![Page 3: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/3.jpg)
Classifier Terminology
![Page 4: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/4.jpg)
Hyperplane Classifier
Class Boundary (“Separatrix”): The plane wTx=b
Normal Vector w
xx
xx
x
x
x
x
x
xx
xx
x
x x
xx
Origin (x=0)
Distance=b
![Page 5: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/5.jpg)
Loss, Risk, and Empirical Risk
![Page 6: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/6.jpg)
Empirical Risk with 0-1 Loss Function = Error Rate on Training Data
![Page 7: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/7.jpg)
Differentiable Approximations of the 0-1 Loss Function: Hinge Loss
![Page 8: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/8.jpg)
Differentiable Approximations of the 0-1 Loss Function: Hinge Loss
![Page 9: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/9.jpg)
Differentiable Empirical Risks
![Page 10: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/10.jpg)
Error Backpropagation: Hyperplane Classifier with Sigmoidal Loss
![Page 11: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/11.jpg)
Sigmoidal Classifier = Hyperplane Classifier with Fuzzy Boundaries
xx
xx
x
x
x
x
x
xx
xx
x
xx
More Red
Less Red
Less Blue
More Blue
![Page 12: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/12.jpg)
Error Backpropagation: Sigmoidal Classifier with Absolute Loss
![Page 13: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/13.jpg)
Sigmoidal Classifier: Signal Flow Diagram
x1 x2 x3
+
Hypothesis h(x)
Input x
Sigmoid input g(x)
Connection weights ww3w2
w1
![Page 14: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/14.jpg)
Multilayer Perceptron
+ + +
x1 x2 x3
+
Hypothesis h2(x)
Input h0(x)≡x
Sigmoid inputs g1(x)
Sigmoid outputs h1(x)
w133w123
w113
Connection weights w1
Sigmoid inputs g2(x)
Connection weights w1w313w312
w311
b11 b12 b13
b21
![Page 15: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/15.jpg)
Multilayer Perceptron: Classification Equations
![Page 16: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/16.jpg)
Error Backpropagation for a Multilayer Perceptron
![Page 17: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/17.jpg)
Classification Power of a One-Layer Perceptron
![Page 18: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/18.jpg)
Classification Power of a Two-Layer Perceptron
![Page 19: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/19.jpg)
Classification Power of a Three-Layer Perceptron
![Page 20: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/20.jpg)
Output of Multilayer Perceptron is an Approximation of Posterior Probability
![Page 21: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/21.jpg)
Kernel-Based Classifiers
![Page 22: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/22.jpg)
Representation of Hyperplane in terms of Arbitrary Vectors
![Page 23: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/23.jpg)
Kernel-based Classifier
![Page 24: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/24.jpg)
Error Backpropagation for a Kernel-Based Classifier
![Page 25: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/25.jpg)
The Implied High-Dimensional Space
![Page 26: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/26.jpg)
Some Useful Kernels
![Page 27: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/27.jpg)
Polynomial Kernel
![Page 28: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/28.jpg)
Polynomial Kernel: Separatrix (Boundary Between Two Classes)
is a Polynomial Surface
![Page 29: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/29.jpg)
Classification Boundaries Available from a Polynomial Kernel(Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004)
![Page 30: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/30.jpg)
Implied Higher-Dimensional Space has a Dimension of Kd
![Page 31: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/31.jpg)
The Radial Basis Function (RBF) Kernel
![Page 32: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/32.jpg)
RBF Classifier Can Represent Any Classifier Boundary
![Page 33: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/33.jpg)
RBF Classifier Can Represent Any Classifier Boundary
(Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004)
In these figures, C was adjusted, not , but a similar effect can be achieved by setting N<<M and adjusting .
- More training corpus errors- Smoother boundary
- Fewer training corpus errors- Wigglier boundary
![Page 34: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/34.jpg)
If N<M, Gamma can Adjust Boundary Smoothness
![Page 35: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA](https://reader036.vdocuments.us/reader036/viewer/2022062518/56814a67550346895db78465/html5/thumbnails/35.jpg)
Summary• Classifier definitions
– Classifier = a function from x into y– Loss = the cost of a mistake– Risk = the expected loss– Empirical Risk = the average loss on training data
• Multilayer Perceptrons– Sigmoidal classifier is similar to hyperplane classifier with sigmoidal
loss function– Train using error backpropagation– With two hidden layers, can model any boundary (MLP is a “universal
approximator”)– MLP output is an estimate of p(y|x)
• Kernel Classifiers– Equivalent to: (1) project into (x), (2) apply hyperplane classifier– Polynomial kernel: separatrix is polynomial surface of order d– RBF kernel: separatrix can be any surface (RBF is also a “universal
approximator”)– RBF kernel: if N<M, can adjust the “wiggliness” of the separatrix