mark hasegawa-johnson jhasegaw@uiuc university of illinois at urbana-champaign, usa
DESCRIPTION
Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology. Mark Hasegawa-Johnson [email protected] University of Illinois at Urbana-Champaign, USA. Lecture 4: Hyperplanes, Perceptrons, and Kernel-Based Classifiers. - PowerPoint PPT PresentationTRANSCRIPT
Landmark-Based Speech Recognition:
Spectrogram Reading,Support Vector Machines,
Dynamic Bayesian Networks,and Phonology
Mark [email protected]
University of Illinois at Urbana-Champaign, USA
Lecture 4: Hyperplanes, Perceptrons, and Kernel-Based Classifiers
• Definition: Hyperplane Classifier• Minimum Classification Error Training Methods
– Empirical risk– Differentiable estimates of the 0-1 loss function– Error backpropagation
• Kernel Methods– Nonparametric expression of a hyperplane– Mathematical properties of a dot product– Kernel-based classifier– The implied high-dimensional space– Error backpropagation for a kernel-based classifier
• Useful kernels– Polynomial kernel– RBF kernel
Classifier Terminology
Hyperplane Classifier
Class Boundary (“Separatrix”): The plane wTx=b
Normal Vector w
xx
xx
x
x
x
x
x
xx
xx
x
x x
xx
Origin (x=0)
Distance=b
Loss, Risk, and Empirical Risk
Empirical Risk with 0-1 Loss Function = Error Rate on Training Data
Differentiable Approximations of the 0-1 Loss Function: Hinge Loss
Differentiable Approximations of the 0-1 Loss Function: Hinge Loss
Differentiable Empirical Risks
Error Backpropagation: Hyperplane Classifier with Sigmoidal Loss
Sigmoidal Classifier = Hyperplane Classifier with Fuzzy Boundaries
xx
xx
x
x
x
x
x
xx
xx
x
xx
More Red
Less Red
Less Blue
More Blue
Error Backpropagation: Sigmoidal Classifier with Absolute Loss
Sigmoidal Classifier: Signal Flow Diagram
x1 x2 x3
+
Hypothesis h(x)
Input x
Sigmoid input g(x)
Connection weights ww3w2
w1
Multilayer Perceptron
+ + +
x1 x2 x3
+
Hypothesis h2(x)
Input h0(x)≡x
Sigmoid inputs g1(x)
Sigmoid outputs h1(x)
w133w123
w113
Connection weights w1
Sigmoid inputs g2(x)
Connection weights w1w313w312
w311
b11 b12 b13
b21
Multilayer Perceptron: Classification Equations
Error Backpropagation for a Multilayer Perceptron
Classification Power of a One-Layer Perceptron
Classification Power of a Two-Layer Perceptron
Classification Power of a Three-Layer Perceptron
Output of Multilayer Perceptron is an Approximation of Posterior Probability
Kernel-Based Classifiers
Representation of Hyperplane in terms of Arbitrary Vectors
Kernel-based Classifier
Error Backpropagation for a Kernel-Based Classifier
The Implied High-Dimensional Space
Some Useful Kernels
Polynomial Kernel
Polynomial Kernel: Separatrix (Boundary Between Two Classes)
is a Polynomial Surface
Classification Boundaries Available from a Polynomial Kernel(Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004)
Implied Higher-Dimensional Space has a Dimension of Kd
The Radial Basis Function (RBF) Kernel
RBF Classifier Can Represent Any Classifier Boundary
RBF Classifier Can Represent Any Classifier Boundary
(Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004)
In these figures, C was adjusted, not , but a similar effect can be achieved by setting N<<M and adjusting .
- More training corpus errors- Smoother boundary
- Fewer training corpus errors- Wigglier boundary
If N<M, Gamma can Adjust Boundary Smoothness
Summary• Classifier definitions
– Classifier = a function from x into y– Loss = the cost of a mistake– Risk = the expected loss– Empirical Risk = the average loss on training data
• Multilayer Perceptrons– Sigmoidal classifier is similar to hyperplane classifier with sigmoidal
loss function– Train using error backpropagation– With two hidden layers, can model any boundary (MLP is a “universal
approximator”)– MLP output is an estimate of p(y|x)
• Kernel Classifiers– Equivalent to: (1) project into (x), (2) apply hyperplane classifier– Polynomial kernel: separatrix is polynomial surface of order d– RBF kernel: separatrix can be any surface (RBF is also a “universal
approximator”)– RBF kernel: if N<M, can adjust the “wiggliness” of the separatrix