this week: overview on pattern recognition (related to machine learning)
TRANSCRIPT
This week: overview on pattern recognition
(related to machine learning)
Non-review of chapters 6/7
• Z-transforms • Convolution• Sampling/aliasing• Linear difference equations• Resonances• FIR/IIR filtering• DFT/FFT
Speech Pattern Recognition
• Soft pattern classification plus temporal sequence integration
• Supervised pattern classification: class labels used in training
• Unsupervised pattern classification: labels not available (or at least not used)
Training and testing
• Training: learning parameters of classifier• Testing: classify independent test set, compare
with labels, and score
F1 and F2 for various vowels
Swedish basketball players vs.
speech recognition researchers
Feature extraction criteria
• Class discrimination• Generalization• Parsimony (efficiency)
Feature vector size
• Best representations for discrimination on training set are highly dimensioned
• Best representations for generalization to test set tend to be succinct
Dimensionality reduction
• Principal components (i.e., SVD, KL transform, eigenanalysis ...)• Linear Discriminant Analysis (LDA)• Application-specific knowledge• Feature Selection via PR Evaluation
PR Methods
• Minimum distance• Discriminant Functions– Linear– Nonlinear (e.g., quadratic, neural networks)– Some aspects of each - SVMs
• Statistical Discriminant Functions
Minimum Distance
• Vector or matrix representing element• Define distance function• Collect examples for each class• In testing, choose the class of the closest
example• Choice of distance equivalent to an implicit
statistical assumption• Signals (e.g. speech) add temporal variability
Limitations
• Variable scale of dimensions• Variable importance of dimensions• For high dimensions, sparsely sampled space• For tough problems, resource limitations
(storage, computation, memory access)
Decision Rule for Min Distance
• Nearest Neighbor (NN) - in the limit of infinite samples, at most twice the error of optimum classifier• k-Nearest Neighbor (kNN)• lots of storage for large problems; potentially
large searches
Some Opinions
• Better to throw away bad data than to reduce its weight
• Dimensionality-reduction based on variance often a bad choice for supervised pattern recognition
• Both of these are only true sometimes
Discriminant Analysis
• Discriminant functions max for correct class, min for others• Decision surface between classes• Linear decision surface for 2-dim is a line, for 3 is
a plane; generally called hyperplane• For 2 classes, surface at
wTx + w0 = 0• 2-class quadratic case, surface at
xTWx + wTx + w0 = 0
Two prototype example
• Di2 = (x – zi)T (x – zi) = xTx + zi
Tzi – 2 xTzi
• D12 – D2
2 = 2 xTz2 – 2 xTz1 + z1Tz1 – z2
Tz2
• At decision surface, distances are equal, so• 2 xTz2 – 2 xTz1 = z2
Tz2 – z1Tz1
OrxT(z2 – z1) = ½ (z2
Tz2 – z1Tz1)
And if prototypes are all normalized to 1,Decision surface is
• xT(z2 – z1) = 0
• And each discriminant function is xTzi
System to discriminate between two classes
Training Discriminant Functions
• Minimum distance• Fisher linear discriminant• Gradient learning
Generalized Discriminators - ANNs
• McCulloch Pitts neural model• Rosenblatt Perceptron• Multilayer Systems
Typical unit for MLP
Σ
Typical MLP
Support Vector Machines (SVMs)
• High dimensional feature vectors– Transformed from simple features (e.g., polynomial)– Can potentially classify training set arbitrarily well
• Improve generalization by maximizing margin• Via “Kernel trick”, don’t need explicit high dim– Inner product function between 2 points in space
• Slack variables allow for imperfect classification
Maximum margin decision boundary, SVM
Unsupervised clustering
• Large and diverse literature• Many methods• Next time one method explained in the
context of a larger, statistical system
Some PR/machine learning Issues
• Testing on the training set• Training on the test set• # parameters vs # training examples: overfitting
and overtraining• For much more on machine learning, CS 281A/B