applications of risk minimization to speech recognition
DESCRIPTION
•Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi State University •Contact Information: Box 9571 Mississippi State University Mississippi State, Mississippi 39762 Tel: 662-325-3149 Fax: 662-325-2298 - PowerPoint PPT PresentationTRANSCRIPT
Applications of Risk Minimization to Speech Recognition
• Joseph Picone
Inst. for Signal and Info. Processing
Dept. Electrical and Computer Eng.
Mississippi State University
• Contact Information:
Box 9571
Mississippi State University
Mississippi State, Mississippi 39762
Tel: 662-325-3149
Fax: 662-325-2298
Email: [email protected]
MIT LINCOLN LABORATORY
• URL: www.isip.msstate.edu/publications/seminars/.../2003/lincoln_labs
• Acknowledgement:Supported by NSF under Grant No. IIS-0085940.
INTRODUCTIONABSTRACT AND BIOGRAPHY
ABSTRACT: Statistical techniques based on Hidden Markov models (HMMs) with Gaussian emission densities have dominated the signal processing and pattern recognition literature for the past 20 years. However, HMMs suffer from an inability to learn discriminative information and are prone to overfitting and over‑parameterization. In this presentation, we will review our attempts to apply notions of risk minimization into pattern recognition problems such as speech recognition. New approaches based on probabilistic Bayesian learning are shown to provide an order of magnitude reduction in complexity over comparable approaches based on HMMs and Support Vector Machines.
BIOGRAPHY: Joseph Picone is currently a Professor in the Department of Electrical and Computer Engineering at Mississippi State University, where he also directs the Institute for Signal and Information Processing. For the past 15 years he has been promoting open source speech technology. He has previously been employed by Texas Instruments and AT&T Bell Laboratories. Dr. Picone received his Ph.D. in Electrical Engineering from Illinois Institute of Technology in 1983. He is a Senior Member of the IEEE and a registered Professional Engineer.
INTRODUCTIONGENERALIZATION AND RISK
• Optimal decision surface is a line
• Optimal decision surface changes abruptly
• Optimal decision surface still a line
• How much can we trust isolated data points?
• Can we integrate prior knowledge about data, confidence, or willingness to take risk?
INTRODUCTIONACOUSTIC CONFUSABILITY
• Regions of overlap represent classification error
• Reduce overlap by introducing acoustic and linguistic context
• Comparison of “aa” in “lOck” and “iy” in “bEAt” for conversational speech
INTRODUCTIONPROBABILISTIC FRAMEWORK
• Maximum likelihood convergence does not translate to optimal classification if a priori assumptions about the data are not correct.
• Finding the optimal decision boundary requires only one parameter.
INTRODUCTIONML CONVERGENCE NOT OPTIMAL
INTRODUCTIONPOOR GENERALIZATION WITH GMM MLE
• Data is often not separable by a hyperplane – nonlinear classifier is needed
• Gaussian MLE models tend toward the center of mass – overtraining leads to poor generalization
• Three problems: controlling generalization, direct discriminative training, and sparsity.
• Structural optimization often guided by an Occam’s Razor approach
• Trading goodness of fit and model complexity– Examples: MDL, BIC, AIC, Structural Risk
Minimization, Automatic Relevance Determination
RISK MINIMIZATION
Model Complexity
Error
Training SetError
Open-LoopError
Optimum
STRUCTURAL OPTIMIZATION
RISK MINIMIZATIONSTRUCTURAL RISK MINIMIZATION
• The VC dimension is a measure of the complexity of the learning machine
• Higher VC dimension gives a looser bound on the actual risk – thus penalizing a more complex model (Vapnik)
• Expected Risk:
• Not possible to estimate P(x,y)
• Empirical Risk:
• Related by the VC dimension, h:
• Approach: choose the machine that gives the least upper bound on the actual risk
),(),(2
1)( yxdPxfyR
l
iiiemp xfy
lR
1
|),(|2
1
)()()( hfRR emp
VC confidence
empirical risk
bound on the expected risk
VC dimension
Expected risk
optimum
RISK MINIMIZATION
Optimization: Separable Data
• Hyperplane:
• Constraints:
• Quadratic optimization of a Lagrange functional minimizes risk criterion (maximizes margin). Only a small portion become support vectors.
• Final classifier:
SVs
iii bxxyxf )()(
bwx
01)( bwxy ii
origin
class 1
class 2
w
H1
H2
C1CO C2
optimalclassifier
• Hyperplanes C0-C2 achieve zero empirical risk. C0 generalizes optimally
• The data points that define the boundary are called support vectors
SUPPORT VECTOR MACHINES
RISK MINIMIZATIONSVMS FOR NON-SEPARABLE DATA
• No hyperplane could achieve zero empirical risk (in any dimension space!)
• Recall the SRM Principle: balance empirical risk and model complexity
• Relax our optimization constraint to allow for errors on the training set:
• A new parameter, C, must be estimated to optimally control the trade-off between training set errors and model complexity
iii bwxy 1)(
RISK MINIMIZATIONDRAWBACKS OF SVMS
• Uses a binary (yes/no) decision rule Generates a distance from the hyperplane, but this
distance is often not a good measure of our “confidence” in the classification
Can produce a “probability” as a function of the distance (e.g. using sigmoid fits), but they are inadequate
• Number of support vectors grows linearly with the size of the data set
• Requires the estimation of trade-off parameter, C, via held-out sets
• A kernel-based learning machine
• Incorporates an automatic relevance determination (ARD) prior over each weight (MacKay)
• A flat (non-informative) prior over completes the Bayesian specification
)1
),0(|()|(0
N
i iiiwNwP
N
iii xxKwwwxy
10 ),();(
);(1
1);|1(
wixyi
ewxtP
RELEVANCE VECTOR MACHINESAUTOMATIC RELEVANCE DETERMINATION
• The goal in training becomes finding:
• Estimation of the “sparsity” parameters is inherent in the optimization – no need for a held-out set!
• A closed-form solution to this maximization problem is not available. Iteratively reestimate
)|(
)|,(),,|(),(
),|,(,maxargˆ,ˆ
Xtp
XwpXwtpwp
whereXtwpw
w
ˆ andw
RELEVANCE VECTOR MACHINESITERATIVE REESTIMATION
• Fix and estimate w (e.g. gradient descent)
• Use the Hessian to approximate the covariance of a Gaussian posterior of the weights centered at
• With and as the mean and covariance, respectively, of the Gaussian approximation, we find by finding
• Method is O(N2) in memory and O(N3) in time
w
w
)|()|(maxargˆ wpwtpw
w
1)|()|( wpwtpww
iiiii
ii wherew
1ˆ
ˆ2
RELEVANCE VECTOR MACHINESLAPLACE’S METHOD
RVM:• Data: Class labels (0,1)
• Goal: Learn posterior, P(t=1|x)
• Structural Optimization: Hyperprior distribution encourages sparsity
• Training: iterative O(N3)
SVM:• Data: Class labels (-1,1)
• Goal: Find optimal decision surface under constraints
• Structural Optimization: Trade-off parameter that must be estimated
• Training: Quadratic O(N2)
iii bwxy 1)(
RELEVANCE VECTOR MACHINESCOMPARISON TO SVMS
• Deterding Vowel Data: 11 vowels spoken in “h*d” context; 10 log area parameters; 528 train, 462 SI test
Approach % Error # Parameters
SVM: Polynomial Kernels 49%
K-Nearest Neighbor 44%
Gaussian Node Network 44%
SVM: RBF Kernels 35% 83 SVs
Separable Mixture Models 30%
RVM: RBF Kernels 30% 13 RVs
EXPERIMENTAL RESULTSDETERDING VOWEL DATA
• Data size:
– 30 million frames of data in training set
– Solution: Segmental phone models
• Source for Segmental Data:
– Solution: Use HMM system in bootstrap procedure
– Could also build a segment-based decoder
• Probabilistic decoder coupling:
– SVMs: Sigmoid-fit posterior
– RVMs: naturally probabilistic
EXPERIMENTAL RESULTSINTEGRATION WITH SPEECH RECOGNITION
hh aw aa r y uw
region 10.3*k frames
region 30.3*k frames
region 20.4*k frames
mean region 1 mean region 2 mean region 3
k frames
SEGMENTALCONVERTER
SEGMENTALCONVERTER
HMMRECOGNITION
HMMRECOGNITION
HYBRIDDECODER
HYBRIDDECODER
Features (Mel-Cepstra))
SegmentInformation
N-bestList
SegmentalFeatures
Hypothesis
EXPERIMENTAL RESULTSHYBRID DECODER
• HMM system is cross-word state-tied triphones with 16 mixtures of Gaussian models
• SVM system has monophone models with segmental features
• System combination experiment yields another 1% reduction in error
EXPERIMENTAL RESULTSSVM ALPHADIGIT RECOGNITION
Transcription Segmentation SVM HMM
N-best Hypothesis 11.0% 11.9%
N-best+Ref Reference 3.3% 6.3%
• RVMs yield a large reduction in the parameter count while attaining superior performance
• Computational costs mainly in training for RVMs but is still prohibitive for larger sets
Approach Error
Rate
Avg. # Parameters
Training Time
Testing Time
SVM 16.4% 257 0.5 hours 30 mins
RVM 16.2% 12 30 days 1 min
EXPERIMENTAL RESULTSSVM/RVM ALPHADIGIT COMPARISON
SUMMARYPRACTICAL RISK MINIMIZATION?
• Reduction of complexity at the same level of performance is interesting:
• Results hold across tasks
• RVMs have been trained on 100,000 vectors
• Results suggest integrated training is critical
• Risk minimization provides a family of solutions:
• Is there a better solution than minimum risk?
• What is the impact on complexity and robustness?
• Applications to other problems?
• Speech/Non-speech classificiation?
• Speaker adaptation?
• Language modeling?
APPENDIXSCALING RVMS TO LARGE DATA SETS
• Central to RVM training is the inversion of an MxM Hessian matrix: an O(N3) operation initially
• Solutions:
– Constructive Approach: Start with an empty model and iteratively add candidate parameters. M is typically much smaller than N
– Divide and Conquer Approach: Divide complete problem into set of sub-problems. Iteratively refine the candidate parameter set according to sub-problem solution. M is user-defined
APPENDIXPRELIMINARY RESULTS
ApproachError
RateAvg. #
ParametersTraining
TimeTesting
Time
SVM 15.5% 994 3 hours 1.5 hoursRVM
Constructive 14.8% 72 5 days 5 mins
RVMReduction 14.8% 74 6 days 5 mins
• Data increased to 10000 training vectors
• Reduction method has been trained up to 100k vectors (on toy task). Not possible for Constructive method
• Principal Investigators: Aravind Ganapathiraju (Conversay) and Jon Hamaker (Microsoft) as part of their Ph.D. studies at Mississippi State
• Consultants: Michael Tipping (MSR-Cambridge) and Thorsten Joachims (Cornell)
• Motivation: Serious work began after discussions with V.N. Vapnik at the CLSP Summer Workshop in 1997.
SUMMARYACKNOWLEDGEMENTS
SUMMARYRELEVANT SOFTWARE RESOURCES
• Pattern Recognition Applet: compare popular algorithms on standard or custom data sets
• Speech Recognition Toolkits: compare SVMs and RVMs to standard approaches using a state of the art ASR toolkit
• Fun Stuff: have you seen our commercial on the Home Shopping Channel?
• Foundation Classes: generic C++ implementations of many popular statistical modeling approaches
SUMMARYBRIEF BIBLIOGRAPHY
Applications to Speech Recognition:
1. J. Hamaker and J. Picone, “Advances in Speech Recognition Using Sparse Bayesian Methods,” submitted to the IEEE Transactions on Speech and Audio Processing, January 2003.
2. A. Ganapathiraju, J. Hamaker and J. Picone, “Applications of Risk Minimization to Speech Recognition,” submitted to the IEEE Transactions on Signal Processing, July 2003.
3. J. Hamaker, J. Picone, and A. Ganapathiraju, “A Sparse Modeling Approach to Speech Recognition Based on Relevance Vector Machines,” Proceedings of the International Conference of Spoken Language Processing, vol. 2, pp. 1001-1004, Denver, Colorado, USA, September 2002.
4. J. Hamaker, Sparse Bayesian Methods for Continuous Speech Recognition, Ph.D. Dissertation, Department of Electrical and Computer Engineering, Mississippi State University, December 2003.
5. A. Ganapathiraju, Support Vector Machines for Speech Recognition, Ph.D. Dissertation, Department of Electrical and Computer Engineering, Mississippi State University, January 2002.
Influential work:
6. M. Tipping, “Sparse Bayesian Learning and the Relevance Vector Machine,” Journal of Machine Learning, vol. 1, pp. 211-244, June 2001.
7. D. J. C. MacKay, “Probable networks and plausible predictions --- a review of practical Bayesian methods for supervised neural networks,” Network: Computation in Neural Systems, 6, pp. 469-505, 1995.
8. D. J. C. MacKay, Bayesian Methods for Adaptive Models, Ph. D. thesis, California Institute of Technology, Pasadena, California, USA, 1991.
9. E. T. Jaynes, “Bayesian Methods: General Background,” Maximum Entropy and Bayesian Methods in Applied Statistics, J. H. Justice, ed., pp. 1-25, Cambridge Univ. Press, Cambridge, UK, 1986.
10. V.N. Vapnik, Statistical Learning Theory, John Wiley, New York, NY, USA, 1998.
11. V.N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New York, NY, USA, 1995.
12. C.J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” AT&T Bell Laboratories, November 1999.